Algorithms for hyper-parameter optimization.pdf，讲述

文件名称: Algorithms for hyper-parameter optimization

所属分类: 其它

开发工具:

文件大小: 268kb

下载次数: 0

上传时间: 2019-09-03

提供者: yangta******

下载 (268kb)

不能下载？报告错误

详细说明：Algorithms for hyper-parameter optimization.pdf，讲述贝叶斯算法的TPE过程的专业论文The contribution of this work is two novel strategies for approximating f by modeling H: a hier archical Gaussian Process and a tree-structured parzen estimator. These are described in section 3 and Section 4 respectively 3 The Gaussian Process Approach(GP) Gaussian Processes have long been recognized as a good method for modeling loss functions in model-based optimization literature [13]. Gaussian Processes(GPs, [14] are priors over functions that are closed under sampling, which means that if the prior distribution of f is believed to be a gp with mean 0 and kernel k, the conditional distribution of f knowing a sample H=(ci, f(ai))isl of its values is also a GP, whose mean and covariance function are analytically derivable. GPs with generic mean functions can in principle be used, but it is simpler and sufficient for our purposes to only consider zero mean processes. We do this by centering the function values in the consid ered data sets. Modelling e.g. linear trends in the gP mean leads to undesirable extrapolation in unexplored regions during SMBO [15] The above mentioned closedness property, along with the fact that GPs provide an assessment of prediction uncertainty incorporating the effect of data scarcity, make the GP an elegant candidate for both finding candidate a'(Figure l, step 3 )and fitting a model Mt(Figure 1, step 6). The runtime of each iteration of the GP approach scales cubically in H and linearly in the number of variables being optimized, however the expense of the function evaluations f (a*)typically dominate even chis cubic cost 3.1 Optimizing EI in the GP We model f with a GP and set y to the best value found after observing H: y*=minf(i,I< i< n. The model pm in(1)is then the posterior GP knowing 7. The EI function in(1 )encap- sulates a compromise between regions where the mean function is close to or better than y* and under-explored regions where the uncertainty is high EI functions are usually optimized with an exhaustive grid search over the input space, or a latin Hypercube search in higher dimensions. However, some information on the landscape of the El cri terion can be derived from simple computations [16]: 1)it is always non-negative and zero at training points from D, 2)it inherits the smoothness of the kernel k, which is in practice often at least once differentiable, and noticeably, 3) the ei criterion is likely to be highly multi-modal, especially as the number of training points increases. The authors of [16] used the preceding remarks on the landscape of el to design an evolutionary algorithm with mixture search, specifically aimed at opti mizing El, that is shown to outperform exhaustive search for a given budget in El evaluations. We borrow here their approach and go one step further. We keep the estimation of Distribution(EDa [17]) approach on the discrete part of our input space(categorical and discrete hyper-parameters) where we sample candidate points according to binomial distributions, while we use the Covariance Matrix Adaptation-Evolution Strategy(CMA-ES, [18]) for the remaining part of our input space (continuous hyper-parameters). CMA-ES is a state-of-the-art gradient-free evolutionary algorithm for optimization on continuous domains. which has been shown to outperform the gaussian search EDA. Notice that such a gradient-free approach allows non-differentiable kernels for the GP regres sion We do not take on the use of mixtures in [16], but rather restart the local searches several times starting from promising places. The use of tesselations suggested by [16] is prohibitive here, as our task often means working in more than 10 dimensions thus we start each local search at the center of mass of a simplex with vertices randomly picked among the training points Finally, we remark that all hyper-parameters are not relevant for each point. For example, a DBN with only one hidden layer does not have parameters associated to a second or third layer. Thus it is not enough to place one GP over the entire space of hyper-parameters. We chose to group the hy per-parameters by common use in a tree- like fashion and place different independent GPs over each group. As an example, for DBNs, this means placing one GP over common hyper-parameters including categorical parameters that indicate what are the conditional groups to consider, three GPs on the parameters corresponding to each of the three layers, and a few 1-dimensional GPs over individual conditional hyper-parameters, like ZCa energy(see Table I for dbn parameters) 4 Tree-structured Parzen Estimator Approach(TPe) Anticipating that our hyper-parameter optimization tasks will mean high dimensions and small fit ness evaluation budgets, we now turn to another modeling strategy and el optimization scheme for the SMBO algorithm. Whereas the Gaussian-process based approach modeled p(y a) directly, this strategy models p(aly)and p(y) Recall from the introduction that the configuration space a is described by a graph-structured gen erative process(e. g. first choose a number of dbn layers, then choose the parameters for each) The tree-structured Parzen estimator (TPE)models p(aly) by transforming that generative process replacing the distributions of the configuration prior with non-parametric densities. In the exper- imental section, we will see that the configuation space is described using uniform, log-uniform quantized log-uniform, and categorical variables. In these cases, the TPE algorithm makes the following replacements: uniform truncated Gaussian mixture, log-uniform exponentiated truncated Gaussian mixture, categorical -re-weighted categorical. Using different observations (k) in the non-parametric densities, these substitutions represent a learning algorithm that can produce a variety of densities over the configuration space &. The tpe defines p(cly) using two such densities p(aly e(: if y < y g(x)ify≥ where e(c)is the density formed by using the observations f.( such that corresponding loss f(al)was less than y* and g()is the density formed by using the remaining observations Whereas the GP-based approach favoured quite an aggressive y*(typically less than the best ob served loss), the TPE algorithm depends on a y that is larger than the best observed f() so that some points can be used to form e(e). The TPE algorithm chooses y* to be some quantile y of the observed y values, so that p(y

(系统自动生成,下载前可以参看下载内容)
下载文件列表

相关说明

本站资源为会员上传分享交流与学习,如有侵犯您的权益,请联系我们删除.

本站是交换下载平台，提供交流渠道，下载内容来自于网络，除下载问题外，其它问题请自行百度。

本站已设置防盗链，请勿用迅雷、QQ旋风等多线程下载软件下载资源，下载后用WinRAR最新版进行解压.

如果您发现内容无法下载，请稍后再次尝试；或者到消费记录里找到下载记录反馈给我们.

下载后发现下载的内容跟说明不相乎，请到消费记录里找到下载记录反馈给我们，经确认后退回积分.

如下载前有疑问，可以通过点击"提供者"的名字，查看对方的联系方式，联系对方咨询.

相关搜索: Algorithmsforhyper-parameteroptimization

输入关键字，在本站1000多万海量源码库中尽情搜索：

下载资源分类

移动开发

开发技术

课程资源

网络技术

操作系统

安全技术

数据库

行业

服务器应用

存储

信息化

考试认证

云计算

大数据

跨平台

音视频

游戏开发

人工智能

区块链

资源分类

电子商务

管理软件

IT管理

企业管理

项目管理

其它

本站统计

资源总数：630万个

资源大小：15TB

今日更新：468个

注册人数：225万

今日注册：838

加入“点数信息”会员

　　“点数信息”是专业的,大型的源码,编程资源等搜索,交换平台,旨在帮助软件开发人员提供源码,编程资源下载,技术交流等服务!目前源码资源大小已超过8TB。
　　超值价格，购买下载积分，即时到帐，无需等待马上可以下载你所需的资料。无限期使用，一次购买越多越优惠！

免费获取积分

　　免费获得积分的途径是通过会员下载您上传的资料，您的帐户即增加积分。
　　立即上传资料，越多越好，被搜索到的机会越大！越早上传越早得积分，下载次数越多，您的积分越多。

合作伙伴

CodeProject

搜珍网

建筑工程网

CSDN.net

建筑资料网