文件名称:
Algorithms for hyper-parameter optimization
开发工具:
文件大小: 268kb
下载次数: 0
上传时间: 2019-09-03
详细说明:Algorithms for hyper-parameter optimization.pdf,讲述贝叶斯算法的TPE过程的专业论文The contribution of this work is two novel strategies for approximating f by modeling H: a hier
archical Gaussian Process and a tree-structured parzen estimator. These are described in section 3
and Section 4 respectively
3 The Gaussian Process Approach(GP)
Gaussian Processes have long been recognized as a good method for modeling loss functions in
model-based optimization literature [13]. Gaussian Processes(GPs, [14] are priors over functions
that are closed under sampling, which means that if the prior distribution of f is believed to be a gp
with mean 0 and kernel k, the conditional distribution of f knowing a sample H=(ci, f(ai))isl
of its values is also a GP, whose mean and covariance function are analytically derivable. GPs with
generic mean functions can in principle be used, but it is simpler and sufficient for our purposes
to only consider zero mean processes. We do this by centering the function values in the consid
ered data sets. Modelling e.g. linear trends in the gP mean leads to undesirable extrapolation in
unexplored regions during SMBO [15]
The above mentioned closedness property, along with the fact that GPs provide an assessment of
prediction uncertainty incorporating the effect of data scarcity, make the GP an elegant candidate
for both finding candidate a'(Figure l, step 3 )and fitting a model Mt(Figure 1, step 6). The runtime
of each iteration of the GP approach scales cubically in H and linearly in the number of variables
being optimized, however the expense of the function evaluations f (a*)typically dominate even
chis cubic cost
3.1 Optimizing EI in the GP
We model f with a GP and set y to the best value found after observing H: y*=minf(i,I<
i< n. The model pm in(1)is then the posterior GP knowing 7. The EI function in(1 )encap-
sulates a compromise between regions where the mean function is close to or better than y* and
under-explored regions where the uncertainty is high
EI functions are usually optimized with an exhaustive grid search over the input space, or a latin
Hypercube search in higher dimensions. However, some information on the landscape of the El cri
terion can be derived from simple computations [16]: 1)it is always non-negative and zero at training
points from D, 2)it inherits the smoothness of the kernel k, which is in practice often at least once
differentiable, and noticeably, 3) the ei criterion is likely to be highly multi-modal, especially as
the number of training points increases. The authors of [16] used the preceding remarks on the
landscape of el to design an evolutionary algorithm with mixture search, specifically aimed at opti
mizing El, that is shown to outperform exhaustive search for a given budget in El evaluations. We
borrow here their approach and go one step further. We keep the estimation of Distribution(EDa
[17]) approach on the discrete part of our input space(categorical and discrete hyper-parameters)
where we sample candidate points according to binomial distributions, while we use the Covariance
Matrix Adaptation-Evolution Strategy(CMA-ES, [18]) for the remaining part of our input space
(continuous hyper-parameters). CMA-ES is a state-of-the-art gradient-free evolutionary algorithm
for optimization on continuous domains. which has been shown to outperform the gaussian search
EDA. Notice that such a gradient-free approach allows non-differentiable kernels for the GP regres
sion We do not take on the use of mixtures in [16], but rather restart the local searches several times
starting from promising places. The use of tesselations suggested by [16] is prohibitive here, as our
task often means working in more than 10 dimensions thus we start each local search at the center
of mass of a simplex with vertices randomly picked among the training points
Finally, we remark that all hyper-parameters are not relevant for each point. For example, a DBN
with only one hidden layer does not have parameters associated to a second or third layer. Thus it
is not enough to place one GP over the entire space of hyper-parameters. We chose to group the
hy per-parameters by common use in a tree- like fashion and place different independent GPs over
each group. As an example, for DBNs, this means placing one GP over common hyper-parameters
including categorical parameters that indicate what are the conditional groups to consider, three
GPs on the parameters corresponding to each of the three layers, and a few 1-dimensional GPs over
individual conditional hyper-parameters, like ZCa energy(see Table I for dbn parameters)
4 Tree-structured Parzen Estimator Approach(TPe)
Anticipating that our hyper-parameter optimization tasks will mean high dimensions and small fit
ness evaluation budgets, we now turn to another modeling strategy and el optimization scheme for
the SMBO algorithm. Whereas the Gaussian-process based approach modeled p(y a) directly, this
strategy models p(aly)and p(y)
Recall from the introduction that the configuration space a is described by a graph-structured gen
erative process(e. g. first choose a number of dbn layers, then choose the parameters for each)
The tree-structured Parzen estimator (TPE)models p(aly) by transforming that generative process
replacing the distributions of the configuration prior with non-parametric densities. In the exper-
imental section, we will see that the configuation space is described using uniform, log-uniform
quantized log-uniform, and categorical variables. In these cases, the TPE algorithm makes the
following replacements: uniform truncated Gaussian mixture, log-uniform exponentiated
truncated Gaussian mixture, categorical -re-weighted categorical. Using different observations
(k) in the non-parametric densities, these substitutions represent a learning algorithm
that can produce a variety of densities over the configuration space &. The tpe defines p(cly)
using two such densities
p(aly
e(:
if y < y
g(x)ify≥
where e(c)is the density formed by using the observations f.( such that corresponding loss
f(al)was less than y* and g()is the density formed by using the remaining observations
Whereas the GP-based approach favoured quite an aggressive y*(typically less than the best ob
served loss), the TPE algorithm depends on a y that is larger than the best observed f() so that
some points can be used to form e(e). The TPE algorithm chooses y* to be some quantile y of the
observed y values, so that p(y
(系统自动生成,下载前可以参看下载内容)
下载文件列表
相关说明
- 本站资源为会员上传分享交流与学习,如有侵犯您的权益,请联系我们删除.
- 本站是交换下载平台,提供交流渠道,下载内容来自于网络,除下载问题外,其它问题请自行百度。
- 本站已设置防盗链,请勿用迅雷、QQ旋风等多线程下载软件下载资源,下载后用WinRAR最新版进行解压.
- 如果您发现内容无法下载,请稍后再次尝试;或者到消费记录里找到下载记录反馈给我们.
- 下载后发现下载的内容跟说明不相乎,请到消费记录里找到下载记录反馈给我们,经确认后退回积分.
- 如下载前有疑问,可以通过点击"提供者"的名字,查看对方的联系方式,联系对方咨询.