Professional Documents
Culture Documents
10647-Article Text-14175-1-2-20201228
10647-Article Text-14175-1-2-20201228
822
coordinate search and is able to find a near optimal hyper- scales cubically with the number of observations. To ad-
parameter configuration with few expensive function evalu- dress the scalability issue of GP based methods, TPE is
ations, for either integer or continuous valued hyperparame- proposed by (Bergstra et al. 2011). TPE is a non-standard
ters. Bayesian-based optimization algorithm, which uses tree-
We name our algorithm as Hyperparameter Optimiza- structured Parzen estimators to model the error distribution
tion via RBF and Dynamic coordinate search, or HORD for in a non-parametric way. SMAC (Hutter, Hoos, and Leyton-
short. We compare our method against the well-established Brown 2011) is another tree-based algorithm that uses ran-
GP-based algorithms (i.e., GP-EI and GP-PES) and against dom forests to estimate the error density.
the tree-based algorithms (i.e., TPE and SMAC) for op- In (2013) Eggensperger et al. empirically showed that
timizing the hyperparameters of two types of commonly Spearmints GP-based approaches reach state-of-the-art per-
used neural networks applied on the MNIST and CIFAR-10 formance in optimizing few hyperparameters, while TPE’s
benchmark datasets. and SMAC’s tree-structured approaches achieve best results
Our contributions can be summarized as follows: in high-dimensional spaces. Thus, we compare our algo-
rithm against these four methods.
• We provide a mixed integer deterministic-surrogate opti- The authors of (Eggensperger et al. 2015) propose an in-
mization algorithm for optimizing hyperparameters. The teresting way of using surrogate based models for hyperpa-
algorithm is capable of optimizing both continuous and rameter optimization. They propose to build a surrogate of
integer hyperparameters of deep neural networks and per- the validation error given a hyperparameter configuration to
forms equally well in low-dimensional hyperparameter evaluate and compare different hyperparameter optimiza-
spaces and exceptionally well in higher dimensions. tion algorithms. However, they do not unify the surrogate
• Extensive evaluations demonstrate the superiority of the models with an optimization algorithm.
proposed algorithm over the state-of-the-art, both in find- In (Snoek et al. 2015) the authors address the scalabil-
ing a near-optimal configuration with fewer function eval- ity issue of GB-based models and explore the use of neu-
uations and in achieving a lower final validation error. As ral networks as an alternative surrogate model. Their neural
discussed later, HORD obtains speedups of 3.7 to 6 fold network based hyperparameter optimization algorithm uses
over average computation time of other algorithms with only a fraction of the computational resources of GP-based
19-dimensional problem and is faster on average for all algorithms, especially when the number of function evalua-
dimensions and all problems tested. tion grows to 2,000. However, as opposed to our proposed
method, their approach does not outperform the GP-based
algorithms nor the tree-based algorithms such as TPE and
Related Work SMAC.
Surrogate-based optimization (Mockus, Tiesis, and Zilin-
skas 1978) is a strategy for the global optimization of ex- Description of The Proposed Method HORD
pensive black-box functions over a constrained domain. The
Here we introduce a novel hyperparameter optimization
goal is to obtain a near optimal solution with as few as pos-
called Hyperparameter Optimization using RBF and Dy-
sible function evaluations. The main idea is to utilize a sur-
namic coordinate search, or HORD for short.
rogate model of the expensive function that can be inexpen-
sively evaluated to determine the next most promising point
The surrogate model
for evaluation. Surrogate-based optimization methods differ
in two aspects: the type of model used as a surrogate and the We use the radial basis function (RBF) interpolation model
way the model is used to determine the next most promising as the surrogate model (Powell 1990) in searching for opti-
point for expensive evaluation. mal hyperparameters. Given D number of hyperparameters,
Bayesian optimization is a type of surrogate-based op- and n hyperparameter configurations, x1:n , with xi ∈ RD ,
timization, where the surrogate is a probabilistic model to and their corresponding validation errors fi:n , where fi =
compute the posterior estimate of distribution of the expen- f (xi ), we define the RBF interpolation model as:
sive error function values. The next most promising point is
n
determined by optimizing an acquisition function of choice Sn (x) = λi φ(x − xi ) + p(x) (2)
given the distribution estimation. Various Bayesian opti- i=1
mizations methods mainly differ in how to estimate the error
distribution and definition on the acquisition function. Here φ(r) = r3 denotes the cubic spline RBF, · is the
GP-EI (Snoek, Larochelle, and Adams 2012) and GP- Euclidean norm, and p(x) = b x + a is the polynomial tail,
PES (Hernández-Lobato, Hoffman, and Ghahramani 2014) with b = [b1 , . . . , bd ] ∈ RD , and a ∈ R. The parameters
methods use Gaussian processes to estimate distribution λi , i = 1, . . . , n, bk , k = 1, . . . , d, and a are the interpo-
of the validation error given a hyperparameter configura- lation model parameters, determined by solving the linear
tion using the history of observations. The methods use system of equations (Gutmann 2001):
the expected improvement (EI) and the predictive entropy
search (PES) respectively as an acquisition function. Al- Φ P λ F
= (3)
though GP is simple and flexible, GP-based optimization P 0 c 0
involves inverting an expensive covariance matrix and thus
823
Here elements in the matrix Φ ∈ Rn×n are defined as Algorithm 1 Hyperparameter Optimization using RBF-
Φi,j = φ(xi − xj ), i, j = 1, . . . , n, 0 is a zero matrix of based surrogate and DYCORS (HORD)
compatible dimension, and input n0 = 2(D + 1), m = 100D and Nmax .
output optimal hyperparameters xbest .
⎡ ⎤ ⎤ ⎡ ⎡ ⎤ ⎡ ⎤ 1: Use Latin hypercube sampling to sample n0 points and
b1
x1 1λ1 f (x1 ) set I = {xi }ni=1 0
.
.. ⎥ λ = ⎢ .. ⎥ c = ⎢ ⎥
⎢ .
P=⎣ ⎢ .. ⎥ F = ⎢ . ⎥ 2: Evaluating f (x) for points in I gives An0 =
. ⎦ ⎣ . ⎦ ⎣b ⎦ ⎣ .. ⎦ (4)
xn 1 λn k
f (xn ) {(xi , f (xi ))}ni=1
0
.
a 3: while n < Nmax do
4: Use An to fit or update the surrogate model Sn (x)
(Eq. 2).
Searching the hyperparameter space 5: Set xbest = arg min{f (xi ) : i = 1, . . . , n}.
The training and evaluation of a deep neural network (DNN) 6: Compute ϕn (Eq. 5), i.e, the probability of perturbing
can be regarded as a function f that maps the hyperparame- a coordinate.
ter configuration used to train the network to the validation 7: Populate Ωn with m candidate points, tn,1:m , where
error obtained at the end. Optimizing f (x) with respect to for each candidate yj ∈ tn,1:m , (a) Set yj = xbest ,
a hyperparameter configuration x as in Eq. (1) is a global (b) Select the coordinates of yj to be perturbed
optimization problem since the function f (x) is typically with probability ϕn and (c) Add δi sampled from
highly multimodal. We employ the RBF interpolation to ap- N (0, σn2 ) to the coordinates of yj selected in (b) and
proximate the expensive function f (x). We build the new round to nearest integer if required.
mixed integer global optimization HORD algorithm with 8: Calculate Vnev (tn,1:m ) (Eq. 6), Vndm (tn,1:m ) (Eq. 7),
some features from the continuous global Dynamic coor- and the final weighted score Wn (tn,1:m ) (Eq. 8).
dinate search (DYCORS-LMSRBF) (Regis and Shoemaker 9: Set x∗ = arg min{Wn (tn,1:m )}.
2013) to search the surrogate model for promising hyperpa- 10: Evaluate f (x∗ ).
rameter configurations. 11: Adjust the variance σn2 (see text).
As in the LMSRBF (Regis and Shoemaker 2007), the 12: Update An+1 = {An ∪ (x∗ , f (x∗ ))}.
DYCORS-LMSRBF algorithm starts by fitting an initial sur- 13: end while
rogate model Sn0 using An0 = {(xi , fi ))}ni=1 0
. We set 14: Return xbest .
n0 = 2(D + 1), and use the Latin hypercube sampling
method to sample x1:n0 hyperparameter configurations and
obtain their respective validation errors f1:n0 . is given by:
Next, while n < Nmax a user-defined maximal eval-
uation number, the candidate point generation algorithm ln(n − n0 + 1)
ϕn = ϕ0 1− , n0 <= n < Nmax
(Alg. 1) populates the candidate point set Ωn = tn,1:m , ln(Nmax − n0 )
with m = 100D candidate points, and selects for evaluation (5)
the most promising point as xn+1 . After Nmax iterations, where ϕ0 is set to min(20/D, 1) so that the average number
the algorithm returns the hyperparameter configuration xbest of coordinates perturbed is always less than 20. HORD com-
that yielded the lowest validation error f (xbest ). A formal bines features from DYCORS (Regis and Shoemaker 2013)
algorithm description is given in Alg. 1. and SO-MI (Mueller, Shoemaker, and Piche 2013) to gener-
ate candidates. DYCORS is a continuous optimization RBF
Candidate hyperparameter generation and Surrogate method that uses Eq. 5 for controlling perturba-
selection tions. SO-MI is the first mixed integer implementation of a
We describe how to generate and select candidate hyperpa- RBF surrogate method.
rameters in the following section. The perturbation δi is sampled from N (0, σn2 ). Initially,
The set of candidate hyperparameters Ωn , at iteration n the variance σn2 0 is set to 0.2, and after each Tfail =
(Step 7 in Alg. 1) is populated by points generated by adding max(5, D) consecutive iterations with no improvement over
2
noise δi to some or all of the coordinates1 of the current xbest . the current xbest , is set to σn+1 = min(σn2 /2, 0.005). The
In HORD, the expected number of coordinates perturbed is variance is doubled (capped at the initial value of 0.2), after
monotonically decreasing since perturbing all coordinates of 3 consecutive iterations with improvement. The adjustment
the current xbest will result in a point much further from xbest , of the variance (Step 11 in Alg. 1) is in place to facilitate
a significant problem when the dimension D becomes large, the convergence of the optimization algorithm. Once the set
e.g., D > 10. Note that, during one iteration, each candidate of candidate points Ω has been populated with m = 100D
point is generated by perturbing a potentially different subset points, we estimate their corresponding validation errors
of coordinates. by computing the surrogate values S(t1:m ) (for clarity, we
The probability of perturbing a coordinate (Step 6 in leave out n from notation). Here t denotes a candidate point.
Alg. 1) in HORD declines with the number of iterations and We compute the distances from the previously evaluated
points x1:n for each t ∈ Ω with Δ(t) = min t − x1:n .
1
A coordinate in the hyperparameter space, is one specific hy- Here, · is the Euclidean norm. We also compute Δmax =
perparameter (e.g., learning rate). max{Δ(t1:m )} and Δmin = min{Δ(t1:m )}.
824
Finally, for each t ∈ Ω we compute the score for the two problem setup as in the first DNN problem by optimizing
criteria: V ev for the surrogate estimated value: similar hyperparameters (full details in supplementary).
The third DNN problem incorporates a higher number of
S(t)−smin
max −smin , if smax = smin ; hyperparameters. We increase the number of hyperparame-
V (t) =
ev s (6) ters to 15, 10 continuous and 5 integer, and use the same
1, otherwise.
CNN and setup as in the second experiment. This test prob-
And the distance metric V dm : lem is referred as 15-CNN in subsequent discussion.
max
Δ −Δ(t)
if Δmax = Δmin ; Finally, we design the fourth DNN problem on the more
V dm (t) = Δmax −Δmin , (7) challenging dataset, CIFAR-10, and in even higher dimen-
1, otherwise. sional hyperparameter space by optimizing 19 hyperparam-
We use V ev and V dm to compute the final weighted score: eters, 14 continuous and 5 integer. This hyperparameter op-
timization problem is referred as 19-CNN in subsequent
W (t) = wV ev (t) + (1 − w)V dm (t) (8) discussions. We optimize the hyperparameters of the same
The final weighted score (Eq. 8) is the acquisition function CNN network from Problem 15-CNN, except that we in-
used in HORD to select a new evaluation point. Here, w clude four dropout layers, two in each convolutional block
is a cyclic weight for balancing between global and local and two after each fully-connected layer. We add the dropout
search. We cycle the following weights: 0.3, 0.5, 0.8, 0.95 rate of these four layers to the hyperparameters being opti-
in sequential manner. We select the point with the lowest mized.
weighted score W to be the next evaluated point xn+1 .
Muller and Shoemaker (2014) compare the above can- Baseline algorithms We compare HORD against Gaus-
didate approach with instead searching with a genetic al- sian processes with expected improvement (GP-EI) (Snoek,
gorithm. Krityakierne, Akhtar, and Shoemaker (2016) use Larochelle, and Adams 2012), Gaussian processes with pre-
multi-objective methods with RBF surrogate to select the dictive entropy search (GP-PES) (Hernández-Lobato, Hoff-
next point for evaluation for single objective optimization man, and Ghahramani 2014), the Tree Parzen Estimator
of f (x). (TPE) (Bergstra et al. 2011), and to the Sequential Model-
based Algorithm Configuration (SMAC) (Hutter, Hoos, and
Experimental Setup Leyton-Brown 2011) algorithms.
DNN problems We test HORD on four DNN hyperpa-
rameter optimization problems with 6, 8, 15 and 19 hyper- Evaluation budget, trials and initial points DNN hyper-
parameters. Our first DNN problem consists of 4 continu- parameter evaluation is typically computationally expensive,
ous and 2 integer hyperparameters of a Multi-layer Percep- so it is desirable to find good hyperparameter values within
tron (MLP) network applied to classifying grayscale images a very limited evaluation budget. Accordingly, we limit the
of handwritten digits from the popular benchmark dataset number of optimization iterations to 200 function evalua-
MNIST. This problem is also referred as 6-MLP in subse- tions, where one function evaluations involves one full train-
quent discussions. ing and evaluation of DNN. We run at least five trials of each
The MLP network consists of two hidden layers with experiment using different random seeds.
ReLU activation between them and SoftMax at the end. The SMAC algorithm starts the optimization from a man-
As learning algorithm we use Stochastic Gradient Descent ually set Initial Starting Point (ISP) (also called default hy-
(SGD) to compute θ in Eq. 1. The hyperparameters (denoted perparameter configuration). The advantage of using manu-
as x in Eq. 1) we optimize with HORD and other algorithms ally set ISP as oppose to random samples becomes evident
include hyperparameters of the learning algorithm, the layer when optimizing high number of hyperparameters. Unfor-
weight initialization hyperparameters, and network structure tunately, both GP-based algorithms, as well as TPE, cannot
hyperparameters. Full details of the used datasets, the hy- directly employ a manually set ISP to guide the search, thus
perparameters being optimized and their respective value they are in slight disadvantage.
ranges, as well as the values used as initial starting point (al- On the other hand, HORD fits the initial surrogate model
ways required by SMAC) are provided in the supplementary on n0 Latin hypercube samples, but can also include manu-
materials. ally added points that guide the search in a better region of
The second DNN problem (referred as 8-CNN in subse- the hyperparameter space. We denote this variant of HORD
quent discussions), has 4 continuous and 4 integer hyperpa- as HORD-ISP and test HORD-ISP on the 15-CNN and 19-
rameters of a more complex Convolutional Neural Network CNN problems only. We set the ISP following common
(CNN). The CNN consists of two convolutional blocks, each guidelines for setting the hyperparameters of a CNN net-
containing one convolutional layer with batch normaliza- work (exact values in supplementary). We supply the same
tion, followed by ReLU activation and 3 × 3 max-pooling. ISP to both HORD-ISP and SMAC for the 15-CNN and 19-
Following the convolutional blocks, are two fully-connected CNN problems. HORD-ISP performed the same as HORD
layers with LeakyReLU activation, and SoftMax layer at the on the 6 and 8 hyperparameter cases.
end.2 We again use the MNIST dataset, and follow the same
2
LeakyReLU is defined as f (x) = max(0, x) + α ∗ min(0, x), Implementation details We implement the HORD algo-
where α is a hyperparameter. rithm with the open-source surrogate optimization toolbox
825
pySOT (Eriksson, Bindel, and Shoemaker 2015). HORD has GP-EI TPE SMAC HORD HORD-ISP
a number of algorithm parameters including n0 the number
of points in the initial Latin hypercube samples, m the num- 33
826
Table 1: We report the mean and standard deviation (in GP-EI TPE SMAC HORD HORD-ISP
parenthesis) of test set error obtained with the best found
90
hyperparameter configuration after 200 iterations in at least
85
5 trials with different random seeds. The algorithm with the
827
the LMSRBF gave better results on a range of problems.
Table 3: We report the minimum percentage of function HORD uses the DYCORS strategy to only perturb a fac-
evaluations (actual function evaluations are also reported tion of the dimensions, a strategy not used by the other 4
in parenthesis) required by HORD (6-MLP, 8-CNN) and baseline methods. This is a strategy that greatly improves
HORD-ISP (15-CNN, 19-CNN) to obtain significantly (via efficiency of higher dimensional problems in other applica-
Rank Sum Test) better best validation error than best error tions as shown in (Regis and Shoemaker 2013). So these
achieved by other algorithms after 200 function evaluations. earlier papers indicate that the combination of methods used
If the difference in algorithms is insignificant at 200 func- in HORD to generate trial points and to search on them as a
tion evaluations, we report it as a “–”. All statistical tests are weighted average of surrogate value and distance has been a
performed at 5 percent significance level. very successful method on other kinds of continuous global
Data Set MNIST MNIST MNIST CIFAR-10 optimization problems. So it is not surprising these methods
Problem 6-MLP 8-CNN 15-CNN 19-CNN also work well in HORD for the mixed integer problems of
GP-EI — — 29%(58) 18%(35) multimodal machine learning.
GP-PES 38%(75) — 14%(27) —
TPE 67%(134) 77%(154) 62%(123) 32%(64)
SMAC 39%(77) 72%(144) 29%(58) 33%(66) Conclusion
We introduce a new hyperparameter optimization algorithm
HORD in this paper. Our results show that HORD (and its
dimensional problems are not statistically significant at the variant, HORD-ISP) can be significantly faster (e.g up to 6
95% level because of variability among trials of both meth- times faster than the best of other methods on our numeri-
ods. However, HORD (low dimension) and HORD-ISP (in cal tests) on higher dimensional problems. HORD is more
higher dimensions) does get the best average answer over all efficient than previous algorithms because it uses a deter-
dimensions and it is the only algorithm that performed well ministic Radial Basis surrogate and because it in each itera-
over all the dimensions. tion generates candidate points in a different way that places
To understand the differences in performance between more of them closer to the current best solution and reduces
the algorithms, we should consider the influence of surro- the expected number of dimensions that are perturbed as
gate type and the procedure for selecting the next point x the number of iterations increase. We also present and test
where the algorithm evaluates the expensive function f (x). HORD-ISP, a variant of HORD with initial guess of the so-
With regard to surrogate type, comparisons have been made lution. We demonstrated the method is very effective for op-
to the widely used EGO algorithm (Jones, Schonlau, and timizing hyperparameters of deep learning algorithms.
Welch 1998) that has a Gaussian process (GP) surrogate In future, we plan to extend HORD to incorporate paral-
and uses maximizing Expected Improvement (EI) to search lelization and mixture surrogate models for improving algo-
on the surrogate. EGO functions poorly at higher dimen- rithm efficiency. The potential for improving efficiency of
sions (Regis and Shoemaker 2013). With a deterministic RBF surrogate based algorithms via parallelization is de-
RBF surrogate optimization method DYCORS was the best picted in (Krityakierne, Akhtar, and Shoemaker 2016) and
on all three multimodal problems in this same paper. The via mixture surrogate models is depicted in (Muller and
previous comparisons are based on numbers of function Shoemaker 2014).
evaluation. In the current study, non-evaluation time was
also seen to be much longer for the Gaussian Process based Acknowledgements
methods, presumably because they require a relatively large
amount of computing to find the kriging surface and the The work of Jiashi Feng was partially supported by Na-
covariance matrix, and that amount of computing increases tional University of Singapore startup grant R-263-000-C08-
rapidly as the dimension increases. 133 and Ministry of Education of Singapore AcRF Tier
One grant R-263-000-C21-112. C. Shoemaker and T. Akhtar
However, there are also major differences among algo-
were supported in part by NUS startup funds and by the
rithms in the procedure for selecting the next expensive eval-
E2S2 CREATE program sponsored by NRF of Singapore.
uation point based on the surrogate that are possibly as im-
portant as the type of surrogate used. All of the algorithms
compared randomly generated candidate points and select References
the one point to expensively evaluate with a sorting crite- Bergstra, J.; Bardenet, R.; Bengio, Y.; and Kégl, B. 2011.
rion based on the surrogate and location of previously eval- Algorithms for hyper-parameter optimization. In NIPS.
uated points. GP-EI, GP-PES, TPE and SMAC all distribute
these candidate points uniformly over the domain. By con- Bergstra, J.; Yamins, D.; and Cox, D. 2013. Making a sci-
trast HORD uses the strategy in LMSRBF (Regis and Shoe- ence of model search: Hyperparameter optimization in hun-
maker 2007) to create candidate points by generating them dreds of dimensions for vision architectures. In ICML.
as perturbations (of length that is a normal random variable) Eggensperger, K.; Feurer, M.; Hutter, F.; Bergstra, J.; Snoek,
around the current best solution. In (Regis and Shoemaker J.; Hoos, H.; and Leyton-Brown, K. 2013. Towards an em-
2007) the uniform distribution and the distribution of nor- pirical foundation for assessing bayesian optimization of hy-
mally random perturbations around the current best solu- perparameters. In NIPS workshop on Bayesian Optimization
tion were compared for an RBF surrogate optimization and in Theory and Practice.
828
Eggensperger, K.; Hutter, F.; Hoos, H. H.; and Leyton-
Brown, K. 2015. Efficient benchmarking of hyperparameter
optimizers via surrogates. In AAAI, 1114–1120.
Eriksson, D.; Bindel, D.; and Shoemaker, C.
2015. Surrogate optimization toolbox (pysot).
https://github.com/dme65/pySOT.
Gutmann, H.-M. 2001. A radial basis function method for
global optimization. Journal of Global Optimization.
Hernández-Lobato, J. M.; Hoffman, M. W.; and Ghahra-
mani, Z. 2014. Predictive entropy search for efficient global
optimization of black-box functions. In NIPS.
Hutter, F.; Hoos, H. H.; and Leyton-Brown, K. 2011. Se-
quential model-based optimization for general algorithm
configuration. In Learning and Intelligent Optimization.
Springer.
Jones, D. R.; Schonlau, M.; and Welch, W. J. 1998. Effi-
cient global optimization of expensive black-box functions.
Journal of Global optimization 13(4):455–492.
Krityakierne, T.; Akhtar, T.; and Shoemaker, C. A. 2016.
Sop: parallel surrogate global optimization with pareto cen-
ter selection for computationally expensive single objective
problems. Journal of Global Optimization 66(3):417–437.
Mockus, J.; Tiesis, V.; and Zilinskas, A. 1978. The appli-
cation of bayesian methods for seeking the extremum. To-
wards Global Optimization.
Mueller, J.; Shoemaker, C. A.; and Piche, R. 2013. SO-MI:
A Surrogate Model Algorithm for Computationally Expen-
sive Nonlinear Mixed-Integer, Black-Box Global Optimiza-
tion Problems. Computers and Operations Research.
Muller, J., and Shoemaker, C. 2014. Influence of ensem-
ble surrogate models and sampling strategy on the solution
quality of algorithms forcomputationally expensive black-
box global optimization problems. Journal of Global Opti-
mization 60(2):123–144.
Powell, M. J. 1990. The theory of radial basis function
approximation. University of Cambridge.
Regis, R. G., and Shoemaker, C. A. 2007. A stochastic
radial basis function method for the global optimization of
expensive functions. INFORMS Journal on Computing.
Regis, R. G., and Shoemaker, C. A. 2013. Combining radial
basis function surrogates and dynamic coordinate search in
high-dimensional expensive black-box optimization. Engi-
neering Optimization.
Snoek, J.; Rippel, O.; Swersky, K.; Kiros, J. R.; Satish, N.;
Sundaram, N.; Patwary, M. M. A.; Prabhat; and Adams, R. P.
2015. Scalable bayesian optimization using deep neural net-
works. In ICML.
Snoek, J.; Larochelle, H.; and Adams, R. P. 2012. Practical
bayesian optimization of machine learning algorithms. In
NIPS.
Swersky, K.; Snoek, J.; and Adams, R. P. 2014. Freeze-thaw
bayesian optimization. arXiv preprint arXiv:1406.3896.
829