Practical Bayesian Optimization of Machine
Learning Algorithms
Arturo Fernandez
CS 294
University of California, Berkeley
Tuesday, April 20, 2016
Arturo Fernandez Practical Bayesian Optimization of Machine Learning
Motivation
Machine Learning Algorithms (MLA’s) have hyperparameters
that often need to be tuned
I model hyperparameters (e.g. Bayesian models)
I regularization parameters
I optimization procedure parameters
I step size
I minibatch size
Arturo Fernandez Practical Bayesian Optimization of Machine Learning
Motivation
Machine Learning Algorithms (MLA’s) have hyperparameters
that often need to be tuned
I model hyperparameters (e.g. Bayesian models)
I regularization parameters
I optimization procedure parameters
I step size
I minibatch size
Can we automate the optimization of these high-level
parameters?
Arturo Fernandez Practical Bayesian Optimization of Machine Learning
Motivation
Machine Learning Algorithms (MLA’s) have hyperparameters
that often need to be tuned
I model hyperparameters (e.g. Bayesian models)
I regularization parameters
I optimization procedure parameters
I step size
I minibatch size
Can we automate the optimization of these high-level
parameters?
With some assumptions and Bayesian magic, yes!
Arturo Fernandez Practical Bayesian Optimization of Machine Learning
Gaussian Process
Usually we observe inputs xi and outputs yi . For now, we
assume yi = f (xi ) (no noise) for some unkown function f .
Gaussian Processes (GP’s) approach the prediction problem by
inferring a distribution over functions given the data p(f | X, y)
and then make predictions as
Z
p(y∗ | x∗ , X, y) = p(y∗ | f, x∗ ) · p(f | X, y) df
A Gaussian Process defines a prior over functions → posterior
over functions once we see data.
Arturo Fernandez Practical Bayesian Optimization of Machine Learning
Gaussian Process
A Gaussian Process is defined so that for any n ∈ N
f (x1 ), . . . , f (xn ) ∼ N (µ, K)
where µ ∈ Rn and Ki,j = κ(xi , xj ) for a positive definite kernel
function κ.
Key Idea: If xi and xj are similar by kernel, then the output of
the function at those points should be similar.
Arturo Fernandez Practical Bayesian Optimization of Machine Learning
Gaussian Process
Let the prior on the regression function be a GP
f (x) ∼ GP ( m(x), κ(x, x0 ) )
m(x) = E[f (x)]
κ(x, x0 ) = E[(f (x) − m(x))(f (x0 ) − m(x0 ))T ]
are the mean and covariance/kernel function respectively.
For finite set of points, defines a joint Gaussian
p(f | X) = N (f | µ, K)
where µ = (m(x1 ), . . . , m(xn )), usually m(x) = 0.
Arturo Fernandez Practical Bayesian Optimization of Machine Learning
GP Noise-free and Multivariate Gaussian Refresher
We see training set D = {(xi , fi ), i ∈ [N ]} where fi = f (xi )
Given a test set X∗ of size N∗ × D, want to predict output f∗
By definition of GP
f µ K K∗
∼N ,
f∗ µ∗ KT∗ K∗∗
Arturo Fernandez Practical Bayesian Optimization of Machine Learning
GP Noise-free and Multivariate Gaussian Refresher
We see training set D = {(xi , fi ), i ∈ [N ]} where fi = f (xi )
Given a test set X∗ of size N∗ × D, want to predict output f∗
By definition of GP
f µ K K∗
∼N ,
f∗ µ∗ KT∗ K∗∗
Thus
f∗ ∼ p(f∗ | X∗ , X, f ) = N (µ̂, Σ̂)
µ̂ = µ(X∗ ) + KT∗ K−1 (f − µ(X))
Σ̂ = K∗∗ − KT∗ K−1 K∗
Arturo Fernandez Practical Bayesian Optimization of Machine Learning
Priors and Kernels
Samples from a prior p(f | X), using squared exponential /
Gaussian / RBF kernel
0 2 1 0 2
κ(x, x ) = σf exp − 2 (x − x ) (1D Case)
2`
` controls horizontal scale of variation, σf2 controls vertical
variation.
Arturo Fernandez Practical Bayesian Optimization of Machine Learning
Priors and Kernels
Samples from a prior p(f | X), using squared exponential /
Gaussian / RBF kernel
0 2 1 0 2
κ(x, x ) = σf exp − 2 (x − x ) (1D Case)
2`
` controls horizontal scale of variation, σf2 controls vertical
variation.
Automatic Relevance Determination (ARD) squared
exponential kernel
0 1 0 T −1 0
κ(x, x ) = θ0 exp − (x − x ) Diag(θ) (x − x )
2
θ = [θ11 · · · θd2 ]
Arturo Fernandez Practical Bayesian Optimization of Machine Learning
Noisy Observations
Actually observe y where y = f (x) + ε and ε ∼ N (0, σy2 ) then
Cov(y | X) = K + σy2 I =: Ky
Assume E[f (x)] = 0 (so is y) then in the case of a single test
input
µ̂ = kT∗ K−1
y y
Σ̂ = k∗∗ − kT∗ K−1
y k∗
where
k∗ = [κ(x∗ , x1 ), . . . , κ(x∗ , xN ) ] and k∗∗ = κ(x∗ , x∗ )
Arturo Fernandez Practical Bayesian Optimization of Machine Learning
Bayesian Optimization with GP Priors
Setup for Bayesian Optimization (x vector of MLA
hyperparameters)
1. x ∈ X ⊂ RD and X bounded
2. f (x) is drawn from GP prior
3. Want to minimize f (x) on X
4. Observations are of form {xn , yn }N
n=1 with yn ∼ N (f (xn ), ν)
5. Acquisition function (AF), a : X → R+ , is used via
xnext = arg maxx a(x)
Arturo Fernandez Practical Bayesian Optimization of Machine Learning
Bayesian Optimization with GP Priors
Setup for Bayesian Optimization (x vector of MLA
hyperparameters)
1. x ∈ X ⊂ RD and X bounded
2. f (x) is drawn from GP prior
3. Want to minimize f (x) on X
4. Observations are of form {xn , yn }N
n=1 with yn ∼ N (f (xn ), ν)
5. Acquisition function (AF), a : X → R+ , is used via
xnext = arg maxx a(x)
I a(x) = a(x; {xn , yn }, θ), depends on previous observations
and GP hyperparameters
I Depend on model solely through
I Predictive mean function - µ(x; {xn , yn }, θ)
I Predictive variance function - σ 2 (x; {xn , yn }, θ)
Arturo Fernandez Practical Bayesian Optimization of Machine Learning
What is f ?
I Framework useful for f when its evaluations are expensive.
I The case when requires training a machine learning
algorithm
I Thus, should be smart about where we evaluate next
Arturo Fernandez Practical Bayesian Optimization of Machine Learning
Acquisition Functions
Let φ, Φ be the pdf, cdf of a standard normal.
xbest = arg minxn f (xn )
1. Probability of Improvement.
aP I (x; {xn , yn }, θ) = Φ(γ(x)) = P(N ≤ γ(x))
f (xbest ) − µ(x; {xn , yn }, θ)
γ(x) =
σ(x; {xn , yn }, θ)
and N ∼ N (0, 1).
Arturo Fernandez Practical Bayesian Optimization of Machine Learning
Acquisition Functions
Let φ, Φ be the pdf, cdf of a standard normal.
xbest = arg minxn f (xn )
1. Probability of Improvement.
aP I (x; {xn , yn }, θ) = Φ(γ(x)) = P(N ≤ γ(x))
f (xbest ) − µ(x; {xn , yn }, θ)
γ(x) =
σ(x; {xn , yn }, θ)
and N ∼ N (0, 1).
Points that have a high probability of being infinitesimally less
than f (xbest ) will be drawn over points that offer larger gains but
less certainty.
Arturo Fernandez Practical Bayesian Optimization of Machine Learning
Acquisition Functions
Let φ, Φ be the pdf, cdf of a standard normal.
xbest = arg minxn f (xn )
1. Probability of Improvement.
aP I (x; {xn , yn }, θ) = Φ(γ(x)) = P(N ≤ γ(x))
f (xbest ) − µ(x; {xn , yn }, θ)
γ(x) =
σ(x; {xn , yn }, θ)
and N ∼ N (0, 1).
Points that have a high probability of being infinitesimally less
than f (xbest ) will be drawn over points that offer larger gains but
less certainty.
2. Expected Improvement (over current best) [BCd10]
aEI (x; {xn , yn }, θ) = σ({x; xn , yn }, θ)·[γ(x)Φ(γ(x)) + φ(γ(x))]
Arturo Fernandez Practical Bayesian Optimization of Machine Learning
Arturo Fernandez Practical Bayesian Optimization of Machine Learning
Covariance Function and its Hyperparameters
ARDSE kernel too smooth, instead use ARD Matérn 5/2 kernel:
KM 52 (x, x0 )
p 5 2 0
n p o
= θ0 1 + 5r (x, x ) + r (x, x ) exp − 5r2 (x, x0 )
2 0
3
Samples functions which are twice differentiable.
r2 (x, x0 ) = D 0 2 2
P
d=1 d − xd ) /θd .
(x
D + 3 Hyperparameters
I D Scales θ1:D
I Amplitude θ0
I Observation Noise ν
I Constant mean m
Arturo Fernandez Practical Bayesian Optimization of Machine Learning
Integrated Acquisition Function
To be fully bayesian, we should marginalize over
hyperparameters (denote by θ) by computing Integrated
Acquisition Function (IAF)
Z
â(x; {xn , yn }) = a(x; {xn , yn }, θ) · p(θ | {xn , yn }) dθ
I This expectation is a good generalization for the
uncertainty in chosen parameters
I Can blend a(·) functions arising from posterior over GP
hyperparameters, and then use a Monte Carlo estimate of
Integrated Expected Improvement (IEI)
I To do this MC, use Slice Sampling [MP10]
Arturo Fernandez Practical Bayesian Optimization of Machine Learning
Costs
I Don’t just care about minimizing f
I Evaluating f can result in vastly different execution times
depending on MLA hyperparameters
I Propose optimizing expected improvement per second
I Don’t know true f , also don’t know c(x) : X → R+ the
duration function.
I Solution: Model ln c(x) along with f , assuming
independence, makes computation easier.
Arturo Fernandez Practical Bayesian Optimization of Machine Learning
Parallelization Scheme
Use batch parallelism plus sequential strategy over yet to be
evaluated points by computing MC estimattes of AF over
different possible realizations of y’s.
I N evaluations have completed, {xn , yn }N
n=1
I J evaluations pending at locations {x̄j }Jj=1
I Choose new point based on expected AFunder all possible
outcomes of pending evaluations
â(x; {xn , yn }, θ, {xj }) =
Z
a(x; {xn , yn }, θ, {xj , yj }) · p({yj } | {xj }, {xn , yn }) dy1 · · · dyJ
RJ
Arturo Fernandez Practical Bayesian Optimization of Machine Learning
Methods and Metrics
I Expected improvement with GP HP marginalization as GP
EI MCMC
I Optimizing hyperparameters as GP EI Opt,
I EI per second as GP EI per Second
I N times parallelized GP EI MCMC as Nx GP EI MCMC
Arturo Fernandez Practical Bayesian Optimization of Machine Learning
Online LDA
Hyperparameters
I Learning rate ρt = (τ0 + t)−κ → (τ0 , κ)
I minibatch size
Cited Papers uses exhaustive search of size 6 × 6 × 8 (288)
Arturo Fernandez Practical Bayesian Optimization of Machine Learning
Results
Arturo Fernandez Practical Bayesian Optimization of Machine Learning
3-layer CNN
Hyperparameters
I Epochs to run model
I Learning rate
I Four weight costs (one for each layer and the softmax
output weights)
I Width, scale and power of the response normalization on
the pooling layers
Cited Papers uses exhaustive search of size 6 × 6 × 8 (288)
Arturo Fernandez Practical Bayesian Optimization of Machine Learning
Results
Arturo Fernandez Practical Bayesian Optimization of Machine Learning
References
E. Brochu, V. M. Cora, and N. de Freitas.
A Tutorial on Bayesian Optimization of Expensive Cost
Functions, with Application to Active User Modeling and
Hierarchical Reinforcement Learning.
ArXiv e-prints, December 2010.
I. Murray and R. Prescott Adams.
Slice sampling covariance hyperparameters of latent
Gaussian models.
ArXiv e-prints, June 2010.
J. Snoek, H. Larochelle, and R. P. Adams.
Practical Bayesian Optimization of Machine Learning
Algorithms.
ArXiv e-prints, June 2012.
Arturo Fernandez Practical Bayesian Optimization of Machine Learning