0% found this document useful (0 votes)

168 views29 pages

Bayesian Optimization in ML

The document discusses Bayesian optimization for machine learning algorithms. It introduces Gaussian processes which place a prior over functions to model an unknown function from data. This allows predicting outputs for new inputs. Bayesian optimization uses a Gaussian process prior and acquisition functions to select optimal hyperparameters for machine learning algorithms through iterative evaluations, aiming to minimize an expensive cost function. It describes setting up the optimization with a bounded search space, Gaussian process prior, observations with noise, and acquisition functions like probability of improvement and expected improvement to balance exploring new areas and exploiting current best results.

Uploaded by

M Rameez Ur Rehman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

168 views29 pages

Bayesian Optimization in ML

Uploaded by

M Rameez Ur Rehman

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Practical Bayesian Optimization of Machine

Learning Algorithms

Arturo Fernandez

CS 294
University of California, Berkeley

Tuesday, April 20, 2016

Arturo Fernandez Practical Bayesian Optimization of Machine Learning

Motivation

Machine Learning Algorithms (MLA’s) have hyperparameters

that often need to be tuned
I model hyperparameters (e.g. Bayesian models)
I regularization parameters
I optimization procedure parameters
I step size
I minibatch size

Arturo Fernandez Practical Bayesian Optimization of Machine Learning

Motivation

Machine Learning Algorithms (MLA’s) have hyperparameters

that often need to be tuned
I model hyperparameters (e.g. Bayesian models)
I regularization parameters
I optimization procedure parameters
I step size
I minibatch size

Can we automate the optimization of these high-level

parameters?

Arturo Fernandez Practical Bayesian Optimization of Machine Learning

Motivation

Machine Learning Algorithms (MLA’s) have hyperparameters

that often need to be tuned
I model hyperparameters (e.g. Bayesian models)
I regularization parameters
I optimization procedure parameters
I step size
I minibatch size

Can we automate the optimization of these high-level

parameters?

With some assumptions and Bayesian magic, yes!

Arturo Fernandez Practical Bayesian Optimization of Machine Learning

Gaussian Process

Usually we observe inputs xi and outputs yi . For now, we

assume yi = f (xi ) (no noise) for some unkown function f .

Gaussian Processes (GP’s) approach the prediction problem by

inferring a distribution over functions given the data p(f | X, y)
and then make predictions as
Z
p(y∗ | x∗ , X, y) = p(y∗ | f, x∗ ) · p(f | X, y) df

A Gaussian Process defines a prior over functions → posterior

over functions once we see data.

Arturo Fernandez Practical Bayesian Optimization of Machine Learning

Gaussian Process

A Gaussian Process is defined so that for any n ∈ N

f (x1 ), . . . , f (xn ) ∼ N (µ, K)

where µ ∈ Rn and Ki,j = κ(xi , xj ) for a positive definite kernel

function κ.

Key Idea: If xi and xj are similar by kernel, then the output of

the function at those points should be similar.

Arturo Fernandez Practical Bayesian Optimization of Machine Learning

Gaussian Process

Let the prior on the regression function be a GP

f (x) ∼ GP ( m(x), κ(x, x0 ) )

m(x) = E[f (x)]
κ(x, x0 ) = E[(f (x) − m(x))(f (x0 ) − m(x0 ))T ]

are the mean and covariance/kernel function respectively.

For finite set of points, defines a joint Gaussian

p(f | X) = N (f | µ, K)

where µ = (m(x1 ), . . . , m(xn )), usually m(x) = 0.

Arturo Fernandez Practical Bayesian Optimization of Machine Learning

GP Noise-free and Multivariate Gaussian Refresher

We see training set D = {(xi , fi ), i ∈ [N ]} where fi = f (xi )

Given a test set X∗ of size N∗ × D, want to predict output f∗
By definition of GP

f µ K K∗
∼N ,
f∗ µ∗ KT∗ K∗∗

Arturo Fernandez Practical Bayesian Optimization of Machine Learning

GP Noise-free and Multivariate Gaussian Refresher

We see training set D = {(xi , fi ), i ∈ [N ]} where fi = f (xi )

Given a test set X∗ of size N∗ × D, want to predict output f∗
By definition of GP

f µ K K∗
∼N ,
f∗ µ∗ KT∗ K∗∗

Thus

f∗ ∼ p(f∗ | X∗ , X, f ) = N (µ̂, Σ̂)

µ̂ = µ(X∗ ) + KT∗ K−1 (f − µ(X))
Σ̂ = K∗∗ − KT∗ K−1 K∗

Arturo Fernandez Practical Bayesian Optimization of Machine Learning

Priors and Kernels

Samples from a prior p(f | X), using squared exponential /

Gaussian / RBF kernel

0 2 1 0 2
κ(x, x ) = σf exp − 2 (x − x ) (1D Case)
2`
` controls horizontal scale of variation, σf2 controls vertical
variation.

Arturo Fernandez Practical Bayesian Optimization of Machine Learning

Priors and Kernels

Samples from a prior p(f | X), using squared exponential /

Gaussian / RBF kernel

0 2 1 0 2
κ(x, x ) = σf exp − 2 (x − x ) (1D Case)
2`
` controls horizontal scale of variation, σf2 controls vertical
variation.

Automatic Relevance Determination (ARD) squared

exponential kernel

0 1 0 T −1 0
κ(x, x ) = θ0 exp − (x − x ) Diag(θ) (x − x )
2
θ = [θ11 · · · θd2 ]

Arturo Fernandez Practical Bayesian Optimization of Machine Learning

Noisy Observations

Actually observe y where y = f (x) + ε and ε ∼ N (0, σy2 ) then

Cov(y | X) = K + σy2 I =: Ky

Assume E[f (x)] = 0 (so is y) then in the case of a single test

input

µ̂ = kT∗ K−1
y y
Σ̂ = k∗∗ − kT∗ K−1
y k∗

where

k∗ = [κ(x∗ , x1 ), . . . , κ(x∗ , xN ) ] and k∗∗ = κ(x∗ , x∗ )

Arturo Fernandez Practical Bayesian Optimization of Machine Learning

Bayesian Optimization with GP Priors
Setup for Bayesian Optimization (x vector of MLA
hyperparameters)
1. x ∈ X ⊂ RD and X bounded
2. f (x) is drawn from GP prior
3. Want to minimize f (x) on X
4. Observations are of form {xn , yn }N
n=1 with yn ∼ N (f (xn ), ν)
5. Acquisition function (AF), a : X → R+ , is used via
xnext = arg maxx a(x)

Arturo Fernandez Practical Bayesian Optimization of Machine Learning

I a(x) = a(x; {xn , yn }, θ), depends on previous observations

and GP hyperparameters
I Depend on model solely through
I Predictive mean function - µ(x; {xn , yn }, θ)
I Predictive variance function - σ 2 (x; {xn , yn }, θ)

Arturo Fernandez Practical Bayesian Optimization of Machine Learning

What is f ?

I Framework useful for f when its evaluations are expensive.

I The case when requires training a machine learning
algorithm
I Thus, should be smart about where we evaluate next

Arturo Fernandez Practical Bayesian Optimization of Machine Learning

Acquisition Functions
Let φ, Φ be the pdf, cdf of a standard normal.
xbest = arg minxn f (xn )
1. Probability of Improvement.

aP I (x; {xn , yn }, θ) = Φ(γ(x)) = P(N ≤ γ(x))

f (xbest ) − µ(x; {xn , yn }, θ)
γ(x) =
σ(x; {xn , yn }, θ)

and N ∼ N (0, 1).

Arturo Fernandez Practical Bayesian Optimization of Machine Learning

Acquisition Functions
Let φ, Φ be the pdf, cdf of a standard normal.
xbest = arg minxn f (xn )
1. Probability of Improvement.

aP I (x; {xn , yn }, θ) = Φ(γ(x)) = P(N ≤ γ(x))

f (xbest ) − µ(x; {xn , yn }, θ)
γ(x) =
σ(x; {xn , yn }, θ)

and N ∼ N (0, 1).

Points that have a high probability of being infinitesimally less
than f (xbest ) will be drawn over points that offer larger gains but
less certainty.

Arturo Fernandez Practical Bayesian Optimization of Machine Learning

Acquisition Functions
Let φ, Φ be the pdf, cdf of a standard normal.
xbest = arg minxn f (xn )
1. Probability of Improvement.

aP I (x; {xn , yn }, θ) = Φ(γ(x)) = P(N ≤ γ(x))

f (xbest ) − µ(x; {xn , yn }, θ)
γ(x) =
σ(x; {xn , yn }, θ)

and N ∼ N (0, 1).

Points that have a high probability of being infinitesimally less
than f (xbest ) will be drawn over points that offer larger gains but
less certainty.
2. Expected Improvement (over current best) [BCd10]

aEI (x; {xn , yn }, θ) = σ({x; xn , yn }, θ)·[γ(x)Φ(γ(x)) + φ(γ(x))]

Arturo Fernandez Practical Bayesian Optimization of Machine Learning

Arturo Fernandez Practical Bayesian Optimization of Machine Learning
Covariance Function and its Hyperparameters

ARDSE kernel too smooth, instead use ARD Matérn 5/2 kernel:

KM 52 (x, x0 )

p 5 2 0
n p o
= θ0 1 + 5r (x, x ) + r (x, x ) exp − 5r2 (x, x0 )
2 0
3

Samples functions which are twice differentiable.

r2 (x, x0 ) = D 0 2 2
P
d=1 d − xd ) /θd .
(x

D + 3 Hyperparameters
I D Scales θ1:D
I Amplitude θ0
I Observation Noise ν
I Constant mean m

Arturo Fernandez Practical Bayesian Optimization of Machine Learning

Integrated Acquisition Function

To be fully bayesian, we should marginalize over

hyperparameters (denote by θ) by computing Integrated
Acquisition Function (IAF)
Z
â(x; {xn , yn }) = a(x; {xn , yn }, θ) · p(θ | {xn , yn }) dθ

I This expectation is a good generalization for the

uncertainty in chosen parameters
I Can blend a(·) functions arising from posterior over GP
hyperparameters, and then use a Monte Carlo estimate of
Integrated Expected Improvement (IEI)
I To do this MC, use Slice Sampling [MP10]

Arturo Fernandez Practical Bayesian Optimization of Machine Learning

Costs

I Don’t just care about minimizing f

I Evaluating f can result in vastly different execution times
depending on MLA hyperparameters
I Propose optimizing expected improvement per second

I Don’t know true f , also don’t know c(x) : X → R+ the

duration function.
I Solution: Model ln c(x) along with f , assuming
independence, makes computation easier.

Arturo Fernandez Practical Bayesian Optimization of Machine Learning

Parallelization Scheme

Use batch parallelism plus sequential strategy over yet to be

evaluated points by computing MC estimattes of AF over
different possible realizations of y’s.

I N evaluations have completed, {xn , yn }N

n=1
I J evaluations pending at locations {x̄j }Jj=1
I Choose new point based on expected AFunder all possible
outcomes of pending evaluations

â(x; {xn , yn }, θ, {xj }) =

Z
a(x; {xn , yn }, θ, {xj , yj }) · p({yj } | {xj }, {xn , yn }) dy1 · · · dyJ
RJ

Arturo Fernandez Practical Bayesian Optimization of Machine Learning

Methods and Metrics

I Expected improvement with GP HP marginalization as GP

EI MCMC
I Optimizing hyperparameters as GP EI Opt,
I EI per second as GP EI per Second
I N times parallelized GP EI MCMC as Nx GP EI MCMC

Arturo Fernandez Practical Bayesian Optimization of Machine Learning

Online LDA

Hyperparameters
I Learning rate ρt = (τ0 + t)−κ → (τ0 , κ)
I minibatch size
Cited Papers uses exhaustive search of size 6 × 6 × 8 (288)

Arturo Fernandez Practical Bayesian Optimization of Machine Learning

Results

Arturo Fernandez Practical Bayesian Optimization of Machine Learning

3-layer CNN

Hyperparameters
I Epochs to run model
I Learning rate
I Four weight costs (one for each layer and the softmax
output weights)
I Width, scale and power of the response normalization on
the pooling layers
Cited Papers uses exhaustive search of size 6 × 6 × 8 (288)

Arturo Fernandez Practical Bayesian Optimization of Machine Learning

Results

Arturo Fernandez Practical Bayesian Optimization of Machine Learning

References

E. Brochu, V. M. Cora, and N. de Freitas.

A Tutorial on Bayesian Optimization of Expensive Cost
Functions, with Application to Active User Modeling and
Hierarchical Reinforcement Learning.
ArXiv e-prints, December 2010.
I. Murray and R. Prescott Adams.
Slice sampling covariance hyperparameters of latent
Gaussian models.
ArXiv e-prints, June 2010.
J. Snoek, H. Larochelle, and R. P. Adams.
Practical Bayesian Optimization of Machine Learning
Algorithms.
ArXiv e-prints, June 2012.

Arturo Fernandez Practical Bayesian Optimization of Machine Learning

Bayesian Optimization for ML Experts
No ratings yet
Bayesian Optimization for ML Experts
84 pages
Bayesian Neural Networks Explained
No ratings yet
Bayesian Neural Networks Explained
47 pages
Predictive Entropy Search For Efficient Global Optimization of Black-Box Functions
No ratings yet
Predictive Entropy Search For Efficient Global Optimization of Black-Box Functions
12 pages
Bayesian Optimization Explained
No ratings yet
Bayesian Optimization Explained
6 pages
Frazier 2018
No ratings yet
Frazier 2018
25 pages
A Tutorial On Bayesian Optimization of
No ratings yet
A Tutorial On Bayesian Optimization of
49 pages
Bayesian Optimization in ML
No ratings yet
Bayesian Optimization in ML
94 pages
07 - Bayesian Learning
No ratings yet
07 - Bayesian Learning
55 pages
Introduction to Bayesian Optimization Techniques
No ratings yet
Introduction to Bayesian Optimization Techniques
79 pages
Bayesian Optimization Tutorial
No ratings yet
Bayesian Optimization Tutorial
22 pages
Package Rbayesianoptimization': R Topics Documented
No ratings yet
Package Rbayesianoptimization': R Topics Documented
6 pages
Bayesian Statistics in Machine Learning - 093615
No ratings yet
Bayesian Statistics in Machine Learning - 093615
7 pages
Main 2
No ratings yet
Main 2
37 pages
Hyperparameter Tuning
No ratings yet
Hyperparameter Tuning
13 pages
CS 601 Machine Learning Unit 5
No ratings yet
CS 601 Machine Learning Unit 5
18 pages
Slide07 Bayes
No ratings yet
Slide07 Bayes
51 pages
Bayesian Inference in Gaussian Processes
No ratings yet
Bayesian Inference in Gaussian Processes
2 pages
Wa0002.
No ratings yet
Wa0002.
24 pages
Unit - 5 ML
No ratings yet
Unit - 5 ML
57 pages
Bayesian Optimization for Hyperparameter Tuning
No ratings yet
Bayesian Optimization for Hyperparameter Tuning
33 pages
Bayesian Estimation in State Space
No ratings yet
Bayesian Estimation in State Space
33 pages
Understanding Bayesian Optimization Techniques
No ratings yet
Understanding Bayesian Optimization Techniques
3 pages
Bayesian Learning for Classification Models
No ratings yet
Bayesian Learning for Classification Models
37 pages
Bayesian Machine Learning 1663238292
No ratings yet
Bayesian Machine Learning 1663238292
152 pages
Unit 5 - Machine Learning
No ratings yet
Unit 5 - Machine Learning
16 pages
MCMC and Bayesian Modeling Overview
No ratings yet
MCMC and Bayesian Modeling Overview
27 pages
10 The Latter Is
No ratings yet
10 The Latter Is
2 pages
Lecture 5
No ratings yet
Lecture 5
23 pages
Bayesian Learning in AI and ML
No ratings yet
Bayesian Learning in AI and ML
30 pages
Machine Learning: Lecture 6: Bayesian Learning (Based On Chapter 6 of Mitchell T.., Machine Learning, 1997)
0% (1)
Machine Learning: Lecture 6: Bayesian Learning (Based On Chapter 6 of Mitchell T.., Machine Learning, 1997)
15 pages
Module 4
No ratings yet
Module 4
15 pages
Bayesoptbook A4
100% (1)
Bayesoptbook A4
374 pages
Bayesian Inference Fundamentals
No ratings yet
Bayesian Inference Fundamentals
195 pages
Bayesian Framework for Machine Learning
No ratings yet
Bayesian Framework for Machine Learning
24 pages
Bayesian Algorithms for Econometrics
No ratings yet
Bayesian Algorithms for Econometrics
33 pages
Frequentist vs. Bayesian ML Methods
No ratings yet
Frequentist vs. Bayesian ML Methods
4 pages
Bayesian Optimization
No ratings yet
Bayesian Optimization
15 pages
Machine Learning PPT Part III
No ratings yet
Machine Learning PPT Part III
26 pages
Full Bayesian Optimization Report
No ratings yet
Full Bayesian Optimization Report
12 pages
8-Continuous Markov Models
No ratings yet
8-Continuous Markov Models
30 pages
Naive Bayes Classifier
No ratings yet
Naive Bayes Classifier
14 pages
Bayesian Machine Learning
No ratings yet
Bayesian Machine Learning
127 pages
Bayesian Optimization For Accelerating Hyper-Parameter Tuning
No ratings yet
Bayesian Optimization For Accelerating Hyper-Parameter Tuning
4 pages
Log Marginal Likelihood in GPs
No ratings yet
Log Marginal Likelihood in GPs
69 pages
Bayesian Learning in Machine Learning
No ratings yet
Bayesian Learning in Machine Learning
26 pages
Bayesian Optimization for Hyperparameter Tuning
No ratings yet
Bayesian Optimization for Hyperparameter Tuning
9 pages
PAC-Bayesian Learning Overview
No ratings yet
PAC-Bayesian Learning Overview
66 pages
PML Class 1 2025
No ratings yet
PML Class 1 2025
54 pages
Bayesian vs Frequentist Inference Explained
No ratings yet
Bayesian vs Frequentist Inference Explained
7 pages
UNIT 3-Bayesian Statistics
No ratings yet
UNIT 3-Bayesian Statistics
80 pages
Bayesian Networks & EM Algorithm
No ratings yet
Bayesian Networks & EM Algorithm
7 pages
Advantages of Probabilistic Machine Learning
No ratings yet
Advantages of Probabilistic Machine Learning
3 pages
Bayesian Inference
100% (3)
Bayesian Inference
347 pages
Modeling and Simulation of A Moving Robot Arm Mounted On Wheelchair
No ratings yet
Modeling and Simulation of A Moving Robot Arm Mounted On Wheelchair
5 pages
20 Multiscale Vessel Enhancement
No ratings yet
20 Multiscale Vessel Enhancement
13 pages
EP Catheter Detection in Noisy Fluoroscopy
No ratings yet
EP Catheter Detection in Noisy Fluoroscopy
8 pages
Tensor Voting for Feature Detection
No ratings yet
Tensor Voting for Feature Detection
10 pages
Advances in Texture Analysis for Defect Detection
No ratings yet
Advances in Texture Analysis for Defect Detection
22 pages
06561535
No ratings yet
06561535
6 pages
Mobile Robot Control System
No ratings yet
Mobile Robot Control System
6 pages
Robot Trajectory Optimization Algorithm
No ratings yet
Robot Trajectory Optimization Algorithm
8 pages
Convex Optimization in Image Processing: Ernie Esser
No ratings yet
Convex Optimization in Image Processing: Ernie Esser
9 pages
WWW Hackingdream Net 2015 05 Hack Wifi Wpa Wpa2 Wps in Windo
0% (1)
WWW Hackingdream Net 2015 05 Hack Wifi Wpa Wpa2 Wps in Windo
71 pages
Understanding Stochastic Gradient Descent
No ratings yet
Understanding Stochastic Gradient Descent
10 pages
Phy Sops
No ratings yet
Phy Sops
171 pages
04 Multiple Linear Regression
No ratings yet
04 Multiple Linear Regression
17 pages
3 Ways To Calculate A Pearson's Correlation Coefficient in Excel
No ratings yet
3 Ways To Calculate A Pearson's Correlation Coefficient in Excel
4 pages
Event Studies for Finance Experts
No ratings yet
Event Studies for Finance Experts
43 pages
Sales Data Analysis and Estimation Techniques
No ratings yet
Sales Data Analysis and Estimation Techniques
4 pages
Sample Size Calculator
100% (1)
Sample Size Calculator
2 pages
AI for Short-Term Water Demand Forecasting
No ratings yet
AI for Short-Term Water Demand Forecasting
22 pages
DS 303 Practice Questions
No ratings yet
DS 303 Practice Questions
2 pages
Hypothesis I
No ratings yet
Hypothesis I
70 pages
Paired T-Test Analysis Results
No ratings yet
Paired T-Test Analysis Results
2 pages
Stat426 Final Fall2022
No ratings yet
Stat426 Final Fall2022
2 pages
Descriptive Statistics Final Exam Guide
No ratings yet
Descriptive Statistics Final Exam Guide
3 pages
Intro To Statistics (Percentile, Z Score)
No ratings yet
Intro To Statistics (Percentile, Z Score)
24 pages
Feature Engineering For Machine Learning
No ratings yet
Feature Engineering For Machine Learning
41 pages
Permutation, Parametric, and Bootstrap Tests of Hypotheses - 3rd Edition EPUB DOCX PDF Download
No ratings yet
Permutation, Parametric, and Bootstrap Tests of Hypotheses - 3rd Edition EPUB DOCX PDF Download
17 pages
IBDP Year 11 Math Exam: Stats & Calculus
No ratings yet
IBDP Year 11 Math Exam: Stats & Calculus
14 pages
Gamiz - 2021
No ratings yet
Gamiz - 2021
12 pages
Data Analysis Exam Question New
No ratings yet
Data Analysis Exam Question New
36 pages
1 Bias Variance Overfit Underfit
No ratings yet
1 Bias Variance Overfit Underfit
6 pages
ML4 - Decision Trees & Random Forest
No ratings yet
ML4 - Decision Trees & Random Forest
44 pages
Sampling Distribution of Sample Means
No ratings yet
Sampling Distribution of Sample Means
23 pages
2024 Econometrics I Chapter Four
No ratings yet
2024 Econometrics I Chapter Four
102 pages
Kurtosis
No ratings yet
Kurtosis
12 pages
DATA Mining UNIT1 DATA Mining UNIT1: Operating System (Sindhi College) Operating System (Sindhi College)
No ratings yet
DATA Mining UNIT1 DATA Mining UNIT1: Operating System (Sindhi College) Operating System (Sindhi College)
24 pages
UNIT 3 A
No ratings yet
UNIT 3 A
36 pages
Bayesian Variable Selection and Shrinkage Strategies in A Complicated Modelling Setting With Missing Data: A Case Study Using Multistate Models
No ratings yet
Bayesian Variable Selection and Shrinkage Strategies in A Complicated Modelling Setting With Missing Data: A Case Study Using Multistate Models
19 pages
Essential R Commands for Data Analysis
No ratings yet
Essential R Commands for Data Analysis
17 pages
Logistic Regression Practice Problems
No ratings yet
Logistic Regression Practice Problems
2 pages
Factor Analysis and Dimension Reduction in R A Social Scientists Toolkit
No ratings yet
Factor Analysis and Dimension Reduction in R A Social Scientists Toolkit
585 pages
Data Science Lab Report-1 PDF
No ratings yet
Data Science Lab Report-1 PDF
7 pages