0% found this document useful (0 votes)
10 views18 pages

Bayesian Optimization Concept Explained in Layman Terms - by Wei Wang - Towards Data Science

Uploaded by

doquyet96.vp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views18 pages

Bayesian Optimization Concept Explained in Layman Terms - by Wei Wang - Towards Data Science

Uploaded by

doquyet96.vp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Photo by kazuend on Unsplash

Bayesian Optimization Concept


Explained in Layman Terms
Bayesian Optimization for Dummies

Wei Wang · Follow


Published in Towards Data Science · 9 min read · Mar 18, 2020

519 4
Bayesian Optimization has been widely used for the hyperparameter tuning
purpose in the Machine Learning world. Despite the fact that there are many
terms and math formulas involved, the concept behind turns out to be very
simple. The goal of this article is to share what I learned about Bayesian
Optimization with a straightforward interpretation of textbook
terminologies, and hopefully, it will help you understand what Bayesian
Optimization is in a short period of time.

The Overview of Hyperparameter Optimization


For the completeness of the article, let’s start with the basic overview of
hyperparameter optimization methods, which are generally 4 types:

Manual Search, Random Search, Grid Search, and Bayesian Optimization

Bayesian Optimization differs from Random Search and Grid Search in that
it improves the search speed using past performances, whereas the other
two methods are uniform (or independent) of past evaluations. In that sense,
Bayesian Optimization is like Manual Search. Let’s say you are manually
optimizing the hyperparameter of a Random Forest regression model.
Firstly, you would try a set of parameters, then look at the result, change one
of the parameters, rerun, and compare the results, so that way you know
whether you are going towards the right direction. Bayesian Optimization
does a similar thing — the performance of your past hyperparameter affects
the future decision. In comparison, Random Search and Grid Search do not
take into account past performance when determining new
hyperparameters to evaluate. Thus, Bayesian Optimization is a much more
efficient method.

How Bayesian Optimization Works


Let’s continue using our example of optimizing hyperparameters for a
Random Forest regression model. Say we want to find a set of
hyperparameters that will minimize RMSE. Here, the function to compute
RMSE is called the objective function. If we were to know the probability
distribution of our objective function, (in simple words, if we were to know
the shape of the objective function), then we can simply compute the
gradient descent and find the global minimum. However since we don’t
know the distributions of RMSE score (this is actually what we are trying to
find out), we need Bayesian Optimization to help us decipher this black-box
model.

So what is Bayesian Optimization?

Bayesian Optimization builds a probability model of


the objective function and uses it to select
hyperparameter to evaluate in the true objective
function.

This sentence might sound complicated but actually delivers a simple


message. Let’s break it down:

“Bayesian Optimization builds a probability model of the objective function”

The true objective function is a fixed function. Let’s say it is supposed to look
like Fig 1, but as I mentioned, we don’t know this at the beginning of the
hyperparameter tuning.
Fig 1: The True Objective Function

If there are unlimited resources, we would compute every single point of the
objective function so that we know its actual shape (In our example, keep
calling the Random Forest Regression model until we have the RMSE scores
for all possible hyperparameter combinations). However, that’s impossible.
So let’s say we only have 10 samples from the true objective function,
represented as black circles in Fig 2:

Fig 2: 10 samples from the true objective function


Using these 10 samples, we need to build a surrogate model (also called the
response surface model) to approximate the true objective function. Take a
look at Fig 3. The surrogate model is represented as the blue line. The blue
shade represents the deviation.

Fig 3: Initiate the surrogate model

A surrogate model by definition is “the probability representation of the


objective function”, which is essentially a model trained on the
(hyperparameter, true objective function score) pairs. In math, it is p(objective
function score | hyperparameter). There are different ways to construct a
surrogate model, but I will come back to this later.

“And use it to select hyperparameters”

now we have 10 samples of the objective function and how should we decide
which parameter to try as the 11th sample? We need to build an acquisition
function (also called the selection function). The next hyperparameter of
choice is where the acquisition function is maximized. In Fig 4, the green
shade is the acquisition function and the red straight line is where it is
maximized. Therefore the corresponding hyperparameter and its objective
function score, represented as a red circle, is used as the 11th sample to
update the surrogate model.
Fig 4: Maximize acquisition function to select the next point

“To evaluate in the true objective function”

As described above, after using an acquisition function to determine the


next hyperparameter, the true objective function score of this new
hyperparameter is obtained. Since the surrogate model has trained on the
(hyperparameter, true objective function score) pairs, adding a new data point
updates the surrogate model.

…Repeat the above steps until the max time or max iteration is reached. And
boom! You now (hopefully) have an accurate approximation of the true
objective function and can easily find the global minimum from the past
evaluated samples. Your Bayesian Optimization completes!

Put it Altogether
To summarize, let’s look at the below pseudo-code in Fig 5, which comes
from this paper:

Fig 5: The pseudo-code of generic Sequential Model-Based Optimization

Here, SMBO stands for Sequential Model-Based Optimization, which is


another name of Bayesian Optimization. It is “sequential” because the
hyperparameters are added to update the surrogate model one by one; it is
“model-based” because it approximates the true objective function with a
surrogate model that is cheaper to evaluate.

Other representations in the pseudo-code:

H : Observation History of (hyperparameter, score) pair


T : Max Number of Iterations
f : True Objective Function (in our example, the RMSE function)
M : Surrogate Function, which is updated whenever a new sample is added
S : Acquisition Function
x* : The Next Chosen Hyperparameter to Evaluate

Let’s go through this loop one more time.


First, initiate a surrogate model and an acquisition function.

Line 3: then for each iteration, find the hyperparameter x* where the
acquisition function is maximized. The acquisition function is a function
of the surrogate model, meaning that it is built using the surrogate model
instead of the true objective function (keep reading, you will know what it
means). Notice that here, the pseudo-code shows x* is obtained when the
acquisition function is minimized, whereas I kept saying that the
acquisition function should be maximized. To maximize or to minimize
all depends on how the acquisition function is defined. If you are using
the most common acquisition function— Expected Improvement, then
you should maximize it.

Line 4: Obtain the objective function score of x* to see how this point
actually performs

Line 5: Include the (hyperparameter x*, true objective function score) in the
history of other samples

Line 6: Train the surrogate model using the latest history of samples

Repeat until the max number of iterations is reached. In the end, the history
of (hyperparameter, true objective function score) is returned. Note that the last
record is not necessarily the best-achieved score. You would have to sort the
score to find the best hyperparameter.

Different Types of the Surrogate Models and the Acquisition


Function
Rather than getting into the math details of the surrogate models and
acquisition function, I will only give a general description of the commonly
used types. If you are interested in learning more about how the acquisition
function works with different surrogate models, check this research paper.

The Most Common Acquisition Function: Expected


Improvement
Let’s start by explaining what the acquisition function is so that we can
explain how each type of surrogate model is optimized.

The most common acquisition function is the Expected Improvement. The


formula looks like this:

p(y|x): the surrogate model. y is the true objective function score and x is the
hyperparameter

y*: the minimum observed true objective function score so far

y: new scores

Expected Improvement is built on top of the surrogate model, meaning that


different surrogate models would result in different ways of optimizing this
acquisition function. We will discuss this in the following sections.

The Most Common Surrogate Model: Gaussian Process Model


The majority of the research papers use the Gaussian Process model as the
surrogate model for its simplicity and ease of optimization. Gaussian Process
directly models P(y|x). It uses the history of (hyperparameter, true objective
function score) as (x, y) to construct the multivariate Gaussian distributions.

To maximize the Expected Improvement result for the Gaussian Process


model, the new score should be less than the current minimum score (y <
y*), so that max(y* — y, 0) can be a large positive number.

Let’s look at a concrete example in Fig 6(which I borrowed from this post):

Fig 6: A dummy example of Score vs Hyperparameter

Let’s say in Fig 6 the lowest score = 12, then y* = 12. The Expected
Improvement function will look into the regions where the uncertainty is
high and the mean function is close to or lower than y*. The n_estimators
that yield the highest Expected Improvement using the multivariate
Gaussian distributions would be used as the next input to the real objective
function.

An Alternative Surrogate Model: Tree Parzen Estimators (TPE)


Another surrogate model that is implemented in some python packages (e.g.
hyperopt library) is TPE. First recall the Bayes Rule which is shown below:

While Gaussian Process models p(y|x) directly, TPE models p(x|y), which is
the probability distribution of hyperparameters given the objective function
score.

Let’s continue using Fig 7 as an example. Instead of choosing y* = 12 as the


Gaussian Process, TPE algorithm chooses y* to be some quantile γ of the
observed y values so that p(y < y*) = γ. In other words, TPE chooses y* to be
some number that’s a bit higher than the best-observed score so that it can
separate the current observations into two clusters: better than y* and worse
than y*. See Fig 7 as an illustration.
Fig 7: The black dash line is the selected y*

Given the separate scores, TPE then constructs separate distributions for the
hyperparameters. Thus p(x|y) is written as:

where l(x) is the distribution of the hyperparameters when the score is lower
than the threshold y* and g(x) is the distribution when the score is higher
than y*.

The Expected Improvement formula for TPE is then changed to:

and after some math transformation, it becomes:

The end formula means that to yield a high Expected Improvement, points
with high probability under l(x) and low probability under g(x), should be
chosen as the next hyperparameter. This meets our intuition that the next
hyperparameter should come from the area under the threshold rather than
the area above the threshold. To learn more about the TPE surrogate model,
refer to this article or this article.

Summary, References, and Further Readings


In this article, I explained the Bayesian Optimization concept (hopefully) in a
straightforward way. For those who want to learn more, below is the list of
resources that I have found helpful:

A Conceptual Explanation of Bayesian Hyperparameter Optimization for


Machine Learning: Another Medium article that emphasis explaining the
TPE method
Algorithms for Hyper-Parameter Optimization: A great research paper
explains in detail how Expected Improvement optimization works in
Gaussian Process and TPE.

Bayesian Methods for Machine Learning: A short 10 min YouTube video that
introduces other types of the acquisition function

Bayesian Optimization with extensions, applications, and other sundry items:


A 1hr 30 min lecture recording that goes through the concept of Bayesian
Optimization in great detail, including the math behind different types of
surrogate models and acquisition functions. The entire lecture might be
too technical to follow, but at least the first half of the video is very
helpful in terms of understanding the concepts.

All right, that’s it! Congratulations on following me all the way through here!
I hope my article can somewhat rescue you from hours of juggling around
the overwhelming terms and math associated with Bayesian Optimization.
For questions and comments, let’s connect on Linkedin and Twitter.

Machine Learning Bayesian Machine Learning Towards Data Science Data Science

Technology

O i Si Si i
Open in app Sign up Sign in

Search Write

Written by Wei Wang Follow

164 Followers · Writer for Towards Data Science

Data Science Manager @ Tiktok

More from Wei Wang and Towards Data Science

Wei Wang in Towards Data Science Ahmed Besbes in Towards Data Science

Review on UC Berkeley Master of What Nobody Tells You About


Information and Data Science… RAGs
Whether it worth your 2 years of time and A deep dive into why RAG doesn’t always
$70,000 of your savings work as expected: an overview of the…

Apr 23, 2020 404 7 Aug 23 1.7K 24


Steve Hedden in Towards Data Science Wei Wang

How to Implement Graph RAG How MLOps Infrastructures Are


Using Knowledge Graphs and… Built in Drizly
A Step-by-Step Tutorial on Implementing Real-world examples of building scalable and
Retrieval-Augmented Generation (RAG),… reliable Machine Learning pipelines

Sep 6 1.1K 13 Jan 9, 2023 4

See all from Wei Wang See all from Towards Data Science

Recommended from Medium


Everton Gomede, PhD in 𝐀𝐈 𝐦𝐨𝐧𝐤𝐬.𝐢𝐨 Hennie de Harder in Towards Data Science

Bayesian Optimization: Solving Multi-Armed Bandit


Revolutionizing Efficient Search i… Problems
Introduction A powerful and easy way to apply
reinforcement learning.

Feb 8 542 Nov 4, 2022 401 5

Lists

Predictive Modeling w/ Practical Guides to Machine


Python Learning
20 stories · 1553 saves 10 stories · 1880 saves

ChatGPT prompts The New Chatbots: ChatGPT,


48 stories · 2021 saves Bard, and Beyond
12 stories · 463 saves
Shuai Li in Programming Domain Florian June in Towards AI

Most Developers Failed with this The Best Practices of RAG


Senior-Level Python Interview… Typical RAG Process, Best Practices for Each
What is the difference between [0] * 3 and [0, Module, and Comprehensive Evaluation
0, 0]?

Aug 9 3K 48 Aug 8 857 4

Vishal Rajput in AIGuys Kavya

Why GEN AI Boom Is Fading And On comparing SciPy Optimize and


What’s Next? CVXPY
Every technology has its hype and cool down Both CVXPY and SciPy’s optimize module are
period. powerful tools for solving optimization…

Sep 4 1.5K 48 May 16 3

See more recommendations

You might also like