Photo by kazuend on Unsplash
Bayesian Optimization Concept
Explained in Layman Terms
Bayesian Optimization for Dummies
Wei Wang · Follow
Published in Towards Data Science · 9 min read · Mar 18, 2020
519 4
Bayesian Optimization has been widely used for the hyperparameter tuning
purpose in the Machine Learning world. Despite the fact that there are many
terms and math formulas involved, the concept behind turns out to be very
simple. The goal of this article is to share what I learned about Bayesian
Optimization with a straightforward interpretation of textbook
terminologies, and hopefully, it will help you understand what Bayesian
Optimization is in a short period of time.
The Overview of Hyperparameter Optimization
For the completeness of the article, let’s start with the basic overview of
hyperparameter optimization methods, which are generally 4 types:
Manual Search, Random Search, Grid Search, and Bayesian Optimization
Bayesian Optimization differs from Random Search and Grid Search in that
it improves the search speed using past performances, whereas the other
two methods are uniform (or independent) of past evaluations. In that sense,
Bayesian Optimization is like Manual Search. Let’s say you are manually
optimizing the hyperparameter of a Random Forest regression model.
Firstly, you would try a set of parameters, then look at the result, change one
of the parameters, rerun, and compare the results, so that way you know
whether you are going towards the right direction. Bayesian Optimization
does a similar thing — the performance of your past hyperparameter affects
the future decision. In comparison, Random Search and Grid Search do not
take into account past performance when determining new
hyperparameters to evaluate. Thus, Bayesian Optimization is a much more
efficient method.
How Bayesian Optimization Works
Let’s continue using our example of optimizing hyperparameters for a
Random Forest regression model. Say we want to find a set of
hyperparameters that will minimize RMSE. Here, the function to compute
RMSE is called the objective function. If we were to know the probability
distribution of our objective function, (in simple words, if we were to know
the shape of the objective function), then we can simply compute the
gradient descent and find the global minimum. However since we don’t
know the distributions of RMSE score (this is actually what we are trying to
find out), we need Bayesian Optimization to help us decipher this black-box
model.
So what is Bayesian Optimization?
Bayesian Optimization builds a probability model of
the objective function and uses it to select
hyperparameter to evaluate in the true objective
function.
This sentence might sound complicated but actually delivers a simple
message. Let’s break it down:
“Bayesian Optimization builds a probability model of the objective function”
The true objective function is a fixed function. Let’s say it is supposed to look
like Fig 1, but as I mentioned, we don’t know this at the beginning of the
hyperparameter tuning.
Fig 1: The True Objective Function
If there are unlimited resources, we would compute every single point of the
objective function so that we know its actual shape (In our example, keep
calling the Random Forest Regression model until we have the RMSE scores
for all possible hyperparameter combinations). However, that’s impossible.
So let’s say we only have 10 samples from the true objective function,
represented as black circles in Fig 2:
Fig 2: 10 samples from the true objective function
Using these 10 samples, we need to build a surrogate model (also called the
response surface model) to approximate the true objective function. Take a
look at Fig 3. The surrogate model is represented as the blue line. The blue
shade represents the deviation.
Fig 3: Initiate the surrogate model
A surrogate model by definition is “the probability representation of the
objective function”, which is essentially a model trained on the
(hyperparameter, true objective function score) pairs. In math, it is p(objective
function score | hyperparameter). There are different ways to construct a
surrogate model, but I will come back to this later.
“And use it to select hyperparameters”
now we have 10 samples of the objective function and how should we decide
which parameter to try as the 11th sample? We need to build an acquisition
function (also called the selection function). The next hyperparameter of
choice is where the acquisition function is maximized. In Fig 4, the green
shade is the acquisition function and the red straight line is where it is
maximized. Therefore the corresponding hyperparameter and its objective
function score, represented as a red circle, is used as the 11th sample to
update the surrogate model.
Fig 4: Maximize acquisition function to select the next point
“To evaluate in the true objective function”
As described above, after using an acquisition function to determine the
next hyperparameter, the true objective function score of this new
hyperparameter is obtained. Since the surrogate model has trained on the
(hyperparameter, true objective function score) pairs, adding a new data point
updates the surrogate model.
…Repeat the above steps until the max time or max iteration is reached. And
boom! You now (hopefully) have an accurate approximation of the true
objective function and can easily find the global minimum from the past
evaluated samples. Your Bayesian Optimization completes!
Put it Altogether
To summarize, let’s look at the below pseudo-code in Fig 5, which comes
from this paper:
Fig 5: The pseudo-code of generic Sequential Model-Based Optimization
Here, SMBO stands for Sequential Model-Based Optimization, which is
another name of Bayesian Optimization. It is “sequential” because the
hyperparameters are added to update the surrogate model one by one; it is
“model-based” because it approximates the true objective function with a
surrogate model that is cheaper to evaluate.
Other representations in the pseudo-code:
H : Observation History of (hyperparameter, score) pair
T : Max Number of Iterations
f : True Objective Function (in our example, the RMSE function)
M : Surrogate Function, which is updated whenever a new sample is added
S : Acquisition Function
x* : The Next Chosen Hyperparameter to Evaluate
Let’s go through this loop one more time.
First, initiate a surrogate model and an acquisition function.
Line 3: then for each iteration, find the hyperparameter x* where the
acquisition function is maximized. The acquisition function is a function
of the surrogate model, meaning that it is built using the surrogate model
instead of the true objective function (keep reading, you will know what it
means). Notice that here, the pseudo-code shows x* is obtained when the
acquisition function is minimized, whereas I kept saying that the
acquisition function should be maximized. To maximize or to minimize
all depends on how the acquisition function is defined. If you are using
the most common acquisition function— Expected Improvement, then
you should maximize it.
Line 4: Obtain the objective function score of x* to see how this point
actually performs
Line 5: Include the (hyperparameter x*, true objective function score) in the
history of other samples
Line 6: Train the surrogate model using the latest history of samples
Repeat until the max number of iterations is reached. In the end, the history
of (hyperparameter, true objective function score) is returned. Note that the last
record is not necessarily the best-achieved score. You would have to sort the
score to find the best hyperparameter.
Different Types of the Surrogate Models and the Acquisition
Function
Rather than getting into the math details of the surrogate models and
acquisition function, I will only give a general description of the commonly
used types. If you are interested in learning more about how the acquisition
function works with different surrogate models, check this research paper.
The Most Common Acquisition Function: Expected
Improvement
Let’s start by explaining what the acquisition function is so that we can
explain how each type of surrogate model is optimized.
The most common acquisition function is the Expected Improvement. The
formula looks like this:
p(y|x): the surrogate model. y is the true objective function score and x is the
hyperparameter
y*: the minimum observed true objective function score so far
y: new scores
Expected Improvement is built on top of the surrogate model, meaning that
different surrogate models would result in different ways of optimizing this
acquisition function. We will discuss this in the following sections.
The Most Common Surrogate Model: Gaussian Process Model
The majority of the research papers use the Gaussian Process model as the
surrogate model for its simplicity and ease of optimization. Gaussian Process
directly models P(y|x). It uses the history of (hyperparameter, true objective
function score) as (x, y) to construct the multivariate Gaussian distributions.
To maximize the Expected Improvement result for the Gaussian Process
model, the new score should be less than the current minimum score (y <
y*), so that max(y* — y, 0) can be a large positive number.
Let’s look at a concrete example in Fig 6(which I borrowed from this post):
Fig 6: A dummy example of Score vs Hyperparameter
Let’s say in Fig 6 the lowest score = 12, then y* = 12. The Expected
Improvement function will look into the regions where the uncertainty is
high and the mean function is close to or lower than y*. The n_estimators
that yield the highest Expected Improvement using the multivariate
Gaussian distributions would be used as the next input to the real objective
function.
An Alternative Surrogate Model: Tree Parzen Estimators (TPE)
Another surrogate model that is implemented in some python packages (e.g.
hyperopt library) is TPE. First recall the Bayes Rule which is shown below:
While Gaussian Process models p(y|x) directly, TPE models p(x|y), which is
the probability distribution of hyperparameters given the objective function
score.
Let’s continue using Fig 7 as an example. Instead of choosing y* = 12 as the
Gaussian Process, TPE algorithm chooses y* to be some quantile γ of the
observed y values so that p(y < y*) = γ. In other words, TPE chooses y* to be
some number that’s a bit higher than the best-observed score so that it can
separate the current observations into two clusters: better than y* and worse
than y*. See Fig 7 as an illustration.
Fig 7: The black dash line is the selected y*
Given the separate scores, TPE then constructs separate distributions for the
hyperparameters. Thus p(x|y) is written as:
where l(x) is the distribution of the hyperparameters when the score is lower
than the threshold y* and g(x) is the distribution when the score is higher
than y*.
The Expected Improvement formula for TPE is then changed to:
and after some math transformation, it becomes:
The end formula means that to yield a high Expected Improvement, points
with high probability under l(x) and low probability under g(x), should be
chosen as the next hyperparameter. This meets our intuition that the next
hyperparameter should come from the area under the threshold rather than
the area above the threshold. To learn more about the TPE surrogate model,
refer to this article or this article.
Summary, References, and Further Readings
In this article, I explained the Bayesian Optimization concept (hopefully) in a
straightforward way. For those who want to learn more, below is the list of
resources that I have found helpful:
A Conceptual Explanation of Bayesian Hyperparameter Optimization for
Machine Learning: Another Medium article that emphasis explaining the
TPE method
Algorithms for Hyper-Parameter Optimization: A great research paper
explains in detail how Expected Improvement optimization works in
Gaussian Process and TPE.
Bayesian Methods for Machine Learning: A short 10 min YouTube video that
introduces other types of the acquisition function
Bayesian Optimization with extensions, applications, and other sundry items:
A 1hr 30 min lecture recording that goes through the concept of Bayesian
Optimization in great detail, including the math behind different types of
surrogate models and acquisition functions. The entire lecture might be
too technical to follow, but at least the first half of the video is very
helpful in terms of understanding the concepts.
All right, that’s it! Congratulations on following me all the way through here!
I hope my article can somewhat rescue you from hours of juggling around
the overwhelming terms and math associated with Bayesian Optimization.
For questions and comments, let’s connect on Linkedin and Twitter.
Machine Learning Bayesian Machine Learning Towards Data Science Data Science
Technology
O i Si Si i
Open in app Sign up Sign in
Search Write
Written by Wei Wang Follow
164 Followers · Writer for Towards Data Science
Data Science Manager @ Tiktok
More from Wei Wang and Towards Data Science
Wei Wang in Towards Data Science Ahmed Besbes in Towards Data Science
Review on UC Berkeley Master of What Nobody Tells You About
Information and Data Science… RAGs
Whether it worth your 2 years of time and A deep dive into why RAG doesn’t always
$70,000 of your savings work as expected: an overview of the…
Apr 23, 2020 404 7 Aug 23 1.7K 24
Steve Hedden in Towards Data Science Wei Wang
How to Implement Graph RAG How MLOps Infrastructures Are
Using Knowledge Graphs and… Built in Drizly
A Step-by-Step Tutorial on Implementing Real-world examples of building scalable and
Retrieval-Augmented Generation (RAG),… reliable Machine Learning pipelines
Sep 6 1.1K 13 Jan 9, 2023 4
See all from Wei Wang See all from Towards Data Science
Recommended from Medium
Everton Gomede, PhD in 𝐀𝐈 𝐦𝐨𝐧𝐤𝐬.𝐢𝐨 Hennie de Harder in Towards Data Science
Bayesian Optimization: Solving Multi-Armed Bandit
Revolutionizing Efficient Search i… Problems
Introduction A powerful and easy way to apply
reinforcement learning.
Feb 8 542 Nov 4, 2022 401 5
Lists
Predictive Modeling w/ Practical Guides to Machine
Python Learning
20 stories · 1553 saves 10 stories · 1880 saves
ChatGPT prompts The New Chatbots: ChatGPT,
48 stories · 2021 saves Bard, and Beyond
12 stories · 463 saves
Shuai Li in Programming Domain Florian June in Towards AI
Most Developers Failed with this The Best Practices of RAG
Senior-Level Python Interview… Typical RAG Process, Best Practices for Each
What is the difference between [0] * 3 and [0, Module, and Comprehensive Evaluation
0, 0]?
Aug 9 3K 48 Aug 8 857 4
Vishal Rajput in AIGuys Kavya
Why GEN AI Boom Is Fading And On comparing SciPy Optimize and
What’s Next? CVXPY
Every technology has its hype and cool down Both CVXPY and SciPy’s optimize module are
period. powerful tools for solving optimization…
Sep 4 1.5K 48 May 16 3
See more recommendations