You are on page 1of 6

Bayesian optimization is a sequential design strategy for global optimization of black-box functions that does not

assume any functional forms. It is usually employed to optimize expensive-to-evaluate functions.


In Bayesian optimization, a probabilistic model of the objective function is built using Bayesian inference. This
model is then used to select the next point to evaluate, which is the point that is most likely to improve the
objective function. This process is repeated until the desired level of accuracy is reached.
Bayesian optimization has been shown to be effective in a variety of applications, including machine learning,
engineering, and finance. It is a powerful tool for finding the best possible solution to a problem when the
objective function is expensive to evaluate.
Here are some of the benefits of using Bayesian optimization:
•It is a principled approach to global optimization that does not require any assumptions about the functional form
of the objective function.
•It is efficient, as it only evaluates the objective function at a small number of points.
•It is robust to noise in the objective function.
•It can be used to optimize functions with multiple objectives.
Here are some of the challenges of using Bayesian optimization:
•It can be computationally expensive to build the probabilistic model of the objective function.
•It can be difficult to choose the right acquisition function.
•It can be difficult to tune the hyperparameters of the Bayesian optimization algorithm.
Overall, Bayesian optimization is a powerful tool for finding the best possible solution to a problem when the
objective function is expensive to evaluate. It is a principled, efficient, and robust approach to global optimization.
The O.F. needs to take a lot of computational effort and there should be a probabilistic behavior inside that produces
a noisy value (i.e., if I run de simulation two times with the same parameters, the results should be similar but not
the same!!!) -> Think of the example of drilling holes to find oil
Given two points of a search space and , the kernel function determines the statistical correlation among those points.

The Gaussian kernel or RBF kernel function is widely used, and it requires two parameters: the lenght to control the
smoothness of the function and to control de vertical variation.

1 2

2
− 2 ‖𝑥 𝑖 − 𝑥 𝑗‖
𝜅 ( 𝑥 𝑖 , 𝑥 𝑗 )=𝜎 𝑒 𝑓
2𝑙

For values of and between -5 and 5, and lenght and , the following correlation function in obtained:
The prior predictive distribution, is the Gaussian process (GP) that models the probability of having a value of the objective
function for a any input .

Initially, for na initial value of and , the GP prior is modeled as a multivariate normal distribution fuction, e.i.,

[] ([ ] [ ])
𝑓1 0 𝐾 11 𝐾 12 … 𝐾 1𝑛
𝑓2 0 𝐾 21 𝐾 22 … 𝐾 2𝑛
𝒇= 𝑁 ( 𝛍 ,𝐊 ) = ,
⋮ ⋮ ⋮ ⋮ ⋱ ⋮
𝑓𝑛 0 𝐾𝑛 1 𝐾𝑛 2 … 𝐾 𝑛𝑛 Where
=

The following figure shows five samples taken from the initial prior with a confidence interval of 96%

These samples are created using the Pyhton function


numpy.random.multivariate_normal(mean, cov, size=None,
check_valid='warn', tol=1e-8)

Draw random samples from a multivariate normal distribution.


Given a set of realizations, the posterior predictive distribution, is the Gaussian process (GP) that models the probability of
having a value of the objective function for a any input , given the new realizations.

In this case, the values of and are not updated, but the average verctor and the covariance matrix are updated using the
following expressions:
Ver Fig. 2.9
Prior dist. of the O.F.
Where
𝑝 ( 𝒇 ∗ ∨𝒙 ∗ , 𝒙 , 𝒇 ) =𝑁 ( 𝒇 ∗ ∨𝝁∗ , 𝜮 ∗ )
[ ] ( [ ]) 𝑲 =𝜅 ( 𝒙 , 𝒙 ) Posterior
𝒇 𝑲 𝑲
𝑁 𝟎, ∗ conditional
𝒇∗ 𝑲
𝑇
∗ 𝑲∗ ∗ 𝑲 ∗= 𝜅 ( 𝒙 , 𝒙∗ ) distribution 𝝁∗ = 𝑲 𝑇∗ 𝑲 −1 𝒇
𝑲 ∗ ∗ =𝜅 ( 𝒙 ∗ , 𝒙 ∗ ) 𝜮 ∗= 𝑲 ∗ ∗ − 𝑲 𝑇∗ 𝑲 −1 𝑲 ∗ ∗
Posterior dist. of the O.F.
Using this technique for 5 given points, the following the posterior predictive distribution is obatained (showing 5 samples),
without noise (left) and with noise S/N = 10 (right)
The values for the hyperparameters of the kernel function of and , can be further optimized based on the characteristics of
the observed data. To do that, we can maximize the joint log-likelihood of the marginal likelihood of the observations, i.e.,

^
𝜽=arg max { 𝑝 ( 𝒇 ∨ 𝒙 , 𝜽 ) }
𝜽

1
1
𝑇 −1
− ( 𝒇 − 𝝁 ) 𝑲 ( 𝒇 − 𝝁)
2
𝑝 ( 𝒇 ∨𝒙 , 𝜽 ) = 𝑒
√ ( 2 𝜋 ) | 𝑲|
2

^
{
1 𝑇 −1 𝑛 1
𝜽=arg max − 𝒇 𝑲 𝒇 − log 2 𝜋 − log | 𝑲|
𝜽 2 2 2 }
Which using a simple minimization method (gradine descendant, for example) fits the values for and , resulting in...

𝜎 ∗𝑓 =0.702281
𝑙∗ =1.384861

You might also like