Professional Documents
Culture Documents
1 General introduction 1
Appendices 35
i
C Reached minima, costs and number of generations to conver-
gence combined algorithms j
ii
Chapter 1
General introduction
1
optima of non-convex functions on the other hand can be combined to perform
an efficient optimization. More specifically, the genetic algorithm will be used
to locate and reach the proximity of the global optimum of a very challenging
non-convex function, making it behave in that constrained space like a convex
function, and then a switch will be made to a gradient based algorithm. It is clear
that, to make this switch, a suitable criteria has to be found. The programming
language used is Python.
In the remainder of this paper, the structure is as follows. In the first part we
will treat the gradient based algorithms. This part will firstly consist in giving
a theoretical background of some gradient based algorithms. Next, the issue
of computational cost of those algorithms will be treated. This cost will be
measured by the total calls required to the objective function to obtain a certain
accuracy. This brings us to the comparison of the efficiency of different gradient
based algorithms along with the behaviour of the algorithms applied on convex
functions of higher dimensions. Next, the inability of gradient based algorithms
to find global optima of non-convex functions will be shown. In the following part,
the genetic algorithms will be introduced after a brief introduction about global
optimization algorithms. Then the computational cost to find accurately the
global optimum of convex test functions with genetic algorithms will be evaluated
and the variation of this cost with respect to the number of input variables will
be discussed. This cost will be compared to the one when using gradient based
algorithms, which will show their superiority in terms of computational cost over
genetic algorithms. In the last part, the combination of both algorithms will be
used to accurately capture the global minimum of non convex test functions. The
cost of this combined algorithm will be determined and we will try to establish a
criterion to switch from genetic- to gradient based algorithms which applies for
different problems with various dimensions.
2
Chapter 2
∇f (~x) = 0 (2.1)
3
Since in most cases this equation can be very difficult or even impossible to
solve analytically[15], we have to resort to algorithms that compute a sequence
of points ~x0 , ~x1 , ~x2 , ... so that:
Gradient based methods are methods that principally approach the minimum of
f (~x) by a sequence of successive steps in a given direction, which is defined by
the following equation:
~xn+1 = ~xn + αn · p~n (2.3)
The idea is to choose an initial position ~x0 , and for each step walk along a
direction so that[9]:
f (~xn+1 ) < f (~xn )
The different methods have various strategies for how they choose the search
direction p~n and the step length αn [9]. The success of these methods depend
on the effective choices of both of these parameters. A safe choice for the
search direction is for example a descent direction[15], i.e. one along which the
directional derivative is negative:
Indeed, it can be proved[3] easily that this condition guarantees that the function
f can be reduced along this direction. When we write the Taylor expansion in
~x(first order approximation):
we notice that for a small α > 0, the negativity of the right hand side guarantees
that a lower function value can be found along p~.
The stop criterion of the algorithms makes use of the optimality condition
and is defined as follows[15]:
Where is a pre-defined error tolerance. It can be noticed that the stop criterion
doesn't guarantee a global minimum, since the above property applies also at
local minima[15].
4
2.3 Description of some gradient based algorithms
2.3.1 The Steepest Descent method
2.3.1.1 Search direction
Since we need to choose a descent search direction, the most natural choice
would be the negative gradient [15]:
p~ = −∇f (~x) (2.7)
Indeed, we remember that the gradient of a function is a vector giving the
direction along which the function increases the most and thus in the opposite
direction the function decreases the most(hence the name Steepest Descent)[9].
We also call this direction the residual ~r of the system. The appellation residual
is due to the fact that for a quadratic problem, it's somehow an indication of
how far we are from from the correct value of the minimum[11].
5
It is clear that this behaviour would be obtained if we found a α that minimizes
the objective function in the search direction. This corresponds to a so called
exact line search. This kind of line search is in terms of accuracy the best to be
used but not in terms of cost effectiveness. More than that, this line search is
useful only when the cost of the minimization to find the step size is low compared
to the cost of computing the search direction[15]. This is because it would be
better to use the computing time to find a better search direction instead of using
it to find a precise minimum along a particular direction that may not be precise
since it was found in a cheap way compared to the one that founds a minimizing
α. For example, in the case of a nonlinear function it is much more difficult to
perform an exact line search than in the case of a quadratic function(see [11]).
Indeed, in this case a general exact line search needs to be defined that consists
in finding the zeros of ∇f (~xn ) · ~rn−1 or in general of ∇f (~xn ) · d~n−1 , where d~ is a
search direction in general. One of the fast algorithms that can be used for this
purpose is the Newton-Raphson algorithm from which a detailed description can
be found in [11]. This algorithm requires the function f to be twice continuously
differentiable and that it be possible to compute the second derivative of the
f (~xn + α · p~n ) with respect to α since the hessian of f needs to be computed[11].
If the full hessian needs to be computed at each iteration, the cost to calculate
the step size is too high compared to the cost of computing the search direction.
An alternative method to perform a line search is the Secant method(see [11]),
which is less costly but also less accurate since it involves some approximations
compared to the Newton-Raphson method[11]. To illustrate the zigzagging of the
Steepest Descent method towards the solution, figure C.2 shows the path towards
the minimum of the 2D-Rosenbrock function, which has a minimum located in
~x = (1, 1)(see 2.4 for definition). Like mentioned above, the full Hessian and the
gradient need to be computed even more than once per iteration. Hence, since
calculating the search direction only requires the evaluation of the gradient, it is
clear that using the Newton-Raphson algorithm is not justified at all to calculate
the step size[A].
6
Figure 2.1: Zigzagging path towards minimum of Rosenbrock function
In the example corresponding to figure C.2, 654 steps are required in order
to obtain an accuracy in the order of 10−8 . Not to mention when the line search
is inexact, which would involve a less accurate estimation of the step size which
will automatically imply the need of even more iterations in order to find an accu-
rate estimation of the minimum. This corresponds to a slow convergence of the
method in comparison with the other gradient based optimization algorithms(see
paragraph 2.4.3). The convergence of this algorithm may be slow, but this is due
to the inaccurate estimation of the search direction. Although this would sound
like a contradiction, the steepest descent is actually not the best direction along
which to walk to find the minimum. Indeed, as previously explained, this choice
for the search direction results in a zigzagging behaviour towards the solution
which in general causes the method to go back and forth across the valley where
the minimum is located. It would be preferable for the method to travel in a
more straight manner towards the minimum. This is more or less achieved in
the method of Conjugate Gradients which is an enhancement of the Steepest
Descent method. By contrast, the method always converges. The rate at which
it converges depends on the starting point and for a quadratic form for example
on how ill its matrix is conditioned[11]. The inaccurate estimation of the search
direction, though has its advantages: it relatively doesn't ask much computing
time. Aside from the line search for the finding of the step size, in each iteration
the only expensive operation is that a gradient needs to be evaluated.
Furthermore, the Steepest Descent method doesn't require the step size to be
7
found with exact line search, whereas this is a necessary task to be fulfilled in other
algorithms that have more accurate search directions. That's why alternative
methods of finding the step size can even consist in choosing a constant step size,
so that the step size doesn't contribute to the expensiveness of the method in any
way. This alternative method belongs to the category of the inexact line search
methods. In this method, an appropriate constant step size has to be chosen in
order for the algorithm to converge. A too high step size results in the divergence
of the algorithm, whereas a too low step size results in too much iterations to
achieve a certain accuracy. Here the step size was found through trial and error
and depends on the objective function. In the case of finding the minimum of
the Rosenbrock function, a quite good step size to achieve an accuracy in the
order of 10−8 is 0, 00211. The convergence of the Steepest Descent method can
be visualized in figure 2.2 in case of both line search methods. Note that in order
to make a meaningful comparison, the same starting point has been chosen.
From this figure, it is quite clear that the algorithm converges faster when
using an exact line search. When an appropriate constant step size is chosen,
the algorithm requires 3522 iterations in order to achieve the same accuracy.
8
of the minimization of a scalar function of 2 variables(e.g. a quadratic form), the
choice of a α would then guarantee that 2 steps are needed for the method to
arrive at the minimum. To this aim, α should be chosen such that the error ~en+1
is orthogonal to the previous search direction d~n . Like shown in figure 2.3 [11].
Actually, since we do not know ~en+1 a priori, this method is useless here. But
there is a solution to that. Assume an arbitrary search direction d~n . One can
prove that searching for the best α (that minimizes f (~x + α · d)) ~ is equivalent
with the demand that ~en+1 is A-orthogonal or Conjugate to d~n . 2 vectors d~i and
d~j are conjugate to each other when:
9
mutually conjugate vectors and αn the minimizing step size at each step [11] [1].
Note that this algorithm won't be converging in k steps if the function from which
the minimum has to be found is not linear. We shall come back to this point later.
We know until now that there exists an algorithm that computes the minimum
of a quadratic form in k steps, but we actually didn't mention how to build a
set of k mutually conjugate vectors. The eigenvectors of A form a conjugate set
but this would involve solving an eigenvalue problem which can be costly. An
alternative is a variant of the Gramm-Schmidt orthogonalization process which
is called the Conjugate Gram-Schmidt process[11][9]. We will not go into detail
on this(see [11]). The production of a set of k conjugate directions will bring us
further, but it will also require to store all those directions. The solution to this
is to actually conjugate the residuals[11][9]. Recall that we are still developing a
method for the minimization of a quadratic form so that the residual in this case
corresponds with the gradient of the quadratic form. Conjugating the residuals
will imply that we can calculate the conjugate directions by using the residual in
the arrived point at step n and the previous search direction d~n−1 :
where the
~rnT · ~rn
βn = T (2.13)
~rn−1 · ~rn−1
are called the Gram-Schmidt constants. These are defined in the Gram-Schmidt
process mentioned above[11]. The reasoning behind this result can be found
in[11] but here we will not elaborate on it. The use of conjugate residuals imply
also a less costly procedure to create a set of conjugate directions than what
would be expected when looking at the Gram-Schmidt process[11]. As men-
tioned earlier, we have in this case that the residuals are actually the gradients of
the to be minimizing function, hence the name Method of Conjugate Gradients.
Note that in order for the method of the Conjugate Gradients to work, the step
size needs to be found with an exact line search. Until now we have assumed
the function to be minimized to be a quadratic form. For nonlinear functions,
the algorithm is quite the same but there are different choices for calculating βn.
Here we will be using βn to be like in the case of a quadratic form, which is
called the Fletcher-Reeves formula. Let's summarize the algorithm:
10
2. ~xn+1 = ~xn + αn · d~n
To this aim, the idea is to find a ∆x so that f (xn + ∆x) maximal. Analytically,
it is obvious that this can be achieved by deriving equation 2.15 and solving it
for ∆x. This yields for ∆x:
f 0 (xn )
∆x = − (2.16)
f 00 (xn )
This result can be generalized by replacing the derivative with the gradient
∇f (~xn ) and the reciprocal of the second derivative with the inverse of the Hessian
matrix Hf (~xn ) [20], so that one obtains following sequence:
11
The term −[Hf (~xn )]−1 ∇f (~xn ) is called the Newton direction. Hence, the step
size here is chosen to be equal to 1. The Newton algorithm is then the following:
Newton algorithm:
||~xk+1 ||
lim e
k+1
=C (2.18)
k→∞ ||~
xe ||2
Where ~xe = ~xn+1 −~xn and C is a constant which is called the rate of convergence.
It can be proven that the Steepest Descent[12] and Conjugate Gradients [21]
converge linearly, which in general implies that the Newton method converges
faster than the previous methods[14].
Note that another difference with the last 2 methods is that this method
requires to evaluate second order derivatives, and to inverse the Hessian matrix,
which comes down to solve a system of linear equations, which is, after some
calculations:
[Hf (~xn )] · ~xe = −∇f (~xn ) (2.19)
The Newton algorithm provided by the Python libraries makes use of the so called
Conjugate Gradients iterative method to solve this system of linear equations.
Note that the algorithm will get to convergence only if the Hessian matrix is
positive definite[20]. The Newton algorithm may be fast converging, but the
involvement of second order derivative calculations and the inversion of the Hes-
sian matrix can be really expensive, especially if the objective function has a
large number of variables[20]. That’s why there also exist various quasi-Newton
methods, where an approximation of the Hessian matrix is built up from changes
in the gradient. The covering of one of the most popular method among them,
is the subject of next paragraph.
12
function to be computed at each iteration. Besides, a high convergence speed
can be achieved like the Newton method, so that we actually combine the ad-
vantages of both previous methods. Here, the Hessian matrix is approximated
and built up from changes in the gradient at each iteration, which is a huge
advantage compared to the Newton method, especially if the objective function
has a large number of variables.
Hence, as for the algorithm, an analog reasoning as for the Newton method
is used so that we can write:
Where
Bn ≈ Hf (~xn ) (2.21)
Note that here the step size is not chosen to be = 1, like in the Newton algo-
rithm. Hence, a line search is performed.
The construction of the approximation for the Hessian matrix can be found
in [2]. For the sake of understanding the underlying mathematics behind the
BFGS algorithm, a brief summary of this construction will be given after which
the algorithm will be described. Instead of computing a completely new Bn−1 in
each iteration(instead of Bn , like in the DFP method), information about the
gradient in the previous step will be used to update it. In order to construct the
Bn−1 -matrix, some conditions on the approximated function in step n related to
the use of the gradient from the previous step n − 1 have to be imposed. By
imposing those conditions, we can derive the Secant equation, from which Bn−1
can be computed. The form of the Secant equation in case of the BFGS method
is the following:
13
this closeness to satisfy the so called weighted Frobenius norm, we get following
unique update of Bn−1 :
Where
1
ρn−1 = T
(2.24)
~yn−1 · ~sn−1
Initialize B0−1 = I
BFGS algorithm:
14
5. ~yn−1 = ∇f (~xn ) − ∇f (~xn−1 )
6. ρn = ~
yn
1
T ·~
sn
As for the convergence of the BFGS method, it can be said that it con-
verges superlinearly, which is much faster than linear convergence, but much
much slower than quadratic convergence. Hence, despite the BFGS algorithm
being better in terms of computational time than the Newton algorithm, BFGS
converges much much slower towards a minimum [16].
15
2.4 Efficiency and comparative study
2.4.1 Introduction
In this section, a comparison will be made between gradient based algorithms
on the basis of their efficiencies. Those efficiencies are evaluated by the number
of function- and gradient calls the algorithm performs, which determines the
computational cost. The algorithms that will be compared with each other are
the Steepest Descent-, the Conjugate Gradient-, the BFGS- and the Newton
algorithm. Furthermore the test functions on which the algorithms will be tested
are the 2nd -, 4th -, 6th -, 8th -, 10th - and 12th - dimensional Rosenbrock functions.
Where d stands for the number of variables. Moreover, the function has a global
minimum fmin,global = 0 located in ~xmin,global = (1, 1, ..., 1).
16
Please note that in order for he comparison to make sense, a same starting
point has to be chosen for both algorithms. Furthermore, a ’good’ choice of the
starting point is actually important since the algorithm could become unstable
for some starting points. To make those choices, we rely on a trial and error
approach or another way is to find recommended starting points based on expe-
rience. Here the starting point will be chosen to be ~x0 = (1, 2, 1, 2)
(c) BFGS: Function- and gradient calls (d) Newton: Function- and gradient calls
Figure 2.4: Convergence curves and cummulative function- and gradient calls
per iteration for BFGS- and Newton method
As can be seen in figures B.1a and B.1b, BFGS converges faster than the New-
ton algorithm: BFGS requires 21 iterations to reach an accuracy in the order of
approximately 10−8 whereas Newton requires 27 iterations. Furthermore, BFGS
requires 33 function evaluations and 33 gradient evaluations per 21 iterations
whereas Newton requires 41 function evaluations and 199 gradient evaluations
per 27 iterations. Since gradient evaluations can be done in an efficient way using
17
the so called ’Adjoint Method’[4], the cost of function and gradient evaluations
are considered to be similar. Hence, with a total of 70 evaluations for the BFGS
method and 240 evaluations for the Newton method, it can be concluded that, to
achieve an accuracy in the order of 10−8 , the Newton method is more expensive.
Taking into account table 2.1, the computational costs can be deduced form
figure 2.6.
18
(a) BFGS: Convergence curve (b) Newton: Convergence curve
(c) Conjugate Gradients: Convergence curve (d) Steepest Descent: Convergence curve
Table 2.2: Total computational cost per algorithm for different accuracies
Note that the total cost for the Steepest Descent is like the iterations approx-
imate but can also be determined exactly by passing the order of accuracy in the
algorithm[A]. Besides by taking a closer look at the algorithm we can conclude
that the number of calls is perfectly proportional to the number of iterations.
Based on those latter values, a ranking of the different algorithms is repre-
19
(a) BFGS (b) Newton
sented in table 2.4, as function of the desired accuracies. The ranking is made
with the help of the numbers 1, .., 4, where 1 represents the best and 4 the worst.
20
Steepest Descent method converges compared to the other algorithms very
slowly, which is the cause of its low performance.
Analogously this Ranking can be made for the higher dimensional Rosenbrock
test functions. We concluded that, unlike in the 2D-case, CG becomes less in-
teresting than Newton whereas BFGS is in all cases the most efficient algorithm.
This is because for higher dimensional test functions, Newton converges at a
rate similar to BFGS, which is faster than Conjugate Gradients, but is on the
other hand more expensive per iteration than BFGS and Conjugate Gradients.
Hence, in some cases, the difference in required iterations between Newton and
Conjugate Gradients outweighs the difference in computational cost per iteration.
The Steepest Descent method is not used in higher dimensions because of the
time consuming determination of an appropriate step size for each variant of the
Rosenbrock function. The corresponding convergence curves and bar plots for
the cummulative function- and gradient calls can be found in appendix[B].
Analogously this Ranking can be made for the higher dimensional Rosenbrock
test functions. and is illustrated in table 2.4. From this table it can be concluded
that, unlike in the 2D-case, CG becomes less interesting than Newton whereas
BFGS is in all cases the most efficient algorithm. This is because for higher
dimensional test functions, Newton converges at a rate similar to BFGS, which
is faster than Conjugate Gradients, but is on the other hand more expensive per
iteration than BFGS and Conjugate Gradients. Hence, in some cases, the differ-
ence in required iterations between Newton and Conjugate Gradients outweighs
the difference in computational cost per iteration. The Steepest Descent method
is not used in higher dimensions because of the time consuming determination
of an appropriate step size for each variant of the Rosenbrock function. The
corresponding convergence curves and bar plots for the cummulative function-
and gradient calls can be found in appendix(barplots and conv curves).
21
4D 10−2 10−4 10−8
Newton 2 2 2
BFGS 1 1 1
Conjugate Gradient 3 3 3
6D 10−2 10−4 10−8
Newton 2 2 2
BFGS 1 1 1
Conjugate Gradient 3 3 3
8D 10−2 10−4 10−8
Newton 3 3 2
BFGS 1 1 1
Conjugate Gradient 2 2 3
10D 10−2 10−4 10−8
Newton 3 3 2
BFGS 1 1 1
Conjugate Gradient 2 2 3
12D 10−2 10−4 10−8
Newton 2 2 2
BFGS 1 1 1
Conjugate Gradient 3 3 3
Where d stands for the number of variables. The two-dimensional form can be vi-
sualized in figure 2.7. Moreover, the function has a global minimum fmin,global =
0 located in ~xmin,global = (0, 0, ..., 0).
22
Recall that convex functions are functions whose local minimum coincides
with its global minimum. In case the objective function is non-convex, local
minima exist that do not coincide with its global minimum.
23
(a) BFGS method stuck in local minimum (b) BFGS method attaining global minimum
Figure 2.8: Gradient based algorithms only capturing global minimum when the
starting point is in the attraction region
Based on the above examples it can be concluded that our gradient based
algorithms are not capable of capturing the global minimum of non convex func-
tions. Since they are non convex as well, the same can be concluded for the
higher dimensional Ackley functions. Hence, gradient based algorithms can only
minimize convex functions and are not suited for global optimization. We thus
will have to resort to global optimization techniques. This will be covered in the
next chapter.
24
Chapter 3
3.1 Introduction
Global optimization techniques make sure the global minimum of non convex
functions can be found without being trapped in local minima. This is thanks
to the fact that global optimizers search for the minimum in the whole solution
space. However, as already mentioned in the introduction(see 1), global opti-
mizers require a high number of function evaluations and hence are too costly.
This is why metaheuristic strategies will be used to search in the solution space
in a more or less intelligent way. Metaheuristic algorithms were developed specif-
ically to find a solution that is "good enough" in a computing time that is "small
enough". Specifically, metaheuristics sample a set of solutions which is too large
to be completely sampled [6] [19]. This surely is interesting here since at higher
dimensions, the search space increases strongly. Besides, since the idea is to fuse
global optimization with gradient based optimization, we don’t need the solution
provided by the global optimization to be accurate. An example of such a meta-
heuristic method is the evolutionary algorithm.
25
rithms is the genetic algorithm[18]. The preferability of genetic algorithms over
other metaheuristic methods is because genetic algorithms use both crossover
and mutation operators which make its population more diverse and thus more
immune to be trapped in local optima.
The genetic algorithm iterates over the generated individuals in each gen-
eration. Firstly, an initial population of a given number of randomly chosen
individuals is created. Then, in each iteration(or generation), individuals are
selected to breed a new generation, recombined in the crossover process and mu-
tated. The selection process is made by grouping the population in 3 randomly
chosen individuals and selecting for each group 1 among this group on the basis
of its minimal function value, and deleting the other 2. In the crossover process,
the chromosomes(here the coordinates) of a given percentage of the individuals
are recombined in the hope that the offspring will have a better fitness. Mu-
tation changes randomly a given percentage of the new offspring and is made
to prevent falling into local extreme. Indeed, those changes bring forth less fit
solutions, which ensure genetic diversity. The percentages of the population un-
dergoing mutation and crossover will be held constant and equal to respectively
20 and 50 percent. The algorithm terminates when a fixed number of generations
is needed. Furthermore, the search spaces in which the genetic algorithm will
choose its population will be the ones defined by the points (−5, −5), (−5, 5),
(5, −5) and (5, 5) for 2D functions(square around the origin having a distance of
5 with its sides). For functions of higher dimensions a similar domain is chosen:
squares around the origin having a distance of 5 with its sides for 2D, cubes
around the origin having a distance of 5 with its planes for 3D, etc.
It is clear that the larger the initial population size, the more information
the algorithm has in order to find the global optimum. Because of that, the
accuracy of the genetic algorithm depends on the initial population size. This is
26
illustrated by applying the genetic algorithm to the 2D-Rosenbrock function(see
figure 3.1a). Note that the error is defined as in paragraph 2.4. Although this
function is highly fluctuating, we see that it generally has a decreasing trend for
an increasing population size.
Figure 3.1: Accuracy and cost as function of initial population size. Note: pop-
ulation varies from 50 to 980 individuals
Important to mention is that the initial population size has an influence on the
number of generations to convergence. Indeed, the higher the population size,
the more generations needed to converge because more mutations will occur for
a given mutation probability. As mentioned in the introduction, mutation makes
the population more diverse which slows down the genetic algorithm [10]. This
has thus an increasing effect on the cost of the algorithms. A general(applicable
to all genetic algorithm applicable problems) way to determine an optimal initial
population size(for which the benefits of accuracy and number of generations to
convergence, and thus cost, balans) can be found in [10] but will not be applied
here. What we will do is find a minimal population size to obtain an accuracy
in the order of 10−2 . As can be depicted from figure 3.1a, this accuracy is ob-
tained for an initial population size of at least 690 individuals. In figure 3.1b the
computational cost as function of the initial population size is shown. As for the
accuracy as function of the initial population size, this cost function is highly
fluctuating. However it displays in general an increasing trend for an increasing
population size. This cost was determined by running the algorithm for 100 gen-
erations, after which the number of generations to convergence can be deduced
by plotting the minimum as function of the generation, and finally by running
the algorithm for the deduced number of generations. We see that this initial
population size corresponds to a minimal(since for higher populations, the same
27
accuracy can be obtained) cost of around 4500 function calls. Comparing this
result with the computational cost to obtain the same accuracy with the BFGS
algorithm(see table 2.3, section 2.4), which requires only 74 function evaluations
we can conclude that for minimizing the convex 2D Rosenbrock function, the
BFGS algorithm is far more efficient than the genetic algorithm.
We can thus conclude in general that gradient based algorithms are far more
efficient than genetic algorithms for minimizing convex functions. As mentioned
earlier, genetic algorithms however are capable of finding global minima of non
convex functions. But, if high accuracies are required, the population size has to
be raised strongly which is too costly as we saw in the examples above. Besides,
after a certain point, the cost of raising the population further outweighs the
benefit of obtaining a more accurate result [10]. In this case, a switch to gradient
based algorithms can be made to further minimize the function to the desired
accuracy. This will be further investigated in the next section.
28
3.3 Efficient global optimization: combined genetic-
and gradient based algorithms
In the following, we will try to find an efficient strategy to find accurately
the global minimum of non-convex functions. To this aim, the Ackley func-
tion(see 2.5 for definition) will serve as test function. The genetic algorithm will
be used to reach the vicinity of the global minimum, and then the switch will
be made to a gradient based algorithm to capture accurately the minimum. By
’vicinity’ it is meant that the attraction region of the global minimum has to be
reached. Indeed, as already mentioned in section 2.5, the function behaves like
a convex function in this attraction region. The problem lies then in determining
criteria which tells us that the genetic algorithm has reached the attraction region
and that we can switch to a gradient based algorithm. These criteria actually are
conditions on the population size and the number of generations. The number of
generations obviously has to be the one for which the genetic algorithm converges
for a given initial population size. For an optimization problem in general, the
number of generations to convergence is not known. What we do know is that
when the convergence point is reached, the minimum doesn't change anymore.
Hence, the genetic algorithm will be stopped when 2 succesive solutions are the
same.
As for the choice of the population size, a first try can be determining the
optimal population size for which the number of generations to convergence and
the accuracy are balanced(see [10]) . A simpler way to choose an initial popula-
tion size is using the rule of thumb whereby the population size equals 10 times
the number of variable inputs of the objective function [13].
Consider the 2D-Ackley function. In figure 3.2a we see that the genetic
algorithm didn't succeed in reaching the attraction region using an initial pop-
ulation size of 20 (10 times the dimension), since all of the gradient based
algorithms(Newton, Conjugate Gradient and BFGS) remain stuck in a local min-
imum. Like in paragraph 2.5, all the gradient based algorithms have more or less
the same path and thus only the BFGS algorithm is shown.
29
(a) Gradient based algorithms stuck in local(b) Gradient based algorithms finds global min-
minimum imum
Figure 3.2: Global minimum search failed and accomplished for respectively 20-
and 38 individuals
It is clear that for the 2D-Ackley test function, the above rule of thumb can-
not be applied. In the following, we will investigate if the combined algorithm
converges to the global minimum for higher population sizes. This is done by
simply running the combined algorithm for 21-, 22-, ... individuals untill we find
a population size which finds a point in the attraction region. For each run,
the contour plot is drawn, which provides a visualization of whether the genetic
algorithm converges to the attraction region. If a contour plot is not available(as
will be the case for higher dimensional functions), we know that the attraction
has not been reached by looking at the point attained by the gradient based
algorithm. If the function value of this point is not close to the minimum(with
an error in the order of at least 10−2 ), we can also conclude that the attraction
region has not been reached. In figure 3.2b we see that a point in the attraction
region is found for an initial population of 38, and consequently the combined
algorithm converges accurately to the global minimum. Again, we show only the
BFGS algorithm. Note that this is also the case for populations higher than 38.
We can thus conclude that, for the 2D Ackley function, the population size has
to be approximately 20 times the number of input variables. This factor can be
the consequence of the fact that the Ackley function is extremely complex. As
for the cost, to obtain an accuracy in the order of 10−2 , the Conjugate Gradients
algorithm is the best, requiring a computational cost of 14 function/gradient
evaluations to achieve this accuracy. The Newton- and BFGS algorithms require
in contrast respectively 27 and 24 function/gradient evaluations. Of course the
cost of the genetic algorithm should not escape our attention, which requires
89 function evaluations to attain the attraction region. Besides, when plotting
30
the best individual(with the lowest function value) per generation, we see that
the number of generations to convergence is equal to 2(see [C]). This gives us
for the Conjugate Gradient algortihm a total computational cost of 113 function
evaluations[C]. Comparing this cost to the cost of the genetic algorithm to find
the minimum with an order of accuracy of 10−2 of a less complex function which
is the 2D Rosenbrock function, we can conclude that this combined algorithm is
way more performant.
The question still is: can we conclude this in general for other problems ? And
is this rule applicable for functions with more input variables ? Let’s take a look
at the Rastrigin function, which is also a very challenging function to optimize
and is defined as follows(see figure 3.3 for the surface- and contourplots):
d
(x2i − 10 · cos(2πxi )) (3.1)
X
f (~x) = 10 · d +
i=1
Where d stands for the number of input variables. This function has a minimum
of 0 in the point ~xmin = (0, ..., 0). For this test the 2D- variant of this function
will be used.
Please note that from now on, no desired accuracy will be mentioned as we
are interested in investigating the criteria for which the combined algorithm con-
verges, which is the initial population size. As proceeded previously to find the
minimal population size to attain the attraction region of the Ackley function,
the combined algorithm is ran for increasing population sizes. The combined
algorithm converges in this case from an initial population size of 213 individu-
als(requiring 4 generations to converge, see [C]), which is way more than what
31
the Ackley function requires. Instead of using 10 times the number of input
variables, we thus need to use approximately 100 times the number of input
variables. As for the Ackley function, this could be due to the complexity of the
Rastrigin function.
Let’s now look at the higher dimensional Ackley functions. Like explained in
section 3.2, the search space extends as the dimension increases, which requires a
larger initial population size to obtain a desired accuracy. So, if we want to be in
the attraction region for higher dimensional functions, it is obvious that a larger
population size will be required to attain this attraction region. When checking
the original rule of thumb applied on the 8D Ackley function, we conclude that
an initial population of 80 is not sufficient to attain the attraction region since
the combined algorithm doesn’t converge to the global minimum. It is from an
initial population size of 173 individuals that the combined algorithm converges
to the global minimum after 14 generations. See [C] for the exact coordinates
and attained minimum by the genetic- and combined algorithms and number of
generations to convergence. Note that this number of individuals is very close
to 20 times the number of generations, like for the 2D Ackley function. Hence,
we shall say that for the n-dimensional Ackley problem, the combined algorithm
converges for an initial population size of 20 times the dimension.
It can thus be concluded in general that the rule of thumb which is multiplying
10 with the number of input variables doesn’t apply for all problems. Indeed, for
very complex 2D problems like the Ackley- and Rastrigin functions, the genetic
algorithm requires more than 20 individuals to attain the attraction region. For
higher dimensional problems, this rule of thumb also doesn’t apply. Again, this
could be due to the complexity of the higher dimensional Ackley function. Hence,
further research on a general way(i.e. for problems of random complexity) to
choose the population size to attain the attraction region has to be performed.
Furthermore, since it should apply to all kinds of problems, it is to be suggested
to determine the to be chosen initial population size like described in [10] in
future research.
32
Bibliography
33
[11] Jonathan Richard Shewchuk. An Introduction to the Conjugate Gradient
Method Without the Agonizing Pain. https : / / www . cs . cmu . edu /
~quake-papers/painless-conjugate-gradient.pdf. Aug. 1994.
[12] Penn State. Line search methods. Rate of convergence. http://web.khu.
ac.kr/~tskim/PatternClass%20Lec%20Note%2004-3%20(Newton’s%
20Method%20-%20Optimization).pdf.
[13] R. Storn. “On the usage of differential evolution for function optimization”.
In: Fuzzy Information Processing Society, 1996. NAFIPS., 1996 Biennial
Conference of the North American. IEEE, June 1996, pp. 519–523.
[14] Kyung Hee University. 8 Newtons method for minimization. http : / /
web . khu . ac . kr / ~tskim / PatternClass % 20Lec % 20Note % 2004 - 3 %
20(Newton’s%20Method%20-%20Optimization).pdf. Fall 2007.
[15] Jean-Philippe Vert. Nonlinear Optimization: Algorithms 1: Unconstrained
Optimization. http://cbio.mines-paristech.fr/~jvert/teaching/
2006insead/slides/5_algo1/algo1.pdf. PowerPoint Presentation.
Spring 2006.
[16] University of Washington Department of mathematics. Rates of Cover-
gence and Newtons Method. https://www.math.washington.edu/
~burke/crs/408/lectures/L10-Rates-of-conv-Newton.pdf.
[17] DONALD R. JONES MATTHIAS SCHONLAU WILLIAM J. WELCH. “Ef-
ficient Global Optimization of Expensive Black-Box Functions”. In: Journal
of Global Optimization 13 (1998), pp. 455–492.
[18] Wikipedia. Evolutionary algorithm. https://en.wikipedia.org/wiki/
Evolutionary_algorithm. 2016.
[19] Wikipedia. Metaheuristics. https://en.wikipedia.org/w/index.php?
title=Metaheuristic&action=history. 2016.
[20] Wikipedia. Newton’s method in optimization. https://en.wikipedia.
org/wiki/Newton%27s_method_in_optimization. [Online; accessed
2016]. 2016.
[21] Harlan Crowder Philip Wolfe. Linear convergence of the conjugate gradient
method. http://www.dtic.mil/dtic/tr/fulltext/u2/742126.pdf.
May 1972.
34
Appendices
35
Appendix A
import s c i p y a s sp
import numpy a s np
import math
import T e s t c a s e s a s Tc
import T e s t c a s e s G r a d i e n t a s TcG
import S c a t t e r p l o t 2 D a s Sp2D
import P a t h C o n t o u r a s PC
# S t e e p e s t D e s c e n t method w i t h c o n s t a n t a l p h a . a l p h a h a s t o be t u n e d
#t o g e t c o n v e r g e n c e .
d e f S t e e p e s t D e s c e n t ( x , f , Ac ) :
FuncEval = 0
GradEval = 0
alpha = 0.00211
x = np . a r r a y ( ( f l o a t ( x [ 0 ] ) , f l o a t ( x [ 1 ] ) ) )
f p r i m e = TcG . T e s t c a s e s G r a d i e n t s l i b
# ∗∗∗ I n i t i a l i z a t i o n s ∗∗∗
#G r a d i e n t
r = −f p r i m e ( x )
G r a d E v a l =+ 1
d e l t a _ n e w = np . d o t ( r , r )
a
d e l t a 0 = delta_new
i = 0
Error = [ ]
Iterations = []
CFuncEval = [ ]
CGradEval = [ ]
fsol = f (x)
F u n c E v a l =+ 1
E r r o r . append ( math . l o g ( a b s (− f s o l ) ) )
I t e r a t i o n s . append ( i )
CFuncEval . append ( F u n c E v a l )
CGradEval . append ( G r a d E v a l )
#D i m e n s i o n s
n = 2
#V e c t o r s o f x and y components f o r d r a w i n g p a t h
Scatterx = [ ]
Scattery = [ ]
S c a t t e r x . append ( x [ 0 ] )
S c a t t e r y . append ( x [ 1 ] )
# ∗∗∗ A l g o r i t h m ∗∗∗
w h i l e ( i < imax ) and ( d e l t a _ n e w > d e l t a 0 ∗ e p s i l o n C G ∗∗ 2) :
oldfsol = fsol
x = x + alpha ∗ r
r = −f p r i m e ( x )
G r a d E v a l += 1
S c a t t e r x . append ( x [ 0 ] )
b
S c a t t e r y . append ( x [ 1 ] )
fsol = f (x)
F u n c E v a l += 1
d e l t a _ n e w = np . d o t ( r , r )
i += 1
E r r o r . append ( math . l o g 1 0 ( a b s (− f s o l ) ) )
I t e r a t i o n s . append ( i )
CFuncEval . append ( F u n c E v a l )
CGradEval . append ( G r a d E v a l )
X = sp . a s a r r a y ( S c a t t e r x )
Y = sp . a s a r r a y ( S c a t t e r y )
Result = []
Result . append ( x )
Result . append ( E r r o r )
Result . append ( I t e r a t i o n s )
Result . append ( CFuncEval )
Result . append ( CGradEval )
p r i n t math . l o g 1 0 ( a b s (− f s o l ) )
p r i n t (" S o l u t i o n : %s " % x )
p r i n t (" C u r r e n t f u n c t i o n v a l u e : %f " % f s o l )
p r i n t (" I t e r a t i o n s : %d " % i )
p r i n t (" F u n c t i o n e v a l u a t i o n s : %d " % F u n c E v a l )
p r i n t (" G r a d i e n t e v a l u a t i o n s : %d " % G r a d E v a l )
#Sp2D . S c a t t e r p l o t 2 D ( I , E , ’ E r r o r v s i t e r a t i o n number ’ ,
#’ E r r o r ’ , ’ I t e r a t i o n number ’ )
#PC . S c a t t e r C o n t o u r p l o t 2 D (X , Y , ’ S t e e p e s t D e s c e n t Method ’ , ’ y ’ , ’ x ’ )
return Result
c
Appendix B
The following curves and bar plots are the convergence curves and cummulative
costs per iteration respresenting the gradient based algorithms applied on the
4D-, 6D-, 8D-, 10D- and 12D Rosenbrock functions.The used gradient based
algorithms are: Conjugate gradients, Newton and BFGS. The starting points of
all the algorithms were
1. ~x0 = (1, 2, 1, 2) 4D
Very important to be noted is that for the conjugate gradients and the BFGS
algorithms, the number of gradient calls is always equal to the number of function
calls. The determination of the computational cost for a given accuracy and
function is as follows: first, the number of iterations to obtain a certain accuracy
is determined, then, at this iteration number, the cost can be determined using
the cummulative cost per iteration. Note that, as already mentioned in 2.4, we
consider the costs of function- and gradient calls to be equal[4].
d
(a) Convergence BFGS 4D (b) Convergence Newton 4D
e
(g) Convergence BFGS 6D (h) Convergence Newton 6D
f
(m) Convergence BFGS 8D (n) Convergence Newton 8D
g
BFGS10D.png
h
(y) Convergence BFGS 12D (z) Convergence Newton 12D
i
Appendix C
In the following, the results of the application of the combined algorithms on the
2D- and 8D Ackley functions are represented as outputs, in which the coordi-
nates as well as the function values in the corresponding points are given. These
coordinates and function values are those reached by the genetic algorithm(used
to reach the attraction region) and by the BFGS-, Newton- and Conjugate gradi-
ents algorithms(which were activated once the attraction region was reached by
the genetic algorithm). Furthermore, the number of function-/gradient calls of
both the genetic- and gradient based algorithms are shown. Finally, the number
of generations to convergence for given population sizes corresponding to the be
minimized functions are deduced from the graphs in figures C.3a to C.3c. These
functions are the 2D- and 8D Ackley functions and the 2D Rastrigin function.
Please note that generation 0 has to be counted as a generation!
j
Figure C.1: Obtained coordinates, corresponding function values and required
costs after minimization with accuracy of in the order of 10−2 of 2D Ackley
function with combined algorithm with initial population size of 38 individuals.
k
Figure C.2: Obtained coordinates, corresponding function values and required
costs after minimization with unlimited accuracy of 8D Ackley function with
combined algorithm with initial population size of 173 individuals.
l
(a) 2D Ackley: Minimum(logarithmic (b) 8D Ackley: Minimum(logarithmic
scale) as function of generation: best scale) as function of generation: best
individual per generation individual per generation