You are on page 1of 51

Towards the development of an

efficient optimization strategy


Jawad Bessedjerari

Master 1 project submitted under the supervision of


Prof. dr. ir. Chris Lacor
The co-supervision of
Ir. Simon Abraham
Academic year
2015-2016
Contents

1 General introduction 1

2 Gradient based methods 3


2.1 Convex functions . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Gradient based algorithms: generalities . . . . . . . . . . . . . . 3
2.3 Description of some gradient based algorithms . . . . . . . . . . 5
2.3.1 The Steepest Descent method . . . . . . . . . . . . . . . 5
2.3.2 The Conjugate Gradients method . . . . . . . . . . . . . 8
2.3.3 The Newton method . . . . . . . . . . . . . . . . . . . . 11
2.3.4 Quasi-Newton methods: the BFGS method . . . . . . . . 12
2.4 Efficiency and comparative study . . . . . . . . . . . . . . . . . 16
2.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.2 Efficiency evaluation strategy . . . . . . . . . . . . . . . 16
2.4.3 Efficiency evaluation using different n-dimensional Rosen-
brock test functions . . . . . . . . . . . . . . . . . . . . 18
2.5 Gradient based optimization: global optimization ? . . . . . . . . 22

3 Global optimization: genetic algorithms 25


3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Genetic algorithms vs Gradient based algorithms . . . . . . . . . 26
3.3 Efficient global optimization: combined genetic- and gradient
based algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 29

Appendices 35

A Python code: The steepest descent method a

B Convergence curves and cummulative cost per iteration higher


dimensional Rosenbrock functions d

i
C Reached minima, costs and number of generations to conver-
gence combined algorithms j

ii
Chapter 1

General introduction

In many engineering fields, optimization techniques are crucial to achieve design


goals. For example, in the development of commercial aircraft, the aerodynamic
design in general plays a leading role during the preliminary design stage in which
the external aerodynamic shape is typically finalized [7]. To be more specific, the
shape design of a wing is optimized as to minimize the acting drag force[8]. In
practice, this optimization is performed by reducing the acting force as function
of the design parameters to a mathematical model. This mathematical model
represents an analytical function which in most cases is non-convex. This cat-
egory of functions require global optimization techniques, as many algorithms
tend to be trapped in local optima(see 2.5). However, a problem encountered
in the field of global optimization is the number of function evaluations. Indeed,
this number is severely limited by time or cost[17] and thus global optimization
in general is made difficult by the long-running computer codes involved.

In this paper, the goal is actually to address this problem by finding an


efficient optimization strategy to locate accurately the global optimum of a given
analytical function. To this aim, the idea is to combine the properties of both
genetic and gradient based algorithms. The motivation behind this idea is as
follows. Genetic algorithms belong to the class of global optimization methods
and have thus the ability to avoid local optima and to detect where the global
optimum is located. As for the computational cost required to find accurately the
optimum of a convex function, gradient based algorithms are preferred. Indeed,
as previously mentioned, global optimizers require a high number of calls for
capturing the (global) optimum of (non-)convex functions accurately. Gradient
based algorithms however are not capable of finding the global optimum of non-
convex functions, since they can remain stuck in one of the local optima. Hence,
the relatively low cost of those algorithms to find accurately the minimum of
convex functions on one hand and the ability of genetic algorithms to find global

1
optima of non-convex functions on the other hand can be combined to perform
an efficient optimization. More specifically, the genetic algorithm will be used
to locate and reach the proximity of the global optimum of a very challenging
non-convex function, making it behave in that constrained space like a convex
function, and then a switch will be made to a gradient based algorithm. It is clear
that, to make this switch, a suitable criteria has to be found. The programming
language used is Python.
In the remainder of this paper, the structure is as follows. In the first part we
will treat the gradient based algorithms. This part will firstly consist in giving
a theoretical background of some gradient based algorithms. Next, the issue
of computational cost of those algorithms will be treated. This cost will be
measured by the total calls required to the objective function to obtain a certain
accuracy. This brings us to the comparison of the efficiency of different gradient
based algorithms along with the behaviour of the algorithms applied on convex
functions of higher dimensions. Next, the inability of gradient based algorithms
to find global optima of non-convex functions will be shown. In the following part,
the genetic algorithms will be introduced after a brief introduction about global
optimization algorithms. Then the computational cost to find accurately the
global optimum of convex test functions with genetic algorithms will be evaluated
and the variation of this cost with respect to the number of input variables will
be discussed. This cost will be compared to the one when using gradient based
algorithms, which will show their superiority in terms of computational cost over
genetic algorithms. In the last part, the combination of both algorithms will be
used to accurately capture the global minimum of non convex test functions. The
cost of this combined algorithm will be determined and we will try to establish a
criterion to switch from genetic- to gradient based algorithms which applies for
different problems with various dimensions.

2
Chapter 2

Gradient based methods

2.1 Convex functions


In optimization, a distinction has to be made between different categories of
objective functions. Indeed, this distinction gave rise to a wide range of op-
timization algorithms, each of which is better suited for a specific category of
objective functions. One of the important factors that characterize those cate-
gories is the convexity of the to be optimized functions(see [5] for definition).
Let's now draw our attention to an important feature of convex problems
related to their minima. For this, we’ll define local- and global minima A point
~xmin,local ∈ domf is locally minimal if there exists a neighbourhood δ of ~xmin,local
such that the convex problem f : <n 7−→ < is minimal in this neighbourhood.
A point ~xmin,global is globally minimal if no other function value defined over the
whole domf is lower than f (~xmin,global ). Next, it can be shown that, for convex
problems, any locally minimal point is globally minimal[5]. This feature makes
gradient based algorithms suited for the optimization of convex problems. In
the remainder of this chapter, those algorithms are discussed and accordingly,
the optimizing test functions will be convex functions. Afterwards, as already
specified in the introduction, it will be shown that those algorithms are not suited
for non-convex functions.

2.2 Gradient based algorithms: generalities


Consider a twice continuously differentiable scalar function f . In the calculus
course, we saw that the following equation applies in a minimum:

∇f (~x) = 0 (2.1)

3
Since in most cases this equation can be very difficult or even impossible to
solve analytically[15], we have to resort to algorithms that compute a sequence
of points ~x0 , ~x1 , ~x2 , ... so that:

f (~xn+1 ) < f (~xn ) (2.2)

Gradient based methods are methods that principally approach the minimum of
f (~x) by a sequence of successive steps in a given direction, which is defined by
the following equation:
~xn+1 = ~xn + αn · p~n (2.3)
The idea is to choose an initial position ~x0 , and for each step walk along a
direction so that[9]:
f (~xn+1 ) < f (~xn )
The different methods have various strategies for how they choose the search
direction p~n and the step length αn [9]. The success of these methods depend
on the effective choices of both of these parameters. A safe choice for the
search direction is for example a descent direction[15], i.e. one along which the
directional derivative is negative:

p~n · ∇fn < 0 (2.4)

Indeed, it can be proved[3] easily that this condition guarantees that the function
f can be reduced along this direction. When we write the Taylor expansion in
~x(first order approximation):

f (~x + α · p~) − f (~x) ≈ α · ∇f (~x) · p~ (2.5)

we notice that for a small α > 0, the negativity of the right hand side guarantees
that a lower function value can be found along p~.
The stop criterion of the algorithms makes use of the optimality condition
and is defined as follows[15]:

∇f (~x) <  (2.6)

Where  is a pre-defined error tolerance. It can be noticed that the stop criterion
doesn't guarantee a global minimum, since the above property applies also at
local minima[15].

4
2.3 Description of some gradient based algorithms
2.3.1 The Steepest Descent method
2.3.1.1 Search direction
Since we need to choose a descent search direction, the most natural choice
would be the negative gradient [15]:
p~ = −∇f (~x) (2.7)
Indeed, we remember that the gradient of a function is a vector giving the
direction along which the function increases the most and thus in the opposite
direction the function decreases the most(hence the name Steepest Descent)[9].
We also call this direction the residual ~r of the system. The appellation residual
is due to the fact that for a quadratic problem, it's somehow an indication of
how far we are from from the correct value of the minimum[11].

2.3.1.2 Step size


Now that we have our direction, we still need to know how far to walk along it. A
natural choice is obviously to walk until we no longer descend[9]. To accomplish
this, we can perform a line search. A line search is a procedure that chooses α
to minimize f along a direction[11] or in mathematical terms: at step n, choose
αn such that
f (~xn + αn · p~n ) = min α≥0 f (~xn + α · p~n ) (2.8)
Again we know from basic calculus that α minimizes f when the directional
derivative dfdαxn )
(~
is equal to zero. It can easily be proven that this implies that
the search direction d~ (or in this case the residual ~r) in ~xn−1 and the gradient
∇f in ~xn have to be orthogonal. Suppose we start at a given point ~x0 [11]. The
first step then writes:
~x1 = ~x0 + α · ~r0 (2.9)
Applying the chainrule on the directional derivative in point 1 gives thus:
df (~x1 ) d~x1
= ∇f (~x1 ) · = ∇f (~x1 ) · ~r0 = 0 (2.10)
dα dα
It is obvious that this leads to search directions that are orthogonal to each
other[9], which implies that the method zigzags towards our solution(see fig-
ure C.2). To be noted is that this behaviour would not be obtained for quadratic
forms with circular contour plots, in which case the steepest descent method
converges in one step and thus doesn't zigzag towards our solution.

5
It is clear that this behaviour would be obtained if we found a α that minimizes
the objective function in the search direction. This corresponds to a so called
exact line search. This kind of line search is in terms of accuracy the best to be
used but not in terms of cost effectiveness. More than that, this line search is
useful only when the cost of the minimization to find the step size is low compared
to the cost of computing the search direction[15]. This is because it would be
better to use the computing time to find a better search direction instead of using
it to find a precise minimum along a particular direction that may not be precise
since it was found in a cheap way compared to the one that founds a minimizing
α. For example, in the case of a nonlinear function it is much more difficult to
perform an exact line search than in the case of a quadratic function(see [11]).
Indeed, in this case a general exact line search needs to be defined that consists
in finding the zeros of ∇f (~xn ) · ~rn−1 or in general of ∇f (~xn ) · d~n−1 , where d~ is a
search direction in general. One of the fast algorithms that can be used for this
purpose is the Newton-Raphson algorithm from which a detailed description can
be found in [11]. This algorithm requires the function f to be twice continuously
differentiable and that it be possible to compute the second derivative of the
f (~xn + α · p~n ) with respect to α since the hessian of f needs to be computed[11].
If the full hessian needs to be computed at each iteration, the cost to calculate
the step size is too high compared to the cost of computing the search direction.
An alternative method to perform a line search is the Secant method(see [11]),
which is less costly but also less accurate since it involves some approximations
compared to the Newton-Raphson method[11]. To illustrate the zigzagging of the
Steepest Descent method towards the solution, figure C.2 shows the path towards
the minimum of the 2D-Rosenbrock function, which has a minimum located in
~x = (1, 1)(see 2.4 for definition). Like mentioned above, the full Hessian and the
gradient need to be computed even more than once per iteration. Hence, since
calculating the search direction only requires the evaluation of the gradient, it is
clear that using the Newton-Raphson algorithm is not justified at all to calculate
the step size[A].

6
Figure 2.1: Zigzagging path towards minimum of Rosenbrock function

In the example corresponding to figure C.2, 654 steps are required in order
to obtain an accuracy in the order of 10−8 . Not to mention when the line search
is inexact, which would involve a less accurate estimation of the step size which
will automatically imply the need of even more iterations in order to find an accu-
rate estimation of the minimum. This corresponds to a slow convergence of the
method in comparison with the other gradient based optimization algorithms(see
paragraph 2.4.3). The convergence of this algorithm may be slow, but this is due
to the inaccurate estimation of the search direction. Although this would sound
like a contradiction, the steepest descent is actually not the best direction along
which to walk to find the minimum. Indeed, as previously explained, this choice
for the search direction results in a zigzagging behaviour towards the solution
which in general causes the method to go back and forth across the valley where
the minimum is located. It would be preferable for the method to travel in a
more straight manner towards the minimum. This is more or less achieved in
the method of Conjugate Gradients which is an enhancement of the Steepest
Descent method. By contrast, the method always converges. The rate at which
it converges depends on the starting point and for a quadratic form for example
on how ill its matrix is conditioned[11]. The inaccurate estimation of the search
direction, though has its advantages: it relatively doesn't ask much computing
time. Aside from the line search for the finding of the step size, in each iteration
the only expensive operation is that a gradient needs to be evaluated.

Furthermore, the Steepest Descent method doesn't require the step size to be

7
found with exact line search, whereas this is a necessary task to be fulfilled in other
algorithms that have more accurate search directions. That's why alternative
methods of finding the step size can even consist in choosing a constant step size,
so that the step size doesn't contribute to the expensiveness of the method in any
way. This alternative method belongs to the category of the inexact line search
methods. In this method, an appropriate constant step size has to be chosen in
order for the algorithm to converge. A too high step size results in the divergence
of the algorithm, whereas a too low step size results in too much iterations to
achieve a certain accuracy. Here the step size was found through trial and error
and depends on the objective function. In the case of finding the minimum of
the Rosenbrock function, a quite good step size to achieve an accuracy in the
order of 10−8 is 0, 00211. The convergence of the Steepest Descent method can
be visualized in figure 2.2 in case of both line search methods. Note that in order
to make a meaningful comparison, the same starting point has been chosen.

(a) Exact step (b) Constant step

Figure 2.2: Exact line search vs constant step size

From this figure, it is quite clear that the algorithm converges faster when
using an exact line search. When an appropriate constant step size is chosen,
the algorithm requires 3522 iterations in order to achieve the same accuracy.

2.3.2 The Conjugate Gradients method


We recall that the search directions were perpendicular to each other in the
Steepest Descent method, but this causes the method to zigzag towards our
solution which implies a high number of steps. It would be better if we could
find a α that sees to limit the number of steps to the dimension of the search
space, so as to attain the minimum in a more straightforward manner. In the case

8
of the minimization of a scalar function of 2 variables(e.g. a quadratic form), the
choice of a α would then guarantee that 2 steps are needed for the method to
arrive at the minimum. To this aim, α should be chosen such that the error ~en+1
is orthogonal to the previous search direction d~n . Like shown in figure 2.3 [11].

Figure 2.3: Method of orthogonal directions: error is perpendicular to previous


search direction

Actually, since we do not know ~en+1 a priori, this method is useless here. But
there is a solution to that. Assume an arbitrary search direction d~n . One can
prove that searching for the best α (that minimizes f (~x + α · d)) ~ is equivalent
with the demand that ~en+1 is A-orthogonal or Conjugate to d~n . 2 vectors d~i and
d~j are conjugate to each other when:

d~Ti · A · d~j = 0 (2.11)


Where A is the matrix of a quadratic form. Thus when we find a minimizing α,
the direction conjugate to d~n will be ~en+1 [11]. This and the fact that k mutually
conjugate vectors are linearly independent lead to the fact that the algorithm
~xn+1 = ~xn + αn · d~n
converges in k steps steps to the minimum of a quadratic form, where k is
the dimension of the search space, considering d~n to be k linearly independent

9
mutually conjugate vectors and αn the minimizing step size at each step [11] [1].
Note that this algorithm won't be converging in k steps if the function from which
the minimum has to be found is not linear. We shall come back to this point later.
We know until now that there exists an algorithm that computes the minimum
of a quadratic form in k steps, but we actually didn't mention how to build a
set of k mutually conjugate vectors. The eigenvectors of A form a conjugate set
but this would involve solving an eigenvalue problem which can be costly. An
alternative is a variant of the Gramm-Schmidt orthogonalization process which
is called the Conjugate Gram-Schmidt process[11][9]. We will not go into detail
on this(see [11]). The production of a set of k conjugate directions will bring us
further, but it will also require to store all those directions. The solution to this
is to actually conjugate the residuals[11][9]. Recall that we are still developing a
method for the minimization of a quadratic form so that the residual in this case
corresponds with the gradient of the quadratic form. Conjugating the residuals
will imply that we can calculate the conjugate directions by using the residual in
the arrived point at step n and the previous search direction d~n−1 :

d~n = ~rn + βn · d~n−1 (2.12)

where the
~rnT · ~rn
βn = T (2.13)
~rn−1 · ~rn−1
are called the Gram-Schmidt constants. These are defined in the Gram-Schmidt
process mentioned above[11]. The reasoning behind this result can be found
in[11] but here we will not elaborate on it. The use of conjugate residuals imply
also a less costly procedure to create a set of conjugate directions than what
would be expected when looking at the Gram-Schmidt process[11]. As men-
tioned earlier, we have in this case that the residuals are actually the gradients of
the to be minimizing function, hence the name Method of Conjugate Gradients.
Note that in order for the method of the Conjugate Gradients to work, the step
size needs to be found with an exact line search. Until now we have assumed
the function to be minimized to be a quadratic form. For nonlinear functions,
the algorithm is quite the same but there are different choices for calculating βn.
Here we will be using βn to be like in the case of a quadratic form, which is
called the Fletcher-Reeves formula. Let's summarize the algorithm:

Initialize d~0 = ~r0 = −∇f (~x0 )

Conjugate Gradients algorithm:

1. Find α that minimizes f (~xn + α · d~n )

10
2. ~xn+1 = ~xn + αn · d~n

3. ~rn+1 = −∇f (~xn+1 )


~ T
rn+1 ·~
rn+1
4. βn = T
rn−1 ·~
~ rn−1

5. d~n+1 = ~rn+1 + βn+1 · d~n


Note that for nonlinear functions the algorithm won't be converging in k steps,
where k is the dimension of the search space, since the less similar the function
to be minimized is to a quadratic form, the more quickly the search directions
lose conjugacy. We will not elaborate on the proof of this[11]. The solution
to this is to restart the algorithm after k iterations for better convergence and
to restart it also whenever a search direction is computed that is not a descent
direction. The algorithm stops then when with the stop criterion like mentioned
in paragraph 2.2.

2.3.3 The Newton method


Consider twice continuous differentiable one dimensional function f and following
sequence:
~xn+1 = ~xn + ∆x (2.14)
Another descent direction can be obtained by considering the second order Taylor
expansion of f :
1 00
f (x) ≈ f (xn ) + f 0 (xn ) · ∆x + · f (xn ) · ∆x2 (2.15)
2
As already mentioned, we would like to find ∆x so that

f (~xn+1 ) < f (~xn )

To this aim, the idea is to find a ∆x so that f (xn + ∆x) maximal. Analytically,
it is obvious that this can be achieved by deriving equation 2.15 and solving it
for ∆x. This yields for ∆x:
f 0 (xn )
∆x = − (2.16)
f 00 (xn )
This result can be generalized by replacing the derivative with the gradient
∇f (~xn ) and the reciprocal of the second derivative with the inverse of the Hessian
matrix Hf (~xn ) [20], so that one obtains following sequence:

~xn+1 = ~xn − [Hf (~xn )]−1 ∇f (~xn ) (2.17)

11
The term −[Hf (~xn )]−1 ∇f (~xn ) is called the Newton direction. Hence, the step
size here is chosen to be equal to 1. The Newton algorithm is then the following:

Newton algorithm:

~ = −[Hf (~xn )]−1 ∇f (~xn )


1. ∆x
~
2. ~xn+1 = ~xn + ∆x
Note that the objective function is approximated by a quadratic function
around ~xn .
It can be proven that, under a certain set of conditions, this algorithm con-
verges quadratically[14], which can be mathematically written as:

||~xk+1 ||
lim e
k+1
=C (2.18)
k→∞ ||~
xe ||2
Where ~xe = ~xn+1 −~xn and C is a constant which is called the rate of convergence.
It can be proven that the Steepest Descent[12] and Conjugate Gradients [21]
converge linearly, which in general implies that the Newton method converges
faster than the previous methods[14].
Note that another difference with the last 2 methods is that this method
requires to evaluate second order derivatives, and to inverse the Hessian matrix,
which comes down to solve a system of linear equations, which is, after some
calculations:
[Hf (~xn )] · ~xe = −∇f (~xn ) (2.19)
The Newton algorithm provided by the Python libraries makes use of the so called
Conjugate Gradients iterative method to solve this system of linear equations.
Note that the algorithm will get to convergence only if the Hessian matrix is
positive definite[20]. The Newton algorithm may be fast converging, but the
involvement of second order derivative calculations and the inversion of the Hes-
sian matrix can be really expensive, especially if the objective function has a
large number of variables[20]. That’s why there also exist various quasi-Newton
methods, where an approximation of the Hessian matrix is built up from changes
in the gradient. The covering of one of the most popular method among them,
is the subject of next paragraph.

2.3.4 Quasi-Newton methods: the BFGS method


Like mentioned above, the BFGS method is a quasi-Newton method, which
require only the gradient (like the Steepest Descent method) of the objective

12
function to be computed at each iteration. Besides, a high convergence speed
can be achieved like the Newton method, so that we actually combine the ad-
vantages of both previous methods. Here, the Hessian matrix is approximated
and built up from changes in the gradient at each iteration, which is a huge
advantage compared to the Newton method, especially if the objective function
has a large number of variables.

Hence, as for the algorithm, an analog reasoning as for the Newton method
is used so that we can write:

~xn+1 = ~xn − αn · Bn−1 ∇f (~xn ) (2.20)

Where
Bn ≈ Hf (~xn ) (2.21)
Note that here the step size is not chosen to be = 1, like in the Newton algo-
rithm. Hence, a line search is performed.

The construction of the approximation for the Hessian matrix can be found
in [2]. For the sake of understanding the underlying mathematics behind the
BFGS algorithm, a brief summary of this construction will be given after which
the algorithm will be described. Instead of computing a completely new Bn−1 in
each iteration(instead of Bn , like in the DFP method), information about the
gradient in the previous step will be used to update it. In order to construct the
Bn−1 -matrix, some conditions on the approximated function in step n related to
the use of the gradient from the previous step n − 1 have to be imposed. By
imposing those conditions, we can derive the Secant equation, from which Bn−1
can be computed. The form of the Secant equation in case of the BFGS method
is the following:

αn−1 · Bn−1 · d~n−1 = ∇f (~xn ) − ∇f (~xn−1 ) (2.22)

Where αn−1 is the step size at iteration n − 1 and d~n−1 = −Bn−1−1


∇f (~xn−1 ) the
search direction at iteration n − 1. In order for the Secant method to have at
least one solution Bn−1 , following additional condition must be satisfied, which
is known as the curvature condition and which is obtained by multiplying the
Secant equation by [αn−1 · d~n−1 ]T and by imposing this result to be greater
than 0. This condition is always hold if the Wolfe, or strong Wolfe condition
on the line search procedure is satisfied. Since the Secant equation is satisfied
for an infinitely number of symmetric positive definite matrices Bn−1 , again an
additional condition will be imposed to ensure the uniqueness of the latter, which
states that Bn−1 has to be the closest possible to Bn−1
−1
in some sense. Choosing

13
this closeness to satisfy the so called weighted Frobenius norm, we get following
unique update of Bn−1 :

Bn−1 = (I − ρn−1 ·~sn−1 · ~yn−1


T
) · (I − ρn−1 · ~yn−1 ·~sTn−1 ) + ρn−1 ·~sn−1 ·~sTn−1 (2.23)

Where
1
ρn−1 = T
(2.24)
~yn−1 · ~sn−1

~sn−1 = ~xn − ~xn−1 = αn−1 · d~n−1 (2.25)

~yn−1 = ∇f (~xn ) − ∇f (~xn−1 ) (2.26)


After some linear algebra and by use of the so called Sherman-Morrison-Woodbury
formula, we can also write the approximation of the Hessian matrix Bn at itera-
tion n
Bn−1 · ~sn−1 · ~sTn−1 · Bn−1 ~yn−1 · ~yn−1
T
Bn = Bn − 1 − + T (2.27)
~sTn−1 · Bn−1 · ~sn−1 ~yn−1 · ~sn−1

Here we actually see the fundamental idea of quasi-Newton methods, where


the approximated Hessian (and its inverse) are recomputed at each iteration by
adding two matrices to the approximated Hessian of the previous step. These
two matrices depend on changes in the gradient like mentioned in the beginning
of the section and as we can see in the expression of these matrices in the two
previous equations.

The initial value for the matrix Bn−1


−1
can be selected in different ways, one
of which is to choose the value to be the identity matrix I. The BFGS algorithm
provided by the Python libraries use that latter initial value for the inverted ap-
proximated Hessian. Finally, the algorithm may be summarized as follows:

Initialize B0−1 = I
BFGS algorithm:

1. d~n = −Bn−1 ∇f (~xn )

2. Line search for αn

3. Update ~xn+1 : ~xn+1 = ~xn − αn · Bn−1 ∇f (~xn )

4. ~sn = ~xn+1 − ~xn = αn · d~n

14
5. ~yn−1 = ∇f (~xn ) − ∇f (~xn−1 )

6. ρn = ~
yn
1
T ·~
sn

7. Update Bn−1 = (I−ρn−1 ·~sn−1 ·~yn−1


T
)·(I−ρn−1 ·~yn−1 ·~sTn−1 )+ρn−1 ·~sn−1 ·~sTn−1

As for the convergence of the BFGS method, it can be said that it con-
verges superlinearly, which is much faster than linear convergence, but much
much slower than quadratic convergence. Hence, despite the BFGS algorithm
being better in terms of computational time than the Newton algorithm, BFGS
converges much much slower towards a minimum [16].

15
2.4 Efficiency and comparative study
2.4.1 Introduction
In this section, a comparison will be made between gradient based algorithms
on the basis of their efficiencies. Those efficiencies are evaluated by the number
of function- and gradient calls the algorithm performs, which determines the
computational cost. The algorithms that will be compared with each other are
the Steepest Descent-, the Conjugate Gradient-, the BFGS- and the Newton
algorithm. Furthermore the test functions on which the algorithms will be tested
are the 2nd -, 4th -, 6th -, 8th -, 10th - and 12th - dimensional Rosenbrock functions.

2.4.2 Efficiency evaluation strategy


As we will see later, an important tool to evaluate the algorithm’s efficiency
in practice is its convergence curve. The convergence curve is a curve that
represents the variation of an error as function of the iteration. Here the error
will be defined as the logarithm of the absolute value of the difference between
the function value at the nth iteration and the exact minimum of the function to
be minimized:
Error = log10 |fmin − fn | (2.28)
The logarithmic scale allows to see variations of the error at very low values.

To be noted is that a numerical algorithm cannot find the exact minimum of


a function, but only an accurate estimation of it. This accuracy is limited by the
context in which the algorithm is useful to find the minimum of the concerned
function and is relevant for the determination of the computational cost. Indeed,
the accuracy is used as a criterion to stop searching the minimum and thus
influences the number of iterations that need to be performed, which in turn
leads to the total computational cost for finding the (estimated) minimum of the
concerned function. This numer of iterations as function of the accuracy can be
deduced from the convergence curves. The following example will illustrate this.
Let’s look at the convergence curves generated when applying the Newton- and
BFGS algorithm in order to find the minimum of the 4D Rosenbrock function.
The n-dimensional Rosenbrock function is defined as follows:
d−1
Xh i
f (~x) = 100 · (xi+1 − x2i )2 + (xi − 1)2 (2.29)
i=1

Where d stands for the number of variables. Moreover, the function has a global
minimum fmin,global = 0 located in ~xmin,global = (1, 1, ..., 1).

16
Please note that in order for he comparison to make sense, a same starting
point has to be chosen for both algorithms. Furthermore, a ’good’ choice of the
starting point is actually important since the algorithm could become unstable
for some starting points. To make those choices, we rely on a trial and error
approach or another way is to find recommended starting points based on expe-
rience. Here the starting point will be chosen to be ~x0 = (1, 2, 1, 2)

(a) BFGS: Convergence curve (b) Newton: Convergence curve

(c) BFGS: Function- and gradient calls (d) Newton: Function- and gradient calls

Figure 2.4: Convergence curves and cummulative function- and gradient calls
per iteration for BFGS- and Newton method

As can be seen in figures B.1a and B.1b, BFGS converges faster than the New-
ton algorithm: BFGS requires 21 iterations to reach an accuracy in the order of
approximately 10−8 whereas Newton requires 27 iterations. Furthermore, BFGS
requires 33 function evaluations and 33 gradient evaluations per 21 iterations
whereas Newton requires 41 function evaluations and 199 gradient evaluations
per 27 iterations. Since gradient evaluations can be done in an efficient way using

17
the so called ’Adjoint Method’[4], the cost of function and gradient evaluations
are considered to be similar. Hence, with a total of 70 evaluations for the BFGS
method and 240 evaluations for the Newton method, it can be concluded that, to
achieve an accuracy in the order of 10−8 , the Newton method is more expensive.

2.4.3 Efficiency evaluation using different n-dimensional


Rosenbrock test functions
In what follows, the same approach as in the previous paragraph will be used
to compare the efficiencies of the 4 above described algorithms. The objective
functions will have different numbers of variables so as to check if some algorithms
may be better suited than others if the number of variables of functions is higher.
To this aim, we will use the higher dimensional Rosenbrock functions and the
desired accuracies will be in the order of 10−2 , 10−4 and 10−8 .
For the sake of illustration, a step by step description of the effiency evalua-
tion of the gradient based algorithms will be given by using the 2D Rosenbrock
function for different accuracies. As already mentioned in previous paragraph,
the starting point for the different algorithms has to be the same and is here
chosen to be ~x0 = (−0.5, 0.5).
Based on the convergence curves of BFGS, Newton, Conjugate Gradient
and Steepest Descent generated when applying those algorithms to the 2D-
Rosenbrock function shown in figure 2.5 ,table 2.1 shows the number of itera-
tions performed in this case to achieve an accuracy of 10−2 , 10−4 and 10−8 . The
number of iterations for the Steepest Descent method is approximate but can be
determined precisely by passing the order of accuracy in the algorithm[A].

10−2 10−4 10−8


Newton 31 39 42
BFGS 22 26 28
Conjugate Gradient 23 26 29
Steepest Descent 600 1400 3500

Table 2.1: Required iterations for different accuracies

Taking into account table 2.1, the computational costs can be deduced form
figure 2.6.

18
(a) BFGS: Convergence curve (b) Newton: Convergence curve

(c) Conjugate Gradients: Convergence curve (d) Steepest Descent: Convergence curve

Figure 2.5: Convergence to minimum of 2D-Rosenbrock function starting at


point ~x0 = (−0.5, 0.5)

10−2 10−4 10−8


Newton 200 264 284
BFGS 74 84 88
Conjugate Gradient 122 130 144
Steepest Descent 1200 2800 7000

Table 2.2: Total computational cost per algorithm for different accuracies

Note that the total cost for the Steepest Descent is like the iterations approx-
imate but can also be determined exactly by passing the order of accuracy in the
algorithm[A]. Besides by taking a closer look at the algorithm we can conclude
that the number of calls is perfectly proportional to the number of iterations.
Based on those latter values, a ranking of the different algorithms is repre-

19
(a) BFGS (b) Newton

(c) Conjugate Gradients (d) Steepest Descent

Figure 2.6: Cummulative function- and gradient calls per iteration

sented in table 2.4, as function of the desired accuracies. The ranking is made
with the help of the numbers 1, .., 4, where 1 represents the best and 4 the worst.

10−2 10−4 10−8


Newton 3 3 3
BFGS 1 1 1
Conjugate Gradient 2 2 2
Steepest Descent 4 4 4

Table 2.3: Ranking as function of desired accuracy

As can be concluded, BFGS is the most suited in terms of computational


costs, followed by Conjugate Gradients, Newton and Steepest Descent. The

20
Steepest Descent method converges compared to the other algorithms very
slowly, which is the cause of its low performance.
Analogously this Ranking can be made for the higher dimensional Rosenbrock
test functions. We concluded that, unlike in the 2D-case, CG becomes less in-
teresting than Newton whereas BFGS is in all cases the most efficient algorithm.
This is because for higher dimensional test functions, Newton converges at a
rate similar to BFGS, which is faster than Conjugate Gradients, but is on the
other hand more expensive per iteration than BFGS and Conjugate Gradients.
Hence, in some cases, the difference in required iterations between Newton and
Conjugate Gradients outweighs the difference in computational cost per iteration.
The Steepest Descent method is not used in higher dimensions because of the
time consuming determination of an appropriate step size for each variant of the
Rosenbrock function. The corresponding convergence curves and bar plots for
the cummulative function- and gradient calls can be found in appendix[B].

Analogously this Ranking can be made for the higher dimensional Rosenbrock
test functions. and is illustrated in table 2.4. From this table it can be concluded
that, unlike in the 2D-case, CG becomes less interesting than Newton whereas
BFGS is in all cases the most efficient algorithm. This is because for higher
dimensional test functions, Newton converges at a rate similar to BFGS, which
is faster than Conjugate Gradients, but is on the other hand more expensive per
iteration than BFGS and Conjugate Gradients. Hence, in some cases, the differ-
ence in required iterations between Newton and Conjugate Gradients outweighs
the difference in computational cost per iteration. The Steepest Descent method
is not used in higher dimensions because of the time consuming determination
of an appropriate step size for each variant of the Rosenbrock function. The
corresponding convergence curves and bar plots for the cummulative function-
and gradient calls can be found in appendix(barplots and conv curves).

21
4D 10−2 10−4 10−8
Newton 2 2 2
BFGS 1 1 1
Conjugate Gradient 3 3 3
6D 10−2 10−4 10−8
Newton 2 2 2
BFGS 1 1 1
Conjugate Gradient 3 3 3
8D 10−2 10−4 10−8
Newton 3 3 2
BFGS 1 1 1
Conjugate Gradient 2 2 3
10D 10−2 10−4 10−8
Newton 3 3 2
BFGS 1 1 1
Conjugate Gradient 2 2 3
12D 10−2 10−4 10−8
Newton 2 2 2
BFGS 1 1 1
Conjugate Gradient 3 3 3

Table 2.4: Ranking for higher dimensional Rosenbrock test functions

2.5 Gradient based optimization: global optimiza-


tion ?
Until now we considered only convex test functions. In this section, we will try
gradient based algorithms on non convex test functions and show that they are
not suited for this category of functions(see [5] for definition). An example of
such a function is the so called Ackley function which is defined as follows:
 v 
d d
u !
u1 X
(2.30)
X
2
f (~x) = −20·exp −0, 2 · t · x ) −exp
i cos(2πxi ) +20+e
d i=1 i=1

Where d stands for the number of variables. The two-dimensional form can be vi-
sualized in figure 2.7. Moreover, the function has a global minimum fmin,global =
0 located in ~xmin,global = (0, 0, ..., 0).

22
Recall that convex functions are functions whose local minimum coincides
with its global minimum. In case the objective function is non-convex, local
minima exist that do not coincide with its global minimum.

(a) Surface plot Ackley (b) Contour plot Ackley

Figure 2.7: The Ackley function

Since this function is non-convex and because of its complexity, it is a very


challenging function to optimize. Hence, we will use the 2-dimensional variant
of the Ackley function and plot the paths followed by the algorithms in the
search space which start in an initial point(that is the same for each algorithm)
that is not in the attraction region. This attraction region is the set of points
that includes the global minimum and where the function behaves like a convex
function. The paths followed by each algorithm are quite the same and thus
1 path suffices to show that the algorithms aren’t able to capture the global
minimum. The path of the BFGS algorithm, starting in point ~x0 = (2.4, 2.4),
can be visualized in figure 2.8a.
As can be seen, the algorithm stops when it arrives in a local minimum, which
happens to be the point with coordinates (−0.96847759, −0.96847759). This
actually is an expected result. Indeed, as mentioned in section 2.2, the stop
criterion is met when for a specified , the gradient of the function has become
less than , which means that the function there is locally minimal(cfr. definition
in section 2.1). Let's try now for starting points that are in the vicinity of the
global minimum, like for example ~x0 = (0.2, 0.5). Again, the paths followed by
all the algorithms are quite the same so that we will only consider the path of
the BFGS algorithm. Figure 2.8b clearly shows the global minimum has been
attained.

23
(a) BFGS method stuck in local minimum (b) BFGS method attaining global minimum

Figure 2.8: Gradient based algorithms only capturing global minimum when the
starting point is in the attraction region

Based on the above examples it can be concluded that our gradient based
algorithms are not capable of capturing the global minimum of non convex func-
tions. Since they are non convex as well, the same can be concluded for the
higher dimensional Ackley functions. Hence, gradient based algorithms can only
minimize convex functions and are not suited for global optimization. We thus
will have to resort to global optimization techniques. This will be covered in the
next chapter.

24
Chapter 3

Global optimization: genetic


algorithms

3.1 Introduction
Global optimization techniques make sure the global minimum of non convex
functions can be found without being trapped in local minima. This is thanks
to the fact that global optimizers search for the minimum in the whole solution
space. However, as already mentioned in the introduction(see 1), global opti-
mizers require a high number of function evaluations and hence are too costly.
This is why metaheuristic strategies will be used to search in the solution space
in a more or less intelligent way. Metaheuristic algorithms were developed specif-
ically to find a solution that is "good enough" in a computing time that is "small
enough". Specifically, metaheuristics sample a set of solutions which is too large
to be completely sampled [6] [19]. This surely is interesting here since at higher
dimensions, the search space increases strongly. Besides, since the idea is to fuse
global optimization with gradient based optimization, we don’t need the solution
provided by the global optimization to be accurate. An example of such a meta-
heuristic method is the evolutionary algorithm.

An evolutionary algorithm is a generic population-based metaheuristic opti-


mization algorithm which uses mechanisms inspired by biological evolution, such
as reproduction and recombination(crossover ), mutation, and selection. Candi-
date solutions to the optimization problem play the role of individuals in a popu-
lation, and the fitness function, which is the objective function to be minimized,
determines the quality of the solutions(here the quality is its function value low-
ness). Evolution of the population then takes place after the repeated(iterative)
application of the above operators. The most popular type of evolutionary algo-

25
rithms is the genetic algorithm[18]. The preferability of genetic algorithms over
other metaheuristic methods is because genetic algorithms use both crossover
and mutation operators which make its population more diverse and thus more
immune to be trapped in local optima.
The genetic algorithm iterates over the generated individuals in each gen-
eration. Firstly, an initial population of a given number of randomly chosen
individuals is created. Then, in each iteration(or generation), individuals are
selected to breed a new generation, recombined in the crossover process and mu-
tated. The selection process is made by grouping the population in 3 randomly
chosen individuals and selecting for each group 1 among this group on the basis
of its minimal function value, and deleting the other 2. In the crossover process,
the chromosomes(here the coordinates) of a given percentage of the individuals
are recombined in the hope that the offspring will have a better fitness. Mu-
tation changes randomly a given percentage of the new offspring and is made
to prevent falling into local extreme. Indeed, those changes bring forth less fit
solutions, which ensure genetic diversity. The percentages of the population un-
dergoing mutation and crossover will be held constant and equal to respectively
20 and 50 percent. The algorithm terminates when a fixed number of generations
is needed. Furthermore, the search spaces in which the genetic algorithm will
choose its population will be the ones defined by the points (−5, −5), (−5, 5),
(5, −5) and (5, 5) for 2D functions(square around the origin having a distance of
5 with its sides). For functions of higher dimensions a similar domain is chosen:
squares around the origin having a distance of 5 with its sides for 2D, cubes
around the origin having a distance of 5 with its planes for 3D, etc.

3.2 Genetic algorithms vs Gradient based algo-


rithms
In the following, a cost comparison will be made between gradient based algo-
rithms and genetic algorithms for the minimization of a convex function, which
will be the 2D-Rosenbrock function. For this, we will firstly determine the most
efficient way to minimize those functions with genetic algorithms, which strongly
depends on the initial population size. Afterwards, this cost comparison for min-
imizing higher dimensional functions will be discussed.

It is clear that the larger the initial population size, the more information
the algorithm has in order to find the global optimum. Because of that, the
accuracy of the genetic algorithm depends on the initial population size. This is

26
illustrated by applying the genetic algorithm to the 2D-Rosenbrock function(see
figure 3.1a). Note that the error is defined as in paragraph 2.4. Although this
function is highly fluctuating, we see that it generally has a decreasing trend for
an increasing population size.

(a) Accuracy as function of population (b) Cost as function of population

Figure 3.1: Accuracy and cost as function of initial population size. Note: pop-
ulation varies from 50 to 980 individuals

Important to mention is that the initial population size has an influence on the
number of generations to convergence. Indeed, the higher the population size,
the more generations needed to converge because more mutations will occur for
a given mutation probability. As mentioned in the introduction, mutation makes
the population more diverse which slows down the genetic algorithm [10]. This
has thus an increasing effect on the cost of the algorithms. A general(applicable
to all genetic algorithm applicable problems) way to determine an optimal initial
population size(for which the benefits of accuracy and number of generations to
convergence, and thus cost, balans) can be found in [10] but will not be applied
here. What we will do is find a minimal population size to obtain an accuracy
in the order of 10−2 . As can be depicted from figure 3.1a, this accuracy is ob-
tained for an initial population size of at least 690 individuals. In figure 3.1b the
computational cost as function of the initial population size is shown. As for the
accuracy as function of the initial population size, this cost function is highly
fluctuating. However it displays in general an increasing trend for an increasing
population size. This cost was determined by running the algorithm for 100 gen-
erations, after which the number of generations to convergence can be deduced
by plotting the minimum as function of the generation, and finally by running
the algorithm for the deduced number of generations. We see that this initial
population size corresponds to a minimal(since for higher populations, the same

27
accuracy can be obtained) cost of around 4500 function calls. Comparing this
result with the computational cost to obtain the same accuracy with the BFGS
algorithm(see table 2.3, section 2.4), which requires only 74 function evaluations
we can conclude that for minimizing the convex 2D Rosenbrock function, the
BFGS algorithm is far more efficient than the genetic algorithm.

As for the higher dimensional Rosenbrock functions, it is obvious that the


search space extends rapidly as the dimension increases. It is then evident that
the initial size has to be larger if we desire the same accuracy. As already seen for
the 2D-Rosenbrock function and as can be concluded in general, an increasing
intitial population size in turn has an increasing effect on the algorithms cost.
We can thus conclude that, as the dimensions increase, to obtain an accuracy
in the order of 10−2 , the number of function calls cannot but increase. As a
demonstration, consider the 6D-Rosenbrock function. As already mentioned in
paragraph 2.4, the function value of the minimum of this function is equal to
0. In this case, for an initial population size of 1000 individuals, the algorithm
requires 13706 function calls to finally find the point where the function value
equals 9, 38(determined as described above). This is in contrast with the result
for finding 2D Rosenbrock minimum, where for initially 1000 individuals, 5738
function calls are required and the found point corresponds to a function value
of 0, 00471, which is in the order of 10−2 . For the sake of completeness, it is
worth to mention that the BFGS algorithm requires 60 function evaluations to
obtain an accuracy in the order of 10−2 when searching for the minimum of the
6D-Rosenbrock function[B].

We can thus conclude in general that gradient based algorithms are far more
efficient than genetic algorithms for minimizing convex functions. As mentioned
earlier, genetic algorithms however are capable of finding global minima of non
convex functions. But, if high accuracies are required, the population size has to
be raised strongly which is too costly as we saw in the examples above. Besides,
after a certain point, the cost of raising the population further outweighs the
benefit of obtaining a more accurate result [10]. In this case, a switch to gradient
based algorithms can be made to further minimize the function to the desired
accuracy. This will be further investigated in the next section.

28
3.3 Efficient global optimization: combined genetic-
and gradient based algorithms
In the following, we will try to find an efficient strategy to find accurately
the global minimum of non-convex functions. To this aim, the Ackley func-
tion(see 2.5 for definition) will serve as test function. The genetic algorithm will
be used to reach the vicinity of the global minimum, and then the switch will
be made to a gradient based algorithm to capture accurately the minimum. By
’vicinity’ it is meant that the attraction region of the global minimum has to be
reached. Indeed, as already mentioned in section 2.5, the function behaves like
a convex function in this attraction region. The problem lies then in determining
criteria which tells us that the genetic algorithm has reached the attraction region
and that we can switch to a gradient based algorithm. These criteria actually are
conditions on the population size and the number of generations. The number of
generations obviously has to be the one for which the genetic algorithm converges
for a given initial population size. For an optimization problem in general, the
number of generations to convergence is not known. What we do know is that
when the convergence point is reached, the minimum doesn't change anymore.
Hence, the genetic algorithm will be stopped when 2 succesive solutions are the
same.

As for the choice of the population size, a first try can be determining the
optimal population size for which the number of generations to convergence and
the accuracy are balanced(see [10]) . A simpler way to choose an initial popula-
tion size is using the rule of thumb whereby the population size equals 10 times
the number of variable inputs of the objective function [13].

Consider the 2D-Ackley function. In figure 3.2a we see that the genetic
algorithm didn't succeed in reaching the attraction region using an initial pop-
ulation size of 20 (10 times the dimension), since all of the gradient based
algorithms(Newton, Conjugate Gradient and BFGS) remain stuck in a local min-
imum. Like in paragraph 2.5, all the gradient based algorithms have more or less
the same path and thus only the BFGS algorithm is shown.

29
(a) Gradient based algorithms stuck in local(b) Gradient based algorithms finds global min-
minimum imum

Figure 3.2: Global minimum search failed and accomplished for respectively 20-
and 38 individuals

It is clear that for the 2D-Ackley test function, the above rule of thumb can-
not be applied. In the following, we will investigate if the combined algorithm
converges to the global minimum for higher population sizes. This is done by
simply running the combined algorithm for 21-, 22-, ... individuals untill we find
a population size which finds a point in the attraction region. For each run,
the contour plot is drawn, which provides a visualization of whether the genetic
algorithm converges to the attraction region. If a contour plot is not available(as
will be the case for higher dimensional functions), we know that the attraction
has not been reached by looking at the point attained by the gradient based
algorithm. If the function value of this point is not close to the minimum(with
an error in the order of at least 10−2 ), we can also conclude that the attraction
region has not been reached. In figure 3.2b we see that a point in the attraction
region is found for an initial population of 38, and consequently the combined
algorithm converges accurately to the global minimum. Again, we show only the
BFGS algorithm. Note that this is also the case for populations higher than 38.
We can thus conclude that, for the 2D Ackley function, the population size has
to be approximately 20 times the number of input variables. This factor can be
the consequence of the fact that the Ackley function is extremely complex. As
for the cost, to obtain an accuracy in the order of 10−2 , the Conjugate Gradients
algorithm is the best, requiring a computational cost of 14 function/gradient
evaluations to achieve this accuracy. The Newton- and BFGS algorithms require
in contrast respectively 27 and 24 function/gradient evaluations. Of course the
cost of the genetic algorithm should not escape our attention, which requires
89 function evaluations to attain the attraction region. Besides, when plotting

30
the best individual(with the lowest function value) per generation, we see that
the number of generations to convergence is equal to 2(see [C]). This gives us
for the Conjugate Gradient algortihm a total computational cost of 113 function
evaluations[C]. Comparing this cost to the cost of the genetic algorithm to find
the minimum with an order of accuracy of 10−2 of a less complex function which
is the 2D Rosenbrock function, we can conclude that this combined algorithm is
way more performant.

The question still is: can we conclude this in general for other problems ? And
is this rule applicable for functions with more input variables ? Let’s take a look
at the Rastrigin function, which is also a very challenging function to optimize
and is defined as follows(see figure 3.3 for the surface- and contourplots):
d
(x2i − 10 · cos(2πxi )) (3.1)
X
f (~x) = 10 · d +
i=1

Where d stands for the number of input variables. This function has a minimum
of 0 in the point ~xmin = (0, ..., 0). For this test the 2D- variant of this function
will be used.

(a) Surface plot Rastrigin (b) Contour plot Rastrigin

Figure 3.3: The Rastrigin function

Please note that from now on, no desired accuracy will be mentioned as we
are interested in investigating the criteria for which the combined algorithm con-
verges, which is the initial population size. As proceeded previously to find the
minimal population size to attain the attraction region of the Ackley function,
the combined algorithm is ran for increasing population sizes. The combined
algorithm converges in this case from an initial population size of 213 individu-
als(requiring 4 generations to converge, see [C]), which is way more than what

31
the Ackley function requires. Instead of using 10 times the number of input
variables, we thus need to use approximately 100 times the number of input
variables. As for the Ackley function, this could be due to the complexity of the
Rastrigin function.

Let’s now look at the higher dimensional Ackley functions. Like explained in
section 3.2, the search space extends as the dimension increases, which requires a
larger initial population size to obtain a desired accuracy. So, if we want to be in
the attraction region for higher dimensional functions, it is obvious that a larger
population size will be required to attain this attraction region. When checking
the original rule of thumb applied on the 8D Ackley function, we conclude that
an initial population of 80 is not sufficient to attain the attraction region since
the combined algorithm doesn’t converge to the global minimum. It is from an
initial population size of 173 individuals that the combined algorithm converges
to the global minimum after 14 generations. See [C] for the exact coordinates
and attained minimum by the genetic- and combined algorithms and number of
generations to convergence. Note that this number of individuals is very close
to 20 times the number of generations, like for the 2D Ackley function. Hence,
we shall say that for the n-dimensional Ackley problem, the combined algorithm
converges for an initial population size of 20 times the dimension.

It can thus be concluded in general that the rule of thumb which is multiplying
10 with the number of input variables doesn’t apply for all problems. Indeed, for
very complex 2D problems like the Ackley- and Rastrigin functions, the genetic
algorithm requires more than 20 individuals to attain the attraction region. For
higher dimensional problems, this rule of thumb also doesn’t apply. Again, this
could be due to the complexity of the higher dimensional Ackley function. Hence,
further research on a general way(i.e. for problems of random complexity) to
choose the population size to attain the attraction region has to be performed.
Furthermore, since it should apply to all kinds of problems, it is to be suggested
to determine the to be chosen initial population size like described in [10] in
future research.

32
Bibliography

[1] A. Astolfi. OPTIMIZATION An introduction. http://www3.imperial.


ac.uk/pls/portallive/docs/1/7288263.PDF. Sept. 2006.
[2] Peter Blomgren. Numerical Optimization Lecture Notes 18 Quasi-Newton
Methods The BFGS Method. http : / / terminus . sdsu . edu / SDSU /
Math693a_f2013/Lectures/18/lecture.pdf. Fall 2015.
[3] Nikos Drakos. Mathematical optimization. http://www.phy.ornl.gov/
csep/mo/node10.html. Aug. 1994.
[4] Austen C. Duffy. An Introduction to Gradient Computation by the Dis-
crete Adjoint Method. http://www.computationalmathematics.org/
topics/files/adjointtechreport.pdf. Summer 2009.
[5] Laurent El Ghaoui. Optimization models and applications. http://livebooklabs.
com/keeppies/c5a5868ce26b8125. Livebook visited July 2015. 2015.
[6] Fred Glover and Kenneth Sorensen. Metaheuristics. http://www.scholarpedia.
org/article/Metaheuristics. 2015.
[7] Boris Epstein Antony Jameson Sergey Peigin Dino Roman Neal Harrison
and John Vassberg. “Comparative Study of Three-Dimensional Wing Drag
Minimization by Different Optimization Techniques”. In: Journal of Aircraft
46.2 (2009), pp. 526–541.
[8] Zhiming Gao Yichen Ma. “Drag minimization for Stokes flow”. In: Applied
Mathematics Letters 21.11 (2008), pp. 1161–1165.
[9] Runar Heggelien Refsnaes. A brief introduction to the conjugate gradient
method. http://www.idi.ntnu.no/~elster/tdt24/tdt24-f09/cg.
pdf. Fall 2009.
[10] STANLEY GOTSHALL BART RYLANDER. Optimal Population Size and
the Genetic Algorithm. http://citeseerx.ist.psu.edu/viewdoc/
download?doi=10.1.1.105.2431&rep=rep1&type=pdf.

33
[11] Jonathan Richard Shewchuk. An Introduction to the Conjugate Gradient
Method Without the Agonizing Pain. https : / / www . cs . cmu . edu /
~quake-papers/painless-conjugate-gradient.pdf. Aug. 1994.
[12] Penn State. Line search methods. Rate of convergence. http://web.khu.
ac.kr/~tskim/PatternClass%20Lec%20Note%2004-3%20(Newton’s%
20Method%20-%20Optimization).pdf.
[13] R. Storn. “On the usage of differential evolution for function optimization”.
In: Fuzzy Information Processing Society, 1996. NAFIPS., 1996 Biennial
Conference of the North American. IEEE, June 1996, pp. 519–523.
[14] Kyung Hee University. 8 Newtons method for minimization. http : / /
web . khu . ac . kr / ~tskim / PatternClass % 20Lec % 20Note % 2004 - 3 %
20(Newton’s%20Method%20-%20Optimization).pdf. Fall 2007.
[15] Jean-Philippe Vert. Nonlinear Optimization: Algorithms 1: Unconstrained
Optimization. http://cbio.mines-paristech.fr/~jvert/teaching/
2006insead/slides/5_algo1/algo1.pdf. PowerPoint Presentation.
Spring 2006.
[16] University of Washington Department of mathematics. Rates of Cover-
gence and Newtons Method. https://www.math.washington.edu/
~burke/crs/408/lectures/L10-Rates-of-conv-Newton.pdf.
[17] DONALD R. JONES MATTHIAS SCHONLAU WILLIAM J. WELCH. “Ef-
ficient Global Optimization of Expensive Black-Box Functions”. In: Journal
of Global Optimization 13 (1998), pp. 455–492.
[18] Wikipedia. Evolutionary algorithm. https://en.wikipedia.org/wiki/
Evolutionary_algorithm. 2016.
[19] Wikipedia. Metaheuristics. https://en.wikipedia.org/w/index.php?
title=Metaheuristic&action=history. 2016.
[20] Wikipedia. Newton’s method in optimization. https://en.wikipedia.
org/wiki/Newton%27s_method_in_optimization. [Online; accessed
2016]. 2016.
[21] Harlan Crowder Philip Wolfe. Linear convergence of the conjugate gradient
method. http://www.dtic.mil/dtic/tr/fulltext/u2/742126.pdf.
May 1972.

34
Appendices

35
Appendix A

Python code: The steepest


descent method

import s c i p y a s sp
import numpy a s np
import math
import T e s t c a s e s a s Tc
import T e s t c a s e s G r a d i e n t a s TcG
import S c a t t e r p l o t 2 D a s Sp2D
import P a t h C o n t o u r a s PC

# S t e e p e s t D e s c e n t method w i t h c o n s t a n t a l p h a . a l p h a h a s t o be t u n e d
#t o g e t c o n v e r g e n c e .

d e f S t e e p e s t D e s c e n t ( x , f , Ac ) :

FuncEval = 0
GradEval = 0

alpha = 0.00211
x = np . a r r a y ( ( f l o a t ( x [ 0 ] ) , f l o a t ( x [ 1 ] ) ) )
f p r i m e = TcG . T e s t c a s e s G r a d i e n t s l i b

# ∗∗∗ I n i t i a l i z a t i o n s ∗∗∗
#G r a d i e n t
r = −f p r i m e ( x )
G r a d E v a l =+ 1
d e l t a _ n e w = np . d o t ( r , r )

a
d e l t a 0 = delta_new

i = 0

Error = [ ]
Iterations = []
CFuncEval = [ ]
CGradEval = [ ]

fsol = f (x)
F u n c E v a l =+ 1

E r r o r . append ( math . l o g ( a b s (− f s o l ) ) )
I t e r a t i o n s . append ( i )
CFuncEval . append ( F u n c E v a l )
CGradEval . append ( G r a d E v a l )

#E r r o r t o l e r a n c e s and maximal i t e r a t i o n s CG and Newton Raphson


e p s i l o n C G = 1e −10
imax = 1 e6

#D i m e n s i o n s
n = 2

#V e c t o r s o f x and y components f o r d r a w i n g p a t h
Scatterx = [ ]
Scattery = [ ]
S c a t t e r x . append ( x [ 0 ] )
S c a t t e r y . append ( x [ 1 ] )

# ∗∗∗ A l g o r i t h m ∗∗∗
w h i l e ( i < imax ) and ( d e l t a _ n e w > d e l t a 0 ∗ e p s i l o n C G ∗∗ 2) :

oldfsol = fsol

x = x + alpha ∗ r

r = −f p r i m e ( x )
G r a d E v a l += 1

S c a t t e r x . append ( x [ 0 ] )

b
S c a t t e r y . append ( x [ 1 ] )

fsol = f (x)
F u n c E v a l += 1

d e l t a _ n e w = np . d o t ( r , r )

i += 1

#Stop when a c c u r a c y i n o r d e r o f 10^−8 h a s been r e a c h e d o r i f


#t h e a l g o r i t h m b e g i n s t o o s c i l l a t e
i f ( math . l o g 1 0 ( a b s (− f s o l ) ) < Ac ) o r o l d f s o l < f s o l :
break

E r r o r . append ( math . l o g 1 0 ( a b s (− f s o l ) ) )
I t e r a t i o n s . append ( i )
CFuncEval . append ( F u n c E v a l )
CGradEval . append ( G r a d E v a l )

X = sp . a s a r r a y ( S c a t t e r x )
Y = sp . a s a r r a y ( S c a t t e r y )

Result = []
Result . append ( x )
Result . append ( E r r o r )
Result . append ( I t e r a t i o n s )
Result . append ( CFuncEval )
Result . append ( CGradEval )

p r i n t math . l o g 1 0 ( a b s (− f s o l ) )
p r i n t (" S o l u t i o n : %s " % x )
p r i n t (" C u r r e n t f u n c t i o n v a l u e : %f " % f s o l )
p r i n t (" I t e r a t i o n s : %d " % i )
p r i n t (" F u n c t i o n e v a l u a t i o n s : %d " % F u n c E v a l )
p r i n t (" G r a d i e n t e v a l u a t i o n s : %d " % G r a d E v a l )

#Sp2D . S c a t t e r p l o t 2 D ( I , E , ’ E r r o r v s i t e r a t i o n number ’ ,
#’ E r r o r ’ , ’ I t e r a t i o n number ’ )
#PC . S c a t t e r C o n t o u r p l o t 2 D (X , Y , ’ S t e e p e s t D e s c e n t Method ’ , ’ y ’ , ’ x ’ )

return Result

c
Appendix B

Convergence curves and


cummulative cost per iteration
higher dimensional Rosenbrock
functions

The following curves and bar plots are the convergence curves and cummulative
costs per iteration respresenting the gradient based algorithms applied on the
4D-, 6D-, 8D-, 10D- and 12D Rosenbrock functions.The used gradient based
algorithms are: Conjugate gradients, Newton and BFGS. The starting points of
all the algorithms were

1. ~x0 = (1, 2, 1, 2) 4D

2. ~x0 = (1, .5, .5, 1, .5, 1) 6D

3. ~x0 = (2, 1, 2, .5, 2, 1, 2, 2) 8D

4. ~x0 = (2, 1, 2, .5, 2, 1, 2, 2, .5, 2) 10D

5. ~x0 = (1, .5, 1, 2, 1, 2, 1, .5, 1, 2, 1, .5) 12D

Very important to be noted is that for the conjugate gradients and the BFGS
algorithms, the number of gradient calls is always equal to the number of function
calls. The determination of the computational cost for a given accuracy and
function is as follows: first, the number of iterations to obtain a certain accuracy
is determined, then, at this iteration number, the cost can be determined using
the cummulative cost per iteration. Note that, as already mentioned in 2.4, we
consider the costs of function- and gradient calls to be equal[4].

d
(a) Convergence BFGS 4D (b) Convergence Newton 4D

(c) Convergence Conjugate Gradients 4D (d) Cost BFGS 4D

(e) Cost Newton 4D (f) Cost Conjugate gradients 4D

e
(g) Convergence BFGS 6D (h) Convergence Newton 6D

(i) Convergence Conjugate Gradients 6D (j) Cost BFGS 6D

(k) Cost Newton 6D (l) Cost Conjugate gradients 6D

f
(m) Convergence BFGS 8D (n) Convergence Newton 8D

(o) Convergence Conjugate Gradients 8D (p) Cost BFGS 8D

(q) Cost Newton 8D (r) Cost Conjugate gradients 8D

g
BFGS10D.png

(s) Convergence BFGS 10D (t) Convergence Newton 10D

(u) Convergence Conjugate Gradients 10D (v) Cost BFGS 10D

(w) Cost Newton 10D (x) Cost Conjugate gradients 10D

h
(y) Convergence BFGS 12D (z) Convergence Newton 12D

() Convergence Conjugate Gradients 10D () Cost BFGS 12D

() Cost Newton 12D () Cost Conjugate gradients 12D

i
Appendix C

Reached minima, costs and


number of generations to
convergence combined algorithms

In the following, the results of the application of the combined algorithms on the
2D- and 8D Ackley functions are represented as outputs, in which the coordi-
nates as well as the function values in the corresponding points are given. These
coordinates and function values are those reached by the genetic algorithm(used
to reach the attraction region) and by the BFGS-, Newton- and Conjugate gradi-
ents algorithms(which were activated once the attraction region was reached by
the genetic algorithm). Furthermore, the number of function-/gradient calls of
both the genetic- and gradient based algorithms are shown. Finally, the number
of generations to convergence for given population sizes corresponding to the be
minimized functions are deduced from the graphs in figures C.3a to C.3c. These
functions are the 2D- and 8D Ackley functions and the 2D Rastrigin function.
Please note that generation 0 has to be counted as a generation!

j
Figure C.1: Obtained coordinates, corresponding function values and required
costs after minimization with accuracy of in the order of 10−2 of 2D Ackley
function with combined algorithm with initial population size of 38 individuals.

k
Figure C.2: Obtained coordinates, corresponding function values and required
costs after minimization with unlimited accuracy of 8D Ackley function with
combined algorithm with initial population size of 173 individuals.

l
(a) 2D Ackley: Minimum(logarithmic (b) 8D Ackley: Minimum(logarithmic
scale) as function of generation: best scale) as function of generation: best
individual per generation individual per generation

(c) 2D Rastrigin: Minimum(logarithmic scale)


as function of generation: best individual per
generation

You might also like