Unconstrained Optimization PDF

Descent methods
for unconstrained optimization
Gilles Gasso
INSA Rouen - ASI Departement

Laboratory LITIS
September 19, 2019
Gilles Gasso Descent methods 1 / 27

Plan
1 Formulation
2 Optimality conditions
3 Descent algorithms
Main methods of descent
Research of the step
Summary
4 Illustration of descent methods

Formulation
Unconstrained optimization
Elements of the problem

θ ∈ Rd : vector of unknown real parameters
J : Rd → R : the function to be minimized.
Assumption:
J is differentiable
all over its domain
domJ = θ ∈ Rd | J(θ) < ∞
Problem formulation
(P) min J(θ)
θ∈Rd

Formulation
Unconstrained optimization
Examples 6
1 >
θ Pθ + q > θ + r
5.5
J(θ) =
2
J(θ)
5
4.5
with P a positive definite matrix 4
2
2
0 1
0
−1
θ2 −2 −2 θ1
1 θ1
J(θ) = cos(θ1 −θ2 )+sin(θ1 +θ2 )+
0
4
−1
−2
6 6
4 4
2 2
0 0

Optimality conditions
Different solutions
Global solution
θ ∗ is said to be the global minimum solution of the problem if
J(θ ∗ ) ≤ J(θ), ∀θ ∈ domJ
Local solution
θ̂ is a local minimum solution of problem (P) if it holds
J(θ̂) ≤ J(θ), ∀θ ∈ domJ such that kθ̂ − θk ≤ , > 0
2.5
Minimum global
5
Minimum local 2
Illustration 4 θ* 1.5
1
3
θ1
J(θ) = cos(θ1 − θ2 ) + sin(θ1 + θ2 ) + 4 0.5
2
0
1
θ −0.5
0 −1
0 1 2 3 4 5

How do we assess a solution to the problem?

First order necessary condition
Theorem [First order condition]

Let J : Rd → R be a differential function on its domain. A vector θ 0 is a
(local or global) solution of the problem (P), if it necessarily satisfies the
condition ∇J(θ 0 ) = 0.
Vocabulary
Any vector θ 0 that verifies ∇J(θ 0 ) = 0 is called a stationary point or critical
point
∇J(θ) ∈ Rd is the gradient vector of J at θ.
The gradient is the unique vector such that the directional derivative can be
written as:
J(θ + th) − J(θ)
lim = ∇J(θ)> h, h ∈ Rd , t∈R
t→0 t
Example of a first order optimality condition

2 14 9
19 4
J(θ) = θ14 + θ24 − 4θ1 θ2
9
9 0
14 4
10
1
24
3
4θ1 − 4θ2 1
−1
−1
9
Gradient ∇J(θ) =
0
19
−4θ1 + 4θ23
4
9
4
1
14
0 0 0 1
14
Stationary points that verify ∇J(θ) = 0.
−1
19
9
4

0 1
10
4
9
Three solutions θ (1) = , θ (2) = et −1
−1
0 1
0 1
24
14
0
1 4 9
−1
9
4 19
14
θ (3) = −2
−2 −1
9
0 1 2
−1
Remarks
θ (2) and θ (3) are local minimal but not θ (1)
every stationary point can be deemed a local extremum
We need another optimality condition

How to ensure that a stationary point is a minimum?
Hessian matrix
Twice differential function
J : Rd → R is said to be a twice differentiable function on its domain
domJ if, at every point θ ∈, there exists an unique symmetrical matrix
H(θ) ∈ Rd×d called Hessian matrix such that
J(θ + h) = J(θ) + ∇J(θ)> + h> H(θ)h + khk2 ε(h).
ε(h) is a continuous function at 0 with limh→0 ε(h) = 0
H(θ) is the second derivative matrix

 ∂2J ∂2J ∂2J

∂θ1 ∂θ1 ∂θ1 ∂θ2 ··· ∂θ 1 ∂θ d

H(θ) =  .. .. .. 
 . . ··· . 

∂2J ∂2J ∂2J
∂θd ∂θ1 ∂θ d ∂θ 2 ··· ∂θd ∂θd
H(θ) = ∇θ> (∇θ J(θ)) is the Jacobian of the gradient function

Examples
Example 1 Exemple 2
Objective function Quadratic objective function
J(θ) = θ14 + θ24 − 4θ1 θ2 J(θ) = 21 θ > Pθ + q> θ + r
Directional derivative
Gradient
4θ13 − 4θ2 D(h, θ) = limt→0 J(θ+th)−J(θ)

t
∇J(θ) =
−4θ1 + 4θ23 D(h, θ) = (Pθ + q)> h
Hessian matrix
2 Gradient ∇J(θ) = Pθ + q
12θ1 −4
H(θ) =
−4 12θ22
Hessian matrix H(θ) = P

Second order optimality condition
Theorem [Second order optimality condition]

Let J : Rd → R be a twice differentiable function on its domain. If θ 0 is a
minimum of J, then ∇J(θ 0 ) = 0 and H(θ 0 ) is a positive definite matrix.
Remarks
H is positive definite if and only if all its eigenvalues are positive
H is negative definite if and only if all its eigenvalues are negative
For θ ∈ R, this condition means that the gradient of J at the minimum is
null, J 0 (θ) = 0 and its second derivative is positive i.e. J 00 (θ) > 0
If at a stationary point θ 0 H(θ 0 )) is negative definite, θ 0 is a local
maximum of J

Illustration of the second order optimality condition

2 14 9
19 4
J(θ) = θ14 + θ24 − 4θ1 θ2
9
9 0
14 4
10
1
24
3
4θ1 − 4θ2 1
−1
−1
9
Gradient : ∇J(θ) =
0
19
−4θ1 + 4θ23
4
9
4
1
14
0 0 0 1
14

0 1
Stationary points : θ (1) = , θ (2) = and
0 1 −1
19
9
4
10
4
9
−1 0 1
−1 −1
θ (3) =
24
14
0
4
−1 1 9
9
4 19
−2 9 14
2 −2 −1 0 1 2
12θ1 −4
Hessian matrix H(θ) =
−4 12θ22
(1) (2) (3)

θ θ θ
0 −4 12 −4 12 −4
Hessian
−4 12 −4 12 −4 12
Eigenvalues 4, −4 8, 16 8, 16
Type of solution Saddle point Minimum Minimum
Necessary and sufficient optimality condition
Theorem [2nd order sufficient condition ]

Assume the hessian matrix H(θ 0 ) of J(θ) at θ 0 exists and is positive
definite. Assume also the gradient ∇J(θ 0 ) = 0. Then θ 0 is a (local or
global) minimum of problem (P).
Theorem [Sufficient and necessary optimality condition]

Let J be a convex function. Every local solution θ̂ is a global solution θ ∗ .
Recall
A function J : Rd → R is convex if it verifies
J(αθ + (1 − α)z) ≤ αJ(θ) + (1 − α)J(z), ∀θ, z ∈ domJ, 0≤α≤1

How to find the solution(s)?
We have seen how to assess a solution to the problem

A question to be addressed is: how to compute a solution?

Descent algorithms
Principle of descent algorithms
Direction of descent
Let the function J : Rd → R. The vector h ∈ Rd is called a direction of
descent in θ if there exists α > 0 such that J(θ + αh) < J(θ)
Principle of descent methods

Start from an initial point θ 0
Design a sequence of points {θ k } with θ k+1 = θ k + αk hk
Ensure that the sequence {θ k } converges to a stationary point θ̂
hk : direction of descent
αk is the step size

Descent algorithms
General approach
General algorithm
1: Let k = 0, initialize θ k
2: repeat
3: Find a descent direction hk ∈ Rd
4: Line search: find a step size αk > 0 in the direction hk such that
J(θ k + αk hk ) decreases "enough"
5: Update: θ k+1 ← θ k + αk hk and k ← k + 1
6: until k∇J(θ k )k <
The methods of descent differ by the choice of:

h: gradient algorithm, Newton, Quasi-Newton algorithm
α: Line search, backtracking
Descent algorithms Main methods of descent
Gradient Algorithm
Theorem [descent direction and opposite direction of gradient]
Let J(θ) be a differential function. The direction h = −∇J(θ) ∈ Rd is a
descent direction.
Proof.
J being differentiable, for any t > 0 we have
J(θ + th) = J(θ) + t∇J(θ)> h + tkhk(th). Setting h = −∇J(θ), we get
J(θ + th) − J(θ) = −tk∇J(θ)k2 + tkhk(th). For t small enough (th) → 0 and
so J(θ + th) − J(θ) = −tk∇J(θ)k2 < 0. It is then a descent direction.
Characteristics of the gradient algorithm

Choice of the descent direction at θ k : hk = −∇J(θ k )
Complexity of the update: θ k+1 ← θ k − αk ∇J(θ k ) costs O(d)
Newton algorithm
2nd order approximation of the twice diffetentiable J at θ k
1
J(θ + h) ≈ J(θ k ) + ∇J(θ k )> h + h> H(θ k )h
2
with H(θ k ) the positive definite Hessian matrix
The direction hk which minimizes this approximation is obtained by
∇J(θ + hk ) = 0 ⇒ hk = −H(θ k )−1 ∇J(θ k )
Features
Choice of the descent direction at θ k : hk = −H(θ k )−1 ∇J(θ k )
Complexity of the update: θ k+1 ← θ k − αk H(θ k )−1 ∇(θ k ) costs
O(d 3 ) flops
H(θ k ) is not always guaranteed to be positive definite matrix. Hence
we cannot always ensure that hk is a direction of descent
Illustration of gradient and Newton methods

Local approximation of the two methods
in 1D
100
J
Approxim. de J Meth. Gradient
80
Approxim. de J Meth. Newton
60 Directions of descent in 2D
40 2 24 19 14 9 9
J(θ)
9 4
14 1
20
4
0
−1
19
0 1 4
0
θk
9
1
9
1
−1
−20
14
Tangente en θ
k
0 0 0
4
−40
h = − H−1 ∇ J
4
−4 −2 0 2 4
14
1
θ h=−∇J1
−1
9
9
−1
0 4
−1
1
19
4
1
14
9
4
24
−2 9 9 14 19
−2 −1 0 1 2

Quasi-Newton method
Main features
Choice of the descent direction at θ k : hk = −B(θ k )−1 ∇J(θ k )
B(θ k ) is an positive definite approximation of the Hessian matrix
Complexity of the update: most of the times O(d 2 )

Descent algorithms Research of the step
Line search
Assume the direction of descent hk at θ k is fixed. We aim to find the step
size αk > 0 in the direction hk such that the function J(θ k + αk hk )
decreases enough (compared to J(θ k ))
Several options
Fixed step size: use a fixed value α > 0 at each iteration k
θ k+1 ← θ k + αhk
Optimal step size αk∗
θ k+1 ← θ k + αk∗ hk with αk∗ = arg min J(θ k + αhk )

α>0
Variable step size: the choice αk is adapted to the current iteration
θ k+1 ← θ k + αk hk
Line search
Pros and cons

Fixed step size strategy: often not very effective
Optimal step size: can be costly in calculation time
Variable step: most commonly used approach
The step is often imprecise
A trade-off between computation cost and decrease of J

Variable step sizeh

Armijo’s rule
Determine the step size αk in order to have a sufficient decrease of J i.e.
J(θ k + αk h) ≤ J(θ k ) + c αk ∇J(θ k )> hk
Usually c is chosen in the range 10−5 , 10−1

Having hk the direction of descent, we have ∇J(θ k )> hk < 0, which ensures
the decrease of J
Backtracking
1: Fix an initial step ᾱ, choose 0 < ρ < 1, α ← ᾱ Choice of the initial step
2: repeat Newton method:
3: α ← ρα ᾱ = 1
4: until J(θ k + αh) > J(θ k ) + c α∇J(θ k )> hk Gradient method:

J(θ k )−J(θ k−1 )
ᾱ = 2 ∇J(θ )> h
k k
Interpretation: as long as J does not decrease, we
decrease the value of the step size
Descent algorithms Summary
Summary of descent methods

General algorithm
1: Initialize θk
2: repeat
3: Find direction of descent hk ∈ Rd
4: Line search: find the step αk > 0
5: Update: θ k+1 ← θ k + αk hk
6: until convergence
Method Direction of descent h Complexity Convergence

Gradient −∇J(θ) O(d) linear
Quasi-Newton −B(θ)−1 ∇J(θ) O(d 2 ) superlineare
Newton −H(θ)−1 ∇J(θ) O(d 3 ) quadratic
Step size computation: backtracking (common) or optimal step size
Complexity of each method: depends on the complexity of calculating h, the

search for α, and the number of iterations performed till convergence
Illustration of descent methods
Gradient method
J along the iterations Evolution of the iterates

10
19 14 9 4
1.5
θ0 1
8
−1
0
9 4 0
0
6
1 1
J(θ(k))
−1
4
−1
4
2
0.5
1
0
0
0
0 0 1
4
−2
0 2 4 6 8 10 12
Itérations k
−0.5
0
−1
1
0
9
−1 4
−1
0
−1
0
−1.5 1 1 4 9 14
−1.5 −1 −0.5 0 0.5 1 1.5

Newton method
J along the iterations Evolution of the iterates

10
19 14 9 4
1.5
θ0 1
8
−1
0
9 4 0
0
6
1 1
J(θ(k))
−1
4
−1
4
2
0.5
1
0
0
0
0 0 1
4
−2
1 2 3 4 5 6 7
Itérations k
−0.5
0
−1
1
0
9
−1 4
−1
0
At each iteration we considered

−1
the matrix H(θ) + λI instead of −1.5 1 0

1 4 9 14
H to guarantee the positive −1.5 −1 −0.5 0 0.5 1 1.5

definite property of Hessian

Conclusion
Unconstrained optimization of smooth objective function

Characterization of the solution(s) requires checking the optimality
conditions
Computation of a solution using descent methods
Gradient descent method
Newton method
Not covered in this lecture:
Convergence analysis of the studied algorithms
Non-smooth optimization
Gradient-free optimzation

Unconstrained Optimization PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unconstrained Optimization PDF

Uploaded by

Copyright:

Available Formats

Descent methods

for unconstrained optimization

INSA Rouen - ASI Departement

September 19, 2019

Gilles Gasso Descent methods 1 / 27

4 Illustration of descent methods

Gilles Gasso Descent methods 2 / 27

Elements of the problem

Gilles Gasso Descent methods 3 / 27

Gilles Gasso Descent methods 4 / 27

Gilles Gasso Descent methods 5 / 27

How do we assess a solution to the problem?

Gilles Gasso Descent methods 6 / 27

First order necessary condition

Theorem [First order condition]

Example of a first order optimality condition

We need another optimality condition

H(θ) is the second derivative matrix

H(θ) = ∇θ> (∇θ J(θ)) is the Jacobian of the gradient function

Gilles Gasso Descent methods 10 / 27

Second order optimality condition

Theorem [Second order optimality condition]

Gilles Gasso Descent methods 11 / 27

Illustration of the second order optimality condition

(1) (2) (3)

Necessary and sufficient optimality condition

Theorem [2nd order sufficient condition ]

Theorem [Sufficient and necessary optimality condition]

Gilles Gasso Descent methods 13 / 27

How to find the solution(s)?

We have seen how to assess a solution to the problem

Gilles Gasso Descent methods 14 / 27

Principle of descent algorithms

Principle of descent methods

Gilles Gasso Descent methods 15 / 27

The methods of descent differ by the choice of:

Characteristics of the gradient algorithm

Illustration of gradient and Newton methods

Gilles Gasso Descent methods 19 / 27

Gilles Gasso Descent methods 20 / 27

Optimal step size αk∗

θ k+1 ← θ k + αk∗ hk with αk∗ = arg min J(θ k + αhk )

Variable step size: the choice αk is adapted to the current iteration

Pros and cons

Gilles Gasso Descent methods 22 / 27

Variable step sizeh

J(θ k + αk h) ≤ J(θ k ) + c αk ∇J(θ k )> hk

Usually c is chosen in the range 10−5 , 10−1

4: until J(θ k + αh) > J(θ k ) + c α∇J(θ k )> hk Gradient method:

Summary of descent methods

Method Direction of descent h Complexity Convergence

Step size computation: backtracking (common) or optimal step size

Complexity of each method: depends on the complexity of calculating h, the

J along the iterations Evolution of the iterates

−1.5 −1 −0.5 0 0.5 1 1.5

Gilles Gasso Descent methods 25 / 27

J along the iterations Evolution of the iterates

At each iteration we considered

the matrix H(θ) + λI instead of −1.5 1 0

H to guarantee the positive −1.5 −1 −0.5 0 0.5 1 1.5

Gilles Gasso Descent methods 26 / 27

Unconstrained optimization of smooth objective function

Gilles Gasso Descent methods 27 / 27