You are on page 1of 27

Descent methods

for unconstrained optimization

Gilles Gasso

INSA Rouen - ASI Departement


Laboratory LITIS

September 19, 2019

Gilles Gasso Descent methods 1 / 27


Plan

1 Formulation

2 Optimality conditions

3 Descent algorithms
Main methods of descent
Research of the step
Summary

4 Illustration of descent methods

Gilles Gasso Descent methods 2 / 27


Formulation

Unconstrained optimization

Elements of the problem


θ ∈ Rd : vector of unknown real parameters
J : Rd → R : the function to be minimized.
Assumption:
 J is differentiable
all over its domain
domJ = θ ∈ Rd | J(θ) < ∞

Problem formulation
(P) min J(θ)
θ∈Rd

Gilles Gasso Descent methods 3 / 27


Formulation

Unconstrained optimization

Examples 6

1 >
θ Pθ + q > θ + r
5.5
J(θ) =
2

J(θ)
5

4.5
with P a positive definite matrix 4
2
2
0 1
0
−1
θ2 −2 −2 θ1

1 θ1
J(θ) = cos(θ1 −θ2 )+sin(θ1 +θ2 )+
0
4
−1

−2
6 6
4 4
2 2
0 0

Gilles Gasso Descent methods 4 / 27


Optimality conditions

Different solutions
Global solution
θ ∗ is said to be the global minimum solution of the problem if
J(θ ∗ ) ≤ J(θ), ∀θ ∈ domJ

Local solution
θ̂ is a local minimum solution of problem (P) if it holds
J(θ̂) ≤ J(θ), ∀θ ∈ domJ such that kθ̂ − θk ≤ ,  > 0
2.5
Minimum global
5
Minimum local 2

Illustration 4 θ* 1.5

1
3
θ1
J(θ) = cos(θ1 − θ2 ) + sin(θ1 + θ2 ) + 4 0.5
2
0

1
θ −0.5

0 −1

0 1 2 3 4 5

Gilles Gasso Descent methods 5 / 27


Optimality conditions

Optimality conditions

How do we assess a solution to the problem?

Gilles Gasso Descent methods 6 / 27


Optimality conditions

First order necessary condition

Theorem [First order condition]


Let J : Rd → R be a differential function on its domain. A vector θ 0 is a
(local or global) solution of the problem (P), if it necessarily satisfies the
condition ∇J(θ 0 ) = 0.

Vocabulary
Any vector θ 0 that verifies ∇J(θ 0 ) = 0 is called a stationary point or critical
point
∇J(θ) ∈ Rd is the gradient vector of J at θ.
The gradient is the unique vector such that the directional derivative can be
written as:
J(θ + th) − J(θ)
lim = ∇J(θ)> h, h ∈ Rd , t∈R
t→0 t
Gilles Gasso Descent methods 7 / 27
Optimality conditions

Example of a first order optimality condition


2 14 9
19 4
J(θ) = θ14 + θ24 − 4θ1 θ2

9
9 0
14 4

10
1

24
 3 
4θ1 − 4θ2 1

−1

−1

9
Gradient ∇J(θ) =

0
19
−4θ1 + 4θ23

4
9
4
1

14
0 0 0 1

14
Stationary points that verify ∇J(θ) = 0.
−1

19
9
4
   
0 1

10
4
9
Three solutions θ (1) = , θ (2) = et −1
−1
0 1
0 1

24
14

0
  1 4 9
−1

9
4 19
14
θ (3) = −2
−2 −1
9
0 1 2
−1
Remarks
θ (2) and θ (3) are local minimal but not θ (1)
every stationary point can be deemed a local extremum

We need another optimality condition


How to ensure that a stationary point is a minimum?
Gilles Gasso Descent methods 8 / 27
Optimality conditions

Hessian matrix
Twice differential function
J : Rd → R is said to be a twice differentiable function on its domain
domJ if, at every point θ ∈, there exists an unique symmetrical matrix
H(θ) ∈ Rd×d called Hessian matrix such that
J(θ + h) = J(θ) + ∇J(θ)> + h> H(θ)h + khk2 ε(h).
ε(h) is a continuous function at 0 with limh→0 ε(h) = 0

H(θ) is the second derivative matrix


 ∂2J ∂2J ∂2J

∂θ1 ∂θ1 ∂θ1 ∂θ2 ··· ∂θ 1 ∂θ d

H(θ) =  .. .. .. 
 . . ··· . 

∂2J ∂2J ∂2J
∂θd ∂θ1 ∂θ d ∂θ 2 ··· ∂θd ∂θd

H(θ) = ∇θ> (∇θ J(θ)) is the Jacobian of the gradient function


Gilles Gasso Descent methods 9 / 27
Optimality conditions

Examples

Example 1 Exemple 2
Objective function Quadratic objective function
J(θ) = θ14 + θ24 − 4θ1 θ2 J(θ) = 21 θ > Pθ + q> θ + r

Directional derivative
Gradient 
4θ13 − 4θ2 D(h, θ) = limt→0 J(θ+th)−J(θ)

t
∇J(θ) =
−4θ1 + 4θ23 D(h, θ) = (Pθ + q)> h

Hessian matrix
 2  Gradient ∇J(θ) = Pθ + q
12θ1 −4
H(θ) =
−4 12θ22
Hessian matrix H(θ) = P

Gilles Gasso Descent methods 10 / 27


Optimality conditions

Second order optimality condition

Theorem [Second order optimality condition]


Let J : Rd → R be a twice differentiable function on its domain. If θ 0 is a
minimum of J, then ∇J(θ 0 ) = 0 and H(θ 0 ) is a positive definite matrix.

Remarks
H is positive definite if and only if all its eigenvalues are positive
H is negative definite if and only if all its eigenvalues are negative
For θ ∈ R, this condition means that the gradient of J at the minimum is
null, J 0 (θ) = 0 and its second derivative is positive i.e. J 00 (θ) > 0
If at a stationary point θ 0 H(θ 0 )) is negative definite, θ 0 is a local
maximum of J

Gilles Gasso Descent methods 11 / 27


Optimality conditions

Illustration of the second order optimality condition


2 14 9
19 4
J(θ) = θ14 + θ24 − 4θ1 θ2

9
9 0
14 4

10
1

24
 3 
4θ1 − 4θ2 1

−1

−1

9
Gradient : ∇J(θ) =

0
19
−4θ1 + 4θ23

4
9
4
1

14
  0 0 0 1

14
 
0 1
Stationary points : θ (1) = , θ (2) = and
0 1 −1

19
9
4

10
4
9
  −1 0 1
−1 −1
θ (3) =

24
14

0
4
−1 1 9

9
4 19
−2 9 14
 2  −2 −1 0 1 2
12θ1 −4
Hessian matrix H(θ) =
−4 12θ22

(1) (2) (3)


 θ   θ   θ 
0 −4 12 −4 12 −4
Hessian
−4 12 −4 12 −4 12
Eigenvalues 4, −4 8, 16 8, 16
Type of solution Saddle point Minimum Minimum
Gilles Gasso Descent methods 12 / 27
Optimality conditions

Necessary and sufficient optimality condition

Theorem [2nd order sufficient condition ]


Assume the hessian matrix H(θ 0 ) of J(θ) at θ 0 exists and is positive
definite. Assume also the gradient ∇J(θ 0 ) = 0. Then θ 0 is a (local or
global) minimum of problem (P).

Theorem [Sufficient and necessary optimality condition]


Let J be a convex function. Every local solution θ̂ is a global solution θ ∗ .

Recall
A function J : Rd → R is convex if it verifies
J(αθ + (1 − α)z) ≤ αJ(θ) + (1 − α)J(z), ∀θ, z ∈ domJ, 0≤α≤1

Gilles Gasso Descent methods 13 / 27


Optimality conditions

How to find the solution(s)?

We have seen how to assess a solution to the problem


A question to be addressed is: how to compute a solution?

Gilles Gasso Descent methods 14 / 27


Descent algorithms

Principle of descent algorithms

Direction of descent
Let the function J : Rd → R. The vector h ∈ Rd is called a direction of
descent in θ if there exists α > 0 such that J(θ + αh) < J(θ)

Principle of descent methods


Start from an initial point θ 0
Design a sequence of points {θ k } with θ k+1 = θ k + αk hk
Ensure that the sequence {θ k } converges to a stationary point θ̂

hk : direction of descent
αk is the step size

Gilles Gasso Descent methods 15 / 27


Descent algorithms

General approach

General algorithm
1: Let k = 0, initialize θ k
2: repeat
3: Find a descent direction hk ∈ Rd
4: Line search: find a step size αk > 0 in the direction hk such that
J(θ k + αk hk ) decreases "enough"
5: Update: θ k+1 ← θ k + αk hk and k ← k + 1
6: until k∇J(θ k )k < 

The methods of descent differ by the choice of:


h: gradient algorithm, Newton, Quasi-Newton algorithm
α: Line search, backtracking
Gilles Gasso Descent methods 16 / 27
Descent algorithms Main methods of descent

Gradient Algorithm
Theorem [descent direction and opposite direction of gradient]
Let J(θ) be a differential function. The direction h = −∇J(θ) ∈ Rd is a
descent direction.

Proof.
J being differentiable, for any t > 0 we have
J(θ + th) = J(θ) + t∇J(θ)> h + tkhk(th). Setting h = −∇J(θ), we get
J(θ + th) − J(θ) = −tk∇J(θ)k2 + tkhk(th). For t small enough (th) → 0 and
so J(θ + th) − J(θ) = −tk∇J(θ)k2 < 0. It is then a descent direction.

Characteristics of the gradient algorithm


Choice of the descent direction at θ k : hk = −∇J(θ k )
Complexity of the update: θ k+1 ← θ k − αk ∇J(θ k ) costs O(d)
Gilles Gasso Descent methods 17 / 27
Descent algorithms Main methods of descent

Newton algorithm
2nd order approximation of the twice diffetentiable J at θ k
1
J(θ + h) ≈ J(θ k ) + ∇J(θ k )> h + h> H(θ k )h
2
with H(θ k ) the positive definite Hessian matrix
The direction hk which minimizes this approximation is obtained by
∇J(θ + hk ) = 0 ⇒ hk = −H(θ k )−1 ∇J(θ k )

Features
Choice of the descent direction at θ k : hk = −H(θ k )−1 ∇J(θ k )
Complexity of the update: θ k+1 ← θ k − αk H(θ k )−1 ∇(θ k ) costs
O(d 3 ) flops
H(θ k ) is not always guaranteed to be positive definite matrix. Hence
we cannot always ensure that hk is a direction of descent
Gilles Gasso Descent methods 18 / 27
Descent algorithms Main methods of descent

Illustration of gradient and Newton methods


Local approximation of the two methods
in 1D
100
J
Approxim. de J Meth. Gradient
80
Approxim. de J Meth. Newton

60 Directions of descent in 2D
40 2 24 19 14 9 9
J(θ)

9 4
14 1
20

4
0
−1

19
0 1 4

0
θk

9
1

9
1
−1
−20

14
Tangente en θ
k
0 0 0

4
−40
h = − H−1 ∇ J

4
−4 −2 0 2 4

14
1
θ h=−∇J1

−1
9

9
−1
0 4
−1
1

19
4

1
14
9
4
24
−2 9 9 14 19

−2 −1 0 1 2

Gilles Gasso Descent methods 19 / 27


Descent algorithms Main methods of descent

Quasi-Newton method

Main features
Choice of the descent direction at θ k : hk = −B(θ k )−1 ∇J(θ k )
B(θ k ) is an positive definite approximation of the Hessian matrix
Complexity of the update: most of the times O(d 2 )

Gilles Gasso Descent methods 20 / 27


Descent algorithms Research of the step

Line search
Assume the direction of descent hk at θ k is fixed. We aim to find the step
size αk > 0 in the direction hk such that the function J(θ k + αk hk )
decreases enough (compared to J(θ k ))

Several options
Fixed step size: use a fixed value α > 0 at each iteration k

θ k+1 ← θ k + αhk

Optimal step size αk∗

θ k+1 ← θ k + αk∗ hk with αk∗ = arg min J(θ k + αhk )


α>0

Variable step size: the choice αk is adapted to the current iteration

θ k+1 ← θ k + αk hk
Gilles Gasso Descent methods 21 / 27
Descent algorithms Research of the step

Line search

Pros and cons


Fixed step size strategy: often not very effective
Optimal step size: can be costly in calculation time
Variable step: most commonly used approach
The step is often imprecise
A trade-off between computation cost and decrease of J

Gilles Gasso Descent methods 22 / 27


Descent algorithms Research of the step

Variable step sizeh


Armijo’s rule
Determine the step size αk in order to have a sufficient decrease of J i.e.

J(θ k + αk h) ≤ J(θ k ) + c αk ∇J(θ k )> hk

Usually c is chosen in the range 10−5 , 10−1


 

Having hk the direction of descent, we have ∇J(θ k )> hk < 0, which ensures
the decrease of J

Backtracking
1: Fix an initial step ᾱ, choose 0 < ρ < 1, α ← ᾱ Choice of the initial step
2: repeat Newton method:
3: α ← ρα ᾱ = 1

4: until J(θ k + αh) > J(θ k ) + c α∇J(θ k )> hk Gradient method:


J(θ k )−J(θ k−1 )
ᾱ = 2 ∇J(θ )> h
k k
Interpretation: as long as J does not decrease, we
decrease the value of the step size
Gilles Gasso Descent methods 23 / 27
Descent algorithms Summary

Summary of descent methods


General algorithm
1: Initialize θk
2: repeat
3: Find direction of descent hk ∈ Rd
4: Line search: find the step αk > 0
5: Update: θ k+1 ← θ k + αk hk
6: until convergence

Method Direction of descent h Complexity Convergence


Gradient −∇J(θ) O(d) linear
Quasi-Newton −B(θ)−1 ∇J(θ) O(d 2 ) superlineare
Newton −H(θ)−1 ∇J(θ) O(d 3 ) quadratic

Step size computation: backtracking (common) or optimal step size

Complexity of each method: depends on the complexity of calculating h, the


search for α, and the number of iterations performed till convergence
Gilles Gasso Descent methods 24 / 27
Illustration of descent methods

Gradient method

J along the iterations Evolution of the iterates


10
19 14 9 4

1.5
θ0 1
8
−1

0
9 4 0

0
6
1 1
J(θ(k))

−1
4

−1
4
2
0.5
1

0
0
0
0 0 1

4
−2
0 2 4 6 8 10 12
Itérations k
−0.5

0
−1
1
0

9
−1 4

−1
0
−1

0
−1.5 1 1 4 9 14

−1.5 −1 −0.5 0 0.5 1 1.5

Gilles Gasso Descent methods 25 / 27


Illustration of descent methods

Newton method

J along the iterations Evolution of the iterates


10
19 14 9 4

1.5
θ0 1
8
−1

0
9 4 0

0
6
1 1
J(θ(k))

−1
4

−1
4
2
0.5
1

0
0
0
0 0 1

4
−2
1 2 3 4 5 6 7
Itérations k
−0.5

0
−1
1
0

9
−1 4

−1
0

At each iteration we considered


−1

the matrix H(θ) + λI instead of −1.5 1 0


1 4 9 14

H to guarantee the positive −1.5 −1 −0.5 0 0.5 1 1.5


definite property of Hessian

Gilles Gasso Descent methods 26 / 27


Illustration of descent methods

Conclusion

Unconstrained optimization of smooth objective function


Characterization of the solution(s) requires checking the optimality
conditions
Computation of a solution using descent methods
Gradient descent method
Newton method
Not covered in this lecture:
Convergence analysis of the studied algorithms
Non-smooth optimization
Gradient-free optimzation

Gilles Gasso Descent methods 27 / 27

You might also like