You are on page 1of 8

School of Computer Science and Applied Mathematics

APPM 3017: Optimization III


Lecture 02 : Convergence Analysis
Lecturer: Matthews Sejeso Date: August 2021

LEARNING OUTCOMES
By the completion of this lecture you should be able to:
1. Describe convergence of of optimization algorithms.
2. Describe convergence rate of optimization algorithms.
3. Determine the convergence and convergence rate of basic optimization algorithms.
Reference: This lecture is largely based on
• Chapter 2 and 3, Jorge Nocedal and Stephen J. Wright, ‘Numerical Optimization’.
• Chapter 1, Dimitri P. Bertsekas, ‘Nonlinear Programming’.

2.1 Introduction

Optimization methods are iterative, and we cannot expect a nonlinear problem to be solved precisely in
finite number of iterations. However, we hope that the sequence of iterates {xk } generated by the method
converges, to the solution of the problem, as k → ∞. Convergence is a vital characteristic of optimization
algorithms. It is the property that makes the method theoretically valid optimization routine. To complete
convergence analysis, we must also establish the speed at which the sequence {xk } converges to x∗ . We
state convergence properties of steepest descent method and Newton’s method.

2.2 Gradient methods

Recall the line search strategy from the previous lecture. At each iteration the algorithms chooses a search
direction dk and then decides how far to move along that direction. The distance to move along dk can
be found by approximately solving the following one dimensional minimization problem to find a stepsize
s

sk = min f (xk + sdk ). (2.1)


s>0

In this lecture we assume a fixed stepsize sk = s. Thus, the iteration of line search method is given by

xk+1 = xk + sk dk .

The success of a line search method depends on effective choices of both the direction dk and the stepsize
sk . Most line search method require dk to be a descent direction - one for which ∇f (xk )T dk < 0. This
property guarantees that the function f is reduced along the search direction. The search direction often
takes the form

dk = −Dk ∇f (xk ) which implies xk+1 = xk − sk Dk ∇f (xk )

where Dk is a positive definite symmetric matrix. Algorithms of this form are called gradient methods.
Here are some of choices of the matrix Dk , resulting in methods that are widely used:

2-1
• Steepest descent method: The choice of Dk is the identity matrix, that is Dk = I. This is the
simplest choice but it often leads to slow convergence.
• Newton’s method The choice of Dk is the inverse Hessian of the objective function the the current
iterate, that is Dk = [∇2 f (x)]−1 . Generally, Newton’s methods converges fast as compare to steepest
descent methods.
The gradient methods discussed above come with proofs of convergence to a stationary point. Proofs for
convergence are usually long and tedious; therefore, we state theorems without proofs. Those who are
interested in the proofs are referred to the references.

2.3 Convergence

Let Ω be the subset of points that satisfy the first-order necessary optimality conditions of the general
optimization problem. From a theoretical point of view, an optimization algorithm stops when a point
x∗ ∈ Ω is reached. From this point of view, the set Ω is called the target set. Convergence properties are
stated with reference to the target set Ω. In the case of unconstrained optimization, a possible target set
is Ω = {x ∈ Rn : ∇f (x) = 0}, whereas in the case of constrained optimization, Ω may be the set of KKT
points.

Definition 2.1 (Limit point). We say x∗ ∈ Rn is a limit point of a a sequence x, if there exits a subsequence
of xk that convergences to x∗ .
Let {xk }, be a sequence of points produced by an algorithm. An algorithm has a solution x∗ , if the
sequence {xk } converge to a unique limit point x∗ . An algorithm is said to be globally convergent if a
limit point x∗ of {xk } exists such that x∗ ∈ Ω for any starting point x0 ∈ Rn . An algorithm is said to be
locally convergent if the existence of the limit point x∗ ∈ Ω can be established only if the starting point x0
belongs to some neighbourhood of Ω.
Given the gradient methods, ideally, we would like the generated sequence {xk } to converge to a global
minimum. Unfortunately, however, this is too much to expect, at least when f is convex, because of the
presence of local minima that are not global. A gradient method is guided downhill by the form of f near
the current iterate while being oblivious to the global structure of f . Thus the method can easily get
attracted to any type of minimum, global or not. Furthermore, if a gradient method starts or lands at any
stationary point, including a local maximum, it stops at that point. Thus, the most we can expect from a
gradient method is that it converges to a stationary point.
Generally, depending on the nature of the cost function f , the sequence {xk } generated by the gradient
method need not have a limit point; in fact {xk } is typically unbounded if f has no local minima. If,
however, we know that the level set {x|f (x) ≤ f (x0 )} is bounded, and the step length is chosen to enforce
descent at each iteration, then the sequence {xk } must be bounded since it belongs to this level set. It
must then have at least one limit point because every bounded sequence has at least one limit point.

2.4 Convergence Rate

The second major issue regarding gradient methods relates to the rate of convergence of which the generated
sequence {xk }. The mere fact that {xk } converges to a stationary point x∗ will be of little practical value
unless the points xk are reasonably close to x∗ after relatively few iterations. Thus, the study of the
convergence rate provides the dominant criteria for selecting one algorithm in favour of others for solving
a particular problem.

2-2
We limit the discussion to local analysis approach. The local analysis focuses on the local behaviour of
the method in a neighbourhood of an optimal solution. Local analysis can describe quite accurately the
behaviour of a method near the solution by using Taylor series approximations but ignores the behaviour
of the method entirely when far from the solution. We use the following criterion to establish convergence
rate:
• We restrict attention to sequences {xk } that converge to a unique limit point x∗ .
• Rate of convergence is evaluated using an error function e : Rn → R satisfying e(x) ≥ 0 for all
x ∈ Rn and e(x∗ ) = 0. Typical choices are Euclidean distance

e(x) = kx − x∗ k, (2.2)

and the cost difference

e(x) = |f (x) − f (x∗ )|. (2.3)

• The analysis is asymptotic; that is, we look at the rate of convergence of the tail of the error sequence
{e(xk )}.
• The most widely employed notion of rate of convergence is the Q-rate of convergence. The Q-rate
considers the quotient between two successive iterates given by:

e(xk+1 ) kxk+1 − x∗ k
= . (2.4)
e(xk ) kxk − x∗ k

We define the asymptotic rate of convergence with respect to Euclidean distance (2.2), the results
can be easily extended to the cost function (2.3).

Definition 2.2. Suppose an optimization algorithm generates the sequence {xk }, and the sequence
convergences to x∗ , i.e {xk } → x∗ as k → ∞. Then if r is the largest real number for which:

e(xk+1 ) kxk+1 − x∗ k
lim = lim = γ, for 0 ≤ γ < ∞, (2.5)
k→∞ e(xk )r k→∞ kxk − x∗ kr

the sequence is said to have asymptotic rate of convergence of order r, with asymptotic error constant
γ.
If r = 1 and γ = 0 for sufficiently large k, we say that the sequence converges superlinearly (or
Q-superlinear). If r = 1 and γ < 1 for sufficiently large k, then the convergence is said to be linear
(or Q-linear). If r = 2 for sufficiently large k, the convergence is said to be quadratic (or Q-quadratic),
where γ is a positive constant not necessarily less than one. In general, the rate of convergence is
of Q − order r if there exists a positive constant γ such that (2.5) holds for all sufficiently large k.
However, a Q − order greater than two is very seldom to achieve.
Most optimization algorithms that are of interest in practice produce sequences converging either linearly
or superlinearly.
k
Example 2.1. If xk = a2 where a ∈ [0, 1), show that sequence {xk } has a quadratic rate of convergence.

−k
Example 2.2. If xk = a2 where a > 0 is a constant. Then it is clear that the sequence {xk } =
1 1
{a, a 2 , a 4 , . . . } converges to x∗ = 1, i.e xk → 1 as k → ∞. To determine the rate of convergence, choose

2-3
r = 1(this will turn out to be a good guess):

e(xk+1 ) kxk+1 − x∗ k
lim = lim
k→∞ e(xk )r k→∞ kxk − x∗ kr
−(k+1) −(k+1) −(k+1)
|a2 − 1| |a2 − 1| · |a2 + 1|
= lim 2−k = lim −k −(k+1)
k→∞ |a − 1| k→∞ |a 2 − 1| · |a2 + 1|
−k
a2 − 1 1 1
= lim −(k+1)
= lim −(k+1)
= .
k→∞ |a2−k − 1| · |a2 + 1| k→∞ |a 2 + 1| 2

2.5 Steepest descent method: convergence analysis

In this section, we state convergence properties of the steepest methods. We can learn much more about the
steepest descent method by considering the quadratic objective function. We characterize the convergence
property of steepest descent method with exact stepsize and fixed stepsize.

Theorem 2.1 (Convergence of steepest descent method). Consider the the quadratic function f (x) =
1 T ∗
2 Qx + b x + c, with a positive definite matrix Q and let x be the minimizer.

1. The sequence {xk } generated by the steepest descent method with exact line search converges to x∗
for any initial point x0 .
2. The sequence {xk } generated by the steepest descent method with a fixed stepsize sk = s converges to
x∗ for any initial point x0 if and only if the stepsize satisfies:
2
0<s< ,
λn
where λn is the largest eigenvalue of the matrix Q.
The above theorem establish the global convergence of the steepest descent method for both exact line
search and fixed stepsize. The fixed stepsize algorithm is of practical interest because of its simplicity. In
particular, the algorithm does not require a line search at each iteration to determine sk , because the same
stepsize s is used at each iteration. However, care must be taking when choosing the value of the stepsize.
The stepsize has direct impact of the speed of the algorithm.
We turn our attention to the issue of convergence rate of steepest descent method. We focus in the ideal
case, in which the objective function is quadratic and the line searches are exact. For a quadratic function
it is easy to compute the exact line search and it is given by

∇f (xk )T ∇f (xk )
sk = (2.6)
∇f (xk )T Q∇f (xk )

If we use this exact minimizer, the steepest descent iteration for a quadratic function is given by

∇f (xk )T ∇f (xk )
 
k+1 k
x =x − ∇f (xk ). (2.7)
∇f (xk )T Q∇f (xk )

Since ∇f (xk ) = Qxk − b, this equation yields a closed form expression for xk+1 in terms of xk . Figure
2.1b illustrate a typical sequence of iterates generated by the steepest descent method on two-dimensional
quadratic objective function. The contours of f are ellipsoids whose axes lie along the orthogonal eigenvectors
of Q. Note that the iterates zig-zag towards the solution.

2-4
(a) Contours well scaled function (b) Contours of badly scaled function

Figure 2.1: Steepest descent steps.

To quantify the rate of convergence we introduce the weighted norm kxk2Q = xT Qx. By using the relation
Qx∗ = b, we can show that
1
kx − x∗ k = f (x) − f (x∗ ), (2.8)
2
so this norm measures the difference between the current objective value and the optimal value. Using
the close form of steepest descent iterate (2.7) and noting that ∇f (xk ) = Q(xk − x∗ ), we can derive the
equality

∇f (x)T ∇f (xk )T
 
∗ 2
kx k+1
− x kQ = 1 − kxk − x∗ k2Q (2.9)
(∇f (xk )T Q∇f (xk )) (∇f (xk )T Q−1 ∇f (xk ))
This expression describes the exact decrease on the objective function f at each iteration, but since the
term inside the brackets in difficult to interpret, it is more useful to bound it in terms of the condition
number of the problem.

Theorem 2.2 (Convergence rate of steepest descent method). When the steepest descent method with the
exact line search (2.7) is applied to the strongly convex quadratic function, the error norm (2.8) satisfies
 
∗ 2 λn − λ1
kx k+1
− x kQ ≤ kxk − x∗ k2Q , (2.10)
λn + λ1
where 0 < λ1 ≤ λ2 ≤ · · · ≤ λn are the eigenvalues of Q.
The inequalities (2.10) and (2.8) show that the function values f (xk ) converge to the minimum value f (x∗ )
at a linear rate. As a special case of this result, we see that convergence is achieved in one iteration
if all the eigenvalues are equal. In this case, Q is a multiple of the identity matrix, so the contours in
Figure 2.1a are circles and steepest descent direction always points at the solution. In general as the
condition number κ(Q) = λn /λ1 increases, the contours of the quadratic function become more elongated,
the zig-zag in Figure 2.1b becomes more pronounced, and (2.10) implies that the convergence degrades.
Even though (2.10) is a worst-case bound, it gives an accurate indication of the behaviour of the algorithm
when n > 2.
The rate of convergence behaviour of the steepest descent method is essentially the same for general
nonlinear objective functions. In the following result we assume that the stepsize is the global minimizer
along the search direction.

2-5
Theorem 2.3. (Converge property steepest descent method on general nonlinear function) Suppose that
f : Rn → R is twice continuously differentiable, and that the iterates generated by steepest descent method
with exact line searches converge to a point x∗ at which the Hessian matrix ∇2 f (x∗ ) is positive definite.
Let r be a scalar satisfying
 
λn − λ1
r∈ ,1 ,
λn + λ1

where λ1 ≤ λ2 ≤ · · · ≤ λn are eigenvalues of ∇2 f (x∗ ). The for all k sufficiently large, we have
h i
f (xk+1 ) − f (x∗ ) ≤ r2 f (xk ) − f (x∗ ) .

In general, we cannot expect the rate of convergence to improve if an inexact line search is used. Therefore,
Theorem 2.3 shows that the steepest descent method can have an unacceptable show rate of convergence,
even when the Hessian is reasonably well conditioned.
Example 2.3. Consider the function f : R2 → R given by
3
f (x) = (x21 + x22 ) + (1 + a)x1 x2 − (x1 + x2 ) + b,
2
where a and b are some unknown real-valued parameters. Answer the following questions:
a. Write the function f in the usual multivariable quadratic form.
b. Find the largest set of values of a and b such that the unique global minimizer of f exits, and write
down the minimizer (in terms of parameters a and b).
c. Consider the following algorithm:
2
xk+1 = xk − ∇f (xk ).
5
Find the largest set of values of a and b for which the above algorithm converges to the global minimizer
of f for any initial point x0 .

2.6 Newtons method: convergence analysis

Convergence analysis of Newton’s method when f is a quadratic function is straightforward. In fact,


Newton’s method reaches the point x∗ such that ∇f (x∗ ) = 0 in just one step staring from any initial
point x0 . This follows from the fact that Newton’s direction minimizes the quadratic approximation of the
objective function at any step.
For a general nonlinear objective function f , the Hessian matrix ∇2 f (xk ) may not always be positive
definite, dk may not always be a descent direction. Thus, we discuss the local rate of convergence properties
of Newton’s method. We know that for all x in the vicinity of a solution point x∗ such that ∇2 f (x∗ ) is
positive definite, the Hessian ∇2 f (x) will always be positive definite. Newton’s method will be well defined
in this region and will converge quadratically, provided that the stepsizes sk are eventually always 1. The
following theorem summary the convergence properties of Newton’s method.

Theorem 2.4. Suppose that f : Rn → R is twice differentiable and that the Hessian ∇2 f (x) is Lipschitz
continuous in a neighbourhood of a solution x∗ at which the sufficient conditions are satisfied. Consider
the iteration xk+1 = xk + dk , where dk is the Newton’s direction. Then,
1. if the starting point x0 is sufficiently close to x∗ , the sequence {x} generated by the Newton’s method
converges to x∗ .

2-6
2. the rate of convergence of the sequence {x} generated by the Newton’s method is quadratic.
3. In addition, the sequence of gradients k∇f (xk )k converges quadratically to zero.
The above theorem state that Newton’s method is locally convergent (i.e. converge only for starting points
near the optimal solution x∗ ), and it has a quadratic rate of convergence. Thus, Newton’s method is very
fast compared to the linear rate of the steepest descent method. There are two approaches for obtaining
globally convergent iteration based on Newton’s step: a line search approach, in which the Hessian ∇2 f (xk )
is modified, if necessary, to make it positive definite and thereby yield descent direction, and a trust region
approach, in which ∇2 f (xk ) is used to form a quadratic model that is minimized in a ball around the
current iterate xk .

2.7 Summary

In this lecture we discussed convergence properties of steepest descent and Newton’s methods, Table 2.1
gives a comparative summary of the two algorithms.

Steepest descent Newton’s method


Information required at each iteration f (xk ) and ∇f (xk ) f (xk ), ∇f (xk ) and ∇2 f (x)
−1
dk = −∇f (xk ) dk = − ∇2 f (xk ) ∇f (xk )

Search direction
Convergence Global convergent Local convergent
Behaviour on quadratic functions Asymptotic convergence Convergence in one step
Convergence rate Linear Quadratic

Table 2.1: Comparison between steepest descent method and Newton’s method.

PROBLEM SET II
k
1. Show that the sequence xk = 1 + ( 21 )2 is quadratically convergent to 1.
2. Does the sequence xk = 1/k! converge superlinearly? quadratically?
3. Consider the sequence xk defined by


( k
1 2
k 4 , for k is even
x =
xk−1 /k,

for k is odd.

Is this sequence superlinearly convergent? Quadratically convergent?


4. Prove that if the sequence x(k) converges to x∗ superlinearly then


k+1
x − xk
lim = 1,
k→∞ kxk − x∗ k

and show using the following sequence


(
1
xk = k! , for k is odd
2xk−1 , for k is even.

as a counter example that the converse is not true.


5. Let f : Rn → R, be a quadratic function f (xk ) = 12 xT Qx + xT b + c, where Q ∈ Rn×n is a symmetric
positive definite matrix, b ∈ Rn and c ∈ R. Derive the exact line search for a quadratic function.
6. Consider minimizing the function f (x) = x21 + x22 .

2-7
a. Use the steepest descent method xk+1 = xk −s∇f (x)k , where s is chosen to minimize f (x). Show
that xk+1 will be the optimal solution after only one iteration. You should be able to optimize
f (x) with respect to s analytically. Start from the initial point x0 = [3, 5]T .
b. Show that the Newton’s method reach the optimal solution in only one iteration. Star from the
same initial point as part a.
7. Consider the function f (x) = 3(x21 + x22 ) + 4x1 x2 + 5x1 + 6x2 + 7. Suppose we use a steepest descent
method with a fixed stepsize to find the minimizer of f :

xk+1 = xk − s∇f (xk ).

Find the largest range of values of s for which the algorithm is globally convergent.
8. Consider the least square optimization problem:

min kAx − bk22 ,

where A ∈ Rm×n , m ≥ n, and b ∈ Rm . Answer the following questions:


a. Show that the objective function for the above is a quadratic function and write down the gradient
and Hessian of this quadratic.
b. Write down the steepest descent algorithm with fixed stepsize for the solving the above optimization
problem.
" #
1 0
c. Suppose A = . Find the largest range of values of s such that the algorithm in part b
0 2
converges to the solution of the problem.
9. Let f : Rn → R, be a quadratic function f (x) = 12 xT Qx + xT b + c, where Q ∈ Rn×n is a symmetric
positive definite matrix, b ∈ Rn and c ∈ R.
a. Consider the Newton’s method xk+1 = xk + dk , where dk is the Newton’s search direction. Show
that the Newton’s method reaches the optimal point x∗ of f in one step.
b. Consider the modified Newton’s method xk+1 = xk + sk dk , where sk = arg mins f (xk + sdk )
and dk is the Newton’s search direction. Does the modified Newton’s method possess the same
property as standard Newton’s method (part a). Justify your answer.
10. Consider the Rosenbrock function f (x) = 100(x2 − x21 )2 + (1 − x1 )2 . This function is also known as
the banana function because of the shape of its level sets.
a. Prove that the point [1, 1]T is the unique global minimizer of f over R2 .
b. With a starting point x0 = [0, 0]T , apply two iterations of the Newton’s method.
c. Repeat part b using the steepest descent method with a fixed stepsize of sk = 0.05 at each
iteration. Compare your results with part b.

2-8

You might also like