You are on page 1of 14

Newton’s Method for Unconstrained Optimization

Robert M. Freund

March, 2014

⃝2014
c Massachusetts Institute of Technology. All rights reserved.

1
1 Newton’s Method

Consider the unconstrained optimization problem:

(P): min f (x)

x ∈ Rn .

At x = x̄, f (·) can be approximated by its second-order Taylor expansion


at x̄, which we denote by h(·):

f (x) ≈ h(x) := f (x̄) + ∇f (x̄)T (x − x̄) + 21 (x − x̄)T H(x̄)(x − x̄) .

Here ∇f (x) is the gradient of f (x) and H(x) is the Hessian of f (x) at x.
Notice that h(·) is a quadratic function. We can attempt to minimize
h(·) by computing a point x for which ∇h(x) = 0. Since the gradient of
h(x) is:
∇h(x) = ∇f (x̄) + H(x̄)(x − x̄) ,
we therefore are motivated to solve:

0 = ∇h(x) = ∇f (x̄) + H(x̄)(x − x̄) ,

the solution of which is given by

x = x̄ − H(x̄)−1 ∇f (x̄) .

The direction −H(x̄)−1 ∇f (x̄) is called the Newton direction, or the Newton
step, at x̄. If we compute the next iterate as

x̄N := x̄ − H(x̄)−1 ∇f (x̄) ,

then we call x̄N the Newton iterate from the point x̄. If we compute all
iterations this way, we call the algorithm Newton’s Method, whose formal
description is presented in Algorithm 1.
Notice that Newton’s Method presumes that H(xk ) is nonsingular at
each iteration.

2
Algorithm 1 Newton’s Method
Initialize at x0 , and set k ← 0 .
At iteration k :
1. dk := −H(xk )−1 ∇f (xk ). If dk = 0, then stop.
2. Choose step-size αk = 1 .
3. Set xk+1 ← xk + αk dk , k ← k + 1 .

Also notice that there is not necessarily any guarantee of descent in the
algorithm. That is, it is unclear without further assumptions whether or
not it will hold that f (xk+1 ) ≤ f (xk ).
Note that Step 2 could be augmented by a line-search of f (xk + αdk )
to find a good (or even optimal) value of the step-size parameter α with
respect to descent of the function f (·).
Recall that a matrix M is called SPD if M is symmetric and positive
definite. The following proposition shows that the Newton direction is a
descent direction if H(x) is SPD.

Proposition 1.1 If H(x) is SPD and d := −H(x)−1 ∇f (x) ̸= 0, then d is


a descent direction, i.e., f (x + αd) < f (x) for all sufficiently small values
of α.

Proof: It is sufficient to show that ∇f (x)T d = −∇f (x)T H(x)−1 ∇f (x) < 0.
This will clearly be the case if H(x)−1 is SPD. Since H(x) is SPD, if v ̸= 0,

0 < (H(x)−1 v)T H(x)(H(x)−1 v) = v T H(x)−1 v,

thereby showing that H(x)−1 is SPD.

1.1 Linear, Superlinear, and Quadratic Convergence Rates

It turns out that Newton’s method converges to a solution extremely


rapidly under certain circumstances. In order to develop this result with
some preciseness, we first need to understand some concepts about rates of
convergence of sequences in general.

3
A sequence of numbers {si } exhibits linear convergence if limi→∞ si = s̄
and
|si+1 − s̄|
lim =δ<1. (1)
i→∞ |si − s̄|
An example of a linearly convergent sequence is:
( )
1 i
si = , i = 1, 2, . . . .
10
Here we have s1 = 0.1, s2 = 0.01, s3 = 0.001, etc., and s̄ = 0. Notice that
|si+1 − s̄|
= 0.1 ,
|si − s̄|
so this sequence exhibits linear convergence with δ = 0.1 .
A sequence of numbers {si } exhibits superlinear convergence if the se-
quence exhibits linear convergence and δ = 0 in (1).
An example of a superlinearly convergent sequence is:
1
si = , i = 1, 2, . . . .
i!
Here we have s1 = 1, s2 = 12 , s3 = 61 , s4 = 241 1
, s5 = 120 , etc., and s̄ = 0.
Notice that
|si+1 − s̄| 1
lim = lim = 0,
i→∞ |si − s̄| i→∞ i + 1
so this sequence exhibits superlinear convergence.
A sequence of numbers {si } exhibits quadratic convergence if limi→∞ si =
s̄ and
|si+1 − s̄|
lim =δ<∞.
i→∞ |si − s̄|2
An example of a quadratically convergent sequence is:
( )(2i )
1
si = , i = 1, 2, . . . .
10
Here we have s1 = 0.1, s2 = 0.01, s3 = 0.0001, s4 = 0.00000001, etc., and
s̄ = 0. Notice that
i
|si+1 − s̄| (102 )2
lim = =1,
i→∞ |si − s̄|2 102i+1
so this sequence exhibits quadratic convergence.

4
1.2 Quadratic Convergence of Newton’s Method
In this section we will show the quadratic convergence of Newton’s method.
To develop this result, we will need to work with the operator norm of a
matrix M , which is defined as follows:

∥M ∥ := max{∥M x∥ | ∥x∥ = 1} .
x

Theorem 1.1 (Quadratic Convergence Theorem) Suppose f (·) is twice


continuously differentiable and x∗ is a point for which ∇f (x∗ ) = 0. Suppose
that H(·) satisfies the following two conditions:

• there exists a scalar h > 0 for which ∥[H(x∗ )]−1 ∥ ≤ 1


h , and

• there exist scalars β > 0 and L > 0 for which ∥H(x) − H(y)∥ ≤
L∥x − y∥ for all x, y ∈ B(x∗ , β) := {w : ∥w − x∗ ∥ ≤ β} .
{ 2h }
Let x satisfy ∥x − x∗ ∥ < γ := min β, 3L , and let xN := x − H(x)−1 ∇f (x).
Then:
( ) ( )
(i) ∥xN − x∗ ∥ ≤ ∥x − x∗ ∥2 2(h−L∥x−x
L
∗ ∥) ≤ ∥x − x∗ ∥2 3L2h

(ii) ∥xN − x∗ ∥ < ∥x − x∗ ∥ < γ

(iii) Define x0 := x, x1 := xN , and let xi denote the Newton iterate from


xi−1 for i = 2, 3, . . .. Then ∥xi − x∗ ∥ → 0 as i → ∞.

Before proving this theorem let us illustrate it with two examples.


Example 1: Let f (x) = 7x − ln(x) for x > 0. Then ∇f (x) = f ′ (x) = 7 − x1
and H(x) = f ′′ (x) = x12 . It is not hard to check that x∗ = 17 = 0.142857143
is the unique global minimum of f (·). The Newton direction at x is
( )
−1 f ′ (x) 1
d = −H(x) ∇f (x) = − ′′ = −x 7 −
2
= x − 7x2 .
f (x) x

Newton’s method will generate the sequence of iterates {xk } satisfying:

xk+1 = xk + (xk − 7(xk )2 ) = 2xk − 7(xk )2 .

5
Table 1 shows the value of the Newton iterates x1 , x2 , x3 , . . ., for four dif-
ferent starting points, namely x0 = 1.0, 0.0, 0.1, and 0.01. Notice that the
Newton iterates diverge to −∞ for the starting point x0 = 1.0. For the
starting point x0 = 0.0 the Newton iterates are all zero. For the starting
point x0 = 0.1, the Newton iterates converge to 17 = 0.142857143, which
is the solution to the problem. Notice also that starting with the second
iteration, the number of accurate decimal digits doubles at each iteration.
This doubling of the accurate digits is the essence of quadratic convergence.
For the starting point x0 = 0.01, the Newton iterates also converge to the
solution 17 = 0.142857143. For this starting point, we see the doubling the
number of accurate decimal digits starting with the fifth iteration.
It turns out that there is deep theory from which it follows that for this
example Newton’s method will converge quadratically from any starting
point in the range x0 ∈ (0 , 2/7) = (0.0 , 0.2857143).

Starting Value x0 of Newton’s Method


Iteration x0= 1.0 x0 = 0.0 x0 = 0.1 x0 = 0.01
x1 −5.0 0 0.13 0.0193
x2 −185.0 0 0.1417 0.03599257
x3 −239, 945.0 0 0.14284777 0.062916884
x4 −4.0302 × 1011 0 0.142857142 0.098124028
x5 −1.1370 × 1024 0 0.142857143 0.128849782
x6 −9.0486 × 1048 0 0.142857143 0.1414837
x7 −5.7314 × 1098 0 0.142857143 0.142843938
x8 −∞ 0 0.142857143 0.142857142
x9 −∞ 0 0.142857143 0.142857143
x10 −∞ 0 0.142857143 0.142857143

Table 1: Values of the Newton iterates for f (x) = 7x−ln(x) for four different
starting points.

Example 2: Let f (x) = − ln(1 − x1 − x2 ) − ln x1 − ln x2 on the domain


{(x1 , x2 ) : x1 > 0, x2 > 0, x1 + x2 < 1 }. The gradient ∇f (x) and Hessian

6
H(x) are easily derived to be:
 
1
1−x1 −x2 − 1
x1
∇f (x) =  
1
1−x1 −x2 − 1
x2

and  ( )2 ( )2 ( )2 
1 1 1
1−x1 −x2 + 1−x1 −x2
 x1 
H(x) = 

 .
( )2 ( )2 ( )2 
1 1 1
1−x1 −x2 1−x1 −x2 + x2
( )
Furthermore, it is simple to verify that x∗ = 13 , 13 is the optimal solution
since ∇f (x∗ ) = 0, and that f (x∗ ) = 3.295836866. Table 2 shows the iterate
values of Newton’s method applied to this problem starting at (x01 , x02 ) =
(0.85, 0.05). Here we see the doubling of the accurate decimal degits for x1
and x2 starting around the fourth iteration, and the doubling of the number
of leading zeros in the distance ∥xk − x∗ ∥ as well.

k xk1 xk2 ∥xk − x∗ ∥


0 0.85 0.05 0.58925565098879
1 0.717006802721088 0.0965986394557823 0.450831061926011
2 0.512975199133209 0.176479706723556 0.238483249157462
3 0.352478577567272 0.273248784105084 0.0630610294297446
4 0.338449016006352 0.32623807005996 0.00874716926379655
5 0.333337722134802 0.333259330511655 7.41328482837195 e−05
6 0.333333343617612 0.33333332724128 1.19532211855443 e−08
7 0.333333333333333 0.333333333333333 1.57009245868378 e−16

Table 2: Values of the Newton iterates for f (x) = − ln(1 − x1 − x2 ) − ln x1 −


ln x2 , starting at (x01 , x02 ) = (0.85, 0.05).

Let us now return to an evaluation of Theorem 1.1.


The convergence rate stated in Theorem 1.1 is quadratic, since in par-
ticular:
∥xN − x∗ ∥ 3L

≤ .
∥x − x ∥ 2 2h

7
Furthermore, it holds from part (i) of Theorem 1.1 that in the limit we we

obtain ∥x N −x ∥
∥x−x∗ ∥2
≤ 2h
L
. Of course, in practice one typically never knows the
values of β, h, or L. Instead one implements the method and hopes that
the starting point x0 will yield a quadratically convergent sequence. There
is an exceptional family of functions, called self-concordant functions, for
which we can easily compute whether a give starting point is in the range of
quadratic convergence. This family of functions will be used later on in the
construction of specialized algorithms for convex optimization in general.
Notice that the constants β, h, and L in Theorem 1.1 depend on the
choice of norm used. However, Newton’s method itself is invariant with
regard to the norm. Continuing this theme, it is easy to show that Newton’s
method is invariant under invertible linear transformation (see Exercise 7),
but under a linear transformation the values of β, h, and L would change.
This reveals an incompleteness in the theory of Newton’s method as stated
in Theorem 1.1.
Furthermore, notice that Theorem 1.1 does not assume convexity – in-
deed it only assumes that H(x∗ ) is nonsingular and not badly behaved
near x∗ . One can view Newton’s method as trying to solve the equation
“∇f (x) = 0” by successively approximating ∇f (x) by its local linear ap-
proximation:

∇f (x) ≈ ∇f (x̄) + H(x̄)(x − x̄) .


Since the method is essentially only trying to solve ∇f (x) = 0, it is equally
attracted to local minima and local maxima.
As mentioned earlier, Newton’s method presumes that H(xk ) is nonsin-
gular for all iterates. If H(xk ) becomes increasingly near-singular or loses
positive definiteness, one way to “fix” this is to replace H(xk ) by:

H(xk ) + εI , (2)

for some (hopefully) small value of ε. In fact, one can think of the steepest-
descent method as a limiting case of the above with ε → +∞ in (2) which
effectively causes the Newton step to be close to − 1ε ∇f (xk ).
The work per iteration of Newton’s method is the cost of solving the
equation system
H(xk )(x − xk ) = −∇f (xk ) ,

8
which is O(n3 ) operations. When n becomes very large (say, n = 106 ), this
becomes prohibitively expensive. To partially overcome this, optimizers have
developed so-called “quasi-Newton methods” that use low-rank approximate
updates to the H(xk )−1 at each iteration in an attempt to do less work per
iteration.

2 Proof of Theorem 1.1


The proof of Theorem 1.1 relies on the following two “elementary” facts. For
the first√fact, let ∥v∥ denote the usual Euclidian norm of a vector, namely
∥v∥ := v T v. The operator norm of a matrix M is defined as follows:
∥M ∥ := max{∥M x∥ | ∥x∥ = 1} .
x

Proposition 2.1 Suppose that M is a symmetric matrix. Then the follow-


ing are equivalent:
1. h > 0 satisfies ∥M −1 ∥ ≤ 1
h .
2. h > 0 satisfies ∥M v∥ ≥ h · ∥v∥ for any vector v .

You are asked to prove this proposition in Exercise 1 at the end of this
lecture note.
Proposition 2.2 Suppose that f (·) is twice differentiable. Then
∫ 1
∇f (z) − ∇f (x) = [H(x + t(z − x))] (z − x)dt .
0

Proof: Let ϕ(t) := ∇f (x+t(z −x)). Then ϕ(0) = ∇f (x) and ϕ(1) = ∇f (z),

and ϕ (t) = [H(x + t(z − x))] (z − x). From the fundamental theorem of
calculus, we have:
∇f (z) − ∇f (x) = ϕ(1) − ϕ(0)
∫ 1

= ϕ (t)dt
0
∫ 1
= [H(x + t(z − x))] (z − x)dt .
0

9
Proof of Theorem 1.1: We have:
xN − x∗ = x − H(x)−1 ∇f (x) − x∗

= x − x∗ + H(x)−1 (∇f (x∗ ) − ∇f (x))


∫ 1
= x − x∗ + H(x)−1 [H(x + t(x∗ − x))] (x∗ − x)dt (from Proposition 2.2)
0
∫ 1
= H(x)−1 [H(x + t(x∗ − x)) − H(x)] (x∗ − x)dt .
0
Therefore
∫ 1
∥xN − x∗ ∥ ≤ ∥H(x)−1 ∥ ∥ [H(x + t(x∗ − x)) − H(x)] ∥∥(x − x∗ )∥dt
0
∫ 1
≤ ∥x − x∗ ∥∥H(x)−1 ∥ L · t · ∥(x − x∗ )∥dt
0
∫ 1
= ∥x − x∗ ∥2 ∥H(x)−1 ∥L tdt
0

∥x − x∗ ∥2 ∥H(x)−1 ∥L
= .
2

We now bound ∥H(x)−1 ∥. Let v be any vector. Then


∥H(x)v∥ = ∥H(x∗ )v + (H(x) − H(x∗ ))v∥

≥ ∥H(x∗ )v∥ − ∥(H(x) − H(x∗ ))v∥

≥ h · ∥v∥ − ∥H(x) − H(x∗ )∥∥v∥ (from Proposition 2.1)

≥ h · ∥v∥ − L∥x − x∗ ∥ · ∥v∥

= (h − L∥x − x∗ ∥) · ∥v∥ .
Invoking Proposition 2.1 again, we see that this implies that
1
∥H(x)−1 ∥ ≤ .
h − L∥x − x∗ ∥

10
Combining this with the above yields
L
∥xN − x∗ ∥ ≤ ∥x − x∗ ∥2 ,
2 (h − L∥x − x∗ ∥)

which is the first inequality in (i) of the theorem. Noting also that L∥x −
x∗ ∥ ≤ 2h
3 , it then follows that:

L L 3L
∥xN −x∗ ∥ ≤ ∥x−x∗ ∥2 ≤ ∥x−x∗ ∥2 ( ) = ∥x−x∗ ∥2 ,
2 (h − L∥x − x∗ ∥) 2 h− 2h
3
2h

which is the second inequality of (i).


Because ∥x − x∗ ∥ < γ, there exists ε > 0 for which ∥x − x∗ ∥ ≤ (1 − ε)γ.
And because L∥x − x∗ ∥ ≤ 2h3 (1 − ε) we have:

3L∥x − x∗ ∥
∥xN − x∗ ∥ ≤ ∥x − x∗ ∥ ≤ (1 − ε)∥x − x∗ ∥ < ∥x − x∗ ∥ ,
2h
which establishes (ii) of the theorem, and also shows:

∥xN − x∗ ∥ ≤ (1 − ε)∥x − x∗ ∥ . (3)

Finally, it follows from (3) that ∥xi − x∗ ∥ ≤ (1 − ε)i ∥x0 − x∗ ∥ → 0 as i → ∞,


which shows (iii).

3 Newton’s Method Exercises


1. Prove Proposition 2.1.

2. Suppose we want to minimize the following function:

f (x) = 9x − 4 ln(x − 7)

over the domain X = {x | x > 7} using Newton’s method.

(a) Give an exact formula for the Newton iterate for a given value of
x.
(b) Using a calculator (or a computer, if you wish), compute five
iterations of Newton’s method starting at each of the following
points, and record your answers:

11
• x = 7.40
• x = 7.20
• x = 7.01
• x = 7.80
• x = 7.88
(c) Verify empirically that Newton’s method will converge to the op-
timal solution for all starting values of x in the range (7, 7.8888).
What behavior does Newton’s method exhibit outside of this
range?

3. Suppose we want to minimize the following function:

f (x) = 6x − 4 ln(x − 2) − 3 ln(25 − x)

over the domain X = {x | 2 < x < 25} using Newton’s method.

(a) Give an exact formula for the Newton iterate for a given value of
x.
(b) Using a calculator (or a computer, if you wish), compute five
iterations of Newton’s method starting at each of the following
points, and record your answers:
• x = 2.60
• x = 2.70
• x = 2.40
• x = 2.80
• x = 3.00
(c) Verify empirically that Newton’s method will converge to the op-
timal solution for all starting values of x in the range (2, 3.05).
What behavior does Newton’s method exhibit outside of this
range?

4. Suppose that we seek to minimize the following function:

f (x1 , x2 ) = −9x1 −10x2 +θ(− ln(100−x1 −x2 )−ln(x1 )−ln(x2 )−ln(50−x1 +x2 )),

where θ is a given parameter, on the domain X = {(x1 , x2 ) | x1 >


0, x2 > 0, x1 + x2 < 100, x1 − x2 < 50}. This exercise asks you to im-
plement Newton’s method on this problem, first without a line-search,

12
and then with a line-search. Note that your implementation should
include code that will check whether or not the current iterate value
is in the domain X, and which will stop the current run and report
“Failure at iteration ” if the current iterate is not in the domain X.
Run your algorithm for θ = 10 and for θ = 100, using the following
starting points.

• x0 = (8, 90)
• x0 = (1, 40)
• x0 = (15, 68.69)
• x0 = (10, 20)

(a) When you run Newton’s method without a line-search for this
problem and with these starting points, what behavior do you observe?
(b) When you run Newton’s method with a line-search for this prob-
lem, what behavior do you observe?

5. Herein we have described Newton’s method as a method for finding a


point x∗ for which ∇f (x∗ ) = 0. Now consider the following setting,
where we have n nonlinear equations in n unknowns x = (x1 , . . . , xn ):

g1 (x) = 0
.. .. ..
. . .
gn (x) = 0 ,
which we conveniently write as

g(x) = 0 .

Let J(x) denote the Jacobian matrix (of partial derivatives) of g(x).
Then at x = x̄ we have

g(x̄ + d) ≈ g(x̄) + J(x̄)d ,

this being the linear approximation of g(x) at x = x̄. Construct a


version of Newton’s method for solving the equation system g(x) = 0.

13
6. Suppose that f (x) is a strictly convex twice-continuously differentiable
function whose Hessian H(x) is nonsingular for all x, and consider
Newton’s method with a line-search. Given x̄, we compute the Newton
direction d = −[H(x̄)]−1 ∇f (x̄) and the next iterate x̃ is chosen to
satisfy:
x̃ := arg min f (x̄ + αd) .
α

Suppose that Newton’s method with a line-search is started at x0 ,


and that {x : f (x) ≤ f (x0 )} is a closed and bounded set. Prove that
the iterates of this method converge to the unique global minimum of
f (x).

7. Let y := Ax be an invertible linear transformation. Show that New-


ton’s method is “invariant under invertible linear transformations” in
the following intuitive sense. Let f (x) be our function of interest and
let h(y) := f (A−1 y) be this same function but instead expressed as a
function of y. Let x0 , . . . be the sequence of Newton iterates for f (x)
starting at x0 , and y 0 , . . . be the sequence of Newton iterates for h(y)
starting at y 0 = Ax0 . Then Newton’s method is invariant under in-
vertible linear transformations if y i = Axi for all i = 0, . . .. Show that
indeed this is the case, namely, that y i = Axi for all i = 0, . . ..

14

You might also like