Professional Documents
Culture Documents
Robert M. Freund
March, 2014
⃝2014
c Massachusetts Institute of Technology. All rights reserved.
1
1 Newton’s Method
x ∈ Rn .
Here ∇f (x) is the gradient of f (x) and H(x) is the Hessian of f (x) at x.
Notice that h(·) is a quadratic function. We can attempt to minimize
h(·) by computing a point x for which ∇h(x) = 0. Since the gradient of
h(x) is:
∇h(x) = ∇f (x̄) + H(x̄)(x − x̄) ,
we therefore are motivated to solve:
x = x̄ − H(x̄)−1 ∇f (x̄) .
The direction −H(x̄)−1 ∇f (x̄) is called the Newton direction, or the Newton
step, at x̄. If we compute the next iterate as
then we call x̄N the Newton iterate from the point x̄. If we compute all
iterations this way, we call the algorithm Newton’s Method, whose formal
description is presented in Algorithm 1.
Notice that Newton’s Method presumes that H(xk ) is nonsingular at
each iteration.
2
Algorithm 1 Newton’s Method
Initialize at x0 , and set k ← 0 .
At iteration k :
1. dk := −H(xk )−1 ∇f (xk ). If dk = 0, then stop.
2. Choose step-size αk = 1 .
3. Set xk+1 ← xk + αk dk , k ← k + 1 .
Also notice that there is not necessarily any guarantee of descent in the
algorithm. That is, it is unclear without further assumptions whether or
not it will hold that f (xk+1 ) ≤ f (xk ).
Note that Step 2 could be augmented by a line-search of f (xk + αdk )
to find a good (or even optimal) value of the step-size parameter α with
respect to descent of the function f (·).
Recall that a matrix M is called SPD if M is symmetric and positive
definite. The following proposition shows that the Newton direction is a
descent direction if H(x) is SPD.
Proof: It is sufficient to show that ∇f (x)T d = −∇f (x)T H(x)−1 ∇f (x) < 0.
This will clearly be the case if H(x)−1 is SPD. Since H(x) is SPD, if v ̸= 0,
3
A sequence of numbers {si } exhibits linear convergence if limi→∞ si = s̄
and
|si+1 − s̄|
lim =δ<1. (1)
i→∞ |si − s̄|
An example of a linearly convergent sequence is:
( )
1 i
si = , i = 1, 2, . . . .
10
Here we have s1 = 0.1, s2 = 0.01, s3 = 0.001, etc., and s̄ = 0. Notice that
|si+1 − s̄|
= 0.1 ,
|si − s̄|
so this sequence exhibits linear convergence with δ = 0.1 .
A sequence of numbers {si } exhibits superlinear convergence if the se-
quence exhibits linear convergence and δ = 0 in (1).
An example of a superlinearly convergent sequence is:
1
si = , i = 1, 2, . . . .
i!
Here we have s1 = 1, s2 = 12 , s3 = 61 , s4 = 241 1
, s5 = 120 , etc., and s̄ = 0.
Notice that
|si+1 − s̄| 1
lim = lim = 0,
i→∞ |si − s̄| i→∞ i + 1
so this sequence exhibits superlinear convergence.
A sequence of numbers {si } exhibits quadratic convergence if limi→∞ si =
s̄ and
|si+1 − s̄|
lim =δ<∞.
i→∞ |si − s̄|2
An example of a quadratically convergent sequence is:
( )(2i )
1
si = , i = 1, 2, . . . .
10
Here we have s1 = 0.1, s2 = 0.01, s3 = 0.0001, s4 = 0.00000001, etc., and
s̄ = 0. Notice that
i
|si+1 − s̄| (102 )2
lim = =1,
i→∞ |si − s̄|2 102i+1
so this sequence exhibits quadratic convergence.
4
1.2 Quadratic Convergence of Newton’s Method
In this section we will show the quadratic convergence of Newton’s method.
To develop this result, we will need to work with the operator norm of a
matrix M , which is defined as follows:
∥M ∥ := max{∥M x∥ | ∥x∥ = 1} .
x
• there exist scalars β > 0 and L > 0 for which ∥H(x) − H(y)∥ ≤
L∥x − y∥ for all x, y ∈ B(x∗ , β) := {w : ∥w − x∗ ∥ ≤ β} .
{ 2h }
Let x satisfy ∥x − x∗ ∥ < γ := min β, 3L , and let xN := x − H(x)−1 ∇f (x).
Then:
( ) ( )
(i) ∥xN − x∗ ∥ ≤ ∥x − x∗ ∥2 2(h−L∥x−x
L
∗ ∥) ≤ ∥x − x∗ ∥2 3L2h
5
Table 1 shows the value of the Newton iterates x1 , x2 , x3 , . . ., for four dif-
ferent starting points, namely x0 = 1.0, 0.0, 0.1, and 0.01. Notice that the
Newton iterates diverge to −∞ for the starting point x0 = 1.0. For the
starting point x0 = 0.0 the Newton iterates are all zero. For the starting
point x0 = 0.1, the Newton iterates converge to 17 = 0.142857143, which
is the solution to the problem. Notice also that starting with the second
iteration, the number of accurate decimal digits doubles at each iteration.
This doubling of the accurate digits is the essence of quadratic convergence.
For the starting point x0 = 0.01, the Newton iterates also converge to the
solution 17 = 0.142857143. For this starting point, we see the doubling the
number of accurate decimal digits starting with the fifth iteration.
It turns out that there is deep theory from which it follows that for this
example Newton’s method will converge quadratically from any starting
point in the range x0 ∈ (0 , 2/7) = (0.0 , 0.2857143).
Table 1: Values of the Newton iterates for f (x) = 7x−ln(x) for four different
starting points.
6
H(x) are easily derived to be:
1
1−x1 −x2 − 1
x1
∇f (x) =
1
1−x1 −x2 − 1
x2
and ( )2 ( )2 ( )2
1 1 1
1−x1 −x2 + 1−x1 −x2
x1
H(x) =
.
( )2 ( )2 ( )2
1 1 1
1−x1 −x2 1−x1 −x2 + x2
( )
Furthermore, it is simple to verify that x∗ = 13 , 13 is the optimal solution
since ∇f (x∗ ) = 0, and that f (x∗ ) = 3.295836866. Table 2 shows the iterate
values of Newton’s method applied to this problem starting at (x01 , x02 ) =
(0.85, 0.05). Here we see the doubling of the accurate decimal degits for x1
and x2 starting around the fourth iteration, and the doubling of the number
of leading zeros in the distance ∥xk − x∗ ∥ as well.
7
Furthermore, it holds from part (i) of Theorem 1.1 that in the limit we we
∗
obtain ∥x N −x ∥
∥x−x∗ ∥2
≤ 2h
L
. Of course, in practice one typically never knows the
values of β, h, or L. Instead one implements the method and hopes that
the starting point x0 will yield a quadratically convergent sequence. There
is an exceptional family of functions, called self-concordant functions, for
which we can easily compute whether a give starting point is in the range of
quadratic convergence. This family of functions will be used later on in the
construction of specialized algorithms for convex optimization in general.
Notice that the constants β, h, and L in Theorem 1.1 depend on the
choice of norm used. However, Newton’s method itself is invariant with
regard to the norm. Continuing this theme, it is easy to show that Newton’s
method is invariant under invertible linear transformation (see Exercise 7),
but under a linear transformation the values of β, h, and L would change.
This reveals an incompleteness in the theory of Newton’s method as stated
in Theorem 1.1.
Furthermore, notice that Theorem 1.1 does not assume convexity – in-
deed it only assumes that H(x∗ ) is nonsingular and not badly behaved
near x∗ . One can view Newton’s method as trying to solve the equation
“∇f (x) = 0” by successively approximating ∇f (x) by its local linear ap-
proximation:
H(xk ) + εI , (2)
for some (hopefully) small value of ε. In fact, one can think of the steepest-
descent method as a limiting case of the above with ε → +∞ in (2) which
effectively causes the Newton step to be close to − 1ε ∇f (xk ).
The work per iteration of Newton’s method is the cost of solving the
equation system
H(xk )(x − xk ) = −∇f (xk ) ,
8
which is O(n3 ) operations. When n becomes very large (say, n = 106 ), this
becomes prohibitively expensive. To partially overcome this, optimizers have
developed so-called “quasi-Newton methods” that use low-rank approximate
updates to the H(xk )−1 at each iteration in an attempt to do less work per
iteration.
You are asked to prove this proposition in Exercise 1 at the end of this
lecture note.
Proposition 2.2 Suppose that f (·) is twice differentiable. Then
∫ 1
∇f (z) − ∇f (x) = [H(x + t(z − x))] (z − x)dt .
0
Proof: Let ϕ(t) := ∇f (x+t(z −x)). Then ϕ(0) = ∇f (x) and ϕ(1) = ∇f (z),
′
and ϕ (t) = [H(x + t(z − x))] (z − x). From the fundamental theorem of
calculus, we have:
∇f (z) − ∇f (x) = ϕ(1) − ϕ(0)
∫ 1
′
= ϕ (t)dt
0
∫ 1
= [H(x + t(z − x))] (z − x)dt .
0
9
Proof of Theorem 1.1: We have:
xN − x∗ = x − H(x)−1 ∇f (x) − x∗
∥x − x∗ ∥2 ∥H(x)−1 ∥L
= .
2
= (h − L∥x − x∗ ∥) · ∥v∥ .
Invoking Proposition 2.1 again, we see that this implies that
1
∥H(x)−1 ∥ ≤ .
h − L∥x − x∗ ∥
10
Combining this with the above yields
L
∥xN − x∗ ∥ ≤ ∥x − x∗ ∥2 ,
2 (h − L∥x − x∗ ∥)
which is the first inequality in (i) of the theorem. Noting also that L∥x −
x∗ ∥ ≤ 2h
3 , it then follows that:
L L 3L
∥xN −x∗ ∥ ≤ ∥x−x∗ ∥2 ≤ ∥x−x∗ ∥2 ( ) = ∥x−x∗ ∥2 ,
2 (h − L∥x − x∗ ∥) 2 h− 2h
3
2h
3L∥x − x∗ ∥
∥xN − x∗ ∥ ≤ ∥x − x∗ ∥ ≤ (1 − ε)∥x − x∗ ∥ < ∥x − x∗ ∥ ,
2h
which establishes (ii) of the theorem, and also shows:
f (x) = 9x − 4 ln(x − 7)
(a) Give an exact formula for the Newton iterate for a given value of
x.
(b) Using a calculator (or a computer, if you wish), compute five
iterations of Newton’s method starting at each of the following
points, and record your answers:
11
• x = 7.40
• x = 7.20
• x = 7.01
• x = 7.80
• x = 7.88
(c) Verify empirically that Newton’s method will converge to the op-
timal solution for all starting values of x in the range (7, 7.8888).
What behavior does Newton’s method exhibit outside of this
range?
(a) Give an exact formula for the Newton iterate for a given value of
x.
(b) Using a calculator (or a computer, if you wish), compute five
iterations of Newton’s method starting at each of the following
points, and record your answers:
• x = 2.60
• x = 2.70
• x = 2.40
• x = 2.80
• x = 3.00
(c) Verify empirically that Newton’s method will converge to the op-
timal solution for all starting values of x in the range (2, 3.05).
What behavior does Newton’s method exhibit outside of this
range?
f (x1 , x2 ) = −9x1 −10x2 +θ(− ln(100−x1 −x2 )−ln(x1 )−ln(x2 )−ln(50−x1 +x2 )),
12
and then with a line-search. Note that your implementation should
include code that will check whether or not the current iterate value
is in the domain X, and which will stop the current run and report
“Failure at iteration ” if the current iterate is not in the domain X.
Run your algorithm for θ = 10 and for θ = 100, using the following
starting points.
• x0 = (8, 90)
• x0 = (1, 40)
• x0 = (15, 68.69)
• x0 = (10, 20)
(a) When you run Newton’s method without a line-search for this
problem and with these starting points, what behavior do you observe?
(b) When you run Newton’s method with a line-search for this prob-
lem, what behavior do you observe?
g1 (x) = 0
.. .. ..
. . .
gn (x) = 0 ,
which we conveniently write as
g(x) = 0 .
Let J(x) denote the Jacobian matrix (of partial derivatives) of g(x).
Then at x = x̄ we have
13
6. Suppose that f (x) is a strictly convex twice-continuously differentiable
function whose Hessian H(x) is nonsingular for all x, and consider
Newton’s method with a line-search. Given x̄, we compute the Newton
direction d = −[H(x̄)]−1 ∇f (x̄) and the next iterate x̃ is chosen to
satisfy:
x̃ := arg min f (x̄ + αd) .
α
14