Nonlinear Programming 2nd Ed. by Bertsekas Www.solutionmanual.net | Eigenvalues And Eigenvectors | Linear Subspace

Nonlinear Programming

2nd Edition
Solutions Manual
Dimitri P. Bertsekas
Massachusetts Institute of Technology
Athena Scientific, Belmont, Massachusetts
1
NOTE
This solutions manual is continuously updated and improved. Portions of the manual, involving
primarily theoretical exercises, have been posted on the internet at the book’s www page
http://www.athenasc.com/nonlinbook.html
Many thanks are due to several people who have contributed solutions, and particularly to Angelia
Nedic, Asuman Ozdaglar, and Cynara Wu.
Last Updated: May 2005
2
Section 1.1
Solutions Chapter 1
SECTION 1.1
1.1.9 www
For any x, y ∈ ¹
n
, from the second order expansion (see Appendix A, Proposition A.23) we have
f(y) −f(x) = (y −x)

∇f(x) +
1
2
(y −x)


2
f(z)(y −x), (1)
where z is some point of the line segment joining x and y. Setting x = 0 in (1) and using the
given property of f, it can be seen that f is coercive. Therefore, there exists x

∈ ¹
n
such that
f(x

) = inf
x∈R
n f(x) (see Proposition A.8 in Appendix A). The condition
m[[y[[
2
≤ y


2
f(x)y, ∀ x, y ∈ ¹
n
,
is equivalent to strong convexity of f. Strong convexity guarantees that there is a unique global
minimum x

. By using the given property of f and the expansion (1), we obtain
(y −x)

∇f(x) +
m
2
[[y −x[[
2
≤ f(y) −f(x) ≤ (y −x)

∇f(x) +
M
2
[[y −x[[
2
.
Taking the minimum over y ∈ ¹
n
in the expression above gives
min
y∈R
n
_
(y −x)

∇f(x) +
m
2
[[y −x[[
2
_
≤ f(x

) −f(x) ≤ min
y∈R
n
_
(y −x)

∇f(x) +
M
2
[[y −x[[
2
_
.
Note that for any a > 0
min
y∈R
n
_
(y −x)

∇f(x) +
a
2
[[y −x[[
2
_
= −
1
2a
[[∇f(x)[[
2
,
and the minimum is attained for y = x −
∇f(x)
a
. Using this relation for a = m and a = M, we
obtain

1
2m
[[∇f(x)[[
2
≤ f(x

) −f(x) ≤ −
1
2M
[[∇f(x)[[
2
.
The first chain of inequalities follows from here. To show the second relation, use the expansion
(1) at the point x = x

, and note that ∇f(x

) = 0, so that
f(y) −f(x

) =
1
2
(y −x

)


2
f(z)(y −x

).
The rest follows immediately from here and the given property of the function f.
3
Section 1.1
1.1.11 www
Since x

is a nonsingular strict local minimum, we have that ∇
2
f(x

) > 0. The function f is
twice continuously differentiable over '
n
, so that there exists a scalar δ > 0 such that

2
f(x) > 0, ∀ x, with [[x −x

[[ ≤ δ.
This means that the function f is strictly convex over the open sphere B(x

, δ) centered at x

with radius δ. Then according to Proposition 1.1.2, x

is the only stationary point of f in the
sphere B(x

, δ).
If f is not twice continuously differentiable, then x

need not be an isolated stationary
point. The example function f does not have the second derivative at x = 0. Note that f(x) > 0
for x ,= 0, and by definition f(0) = 0. Hence, x

= 0 is the unique (singular) global minimum.
The first derivative of f(x) for x ,= 0 can be calculated as follows:
f

(x) = 2x
_

2 −sin
_

6


3 ln(x
2
)
_
+

3 cos
_

6


3 ln(x
2
)
__
= 2x
_

2 −2 cos
π
3
sin
_

6


3 ln(x
2
)
_
+ 2 sin
π
3
cos
_

6


3 ln(x
2
)
__
= 2x
_

2 + 2 sin
_
π
3


6
+

3 ln(x
2
)
__
= 2x
_

2 −2 cos(2

3 ln x)
_
.
Solving f

(x) = 0, gives x
k
= e
(1−8k)π
8

3
and y
k
= e
−(1+8k)π
8

3
for k integer. The second derivative
of f(x), for x ,= 0, is given by
f

(x) = 2
_

2 −2 cos(2

3 ln x) + 4

3 sin(2

3 ln x)
_
.
Thus:
f

(x
k
) = 2
_

2 −2 cos
π
4
+ 4

3 sin
π
4
_
= 2
_

2 −2

2
2
+ 4

3

2
2
_
= 4

6.
Similarly
f

(y
k
) = = 2
_

2 −2 cos
_
−π
4
_
+ 4

3 sin
_
−π
4
__
= 2
_

2 −2

2
2
−4

3

2
2
_
= −4

6.
Hence, ¦x
k
[ k ≥ 0¦ is a sequence of nonsingular local minima, which evidently converges to x

,
while ¦y
k
[ k ≥ 0¦ is a sequence of nonsingular local maxima converging to x

.
4
Section 1.1
1.1.12 www
(a) Let x

be a strict local minimum of f. Then there is δ such that f(x

) < f(x) for all x in
the closed sphere centered at x

with radius δ. Take any local sequence ¦x
k
¦ that minimizes f,
i.e. [[x
k
−x

[[ ≤ δ and lim
k→∞
f(x
k
) = f(x

). Then there is a subsequence ¦x
k
i
¦ and the point
x such that x
k
i
→x and [[x −x

[[ ≤ δ. By continuity of f, we have
f(x) = lim
i→∞
f(x
k
i
) = f(x

).
Since x

is a strict local minimum, it follows that x = x

. This is true for any convergent
subsequence of ¦x
k
¦, therefore ¦x
k
¦ converges to x

, which means that x

is locally stable. Next
we will show that for a continuous function f every locally stable local minimum must be strict.
Assume that this is not true, i.e., there is a local minimum x

which is locally stable but is not
strict. Then for any θ > 0 there is a point x
θ
,= x

such that
0 < [[x
θ
−x

[[ < θ and f(x
θ
) = f(x

). (1)
Since x

is a stable local minimum, there is a δ > 0 such that x
k
→x

for all ¦x
k
¦ with
lim
k→∞
f(x
k
) = f(x

) and [[x
k
−x

[[ < δ. (2)
For θ = δ in (1), we can find a point x
0
,= x

for which 0 < [[x
0
− x

[[ < δ and f(x
0
) = f(x

).
Then, for θ =
1
2
[[x
0
−x

[[ in (1), we can find a point x
1
such that 0 < [[x
1
−x

[[ <
1
2
[[x
0
−x

[[
and f(x
1
) = f(x

). Then, again, for θ =
1
2
[[x
1
− x

[[ in (1), we can find a point x
2
such that
0 < [[x
2
− x

[[ <
1
2
[[x
1
− x

[[ and f(x
2
) = f(x

), and so on. In this way, we have constructed
a sequence ¦x
k
¦ of distinct points such that 0 < [[x
k
− x

[[ < δ, f(x
k
) = f(x

) for all k, and
lim
k→∞
x
k
= x

. Now, consider the sequence ¦y
k
¦ defined by
y
2m
= x
m
, y
2m+1
= x
0
, ∀ m ≥ 0.
Evidently, the sequence ¦y
k
¦ is contained in the sphere centered at x

with the radius δ. Also
we have that f(y
k
) = f(x

), but ¦y
k
¦ does not converge to x

. This contradicts the assumption
that x

is locally stable. Hence, x

must be strict local minimum.
(b) Since x

is a strict local minimum, we can find δ > 0, such that f(x) > f(x

) for all x ,= x

with [[x −x

[[ ≤ δ. Then min
||x−x

||=δ
f(x) = f
δ
> f(x

). Let G
δ
= max
||x−x

||≤δ
[g(x)[. Now,
we have
f(x) −G
δ
≤ f(x) + g(x) ≤ f(x) + G
δ
, ∀ > 0, ∀ x [[x −x

[[ < δ.
5
Section 1.1
Choose
δ
such that
f
δ

δ
G
δ
> f(x

) +
δ
G
δ
,
and notice that for all 0 ≤ ≤
δ
we have
f
δ
−G
δ
> f(x

) + G
δ
.
Consider the level sets
L() = ¦x [ f(x) + g(x) ≤ f(x

) + G
δ
, [[x −x

[[ ≤ δ¦, 0 ≤ ≤
δ
.
Note that
L(
1
) ⊂ L(
2
) ⊂ B(x

, δ), ∀ 0 ≤
1
<
2

δ
, (3)
where B(x

, δ) is the open sphere centered at x

with radius δ. The relation (3) means that
the sequence ¦L()¦ decreases as decreases. Observe that for any ≥ 0, the level set L() is
compact. Since x

is strictly better than any other point x ∈ B(x

, δ), and x

∈ L() for all
0 ≤ ≤
δ
, we have

0≤≤
δ L() = ¦x

¦. (4)
According to Weierstrass’ theorem, the continuous function f(x) +g(x) attains its minimum on
the compact set L() at some point x

∈ L(). From (3) it follows that x

∈ B(x

, δ) for any in
the range [0,
δ
]. Finally, since x

∈ L(), from (4) we see that lim
→∞
x

= x

.
1.1.13 www
In the solution to the Exercise 1.1.12 we found the numbers δ > 0 and
δ
> 0 such that for all
∈ [0,
δ
) the function f(x) + g(x) has a local minimum x

within the sphere B(x

, δ) = ¦x [
[[x −x

[[ < δ¦. The Implicit Function Theorem can be applied to the continuously differentiable
function G(, x) = ∇f(x) + ∇g(x) for which G(0, x

) = 0. Thus, there are an interval [0,
0
), a
number δ
0
and a continuously differentiable function φ : [0,
0
) → B(x

, δ
0
) such that φ() = x

and
∇φ() = −∇

G(, φ()) (∇
x
G(, φ()))
−1
, ∀ ∈ [0,
0
).
We may assume that
0
is small enough so that the first order expansion for φ() at = 0 holds,
namely
φ() = φ(0) + ∇φ(0) + o(), ∀ ∈ [0,
0
). (1)
It can be seen that ∇
x
G(0, φ(0)) = ∇
x
G(0, x

) = ∇
2
f(x

), and ∇

G(0, φ(0)) = ∇g(x

)

, which
combined with φ() = x

, φ(0) = (x

)

and (1) gives the desired relation.
6
Section 1.2
SECTION 1.2
1.2.5 www
(a) Given a bounded set A, let r = sup¦|x| [ x ∈ A¦ and B = ¦x [ |x| ≤ r¦. Let L =
max¦|∇
2
f(x)| [ x ∈ B¦, which is finite because a continuous function on a compact set is
bounded. For any x, y ∈ A we have
∇f(x) −∇f(y) =
_
1
0

2
f
_
tx + (1 −t)y
_
(x −y)dt.
Notice that tx + (1 −t)y ∈ B, for all t ∈ [0, 1]. It follows that
|∇f(x) −f(y)| ≤ L|x −y|,
as desired.
(b) The key idea is to show that x
k
stays in the bounded set
A =
_
x [ f(x) ≤ f(x
0
)
_
and to use a stepsize α
k
that depends on the constant L corresponding to this bounded set. Let
R = max¦|x| [ x ∈ A¦,
G = max¦|∇f(x)| [ x ∈ A¦,
and
B = ¦x [ |x| ≤ R + 2G¦.
Using condition (i) in the exercise, there exists some constant L such that |∇f(x) − ∇f(y)| ≤
L|x −y|, for all x, y ∈ B. Suppose the stepsize α
k
satisfies
0 < ≤ α
k
≤ (2 −)γ
k
min¦1, 1/L¦,
where
γ
k
=
[∇f(x
k
)

d
k
[
|d
k
|
2
.
Let β
k
= α
k

k
− Lα
k
/2), which can be seen to satisfy β
k

2
γ
k
/2 by our choice of α
k
. We
will, show by induction on k that with such a choice of stepsize, we have x
k
∈ A and
f
_
x
k+1
_
≤ f(x
k
) −β
k
|d
k
|
2
, (*)
7
Section 1.2
for all k ≥ 0.
To start the induction, we note that x
0
∈ A, by the definition of A. Suppose that x
k
∈ A.
By the definition of γ
k
, we have
γ
k
|d
k
|
2
=
¸
¸
∇f(x
k
)

d
k
¸
¸

_
_
∇f(x
k
)
_
_
|d
k
|.
Thus, |d
k
| ≤
_
_
∇f(x
k
)
_
_

k
≤ G/γ
k
. Hence,
|x
k
+ α
k
d
k
| ≤ |x
k
| + α
k
G/γ
k
≤ R + 2G,
which shows that x
k
+ α
k
d
k
∈ B. In order to prove Eq. (*), we now proceed as in the proof of
Prop. 1.2.3. A difficulty arises because Prop. A.24 assumes that the inequality |∇f(x)−∇f(y)| ≤
L|x − y| holds for all x, y, whereas in this exercise this inequality holds only for x, y ∈ B. We
thus essentially repeat the proof of Prop. A.24, to obtain
f(x
k+1
) = f(x
k
+ α
k
d
k
)
=
_
1
0
α
k
∇f(x
k
+ τα
k
d
k
)

d
k

≤ α
k
∇f(x
k
)

d
k
+
¸
¸
¸
¸
_
1
0
α
k
_
∇f
_
x
k
+ α
k
τd
k
_
−∇f(x
k
)
_

d
k

¸
¸
¸
¸
≤ α
k
∇f(x
k
)

d
k
+ (α
k
)
2
|d
k
|
2
_
1
0
Lτ dτ
= α
k
∇f(x
k
)

d
k
+
L(α
k
)
2
2
|d
k
|
2
.
(∗∗)
We have used here the inequality
_
_
∇f
_
x
k
+ α
k
τd
k
_
−∇f(x
k
)
_
_
≤ α
k
Lτ|d
k
|,
which holds because of our definition of L and because x
k
∈ A ⊂ B, x
k

k
d
k
∈ B and (because
of the convexity of B) x
k
+ α
k
τd
k
∈ B, for τ ∈ [0, 1].
Inequality (*) now follows from Eq. (**) as in the proof of Prop. 1.2.3. In particular, we
have f(x
k+1
) ≤ f(x
k
) ≤ f(x
0
) and x
k+1
∈ A. This completes the induction. The remainder of
the proof is the same as in Prop. 1.2.3.
1.2.10 www
We have
∇f(x) −∇f(x

) =
_
1
0

2
f
_
x

+ t(x −x

)
_
(x −x

)dt
and since
∇f(x

) = 0,
8
Section 1.2
we obtain
(x −x

)

∇f(x) =
_
1
0
(x −x

)


2
f(x

+ t(x −x

))(x −x

)dt ≥ m
_
1
0
|x −x

|
2
dt.
Using the Cauchy-Schwartz inequality (x −x

)

∇f(x) ≤ |x −x

||∇f(x)|, we have
m
_
1
0
|x −x

|
2
dt ≤ |x −x

||∇f(x)|,
and
|x −x

| ≤
|∇f(x)|
m
.
Now define for all scalars t,
F(t) = f(x

+ t(x −x

))
We have
F

(t) = (x −x

)

∇f(x

+ t(x −x

))
and
F

(t) = (x −x

)


2
f(x

+ t(x −x

))(x −x

) ≥ m|x −x

|
2
≥ 0.
Thus F

is an increasing function, and F

(1) ≥ F

(t) for all t ∈ [0, 1]. Hence
f(x) −f(x

) = F(1) −F(0) =
_
1
0
F

(t)dt
≤ F

(1) = (x −x

)

∇f(x)
≤ |x −x

||∇f(x)| ≤
|∇f(x)|
2
m
,
where in the last step we used the result shown earlier.
1.2.11 www
Assume condition (i). The same reasoning as in proof of Prop. 1.2.1, can be used here to show
that
0 ≤ ∇f(¯ x)

¯ p, (1)
where ¯ x is a limit point of ¦x
k
¦, namely ¦x
k
¦
k∈
¯
K
−→ ¯ x, and
p
k
=
d
k
[[d
k
[[
, ¦p
k
¦
k∈
¯
K
→ ¯ p. (2)
Since ∇f is continuous, we can write
∇f(¯ x)

¯ p = lim
k→∞, k∈
¯
K
∇f(x
k
)

p
k
= liminf
k→∞, k∈
¯
K
∇f(x
k
)

p
k

liminf
k→∞, k∈
¯
K
∇f(x
k
)

d
k
limsup
k→∞, k∈
¯
K
[[d
k
[[
< 0,
9
Section 1.2
which contradicts (1). The proof for the other choices of stepsize is the same as in Prop.1.2.1.
Assume condition (ii). Suppose that ∇f(x
k
) ,= 0 for all k. For the minimization rule we
have
f(x
k+1
) = min
α≥0
f(x
k
+ αd
k
) = min
θ≥0
f(x
k
+ θp
k
), (3)
for all k, where p
k
=
d
k
||d
k
||
. Note that
∇f(x
k
)

p
k
≤ −c[[∇f(x
k
)[[, ∀ k. (4)
Let ˆ x
k+1
= x
k
+ˆ α
k
p
k
be the iterate generated from x
k
via the Armijo rule, with the corresponding
stepsize ˆ α
k
and the descent direction p
k
. Then from (3) and (4), it follows that
f(x
k+1
) −f(x
k
) ≤ f(ˆ x
k+1
) −f(x
k
) ≤ σˆ α
k
∇f(x
k
)

p
k
≤ −σcˆ α
k
[[∇f(x
k
)[[
2
. (5)
Hence, either ¦f(x
k
)¦ diverges to −∞ or else it converges to some finite value. Suppose
that ¦x
k
¦
k∈K
→ ¯ x and ∇f(¯ x) ,= 0. Then, lim
k→∞,k∈K
f(x
k
) = f(¯ x), which combined with (5)
implies that
lim
k→∞,k∈K
ˆ α
k
[[∇f(x
k
)[[
2
= 0.
Since lim
k→∞,k∈K
∇f(x
k
) = ∇f(¯ x) ,= 0, we must have lim
k→∞,k∈K
ˆ α
k
= 0. Without loss of
generality, we may assume that lim
k→∞,k∈K
p
k
= ¯ p. Now, we can use the same line of arguments
as in the proof of the Prop. 1.2.1 to show that (1) holds. On the other hand, from (4) we have
that
lim
k→∞,k∈K
∇f(x
k
)

p
k
= ∇f(¯ x)

¯ p ≤ −c[[∇f(¯ x)[[ < 0.
This contradicts (1), so that ∇f(¯ x) = 0.
1.2.13 www
Consider the stepsize rule (i). From the Descent Lemma (cf. the proof of Prop. 1.2.3), we have
for all k
f(x
k+1
) ≤ f(x
k
) −α
k
_
1 −
α
k
L
2
_
|∇f(x
k
)|
2
.
From this relation, we obtain for any minimum x

of f,
f(x

) ≤ f(x
0
) −

2

k=0
|∇f(x
k
)|
2
.
It follows that ∇f(x
k
) →0, that ¦f(x
k
)¦ converges, and that


k=0
|∇f(x
k
)|
2
< ∞, from which

k=0
|x
k+1
−x
k
|
2
< ∞,
10
Section 1.2
since ∇f(x
k
) = (x
k
−x
k+1
)/α
k
.
Using the convexity of f, we have for any minimum x

of f,
|x
k+1
−x

|
2
−|x
k
−x

|
2
−|x
k+1
−x
k
|
2
≤ −2(x

−x
k
)

(x
k+1
−x
k
)
= 2α
k
(x

−x
k
)

∇f(x
k
)
≤ 2α
k
_
f(x

) −f(x
k
)
_
≤ 0,
so that
|x
k+1
−x

|
2
≤ |x
k
−x

|
2
+|x
k+1
−x
k
|
2
.
Hence, for any m,
|x
m
−x

|
2
≤ |x
0
−x

|
2
+
m−1

k=0
|x
k+1
−x
k
|
2
.
It follows that ¦x
k
¦ is bounded. Let x be a limit point of ¦x
k
¦, and for any > 0, let k be such
that
|x
k
−x|
2
≤ ,

i=k
|x
i+1
−x
i
|
2
≤ .
Since x is a minimum of f, using the preceding relations, for any k > k, we have
|x
k
−x|
2
≤ |x
k
−x|
2
+
k−1

i=k
|x
i+1
−x
i
|
2
≤ 2.
Since is arbitrarily small, it follows that the entire sequence ¦x
k
¦ converges to x.
The proof for the case of the stepsize rule (ii) is similar. Using the assumptions α
k
→ 0
and


k=0
α
k
= ∞, and the Descent Lemma, we show that ∇f(x
k
) →0, that ¦f(x
k
)¦ converges,
and that

k=0
|x
k+1
−x
k
|
2
< ∞.
From this point, the preceding proof applies.
1.2.14 www
(a) We have
|x
k+1
−y|
2
= |x
k
−y −α
k
∇f(x
k
)|
2
= (x
k
−y −α
k
∇f(x
k
))

(x
k
−y −α
k
∇f(x
k
))
= |x
k
−y|
2
−2α
k
(x
k
−y)

∇f(x
k
) + (α
k
|∇f(x
k
)|)
2
= |x
k
−y|
2
+ 2α
k
(y −x
k
)

∇f(x
k
) + (α
k
|∇f(x
k
)|)
2
≤ |x
k
−y|
2
+ 2α
k
(f(y) −f(x
k
)) + (α
k
|∇f(x
k
)|)
2
= |x
k
−y|
2
−2α
k
(f(x
k
) −f(y)) + (α
k
|∇f(x
k
)|)
2
,
11
Section 1.2
where the inequality follows from Prop. B.3, which states that f is convex if and only if
f(y) −f(x) ≥ (y −x)

∇f(x), ∀ x, y.
(b) Assume the contrary; that is, liminf
k→∞
f(x
k
) ,= inf
x∈
n f(x). Then, for some δ > 0, there
exists y such that f(y) < f(x
k
) −δ for all k ≥
¯
k, where
¯
k is sufficiently large. From part (a), we
have
|x
k+1
−y|
2
≤ |x
k
−y|
2
−2α
k
(f(x
k
) −f(y)) + (α
k
|∇f(x
k
)|)
2
.
Summing over all k sufficiently large, we have

k=
¯
k
|x
k+1
−y|
2

k=
¯
k
_
|x
k
−y|
2
−2α
k
(f(x
k
) −f(y)) + (α
k
|∇f(x
k
)|)
2
_
,
or
0 ≤ |x
¯
k
−y|
2

k=
¯
k

k
δ +

k=
¯
k

k
|∇f(x
k
)|)
2
= |x
¯
k
−y|
2

k=
¯
k
α
k
(2δ −α
k
|∇f(x
k
)|
2
) .
By taking
¯
k large enough, we may assume (using α
k
|∇f(x
k
)|
2
→0) that α
k
|∇f(x
k
)|
2
≤ δ for
k ≥
¯
k. So we obtain
0 ≤ |x
¯
k
−y|
2
−δ

k=
¯
k
α
k
.
Since

α
k
= ∞, the term on the right is equal to −∞, yielding a contradiction. Therefore we
must have liminf
k→∞
f(x
k
) = inf
x∈
n f(x).
(c) Let y be some x

such that f(x

) ≤ f(x
k
) for all k. (If no such x

exists, the desired result
follows trivially). Then
|x
k+1
−y|
2
≤ |x
k
−y|
2
−2α
k
(f(x
k
) −f(y)) + (α
k
|∇f(x
k
)|)
2
≤ |x
k
−y|
2
+ (α
k
|∇f(x
k
)|)
2
= |x
k
−y|
2
+
_
s
k
|∇f(x
k
)|
|∇f(x
k
)|
_
2
= |x
k
−y|
2
+ (s
k
)
2
≤ |x
k−1
−y|
2
+ (s
k−1
)
2
+ (s
k
)
2
≤ ≤ |x
0
−y|
2
+
k

i=0
(s
i
)
2
< ∞.
Thus ¦x
k
¦ is bounded. Since f is continuously differentiable, we then have that ¦∇f(x
k
)¦ is
bounded. Let M be an upper bound for |∇f(x
k
)|. Then

α
k
=

s
k
|∇f(x
k
)|

1
M

s
k
= ∞.
12
Section 1.2
Furthermore,
α
k
|∇f(x
k
)|
2
= s
k
|∇f(x
k
)| ≤ s
k
M.
Since

(s
k
)
2
< ∞, s
k
→0. Then α
k
|∇f(x
k
)|
2
→0. We can thus apply the results of part (b)
to show that liminf
k→∞
f(x
k
) = inf
x∈
n f(x).
Now, since liminf
k→∞
f(x
k
) = inf
x∈
n f(x), there must be a subsequence ¦x
k
¦
K
such that
¦x
k
¦
K
→ ¯ x, for some ¯ x where f(¯ x) = inf
x∈
n f(x) so that ¯ x is a global minimum. We have
|x
k+1
− ¯ x|
2
≤ |x
k
− ¯ x|
2
+ (s
k
)
2
,
so that
|x
k+N
− ¯ x|
2
≤ |x
k
− ¯ x|
2
+
N

m=k
(s
m
)
2
, ∀ k, N ≥ 1.
For any > 0, we can choose
¯
k ∈ K to be sufficiently large so that for all k ∈ K with k ≥
¯
k we
have
|x
k
− ¯ x|
2
≤ and

m=k
(s
m
)
2
≤ .
Then
|x
k+N
− ¯ x|
2
≤ 2, ∀ N ≥ 1.
Since > 0 is arbitrary, we see that ¦x
k
¦ converges to ¯ x.
1.2.17 www
By using the descent lemma (Proposition A.24 of Appendix A), we obtain
f(x
k+1
) −f(x
k
) ≤ −α
k
∇f(x
k
)

(∇f(x
k
) + e
k
) +
L
2

k
)
2
[[∇f(x
k
) + e
k
[[
2
= −α
k
_
1 −
L
2
α
k
_
[[∇f(x
k
)[[
2
+
L
2

k
)
2
[[e
k
[[
2
−α
k
(1 −Lα
k
)∇f(x
k
)

e
k
.
Assume that α
k
<
1
L
for all k, so that 1 −Lα
k
> 0 for every k. Then, using the estimates
1 −
L
2
α
k
≥ 1 −Lα
k
,
∇f(x
k
)

e
k
≥ −
1
2
([[∇f(x
k
)[[
2
+[[e
k
[[
2
) ,
and the assumption [[e
k
[[ ≤ δ for all k, in the inequality above, we obtain
f(x
k+1
) −f(x
k
) ≤ −
α
k
2
(1 −Lα
k
) ([[∇f(x
k
)[[
2
−δ
2
) + (α
k
)
2

2
2
. (1)
Let δ

be an arbitrary number satisfying δ

> δ. Consider the set / = ¦k [ [[∇f(x
k
)[[ < δ

¦. If
the set / is infinite, then we are done. Suppose that the set / is finite. Then, there is some
13
Section 1.2
index k
0
such that [[∇f(x
k
)[[ ≥ δ

for all k ≥ k
0
. By substituting this in (1), we can easily find
that
f(x
k+1
) −f(x
k
) ≤ −
α
k
2
_
(1 −Lα
k
)
_
δ

2
−δ
2
_
−α
k

2
_
, ∀ k ≥ k
0
.
By choosing α and α such that 0 < α < ¯ α < min¦
δ
2
−δ
2
δ
2
L
,
1
L
¦, and α
k
∈ [α, ¯ α] for all k ≥ k
0
, we
have that
f(x
k+1
) −f(x
k
) ≤ −
1
2
α
_
δ

2
−δ
2
− ¯ αLδ

2
_
, ∀ k ≥ k
0
. (2)
Since δ

2
− δ
2
− ¯ αLδ

2
> 0 for k ≥ k
0
, the sequence ¦f(x
k
) [ k ≥ k
0
¦ is strictly decreasing.
Summing the inequalities in (2) over k for k
0
≤ k ≤ N, we get
f(x
N+1
) −f(x
k
0
) ≤ −
(N −k
0
)
2
α
_
δ

2
−δ
2
− ¯ αLδ

2
_
, ∀ N > k
0
.
Taking the limit as N −→∞, we obtain lim
N→∞
f(x
N
) = −∞.
1.2.19 www
(a) Note that
∇f(x) = ∇
x
F(x, g(x)) +∇g(x)∇
y
F(x, g(x)).
We can write the given method as
x
k+1
= x
k
+ α
k
d
k
= x
k
−α
k

x
F(x
k
, g(x
k
)) = x
k
+ α
k
(−∇f(x
k
) +∇g(x
k
)∇
y
F(x
k
, g(x
k
)) ,
so that this method is essentially steepest descent with error
e
k
= −∇g(x
k
)∇
y
F(x
k
, g(x
k
)).
Claim: The directions d
k
are gradient related.
Proof: We first show that d
k
is a descent direction. We have
∇f(x
k
)

d
k
= (∇
x
F(x
k
, g(x
k
)) +∇g(x)∇
y
F(x
k
, g(x
k
)))

(−∇
x
F(x
k
, g(x
k
)))
= −|∇
x
F(x
k
, g(x
k
))|
2
−(∇g(x)∇
y
F(x
k
, g(x
k
)))

(∇
x
F(x
k
, g(x
k
)))
≤ −|∇
x
F(x
k
, g(x
k
))|
2
+|∇g(x)∇
y
F(x
k
, g(x
k
))| |∇
x
F(x
k
, g(x
k
))|
≤ −|∇
x
F(x
k
, g(x
k
))|
2
+ γ |∇
x
F(x
k
, g(x
k
))|
2
= (−1 + γ) |∇
x
F(x
k
, g(x
k
))|
2
< 0 for |∇
x
F(x
k
, g(x
k
))| ,= 0.
14
Section 1.2
It is straightforward to show that |∇
x
F(x
k
, g(x
k
))| = 0 if and only if |∇f(x
k
)| = 0, so that we
have ∇f(x
k
)

d
k
< 0 for |∇f(x
k
)| ,= 0. Hence d
k
is a descent direction if x
k
is nonstationary.
Furthermore, for every subsequence ¦x
k
¦
k∈K
that converges to a nonstationary point ¯ x, we have
|d
k
| =
1
1 −γ
[|∇
x
F(x
k
, g(x
k
))| −γ|∇
x
F(x
k
, g(x
k
))|]

1
1 −γ
[|∇
x
F(x
k
, g(x
k
))| −|∇g(x)∇
y
F(x
k
, g(x
k
))|]

1
1 −γ
|∇
x
F(x
k
, g(x
k
)) +∇g(x)∇
y
F(x
k
, g(x
k
))|
=
1
1 −γ
|∇f(x
k
)|,
and so ¦d
k
¦ is bounded. We have from Eq. (1), ∇f(x
k
)

d
k
≤ −(1 − γ) |∇
x
F(x
k
, g(x
k
))|
2
.
Hence if lim
k→∞
inf
k∈K
∇f(x
k
)

d
k
= 0, then lim
k→∞,k∈K
|∇F(x
k
, g(x
k
))| = 0, from which
|∇F(¯ x, g(¯ x))| = 0. So ∇f(¯ x) = 0, which contradicts the nonstationarity of ¯ x. Hence,
lim
k→∞
inf
k∈K
∇f(x
k
)

d
k
< 0,
and it follows that the directions d
k
are gradient related.
From Prop. 1.2.1, we then have the desired result.
(b) Let’s assume that in addition to being continuously differentiable, h has a continuous and
nonsingular gradient matrix ∇
y
h(x, y). Then from the Implicit Function Theorem (Prop. A.33),
there exists a continuously differentiable function φ : '
n
→'
m
such that h(x, φ(x)) = 0, for all
x ∈ '
n
. If, furthermore, there exists a γ ∈ (0, 1) such that
|∇φ(x)∇
y
f(x, φ(x))| ≤ γ |∇
x
f(x, φ(x))| , ∀ x ∈ '
n
,
then from part (a), the method described is convergent.
1.2.20 www
(a) Consider a function g(α) = f(x
k
+ αd
k
) for 0 < α < α
k
, which is convex over I
k
. Suppose
that x
k
= x
k
+ αd
k
∈ I
k
minimizes f(x) over I
k
. Then g

(α) = 0 and from convexity it follows
that g


k
) = ∇f(x
k+1
)

d
k
> 0 (since g

(0) = ∇f(x
k
)

d
k
< 0). Therefore the stepsize will be
reduced after this iteration. Now, assume that x
k
,∈ I
k
. This means that the derivative g

(α)
does not change the sign for 0 < α < α
k
, i.e. for all α in the interval (0, α
k
) we have g

(α) < 0.
Hence, g


k
) = ∇f(x
k+1
)

d
k
≤ 0 and we can use the same stepsize α
k
in the next iteration.
15
Section 1.2
(b) Here we will use conditions on ∇f(x) and d
k
which imply
∇f(x
k+1
)

d
k
≤ ∇f(x
k
)

d
k
+[[∇f(x
k+1
) −∇f(x
k
)[[ [[d
k
[[
≤ ∇f(x
k
)

d
k
+ α
k
L[[d
k
[[
2
≤ −(c
1
−c
2
α
k
L)|∇f(x
k
)|
2
.
When the stepsize becomes small enough so that c
1
−c
2
α
ˆ
k
L ≥ 0 for some
ˆ
k, then ∇f(x
k+1
)

d
k
≤ 0
for all k ≥
ˆ
k and no further reduction will ever be needed.
(c) The result follows in the same way as in the proof of Prop.1.2.4. Every limit point of ¦x
k
¦ is
a stationary point of f. Since f is convex, every limit point of ¦x
k
¦ must be a global minimum
of f.
1.2.21 www
By using the descent lemma (Prop. A.24 of Appendix A), we obtain
f(x
k+1
) −f(x
k
) ≤ α
k
∇f(x
k
)

(d
k
+ e
k
) + (α
k
)
2
L
2
[[d
k
+ e
k
[[
2
. (1)
Taking into account the given properties of d
k
, e
k
, the Schwartz inequality, and the inequality
[[y[[ [[z[[ ≤ [[y[[
2
+[[z[[
2
, we obtain
∇f(x
k
)

(d
k
+ e
k
) ≤ −(c
1
−pα
k
)[[∇f(x
k
)[[
2
+ qα
k
[[∇f(x
k
)[[
≤ −(c
1
−(p + 1)α
k
) [[∇f(x
k
)[[
2
+ α
k
q
2
.
To estimate the last term in the right hand-side of (1), we again use the properties of d
k
, e
k
, and
the inequality
1
2
[[y + z[[
2
≤ [[y[[
2
+[[z[[
2
, which gives
1
2
[[d
k
+ e
k
[[
2
≤ [[d
k
[[
2
+[[e
k
[[
2
≤ 2 (c
2
2
+ (pα
k
)
2
) [[∇f(x
k
)[[
2
+ 2 (c
2
2
+ (qα
k
)
2
)
≤ 2(c
2
2
+ p
2
)[[∇f(x
k
)[[
2
+ 2(c
2
2
+ q
2
), ∀ k ≥ k
0
,
where k
0
is such that α
k
≤ 1 for all k ≥ k
0
.
By substituting these estimates in (1), we get
f(x
k+1
) −f(x
k
) ≤ −α
k
(c
1
−C)[[∇f(x
k
)[[
2
+ (α
k
)
2
b
2
, ∀ k ≥ k
0
,
where C = 1 + p + 2L(c
2
2
+ p
2
) and b
2
= q
2
+ 2L(c
2
2
+ q
2
). By choosing k
0
large enough, we can
have
f(x
k+1
) −f(x
k
) ≤ −α
k
b
1
[[∇f(x
k
)[[
2
+ (α
k
)
2
b
2
, ∀ k ≥ k
0
.
16
Section 1.2
Summing up these inequalities over k for k
0
≤ K ≤ k ≤ N gives
f(x
N+1
) + b
1
N

k=K
α
k
[[∇f(x
k
)[[
2
≤ f(x
K
) + b
2
N

k=K

k
)
2
, ∀ k
0
≤ K ≤ k ≤ N. (2)
Therefore
limsup
N→∞
f(x
N+1
) ≤ f(x
K
) + b
2

k=K

k
)
2
, ∀ K ≥ k
0
.
Since


k=0

k
)
2
< ∞, the last inequality implies
limsup
N→∞
f(x
N+1
) ≤ liminf
K→∞
f(x
K
),
i.e. lim
k→∞
f(x
k
) exists (possibly infinite). In particular, the relation (2) implies

k=0
α
k
[[∇f(x
k
)[[
2
< ∞.
Thus we have liminf
k→∞
[[∇f(x
k
)[[ = 0 (see the proof of Prop. 1.2.4). To prove that lim
k→∞
[[∇f(x
k
)[[ =
0, assume the contrary, i.e.
limsup
k→∞
[[∇f(x
k
)[[ ≥ > 0. (3)
Let ¦m
j
¦ and ¦n
j
¦ be sequences such that
m
j
< n
j
< m
j+1
,

3
< [[∇f(x
k
)[[ for m
j
≤ k < n
j
,
[[∇f(x
k
)[[ ≤

3
for n
j
≤ k < m
j+1
. (4)
Let
¯
j be large enough so that
α
k
≤ 1, ∀ k ≥
¯
j,

k=m
¯
j
α
k
[[∇f(x
k
)[[
2


3
27L(2c
2
+ q + p)
.
17
Section 1.3
For any j ≥
¯
j and any m with m
j
≤ m ≤ n
j
−1, we have
[[∇f(x
n
j
) −∇f(x
m
)[[ ≤
n
j
−1

k=m
[[∇f(x
k+1
) −∇f(x
k
)[[
≤ L
n
j
−1

k=m
[[x
k+1
−x
k
[[
≤ L
n
j
−1

k=m
α
k
([[d
k
[[ +[[e
k
[[)
≤ L(c
2
+ q)


n
j
−1

k=m
α
k


+ L(c
2
+ p)
n
j
−1

k=m
α
k
[[∇f(x
k
)[[

_
L(c
2
+ q)
9

2
+ L(c
2
+ p)
3

_n
j
−1

k=m
α
k
[[∇f(x
k
)[[
2

9L(2c
2
+ p + q)

2
n
j
−1

k=m
α
k
[[∇f(x
k
)[[
2

9L(2c
2
+ p + q)

2

3
27L(2c
2
+ q + p)
=

3
.
Therefore
[[∇f(x
m
)[[ ≤ [[∇f(x
n
j
)[[ +

3

2
3
, ∀ j ≥
¯
j, m
j
≤ m ≤ n
j
−1.
From here and (4), we have
[[∇f(x
m
)[[ ≤
2
3
, ∀ m ≥ m
j
which contradicts Eq. (3). Hence lim
k→∞
∇f(x
k
) = 0. If ¯ x is a limit point of ¦x
k
¦, then
lim
k→∞
f(x
k
) = f(¯ x). Thus, we have lim
k→∞
∇f(x
k
) = 0, implying that ∇f(¯ x) = 0.
SECTION 1.3
1.3.2 www
Let β be any scalar with 0 < β < 1 and B(x

, ) = ¦x [ [[x−x

[[ ≤ ¦ be a closed sphere centered
at x

with the radius > 0 such that for all x, y ∈ B(x

, ) the following hold

2
f(x) > 0, [[∇
2
f(x)
−1
[[ ≤ M
1
, (1)
18
Section 1.3
[[∇f(x) −∇f(y)[[ ≤ M
2
[[x −y[[, M
2
= sup
x∈B(x

,)
[[∇
2
f(x)[[, (2)
[[∇
2
f(x) −∇
2
f(y)[[ ≤
β
2M
1
(3)
[[d(x) +∇
2
f(x)
−1
∇f(x)[[ ≤
β
2M
2
[[∇f(x)[[. (4)
Then, by using these relations and ∇f(x

) = 0, for any x ∈ B(x

, ) one can obtain
[[x + d(x) −x

[[ ≤ [[x −x

−∇
2
f(x)
−1
∇f(x)[[ +[[d(x) +∇
2
f(x)
−1
∇f(x)[[
≤ [[∇
2
f(x)
−1
(∇
2
f(x)(x −x

) −∇f(x)) [[ +
β
2M
2
[[∇f(x)[[
≤ M
1
[[∇
2
f(x)(x −x

) −∇f(x) +∇f(x

)[[ +
β
2M
2
[[∇f(x) −∇f(x

)[[
≤ M
1
[[∇
2
f(x)(x −x

) −
_
1
0

2
f ((x

+ t(x −x

))

(x −x

)dt[[ +
β
2
[[x −x

[[
≤ M
1
__
1
0
[[∇
2
f(x) −∇
2
f ((x

+ t(x −x

)) [[dt
_
[[x −x

[[ +
β
2
[[x −x

[[
≤ β[[x −x

[[.
This means that if x
0
∈ B(x

, ) and α
k
= 1 for all k, then we will have
[[x
k
−x

[[ ≤ β
k
[[x
0
−x

[[, ∀ k ≥ 0. (5)
Now, we have to prove that for small enough the unity initial stepsize will pass the test of
Armijo rule. By the mean value theorem, we have
f(x + d(x)) −f(x) = ∇f(x)

d(x) +
1
2
d(x)


2
f(x)d(x),
where x is a point on the line segment joining x and x + d(x). We would like to have
∇f(x)

d(x) +
1
2
d(x)


2
f(x)d(x) ≤ σ∇f(x)

d(x), (6)
for all x in some neighborhood of x

. Therefore, we must find how small should be that this
holds in addition to the conditions given in (1)–(4). By defining
p(x) =
∇f(x)
[[∇f(x)[[
, q(x) =
d(x)
[[∇f(x)[[
,
the condition (6) takes the form
(1 −σ)p(x)

q(x) +
1
2
q(x)


2
f(x)q(x) ≤ 0. (7)
The condition on d(x) is equivalent to
q(x) = −(∇
2
f(x

))
−1
p(x) + ν(x),
19
Section 1.3
where ν(x) denotes a vector function with ν(x) →0 as x →x

. By using the above relation and
the fact ∇
2
f(x) →∇
2
f(x

) as x →x

, we may write Eq.(7) as
(1 −σ)p(x)

(∇
2
f(x

))
−1
p(x) −
1
2
p(x)

(∇
2
f(x

))
−1
p(x) ≥ γ(x),
where ¦γ(x)¦ is some scalar sequence with lim
x→x
∗ γ(x) = 0. Thus Eq.(7) is equivalent to
_
1
2
−σ
_
p(x)

(∇
2
f(x

))
−1
p(x) ≥ γ(x). (8)
Since 1/2 > σ, [[p(x)[[ = 1, and ∇
2
f(x

) > 0, the above relation holds in some neighborhood of
point x

. Namely, there is some ∈ (0, ) such that (1)–(4) and (8) hold. Then for any initial
point x
0
∈ B(X

, ) the unity initial stepsize passes the test of Armijo rule, and (5) holds for all
k. This completes the proof.
1.3.8 www
In this case, the gradient method has the form x
k+1
= x
k
−α∇f(x
k
). From the descent lemma
(Prop. A.24 of Appendix A), we have
f(x
k+1
) −f(x
k
) ≤ −αc[[∇f(x
k
)[[
2
, (1)
where α <
2
L
, and c = 1 −αL/2. By using the same arguments as in the proof of Prop. 1.3.3, we
can show that
lim
k→∞
d(x
k
, X

) = 0. (2)
We assume that d(x
k
, X

) ,= 0, otherwise the method will terminate in a finite number of itera-
tions. Convexity of the function f implies that
f(x
k
) −f(x

) ≤ ∇f(x
k
)

(x
k
−x

) ≤ [[∇f(x
k
)[[ [[x
k
−x

[[, ∀ x

∈ X

,
from which, by minimizing over x

∈ X

, we have
f(x
k
) −f

≤ [[∇f(x
k
)[[d(x
k
, X

). (3)
Let e
k
= f(x
k
) −f

. Then, inequalities (1) and (3) imply that
e
k+1
≤ e
k
−αc
(e
k
)
2
d
2
(x
k
, X

)
, ∀ k.
The rest of the proof is exactly the same as the proof of Prop. 1.3.3, starting from the relation
f(x
k+1
) ≤ f(x
k
) −
c
2
[[∇f(x
k
)[[
2
2L
.
20
Section 1.3
1.3.9 www
Without loss of generality we assume that c = 0 (otherwise we make the change of variables
x = y −Q
−1
c). The iteration becomes
_
x
k+1
x
k
_
=
_
(1 + β)I −αQ −βI
I 0
__
x
k
x
k−1
_
Define
A =
_
(1 + β)I −αQ −βI
I 0
_
.
If µ is an eigenvalue of A, then for some vectors u and w, which are not both 0, we have
A
_
u
w
_
= µ
_
u
w
_
,
or equivalently,
u = µw and
_
(1 + β)I −αQ
_
u −βw = µu.
If we had µ = 0, then it is seen from the above equations that u = 0 and also w = 0, which is
not possible. Therefore, µ ,= 0 and A is invertible. We also have from the above equations that
u = µw and
_
(1 + β)I −αQ
_
u =
_
µ +
β
µ
_
u,
so that µ + β/µ is an eigenvalue of (1 + β)I − αQ. Hence, if µ and λ satisfy the equation
µ + β/µ = 1 + β −αλ, then µ is an eigenvalue of A if and only if λ is an eigenvalue of Q.
Now, if
0 < α < 2
_
1 + β
M
_
,
where M is the maximum eigenvalue of Q, then we have
[1 + β −αλ[ < 1 + β
for every eigenvalue λ of Q, and therefore also
¸
¸
¸
¸
µ +
β
µ
¸
¸
¸
¸
< 1 + β
for every eigenvalue µ of A. Let the complex number µ have the representation µ = [µ[e

. Then,
since µ + β/µ is a real number, its imaginary part is 0, or
[µ[ sin θ −β(1/[µ[) sin θ = 0.
If sin θ ,= 0, we have [µ[
2
= β < 1, while if sin θ = 0, µ is a real number and the relation
[µ + β/µ[ < 1 + β is written as µ
2
+ β < (1 + β)[µ[ or ([µ[ − 1)([µ[ − β) < 0. Therefore,
21
Section 1.3
β < [µ[ < 1. Thus, for all values of θ, we have β ≤ [µ[ < 1. Thus, all the eigenvalues of A are
strictly within the unit circle, implying that x
k
→0; that is, the method converges to the unique
optimal solution.
Assume for the moment that α and β are fixed. From the preceding analysis we have that
µ is an eigenvalue of A if and only if µ
2
+ β = 1 + β −αλ, where λ is an eigenvalue of Q. Thus,
the set of eigenvalues of A is
_
1 + β −αλ ±
_
(1 + β −αλ)
2
−4β
2
¸
¸
¸ λ is an eigenvalue of Q
_
,
so that the spectral radius of A is
ρ(A) = max

¸
¸
¸
¸
[1 + β −αλ[ +
_
(1 + β −αλ)
2
−4β
2
¸
¸
¸
¸
¸
¸
¸
¸ λ is an eigenvalue of Q
_
.
For any scalar c ≥ 0, consider the function g : R
+
→R
+
given by
g(r) = [r +

r
2
−c[.
We claim that
g(r) ≥ max¦

c, 2r −

c¦.
Indeed, let us show this relation in each of two cases: Case 1: r ≥

c. Then it is seen that

r
2
−c ≥ r −

c, so that g(r) ≥ 2r −

c ≥

c. Case 2: r <

c. Then g(r) =
_
r
2
+ (c −r
2
) =

c ≥ 2r −

c.
We now apply the relation g(r) ≥ max¦

c, 2r −

c¦ to Eq. (3), with c = 4β and with
r = [1 + β −αλ[, where λ is an eigenvalue of Q. We have
ρ
2
(A) ≥
1
4
max¦4β, max¦2(1 + β −αλ)
2
−4β [ λ is an eigenvalue of Q¦¦.
Therefore,
ρ
2
(A) ≥
1
4
max¦4β, 2(1 + β −αm)
2
−4β, 2(1 + β −αM)
2
−4β¦
or
ρ
2
(A) ≥ max
_
β,
1
2
(1 + β −αm)
2
−β,
1
2
(1 + β −αM)
2
−β
_
.
It is easy to verify that for every β,
max
_
1
2
(1 + β −αm)
2
−β,
1
2
(1 + β −αM)
2
−β
_

1
2
(1 + β −α

m)
2
−β,
where α

corresponds to the intersection point of the graphs of the functions of α inside the
braces, satisfying
1
2
(1 + β −α

m)
2
−β =
1
2
(1 + β −α

M)
2
−β
22
Section 1.3
or
α

=
2(1 + β)
m + M
.
From Eqs. (4), (5), and the above formula for α

, we obtain
ρ
2
(A) ≥ max
_
β,
1
2
_
(1 + β)
M −m
m + M
_
2
−β
_
Again, consider the point β

that corresponds to the intersection point of the graphs of the
functions of β inside the braces, satisfying
β

=
1
2
_
(1 + β

)
M −m
m + M
_
2
−β

.
We have
β

=
_

M −

m

M +

m
_
2
,
and
max
_
β,
1
2
_
(1 + β)
M −m
m + M
_
2
−β
_
≥ β

.
Therefore,
ρ(A) ≥
_
β

=

M −

m

M +

m
.
Note that equality in Eq. (6) is achievable for the (optimal) values
β

=
_

M −

m

M +

m
_
2
and
α

=
2(1 + β)
m + M
.
In conclusion, we have
min
α,β
ρ(A) =

M −

m

M +

m
and the minimum is attained by some values α

> 0 and β

∈ [0, 1). Therefore, the convergence
rate of the heavy ball method (2) with optimal choices of stepsize α and parameter β is governed
by
|x
k+1
|
|x
k
|


M −

m

M +

m
.
It can be seen that

M −

m

M +

m

M −m
M + m
,
so the convergence rate of the heavy ball iteration (2) is faster than the one of the steepest descent
iteration (cf. Section 1.3.2).
23
Section 1.3
1.3.10 www
By using the given property of the sequence ¦e
k
¦, we can obtain
[[e
k+1
−e
k
[[ ≤ β
k+1−
¯
k
[[e
¯
k
−e
¯
k−1
[[, ∀ k ≥
¯
k.
Thus, we have
[[e
m
−e
k
[[ ≤ [[e
m
−e
m−1
[[ +[[e
m−1
−e
m−2
[[ + . . . +[[e
k+1
−e
k
[[

_
β
m−
¯
k+1
+ β
m−
¯
k
+ . . . + β
k−
¯
k+1
_
[[e
¯
k
−e
¯
k−1
[[
≤ β
1−
¯
k
[[e
¯
k
−e
¯
k−1
[[
m

j=k
β
j
.
By choosing k
0

¯
k large enough, we can make

m
j=k
β
j
arbitrarily small for all m, k ≥ k
0
.
Therefore, ¦e
k
¦ is a Cauchy sequence. Let lim
m→∞
e
m
= e

, and let m → ∞ in the inequality
above, which results in
[[e
k
−e

[[ ≤ β
1−
¯
k
[[e
¯
k
−e
¯
k−1
[[

j=k
β
j
= β
1−
¯
k
[[e
¯
k
−e
¯
k−1
[[
β
k
1 −β
= q
¯
k
β
k
, (1)
for all k ≥
¯
k, where q
¯
k
=
β
1−
¯
k
1−β
[[e
¯
k
−e
¯
k−1
[[. Define the sequence ¦q
k
[ 0 ≤ k <
¯
k¦ as follows
q
k
=
[[e
k
−e

[[
β
k
, ∀ k, 0 ≤ k <
¯
k. (2)
Combining (1) and (2), it can be seen that
[[e
k
−e

[[ ≤ qβ
k
, ∀ k,
where q = max
0≤k≤
¯
k
q
k
.
1.3.11 www
Since α
k
is determined by Armijo rule, we know that α
k
= β
m
k
s, where m
k
is the first index m
for which
f (x
k
−β
m
s∇f(x
k
)) −f(x
k
) ≤ −σβ
m
s[[∇f(x
k
)[[
2
. (1)
The second order expansion of f yields
f (x
k
−β
i
s∇f(x
k
)) −f(x
k
) = −β
i
s[[∇f(x
k
)[[
2
+

i
s)
2
2
∇f(x
k
)


2
f(¯ x)∇f(x
k
),
for some ¯ x that lies in the segment joining the points x
k
− β
i
s∇f(x
k
) and x
k
. From the given
property of f, it follows that
f (x
k
−β
i
s∇f(x
k
)) −f(x
k
) ≤ −β
i
s
_
1 −
β
i
sM
2
_
[[∇f(x
k
)[[
2
. (2)
24
Section 1.4
Now, let i
k
be the first index i for which 1 −
M
2
β
i
s ≥ σ, i.e.
1 −
M
2
β
i
s < σ ∀ i, 0 ≤ i ≤ i
k
, and 1 −
M
2
β
i
k
s ≥ σ. (3)
Then, from (1)-(3), we can conclude that m
k
≤ i
k
. Therefore α
k
≥ ˆ α
k
, where ˆ α
k
= β
i
k
s. Thus,
we have
f (x
k
−α
k
∇f(x
k
)) −f(x
k
) ≤ −σˆ α
k
[[∇f(x
k
)[[
2
. (4)
Note that (3) implies
σ > 1 −
M
2
β
i
k
−1
s = 1 −
M

ˆ α
k
.
Hence, ˆ α
k
≥ 2β(1 −σ)/M. By substituting this in (4), we obtain
f(x
k+1
) −f(x

) ≤ f(x
k
) −f(x

) −
2βσ(1 −σ)
M
[[∇f(x
k
)[[
2
. (5)
The given property of f implies that (see Exercise 1.1.9)
f(x) −f(x

) ≤
1
2m
[[∇f(x)[[
2
, ∀ x ∈ ¹
n
, (6)
m
2
[[x −x

[[
2
≤ f(x) −f(x

), ∀ x ∈ ¹
n
. (7)
By combining (5) and (6), we obtain
f(x
k+1
) −f(x

) ≤ r (f(x
k
) −f(x

)) ,
with r = 1 −
4mβσ(1−σ)
M
. Therefore, we have
f(x
k
) −f(x

) ≤ r
k
(f(x
0
) −f(x

)) , ∀ k,
which combined with (7) yields
[[x
k
−x

[[
2
≤ qr
k
, ∀ k,
with q =
2
m
(f(x
0
) −f(x

)).
SECTION 1.4
25
Section 1.4
1.4.2 www
From the proof of Prop. 1.4.1, we have
|x
k+1
−x

| ≤ M
__
1
0
|∇g(x

) −∇g(x

+ t(x
k
−x

))|dt
_
|x
k
−x

|.
By continuity of ∇g, we can take δ sufficiently small to ensure that the term under the integral
sign is arbitrarily small. Let δ
1
be such that the term under the integral sign is less than r/M.
Then
|x
k+1
−x

| ≤ r|x
k
−x

|.
Now, let
M(x) =
_
1
0
∇g (x

+ t(x −x

))

dt.
We then have g(x) = M(x)(x − x

). Note that M(x

) = ∇g(x

). We have that M(x

) is
invertible. By continuity of ∇g, we can take δ to be such that the region S
δ
around x

is
sufficiently small so the M(x)

M(x) is invertible. Let δ
2
be such that M(x)

M(x) is invertible.
Then the eigenvalues of M(x)

M(x) are all positive. Let γ and Γ be such that
0 < γ ≤ min
x−x

≤δ
2
eig (M(x)

M(x)) ≤ max
x−x

≤δ
2
eig (M(x)

M(x)) ≤ Γ.
Then, since |g(x)|
2
= (x −x

)

M

(x)M(x)(x −x

), we have
γ|x −x

|
2
≤ |g(x)|

≤ Γ|x −x

|
2
,
or
1

Γ
|g(x
k+1
)| ≤ |x
k+1
−x

| and r|x
k
−x

| ≤
r

γ
|g(x
k
)|.
Since we’ve already shown that |x
k+1
−x

| ≤ r|x
k
−x

|, we have
|g(x
k+1
)| ≤
r

Γ

γ
|g(x
k
)|.
Let ˆ r =
r

Γ

γ
. By letting
ˆ
δ be sufficiently small, we can have ˆ r < r. Letting δ = min¦
ˆ
δ, δ
2
¦ we
have for any r, both desired results.
1.4.5 www
Since ¦x
k
¦ converges to nonsingular local minimum x

of twice continuously differentiable func-
tion f and
lim
k→∞
[[H
k
−∇
2
f(x
k
)[[ = 0,
26
Section 1.4
we have that
lim
k→∞
[[H
k
−∇
2
f(x

)[[ = 0. (1)
Let m
k
and m denote the smallest eigenvalues of H
k
and ∇
2
f(x

), respectively. The positive
definiteness of ∇
2
f(x

) and the Eq. (1) imply that for any > 0 with m− > 0 and k
0
large
enough, we have
0 < m− ≤ m
k
≤ m + , ∀ k ≥ k
0
. (2)
For the truncated Newton method, the direction d
k
is such that
1
2
d
k

H
k
d
k
+∇f(x
k
)

d
k
< 0, ∀ k ≥ 0. (3)
Define q
k
=
d
k
||∇f(x
k
)||
and p
k
=
∇f(x
k
)
||∇f(x
k
)||
. Then Eq. (3) can be written as
1
2
q
k

H
k
q
k
+ p
k

q
k
< 0, ∀ k ≥ 0.
By the positive definiteness of H
k
, we have
m
k
2
[[q
k
[[
2
< [[q
k
[[, ∀ k ≥ 0,
where we have used the fact that [[p
k
[[ = 1. Combining this and Eq. (2) we obtain that the
sequence ¦q
k
¦ is bounded. Thus, we have
lim
k→∞
[[d
k
+ (∇
2
f(x

))
−1
∇f(x
k
)[[
[[∇f(x
k
)[[
≤ M lim
k→∞
[[∇
2
f(x

)d
k
+∇f(x
k
)[[
[[∇f(x
k
)[[
= M lim
k→∞
[[∇
2
f(x

)q
k
+ p
k
[[
≤ M lim
k→∞
[[∇
2
f(x

) −H
k
[[ [[q
k
[[ + M lim
k→∞
[[H
k
q
k
+ p
k
[[
= 0,
where M = [[(∇
2
f(x

))
−1
[[. Now we have that all the conditions of Prop. 1.3.2 are satisfied, so
¦[[x
k
−x

[[¦ converges superlinearly.
1.4.6 www
For the function f(x) = |x|
3
, we have
∇f(x) = 3|x|x, ∇
2
f(x) = 3|x| +
3
|x|
xx

=
3
|x|
(|x|
2
I + xx

).
Using the formula (A + CBC

)
−1
= A
−1
− A
−1
C(B
−1
+ C

A
−1
C)
−1
C

A
−1
[Eq. (A.7) from
Appendix A], we have
(|x|
2
I + xx

)
−1
=
1
|x|
2
_
I −
1
2|x|
2
xx

_
,
27
Section 1.6
and so
(∇
2
f(x))
−1
=
1
3|x|
_
I −
1
2|x|
2
xx

_
.
Newton’s method is then
x
k+1
= x
k
−α(∇
2
f(x
k
))
−1
∇f(x
k
)
= x
k
−α
1
3|x
k
|
_
I −
1
2|x
k
|
2
x
k
(x
k
)

_
3|x
k
|x
k
= x
k
−α
_
x
k

1
2|x
k
|
2
x
k
|x
k
|
2
_
= x
k
−α
_
x
k

1
2
x
k
_
=
_
1 −
α
2
_
x
k
.
Thus for 0 < α < 2, Newton’s method converges linearly to x

= 0. For α
0
= 2 method converges
in one step. Note that the method also converges linearly for 2 < α < 4. Proposition 1.4.1 does
not apply since ∇
2
f(0) is not invertible. Otherwise, we would have superlinear convergence.
Alternatively, instead of inverting ∇
2
f(x), we can calculate the Newton direction at a
vector x by guessing (based on symmetry) that it has the form γx for some scalar γ, and by
determining the value of γ through the equation ∇
2
f(x)(γx) = −∇f(x). In this way, we can
verify that γ = −1/2.
SECTION 1.6
1.6.3 www
We have that
f(x
k+1
) ≤ max
i
(1 + λ
i
P
k

i
))
2
f(x
0
), (1)
for any polynomial P
k
of degree k and any k, where ¦λ
i
¦ is the set of the eigenvalues of Q. Chose
P
k
such that
1 + λP
k
(λ) =
(z
1
−λ)
z
1

(z
2
−λ)
z
2

(z
k
−λ)
z
k
.
Define I
j
= [z
j
−δ
j
, z
j
+ δ
j
] for j = 1, . . . , k. Since λ
i
∈ I
j
for some j, we have
(1 + λ
i
P
k

i
))
2
≤ max
λ∈I
j
(1 + λP
k
(λ))
2
.
28
Section 1.6
Hence
max
i
(1 + λ
i
P
k

i
))
2
≤ max
1≤j≤k
max
λ∈I
j
(1 + λP
k
(λ))
2
. (2)
For any j and λ ∈ I
j
we have
(1 + λP
k
(λ))
2
=
(z
1
−λ)
2
z
2
1

(z
2
−λ)
2
z
2
2

(z
k
−λ)
2
z
2
k

(z
j
+ δ
j
−z
1
)
2
(z
j
+ δ
j
−z
2
)
2
(z
j
+ δ
j
−z
j−1
)
2
δ
2
j
z
2
1
z
2
j
.
Here we used the fact that λ ∈ I
j
implies λ < z
l
for l = j + 1, . . . , k, and therefore
(z
l
−λ)
2
z
2
l
≤ 1
for all l = j + 1, . . . , k. Thus, from (2) we obtain
max
i
(1 + λ
i
P
k

i
))
2
≤ R, (3)
where
R =
_
δ
2
1
z
2
1
,
δ
2
2
(z
2
+ δ
2
−z
1
)
2
z
2
1
z
2
2
, ,
δ
2
k
(z
k
+ δ
k
−z
1
)
2
(z
k
+ δ
k
−z
k−1
)
2
z
2
1
z
1
2
z
2
k
_
.
The desired estimate follows from (1) and (3).
1.6.4 www
It suffices to show that the subspace spanned by g
0
, g
1
, . . . , g
k−1
is the same as the subspace
spanned by g
0
, Qg
0
, . . . , Q
k−1
g
0
, for k = 1, . . . , n. We will prove this by induction. Clearly, for
k = 1 the statement is true. Assume it is true for k −1 < n −1, i.e.
span¦g
0
, g
1
, . . . , g
k−1
¦ = span¦g
0
, Qg
0
, . . . , Q
k−1
g
0
¦,
where span¦v
0
, . . . , v
l
¦ denotes the subspace spanned by the vectors v
0
, . . . , v
l
. Assume that
g
k
,= 0 (i.e. x
k
,= x

). Since g
k
= ∇f(x
k
) and x
k
minimizes f over the manifold x
0
+
span¦g
0
, g
1
, . . . , g
k−1
¦, from our assumption we have that
g
k
= Qx
k
−b = Q
_
x
0
+
k−1

i=0
ξ
i
Q
i
g
0
_
−b = Qx
0
−b +
k−1

i=0
ξ
i
Q
i+1
g
0
.
The fact that g
0
= Qx
0
−b yields
g
k
= g
0
+ ξ
0
Qg
0
+ ξ
1
Q
2
g
0
+ . . . + ξ
k−2
Q
k−1
g
0
+ ξ
k−1
Q
k
g
0
. (1)
If ξ
k−1
= 0, then from (1) and the inductive hypothesis it follows that
g
k
∈ span¦g
0
, g
1
, . . . , g
k−1
¦. (2)
29
Section 1.6
We know that g
k
is orthogonal to g
0
, . . . , g
k−1
. Therefore (2) is possible only if g
k
= 0 which
contradicts our assumption. Hence, ξ
k−1
,= 0. If Q
k
g
0
∈ span¦g
0
, Qg
0
, . . . , Q
k−1
g
0
¦, then
(1) and our inductive hypothesis again imply (2) which is not possible. Thus the vectors
g
0
, Qg
0
, . . . , Q
k−1
g
0
, Q
k
g
0
are linearly independent. This combined with (1) and linear inde-
pendence of the vectors g
0
, . . . , g
k−1
, g
k
implies that
span¦g
0
, g
1
, . . . , g
k−1
, g
k
¦ = span¦g
0
, Qg
0
, . . . , Q
k−1
g
0
, Q
k
g
0
¦,
which completes the proof.
1.6.5 www
Let x
k
be the sequence generated by the conjugate gradient method, and let d
k
be the sequence
of the corresponding Q-conjugate directions. We know that x
k+1
minimizes f over
x
0
+ span ¦d
0
, d
1
, . . . , d
k
¦.
Let ˜ x
k
be the sequence generated by the method described in the exercise. In particular, ˜ x
1
is
generated from x
0
by steepest descent and line minimization, and for k ≥ 1, ˜ x
k+1
minimizes f
over the two-dimensional linear manifold
˜ x
k
+ span ¦˜ g
k
and ˜ x
k
− ˜ x
k−1
¦,
where ˜ g
k
= ∇f(˜ x
k
). We will show by induction that x
k
= ˜ x
k
for all k ≥ 1.
Indeed, we have by construction x
1
= ˜ x
1
. Suppose that x
i
= ˜ x
i
for i = 1, . . . , k. We will
show that x
k+1
= ˜ x
k+1
. We have that ˜ g
k
is equal to g
k
= β
k
d
k−1
− d
k
so it belongs to the
subspace spanned by d
k−1
and d
k
. Also ˜ x
k
− ˜ x
k−1
is equal to x
k
−x
k−1
= α
k−1
d
k−1
. Thus
span ¦˜ g
k
and ˜ x
k
− ˜ x
k−1
¦ = span ¦d
k−1
and d
k
¦.
Observe that x
k
belongs to
x
0
+ span ¦d
0
, d
1
, . . . , d
k−1
¦,
so
x
0
+ span ¦d
0
, d
1
, . . . , d
k−1
¦ ⊃ x
k
+ span ¦d
k−1
and d
k
¦ ⊃ x
k
+ span ¦d
k
¦.
The vector x
k+1
minimizes f over the linear manifold on the left-hand side above, and also
over the linear manifold on the right-hand side above (by the definition of a conjugate direction
method). Moreover, ˜ x
k+1
minimizes f over the linear manifold in the middle above. Hence
x
k+1
= ˜ x
k+1
.
30
Section 1.6
1.6.6 (PARTAN)
Suppose that x
1
, . . . , x
k
have been generated by the method of Exercise 1.6.5, which by the result
of that exercise, is equivalent to the conjugate gradient method. Let y
k
and x
k+1
be generated
by the two line searches given in the exercise.
By the definition of the congugate gradient method, x
k
minimizes f over
x
0
+ span ¦g
0
, g
1
, . . . , g
k−1
¦,
so that
g
k
⊥ span ¦g
0
, g
1
, . . . , g
k−1
¦,
and in particular
g
k
⊥ g
k−1
. (1)
Also, since y
k
is the vector that minimizes f over the line y
α
= x
k
−αg
k
, α ≥ 0, we have
g
k
⊥ ∇f(y
k
). (2)
Any vector on the line passing through x
k−1
and y
k
has the form
y = αx
k−1
+ (1 −α)y
k
, α ∈ ',
and the gradient of f at such a vector has the form
∇f
_
αx
k−1
+ (1 −α)y
k
_
= Q
_
αx
k−1
+ (1 −α)y
k
_
−b
= α(Qx
k−1
−b) + (1 −α)(Qy
k
−b)
= αg
k−1
+ (1 −α)∇f(y
k
).
(3)
From Eqs. (1)-(3), it follows that g
k
is orthogonal to the gradient ∇f(y) of any vector y on the
line passing through x
k−1
and y
k
.
In particular, for the vector x
k+1
that minimizes f over this line, we have that ∇f(x
k+1
)
is orthogonal to g
k
. Furthermore, because x
k+1
minimizes f over the line passing through x
k−1
and y
k
, ∇f(x
k+1
) is orthogonal to y
k
−x
k−1
. Thus, ∇f(x
k+1
) is orthogonal to
span ¦g
k
, y
k
−x
k−1
¦,
and hence also to
span ¦g
k
, x
k
−x
k−1
¦,
since x
k−1
, x
k
, and y
k
form a triangle whose side connecting x
k
and y
k
is proportional to g
k
.
Thus x
k+1
minimizes f over
x
k
+ span ¦g
k
, x
k
−x
k−1
¦,
and it is equal to the one generated by the algorithm of Exercise 1.6.5.
31
Section 1.6
1.6.7 www
The objective is to minimize over '
n
, the positive semidefinite quadratic function
f(x) =
1
2
x

Qx + b

x.
The value of x
k
following the kth iteration is
x
k
= arg min
_
f(x)[x = x
0
+
k−1

i=1
γ
i
d
i
, γ
i
∈ '
_
= arg min
_
f(x)[x = x
0
+
k−1

i=1
δ
i
g
i
, δ
i
∈ '
_
,
where d
i
are the conjugate directions, and g
i
are the gradient vectors. At the beginning of the
(k + 1)st iteration, there are two possibilities:
(1) g
k
= 0: In this case, x
k
is the global minimum since f(x) is a convex function.
(2) g
k
,= 0: In this case, a new conjugate direction d
k
is generated. Here, we also have two
possibilities:
(a) A minimum is attained along the direction d
k
and defines x
k+1
.
(b) A minimum along the direction d
k
does not exist. This occurs if there exists a direction
d in the manifold spanned by d
0
, . . . , d
k
such that d

Qd = 0 and b

d ,= 0. The problem
in this case has no solution.
If the problem has no solution (which occurs if there is some vector d such that d

Qd = 0
but b

d ,= 0), the algorithm will terminate because the line minimization problem along such a
direction d is unbounded from below.
If the problem has infinitely many solutions (which will happen if there is some vector d
such that d

Qd = 0 and b

d = 0), then the algorithm will proceed as if the matrix Q were positive
definite, i.e. it will find one of the solutions (case 1 occurs).
However, in both situations the algorithm will terminate in at most m steps, where m is
the rank of the matrix Q, because the manifold
¦x ∈ '
n
[x = x
0
+
k−1

i=0
γ
i
d
i
, γ
i
∈ '¦
will not expand for k > m.
1.6.8 www
Let S
1
and S
2
be the subspaces with S
1
∩ S
2
being a proper subspace of '
n
(i.e. a subspace
of '
n
other than ¦0¦ and '
n
itself). Suppose that the subspace S
1
∩ S
2
is spanned by linearly
32
Section 1.6
independent vectors v
k
, k ∈ K ⊆ ¦1, 2, . . . , n¦. Assume that x
1
and x
2
minimize the given
quadratic function f over the manifolds M
1
and M
2
that are parallel to subspaces S
1
and S
2
,
respectively, i.e.
x
1
= arg min
x∈M
1
f(x) and x
2
= arg min
x∈M
2
f(x)
where M
1
= y
1
+ S
1
, M
2
= y
2
+ S
2
, with some vectors y
1
, y
2
∈ '
n
. Assume also that x
1
,= x
2
.
Without loss of generality we may assume that f(x
2
) > f(x
1
). Since x
2
,∈ M
1
, the vectors x
2
−x
1
and ¦v
k
[ k ∈ K¦ are linearly independent. From the definition of x
1
and x
2
we have that
d
dt
f(x
1
+ tv
k
)
¸
¸
¸
¸
t=0
= 0 and
d
dt
f(x
2
+ tv
k
)
¸
¸
¸
¸
t=0
= 0,
for any v
k
. When this is written out, we get
x
1

Qv
k
−b

v
k
= 0 and x
2

Qv
k
−b

v
k
= 0.
Subtraction of the above two equalities yields
(x
1
−x
2
)

Qv
k
= 0, ∀ k ∈ K.
Hence, x
1
−x
2
is Q-conjugate to all vectors in the intersection S
1
∩S
2
. We can use this property
to construct a conjugate direction method that does not evaluate gradients and uses only line
minimizations in the following way.
Initialization: Choose any direction d
1
and points y
1
and z
1
such that M
1
1
= y
1
+ span¦d
1
¦,
M
1
2
= z
1
+ span¦d
1
¦, M
1
1
,= M
1
2
. Let d
2
= x
1
1
−x
2
1
, where x
i
1
= arg min
x∈M
1
i
f(x) for i = 1, 2.
Generating new conjugate direction: Suppose that Q-conjugate directions d
1
, d
2
, . . . , d
k
,
k < n have been generated. Let M
k
1
= y
k
+ span¦d
1
, . . . d
k
¦ and x
1
k
= arg min
x∈M
k
1
f(x). If x
1
k
is not optimal there is a point z
k
such that f(z
k
) < f(x
1
k
). Starting from point z
k
we again
search in the directions d
1
, d
2
, . . . , d
k
obtaining a point x
2
k
which minimizes f over the manifold
M
k
2
generated by z
k
and d
1
, d
2
, . . . , d
k
. Since f(x
2
k
) ≤ f(z
k
), we have
f(x
2
k
) < f(x
1
k
).
As both x
1
k
and x
2
k
minimize f over the manifolds that are parallel to span¦d
1
, . . . , d
k
¦, setting
d
k+1
= x
2
k
−x
1
k
we have that d
1
, . . . , d
k
, d
k+1
are Q-conjugate directions (here we have used the
established property).
In this procedure it is important to have a step which given a nonoptimal point x generates
a point y for which f(y) < f(x). If x is an optimal solution then the step must indicate this fact.
Simply, the step must first determine whether x is optimal, and if x is not optimal, it must find
a better point. A typical example of such a step is one iteration of the cyclic coordinate descent
method, which avoids calculation of derivatives.
33
Section 1.7
SECTION 1.7
1.7.1 www
The proof is by induction. Suppose the relation D
k
q
i
= p
i
holds for all k and i ≤ k − 1. The
relation D
k+1
q
i
= p
i
also holds for i = k because of the following calculation
D
k+1
q
k
= D
k
q
k
+
y
k
y
k

q
k
q
k

y
k
= D
k
q
k
+ y
k
= D
k
q
k
+ (p
k
−D
k
q
k
) = p
k
.
For i < k, we have, using the induction hypothesis D
k
q
i
= p
i
,
D
k+1
q
i
= D
k
q
i
+
y
k
(p
k
−D
k
q
k
)

q
i
q
k

y
k
= p
i
+
y
k
(p
k

q
i
−q
k

p
i
)
q
k

y
k
.
Since p
k

q
i
= p
k

Qp
i
= q
k

p
i
, the second term in the right-hand side vanishes and we have
D
k+1
q
i
= p
i
. This completes the proof.
To show that (D
n
)
−1
= Q, note that from the equation D
k+1
q
i
= p
i
, we have
D
n
=
_
p
0
p
n−1
¸_
q
0
q
n−1
¸
−1
, (*)
while from the equation Qp
i
= Q(x
i+1
−x
i
) = (Qx
i+1
−b)−(Qx
i
−b) = ∇f(x
i+1
)−∇f(x
i
) = q
i
,
we have
Q
_
p
0
p
n−1
¸
=
_
q
0
q
n−1
¸
,
or equivalently
Q =
_
q
0
q
n−1
¸_
p
0
p
n−1
¸
−1
. (**)
(Note here that the matrix
_
p
0
p
n−1
¸
is invertible, since both Q and
_
q
0
q
n−1
¸
are
invertible by assumption.) By comparing Eqs. (*) and (**), it follows that (D
n
)
−1
= Q.
1.7.2 www
For simplicity, we drop superscripts. The BFGS update is given by
¯
D = D +
pp

p

q

Dqq

D
q

Dq
+ q

Dq
_
p
p

q

Dq
q

Dq
__
p
p

q

Dq
q

Dq
_

= D +
pp

p

q

Dqq

D
q

Dq
+ q

Dq
_
pp

(p

q)
2

Dqp

+ pq

D
(p

q)(q

Dq)
+
Dqq

D
(q

Dq)
2
_
= D +
_
1 +
q

Dq
p

q
_
pp

p

q

Dqp

+ pq

D
p

q
.
34
Section 1.7
1.7.3 www
(a) For simplicity, we drop superscripts. Let V = I −ρqp

, where ρ = 1/(q

p). We have
V

DV + ρpp

= (I −ρqp

)

D(I −ρqp

) + ρpp

= D −ρ(Dqp

+ pq

D) + ρ
2
pq

Dqp

+ ρpp

= D −
Dqp

+ pq

D
q

p
+
(q

Dq)(pp

)
(q

p)
2
+
pp

q

p
= D +
_
1 +
q

Dq
p

q
_
pp

p

q

Dqp

+ pq

D
p

q
and the result now follows using the alternative BFGS update formula of Exercise 1.7.2.
(b) We have, by using repeatedly the update formula for D of part (a),
D
k
= V
k−1

D
k−1
V
k−1
+ ρ
k−1
p
k−1
p
k−1

= V
k−1

V
k−2

D
k−2
V
k−2
V
k−1
+ ρ
k−2
V
k−1

p
k−2
p
k−2

V
k−1
+ ρ
k−1
p
k−1
p
k−1

,
and proceeding similarly,
D
k
= V
k−1

V
k−2

V
0

D
0
V
0
V
k−2
V
k−1
+ ρ
0
V
k−1

V
1

p
0
p
0

V
1
V
k−1
+ ρ
1
V
k−1

V
2

p
1
p
1

V
2
V
k−1
+
+ ρ
k−2
V
k−1

p
k−2
p
k−2

V
k−1
+ ρ
k−1
p
k−1
p
k−1

.
Thus to calculate the direction −D
k
∇f(x
k
), we need only to store D
0
and the past vectors p
i
,
q
i
, i = 0, 1, . . . , k − 1, and to perform the matrix-vector multiplications needed using the above
formula for D
k
. Note that multiplication of a matrix V
i
or V
i

with any vector is relatively
simple. It requires only two vector operations: one inner product, and one vector addition.
1.7.4 www
Suppose that D is updated by the DFP formula and H is updated by the BFGS formula. Thus
the update formulas are
¯
D = D +
pp

p

q

Dqq

D
q

Dq
,
¯
H = H +
_
1 +
p

Hp
q

p
_
qq

q

p

Hpq

+ qp

H
q

p
.
If we assume that HD is equal to the identity I, and form the product
¯
H
¯
D using the above
formulas, we can verify with a straightforward calculation that
¯
H
¯
D is equal to I. Thus if the
35
Section 1.7
initial H and D are inverses of each other, the above updating formulas will generate (at each
step) matrices that are inverses of each other.
1.7.5 www
(a) By pre- and postmultiplying the DFP update formula
¯
D = D +
pp

p

q

Dqq

D
q

Dq
,
with Q
1/2
, we obtain
Q
1/2 ¯
DQ
1/2
= Q
1/2
DQ
1/2
+
Q
1/2
pp

Q
1/2
p

q

Q
1/2
Dqq

DQ
1/2
q

Dq
.
Let
¯
R = Q
1/2 ¯
DQ
1/2
, R = Q
1/2
DQ
1/2
,
r = Q
1/2
p, q = Qp = Q
1/2
r.
Then the DFP formula is written as
¯
R = R +
rr

r

r

Rrr

R
r

Rr
.
Consider the matrix
P = R −
Rrr

R
r

Rr
.
From the interlocking eigenvalues lemma, the eigenvalues µ
1
, . . . , µ
n
satisfy
µ
1
≤ λ
1
≤ µ
2
≤ ≤ µ
n
≤ λ
n
,
where λ
1
, . . . λ
n
are the eigenvalues of R. We have Pr = 0, so 0 is an eigenvalue of P and r is a
corresponding eigenvector. Hence, since λ
1
> 0, we have µ
1
= 0. Consider the matrix
¯
R = P +
rr

r

r
.
We have
¯
Rr = r, so 1 is an eigenvalue of
¯
R. The other eigenvalues are the eigenvalues µ
2
, . . . , µ
n
of P, since their corresponding eigenvectors e
2
, . . . , e
n
are orthogonal to r, so that
¯
Re
i
= Pe
i
= µ
i
e
i
, i = 2, . . . , n.
(b) We have
λ
1

r

Rr
r

r
≤ λ
n
,
36
Section 1.7
so if we multiply the matrix R with r

r/r

Rr, its eigenvalue range shifts so that it contains 1.
Since
r

r
r

Rr
=
p

Qp
p

Q
1/2
RQ
1/2
p
=
p

q
q

Q
−1/2
RQ
−1/2
q
=
p

q
q

Dq
,
multiplication of R by r

r/r

Rr is equivalent to multiplication of D by p

q/q

Dq.
(c) In the case of the BFGS update
¯
D = D +
_
1 +
q

Dq
p

q
_
pp

p

q

Dqp

+ pq

D
p

q
,
(cf. Exercise 1.7.2) we again pre- and postmultiply with Q
1/2
. We obtain
¯
R = R +
_
1 +
r

Rr
r

r
_
rr

r

r

Rrr

+ rr

R
r

r
,
and an analysis similar to the ones in parts (a) and (b) goes through.
1.7.6 www
(a) We use induction. Assume that the method coincides with the conjugate gradient method
up to iteration k. For simplicity, denote for all k,
g
k
= ∇f(x
k
).
We have, using the facts p
k

g
k+1
= 0 and p
k
= α
k
d
k
,
d
k+1
= −D
k+1
g
k+1
= −
_
I +
_
1 +
q
k

q
k
p
k

q
k
_
p
k
p
k

p
k

q
k

q
k
p
k

+ p
k
q
k

p
k

q
k
_
g
k+1
= −g
k+1
+
p
k
q
k

g
k+1
p
k

q
k
= −g
k+1
+
(g
k+1
−g
k
)

g
k+1
d
k

q
k
d
k
.
The argument given at the end of the proof of Prop. 1.6.1 shows that this formula is the same as
the conjugate gradient formula.
(b) Use a scaling argument, whereby we work in the transformed coordinate system y = D
−1/2
x,
where the matrix D becomes the identity.
37

NOTE This solutions manual is continuously updated and improved. Portions of the manual, involving primarily theoretical exercises, have been posted on the internet at the book’s www page http://www.athenasc.com/nonlinbook.html Many thanks are due to several people who have contributed solutions, and particularly to Angelia Nedic, Asuman Ozdaglar, and Cynara Wu.

Last Updated: May 2005

2

Section 1.1

Solutions Chapter 1
SECTION 1.1

1.1.9

www

For any x, y ∈ Rn , from the second order expansion (see Appendix A, Proposition A.23) we have 1 f (y) − f (x) = (y − x) ∇f (x) + (y − x) ∇2 f (z)(y − x), 2 (1)

where z is some point of the line segment joining x and y. Setting x = 0 in (1) and using the given property of f , it can be seen that f is coercive. Therefore, there exists x∗ ∈ Rn such that f (x∗ ) = inf x∈Rn f (x) (see Proposition A.8 in Appendix A). The condition m||y||2 ≤ y ∇2 f (x)y, ∀ x, y ∈ Rn ,

is equivalent to strong convexity of f . Strong convexity guarantees that there is a unique global minimum x∗ . By using the given property of f and the expansion (1), we obtain (y − x) ∇f (x) + m M ||y − x||2 ≤ f (y) − f (x) ≤ (y − x) ∇f (x) + ||y − x||2 . 2 2

Taking the minimum over y ∈ Rn in the expression above gives
y∈Rn

min

(y − x) ∇f (x) +

m M ||y − x||2 ≤ f (x∗ ) − f (x) ≤ min (y − x) ∇f (x) + ||y − x||2 . y∈Rn 2 2

Note that for any a > 0 1 a min (y − x) ∇f (x) + ||y − x||2 = − ||∇f (x)||2 , n y∈R 2 2a and the minimum is attained for y = x − obtain − 1 1 ||∇f (x)||2 ≤ f (x∗ ) − f (x) ≤ − ||∇f (x)||2 . 2m 2M
∇f (x) a .

Using this relation for a = m and a = M , we

The first chain of inequalities follows from here. To show the second relation, use the expansion (1) at the point x = x∗ , and note that ∇f (x∗ ) = 0, so that f (y) − f (x∗ ) = 1 (y − x∗ ) ∇2 f (z)(y − x∗ ). 2

The rest follows immediately from here and the given property of the function f .

3

n. The first derivative of f (x) for x = 0 can be calculated as follows: f (x) = 2x √ 5π √ 5π √ − 3 ln(x2 ) + 3 cos − 3 ln(x2 ) 6 6 √ π π 5π √ 5π √ = 2x 2 − 2 cos sin − 3 ln(x2 ) + 2 sin cos − 3 ln(x2 ) 3 6 3 6 √ π 5π √ = 2x − + 3 ln(x2 ) 2 + 2 sin 3 6 √ √ = 2x 2 − 2 cos(2 3 ln x) . for x = 0. √ 2 − sin (1−8k)π √ 8 3 Solving f (x) = 0. x∗ is the only stationary point of f in the sphere B(x∗ . δ). so that there exists a scalar δ > 0 such that ∀ x. √ 2 − 2 cos √ −π 4 √ + 4 3 sin −π 4 √ f (xk ) = 2 Similarly f (y k ) = = 2 =2 √ √ = −4 6. we have that ∇2 f (x∗ ) > 0. 4 . √ π π 2 − 2 cos + 4 3 sin 4 4 √ √ √ √ 2 2 =2 2−2 +4 3 2 2 √ = 4 6. The example function f does not have the second derivative at x = 0.1. Note that f (x) > 0 for x = 0. If f is not twice continuously differentiable.1 1. δ) centered at x∗ with radius δ. while {y k | k ≥ 0} is a sequence of nonsingular local maxima converging to x∗ . The function f is twice continuously differentiable over ∇2 f (x) > 0. √ √ 2 2 2−2 −4 3 2 2 Hence. This means that the function f is strictly convex over the open sphere B(x∗ . and by definition f (0) = 0. Hence.Section 1.2. gives xk = e of f (x). {xk | k ≥ 0} is a sequence of nonsingular local minima. which evidently converges to x∗ . Then according to Proposition 1. with ||x − x∗ || ≤ δ.11 www Since x∗ is a nonsingular strict local minimum. then x∗ need not be an isolated stationary point.1. x∗ = 0 is the unique (singular) global minimum. The second derivative √ √ √ √ 2 − 2 cos(2 3 ln x) + 4 3 sin(2 3 ln x) . is given by f (x) = 2 Thus: and y k = e −(1+8k)π √ 8 3 for k integer.

we can find a point x2 such that 0 < ||x2 − x∗ || < 1 ||x1 − x∗ || and f (x2 ) = f (x∗ ). and limk→∞ xk = x∗ . Let Gδ = max||x−x∗ ||≤δ |g(x)|. but {y k } does not converge to x∗ . Then there is a subsequence {xki } and the point x such that xki → x and ||x − x∗ || ≤ δ. (b) Since x∗ is a strict local minimum. Then there is δ such that f (x∗ ) < f (x) for all x in the closed sphere centered at x∗ with radius δ.Section 1. consider the sequence {y k } defined by y 2m = xm . such that f (x) > f (x∗ ) for all x = x∗ with ||x − x∗ || ≤ δ. it follows that x = x∗ . we have constructed 2 a sequence {xk } of distinct points such that 0 < ||xk − x∗ || < δ.12 www (a) Let x∗ be a strict local minimum of f . Then for any θ > 0 there is a point xθ = x∗ such that 0 < ||xθ − x∗ || < θ and f (xθ ) = f (x∗ ). for θ = 1 1 2 ||x − x∗ || in (1). Also we have that f (y k ) = f (x∗ ). y 2m+1 = x0 . By continuity of f . Evidently. Then min||x−x∗ ||=δ f (x) = f δ > f (x∗ ).e. i. ∀ m ≥ 0. 5 ∀ > 0. we can find a point x1 such that 0 < ||x1 − x∗ || < 1 ||x0 − x∗ || 2 2 and f (x1 ) = f (x∗ ). we have f (x) = lim f (xki ) = f (x∗ ). f (xk ) = f (x∗ ) for all k. therefore {xk } converges to x∗ . Hence. Assume that this is not true. (1) Since x∗ is a stable local minimum. Next we will show that for a continuous function f every locally stable local minimum must be strict. .e. again. In this way. we can find δ > 0. Now. Then.1 1.. and so on. This contradicts the assumption that x∗ is locally stable. for θ = 1 ||x0 − x∗ || in (1). (2) k→∞ For θ = δ in (1). the sequence {y k } is contained in the sphere centered at x∗ with the radius δ. there is a δ > 0 such that xk → x∗ for all {xk } with lim f (xk ) = f (x∗ ) and ||xk − x∗ || < δ.1. we can find a point x0 = x∗ for which 0 < ||x0 − x∗ || < δ and f (x0 ) = f (x∗ ). ∀ x ||x − x∗ || < δ. This is true for any convergent subsequence of {xk }. i→∞ Since x∗ is a strict local minimum. Now. there is a local minimum x∗ which is locally stable but is not strict. Then. ||xk − x∗ || ≤ δ and limk→∞ f (xk ) = f (x∗ ). x∗ must be strict local minimum. Take any local sequence {xk } that minimizes f . which means that x∗ is locally stable. we have f (x) − Gδ ≤ f (x) + g(x) ≤ f (x) + Gδ . i.

the level set L( ) is compact. < 2 ≤ δ. . δ). φ(0)) = ∇x G(0. δ ].1. since x ∈ L( ). x∗ ) = 0. From (3) it follows that x ∈ B(x∗ . δ0 ) such that φ( ) = x ∀ ∈ [0. and ∇ G (0. The relation (3) means that the sequence {L( )} decreases as 0≤ ≤ δ. x) = ∇f (x) + ∇g(x) for which G(0.12 we found the numbers δ > 0 and ∈ [0. the continuous function f (x) + g(x) attains its minimum on the compact set L( ) at some point x ∈ L( ). ∀0≤ 1 0≤ ≤ δ. and x∗ ∈ L( ) for all we have ∩0≤ ≤ δ L( ) = {x∗ }.Section 1. and notice that for all 0 ≤ ≤ δ we have f δ − Gδ > f (x∗ ) + Gδ . Since x∗ is strictly better than any other point x ∈ B(x∗ . 0 −1 0) 0 ).1. a → B(x∗ . δ). δ) for any the range [0.13 www In the solution to the Exercise 1. (1) It can be seen that ∇x G (0. φ(0) = (x∗ ) and (1) gives the desired relation. 0 ). in Finally. from (4) we see that lim →∞ x = x∗ . φ( ))) We may assume that namely φ( ) = φ(0) + ∇φ(0) + o( ). The Implicit Function Theorem can be applied to the continuously differentiable function G( . Thus. δ) δ > 0 such that for all the function f (x) + g(x) has a local minimum x within the sphere B(x∗ . is small enough so that the first order expansion for φ( ) at = 0 holds. 1. Note that L( 1 ) ⊂ L( 2 ) ⊂ B(x∗ . δ) is the open sphere centered at x∗ with radius δ. ||x − x∗ || ≤ δ}. number δ0 and a continuously differentiable function φ : [0. φ(0)) = ∇g(x∗ ) . and ∇φ( ) = −∇ G ( . Consider the level sets L( ) = {x | f (x) + g(x) ≤ f (x∗ ) + Gδ . decreases. δ) = {x | ||x − x∗ || < δ}. (4) According to Weierstrass’ theorem. Observe that for any ≥ 0. (3) where B(x∗ . there are an interval [0.1 Choose δ such that fδ − δ Gδ > f (x∗ ) + δ Gδ . 0 ). which combined with φ( ) = x . ∀ ∈ [0. x∗ ) = ∇2 f (x∗ ). φ( )) (∇x G ( . 6 .

show by induction on k that with such a choice of stepsize. Let R = max{ x | x ∈ A}. We will. for all t ∈ [0. y ∈ B. we have xk ∈ A and f xk+1 ≤ f (xk ) − β k dk 7 2. let r = sup{ x | x ∈ A} and B = {x | x ≤ r}. Using condition (i) in the exercise. For any x. y ∈ A we have 1 ∇f (x) − ∇f (y) = 0 ∇2 f tx + (1 − t)y (x − y)dt. there exists some constant L such that ∇f (x) − ∇f (y) ≤ L x − y . Let L = max{ ∇2 f (x) | x ∈ B}. where γk = |∇f (xk ) dk | . and B = {x | x ≤ R + 2G}.2. 1]. 1/L}. dk 2 2 γ k /2 Let β k = αk (γ k − Lαk /2).2 SECTION 1. which can be seen to satisfy β k ≥ by our choice of αk . as desired. Notice that tx + (1 − t)y ∈ B. G = max{ ∇f (x) | x ∈ A}. (*) .5 www (a) Given a bounded set A. which is finite because a continuous function on a compact set is bounded. (b) The key idea is to show that xk stays in the bounded set A = x | f (x) ≤ f (x0 ) and to use a stepsize αk that depends on the constant L corresponding to this bounded set. It follows that ∇f (x) − f (y) ≤ L x − y .Section 1. for all x. Suppose the stepsize αk satisfies 0 < ≤ αk ≤ (2 − )γ k min{1.2 1.

for τ ∈ [0. 1. whereas in this exercise this inequality holds only for x. xk + αk dk ∈ B and (because of the convexity of B) xk + αk τ dk ∈ B. A. 1.2. xk + αk dk ≤ xk + αk G/γ k ≤ R + 2G. A.24. we have γ k dk 2 = ∇f (xk ) dk ≤ ∇f (xk ) · dk . To start the induction.3.10 We have www 1 ∇f (x) − ∇f (x∗ ) = 0 ∇2 f x∗ + t(x − x∗ ) (x − x∗ )dt and since ∇f (x∗ ) = 0. y. y ∈ B.2. which holds because of our definition of L and because xk ∈ A ⊂ B. we now proceed as in the proof of Prop.24 assumes that the inequality ∇f (x)−∇f (y) ≤ L x − y holds for all x. we have f (xk+1 ) ≤ f (xk ) ≤ f (x0 ) and xk+1 ∈ A.3.2 for all k ≥ 0. which shows that xk + αk dk ∈ B.3. We thus essentially repeat the proof of Prop. Suppose that xk ∈ A. by the definition of A.2. dk ≤ ∇f (xk ) /γ k ≤ G/γ k . 1]. This completes the induction. Thus. 8 . 1. L(αk )2 k d 2 ∇f xk + αk τ dk − ∇f (xk ) ≤ αk Lτ dk . By the definition of γ k . (**) as in the proof of Prop. to obtain f (xk+1 ) = f (xk + αk dk ) 1 = 0 αk ∇f (xk + τ αk dk ) dk dτ 1 ≤ αk ∇f (xk ) dk + 0 αk ∇f xk + αk τ dk − ∇f (xk ) dk dτ 1 2 0 (∗∗) ≤ αk ∇f (xk ) dk + (αk )2 dk = αk ∇f (xk ) dk + We have used here the inequality Lτ dτ 2. Inequality (*) now follows from Eq.Section 1. (*). we note that x0 ∈ A. Hence. The remainder of the proof is the same as in Prop. In order to prove Eq. A difficulty arises because Prop.2. 1. In particular.

k∈K ¯ k→∞. k∈K ||dk || ¯ 9 . m ∇f (x) . Using the Cauchy-Schwartz inequality (x − x∗ ) ∇f (x) ≤ x − x∗ 1 ∇f (x) . we have m 0 x − x∗ 2 dt ≤ x − x∗ ∇f (x) . and ¯ ¯ ¯ pk = Since ∇f is continuous. 1. can be used here to show that 0 ≤ ∇f (¯) p.11 www Assume condition (i).2. lim supk→∞. 1]. Thus F is an increasing function. k∈K ∇f (xk ) dk ¯ < 0. and x − x∗ ≤ Now define for all scalars t. ¯ (2) lim ∇f (xk ) pk = lim inf ∇f (xk ) pk ≤ lim inf k→∞. we can write ∇f (¯) p = x ¯ ¯ k→∞. and F (1) ≥ F (t) for all t ∈ [0. x ¯ where x is a limit point of {xk }. The same reasoning as in proof of Prop. ||dk || ¯ {pk }k∈K → p. 1.2 we obtain 1 1 (x − x∗ ) ∇f (x) = 0 (x − x∗ ) ∇2 f (x∗ + t(x − x∗ ))(x − x∗ )dt ≥ m 0 x − x∗ 2 dt. namely {xk }k∈K −→ x.1. 2 .2. k∈K (1) dk . Hence 1 f (x) − f (x∗ ) = F (1) − F (0) = 0 F (t)dt ≤ F (1) = (x − x∗ ) ∇f (x) ∇f (x) ≤ x − x∗ ∇f (x) ≤ m where in the last step we used the result shown earlier. F (t) = f (x∗ + t(x − x∗ )) We have F (t) = (x − x∗ ) ∇f (x∗ + t(x − x∗ )) and F (t) = (x − x∗ ) ∇2 f (x∗ + t(x − x∗ ))(x − x∗ ) ≥ m x − x∗ 2 ≥ 0.Section 1.

k∈K f (xk ) = f (¯). x ˆ α (5) Hence.13 www Consider the stepsize rule (i). limk→∞.2. For the minimization rule we have f (xk+1 ) = min f (xk + αdk ) = min f (xk + θpk ).k∈K lim αk ||∇f (xk )||2 = 0. Suppose that ∇f (xk ) = 0 for all k.2. x 1. Now. we may assume that limk→∞. where pk = dk . from (4) we have that k→∞. Suppose that {xk }k∈K → x and ∇f (¯) = 0.k∈K lim ∇f (xk ) pk = ∇f (¯) p ≤ −c||∇f (¯)|| < 0.2. α≥0 θ≥0 (3) for all k. that {f (xk )} converges. 1. we must have limk→∞. it follows that ˆ f (xk+1 ) − f (xk ) ≤ f (ˆk+1 ) − f (xk ) ≤ σ αk ∇f (xk ) pk ≤ −σcˆ k ||∇f (xk )||2 . Then.1. Without loss of x ˆ generality. 1.k∈K αk = 0. the proof of Prop. Then from (3) and (4). x ¯ x This contradicts (1). we have for all k f (xk+1 ) ≤ f (xk ) − αk 1 − αk L 2 ∇f (xk ) 2 . Assume condition (ii). On the other hand. and that ∞ ∇f (xk ) 2 < ∞. ∞ f (x∗ ) ≤ f (x0 ) − 2 ∇f (xk ) 2 . we obtain for any minimum x∗ of f . k=0 ∞ k=0 It follows that ∇f (xk ) → 0. ∀ k. ˆ Since limk→∞. so that ∇f (¯) = 0.k∈K ∇f (xk ) = ∇f (¯) = 0. The proof for the other choices of stepsize is the same as in Prop.1.2. (4) Let xk+1 = xk + αk pk be the iterate generated from xk via the Armijo rule.Section 1. From this relation. from which xk+1 − xk k=0 2 < ∞.1 to show that (1) holds. ||dk || Note that ∇f (xk ) pk ≤ −c||∇f (xk )||.3). we can use the same line of arguments ¯ as in the proof of the Prop. which combined with (5) ¯ x x implies that k→∞. 10 . either {f (xk )} diverges to −∞ or else it converges to some finite value. From the Descent Lemma (cf.2 which contradicts (1). with the corresponding ˆ ˆ stepsize αk and the descent direction pk .k∈K pk = p.

we have for any minimum x∗ of f . The proof for the case of the stepsize rule (ii) is similar. Since x is a minimum of f . ∞ and that xk+1 − xk k=0 2 < ∞.2 since ∇f (xk ) = (xk − xk+1 )/αk . we show that ∇f (xk ) → 0. xk+1 − x∗ 2 − xk − x∗ 2 − xk+1 − xk 2 ≤ −2(x∗ − xk ) (xk+1 − xk ) = 2αk (x∗ − xk ) ∇f (xk ) ≤ 2αk f (x∗ ) − f (xk ) ≤ 0. for any k > k. the preceding proof applies. m−1 2 ≤ xk − x∗ 2 + xk+1 − xk 2. Using the assumptions αk → 0 ∞ k=0 and αk = ∞. 1. 11 . and the Descent Lemma. it follows that the entire sequence {xk } converges to x. xm − x∗ 2 ≤ x0 − x∗ 2 + k=0 xk+1 − xk 2. and for any that xk − x 2 ∞ > 0. let k be such ≤ . is arbitrarily small. we have k−1 xk − x Since 2 ≤ xk − x 2 + i=k xi+1 − xi 2 ≤2 . so that xk+1 − x∗ Hence. i=k xi+1 − xi 2 ≤ . It follows that {xk } is bounded. From this point. Let x be a limit point of {xk }. Using the convexity of f . using the preceding relations.Section 1.2.14 www (a) We have xk+1 − y 2 = xk − y − αk ∇f (xk ) 2 = (xk − y − αk ∇f (xk )) (xk − y − αk ∇f (xk )) = xk − y = xk − y ≤ xk − y = xk − y 2 2 2 2 − 2αk (xk − y) ∇f (xk ) + (αk ∇f (xk ) ) + 2αk (y − xk ) ∇f (xk ) + (αk ∇f (xk ) ) + 2αk (f (y) − f (xk )) + (αk ∇f (xk ) ) 2 2 2 2 − 2αk (f (xk ) − f (y)) + (αk ∇f (xk ) ) . that {f (xk )} converges. for any m.

lim inf k→∞ f (xk ) = inf x∈ exists y such that f (y) < have xk+1 − y 2 n f (x). Since f is continuously differentiable. 2 ¯ By taking k large enough. (If no such x∗ exists. Then xk+1 − y 2 ≤ xk − y ≤ xk − y = xk − y = xk − y 2 2 2 2 − 2αk (f (xk ) − f (y)) + (αk ∇f (xk ) ) + (αk ∇f (xk ) ) + sk ∇f (xk ) 2 2 2 ∇f (xk ) + (sk )2 2 ≤ xk−1 − y + (sk−1 )2 + (sk )2 k 2 ≤ · · · ≤ x0 − y + i=0 (si )2 < ∞. which states that f is convex if and only if f (y) − f (x) ≥ (y − x) ∇f (x).3. Therefore we n must have lim inf k→∞ f (xk ) = inf x∈ f (x). where k is sufficiently large. the desired result follows trivially). y.2 where the inequality follows from Prop. Then. Since αk = ∞. (c) Let y be some x∗ such that f (x∗ ) ≤ f (xk ) for all k. the term on the right is equal to −∞. From part (a). ∀ x. So we obtain ¯ 0 ≤ xk − y 2 ∞ 2 → 0) that αk ∇f (xk ) ≤ δ for −δ ¯ k=k αk . Then αk = sk 1 ≥ k) ∇f (x M 12 sk = ∞. (b) Assume the contrary. we may assume (using αk ∇f (xk ) ¯ k ≥ k. there f (xk ) − δ ¯ ¯ for all k ≥ k. we then have that {∇f (xk )} is bounded. or ∞ ∞ ∞ 0≤ ¯ xk −y 2 − ¯ k=k 2αk δ + ¯ k=k (αk ∇f (xk ) ) = 2 ¯ xk −y 2 − ¯ k=k αk (2δ − αk ∇f (xk ) 2 ) . Let M be an upper bound for ∇f (xk ) . B. we 2 ≤ xk − y − 2αk (f (xk ) − f (y)) + (αk ∇f (xk ) ) . for some δ > 0. that is.Section 1. 2 Summing over all k sufficiently large. Thus {xk } is bounded. we have ∞ ∞ xk+1 − y ¯ k=k 2 ≤ ¯ k=k xk − y 2 − 2αk (f (xk ) − f (y)) + (αk ∇f (xk ) ) 2 . yielding a contradiction. .

Then αk ∇f (xk ) n → 0. If the set K is infinite. N ≥ 1. in the inequality above. for some x where f (¯) = inf x∈ ¯ ¯ x xk+1 − x ¯ so that xk+N − x ¯ For any have xk − x ¯ Then 2 2 f (x). ∀ N ≥ 1. 2 and the assumption ||ek || ≤ δ for all k. 2 for all k.24 of Appendix A).17 www By using the descent lemma (Proposition A. we can choose k ∈ K to be sufficiently large so that for all k ∈ K with k ≥ k we ∞ ≤ and m=k (sm )2 ≤ . since lim inf k→∞ f (xk ) = inf x∈ {xk }K → x. ¯ ¯ > 0.2. we obtain f (xk+1 ) − f (xk ) ≤ −αk ∇f (xk ) (∇f (xk ) + ek ) + = −αk 1 − Assume that αk < 1 L L k α 2 L k 2 (α ) ||∇f (xk ) + ek ||2 2 L ||∇f (xk )||2 + (αk )2 ||ek ||2 − αk (1 − Lαk )∇f (xk ) ek . there must be a subsequence {xk }K such that n f (x) so that x is a global minimum.Section 1. we obtain f (xk+1 ) − f (xk ) ≤ − Lδ 2 αk (1 − Lαk ) (||∇f (xk )||2 − δ 2 ) + (αk )2 . Consider the set K = {k | ||∇f (xk )|| < δ }. ¯ 1. 2 2 (1) Let δ be an arbitrary number satisfying δ > δ. Then.2 Furthermore. 2 1 ∇f (xk ) ek ≥ − (||∇f (xk )||2 + ||ek ||2 ) . 2 (sk )2 < ∞. there is some 13 . Then. then we are done. so that 1 − Lαk > 0 for every k. ∀ k. N ≤ xk − x ¯ 2 + m=k (sm )2 . we see that {xk } converges to x. > 0 is arbitrary. xk+N − x ¯ Since 2 ≤2 . sk → 0. We have ¯ 2 2 ≤ xk − x ¯ + (sk )2 . using the estimates 1− L k α ≥ 1 − Lαk . Suppose that the set K is finite. We can thus apply the results of part (b) to show that lim inf k→∞ f (xk ) = inf x∈ f (x). n Now. αk ∇f (xk ) Since 2 = sk ∇f (xk ) ≤ sk M.

g(xk )) + ∇g(x)∇y F (xk . g(x)) + ∇g(x)∇y F (x.2. ∀ N > k0 . 1. ∀ k ≥ k0 . g(xk )) ≤ − ∇x F (xk . and αk ∈ [α. g(xk )) 2 2 ∇x F (xk . L }. (2) − δ 2 − αLδ ¯ 2 > 0 for k ≥ k0 . we can easily find that f (xk+1 ) − f (xk ) ≤ − αk (1 − Lαk ) δ 2 − δ 2 − αk Lδ 2 . g(x)).19 www (a) Note that ∇f (x) = ∇x F (x.2 index k0 such that ||∇f (xk )|| ≥ δ for all k ≥ k0 . We can write the given method as xk+1 = xk + αk dk = xk − αk ∇x F (xk . g(xk ))) (∇x F (xk . g(xk )). g(xk )) ≤ − ∇x F (xk . Proof: We first show that dk is a descent direction. By substituting this in (1). g(xk ))) + ∇g(x)∇y F (xk . g(xk )) = (−1 + γ) ∇x F (xk . Summing the inequalities in (2) over k for k0 ≤ k ≤ N . Taking the limit as N −→ ∞. g(xk )) 2 2 2 − (∇g(x)∇y F (xk . so that this method is essentially steepest descent with error ek = −∇g(xk )∇y F (xk . g(xk ))) = − ∇x F (xk . we ¯ ¯ 2L have that 1 f (xk+1 ) − f (xk ) ≤ − α δ 2 − δ 2 − αLδ ¯ 2 Since δ 2 2 . α] for all k ≥ k0 . 1 By choosing α and α such that 0 < α < α < min{ δ δ −δ . g(xk )) + γ ∇x F (xk . We have ∇f (xk ) dk = (∇x F (xk . the sequence {f (xk ) | k ≥ k0 } is strictly decreasing. 2 2 2 ∀ k ≥ k0 . we obtain limN →∞ f (xN ) = −∞. we get f (xN +1 ) − f (xk0 ) ≤ − (N − k0 ) ¯ α δ 2 − δ 2 − αLδ 2 2 . g(xk )) = 0.Section 1. g(xk )) = xk + αk (−∇f (xk ) + ∇g(xk )∇y F (xk . g(xk )) <0 for ∇x F (xk . Claim: The directions dk are gradient related. 14 . g(xk ))) (−∇x F (xk . g(xk )) .

which is convex over I k . Suppose that xk = xk + αdk ∈ I k minimizes f (x) over I k . This means that the derivative g (α) does not change the sign for 0 < α < αk .k∈K ∇F (xk .2 It is straightforward to show that ∇x F (xk .2. g(xk )) = 0 if and only if ∇f (xk ) = 0. 1. We have from Eq. (b) Let’s assume that in addition to being continuously differentiable. we then have the desired result. then limk→∞. ∀x∈ n. furthermore. we have ¯ dk = 1 1−γ 1 ≤ 1−γ 1 ≤ 1−γ 1 = 1−γ [ ∇x F (xk .1. g(xk )) ∇f (xk ) .Section 1. assume that xk ∈ I k . From Prop. h has a continuous and nonsingular gradient matrix ∇y h(x.2. so that we have ∇f (xk ) dk < 0 for ∇f (xk ) = 0. there exists a γ ∈ (0. g(xk )) = 0. A.20 www (a) Consider a function g(α) = f (xk + αdk ) for 0 < α < αk . Then g (α) = 0 and from convexity it follows that g (αk ) = ∇f (xk+1 ) dk > 0 (since g (0) = ∇f (xk ) dk < 0). x x x ¯ lim inf ∇f (xk ) dk < 0. then from part (a). Therefore the stepsize will be reduced after this iteration. So ∇f (¯) = 0. Hence if limk→∞ inf k∈K ∇f (xk ) dk = 0. g(xk )) + ∇g(x)∇y F (xk . Then from the Implicit Function Theorem (Prop. g(xk )) ] ∇x F (xk . Hence. the method described is convergent. g(xk )) − γ ∇x F (xk . there exists a continuously differentiable function φ : x∈ n. g(xk )) . Furthermore. αk ) we have g (α) < 0. 2 and so {dk } is bounded. for all If. φ(x)) ≤ γ ∇x f (x. k→∞ k∈K and it follows that the directions dk are gradient related. for every subsequence {xk }k∈K that converges to a nonstationary point x. g (αk ) = ∇f (xk+1 ) dk ≤ 0 and we can use the same stepsize αk in the next iteration.33).e. Now. ∇f (xk ) dk ≤ −(1 − γ) ∇x F (xk . 15 . n → m such that h(x. Hence. (1). g(¯)) = 0. g(xk )) − ∇g(x)∇y F (xk . 1. i. g(xk )) ] [ ∇x F (xk . 1) such that ∇φ(x)∇y f (x. from which ∇F (¯. Hence dk is a descent direction if xk is nonstationary. φ(x)) . φ(x)) = 0. y). which contradicts the nonstationarity of x. for all α in the interval (0.

. By choosing k0 large enough. 16 ∀ k ≥ k0 . ek . 2 2 where k0 is such that αk ≤ 1 for all k ≥ k0 . (c) The result follows in the same way as in the proof of Prop. and the inequality 1 ||y + z||2 ≤ ||y||2 + ||z||2 .2 (b) Here we will use conditions on ∇f (x) and dk which imply ∇f (xk+1 ) dk ≤ ∇f (xk ) dk + ||∇f (xk+1 ) − ∇f (xk )|| · ||dk || ≤ ∇f (xk ) dk + αk L||dk ||2 ≤ −(c1 − c2 αk L) ∇f (xk ) 2 .2. 2 (1) Taking into account the given properties of dk . the Schwartz inequality. Every limit point of {xk } is a stationary point of f . ∀ k ≥ k0 .24 of Appendix A). and the inequality ||y|| · ||z|| ≤ ||y||2 + ||z||2 .1. ˆ ˆ When the stepsize becomes small enough so that c1 −c2 αk L ≥ 0 for some k. we again use the properties of dk . every limit point of {xk } must be a global minimum of f .21 www By using the descent lemma (Prop. To estimate the last term in the right hand-side of (1). we obtain ∇f (xk ) (dk + ek ) ≤ −(c1 − pαk )||∇f (xk )||2 + qαk ||∇f (xk )|| ≤ − (c1 − (p + 1)αk ) ||∇f (xk )||2 + αk q 2 . A. which gives 2 1 k ||d + ek ||2 ≤ ||dk ||2 + ||ek ||2 2 ≤ 2 (c2 + (pαk )2 ) ||∇f (xk )||2 + 2 (c2 + (qαk )2 ) 2 2 ≤ 2(c2 + p2 )||∇f (xk )||2 + 2(c2 + q 2 ). we can 2 2 have f (xk+1 ) − f (xk ) ≤ −αk b1 ||∇f (xk )||2 + (αk )2 b2 .Section 1. ∀ k ≥ k0 . By substituting these estimates in (1). where C = 1 + p + 2L(c2 + p2 ) and b2 = q 2 + 2L(c2 + q 2 ). 1.4. then ∇f (xk+1 ) dk ≤ 0 ˆ for all k ≥ k and no further reduction will ever be needed. Since f is convex. we obtain L f (xk+1 ) − f (xk ) ≤ αk ∇f (xk ) (dk + ek ) + (αk )2 ||dk + ek ||2 . ek .2. we get f (xk+1 ) − f (xk ) ≤ −αk (c1 − C)||∇f (xk )||2 + (αk )2 b2 .

To prove that limk→∞ ||∇f (xk )|| = 0. 1. the last inequality implies lim sup f (xN +1 ) ≤ lim inf f (xK ). 3 < ||∇f (xk )|| for mj ≤ k < nj .e. i. assume the contrary. k→∞ (3) Let {mj } and {nj } be sequences such that mj < nj < mj+1 .e.Section 1. limk→∞ f (xk ) exists (possibly infinite). 17 . the relation (2) implies ∞ αk ||∇f (xk )||2 < ∞. N →∞ K→∞ i.2. lim sup ||∇f (xk )|| ≥ > 0. 3 αk ||∇f (xk )||2 ≤ k=m¯ j 27L(2c2 + q + p) . In particular. ∀ K ≥ k0 . Since < ∞.2 Summing up these inequalities over k for k0 ≤ K ≤ k ≤ N gives N N f (xN +1 ) + b1 k=K αk ||∇f (xk )||2 ≤ f (xK ) + b2 k=K (αk )2 . (4) αk ≤ 1. ∀ k0 ≤ K ≤ k ≤ N. ∞ ∀k≥¯ j. (2) Therefore ∞ lim sup f (xN +1 ) ≤ f (xK ) + b2 N →∞ k=K ∞ k 2 k=0 (α ) (αk )2 . ||∇f (xk )|| ≤ Let ¯ be large enough so that j 3 for nj ≤ k < mj+1 . k=0 Thus we have lim inf k→∞ ||∇f (xk )|| = 0 (see the proof of Prop.4).

18 (1) . Thus. ||∇f (xm )|| ≤ ||∇f (xnj )|| + From here and (4). then ¯ limk→∞ f (xk ) = f (¯). implying that ∇f (¯) = 0. (3). ||∇2 f (x)−1 || ≤ M1 . If x is a limit point of {xk }. 3 ∀ m ≥ mj which contradicts Eq. ||∇f (xm )|| ≤ 2 .Section 1. 3 ∀ j ≥ ¯ mj ≤ m ≤ nj − 1. we have 3 ≤ 2 . y ∈ B(x∗ . j. x x SECTION 1.3. we have limk→∞ ∇f (xk ) = 0. ) the following hold ∇2 f (x) > 0. ) = {x | ||x − x∗ || ≤ } be a closed sphere centered at x∗ with the radius > 0 such that for all x.3 For any j ≥ ¯ and any m with mj ≤ m ≤ nj − 1.2 www Let β be any scalar with 0 < β < 1 and B(x∗ . we have j nj −1 ||∇f (x ) − nj ∇f (xm )|| ≤ k=m ||∇f (xk+1 ) − ∇f (xk )|| nj −1 ≤L k=m nj −1 ||xk+1 − xk || αk (||dk || + ||ek ||) k=m ≤L ⎛ ≤ L(c2 + q) ⎝ nj −1 ⎞ αk ⎠ + L(c2 + p) 3 nj −1 αk ||∇f (xk )|| k=m k=m ≤ ≤ ≤ = Therefore L(c2 + q) 9 2 nj −1 + L(c2 + p) nj −1 αk ||∇f (xk )||2 k=m 9L(2c2 + p + q) 2 αk ||∇f (xk )||2 k=m 3 9L(2c2 + p + q) 2 27L(2c2 + q + p) 3 .3 1. Hence limk→∞ ∇f (xk ) = 0.

Now. 2 The condition on d(x) is equivalent to q(x) = − (∇2 f (x∗ )) 19 −1 (6) should be that this ∇f (x) . 2M2 Then. (2) (3) (4) ||∇2 f (x) − ∇2 f (y)|| ≤ ||d(x) + ∇2 f (x)−1 ∇f (x)|| ≤ β 2M1 β ||∇f (x)||.Section 1. By the mean value theorem. M2 = x∈B(x∗ . (5) small enough the unity initial stepsize will pass the test of Armijo rule. then we will have ||xk − x∗ || ≤ β k ||x0 − x∗ ||. ||∇f (x)|| (7) p(x) + ν(x). By defining p(x) = the condition (6) takes the form 1 (1 − σ)p(x) q(x) + q(x) ∇2 f (x)q(x) ≤ 0. for any x ∈ B(x∗ . Therefore. we have 1 f (x + d(x)) − f (x) = ∇f (x) d(x) + d(x) ∇2 f (x)d(x). This means that if x0 ∈ B(x∗ . ) and αk = 1 for all k. by using these relations and ∇f (x∗ ) = 0. ) one can obtain ||x + d(x) − x∗ || ≤ ||x − x∗ − ∇2 f (x)−1 ∇f (x)|| + ||d(x) + ∇2 f (x)−1 ∇f (x)|| β ≤ ||∇2 f (x)−1 (∇2 f (x)(x − x∗ ) − ∇f (x)) || + ||∇f (x)|| 2M2 β ≤ M1 ||∇2 f (x)(x − x∗ ) − ∇f (x) + ∇f (x∗ )|| + ||∇f (x) − ∇f (x∗ )|| 2M2 1 β ≤ M1 ||∇2 f (x)(x − x∗ ) − ∇2 f ((x∗ + t(x − x∗ )) (x − x∗ )dt|| + ||x − x∗ || 2 0 1 ≤ M1 0 ||∇2 f (x) − ∇2 f ((x∗ + t(x − x∗ )) ||dt ||x − x∗ || + β ||x − x∗ || 2 ≤ β||x − x∗ ||. we must find how small holds in addition to the conditions given in (1)–(4). ) sup ||∇2 f (x)||.3 ||∇f (x) − ∇f (y)|| ≤ M2 ||x − y||. . 2 where x is a point on the line segment joining x and x + d(x). we have to prove that for ∀ k ≥ 0. ||∇f (x)|| q(x) = d(x) . 2 for all x in some neighborhood of x∗ . We would like to have 1 ∇f (x) d(x) + d(x) ∇2 f (x)d(x) ≤ σ∇f (x) d(x).

X ∗ ) = 0.8 www In this case. (3) ∀ x∗ ∈ X ∗ .24 of Appendix A). 1.(7) is equivalent to 1 −1 − σ p(x) (∇2 f (x∗ )) p(x) ≥ γ(x). we may write Eq. the gradient method has the form xk+1 = xk − α∇f (xk ). ∀ k. we lim d(xk . 2 (8) Since 1/2 > σ.(7) as (1 − σ)p(x) (∇2 f (x∗ )) −1 1 −1 p(x) − p(x) (∇2 f (x∗ )) p(x) ≥ γ(x). By using the same arguments as in the proof of Prop. By using the above relation and the fact ∇2 f (x) → ∇2 f (x∗ ) as x → x∗ . by minimizing over x∗ ∈ X ∗ . otherwise the method will terminate in a finite number of iterations. Then for any initial point x0 ∈ B(X ∗ . the above relation holds in some neighborhood of point x∗ . Thus Eq. 2 where {γ(x)} is some scalar sequence with limx→x∗ γ(x) = 0. where α < 2 L. and (5) holds for all k. (1) and c = 1 − αL/2.3. Then.3. there is some ∈ (0. from which.3. This completes the proof. ) such that (1)–(4) and (8) hold. ||p(x)|| = 1.3. Let ek = f (xk ) − f ∗ . The rest of the proof is exactly the same as the proof of Prop. A. and ∇2 f (x∗ ) > 0. X ∗ ). inequalities (1) and (3) imply that ek+1 ≤ ek − αc (ek )2 d2 (xk . 1. starting from the relation f (xk+1 ) ≤ f (xk ) − c2 ||∇f (xk )||2 . Convexity of the function f implies that f (xk ) − f (x∗ ) ≤ ∇f (xk ) (xk − x∗ ) ≤ ||∇f (xk )|| · ||xk − x∗ ||. we have f (xk+1 ) − f (xk ) ≤ −αc||∇f (xk )||2 . Namely. ) the unity initial stepsize passes the test of Armijo rule. 1. X ∗ ) = 0.Section 1. we have f (xk ) − f ∗ ≤ ||∇f (xk )||d(xk . X ∗ ) . can show that k→∞ (2) We assume that d(xk . 2L 20 .3. From the descent lemma (Prop.3 where ν(x) denotes a vector function with ν(x) → 0 as x → x∗ .

which are not both 0. or |µ| sin θ − β(1/|µ|) sin θ = 0.3 1. Therefore. Then. then we have |1 + β − αλ| < 1 + β for every eigenvalue λ of Q. since µ + β/µ is a real number. Let the complex number µ have the representation µ = |µ|ejθ . which is not possible. then for some vectors u and w. then it is seen from the above equations that u = 0 and also w = 0. while if sin θ = 0. If sin θ = 0. µ is a real number and the relation |µ + β/µ| < 1 + β is written as µ2 + β < (1 + β)|µ| or (|µ| − 1)(|µ| − β) < 0. If we had µ = 0. where M is the maximum eigenvalue of Q. Therefore.Section 1. The iteration becomes xk+1 xk Define A= = (1 + β)I − αQ −βI I 0 xk xk−1 (1 + β)I − αQ −βI I 0 . Now. then µ is an eigenvalue of A if and only if λ is an eigenvalue of Q.3. u = µw and (1 + β)I − αQ u − βw = µu. If µ is an eigenvalue of A. so that µ + β/µ is an eigenvalue of (1 + β)I − αQ. We also have from the above equations that u = µw and (1 + β)I − αQ u = µ+ β µ u. if 0<α<2 1+β M . 21 .9 www Without loss of generality we assume that c = 0 (otherwise we make the change of variables x = y − Q−1 c). u w =µ u w . Hence. we have A or equivalently. and therefore also µ+ β <1+β µ for every eigenvalue µ of A. if µ and λ satisfy the equation µ + β/µ = 1 + β − αλ. its imaginary part is 0. we have |µ|2 = β < 1. µ = 0 and A is invertible.

Case 2 : r < c. Thus. satisfying 1 1 (1 + β − α m)2 − β = (1 + β − α M )2 − β 2 2 22 . From the preceding analysis we have that µ is an eigenvalue of A if and only if µ2 + β = 1 + β − αλ. For any scalar c ≥ 0. √ √ g(r) ≥ max{ c. 2(1 + β − αm)2 − 4β. We have ρ2 (A) ≥ Therefore. let us show this relation in each of two cases: Case 1 : r ≥ c. max 1 1 (1 + β − αm)2 − β. all the eigenvalues of A are strictly within the unit circle. 2 1 max{4β. Then it is seen that √ √ √ √ √ r2 − c ≥ r − c.Section 1.3 β < |µ| < 1. so that the spectral radius of A is ρ(A) = max |1 + β − αλ| + (1 + β − αλ)2 − 4β 2 λ is an eigenvalue of Q . Thus. 2(1 + β − αM )2 − 4β} 4 1 max{4β. √ Indeed. so that g(r) ≥ 2r − c ≥ c. Then g(r) = r2 + (c − r2 ) = √ √ c ≥ 2r − c. the method converges to the unique optimal solution. (1 + β − αM )2 − β 2 2 ≥ 1 (1 + β − α m)2 − β. we have β ≤ |µ| < 1. for all values of θ. Thus. (3). ρ2 (A) ≥ or 1 1 ρ2 (A) ≥ max β. the set of eigenvalues of A is 1 + β − αλ ± (1 + β − αλ)2 − 4β 2 λ is an eigenvalue of Q . √ √ We now apply the relation g(r) ≥ max{ c. (1 + β − αM )2 − β . consider the function g : R+ → R+ given by g(r) = |r + We claim that √ r2 − c|. implying that xk → 0. (1 + β − αm)2 − β. with c = 4β and with r = |1 + β − αλ|. 4 where α corresponds to the intersection point of the graphs of the functions of α inside the braces. Assume for the moment that α and β are fixed. max{2(1 + β − αλ)2 − 4β | λ is an eigenvalue of Q}}. that is. where λ is an eigenvalue of Q. where λ is an eigenvalue of Q. 2r − c} to Eq. 2 2 It is easy to verify that for every β. 2r − c}.

m+M From Eqs. 2 √ √ M− m √ √ M+ m M −m m+M . √ √ M− m β =√ √ . M+ m √ √ M− m √ √ M+ m 2(1 + β) . (6) is achievable for the (optimal) values 2 β = and α = In conclusion. satisfying β = We have β = and max β. ρ(A) ≥ 1 2 (1 + β) 1 2 (1 + β ) M −m m+M 2 −β . Section 1. √ ≤ M +m M+ m It can be seen that so the convergence rate of the heavy ball iteration (2) is faster than the one of the steepest descent iteration (cf.3 or α = 2(1 + β) . Therefore. 1 2 (1 + β) M −m m+M 2 −β Again. (5). 23 . 2 −β ≥β. m+M Note that equality in Eq. 1). (4). Therefore.β M+ m and the minimum is attained by some values α > 0 and β ∈ [0. we obtain ρ2 (A) ≥ max β.3. xk M+ m √ √ M− m M −m √ . we have √ √ M− m min ρ(A) = √ √ α.Section 1.2). and the above formula for α . the convergence rate of the heavy ball method (2) with optimal choices of stepsize α and parameter β is governed by √ √ xk+1 M− m ≤√ √ . consider the point β that corresponds to the intersection point of the graphs of the functions of β inside the braces.

. ¯ ∀ k ≥ k. we have ||em − ek || ≤ ||em − em−1 || + ||em−1 − em−2 || + . (2) Combining (1) and (2). . βk ¯ ∀ k. it follows that f (xk − β i s∇f (xk )) − f (xk ) ≤ −β i s 1 − 24 β i sM 2 ||∇f (xk )||2 . {ek } is a Cauchy sequence. we can make β j arbitrarily small for all m.3. we know that αk = β mk s. Thus.3.10 www By using the given property of the sequence {ek }. Therefore. which results in ∞ ¯ ¯ ¯ ||ek − e∗ || ≤ β 1−k ||ek − ek−1 || j=k ¯ ¯ for all k ≥ k. 1−β (1) ¯ ¯ − ek−1 ||. (2) . 2 (1) for some x that lies in the segment joining the points xk − β i s∇f (xk ) and xk . where mk is the first index m for which f (xk − β m s∇f (xk )) − f (xk ) ≤ −σβ m s||∇f (xk )||2 .11 www Since αk is determined by Armijo rule. . Define the sequence {q k | 0 ≤ k < k} as follows qk = ||ek − e∗ || . + ||ek+1 − ek || ¯ ¯ ¯ ¯ ¯ ≤ β m−k+1 + β m−k + . we can obtain ¯ ¯ ¯ ||ek+1 − ek || ≤ β k+1−k ||ek − ek−1 ||. 0 ≤ k < k. ¯ ∀ k. . where q k = β 1−k ¯ k 1−β ||e ¯ ¯ ¯ ¯ β j = β 1−k ||ek − ek−1 || βk ¯ = qk β k .3 1. and let m → ∞ in the inequality above. k ≥ k0 . where q = max0≤k≤k q k .Section 1. The second order expansion of f yields f (xk − β i s∇f (xk )) − f (xk ) = −β i s||∇f (xk )||2 + (β i s)2 x ∇f (xk ) ∇2 f (¯)∇f (xk ). 1. + β k−k+1 ||ek − ek−1 || m ¯ ¯ ¯ ≤ β 1−k ||ek − ek−1 || j=k βj . From the given ¯ property of f . m j=k ¯ By choosing k0 ≥ k large enough. Let limm→∞ em = e∗ . it can be seen that ||ek − e∗ || ≤ qβ k .

with q = 2 m ∀ k.4 Now. ˆ Note that (3) implies σ >1− M i −1 M k β k s=1− α . M (5) The given property of f implies that (see Exercise 1. we can conclude that mk ≤ ik . from (1)-(3). 2 By combining (5) and (6).1. we obtain f (xk+1 ) − f (x∗ ) ≤ r (f (xk ) − f (x∗ )) .4 25 . ∀ k. Therefore αk ≥ αk .9) f (x) − f (x∗ ) ≤ 1 ||∇f (x)||2 . i. we obtain ˆ f (xk+1 ) − f (x∗ ) ≤ f (xk ) − f (x∗ ) − 2βσ(1 − σ) ||∇f (xk )||2 . Then. ˆ ˆ we have f (xk − αk ∇f (xk )) − f (xk ) ≤ −σ αk ||∇f (xk )||2 . (f (x0 ) − f (x∗ )). ∀ x ∈ Rn . M Therefore. let ik be the first index i for which 1 − 1− M i β s<σ 2 M i 2 β s ≥ σ. which combined with (7) yields ||xk − x∗ ||2 ≤ qrk . ˆ 2 2β (4) Hence. αk ≥ 2β(1 − σ)/M . where αk = β ik s. (6) (7) m ||x − x∗ ||2 ≤ f (x) − f (x∗ ). 2 (3) ∀ i.Section 1. we have f (xk ) − f (x∗ ) ≤ rk (f (x0 ) − f (x∗ )) . Thus. By substituting this in (4). and 1− M i β k s ≥ σ. 2m ∀ x ∈ Rn . SECTION 1. with r = 1 − 4mβσ(1−σ) . 0 ≤ i ≤ ik .e.

γ ˆ ˆ By letting δ be sufficiently small. 1. both desired results. we have 1 xk+1 − x∗ ≤ M 0 ∇g(x∗ ) − ∇g(x∗ + t(xk − x∗ )) dt xk − x∗ . 1. Let δ2 be such that M (x) M (x) is invertible.4. We then have g(x) = M (x)(x − x∗ ). Let δ1 be such that the term under the integral sign is less than r/M .4 1.5 www Since {xk } converges to nonsingular local minimum x∗ of twice continuously differentiable function f and k→∞ lim ||H k − ∇2 f (xk )|| = 0. Now. 26 . γ Γ Since we’ve already shown that xk+1 − x∗ ≤ r xk − x∗ .2 www From the proof of Prop. we can take δ to be such that the region Sδ around x∗ is sufficiently small so the M (x) M (x) is invertible. Note that M (x∗ ) = ∇g(x∗ ). δ2 } we ˆ have for any r. Let γ and Γ be such that 0<γ≤ Then. By continuity of ∇g. we have √ r Γ g(xk+1 ) ≤ √ g(xk ) . Then the eigenvalues of M (x) M (x) are all positive. By continuity of ∇g. let M (x) = 0 1 ∇g (x∗ + t(x − x∗ )) dt.4. Letting δ = min{δ. = (x − x∗ ) M (x)M (x)(x − x∗ ).Section 1.1. since g(x) 2 x−x∗ ≤δ2 min eig (M (x) M (x)) ≤ x−x∗ ≤δ2 max eig (M (x) M (x)) ≤ Γ. or 1 r √ g(xk+1 ) ≤ xk+1 − x∗ and r xk − x∗ ≤ √ g(xk ) . we have γ x − x∗ 2 ≤ g(x) ∗ ≤ Γ x − x∗ 2 . we can take δ sufficiently small to ensure that the term under the integral sign is arbitrarily small. γ Let r = ˆ √ r Γ √ . Then xk+1 − x∗ ≤ r xk − x∗ . we can have r < r. We have that M (x∗ ) is invertible.4.

Thus. where M = ||(∇2 f (x∗ ))−1 ||. (A.4. 2 ∀ k ≥ 0. we have 0 < m − ≤ mk ≤ m + . where we have used the fact that ||pk || = 1. . (3) and pk = ∇f (xk ) . (2) we obtain that the sequence {q k } is bounded. (2) > 0 with m − > 0 and k0 large For the truncated Newton method. The positive definiteness of ∇2 f (x∗ ) and the Eq.Section 1. (1) Let mk and m denote the smallest eigenvalues of H k and ∇2 f (x∗ ).3.4 we have that k→∞ lim ||H k − ∇2 f (x∗ )|| = 0. 1 k k k q H q + pk q k < 0.2 are satisfied. we have ∇f (x) = 3 x x. ∇2 f (x) = 3 x + 3 3 xx = ( x 2 I + xx ). Now we have that all the conditions of Prop. (1) imply that for any enough. 1. ||∇f (xk )|| Then Eq. respectively. we have ( x 2 I + xx )−1 = 1 x 27 2 I− 1 2 x 2 xx . Combining this and Eq. x x Using the formula (A + CBC )−1 = A−1 − A−1 C(B −1 + C A−1 C)−1 C A−1 [Eq. we have ||dk + (∇2 f (x∗ ))−1 ∇f (xk )|| ||∇2 f (x∗ )dk + ∇f (xk )|| ≤ M lim k→∞ k→∞ ||∇f (xk )|| ||∇f (xk )|| 2 f (x∗ )q k + pk || = M lim ||∇ lim k→∞ k→∞ ≤ M lim ||∇2 f (x∗ ) − H k || · ||q k || + M lim ||H k q k + pk || k→∞ = 0.6 www For the function f (x) = x 3 . ∀ k ≥ k0 . so {||xk − x∗ ||} converges superlinearly. (3) can be written as ∀ k ≥ 0. we have mk k 2 ||q || < ||q k ||.7) from Appendix A]. the direction dk is such that 1 k k k d H d + ∇f (xk ) dk < 0. 1. 2 Define q k = dk ||∇f (xk )|| ∀ k ≥ 0. 2 By the positive definiteness of H k .

. we can calculate the Newton direction at a vector x by guessing (based on symmetry) that it has the form γx for some scalar γ.6.1 does not apply since ∇2 f (0) is not invertible. .6 and so (∇2 f (x)) Newton’s method is then xk+1 = xk − α (∇2 f (xk )) = xk − α 1 3 xk −1 −1 = 1 3 x I− 1 2 x 2 xx .Section 1. Otherwise. In this way. Proposition 1. zj + δj ] for j = 1. λ∈Ij 2 2 28 .4. i 2 (1) for any polynomial P k of degree k and any k. Alternatively. 2 xk xk Thus for 0 < α < 2.3 www We have that f (xk+1 ) ≤ max (1 + λi P k (λi )) f (x0 ). we have (1 + λi P k (λi )) ≤ max (1 + λP k (λ)) . . k.6 1. Chose P k such that 1 + λP k (λ) = (z1 − λ) (z2 − λ) (zk − λ) · ··· . ∇f (xk ) 2 I− 1 2 xk 2 xk (xk ) 2 3 xk xk = xk − α xk − 1 2 xk 1 = xk − α xk − xk 2 α k = 1− x . z1 z2 zk Define Ij = [zj − δj . instead of inverting ∇2 f (x). Newton’s method converges linearly to x∗ = 0. For α0 = 2 method converges in one step. Note that the method also converges linearly for 2 < α < 4. . and by determining the value of γ through the equation ∇2 f (x)(γx) = −∇f (x). where {λi } is the set of the eigenvalues of Q. Since λi ∈ Ij for some j. we can verify that γ = −1/2. SECTION 1. we would have superlinear convergence.

. . Assume that g k = 0 (i. Assume it is true for k − 1 < n − 1. . Thus. . . 2 2 z1 · · · zj (zl −λ)2 2 zl Here we used the fact that λ ∈ Ij implies λ < zl for l = j + 1. k. Qk−1 g 0 . v l . 2z2 2 1 2 z1 z1 2 z1 z2 · · · zk . We will prove this by induction. The fact that g 0 = Qx0 − b yields g k = g 0 + ξ0 Qg 0 + ξ1 Q2 g 0 + . . . . xk = x∗ ). Qk−1 g 0 }. + ξk−2 Qk−1 g 0 + ξk−1 Qk g 0 . g 1 . . . for k = 1 the statement is true. . .···. . . g k−1 } = span{g 0 . g 1 . Clearly. n. If ξk−1 = 0. .e. . g k−1 }. span{g 0 . and therefore for all l = j + 1.Section 1.6. i 2 ≤1 (3) where R= 2 2 δ 2 (zk + δk − z1 )2 · · · (zk + δk − zk−1 )2 δ1 δ2 (z2 + δ2 − z1 )2 . . . then from (1) and the inductive hypothesis it follows that g k ∈ span{g 0 . 29 (2) (1) . . . g 1 . . . from (2) we obtain max (1 + λi P k (λi )) ≤ R. . .4 www It suffices to show that the subspace spanned by g 0 . k. g k−1 }. Qg 0 . . . . 1. . where span{v 0 . . . . The desired estimate follows from (1) and (3). i. . . g k−1 is the same as the subspace spanned by g 0 . . . for k = 1. . . . from our assumption we have that k−1 k−1 g k = Qxk − b = Q x0 + i=0 ξi Qi g 0 − b = Qx0 − b + i=0 ξi Qi+1 g 0 . g 1 . . . k 2. . . . . v l } denotes the subspace spanned by the vectors v 0 . .e. i 1≤j≤k λ∈Ij 2 2 (2) For any j and λ ∈ Ij we have (1 + λP k (λ)) = 2 (z1 − λ)2 (z2 − λ)2 (zk − λ)2 · ··· 2 2 2 z1 z2 zk 2 (zj + δj − z1 )2 (zj + δj − z2 )2 · · · (zj + δj − zj−1 )2 δj ≤ . Qg 0 .6 Hence max (1 + λi P k (λi )) ≤ max max (1 + λP k (λ)) . . Since g k = ∇f (xk ) and xk minimizes f over the manifold x0 + span{g 0 .

. . . Qg 0 . . . . Therefore (2) is possible only if g k = 0 which contradicts our assumption. . Moreover. Hence ˜ xk+1 = xk+1 .5 www Let xk be the sequence generated by the conjugate gradient method. . . . . . then (1) and our inductive hypothesis again imply (2) which is not possible.6 We know that g k is orthogonal to g 0 . and for k ≥ 1. g k−1 . ˜ g ˜ ˜ where g k = ∇f (˜k ). In particular. Thus ˜ ˜ span {˜k and xk − xk−1 } = span {dk−1 and dk }. Also xk − xk−1 is equal to xk − xk−1 = αk−1 dk−1 . We know that xk+1 minimizes f over x0 + span {d0 . . dk−1 } ⊃ xk + span {dk−1 and dk } ⊃ xk + span {dk }. .6. g k implies that span{g 0 . . . . . x1 is ˜ ˜ generated from x0 by steepest descent and line minimization. Qk−1 g 0 . The vector xk+1 minimizes f over the linear manifold on the left-hand side above. and also over the linear manifold on the right-hand side above (by the definition of a conjugate direction method). so x0 + span {d0 . . and let dk be the sequence of the corresponding Q-conjugate directions. ˜ 30 . Qk−1 g 0 . xk+1 minimizes f over the linear manifold in the middle above. . d1 . Qk g 0 }. d1 . Qk−1 g 0 }. g k } = span{g 0 . . which completes the proof. . dk−1 }. Qg 0 .Section 1. . dk }. We will show by induction that xk = xk for all k ≥ 1. Qk g 0 are linearly independent. . If Qk g 0 ∈ span{g 0 . . . . Hence. xk+1 minimizes f ˜ over the two-dimensional linear manifold xk + span {˜k and xk − xk−1 }. . k. We have that g k is equal to g k = β k dk−1 − dk so it belongs to the ˜ ˜ subspace spanned by dk−1 and dk . . g k−1 . . . Thus the vectors g 0 . . Suppose that xi = xi for i = 1. g ˜ ˜ Observe that xk belongs to x0 + span {d0 . . d1 . Let xk be the sequence generated by the method described in the exercise. ˜ x ˜ Indeed. g 1 . ξk−1 = 0. This combined with (1) and linear independence of the vectors g 0 . . . . g k−1 . . Qg 0 . . 1. we have by construction x1 = x1 . We will ˜ ˜ show that xk+1 = xk+1 . . .

g 1 . . g 1 .6 1. we have g k ⊥ ∇f (y k ). .5. and hence also to span {g k . . . Also.5. y k − xk−1 }. In particular. (1)-(3). Thus xk+1 minimizes f over xk + span {g k . (3) α∈ . . . Furthermore. . . we have that ∇f (xk+1 ) is orthogonal to g k .Section 1. for the vector xk+1 that minimizes f over this line. which by the result of that exercise. so that g k ⊥ span {g 0 . By the definition of the congugate gradient method. . Thus. xk − xk−1 }. xk have been generated by the method of Exercise 1. and in particular g k ⊥ g k−1 . since y k is the vector that minimizes f over the line yα = xk − αg k . α ≥ 0. g k−1 }. since xk−1 . g k−1 }. because xk+1 minimizes f over the line passing through xk−1 and y k .6 (PARTAN) Suppose that x1 . and it is equal to the one generated by the algorithm of Exercise 1. . . and the gradient of f at such a vector has the form ∇f αxk−1 + (1 − α)y k = Q αxk−1 + (1 − α)y k − b = α(Qxk−1 − b) + (1 − α)(Qy k − b) = αg k−1 + (1 − α)∇f (y k ). it follows that g k is orthogonal to the gradient ∇f (y) of any vector y on the line passing through xk−1 and y k .6. Let y k and xk+1 be generated by the two line searches given in the exercise. ∇f (xk+1 ) is orthogonal to y k − xk−1 . xk . Any vector on the line passing through xk−1 and y k has the form y = αxk−1 + (1 − α)y k . xk − xk−1 }. is equivalent to the conjugate gradient method. (2) (1) 31 . From Eqs. ∇f (xk+1 ) is orthogonal to span {g k . and y k form a triangle whose side connecting xk and y k is proportional to g k .6.6. . xk minimizes f over x0 + span {g 0 .

6. This occurs if there exists a direction d in the manifold spanned by d0 . If the problem has infinitely many solutions (which will happen if there is some vector d such that d Qd = 0 and b d = 0). there are two possibilities: (1) g k = 0: In this case.7 www The objective is to minimize over n. a subspace other than {0} and n itself). a new conjugate direction dk is generated.e. Suppose that the subspace S1 ∩ S2 is spanned by linearly 32 . . where di are the conjugate directions. because the manifold k−1 {x ∈ will not expand for k > m. .6.8 www Let S1 and S2 be the subspaces with S1 ∩ S2 being a proper subspace of of n n (i. it will find one of the solutions (case 1 occurs). the positive semidefinite quadratic function 1 x Qx + b x. δi ∈ . . the algorithm will terminate because the line minimization problem along such a direction d is unbounded from below. and g i are the gradient vectors.6 1. i. The problem in this case has no solution. γ i ∈ = arg min f (x)|x = x0 + i=1 δi gi . n |x = x0 + i=0 γ i di . . dk such that d Qd = 0 and b d = 0. Here.e. However. (2) g k = 0: In this case. If the problem has no solution (which occurs if there is some vector d such that d Qd = 0 but b d = 0). xk is the global minimum since f (x) is a convex function. then the algorithm will proceed as if the matrix Q were positive definite. (b) A minimum along the direction dk does not exist. we also have two possibilities: (a) A minimum is attained along the direction dk and defines xk+1 . in both situations the algorithm will terminate in at most m steps. At the beginning of the (k + 1)st iteration.Section 1. 2 f (x) = The value of xk following the kth iteration is k−1 k−1 xk = arg min f (x)|x = x0 + i=1 γ i di . where m is the rank of the matrix Q. γ i ∈ } 1.

1 Initialization: Choose any direction d1 and points y 1 and z 1 such that M1 = y 1 + span{d1 }. 33 . we get x1 Qv k − b v k = 0 Subtraction of the above two equalities yields (x1 − x2 ) Qv k = 0. 1 1 1 M2 = z 1 + span{d1 }.6 independent vectors vk . . dk . . . dk . where xi = arg minx∈M 1 f (x) for i = 1. we have k f (x2 ) < f (x1 ). . Let M1 = y k + span{d1 . We can use this property to construct a conjugate direction method that does not evaluate gradients and uses only line minimizations in the following way. . . with some vectors y 1 . . .Section 1. . the vectors x2 −x1 and {vk | k ∈ K} are linearly independent. x1 − x2 is Q-conjugate to all vectors in the intersection S1 ∩ S2 . 1 1 1 i Generating new conjugate direction: Suppose that Q-conjugate directions d1 . In this procedure it is important to have a step which given a nonoptimal point x generates a point y for which f (y) < f (x).e. If x is an optimal solution then the step must indicate this fact. . . . . and if x is not optimal. . Without loss of generality we may assume that f (x2 ) > f (x1 ). Starting from point z k we again k search in the directions d1 . d2 . respectively. M2 = y 2 + S2 . dk obtaining a point x2 which minimizes f over the manifold k k M2 generated by z k and d1 . k k < n have been generated. . . dk }. Assume that x1 and x2 minimize the given quadratic function f over the manifolds M1 and M2 that are parallel to subspaces S1 and S2 . which avoids calculation of derivatives. it must find a better point. setting k k dk+1 = x2 − x1 we have that d1 . . M1 = M2 . dk } and x1 = arg minx∈M k f (x). d2 . If x1 k k 1 is not optimal there is a point z k such that f (z k ) < f (x1 ). . x1 = arg min f (x) x∈M1 and x2 = arg min f (x) x∈M2 n. dk+1 are Q-conjugate directions (here we have used the k k established property). . From the definition of x1 and x2 we have that d f (x1 + tv k ) dt =0 t=0 and d f (x2 + tv k ) dt = 0. Since f (x2 ) ≤ f (z k ). Let d2 = x1 − x2 . k ∈ K ⊆ {1. . i. d2 . Hence. . . t=0 for any v k . Simply. dk . k k As both x1 and x2 minimize f over the manifolds that are parallel to span{d1 . 2. Since x2 ∈ M1 . A typical example of such a step is one iteration of the cyclic coordinate descent method. where M1 = y 1 + S1 . n}. . y 2 ∈ Assume also that x1 = x2 . . and x2 Qv k − b v k = 0. When this is written out. . . the step must first determine whether x is optimal. . ∀ k ∈ K. 2.

) By comparing Eqs. 34 . using the induction hypothesis Dk q i = pi .2 www For simplicity.7 SECTION 1. (*) while from the equation Qpi = Q(xi+1 − xi ) = (Qxi+1 − b) − (Qxi − b) = ∇f (xi+1 ) − ∇f (xi ) = q i . This completes the proof. q 0 · · · q n−1 (**) are is invertible. or equivalently Q = q 0 · · · q n−1 (Note here that the matrix p0 · · · pn−1 p0 · · · pn−1 −1 . Suppose the relation Dk q i = pi holds for all k and i ≤ k − 1.7. Dk+1 q i = Dk q i + y k (pk − Dk q k ) q i y k (pk q i − q k pi ) = pi + . since both Q and invertible by assumption. we have. To show that (Dn )−1 = Q. we have Q p0 · · · pn−1 = q 0 · · · q n−1 . the second term in the right-hand side vanishes and we have Dk+1 q i = pi . it follows that (Dn )−1 = Q. 1. (*) and (**). we drop superscripts.1 www The proof is by induction.Section 1.7. qk yk For i < k. The BFGS update is given by pp ¯ D=D+ − pq pp − =D+ pq Dqq D Dq p Dq p + q Dq − − q Dq p q q Dq p q q Dq Dqq D Dqq D pp Dqp + pq D + q Dq + − 2 q Dq (p q) (p q)(q Dq) (q Dq)2 q Dq pp Dqp + pq D =D+ 1+ − pq pq pq . note that from the equation Dk+1 q i = pi . qk yk qk yk Since pk q i = pk Qpi = q k pi .7 1. we have Dn = p0 · · · pn−1 q 0 · · · q n−1 −1 . The relation Dk+1 q i = pi also holds for i = k because of the following calculation Dk+1 q k = Dk q k + yk yk qk = Dk q k + y k = Dk q k + (pk − Dk q k ) = pk .

and proceeding similarly. It requires only two vector operations: one inner product.7.3 www (a) For simplicity.Section 1.7. We have V DV + ρpp = (I − ρqp ) D(I − ρqp ) + ρpp = D − ρ(Dqp + pq D) + ρ2 pq Dqp + ρpp Dqp + pq D (q Dq)(pp ) pp =D− + + qp (q p)2 qp q Dq pp Dqp + pq D =D+ 1+ − pq pq pq and the result now follows using the alternative BFGS update formula of Exercise 1.2. Let V = I − ρqp .7 1. Thus the update formulas are pp Dqq D ¯ D=D+ − . 1. q i . Dk = V k−1 V k−2 · · · V 0 D0 V 0 · · · V k−2 V k−1 + ρ0 V k−1 · · · V 1 p0 p0 V 1 · · · V k−1 + ρ1 V k−1 · · · V 2 p1 p1 V 2 · · · V k−1 + ··· + ρk−2 V k−1 pk−2 pk−2 V k−1 + ρk−1 pk−1 pk−1 Thus to calculate the direction −Dk ∇f (xk ). . and one vector addition. (b) We have. we can verify with a straightforward calculation that H D is equal to I. . . by using repeatedly the update formula for D of part (a). Note that multiplication of a matrix V i or V i with any vector is relatively simple.7. Dk = V k−1 Dk−1 V k−1 + ρk−1 pk−1 pk−1 = V k−1 V k−2 Dk−2 V k−2 V k−1 + ρk−2 V k−1 pk−2 pk−2 V k−1 + ρk−1 pk−1 pk−1 . Thus if the 35 . qp qp ¯¯ If we assume that HD is equal to the identity I. pq q Dq p Hp ¯ H =H + 1+ qp qq Hpq + qp H − . we need only to store D0 and the past vectors pi . where ρ = 1/(q p). k − 1. i = 0.4 www Suppose that D is updated by the DFP formula and H is updated by the BFGS formula. and form the product H D using the above ¯¯ formulas. we drop superscripts. . and to perform the matrix-vector multiplications needed using the above formula for Dk . 1. .

i = 2. n. we obtain ¯ Q1/2 DQ1/2 = Q1/2 DQ1/2 + Let ¯ ¯ R = Q1/2 DQ1/2 . .Section 1. The other eigenvalues are the eigenvalues µ2 . . r = Q1/2 p. rr r Rr Consider the matrix P =R− Rrr R . (b) We have λ1 ≤ r Rr ≤ λn . so 0 is an eigenvalue of P and r is a corresponding eigenvector.5 www (a) By pre. Consider the matrix rr ¯ R=P + . the above updating formulas will generate (at each step) matrices that are inverses of each other. Q1/2 Dqq DQ1/2 Q1/2 pp Q1/2 − . pq q Dq with Q1/2 . λn are the eigenvalues of R. the eigenvalues µ1 . Then the DFP formula is written as rr Rrr R ¯ R=R+ − . where λ1 . . rr 36 . . . . pq q Dq From the interlocking eigenvalues lemma. . . µn satisfy µ1 ≤ λ1 ≤ µ2 ≤ · · · ≤ µn ≤ λn . .7 initial H and D are inverses of each other. . .and postmultiplying the DFP update formula pp Dqq D ¯ D=D+ − . Hence. 1. en are orthogonal to r. . so that ¯ Rei = P ei = µi ei . . . . µn of P . rr ¯ ¯ We have Rr = r. since λ1 > 0. since their corresponding eigenvectors e2 . r Rr R = Q1/2 DQ1/2 . We have P r = 0.7. q = Qp = Q1/2 r. . . so 1 is an eigenvalue of R. we have µ1 = 0. . .

7.6 www (a) We use induction. where the matrix D becomes the identity.1 shows that this formula is the same as the conjugate gradient formula.2) we again pre. For simplicity.7 so if we multiply the matrix R with r r/r Rr. dk q k =− I + 1+ g k+1 The argument given at the end of the proof of Prop. denote for all k. 37 . rr rr and an analysis similar to the ones in parts (a) and (b) goes through.7. Exercise 1. We obtain r Rr ¯ R=R+ 1+ rr rr Rrr + rr R − . Since rr p Qp pq pq = . Assume that the method coincides with the conjugate gradient method up to iteration k.Section 1. whereby we work in the transformed coordinate system y = D−1/2 x.6.and postmultiply with Q1/2 . = = r Rr q Dq p Q1/2 RQ1/2 p q Q−1/2 RQ−1/2 q multiplication of R by r r/r Rr is equivalent to multiplication of D by p q/q Dq. (b) Use a scaling argument. using the facts pk g k+1 = 0 and pk = αk dk . dk+1 = −Dk+1 g k+1 q k q k pk pk q k pk + pk q k − pk q k pk q k pk q k k q k g k+1 p = −g k+1 + pk q k k+1 − g k ) g k+1 (g = −g k+1 + dk . 1. pq pq (cf. 1. (c) In the case of the BFGS update q Dq ¯ D =D+ 1+ pq pp Dqp + pq D − . We have. g k = ∇f (xk ). its eigenvalue range shifts so that it contains 1.

Sign up to vote on this title
UsefulNot useful