Technical Note Conjugate Gradient Versus Steepest Descent 1

JOURNAL OF OPTIMIZATION THEORY AND APPLICATIONS: Vol. 20, No.
1, SEIrrEMBER 1976
TECHNICAL NOTE
Conjugate Gradient
Versus Steepest Descent 1
J. C. ALLWRIGHT 2
Communicated by D. Q. Mayne
Abstract. It is known that the conjugate-gradient algorithm is at least

as good as the steepest-descent algorithm for minimizing quadratic
functions. It is shown here that the conjugate-gradient algorithm is
actually superior to the steepest-descent algorithm in that, in the generic
case, at each iteration it yields a lower cost than does the steepest-
descent algorithm, when both start at the same point.
Key Words, Convergence of optimization algorithms, steepest-

descent methods, conjugate-gradient methods.
1. In~oducfion
A new result concerning the behavior of the steepest-descent and

conjugate-gradient algorithms for minimizing quadratic functions on R ~ is
developed here.
To establish notation, the algorithms are stated next for a quadratic cost
function
Y[x] = l(x, Qx)-(b, x), (1)
with x ~ R " and Q r = Q->0. In addition, b ~ R [ Q ] (the range of Q), to
ensure that J achieves its minimum on R n.
The conjugate gradient algorithm is as follows:
(i) choose g0 e R", set do = -~0, k = 0;
(ii) set ~k = arg min J[2k + a d k ] = --@,k, dk}/(dk, Qdk);
o~ER
1 T h a n k s a r e d u e to P r o f e s s o r R . W . S a r g e n t , I m p e r i a l C o l l e g e , L o n d o n , E n g l a n d , f o r
suggestions concerning presentation.
2 Lecturer, Department of Computing and Control, Imperial College, London, England.
129
© 1976 Plenum Publishing Corporation, 227 Wesl 17th Street, New York, N.Y. lO011. No part of this publication
may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical,
photocopying, microfilming, recording, or otherwise, without written permission of the publisher.
130 JOTA: VOL. 20, NO. 1, SEPTEMBER 1976
(iii) 2k+1 = 2k +~kdk;

(iv) ~ + 1 = 11~+1112/11MI2;
(v) dk+l = - & + l +/3k+ldk ;
(vi) set k = k + 1 and go to (ii).
Here, ~k Q i ~ k - - b is the gradient of J with respect to x, at 2k.

=
Steepest descent is the same, but with /3k =--0. The sequences so
generated are denoted {xk}, {gk}, {ak}.
The algorithms will be assumed to start at the same point x0, so 2o = x0.
It can be shown that both Xk (generated by steepest descent) and ;~k
(generated by conjugate gradient) belong to the same linear variety xo + Bk,
for all k - 1. The variety is the translation along Xo of the linear subspace
Bk A L[go, Qgo,..., Qk-lgo] (2)
spanned by the initial gradient go, Qgo. . . . , and Qk-lgo. It is known (Ref. 1,
Sections 8.2 and 8.3) that ~k is optimal in xo+B~, and so the steepest-
descent algorithm cannot give a lower cost J at each iteration k than that
given by the conjugate-gradient algorithm, i.e.,
J[xk]¢ J[~].
It is known also (Ref. 1, Section 7.6 and Exercise 10 of Section 8.8) that
E[xk] <- O~kE[xo], (3)
E[:~k]<_462kE[Xo], (4)
where
E[x]= J [ x ] - J * ,
J* = min J[x],
xER n
0 = (1 - 3,)/(1 + y),
= (1 - 4~,)/(1 + 4v),
2/= a/ A.
Here, a is the smallest eigenvalue of (9 and A is the largest eigenvalue.
Now, t~ < 0, so (3) and (4) suggest that, asymptotically, J[2k] converges
to J* faster, with increasing iterations k, than J[£k] converges. They do not
prove, however, that
1[~k] < J[xk],
JOTA: VOL. 20, NO. t, SEPTEMBER 1976 131
as they give only upper bounds on the cost sequences. Such an inequality' is
established here.
2. Results
Theorem 2.1. In the generic case, J[Xo] ¢ J* and go not an eigenvector

of Q, the following result holds:
J[$1] = J[xl] and J[2k] < J[xk], V k > 1.
For the case with go an eigenvector of Q, it is easy to see that x1 = 21 with

gx = ~1 = 0, so that J is minimized on R " by the first iteration of either
algorithm. The rest of this note is devoted to proving Theorem 2.1.
Two lemmas related to the steepest-descent algorithm will be helpful.
L e m m a 2.1. For the steepest-descent algorithm applied to J of (1),

gk ~ 0 iff go is nonzero and is not an eigenvector of O, V finite k > 0.
L e m m a 2.1 is due, essentially, to Langlois (Ref. 2). A somewhat
different proof will be given here nevertheless, as Ref. 2 is not well known
and because the result plays an important role in the proof of Theorem 2.1.
Proof. For the steepest-descent algorithm,

gy+~ = gJ + aiQgJ, (s)
where
aj = -(gj, gj)/(gy, Ogj). (6)

So, if go is zero or is an eigenvector of Q, then gl = 0; consequently, gk = 0,
Vk -> 1. This proves necessity.
Suppose next that gj ¢ 0 and is not an eigenvector of O. Then,
Ogj = 0gj + 6, (7)
where
# 0 and (6, gj) = O.
Clearly,
and so, as
gj = Qx i - b e R [ Q] --LN[Q],
where IN[Q] is the orthogonal complement of the nullspace of Q, 0 > 0.

Also, aj = - 1/0. Hence from (5) and (7),
g]+l = gj --(1/O)(O& + 6) = - ( 1 / 0 ) 8 ¢ 0; (8)
consequently, from (7) and (8),
(g# Ogj+l) = (Ogj, g,+,) ----(1/o)DII 2 # o. (9)

So, from (9),
(&, &+l) # 0
if &+l is an eigenvector of Q. But, from (5) and (6),
(gj, g/'+l) = 0.
Hence, gj+1 cannot be an eigenvector and &+l # 0 [from (8)] if & # 0 and is
not an eigenvector. Using this result iteratively when go ¢ 0 and is not an
eigenvector establishes sufficiency and completes the proof.
Now, let Hk = H~ project R" orthogonally onto Bk, and consider
applying the steepest-descent algorithm to minimizing
Jk[x ] A (go, [X -- Xo])+ ½<[X-- X0], HkOHk [x -- Xo]) (10)
on R ", starting at x~ = xo. Denote the sequences so constructed by {xk},
{ak}, and {g~}, where g~ is the gradient of J~[x] at x~. For the first k
iterations, relationships between vectors associated with the steepest-
descent algorithm for minimizing J[x], generating a sequence {Xg} with
gradients {gk}, and the steepest-descent algorithm for minimizing Jk[x] are
provided by the following lemma.
L e m m a 2.2. For all k - 1,

(i) xk ~ Xo + Bk,
(ii) g~.=&,Vj~{O, 1 . . . . , k - l } ,
(iii) g~= Hkgk.
Proof. For the steepest-descent algorithm for minimizing J,

gi+l ---- g i - - aiQgi.
Using this iteratively gives
gi ~ L[go, Qgo . . . . . Qigo] = Bi+l. (11)
JOTA: VOL. 20, NO. 1, SEPTEMBER 1976 133
Lemma 2.2 (i) follows immediately from this and the fact that, for steepest
descent,
k--1
xk = Xo + ~ a~gi.
i=1
Now, suppose that i < k and that

x~=x,, g~=gi, (12)
Then, from (2) and (11),
gi ~ Bi+l c Bk. (13)
Therefore, by (12) and the fact that Hk projects onto Bk,
Hkg~ = Hkg~ : g, e Bk.
Hence,
o~,k = - IIg ~ll2/ (g ~, HkOHkg~)= - IIg, l]2/ (g,, Og,> -- oti, (14-1)
x~+l = Xk
i + od k
ig k
i = Xi + OQgt = X i + l , (14-2)
gik+l = g~k + a kiHkQHkgik = Hk(gi +aiQgi) = Hkgi+I. (14-3)
So, for i < k - 1 ,
k
gi+l = gi,
as then, from (13),

Qg~ c Bi+2 c Bk.
Thus, for i < k - I,(12) implies that
t~ k
Xi+l ~ Xi+l, gi+l ~ gi+i~
from which Lernma 2.2 (ii) follows inductively, as x~= Xo and, from (10),
k
go = go.
L e m m a 2.2 (iii) is established by (14) when i = k - 1.
Proof of Theorem 2.1. The first iterations of the conjugate-gradient

and steepest-descent algorithms for minimizing J[x] are identical, so
J[X1] = J[Xl],
Consider next the generic case, with go nonzero and not an eigenvector
of O. Then, L e m m a 2.1 for the steepest-descent algorithm (applied to J)
gives gj # 0 for all finite j. This does not show directly that g~ # 0 (desired
later) as g~ = Hkgk [Lemma 2.2 (iii)], and so could be zero even though
gk # 0. By L e m m a 2.2 (ii), however, g~ = gl # 0, Vk > 1, so application of
L e m m a 2.1 to the minimization of Jk[x] using steepest descent reveals that

g~ is nonzero and is not an eigenvector of HkQHk, and thus that g~ ¢ 0.
As g~ = Hkgk [by L e m m a 2.2 (iii)] and H~ = Ilk = H T,
(gk, k k~
gk)=llgdl >0.
So, for some positive a which is sufficiently small,

j [ x k _ a g k]= J[Xk]_ ol(gk, gk)
k +ga
~ z(gk,
k Og~) <J[Xk].
From Lemma 2.2 (i) and the fact that gkk e Bk,
Xk --ag~ ~ Xo+Bk.
Hence as xk (for the conjugate-gradient algorithm) minimizes J on Xo + Bk,
J[xk] -< J[xk -- ag~] < J[xk], Vk > 1,
which proves the theorem.
References
1. LUENBERGER, D. G., Introduction to Linear and Nonlinear Programming,

Addison-Wesley Publishing Company, Reading, Massachusetts, 1973.
2. LANGLOIS, W. E., Conditions for Termination of the Method of Steepest Descent
After a Finite Number of Iterations, IBM Journal of Research and Development,
Vol. 10, pp. 98-99, 1966.

Technical Note Conjugate Gradient Versus Steepest Descent 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Technical Note Conjugate Gradient Versus Steepest Descent 1

Uploaded by

Copyright:

Available Formats

JOURNAL OF OPTIMIZATION THEORY AND APPLICATIONS: Vol. 20, No.

Abstract. It is known that the conjugate-gradient algorithm is at least

Key Words, Convergence of optimization algorithms, steepest-

A new result concerning the behavior of the steepest-descent and

(iii) 2k+1 = 2k +~kdk;

Here, ~k Q i ~ k - - b is the gradient of J with respect to x, at 2k.

Bk A L[go, Qgo,..., Qk-lgo] (2)

Theorem 2.1. In the generic case, J[Xo] ¢ J* and go not an eigenvector

J[$1] = J[xl] and J[2k] < J[xk], V k > 1.

For the case with go an eigenvector of Q, it is easy to see that x1 = 21 with

L e m m a 2.1. For the steepest-descent algorithm applied to J of (1),

Proof. For the steepest-descent algorithm,

aj = -(gj, gj)/(gy, Ogj). (6)

where IN[Q] is the orthogonal complement of the nullspace of Q, 0 > 0.

(g# Ogj+l) = (Ogj, g,+,) ----(1/o)DII 2 # o. (9)

if &+l is an eigenvector of Q. But, from (5) and (6),

L e m m a 2.2. For all k - 1,

Proof. For the steepest-descent algorithm for minimizing J,

Now, suppose that i < k and that

as then, from (13),

Proof of Theorem 2.1. The first iterations of the conjugate-gradient

L e m m a 2.1 to the minimization of Jk[x] using steepest descent reveals that

So, for some positive a which is sufficiently small,

1. LUENBERGER, D. G., Introduction to Linear and Nonlinear Programming,

You might also like