# An Introduction to the Conjugate Gradient Method Without the Agonizing Pain

Edition 1 1 4 Jonathan Richard Shewchuk August 4, 1994

School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213

Abstract
The Conjugate Gradient Method is the most prominent iterative method for solving sparse systems of linear equations. Unfortunately, many textbook treatments of the topic are written with neither illustrations nor intuition, and their victims can be found to this day babbling senselessly in the corners of dusty libraries. For this reason, a deep, geometric understanding of the method has been reserved for the elite brilliant few who have painstakingly decoded the mumblings of their forebears. Nevertheless, the Conjugate Gradient Method is a composite of simple, elegant ideas that almost anyone can understand. Of course, a reader as intelligent as yourself will learn them almost effortlessly. The idea of quadratic forms is introduced and used to derive the methods of Steepest Descent, Conjugate Directions, and Conjugate Gradients. Eigenvectors are explained and used to examine the convergence of the Jacobi Method, Steepest Descent, and Conjugate Gradients. Other topics include preconditioning and the nonlinear Conjugate Gradient Method. I have taken pains to make this article easy to read. Sixty-six illustrations are provided. Dense prose is avoided. Concepts are explained in several different ways. Most equations are coupled with an intuitive interpretation.

Supported in part by the Natural Sciences and Engineering Research Council of Canada under a 1967 Science and Engineering Scholarship and by the National Science Foundation under Grant ASC-9318163. The views and conclusions contained in this document are those of the author and should not be interpreted as representing the ofﬁcial policies, either express or implied, of NSERC, NSF, or the U.S. Government.

Keywords: conjugate gradient method, preconditioning, convergence analysis, agonizing pain

5. Thinking with Eigenvectors and Eigenvalues 5.1. Eigen do it if I try 5.2. Jacobi iterations 5.3. A Concrete Example

¢¡£¢¡¡¢¡¡£¢¡¡¢¡¡£¢¡¡¢¡¡£¢¡¡¢¡¡¢£¡¡¢¡                                                                         ¢¡£¢¡¡¢¡¡£¢¡¡¢¡¡£¢¡¡¢¡¡£¢¡¡¢¡¡¢£¡¡¢¡¡¢                                                                                                                                                      ¢¡£¢¡¡¢¡¡£¢¡¡¢¡¡£¢¡¡¢¡¡£¢¡¡¢¡¡¢£¡¡¢¡¡

4. The Method of Steepest Descent

2. Notation

1. Introduction

Contents

6. Convergence Analysis of Steepest Descent 6.1. Instant Results 6.2. General Convergence

¢¡£¢¡¡¢¡¡£¢¡¡¢¡¡£¢¡¡¢¡¡£¢¡¡¢¡¡¢£¡¡¢                                                                                                                                                    ¢¡£¢¡¡¢¡¡£¢¡¡¢¡¡£¢¡¡¢¡¡£¢¡¡¢¡¡¢£¡¡¢¡¡¢£

7. The Method of Conjugate Directions 7.1. Conjugacy 7.2. Gram-Schmidt Conjugation 7.3. Optimality of the Error Term

¢¡£¢¡¡¢¡¡£¢¡¡¢¡¡£¢¡¡¢¡¡£¢¡¡¢¡¡¢£                                                                 ¢¡£¢¡¡¢¡¡£¢¡¡¢¡¡£¢¡¡¢¡¡£¢¡¡¢¡¡¢£                                                                                                                                                  ¢¡£¢¡¡¢¡¡£¢¡¡¢¡¡£¢¡¡¢¡¡£¢¡¡¢¡¡¢£¡¡¢¡¡¢£¡¡

9. Convergence Analysis of Conjugate Gradients 9.1. Picking Perfect Polynomials 9.2. Chebyshev Polynomials

¢¡£¢¡¡¢¡¡£¢¡¡¢¡¡£¢¡¡¢¡¡£¢¡¡¢¡¡¢£¡¡                                                                                                                                    ¢¡£¢¡¡¢¡¡£¢¡¡¢¡¡£¢¡¡¢¡¡£¢¡¡¢¡¡¢£

11. Starting and Stopping 11.1. Starting 11.2. Stopping

¢¡£¢¡¡¢¡¡£¢¡¡¢¡¡£¢¡¡¢¡¡£¢¡¡¢¡¡¢£¡¡¢¡¡¢£¡¡¢                                                                                                                                                                          ¢¡£¢¡¡¢¡¡£¢¡¡¢¡¡£¢¡¡¢¡¡£¢¡¡¢¡¡¢£¡¡¢¡¡¢£¡¡¢¡

13. Conjugate Gradients on the Normal Equations

12. Preconditioning

10. Complexity

8. The Method of Conjugate Gradients

14. The Nonlinear Conjugate Gradient Method 14.1. Outline of the Nonlinear Conjugate Gradient Method 14.2. General Line Search 14.3. Preconditioning

¢¡£¢¡¡¢¡¡£¢¡¡¢¡¡£¢¡¡¢¡¡£¢¡¡¢¡¡¢£¡¡¢¡¡¢                                                                             ¢¡£¢¡¡¢¡¡£¢¡¡¢¡¡£¢¡¡¢¡¡£¢¡¡¢¡¡¢£¡¡¢¡                                                                                                              ¢¡£¢¡¡¢¡¡£¢¡¡¢¡¡£¢¡

B Canned Algorithms B1. Steepest Descent B2. Conjugate Gradients B3. Preconditioned Conjugate Gradients A Notes

¢¡£¢¡¡¢¡¡£¢¡¡¢¡¡£¢¡¡¢¡¡£¢¡¡¢                                                         ¢¡£¢¡¡¢¡¡£¢¡¡¢¡¡£¢¡¡¢¡¡£¢¡¡¢¡¡¢£¡¡¢¡                                                                                                                                                    ¢¡£¢¡¡¢¡¡£¢¡¡¢¡¡£¢¡¡¢¡¡£¢¡¡¢¡¡¢£¡¡¢¡¡¢

i 49 49 50 51 42 42 43 47 38 38 38 32 33 35 21 21 25 26 13 13 17 9 9 11 12 41 30 39 37 6 2 1 1 48

The Solution to Minimizes the Quadratic Form C2. Nonlinear Conjugate Gradients with Newton-Raphson and Fletcher-Reeves B5. Optimality of Chebyshev Polynomials             ¢¡£¢¡¡                  ¢¡£¢¡¡¢¡  B4. A Symmetric Matrix Has Orthogonal Eigenvectors.D Homework  ¢¡£¢¡¡¢¡¡£¢¡¡¢¡¡£¢¡¡¢¡¡£¢¡¡                                                       ¢¡£¢¡¡¢¡¡£¢¡¡¢¡¡£¢¡                                                                          ¢¡£¢¡¡¢¡¡£¢¡¡¢¡¡£¢  ii C Ugly Proofs C1. C3. Preconditioned Nonlinear Conjugate Gradients with Secant and Polak-Ribi` re e 52 53 54 54 54 55 56  © § ¥ ¨¦¤ .

iv .

4

2

-4

-2

2

4

6

1

-4

-6

Figure 1: Sample two-dimensional linear system. The solution lies at the intersection of the lines. A matrix is positive-deﬁnite if, for every nonzero vector ,

This may mean little to you, but don’t feel bad; it’s not a very intuitive idea, and it’s hard to imagine how a matrix that is positive-deﬁnite might look differently from one that isn’t. We will get a feeling for what positive-deﬁniteness is about when we see how it affects the shape of quadratic forms. Finally, don’t forget the important basic identities and

3. The Quadratic Form A quadratic form is simply a scalar, quadratic function of a vector with the form

where is a matrix, and are vectors, and is a scalar constant. I shall show shortly that if and positive-deﬁnite, is minimized by the solution to . Throughout this paper, I will demonstrate ideas with the simple sample problem

The system is illustrated in Figure 1. In general, the solution lies at the intersection point of hyperplanes, each having dimension 1. For this problem, the solution is 2 2 . The corresponding quadratic form appears in Figure 2. A contour plot of is illustrated in Figure 3.

i ) v B u7¥ t §

I `¥ F Y

¥

§

d

i qh

B 

i ¢h

3 2 2 6

2 8

0

¤

d eA

¥ ) cb¦¤ ) ¥ © B ¥

1 2

V V G VI G £¤ XR§ WUH¤ F

) ¤ ) G § )I G T2SRQPH¤ F

0

A

-2

2

6

B C§ ¥

2

¥

1

¥

¥

A

3

2

§ ¥

D ¥ E¦¤ ) ¥

¥

¥

2

Jonathan Richard Shewchuk
2

2

2

1

8

§ I a`¥ F Y

d

f § gR¤

I w¥ F Y

I `¥ F Y

¥

¤

(2)

1

1

1.

(3) is symmetric

¤

(4) 

3

150

4 2 0 50 0 6 -2 2 0 -2 -6 -4 4

2

2

4

2

-4

-2

2

4

6

1

-2

-4

-6

 y 6`x

Figure 3: Contours of the quadratic form. Each ellipsoidal curve has constant

.

   

¥

¥

 y `x

Figure 2: Graph of a quadratic form

. The minimum point of this surface is the solution to

¥

-4

1

I `¥ F Y
.

100

¥

6

4

2

-4

-2 -2

2

4

6

-4

-6

-8

. The gradient is a vector ﬁeld that, for a given point , points in the direction of greatest increase of Figure 4 illustrates the gradient vectors for Equation 3 with the constants given in Equation 4. At the bottom by setting equal to zero. of the paraboloid bowl, the gradient is zero. One can minimize With a little bit of tedious math, one can apply Equation 5 to Equation 3, and derive

If

is symmetric, this equation reduces to

Setting the gradient to zero, we obtain Equation 1, the linear system we wish to solve. Therefore, the solution to is a critical point of . If is positive-deﬁnite as well as symmetric, then this

cdRw¥ F  Y © B ¥ ¤ § I

A

¥) ¤

1 2

1 2

I `¥ F Y

I w¥ F  Y

I `¥ F Y

¥  I F  ( w¥ Y  !   § I F  & `¥ F Y " aw¥ Y I   && %& `¥ F Y "  I

I w¥ F Y

¤

I w¥ F Y

§ I w¥ F  Y

 y 6`x

increase of

1 2

. . .



  6yx

of the quadratic form. For every , the gradient points in the direction of steepest , and is orthogonal to the contour lines.

is shaped like a paraboloid bowl. (I’ll have more

¥

¥

4

Jonathan Richard Shewchuk
2

1

¤

(5)

(6)

¤

(7)

(b) For a negative-deﬁnite matrix. then is a saddle point. could be negative-deﬁnite — the result of negating a positive-deﬁnite matrix (see Figure 2. (d) For an indeﬁnite matrix. In three dimensions or higher. might be singular. there are several other possibilities. and techniques like Steepest Descent and CG will likely fail. (If is not . Why go to the trouble of converting the linear system into a tougher-looking problem? The methods under study — Steepest Descent and CG — were developed and are intuitively understood in terms of minimization problems like Figure 2. but hold it upside-down). then by Inequality 2. A line that runs through the bottom of the valley is the set of solutions. It follows that ¤ © §F ¥I hgq¤ I `¥ Y A ) T¤ F ¥ ¥¥ © V ¥ ¤ § A F I w¥ Y aI jiF Y § ¥¥ 2 2 1 I `¥ F Y ¥¥ ¥ I `¥ F Y ¥¥ © § ¥ rf¦¤ 2 2 1 I `¥ F Y ¥ (8) I `¥ F Y I w¥ F Y ¥ ¥ i I w¥ F Y I q¤ ¤ A ) T¤ F ¤ ¤ Y ¥ . Steepest Descent and CG will not work. the latter term is positive for all is a global minimum of . not in terms of intersecting hyperplanes such as Figure 1. in which case no solution is unique. solution is a minimum of . The values of and determine where the minimum point of the paraboloid lies. The fact that is a paraboloid is our best intuition of what it means for a matrix to be positive-deﬁnite. so can be solved by ﬁnding an that minimizes 1 symmetric.) 2 . (c) For a singular (and positive-indeﬁnite) matrix. then Equation 6 hints that CG will ﬁnd a solution to the system 2 1 is symmetric. a singular matrix can also have a saddle. the set of solutions is a line or hyperplane having a uniform value for . Note that Why do symmetric positive-deﬁnite matrices have this nice property? Consider the relationship between 1 . ¥ l§ Ri   wkB jiF ¤ ) `B jiF I¥ I¥ 1 2 . If is none of the above. Figure 5 demonstrates the possibilities. From Equation 3 one can show (Appendix C1) at some arbitrary point and at the solution point that if is symmetric (be it positive-deﬁnite or not).The Quadratic Form (a) (b) 5 1 (c) (d) 1 Figure 5: (a) Quadratic form for a positive-deﬁnite matrix. Because the solution is a saddle point. ¤ Y ¤ ¤ d © ¥ Y If is positive-deﬁnite as well. If is not positive-deﬁnite. but do not affect the paraboloid’s shape.

2 0.The Method of Steepest Descent 7 (b) 0 -2. (a) Starting at ¥ n m ¥ 140 120 100 80 60 40 20 ¥ -6 2 4 2 1 ¥¥ -4 ¥ 0 -2 y ¥ n p ¥ m n eA n 4 F y I 4m r m ¥ Y n m ¥ -4 -2 2 4 6 1 2. is minimized where the gradient is orthogonal to the search line. 2 2 1 -2 -1 1 2 3 1 -1 -2 -3 is shown at several locations along the search line (solid arrows).  x Figure 7: The gradient x ¥   j~  Figure 6: The method of Steepest Descent. Each gradient’s projection onto the line is also shown (dotted arrows). The gradient vectors represent the direction of steepest increase of . The bottommost point is our target.5 -5 2.5 2 0 -2.4 0.5 0 (c) -4 -2 -2 -4 2 1 4 6 0. (d) The gradient at the bottommost point is orthogonal to the gradient of the previous step. and the projections represent the rate of increase as one traverses the search line.6 -6 2 2 . take a step in the direction of steepest descent of . (b) Find the point on the intersection of these two surfaces that minimizes . (c) This parabola is the intersection of surfaces. On the search line.5 1 5 (d) I w¥ F Y ¥ ¥ x x x 2 (a) 4 2 150 100 50 0 .

Equation 13 can be used for every iteration thereafter. The computational cost of Steepest Descent is dominated by matrix-vector products. The disadvantage of using this recurrence is that the sequence deﬁned by Equation 13 is generated without any feedback from the value of . as written above. By premultiplying both sides of Equation 12 by and adding . requires two matrix-vector multiplications per iteration. The algorithm.  W ~  n4 m ¥   j~  Figure 8: Here. we have 1 Although Equation 10 is still needed to compute 0 . which occurs in both Equations 11 and 13.4 2 -4 -2 2 4 6 0 -2 -4 -6 Putting it all together. Note the zigzag path. The product . which appears because each gradient is orthogonal to the previous gradient. fortunately. the method of Steepest Descent starts at 2 2 and converges at 2 ¥ n r m   n 4 m r ¤ n 4 m y B n 4 m r § n 64 m r   n4m © y r n 4 m eA n 4 @7§ n m ¥ i n 4 m r ¤ n 4 m) r n n n § 4m m4 r 4 m) r n m ¥ ¤ B © @¦c§ 4 m in 4 ¥ ¤ ¦B y ¥ r 8 Jonathan Richard Shewchuk 2 1 64 @¥ m n m ¥ n 4m ¥ r ¤ ¥ 2 . so that accumulation of ﬂoating point roundoff error may cause to converge to some point near . I must digress to ensure that you have a solid understanding of eigenvectors. Before analyzing the convergence of Steepest Descent. This effect can be avoided by periodically using Equation 10 to recompute the correct residual. the method of Steepest Descent is: 1 The example is run until it converges in Figure 8. need only be computed once. (10) (11) (12) (13) . one can be eliminated.

then will grow to inﬁnity (Figure 10). CG won’t make sense either. the eigenvectors of the stiffness matrix associated with a one-dimensional uniformly-spaced mesh are sine waves. diverges to inﬁnity. If 1. As increases. Why should you care? Iterative methods often depend on applying to a vector over and over again. The value is an eigenvalue of . but it won’t turn sideways. I knew eigenvectors and eigenvalues like the back of my head. feel free to skip this section. Each time is applied. Thinking with Eigenvectors and Eigenvalues 9 After my one course in linear algebra. As increases. there are practical applications for eigenvectors. you recall solving problems involving eigendoohickeys. If your instructor was anything like mine. it’s still an eigenvector. then will vanish as approaches inﬁnity (Figure 9). Steepest Descent and CG do not calculate the value of any eigenvectors as part of the algorithm1 . because . For instance. If 1. may change length or reverse its direction. has a corresponding eigenvalue of 2.   §  G 4 converges 4    0 §  G    G   G    G 4   y  D   y G  y  §  G y I  y F G §  G G G  . without an intuitive grasp of them. v Bv B2 v B3 v 1 However. there is some scalar constant such that .   Figure 10: Here. and to express vibrations as a linear combination of these eigenvectors is equivalent to performing a discrete Fourier transform. Eigenvectors are used primarily as an analysis tool. In other words. B2 v v Bv B3 v to zero. 5. Eigen do it if I try An eigenvector of a matrix is a nonzero vector that does not rotate when is applied to it (except perhaps to point in precisely the opposite direction). but you never really understood them. the vector is also an eigenvector with eigenvalue . Unfortunately.Thinking with Eigenvectors and Eigenvalues 5. ¦           Figure 9: is an eigenvector of with a corresponding eigenvalue of 0 5. if you scale an eigenvector. the vector grows or shrinks according to the value of .1. When is repeatedly applied to an eigenvector . The eigenvectors of the stiffness matrix associated with a discretized structure of uniform density represent the natural modes of vibration of the structure being studied. If you’re already eigentalented. In other words. For any constant . one of two things can happen.

must be positive also. Consider that the set of eigenvectors symmetric has eigenvectors that are linearly independent). because the eigenvectors that compose converge to zero when is repeatedly applied. Each eigenvector has a corresponding eigenvalue. and summing the result. the is positive (for nonzero ). If the magnitudes of all the eigenvalues are smaller than one. Hence. This set is not unique. for instance. whose associated eigenvalues are 1 0 7 and 2 2. one eigenvector converges to zero while the other diverges. Any -dimensional vector can be expressed as a linear combination of these eigenvectors. also diverges. When is repeatedly applied. The eigenvalues may or may not be equal to each other. the eigenvalues of the identity matrix are all one.    ¦   e      )    @)  G §  G ) §  G     ¥4 G  Figure 11: The vector ¥ ¥4 G  G # #  i ""  i  i     G   G    4     ¥ ¥ 4 I G F  i   4  § I aG F    A   §  G A  R¥ G 4 4 G § 4 4 ¥ #  i ""  i  i     G 4 G   G ¥ G G 10 G  . This is why numerical analysts attach importance to the spectral radius of a matrix: max is an eigenvalue of . This fact can be proven from the deﬁnition of eigenvalue:   By the deﬁnition of positive-deﬁnite. because each eigenvector can be scaled by an arbitrary nonzero constant. one can examine the effect of on each eigenvector separately. The rule that converges to zero if and only if all the generalized eigenvalues have magnitude smaller than one still holds. converge to zero. If one of the eigenvalues has magnitude greater than one. The details are too complicated to describe in this article. we have will 1 2 1 1 2 2 . hence. then there exists a set of linearly independent eigenvectors of . If we want to converge to zero quickly. These are uniquely deﬁned for a given matrix. but is harder to prove. In Figure 11. On repeated application. will diverge to inﬁnity. and every nonzero vector is an eigenvector of . What if is applied to a vector that is not an eigenvector? A very important skill in understanding linear algebra — the skill this section is written to teach — is to think of a vector as a sum of other vectors forms a basis for (because a whose behavior is understood. The effect of repeatedly applying to is best understood by examining the effect of on each eigenvector. Here’s a useful fact: the eigenvalues of a positive-deﬁnite matrix are all positive. and preferably as small as possible. a vector is illustrated as a sum of two eigenvectors 1 and 2 . a name that betrays the well-deserved hostility they have earned from frustrated linear algebraists. These matrices are known as defective. B3 x v 1 v2 x Bx B2 x (solid arrow) can be expressed as a linear combination of eigenvectors (dashed arrows). should be less than one. Applying to is equivalent to applying to the eigenvectors. and because matrix-vector multiplication is distributive. denoted 1 2 . denoted 1 2 .Jonathan Richard Shewchuk If is symmetric (and often if it is not). It’s important to mention that there is a family of nonsymmetric matrices that do not have a full set of independent eigenvectors. but the behavior of defective matrices can be analyzed in terms of generalized eigenvectors and generalized eigenvalues.

The characteristic polynomial of (from Equation 4) is  ¤   det 0   ¤   §  qB 9§F I ¤   ¦9 § 0 §  ¤  § I ¤  qcB 9§F ¤ I UG F  ¤ B ¤  B 0  Figure 12: The eigenvectors of ¥ ¥ 12 Jonathan Richard Shewchuk 2 1 B B  f ¤  ¨B 0 ¤  y `x G . for any eigenvector with eigenvalue . Then. or even for every positive-deﬁnite . I shall solve the system speciﬁed by Equation 4. 2 and the eigenvalues are 7 and 2. First. 5. the rate of convergence . 0 0  The determinant of is called the characteristic polynomial. 0 Eigenvectors are nonzero. Unfortunately. is not generally symmetric (even if is). Jacobi does not converge of the Jacobi Method depends largely on for every . However. which depends on . A Concrete Example To demonstrate these ideas. Each eigenvalue is proportional to the steepness of the corresponding slope.4 2 -4 -2 2 7 -2 2 4 6 -4 -6 are directed along the axes of the paraboloid deﬁned by the quadratic form .3. so must be singular. By deﬁnition. To ﬁnd the eigenvector associated with 1 2 1 2   4 2 § 4 2 §  B  ¢ h  h  f B B g§  qB 0§F f I ¤  B  det 2 6 9 14 2 1 §  i   I B §F I B §F § h A  B  § 3 2 7 2 7. and may even be defective. Each eigenvector is labeled with its associated eigenvalue. we need a method of ﬁnding eigenvalues and eigenvectors. It is a degree polynomial in whose roots are the set of eigenvalues.

we see that these eigenvectors coincide with the axes of the familiar ellipsoid. These are converging normally at the rate deﬁned by their eigenvalues. and not just bizarre torture devices inﬂicted on you by your professors for the pleasure of watching you suffer (although the latter is a nice fringe beneﬁt). we ﬁnd that 2 1 is an eigenvector corresponding to the eigenvalue 2. Then. and 2 1 with eigenvalue 2 3. Instant Results is an eigenvector To understand the convergence of Steepest Descent. (Negative eigenvalues indicate that decreases along the axis. we must express as a linear combination of eigenvectors. Choosing 1 gives us instant convergence. say. and (e)). Using the constants speciﬁed by Equation 4. For a more general analysis. These are graphed in Figure 13(a). Figure 13(f) plots the eigenvector components as arrowheads. as in Figure 11. the residual is also an eigenvector. let’s see the Jacobi Method in action. 1 2 . In Figure 12. The point lies on one of the axes of the ellipsoid. let’s ﬁrst consider the case where with eigenvalue . By the same method. and we shall furthermore require these eigenvectors to be orthonormal. 6. Figure 13(b) shows the convergence of the Jacobi method.1. and are not related to the axes of the paraboloid. I hope that this section has convinced you that eigenvectors are useful tools.Convergence Analysis of Steepest Descent 13 Any solution to this equation is an eigenvector. and that a larger eigenvalue corresponds to a steeper slope. It is proven in Appendix C2 that if is ª © ¤ n4 m¥ ¤ n4 p m )v i © ¢xBt   § n 4n r 4 m r «  A n 4 p n p I 4 m «9 B F nm 4 r n ) 4 m r m § m ) n r n4 r 4n m r 4 n m r n¤ 4 m) r A n 4 m p § n 64 m p 4 m m) n 4 9 n n m p « C§ 4 m p ¦u§ 4 m r B ¤ B h 1 3 0 ª ¢B © n4 p m A n4 h m ¥ 0 2 3 2 3 4 3 h h 0 0 B f h f B f A n4 h m ¥ f 1 3 0 0 2 2 0 1 3 0 2 8 )v i xBt )v i t  §  B )v f  B B f i j©t § § n 64 m ¥ G « 0 « n y V  § 4m Y . The mysterious path the algorithm takes can be understood by watching the eigenvector components of each successive error term (Figures 13(c). (d). Convergence Analysis of Steepest Descent 6. Equation 12 gives 1 0 Figure 14 demonstrates why it takes only one step to converge to the exact solution.) Now. we have 1 1 6 1 6 The eigenvectors of are 2 1 with eigenvalue 2 3. and so the residual points directly to the center of the ellipsoid. note that they do not coincide with the eigenvectors of . as in Figures 5(b) and 5(d).

e) The error vectors 0 . these eigenvectors are not the axes of the paraboloid.14 Jonathan Richard Shewchuk 2 (a) 4 4 2 2 -4 -2 2 4 6 1 -4 -2 2 4 6 -2 0 -4 -6 -6 2 (c) 4 4 2 2 -4 -2 2 4 6 1 -4 -2 2 1 4 6 -2 0 -4 -4 -6 -6 2 (e) 2 (f) 4 4 2 2 -4 -2 2 4 6 1 -4 -2 2 4 6 -2 2 -2 -4 -4 -6 -6 Figure 13: Convergence of the Jacobi Method. Each eigenvector component of the error is converging to zero at the expected rate based on its eigenvalue. (a) The eigenvectors of are shown with their corresponding eigenvalues. ° ¯­ ° ¯­ ° ¯­ ® ® ® ¥ ¥ n p m ¥ ¥  n 0 47 0 47 -2 -4 2 (d) ¥ ¥ ¥ m¥   ~  ¥ ¥ ¥    B ¬ W '~  n p m   ¥ ¥ ¥ 2 (b) 1 1 n p m 1 . (f) Arrowheads represent the eigenvector components of the ﬁrst four error vectors. (c. Unlike the eigenvectors of . (b) The Jacobi Method starts at 2 2 and converges at 2 2 . d. 2 (solid arrows) and their eigenvector components (dashed arrows). 1 .

This choice gives us the useful property that Express the error term as a linear combination of eigenvectors 2 2 2 2 2 2 2 3 i ¥¯£ ¸ ££ ££ I º¥\$£ ¸ £ where is the length of each component of ¸ £   £  £ · § n 4 m r ¤ n 4 m) r i ¸ £ n rn r n r ££ · § 4 m 4 m) § ¹ 4 m ¹ i ¥ £ ¸ £ £ § ¸ £ · £ F £ "£ F § n 4 m p ¤ n 4 m ) p  · I ) · i ¸ £ n pn p n p £ · § 4 m 4 m) § ¹ 4 m ¹ £ n p n r ¤ · BC§ 4 m ¦B § 4 m n4 p m i ¸ £ £ "£ 5 § n 4 m p #· . there exists a set of orthogonal eigenvectors of . As we can scale eigenvectors arbitrarily. From Equations 17 and 18 we have the following identities: (19) (20) i i 1 0 1 ¥ µ i  l § µ ¶ ¶ § ¤ £ ³ ± ´§ ² )  ¥ 2 1  (17) (18) ¸ (21) (22) (23) .Convergence Analysis of Steepest Descent 4 15 2 -4 -2 2 4 6 -2 -4 -6 Figure 14: Steepest Descent converges to the exact solution on the ﬁrst iteration if the error term is an eigenvector. symmetric. let us choose so that each eigenvector is of unit length.

choose ¥ n4 p m ¸ § £¸ £ 3  A n n p n I 4 m 9 B F £ £ 3  4 m p § 64 m p  « n y n 4 Vp § 4m m ¸ n4 r £  £ ¸ £ 3 A n4 p m ££ £ 3 m n r n4 r 4n m r 4 n m r n¤ 4 m) r A n 4 m p 4 m m) § § ¥ 16 Jonathan Richard Shewchuk 2 1 n 64 p m n y V  § 4m ¸ £ 8¯£ n4 r B m (24) . the residual must point to the center 1. of the sphere. Equations 20 and 22 are just Pythagoras’ Law. if 1 . once again. hence. there is instant convergence. the ellipsoid is spherical. no matter what point we start at. Equation 12 gives 1 2 2 2 3 has only one eigenvector component. Now let’s examine the case where achieved in one step by choosing is arbitrary. As before. too can be expressed as a sum of eigenvector components. and the length of Equation 19 shows that these components are . Equation 24 becomes 2 3 2 2 1 0 Figure 15 demonstrates why. but all the eigenvectors have a common eigenvalue .6 4 2 -4 -2 2 4 6 -2 -4 Figure 15: Steepest Descent converges to the exact solution on the ﬁrst iteration if the eigenvalues are all equal. Because all the eigenvalues are equal. then convergence is We saw in the last example that. Now we can proceed with the analysis.

For this reason. The weights 2 ensure that longer components are given precedence. In fact. This norm is easier to work with than the Euclidean norm. some of the shorter components of of might actually increase in length (though never for long).2. the Jacobi Method is a smoother. 6. the methods of Steepest Descent and Conjugate Gradients are called roughers. on any given iteration. and is in some sense a more natural norm. General Convergence To bound the convergence of Steepest Descent in the general case. if there are several unequal. the fraction in Equation 24 is best thought of as a weighted average of the values of 1 . we shall deﬁne the energy norm 1 2 (see Figure 16). By contrast. Steepest Descent and Conjugate Gradients are not smoothers. because every eigenvector component is reduced on every iteration. nonzero eigenvalues. then no choice of will eliminate all the eigenvector components. examination of Equation 8 shows that minimizing is equivalent to n4 p m n » W¹ 4 m p ¹ n4 y m ¥ £ ¸ ¥ 2 1 £ V ½ I p ) ¼F § W¹ p ¹ ¤ p » n4 p m .Convergence Analysis of Steepest Descent 8 17 6 4 2 -6 -4 -2 2 -2 Figure 16: The energy norm of these two vectors is equal. and our choice becomes a sort of compromise. As a result. However. although they are often erroneously identiﬁed as such in the mathematical literature.

23) (25) 2 2 2 2 3 2 The analysis depends on ﬁnding an upper bound for . we see from the graph that is zero. a little basic calculus reveals that Equation 26 is maximized when . If the eigenvalues are equal. If 0 is an eigenvector. then the slope is zero (or inﬁnite). An Æ Å È§ Ç ¸ ¸ ª £ I ¥ 2 2 2 1 n4 r n4 r m ¤ m) 2 ¤ A Å A Å I Æ Æ ¼F IA Æ ¼F ¼F B § ¸ ¸ ¸ Å ¸ A  I ¸F  ¸ A  F I   IA  F B § Ã ¸ ¸I ª § Æ n Ä ²ª  Å  m4 p §  Ä  §  Ã ¸ ¸ £ £ 3F i ¸£  £ £ 3F n Ã @» ¹ 4 m p ¹ § £ £I £ 3F ¸B § Ã ¸ I  Á £ £ £ £ £ £ 3F À n I ¥ £ 3 F £ I¸ £  B » ¹ 4m p ¹ § 3F I  Á n n p n n r I 4 m p ¤ 4 m n) ¼F I n 4 m r ¤ 4 m) ÂF B À » ¹ n 4 m p ¹ § I 4 m r 4 m) ÂrF n 4n m r ¤ 4 m) r n n n r B » ¹ 4m p ¹ § I 4 m r n 4 m) ÂF Á" n ¾ 4n r 4 m r ÀwA ¿gn n n 4 r 4 m r A n 4 p »¹ m ¹ § n m 4 r n¤ 4 m ) r m4 r 4 m) r B n m 4 r n¤ 4 m ) r m ) m ) n 4 r n 4 r n 4 eA n 4 p n 4 r n 4 y A n 4 p n 4 p y m ¤ m) m m ¤ m m m ¤ m § n n yeA n 4 ¼)p F n 4 r n 4 tA n 4 )¼F y mp m I m) m I 4m r 4m ¤ § ) n 64 p n 64 p m ¤ m) 1 1 n p m Ã I `¥ F Y Ã Å ¤ Å ¤ n § » ¹ 64 m p ¹ n I 4m ¥F Y Ã Å 18 minimizing Jonathan Richard Shewchuk . I shall derive a result for 2. In Figure 17. then the condition number is one. which depends on the starting point. we have (by Equation 12) Å v§ Æ ÅÇ ¤ (26) Æ . The latter ﬁgure gives us our best intuition for why a large condition number can be bad: forms a trough.1 2 2 2 (by symmetry of ) 2 2 2 2 2 2 1 1 2 2 2 2 2 3 2 (by Identities 21. the condition number is small. Figures 18(a) and 18(b) are examples with a large condition number. Figure 18 illustrates examples from near each of the four corners of Figure 17. which determines the rate of convergence of Steepest Descent. The slope of (relative to the coordinate system deﬁned by the 1 2 eigenvectors). one can see a faint ridge deﬁned by this line. These quadratic forms are graphed in the coordinate system deﬁned by their eigenvectors. The spectral condition number of is deﬁned to be 1. is graphed as a function of and in Figure 17. but is usually at its worst when is large (Figure 18(b)). again. 22. Assume that 1 2 . The graph conﬁrms my two examples. so the quadratic form is nearly spherical. and convergence is quick regardless of the starting point. we see that is zero. is denoted 2 1 . In Figures 18(c) and 18(d). Steepest Descent can converge quickly if a fortunate starting point is chosen (Figure 18(a)). Holding constant (because is ﬁxed). Figure 19 plots worst-case starting points for our sample matrix . With this norm. so convergence is instant. These starting points fall on the lines deﬁned by 2 1 . To demonstrate how the weights and eigenvalues affect convergence. and Steepest Descent bounces back and forth between the sides of the trough while making little progress along its length. We have 2 1 1 2 2 1 1 2 2 1 1 2 2 2 2 2 2 3 2 2 2 1 2 2 2 3 1 2 3 2 2 2 The value of .

(c) Small and . (d) Small . Ë   Ê Ë     Ë Ê   Ê Ê Ë Ë  number of .4 0. Convergence is fast when or are small. and are both large. For a ﬁxed matrix.6 100 80 60 0. convergence is worst when ÆÆ 20 Ã Å Ë Ì  ¤cÊ Ê . large .8 0. 2 (b) 1 1 Ë °  ¯­ ® Ê É Figure 17: Convergence of Steepest Descent as a function of (the slope of ) and (the condition ).2 0 20 40 15 5 10 10 2 (a) 6 4 2 -4 -2 -2 -4 6 4 2 -4 -2 -2 -4 2 4 2 4 6 4 2 1 -4 -2 -2 -4 2 4 2 (c) 2 (d) 6 4 2 1 -4 -2 -2 -4 2 4 Figure 18: These four examples represent points near the corresponding four corners of the graph in Figure 17. (a) Large .Convergence Analysis of Steepest Descent 19 0. small . (b) An example of poor convergence.

positive-deﬁnite matrix is deﬁned to be the ratio of the largest to smallest eigenvalue. The convergence results for Steepest Descent are Ò A Å i n Ñ » n Å vÎ W¹ 4 m p ¹ » W¹ m p ¹ 4 B 1 1 Å A ÅI Å A Å Å A Å A Å 4 2 4 4 3 4 4 3 3 2 2 (27) 0 and (28) Í Å § Æ ¥ i # ²ª o0 Å 4 Ï  Ð Ï § A Å   Å ABÅ I Å¼F BA ¼F Å Å ÅB B ¥ 20 Jonathan Richard Shewchuk 2 Î Î § § Ã Ã n 1 m¥  SË  Ã D  Å . Dashed lines represent steps toward convergence. the larger its condition number ). Each step taken intersects the paraboloid axes (gray arrows) at precisely a 45 angle.2. upper bound for (corresponding to the worst-case starting points) is found by setting 2 2 2: 1 5 5 5 2 2 1 1 1 1 Inequality 27 is plotted in Figure 20. the slower the convergence of Steepest Descent. 3 5. The more ill-conditioned the matrix (that is. In Section 9.4 0 2 -4 -2 2 4 6 -2 -4 -6 Figure 19: Solid lines represent the starting points that give the worst convergence for Steepest Descent. if the condition number of a symmetric. Here. If the ﬁrst iteration starts from a worst-case point. it is proven that Equation 27 is also valid for 2. so do all succeeding iterations.

8 0.6 0. The ﬁrst (horizontal) step leads to the correct 1 -coordinate.1. Wouldn’t it be better if. Conjugacy Steepest Descent often ﬁnds itself taking steps in the same direction as earlier steps (see Figure 8). In general. the second (vertical) step will hit home.4 0. so that we need never step in n p m Å n4 Ô m  n y   n 4 m Ô n 4 m eA n 4 2R§ n m¥ ¥ Ò A Å   Ñ B Å n4 p n m p n4m p ¤ n4 ) p m ¤ m) n V # Ô i ""  i n Ô i n Ô m    m m 4 m p 64 2¥ m Î n I F B m F § w¥ ¥ F Y Y ÓI I n 4 @¥¥ F YY I w  m B Ã n4 y m ¥ n Ô m (29) . After steps. and that step will be just the right length to line up evenly with . we’ll be done. using the coordinate axes as search directions. every time we took a step. Figure 21 illustrates this idea. 0 1 2 1 2 (by Equation 8) 0 0 1 1 2 7. The Method of Conjugate Directions 7. we’ll take exactly one step.The Method of Conjugate Directions 21 1 0. we got it right the ﬁrst time? Here’s an idea: let’s pick a set of orthogonal search directions 0 1 1 . In each search direction. use the fact that 1 should be orthogonal to .2 0 20 40 60 80 100 Figure 20: Convergence of Steepest Descent (per iteration) worsens as the condition number of the matrix increases. Notice that 1 is orthogonal to 0 . for each step we choose a point 1 To ﬁnd the value of .

this 1 be -orthogonal to orthogonality condition is equivalent to ﬁnding the minimum point along the search direction .4 2 -4 -2 2 4 6 -2 1 0 0 1 -4 -6 Figure 21: The Method of Orthogonal Directions. as in Figure 22(b). without knowing The solution is to make the search directions -orthogonal instead of orthogonal. the direction of again. Two vectors are -orthogonal. we haven’t accomplished anything. we have 1 0 0 (by Equation 29) Unfortunately. Imagine if this article were printed on bubble gum. the problem would already be solved. and . if 0 Figure 22(a) shows what -orthogonal vectors look like. and you grabbed Figure 22(a) by the ends and stretched it until the ellipses appeared circular. Using this condition. or conjugate. Not coincidentally. The vectors would then appear orthogonal. Our new requirement is that (see Figure 23(a)). this method only works if you already know the answer. because we can’t compute and if we knew . as in n4 p m n4 Ô m n4 Ô m ¥ n4 y m n   4 n4m Ôn4 Ô n m) B m p 4 m) Ô n4 Ô m n   m¥ ¥ n n § £ m Ô ¤ 4 m) Ô n p m § §ÕI n n y § 4m n 4 Ô n 4 ¡A n 4 ¼F n 4 Ô y m m m p m) 64 m p n 4 m ) Ô ¤ ¥ 22 Jonathan Richard Shewchuk 2 1 ¤ n Ô m n 64 p m n m @¥ ¤ n4 Ô m n4 p m ¤ n£ Ô m (30) . Unfortunately.

To see this.2 4 4 2 2 -4 -2 2 4 1 -4 -2 2 4 -2 -2 -4 -4 (a) (b) because these pairs of vectors are orthogonal. n4 y m   n Ô n Ô   4 4m n m 4 r n¤ 4 m ) Ô m ) n4 Ô n4m Ô n4m p ¤ n4 ) Ô B m ¤ m) n n § 64 n m p ¤ 4 m ) Ô n § 4 m Ô 64 y m) Ôr B n n § 64 m ¥ Ô ) I 64 m ¥ F  Y yÔ n § ÕI 64 2¥ F Y Ô m 1 1  W Figure 22: These pairs of vectors are -orthogonal 1 0 0 0 0 1 when the search directions are ¥ ¥ The Method of Conjugate Directions 23 2 ¥ n y § 4m §  ¥ ¤ 1 (31) (32) . we can calculate this expression. set the directional derivative to zero: 1 Following the derivation of Equation 30. here is the expression for -orthogonal: Unlike Equation 30. Note that if the search vector were the residual. Steepest Descent. this formula would be identical to the formula used by Steepest Descent.

(a) The ﬁrst step is taken along some direction 0 . This fact gives us a new way to look at the error term. we ﬁnd that . 0 0 The values of can be found by a mathematical trick.2 4 4 2 2 -4 -2 2 4 6 1 -4 -2 2 4 6 0 -4 -6 (a) Figure 23: The method of Conjugate Directions converges in steps. The minimum point 1 is chosen by the constraint that 1 must be -orthogonal to 0 . it is possible to eliminate all the values but one from Expression 33 by premultiplying the expression by : 0 0 0 (By Equation 29). (b) The initial error 0 can be expressed as a sum of -orthogonal components (gray arrows). Because the search directions are -orthogonal. express the error term as a linear ° ¦× ®  ° Ø­ ® n p m -2 Ö ¥ Ù 5£ n § mp V ·# n ± Ô n ±m Ô n ±m p ¤ n ± ) Ô m ¤ m) n ± Ô n ±m Ô n n y m ¤ A) n p n ± I 4 m Ô 4 m V 65±4 3 m ¼F ¤ m ) Ô n ± Ô n ±m Ô n m p ¤ n ± ) ÔÙ m n mn ± ¤ n ) ¤ m± Ô ¤ m ) ÔÙ ± m n£ Ô n ± Ôn£ £ m ¤ m) m · n4 Ù n y m Ú§ 4 m B  ° 9 ® £ ¥ 1 0 1 -2 0 -4 -6 (b) ¥ ¥ ¥ Ù § Ù n § ±m n n± § m p ¤ m) Ô n p n± Ô m ¤ m) § § ¥ n p m ° Ø­ ® £ n Ù m¥ ° ¥× ® n Ô n m m¥ 24 Jonathan Richard Shewchuk 2 1 (33) n± ¤ m) Ô . namely. in steps. As the following equation shows. Each step of Conjugate Directions eliminates one of these components. the process of building up component by component can also be Ô 0 (by -orthogonality of ¥ ¤ 1 0 Ô (by -orthogonality of vectors) vectors) (34) ¤ Ô   n £ m ¯£ 1  To prove that this procedure really does compute combination of search directions. By Equations 31 and 34.

set 0 . set 0. and for 0 0 The difﬁculty with using Gram-Schmidt conjugation in the method of Conjugate Directions is that all 3 operations the old search vectors must be kept in memory to construct each new one. Gram-Schmidt Conjugation All that is needed now is a set of -orthogonal search directions to generate them. Begin with two linearly independent vectors 0 and . the proof is complete. and . use the same trick used to ﬁnd Ù i n ± Ø± 4 mÔ Ü 5± A Û n Ô · 4 § 4m 1 4Û 4n m Ô # Û i i Û i Û    V ""  Ô  n 4 m ¯ Ô V4 áß n £ Ô n £m Ô m ¤ ) Ü n £ m Ô ¤ 4) Û B n£ A n£ Ô 4 Û ¤ m ) Ü Ô¯£ 4 m ¤) ± Ô Ø± 4 5 A n £ m Ô ¤ 4 ) Û V 4· ¶ ÝD  ¤  ¤ à D  n n § £ m Ô ¤ 4 m) Ô Ü ±4 § § £4 Ü  After iterations. and áß d(0) d(0) d(1) F I ÂÞ Ô 0 (by -orthogonality of áß ¤ 0 µ i n£ Ô D  m n£ Ô n± m ¤ m) 1 £ where the are deﬁned for . there is a simple way (36) : vectors) (37) . Fortunately. To construct any components that are not -orthogonal to the previous vectors (see Figure 24). After conjugation. every component is cut away. called a conjugate Gram-Schmidt process. 1 â Figure 24: Gram-Schmidt conjugation of two vectors. and   n £m Ô n £m Ù Ù n n 5£ n n m£ Ô £ m · B £ m Ô £ m V4 n n y 5£ A m£ Ô £ m · V4 1 0 1 0 n ²# § mp Ù 5 £ 4 V ·# 5£ V ·# n p m § § n § 4m p Û § n mÔ (35) 0. Suppose we have a set of linearly independent vectors 0 1 1 .The Method of Conjugate Directions viewed as a process of cutting down the error term component by component (see Figure 23(b)). The coordinate axes will . Set 0 to 0 . which is parallel to 0 . although more intelligent choices are possible. The vector 1 is composed of two components: . In other words. take and subtract out do in a pinch. To ﬁnd their values. and furthermore u0 u+ u1 u*   ã 0â ° ¦× ® â ãâ ® k ° ¥× ° ¥× ä â ® â ® k ° ¥× â 1 . which is -orthogonal (or conjugate) 0 . . only the -orthogonal portion remains. 0 25 1 0 1 7.2.

Optimality of the Error Term Conjugate Directions has an interesting property: it ﬁnds at every step the best solution within the bounds be the -dimensional of where it’s been allowed to explore. also known as Gaussian elimination.4 2 -4 -2 2 4 6 -2 -4 -6 Figure 25: The method of Conjugate Directions using the axial unit vectors. Conjugate Directions becomes equivalent to performing Gaussian elimination (see Figure 25). are required to generate the full set. the method of Conjugate Directions enjoyed little use until the discovery of CG — which is a method of Conjugate Directions — cured these disadvantages. An important key to understanding the method of Conjugate Directions (and also CG) is to notice that Figure 25 is just a stretched copy of Figure 21! Remember that when one is performing the method of Conjugate Directions (including CG). In fact. In fact. some authors derive CG by trying to minimize within 0 . What do I mean by “best 1 1 solution”? I mean that Conjugate Directions chooses the value from 0 that minimizes (see Figure 26).3. In the same way that the error term can be expressed as a linear combination of search directions n » W¹ 4 m p ¹  4 å An mp 4 å ¥ n » W¹ 4 m p ¹ 4 å An mp 4 å A n mp n4 p m ¥ 26 Jonathan Richard Shewchuk 2 1 i   i i Ô  n V 4 m Ô ""  n m Ô n m ¯ . the value is chosen from 0 . Where has it been allowed to explore? Let subspace span 0 . if the search vectors are constructed by conjugation of the axial unit vectors. 7. one is simultaneously performing the method of Orthogonal Directions in a stretched (scaled) space. As a result.

á e (0) Ù n£ Ô n£ Ôn£ m ¤ m) m ¤ Ù Ù4 n £ n n 5 ± n m± Ô ¤ m ) Ô ± m £ m · # V m@¥ » W¹ p ¹ n p m á d(0) á ¤ áß d(1) æ Y 4 å An m ¥ å An n m @¥ 45 £ V ·# 45 £ V ·# m¥ áß n Ô m 4 å An mp § § n » W¹ p ¹ m @¥ n » W¹ 4 m p ¹ n íì ­ }¦|ì m 2¥ p n I 4m ¥F Y n p m 0 e e (1) (2) . 1 appears perpendicular to 0 because they are -orthogonal. the point on 0 2 that minimizes . The ellipsoid is a contour on which the energy norm is constant.The Method of Conjugate Directions 27 (Equation 35). Hence. let’s turn to the intuition. why will be minimized over all of 0 ? n n Ô m m¥  n Ô m n4 p m   ¤ n »W¹ m p ¹ å A n m¥ n p m n Ô m Ô 2 (by -orthogonality of vectors). 1 is the point on 0 that minimizes 1 . è ç ® ¨° Ø­ ° ¯­ ® ë ®× ®× ° ¥g ° ¥êé n Ô m n m¥ ¤ » W¹ p ¹ » W¹ p ¹ å An ¤ ç ®­ è ç ® ° ¯k X8° Ø­ Figure 26: In this ﬁgure. Perhaps the best way to visualize the workings of Conjugate Directions is by comparing the space we are working in with a “stretched” space. why should we expect that will still be minimized in the direction of 0 ? After taking steps. the shaded area is span 0 2 0 0 1 . we have already seen in Section 7. Lines that appear perpendicular in these illustrations are -orthogonal. After two steps. after Conjugate Directions takes a second step. and stops at the point 1 . so 0 1 must be tangent at 1 to the circle that 1 lies on. lines that appear perpendicular in these illustrations are orthogonal. However. minimizing along a second search direction 1 . Any other vector chosen from 0 must have these same terms in its expansion. Figures 27(b) and 27(d) show the same drawings in spaces that are stretched (along the eigenvector axes) so that the ellipsoidal contour lines become spherical. takes a step in the direction of 0 . Why should we expect this to be the minimum point on 0 1 ? The answer is found in Figure 27(b): in this stretched space. On the other hand. The error vector 1 is a radius of the concentric circles that represent contours on which is constant. the Method of Conjugate Directions begins at 0 . Figures 27(a) and 27(c) demonstrate the behavior of Conjugate Directions in 2 and 3 . Conjugate Directions ﬁnds 2 . 1 1 (by Equation 35) 1 Each term in this summation is associated with a search direction that has not yet been traversed. In Figure 27(a). which proves that must have the minimum energy norm. its energy norm can be expressed as a summation. as in Figure 22. where the error vector 1 is -orthogonal to 0 . Having proven the optimality with equations. 1 This is not surprising.1 that -conjugacy of the search direction and the error term is equivalent to minimizing (and therefore ) along the search direction.

The plane 0 tangent to the inner ellipsoid at 2 .28 Jonathan Richard Shewchuk 0 0 1 1 1 1 (a) 0 0 1 0 1 0 0 2 0 (c) Figure 27: Optimality of the Method of Conjugate Directions. Lines that appear perpendicular are -orthogonal. Two concentric ellipsoids are shown. The line 0 1 is tangent to the outer ellipsoid at 2 is 1 . (b) The same problem in a “stretched” space. Lines that appear perpendicular are orthogonal. (d) Stretched view of the three-dimensional problem. is at the center of both. (c) A three-dimensional problem. (a) A two-dimensional problem.  èfî° 9 ç ® ° 0 ® n Ô m 1 1 2 0 (d) å An 1 0 2 1 m @¥ 1 1 n m @¥ n ¥ m @¥ n m Ô n n mr m 2¥ å An m¥ n Ô m 0 1 0 (b) ¥ n p m n r m n Ô m n m 2¥ 1 å An 0 1 1 1 1 n 2 0 m¥ m¥ å A n @¥ m ° 0 ® èfî° 9 ç ® ¥ n m ¥ n n Ô n p m m n r m ¥n m 2¥ m2¥ n Ô m n n mr m¥ å An m ¥  å An m¥ n m @¥ n Ô m n Ô m .

because 0 is tangent at 1 to a circle whose center is . The point 2 would be at the center of the paper. the hyperplane 0 is tangent to the ellipsoid on which lies. If 1 points to the minimum point. we need never step in that direction again. you have the Method of Conjugate Directions. the sight you’d see would be identical to Figure 27(b). in a direction -orthogonal to 2 . and the residual is orthogonal to these previous vectors as well (see Figure 28). It follows that is orthogonal to as well. and is tangent to the smaller ellipsoid at 2 . To show this fact mathematically. pulling a string connected to a bead that is constrained to lie in 0 expanding subspace is enlarged by a dimension. Recall that once we take a step in a search direction. which is the point in 0 2 closest to . they are perpendicular in this diagram.   0 (by Equation 39) i   i V 4 Û ""  Û Ô 0 (by -orthogonality of -vectors). suppose you were staring down at the plane 0 2 as if it were a sheet of paper. underneath the plane. Because the search directions are constructed from the vectors. and the point would lie underneath the paper. Each time the solution point . the bead becomes free to move a little closer to you. it would be straight down from 2 to . a three-dimensional example is more revealing. we will succeed. However. We have seen that. the error term is evermore -orthogonal to all the old search directions. The point 1 lies on the outer ellipsoid. Because 0 and 1 are perpendicular. Figures 27(c) and 27(d) each show two concentric ellipsoids. 0 and 1 appear perpendicular because they are -orthogonal. Recall from Section 4 that the residual at any point is orthogonal to the ellipsoidal surface at that point. Now if you stretch the space so it looks like Figure 27(c). but we are constrained to walk along the search direction 1 . Another important property of Conjugate Directions is visible in these illustrations. Another way to understand what is happening in Figure 27(d) is to imagine yourself standing at the . From Equation 40 (and Figure 28). and 2 lies on the inner ellipsoid. (40) (41) (42) n Ô m n Ô m n å m A ¥n m n Ô 2¥ m n4 r m n n m @¥ m ¥ n ¤ 4 m) Ô B n Ô m ¥ 4 å An m¥ ¥ n4 n m¥ m¥ n Ô m ¤ ¤ å A n 2¥ m n å A m n¥ m ¥   n 4 m r 4) Û § n 4 m r n 4 m ) Ô n µ  in£  Ü 5± m n n± Ô An m£ r m ) ï± 4 · £ m V4 n£ r m m ¥ Û Û µ o i  ¤ Ù4 n n n 5 £ m£ Ô ¤ 4 m ) Ô £ m · # B V å A n 2¥ m ¤ n Ô m n m¥ ¹ p¹ n Ô m n r 4) Û r 4) Û n m¥ m 2¥ 4 å An m¥ n4 p n ¤ B m ¦u§ 4 m r å § § n Ô¥ m § n n § £ m r 4 m) Ô n £ p n4 Ô m ¤ m) B n £ rn4 Ô m m) n4 r m n Ô m å n Ô m ¤ n Ô m å An n m¥ m¥ 4 å ¥ ¥ n Ô m ¥ 4 å ¥ . Look carefully at these ﬁgures: the plane 0 2 slices through the larger ellipsoid. The plane 2 0 is tangent to the sphere on which 2 lies. premultiply Equation 35 by : 1 (38) (39) We could have derived this identity by another tack. and on 0 want to walk to the point that minimizes 2 . Is there any reason to expect that 1 will point the right way? Figure 27(d) supplies an answer. Now. If you took a third step. directly under the point 2 . The point is at the center of the ellipsoids. This is proven by taking the inner product of Equation 36 and : 1 0 There is one more identity we will use later. the residual is evermore orthogonal to all the old search directions. 1 points directly to 2 . we can rephrase our question.The Method of Conjugate Directions 29 In Figure 27(b). It is clear that 1 must point to the solution . Looking at Figure 27(c). the subspace spanned by 0 1 is . at each step. Because 1 is -orthogonal to 0 . Suppose you and I are standing at 1 . Because .

linearly independent search direction unless the residual is zero. The endpoints of 2 and 2 lie on a plane parallel to 2 . The Method of Conjugate Gradients It may seem odd that an article about CG doesn’t describe CG until page 30. they span the same subspace 2 (the gray-colored plane). The error term 2 is -orthogonal to 2 . First. Equation 41 becomes Interestingly. this fact implies that each new subspace 1 is formed 1 1 and the subspace . This choice makes sense for many reasons. so it’s guaranteed always to produce a new. I note that as with the method of Steepest Descent. CG is simply the method of Conjugate Directions where the search directions are constructed by conjugation of the residuals (that is. Recalling that .30 Jonathan Richard Shewchuk r(2) Figure 28: Because the search directions 0 1 are constructed from the vectors 0 1 . in which case the problem is already solved. it is also orthogonal to the previous residuals (see Figure 29). but all the machinery is now in place. As each residual is orthogonal to the previous search subspace span 0 1 1 directions. from the union of the previous subspace span span 0 0 2 2 0 1 1 0 0 0 0 0 4 å   n m r n m Ô V4 V4 in ¤i n m r ¤ mÔ i 0 ° ¯ð ® è â P â â è  â   n4 Ô ¤n4 y n4 r n n m y¡A m n 4 ¼Bp F m I 4m Ô 4m ¤ n 6m 4 p B ¤ m B ° ¥× ®  ° Ø­ ® i   in ¤ i ""  i n m r    ¤ ""  m Ô 4 å ¤ n4 r m l§   µ  à n4 r 4 Û m § è ° ¦× ® ° ¥g ° ¥× ®× ® 4 å n n § £ m r 4 m) r i n r ¤  § in mÔ ¤ m ¯ § 4 å å ñ4 4 å pn V 4 m Ô § n 64 r m § § áß áß i   i i r  n V 4 m r ""  n m r n m   u 2 d (2) e (2) d(1) d(0) u1 u0 á á áß ° ¥× ® n V4 Ô m ¤ è è â (43) (44) . the is equal to . by setting ). To conclude this section. the number of matrix-vector products per iteration can be reduced to one by using a recurrence to ﬁnd the residual: 1 1 8. the residual 2 is orthogonal to 2 . In fact. Hence. because 2 is constructed from 2 by Gram-Schmidt conjugation. Let’s consider the implications of this choice. the residuals worked for Steepest Descent. and a new search direction 2 is constructed (from 2 ) to be -orthogonal to 2 . there is an even better reason to choose the residual. Equation 43 shows that each new residual is just a linear combination of the previous residual and . As we shall see. the residual has the nice property that it’s orthogonal to the previous search directions (Equation 39). Because the search vectors are built from the residuals. so why not for Conjugate Directions? Second.

because 1 is already -orthogonal to all of the previous search directions except Recall from Equation 37 that the Gram-Schmidt constants are and Equation 43. most of the to ensure the -orthogonality of new search vectors. because both the space complexity and time complexity per iteration are reduced from 2 to . It is no longer necessary to store old search vectors As if by magic.) 1 1 1 1 1 1 (By Equation 37. This major advance is what makes CG as important an algorithm as it is. Henceforth. each new residual is orthogonal to all the previous residuals and search directions. I shall use the abbreviation 1 . simplify this expression. the fact that the next residual 1 (Equation 39) implies that is orthogonal to is -orthogonal to . 2 is a linear combination of 2 and 1 . Taking the inner product of 1 . The endpoints of 2 and 2 lie on a plane parallel to 2 (the shaded subspace).The Method of Conjugate Gradients 31 d(0) Figure 29: In the method of Conjugate Gradients. and each new search direction is constructed (from the residual) to be -orthogonal to all the previous residuals and search directions. let us 1 1 1 1 0 1 otherwise (By Equation 44. It has a pleasing property: because is included in 1 . Gram-Schmidt conjugation 1 1 ! becomes easy. a subspace created by repeatedly applying a matrix to a vector. Simplifying further: (by Equation 32) 1 1 1 1   (by Equation 42) n4 Ô m n 64 r m n £ Ô n £ \$ª n £ Ô n 4 r m ¤ m ) Ô m ¤ m) ò§ £ 4 B ° ¥× ® ° ¥× ® ° ð ® ° ð ® 4 å   A µ D  i A µ  §  i A µ  i ¦µ § §  Ü 64 å á ¤ ¤ n4 r m ° ¥× ® n m46 r å 4 ¤ ý i ö ÷ ý ÷ i ù 6ûø"z ü ù }ü 6ûúø÷ z ù 6ûø{ ó xùø÷ »xùúø÷ õ i ôôö ÷ i n 4 r n 4 r ù ûúø{ ô m i ÷ n 4m) r n 4 m r xùø{ B óô õ m ) n  £ rn4 r n £ rn4 r m m B m m n £ Ô n4 rn £ ) y n £ rn4) r m ¤ m) m B m m ) á n V4 r n V4m r n4m r n4 r ) m m) n V4 r n V4m Ô n4m r n4 r ) m m) áß d(1) r(1) r(0) á áß r(2) ¤ á d (2) e (2) § § áß n4 m Ü £4 n 64 r m Ü n n § £ m Ô ¤ 4 m) r n n n y § £m Ô ¤ 4m r £m n  £ r n 4) r m m) § Ü § £4 ¢ þ 64 å è Ü Ü n Vÿ þ F 4 Â4 §  F4 m I ÂÞ I ÂÞ ¤  . This subspace is called a Krylov subspace. In CG.) 0 terms have disappeared. where is the number of nonzero entries of .

“Conjugated Gradients” would be more accurate. Let’s put it all together into one piece now. 9. so why should we care about convergence analysis? In practice. and cancellation i (by Equations 32 and 42) ¥ Ü   4n m Ô n 64 m A n 64 m r § n 64 m Ô Ü i n4 r n4m r n n 64 m r n ) 64 r § 64 m i n Ô n m y m) n r n r i n 4 m Ô n ¤ 4 m eA Bn 4 m § n 64 m 4 m 4 m y 4 m R§ 64 m ¥ ¥ n4 Ô n4m Ô n y n m 4 r n¤ 4 m ) r § 4 m m ) n n m ¥ ¤ B © @¦c§ m r § m Ô in ¥ ¥ 32 Jonathan Richard Shewchuk 2 1 n m ¥  (45) (46) (47) (48) (49) . and the conjugate directions are not all gradients. Convergence Analysis of Conjugate Gradients CG is complete after iterations. because the gradients are not conjugate. The name “Conjugate Gradients” is a bit of a misnomer. The method of Conjugate Gradients is: 0 0 0 1 1 1 1 1 1 1 1 The performance of CG on our sample problem is demonstrated in Figure 30. accumulated ﬂoating point roundoff error causes the residual to gradually lose accuracy.4 2 -4 -2 2 4 6 0 -2 -4 -6 Figure 30: The method of Conjugate Gradients.

but the precise relationship is not important here. Picking Perfect Polynomials We have seen that at each step of CG. and interest only resurged when evidence for its effectiveness as an iterative procedure was published in the seventies. As in the analysis of Steepest Descent. where ¤ § I ¦¢¤ F § § 4 å 1 0 0 § § XI F 4 § £ £ A  4 § § I §F § . Section 6. can take either a scalar or a matrix as its argument. for instance.3 that CG chooses the coefﬁcients that minimize .1. then 2 2 2 . express 0 as a linear combination of orthogonal unit eigenvectors 0 1 and we ﬁnd that § 2 2 2 n p m n » W¹ 4 m p ¹   §  ¤  §  ¤   I §F 4 £ £ § £ Ü n4 n4 y m m 5£ A n  n m p £ ¤ £ ·  § 4m p 4  i  in in in   n m p 4 ¤ ""i    m i p ¤ m i p ¤ i m p ¤  r     n m r V 4 ¤ ""  n m r ¤ n m r ¤ n m  n4 p 4 å An mp m is chosen from 2 0 3 0 0 0 2 0 0 0 1 ¸ £   £8 v £  £ I 8§F 4 t ¸ £· £ £ "8 £  I 8§F 4 £ · ¸ £ £ £  I 8§F 4 £  £ in p F m q¤ 4 I ¥¤¦ i ¸ £ £ ""£ 5 § n m p #·  I §F 4  §¥ ¥§ £ n p m § § · §  I¢¤  F 4 A n § 4m p n § 4m p ¤ n4 p m § § ¢¡   n » ¹ 4m p ¹ § ¤ 0 . and so has our outlook. 2 2 1. for if 2 2 . (Note that . but the latter problem is not easily curable. and so on. Today. Let’s examine the effect of applying this polynomial to 0 . Times have changed. The expression in parentheses above can be expressed as a polynomial. and will evaluate to the same. Convergence analysis is seen less as a ward against ﬂoating point error. Let be a polynomial of degree . the error term has the form The coefﬁcients are related to the values and . and more as a proof that CG is useful for problems that are out of the reach of any exact algorithm. Because of this loss of conjugacy.1 describes the conditions under which CG converges on the ﬁrst iteration. The ﬁrst iteration of CG is identical to the ﬁrst iteration of Steepest Descent. 9. For a ﬁxed . so without changes. The former problem could be dealt with as it was for Steepest Descent. convergence analysis is important because CG is commonly used for problems so large it is not feasible to run even iterations. What is important is the proof in Section 7. CG chooses this polynomial when it chooses the coefﬁcients. This ﬂexible notation comes in handy for eigenvectors.) which you should convince yourself that .Convergence Analysis of Conjugate Gradients 33 error causes the search vectors to lose -orthogonality. the mathematical community discarded CG during the 1960s. 2 Now we can express the error term as 0 if we require that 0 1. the value span span Krylov subspaces such as this have another pleasing property.

There is only one polynomial of degree zero that satisﬁes 0 0 1. we have of the worst eigenvector. This is because a polynomial of degree two can be ﬁt to three points ( 2 0 1. Note that 1 2 5 9 and 1 7 5 9. the that minimizes this expression for our sample problem with eigenvalues 2 and 7.75 -1 2 7 1 0. and thereby accommodate separate eigenvalues. graphed in Figure 31(a).25 -0.25 -0. but convergence is only as good as the convergence be the set of eigenvalues of .5 -0.34 0 Jonathan Richard Shewchuk (a) 1 1 0. after two iterations. In general. Equation 50 evaluates to zero.75 0.25 -0. 2 2 0. given the constraint that 0 1. and 2 7 0).25 -0. a polynomial of degree can ﬁt 1 points. however. Letting Λ min max Λ Λ min max Figure 31 illustrates. Be forewarned.75 0.25 -0. and so the energy norm of the error term after one iteration of CG is no greater than 5 9 its initial value. The optimal polynomial of degree one is 1 1 2 9.5 -0.75 0. the number of iterations required to compute an exact solution is at most the number of distinct eigenvalues. graphed in Figure 31(b).75 0. 2 0 2   Figure 31: The convergence of CG after iterations depends on how close a polynomial ¨    §F 4 t n m»  §F 4 t n m»  I §F  I §F ¨ § §    § TI F § ¤I F Î §  n » ¹ 4m p ¹ I ¢¤ F § § ¤  I §F  I §F  § ¤I F § § § § sI F §  dI §F (b) of degree can (50) § § . their eigenvalues may be omitted from consideration in Equation 50.25 -0.5 0. that these eigenvectors may be reintroduced by ﬂoating point roundoff error. (There is one other possibility for early termination: 0 may already be -orthogonal to some of the eigenvectors of . and that is 0 1. If eigenvectors are missing from the expansion of 0 .5 0.75 -1 2 7 2 (c) 2 (d) 1 0. The foregoing discussing reinforces our understanding that CG yields the exact result after iterations. Given inﬁnite ﬂoating point precision. Figure 31(c) shows that. and furthermore proves that CG is quicker if there are duplicated eigenvalues.75 -1 2 7 1 0.25 -0.75 -1 2 7 CG ﬁnds the polynomial that minimizes this expression.5 -0.) ª¥ B A    §I §F § ¦I F ¤ n §§ m¥  n m 2¥ ª ª ª B´§pI F § 4 § §  ø © §  ø © 2   »¹n m p¹ vI ¸ £ £ 8 £ v · I 2 2 ¤  2 y  be to zero on each eigenvalue.25 -0.5 0. for several values of .5 -0.5 0.

5 1 -1 -0.5 -1 -1. The polynomials that accomplish this are based on Chebyshev polynomials.) Several Chebyshev polynomials are illustrated in Figure 32. and ﬂoating point roundoff occurs.  I ÃF 4   v i xBt ñ Ã   B v i xBt l ñ Ã  A Ã F  1 2 2 2 1 Ð Ï  s # 4 0 Ï Ã   I B Ã 4  ¤ v  o0 i # 4  t Ð Ï Ï Ð Ï  o0 A B ÃF 4I B Ã # 4 0 Ï Ã 2 1.5 1 -1 -0.5 I ÃF   I ÃF -1 -0.Convergence Analysis of Conjugate Gradients 2 5 35 10 49 -1 -0. try working it out for equal to 1 or 2.5 -1 -1.5 -2 0.5 1 0. If we know something about the characteristics of the eigenvalues of . I shall assume the most general case: the eigenvalues are evenly distributed between and . and furthermore that is maximum on the domain 1 1 among all such polynomials. because it is easier for CG to choose a than when they are irregularly distributed between polynomial that makes Equation 50 small.5 2 1.5 1 0.5 -1 -1.5 1 Figure 32: Chebyshev polynomials of degree 2.5 1 0.5 I ÃF  I ÃF § I Ã F 4     I ÃF4  Î  I ÃF4   . We also ﬁnd that CG converges more quickly when eigenvalues are clustered together (Figure 31(d)) and .5 -2 0.5 -2 0. Chebyshev Polynomials A useful approach is to minimize Equation 50 over the range rather than at a ﬁnite number of points. it is sometimes possible to suggest a polynomial that leads to a proof of a fast convergence.5 -2 0.5 -0.5 1 0. Loosely speaking.5 2 1. however. increases as quickly as possible outside the boxes in the illustration.2.5 -0. 5. 10.5 -0. the number of distinct eigenvalues is large. The Chebyshev polynomial of degree is 1 (If this expression doesn’t look like a polynomial to you. and 49. 9. The Chebyshev polynomials have the property that 1 (in fact.5 -0. For the remainder of this analysis. they oscillate between 1 and 1) on the domain 1 1 .5 1 Ã Ã 2 1.5 -1 -1.

Compare with Figure 31(c).36 2 Jonathan Richard Shewchuk 0. where it is known that there are only two eigenvalues.25 1 -0. The denominator enforces our requirement that 0 1.5 0. so it is more common to express the convergence of CG with the weaker inequality 2 1 1 0   W¹ n m p ¹ » Á A Å À   n Å © » W¹ m p ¹ 4 B © Á 1 1 1 1 ( V %4 1 1 1 0 n »¹ mp¹ 1 0 1 0 Ð Ï Ï  o0 Î  Î # 4 0  ¦ Ð Ï  o0 §I F 4 § ¿  øø ' V &0 ' ¾ 11 ))   ¿  ø '  &0 ' 4 ¾    V  ø ' V ' &1 0&1) ('0) (' 4      A Å B Å Á © ÀwA BA ÅÅ © À ! § © 4 Ò ©  n BA ÅÅ Ñ 4 § »W¹ m p ¹ Ò # 4  V B  o0 Ï Ð Ï # Ï A Ï Ñ 4 Î V 4   Ðo0 # 4 0 Ï §  I §F 4 Î W¹ n 4 m p ¹ » §    y Figure 33: The polynomial &%\$  #  "!  n » ¹ 4m p ¹  I §F 1 ¨ § (51) (52) . 2 It is shown in Appendix C3 that Equation 50 is minimized by choosing 2 This polynomial has the oscillating properties of Chebyshev polynomials on the domain (see Figure 33).183 times its initial value.75 -1 2 3 4 5 6 7 8 that minimizes Equation 50 for 2 and 7 in the general case.75 0. The energy norm of the error term after two iterations is no greater than 0. The numerator has a maximum value of one on the interval between and . so from Equation 50 we have 2 The second addend inside the square brackets converges to zero as grows. This curve is a scaled version of the Chebyshev polynomial of degree 2.5 -0.25 -0.

Figure 34 charts the convergence per iteration of CG.8 0. it is clear that the convergence of CG is much quicker than that of Steepest Descent (see Figure 35). the ﬁrst iteration of CG is an iteration of Steepest Descent. . operations.Complexity 37 1 0. Suppose we wish to perform enough iterations to reduce the norm of the error by a factor of . 0 .6 0. For many problems. for example. 10. In practice. that is. ignoring the lost factor of 2. CG usually converges faster than Equation 52 would suggest. matrix-vector multiplication requires is sparse and entries in the matrix. the convergence result for Steepest Descent. it is not necessarily true that every iteration of CG enjoys faster convergence. where is the number of non-zero In general. Setting Equation 28. because of good eigenvalue distributions or good starting points. Complexity The dominating operations during an iteration of either Steepest Descent or CG are matrix-vector products.4 0. However. we obtain The ﬁrst step of CG is identical to a step of Steepest Descent. This is just the linear polynomial case illustrated in Figure 31(b). Comparing Equations 52 and 28. Equation 28 can be used to show that the maximum number of iterations required to achieve this bound using Steepest Descent is whereas Equation 52 suggests that the maximum number of iterations CG requires is   6 \$Ò Å © 1 2 ln 2 i Ò 6 \$ 1 1 ln 2 2 ¤ Å þ §  5Ñ 5Ñ F I ÂþÞ Å 3 4Î  3 7Î  Ã n ¹ mp¹ 2 F ñ I ÂÞ þ Î ¹ n4m p ¹ . Compare with Figure 20.2 0 20 40 60 80 100 Figure 34: Convergence of Conjugate Gradients (per iteration) as a function of condition number. including those listed in the introduction. The factor of 2 in Equation 52 allows CG a little slack for these poor iterations. 1 in Equation 51.

Starting and Stopping In the preceding presentation of the Steepest Descent and Conjugate Gradient algorithms. a division by zero will result. Stopping When Steepest Descent or CG reaches the minimum point.1. 11. either Steepest Descent or CG will eventually converge when used to solve linear systems. versus 3 2 for CG. It seems. and the choice of starting point will determine which minimum the procedure converges to. If you have a rough estimate of the value of . and Steepest Descent has a time complexity 5 3 for three-dimensional problems. how to choose a starting point. then. and if Equation 11 or 48 is evaluated an iteration later. Both algorithms have a space complexity of . several details have been omitted. that one must stop ¥ Å ÂþÞ F Å FI I fÂþÞ F I ½ ÂÞ F I ½F ÂÞ ñ I z ½ ÂÞ Å Ã 38 Jonathan Richard Shewchuk n § m ¥ F I ½ ÂÂÞ Þ F Ô I n m¥ þF I Å © ÂÞ . and when to stop. whereas CG has a time complexity of . the residual becomes zero. I conclude that Steepest Descent has a time complexity of . because there may be several local minima. particularly. Nonlinear minimization (coming up in Section 14) is trickier. set 0 0. Thus. Steepest Descent has a time complexity of 2 for two-dimensional problems. If not. Finite difference and ﬁnite element approximations of second-order elliptic boundary value problems 2 posed on -dimensional domains often have . or whether it will converge at all.2. use it as the starting value 0 .40 35 30 25 20 15 10 5 0 200 400 600 800 1000 Figure 35: Number of iterations of Steepest Descent required to match one iteration of CG. versus 4 3 for CG. Starting There’s not much to say about starting. though. of 11. 11.

though. We can circumvent this difﬁculty. To complicate things. often. because for every symmetric. Suppose that symmetric. accumulated roundoff error in the recursive formulation of the residual (Equation 47) may yield a false zero residual. positive-deﬁnite there is a (not necessarily unique) matrix that has the property that . Usually. then for . even if and are. one wishes to stop before convergence is complete. (Such an can be obtained. this problem could be resolved by restarting with Equation 45. we can iteratively If . this value is some small fraction of the initial residual ( 0 ). 12. can be found by Steepest Descent or CG. but is easier to invert. it is customary to stop when the norm of the residual falls below a speciﬁed value. positive-deﬁnite matrix that approximates . This instance. or if the eigenvalues of 1 is not generally solve Equation 53 more quickly than the original problem. See Appendix B for sample code. The catch is that symmetric nor deﬁnite. for 1 and 1 have the same eigenvalues. then 1 is true because if is an eigenvector of is an eigenvector of with eigenvalue : 1 1 1 The system can be transformed into the problem 1 1 1 which we solve ﬁrst for . Preconditioning Preconditioning is a technique for improving the condition number of a matrix. by Cholesky factorization. The process of using CG to solve this system is called the Transformed Preconditioned Conjugate Gradient Method: 0 0 0 1 1 1 1 1 C C C C C 1 1 1 1 C 1 in 1 1 ) V  ¤ V  C¥ 9 ¤ V     ) 9 §  ¤ V ¤  9 )   ) V  ¤ V  £ ¤ V § u)  i ¥ )  ¨¥ § Ü n An  n n m4 Ô 64 m m46 r § 64 m Ô Ü i n4 r n4m r n n 64 m r n ) 64 m r § 4 m ) m i 4n m Ô V  ¤ V  n 4 m y B n 4 m r § n 64 m r )i n 4 Ô n 4 eA n 4 n y m m m 2¥ § 64 2¥ m i n 4 m Ô V  ¤ V  n 4 m ) Ô n y ) n4 rn4 r § 4m m m) n n m ) 2¥ V  ¤ V  s© V  § m r § m Ô B 9) C \$ 9  §  ¤ V  I V  )  F I  )  F I V  ¤ V  F § ) ) C C C ) V  ¤ V  ¥ i © V  ¥ V  ¤ V  § ) C C 9 1 1 © R¤ § ¥ 9 n ¹ mr¹  © V 2 8 ¹ n 4 m r ¹ ¤ is a ¤ V § ¥ ¦¤ V 9 C C C 9 C \$ C ¤ V 9 (53) C C C ¤ 9 C C 9  C¥ © § ¥ ¤ I F ¢¤ @Å   AI F Bq¤ V @9@Å . Because the error term is not available. however. We can solve indirectly by solving 1 1 are better clustered than those of .Preconditioning 39 immediately when the residual is zero.) The matrices 1 with eigenvalue . Because is symmetric and positive-deﬁnite.

it is only necessary to be able to once per iteration. only to derive a Preconditioned Steepest Descent Method that does not use . it is possible The matrix does not appear in these equations. However. The problem remains of ﬁnding a preconditioner that approximates 1 well enough to improve convergence enough to make up for the cost of computing the product 1 . and I can only scratch the surface here. we derive the Untransformed Preconditioned Conjugate Gradient and Method: 0 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 is needed. 1 . must be computed. By the same means.) Within this constraint. and using the identities variable substitutions can eliminate . there is a surprisingly rich supply compute the effect of applying of possibilities.-4 -2 2 4 6 -2 -4 -6 -8 Figure 36: Contour lines of the quadratic form of the diagonally preconditioned sample problem. and occais determined by the condition number of The effectiveness of a preconditioner sionally by its clustering of eigenvalues. a few careful This method has the undesirable characteristic that 1 and . (It is not necessary to explicitly compute or 1 to a vector. Setting 1 1 . n4 r m V 9 ¤ V 9  V 9 9 9 9 1 1 n4 Ô  n4 Ô m ) § m ¥ V Ü   4n m Ô n 64 m A n 64 m r V § n 64 m Ô Ü n4 r i n4 r n m n 64 m r V n ) 64 r § 4 m i n Ô m n V y n m) r n r i 4 m ¤ 4 m B 4 m § n 4 m n 4 Ô n 4 eA n 4 y ¥ m m m R§ 64 m ¥ i n4 Ô n4m Ô n y n 4 r m ¤ )n 4 r § 4 m i m V m) n n r m V § mÔ n m¥ ¤ B © 2o§ m r C 4n m r V  § n 4 m r  in 9 9 9 C ¥ 40 Jonathan Richard Shewchuk 2 1 9 9 V V 9 9 § V  V  )  n4 n ¥ m 2)  § 4 m ¥ ¤ C .

1. these different expressions are no longer equivalent. Outline of the Nonlinear Conjugate Gradient Method To derive nonlinear CG.14. In nonlinear CG. expression In linear CG. 14. there are several equivalent expressions for the value of . e 0. In nonlinear CG. To restart CG is to forget the past search Using this value is equivalent to restarting CG if directions. cycle inﬁnitely without converging. such as engineering design. e Fortunately. Polake Ribi` re often converges much more quickly. and nonlinear regression. whereas the Polak-Ribi` re method can. and the Polak-Ribi` re e formula: 1 1 1 1 The Fletcher-Reeves method converges if the starting point is sufﬁciently close to the desired minimum. The Nonlinear Conjugate Gradient Method CG can be used not only to ﬁnd the minimum point of a quadratic form. Performing a line search along this search direction is much more difﬁcult than in the linear case. neural net training. a value of that the gradient is orthogonal to the search direction. there are three changes to the linear algorithm: the recursive formula for the residual cannot be used. and a variety of procedures that minimizes is found by ensuring can be used. researchers are still investigating the best choice. but to minimize any continuous function for which the gradient can be computed. Jonathan Richard Shewchuk (in Equation 46) is computed by taking the inner product of Ü n4 rn4m r n n 4 m r n ) 64 m r § 64 m ) m n 4 Ô v n 4 Ô n 4 eA n 4 F y m )I m m m 2¥  tY RP IQ I w¥ F Y Ü . Here is an outline of the nonlinear CG method: 0 0 0 Find that minimizes 1 1 1 i 1 or 1 max 0 S 1 1 1 1  i R © Ü  § Ü i 4n m n n I 4m r B Ü r n 4 m) r n n ³ § 64 m m46 ÂrF 64 m) r i n n r ¥F B i n I Ô n 64 m eA nY C§ n 64 m y 4 ¥ m R§ 64 m i n n tA 4 m n 4 m y I 4 m Ô 4 m i 4 @¥ F Y m n n n I m ¥ F  Y C§ m r § m B  R 1 1 Ô ¤ n n I 4 m ¥ F  Y C§ 4 m r B n n ¡A n 4 F y m¥ Y I 4m Ô 4m Ü n n 4n m r 4 m) r   n n n § 64 m I 4 m r B 64 m ÂrF 64 m) r y Ü R © © Ü n4 y m Ü i n4 r n4m r n n 64 m r n ) 4 r § 64 m m) m ¥ Ô ¤) ¤ )Ô Ô n4 y m Y 42 stability is improved if the value with itself. As with the linear CG. The search directions are computed by Gram-Schmidt conjugation of the residuals as with linear CG. the residual is always set to the negation of the gradient. and there are several different choices for . convergence of the Polak-Ribi` re method can be guaranteed by choosing max 0 . However. . We can use any algorithm that ﬁnds the zeros of the . which we used in linear CG for its ease of computation. and start CG anew in the direction of steepest descent. Applications include a variety of optimization problems. it becomes more complicated to compute the step size . in rare cases. Two choices are the Fletcher-Reeves formula.

Figure 37(a) is a function with many local minima. e Because CG can only generate conjugate vectors in an -dimensional space. Both methods require that be twice continuously differentiable. In this example. and may not even ﬁnd a local minimum if has no lower bound. this function is deceptively difﬁcult to minimize.) 14.   Ô I `¥ F  Y ) Ô { { 5 h Ô eA F y Ô y A 5 I I ¥ Y Ô f y 2 2 0 2 2 0 2 1 2 2 . then an efﬁcient algorithm for polynomial zero-ﬁnding can be used. Figure 37(b) demonstrates the convergence of nonlinear CG with the Fletcher-Reeves formula. Ô ) Y Y     "\$" ""   (     &  " " &&    ""     '&&%  "  " y ` `   n 4 m Ô n 64 m 1  y YX Ô tA y Ô `¥ F  Y ) Ô eA I y A Ô ) v w¥ F  tY I yÔ ¥F Y Ô Ü " "    "  "  ` ` y eA F IÔ ¥ Y A n 64 r n 64 Ô m § m y "¯"   ! § I F " " w¥  Y    "  "   ` ` Vy I T y WeA w¥ F Y UI Ô eA Ô ) v `¥ F  tY I y eA F I w¥ Y   y § T y UI Ô eA Y Y Y ¥F Y yÔ ¥F Y Ô Y I `¥ F  Y  Y (56) (57) . The Newton-Raphson method relies on the Taylor series approximation 2 2 where is the Hessian matrix 2 2 1 2 2 1 1 1 2 2 2 2 2 2 2 1 2 `  ¯"  ` ` . .. it makes sense to restart CG every iterations. Notice that there are several minima. corresponding to the ﬁrst line search in Figure 37(b). General Line Search Depending on the value of . However. The less similar is to a quadratic function. . For instance. it might be possible to use a fast algorithm to ﬁnd the zeros of . Figure 38 shows the effect when nonlinear CG is restarted every second iteration. CG is not nearly as effective as in the linear case. we will only consider general-purpose algorithms. Figure 37(c) shows a cross-section of the surface. Two iterative methods for zero-ﬁnding are the Newton-Raphson method and the Secant method. (It will become clear shortly that the term “conjugacy” still has some meaning in nonlinear CG. both the Fletcher-Reeves method and the Polak-Ribi` re e method appear the same. Newton-Raphson also requires that it be possible to compute the second derivative of with respect to . . if is polynomial in . Figure 37 illustrates nonlinear CG. especially if is small. . . . CG is not guaranteed to converge to the global minimum. the line search ﬁnds a value of corresponding to a nearby minimum. (For this particular example. Figure 37(d) shows the superior convergence of Polak-Ribi` re CG.2.The Nonlinear Conjugate Gradient Method 1 1 43 Nonlinear CG comes with few of the convergence guarantees of linear CG. the more quickly the search directions lose conjugacy.) Another problem is that a general function may have many local minima.

Unlike linear CG.04 -0.04 -4 -2 2 4 6 -200 -2 Figure 37: Convergence of the nonlinear Conjugate Gradient Method. ere ¥ n m @¥ 2 ¥ y ¥¥ 0 2 1 -2 2 (d) ¥ 4 -250 n 6 0 m ¥ I `¥ F Y 250 2 ¥ 2 (b) n eA n 4 F y m¥ Y I 4m Ô ¥ 1 1 .02 0. (c) Cross-section of the surface corresponding to the ﬁrst line search.44 Jonathan Richard Shewchuk (a) 6 4 500 0 2 2 4 0 -2 -4 -2 6 -4 -2 2 4 6 (c) 600 6 400 4 200 0 -0. (a) A complicated function with many local minima and maxima. (b) Convergence path of Fletcher-Reeves CG. (d) Convergence path of Polak-Ribi` CG.02 0. convergence does not occur in two steps.

This procedure is iterated until convergence is reached. ¥ y n 2 m @¥ a  ¥ ¥  2 1 .5 1 1.5 -0.25 -0.75 -1 0.75 0.25 -1 -0.5 0. calculate the ﬁrst and second derivatives. Starting from the point . and use them to construct a quadratic approximation to the function (dashed curve).5 2 Figure 39: The Newton-Raphson method for minimizing a one-dimensional function (solid curve).The Nonlinear Conjugate Gradient Method 45 6 4 0 -4 -2 2 4 6 -2 Figure 38: Nonlinear CG can be more effective with periodic restarts. A new point is chosen at the base of the parabola.5 -0. 1 0.

Demanding too little precision could cause a failure of convergence. giving y A F IÔ ¥ Y y eA F IÔ ¥ Y ut4 r y C§ t 64 r b B Ô  YÓ) Ô Ô Ô ) Y ) Y  Y ¥ . the closer is to the solution. because is just the familiar matrix . Of course. The closer the starting point is to the solution. To perform an exact line search without computing . The varies with . In fact. because with . where is derivative of an arbitrary small nonzero number: 0 2 which becomes a better approximation to the second derivative as and approach zero. the algorithm is prohibitively slow. it is possible to circumvent this problem by performing an approximate line search that uses only the diagonal elements of . On the other hand. If we substitute Expression 58 for the third term of the Taylor expansion (Equation 56). 1 Both the Newton-Raphson and Secant methods should be terminated when is reasonably close to the exact solution. the more quickly the search directions lose conjugacy. but instead of choosing the parabola by ﬁnding the ﬁrst and second derivative at a point. These evaluations may be inexpensive if can be analytically simpliﬁed. For some applications. one CG iteration may include many Newton-Raphson iterations. the more similar the convergence of nonlinear CG is to that of linear CG. In other words. To perform an exact line search of a non-quadratic function. Typically. the Secant method also approximates with a parabola. but if the full matrix must be evaluated repeatedly. we have Minimize by setting its derivative to zero: Like Newton-Raphson. then this parabolic approximation is exact. the search directions are conjugate if they are -orthogonal. though. In general. because conjugacy will break  l§   Ô ) v w¥ F  Y 7B Ô ) v I Ô I t i p ¥ ¥ b y 7A IÔ b ¥F Y Ô ) v I`¥ F  Y tÝ Ô v Ô Ô B) v w¥ ) F  I tY I y b A F ¥  tY Ô b g hby A b ¥ A A Ô v F y eA F y Ô I t ¥ Y Ô ) `¥  Y q I Ô ¥ F  tY r "t4 sy bB qu§ T B y § yÔ ¥F Y Ô 2 0 (58) (59) b  Y Ô  ) Ô Y  Y n 4 @¥ m b c§ y  Y b § y ¥ y i Ô I t ) v w¥ F  Y ÝB { |z { y 5 v I Ô eA ¥ F Y z ÝB 5 t y A F IÔ ¥ Y Ô Ô y    Ô )Y  ) Y C§ B  Y  Y Ôb v Ô b )I b e y f { v I Ô eA  Y  Y A ¥ F  tY { |z ¥F Y z t T y dI Ô ¡A Y y eA F IÔ ¥ Y ¤  Y 46 Jonathan Richard Shewchuk is approximately minimized by setting Expression 57 to zero. the Secant method approximates the second by evaluating the ﬁrst derivative at the distinct points 0 and . if is a quadratic form. repeated steps must be taken along the line is zero. there are functions for which it is not possible to evaluate at all. on subsequent iterations we will choose to be the value of from the previous Secant method iteration. we step to the bottom of the parabola (see Figure 39). hence. we will choose an arbitrary on the ﬁrst Secant method iteration. The more quickly varies meaning of “conjugate” keeps changing. but demanding too ﬁne precision makes the computation unnecessarily slow and gains nothing.The function The truncated Taylor series approximates with a parabola. then . the less varies from iteration to iteration. The values of until and must be evaluated at each step. if we let denote the value of calculated during Secant iteration . it ﬁnds the ﬁrst derivative at two different points (see Figure 40).

Notice that both curves have the same 0. by sampling at three different points.5 Figure 40: The Secant method for minimizing a one-dimensional function (solid curve). it is likely to converge to that point. 0 and 2). a quick but inexact line search is often the better policy (for instance. the preconditioner attempts to transform the quadratic property that form so that it is similar to a sphere. it is possible to generate a parabola that approximates without the need to calculate even a ﬁrst derivative of . calculate the ﬁrst derivatives at two different points (here. and use them to construct a quadratic approximation to the function (dashed curve). down quickly anyway if varies much with . The Newton-Raphson method has a better convergence rate. and also at 2. a nonlinear CG preconditioner performs this transformation for a region near . and the procedure is iterated until convergence. a new point is chosen at the base slope at of the parabola. As in the Newton-Raphson method.e. use only a ﬁxed number of Newton-Raphson or Secant method iterations). but its success may depend on a good choice of the parameter . Starting from the point . inexact line search may lead to the construction of a search direction that is not a descent nonpositive?). Even when it is too expensive to compute the full Hessian .. and restart CG if necessary direction.The Nonlinear Conjugate Gradient Method 1 47 0. Preconditioning Nonlinear CG can be preconditioned by choosing a preconditioner that approximates and has the 1 is easy to compute.5 -1 1 2 3 4 -0. For instance. in time). In linear CG. The result of nonlinear CG generally depends strongly on the starting point. The Secant method only requires ﬁrst derivatives of .3. Each method has its own advantages. 14. and if CG with the Newton-Raphson or Secant method starts near a local maximum. A common solution is to test for this eventuality (is by setting .5 -1 -1. it is often reasonable to compute its diagonal b  Y F I ÂÞ  a v Y y  9 v y ¡A F IÔ ¥ Y Ô)r  Y  ¥ Y ¥ I `¥ F  Y  b v Ô  Y ) Ô r V r § Ô 9 Y  b v n4 m¥  . Unfortunately. Therefore. A bigger problem with both methods is that they cannot distinguish minima from maxima. It is easy to derive a variety of other methods as well. and is to be preferred if can be calculated (or well approximated) quickly (i.

CG was generalized to nonlinear problems in 1964 by Fletcher and Reeves [6]. sparse matrices by Reid [13] in 1971. A more thorough analysis of CG convergence is provided by van der Sluis and van der Vorst [16]. Convergence bounds for CG in terms of Chebyshev polynomials were developed by Kaniel [12]. A preconditioner should be positive-deﬁnite. CG was popularized as an iterative method for large. based on work by Davidon [4] and Fletcher and Powell [5]. A survey of iterative methods for the solution of linear systems is offered by Barrett et al. A history and extensive annotated bibliography of CG to the mid-seventies is provided by Golub and O’Leary [9]. so ) nonpositive diagonal elements cannot be allowed.6 4 0 -4 -2 2 4 6 -2 Figure 41: The preconditioned nonlinear Conjugate Gradient Method. Here. for use as a preconditioner. However. shortly thereafter. they jointly published what is considered the seminal reference on CG [11]. with the Polak-Ribi` re formula. Convergence of nonlinear CG with inexact line searches was analyzed by Daniel [3]. the method of Conjugate Gradients was discovered independently by Hestenes [10] and Stiefel [15]. and were independently reinvented by Fox. In the early ﬁfties. the diagonal elements of the Hessian may not all be positive. A Notes Conjugate Direction methods were probably ﬁrst presented by Schmidt [14] in 1908. A conservative solution is to not precondition (set when the Hessian cannot be guaranteed to be positive-deﬁnite.  § 9 ¥ ¥ ¥  Y n m¥ 2 ¥ 48 Jonathan Richard Shewchuk 2 1 Ü . [1]. The choice of for nonlinear CG is discussed by Gilbert and Nocedal [8]. and Wilkinson [7] in 1948. Most research since that time has focused on nonsymmetric systems. using the Polak-Ribi` formula and ere a diagonal preconditioner. Huskey. Figure 41 demonstrates the convergence of diagonally preconditioned nonlinear CG. The space has been “stretched” to show the improvement in circularity of the contour lines around the minimum. I have cheated by using the diagonal of iteration. be forewarned that if is sufﬁciently far from a local minimum. on the same function illustrated e at the solution point to precondition every in Figure 37.

for large . Steepest Descent Given the inputs 1: . the exact machine. a maximum number of iterations If is divisible by 50 1 This algorithm terminates when the maximum number of iterations 0 .Canned Algorithms B Canned Algorithms 49 The code given in this section represents efﬁcient implementations of the algorithms discussed in this article. Of course. and if this test holds true. but once every 50 iterations. the exact residual is recalculated to remove accumulated ﬂoating point error. or when . This prevents the procedure from terminating early due to ﬂoating point roundoff error. the residual need not be corrected at all (in practice. If the tolerance is large. B1. might be appropriate. . If the tolerance is close to the limits of the ﬂoating point precision of the 2 . Ð Ï  o§ Ù 2 Î Ù  else Ù w  w  w w   w  ww  While and 2 D 0 A  Ù r g r y g) r B r ¥¤kBs© r  r yeA ý ü ¥ y¥ r Ù Ð ¤   sÏ ¡ Ù r )r ¥ ¤ B ¦c© 0 2 0 do Ð Ï  o§ Ù ¥ w w wwy xr Ù Ù Ù © ¤ . The fast recursive formula for the residual is usually used. a test should be added after is evaluated to check if 0 residual should also be recomputed and reevaluated. and an error tolerance n ¹ mr¹ 2 Î ¹ n4m r ¹  ©   S 2 has been exceeded. a starting value . the number 50 is arbitrary. this correction is rarely used).

Ù  Ü Au  Ô A r ÔÜ Ù r g z« # Ù rÙ «) # y r r B ¥¦¤cB© r  Ô yeA ý z ¥ ¥ y Ô Ù # ¤  Ù  ÐoÏP¨#  Ù D « « Ù r)u «# r r g Ô ¥ ¤ B ¦o© r 0 0 and 2 0 do Ð Ï  o§ w wx h Ygfe  i @d w  w Y  ww  y w y  w @ w w  2 Y  ww ww  w x ¥ © ¤ .50 B2. Conjugate Gradients Given the inputs 1: Jonathan Richard Shewchuk . and an error tolerance  S 2 . a starting value . a maximum number of iterations While If is divisible by 50 else 1 See the comments at the end of Section B1. .

a (perhaps implicitly deﬁned) preconditioner number of iterations . a starting value . a maximum . and an error tolerance 1: 51 While If is divisible by 50 else 1 1 The statement “ be in the form of a matrix. which may not actually See also the comments at the end of Section B1. .Canned Algorithms B3. Ù  Ü Au  Ô A ÔÜ Ù rÙ g z« # Ù «) # r y r V r B ¥¦¤cB© r  Ô yeA ý z ¥ y¥ Ô ¤ Ù #   Ù  ÐoÏP¨#  Ù D « « Ù Ô u «# r r) V Ô ¥ ¤ B ¦o© r 1 0 0 and 2 0 do 9 w wx h Yg e k i @d  w w k  Y  9w w jw  y k w y  w @ w w  2 Y  w 9j w ww  w x  £ 2 ¥ r V 9 lw k  ÐoÏ§ © ¤ . 1 ” implies that one should apply the preconditioner. Preconditioned Conjugate Gradients Given the inputs .

a starting value . A fast inexact line and/or by approximating the Hessian with its search can be accomplished by using a small diagonal.52 Jonathan Richard Shewchuk B4. Nonlinear Conjugate Gradients with Newton-Raphson and Fletcher-Reeves Given a function . has been exceeded. the solution is to choose a better starting point or a more sophisticated line search. a maximum number of Newton-Raphson iterations . In the latter case. and a Newton-Raphson error tolerance 1: 0 0 0 While and 2 0 do 0 Do 1 while and 2 2 If 1 or 0 0 1 This algorithm terminates when the maximum number of iterations 0 . Nonlinear CG is restarted (by setting ) whenever a search direction is computed that is not a descent direction. CG might not be the most appropriate minimization algorithm. The calculation of may result in a divide-by-zero error. It is also restarted once every iterations. or when Each Newton-Raphson iteration adds to . a CG error tolerance 1. or because is not twice continuously differentiable. This may occur because the starting point 0 is not sufﬁciently close to the desired minimum. the iterations are terminated when each update falls below a given tolerance ( ). or when the number of iterations exceeds . n m 2¥ Ô y I `¥ F  Y  Ð  sÏ Ð Ï  oP µ Ð Ï  o§  w w ux r w gi¶Ô Î Ô ) r A  d¶ Ü § w Ô ¶r w h YgA e w iÙ¶ÔÜ  r  Ù@d w # w r «) #gYz « w Ùr zÙ y I F Ð `µ ¥  Y µ B  sA Ï µ  µ w 5D ý ytA z n ý m Ô `m z ¥ w y¥ z t m n  m m` r B w Ù Ô ) uw z µ Ô Ù 2 D Y« # Ù  oPÏ¨ w Ð  #«  Ù w  w # ÙÙ r ) ru« wÔ rg Iw¥ F  Y B wr ww x¶ Y Ð  oÏ r  ¥ µ A Ð  oÏ µ w hÔ Ô y ¥ 5 Î ¹ Ô y¹ y Y n ¹ mr¹ 2 Î ¹ n4m r ¹    52 . to improve convergence for small . a maximum number of CG iterations . In the former case.

the iterations are terminated when each update falls below a given tolerance ( ).Canned Algorithms B5. A fast inexact line search can be accomplished by using a small . The parameter 0 determines the value of in Equation 59 for the ﬁrst step of each Secant method minimization. a maximum number of Secant method iterations . a Secant method step parameter 0 . or when the number of iterations exceeds . or when Each Secant method iteration adds to . and a Secant method error tolerance 1: 0 0 Calculate a preconditioner 1 0 While and 2 0 do 0 0 0 Do 1 while and 2 2 Calculate a preconditioner 1 If 1 or 0 0 else 1 This algorithm terminates when the maximum number of iterations 0 . it may be necessary to adjust this parameter Ð  oÏ µ Ô y Ð Ï  oP b Ð Ï  o§ Ð  oÏ I `¥ F  µ A w Ü Ô A k zÔ w  x ww Ü k z¶Ô  § Î A d¶ w h ø  h Yg e ¶ w ¶Ü w 9 Ù k' r V) gr @ o« # V jw kÙ Y T 9 k Y« ) # r Ù w w z 4z wÏ Ù w yr zÙ I F y Ð wµ ¥  Y µ B  oA Ï µ  ü µ 5D w w p ts« rqp u V Ô @yu  twvxy uA y ¥ w w y¥ Ô ) v I`¥ F  tY w p ü Ô ) v I Ô b A ¥ F  tY tY« rp ws q b ) w Ùy Ô qÔ B w z µ Ù 2 D « # Ù  oÏ§¨Y#« w  Ù Ù Ð  Ô w w # Ù ) r oY« iÔ k w r V 9nw k I`¥ F  Y T 9 I `¥ F  Y B wyr ww x¶ b ¥ b Ô y Ð  oÏ ¥ µ  5 5 Î ¹ Ô y¹ Y n ¹ mr¹ 2 Î ¹ n4m r ¹   2 . a maximum number of CG iterations . a starting value . Preconditioned Nonlinear Conjugate Gradients with Secant and Polak-Ribi` re e 53 Given a function . has been exceeded. a CG error tolerance 1. Unfortunately.