You are on page 1of 5

Derivation of the Conjugate Gradient Method

Olof Runborg
June 1, 2011

1 Goal
We want to find xk ∈ Kk (b, A) which minimizes ||Axk − b||A−1 when A is a
symmetric positive definite matrix. Normally one would need to first compute
a basis for Kk (b, A) and then do a least squares fit to find the solution. How-
ever, for symmetric positive definite matrices we can do it iteratively with the
conjugate gradient method.

1.1 Ansatz for the method


We let
xk+1 = xk + αk+1 pk+1 , x0 = 0, p1 = b, (1)
with αk and search directions pk to be determined such that the iteratates
xk belong to Kk (b, A) and minimize ||Ax − b||A−1 . Moreover, we want to be
able to compute αk and pk cheaply from quantities available from the previous
iteration.
By simple induction

xk = α1 p1 + · · · + αk pk =: Sk ak , k ≥ 1,

where
Sk = [p1 p2 · · · pk ], ak = [α1 α2 · · · αk ]T .
We also define the residuals

r k = Axk − b,

and note that after multiplying (1) by A and subtracting b we get

rk+1 = rk + αk+1 Apk+1 , r0 = −b. (2)

The method stops if rk = 0 for some k. We say that the method fails if
pk = 0 for some k. We will assume for now that the method has not failed
before iteration k, i.e. that pℓ 6= 0 for ℓ ≤ k. At the end we will then show that
this implies that it will not fail in iteration k + 1. Since p1 = b 6= 0 it follows
by induction that the method will never fail.

1
1.2 Choice of pk -directions
Since we look for xk ∈ Kk (b, A) and xk = Sk ak we want to pick pk such that
range(Sk ) = Kk (b, A). We will show here that this follows if we take
pk+1 = r k − ASk z k , (3)
where z k is chosen such that it minimizes ||pk+1 ||2 . It is thus the least squares
solution given by the normal equations
SkT AT ASk z k = SkT Ar k . (4)
Note that in the steepest descent method we take just pk+1 = r k . The sub-
traction of ASk z k makes the search much more efficient. The construction has
several interesting consequences.
1. The pk -directions are A-conjugate (or just conjugate), which means that

pTk Apℓ = 0, k 6= ℓ. (5)

This follows from the fact that SkT AT pk+1 = SkT AT (r k − ASk z k ) = 0 for
all k. It also implies that
SkT ASk = Dk , Dk is diagonal with elements dℓℓ = pTℓ Apℓ . (6)

2. If the method has not failed before iteration k then


range(Sk ) = span{p1 , . . . , pk } = span{r0 , . . . , rk−1 } = Kk (b, A).
k−1
Moreover, {pj }kj=1 and {rj }j=0 are linearly independent.
By (2) we have Apk ∈ span{rk , rk−1 } and therefore
range(ASk ) = span{Ap1 , . . . , Apk } ⊂ span{r0 , . . . , r k }.
Hence, from (3), we get that pk ∈ span{r0 , . . . , r k−1 } for all k, and then
span{p1 , . . . , pk } ⊂ span{r0 , . . . , rk−1 }. (7)
Since A is symmetric positive definite, Dk has only positive entries, unless
pℓ = 0 for some ℓ ≤ k. This means that if the method has not failed
before iteration k, the directions p1 , . . . , pk are linearly independent by
(6). (Otherwise there would be a vector x 6= 0 such that Sk x = 0 and
then Dk x = 0, a contradiction.) Hence, the left hand side space in (7) has
dimension k, which is therefore also the dimension of the right hand side.
Then r 0 , . . . , rk−1 are linearly independent and the spaces must be equal.
Finally, we can show that the space is in fact equal to the Krylov space
Kk (b, A). We use induction. Clearly span{r 0 } = K1 (b, A). Suppose
span{p1 , . . . , pk−1 } = Kk−1 (b, A). Then, for ℓ ≤ k − 1 we have r ℓ−1 ∈
Kk−1 (b, A) and by (2)
r ℓ ∈ span{r ℓ−1 , Apℓ } ⊂ span{Kk−1 (b, A), Ab, . . . , Ak−1 b} ⊂ Kk (b, A).
But since the residuals {rℓ } are linearly independent we must have equal-
ity, span{r 0 , . . . , r k−1 } = Kk (b, A).

2
1.3 Simplified minimization problem
Since {pj } span the Krylov space Kk (b, A) we can simplify the minimization
problem

min ||Axk − b||A−1 ⇒ min ||ASk ak − b||A−1 .


xk ∈Kk (b,A) ak ∈Rk

The solution to this weighted least squares problem is given by minimizing

(ASk a − b)T A−1 (Sk a − b) = aT SkT ASk a − 2aT SkT b + bT A−1 b,

over a, which has the solution ak satisfying

SkT ASk ak = SkT b ⇒ Dk ak = SkT b.

Hence, αk = pTk b/pTk Apk . By also using (5) we get

pTk b = pTk (Axk−1 − r k−1 ) = pTk (ASk−1 ak−1 − rk−1 ) = −pTk r k−1 ,

which shows that


pTk rk−1
αk = − .
pTk Apk
We also have

SkT r k = SkT (Axk − b) = 0 ⇒ pTℓ rk = 0, ℓ ≤ k.

This implies that


rk ⊥ Kk (b, A),
and, by induction,
rk ⊥ rℓ , k 6= ℓ.
Hence
1. The residual at iteration k is orthogonal to the search space Kk (b, A), cf.
Galerkin methods,
2. The previously computed residuals form an orthogonal basis for Kk (b, A).
Consequently, if pTk r k−1 = 0 then pk ∈ Kk−1 (b, A) which is impossible since
{pk } are linearly independent. Therefore αk 6= 0.

1.4 Simplified computation of pk+1


Finding pk+1 involves solving the least squares problem (4), where z k depends
on all previous pk -directions via the Sk matrix. We will now show that this can
be simplified since pℓ+1 ∈ span{rℓ , pℓ }.
 
wk
z k = γk , wk ∈ Rk−1 ,
αk

3
and with αk 6= 0 as defined above. Then

pk+1 = rk − ASk z k = rk − γk ASk−1 w k − γk αk Apk


= rk − γk ASk−1 w k − γk (r k − rk−1 )
= (1 − γk ) rk + γk (rk−1 − ASk−1 w k ) .

We note that rk ⊥ (ASk−1 wk − r k−1 ) for any wk , since range(ASk−1 ) ⊂


Kk (b, A). Therefore,

||pk+1 ||2 = (1 − γk )2 ||r k ||2 + γk2 krk−1 − ASk−1 w k k2 . (8)

In order to minimize this over z k , i.e. over γk and wk , we see that wk


must be the minimizer of kr k−1 − ASk−1 wk k. Hence, wk = z k−1 and rk−1 −
ASk−1 wk = pk . This shows the claim and we have
γk
pk+1 = (1 − γk ) rk + γk pk = (1 − γk ) (r k + βk pk ), βk = .
1 − γk

From the fact that pTk+1 Apk = 0 by (5) we get an expression for βk ,

rTk Apk
βk = − .
pTk Apk

It is also possible to derive an expression for γk . By minimizing (8) we get

||r k ||2
γk = .
||r k ||2 + ||pk ||2

However, as we will see below, γk is actually not needed in the final optimized
algorithm.

1.5 Non-failure
We have left to show that the method does not fail. Recall that in Section 1.1
we assumed that pℓ 6= 0 and rℓ 6= 0 for 0 ≤ ℓ ≤ k. The results we have shown so
far therefore only holds upto ℓ = k. We can now finish the induction argument
by noting that whenever pk 6= 0 and r k 6= 0 we must have pk+1 6= 0 by (8).
The results in the previous sections therefore hold for all ℓ, until r ℓ = 0 and the
method stops.

1.6 Basic algorithm


We summarize the basic algorithm that follows from the considerations above:
1. Given rk , pk , xk .
r T Ap ||r k ||2
2. Compute βk = − pkT Apk , γk = ||r k ||2 +||pk ||2 .
k k

4
3. Update pk+1 = (1 − γk )(r k + βk pk ).
pT
k+1 r k
4. Compute αk+1 = − pT
k+1 Apk+1

5. Update xk+1 = xk + αk+1 pk+1


6. Update r k+1 = rk + αk+1 Apk+1
7. Iterate.
This can be implemented such that only one matrix vector multiply is used per
iteration. We will see this more clearly in the next section.

1.7 Optimized algorithm


We make a few observations which simplify the algorithm. First, if we define
p̂Tk b r̂Tk Apk
pk = (1 − γk )p̂k , α̂k = , β̂k = − ,
p̂Tk Ap̂k p̂Tk Ap̂k
then
α̂k p̂k = αk pk , β̂k p̂k = βk pk ,
We can therefore use (p̂k , α̂k , β̂k ) instead of (pk , αk , βk ) in the algorithm, avoid-
ing γk entirely. Moreover,
r Tk r k = r Tk (r k−1 + αk Ap̂k ) = αk r Tk Ap̂k ,
and
p̂Tk r k−1 = (r k−1 + β̂k−1 p̂k−1 )T rk−1 = rTk−1 r k−1 .
Hence,
rTk r k α̂k rTk Ap̂k rTk Ap̂k
= = − = β̂k .
r Tk−1 rk−1 p̂Tk r k−1 p̂Tk Ap̂k
This leads to an algorithm with one matrix vector multiply, two inner products
and three saxpy per iteration:
1. Given rk , p̂k , xk and sk−1 = rTk−1 r k−1
2. Compute sk = rTk r k (inner product)
sk
3. Compute β̂k = sk−1 (scalar)

4. Update p̂k+1 = r k + β̂k p̂k (saxpy)


5. Compute q k+1 = Ap̂k+1 (matrix vector multiply)
sk
6. Compute α̂k+1 = p̂T
(scalar, inner product)
k+1 q k+1

7. Update xk+1 = xk + α̂k+1 p̂k+1 (saxpy)


8. Update r k+1 = rk + α̂k+1 q k+1 (saxpy)
9. Iterate.

You might also like