Derivation of The Conjugate Gradient Method: 1 Goal

Derivation of the Conjugate Gradient Method
Olof Runborg
June 1, 2011
1 Goal
We want to find xk ∈ Kk (b, A) which minimizes ||Axk − b||A−1 when A is a
symmetric positive definite matrix. Normally one would need to first compute
a basis for Kk (b, A) and then do a least squares fit to find the solution. How-
ever, for symmetric positive definite matrices we can do it iteratively with the
conjugate gradient method.
1.1 Ansatz for the method

We let
xk+1 = xk + αk+1 pk+1 , x0 = 0, p1 = b, (1)
with αk and search directions pk to be determined such that the iteratates
xk belong to Kk (b, A) and minimize ||Ax − b||A−1 . Moreover, we want to be
able to compute αk and pk cheaply from quantities available from the previous
iteration.
By simple induction
xk = α1 p1 + · · · + αk pk =: Sk ak , k ≥ 1,
where
Sk = [p1 p2 · · · pk ], ak = [α1 α2 · · · αk ]T .
We also define the residuals
r k = Axk − b,
and note that after multiplying (1) by A and subtracting b we get
rk+1 = rk + αk+1 Apk+1 , r0 = −b. (2)
The method stops if rk = 0 for some k. We say that the method fails if
pk = 0 for some k. We will assume for now that the method has not failed
before iteration k, i.e. that pℓ 6= 0 for ℓ ≤ k. At the end we will then show that
this implies that it will not fail in iteration k + 1. Since p1 = b 6= 0 it follows
by induction that the method will never fail.
1
1.2 Choice of pk -directions
Since we look for xk ∈ Kk (b, A) and xk = Sk ak we want to pick pk such that
range(Sk ) = Kk (b, A). We will show here that this follows if we take
pk+1 = r k − ASk z k , (3)
where z k is chosen such that it minimizes ||pk+1 ||2 . It is thus the least squares
solution given by the normal equations
SkT AT ASk z k = SkT Ar k . (4)
Note that in the steepest descent method we take just pk+1 = r k . The sub-
traction of ASk z k makes the search much more efficient. The construction has
several interesting consequences.
1. The pk -directions are A-conjugate (or just conjugate), which means that
pTk Apℓ = 0, k 6= ℓ. (5)
This follows from the fact that SkT AT pk+1 = SkT AT (r k − ASk z k ) = 0 for
all k. It also implies that
SkT ASk = Dk , Dk is diagonal with elements dℓℓ = pTℓ Apℓ . (6)
2. If the method has not failed before iteration k then

range(Sk ) = span{p1 , . . . , pk } = span{r0 , . . . , rk−1 } = Kk (b, A).
k−1
Moreover, {pj }kj=1 and {rj }j=0 are linearly independent.
By (2) we have Apk ∈ span{rk , rk−1 } and therefore
range(ASk ) = span{Ap1 , . . . , Apk } ⊂ span{r0 , . . . , r k }.
Hence, from (3), we get that pk ∈ span{r0 , . . . , r k−1 } for all k, and then
span{p1 , . . . , pk } ⊂ span{r0 , . . . , rk−1 }. (7)
Since A is symmetric positive definite, Dk has only positive entries, unless
pℓ = 0 for some ℓ ≤ k. This means that if the method has not failed
before iteration k, the directions p1 , . . . , pk are linearly independent by
(6). (Otherwise there would be a vector x 6= 0 such that Sk x = 0 and
then Dk x = 0, a contradiction.) Hence, the left hand side space in (7) has
dimension k, which is therefore also the dimension of the right hand side.
Then r 0 , . . . , rk−1 are linearly independent and the spaces must be equal.
Finally, we can show that the space is in fact equal to the Krylov space
Kk (b, A). We use induction. Clearly span{r 0 } = K1 (b, A). Suppose
span{p1 , . . . , pk−1 } = Kk−1 (b, A). Then, for ℓ ≤ k − 1 we have r ℓ−1 ∈
Kk−1 (b, A) and by (2)
r ℓ ∈ span{r ℓ−1 , Apℓ } ⊂ span{Kk−1 (b, A), Ab, . . . , Ak−1 b} ⊂ Kk (b, A).
But since the residuals {rℓ } are linearly independent we must have equal-
ity, span{r 0 , . . . , r k−1 } = Kk (b, A).
2
1.3 Simplified minimization problem
Since {pj } span the Krylov space Kk (b, A) we can simplify the minimization
problem
min ||Axk − b||A−1 ⇒ min ||ASk ak − b||A−1 .

xk ∈Kk (b,A) ak ∈Rk
The solution to this weighted least squares problem is given by minimizing
(ASk a − b)T A−1 (Sk a − b) = aT SkT ASk a − 2aT SkT b + bT A−1 b,
over a, which has the solution ak satisfying
SkT ASk ak = SkT b ⇒ Dk ak = SkT b.
Hence, αk = pTk b/pTk Apk . By also using (5) we get
pTk b = pTk (Axk−1 − r k−1 ) = pTk (ASk−1 ak−1 − rk−1 ) = −pTk r k−1 ,
which shows that

pTk rk−1
αk = − .
pTk Apk
We also have
SkT r k = SkT (Axk − b) = 0 ⇒ pTℓ rk = 0, ℓ ≤ k.
This implies that

rk ⊥ Kk (b, A),
and, by induction,
rk ⊥ rℓ , k 6= ℓ.
Hence
1. The residual at iteration k is orthogonal to the search space Kk (b, A), cf.
Galerkin methods,
2. The previously computed residuals form an orthogonal basis for Kk (b, A).
Consequently, if pTk r k−1 = 0 then pk ∈ Kk−1 (b, A) which is impossible since
{pk } are linearly independent. Therefore αk 6= 0.
1.4 Simplified computation of pk+1

Finding pk+1 involves solving the least squares problem (4), where z k depends
on all previous pk -directions via the Sk matrix. We will now show that this can
be simplified since pℓ+1 ∈ span{rℓ , pℓ }.

wk
z k = γk , wk ∈ Rk−1 ,
αk
3
and with αk 6= 0 as defined above. Then
pk+1 = rk − ASk z k = rk − γk ASk−1 w k − γk αk Apk

= rk − γk ASk−1 w k − γk (r k − rk−1 )
= (1 − γk ) rk + γk (rk−1 − ASk−1 w k ) .
We note that rk ⊥ (ASk−1 wk − r k−1 ) for any wk , since range(ASk−1 ) ⊂

Kk (b, A). Therefore,
||pk+1 ||2 = (1 − γk )2 ||r k ||2 + γk2 krk−1 − ASk−1 w k k2 . (8)
In order to minimize this over z k , i.e. over γk and wk , we see that wk

must be the minimizer of kr k−1 − ASk−1 wk k. Hence, wk = z k−1 and rk−1 −
ASk−1 wk = pk . This shows the claim and we have
γk
pk+1 = (1 − γk ) rk + γk pk = (1 − γk ) (r k + βk pk ), βk = .
1 − γk
From the fact that pTk+1 Apk = 0 by (5) we get an expression for βk ,
rTk Apk
βk = − .
pTk Apk
It is also possible to derive an expression for γk . By minimizing (8) we get
||r k ||2
γk = .
||r k ||2 + ||pk ||2
However, as we will see below, γk is actually not needed in the final optimized
algorithm.
1.5 Non-failure
We have left to show that the method does not fail. Recall that in Section 1.1
we assumed that pℓ 6= 0 and rℓ 6= 0 for 0 ≤ ℓ ≤ k. The results we have shown so
far therefore only holds upto ℓ = k. We can now finish the induction argument
by noting that whenever pk 6= 0 and r k 6= 0 we must have pk+1 6= 0 by (8).
The results in the previous sections therefore hold for all ℓ, until r ℓ = 0 and the
method stops.
1.6 Basic algorithm

We summarize the basic algorithm that follows from the considerations above:
1. Given rk , pk , xk .
r T Ap ||r k ||2
2. Compute βk = − pkT Apk , γk = ||r k ||2 +||pk ||2 .
k k
4
3. Update pk+1 = (1 − γk )(r k + βk pk ).
pT
k+1 r k
4. Compute αk+1 = − pT
k+1 Apk+1
5. Update xk+1 = xk + αk+1 pk+1

6. Update r k+1 = rk + αk+1 Apk+1
7. Iterate.
This can be implemented such that only one matrix vector multiply is used per
iteration. We will see this more clearly in the next section.
1.7 Optimized algorithm

We make a few observations which simplify the algorithm. First, if we define
p̂Tk b r̂Tk Apk
pk = (1 − γk )p̂k , α̂k = , β̂k = − ,
p̂Tk Ap̂k p̂Tk Ap̂k
then
α̂k p̂k = αk pk , β̂k p̂k = βk pk ,
We can therefore use (p̂k , α̂k , β̂k ) instead of (pk , αk , βk ) in the algorithm, avoid-
ing γk entirely. Moreover,
r Tk r k = r Tk (r k−1 + αk Ap̂k ) = αk r Tk Ap̂k ,
and
p̂Tk r k−1 = (r k−1 + β̂k−1 p̂k−1 )T rk−1 = rTk−1 r k−1 .
Hence,
rTk r k α̂k rTk Ap̂k rTk Ap̂k
= = − = β̂k .
r Tk−1 rk−1 p̂Tk r k−1 p̂Tk Ap̂k
This leads to an algorithm with one matrix vector multiply, two inner products
and three saxpy per iteration:
1. Given rk , p̂k , xk and sk−1 = rTk−1 r k−1
2. Compute sk = rTk r k (inner product)
sk
3. Compute β̂k = sk−1 (scalar)
4. Update p̂k+1 = r k + β̂k p̂k (saxpy)

5. Compute q k+1 = Ap̂k+1 (matrix vector multiply)
sk
6. Compute α̂k+1 = p̂T
(scalar, inner product)
k+1 q k+1
7. Update xk+1 = xk + α̂k+1 p̂k+1 (saxpy)

8. Update r k+1 = rk + α̂k+1 q k+1 (saxpy)
9. Iterate.

Derivation of The Conjugate Gradient Method: 1 Goal

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Derivation of The Conjugate Gradient Method: 1 Goal

Uploaded by

Copyright:

Available Formats

Derivation of the Conjugate Gradient Method

1.1 Ansatz for the method

and note that after multiplying (1) by A and subtracting b we get

rk+1 = rk + αk+1 Apk+1 , r0 = −b. (2)

pTk Apℓ = 0, k 6= ℓ. (5)

2. If the method has not failed before iteration k then

min ||Axk − b||A−1 ⇒ min ||ASk ak − b||A−1 .

The solution to this weighted least squares problem is given by minimizing

(ASk a − b)T A−1 (Sk a − b) = aT SkT ASk a − 2aT SkT b + bT A−1 b,

over a, which has the solution ak satisfying

SkT ASk ak = SkT b ⇒ Dk ak = SkT b.

Hence, αk = pTk b/pTk Apk . By also using (5) we get

which shows that

SkT r k = SkT (Axk − b) = 0 ⇒ pTℓ rk = 0, ℓ ≤ k.

This implies that

1.4 Simplified computation of pk+1

pk+1 = rk − ASk z k = rk − γk ASk−1 w k − γk αk Apk

We note that rk ⊥ (ASk−1 wk − r k−1 ) for any wk , since range(ASk−1 ) ⊂

||pk+1 ||2 = (1 − γk )2 ||r k ||2 + γk2 krk−1 − ASk−1 w k k2 . (8)

In order to minimize this over z k , i.e. over γk and wk , we see that wk

It is also possible to derive an expression for γk . By minimizing (8) we get

1.6 Basic algorithm

5. Update xk+1 = xk + αk+1 pk+1

1.7 Optimized algorithm

4. Update p̂k+1 = r k + β̂k p̂k (saxpy)

7. Update xk+1 = xk + α̂k+1 p̂k+1 (saxpy)

You might also like