This action might not be possible to undo. Are you sure you want to continue?

**Dennis Deturck and Herbert S. Wilf
**

Department of Mathematics

University of Pennsylvania

Philadelphia, PA 19104-6395

Copyright 2002, Dennis Deturck and Herbert Wilf

April 30, 2002

2

Contents

1 Diﬀerential and Diﬀerence Equations 5

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2 Linear equations with constant coeﬃcients . . . . . . . . . . . . . . . . . . . 8

1.3 Diﬀerence equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.4 Computing with diﬀerence equations . . . . . . . . . . . . . . . . . . . . . . 14

1.5 Stability theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.6 Stability theory of diﬀerence equations . . . . . . . . . . . . . . . . . . . . . 19

2 The Numerical Solution of Diﬀerential Equations 23

2.1 Euler’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.2 Software notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.3 Systems and equations of higher order . . . . . . . . . . . . . . . . . . . . . 29

2.4 How to document a program . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.5 The midpoint and trapezoidal rules . . . . . . . . . . . . . . . . . . . . . . . 38

2.6 Comparison of the methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

2.7 Predictor-corrector methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

2.8 Truncation error and step size . . . . . . . . . . . . . . . . . . . . . . . . . . 50

2.9 Controlling the step size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

2.10 Case study: Rocket to the moon . . . . . . . . . . . . . . . . . . . . . . . . 60

2.11 Maple programs for the trapezoidal rule . . . . . . . . . . . . . . . . . . . . 65

2.11.1 Example: Computing the cosine function . . . . . . . . . . . . . . . 67

2.11.2 Example: The moon rocket in one dimension . . . . . . . . . . . . . 68

2.12 The big leagues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

2.13 Lagrange and Adams formulas . . . . . . . . . . . . . . . . . . . . . . . . . 74

4 CONTENTS

3 Numerical linear algebra 81

3.1 Vector spaces and linear mappings . . . . . . . . . . . . . . . . . . . . . . . 81

3.2 Linear systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

3.3 Building blocks for the linear equation solver . . . . . . . . . . . . . . . . . 92

3.4 How big is zero? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

3.5 Operation count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

3.6 To unscramble the eggs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

3.7 Eigenvalues and eigenvectors of matrices . . . . . . . . . . . . . . . . . . . . 108

3.8 The orthogonal matrices of Jacobi . . . . . . . . . . . . . . . . . . . . . . . 112

3.9 Convergence of the Jacobi method . . . . . . . . . . . . . . . . . . . . . . . 115

3.10 Corbat´o’s idea and the implementation of the Jacobi algorithm . . . . . . . 118

3.11 Getting it together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

3.12 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

Chapter 1

Diﬀerential and Diﬀerence

Equations

1.1 Introduction

In this chapter we are going to study diﬀerential equations, with particular emphasis on how

to solve them with computers. We assume that the reader has previously met diﬀerential

equations, so we’re going to review the most basic facts about them rather quickly.

A diﬀerential equation is an equation in an unknown function, say y(x), where the

equation contains various derivatives of y and various known functions of x. The problem

is to “ﬁnd” the unknown function. The order of a diﬀerential equation is the order of the

highest derivative that appears in it.

Here’s an easy equation of ﬁrst order:

y

(x) = 0. (1.1.1)

The unknown function is y(x) = constant, so we have solved the given equation (1.1.1).

The next one is a little harder:

y

(x) = 2y(x). (1.1.2)

A solution will, now doubt, arrive after a bit of thought, namely y(x) = e

2x

. But, if y(x)

is a solution of (1.1.2), then so is 10y(x), or 49.6y(x), or in fact cy(x) for any constant c.

Hence y = ce

2x

is a solution of (1.1.2). Are there any other solutions? No there aren’t,

because if y is any function that satisﬁes (1.1.2) then

(ye

−2x

)

= e

−2x

(y

−2y) = 0, (1.1.3)

and so ye

−2x

must be a constant, C.

In general, we can expect that if a diﬀerential equation is of the ﬁrst order, then the

most general solution will involve one arbitrary constant C. This is not always the case,

6 Diﬀerential and Diﬀerence Equations

since we can write down diﬀerential equations that have no solutions at all. We would have,

for instance, a fairly hard time (why?) ﬁnding a real function y(x) for which

(y

)

2

= −y

2

−2. (1.1.4)

There are certain special kinds of diﬀerential equations that can always be solved, and

it’s often important to be able to recognize them. Among there are the “ﬁrst-order linear”

equations

y

(x) +a(x)y(x) = 0, (1.1.5)

where a(x) is a given function of x.

Before we describe the solution of these equations, let’s discuss the word linear. To say

that an equation is linear is to say that if we have any two solutions y

1

(x) and y

2

(x) of the

equation, then c

1

y

1

(x) +c

2

y

2

(x) is also a solution of the equation, where c

1

and c

2

are any

two constants (in other words, the set of solutions forms a vector space).

Equation (1.1.1) is linear, in fact, y

1

(x) = 7 and y

2

(x) = 23 are both solutions, and so

is 7c

1

+ 23c

2

. Less trivially, the equation

y

(x) +y(x) = 0 (1.1.6)

is linear. The linearity of (1.1.6) can be checked right from the equation itself, without

knowing what the solutions are (do it!). For an example, though, we might note that

y = sin x is a solution of (1.1.6), that y = cos x is another solution of (1.1.6), and ﬁnally,

by linearity, that the function y = c

1

sinx+c

2

cos x is a solution, whatever the constants c

1

and c

2

. Now let’s consider an instance of the ﬁrst order linear equation (1.1.5):

y

(x) +xy(x) = 0. (1.1.7)

So we’re looking for a function whose derivative is −x times the function. Evidently y =

e

−x

2

/2

will do, and the general solution is y(x) = ce

−x

2

/2

.

If instead of (1.1.7) we had

y

(x) +x

2

y(x) = 0,

then we would have found the general solution ce

−x

3

/3

.

As a last example, take

y

(x) −(cos x) y(x) = 0. (1.1.8)

The right medicine is now y(x) = e

sinx

. In the next paragraph we’ll give the general rule

of which the above are three examples. The reader might like to put down the book at this

point and try to formulate the rule for solving (1.1.5) before going on to read about it.

Ready? What we need is to choose some antiderivative A(x) of a(x), and then the

solution is y(x) = ce

−A(x)

.

Since that was so easy, next let’s put a more interesting right hand side into (1.1.5), by

considering the equation

y

(x) +a(x)y(x) = b(x) (1.1.9)

1.1 Introduction 7

where now b(x) is also a given function of x (Is (1.1.9) a linear equation? Are you sure?).

To solve (1.1.9), once again choose some antiderivative A(x) of a(x), and then note that

we can rewrite (1.1.9) in the equivalent form

e

−A(x)

d

dx

_

e

A(x)

y(x)

_

= b(x).

Now if we multiply through by e

A(x)

we see that

d

dx

_

e

A(x)

y(x)

_

= b(x)e

A(x)

(1.1.10)

so , if we integrate both sides,

e

A(x)

y(x) =

_

x

b(t)e

A(t)

dt + const. , (1.1.11)

where on the right side, we mean any antiderivative of the function under the integral sign.

Consequently

y(x) = e

−A(x)

__

x

b(t)e

A(t)

dt + const.

_

. (1.1.12)

As an example, consider the equation

y

+

y

x

= x + 1 . (1.1.13)

We ﬁnd that A(x) = log x, then from (1.1.12) we get

y(x) =

1

x

__

x

(t + 1)t dt +C

_

=

x

2

3

+

x

2

+

C

x

.

(1.1.14)

We may be doing a disservice to the reader by beginning with this discussion of certain

types of diﬀerential equations that can be solved analytically, because it would be erroneous

to assume that most, or even many, such equations can be dealt with by these techniques.

Indeed, the reason for the importance of the numerical methods that are the main subject

of this chapter is precisely that most equations that arise in “real” problems are quite

intractable by analytical means, so the computer is the only hope.

Despite the above disclaimer, in the next section we will study yet another important

family of diﬀerential equations that can be handled analytically, namely linear equations

with constant coeﬃcients.

Exercises 1.1

1. Find the general solution of each of the following equations:

(a) y

= 2 cos x

8 Diﬀerential and Diﬀerence Equations

(b) y

+

2

x

y = 0

(c) y

+xy = 3

(d) y

+

1

x

y = x + 5

(e) 2yy

= x + 1

2. Show that the equation (1.1.4) has no real solutions.

3. Go to your computer or terminal and familiarize yourself with the equipment, the

operating system, and the speciﬁc software you will be using. Then write a program

that will calculate and print the sum of the squares of the integers 1, 2, . . . , 100. Run

this program.

4. For each part of problem 1, ﬁnd the solution for which y(1) = 1.

1.2 Linear equations with constant coeﬃcients

One particularly pleasant, and important, type of linear diﬀerential equation is the variety

with constant coeﬃcients, such as

y

+ 3y

+ 2y = 0 . (1.2.1)

It turns out that what we have to do to solve these equations is to try a solution of a certain

form, and we will then ﬁnd that all of the solutions indeed are of that form.

Let’s see if the function y(x) = e

αx

is a solution of (1.2.1). If we substitute in (1.2.1),

and then cancel the common factor e

αx

, we are left with the quadratic equation

α

2

+ 3α + 2 = 0

whose solutions are α = −2 and α = −1. Hence for those two values of α our trial function

y(x) = e

αx

is indeed a solution of (1.2.1). In other words, e

−2x

is a solution, e

−x

is a

solution, and since the equation is linear,

y(x) = c

1

e

−2x

+c

2

e

−x

(1.2.2)

is also a solution, where c

1

and c

2

are arbitrary constants. Finally, (1.2.2) must be the most

general solution since it has the “right” number of arbitrary constants, namely two.

Trying a solution in the form of an exponential is always the correct ﬁrst step in solving

linear equations with constant coeﬃcients. Various complications can develop, however, as

illustrated by the equation

y

+ 4y

+ 4y = 0 . (1.2.3)

Again, let’s see if there is a solution of the form y = e

αx

. This time, substitution into

(1.2.3) and cancellation of the factor e

αx

leads to the quadratic equation

α

2

+ 4α + 4 = 0, (1.2.4)

1.2 Linear equations with constant coeﬃcients 9

whose two roots are identical, both being −2. Hence e

−2x

is a solution, and of course so is

c

1

e

−2x

, but we don’t yet have the general solution because there is, so far, only one arbitrary

constant. The diﬃculty, of course, is caused by the fact that the roots of (1.2.4) are not

distinct.

In this case, it turns out that xe

−2x

is another solution of the diﬀerential equation (1.2.3)

(verify this), so the general solution is (c

1

+c

2

x)e

−2x

.

Suppose that we begin with an equation of third order, and that all three roots turn

out to be the same. For instance, to solve the equation

y

+ 3y

+ 3y

+y = 0 (1.2.5)

we would try y = e

αx

, and we would then be facing the cubic equation

α

3

+ 3α

2

+ 3α + 1 = 0 , (1.2.6)

whose “three” roots are all equal to −1. Now, not only is e

−x

a solution, but so are xe

−x

and x

2

e

−x

.

To see why this procedure works in general, suppose we have a linear diﬀerential equation

with constant coeﬁccients, say

y

(n)

+a

1

y

(n−1)

+a

2

y

(n−2)

+ +a

n

y = 0 (1.2.7)

If we try to ﬁnd a solution of the usual exponential form y = e

αx

, then after substitution into

(1.2.7) and cancellation of the common factor e

αx

, we would ﬁnd the polynomial equation

α

n

+a

1

α

n−1

+a

2

α

n−2

+ +a

n

= 0 . (1.2.8)

The polynomial on the left side is called the characteristic polynomial of the given

diﬀerential equation. Suppose now that a certain number α = α

∗

is a root of (1.2.8) of

multiplicity p. To say that α

∗

is a root of multiplicity p of the equation is to say that

(α−α

∗

)

p

is a factor of the characteristic polynomial. Now look at the left side of the given

diﬀerential equation (1.2.7). We can write it in the form

(D

n

+a

1

D

n−1

+a

2

D

n−2

+ +a

n

)y = 0 , (1.2.9)

in which D is the diﬀerential operator d/dx. In the parentheses in (1.2.9) we see the

polynomial ϕ(D), where ϕ is exactly the characteristic polynomial in (1.2.8).

Since ϕ(α) has the factor (α − α

∗

)

p

, it follows that ϕ(D) has the factor (D − α

∗

)

p

, so

the left side of (1.2.9) can be written in the form

g(D)(D −α

∗

)

p

y = 0 , (1.2.10)

where g is a polynomial of degree n−p. Now it’s quite easy to see that y = x

k

e

α

∗

x

satisﬁes

(1.2.10) (and therefore (1.2.7) also) for each k = 0, 1, . . . , p−1. Indeed, if we substitute this

function y into (1.2.10), we see that it is enough to show that

(D −α

∗

)

p

(x

k

e

α

∗

x

) = 0 k = 0, 1, . . . , p −1 . (1.2.11)

10 Diﬀerential and Diﬀerence Equations

However, (D −α

∗

)(x

k

e

−α

∗

x

) = kx

k−1

e

α

∗

x

, and if we apply (D −α

∗

) again,

(D −α

∗

)

2

(x

k

e

−α

∗

x

) = k(k −1)x

k−2

e

α

∗

x

,

etc. Now since k < p it is clear that (D −α

∗

)

p

(x

k

e

−α

∗

x

) = 0, as claimed.

To summarize, then, if we encounter a root α

∗

of the characteristic equation, of multi-

plicity p, then corresponding to α

∗

we can ﬁnd exactly p linearly independent solutions of

the diﬀerential equation, namely

e

α

∗

x

, xe

α

∗

x

, x

2

e

α

∗

x

, . . . , x

p−1

e

α

∗

x

. (1.2.12)

Another way to state it is to say that the portion of the general solution of the given

diﬀerential equation that corresponds to a root α

∗

of the characteristic polynomial equation

is Q(x)e

α

∗

x

, where Q(x) is an arbitrary polynomial whose degree is one less than the

multiplicity of the root α

∗

.

One last mild complication may arise from roots of the characteristic equation that are

not real numbers. These don’t really require any special attention, but they do present a few

options. For instance, to solve y

**+ 4y = 0, we ﬁnd the characteristic equation α
**

2

+ 4 = 0,

and the complex roots ±2i. Hence the general solution is obtained by the usual rule as

y(x) = c

1

e

2ix

+c

2

e

−2ix

. (1.2.13)

This is a perfectly acceptable form of the solution, but we could make it look a bit prettier

by using deMoivre’s theorem, which says that

e

2ix

= cos 2x +i sin 2x

e

−2ix

= cos 2x −i sin 2x.

(1.2.14)

Then our general solution would look like

y(x) = (c

1

+c

2

) cos 2x + (ic

1

−ic

2

) sin 2x. (1.2.15)

But c

1

and c

2

are just arbitrary constants, hence so are c

1

+ c

2

and ic

1

− ic

2

, so we might

as well rename them c

1

and c

2

, in which case the solution would take the form

y(x) = c

1

cos 2x +c

2

sin 2x. (1.2.16)

Here’s an example that shows the various possibilities:

y

(8)

−5y

(7)

+ 17y

(6)

−997y

(5)

+ 110y

(4)

−531y

(3)

+ 765y

(2)

−567y

+ 162y = 0. (1.2.17)

The equation was cooked up to have a characteristic polynomial that can be factored as

(α −2)(α

2

+ 9)

2

(α −1)

3

. (1.2.18)

Hence the roots of the characteristic equation are 2 (simple), 3i (multiplicity 2), −3i (mul-

tiplicity 2), and 1 (multiplicity 3).

1.3 Diﬀerence equations 11

Corresponding to the root 2, the general solution will contain the term c

1

e

2x

. Corre-

sponding to the double root at 3i we have terms (c

2

+ c

3

x)e

3ix

in the solution. From the

double root at −3i we get a contribution (c

4

+ c

5

x)e

−3ix

, and ﬁnally from the triple root

at 1 we get (c

6

+ c

7

x + c

8

x

2

)e

x

. The general solution is the sum of these eight terms.

Alternatively, we might have taken the four terms that come from 3i in the form

(c

2

+c

3

x) cos 3x + (c

4

+c

5

x) sin 3x. (1.2.19)

Exercises 1.2

1. Obtain the general solutions of each of the following diﬀerential equations:

(a) y

+ 5y

+ 6y = 0

(b) y

−8y

+ 7y = 0

(c) (D + 3)

2

y = 0

(d) (D

2

+ 16)

2

y = 0

(e) (D + 3)

3

(D

2

−25)

2

(D + 2)

3

y = 0

2. Find a curve y = f(x) that passes through the origin with unit slope, and which

satisﬁes (D + 4)(D −1)y = 0.

1.3 Diﬀerence equations

Whereas a diﬀerential equation is an equation in an unknown function, a diﬀerence equation

is an equation in an unknown sequence. For example, suppose we know that a certain

sequence of numbers y

0

, y

1

, y

2

, . . . satisﬁes the following conditions:

y

n+2

+ 5y

n+1

+ 6y

n

= 0 n = 0, 1, 2, . . . (1.3.1)

and furthermore that y

0

= 1 and y

1

= 3.

Evidently, we can compute as many of the y

n

’s as we need from (1.3.1), thus we would

get y

2

= −21, y

3

= 87, y

4

= −309 so forth. The entire sequence of y

n

’s is determined by

the diﬀerence equation (1.3.1) together with the two starting values.

Such equations are encountered when diﬀerential equations are solved on computers.

Naturally, the computer can provide the values of the unknown function only at a discrete

set of points. These values are computed by replacing the given diﬀerential equations by

a diﬀerence equation that approximates it, and then calculating successive approximate

values of the desired function from the diﬀerence equation.

Can we somehow “solve” a diﬀerence equation by obtaining a formula for the values

of the solution sequence? The answer is that we can, as long as the diﬀerence equation is

linear and has constant coeﬃcients, as in (1.3.1). Just as in the case of diﬀerential equations

with constant coeﬃcients, the correct strategy for solving them is to try a solution of the

12 Diﬀerential and Diﬀerence Equations

right form. In the previous section, the right form to try was y(x) = e

αx

. Now the winning

combination is y = α

n

, where α is a constant.

In fact, let’s substitute α

n

for y

n

in (1.3.1) to see what happens. The left side becomes

α

n+2

+ 5α

n+1

+ 6α

n

= α

n

(α

2

+ 5α + 6) = 0. (1.3.2)

Just as we were able to cancel the common factor e

αx

in the diﬀerential equation case, so

here we can cancel the α

n

, and we’re left with the quadratic equation

α

2

+ 5α + 6 = 0. (1.3.3)

The two roots of this characteristic equation are α = −2 and α = −3. Therefore the

sequence (−2)

n

satisﬁes (1.3.1) and so does (−3)

n

. Since the diﬀerence equation is linear,

it follows that

y

n

= c

1

(−2)

n

+c

2

(−3)

n

(1.3.4)

is also a solution, whatever the values of the constants c

1

and c

2

.

Now it is evident from (1.3.1) itself that the numbers y

n

are uniquely determined if we

prescribe the values of just two of them. Hence, it is very clear that when we have a solution

that contains two arbitrary constants we have the most general solution.

When we take account of the given data y

0

= 1 and y

1

= 3, we get the two equations

_

1 = c

1

+c

2

3 = (−2)c

1

+ (−3)c

2

(1.3.5)

from which c

1

= 6 and c

2

= −5. Finally, we use these values of c

1

and c

2

in (1.3.4) to get

y

n

= 6(−2)

n

−5(−3)

n

n = 0, 1, 2, . . . . (1.3.6)

Equation (1.3.6) is the desired formula that represents the unique solution of the given

diﬀerence equation together with the prescribed starting values.

Let’s step back a few paces to get a better view of the solution. Notice that the formula

(1.3.6) expresses the solution as a linear combination of nth powers of the roots of the

associated characteristic equation (1.3.3). When n is very large, is the number y

n

a large

number or a small one? Evidently the powers of −3 overwhelm those of −2, so the sequence

will behave roughly like a constant times powers of −3. This means that we should expect

the members of the sequence to alternate in sign and to grow rapidly in magnitude.

So much for the equation (1.3.1). Now let’s look at the general case, in the form of a

linear diﬀerence equation of order p:

y

n+p

+a

1

y

n+p−1

+a

2

y

n+p−2

+ +a

p

y

n

= 0. (1.3.7)

We try a solution of the form y

n

= α

n

, and after substituting and canceling, we get the

characteristic equation

α

p

+a

1

α

p−1

+a

2

α

p−2

+ +a

p

= 0. (1.3.8)

1.3 Diﬀerence equations 13

This is a polynomial equation of degree p, so it has p roots, counting multiplicities, some-

where in the complex plane.

Let α

∗

be one of these p roots. If α

∗

is simple (i.e., has multiplicity 1) then the part

of the general solution that corresponds to α

∗

is c(α

∗

)

n

. If, however, α

∗

is a root of

multiplicity k > 1 then we must multiply the solution c(α

∗

)

n

by an arbitrary polynomial

in n, of degree k −1, just as in the corresponding case for diﬀerential equations we used an

arbitrary polynomial in x of degree k −1.

We illustrate this, as well as the case of complex roots, by considering the following

diﬀerence equation of order ﬁve:

y

n+5

−5y

n+4

+ 9y

n+3

−9y

n+2

+ 8y

n+1

−4y

n

= 0. (1.3.9)

This example is rigged so that the characteristic equation can be factored as

(α

2

+ 1)(α −2)

2

(α −1) = 0 (1.3.10)

from which the roots are obviously i, −i, 2 (multiplicity 2), 1.

Corresponding to the roots i, −i, the portion of the general solution is c

1

i

n

+ c

2

(−i)

n

.

Since

i

n

= e

inπ/2

= cos

_

nπ

2

_

+i sin

_

nπ

2

_

(1.3.11)

and similarly for (−i)

n

, we can also take this part of the general solution in the form

c

1

cos

_

nπ

2

_

+c

2

sin

_

nπ

2

_

. (1.3.12)

The double root α = 2 contributes (c

3

+ c

4

n)2

n

, and the simple root α = 1 adds c

5

to

the general solution, which in its full glory is

y

n

= c

1

cos

_

nπ

2

_

+c

2

sin

_

nπ

2

_

+ (c

3

+c

4

n)2

n

+c

5

. (1.3.13)

The ﬁve constants would be determined by prescribing ﬁve initial values, say y

0

, y

1

, y

2

, y

3

and y

4

, as we would expect for the equation (1.3.9).

Exercises 1.3

1. Obtain the general solution of each of the following diﬀerence equations:

(a) y

n+1

= 3y

n

(b) y

n+1

= 3y

n

+ 2

(c) y

n+2

−2y

n+1

+y

n

= 0

(d) y

n+2

−8y

n+1

+ 12y

n

= 0

(e) y

n+2

−6y

n+1

+ 9y

n

= 1

(f) y

n+2

+y

n

= 0

14 Diﬀerential and Diﬀerence Equations

2. Find the solution of the given diﬀerence equation that takes the prescribed initial

values:

(a) y

n+2

= 2y

n+1

+y

n

; y

0

= 0; y

1

= 1

(b) y

n+1

= αy

n

+β; y

0

= 1

(c) y

n+4

+y

n

= 0; y

0

= 1; y

1

= −1; y

2

= 1; y

3

= −1

(d) y

n+2

−5y

n+1

+ 6y

n

= 0; y

0

= 1; y

1

= 2

3. (a) For each of the diﬀerence equations in problems 1 and 2, evaluate

lim

n→∞

y

n+1

y

n

(1.3.14)

if it exists.

(b) Formulate and prove a general theorem about the existence of, and value of the

limit in part (a) for a linear diﬀerence equation with constant coeﬃcients.

(c) Reverse the process: given a polynomial equation, ﬁnd its root of largest absolute

value by computing from a certain diﬀerence equation and evaluating the ratios

of consecutive terms.

(d) Write a computer program to implement the method in part (c). Use it to

calculate the largest root of the equation

x

8

= x

7

+x

6

+x

5

+ + 1. (1.3.15)

1.4 Computing with diﬀerence equations

This is, after all, a book about computing, so let’s begin with computing from diﬀerence

equations since they will give us a chance to discuss some important questions that concern

the design of computer programs. For a sample diﬀerence equation we’ll use

y

n+3

= y

n+2

+ 5y

n+1

+ 3y

n

(1.4.1)

together with the starting values y

0

= y

1

= y

2

= 1. The reader might want, just for practice,

to ﬁnd an explicit formula for this sequence by the methods of the previous section.

Suppose we want to compute a large number of these y’s in order to verify some property

that they have, for instance to check that

lim

n→∞

y

n+1

y

n

= 3 (1.4.2)

which must be true since 3 is the root of largest absolute value of the characteristic equation.

As a ﬁrst approach, we might declare y to be a linear array of some size large enough to

accommodate the expected length of the calculation. Then the rest is easy. For each n, we

would calculate the next y

n+1

from (1.4.1), we would divide it by its predecessor y

n

to get

a new ratio. If the new ratio agrees suﬃciently well with the previous ratio we announce

that the computation has terminated and print the new ratio as our answer. Otherwise, we

move the new ratio to the location of the old ratio, increase n and try again.

If we were to write this out as formal procedure (algorithm) it might look like:

1.4 Computing with diﬀerence equations 15

y

0

:= 1; y

1

:= 1; y

2

:= 1; n := 2;

newrat := −10; oldrat := 1;

while [newrat −oldrat[ ≥ 0.000001 do

oldrat := newrat; n := n + 1;

y

n

:= y

n−1

+ 5 ∗ y

n−2

+ 3 ∗ y

n−3

;

newrat := y

n

/y

n−1

endwhile

print newrat; Halt.

We’ll use the symbol ‘:=’ to mean that we are to compute the quantity on the right, if

necessary, and then store it in the place named on the left. It can be read as ‘is replaced

by’ or ‘is assigned.’ Also, the block that begins with ‘while’ and ends with ‘endwhile’

represents a group of instructions that are to be executed repeatedly until the condition

that follows ‘while’ becomes false, at which point the line following ‘endwhile’ is executed.

The procedure just described is fast, but it uses lots of storage. If, for instance, such a

program needed to calculate 79 y’s before convergence occurred, then it would have used

79 locations of array storage. In fact, the problem above doesn’t need that many locations

because convergence happens a lot sooner. Suppose you wanted to ﬁnd out how much

sooner, given only a programmable hand calculator with ten or twenty memory locations.

Then you might appreciate a calculation procedure that needs just four locations to hold

all necessary y’s.

That’s fairly easy to accomplish, though. At any given moment in the program, what

we need to ﬁnd the next y are just the previous three y’s. So why not save only those three?

We’ll use the previous three to calculate the next one, and stow it for a moment in a fourth

location. Then we’ll compute the new ratio and compare it with the old. If they’re not

close enough, we move each one of the three newest y’s back one step into the places where

we store the latest three y’s and repeat the process. Formally, it might be:

y := 1; ym1 := 1; ym2 := 1;

newrat := −10; oldrat := 1;

while [newrat −oldrat[ ≥ 0.000001 do

ym3 := ym2; ym2 := ym1; ym1 := y;

oldrat := newrat;

y := ym1 + 5 ∗ ym2 + 3 ∗ ym3;

newrat := y/ym1 endwhile;

print newrat; Halt.

The calculation can now be done in exactly six memory locations (y, ym1, ym2, ym3,

oldrat, newrat) no matter how many y’s have to be calculated, so you can undertake it

on your hand calculator with complete conﬁdence. The price that we pay for the memory

saving is that we must move the data around a bit more.

One should not think that such programming methods are only for hand calculators. As

we progress through the numerical solution of diﬀerential equations we will see situations

in which each of the quantities that appears in the diﬀerence equation will itself be an

16 Diﬀerential and Diﬀerence Equations

array (!), and that very large numbers, perhaps thousands, of these arrays will need to be

computed. Even large computers might quake at the thought of using the ﬁrst method

above, rather than the second, for doing the calculation. Fortunately, it will almost never

be necessary to save in memory all of the computed values simultaneously. Normally, they

will be computed, and then printed or plotted, and never needed except in the calculation

of their immediate successors.

Exercises 1.4

1. The Fibonacci numbers F

0

, F

1

, F

2

, . . . are deﬁned by the recurrence formula F

n+2

=

F

n+1

+F

n

for n = 0, 1, 2, . . . together with the starting values F

0

= 0, F

1

= 1.

(a) Write out the ﬁrst ten Fibonacci numbers.

(b) Derive an explicit formula for the nth Fibonacci number F

n

.

(c) Evaluate your formula for n = 0, 1, 2, 3, 4.

(d) Prove directly from your formula that the Fibonacci numbers are integers (This is

perfectly obvious from their deﬁnition, but is not so obvious from the formula!).

(e) Evaluate

lim

n→∞

F

n+1

F

n

(1.4.3)

(f) Write a computer program that will compute Fibonacci numbers and print out

the limit in part (e) above, correct to six decimal places.

(g) Write a computer program that will compute the ﬁrst 40 members of the modiﬁed

Fibonacci sequence in which F

0

= 1 and F

1

= (1 −

√

5)/2. Do these computed

numbers seem to be approaching zero? Explain carefully what you see and why

it happens.

(h) Modify the program of part (h) to run in higher (or double) precision arithmetic.

Does it change any of your answers?

2. Find the most general solution of each of the following diﬀerence equations:

(a) y

n+1

−2y

n

+y

n−1

= 0

(b) y

n+1

= 2y

n

(c) y

n+2

+y

n

= 0

(d) y

n+2

+ 3y

n+1

+ 3y

n

+y

n−1

= 0

1.5 Stability theory

In the study of natural phenomena it is most often true that a small change in conditions

will produce just a small change in the state of the system being studied. If, for example,

a very slight increase in atmospheric pollution could produce dramatically large changes

in populations of ﬂora and fauna, or if tiny variations in the period of the earth’s rotation

1.5 Stability theory 17

produced huge changes in climatic conditions, the world would be a very diﬀerent place to

live in, or to try to live in. In brief, we may say that most aspects of nature are stable.

When physical scientists attempt to understand some facet of nature, they often will

make a mathematical model. This model will usually not faithfully reproduce all of the

structure of the original phenomenon, but one hopes that the important features of the

system will be preserved in the model, so that predictions will be possible. One of the most

important features to preserve is that of stability.

For instance, the example of atmospheric pollution and its eﬀect on living things referred

to above is important and very complex. Therefore considerable eﬀort has gone into the

construction of mathematical models that will allow computer studies of the eﬀects of

atmospheric changes. One of the ﬁrst tests to which such a model should be subjected is

that of stability: does it faithfully reproduce the observed fact that small changes produce

small changes? What is true in nature need not be true in a man-made model that is a

simpliﬁcation or idealization of the real world.

Now suppose that we have gotten ourselves over this hurdle, and we have constructed

a model that is indeed stable. The next step might be to go to the computer and do

calculations from the model, and to use these calculations for predicting the eﬀects of

various proposed actions. Unfortunately, yet another layer of approximation is usually

introduced at this stage, because the model, even though it is a simpliﬁcation of the real

world, might still be too complicated to solve exactly on a computer.

For instance, may models use diﬀerential equations. Models of the weather, of the mo-

tion of ﬂuids, of the movement of astronomical objects, of spacecraft, of population growth,

of predator-prey relationships, of electric circuit transients, and so forth, all involve diﬀer-

ential equations. Digital computers solve diﬀerential equations by approximating them by

diﬀerence equations, and then solving the diﬀerence equations. Even though the diﬀerential

equation that represents our model is indeed stable, it may be that the diﬀerence equation

that we use on the computer is no longer stable, and that small changes in initial data on

the computer, or small roundoﬀ errors, will produce not small but very large changes in the

computed solution.

An important job of the numerical analyst is to make sure that this does not happen,

and we will ﬁnd that this theme of stability recurs throughout our study of computer

approximations.

As an example of instability in diﬀerential equations, suppose that some model of a

system led us to the equation

y

−y

−2y = 0 (1.5.1)

together with the initial data

y(0) = 1; y

(0) = −1. (1.5.2)

We are thinking of the independent variable t as the time, so we will be interested in

the solution as t becomes large and positive.

The general solution of (1.5.1) is y(t) = c

1

e

−t

+c

2

e

2t

. The initial conditions tell us that

c

1

= 1 and c

2

= 0, hence the solution of our problem is y(t) = e

−t

, and it represents a

18 Diﬀerential and Diﬀerence Equations

function that decays rapidly to zero with increasing t. In fact, when t = 10, the solution

has the value 0.000045.

Now let’s change the initial data (1.5.2) just a bit, by asking for a solution with y

(0) =

−0.999. It’s easy to check that the solution is now

y(t) = (0.999666 . . .)e

−t

+ (0.000333 . . .)e

2t

(1.5.3)

instead of just y(t) = e

−t

. If we want the value of the solution at t = 10, we would ﬁnd

that it has changed from 0.000045 to about 7.34.

At t = 20 the change is even more impressive, from 0.00000002 to 161, 720+, just from

changing the initial value of y

**from −1 to −0.999. Let’s hope that there are no phenomena
**

in nature that behave in this way, or our lives hang by a slender thread indeed!

Now exactly what is the reason for the observed instability of the equation (1.5.1)?

The general solution of the equation contains a falling exponential term c

1

e

−t

, and a rising

exponential term c

2

e

2t

. By prescribing the initial data (1.5.2) we suppressed the growing

term, and picked out only the decreasing one. A small change in the initial data, however,

results in the presence of both terms in the solution.

Now it’s time for a formal

Deﬁnition: A diﬀerential equation is said to be stable if for every set of initial data (at

t = 0) the solution of the diﬀerential equation remains bounded as t approaches inﬁnity.

A diﬀerential equation is called strongly stable if, for every set of initial data (at t = 0)

the solution not only remains bounded, but approaches zero as t approaches inﬁnity.

What makes the equation (1.5.1) unstable, then, is the presence of a rising exponential

in its general solution. In other words, if we have a diﬀerential equation whose general

solution contains a term e

αt

in which α is positive, that equation is unstable.

Let’s restrict attention now to linear diﬀerential equations with constant coeﬃcients.

We know from section 1.2 that the general solution of such an equation is a sum of terms

of the form

(polynomial in t)e

αt

. (1.5.4)

Under what circumstances does such a term remain bounded as t becomes large and posi-

tive?

Certainly if α is negative then the term stays bounded. Likewise, if α is a complex

number and its real part is negative, then the term remains bounded. If α has positive real

part the term is unbounded.

This takes care of all of the possibilities except the case where α is zero, or more generally,

the complex number α has zero real part (is purely imaginary). In that case the question

of whether (polynomial in t)e

αt

remains bounded depend on whether the “polynomial in t”

is of degree zero (a constant polynomial) or of higher degree. If the polynomial is constant

then the term does indeed remain bounded for large positive t, whereas otherwise the term

will grow as t gets large, for some values of the initial conditions, thereby violating the

deﬁnition of stability.

1.6 Stability theory of diﬀerence equations 19

Now recall that the “polynomial in t” is in fact a constant if the root α is a simple

root of the characteristic equation of the diﬀerential equation, and otherwise it is of higher

degree. This observation completes the proof of the following:

Theorem 1.5.1 A linear diﬀerential equation with constant coeﬃcients is stable if and

only if all of the roots of its characteristic equation lie in the left half plane, and those that

lie on the imaginary axis, if any, are simple. Such an equation is strongly stable if and only

if all of the roots of its characteristic equation lie in the left half plane, and none lie on the

imaginary axis.

Exercises 1.5

1. Determine for each of the following diﬀerential equations whether it is strongly stable,

stable, or unstable.

(a) y

−5y

+ 6y = 0

(b) y

+ 5y

+ 6y = 0

(c) y

+ 3y = 0

(d) (D + 3)

3

(D + 1)y = 0

(e) (D + 1)

2

(D

2

+ 1)

2

y = 0

(f) (D

4

+ 1)y = 0

2. Make a list of some natural phenomena that you think are unstable. Discuss.

3. The diﬀerential equation y

**−y = 0 is to be solved with the initial conditions y(0) = 1,
**

y

(0) = −1, and then solved again with y(0) = 1, y

**(0) = −0.99. Compare the two
**

solutions when x = 20.

4. For exactly which real values of the parameter λ is each of the following diﬀerential

equations stable? . . . strongly stable?

(a) y

+ (2 +λ)y

+y = 0

(b) y

+λy

+y = 0

(c) y

+λy = 1

1.6 Stability theory of diﬀerence equations

In the previous section we discussed the stability of diﬀerential equations. The key ideas

were that such an equation is stable if every one of its solutions remains bounded as t

approaches inﬁnity, and strongly stable if the solutions actually approach zero.

Similar considerations apply to diﬀerence equations, and for similar reasons. As an

example, take the equation

y

n+1

=

5

2

y

n

−y

n−1

(n ≥ 1) (1.6.1)

20 Diﬀerential and Diﬀerence Equations

along with the initial equations

y

0

= 1; y

1

= 0.5 . (1.6.2)

It’s easy to see that the solution is y

n

= 2

−n

, and of course, this is a function that

rapidly approaches zero with increasing n.

Now let’s change the initial data (1.6.2), say to

y

0

= 1; y

1

= 0.50000001 (1.6.3)

instead of (1.6.2).

The solution of the diﬀerence equation with these new data is

y = (0.0000000066 . . .)2

n

+ (0.9999999933 . . .)2

−n

. (1.6.4)

The point is that the coeﬃcient of the growing term 2

n

is small, but 2

n

grows so fast that

after a while the ﬁrst term in (1.6.4) will be dominant. For example, when n = 30, the

solution is y

30

= 7.16, compared to the value y

30

= 0.0000000009 of the solution with the

original initial data (1.6.2). A change of one part in ﬁfty million in the initial condition

produced, thirty steps later, an answer one billion times as large.

The fault lies with the diﬀerence equation, because it has both rising and falling com-

ponents to its general solution. It should be clear that it is hopeless to do extended com-

putation with an unstable diﬀerence equation, since a small roundoﬀ error may alter the

solution beyond recognition several steps later.

As in the case of diﬀerential equations, we’ll say that a diﬀerence equation is stable

if every solution remains bounded as n grows large, and that it is strongly stable if every

solution approaches zero as n grows large. Again, we emphasize that every solution must

be well behaved, not just the solution that is picked out by a certain set of initial data. In

other words, the stability, or lack of it, is a property of the equation and not of the starting

values.

Now consider the case where the diﬀerence equation is linear with constant coeﬃcients.

The we know that the general solution is a sum of terms of the form

(polynomial in n)α

n

. (1.6.5)

Under what circumstances will such a term remain bounded or approach zero?

Suppose [α[ < 1. Then the powers of α approach zero, and multiplication by a polyno-

mial in n does not alter that conclusion. Suppose [α[ > 1. Then the sequence of powers

grows unboundedly, and multiplication by a nonzero polynomial only speeds the parting

guest.

Finally suppose the complex number α has absolute value 1. Then the sequence of its

powers remains bounded (in fact they all have absolute value 1), but if we multiply by a

nonconstant polynomial, the resulting expression would grow without bound.

To summarize then, the term (1.6.5), if the polynomial is not identically zero, approaches

zero with increasing n if and only if [α[ < 1. It remains bounded as n increases if and only

1.6 Stability theory of diﬀerence equations 21

if either (a) [α[ < 1 or (b) [α[ = 1 and the polynomial is of degree zero (a constant). Now

we have proved:

Theorem 1.6.1 A linear diﬀerence equation with constant coeﬃcients is stable if and only

if all of the roots of its characteristic equation have absolute value at most 1, and those of

absolute value 1 are simple. The equation is strongly stable if and only if all of the roots

have absolute value strictly less than 1.

Exercises 1.6

1. Determine, for each of the following diﬀerence equations whether it is strongly stable,

stable, or unstable.

(a) y

n+2

−5y

n+1

+ 6y

n

= 0

(b) 8y

n+2

+ 2y

n+1

−3y

n

= 0

(c) 3y

n+2

+y

n

= 0

(d) 3y

n+3

+ 9y

n+2

−y

n+1

−3y

n

= 0

(e) 4y

n+4

+ 5y

n+2

+y

n

= 0

2. The diﬀerence equation 2y

n+2

+ 3y

n+1

− 2y

n

= 0 is to be solved with the initial

conditions y

0

= 2, y

1

= 1, and then solved again with y

0

= 2, y

1

= 0.99. Compare y

20

for the two solutions.

3. For exactly which real values of the parameter λ is each of the following diﬀerence

equations stable? . . . strongly stable?

(a) y

n+2

+λy

n+1

+y

n

= 0

(b) y

n+1

+λy

n

= 1

(c) y

n+2

+y

n+1

+λy

n

= 0

4. (a) Consider the (constant-coeﬃcient) diﬀerence equation

a

0

y

n+p

+a

1

y

n+p−1

+a

2

y

n+p−2

+ +a

p

y

n

= 0. (1.6.6)

Show that this diﬀerence equation cannot be stable if [a

p

/a

0

[ > 1.

(b) Give an example to show that the converse of the statement in part (a) is false.

Namely, exhibit a diﬀerence equation for which [a

p

/a

0

[ < 1 but the equation is unsta-

ble anyway.

22 Diﬀerential and Diﬀerence Equations

Chapter 2

The Numerical Solution of

Diﬀerential Equations

2.1 Euler’s method

Our study of numerical methods will begin with a very simple procedure, due to Euler.

We will state it as a method for solving a single diﬀerential equation of ﬁrst order. One of

the nice features of the subject of numerical integration of diﬀerential equations is that the

techniques that are developed for just one ﬁrst order diﬀerential equation will apply, with

very little change, both to systems of simultaneous ﬁrst order equations and to equations of

higher order. Hence the consideration of a single equation of ﬁrst order, seemingly a very

special case, turns out to be quite general.

By an initial-value problem we mean a diﬀerential equation together with enough given

values of the unknown function and its derivatives at an initial point x

0

to determine the

solution uniquely.

Let’s suppose that we are given an initial-value problem of the form

y

= f(x, y); y(x

0

) = y

0

. (2.1.1)

Our job is to ﬁnd numerical approximate values of the unknown function y at points x

to the right of (larger than) x

0

.

What we actually will ﬁnd will be approximate values of the unknown function at a

discrete set of points x

0

, x

1

= x

0

+ h, x

2

= x

0

+ 2h, x

3

= x

0

+ 3h, etc. At each of these

points x

n

we will compute y

n

, our approximation to y(x

n

).

Hence, suppose that the spacing h between consecutive points has been chosen. We

propose to start at the point x

0

where the initial data are given, and move to the right,

obtaining y

1

from y

0

, then y

2

from y

1

and so forth until suﬃciently many values have been

found.

Next we need to derive a method by which each value of y can be obtained from its

immediate predecessor. Consider the Taylor series expansion of the unknown function y(x)

24 The Numerical Solution of Diﬀerential Equations

about the point x

n

y(x

n

+h) = y(x

n

) +hy

(x

n

) +h

2

y

(X)

2

, (2.1.2)

where we have halted the expansion after the ﬁrst power of h and in the remainder term,

the point X lies between x

n

and x

n

+h.

Now equation (2.1.2) is exact, but of course it cannot be used for computation because

the point X is unknown. On the other hand, if we simply “forget” the error term, we’ll

have only an approximate relation instead of an exact one, with the consolation that we

will be able to compute from it. The approximate relation is

y(x

n

+h) ≈ y(x

n

) +hy

(x

n

). (2.1.3)

Next deﬁne y

n+1

to be the approximate value of y(x

n+1

) that we obtain by using the

right side of (2.1.3) instead of (2.1.2). Then we get

y

n+1

= y

n

+hy

n

. (2.1.4)

Now we have a computable formula for the approximate values of the unknown function,

because the quantity y

n

can be found from the diﬀerential equation (2.1.1) by writing

y

n

= f(x

x

, y

n

), (2.1.5)

and if we do so then (2.1.4) takes the form

y

n+1

= y

n

+hf(x

n

, y

n

). (2.1.6)

This is Euler’s method, in a very explicit form, so that the computational procedure is

clear. Equation (2.1.6) is in fact a recurrence relation, or diﬀerence equation, whereby each

value of y

n

is computed from its immediate predecessor.

Let’s use Euler’s method to obtain a numerical solution of the diﬀerential equation

y

= 0.5y (2.1.7)

together with the starting value y(0) = 1. The exact solution of this initial-value problem

is obviously y(x) = e

x/2

.

Concerning the approximate solution by Euler’s method, we have, by comparing (2.1.7)

with (2.1.1), f(x, y) = 0.5y, so

y

n+1

= y

n

+h

y

n

2

=

_

1 +

h

2

_

y

n

.

(2.1.8)

Therefore, in this example, each y

n

will be obtained from its predecessor by multiplication

by 1 +

h

2

. To be quite speciﬁc, let’s take h to be 0.05. Then we show below, for each value

of x = 0, 0.05, 0.10, 0.15, 0.20, . . . the approximate value of y computed from (2.1.8) and

the exact value y(x

n

) = e

xn/2

:

2.1 Euler’s method 25

x Euler(x) Exact(x)

0.00 1.00000 1.00000

0.05 1.02500 1.02532

0.10 1.05063 1.05127

0.15 1.07689 1.07788

0.20 1.10381 1.10517

0.25 1.13141 1.13315

.

.

.

.

.

.

.

.

.

1.00 1.63862 1.64872

2.00 2.68506 2.71828

3.00 4.39979 4.48169

.

.

.

.

.

.

.

.

.

5.00 11.81372 12.18249

10.00 139.56389 148.41316

table 1

Considering the extreme simplicity of the approximation, it seems that we have done

pretty well by this equation. Let’s continue with this simple example by asking for a formula

for the numbers that are called Euler(x) in the above table. In other words, exactly what

function of x is Euler(x)?

To answer this, we note ﬁrst that each computed value y

n+1

is obtained according to

(2.1.8) by multiplying its predecessor y

n

by 1 +

h

2

. Since y

0

= 1, it is clear tha we will

compute y

n

= (1 +

h

2

)

n

. Now we want to express this in terms of x rather than n. Since

x

n

= nh, we have n = x/h, and since h = 0.05 we have n = 20x. Hence the computed

approximation to y at a particular point x will be 1.025

20x

, or equivalently

Euler(x) = (1.638616 . . .)

x

. (2.1.9)

The approximate values can now easily be compared with the true solution, since

Exact(x) = e

x

2

=

_

e

1

2

_

x

= (1.648721 . . .)

x

.

(2.1.10)

Therefore both the exact solution of this diﬀerential equation and its computed solution

have the form (const.)

x

. The correct value of “const.” is e

1/2

, and the value that is, in

eﬀect, used by Euler’s method is (1 +

h

2

)

1/h

. For a ﬁxed value of x, we see that if we use

Euler’s method with smaller and smaller values of h (neglecting the increase in roundoﬀ

error that is sure to result), the values Euler(x) will converge to Exact(x), because

lim

h→0

_

1 +

h

2

_

1/h

= e

1

2

. (2.1.11)

Exercises 2.1

26 The Numerical Solution of Diﬀerential Equations

1. Verify the limit (2.1.11).

2. Use a calculator or a computer to integrate each of the following diﬀerential equations

forward ten steps, using a spacing h = 0.05 with Euler’s method. Also tabulate the

exact solution at each value of x that occurs.

(a) y

(x) = xy(x); y(0) = 1

(b) y

(x) = xy(x) + 2; y(0) = 1

(c) y

(x) =

y(x)

1 +x

; y(0) = 1

(d) y

(x) = −2xy(x)

2

; y(0) = 10

2.2 Software notes

One of the main themes of our study will be the preparation of programs that not only

work, but also are easily readable and useable by other people. The act of communication

that must take place before a program can be used by persons other than its author is a

diﬃcult one to carry out, and we will return several times to the principles that have evolved

as guides to the preparation of readable software. Here are some of these guidelines.

1. Documentation

The documentation of a program is the set of written instructions to a user that inform

the user about the purpose and operation of the program. At the moment that the job

of writing and testing a program has been completed it is only natural to feel an urge to

get the whole thing over with and get on to the next job. Besides, one might think, it’s

perfectly obvious how to use this program. Some programs may be obscure, but not this

one.

It is amazing how rapidly our knowledge of our very own program fades. If we come

back to a program after a lapse of a few months’ time, it often happens that we will have no

idea what the program did or how to use it, at least not without making a large investment

of time.

For that reason it is important that when our program has been written and tested it

should be documented immediately, while our memory of it is still green. Furthermore, the

best place for documentation is in the program itself, in “comment” statements. That way

one can be sure that when the comments are needed they will be available.

The ﬁrst mission of program documentation is to describe the purpose of the program.

State clearly the problem that the program solves, or the exact operation that it performs

on its input in order to get its output.

Already in this ﬁrst mission, a good bit of technical skill can be brought to bear that

will be very helpful to the use, by intertwining the description of the program purpose with

the names of the communicating variables in the program.

Let’s see what that means by considering an example. Suppose we have written a

subroutine that searches through a speciﬁed row of a matrix to ﬁnd the element of largest

2.2 Software notes 27

absolute value, and outputs a column in which it was found. Such a routine, in Maple for

instance, might look like this:

search:=proc(A,i)

local j, winner, jwin;

winner:=-1;

for j from 1 to coldim(A) do

if (abs(A[i,j])>winner) then

winner:=abs(A[i,j]) ; jwin:=j fi

od;

return(jwin);

end;

Now let’s try our hand at documenting this program:

“The purpose of this program is to search a given row of a matrix to ﬁnd an

element of largest absolute value and return the column in which it was found.”

That is pretty good documentation, perhaps better than many programs get. But we

can make it a lot more useful by doing the intertwining that we referred to above. There

we said that the description should be related to the communicating variables. Those

variables are the ones that the user can see. They are the input and output variables of

the subroutine. In most important computer languages, the communicating variables are

announced in the ﬁrst line of the coding of a procedure or subroutine. Maple follows this

convention at least for the input variables, although the output variable is usually speciﬁed

in the “return” statement.

In the ﬁrst line of the little subroutine above we see the list (A, i) of its input variables

(or “arguments”). These are the ones that the user has to understand, as opposed to the

other “local” variables that live inside the subroutine but don’t communicate with the

outside world (like j, winner, jwin, which are listed on the second line of the program).

The best way to help the user to understand these variables is to relate them directly

to the description of the purpose of the program.

“The purpose of this program is to search row I of a given matrix A to ﬁnd an

entry of largest absolute value, and returns the column jwin where that entry

lives.”

We’ll come back to the subject of documentation, but now let’s mention another ingre-

dient of ease of use of programs, and that is:

2. Modularity

It is important to divide a long program into a number of smaller modules, each with a

clearly stated set of inputs and outputs, and each with its own documentation. That means

that we should get into the habit of writing lots of subroutines or procedures, because

28 The Numerical Solution of Diﬀerential Equations

the subroutine or procedure mode of expression forces one to be quite explicit about the

relationship of the block of coding to the rest of the world.

When we are writing a large program we would all write a subroutine if we found that

a certain sequence of steps was being called for repeatedly. Beyond this, however, there are

numerous inducements for breaking oﬀ subroutines even if the block of coding occurs just

once in the main program.

For one thing it’s easier to check out the program. The testing procedure would consist

of ﬁrst testing each of the subroutines separately on test problems designed just for them.

Once the subroutines work, it would remain only to test their relationships to the calling

program.

For another reason, we might discover a better, faster, more elegant, or what-have-you

method of performing the task that one of these subroutines does. Then we would be able

to yank out the former subroutine and plug in the new one, while being careful only to make

sure that the new subroutine relates to the same inputs and outputs as the old one. If jobs

within a large program are not broken into subroutines it can be much harder to isolate the

block of coding that deals with a particular function and remove it without aﬀecting the

whole works.

For another reason, if one be needed, it may well happen that even though the job that

is done by the subroutine occurs only once in the current program, it may recur in other

programs as yet undreamed of. If one is in the habit of writing small independent modules

and stringing them together to make large programs, then it doesn’t take long before one

has a library of useful subroutines, each one tested, working and documented, that will

greatly simplify the task of writing future programs.

Finally, the practice of subdividing the large jobs into the smaller jobs of which they are

composed is an extremely valuable analytical skill, one that is useful not only in program-

ming, but in all sorts of organizational activities where smaller eﬀorts are to be pooled in

order to produce a larger eﬀect. It is therefore a quality of mind that provides much of its

own justiﬁcation.

In this book, the major programs that are the objects of study have been broken up into

subroutines in the expectation that the reader will be able to start writing and checking out

these modules even before the main ideas of the current subject have been fully explained.

This was done in part because some of these programs are quite complex, and it would be

unreasonable to expect the whole program to be written in a short time. It was also done

to give examples of the process of subdivision that we have been talking about.

For instance, the general linear algebra program for solving systems of linear simulta-

neous equations in Chapter 3, has been divided into six modules, and they are described in

section 3.3. The reader might wish to look ahead at those routines and to verify that even

though their relationship to the whole job of solving equations is by no means clear now,

nonetheless, because of the fact that they are independent and self-contained, they can be

programmed and checked out at any time without waiting for the full explanation.

One more ingredient that is needed for the production of useful software is:

2.3 Systems and equations of higher order 29

3. Style

We don’t mean style in the sense of “class,” although this is as welcome in programming

as it is elsewhere. There have evolved a number of elements of good programming style,

and these will mainly be discussed as they arise. But two of them (one trivial and one quite

deep) are:

(a) Indentation: The instructions that lie within the range of a loop are indented in the

program listing further to the right than the instructions that announce that the loop

is about to begin, or that it has just terminated.

(b) Top-down structuring: When we visualize the overall logical structure of a compli-

cated program we see a grand loop, within which there are several other loops and

branchings, within which . . . etc. According to the principles of top-down design the

looping and branching structure of the program should be visible at once in the list-

ing. That is, we should see an announcement of the opening of the grand loop, then

indented under that perhaps a two-way branch (if-then-else), where, under the “then”

one sees all that will happen if the condition is met, and under the “else” one sees

what happens if it is not met.

When we say that we see all that will happen, we mean that there are not any “go-to”

instructions that would take our eye out of the ﬂow of the if-then-else loop to some

other page. It all happens right there on the same page, under “then” and under

“else”.

These few words can scarcely convey the ideas of structuring, which we leave to the

numerous examples in the sequel.

2.3 Systems and equations of higher order

We have already remarked that the methods of numerical integration for a single ﬁrst-

order diﬀerential equation carry over with very little change to systems of simultaneous

diﬀerential equations of ﬁrst order. In this section we’ll discuss exactly how this is done,

and furthermore, how the same idea can be applied to equations of higher order than the

ﬁrst. Euler’s method will be used as the example, but the same transformations will apply

to all of the methods that we will study.

In Euler’s method for one equation, the approximate value of the unknown function at

the next point x

n+1

= x

n

+h is calculated from

y

n+1

= y

n

+hf(x

n

, y

n

). (2.3.1)

Now suppose that we are trying to solve not just a single equation, but a system of N

simultaneous equations, say

y

i

(x) = f

i

(x, y

1

, y

2

, . . . y

N

) i = 1, . . . , N. (2.3.2)

30 The Numerical Solution of Diﬀerential Equations

Equation (2.3.2) represents N equations, in each of which just one derivative appears,

and whose right-hand side may depend on x, and on all of the unknown functions, but not

on their derivatives. The “f

i

” indicates that, of course, each equation can have a diﬀerent

right-hand side.

Now introduce the vector y(x) of unknown functions

y(x) = [y

1

(x), y

2

(x), y

3

(x), . . . , y

N

(x)] (2.3.3)

and the vector f = f(x, y(x)) of right-hand sides

f = [f

1

(x, y), f

2

(x, y), . . . , f

N

(x, y)]. (2.3.4)

In terms of these vectors, equation (2.3.2) can be rewritten as

y

(x) = f(x, y(x)). (2.3.5)

We observe that equation (2.3.5) looks just like our standard form (2.1.1) for a single

equation in a single unknown function, except for the bold face type, i.e., except for the

fact that y and f now represent vector quantities.

To apply a numerical method such as that of Euler, then, all we need to do is to take

the statement of the method for a single diﬀerential equation in a single unknown function,

and replace y(x) and f(x, y(x)) by vector quantities as above. We will then have obtained

the generalization of the numerical method to systems.

To be speciﬁc, Euler’s method for a single equation is

y

n+1

= y

n

+hf(x, y

n

) (2.3.6)

so Euler’s method for a system of diﬀerential equations will be

y

n+1

= y

n

+hf(x

n

, y

n

). (2.3.7)

This means that if we know the entire vector y of unknown functions at the point x = x

n

,

then we can ﬁnd the entire vector of unknown functions at the next point x

n+1

= x

n

+ h

by means of (2.3.7).

In detail, if y

i

(x

n

) denotes the computed approximate value of the unknown function y

i

at the point x

n

, then what we must calculate are

y

i

(x

n+1

) = y

i

(x

n

) +hf

i

(x

n

, y

1

(x

n

), y

2

(x

n

), . . . , y

N

(x

n

)) (2.3.8)

for each i = 1, 2, . . . , N.

As an example, take the pair of diﬀerential equations

_

y

1

= x +y

1

+y

2

y

2

= y

1

y

2

+ 1

(2.3.9)

together with the initial values y

1

(0) = 0, y

2

(0) = 1.

2.3 Systems and equations of higher order 31

Now the vector of unknown functions is y = [y

1

, y

2

], and the vector of right-hand sides

is f = [x +y

1

+y

2

, y

1

y

2

+ 1]. Initially, the vector of unknowns is y = [0, 1]. Let’s choose a

step size h = 0.05. Then we calculate

_

y

1

(0.05)

y

2

(0.05)

_

=

_

0

1

_

+ 0.05

_

0 + 0 + 1

0 ∗ 1 + 1

_

=

_

0.05

1.05

_

(2.3.10)

and

_

y

1

(0.10)

y

2

(0.10)

_

=

_

0.05

1.05

_

+ 0.05

_

0.05 + 0.05 + 1.05

0.05 ∗ 1.05 + 1

_

=

_

0.1075

1.102625

_

(2.3.11)

and so forth. At each step we compute the vector of approximate values of the two unknown

functions from the corresponding vector at the immediately preceding step.

Let’s consider the preparation of a computer program that will carry out the solution,

by Euler’s method, of a system of N simultaneous equations of the form (2.3.2), which we

will rewrite just slightly, in the form

y

i

= f

i

(x, y) i = 1, . . . , N. (2.3.12)

Note that on the left is just one of the unknown functions, and on the right there may

appear all N of them in each equation.

Evidently we will need an array Y of length N to hold the values of the N unknown

functions at the current point x. Suppose we have computed the array Y at a point x, and

we want to get the new array Y at the point x +h. Exactly what do we do?

Some care is necessary in answering this question because there is a bit of a snare in

the underbrush. The new values of the N unknown functions are calculated from (2.3.8) or

(2.3.12) in a certain order. For instance, we might calculate y

1

(x+h), then y

2

(x +h), then

y

3

(x +h),. . . , then y

N

(x +h).

The question is this: when we compute y

1

(x+h), where shall we put it? If we put it into

Y[1], the ﬁrst position of the Y array in storage, then the previous contents of Y[1] are

lost, i.e., the value of y

1

(x) is lost. But we aren’t ﬁnished with y

1

(x) yet; it’s still needed to

compute y

2

(x+h), y

3

(x+h), etc. This is because the new value y

i

(x+h) depends (or might

depend), according to (2.3.8), on the old values of all of the unknown functions, including

those whose new values have already been computed before we begin the computation of

y

i

(x +h).

If the point still is murky, go back to (2.3.11) and notice how, in the calculation of

y

2

(0.10) we needed to know y

1

(0.05) even though y

1

(0.10) had already been computed.

Hence if we had put y

1

(0.10) into an array to replace the old value y

1

(0.05) we would not

have been able to obtain y

2

(0.10).

The conclusion is that we need at least two arrays, say YIN and YOUT, each of length N.

The array YIN holds the unknown functions evaluated at x, and YOUT will hold their values

at x+h. Initially YIN holds the given data at x

0

. Then we compute all of the unknowns at

x

0

+h, and store them in YOUT as we ﬁnd them. When all have been done, we print them

if desired, move all entries of YOUT back to YIN, increase x by h and repeat.

32 The Numerical Solution of Diﬀerential Equations

The principal building block in this structure would be a subroutine that would advance

the solution exactly one step. The main program would initialize the arrays, call this

subroutine, increase x, move date from the output array YOUT back to the input array YIN,

print, etc. The single-step subroutine is shown below. We will use it later on to help us

get a solution started when we use methods that need information at more than one point

before they can start the integration.

Eulerstep:=proc(xin,yin,h,n) local i,yout;

# This program numerically integrates the system

# y’=f(x,y) one step forward by Euler’s method using step size

# h. Enter with values at xin in yin. Exit with values at xin+h

# in yout. Supply f as a function subprogram.

yout:=[seq(evalf(yin[i]+h*f(xin,yin,i)),i=1..n)];

return(yout);

end;

A few remarks about the program are in order. One structured data type in Maple

is a list of things, in this case, a list of ﬂoating point numbers. The seq command (for

“sequence”) creates such a list, in this case a list of length n since i goes from 1 to n in the

seq command. The brackets [ and ] convert the list into a vector. The evalf command

ensures that the results of the computation of the components of yout are ﬂoating point

numbers.

Our next remark about the program concerns the function subprogram f, which calcu-

lates the right hand sides of the diﬀerential equation. This subprogram, of course, must be

supplied by the user. Here is a sample of such a program, namely the one that describes

the system (2.3.9). In that case we have f

1

(x, y) = x+y

1

+y

2

and f

2

(x, y) = y

1

y

2

+1. This

translates into the following:

f:=proc(x,y,i);

# Calculates the right-hand sides of the system of differential

# equations.

if i=1 then return(x+y[1]+y[2]) else return(y[1]*y[2]+1) fi;

end;

Our last comment about the program to solve systems is that it is perfectly possible to

use it in such a way that we would not have to move the contents of the vector YOUT back

to the vector YIN at each step. In other words, we could save N move operations, where N

is the number of equations. Such savings might be signiﬁcant in an extended calculation.

To achieve this saving, we write two blocks of programming in the main program. One

block takes the contents of YIN as input, advances the solution one step by Euler’s method

and prints, leaving the new vector in YOUT. Then, without moving anything, another block

of programming takes the contents of YOUT as input, advances the solution one step, leaves

the new vector in YIN, and prints. The two blocks call Euler alternately as the integration

proceeds to the right. The reader might enjoy writing this program, and thinking about

how to generalize the idea to the situation where the new value of y is computed from two

2.3 Systems and equations of higher order 33

previously computed values, rather than from just one (then three blocks of programming

would be needed).

Now we’ve discussed the numerical solution of a single diﬀerential equation of ﬁrst order,

and of a system of simultaneous diﬀerential equations of ﬁrst order, and there remains the

treatment of equations of higher order than the ﬁrst. Fortunately, this case is very easily

reduced to the varieties that we have already studied.

For example, suppose we want to solve a single equation of the second order, say

y

+xy

+ (x + 1) cos y = 2. (2.3.13)

The strategy is to transform the single second-order equation into a pair of simultaneous

ﬁrst order equations that can then be handled as before. To do this, choose two unknown

functions u and v. The function u is to be the unknown function y in (2.3.13), and the

function v is to be the derivative of u. Then u and v satisfy two simultaneous ﬁrst-order

diﬀerential equations:

_

u

= v

v

= −xv −(x + 1) cos u + 2

(2.3.14)

and these are exactly of the form (2.3.5) that we have already discussed!

The same trick works on a general diﬀerential equation of Nth order

y

(N)

+G(x, y, y

, y

, . . . , y

(N−1)

) = 0. (2.3.15)

We introduce N unknown functions u

0

, u

1

, . . . , u

N−1

, and let them be the solutions of the

system of N simultaneous ﬁrst order equations

u

0

= u

1

u

1

= u

2

. . .

u

N−2

= u

N−1

u

N−1

= −G(x, u

0

, u

1

, . . . , u

N−2

).

(2.3.16)

The system can now be dealt with as before.

Exercises 2.3

1. Write each of the following as a system of simultaneous ﬁrst-order initial-value prob-

lems in the standard form (2.3.2):

(a) y

+x

2

y = 0; y(0) = 1; y

(0) = 0

(b) u

+xv = 2; v

+e

uv

= 0; u(0) = 0; v(0) = 0

(c) u

+xv

= 0; v

+x

2

u = 1; u(1) = 1; v(1) = 0

(d) y

iv

+ 3xy

+x

2

y

+ 2y

+y = 0; y(0) = y

(0) = y

(0) = y

(0) = 1

(e) x

(t) + t

3

x(t) + y(t) = 0; y

(t) + x(t)

2

= t

3

; x(0) = 2; x

(0) = 1; x

(0) = 0;

y(0) = 1; y

(0) = 0

34 The Numerical Solution of Diﬀerential Equations

2. For each of the parts of problem 1, write the function subprogram that will compute

the right-hand sides, as required by the Eulerstep subroutine.

3. For each of the parts of problem 1, assemble nad run on the computer the Euler

program, together with the relevant function subprogram of problem 2, to print out

the solutions for ﬁfty steps of integration, each of size h = 0.03. Begin with x = x

0

,

the point at which the initial data was given.

4. Reprogram the Eulerstep subroutine, as discussed in the text, to avoid the movement

of YOUT back to YIN.

5. Modify your program as necessary (in Maple, take advantage of the plot command)

to produce graphical output (graph all of the unknown functions on the same axes).

Test your program by running it with Euler as it solves y

+ y = 0 with y(0) = 0,

y

**(0) = 1, and h = π/60 for 150 steps.
**

6. Write a program that will compute successive values y

p

, y

p+1

, . . . from a diﬀerence

equation of order p. Do this by storing the y’s as they are computed in a circular list,

so that it is never necessary to move back the last p computed values before ﬁnding

the next one. Write your program so that it will work with vectors, so you can solve

systems of diﬀerence equations as well as single ones.

2.4 How to document a program

One of the main themes of our study will be the preparation of programs that not only

work, but also are easily readable and useable by other people. The act of communication

that must take place before a program can be used by persons other than its author is a

diﬃcult one to carry out, and we will return several times to the principles that serve as

guides to the preparation of readable software.

In this section we discuss further the all-important question of program documentation,

already touched upon in section 2.2. Some very nontrivial skills are called for in the creation

of good user-oriented program descriptions. One of these is the ability to enter the head of

another person, the user, and to relate to the program that you have just written through

the user’s eyes.

It’s hard to enter someone else’s head. One of the skills that make one person a better

teacher than another person is of the same kind: the ability to see the subject matter that

is being taught through the eyes of another person, the student. If one can do that, or

even make a good try at it, then obviously one will be able much better to deal with the

questions that are really concerning the audience. Relatively few actually do this to any

great extent not, I think, because it’s an ability that one either has or doesn’t have, but

because few eﬀorts are made to train this skill and to develop it.

We’ll try to make our little contribution here.

2.4 How to document a program 35

(A) What does it do?

The ﬁrst task should be to describe the precise purpose of the program. Put yourself in

the place of a potential user who is looking at a particular numerical instance of the problem

that needs solving. That user is now thumbing through a book full of program descriptions

in the library of a computer center 10,000 miles away in order to ﬁnd a program that will

do the problem in question. Your description must answer that question.

Let’s now assume that your program has been written in the form of a subroutine or

procedure, rather than as a main program. Then the list of global, or communicating

variables is plainly in view, in the opening statement of the subroutine.

As we noted in section 2.2, you should state the purpose of your program using the

global variables of the subroutine in the same sentence. For one example of the diﬀerence

that makes, see section 2.2. For another, a linear equation solver might be described by

saying

“This program solves a system of simultaneous equations. To use it, put the

right-hand sides into the vector B, put the coeﬃcients into the matrix A and call

the routine. The answer will be returned in X.”

We are, however, urging the reader to do it this way:

“This program solves the equations AX=B, where A is an N-by-N matrix and B

is an N-vector.”

Observe that the second description is shorter, only about half as long, and yet more

informative. We have found out not only what the program does, but how that function

relates to the global variables of the subroutine. This was done by using a judicious sprin-

kling of symbols in the documentation, along with the words. Don’t use only symbols, or

only words, but weave them together for maximum information.

Notice also that the ambiguous term “right-hand side” that appeared in the ﬁrst form

has been done away with in the second form. The phrase was ambiguous because exactly

what ends up on the right-hand side and what on the left is an accident of how we happen

to write the equations, and your audience may not do it the same way you do.

(B) How is it done?

This is usually the easy part of program documentation because it is not the purpose of

this documentation to give a course in mathematics or algorithms or anything else. Hence

most of the time a reference to the literature is enough, or perhaps if the method is a

standard one, just give its name. Often though, variations on the standard method have

been chosen, and the user must be informed about those:

“. . . is solved by Gaussian elimination, using complete positioning for size. . . ”

“. . . the input array A is sorted by the Quicksort method (see D.E. Knuth, The

Art of Computer Programming, volume 3). . . ”

36 The Numerical Solution of Diﬀerential Equations

“. . . the eigenvalues and vectors are found by the Jacobi method, using Cor-

bat´o’s method of avoiding the search for the largest oﬀ-diagonal element (see,

for instance, the description in D.R. Wilson, A First Course in Mathematical

Software).”

“. . . is found by the Simplex method, except that Charnes’ selection rule (see

F.A. Ficken, The Simplex Method. . . ) is not programmed, and so. . . ”

(C) Describe the global variables

Now it gets hard again. The global variables are the ones through which the subroutine

communicates with the user. Generally speaking, the user doesn’t care about variables

that are entirely local to your subroutine, but is vitally concerned with the communicating

variables.

First the user has to know exactly ho each of the global variables is related to the

problem that is being solved. This calls for a brief verbal description of the variable, and

what it has to do with the functioning of the program.

“A[i] is the ith element of the input list that is to be sorted, i=1..N”

“WHY is set by the subroutine to TRUE unles the return is because of overﬂow,

and then it will be set to FALSE.”

“B[i,j] is the coeﬃcient of X[j] in the ith one of the input equations BX=C.”

“option is set by the calling program on input. Set it to 0 if the output is to

be rounded to the nearest integer, else set it to m if the output is to be rounded

to m decimal places (m ≤ 12).”

It is extremely important that each and every global variable of the subroutine should

get such a description. Just march through the parentheses in the subroutine or procedure

heading, and describe each variable in turn.

Next, the user will need more information about each of the global variables than just its

description as above. Also required is the “type” of the variable. Some computer languages

force each program to declare the types of their variables right in the opening statement.

Others declare types by observing various default rules with exceptions stated. In any case,

a little redundancy never hurts, and the program documentation should declare the type of

each and every global variable.

It’s easy to declare along with the types of the variables, their dimensions if they are

array variables. For instance we may have a

solver:=proc(A,X,n,ndim,b);

2.4 How to document a program 37

in which the communicating variables have the following types:

A ndim-by-ndim array of ﬂoating point numbers

X vector of ﬂoating point numbers of length n

n integer

ndim integer

b vector of ﬂoating point numbers of length n

The best way to announce all of these types and dimensions of global variables to the

user is simply to list them, as above, in a table.

Now surely we’ve ﬁnished taking the pulse, blood pressure, etc. of the global variables,

haven’t we? Well, no, we haven’t. There’s still more vital data that a user will need to

know about these variables. There isn’t any standard name like “type” to apply to this

information, so we’ll call it the “role” of the variable.

First, for some of the global variables of the subroutine, it may be true that their values

at the time the subroutine is called are quite irrelevant to the operation of the subroutine.

This would be the case for the output variables, and in certain other situations. For some

other variables, the values at input time would be crucial. The user needs to know which are

which. Just for one example, if the value at input time is irrelevant, then the user can feel

free to use the same storage for other temporary purposes between calls to the subroutine.

Second, it may happen that certain variables are returned by the subroutine with their

values unchanged. This is particularly true for “implicitly passed” global variables, i.e.,

variables whose values are used by the subroutine but which do not appear explicitly in the

argument list. In such cases, the user may be delighted to hear the good news. In other

cases, the action of a subroutine may change an input variable, so if the user needs to use

those quantities again it will be necessary to save them somewhere else before calling the

subroutine. In either case, the user needs to be informed.

Third, it may be that the computation of the value of a certain variable is one of the

main purposes of the subroutine. Such variables are the outputs of the program, and the

user needs to know which these are (whether they are explicit in heading or the return

statement, or are “implicit”).

Although some high-level computer languages require type declarations immediately

in the opening instruction of a subroutine, none require the descriptions of the roles of

the variables (well, Pascal requires the VAR declaration, and Maple separates the input

variables from the output ones, but both languages allow implicit passing and changing

of global variables). These are, however, important for the user to know, so let’s invent

a shorthand for describing them in the documentation of the programs that occur in this

book.

First, if the value at input time is important, let’s say that the role of the variable is I,

otherwise it is I’.

Second, if the value of the variable is changed by the action of the subroutine, we’ll say

that its role is C, else C’.

Finally, if the computation of this variable is one of the main purposes of the subroutine,

38 The Numerical Solution of Diﬀerential Equations

it’s role is O (as in output), else O’.

In the description of each communicating variable, all three of these should be speciﬁed,.

Thus, a variable X might have role IC’O’, or a variable why might be of role I’CO, etc.

To sum up, the essential features of program documentation are a description of that

the program does, phrased in terms of the global variables, a statement of how it gets the

job done, and a list of all of the global variables, showing for each one its name, type,

dimension (or structure) if any, its role, and a brief verbal description.

Refer back to the short program in section 2.2, that searches for the largest element in

a row of a matrix. Here is the table of information about its global variables:

Name Type Role Description

A ﬂoating point matrix IC’O’ The input matrix

i integer IC’O’ Which row to search

jwin integer I’CO Column containing largest element

Exercises 2.4

Write programs that perform each of the jobs stated below. In each case, after testing

the program, document it with comments. Give a complete table of information about the

global variables in each case.

(a) Find and print all of the prime numbers between M and N.

(b) Find the elements of largest and smallest absolute values in a given linear array

(vector), and their positions in the array.

(c) Sort the elements of a given linear array into ascending order of size.

(d) Deal out four bridge hands (13 cards each from a standard 52-card deck – This one

is not so easy!).

(e) Solve a quadratic equation (any quadratic equation!).

2.5 The midpoint and trapezoidal rules

Euler’s formula is doubtless the simplest numerical integration procedure for diﬀerential

equations, but the accuracy that can be obtained with it is insuﬃcient for most applications.

In this section and those that follow, we want to introduce a while family of methods for

the solution of diﬀerential equations, called the linear multistep methods, in which the user

can choose the degree of precision that will suﬃce for the job, and then select a member of

the family that will achieve it.

Before describing the family in all of its generality, we will produce two more of its

members, which illustrate the diﬀerent sorts of creatures that inhabit the family in question.

2.5 The midpoint and trapezoidal rules 39

Recall that we derived Euler’s method by chopping oﬀ the Taylor series expansion of the

solution after the linear term. To get a more accurate method we could, of course, keep the

quadratic term, too. However, that term involves a second derivative, and we want to avoid

the calculation of higher derivatives because our diﬀerential equations will always be written

as ﬁrst-order systems, so that only the ﬁrst derivative will be conveniently computable.

We can have greater accuracy without having to calculate higher derivatives if we’re

willing to allow our numerical integration procedure to involve values of the unknown func-

tion and its derivative at more than one point. In other words, in Euler’s method, the next

value of the unknown function, at x + h, is gotten from the values of y and y

at just one

backwards point x. In the more accurate formulas that we will discuss next, the new value

of y depends in y and y

**at more than one point, for instance, at x and x −h, or at several
**

points.

As a primitive example of this kind, we will now discuss the midpoint rule. We begin

once again with the Taylor expansion of the unknown function y(x) about the point x

n

:

y(x

n

+h) = y(x

n

) +hy

(x

n

) +h

2

y

(x

n

)

2

+h

3

y

(x

n

)

6

+ . (2.5.1)

Now we rewrite equation (2.5.1) with h replaced by −h to get

y(x

n

−h) = y(x

n

) −hy

(x

n

) +h

2

y

(x

n

)

2

−h

3

y

(x

n

)

6

+ (2.5.2)

and then subtract these equations, obtaining

y(x

n

+h) −y(x

n

−h) = 2hy

(x

n

) + 2h

3

y

(x

n

)

6

+ . (2.5.3)

Now, just as we did in the derivation of Euler’s method, we will truncate the right side

of (2.5.3) after the ﬁrst term, ignoring the terms that involve h

3

, h

5

, etc. Further, let’s

use y

n

to denote the computed approximate value of y(x

n

) (and y

n+1

for the approximate

y(x

n+1

), etc.). Then we have

y

n+1

−y

n−1

= 2hy

n

. (2.5.4)

If, as usual, we are solving the diﬀerential equation y

**= f(x, y), then ﬁnally (2.5.4) takes
**

the form

y

n+1

= y

n−1

+ 2hf(x

n

, y

n

) (2.5.5)

and this is the midpoint rule. The name arises from the fact that the ﬁrst derivative y

n

is

being approximated by the slope of the chord that joins the two points (x

n−1

, y

n−1

) and

(x

n+1

, y

n+1

), instead of the chord joining (x

n

, y

n

) and (x

n+1

, y

n+1

) as in Euler’s method.

At ﬁrst sight it seems that (2.5.5) can be used just like Euler’s method, because it is a

recurrence formula in which we compute the next value y

n+1

from the two previous values

y

n

and y

n−1

. Indeed the rules are quite similar, except for the fact that we can’t get started

with the midpoint rule until we know two consecutive values y

0

, y

1

of the unknown function

at two consecutive points x

0

, x

1

. Normally a diﬀerential equation is given together with

just one value of the unknown function, so if we are to use the midpoint rule we’ll need to

manufacture one more value of y(x) by some other means.

40 The Numerical Solution of Diﬀerential Equations

This kind of situation will come up again and again as we look at more accurate methods,

because to obtain greater precision without computing higher derivatives we will get the

next approximate value of y from a recurrence formula that may involve not just one or two,

but several of its predecessors. To get such a formula started we will have to ﬁnd several

starting values in addition to the one that is given in the statement of the initial-value

problem.

To get back to the midpoint rule, we can get it started most easily by calculating y

1

,

the approximation to y(x

0

+ h), from Euler’s method, and then switching to the midpoint

rule to carry out the rest of the calculation.

Let’s do this, for example with the same diﬀerential equation (2.1.7) that we used to

illustrate Euler’s rule, so we can compare the two methods. The problem consists of the

equation y

**= 0.5y and the initial value y(0) = 1. We’ll use the same step size h = 0.05 as
**

before.

Now to start the midpoint rule we need two consecutive values of y, in this case at x = 0

and x = 0.05. At 0.05 we use the value that Euler’s method gives us, namely y

1

= 1.025

(see Table 1). It’s easy to continue the calculation now from (2.5.5).

For instance

y

2

= y

0

+ 2h(0.5y

1

)

= 1 + 0.1(0.5 ∗ 1.025)

= 1.05125

(2.5.6)

and

y

3

= y

1

+ 2h(0.5y

2

)

= 1.025 + 0.1(0.5 ∗ 1.05125)

= 1.0775625 .

(2.5.7)

In the table below we show for each x the value computed from the midpoint rule, from

Euler’s method, and from the exact solution y(x) = e

x/2

. The superior accuracy of the

midpoint rule is apparent.

x Midpoint(x) Euler(x) Exact(x)

0.00 1.00000 1.00000 1.00000

0.05 1.02500 1.02500 1.02532

0.10 1.05125 1.05063 1.05127

0.15 1.07756 1.07689 1.07788

0.20 1.10513 1.10381 1.10517

0.25 1.13282 1.13141 1.13315

.

.

.

.

.

.

.

.

.

.

.

.

1.00 1.64847 1.63862 1.64872

2.00 2.71763 2.68506 2.71828

3.00 4.48032 4.39979 4.48169

.

.

.

.

.

.

.

.

.

.

.

.

5.00 12.17743 11.81372 12.18249

10.00 148.31274 139.56389 148.41316

2.5 The midpoint and trapezoidal rules 41

2

2

2

2

y = f(x)

a b

Figure 2.1: The trapezoidal rule

table 2

Next, we introduce a third method of numerical integration, the trapezoidal rule. The

best way to obtain it is to convert the diﬀerential equation that we’re trying to solve into

an integral equation, and then use the trapezoidal approximation for the integral.

We begin with the diﬀerential equation y

**= f(x, y(x)), and we integrate both sides
**

from x to x +h, getting

y(x +h) = y(x) +

_

x+h

x

f(t, y(t)) dt. (2.5.8)

Now if we approximate the right-hand side in any way by a weighted sum of values of

the integrand at various points we will have found an approximate method for solving our

diﬀerential equation.

The trapezoidal rule states that for an approximate value of an integral

_

b

a

f(t) dt (2.5.9)

we can use, instead of the area under the curve between x = a and x = b, the area of the

trapezoid whose sides are the x axis, the lines x = a and x = b, and the line through the

points (a, f(a)) and (b, f(b)), as shown in Figure 2.1. That area is

1

2

(f(a) +f(b))(b −a).

If we apply the trapezoidal rule to the integral that appears in (2.5.8), we obtain

y(x

n

+h) ≈ y(x

n

) +

h

2

(f(x

n

, y(x

n

)) +f(x

n

+h, y(x

n

+h))) (2.5.10)

in which we have used the “≈” sign rather than the “=” because the right hand side is not

exactly equal to the integral that really belongs there, but is only approximately so.

If we use our usual abbreviation y

n

for the computed approximate value of y(x

n

), then

(2.5.10) becomes

y

n+1

= y

n

+

h

2

(f(x

n

, y

n

) +f(x

n+1

, y

n+1

)). (2.5.11)

42 The Numerical Solution of Diﬀerential Equations

This is the trapezoidal rule in the form that is useful for diﬀerential equations.

At ﬁrst sight, (2.5.11) looks like a recurrence formula from which the next approximate

value, y

n+1

, of the unknown function, can immediately be computed from the previous

value, y

n

. However, this is not the case.

Upon closer examination one observes that the next value y

n+1

appears not only on the

left-hand side, but also on the right (it’s hiding in the second f on the right side).

In order to ﬁnd the value y

n+1

it appears that we need to carry out an iterative process.

First we would guess y

n+1

(guessing y

n+1

to be equal to y

n

wouldn’t be all that bad, but

we can do better). If we use this guess value on the right side of (2.5.11) then we would

be able to calculate the entire right-hand side, and then we could use that value as a new

“improved” value of y

n+1

.

Now if the new value agrees with the old suﬃciently well the iteration would halt, and

we would have found the desired value of y

n+1

. Otherwise we would use the improved value

on the right side just as we previously used the ﬁrst guess. Then we would have a “more

improved” guess, etc.

Fortunately, in actual use, it turns out that one does not actually have to iterate to

convergence. If a good enough guess is available for the unknown value, then just one

reﬁnement by a single application of the trapezoidal formula is suﬃcient. This is not the

case if a high quality guess is unavailable. We will discuss this point in more detail in

section 2.9. The pair of formulas, one of which supplies a very good guess to the next

value of y, and the other of which reﬁnes it to a better guess, is called a predictor-corrector

pair, and such pairs form the basis of many of the highly accurate schemes that are used in

practice.

As a numerical example, take the diﬀerential equation

y

= 2xe

y

+ 1 (2.5.12)

with the initial value y(0) = 1. If we use h = 0.05, then our ﬁrst task is to calculate y

1

, the

approximate value of y(0.05). The trapezoidal rule asserts that

y

1

= 1 + 0.025(2 + 0.1e

y

1

) (2.5.13)

and sure enough, the unknown number y

1

appears on both sides.

Let’s guess y

1

= 1. Since this is not a very inspired guess, we will have to iterate the

trapezoidal rule to convergence. Hence, we use this guess on the right side of (2.5.13),

compute the right side, and obtain y

1

= 1.056796. If we use this new guess the same way,

the result is y

1

= 1.057193. Then we get 1.057196, and since this is in suﬃciently close

agreement with the previous result, we declare that the iteration has converged. Then we

take y

1

= 1.057196 for the computed value of the unknown function at x = 0.05, and we go

next to x = 0.1 to repeat the same sort of thing to get y

2

, the computed approximation to

y(0.1).

In Table 3 we show the results of using the trapezoidal rule (where we have iterated

until two successive guesses are within 10

−4

) on our test equation y

= 0.5y, y(0) = 1 as

the column Trap(x). For comparison, we show Midpoint(x) and Exact(x).

2.6 Comparison of the methods 43

x Trap(x) Midpoint(x) Exact(x)

0.00 1.00000 1.00000 1.00000

0.05 1.02532 1.02500 1.02532

0.10 1.05127 1.05125 1.05127

0.15 1.07789 1.07756 1.07788

0.20 1.10518 1.10513 1.10517

0.25 1.13316 1.13282 1.13315

.

.

.

.

.

.

.

.

.

.

.

.

1.00 1.64876 1.64847 1.64872

2.00 2.71842 2.71763 2.71828

3.00 4.48203 4.48032 4.48169

.

.

.

.

.

.

.

.

.

.

.

.

5.00 12.18402 12.17743 12.18249

10.00 148.45089 148.31274 148.41316

table 3

2.6 Comparison of the methods

We are now in possession of three methods for the numerical solution of diﬀerential equa-

tions. They are Euler’s method

y

n+1

= y

n

+hy

n

, (2.6.1)

the trapezoidal rule

y

n+1

= y

n

+

h

2

(y

n

+y

n+1

), (2.6.2)

and the midpoint rule

y

n+1

= y

n−1

+ 2hy

n

. (2.6.3)

In order to compare the performance of the three techniques it will be helpful to have

a standard diﬀerential equation on which to test them. The most natural candidate for

such an equation is y

**= Ay, where A is constant. The reasons for this choice are ﬁrst
**

that the equation is easy to solve exactly, second that the diﬀerence approximations are

also relatively easy to solve exactly, so comparison is readily done, third that by varying the

sign of A we can study behavior of either growing or shrinking (stable or unstable) solutions,

and ﬁnally that many problems in nature have solutions that are indeed exponential, at

least over the short term, so this is an important class of diﬀerential equations.

We will, however, write the test equation in a slightly diﬀerent form for expository

reasons, namely as

y

= −

y

L

; y(0) = 1 ; L > 0 (2.6.4)

where L is a constant. The most interesting and revealing case is where the true solution

is a decaying exponential, so we will assume that L > 0. Further, we will assume that

y(0) = 1 is the given initial value.

44 The Numerical Solution of Diﬀerential Equations

The exact solution is of course

Exact(x) = e

−x/L

. (2.6.5)

Notice that if x increases by L, the solution changes by a factor of e. Hence L, called the

relaxation length of the problem, can be conveniently visualized as the distance over which

the solution falls by a factor of e.

Now we would like to know how well each of the methods (2.6.1)–(2.6.3) handles the

problem (2.6.4).

Suppose ﬁrst that we ask Euler’s method to solve the problem. If we substitute y

=

f(x, y) = −y/L into (2.6.1), we get

y

n+1

= y

n

+h ∗

_

−

y

n

L

_

=

_

1 −

h

L

_

y

n

.

(2.6.6)

Before we solve this recurrence, let’s comment on the ratio h/L that appears in it. Now

L is the distance over which the solution changes by a factor of e, and h is the step size

that we are going to use in the numerical integration. Instinctively, one feels that if the

solution is changing rapidly in a certain region, then h will have to be kept small there if

good accuracy is to be retained, while if the solution changes only slowly, then h can be

larger without sacriﬁcing too much accuracy. The ratio h/L measures the step size of the

integration in relation to the distance over which the solution changes appreciably. Hence,

h/L is exactly the thing that one feels should be kept small for a successful numerical

solution.

Since h/L occurs frequently below, we will denote it with the symbol τ.

Now the solution of the recurrence equation (2.6.6), with the starting value y

0

= 1, is

obviously

y

n

= (1 −τ)

n

n = 0, 1, 2, . . . . (2.6.7)

Next we study the trapezoidal approximation to the same equation (2.6.4). We substi-

tute y

= f(x, y) = −y/L in (2.6.2) and get

y

n+1

= y

n

+

h

2

_

−

y

n

L

−

y

n+1

L

_

. (2.6.8)

The unknown y

n+1

appears, as usual with this method, on both sides of the equation.

However, for the particularly simple equation that we are now studying, there is no diﬃculty

in solving (2.6.8) for y

n+1

(without any need for an iterative process) and obtaining

y

n+1

=

1 −

τ

2

1 +

τ

2

y

n

. (2.6.9)

Together with the initial value y

0

= 1, this implies that

y

n

=

_

1 −

τ

2

1 +

τ

2

_

n

n = 0, 1, 2, . . . . (2.6.10)

2.6 Comparison of the methods 45

Before we deal with the midpoint rule, let’s pause to examine the two methods whose

solutions we have just found. Note that for a given value of h, all three of (a) the exact solu-

tion, (b) Euler’s solution and (c) the trapezoidal solution are of the form y

n

= (constant)

n

,

in which the three values of “constant” are

(a) e

−τ

(b) 1 −τ

(c)

1−

τ

2

1+

τ

2

.

(2.6.11)

It follows that to compare the two approximate methods with the “truth,” all we have

to do is see how close the constants (b) and (c) above are to the true constant (a). If we

remember that τ is being thought of as small compared to 1, then we have the power series

expansion of e

−τ

e

−τ

= 1 −τ +

τ

2

2

−

τ

3

6

+ (2.6.12)

to compare with 1 −τ and with the power series expansion of

1 −

τ

2

1 +

τ

2

= 1 −τ +

τ

2

2

−

τ

3

4

+ . (2.6.13)

The comparison is now clear. Both the Euler and the trapezoidal methods yield ap-

proximate solutions of the form (constant)

n

, where “constant” is near e

−τ

. The trapezoidal

rule does a better job of being near e

−τ

because its constant agrees with the power series

expansion of e

−τ

through the quadratic term, whereas that of the Euler method agrees only

up to the linear term.

Finally we study the nature of the approximation that is provided by the midpoint

rule. We will ﬁnd that a new and important phenomenon rears its head in this case. The

analysis begins just as it did in the previous two cases: We substitute the right-hand side

f(x, y) = −y/L for y

in (2.6.3) to get

y

n+1

= y

n−1

+ 2h ∗

_

−

y

n

L

_

. (2.6.14)

One important feature is already apparent. Instead of facing a ﬁrst-order diﬀerence

equation as we did in (2.6.6) for Euler’s method and in (2.6.9) for the trapezoidal rule, we

have now to contend with a second-order diﬀerence equation.

Since the equation is linear with constant coeﬃcients, we know to try a solution of the

form y

n

= r

n

. This leads to the quadratic equation

r

2

+ 2τr −1 = 0. (2.6.15)

Evidently the discriminant of this equation is positive, so its roots are distinct. If we denote

these two roots by r

+

(τ) and r

−

(τ), then the general solution of the diﬀerence equation

(2.6.14) is

y

n

= c (r

+

(τ))

n

+d (r

−

(τ))

n

, (2.6.16)

46 The Numerical Solution of Diﬀerential Equations

where c and d are constants whose values are determined by the initial data y

0

and y

1

.

The Euler and trapezoidal approximations were each of the form (constant)

n

. This one

is a sum of two terms of that kind. We will see that r

+

(τ) is a very good approximation

to e

−τ

. The other term, (r

−

(τ))

n

is, so to speak, the price that we pay for getting such a

good approximation in r

+

(τ). We hope that the other term will stay small relative to the

ﬁrst term, so as not to disturb the closeness of the approximation. We will see, however,

that it need not be so obliging, and that in fact it might do whatever it can to spoil things.

The two roots of the quadratic equation are

r

+

(τ) = −τ +

√

1 +τ

2

r

−

(τ) = −τ −

√

1 +τ

2

.

(2.6.17)

When τ = 0 the ﬁrst of these is +1, so when τ is small r

+

(τ) is near +1, and it is the root

that is trying to approximate the exact constant e

−τ

as well as possible. In fact it does

pretty well, because the power series expansion of r

+

(τ) is

r

+

(τ) = 1 −τ +

τ

2

2

−

τ

4

8

+ (2.6.18)

so it agrees with e

−τ

through the quadratic terms.

What about r

−

(τ)? Its Taylor series is

r

−

(τ) = −1 −τ −

τ

2

2

+ . (2.6.19)

The bad news is now before us: When τ is a small positive number, the root r

−

(τ) is

larger than 1 in absolute value. This means that the stability criterion of Theorem 1.6.1 is

violated, so we say that the midpoint rule is unstable.

In practical terms, we observe that r

+

(τ) is close to e

−τ

, so the ﬁrst term on the right

of (2.6.16) is very close to the exact solution. The second term of (2.6.16), the so-called

parasitic solution, is small compared to the ﬁrst term when n is small, because the constant

d will be small compared with c. However, as we move to the right, n increases, and the

second term will eventually dominate the ﬁrst, because the ﬁrst term is shrinking to zero

as n increases, because that’s what the exact solution does, while the second term increases

steadily in size. In fact, since r

−

(τ) is negative and larger than 1 in absolute value, the

second term will alternate in sign as n increases, and grow without bound in magnitude.

In Table 4 below we show the result of integrating the problem y

= −y, y(0) = 1 with

each of the three methods that we have discussed, using a step size of h = 0.05. To get the

midpoint method started, we used the exact value of y(0.05) (i.e., we cheated), and in the

trapezoidal rule we iterated to convergence with = 10

−4

. The instability of the midpoint

rule is quite apparent.

2.6 Comparison of the methods 47

x Euler(x) Trap(x) Midpoint(x) Exact(x)

0.0 1.00000 1.00000 1.00000 1.00000

1.0 0.35849 0.36780 0.36806 0.36788

2.0 0.12851 0.13527 0.13552 0.13534

3.0 0.04607 0.04975 0.05005 0.04979

4.0 0.01652 0.01830 0.01888 0.01832

5.0 0.00592 0.00673 0.00822 0.00674

10.0 0.00004 0.00005 0.21688 0.00005

14.55 3.3 10

−7

4.8 10

−7

-20.48 4.8 10

−7

15.8 9.1 10

−8

1.4 10

−7

71.45 1.4 10

−7

table 4

In addition to the above discussion of accuracy, we summarize here there additional

properties of integration methods as they relate to the examples that we have already

studied.

First, a numerical integration method might be iterative or noniterative. A method is

noniterative if it expresses the next value of the unknown function quite explicitly in terms

of values of the function and its derivatives at preceding points. In an iterative method,

at each step of the solution process the next value of the unknown is deﬁned implicitly by

an equation, which must be solved to obtain the next value of the unknown function. In

practice, we may either solve this equation completely by an iteration or do just one step

of the iteration, depending on the quality of available estimates for the unknown value.

Second, a method is self-starting if the next value of the unknown function is obtained

from the values of the function and its derivatives at exactly one previous point. It is not

self-starting if values at more than one backward point are needed to get the next one. In

the latter case some other method will have to be used to get the ﬁrst few computed values.

Third, we can deﬁne a numerical method to be stable if when it is applied to the equation

y

**= −y/L, where L > 0, then for all suﬃciently small positive values of the step size h a
**

stable diﬀerence equation results, i.e., the computed solution (neglecting roundoﬀ) remains

bounded as n → ∞.

We summarize below the properties of the three methods that we have been studying.

Euler Midpoint Trapezoidal

Iterative No No Yes

Self-starting Yes No Yes

Stable Yes No Yes

Of the three methods, the trapezoidal rule is clearly the best, though for eﬃcient use

it needs the help of some other formula to predict the next value of y and thereby avoid

lengthy iterations.

48 The Numerical Solution of Diﬀerential Equations

2.7 Predictor-corrector methods

The trapezoidal rule diﬀers from the other two that we’ve looked at in that it does not

explicitly tell us what the next value of the unknown function is, but instead gives us an

equation that must be solved in order to ﬁnd it. At ﬁrst sight this seems like a nuisance,

but in fact it is a boon, because it enables us to regulate the step size during the course of

a calculation, as we will discuss in section 2.9.

Let’s take a look at the process by which we reﬁne a guessed value of y

n+1

to an improved

value, using the trapezoidal formula

y

n+1

= y

n

+

h

2

(f(x

n

, y

n

) +f(x

n+1

, y

n+1

)). (2.7.1)

Suppose we let y

(k)

n+1

represent some guess to the value of y

n+1

that satisﬁes (2.7.1). Then

the improved value y

(k+1)

n+1

is computed from

y

(k+1)

n+1

= y

n

+

h

2

(f(x

n

, y

n

) +f(x

n+1

, y

(k)

n+1

)). (2.7.2)

We want to ﬁnd out about how rapidly the successive values y

(k)

n+1

, k = 1, 2, . . . approach

a limit, if at all. To do this, we rewrite equation (2.7.2), this time replacing k by k − 1 to

obtain

y

(k)

n+1

= y

n

+

h

2

(f(x

n

, y

n

) +f(x

n+1

, y

(k−1)

n+1

)) (2.7.3)

and then subtract (2.7.3) from (2.7.2) to get

y

(k+1)

n+1

−y

(k)

n+1

=

h

2

(f(x

n+1

, y

(k)

n+1

) −f(x

n+1

, y

(k−1)

n+1

)). (2.7.4)

Next we use the mean-value theorem on the diﬀerence of f values on the right-hand

side, yielding

y

(k+1)

n+1

−y

(k)

n+1

=

h

2

∂f

∂y

¸

¸

¸

¸

(x

n+1

,η)

_

y

(k)

n+1

−y

(k−1

n+1

_

, (2.7.5)

where η lies between y

(k)

n+1

and y

(k−1)

n+1

.

From the above we see at once that the diﬀerence between two consecutive iterated

values of y

n+1

will be

h

2

∂f

∂y

times the diﬀerence between the previous two iterated values.

It follws that the iterative process will converge if h is kept small enough so that

h

2

∂f

∂y

is less than 1 in absolute value. We refer to

h

2

∂f

∂y

as the local convergence factor of the

trapezoidal rule.

If the factor is a lot less than 1 (and this can be assured by keeping h small enough),

then the convergence will be extremely rapid.

In actual practice, one uses an iterative formula together with another formula (the

predictor) whose mission is to provide an intelligent ﬁrst guess for the iterative method

2.7 Predictor-corrector methods 49

to use. The predictor formula will be explicit, or noniterative. If the predictor formula is

clever enough, then it will happen that just a single application of the iterative reﬁnement

(corrector formula) will be suﬃcient, and we won’t have to get involved in a long convergence

process.

If we use the trapezoidal rule for a corrector, for instance, then a clever predictor would

be the midpoint rule. The reason for this will become clear if we look at both formulas

together with their error terms. We will see in the next section that the error terms are as

follows:

y

n+1

= y

n−1

+ 2hy

n

+

h

3

3

y

(X

m

) (2.7.6)

y

n+1

= y

n

+

h

2

(y

n

+y

n+1

) −

h

3

12

y

(X

t

). (2.7.7)

Now the exact locations of the points X

m

and X

t

are unknown, but we will assume here

that h is small enough that we can regard the two values of y

**that appear as being the
**

same.

As far as the powers of h that appear in the error terms go, we see that the third power

occurs in both formulas. We say then, that the midpoint predictor and the trapezoidal

corrector constitute a matched pair. The error in the trapezoidal rule is about one fourth

as large as, and of opposite sign from, the error in the midpoint method.

The midpont guess is therefore quite “intelligent”. The subsequent iterative reﬁnement

of that guess needs to reduce the error only by a factor of four. Now let y

P

denote the

midpoint predicted value, y

(1)

n+1

denote the ﬁrst reﬁned value, and y

n+1

be the ﬁnal converged

value given by the trapezoidal rule. Then we have

y

n+1

= y

n

+

h

2

_

y

n

+f(x

n+1

, y

n+1

)

_

y

(1)

n+1

= y

n

+

h

2

_

y

n

+f(x

n+1

, y

P

)

_

(2.7.8)

and by subtraction

y

(1)

n+1

−y

n+1

=

h

2

∂f

∂y

(y

P

−y

n+1

) . (2.7.9)

This shows that, however far from the converged value the ﬁrst guess was, the reﬁned

value is

h

2

∂f

∂y

times closer. Hence if we can keep

h

2

∂f

∂y

no bigger than about 1/4, then the

distance from the ﬁrst reﬁned value to the converged value will be no larger than the size

of the error term in the method, so there would be little point in gilding the iteration any

further.

The conclusion is that when we are dealing with a matched predictor-corrector pair,

we need do only a single reﬁnement of the corrector if the step size is kept moderately

small. Furthermore, “moderately small” means that the step size times the local value of

∂f

∂y

should be small compared to 1. For this reason, iteration to full convergence is rarely

done in practice.

50 The Numerical Solution of Diﬀerential Equations

2.8 Truncation error and step size

We have so far regarded the step size h as a silent partner, more often than not choosing

it to be equal to 0.05, for no particular reason. It is evident, however, that the accuracy of

the calculation is strongly aﬀected by the step size. If h is chosen too large, the computed

solution may be quite far from the true solution of the diﬀerential equation, if too small

then the calculation will become unnecessarily time-consuming, and roundoﬀ errors may

build up excessively because of the numerous arithmetic operations that are being carried

out.

Speaking in quite general terms, if the true solution of the diﬀerential equation is rapidly

changing, then we will need a small values of h, that is, small compared to the local re-

laxation length (see p. 44), and if the solution changes slowly, then a larger value of h will

do.

Frequently in practice we deal with equations whose solutions change very rapidly over

part of the range of integration and slowly over another part. Examples of this are provided

by the study of the switching on of a complicated process, such as beginning a multi-stage

chemical reaction, turning on a piece of electronic equipment, starting a power reactor,

etc. In such cases there usually are rapid and ephemeral or “transient” phenomena that

occur soon after startup, and that disappear quickly. If we want to follow these transients

accurately, we may need to choose a very tiny step size. After the transients die out,

however, the steady-state solution may be a very quiet, slowly varying or nearly constant

function, and then a much larger value of h will be adequate.

If we are going to develop software that will be satisfactory for such problems, then

the program will obviously have to choose, and re-choose its own step size as the calcula-

tion proceeds. While following a rapid transient it should use a small mesh size, then it

should gradually increase h as the transient fades, use a large step while the solution is

steady, decreae it again if further quick changes appear, and so forth, all without operator

intervention.

Before we go ahead to discuss methods for achieving this step size control, let’s observe

that one technique is already available in the material of the previous section. Recall that

if we want to, we can implement the trapezoidal rule by ﬁrst guessing, or predicting, the

unknown at the next point by using Euler’s formula, and then correcting the guess to

complete convergence by iteration.

The ﬁrst guess will be relatively far away from the ﬁnal converged value if the solution

is rapidly varying, but if the solution is slowly varying, then the guess will be rather good.

It follows that the number of iterations required to produce convergence is one measure of

the appropriateness of the current value of the step size: if many iterations are needed, then

the step size is too big. Hence one way to get some control on h is to follow a policy of

cutting the step size in half whenever more than, say, one or two iterations are necessary.

This suggestion is not suﬃciently sensitive to allow doubling the stepsize when only one

iteration is needed, however, and somewhat more delicacy is called for in that situation.

Furthermore this is a very time-consuming approach since it involves a complete iteration

to convergence, when in fact a single turn of the crank is enough if the step size is kept

2.8 Truncation error and step size 51

small enough.

The discussion does, however, point to the fundamental idea that underlies the auto-

matic control of step size during the integration. That basic idea is precisely that we can

estimate the correctness of the step size by whatching how well the ﬁrst guess in our itera-

tive process agrees with the corrected value. The correction process itself, when viewed this

way, is seen to be a powerful ally of the software user, rather than the “pain in the neck”

it seemed to be when we ﬁrst met it.

Indeed, why would anyone use the cumbersome procedure of guessing and reﬁning (i.e.,

prediction and correction) as we do in the trapezoidal rule, when many other methods are

available that give the next value of the unknown immediately and explicitly? No doubt

the question crossed the reader’s mind, and the answer is now emerging. It will appear that

not only does the disparity between the ﬁrst prediction and the corrected value help us to

control the step size, it actually can give us a quantitative estimate of the local error in the

integration, so that if we want to, we can print out the approximate size of the error along

with the solution.

Our next job will be to make these rather qualitative remarks into quantitative tools, so

we must discuss the estimation of the error that we commit by using a particular diﬀerence

approximation to a diﬀerential equation, instead of that equation itself, on one step of the

integration process. This is the single-step truncation error. It does not tell us how far our

computed solution is from the true solution, but only how much error is committed in a

single step.

The easiest example, as usual, is Euler’s method. In fact, in equation (2.1.2) we have

already seen the single-step error of this metnod. That equation was

y(x

n

+h) = y(x

n

) +hy

(x

n

) +h

2

y

(X)

2

(2.8.1)

where X lies between x

n

and x

n

+ h. In Euler’s procedure, we drop the third term on the

right, the “remainder term,” and compute the solution from the rest of the equation. In

doing this we commit a single-step trunction error that is equal to

E = h

2

y

(X)

2

x

n

< X < x

n

+h. (2.8.2)

Thus, Euler’s method is exact (E = 0) if the solution is a polynomial of degree 1 or less

(y

**= 0). Otherwise, the single-step error is proportional to h
**

2

, so if we cut the step size in

half, the local error is reduced to 1/4 of its former value, approximately, and if we double

h the error is multiplied by about 4.

We could use (2.8.2) to estimate the error by somehow computing an estimate of y

.

For instance, we might diﬀerentiate the diﬀerential equation y

**= f(x, y) once more, and
**

compute y

**directly from the resulting formula. This is usually more trouble than it is
**

worth, though, and we will prefer to estimate E by more indirect methods.

Next we derive the local error of the trapezoidal rule. There are various special methods

that might be used to do this, but instead we are going to use a very general method that

52 The Numerical Solution of Diﬀerential Equations

is capable of dealing with the error terms of almost every integration rule that we intend

to study.

First, let’s look a little more carefully at the capability of the trapezoidal rule, in the

form

y

n+1

−y

n

−

h

2

(y

n

+y

n+1

) = 0. (2.8.3)

Of course, this is a recurrence by meand of which we propagate the approximate solution

to the right. It certainly is not exactly true if y

n

denotes the value of the true solution at

the point x

n

unless that true solution is very special. How special?

Suppose the true solution is y(x) = 1 for all x. Then (2.8.3) would be exactly valid.

Suppose y(x) = x. Then (2.8.3) is again exactly satisﬁed, as the reader should check.

Furthermore, if y(x) = x

2

, then a brief calculation reveals that (2.8.3) holds once more.

How long does this continue? Our run of good luck has just expired, because if y(x) = x

3

then (check this) the left side of (2.8.3) is not 0, but is instead −h

3

/2.

We might say that the trapezoidal rule is exact on 1, x, and x

2

, but not x

3

, i.e., that it

is an integration rule of order two (“order” is an overworked word in diﬀerential equations).

It follows by linearity that the rule is exact on any quadratic polynomial.

By way of contrast, it is easy to verify that Euler’s method is exact for a linear function,

but fails on x

2

. Since the error term for Euler’s method in (2.8.2) is of the form const ∗ h

2

∗

y

**(X), it is perhaps reasonable to expect the error term for the trapezoidal rule to look like
**

const ∗ h

3

∗ y

(X).

Now we have to questions to handle, and they are respectively easy and hard:

(a) If the error term in the trapezoidal rule really is const ∗ h

3

∗ y

(X), then what is

“const”?

(b) Is it true that the error term is const ∗ h

3

∗ y

(X)?

We’ll do the easy one ﬁrst, anticipating that the answer to (b) is aﬃrmative so the eﬀort

won’t be wasted. If the error term is of the form stated, then the trapezoidal rule can be

written as

y(x

h

) −y(x) −

h

2

_

y

(x +h) +y

(x)

_

= c ∗ h

3

∗ y

(X), (2.8.4)

where X lies between x and x + h. To ﬁnd c all we have to do is substitute y(x) = x

3

into (2.8.4) and we ﬁnd at once that c = −1/12. The single-step truncation error of the

trapezoidal rule would therefore be

E = −h

3

y

(X)

12

x < X < x +h. (2.8.5)

Now let’s see that question (b) has the answer “yes” so that (2.8.5) is really right.

To do this we start with a truncated Taylor series with the integral form of the remainder,

rather than with the diﬀerential form of the remainder. In general the series is

y(x) = y(0) +xy

(0) +x

2

y

(0)

2!

+ +x

n

y

(n)

(0)

n!

+R

n

(x) (2.8.6)

2.8 Truncation error and step size 53

where

R

n

(x) =

1

n!

_

x

0

(x −s)

n

y

(n−1)

(s) ds. (2.8.7)

Indeed, one of the nice ways to prove Taylor’s theorem begins with the right-hand side of

(2.8.7), plucked from the blue, and then repeatedly integrates by parts, lowereing the order

of the derivative of y and the power of (x −s) until both reach zero.

In (2.8.6) we choose n = 2, because the trapezoidal rule is exact on polynomials of

degree 2, and we write it in the form

y(x) = P

2

(x) +R

2

(x) (2.8.8)

where P

2

(x) is the quadratic (in x) polynomial P

2

(x) = y(0) +xy

(0) +x

2

y

(0)/2.

Next we deﬁne a certain operation that transforms a function y(x) into a new function,

namely into the left-hand side of equation (2.8.4). We call the operator L so it is deﬁned

by

Ly(x) = y(x +h) −y(x) −

h

2

_

y

(x) +y

(x +h)

_

. (2.8.9)

Now we apply the operator L to both sides of equation (2.8.8), and we notice immediately

that LP

2

(x) = 0, because the rule is exact on quadratic polynomials (this is why we chose

n = 2 in (2.8.6)). Hence we have

Ly(x) = LR

2

(x). (2.8.10)

Notice that we have here a remainder formula for the trapezoidal rule. It isn’t in a very

satisfactory form yet, so we will now make it a bit more explicit by computing LR

2

(x).

First, in the integral expression (2.8.7) for R

2

(x) we want to replace the upper limit of the

integral by +∞. We can do this by writing

R

2

(x) =

1

2!

_

∞

0

H

2

(x −s)y

(s) ds (2.8.11)

where H

2

(t) = t

2

if t > 0 and H

2

(t) = 0 if t < 0.

Now if we bear in mind the fact that the operator L acts only on x, and that s is a

dummy variable of integration, we ﬁnd that

LR

2

(x) =

1

2!

_

∞

0

LH

2

(x −s)y

(s) ds. (2.8.12)

Choose x = h. Then if s lies between 0 and h we ﬁnd

LH

2

(x −s) = (h −s)

2

−

h

2

(2(h −s))

= −s(h −s

(2.8.13)

(Caution: Do not read past the right-hand side of the ﬁrst equals sign unless you can verify

the correctness of what you see there!), whereas if s > h then LH

2

(x −s) = 0.

54 The Numerical Solution of Diﬀerential Equations

Then (2.8.12) becomes

LR

2

(h) = −

1

2

_

h

0

s(h −s)y

(s) ds. (2.8.14)

This is a much better form for the remainder, but we still do not have the “hard” question

(b). To ﬁnish it oﬀ we need a form of the mean-value theorem of integral calculus, namely

Theorem 2.8.1 If p(x) is nonnegative, and g(x) is continuous, then

_

b

a

p(x)g(x) dx = g(X)

_

b

a

p(x) dx (2.8.15)

where X lies between a and b.

The theorem asserts that a weighted average of the values of a continuous function is

itself one of the values of that function. The vital hypothesis is that the “weight” p(x) does

not change sign.

Now in (2.8.14), the function s(h −s) does not change sign on the s-interval (0, h), so

LR

2

(h) = −

1

2

y

(X)

_

h

0

s(h −s) ds (2.8.16)

and if we do the integral we obtain, ﬁnally,

Theorem 2.8.2 The trapezoidal rule with remainder term is given by the formula

y(x

n+1

) −y(x

n

) =

h

2

_

y

(x

n

) +y

(x

n+1

)

_

−

h

3

12

y

(X), (2.8.17)

where X lies between x

n

and x

n+1

.

The proof of this theorem involved some ideas tha carry over almost unchanged to

very general kinds of integration rules. Therefore it is important to make sure that you

completely understand its derivation.

2.9 Controlling the step size

In equation (2.8.5) we saw that if we can estimate the size of the third derivative during

the calculation, then we can estimate the error in the trapezoidal rule as we go along, and

modify the step size h if necessary, to keep that error within preassigned bounds.

To see how this can be done, we will quote, without proof, the result of a similar

derivation for the midpoint rule. It says that

y(x

n+1

) −y(x

n−1

) = 2hy

(x

n

) +

h

3

3

y

(X), (2.9.1)

2.9 Controlling the step size 55

where X is between x

n−1

and x

n+1

. Thus the midpoint rule is also of second order. If the

step size were halved, the local error would be cut to one eighth of its former value. The

error in the midpoint rule is, however, about four times as large as that in the trapezoidal

formula, and of opposite sign.

Now suppose we adopt an overall strategy of predicting the value y

n+1

of the unknown

by means of the midpoint rule, and then reﬁning the prediction to convergence with the

trapezoidal corrector. We want to estiamte the size of the single-step truncation error,

using only the following data, both of which are available during the calculation: (a) the

initial guess, from the midpoint method, and (b) the converged corrected value, from the

trapezoidal rule.

We begin by deﬁning three diﬀerent kinds of values of the unknown function at the

“next” point x

n+1

. They are

(i) the quantity p

n+1

is deﬁned as the predicted value of y(x

n+1

) obtained from using the

midpoint rule, except that backwards values are not the computed ones, but are the

exact ones instead. In symbols,

p

n+1

= y(x

n+1

) + 2hy

(x

n

). (2.9.2)

Of course, p

n+1

is not available during an actual computation.

(ii) the quantity q

n+1

is the value that we would compute from the trapezoidal corrector

if for the backward value we use the exact solution y(x

n

) instead of the calculated

solution y

n

. Thus q

n+1

satisﬁes the equation

q

n+1

= y(x

n

) +

h

2

(f(x

n

, y(x

n

)) +f(x

n+1

, q

n+1

)) . (2.9.3)

Again, q

n+1

is not available to us during calculation.

(iii) the quantity y(x

n+1

), which is the exact solution itself. It staisﬁes two diﬀerent

equations, one of which is

y(x

n+1

) = y(x

n

) +

h

2

(f(x

n

, y(x

n

)) +f(x

n+1

, y(x

n+1

))) −

h

3

12

y

(X) (2.9.4)

and the other of which is (2.9.1). Note that the two X’s may be diﬀerent.

Now, from (2.9.1) and (2.9.2) we have at once that

y(x

n+1

) = p

n+1

+

h

3

3

y

(X). (2.9.5)

Next, from (2.9.3) and (2.9.4) we get

y(x

n+1

) =

h

2

(f(x

n+1

, y(x

n+1

)) −f(x

n+1

, q

n+1

)) −

h

3

12

y

(X)

= q

n+1

h

2

(y(x

n+1

) −q

n+1

)

∂f

∂y

(x

n+1

, Y ) −

h

3

12

y

(X

(2.9.6)

56 The Numerical Solution of Diﬀerential Equations

where we have used the mean-value theorem, and Y lies between y(x

n+1

) and q

n+1

. Now if

we subtract q

n+1

from both sides, we will observe that y(x

n+1

) −q

n+1

will then appear on

both sides of the equation. Hence we will be able to solve for it, with the result that

y(x

n+1

) = q

n+1

−

h

3

12

y

(X) + terms involving h

4

. (2.9.7)

Now let’s make the working hypothesis that y

**is constant over the range of values of
**

x considered, namely from x

n

−h to x

n

+h. The y

(X) in (2.9.7) is thereby decreed to be

equal to the y

**(X) in (2.9.5), even though the X’s are diﬀerent. Under this assumption,
**

we can eliminate y(x

n+1

) between (2.9.5) and (2.9.7) and obtain

q

n+1

−p

n+1

=

5

12

h

3

y

+ terms involving h

4

. (2.9.8)

We see that this expresses the unknown, but assumed constant, value of y

in terms

of the diﬀerence between the initial prediction and the ﬁnal converged value of y(x

n+1

).

Now we ignore the “terms involving h

4

” in (2.9.8), solve for y

**, and then for the estimated
**

single-step truncation error we have

Error = −

h

3

12

y

≈ −

1

12

12

5

(q

n+1

−p

n+1

)

= −

1

5

(q

n+1

−p

n+1

).

(2.9.9)

The quantity q

n+1

− p

n+1

is not available during the calculation, but as an estimator

we can use the compted predicted value and the compted converged value, because these

diﬀer only in that they use computed, rather than exact backwards values of the unknown

function.

Hence, we have here an estimate of the single-step trunction error that we can conve-

niently compute, print out, or use to control the step size.

The derivation of this formula was of course dependent on the fact that we used the

midpoint metnod for the predoctor and the trapezoidal rule for the corrector. If we had

used a diﬀerent pair, however, the same argument would have worked, provided only that

the error terms of the predictor and corrector formulas both involved the same derivative

of y, i.e., both formulas were of the same order.

Hence, “matched pairs”” of predictor and corrector formulas, i.e., pairs in which both

are of the same order, are most useful for carrying out extended calculations in which the

local errors are continually monitored and the step size is regulated accordingly.

Let’s pause to see how this error estimator would have turned out in the case of a general

matched pair of predictor-corrector formulas, insted of just for the midpoint and trapezoidal

rule combination. Suppose the predictor formula has an error term

y

exact

−y

predicted

= λh

q

y

(q)

(X) (2.9.10)

2.9 Controlling the step size 57

and suppose that the error in the corrector formula is given by

y

exact

−y

corrected

= µh

q

y

(q)

(X). (2.9.11)

Then a derivation similar to the one that we have just done will show that the estimator

for the single-step error that is available during the progress of the computation is

Error ≈

µ

λ −µ

(y

predicted

−y

corrected

). (2.9.12)

In the table below we show the result of integrating the diﬀerential equation y

= −y

with y(0) = 1 using the midpoint and trapezoidal formulas with h = 0.05 as the predictor

and corrector, as described above. The successive columns show x, the predicted value at x,

the converged corrected value at x, the single-step error estimated from the approximation

(2.9.9), and the actual single-step error obtained by computing

y(x

n+1

) −y(x

n

) −

h

2

(y

(x

n

) +y

(x

n+1

)) (2.9.13)

using the true solution y(x) = e

−x

. The calculation was started by (cheating and) using

the exact solution at 0.00 and 0.05.

x Pred(x) Corr(x) Errest(x) Error(x)

0.00 ——- ——- ——- ——-

0.05 ——- ——- ——- ——-

0.10 0.904877 0.904828 98 10

−7

94 10

−7

0.15 0.860747 0.860690 113 10

−7

85 10

−7

0.20 0.818759 0.818705 108 10

−7

77 10

−7

0.25 0.778820 0.778768 102 10

−7

69 10

−7

0.30 0.740828 0.740780 97 10

−7

61 10

−7

0.35 0.704690 0.704644 93 10

−7

55 10

−7

0.40 0.670315 0.670271 88 10

−7

48 10

−7

0.45 0.637617 0.637575 84 10

−7

43 10

−7

0.50 0.606514 0.606474 80 10

−7

37 10

−7

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

0.95 0.386694 0.386669 51 10

−7

5 10

−7

1.00 0.367831 0.367807 48 10

−7

3 10

−7

table 5

Now that we have a simple device for estimating the single-step truncation error, namely

by using one ﬁfth of the distance between the ﬁrst guess and the corrected value, we can

regulate the step size so as to keep the error between preset limits. Suppose we would like to

keep the single-step error in the neighborhood of 10

−8

. We might then adopt, say 5 10

−8

as the upper limit of tolerable error and, for instance, 10

−9

as the lower limit.

Why should we have a lower limit? If the calculation is being done with more precision

than necessary, the step size will be smaller than needed, and we will be wasting computer

time as well as possibly building up roundoﬀ error.

58 The Numerical Solution of Diﬀerential Equations

Now that we have ﬁxed these limits we should be under no delusion that our computed

solution will depart from the true solution by no more than 5 10

−8

, or whatever. What

we are controlling is the one-step truncation error, a lot bettern than nothing, but certainly

not the same as the total accumulated error.

With the upper and lower tolerances set, we embark on the computation. First, since

the midpoint method needs two backward values before it can be used, something special

will have to be done to get the procedure moving at the start. This is typical of numerical

solution of diﬀerential equations. Special procedures are needed at the beginning to build

up enough computed values so that the predictor-corrector formulas can be used thereafter,

or at least until the step size is changed.

In the present case, since we’re going to use the trapezoidal corrector anyway, we might

as well use the trapezoidal rule, unassisted by midpoint, with complete iteration to conver-

gence, to get the value of y at the ﬁrst point x

0

+h beyond the initial point x

0

.

Now we have two consecutive values of y, and the generic calculation can begin. From

any two consecutive values, the midpoint rule is used to predict the next value of y. This

predicted value is also saved for future use in error estimation. The predicted value is then

reﬁned by the trapezoidal rule.

With the trapezoidal value in hand, the local error is then estimated by calculating

one-ﬁfth of the diﬀerence between that value and the midpoint guess.

If the absolute value of the local error lies between the preset limits 10

−9

and 5 10

−8

,

then we just go on to the next step. This means that we augment x by h, and move back the

newer values of the unknown function to the locations that hold older values (we remember,

at any moment, just two past values).

Otherwise, suppose the local error was too large. Then we must reduce the step size

h, say by cutting it in half. When all this is done, some message should be printed out

that announces the change, and then we should restart the procedure, with the new value

of h, from the “farthest backward” value of x for which we still have the corresponding y

in memory. One reason for this is that we may ﬁnd out right at the outset that our very

ﬁrst choice of h is too large, and perhaps it may need to be halved, say, three times before

the errors are tolerable. Then we would like to restart each time from the same originally

given data point x

0

, rather than let the computation creep forward a little bit with step

sizes that are too large for comfort.

Finally, suppose the local error was too small. Then we double the step size, print a

message to that eﬀect, and restart, again from the smallest possible value of x.

Now let’s apply the philosophy of structured programming to see how the whole thing

should be organized. We ask ﬁrst for the major logical blocks into which the computation

is divided. In this case we see

(i) a procedure midpt. Input to this procedure will be x, h, y

n−1

, y

n

. Output from it

will be y

n+1

computed from the midpoint formula. No arrays are involved. The three

values of y in question occupy just three memory locations. The leading statement in

this routine might be

2.9 Controlling the step size 59

midpt:=proc(x,h,y0,ym1,n);

and its return statement would be return(yp1);. One might think that it is scarcely

necessary to have a separate subroutine for such a simple calculation. The spirit of

structured programming dictates otherwise. Someday one might want to change from

the midpoint predictor to some other predictor. If organized as a subrouting, then it’s

quite easy to disentangle it from the program and replace it. This is the “modular”

arpproach.

(ii) a procedure trapez. This routine will be called from two or three diﬀerent places in the

main routine: when starting, when restarting with a new value of h, and in a generic

step, wher eit is the corrector formula. Operation of thie routine is diﬀerent, too,

depending on the circumstances. When starting or restarting, there is no externally

supplied guess to help it. It must ﬁnd its own way to convergence. On a generic step

of the integration, however, we want it to use the prediction supplied by midpt, and

then just do a single correction.

One way to handle this is to use a logical variable start. If trapez is called with start

= TRUE, then the subroutine would supply a ﬁnal converged value without looking for any

input guess. Suppose the ﬁrst line of trapez is

trapez:=proc(x,h,yin,yguess,n,start,eps);

and its return statement is return(yout);. When called with start = TRUE, then the

routine might use yin as its ﬁrst guess to yout, and iterate to convergence from there, using

eps as its convergence criterion. If start = FALSE, it will take yguess as an estimate of

yout, then use the trapezoidal rule just once, to reﬁne the value as yout.

The combination of these two modules plus a small main program that calls them as

needed, constitutes the whole package. Each of the subroutines and the main routine should

be heavily documented in a self-contained way. That is, the descriptions of trapez, and of

midpt, should precisely explain their operation as separate modules, with their own inputs

and outputs, quite independently of the way they happen to be used by the main program

in this problem. It should be possible to unplug a module from this application and use it

without change in another. The documentation of a procedure should never make reference

to how it is used by a certain calling program, but should describe only how it transforms

its own inputs into its own outputs.

In the next section we are going to take an intermission from the study of integration

rules in order to discuss an actual physical problem, the ﬂight of a spacecraft to the moon.

This problem will need the best methods that we can ﬁnd for a successful, accurate solution.

Then in section 2.12, we’ll return to the modules discussed above, and will display complete

computer programs that carry them out. In the meantine, the reader might enjoy trying to

write such a complete program now, and comparing it with the one in that section.

60 The Numerical Solution of Diﬀerential Equations

l

Earth Rocket

E

x = D

x=D−r

x = 0

x = R

x(t)

Moon

s f

Figure 2.2: 1D Moon Rocket

2.10 Case study: Rocket to the moon

Now we have a reasonably powerful apparatus for integration of initial-value problems and

systems, including the automatic regulation of step size and built-in error estimation. In

order to try out this software on a problem that will use all of its capability, in this section

we are going to derive the diﬀerential equations that govern the ﬂight of a rocket to the

moon. We do this ﬁrst in a one-dimensional model, and then in two dimensions. It will

be very useful to have these equations available for testing various proposed integration

methods. Great accuracy will be needed, and the ability to chenge the step size, both

to increase it and the devrease it, will be essential, or else the computation will become

intolerably long. The variety of solutions that can be obtained is quite remarkable.

First, in the one-dimensional simpliﬁed model, we place the center of the earth at the

origin of the x-axis, and let R denote the earth’s radius. At the point x = D we place the

moon, and we let its radius be r. Finally, at a position x = x(t) is our rocket, making its

way towards the moon.

We will use Newton’s law of gravitation to ﬁnd the net gravitational force on the

rocket,and equate it to the mass of the rocket times its acceleration (Newton’s second law of

motion). According to Newton’s law of gravitation, the gravitational force exerted by one

body on another is proportional to the product of their masses and inversely proportional

to the square of the distance between them. If we use K for the constant of proportionality,

then the force on the rocket due to the earth is

−K

M

E

m

x

2

, (2.10.1)

whereas the force on the rocket due to the moon’s gravity is

K

M

M

m

(D −x)

2

(2.10.2)

where M

E

, M

M

and m are, respectively, the masses of the earth, the moon and the rocket.

The acceleration of the rocket is of course x

**(t), and so the assertion that the net force
**

is equal to mass times acceleration takes the form:

mx

= −K

M

E

m

x

2

+K

M

M

m

(D −x)

2

. (2.10.3)

This is a (nasty) diﬀerential equation of second order in the unknown function x(t), the

position of the rocket at time t. Note the nonlinear way in which this unknown function

appears on the right-hand side.

2.10 Case study: Rocket to the moon 61

A second-order diﬀerential equations deserves two initial values, and we will oblige.

First. let’s agree that at time t = 0 the rocket was on the surface of the earth, and second,

that the rocket was ﬁred at the moon with a certain initial velocity V . Hence, the initial

conditions that go with (2.10.3) are

x(0) = R; x

(0) = V. (2.10.4)

Now, just a quick glance at (2.10.3) shows that m cancels out, so let’s remove it, but

not before pointing out the immense signiﬁcance of that fact. It implies that the motion

of the rocket is independent of its mass. For performing a now-legendary experiment with

rocks of diﬀerent sizes dropping from the Tower of Pisa, Galileo demonstrated that fact to

an incredulous world.

At any rate, (2.10.3) now reads as

x

= −

KM

E

x

+

KM

M

(D −x)

2

. (2.10.5)

We can make this equation a good bit prettier by changing the units of distance and time

from miles and seconds (or whatever) to a set of more natural units for the problem.

For our unit of distance we choose R, the radius of the earth. If we divide (2.10.5)

through by R, we can write the result as

_

x

R

_

= −

KM

E

R

3

_

X

R

_

2

+

KM

M

R

3

_

D

R

−

x

R

_

2

. (2.10.6)

Now instead of the unknown function x(t), we deﬁne y(t) = x(t)/R. Then y(t) is the

position of the rocket, expressed in earth radii, at time t. Further, the ratio D/R that

occurs in (2.10.6) is a dimensionless quantity, whose numerical value is about 60. Hence

(2.10.6) has now been transformed to

y

= −

KM

E

R

3

y

2

+

KM

M

R

3

(60 −y)

2

. (2.10.7)

Next we tackle the new time units. Since y is now dimensionless, the dimension of the

left side of the equation is the reciprocal of the square of a time. If we look next at the

ﬁrst term on the right, which of course has the same dimension, we see that the quantity

R

3

/KM

E

is the square of a time, so

T

0

=

¸

R

3

KM

E

(2.10.8)

is a time. Its numerical value is easier to calculate if we change the formula ﬁrst, as follows.

Consider a body of mass m on the surface of the earth. Its weight is the magnitude of

the force exerted on it by the earth’s gravity, namely KM

E

m/R

2

. Its weight is also equal

to m times the acceleration of the body, namely the acceleration due to gravity, usually

denoted by g, and having the value 32.2 feet/sec

2

.

62 The Numerical Solution of Diﬀerential Equations

It follows that

KM

e

m

R

2

= mg, (2.10.9)

and if we substitute into (2.10.8) we ﬁnd that our time unit is

T

0

=

¸

R

g

. (2.10.10)

We take R = 4000 miles, and ﬁnd T

0

is about 13 minutes and 30 seconds. We propose to

measure time in units of T

0

. To that end, we multiply through equation (2.10.7) by T

0

and

get

T

2

0

y

= −

1

y

2

+

M

M

M

E

(60 −y)

2

. (2.10.11)

The ratio of the mass M

M

of the moon to the mass M

E

of the earth is about 0.012.

Furthermore, we will now introduce a new independent variable τ and a new dependent

variable u = u(τ) by the relations

u(τ) = y(τT

0

) ; t = τT

0

. (2.10.12)

Thus, u(τ) represents the position of the rocket, measured in units of the radius of the

earth, at a time τ that is measured in units of T

0

, i.e., in units of 13.5 minutes.

The substitution of (2.10.12) into (2.10.11) yields the diﬀerential equation for the scaled

distance u(τ) as a function of the scaled time τ in the form

u

= −

1

u

2

+

0.012

(60 −u)

2

. (2.10.13)

Finally we must translate the initial conditions (2.10.4) into conditions on the new

variables. The ﬁrst condition is easy: u(0) = 1. Next, if we diﬀerentiate (2.10.12) and set

τ = 0 we get

u

(0) =

T

0

V

R

=

V

R/T

0

. (2.10.14)

This is a ratio of two velocities. In the numerator is the velocity with which the rocket is

launched. What is the signiﬁcance of the velocity R/T

0

?

We claim that it is, aside from a numerical factor, the escape velocity from the earth, if

there were no moon. Perhaps the quickest way to see this is to go back to equation (2.10.11)

and drop the second term on the right-hand side (the one that comes from the moon). Then

we will be looking at the diﬀerential equation that would govern the motion if the moon

were absent. This equation can be solved. Multiply bth sides by 2y

, and it becomes

T

2

0

_

(y

)

2

_

=

_

2

y

_

, (2.10.15)

and integration yields

T

2

0

(y

)

2

=

2

y

+C. (2.10.16)

2.10 Case study: Rocket to the moon 63

Now let t = 0 and ﬁnd that C = T

2

0

V

2

/R

2

−2, so

T

2

0

(y

)

2

=

2

y

−

_

T

2

0

V

2

R

2

−2

_

. (2.10.17)

Suppose the rocket is launched with suﬃcient initial velocity to escape from the earth. Then

the function y(t) will grow without bound. Hence let y → ∞ on the right side of (2.10.17).

For all values of y, the left side is a square, and therefore a nonnegative quantity. Hence

the right side, which approaches the constant C, must also be nonnegative. Thus C ≥ 0 or,

equivalently

V ≥

√

2

R

T

0

. (2.10.18)

Thus, if the rocket escapes, then (2.10.18) is true, and the converse is easy to show also.

Hence the quantity

√

2 R/T

0

is the escape velocity from the earth. We shall denote it by

V

esc

. Its numerical value is approximately 25, 145 miles per hour.

Now we can return to (2.10.12) to translate the initial conditions on x

(t) into initial

conditions on u

(τ). In terms of the escape velocity, it becomes u

(0) =

√

2 V/V

esc

. We

might say that if we choose to measure distance in units of earth radii, and time in units of

T

0

, then velocities turn out to be measured in units of escape velocity, aside from the

√

2.

In summary, the diﬀerential equation and the initial conditions have the ﬁnal form

u

= −

1

u

2

+

0.012

(60 −u)

2

u(0) = 1

u

(0) =

√

2

V

V

esc

(2.10.19)

Since that was all so easy, let’s try the two-dimensional case next. Here, the earth is

centered at the origin of the xy-plane, and the moon is moving. Let the coordinates of the

moon at time t be (x

m

(t), y

m

(t)). For example, if we take the orbit of the moon to be a

circle of radius D, then we would have x

m

= Dcos(ωt) and y

m

(t) = Dsin(ωt).

If we put the rocket at a generic position (x(t), y(t)) on the way to the moon, then we

have the conﬁguration shown in ﬁgure 1.16.2.

Consider the net force on the rocket in the x direction. It is given by

F

x

= −

KM

E

mcos θ

x

2

+y

2

+

KM

M

mcos ψ

(x −x

m

)

2

+ (y −y

m

)

2

, (2.10.20)

where the angles θ and ψ are shown in ﬁgure 1.16.2. From that ﬁgure, we see that

cos θ = x

_

x

2

+y

2

(2.10.21)

and

cos ψ =

x

m

−x

_

(x

m

−x)

2

+ (y

m

−y)

2

. (2.10.22)

64 The Numerical Solution of Diﬀerential Equations

v

s

z

x(t)

Earth

rocket

Moon

y(t)

y

M

(t)

x

M

(t)

$

$

$

$

$

$

$

$

$

$

$

$

$$

θ

ψ

Figure 2.3: The 2D Moon Rocket

Now we substitute into (2.10.20), and equate the force in the x direction to mx

(t), to

obtain the diﬀerential equation

mx

(t) = −

KM

E

mx

(x

2

+y

2

)

3/2

+

KM

M

m(x

m

−x)

((x

m

−x)

2

+ (y

m

−y)

2

)

3/2

. (2.10.23)

If we carry out a similar analysis for the y-component of the force on the rocket, we get

my

(t) = −

KM

E

my

(x

2

+y

2

)

3/2

+

KM

M

m(y

m

−y)

((x

m

−x)

2

+)y

m

−y)

2

)

3/2

. (2.10.24)

We are now looking at two (even nastier) simultaneous diﬀerential equations of the

second order in the two unknown functions x(t), y(t) that describe the position of the

rocket. To go with these equations, we need four initial conditions. We will suppose that at

time t = 0, the rocket is on the earth’s surface, at the point (R, 0). Further, at time t = 0,

it will be ﬁred with an initial speed of V , in a direction that makes an angle α with the

positive x-axis. Thus, our initial conditions are

_

x(0) = R; y(0) = 0

x

(0) = V cos α; y

(0) = V sinα

. (2.10.25)

The problem has now been completely deﬁned. It remains to change the units into the same

natural dimensions of distance and time that were used in the one-dimensional problem.

This time we leave the details to the reader, and give only the results. If u(τ) and v(τ)

denote the x and y coordinates of the rocket, measured in units of earth radii, at a time τ

measured in units of T

0

(see (2.10.10)), then it turns out the u and v satisfy the diﬀerential

equations

u

= −

u

(u

2

+v

2

)

3/2

+

0.012(u

m

−u)

((u

m

−u)

2

+ (v

m

−v)

2

)

3/2

v

= −

v

(u

2

+v

2

)

3/2

+

0.012(v

m

−v)

((u

m

−u)

2

+ (v

m

−v)

2

)

3/2

.

(2.10.26)

Furthermore, the initial data (2.10.25) take the form

_

_

_

u(0) = 1 ; v(0) = 0

u

(0) =

√

2

V cos α

V

esc

; v

(0) =

√

2

V sin α

V

esc

.

(2.10.27)

2.11 Maple programs for the trapezoidal rule 65

In these equations, the functions u

m

(τ) and v

m

(τ) are the x and y coordinates of the moon,

in units of R, at the time τ. Just to be speciﬁc, let’s decree that the moon is in a circular

orbit of radius 60R, and that it completes a revolution every twenty eight days. Then, after

a brief session with a hand calculator or a computer, we discover that the equations

u

m

(τ) = 60 cos (0.002103745τ)

v

m

(τ) = 60 sin (0.002103745τ)

(2.10.28)

represent the position of the moon.

2.11 Maple programs for the trapezoidal rule

In this section we will ﬁrst display a complete Maple program that can numerically solve a

system of ordinary diﬀerential equations of the ﬁrst order together with given initial values.

After discussing those programs, we will illustrate their operation by doing the numerical

solution of the one dimensional moon rocket problem.

We will employ Euler’s method to predict the values of the unknowns at the next point

x + h from their values at x, and then we will apply the trapezoidal rule to correct these

predicted values until suﬃcient convergence has occurred.

First, here is the program that does the Euler method prediction.

> eulermethod:=proc(yin,x,h,f)

> local yout,ll,i:

> # Given the array yin of unknowns at x, uses Euler method to return

> # the array of values of the unknowns at x+h. The function f(x,y) is

> # the array-valued right hand side of the given system of ODE’s.

> ll:=nops(yin):

> yout:=[]:

> for i from 1 to ll do

> yout:=[op(yout),yin[i]+h*f(x,yin,i)];

> od:

> RETURN(yout):

> end:

Next, here is the program that takes as input an array of guessed values of the unknowns

at x +h and reﬁnes the guess to convergence using the trapezoidal rule.

> traprule:=proc(yin,x,h,eps,f)

> local ynew,yfirst,ll,toofar,yguess,i,allnear,dist;

> # Input is the array yin of values of the unknowns at x. The program

> # first calls eulermethod to obtain the array ynew of guessed values

> # of y at x+h. It then refines the guess repeatedly, using the trapezoidal

> # rule, until the previous guess, yguess, and the refined guess, ynew, agree

> # within a tolerance of eps in all components. Program then computes dist,

> # which is the largest deviation of any component of the final converged

> # solution from the initial Euler method guess. If dist is too large

66 The Numerical Solution of Diﬀerential Equations

> # the mesh size h should be decreased; if too small, h should be increased.

> ynew:=eulermethod(yin,x,h,f);

> yfirst:=ynew;

> ll:=nops(yin);

> allnear:=false;

> while(not allnear) do

> yguess:=ynew;

> ynew:=[];

> for i from 1 to ll do

> ynew:=[op(ynew),yin[i]+(h/2)*(f(x,yin,i)+f(x+h,yguess,i))];

> od;

> allnear:=true;

> for i from 1 to ll do allnear:=allnear and abs(ynew[i]-yguess[i])<eps od:

> od; #end while

> dist:=max(seq(abs(ynew[i]-yfirst[i]),i=1..ll));

> RETURN([dist,ynew]):

> end:

The two programs above each operate at a single point x and seek to compute the

unknowns at the next point x +h. Now we need a global view, that is a program that will

call the above repeatedly and increment the value of x until the end of the desired range

of x. The global routine also needs to check whether or not the mesh size h needs to be

changed at each point and to do so when necessary.

> trapglobal:=proc(f,y0,h0,xinit,xfinal,eps,nprint)

> local x,y,y1,h,j,arr,dst,cnt:

> # Finds solution of the ODE system y’=f(x,y), where y is an array

> # and f is array-valued. y0 is initial data array at x=xinit.

> # Halts when x>xfinal. eps is convergence criterion for

> # trapezoidal rule; Prints every nprint-th value that is computed.

> x:=xinit:y:=y0:arr:=[[x,y[1]]]:h:=h0:cnt:=0:

> while x<=xfinal do

> y1:=traprule(y,x,h,eps,f):

> y:=y1[2]:dst:=y1[1];

> # Is dst too large? If so, halve the mesh size h and repeat.

> while dst>3*eps do

> h:=h/2; lprint(‘At x=‘,x,‘h was reduced to‘,h);

> y1:=traprule(y,x,h,eps,f):

> y:=y1[2]:dst:=y1[1];

> od:

> # Is dst too small? If so, double the mesh size h and repeat.

> while dst<.0001*eps do

> h:=2*h; lprint(‘At x=‘,x,‘h was increased to‘,h);

> y1:=traprule(y,x,h,eps,f):

> y:=y1[2]:dst:=y1[1];

> od:

> # Adjoin newly computed values to the output array arr.

> x:=x+h; arr:=[op(arr),[x,y[2]]]:

> # Decide if we should print this line of output or not.

> cnt:=cnt+1: if cnt mod nprint =0 or x>=xfinal then print(x,y) fi;

2.11 Maple programs for the trapezoidal rule 67

> od:

> RETURN(arr);

> end:

The above three programs comprise a general package that can numerically solve systems

of ordinary diﬀerential equations. The applicability of the package is limited mainly by the

fact the Euler’s method and the Trapezoidal Rule are fairly primitive approximations to

the truth, and therefore one should not expect dazzling accuracy when these routines are

used over long intervals of integration.

2.11.1 Example: Computing the cosine function

We will now give two examples of the operation of the above programs. First let’s compute

the cosine function. We will numerically integrate the equation y

+ y = 0 with initial

conditions y(0) = 1 and y

**(0) = 0 over the range from x = 0 to x = 2π. To do this we use
**

two unknown functions y

1

, y

2

which are subject to the equations y

1

= y

2

and y

2

= −y

1

,

together with initial values y

1

(0) = 1, y

2

(0) = 0. Then the function y

1

(x) will be the cosine

function.

To use the routines above, we need to program the function f = f(x, y) that gives the

right hand sides of the input ODE’s. This is done as follows:

> f:=proc(x,u,j)

> if j=1 then RETURN(u[2]) else RETURN(-u[1]) fi:

> end:

That’s all. To run the programs we type the line

> trapglobal(f,[1,0],.031415927,0,6.3,.0001,50):

This means that we want to solve the system whose right hand sides are as given by f,

with initial condition array [1, 0]. The initial choice of the mesh size h is π/100, in order to

facilitate the comparison of the values that will be output with those of the cosine function

at the same points. The range of x over which we are asking for the solution is from x = 0

to x = 6.3. Our convergence criterion eps is set to .0001, and we are asking the program

to print every 50th line of putput that it computes. Maple responds to this call with the

following output.

At x= 0 h was reduced to .1570796350e-1

.7853981750, [ .6845520546, -.7289635418]

1.570796368, [-.0313962454, -.9995064533]

2.356194568, [-.7289550591, -.6845604434]

3.141592768, [-.9995063863, .0313843153]

3.926990968, [-.6845688318, .7289465759]

4.712389168, [ .0313723853, .9995063195]

5.497787368, [ .7289380928, .6845772205]

6.283185568, [ .9995062524, -.0313604551]

68 The Numerical Solution of Diﬀerential Equations

We observe that the program ﬁrst decided that the value of h that we gave it was too big

and it was cut in half. Next we see that the accuracy is pretty good. At x = π we have 3

or 4 correct digits, and we still have them at x = 2π. The values of the negative of the sine

function, which are displayed above as the second column of unknowns, are less accurate at

those points. To improve the accuracy we might try reducing the error tolerance eps, but

realistically we will have to confess that the major source of imprecision lies in the Euler

and Trapezoidal combination itself, which, although it provides a good introduction to the

philosophy of these methods, is too crude to yield results of great accuracy over a long range

of integration.

2.11.2 Example: The moon rocket in one dimension

As a second example we will run the moon rocket in one dimension. The equations that

we’re solving now are given by (2.10.19). So all we need to do is to program the right hand

sides f, which we do as follows.

> f:=proc(x,u,j)

> if j=1 then RETURN(u[2]) else RETURN(-1/u[1]^2+.012/(60-u[1])^2) fi:

> end:

Then we invoke our routines by the statement

trapglobal(f,[1,1.4142],.02,0,250,.0001,1000):

This means that we are solving our system with initial data (1,

√

2), with an initial mesh size

of h = .02, integrating over the range of time from t = 0 until t = 250, with a convergence

tolerance eps of .0001, and printing the output every 1000 lines. We inserted an extra

line of program also, so as to halt the calculation as soon as y[1], the distance from earth,

reached 59.75, which is the surface of the moon.

Maple responded with the following output.

At x= 0 h was reduced to .1000000000e-1

10.00000000, [7.911695068, .5027323920]

20.00000000, [12.36222569, .4022135344]

30.00000000, [16.11311118, .3523608113]

40.00000000, [19.46812426, .3206445364]

50.00000000, [22.55572225, .2979916199]

60.00000000, [25.44558364, .2806820525]

70.00000000, [28.18089771, .2668577906]

80.00000000, [30.79081775, .2554698522]

90.00000000, [33.29624630, .2458746369]

100.0000000, [35.71287575, .2376534139]

110.0000000, [38.05293965, .2305225677]

120.0000000, [40.32630042, .2242857188]

130.0000000, [42.54117736, .2188074327]

140.0000000, [44.70467985, .2139997868]

2.12 The big leagues 69

150.0000000, [46.82325670, .2098188816]

160.0000000, [48.90316649, .2062733168]

170.0000000, [50.95113244, .2034559376]

At x= 174.2300000 h was increased to .2000000000e-1

At x= 183.4500000 h was increased to .4000000000e-1

188.0900000, [54.61091112, .2014073390]

At x= 211.7300000 h was reduced to .2000000000e-1

211.8100000, [59.75560109, .3622060968]

So the trip was somewhat eventful. The mesh size was reduced to .01 right away. It was

increased again after 174 time units because at that time the rocket was moving quite

slowly, and again after 183 time units for the same reason. Impact on the surface of the

moon occurred at time 211.81. Since each time unit corresponds to 13.5 minutes, this means

that the whole trip took 211.8 13.5 minutes, or 47.65 hours – nearly two days.

The reader might enjoy experimenting with this situation a little bit. For instance, if

we reduce the initial velocity from 1.4142 by a small amount, then the rocket will reach

some maximum distance from Earth and will fall back to the ground without ever having

reached the moon.

2.12 The big leagues

The three integration rules that we have studied are able to handle small-to-medium sized

problems in reasonable time, and with good accuracy. Some problems, however, are more

demanding than that. Our two-dimensional moon shot is an example of such a situation,

where even a good method like the trapezoidal rule is unable to give the pinpoint accuracy

that is needed. In this section we discuss a general family of methods, of which all three

of the rules that we have studied are examples, that includes methods of arbitrarily high

accuracy. These are the linear multistep methods.

A general linear multistep method is of the form

y

n+1

=

p

i=0

a

−i

y

n−i

+h

p

i=−1

b

−i

y

n−i

. (2.12.1)

In order to compute the next value of y, namely y

n+1

, we need to store p + 1 backwards

values of y and p + 1 backwards values of y

**. A total of p + 2 points are involved in the
**

formula, and so we can call (2.12.1) a p + 2-point formula.

The trapezoidal rule, for example, arises by taking p = 0, a

0

= 1, b

0

= 1/2, b

−1

= 1/2.

Euler’s method has p = 0, a

0

= 1, b

−1

= 0, b

0

= 1, whereas for the midpoint rule we have

p = 1, a

0

= 0, a

−1

= 1, b

1

= 0, b

0

= 2, b

−1

= 0.

We can recognize an explicit, or noniterative, formula by looking at b

1

. If b

1

is nonzero,

then y

n+1

appears implicitly on the right side of (2.12.1) as well as on the left, and the

formula does not explicitly tell us the value of y

n+1

. Otherwise, if b

1

= 0, the formula is

explicit. In either case, if p > 0 the formula will need help getting started or restarted,

whereas if p = 0 it is self-starting.

70 The Numerical Solution of Diﬀerential Equations

The general linear multistep formula contains 2p + 3 constants

a

0

, . . . , a

−p

and b

1

, b

0

, b

−1

, . . . , b

−p

.

These constants are chosen to give the method various properties that we may deem to be

desirable in a particular application. For instance, if we value explicit formulas, then we

may set b

1

= 0 immediately, thereby using one of the 2p + 3 free parameters, and leaving

2p + 2 others.

One might think that the remaining parameters should be chosen so as to give the

highest possible accuracy, in some sense. However, for a ﬁxed p, the more accuracy we

demand, the more we come into conﬂict with stability, another highly desirable feature.

Indeed, if we demand “too much accuracy” for a ﬁxed p, it will turn out that no stable

formulas exist. An important theorem of the subject, due to Dahlquist, states roughly that

we cannot use more than about half of the 2p + 3 “degrees of freedom” to achieve high

accuracy if we want to have a stable formula (and we do!).

First let’s discuss the conditions of accuracy. These are usually handled by asking that

the equation (2.12.1) should be exactly true if the unknown function y happens to be a

polynomial of low enough degree. For instance, suppose y(x) = 1 for all x. Substitute

y

k

= 1 and y

k

= 0 for all k into (2.12.1), and there follows the condition

p

i=0

a

−i

= 1. (2.12.2)

Notice that in all three of the methods we have been studying, the sum of the a’s is

indeed equal to 1.

Now suppose we want our multistep formula to be exact not only when y(x) = 1, but

also when y(x) = x. We substitute y

k

= kh, y

k

= 1 for all k into (2.12.1) and obtain, after

some simpliﬁcation, and use of (2.12.2),

−

p

i=0

ia

−i

+

p

i=−1

b

−i

= 1. (2.12.3)

The reader should check that this condition is also satisﬁed by all three of the methods we

have studied.

In general let’s ﬁnd the condition for the formula to integrate the function y(x) = x

r

and all lower powers of x exactly for some ﬁxed value of r. Hence, in (2.12.1), we substitute

y

k

= (kh)

r

and y

k

= r(kh)

r−1

. After cancelling the factor h

r

, we get

(n + 1)

r

=

p

i=0

a

−i

(n −i)

r

+r

p

i=−1

b

−i

(n −i)

r−1

. (2.12.4)

Now clearly we do x

r

and all lower powers of x exactly if and only if we do (x +c)

r

and

all lower powers of x exactly. The conditions are therefore translation invariant, so we can

choose one special value of n in (2.12.4) if we want.

2.12 The big leagues 71

Let’s choose n = 0, because the result simpliﬁes then to

(−1)

r

=

p

i=0

i

r

a

−i

−r

p

i=−1

i

r−1

b

−i

r = 0, 1, 2, . . . . (2.12.5)

A small technical remark here is that when r = 1 and i = 0 we see a 0

0

in the second

sum on the right. This should be interpreted as 1, in accordance with (2.12.3).

By the order (of accuracy) of a linear multistep method we mean the highest power of x

that the formula integrates exactly, or equivalently, the largest number r for which (2.12.5)

is true (together with its analogues for all numbers between 0 and r).

The accuracy conditions enable us to take the role of designer, and to construct accurate

formlas with desirable characteristics. The reader should verify that the trapezoidal rule is

the most accurate of all possible two-point formulas, and should seach for the most accurate

of all possible three-point formulas (ignoring the stability question altogether).

Now we must discuss the question of stability. Just as in section 2.6, stability will be

judged with respect to the performance of our multistep formula on a particular diﬀerential

equation, namely

y

= −

y

L

(L > 0) (2.12.6)

with y(0) = 1.

To see how well the general formula (2.12.1) does with this diﬀerential equation, whose

solution is a falling exponential, substitute y

k

= −y

k

/L for all k in (2.12.1), to obtain

(1 +τb

1

)y

n+1

−

p

i=0

(a

−i

−τb

−i

)y

n−i

= 0 (2.12.7)

where as in section 2.6, we have written τ = h/L, the ratio of the step size to the relaxation

length of the problem.

Equation (2.12.7) is a linear diﬀerence equation with constant coeﬃcients of order p+1.

If as usual with such equations, we look for solutions of the form α

n

, then after substitution

and cancellation we obtain the characteristic equation

(1 +τb

1

)α

p+1

−

p

i=0

(a

−i

−τb

−i

)α

p−i

= 0. (2.12.8)

This is a polynomial equation of degree p + 1, so it has p + 1 roots somewhere in the

complex plane. These roots depend on the value of τ, that is, on the step size that we use

to do the integration.

We can’t expect that these roots will have absolute value less than 1 however large we

choose the step size h. All we can hope for is that it should be possible, by choosing h small

enough, to get all of these roots to lie inside the unit disk in the complex plane.

Just for openers, let’s see where the roots are when h = 0. In fact, if when h = 0 some

root has absolute value strictly greater than 1, then there is no hope at all for the formula,

72 The Numerical Solution of Diﬀerential Equations

because for all suﬃciently small h there will be a root outside the unit disk, and the method

will be unstable.

Now when h = 0, τ = 0 also, so the polynomial equation (2.12.8) becomes simply

α

p+1

−

p

i=0

a

−i

α

p−i

= 0. (2.12.9)

This is also a polynomial equation of degree p + 1. However, its coeﬃcients don’t depend

on the step size, or even on the values of the various b’s in the formula. The coeﬃcients,

and therefore the roots, depend only on the a’s that we use.

Hence, our ﬁrst necessary condition for the stability of the formula (2.12.1) is that the

polynomial equation (2.12.9) should have no roots of absolute value greater than 1. The

reader should now check the three methods that we have been using to see if this condition

is satisﬁed.

Next we have to study what happens to the roots when h is small and positive. Qualita-

tively, here’s what happens. One of the roots of (2.12.9) is always α = 1, because condition

(2.12.2) is always satisﬁed in practice, and since all it asks is that we correctly integrate the

equation y

**= 0, which is surely not an excessive request.
**

The one root of (2.12.9) that is = 1 is our good friend, and we’ll call it the principal

root. As h increases slightly away from 0 the principal root moves slightly away from 1.

Let α

1

(h) denote the value of the principal root at some small value of h. Then α

1

(0) = 1.

Now for small positive h, it turns out that the principal root tries as hard as it can to be a

good approximation to e

−τ

. This means that the portion of the solution of the diﬀerence

equation (2.12.7) that comes from the one root α

1

(h) is

α

1

(h)

n

= (“nearly” e

−τ

)

n

= “nearly” e

−nτ

= “nearly” e

−nh/L

.

(2.12.10)

But e

−nh/L

is the exact solution of the diﬀerential equation (2.12.6) that we’re trying to

solve. Therefore the principal root is trying to give us a very accurate approximation to

the exact solution.

Well then, what are all of the other p roots of the diﬀerence equation (2.12.7) doing for

us? The answer is that they are trying as hard as they can to make our lives diﬃcult. The

high quality of our approximate solution derives from the nearness of the principal root to

e

−τ

. This high quality is bought at the price of using a p + 2-point multistep formula, and

this forces the characteristic equation to be of degree p + 1, and hence the remining roots

have to be there, too.

People have invented various names for these other, non-principal, roots of the diﬀerence

equations. One that is printable is “parasitic,” so we’ll call them the parasitic roots. We

would be happiest if they would all be zero, but failing that, we would like them to be as

small as possible in absolute value.

2.12 The big leagues 73

A picture of a good multistep method in action, then, with some small value of h, shows

one root of the characteristic equation near 1, more precisely, very near e

−τ

, and all of the

other roots near the origin.

Now let’s try to make this picture a bit more quantitative. We return to the polynomial

equation (2.12.8), and attempt to ﬁnd a power series expansion for the principal root α

1

(τ)

in powers of τ. The expansion will be in the form

α

1

(τ) = 1 +q

1

τ +q

2

τ

2

+ (2.12.11)

where the q’s are to be determined. If we substitue (2.12.11) into (2.12.8), we get

(1 −τb

1

)(1 +q

1

τ +q

2

τ

2

+ )

p+1

−

p

i=0

(a

−i

−τb

−i

)(1 +q

1

τ +q

2

τ

2

+ )

p−i

= 0. (2.12.12)

Now we equate the coeﬃcient of each power of τ to zero. First, the coeﬃcient of the

zeroth power of τ is

1 −

p

i=0

a

−i

(2.12.13)

and, according to (2.12.2), this is indeed zero if our multistep formula correctly integrates

a constant function.

Second, the coeﬃcient of τ is

b

1

+ (p + 1)q

1

−

p

i=0

[a

−i

(p −i)q

1

+b

−i

] = 0. (2.12.14)

However, if we use (2.12.2) and (2.12.3), this simpliﬁes instantly to the simple statement

that q

1

= −1.

So far, we have shown by direct calculation that if our multistep formula is of order

at least 1 (i.e., if (2.12.4) holds for r = 0 and r = 1), then the expansion (2.12.11) of the

principal root agrees with the expansion of e

−τ

through terms of ﬁrst degree in τ.

Much more is true, although we will not prove it here: the expansion of the principal root

agrees with the expansion of e

−τ

through terms of degree equal to the order of the multistep

method under consideration. The proof can be done just by continuing the argument that

we began above, but we omit the details.

Thus the careful determination of the a’s and the b’s so as to make the formula as

accurate as possible all result in the principal root being “nearly” e

−τ

. Equal care must be

taken to assure that the parasitic roots stay small.

We illustrate these ideas with a new multistep formula,

y

n+1

= y

n−1

+

h

3

(y

n+1

+ 4y

n

+y

n−1

). (2.12.15)

This, like the midpoint rule, is a three-point method (p = 1). It is iterative, because

b

1

= 1/3 ,= 0, and it happens to be very accurate. Indeed, we can quickly check that

74 The Numerical Solution of Diﬀerential Equations

the accuracy conditions (2.12.5) are satisﬁed for r = 0, 1, 2, 3, 4. The method is of

fourth order, and in fact its error term can be found by the method of section 2.8 to be

−h

5

y

(v)

(X)/90, where X lies between x

n−1

and x

n+1

.

Now we examine the stability of this method. First, when h = 0 the equation (2.12.9)

that determines the roots is just α

2

− 1 = 0, so the roots are +1 and −1. The root at +1

is the friendly one. As h increases slightly to small positive values, that root will follow the

power series expansion of e

−τ

very accurately, in fact, through the ﬁrst four powers of τ.

The root at −1 is to be regarded with apprehension, because it is poised on the brink of

causing trouble. If as h grows to a small positive value, this root grows in absolute value,

then its powers will dwarf the powers of the principal root in the numerical solution, and

all accuracy will eventually be lost.

To see if this happens, let’s substitute a power series

α

2

(h) = −1 +q

1

τ +q

2

τ

2

+ (2.12.16)

into the characteristic equation (2.12.8), which in the present case is just the quadratic

equation

_

1 +

τ

3

_

α

2

+

4τ

3

α −

_

1 −

τ

3

_

= 0. (2.12.17)

After substituting, we quickly ﬁnd that q

1

= −1/3, and our apprehension was fully war-

rented, because for small τ the root acts like −1 −τ/3, so for all small positive values of τ

this lies outside of the unit disk, so the method will be unstable.

In the next section, we are going to describe a family of multistep methods, called the

Adams methods, that are stable, and that oﬀer whatever accuracy one might want, if one

is willing to save enough backwards values of the y’s. First we will develop a very general

tool, the Lagrange interpolation formula, that we’ll need in several parts of the sequel, and

following that we discuss the Adams formulas. The secret weapon of the Adams formulas

is that when h = 0, one of the roots (the friendly one) is as usual sitting at 1, ready to

develop into the exponential series, and all of the unfriendly roots are huddled together at

the origin, just as far out of trouble as they can get.

2.13 Lagrange and Adams formulas

Our next job is to develop formulas that can give us as much accuracy as we want in a

numerical solution of a diﬀerential equation. This means that we want methods in which

the formulas span a number of points, i.e., in which the next value of y is obtained from

several backward values, instead of from just one or two as in the methods that we have

already studied. Furthermore, these methods will need some assistance in getting started,

so we will have to develop matched formulas that will provide them with starting values of

the right accuracy.

All of these jobs can be done with the aid of a formula, due to Lagrange, whose mission

in life is to ﬁt a polynomial to a given set of data points, so let’s begin with a little example.

2.13 Lagrange and Adams formulas 75

Problem: Through the three points (1, 17), (3, 10), (7, −5) there passes a unique quadratic

polynomial. Find it.

Solution:

P(x) = 17

_

(x −3)(x −7)

(1 −3)(1 −7)

_

+ 10

_

(x −1)(x −7)

(3 −1)(3 −7)

_

+ (−5)

_

(x −1)(x −3)

(7 −1)(7 −3)

_

. (2.13.1)

Let’s check that this is really the right answer. First of all, (2.13.1) is a quadratic

polynomial, since each term is. Does it pass through the three given points? When x = 1,

the second and third terms vanish, because of the factor x − 1, and the ﬁrst term has the

vlaue 17, as is obvious from the cancellation that takes place. Similarly when x = 3, the

ﬁrst and third terms are zero and the second is 10, etc.

Now we can jump from this little example all the way to the general formula. If we want

to ﬁnd the polynomial of degree n −1 that passes through n given data points

(x

1

, y

1

), (x

2

, y

2

), . . . , (x

n

, y

n

),

then all we have to do is to write it out:

P(x) =

n

i=1

y

i

_

_

_

n

j=1

j=i

x −x

j

x

i

−x

j

_

_

_. (2.13.2)

This is the Lagrange interpolation formula. In the ith term of the sum above is the product

of n − 1 factors (x − x

j

)/(x

i

− x

j

), namely of all those factors except for the one in which

j = i.

Next, consider the special case of the above formula in which the points x

1

, x

2

, . . . , x

n

are chosen to be equally spaced. To be speciﬁc, we might as well suppose that x

j

= jh for

j = 1, 2, . . . , n.

If we substitute in (2.13.2), we get

P(x) =

n

i=1

y

i

_

_

_

n

j=1

j=i

x −jh

(i −j)h

_

_

_ . (2.13.3)

Consider the product of the denominators above. It is

(i −1)(i −2) (1)(−1)(−2) (i −n)h

n−1

= (−1)

n−i

(i −1)!(n −i)!h

n−1

. (2.13.4)

Finally we replace x by xh and substitute in (2.13.3) to obtain

P(xh) =

n

i=1

(−1)

n−i

(i −1)!(n −i)!

y

i

_

_

_

n

j=1

j=i

(x −j)

_

_

_ (2.13.5)

and that is about as simple as we can get it.

76 The Numerical Solution of Diﬀerential Equations

Now we’ll use this result to obtain a collection of excellent methods for solving diﬀerential

equations, the so-called Adams formulas. Begin with the obvious fact that

y((p + 1)h) = y(ph) +h

_

p+1

p

y

(ht) dt. (2.13.6)

Instead of integrating the exact function y

**in this formula, we will integrate the poly-
**

nomial that agrees with y

at a number of given data points. First, we replace y

by the

Lagrange interpolating polynomial that agrees with it at the p+1 points h, 2h, . . . , (p+1)h.

This transforms (2.13.6) into

y

p+1

= y

p

+h

p+1

i=1

(−1)

p−i+1

(i −1)!(p −i + 1)!

y

(ih)j

_

p+1

p

_

_

_

p+1

j=1

j=i

(x −j)

_

_

_ dx. (2.13.7)

We can rewrite this in the more customary form of a linear multistep method:

y

n+1

= y

n

+h

p−1

i=−1

b

−i

y

n−i

. (2.13.8)

This involves replacing i by p −i in (2.13.7), so the numbers b

−i

are given by

b

−i

=

(−1)

i+1

(p −1 −i)!(i + 1)!

_

p+1

p

_

_

_

p+1

j=1

j=p−i

(x −j)

_

_

_ dx i = −1, 0, . . . , p −1. (2.13.9)

Now to choose a formula in this family, all we have to do is specify p. For instance, let’s

take p = 2. Then we ﬁnd

b

1

=

1

2

_

3

2

(x −1)(x −2)dx =

5

12

b

0

= −

_

3

2

(x −1)(x −3)dx =

2

3

b

−1

=

1

2

_

3

2

(x −2)(x −3)dx = −

1

12

(2.13.10)

so we have the third-order implicit Adams method

y

n+1

= y

n

+

h

12

(5y

n+1

+ 8y

n

−y

n−1

). (2.13.11)

If this process is repeated for various values of p we get the whole collection of these

formulas. For reference, we show the ﬁrst four of them below, complete with their error

terms.

y

n+1

= y

n

+hy

n

−

h

2

2

y

(2.13.12)

2.13 Lagrange and Adams formulas 77

y

n+1

= y

n

+

h

2

(y

n+1

+y

n

) −

h

3

12

y

(2.13.13)

y

n+1

= y

n

+

h

12

(5y

n+1

+ 8y

n

−y

n−1

) −

h

4

24

y

(iv

(2.13.14)

y

n+1

= y

n

+

h

24

(9y

n+1

+ 19y

n

−5y

n−1

+y

n−2

) −

19h

5

720

y

(v

(2.13.15)

If we compare these formulas with the general linear multistep method (2.12.1) then we

quickly ﬁnd the reason for the stability of these formulas. Notice that only one backwards

value of y itself is used, namely y

n

. The other backwards values are all of the derivatives.

Now look at equation (2.12.9) again. It is the polynomial equation that determines the

roots of the characteristic equation of the method, in the limiting case where the step size

is zero.

If we use a formula with all the a

i

= 0, except that a

0

= 1, then that polynomial

equation becomes simply

α

p+1

−α

p

= 0. (2.13.16)

The roots of this are 1, 0, 0,. . . ,0. The root at 1, as h grows to a small positive value, is

the one that gives us the accuracy of the computed solution. The other roots all start at

the origin, so when h is small they too will be small, instead of trying to cause trouble for

us. All of the Adams formulas are stable for this reason.

In Table 6 we show the behavior of the three roots of the characteristic equation (2.12.8)

as it applies to the fourth-order method (2.13.15). It happens that the roots are all real in

this case. Notice that the friendly root is the one of largest absolute value for all τ ≤ 0.9.

τ e

−τ

Friendly(τ) Root2(τ) Root3(τ)

0.0 1.000000 1.000000 0.000000 0.000000

0.1 0.904837 0.904837 0.058536 -0.075823

0.2 0.818731 0.818723 0.081048 -0.116824

0.3 0.740818 0.740758 0.098550 -0.153914

0.4 0.670320 0.670068 0.113948 -0.189813

0.5 0.606531 0.605769 0.128457 -0.225454

0.6 0.548812 0.546918 0.142857 -0.261204

0.7 0.496585 0.492437 0.157869 -0.297171

0.8 0.449329 0.440927 0.174458 -0.333333

0.9 0.406570 0.390073 0.194475 -0.369595

1.0 0.367879 0.333333 0.224009 -0.405827

table 6

Now we have found the implicit Adams formulas. In each of them the unknown y

n+1

appears on both sides of the equation. They are, therefore, useful as corrector formulas.

To ﬁnd matching predictor formulas is also straightforward. We return to (2.13.6), and

for a predictor method of order p + 1 we replace y

**by the interpolating polynomial that
**

78 The Numerical Solution of Diﬀerential Equations

agrees with it at the data points 0, h, 2h, 3h, . . . , ph (but not at (p+1)h as before). Then,

in place of (2.13.7) we get

y

p+1

= y

p

+h

p

i=0

(−1)

p−i

i!(p −i)!

y

(ih)

_

p+1

p

_

_

_

p

j=0

j=i

(x −j)

_

_

_ dx. (2.13.17)

As before, we can write this in the more familiar form

y

n+1

= y

n

+h

p

i=0

b

−i

y

n−i

(2.13.18)

where the numbers b

−i

are now given by

b

−i

=

(−1)

i

(p −i)!i!

_

p+1

p

_

_

_

p

j=0

j=p−i

(x −j)

_

_

_ dx i = 0, 1, . . . , p. (2.13.19)

We tabulate below these formulas in the cases p = 1, 2, 3, together with their error

terms:

y

n+1

= y

n

+

h

2

(3y

n

−y

n−1

) +

5h

3

12

y

(2.13.20)

y

n+1

= y

n

+

h

12

(23y

n

−16y

n−1

+ 5y

n−2

) +

3h

4

8

y

(iv

(2.13.21)

y

n+1

= y

n

+

h

24

(55y

n

−59y

n−1

+ 37y

n−2

−9y

n−3

) +

251h

5

720

y

(v

(2.13.22)

Notice, for example, that the explicit fourth-order formula (2.13.22) has about 13 times

as large an error term as the implicit fourth-order formula (2.13.15). This is typical for

matched pairs of predictor-corrector formulas. As we have noted previously, a single appli-

cation of a corrector formula reduces error by about h

∂f

∂y

, so if we keep h small enough so

that h

∂f

∂y

is less than 1/13 or thereabouts, then a single application of the corrector formula

will produce an estimate of the next value of y with full attainable accuracy.

We are now fully equipped with Adams formulas for prediction and correction, in

matched pairs, of whatever accuracy is needed, all of them stable. The use of these pairs

requires special starting formulas, since multistep methods cannot get themselves started

or restarted without assistance. Once again the Lagrange interpolation formula comes to

the rescue.

This time we begin with a slight variation of (2.13.6),

y(mh) = y(0) +

_

mh

0

y

(t) dt. (2.13.23)

Next, replace y

**(t) by the Lagrange polynomial that agrees with it at 0, h, 2h, . . . , ph (for
**

p ≥ m). We then obtain

y

m

= y

0

+h

p

i=0

(−1)

p−i

i!(p −i)!

y

i

_

m

0

_

_

_

p

j=0

j=i

(t −j)

_

_

_dt m = 1, 2, . . . , p. (2.13.24)

2.13 Lagrange and Adams formulas 79

We can rewrite these equations in the form

y

m

= y

0

+h

p

j=0

λ

j

y

j

m = 1, 2, . . . , p (2.13.25)

where the coeﬃcients λ

i

are given by

λ

i

=

(−1)

p−i

i!(p −i)!

_

m

0

_

_

_

p

j=0

j=i

(t −j)

_

_

_ dt i = 0, . . . , p. (2.13.26)

Of course, when these formulas are used on a diﬀerential equation y

= f(x, y), each

of the values of y

**on the right side of (2.13.25) is replaced by an f(x, y) value. Therefore
**

equations (2.13.25) are a set of p simultaneous equations in p unknowns y

1

, y

2

, . . . , y

p

(y

0

is, of course, known). We can solve them with an iteration in which we ﬁrst guess all p of

the unknowns, and then reﬁne the guesses all at once by using (2.13.25) to give us the new

guesses from the old, until suﬃcient convergence has occurred.

For example, if we take p = 3, then the starting formulas are

y

1

= y

0

+

h

24

(9y

0

+ 19y

1

−5y

2

+y

3

) −

19h

5

720

y

(v)

y

2

= y

0

+

h

3

(y

0

+ 4y

1

+y

2

) −

h

5

90

y

(v)

y

3

= y

0

+

3h

8

(y

0

+ 3y

1

+ 3y

2

+y

3

) −

3h

5

80

y

(v)

.

(2.13.27)

The philosophy of using a matched pair of Adams formulas for propagation of the solution,

together with the starter formulas shown above has the potential of being implemented in a

computer program in which the user could speciﬁy the desired precision by giving the value

of p. The program could then calculate from the formulas above the coeﬃcients in the

predictor-corrector pair and the starting method, and then proceed with the integration.

This would make a very versatile code.

80 The Numerical Solution of Diﬀerential Equations

Chapter 3

Numerical linear algebra

3.1 Vector spaces and linear mappings

In this chapter we will study numerical methods for the solution of problems in linear

algebra, that is to say, of problems that involve matrices or systems of linear equations.

Among these we mention the following:

(a) Given an n n matrix, calculate its determinant.

(b) Given m linear algebraic equations in n unknowns, ﬁnd the most general solution of

the system, or discover that it has none.

(c) Invert an n n matrix, if possible.

(d) Find the eigenvalues and eigenvectors of an n n matrix.

As usual, we will be very much concerned with the development of eﬃcient software

that will accomplish the above purposes.

We assume that the reader is familiar with the basic constructs of linear algebra: vec-

tor space, linear dependence and independence of vectors, Euclidean n-dimensional space,

spanning sets of vectors, basis vectors. We will quickly review some additional concepts

that will be helpful in our work. For a more complete discussion of linear algebra, see any

of the references cited at the end of this chapter.

And now, to business. The ﬁrst major concept we need is that of a linear mapping.

Let V and W be two vector spaces over the real numbers (we’ll stick to the real numbers

unless otherwise speciﬁed). We say that T is a linear mapping from V to W if T associates

with every vector v of V a vector Tv of W (so T is a mapping) in such a way that

T(αv

+βv

) = αTv

+βTv

(3.1.1)

for all vectors v

, v

**of V and real numbers (or scalars) α and β (i.e., T is linear). Notice
**

that the “+” signs are diﬀerent on the two sides of (3.1.1). On the left we add two vectors

of V , on the right we add two vectors of W.

82 Numerical linear algebra

Here are a few examples of linear mappings.

First, let V and W both be the same, namely the space of all polynomials of some given

degree n. Consider the mapping that associates with a polynomial f of V its derivative

Tf = f

**in W. It’s easy to check that this mapping is linear.
**

Second, suppose V is Euclidean two-dimensional space (the plane) and W is Euclidean

three-dimensional space. Let T be the mapping that carries the vector (x, y) of V to the

vector (3x +2y, x −y, 4x +5y) of W. For instance, T(2, −1) = (4, 3, 3). Then T is a linear

mapping.

More generally, let A be a given m n matrix of real numbers, let V be Euclidean

n-dimensional space and let W be Euclidean m-space. The mapping T that carries a vector

x of V into Ax of W is a linear mapping. That is, any matrix generates a linear mapping

between two appropriately chosen (to match the dimensions of the matrix) vector spaces.

The importance of studying linear mappings in general, and not just matrices, comes

from the fact that a particular mapping can be represented by many diﬀerent matrices.

Further, it often happens that problems in linear algebra that seem to be questions about

matrices, are in fact questions about linear mappings. This means that we can change to

a simpler matrix that represents the same linear mapping before answering the question,

secure in the knowledge that the answer will be the same. For example, if we are given a

square matrix and we want its determinant, we seem to confront a problem about matrices.

In fact, any of the matrices that represent the same mapping will have the same determinant

as the given one, and making this kind of observation and identifying simple representatives

of the class of relevant matrices can be quite helpful.

To get back to the matter at hand, suppose the vector spaces V and W are of dimensions

m and n, respectively. Then we can choose in V a basis of m vectors, say e

1

, e

2

, e

3

, . . . , e

m

,

and in W there is a basis of n vectors f

1

, f

2

, . . . , f

n

. Let T be a linear mapping from V to

W. Then we have the situation that is sketched in ﬁgure 3.1 below.

We claim now that the action of T on every vector of V is known if we know only its

eﬀect on the m basis vectors of V . Indeed, suppose we know Te

1

, Te

2

, . . . , Te

m

. Then let

x be any vector in V . Express x in terms of the basis of V ,

x = α

1

e

1

+α

2

e

2

+ +α

m

e

m

. (3.1.2)

Now apply T to both sides and use the linearity of T (extended, by induction, to linear

combinations of more than two vectors) to obtain

Tx = α

1

(Te

1

) +α

2

(Te

2

) + +α

m

(Te

m

). (3.1.3)

The right side is known, and the claim is established.

So, to describe a linear mapping, “all we have to do” is describe its action on a set of

basis vectors of V . If e

i

is one of these, then Te

i

is a vector in W. As such, Te

i

can be

written as a linear combination of the basis vectors of W. The coeﬃcients of this linear

combination will evidently depend on i, so we write

Te

i

=

n

j=1

t

ji

f

j

i = 1, . . . , m. (3.1.4)

3.1 Vector spaces and linear mappings 83

V

T

E

W

¨

¨

¨

¨B

e

1

e

2

e

m

.

.

.

e

e

e

e

r

r

r

¨

¨

¨

¨B

f

1

f

2

.

.

. f

m

e

e

e

e

r

r

r

Figure 3.1: The action of a linear mapping

Now the mn numbers t

ji

, i = 1, . . . , m, j = 1, . . . , n, together with the given sets of basis

vectors for V and W, are enough to describe the linear operator T completely. Indeed, if

we know all of those numbers, then by (3.1.4) we know what T does to every basis vector

of V , and then by (3.1.3) we know the action of T on every vector of V .

To summarize, an nm matrix t

ji

represents a linear mapping T from a vector space V

with a distinguished basis E = ¦e

1

, e

2

, . . . , e

m

¦, to a vector space W with a distinguished

basis F = ¦f

1

, f

2

, . . . , f

n

¦, in the sense that from a knowledge of (t, E, F) we know the full

mapping T.

Next, suppose once more that T is a linear mapping from V to W. Since T is linear,

it is easy to see that T carries the 0 vector of V into the 0 vector of W. Consider the set

of all vectors of V that are mapped by T into the zero vector of W. This set is called the

kernel of T, and is written ker(T). Thus

ker(T) = ¦x ∈ V [ Tx = 0

W

¦. (3.1.5)

Now ker(T) is not just a set of vectors, it is itself a vector space, that is a vector subspace

of V . Indeed, one sees immediately that if x and y belong to ker(T) and α and β are scalars,

then

T(αx +βy) = αT(x) +βT(y) = 0, (3.1.6)

so αx +βy belongs to ker(T) also.

Since ker(T) is a vector space, we can speak of its dimension. If ν = dimker(T), then ν

is called the nullity of the mapping T.

Consider also the set of vectors w of W that are of the form w = Tv, for some vector

v ∈ V (possibly many such v’s exist). This set is called the image of T, and is written

im(T) = ¦w ∈ W [ w = Tv, v ∈ V ¦. (3.1.7)

84 Numerical linear algebra

Once again, we remark that im(T) is more than just a set of vectors, it is in fact a vector

subspace of W, since if w

and w

**are both in im(T), and if α and β are scalars, then we
**

have w

= Tv

and w

= Tv

for some v

, v

in V . Hence

αw

+βw

= αTv

+βv

= T(αv

+βv

) (3.1.8)

so αw

+βw

lies in im(T), too.

The dimension of the vector (sub)space im(T) is called the rank of the mapping T.

A celebrated theorem of Sylvester asserts that

rank(T) + nullity(T) = dim(V ). (3.1.9)

By the rank of a matrix A we mean any of the following:

(a) the maximum number of linearly independent rows that we can ﬁnd in A

(b) same for columns

(c) the largest number r for which there is an r r nonzero sub-determinant in A (i.e., a

set of r rows and r columns, not necessarily consecutive, such that the rr submatrix

that they determine has a nonzero determinant.

It is true that the rank of a linear mapping T from V to W is equal to the rank of any

matrix A that represents T with respect to some pair of bases of V and W.

It is also true that the rank of a matrix is not computable unless inﬁnite-precision

arithmetic is used. In fact, the 2 2 matrix

_

1 1

1 1 + 10

−20

_

(3.1.10)

has rank 2, but if the [2,2]-entry is changed to 1, the rank becomes 1. No computer program

will be able to tell the diﬀerence between these two situations unless it is doing arithmetic

to at least 21 digits of precision. Therefore, unless our programs do exact arithmetic on

rational numbers, or do ﬁnite ﬁeld arithmetic, or whatever, the rank will be uncomputable.

What we are saying, really, is just that the rank of a matrix is not a continuous function of

the matrix entries.

Now we are ready to consider one of the most important problems of numerical linear

algebra: the solution of simultaneous linear equations.

Let A be a given mn matrix, let b be a given column vector of length m, and consider

the system

Ax = b (3.1.11)

of m linear simultaneous equations in n unknowns x

1

, x

2

, . . . , x

n

.

Consider the set of all solution vectors x of (3.1.11). Is it a vector space? That’s right,

it isn’t, unless b happens to be the zero vector (why?).

3.1 Vector spaces and linear mappings 85

Suppose then that we intend to write a computer program that will in some sense present

as output all solutions x of (3.1.11). What might the output look like?

If b = 0, i.e., if the system is homogeneous, there is no diﬃculty, for in that case the

solution set is ker(A), a vector space, and we can describe it by printing out a set of basis

vectors of ker(A).

If the right-hand side vector b is not 0, then consider any two solutions x

and x

of

(3.1.11). Then Ax

= b, Ax

= b, and by subtraction, A(x

−x

) = 0. Hence x

−x

belongs

to ker(A), so if e

1

, e

2

, . . . , e

ν

are a basis for ker(A), then x

−x

= α

1

e

1

+α

2

e

2

+ +α

ν

e

ν

.

If b ,= 0 then, we can describe all possible solutions of (3.1.11) by printing out one

particular solution x

**, and a list of the basis vectors of ker(A), because then all solutions
**

are of the form

x = x

+α

1

e

1

+α

2

e

2

+ +α

ν

e

ν

. (3.1.12)

Therefore, a computer program that alleges that it solves (3.1.11) should print out a

basis for the kernel of A together with, in case b ,= 0, any one particular solution of the

system. We will see how to accomplish this in the next section.

Exercises 3.1

1. Show by examples that for every n, the rank of a given nn matrix A is a discontinuous

function of the entries of A.

2. Consider the vector space V of all polynomials in x of degree at most 2. Let T be the

linear mapping that sends each polynomial to its derivative.

(a) What is the rank of T?

(b) What is the image of T?

(c) For the basis ¦1, x, x

2

¦ of V , ﬁnd the 3 3 matrix that represents T.

(d) Regard T as a mapping from V to the space W of polynomials of degree 1. Use

the basis given in part (c) for V , and the basis ¦1, x − 1¦ for W, and ﬁnd the

2 3 matrix that represents T with respect to these bases.

(e) Check that the ranks of the matrices in (c) and (d) are equal.

3. Let T be a linear mapping of Euclidean 3-dimensional space to itself. Suppose T takes

the vector (1, 1, 1) to (1, 2, 3), and T takes (1, 0, −1) to (2, 0, 1) and T takes (3, −1, 0)

to (1, 1, 2). Find T(1, 2, 4).

4. Let A be an n n matrix with integer entries having absolute values at most M.

What is the maximum number of binary digits that could be needed to represent all

of the elements of A?

5. If T acts on the vector space of polynomials of degree at most n according to T(f) =

f

**−3f, ﬁnd ker(T), im(T), and rank(T).
**

6. Why isn’t the solution set of (3.1.11) a vector space if b ,= 0?

86 Numerical linear algebra

7. Let a

ij

= r

i

s

j

for i, j = 1, . . . , n. Show that the rank of A is at most 1.

8. Suppose that the matrix

A =

_

¸

_

0 1 1

1 0 −1

−2 1 1

_

¸

_ (3.1.13)

represents a certain linear mapping from V to V with respect to the basis

¦(1, 0, 0), (0, 1, 0), (1, 1, 1)¦ (3.1.14)

of V . Find the matrix that represents the same mapping with respect to the basis

¦(0, 0, 1), (0, 1, 1), (1, 1, 1)¦. (3.1.15)

Check that the determinant is unchanged.

9. Find a system of linear equations whose solution set consists of the vector (1, 2, 0)

plus any linear combination of (−1, 0, 1) and (0, 1, 0).

10. Construct two sets of two equations in two unknowns such that

(a) their coeﬃcient matrices diﬀer by at most 10

−12

in each entry, and

(b) their solutions diﬀer by at least 10

+12

in each entry.

3.2 Linear systems

The method that we will use for the computer solution of m linear equations in n unknowns

will be a natural extension of the familiar process of Gaussian elimination. Let’s begin with

a little example, say of the following set of two equations in three unknowns:

x +y +z = 2

x −y −z = 5.

(3.2.1)

If we subtract the second equation from the ﬁrst, then the two equations can be written

x + y + z = 2

2y + 2z = −3.

(3.2.2)

We divide the second equation by 2, and then subtract it from the ﬁrst, getting

x =

7

2

y + z = −

3

2

.

(3.2.3)

The value of z can now be chosen arbitrarily, and then x and y will be determined. To

make this more explicit, we can rewrite the solution in the form

_

¸

_

x

y

z

_

¸

_ =

_

¸

_

7

2

−

3

2

0

_

¸

_ +z

_

¸

_

0

−1

1

_

¸

_. (3.2.4)

3.2 Linear systems 87

In (3.2.4) we see clearly that the general solution is the sum of a particular solution

(7/2, −3/2, 0), plus any multiple of the basis vector (0, −1, 1) for the kernel of the coef-

ﬁcient matrix of (3.2.1).

The calculation can be compactiﬁed by writing the numbers in a matrix and omitting

the names of the unknowns. A vertical line in the matrix will separate the left sides and

the right sides of the equations. Thus, the original system (3.2.1) is

_

1 1 1 2

1 −1 −1 5

_

. (3.2.5)

Now we do (

−−→

row2) := (

−−→

row1) −(

−−→

row2) and we have

_

1 1 1 2

0 2 2 −3

_

. (3.2.6)

Next (

−−→

row2) := (

−−→

row2)/2, and then (

−−→

row1) := (

−−→

row1) −(

−−→

row2) bring us to the ﬁnal form

_

1 0 0

7

2

0 1 1 −

3

2

_

(3.2.7)

which is the matrix equivalent of (3.2.3).

Now we will step up to a slightly larger example, to see some of the situations that

can arise. We won’t actually write the numerical values of the coeﬃcients, but we’ll use

asterisks instead. So consider the three equations in ﬁve unknowns shown below.

_

¸

_

∗ ∗ ∗ ∗ ∗ ∗

∗ ∗ ∗ ∗ ∗ ∗

∗ ∗ ∗ ∗ ∗ ∗

_

¸

_. (3.2.8)

The ﬁrst step is to create a 1 in the extreme upper left corner, by dividing the ﬁrst row

through by the [1,1] element. We will assume for the moment that the various numbers

that we want to divide by are not zero. Later on we will take extensive measures to assure

this.

After we divide the ﬁrst row by the [1,1] element, we use the 1 in the upper left-hand

corner to zero out the entries below it in column 1. That is, we let t = a

21

and then do

−−→

row(2) :=

−−→

row(2) −t ∗

−−→

row(1). (3.2.9)

Then let t = a

31

and do

−−→

row(3) :=

−−→

row(3) −t ∗

−−→

row(1). (3.2.10)

The result is that we now have the matrix

_

¸

_

1 ∗ ∗ ∗ ∗ ∗

0 ∗ ∗ ∗ ∗ ∗

0 ∗ ∗ ∗ ∗ ∗

_

¸

_. (3.2.11)

88 Numerical linear algebra

We pause for a moment to consider what we’ve done, in terms of the original set of

simultaneous equations. First, to divide a row of the matrix by a number corresponds

to dividing an equation through by the same number. Evidently this does not change

the solutions of the system of equations. Next, to add a constant multiple of one row to

another in the matrix corresponds to adding a multiple of one equation to another, and this

also doesn’t aﬀect the solutions. Finally, in terms of the linear mapping that the matrix

represents, what we are doing is changing the sets of basis vectors, keeping the mapping

ﬁxed, in such a way that the matrix that represents the mapping becomes a bit more

acceptable to our taste.

Now in (3.2.11) we divide through the second row by a

22

(again blissfully assuming that

a

22

is not zero) to create a 1 in the [2,2] position. Then we use that 1 to create zeroes (just

one zero in this case) below it in the second column by letting t = a

32

and doing

−−→

row(3) :=

−−→

row(3) −t ∗

−−→

row(2). (3.2.12)

Finally, we divide the third row by the [3,3] element to obtain

_

¸

_

1 ∗ ∗ ∗ ∗ ∗

0 1 ∗ ∗ ∗ ∗

0 0 1 ∗ ∗ ∗

_

¸

_. (3.2.13)

This is the end of the so-called forward solution, the ﬁrst phase of the process of obtaining

the general solution.

Again, let’s think about the system of equations that is represented here. What is

special about them is that the ﬁrst unknown does not appear in the second equation, and

neither the ﬁrst nor the second unknown appears in the third equation.

To ﬁnish the solution of such a system of equations we would use the third equation to

express x

3

in terms of x

4

and x

5

, then the second equation would give us x

2

in terms of x

4

and x

5

, and ﬁnally the ﬁrst equation would yield x

1

, also expressed in terms of x

4

and x

5

.

Hence, we would say that x

4

and x

5

are free, and that the others are determined by them.

More precisely, we should say that the kernel of A has a two-dimensional basis.

Let’s see how all of this will look if we were to operate directly on the matrix (3.2.13).

The second phase of the solution, that we are now beginning, is called the backwards sub-

stitution, because we start with the last equation and work backwards.

First we use the 1 in the [3,3] position to create zeros in the third column above that 1.

To do this we let t = a

23

and then we do

−−→

row(2) :=

−−→

row(2) −t ∗

−−→

row(3). (3.2.14)

Then we let t = a

13

and set

−−→

row1 :=

−−→

row(1) −t ∗

−−→

row(3) (3.2.15)

resulting in

_

¸

_

1 ∗ 0 ∗ ∗ ∗

0 1 1 ∗ ∗ ∗

0 0 1 ∗ ∗ ∗

_

¸

_. (3.2.16)

3.2 Linear systems 89

Observe that now x

3

does not appear in the equations before it. Next we use the 1 in

the [2,2] position to create a zero in column 2 above that 1 by letting t = a

12

and

−−→

row(1) :=

−−→

row(1) −t ∗

−−→

row(2). (3.2.17)

Of course, none of our previously constructed zeros gets wrecked by this process, and we

have arrived at the reduced echelon form of the original system of equations

_

¸

_

1 0 0 ∗ ∗ ∗

0 1 0 ∗ ∗ ∗

0 0 1 ∗ ∗ ∗

_

¸

_. (3.2.18)

We need to be more careful about the next step, so it’s time to use numbers instead of

asterisks. For instance, suppose that we have now arrived at

_

¸

_

1 0 0 a

14

a

15

a

16

0 1 0 a

24

a

25

a

26

0 0 1 a

34

a

35

a

36

_

¸

_. (3.2.19)

Each of the unknowns is expressible in terms of x

4

and x

5

:

x

1

= a

16

−a

14

x

4

− a

15

x

5

x

2

= a

26

−a

24

x

4

− a

25

x

5

x

3

= a

36

−a

34

x

4

− a

35

x

5

x

4

= x

4

x

5

= x

5

.

(3.2.20)

This means that we have found the general solution of the given system by ﬁnding a

particular solution and a pair of basis vectors for the kernel of A. They are, respectively,

the vector and the two columns of the matrix shown below:

_

¸

¸

¸

¸

¸

_

a

16

a

26

a

36

0

0

_

¸

¸

¸

¸

¸

_

,

_

¸

¸

¸

¸

¸

_

a

14

a

15

a

24

a

25

a

34

a

35

−1 0

0 −1

_

¸

¸

¸

¸

¸

_

. (3.2.21)

As this shows, we ﬁnd a particular solution by ﬁlling in extra zeros in the last column

until its length matches the number (ﬁve in this case) of unknowns. We ﬁnd a basis matrix

(i.e., a matrix whose columns are a basis for the kernel of A) by extending the fourth and

ﬁfth columns of the reduced row echelon form of A with −I, where I is the identity matrix

whose size is equal to the nullity of the system, in this case 2.

It’s time to deal with the case where one of the ∗’s that we divide by is actually a zero.

In fact we will have to discuss rather carefully what we mean by zero. In numerical work on

computers, in the presence of rounding errors, it is unreasonable to expect a 0 to be exactly

zero. Instead we will set a certain threshold level, and numbers that are smaller than that

will be declared to be zero. The hard part will be the determination of the right threshold,

90 Numerical linear algebra

but let’s postpone that question for a while, and make the convention that in this and the

following sections, when we speak of matrix entries being zero, we will mean that their size

is below our current tolerance level.

With that understanding, suppose we are carrying out the row reduction of a certain

system, and we’ve arrived at a stage like this:

_

¸

¸

¸

¸

¸

¸

_

1 ∗ ∗ ∗ ∗ ∗

0 1 ∗ ∗ ∗ ∗

0 0 0 ∗ ∗ ∗

0 0 ∗ ∗ ∗ ∗

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

_

¸

¸

¸

¸

¸

¸

_

. (3.2.22)

Normally, the next step would be to divide by a

33

, but it is zero. This means that x

3

happens not to appear in the third equation. However, x

3

might appear in some later

equation.

If so, we can renumber the equations so that later equation becomes the third equation,

and continue the process. In the matrix, this means that we would exchange two rows, so

as to bring a nonzero entry into the [3,3] position, and continue.

It is possible, though, that x

3

does not appear in any later equation. Then all entries

a

i3

= 0 for i ≥ 3. Then we could ask for some other unknown x

j

for j > 3 that does appear

in some equation later than the third. In the matrix, this amounts to searching through

the whole rectangle that lies “southeast” of the [3,3] position, extending over to, but not

beyond, the vertical line, to ﬁnd a nonzero entry, if there is one.

If a

ij

is such a nonzero entry, then we want next to bring a

ij

into the pivot position [3,3].

We can do this in two steps. First we exchange rows 3 and i (interchange two equations).

Second, exchange columns 3 and j (interchange the numbers of the unknowns, so that x

3

becomes x

j

and x

j

becomes x

3

). We must remember somehow that we renumbered the

unknowns, so we’ll be able to recognize the answers when we see them. The calculation can

now proceed from the rejuvenated pivot element in the [3,3] position.

Else, it may happen that the rectangle southeast of [3,3] consists entirely of zeros, like

this:

_

¸

¸

¸

¸

¸

¸

¸

¸

_

1 ∗ ∗ ∗ ∗ ∗

0 1 ∗ ∗ ∗ ∗

0 0 0 0 0 ∗

0 0 0 0 0 ∗

0 0 0 0 0 ∗

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

_

¸

¸

¸

¸

¸

¸

¸

¸

_

. (3.2.23)

What then? We’re ﬁnished with the forward solution. The equations from the third onwards

have only zeros on their left-hand sides. If any solutions at all exist, then those equations

had better have only zeros on their right-hand sides also. The last three ∗’s in the last

column (and all the entries below them) must all be zeros, or the calculation halts with the

announcement that the input system was inconsistent (i.e, has no solutions). If the system

is consistent, then of course we ignore the ﬁnal rows of zeros, and we do the backwards

solution just as in the preceding case.

3.2 Linear systems 91

It follows that in all cases, whether there are more equations than unknowns, fewer, or

the same number, the backwards solution always begins with a matrix that has a diagonal

of 1’s stretching from top to bottom (the bottom may have moved up, though!), with only

zero entries below the 1’s on the diagonal.

Speaking in theoretical, rather than practical terms for a moment, the number of nonzero

rows in the coeﬃcient matrix at the end of the forward solution phase is the rank of the

matrix. Speaking practically again, this number simply represents a number of rows beyond

which we cannot do any more reductions because the matrix entries are all indistinguishable

from zero. Perhaps a good name for it is pseudorank. The pseudorank should be thought

of then, not as some interesting property of the matrix that we have just computed, but as

the number of rows we were able to reduce before roundoﬀ errors became overwhelming.

Exercises 3.2

In problems 1–5, solve each system of equations by transforming the coeﬃcient matrix

to reduced echelon form. To “solve” a system means either to show that no solution exists

or to ﬁnd all possible solutions. In the latter case, exhibit a particular solution and a basis

for the kernel of the coeﬃcient matrix.

1.

2x − y + z = 6

3x + y + 2z = 3

7x − y + 4z = 15

8x + y + 5z = 12

(3.2.24)

2.

x + y + z + q = 0

x − y − z − q = 0

(3.2.25)

3.

x + y + z = 3

3x − y − 2z = −1

5x + y = 7

(3.2.26)

4.

3x + u + v + w + t = 1

x − u + 2v + w − 3t = 2

(3.2.27)

5.

x + 3y − z = 4

2x − y + 2z = 6

x + y + z = 6

3x − y − z = −2

(3.2.28)

6. Construct a set of four equations in four unknowns, of rank two.

7. Construct a set of four equations in three unknowns, with a unique solution.

8. Construct a system of homogeneous equations that has a three-dimensional vector

space of solutions.

92 Numerical linear algebra

9. Under precisely what conditions is the set of all solutions of the system Ax = b a

vector space?

10. Given a set of m vectors in n space, describe an algorithm that will extract a maximal

subset of linearly independent vectors.

11. Given a set of m vectors, and one more vector w, describe an algorithm that will

decide whether or not w is in the vector subspace spanned by the given set.

3.3 Building blocks for the linear equation solver

Let’s now try to amplify some of the points raised by the informal discussion of the procedure

for solving linear equations, with a view towards the development of a formal algorithm.

First, let’s deal with the fact that a diagonal element might be zero (in the fuzzy sense

deﬁned in the previous section) at the time when we want to divide by it.

Consider the moment when we are carrying out the forward solution, we have made i−1

1’s down the diagonal, all the entries below the 1’s are zeros, and next we want to put a 1

into the [i, i] position and use that 1 as a pivot to reduce the entries below it to zeros.

Previously we had said that this could be done by dividing through the ith row by the

[i, i] element, unless that element is zero, in which case we carry out a search for some

nonzero element in the rectangle that lies southeast of [i, i] in the matrix. After careful

analysis, it turns out that an even more conservative approach is better: the best procedure

consists in searching through the entire southeast rectangle, whether or not the [i, i] element

is zero, to ﬁnd the largest matrix element in absolute value.

If this complete search is done every time, whether or not the [i, i] element is zero, then

it develops that the sizes of the matrix elements do not grow very much as the algorithm

unfolds, and the growth of the numerical errors is also kept to a minimum.

If that largest element found in the rectangle is, say, a

uv

, then we bring a

uv

into the pivot

position [i, i] by interchanging rows u and i (renumbering the equations) and interchanging

columns v and i (renumbering the unknowns). Then we proceed as before.

It may seem wasteful to make a policy of carrying out a complete search of the rectangle

whenever we are ready to ﬁnd the next pivot element, and especially even if a nonzero

element already occupies the pivot position anyway, without a search, but it turns out that

the extra labor is well rewarded with optimum numerical accuracy and stability.

If we are solving m equations in n unknowns, then we need to carry along an extra

array of length n. Let’s call it τ

j

, j = 1, . . . , n. This array will keep a record of the column

interchanges that we do as we do them, so that in the end we will be able to identify the

output. Initially, we put τ

j

= j for j = 1, . . . , n. If at a certain moment we are about

to interchange, say, the pth column and the qth column, then we will also interchange the

entries τ

p

and τ

q

. At all times then, τ

j

will hold the number of the column where the current

jth column really belongs.

3.3 Building blocks for the linear equation solver 93

It must be noted that there is a fundamental diﬀerence between the interchange of rows

and the interchange of columns. An interchange of two rows corresponds simply to listing

the equations that we are trying to solve in a slightly diﬀerent sequence, but has no eﬀect on

the solutions. On the other hand, an interchange of two columns amounts to renumbering

two of the unknowns. Hence we must keep track of the column interchanges while we are

doing them, so we’ll be able to tell which unknown is which at output time, but we don’t

need to record row interchanges.

At the end of the calculation then, the output arrays will have to be shuﬄed. The reader

might want to think about how do carry out that rearrangement, and we will return to it

in section 3.6 under the heading of “to unscramble the eggs”.

The next item to consider is that we would like our program to be able to solve not

just one system Ax = b, but several systems of simultaneous equations, each of the form

Ax = b, where the left-hand sides are all the same, but the right-hand sides are diﬀerent.

The data for our program will therefore be an m by n+p matrix whose ﬁrst n columns will

contain the coeﬃcient matrix A and whose last p columns will be the p diﬀerent right-hand

side vectors b.

Why are we allowing several diﬀerent right sides? Some of the main customers for our

program will be matrices A whose inverses we want to calculate. To ﬁnd, say, the ﬁrst

column of the inverse of A we want to solve Ax = b, where b is the ﬁrst column of the

identity matrix. For the second column of the inverse, b would be the second column of

the identity matrix, and so on. Hence, to ﬁnd A

−1

, if A is an n n matrix, we must solve

n systems of simultaneous equations each having the same left-hand side A, but with n

diﬀerent right-hand side vectors.

It is convenient to solve all n of these systems at once because the reduction that we

apply to A itself to bring it into reduced echelon form is useful in solving all n of these

systems, and we avoid having to repeat that part of the job n times. Thus, for matrix

inversion, and for other purposes too, it is very handy to have the capability of solving

several systems with a common left-hand side at once.

The next point concerns the linear array τ that we are going to use to keep track of the

column interchanges. Instead of storing it in its own private array, it’s easier to adjoin it

to the matrix A that we’re working on, as an extra row, for then when we interchange two

columns we will automatically interchange the corresponding elements of τ, and thereby

avoid separate programming.

This means that the full matrix that we will be working with in our program will be

(m + 1) (n + p) if we are solving p systems of m equations in n unknowns with p right-

hand sides. In the program itself, let’s call this matrix C. So C will be thought of as being

partitioned into blocks of sizes as shown below:

C =

_

¸

¸

_

A : mn RHS : mp

τ : 1 n 0 : 1 p

_

¸

¸

_

. (3.3.1)

Now a good way to begin the writing of a program such as the general-purpose matrix

94 Numerical linear algebra

analysis program that we now have in mind is to consider the diﬀerent procedures, or

modules, into which it may be broken up. We suggest that the individual blocks that we

are about to discuss should be written as separate subroutines, each with its own clearly

deﬁned input and output, each with its own documentation, and each with its own local

variable names. They should then be tested one at a time, by giving them small, suitable

test problems. If this is done, then the main routine won’t be much more than a string of

calls to the various blocks.

1. Procedure searchmat(C,r,s,i1,j1,i2,j2)

This routine is given an r s array C, and two positions in the matrix, say [i

1

, j

1

] and

[i

2

, j

2

]. It then carries out a search of the rectangular submatrix of C whose northwest

corner is at [i

1

, j

1

] and whose southeast corner is at [i

2

, j

2

], inclusive, in order to ﬁnd an

element of largest absolute value that lives in that rectangle. The subroutine returns this

element of largest magnitude, as big, and the row and column in which it lives, as iloc,

jloc.

Subroutine searchmat will be called in at least two diﬀerent places in the main routine.

First, it will do the search for the next pivot element in the southeast rectangle. Second, it

can be used to determine if the equations are consistent by searching the right-hand sides

of equations r + 1, . . . , m (r is the pseudorank) to see if they are all zero (i.e., below our

tolerance level).

2. Procedure switchrow(C,r,s,i,j,k,l)

The program is given an r s matrix C, and four integers i, j, k and l. The subroutine

interchanges rows i and j of C, between columns k and l inclusive, and returns a variable

called sign with a value of −1, unless i = j, in which case it does nothing to C and returns

a +1 in sign.

3. Procedure switchcol(C,r,s,i,j,k,l)

This subroutine is like the previous one, except it interchanges columns i and j of C,

between rows k and l inclusive. It also returns a variable called sign with a value of −1,

unless i = j, in which case it does nothing to C and returns a +1 in sign.

The subroutines switchrow and switchcol are used during the forward solution in the

obvious way, and again after the back solution has been done, to unscramble the output

(see procedure 5 below).

4. Procedure pivot(C,r,s,i,k,u)

Given the r s matrix C, and three integers i, k and u, the subroutine assumes that

C

ii

= 1. It then stores C

ki

in the local variable tm sets C

ki

to zero, and reduces row k of

C, in columns u to s, by doing the operation C

kq

:= C

kq

−t ∗ C

iq

for q = u, . . . , s.

The use of the parameter u in this subroutine allows the ﬂexibility for economical op-

eration in both the forward and back solution. In the forward solution, we take u = i + 1

and it reduces the whole row k. In the back solution we use u = n + 1, because the rest of

row k will have already been reduced.

5. Procedure scale(C,r,s,i,u)

3.3 Building blocks for the linear equation solver 95

Given an r s matrix C and integers i and u, the routine stores C

ii

in a variable called

piv. It then does C

ij

:= C

ij

/piv for j = u, . . . , s and returns the value of piv.

6. Procedure scramb(C,r,s,n)

This procedure permutes the ﬁrst n rows of the r s matrix C according to the permu-

tation that occupies the positions C

r1

, C

r2

, . . . , C

rn

on input.

The use of this subroutine is explained in detail in section 3.6 (q.v.). Its purpose is to

rearrange the rows of the output matrix that holds a basis for the kernel, and also the rows of

the output matrix that holds particular solutions of the give system(s). After rearrangement

the rows will correspond to the original numbering of the unknowns, thereby compensating

for the renumbering that was induced by column interchanges during the forward solution.

This subroutine poses some interesting questions if we require that it should not use any

additional array space beyond the input matrix itself.

7. Procedure ident(C,r,s,i,j,n,q)

This procedure inserts q times the nn identity matrix into the nn submatrix whose

Northwest corner is at position [i, j] of the r s matrix C.

Now let’s look at the assembly of these building blocks into a complete matrix analysis

procedure called matalg(C,r,s,m,n,p,opt,eps). Input items to it are:

• An rs matrix C (as well as the values of r and s), whose Northwest mn submatrix

contains the matrix of coeﬃcients of the system(s) of equations that are about to be

solved. The values of m and n must also be provided to the procedure. It is assumed

that r = 1 + max(m, n). Unless the inverse of the coeﬃcient matrix is wanted, the

Northeast m p submatrix of C holds p diﬀerent right-hand side vectors for which

we want solutions.

• The numbers r, s, m, n and p.

• A parameter option that is equal to 1 if we want an inverse, equal to 2 if we want

to see the determinant of the coeﬃcient matrix (if square) as well as a basis for the

kernel (if it is nontrivial) and a set of p particular solution vectors.

• A real parameter eps that is used to bound roundoﬀ error.

Output items from the procedure matalg are:

• The pseudorank r

• The determinant det if m = n

• An n r matrix basis , whose columns are a basis for the kernel of the coeﬃcient

matrix.

• An n p matrix partic, whose columns are particular solution vectors for the given

systems.

96 Numerical linear algebra

In case opt = 1 is chosen, the procedure will ﬁll the last m columns and rows of C with

an mm identity matrix, set p = n = m, and proceed as before, leaving the inverse matrix

in the same place, on output.

Let’s remark on how the determinant is calculated. The reduction of the input matrix

to echelon form in the forward solution phase entails the use of three kinds of operations.

First we divide a row by a pivot element. Second, we multiply a row by a number and add

it to another row. Third, we exchange a pair of rows or columns.

The ﬁrst operation divides the determinant by that same pivot element. The second

has no eﬀect on the determinant. The third changes the sign of the determinant, at any

rate if the rows or columns are distinct. At the end of the forward solution the matrix is

upper triangular, with 1’s on the diagonal, hence its determinant is clearly 1.

What must have been the value of the determinant of the input matrix? Clearly it

must have been equal to the product of all the pivot elements that were used during the

reduction, together with a plus or minus sign from the row or column interchanges.

Hence, to compute the determinant, we begin by setting det to 1. Then, each time a new

pivot element is selected, we multiply det by it. Finally, whenever a pair of diﬀerent rows

or columns are interchanged we reverse the sign of det. Then det holds the determinant

of the input matrix when the forward solution phase has ended.

Now we have described the basic modules out of which a general purpose program for

linear equations can be constructed. In the next section we are going to discuss the vexing

question of roundoﬀ error and how to set the tolerance level below which entries are declared

to be zero. A complete formal algorithm that ties together all of these modules, with control

of rounding error, is given at the end of the next section.

Exercises 3.3

1. Make a test problem for the major program that you’re writing by tracing through a

solution the way the computer would:

Take one of the systems that appears at the end of section 3.2. Transform it to

reduced row echelon form step by step, being sure to carry out a complete search of

the Southeast rectangle each time, and to interchange rows and columns to bring the

largest element found into the pivot position. Record the column interchanges in τ,

as described above. Record the status of the matrix C after each major loop so you’ll

be able to test your program thoroughly and easily.

2. Repeat problem 1 on a system of each major type: inconsistent, unique solution, many

solutions.

3. Construct a formal algorithm that will invert a matrix, using no more array space

than the matrix itself. The idea is that the input matrix is transformed, a column at

a time, into the identity matrix, and the identity matrix is transformed, a column at

a time, into the inverse. Why store all of the extra columns of the identity matrix?

(Good luck!)

3.4 How big is zero? 97

4. Show that a matrix A is of rank one if and only if its entries are of the form A

ij

= f

i

g

j

for all I and j.

5. Show that the operation

−−→

row(i) := c ∗

−−→

row(j) +

−−→

row(i) applied to a matrix A has the

same eﬀect as ﬁrst applying that same operation to the identity matrix I to get a

certain matrix E, and then computing EA.

6. Show that the operation of scaling

−−→

row(i):

a

ik

:=

a

ik

a

ii

k = 1, . . . , n (3.3.2)

has the same eﬀect as ﬁrst dividing the ith row of the identity matrix by a

ii

to get a

certain matrix E, and then computing EA.

7. Suppose we do a complete forward solution without ever searching or interchanging

rows or columns. Show that the forward solution amounts to discovering a lower

triangular matrix L and an upper triangular matrix U such that LA = U (think of L

as a product of several matrices E such as you found in the preceding two problems).

3.4 How big is zero?

The story of the linear algebra subroutine has just two pieces untold: the ﬁrst concerns

how small we will allow a number to be without calling it zero, and the second concerns

the rearrangement of the output to compensate for interchanges of rows and columns that

are done during the row-echelon reduction.

The main reduction loop begins with a search of the rectangle that lies Southeast of the

pivot position [i, i], in order to locate the largest element that lives there and to use it for

the next pivot. If that element is zero, the forward solution halts because the remaining

pivot candidates are all zero.

But “how zero” do they have to be? Certainly it would be to much to insist, when

working with sixteen decimal digits, that a number should be exactly equal to zero. A

little more natural would be to declare that any number that is no larger than the size of

the accumulated roundoﬀ error in the calculation should be declared to be zero, since our

microscope lens would then be too clouded to tell the diﬀerence.

It is important that we should know how large roundoﬀ error is, or might be. Indeed,

if we set too small a threshold, then numbers that “really are” zero will slip through, the

calculation will continue after it should have terminated because of unreliability of the

computed entries, and so forth. If the threshold is too large, we will declare numbers

to be zero that aren’t, and our numerical solution will terminate too quickly because the

computed matrix elements will be declared to be unreliable when really they are perfectly

OK.

The phenomenon of roundoﬀ error occurs because of the ﬁnite size of a computer word.

If a word consists of d binary digits, then when two d-digit binary numbers are multiplied

98 Numerical linear algebra

together, the answer that should be 2d bits long gets rounded oﬀ to d bits when it is stored.

By doing so we incur a rounding error whose size is at most 1 unit in the (d + 1)st place.

Then we proceed to add that answer to other numbers with errors in them, and to

multiply, divide, and so forth, some large number of times. The accumulation of all of this

rounding error can be quite signiﬁcant in an extended computation, particularly when a

good deal of cancellation occurs from subtraction of nearly equal quantities.

The question is to determine the level of rounding error that is present, while the

calculation is proceeding. Then, when we arrive at a stage where the numbers of interest

are about the same size as the rounding errors that may be present in them, we had better

halt the calculation.

How can we estimate, during the course of a calculation, the size of the accumulated

roundoﬀ error? There are a number of theoretical a priori estimates for this error, but in

any given computation these would tend to be overly conservative, and we would usually

terminate the calculation too soon, thinking that the errors were worse than they actually

were.

We prefer to let the computer estimate the error for us while it’s doing the calculation.

True, it will have to do more work, but we would rather have it work a little harder if the

result will be that we get more reliable answers.

Here is a proposal for estimating the accumulated rounding error during the progress

of a computation. This method was suggested by Professors Nijenhuis and Wilf. We carry

along an additional matrix of the same size as the matrix C, the one that has the coeﬃcients

and right-hand sides of the equations that we are solving. In this extra matrix we are going

to keep estimates of the roundoﬀ error in each of the elements of the matrix C.

In other words, we are going to keep two whole matrices, one of which will contain the

coeﬃcients and the right-hand sides of the equations, and the other of which will contain

estimates of the roundoﬀ error that is present in the elements of the ﬁrst one.

At any time during the calculation that we want to know how reliable a certain matrix

entry is, we’ll need only to look at the corresponding entry of the error matrix to ﬁnd out.

Let’s call this auxiliary matrix R (as in roundoﬀ). Initially an element R

ij

might be as

large as 2

−d

[C

ij

[ in magnitude, and of either sign. Therefore, to initialize the R matrix we

choose a number uniformly at random in the interval [−[2

−d

C

ij

[, [2

−d

C

ij

[], and store it in

R

ij

for each i and j. Hence, to begin with, the matrix R is set to randomly chosen values

in the range in which the actual roundoﬀ errors lie.

Then, as the calculation unfolds, we do arithmetic on the matrix C of two kinds. We

either scale a row by dividing it through by the pivot element, or we pivot a row against

another row. In each case let’s look at the eﬀect that the operation has on the corresponding

roundoﬀ error estimator in the R matrix.

In the ﬁrst case, consider a scaling operation, in which a certain row is divided by the

pivot element. Speciﬁcally, suppose we are dividing row i through by the element C

ii

, and

3.4 How big is zero? 99

let R

ii

be the corresponding entry of the error matrix. Then, in view of the fact that

C

ij

+R

ij

C

ii

+R

ii

=

C

ij

C

ii

+

R

ij

C

ii

−

R

ii

C

ij

C

2

ii

+ terms involving products of two or more errors (3.4.1)

we see that the error entries R

ij

in the row that is being divided through by C

ii

should be

computed as

R

ij

:=

R

ij

C

ii

−

R

ii

C

ij

C

2

ii

. (3.4.2)

In the second case, suppose we are doing a pivoting operation on the kth row. Then for

each column q we do the operation C

kq

:= C

kq

− t ∗ C

iq

, where t = C

ki

. Now let’s replace

C

kq

by C

kq

+R

kq

, replace C

iq

by C

iq

+R

iq

and replace t by t + t

(where t

= R

ki

). Then

substitute these expressions into the pivot operation above, and keep terms that are of ﬁrst

order in the errors (i.e., that do not involve products of two of the errors).

Then C

kq

+R

kq

is replaced by

C

kq

+R

kq

−(t +t

) ∗ (C

iq

+R

iq

) = (C

kq

−t ∗ C

iq

) + (R

kq

−t ∗ R

iq

−t

∗ C

iq

)

= (new C

kq

) + (new error R

kq

).

(3.4.3)

It follows that as a result of the pivoting, the error estimator is updated as follows:

R

kq

:= R

kq

−C

ki

∗ R

iq

−R

ki

∗ C

iq

. (3.4.4)

Equations (3.4.2) and (3.4.4) completely describe the evolution of the R matrix. It

begins life as random roundoﬀ error; it gets modiﬁed along with the matrix elements whose

errors are being estimated, and in return, we are supplied with good error estimates of each

entry while the calculation proceeds.

Before each scaling and pivoting sequence we will need to update the R matrix as

described above. Then, when we search the now-famous Southeast rectangle for the new

pivot element we accept it if it is larger in absolute value that its corresponding roundoﬀ

estimator, and otherwise we declare the rectangle to be identically zero and halt the forward

solution.

The R matrix is also used to check the consistency of the input system. At the end of

the forward solution all rows of the coeﬃcient matrix from a certain row onwards are ﬁlled

with zeros, in the sense that the entries are below the level of their corresponding roundoﬀ

estimator. Then the corresponding right-hand side vector entries should also be zero in

the same sense, else as far as the algorithm can tell, the input system was inconsistent.

With typical ambiguity of course, this means either that the input system was “really”

inconsistent, or just that rounding errors have built up so severely that we cannot decide

on consistency, and continuation of the “solution” would be meaningless.

Algorithm matalg(C,r,s,m,n,p,opt,eps). The algorithm operates on the matrix C,

which is of dimension r s, where r = max(m, n) + 1. It solves p systems of m equations

in n unknowns, unless opt= 1, in which case it will calculate the inverse of the matrix in

the ﬁrst m = n rows and columns of C.

100 Numerical linear algebra

matalg:=proc(C,r,s,m,n,p,opt,eps)

local R,i,j,Det,Done,ii,jj,Z,k,psrank;

# if opt = 1 that means inverse is expected

if opt=1 then ident(C,r,s,1,n+1,n,1) fi;

# initialize random error matrix

R:=matrix(r,s,(i,j)->0.000000000001*(rand()-500000000000)*eps*C[i,j]);

# set row permutation to the identity

for j from 1 to n do C[r,j]:=j od;

# begin forward solution

Det:=1; Done:=false; i:=0;

while ((i<m) and not(Done))do

# find largest in SE rectangle

Z:=searchmat(C,r,s,i+1,i+1,m,n);ii:=Z[1][1]; jj:=Z[1][2];

if abs(Z[2])>abs(R[ii,jj]) then

i:=i+1;

# switch rows

Det:=Det*switchrow(C,r,s,i,ii,i,s);

Z:=switchrow(R,r,s,i,ii,i,s);

# switch columns

Det:=Det*switchcol(C,r,s,i,jj,1,r);

Z:=switchcol(R,r,s,i,jj,1,r);

# divide by pivot element

Z:=scaler(C,R,r,s,i,i);

Det:=Det*scale(C,r,s,i,i);

for k from i+1 to m do

# reduce row k against row i

Z:=pivotr(C,R,r,s,i,k,i+1);

Z:=pivot(C,r,s,i,k,i+1);

od;

else Done:=true fi;

od;

psrank:=i;

# end forward solution; begin consistency check

if psrank<m then

Det:=0;

for j from 1 to p do

# check that right hand sides are 0 for i>psrank

Z:=searchmat(C,r,s,psrank+1,n+j,m,n+j);

if abs(Z[2])>abs(R[Z[1][1],Z[1][2]]) then

printf("Right hand side %d is inconsistent",j);

return;

fi;

od;

fi;

# equations are consistent, do back solution

3.4 How big is zero? 101

for j from psrank to 2 by -1 do

for i from 1 to j-1 do

Z:=pivotr(C,R,r,s,j,i,psrank+1);

Z:=pivot(C,r,s,j,i,psrank+1);

C[i,j]:=0; R[i,j]:=0;

od;

od;

# end back solution, insert minus identity in basis

if psrank<n then

# fill bottom of basis matrix with -I

Z:=ident(C,r,s,psrank+1,psrank+1,n-psrank,-1);

# fill under right-hand sides with zeroes

for i from psrank+1 to n do for j from n+1 to s do C[i,j]:=0 od od;

# fill under R matrix with zeroes

for i from psrank+1 to n do for j from n-psrank to s do R[i,j]:=0 od od;

fi;

# permute rows prior to output

Z:=scramb(C,r,s,n);

# copy row r of C to row r of R

for j from 1 to n do R[r,j]:=C[r,j] od;

Z:=scramb(R,r,s,n);

return(Det,psrank,evalm(R));

end;

If the procedure terminates successfully, it returns a list containing three items: the ﬁrst

is the determinant (if there is one), the second is the pseudorank of the coeﬃcient matrix,

and the third is the matrix of estimated roundoﬀ errors. The matrix C (which is called by

name in the procedure, which means that the input matrix is altered by the procedure) will

contain a basis for the kernel of the coeﬃcient matrix in columns psrank + 1 to n, and p

particular solution vectors, one for each input right-hand side, in columns n + 1 to n +p.

Two new sub-procedures are called by this procedure, namely scaler and pivotr.

These are called immediately before the action of scale or pivot, respectively, and their

mission is to update the R matrix in accordance with equations (3.4.2) or (3.4.4) to take

account of the impending scaling or pivoting operation .

Exercises 3.4

1. Break oﬀ from the complete algorithm above, the forward solution process. State it

formally as algorithm forwd, list its global variables, and describe precisely its eﬀect

on them. Do the same for the backwards solution.

2. When the program runs, it gives the solutions and their roundoﬀ error estimates.

Work out an elegant way to print the answers and the error estimates. For instance,

there’s no point in giving 12 digits of roundoﬀ error estimate. That’s too much. Just

print the number of digits of the answers that can be trusted. How would you do

that? Write subroutine prnt that will carry it out.

102 Numerical linear algebra

3. Suppose you want to re-run a problem, with a diﬀerent set of random numbers in

the roundoﬀ matrix initialization. How would you do that? Run one problem three

or four times to see how sensitive the roundoﬀ estimates are to the choice of random

values that start them oﬀ.

4. Show your program to a person who is knowledgeable in programming, but who is

not one of your classmates. Ask that person to use your program to solve some set of

three simultaneous equations in ﬁve unknowns.

Do not answer any questions verbally about how the program works or how to use it.

Refer all such questions to your written documentation that accompanies the program.

If the other person is able to run your program and understand the answers, award

yourself a gold medal in documentation. Otherwise, improve your documentation and

let the person try again. When successful, try again on someone else.

5. Select two vectors f, g of length 10 by choosing their elements at random. Form the

10 10 matrix of rank 1 whose elements are f

i

g

j

. Do this three times and add the

resulting matrices to get a single 10 10 matrix of rank three.

Run your program on the coeﬃcient matrix you just constructed, in order to see if

the program is smart enough to recognize a matrix of rank three when it sees one, by

halting the forward solution with pseudorank = 3.

Repeat the above experiment 50 times, and tabulate the frequencies with which your

program “thought” that the 10 10 matrix had various ranks.

3.5 Operation count

With any numerical algorithm it is important to know how much work is involved in carrying

it out. In this section we are going to estimate the labor involved in solving linear systems

by the method of the previous sections.

Let’s recognize two kinds of labor: arithmetic operations, and other operations, both as

applied to elements of the matrix. The arithmetic operations are +, −, , ÷, all lumped

together, and by other operations we mean comparisons of size, movement of data, and

other operations performed directly on the matrix elements. Of course there are many

“other operations,” not involving the matrix elements directly, such as augmenting counters,

testing for completion of loops, etc., that go on during the reduction of the matrix, but the

two categories above represent a good measure of the work done. We’re not going to include

the management of the roundoﬀ error matrix R in our estimates, because its eﬀect would

be simply to double the labor involved. Hence, remember to double all of the estimates of

the labor that we are about to derive if you’re using the R matrix.

We consider a generic stage in the forward solution where we have been given m equa-

tions in n unknowns with p right-hand sides, and during the forward solution phase we have

just arrived at the [i, i] element.

3.5 Operation count 103

The next thing to do is to search the Southeast rectangle for the largest element, The

rectangle contains about (m− i) ∗ (m− i) elements. Hence the search requires that many

comparisons.

Then we exchange two rows (n+p−i operations), exchange two columns (m operations)

and divide a row by the pivot element (n +p −i arithmetic operations).

Next, for each of the m−i−1 rows below the ith, and for each of the n+p−i elements of

one of those rows, we do two arithmetic operations when we do the elementary row operation

that produces a zero in the ith column. This requires, therefore, 2(n + p − i)(m − i − 1)

arithmetic operations.

For the forward phase of the solution, therefore, we count

A

f

=

r

i=1

¦2(n +p −i)(m−i −1) + (n +p −i)¦ (3.5.1)

arithmetic operations altogether, where r is the pseudorank of the matrix, because the

forward solution halts after the rth row with only zeros below.

The non-arithmetic operations in the forward phase amount to

N

f

=

r

i=1

¦(m−i)(n −i) + (n +p −i) +m¦. (3.5.2)

Let’s leave these sums for a while, and go to the backwards phase of the solution. We

do the columns in reverse order, from column r back to 1, and when we have arrived at a

generic column j, we want to create zeroes in all of the positions above the 1 in the [j, j]

position.

To do this we perform the elementary row operation

−−→

row(i) :=

−−→

row(i) −A

ij

∗

−−→

row(j) (3.5.3)

to each of the j − 1 rows above row j. Let i be the number of one of these rows. Then,

exactly how many elements of

−−→

row(i) are acted upon by the elementary row operation above?

Certainly the elements in

−−→

row(i) that lie in columns 1 through j −1 are unaﬀected, because

only zero elements are in

−−→

row(j) below them, thanks to the forward reduction process.

Furthermore, the elements in

−−→

row(i) that lie in columns j +1 through r are unaﬀected,

for a diﬀerent reason. Indeed, any such entry is already zero, because it lies above an entry

of 1 in some diagonal position that has already had its turn in the back solution (remember

that we’re doing the columns in the sequence r, r −1, . . . , 1). Not only is such an entry zero,

but it remains zero, because the entry of

−−→

row(j) below it is also zero, having previously

been deleted by the action of a diagonal element below it.

Hence in

−−→

row(i), the elements that are aﬀected by the elementary row operation (3.5.3)

are those that lie in columns j, n − r, . . . , n, n + 1, . . . , n +p (be sure to write [or modify!]

the program so that the row reduction (3.5.3) acts only on those columns!). We have now

104 Numerical linear algebra

shown that exactly N +p + r − 1 entries of each row above

−−→

row(j) are aﬀected (note that

the number is independent of j and i), so

A

b

=

r

j=1

(n +p −r + 1)(j −1) (3.5.4)

arithmetic operations are done during the back solution, and no other operations.

It remains only to do the various sums, and for this purpose we recall that

N

i=1

i =

N(N + 1)

2

N

i=1

i

2

=

N(N + 1)(2N + 1)

6

(3.5.5)

Then it is straightforward to ﬁnd the total number of arithmetic operations from A

f

+A

b

as

Arith(m, n, p, r) =

r

3

6

−(2m+n +p −5)

r

2

2

+ ((n +p)(2m−5/2) −m+ 1/3)r (3.5.6)

and the total of the non-arithmetic operations from N

f

as

NonArith(m, n, p, r) =

r

3

3

−(m +n)

r

2

2

+ (6mn + 3m+ 3n + 6p −2)

r

6

. (3.5.7)

Let’s look at a few important special cases. First, suppose we are solving one system

of n equations in n unknowns that has a unique solution. Then we have m = n = r and

p = 1. We ﬁnd that

Arith(n, n, 1, n) =

2

3

n

3

+ O(n

2

) (3.5.8)

where O(n

2

) refers to some function of n that is bounded by a constant times n

2

as n grows

large. Similarly, for the non-arithmetic operations on matrix elements we ﬁnd

1

3

n

3

+O(n

2

)

in this case.

It follows that a system of n equations can be solved for about one third of the price,

in terms of arithmetic operations, of one matrix multiplication, at least if matrices are

multiplied in the usual way (did you know that there is a faster way to multiply two

matrices? We will see one later on).

Now what is the price of a matrix inversion by this method? Then we are solving n

systems of n equations in n unknowns, all with the same left-hand side. Hence we have

r = m = n = p, and we ﬁnd that

Arith(n, n, n, n) =

13

6

n

3

+ O(n

2

). (3.5.9)

Hence we can invert a matrix by this method for about the same price as solving 3.25

systems of equations! At ﬁrst glance, it may seem as if the cost should be n times as great

3.6 To unscramble the eggs 105

because we are solving n systems. The great economy results, of course, from the common

left-hand sides.

The cost of the non-arithmetic operations remains at

1

3

n

3

+ O(n

2

).

If we want only the determinant of a square matrix A, or want only the rank of A then

we need to do only the forward solution, and we can save the cost of the back solution. We

leave it to the reader to work out the cost of a determinant, or of ﬁnding the rank.

3.6 To unscramble the eggs

Now we have reached the last of the issues that needs to be discussed in order to plan a

complete linear equation solving routine, and it concerns the rearrangement of the output

so that it ends up in the right order.

During the operation of the forward solution algorithm we found it necessary to inter-

change rows and columns so that the largest element of the Southeast rectangle was brought

into the pivot position. As we mentioned previously, we don’t need to keep a record of the

row interchanges, because they correspond simply to solving the equations in a diﬀerent

sequence. We must remember the column interchanges that occur along the way though,

because each time we do one of them we are, in eﬀect, renumbering the unknowns.

To remember the column interchanges we glue onto our array C an additional row, just

for bookkeeping purposes. Its elements are called τ

j

, j = 1, . . . , n, and it is kept at the

bottomof the matrix. More precisely, the elements of ¦τ

j

¦ are the ﬁrst n entries of the new

last row of the matrix, where the row contains n +p entries altogether, the last p of which

are not used (refer to (3.3.1) to see the complete partitioning of the matrix C).

Now suppose we have arrived at the end of the back solution, and the answers to the

original question are before us, except that they are scrambled. Here’s an example of the

kind of situation that might result:

_

¸

¸

¸

¸

¸

¸

_

1 0 0 a b c

0 1 0 d e f

0 0 1 g h k

.

.

.

.

.

.

.

.

.

.

.

.

3 5 2 1 4 ∗

_

¸

¸

¸

¸

¸

¸

_

. (3.6.1)

The matrix above represents schematically the reduced row echelon form in a problem

where there are ﬁve unknowns (n = 5), the pseudorank r = 3, just one right-hand side

vector is given (p = 1), and the permuations that were carried out on the columns are

recorded in the array τ : [3, 5, 2, 1, 4] shown in the last row of the matrix as it would be

storded in a computation.

The question now is, how do we express the general solution of the given set of equations?

To ﬁnd the answer, let’s go back to the set of equations that (3.6.1) stands for. The ﬁrst of

these is

x

3

= c −ax

1

−bx

4

(3.6.2)

106 Numerical linear algebra

because the numbering of the unknowns is as shown in the τ array. The next two equations

are

x

5

= f −dx

1

−ex

4

x

2

= k −gx

1

−hx

4

.

(3.6.3)

If we add the two trivial equations x

1

= x

1

and x

4

= x

4

, then we get the whole solution

vector which, after re-ordering the equations, can be written as

_

¸

¸

¸

¸

¸

_

x

1

x

2

x

3

x

4

x

5

_

¸

¸

¸

¸

¸

_

=

_

¸

¸

¸

¸

¸

_

0

k

c

0

f

_

¸

¸

¸

¸

¸

_

+ (−x

1

) ∗

_

¸

¸

¸

¸

¸

_

−1

g

a

0

d

_

¸

¸

¸

¸

¸

_

+ (−x

4

) ∗

_

¸

¸

¸

¸

¸

_

0

h

b

−1

e

_

¸

¸

¸

¸

¸

_

. (3.6.4)

Now we are looking at a display of the output as we would like our subroutine to give

it. The three vectors on the right side of (3.6.4) are, respectively, a particular solution of

the given system of equations, and the two vectors of a basis for the kernel of the coeﬃcient

matrix.

The question can now be rephrased: exactly what operations must be done to the matrix

shown in (3.6.1) that represents the situation at the end of the back solution, in order to

obtain the three vectors in (3.6.4)?

The ﬁrst things to do are, as we have previously noted, to append the negative of a 22

identity matrix to the bottom of the fourth and ﬁfth columns of (3.6.1), and to lengthen

the last column on the right by appending two more zeros. That brings us to the matrix

_

¸

¸

¸

¸

¸

¸

¸

_

1 0 0 a b c

0 1 0 d e f

0 0 1 g h k

−1 0 0

0 −1 0

[3 5 2 1 4] ∗

_

¸

¸

¸

¸

¸

¸

¸

_

. (3.6.5)

The ﬁrst two of the three long columns above will be the basis for the kernel, and the last

column above will be the particular solution, but only after we do the right rearrangement.

Now here is the punch line: the right rearrangement to do is to permute the rows of

those three long columns as described by the permuation τ.

That means that the ﬁrst row becomes the third, the second row becomes the ﬁfth, the

third row becomes the second, the fourth row is the new ﬁrst, and the old ﬁfth row is the

new fourth. The reader is invited to carry out on the rows the interchanges just described,

and to compare the result with what we want, namely with (3.6.4). It will be seen that we

have gotten the desired result.

The point that is just a little surprising is that to undo the column interchanges that

are recorded by τ, we do row interchanges. Just roughly, the reason for this is that we

begin by wanting to solve Ax = b, and instead we end up solving (AE)y = b, where E is

3.6 To unscramble the eggs 107

a matrix obtained from the identity by elementary column operations. Evidently, x = Ey,

which means that we must perform row operations on y to recover the answers in the right

order.

Now we can leave the example above, and state the rule in general. We are given p

systems of m simultaneous equations each, all having a commom m n coeﬃcient matrix

A, in n unknowns. At the end of the back solution we will have before us a matrix of the

form

_

¸

_ I(r, r) B(r, n −r) P(r, p)

_

¸

_ (3.6.6)

where I(r, r) is the rr identity matrix, r is the pseudorank of A, and B and P are matrices

of the sizes shown.

We adjoin under B the negative of the (n − r) (n −r) identity matrix, and under P

we adjoin an (n − r) p block of zeros. Next, we forget the identity matrix on the left,

and we consider the entire remaining n (n −r +p) matrix as a whole, call it T, say. Now

we exchange the rows of T according to the permuation array τ. Preciesly, row 1 of T will

be row τ

1

of the new T, row 2 will be row τ

2

, . . . . Conceptually, we should regard the

old T and the new T as occupying diﬀerent areas of storage, so that the new T is just a

rearrangement of the rows of the old.

Now the ﬁrst n −r columns of the new T are a basis for the kernel of A, and should be

output as such, and the jth one of the last p columns of the new T is a particular solution

of the jth one of the input systems of equations, and should be output as such.

Although conceptually we should think of the old T and the new T as occupying distinct

arrays in memory, in fact it is perfectly possible to carry out the whole row interchange

procedure described above in just one array, the one that holds T, without ever “stepping

on our own toes,” so let’s consider that problem.

Suppose a linear array a = [a

1

, . . . , a

n

] is given, along with a permutation array τ =

[τ

1

, . . . , τ

n

]. We want to rearrange the entries of the array a according to the permuation τ

without using any additional array storage. Thus the present array a

1

will end up as the

output a

τ

1

, the initial a

2

will end up as a

τ

2

, etc.

To do this with no extra array storage, let’s ﬁrst pick up the element a

1

and move it to

a

τ

1

, being careful to store the original a

τ

1

in a temporary location t so it won’t be destroyed.

Next we move the contents of t to its destination, and so forth. After a certain number of

steps (maybe only 1), we will be back to a

1

.

Her’s an example to help clarify the situation. Suppose the arrays a and τ at input time

were:

a = [5, 7, 13, 9, 2, 8]

τ = [3, 4, 5, 2, 1, 6].

(3.6.7)

So we move the 5 in position a

1

to position a

3

(after putting the 13 into a safe place), and

then the 13 goes to position a

5

(after putting the 2 into a safe place) and the 2 is moved

into position a

1

, and we’re back where we started. The a array now has become

a = [2, 7, 5, 9, 13, 8]. (3.6.8)

108 Numerical linear algebra

The job, however, is not ﬁnished. Somehow we have to recognize that the elements a

2

,

a

4

and a

6

haven’t yet been moved, while the others have been moved to their destinations.

For this purpose we will ﬂag the array positions. A convenient place to hang a ﬂag is in

the sign position of an entry of the array τ, since we’re sure that the entries of τ are all

supposed to be positive. Therefore, initially we’ll change the signs of all of the entries of τ

to make them negative. Then as elements are moved around in the a array we will reverse

the sign of the corresponding entry of the τ array. In that way we can always begin the

next block of entries of a to move by searching τ for a negative entry. When none exist, the

job is ﬁnished.

Here’s a complete algorithm, in Maple:

shuffle:=proc(a,tau,n) local i,j,t,q,u,v;

#permutes the entries of a according to the permutation tau

#

# flag entries of tau with negative signs

for i from 1 to n do tau[i]:=-tau[i] od;

for i from 1 to n do

# has entry i been moved?

if tau[i]<0 then

# move the block of entries beginning at a[i]

t:=i; q:=-tau[i]; tau[i]:=q;

u:=a[i]; v:=u;

while q<>t do

v:=a[q]; a[q]:=u; tau[q]:=-tau[q];

u:=v; q:=tau[q]; od;

a[t]:=v;

fi;

od;

return(1);

end;

The reader should carefully trace through the complete operation of this algorithm on

the sample arrays shown above. In order to apply the method to the linear equation solving

program, the entries C[r +1, i], i = 1. . . . , n are interpreted as τ

i

, and the array a of length

n whose entries are going to be moved is one of the columns r + 1, . . . , n +p of the matrix

C in rows 1, . . . , n.

3.7 Eigenvalues and eigenvectors of matrices

Our next topic in numerical linear algebra concerns the computation of the eigenvalues and

eigenvectors of matrices. Until further notice, all matrices will be square. If A is n n, by

an eigenvector of A we mean a vector x ,= 0 such that

Ax = λx (3.7.1)

3.7 Eigenvalues and eigenvectors of matrices 109

where the scalar λ is called an eigenvalue of A. We say that the eigenvector x corresponds to,

or belongs to, the eigenvalue λ. We will see that in fact the eigenvalues of A are properties

of the linear mapping that A represents, rather than of the matrix A, so we can exploit

changes of basis in the computation of eigenvalues.

For an example, consider the 2 2 matrix

A =

_

3 −1

−1 3

_

. (3.7.2)

If we write out the vector equation (3.7.1) for this matrix, it becomes the two scalar equa-

tions

3x

1

−x

2

= λx

1

−x

1

+ 3x

2

= λx

2

.

(3.7.3)

These are two homogeneous equations in two unknowns, and therefore they have no solution

other than the zero vector unless the determinant

¸

¸

¸

¸

¸

3 −λ −1

−1 3 −λ

¸

¸

¸

¸

¸

(3.7.4)

is equal to zero. This condition yields a quadratic equation for λ whose two roots are λ = 2

and λ = 4. These are the two eigenvalues of the matrix (3.7.2).

For the same 2 2 example, let’s now ﬁnd the eigenvectors (by a method that doesn’t

bear the slightest resemblance to the numerical method that we will discuss later). First, to

ﬁnd the eigenvector that belongs to the eigenvalue λ = 2, we go back to (3.7.3) and replace

λ by 2 to obtain the two equations

x

1

−x

2

= 0

−x

1

+x

2

= 0.

(3.7.5)

These equations are, of course, redundant since λ was chosen to make them so. They are

satisﬁed by any vector x of the form c ∗ [1, 1], where c is an arbitrary constant. If we refer

back to the deﬁnition (3.7.1) of eigenvectors we notice that if x is an eigenvector then so is

cx, so eigenvectors are determined only up to constant multiples. The ﬁrst eigenvector of

our 2 2 matrix is therefore any multiple of the vector [1, 1].

To ﬁnd the eigenvector that belongs to the eigenvalue λ = 4, we return to (3.7.3), replace

λ by 4, and solve the equations. The result is that any scalar multiple of the vector [1, −1]

is an eigenvector corresponding to the eigenvalue λ = 4.

The two statements that [1, 1] is an eigenvector and that [1, −1] is an eigenvector can

either be written as two vector equations:

_

3 −1

−1 3

_ _

1

1

_

= 2

_

1

1

_

,

_

3 −1

−1 3

_ _

1

−1

_

= 4

_

1

−1

_

(3.7.6)

or as a single matrix equation

_

2 −1

−1 3

_ _

1 1

1 −1

_

=

_

1 1

1 −1

_ _

2 0

0 4

_

. (3.7.7)

110 Numerical linear algebra

Observe that the matrix equation (3.7.7) states that AP = PΛ, where A is the given

22 matrix, P is a (nonsingular) matrix whose columns are eigenvectors of A, and Λ is the

diagonal matrix that carries the eigenvalues of A down the diagonal (in order corresponding

to the eigenvectors in the columns of P). This matrix equation AP = PΛ leads to one of the

many important areas of application of the theory of eigenvalues, namely to the computation

of functions of matrices.

Suppose we want to calculate A

2147

, where A is the 2 2 matrix (3.7.2). A direct

calculation, by raising A to higher and higher powers would take quite a while (although

not as long as one might think at ﬁrst sight! Exactly what powers of A would you compute?

How many matrix multiplications would be required?).

A better way is to begin with the relation AP = PΛ and to observe that in this case

the matrix P is nonsingular, and so P has an inverse. Since P has the eigenvectors of

A in its columns, the nonsingularity of P is equivalent to the linear independence of the

eigenvectors. Hence we can write

A = PΛP

−1

. (3.7.8)

This is called the spectral representation of A, and the set of eigenvalues is often called the

spectrum of A.

Equation (3.7.8) is very helpful in computing powers of A. For instance

A

2

= (PΛP

−1

)(PΛP

−1

) = PΛ

2

P

−1

,

and for every m, A

m

= PΛ

m

P

−1

. It is of course quite easy to ﬁnd high powers of the

diagonal matrix Λ, because we need only raise the entries on the diagonal to that power.

Thus for example,

A

2147

=

_

1 1

1 −1

_ _

2

2147

0

0 4

2147

_ _

1/2 1/2

1/2 −1/2

_

. (3.7.9)

Not only can we compute powers from the spectral representation (3.7.8), we can equally

well obtain any polynomial in the matrix A. For instance,

13A

3

+ 78A

19

−43A

31

= P(13Λ

3

+ 78Λ

19

−43Λ

31

)P

−1

. (3.7.10)

Indeed if f is any polynomial, then

f(A) = Pf(Λ)P

−1

(3.7.11)

and f(Λ) is easy to calculate because it just has the numbers f(λ

i

) down the diagonal and

zeros elsewhere.

Finally, it’s just a short hop to the conclusion that (3.7.11) remains valid even if f is

not a polynomial, but is represented by an everywhere-convergent powers series (we don’t

even need that much, but this statement suﬃces for our present purposes). So for instance,

if A is the above 2 2 matrix, then

e

A

= Pe

Λ

P

−1

(3.7.12)

3.7 Eigenvalues and eigenvectors of matrices 111

where e

Λ

has e

2

and e

4

on its diagonal.

We have now arrived at a very important area of application of eigenvalues and eigen-

vectors, to the solution of systems of diﬀerential equations. A system of n linear simul-

taneous diﬀerential equations in n unknown functions can be written simply as y

= Ay,

with say y(0) given as initial data. The solution of this system of diﬀerential equations is

y(t) = e

At

y(0), where the matrix e

At

is calculated by writing A = PΛP

−1

if possible, and

then putting e

At

= Pe

Λt

P

−1

.

Hence, whenever we can ﬁnd the spectral representation of a matrix A, we can calculate

functions of the matrix and can solve diﬀerential equations that involve the matrix.

So, when can we ﬁnd a spectral representation of a given n n matrix A? If we can

ﬁnd a set of n linearly independent eigenvectors for A, then all we need to do is to arrange

them in the columns of a new matrix P. Then P will be invertible, and we’ll be all ﬁnished.

Conversely, if we somehow have found a spectral representation of A `a la (3.7.8), then the

columns of P obviously do comprise a set of n independent eigenvectors of A.

That changes the question. What kind of an n n matrix A has a set of n linearly

independent eigenvectors? This is quite a hard problem, and we won’t answer it completely.

Instead, we give an example of a matrix that does not have as many independent eigenvec-

tors as it “ought to,” and then we’ll specialize our discussion to a kind of matrix that is

guaranteed to have a spectral representation.

For an example we don’t have to look any further than

A =

_

0 1

0 0

_

. (3.7.13)

The reader will have no diﬃculty in checking that this matrix has just one eigenvalue, λ = 0,

and that corresponding to that eigenvalue there is just one independent eigenvector, and

therefore there is no spectral representation of this matrix.

Now ﬁrst we’re going to devote our attention to the real symmetric matrices. i.e., to

matrices A for which A

ij

= A

ji

for all i, j = 1, . . . , n. These matrices occur in many

important applications, and they always have a spectral representation. Indeed, much more

is true, as is shown by the following fundamental theorem of the subject, whose proof

is deferred to section 3.9, where it will emerge (see Theorem 3.9.1) as a corollary of an

algorithm.

Theorem 3.7.1 (The Spectral Theorem) – Let A be an nn real symmetric matrix. Then

the eigenvalues and eigenvectors of A are real. Furthermore, we can always ﬁnd a set of n

eigenvectors of A that are pairwise orthogonal to each other (so they are surely independent).

Recall that the eigenvectors of the symmetric 22 matrix (3.7.2) were [1, 1] and [1, −1],

and these are indeed orthogonal to each other, though we didn’t comment on it at the time.

We’re going to follow a slightly unusual route now, that will lead us simultaneously to

a proof of the fundamental theorem (the “spectral theorem”) above, and to a very elegant

112 Numerical linear algebra

computer algorithm, called the method of Jacobi, for the computation of eigenvalues and

eigenvectors of real symmetric matrices.

In the next section we will introduce a very special family of matrices, ﬁrst studied by

Jacobi, and we will examine their properties in some detail. Once we understand these

properties, a proof of the spectral theorem will appear, with almost no additional work.

Following that we will show how the algorithm of Jacobi can be implemented on a

computer, as a fast and pretty program in which all of the eigenvalues and eigenvectors

of a real symmetric matrix are found simultaneously, and are delivered to your door as an

orthogonal set.

Throughout these algorithms certain themes will recur. Speciﬁcally, we will see several

situations in which we have to compute a certain angle and then carry out a rotation of

space through that angle. Since the themes occur so often we are going to abstract from

them certain basic modules of algorithms that will be used repeatedly.

This choice will greatly simplify the preparation of programs, but at a price, namely

that each module will not always be exactly optimal in terms of machine time for execution

in each application, although it will be nearly so. Consequently it was felt that the price

was worth the beneﬁt of greater universality. We’ll discuss these points further, in context,

as they arise.

3.8 The orthogonal matrices of Jacobi

A matrix P is called an orthogonal matrix if it is real, square, and if P

−1

= P

T

, i.e., if

P

T

P = PP

T

= I. If we visualize the way a matrix is multiplied by its transpose, it will be

clear that an orthogonal matrix is one in which each of the rows (columns) is a unit vector

and any two distinct rows (columns) are orthogonal to each other.

For example, the 2 2 matrix

_

cos θ sinθ

−sinθ cos θ

_

(3.8.1)

is an orthogonal matrix for every real θ.

We will soon prove that a real symmetric matrix always has a set of n pairwise orthogonal

eigenvectors. If we take such as set of vectors, normalize them by dividing each by its length,

and arrange them in the consecutive columns of a matrix P, then P will be an orthogonal

matrix, and further we will have AP = PΛ. Since P

T

= P

−1

, we can multiply on the right

by P

T

and obtain

A = PΛP

T

, (3.8.2)

and this is the spectral theorem for a symmetric matrix A.

Conversely, if we can ﬁnd an orthogonal matrix P such that P

T

AP is a diagonal matrix

D, then we will have found a complete set of pairwise orthogonal eigenvectors of A (the

columns of P), and the eigenvalues of A (on the diagonal of D).

3.8 The orthogonal matrices of Jacobi 113

In this section we are going to describe a numerical procedure that will ﬁnd such an

orthogonal matrix, given a real symmetric matrix A. As soon as we prove that the method

works, we will have proved the spectral theorem at the same time. Hence the method is

of theoretical as well as algorithmic importance. It is important to notice that we will

not have to ﬁnd an eigenvalue, then ﬁnd a corresponding eigenvector, then ﬁnd another

eigenvalue and another vector, etc. Instead, the whole orthogonal matrix whose columns

are the desired vectors will creep up on us at once.

The ﬁrst thing we have to do is to describe some special orthogonal matrices that will

be used in the algorithm. Let n, p and q be given positive integers, with n ≥ 2 and p ,= q,

and let θ be a real number. We deﬁne the matrix J

pq

(θ) by saying that J is just like the

n n identity matrix except that in the four positions that lie at the intersections of rows

and columns p and q we ﬁnd the entries (3.8.1).

More precisely, J

pq

(θ) has in position [p, p] the entry cos θ, it has sin θ in the [p, q] entry,

−sin θ in the [q, p] entry, cos θ in entry [q, q], and otherwise it agrees with the identity

matrix, as shown below:

row p

row q

_

¸

¸

¸

¸

¸

¸

¸

¸

¸

¸

¸

¸

¸

¸

¸

¸

¸

¸

¸

¸

_

1 0 0 0 0 0 0 0

0 1 0 0 0 0 0 0

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

0 0 0 cos θ sin θ 0 0 0

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

0 0 0 −sin θ cos θ 0 0 0

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

0 0 0 0 0 0 1 0

0 0 0 0 0 0 0 1

_

¸

¸

¸

¸

¸

¸

¸

¸

¸

¸

¸

¸

¸

¸

¸

¸

¸

¸

¸

¸

_

. (3.8.3)

Not only is J

pq

(θ) an orthogonal matrix, there is a reasonably pleasant way to picture

its action on n-dimensional space. Since the 2 2 matrix of (3.8.1) is the familiar rotation

of the plane through an angle θ, we can say that the matrix J

pq

(θ) carries out a special kind

of rotation of n-dimensional space, namely one in which a certain plane, the plane of the

pth and qth coordinate, is rotated through the angle θ, and the remaining coordinates are

all left alone. Hence J

pq

(θ) carries out a two-dimensional rotation of n-dimensional space.

These matrices of Jacobi turn out to be useful in a host of numerical algorithms for the

eigenproblem. The ﬁrst application that we’ll make of them will be to the real symmetric

matrices, but later we’ll ﬁnd that the same two-dimensional rotations will play important

roles in the solution of non-symmetric problems as well.

First, let’s see how they can help us with symmetric matrices. What we propose to do

is the following. If a real symmetric matrix A is given, we will determine p, q, and the

angle θ in such a way that the matrix JAJ

T

is a little bit more diagonal (whatever that

means!) than A is. It turns out that this can always be done, at any rate unless A is already

diagonal, so we will have the germ of a numerical procedure for computing eigenvalues and

eigenvectors.

114 Numerical linear algebra

Indeed, suppose we have found out how to determine such an angle θ, and let’s then see

what the whole process would look like. Starting with A, we would ﬁnd p, q, and θ, and

then the matrix JAJ

T

is somehow a little more diagonal than A was. Now JAJ

T

is still a

symmetric matrix (try to transpose it and see what happens) so we can do it again. After

ﬁnding another p, q and θ we will have J

J

A(J

J

)

T

a bit “more diagonal” and so forth.

Now suppose that after some large number of repetitions of this process we ﬁnd that

the current matrix is very diagonal indeed, so that perhaps aside from roundoﬀ error it is

a diagonal matrix D. Then we will know that

D = (product of all J’s used)A(product of all J’s used)

T

. (3.8.4)

If we let P denote the product of all J’s used, then we have PAP

T

= D, so the columns

of P will be the (approximate) eigenvectors of A and the diagonal elements of D will be

its eigenvalues. The matrix P will automatically be an orthogonal matrix, since it is the

product of such matrices, and the product of orthogonal matrices is always orthogonal

(proof?).

That, at any rate, is the main idea of Jacobi’s method (he introduced it in order to

study planetary orbits!). Let’s now ﬁll in the details.

First we’ll deﬁne what we mean by “more diagonal”. For any square, real matrix A, let

Od(A) denote the sum of the squares of the oﬀ-diagonal entries of A. From now on, instead

of “B is more diagonal than A,” we’ll be able to say Od(B) < Od(A), which is much more

professional.

Now we claim that if A is a real symmetric matrix, and it is not already a diagonal

matrix, then we can ﬁnd p, q and θ such that Od(J

pq

(θ)AJ

pq

(θ)

T

) < Od(A). We’ll do this

by a very direct computation of the elements of JAJ

T

(we’ll need them anyway for the

computer program), and then we will be able to see what the new value of Od is.

So ﬁx p, q and θ. Then by direct multiplication of the matrix in (3.8.3) by A we ﬁnd

that

(JA)

ij

=

_

¸

_

¸

_

(cos θ)A

pj

+ (sin θ)A

qj

if i = p

−(sin θ)A

pj

+ (cos θ)A

qj

if i = q

a

ij

otherwise

(3.8.5)

Then after one more multiplication, this time on the right by the transpose of the matrix

in (3.8.3), we ﬁnd that

(JAJ

T

)

ij

=

_

¸

¸

¸

¸

¸

¸

¸

_

¸

¸

¸

¸

¸

¸

¸

_

CA

ip

+SA

iq

if i ,∈ ¦p, q¦; j = p or i = p; j ,∈ ¦p, q¦

−SA

ip

+CA

iq

if i ,∈ ¦p, q¦; j = q or i = q; j ,∈ ¦p, q¦

C

2

A

pp

+ 2SCA

pq

+S

2

A

qq

if i = j = p

S

2

A

pp

−2SCA

pq

+C

2

A

qq

if i = j = q

CS(A

qq

−A

pp

) + (C

2

−S

2

)A

pq

if i = p, j = q or i = q j = p

a

ij

otherwise.

(3.8.6)

In (3.8.6) we have written C for cos θ and S for sin θ.

Now we are going to choose the angle θ so that the elements A

pq

and A

qp

are reduced

to zero, assuming that they were not already zero. To do this we refer to the formula in

3.9 Convergence of the Jacobi method 115

(3.8.6) for A

pq

, equate it to zero, and then solve for θ. The result is

tan 2θ =

2A

pq

A

pp

−A

qq

(3.8.7)

and we will choose the value of θ that lies between −

π

4

and

π

4

.

With this value of θ, we will have reduced one single oﬀ-diagonal element of A to zero

in the new symmetric matrix JAJ

T

.

The full Jacobi algorithm consists in repeatedly executing these plane rotations, each

time choosing the largest oﬀ-diagonal element A

pq

and annihilating it by the choice (3.8.7)

of θ. After each rotation, Od(A) will be a little smaller than it was before. We will prove

that Od(A) converges to zero.

It is important to note that a plane rotation that annihilates A

pq

may “revive” some

other A

rs

that was set to zero by an earlier plane rotation. Hence we should not think of

the zero as “staying put”.

Let’s now see exactly what happens to Od(A) after a single rotation. If we sum the

squares of all of the oﬀ-diagonal elements of JAJ

T

using the formulas (3.8.6), but remem-

bering that the new A

pq

=0, then it’s quite easy to check that the new sum of squares is

exactly equal to the old sum of squares minus the squares of the two entries A

pq

and A

qp

that were reduced to zero. Hence we have

Theorem 3.8.1 Let A be an nn real, symmetric matrix that is not diagonal. If A

pq

,= 0

for some p ,= q, then we can choose θ as in equation (3.8.7) so that if J = J

pq

(θ) then

Od(JAJ

T

) = Od(A) −2A

2

pq

< Od(A). (3.8.8)

3.9 Convergence of the Jacobi method

We have now described the fundamental operation of the Jacobi algorithm, namely the

plane rotation in n-dimensional space that sends a real symmetric matrix A into JAJ

T

,

and we have explicit formulas for the new matrix elements. There are still a number of

quite substantive points to discuss before we will be able to assemble an eﬃcient program

for carrying out the method. However, in line with the philosophy that it is best to break

up large programs into small, manageable chunks, we are now ready to prepare the ﬁrst

module of the Jacobi program.

What we want is to be able to execute the rotation through an angle θ, according to the

formulas (3.8.6) of the previous section. This could be accomplished by a single subroutine

that would take the symmetric matrix A and the sine and cosine of the rotation angle, and

execute the operation (3.8.6).

If we keep later applications in mind, then the best choice for a module will be one that

will, on demand, multiply a given not-necessarily-symmetric matrix on the left by J or on

the right by J

T

, depending on the call. This is one of the situations we referred to earlier

116 Numerical linear algebra

where the most universal choice of subroutine will not be the most economical one in every

application, but we will get a lot of mileage out of this routine!

Hence, suppose we are given

1. an n n real matrix A

2. the sine S and cosine C of a rotation angle

3. the plane [p, q] of the rotation, and

4. a parameter option that will equal 1 if we want to do JA, and 2 if we want AJ

T

.

The procedure will be called

Procedure rotate(s,c,p,q,option);

What the procedure will do is exactly this. If called with option = 1, it will multiply A

on the left by J, according to the formulas (3.8.5), and exit. If option = 2, it will multiply

A on the right by J

T

. The Maple procedure is as follows:

rotate:=proc(s,c,p,q,opt) local j,temp;

global A,n;

if opt=1 then

for j from 1 to n do

temp:=evalf(c*A[p,j]+s*A[q,j]);

A[q,j]:=-s*A[p,j]+c*A[q,j];

A[p,j]:=temp;

od

else

for j from 1 to n do

temp:=c*A[j,p]+s*A[j,q];

A[j,q]:=-s*A[j,p]+c*A[j,q];

A[j,p]:=temp;

od

fi;

RETURN()

end:

To carry out one iteration of the Jacobi method, we will have to call rotate twice, once

with option = 1 and then with option = 2.

The amount of computational labor that is done by this module is O(N) per call, since

only two lines of the matrix are aﬀected by its operation.

Next, let’s prove that the results of applying one rotation after another do in fact

converge to a diagonal matrix.

3.9 Convergence of the Jacobi method 117

Theorem 3.9.1 Let A be a real symmetric matrix. Suppose we follow the strategy of search-

ing for the oﬀ-diagonal element of largest absolute value, choosing so as to zero out that

element by carrying out a Jacobi rotation on A, and then repeating the whole process on the

resulting matrix, etc. Then the sequence of matrices that is thereby obtained approaches a

diagonal matrix D.

Proof. At a certain stage of the iteration, let A

pq

denote the oﬀ-diagonal element of largest

absolute value. Since the maximum of any set of numbers is at least as big as the average of

that set (proof?), it follows that the maximum of the squares of the oﬀ-diagonal elements of

A is at least as big as the average square of an oﬀ-diagonal element. The average square is

equal to the sum of the squares of the oﬀ-diagonal elements divided by the number of such

elements, i.e., divided by n(n −1). Hence the average square is exactly Od(A)/(n(n −1)),

and therefore

A

2

pq

≥

Od(A)

n(n −1)

. (3.9.1)

Now the eﬀect of a single rotation of the matrix is to reduce Od(A) by 2A

2

pq

, so equation

(3.8.8) yields

Od(JAJ

T

) = Od(A) −2A

2

pq

≤ Od(A) −

2 Od(A)

n(n −1

=

_

1 −

2

n(n −1)

_

Od(A).

(3.9.2)

Hence a single rotation will reduce Od(A) by a multiplicative factor of 1 −2/(n(n −1)) at

least. Since this factor is less than 1, it follows that the sum of squares of the oﬀ-diagonal

entries approaches zero as the number of plane rotations grows without bound, completing

the proof.

The proof told us even more since it produced a quantitative estimate of the rate at

which Od(A) approaches zero. Indeed, after r rotations, the sum of squares will have

dropped to at most

_

1 −

2

n(n −1

_

r

Od(original A). (3.9.3)

If we put r = n(n − 1)/2, then we see that Od(A) has dropped by at least a factor of

(approximately) e. Hence, after doing an average of one rotation per oﬀ-diagonal element,

the function Od(A) is no more than 1/e times its original value. After doing an average of,

say, t rotations per oﬀ-diagonal element (i.e., tn(n − 1)/2 rotations), the function Od(A)

will have dropped to about e

−t

times its original value. If we want it to drop to, say, 10

−m

times its original value then we can expect to need no more than about m(ln 10)n(n −1)/2

rotations.

To put it in very concrete terms, suppose we’re working in double precision (12-digit)

arithmetic and we are willing to decree that convergence has taken place if Od has been

reduced by 10

−12

. Then at most 12(ln 10)n(n − 1)/2 < 6(ln 10)n

2

≈ 13.8n

2

rotations will

have to be done. Of course in practice we will be watching the function Od(A) as it drops,

118 Numerical linear algebra

so there won’t be any need to know in advance how many iterations are needed. We can

stop when the actual observed value is small enough. Still, it’s comforting to know that at

most O(n

2

) iterations will be enough to do the job.

Now let’s re-direct our thoughts to the grand iterations process itself. At each step

we apply a rotation matrix to the current symmetric matrix in order to make it “more

diagonal”. At the same time, of course, we must keep track of the product of all of the

rotation matrices that we have so far used, because that is the matrix that ultimately will

be an orthogonal matrix with the eigenvectors of A across its rows.

Let’s watch this happen. Begin with A. After one rotation we have J

1

AJ

T

1

, after

two iterations we have J

2

J

1

AJ

T

1

J

T

2

, after three we have J

3

J

2

J

1

AJ

T

1

J

T

2

J

T

3

, etc. After all

iterations have been done, and we are looking at a matrix that is “diagonal enough” for

our purposes, the matrix we see is PAP

T

= D, where P is obtained by starting with the

identity matrix and multiplying successively on the left by the rotational matrices J that

are used, and D is (virtually) diagonal.

Since PAP

T

= D, we have AP

T

= P

T

D, so the columns of P

T

, or equivalently the

rows of P, are the eigenvectors of A.

Now, we have indeed proved that the repeated rotations will diagonalize A. We have not

proved that the matrices P themselves converge to a certain ﬁxed matrix. This is true, but

we omit the proof. One thing we do want to do, however, is to prove the spectral theorem,

Theorem 3.7.1, itself, since we have long since done all of the work.

Proof (of the Spectral Theorem 3.7.1): Consider the mapping f that associates with every

orthogonal matrix P the matrix f(P) = P

T

AP. The set of orthogonal matrices is compact,

and the mapping f is continuous. Hence the image of the set of orthogonal matrices under

f is compact. Hence there is a matrix F in that image that minimizes the continuous

function Od(f(P)) = Od(P

T

AP). Suppose D is not diagonal. Then we could ﬁnd a Jacobi

rotation that would produce another matrix in the image whose Od would be lower, which

is a contradiction (of the fact that Od(D) was minimal). Hence F is diagonal. So there is

an orthogonal matrix P such that P

T

AP = D, i.e., AP = PD. Hence the columns of P are

n pairwise orthogonal eigenvectors of A, and the proof of the spectral theorem is complete.

Now let’s get on with the implementation of the algorithm.

3.10 Corbat´o’s idea and the implementation of the Jacobi

algorithm

It’s time to sit down with our accountants and add up the costs of the Jacobi method.

First, we have seen that O(n

2

) rotations will be suﬃcient to reduce the oﬀ-diagonal sum of

squares below some pre-assigned threshold level. Now, what is the price of a single rotation?

Here are the steps:

(i) Search for the oﬀ-diagonal element having the largest absolute value. The cost seems

to be equal to the number of elements that have to be looked at, namely n(n −1)/2,

which we abbreviate as O(n

2

).

3.10 Corbat´o’s idea and the implementation of the Jacobi algorithm 119

(ii) Calculate θ, sin θ and cos θ, and then carry out a rotation on the matrix A. This costs

O(n), since only four lines of A are changed.

(iii) Update the matrix P of eigenvectors by multiplying it by the rotation matrix. Since

only two rows of P change, this cost is O(n) also.

The longest part of the job is the search for the largest oﬀ-diagonal element. The search

is n times as expensive in time as either the rotation of A or the update of the eigenvector

matrix.

For this reason, in the years since Jacobi ﬁrst described his algorithm, a number of other

strategies for dealing with the eigenvalue problem have been worked out. One of these is

called the cyclic Jacobi method. In that variation, one does not search, but instead goes

marching through the matrix one element at a time. That is to say, ﬁrst do a rotation that

reduces A

12

to zero. Next do a rotation that reduces A

13

to zero (of course, A

12

doesn’t

stay put, but becomes nonzero again!). Then do A

14

and so forth, returning to A

12

again

after A

nn

, cycling as long as necessary. This method avoids the search, but the proof that

it converges at all is quite complex, and the exact rate of convergence is unknown.

A variation on the cyclic method is called the threshold Jacobi method, in which we

go through the entries cyclically as above, but we do not carry out a rotation unless the

magnitude of the current matrix entry exceeds a certain threshold (“throw it back, it’s too

small”). This method also has an uncertain rate of convergence.

At a deeper level, two newer methods due to Givens and Householder have been de-

veloped. These methods work not by trying to diagonalize A by rotations, but instead to

tri-diagonalize A by rotations. A tri-diagonal matrix is one whose entries are all zero ex-

cept for those on the diagonal, the sub-diagonal and the super-diagonal (i.e, A

ij

= 0 unless

[i −j[ ≤ 1).

The advantage of tri-diagonalization is that it is a ﬁnite process: it can be done in such

a way that elements, once reduced to zero, stay zero instead of bouncing back again as

they do in the Jacobi method. The disadvantage is that, having arrived at a tri-diagonal

matrix, one is not ﬁnished, but instead one must then confront the question of obtaining

the eigenvalues and eigenvectors of a tri-diagonal matrix, a nontrivial operation.

One of the reasons for the wide use of the Givens and Householder methods has been

that they get the answers in just O(n

3

) time, instead of the O(n

4

) time in which the original

Jacobi method operates.

Thanks to a suggestion of Corbat´o, (F. J. Corbat´o, On the coding of Jacobi’s method

for computing eigenvalues and eigenvectors of symmetric matrices, JACM, 10: 123-125,

1963) however, it is now easy to run the original Jacobi method in O(n

3

) time also. The

suggestion is all the more remarkable because of its simplicity and the fact that it lives at

the software level, rather than at the mathematical level. What it does is just this: it allows

us to use the largest oﬀ-diagonal entry at each stage while paying a price of a mere O(n)

for the privilege, instead of the O(n

2

) billing mentioned above.

We can do this by carrying along an additional linear array during the calculation. The

ith entry of this array, say loc[i], contains the number of a column in which an oﬀ-diagonal

120 Numerical linear algebra

element of largest absolute value in row i lives, i.e., A

i,loc(i)

is an entry in row i of A of

largest absolute value in that row.

Now of course if some benefactor is kind enough to hand us this array, then it would be

a simple matter to ﬁnd the biggest oﬀ-diagonal element of the entire matrix A. We would

just look through the n − 1 numbers [A

i,loc(i)

[ (i = 1, 2, . . . , n − 1) to ﬁnd the largest one.

If the largest one is the one where i = p, say, then the desired matrix element would be

A

p,loc(p)

. Hence the cost of using this array is O(n).

Since there are no such benefactors as described above, we are going to have to pay a

price for the care and feeding of this array. How much does it cost to create and to maintain

it?

Initially, we just search through the whole matrix and set it up. This clearly costs

O(n

2

) operations, but we pay just once. Now let’s turn to a typical intermediate stage in

the calculation, and see what the price is for updating the loc array.

Given an array loc, suppose now that we carry out a single rotation on the matrix

A. Precisely how do we go about modifying loc so it will correspond to the new matrix?

The rotated matrix diﬀers from the previous matrix in exactly two rows and two columns.

Certainly the two rows, p and q, that have been completely changed will simply have to be

searched again in order to ﬁnd the new values of loc. This costs O(n) operations.

What about the other n−2 rows of the matrix? In the ith one of those rows exactly two

entries were changed, namely the ones in the pth column and in the qth column. Suppose

the largest element that was previously in row i was not in either the pth or the qth column,

i.e., suppose loc[i],∈ ¦p, q¦. Then that previous largest element will still be there in the

rotated matrix. In that case, in order to discover the new entry of largest absolute value in

row i we need only compare at most three numbers: the new [A

ip

[, the new [A

iq

[, and the

old [A

i,loc(i)

[. The column in which the largest of these three numbers is found will be the

new loc[i]. The price paid is at most three comparisons, and this does the updating job

in every row except those that happen to have had loc[i]∈ ¦p, q¦.

In the latter case we can still salvage something. If we are replacing the entry of

previously largest absolute value in the row i, we might after all get lucky and replace it

with an even larger number, in which case we would again know the new loc(i). Christmas,

however, comes just once an year, and since the general trend of the oﬀ-diagonal entries is

downwards, and the previous largest entry was uncommonly large, most of the time we’ll

be replacing the former largest entry with a smaller entry. In that case we’ll just have to

re-search the entire row to ﬁnd the new champion.

The number of rows that must be searched in their entireties in order to update the

loc array is therefore at most two (for rows p and q) plus the number of rows i in which

loc[i] happens to be equal to p or to q. It is reasonable to expect that the probability of

the event loc[i]∈ ¦p, q¦ is about two chances out of n, since it seems that it ought not to

be any more likely that the winner was previously in those two columns than in any other

columns. This has never been in any sense proved, but we will assume that it is so. Then

the expected number of rows that will have to be completely searched will be about four,

on the average (the pth, the qth, and an average of about two others).

3.10 Corbat´o’s idea and the implementation of the Jacobi algorithm 121

It follows that the expected cost of maintaining the loc array is O(n) per rotation.

The cost of ﬁnding the largest oﬀ-diagonal element has therefore been reduced to O(n) per

rotation, after all bills have been paid. Hence the cost of ﬁnding that element is comparable

with all of the other operations that go on in the algorithm, and it poses no special problem.

Using Corbat´o’s suggestion, and subject to the equidistribution hypothesis mentioned

above, the cost of the complete Jacobi algorithm for eigenvalues and eigenvectors is O(n

3

).

We show below the complete Maple procedure for updating the array loc immediately after

a rotation has been done in the plane of p and q.

update:=proc(p,q) local i,r;

global loc,A,n;

for i from 1 to n-1 do

if i=p or i=q then searchrow(i)

else

r:=loc[i];

if r=p or r=q then

if abs(A[i,p])>=abs(A[i,loc[i]]) then loc[i]:=p; fi;

if abs(A[i,q])>=abs(A[i,loc[i]]) then loc[i]:=q; fi;

else

if abs(A[i,loc[i]])<=abs(A[i,r]) then loc[i]:=r

else searchrow(i)

fi;fi;fi;od;

RETURN();

end:

The above procedure uses a small auxiliary routine:

Procedure searchrow(i)

This procedure searches the portion of row i of the nn matrix A that lies above the main

diagonal and places the index of a column that contains an entry of largest absolute value

in loc[i].

searchrow:=proc(i)

local j,bigg;

global loc,A,n,P;

bigg:=0;

for j from i+1 to n do

if abs(A[i,j])>bigg then

bigg:=abs(A[i,j]);loc[i]:=j;fi;

od;

RETURN();

end:

We should mention that the search can be speeded up a little bit more by using a data

structure called a heap. What we want is to store the locations of the biggest elements in

each row in such a way that we can quickly access the biggest of all of them.

122 Numerical linear algebra

If we had stored the set of winners in a heap, or priority queue, then we would have

been able to ﬁnd the overall winner in a single step, and the expense would have been the

maintenance of the heap structure, at a cost of a mere O(log n) operations. This would have

reduced the program overhead by an addition twenty percent or thereabouts, although the

programming job itself would have gotten harder.

To learn about heaps and how to use them, consult books on data structures, computer

science, and discrete mathematics.

3.11 Getting it together

The various pieces of the procedure that will ﬁnd the eigenvalues and eigenvectors of a

real symmetric matrix by the method of Jacobi are now in view. It’s time to discuss the

assembly of those pieces.

Procedure jacobi(eps,dgts)

The input to the jacobi procedure is the n n matrix A, and a parameter eps that

we will use as a reduction factor to test whether or not the current matrix is “diagonal

enough,” so that the procedure can terminate its operation. Further input is dgts, which

is the number of signiﬁcant digits that you would like to be carried in the calculation.

The input matrix A itself is global, which is to say that its value is set outside of the

jacobiprocedure, and it is available to all procedures that are involved.

The output of procedure jacobiwill be an orthogonal matrix P that will hold the

eigenvectors of A in its rows, and a linear array eig that will hold the eigenvalues of A.

The input matrix A will be destroyed by the action of the procedure.

The ﬁrst step in the operation of the procedure will be to compute the original oﬀ-

diagonal sum of squares test. The Jacobi process will halt when this sum of squares of all

oﬀ-diagonal elements has been reduced to eps*test or less.

Next we set the matrix P to the n n identity matrix, and initialize the array loc by

calling the subr outine search for each row of A. This completes the initialization.

The remaining steps all get done while the sum of the squares of the oﬀ-diagonal entries

of the current matrix A exceeds eps times test.

We ﬁrst ﬁnd the largest oﬀ-diagonal element by searching through the numbers A

i,loc(i)

for the largest in absolute value. If that is A

p,loc(p)

, we will then know p, q, A

pq

, A

pp

amd

A

qq

, and it will be time to compute the sine and cosine of the rotation angle θ. This is

a fairly ticklish operation, since the Jacobi method is sensitive to small inaccuracies in

these quantities. Also, note that we are going to calculate sin θ and cos θ without actually

calculating θ itself. After careful analysis it turns out that the best formulas for the purpose

3.11 Getting it together 123

are:

x := 2A

pq

y := A

pp

−A

qq

t :=

_

x

2

+y

2

sin θ := sign(xy)

_

1 −[y[/t

2

_

1/2

cos θ :=

_

1 +[y[/t

2

_

1/2

.

(3.11.1)

Having computed sin θ and cos θ, we can now call rotate twice, as discussed in sec-

tion 3.9, ﬁrst with opt=1 and again with opt=2. The matrix A has now been transformed

into the next stage of its march towards diagonalization.

Next we multiply the matrix P on the left by the same nn orthogonal matrix of Jacobi

(3.8.3), in order to update the matrix that will hold the eigenvectors of A on output. Notice

that this multiplication aﬀects only rows p and q of the matrix P, so only 2n elements are

changed.

Next we call update, as discussed in section 3.10, to modify the loc array to correspond

to the newly rotated matrix, and we are at the end (or od) of the while that was started

a few paragraphs ago. In other words, we’re ﬁnished. The Maple program for the Jacobi

algorithm follows.

jacobi:=proc(eps,dgts)

local test,eig,i,j,iter,big,p,q,x,y,t,s,c,x1;

global loc,n,P,A;

with(linalg): # initialize

iter:=0;n:=rowdim(A);Digits:=dgts;

loc:=array(1..n);

test:=add(add(2*A[i,j]^2,j=i+1..n),i=1..n-1); #sum squares o-d entries

P:=matrix(n,n,(i,j)->if i=j then 1 else 0 fi); #initialize eigenvector matrix

for i from 1 to n-1 do searchrow(i) od; #set up initial loc array

big:=test;

while big>eps*test do #begin next sweep

x:=0; #find largest o.d. element

for i from 1 to n-1 do

if abs(A[i,loc[i]])>x then x:=abs(A[i,loc[i]]);p:=i; fi;od;

q:=loc[p];

x:=2*A[p,q]; y:=A[p,p]-A[q,q]; #find sine and cosine of theta

t:=evalf(sqrt(x^2+y^2));

s:=sign(x*y)*evalf(sqrt(0.5*(1-abs(y)/t)));

c:=evalf(sqrt(0.5*(1+abs(y)/t)));

rotate(s,c,p,q,1);rotate(s,c,p,q,2); #apply rotations to A

for j from 1 to n do #update matrix of eigenvectors

t:=c*P[p,j]+s*P[q,j];

124 Numerical linear algebra

P[q,j]:=-s*P[p,j]+c*P[q,j];

P[p,j]:=t;

od;

update(p,q); #update loc array

big:=big-x^2/2;iter:=iter+1; #go do next sweep

od; #end of while

eig:=[seq(A[i,i],i=1..n)]; #output eigenvalue array

print(eig,P,iter); #print eigenvals, vecs, and

RETURN(); # no. of sweeps needed

end:

To use the programs one does the following. First, enter the four procedures jacobi,

update, rotate, searchrow into a Maple worksheet. Next enter the matrix A whose

eigenvalues and eigenvectors are wanted. Then choose dgts, the number of digits of accuracy

to be maintained, and eps, the fraction by which the original oﬀ diagonal sum of squares

must be reduced for convergence.

As an example, a call jacobi(.00000001,15) will carry 15 digits along in the compu-

tation, and will terminate when the sum of squares of the oﬀ diagonal elements is .00000001

times what it was on the input matrix.

3.12 Remarks

For a parting volley in the direction of eigenvalues, let’s review some connections with the

ﬁrst section of this chapter, in which we studied linear mappings, albeit sketchily.

It’s worth noting that the eigenvalues of a matrix really are the eigenvalues of the linear

mapping that the matrix represents with respect to some basis.

In fact, suppose T is a linear mapping of E

n

(Euclidean n-dimensional space) to itself.

If we choose a basis for E

n

then T is represented by an nn matrix A with respect to that

basis. Now if we change to a diﬀerent basis, then the same linear mapping is represented

by B = HAH

−1

, where H is a nonsingular n n matrix. The proof of this fact is by a

straightforward calculation, and can be found in standard references on linear algebra.

First question: what happens to the determinant if we change the basis? Answer:

nothing, because

det(HAH

−1

) = det(H) det(A) det(H

−1

)

= det(A).

(3.12.1)

Hence the value of the determinant is a property of the linear mapping T, and will be the

same for every matrix that represents T in some basis. Hence we can speak of det(T), the

determinant of the linear mapping itself.

Next question: what happens to the eigenvalues if we change basis? Suppose x is an

eigenvector of A for the eigenvalue λ. Then Ax = λx. If we change basis, A changes to

B = HAH

−1

, or A = H

−1

BH. Hence H

−1

BHx = λx, or B(Hx) = λ(Hx). Therefore Hx

3.12 Remarks 125

is an eigenvector of B with the same eigenvalue λ. The eigenvalues are therefore independent

of the basis, and are properties of the linear mapping T itself. Hence we can speak of the

eigenvalues of a linear mapping T.

In the Jacobi method we carry out transformations A → JAJ

T

, where J

T

= J

−1

. Hence

this transformation corresponds exactly to looking at the underlying linear mapping in a

diﬀerent basis, in which the matrix that represents it is a little more diagonal than before.

Since J is an orthogonal matrix, it preserves lengths and angles because it preserves inner

products between vectors:

¸Jx, Jy) = ¸x, J

T

Jy) = ¸x, y). (3.12.2)

Therefore the method of Jacobi works by rotating a basis slightly, into a new basis in which

the matrix is closer to being diagonal.

2

Contents

1 Diﬀerential and Diﬀerence Equations 1.1 1.2 1.3 1.4 1.5 1.6 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Linear equations with constant coeﬃcients . . . . . . . . . . . . . . . . . . . Diﬀerence equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Computing with diﬀerence equations . . . . . . . . . . . . . . . . . . . . . . Stability theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stability theory of diﬀerence equations . . . . . . . . . . . . . . . . . . . . . 5 5 8 11 14 16 19 23 23 26 29 34 38 43 48 50 54 60 65 67 68 69 74

2 The Numerical Solution of Diﬀerential Equations 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 Euler’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Software notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Systems and equations of higher order . . . . . . . . . . . . . . . . . . . . . How to document a program . . . . . . . . . . . . . . . . . . . . . . . . . . The midpoint and trapezoidal rules . . . . . . . . . . . . . . . . . . . . . . . Comparison of the methods . . . . . . . . . . . . . . . . . . . . . . . . . . . Predictor-corrector methods . . . . . . . . . . . . . . . . . . . . . . . . . . . Truncation error and step size . . . . . . . . . . . . . . . . . . . . . . . . . . Controlling the step size . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.10 Case study: Rocket to the moon . . . . . . . . . . . . . . . . . . . . . . . . 2.11 Maple programs for the trapezoidal rule . . . . . . . . . . . . . . . . . . . . 2.11.1 Example: Computing the cosine function . . . . . . . . . . . . . . . 2.11.2 Example: The moon rocket in one dimension . . . . . . . . . . . . . 2.12 The big leagues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.13 Lagrange and Adams formulas . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . .10 Corbat´’s idea and the implementation of the Jacobi algorithm . . . . .1 3. . . . . . . . . . . . . . . .4 3 Numerical linear algebra 3. . . . 3. . . . . . . . Operation count . .5 3. . . .9 CONTENTS 81 81 86 92 97 102 105 108 112 115 118 122 124 Vector spaces and linear mappings . . . . . . . . . . .11 Getting it together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The orthogonal matrices of Jacobi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Linear systems . . . . . . . . . . .12 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . How big is zero? . . . . . . . . . . . . . . . . . . . .7 3. . . . Eigenvalues and eigenvectors of matrices . . . . . . . . . . . . . . . . . . . . . .6 3. . o 3. . . . . . . Building blocks for the linear equation solver . . . . . . .2 3. . . . . . . . . . . . . . . . . . . .3 3. To unscramble the eggs . . . . . . . . . .8 3. . . . Convergence of the Jacobi method . . . 3. . . . . . . . . . . . . . . . . . . .4 3. . . . . . . . . . . . . . . . . . . . . . . . . . .

The next one is a little harder: y (x) = 2y(x). We assume that the reader has previously met diﬀerential equations.1.Chapter 1 Diﬀerential and Diﬀerence Equations 1. we can expect that if a diﬀerential equation is of the ﬁrst order. The order of a diﬀerential equation is the order of the highest derivative that appears in it. namely y(x) = e2x . and so ye−2x must be a constant. if y(x) is a solution of (1. But. A diﬀerential equation is an equation in an unknown function. Here’s an easy equation of ﬁrst order: y (x) = 0.2).1. where the equation contains various derivatives of y and various known functions of x.1). Are there any other solutions? No there aren’t. so we’re going to review the most basic facts about them rather quickly. C. so we have solved the given equation (1. arrive after a bit of thought.1.1) The unknown function is y(x) = constant.1.2) A solution will. then so is 10y(x). with particular emphasis on how to solve them with computers. This is not always the case.6y(x). because if y is any function that satisﬁes (1. or in fact cy(x) for any constant c. Hence y = ce2x is a solution of (1.2) then (ye−2x ) = e−2x (y − 2y) = 0. (1. or 49.2). now doubt.1.1. then the most general solution will involve one arbitrary constant C.1 Introduction In this chapter we are going to study diﬀerential equations.3) . say y(x). The problem is to “ﬁnd” the unknown function. (1.1. (1. In general.

y1 (x) = 7 and y2 (x) = 23 are both solutions.1.1. Among there are the “ﬁrst-order linear” equations y (x) + a(x)y(x) = 0.5) where a(x) is a given function of x. Less trivially. take . (1. a fairly hard time (why?) ﬁnding a real function y(x) for which (y )2 = −y 2 − 2.1. by linearity.6 Diﬀerential and Diﬀerence Equations since we can write down diﬀerential equations that have no solutions at all. the set of solutions forms a vector space). that the function y = c1 sin x + c2 cos x is a solution.1. The right medicine is now y(x) = esin x .1. whatever the constants c1 and c2 . Evidently y = 2 2 e−x /2 will do.5): y (x) + xy(x) = 0.1. next let’s put a more interesting right hand side into (1.7) we had y (x) + x2 y(x) = 0.1. Since that was so easy. and the general solution is y(x) = ce−x /2 .1. Now let’s consider an instance of the ﬁrst order linear equation (1.1. the equation y (x) + y(x) = 0 (1. for instance. and then the solution is y(x) = ce−A(x) .6) can be checked right from the equation itself.1. The reader might like to put down the book at this point and try to formulate the rule for solving (1. though. that y = cos x is another solution of (1.4) There are certain special kinds of diﬀerential equations that can always be solved.8) y (x) − (cos x) y(x) = 0.9) . (1. If instead of (1. and it’s often important to be able to recognize them.1.6) is linear. we might note that y = sin x is a solution of (1. We would have. let’s discuss the word linear.5). Equation (1. Ready? What we need is to choose some antiderivative A(x) of a(x). (1. In the next paragraph we’ll give the general rule of which the above are three examples. (1. without knowing what the solutions are (do it!).1. where c1 and c2 are any two constants (in other words.7) So we’re looking for a function whose derivative is −x times the function. and ﬁnally.1. 3 /3 then we would have found the general solution ce−x As a last example.6). by considering the equation y (x) + a(x)y(x) = b(x) (1.1. For an example. To say that an equation is linear is to say that if we have any two solutions y1 (x) and y2 (x) of the equation. Before we describe the solution of these equations.1) is linear. and so is 7c1 + 23c2 .6). in fact. The linearity of (1.5) before going on to read about it. then c1 y1 (x) + c2 y2 (x) is also a solution of the equation.

Indeed.12) we get y(x) = x 1 (t + 1)t dt + C x x2 x C = + + . 3 2 x (1. the reason for the importance of the numerical methods that are the main subject of this chapter is precisely that most equations that arise in “real” problems are quite intractable by analytical means. so the computer is the only hope.13) x b(t)eA(t) dt + const. then from (1.14) We may be doing a disservice to the reader by beginning with this discussion of certain types of diﬀerential equations that can be solved analytically.1.1. if we integrate both sides.1. we mean any antiderivative of the function under the integral sign. namely linear equations with constant coeﬃcients.9).1. or even many. Consequently y(x) = e−A(x) As an example.1 Introduction 7 where now b(x) is also a given function of x (Is (1. and then note that we can rewrite (1. . dx Now if we multiply through by eA(x) we see that d eA(x) y(x) = b(x)eA(x) dx so .9) a linear equation? Are you sure?). eA(x) y(x) = x (1. consider the equation y + y = x + 1. in the next section we will study yet another important family of diﬀerential equations that can be handled analytically. once again choose some antiderivative A(x) of a(x). such equations can be dealt with by these techniques.9) in the equivalent form e−A(x) d eA(x) y(x) = b(x). x (1.1 1.1. because it would be erroneous to assume that most.1. Find the general solution of each of the following equations: (a) y = 2 cos x . Despite the above disclaimer.10) b(t)eA(t) dt + const.1.11) where on the right side.1.12) We ﬁnd that A(x) = log x. (1. To solve (1. . (1.1.1. Exercises 1.

4.2. ﬁnd the solution for which y(1) = 1.1). and important.2. let’s see if there is a solution of the form y = eαx . .2.1) It turns out that what we have to do to solve these equations is to try a solution of a certain form. In other words.2.2. Then write a program that will calculate and print the sum of the squares of the integers 1. (1. type of linear diﬀerential equation is the variety with constant coeﬃcients. and the speciﬁc software you will be using. where c1 and c2 are arbitrary constants. For each part of problem 1. (1.1). .2. e−2x is a solution. If we substitute in (1. This time.2.1.3) Again.3) and cancellation of the factor eαx leads to the quadratic equation α2 + 4α + 4 = 0. 1.2. Go to your computer or terminal and familiarize yourself with the equipment. . 3. (1.2. and since the equation is linear. Finally.1). and then cancel the common factor eαx . Various complications can develop. Run this program. and we will then ﬁnd that all of the solutions indeed are of that form.2) must be the most general solution since it has the “right” number of arbitrary constants. e−x is a solution. namely two. . y(x) = c1 e−2x + c2 e−x (1. such as y + 3y + 2y = 0 .2 Linear equations with constant coeﬃcients One particularly pleasant.2) is also a solution. Let’s see if the function y(x) = eαx is a solution of (1. Hence for those two values of α our trial function y(x) = eαx is indeed a solution of (1. we are left with the quadratic equation α2 + 3α + 2 = 0 whose solutions are α = −2 and α = −1. (1. Show that the equation (1. substitution into (1. 2. as illustrated by the equation y + 4y + 4y = 0 .4) . the operating system. however. 100.4) has no real solutions.8 2 y=0 x (c) y + xy = 3 1 (d) y + y = x + 5 x (e) 2yy = x + 1 (b) y + Diﬀerential and Diﬀerence Equations 2. Trying a solution in the form of an exponential is always the correct ﬁrst step in solving linear equations with constant coeﬃcients.

Now it’s quite easy to see that y = xk eα x satisﬁes (1. and that all three roots turn out to be the same. .2. We can write it in the form (D n + a1 D n−1 + a2 D n−2 + · · · + an )y = 0 .2.2. so the left side of (1.2.7) If we try to ﬁnd a solution of the usual exponential form y = eαx . Indeed. but so are xe−x and x2 e−x . if we substitute this function y into (1.7). .7) also) for each k = 0. of course. is caused by the fact that the roots of (1. Now look at the left side of the given diﬀerential equation (1. .2. Since ϕ(α) has the factor (α − α∗ )p . (1. . (1. Hence e−2x is a solution.2.11) .2.9) in which D is the diﬀerential operator d/dx. . only one arbitrary constant.2. Suppose that we begin with an equation of third order. we would ﬁnd the polynomial equation αn + a1 αn−1 + a2 αn−2 + · · · + an = 0 . where ϕ is exactly the characteristic polynomial in (1. say y (n) + a1 y (n−1) + a2 y (n−2) + · · · + an y = 0 (1. . (1.9) can be written in the form g(D)(D − α∗ )p y = 0 . To see why this procedure works in general.7) and cancellation of the common factor eαx . In this case.8) The polynomial on the left side is called the characteristic polynomial of the given diﬀerential equation. suppose we have a linear diﬀerential equation with constant coeﬁccients. In the parentheses in (1. to solve the equation y + 3y + 3y + y = 0 we would try y = eαx .2. we see that it is enough to show that ∗ (D − α∗ )p (xk eα x ) = 0 k = 0.2.8) of multiplicity p.2. .1.3) (verify this). Now. To say that α∗ is a root of multiplicity p of the equation is to say that (α − α∗ )p is a factor of the characteristic polynomial. both being −2.8).10).4) are not distinct. it turns out that xe−2x is another solution of the diﬀerential equation (1. but we don’t yet have the general solution because there is.2. it follows that ϕ(D) has the factor (D − α∗ )p .2. so far.6) (1. (1. 1.2. The diﬃculty. Suppose now that a certain number α = α∗ is a root of (1.2.9) we see the polynomial ϕ(D).2.2. so the general solution is (c1 + c2 x)e−2x . p − 1 .10) ∗ where g is a polynomial of degree n − p. For instance. not only is e−x a solution.5) whose “three” roots are all equal to −1. then after substitution into (1. (1. . and we would then be facing the cubic equation α3 + 3α2 + 3α + 1 = 0 .10) (and therefore (1. 1. p − 1.2 Linear equations with constant coeﬃcients 9 whose two roots are identical.2. and of course so is c1 e−2x .

namely ∗ ∗ ∗ ∗ eα x . where Q(x) is an arbitrary polynomial whose degree is one less than the multiplicity of the root α∗ . to solve y + 4y = 0. xeα x . if we encounter a root α∗ of the characteristic equation.10 Diﬀerential and Diﬀerence Equations ∗ ∗ However. (1. (1. To summarize. These don’t really require any special attention. −3i (multiplicity 2). x2 eα x .13) This is a perfectly acceptable form of the solution. (1. xp−1 eα x .2. Now since k < p it is clear that (D − α∗ )p (xk e−α x ) = 0.17) The equation was cooked up to have a characteristic polynomial that can be factored as (α − 2)(α2 + 9)2 (α − 1)3 .2.16) Hence the roots of the characteristic equation are 2 (simple). (1. which says that e2ix = cos 2x + i sin 2x e−2ix = cos 2x − i sin 2x. . then corresponding to α∗ we can ﬁnd exactly p linearly independent solutions of the diﬀerential equation. as claimed. but we could make it look a bit prettier by using deMoivre’s theorem. ∗ ∗ (D − α∗ )2 (xk e−α x ) = k(k − 1)xk−2 eα x .2. Then our general solution would look like y(x) = (c1 + c2 ) cos 2x + (ic1 − ic2 ) sin 2x. 3i (multiplicity 2).2. . of multiplicity p. (D − α∗ )(xk e−α x ) = kxk−1 eα x . .2. Here’s an example that shows the various possibilities: y (8) − 5y (7) + 17y (6) − 997y (5) + 110y (4) − 531y (3) + 765y (2) − 567y + 162y = 0.2. For instance. then. hence so are c1 + c2 and ic1 − ic2 .14) But c1 and c2 are just arbitrary constants. but they do present a few options. so we might as well rename them c1 and c2 .18) (1. and if we apply (D − α∗ ) again.12) Another way to state it is to say that the portion of the general solution of the given diﬀerential equation that corresponds to a root α∗ of the characteristic polynomial equation ∗ is Q(x)eα x . One last mild complication may arise from roots of the characteristic equation that are not real numbers.15) (1. and 1 (multiplicity 3). and the complex roots ±2i. . ∗ etc. in which case the solution would take the form y(x) = c1 cos 2x + c2 sin 2x. we ﬁnd the characteristic equation α2 + 4 = 0. Hence the general solution is obtained by the usual rule as y(x) = c1 e2ix + c2 e−2ix .2. (1. .

Evidently. Can we somehow “solve” a diﬀerence equation by obtaining a formula for the values of the solution sequence? The answer is that we can.3 Diﬀerence equations 11 Corresponding to the root 2. a diﬀerence equation is an equation in an unknown sequence. Corresponding to the double root at 3i we have terms (c2 + c3 x)e3ix in the solution. . and ﬁnally from the triple root at 1 we get (c6 + c7 x + c8 x2 )ex .2.1). and which satisﬁes (D + 4)(D − 1)y = 0. The general solution is the sum of these eight terms. (1. For example. y3 = 87. and then calculating successive approximate values of the desired function from the diﬀerence equation. . . . Just as in the case of diﬀerential equations with constant coeﬃcients. . Obtain the general solutions of each of the following diﬀerential equations: (a) y + 5y + 6y = 0 (b) y − 8y + 7y = 0 (c) (D + 3)2 y = 0 (d) (D2 + 16)2 y = 0 (e) (D + 3)3 (D 2 − 25)2 (D + 2)3 y = 0 2. y4 = −309 so forth.3.3. Naturally. suppose we know that a certain sequence of numbers y0 . Find a curve y = f (x) that passes through the origin with unit slope. the general solution will contain the term c1 e2x . the correct strategy for solving them is to try a solution of the n = 0.3.1. we might have taken the four terms that come from 3i in the form (c2 + c3 x) cos 3x + (c4 + c5 x) sin 3x. the computer can provide the values of the unknown function only at a discrete set of points. (1. 1. From the double root at −3i we get a contribution (c4 + c5 x)e−3ix . thus we would get y2 = −21. satisﬁes the following conditions: yn+2 + 5yn+1 + 6yn = 0 and furthermore that y0 = 1 and y1 = 3. Exercises 1.1) . as long as the diﬀerence equation is linear and has constant coeﬃcients. .1). y2 . as in (1. The entire sequence of yn ’s is determined by the diﬀerence equation (1.19) 1. Such equations are encountered when diﬀerential equations are solved on computers.2 1.3 Diﬀerence equations Whereas a diﬀerential equation is an equation in an unknown function. y1 . 2.3. Alternatively. we can compute as many of the yn ’s as we need from (1. These values are computed by replacing the given diﬀerential equations by a diﬀerence equation that approximates it.1) together with the two starting values.

2.3. is the number yn a large number or a small one? Evidently the powers of −3 overwhelm those of −2. Therefore the sequence (−2)n satisﬁes (1.3.3.3. 1.3) The two roots of this characteristic equation are α = −2 and α = −3.3.12 Diﬀerential and Diﬀerence Equations right form. .3.1). (1. So much for the equation (1. and we’re left with the quadratic equation α2 + 5α + 6 = 0. where α is a constant.1) itself that the numbers yn are uniquely determined if we prescribe the values of just two of them. and after substituting and canceling.6) is the desired formula that represents the unique solution of the given diﬀerence equation together with the prescribed starting values.3. let’s substitute αn for yn in (1. Now let’s look at the general case. . Hence. (1. (1. so the sequence will behave roughly like a constant times powers of −3. (1. Since the diﬀerence equation is linear. The left side becomes αn+2 + 5αn+1 + 6αn = αn (α2 + 5α + 6) = 0.1) to see what happens. When n is very large.3.3.3.3. in the form of a linear diﬀerence equation of order p: yn+p + a1 yn+p−1 + a2 yn+p−2 + · · · + ap yn = 0. the right form to try was y(x) = eαx . Now it is evident from (1. Finally. .6) Equation (1.4) is also a solution. This means that we should expect the members of the sequence to alternate in sign and to grow rapidly in magnitude.4) to get yn = 6(−2)n − 5(−3)n n = 0. . it is very clear that when we have a solution that contains two arbitrary constants we have the most general solution. whatever the values of the constants c1 and c2 . so here we can cancel the αn .3. When we take account of the given data y0 = 1 and y1 = 3. we use these values of c1 and c2 in (1. Let’s step back a few paces to get a better view of the solution.8) . Notice that the formula (1.2) Just as we were able to cancel the common factor eαx in the diﬀerential equation case.6) expresses the solution as a linear combination of nth powers of the roots of the associated characteristic equation (1.3. it follows that yn = c1 (−2)n + c2 (−3)n (1. Now the winning combination is y = αn . In fact.5) from which c1 = 6 and c2 = −5.3.3).3. we get the two equations 1 = c1 + c2 3 = (−2)c1 + (−3)c2 (1.7) We try a solution of the form yn = αn . In the previous section. we get the characteristic equation αp + a1 αp−1 + a2 αp−2 + · · · + ap = 0.1) and so does (−3)n . (1.

3 Diﬀerence equations 13 This is a polynomial equation of degree p. of degree k − 1. by considering the following diﬀerence equation of order ﬁve: yn+5 − 5yn+4 + 9yn+3 − 9yn+2 + 8yn+1 − 4yn = 0. Exercises 1. −i. somewhere in the complex plane. say y0 .3 1.3. and the simple root α = 1 adds c5 to the general solution.12) (1.10) (1. y1 . (1. we can also take this part of the general solution in the form c1 cos nπ 2 + c2 sin nπ . as well as the case of complex roots.3. y2 . −i. If. 1. Corresponding to the roots i.13) The ﬁve constants would be determined by prescribing ﬁve initial values. Let α∗ be one of these p roots. This example is rigged so that the characteristic equation can be factored as (α2 + 1)(α − 2)2 (α − 1) = 0 from which the roots are obviously i. Obtain the general solution of each of the following diﬀerence equations: (a) yn+1 = 3yn (b) yn+1 = 3yn + 2 (c) yn+2 − 2yn+1 + yn = 0 (d) yn+2 − 8yn+1 + 12yn = 0 (e) yn+2 − 6yn+1 + 9yn = 1 (f) yn+2 + yn = 0 . y3 and y4 . α∗ is a root of multiplicity k > 1 then we must multiply the solution c(α∗ )n by an arbitrary polynomial in n.9) The double root α = 2 contributes (c3 + c4 n)2n .1. counting multiplicities..9). Since nπ nπ in = einπ/2 = cos + i sin (1. 2 (1.e.11) 2 2 and similarly for (−i)n . We illustrate this.3. however. the portion of the general solution is c1 in + c2 (−i)n . as we would expect for the equation (1. which in its full glory is yn = c1 cos nπ 2 + c2 sin nπ 2 + (c3 + c4 n)2n + c5 . has multiplicity 1) then the part of the general solution that corresponds to α∗ is c(α∗ )n . just as in the corresponding case for diﬀerential equations we used an arbitrary polynomial in x of degree k − 1. 2 (multiplicity 2). so it has p roots. If α∗ is simple (i.3.3.3.

Find the solution of the given diﬀerence equation that takes the prescribed initial values: (a) (b) (c) (d) yn+2 = 2yn+1 + yn . For each n. y0 = 0. If the new ratio agrees suﬃciently well with the previous ratio we announce that the computation has terminated and print the new ratio as our answer. we move the new ratio to the location of the old ratio. (a) For each of the diﬀerence equations in problems 1 and 2. y0 = 1. and value of the limit in part (a) for a linear diﬀerence equation with constant coeﬃcients. Suppose we want to compute a large number of these y’s in order to verify some property that they have. we might declare y to be a linear array of some size large enough to accommodate the expected length of the calculation. y1 = −1. y3 = −1 yn+2 − 5yn+1 + 6yn = 0. ﬁnd its root of largest absolute value by computing from a certain diﬀerence equation and evaluating the ratios of consecutive terms.1) together with the starting values y0 = y1 = y2 = 1. we would divide it by its predecessor yn to get a new ratio. y1 = 1 yn+1 = αyn + β. we would calculate the next yn+1 from (1. (d) Write a computer program to implement the method in part (c). for instance to check that yn+1 lim =3 (1.1). Use it to calculate the largest root of the equation x8 = x7 + x6 + x5 + · · · + 1. y0 = 1 yn+4 + yn = 0. (b) Formulate and prove a general theorem about the existence of.3.4.3. The reader might want. just for practice.4. increase n and try again. As a ﬁrst approach. evaluate yn+1 lim n→∞ yn (1. a book about computing. (c) Reverse the process: given a polynomial equation. y2 = 1. Then the rest is easy. Otherwise. y0 = 1.2) n→∞ yn which must be true since 3 is the root of largest absolute value of the characteristic equation.14 Diﬀerential and Diﬀerence Equations 2. after all. y1 = 2 3. (1. If we were to write this out as formal procedure (algorithm) it might look like: .15) 1.4 Computing with diﬀerence equations This is. so let’s begin with computing from diﬀerence equations since they will give us a chance to discuss some important questions that concern the design of computer programs.4. For a sample diﬀerence equation we’ll use yn+3 = yn+2 + 5yn+1 + 3yn (1.14) if it exists. to ﬁnd an explicit formula for this sequence by the methods of the previous section.

y2 := 1. n := n + 1. if necessary. ym1 := 1. Then you might appreciate a calculation procedure that needs just four locations to hold all necessary y’s. for instance. so you can undertake it on your hand calculator with complete conﬁdence. If. ym1. y := ym1 + 5 ∗ ym2 + 3 ∗ ym3.1. given only a programmable hand calculator with ten or twenty memory locations. such a program needed to calculate 79 y’s before convergence occurred. Halt. print newrat. So why not save only those three? We’ll use the previous three to calculate the next one. what we need to ﬁnd the next y are just the previous three y’s. At any given moment in the program. the problem above doesn’t need that many locations because convergence happens a lot sooner. As we progress through the numerical solution of diﬀerential equations we will see situations in which each of the quantities that appears in the diﬀerence equation will itself be an . oldrat := 1. while |newrat − oldrat| ≥ 0. yn := yn−1 + 5 ∗ yn−2 + 3 ∗ yn−3 . while |newrat − oldrat| ≥ 0. newrat) no matter how many y’s have to be calculated. newrat := y/ym1 endwhile. though. and then store it in the place named on the left. The price that we pay for the memory saving is that we must move the data around a bit more. Halt. we move each one of the three newest y’s back one step into the places where we store the latest three y’s and repeat the process. ym2. the block that begins with ‘while’ and ends with ‘endwhile’ represents a group of instructions that are to be executed repeatedly until the condition that follows ‘while’ becomes false. ym3. oldrat := newrat. In fact. y1 := 1. Then we’ll compute the new ratio and compare it with the old. but it uses lots of storage. n := 2. newrat := yn /yn−1 endwhile print newrat.000001 do ym3 := ym2. oldrat := 1. newrat := −10. ym2 := ym1. The calculation can now be done in exactly six memory locations (y. and stow it for a moment in a fourth location. ym1 := y. One should not think that such programming methods are only for hand calculators. 15 We’ll use the symbol ‘:=’ to mean that we are to compute the quantity on the right. then it would have used 79 locations of array storage. newrat := −10. That’s fairly easy to accomplish. If they’re not close enough. at which point the line following ‘endwhile’ is executed.’ Also.4 Computing with diﬀerence equations y0 := 1.000001 do oldrat := newrat. it might be: y := 1. The procedure just described is fast. Formally. oldrat. Suppose you wanted to ﬁnd out how much sooner. It can be read as ‘is replaced by’ or ‘is assigned. ym2 := 1.

Normally. (d) Prove directly from your formula that the Fibonacci numbers are integers (This is perfectly obvious from their deﬁnition. and that very large numbers. .4. for doing the calculation. (c) Evaluate your formula for n = 0. (g) Write a computer program that will compute the ﬁrst√ members of the modiﬁed 40 Fibonacci sequence in which F0 = 1 and F1 = (1 − 5)/2. The Fibonacci numbers F0 . Does it change any of your answers? 2. 2. (h) Modify the program of part (h) to run in higher (or double) precision arithmetic. perhaps thousands. and never needed except in the calculation of their immediate successors. they will be computed. for example. F2 . . or if tiny variations in the period of the earth’s rotation . a very slight increase in atmospheric pollution could produce dramatically large changes in populations of ﬂora and fauna. Even large computers might quake at the thought of using the ﬁrst method above. . F1 . 1. (e) Evaluate n→∞ lim Fn+1 Fn (1. 4. Exercises 1. . (a) Write out the ﬁrst ten Fibonacci numbers. 2. rather than the second. 1. If. are deﬁned by the recurrence formula Fn+2 = Fn+1 + Fn for n = 0.5 Stability theory In the study of natural phenomena it is most often true that a small change in conditions will produce just a small change in the state of the system being studied. correct to six decimal places. of these arrays will need to be computed. (b) Derive an explicit formula for the nth Fibonacci number Fn .4 1. . .3) (f) Write a computer program that will compute Fibonacci numbers and print out the limit in part (e) above. Do these computed numbers seem to be approaching zero? Explain carefully what you see and why it happens. together with the starting values F0 = 0. Fortunately. F1 = 1. 3.16 Diﬀerential and Diﬀerence Equations array (!). it will almost never be necessary to save in memory all of the computed values simultaneously. Find the most general solution of each of the following diﬀerence equations: (a) yn+1 − 2yn + yn−1 = 0 (b) yn+1 = 2yn (c) yn+2 + yn = 0 (d) yn+2 + 3yn+1 + 3yn + yn−1 = 0 1. and then printed or plotted. but is not so obvious from the formula!).

we may say that most aspects of nature are stable. they often will make a mathematical model.1. of spacecraft. of predator-prey relationships. The next step might be to go to the computer and do calculations from the model. Digital computers solve diﬀerential equations by approximating them by diﬀerence equations. The initial conditions tell us that c1 = 1 and c2 = 0. yet another layer of approximation is usually introduced at this stage. One of the ﬁrst tests to which such a model should be subjected is that of stability: does it faithfully reproduce the observed fact that small changes produce small changes? What is true in nature need not be true in a man-made model that is a simpliﬁcation or idealization of the real world. and it represents a . and that small changes in initial data on the computer. This model will usually not faithfully reproduce all of the structure of the original phenomenon. of the movement of astronomical objects. or to try to live in. might still be too complicated to solve exactly on a computer. will produce not small but very large changes in the computed solution. When physical scientists attempt to understand some facet of nature. even though it is a simpliﬁcation of the real world.1) together with the initial data y(0) = 1. (1. One of the most important features to preserve is that of stability.5. so that predictions will be possible.5. Now suppose that we have gotten ourselves over this hurdle. or small roundoﬀ errors. Even though the diﬀerential equation that represents our model is indeed stable. For instance.5. the example of atmospheric pollution and its eﬀect on living things referred to above is important and very complex. it may be that the diﬀerence equation that we use on the computer is no longer stable. For instance. of population growth. of electric circuit transients. As an example of instability in diﬀerential equations. the world would be a very diﬀerent place to live in. and so forth. because the model. An important job of the numerical analyst is to make sure that this does not happen. but one hopes that the important features of the system will be preserved in the model. so we will be interested in the solution as t becomes large and positive. hence the solution of our problem is y(t) = e−t . and to use these calculations for predicting the eﬀects of various proposed actions.1) is y(t) = c1 e−t + c2 e2t . and we have constructed a model that is indeed stable. all involve diﬀerential equations. suppose that some model of a system led us to the equation y − y − 2y = 0 (1. Therefore considerable eﬀort has gone into the construction of mathematical models that will allow computer studies of the eﬀects of atmospheric changes. In brief. Models of the weather.2) We are thinking of the independent variable t as the time. Unfortunately. The general solution of (1. and we will ﬁnd that this theme of stability recurs throughout our study of computer approximations.5 Stability theory 17 produced huge changes in climatic conditions. y (0) = −1. of the motion of ﬂuids. and then solving the diﬀerence equations. may models use diﬀerential equations.

(1. A diﬀerential equation is called strongly stable if. Likewise. At t = 20 the change is even more impressive. then the term remains bounded.)e2t (1. .5. just from changing the initial value of y from −1 to −0. if we have a diﬀerential equation whose general solution contains a term eαt in which α is positive. A small change in the initial data. we would ﬁnd that it has changed from 0. results in the presence of both terms in the solution. . Now it’s time for a formal Deﬁnition: A diﬀerential equation is said to be stable if for every set of initial data (at t = 0) the solution of the diﬀerential equation remains bounded as t approaches inﬁnity. the solution has the value 0.18 Diﬀerential and Diﬀerence Equations function that decays rapidly to zero with increasing t.3) instead of just y(t) = e−t .000333 . If α has positive real part the term is unbounded. By prescribing the initial data (1.4) Under what circumstances does such a term remain bounded as t becomes large and positive? Certainly if α is negative then the term stays bounded. that equation is unstable. and a rising exponential term c2 e2t .999666 . . however. and picked out only the decreasing one.999. by asking for a solution with y (0) = −0.000045 to about 7.5. from 0. This takes care of all of the possibilities except the case where α is zero. then. for every set of initial data (at t = 0) the solution not only remains bounded.1) unstable. What makes the equation (1.5. In that case the question of whether (polynomial in t)eαt remains bounded depend on whether the “polynomial in t” is of degree zero (a constant polynomial) or of higher degree.5. In fact. thereby violating the deﬁnition of stability. or more generally. We know from section 1. Let’s restrict attention now to linear diﬀerential equations with constant coeﬃcients. 720+. but approaches zero as t approaches inﬁnity.2) just a bit. is the presence of a rising exponential in its general solution. Now let’s change the initial data (1. or our lives hang by a slender thread indeed! Now exactly what is the reason for the observed instability of the equation (1. whereas otherwise the term will grow as t gets large. . . Let’s hope that there are no phenomena in nature that behave in this way. It’s easy to check that the solution is now y(t) = (0. If the polynomial is constant then the term does indeed remain bounded for large positive t.1)? The general solution of the equation contains a falling exponential term c1 e−t .00000002 to 161.5.2 that the general solution of such an equation is a sum of terms of the form (polynomial in t)eαt .000045.2) we suppressed the growing term.999.)e−t + (0.5. when t = 10. for some values of the initial conditions. If we want the value of the solution at t = 10. In other words. the complex number α has zero real part (is purely imaginary).34. if α is a complex number and its real part is negative.

if any. . and strongly stable if the solutions actually approach zero.6 Stability theory of diﬀerence equations 19 Now recall that the “polynomial in t” is in fact a constant if the root α is a simple root of the characteristic equation of the diﬀerential equation.5 1. Compare the two solutions when x = 20. 3. Similar considerations apply to diﬀerence equations.1) . Discuss.1. and for similar reasons. The diﬀerential equation y −y = 0 is to be solved with the initial conditions y(0) = 1. Such an equation is strongly stable if and only if all of the roots of its characteristic equation lie in the left half plane.6 Stability theory of diﬀerence equations In the previous section we discussed the stability of diﬀerential equations. and otherwise it is of higher degree. This observation completes the proof of the following: Theorem 1. are simple. strongly stable? (a) y + (2 + λ)y + y = 0 (b) y + λy + y = 0 (c) y + λy = 1 1. y (0) = −1. Make a list of some natural phenomena that you think are unstable. Exercises 1. or unstable. As an example.5. Determine for each of the following diﬀerential equations whether it is strongly stable. (a) y − 5y + 6y = 0 (b) y + 5y + 6y = 0 (c) y + 3y = 0 (d) (D + 3)3 (D + 1)y = 0 (e) (D + 1)2 (D 2 + 1)2 y = 0 (f) (D4 + 1)y = 0 2.6. stable. and then solved again with y(0) = 1. 4. . and none lie on the imaginary axis. The key ideas were that such an equation is stable if every one of its solutions remains bounded as t approaches inﬁnity. take the equation 5 yn+1 = yn − yn−1 2 (n ≥ 1) (1.1 A linear diﬀerential equation with constant coeﬃcients is stable if and only if all of the roots of its characteristic equation lie in the left half plane. For exactly which real values of the parameter λ is each of the following diﬀerential equations stable? . y (0) = −0.99. and those that lie on the imaginary axis.

this is a function that rapidly approaches zero with increasing n. Then the sequence of powers grows unboundedly. Diﬀerential and Diﬀerence Equations y1 = 0. not just the solution that is picked out by a certain set of initial data.3) The point is that the coeﬃcient of the growing term 2n is small.6. The solution of the diﬀerence equation with these new data is y = (0. or lack of it. because it has both rising and falling components to its general solution. but if we multiply by a nonconstant polynomial. thirty steps later.16. the solution is y30 = 7. Now let’s change the initial data (1. if the polynomial is not identically zero. . we emphasize that every solution must be well behaved. The fault lies with the diﬀerence equation. we’ll say that a diﬀerence equation is stable if every solution remains bounded as n grows large. say to y0 = 1.4) y1 = 0.6.2). It should be clear that it is hopeless to do extended computation with an unstable diﬀerence equation. To summarize then.2).4) will be dominant.6. Now consider the case where the diﬀerence equation is linear with constant coeﬃcients. Then the powers of α approach zero.5 .0000000066 . and multiplication by a polynomial in n does not alter that conclusion. For example. Suppose |α| > 1. (1. Again.6.5) . Then the sequence of its powers remains bounded (in fact they all have absolute value 1). and multiplication by a nonzero polynomial only speeds the parting guest. and that it is strongly stable if every solution approaches zero as n grows large. The we know that the general solution is a sum of terms of the form (polynomial in n)αn . but 2n grows so fast that after a while the ﬁrst term in (1. It remains bounded as n increases if and only (1. the resulting expression would grow without bound. As in the case of diﬀerential equations. is a property of the equation and not of the starting values. A change of one part in ﬁfty million in the initial condition produced.20 along with the initial equations y0 = 1.6.9999999933 .50000001 (1. the term (1. approaches zero with increasing n if and only if |α| < 1.2).6. the stability. when n = 30. (1. since a small roundoﬀ error may alter the solution beyond recognition several steps later. and of course. instead of (1. compared to the value y30 = 0.6. . In other words.6.)2n + (0. .0000000009 of the solution with the original initial data (1. Finally suppose the complex number α has absolute value 1.5).2) It’s easy to see that the solution is yn = 2−n . .6.)2−n . Under what circumstances will such a term remain bounded or approach zero? Suppose |α| < 1. an answer one billion times as large.

3. . Namely. strongly stable? (a) yn+2 + λyn+1 + yn = 0 (b) yn+1 + λyn = 1 (c) yn+2 + yn+1 + λyn = 0 4. The equation is strongly stable if and only if all of the roots have absolute value strictly less than 1. y1 = 1.1 A linear diﬀerence equation with constant coeﬃcients is stable if and only if all of the roots of its characteristic equation have absolute value at most 1.99. (a) Consider the (constant-coeﬃcient) diﬀerence equation a0 yn+p + a1 yn+p−1 + a2 yn+p−2 + · · · + ap yn = 0. Determine. (b) Give an example to show that the converse of the statement in part (a) is false. and then solved again with y0 = 2. for each of the following diﬀerence equations whether it is strongly stable. Compare y20 for the two solutions.6.6 Stability theory of diﬀerence equations 21 if either (a) |α| < 1 or (b) |α| = 1 and the polynomial is of degree zero (a constant). The diﬀerence equation 2yn+2 + 3yn+1 − 2yn = 0 is to be solved with the initial conditions y0 = 2. exhibit a diﬀerence equation for which |ap /a0 | < 1 but the equation is unstable anyway. Exercises 1. or unstable. and those of absolute value 1 are simple. . stable. y1 = 0.6. Now we have proved: Theorem 1.1. For exactly which real values of the parameter λ is each of the following diﬀerence equations stable? . Show that this diﬀerence equation cannot be stable if |ap /a0 | > 1. (1.6) .6 1. (a) yn+2 − 5yn+1 + 6yn = 0 (b) 8yn+2 + 2yn+1 − 3yn = 0 (c) 3yn+2 + yn = 0 (d) 3yn+3 + 9yn+2 − yn+1 − 3yn = 0 (e) 4yn+4 + 5yn+2 + yn = 0 2.

22 Diﬀerential and Diﬀerence Equations .

y(x0 ) = y0 . etc. Hence. both to systems of simultaneous ﬁrst order equations and to equations of higher order. Next we need to derive a method by which each value of y can be obtained from its immediate predecessor.Chapter 2 The Numerical Solution of Diﬀerential Equations 2. due to Euler. At each of these points xn we will compute yn . By an initial-value problem we mean a diﬀerential equation together with enough given values of the unknown function and its derivatives at an initial point x0 to determine the solution uniquely. obtaining y1 from y0 . Consider the Taylor series expansion of the unknown function y(x) . (2.1.1 Euler’s method Our study of numerical methods will begin with a very simple procedure. One of the nice features of the subject of numerical integration of diﬀerential equations is that the techniques that are developed for just one ﬁrst order diﬀerential equation will apply. We will state it as a method for solving a single diﬀerential equation of ﬁrst order. What we actually will ﬁnd will be approximate values of the unknown function at a discrete set of points x0 . Let’s suppose that we are given an initial-value problem of the form y = f (x. suppose that the spacing h between consecutive points has been chosen. then y2 from y1 and so forth until suﬃciently many values have been found. our approximation to y(xn ). x1 = x0 + h. and move to the right. We propose to start at the point x0 where the initial data are given. with very little change. Hence the consideration of a single equation of ﬁrst order. seemingly a very special case. x2 = x0 + 2h. turns out to be quite general. y).1) Our job is to ﬁnd numerical approximate values of the unknown function y at points x to the right of (larger than) x0 . x3 = x0 + 3h.

20. with the consolation that we will be able to compute from it.1.05.1. but of course it cannot be used for computation because the point X is unknown. the approximate value of y computed from (2. y) = 0.1. let’s take h to be 0.1. (2. 0. in this example. by comparing (2. (2.1. in a very explicit form.6) (2.1. yn ).1) by writing yn = f (xx . for each value 2 of x = 0.1.1. the point X lies between xn and xn + h. we’ll have only an approximate relation instead of an exact one. if we simply “forget” the error term.5y. 2 (2.1.5y (2. we have.5) This is Euler’s method.1. and if we do so then (2. . each yn will be obtained from its predecessor by multiplication by 1 + h .2) is exact.8) and the exact value y(xn ) = exn /2 : .3) instead of (2. because the quantity yn can be found from the diﬀerential equation (2. Then we show below.6) is in fact a recurrence relation. On the other hand. 0. y(xn + h) = y(xn ) + hy (xn ) + h2 Now equation (2.05. Concerning the approximate solution by Euler’s method.1).2) 2 where we have halted the expansion after the ﬁrst power of h and in the remainder term. . 0.24 about the point xn The Numerical Solution of Diﬀerential Equations y (X) . f (x. The exact solution of this initial-value problem is obviously y(x) = ex/2 . The approximate relation is y(xn + h) ≈ y(xn ) + hy (xn ).1. so that the computational procedure is clear.1.7) with (2. . whereby each value of yn is computed from its immediate predecessor. Let’s use Euler’s method to obtain a numerical solution of the diﬀerential equation y = 0.1.4) Now we have a computable formula for the approximate values of the unknown function. Then we get yn+1 = yn + hyn . yn ).10.1. so yn+1 = yn + h yn 2 h = 1+ yn .2). To be quite speciﬁc. Equation (2.8) Therefore.3) Next deﬁne yn+1 to be the approximate value of y(xn+1 ) that we obtain by using the right side of (2. 0.1. or diﬀerence equation.7) together with the starting value y(0) = 1.1. (2. (2.15.4) takes the form yn+1 = yn + hf (xn .

02532 1.” is e1/2 . Hence the computed approximation to y at a particular point x will be 1.11) Exercises 2. or equivalently Euler(x) = (1.648721 .1. .15 0.07788 1.10 0.9) The approximate values can now easily be compared with the true solution.02500 1.00000 1.68506 4. For a ﬁxed value of x.18249 148. used by Euler’s method is (1 + h )1/h .05127 1. .10381 1. 1. .1. exactly what function of x is Euler(x)? To answer this. it seems that we have done pretty well by this equation. .2.64872 2. .05 0.00 3.25 . The correct value of “const. 11. because lim 1 + h 2 1/h h→0 = e2 . In other words.07689 1.00 10.00000 1.39979 .)x .05 we have n = 20x. we have n = x/h. and since h = 0. Since 2 xn = nh. Since y0 = 1.)x .13315 . Now we want to express this in terms of x rather than n. 1 (2.63862 2. it is clear tha we will 2 compute yn = (1 + h )n . .1. the values Euler(x) will converge to Exact(x). . since Exact(x) = e 2 = e 2 x 1 x = (1.13141 . 1.02520x .1 .00 2. we see that if we use 2 Euler’s method with smaller and smaller values of h (neglecting the increase in roundoﬀ error that is sure to result). .638616 .05063 1. and the value that is.10517 1.20 0. . . .56389 table 1 Exact(x) 1.00 .00 0. 1.10) Therefore both the exact solution of this diﬀerential equation and its computed solution have the form (const. .81372 139.1. 12.00 Euler(x) 1.8) by multiplying its predecessor yn by 1 + h .48169 .1 Euler’s method x 0.71828 4.)x . Let’s continue with this simple example by asking for a formula for the numbers that are called Euler(x) in the above table.41316 25 Considering the extreme simplicity of the approximation. . (2. . . we note ﬁrst that each computed value yn+1 is obtained according to (2. (2. 5. in eﬀect. .

and we will return several times to the principles that have evolved as guides to the preparation of readable software.1. or the exact operation that it performs on its input in order to get its output. It is amazing how rapidly our knowledge of our very own program fades. one might think. Also tabulate the exact solution at each value of x that occurs. (a) y (x) = xy(x). y(0) = 1 y(x) (c) y (x) = . Use a calculator or a computer to integrate each of the following diﬀerential equations forward ten steps.2 Software notes One of the main themes of our study will be the preparation of programs that not only work.11). Already in this ﬁrst mission. Besides. State clearly the problem that the program solves. but not this one. Furthermore. Suppose we have written a subroutine that searches through a speciﬁed row of a matrix to ﬁnd the element of largest . at least not without making a large investment of time. Verify the limit (2. The ﬁrst mission of program documentation is to describe the purpose of the program. Here are some of these guidelines. The act of communication that must take place before a program can be used by persons other than its author is a diﬃcult one to carry out. but also are easily readable and useable by other people.26 1. 1. by intertwining the description of the program purpose with the names of the communicating variables in the program.05 with Euler’s method. in “comment” statements. y(0) = 1 1+x (d) y (x) = −2xy(x)2 . That way one can be sure that when the comments are needed they will be available. it’s perfectly obvious how to use this program. the best place for documentation is in the program itself. a good bit of technical skill can be brought to bear that will be very helpful to the use. while our memory of it is still green. it often happens that we will have no idea what the program did or how to use it. using a spacing h = 0. y(0) = 1 (b) y (x) = xy(x) + 2. The Numerical Solution of Diﬀerential Equations 2. Some programs may be obscure. Documentation The documentation of a program is the set of written instructions to a user that inform the user about the purpose and operation of the program. Let’s see what that means by considering an example. For that reason it is important that when our program has been written and tested it should be documented immediately. If we come back to a program after a lapse of a few months’ time. At the moment that the job of writing and testing a program has been completed it is only natural to feel an urge to get the whole thing over with and get on to the next job. y(0) = 10 2.

2. That means that we should get into the habit of writing lots of subroutines or procedures.2 Software notes 27 absolute value.j])>winner) then winner:=abs(A[i. They are the input and output variables of the subroutine. and outputs a column in which it was found. and that is: 2. might look like this: search:=proc(A.” We’ll come back to the subject of documentation. But we can make it a lot more useful by doing the intertwining that we referred to above.i) local j. in Maple for instance. There we said that the description should be related to the communicating variables. Modularity It is important to divide a long program into a number of smaller modules.j]) . winner. and each with its own documentation. perhaps better than many programs get. Now let’s try our hand at documenting this program: “The purpose of this program is to search a given row of a matrix to ﬁnd an element of largest absolute value and return the column in which it was found. Such a routine. jwin. The best way to help the user to understand these variables is to relate them directly to the description of the purpose of the program. as opposed to the other “local” variables that live inside the subroutine but don’t communicate with the outside world (like j. Maple follows this convention at least for the input variables. In the ﬁrst line of the little subroutine above we see the list (A. Those variables are the ones that the user can see. which are listed on the second line of the program). the communicating variables are announced in the ﬁrst line of the coding of a procedure or subroutine. “The purpose of this program is to search row I of a given matrix A to ﬁnd an entry of largest absolute value. jwin. These are the ones that the user has to understand. each with a clearly stated set of inputs and outputs. winner. return(jwin). because . winner:=-1.” That is pretty good documentation. jwin:=j fi od. but now let’s mention another ingredient of ease of use of programs. end. and returns the column jwin where that entry lives. for j from 1 to coldim(A) do if (abs(A[i. In most important computer languages. although the output variable is usually speciﬁed in the “return” statement. i) of its input variables (or “arguments”).

This was done in part because some of these programs are quite complex. For another reason. It is therefore a quality of mind that provides much of its own justiﬁcation. The reader might wish to look ahead at those routines and to verify that even though their relationship to the whole job of solving equations is by no means clear now. it may recur in other programs as yet undreamed of. and they are described in section 3. For one thing it’s easier to check out the program. the general linear algebra program for solving systems of linear simultaneous equations in Chapter 3. more elegant. however. we might discover a better. because of the fact that they are independent and self-contained. Once the subroutines work. or what-have-you method of performing the task that one of these subroutines does. and it would be unreasonable to expect the whole program to be written in a short time.28 The Numerical Solution of Diﬀerential Equations the subroutine or procedure mode of expression forces one to be quite explicit about the relationship of the block of coding to the rest of the world. the practice of subdividing the large jobs into the smaller jobs of which they are composed is an extremely valuable analytical skill. One more ingredient that is needed for the production of useful software is: . nonetheless. if one be needed. Beyond this. there are numerous inducements for breaking oﬀ subroutines even if the block of coding occurs just once in the main program. In this book. Finally. the major programs that are the objects of study have been broken up into subroutines in the expectation that the reader will be able to start writing and checking out these modules even before the main ideas of the current subject have been fully explained. For instance. The testing procedure would consist of ﬁrst testing each of the subroutines separately on test problems designed just for them. working and documented. they can be programmed and checked out at any time without waiting for the full explanation. Then we would be able to yank out the former subroutine and plug in the new one. has been divided into six modules. It was also done to give examples of the process of subdivision that we have been talking about. For another reason. one that is useful not only in programming. If jobs within a large program are not broken into subroutines it can be much harder to isolate the block of coding that deals with a particular function and remove it without aﬀecting the whole works. faster. that will greatly simplify the task of writing future programs. then it doesn’t take long before one has a library of useful subroutines. but in all sorts of organizational activities where smaller eﬀorts are to be pooled in order to produce a larger eﬀect. When we are writing a large program we would all write a subroutine if we found that a certain sequence of steps was being called for repeatedly. it would remain only to test their relationships to the calling program. If one is in the habit of writing small independent modules and stringing them together to make large programs.3. while being careful only to make sure that the new subroutine relates to the same inputs and outputs as the old one. each one tested. it may well happen that even though the job that is done by the subroutine occurs only once in the current program.

Style 29 We don’t mean style in the sense of “class. . say yi (x) = fi (x. But two of them (one trivial and one quite deep) are: (a) Indentation: The instructions that lie within the range of a loop are indented in the program listing further to the right than the instructions that announce that the loop is about to begin. . When we say that we see all that will happen. and furthermore.3 Systems and equations of higher order We have already remarked that the methods of numerical integration for a single ﬁrstorder diﬀerential equation carry over with very little change to systems of simultaneous diﬀerential equations of ﬁrst order. we mean that there are not any “go-to” instructions that would take our eye out of the ﬂow of the if-then-else loop to some other page. how the same idea can be applied to equations of higher order than the ﬁrst.3 Systems and equations of higher order 3. Euler’s method will be used as the example. and these will mainly be discussed as they arise. In Euler’s method for one equation. (b) Top-down structuring: When we visualize the overall logical structure of a complicated program we see a grand loop. (2.3. In this section we’ll discuss exactly how this is done. . and under the “else” one sees what happens if it is not met.1) Now suppose that we are trying to solve not just a single equation. y2 . It all happens right there on the same page. under “then” and under “else”. . where. . yN ) i = 1. which we leave to the numerous examples in the sequel. under the “then” one sees all that will happen if the condition is met. . yn ). within which . (2.” although this is as welcome in programming as it is elsewhere. There have evolved a number of elements of good programming style.3.2) . but a system of N simultaneous equations. or that it has just terminated. 2. According to the principles of top-down design the looping and branching structure of the program should be visible at once in the listing. . y1 . N. That is. . . within which there are several other loops and branchings.2. the approximate value of the unknown function at the next point xn+1 = xn + h is calculated from yn+1 = yn + hf (xn . These few words can scarcely convey the ideas of structuring. etc. then indented under that perhaps a two-way branch (if-then-else). but the same transformations will apply to all of the methods that we will study. we should see an announcement of the opening of the grand loop.

and on all of the unknown functions. . To apply a numerical method such as that of Euler.1) for a single equation in a single unknown function. Euler’s method for a single equation is yn+1 = yn + hf (x. then we can ﬁnd the entire vector of unknown functions at the next point xn+1 = xn + h by means of (2. except for the fact that y and f now represent vector quantities. .3. y 1 (xn ). f2 (x.2) represents N equations.8) . . y3 (x).3. in each of which just one derivative appears. N .4) (2. y). y n ).30 The Numerical Solution of Diﬀerential Equations Equation (2. then what we must calculate are y i (xn+1 ) = y i (xn ) + hfi (xn . y(x)) by vector quantities as above. .5) looks just like our standard form (2. but not on their derivatives. each equation can have a diﬀerent right-hand side.3. In detail. i.3) We observe that equation (2. y). .7) (2.3. yn ) so Euler’s method for a system of diﬀerential equations will be y n+1 = y n + hf (xn . (2.3.2) can be rewritten as y (x) = f (x. . y2 (0) = 1. equation (2.3.1. and replace y(x) and f (x. . . . Now introduce the vector y(x) of unknown functions y(x) = [y1 (x). y2 (x).6) This means that if we know the entire vector y of unknown functions at the point x = xn .3. take the pair of diﬀerential equations y1 = x + y1 + y2 y2 = y1 y2 + 1 together with the initial values y1 (0) = 0. . fN (x. except for the bold face type. y 2 (xn ).3.9) (2.3. all we need to do is to take the statement of the method for a single diﬀerential equation in a single unknown function. y(x)). To be speciﬁc. if y i (xn ) denotes the computed approximate value of the unknown function yi at the point xn . y(x)) of right-hand sides f = [f1 (x. (2. .7). We will then have obtained the generalization of the numerical method to systems. In terms of these vectors. 2. y N (xn )) for each i = 1.3. . y)]. .e. The “fi ” indicates that. (2. . of course. then. . yN (x)] and the vector f = f (x. and whose right-hand side may depend on x.3.5) (2. As an example. ..

. then yN (x + h). Evidently we will need an array Y of length N to hold the values of the N unknown functions at the current point x. and store them in YOUT as we ﬁnd them.05 (2. The conclusion is that we need at least two arrays.11) = 0 1 + 0. i. and the vector of right-hand sides is f = [x + y1 + y2 . The array YIN holds the unknown functions evaluated at x.05.10) we needed to know y 1 (0. on the old values of all of the unknown functions. y) i = 1. including those whose new values have already been computed before we begin the computation of y i (x + h). Then we compute all of the unknowns at x0 + h. Hence if we had put y 1 (0. y2 ].05) even though y1 (0. by Euler’s method. we print them if desired. Suppose we have computed the array Y at a point x.3. For instance. This is because the new value y i (x+h) depends (or might depend). The question is this: when we compute y 1 (x+h).05 0.05 + 0. then the previous contents of Y[1] are lost.1075 1. . which we will rewrite just slightly.05) we would not have been able to obtain y 2 (0. .3. The new values of the N unknown functions are calculated from (2. and YOUT will hold their values at x + h. say YIN and YOUT.e. increase x by h and repeat. etc.05) and y 1 (0.10) = 0.05 0+0+1 0∗1+1 = 0. Let’s consider the preparation of a computer program that will carry out the solution. and we want to get the new array Y at the point x + h.3.10) into an array to replace the old value y 1 (0.05) y2 (0. But we aren’t ﬁnished with y 1 (x) yet. At each step we compute the vector of approximate values of the two unknown functions from the corresponding vector at the immediately preceding step. of a system of N simultaneous equations of the form (2.8) or (2.11) and notice how.05 1.05 ∗ 1. where shall we put it? If we put it into Y[1].8).10) and so forth. . each of length N .2. Let’s choose a step size h = 0.12) in a certain order.05 1.10) had already been computed. . 1]. Then we calculate y1 (0.3.3. y1 y2 + 1]. Initially YIN holds the given data at x0 .2). then y 3 (x + h). (2.05 + 0. Initially. . and on the right there may appear all N of them in each equation. move all entries of YOUT back to YIN.12) Note that on the left is just one of the unknown functions.05 + 1. the vector of unknowns is y = [0. N.05 + 1 = 0.3.. . Exactly what do we do? Some care is necessary in answering this question because there is a bit of a snare in the underbrush. according to (2.10).102625 (2. y 3 (x+h). in the form yi = fi (x. When all have been done.05 0. we might calculate y 1 (x + h). the ﬁrst position of the Y array in storage. in the calculation of y 2 (0. . then y 2 (x + h).3 Systems and equations of higher order 31 Now the vector of unknown functions is y = [y1 . it’s still needed to compute y 2 (x+h).10) y 2 (0. go back to (2.3. If the point still is murky.3. . the value of y1 (x) is lost.

end. end. Enter with values at xin in yin. without moving anything. advances the solution one step by Euler’s method and prints. The brackets [ and ] convert the list into a vector. Our next remark about the program concerns the function subprogram f. We will use it later on to help us get a solution started when we use methods that need information at more than one point before they can start the integration. # This program numerically integrates the system # y’=f(x. In other words. in this case a list of length n since i goes from 1 to n in the seq command.h. The seq command (for “sequence”) creates such a list. Such savings might be signiﬁcant in an extended calculation. return(yout).i). The reader might enjoy writing this program. Eulerstep:=proc(xin. and prints.yin. Our last comment about the program to solve systems is that it is perfectly possible to use it in such a way that we would not have to move the contents of the vector YOUT back to the vector YIN at each step. we could save N move operations. must be supplied by the user. move date from the output array YOUT back to the input array YIN. Here is a sample of such a program. y) = y1 y2 + 1. Then.yout. increase x. One block takes the contents of YIN as input. leaving the new vector in YOUT. leaves the new vector in YIN. To achieve this saving. The single-step subroutine is shown below.i)).y) one step forward by Euler’s method using step size # h. advances the solution one step. y) = x + y1 + y2 and f2 (x. One structured data type in Maple is a list of things. In that case we have f1 (x. A few remarks about the program are in order. print. The evalf command ensures that the results of the computation of the components of yout are ﬂoating point numbers. yout:=[seq(evalf(yin[i]+h*f(xin. Supply f as a function subprogram.32 The Numerical Solution of Diﬀerential Equations The principal building block in this structure would be a subroutine that would advance the solution exactly one step. of course..n)]. if i=1 then return(x+y[1]+y[2]) else return(y[1]*y[2]+1) fi.yin.y. # Calculates the right-hand sides of the system of differential # equations. where N is the number of equations. etc. Exit with values at xin+h # in yout. in this case.n) local i.9). a list of ﬂoating point numbers. This subprogram. call this subroutine. we write two blocks of programming in the main program. namely the one that describes the system (2. another block of programming takes the contents of YOUT as input.3. The two blocks call Euler alternately as the integration proceeds to the right. This translates into the following: f:=proc(x. and thinking about how to generalize the idea to the situation where the new value of y is computed from two . which calculates the right hand sides of the diﬀerential equation.i=1. The main program would initialize the arrays.

2.3 Systems and equations of higher order

33

previously computed values, rather than from just one (then three blocks of programming would be needed). Now we’ve discussed the numerical solution of a single diﬀerential equation of ﬁrst order, and of a system of simultaneous diﬀerential equations of ﬁrst order, and there remains the treatment of equations of higher order than the ﬁrst. Fortunately, this case is very easily reduced to the varieties that we have already studied. For example, suppose we want to solve a single equation of the second order, say y + xy + (x + 1) cos y = 2. (2.3.13)

The strategy is to transform the single second-order equation into a pair of simultaneous ﬁrst order equations that can then be handled as before. To do this, choose two unknown functions u and v. The function u is to be the unknown function y in (2.3.13), and the function v is to be the derivative of u. Then u and v satisfy two simultaneous ﬁrst-order diﬀerential equations: u = v (2.3.14) v = −xv − (x + 1) cos u + 2 and these are exactly of the form (2.3.5) that we have already discussed! The same trick works on a general diﬀerential equation of N th order y (N ) + G(x, y, y , y , . . . , y (N −1) ) = 0. (2.3.15)

We introduce N unknown functions u0 , u1 , . . . , uN −1 , and let them be the solutions of the system of N simultaneous ﬁrst order equations u0 = u1 u1 = u 2 ... uN −2 = uN −1 uN −1 = −G(x, u0 , u1 , . . . , uN −2 ). The system can now be dealt with as before. Exercises 2.3 1. Write each of the following as a system of simultaneous ﬁrst-order initial-value problems in the standard form (2.3.2): (a) y + x2 y = 0; y(0) = 1; y (0) = 0 (b) u + xv = 2; v + euv = 0; u(0) = 0; v(0) = 0 (c) u + xv = 0; v + x2 u = 1; u(1) = 1; v(1) = 0 (d) y iv + 3xy + x2 y + 2y + y = 0; y(0) = y (0) = y (0) = y (0) = 1 (e) x (t) + t3 x(t) + y(t) = 0; y (t) + x(t)2 = t3 ; x(0) = 2; x (0) = 1; x (0) = 0; y(0) = 1; y (0) = 0 (2.3.16)

34

The Numerical Solution of Diﬀerential Equations 2. For each of the parts of problem 1, write the function subprogram that will compute the right-hand sides, as required by the Eulerstep subroutine. 3. For each of the parts of problem 1, assemble nad run on the computer the Euler program, together with the relevant function subprogram of problem 2, to print out the solutions for ﬁfty steps of integration, each of size h = 0.03. Begin with x = x0 , the point at which the initial data was given. 4. Reprogram the Eulerstep subroutine, as discussed in the text, to avoid the movement of YOUT back to YIN. 5. Modify your program as necessary (in Maple, take advantage of the plot command) to produce graphical output (graph all of the unknown functions on the same axes). Test your program by running it with Euler as it solves y + y = 0 with y(0) = 0, y (0) = 1, and h = π/60 for 150 steps. 6. Write a program that will compute successive values yp , yp+1 , . . . from a diﬀerence equation of order p. Do this by storing the y’s as they are computed in a circular list, so that it is never necessary to move back the last p computed values before ﬁnding the next one. Write your program so that it will work with vectors, so you can solve systems of diﬀerence equations as well as single ones.

2.4

How to document a program

One of the main themes of our study will be the preparation of programs that not only work, but also are easily readable and useable by other people. The act of communication that must take place before a program can be used by persons other than its author is a diﬃcult one to carry out, and we will return several times to the principles that serve as guides to the preparation of readable software. In this section we discuss further the all-important question of program documentation, already touched upon in section 2.2. Some very nontrivial skills are called for in the creation of good user-oriented program descriptions. One of these is the ability to enter the head of another person, the user, and to relate to the program that you have just written through the user’s eyes. It’s hard to enter someone else’s head. One of the skills that make one person a better teacher than another person is of the same kind: the ability to see the subject matter that is being taught through the eyes of another person, the student. If one can do that, or even make a good try at it, then obviously one will be able much better to deal with the questions that are really concerning the audience. Relatively few actually do this to any great extent not, I think, because it’s an ability that one either has or doesn’t have, but because few eﬀorts are made to train this skill and to develop it. We’ll try to make our little contribution here.

2.4 How to document a program (A) What does it do?

35

The ﬁrst task should be to describe the precise purpose of the program. Put yourself in the place of a potential user who is looking at a particular numerical instance of the problem that needs solving. That user is now thumbing through a book full of program descriptions in the library of a computer center 10,000 miles away in order to ﬁnd a program that will do the problem in question. Your description must answer that question. Let’s now assume that your program has been written in the form of a subroutine or procedure, rather than as a main program. Then the list of global, or communicating variables is plainly in view, in the opening statement of the subroutine. As we noted in section 2.2, you should state the purpose of your program using the global variables of the subroutine in the same sentence. For one example of the diﬀerence that makes, see section 2.2. For another, a linear equation solver might be described by saying “This program solves a system of simultaneous equations. To use it, put the right-hand sides into the vector B, put the coeﬃcients into the matrix A and call the routine. The answer will be returned in X.” We are, however, urging the reader to do it this way: “This program solves the equations AX=B, where A is an N-by-N matrix and B is an N-vector.” Observe that the second description is shorter, only about half as long, and yet more informative. We have found out not only what the program does, but how that function relates to the global variables of the subroutine. This was done by using a judicious sprinkling of symbols in the documentation, along with the words. Don’t use only symbols, or only words, but weave them together for maximum information. Notice also that the ambiguous term “right-hand side” that appeared in the ﬁrst form has been done away with in the second form. The phrase was ambiguous because exactly what ends up on the right-hand side and what on the left is an accident of how we happen to write the equations, and your audience may not do it the same way you do. (B) How is it done? This is usually the easy part of program documentation because it is not the purpose of this documentation to give a course in mathematics or algorithms or anything else. Hence most of the time a reference to the literature is enough, or perhaps if the method is a standard one, just give its name. Often though, variations on the standard method have been chosen, and the user must be informed about those: “. . . is solved by Gaussian elimination, using complete positioning for size. . . ” “. . . the input array A is sorted by the Quicksort method (see D.E. Knuth, The Art of Computer Programming, volume 3). . . ”

Just march through the parentheses in the subroutine or procedure heading. . and then it will be set to FALSE.n. the user will need more information about each of the global variables than just its description as above.36 The Numerical Solution of Diﬀerential Equations “. a little redundancy never hurts. i=1.N” “WHY is set by the subroutine to TRUE unles the return is because of overﬂow.X. . o for instance. and describe each variable in turn. ) is not programmed.R. The global variables are the ones through which the subroutine communicates with the user. . and so. ” (C) Describe the global variables Now it gets hard again. For instance we may have a solver:=proc(A. except that Charnes’ selection rule (see F. It’s easy to declare along with the types of the variables.A. The Simplex Method. the eigenvalues and vectors are found by the Jacobi method. . First the user has to know exactly ho each of the global variables is related to the problem that is being solved. . .ndim. In any case.. is found by the Simplex method. “A[i] is the ith element of the input list that is to be sorted. Next. but is vitally concerned with the communicating variables. Generally speaking. Set it to 0 if the output is to be rounded to the nearest integer.b). using Corbat´’s method of avoiding the search for the largest oﬀ-diagonal element (see. their dimensions if they are array variables. This calls for a brief verbal description of the variable. else set it to m if the output is to be rounded to m decimal places (m ≤ 12).” “option is set by the calling program on input. Some computer languages force each program to declare the types of their variables right in the opening statement.” It is extremely important that each and every global variable of the subroutine should get such a description. A First Course in Mathematical Software). . Wilson. the description in D. . and what it has to do with the functioning of the program. the user doesn’t care about variables that are entirely local to your subroutine.” “. and the program documentation should declare the type of each and every global variable. Ficken. Others declare types by observing various default rules with exceptions stated.” “B[i. . Also required is the “type” of the variable.j] is the coeﬃcient of X[j] in the ith one of the input equations BX=C.

variables whose values are used by the subroutine but which do not appear explicitly in the argument list. for some of the global variables of the subroutine. the action of a subroutine may change an input variable. and Maple separates the input variables from the output ones. or are “implicit”). In other cases. we’ll say that its role is C. but both languages allow implicit passing and changing of global variables).4 How to document a program in which the communicating variables have the following types: A X n ndim b ndim-by-ndim array of ﬂoating point numbers vector of ﬂoating point numbers of length n integer integer vector of ﬂoating point numbers of length n 37 The best way to announce all of these types and dimensions of global variables to the user is simply to list them. otherwise it is I’. There’s still more vital data that a user will need to know about these variables. we haven’t. In either case. and the user needs to know which these are (whether they are explicit in heading or the return statement. if the value at input time is irrelevant. if the computation of this variable is one of the main purposes of the subroutine. In such cases. Such variables are the outputs of the program. so let’s invent a shorthand for describing them in the documentation of the programs that occur in this book. Second. as above. The user needs to know which are which. haven’t we? Well. the user may be delighted to hear the good news. and in certain other situations. important for the user to know. so if the user needs to use those quantities again it will be necessary to save them somewhere else before calling the subroutine. let’s say that the role of the variable is I. Third. of the global variables. Finally. the user needs to be informed. Although some high-level computer languages require type declarations immediately in the opening instruction of a subroutine. First. it may be that the computation of the value of a certain variable is one of the main purposes of the subroutine.. if the value of the variable is changed by the action of the subroutine. These are. First. none require the descriptions of the roles of the variables (well. it may happen that certain variables are returned by the subroutine with their values unchanged.e. Pascal requires the VAR declaration. There isn’t any standard name like “type” to apply to this information. however. so we’ll call it the “role” of the variable. etc. . in a table. This is particularly true for “implicitly passed” global variables. Just for one example. else C’.2. For some other variables. Second. if the value at input time is important. blood pressure. Now surely we’ve ﬁnished taking the pulse. i. then the user can feel free to use the same storage for other temporary purposes between calls to the subroutine. no. it may be true that their values at the time the subroutine is called are quite irrelevant to the operation of the subroutine. This would be the case for the output variables. the values at input time would be crucial.

Give a complete table of information about the global variables in each case. etc. we want to introduce a while family of methods for the solution of diﬀerential equations. all three of these should be speciﬁed. we will produce two more of its members. showing for each one its name. but the accuracy that can be obtained with it is insuﬃcient for most applications. Refer back to the short program in section 2. after testing the program. and their positions in the array. its role. . that searches for the largest element in a row of a matrix. 2. In this section and those that follow. in which the user can choose the degree of precision that will suﬃce for the job. (a) Find and print all of the prime numbers between M and N . and then select a member of the family that will achieve it. (d) Deal out four bridge hands (13 cards each from a standard 52-card deck – This one is not so easy!). the essential features of program documentation are a description of that the program does. and a list of all of the global variables. dimension (or structure) if any. called the linear multistep methods. (b) Find the elements of largest and smallest absolute values in a given linear array (vector).2. phrased in terms of the global variables.. and a brief verbal description.5 The midpoint and trapezoidal rules Euler’s formula is doubtless the simplest numerical integration procedure for diﬀerential equations.4 Write programs that perform each of the jobs stated below. (e) Solve a quadratic equation (any quadratic equation!). which illustrate the diﬀerent sorts of creatures that inhabit the family in question. type. a statement of how it gets the job done. a variable X might have role IC’O’.38 The Numerical Solution of Diﬀerential Equations it’s role is O (as in output). document it with comments. or a variable why might be of role I’CO. else O’. In the description of each communicating variable. In each case. To sum up. Here is the table of information about its global variables: Name A i jwin Type ﬂoating point matrix integer integer Role IC’O’ IC’O’ I’CO Description The input matrix Which row to search Column containing largest element Exercises 2. (c) Sort the elements of a given linear array into ascending order of size. Before describing the family in all of its generality. Thus.

5) and this is the midpoint rule. at x + h. except for the fact that we can’t get started with the midpoint rule until we know two consecutive values y0 . is gotten from the values of y and y at just one backwards point x. yn−1 ) and (xn+1 .5.4) takes the form yn+1 = yn−1 + 2hf (xn . so if we are to use the midpoint rule we’ll need to manufacture one more value of y(x) by some other means. etc. instead of the chord joining (xn .3) after the ﬁrst term. and we want to avoid the calculation of higher derivatives because our diﬀerential equations will always be written as ﬁrst-order systems. too. that term involves a second derivative.1) with h replaced by −h to get y(xn − h) = y(xn ) − hy (xn ) + h2 and then subtract these equations.5. 2 6 (2. we will now discuss the midpoint rule.5. as usual.4) If. obtaining y(xn + h) − y(xn − h) = 2hy (xn ) + 2h3 y (xn ) + ···. let’s use yn to denote the computed approximate value of y(xn ) (and yn+1 for the approximate y(xn+1 ). However. yn+1 ) as in Euler’s method. In other words. for instance.5. h5 .5 The midpoint and trapezoidal rules 39 Recall that we derived Euler’s method by chopping oﬀ the Taylor series expansion of the solution after the linear term. of course. At ﬁrst sight it seems that (2. As a primitive example of this kind. the next value of the unknown function. we are solving the diﬀerential equation y = f (x. etc. the new value of y depends in y and y at more than one point. Normally a diﬀerential equation is given together with just one value of the unknown function. Further. (2. yn+1 ). because it is a recurrence formula in which we compute the next value yn+1 from the two previous values yn and yn−1 .1) Now we rewrite equation (2. yn ) (2. y).2) Now. y1 of the unknown function at two consecutive points x0 . We begin once again with the Taylor expansion of the unknown function y(x) about the point xn : y(xn + h) = y(xn ) + hy (xn ) + h2 y (xn ) y (xn ) + h3 + ···. so that only the ﬁrst derivative will be conveniently computable.2.). The name arises from the fact that the ﬁrst derivative yn is being approximated by the slope of the chord that joins the two points (xn−1 . Then we have yn+1 − yn−1 = 2hyn .3) y (xn ) y (xn ) − h3 + ··· 2 6 (2. Indeed the rules are quite similar. To get a more accurate method we could. . we will truncate the right side of (2. x1 .5. We can have greater accuracy without having to calculate higher derivatives if we’re willing to allow our numerical integration procedure to involve values of the unknown function and its derivative at more than one point. at x and x − h. yn ) and (xn+1 . or at several points. In the more accurate formulas that we will discuss next. then ﬁnally (2. keep the quadratic term.5. in Euler’s method. just as we did in the derivation of Euler’s method.5.5. ignoring the terms that involve h3 .5.5) can be used just like Euler’s method. 6 (2.

25 .00 2. from Euler’s method.5 ∗ 1. . . 1.1(0.02500 1.07756 1. in this case at x = 0 and x = 0.56389 Exact(x) 1.5y2 ) = 1. .71763 4. namely y1 = 1.10 0.00000 1. .00000 1.10381 1.00 Midpoint(x) 1.025 (see Table 1). .00 10.1. .00 .05 we use the value that Euler’s method gives us. Now to start the midpoint rule we need two consecutive values of y.05 as before. .7) that we used to illustrate Euler’s rule.05125) = 1. we can get it started most easily by calculating y1 . from Euler’s method. and then switching to the midpoint rule to carry out the rest of the calculation. Let’s do this.02500 1.07689 1.05127 1. . 12. . To get such a formula started we will have to ﬁnd several starting values in addition to the one that is given in the statement of the initial-value problem.1(0.5y and the initial value y(0) = 1.13141 .05 0.025) = 1.81372 139.48032 .5.31274 Euler(x) 1. the approximation to y(x0 + h).02532 1.05.10517 1. . We’ll use the same step size h = 0.39979 .05125 and y3 = y1 + 2h(0.00000 1.00 3.5y1 ) = 1 + 0.6) . for example with the same diﬀerential equation (2.64872 2.68506 4.18249 148.025 + 0.71828 4. 11.64847 2.7) (2. .5 ∗ 1. x 0.13315 . 1. and from the exact solution y(x) = ex/2 . but several of its predecessors. 1.5.07788 1.63862 2. because to obtain greater precision without computing higher derivatives we will get the next approximate value of y from a recurrence formula that may involve not just one or two.00 0. . In the table below we show for each x the value computed from the midpoint rule. At 0. 5.41316 (2.48169 . 1.5.10513 1.05063 1.5). To get back to the midpoint rule. 12.17743 148. . .0775625 . It’s easy to continue the calculation now from (2.40 The Numerical Solution of Diﬀerential Equations This kind of situation will come up again and again as we look at more accurate methods.05125 1. The superior accuracy of the midpoint rule is apparent.20 0.15 0.13282 . The problem consists of the equation y = 0. . . For instance y2 = y0 + 2h(0. so we can compare the two methods.

10) in which we have used the “≈” sign rather than the “=” because the right hand side is not exactly equal to the integral that really belongs there. and the line through the points (a. the lines x = a and x = b. then (2.5 The midpoint and trapezoidal rules 41 2 222 y = f (x) a b Figure 2. but is only approximately so.2. 2 If we apply the trapezoidal rule to the integral that appears in (2. y(x)). The best way to obtain it is to convert the diﬀerential equation that we’re trying to solve into an integral equation. yn+1 )). yn ) + f (xn+1 . (2.8). If we use our usual abbreviation yn for the computed approximate value of y(xn ). We begin with the diﬀerential equation y = f (x. y(t)) dt. f (a)) and (b. instead of the area under the curve between x = a and x = b. the trapezoidal rule. f (b)).1. we obtain y(xn + h) ≈ y(xn ) + h (f (xn .5.8) Now if we approximate the right-hand side in any way by a weighted sum of values of the integrand at various points we will have found an approximate method for solving our diﬀerential equation. y(xn )) + f (xn + h.11) 2 .5.5. getting x+h y(x + h) = y(x) + x f (t.1: The trapezoidal rule table 2 Next. as shown in Figure 2.5. (2. and we integrate both sides from x to x + h. The trapezoidal rule states that for an approximate value of an integral b f (t) dt a (2. the area of the trapezoid whose sides are the x axis.5. That area is 1 (f (a) + f (b))(b − a). we introduce a third method of numerical integration.10) becomes h yn+1 = yn + (f (xn . and then use the trapezoidal approximation for the integral.5.9) we can use. y(xn + h))) 2 (2.

As a numerical example. compute the right side.5.056796. but also on the right (it’s hiding in the second f on the right side). we use this guess on the right side of (2. and obtain y1 = 1. First we would guess yn+1 (guessing yn+1 to be equal to yn wouldn’t be all that bad. (2.1ey1 ) and sure enough.13). it turns out that one does not actually have to iterate to convergence. This is not the case if a high quality guess is unavailable. can immediately be computed from the previous value. and such pairs form the basis of many of the highly accurate schemes that are used in practice. take the diﬀerential equation y = 2xey + 1 (2. is called a predictor-corrector pair. Now if the new value agrees with the old suﬃciently well the iteration would halt.05. the computed approximation to y(0.5. Then we get 1. and since this is in suﬃciently close agreement with the previous result. The pair of formulas. Then we would have a “more improved” guess. the result is y1 = 1.5. For comparison.12) with the initial value y(0) = 1.11) looks like a recurrence formula from which the next approximate value. If we use h = 0. one of which supplies a very good guess to the next value of y. and then we could use that value as a new “improved” value of yn+1 . and we would have found the desired value of yn+1 . (2.5. we show Midpoint(x) and Exact(x). then just one reﬁnement by a single application of the trapezoidal formula is suﬃcient. this is not the case. yn . Let’s guess y1 = 1.42 The Numerical Solution of Diﬀerential Equations This is the trapezoidal rule in the form that is useful for diﬀerential equations. If a good enough guess is available for the unknown value. of the unknown function. y(0) = 1 as the column Trap(x). and the other of which reﬁnes it to a better guess. the unknown number y1 appears on both sides. Upon closer examination one observes that the next value yn+1 appears not only on the left-hand side.9.1 to repeat the same sort of thing to get y2 . in actual use. etc. the approximate value of y(0. and we go next to x = 0. At ﬁrst sight. then our ﬁrst task is to calculate y1 . In Table 3 we show the results of using the trapezoidal rule (where we have iterated until two successive guesses are within 10−4 ) on our test equation y = 0.057196.057193. If we use this guess value on the right side of (2. Hence. However.05).11) then we would be able to calculate the entire right-hand side. we declare that the iteration has converged. but we can do better).5. We will discuss this point in more detail in section 2. Then we take y1 = 1.025(2 + 0. The trapezoidal rule asserts that y1 = 1 + 0. Fortunately. yn+1 .1). Since this is not a very inspired guess. If we use this new guess the same way. Otherwise we would use the improved value on the right side just as we previously used the ﬁrst guess.13) .057196 for the computed value of the unknown function at x = 0. In order to ﬁnd the value yn+1 it appears that we need to carry out an iterative process.5y.05. we will have to iterate the trapezoidal rule to convergence.

12.07789 1.00000 1. where A is constant. (2.71842 4.48032 .71763 4. so this is an important class of diﬀerential equations.18402 148. so comparison is readily done.05125 1.45089 Midpoint(x) 1.10 0.2.13316 . .10518 1.00 10.6.48169 . 1.20 0. . so we will assume that L > 0. L>0 (2. 1.10517 1.07756 1. namely as y y =− . . . . The most natural candidate for such an equation is y = Ay.3) h (y + yn+1 ). .6.02532 1.05 0. .00 3. y(0) = 1 . .64872 2. second that the diﬀerence approximations are also relatively easy to solve exactly.00 2.13315 . 12. .05127 1.10513 1.07788 1. .25 . write the test equation in a slightly diﬀerent form for expository reasons.6 Comparison of the methods x 0.18249 148.6.15 0. however.41316 43 2.64876 2. at least over the short term.00000 1.6.31274 table 3 Exact(x) 1. The reasons for this choice are ﬁrst that the equation is easy to solve exactly.1) the trapezoidal rule yn+1 = yn + and the midpoint rule yn+1 = yn−1 + 2hyn . .6 Comparison of the methods We are now in possession of three methods for the numerical solution of diﬀerential equations.4) L where L is a constant. Further. 12. third that by varying the sign of A we can study behavior of either growing or shrinking (stable or unstable) solutions.71828 4. . . 1.64847 2.05127 1. . . (2. 1.17743 148. 2 n (2.13282 .2) In order to compare the performance of the three techniques it will be helpful to have a standard diﬀerential equation on which to test them.00 Trap(x) 1.48203 .02500 1. We will. 5. The most interesting and revealing case is where the true solution is a decaying exponential.02532 1.00000 1. .00 0. They are Euler’s method yn+1 = yn + hyn . .00 . we will assume that y(0) = 1 is the given initial value. and ﬁnally that many problems in nature have solutions that are indeed exponential.

6. 2. can be conveniently visualized as the distance over which the solution falls by a factor of e. 1. Hence. (2. h/L is exactly the thing that one feels should be kept small for a successful numerical solution.4).8) The unknown yn+1 appears. and h is the step size that we are going to use in the numerical integration.8) for yn+1 (without any need for an iterative process) and obtaining yn+1 = 1− 1+ τ 2 τ 2 yn . . as usual with this method.6.1). . we will denote it with the symbol τ . then h will have to be kept small there if good accuracy is to be retained.6.6. one feels that if the solution is changing rapidly in a certain region. for the particularly simple equation that we are now studying. called the relaxation length of the problem.5) Notice that if x increases by L. Suppose ﬁrst that we ask Euler’s method to solve the problem. y) = −y/L into (2. (2. (2. Now L is the distance over which the solution changes by a factor of e. 1. this implies that yn = 1− 1+ τ 2 τ 2 n n = 0.3) handles the problem (2.6.6. with the starting value y0 = 1. Hence L. while if the solution changes only slowly. .4).6) Before we solve this recurrence. on both sides of the equation. L (2.6.9) Together with the initial value y0 = 1. there is no diﬃculty in solving (2. However. .6. .6.6. then h can be larger without sacriﬁcing too much accuracy. 2. 2 L L (2. Since h/L occurs frequently below. We substitute y = f (x.6.7) Next we study the trapezoidal approximation to the same equation (2. we get yn+1 = yn + h ∗ − yn L h = 1− yn . is obviously yn = (1 − τ )n n = 0. (2.2) and get yn+1 = yn + h yn yn+1 − − . the solution changes by a factor of e. . Instinctively.6.10) . .44 The Numerical Solution of Diﬀerential Equations The exact solution is of course Exact(x) = e−x/L . . The ratio h/L measures the step size of the integration in relation to the distance over which the solution changes appreciably. If we substitute y = f (x.6.1)–(2.6. y) = −y/L in (2. let’s comment on the ratio h/L that appears in it. Now the solution of the recurrence equation (2. Now we would like to know how well each of the methods (2.6).

Instead of facing a ﬁrst-order diﬀerence equation as we did in (2. If we remember that τ is being thought of as small compared to 1. If we denote these two roots by r+ (τ ) and r− (τ ).6.” all we have to do is see how close the constants (b) and (c) above are to the true constant (a). Both the Euler and the trapezoidal methods yield approximate solutions of the form (constant)n . then the general solution of the diﬀerence equation (2. let’s pause to examine the two methods whose solutions we have just found. Finally we study the nature of the approximation that is provided by the midpoint rule.2.14) is yn = c (r+ (τ ))n + d (r− (τ ))n . L (2. then we have the power series expansion of e−τ τ2 τ3 e−τ = 1 − τ + − + ··· (2.6.6.13) The comparison is now clear. y) = −y/L for y in (2.6. We will ﬁnd that a new and important phenomenon rears its head in this case. we know to try a solution of the form yn = r n . It follows that to compare the two approximate methods with the “truth.14) One important feature is already apparent.3) to get yn+1 = yn−1 + 2h ∗ − yn .6 Comparison of the methods 45 Before we deal with the midpoint rule.16) . (2. we have now to contend with a second-order diﬀerence equation. This leads to the quadratic equation r 2 + 2τ r − 1 = 0. The trapezoidal rule does a better job of being near e−τ because its constant agrees with the power series expansion of e−τ through the quadratic term.6.6. The analysis begins just as it did in the previous two cases: We substitute the right-hand side f (x.11) .6) for Euler’s method and in (2. Note that for a given value of h. Since the equation is linear with constant coeﬃcients.6.6.6. 2 4 (2.9) for the trapezoidal rule.6. whereas that of the Euler method agrees only up to the linear term. (2. where “constant” is near e−τ .12) 2 6 to compare with 1 − τ and with the power series expansion of 1− 1+ τ 2 τ 2 =1−τ + τ2 τ3 − + ···. so its roots are distinct. (b) Euler’s solution and (c) the trapezoidal solution are of the form yn = (constant)n . all three of (a) the exact solution.15) Evidently the discriminant of this equation is positive. in which the three values of “constant” are (a) e−τ (b) 1 − τ (c) 1− τ 2 1+ τ 2 (2.

6. so as not to disturb the closeness of the approximation.6. the so-called parasitic solution. because the ﬁrst term is shrinking to zero as n increases. y(0) = 1 with each of the three methods that we have discussed.e. In fact it does pretty well. The second term of (2. The instability of the midpoint rule is quite apparent. and grow without bound in magnitude. n increases. (2.19) The bad news is now before us: When τ is a small positive number.6. as we move to the right. However.6. We hope that the other term will stay small relative to the ﬁrst term. so to speak. because the constant d will be small compared with c. we observe that r+ (τ ) is close to e−τ . the price that we pay for getting such a good approximation in r+ (τ ). so we say that the midpoint rule is unstable. because the power series expansion of r+ (τ ) is r+ (τ ) = 1 − τ + τ2 τ4 − + ··· 2 8 (2. The two roots of the quadratic equation are √ r+ (τ ) = −τ + √1 + τ 2 r− (τ ) = −τ − 1 + τ 2 . so the ﬁrst term on the right of (2.18) so it agrees with e−τ through the quadratic terms. we cheated). . The other term. since r− (τ ) is negative and larger than 1 in absolute value. In fact. In practical terms. the second term will alternate in sign as n increases. and the second term will eventually dominate the ﬁrst. We will see that r+ (τ ) is a very good approximation to e−τ .17) When τ = 0 the ﬁrst of these is +1.6. because that’s what the exact solution does.05. The Euler and trapezoidal approximations were each of the form (constant)n . using a step size of h = 0. In Table 4 below we show the result of integrating the problem y = −y.05) (i.16). however.46 The Numerical Solution of Diﬀerential Equations where c and d are constants whose values are determined by the initial data y0 and y1 . so when τ is small r+ (τ ) is near +1. is small compared to the ﬁrst term when n is small. To get the midpoint method started.1 is violated.6. This one is a sum of two terms of that kind. and that in fact it might do whatever it can to spoil things. What about r− (τ )? Its Taylor series is r− (τ ) = −1 − τ − τ2 + ···. and it is the root that is trying to approximate the exact constant e−τ as well as possible. We will see.16) is very close to the exact solution. and in the trapezoidal rule we iterated to convergence with = 10−4 . (r− (τ ))n is. 2 (2. we used the exact value of y(0. the root r− (τ ) is larger than 1 in absolute value. that it need not be so obliging. This means that the stability criterion of Theorem 1.. while the second term increases steadily in size.

35849 0. where L > 0. .3 × 10−7 9. We summarize below the properties of the three methods that we have been studying.00592 0.4 × 10−7 47 table 4 In addition to the above discussion of accuracy.6 Comparison of the methods x 0. the computed solution (neglecting roundoﬀ) remains bounded as n → ∞. Second. Third.48 71. It is not self-starting if values at more than one backward point are needed to get the next one.4 × 10−7 Midpoint(x) 1.01652 0.0 2.36806 0.13534 0.36780 0.00005 4.. depending on the quality of available estimates for the unknown value.36788 0. In an iterative method.01888 0.00000 0.00000 0. the trapezoidal rule is clearly the best.2.01830 0.04607 0. i.0 4. a method is self-starting if the next value of the unknown function is obtained from the values of the function and its derivatives at exactly one previous point. we may either solve this equation completely by an iteration or do just one step of the iteration.0 14. we can deﬁne a numerical method to be stable if when it is applied to the equation y = −y/L.e. In the latter case some other method will have to be used to get the ﬁrst few computed values. A method is noniterative if it expresses the next value of the unknown function quite explicitly in terms of values of the function and its derivatives at preceding points. Iterative Self-starting Stable Euler No Yes Yes Midpoint No No No Trapezoidal Yes Yes Yes Of the three methods.04975 0. we summarize here there additional properties of integration methods as they relate to the examples that we have already studied. a numerical integration method might be iterative or noniterative.21688 -20.8 × 10−7 1.00674 0. at each step of the solution process the next value of the unknown is deﬁned implicitly by an equation.1 × 10−8 Trap(x) 1.00822 0.00000 0.13552 0. though for eﬃcient use it needs the help of some other formula to predict the next value of y and thereby avoid lengthy iterations.00005 4.05005 0.13527 0.04979 0.00004 3.0 10.45 Exact(x) 1. In practice.00000 0.12851 0. First. which must be solved to obtain the next value of the unknown function.8 Euler(x) 1. then for all suﬃciently small positive values of the step size h a stable diﬀerence equation results.55 15.00673 0.0 5.0 3.0 1.8 × 10−7 1.01832 0.

but instead gives us an equation that must be solved in order to ﬁnd it. yielding h ∂f (k+1) (k) (k) (k−1 yn+1 − yn+1 = y − yn+1 . yn+1 ) − f (xn+1 . yn ) + f (xn+1 . At ﬁrst sight this seems like a nuisance. yn+1 )). 2 (2. .7.2) We want to ﬁnd out about how rapidly the successive values yn+1 . approach a limit.7. we rewrite equation (2.7.7.7.3) 2 and then subtract (2. .48 The Numerical Solution of Diﬀerential Equations 2.4) Next we use the mean-value theorem on the diﬀerence of f values on the right-hand side.3) from (2. yn ) + f (xn+1 .7. Then the improved value yn+1 is computed from yn+1 = yn + (k+1) (k+1) h (k) (f (xn . yn+1 )) (2.5) 2 ∂y (xn+1 .7 Predictor-corrector methods The trapezoidal rule diﬀers from the other two that we’ve looked at in that it does not explicitly tell us what the next value of the unknown function is. yn ) + f (xn+1 .2) to get yn+1 − yn+1 = (k+1) (k) h (k) (k−1) (f (xn+1 .2). 2.7. Let’s take a look at the process by which we reﬁne a guessed value of yn+1 to an improved value. one uses an iterative formula together with another formula (the predictor) whose mission is to provide an intelligent ﬁrst guess for the iterative method . yn+1 )). .7.1) Suppose we let yn+1 represent some guess to the value of yn+1 that satisﬁes (2. then the convergence will be extremely rapid. k = 1. h ∂f 2 ∂y h ∂f 2 ∂y (k) (k−1) as the local convergence factor of the If the factor is a lot less than 1 (and this can be assured by keeping h small enough). using the trapezoidal formula yn+1 = yn + (k) h (f (xn . We refer to trapezoidal rule. 2 (2. if at all. because it enables us to regulate the step size during the course of a calculation. but in fact it is a boon.7. 2 (k) (2. 2 ∂y It follws that the iterative process will converge if h is kept small enough so that is less than 1 in absolute value.η) n+1 where η lies between yn+1 and yn+1 . To do this. yn+1 )). this time replacing k by k − 1 to obtain h (k) (k−1) yn+1 = yn + (f (xn . as we will discuss in section 2. From the above we see at once that the diﬀerence between two consecutive iterated values of yn+1 will be h ∂f times the diﬀerence between the previous two iterated values.1).9. In actual practice. (2.

we see that the third power occurs in both formulas. “moderately small” means that the step size times the local value of ∂f ∂y should be small compared to 1. The reason for this will become clear if we look at both formulas together with their error terms. the error in the midpoint method. Furthermore. The predictor formula will be explicit. We will see in the next section that the error terms are as follows: h3 yn+1 = yn−1 + 2hyn + y (Xm ) (2. For this reason.7.7. The subsequent iterative reﬁnement of that guess needs to reduce the error only by a factor of four.7. or noniterative. As far as the powers of h that appear in the error terms go. then the 2 ∂y 2 ∂y distance from the ﬁrst reﬁned value to the converged value will be no larger than the size of the error term in the method. 2 ∂y and by subtraction yn+1 − yn+1 = (1) (2. for instance. Now let yP denote the (1) midpoint predicted value. and we won’t have to get involved in a long convergence process. so there would be little point in gilding the iteration any further. and of opposite sign from. the reﬁned value is h ∂f times closer. Then we have yn+1 = yn + (1) yn+1 h y + f (xn+1 . Hence if we can keep h ∂f no bigger than about 1/4. then it will happen that just a single application of the iterative reﬁnement (corrector formula) will be suﬃcient. iteration to full convergence is rarely done in practice. The conclusion is that when we are dealing with a matched predictor-corrector pair. We say then.6) 3 h h3 yn+1 = yn + (yn + yn+1 ) − y (Xt ). that the midpoint predictor and the trapezoidal corrector constitute a matched pair. then a clever predictor would be the midpoint rule. yn+1 ) 2 n (2.7. . however far from the converged value the ﬁrst guess was. we need do only a single reﬁnement of the corrector if the step size is kept moderately small. yn+1 denote the ﬁrst reﬁned value. yP ) 2 n h ∂f (yP − yn+1 ) . The midpont guess is therefore quite “intelligent”.2.7) 2 12 Now the exact locations of the points Xm and Xt are unknown. If we use the trapezoidal rule for a corrector.7 Predictor-corrector methods 49 to use.8) h = yn + y + f (xn+1 . If the predictor formula is clever enough. and yn+1 be the ﬁnal converged value given by the trapezoidal rule.9) This shows that. The error in the trapezoidal rule is about one fourth as large as. (2. but we will assume here that h is small enough that we can regard the two values of y that appear as being the same.

all without operator intervention. then a larger value of h will do. however. While following a rapid transient it should use a small mesh size. when in fact a single turn of the crank is enough if the step size is kept . starting a power reactor. and that disappear quickly. then the program will obviously have to choose. or predicting. such as beginning a multi-stage chemical reaction. and then correcting the guess to complete convergence by iteration. After the transients die out. In such cases there usually are rapid and ephemeral or “transient” phenomena that occur soon after startup. Hence one way to get some control on h is to follow a policy of cutting the step size in half whenever more than. then the guess will be rather good. and re-choose its own step size as the calculation proceeds. let’s observe that one technique is already available in the material of the previous section. then the step size is too big. one or two iterations are necessary. Recall that if we want to. we can implement the trapezoidal rule by ﬁrst guessing. and then a much larger value of h will be adequate. If h is chosen too large. the steady-state solution may be a very quiet. Furthermore this is a very time-consuming approach since it involves a complete iteration to convergence. say. the unknown at the next point by using Euler’s formula. however. Frequently in practice we deal with equations whose solutions change very rapidly over part of the range of integration and slowly over another part. etc. and somewhat more delicacy is called for in that situation. use a large step while the solution is steady.8 Truncation error and step size We have so far regarded the step size h as a silent partner. slowly varying or nearly constant function. the computed solution may be quite far from the true solution of the diﬀerential equation. If we want to follow these transients accurately. If we are going to develop software that will be satisfactory for such problems. that the accuracy of the calculation is strongly aﬀected by the step size. Speaking in quite general terms. however. that is. we may need to choose a very tiny step size. and roundoﬀ errors may build up excessively because of the numerous arithmetic operations that are being carried out. This suggestion is not suﬃciently sensitive to allow doubling the stepsize when only one iteration is needed. small compared to the local relaxation length (see p.05. but if the solution is slowly varying. It follows that the number of iterations required to produce convergence is one measure of the appropriateness of the current value of the step size: if many iterations are needed. more often than not choosing it to be equal to 0.50 The Numerical Solution of Diﬀerential Equations 2. Examples of this are provided by the study of the switching on of a complicated process. if the true solution of the diﬀerential equation is rapidly changing. and if the solution changes slowly. and so forth. turning on a piece of electronic equipment. for no particular reason. It is evident. decreae it again if further quick changes appear. then it should gradually increase h as the transient fades. 44). The ﬁrst guess will be relatively far away from the ﬁnal converged value if the solution is rapidly varying. if too small then the calculation will become unnecessarily time-consuming. Before we go ahead to discuss methods for achieving this step size control. then we will need a small values of h.

and if we double h the error is multiplied by about 4. In Euler’s procedure. and compute y directly from the resulting formula. approximately.. It does not tell us how far our computed solution is from the true solution. This is the single-step truncation error. why would anyone use the cumbersome procedure of guessing and reﬁning (i. when many other methods are available that give the next value of the unknown immediately and explicitly? No doubt the question crossed the reader’s mind. though. The correction process itself.8 Truncation error and step size small enough. Our next job will be to make these rather qualitative remarks into quantitative tools. This is usually more trouble than it is worth. rather than the “pain in the neck” it seemed to be when we ﬁrst met it.1) where X lies between xn and xn + h. is Euler’s method. That basic idea is precisely that we can estimate the correctness of the step size by whatching how well the ﬁrst guess in our iterative process agrees with the corrected value. There are various special methods that might be used to do this. The easiest example. but instead we are going to use a very general method that . (2.e. 51 The discussion does. so that if we want to. but only how much error is committed in a single step. prediction and correction) as we do in the trapezoidal rule. Next we derive the local error of the trapezoidal rule. and we will prefer to estimate E by more indirect methods. so if we cut the step size in half.8. Otherwise. we can print out the approximate size of the error along with the solution. on one step of the integration process. the “remainder term. the single-step error is proportional to h2 .2) to estimate the error by somehow computing an estimate of y . Indeed. We could use (2. instead of that equation itself.8.2) Thus.1. For instance. we drop the third term on the right. point to the fundamental idea that underlies the automatic control of step size during the integration. y) once more. It will appear that not only does the disparity between the ﬁrst prediction and the corrected value help us to control the step size. In doing this we commit a single-step trunction error that is equal to E = h2 y (X) 2 xn < X < xn + h. In fact.2. the local error is reduced to 1/4 of its former value. we might diﬀerentiate the diﬀerential equation y = f (x. That equation was y(xn + h) = y(xn ) + hy (xn ) + h2 y (X) 2 (2. when viewed this way.” and compute the solution from the rest of the equation.8. and the answer is now emerging. as usual. it actually can give us a quantitative estimate of the local error in the integration. is seen to be a powerful ally of the software user. in equation (2. Euler’s method is exact (E = 0) if the solution is a polynomial of degree 1 or less (y = 0).2) we have already seen the single-step error of this metnod. so we must discuss the estimation of the error that we commit by using a particular diﬀerence approximation to a diﬀerential equation. however.

but fails on x2 .8.8. First.. We might say that the trapezoidal rule is exact on 1. By way of contrast.8. but is instead −h3 /2.3) would be exactly valid. i.3) holds once more.8. anticipating that the answer to (b) is aﬃrmative so the eﬀort won’t be wasted.8.3) is not 0. rather than with the diﬀerential form of the remainder.6) . To do this we start with a truncated Taylor series with the integral form of the remainder. (2. Now we have to questions to handle. (2.5) is really right. In general the series is y(x) = y(0) + xy (0) + x2 y (0) y (n) (0) + · · · + xn + Rn (x) 2! n! (2. then what is “const”? (b) Is it true that the error term is const ∗ h3 ∗ y (X)? We’ll do the easy one ﬁrst. Then (2. then the trapezoidal rule can be written as h y(xh ) − y(x) − y (x + h) + y (x) = c ∗ h3 ∗ y (X). Since the error term for Euler’s method in (2. but not x3 . (2. that it is an integration rule of order two (“order” is an overworked word in diﬀerential equations). Then (2.8. It follows by linearity that the rule is exact on any quadratic polynomial. If the error term is of the form stated.5) Now let’s see that question (b) has the answer “yes” so that (2.52 The Numerical Solution of Diﬀerential Equations is capable of dealing with the error terms of almost every integration rule that we intend to study. How special? Suppose the true solution is y(x) = 1 for all x. To ﬁnd c all we have to do is substitute y(x) = x3 into (2.3) 2 Of course. let’s look a little more carefully at the capability of the trapezoidal rule.3) is again exactly satisﬁed. The single-step truncation error of the trapezoidal rule would therefore be E = −h3 y (X) 12 x < X < x + h.4) and we ﬁnd at once that c = −1/12.8. in the form h yn+1 − yn − (yn + yn+1 ) = 0. as the reader should check. x.8. How long does this continue? Our run of good luck has just expired. this is a recurrence by meand of which we propagate the approximate solution to the right. and they are respectively easy and hard: (a) If the error term in the trapezoidal rule really is const ∗ h3 ∗ y (X). because if y(x) = x3 then (check this) the left side of (2. if y(x) = x2 .e.8. Suppose y(x) = x. Furthermore. then a brief calculation reveals that (2. it is easy to verify that Euler’s method is exact for a linear function.2) is of the form const ∗ h2 ∗ y (X). it is perhaps reasonable to expect the error term for the trapezoidal rule to look like const ∗ h3 ∗ y (X).8.8. It certainly is not exactly true if yn denotes the value of the true solution at the point xn unless that true solution is very special.4) 2 where X lies between x and x + h. and x2 .

8.7) for R2 (x) we want to replace the upper limit of the integral by +∞.8. because the trapezoidal rule is exact on polynomials of degree 2. It isn’t in a very satisfactory form yet.11) where H2 (t) = t2 if t > 0 and H2 (t) = 0 if t < 0. We call the operator L so it is deﬁned by h Ly(x) = y(x + h) − y(x) − y (x) + y (x + h) . In (2.13) .8.12) Choose x = h. Then if s lies between 0 and h we ﬁnd LH2 (x − s) = (h − s)2 − = −s(h − s (Caution: Do not read past the right-hand side of the ﬁrst equals sign unless you can verify the correctness of what you see there!).8. (2.8. so we will now make it a bit more explicit by computing LR2 (x). plucked from the blue. (2.8. and we write it in the form y(x) = P2 (x) + R2 (x) where P2 (x) is the quadratic (in x) polynomial P2 (x) = y(0) + xy (0) + x2 y (0)/2. one of the nice ways to prove Taylor’s theorem begins with the right-hand side of (2.6) we choose n = 2.8.8. in the integral expression (2.2.8) Notice that we have here a remainder formula for the trapezoidal rule.8. we ﬁnd that LR2 (x) = 1 2! ∞ 0 LH2 (x − s)y (s) ds.7).8. (2. Now if we bear in mind the fact that the operator L acts only on x. lowereing the order of the derivative of y and the power of (x − s) until both reach zero.8).6)).8. Hence we have Ly(x) = LR2 (x). First. Next we deﬁne a certain operation that transforms a function y(x) into a new function. because the rule is exact on quadratic polynomials (this is why we chose n = 2 in (2. (2. and then repeatedly integrates by parts. and we notice immediately that LP2 (x) = 0. namely into the left-hand side of equation (2.8 Truncation error and step size where Rn (x) = 1 n! x 0 53 (x − s)n y (n−1) (s) ds. whereas if s > h then LH2 (x − s) = 0.4).7) Indeed.10) (2.9) 2 Now we apply the operator L to both sides of equation (2. We can do this by writing R2 (x) = 1 2! ∞ 0 H2 (x − s)y (s) ds (2.8. and that s is a dummy variable of integration.8. h (2(h − s)) 2 (2.

Therefore it is important to make sure that you completely understand its derivation.14). then b b p(x)g(x) dx = g(X) a a p(x) dx (2. The vital hypothesis is that the “weight” p(x) does not change sign. ﬁnally. Theorem 2. and modify the step size h if necessary. namely Theorem 2.8.8. but we still do not have the “hard” question (b). (2.2 The trapezoidal rule with remainder term is given by the formula y(xn+1 ) − y(xn ) = where X lies between xn and xn+1 . we will quote.8.16) 2.8. The theorem asserts that a weighted average of the values of a continuous function is itself one of the values of that function. and g(x) is continuous.9.8. h h3 y (xn ) + y (xn+1 ) − y (X).8. The proof of this theorem involved some ideas tha carry over almost unchanged to very general kinds of integration rules. To see how this can be done.8.12) becomes The Numerical Solution of Diﬀerential Equations LR2 (h) = − 1 2 h 0 s(h − s)y (s) ds. then we can estimate the error in the trapezoidal rule as we go along.9 Controlling the step size In equation (2. Now in (2.17) h 0 s(h − s) ds (2.8. to keep that error within preassigned bounds.15) where X lies between a and b. 2 12 (2.8.5) we saw that if we can estimate the size of the third derivative during the calculation. the function s(h − s) does not change sign on the s-interval (0. without proof.1) . h).54 Then (2. To ﬁnish it oﬀ we need a form of the mean-value theorem of integral calculus. It says that y(xn+1 ) − y(xn−1 ) = 2hy (xn ) + h3 y (X). so 1 LR2 (h) = − y (X) 2 and if we do the integral we obtain.1 If p(x) is nonnegative.14) This is a much better form for the remainder. the result of a similar derivation for the midpoint rule. 3 (2.

from the midpoint method. In symbols.9. qn+1 )) − y (X) 2 12 h ∂f h3 = qn+1 (y(xn+1 ) − qn+1 ) (xn+1 . y(xn+1 )) − f (xn+1 .4) we get y(xn+1 ) = h h3 (f (xn+1 . We want to estiamte the size of the single-step truncation error.9 Controlling the step size 55 where X is between xn−1 and xn+1 .2) Again.9.9.6) .2. Now. and then reﬁning the prediction to convergence with the trapezoidal corrector.9. from (2. one of which is y(xn+1 ) = y(xn ) + h h3 (f (xn .9. except that backwards values are not the computed ones.9. from (2. (iii) the quantity y(xn+1 ). from the trapezoidal rule. Thus the midpoint rule is also of second order.9.1) and (2. We begin by deﬁning three diﬀerent kinds of values of the unknown function at the “next” point xn+1 . Thus qn+1 satisﬁes the equation qn+1 = y(xn ) + h (f (xn . both of which are available during the calculation: (a) the initial guess.3) (2. about four times as large as that in the trapezoidal formula. Y ) − y (X 2 ∂y 12 h3 y (X). pn+1 is not available during an actual computation. (ii) the quantity qn+1 is the value that we would compute from the trapezoidal corrector if for the backward value we use the exact solution y(xn ) instead of the calculated solution yn . The error in the midpoint rule is. qn+1 )) . 3 (2. however. 2 (2.9.4) and the other of which is (2. Of course. but are the exact ones instead. which is the exact solution itself. y(xn+1 ))) − y (X) 2 12 (2. pn+1 = y(xn+1 ) + 2hy (xn ). Note that the two X’s may be diﬀerent.5) (2. y(xn )) + f (xn+1 . using only the following data.1).3) and (2.2) we have at once that y(xn+1 ) = pn+1 + Next. It staisﬁes two diﬀerent equations.9. If the step size were halved. the local error would be cut to one eighth of its former value. and of opposite sign.9. qn+1 is not available to us during calculation. Now suppose we adopt an overall strategy of predicting the value yn+1 of the unknown by means of the midpoint rule. and (b) the converged corrected value. y(xn )) + f (xn+1 . They are (i) the quantity pn+1 is deﬁned as the predicted value of y(xn+1 ) obtained from using the midpoint rule.

9. and then for the estimated single-step truncation error we have Error = − h3 y 12 1 12 ≈− (qn+1 − pn+1 ) 12 5 1 = − (qn+1 − pn+1 ). value of y in terms of the diﬀerence between the initial prediction and the ﬁnal converged value of y(xn+1 ).9.5) and (2. namely from xn − h to xn + h. even though the X’s are diﬀerent.9.9.. however. print out. The derivation of this formula was of course dependent on the fact that we used the midpoint metnod for the predoctor and the trapezoidal rule for the corrector.e. “matched pairs”” of predictor and corrector formulas. Under this assumption. but as an estimator we can use the compted predicted value and the compted converged value.10) .56 The Numerical Solution of Diﬀerential Equations where we have used the mean-value theorem.7) and obtain qn+1 − pn+1 = 5 3 h y + terms involving h4 .e. Hence. 12 (2. Now if we subtract qn+1 from both sides. both formulas were of the same order. 5 (2. but assumed constant.9) The quantity qn+1 − pn+1 is not available during the calculation. we have here an estimate of the single-step trunction error that we can conveniently compute.8) We see that this expresses the unknown. rather than exact backwards values of the unknown function. provided only that the error terms of the predictor and corrector formulas both involved the same derivative of y. we will observe that y(xn+1 ) − qn+1 will then appear on both sides of the equation.9. because these diﬀer only in that they use computed. i. If we had used a diﬀerent pair.9.7) is thereby decreed to be equal to the y (X) in (2. Now we ignore the “terms involving h4 ” in (2.5). insted of just for the midpoint and trapezoidal rule combination. the same argument would have worked. or use to control the step size. we can eliminate y(xn+1 ) between (2. The y (X) in (2. 12 (2. i. Hence. Let’s pause to see how this error estimator would have turned out in the case of a general matched pair of predictor-corrector formulas. Hence we will be able to solve for it. solve for y ..9. and Y lies between y(xn+1 ) and qn+1 . Suppose the predictor formula has an error term yexact − ypredicted = λhq y (q) (X) (2. with the result that y(xn+1 ) = qn+1 − h3 y (X) + terms involving h4 .9. are most useful for carrying out extended calculations in which the local errors are continually monitored and the step size is regulated accordingly.9.8).7) Now let’s make the working hypothesis that y is constant over the range of values of x considered. pairs in which both are of the same order.

05.2.670271 0.50 . we can regulate the step size so as to keep the error between preset limits.818705 0.606474 .9.818759 0. 10−9 as the lower limit. . and the actual single-step error obtained by computing y(xn+1 ) − y(xn ) − h (y (xn ) + y (xn+1 )) 2 (2.05 0.9).11) Then a derivation similar to the one that we have just done will show that the estimator for the single-step error that is available during the progress of the computation is Error ≈ µ (ypredicted − ycorrected).860690 0.904877 0.740780 0.12) In the table below we show the result of integrating the diﬀerential equation y = −y with y(0) = 1 using the midpoint and trapezoidal formulas with h = 0. 0. Suppose we would like to keep the single-step error in the neighborhood of 10−8 .670315 0.704690 0. the converged corrected value at x.40 0.15 0.386694 0. The calculation was started by (cheating and) using the exact solution at 0.367831 Corr(x) ————0. .35 0.00 Pred(x) ————0.20 0.30 0.637617 0.860747 0. the predicted value at x. and we will be wasting computer time as well as possibly building up roundoﬀ error. the single-step error estimated from the approximation (2.00 0. . .637575 0. for instance.10 0. .704644 0. 5 × 10−7 3 × 10−7 table 5 Now that we have a simple device for estimating the single-step truncation error. 0. the step size will be smaller than needed.9 Controlling the step size and suppose that the error in the corrector formula is given by yexact − ycorrected = µhq y (q) (X).740828 0. say 5 × 10−8 as the upper limit of tolerable error and.45 0.606514 . 57 (2. Why should we have a lower limit? If the calculation is being done with more precision than necessary.95 1. .9.778768 0.904828 0.9. 0.25 0. We might then adopt. .778820 0.13) using the true solution y(x) = e−x . 51 × 10−7 48 × 10−7 Error(x) ————94 × 10−7 85 × 10−7 77 × 10−7 69 × 10−7 61 × 10−7 55 × 10−7 48 × 10−7 43 × 10−7 37 × 10−7 . The successive columns show x. as described above.05 as the predictor and corrector. . . .386669 0. namely by using one ﬁfth of the distance between the ﬁrst guess and the corrected value.9. x 0.367807 Errest(x) ————98 × 10−7 113 × 10−7 108 × 10−7 102 × 10−7 97 × 10−7 93 × 10−7 88 × 10−7 84 × 10−7 80 × 10−7 . .00 and 0. λ−µ (2.

yn−1 . just two past values). something special will have to be done to get the procedure moving at the start. With the trapezoidal value in hand. to get the value of y at the ﬁrst point x0 + h beyond the initial point x0 . The leading statement in this routine might be . with the new value of h. Special procedures are needed at the beginning to build up enough computed values so that the predictor-corrector formulas can be used thereafter. In this case we see (i) a procedure midpt. Output from it will be yn+1 computed from the midpoint formula. Input to this procedure will be x. but certainly not the same as the total accumulated error. since we’re going to use the trapezoidal corrector anyway. with complete iteration to convergence. print a message to that eﬀect. we might as well use the trapezoidal rule. This predicted value is also saved for future use in error estimation. We ask ﬁrst for the major logical blocks into which the computation is divided. Finally. h. Now we have two consecutive values of y. The three values of y in question occupy just three memory locations. First. and move back the newer values of the unknown function to the locations that hold older values (we remember. and then we should restart the procedure. This is typical of numerical solution of diﬀerential equations. we embark on the computation. since the midpoint method needs two backward values before it can be used. Otherwise. If the absolute value of the local error lies between the preset limits 10−9 and 5 × 10−8 . say. from the “farthest backward” value of x for which we still have the corresponding y in memory. or whatever. Then we double the step size. suppose the local error was too small. some message should be printed out that announces the change. What we are controlling is the one-step truncation error. suppose the local error was too large. One reason for this is that we may ﬁnd out right at the outset that our very ﬁrst choice of h is too large.58 The Numerical Solution of Diﬀerential Equations Now that we have ﬁxed these limits we should be under no delusion that our computed solution will depart from the true solution by no more than 5 × 10−8 . Then we must reduce the step size h. the local error is then estimated by calculating one-ﬁfth of the diﬀerence between that value and the midpoint guess. When all this is done. In the present case. say by cutting it in half. at any moment. and restart. unassisted by midpoint. then we just go on to the next step. With the upper and lower tolerances set. yn . Then we would like to restart each time from the same originally given data point x0 . three times before the errors are tolerable. No arrays are involved. The predicted value is then reﬁned by the trapezoidal rule. the midpoint rule is used to predict the next value of y. and perhaps it may need to be halved. rather than let the computation creep forward a little bit with step sizes that are too large for comfort. or at least until the step size is changed. again from the smallest possible value of x. From any two consecutive values. and the generic calculation can begin. a lot bettern than nothing. Now let’s apply the philosophy of structured programming to see how the whole thing should be organized. This means that we augment x by h.

and will display complete computer programs that carry them out. and then just do a single correction. When called with start = TRUE.yguess. constitutes the whole package. then use the trapezoidal rule just once. quite independently of the way they happen to be used by the main program in this problem. we want it to use the prediction supplied by midpt.h. The spirit of structured programming dictates otherwise. Someday one might want to change from the midpoint predictor to some other predictor.ym1. there is no externally supplied guess to help it. accurate solution. That is.n).9 Controlling the step size midpt:=proc(x. In the meantine. using eps as its convergence criterion.. and comparing it with the one in that section. If start = FALSE. Suppose the ﬁrst line of trapez is trapez:=proc(x. but should describe only how it transforms its own inputs into its own outputs.. and iterate to convergence from there. (ii) a procedure trapez. wher eit is the corrector formula.12.eps). and its return statement is return(yout). depending on the circumstances. however.start. The documentation of a procedure should never make reference to how it is used by a certain calling program.2. we’ll return to the modules discussed above. the reader might enjoy trying to write such a complete program now. This routine will be called from two or three diﬀerent places in the main routine: when starting. then it’s quite easy to disentangle it from the program and replace it. with their own inputs and outputs. It should be possible to unplug a module from this application and use it without change in another. One might think that it is scarcely necessary to have a separate subroutine for such a simple calculation. If organized as a subrouting. 59 and its return statement would be return(yp1). This problem will need the best methods that we can ﬁnd for a successful.y0. The combination of these two modules plus a small main program that calls them as needed. the ﬂight of a spacecraft to the moon. On a generic step of the integration. and in a generic step. it will take yguess as an estimate of yout. When starting or restarting. .yin. too. and of midpt. Operation of thie routine is diﬀerent.n. It must ﬁnd its own way to convergence. In the next section we are going to take an intermission from the study of integration rules in order to discuss an actual physical problem. One way to handle this is to use a logical variable start. Then in section 2. then the routine might use yin as its ﬁrst guess to yout. If trapez is called with start = TRUE. then the subroutine would supply a ﬁnal converged value without looking for any input guess. should precisely explain their operation as separate modules. when restarting with a new value of h.h. to reﬁne the value as yout. This is the “modular” arpproach. the descriptions of trapez. Each of the subroutines and the main routine should be heavily documented in a self-contained way.

and so the assertion that the net force is equal to mass times acceleration takes the form: mx = −K ME m MM m +K . in the one-dimensional simpliﬁed model. and we let its radius be r.60 The Numerical Solution of Diﬀerential Equations x=D x=D−r Moon f x=0 x=R l x(t) s E Earth Rocket Figure 2. the position of the rocket at time t. MM and m are.3) This is a (nasty) diﬀerential equation of second order in the unknown function x(t). the masses of the earth.10.2: 1D Moon Rocket 2. and the ability to chenge the step size. If we use K for the constant of proportionality.1) whereas the force on the rocket due to the moon’s gravity is K MM m (D − x)2 (2. and then in two dimensions. the gravitational force exerted by one body on another is proportional to the product of their masses and inversely proportional to the square of the distance between them. making its way towards the moon. Finally. x2 (D − x)2 (2. both to increase it and the devrease it.10. We will use Newton’s law of gravitation to ﬁnd the net gravitational force on the rocket. The variety of solutions that can be obtained is quite remarkable. Great accuracy will be needed.10. It will be very useful to have these equations available for testing various proposed integration methods.2) where ME . We do this ﬁrst in a one-dimensional model. According to Newton’s law of gravitation.10 Case study: Rocket to the moon Now we have a reasonably powerful apparatus for integration of initial-value problems and systems. At the point x = D we place the moon. Note the nonlinear way in which this unknown function appears on the right-hand side.and equate it to the mass of the rocket times its acceleration (Newton’s second law of motion). will be essential. First. the moon and the rocket. then the force on the rocket due to the earth is −K ME m . x2 (2. . including the automatic regulation of step size and built-in error estimation. In order to try out this software on a problem that will use all of its capability. in this section we are going to derive the diﬀerential equations that govern the ﬂight of a rocket to the moon. or else the computation will become intolerably long. respectively. at a position x = x(t) is our rocket. we place the center of the earth at the origin of the x-axis. The acceleration of the rocket is of course x (t). and let R denote the earth’s radius.

First. just a quick glance at (2.10 Case study: Rocket to the moon 61 A second-order diﬀerential equations deserves two initial values.7) Next we tackle the new time units.6) is a dimensionless quantity. Galileo demonstrated that fact to an incredulous world. the radius of the earth. that the rocket was ﬁred at the moon with a certain initial velocity V .10.6) Now instead of the unknown function x(t).10. as follows. If we divide (2. let’s agree that at time t = 0 the rocket was on the surface of the earth. namely the acceleration due to gravity. the initial conditions that go with (2.10. For performing a now-legendary experiment with rocks of diﬀerent sizes dropping from the Tower of Pisa. Since y is now dimensionless.10. Its weight is also equal to m times the acceleration of the body. which of course has the same dimension.10. usually denoted by g. (2. we can write the result as x R =− KME R3 2 X R + KMM R3 D R − x R 2. Then y(t) is the position of the rocket.10.10.5) We can make this equation a good bit prettier by changing the units of distance and time from miles and seconds (or whatever) to a set of more natural units for the problem. whose numerical value is about 60. and we will oblige.4) Now.6) has now been transformed to y = KME 3 − R2 y + (60 − y)2 KMM R3 . (2.10. the ratio D/R that occurs in (2.2.10.10.3) now reads as x =− KME KMM . + x (D − x)2 (2.8) is a time. at time t.5) through by R.10. namely KME m/R2 . Its numerical value is easier to calculate if we change the formula ﬁrst. Further. If we look next at the ﬁrst term on the right. and having the value 32. Hence (2. we deﬁne y(t) = x(t)/R. expressed in earth radii. It implies that the motion of the rocket is independent of its mass. so T0 = R3 KME (2.3) shows that m cancels out. Consider a body of mass m on the surface of the earth. so let’s remove it. Its weight is the magnitude of the force exerted on it by the earth’s gravity. (2.3) are x(0) = R . At any rate. and second. For our unit of distance we choose R. but not before pointing out the immense signiﬁcance of that fact. (2. the dimension of the left side of the equation is the reciprocal of the square of a time. we see that the quantity R3 /KME is the square of a time. x (0) = V. Hence. .2 feet/sec2 .

.012 + . The ﬁrst condition is easy: u(0) = 1. (2.12) and set τ = 0 we get T0 V V u (0) = .11) y (60 − y)2 The ratio of the mass MM of the moon to the mass ME of the earth is about 0.4) into conditions on the new variables.15) and integration yields 2 T0 (y )2 = 2 + C.10) We take R = 4000 miles.10.5 minutes. The substitution of (2.13) Finally we must translate the initial conditions (2.12) Thus. (2.10.11) and drop the second term on the right-hand side (the one that comes from the moon). and it becomes 2 T0 (y )2 = 2 y . y (2. Perhaps the quickest way to see this is to go back to equation (2.62 It follows that The Numerical Solution of Diﬀerential Equations KMe m = mg. measured in units of the radius of the earth. In the numerator is the velocity with which the rocket is launched.10.012. at a time τ that is measured in units of T0 . What is the signiﬁcance of the velocity R/T0 ? We claim that it is.e. we will now introduce a new independent variable τ and a new dependent variable u = u(τ ) by the relations u(τ ) = y(τ T0 ) .10.16) .10. We propose to measure time in units of T0 .7) by T0 and get MM 1 ME 2 T0 y = − 2 + .10. if there were no moon. This equation can be solved. u(τ ) represents the position of the rocket.10.10.10.10. aside from a numerical factor. if we diﬀerentiate (2. Furthermore. R2 and if we substitute into (2.12) into (2. (2. we multiply through equation (2. Multiply bth sides by 2y . To that end. i. the escape velocity from the earth.9) (2. t = τ T0 .14) = R R/T0 This is a ratio of two velocities.10.10. g (2.11) yields the diﬀerential equation for the scaled distance u(τ ) as a function of the scaled time τ in the form u =− 1 0. in units of 13. and ﬁnd T0 is about 13 minutes and 30 seconds.10. Next.10. (2. Then we will be looking at the diﬀerential equation that would govern the motion if the moon were absent.8) we ﬁnd that our time unit is T0 = R . u2 (60 − u)2 (2.10.

x2 + y 2 (x − xm )2 + (y − ym )2 (2.10.21) (2. Then the function y(t) will grow without bound.18) T0 Thus.2. equivalently √ R V ≥ 2 . √ Hence the quantity 2 R/T0 is the escape velocity from the earth. We shall denote it by Vesc . R2 (2. and therefore a nonnegative quantity. (xm − x)2 + (ym − y)2 (2.10. Hence let y → ∞ on the right side of (2. It is given by Fx = − KME m cos θ KMM m cos ψ + .20) where the angles θ and ψ are shown in ﬁgure 1. y(t)) on the way to the moon. then we have the conﬁguration shown in ﬁgure 1. which approaches the constant C.10 Case study: Rocket to the moon 2 Now let t = 0 and ﬁnd that C = T0 V 2 /R2 − 2.10. we see that cos θ = x x2 + y 2 and cos ψ = xm − x . For all values of y. We might say that if we choose to measure distance in units of earth radii.17). 145 miles per hour. the left side is a square. and time in units of √ T0 .012 + 2 u (60 − u)2 (2. then we would have xm = D cos(ωt) and ym (t) = D sin(ωt). then velocities turn out to be measured in units of escape velocity.10. If we put the rocket at a generic position (x(t). and the converse is easy to show also. Consider the net force on the rocket in the x direction. For example. must also be nonnegative.18) is true.2. Now we can return to (2.10.10. Its numerical value is approximately 25. and the moon is moving. it becomes u (0) = 2 V /Vesc . (2. so 2 T0 (y )2 = 63 2 − y 2 T0 V 2 −2 .2.19) u(0) = 1 √ V u (0) = 2 Vesc Since that was all so easy.17) Suppose the rocket is launched with suﬃcient initial velocity to escape from the earth.16.12) to translate the initial conditions on x (t) into initial √ conditions on u (τ ). In terms of the escape velocity. Hence the right side. aside from the 2. the diﬀerential equation and the initial conditions have the ﬁnal form u =− 1 0.22) .10. In summary. let’s try the two-dimensional case next.16. if we take the orbit of the moon to be a circle of radius D. then (2. if the rocket escapes. From that ﬁgure. ym (t)). Thus C ≥ 0 or.10. the earth is centered at the origin of the xy-plane. Let the coordinates of the moon at time t be (xm (t). Here.10.

10.64 The Numerical Solution of Diﬀerential Equations yM (t) Moon v y(t) ψ $s $$ rocket $$$ $$$ $ $ z$$ θ Earth x(t) xM (t) Figure 2. We will suppose that at time t = 0.10. y (0) = V sin α (2.26) v 0.10)). (u + v 2 )3/2 ((um − u)2 + (vm − v)2 )3/2 Furthermore. at time t = 0. v(0) = 0 √ V cos α √ V sin α u (0) = 2 .3: The 2D Moon Rocket Now we substitute into (2.10.10.25) The problem has now been completely deﬁned. 0). at a time τ measured in units of T0 (see (2.25) take the form u(0) = 1 . our initial conditions are x(0) = R . we need four initial conditions. v (0) = 2 .23) If we carry out a similar analysis for the y-component of the force on the rocket. and equate the force in the x direction to mx (t). y(t) that describe the position of the rocket. It remains to change the units into the same natural dimensions of distance and time that were used in the one-dimensional problem. then it turns out the u and v satisfy the diﬀerential equations u 0.012(um − u) u =− 2 + (u + v 2 )3/2 ((um − u)2 + (vm − v)2 )3/2 (2.10. in a direction that makes an angle α with the positive x-axis. to obtain the diﬀerential equation mx (t) = − KME mx KMM m(xm − x) + .10. (x2 + y 2 )3/2 ((xm − x)2 +)ym − y)2 )3/2 (2.10. at the point (R. and give only the results. y(0) = 0 . If u(τ ) and v(τ ) denote the x and y coordinates of the rocket. measured in units of earth radii.012(vm − v) v =− 2 + . the initial data (2. Further. we get my (t) = − KME my KMM m(ym − y) + . x (0) = V cos α .24) We are now looking at two (even nastier) simultaneous diﬀerential equations of the second order in the two unknown functions x(t). To go with these equations. 2 + y 2 )3/2 (x ((xm − x)2 + (ym − y)2 )3/2 (2.10. Thus. This time we leave the details to the reader. the rocket is on the earth’s surface. it will be ﬁred with an initial speed of V .20). Vesc Vesc (2.27) .

2.11 Maple programs for the trapezoidal rule

65

In these equations, the functions um (τ ) and vm (τ ) are the x and y coordinates of the moon, in units of R, at the time τ . Just to be speciﬁc, let’s decree that the moon is in a circular orbit of radius 60R, and that it completes a revolution every twenty eight days. Then, after a brief session with a hand calculator or a computer, we discover that the equations um (τ ) = 60 cos (0.002103745τ ) vm (τ ) = 60 sin (0.002103745τ ) represent the position of the moon. (2.10.28)

2.11

Maple programs for the trapezoidal rule

In this section we will ﬁrst display a complete Maple program that can numerically solve a system of ordinary diﬀerential equations of the ﬁrst order together with given initial values. After discussing those programs, we will illustrate their operation by doing the numerical solution of the one dimensional moon rocket problem. We will employ Euler’s method to predict the values of the unknowns at the next point x + h from their values at x, and then we will apply the trapezoidal rule to correct these predicted values until suﬃcient convergence has occurred. First, here is the program that does the Euler method prediction.

> > > > > > > > > > > > eulermethod:=proc(yin,x,h,f) local yout,ll,i: # Given the array yin of unknowns at x, uses Euler method to return # the array of values of the unknowns at x+h. The function f(x,y) is # the array-valued right hand side of the given system of ODE’s. ll:=nops(yin): yout:=[]: for i from 1 to ll do yout:=[op(yout),yin[i]+h*f(x,yin,i)]; od: RETURN(yout): end:

Next, here is the program that takes as input an array of guessed values of the unknowns at x + h and reﬁnes the guess to convergence using the trapezoidal rule.

> > > > > > > > > traprule:=proc(yin,x,h,eps,f) local ynew,yfirst,ll,toofar,yguess,i,allnear,dist; # Input is the array yin of values of the unknowns at x. The program # first calls eulermethod to obtain the array ynew of guessed values # of y at x+h. It then refines the guess repeatedly, using the trapezoidal # rule, until the previous guess, yguess, and the refined guess, ynew, agree # within a tolerance of eps in all components. Program then computes dist, # which is the largest deviation of any component of the final converged # solution from the initial Euler method guess. If dist is too large

66

> > > > > > > > > > > > > > > > >

The Numerical Solution of Diﬀerential Equations

# the mesh size h should be decreased; if too small, h should be increased. ynew:=eulermethod(yin,x,h,f); yfirst:=ynew; ll:=nops(yin); allnear:=false; while(not allnear) do yguess:=ynew; ynew:=[]; for i from 1 to ll do ynew:=[op(ynew),yin[i]+(h/2)*(f(x,yin,i)+f(x+h,yguess,i))]; od; allnear:=true; for i from 1 to ll do allnear:=allnear and abs(ynew[i]-yguess[i])<eps od: od; #end while dist:=max(seq(abs(ynew[i]-yfirst[i]),i=1..ll)); RETURN([dist,ynew]): end:

The two programs above each operate at a single point x and seek to compute the unknowns at the next point x + h. Now we need a global view, that is a program that will call the above repeatedly and increment the value of x until the end of the desired range of x. The global routine also needs to check whether or not the mesh size h needs to be changed at each point and to do so when necessary.

> > > > > > > > > > > > > > > > > > > > > > > > > > trapglobal:=proc(f,y0,h0,xinit,xfinal,eps,nprint) local x,y,y1,h,j,arr,dst,cnt: # Finds solution of the ODE system y’=f(x,y), where y is an array # and f is array-valued. y0 is initial data array at x=xinit. # Halts when x>xfinal. eps is convergence criterion for # trapezoidal rule; Prints every nprint-th value that is computed. x:=xinit:y:=y0:arr:=[[x,y[1]]]:h:=h0:cnt:=0: while x<=xfinal do y1:=traprule(y,x,h,eps,f): y:=y1[2]:dst:=y1[1]; # Is dst too large? If so, halve the mesh size h and repeat. while dst>3*eps do h:=h/2; lprint(‘At x=‘,x,‘h was reduced to‘,h); y1:=traprule(y,x,h,eps,f): y:=y1[2]:dst:=y1[1]; od: # Is dst too small? If so, double the mesh size h and repeat. while dst<.0001*eps do h:=2*h; lprint(‘At x=‘,x,‘h was increased to‘,h); y1:=traprule(y,x,h,eps,f): y:=y1[2]:dst:=y1[1]; od: # Adjoin newly computed values to the output array arr. x:=x+h; arr:=[op(arr),[x,y[2]]]: # Decide if we should print this line of output or not. cnt:=cnt+1: if cnt mod nprint =0 or x>=xfinal then print(x,y) fi;

**2.11 Maple programs for the trapezoidal rule
**

> > RETURN(arr); > end: od:

67

The above three programs comprise a general package that can numerically solve systems of ordinary diﬀerential equations. The applicability of the package is limited mainly by the fact the Euler’s method and the Trapezoidal Rule are fairly primitive approximations to the truth, and therefore one should not expect dazzling accuracy when these routines are used over long intervals of integration.

2.11.1

Example: Computing the cosine function

We will now give two examples of the operation of the above programs. First let’s compute the cosine function. We will numerically integrate the equation y + y = 0 with initial conditions y(0) = 1 and y (0) = 0 over the range from x = 0 to x = 2π. To do this we use two unknown functions y1 , y2 which are subject to the equations y1 = y2 and y2 = −y1 , together with initial values y1 (0) = 1, y2 (0) = 0. Then the function y1 (x) will be the cosine function. To use the routines above, we need to program the function f = f (x, y) that gives the right hand sides of the input ODE’s. This is done as follows: > f:=proc(x,u,j) > if j=1 then RETURN(u[2]) else RETURN(-u[1]) fi: > end: That’s all. To run the programs we type the line > trapglobal(f,[1,0],.031415927,0,6.3,.0001,50): This means that we want to solve the system whose right hand sides are as given by f , with initial condition array [1, 0]. The initial choice of the mesh size h is π/100, in order to facilitate the comparison of the values that will be output with those of the cosine function at the same points. The range of x over which we are asking for the solution is from x = 0 to x = 6.3. Our convergence criterion eps is set to .0001, and we are asking the program to print every 50th line of putput that it computes. Maple responds to this call with the following output.

At x= 0 h was reduced to .1570796350e-1 .7853981750, [ .6845520546, -.7289635418] 1.570796368, [-.0313962454, -.9995064533] 2.356194568, [-.7289550591, -.6845604434] 3.141592768, [-.9995063863, .0313843153] 3.926990968, [-.6845688318, .7289465759] 4.712389168, [ .0313723853, .9995063195] 5.497787368, [ .7289380928, .6845772205] 6.283185568, [ .9995062524, -.0313604551]

[12.j) > if j=1 then RETURN(u[2]) else RETURN(-1/u[1]^2+.2979916199] 60. reached 59.11. which we do as follows.2242857188] 130. [25.29624630. . but realistically we will have to confess that the major source of imprecision lies in the Euler and Trapezoidal combination itself.1000): √ This means that we are solving our system with initial data (1. [33. . is too crude to yield results of great accuracy over a long range of integration.0000000. . So all we need to do is to program the right hand sides f .3523608113] 40.2188074327] 140. which is the surface of the moon.3206445364] 50.75.2305225677] 120.10.012/(60-u[1])^2) fi: > end: Then we invoke our routines by the statement trapglobal(f.1. .0001. The equations that we’re solving now are given by (2. are less accurate at those points.4022135344] 30.68 The Numerical Solution of Diﬀerential Equations We observe that the program ﬁrst decided that the value of h that we gave it was too big and it was cut in half.00000000. and we still have them at x = 2π. with a convergence tolerance eps of .36222569.1000000000e-1 10.32630042. [38.0001. > f:=proc(x.911695068.4142].00000000. Maple responded with the following output. . [16.2 Example: The moon rocket in one dimension As a second example we will run the moon rocket in one dimension..[1.46812426.11311118.02. To improve the accuracy we might try reducing the error tolerance eps.00000000.00000000. [40. and printing the output every 1000 lines. .70467985.00000000. [7. [30. [28. [42. . which. with an initial mesh size of h = .0000000. At x = π we have 3 or 4 correct digits. [19. . so as to halt the calculation as soon as y[1]. .18089771.05293965. . integrating over the range of time from t = 0 until t = 250. although it provides a good introduction to the philosophy of these methods. 2).5027323920] 20.. . the distance from earth.19). [35.0.00000000.00000000. Next we see that the accuracy is pretty good.79081775.2554698522] 90.71287575.54117736. [44.2139997868] . 2.0000000.55572225.0000000.2376534139] 110.02.250.0000000.u. [22. The values of the negative of the sine function. . . .2806820525] 70.2668577906] 80.44558364. which are displayed above as the second column of unknowns.00000000. We inserted an extra line of program also.2458746369] 100.00000000. At x= 0 h was reduced to .

1) as well as on the left. this means that the whole trip took 211.0000000. [46.75560109. arises by taking p = 0. if we reduce the initial velocity from 1. a0 = 1. and with good accuracy. b−1 = 0. or 47. 160. 2.5 minutes. the formula is explicit.2000000000e-1 [59. The trapezoidal rule. In this section we discuss a general family of methods. The reader might enjoy experimenting with this situation a little bit. Some problems. we need to store p + 1 backwards values of y and p + 1 backwards values of y .8 · 13. It was increased again after 174 time units because at that time the rocket was moving quite slowly.90316649.4142 by a small amount. where even a good method like the trapezoidal rule is unable to give the pinpoint accuracy that is needed. that includes methods of arbitrarily high accuracy. then the rocket will reach some maximum distance from Earth and will fall back to the ground without ever having reached the moon.2000000000e-1 increased to .2034559376] increased to . for example. and the formula does not explicitly tell us the value of yn+1 . a−1 = 1.1) a p + 2-point formula. b0 = 2.0000000. If b1 is nonzero. if p > 0 the formula will need help getting started or restarted. Our two-dimensional moon shot is an example of such a situation. Impact on the surface of the moon occurred at time 211.12 The big leagues 150.95113244. b0 = 1/2.1) In order to compute the next value of y. For instance.0000000. In either case.65 hours – nearly two days. Since each time unit corresponds to 13. b−1 = 0. namely yn+1 .2062733168] [50. .12. . if b1 = 0. (2. of which all three of the rules that we have studied are examples.2300000 h was 183. .12. . and again after 183 time units for the same reason. formula by looking at b1 .2.12. and so we can call (2. a0 = 0. A general linear multistep method is of the form p p yn+1 = i=0 a−i yn−i + h i=−1 b−i yn−i . then yn+1 appears implicitly on the right side of (2. 170. The mesh size was reduced to .7300000 h was 211.5 minutes. 174. Euler’s method has p = 0. however. 211.2014073390] reduced to . These are the linear multistep methods.4500000 h was 188. b1 = 0. . We can recognize an explicit. are more demanding than that. a0 = 1. Otherwise. A total of p + 2 points are involved in the formula.82325670. b−1 = 1/2.4000000000e-1 [54. .8100000. whereas for the midpoint rule we have p = 1.01 right away.3622060968] 69 At x= At x= At x= So the trip was somewhat eventful.2098188816] [48.12 The big leagues The three integration rules that we have studied are able to handle small-to-medium sized problems in reasonable time. or noniterative. b0 = 1. whereas if p = 0 it is self-starting.81.61091112.0900000.

the more we come into conﬂict with stability. yk = 1 for all k into (2. b−1 .2).3) The reader should check that this condition is also satisﬁed by all three of the methods we have studied. First let’s discuss the conditions of accuracy. i=0 (2. for a ﬁxed p. For instance. another highly desirable feature. the more accuracy we demand. and there follows the condition p a−i = 1.12. These are usually handled by asking that the equation (2.2) Notice that in all three of the methods we have been studying. so we can choose one special value of n in (2.1). states roughly that we cannot use more than about half of the 2p + 3 “degrees of freedom” to achieve high accuracy if we want to have a stable formula (and we do!).12. a−p and b1 .1) should be exactly true if the unknown function y happens to be a polynomial of low enough degree. After cancelling the factor hr . after some simpliﬁcation. We substitute yk = kh. These constants are chosen to give the method various properties that we may deem to be desirable in a particular application. .12.1). we substitute yk = (kh)r and yk = r(kh)r−1 . The conditions are therefore translation invariant.70 The Numerical Solution of Diﬀerential Equations The general linear multistep formula contains 2p + 3 constants a0 .1) and obtain.12. . b−p . One might think that the remaining parameters should be chosen so as to give the highest possible accuracy. . p p − i=0 ia−i + i=−1 b−i = 1. . thereby using one of the 2p + 3 free parameters. . However. . Now suppose we want our multistep formula to be exact not only when y(x) = 1. . the sum of the a’s is indeed equal to 1. but also when y(x) = x. (2. in some sense.4) if we want. Substitute yk = 1 and yk = 0 for all k into (2. . and leaving 2p + 2 others. we get p p (n + 1)r = i=0 a−i (n − i)r + r i=−1 b−i (n − i)r−1 . b0 . in (2. Indeed. .12.12. then we may set b1 = 0 immediately. suppose y(x) = 1 for all x. (2. An important theorem of the subject.12. For instance. if we value explicit formulas. and use of (2. due to Dahlquist. it will turn out that no stable formulas exist. Hence.12. if we demand “too much accuracy” for a ﬁxed p.12. In general let’s ﬁnd the condition for the formula to integrate the function y(x) = xr and all lower powers of x exactly for some ﬁxed value of r.4) Now clearly we do xr and all lower powers of x exactly if and only if we do (x + c)r and all lower powers of x exactly.

12. .12. . . 2.7) is a linear diﬀerence equation with constant coeﬃcients of order p + 1. then after substitution and cancellation we obtain the characteristic equation p (1 + τ b1 )αp+1 − i=0 (a−i − τ b−i )αp−i = 0.12 The big leagues Let’s choose n = 0.12. This should be interpreted as 1.12. substitute yk = −yk /L for all k in (2. (2.12.1) does with this diﬀerential equation.12.12.6.5) is true (together with its analogues for all numbers between 0 and r).6) L with y(0) = 1. (2. and should seach for the most accurate of all possible three-point formulas (ignoring the stability question altogether). Just as in section 2. By the order (of accuracy) of a linear multistep method we mean the highest power of x that the formula integrates exactly. then there is no hope at all for the formula. because the result simpliﬁes then to p p 71 (−1) = i=0 r i a−i − r r i=−1 ir−1 b−i r = 0.12.3). The reader should verify that the trapezoidal rule is the most accurate of all possible two-point formulas. To see how well the general formula (2. Just for openers.6. .2. we look for solutions of the form αn . that is. .5) A small technical remark here is that when r = 1 and i = 0 we see a 00 in the second sum on the right.12. if when h = 0 some root has absolute value strictly greater than 1. We can’t expect that these roots will have absolute value less than 1 however large we choose the step size h. so it has p + 1 roots somewhere in the complex plane. let’s see where the roots are when h = 0. whose solution is a falling exponential. Equation (2. we have written τ = h/L.7) where as in section 2. In fact. stability will be judged with respect to the performance of our multistep formula on a particular diﬀerential equation. or equivalently. All we can hope for is that it should be possible. the ratio of the step size to the relaxation length of the problem. These roots depend on the value of τ .1). 1.8) This is a polynomial equation of degree p + 1. in accordance with (2. If as usual with such equations. on the step size that we use to do the integration. to get all of these roots to lie inside the unit disk in the complex plane. and to construct accurate formlas with desirable characteristics. by choosing h small enough. The accuracy conditions enable us to take the role of designer. to obtain p (1 + τ b1 )yn+1 − i=0 (a−i − τ b−i )yn−i = 0 (2. the largest number r for which (2. namely y y =− (L > 0) (2. Now we must discuss the question of stability.

and we’ll call it the principal root.12.12.12. because condition (2. its coeﬃcients don’t depend on the step size. Now for small positive h. here’s what happens. τ = 0 also. non-principal. Next we have to study what happens to the roots when h is small and positive. roots of the diﬀerence equations. we would like them to be as small as possible in absolute value.72 The Numerical Solution of Diﬀerential Equations because for all suﬃciently small h there will be a root outside the unit disk.12.1) is that the polynomial equation (2.12. Qualitatively.6) that we’re trying to solve. The reader should now check the three methods that we have been using to see if this condition is satisﬁed.10) . which is surely not an excessive request.7) doing for us? The answer is that they are trying as hard as they can to make our lives diﬃcult. Let α1 (h) denote the value of the principal root at some small value of h. (2. so the polynomial equation (2. and the method will be unstable. and hence the remining roots have to be there.9) that is = 1 is our good friend. and since all it asks is that we correctly integrate the equation y = 0.9) should have no roots of absolute value greater than 1. This high quality is bought at the price of using a p + 2-point multistep formula. Therefore the principal root is trying to give us a very accurate approximation to the exact solution.12.12. but failing that. The coeﬃcients.12. Well then.12.” so we’ll call them the parasitic roots. what are all of the other p roots of the diﬀerence equation (2. Now when h = 0. Then α1 (0) = 1. or even on the values of the various b’s in the formula.9) This is also a polynomial equation of degree p + 1. (2.9) is always α = 1. As h increases slightly away from 0 the principal root moves slightly away from 1. However. One of the roots of (2.12.2) is always satisﬁed in practice. One that is printable is “parasitic. We would be happiest if they would all be zero. it turns out that the principal root tries as hard as it can to be a good approximation to e−τ . depend only on the a’s that we use. This means that the portion of the solution of the diﬀerence equation (2. The high quality of our approximate solution derives from the nearness of the principal root to e−τ . But e−nh/L is the exact solution of the diﬀerential equation (2. and therefore the roots.12. too. The one root of (2.7) that comes from the one root α1 (h) is α1 (h)n = (“nearly” e−τ )n = “nearly” e−nτ = “nearly” e−nh/L . our ﬁrst necessary condition for the stability of the formula (2. People have invented various names for these other. Hence.8) becomes simply p αp+1 − i=0 a−i αp−i = 0. and this forces the characteristic equation to be of degree p + 1.

the coeﬃcient of τ is p b1 + (p + 1)q1 − i=0 [a−i (p − i)q1 + b−i ] = 0.2). with some small value of h. is a three-point method (p = 1). this simpliﬁes instantly to the simple statement that q1 = −1.13) and. more precisely.8). like the midpoint rule.3).2.8). if we use (2.e. this is indeed zero if our multistep formula correctly integrates a constant function. (2. (y 3 n+1 (2. we get (1 − τ b1 )(1 + q1 τ + q2 τ 2 + · · ·)p+1 − p i=0 (2. Equal care must be taken to assure that the parasitic roots stay small.14) However. we have shown by direct calculation that if our multistep formula is of order at least 1 (i. then the expansion (2. (2. Second. the coeﬃcient of the zeroth power of τ is p 1− i=0 a−i (2.12) Now we equate the coeﬃcient of each power of τ to zero.15) This. and attempt to ﬁnd a power series expansion for the principal root α1 (τ ) in powers of τ . according to (2.12. but we omit the details. Much more is true. The expansion will be in the form α1 (τ ) = 1 + q1 τ + q2 τ 2 + · · · where the q’s are to be determined.11) into (2. then.12. Thus the careful determination of the a’s and the b’s so as to make the formula as accurate as possible all result in the principal root being “nearly” e−τ .12.12. The proof can be done just by continuing the argument that we began above.. and all of the other roots near the origin.2) and (2. if (2.12.12. although we will not prove it here: the expansion of the principal root agrees with the expansion of e−τ through terms of degree equal to the order of the multistep method under consideration. First. Now let’s try to make this picture a bit more quantitative. So far.11) of the principal root agrees with the expansion of e−τ through terms of ﬁrst degree in τ .12. shows one root of the characteristic equation near 1. and it happens to be very accurate. very near e−τ . We illustrate these ideas with a new multistep formula. It is iterative.4) holds for r = 0 and r = 1). because b1 = 1/3 = 0. yn+1 = yn−1 + h + 4yn + yn−1 ).12 The big leagues 73 A picture of a good multistep method in action. If we substitue (2.12.12.12. Indeed. We return to the polynomial equation (2.11) (a−i − τ b−i )(1 + q1 τ + q2 τ 2 + · · ·)p−i = 0.12.12. we can quickly check that .12.

and that oﬀer whatever accuracy one might want. these methods will need some assistance in getting started. 3. one of the roots (the friendly one) is as usual sitting at 1. so the roots are +1 and −1.12. 2. the Lagrange interpolation formula. in which the next value of y is obtained from several backward values.5) are satisﬁed for r = 0. First. The secret weapon of the Adams formulas is that when h = 0.16) into the characteristic equation (2.8 to be −h5 y (v) (X)/90. so we will have to develop matched formulas that will provide them with starting values of the right accuracy. because it is poised on the brink of causing trouble. whose mission in life is to ﬁt a polynomial to a given set of data points.12. This means that we want methods in which the formulas span a number of points. then its powers will dwarf the powers of the principal root in the numerical solution. To see if this happens. we quickly ﬁnd that q1 = −1/3. In the next section.17) 3 3 3 After substituting.74 The Numerical Solution of Diﬀerential Equations the accuracy conditions (2. let’s substitute a power series α2 (h) = −1 + q1 τ + q2 τ 2 + · · · (2. that root will follow the power series expansion of e−τ very accurately. this root grows in absolute value. through the ﬁrst four powers of τ . so let’s begin with a little example. i. Now we examine the stability of this method.9) that determines the roots is just α2 − 1 = 0. and in fact its error term can be found by the method of section 2. in fact. instead of from just one or two as in the methods that we have already studied. and all of the unfriendly roots are huddled together at the origin. 1. that we’ll need in several parts of the sequel. The root at +1 is the friendly one. we are going to describe a family of multistep methods. just as far out of trouble as they can get. First we will develop a very general tool. As h increases slightly to small positive values. called the Adams methods. that are stable. . due to Lagrange. because for small τ the root acts like −1 − τ /3. if one is willing to save enough backwards values of the y’s. 2. and our apprehension was fully warrented. Furthermore. so for all small positive values of τ this lies outside of the unit disk.8). and following that we discuss the Adams formulas. The method is of fourth order. (2. which in the present case is just the quadratic equation τ 4τ τ 1+ α2 + α − 1 − = 0. All of these jobs can be done with the aid of a formula. 4.13 Lagrange and Adams formulas Our next job is to develop formulas that can give us as much accuracy as we want in a numerical solution of a diﬀerential equation.e.. The root at −1 is to be regarded with apprehension. where X lies between xn−1 and xn+1 . and all accuracy will eventually be lost. so the method will be unstable.12. If as h grows to a small positive value.12.12. ready to develop into the exponential series. when h = 0 the equation (2.

the second and third terms vanish. 10). . . (x2 .13. we might as well suppose that xj = jh for j = 1. Does it pass through the three given points? When x = 1. yn ). If we substitute in (2. . .13. .1) is a quadratic polynomial.13 Lagrange and Adams formulas 75 Problem: Through the three points (1.13. etc. then all we have to do is to write it out: n n j=1 j=i P (x) = i=1 yi x − xj . 17). (7.5) and that is about as simple as we can get it. the ﬁrst and third terms are zero and the second is 10. xi − x j (2. as is obvious from the cancellation that takes place. (2. .3) Consider the product of the denominators above. (xn . x2 . Find it. Solution: P (x) = 17 (x − 3)(x − 7) (1 − 3)(1 − 7) + 10 (x − 1)(x − 7) (3 − 1)(3 − 7) + (−5) (x − 1)(x − 3) . Similarly when x = 3.2). Now we can jump from this little example all the way to the general formula. To be speciﬁc. . (i − j)h (2. . consider the special case of the above formula in which the points x1 . . .2) This is the Lagrange interpolation formula. Finally we replace x by xh and substitute in (2.13. because of the factor x − 1. It is (i − 1)(i − 2) · · · (1)(−1)(−2) · · · (i − n)hn−1 = (−1)n−i (i − 1)!(n − i)!hn−1 . n. First of all. y1 ).4) n j=1 j=i P (xh) = i=1 (−1)n−i (i − 1)!(n − i)! yi (x − j) (2. xn are chosen to be equally spaced. .13.13. −5) there passes a unique quadratic polynomial.2.3) to obtain n (2. Next. .13.1) (7 − 1)(7 − 3) Let’s check that this is really the right answer. since each term is.13. . 2. and the ﬁrst term has the vlaue 17. In the ith term of the sum above is the product of n − 1 factors (x − xj )/(xi − xj ). (3. (2. we get n n j=1 j=i P (x) = i=1 yi x − jh . If we want to ﬁnd the polynomial of degree n − 1 that passes through n given data points (x1 . y2 ). namely of all those factors except for the one in which j = i.

so the numbers b−i are given by p+1 j=1 j=p−i b−i = (−1)i+1 (p − 1 − i)!(i + 1)! p p+1 (x − j) dx i = −1. .13.8) This involves replacing i by p − i in (2.12) . .9) Now to choose a formula in this family. (2. 0.6) Instead of integrating the exact function y in this formula. the so-called Adams formulas. For instance. complete with their error terms. we show the ﬁrst four of them below. .13. we replace y by the Lagrange interpolating polynomial that agrees with it at the p+1 points h.10) 1 (x − 2)(x − 3)dx = − 12 so we have the third-order implicit Adams method yn+1 = yn + h + 8yn − yn−1 ). (5y 12 n+1 (2.76 The Numerical Solution of Diﬀerential Equations Now we’ll use this result to obtain a collection of excellent methods for solving diﬀerential equations. .13.7) We can rewrite this in the more customary form of a linear multistep method: p−1 yn+1 = yn + h i=−1 b−i yn−i . 2h. Then we ﬁnd b1 = 1 2 1 2 3 2 3 (x − 1)(x − 2)dx = (x − 1)(x − 3)dx = 5 12 2 3 b0 = − b−1 = 2 3 2 (2. For reference.13. let’s take p = 2. First.13. Begin with the obvious fact that p+1 y((p + 1)h) = y(ph) + h p y (ht) dt.11) If this process is repeated for various values of p we get the whole collection of these formulas. (2. .13.7).13. . all we have to do is specify p. h2 y 2 yn+1 = yn + hyn − (2. This transforms (2.6) into p+1 p+1 j=1 j=i yp+1 = yp + h i=1 (−1)p−i+1 (i − 1)!(p − i + 1)! p+1 y (ih)j p (x − j) dx.13. p − 1. (2. (p+1)h. . we will integrate the polynomial that agrees with y at a number of given data points. . (2.13.

492437 0.2.000000 0.740818 0. except that a0 = 1.369595 -0.12.14) (2.7 0. All of the Adams formulas are stable for this reason.390073 0. and for a predictor method of order p + 1 we replace y by the interpolating polynomial that Root2(τ ) 0.13.113948 0.13. In each of them the unknown yn+1 appears on both sides of the equation.12. (2.904837 0. The other backwards values are all of the derivatives.157869 0. 0.13.449329 0. The other roots all start at the origin. The root at 1.16) The roots of this are 1.13.075823 -0. It happens that the roots are all real in this case.2 0.128457 0.058536 0.081048 0.174458 0. useful as corrector formulas. as h grows to a small positive value. therefore.116824 -0. so when h is small they too will be small. instead of trying to cause trouble for us.9.440927 0.605769 0. in the limiting case where the step size is zero.548812 0. It is the polynomial equation that determines the roots of the characteristic equation of the method.142857 0.9 1.1 0. We return to (2. .13) (2. In Table 6 we show the behavior of the three roots of the characteristic equation (2. then that polynomial equation becomes simply αp+1 − αp = 0. Notice that the friendly root is the one of largest absolute value for all τ ≤ 0.3 0.496585 0.13 Lagrange and Adams formulas yn+1 = yn + yn+1 yn+1 h h3 (yn+1 + yn ) − y 2 12 h h4 = yn + (5yn+1 + 8yn − yn−1 ) − y (iv 12 24 h 19h5 (v = yn + (9yn+1 + 19yn − 5yn−1 + yn−2 ) − y 24 720 77 (2.224009 Root3(τ ) 0.8 0.670320 0.6).818731 0.333333 -0.000000 0.261204 -0.904837 0.0.9) again.297171 -0.098550 0.15)..1) then we quickly ﬁnd the reason for the stability of these formulas.405827 . τ 0.13.0 0.5 0.0 e−τ 1.153914 -0.406570 0. 0.13. Notice that only one backwards value of y itself is used.367879 Friendly(τ ) 1.546918 0.818723 0.6 0.4 0.12.000000 0. .189813 -0. .225454 -0.194475 0.740758 0.606531 0. If we use a formula with all the ai = 0.000000 -0. They are. Now look at equation (2.670068 0. To ﬁnd matching predictor formulas is also straightforward.8) as it applies to the fourth-order method (2.333333 table 6 Now we have found the implicit Adams formulas. is the one that gives us the accuracy of the computed solution.15) If we compare these formulas with the general linear multistep method (2. namely yn .

we can write this in the more familiar form p yn+1 = yn + h i=0 b−i yn−i (2. .13. . .21) y 12 8 h 251h5 (v yn+1 = yn + (55yn − 59yn−1 + 37yn−2 − 9yn−3 ) + (2. We are now fully equipped with Adams formulas for prediction and correction. . of whatever accuracy is needed. replace y (t) by the Lagrange polynomial that agrees with it at 0. .13. . that the explicit fourth-order formula (2. .13. then a single application of the corrector formula ∂y will produce an estimate of the next value of y with full attainable accuracy.13. mh y(mh) = y(0) + 0 y (t) dt. .17) As before.13.13.13.13. p. p. in place of (2.22) has about 13 times as large an error term as the implicit fourth-order formula (2.22) y 24 720 Notice.18) where the numbers b−i are now given by b−i = (−1)i (p − i)!i! p p+1 p j=0 j=p−i (x − j) dx i = 0. This time we begin with a slight variation of (2.23) Next. Then. . (2. .20) y 2 12 h 3h4 (iv yn+1 = yn + (23yn − 16yn−1 + 5yn−2 ) + (2. . 2h. ph (for p ≥ m). for example. in matched pairs. 1.19) We tabulate below these formulas in the cases p = 1. all of them stable. 2h. 2. so if we keep h small enough so ∂y that h ∂f is less than 1/13 or thereabouts. h.24) . This is typical for matched pairs of predictor-corrector formulas.13. (2.6). 2. together with their error terms: h 5h3 yn+1 = yn + (3yn − yn−1 ) + (2.15). As we have noted previously. The use of these pairs requires special starting formulas.13. . (2. 3.13. We then obtain p ym = y0 + h i=0 (−1)p−i i!(p − i)! m p j=0 j=i yi 0 (t − j) dt m = 1. Once again the Lagrange interpolation formula comes to the rescue. (2. .13. a single application of a corrector formula reduces error by about h ∂f .78 The Numerical Solution of Diﬀerential Equations agrees with it at the data points 0. h. 3h. . since multistep methods cannot get themselves started or restarted without assistance. ph (but not at (p + 1)h as before).7) we get p yp+1 = yp + h i=0 (−1)p−i i!(p − i)! p+1 p j=0 j=i y (ih) p (x − j) dx. . .

27) The philosophy of using a matched pair of Adams formulas for propagation of the solution. .13. .13.26) Of course. p (2. . . yp (y0 is.2. together with the starter formulas shown above has the potential of being implemented in a computer program in which the user could speciﬁy the desired precision by giving the value of p. when these formulas are used on a diﬀerential equation y = f (x.13. The program could then calculate from the formulas above the coeﬃcients in the predictor-corrector pair and the starting method. . Therefore equations (2.13. This would make a very versatile code. then the starting formulas are y1 = y0 + h 19h5 (v) (9y0 + 19y1 − 5y2 + y3 ) − y 24 720 h h5 y2 = y0 + (y0 + 4y1 + y2 ) − y (v) 3 90 3h 3h5 (v) y3 = y0 + (y0 + 3y1 + 3y2 + y3 ) − y . until suﬃcient convergence has occurred. y2 .25) to give us the new guesses from the old. of course. . 8 80 (2.25) are a set of p simultaneous equations in p unknowns y1 . For example. and then proceed with the integration.13 Lagrange and Adams formulas We can rewrite these equations in the form p 79 ym = y0 + h j=0 λj y j m = 1. We can solve them with an iteration in which we ﬁrst guess all p of the unknowns. y). 2. p. . . if we take p = 3. and then reﬁne the guesses all at once by using (2.25) where the coeﬃcients λi are given by p j=0 j=i λi = (−1)p−i i!(p − i)! 0 m (t − j) dt i = 0. . .13. . known).25) is replaced by an f (x.13. . . (2. y) value. each of the values of y on the right side of (2.

80 The Numerical Solution of Diﬀerential Equations .

or discover that it has none.1) for all vectors v . spanning sets of vectors.Chapter 3 Numerical linear algebra 3. calculate its determinant.1). v of V and real numbers (or scalars) α and β (i. (b) Given m linear algebraic equations in n unknowns. to business. that is to say.1. On the left we add two vectors of V . We will quickly review some additional concepts that will be helpful in our work. of problems that involve matrices or systems of linear equations.1 Vector spaces and linear mappings In this chapter we will study numerical methods for the solution of problems in linear algebra. And now. We assume that the reader is familiar with the basic constructs of linear algebra: vector space. We say that T is a linear mapping from V to W if T associates with every vector v of V a vector T v of W (so T is a mapping) in such a way that T (αv + βv ) = αT v + βT v (3. basis vectors. ﬁnd the most general solution of the system. Euclidean n-dimensional space. if possible.. linear dependence and independence of vectors. As usual. Let V and W be two vector spaces over the real numbers (we’ll stick to the real numbers unless otherwise speciﬁed). see any of the references cited at the end of this chapter.1. T is linear). we will be very much concerned with the development of eﬃcient software that will accomplish the above purposes. on the right we add two vectors of W . (d) Find the eigenvalues and eigenvectors of an n × n matrix. Among these we mention the following: (a) Given an n × n matrix. For a more complete discussion of linear algebra. . (c) Invert an n × n matrix.e. The ﬁrst major concept we need is that of a linear mapping. Notice that the “+” signs are diﬀerent on the two sides of (3.

1. . respectively. The right side is known. Consider the mapping that associates with a polynomial f of V its derivative T f = f in W . let V and W both be the same. Let T be the mapping that carries the vector (x. suppose V is Euclidean two-dimensional space (the plane) and W is Euclidean three-dimensional space. More generally. and not just matrices. So. . Second. suppose we know T e1 . suppose the vector spaces V and W are of dimensions m and n. 3. 3). T e2 . . . and in W there is a basis of n vectors f1 . That is. any of the matrices that represent the same mapping will have the same determinant as the given one. to linear combinations of more than two vectors) to obtain T x = α1 (T e1 ) + α2 (T e2 ) + · · · + αm (T em ). 4x + 5y) of W . Then we have the situation that is sketched in ﬁgure 3. “all we have to do” is describe its action on a set of basis vectors of V . To get back to the matter at hand. Then T is a linear mapping. and making this kind of observation and identifying simple representatives of the class of relevant matrices can be quite helpful. The mapping T that carries a vector x of V into Ax of W is a linear mapping. let A be a given m × n matrix of real numbers. (3. let V be Euclidean n-dimensional space and let W be Euclidean m-space. For instance. . by induction. T ei can be written as a linear combination of the basis vectors of W . so we write n (3. and the claim is established. . −1) = (4. we seem to confront a problem about matrices. comes from the fact that a particular mapping can be represented by many diﬀerent matrices. Indeed. to describe a linear mapping. (3. It’s easy to check that this mapping is linear. are in fact questions about linear mappings. Let T be a linear mapping from V to W . it often happens that problems in linear algebra that seem to be questions about matrices. x = α1 e1 + α2 e2 + · · · + αm em . If ei is one of these. The importance of studying linear mappings in general. . secure in the knowledge that the answer will be the same. .2) Now apply T to both sides and use the linearity of T (extended. e2 . x − y. em .82 Here are a few examples of linear mappings. any matrix generates a linear mapping between two appropriately chosen (to match the dimensions of the matrix) vector spaces. e3 . We claim now that the action of T on every vector of V is known if we know only its eﬀect on the m basis vectors of V . Then we can choose in V a basis of m vectors. T em . T (2. Express x in terms of the basis of V . For example. This means that we can change to a simpler matrix that represents the same linear mapping before answering the question. m. namely the space of all polynomials of some given degree n. . Numerical linear algebra First. .1 below. then T ei is a vector in W . In fact. Further. .1. fn .1. if we are given a square matrix and we want its determinant. say e1 . . f2 . . The coeﬃcients of this linear combination will evidently depend on i. y) of V to the vector (3x + 2y.3) T ei = j=1 tji fj i = 1.4) . . Then let x be any vector in V . As such. . .

then ν is called the nullity of the mapping T . (3. . Since T is linear. . . E. Consider the set of all vectors of V that are mapped by T into the zero vector of W . suppose once more that T is a linear mapping from V to W .4) we know what T does to every basis vector of V .1. so αx + βy belongs to ker(T ) also. it is itself a vector space.1. This set is called the kernel of T .6) T (αx + βy) = αT (x) + βT (y) = 0.3) we know the action of T on every vector of V . n. Indeed. . Since ker(T ) is a vector space. Consider also the set of vectors w of W that are of the form w = T v. . . . fn }. . Next. then (3. . To summarize. if we know all of those numbers. .1: The action of a linear mapping Now the mn numbers tji . that is a vector subspace of V .1. are enough to describe the linear operator T completely. If ν = dim ker(T ). F ) we know the full mapping T . . This set is called the image of T . j = 1. i = 1. .1 Vector spaces and linear mappings 83 T em . f2 V Figure 3.3. in the sense that from a knowledge of (t. and is written im(T ) = {w ∈ W | w = T v. for some vector v ∈ V (possibly many such v’s exist). em }. and is written ker(T ). e2 . f2 . to a vector space W with a distinguished basis F = {f1 . . Indeed. . . an n × m matrix tji represents a linear mapping T from a vector space V with a distinguished basis E = {e1 . it is easy to see that T carries the 0 vector of V into the 0 vector of W . .1. .1. m. then by (3.7) . . one sees immediately that if x and y belong to ker(T ) and α and β are scalars. (3. Thus ker(T ) = {x ∈ V | T x = 0W }. v ∈ V }. e2 B ¨ ¨ r rr¨¨ e e e e1e E W m B ¨ ¨ r rr¨¨ e e e f1e f . and then by (3. together with the given sets of basis vectors for V and W . we can speak of its dimension.5) Now ker(T ) is not just a set of vectors. . .

such that the r × r submatrix that they determine has a nonzero determinant. In fact. the rank becomes 1. Let A be a given m × n matrix.2]-entry is changed to 1. Is it a vector space? That’s right. but if the [2.e.8) has rank 2. since if w and w are both in im(T ). By the rank of a matrix A we mean any of the following: (a) the maximum number of linearly independent rows that we can ﬁnd in A (b) same for columns (c) the largest number r for which there is an r × r nonzero sub-determinant in A (i. unless b happens to be the zero vector (why?).1. No computer program will be able to tell the diﬀerence between these two situations unless it is doing arithmetic to at least 21 digits of precision. the rank will be uncomputable.84 Numerical linear algebra Once again. too. Now we are ready to consider one of the most important problems of numerical linear algebra: the solution of simultaneous linear equations. .1.10) (3.9) (3. really. or whatever. or do ﬁnite ﬁeld arithmetic. not necessarily consecutive. . xn . Hence αw + βw = αT v + βv = T (αv + βv ) so αw + βw lies in im(T ). v in V . and consider the system Ax = b (3. A celebrated theorem of Sylvester asserts that rank(T ) + nullity(T ) = dim(V ). then we have w = T v and w = T v for some v . Consider the set of all solution vectors x of (3.. is just that the rank of a matrix is not a continuous function of the matrix entries. What we are saying. . unless our programs do exact arithmetic on rational numbers. it isn’t.1. the 2 × 2 matrix 1 1 1 1 + 10−20 (3. It is also true that the rank of a matrix is not computable unless inﬁnite-precision arithmetic is used. The dimension of the vector (sub)space im(T ) is called the rank of the mapping T . x2 .11) of m linear simultaneous equations in n unknowns x1 . Therefore.1.1. a set of r rows and r columns. . we remark that im(T ) is more than just a set of vectors. let b be a given column vector of length m. and if α and β are scalars. it is in fact a vector subspace of W .11). . It is true that the rank of a linear mapping T from V to W is equal to the rank of any matrix A that represents T with respect to some pair of bases of V and W .

1. 3. Ax = b. −1) to (2. 0) to (1. (d) Regard T as a mapping from V to the space W of polynomials of degree 1. if the system is homogeneous. . a vector space. What is the maximum number of binary digits that could be needed to represent all of the elements of A? 5. then consider any two solutions x and x of (3. 1) to (1. 0. Find T (1. i. 1) and T takes (3.11).11) by printing out one particular solution x .1. 1. and a list of the basis vectors of ker(A).1 Vector spaces and linear mappings 85 Suppose then that we intend to write a computer program that will in some sense present as output all solutions x of (3. . A(x −x ) = 0. any one particular solution of the system.11) should print out a basis for the kernel of A together with. What might the output look like? If b = 0. x. so if e1 . Let T be the linear mapping that sends each polynomial to its derivative.11) a vector space if b = 0? . 3). because then all solutions are of the form x = x + α1 e1 + α2 e2 + · · · + αν eν . eν are a basis for ker(A). and the basis {1. Show by examples that for every n. We will see how to accomplish this in the next section. and rank(T ). im(T ).1 1.3. then x − x = α1 e1 + α2 e2 + · · · + αν eν . 4.1. ﬁnd the 3 × 3 matrix that represents T . 2. 4). 2. x − 1} for W . 0. (3. for in that case the solution set is ker(A). Consider the vector space V of all polynomials in x of degree at most 2. and ﬁnd the 2 × 3 matrix that represents T with respect to these bases. . 2. If T acts on the vector space of polynomials of degree at most n according to T (f ) = f − 3f . Then Ax = b.11). e2 . If the right-hand side vector b is not 0. in case b = 0. a computer program that alleges that it solves (3. −1. we can describe all possible solutions of (3. Suppose T takes the vector (1. there is no diﬃculty. If b = 0 then. (e) Check that the ranks of the matrices in (c) and (d) are equal. . 6. and by subtraction. Why isn’t the solution set of (3. Use the basis given in part (c) for V . ﬁnd ker(T ).1.1. 1. and we can describe it by printing out a set of basis vectors of ker(A). x2 } of V . and T takes (1. 2).12) Therefore. Exercises 3. the rank of a given n×n matrix A is a discontinuous function of the entries of A. (a) What is the rank of T ? (b) What is the image of T ? (c) For the basis {1. Let T be a linear mapping of Euclidean 3-dimensional space to itself. Let A be an n × n matrix with integer entries having absolute values at most M .1..e. Hence x −x belongs to ker(A).

0) plus any linear combination of (−1. Show that the rank of A is at most 1. 1. 1. and (b) their solutions diﬀer by at least 10+12 in each entry.2. (3.3) The value of z can now be chosen arbitrarily. we can rewrite the solution in the form 7 x 0 2 3 y = − 2 + z −1 . . and then subtract it from the ﬁrst.13) of V .1) If we subtract the second equation from the ﬁrst. j = 1. . 9. 0). (1.1.1. Let aij = ri sj for i. Suppose that the matrix 0 1 1 A = 1 0 −1 −2 1 1 represents a certain linear mapping from V to V with respect to the basis {(1. 1. 1).2. getting x 7 = 2 y + z = −3. 1). (1. Construct two sets of two equations in two unknowns such that (a) their coeﬃcient matrices diﬀer by at most 10−12 in each entry. 0).2 Linear systems The method that we will use for the computer solution of m linear equations in n unknowns will be a natural extension of the familiar process of Gaussian elimination.2.15) 3. To make this more explicit. (0. then the two equations can be written x + y + z = 2 2y + 2z = −3. Check that the determinant is unchanged. 1. 0). 10. 1) and (0. Let’s begin with a little example. 2. and then x and y will be determined. n. (0. 0. 0. 8. . 2 (3. Find the matrix that represents the same mapping with respect to the basis {(0.86 Numerical linear algebra 7.1. z 1 0 (3. say of the following set of two equations in three unknowns: x+y+z =2 x − y − z = 5. (3. 0.2) We divide the second equation by 2.4) . 1.14) (3. 1)} (3. (3. .2. 1)}. Find a system of linear equations whose solution set consists of the vector (1.

2. ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ .2.5) − − − Now we do (−→ 2) := (−→ 1) − (−→ 2) and we have row row row 1 1 1 2 0 2 2 −3 . (3.11) .2. we use the 1 in the upper left-hand corner to zero out the entries below it in column 1.2. to see some of the situations that can arise.2.9) (3.10) The result is that we now have the matrix 1 ∗ ∗ ∗ ∗ ∗ 0 ∗ ∗ ∗ ∗ ∗ .2.2.2. 1) for the kernel of the coefﬁcient matrix of (3.2. Thus. 0 ∗ ∗ ∗ ∗ ∗ (3.1] element.1] element.3). 0).2. Now we will step up to a slightly larger example. After we divide the ﬁrst row by the [1. That is. ∗ ∗ ∗ ∗ ∗ ∗ (3.2 Linear systems 87 In (3. −1. The calculation can be compactiﬁed by writing the numbers in a matrix and omitting the names of the unknowns.3. by dividing the ﬁrst row through by the [1.4) we see clearly that the general solution is the sum of a particular solution (7/2.6) − − − − − Next (−→ 2) := (−→ 2)/2. the original system (3.2. −3/2. Then let t = a31 and do −→ − − − row(3) := −→ − t ∗ −→ row(3) row(1). and then (−→ 1) := (−→ 1) − (−→ 2) bring us to the ﬁnal form row row row row row 7 1 0 0 2 0 1 1 −3 2 (3. We will assume for the moment that the various numbers that we want to divide by are not zero. (3. but we’ll use asterisks instead. So consider the three equations in ﬁve unknowns shown below. plus any multiple of the basis vector (0.8) The ﬁrst step is to create a 1 in the extreme upper left corner. we let t = a21 and then do −→ − − − row(2) := −→ − t ∗ −→ row(2) row(1).7) which is the matrix equivalent of (3.1) is 1 1 1 2 1 −1 −1 5 . Later on we will take extensive measures to assure this. We won’t actually write the numerical values of the coeﬃcients. (3. A vertical line in the matrix will separate the left sides and the right sides of the equations.1).

2. that we are now beginning. Let’s see how all of this will look if we were to operate directly on the matrix (3. Finally. 0 0 1 ∗ ∗ ∗ (3. What is special about them is that the ﬁrst unknown does not appear in the second equation. and this also doesn’t aﬀect the solutions. Evidently this does not change the solutions of the system of equations.13).11) we divide through the second row by a22 (again blissfully assuming that a22 is not zero) to create a 1 in the [2. the ﬁrst phase of the process of obtaining the general solution. The second phase of the solution.2. 0 0 1 ∗ ∗ ∗ (3.15) 1 ∗ 0 ∗ ∗ ∗ 0 1 1 ∗ ∗ ∗ .2] position. to divide a row of the matrix by a number corresponds to dividing an equation through by the same number. and that the others are determined by them. we divide the third row by the [3.12) 1 ∗ ∗ ∗ ∗ ∗ 0 1 ∗ ∗ ∗ ∗ . Then we use that 1 to create zeroes (just one zero in this case) below it in the second column by letting t = a32 and doing −→ − − − row(3) := −→ − t ∗ −→ row(3) row(2). we would say that x4 and x5 are free. in such a way that the matrix that represents the mapping becomes a bit more acceptable to our taste. Again. Then we let t = a13 and set −→ 1 := −→ − − − row row(1) − t ∗ −→ row(3) resulting in (3. First we use the 1 in the [3. Hence.2. we should say that the kernel of A has a two-dimensional basis. keeping the mapping ﬁxed.2. in terms of the linear mapping that the matrix represents.3] position to create zeros in the third column above that 1. what we are doing is changing the sets of basis vectors. More precisely.16) . because we start with the last equation and work backwards. to add a constant multiple of one row to another in the matrix corresponds to adding a multiple of one equation to another.3] element to obtain (3.2.14) (3. and ﬁnally the ﬁrst equation would yield x1 . also expressed in terms of x4 and x5 . First.2. To ﬁnish the solution of such a system of equations we would use the third equation to express x3 in terms of x4 and x5 . Now in (3. then the second equation would give us x2 in terms of x4 and x5 .2. and neither the ﬁrst nor the second unknown appears in the third equation. is called the backwards substitution. Finally. Next. let’s think about the system of equations that is represented here.88 Numerical linear algebra We pause for a moment to consider what we’ve done. in terms of the original set of simultaneous equations. To do this we let t = a23 and then we do −→ − −→ − − row(2) := row(2) − t ∗ −→ row(3).13) This is the end of the so-called forward solution.

2 Linear systems 89 Observe that now x3 does not appear in the equations before it. and we have arrived at the reduced echelon form of the original system of equations 1 0 0 ∗ ∗ ∗ 0 1 0 ∗ ∗ ∗ . none of our previously constructed zeros gets wrecked by this process.20) This means that we have found the general solution of the given system by ﬁnding a particular solution and a pair of basis vectors for the kernel of A. a14 a24 a34 −1 0 a15 a25 a35 0 −1 .18) We need to be more careful about the next step. the vector and the two columns of the matrix shown below: a16 a26 a36 0 0 . in this case 2.17) Of course. (3. . For instance.3. We ﬁnd a basis matrix (i.19) (3. we ﬁnd a particular solution by ﬁlling in extra zeros in the last column until its length matches the number (ﬁve in this case) of unknowns.2.2] position to create a zero in column 2 above that 1 by letting t = a12 and −→ − − − row(1) := −→ − t ∗ −→ row(1) row(2). it is unreasonable to expect a 0 to be exactly zero.21) As this shows.2. respectively. In numerical work on computers. 0 0 1 a34 a35 a36 Each of the unknowns is expressible in terms of x4 and x5 : x1 x2 x3 x4 x5 = a16 − a14 x4 − a15 x5 = a26 − a24 x4 − a25 x5 = a36 − a34 x4 − a35 x5 = x4 = x5 . In fact we will have to discuss rather carefully what we mean by zero. (3. The hard part will be the determination of the right threshold. a matrix whose columns are a basis for the kernel of A) by extending the fourth and ﬁfth columns of the reduced row echelon form of A with −I. It’s time to deal with the case where one of the ∗’s that we divide by is actually a zero..2. 0 0 1 ∗ ∗ ∗ (3. so it’s time to use numbers instead of asterisks. Next we use the 1 in the [2. in the presence of rounding errors. where I is the identity matrix whose size is equal to the nullity of the system. suppose that we have now arrived at 1 0 0 a14 a15 a16 0 1 0 a24 a25 a26 .e. (3.2. and numbers that are smaller than that will be declared to be zero. Instead we will set a certain threshold level. They are.2.

. Else.23) What then? We’re ﬁnished with the forward solution.3]. However. the next step would be to divide by a33 . (3. we will mean that their size is below our current tolerance level. . . and continue. suppose we are carrying out the row reduction of a certain system. and make the convention that in this and the following sections. then of course we ignore the ﬁnal rows of zeros. In the matrix. The equations from the third onwards have only zeros on their left-hand sides. The calculation can now proceed from the rejuvenated pivot element in the [3. . We can do this in two steps.3] position. . If so. this amounts to searching through the whole rectangle that lies “southeast” of the [3. and we’ve arrived at a stage like this: 1 0 0 0 . . southeast of [3. (3.3] position. . that x3 does not appear in any later equation. In the matrix. . so that x3 becomes xj and xj becomes x3 ).3] position. . to ﬁnd a nonzero entry. . . . First we exchange rows 3 and i (interchange two equations). or the calculation halts with the announcement that the input system was inconsistent (i. ∗ ∗ ∗ ∗ .22) Normally. . but not beyond. and we do the backwards solution just as in the preceding case. the vertical line. . ∗ ∗ ∗ ∗ ∗ . ··· ··· ··· ··· ··· . If any solutions at all exist. but it is zero. has no solutions). This means that x3 happens not to appear in the third equation. It is possible. when we speak of matrix entries being zero. .2. Then all entries ai3 = 0 for i ≥ 3.e. and continue the process. so we’ll be able to recognize the answers when we see them. extending over to. so as to bring a nonzero entry into the [3. ∗ 1 0 0 . . ∗ ∗ 0 ∗ . we can renumber the equations so that later equation becomes the third equation. . We must remember somehow that we renumbered the unknowns. . ∗ ∗ ∗ ∗ . . if there is one. exchange columns 3 and j (interchange the numbers of the unknowns. Second. ∗ ∗ ∗ ∗ . . . like ∗ ∗ 0 0 0 .3] consists entirely of zeros. though. Then we could ask for some other unknown xj for j > 3 that does appear in some equation later than the third. ··· ··· ··· ··· . If aij is such a nonzero entry.90 Numerical linear algebra but let’s postpone that question for a while. . . then those equations had better have only zeros on their right-hand sides also. . . The last three ∗’s in the last column (and all the entries below them) must all be zeros. . x3 might appear in some later equation. . . this means that we would exchange two rows. . . With that understanding. then we want next to bring aij into the pivot position [3. If the system is consistent.2. . it may happen that the rectangle this: 1 ∗ ∗ 0 1 ∗ 0 0 0 0 0 0 0 0 0 . . . ∗ ∗ 0 0 0 .

1.2 Linear systems 91 It follows that in all cases. whether there are more equations than unknowns. though!).26) (3. not as some interesting property of the matrix that we have just computed. this number simply represents a number of rows beyond which we cannot do any more reductions because the matrix entries are all indistinguishable from zero. x 2x x 3x + 3y − z = 4 − y + 2z = 6 + y + z = 6 − y − z = −2 (3.2.2.25) − + − + y y y y + z = 6 + 2z = 3 + 4z = 15 + 5z = 12 (3.3. Speaking in theoretical.2.2. x + y + z + q = 0 x − y − z − q = 0 3.2 In problems 1–5. 8. 7. rather than practical terms for a moment. Speaking practically again. 2x 3x 7x 8x 2. fewer. exhibit a particular solution and a basis for the kernel of the coeﬃcient matrix. but as the number of rows we were able to reduce before roundoﬀ errors became overwhelming. the backwards solution always begins with a matrix that has a diagonal of 1’s stretching from top to bottom (the bottom may have moved up. To “solve” a system means either to show that no solution exists or to ﬁnd all possible solutions. or the same number.2. In the latter case. 3x + u + v + w + t = 1 x − u + 2v + w − 3t = 2 5. Exercises 3.28) 6. with only zero entries below the 1’s on the diagonal. Construct a set of four equations in four unknowns. of rank two.27) (3. . the number of nonzero rows in the coeﬃcient matrix at the end of the forward solution phase is the rank of the matrix. solve each system of equations by transforming the coeﬃcient matrix to reduced echelon form. x + y + z = 3 3x − y − 2z = −1 5x + y = 7 4.24) (3. The pseudorank should be thought of then. Construct a set of four equations in three unknowns. Construct a system of homogeneous equations that has a three-dimensional vector space of solutions. with a unique solution. Perhaps a good name for it is pseudorank.

all the entries below the 1’s are zeros. At all times then. If this complete search is done every time. . then it develops that the sizes of the matrix elements do not grow very much as the algorithm unfolds. say. we have made i− 1 1’s down the diagonal. unless that element is zero. It may seem wasteful to make a policy of carrying out a complete search of the rectangle whenever we are ready to ﬁnd the next pivot element. Under precisely what conditions is the set of all solutions of the system Ax = b a vector space? 10. and the growth of the numerical errors is also kept to a minimum. First. whether or not the [i. describe an algorithm that will decide whether or not w is in the vector subspace spanned by the given set. If that largest element found in the rectangle is. 3. . i] in the matrix. . Initially. i] position and use that 1 as a pivot to reduce the entries below it to zeros. then we will also interchange the entries τp and τq . without a search. n. After careful analysis. i] element is zero. and one more vector w. . . then we bring auv into the pivot position [i. Previously we had said that this could be done by dividing through the ith row by the [i. i] element is zero. so that in the end we will be able to identify the output. If we are solving m equations in n unknowns. Let’s call it τj . to ﬁnd the largest matrix element in absolute value. we put τj = j for j = 1. Given a set of m vectors. the pth column and the qth column. n. say. describe an algorithm that will extract a maximal subset of linearly independent vectors. This array will keep a record of the column interchanges that we do as we do them.92 Numerical linear algebra 9. . then we need to carry along an extra array of length n. i] by interchanging rows u and i (renumbering the equations) and interchanging columns v and i (renumbering the unknowns). let’s deal with the fact that a diagonal element might be zero (in the fuzzy sense deﬁned in the previous section) at the time when we want to divide by it. auv .3 Building blocks for the linear equation solver Let’s now try to amplify some of the points raised by the informal discussion of the procedure for solving linear equations. j = 1. Consider the moment when we are carrying out the forward solution. i] element. in which case we carry out a search for some nonzero element in the rectangle that lies southeast of [i. . Given a set of m vectors in n space. but it turns out that the extra labor is well rewarded with optimum numerical accuracy and stability. it turns out that an even more conservative approach is better: the best procedure consists in searching through the entire southeast rectangle. . and especially even if a nonzero element already occupies the pivot position anyway. and next we want to put a 1 into the [i. with a view towards the development of a formal algorithm. If at a certain moment we are about to interchange. 11. τj will hold the number of the column where the current jth column really belongs. whether or not the [i. . Then we proceed as before.

and so on. Hence we must keep track of the column interchanges while we are doing them. where b is the ﬁrst column of the identity matrix. if A is an n × n matrix. let’s call this matrix C. At the end of the calculation then. but has no eﬀect on the solutions. each of the form Ax = b. the output arrays will have to be shuﬄed. and thereby avoid separate programming. Instead of storing it in its own private array. Thus. b would be the second column of the identity matrix. but several systems of simultaneous equations. To ﬁnd. for matrix inversion. So C will be thought of as being partitioned into blocks of sizes as shown below: C= A: m×n τ : 1×n RHS : m × p 0: 1×p . as an extra row. for then when we interchange two columns we will automatically interchange the corresponding elements of τ . the ﬁrst column of the inverse of A we want to solve Ax = b. but with n diﬀerent right-hand side vectors. An interchange of two rows corresponds simply to listing the equations that we are trying to solve in a slightly diﬀerent sequence. This means that the full matrix that we will be working with in our program will be (m + 1) × (n + p) if we are solving p systems of m equations in n unknowns with p righthand sides. The reader might want to think about how do carry out that rearrangement. The next item to consider is that we would like our program to be able to solve not just one system Ax = b. The data for our program will therefore be an m by n + p matrix whose ﬁrst n columns will contain the coeﬃcient matrix A and whose last p columns will be the p diﬀerent right-hand side vectors b. where the left-hand sides are all the same. It is convenient to solve all n of these systems at once because the reduction that we apply to A itself to bring it into reduced echelon form is useful in solving all n of these systems. so we’ll be able to tell which unknown is which at output time. to ﬁnd A−1 . and for other purposes too.3 Building blocks for the linear equation solver 93 It must be noted that there is a fundamental diﬀerence between the interchange of rows and the interchange of columns.1) Now a good way to begin the writing of a program such as the general-purpose matrix . The next point concerns the linear array τ that we are going to use to keep track of the column interchanges. On the other hand. an interchange of two columns amounts to renumbering two of the unknowns.3. and we will return to it in section 3.6 under the heading of “to unscramble the eggs”. it is very handy to have the capability of solving several systems with a common left-hand side at once. it’s easier to adjoin it to the matrix A that we’re working on. In the program itself. we must solve n systems of simultaneous equations each having the same left-hand side A. and we avoid having to repeat that part of the job n times. but the right-hand sides are diﬀerent. Hence.3. Why are we allowing several diﬀerent right sides? Some of the main customers for our program will be matrices A whose inverses we want to calculate. but we don’t need to record row interchanges. (3. For the second column of the inverse. say.

r. it will do the search for the next pivot element in the southeast rectangle. j2 ]. 4. It also returns a variable called sign with a value of −1. . 5. into which it may be broken up.i2.i. It then stores Cki in the local variable tm sets Cki to zero. .j1. each with its own clearly deﬁned input and output. each with its own documentation. j1 ] and [i2 . below our tolerance level). Procedure scale(C. then the main routine won’t be much more than a string of calls to the various blocks. j.i. If this is done. In the back solution we use u = n + 1. and each with its own local variable names. in which case it does nothing to C and returns a +1 in sign. The subroutine returns this element of largest magnitude. The subroutine interchanges rows i and j of C. unless i = j. in which case it does nothing to C and returns a +1 in sign. the subroutine assumes that Cii = 1. 2. Procedure switchrow(C.i. and three integers i. 1.. j2 ]. jloc. suitable test problems. They should then be tested one at a time.94 Numerical linear algebra analysis program that we now have in mind is to consider the diﬀerent procedures. by doing the operation Ckq := Ckq − t ∗ Ciq for q = u.k.l) The program is given an r × s matrix C. say [i1 . It then carries out a search of the rectangular submatrix of C whose northwest corner is at [i1 . except it interchanges columns i and j of C.r. First.s.j2) This routine is given an r × s array C. . j1 ] and whose southeast corner is at [i2 . we take u = i + 1 and it reduces the whole row k. Second. Subroutine searchmat will be called in at least two diﬀerent places in the main routine.k. 3. Procedure pivot(C.e. In the forward solution.s. k and u. between columns k and l inclusive. it can be used to determine if the equations are consistent by searching the right-hand sides of equations r + 1. and returns a variable called sign with a value of −1. . and reduces row k of C. between rows k and l inclusive. k and l.i. to unscramble the output (see procedure 5 below). unless i = j. .l) This subroutine is like the previous one.k. We suggest that the individual blocks that we are about to discuss should be written as separate subroutines. or modules. because the rest of row k will have already been reduced.j.i1.r. by giving them small. inclusive. m (r is the pseudorank) to see if they are all zero (i. s.s.u) .j.u) Given the r × s matrix C. in columns u to s. as big. Procedure searchmat(C. The use of the parameter u in this subroutine allows the ﬂexibility for economical operation in both the forward and back solution. and the row and column in which it lives. The subroutines switchrow and switchcol are used during the forward solution in the obvious way. as iloc.r. . and two positions in the matrix. and four integers i. . Procedure switchcol(C.r. and again after the back solution has been done.s. in order to ﬁnd an element of largest absolute value that lives in that rectangle. .s.

The values of m and n must also be provided to the procedure.p. This subroutine poses some interesting questions if we require that it should not use any additional array space beyond the input matrix itself. whose columns are particular solution vectors for the given systems. and also the rows of the output matrix that holds particular solutions of the give system(s). whose columns are a basis for the kernel of the coeﬃcient matrix. The use of this subroutine is explained in detail in section 3.r.opt. .3. . .). Unless the inverse of the coeﬃcient matrix is wanted. the routine stores Cii in a variable called piv.s. • An n × p matrix partic. Procedure ident(C. Now let’s look at the assembly of these building blocks into a complete matrix analysis procedure called matalg(C.eps). Output items from the procedure matalg are: • The pseudorank r • The determinant det if m = n • An n × r matrix basis . . 7. j] of the r × s matrix C. Input items to it are: • An r ×s matrix C (as well as the values of r and s). m.s. s and returns the value of piv.v. Procedure scramb(C. . Its purpose is to rearrange the rows of the output matrix that holds a basis for the kernel.q) This procedure inserts q times the n × n identity matrix into the n × n submatrix whose Northwest corner is at position [i. After rearrangement the rows will correspond to the original numbering of the unknowns. .i.n. the Northeast m × p submatrix of C holds p diﬀerent right-hand side vectors for which we want solutions.3 Building blocks for the linear equation solver 95 Given an r × s matrix C and integers i and u. n).n) This procedure permutes the ﬁrst n rows of the r × s matrix C according to the permutation that occupies the positions Cr1 . It is assumed that r = 1 + max(m. • A parameter option that is equal to 1 if we want an inverse. whose Northwest m×n submatrix contains the matrix of coeﬃcients of the system(s) of equations that are about to be solved. Crn on input.m.r. 6.r. n and p. • A real parameter eps that is used to bound roundoﬀ error. s. thereby compensating for the renumbering that was induced by column interchanges during the forward solution. equal to 2 if we want to see the determinant of the coeﬃcient matrix (if square) as well as a basis for the kernel (if it is nontrivial) and a set of p particular solution vectors. .6 (q. • The numbers r.n.j. . Cr2 . . It then does Cij := Cij /piv for j = u.s.

on output. First we divide a row by a pivot element. we exchange a pair of rows or columns. leaving the inverse matrix in the same place. Why store all of the extra columns of the identity matrix? (Good luck!) .3 1. The second has no eﬀect on the determinant. Construct a formal algorithm that will invert a matrix. the procedure will ﬁll the last m columns and rows of C with an m × m identity matrix. and the identity matrix is transformed. Exercises 3. Record the status of the matrix C after each major loop so you’ll be able to test your program thoroughly and easily. Let’s remark on how the determinant is calculated. The third changes the sign of the determinant. we multiply a row by a number and add it to another row. Second. A complete formal algorithm that ties together all of these modules. together with a plus or minus sign from the row or column interchanges. At the end of the forward solution the matrix is upper triangular. many solutions.2. Then det holds the determinant of the input matrix when the forward solution phase has ended. Record the column interchanges in τ . Repeat problem 1 on a system of each major type: inconsistent. Finally. as described above. we multiply det by it. Then. to compute the determinant. 2. What must have been the value of the determinant of the input matrix? Clearly it must have been equal to the product of all the pivot elements that were used during the reduction. The reduction of the input matrix to echelon form in the forward solution phase entails the use of three kinds of operations. using no more array space than the matrix itself.96 Numerical linear algebra In case opt = 1 is chosen. The idea is that the input matrix is transformed. In the next section we are going to discuss the vexing question of roundoﬀ error and how to set the tolerance level below which entries are declared to be zero. whenever a pair of diﬀerent rows or columns are interchanged we reverse the sign of det. Transform it to reduced row echelon form step by step. at any rate if the rows or columns are distinct. being sure to carry out a complete search of the Southeast rectangle each time. into the identity matrix. Third. a column at a time. and to interchange rows and columns to bring the largest element found into the pivot position. set p = n = m. Hence. into the inverse. we begin by setting det to 1. with 1’s on the diagonal. and proceed as before. is given at the end of the next section. hence its determinant is clearly 1. Now we have described the basic modules out of which a general purpose program for linear equations can be constructed. each time a new pivot element is selected. a column at a time. unique solution. 3. with control of rounding error. The ﬁrst operation divides the determinant by that same pivot element. Make a test problem for the major program that you’re writing by tracing through a solution the way the computer would: Take one of the systems that appears at the end of section 3.

4 How big is zero? 97 4. Suppose we do a complete forward solution without ever searching or interchanging rows or columns. Show that the forward solution amounts to discovering a lower triangular matrix L and an upper triangular matrix U such that LA = U (think of L as a product of several matrices E such as you found in the preceding two problems). Show that the operation of scaling −→ row(i): aik := aik aii k = 1. we will declare numbers to be zero that aren’t. i]. The phenomenon of roundoﬀ error occurs because of the ﬁnite size of a computer word. in order to locate the largest element that lives there and to use it for the next pivot. the forward solution halts because the remaining pivot candidates are all zero. Show that a matrix A is of rank one if and only if its entries are of the form Aij = fi gj for all I and j. Indeed. 3.2) has the same eﬀect as ﬁrst dividing the ith row of the identity matrix by aii to get a certain matrix E. then when two d-digit binary numbers are multiplied . − 6. and our numerical solution will terminate too quickly because the computed matrix elements will be declared to be unreliable when really they are perfectly OK. or might be. and then computing EA. . the calculation will continue after it should have terminated because of unreliability of the computed entries. . and the second concerns the rearrangement of the output to compensate for interchanges of rows and columns that are done during the row-echelon reduction. A little more natural would be to declare that any number that is no larger than the size of the accumulated roundoﬀ error in the calculation should be declared to be zero. 7. The main reduction loop begins with a search of the rectangle that lies Southeast of the pivot position [i. . If the threshold is too large. then numbers that “really are” zero will slip through. It is important that we should know how large roundoﬀ error is.3. and so forth.4 How big is zero? The story of the linear algebra subroutine has just two pieces untold: the ﬁrst concerns how small we will allow a number to be without calling it zero. when working with sixteen decimal digits. Show that the operation −→ := c ∗ −→ + −→ applied to a matrix A has the row(i) row(j) row(i) same eﬀect as ﬁrst applying that same operation to the identity matrix I to get a certain matrix E. If that element is zero. If a word consists of d binary digits. But “how zero” do they have to be? Certainly it would be to much to insist. − − − 5.3. and then computing EA. n (3. if we set too small a threshold. since our microscope lens would then be too clouded to tell the diﬀerence. that a number should be exactly equal to zero. .

and . but we would rather have it work a little harder if the result will be that we get more reliable answers. we had better halt the calculation. We prefer to let the computer estimate the error for us while it’s doing the calculation. and so forth. we do arithmetic on the matrix C of two kinds. suppose we are dividing row i through by the element Cii . we’ll need only to look at the corresponding entry of the error matrix to ﬁnd out. In each case let’s look at the eﬀect that the operation has on the corresponding roundoﬀ error estimator in the R matrix. How can we estimate. Then we proceed to add that answer to other numbers with errors in them. some large number of times. The question is to determine the level of rounding error that is present. Then. when we arrive at a stage where the numbers of interest are about the same size as the rounding errors that may be present in them. and of either sign. In this extra matrix we are going to keep estimates of the roundoﬀ error in each of the elements of the matrix C. Here is a proposal for estimating the accumulated rounding error during the progress of a computation. consider a scaling operation. At any time during the calculation that we want to know how reliable a certain matrix entry is. and the other of which will contain estimates of the roundoﬀ error that is present in the elements of the ﬁrst one. the one that has the coeﬃcients and right-hand sides of the equations that we are solving. we are going to keep two whole matrices. but in any given computation these would tend to be overly conservative. Then. as the calculation unfolds. to begin with. and store it in Rij for each i and j. during the course of a calculation. to initialize the R matrix we choose a number uniformly at random in the interval [−|2−d Cij |. Initially an element Rij might be as large as 2−d |Cij | in magnitude. divide. the matrix R is set to randomly chosen values in the range in which the actual roundoﬀ errors lie. This method was suggested by Professors Nijenhuis and Wilf. In the ﬁrst case. it will have to do more work. while the calculation is proceeding. or we pivot a row against another row. Let’s call this auxiliary matrix R (as in roundoﬀ). in which a certain row is divided by the pivot element. By doing so we incur a rounding error whose size is at most 1 unit in the (d + 1)st place. thinking that the errors were worse than they actually were. Therefore. True. one of which will contain the coeﬃcients and the right-hand sides of the equations. We either scale a row by dividing it through by the pivot element. Hence. particularly when a good deal of cancellation occurs from subtraction of nearly equal quantities. The accumulation of all of this rounding error can be quite signiﬁcant in an extended computation. the answer that should be 2d bits long gets rounded oﬀ to d bits when it is stored.98 Numerical linear algebra together. and we would usually terminate the calculation too soon. the size of the accumulated roundoﬀ error? There are a number of theoretical a priori estimates for this error. |2−d Cij |]. In other words. Speciﬁcally. and to multiply. We carry along an additional matrix of the same size as the matrix C.

. Before each scaling and pivoting sequence we will need to update the R matrix as described above.eps). it gets modiﬁed along with the matrix elements whose errors are being estimated.p. in view of the fact that 99 Cij + Rij Cij Rij Rii Cij = + − 2 + terms involving products of two or more errors (3.m. Cii Cii In the second case. where t = Cki . Then Ckq + Rkq is replaced by Ckq + Rkq − (t + t ) ∗ (Ciq + Riq ) = (Ckq − t ∗ Ciq ) + (Rkq − t ∗ Riq − t ∗ Ciq ) = (new Ckq ) + (new error Rkq ). else as far as the algorithm can tell. unless opt= 1. or just that rounding errors have built up so severely that we cannot decide on consistency. in which case it will calculate the inverse of the matrix in the ﬁrst m = n rows and columns of C.e. Then.2) and (3.2) 2 . With typical ambiguity of course.1) Cii + Rii Cii Cii Cii we see that the error entries Rij in the row that is being divided through by Cii should be computed as Rij Rii Cij Rij := − (3.3.opt. replace Ciq by Ciq + Riq and replace t by t + t (where t = Rki ). . Now let’s replace Ckq by Ckq + Rkq .s. Then for each column q we do the operation Ckq := Ckq − t ∗ Ciq . Algorithm matalg(C.r.4 How big is zero? let Rii be the corresponding entry of the error matrix.4.4. we are supplied with good error estimates of each entry while the calculation proceeds. and keep terms that are of ﬁrst order in the errors (i. It solves p systems of m equations in n unknowns.n. Then. The R matrix is also used to check the consistency of the input system. the input system was inconsistent. It follows that as a result of the pivoting. (3. this means either that the input system was “really” inconsistent. suppose we are doing a pivoting operation on the kth row. n) + 1.4. It begins life as random roundoﬀ error. where r = max(m.3) Equations (3.4. Then substitute these expressions into the pivot operation above.4) completely describe the evolution of the R matrix. in the sense that the entries are below the level of their corresponding roundoﬀ estimator. that do not involve products of two of the errors). and continuation of the “solution” would be meaningless. The algorithm operates on the matrix C. the error estimator is updated as follows: Rkq := Rkq − Cki ∗ Riq − Rki ∗ Ciq . Then the corresponding right-hand side vector entries should also be zero in the same sense. when we search the now-famous Southeast rectangle for the new pivot element we accept it if it is larger in absolute value that its corresponding roundoﬀ estimator. and in return. which is of dimension r × s.4.4) (3. and otherwise we declare the rectangle to be identically zero and halt the forward solution. At the end of the forward solution all rows of the coeﬃcient matrix from a certain row onwards are ﬁlled with zeros.4.

p.s). # equations are consistent.i.n+j.i+1.n+j).jj.ii.r.m. # if opt = 1 that means inverse is expected if opt=1 then ident(C.n).r.s.R.jj.psrank+1. # set row permutation to the identity for j from 1 to n do C[r. psrank:=i.j]).r).ii.i+1.j]:=j od.r. do back solution .s. Done:=false.opt.s.i.i).1) fi.r.R. fi. od. if abs(Z[2])>abs(R[ii. # divide by pivot element Z:=scaler(C. for j from 1 to p do # check that right hand sides are 0 for i>psrank Z:=searchmat(C.Done.s.r. if abs(Z[2])>abs(R[Z[1][1].jj. # switch columns Det:=Det*switchcol(C. Z:=switchrow(R.jj]) then i:=i+1. od.k.r.i. return.s.Z.Det.i.s. while ((i<m) and not(Done))do # find largest in SE rectangle Z:=searchmat(C.i+1). jj:=Z[1][2].1.j). for k from i+1 to m do # reduce row k against row i Z:=pivotr(C.s.i. Z:=switchcol(R.i.m.r.i).000000000001*(rand()-500000000000)*eps*C[i.r.i.n.j)->0.ii. od.k.Z[1][2]]) then printf("Right hand side %d is inconsistent".eps) local R. Z:=pivot(C.n+1.r.n. else Done:=true fi.r.i.1. # begin forward solution Det:=1.r.s).s.i.i+1). fi.m.s.(i.s.r. # switch rows Det:=Det*switchrow(C. Det:=Det*scale(C.j. # initialize random error matrix R:=matrix(r.100 Numerical linear algebra matalg:=proc(C.i.s. begin consistency check if psrank<m then Det:=0.psrank.i.k.1. i:=0.s. # end forward solution.s.ii:=Z[1][1].r).

# fill under right-hand sides with zeroes for i from psrank+1 to n do for j from n+1 to s do C[i.psrank+1).4 1. and describe precisely its eﬀect on them. For instance. there’s no point in giving 12 digits of roundoﬀ error estimate. # copy row r of C to row r of R for j from 1 to n do R[r. These are called immediately before the action of scale or pivot.psrank+1). namely scaler and pivotr. and p particular solution vectors. Work out an elegant way to print the answers and the error estimates.j]:=0.s. insert minus identity in basis if psrank<n then # fill bottom of basis matrix with -I Z:=ident(C.s. and the third is the matrix of estimated roundoﬀ errors. Z:=scramb(R.n-psrank. Z:=pivot(C.j]:=C[r. Do the same for the backwards solution.psrank+1.j.3.i.r. .r.r.s. and their mission is to update the R matrix in accordance with equations (3.s.n). it returns a list containing three items: the ﬁrst is the determinant (if there is one). list its global variables. Just print the number of digits of the answers that can be trusted. Exercises 3. # fill under R matrix with zeroes for i from psrank+1 to n do for j from n-psrank to s do R[i.i.4. 2. end. one for each input right-hand side.r. The matrix C (which is called by name in the procedure. it gives the solutions and their roundoﬀ error estimates. How would you do that? Write subroutine prnt that will carry it out.evalm(R)). C[i. od. return(Det. R[i.j]:=0 od od. State it formally as algorithm forwd.4) to take account of the impending scaling or pivoting operation .j]:=0.s.psrank.4. # permute rows prior to output Z:=scramb(C. the second is the pseudorank of the coeﬃcient matrix. When the program runs.4 How big is zero? 101 for j from psrank to 2 by -1 do for i from 1 to j-1 do Z:=pivotr(C. od. fi. the forward solution process.j]:=0 od od.R. That’s too much. # end back solution.psrank+1.2) or (3. If the procedure terminates successfully.n). Break oﬀ from the complete algorithm above. which means that the input matrix is altered by the procedure) will contain a basis for the kernel of the coeﬃcient matrix in columns psrank + 1 to n.-1). Two new sub-procedures are called by this procedure.j. respectively.r. in columns n + 1 to n + p.j] od.

102

Numerical linear algebra

3. Suppose you want to re-run a problem, with a diﬀerent set of random numbers in the roundoﬀ matrix initialization. How would you do that? Run one problem three or four times to see how sensitive the roundoﬀ estimates are to the choice of random values that start them oﬀ. 4. Show your program to a person who is knowledgeable in programming, but who is not one of your classmates. Ask that person to use your program to solve some set of three simultaneous equations in ﬁve unknowns. Do not answer any questions verbally about how the program works or how to use it. Refer all such questions to your written documentation that accompanies the program. If the other person is able to run your program and understand the answers, award yourself a gold medal in documentation. Otherwise, improve your documentation and let the person try again. When successful, try again on someone else. 5. Select two vectors f , g of length 10 by choosing their elements at random. Form the 10 × 10 matrix of rank 1 whose elements are fi gj . Do this three times and add the resulting matrices to get a single 10 × 10 matrix of rank three. Run your program on the coeﬃcient matrix you just constructed, in order to see if the program is smart enough to recognize a matrix of rank three when it sees one, by halting the forward solution with pseudorank = 3. Repeat the above experiment 50 times, and tabulate the frequencies with which your program “thought” that the 10 × 10 matrix had various ranks.

3.5

Operation count

With any numerical algorithm it is important to know how much work is involved in carrying it out. In this section we are going to estimate the labor involved in solving linear systems by the method of the previous sections. Let’s recognize two kinds of labor: arithmetic operations, and other operations, both as applied to elements of the matrix. The arithmetic operations are +, −, ×, ÷, all lumped together, and by other operations we mean comparisons of size, movement of data, and other operations performed directly on the matrix elements. Of course there are many “other operations,” not involving the matrix elements directly, such as augmenting counters, testing for completion of loops, etc., that go on during the reduction of the matrix, but the two categories above represent a good measure of the work done. We’re not going to include the management of the roundoﬀ error matrix R in our estimates, because its eﬀect would be simply to double the labor involved. Hence, remember to double all of the estimates of the labor that we are about to derive if you’re using the R matrix. We consider a generic stage in the forward solution where we have been given m equations in n unknowns with p right-hand sides, and during the forward solution phase we have just arrived at the [i, i] element.

3.5 Operation count

103

The next thing to do is to search the Southeast rectangle for the largest element, The rectangle contains about (m − i) ∗ (m − i) elements. Hence the search requires that many comparisons. Then we exchange two rows (n+p−i operations), exchange two columns (m operations) and divide a row by the pivot element (n + p − i arithmetic operations). Next, for each of the m−i−1 rows below the ith, and for each of the n+p−i elements of one of those rows, we do two arithmetic operations when we do the elementary row operation that produces a zero in the ith column. This requires, therefore, 2(n + p − i)(m − i − 1) arithmetic operations. For the forward phase of the solution, therefore, we count

r

Af =

i=1

{2(n + p − i)(m − i − 1) + (n + p − i)}

(3.5.1)

arithmetic operations altogether, where r is the pseudorank of the matrix, because the forward solution halts after the rth row with only zeros below. The non-arithmetic operations in the forward phase amount to

r

Nf =

i=1

{(m − i)(n − i) + (n + p − i) + m}.

(3.5.2)

Let’s leave these sums for a while, and go to the backwards phase of the solution. We do the columns in reverse order, from column r back to 1, and when we have arrived at a generic column j, we want to create zeroes in all of the positions above the 1 in the [j, j] position. To do this we perform the elementary row operation −→ := −→ − Aij ∗ −→ − − − row(i) row(i) row(j) (3.5.3)

to each of the j − 1 rows above row j. Let i be the number of one of these rows. Then, − exactly how many elements of −→ are acted upon by the elementary row operation above? row(i) − Certainly the elements in −→ that lie in columns 1 through j − 1 are unaﬀected, because row(i) −→ − only zero elements are in row(j) below them, thanks to the forward reduction process. − Furthermore, the elements in −→ that lie in columns j + 1 through r are unaﬀected, row(i) for a diﬀerent reason. Indeed, any such entry is already zero, because it lies above an entry of 1 in some diagonal position that has already had its turn in the back solution (remember that we’re doing the columns in the sequence r, r − 1, . . . , 1). Not only is such an entry zero, − but it remains zero, because the entry of −→ row(j) below it is also zero, having previously been deleted by the action of a diagonal element below it. −→ − Hence in row(i), the elements that are aﬀected by the elementary row operation (3.5.3) are those that lie in columns j, n − r, . . . , n, n + 1, . . . , n + p (be sure to write [or modify!] the program so that the row reduction (3.5.3) acts only on those columns!). We have now

104

Numerical linear algebra

− shown that exactly N + p + r − 1 entries of each row above −→ row(j) are aﬀected (note that the number is independent of j and i), so

r

Ab =

(n + p − r + 1)(j − 1)

(3.5.4)

j=1

arithmetic operations are done during the back solution, and no other operations. It remains only to do the various sums, and for this purpose we recall that

N

i=

i=1 N 2

N (N + 1) 2 (3.5.5)

N (N + 1)(2N + 1) i = 6 i=1 Then it is straightforward to ﬁnd the total number of arithmetic operations from Af +Ab as Arith(m, n, p, r) = r3 r2 − (2m + n + p − 5) + ((n + p)(2m − 5/2) − m + 1/3)r 6 2 (3.5.6)

and the total of the non-arithmetic operations from Nf as NonArith(m, n, p, r) = r3 r2 r − (m + n) + (6mn + 3m + 3n + 6p − 2) . 3 2 6 (3.5.7)

Let’s look at a few important special cases. First, suppose we are solving one system of n equations in n unknowns that has a unique solution. Then we have m = n = r and p = 1. We ﬁnd that 2 Arith(n, n, 1, n) = n3 + O(n2 ) (3.5.8) 3 where O(n2 ) refers to some function of n that is bounded by a constant times n2 as n grows large. Similarly, for the non-arithmetic operations on matrix elements we ﬁnd 1 n3 + O(n2 ) 3 in this case. It follows that a system of n equations can be solved for about one third of the price, in terms of arithmetic operations, of one matrix multiplication, at least if matrices are multiplied in the usual way (did you know that there is a faster way to multiply two matrices? We will see one later on). Now what is the price of a matrix inversion by this method? Then we are solving n systems of n equations in n unknowns, all with the same left-hand side. Hence we have r = m = n = p, and we ﬁnd that Arith(n, n, n, n) = 13 3 n + O(n2 ). 6 (3.5.9)

Hence we can invert a matrix by this method for about the same price as solving 3.25 systems of equations! At ﬁrst glance, it may seem as if the cost should be n times as great

j = 1. . Here’s an example of the kind of situation that might result: 1 0 0 a 0 1 0 d 0 0 1 g . We must remember the column interchanges that occur along the way though. 3. . 3 5 2 1 b c e f h k . The great economy results. The cost of the non-arithmetic operations remains at 1 n3 + O(n2 ). in eﬀect. More precisely.6. 5.1) to see the complete partitioning of the matrix C). from the common left-hand sides. where the row contains n + p entries altogether. As we mentioned previously. . and the answers to the original question are before us. and it concerns the rearrangement of the output so that it ends up in the right order. let’s go back to the set of equations that (3. We leave it to the reader to work out the cost of a determinant. the pseudorank r = 3.6 To unscramble the eggs Now we have reached the last of the issues that needs to be discussed in order to plan a complete linear equation solving routine. . we don’t need to keep a record of the row interchanges.1) The matrix above represents schematically the reduced row echelon form in a problem where there are ﬁve unknowns (n = 5). . . or of ﬁnding the rank. 1. and we can save the cost of the back solution. and it is kept at the bottomof the matrix. the last p of which are not used (refer to (3. 4] shown in the last row of the matrix as it would be storded in a computation. .3. n. The question now is. . the elements of {τj } are the ﬁrst n entries of the new last row of the matrix. . Now suppose we have arrived at the end of the back solution.3. . because each time we do one of them we are. or want only the rank of A then we need to do only the forward solution. of course. and the permuations that were carried out on the columns are recorded in the array τ : [3. 4 ∗ (3. renumbering the unknowns. .6. 3 If we want only the determinant of a square matrix A. During the operation of the forward solution algorithm we found it necessary to interchange rows and columns so that the largest element of the Southeast rectangle was brought into the pivot position.6 To unscramble the eggs 105 because we are solving n systems. The ﬁrst of these is x3 = c − ax1 − bx4 (3. .1) stands for. . just for bookkeeping purposes. To remember the column interchanges we glue onto our array C an additional row.2) . Its elements are called τj .6. 2. . except that they are scrambled. how do we express the general solution of the given set of equations? To ﬁnd the answer. ··· ··· . just one right-hand side vector is given (p = 1). because they correspond simply to solving the equations in a diﬀerent sequence.

It will be seen that we have gotten the desired result.4) Now we are looking at a display of the output as we would like our subroutine to give it. after re-ordering the equations.6. in order to obtain the three vectors in (3. the third row becomes the second.3) x2 = k − gx1 − hx4 .106 Numerical linear algebra because the numbering of the unknowns is as shown in the τ array.6. and the last column above will be the particular solution.6.4)? The ﬁrst things to do are. (3. a particular solution of the given system of equations. and to compare the result with what we want.6.6. Just roughly.4) are.6. and to lengthen the last column on the right by appending two more zeros. That brings us to the matrix a b c d e f g h k −1 0 0 0 −1 0 [3 5 2 1 4] ∗ 1 0 0 0 0 1 0 0 1 .6. and the two vectors of a basis for the kernel of the coeﬃcient matrix. the second row becomes the ﬁfth. can be written as x1 x2 x3 x4 x5 = 0 k c 0 f + (−x1 ) ∗ −1 g a 0 d + (−x4 ) ∗ 0 h b −1 e .4).1) that represents the situation at the end of the back solution. namely with (3. we do row interchanges. where E is .5) The ﬁrst two of the three long columns above will be the basis for the kernel. and instead we end up solving (AE)y = b. the reason for this is that we begin by wanting to solve Ax = b. That means that the ﬁrst row becomes the third. The question can now be rephrased: exactly what operations must be done to the matrix shown in (3.1). The next two equations are x5 = f − dx1 − ex4 (3. as we have previously noted. The reader is invited to carry out on the rows the interchanges just described. and the old ﬁfth row is the new fourth. (3. Now here is the punch line: the right rearrangement to do is to permute the rows of those three long columns as described by the permuation τ . If we add the two trivial equations x1 = x1 and x4 = x4 . the fourth row is the new ﬁrst.6. The point that is just a little surprising is that to undo the column interchanges that are recorded by τ . but only after we do the right rearrangement. then we get the whole solution vector which. The three vectors on the right side of (3. to append the negative of a 2 × 2 identity matrix to the bottom of the fourth and ﬁfth columns of (3. respectively.

To do this with no extra array storage. and should be output as such. . and the jth one of the last p columns of the new T is a particular solution of the jth one of the input systems of equations. and under P we adjoin an (n − r) × p block of zeros. 2. r is the pseudorank of A.3. r) is the r×r identity matrix. 2. . 1. etc.6. r) B(r. let’s ﬁrst pick up the element a1 and move it to aτ1 . which means that we must perform row operations on y to recover the answers in the right order. Suppose a linear array a = [a1 . Although conceptually we should think of the old T and the new T as occupying distinct arrays in memory. say.6 To unscramble the eggs 107 a matrix obtained from the identity by elementary column operations. We want to rearrange the entries of the array a according to the permuation τ without using any additional array storage. in n unknowns. 13. 9. 7. 4. We adjoin under B the negative of the (n − r) × (n − r) identity matrix. and should be output as such. 13. 5. and state the rule in general. and we’re back where we started. Evidently. At the end of the back solution we will have before us a matrix of the form I(r. and B and P are matrices of the sizes shown. Conceptually. so that the new T is just a rearrangement of the rows of the old. τn ].6. Now we can leave the example above. without ever “stepping on our own toes.6. Now we exchange the rows of T according to the permuation array τ . being careful to store the original aτ1 in a temporary location t so it won’t be destroyed. So we move the 5 in position a1 to position a3 (after putting the 13 into a safe place). in fact it is perfectly possible to carry out the whole row interchange procedure described above in just one array. (3. 9. we will be back to a1 . After a certain number of steps (maybe only 1). Suppose the arrays a and τ at input time were: a = [5. n − r) P (r.” so let’s consider that problem. 8] (3. . . The a array now has become a = [2. . 7. Next we move the contents of t to its destination.8) . 8]. all having a commom m × n coeﬃcient matrix A. We are given p systems of m simultaneous equations each. the one that holds T . Her’s an example to help clarify the situation.7) τ = [3. 6]. Next. row 2 will be row τ2 . call it T . . Thus the present array a1 will end up as the output aτ1 . we should regard the old T and the new T as occupying diﬀerent areas of storage. along with a permutation array τ = [τ1 . . and then the 13 goes to position a5 (after putting the 2 into a safe place) and the 2 is moved into position a1 . 5. . . and so forth. . the initial a2 will end up as aτ2 . row 1 of T will be row τ1 of the new T . Preciesly. we forget the identity matrix on the left. Now the ﬁrst n − r columns of the new T are a basis for the kernel of A. . x = Ey.6) where I(r. . p) (3. an ] is given. and we consider the entire remaining n × (n − r + p) matrix as a whole.

Here’s a complete algorithm.108 Numerical linear algebra The job. the entries C[r + 1. u:=v. a4 and a6 haven’t yet been moved. In that way we can always begin the next block of entries of a to move by searching τ for a negative entry. while the others have been moved to their destinations.u. u:=a[i]. while q<>t do v:=a[q]. . q:=tau[q]. . . the job is ﬁnished.7. Somehow we have to recognize that the elements a2 . . however. by an eigenvector of A we mean a vector x = 0 such that Ax = λx (3.7 Eigenvalues and eigenvectors of matrices Our next topic in numerical linear algebra concerns the computation of the eigenvalues and eigenvectors of matrices. A convenient place to hang a ﬂag is in the sign position of an entry of the array τ . tau[i]:=q. #permutes the entries of a according to the permutation tau # # flag entries of tau with negative signs for i from 1 to n do tau[i]:=-tau[i] od. a[t]:=v. . Therefore.n) local i.t. v:=u. . and the array a of length moved is one of the columns r + 1. a[q]:=u. od. since we’re sure that the entries of τ are all supposed to be positive. . return(1). trace through the complete operation of this algorithm on In order to apply the method to the linear equation solving i = 1. n whose entries are going to be C in rows 1. The reader should carefully the sample arrays shown above. for i from 1 to n do # has entry i been moved? if tau[i]<0 then # move the block of entries beginning at a[i] t:=i. n are interpreted as τi . . q:=-tau[i].v. n + p of the matrix 3. For this purpose we will ﬂag the array positions.1) . . all matrices will be square. program. tau[q]:=-tau[q]. initially we’ll change the signs of all of the entries of τ to make them negative. in Maple: shuffle:=proc(a.tau. . n. If A is n × n.j. i]. is not ﬁnished. Then as elements are moved around in the a array we will reverse the sign of the corresponding entry of the τ array. fi. od. Until further notice. . .q. end. When none exist.

3).7.5) These equations are. We say that the eigenvector x corresponds to. and solve the equations. These are the two eigenvalues of the matrix (3.2). we go back to (3. to ﬁnd the eigenvector that belongs to the eigenvalue λ = 2. The ﬁrst eigenvector of our 2 × 2 matrix is therefore any multiple of the vector [1. If we refer back to the deﬁnition (3. redundant since λ was chosen to make them so.7.7. so we can exploit changes of basis in the computation of eigenvalues.7) . −1] is an eigenvector can either be written as two vector equations: 3 −1 −1 3 1 1 =2 1 1 . This condition yields a quadratic equation for λ whose two roots are λ = 2 and λ = 4.7. These are two homogeneous equations in two unknowns. 3 −1 −1 3 1 −1 =4 1 −1 (3. the eigenvalue λ. it becomes the two scalar equations 3x1 − x2 = λx1 (3. rather than of the matrix A. First. The result is that any scalar multiple of the vector [1.7. The two statements that [1.7.7.1) for this matrix.3) −x1 + 3x2 = λx2 . consider the 2 × 2 matrix A= 3 −1 −1 3 . so eigenvectors are determined only up to constant multiples. 1].7. we return to (3.7.1) of eigenvectors we notice that if x is an eigenvector then so is cx. 1]. (3.6) or as a single matrix equation 2 −1 −1 3 1 1 1 −1 = 1 1 1 −1 2 0 0 4 . For an example.7. and therefore they have no solution other than the zero vector unless the determinant 3−λ −1 −1 3 − λ (3. (3.7 Eigenvalues and eigenvectors of matrices 109 where the scalar λ is called an eigenvalue of A. −1] is an eigenvector corresponding to the eigenvalue λ = 4. 1] is an eigenvector and that [1.3. of course.3) and replace λ by 2 to obtain the two equations x 1 − x2 = 0 −x1 + x2 = 0. For the same 2 × 2 example. (3. To ﬁnd the eigenvector that belongs to the eigenvalue λ = 4. let’s now ﬁnd the eigenvectors (by a method that doesn’t bear the slightest resemblance to the numerical method that we will discuss later). They are satisﬁed by any vector x of the form c ∗ [1. or belongs to. where c is an arbitrary constant.2) If we write out the vector equation (3.4) is equal to zero. replace λ by 4. We will see that in fact the eigenvalues of A are properties of the linear mapping that A represents.7.

Am = P Λm P −1 .7. Hence we can write A = P ΛP −1 . and for every m. then eA = P eΛ P −1 (3.10) and f (Λ) is easy to calculate because it just has the numbers f (λi ) down the diagonal and zeros elsewhere. (3. Finally. the nonsingularity of P is equivalent to the linear independence of the eigenvectors. A2147 = 1 1 1 −1 22147 0 0 42147 1/2 1/2 1/2 −1/2 . where A is the 2 × 2 matrix (3. Thus for example.8). by raising A to higher and higher powers would take quite a while (although not as long as one might think at ﬁrst sight! Exactly what powers of A would you compute? How many matrix multiplications would be required?).7.7. Equation (3. A direct calculation.7. A better way is to begin with the relation AP = P Λ and to observe that in this case the matrix P is nonsingular.7. but is represented by an everywhere-convergent powers series (we don’t even need that much. 13A3 + 78A19 − 43A31 = P (13Λ3 + 78Λ19 − 43Λ31 )P −1 . For instance. we can equally well obtain any polynomial in the matrix A. and Λ is the diagonal matrix that carries the eigenvalues of A down the diagonal (in order corresponding to the eigenvectors in the columns of P ).11) remains valid even if f is not a polynomial. Suppose we want to calculate A2147 .8) is very helpful in computing powers of A. where A is the given 2 × 2 matrix. This matrix equation AP = P Λ leads to one of the many important areas of application of the theory of eigenvalues. but this statement suﬃces for our present purposes). So for instance.12) . Since P has the eigenvectors of A in its columns.2). if A is the above 2 × 2 matrix. P is a (nonsingular) matrix whose columns are eigenvectors of A.7) states that AP = P Λ. (3. For instance A2 = (P ΛP −1 )(P ΛP −1 ) = P Λ2 P −1 .7. Indeed if f is any polynomial. It is of course quite easy to ﬁnd high powers of the diagonal matrix Λ.7.7.7.110 Numerical linear algebra Observe that the matrix equation (3. namely to the computation of functions of matrices. and the set of eigenvalues is often called the spectrum of A. it’s just a short hop to the conclusion that (3.7.8) This is called the spectral representation of A. and so P has an inverse.9) Not only can we compute powers from the spectral representation (3. because we need only raise the entries on the diagonal to that power.11) (3. then f (A) = P f (Λ)P −1 (3.

7. That changes the question. where it will emerge (see Theorem 3. n. . and we won’t answer it completely. (3. much more is true. and that corresponding to that eigenvalue there is just one independent eigenvector.e.1) as a corollary of an algorithm. 1] and [1. Conversely. For an example we don’t have to look any further than A= 0 1 0 0 . as is shown by the following fundamental theorem of the subject. and then putting eAt = P eΛt P −1 . λ = 0. i. Then P will be invertible.7. Indeed. that will lead us simultaneously to a proof of the fundamental theorem (the “spectral theorem”) above.7 Eigenvalues and eigenvectors of matrices where eΛ has e2 and e4 on its diagonal.13) The reader will have no diﬃculty in checking that this matrix has just one eigenvalue. with say y(0) given as initial data.7.3. to the solution of systems of diﬀerential equations. The solution of this system of diﬀerential equations is y(t) = eAt y(0).9. j = 1. So. Recall that the eigenvectors of the symmetric 2 × 2 matrix (3. We’re going to follow a slightly unusual route now. .8). 111 We have now arrived at a very important area of application of eigenvalues and eigenvectors.2) were [1. Instead. we can always ﬁnd a set of n eigenvectors of A that are pairwise orthogonal to each other (so they are surely independent). then all we need to do is to arrange them in the columns of a new matrix P .9. if we somehow have found a spectral representation of A ` la (3. . These matrices occur in many important applications. and we’ll be all ﬁnished. and to a very elegant . we give an example of a matrix that does not have as many independent eigenvectors as it “ought to. . Now ﬁrst we’re going to devote our attention to the real symmetric matrices. −1]. we can calculate functions of the matrix and can solve diﬀerential equations that involve the matrix. Furthermore. Then the eigenvalues and eigenvectors of A are real. Hence.7. A system of n linear simultaneous diﬀerential equations in n unknown functions can be written simply as y = Ay. then the a columns of P obviously do comprise a set of n independent eigenvectors of A. What kind of an n × n matrix A has a set of n linearly independent eigenvectors? This is quite a hard problem.” and then we’ll specialize our discussion to a kind of matrix that is guaranteed to have a spectral representation. and these are indeed orthogonal to each other. Theorem 3. where the matrix eAt is calculated by writing A = P ΛP −1 if possible.. whose proof is deferred to section 3. and they always have a spectral representation. to matrices A for which Aij = Aji for all i. though we didn’t comment on it at the time.1 (The Spectral Theorem) – Let A be an n × n real symmetric matrix. when can we ﬁnd a spectral representation of a given n × n matrix A? If we can ﬁnd a set of n linearly independent eigenvectors for A. and therefore there is no spectral representation of this matrix. whenever we can ﬁnd the spectral representation of a matrix A.

Throughout these algorithms certain themes will recur. if we can ﬁnd an orthogonal matrix P such that P T AP is a diagonal matrix D. and further we will have AP = P Λ. although it will be nearly so. If we take such as set of vectors.e. i. Since P T = P −1 . called the method of Jacobi. namely that each module will not always be exactly optimal in terms of machine time for execution in each application. If we visualize the way a matrix is multiplied by its transpose.2) and this is the spectral theorem for a symmetric matrix A.112 Numerical linear algebra computer algorithm. Since the themes occur so often we are going to abstract from them certain basic modules of algorithms that will be used repeatedly. This choice will greatly simplify the preparation of programs. Once we understand these properties. as they arise. with almost no additional work. Speciﬁcally. square.. we can multiply on the right by P T and obtain A = P ΛP T . normalize them by dividing each by its length. We’ll discuss these points further. we will see several situations in which we have to compute a certain angle and then carry out a rotation of space through that angle.1) . In the next section we will introduce a very special family of matrices. but at a price. and we will examine their properties in some detail. For example. (3. and arrange them in the consecutive columns of a matrix P . and if P −1 = P T .8. (3. as a fast and pretty program in which all of the eigenvalues and eigenvectors of a real symmetric matrix are found simultaneously. Following that we will show how the algorithm of Jacobi can be implemented on a computer. it will be clear that an orthogonal matrix is one in which each of the rows (columns) is a unit vector and any two distinct rows (columns) are orthogonal to each other. then we will have found a complete set of pairwise orthogonal eigenvectors of A (the columns of P ). Conversely.8 The orthogonal matrices of Jacobi A matrix P is called an orthogonal matrix if it is real. in context. a proof of the spectral theorem will appear. for the computation of eigenvalues and eigenvectors of real symmetric matrices.8. then P will be an orthogonal matrix. and are delivered to your door as an orthogonal set. ﬁrst studied by Jacobi. the 2 × 2 matrix cos θ sin θ − sin θ cos θ is an orthogonal matrix for every real θ. We will soon prove that a real symmetric matrix always has a set of n pairwise orthogonal eigenvectors. if P T P = P P T = I. 3. and the eigenvalues of A (on the diagonal of D). Consequently it was felt that the price was worth the beneﬁt of greater universality.

as shown below: row p row q 0 1 0 0 0 1 0 . with n ≥ 2 and p = q. . . . . ..3. . . . . We deﬁne the matrix Jpq (θ) by saying that J is just like the n × n identity matrix except that in the four positions that lie at the intersections of rows and columns p and q we ﬁnd the entries (3. . sin θ · · · 0 0 0 . ··· ··· . but later we’ll ﬁnd that the same two-dimensional rotations will play important roles in the solution of non-symmetric problems as well. . 0 0 0 . . ··· ··· . . . . .1) is the familiar rotation of the plane through an angle θ. . . and let θ be a real number. . 0 0 0 0 ··· 0 1 0 ··· 0 0 1 (3. q] entry. 0 0 0 .3) 0 0 0 Not only is Jpq (θ) an orthogonal matrix. . namely one in which a certain plane. is rotated through the angle θ. . . then ﬁnd a corresponding eigenvector. . the whole orthogonal matrix whose columns are the desired vectors will creep up on us at once. As soon as we prove that the method works. then ﬁnd another eigenvalue and another vector. we can say that the matrix Jpq (θ) carries out a special kind of rotation of n-dimensional space. q. we will have proved the spectral theorem at the same time. . If a real symmetric matrix A is given.8. First. given a real symmetric matrix A. . What we propose to do is the following. ··· . . let’s see how they can help us with symmetric matrices. The ﬁrst application that we’ll make of them will be to the real symmetric matrices. . The ﬁrst thing we have to do is to describe some special orthogonal matrices that will be used in the algorithm. . p] the entry cos θ. These matrices of Jacobi turn out to be useful in a host of numerical algorithms for the eigenproblem. . . at any rate unless A is already diagonal. 0 0 ··· ··· ··· ..8. More precisely. . . .. Jpq (θ) has in position [p. the plane of the pth and qth coordinate. .8. cos θ · · · 0 0 0 .1). ··· ··· . . and the angle θ in such a way that the matrix JAJ T is a little bit more diagonal (whatever that means!) than A is. . . Instead. .8 The orthogonal matrices of Jacobi 113 In this section we are going to describe a numerical procedure that will ﬁnd such an orthogonal matrix. cos θ in entry [q. so we will have the germ of a numerical procedure for computing eigenvalues and eigenvectors. . p] entry. . . . . . p and q be given positive integers. . . . . it has sin θ in the [p. Hence the method is of theoretical as well as algorithmic importance. . etc.. . . . ··· 0 0 ··· ··· cos θ ··· ··· ··· ··· ··· . . . . there is a reasonably pleasant way to picture its action on n-dimensional space. . . . q]. and otherwise it agrees with the identity matrix. . . Hence Jpq (θ) carries out a two-dimensional rotation of n-dimensional space. . Since the 2 × 2 matrix of (3. . Let n. ··· ··· · · · − sin θ · · · ··· ··· ··· ··· 0 0 ··· ··· ··· ··· 0 0 0 ··· 0 0 0 . . and the remaining coordinates are all left alone. . we will determine p. It is important to notice that we will not have to ﬁnd an eigenvalue. It turns out that this can always be done. − sin θ in the [q. . . .

(3.3) by A we ﬁnd if i = p if i = q otherwise (3. suppose we have found out how to determine such an angle θ.3). and let’s then see what the whole process would look like. this time on the right by the transpose of the matrix in (3. q and θ such that Od(Jpq (θ)AJpq (θ)T ) < Od(A).6) In (3. q} if i = j = p (JAJ T )ij = if i = j = q if i = p. From now on.8. j = q or i = q. (3. q. Now JAJ T is still a symmetric matrix (try to transpose it and see what happens) so we can do it again. We’ll do this by a very direct computation of the elements of JAJ T (we’ll need them anyway for the computer program). and then the matrix JAJ T is somehow a little more diagonal than A was. and θ. Then we will know that D = (product of all J’s used)A(product of all J’s used)T . and the product of orthogonal matrices is always orthogonal (proof?).114 Numerical linear algebra Indeed. let Od(A) denote the sum of the squares of the oﬀ-diagonal entries of A. and it is not already a diagonal matrix. we ﬁnd that if i ∈ {p. we would ﬁnd p.6) we have written C for cos θ and S for sin θ. So ﬁx p.8.5) Then after one more multiplication. Then by direct multiplication of the that (cos θ)Apj + (sin θ)Aqj (JA)ij = −(sin θ)Apj + (cos θ)Aqj aij matrix in (3. Starting with A. That. q}. q} if i ∈ {p. j = p or i = p. at any rate. assuming that they were not already zero. For any square. Now we claim that if A is a real symmetric matrix.8. real matrix A. j = q or i = q j = p otherwise. so that perhaps aside from roundoﬀ error it is a diagonal matrix D. First we’ll deﬁne what we mean by “more diagonal”. q}.8.4) If we let P denote the product of all J’s used. is the main idea of Jacobi’s method (he introduced it in order to study planetary orbits!). which is much more professional. j ∈ {p. then we have P AP T = D. instead of “B is more diagonal than A. q and θ. CAip + SAiq −SAip + CAiq C 2 App + 2SCApq + S 2 Aqq S 2 App − 2SCApq + C 2 Aqq CS(Aqq − App ) + (C 2 − S 2 )Apq aij Now we are going to choose the angle θ so that the elements Apq and Aqp are reduced to zero. q and θ we will have J J A(J J )T a bit “more diagonal” and so forth. Let’s now ﬁll in the details.8. The matrix P will automatically be an orthogonal matrix. so the columns of P will be the (approximate) eigenvectors of A and the diagonal elements of D will be its eigenvalues. Now suppose that after some large number of repetitions of this process we ﬁnd that the current matrix is very diagonal indeed. To do this we refer to the formula in .” we’ll be able to say Od(B) < Od(A). j ∈ {p.8. After ﬁnding another p. since it is the product of such matrices. and then we will be able to see what the new value of Od is. then we can ﬁnd p.

and then solve for θ.1 Let A be an n × n real.7) so that if J = Jpq (θ) then Od(JAJ T ) = Od(A) − 2A2 < Od(A).6) for Apq . but remembering that the new Apq =0. 115 (3.8. in line with the philosophy that it is best to break up large programs into small.6). pq (3. If we sum the squares of all of the oﬀ-diagonal elements of JAJ T using the formulas (3. Od(A) will be a little smaller than it was before. we will have reduced one single oﬀ-diagonal element of A to zero in the new symmetric matrix JAJ T . If we keep later applications in mind.8.8. It is important to note that a plane rotation that annihilates Apq may “revive” some other Ars that was set to zero by an earlier plane rotation. and execute the operation (3. then we can choose θ as in equation (3. This is one of the situations we referred to earlier . symmetric matrix that is not diagonal. each time choosing the largest oﬀ-diagonal element Apq and annihilating it by the choice (3. namely the plane rotation in n-dimensional space that sends a real symmetric matrix A into JAJ T .8. we are now ready to prepare the ﬁrst module of the Jacobi program. We will prove that Od(A) converges to zero. However.8. then the best choice for a module will be one that will.9 Convergence of the Jacobi method We have now described the fundamental operation of the Jacobi algorithm. and we have explicit formulas for the new matrix elements.8) 3.7) of θ. Hence we should not think of the zero as “staying put”.7) and we will choose the value of θ that lies between − π and 4 With this value of θ.8. Hence we have Theorem 3.8. The result is tan 2θ = 2Apq App − Aqq π 4.6) of the previous section. then it’s quite easy to check that the new sum of squares is exactly equal to the old sum of squares minus the squares of the two entries Apq and Aqp that were reduced to zero.3. depending on the call. multiply a given not-necessarily-symmetric matrix on the left by J or on the right by J T .6). This could be accomplished by a single subroutine that would take the symmetric matrix A and the sine and cosine of the rotation angle.8. manageable chunks. Let’s now see exactly what happens to Od(A) after a single rotation. according to the formulas (3. The full Jacobi algorithm consists in repeatedly executing these plane rotations. There are still a number of quite substantive points to discuss before we will be able to assemble an eﬃcient program for carrying out the method. What we want is to be able to execute the rotation through an angle θ.9 Convergence of the Jacobi method (3. on demand. After each rotation. equate it to zero.8. If Apq = 0 for some p = q.

j]+s*A[q. the plane [p. . and exit. A[j.8. the sine S and cosine C of a rotation angle 3. The Maple procedure is as follows: rotate:=proc(s. The amount of computational labor that is done by this module is O(N ) per call. od fi. according to the formulas (3.j]:=temp. A[q.p. A[j. since only two lines of the matrix are aﬀected by its operation. q] of the rotation. and 2 if we want AJ T .option).j]). A[p.temp.p]+s*A[j.j]+c*A[q. What the procedure will do is exactly this.j]. The procedure will be called Procedure rotate(s.q.c.p]:=temp. it will multiply A on the left by J.q]. and 4.q]:=-s*A[j.n. we will have to call rotate twice. global A.q]. a parameter option that will equal 1 if we want to do JA.p]+c*A[j. If called with option = 1. but we will get a lot of mileage out of this routine! Hence.116 Numerical linear algebra where the most universal choice of subroutine will not be the most economical one in every application. suppose we are given 1. an n × n real matrix A 2. If option = 2. once with option = 1 and then with option = 2.opt) local j.j]:=-s*A[p. RETURN() end: To carry out one iteration of the Jacobi method.p. if opt=1 then for j from 1 to n do temp:=evalf(c*A[p.5).c. it will multiply A on the right by J T . Next.q. let’s prove that the results of applying one rotation after another do in fact converge to a diagonal matrix. od else for j from 1 to n do temp:=c*A[j.

after doing an average of one rotation per oﬀ-diagonal element.3) n(n − 1 If we put r = n(n − 1)/2. n(n − 1) ≤ Od(A) − (3. The average square is equal to the sum of the squares of the oﬀ-diagonal elements divided by the number of such elements. To put it in very concrete terms. Then at most 12(ln 10)n(n − 1)/2 < 6(ln 10)n2 ≈ 13. t rotations per oﬀ-diagonal element (i. etc. suppose we’re working in double precision (12-digit) arithmetic and we are willing to decree that convergence has taken place if Od has been reduced by 10−12 . choosing so as to zero out that element by carrying out a Jacobi rotation on A.3. Of course in practice we will be watching the function Od(A) as it drops.8n2 rotations will have to be done. so equation pq (3.9.e. i. Since this factor is less than 1. Then the sequence of matrices that is thereby obtained approaches a diagonal matrix D. completing the proof.1 Let A be a real symmetric matrix. say. Suppose we follow the strategy of searching for the oﬀ-diagonal element of largest absolute value. Since the maximum of any set of numbers is at least as big as the average of that set (proof?). Hence. Proof.8) yields Od(JAJ T ) = Od(A) − 2A2 pq 2 Od(A) n(n − 1 2 = 1− Od(A). At a certain stage of the iteration. the sum of squares will have dropped to at most r 2 1− Od(original A).. let Apq denote the oﬀ-diagonal element of largest absolute value. (3. it follows that the sum of squares of the oﬀ-diagonal entries approaches zero as the number of plane rotations grows without bound. after r rotations.8.2) Hence a single rotation will reduce Od(A) by a multiplicative factor of 1 − 2/(n(n − 1)) at least. it follows that the maximum of the squares of the oﬀ-diagonal elements of A is at least as big as the average square of an oﬀ-diagonal element. Hence the average square is exactly Od(A)/(n(n − 1)).9. then we see that Od(A) has dropped by at least a factor of (approximately) e. divided by n(n − 1). The proof told us even more since it produced a quantitative estimate of the rate at which Od(A) approaches zero. and therefore Od(A) A2 ≥ . tn(n − 1)/2 rotations).9.1) pq n(n − 1) Now the eﬀect of a single rotation of the matrix is to reduce Od(A) by 2A2 .9 Convergence of the Jacobi method 117 Theorem 3.9. (3. If we want it to drop to. the function Od(A) is no more than 1/e times its original value. 10−m times its original value then we can expect to need no more than about m(ln 10)n(n − 1)/2 rotations. and then repeating the whole process on the resulting matrix. the function Od(A) will have dropped to about e−t times its original value.e. .. say. Indeed. After doing an average of.

Hence the image of the set of orthogonal matrices under f is compact. etc.7. . however. Since P AP T = D. Still. which is a contradiction (of the fact that Od(D) was minimal). after three we have J J J AJ T J T J T . One thing we do want to do. Proof (of the Spectral Theorem 3. Hence there is a matrix F in that image that minimizes the continuous function Od(f (P )) = Od(P T AP ). i. itself. we have seen that O(n2 ) rotations will be suﬃcient to reduce the oﬀ-diagonal sum of squares below some pre-assigned threshold level.. 3. namely n(n − 1)/2. the matrix we see is P AP T = D. so the columns of P T . At the same time. Now let’s get on with the implementation of the algorithm. Theorem 3. Hence the columns of P are n pairwise orthogonal eigenvectors of A. we have AP T = P T D.1): Consider the mapping f that associates with every orthogonal matrix P the matrix f (P ) = P T AP . and the mapping f is continuous. and the proof of the spectral theorem is complete.10 Corbat´’s idea and the implementation of the Jacobi o algorithm It’s time to sit down with our accountants and add up the costs of the Jacobi method. because that is the matrix that ultimately will be an orthogonal matrix with the eigenvectors of A across its rows. are the eigenvectors of A. AP = P D. T Let’s watch this happen. The set of orthogonal matrices is compact. Suppose D is not diagonal. The cost seems to be equal to the number of elements that have to be looked at. First.7. So there is an orthogonal matrix P such that P T AP = D. after T J T . This is true. where P is obtained by starting with the identity matrix and multiplying successively on the left by the rotational matrices J that are used. of course. since we have long since done all of the work. Now let’s re-direct our thoughts to the grand iterations process itself. but we omit the proof. and D is (virtually) diagonal. Then we could ﬁnd a Jacobi rotation that would produce another matrix in the image whose Od would be lower.e. Now. After all two iterations we have J2 J1 AJ1 2 3 2 1 1 2 3 iterations have been done. and we are looking at a matrix that is “diagonal enough” for our purposes. or equivalently the rows of P . what is the price of a single rotation? Here are the steps: (i) Search for the oﬀ-diagonal element having the largest absolute value. Now. Hence F is diagonal. is to prove the spectral theorem. At each step we apply a rotation matrix to the current symmetric matrix in order to make it “more diagonal”.1. we must keep track of the product of all of the rotation matrices that we have so far used.118 Numerical linear algebra so there won’t be any need to know in advance how many iterations are needed. We can stop when the actual observed value is small enough. which we abbreviate as O(n2 ). it’s comforting to know that at most O(n2 ) iterations will be enough to do the job. we have indeed proved that the repeated rotations will diagonalize A. Begin with A. After one rotation we have J1 AJ1 . We have not proved that the matrices P themselves converge to a certain ﬁxed matrix.

say loc[i]. and the exact rate of convergence is unknown. but becomes nonzero again!). instead of the O(n4 ) time in which the original Jacobi method operates. one does not search. cycling as long as necessary. once reduced to zero. two newer methods due to Givens and Householder have been developed. Aij = 0 unless |i − j| ≤ 1). Since only two rows of P change. We can do this by carrying along an additional linear array during the calculation. What it does is just this: it allows us to use the largest oﬀ-diagonal entry at each stage while paying a price of a mere O(n) for the privilege. in which we go through the entries cyclically as above. Thanks to a suggestion of Corbat´. but instead to tri-diagonalize A by rotations. it is now easy to run the original Jacobi method in O(n3 ) time also. contains the number of a column in which an oﬀ-diagonal . That is to say. The suggestion is all the more remarkable because of its simplicity and the fact that it lives at the software level. since only four lines of A are changed. a number of other strategies for dealing with the eigenvalue problem have been worked out. a nontrivial operation. On the coding of Jacobi’s method o o for computing eigenvalues and eigenvectors of symmetric matrices. stay zero instead of bouncing back again as they do in the Jacobi method. The ith entry of this array. Then do A14 and so forth. In that variation. These methods work not by trying to diagonalize A by rotations. returning to A12 again after Ann . This costs O(n). rather than at the mathematical level. At a deeper level. The advantage of tri-diagonalization is that it is a ﬁnite process: it can be done in such a way that elements. One of the reasons for the wide use of the Givens and Householder methods has been that they get the answers in just O(n3 ) time. (F. A12 doesn’t stay put. This method avoids the search. (iii) Update the matrix P of eigenvectors by multiplying it by the rotation matrix. The search is n times as expensive in time as either the rotation of A or the update of the eigenvector matrix. J. this cost is O(n) also. it’s too small”). in the years since Jacobi ﬁrst described his algorithm. One of these is called the cyclic Jacobi method. 10: 123-125. For this reason.3. A tri-diagonal matrix is one whose entries are all zero except for those on the diagonal. but instead goes marching through the matrix one element at a time. but we do not carry out a rotation unless the magnitude of the current matrix entry exceeds a certain threshold (“throw it back. but instead one must then confront the question of obtaining the eigenvalues and eigenvectors of a tri-diagonal matrix.e. instead of the O(n2 ) billing mentioned above.10 Corbat´’s idea and the implementation of the Jacobi algorithm o 119 (ii) Calculate θ. 1963) however. JACM. The disadvantage is that. A variation on the cyclic method is called the threshold Jacobi method. The longest part of the job is the search for the largest oﬀ-diagonal element. sin θ and cos θ. Next do a rotation that reduces A13 to zero (of course. and then carry out a rotation on the matrix A. but the proof that it converges at all is quite complex. the sub-diagonal and the super-diagonal (i. one is not ﬁnished. This method also has an uncertain rate of convergence. having arrived at a tri-diagonal matrix. Corbat´. ﬁrst do a rotation that reduces A12 to zero.

on the average (the pth.e. n − 1) to ﬁnd the largest one. we might after all get lucky and replace it with an even larger number. we just search through the whole matrix and set it up.120 Numerical linear algebra element of largest absolute value in row i lives. Ai. If we are replacing the entry of previously largest absolute value in the row i. the new |Aiq |. and since the general trend of the oﬀ-diagonal entries is downwards.. p and q. since it seems that it ought not to be any more likely that the winner was previously in those two columns than in any other columns. Then the expected number of rows that will have to be completely searched will be about four. Suppose the largest element that was previously in row i was not in either the pth or the qth column. Christmas. namely the ones in the pth column and in the qth column. We would just look through the n − 1 numbers |Ai. that have been completely changed will simply have to be searched again in order to ﬁnd the new values of loc. most of the time we’ll be replacing the former largest entry with a smaller entry. How much does it cost to create and to maintain it? Initially. What about the other n − 2 rows of the matrix? In the ith one of those rows exactly two entries were changed. Then that previous largest element will still be there in the rotated matrix. suppose now that we carry out a single rotation on the matrix A.loc(i) | (i = 1. in which case we would again know the new loc(i). In that case. Now let’s turn to a typical intermediate stage in the calculation. suppose loc[i]∈ {p.loc(p) . and the old |Ai. This has never been in any sense proved. . the qth. . but we will assume that it is so. we are going to have to pay a price for the care and feeding of this array. Hence the cost of using this array is O(n). but we pay just once. Given an array loc. i. and see what the price is for updating the loc array. then the desired matrix element would be Ap. The column in which the largest of these three numbers is found will be the new loc[i]. In the latter case we can still salvage something. then it would be a simple matter to ﬁnd the biggest oﬀ-diagonal element of the entire matrix A. in order to discover the new entry of largest absolute value in row i we need only compare at most three numbers: the new |Aip |. and this does the updating job in every row except those that happen to have had loc[i]∈ {p. If the largest one is the one where i = p.e. q}.loc(i) is an entry in row i of A of largest absolute value in that row. q}.. . i. and the previous largest entry was uncommonly large. q} is about two chances out of n. . Certainly the two rows. however. The price paid is at most three comparisons. This clearly costs O(n2 ) operations. It is reasonable to expect that the probability of the event loc[i]∈ {p. and an average of about two others). 2. Since there are no such benefactors as described above. comes just once an year.loc(i) |. In that case we’ll just have to re-search the entire row to ﬁnd the new champion. . This costs O(n) operations. say. The number of rows that must be searched in their entireties in order to update the loc array is therefore at most two (for rows p and q) plus the number of rows i in which loc[i] happens to be equal to p or to q. Now of course if some benefactor is kind enough to hand us this array. Precisely how do we go about modifying loc so it will correspond to the new matrix? The rotated matrix diﬀers from the previous matrix in exactly two rows and two columns.

and it poses no special problem. end: We should mention that the search can be speeded up a little bit more by using a data structure called a heap.r]) then loc[i]:=r else searchrow(i) fi. for i from 1 to n-1 do if i=p or i=q then searchrow(i) else r:=loc[i]. bigg:=0. Hence the cost of ﬁnding that element is comparable with all of the other operations that go on in the algorithm. else if abs(A[i.fi. RETURN(). fi. od.fi.loc[i]:=j.10 Corbat´’s idea and the implementation of the Jacobi algorithm o 121 It follows that the expected cost of maintaining the loc array is O(n) per rotation. Using Corbat´’s suggestion.loc[i]])<=abs(A[i.od. We show below the complete Maple procedure for updating the array loc immediately after a rotation has been done in the plane of p and q. update:=proc(p.bigg. if r=p or r=q then if abs(A[i. global loc.n. end: The above procedure uses a small auxiliary routine: Procedure searchrow(i) This procedure searches the portion of row i of the n × n matrix A that lies above the main diagonal and places the index of a column that contains an entry of largest absolute value in loc[i].q])>=abs(A[i.A.j])>bigg then bigg:=abs(A[i. The cost of ﬁnding the largest oﬀ-diagonal element has therefore been reduced to O(n) per rotation. What we want is to store the locations of the biggest elements in each row in such a way that we can quickly access the biggest of all of them. the cost of the complete Jacobi algorithm for eigenvalues and eigenvectors is O(n3 ). and subject to the equidistribution hypothesis mentioned o above.r.3. fi.q) local i. after all bills have been paid.p])>=abs(A[i.P. .loc[i]]) then loc[i]:=q.A.fi.j]). RETURN().n. if abs(A[i. for j from i+1 to n do if abs(A[i. global loc.loc[i]]) then loc[i]:=p. searchrow:=proc(i) local j.

11 Getting it together The various pieces of the procedure that will ﬁnd the eigenvalues and eigenvectors of a real symmetric matrix by the method of Jacobi are now in view. It’s time to discuss the assembly of those pieces. If that is Ap.dgts) The input to the jacobi procedure is the n × n matrix A. q.loc(i) for the largest in absolute value. and it is available to all procedures that are involved. 3. and a linear array eig that will hold the eigenvalues of A. note that we are going to calculate sin θ and cos θ without actually calculating θ itself. we will then know p. This completes the initialization. We ﬁrst ﬁnd the largest oﬀ-diagonal element by searching through the numbers Ai. The Jacobi process will halt when this sum of squares of all oﬀ-diagonal elements has been reduced to eps*test or less. The output of procedure jacobiwill be an orthogonal matrix P that will hold the eigenvectors of A in its rows. then we would have been able to ﬁnd the overall winner in a single step. The remaining steps all get done while the sum of the squares of the oﬀ-diagonal entries of the current matrix A exceeds eps times test. The ﬁrst step in the operation of the procedure will be to compute the original oﬀdiagonal sum of squares test. and discrete mathematics. and initialize the array loc by calling the subr outine search for each row of A. or priority queue. Also. since the Jacobi method is sensitive to small inaccuracies in these quantities. consult books on data structures. and the expense would have been the maintenance of the heap structure. at a cost of a mere O(log n) operations. Further input is dgts.” so that the procedure can terminate its operation. and a parameter eps that we will use as a reduction factor to test whether or not the current matrix is “diagonal enough.122 Numerical linear algebra If we had stored the set of winners in a heap. This is a fairly ticklish operation. This would have reduced the program overhead by an addition twenty percent or thereabouts. Procedure jacobi(eps. which is the number of signiﬁcant digits that you would like to be carried in the calculation. which is to say that its value is set outside of the jacobiprocedure.loc(p) . computer science. The input matrix A will be destroyed by the action of the procedure. After careful analysis it turns out that the best formulas for the purpose . and it will be time to compute the sine and cosine of the rotation angle θ. App amd Aqq . Next we set the matrix P to the n × n identity matrix. The input matrix A itself is global. although the programming job itself would have gotten harder. To learn about heaps and how to use them. Apq .

.c.eig.p:=i.A.5*(1+abs(y)/t))).p.1).i.j]+s*P[q.3). x:=2*A[p.8.p]-A[q. Next we call update.s. test:=add(add(2*A[i. #set up initial loc array big:=test.iter.d. loc:=array(1. q:=loc[p]. rotate(s.j=i+1.1) sin θ := sign(xy) cos θ := 1 + |y|/t 2 ..11 Getting it together are: x := 2Apq y := App − Aqq t := x2 + y 2 1 − |y|/t 2 1/2 1/2 123 (3. to modify the loc array to correspond to the newly rotated matrix.big. fi. as discussed in section 3. and we are at the end (or od) of the while that was started a few paragraphs ago.c.i=1. #find largest o. #apply rotations to A for j from 1 to n do #update matrix of eigenvectors t:=c*P[p.p.j)->if i=j then 1 else 0 fi).5*(1-abs(y)/t))). #sum squares o-d entries P:=matrix(n. Notice that this multiplication aﬀects only rows p and q of the matrix P . s:=sign(x*y)*evalf(sqrt(0.j.n. so only 2n elements are changed.p.. in order to update the matrix that will hold the eigenvectors of A on output.q.loc[i]])>x then x:=abs(A[i.x1. In other words. y:=A[p. global loc.rotate(s.q].od. element for i from 1 to n-1 do if abs(A[i.Digits:=dgts. with(linalg): # initialize iter:=0.j]^2.q]. Having computed sin θ and cos θ.n).3. ﬁrst with opt=1 and again with opt=2.(i. while big>eps*test do #begin next sweep x:=0.P.n-1).q.10.j].n.dgts) local test. as discussed in section 3.loc[i]]).x. #initialize eigenvector matrix for i from 1 to n-1 do searchrow(i) od.c.9.q. we can now call rotate twice.n:=rowdim(A). jacobi:=proc(eps. The Maple program for the Jacobi algorithm follows. we’re ﬁnished. The matrix A has now been transformed into the next stage of its march towards diagonalization.y.2).n). Next we multiply the matrix P on the left by the same n×n orthogonal matrix of Jacobi (3.t.11.. #find sine and cosine of theta t:=evalf(sqrt(x^2+y^2)). c:=evalf(sqrt(0.

or B(Hx) = λ(Hx). and # no. If we choose a basis for E n then T is represented by an n × n matrix A with respect to that basis.12. suppose T is a linear mapping of E n (Euclidean n-dimensional space) to itself. update(p. and eps. Therefore Hx . in which we studied linear mappings. A changes to B = HAH −1 . where H is a nonsingular n × n matrix.i=1. or A = H −1 BH. the fraction by which the original oﬀ diagonal sum of squares must be reduced for convergence. then the same linear mapping is represented by B = HAH −1 . of sweeps needed To use the programs one does the following.1) = det(A). and will be the same for every matrix that represents T in some basis. a call jacobi(.iter).q). rotate. It’s worth noting that the eigenvalues of a matrix really are the eigenvalues of the linear mapping that the matrix represents with respect to some basis.iter:=iter+1. In fact. od. The proof of this fact is by a straightforward calculation. vecs.j]:=t.00000001. Now if we change to a diﬀerent basis.i].. Then Ax = λx.j]:=-s*P[p. because det(HAH −1 ) = det(H) det(A) det(H −1 ) (3. Next enter the matrix A whose eigenvalues and eigenvectors are wanted. If we change basis.15) will carry 15 digits along in the computation. First question: what happens to the determinant if we change the basis? Answer: nothing. Then choose dgts. 3. albeit sketchily. searchrow into a Maple worksheet. print(eig. the number of digits of accuracy to be maintained. update. enter the four procedures jacobi.12 Remarks For a parting volley in the direction of eigenvalues. end: Numerical linear algebra #update loc array #go do next sweep #end of while #output eigenvalue array #print eigenvals. and will terminate when the sum of squares of the oﬀ diagonal elements is . let’s review some connections with the ﬁrst section of this chapter.P. Hence we can speak of det(T ). RETURN(). Hence the value of the determinant is a property of the linear mapping T .j]+c*P[q.n)].00000001 times what it was on the input matrix.j]. and can be found in standard references on linear algebra.124 P[q. Hence H −1 BHx = λx. the determinant of the linear mapping itself. First. eig:=[seq(A[i. P[p. od. big:=big-x^2/2. Next question: what happens to the eigenvalues if we change basis? Suppose x is an eigenvector of A for the eigenvalue λ. As an example.

Jy = x. The eigenvalues are therefore independent of the basis. where J T = J −1 . . J T Jy = x. Hence we can speak of the eigenvalues of a linear mapping T .3. In the Jacobi method we carry out transformations A → JAJ T .2) Therefore the method of Jacobi works by rotating a basis slightly. y . in which the matrix that represents it is a little more diagonal than before.12 Remarks 125 is an eigenvector of B with the same eigenvalue λ. Hence this transformation corresponds exactly to looking at the underlying linear mapping in a diﬀerent basis. (3. Since J is an orthogonal matrix. it preserves lengths and angles because it preserves inner products between vectors: Jx.12. and are properties of the linear mapping T itself. into a new basis in which the matrix is closer to being diagonal.

- A First Course in Differential Equations With Modeling Applications 9th Edition (2008)by willize2k
- Schaum Outline Theory and Problems of Numerical Analysis McGraw-Hill 1968by Kostas Ntougias
- Pinchover-Rubinstein - An Introduction to Partial Differential Equations (Cambridge, 2005)by Devarshi S Khule
- Haberman - Elementary Applied Partial Differential Equations 2eby Thomas Pierce

- Earl Coddington - An introduction to Ordinary Differential Equations.pdfby ntinoss
- Applied Engineering Mathematicsby mmillc56859
- Kreyszig - Advanced Engineering Mathematics 9e BWby Muhammad Noman Khan
- Morris Tenenbaum, Harry Pollard Ordinary Differential Equations Dover Books on Mathematics 1985by fractalmonkey2

- Numerical Analysis
- Numerical Analysis
- Numerical Analysis
- Numerical Analysis
- Nice Hand Book -Numerical Analysis - PDF
- Numerical Analysis
- Numerical Analysis, 2nd Edition, Timothy Sauer
- Rao G.S. Numerical Analysis
- Finite Element Method
- course_16_4956
- Συνηθεις διαφορικες εξισωσεις 1 Πατρα Μαθηματικο Τσουμπελης Κοκολογιαννακη
- 978-960-7182-07-4
- 9789607182074
- 0136006132.pdf
- Sfe Differential Equations in Applications Amelkin
- Coddington, E. Carlson, R. - Linear Ordinary Differential Equations(1)
- A First Course in Differential Equations With Modeling Applications 9th Edition (2008)
- Schaum Outline Theory and Problems of Numerical Analysis McGraw-Hill 1968
- Pinchover-Rubinstein - An Introduction to Partial Differential Equations (Cambridge, 2005)
- Haberman - Elementary Applied Partial Differential Equations 2e
- Differential Equations for Dummies 2008 - (Malestrom)
- solution_manual for numerical analysis
- 0521194245 Differential Equations
- Taylor Calculus With Analytic Geometry
- Earl Coddington - An introduction to Ordinary Differential Equations.pdf
- Applied Engineering Mathematics
- Kreyszig - Advanced Engineering Mathematics 9e BW
- Morris Tenenbaum, Harry Pollard Ordinary Differential Equations Dover Books on Mathematics 1985
- Numerical Methods 6th Chapra
- Deturck and Wilf - Lectures on Numerical Analysis

Are you sure?

This action might not be possible to undo. Are you sure you want to continue?

We've moved you to where you read on your other device.

Get the full title to continue

Get the full title to continue reading from where you left off, or restart the preview.

scribd