Lectures on Numerical Analysis

Dennis Deturck and Herbert S. Wilf
Department of Mathematics
University of Pennsylvania
Philadelphia, PA 19104-6395
Copyright 2002, Dennis Deturck and Herbert Wilf
April 30, 2002
2
Contents
1 Differential and Difference Equations 5
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Linear equations with constant coefficients . . . . . . . . . . . . . . . . . . . 8
1.3 Difference equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 Computing with difference equations . . . . . . . . . . . . . . . . . . . . . . 14
1.5 Stability theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.6 Stability theory of difference equations . . . . . . . . . . . . . . . . . . . . . 19
2 The Numerical Solution of Differential Equations 23
2.1 Euler’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 Software notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Systems and equations of higher order . . . . . . . . . . . . . . . . . . . . . 29
2.4 How to document a program . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.5 The midpoint and trapezoidal rules . . . . . . . . . . . . . . . . . . . . . . . 38
2.6 Comparison of the methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.7 Predictor-corrector methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.8 Truncation error and step size . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.9 Controlling the step size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.10 Case study: Rocket to the moon . . . . . . . . . . . . . . . . . . . . . . . . 60
2.11 Maple programs for the trapezoidal rule . . . . . . . . . . . . . . . . . . . . 65
2.11.1 Example: Computing the cosine function . . . . . . . . . . . . . . . 67
2.11.2 Example: The moon rocket in one dimension . . . . . . . . . . . . . 68
2.12 The big leagues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
2.13 Lagrange and Adams formulas . . . . . . . . . . . . . . . . . . . . . . . . . 74
4 CONTENTS
3 Numerical linear algebra 81
3.1 Vector spaces and linear mappings . . . . . . . . . . . . . . . . . . . . . . . 81
3.2 Linear systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
3.3 Building blocks for the linear equation solver . . . . . . . . . . . . . . . . . 92
3.4 How big is zero? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
3.5 Operation count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
3.6 To unscramble the eggs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
3.7 Eigenvalues and eigenvectors of matrices . . . . . . . . . . . . . . . . . . . . 108
3.8 The orthogonal matrices of Jacobi . . . . . . . . . . . . . . . . . . . . . . . 112
3.9 Convergence of the Jacobi method . . . . . . . . . . . . . . . . . . . . . . . 115
3.10 Corbat´o’s idea and the implementation of the Jacobi algorithm . . . . . . . 118
3.11 Getting it together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
3.12 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
Chapter 1
Differential and Difference
Equations
1.1 Introduction
In this chapter we are going to study differential equations, with particular emphasis on how
to solve them with computers. We assume that the reader has previously met differential
equations, so we’re going to review the most basic facts about them rather quickly.
A differential equation is an equation in an unknown function, say y(x), where the
equation contains various derivatives of y and various known functions of x. The problem
is to “find” the unknown function. The order of a differential equation is the order of the
highest derivative that appears in it.
Here’s an easy equation of first order:
y

(x) = 0. (1.1.1)
The unknown function is y(x) = constant, so we have solved the given equation (1.1.1).
The next one is a little harder:
y

(x) = 2y(x). (1.1.2)
A solution will, now doubt, arrive after a bit of thought, namely y(x) = e
2x
. But, if y(x)
is a solution of (1.1.2), then so is 10y(x), or 49.6y(x), or in fact cy(x) for any constant c.
Hence y = ce
2x
is a solution of (1.1.2). Are there any other solutions? No there aren’t,
because if y is any function that satisfies (1.1.2) then
(ye
−2x
)

= e
−2x
(y

−2y) = 0, (1.1.3)
and so ye
−2x
must be a constant, C.
In general, we can expect that if a differential equation is of the first order, then the
most general solution will involve one arbitrary constant C. This is not always the case,
6 Differential and Difference Equations
since we can write down differential equations that have no solutions at all. We would have,
for instance, a fairly hard time (why?) finding a real function y(x) for which
(y

)
2
= −y
2
−2. (1.1.4)
There are certain special kinds of differential equations that can always be solved, and
it’s often important to be able to recognize them. Among there are the “first-order linear”
equations
y

(x) +a(x)y(x) = 0, (1.1.5)
where a(x) is a given function of x.
Before we describe the solution of these equations, let’s discuss the word linear. To say
that an equation is linear is to say that if we have any two solutions y
1
(x) and y
2
(x) of the
equation, then c
1
y
1
(x) +c
2
y
2
(x) is also a solution of the equation, where c
1
and c
2
are any
two constants (in other words, the set of solutions forms a vector space).
Equation (1.1.1) is linear, in fact, y
1
(x) = 7 and y
2
(x) = 23 are both solutions, and so
is 7c
1
+ 23c
2
. Less trivially, the equation
y

(x) +y(x) = 0 (1.1.6)
is linear. The linearity of (1.1.6) can be checked right from the equation itself, without
knowing what the solutions are (do it!). For an example, though, we might note that
y = sin x is a solution of (1.1.6), that y = cos x is another solution of (1.1.6), and finally,
by linearity, that the function y = c
1
sinx+c
2
cos x is a solution, whatever the constants c
1
and c
2
. Now let’s consider an instance of the first order linear equation (1.1.5):
y

(x) +xy(x) = 0. (1.1.7)
So we’re looking for a function whose derivative is −x times the function. Evidently y =
e
−x
2
/2
will do, and the general solution is y(x) = ce
−x
2
/2
.
If instead of (1.1.7) we had
y

(x) +x
2
y(x) = 0,
then we would have found the general solution ce
−x
3
/3
.
As a last example, take
y

(x) −(cos x) y(x) = 0. (1.1.8)
The right medicine is now y(x) = e
sinx
. In the next paragraph we’ll give the general rule
of which the above are three examples. The reader might like to put down the book at this
point and try to formulate the rule for solving (1.1.5) before going on to read about it.
Ready? What we need is to choose some antiderivative A(x) of a(x), and then the
solution is y(x) = ce
−A(x)
.
Since that was so easy, next let’s put a more interesting right hand side into (1.1.5), by
considering the equation
y

(x) +a(x)y(x) = b(x) (1.1.9)
1.1 Introduction 7
where now b(x) is also a given function of x (Is (1.1.9) a linear equation? Are you sure?).
To solve (1.1.9), once again choose some antiderivative A(x) of a(x), and then note that
we can rewrite (1.1.9) in the equivalent form
e
−A(x)
d
dx
_
e
A(x)
y(x)
_
= b(x).
Now if we multiply through by e
A(x)
we see that
d
dx
_
e
A(x)
y(x)
_
= b(x)e
A(x)
(1.1.10)
so , if we integrate both sides,
e
A(x)
y(x) =
_
x
b(t)e
A(t)
dt + const. , (1.1.11)
where on the right side, we mean any antiderivative of the function under the integral sign.
Consequently
y(x) = e
−A(x)
__
x
b(t)e
A(t)
dt + const.
_
. (1.1.12)
As an example, consider the equation
y

+
y
x
= x + 1 . (1.1.13)
We find that A(x) = log x, then from (1.1.12) we get
y(x) =
1
x
__
x
(t + 1)t dt +C
_
=
x
2
3
+
x
2
+
C
x
.
(1.1.14)
We may be doing a disservice to the reader by beginning with this discussion of certain
types of differential equations that can be solved analytically, because it would be erroneous
to assume that most, or even many, such equations can be dealt with by these techniques.
Indeed, the reason for the importance of the numerical methods that are the main subject
of this chapter is precisely that most equations that arise in “real” problems are quite
intractable by analytical means, so the computer is the only hope.
Despite the above disclaimer, in the next section we will study yet another important
family of differential equations that can be handled analytically, namely linear equations
with constant coefficients.
Exercises 1.1
1. Find the general solution of each of the following equations:
(a) y

= 2 cos x
8 Differential and Difference Equations
(b) y

+
2
x
y = 0
(c) y

+xy = 3
(d) y

+
1
x
y = x + 5
(e) 2yy

= x + 1
2. Show that the equation (1.1.4) has no real solutions.
3. Go to your computer or terminal and familiarize yourself with the equipment, the
operating system, and the specific software you will be using. Then write a program
that will calculate and print the sum of the squares of the integers 1, 2, . . . , 100. Run
this program.
4. For each part of problem 1, find the solution for which y(1) = 1.
1.2 Linear equations with constant coefficients
One particularly pleasant, and important, type of linear differential equation is the variety
with constant coefficients, such as
y

+ 3y

+ 2y = 0 . (1.2.1)
It turns out that what we have to do to solve these equations is to try a solution of a certain
form, and we will then find that all of the solutions indeed are of that form.
Let’s see if the function y(x) = e
αx
is a solution of (1.2.1). If we substitute in (1.2.1),
and then cancel the common factor e
αx
, we are left with the quadratic equation
α
2
+ 3α + 2 = 0
whose solutions are α = −2 and α = −1. Hence for those two values of α our trial function
y(x) = e
αx
is indeed a solution of (1.2.1). In other words, e
−2x
is a solution, e
−x
is a
solution, and since the equation is linear,
y(x) = c
1
e
−2x
+c
2
e
−x
(1.2.2)
is also a solution, where c
1
and c
2
are arbitrary constants. Finally, (1.2.2) must be the most
general solution since it has the “right” number of arbitrary constants, namely two.
Trying a solution in the form of an exponential is always the correct first step in solving
linear equations with constant coefficients. Various complications can develop, however, as
illustrated by the equation
y

+ 4y

+ 4y = 0 . (1.2.3)
Again, let’s see if there is a solution of the form y = e
αx
. This time, substitution into
(1.2.3) and cancellation of the factor e
αx
leads to the quadratic equation
α
2
+ 4α + 4 = 0, (1.2.4)
1.2 Linear equations with constant coefficients 9
whose two roots are identical, both being −2. Hence e
−2x
is a solution, and of course so is
c
1
e
−2x
, but we don’t yet have the general solution because there is, so far, only one arbitrary
constant. The difficulty, of course, is caused by the fact that the roots of (1.2.4) are not
distinct.
In this case, it turns out that xe
−2x
is another solution of the differential equation (1.2.3)
(verify this), so the general solution is (c
1
+c
2
x)e
−2x
.
Suppose that we begin with an equation of third order, and that all three roots turn
out to be the same. For instance, to solve the equation
y

+ 3y

+ 3y

+y = 0 (1.2.5)
we would try y = e
αx
, and we would then be facing the cubic equation
α
3
+ 3α
2
+ 3α + 1 = 0 , (1.2.6)
whose “three” roots are all equal to −1. Now, not only is e
−x
a solution, but so are xe
−x
and x
2
e
−x
.
To see why this procedure works in general, suppose we have a linear differential equation
with constant coeficcients, say
y
(n)
+a
1
y
(n−1)
+a
2
y
(n−2)
+ +a
n
y = 0 (1.2.7)
If we try to find a solution of the usual exponential form y = e
αx
, then after substitution into
(1.2.7) and cancellation of the common factor e
αx
, we would find the polynomial equation
α
n
+a
1
α
n−1
+a
2
α
n−2
+ +a
n
= 0 . (1.2.8)
The polynomial on the left side is called the characteristic polynomial of the given
differential equation. Suppose now that a certain number α = α

is a root of (1.2.8) of
multiplicity p. To say that α

is a root of multiplicity p of the equation is to say that
(α−α

)
p
is a factor of the characteristic polynomial. Now look at the left side of the given
differential equation (1.2.7). We can write it in the form
(D
n
+a
1
D
n−1
+a
2
D
n−2
+ +a
n
)y = 0 , (1.2.9)
in which D is the differential operator d/dx. In the parentheses in (1.2.9) we see the
polynomial ϕ(D), where ϕ is exactly the characteristic polynomial in (1.2.8).
Since ϕ(α) has the factor (α − α

)
p
, it follows that ϕ(D) has the factor (D − α

)
p
, so
the left side of (1.2.9) can be written in the form
g(D)(D −α

)
p
y = 0 , (1.2.10)
where g is a polynomial of degree n−p. Now it’s quite easy to see that y = x
k
e
α

x
satisfies
(1.2.10) (and therefore (1.2.7) also) for each k = 0, 1, . . . , p−1. Indeed, if we substitute this
function y into (1.2.10), we see that it is enough to show that
(D −α

)
p
(x
k
e
α

x
) = 0 k = 0, 1, . . . , p −1 . (1.2.11)
10 Differential and Difference Equations
However, (D −α

)(x
k
e
−α

x
) = kx
k−1
e
α

x
, and if we apply (D −α

) again,
(D −α

)
2
(x
k
e
−α

x
) = k(k −1)x
k−2
e
α

x
,
etc. Now since k < p it is clear that (D −α

)
p
(x
k
e
−α

x
) = 0, as claimed.
To summarize, then, if we encounter a root α

of the characteristic equation, of multi-
plicity p, then corresponding to α

we can find exactly p linearly independent solutions of
the differential equation, namely
e
α

x
, xe
α

x
, x
2
e
α

x
, . . . , x
p−1
e
α

x
. (1.2.12)
Another way to state it is to say that the portion of the general solution of the given
differential equation that corresponds to a root α

of the characteristic polynomial equation
is Q(x)e
α

x
, where Q(x) is an arbitrary polynomial whose degree is one less than the
multiplicity of the root α

.
One last mild complication may arise from roots of the characteristic equation that are
not real numbers. These don’t really require any special attention, but they do present a few
options. For instance, to solve y

+ 4y = 0, we find the characteristic equation α
2
+ 4 = 0,
and the complex roots ±2i. Hence the general solution is obtained by the usual rule as
y(x) = c
1
e
2ix
+c
2
e
−2ix
. (1.2.13)
This is a perfectly acceptable form of the solution, but we could make it look a bit prettier
by using deMoivre’s theorem, which says that
e
2ix
= cos 2x +i sin 2x
e
−2ix
= cos 2x −i sin 2x.
(1.2.14)
Then our general solution would look like
y(x) = (c
1
+c
2
) cos 2x + (ic
1
−ic
2
) sin 2x. (1.2.15)
But c
1
and c
2
are just arbitrary constants, hence so are c
1
+ c
2
and ic
1
− ic
2
, so we might
as well rename them c
1
and c
2
, in which case the solution would take the form
y(x) = c
1
cos 2x +c
2
sin 2x. (1.2.16)
Here’s an example that shows the various possibilities:
y
(8)
−5y
(7)
+ 17y
(6)
−997y
(5)
+ 110y
(4)
−531y
(3)
+ 765y
(2)
−567y

+ 162y = 0. (1.2.17)
The equation was cooked up to have a characteristic polynomial that can be factored as
(α −2)(α
2
+ 9)
2
(α −1)
3
. (1.2.18)
Hence the roots of the characteristic equation are 2 (simple), 3i (multiplicity 2), −3i (mul-
tiplicity 2), and 1 (multiplicity 3).
1.3 Difference equations 11
Corresponding to the root 2, the general solution will contain the term c
1
e
2x
. Corre-
sponding to the double root at 3i we have terms (c
2
+ c
3
x)e
3ix
in the solution. From the
double root at −3i we get a contribution (c
4
+ c
5
x)e
−3ix
, and finally from the triple root
at 1 we get (c
6
+ c
7
x + c
8
x
2
)e
x
. The general solution is the sum of these eight terms.
Alternatively, we might have taken the four terms that come from 3i in the form
(c
2
+c
3
x) cos 3x + (c
4
+c
5
x) sin 3x. (1.2.19)
Exercises 1.2
1. Obtain the general solutions of each of the following differential equations:
(a) y

+ 5y

+ 6y = 0
(b) y

−8y

+ 7y = 0
(c) (D + 3)
2
y = 0
(d) (D
2
+ 16)
2
y = 0
(e) (D + 3)
3
(D
2
−25)
2
(D + 2)
3
y = 0
2. Find a curve y = f(x) that passes through the origin with unit slope, and which
satisfies (D + 4)(D −1)y = 0.
1.3 Difference equations
Whereas a differential equation is an equation in an unknown function, a difference equation
is an equation in an unknown sequence. For example, suppose we know that a certain
sequence of numbers y
0
, y
1
, y
2
, . . . satisfies the following conditions:
y
n+2
+ 5y
n+1
+ 6y
n
= 0 n = 0, 1, 2, . . . (1.3.1)
and furthermore that y
0
= 1 and y
1
= 3.
Evidently, we can compute as many of the y
n
’s as we need from (1.3.1), thus we would
get y
2
= −21, y
3
= 87, y
4
= −309 so forth. The entire sequence of y
n
’s is determined by
the difference equation (1.3.1) together with the two starting values.
Such equations are encountered when differential equations are solved on computers.
Naturally, the computer can provide the values of the unknown function only at a discrete
set of points. These values are computed by replacing the given differential equations by
a difference equation that approximates it, and then calculating successive approximate
values of the desired function from the difference equation.
Can we somehow “solve” a difference equation by obtaining a formula for the values
of the solution sequence? The answer is that we can, as long as the difference equation is
linear and has constant coefficients, as in (1.3.1). Just as in the case of differential equations
with constant coefficients, the correct strategy for solving them is to try a solution of the
12 Differential and Difference Equations
right form. In the previous section, the right form to try was y(x) = e
αx
. Now the winning
combination is y = α
n
, where α is a constant.
In fact, let’s substitute α
n
for y
n
in (1.3.1) to see what happens. The left side becomes
α
n+2
+ 5α
n+1
+ 6α
n
= α
n

2
+ 5α + 6) = 0. (1.3.2)
Just as we were able to cancel the common factor e
αx
in the differential equation case, so
here we can cancel the α
n
, and we’re left with the quadratic equation
α
2
+ 5α + 6 = 0. (1.3.3)
The two roots of this characteristic equation are α = −2 and α = −3. Therefore the
sequence (−2)
n
satisfies (1.3.1) and so does (−3)
n
. Since the difference equation is linear,
it follows that
y
n
= c
1
(−2)
n
+c
2
(−3)
n
(1.3.4)
is also a solution, whatever the values of the constants c
1
and c
2
.
Now it is evident from (1.3.1) itself that the numbers y
n
are uniquely determined if we
prescribe the values of just two of them. Hence, it is very clear that when we have a solution
that contains two arbitrary constants we have the most general solution.
When we take account of the given data y
0
= 1 and y
1
= 3, we get the two equations
_
1 = c
1
+c
2
3 = (−2)c
1
+ (−3)c
2
(1.3.5)
from which c
1
= 6 and c
2
= −5. Finally, we use these values of c
1
and c
2
in (1.3.4) to get
y
n
= 6(−2)
n
−5(−3)
n
n = 0, 1, 2, . . . . (1.3.6)
Equation (1.3.6) is the desired formula that represents the unique solution of the given
difference equation together with the prescribed starting values.
Let’s step back a few paces to get a better view of the solution. Notice that the formula
(1.3.6) expresses the solution as a linear combination of nth powers of the roots of the
associated characteristic equation (1.3.3). When n is very large, is the number y
n
a large
number or a small one? Evidently the powers of −3 overwhelm those of −2, so the sequence
will behave roughly like a constant times powers of −3. This means that we should expect
the members of the sequence to alternate in sign and to grow rapidly in magnitude.
So much for the equation (1.3.1). Now let’s look at the general case, in the form of a
linear difference equation of order p:
y
n+p
+a
1
y
n+p−1
+a
2
y
n+p−2
+ +a
p
y
n
= 0. (1.3.7)
We try a solution of the form y
n
= α
n
, and after substituting and canceling, we get the
characteristic equation
α
p
+a
1
α
p−1
+a
2
α
p−2
+ +a
p
= 0. (1.3.8)
1.3 Difference equations 13
This is a polynomial equation of degree p, so it has p roots, counting multiplicities, some-
where in the complex plane.
Let α

be one of these p roots. If α

is simple (i.e., has multiplicity 1) then the part
of the general solution that corresponds to α

is c(α

)
n
. If, however, α

is a root of
multiplicity k > 1 then we must multiply the solution c(α

)
n
by an arbitrary polynomial
in n, of degree k −1, just as in the corresponding case for differential equations we used an
arbitrary polynomial in x of degree k −1.
We illustrate this, as well as the case of complex roots, by considering the following
difference equation of order five:
y
n+5
−5y
n+4
+ 9y
n+3
−9y
n+2
+ 8y
n+1
−4y
n
= 0. (1.3.9)
This example is rigged so that the characteristic equation can be factored as

2
+ 1)(α −2)
2
(α −1) = 0 (1.3.10)
from which the roots are obviously i, −i, 2 (multiplicity 2), 1.
Corresponding to the roots i, −i, the portion of the general solution is c
1
i
n
+ c
2
(−i)
n
.
Since
i
n
= e
inπ/2
= cos
_

2
_
+i sin
_

2
_
(1.3.11)
and similarly for (−i)
n
, we can also take this part of the general solution in the form
c
1
cos
_

2
_
+c
2
sin
_

2
_
. (1.3.12)
The double root α = 2 contributes (c
3
+ c
4
n)2
n
, and the simple root α = 1 adds c
5
to
the general solution, which in its full glory is
y
n
= c
1
cos
_

2
_
+c
2
sin
_

2
_
+ (c
3
+c
4
n)2
n
+c
5
. (1.3.13)
The five constants would be determined by prescribing five initial values, say y
0
, y
1
, y
2
, y
3
and y
4
, as we would expect for the equation (1.3.9).
Exercises 1.3
1. Obtain the general solution of each of the following difference equations:
(a) y
n+1
= 3y
n
(b) y
n+1
= 3y
n
+ 2
(c) y
n+2
−2y
n+1
+y
n
= 0
(d) y
n+2
−8y
n+1
+ 12y
n
= 0
(e) y
n+2
−6y
n+1
+ 9y
n
= 1
(f) y
n+2
+y
n
= 0
14 Differential and Difference Equations
2. Find the solution of the given difference equation that takes the prescribed initial
values:
(a) y
n+2
= 2y
n+1
+y
n
; y
0
= 0; y
1
= 1
(b) y
n+1
= αy
n
+β; y
0
= 1
(c) y
n+4
+y
n
= 0; y
0
= 1; y
1
= −1; y
2
= 1; y
3
= −1
(d) y
n+2
−5y
n+1
+ 6y
n
= 0; y
0
= 1; y
1
= 2
3. (a) For each of the difference equations in problems 1 and 2, evaluate
lim
n→∞
y
n+1
y
n
(1.3.14)
if it exists.
(b) Formulate and prove a general theorem about the existence of, and value of the
limit in part (a) for a linear difference equation with constant coefficients.
(c) Reverse the process: given a polynomial equation, find its root of largest absolute
value by computing from a certain difference equation and evaluating the ratios
of consecutive terms.
(d) Write a computer program to implement the method in part (c). Use it to
calculate the largest root of the equation
x
8
= x
7
+x
6
+x
5
+ + 1. (1.3.15)
1.4 Computing with difference equations
This is, after all, a book about computing, so let’s begin with computing from difference
equations since they will give us a chance to discuss some important questions that concern
the design of computer programs. For a sample difference equation we’ll use
y
n+3
= y
n+2
+ 5y
n+1
+ 3y
n
(1.4.1)
together with the starting values y
0
= y
1
= y
2
= 1. The reader might want, just for practice,
to find an explicit formula for this sequence by the methods of the previous section.
Suppose we want to compute a large number of these y’s in order to verify some property
that they have, for instance to check that
lim
n→∞
y
n+1
y
n
= 3 (1.4.2)
which must be true since 3 is the root of largest absolute value of the characteristic equation.
As a first approach, we might declare y to be a linear array of some size large enough to
accommodate the expected length of the calculation. Then the rest is easy. For each n, we
would calculate the next y
n+1
from (1.4.1), we would divide it by its predecessor y
n
to get
a new ratio. If the new ratio agrees sufficiently well with the previous ratio we announce
that the computation has terminated and print the new ratio as our answer. Otherwise, we
move the new ratio to the location of the old ratio, increase n and try again.
If we were to write this out as formal procedure (algorithm) it might look like:
1.4 Computing with difference equations 15
y
0
:= 1; y
1
:= 1; y
2
:= 1; n := 2;
newrat := −10; oldrat := 1;
while [newrat −oldrat[ ≥ 0.000001 do
oldrat := newrat; n := n + 1;
y
n
:= y
n−1
+ 5 ∗ y
n−2
+ 3 ∗ y
n−3
;
newrat := y
n
/y
n−1
endwhile
print newrat; Halt.
We’ll use the symbol ‘:=’ to mean that we are to compute the quantity on the right, if
necessary, and then store it in the place named on the left. It can be read as ‘is replaced
by’ or ‘is assigned.’ Also, the block that begins with ‘while’ and ends with ‘endwhile’
represents a group of instructions that are to be executed repeatedly until the condition
that follows ‘while’ becomes false, at which point the line following ‘endwhile’ is executed.
The procedure just described is fast, but it uses lots of storage. If, for instance, such a
program needed to calculate 79 y’s before convergence occurred, then it would have used
79 locations of array storage. In fact, the problem above doesn’t need that many locations
because convergence happens a lot sooner. Suppose you wanted to find out how much
sooner, given only a programmable hand calculator with ten or twenty memory locations.
Then you might appreciate a calculation procedure that needs just four locations to hold
all necessary y’s.
That’s fairly easy to accomplish, though. At any given moment in the program, what
we need to find the next y are just the previous three y’s. So why not save only those three?
We’ll use the previous three to calculate the next one, and stow it for a moment in a fourth
location. Then we’ll compute the new ratio and compare it with the old. If they’re not
close enough, we move each one of the three newest y’s back one step into the places where
we store the latest three y’s and repeat the process. Formally, it might be:
y := 1; ym1 := 1; ym2 := 1;
newrat := −10; oldrat := 1;
while [newrat −oldrat[ ≥ 0.000001 do
ym3 := ym2; ym2 := ym1; ym1 := y;
oldrat := newrat;
y := ym1 + 5 ∗ ym2 + 3 ∗ ym3;
newrat := y/ym1 endwhile;
print newrat; Halt.
The calculation can now be done in exactly six memory locations (y, ym1, ym2, ym3,
oldrat, newrat) no matter how many y’s have to be calculated, so you can undertake it
on your hand calculator with complete confidence. The price that we pay for the memory
saving is that we must move the data around a bit more.
One should not think that such programming methods are only for hand calculators. As
we progress through the numerical solution of differential equations we will see situations
in which each of the quantities that appears in the difference equation will itself be an
16 Differential and Difference Equations
array (!), and that very large numbers, perhaps thousands, of these arrays will need to be
computed. Even large computers might quake at the thought of using the first method
above, rather than the second, for doing the calculation. Fortunately, it will almost never
be necessary to save in memory all of the computed values simultaneously. Normally, they
will be computed, and then printed or plotted, and never needed except in the calculation
of their immediate successors.
Exercises 1.4
1. The Fibonacci numbers F
0
, F
1
, F
2
, . . . are defined by the recurrence formula F
n+2
=
F
n+1
+F
n
for n = 0, 1, 2, . . . together with the starting values F
0
= 0, F
1
= 1.
(a) Write out the first ten Fibonacci numbers.
(b) Derive an explicit formula for the nth Fibonacci number F
n
.
(c) Evaluate your formula for n = 0, 1, 2, 3, 4.
(d) Prove directly from your formula that the Fibonacci numbers are integers (This is
perfectly obvious from their definition, but is not so obvious from the formula!).
(e) Evaluate
lim
n→∞
F
n+1
F
n
(1.4.3)
(f) Write a computer program that will compute Fibonacci numbers and print out
the limit in part (e) above, correct to six decimal places.
(g) Write a computer program that will compute the first 40 members of the modified
Fibonacci sequence in which F
0
= 1 and F
1
= (1 −

5)/2. Do these computed
numbers seem to be approaching zero? Explain carefully what you see and why
it happens.
(h) Modify the program of part (h) to run in higher (or double) precision arithmetic.
Does it change any of your answers?
2. Find the most general solution of each of the following difference equations:
(a) y
n+1
−2y
n
+y
n−1
= 0
(b) y
n+1
= 2y
n
(c) y
n+2
+y
n
= 0
(d) y
n+2
+ 3y
n+1
+ 3y
n
+y
n−1
= 0
1.5 Stability theory
In the study of natural phenomena it is most often true that a small change in conditions
will produce just a small change in the state of the system being studied. If, for example,
a very slight increase in atmospheric pollution could produce dramatically large changes
in populations of flora and fauna, or if tiny variations in the period of the earth’s rotation
1.5 Stability theory 17
produced huge changes in climatic conditions, the world would be a very different place to
live in, or to try to live in. In brief, we may say that most aspects of nature are stable.
When physical scientists attempt to understand some facet of nature, they often will
make a mathematical model. This model will usually not faithfully reproduce all of the
structure of the original phenomenon, but one hopes that the important features of the
system will be preserved in the model, so that predictions will be possible. One of the most
important features to preserve is that of stability.
For instance, the example of atmospheric pollution and its effect on living things referred
to above is important and very complex. Therefore considerable effort has gone into the
construction of mathematical models that will allow computer studies of the effects of
atmospheric changes. One of the first tests to which such a model should be subjected is
that of stability: does it faithfully reproduce the observed fact that small changes produce
small changes? What is true in nature need not be true in a man-made model that is a
simplification or idealization of the real world.
Now suppose that we have gotten ourselves over this hurdle, and we have constructed
a model that is indeed stable. The next step might be to go to the computer and do
calculations from the model, and to use these calculations for predicting the effects of
various proposed actions. Unfortunately, yet another layer of approximation is usually
introduced at this stage, because the model, even though it is a simplification of the real
world, might still be too complicated to solve exactly on a computer.
For instance, may models use differential equations. Models of the weather, of the mo-
tion of fluids, of the movement of astronomical objects, of spacecraft, of population growth,
of predator-prey relationships, of electric circuit transients, and so forth, all involve differ-
ential equations. Digital computers solve differential equations by approximating them by
difference equations, and then solving the difference equations. Even though the differential
equation that represents our model is indeed stable, it may be that the difference equation
that we use on the computer is no longer stable, and that small changes in initial data on
the computer, or small roundoff errors, will produce not small but very large changes in the
computed solution.
An important job of the numerical analyst is to make sure that this does not happen,
and we will find that this theme of stability recurs throughout our study of computer
approximations.
As an example of instability in differential equations, suppose that some model of a
system led us to the equation
y

−y

−2y = 0 (1.5.1)
together with the initial data
y(0) = 1; y

(0) = −1. (1.5.2)
We are thinking of the independent variable t as the time, so we will be interested in
the solution as t becomes large and positive.
The general solution of (1.5.1) is y(t) = c
1
e
−t
+c
2
e
2t
. The initial conditions tell us that
c
1
= 1 and c
2
= 0, hence the solution of our problem is y(t) = e
−t
, and it represents a
18 Differential and Difference Equations
function that decays rapidly to zero with increasing t. In fact, when t = 10, the solution
has the value 0.000045.
Now let’s change the initial data (1.5.2) just a bit, by asking for a solution with y

(0) =
−0.999. It’s easy to check that the solution is now
y(t) = (0.999666 . . .)e
−t
+ (0.000333 . . .)e
2t
(1.5.3)
instead of just y(t) = e
−t
. If we want the value of the solution at t = 10, we would find
that it has changed from 0.000045 to about 7.34.
At t = 20 the change is even more impressive, from 0.00000002 to 161, 720+, just from
changing the initial value of y

from −1 to −0.999. Let’s hope that there are no phenomena
in nature that behave in this way, or our lives hang by a slender thread indeed!
Now exactly what is the reason for the observed instability of the equation (1.5.1)?
The general solution of the equation contains a falling exponential term c
1
e
−t
, and a rising
exponential term c
2
e
2t
. By prescribing the initial data (1.5.2) we suppressed the growing
term, and picked out only the decreasing one. A small change in the initial data, however,
results in the presence of both terms in the solution.
Now it’s time for a formal
Definition: A differential equation is said to be stable if for every set of initial data (at
t = 0) the solution of the differential equation remains bounded as t approaches infinity.
A differential equation is called strongly stable if, for every set of initial data (at t = 0)
the solution not only remains bounded, but approaches zero as t approaches infinity.
What makes the equation (1.5.1) unstable, then, is the presence of a rising exponential
in its general solution. In other words, if we have a differential equation whose general
solution contains a term e
αt
in which α is positive, that equation is unstable.
Let’s restrict attention now to linear differential equations with constant coefficients.
We know from section 1.2 that the general solution of such an equation is a sum of terms
of the form
(polynomial in t)e
αt
. (1.5.4)
Under what circumstances does such a term remain bounded as t becomes large and posi-
tive?
Certainly if α is negative then the term stays bounded. Likewise, if α is a complex
number and its real part is negative, then the term remains bounded. If α has positive real
part the term is unbounded.
This takes care of all of the possibilities except the case where α is zero, or more generally,
the complex number α has zero real part (is purely imaginary). In that case the question
of whether (polynomial in t)e
αt
remains bounded depend on whether the “polynomial in t”
is of degree zero (a constant polynomial) or of higher degree. If the polynomial is constant
then the term does indeed remain bounded for large positive t, whereas otherwise the term
will grow as t gets large, for some values of the initial conditions, thereby violating the
definition of stability.
1.6 Stability theory of difference equations 19
Now recall that the “polynomial in t” is in fact a constant if the root α is a simple
root of the characteristic equation of the differential equation, and otherwise it is of higher
degree. This observation completes the proof of the following:
Theorem 1.5.1 A linear differential equation with constant coefficients is stable if and
only if all of the roots of its characteristic equation lie in the left half plane, and those that
lie on the imaginary axis, if any, are simple. Such an equation is strongly stable if and only
if all of the roots of its characteristic equation lie in the left half plane, and none lie on the
imaginary axis.
Exercises 1.5
1. Determine for each of the following differential equations whether it is strongly stable,
stable, or unstable.
(a) y

−5y

+ 6y = 0
(b) y

+ 5y

+ 6y = 0
(c) y

+ 3y = 0
(d) (D + 3)
3
(D + 1)y = 0
(e) (D + 1)
2
(D
2
+ 1)
2
y = 0
(f) (D
4
+ 1)y = 0
2. Make a list of some natural phenomena that you think are unstable. Discuss.
3. The differential equation y

−y = 0 is to be solved with the initial conditions y(0) = 1,
y

(0) = −1, and then solved again with y(0) = 1, y

(0) = −0.99. Compare the two
solutions when x = 20.
4. For exactly which real values of the parameter λ is each of the following differential
equations stable? . . . strongly stable?
(a) y

+ (2 +λ)y

+y = 0
(b) y

+λy

+y = 0
(c) y

+λy = 1
1.6 Stability theory of difference equations
In the previous section we discussed the stability of differential equations. The key ideas
were that such an equation is stable if every one of its solutions remains bounded as t
approaches infinity, and strongly stable if the solutions actually approach zero.
Similar considerations apply to difference equations, and for similar reasons. As an
example, take the equation
y
n+1
=
5
2
y
n
−y
n−1
(n ≥ 1) (1.6.1)
20 Differential and Difference Equations
along with the initial equations
y
0
= 1; y
1
= 0.5 . (1.6.2)
It’s easy to see that the solution is y
n
= 2
−n
, and of course, this is a function that
rapidly approaches zero with increasing n.
Now let’s change the initial data (1.6.2), say to
y
0
= 1; y
1
= 0.50000001 (1.6.3)
instead of (1.6.2).
The solution of the difference equation with these new data is
y = (0.0000000066 . . .)2
n
+ (0.9999999933 . . .)2
−n
. (1.6.4)
The point is that the coefficient of the growing term 2
n
is small, but 2
n
grows so fast that
after a while the first term in (1.6.4) will be dominant. For example, when n = 30, the
solution is y
30
= 7.16, compared to the value y
30
= 0.0000000009 of the solution with the
original initial data (1.6.2). A change of one part in fifty million in the initial condition
produced, thirty steps later, an answer one billion times as large.
The fault lies with the difference equation, because it has both rising and falling com-
ponents to its general solution. It should be clear that it is hopeless to do extended com-
putation with an unstable difference equation, since a small roundoff error may alter the
solution beyond recognition several steps later.
As in the case of differential equations, we’ll say that a difference equation is stable
if every solution remains bounded as n grows large, and that it is strongly stable if every
solution approaches zero as n grows large. Again, we emphasize that every solution must
be well behaved, not just the solution that is picked out by a certain set of initial data. In
other words, the stability, or lack of it, is a property of the equation and not of the starting
values.
Now consider the case where the difference equation is linear with constant coefficients.
The we know that the general solution is a sum of terms of the form
(polynomial in n)α
n
. (1.6.5)
Under what circumstances will such a term remain bounded or approach zero?
Suppose [α[ < 1. Then the powers of α approach zero, and multiplication by a polyno-
mial in n does not alter that conclusion. Suppose [α[ > 1. Then the sequence of powers
grows unboundedly, and multiplication by a nonzero polynomial only speeds the parting
guest.
Finally suppose the complex number α has absolute value 1. Then the sequence of its
powers remains bounded (in fact they all have absolute value 1), but if we multiply by a
nonconstant polynomial, the resulting expression would grow without bound.
To summarize then, the term (1.6.5), if the polynomial is not identically zero, approaches
zero with increasing n if and only if [α[ < 1. It remains bounded as n increases if and only
1.6 Stability theory of difference equations 21
if either (a) [α[ < 1 or (b) [α[ = 1 and the polynomial is of degree zero (a constant). Now
we have proved:
Theorem 1.6.1 A linear difference equation with constant coefficients is stable if and only
if all of the roots of its characteristic equation have absolute value at most 1, and those of
absolute value 1 are simple. The equation is strongly stable if and only if all of the roots
have absolute value strictly less than 1.
Exercises 1.6
1. Determine, for each of the following difference equations whether it is strongly stable,
stable, or unstable.
(a) y
n+2
−5y
n+1
+ 6y
n
= 0
(b) 8y
n+2
+ 2y
n+1
−3y
n
= 0
(c) 3y
n+2
+y
n
= 0
(d) 3y
n+3
+ 9y
n+2
−y
n+1
−3y
n
= 0
(e) 4y
n+4
+ 5y
n+2
+y
n
= 0
2. The difference equation 2y
n+2
+ 3y
n+1
− 2y
n
= 0 is to be solved with the initial
conditions y
0
= 2, y
1
= 1, and then solved again with y
0
= 2, y
1
= 0.99. Compare y
20
for the two solutions.
3. For exactly which real values of the parameter λ is each of the following difference
equations stable? . . . strongly stable?
(a) y
n+2
+λy
n+1
+y
n
= 0
(b) y
n+1
+λy
n
= 1
(c) y
n+2
+y
n+1
+λy
n
= 0
4. (a) Consider the (constant-coefficient) difference equation
a
0
y
n+p
+a
1
y
n+p−1
+a
2
y
n+p−2
+ +a
p
y
n
= 0. (1.6.6)
Show that this difference equation cannot be stable if [a
p
/a
0
[ > 1.
(b) Give an example to show that the converse of the statement in part (a) is false.
Namely, exhibit a difference equation for which [a
p
/a
0
[ < 1 but the equation is unsta-
ble anyway.
22 Differential and Difference Equations
Chapter 2
The Numerical Solution of
Differential Equations
2.1 Euler’s method
Our study of numerical methods will begin with a very simple procedure, due to Euler.
We will state it as a method for solving a single differential equation of first order. One of
the nice features of the subject of numerical integration of differential equations is that the
techniques that are developed for just one first order differential equation will apply, with
very little change, both to systems of simultaneous first order equations and to equations of
higher order. Hence the consideration of a single equation of first order, seemingly a very
special case, turns out to be quite general.
By an initial-value problem we mean a differential equation together with enough given
values of the unknown function and its derivatives at an initial point x
0
to determine the
solution uniquely.
Let’s suppose that we are given an initial-value problem of the form
y

= f(x, y); y(x
0
) = y
0
. (2.1.1)
Our job is to find numerical approximate values of the unknown function y at points x
to the right of (larger than) x
0
.
What we actually will find will be approximate values of the unknown function at a
discrete set of points x
0
, x
1
= x
0
+ h, x
2
= x
0
+ 2h, x
3
= x
0
+ 3h, etc. At each of these
points x
n
we will compute y
n
, our approximation to y(x
n
).
Hence, suppose that the spacing h between consecutive points has been chosen. We
propose to start at the point x
0
where the initial data are given, and move to the right,
obtaining y
1
from y
0
, then y
2
from y
1
and so forth until sufficiently many values have been
found.
Next we need to derive a method by which each value of y can be obtained from its
immediate predecessor. Consider the Taylor series expansion of the unknown function y(x)
24 The Numerical Solution of Differential Equations
about the point x
n
y(x
n
+h) = y(x
n
) +hy

(x
n
) +h
2
y

(X)
2
, (2.1.2)
where we have halted the expansion after the first power of h and in the remainder term,
the point X lies between x
n
and x
n
+h.
Now equation (2.1.2) is exact, but of course it cannot be used for computation because
the point X is unknown. On the other hand, if we simply “forget” the error term, we’ll
have only an approximate relation instead of an exact one, with the consolation that we
will be able to compute from it. The approximate relation is
y(x
n
+h) ≈ y(x
n
) +hy

(x
n
). (2.1.3)
Next define y
n+1
to be the approximate value of y(x
n+1
) that we obtain by using the
right side of (2.1.3) instead of (2.1.2). Then we get
y
n+1
= y
n
+hy

n
. (2.1.4)
Now we have a computable formula for the approximate values of the unknown function,
because the quantity y

n
can be found from the differential equation (2.1.1) by writing
y

n
= f(x
x
, y
n
), (2.1.5)
and if we do so then (2.1.4) takes the form
y
n+1
= y
n
+hf(x
n
, y
n
). (2.1.6)
This is Euler’s method, in a very explicit form, so that the computational procedure is
clear. Equation (2.1.6) is in fact a recurrence relation, or difference equation, whereby each
value of y
n
is computed from its immediate predecessor.
Let’s use Euler’s method to obtain a numerical solution of the differential equation
y

= 0.5y (2.1.7)
together with the starting value y(0) = 1. The exact solution of this initial-value problem
is obviously y(x) = e
x/2
.
Concerning the approximate solution by Euler’s method, we have, by comparing (2.1.7)
with (2.1.1), f(x, y) = 0.5y, so
y
n+1
= y
n
+h
y
n
2
=
_
1 +
h
2
_
y
n
.
(2.1.8)
Therefore, in this example, each y
n
will be obtained from its predecessor by multiplication
by 1 +
h
2
. To be quite specific, let’s take h to be 0.05. Then we show below, for each value
of x = 0, 0.05, 0.10, 0.15, 0.20, . . . the approximate value of y computed from (2.1.8) and
the exact value y(x
n
) = e
xn/2
:
2.1 Euler’s method 25
x Euler(x) Exact(x)
0.00 1.00000 1.00000
0.05 1.02500 1.02532
0.10 1.05063 1.05127
0.15 1.07689 1.07788
0.20 1.10381 1.10517
0.25 1.13141 1.13315
.
.
.
.
.
.
.
.
.
1.00 1.63862 1.64872
2.00 2.68506 2.71828
3.00 4.39979 4.48169
.
.
.
.
.
.
.
.
.
5.00 11.81372 12.18249
10.00 139.56389 148.41316
table 1
Considering the extreme simplicity of the approximation, it seems that we have done
pretty well by this equation. Let’s continue with this simple example by asking for a formula
for the numbers that are called Euler(x) in the above table. In other words, exactly what
function of x is Euler(x)?
To answer this, we note first that each computed value y
n+1
is obtained according to
(2.1.8) by multiplying its predecessor y
n
by 1 +
h
2
. Since y
0
= 1, it is clear tha we will
compute y
n
= (1 +
h
2
)
n
. Now we want to express this in terms of x rather than n. Since
x
n
= nh, we have n = x/h, and since h = 0.05 we have n = 20x. Hence the computed
approximation to y at a particular point x will be 1.025
20x
, or equivalently
Euler(x) = (1.638616 . . .)
x
. (2.1.9)
The approximate values can now easily be compared with the true solution, since
Exact(x) = e
x
2
=
_
e
1
2
_
x
= (1.648721 . . .)
x
.
(2.1.10)
Therefore both the exact solution of this differential equation and its computed solution
have the form (const.)
x
. The correct value of “const.” is e
1/2
, and the value that is, in
effect, used by Euler’s method is (1 +
h
2
)
1/h
. For a fixed value of x, we see that if we use
Euler’s method with smaller and smaller values of h (neglecting the increase in roundoff
error that is sure to result), the values Euler(x) will converge to Exact(x), because
lim
h→0
_
1 +
h
2
_
1/h
= e
1
2
. (2.1.11)
Exercises 2.1
26 The Numerical Solution of Differential Equations
1. Verify the limit (2.1.11).
2. Use a calculator or a computer to integrate each of the following differential equations
forward ten steps, using a spacing h = 0.05 with Euler’s method. Also tabulate the
exact solution at each value of x that occurs.
(a) y

(x) = xy(x); y(0) = 1
(b) y

(x) = xy(x) + 2; y(0) = 1
(c) y

(x) =
y(x)
1 +x
; y(0) = 1
(d) y

(x) = −2xy(x)
2
; y(0) = 10
2.2 Software notes
One of the main themes of our study will be the preparation of programs that not only
work, but also are easily readable and useable by other people. The act of communication
that must take place before a program can be used by persons other than its author is a
difficult one to carry out, and we will return several times to the principles that have evolved
as guides to the preparation of readable software. Here are some of these guidelines.
1. Documentation
The documentation of a program is the set of written instructions to a user that inform
the user about the purpose and operation of the program. At the moment that the job
of writing and testing a program has been completed it is only natural to feel an urge to
get the whole thing over with and get on to the next job. Besides, one might think, it’s
perfectly obvious how to use this program. Some programs may be obscure, but not this
one.
It is amazing how rapidly our knowledge of our very own program fades. If we come
back to a program after a lapse of a few months’ time, it often happens that we will have no
idea what the program did or how to use it, at least not without making a large investment
of time.
For that reason it is important that when our program has been written and tested it
should be documented immediately, while our memory of it is still green. Furthermore, the
best place for documentation is in the program itself, in “comment” statements. That way
one can be sure that when the comments are needed they will be available.
The first mission of program documentation is to describe the purpose of the program.
State clearly the problem that the program solves, or the exact operation that it performs
on its input in order to get its output.
Already in this first mission, a good bit of technical skill can be brought to bear that
will be very helpful to the use, by intertwining the description of the program purpose with
the names of the communicating variables in the program.
Let’s see what that means by considering an example. Suppose we have written a
subroutine that searches through a specified row of a matrix to find the element of largest
2.2 Software notes 27
absolute value, and outputs a column in which it was found. Such a routine, in Maple for
instance, might look like this:
search:=proc(A,i)
local j, winner, jwin;
winner:=-1;
for j from 1 to coldim(A) do
if (abs(A[i,j])>winner) then
winner:=abs(A[i,j]) ; jwin:=j fi
od;
return(jwin);
end;
Now let’s try our hand at documenting this program:
“The purpose of this program is to search a given row of a matrix to find an
element of largest absolute value and return the column in which it was found.”
That is pretty good documentation, perhaps better than many programs get. But we
can make it a lot more useful by doing the intertwining that we referred to above. There
we said that the description should be related to the communicating variables. Those
variables are the ones that the user can see. They are the input and output variables of
the subroutine. In most important computer languages, the communicating variables are
announced in the first line of the coding of a procedure or subroutine. Maple follows this
convention at least for the input variables, although the output variable is usually specified
in the “return” statement.
In the first line of the little subroutine above we see the list (A, i) of its input variables
(or “arguments”). These are the ones that the user has to understand, as opposed to the
other “local” variables that live inside the subroutine but don’t communicate with the
outside world (like j, winner, jwin, which are listed on the second line of the program).
The best way to help the user to understand these variables is to relate them directly
to the description of the purpose of the program.
“The purpose of this program is to search row I of a given matrix A to find an
entry of largest absolute value, and returns the column jwin where that entry
lives.”
We’ll come back to the subject of documentation, but now let’s mention another ingre-
dient of ease of use of programs, and that is:
2. Modularity
It is important to divide a long program into a number of smaller modules, each with a
clearly stated set of inputs and outputs, and each with its own documentation. That means
that we should get into the habit of writing lots of subroutines or procedures, because
28 The Numerical Solution of Differential Equations
the subroutine or procedure mode of expression forces one to be quite explicit about the
relationship of the block of coding to the rest of the world.
When we are writing a large program we would all write a subroutine if we found that
a certain sequence of steps was being called for repeatedly. Beyond this, however, there are
numerous inducements for breaking off subroutines even if the block of coding occurs just
once in the main program.
For one thing it’s easier to check out the program. The testing procedure would consist
of first testing each of the subroutines separately on test problems designed just for them.
Once the subroutines work, it would remain only to test their relationships to the calling
program.
For another reason, we might discover a better, faster, more elegant, or what-have-you
method of performing the task that one of these subroutines does. Then we would be able
to yank out the former subroutine and plug in the new one, while being careful only to make
sure that the new subroutine relates to the same inputs and outputs as the old one. If jobs
within a large program are not broken into subroutines it can be much harder to isolate the
block of coding that deals with a particular function and remove it without affecting the
whole works.
For another reason, if one be needed, it may well happen that even though the job that
is done by the subroutine occurs only once in the current program, it may recur in other
programs as yet undreamed of. If one is in the habit of writing small independent modules
and stringing them together to make large programs, then it doesn’t take long before one
has a library of useful subroutines, each one tested, working and documented, that will
greatly simplify the task of writing future programs.
Finally, the practice of subdividing the large jobs into the smaller jobs of which they are
composed is an extremely valuable analytical skill, one that is useful not only in program-
ming, but in all sorts of organizational activities where smaller efforts are to be pooled in
order to produce a larger effect. It is therefore a quality of mind that provides much of its
own justification.
In this book, the major programs that are the objects of study have been broken up into
subroutines in the expectation that the reader will be able to start writing and checking out
these modules even before the main ideas of the current subject have been fully explained.
This was done in part because some of these programs are quite complex, and it would be
unreasonable to expect the whole program to be written in a short time. It was also done
to give examples of the process of subdivision that we have been talking about.
For instance, the general linear algebra program for solving systems of linear simulta-
neous equations in Chapter 3, has been divided into six modules, and they are described in
section 3.3. The reader might wish to look ahead at those routines and to verify that even
though their relationship to the whole job of solving equations is by no means clear now,
nonetheless, because of the fact that they are independent and self-contained, they can be
programmed and checked out at any time without waiting for the full explanation.
One more ingredient that is needed for the production of useful software is:
2.3 Systems and equations of higher order 29
3. Style
We don’t mean style in the sense of “class,” although this is as welcome in programming
as it is elsewhere. There have evolved a number of elements of good programming style,
and these will mainly be discussed as they arise. But two of them (one trivial and one quite
deep) are:
(a) Indentation: The instructions that lie within the range of a loop are indented in the
program listing further to the right than the instructions that announce that the loop
is about to begin, or that it has just terminated.
(b) Top-down structuring: When we visualize the overall logical structure of a compli-
cated program we see a grand loop, within which there are several other loops and
branchings, within which . . . etc. According to the principles of top-down design the
looping and branching structure of the program should be visible at once in the list-
ing. That is, we should see an announcement of the opening of the grand loop, then
indented under that perhaps a two-way branch (if-then-else), where, under the “then”
one sees all that will happen if the condition is met, and under the “else” one sees
what happens if it is not met.
When we say that we see all that will happen, we mean that there are not any “go-to”
instructions that would take our eye out of the flow of the if-then-else loop to some
other page. It all happens right there on the same page, under “then” and under
“else”.
These few words can scarcely convey the ideas of structuring, which we leave to the
numerous examples in the sequel.
2.3 Systems and equations of higher order
We have already remarked that the methods of numerical integration for a single first-
order differential equation carry over with very little change to systems of simultaneous
differential equations of first order. In this section we’ll discuss exactly how this is done,
and furthermore, how the same idea can be applied to equations of higher order than the
first. Euler’s method will be used as the example, but the same transformations will apply
to all of the methods that we will study.
In Euler’s method for one equation, the approximate value of the unknown function at
the next point x
n+1
= x
n
+h is calculated from
y
n+1
= y
n
+hf(x
n
, y
n
). (2.3.1)
Now suppose that we are trying to solve not just a single equation, but a system of N
simultaneous equations, say
y

i
(x) = f
i
(x, y
1
, y
2
, . . . y
N
) i = 1, . . . , N. (2.3.2)
30 The Numerical Solution of Differential Equations
Equation (2.3.2) represents N equations, in each of which just one derivative appears,
and whose right-hand side may depend on x, and on all of the unknown functions, but not
on their derivatives. The “f
i
” indicates that, of course, each equation can have a different
right-hand side.
Now introduce the vector y(x) of unknown functions
y(x) = [y
1
(x), y
2
(x), y
3
(x), . . . , y
N
(x)] (2.3.3)
and the vector f = f(x, y(x)) of right-hand sides
f = [f
1
(x, y), f
2
(x, y), . . . , f
N
(x, y)]. (2.3.4)
In terms of these vectors, equation (2.3.2) can be rewritten as
y

(x) = f(x, y(x)). (2.3.5)
We observe that equation (2.3.5) looks just like our standard form (2.1.1) for a single
equation in a single unknown function, except for the bold face type, i.e., except for the
fact that y and f now represent vector quantities.
To apply a numerical method such as that of Euler, then, all we need to do is to take
the statement of the method for a single differential equation in a single unknown function,
and replace y(x) and f(x, y(x)) by vector quantities as above. We will then have obtained
the generalization of the numerical method to systems.
To be specific, Euler’s method for a single equation is
y
n+1
= y
n
+hf(x, y
n
) (2.3.6)
so Euler’s method for a system of differential equations will be
y
n+1
= y
n
+hf(x
n
, y
n
). (2.3.7)
This means that if we know the entire vector y of unknown functions at the point x = x
n
,
then we can find the entire vector of unknown functions at the next point x
n+1
= x
n
+ h
by means of (2.3.7).
In detail, if y
i
(x
n
) denotes the computed approximate value of the unknown function y
i
at the point x
n
, then what we must calculate are
y
i
(x
n+1
) = y
i
(x
n
) +hf
i
(x
n
, y
1
(x
n
), y
2
(x
n
), . . . , y
N
(x
n
)) (2.3.8)
for each i = 1, 2, . . . , N.
As an example, take the pair of differential equations
_
y

1
= x +y
1
+y
2
y

2
= y
1
y
2
+ 1
(2.3.9)
together with the initial values y
1
(0) = 0, y
2
(0) = 1.
2.3 Systems and equations of higher order 31
Now the vector of unknown functions is y = [y
1
, y
2
], and the vector of right-hand sides
is f = [x +y
1
+y
2
, y
1
y
2
+ 1]. Initially, the vector of unknowns is y = [0, 1]. Let’s choose a
step size h = 0.05. Then we calculate
_
y
1
(0.05)
y
2
(0.05)
_
=
_
0
1
_
+ 0.05
_
0 + 0 + 1
0 ∗ 1 + 1
_
=
_
0.05
1.05
_
(2.3.10)
and
_
y
1
(0.10)
y
2
(0.10)
_
=
_
0.05
1.05
_
+ 0.05
_
0.05 + 0.05 + 1.05
0.05 ∗ 1.05 + 1
_
=
_
0.1075
1.102625
_
(2.3.11)
and so forth. At each step we compute the vector of approximate values of the two unknown
functions from the corresponding vector at the immediately preceding step.
Let’s consider the preparation of a computer program that will carry out the solution,
by Euler’s method, of a system of N simultaneous equations of the form (2.3.2), which we
will rewrite just slightly, in the form
y

i
= f
i
(x, y) i = 1, . . . , N. (2.3.12)
Note that on the left is just one of the unknown functions, and on the right there may
appear all N of them in each equation.
Evidently we will need an array Y of length N to hold the values of the N unknown
functions at the current point x. Suppose we have computed the array Y at a point x, and
we want to get the new array Y at the point x +h. Exactly what do we do?
Some care is necessary in answering this question because there is a bit of a snare in
the underbrush. The new values of the N unknown functions are calculated from (2.3.8) or
(2.3.12) in a certain order. For instance, we might calculate y
1
(x+h), then y
2
(x +h), then
y
3
(x +h),. . . , then y
N
(x +h).
The question is this: when we compute y
1
(x+h), where shall we put it? If we put it into
Y[1], the first position of the Y array in storage, then the previous contents of Y[1] are
lost, i.e., the value of y
1
(x) is lost. But we aren’t finished with y
1
(x) yet; it’s still needed to
compute y
2
(x+h), y
3
(x+h), etc. This is because the new value y
i
(x+h) depends (or might
depend), according to (2.3.8), on the old values of all of the unknown functions, including
those whose new values have already been computed before we begin the computation of
y
i
(x +h).
If the point still is murky, go back to (2.3.11) and notice how, in the calculation of
y
2
(0.10) we needed to know y
1
(0.05) even though y
1
(0.10) had already been computed.
Hence if we had put y
1
(0.10) into an array to replace the old value y
1
(0.05) we would not
have been able to obtain y
2
(0.10).
The conclusion is that we need at least two arrays, say YIN and YOUT, each of length N.
The array YIN holds the unknown functions evaluated at x, and YOUT will hold their values
at x+h. Initially YIN holds the given data at x
0
. Then we compute all of the unknowns at
x
0
+h, and store them in YOUT as we find them. When all have been done, we print them
if desired, move all entries of YOUT back to YIN, increase x by h and repeat.
32 The Numerical Solution of Differential Equations
The principal building block in this structure would be a subroutine that would advance
the solution exactly one step. The main program would initialize the arrays, call this
subroutine, increase x, move date from the output array YOUT back to the input array YIN,
print, etc. The single-step subroutine is shown below. We will use it later on to help us
get a solution started when we use methods that need information at more than one point
before they can start the integration.
Eulerstep:=proc(xin,yin,h,n) local i,yout;
# This program numerically integrates the system
# y’=f(x,y) one step forward by Euler’s method using step size
# h. Enter with values at xin in yin. Exit with values at xin+h
# in yout. Supply f as a function subprogram.
yout:=[seq(evalf(yin[i]+h*f(xin,yin,i)),i=1..n)];
return(yout);
end;
A few remarks about the program are in order. One structured data type in Maple
is a list of things, in this case, a list of floating point numbers. The seq command (for
“sequence”) creates such a list, in this case a list of length n since i goes from 1 to n in the
seq command. The brackets [ and ] convert the list into a vector. The evalf command
ensures that the results of the computation of the components of yout are floating point
numbers.
Our next remark about the program concerns the function subprogram f, which calcu-
lates the right hand sides of the differential equation. This subprogram, of course, must be
supplied by the user. Here is a sample of such a program, namely the one that describes
the system (2.3.9). In that case we have f
1
(x, y) = x+y
1
+y
2
and f
2
(x, y) = y
1
y
2
+1. This
translates into the following:
f:=proc(x,y,i);
# Calculates the right-hand sides of the system of differential
# equations.
if i=1 then return(x+y[1]+y[2]) else return(y[1]*y[2]+1) fi;
end;
Our last comment about the program to solve systems is that it is perfectly possible to
use it in such a way that we would not have to move the contents of the vector YOUT back
to the vector YIN at each step. In other words, we could save N move operations, where N
is the number of equations. Such savings might be significant in an extended calculation.
To achieve this saving, we write two blocks of programming in the main program. One
block takes the contents of YIN as input, advances the solution one step by Euler’s method
and prints, leaving the new vector in YOUT. Then, without moving anything, another block
of programming takes the contents of YOUT as input, advances the solution one step, leaves
the new vector in YIN, and prints. The two blocks call Euler alternately as the integration
proceeds to the right. The reader might enjoy writing this program, and thinking about
how to generalize the idea to the situation where the new value of y is computed from two
2.3 Systems and equations of higher order 33
previously computed values, rather than from just one (then three blocks of programming
would be needed).
Now we’ve discussed the numerical solution of a single differential equation of first order,
and of a system of simultaneous differential equations of first order, and there remains the
treatment of equations of higher order than the first. Fortunately, this case is very easily
reduced to the varieties that we have already studied.
For example, suppose we want to solve a single equation of the second order, say
y

+xy

+ (x + 1) cos y = 2. (2.3.13)
The strategy is to transform the single second-order equation into a pair of simultaneous
first order equations that can then be handled as before. To do this, choose two unknown
functions u and v. The function u is to be the unknown function y in (2.3.13), and the
function v is to be the derivative of u. Then u and v satisfy two simultaneous first-order
differential equations:
_
u

= v
v

= −xv −(x + 1) cos u + 2
(2.3.14)
and these are exactly of the form (2.3.5) that we have already discussed!
The same trick works on a general differential equation of Nth order
y
(N)
+G(x, y, y

, y

, . . . , y
(N−1)
) = 0. (2.3.15)
We introduce N unknown functions u
0
, u
1
, . . . , u
N−1
, and let them be the solutions of the
system of N simultaneous first order equations
u

0
= u
1
u

1
= u
2
. . .
u

N−2
= u
N−1
u

N−1
= −G(x, u
0
, u
1
, . . . , u
N−2
).
(2.3.16)
The system can now be dealt with as before.
Exercises 2.3
1. Write each of the following as a system of simultaneous first-order initial-value prob-
lems in the standard form (2.3.2):
(a) y

+x
2
y = 0; y(0) = 1; y

(0) = 0
(b) u

+xv = 2; v

+e
uv
= 0; u(0) = 0; v(0) = 0
(c) u

+xv

= 0; v

+x
2
u = 1; u(1) = 1; v(1) = 0
(d) y
iv
+ 3xy

+x
2
y

+ 2y

+y = 0; y(0) = y

(0) = y

(0) = y

(0) = 1
(e) x

(t) + t
3
x(t) + y(t) = 0; y

(t) + x(t)
2
= t
3
; x(0) = 2; x

(0) = 1; x

(0) = 0;
y(0) = 1; y

(0) = 0
34 The Numerical Solution of Differential Equations
2. For each of the parts of problem 1, write the function subprogram that will compute
the right-hand sides, as required by the Eulerstep subroutine.
3. For each of the parts of problem 1, assemble nad run on the computer the Euler
program, together with the relevant function subprogram of problem 2, to print out
the solutions for fifty steps of integration, each of size h = 0.03. Begin with x = x
0
,
the point at which the initial data was given.
4. Reprogram the Eulerstep subroutine, as discussed in the text, to avoid the movement
of YOUT back to YIN.
5. Modify your program as necessary (in Maple, take advantage of the plot command)
to produce graphical output (graph all of the unknown functions on the same axes).
Test your program by running it with Euler as it solves y

+ y = 0 with y(0) = 0,
y

(0) = 1, and h = π/60 for 150 steps.
6. Write a program that will compute successive values y
p
, y
p+1
, . . . from a difference
equation of order p. Do this by storing the y’s as they are computed in a circular list,
so that it is never necessary to move back the last p computed values before finding
the next one. Write your program so that it will work with vectors, so you can solve
systems of difference equations as well as single ones.
2.4 How to document a program
One of the main themes of our study will be the preparation of programs that not only
work, but also are easily readable and useable by other people. The act of communication
that must take place before a program can be used by persons other than its author is a
difficult one to carry out, and we will return several times to the principles that serve as
guides to the preparation of readable software.
In this section we discuss further the all-important question of program documentation,
already touched upon in section 2.2. Some very nontrivial skills are called for in the creation
of good user-oriented program descriptions. One of these is the ability to enter the head of
another person, the user, and to relate to the program that you have just written through
the user’s eyes.
It’s hard to enter someone else’s head. One of the skills that make one person a better
teacher than another person is of the same kind: the ability to see the subject matter that
is being taught through the eyes of another person, the student. If one can do that, or
even make a good try at it, then obviously one will be able much better to deal with the
questions that are really concerning the audience. Relatively few actually do this to any
great extent not, I think, because it’s an ability that one either has or doesn’t have, but
because few efforts are made to train this skill and to develop it.
We’ll try to make our little contribution here.
2.4 How to document a program 35
(A) What does it do?
The first task should be to describe the precise purpose of the program. Put yourself in
the place of a potential user who is looking at a particular numerical instance of the problem
that needs solving. That user is now thumbing through a book full of program descriptions
in the library of a computer center 10,000 miles away in order to find a program that will
do the problem in question. Your description must answer that question.
Let’s now assume that your program has been written in the form of a subroutine or
procedure, rather than as a main program. Then the list of global, or communicating
variables is plainly in view, in the opening statement of the subroutine.
As we noted in section 2.2, you should state the purpose of your program using the
global variables of the subroutine in the same sentence. For one example of the difference
that makes, see section 2.2. For another, a linear equation solver might be described by
saying
“This program solves a system of simultaneous equations. To use it, put the
right-hand sides into the vector B, put the coefficients into the matrix A and call
the routine. The answer will be returned in X.”
We are, however, urging the reader to do it this way:
“This program solves the equations AX=B, where A is an N-by-N matrix and B
is an N-vector.”
Observe that the second description is shorter, only about half as long, and yet more
informative. We have found out not only what the program does, but how that function
relates to the global variables of the subroutine. This was done by using a judicious sprin-
kling of symbols in the documentation, along with the words. Don’t use only symbols, or
only words, but weave them together for maximum information.
Notice also that the ambiguous term “right-hand side” that appeared in the first form
has been done away with in the second form. The phrase was ambiguous because exactly
what ends up on the right-hand side and what on the left is an accident of how we happen
to write the equations, and your audience may not do it the same way you do.
(B) How is it done?
This is usually the easy part of program documentation because it is not the purpose of
this documentation to give a course in mathematics or algorithms or anything else. Hence
most of the time a reference to the literature is enough, or perhaps if the method is a
standard one, just give its name. Often though, variations on the standard method have
been chosen, and the user must be informed about those:
“. . . is solved by Gaussian elimination, using complete positioning for size. . . ”
“. . . the input array A is sorted by the Quicksort method (see D.E. Knuth, The
Art of Computer Programming, volume 3). . . ”
36 The Numerical Solution of Differential Equations
“. . . the eigenvalues and vectors are found by the Jacobi method, using Cor-
bat´o’s method of avoiding the search for the largest off-diagonal element (see,
for instance, the description in D.R. Wilson, A First Course in Mathematical
Software).”
“. . . is found by the Simplex method, except that Charnes’ selection rule (see
F.A. Ficken, The Simplex Method. . . ) is not programmed, and so. . . ”
(C) Describe the global variables
Now it gets hard again. The global variables are the ones through which the subroutine
communicates with the user. Generally speaking, the user doesn’t care about variables
that are entirely local to your subroutine, but is vitally concerned with the communicating
variables.
First the user has to know exactly ho each of the global variables is related to the
problem that is being solved. This calls for a brief verbal description of the variable, and
what it has to do with the functioning of the program.
“A[i] is the ith element of the input list that is to be sorted, i=1..N”
“WHY is set by the subroutine to TRUE unles the return is because of overflow,
and then it will be set to FALSE.”
“B[i,j] is the coefficient of X[j] in the ith one of the input equations BX=C.”
“option is set by the calling program on input. Set it to 0 if the output is to
be rounded to the nearest integer, else set it to m if the output is to be rounded
to m decimal places (m ≤ 12).”
It is extremely important that each and every global variable of the subroutine should
get such a description. Just march through the parentheses in the subroutine or procedure
heading, and describe each variable in turn.
Next, the user will need more information about each of the global variables than just its
description as above. Also required is the “type” of the variable. Some computer languages
force each program to declare the types of their variables right in the opening statement.
Others declare types by observing various default rules with exceptions stated. In any case,
a little redundancy never hurts, and the program documentation should declare the type of
each and every global variable.
It’s easy to declare along with the types of the variables, their dimensions if they are
array variables. For instance we may have a
solver:=proc(A,X,n,ndim,b);
2.4 How to document a program 37
in which the communicating variables have the following types:
A ndim-by-ndim array of floating point numbers
X vector of floating point numbers of length n
n integer
ndim integer
b vector of floating point numbers of length n
The best way to announce all of these types and dimensions of global variables to the
user is simply to list them, as above, in a table.
Now surely we’ve finished taking the pulse, blood pressure, etc. of the global variables,
haven’t we? Well, no, we haven’t. There’s still more vital data that a user will need to
know about these variables. There isn’t any standard name like “type” to apply to this
information, so we’ll call it the “role” of the variable.
First, for some of the global variables of the subroutine, it may be true that their values
at the time the subroutine is called are quite irrelevant to the operation of the subroutine.
This would be the case for the output variables, and in certain other situations. For some
other variables, the values at input time would be crucial. The user needs to know which are
which. Just for one example, if the value at input time is irrelevant, then the user can feel
free to use the same storage for other temporary purposes between calls to the subroutine.
Second, it may happen that certain variables are returned by the subroutine with their
values unchanged. This is particularly true for “implicitly passed” global variables, i.e.,
variables whose values are used by the subroutine but which do not appear explicitly in the
argument list. In such cases, the user may be delighted to hear the good news. In other
cases, the action of a subroutine may change an input variable, so if the user needs to use
those quantities again it will be necessary to save them somewhere else before calling the
subroutine. In either case, the user needs to be informed.
Third, it may be that the computation of the value of a certain variable is one of the
main purposes of the subroutine. Such variables are the outputs of the program, and the
user needs to know which these are (whether they are explicit in heading or the return
statement, or are “implicit”).
Although some high-level computer languages require type declarations immediately
in the opening instruction of a subroutine, none require the descriptions of the roles of
the variables (well, Pascal requires the VAR declaration, and Maple separates the input
variables from the output ones, but both languages allow implicit passing and changing
of global variables). These are, however, important for the user to know, so let’s invent
a shorthand for describing them in the documentation of the programs that occur in this
book.
First, if the value at input time is important, let’s say that the role of the variable is I,
otherwise it is I’.
Second, if the value of the variable is changed by the action of the subroutine, we’ll say
that its role is C, else C’.
Finally, if the computation of this variable is one of the main purposes of the subroutine,
38 The Numerical Solution of Differential Equations
it’s role is O (as in output), else O’.
In the description of each communicating variable, all three of these should be specified,.
Thus, a variable X might have role IC’O’, or a variable why might be of role I’CO, etc.
To sum up, the essential features of program documentation are a description of that
the program does, phrased in terms of the global variables, a statement of how it gets the
job done, and a list of all of the global variables, showing for each one its name, type,
dimension (or structure) if any, its role, and a brief verbal description.
Refer back to the short program in section 2.2, that searches for the largest element in
a row of a matrix. Here is the table of information about its global variables:
Name Type Role Description
A floating point matrix IC’O’ The input matrix
i integer IC’O’ Which row to search
jwin integer I’CO Column containing largest element
Exercises 2.4
Write programs that perform each of the jobs stated below. In each case, after testing
the program, document it with comments. Give a complete table of information about the
global variables in each case.
(a) Find and print all of the prime numbers between M and N.
(b) Find the elements of largest and smallest absolute values in a given linear array
(vector), and their positions in the array.
(c) Sort the elements of a given linear array into ascending order of size.
(d) Deal out four bridge hands (13 cards each from a standard 52-card deck – This one
is not so easy!).
(e) Solve a quadratic equation (any quadratic equation!).
2.5 The midpoint and trapezoidal rules
Euler’s formula is doubtless the simplest numerical integration procedure for differential
equations, but the accuracy that can be obtained with it is insufficient for most applications.
In this section and those that follow, we want to introduce a while family of methods for
the solution of differential equations, called the linear multistep methods, in which the user
can choose the degree of precision that will suffice for the job, and then select a member of
the family that will achieve it.
Before describing the family in all of its generality, we will produce two more of its
members, which illustrate the different sorts of creatures that inhabit the family in question.
2.5 The midpoint and trapezoidal rules 39
Recall that we derived Euler’s method by chopping off the Taylor series expansion of the
solution after the linear term. To get a more accurate method we could, of course, keep the
quadratic term, too. However, that term involves a second derivative, and we want to avoid
the calculation of higher derivatives because our differential equations will always be written
as first-order systems, so that only the first derivative will be conveniently computable.
We can have greater accuracy without having to calculate higher derivatives if we’re
willing to allow our numerical integration procedure to involve values of the unknown func-
tion and its derivative at more than one point. In other words, in Euler’s method, the next
value of the unknown function, at x + h, is gotten from the values of y and y

at just one
backwards point x. In the more accurate formulas that we will discuss next, the new value
of y depends in y and y

at more than one point, for instance, at x and x −h, or at several
points.
As a primitive example of this kind, we will now discuss the midpoint rule. We begin
once again with the Taylor expansion of the unknown function y(x) about the point x
n
:
y(x
n
+h) = y(x
n
) +hy

(x
n
) +h
2
y

(x
n
)
2
+h
3
y

(x
n
)
6
+ . (2.5.1)
Now we rewrite equation (2.5.1) with h replaced by −h to get
y(x
n
−h) = y(x
n
) −hy

(x
n
) +h
2
y

(x
n
)
2
−h
3
y

(x
n
)
6
+ (2.5.2)
and then subtract these equations, obtaining
y(x
n
+h) −y(x
n
−h) = 2hy

(x
n
) + 2h
3
y

(x
n
)
6
+ . (2.5.3)
Now, just as we did in the derivation of Euler’s method, we will truncate the right side
of (2.5.3) after the first term, ignoring the terms that involve h
3
, h
5
, etc. Further, let’s
use y
n
to denote the computed approximate value of y(x
n
) (and y
n+1
for the approximate
y(x
n+1
), etc.). Then we have
y
n+1
−y
n−1
= 2hy

n
. (2.5.4)
If, as usual, we are solving the differential equation y

= f(x, y), then finally (2.5.4) takes
the form
y
n+1
= y
n−1
+ 2hf(x
n
, y
n
) (2.5.5)
and this is the midpoint rule. The name arises from the fact that the first derivative y

n
is
being approximated by the slope of the chord that joins the two points (x
n−1
, y
n−1
) and
(x
n+1
, y
n+1
), instead of the chord joining (x
n
, y
n
) and (x
n+1
, y
n+1
) as in Euler’s method.
At first sight it seems that (2.5.5) can be used just like Euler’s method, because it is a
recurrence formula in which we compute the next value y
n+1
from the two previous values
y
n
and y
n−1
. Indeed the rules are quite similar, except for the fact that we can’t get started
with the midpoint rule until we know two consecutive values y
0
, y
1
of the unknown function
at two consecutive points x
0
, x
1
. Normally a differential equation is given together with
just one value of the unknown function, so if we are to use the midpoint rule we’ll need to
manufacture one more value of y(x) by some other means.
40 The Numerical Solution of Differential Equations
This kind of situation will come up again and again as we look at more accurate methods,
because to obtain greater precision without computing higher derivatives we will get the
next approximate value of y from a recurrence formula that may involve not just one or two,
but several of its predecessors. To get such a formula started we will have to find several
starting values in addition to the one that is given in the statement of the initial-value
problem.
To get back to the midpoint rule, we can get it started most easily by calculating y
1
,
the approximation to y(x
0
+ h), from Euler’s method, and then switching to the midpoint
rule to carry out the rest of the calculation.
Let’s do this, for example with the same differential equation (2.1.7) that we used to
illustrate Euler’s rule, so we can compare the two methods. The problem consists of the
equation y

= 0.5y and the initial value y(0) = 1. We’ll use the same step size h = 0.05 as
before.
Now to start the midpoint rule we need two consecutive values of y, in this case at x = 0
and x = 0.05. At 0.05 we use the value that Euler’s method gives us, namely y
1
= 1.025
(see Table 1). It’s easy to continue the calculation now from (2.5.5).
For instance
y
2
= y
0
+ 2h(0.5y
1
)
= 1 + 0.1(0.5 ∗ 1.025)
= 1.05125
(2.5.6)
and
y
3
= y
1
+ 2h(0.5y
2
)
= 1.025 + 0.1(0.5 ∗ 1.05125)
= 1.0775625 .
(2.5.7)
In the table below we show for each x the value computed from the midpoint rule, from
Euler’s method, and from the exact solution y(x) = e
x/2
. The superior accuracy of the
midpoint rule is apparent.
x Midpoint(x) Euler(x) Exact(x)
0.00 1.00000 1.00000 1.00000
0.05 1.02500 1.02500 1.02532
0.10 1.05125 1.05063 1.05127
0.15 1.07756 1.07689 1.07788
0.20 1.10513 1.10381 1.10517
0.25 1.13282 1.13141 1.13315
.
.
.
.
.
.
.
.
.
.
.
.
1.00 1.64847 1.63862 1.64872
2.00 2.71763 2.68506 2.71828
3.00 4.48032 4.39979 4.48169
.
.
.
.
.
.
.
.
.
.
.
.
5.00 12.17743 11.81372 12.18249
10.00 148.31274 139.56389 148.41316
2.5 The midpoint and trapezoidal rules 41
2
2
2
2
y = f(x)
a b
Figure 2.1: The trapezoidal rule
table 2
Next, we introduce a third method of numerical integration, the trapezoidal rule. The
best way to obtain it is to convert the differential equation that we’re trying to solve into
an integral equation, and then use the trapezoidal approximation for the integral.
We begin with the differential equation y

= f(x, y(x)), and we integrate both sides
from x to x +h, getting
y(x +h) = y(x) +
_
x+h
x
f(t, y(t)) dt. (2.5.8)
Now if we approximate the right-hand side in any way by a weighted sum of values of
the integrand at various points we will have found an approximate method for solving our
differential equation.
The trapezoidal rule states that for an approximate value of an integral
_
b
a
f(t) dt (2.5.9)
we can use, instead of the area under the curve between x = a and x = b, the area of the
trapezoid whose sides are the x axis, the lines x = a and x = b, and the line through the
points (a, f(a)) and (b, f(b)), as shown in Figure 2.1. That area is
1
2
(f(a) +f(b))(b −a).
If we apply the trapezoidal rule to the integral that appears in (2.5.8), we obtain
y(x
n
+h) ≈ y(x
n
) +
h
2
(f(x
n
, y(x
n
)) +f(x
n
+h, y(x
n
+h))) (2.5.10)
in which we have used the “≈” sign rather than the “=” because the right hand side is not
exactly equal to the integral that really belongs there, but is only approximately so.
If we use our usual abbreviation y
n
for the computed approximate value of y(x
n
), then
(2.5.10) becomes
y
n+1
= y
n
+
h
2
(f(x
n
, y
n
) +f(x
n+1
, y
n+1
)). (2.5.11)
42 The Numerical Solution of Differential Equations
This is the trapezoidal rule in the form that is useful for differential equations.
At first sight, (2.5.11) looks like a recurrence formula from which the next approximate
value, y
n+1
, of the unknown function, can immediately be computed from the previous
value, y
n
. However, this is not the case.
Upon closer examination one observes that the next value y
n+1
appears not only on the
left-hand side, but also on the right (it’s hiding in the second f on the right side).
In order to find the value y
n+1
it appears that we need to carry out an iterative process.
First we would guess y
n+1
(guessing y
n+1
to be equal to y
n
wouldn’t be all that bad, but
we can do better). If we use this guess value on the right side of (2.5.11) then we would
be able to calculate the entire right-hand side, and then we could use that value as a new
“improved” value of y
n+1
.
Now if the new value agrees with the old sufficiently well the iteration would halt, and
we would have found the desired value of y
n+1
. Otherwise we would use the improved value
on the right side just as we previously used the first guess. Then we would have a “more
improved” guess, etc.
Fortunately, in actual use, it turns out that one does not actually have to iterate to
convergence. If a good enough guess is available for the unknown value, then just one
refinement by a single application of the trapezoidal formula is sufficient. This is not the
case if a high quality guess is unavailable. We will discuss this point in more detail in
section 2.9. The pair of formulas, one of which supplies a very good guess to the next
value of y, and the other of which refines it to a better guess, is called a predictor-corrector
pair, and such pairs form the basis of many of the highly accurate schemes that are used in
practice.
As a numerical example, take the differential equation
y

= 2xe
y
+ 1 (2.5.12)
with the initial value y(0) = 1. If we use h = 0.05, then our first task is to calculate y
1
, the
approximate value of y(0.05). The trapezoidal rule asserts that
y
1
= 1 + 0.025(2 + 0.1e
y
1
) (2.5.13)
and sure enough, the unknown number y
1
appears on both sides.
Let’s guess y
1
= 1. Since this is not a very inspired guess, we will have to iterate the
trapezoidal rule to convergence. Hence, we use this guess on the right side of (2.5.13),
compute the right side, and obtain y
1
= 1.056796. If we use this new guess the same way,
the result is y
1
= 1.057193. Then we get 1.057196, and since this is in sufficiently close
agreement with the previous result, we declare that the iteration has converged. Then we
take y
1
= 1.057196 for the computed value of the unknown function at x = 0.05, and we go
next to x = 0.1 to repeat the same sort of thing to get y
2
, the computed approximation to
y(0.1).
In Table 3 we show the results of using the trapezoidal rule (where we have iterated
until two successive guesses are within 10
−4
) on our test equation y

= 0.5y, y(0) = 1 as
the column Trap(x). For comparison, we show Midpoint(x) and Exact(x).
2.6 Comparison of the methods 43
x Trap(x) Midpoint(x) Exact(x)
0.00 1.00000 1.00000 1.00000
0.05 1.02532 1.02500 1.02532
0.10 1.05127 1.05125 1.05127
0.15 1.07789 1.07756 1.07788
0.20 1.10518 1.10513 1.10517
0.25 1.13316 1.13282 1.13315
.
.
.
.
.
.
.
.
.
.
.
.
1.00 1.64876 1.64847 1.64872
2.00 2.71842 2.71763 2.71828
3.00 4.48203 4.48032 4.48169
.
.
.
.
.
.
.
.
.
.
.
.
5.00 12.18402 12.17743 12.18249
10.00 148.45089 148.31274 148.41316
table 3
2.6 Comparison of the methods
We are now in possession of three methods for the numerical solution of differential equa-
tions. They are Euler’s method
y
n+1
= y
n
+hy

n
, (2.6.1)
the trapezoidal rule
y
n+1
= y
n
+
h
2
(y

n
+y

n+1
), (2.6.2)
and the midpoint rule
y
n+1
= y
n−1
+ 2hy

n
. (2.6.3)
In order to compare the performance of the three techniques it will be helpful to have
a standard differential equation on which to test them. The most natural candidate for
such an equation is y

= Ay, where A is constant. The reasons for this choice are first
that the equation is easy to solve exactly, second that the difference approximations are
also relatively easy to solve exactly, so comparison is readily done, third that by varying the
sign of A we can study behavior of either growing or shrinking (stable or unstable) solutions,
and finally that many problems in nature have solutions that are indeed exponential, at
least over the short term, so this is an important class of differential equations.
We will, however, write the test equation in a slightly different form for expository
reasons, namely as
y

= −
y
L
; y(0) = 1 ; L > 0 (2.6.4)
where L is a constant. The most interesting and revealing case is where the true solution
is a decaying exponential, so we will assume that L > 0. Further, we will assume that
y(0) = 1 is the given initial value.
44 The Numerical Solution of Differential Equations
The exact solution is of course
Exact(x) = e
−x/L
. (2.6.5)
Notice that if x increases by L, the solution changes by a factor of e. Hence L, called the
relaxation length of the problem, can be conveniently visualized as the distance over which
the solution falls by a factor of e.
Now we would like to know how well each of the methods (2.6.1)–(2.6.3) handles the
problem (2.6.4).
Suppose first that we ask Euler’s method to solve the problem. If we substitute y

=
f(x, y) = −y/L into (2.6.1), we get
y
n+1
= y
n
+h ∗
_

y
n
L
_
=
_
1 −
h
L
_
y
n
.
(2.6.6)
Before we solve this recurrence, let’s comment on the ratio h/L that appears in it. Now
L is the distance over which the solution changes by a factor of e, and h is the step size
that we are going to use in the numerical integration. Instinctively, one feels that if the
solution is changing rapidly in a certain region, then h will have to be kept small there if
good accuracy is to be retained, while if the solution changes only slowly, then h can be
larger without sacrificing too much accuracy. The ratio h/L measures the step size of the
integration in relation to the distance over which the solution changes appreciably. Hence,
h/L is exactly the thing that one feels should be kept small for a successful numerical
solution.
Since h/L occurs frequently below, we will denote it with the symbol τ.
Now the solution of the recurrence equation (2.6.6), with the starting value y
0
= 1, is
obviously
y
n
= (1 −τ)
n
n = 0, 1, 2, . . . . (2.6.7)
Next we study the trapezoidal approximation to the same equation (2.6.4). We substi-
tute y

= f(x, y) = −y/L in (2.6.2) and get
y
n+1
= y
n
+
h
2
_

y
n
L

y
n+1
L
_
. (2.6.8)
The unknown y
n+1
appears, as usual with this method, on both sides of the equation.
However, for the particularly simple equation that we are now studying, there is no difficulty
in solving (2.6.8) for y
n+1
(without any need for an iterative process) and obtaining
y
n+1
=
1 −
τ
2
1 +
τ
2
y
n
. (2.6.9)
Together with the initial value y
0
= 1, this implies that
y
n
=
_
1 −
τ
2
1 +
τ
2
_
n
n = 0, 1, 2, . . . . (2.6.10)
2.6 Comparison of the methods 45
Before we deal with the midpoint rule, let’s pause to examine the two methods whose
solutions we have just found. Note that for a given value of h, all three of (a) the exact solu-
tion, (b) Euler’s solution and (c) the trapezoidal solution are of the form y
n
= (constant)
n
,
in which the three values of “constant” are
(a) e
−τ
(b) 1 −τ
(c)
1−
τ
2
1+
τ
2
.
(2.6.11)
It follows that to compare the two approximate methods with the “truth,” all we have
to do is see how close the constants (b) and (c) above are to the true constant (a). If we
remember that τ is being thought of as small compared to 1, then we have the power series
expansion of e
−τ
e
−τ
= 1 −τ +
τ
2
2

τ
3
6
+ (2.6.12)
to compare with 1 −τ and with the power series expansion of
1 −
τ
2
1 +
τ
2
= 1 −τ +
τ
2
2

τ
3
4
+ . (2.6.13)
The comparison is now clear. Both the Euler and the trapezoidal methods yield ap-
proximate solutions of the form (constant)
n
, where “constant” is near e
−τ
. The trapezoidal
rule does a better job of being near e
−τ
because its constant agrees with the power series
expansion of e
−τ
through the quadratic term, whereas that of the Euler method agrees only
up to the linear term.
Finally we study the nature of the approximation that is provided by the midpoint
rule. We will find that a new and important phenomenon rears its head in this case. The
analysis begins just as it did in the previous two cases: We substitute the right-hand side
f(x, y) = −y/L for y

in (2.6.3) to get
y
n+1
= y
n−1
+ 2h ∗
_

y
n
L
_
. (2.6.14)
One important feature is already apparent. Instead of facing a first-order difference
equation as we did in (2.6.6) for Euler’s method and in (2.6.9) for the trapezoidal rule, we
have now to contend with a second-order difference equation.
Since the equation is linear with constant coefficients, we know to try a solution of the
form y
n
= r
n
. This leads to the quadratic equation
r
2
+ 2τr −1 = 0. (2.6.15)
Evidently the discriminant of this equation is positive, so its roots are distinct. If we denote
these two roots by r
+
(τ) and r

(τ), then the general solution of the difference equation
(2.6.14) is
y
n
= c (r
+
(τ))
n
+d (r

(τ))
n
, (2.6.16)
46 The Numerical Solution of Differential Equations
where c and d are constants whose values are determined by the initial data y
0
and y
1
.
The Euler and trapezoidal approximations were each of the form (constant)
n
. This one
is a sum of two terms of that kind. We will see that r
+
(τ) is a very good approximation
to e
−τ
. The other term, (r

(τ))
n
is, so to speak, the price that we pay for getting such a
good approximation in r
+
(τ). We hope that the other term will stay small relative to the
first term, so as not to disturb the closeness of the approximation. We will see, however,
that it need not be so obliging, and that in fact it might do whatever it can to spoil things.
The two roots of the quadratic equation are
r
+
(τ) = −τ +

1 +τ
2
r

(τ) = −τ −

1 +τ
2
.
(2.6.17)
When τ = 0 the first of these is +1, so when τ is small r
+
(τ) is near +1, and it is the root
that is trying to approximate the exact constant e
−τ
as well as possible. In fact it does
pretty well, because the power series expansion of r
+
(τ) is
r
+
(τ) = 1 −τ +
τ
2
2

τ
4
8
+ (2.6.18)
so it agrees with e
−τ
through the quadratic terms.
What about r

(τ)? Its Taylor series is
r

(τ) = −1 −τ −
τ
2
2
+ . (2.6.19)
The bad news is now before us: When τ is a small positive number, the root r

(τ) is
larger than 1 in absolute value. This means that the stability criterion of Theorem 1.6.1 is
violated, so we say that the midpoint rule is unstable.
In practical terms, we observe that r
+
(τ) is close to e
−τ
, so the first term on the right
of (2.6.16) is very close to the exact solution. The second term of (2.6.16), the so-called
parasitic solution, is small compared to the first term when n is small, because the constant
d will be small compared with c. However, as we move to the right, n increases, and the
second term will eventually dominate the first, because the first term is shrinking to zero
as n increases, because that’s what the exact solution does, while the second term increases
steadily in size. In fact, since r

(τ) is negative and larger than 1 in absolute value, the
second term will alternate in sign as n increases, and grow without bound in magnitude.
In Table 4 below we show the result of integrating the problem y

= −y, y(0) = 1 with
each of the three methods that we have discussed, using a step size of h = 0.05. To get the
midpoint method started, we used the exact value of y(0.05) (i.e., we cheated), and in the
trapezoidal rule we iterated to convergence with = 10
−4
. The instability of the midpoint
rule is quite apparent.
2.6 Comparison of the methods 47
x Euler(x) Trap(x) Midpoint(x) Exact(x)
0.0 1.00000 1.00000 1.00000 1.00000
1.0 0.35849 0.36780 0.36806 0.36788
2.0 0.12851 0.13527 0.13552 0.13534
3.0 0.04607 0.04975 0.05005 0.04979
4.0 0.01652 0.01830 0.01888 0.01832
5.0 0.00592 0.00673 0.00822 0.00674
10.0 0.00004 0.00005 0.21688 0.00005
14.55 3.3 10
−7
4.8 10
−7
-20.48 4.8 10
−7
15.8 9.1 10
−8
1.4 10
−7
71.45 1.4 10
−7
table 4
In addition to the above discussion of accuracy, we summarize here there additional
properties of integration methods as they relate to the examples that we have already
studied.
First, a numerical integration method might be iterative or noniterative. A method is
noniterative if it expresses the next value of the unknown function quite explicitly in terms
of values of the function and its derivatives at preceding points. In an iterative method,
at each step of the solution process the next value of the unknown is defined implicitly by
an equation, which must be solved to obtain the next value of the unknown function. In
practice, we may either solve this equation completely by an iteration or do just one step
of the iteration, depending on the quality of available estimates for the unknown value.
Second, a method is self-starting if the next value of the unknown function is obtained
from the values of the function and its derivatives at exactly one previous point. It is not
self-starting if values at more than one backward point are needed to get the next one. In
the latter case some other method will have to be used to get the first few computed values.
Third, we can define a numerical method to be stable if when it is applied to the equation
y

= −y/L, where L > 0, then for all sufficiently small positive values of the step size h a
stable difference equation results, i.e., the computed solution (neglecting roundoff) remains
bounded as n → ∞.
We summarize below the properties of the three methods that we have been studying.
Euler Midpoint Trapezoidal
Iterative No No Yes
Self-starting Yes No Yes
Stable Yes No Yes
Of the three methods, the trapezoidal rule is clearly the best, though for efficient use
it needs the help of some other formula to predict the next value of y and thereby avoid
lengthy iterations.
48 The Numerical Solution of Differential Equations
2.7 Predictor-corrector methods
The trapezoidal rule differs from the other two that we’ve looked at in that it does not
explicitly tell us what the next value of the unknown function is, but instead gives us an
equation that must be solved in order to find it. At first sight this seems like a nuisance,
but in fact it is a boon, because it enables us to regulate the step size during the course of
a calculation, as we will discuss in section 2.9.
Let’s take a look at the process by which we refine a guessed value of y
n+1
to an improved
value, using the trapezoidal formula
y
n+1
= y
n
+
h
2
(f(x
n
, y
n
) +f(x
n+1
, y
n+1
)). (2.7.1)
Suppose we let y
(k)
n+1
represent some guess to the value of y
n+1
that satisfies (2.7.1). Then
the improved value y
(k+1)
n+1
is computed from
y
(k+1)
n+1
= y
n
+
h
2
(f(x
n
, y
n
) +f(x
n+1
, y
(k)
n+1
)). (2.7.2)
We want to find out about how rapidly the successive values y
(k)
n+1
, k = 1, 2, . . . approach
a limit, if at all. To do this, we rewrite equation (2.7.2), this time replacing k by k − 1 to
obtain
y
(k)
n+1
= y
n
+
h
2
(f(x
n
, y
n
) +f(x
n+1
, y
(k−1)
n+1
)) (2.7.3)
and then subtract (2.7.3) from (2.7.2) to get
y
(k+1)
n+1
−y
(k)
n+1
=
h
2
(f(x
n+1
, y
(k)
n+1
) −f(x
n+1
, y
(k−1)
n+1
)). (2.7.4)
Next we use the mean-value theorem on the difference of f values on the right-hand
side, yielding
y
(k+1)
n+1
−y
(k)
n+1
=
h
2
∂f
∂y
¸
¸
¸
¸
(x
n+1
,η)
_
y
(k)
n+1
−y
(k−1
n+1
_
, (2.7.5)
where η lies between y
(k)
n+1
and y
(k−1)
n+1
.
From the above we see at once that the difference between two consecutive iterated
values of y
n+1
will be
h
2
∂f
∂y
times the difference between the previous two iterated values.
It follws that the iterative process will converge if h is kept small enough so that
h
2
∂f
∂y
is less than 1 in absolute value. We refer to
h
2
∂f
∂y
as the local convergence factor of the
trapezoidal rule.
If the factor is a lot less than 1 (and this can be assured by keeping h small enough),
then the convergence will be extremely rapid.
In actual practice, one uses an iterative formula together with another formula (the
predictor) whose mission is to provide an intelligent first guess for the iterative method
2.7 Predictor-corrector methods 49
to use. The predictor formula will be explicit, or noniterative. If the predictor formula is
clever enough, then it will happen that just a single application of the iterative refinement
(corrector formula) will be sufficient, and we won’t have to get involved in a long convergence
process.
If we use the trapezoidal rule for a corrector, for instance, then a clever predictor would
be the midpoint rule. The reason for this will become clear if we look at both formulas
together with their error terms. We will see in the next section that the error terms are as
follows:
y
n+1
= y
n−1
+ 2hy

n
+
h
3
3
y

(X
m
) (2.7.6)
y
n+1
= y
n
+
h
2
(y

n
+y

n+1
) −
h
3
12
y

(X
t
). (2.7.7)
Now the exact locations of the points X
m
and X
t
are unknown, but we will assume here
that h is small enough that we can regard the two values of y

that appear as being the
same.
As far as the powers of h that appear in the error terms go, we see that the third power
occurs in both formulas. We say then, that the midpoint predictor and the trapezoidal
corrector constitute a matched pair. The error in the trapezoidal rule is about one fourth
as large as, and of opposite sign from, the error in the midpoint method.
The midpont guess is therefore quite “intelligent”. The subsequent iterative refinement
of that guess needs to reduce the error only by a factor of four. Now let y
P
denote the
midpoint predicted value, y
(1)
n+1
denote the first refined value, and y
n+1
be the final converged
value given by the trapezoidal rule. Then we have
y
n+1
= y
n
+
h
2
_
y

n
+f(x
n+1
, y
n+1
)
_
y
(1)
n+1
= y
n
+
h
2
_
y

n
+f(x
n+1
, y
P
)
_
(2.7.8)
and by subtraction
y
(1)
n+1
−y
n+1
=
h
2
∂f
∂y
(y
P
−y
n+1
) . (2.7.9)
This shows that, however far from the converged value the first guess was, the refined
value is
h
2
∂f
∂y
times closer. Hence if we can keep
h
2
∂f
∂y
no bigger than about 1/4, then the
distance from the first refined value to the converged value will be no larger than the size
of the error term in the method, so there would be little point in gilding the iteration any
further.
The conclusion is that when we are dealing with a matched predictor-corrector pair,
we need do only a single refinement of the corrector if the step size is kept moderately
small. Furthermore, “moderately small” means that the step size times the local value of
∂f
∂y
should be small compared to 1. For this reason, iteration to full convergence is rarely
done in practice.
50 The Numerical Solution of Differential Equations
2.8 Truncation error and step size
We have so far regarded the step size h as a silent partner, more often than not choosing
it to be equal to 0.05, for no particular reason. It is evident, however, that the accuracy of
the calculation is strongly affected by the step size. If h is chosen too large, the computed
solution may be quite far from the true solution of the differential equation, if too small
then the calculation will become unnecessarily time-consuming, and roundoff errors may
build up excessively because of the numerous arithmetic operations that are being carried
out.
Speaking in quite general terms, if the true solution of the differential equation is rapidly
changing, then we will need a small values of h, that is, small compared to the local re-
laxation length (see p. 44), and if the solution changes slowly, then a larger value of h will
do.
Frequently in practice we deal with equations whose solutions change very rapidly over
part of the range of integration and slowly over another part. Examples of this are provided
by the study of the switching on of a complicated process, such as beginning a multi-stage
chemical reaction, turning on a piece of electronic equipment, starting a power reactor,
etc. In such cases there usually are rapid and ephemeral or “transient” phenomena that
occur soon after startup, and that disappear quickly. If we want to follow these transients
accurately, we may need to choose a very tiny step size. After the transients die out,
however, the steady-state solution may be a very quiet, slowly varying or nearly constant
function, and then a much larger value of h will be adequate.
If we are going to develop software that will be satisfactory for such problems, then
the program will obviously have to choose, and re-choose its own step size as the calcula-
tion proceeds. While following a rapid transient it should use a small mesh size, then it
should gradually increase h as the transient fades, use a large step while the solution is
steady, decreae it again if further quick changes appear, and so forth, all without operator
intervention.
Before we go ahead to discuss methods for achieving this step size control, let’s observe
that one technique is already available in the material of the previous section. Recall that
if we want to, we can implement the trapezoidal rule by first guessing, or predicting, the
unknown at the next point by using Euler’s formula, and then correcting the guess to
complete convergence by iteration.
The first guess will be relatively far away from the final converged value if the solution
is rapidly varying, but if the solution is slowly varying, then the guess will be rather good.
It follows that the number of iterations required to produce convergence is one measure of
the appropriateness of the current value of the step size: if many iterations are needed, then
the step size is too big. Hence one way to get some control on h is to follow a policy of
cutting the step size in half whenever more than, say, one or two iterations are necessary.
This suggestion is not sufficiently sensitive to allow doubling the stepsize when only one
iteration is needed, however, and somewhat more delicacy is called for in that situation.
Furthermore this is a very time-consuming approach since it involves a complete iteration
to convergence, when in fact a single turn of the crank is enough if the step size is kept
2.8 Truncation error and step size 51
small enough.
The discussion does, however, point to the fundamental idea that underlies the auto-
matic control of step size during the integration. That basic idea is precisely that we can
estimate the correctness of the step size by whatching how well the first guess in our itera-
tive process agrees with the corrected value. The correction process itself, when viewed this
way, is seen to be a powerful ally of the software user, rather than the “pain in the neck”
it seemed to be when we first met it.
Indeed, why would anyone use the cumbersome procedure of guessing and refining (i.e.,
prediction and correction) as we do in the trapezoidal rule, when many other methods are
available that give the next value of the unknown immediately and explicitly? No doubt
the question crossed the reader’s mind, and the answer is now emerging. It will appear that
not only does the disparity between the first prediction and the corrected value help us to
control the step size, it actually can give us a quantitative estimate of the local error in the
integration, so that if we want to, we can print out the approximate size of the error along
with the solution.
Our next job will be to make these rather qualitative remarks into quantitative tools, so
we must discuss the estimation of the error that we commit by using a particular difference
approximation to a differential equation, instead of that equation itself, on one step of the
integration process. This is the single-step truncation error. It does not tell us how far our
computed solution is from the true solution, but only how much error is committed in a
single step.
The easiest example, as usual, is Euler’s method. In fact, in equation (2.1.2) we have
already seen the single-step error of this metnod. That equation was
y(x
n
+h) = y(x
n
) +hy

(x
n
) +h
2
y

(X)
2
(2.8.1)
where X lies between x
n
and x
n
+ h. In Euler’s procedure, we drop the third term on the
right, the “remainder term,” and compute the solution from the rest of the equation. In
doing this we commit a single-step trunction error that is equal to
E = h
2
y

(X)
2
x
n
< X < x
n
+h. (2.8.2)
Thus, Euler’s method is exact (E = 0) if the solution is a polynomial of degree 1 or less
(y

= 0). Otherwise, the single-step error is proportional to h
2
, so if we cut the step size in
half, the local error is reduced to 1/4 of its former value, approximately, and if we double
h the error is multiplied by about 4.
We could use (2.8.2) to estimate the error by somehow computing an estimate of y

.
For instance, we might differentiate the differential equation y

= f(x, y) once more, and
compute y

directly from the resulting formula. This is usually more trouble than it is
worth, though, and we will prefer to estimate E by more indirect methods.
Next we derive the local error of the trapezoidal rule. There are various special methods
that might be used to do this, but instead we are going to use a very general method that
52 The Numerical Solution of Differential Equations
is capable of dealing with the error terms of almost every integration rule that we intend
to study.
First, let’s look a little more carefully at the capability of the trapezoidal rule, in the
form
y
n+1
−y
n

h
2
(y

n
+y

n+1
) = 0. (2.8.3)
Of course, this is a recurrence by meand of which we propagate the approximate solution
to the right. It certainly is not exactly true if y
n
denotes the value of the true solution at
the point x
n
unless that true solution is very special. How special?
Suppose the true solution is y(x) = 1 for all x. Then (2.8.3) would be exactly valid.
Suppose y(x) = x. Then (2.8.3) is again exactly satisfied, as the reader should check.
Furthermore, if y(x) = x
2
, then a brief calculation reveals that (2.8.3) holds once more.
How long does this continue? Our run of good luck has just expired, because if y(x) = x
3
then (check this) the left side of (2.8.3) is not 0, but is instead −h
3
/2.
We might say that the trapezoidal rule is exact on 1, x, and x
2
, but not x
3
, i.e., that it
is an integration rule of order two (“order” is an overworked word in differential equations).
It follows by linearity that the rule is exact on any quadratic polynomial.
By way of contrast, it is easy to verify that Euler’s method is exact for a linear function,
but fails on x
2
. Since the error term for Euler’s method in (2.8.2) is of the form const ∗ h
2

y

(X), it is perhaps reasonable to expect the error term for the trapezoidal rule to look like
const ∗ h
3
∗ y

(X).
Now we have to questions to handle, and they are respectively easy and hard:
(a) If the error term in the trapezoidal rule really is const ∗ h
3
∗ y

(X), then what is
“const”?
(b) Is it true that the error term is const ∗ h
3
∗ y

(X)?
We’ll do the easy one first, anticipating that the answer to (b) is affirmative so the effort
won’t be wasted. If the error term is of the form stated, then the trapezoidal rule can be
written as
y(x
h
) −y(x) −
h
2
_
y

(x +h) +y

(x)
_
= c ∗ h
3
∗ y

(X), (2.8.4)
where X lies between x and x + h. To find c all we have to do is substitute y(x) = x
3
into (2.8.4) and we find at once that c = −1/12. The single-step truncation error of the
trapezoidal rule would therefore be
E = −h
3
y

(X)
12
x < X < x +h. (2.8.5)
Now let’s see that question (b) has the answer “yes” so that (2.8.5) is really right.
To do this we start with a truncated Taylor series with the integral form of the remainder,
rather than with the differential form of the remainder. In general the series is
y(x) = y(0) +xy

(0) +x
2
y

(0)
2!
+ +x
n
y
(n)
(0)
n!
+R
n
(x) (2.8.6)
2.8 Truncation error and step size 53
where
R
n
(x) =
1
n!
_
x
0
(x −s)
n
y
(n−1)
(s) ds. (2.8.7)
Indeed, one of the nice ways to prove Taylor’s theorem begins with the right-hand side of
(2.8.7), plucked from the blue, and then repeatedly integrates by parts, lowereing the order
of the derivative of y and the power of (x −s) until both reach zero.
In (2.8.6) we choose n = 2, because the trapezoidal rule is exact on polynomials of
degree 2, and we write it in the form
y(x) = P
2
(x) +R
2
(x) (2.8.8)
where P
2
(x) is the quadratic (in x) polynomial P
2
(x) = y(0) +xy

(0) +x
2
y

(0)/2.
Next we define a certain operation that transforms a function y(x) into a new function,
namely into the left-hand side of equation (2.8.4). We call the operator L so it is defined
by
Ly(x) = y(x +h) −y(x) −
h
2
_
y

(x) +y

(x +h)
_
. (2.8.9)
Now we apply the operator L to both sides of equation (2.8.8), and we notice immediately
that LP
2
(x) = 0, because the rule is exact on quadratic polynomials (this is why we chose
n = 2 in (2.8.6)). Hence we have
Ly(x) = LR
2
(x). (2.8.10)
Notice that we have here a remainder formula for the trapezoidal rule. It isn’t in a very
satisfactory form yet, so we will now make it a bit more explicit by computing LR
2
(x).
First, in the integral expression (2.8.7) for R
2
(x) we want to replace the upper limit of the
integral by +∞. We can do this by writing
R
2
(x) =
1
2!
_

0
H
2
(x −s)y

(s) ds (2.8.11)
where H
2
(t) = t
2
if t > 0 and H
2
(t) = 0 if t < 0.
Now if we bear in mind the fact that the operator L acts only on x, and that s is a
dummy variable of integration, we find that
LR
2
(x) =
1
2!
_

0
LH
2
(x −s)y

(s) ds. (2.8.12)
Choose x = h. Then if s lies between 0 and h we find
LH
2
(x −s) = (h −s)
2

h
2
(2(h −s))
= −s(h −s
(2.8.13)
(Caution: Do not read past the right-hand side of the first equals sign unless you can verify
the correctness of what you see there!), whereas if s > h then LH
2
(x −s) = 0.
54 The Numerical Solution of Differential Equations
Then (2.8.12) becomes
LR
2
(h) = −
1
2
_
h
0
s(h −s)y

(s) ds. (2.8.14)
This is a much better form for the remainder, but we still do not have the “hard” question
(b). To finish it off we need a form of the mean-value theorem of integral calculus, namely
Theorem 2.8.1 If p(x) is nonnegative, and g(x) is continuous, then
_
b
a
p(x)g(x) dx = g(X)
_
b
a
p(x) dx (2.8.15)
where X lies between a and b.
The theorem asserts that a weighted average of the values of a continuous function is
itself one of the values of that function. The vital hypothesis is that the “weight” p(x) does
not change sign.
Now in (2.8.14), the function s(h −s) does not change sign on the s-interval (0, h), so
LR
2
(h) = −
1
2
y

(X)
_
h
0
s(h −s) ds (2.8.16)
and if we do the integral we obtain, finally,
Theorem 2.8.2 The trapezoidal rule with remainder term is given by the formula
y(x
n+1
) −y(x
n
) =
h
2
_
y

(x
n
) +y

(x
n+1
)
_

h
3
12
y

(X), (2.8.17)
where X lies between x
n
and x
n+1
.
The proof of this theorem involved some ideas tha carry over almost unchanged to
very general kinds of integration rules. Therefore it is important to make sure that you
completely understand its derivation.
2.9 Controlling the step size
In equation (2.8.5) we saw that if we can estimate the size of the third derivative during
the calculation, then we can estimate the error in the trapezoidal rule as we go along, and
modify the step size h if necessary, to keep that error within preassigned bounds.
To see how this can be done, we will quote, without proof, the result of a similar
derivation for the midpoint rule. It says that
y(x
n+1
) −y(x
n−1
) = 2hy

(x
n
) +
h
3
3
y

(X), (2.9.1)
2.9 Controlling the step size 55
where X is between x
n−1
and x
n+1
. Thus the midpoint rule is also of second order. If the
step size were halved, the local error would be cut to one eighth of its former value. The
error in the midpoint rule is, however, about four times as large as that in the trapezoidal
formula, and of opposite sign.
Now suppose we adopt an overall strategy of predicting the value y
n+1
of the unknown
by means of the midpoint rule, and then refining the prediction to convergence with the
trapezoidal corrector. We want to estiamte the size of the single-step truncation error,
using only the following data, both of which are available during the calculation: (a) the
initial guess, from the midpoint method, and (b) the converged corrected value, from the
trapezoidal rule.
We begin by defining three different kinds of values of the unknown function at the
“next” point x
n+1
. They are
(i) the quantity p
n+1
is defined as the predicted value of y(x
n+1
) obtained from using the
midpoint rule, except that backwards values are not the computed ones, but are the
exact ones instead. In symbols,
p
n+1
= y(x
n+1
) + 2hy

(x
n
). (2.9.2)
Of course, p
n+1
is not available during an actual computation.
(ii) the quantity q
n+1
is the value that we would compute from the trapezoidal corrector
if for the backward value we use the exact solution y(x
n
) instead of the calculated
solution y
n
. Thus q
n+1
satisfies the equation
q
n+1
= y(x
n
) +
h
2
(f(x
n
, y(x
n
)) +f(x
n+1
, q
n+1
)) . (2.9.3)
Again, q
n+1
is not available to us during calculation.
(iii) the quantity y(x
n+1
), which is the exact solution itself. It staisfies two different
equations, one of which is
y(x
n+1
) = y(x
n
) +
h
2
(f(x
n
, y(x
n
)) +f(x
n+1
, y(x
n+1
))) −
h
3
12
y

(X) (2.9.4)
and the other of which is (2.9.1). Note that the two X’s may be different.
Now, from (2.9.1) and (2.9.2) we have at once that
y(x
n+1
) = p
n+1
+
h
3
3
y

(X). (2.9.5)
Next, from (2.9.3) and (2.9.4) we get
y(x
n+1
) =
h
2
(f(x
n+1
, y(x
n+1
)) −f(x
n+1
, q
n+1
)) −
h
3
12
y

(X)
= q
n+1
h
2
(y(x
n+1
) −q
n+1
)
∂f
∂y
(x
n+1
, Y ) −
h
3
12
y

(X
(2.9.6)
56 The Numerical Solution of Differential Equations
where we have used the mean-value theorem, and Y lies between y(x
n+1
) and q
n+1
. Now if
we subtract q
n+1
from both sides, we will observe that y(x
n+1
) −q
n+1
will then appear on
both sides of the equation. Hence we will be able to solve for it, with the result that
y(x
n+1
) = q
n+1

h
3
12
y

(X) + terms involving h
4
. (2.9.7)
Now let’s make the working hypothesis that y

is constant over the range of values of
x considered, namely from x
n
−h to x
n
+h. The y

(X) in (2.9.7) is thereby decreed to be
equal to the y

(X) in (2.9.5), even though the X’s are different. Under this assumption,
we can eliminate y(x
n+1
) between (2.9.5) and (2.9.7) and obtain
q
n+1
−p
n+1
=
5
12
h
3
y

+ terms involving h
4
. (2.9.8)
We see that this expresses the unknown, but assumed constant, value of y

in terms
of the difference between the initial prediction and the final converged value of y(x
n+1
).
Now we ignore the “terms involving h
4
” in (2.9.8), solve for y

, and then for the estimated
single-step truncation error we have
Error = −
h
3
12
y

≈ −
1
12
12
5
(q
n+1
−p
n+1
)
= −
1
5
(q
n+1
−p
n+1
).
(2.9.9)
The quantity q
n+1
− p
n+1
is not available during the calculation, but as an estimator
we can use the compted predicted value and the compted converged value, because these
differ only in that they use computed, rather than exact backwards values of the unknown
function.
Hence, we have here an estimate of the single-step trunction error that we can conve-
niently compute, print out, or use to control the step size.
The derivation of this formula was of course dependent on the fact that we used the
midpoint metnod for the predoctor and the trapezoidal rule for the corrector. If we had
used a different pair, however, the same argument would have worked, provided only that
the error terms of the predictor and corrector formulas both involved the same derivative
of y, i.e., both formulas were of the same order.
Hence, “matched pairs”” of predictor and corrector formulas, i.e., pairs in which both
are of the same order, are most useful for carrying out extended calculations in which the
local errors are continually monitored and the step size is regulated accordingly.
Let’s pause to see how this error estimator would have turned out in the case of a general
matched pair of predictor-corrector formulas, insted of just for the midpoint and trapezoidal
rule combination. Suppose the predictor formula has an error term
y
exact
−y
predicted
= λh
q
y
(q)
(X) (2.9.10)
2.9 Controlling the step size 57
and suppose that the error in the corrector formula is given by
y
exact
−y
corrected
= µh
q
y
(q)
(X). (2.9.11)
Then a derivation similar to the one that we have just done will show that the estimator
for the single-step error that is available during the progress of the computation is
Error ≈
µ
λ −µ
(y
predicted
−y
corrected
). (2.9.12)
In the table below we show the result of integrating the differential equation y

= −y
with y(0) = 1 using the midpoint and trapezoidal formulas with h = 0.05 as the predictor
and corrector, as described above. The successive columns show x, the predicted value at x,
the converged corrected value at x, the single-step error estimated from the approximation
(2.9.9), and the actual single-step error obtained by computing
y(x
n+1
) −y(x
n
) −
h
2
(y

(x
n
) +y

(x
n+1
)) (2.9.13)
using the true solution y(x) = e
−x
. The calculation was started by (cheating and) using
the exact solution at 0.00 and 0.05.
x Pred(x) Corr(x) Errest(x) Error(x)
0.00 ——- ——- ——- ——-
0.05 ——- ——- ——- ——-
0.10 0.904877 0.904828 98 10
−7
94 10
−7
0.15 0.860747 0.860690 113 10
−7
85 10
−7
0.20 0.818759 0.818705 108 10
−7
77 10
−7
0.25 0.778820 0.778768 102 10
−7
69 10
−7
0.30 0.740828 0.740780 97 10
−7
61 10
−7
0.35 0.704690 0.704644 93 10
−7
55 10
−7
0.40 0.670315 0.670271 88 10
−7
48 10
−7
0.45 0.637617 0.637575 84 10
−7
43 10
−7
0.50 0.606514 0.606474 80 10
−7
37 10
−7
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0.95 0.386694 0.386669 51 10
−7
5 10
−7
1.00 0.367831 0.367807 48 10
−7
3 10
−7
table 5
Now that we have a simple device for estimating the single-step truncation error, namely
by using one fifth of the distance between the first guess and the corrected value, we can
regulate the step size so as to keep the error between preset limits. Suppose we would like to
keep the single-step error in the neighborhood of 10
−8
. We might then adopt, say 5 10
−8
as the upper limit of tolerable error and, for instance, 10
−9
as the lower limit.
Why should we have a lower limit? If the calculation is being done with more precision
than necessary, the step size will be smaller than needed, and we will be wasting computer
time as well as possibly building up roundoff error.
58 The Numerical Solution of Differential Equations
Now that we have fixed these limits we should be under no delusion that our computed
solution will depart from the true solution by no more than 5 10
−8
, or whatever. What
we are controlling is the one-step truncation error, a lot bettern than nothing, but certainly
not the same as the total accumulated error.
With the upper and lower tolerances set, we embark on the computation. First, since
the midpoint method needs two backward values before it can be used, something special
will have to be done to get the procedure moving at the start. This is typical of numerical
solution of differential equations. Special procedures are needed at the beginning to build
up enough computed values so that the predictor-corrector formulas can be used thereafter,
or at least until the step size is changed.
In the present case, since we’re going to use the trapezoidal corrector anyway, we might
as well use the trapezoidal rule, unassisted by midpoint, with complete iteration to conver-
gence, to get the value of y at the first point x
0
+h beyond the initial point x
0
.
Now we have two consecutive values of y, and the generic calculation can begin. From
any two consecutive values, the midpoint rule is used to predict the next value of y. This
predicted value is also saved for future use in error estimation. The predicted value is then
refined by the trapezoidal rule.
With the trapezoidal value in hand, the local error is then estimated by calculating
one-fifth of the difference between that value and the midpoint guess.
If the absolute value of the local error lies between the preset limits 10
−9
and 5 10
−8
,
then we just go on to the next step. This means that we augment x by h, and move back the
newer values of the unknown function to the locations that hold older values (we remember,
at any moment, just two past values).
Otherwise, suppose the local error was too large. Then we must reduce the step size
h, say by cutting it in half. When all this is done, some message should be printed out
that announces the change, and then we should restart the procedure, with the new value
of h, from the “farthest backward” value of x for which we still have the corresponding y
in memory. One reason for this is that we may find out right at the outset that our very
first choice of h is too large, and perhaps it may need to be halved, say, three times before
the errors are tolerable. Then we would like to restart each time from the same originally
given data point x
0
, rather than let the computation creep forward a little bit with step
sizes that are too large for comfort.
Finally, suppose the local error was too small. Then we double the step size, print a
message to that effect, and restart, again from the smallest possible value of x.
Now let’s apply the philosophy of structured programming to see how the whole thing
should be organized. We ask first for the major logical blocks into which the computation
is divided. In this case we see
(i) a procedure midpt. Input to this procedure will be x, h, y
n−1
, y
n
. Output from it
will be y
n+1
computed from the midpoint formula. No arrays are involved. The three
values of y in question occupy just three memory locations. The leading statement in
this routine might be
2.9 Controlling the step size 59
midpt:=proc(x,h,y0,ym1,n);
and its return statement would be return(yp1);. One might think that it is scarcely
necessary to have a separate subroutine for such a simple calculation. The spirit of
structured programming dictates otherwise. Someday one might want to change from
the midpoint predictor to some other predictor. If organized as a subrouting, then it’s
quite easy to disentangle it from the program and replace it. This is the “modular”
arpproach.
(ii) a procedure trapez. This routine will be called from two or three different places in the
main routine: when starting, when restarting with a new value of h, and in a generic
step, wher eit is the corrector formula. Operation of thie routine is different, too,
depending on the circumstances. When starting or restarting, there is no externally
supplied guess to help it. It must find its own way to convergence. On a generic step
of the integration, however, we want it to use the prediction supplied by midpt, and
then just do a single correction.
One way to handle this is to use a logical variable start. If trapez is called with start
= TRUE, then the subroutine would supply a final converged value without looking for any
input guess. Suppose the first line of trapez is
trapez:=proc(x,h,yin,yguess,n,start,eps);
and its return statement is return(yout);. When called with start = TRUE, then the
routine might use yin as its first guess to yout, and iterate to convergence from there, using
eps as its convergence criterion. If start = FALSE, it will take yguess as an estimate of
yout, then use the trapezoidal rule just once, to refine the value as yout.
The combination of these two modules plus a small main program that calls them as
needed, constitutes the whole package. Each of the subroutines and the main routine should
be heavily documented in a self-contained way. That is, the descriptions of trapez, and of
midpt, should precisely explain their operation as separate modules, with their own inputs
and outputs, quite independently of the way they happen to be used by the main program
in this problem. It should be possible to unplug a module from this application and use it
without change in another. The documentation of a procedure should never make reference
to how it is used by a certain calling program, but should describe only how it transforms
its own inputs into its own outputs.
In the next section we are going to take an intermission from the study of integration
rules in order to discuss an actual physical problem, the flight of a spacecraft to the moon.
This problem will need the best methods that we can find for a successful, accurate solution.
Then in section 2.12, we’ll return to the modules discussed above, and will display complete
computer programs that carry them out. In the meantine, the reader might enjoy trying to
write such a complete program now, and comparing it with the one in that section.
60 The Numerical Solution of Differential Equations
l
Earth Rocket
E
x = D
x=D−r
x = 0
x = R
x(t)
Moon
s f
Figure 2.2: 1D Moon Rocket
2.10 Case study: Rocket to the moon
Now we have a reasonably powerful apparatus for integration of initial-value problems and
systems, including the automatic regulation of step size and built-in error estimation. In
order to try out this software on a problem that will use all of its capability, in this section
we are going to derive the differential equations that govern the flight of a rocket to the
moon. We do this first in a one-dimensional model, and then in two dimensions. It will
be very useful to have these equations available for testing various proposed integration
methods. Great accuracy will be needed, and the ability to chenge the step size, both
to increase it and the devrease it, will be essential, or else the computation will become
intolerably long. The variety of solutions that can be obtained is quite remarkable.
First, in the one-dimensional simplified model, we place the center of the earth at the
origin of the x-axis, and let R denote the earth’s radius. At the point x = D we place the
moon, and we let its radius be r. Finally, at a position x = x(t) is our rocket, making its
way towards the moon.
We will use Newton’s law of gravitation to find the net gravitational force on the
rocket,and equate it to the mass of the rocket times its acceleration (Newton’s second law of
motion). According to Newton’s law of gravitation, the gravitational force exerted by one
body on another is proportional to the product of their masses and inversely proportional
to the square of the distance between them. If we use K for the constant of proportionality,
then the force on the rocket due to the earth is
−K
M
E
m
x
2
, (2.10.1)
whereas the force on the rocket due to the moon’s gravity is
K
M
M
m
(D −x)
2
(2.10.2)
where M
E
, M
M
and m are, respectively, the masses of the earth, the moon and the rocket.
The acceleration of the rocket is of course x

(t), and so the assertion that the net force
is equal to mass times acceleration takes the form:
mx

= −K
M
E
m
x
2
+K
M
M
m
(D −x)
2
. (2.10.3)
This is a (nasty) differential equation of second order in the unknown function x(t), the
position of the rocket at time t. Note the nonlinear way in which this unknown function
appears on the right-hand side.
2.10 Case study: Rocket to the moon 61
A second-order differential equations deserves two initial values, and we will oblige.
First. let’s agree that at time t = 0 the rocket was on the surface of the earth, and second,
that the rocket was fired at the moon with a certain initial velocity V . Hence, the initial
conditions that go with (2.10.3) are
x(0) = R; x

(0) = V. (2.10.4)
Now, just a quick glance at (2.10.3) shows that m cancels out, so let’s remove it, but
not before pointing out the immense significance of that fact. It implies that the motion
of the rocket is independent of its mass. For performing a now-legendary experiment with
rocks of different sizes dropping from the Tower of Pisa, Galileo demonstrated that fact to
an incredulous world.
At any rate, (2.10.3) now reads as
x

= −
KM
E
x
+
KM
M
(D −x)
2
. (2.10.5)
We can make this equation a good bit prettier by changing the units of distance and time
from miles and seconds (or whatever) to a set of more natural units for the problem.
For our unit of distance we choose R, the radius of the earth. If we divide (2.10.5)
through by R, we can write the result as
_
x
R
_

= −
KM
E
R
3
_
X
R
_
2
+
KM
M
R
3
_
D
R

x
R
_
2
. (2.10.6)
Now instead of the unknown function x(t), we define y(t) = x(t)/R. Then y(t) is the
position of the rocket, expressed in earth radii, at time t. Further, the ratio D/R that
occurs in (2.10.6) is a dimensionless quantity, whose numerical value is about 60. Hence
(2.10.6) has now been transformed to
y

= −
KM
E
R
3
y
2
+
KM
M
R
3
(60 −y)
2
. (2.10.7)
Next we tackle the new time units. Since y is now dimensionless, the dimension of the
left side of the equation is the reciprocal of the square of a time. If we look next at the
first term on the right, which of course has the same dimension, we see that the quantity
R
3
/KM
E
is the square of a time, so
T
0
=
¸
R
3
KM
E
(2.10.8)
is a time. Its numerical value is easier to calculate if we change the formula first, as follows.
Consider a body of mass m on the surface of the earth. Its weight is the magnitude of
the force exerted on it by the earth’s gravity, namely KM
E
m/R
2
. Its weight is also equal
to m times the acceleration of the body, namely the acceleration due to gravity, usually
denoted by g, and having the value 32.2 feet/sec
2
.
62 The Numerical Solution of Differential Equations
It follows that
KM
e
m
R
2
= mg, (2.10.9)
and if we substitute into (2.10.8) we find that our time unit is
T
0
=
¸
R
g
. (2.10.10)
We take R = 4000 miles, and find T
0
is about 13 minutes and 30 seconds. We propose to
measure time in units of T
0
. To that end, we multiply through equation (2.10.7) by T
0
and
get
T
2
0
y

= −
1
y
2
+
M
M
M
E
(60 −y)
2
. (2.10.11)
The ratio of the mass M
M
of the moon to the mass M
E
of the earth is about 0.012.
Furthermore, we will now introduce a new independent variable τ and a new dependent
variable u = u(τ) by the relations
u(τ) = y(τT
0
) ; t = τT
0
. (2.10.12)
Thus, u(τ) represents the position of the rocket, measured in units of the radius of the
earth, at a time τ that is measured in units of T
0
, i.e., in units of 13.5 minutes.
The substitution of (2.10.12) into (2.10.11) yields the differential equation for the scaled
distance u(τ) as a function of the scaled time τ in the form
u

= −
1
u
2
+
0.012
(60 −u)
2
. (2.10.13)
Finally we must translate the initial conditions (2.10.4) into conditions on the new
variables. The first condition is easy: u(0) = 1. Next, if we differentiate (2.10.12) and set
τ = 0 we get
u

(0) =
T
0
V
R
=
V
R/T
0
. (2.10.14)
This is a ratio of two velocities. In the numerator is the velocity with which the rocket is
launched. What is the significance of the velocity R/T
0
?
We claim that it is, aside from a numerical factor, the escape velocity from the earth, if
there were no moon. Perhaps the quickest way to see this is to go back to equation (2.10.11)
and drop the second term on the right-hand side (the one that comes from the moon). Then
we will be looking at the differential equation that would govern the motion if the moon
were absent. This equation can be solved. Multiply bth sides by 2y

, and it becomes
T
2
0
_
(y

)
2
_

=
_
2
y
_

, (2.10.15)
and integration yields
T
2
0
(y

)
2
=
2
y
+C. (2.10.16)
2.10 Case study: Rocket to the moon 63
Now let t = 0 and find that C = T
2
0
V
2
/R
2
−2, so
T
2
0
(y

)
2
=
2
y

_
T
2
0
V
2
R
2
−2
_
. (2.10.17)
Suppose the rocket is launched with sufficient initial velocity to escape from the earth. Then
the function y(t) will grow without bound. Hence let y → ∞ on the right side of (2.10.17).
For all values of y, the left side is a square, and therefore a nonnegative quantity. Hence
the right side, which approaches the constant C, must also be nonnegative. Thus C ≥ 0 or,
equivalently
V ≥

2
R
T
0
. (2.10.18)
Thus, if the rocket escapes, then (2.10.18) is true, and the converse is easy to show also.
Hence the quantity

2 R/T
0
is the escape velocity from the earth. We shall denote it by
V
esc
. Its numerical value is approximately 25, 145 miles per hour.
Now we can return to (2.10.12) to translate the initial conditions on x

(t) into initial
conditions on u

(τ). In terms of the escape velocity, it becomes u

(0) =

2 V/V
esc
. We
might say that if we choose to measure distance in units of earth radii, and time in units of
T
0
, then velocities turn out to be measured in units of escape velocity, aside from the

2.
In summary, the differential equation and the initial conditions have the final form
u

= −
1
u
2
+
0.012
(60 −u)
2
u(0) = 1
u

(0) =

2
V
V
esc
(2.10.19)
Since that was all so easy, let’s try the two-dimensional case next. Here, the earth is
centered at the origin of the xy-plane, and the moon is moving. Let the coordinates of the
moon at time t be (x
m
(t), y
m
(t)). For example, if we take the orbit of the moon to be a
circle of radius D, then we would have x
m
= Dcos(ωt) and y
m
(t) = Dsin(ωt).
If we put the rocket at a generic position (x(t), y(t)) on the way to the moon, then we
have the configuration shown in figure 1.16.2.
Consider the net force on the rocket in the x direction. It is given by
F
x
= −
KM
E
mcos θ
x
2
+y
2
+
KM
M
mcos ψ
(x −x
m
)
2
+ (y −y
m
)
2
, (2.10.20)
where the angles θ and ψ are shown in figure 1.16.2. From that figure, we see that
cos θ = x
_
x
2
+y
2
(2.10.21)
and
cos ψ =
x
m
−x
_
(x
m
−x)
2
+ (y
m
−y)
2
. (2.10.22)
64 The Numerical Solution of Differential Equations
v
s
z
x(t)
Earth
rocket
Moon
y(t)
y
M
(t)
x
M
(t)
$
$
$
$
$
$
$
$
$
$
$
$
$$

θ
ψ
Figure 2.3: The 2D Moon Rocket
Now we substitute into (2.10.20), and equate the force in the x direction to mx

(t), to
obtain the differential equation
mx

(t) = −
KM
E
mx
(x
2
+y
2
)
3/2
+
KM
M
m(x
m
−x)
((x
m
−x)
2
+ (y
m
−y)
2
)
3/2
. (2.10.23)
If we carry out a similar analysis for the y-component of the force on the rocket, we get
my

(t) = −
KM
E
my
(x
2
+y
2
)
3/2
+
KM
M
m(y
m
−y)
((x
m
−x)
2
+)y
m
−y)
2
)
3/2
. (2.10.24)
We are now looking at two (even nastier) simultaneous differential equations of the
second order in the two unknown functions x(t), y(t) that describe the position of the
rocket. To go with these equations, we need four initial conditions. We will suppose that at
time t = 0, the rocket is on the earth’s surface, at the point (R, 0). Further, at time t = 0,
it will be fired with an initial speed of V , in a direction that makes an angle α with the
positive x-axis. Thus, our initial conditions are
_
x(0) = R; y(0) = 0
x

(0) = V cos α; y

(0) = V sinα
. (2.10.25)
The problem has now been completely defined. It remains to change the units into the same
natural dimensions of distance and time that were used in the one-dimensional problem.
This time we leave the details to the reader, and give only the results. If u(τ) and v(τ)
denote the x and y coordinates of the rocket, measured in units of earth radii, at a time τ
measured in units of T
0
(see (2.10.10)), then it turns out the u and v satisfy the differential
equations
u

= −
u
(u
2
+v
2
)
3/2
+
0.012(u
m
−u)
((u
m
−u)
2
+ (v
m
−v)
2
)
3/2
v

= −
v
(u
2
+v
2
)
3/2
+
0.012(v
m
−v)
((u
m
−u)
2
+ (v
m
−v)
2
)
3/2
.
(2.10.26)
Furthermore, the initial data (2.10.25) take the form
_
_
_
u(0) = 1 ; v(0) = 0
u

(0) =

2
V cos α
V
esc
; v

(0) =

2
V sin α
V
esc
.
(2.10.27)
2.11 Maple programs for the trapezoidal rule 65
In these equations, the functions u
m
(τ) and v
m
(τ) are the x and y coordinates of the moon,
in units of R, at the time τ. Just to be specific, let’s decree that the moon is in a circular
orbit of radius 60R, and that it completes a revolution every twenty eight days. Then, after
a brief session with a hand calculator or a computer, we discover that the equations
u
m
(τ) = 60 cos (0.002103745τ)
v
m
(τ) = 60 sin (0.002103745τ)
(2.10.28)
represent the position of the moon.
2.11 Maple programs for the trapezoidal rule
In this section we will first display a complete Maple program that can numerically solve a
system of ordinary differential equations of the first order together with given initial values.
After discussing those programs, we will illustrate their operation by doing the numerical
solution of the one dimensional moon rocket problem.
We will employ Euler’s method to predict the values of the unknowns at the next point
x + h from their values at x, and then we will apply the trapezoidal rule to correct these
predicted values until sufficient convergence has occurred.
First, here is the program that does the Euler method prediction.
> eulermethod:=proc(yin,x,h,f)
> local yout,ll,i:
> # Given the array yin of unknowns at x, uses Euler method to return
> # the array of values of the unknowns at x+h. The function f(x,y) is
> # the array-valued right hand side of the given system of ODE’s.
> ll:=nops(yin):
> yout:=[]:
> for i from 1 to ll do
> yout:=[op(yout),yin[i]+h*f(x,yin,i)];
> od:
> RETURN(yout):
> end:
Next, here is the program that takes as input an array of guessed values of the unknowns
at x +h and refines the guess to convergence using the trapezoidal rule.
> traprule:=proc(yin,x,h,eps,f)
> local ynew,yfirst,ll,toofar,yguess,i,allnear,dist;
> # Input is the array yin of values of the unknowns at x. The program
> # first calls eulermethod to obtain the array ynew of guessed values
> # of y at x+h. It then refines the guess repeatedly, using the trapezoidal
> # rule, until the previous guess, yguess, and the refined guess, ynew, agree
> # within a tolerance of eps in all components. Program then computes dist,
> # which is the largest deviation of any component of the final converged
> # solution from the initial Euler method guess. If dist is too large
66 The Numerical Solution of Differential Equations
> # the mesh size h should be decreased; if too small, h should be increased.
> ynew:=eulermethod(yin,x,h,f);
> yfirst:=ynew;
> ll:=nops(yin);
> allnear:=false;
> while(not allnear) do
> yguess:=ynew;
> ynew:=[];
> for i from 1 to ll do
> ynew:=[op(ynew),yin[i]+(h/2)*(f(x,yin,i)+f(x+h,yguess,i))];
> od;
> allnear:=true;
> for i from 1 to ll do allnear:=allnear and abs(ynew[i]-yguess[i])<eps od:
> od; #end while
> dist:=max(seq(abs(ynew[i]-yfirst[i]),i=1..ll));
> RETURN([dist,ynew]):
> end:
The two programs above each operate at a single point x and seek to compute the
unknowns at the next point x +h. Now we need a global view, that is a program that will
call the above repeatedly and increment the value of x until the end of the desired range
of x. The global routine also needs to check whether or not the mesh size h needs to be
changed at each point and to do so when necessary.
> trapglobal:=proc(f,y0,h0,xinit,xfinal,eps,nprint)
> local x,y,y1,h,j,arr,dst,cnt:
> # Finds solution of the ODE system y’=f(x,y), where y is an array
> # and f is array-valued. y0 is initial data array at x=xinit.
> # Halts when x>xfinal. eps is convergence criterion for
> # trapezoidal rule; Prints every nprint-th value that is computed.
> x:=xinit:y:=y0:arr:=[[x,y[1]]]:h:=h0:cnt:=0:
> while x<=xfinal do
> y1:=traprule(y,x,h,eps,f):
> y:=y1[2]:dst:=y1[1];
> # Is dst too large? If so, halve the mesh size h and repeat.
> while dst>3*eps do
> h:=h/2; lprint(‘At x=‘,x,‘h was reduced to‘,h);
> y1:=traprule(y,x,h,eps,f):
> y:=y1[2]:dst:=y1[1];
> od:
> # Is dst too small? If so, double the mesh size h and repeat.
> while dst<.0001*eps do
> h:=2*h; lprint(‘At x=‘,x,‘h was increased to‘,h);
> y1:=traprule(y,x,h,eps,f):
> y:=y1[2]:dst:=y1[1];
> od:
> # Adjoin newly computed values to the output array arr.
> x:=x+h; arr:=[op(arr),[x,y[2]]]:
> # Decide if we should print this line of output or not.
> cnt:=cnt+1: if cnt mod nprint =0 or x>=xfinal then print(x,y) fi;
2.11 Maple programs for the trapezoidal rule 67
> od:
> RETURN(arr);
> end:
The above three programs comprise a general package that can numerically solve systems
of ordinary differential equations. The applicability of the package is limited mainly by the
fact the Euler’s method and the Trapezoidal Rule are fairly primitive approximations to
the truth, and therefore one should not expect dazzling accuracy when these routines are
used over long intervals of integration.
2.11.1 Example: Computing the cosine function
We will now give two examples of the operation of the above programs. First let’s compute
the cosine function. We will numerically integrate the equation y

+ y = 0 with initial
conditions y(0) = 1 and y

(0) = 0 over the range from x = 0 to x = 2π. To do this we use
two unknown functions y
1
, y
2
which are subject to the equations y

1
= y
2
and y

2
= −y
1
,
together with initial values y
1
(0) = 1, y
2
(0) = 0. Then the function y
1
(x) will be the cosine
function.
To use the routines above, we need to program the function f = f(x, y) that gives the
right hand sides of the input ODE’s. This is done as follows:
> f:=proc(x,u,j)
> if j=1 then RETURN(u[2]) else RETURN(-u[1]) fi:
> end:
That’s all. To run the programs we type the line
> trapglobal(f,[1,0],.031415927,0,6.3,.0001,50):
This means that we want to solve the system whose right hand sides are as given by f,
with initial condition array [1, 0]. The initial choice of the mesh size h is π/100, in order to
facilitate the comparison of the values that will be output with those of the cosine function
at the same points. The range of x over which we are asking for the solution is from x = 0
to x = 6.3. Our convergence criterion eps is set to .0001, and we are asking the program
to print every 50th line of putput that it computes. Maple responds to this call with the
following output.
At x= 0 h was reduced to .1570796350e-1
.7853981750, [ .6845520546, -.7289635418]
1.570796368, [-.0313962454, -.9995064533]
2.356194568, [-.7289550591, -.6845604434]
3.141592768, [-.9995063863, .0313843153]
3.926990968, [-.6845688318, .7289465759]
4.712389168, [ .0313723853, .9995063195]
5.497787368, [ .7289380928, .6845772205]
6.283185568, [ .9995062524, -.0313604551]
68 The Numerical Solution of Differential Equations
We observe that the program first decided that the value of h that we gave it was too big
and it was cut in half. Next we see that the accuracy is pretty good. At x = π we have 3
or 4 correct digits, and we still have them at x = 2π. The values of the negative of the sine
function, which are displayed above as the second column of unknowns, are less accurate at
those points. To improve the accuracy we might try reducing the error tolerance eps, but
realistically we will have to confess that the major source of imprecision lies in the Euler
and Trapezoidal combination itself, which, although it provides a good introduction to the
philosophy of these methods, is too crude to yield results of great accuracy over a long range
of integration.
2.11.2 Example: The moon rocket in one dimension
As a second example we will run the moon rocket in one dimension. The equations that
we’re solving now are given by (2.10.19). So all we need to do is to program the right hand
sides f, which we do as follows.
> f:=proc(x,u,j)
> if j=1 then RETURN(u[2]) else RETURN(-1/u[1]^2+.012/(60-u[1])^2) fi:
> end:
Then we invoke our routines by the statement
trapglobal(f,[1,1.4142],.02,0,250,.0001,1000):
This means that we are solving our system with initial data (1,

2), with an initial mesh size
of h = .02, integrating over the range of time from t = 0 until t = 250, with a convergence
tolerance eps of .0001, and printing the output every 1000 lines. We inserted an extra
line of program also, so as to halt the calculation as soon as y[1], the distance from earth,
reached 59.75, which is the surface of the moon.
Maple responded with the following output.
At x= 0 h was reduced to .1000000000e-1
10.00000000, [7.911695068, .5027323920]
20.00000000, [12.36222569, .4022135344]
30.00000000, [16.11311118, .3523608113]
40.00000000, [19.46812426, .3206445364]
50.00000000, [22.55572225, .2979916199]
60.00000000, [25.44558364, .2806820525]
70.00000000, [28.18089771, .2668577906]
80.00000000, [30.79081775, .2554698522]
90.00000000, [33.29624630, .2458746369]
100.0000000, [35.71287575, .2376534139]
110.0000000, [38.05293965, .2305225677]
120.0000000, [40.32630042, .2242857188]
130.0000000, [42.54117736, .2188074327]
140.0000000, [44.70467985, .2139997868]
2.12 The big leagues 69
150.0000000, [46.82325670, .2098188816]
160.0000000, [48.90316649, .2062733168]
170.0000000, [50.95113244, .2034559376]
At x= 174.2300000 h was increased to .2000000000e-1
At x= 183.4500000 h was increased to .4000000000e-1
188.0900000, [54.61091112, .2014073390]
At x= 211.7300000 h was reduced to .2000000000e-1
211.8100000, [59.75560109, .3622060968]
So the trip was somewhat eventful. The mesh size was reduced to .01 right away. It was
increased again after 174 time units because at that time the rocket was moving quite
slowly, and again after 183 time units for the same reason. Impact on the surface of the
moon occurred at time 211.81. Since each time unit corresponds to 13.5 minutes, this means
that the whole trip took 211.8 13.5 minutes, or 47.65 hours – nearly two days.
The reader might enjoy experimenting with this situation a little bit. For instance, if
we reduce the initial velocity from 1.4142 by a small amount, then the rocket will reach
some maximum distance from Earth and will fall back to the ground without ever having
reached the moon.
2.12 The big leagues
The three integration rules that we have studied are able to handle small-to-medium sized
problems in reasonable time, and with good accuracy. Some problems, however, are more
demanding than that. Our two-dimensional moon shot is an example of such a situation,
where even a good method like the trapezoidal rule is unable to give the pinpoint accuracy
that is needed. In this section we discuss a general family of methods, of which all three
of the rules that we have studied are examples, that includes methods of arbitrarily high
accuracy. These are the linear multistep methods.
A general linear multistep method is of the form
y
n+1
=
p

i=0
a
−i
y
n−i
+h
p

i=−1
b
−i
y

n−i
. (2.12.1)
In order to compute the next value of y, namely y
n+1
, we need to store p + 1 backwards
values of y and p + 1 backwards values of y

. A total of p + 2 points are involved in the
formula, and so we can call (2.12.1) a p + 2-point formula.
The trapezoidal rule, for example, arises by taking p = 0, a
0
= 1, b
0
= 1/2, b
−1
= 1/2.
Euler’s method has p = 0, a
0
= 1, b
−1
= 0, b
0
= 1, whereas for the midpoint rule we have
p = 1, a
0
= 0, a
−1
= 1, b
1
= 0, b
0
= 2, b
−1
= 0.
We can recognize an explicit, or noniterative, formula by looking at b
1
. If b
1
is nonzero,
then y
n+1
appears implicitly on the right side of (2.12.1) as well as on the left, and the
formula does not explicitly tell us the value of y
n+1
. Otherwise, if b
1
= 0, the formula is
explicit. In either case, if p > 0 the formula will need help getting started or restarted,
whereas if p = 0 it is self-starting.
70 The Numerical Solution of Differential Equations
The general linear multistep formula contains 2p + 3 constants
a
0
, . . . , a
−p
and b
1
, b
0
, b
−1
, . . . , b
−p
.
These constants are chosen to give the method various properties that we may deem to be
desirable in a particular application. For instance, if we value explicit formulas, then we
may set b
1
= 0 immediately, thereby using one of the 2p + 3 free parameters, and leaving
2p + 2 others.
One might think that the remaining parameters should be chosen so as to give the
highest possible accuracy, in some sense. However, for a fixed p, the more accuracy we
demand, the more we come into conflict with stability, another highly desirable feature.
Indeed, if we demand “too much accuracy” for a fixed p, it will turn out that no stable
formulas exist. An important theorem of the subject, due to Dahlquist, states roughly that
we cannot use more than about half of the 2p + 3 “degrees of freedom” to achieve high
accuracy if we want to have a stable formula (and we do!).
First let’s discuss the conditions of accuracy. These are usually handled by asking that
the equation (2.12.1) should be exactly true if the unknown function y happens to be a
polynomial of low enough degree. For instance, suppose y(x) = 1 for all x. Substitute
y
k
= 1 and y

k
= 0 for all k into (2.12.1), and there follows the condition
p

i=0
a
−i
= 1. (2.12.2)
Notice that in all three of the methods we have been studying, the sum of the a’s is
indeed equal to 1.
Now suppose we want our multistep formula to be exact not only when y(x) = 1, but
also when y(x) = x. We substitute y
k
= kh, y

k
= 1 for all k into (2.12.1) and obtain, after
some simplification, and use of (2.12.2),

p

i=0
ia
−i
+
p

i=−1
b
−i
= 1. (2.12.3)
The reader should check that this condition is also satisfied by all three of the methods we
have studied.
In general let’s find the condition for the formula to integrate the function y(x) = x
r
and all lower powers of x exactly for some fixed value of r. Hence, in (2.12.1), we substitute
y
k
= (kh)
r
and y

k
= r(kh)
r−1
. After cancelling the factor h
r
, we get
(n + 1)
r
=
p

i=0
a
−i
(n −i)
r
+r
p

i=−1
b
−i
(n −i)
r−1
. (2.12.4)
Now clearly we do x
r
and all lower powers of x exactly if and only if we do (x +c)
r
and
all lower powers of x exactly. The conditions are therefore translation invariant, so we can
choose one special value of n in (2.12.4) if we want.
2.12 The big leagues 71
Let’s choose n = 0, because the result simplifies then to
(−1)
r
=
p

i=0
i
r
a
−i
−r
p

i=−1
i
r−1
b
−i
r = 0, 1, 2, . . . . (2.12.5)
A small technical remark here is that when r = 1 and i = 0 we see a 0
0
in the second
sum on the right. This should be interpreted as 1, in accordance with (2.12.3).
By the order (of accuracy) of a linear multistep method we mean the highest power of x
that the formula integrates exactly, or equivalently, the largest number r for which (2.12.5)
is true (together with its analogues for all numbers between 0 and r).
The accuracy conditions enable us to take the role of designer, and to construct accurate
formlas with desirable characteristics. The reader should verify that the trapezoidal rule is
the most accurate of all possible two-point formulas, and should seach for the most accurate
of all possible three-point formulas (ignoring the stability question altogether).
Now we must discuss the question of stability. Just as in section 2.6, stability will be
judged with respect to the performance of our multistep formula on a particular differential
equation, namely
y

= −
y
L
(L > 0) (2.12.6)
with y(0) = 1.
To see how well the general formula (2.12.1) does with this differential equation, whose
solution is a falling exponential, substitute y

k
= −y
k
/L for all k in (2.12.1), to obtain
(1 +τb
1
)y
n+1

p

i=0
(a
−i
−τb
−i
)y
n−i
= 0 (2.12.7)
where as in section 2.6, we have written τ = h/L, the ratio of the step size to the relaxation
length of the problem.
Equation (2.12.7) is a linear difference equation with constant coefficients of order p+1.
If as usual with such equations, we look for solutions of the form α
n
, then after substitution
and cancellation we obtain the characteristic equation
(1 +τb
1

p+1

p

i=0
(a
−i
−τb
−i

p−i
= 0. (2.12.8)
This is a polynomial equation of degree p + 1, so it has p + 1 roots somewhere in the
complex plane. These roots depend on the value of τ, that is, on the step size that we use
to do the integration.
We can’t expect that these roots will have absolute value less than 1 however large we
choose the step size h. All we can hope for is that it should be possible, by choosing h small
enough, to get all of these roots to lie inside the unit disk in the complex plane.
Just for openers, let’s see where the roots are when h = 0. In fact, if when h = 0 some
root has absolute value strictly greater than 1, then there is no hope at all for the formula,
72 The Numerical Solution of Differential Equations
because for all sufficiently small h there will be a root outside the unit disk, and the method
will be unstable.
Now when h = 0, τ = 0 also, so the polynomial equation (2.12.8) becomes simply
α
p+1

p

i=0
a
−i
α
p−i
= 0. (2.12.9)
This is also a polynomial equation of degree p + 1. However, its coefficients don’t depend
on the step size, or even on the values of the various b’s in the formula. The coefficients,
and therefore the roots, depend only on the a’s that we use.
Hence, our first necessary condition for the stability of the formula (2.12.1) is that the
polynomial equation (2.12.9) should have no roots of absolute value greater than 1. The
reader should now check the three methods that we have been using to see if this condition
is satisfied.
Next we have to study what happens to the roots when h is small and positive. Qualita-
tively, here’s what happens. One of the roots of (2.12.9) is always α = 1, because condition
(2.12.2) is always satisfied in practice, and since all it asks is that we correctly integrate the
equation y

= 0, which is surely not an excessive request.
The one root of (2.12.9) that is = 1 is our good friend, and we’ll call it the principal
root. As h increases slightly away from 0 the principal root moves slightly away from 1.
Let α
1
(h) denote the value of the principal root at some small value of h. Then α
1
(0) = 1.
Now for small positive h, it turns out that the principal root tries as hard as it can to be a
good approximation to e
−τ
. This means that the portion of the solution of the difference
equation (2.12.7) that comes from the one root α
1
(h) is
α
1
(h)
n
= (“nearly” e
−τ
)
n
= “nearly” e
−nτ
= “nearly” e
−nh/L
.
(2.12.10)
But e
−nh/L
is the exact solution of the differential equation (2.12.6) that we’re trying to
solve. Therefore the principal root is trying to give us a very accurate approximation to
the exact solution.
Well then, what are all of the other p roots of the difference equation (2.12.7) doing for
us? The answer is that they are trying as hard as they can to make our lives difficult. The
high quality of our approximate solution derives from the nearness of the principal root to
e
−τ
. This high quality is bought at the price of using a p + 2-point multistep formula, and
this forces the characteristic equation to be of degree p + 1, and hence the remining roots
have to be there, too.
People have invented various names for these other, non-principal, roots of the difference
equations. One that is printable is “parasitic,” so we’ll call them the parasitic roots. We
would be happiest if they would all be zero, but failing that, we would like them to be as
small as possible in absolute value.
2.12 The big leagues 73
A picture of a good multistep method in action, then, with some small value of h, shows
one root of the characteristic equation near 1, more precisely, very near e
−τ
, and all of the
other roots near the origin.
Now let’s try to make this picture a bit more quantitative. We return to the polynomial
equation (2.12.8), and attempt to find a power series expansion for the principal root α
1
(τ)
in powers of τ. The expansion will be in the form
α
1
(τ) = 1 +q
1
τ +q
2
τ
2
+ (2.12.11)
where the q’s are to be determined. If we substitue (2.12.11) into (2.12.8), we get
(1 −τb
1
)(1 +q
1
τ +q
2
τ
2
+ )
p+1

p

i=0
(a
−i
−τb
−i
)(1 +q
1
τ +q
2
τ
2
+ )
p−i
= 0. (2.12.12)
Now we equate the coefficient of each power of τ to zero. First, the coefficient of the
zeroth power of τ is
1 −
p

i=0
a
−i
(2.12.13)
and, according to (2.12.2), this is indeed zero if our multistep formula correctly integrates
a constant function.
Second, the coefficient of τ is
b
1
+ (p + 1)q
1

p

i=0
[a
−i
(p −i)q
1
+b
−i
] = 0. (2.12.14)
However, if we use (2.12.2) and (2.12.3), this simplifies instantly to the simple statement
that q
1
= −1.
So far, we have shown by direct calculation that if our multistep formula is of order
at least 1 (i.e., if (2.12.4) holds for r = 0 and r = 1), then the expansion (2.12.11) of the
principal root agrees with the expansion of e
−τ
through terms of first degree in τ.
Much more is true, although we will not prove it here: the expansion of the principal root
agrees with the expansion of e
−τ
through terms of degree equal to the order of the multistep
method under consideration. The proof can be done just by continuing the argument that
we began above, but we omit the details.
Thus the careful determination of the a’s and the b’s so as to make the formula as
accurate as possible all result in the principal root being “nearly” e
−τ
. Equal care must be
taken to assure that the parasitic roots stay small.
We illustrate these ideas with a new multistep formula,
y
n+1
= y
n−1
+
h
3
(y

n+1
+ 4y

n
+y

n−1
). (2.12.15)
This, like the midpoint rule, is a three-point method (p = 1). It is iterative, because
b
1
= 1/3 ,= 0, and it happens to be very accurate. Indeed, we can quickly check that
74 The Numerical Solution of Differential Equations
the accuracy conditions (2.12.5) are satisfied for r = 0, 1, 2, 3, 4. The method is of
fourth order, and in fact its error term can be found by the method of section 2.8 to be
−h
5
y
(v)
(X)/90, where X lies between x
n−1
and x
n+1
.
Now we examine the stability of this method. First, when h = 0 the equation (2.12.9)
that determines the roots is just α
2
− 1 = 0, so the roots are +1 and −1. The root at +1
is the friendly one. As h increases slightly to small positive values, that root will follow the
power series expansion of e
−τ
very accurately, in fact, through the first four powers of τ.
The root at −1 is to be regarded with apprehension, because it is poised on the brink of
causing trouble. If as h grows to a small positive value, this root grows in absolute value,
then its powers will dwarf the powers of the principal root in the numerical solution, and
all accuracy will eventually be lost.
To see if this happens, let’s substitute a power series
α
2
(h) = −1 +q
1
τ +q
2
τ
2
+ (2.12.16)
into the characteristic equation (2.12.8), which in the present case is just the quadratic
equation
_
1 +
τ
3
_
α
2
+

3
α −
_
1 −
τ
3
_
= 0. (2.12.17)
After substituting, we quickly find that q
1
= −1/3, and our apprehension was fully war-
rented, because for small τ the root acts like −1 −τ/3, so for all small positive values of τ
this lies outside of the unit disk, so the method will be unstable.
In the next section, we are going to describe a family of multistep methods, called the
Adams methods, that are stable, and that offer whatever accuracy one might want, if one
is willing to save enough backwards values of the y’s. First we will develop a very general
tool, the Lagrange interpolation formula, that we’ll need in several parts of the sequel, and
following that we discuss the Adams formulas. The secret weapon of the Adams formulas
is that when h = 0, one of the roots (the friendly one) is as usual sitting at 1, ready to
develop into the exponential series, and all of the unfriendly roots are huddled together at
the origin, just as far out of trouble as they can get.
2.13 Lagrange and Adams formulas
Our next job is to develop formulas that can give us as much accuracy as we want in a
numerical solution of a differential equation. This means that we want methods in which
the formulas span a number of points, i.e., in which the next value of y is obtained from
several backward values, instead of from just one or two as in the methods that we have
already studied. Furthermore, these methods will need some assistance in getting started,
so we will have to develop matched formulas that will provide them with starting values of
the right accuracy.
All of these jobs can be done with the aid of a formula, due to Lagrange, whose mission
in life is to fit a polynomial to a given set of data points, so let’s begin with a little example.
2.13 Lagrange and Adams formulas 75
Problem: Through the three points (1, 17), (3, 10), (7, −5) there passes a unique quadratic
polynomial. Find it.
Solution:
P(x) = 17
_
(x −3)(x −7)
(1 −3)(1 −7)
_
+ 10
_
(x −1)(x −7)
(3 −1)(3 −7)
_
+ (−5)
_
(x −1)(x −3)
(7 −1)(7 −3)
_
. (2.13.1)
Let’s check that this is really the right answer. First of all, (2.13.1) is a quadratic
polynomial, since each term is. Does it pass through the three given points? When x = 1,
the second and third terms vanish, because of the factor x − 1, and the first term has the
vlaue 17, as is obvious from the cancellation that takes place. Similarly when x = 3, the
first and third terms are zero and the second is 10, etc.
Now we can jump from this little example all the way to the general formula. If we want
to find the polynomial of degree n −1 that passes through n given data points
(x
1
, y
1
), (x
2
, y
2
), . . . , (x
n
, y
n
),
then all we have to do is to write it out:
P(x) =
n

i=1
y
i
_
_
_
n

j=1
j=i
x −x
j
x
i
−x
j
_
_
_. (2.13.2)
This is the Lagrange interpolation formula. In the ith term of the sum above is the product
of n − 1 factors (x − x
j
)/(x
i
− x
j
), namely of all those factors except for the one in which
j = i.
Next, consider the special case of the above formula in which the points x
1
, x
2
, . . . , x
n
are chosen to be equally spaced. To be specific, we might as well suppose that x
j
= jh for
j = 1, 2, . . . , n.
If we substitute in (2.13.2), we get
P(x) =
n

i=1
y
i
_
_
_
n

j=1
j=i
x −jh
(i −j)h
_
_
_ . (2.13.3)
Consider the product of the denominators above. It is
(i −1)(i −2) (1)(−1)(−2) (i −n)h
n−1
= (−1)
n−i
(i −1)!(n −i)!h
n−1
. (2.13.4)
Finally we replace x by xh and substitute in (2.13.3) to obtain
P(xh) =
n

i=1
(−1)
n−i
(i −1)!(n −i)!
y
i
_
_
_
n

j=1
j=i
(x −j)
_
_
_ (2.13.5)
and that is about as simple as we can get it.
76 The Numerical Solution of Differential Equations
Now we’ll use this result to obtain a collection of excellent methods for solving differential
equations, the so-called Adams formulas. Begin with the obvious fact that
y((p + 1)h) = y(ph) +h
_
p+1
p
y

(ht) dt. (2.13.6)
Instead of integrating the exact function y

in this formula, we will integrate the poly-
nomial that agrees with y

at a number of given data points. First, we replace y

by the
Lagrange interpolating polynomial that agrees with it at the p+1 points h, 2h, . . . , (p+1)h.
This transforms (2.13.6) into
y
p+1
= y
p
+h
p+1

i=1
(−1)
p−i+1
(i −1)!(p −i + 1)!
y

(ih)j
_
p+1
p
_
_
_
p+1

j=1
j=i
(x −j)
_
_
_ dx. (2.13.7)
We can rewrite this in the more customary form of a linear multistep method:
y
n+1
= y
n
+h
p−1

i=−1
b
−i
y

n−i
. (2.13.8)
This involves replacing i by p −i in (2.13.7), so the numbers b
−i
are given by
b
−i
=
(−1)
i+1
(p −1 −i)!(i + 1)!
_
p+1
p
_
_
_
p+1

j=1
j=p−i
(x −j)
_
_
_ dx i = −1, 0, . . . , p −1. (2.13.9)
Now to choose a formula in this family, all we have to do is specify p. For instance, let’s
take p = 2. Then we find
b
1
=
1
2
_
3
2
(x −1)(x −2)dx =
5
12
b
0
= −
_
3
2
(x −1)(x −3)dx =
2
3
b
−1
=
1
2
_
3
2
(x −2)(x −3)dx = −
1
12
(2.13.10)
so we have the third-order implicit Adams method
y
n+1
= y
n
+
h
12
(5y

n+1
+ 8y

n
−y

n−1
). (2.13.11)
If this process is repeated for various values of p we get the whole collection of these
formulas. For reference, we show the first four of them below, complete with their error
terms.
y
n+1
= y
n
+hy

n

h
2
2
y

(2.13.12)
2.13 Lagrange and Adams formulas 77
y
n+1
= y
n
+
h
2
(y

n+1
+y

n
) −
h
3
12
y

(2.13.13)
y
n+1
= y
n
+
h
12
(5y

n+1
+ 8y

n
−y

n−1
) −
h
4
24
y
(iv
(2.13.14)
y
n+1
= y
n
+
h
24
(9y

n+1
+ 19y

n
−5y

n−1
+y

n−2
) −
19h
5
720
y
(v
(2.13.15)
If we compare these formulas with the general linear multistep method (2.12.1) then we
quickly find the reason for the stability of these formulas. Notice that only one backwards
value of y itself is used, namely y
n
. The other backwards values are all of the derivatives.
Now look at equation (2.12.9) again. It is the polynomial equation that determines the
roots of the characteristic equation of the method, in the limiting case where the step size
is zero.
If we use a formula with all the a
i
= 0, except that a
0
= 1, then that polynomial
equation becomes simply
α
p+1
−α
p
= 0. (2.13.16)
The roots of this are 1, 0, 0,. . . ,0. The root at 1, as h grows to a small positive value, is
the one that gives us the accuracy of the computed solution. The other roots all start at
the origin, so when h is small they too will be small, instead of trying to cause trouble for
us. All of the Adams formulas are stable for this reason.
In Table 6 we show the behavior of the three roots of the characteristic equation (2.12.8)
as it applies to the fourth-order method (2.13.15). It happens that the roots are all real in
this case. Notice that the friendly root is the one of largest absolute value for all τ ≤ 0.9.
τ e
−τ
Friendly(τ) Root2(τ) Root3(τ)
0.0 1.000000 1.000000 0.000000 0.000000
0.1 0.904837 0.904837 0.058536 -0.075823
0.2 0.818731 0.818723 0.081048 -0.116824
0.3 0.740818 0.740758 0.098550 -0.153914
0.4 0.670320 0.670068 0.113948 -0.189813
0.5 0.606531 0.605769 0.128457 -0.225454
0.6 0.548812 0.546918 0.142857 -0.261204
0.7 0.496585 0.492437 0.157869 -0.297171
0.8 0.449329 0.440927 0.174458 -0.333333
0.9 0.406570 0.390073 0.194475 -0.369595
1.0 0.367879 0.333333 0.224009 -0.405827
table 6
Now we have found the implicit Adams formulas. In each of them the unknown y
n+1
appears on both sides of the equation. They are, therefore, useful as corrector formulas.
To find matching predictor formulas is also straightforward. We return to (2.13.6), and
for a predictor method of order p + 1 we replace y

by the interpolating polynomial that
78 The Numerical Solution of Differential Equations
agrees with it at the data points 0, h, 2h, 3h, . . . , ph (but not at (p+1)h as before). Then,
in place of (2.13.7) we get
y
p+1
= y
p
+h
p

i=0
(−1)
p−i
i!(p −i)!
y

(ih)
_
p+1
p
_
_
_
p

j=0
j=i
(x −j)
_
_
_ dx. (2.13.17)
As before, we can write this in the more familiar form
y
n+1
= y
n
+h
p

i=0
b
−i
y

n−i
(2.13.18)
where the numbers b
−i
are now given by
b
−i
=
(−1)
i
(p −i)!i!
_
p+1
p
_
_
_
p

j=0
j=p−i
(x −j)
_
_
_ dx i = 0, 1, . . . , p. (2.13.19)
We tabulate below these formulas in the cases p = 1, 2, 3, together with their error
terms:
y
n+1
= y
n
+
h
2
(3y

n
−y

n−1
) +
5h
3
12
y

(2.13.20)
y
n+1
= y
n
+
h
12
(23y

n
−16y

n−1
+ 5y

n−2
) +
3h
4
8
y
(iv
(2.13.21)
y
n+1
= y
n
+
h
24
(55y

n
−59y

n−1
+ 37y

n−2
−9y

n−3
) +
251h
5
720
y
(v
(2.13.22)
Notice, for example, that the explicit fourth-order formula (2.13.22) has about 13 times
as large an error term as the implicit fourth-order formula (2.13.15). This is typical for
matched pairs of predictor-corrector formulas. As we have noted previously, a single appli-
cation of a corrector formula reduces error by about h
∂f
∂y
, so if we keep h small enough so
that h
∂f
∂y
is less than 1/13 or thereabouts, then a single application of the corrector formula
will produce an estimate of the next value of y with full attainable accuracy.
We are now fully equipped with Adams formulas for prediction and correction, in
matched pairs, of whatever accuracy is needed, all of them stable. The use of these pairs
requires special starting formulas, since multistep methods cannot get themselves started
or restarted without assistance. Once again the Lagrange interpolation formula comes to
the rescue.
This time we begin with a slight variation of (2.13.6),
y(mh) = y(0) +
_
mh
0
y

(t) dt. (2.13.23)
Next, replace y

(t) by the Lagrange polynomial that agrees with it at 0, h, 2h, . . . , ph (for
p ≥ m). We then obtain
y
m
= y
0
+h
p

i=0
(−1)
p−i
i!(p −i)!
y

i
_
m
0
_
_
_
p

j=0
j=i
(t −j)
_
_
_dt m = 1, 2, . . . , p. (2.13.24)
2.13 Lagrange and Adams formulas 79
We can rewrite these equations in the form
y
m
= y
0
+h
p

j=0
λ
j
y

j
m = 1, 2, . . . , p (2.13.25)
where the coefficients λ
i
are given by
λ
i
=
(−1)
p−i
i!(p −i)!
_
m
0
_
_
_
p

j=0
j=i
(t −j)
_
_
_ dt i = 0, . . . , p. (2.13.26)
Of course, when these formulas are used on a differential equation y

= f(x, y), each
of the values of y

on the right side of (2.13.25) is replaced by an f(x, y) value. Therefore
equations (2.13.25) are a set of p simultaneous equations in p unknowns y
1
, y
2
, . . . , y
p
(y
0
is, of course, known). We can solve them with an iteration in which we first guess all p of
the unknowns, and then refine the guesses all at once by using (2.13.25) to give us the new
guesses from the old, until sufficient convergence has occurred.
For example, if we take p = 3, then the starting formulas are
y
1
= y
0
+
h
24
(9y

0
+ 19y

1
−5y

2
+y

3
) −
19h
5
720
y
(v)
y
2
= y
0
+
h
3
(y

0
+ 4y

1
+y

2
) −
h
5
90
y
(v)
y
3
= y
0
+
3h
8
(y

0
+ 3y

1
+ 3y

2
+y

3
) −
3h
5
80
y
(v)
.
(2.13.27)
The philosophy of using a matched pair of Adams formulas for propagation of the solution,
together with the starter formulas shown above has the potential of being implemented in a
computer program in which the user could specifiy the desired precision by giving the value
of p. The program could then calculate from the formulas above the coefficients in the
predictor-corrector pair and the starting method, and then proceed with the integration.
This would make a very versatile code.
80 The Numerical Solution of Differential Equations
Chapter 3
Numerical linear algebra
3.1 Vector spaces and linear mappings
In this chapter we will study numerical methods for the solution of problems in linear
algebra, that is to say, of problems that involve matrices or systems of linear equations.
Among these we mention the following:
(a) Given an n n matrix, calculate its determinant.
(b) Given m linear algebraic equations in n unknowns, find the most general solution of
the system, or discover that it has none.
(c) Invert an n n matrix, if possible.
(d) Find the eigenvalues and eigenvectors of an n n matrix.
As usual, we will be very much concerned with the development of efficient software
that will accomplish the above purposes.
We assume that the reader is familiar with the basic constructs of linear algebra: vec-
tor space, linear dependence and independence of vectors, Euclidean n-dimensional space,
spanning sets of vectors, basis vectors. We will quickly review some additional concepts
that will be helpful in our work. For a more complete discussion of linear algebra, see any
of the references cited at the end of this chapter.
And now, to business. The first major concept we need is that of a linear mapping.
Let V and W be two vector spaces over the real numbers (we’ll stick to the real numbers
unless otherwise specified). We say that T is a linear mapping from V to W if T associates
with every vector v of V a vector Tv of W (so T is a mapping) in such a way that
T(αv

+βv

) = αTv

+βTv

(3.1.1)
for all vectors v

, v

of V and real numbers (or scalars) α and β (i.e., T is linear). Notice
that the “+” signs are different on the two sides of (3.1.1). On the left we add two vectors
of V , on the right we add two vectors of W.
82 Numerical linear algebra
Here are a few examples of linear mappings.
First, let V and W both be the same, namely the space of all polynomials of some given
degree n. Consider the mapping that associates with a polynomial f of V its derivative
Tf = f

in W. It’s easy to check that this mapping is linear.
Second, suppose V is Euclidean two-dimensional space (the plane) and W is Euclidean
three-dimensional space. Let T be the mapping that carries the vector (x, y) of V to the
vector (3x +2y, x −y, 4x +5y) of W. For instance, T(2, −1) = (4, 3, 3). Then T is a linear
mapping.
More generally, let A be a given m n matrix of real numbers, let V be Euclidean
n-dimensional space and let W be Euclidean m-space. The mapping T that carries a vector
x of V into Ax of W is a linear mapping. That is, any matrix generates a linear mapping
between two appropriately chosen (to match the dimensions of the matrix) vector spaces.
The importance of studying linear mappings in general, and not just matrices, comes
from the fact that a particular mapping can be represented by many different matrices.
Further, it often happens that problems in linear algebra that seem to be questions about
matrices, are in fact questions about linear mappings. This means that we can change to
a simpler matrix that represents the same linear mapping before answering the question,
secure in the knowledge that the answer will be the same. For example, if we are given a
square matrix and we want its determinant, we seem to confront a problem about matrices.
In fact, any of the matrices that represent the same mapping will have the same determinant
as the given one, and making this kind of observation and identifying simple representatives
of the class of relevant matrices can be quite helpful.
To get back to the matter at hand, suppose the vector spaces V and W are of dimensions
m and n, respectively. Then we can choose in V a basis of m vectors, say e
1
, e
2
, e
3
, . . . , e
m
,
and in W there is a basis of n vectors f
1
, f
2
, . . . , f
n
. Let T be a linear mapping from V to
W. Then we have the situation that is sketched in figure 3.1 below.
We claim now that the action of T on every vector of V is known if we know only its
effect on the m basis vectors of V . Indeed, suppose we know Te
1
, Te
2
, . . . , Te
m
. Then let
x be any vector in V . Express x in terms of the basis of V ,
x = α
1
e
1

2
e
2
+ +α
m
e
m
. (3.1.2)
Now apply T to both sides and use the linearity of T (extended, by induction, to linear
combinations of more than two vectors) to obtain
Tx = α
1
(Te
1
) +α
2
(Te
2
) + +α
m
(Te
m
). (3.1.3)
The right side is known, and the claim is established.
So, to describe a linear mapping, “all we have to do” is describe its action on a set of
basis vectors of V . If e
i
is one of these, then Te
i
is a vector in W. As such, Te
i
can be
written as a linear combination of the basis vectors of W. The coefficients of this linear
combination will evidently depend on i, so we write
Te
i
=
n

j=1
t
ji
f
j
i = 1, . . . , m. (3.1.4)
3.1 Vector spaces and linear mappings 83
V
T
E
W
¨
¨
¨
¨B
e
1
e
2
e
m
.
.
.
e
e
e

r
r

¨
¨
¨
¨B
f
1
f
2
.
.
. f
m
e
e
e

r
r

Figure 3.1: The action of a linear mapping
Now the mn numbers t
ji
, i = 1, . . . , m, j = 1, . . . , n, together with the given sets of basis
vectors for V and W, are enough to describe the linear operator T completely. Indeed, if
we know all of those numbers, then by (3.1.4) we know what T does to every basis vector
of V , and then by (3.1.3) we know the action of T on every vector of V .
To summarize, an nm matrix t
ji
represents a linear mapping T from a vector space V
with a distinguished basis E = ¦e
1
, e
2
, . . . , e
m
¦, to a vector space W with a distinguished
basis F = ¦f
1
, f
2
, . . . , f
n
¦, in the sense that from a knowledge of (t, E, F) we know the full
mapping T.
Next, suppose once more that T is a linear mapping from V to W. Since T is linear,
it is easy to see that T carries the 0 vector of V into the 0 vector of W. Consider the set
of all vectors of V that are mapped by T into the zero vector of W. This set is called the
kernel of T, and is written ker(T). Thus
ker(T) = ¦x ∈ V [ Tx = 0
W
¦. (3.1.5)
Now ker(T) is not just a set of vectors, it is itself a vector space, that is a vector subspace
of V . Indeed, one sees immediately that if x and y belong to ker(T) and α and β are scalars,
then
T(αx +βy) = αT(x) +βT(y) = 0, (3.1.6)
so αx +βy belongs to ker(T) also.
Since ker(T) is a vector space, we can speak of its dimension. If ν = dimker(T), then ν
is called the nullity of the mapping T.
Consider also the set of vectors w of W that are of the form w = Tv, for some vector
v ∈ V (possibly many such v’s exist). This set is called the image of T, and is written
im(T) = ¦w ∈ W [ w = Tv, v ∈ V ¦. (3.1.7)
84 Numerical linear algebra
Once again, we remark that im(T) is more than just a set of vectors, it is in fact a vector
subspace of W, since if w

and w

are both in im(T), and if α and β are scalars, then we
have w

= Tv

and w

= Tv

for some v

, v

in V . Hence
αw

+βw

= αTv

+βv

= T(αv

+βv

) (3.1.8)
so αw

+βw

lies in im(T), too.
The dimension of the vector (sub)space im(T) is called the rank of the mapping T.
A celebrated theorem of Sylvester asserts that
rank(T) + nullity(T) = dim(V ). (3.1.9)
By the rank of a matrix A we mean any of the following:
(a) the maximum number of linearly independent rows that we can find in A
(b) same for columns
(c) the largest number r for which there is an r r nonzero sub-determinant in A (i.e., a
set of r rows and r columns, not necessarily consecutive, such that the rr submatrix
that they determine has a nonzero determinant.
It is true that the rank of a linear mapping T from V to W is equal to the rank of any
matrix A that represents T with respect to some pair of bases of V and W.
It is also true that the rank of a matrix is not computable unless infinite-precision
arithmetic is used. In fact, the 2 2 matrix
_
1 1
1 1 + 10
−20
_
(3.1.10)
has rank 2, but if the [2,2]-entry is changed to 1, the rank becomes 1. No computer program
will be able to tell the difference between these two situations unless it is doing arithmetic
to at least 21 digits of precision. Therefore, unless our programs do exact arithmetic on
rational numbers, or do finite field arithmetic, or whatever, the rank will be uncomputable.
What we are saying, really, is just that the rank of a matrix is not a continuous function of
the matrix entries.
Now we are ready to consider one of the most important problems of numerical linear
algebra: the solution of simultaneous linear equations.
Let A be a given mn matrix, let b be a given column vector of length m, and consider
the system
Ax = b (3.1.11)
of m linear simultaneous equations in n unknowns x
1
, x
2
, . . . , x
n
.
Consider the set of all solution vectors x of (3.1.11). Is it a vector space? That’s right,
it isn’t, unless b happens to be the zero vector (why?).
3.1 Vector spaces and linear mappings 85
Suppose then that we intend to write a computer program that will in some sense present
as output all solutions x of (3.1.11). What might the output look like?
If b = 0, i.e., if the system is homogeneous, there is no difficulty, for in that case the
solution set is ker(A), a vector space, and we can describe it by printing out a set of basis
vectors of ker(A).
If the right-hand side vector b is not 0, then consider any two solutions x

and x

of
(3.1.11). Then Ax

= b, Ax

= b, and by subtraction, A(x

−x

) = 0. Hence x

−x

belongs
to ker(A), so if e
1
, e
2
, . . . , e
ν
are a basis for ker(A), then x

−x

= α
1
e
1

2
e
2
+ +α
ν
e
ν
.
If b ,= 0 then, we can describe all possible solutions of (3.1.11) by printing out one
particular solution x

, and a list of the basis vectors of ker(A), because then all solutions
are of the form
x = x


1
e
1

2
e
2
+ +α
ν
e
ν
. (3.1.12)
Therefore, a computer program that alleges that it solves (3.1.11) should print out a
basis for the kernel of A together with, in case b ,= 0, any one particular solution of the
system. We will see how to accomplish this in the next section.
Exercises 3.1
1. Show by examples that for every n, the rank of a given nn matrix A is a discontinuous
function of the entries of A.
2. Consider the vector space V of all polynomials in x of degree at most 2. Let T be the
linear mapping that sends each polynomial to its derivative.
(a) What is the rank of T?
(b) What is the image of T?
(c) For the basis ¦1, x, x
2
¦ of V , find the 3 3 matrix that represents T.
(d) Regard T as a mapping from V to the space W of polynomials of degree 1. Use
the basis given in part (c) for V , and the basis ¦1, x − 1¦ for W, and find the
2 3 matrix that represents T with respect to these bases.
(e) Check that the ranks of the matrices in (c) and (d) are equal.
3. Let T be a linear mapping of Euclidean 3-dimensional space to itself. Suppose T takes
the vector (1, 1, 1) to (1, 2, 3), and T takes (1, 0, −1) to (2, 0, 1) and T takes (3, −1, 0)
to (1, 1, 2). Find T(1, 2, 4).
4. Let A be an n n matrix with integer entries having absolute values at most M.
What is the maximum number of binary digits that could be needed to represent all
of the elements of A?
5. If T acts on the vector space of polynomials of degree at most n according to T(f) =
f

−3f, find ker(T), im(T), and rank(T).
6. Why isn’t the solution set of (3.1.11) a vector space if b ,= 0?
86 Numerical linear algebra
7. Let a
ij
= r
i
s
j
for i, j = 1, . . . , n. Show that the rank of A is at most 1.
8. Suppose that the matrix
A =
_
¸
_
0 1 1
1 0 −1
−2 1 1
_
¸
_ (3.1.13)
represents a certain linear mapping from V to V with respect to the basis
¦(1, 0, 0), (0, 1, 0), (1, 1, 1)¦ (3.1.14)
of V . Find the matrix that represents the same mapping with respect to the basis
¦(0, 0, 1), (0, 1, 1), (1, 1, 1)¦. (3.1.15)
Check that the determinant is unchanged.
9. Find a system of linear equations whose solution set consists of the vector (1, 2, 0)
plus any linear combination of (−1, 0, 1) and (0, 1, 0).
10. Construct two sets of two equations in two unknowns such that
(a) their coefficient matrices differ by at most 10
−12
in each entry, and
(b) their solutions differ by at least 10
+12
in each entry.
3.2 Linear systems
The method that we will use for the computer solution of m linear equations in n unknowns
will be a natural extension of the familiar process of Gaussian elimination. Let’s begin with
a little example, say of the following set of two equations in three unknowns:
x +y +z = 2
x −y −z = 5.
(3.2.1)
If we subtract the second equation from the first, then the two equations can be written
x + y + z = 2
2y + 2z = −3.
(3.2.2)
We divide the second equation by 2, and then subtract it from the first, getting
x =
7
2
y + z = −
3
2
.
(3.2.3)
The value of z can now be chosen arbitrarily, and then x and y will be determined. To
make this more explicit, we can rewrite the solution in the form
_
¸
_
x
y
z
_
¸
_ =
_
¸
_
7
2

3
2
0
_
¸
_ +z
_
¸
_
0
−1
1
_
¸
_. (3.2.4)
3.2 Linear systems 87
In (3.2.4) we see clearly that the general solution is the sum of a particular solution
(7/2, −3/2, 0), plus any multiple of the basis vector (0, −1, 1) for the kernel of the coef-
ficient matrix of (3.2.1).
The calculation can be compactified by writing the numbers in a matrix and omitting
the names of the unknowns. A vertical line in the matrix will separate the left sides and
the right sides of the equations. Thus, the original system (3.2.1) is
_
1 1 1 2
1 −1 −1 5
_
. (3.2.5)
Now we do (
−−→
row2) := (
−−→
row1) −(
−−→
row2) and we have
_
1 1 1 2
0 2 2 −3
_
. (3.2.6)
Next (
−−→
row2) := (
−−→
row2)/2, and then (
−−→
row1) := (
−−→
row1) −(
−−→
row2) bring us to the final form
_
1 0 0
7
2
0 1 1 −
3
2
_
(3.2.7)
which is the matrix equivalent of (3.2.3).
Now we will step up to a slightly larger example, to see some of the situations that
can arise. We won’t actually write the numerical values of the coefficients, but we’ll use
asterisks instead. So consider the three equations in five unknowns shown below.
_
¸
_
∗ ∗ ∗ ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗ ∗
∗ ∗ ∗ ∗ ∗ ∗
_
¸
_. (3.2.8)
The first step is to create a 1 in the extreme upper left corner, by dividing the first row
through by the [1,1] element. We will assume for the moment that the various numbers
that we want to divide by are not zero. Later on we will take extensive measures to assure
this.
After we divide the first row by the [1,1] element, we use the 1 in the upper left-hand
corner to zero out the entries below it in column 1. That is, we let t = a
21
and then do
−−→
row(2) :=
−−→
row(2) −t ∗
−−→
row(1). (3.2.9)
Then let t = a
31
and do
−−→
row(3) :=
−−→
row(3) −t ∗
−−→
row(1). (3.2.10)
The result is that we now have the matrix
_
¸
_
1 ∗ ∗ ∗ ∗ ∗
0 ∗ ∗ ∗ ∗ ∗
0 ∗ ∗ ∗ ∗ ∗
_
¸
_. (3.2.11)
88 Numerical linear algebra
We pause for a moment to consider what we’ve done, in terms of the original set of
simultaneous equations. First, to divide a row of the matrix by a number corresponds
to dividing an equation through by the same number. Evidently this does not change
the solutions of the system of equations. Next, to add a constant multiple of one row to
another in the matrix corresponds to adding a multiple of one equation to another, and this
also doesn’t affect the solutions. Finally, in terms of the linear mapping that the matrix
represents, what we are doing is changing the sets of basis vectors, keeping the mapping
fixed, in such a way that the matrix that represents the mapping becomes a bit more
acceptable to our taste.
Now in (3.2.11) we divide through the second row by a
22
(again blissfully assuming that
a
22
is not zero) to create a 1 in the [2,2] position. Then we use that 1 to create zeroes (just
one zero in this case) below it in the second column by letting t = a
32
and doing
−−→
row(3) :=
−−→
row(3) −t ∗
−−→
row(2). (3.2.12)
Finally, we divide the third row by the [3,3] element to obtain
_
¸
_
1 ∗ ∗ ∗ ∗ ∗
0 1 ∗ ∗ ∗ ∗
0 0 1 ∗ ∗ ∗
_
¸
_. (3.2.13)
This is the end of the so-called forward solution, the first phase of the process of obtaining
the general solution.
Again, let’s think about the system of equations that is represented here. What is
special about them is that the first unknown does not appear in the second equation, and
neither the first nor the second unknown appears in the third equation.
To finish the solution of such a system of equations we would use the third equation to
express x
3
in terms of x
4
and x
5
, then the second equation would give us x
2
in terms of x
4
and x
5
, and finally the first equation would yield x
1
, also expressed in terms of x
4
and x
5
.
Hence, we would say that x
4
and x
5
are free, and that the others are determined by them.
More precisely, we should say that the kernel of A has a two-dimensional basis.
Let’s see how all of this will look if we were to operate directly on the matrix (3.2.13).
The second phase of the solution, that we are now beginning, is called the backwards sub-
stitution, because we start with the last equation and work backwards.
First we use the 1 in the [3,3] position to create zeros in the third column above that 1.
To do this we let t = a
23
and then we do
−−→
row(2) :=
−−→
row(2) −t ∗
−−→
row(3). (3.2.14)
Then we let t = a
13
and set
−−→
row1 :=
−−→
row(1) −t ∗
−−→
row(3) (3.2.15)
resulting in
_
¸
_
1 ∗ 0 ∗ ∗ ∗
0 1 1 ∗ ∗ ∗
0 0 1 ∗ ∗ ∗
_
¸
_. (3.2.16)
3.2 Linear systems 89
Observe that now x
3
does not appear in the equations before it. Next we use the 1 in
the [2,2] position to create a zero in column 2 above that 1 by letting t = a
12
and
−−→
row(1) :=
−−→
row(1) −t ∗
−−→
row(2). (3.2.17)
Of course, none of our previously constructed zeros gets wrecked by this process, and we
have arrived at the reduced echelon form of the original system of equations
_
¸
_
1 0 0 ∗ ∗ ∗
0 1 0 ∗ ∗ ∗
0 0 1 ∗ ∗ ∗
_
¸
_. (3.2.18)
We need to be more careful about the next step, so it’s time to use numbers instead of
asterisks. For instance, suppose that we have now arrived at
_
¸
_
1 0 0 a
14
a
15
a
16
0 1 0 a
24
a
25
a
26
0 0 1 a
34
a
35
a
36
_
¸
_. (3.2.19)
Each of the unknowns is expressible in terms of x
4
and x
5
:
x
1
= a
16
−a
14
x
4
− a
15
x
5
x
2
= a
26
−a
24
x
4
− a
25
x
5
x
3
= a
36
−a
34
x
4
− a
35
x
5
x
4
= x
4
x
5
= x
5
.
(3.2.20)
This means that we have found the general solution of the given system by finding a
particular solution and a pair of basis vectors for the kernel of A. They are, respectively,
the vector and the two columns of the matrix shown below:
_
¸
¸
¸
¸
¸
_
a
16
a
26
a
36
0
0
_
¸
¸
¸
¸
¸
_
,
_
¸
¸
¸
¸
¸
_
a
14
a
15
a
24
a
25
a
34
a
35
−1 0
0 −1
_
¸
¸
¸
¸
¸
_
. (3.2.21)
As this shows, we find a particular solution by filling in extra zeros in the last column
until its length matches the number (five in this case) of unknowns. We find a basis matrix
(i.e., a matrix whose columns are a basis for the kernel of A) by extending the fourth and
fifth columns of the reduced row echelon form of A with −I, where I is the identity matrix
whose size is equal to the nullity of the system, in this case 2.
It’s time to deal with the case where one of the ∗’s that we divide by is actually a zero.
In fact we will have to discuss rather carefully what we mean by zero. In numerical work on
computers, in the presence of rounding errors, it is unreasonable to expect a 0 to be exactly
zero. Instead we will set a certain threshold level, and numbers that are smaller than that
will be declared to be zero. The hard part will be the determination of the right threshold,
90 Numerical linear algebra
but let’s postpone that question for a while, and make the convention that in this and the
following sections, when we speak of matrix entries being zero, we will mean that their size
is below our current tolerance level.
With that understanding, suppose we are carrying out the row reduction of a certain
system, and we’ve arrived at a stage like this:
_
¸
¸
¸
¸
¸
¸
_
1 ∗ ∗ ∗ ∗ ∗
0 1 ∗ ∗ ∗ ∗
0 0 0 ∗ ∗ ∗
0 0 ∗ ∗ ∗ ∗
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
_
¸
¸
¸
¸
¸
¸
_
. (3.2.22)
Normally, the next step would be to divide by a
33
, but it is zero. This means that x
3
happens not to appear in the third equation. However, x
3
might appear in some later
equation.
If so, we can renumber the equations so that later equation becomes the third equation,
and continue the process. In the matrix, this means that we would exchange two rows, so
as to bring a nonzero entry into the [3,3] position, and continue.
It is possible, though, that x
3
does not appear in any later equation. Then all entries
a
i3
= 0 for i ≥ 3. Then we could ask for some other unknown x
j
for j > 3 that does appear
in some equation later than the third. In the matrix, this amounts to searching through
the whole rectangle that lies “southeast” of the [3,3] position, extending over to, but not
beyond, the vertical line, to find a nonzero entry, if there is one.
If a
ij
is such a nonzero entry, then we want next to bring a
ij
into the pivot position [3,3].
We can do this in two steps. First we exchange rows 3 and i (interchange two equations).
Second, exchange columns 3 and j (interchange the numbers of the unknowns, so that x
3
becomes x

j
and x
j
becomes x

3
). We must remember somehow that we renumbered the
unknowns, so we’ll be able to recognize the answers when we see them. The calculation can
now proceed from the rejuvenated pivot element in the [3,3] position.
Else, it may happen that the rectangle southeast of [3,3] consists entirely of zeros, like
this:
_
¸
¸
¸
¸
¸
¸
¸
¸
_
1 ∗ ∗ ∗ ∗ ∗
0 1 ∗ ∗ ∗ ∗
0 0 0 0 0 ∗
0 0 0 0 0 ∗
0 0 0 0 0 ∗
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
_
¸
¸
¸
¸
¸
¸
¸
¸
_
. (3.2.23)
What then? We’re finished with the forward solution. The equations from the third onwards
have only zeros on their left-hand sides. If any solutions at all exist, then those equations
had better have only zeros on their right-hand sides also. The last three ∗’s in the last
column (and all the entries below them) must all be zeros, or the calculation halts with the
announcement that the input system was inconsistent (i.e, has no solutions). If the system
is consistent, then of course we ignore the final rows of zeros, and we do the backwards
solution just as in the preceding case.
3.2 Linear systems 91
It follows that in all cases, whether there are more equations than unknowns, fewer, or
the same number, the backwards solution always begins with a matrix that has a diagonal
of 1’s stretching from top to bottom (the bottom may have moved up, though!), with only
zero entries below the 1’s on the diagonal.
Speaking in theoretical, rather than practical terms for a moment, the number of nonzero
rows in the coefficient matrix at the end of the forward solution phase is the rank of the
matrix. Speaking practically again, this number simply represents a number of rows beyond
which we cannot do any more reductions because the matrix entries are all indistinguishable
from zero. Perhaps a good name for it is pseudorank. The pseudorank should be thought
of then, not as some interesting property of the matrix that we have just computed, but as
the number of rows we were able to reduce before roundoff errors became overwhelming.
Exercises 3.2
In problems 1–5, solve each system of equations by transforming the coefficient matrix
to reduced echelon form. To “solve” a system means either to show that no solution exists
or to find all possible solutions. In the latter case, exhibit a particular solution and a basis
for the kernel of the coefficient matrix.
1.
2x − y + z = 6
3x + y + 2z = 3
7x − y + 4z = 15
8x + y + 5z = 12
(3.2.24)
2.
x + y + z + q = 0
x − y − z − q = 0
(3.2.25)
3.
x + y + z = 3
3x − y − 2z = −1
5x + y = 7
(3.2.26)
4.
3x + u + v + w + t = 1
x − u + 2v + w − 3t = 2
(3.2.27)
5.
x + 3y − z = 4
2x − y + 2z = 6
x + y + z = 6
3x − y − z = −2
(3.2.28)
6. Construct a set of four equations in four unknowns, of rank two.
7. Construct a set of four equations in three unknowns, with a unique solution.
8. Construct a system of homogeneous equations that has a three-dimensional vector
space of solutions.
92 Numerical linear algebra
9. Under precisely what conditions is the set of all solutions of the system Ax = b a
vector space?
10. Given a set of m vectors in n space, describe an algorithm that will extract a maximal
subset of linearly independent vectors.
11. Given a set of m vectors, and one more vector w, describe an algorithm that will
decide whether or not w is in the vector subspace spanned by the given set.
3.3 Building blocks for the linear equation solver
Let’s now try to amplify some of the points raised by the informal discussion of the procedure
for solving linear equations, with a view towards the development of a formal algorithm.
First, let’s deal with the fact that a diagonal element might be zero (in the fuzzy sense
defined in the previous section) at the time when we want to divide by it.
Consider the moment when we are carrying out the forward solution, we have made i−1
1’s down the diagonal, all the entries below the 1’s are zeros, and next we want to put a 1
into the [i, i] position and use that 1 as a pivot to reduce the entries below it to zeros.
Previously we had said that this could be done by dividing through the ith row by the
[i, i] element, unless that element is zero, in which case we carry out a search for some
nonzero element in the rectangle that lies southeast of [i, i] in the matrix. After careful
analysis, it turns out that an even more conservative approach is better: the best procedure
consists in searching through the entire southeast rectangle, whether or not the [i, i] element
is zero, to find the largest matrix element in absolute value.
If this complete search is done every time, whether or not the [i, i] element is zero, then
it develops that the sizes of the matrix elements do not grow very much as the algorithm
unfolds, and the growth of the numerical errors is also kept to a minimum.
If that largest element found in the rectangle is, say, a
uv
, then we bring a
uv
into the pivot
position [i, i] by interchanging rows u and i (renumbering the equations) and interchanging
columns v and i (renumbering the unknowns). Then we proceed as before.
It may seem wasteful to make a policy of carrying out a complete search of the rectangle
whenever we are ready to find the next pivot element, and especially even if a nonzero
element already occupies the pivot position anyway, without a search, but it turns out that
the extra labor is well rewarded with optimum numerical accuracy and stability.
If we are solving m equations in n unknowns, then we need to carry along an extra
array of length n. Let’s call it τ
j
, j = 1, . . . , n. This array will keep a record of the column
interchanges that we do as we do them, so that in the end we will be able to identify the
output. Initially, we put τ
j
= j for j = 1, . . . , n. If at a certain moment we are about
to interchange, say, the pth column and the qth column, then we will also interchange the
entries τ
p
and τ
q
. At all times then, τ
j
will hold the number of the column where the current
jth column really belongs.
3.3 Building blocks for the linear equation solver 93
It must be noted that there is a fundamental difference between the interchange of rows
and the interchange of columns. An interchange of two rows corresponds simply to listing
the equations that we are trying to solve in a slightly different sequence, but has no effect on
the solutions. On the other hand, an interchange of two columns amounts to renumbering
two of the unknowns. Hence we must keep track of the column interchanges while we are
doing them, so we’ll be able to tell which unknown is which at output time, but we don’t
need to record row interchanges.
At the end of the calculation then, the output arrays will have to be shuffled. The reader
might want to think about how do carry out that rearrangement, and we will return to it
in section 3.6 under the heading of “to unscramble the eggs”.
The next item to consider is that we would like our program to be able to solve not
just one system Ax = b, but several systems of simultaneous equations, each of the form
Ax = b, where the left-hand sides are all the same, but the right-hand sides are different.
The data for our program will therefore be an m by n+p matrix whose first n columns will
contain the coefficient matrix A and whose last p columns will be the p different right-hand
side vectors b.
Why are we allowing several different right sides? Some of the main customers for our
program will be matrices A whose inverses we want to calculate. To find, say, the first
column of the inverse of A we want to solve Ax = b, where b is the first column of the
identity matrix. For the second column of the inverse, b would be the second column of
the identity matrix, and so on. Hence, to find A
−1
, if A is an n n matrix, we must solve
n systems of simultaneous equations each having the same left-hand side A, but with n
different right-hand side vectors.
It is convenient to solve all n of these systems at once because the reduction that we
apply to A itself to bring it into reduced echelon form is useful in solving all n of these
systems, and we avoid having to repeat that part of the job n times. Thus, for matrix
inversion, and for other purposes too, it is very handy to have the capability of solving
several systems with a common left-hand side at once.
The next point concerns the linear array τ that we are going to use to keep track of the
column interchanges. Instead of storing it in its own private array, it’s easier to adjoin it
to the matrix A that we’re working on, as an extra row, for then when we interchange two
columns we will automatically interchange the corresponding elements of τ, and thereby
avoid separate programming.
This means that the full matrix that we will be working with in our program will be
(m + 1) (n + p) if we are solving p systems of m equations in n unknowns with p right-
hand sides. In the program itself, let’s call this matrix C. So C will be thought of as being
partitioned into blocks of sizes as shown below:
C =
_
¸
¸
_
A : mn RHS : mp
τ : 1 n 0 : 1 p
_
¸
¸
_
. (3.3.1)
Now a good way to begin the writing of a program such as the general-purpose matrix
94 Numerical linear algebra
analysis program that we now have in mind is to consider the different procedures, or
modules, into which it may be broken up. We suggest that the individual blocks that we
are about to discuss should be written as separate subroutines, each with its own clearly
defined input and output, each with its own documentation, and each with its own local
variable names. They should then be tested one at a time, by giving them small, suitable
test problems. If this is done, then the main routine won’t be much more than a string of
calls to the various blocks.
1. Procedure searchmat(C,r,s,i1,j1,i2,j2)
This routine is given an r s array C, and two positions in the matrix, say [i
1
, j
1
] and
[i
2
, j
2
]. It then carries out a search of the rectangular submatrix of C whose northwest
corner is at [i
1
, j
1
] and whose southeast corner is at [i
2
, j
2
], inclusive, in order to find an
element of largest absolute value that lives in that rectangle. The subroutine returns this
element of largest magnitude, as big, and the row and column in which it lives, as iloc,
jloc.
Subroutine searchmat will be called in at least two different places in the main routine.
First, it will do the search for the next pivot element in the southeast rectangle. Second, it
can be used to determine if the equations are consistent by searching the right-hand sides
of equations r + 1, . . . , m (r is the pseudorank) to see if they are all zero (i.e., below our
tolerance level).
2. Procedure switchrow(C,r,s,i,j,k,l)
The program is given an r s matrix C, and four integers i, j, k and l. The subroutine
interchanges rows i and j of C, between columns k and l inclusive, and returns a variable
called sign with a value of −1, unless i = j, in which case it does nothing to C and returns
a +1 in sign.
3. Procedure switchcol(C,r,s,i,j,k,l)
This subroutine is like the previous one, except it interchanges columns i and j of C,
between rows k and l inclusive. It also returns a variable called sign with a value of −1,
unless i = j, in which case it does nothing to C and returns a +1 in sign.
The subroutines switchrow and switchcol are used during the forward solution in the
obvious way, and again after the back solution has been done, to unscramble the output
(see procedure 5 below).
4. Procedure pivot(C,r,s,i,k,u)
Given the r s matrix C, and three integers i, k and u, the subroutine assumes that
C
ii
= 1. It then stores C
ki
in the local variable tm sets C
ki
to zero, and reduces row k of
C, in columns u to s, by doing the operation C
kq
:= C
kq
−t ∗ C
iq
for q = u, . . . , s.
The use of the parameter u in this subroutine allows the flexibility for economical op-
eration in both the forward and back solution. In the forward solution, we take u = i + 1
and it reduces the whole row k. In the back solution we use u = n + 1, because the rest of
row k will have already been reduced.
5. Procedure scale(C,r,s,i,u)
3.3 Building blocks for the linear equation solver 95
Given an r s matrix C and integers i and u, the routine stores C
ii
in a variable called
piv. It then does C
ij
:= C
ij
/piv for j = u, . . . , s and returns the value of piv.
6. Procedure scramb(C,r,s,n)
This procedure permutes the first n rows of the r s matrix C according to the permu-
tation that occupies the positions C
r1
, C
r2
, . . . , C
rn
on input.
The use of this subroutine is explained in detail in section 3.6 (q.v.). Its purpose is to
rearrange the rows of the output matrix that holds a basis for the kernel, and also the rows of
the output matrix that holds particular solutions of the give system(s). After rearrangement
the rows will correspond to the original numbering of the unknowns, thereby compensating
for the renumbering that was induced by column interchanges during the forward solution.
This subroutine poses some interesting questions if we require that it should not use any
additional array space beyond the input matrix itself.
7. Procedure ident(C,r,s,i,j,n,q)
This procedure inserts q times the nn identity matrix into the nn submatrix whose
Northwest corner is at position [i, j] of the r s matrix C.
Now let’s look at the assembly of these building blocks into a complete matrix analysis
procedure called matalg(C,r,s,m,n,p,opt,eps). Input items to it are:
• An rs matrix C (as well as the values of r and s), whose Northwest mn submatrix
contains the matrix of coefficients of the system(s) of equations that are about to be
solved. The values of m and n must also be provided to the procedure. It is assumed
that r = 1 + max(m, n). Unless the inverse of the coefficient matrix is wanted, the
Northeast m p submatrix of C holds p different right-hand side vectors for which
we want solutions.
• The numbers r, s, m, n and p.
• A parameter option that is equal to 1 if we want an inverse, equal to 2 if we want
to see the determinant of the coefficient matrix (if square) as well as a basis for the
kernel (if it is nontrivial) and a set of p particular solution vectors.
• A real parameter eps that is used to bound roundoff error.
Output items from the procedure matalg are:
• The pseudorank r
• The determinant det if m = n
• An n r matrix basis , whose columns are a basis for the kernel of the coefficient
matrix.
• An n p matrix partic, whose columns are particular solution vectors for the given
systems.
96 Numerical linear algebra
In case opt = 1 is chosen, the procedure will fill the last m columns and rows of C with
an mm identity matrix, set p = n = m, and proceed as before, leaving the inverse matrix
in the same place, on output.
Let’s remark on how the determinant is calculated. The reduction of the input matrix
to echelon form in the forward solution phase entails the use of three kinds of operations.
First we divide a row by a pivot element. Second, we multiply a row by a number and add
it to another row. Third, we exchange a pair of rows or columns.
The first operation divides the determinant by that same pivot element. The second
has no effect on the determinant. The third changes the sign of the determinant, at any
rate if the rows or columns are distinct. At the end of the forward solution the matrix is
upper triangular, with 1’s on the diagonal, hence its determinant is clearly 1.
What must have been the value of the determinant of the input matrix? Clearly it
must have been equal to the product of all the pivot elements that were used during the
reduction, together with a plus or minus sign from the row or column interchanges.
Hence, to compute the determinant, we begin by setting det to 1. Then, each time a new
pivot element is selected, we multiply det by it. Finally, whenever a pair of different rows
or columns are interchanged we reverse the sign of det. Then det holds the determinant
of the input matrix when the forward solution phase has ended.
Now we have described the basic modules out of which a general purpose program for
linear equations can be constructed. In the next section we are going to discuss the vexing
question of roundoff error and how to set the tolerance level below which entries are declared
to be zero. A complete formal algorithm that ties together all of these modules, with control
of rounding error, is given at the end of the next section.
Exercises 3.3
1. Make a test problem for the major program that you’re writing by tracing through a
solution the way the computer would:
Take one of the systems that appears at the end of section 3.2. Transform it to
reduced row echelon form step by step, being sure to carry out a complete search of
the Southeast rectangle each time, and to interchange rows and columns to bring the
largest element found into the pivot position. Record the column interchanges in τ,
as described above. Record the status of the matrix C after each major loop so you’ll
be able to test your program thoroughly and easily.
2. Repeat problem 1 on a system of each major type: inconsistent, unique solution, many
solutions.
3. Construct a formal algorithm that will invert a matrix, using no more array space
than the matrix itself. The idea is that the input matrix is transformed, a column at
a time, into the identity matrix, and the identity matrix is transformed, a column at
a time, into the inverse. Why store all of the extra columns of the identity matrix?
(Good luck!)
3.4 How big is zero? 97
4. Show that a matrix A is of rank one if and only if its entries are of the form A
ij
= f
i
g
j
for all I and j.
5. Show that the operation
−−→
row(i) := c ∗
−−→
row(j) +
−−→
row(i) applied to a matrix A has the
same effect as first applying that same operation to the identity matrix I to get a
certain matrix E, and then computing EA.
6. Show that the operation of scaling
−−→
row(i):
a
ik
:=
a
ik
a
ii
k = 1, . . . , n (3.3.2)
has the same effect as first dividing the ith row of the identity matrix by a
ii
to get a
certain matrix E, and then computing EA.
7. Suppose we do a complete forward solution without ever searching or interchanging
rows or columns. Show that the forward solution amounts to discovering a lower
triangular matrix L and an upper triangular matrix U such that LA = U (think of L
as a product of several matrices E such as you found in the preceding two problems).
3.4 How big is zero?
The story of the linear algebra subroutine has just two pieces untold: the first concerns
how small we will allow a number to be without calling it zero, and the second concerns
the rearrangement of the output to compensate for interchanges of rows and columns that
are done during the row-echelon reduction.
The main reduction loop begins with a search of the rectangle that lies Southeast of the
pivot position [i, i], in order to locate the largest element that lives there and to use it for
the next pivot. If that element is zero, the forward solution halts because the remaining
pivot candidates are all zero.
But “how zero” do they have to be? Certainly it would be to much to insist, when
working with sixteen decimal digits, that a number should be exactly equal to zero. A
little more natural would be to declare that any number that is no larger than the size of
the accumulated roundoff error in the calculation should be declared to be zero, since our
microscope lens would then be too clouded to tell the difference.
It is important that we should know how large roundoff error is, or might be. Indeed,
if we set too small a threshold, then numbers that “really are” zero will slip through, the
calculation will continue after it should have terminated because of unreliability of the
computed entries, and so forth. If the threshold is too large, we will declare numbers
to be zero that aren’t, and our numerical solution will terminate too quickly because the
computed matrix elements will be declared to be unreliable when really they are perfectly
OK.
The phenomenon of roundoff error occurs because of the finite size of a computer word.
If a word consists of d binary digits, then when two d-digit binary numbers are multiplied
98 Numerical linear algebra
together, the answer that should be 2d bits long gets rounded off to d bits when it is stored.
By doing so we incur a rounding error whose size is at most 1 unit in the (d + 1)st place.
Then we proceed to add that answer to other numbers with errors in them, and to
multiply, divide, and so forth, some large number of times. The accumulation of all of this
rounding error can be quite significant in an extended computation, particularly when a
good deal of cancellation occurs from subtraction of nearly equal quantities.
The question is to determine the level of rounding error that is present, while the
calculation is proceeding. Then, when we arrive at a stage where the numbers of interest
are about the same size as the rounding errors that may be present in them, we had better
halt the calculation.
How can we estimate, during the course of a calculation, the size of the accumulated
roundoff error? There are a number of theoretical a priori estimates for this error, but in
any given computation these would tend to be overly conservative, and we would usually
terminate the calculation too soon, thinking that the errors were worse than they actually
were.
We prefer to let the computer estimate the error for us while it’s doing the calculation.
True, it will have to do more work, but we would rather have it work a little harder if the
result will be that we get more reliable answers.
Here is a proposal for estimating the accumulated rounding error during the progress
of a computation. This method was suggested by Professors Nijenhuis and Wilf. We carry
along an additional matrix of the same size as the matrix C, the one that has the coefficients
and right-hand sides of the equations that we are solving. In this extra matrix we are going
to keep estimates of the roundoff error in each of the elements of the matrix C.
In other words, we are going to keep two whole matrices, one of which will contain the
coefficients and the right-hand sides of the equations, and the other of which will contain
estimates of the roundoff error that is present in the elements of the first one.
At any time during the calculation that we want to know how reliable a certain matrix
entry is, we’ll need only to look at the corresponding entry of the error matrix to find out.
Let’s call this auxiliary matrix R (as in roundoff). Initially an element R
ij
might be as
large as 2
−d
[C
ij
[ in magnitude, and of either sign. Therefore, to initialize the R matrix we
choose a number uniformly at random in the interval [−[2
−d
C
ij
[, [2
−d
C
ij
[], and store it in
R
ij
for each i and j. Hence, to begin with, the matrix R is set to randomly chosen values
in the range in which the actual roundoff errors lie.
Then, as the calculation unfolds, we do arithmetic on the matrix C of two kinds. We
either scale a row by dividing it through by the pivot element, or we pivot a row against
another row. In each case let’s look at the effect that the operation has on the corresponding
roundoff error estimator in the R matrix.
In the first case, consider a scaling operation, in which a certain row is divided by the
pivot element. Specifically, suppose we are dividing row i through by the element C
ii
, and
3.4 How big is zero? 99
let R
ii
be the corresponding entry of the error matrix. Then, in view of the fact that
C
ij
+R
ij
C
ii
+R
ii
=
C
ij
C
ii
+
R
ij
C
ii

R
ii
C
ij
C
2
ii
+ terms involving products of two or more errors (3.4.1)
we see that the error entries R
ij
in the row that is being divided through by C
ii
should be
computed as
R
ij
:=
R
ij
C
ii

R
ii
C
ij
C
2
ii
. (3.4.2)
In the second case, suppose we are doing a pivoting operation on the kth row. Then for
each column q we do the operation C
kq
:= C
kq
− t ∗ C
iq
, where t = C
ki
. Now let’s replace
C
kq
by C
kq
+R
kq
, replace C
iq
by C
iq
+R
iq
and replace t by t + t

(where t

= R
ki
). Then
substitute these expressions into the pivot operation above, and keep terms that are of first
order in the errors (i.e., that do not involve products of two of the errors).
Then C
kq
+R
kq
is replaced by
C
kq
+R
kq
−(t +t

) ∗ (C
iq
+R
iq
) = (C
kq
−t ∗ C
iq
) + (R
kq
−t ∗ R
iq
−t

∗ C
iq
)
= (new C
kq
) + (new error R
kq
).
(3.4.3)
It follows that as a result of the pivoting, the error estimator is updated as follows:
R
kq
:= R
kq
−C
ki
∗ R
iq
−R
ki
∗ C
iq
. (3.4.4)
Equations (3.4.2) and (3.4.4) completely describe the evolution of the R matrix. It
begins life as random roundoff error; it gets modified along with the matrix elements whose
errors are being estimated, and in return, we are supplied with good error estimates of each
entry while the calculation proceeds.
Before each scaling and pivoting sequence we will need to update the R matrix as
described above. Then, when we search the now-famous Southeast rectangle for the new
pivot element we accept it if it is larger in absolute value that its corresponding roundoff
estimator, and otherwise we declare the rectangle to be identically zero and halt the forward
solution.
The R matrix is also used to check the consistency of the input system. At the end of
the forward solution all rows of the coefficient matrix from a certain row onwards are filled
with zeros, in the sense that the entries are below the level of their corresponding roundoff
estimator. Then the corresponding right-hand side vector entries should also be zero in
the same sense, else as far as the algorithm can tell, the input system was inconsistent.
With typical ambiguity of course, this means either that the input system was “really”
inconsistent, or just that rounding errors have built up so severely that we cannot decide
on consistency, and continuation of the “solution” would be meaningless.
Algorithm matalg(C,r,s,m,n,p,opt,eps). The algorithm operates on the matrix C,
which is of dimension r s, where r = max(m, n) + 1. It solves p systems of m equations
in n unknowns, unless opt= 1, in which case it will calculate the inverse of the matrix in
the first m = n rows and columns of C.
100 Numerical linear algebra
matalg:=proc(C,r,s,m,n,p,opt,eps)
local R,i,j,Det,Done,ii,jj,Z,k,psrank;
# if opt = 1 that means inverse is expected
if opt=1 then ident(C,r,s,1,n+1,n,1) fi;
# initialize random error matrix
R:=matrix(r,s,(i,j)->0.000000000001*(rand()-500000000000)*eps*C[i,j]);
# set row permutation to the identity
for j from 1 to n do C[r,j]:=j od;
# begin forward solution
Det:=1; Done:=false; i:=0;
while ((i<m) and not(Done))do
# find largest in SE rectangle
Z:=searchmat(C,r,s,i+1,i+1,m,n);ii:=Z[1][1]; jj:=Z[1][2];
if abs(Z[2])>abs(R[ii,jj]) then
i:=i+1;
# switch rows
Det:=Det*switchrow(C,r,s,i,ii,i,s);
Z:=switchrow(R,r,s,i,ii,i,s);
# switch columns
Det:=Det*switchcol(C,r,s,i,jj,1,r);
Z:=switchcol(R,r,s,i,jj,1,r);
# divide by pivot element
Z:=scaler(C,R,r,s,i,i);
Det:=Det*scale(C,r,s,i,i);
for k from i+1 to m do
# reduce row k against row i
Z:=pivotr(C,R,r,s,i,k,i+1);
Z:=pivot(C,r,s,i,k,i+1);
od;
else Done:=true fi;
od;
psrank:=i;
# end forward solution; begin consistency check
if psrank<m then
Det:=0;
for j from 1 to p do
# check that right hand sides are 0 for i>psrank
Z:=searchmat(C,r,s,psrank+1,n+j,m,n+j);
if abs(Z[2])>abs(R[Z[1][1],Z[1][2]]) then
printf("Right hand side %d is inconsistent",j);
return;
fi;
od;
fi;
# equations are consistent, do back solution
3.4 How big is zero? 101
for j from psrank to 2 by -1 do
for i from 1 to j-1 do
Z:=pivotr(C,R,r,s,j,i,psrank+1);
Z:=pivot(C,r,s,j,i,psrank+1);
C[i,j]:=0; R[i,j]:=0;
od;
od;
# end back solution, insert minus identity in basis
if psrank<n then
# fill bottom of basis matrix with -I
Z:=ident(C,r,s,psrank+1,psrank+1,n-psrank,-1);
# fill under right-hand sides with zeroes
for i from psrank+1 to n do for j from n+1 to s do C[i,j]:=0 od od;
# fill under R matrix with zeroes
for i from psrank+1 to n do for j from n-psrank to s do R[i,j]:=0 od od;
fi;
# permute rows prior to output
Z:=scramb(C,r,s,n);
# copy row r of C to row r of R
for j from 1 to n do R[r,j]:=C[r,j] od;
Z:=scramb(R,r,s,n);
return(Det,psrank,evalm(R));
end;
If the procedure terminates successfully, it returns a list containing three items: the first
is the determinant (if there is one), the second is the pseudorank of the coefficient matrix,
and the third is the matrix of estimated roundoff errors. The matrix C (which is called by
name in the procedure, which means that the input matrix is altered by the procedure) will
contain a basis for the kernel of the coefficient matrix in columns psrank + 1 to n, and p
particular solution vectors, one for each input right-hand side, in columns n + 1 to n +p.
Two new sub-procedures are called by this procedure, namely scaler and pivotr.
These are called immediately before the action of scale or pivot, respectively, and their
mission is to update the R matrix in accordance with equations (3.4.2) or (3.4.4) to take
account of the impending scaling or pivoting operation .
Exercises 3.4
1. Break off from the complete algorithm above, the forward solution process. State it
formally as algorithm forwd, list its global variables, and describe precisely its effect
on them. Do the same for the backwards solution.
2. When the program runs, it gives the solutions and their roundoff error estimates.
Work out an elegant way to print the answers and the error estimates. For instance,
there’s no point in giving 12 digits of roundoff error estimate. That’s too much. Just
print the number of digits of the answers that can be trusted. How would you do
that? Write subroutine prnt that will carry it out.
102 Numerical linear algebra
3. Suppose you want to re-run a problem, with a different set of random numbers in
the roundoff matrix initialization. How would you do that? Run one problem three
or four times to see how sensitive the roundoff estimates are to the choice of random
values that start them off.
4. Show your program to a person who is knowledgeable in programming, but who is
not one of your classmates. Ask that person to use your program to solve some set of
three simultaneous equations in five unknowns.
Do not answer any questions verbally about how the program works or how to use it.
Refer all such questions to your written documentation that accompanies the program.
If the other person is able to run your program and understand the answers, award
yourself a gold medal in documentation. Otherwise, improve your documentation and
let the person try again. When successful, try again on someone else.
5. Select two vectors f, g of length 10 by choosing their elements at random. Form the
10 10 matrix of rank 1 whose elements are f
i
g
j
. Do this three times and add the
resulting matrices to get a single 10 10 matrix of rank three.
Run your program on the coefficient matrix you just constructed, in order to see if
the program is smart enough to recognize a matrix of rank three when it sees one, by
halting the forward solution with pseudorank = 3.
Repeat the above experiment 50 times, and tabulate the frequencies with which your
program “thought” that the 10 10 matrix had various ranks.
3.5 Operation count
With any numerical algorithm it is important to know how much work is involved in carrying
it out. In this section we are going to estimate the labor involved in solving linear systems
by the method of the previous sections.
Let’s recognize two kinds of labor: arithmetic operations, and other operations, both as
applied to elements of the matrix. The arithmetic operations are +, −, , ÷, all lumped
together, and by other operations we mean comparisons of size, movement of data, and
other operations performed directly on the matrix elements. Of course there are many
“other operations,” not involving the matrix elements directly, such as augmenting counters,
testing for completion of loops, etc., that go on during the reduction of the matrix, but the
two categories above represent a good measure of the work done. We’re not going to include
the management of the roundoff error matrix R in our estimates, because its effect would
be simply to double the labor involved. Hence, remember to double all of the estimates of
the labor that we are about to derive if you’re using the R matrix.
We consider a generic stage in the forward solution where we have been given m equa-
tions in n unknowns with p right-hand sides, and during the forward solution phase we have
just arrived at the [i, i] element.
3.5 Operation count 103
The next thing to do is to search the Southeast rectangle for the largest element, The
rectangle contains about (m− i) ∗ (m− i) elements. Hence the search requires that many
comparisons.
Then we exchange two rows (n+p−i operations), exchange two columns (m operations)
and divide a row by the pivot element (n +p −i arithmetic operations).
Next, for each of the m−i−1 rows below the ith, and for each of the n+p−i elements of
one of those rows, we do two arithmetic operations when we do the elementary row operation
that produces a zero in the ith column. This requires, therefore, 2(n + p − i)(m − i − 1)
arithmetic operations.
For the forward phase of the solution, therefore, we count
A
f
=
r

i=1
¦2(n +p −i)(m−i −1) + (n +p −i)¦ (3.5.1)
arithmetic operations altogether, where r is the pseudorank of the matrix, because the
forward solution halts after the rth row with only zeros below.
The non-arithmetic operations in the forward phase amount to
N
f
=
r

i=1
¦(m−i)(n −i) + (n +p −i) +m¦. (3.5.2)
Let’s leave these sums for a while, and go to the backwards phase of the solution. We
do the columns in reverse order, from column r back to 1, and when we have arrived at a
generic column j, we want to create zeroes in all of the positions above the 1 in the [j, j]
position.
To do this we perform the elementary row operation
−−→
row(i) :=
−−→
row(i) −A
ij

−−→
row(j) (3.5.3)
to each of the j − 1 rows above row j. Let i be the number of one of these rows. Then,
exactly how many elements of
−−→
row(i) are acted upon by the elementary row operation above?
Certainly the elements in
−−→
row(i) that lie in columns 1 through j −1 are unaffected, because
only zero elements are in
−−→
row(j) below them, thanks to the forward reduction process.
Furthermore, the elements in
−−→
row(i) that lie in columns j +1 through r are unaffected,
for a different reason. Indeed, any such entry is already zero, because it lies above an entry
of 1 in some diagonal position that has already had its turn in the back solution (remember
that we’re doing the columns in the sequence r, r −1, . . . , 1). Not only is such an entry zero,
but it remains zero, because the entry of
−−→
row(j) below it is also zero, having previously
been deleted by the action of a diagonal element below it.
Hence in
−−→
row(i), the elements that are affected by the elementary row operation (3.5.3)
are those that lie in columns j, n − r, . . . , n, n + 1, . . . , n +p (be sure to write [or modify!]
the program so that the row reduction (3.5.3) acts only on those columns!). We have now
104 Numerical linear algebra
shown that exactly N +p + r − 1 entries of each row above
−−→
row(j) are affected (note that
the number is independent of j and i), so
A
b
=
r

j=1
(n +p −r + 1)(j −1) (3.5.4)
arithmetic operations are done during the back solution, and no other operations.
It remains only to do the various sums, and for this purpose we recall that
N

i=1
i =
N(N + 1)
2
N

i=1
i
2
=
N(N + 1)(2N + 1)
6
(3.5.5)
Then it is straightforward to find the total number of arithmetic operations from A
f
+A
b
as
Arith(m, n, p, r) =
r
3
6
−(2m+n +p −5)
r
2
2
+ ((n +p)(2m−5/2) −m+ 1/3)r (3.5.6)
and the total of the non-arithmetic operations from N
f
as
NonArith(m, n, p, r) =
r
3
3
−(m +n)
r
2
2
+ (6mn + 3m+ 3n + 6p −2)
r
6
. (3.5.7)
Let’s look at a few important special cases. First, suppose we are solving one system
of n equations in n unknowns that has a unique solution. Then we have m = n = r and
p = 1. We find that
Arith(n, n, 1, n) =
2
3
n
3
+ O(n
2
) (3.5.8)
where O(n
2
) refers to some function of n that is bounded by a constant times n
2
as n grows
large. Similarly, for the non-arithmetic operations on matrix elements we find
1
3
n
3
+O(n
2
)
in this case.
It follows that a system of n equations can be solved for about one third of the price,
in terms of arithmetic operations, of one matrix multiplication, at least if matrices are
multiplied in the usual way (did you know that there is a faster way to multiply two
matrices? We will see one later on).
Now what is the price of a matrix inversion by this method? Then we are solving n
systems of n equations in n unknowns, all with the same left-hand side. Hence we have
r = m = n = p, and we find that
Arith(n, n, n, n) =
13
6
n
3
+ O(n
2
). (3.5.9)
Hence we can invert a matrix by this method for about the same price as solving 3.25
systems of equations! At first glance, it may seem as if the cost should be n times as great
3.6 To unscramble the eggs 105
because we are solving n systems. The great economy results, of course, from the common
left-hand sides.
The cost of the non-arithmetic operations remains at
1
3
n
3
+ O(n
2
).
If we want only the determinant of a square matrix A, or want only the rank of A then
we need to do only the forward solution, and we can save the cost of the back solution. We
leave it to the reader to work out the cost of a determinant, or of finding the rank.
3.6 To unscramble the eggs
Now we have reached the last of the issues that needs to be discussed in order to plan a
complete linear equation solving routine, and it concerns the rearrangement of the output
so that it ends up in the right order.
During the operation of the forward solution algorithm we found it necessary to inter-
change rows and columns so that the largest element of the Southeast rectangle was brought
into the pivot position. As we mentioned previously, we don’t need to keep a record of the
row interchanges, because they correspond simply to solving the equations in a different
sequence. We must remember the column interchanges that occur along the way though,
because each time we do one of them we are, in effect, renumbering the unknowns.
To remember the column interchanges we glue onto our array C an additional row, just
for bookkeeping purposes. Its elements are called τ
j
, j = 1, . . . , n, and it is kept at the
bottomof the matrix. More precisely, the elements of ¦τ
j
¦ are the first n entries of the new
last row of the matrix, where the row contains n +p entries altogether, the last p of which
are not used (refer to (3.3.1) to see the complete partitioning of the matrix C).
Now suppose we have arrived at the end of the back solution, and the answers to the
original question are before us, except that they are scrambled. Here’s an example of the
kind of situation that might result:
_
¸
¸
¸
¸
¸
¸
_
1 0 0 a b c
0 1 0 d e f
0 0 1 g h k
.
.
.
.
.
.
.
.
.
.
.
.
3 5 2 1 4 ∗
_
¸
¸
¸
¸
¸
¸
_
. (3.6.1)
The matrix above represents schematically the reduced row echelon form in a problem
where there are five unknowns (n = 5), the pseudorank r = 3, just one right-hand side
vector is given (p = 1), and the permuations that were carried out on the columns are
recorded in the array τ : [3, 5, 2, 1, 4] shown in the last row of the matrix as it would be
storded in a computation.
The question now is, how do we express the general solution of the given set of equations?
To find the answer, let’s go back to the set of equations that (3.6.1) stands for. The first of
these is
x
3
= c −ax
1
−bx
4
(3.6.2)
106 Numerical linear algebra
because the numbering of the unknowns is as shown in the τ array. The next two equations
are
x
5
= f −dx
1
−ex
4
x
2
= k −gx
1
−hx
4
.
(3.6.3)
If we add the two trivial equations x
1
= x
1
and x
4
= x
4
, then we get the whole solution
vector which, after re-ordering the equations, can be written as
_
¸
¸
¸
¸
¸
_
x
1
x
2
x
3
x
4
x
5
_
¸
¸
¸
¸
¸
_
=
_
¸
¸
¸
¸
¸
_
0
k
c
0
f
_
¸
¸
¸
¸
¸
_
+ (−x
1
) ∗
_
¸
¸
¸
¸
¸
_
−1
g
a
0
d
_
¸
¸
¸
¸
¸
_
+ (−x
4
) ∗
_
¸
¸
¸
¸
¸
_
0
h
b
−1
e
_
¸
¸
¸
¸
¸
_
. (3.6.4)
Now we are looking at a display of the output as we would like our subroutine to give
it. The three vectors on the right side of (3.6.4) are, respectively, a particular solution of
the given system of equations, and the two vectors of a basis for the kernel of the coefficient
matrix.
The question can now be rephrased: exactly what operations must be done to the matrix
shown in (3.6.1) that represents the situation at the end of the back solution, in order to
obtain the three vectors in (3.6.4)?
The first things to do are, as we have previously noted, to append the negative of a 22
identity matrix to the bottom of the fourth and fifth columns of (3.6.1), and to lengthen
the last column on the right by appending two more zeros. That brings us to the matrix
_
¸
¸
¸
¸
¸
¸
¸
_
1 0 0 a b c
0 1 0 d e f
0 0 1 g h k
−1 0 0
0 −1 0
[3 5 2 1 4] ∗
_
¸
¸
¸
¸
¸
¸
¸
_
. (3.6.5)
The first two of the three long columns above will be the basis for the kernel, and the last
column above will be the particular solution, but only after we do the right rearrangement.
Now here is the punch line: the right rearrangement to do is to permute the rows of
those three long columns as described by the permuation τ.
That means that the first row becomes the third, the second row becomes the fifth, the
third row becomes the second, the fourth row is the new first, and the old fifth row is the
new fourth. The reader is invited to carry out on the rows the interchanges just described,
and to compare the result with what we want, namely with (3.6.4). It will be seen that we
have gotten the desired result.
The point that is just a little surprising is that to undo the column interchanges that
are recorded by τ, we do row interchanges. Just roughly, the reason for this is that we
begin by wanting to solve Ax = b, and instead we end up solving (AE)y = b, where E is
3.6 To unscramble the eggs 107
a matrix obtained from the identity by elementary column operations. Evidently, x = Ey,
which means that we must perform row operations on y to recover the answers in the right
order.
Now we can leave the example above, and state the rule in general. We are given p
systems of m simultaneous equations each, all having a commom m n coefficient matrix
A, in n unknowns. At the end of the back solution we will have before us a matrix of the
form
_
¸
_ I(r, r) B(r, n −r) P(r, p)
_
¸
_ (3.6.6)
where I(r, r) is the rr identity matrix, r is the pseudorank of A, and B and P are matrices
of the sizes shown.
We adjoin under B the negative of the (n − r) (n −r) identity matrix, and under P
we adjoin an (n − r) p block of zeros. Next, we forget the identity matrix on the left,
and we consider the entire remaining n (n −r +p) matrix as a whole, call it T, say. Now
we exchange the rows of T according to the permuation array τ. Preciesly, row 1 of T will
be row τ
1
of the new T, row 2 will be row τ
2
, . . . . Conceptually, we should regard the
old T and the new T as occupying different areas of storage, so that the new T is just a
rearrangement of the rows of the old.
Now the first n −r columns of the new T are a basis for the kernel of A, and should be
output as such, and the jth one of the last p columns of the new T is a particular solution
of the jth one of the input systems of equations, and should be output as such.
Although conceptually we should think of the old T and the new T as occupying distinct
arrays in memory, in fact it is perfectly possible to carry out the whole row interchange
procedure described above in just one array, the one that holds T, without ever “stepping
on our own toes,” so let’s consider that problem.
Suppose a linear array a = [a
1
, . . . , a
n
] is given, along with a permutation array τ =

1
, . . . , τ
n
]. We want to rearrange the entries of the array a according to the permuation τ
without using any additional array storage. Thus the present array a
1
will end up as the
output a
τ
1
, the initial a
2
will end up as a
τ
2
, etc.
To do this with no extra array storage, let’s first pick up the element a
1
and move it to
a
τ
1
, being careful to store the original a
τ
1
in a temporary location t so it won’t be destroyed.
Next we move the contents of t to its destination, and so forth. After a certain number of
steps (maybe only 1), we will be back to a
1
.
Her’s an example to help clarify the situation. Suppose the arrays a and τ at input time
were:
a = [5, 7, 13, 9, 2, 8]
τ = [3, 4, 5, 2, 1, 6].
(3.6.7)
So we move the 5 in position a
1
to position a
3
(after putting the 13 into a safe place), and
then the 13 goes to position a
5
(after putting the 2 into a safe place) and the 2 is moved
into position a
1
, and we’re back where we started. The a array now has become
a = [2, 7, 5, 9, 13, 8]. (3.6.8)
108 Numerical linear algebra
The job, however, is not finished. Somehow we have to recognize that the elements a
2
,
a
4
and a
6
haven’t yet been moved, while the others have been moved to their destinations.
For this purpose we will flag the array positions. A convenient place to hang a flag is in
the sign position of an entry of the array τ, since we’re sure that the entries of τ are all
supposed to be positive. Therefore, initially we’ll change the signs of all of the entries of τ
to make them negative. Then as elements are moved around in the a array we will reverse
the sign of the corresponding entry of the τ array. In that way we can always begin the
next block of entries of a to move by searching τ for a negative entry. When none exist, the
job is finished.
Here’s a complete algorithm, in Maple:
shuffle:=proc(a,tau,n) local i,j,t,q,u,v;
#permutes the entries of a according to the permutation tau
#
# flag entries of tau with negative signs
for i from 1 to n do tau[i]:=-tau[i] od;
for i from 1 to n do
# has entry i been moved?
if tau[i]<0 then
# move the block of entries beginning at a[i]
t:=i; q:=-tau[i]; tau[i]:=q;
u:=a[i]; v:=u;
while q<>t do
v:=a[q]; a[q]:=u; tau[q]:=-tau[q];
u:=v; q:=tau[q]; od;
a[t]:=v;
fi;
od;
return(1);
end;
The reader should carefully trace through the complete operation of this algorithm on
the sample arrays shown above. In order to apply the method to the linear equation solving
program, the entries C[r +1, i], i = 1. . . . , n are interpreted as τ
i
, and the array a of length
n whose entries are going to be moved is one of the columns r + 1, . . . , n +p of the matrix
C in rows 1, . . . , n.
3.7 Eigenvalues and eigenvectors of matrices
Our next topic in numerical linear algebra concerns the computation of the eigenvalues and
eigenvectors of matrices. Until further notice, all matrices will be square. If A is n n, by
an eigenvector of A we mean a vector x ,= 0 such that
Ax = λx (3.7.1)
3.7 Eigenvalues and eigenvectors of matrices 109
where the scalar λ is called an eigenvalue of A. We say that the eigenvector x corresponds to,
or belongs to, the eigenvalue λ. We will see that in fact the eigenvalues of A are properties
of the linear mapping that A represents, rather than of the matrix A, so we can exploit
changes of basis in the computation of eigenvalues.
For an example, consider the 2 2 matrix
A =
_
3 −1
−1 3
_
. (3.7.2)
If we write out the vector equation (3.7.1) for this matrix, it becomes the two scalar equa-
tions
3x
1
−x
2
= λx
1
−x
1
+ 3x
2
= λx
2
.
(3.7.3)
These are two homogeneous equations in two unknowns, and therefore they have no solution
other than the zero vector unless the determinant
¸
¸
¸
¸
¸
3 −λ −1
−1 3 −λ
¸
¸
¸
¸
¸
(3.7.4)
is equal to zero. This condition yields a quadratic equation for λ whose two roots are λ = 2
and λ = 4. These are the two eigenvalues of the matrix (3.7.2).
For the same 2 2 example, let’s now find the eigenvectors (by a method that doesn’t
bear the slightest resemblance to the numerical method that we will discuss later). First, to
find the eigenvector that belongs to the eigenvalue λ = 2, we go back to (3.7.3) and replace
λ by 2 to obtain the two equations
x
1
−x
2
= 0
−x
1
+x
2
= 0.
(3.7.5)
These equations are, of course, redundant since λ was chosen to make them so. They are
satisfied by any vector x of the form c ∗ [1, 1], where c is an arbitrary constant. If we refer
back to the definition (3.7.1) of eigenvectors we notice that if x is an eigenvector then so is
cx, so eigenvectors are determined only up to constant multiples. The first eigenvector of
our 2 2 matrix is therefore any multiple of the vector [1, 1].
To find the eigenvector that belongs to the eigenvalue λ = 4, we return to (3.7.3), replace
λ by 4, and solve the equations. The result is that any scalar multiple of the vector [1, −1]
is an eigenvector corresponding to the eigenvalue λ = 4.
The two statements that [1, 1] is an eigenvector and that [1, −1] is an eigenvector can
either be written as two vector equations:
_
3 −1
−1 3
_ _
1
1
_
= 2
_
1
1
_
,
_
3 −1
−1 3
_ _
1
−1
_
= 4
_
1
−1
_
(3.7.6)
or as a single matrix equation
_
2 −1
−1 3
_ _
1 1
1 −1
_
=
_
1 1
1 −1
_ _
2 0
0 4
_
. (3.7.7)
110 Numerical linear algebra
Observe that the matrix equation (3.7.7) states that AP = PΛ, where A is the given
22 matrix, P is a (nonsingular) matrix whose columns are eigenvectors of A, and Λ is the
diagonal matrix that carries the eigenvalues of A down the diagonal (in order corresponding
to the eigenvectors in the columns of P). This matrix equation AP = PΛ leads to one of the
many important areas of application of the theory of eigenvalues, namely to the computation
of functions of matrices.
Suppose we want to calculate A
2147
, where A is the 2 2 matrix (3.7.2). A direct
calculation, by raising A to higher and higher powers would take quite a while (although
not as long as one might think at first sight! Exactly what powers of A would you compute?
How many matrix multiplications would be required?).
A better way is to begin with the relation AP = PΛ and to observe that in this case
the matrix P is nonsingular, and so P has an inverse. Since P has the eigenvectors of
A in its columns, the nonsingularity of P is equivalent to the linear independence of the
eigenvectors. Hence we can write
A = PΛP
−1
. (3.7.8)
This is called the spectral representation of A, and the set of eigenvalues is often called the
spectrum of A.
Equation (3.7.8) is very helpful in computing powers of A. For instance
A
2
= (PΛP
−1
)(PΛP
−1
) = PΛ
2
P
−1
,
and for every m, A
m
= PΛ
m
P
−1
. It is of course quite easy to find high powers of the
diagonal matrix Λ, because we need only raise the entries on the diagonal to that power.
Thus for example,
A
2147
=
_
1 1
1 −1
_ _
2
2147
0
0 4
2147
_ _
1/2 1/2
1/2 −1/2
_
. (3.7.9)
Not only can we compute powers from the spectral representation (3.7.8), we can equally
well obtain any polynomial in the matrix A. For instance,
13A
3
+ 78A
19
−43A
31
= P(13Λ
3
+ 78Λ
19
−43Λ
31
)P
−1
. (3.7.10)
Indeed if f is any polynomial, then
f(A) = Pf(Λ)P
−1
(3.7.11)
and f(Λ) is easy to calculate because it just has the numbers f(λ
i
) down the diagonal and
zeros elsewhere.
Finally, it’s just a short hop to the conclusion that (3.7.11) remains valid even if f is
not a polynomial, but is represented by an everywhere-convergent powers series (we don’t
even need that much, but this statement suffices for our present purposes). So for instance,
if A is the above 2 2 matrix, then
e
A
= Pe
Λ
P
−1
(3.7.12)
3.7 Eigenvalues and eigenvectors of matrices 111
where e
Λ
has e
2
and e
4
on its diagonal.
We have now arrived at a very important area of application of eigenvalues and eigen-
vectors, to the solution of systems of differential equations. A system of n linear simul-
taneous differential equations in n unknown functions can be written simply as y

= Ay,
with say y(0) given as initial data. The solution of this system of differential equations is
y(t) = e
At
y(0), where the matrix e
At
is calculated by writing A = PΛP
−1
if possible, and
then putting e
At
= Pe
Λt
P
−1
.
Hence, whenever we can find the spectral representation of a matrix A, we can calculate
functions of the matrix and can solve differential equations that involve the matrix.
So, when can we find a spectral representation of a given n n matrix A? If we can
find a set of n linearly independent eigenvectors for A, then all we need to do is to arrange
them in the columns of a new matrix P. Then P will be invertible, and we’ll be all finished.
Conversely, if we somehow have found a spectral representation of A `a la (3.7.8), then the
columns of P obviously do comprise a set of n independent eigenvectors of A.
That changes the question. What kind of an n n matrix A has a set of n linearly
independent eigenvectors? This is quite a hard problem, and we won’t answer it completely.
Instead, we give an example of a matrix that does not have as many independent eigenvec-
tors as it “ought to,” and then we’ll specialize our discussion to a kind of matrix that is
guaranteed to have a spectral representation.
For an example we don’t have to look any further than
A =
_
0 1
0 0
_
. (3.7.13)
The reader will have no difficulty in checking that this matrix has just one eigenvalue, λ = 0,
and that corresponding to that eigenvalue there is just one independent eigenvector, and
therefore there is no spectral representation of this matrix.
Now first we’re going to devote our attention to the real symmetric matrices. i.e., to
matrices A for which A
ij
= A
ji
for all i, j = 1, . . . , n. These matrices occur in many
important applications, and they always have a spectral representation. Indeed, much more
is true, as is shown by the following fundamental theorem of the subject, whose proof
is deferred to section 3.9, where it will emerge (see Theorem 3.9.1) as a corollary of an
algorithm.
Theorem 3.7.1 (The Spectral Theorem) – Let A be an nn real symmetric matrix. Then
the eigenvalues and eigenvectors of A are real. Furthermore, we can always find a set of n
eigenvectors of A that are pairwise orthogonal to each other (so they are surely independent).
Recall that the eigenvectors of the symmetric 22 matrix (3.7.2) were [1, 1] and [1, −1],
and these are indeed orthogonal to each other, though we didn’t comment on it at the time.
We’re going to follow a slightly unusual route now, that will lead us simultaneously to
a proof of the fundamental theorem (the “spectral theorem”) above, and to a very elegant
112 Numerical linear algebra
computer algorithm, called the method of Jacobi, for the computation of eigenvalues and
eigenvectors of real symmetric matrices.
In the next section we will introduce a very special family of matrices, first studied by
Jacobi, and we will examine their properties in some detail. Once we understand these
properties, a proof of the spectral theorem will appear, with almost no additional work.
Following that we will show how the algorithm of Jacobi can be implemented on a
computer, as a fast and pretty program in which all of the eigenvalues and eigenvectors
of a real symmetric matrix are found simultaneously, and are delivered to your door as an
orthogonal set.
Throughout these algorithms certain themes will recur. Specifically, we will see several
situations in which we have to compute a certain angle and then carry out a rotation of
space through that angle. Since the themes occur so often we are going to abstract from
them certain basic modules of algorithms that will be used repeatedly.
This choice will greatly simplify the preparation of programs, but at a price, namely
that each module will not always be exactly optimal in terms of machine time for execution
in each application, although it will be nearly so. Consequently it was felt that the price
was worth the benefit of greater universality. We’ll discuss these points further, in context,
as they arise.
3.8 The orthogonal matrices of Jacobi
A matrix P is called an orthogonal matrix if it is real, square, and if P
−1
= P
T
, i.e., if
P
T
P = PP
T
= I. If we visualize the way a matrix is multiplied by its transpose, it will be
clear that an orthogonal matrix is one in which each of the rows (columns) is a unit vector
and any two distinct rows (columns) are orthogonal to each other.
For example, the 2 2 matrix
_
cos θ sinθ
−sinθ cos θ
_
(3.8.1)
is an orthogonal matrix for every real θ.
We will soon prove that a real symmetric matrix always has a set of n pairwise orthogonal
eigenvectors. If we take such as set of vectors, normalize them by dividing each by its length,
and arrange them in the consecutive columns of a matrix P, then P will be an orthogonal
matrix, and further we will have AP = PΛ. Since P
T
= P
−1
, we can multiply on the right
by P
T
and obtain
A = PΛP
T
, (3.8.2)
and this is the spectral theorem for a symmetric matrix A.
Conversely, if we can find an orthogonal matrix P such that P
T
AP is a diagonal matrix
D, then we will have found a complete set of pairwise orthogonal eigenvectors of A (the
columns of P), and the eigenvalues of A (on the diagonal of D).
3.8 The orthogonal matrices of Jacobi 113
In this section we are going to describe a numerical procedure that will find such an
orthogonal matrix, given a real symmetric matrix A. As soon as we prove that the method
works, we will have proved the spectral theorem at the same time. Hence the method is
of theoretical as well as algorithmic importance. It is important to notice that we will
not have to find an eigenvalue, then find a corresponding eigenvector, then find another
eigenvalue and another vector, etc. Instead, the whole orthogonal matrix whose columns
are the desired vectors will creep up on us at once.
The first thing we have to do is to describe some special orthogonal matrices that will
be used in the algorithm. Let n, p and q be given positive integers, with n ≥ 2 and p ,= q,
and let θ be a real number. We define the matrix J
pq
(θ) by saying that J is just like the
n n identity matrix except that in the four positions that lie at the intersections of rows
and columns p and q we find the entries (3.8.1).
More precisely, J
pq
(θ) has in position [p, p] the entry cos θ, it has sin θ in the [p, q] entry,
−sin θ in the [q, p] entry, cos θ in entry [q, q], and otherwise it agrees with the identity
matrix, as shown below:
row p
row q
_
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
_
1 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0 cos θ sin θ 0 0 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0 −sin θ cos θ 0 0 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 1
_
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
_
. (3.8.3)
Not only is J
pq
(θ) an orthogonal matrix, there is a reasonably pleasant way to picture
its action on n-dimensional space. Since the 2 2 matrix of (3.8.1) is the familiar rotation
of the plane through an angle θ, we can say that the matrix J
pq
(θ) carries out a special kind
of rotation of n-dimensional space, namely one in which a certain plane, the plane of the
pth and qth coordinate, is rotated through the angle θ, and the remaining coordinates are
all left alone. Hence J
pq
(θ) carries out a two-dimensional rotation of n-dimensional space.
These matrices of Jacobi turn out to be useful in a host of numerical algorithms for the
eigenproblem. The first application that we’ll make of them will be to the real symmetric
matrices, but later we’ll find that the same two-dimensional rotations will play important
roles in the solution of non-symmetric problems as well.
First, let’s see how they can help us with symmetric matrices. What we propose to do
is the following. If a real symmetric matrix A is given, we will determine p, q, and the
angle θ in such a way that the matrix JAJ
T
is a little bit more diagonal (whatever that
means!) than A is. It turns out that this can always be done, at any rate unless A is already
diagonal, so we will have the germ of a numerical procedure for computing eigenvalues and
eigenvectors.
114 Numerical linear algebra
Indeed, suppose we have found out how to determine such an angle θ, and let’s then see
what the whole process would look like. Starting with A, we would find p, q, and θ, and
then the matrix JAJ
T
is somehow a little more diagonal than A was. Now JAJ
T
is still a
symmetric matrix (try to transpose it and see what happens) so we can do it again. After
finding another p, q and θ we will have J

J

A(J

J

)
T
a bit “more diagonal” and so forth.
Now suppose that after some large number of repetitions of this process we find that
the current matrix is very diagonal indeed, so that perhaps aside from roundoff error it is
a diagonal matrix D. Then we will know that
D = (product of all J’s used)A(product of all J’s used)
T
. (3.8.4)
If we let P denote the product of all J’s used, then we have PAP
T
= D, so the columns
of P will be the (approximate) eigenvectors of A and the diagonal elements of D will be
its eigenvalues. The matrix P will automatically be an orthogonal matrix, since it is the
product of such matrices, and the product of orthogonal matrices is always orthogonal
(proof?).
That, at any rate, is the main idea of Jacobi’s method (he introduced it in order to
study planetary orbits!). Let’s now fill in the details.
First we’ll define what we mean by “more diagonal”. For any square, real matrix A, let
Od(A) denote the sum of the squares of the off-diagonal entries of A. From now on, instead
of “B is more diagonal than A,” we’ll be able to say Od(B) < Od(A), which is much more
professional.
Now we claim that if A is a real symmetric matrix, and it is not already a diagonal
matrix, then we can find p, q and θ such that Od(J
pq
(θ)AJ
pq
(θ)
T
) < Od(A). We’ll do this
by a very direct computation of the elements of JAJ
T
(we’ll need them anyway for the
computer program), and then we will be able to see what the new value of Od is.
So fix p, q and θ. Then by direct multiplication of the matrix in (3.8.3) by A we find
that
(JA)
ij
=
_
¸
_
¸
_
(cos θ)A
pj
+ (sin θ)A
qj
if i = p
−(sin θ)A
pj
+ (cos θ)A
qj
if i = q
a
ij
otherwise
(3.8.5)
Then after one more multiplication, this time on the right by the transpose of the matrix
in (3.8.3), we find that
(JAJ
T
)
ij
=
_
¸
¸
¸
¸
¸
¸
¸
_
¸
¸
¸
¸
¸
¸
¸
_
CA
ip
+SA
iq
if i ,∈ ¦p, q¦; j = p or i = p; j ,∈ ¦p, q¦
−SA
ip
+CA
iq
if i ,∈ ¦p, q¦; j = q or i = q; j ,∈ ¦p, q¦
C
2
A
pp
+ 2SCA
pq
+S
2
A
qq
if i = j = p
S
2
A
pp
−2SCA
pq
+C
2
A
qq
if i = j = q
CS(A
qq
−A
pp
) + (C
2
−S
2
)A
pq
if i = p, j = q or i = q j = p
a
ij
otherwise.
(3.8.6)
In (3.8.6) we have written C for cos θ and S for sin θ.
Now we are going to choose the angle θ so that the elements A
pq
and A
qp
are reduced
to zero, assuming that they were not already zero. To do this we refer to the formula in
3.9 Convergence of the Jacobi method 115
(3.8.6) for A
pq
, equate it to zero, and then solve for θ. The result is
tan 2θ =
2A
pq
A
pp
−A
qq
(3.8.7)
and we will choose the value of θ that lies between −
π
4
and
π
4
.
With this value of θ, we will have reduced one single off-diagonal element of A to zero
in the new symmetric matrix JAJ
T
.
The full Jacobi algorithm consists in repeatedly executing these plane rotations, each
time choosing the largest off-diagonal element A
pq
and annihilating it by the choice (3.8.7)
of θ. After each rotation, Od(A) will be a little smaller than it was before. We will prove
that Od(A) converges to zero.
It is important to note that a plane rotation that annihilates A
pq
may “revive” some
other A
rs
that was set to zero by an earlier plane rotation. Hence we should not think of
the zero as “staying put”.
Let’s now see exactly what happens to Od(A) after a single rotation. If we sum the
squares of all of the off-diagonal elements of JAJ
T
using the formulas (3.8.6), but remem-
bering that the new A
pq
=0, then it’s quite easy to check that the new sum of squares is
exactly equal to the old sum of squares minus the squares of the two entries A
pq
and A
qp
that were reduced to zero. Hence we have
Theorem 3.8.1 Let A be an nn real, symmetric matrix that is not diagonal. If A
pq
,= 0
for some p ,= q, then we can choose θ as in equation (3.8.7) so that if J = J
pq
(θ) then
Od(JAJ
T
) = Od(A) −2A
2
pq
< Od(A). (3.8.8)
3.9 Convergence of the Jacobi method
We have now described the fundamental operation of the Jacobi algorithm, namely the
plane rotation in n-dimensional space that sends a real symmetric matrix A into JAJ
T
,
and we have explicit formulas for the new matrix elements. There are still a number of
quite substantive points to discuss before we will be able to assemble an efficient program
for carrying out the method. However, in line with the philosophy that it is best to break
up large programs into small, manageable chunks, we are now ready to prepare the first
module of the Jacobi program.
What we want is to be able to execute the rotation through an angle θ, according to the
formulas (3.8.6) of the previous section. This could be accomplished by a single subroutine
that would take the symmetric matrix A and the sine and cosine of the rotation angle, and
execute the operation (3.8.6).
If we keep later applications in mind, then the best choice for a module will be one that
will, on demand, multiply a given not-necessarily-symmetric matrix on the left by J or on
the right by J
T
, depending on the call. This is one of the situations we referred to earlier
116 Numerical linear algebra
where the most universal choice of subroutine will not be the most economical one in every
application, but we will get a lot of mileage out of this routine!
Hence, suppose we are given
1. an n n real matrix A
2. the sine S and cosine C of a rotation angle
3. the plane [p, q] of the rotation, and
4. a parameter option that will equal 1 if we want to do JA, and 2 if we want AJ
T
.
The procedure will be called
Procedure rotate(s,c,p,q,option);
What the procedure will do is exactly this. If called with option = 1, it will multiply A
on the left by J, according to the formulas (3.8.5), and exit. If option = 2, it will multiply
A on the right by J
T
. The Maple procedure is as follows:
rotate:=proc(s,c,p,q,opt) local j,temp;
global A,n;
if opt=1 then
for j from 1 to n do
temp:=evalf(c*A[p,j]+s*A[q,j]);
A[q,j]:=-s*A[p,j]+c*A[q,j];
A[p,j]:=temp;
od
else
for j from 1 to n do
temp:=c*A[j,p]+s*A[j,q];
A[j,q]:=-s*A[j,p]+c*A[j,q];
A[j,p]:=temp;
od
fi;
RETURN()
end:
To carry out one iteration of the Jacobi method, we will have to call rotate twice, once
with option = 1 and then with option = 2.
The amount of computational labor that is done by this module is O(N) per call, since
only two lines of the matrix are affected by its operation.
Next, let’s prove that the results of applying one rotation after another do in fact
converge to a diagonal matrix.
3.9 Convergence of the Jacobi method 117
Theorem 3.9.1 Let A be a real symmetric matrix. Suppose we follow the strategy of search-
ing for the off-diagonal element of largest absolute value, choosing so as to zero out that
element by carrying out a Jacobi rotation on A, and then repeating the whole process on the
resulting matrix, etc. Then the sequence of matrices that is thereby obtained approaches a
diagonal matrix D.
Proof. At a certain stage of the iteration, let A
pq
denote the off-diagonal element of largest
absolute value. Since the maximum of any set of numbers is at least as big as the average of
that set (proof?), it follows that the maximum of the squares of the off-diagonal elements of
A is at least as big as the average square of an off-diagonal element. The average square is
equal to the sum of the squares of the off-diagonal elements divided by the number of such
elements, i.e., divided by n(n −1). Hence the average square is exactly Od(A)/(n(n −1)),
and therefore
A
2
pq

Od(A)
n(n −1)
. (3.9.1)
Now the effect of a single rotation of the matrix is to reduce Od(A) by 2A
2
pq
, so equation
(3.8.8) yields
Od(JAJ
T
) = Od(A) −2A
2
pq
≤ Od(A) −
2 Od(A)
n(n −1
=
_
1 −
2
n(n −1)
_
Od(A).
(3.9.2)
Hence a single rotation will reduce Od(A) by a multiplicative factor of 1 −2/(n(n −1)) at
least. Since this factor is less than 1, it follows that the sum of squares of the off-diagonal
entries approaches zero as the number of plane rotations grows without bound, completing
the proof.
The proof told us even more since it produced a quantitative estimate of the rate at
which Od(A) approaches zero. Indeed, after r rotations, the sum of squares will have
dropped to at most
_
1 −
2
n(n −1
_
r
Od(original A). (3.9.3)
If we put r = n(n − 1)/2, then we see that Od(A) has dropped by at least a factor of
(approximately) e. Hence, after doing an average of one rotation per off-diagonal element,
the function Od(A) is no more than 1/e times its original value. After doing an average of,
say, t rotations per off-diagonal element (i.e., tn(n − 1)/2 rotations), the function Od(A)
will have dropped to about e
−t
times its original value. If we want it to drop to, say, 10
−m
times its original value then we can expect to need no more than about m(ln 10)n(n −1)/2
rotations.
To put it in very concrete terms, suppose we’re working in double precision (12-digit)
arithmetic and we are willing to decree that convergence has taken place if Od has been
reduced by 10
−12
. Then at most 12(ln 10)n(n − 1)/2 < 6(ln 10)n
2
≈ 13.8n
2
rotations will
have to be done. Of course in practice we will be watching the function Od(A) as it drops,
118 Numerical linear algebra
so there won’t be any need to know in advance how many iterations are needed. We can
stop when the actual observed value is small enough. Still, it’s comforting to know that at
most O(n
2
) iterations will be enough to do the job.
Now let’s re-direct our thoughts to the grand iterations process itself. At each step
we apply a rotation matrix to the current symmetric matrix in order to make it “more
diagonal”. At the same time, of course, we must keep track of the product of all of the
rotation matrices that we have so far used, because that is the matrix that ultimately will
be an orthogonal matrix with the eigenvectors of A across its rows.
Let’s watch this happen. Begin with A. After one rotation we have J
1
AJ
T
1
, after
two iterations we have J
2
J
1
AJ
T
1
J
T
2
, after three we have J
3
J
2
J
1
AJ
T
1
J
T
2
J
T
3
, etc. After all
iterations have been done, and we are looking at a matrix that is “diagonal enough” for
our purposes, the matrix we see is PAP
T
= D, where P is obtained by starting with the
identity matrix and multiplying successively on the left by the rotational matrices J that
are used, and D is (virtually) diagonal.
Since PAP
T
= D, we have AP
T
= P
T
D, so the columns of P
T
, or equivalently the
rows of P, are the eigenvectors of A.
Now, we have indeed proved that the repeated rotations will diagonalize A. We have not
proved that the matrices P themselves converge to a certain fixed matrix. This is true, but
we omit the proof. One thing we do want to do, however, is to prove the spectral theorem,
Theorem 3.7.1, itself, since we have long since done all of the work.
Proof (of the Spectral Theorem 3.7.1): Consider the mapping f that associates with every
orthogonal matrix P the matrix f(P) = P
T
AP. The set of orthogonal matrices is compact,
and the mapping f is continuous. Hence the image of the set of orthogonal matrices under
f is compact. Hence there is a matrix F in that image that minimizes the continuous
function Od(f(P)) = Od(P
T
AP). Suppose D is not diagonal. Then we could find a Jacobi
rotation that would produce another matrix in the image whose Od would be lower, which
is a contradiction (of the fact that Od(D) was minimal). Hence F is diagonal. So there is
an orthogonal matrix P such that P
T
AP = D, i.e., AP = PD. Hence the columns of P are
n pairwise orthogonal eigenvectors of A, and the proof of the spectral theorem is complete.
Now let’s get on with the implementation of the algorithm.
3.10 Corbat´o’s idea and the implementation of the Jacobi
algorithm
It’s time to sit down with our accountants and add up the costs of the Jacobi method.
First, we have seen that O(n
2
) rotations will be sufficient to reduce the off-diagonal sum of
squares below some pre-assigned threshold level. Now, what is the price of a single rotation?
Here are the steps:
(i) Search for the off-diagonal element having the largest absolute value. The cost seems
to be equal to the number of elements that have to be looked at, namely n(n −1)/2,
which we abbreviate as O(n
2
).
3.10 Corbat´o’s idea and the implementation of the Jacobi algorithm 119
(ii) Calculate θ, sin θ and cos θ, and then carry out a rotation on the matrix A. This costs
O(n), since only four lines of A are changed.
(iii) Update the matrix P of eigenvectors by multiplying it by the rotation matrix. Since
only two rows of P change, this cost is O(n) also.
The longest part of the job is the search for the largest off-diagonal element. The search
is n times as expensive in time as either the rotation of A or the update of the eigenvector
matrix.
For this reason, in the years since Jacobi first described his algorithm, a number of other
strategies for dealing with the eigenvalue problem have been worked out. One of these is
called the cyclic Jacobi method. In that variation, one does not search, but instead goes
marching through the matrix one element at a time. That is to say, first do a rotation that
reduces A
12
to zero. Next do a rotation that reduces A
13
to zero (of course, A
12
doesn’t
stay put, but becomes nonzero again!). Then do A
14
and so forth, returning to A
12
again
after A
nn
, cycling as long as necessary. This method avoids the search, but the proof that
it converges at all is quite complex, and the exact rate of convergence is unknown.
A variation on the cyclic method is called the threshold Jacobi method, in which we
go through the entries cyclically as above, but we do not carry out a rotation unless the
magnitude of the current matrix entry exceeds a certain threshold (“throw it back, it’s too
small”). This method also has an uncertain rate of convergence.
At a deeper level, two newer methods due to Givens and Householder have been de-
veloped. These methods work not by trying to diagonalize A by rotations, but instead to
tri-diagonalize A by rotations. A tri-diagonal matrix is one whose entries are all zero ex-
cept for those on the diagonal, the sub-diagonal and the super-diagonal (i.e, A
ij
= 0 unless
[i −j[ ≤ 1).
The advantage of tri-diagonalization is that it is a finite process: it can be done in such
a way that elements, once reduced to zero, stay zero instead of bouncing back again as
they do in the Jacobi method. The disadvantage is that, having arrived at a tri-diagonal
matrix, one is not finished, but instead one must then confront the question of obtaining
the eigenvalues and eigenvectors of a tri-diagonal matrix, a nontrivial operation.
One of the reasons for the wide use of the Givens and Householder methods has been
that they get the answers in just O(n
3
) time, instead of the O(n
4
) time in which the original
Jacobi method operates.
Thanks to a suggestion of Corbat´o, (F. J. Corbat´o, On the coding of Jacobi’s method
for computing eigenvalues and eigenvectors of symmetric matrices, JACM, 10: 123-125,
1963) however, it is now easy to run the original Jacobi method in O(n
3
) time also. The
suggestion is all the more remarkable because of its simplicity and the fact that it lives at
the software level, rather than at the mathematical level. What it does is just this: it allows
us to use the largest off-diagonal entry at each stage while paying a price of a mere O(n)
for the privilege, instead of the O(n
2
) billing mentioned above.
We can do this by carrying along an additional linear array during the calculation. The
ith entry of this array, say loc[i], contains the number of a column in which an off-diagonal
120 Numerical linear algebra
element of largest absolute value in row i lives, i.e., A
i,loc(i)
is an entry in row i of A of
largest absolute value in that row.
Now of course if some benefactor is kind enough to hand us this array, then it would be
a simple matter to find the biggest off-diagonal element of the entire matrix A. We would
just look through the n − 1 numbers [A
i,loc(i)
[ (i = 1, 2, . . . , n − 1) to find the largest one.
If the largest one is the one where i = p, say, then the desired matrix element would be
A
p,loc(p)
. Hence the cost of using this array is O(n).
Since there are no such benefactors as described above, we are going to have to pay a
price for the care and feeding of this array. How much does it cost to create and to maintain
it?
Initially, we just search through the whole matrix and set it up. This clearly costs
O(n
2
) operations, but we pay just once. Now let’s turn to a typical intermediate stage in
the calculation, and see what the price is for updating the loc array.
Given an array loc, suppose now that we carry out a single rotation on the matrix
A. Precisely how do we go about modifying loc so it will correspond to the new matrix?
The rotated matrix differs from the previous matrix in exactly two rows and two columns.
Certainly the two rows, p and q, that have been completely changed will simply have to be
searched again in order to find the new values of loc. This costs O(n) operations.
What about the other n−2 rows of the matrix? In the ith one of those rows exactly two
entries were changed, namely the ones in the pth column and in the qth column. Suppose
the largest element that was previously in row i was not in either the pth or the qth column,
i.e., suppose loc[i],∈ ¦p, q¦. Then that previous largest element will still be there in the
rotated matrix. In that case, in order to discover the new entry of largest absolute value in
row i we need only compare at most three numbers: the new [A
ip
[, the new [A
iq
[, and the
old [A
i,loc(i)
[. The column in which the largest of these three numbers is found will be the
new loc[i]. The price paid is at most three comparisons, and this does the updating job
in every row except those that happen to have had loc[i]∈ ¦p, q¦.
In the latter case we can still salvage something. If we are replacing the entry of
previously largest absolute value in the row i, we might after all get lucky and replace it
with an even larger number, in which case we would again know the new loc(i). Christmas,
however, comes just once an year, and since the general trend of the off-diagonal entries is
downwards, and the previous largest entry was uncommonly large, most of the time we’ll
be replacing the former largest entry with a smaller entry. In that case we’ll just have to
re-search the entire row to find the new champion.
The number of rows that must be searched in their entireties in order to update the
loc array is therefore at most two (for rows p and q) plus the number of rows i in which
loc[i] happens to be equal to p or to q. It is reasonable to expect that the probability of
the event loc[i]∈ ¦p, q¦ is about two chances out of n, since it seems that it ought not to
be any more likely that the winner was previously in those two columns than in any other
columns. This has never been in any sense proved, but we will assume that it is so. Then
the expected number of rows that will have to be completely searched will be about four,
on the average (the pth, the qth, and an average of about two others).
3.10 Corbat´o’s idea and the implementation of the Jacobi algorithm 121
It follows that the expected cost of maintaining the loc array is O(n) per rotation.
The cost of finding the largest off-diagonal element has therefore been reduced to O(n) per
rotation, after all bills have been paid. Hence the cost of finding that element is comparable
with all of the other operations that go on in the algorithm, and it poses no special problem.
Using Corbat´o’s suggestion, and subject to the equidistribution hypothesis mentioned
above, the cost of the complete Jacobi algorithm for eigenvalues and eigenvectors is O(n
3
).
We show below the complete Maple procedure for updating the array loc immediately after
a rotation has been done in the plane of p and q.
update:=proc(p,q) local i,r;
global loc,A,n;
for i from 1 to n-1 do
if i=p or i=q then searchrow(i)
else
r:=loc[i];
if r=p or r=q then
if abs(A[i,p])>=abs(A[i,loc[i]]) then loc[i]:=p; fi;
if abs(A[i,q])>=abs(A[i,loc[i]]) then loc[i]:=q; fi;
else
if abs(A[i,loc[i]])<=abs(A[i,r]) then loc[i]:=r
else searchrow(i)
fi;fi;fi;od;
RETURN();
end:
The above procedure uses a small auxiliary routine:
Procedure searchrow(i)
This procedure searches the portion of row i of the nn matrix A that lies above the main
diagonal and places the index of a column that contains an entry of largest absolute value
in loc[i].
searchrow:=proc(i)
local j,bigg;
global loc,A,n,P;
bigg:=0;
for j from i+1 to n do
if abs(A[i,j])>bigg then
bigg:=abs(A[i,j]);loc[i]:=j;fi;
od;
RETURN();
end:
We should mention that the search can be speeded up a little bit more by using a data
structure called a heap. What we want is to store the locations of the biggest elements in
each row in such a way that we can quickly access the biggest of all of them.
122 Numerical linear algebra
If we had stored the set of winners in a heap, or priority queue, then we would have
been able to find the overall winner in a single step, and the expense would have been the
maintenance of the heap structure, at a cost of a mere O(log n) operations. This would have
reduced the program overhead by an addition twenty percent or thereabouts, although the
programming job itself would have gotten harder.
To learn about heaps and how to use them, consult books on data structures, computer
science, and discrete mathematics.
3.11 Getting it together
The various pieces of the procedure that will find the eigenvalues and eigenvectors of a
real symmetric matrix by the method of Jacobi are now in view. It’s time to discuss the
assembly of those pieces.
Procedure jacobi(eps,dgts)
The input to the jacobi procedure is the n n matrix A, and a parameter eps that
we will use as a reduction factor to test whether or not the current matrix is “diagonal
enough,” so that the procedure can terminate its operation. Further input is dgts, which
is the number of significant digits that you would like to be carried in the calculation.
The input matrix A itself is global, which is to say that its value is set outside of the
jacobiprocedure, and it is available to all procedures that are involved.
The output of procedure jacobiwill be an orthogonal matrix P that will hold the
eigenvectors of A in its rows, and a linear array eig that will hold the eigenvalues of A.
The input matrix A will be destroyed by the action of the procedure.
The first step in the operation of the procedure will be to compute the original off-
diagonal sum of squares test. The Jacobi process will halt when this sum of squares of all
off-diagonal elements has been reduced to eps*test or less.
Next we set the matrix P to the n n identity matrix, and initialize the array loc by
calling the subr outine search for each row of A. This completes the initialization.
The remaining steps all get done while the sum of the squares of the off-diagonal entries
of the current matrix A exceeds eps times test.
We first find the largest off-diagonal element by searching through the numbers A
i,loc(i)
for the largest in absolute value. If that is A
p,loc(p)
, we will then know p, q, A
pq
, A
pp
amd
A
qq
, and it will be time to compute the sine and cosine of the rotation angle θ. This is
a fairly ticklish operation, since the Jacobi method is sensitive to small inaccuracies in
these quantities. Also, note that we are going to calculate sin θ and cos θ without actually
calculating θ itself. After careful analysis it turns out that the best formulas for the purpose
3.11 Getting it together 123
are:
x := 2A
pq
y := A
pp
−A
qq
t :=
_
x
2
+y
2
sin θ := sign(xy)
_
1 −[y[/t
2
_
1/2
cos θ :=
_
1 +[y[/t
2
_
1/2
.
(3.11.1)
Having computed sin θ and cos θ, we can now call rotate twice, as discussed in sec-
tion 3.9, first with opt=1 and again with opt=2. The matrix A has now been transformed
into the next stage of its march towards diagonalization.
Next we multiply the matrix P on the left by the same nn orthogonal matrix of Jacobi
(3.8.3), in order to update the matrix that will hold the eigenvectors of A on output. Notice
that this multiplication affects only rows p and q of the matrix P, so only 2n elements are
changed.
Next we call update, as discussed in section 3.10, to modify the loc array to correspond
to the newly rotated matrix, and we are at the end (or od) of the while that was started
a few paragraphs ago. In other words, we’re finished. The Maple program for the Jacobi
algorithm follows.
jacobi:=proc(eps,dgts)
local test,eig,i,j,iter,big,p,q,x,y,t,s,c,x1;
global loc,n,P,A;
with(linalg): # initialize
iter:=0;n:=rowdim(A);Digits:=dgts;
loc:=array(1..n);
test:=add(add(2*A[i,j]^2,j=i+1..n),i=1..n-1); #sum squares o-d entries
P:=matrix(n,n,(i,j)->if i=j then 1 else 0 fi); #initialize eigenvector matrix
for i from 1 to n-1 do searchrow(i) od; #set up initial loc array
big:=test;
while big>eps*test do #begin next sweep
x:=0; #find largest o.d. element
for i from 1 to n-1 do
if abs(A[i,loc[i]])>x then x:=abs(A[i,loc[i]]);p:=i; fi;od;
q:=loc[p];
x:=2*A[p,q]; y:=A[p,p]-A[q,q]; #find sine and cosine of theta
t:=evalf(sqrt(x^2+y^2));
s:=sign(x*y)*evalf(sqrt(0.5*(1-abs(y)/t)));
c:=evalf(sqrt(0.5*(1+abs(y)/t)));
rotate(s,c,p,q,1);rotate(s,c,p,q,2); #apply rotations to A
for j from 1 to n do #update matrix of eigenvectors
t:=c*P[p,j]+s*P[q,j];
124 Numerical linear algebra
P[q,j]:=-s*P[p,j]+c*P[q,j];
P[p,j]:=t;
od;
update(p,q); #update loc array
big:=big-x^2/2;iter:=iter+1; #go do next sweep
od; #end of while
eig:=[seq(A[i,i],i=1..n)]; #output eigenvalue array
print(eig,P,iter); #print eigenvals, vecs, and
RETURN(); # no. of sweeps needed
end:
To use the programs one does the following. First, enter the four procedures jacobi,
update, rotate, searchrow into a Maple worksheet. Next enter the matrix A whose
eigenvalues and eigenvectors are wanted. Then choose dgts, the number of digits of accuracy
to be maintained, and eps, the fraction by which the original off diagonal sum of squares
must be reduced for convergence.
As an example, a call jacobi(.00000001,15) will carry 15 digits along in the compu-
tation, and will terminate when the sum of squares of the off diagonal elements is .00000001
times what it was on the input matrix.
3.12 Remarks
For a parting volley in the direction of eigenvalues, let’s review some connections with the
first section of this chapter, in which we studied linear mappings, albeit sketchily.
It’s worth noting that the eigenvalues of a matrix really are the eigenvalues of the linear
mapping that the matrix represents with respect to some basis.
In fact, suppose T is a linear mapping of E
n
(Euclidean n-dimensional space) to itself.
If we choose a basis for E
n
then T is represented by an nn matrix A with respect to that
basis. Now if we change to a different basis, then the same linear mapping is represented
by B = HAH
−1
, where H is a nonsingular n n matrix. The proof of this fact is by a
straightforward calculation, and can be found in standard references on linear algebra.
First question: what happens to the determinant if we change the basis? Answer:
nothing, because
det(HAH
−1
) = det(H) det(A) det(H
−1
)
= det(A).
(3.12.1)
Hence the value of the determinant is a property of the linear mapping T, and will be the
same for every matrix that represents T in some basis. Hence we can speak of det(T), the
determinant of the linear mapping itself.
Next question: what happens to the eigenvalues if we change basis? Suppose x is an
eigenvector of A for the eigenvalue λ. Then Ax = λx. If we change basis, A changes to
B = HAH
−1
, or A = H
−1
BH. Hence H
−1
BHx = λx, or B(Hx) = λ(Hx). Therefore Hx
3.12 Remarks 125
is an eigenvector of B with the same eigenvalue λ. The eigenvalues are therefore independent
of the basis, and are properties of the linear mapping T itself. Hence we can speak of the
eigenvalues of a linear mapping T.
In the Jacobi method we carry out transformations A → JAJ
T
, where J
T
= J
−1
. Hence
this transformation corresponds exactly to looking at the underlying linear mapping in a
different basis, in which the matrix that represents it is a little more diagonal than before.
Since J is an orthogonal matrix, it preserves lengths and angles because it preserves inner
products between vectors:
¸Jx, Jy) = ¸x, J
T
Jy) = ¸x, y). (3.12.2)
Therefore the method of Jacobi works by rotating a basis slightly, into a new basis in which
the matrix is closer to being diagonal.

2

Contents
1 Differential and Difference Equations 1.1 1.2 1.3 1.4 1.5 1.6 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Linear equations with constant coefficients . . . . . . . . . . . . . . . . . . . Difference equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Computing with difference equations . . . . . . . . . . . . . . . . . . . . . . Stability theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stability theory of difference equations . . . . . . . . . . . . . . . . . . . . . 5 5 8 11 14 16 19 23 23 26 29 34 38 43 48 50 54 60 65 67 68 69 74

2 The Numerical Solution of Differential Equations 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 Euler’s method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Software notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Systems and equations of higher order . . . . . . . . . . . . . . . . . . . . . How to document a program . . . . . . . . . . . . . . . . . . . . . . . . . . The midpoint and trapezoidal rules . . . . . . . . . . . . . . . . . . . . . . . Comparison of the methods . . . . . . . . . . . . . . . . . . . . . . . . . . . Predictor-corrector methods . . . . . . . . . . . . . . . . . . . . . . . . . . . Truncation error and step size . . . . . . . . . . . . . . . . . . . . . . . . . . Controlling the step size . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.10 Case study: Rocket to the moon . . . . . . . . . . . . . . . . . . . . . . . . 2.11 Maple programs for the trapezoidal rule . . . . . . . . . . . . . . . . . . . . 2.11.1 Example: Computing the cosine function . . . . . . . . . . . . . . . 2.11.2 Example: The moon rocket in one dimension . . . . . . . . . . . . . 2.12 The big leagues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.13 Lagrange and Adams formulas . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . o 3. . . . . . . . .3 3. . . . . .12 Remarks . . . . . . . . . . . . . . . . . . . . . . .5 3. . . . . . . . . . . . . .1 3. . . . Convergence of the Jacobi method . . .11 Getting it together . . . . . Operation count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4 3 Numerical linear algebra 3. . . . . . . . . . . Eigenvalues and eigenvectors of matrices . . . . . . . . . . . . . . . . . . To unscramble the eggs . .2 3. . . . . . . . . .7 3. . . . . . . . . . . . . . .4 3. . . . . . . . . . . . . . . . . . . . . . The orthogonal matrices of Jacobi . . . . . . . . . . . . . . . . . . . Linear systems . 3. . . . . . . . . . . . . . . . . . . . .9 CONTENTS 81 81 86 92 97 102 105 108 112 115 118 122 124 Vector spaces and linear mappings . . . . . . . . Building blocks for the linear equation solver . How big is zero? . . . . . . . . . . . . . . . . . . . . . 3. . .8 3. . . . . . . . . . . .6 3.10 Corbat´’s idea and the implementation of the Jacobi algorithm . . . . . . . . .

where the equation contains various derivatives of y and various known functions of x. so we’re going to review the most basic facts about them rather quickly.1). The order of a differential equation is the order of the highest derivative that appears in it.1. we can expect that if a differential equation is of the first order. then so is 10y(x). We assume that the reader has previously met differential equations.2).2).1.1.1) The unknown function is y(x) = constant.2) then (ye−2x ) = e−2x (y − 2y) = 0.1.Chapter 1 Differential and Difference Equations 1.2) A solution will.1. (1. The problem is to “find” the unknown function.6y(x). with particular emphasis on how to solve them with computers. now doubt. (1.1. A differential equation is an equation in an unknown function. say y(x). or 49. Are there any other solutions? No there aren’t. This is not always the case. Hence y = ce2x is a solution of (1. and so ye−2x must be a constant. (1. Here’s an easy equation of first order: y (x) = 0. But. C. arrive after a bit of thought. The next one is a little harder: y (x) = 2y(x). namely y(x) = e2x .1 Introduction In this chapter we are going to study differential equations. so we have solved the given equation (1. or in fact cy(x) for any constant c. because if y is any function that satisfies (1. then the most general solution will involve one arbitrary constant C.3) . In general.1. if y(x) is a solution of (1.

4) There are certain special kinds of differential equations that can always be solved. If instead of (1.1.1.6). and then the solution is y(x) = ce−A(x) .8) y (x) − (cos x) y(x) = 0. next let’s put a more interesting right hand side into (1. in fact.1. the equation y (x) + y(x) = 0 (1.5) before going on to read about it.1.7) So we’re looking for a function whose derivative is −x times the function. for instance. and so is 7c1 + 23c2 . whatever the constants c1 and c2 .1.5): y (x) + xy(x) = 0. though.1. To say that an equation is linear is to say that if we have any two solutions y1 (x) and y2 (x) of the equation. by linearity. Evidently y = 2 2 e−x /2 will do.7) we had y (x) + x2 y(x) = 0. without knowing what the solutions are (do it!).1. The linearity of (1. that y = cos x is another solution of (1. Now let’s consider an instance of the first order linear equation (1. Before we describe the solution of these equations. Ready? What we need is to choose some antiderivative A(x) of a(x). Since that was so easy. and finally. the set of solutions forms a vector space). In the next paragraph we’ll give the general rule of which the above are three examples. where c1 and c2 are any two constants (in other words. 3 /3 then we would have found the general solution ce−x As a last example.1. we might note that y = sin x is a solution of (1. and the general solution is y(x) = ce−x /2 .9) .1. let’s discuss the word linear. (1. The reader might like to put down the book at this point and try to formulate the rule for solving (1.1. a fairly hard time (why?) finding a real function y(x) for which (y )2 = −y 2 − 2. then c1 y1 (x) + c2 y2 (x) is also a solution of the equation.1.1. Equation (1. Among there are the “first-order linear” equations y (x) + a(x)y(x) = 0.6) can be checked right from the equation itself. that the function y = c1 sin x + c2 cos x is a solution. (1. y1 (x) = 7 and y2 (x) = 23 are both solutions. We would have. Less trivially. take .6 Differential and Difference Equations since we can write down differential equations that have no solutions at all.6) is linear. (1. For an example. by considering the equation y (x) + a(x)y(x) = b(x) (1.6).1.1. The right medicine is now y(x) = esin x .5).5) where a(x) is a given function of x.1) is linear. (1. and it’s often important to be able to recognize them.

Consequently y(x) = e−A(x) As an example.1.1. consider the equation y + y = x + 1. then from (1.1.13) x b(t)eA(t) dt + const. x (1. Exercises 1. (1. if we integrate both sides. To solve (1. in the next section we will study yet another important family of differential equations that can be handled analytically. Find the general solution of each of the following equations: (a) y = 2 cos x . and then note that we can rewrite (1. once again choose some antiderivative A(x) of a(x).12) we get y(x) = x 1 (t + 1)t dt + C x x2 x C = + + .1.1. we mean any antiderivative of the function under the integral sign.1 Introduction 7 where now b(x) is also a given function of x (Is (1. eA(x) y(x) = x (1. so the computer is the only hope.1 1.1. (1.1. because it would be erroneous to assume that most. .12) We find that A(x) = log x. .11) where on the right side.10) b(t)eA(t) dt + const.1. or even many. Despite the above disclaimer.1. Indeed.9) in the equivalent form e−A(x) d eA(x) y(x) = b(x).14) We may be doing a disservice to the reader by beginning with this discussion of certain types of differential equations that can be solved analytically. such equations can be dealt with by these techniques. dx Now if we multiply through by eA(x) we see that d eA(x) y(x) = b(x)eA(x) dx so .9).1. the reason for the importance of the numerical methods that are the main subject of this chapter is precisely that most equations that arise in “real” problems are quite intractable by analytical means. 3 2 x (1.9) a linear equation? Are you sure?). namely linear equations with constant coefficients.

3) and cancellation of the factor eαx leads to the quadratic equation α2 + 4α + 4 = 0.3) Again. If we substitute in (1. type of linear differential equation is the variety with constant coefficients.2. find the solution for which y(1) = 1.2. Finally. .2. Let’s see if the function y(x) = eαx is a solution of (1. . Then write a program that will calculate and print the sum of the squares of the integers 1. (1. (1.2. (1. such as y + 3y + 2y = 0 .1. we are left with the quadratic equation α2 + 3α + 2 = 0 whose solutions are α = −2 and α = −1. This time. Go to your computer or terminal and familiarize yourself with the equipment. and important.8 2 y=0 x (c) y + xy = 3 1 (d) y + y = x + 5 x (e) 2yy = x + 1 (b) y + Differential and Difference Equations 2. where c1 and c2 are arbitrary constants. substitution into (1.1). For each part of problem 1. namely two. and then cancel the common factor eαx . 1.2) is also a solution.1) It turns out that what we have to do to solve these equations is to try a solution of a certain form. Various complications can develop. however.2.2. 4. e−2x is a solution. Show that the equation (1.4) .2) must be the most general solution since it has the “right” number of arbitrary constants. Hence for those two values of α our trial function y(x) = eαx is indeed a solution of (1. In other words. 3. y(x) = c1 e−2x + c2 e−x (1. let’s see if there is a solution of the form y = eαx .2 Linear equations with constant coefficients One particularly pleasant.1). and since the equation is linear. as illustrated by the equation y + 4y + 4y = 0 . (1.2. . 100.2. e−x is a solution.4) has no real solutions. Run this program. Trying a solution in the form of an exponential is always the correct first step in solving linear equations with constant coefficients.1). the operating system. and the specific software you will be using. 2.2. . and we will then find that all of the solutions indeed are of that form.

2. say y (n) + a1 y (n−1) + a2 y (n−2) + · · · + an y = 0 (1. Suppose that we begin with an equation of third order.5) whose “three” roots are all equal to −1. In this case. . . (1.8). Hence e−2x is a solution.2 Linear equations with constant coefficients 9 whose two roots are identical. suppose we have a linear differential equation with constant coeficcients. so the general solution is (c1 + c2 x)e−2x . To say that α∗ is a root of multiplicity p of the equation is to say that (α − α∗ )p is a factor of the characteristic polynomial. Now look at the left side of the given differential equation (1.6) (1. .8) of multiplicity p.10) ∗ where g is a polynomial of degree n − p.7).2.2. so far. For instance.7) also) for each k = 0. .2. is caused by the fact that the roots of (1. (1.2. (1. to solve the equation y + 3y + 3y + y = 0 we would try y = eαx . but we don’t yet have the general solution because there is. it follows that ϕ(D) has the factor (D − α∗ )p . The difficulty. not only is e−x a solution. only one arbitrary constant.10) (and therefore (1. . . and that all three roots turn out to be the same. and of course so is c1 e−2x . where ϕ is exactly the characteristic polynomial in (1.4) are not distinct. Suppose now that a certain number α = α∗ is a root of (1.9) in which D is the differential operator d/dx. so the left side of (1.8) The polynomial on the left side is called the characteristic polynomial of the given differential equation. In the parentheses in (1.11) . then after substitution into (1. both being −2. p − 1 .9) can be written in the form g(D)(D − α∗ )p y = 0 . 1.3) (verify this).2. To see why this procedure works in general. Indeed. we see that it is enough to show that ∗ (D − α∗ )p (xk eα x ) = 0 k = 0.2. p − 1. . Since ϕ(α) has the factor (α − α∗ )p .2. Now it’s quite easy to see that y = xk eα x satisfies (1.2. We can write it in the form (D n + a1 D n−1 + a2 D n−2 + · · · + an )y = 0 . if we substitute this function y into (1.2.10). and we would then be facing the cubic equation α3 + 3α2 + 3α + 1 = 0 .2. we would find the polynomial equation αn + a1 αn−1 + a2 αn−2 + · · · + an = 0 . it turns out that xe−2x is another solution of the differential equation (1.9) we see the polynomial ϕ(D).2. (1. .7) and cancellation of the common factor eαx . (1.2. of course.2. 1.2.1. Now. but so are xe−x and x2 e−x .2.2.2.7) If we try to find a solution of the usual exponential form y = eαx .

. and the complex roots ±2i. we find the characteristic equation α2 + 4 = 0. (1. to solve y + 4y = 0. but we could make it look a bit prettier by using deMoivre’s theorem. as claimed. For instance. Then our general solution would look like y(x) = (c1 + c2 ) cos 2x + (ic1 − ic2 ) sin 2x. of multiplicity p. hence so are c1 + c2 and ic1 − ic2 . so we might as well rename them c1 and c2 .2.18) (1. x2 eα x . xeα x .2. (1. where Q(x) is an arbitrary polynomial whose degree is one less than the multiplicity of the root α∗ .2.2.14) But c1 and c2 are just arbitrary constants. then corresponding to α∗ we can find exactly p linearly independent solutions of the differential equation. .17) The equation was cooked up to have a characteristic polynomial that can be factored as (α − 2)(α2 + 9)2 (α − 1)3 . 3i (multiplicity 2). in which case the solution would take the form y(x) = c1 cos 2x + c2 sin 2x. −3i (multiplicity 2).2. and 1 (multiplicity 3). if we encounter a root α∗ of the characteristic equation. and if we apply (D − α∗ ) again. .2. (1. ∗ etc. These don’t really require any special attention.15) (1.13) This is a perfectly acceptable form of the solution.12) Another way to state it is to say that the portion of the general solution of the given differential equation that corresponds to a root α∗ of the characteristic polynomial equation ∗ is Q(x)eα x .10 Differential and Difference Equations ∗ ∗ However. Here’s an example that shows the various possibilities: y (8) − 5y (7) + 17y (6) − 997y (5) + 110y (4) − 531y (3) + 765y (2) − 567y + 162y = 0. . (1. namely ∗ ∗ ∗ ∗ eα x .2. then. Hence the general solution is obtained by the usual rule as y(x) = c1 e2ix + c2 e−2ix . (1. xp−1 eα x .16) Hence the roots of the characteristic equation are 2 (simple). ∗ ∗ (D − α∗ )2 (xk e−α x ) = k(k − 1)xk−2 eα x . (D − α∗ )(xk e−α x ) = kxk−1 eα x . but they do present a few options. . Now since k < p it is clear that (D − α∗ )p (xk e−α x ) = 0. which says that e2ix = cos 2x + i sin 2x e−2ix = cos 2x − i sin 2x. One last mild complication may arise from roots of the characteristic equation that are not real numbers. To summarize.

Obtain the general solutions of each of the following differential equations: (a) y + 5y + 6y = 0 (b) y − 8y + 7y = 0 (c) (D + 3)2 y = 0 (d) (D2 + 16)2 y = 0 (e) (D + 3)3 (D 2 − 25)2 (D + 2)3 y = 0 2. Exercises 1.3. . (1. Such equations are encountered when differential equations are solved on computers. Alternatively. . The entire sequence of yn ’s is determined by the difference equation (1.1) together with the two starting values. Can we somehow “solve” a difference equation by obtaining a formula for the values of the solution sequence? The answer is that we can. as long as the difference equation is linear and has constant coefficients.1) .3.1).2. suppose we know that a certain sequence of numbers y0 . the general solution will contain the term c1 e2x . we might have taken the four terms that come from 3i in the form (c2 + c3 x) cos 3x + (c4 + c5 x) sin 3x. y1 .3. . and finally from the triple root at 1 we get (c6 + c7 x + c8 x2 )ex . as in (1.3. .1). Find a curve y = f (x) that passes through the origin with unit slope. 1. . a difference equation is an equation in an unknown sequence.19) 1.3 Difference equations Whereas a differential equation is an equation in an unknown function. (1. Corresponding to the double root at 3i we have terms (c2 + c3 x)e3ix in the solution.3 Difference equations 11 Corresponding to the root 2. and which satisfies (D + 4)(D − 1)y = 0. . the correct strategy for solving them is to try a solution of the n = 0. satisfies the following conditions: yn+2 + 5yn+1 + 6yn = 0 and furthermore that y0 = 1 and y1 = 3. the computer can provide the values of the unknown function only at a discrete set of points. The general solution is the sum of these eight terms. we can compute as many of the yn ’s as we need from (1. thus we would get y2 = −21. Just as in the case of differential equations with constant coefficients. and then calculating successive approximate values of the desired function from the difference equation. Naturally. 2. y3 = 87. These values are computed by replacing the given differential equations by a difference equation that approximates it.1. For example. y4 = −309 so forth.2 1. From the double root at −3i we get a contribution (c4 + c5 x)e−3ix . y2 . Evidently.

1).3. we get the characteristic equation αp + a1 αp−1 + a2 αp−2 + · · · + ap = 0.3.3) The two roots of this characteristic equation are α = −2 and α = −3. This means that we should expect the members of the sequence to alternate in sign and to grow rapidly in magnitude.3.3.3. When we take account of the given data y0 = 1 and y1 = 3.12 Differential and Difference Equations right form. . Since the difference equation is linear. . Now the winning combination is y = αn . Now it is evident from (1.6) expresses the solution as a linear combination of nth powers of the roots of the associated characteristic equation (1.1) to see what happens. let’s substitute αn for yn in (1. where α is a constant. When n is very large. it follows that yn = c1 (−2)n + c2 (−3)n (1. .5) from which c1 = 6 and c2 = −5. and after substituting and canceling.1) itself that the numbers yn are uniquely determined if we prescribe the values of just two of them. we use these values of c1 and c2 in (1. The left side becomes αn+2 + 5αn+1 + 6αn = αn (α2 + 5α + 6) = 0.3.3. 2.3. In fact. . is the number yn a large number or a small one? Evidently the powers of −3 overwhelm those of −2. (1.4) to get yn = 6(−2)n − 5(−3)n n = 0. whatever the values of the constants c1 and c2 .6) is the desired formula that represents the unique solution of the given difference equation together with the prescribed starting values. 1.8) . the right form to try was y(x) = eαx . (1.4) is also a solution.2) Just as we were able to cancel the common factor eαx in the differential equation case.3.3). In the previous section.7) We try a solution of the form yn = αn .1) and so does (−3)n . it is very clear that when we have a solution that contains two arbitrary constants we have the most general solution.3. (1. Finally. we get the two equations 1 = c1 + c2 3 = (−2)c1 + (−3)c2 (1. Notice that the formula (1. Therefore the sequence (−2)n satisfies (1.3.3.3. in the form of a linear difference equation of order p: yn+p + a1 yn+p−1 + a2 yn+p−2 + · · · + ap yn = 0. Let’s step back a few paces to get a better view of the solution. Hence.6) Equation (1.3. so the sequence will behave roughly like a constant times powers of −3. and we’re left with the quadratic equation α2 + 5α + 6 = 0. (1. (1. Now let’s look at the general case.3. So much for the equation (1. so here we can cancel the αn .

3 1. We illustrate this. counting multiplicities. 2 (1.3. just as in the corresponding case for differential equations we used an arbitrary polynomial in x of degree k − 1. y2 .3. (1. 2 (multiplicity 2). α∗ is a root of multiplicity k > 1 then we must multiply the solution c(α∗ )n by an arbitrary polynomial in n. If. and the simple root α = 1 adds c5 to the general solution. −i. say y0 . 1. by considering the following difference equation of order five: yn+5 − 5yn+4 + 9yn+3 − 9yn+2 + 8yn+1 − 4yn = 0. Let α∗ be one of these p roots.3. This example is rigged so that the characteristic equation can be factored as (α2 + 1)(α − 2)2 (α − 1) = 0 from which the roots are obviously i. of degree k − 1..9).12) (1. has multiplicity 1) then the part of the general solution that corresponds to α∗ is c(α∗ )n . Obtain the general solution of each of the following difference equations: (a) yn+1 = 3yn (b) yn+1 = 3yn + 2 (c) yn+2 − 2yn+1 + yn = 0 (d) yn+2 − 8yn+1 + 12yn = 0 (e) yn+2 − 6yn+1 + 9yn = 1 (f) yn+2 + yn = 0 . −i. y1 .3. as we would expect for the equation (1. Since nπ nπ in = einπ/2 = cos + i sin (1. Exercises 1.3. Corresponding to the roots i. the portion of the general solution is c1 in + c2 (−i)n . we can also take this part of the general solution in the form c1 cos nπ 2 + c2 sin nπ .13) The five constants would be determined by prescribing five initial values. somewhere in the complex plane.10) (1.1.3 Difference equations 13 This is a polynomial equation of degree p. so it has p roots.3.e. If α∗ is simple (i.11) 2 2 and similarly for (−i)n . as well as the case of complex roots. y3 and y4 . which in its full glory is yn = c1 cos nπ 2 + c2 sin nπ 2 + (c3 + c4 n)2n + c5 . however.9) The double root α = 2 contributes (c3 + c4 n)2n .

y3 = −1 yn+2 − 5yn+1 + 6yn = 0. we might declare y to be a linear array of some size large enough to accommodate the expected length of the calculation.4 Computing with difference equations This is. we would divide it by its predecessor yn to get a new ratio. (d) Write a computer program to implement the method in part (c). As a first approach. The reader might want. For each n. evaluate yn+1 lim n→∞ yn (1. y1 = 2 3.3. y0 = 0. Then the rest is easy. For a sample difference equation we’ll use yn+3 = yn+2 + 5yn+1 + 3yn (1. y0 = 1 yn+4 + yn = 0. (a) For each of the difference equations in problems 1 and 2. so let’s begin with computing from difference equations since they will give us a chance to discuss some important questions that concern the design of computer programs. (c) Reverse the process: given a polynomial equation.15) 1. y0 = 1. just for practice.4.1) together with the starting values y0 = y1 = y2 = 1. y0 = 1. (1. y1 = 1 yn+1 = αyn + β.3. increase n and try again. Find the solution of the given difference equation that takes the prescribed initial values: (a) (b) (c) (d) yn+2 = 2yn+1 + yn .4. to find an explicit formula for this sequence by the methods of the previous section.14) if it exists.14 Differential and Difference Equations 2.2) n→∞ yn which must be true since 3 is the root of largest absolute value of the characteristic equation. (b) Formulate and prove a general theorem about the existence of. Suppose we want to compute a large number of these y’s in order to verify some property that they have.1). If we were to write this out as formal procedure (algorithm) it might look like: . y2 = 1. a book about computing. find its root of largest absolute value by computing from a certain difference equation and evaluating the ratios of consecutive terms. If the new ratio agrees sufficiently well with the previous ratio we announce that the computation has terminated and print the new ratio as our answer. Otherwise. and value of the limit in part (a) for a linear difference equation with constant coefficients. Use it to calculate the largest root of the equation x8 = x7 + x6 + x5 + · · · + 1. after all.4. we would calculate the next yn+1 from (1. we move the new ratio to the location of the old ratio. y1 = −1. for instance to check that yn+1 lim =3 (1.

print newrat. It can be read as ‘is replaced by’ or ‘is assigned.4 Computing with difference equations y0 := 1. In fact. the block that begins with ‘while’ and ends with ‘endwhile’ represents a group of instructions that are to be executed repeatedly until the condition that follows ‘while’ becomes false. That’s fairly easy to accomplish. One should not think that such programming methods are only for hand calculators.1. ym2 := ym1. y1 := 1. n := 2. and stow it for a moment in a fourth location. at which point the line following ‘endwhile’ is executed. given only a programmable hand calculator with ten or twenty memory locations. As we progress through the numerical solution of differential equations we will see situations in which each of the quantities that appears in the difference equation will itself be an .’ Also. The price that we pay for the memory saving is that we must move the data around a bit more. it might be: y := 1. so you can undertake it on your hand calculator with complete confidence. ym2. newrat := −10. oldrat := newrat. but it uses lots of storage. ym1 := y. Suppose you wanted to find out how much sooner. yn := yn−1 + 5 ∗ yn−2 + 3 ∗ yn−3 . and then store it in the place named on the left. ym1.000001 do oldrat := newrat.000001 do ym3 := ym2. such a program needed to calculate 79 y’s before convergence occurred. newrat := −10. oldrat. newrat := yn /yn−1 endwhile print newrat. Then we’ll compute the new ratio and compare it with the old. ym3. At any given moment in the program. newrat) no matter how many y’s have to be calculated. 15 We’ll use the symbol ‘:=’ to mean that we are to compute the quantity on the right. then it would have used 79 locations of array storage. So why not save only those three? We’ll use the previous three to calculate the next one. ym2 := 1. Halt. If. Formally. while |newrat − oldrat| ≥ 0. n := n + 1. though. we move each one of the three newest y’s back one step into the places where we store the latest three y’s and repeat the process. The calculation can now be done in exactly six memory locations (y. newrat := y/ym1 endwhile. ym1 := 1. while |newrat − oldrat| ≥ 0. y2 := 1. for instance. the problem above doesn’t need that many locations because convergence happens a lot sooner. if necessary. oldrat := 1. y := ym1 + 5 ∗ ym2 + 3 ∗ ym3. what we need to find the next y are just the previous three y’s. If they’re not close enough. oldrat := 1. Halt. The procedure just described is fast. Then you might appreciate a calculation procedure that needs just four locations to hold all necessary y’s.

. together with the starting values F0 = 0. of these arrays will need to be computed.3) (f) Write a computer program that will compute Fibonacci numbers and print out the limit in part (e) above. F1 = 1. 2. 1. (b) Derive an explicit formula for the nth Fibonacci number Fn .5 Stability theory In the study of natural phenomena it is most often true that a small change in conditions will produce just a small change in the state of the system being studied. for doing the calculation. rather than the second. (a) Write out the first ten Fibonacci numbers.16 Differential and Difference Equations array (!). (h) Modify the program of part (h) to run in higher (or double) precision arithmetic. but is not so obvious from the formula!). . 4. (e) Evaluate n→∞ lim Fn+1 Fn (1. .4 1. for example. it will almost never be necessary to save in memory all of the computed values simultaneously. and then printed or plotted. are defined by the recurrence formula Fn+2 = Fn+1 + Fn for n = 0. . Do these computed numbers seem to be approaching zero? Explain carefully what you see and why it happens. 3. Does it change any of your answers? 2. 1. Normally. . correct to six decimal places. The Fibonacci numbers F0 . If. F2 . or if tiny variations in the period of the earth’s rotation . . perhaps thousands. (g) Write a computer program that will compute the first√ members of the modified 40 Fibonacci sequence in which F0 = 1 and F1 = (1 − 5)/2. 2. Fortunately. F1 .4. they will be computed. (d) Prove directly from your formula that the Fibonacci numbers are integers (This is perfectly obvious from their definition. and never needed except in the calculation of their immediate successors. (c) Evaluate your formula for n = 0. and that very large numbers. Even large computers might quake at the thought of using the first method above. Find the most general solution of each of the following difference equations: (a) yn+1 − 2yn + yn−1 = 0 (b) yn+1 = 2yn (c) yn+2 + yn = 0 (d) yn+2 + 3yn+1 + 3yn + yn−1 = 0 1. a very slight increase in atmospheric pollution could produce dramatically large changes in populations of flora and fauna. Exercises 1.

and so forth. and it represents a . so that predictions will be possible. An important job of the numerical analyst is to make sure that this does not happen. yet another layer of approximation is usually introduced at this stage. even though it is a simplification of the real world. or to try to live in.2) We are thinking of the independent variable t as the time. because the model. When physical scientists attempt to understand some facet of nature. will produce not small but very large changes in the computed solution. and we will find that this theme of stability recurs throughout our study of computer approximations. Therefore considerable effort has gone into the construction of mathematical models that will allow computer studies of the effects of atmospheric changes. For instance. might still be too complicated to solve exactly on a computer. This model will usually not faithfully reproduce all of the structure of the original phenomenon. may models use differential equations.5 Stability theory 17 produced huge changes in climatic conditions. they often will make a mathematical model. of population growth. and then solving the difference equations. but one hopes that the important features of the system will be preserved in the model. Models of the weather. we may say that most aspects of nature are stable. For instance. and to use these calculations for predicting the effects of various proposed actions.1. The next step might be to go to the computer and do calculations from the model. the world would be a very different place to live in. The initial conditions tell us that c1 = 1 and c2 = 0. As an example of instability in differential equations. Now suppose that we have gotten ourselves over this hurdle. all involve differential equations. and we have constructed a model that is indeed stable.5. Digital computers solve differential equations by approximating them by difference equations. In brief.1) is y(t) = c1 e−t + c2 e2t . hence the solution of our problem is y(t) = e−t . of the movement of astronomical objects. the example of atmospheric pollution and its effect on living things referred to above is important and very complex. of the motion of fluids.5. One of the first tests to which such a model should be subjected is that of stability: does it faithfully reproduce the observed fact that small changes produce small changes? What is true in nature need not be true in a man-made model that is a simplification or idealization of the real world. or small roundoff errors. suppose that some model of a system led us to the equation y − y − 2y = 0 (1. The general solution of (1. (1. it may be that the difference equation that we use on the computer is no longer stable. of electric circuit transients.1) together with the initial data y(0) = 1. and that small changes in initial data on the computer. of spacecraft. of predator-prey relationships. Unfortunately. so we will be interested in the solution as t becomes large and positive. y (0) = −1. One of the most important features to preserve is that of stability.5. Even though the differential equation that represents our model is indeed stable.

. If α has positive real part the term is unbounded. and picked out only the decreasing one. . that equation is unstable. A small change in the initial data.000045 to about 7. In that case the question of whether (polynomial in t)eαt remains bounded depend on whether the “polynomial in t” is of degree zero (a constant polynomial) or of higher degree. when t = 10.000333 . or our lives hang by a slender thread indeed! Now exactly what is the reason for the observed instability of the equation (1.34.)e2t (1. the solution has the value 0. for every set of initial data (at t = 0) the solution not only remains bounded.5. (1.18 Differential and Difference Equations function that decays rapidly to zero with increasing t. . whereas otherwise the term will grow as t gets large. however.999. If we want the value of the solution at t = 10. thereby violating the definition of stability. . Likewise. A differential equation is called strongly stable if. or more generally.2) we suppressed the growing term. If the polynomial is constant then the term does indeed remain bounded for large positive t.5. By prescribing the initial data (1.1)? The general solution of the equation contains a falling exponential term c1 e−t . by asking for a solution with y (0) = −0. Let’s hope that there are no phenomena in nature that behave in this way. for some values of the initial conditions. from 0.5. In fact. but approaches zero as t approaches infinity.)e−t + (0. and a rising exponential term c2 e2t . Now it’s time for a formal Definition: A differential equation is said to be stable if for every set of initial data (at t = 0) the solution of the differential equation remains bounded as t approaches infinity. the complex number α has zero real part (is purely imaginary). .4) Under what circumstances does such a term remain bounded as t becomes large and positive? Certainly if α is negative then the term stays bounded. Now let’s change the initial data (1. then the term remains bounded. results in the presence of both terms in the solution.5.000045.5.00000002 to 161.999. if α is a complex number and its real part is negative. just from changing the initial value of y from −1 to −0. This takes care of all of the possibilities except the case where α is zero. if we have a differential equation whose general solution contains a term eαt in which α is positive. In other words. Let’s restrict attention now to linear differential equations with constant coefficients. What makes the equation (1. At t = 20 the change is even more impressive.2) just a bit. we would find that it has changed from 0. It’s easy to check that the solution is now y(t) = (0.2 that the general solution of such an equation is a sum of terms of the form (polynomial in t)eαt .1) unstable.5. is the presence of a rising exponential in its general solution. We know from section 1. 720+.3) instead of just y(t) = e−t .999666 . then.

. or unstable. 4. y (0) = −0. For exactly which real values of the parameter λ is each of the following differential equations stable? . take the equation 5 yn+1 = yn − yn−1 2 (n ≥ 1) (1.5 1.6 Stability theory of difference equations 19 Now recall that the “polynomial in t” is in fact a constant if the root α is a simple root of the characteristic equation of the differential equation. (a) y − 5y + 6y = 0 (b) y + 5y + 6y = 0 (c) y + 3y = 0 (d) (D + 3)3 (D + 1)y = 0 (e) (D + 1)2 (D 2 + 1)2 y = 0 (f) (D4 + 1)y = 0 2. and none lie on the imaginary axis.1 A linear differential equation with constant coefficients is stable if and only if all of the roots of its characteristic equation lie in the left half plane. and then solved again with y(0) = 1. . Determine for each of the following differential equations whether it is strongly stable. Make a list of some natural phenomena that you think are unstable. y (0) = −1. and those that lie on the imaginary axis. and for similar reasons. if any. Similar considerations apply to difference equations. 3. Exercises 1. This observation completes the proof of the following: Theorem 1. The differential equation y −y = 0 is to be solved with the initial conditions y(0) = 1.99. and otherwise it is of higher degree. As an example. are simple.1) .6.6 Stability theory of difference equations In the previous section we discussed the stability of differential equations.5. The key ideas were that such an equation is stable if every one of its solutions remains bounded as t approaches infinity. stable. Discuss. and strongly stable if the solutions actually approach zero. Such an equation is strongly stable if and only if all of the roots of its characteristic equation lie in the left half plane. strongly stable? (a) y + (2 + λ)y + y = 0 (b) y + λy + y = 0 (c) y + λy = 1 1.1. Compare the two solutions when x = 20.

6. A change of one part in fifty million in the initial condition produced.)2n + (0. . The solution of the difference equation with these new data is y = (0. The we know that the general solution is a sum of terms of the form (polynomial in n)αn . .0000000066 . and multiplication by a nonzero polynomial only speeds the parting guest.4) y1 = 0. Suppose |α| > 1. Now consider the case where the difference equation is linear with constant coefficients. Then the powers of α approach zero.6.6. Then the sequence of powers grows unboundedly. or lack of it.3) The point is that the coefficient of the growing term 2n is small. As in the case of differential equations.5) .6. It remains bounded as n increases if and only (1. because it has both rising and falling components to its general solution. the stability. instead of (1.5).6. For example. approaches zero with increasing n if and only if |α| < 1.2). but 2n grows so fast that after a while the first term in (1. Again.2). . Then the sequence of its powers remains bounded (in fact they all have absolute value 1). (1. not just the solution that is picked out by a certain set of initial data.6.2).6. and that it is strongly stable if every solution approaches zero as n grows large. .5 . and of course. we emphasize that every solution must be well behaved.)2−n . and multiplication by a polynomial in n does not alter that conclusion.2) It’s easy to see that the solution is yn = 2−n . thirty steps later. compared to the value y30 = 0. In other words. (1. an answer one billion times as large. this is a function that rapidly approaches zero with increasing n. say to y0 = 1. To summarize then. since a small roundoff error may alter the solution beyond recognition several steps later. the solution is y30 = 7. Now let’s change the initial data (1.4) will be dominant. when n = 30. the resulting expression would grow without bound. Finally suppose the complex number α has absolute value 1. but if we multiply by a nonconstant polynomial. is a property of the equation and not of the starting values.50000001 (1. The fault lies with the difference equation. we’ll say that a difference equation is stable if every solution remains bounded as n grows large.20 along with the initial equations y0 = 1.0000000009 of the solution with the original initial data (1.6.6.9999999933 .16. Under what circumstances will such a term remain bounded or approach zero? Suppose |α| < 1. Differential and Difference Equations y1 = 0. It should be clear that it is hopeless to do extended computation with an unstable difference equation. if the polynomial is not identically zero. the term (1.

(a) yn+2 − 5yn+1 + 6yn = 0 (b) 8yn+2 + 2yn+1 − 3yn = 0 (c) 3yn+2 + yn = 0 (d) 3yn+3 + 9yn+2 − yn+1 − 3yn = 0 (e) 4yn+4 + 5yn+2 + yn = 0 2. Exercises 1. Compare y20 for the two solutions.6 1. exhibit a difference equation for which |ap /a0 | < 1 but the equation is unstable anyway.6) . strongly stable? (a) yn+2 + λyn+1 + yn = 0 (b) yn+1 + λyn = 1 (c) yn+2 + yn+1 + λyn = 0 4. Now we have proved: Theorem 1. Determine. 3. (a) Consider the (constant-coefficient) difference equation a0 yn+p + a1 yn+p−1 + a2 yn+p−2 + · · · + ap yn = 0. for each of the following difference equations whether it is strongly stable.1.6. and those of absolute value 1 are simple. .99. or unstable. y1 = 0. . and then solved again with y0 = 2. Show that this difference equation cannot be stable if |ap /a0 | > 1. (1. Namely. For exactly which real values of the parameter λ is each of the following difference equations stable? .6. (b) Give an example to show that the converse of the statement in part (a) is false. The equation is strongly stable if and only if all of the roots have absolute value strictly less than 1. The difference equation 2yn+2 + 3yn+1 − 2yn = 0 is to be solved with the initial conditions y0 = 2.1 A linear difference equation with constant coefficients is stable if and only if all of the roots of its characteristic equation have absolute value at most 1. stable. y1 = 1.6 Stability theory of difference equations 21 if either (a) |α| < 1 or (b) |α| = 1 and the polynomial is of degree zero (a constant).

22 Differential and Difference Equations .

1. One of the nice features of the subject of numerical integration of differential equations is that the techniques that are developed for just one first order differential equation will apply. y(x0 ) = y0 . (2. What we actually will find will be approximate values of the unknown function at a discrete set of points x0 . At each of these points xn we will compute yn . then y2 from y1 and so forth until sufficiently many values have been found. our approximation to y(xn ). Next we need to derive a method by which each value of y can be obtained from its immediate predecessor. Let’s suppose that we are given an initial-value problem of the form y = f (x. both to systems of simultaneous first order equations and to equations of higher order. x1 = x0 + h.1) Our job is to find numerical approximate values of the unknown function y at points x to the right of (larger than) x0 . seemingly a very special case. Hence the consideration of a single equation of first order. We will state it as a method for solving a single differential equation of first order. By an initial-value problem we mean a differential equation together with enough given values of the unknown function and its derivatives at an initial point x0 to determine the solution uniquely. and move to the right. x3 = x0 + 3h. Consider the Taylor series expansion of the unknown function y(x) . suppose that the spacing h between consecutive points has been chosen. due to Euler. turns out to be quite general. x2 = x0 + 2h. We propose to start at the point x0 where the initial data are given. Hence. etc. obtaining y1 from y0 .1 Euler’s method Our study of numerical methods will begin with a very simple procedure.Chapter 2 The Numerical Solution of Differential Equations 2. with very little change. y).

0. (2.1) by writing yn = f (xx .4) takes the form yn+1 = yn + hf (xn .1. with the consolation that we will be able to compute from it. 0.2). On the other hand. 0. Concerning the approximate solution by Euler’s method. by comparing (2.1).05.3) Next define yn+1 to be the approximate value of y(xn+1 ) that we obtain by using the right side of (2.1. because the quantity yn can be found from the differential equation (2.1.7) together with the starting value y(0) = 1. To be quite specific. The approximate relation is y(xn + h) ≈ y(xn ) + hy (xn ).1.1.20.15. Let’s use Euler’s method to obtain a numerical solution of the differential equation y = 0.6) is in fact a recurrence relation. . we’ll have only an approximate relation instead of an exact one. Then we get yn+1 = yn + hyn .4) Now we have a computable formula for the approximate values of the unknown function. yn ). 0. we have.2) is exact. (2. Then we show below. the point X lies between xn and xn + h. but of course it cannot be used for computation because the point X is unknown.1.5) This is Euler’s method. y) = 0. each yn will be obtained from its predecessor by multiplication by 1 + h .5y (2. let’s take h to be 0.1.1.1. or difference equation.1.1.1.8) and the exact value y(xn ) = exn /2 : .24 about the point xn The Numerical Solution of Differential Equations y (X) .1.8) Therefore.1.2) 2 where we have halted the expansion after the first power of h and in the remainder term. so that the computational procedure is clear. the approximate value of y computed from (2. f (x. The exact solution of this initial-value problem is obviously y(x) = ex/2 .3) instead of (2.1. . . 2 (2.1. Equation (2. yn ).05. y(xn + h) = y(xn ) + hy (xn ) + h2 Now equation (2.7) with (2.5y. and if we do so then (2.6) (2. (2. in this example. whereby each value of yn is computed from its immediate predecessor. (2. for each value 2 of x = 0.10. in a very explicit form. if we simply “forget” the error term. so yn+1 = yn + h yn 2 h = 1+ yn .

(2. . or equivalently Euler(x) = (1. exactly what function of x is Euler(x)? To answer this.” is e1/2 .02520x . . we note first that each computed value yn+1 is obtained according to (2. (2. 1.00 10.)x . . . 5. Hence the computed approximation to y at a particular point x will be 1. 11.41316 25 Considering the extreme simplicity of the approximation.05 we have n = 20x. Since 2 xn = nh.8) by multiplying its predecessor yn by 1 + h . .13315 .10517 1. . and since h = 0.11) Exercises 2. .10 0. .05127 1. the values Euler(x) will converge to Exact(x).648721 .20 0. .00000 1.1.10381 1.25 . 1 (2.02500 1.1.2. . .48169 . . The correct value of “const.00 Euler(x) 1.1. because lim 1 + h 2 1/h h→0 = e2 .638616 .18249 148. Let’s continue with this simple example by asking for a formula for the numbers that are called Euler(x) in the above table.00 2.1.07689 1.02532 1. . In other words. For a fixed value of x.00000 1.)x .71828 4.56389 table 1 Exact(x) 1. used by Euler’s method is (1 + h )1/h .)x . . Now we want to express this in terms of x rather than n. 12.00 3.39979 .15 0. 1.1 Euler’s method x 0. in effect. .13141 .05 0. it is clear tha we will 2 compute yn = (1 + h )n . it seems that we have done pretty well by this equation.1 .05063 1.81372 139.07788 1.63862 2. and the value that is.00 0. .10) Therefore both the exact solution of this differential equation and its computed solution have the form (const.9) The approximate values can now easily be compared with the true solution.00 . Since y0 = 1. 1.64872 2. we see that if we use 2 Euler’s method with smaller and smaller values of h (neglecting the increase in roundoff error that is sure to result). we have n = x/h.68506 4. since Exact(x) = e 2 = e 2 x 1 x = (1.

Verify the limit (2. Besides. at least not without making a large investment of time. y(0) = 1 (b) y (x) = xy(x) + 2. The act of communication that must take place before a program can be used by persons other than its author is a difficult one to carry out. but not this one. y(0) = 1 y(x) (c) y (x) = . but also are easily readable and useable by other people. using a spacing h = 0. For that reason it is important that when our program has been written and tested it should be documented immediately. it often happens that we will have no idea what the program did or how to use it. a good bit of technical skill can be brought to bear that will be very helpful to the use. one might think. by intertwining the description of the program purpose with the names of the communicating variables in the program. Already in this first mission. it’s perfectly obvious how to use this program. y(0) = 1 1+x (d) y (x) = −2xy(x)2 .11). Suppose we have written a subroutine that searches through a specified row of a matrix to find the element of largest . Use a calculator or a computer to integrate each of the following differential equations forward ten steps. Furthermore. Here are some of these guidelines. The Numerical Solution of Differential Equations 2. and we will return several times to the principles that have evolved as guides to the preparation of readable software. 1. That way one can be sure that when the comments are needed they will be available. It is amazing how rapidly our knowledge of our very own program fades. the best place for documentation is in the program itself. Some programs may be obscure. Also tabulate the exact solution at each value of x that occurs.2 Software notes One of the main themes of our study will be the preparation of programs that not only work.1. Documentation The documentation of a program is the set of written instructions to a user that inform the user about the purpose and operation of the program.26 1. or the exact operation that it performs on its input in order to get its output. If we come back to a program after a lapse of a few months’ time. in “comment” statements. State clearly the problem that the program solves. y(0) = 10 2. while our memory of it is still green. At the moment that the job of writing and testing a program has been completed it is only natural to feel an urge to get the whole thing over with and get on to the next job.05 with Euler’s method. The first mission of program documentation is to describe the purpose of the program. (a) y (x) = xy(x). Let’s see what that means by considering an example.

j])>winner) then winner:=abs(A[i. “The purpose of this program is to search row I of a given matrix A to find an entry of largest absolute value. although the output variable is usually specified in the “return” statement.i) local j.2 Software notes 27 absolute value. Modularity It is important to divide a long program into a number of smaller modules. The best way to help the user to understand these variables is to relate them directly to the description of the purpose of the program. return(jwin). winner. the communicating variables are announced in the first line of the coding of a procedure or subroutine. That means that we should get into the habit of writing lots of subroutines or procedures.2. as opposed to the other “local” variables that live inside the subroutine but don’t communicate with the outside world (like j. These are the ones that the user has to understand. and each with its own documentation. winner:=-1. each with a clearly stated set of inputs and outputs. In most important computer languages.” That is pretty good documentation. jwin. Maple follows this convention at least for the input variables. But we can make it a lot more useful by doing the intertwining that we referred to above. but now let’s mention another ingredient of ease of use of programs. and returns the column jwin where that entry lives. Now let’s try our hand at documenting this program: “The purpose of this program is to search a given row of a matrix to find an element of largest absolute value and return the column in which it was found. i) of its input variables (or “arguments”).” We’ll come back to the subject of documentation. jwin:=j fi od. for j from 1 to coldim(A) do if (abs(A[i.j]) . winner. and that is: 2. There we said that the description should be related to the communicating variables. jwin. perhaps better than many programs get. In the first line of the little subroutine above we see the list (A. because . They are the input and output variables of the subroutine. Those variables are the ones that the user can see. Such a routine. might look like this: search:=proc(A. in Maple for instance. which are listed on the second line of the program). and outputs a column in which it was found. end.

It is therefore a quality of mind that provides much of its own justification. It was also done to give examples of the process of subdivision that we have been talking about. more elegant. or what-have-you method of performing the task that one of these subroutines does. that will greatly simplify the task of writing future programs. This was done in part because some of these programs are quite complex. If one is in the habit of writing small independent modules and stringing them together to make large programs. however. they can be programmed and checked out at any time without waiting for the full explanation. and they are described in section 3. The reader might wish to look ahead at those routines and to verify that even though their relationship to the whole job of solving equations is by no means clear now. the practice of subdividing the large jobs into the smaller jobs of which they are composed is an extremely valuable analytical skill. For instance. Once the subroutines work. For another reason. if one be needed. each one tested.28 The Numerical Solution of Differential Equations the subroutine or procedure mode of expression forces one to be quite explicit about the relationship of the block of coding to the rest of the world. it would remain only to test their relationships to the calling program. has been divided into six modules. because of the fact that they are independent and self-contained. nonetheless. faster. the major programs that are the objects of study have been broken up into subroutines in the expectation that the reader will be able to start writing and checking out these modules even before the main ideas of the current subject have been fully explained. Finally. For one thing it’s easier to check out the program. working and documented. and it would be unreasonable to expect the whole program to be written in a short time. If jobs within a large program are not broken into subroutines it can be much harder to isolate the block of coding that deals with a particular function and remove it without affecting the whole works. The testing procedure would consist of first testing each of the subroutines separately on test problems designed just for them. but in all sorts of organizational activities where smaller efforts are to be pooled in order to produce a larger effect. we might discover a better.3. When we are writing a large program we would all write a subroutine if we found that a certain sequence of steps was being called for repeatedly. Beyond this. it may well happen that even though the job that is done by the subroutine occurs only once in the current program. the general linear algebra program for solving systems of linear simultaneous equations in Chapter 3. then it doesn’t take long before one has a library of useful subroutines. there are numerous inducements for breaking off subroutines even if the block of coding occurs just once in the main program. one that is useful not only in programming. One more ingredient that is needed for the production of useful software is: . In this book. it may recur in other programs as yet undreamed of. while being careful only to make sure that the new subroutine relates to the same inputs and outputs as the old one. Then we would be able to yank out the former subroutine and plug in the new one. For another reason.

yn ). then indented under that perhaps a two-way branch (if-then-else). Style 29 We don’t mean style in the sense of “class. In Euler’s method for one equation. Euler’s method will be used as the example. There have evolved a number of elements of good programming style. . N. .3 Systems and equations of higher order We have already remarked that the methods of numerical integration for a single firstorder differential equation carry over with very little change to systems of simultaneous differential equations of first order. within which . But two of them (one trivial and one quite deep) are: (a) Indentation: The instructions that lie within the range of a loop are indented in the program listing further to the right than the instructions that announce that the loop is about to begin. y2 .3 Systems and equations of higher order 3.2) . . or that it has just terminated. It all happens right there on the same page. .3. . That is. . and these will mainly be discussed as they arise. the approximate value of the unknown function at the next point xn+1 = xn + h is calculated from yn+1 = yn + hf (xn . under “then” and under “else”. These few words can scarcely convey the ideas of structuring. . but the same transformations will apply to all of the methods that we will study. under the “then” one sees all that will happen if the condition is met. and under the “else” one sees what happens if it is not met. In this section we’ll discuss exactly how this is done. but a system of N simultaneous equations. and furthermore. yN ) i = 1. According to the principles of top-down design the looping and branching structure of the program should be visible at once in the listing. When we say that we see all that will happen. we should see an announcement of the opening of the grand loop. etc.1) Now suppose that we are trying to solve not just a single equation. . we mean that there are not any “go-to” instructions that would take our eye out of the flow of the if-then-else loop to some other page. which we leave to the numerous examples in the sequel. within which there are several other loops and branchings. where.2. say yi (x) = fi (x. .3. (2. y1 . how the same idea can be applied to equations of higher order than the first. (2. (b) Top-down structuring: When we visualize the overall logical structure of a complicated program we see a grand loop. 2.” although this is as welcome in programming as it is elsewhere.

3. .3. . . except for the bold face type. . y 1 (xn ). then we can find the entire vector of unknown functions at the next point xn+1 = xn + h by means of (2.3. .3.3. (2. yN (x)] and the vector f = f (x.3.3. . . each equation can have a different right-hand side. y(x)) by vector quantities as above. take the pair of differential equations y1 = x + y1 + y2 y2 = y1 y2 + 1 together with the initial values y1 (0) = 0. . In detail. (2. N . then what we must calculate are y i (xn+1 ) = y i (xn ) + hfi (xn . fN (x. .1. .5) (2.8) .e. The “fi ” indicates that. . and on all of the unknown functions. and replace y(x) and f (x. except for the fact that y and f now represent vector quantities.3. y2 (x). We will then have obtained the generalization of the numerical method to systems. if y i (xn ) denotes the computed approximate value of the unknown function yi at the point xn . yn ) so Euler’s method for a system of differential equations will be y n+1 = y n + hf (xn . y).3. (2. To be specific. . y 2 (xn ). y)]. all we need to do is to take the statement of the method for a single differential equation in a single unknown function. In terms of these vectors.3) We observe that equation (2.7) (2. y2 (0) = 1.7). and whose right-hand side may depend on x.3. in each of which just one derivative appears. . y N (xn )) for each i = 1..2) can be rewritten as y (x) = f (x.4) (2. To apply a numerical method such as that of Euler.3. y3 (x). equation (2. Euler’s method for a single equation is yn+1 = yn + hf (x. but not on their derivatives.1) for a single equation in a single unknown function.30 The Numerical Solution of Differential Equations Equation (2.9) (2. . As an example. y n ). y).2) represents N equations. f2 (x.6) This means that if we know the entire vector y of unknown functions at the point x = xn . Now introduce the vector y(x) of unknown functions y(x) = [y1 (x). 2. then. .5) looks just like our standard form (2. of course. y(x)). y(x)) of right-hand sides f = [f1 (x. . i.

8). each of length N . go back to (2.10). etc.3. . For instance. Evidently we will need an array Y of length N to hold the values of the N unknown functions at the current point x. Initially YIN holds the given data at x0 . But we aren’t finished with y 1 (x) yet.e. .05) y2 (0.05 (2. including those whose new values have already been computed before we begin the computation of y i (x + h). y2 ]. increase x by h and repeat.3.05 + 0.11) and notice how. 1]. The conclusion is that we need at least two arrays.05) we would not have been able to obtain y 2 (0. it’s still needed to compute y 2 (x+h). The question is this: when we compute y 1 (x+h). The array YIN holds the unknown functions evaluated at x.10) into an array to replace the old value y 1 (0.05 0. Then we calculate y1 (0.05 + 1. the vector of unknowns is y = [0. which we will rewrite just slightly. . the first position of the Y array in storage. The new values of the N unknown functions are calculated from (2. Exactly what do we do? Some care is necessary in answering this question because there is a bit of a snare in the underbrush. by Euler’s method. y) i = 1. .1075 1. and YOUT will hold their values at x + h. we print them if desired. N.10) and so forth.10) y 2 (0. the value of y1 (x) is lost.3. then yN (x + h). on the old values of all of the unknown functions.3.05 + 1 = 0.05) even though y1 (0.10) had already been computed. of a system of N simultaneous equations of the form (2. Initially. according to (2.2). then y 2 (x + h). . This is because the new value y i (x+h) depends (or might depend).11) = 0 1 + 0. When all have been done. and store them in YOUT as we find them.8) or (2. . Then we compute all of the unknowns at x0 + h. (2. Suppose we have computed the array Y at a point x. Let’s choose a step size h = 0. say YIN and YOUT. Hence if we had put y 1 (0.3.. and on the right there may appear all N of them in each equation.05.05 0+0+1 0∗1+1 = 0. then the previous contents of Y[1] are lost. in the calculation of y 2 (0. where shall we put it? If we put it into Y[1].10) we needed to know y 1 (0.2. Let’s consider the preparation of a computer program that will carry out the solution.3.05) and y 1 (0.10) = 0. we might calculate y 1 (x + h). y1 y2 + 1].3 Systems and equations of higher order 31 Now the vector of unknown functions is y = [y1 .05 1. If the point still is murky. and we want to get the new array Y at the point x + h. then y 3 (x + h).102625 (2.05 0.05 + 0. At each step we compute the vector of approximate values of the two unknown functions from the corresponding vector at the immediately preceding step..12) in a certain order. in the form yi = fi (x.3. . i.12) Note that on the left is just one of the unknown functions. and the vector of right-hand sides is f = [x + y1 + y2 . move all entries of YOUT back to YIN. y 3 (x+h).3. .05 ∗ 1.05 1.

Then.n) local i. The two blocks call Euler alternately as the integration proceeds to the right. Our next remark about the program concerns the function subprogram f. The reader might enjoy writing this program. yout:=[seq(evalf(yin[i]+h*f(xin. which calculates the right hand sides of the differential equation. print. One structured data type in Maple is a list of things.y) one step forward by Euler’s method using step size # h. The evalf command ensures that the results of the computation of the components of yout are floating point numbers. Here is a sample of such a program.i)). The brackets [ and ] convert the list into a vector.h. In other words. y) = y1 y2 + 1. leaving the new vector in YOUT. Enter with values at xin in yin.i=1. increase x. The single-step subroutine is shown below. To achieve this saving.yout.yin. where N is the number of equations. must be supplied by the user. we could save N move operations. end. move date from the output array YOUT back to the input array YIN. # Calculates the right-hand sides of the system of differential # equations. y) = x + y1 + y2 and f2 (x..9). Such savings might be significant in an extended calculation. in this case a list of length n since i goes from 1 to n in the seq command. Our last comment about the program to solve systems is that it is perfectly possible to use it in such a way that we would not have to move the contents of the vector YOUT back to the vector YIN at each step.i). in this case. end. advances the solution one step by Euler’s method and prints. In that case we have f1 (x. Supply f as a function subprogram. # This program numerically integrates the system # y’=f(x. The main program would initialize the arrays. Exit with values at xin+h # in yout. and thinking about how to generalize the idea to the situation where the new value of y is computed from two . namely the one that describes the system (2.y. of course. and prints. The seq command (for “sequence”) creates such a list.n)].32 The Numerical Solution of Differential Equations The principal building block in this structure would be a subroutine that would advance the solution exactly one step. One block takes the contents of YIN as input. A few remarks about the program are in order. without moving anything. call this subroutine. if i=1 then return(x+y[1]+y[2]) else return(y[1]*y[2]+1) fi.3. advances the solution one step. a list of floating point numbers. etc. This subprogram. we write two blocks of programming in the main program. Eulerstep:=proc(xin. We will use it later on to help us get a solution started when we use methods that need information at more than one point before they can start the integration. leaves the new vector in YIN.yin. return(yout). another block of programming takes the contents of YOUT as input. This translates into the following: f:=proc(x.

2.3 Systems and equations of higher order

33

previously computed values, rather than from just one (then three blocks of programming would be needed). Now we’ve discussed the numerical solution of a single differential equation of first order, and of a system of simultaneous differential equations of first order, and there remains the treatment of equations of higher order than the first. Fortunately, this case is very easily reduced to the varieties that we have already studied. For example, suppose we want to solve a single equation of the second order, say y + xy + (x + 1) cos y = 2. (2.3.13)

The strategy is to transform the single second-order equation into a pair of simultaneous first order equations that can then be handled as before. To do this, choose two unknown functions u and v. The function u is to be the unknown function y in (2.3.13), and the function v is to be the derivative of u. Then u and v satisfy two simultaneous first-order differential equations: u = v (2.3.14) v = −xv − (x + 1) cos u + 2 and these are exactly of the form (2.3.5) that we have already discussed! The same trick works on a general differential equation of N th order y (N ) + G(x, y, y , y , . . . , y (N −1) ) = 0. (2.3.15)

We introduce N unknown functions u0 , u1 , . . . , uN −1 , and let them be the solutions of the system of N simultaneous first order equations u0 = u1 u1 = u 2 ... uN −2 = uN −1 uN −1 = −G(x, u0 , u1 , . . . , uN −2 ). The system can now be dealt with as before. Exercises 2.3 1. Write each of the following as a system of simultaneous first-order initial-value problems in the standard form (2.3.2): (a) y + x2 y = 0; y(0) = 1; y (0) = 0 (b) u + xv = 2; v + euv = 0; u(0) = 0; v(0) = 0 (c) u + xv = 0; v + x2 u = 1; u(1) = 1; v(1) = 0 (d) y iv + 3xy + x2 y + 2y + y = 0; y(0) = y (0) = y (0) = y (0) = 1 (e) x (t) + t3 x(t) + y(t) = 0; y (t) + x(t)2 = t3 ; x(0) = 2; x (0) = 1; x (0) = 0; y(0) = 1; y (0) = 0 (2.3.16)

34

The Numerical Solution of Differential Equations 2. For each of the parts of problem 1, write the function subprogram that will compute the right-hand sides, as required by the Eulerstep subroutine. 3. For each of the parts of problem 1, assemble nad run on the computer the Euler program, together with the relevant function subprogram of problem 2, to print out the solutions for fifty steps of integration, each of size h = 0.03. Begin with x = x0 , the point at which the initial data was given. 4. Reprogram the Eulerstep subroutine, as discussed in the text, to avoid the movement of YOUT back to YIN. 5. Modify your program as necessary (in Maple, take advantage of the plot command) to produce graphical output (graph all of the unknown functions on the same axes). Test your program by running it with Euler as it solves y + y = 0 with y(0) = 0, y (0) = 1, and h = π/60 for 150 steps. 6. Write a program that will compute successive values yp , yp+1 , . . . from a difference equation of order p. Do this by storing the y’s as they are computed in a circular list, so that it is never necessary to move back the last p computed values before finding the next one. Write your program so that it will work with vectors, so you can solve systems of difference equations as well as single ones.

2.4

How to document a program

One of the main themes of our study will be the preparation of programs that not only work, but also are easily readable and useable by other people. The act of communication that must take place before a program can be used by persons other than its author is a difficult one to carry out, and we will return several times to the principles that serve as guides to the preparation of readable software. In this section we discuss further the all-important question of program documentation, already touched upon in section 2.2. Some very nontrivial skills are called for in the creation of good user-oriented program descriptions. One of these is the ability to enter the head of another person, the user, and to relate to the program that you have just written through the user’s eyes. It’s hard to enter someone else’s head. One of the skills that make one person a better teacher than another person is of the same kind: the ability to see the subject matter that is being taught through the eyes of another person, the student. If one can do that, or even make a good try at it, then obviously one will be able much better to deal with the questions that are really concerning the audience. Relatively few actually do this to any great extent not, I think, because it’s an ability that one either has or doesn’t have, but because few efforts are made to train this skill and to develop it. We’ll try to make our little contribution here.

2.4 How to document a program (A) What does it do?

35

The first task should be to describe the precise purpose of the program. Put yourself in the place of a potential user who is looking at a particular numerical instance of the problem that needs solving. That user is now thumbing through a book full of program descriptions in the library of a computer center 10,000 miles away in order to find a program that will do the problem in question. Your description must answer that question. Let’s now assume that your program has been written in the form of a subroutine or procedure, rather than as a main program. Then the list of global, or communicating variables is plainly in view, in the opening statement of the subroutine. As we noted in section 2.2, you should state the purpose of your program using the global variables of the subroutine in the same sentence. For one example of the difference that makes, see section 2.2. For another, a linear equation solver might be described by saying “This program solves a system of simultaneous equations. To use it, put the right-hand sides into the vector B, put the coefficients into the matrix A and call the routine. The answer will be returned in X.” We are, however, urging the reader to do it this way: “This program solves the equations AX=B, where A is an N-by-N matrix and B is an N-vector.” Observe that the second description is shorter, only about half as long, and yet more informative. We have found out not only what the program does, but how that function relates to the global variables of the subroutine. This was done by using a judicious sprinkling of symbols in the documentation, along with the words. Don’t use only symbols, or only words, but weave them together for maximum information. Notice also that the ambiguous term “right-hand side” that appeared in the first form has been done away with in the second form. The phrase was ambiguous because exactly what ends up on the right-hand side and what on the left is an accident of how we happen to write the equations, and your audience may not do it the same way you do. (B) How is it done? This is usually the easy part of program documentation because it is not the purpose of this documentation to give a course in mathematics or algorithms or anything else. Hence most of the time a reference to the literature is enough, or perhaps if the method is a standard one, just give its name. Often though, variations on the standard method have been chosen, and the user must be informed about those: “. . . is solved by Gaussian elimination, using complete positioning for size. . . ” “. . . the input array A is sorted by the Quicksort method (see D.E. Knuth, The Art of Computer Programming, volume 3). . . ”

” “option is set by the calling program on input.36 The Numerical Solution of Differential Equations “. Wilson.j] is the coefficient of X[j] in the ith one of the input equations BX=C. For instance we may have a solver:=proc(A. else set it to m if the output is to be rounded to m decimal places (m ≤ 12). ” (C) Describe the global variables Now it gets hard again. Some computer languages force each program to declare the types of their variables right in the opening statement. Others declare types by observing various default rules with exceptions stated.X. . . the eigenvalues and vectors are found by the Jacobi method. . Generally speaking.” It is extremely important that each and every global variable of the subroutine should get such a description. using Corbat´’s method of avoiding the search for the largest off-diagonal element (see. This calls for a brief verbal description of the variable. and so. First the user has to know exactly ho each of the global variables is related to the problem that is being solved.” “B[i. Also required is the “type” of the variable. . It’s easy to declare along with the types of the variables.A. Ficken. their dimensions if they are array variables. .. A First Course in Mathematical Software). and the program documentation should declare the type of each and every global variable. . “A[i] is the ith element of the input list that is to be sorted. and describe each variable in turn. ) is not programmed.N” “WHY is set by the subroutine to TRUE unles the return is because of overflow. the description in D. Next. the user doesn’t care about variables that are entirely local to your subroutine. Just march through the parentheses in the subroutine or procedure heading. . but is vitally concerned with the communicating variables.R. In any case. o for instance.b).ndim. the user will need more information about each of the global variables than just its description as above. . The global variables are the ones through which the subroutine communicates with the user.n. Set it to 0 if the output is to be rounded to the nearest integer. i=1. and then it will be set to FALSE. a little redundancy never hurts. and what it has to do with the functioning of the program. . except that Charnes’ selection rule (see F. is found by the Simplex method.” “. The Simplex Method.

Second. none require the descriptions of the roles of the variables (well. Finally. then the user can feel free to use the same storage for other temporary purposes between calls to the subroutine.4 How to document a program in which the communicating variables have the following types: A X n ndim b ndim-by-ndim array of floating point numbers vector of floating point numbers of length n integer integer vector of floating point numbers of length n 37 The best way to announce all of these types and dimensions of global variables to the user is simply to list them. and Maple separates the input variables from the output ones. Pascal requires the VAR declaration. This would be the case for the output variables. so if the user needs to use those quantities again it will be necessary to save them somewhere else before calling the subroutine. or are “implicit”). There isn’t any standard name like “type” to apply to this information. if the value at input time is important. we’ll say that its role is C.e. variables whose values are used by the subroutine but which do not appear explicitly in the argument list. so let’s invent a shorthand for describing them in the documentation of the programs that occur in this book. Third. etc. it may happen that certain variables are returned by the subroutine with their values unchanged. the user needs to be informed. and the user needs to know which these are (whether they are explicit in heading or the return statement. as above. so we’ll call it the “role” of the variable. In either case. and in certain other situations. haven’t we? Well. of the global variables. the action of a subroutine may change an input variable. in a table. it may be true that their values at the time the subroutine is called are quite irrelevant to the operation of the subroutine. First. if the computation of this variable is one of the main purposes of the subroutine.. the user may be delighted to hear the good news. no. Now surely we’ve finished taking the pulse. Second. In other cases. we haven’t. else C’. First. for some of the global variables of the subroutine. it may be that the computation of the value of a certain variable is one of the main purposes of the subroutine. Just for one example.2. For some other variables. i. if the value of the variable is changed by the action of the subroutine. but both languages allow implicit passing and changing of global variables). . Although some high-level computer languages require type declarations immediately in the opening instruction of a subroutine. let’s say that the role of the variable is I. Such variables are the outputs of the program. if the value at input time is irrelevant. These are. blood pressure. In such cases. There’s still more vital data that a user will need to know about these variables. the values at input time would be crucial. otherwise it is I’. The user needs to know which are which. This is particularly true for “implicitly passed” global variables. important for the user to know. however.

5 The midpoint and trapezoidal rules Euler’s formula is doubtless the simplest numerical integration procedure for differential equations. all three of these should be specified. 2. and a list of all of the global variables. which illustrate the different sorts of creatures that inhabit the family in question. and their positions in the array. .2. Give a complete table of information about the global variables in each case. Here is the table of information about its global variables: Name A i jwin Type floating point matrix integer integer Role IC’O’ IC’O’ I’CO Description The input matrix Which row to search Column containing largest element Exercises 2. Thus. else O’. (d) Deal out four bridge hands (13 cards each from a standard 52-card deck – This one is not so easy!). showing for each one its name. Refer back to the short program in section 2. in which the user can choose the degree of precision that will suffice for the job. and then select a member of the family that will achieve it. type. called the linear multistep methods. a variable X might have role IC’O’. we want to introduce a while family of methods for the solution of differential equations. To sum up. but the accuracy that can be obtained with it is insufficient for most applications. In each case. document it with comments. or a variable why might be of role I’CO. (e) Solve a quadratic equation (any quadratic equation!). phrased in terms of the global variables. Before describing the family in all of its generality. etc. dimension (or structure) if any. after testing the program. In this section and those that follow. we will produce two more of its members. (b) Find the elements of largest and smallest absolute values in a given linear array (vector).4 Write programs that perform each of the jobs stated below. its role. a statement of how it gets the job done. In the description of each communicating variable.38 The Numerical Solution of Differential Equations it’s role is O (as in output). (c) Sort the elements of a given linear array into ascending order of size.. and a brief verbal description. that searches for the largest element in a row of a matrix. (a) Find and print all of the prime numbers between M and N . the essential features of program documentation are a description of that the program does.

let’s use yn to denote the computed approximate value of y(xn ) (and yn+1 for the approximate y(xn+1 ). because it is a recurrence formula in which we compute the next value yn+1 from the two previous values yn and yn−1 . We begin once again with the Taylor expansion of the unknown function y(x) about the point xn : y(xn + h) = y(xn ) + hy (xn ) + h2 y (xn ) y (xn ) + h3 + ···.5.4) If. we are solving the differential equation y = f (x. the next value of the unknown function.5) can be used just like Euler’s method.5.3) after the first term.2) Now. so that only the first derivative will be conveniently computable. To get a more accurate method we could. at x + h. is gotten from the values of y and y at just one backwards point x. in Euler’s method. . too.5.5.1) with h replaced by −h to get y(xn − h) = y(xn ) − hy (xn ) + h2 and then subtract these equations.). Indeed the rules are quite similar. then finally (2. just as we did in the derivation of Euler’s method. 6 (2. y1 of the unknown function at two consecutive points x0 . instead of the chord joining (xn . Normally a differential equation is given together with just one value of the unknown function. we will now discuss the midpoint rule. As a primitive example of this kind. x1 . the new value of y depends in y and y at more than one point.4) takes the form yn+1 = yn−1 + 2hf (xn . yn+1 ) as in Euler’s method. ignoring the terms that involve h3 . or at several points.5. Further. etc. we will truncate the right side of (2. and we want to avoid the calculation of higher derivatives because our differential equations will always be written as first-order systems. obtaining y(xn + h) − y(xn − h) = 2hy (xn ) + 2h3 y (xn ) + ···. However. yn ) and (xn+1 .5. yn−1 ) and (xn+1 . The name arises from the fact that the first derivative yn is being approximated by the slope of the chord that joins the two points (xn−1 .2. Then we have yn+1 − yn−1 = 2hyn .5. In other words. etc. We can have greater accuracy without having to calculate higher derivatives if we’re willing to allow our numerical integration procedure to involve values of the unknown function and its derivative at more than one point. as usual. In the more accurate formulas that we will discuss next. h5 . so if we are to use the midpoint rule we’ll need to manufacture one more value of y(x) by some other means.5 The midpoint and trapezoidal rules 39 Recall that we derived Euler’s method by chopping off the Taylor series expansion of the solution after the linear term. for instance.1) Now we rewrite equation (2. At first sight it seems that (2. keep the quadratic term. that term involves a second derivative.3) y (xn ) y (xn ) − h3 + ··· 2 6 (2. of course. except for the fact that we can’t get started with the midpoint rule until we know two consecutive values y0 .5. y). (2. yn+1 ). yn ) (2. 2 6 (2.5.5) and this is the midpoint rule. at x and x − h.

For instance y2 = y0 + 2h(0.5 ∗ 1. Let’s do this.13141 .13315 . In the table below we show for each x the value computed from the midpoint rule. 12. The superior accuracy of the midpoint rule is apparent.05127 1. The problem consists of the equation y = 0.07756 1.1(0.64872 2.18249 148. To get such a formula started we will have to find several starting values in addition to the one that is given in the statement of the initial-value problem.71828 4. To get back to the midpoint rule. and from the exact solution y(x) = ex/2 .02532 1. 12.63862 2.71763 4.00 . so we can compare the two methods. but several of its predecessors.5.02500 1.05125 and y3 = y1 + 2h(0. for example with the same differential equation (2.05 as before.05 we use the value that Euler’s method gives us.5y1 ) = 1 + 0. 1.025 (see Table 1).07689 1. It’s easy to continue the calculation now from (2. 1.13282 . .05 0. from Euler’s method.02500 1.5y and the initial value y(0) = 1.68506 4.00000 1. 1.7) (2.05125 1.40 The Numerical Solution of Differential Equations This kind of situation will come up again and again as we look at more accurate methods.00 0.31274 Euler(x) 1. . . 1. .0775625 .025 + 0. x 0. in this case at x = 0 and x = 0.56389 Exact(x) 1. 11.64847 2. We’ll use the same step size h = 0.7) that we used to illustrate Euler’s rule.00 10.10 0.5 ∗ 1.5.25 . .41316 (2.00 3.5.00 Midpoint(x) 1.10381 1.05063 1. namely y1 = 1.1. because to obtain greater precision without computing higher derivatives we will get the next approximate value of y from a recurrence formula that may involve not just one or two. Now to start the midpoint rule we need two consecutive values of y.00 2. .5). and then switching to the midpoint rule to carry out the rest of the calculation. .39979 . . . . we can get it started most easily by calculating y1 . .15 0. 5.1(0.07788 1.48169 .48032 . from Euler’s method.10517 1.00000 1.81372 139.025) = 1. At 0.00000 1. .10513 1. . .05.5y2 ) = 1.17743 148. . the approximation to y(x0 + h).05125) = 1.6) .20 0. .

f (a)) and (b. but is only approximately so. and the line through the points (a. If we use our usual abbreviation yn for the computed approximate value of y(xn ). instead of the area under the curve between x = a and x = b. (2. as shown in Figure 2. the trapezoidal rule. The trapezoidal rule states that for an approximate value of an integral b f (t) dt a (2. we introduce a third method of numerical integration. y(x)).10) becomes h yn+1 = yn + (f (xn . y(xn + h))) 2 (2. we obtain y(xn + h) ≈ y(xn ) + h (f (xn . the area of the trapezoid whose sides are the x axis.5.8) Now if we approximate the right-hand side in any way by a weighted sum of values of the integrand at various points we will have found an approximate method for solving our differential equation. f (b)). We begin with the differential equation y = f (x.10) in which we have used the “≈” sign rather than the “=” because the right hand side is not exactly equal to the integral that really belongs there. and we integrate both sides from x to x + h. the lines x = a and x = b.5.9) we can use.5. yn+1 )). That area is 1 (f (a) + f (b))(b − a). y(xn )) + f (xn + h.11) 2 . yn ) + f (xn+1 .1.5. then (2.5 The midpoint and trapezoidal rules 41 2 222 y = f (x) a b Figure 2.2. 2 If we apply the trapezoidal rule to the integral that appears in (2. and then use the trapezoidal approximation for the integral. The best way to obtain it is to convert the differential equation that we’re trying to solve into an integral equation. y(t)) dt.5. getting x+h y(x + h) = y(x) + x f (t.8). (2.1: The trapezoidal rule table 2 Next.5.

is called a predictor-corrector pair. If a good enough guess is available for the unknown value.1ey1 ) and sure enough. and we go next to x = 0. can immediately be computed from the previous value. and such pairs form the basis of many of the highly accurate schemes that are used in practice. (2. Let’s guess y1 = 1.12) with the initial value y(0) = 1. As a numerical example. First we would guess yn+1 (guessing yn+1 to be equal to yn wouldn’t be all that bad. we show Midpoint(x) and Exact(x). in actual use. we declare that the iteration has converged. the result is y1 = 1.057196 for the computed value of the unknown function at x = 0. etc. (2. compute the right side. then our first task is to calculate y1 .1 to repeat the same sort of thing to get y2 .025(2 + 0. In order to find the value yn+1 it appears that we need to carry out an iterative process. one of which supplies a very good guess to the next value of y.13) . For comparison.11) looks like a recurrence formula from which the next approximate value. Otherwise we would use the improved value on the right side just as we previously used the first guess. This is not the case if a high quality guess is unavailable. We will discuss this point in more detail in section 2. Then we would have a “more improved” guess.05. this is not the case. the computed approximation to y(0. the approximate value of y(0. The pair of formulas. we use this guess on the right side of (2.05). and obtain y1 = 1. At first sight.5. the unknown number y1 appears on both sides. Since this is not a very inspired guess.5y.05.9.5. yn . Then we get 1. Now if the new value agrees with the old sufficiently well the iteration would halt. and we would have found the desired value of yn+1 .11) then we would be able to calculate the entire right-hand side.5. Upon closer examination one observes that the next value yn+1 appears not only on the left-hand side. it turns out that one does not actually have to iterate to convergence. If we use h = 0. However. then just one refinement by a single application of the trapezoidal formula is sufficient. yn+1 . In Table 3 we show the results of using the trapezoidal rule (where we have iterated until two successive guesses are within 10−4 ) on our test equation y = 0. we will have to iterate the trapezoidal rule to convergence. of the unknown function. but also on the right (it’s hiding in the second f on the right side).5. and then we could use that value as a new “improved” value of yn+1 . Fortunately.5. Hence. and since this is in sufficiently close agreement with the previous result.057193.13). take the differential equation y = 2xey + 1 (2. The trapezoidal rule asserts that y1 = 1 + 0.057196. y(0) = 1 as the column Trap(x). but we can do better).1).42 The Numerical Solution of Differential Equations This is the trapezoidal rule in the form that is useful for differential equations. Then we take y1 = 1. If we use this new guess the same way. If we use this guess value on the right side of (2. and the other of which refines it to a better guess.056796.

64847 2.41316 43 2. .07788 1.10513 1. namely as y y =− .6.17743 148.00000 1. where A is constant. however.6. .18402 148. write the test equation in a slightly different form for expository reasons. The most natural candidate for such an equation is y = Ay. third that by varying the sign of A we can study behavior of either growing or shrinking (stable or unstable) solutions.00 3. .07756 1.2) In order to compare the performance of the three techniques it will be helpful to have a standard differential equation on which to test them.45089 Midpoint(x) 1. 12.71763 4. so comparison is readily done. L>0 (2. 1.4) L where L is a constant. . .10518 1. so we will assume that L > 0. 1.13316 .6. The most interesting and revealing case is where the true solution is a decaying exponential. 1. .1) the trapezoidal rule yn+1 = yn + and the midpoint rule yn+1 = yn−1 + 2hyn .00 Trap(x) 1. .05127 1.48203 . 12.15 0.48169 . . . (2.20 0. at least over the short term.00 10.02532 1.64876 2.05 0. .02500 1.00 2. y(0) = 1 . .6 Comparison of the methods We are now in possession of three methods for the numerical solution of differential equations. They are Euler’s method yn+1 = yn + hyn .13315 .6. second that the difference approximations are also relatively easy to solve exactly.31274 table 3 Exact(x) 1. we will assume that y(0) = 1 is the given initial value. .02532 1.25 . .6 Comparison of the methods x 0. and finally that many problems in nature have solutions that are indeed exponential.00000 1. 5.00 .00000 1.10 0. (2.48032 . 2 n (2. Further.00 0. . We will.18249 148.07789 1.13282 . 12. so this is an important class of differential equations.64872 2.05127 1. The reasons for this choice are first that the equation is easy to solve exactly.3) h (y + yn+1 ).2. . .10517 1. 1. .71842 4.71828 4.05125 1.

6. (2.6. for the particularly simple equation that we are now studying. while if the solution changes only slowly. .6. as usual with this method. we will denote it with the symbol τ . with the starting value y0 = 1. then h will have to be kept small there if good accuracy is to be retained. 2 L L (2.6.6.4). Instinctively. one feels that if the solution is changing rapidly in a certain region. (2.6.3) handles the problem (2.6. Suppose first that we ask Euler’s method to solve the problem. If we substitute y = f (x. Now the solution of the recurrence equation (2.6. 2. is obviously yn = (1 − τ )n n = 0. and h is the step size that we are going to use in the numerical integration.2) and get yn+1 = yn + h yn yn+1 − − . on both sides of the equation. Now L is the distance over which the solution changes by a factor of e. y) = −y/L in (2. . We substitute y = f (x. L (2. called the relaxation length of the problem. 2. h/L is exactly the thing that one feels should be kept small for a successful numerical solution.5) Notice that if x increases by L. 1.6).10) . Hence L.7) Next we study the trapezoidal approximation to the same equation (2. (2. .9) Together with the initial value y0 = 1. this implies that yn = 1− 1+ τ 2 τ 2 n n = 0. Since h/L occurs frequently below.6. y) = −y/L into (2.6. . then h can be larger without sacrificing too much accuracy. (2.6.1)–(2.6.44 The Numerical Solution of Differential Equations The exact solution is of course Exact(x) = e−x/L .8) for yn+1 (without any need for an iterative process) and obtaining yn+1 = 1− 1+ τ 2 τ 2 yn . Hence. there is no difficulty in solving (2. the solution changes by a factor of e.6.6) Before we solve this recurrence. .6.8) The unknown yn+1 appears. let’s comment on the ratio h/L that appears in it. can be conveniently visualized as the distance over which the solution falls by a factor of e. . However. 1. The ratio h/L measures the step size of the integration in relation to the distance over which the solution changes appreciably. we get yn+1 = yn + h ∗ − yn L h = 1− yn . . Now we would like to know how well each of the methods (2. .1).4).

15) Evidently the discriminant of this equation is positive.6. If we denote these two roots by r+ (τ ) and r− (τ ). in which the three values of “constant” are (a) e−τ (b) 1 − τ (c) 1− τ 2 1+ τ 2 (2.16) . The trapezoidal rule does a better job of being near e−τ because its constant agrees with the power series expansion of e−τ through the quadratic term.6. whereas that of the Euler method agrees only up to the linear term.9) for the trapezoidal rule.6. L (2.11) .6) for Euler’s method and in (2. It follows that to compare the two approximate methods with the “truth.6. all three of (a) the exact solution.14) is yn = c (r+ (τ ))n + d (r− (τ ))n . then the general solution of the difference equation (2.2.12) 2 6 to compare with 1 − τ and with the power series expansion of 1− 1+ τ 2 τ 2 =1−τ + τ2 τ3 − + ···.6. (2. Since the equation is linear with constant coefficients.6. This leads to the quadratic equation r 2 + 2τ r − 1 = 0. y) = −y/L for y in (2.6. 2 4 (2. so its roots are distinct.6. let’s pause to examine the two methods whose solutions we have just found. then we have the power series expansion of e−τ τ2 τ3 e−τ = 1 − τ + − + ··· (2.6.3) to get yn+1 = yn−1 + 2h ∗ − yn .14) One important feature is already apparent. We will find that a new and important phenomenon rears its head in this case. (b) Euler’s solution and (c) the trapezoidal solution are of the form yn = (constant)n . The analysis begins just as it did in the previous two cases: We substitute the right-hand side f (x. Note that for a given value of h. (2. where “constant” is near e−τ . If we remember that τ is being thought of as small compared to 1. Finally we study the nature of the approximation that is provided by the midpoint rule. we know to try a solution of the form yn = r n .” all we have to do is see how close the constants (b) and (c) above are to the true constant (a). Both the Euler and the trapezoidal methods yield approximate solutions of the form (constant)n .6 Comparison of the methods 45 Before we deal with the midpoint rule. Instead of facing a first-order difference equation as we did in (2.6. we have now to contend with a second-order difference equation.13) The comparison is now clear.

To get the midpoint method started. so we say that the midpoint rule is unstable. (r− (τ ))n is. however. so the first term on the right of (2. so to speak. The two roots of the quadratic equation are √ r+ (τ ) = −τ + √1 + τ 2 r− (τ ) = −τ − 1 + τ 2 . since r− (τ ) is negative and larger than 1 in absolute value. We hope that the other term will stay small relative to the first term. This one is a sum of two terms of that kind. y(0) = 1 with each of the three methods that we have discussed. and that in fact it might do whatever it can to spoil things. The other term. and in the trapezoidal rule we iterated to convergence with = 10−4 . What about r− (τ )? Its Taylor series is r− (τ ) = −1 − τ − τ2 + ···. The second term of (2. the price that we pay for getting such a good approximation in r+ (τ ). that it need not be so obliging. However. we cheated).46 The Numerical Solution of Differential Equations where c and d are constants whose values are determined by the initial data y0 and y1 .16). the root r− (τ ) is larger than 1 in absolute value. as we move to the right. so when τ is small r+ (τ ) is near +1. We will see that r+ (τ ) is a very good approximation to e−τ .6.05.6. We will see. because that’s what the exact solution does. The instability of the midpoint rule is quite apparent. so as not to disturb the closeness of the approximation.6. 2 (2. n increases. (2.e. This means that the stability criterion of Theorem 1. the so-called parasitic solution. In Table 4 below we show the result of integrating the problem y = −y. because the constant d will be small compared with c.19) The bad news is now before us: When τ is a small positive number. The Euler and trapezoidal approximations were each of the form (constant)n .6. using a step size of h = 0.6. In fact it does pretty well. and the second term will eventually dominate the first. In fact.17) When τ = 0 the first of these is +1. we observe that r+ (τ ) is close to e−τ . In practical terms.6. while the second term increases steadily in size. because the power series expansion of r+ (τ ) is r+ (τ ) = 1 − τ + τ2 τ4 − + ··· 2 8 (2.. the second term will alternate in sign as n increases.16) is very close to the exact solution.05) (i. and grow without bound in magnitude. and it is the root that is trying to approximate the exact constant e−τ as well as possible. because the first term is shrinking to zero as n increases. we used the exact value of y(0.1 is violated. .18) so it agrees with e−τ through the quadratic terms. is small compared to the first term when n is small.

00005 4. we summarize here there additional properties of integration methods as they relate to the examples that we have already studied.04979 0.0 2.00822 0.00000 0.04975 0. Second. In an iterative method.e.01652 0. though for efficient use it needs the help of some other formula to predict the next value of y and thereby avoid lengthy iterations. the trapezoidal rule is clearly the best.55 15. the computed solution (neglecting roundoff) remains bounded as n → ∞.8 × 10−7 1. i. where L > 0. a method is self-starting if the next value of the unknown function is obtained from the values of the function and its derivatives at exactly one previous point. depending on the quality of available estimates for the unknown value. It is not self-starting if values at more than one backward point are needed to get the next one.00673 0. First.21688 -20.0 10..0 14.01830 0.45 Exact(x) 1.00000 0. A method is noniterative if it expresses the next value of the unknown function quite explicitly in terms of values of the function and its derivatives at preceding points.0 1.4 × 10−7 Midpoint(x) 1.6 Comparison of the methods x 0.4 × 10−7 47 table 4 In addition to the above discussion of accuracy. . we can define a numerical method to be stable if when it is applied to the equation y = −y/L. then for all sufficiently small positive values of the step size h a stable difference equation results.00005 4. We summarize below the properties of the three methods that we have been studying.36788 0. a numerical integration method might be iterative or noniterative.00000 0. In practice. Third.48 71.35849 0.2. which must be solved to obtain the next value of the unknown function.8 × 10−7 1.8 Euler(x) 1.00004 3.00000 0.01888 0.13552 0.0 3.13534 0.13527 0. Iterative Self-starting Stable Euler No Yes Yes Midpoint No No No Trapezoidal Yes Yes Yes Of the three methods. at each step of the solution process the next value of the unknown is defined implicitly by an equation. In the latter case some other method will have to be used to get the first few computed values.04607 0.05005 0.01832 0.12851 0.3 × 10−7 9.00674 0.36780 0.36806 0.0 4.00592 0. we may either solve this equation completely by an iteration or do just one step of the iteration.0 5.1 × 10−8 Trap(x) 1.

At first sight this seems like a nuisance.1). 2 (2. yn+1 ) − f (xn+1 . using the trapezoidal formula yn+1 = yn + (k) h (f (xn . To do this. k = 1. but instead gives us an equation that must be solved in order to find it. then the convergence will be extremely rapid.7. . In actual practice. From the above we see at once that the difference between two consecutive iterated values of yn+1 will be h ∂f times the difference between the previous two iterated values.7. yn+1 )) (2. .48 The Numerical Solution of Differential Equations 2. We refer to trapezoidal rule.7. h ∂f 2 ∂y h ∂f 2 ∂y (k) (k−1) as the local convergence factor of the If the factor is a lot less than 1 (and this can be assured by keeping h small enough).7 Predictor-corrector methods The trapezoidal rule differs from the other two that we’ve looked at in that it does not explicitly tell us what the next value of the unknown function is.7.9. . if at all. Let’s take a look at the process by which we refine a guessed value of yn+1 to an improved value. this time replacing k by k − 1 to obtain h (k) (k−1) yn+1 = yn + (f (xn .7.7.7. approach a limit.η) n+1 where η lies between yn+1 and yn+1 . Then the improved value yn+1 is computed from yn+1 = yn + (k+1) (k+1) h (k) (f (xn . we rewrite equation (2. 2 (2.4) Next we use the mean-value theorem on the difference of f values on the right-hand side. yn ) + f (xn+1 .2) to get yn+1 − yn+1 = (k+1) (k) h (k) (k−1) (f (xn+1 . (2.3) 2 and then subtract (2. as we will discuss in section 2.1) Suppose we let yn+1 represent some guess to the value of yn+1 that satisfies (2.7. 2. 2 ∂y It follws that the iterative process will converge if h is kept small enough so that is less than 1 in absolute value.2) We want to find out about how rapidly the successive values yn+1 . yn+1 )).7. yn ) + f (xn+1 . yn+1 )). because it enables us to regulate the step size during the course of a calculation. but in fact it is a boon.5) 2 ∂y (xn+1 .3) from (2. one uses an iterative formula together with another formula (the predictor) whose mission is to provide an intelligent first guess for the iterative method . yielding h ∂f (k+1) (k) (k) (k−1 yn+1 − yn+1 = y − yn+1 . yn+1 )). yn ) + f (xn+1 .2). 2 (k) (2.

7. The conclusion is that when we are dealing with a matched predictor-corrector pair.7 Predictor-corrector methods 49 to use. then a clever predictor would be the midpoint rule.9) This shows that. but we will assume here that h is small enough that we can regard the two values of y that appear as being the same. If the predictor formula is clever enough. We will see in the next section that the error terms are as follows: h3 yn+1 = yn−1 + 2hyn + y (Xm ) (2. and yn+1 be the final converged value given by the trapezoidal rule. that the midpoint predictor and the trapezoidal corrector constitute a matched pair. we need do only a single refinement of the corrector if the step size is kept moderately small. If we use the trapezoidal rule for a corrector. The subsequent iterative refinement of that guess needs to reduce the error only by a factor of four. The error in the trapezoidal rule is about one fourth as large as. The midpont guess is therefore quite “intelligent”.7) 2 12 Now the exact locations of the points Xm and Xt are unknown. the error in the midpoint method. We say then. then the 2 ∂y 2 ∂y distance from the first refined value to the converged value will be no larger than the size of the error term in the method.7. so there would be little point in gilding the iteration any further. and we won’t have to get involved in a long convergence process. the refined value is h ∂f times closer. Hence if we can keep h ∂f no bigger than about 1/4.7. The predictor formula will be explicit. . yP ) 2 n h ∂f (yP − yn+1 ) . we see that the third power occurs in both formulas. or noniterative. then it will happen that just a single application of the iterative refinement (corrector formula) will be sufficient. however far from the converged value the first guess was. The reason for this will become clear if we look at both formulas together with their error terms. for instance. (2. iteration to full convergence is rarely done in practice. Furthermore. and of opposite sign from. 2 ∂y and by subtraction yn+1 − yn+1 = (1) (2. “moderately small” means that the step size times the local value of ∂f ∂y should be small compared to 1. Then we have yn+1 = yn + (1) yn+1 h y + f (xn+1 .2. For this reason. As far as the powers of h that appear in the error terms go.7. yn+1 denote the first refined value.8) h = yn + y + f (xn+1 . Now let yP denote the (1) midpoint predicted value.6) 3 h h3 yn+1 = yn + (yn + yn+1 ) − y (Xt ). yn+1 ) 2 n (2.

then we will need a small values of h.8 Truncation error and step size We have so far regarded the step size h as a silent partner. decreae it again if further quick changes appear. The first guess will be relatively far away from the final converged value if the solution is rapidly varying. Furthermore this is a very time-consuming approach since it involves a complete iteration to convergence. If we want to follow these transients accurately. and then correcting the guess to complete convergence by iteration. In such cases there usually are rapid and ephemeral or “transient” phenomena that occur soon after startup. If we are going to develop software that will be satisfactory for such problems. Hence one way to get some control on h is to follow a policy of cutting the step size in half whenever more than. when in fact a single turn of the crank is enough if the step size is kept . more often than not choosing it to be equal to 0. slowly varying or nearly constant function. if the true solution of the differential equation is rapidly changing. and so forth. If h is chosen too large. and roundoff errors may build up excessively because of the numerous arithmetic operations that are being carried out. the computed solution may be quite far from the true solution of the differential equation. While following a rapid transient it should use a small mesh size. but if the solution is slowly varying. say. then it should gradually increase h as the transient fades. one or two iterations are necessary. and then a much larger value of h will be adequate. let’s observe that one technique is already available in the material of the previous section. then the guess will be rather good. we may need to choose a very tiny step size. all without operator intervention. and that disappear quickly. however. and if the solution changes slowly. Speaking in quite general terms. small compared to the local relaxation length (see p. for no particular reason. if too small then the calculation will become unnecessarily time-consuming. such as beginning a multi-stage chemical reaction. It is evident. the steady-state solution may be a very quiet. that is. Recall that if we want to. turning on a piece of electronic equipment.05. use a large step while the solution is steady. and somewhat more delicacy is called for in that situation. then a larger value of h will do. This suggestion is not sufficiently sensitive to allow doubling the stepsize when only one iteration is needed. starting a power reactor. or predicting. then the program will obviously have to choose. however. then the step size is too big. and re-choose its own step size as the calculation proceeds. It follows that the number of iterations required to produce convergence is one measure of the appropriateness of the current value of the step size: if many iterations are needed. 44).50 The Numerical Solution of Differential Equations 2. etc. we can implement the trapezoidal rule by first guessing. that the accuracy of the calculation is strongly affected by the step size. Examples of this are provided by the study of the switching on of a complicated process. however. After the transients die out. the unknown at the next point by using Euler’s formula. Frequently in practice we deal with equations whose solutions change very rapidly over part of the range of integration and slowly over another part. Before we go ahead to discuss methods for achieving this step size control.

(2. and if we double h the error is multiplied by about 4. Our next job will be to make these rather qualitative remarks into quantitative tools. it actually can give us a quantitative estimate of the local error in the integration. why would anyone use the cumbersome procedure of guessing and refining (i. In Euler’s procedure. so that if we want to.2. but instead we are going to use a very general method that . we can print out the approximate size of the error along with the solution. For instance. Next we derive the local error of the trapezoidal rule.8. It will appear that not only does the disparity between the first prediction and the corrected value help us to control the step size. the “remainder term. approximately. It does not tell us how far our computed solution is from the true solution. however. and we will prefer to estimate E by more indirect methods. we might differentiate the differential equation y = f (x.2) Thus. is Euler’s method. That equation was y(xn + h) = y(xn ) + hy (xn ) + h2 y (X) 2 (2. when viewed this way.2) we have already seen the single-step error of this metnod. and compute y directly from the resulting formula. in equation (2.8 Truncation error and step size small enough.” and compute the solution from the rest of the equation. when many other methods are available that give the next value of the unknown immediately and explicitly? No doubt the question crossed the reader’s mind. we drop the third term on the right. Indeed. the single-step error is proportional to h2 . the local error is reduced to 1/4 of its former value.1) where X lies between xn and xn + h. instead of that equation itself. That basic idea is precisely that we can estimate the correctness of the step size by whatching how well the first guess in our iterative process agrees with the corrected value. is seen to be a powerful ally of the software user. rather than the “pain in the neck” it seemed to be when we first met it.8. This is the single-step truncation error. Euler’s method is exact (E = 0) if the solution is a polynomial of degree 1 or less (y = 0). We could use (2. 51 The discussion does. The easiest example.2) to estimate the error by somehow computing an estimate of y . y) once more. as usual. and the answer is now emerging. prediction and correction) as we do in the trapezoidal rule. point to the fundamental idea that underlies the automatic control of step size during the integration. though. Otherwise. The correction process itself.e. This is usually more trouble than it is worth. so we must discuss the estimation of the error that we commit by using a particular difference approximation to a differential equation. so if we cut the step size in half. In fact. In doing this we commit a single-step trunction error that is equal to E = h2 y (X) 2 xn < X < xn + h.8.. There are various special methods that might be used to do this.1. but only how much error is committed in a single step. on one step of the integration process.

In general the series is y(x) = y(0) + xy (0) + x2 y (0) y (n) (0) + · · · + xn + Rn (x) 2! n! (2. How long does this continue? Our run of good luck has just expired. (2. Since the error term for Euler’s method in (2.3) is not 0.3) 2 Of course. and they are respectively easy and hard: (a) If the error term in the trapezoidal rule really is const ∗ h3 ∗ y (X). that it is an integration rule of order two (“order” is an overworked word in differential equations).52 The Numerical Solution of Differential Equations is capable of dealing with the error terms of almost every integration rule that we intend to study.8. x. then a brief calculation reveals that (2. i. in the form h yn+1 − yn − (yn + yn+1 ) = 0.6) . anticipating that the answer to (b) is affirmative so the effort won’t be wasted.8.2) is of the form const ∗ h2 ∗ y (X). It certainly is not exactly true if yn denotes the value of the true solution at the point xn unless that true solution is very special.8.8.3) would be exactly valid.4) 2 where X lies between x and x + h. but fails on x2 .e. it is easy to verify that Euler’s method is exact for a linear function. Then (2. as the reader should check. By way of contrast. It follows by linearity that the rule is exact on any quadratic polynomial. (2. First. We might say that the trapezoidal rule is exact on 1.8. and x2 . How special? Suppose the true solution is y(x) = 1 for all x. let’s look a little more carefully at the capability of the trapezoidal rule. then the trapezoidal rule can be written as h y(xh ) − y(x) − y (x + h) + y (x) = c ∗ h3 ∗ y (X).8. Then (2. To do this we start with a truncated Taylor series with the integral form of the remainder. but not x3 . Furthermore. this is a recurrence by meand of which we propagate the approximate solution to the right.5) Now let’s see that question (b) has the answer “yes” so that (2. because if y(x) = x3 then (check this) the left side of (2. then what is “const”? (b) Is it true that the error term is const ∗ h3 ∗ y (X)? We’ll do the easy one first. it is perhaps reasonable to expect the error term for the trapezoidal rule to look like const ∗ h3 ∗ y (X).5) is really right. Now we have to questions to handle.8. To find c all we have to do is substitute y(x) = x3 into (2.3) holds once more. if y(x) = x2 .8.. If the error term is of the form stated. Suppose y(x) = x.8. (2. but is instead −h3 /2.8.8.4) and we find at once that c = −1/12. The single-step truncation error of the trapezoidal rule would therefore be E = −h3 y (X) 12 x < X < x + h.3) is again exactly satisfied. rather than with the differential form of the remainder.

6) we choose n = 2.11) where H2 (t) = t2 if t > 0 and H2 (t) = 0 if t < 0. because the trapezoidal rule is exact on polynomials of degree 2. First.8 Truncation error and step size where Rn (x) = 1 n! x 0 53 (x − s)n y (n−1) (s) ds.8. and we notice immediately that LP2 (x) = 0. we find that LR2 (x) = 1 2! ∞ 0 LH2 (x − s)y (s) ds. Then if s lies between 0 and h we find LH2 (x − s) = (h − s)2 − = −s(h − s (Caution: Do not read past the right-hand side of the first equals sign unless you can verify the correctness of what you see there!).7) for R2 (x) we want to replace the upper limit of the integral by +∞.8.8.9) 2 Now we apply the operator L to both sides of equation (2.8. (2.8. namely into the left-hand side of equation (2. (2.8. In (2.7) Indeed. We can do this by writing R2 (x) = 1 2! ∞ 0 H2 (x − s)y (s) ds (2.2.7). h (2(h − s)) 2 (2. (2.12) Choose x = h. and we write it in the form y(x) = P2 (x) + R2 (x) where P2 (x) is the quadratic (in x) polynomial P2 (x) = y(0) + xy (0) + x2 y (0)/2. whereas if s > h then LH2 (x − s) = 0.8. We call the operator L so it is defined by h Ly(x) = y(x + h) − y(x) − y (x) + y (x + h) .8). It isn’t in a very satisfactory form yet.6)).8.8.8.8. (2.4). because the rule is exact on quadratic polynomials (this is why we chose n = 2 in (2. plucked from the blue.10) (2. and then repeatedly integrates by parts.8) Notice that we have here a remainder formula for the trapezoidal rule. in the integral expression (2. Next we define a certain operation that transforms a function y(x) into a new function. Hence we have Ly(x) = LR2 (x). lowereing the order of the derivative of y and the power of (x − s) until both reach zero.8. Now if we bear in mind the fact that the operator L acts only on x. so we will now make it a bit more explicit by computing LR2 (x).13) . one of the nice ways to prove Taylor’s theorem begins with the right-hand side of (2.8. and that s is a dummy variable of integration.

The theorem asserts that a weighted average of the values of a continuous function is itself one of the values of that function. namely Theorem 2.8.9 Controlling the step size In equation (2.8. h h3 y (xn ) + y (xn+1 ) − y (X). (2. so 1 LR2 (h) = − y (X) 2 and if we do the integral we obtain.8. then b b p(x)g(x) dx = g(X) a a p(x) dx (2.8. we will quote. then we can estimate the error in the trapezoidal rule as we go along. The vital hypothesis is that the “weight” p(x) does not change sign. the function s(h − s) does not change sign on the s-interval (0.9. The proof of this theorem involved some ideas tha carry over almost unchanged to very general kinds of integration rules. Therefore it is important to make sure that you completely understand its derivation.1) . To see how this can be done.1 If p(x) is nonnegative.8.5) we saw that if we can estimate the size of the third derivative during the calculation.14).2 The trapezoidal rule with remainder term is given by the formula y(xn+1 ) − y(xn ) = where X lies between xn and xn+1 .8. 3 (2. Now in (2. to keep that error within preassigned bounds. the result of a similar derivation for the midpoint rule. 2 12 (2. finally.54 Then (2. Theorem 2. and g(x) is continuous.16) 2.8. but we still do not have the “hard” question (b).12) becomes The Numerical Solution of Differential Equations LR2 (h) = − 1 2 h 0 s(h − s)y (s) ds.15) where X lies between a and b. It says that y(xn+1 ) − y(xn−1 ) = 2hy (xn ) + h3 y (X). h).8. and modify the step size h if necessary. To finish it off we need a form of the mean-value theorem of integral calculus.8. without proof.14) This is a much better form for the remainder.17) h 0 s(h − s) ds (2.

2 (2.1) and (2. pn+1 = y(xn+1 ) + 2hy (xn ). qn+1 )) . qn+1 is not available to us during calculation. from (2. Of course. from the trapezoidal rule. Y ) − y (X 2 ∂y 12 h3 y (X). Note that the two X’s may be different. It staisfies two different equations.2.6) .2) we have at once that y(xn+1 ) = pn+1 + Next. Now. from (2. In symbols. one of which is y(xn+1 ) = y(xn ) + h h3 (f (xn .9.3) and (2.2) Again. If the step size were halved. the local error would be cut to one eighth of its former value. pn+1 is not available during an actual computation.9. We begin by defining three different kinds of values of the unknown function at the “next” point xn+1 . y(xn+1 ))) − y (X) 2 12 (2.5) (2.9. both of which are available during the calculation: (a) the initial guess. and of opposite sign. about four times as large as that in the trapezoidal formula. y(xn+1 )) − f (xn+1 . from the midpoint method. and (b) the converged corrected value. qn+1 )) − y (X) 2 12 h ∂f h3 = qn+1 (y(xn+1 ) − qn+1 ) (xn+1 .9 Controlling the step size 55 where X is between xn−1 and xn+1 .4) we get y(xn+1 ) = h h3 (f (xn+1 .9. Thus qn+1 satisfies the equation qn+1 = y(xn ) + h (f (xn .9.9. They are (i) the quantity pn+1 is defined as the predicted value of y(xn+1 ) obtained from using the midpoint rule.9. y(xn )) + f (xn+1 . which is the exact solution itself.3) (2. The error in the midpoint rule is.9.1).4) and the other of which is (2.9. using only the following data. Now suppose we adopt an overall strategy of predicting the value yn+1 of the unknown by means of the midpoint rule. (ii) the quantity qn+1 is the value that we would compute from the trapezoidal corrector if for the backward value we use the exact solution y(xn ) instead of the calculated solution yn . (iii) the quantity y(xn+1 ). except that backwards values are not the computed ones. but are the exact ones instead. 3 (2. however. We want to estiamte the size of the single-step truncation error. Thus the midpoint rule is also of second order.9. and then refining the prediction to convergence with the trapezoidal corrector. y(xn )) + f (xn+1 .

12 (2. provided only that the error terms of the predictor and corrector formulas both involved the same derivative of y.9. but as an estimator we can use the compted predicted value and the compted converged value.9. Under this assumption. we will observe that y(xn+1 ) − qn+1 will then appear on both sides of the equation. insted of just for the midpoint and trapezoidal rule combination. we have here an estimate of the single-step trunction error that we can conveniently compute.e. even though the X’s are different. Hence we will be able to solve for it.. with the result that y(xn+1 ) = qn+1 − h3 y (X) + terms involving h4 . i.7) is thereby decreed to be equal to the y (X) in (2.56 The Numerical Solution of Differential Equations where we have used the mean-value theorem. but assumed constant. Now if we subtract qn+1 from both sides. 5 (2. “matched pairs”” of predictor and corrector formulas.9. namely from xn − h to xn + h.e. and Y lies between y(xn+1 ) and qn+1 . If we had used a different pair. or use to control the step size.7) Now let’s make the working hypothesis that y is constant over the range of values of x considered. because these differ only in that they use computed.8). Hence. Now we ignore the “terms involving h4 ” in (2. pairs in which both are of the same order. however. rather than exact backwards values of the unknown function. Hence. both formulas were of the same order.9) The quantity qn+1 − pn+1 is not available during the calculation.9. Let’s pause to see how this error estimator would have turned out in the case of a general matched pair of predictor-corrector formulas. The y (X) in (2. we can eliminate y(xn+1 ) between (2.9. the same argument would have worked. i.9.10) .8) We see that this expresses the unknown. and then for the estimated single-step truncation error we have Error = − h3 y 12 1 12 ≈− (qn+1 − pn+1 ) 12 5 1 = − (qn+1 − pn+1 ). The derivation of this formula was of course dependent on the fact that we used the midpoint metnod for the predoctor and the trapezoidal rule for the corrector.7) and obtain qn+1 − pn+1 = 5 3 h y + terms involving h4 . Suppose the predictor formula has an error term yexact − ypredicted = λhq y (q) (X) (2.9. print out.5)..9.9. value of y in terms of the difference between the initial prediction and the final converged value of y(xn+1 ).5) and (2. solve for y . are most useful for carrying out extended calculations in which the local errors are continually monitored and the step size is regulated accordingly. 12 (2.

00 Pred(x) ————0.13) using the true solution y(x) = e−x .367807 Errest(x) ————98 × 10−7 113 × 10−7 108 × 10−7 102 × 10−7 97 × 10−7 93 × 10−7 88 × 10−7 84 × 10−7 80 × 10−7 . The successive columns show x.12) In the table below we show the result of integrating the differential equation y = −y with y(0) = 1 using the midpoint and trapezoidal formulas with h = 0. x 0. .818705 0.35 0.9.9. The calculation was started by (cheating and) using the exact solution at 0. Suppose we would like to keep the single-step error in the neighborhood of 10−8 . λ−µ (2.50 .637575 0.95 1.10 0.778820 0. 10−9 as the lower limit. .45 0. . 57 (2. .40 0.15 0. . the step size will be smaller than needed. . the converged corrected value at x. .9). 0.740780 0.670315 0.860747 0.704644 0.367831 Corr(x) ————0.05 as the predictor and corrector.386694 0.9. the single-step error estimated from the approximation (2.20 0. 0.904828 0.818759 0.05 0. the predicted value at x. and we will be wasting computer time as well as possibly building up roundoff error.05.25 0. 51 × 10−7 48 × 10−7 Error(x) ————94 × 10−7 85 × 10−7 77 × 10−7 69 × 10−7 61 × 10−7 55 × 10−7 48 × 10−7 43 × 10−7 37 × 10−7 . as described above.00 and 0.9.904877 0.860690 0. Why should we have a lower limit? If the calculation is being done with more precision than necessary. We might then adopt. say 5 × 10−8 as the upper limit of tolerable error and.670271 0.704690 0.9 Controlling the step size and suppose that the error in the corrector formula is given by yexact − ycorrected = µhq y (q) (X). 0. . 5 × 10−7 3 × 10−7 table 5 Now that we have a simple device for estimating the single-step truncation error. and the actual single-step error obtained by computing y(xn+1 ) − y(xn ) − h (y (xn ) + y (xn+1 )) 2 (2. namely by using one fifth of the distance between the first guess and the corrected value. for instance.606514 .2. .606474 .30 0.740828 0.00 0. we can regulate the step size so as to keep the error between preset limits. .778768 0.11) Then a derivation similar to the one that we have just done will show that the estimator for the single-step error that is available during the progress of the computation is Error ≈ µ (ypredicted − ycorrected).637617 0.386669 0. .

The predicted value is then refined by the trapezoidal rule. to get the value of y at the first point x0 + h beyond the initial point x0 . three times before the errors are tolerable. First. just two past values). at any moment. This predicted value is also saved for future use in error estimation. yn . If the absolute value of the local error lies between the preset limits 10−9 and 5 × 10−8 . Special procedures are needed at the beginning to build up enough computed values so that the predictor-corrector formulas can be used thereafter. something special will have to be done to get the procedure moving at the start. From any two consecutive values. we might as well use the trapezoidal rule. Finally. Input to this procedure will be x. we embark on the computation. from the “farthest backward” value of x for which we still have the corresponding y in memory. Now let’s apply the philosophy of structured programming to see how the whole thing should be organized. Then we must reduce the step size h. some message should be printed out that announces the change. Then we double the step size. then we just go on to the next step. unassisted by midpoint. and perhaps it may need to be halved. say by cutting it in half. One reason for this is that we may find out right at the outset that our very first choice of h is too large. In this case we see (i) a procedure midpt. or at least until the step size is changed. and the generic calculation can begin. h. and move back the newer values of the unknown function to the locations that hold older values (we remember. and restart. The three values of y in question occupy just three memory locations. the midpoint rule is used to predict the next value of y. With the upper and lower tolerances set. The leading statement in this routine might be . a lot bettern than nothing. With the trapezoidal value in hand. but certainly not the same as the total accumulated error. and then we should restart the procedure. say. rather than let the computation creep forward a little bit with step sizes that are too large for comfort. No arrays are involved. suppose the local error was too large. Then we would like to restart each time from the same originally given data point x0 . with the new value of h. with complete iteration to convergence. since the midpoint method needs two backward values before it can be used. In the present case. suppose the local error was too small.58 The Numerical Solution of Differential Equations Now that we have fixed these limits we should be under no delusion that our computed solution will depart from the true solution by no more than 5 × 10−8 . This is typical of numerical solution of differential equations. Output from it will be yn+1 computed from the midpoint formula. since we’re going to use the trapezoidal corrector anyway. the local error is then estimated by calculating one-fifth of the difference between that value and the midpoint guess. or whatever. yn−1 . Now we have two consecutive values of y. When all this is done. We ask first for the major logical blocks into which the computation is divided. Otherwise. print a message to that effect. This means that we augment x by h. What we are controlling is the one-step truncation error. again from the smallest possible value of x.

9 Controlling the step size midpt:=proc(x. Operation of thie routine is different. then use the trapezoidal rule just once. It must find its own way to convergence. should precisely explain their operation as separate modules. Suppose the first line of trapez is trapez:=proc(x. On a generic step of the integration. One might think that it is scarcely necessary to have a separate subroutine for such a simple calculation. The documentation of a procedure should never make reference to how it is used by a certain calling program. 59 and its return statement would be return(yp1). accurate solution. Each of the subroutines and the main routine should be heavily documented in a self-contained way.ym1.yguess. The spirit of structured programming dictates otherwise. with their own inputs and outputs. . This problem will need the best methods that we can find for a successful.y0. then it’s quite easy to disentangle it from the program and replace it. too.12.yin. but should describe only how it transforms its own inputs into its own outputs. Then in section 2. and in a generic step. If organized as a subrouting. and comparing it with the one in that section. the descriptions of trapez. quite independently of the way they happen to be used by the main program in this problem.. When starting or restarting. That is. then the subroutine would supply a final converged value without looking for any input guess. to refine the value as yout.n. the flight of a spacecraft to the moon. If start = FALSE. and iterate to convergence from there. however. when restarting with a new value of h. and of midpt.2. When called with start = TRUE. and then just do a single correction. In the next section we are going to take an intermission from the study of integration rules in order to discuss an actual physical problem. This routine will be called from two or three different places in the main routine: when starting. we want it to use the prediction supplied by midpt. there is no externally supplied guess to help it. the reader might enjoy trying to write such a complete program now. The combination of these two modules plus a small main program that calls them as needed. constitutes the whole package.start.n). (ii) a procedure trapez. using eps as its convergence criterion.. Someday one might want to change from the midpoint predictor to some other predictor. it will take yguess as an estimate of yout.eps). In the meantine. It should be possible to unplug a module from this application and use it without change in another. and its return statement is return(yout). we’ll return to the modules discussed above. then the routine might use yin as its first guess to yout. This is the “modular” arpproach. and will display complete computer programs that carry them out. depending on the circumstances.h. One way to handle this is to use a logical variable start. If trapez is called with start = TRUE.h. wher eit is the corrector formula.

respectively. If we use K for the constant of proportionality. MM and m are.2) where ME .3) This is a (nasty) differential equation of second order in the unknown function x(t). and let R denote the earth’s radius. then the force on the rocket due to the earth is −K ME m . The variety of solutions that can be obtained is quite remarkable. x2 (D − x)2 (2. will be essential. Finally. It will be very useful to have these equations available for testing various proposed integration methods.60 The Numerical Solution of Differential Equations x=D x=D−r Moon f x=0 x=R l x(t) s E Earth Rocket Figure 2. At the point x = D we place the moon. In order to try out this software on a problem that will use all of its capability. x2 (2. and we let its radius be r. at a position x = x(t) is our rocket. First. and then in two dimensions. We will use Newton’s law of gravitation to find the net gravitational force on the rocket.10. in this section we are going to derive the differential equations that govern the flight of a rocket to the moon. including the automatic regulation of step size and built-in error estimation. and so the assertion that the net force is equal to mass times acceleration takes the form: mx = −K ME m MM m +K . the gravitational force exerted by one body on another is proportional to the product of their masses and inversely proportional to the square of the distance between them.10 Case study: Rocket to the moon Now we have a reasonably powerful apparatus for integration of initial-value problems and systems. making its way towards the moon. we place the center of the earth at the origin of the x-axis.10.1) whereas the force on the rocket due to the moon’s gravity is K MM m (D − x)2 (2. We do this first in a one-dimensional model.and equate it to the mass of the rocket times its acceleration (Newton’s second law of motion). the masses of the earth. Note the nonlinear way in which this unknown function appears on the right-hand side. in the one-dimensional simplified model. Great accuracy will be needed. .2: 1D Moon Rocket 2. The acceleration of the rocket is of course x (t). According to Newton’s law of gravitation. or else the computation will become intolerably long. the moon and the rocket. and the ability to chenge the step size. both to increase it and the devrease it. the position of the rocket at time t.10.

It implies that the motion of the rocket is independent of its mass.6) is a dimensionless quantity. (2.10 Case study: Rocket to the moon 61 A second-order differential equations deserves two initial values. we define y(t) = x(t)/R. Hence (2. the radius of the earth. namely KME m/R2 . Hence. Galileo demonstrated that fact to an incredulous world.10.2 feet/sec2 .10. the initial conditions that go with (2. For performing a now-legendary experiment with rocks of different sizes dropping from the Tower of Pisa.5) We can make this equation a good bit prettier by changing the units of distance and time from miles and seconds (or whatever) to a set of more natural units for the problem. First.10.4) Now. expressed in earth radii.10. which of course has the same dimension.5) through by R. that the rocket was fired at the moon with a certain initial velocity V . Its weight is also equal to m times the acceleration of the body. Further.6) has now been transformed to y = KME 3 − R2 y + (60 − y)2 KMM R3 .3) now reads as x =− KME KMM . as follows. Its numerical value is easier to calculate if we change the formula first. (2. usually denoted by g.10. whose numerical value is about 60.10.3) shows that m cancels out. we see that the quantity R3 /KME is the square of a time. At any rate. but not before pointing out the immense significance of that fact. let’s agree that at time t = 0 the rocket was on the surface of the earth. namely the acceleration due to gravity. and having the value 32.10. (2. x (0) = V.8) is a time.6) Now instead of the unknown function x(t).3) are x(0) = R . the ratio D/R that occurs in (2. If we divide (2. at time t. Consider a body of mass m on the surface of the earth. and we will oblige. Since y is now dimensionless. and second. we can write the result as x R =− KME R3 2 X R + KMM R3 D R − x R 2. + x (D − x)2 (2.7) Next we tackle the new time units. If we look next at the first term on the right.10. Its weight is the magnitude of the force exerted on it by the earth’s gravity.2. the dimension of the left side of the equation is the reciprocal of the square of a time. so T0 = R3 KME (2. .10. so let’s remove it. For our unit of distance we choose R.10. (2.10. Then y(t) is the position of the rocket. just a quick glance at (2.

8) we find that our time unit is T0 = R . and find T0 is about 13 minutes and 30 seconds.16) .012.10.12) and set τ = 0 we get T0 V V u (0) = . What is the significance of the velocity R/T0 ? We claim that it is.10.. we multiply through equation (2.13) Finally we must translate the initial conditions (2. In the numerator is the velocity with which the rocket is launched. y (2. Perhaps the quickest way to see this is to go back to equation (2. measured in units of the radius of the earth. in units of 13.10.10) We take R = 4000 miles. The substitution of (2. (2. u2 (60 − u)2 (2.10. This equation can be solved. and it becomes 2 T0 (y )2 = 2 y .012 + . Then we will be looking at the differential equation that would govern the motion if the moon were absent. t = τ T0 . if there were no moon.62 It follows that The Numerical Solution of Differential Equations KMe m = mg. To that end.10.7) by T0 and get MM 1 ME 2 T0 y = − 2 + .15) and integration yields 2 T0 (y )2 = 2 + C.4) into conditions on the new variables. i. (2. we will now introduce a new independent variable τ and a new dependent variable u = u(τ ) by the relations u(τ ) = y(τ T0 ) .9) (2.14) = R R/T0 This is a ratio of two velocities. g (2.10.11) y (60 − y)2 The ratio of the mass MM of the moon to the mass ME of the earth is about 0.12) into (2.10. at a time τ that is measured in units of T0 . (2.10. Next. The first condition is easy: u(0) = 1. We propose to measure time in units of T0 .12) Thus.10.11) yields the differential equation for the scaled distance u(τ ) as a function of the scaled time τ in the form u =− 1 0. aside from a numerical factor. R2 and if we substitute into (2.11) and drop the second term on the right-hand side (the one that comes from the moon).10. if we differentiate (2. the escape velocity from the earth. (2. Multiply bth sides by 2y .10.e.10.10.10.5 minutes.10. u(τ ) represents the position of the rocket. Furthermore.

and the converse is easy to show also. (xm − x)2 + (ym − y)2 (2. we see that cos θ = x x2 + y 2 and cos ψ = xm − x . the earth is centered at the origin of the xy-plane. Here.17) Suppose the rocket is launched with sufficient initial velocity to escape from the earth. the differential equation and the initial conditions have the final form u =− 1 0. Let the coordinates of the moon at time t be (xm (t). For example.10. Thus C ≥ 0 or. which approaches the constant C.10.19) u(0) = 1 √ V u (0) = 2 Vesc Since that was all so easy. y(t)) on the way to the moon.17).10. In summary. Hence the right side.22) . the left side is a square.16. We might say that if we choose to measure distance in units of earth radii. If we put the rocket at a generic position (x(t).2.10. Consider the net force on the rocket in the x direction. aside from the 2. then we would have xm = D cos(ωt) and ym (t) = D sin(ωt). Now we can return to (2.10 Case study: Rocket to the moon 2 Now let t = 0 and find that C = T0 V 2 /R2 − 2. Hence let y → ∞ on the right side of (2.10.10. R2 (2.10. then we have the configuration shown in figure 1.10. and the moon is moving.10.012 + 2 u (60 − u)2 (2. and time in units of √ T0 .12) to translate the initial conditions on x (t) into initial √ conditions on u (τ ). In terms of the escape velocity. if the rocket escapes. We shall denote it by Vesc . 145 miles per hour. x2 + y 2 (x − xm )2 + (y − ym )2 (2. equivalently √ R V ≥ 2 . let’s try the two-dimensional case next. √ Hence the quantity 2 R/T0 is the escape velocity from the earth. so 2 T0 (y )2 = 63 2 − y 2 T0 V 2 −2 . must also be nonnegative.2. It is given by Fx = − KME m cos θ KMM m cos ψ + . Then the function y(t) will grow without bound. then (2.20) where the angles θ and ψ are shown in figure 1. Its numerical value is approximately 25. ym (t)).18) T0 Thus. For all values of y. and therefore a nonnegative quantity.21) (2.2. it becomes u (0) = 2 V /Vesc . (2. From that figure.18) is true.16. then velocities turn out to be measured in units of escape velocity. if we take the orbit of the moon to be a circle of radius D.

to obtain the differential equation mx (t) = − KME mx KMM m(xm − x) + .3: The 2D Moon Rocket Now we substitute into (2.10)).10.20).25) take the form   u(0) = 1 . at the point (R. y(0) = 0 . and give only the results.23) If we carry out a similar analysis for the y-component of the force on the rocket. It remains to change the units into the same natural dimensions of distance and time that were used in the one-dimensional problem. at time t = 0. our initial conditions are x(0) = R . Further. 0).10.24) We are now looking at two (even nastier) simultaneous differential equations of the second order in the two unknown functions x(t). To go with these equations. v(0) = 0 √ V cos α √ V sin α  u (0) = 2 .10.10. we get my (t) = − KME my KMM m(ym − y) + . the rocket is on the earth’s surface. we need four initial conditions.10. then it turns out the u and v satisfy the differential equations u 0.10. v (0) = 2 . (x2 + y 2 )3/2 ((xm − x)2 +)ym − y)2 )3/2 (2. in a direction that makes an angle α with the positive x-axis. y(t) that describe the position of the rocket.25) The problem has now been completely defined. the initial data (2. y (0) = V sin α (2. If u(τ ) and v(τ ) denote the x and y coordinates of the rocket. (u + v 2 )3/2 ((um − u)2 + (vm − v)2 )3/2 Furthermore. at a time τ measured in units of T0 (see (2.012(vm − v) v =− 2 + .10. 2 + y 2 )3/2 (x ((xm − x)2 + (ym − y)2 )3/2 (2.012(um − u) u =− 2 + (u + v 2 )3/2 ((um − u)2 + (vm − v)2 )3/2 (2. it will be fired with an initial speed of V .27) .10. This time we leave the details to the reader.64 The Numerical Solution of Differential Equations yM (t) Moon v       y(t)  ψ $s $$ rocket $$$ $$$ $ $ z$$ θ Earth x(t) xM (t) Figure 2. Vesc Vesc (2. Thus.26) v 0. x (0) = V cos α . We will suppose that at time t = 0. and equate the force in the x direction to mx (t). measured in units of earth radii.

2.11 Maple programs for the trapezoidal rule

65

In these equations, the functions um (τ ) and vm (τ ) are the x and y coordinates of the moon, in units of R, at the time τ . Just to be specific, let’s decree that the moon is in a circular orbit of radius 60R, and that it completes a revolution every twenty eight days. Then, after a brief session with a hand calculator or a computer, we discover that the equations um (τ ) = 60 cos (0.002103745τ ) vm (τ ) = 60 sin (0.002103745τ ) represent the position of the moon. (2.10.28)

2.11

Maple programs for the trapezoidal rule

In this section we will first display a complete Maple program that can numerically solve a system of ordinary differential equations of the first order together with given initial values. After discussing those programs, we will illustrate their operation by doing the numerical solution of the one dimensional moon rocket problem. We will employ Euler’s method to predict the values of the unknowns at the next point x + h from their values at x, and then we will apply the trapezoidal rule to correct these predicted values until sufficient convergence has occurred. First, here is the program that does the Euler method prediction.
> > > > > > > > > > > > eulermethod:=proc(yin,x,h,f) local yout,ll,i: # Given the array yin of unknowns at x, uses Euler method to return # the array of values of the unknowns at x+h. The function f(x,y) is # the array-valued right hand side of the given system of ODE’s. ll:=nops(yin): yout:=[]: for i from 1 to ll do yout:=[op(yout),yin[i]+h*f(x,yin,i)]; od: RETURN(yout): end:

Next, here is the program that takes as input an array of guessed values of the unknowns at x + h and refines the guess to convergence using the trapezoidal rule.
> > > > > > > > > traprule:=proc(yin,x,h,eps,f) local ynew,yfirst,ll,toofar,yguess,i,allnear,dist; # Input is the array yin of values of the unknowns at x. The program # first calls eulermethod to obtain the array ynew of guessed values # of y at x+h. It then refines the guess repeatedly, using the trapezoidal # rule, until the previous guess, yguess, and the refined guess, ynew, agree # within a tolerance of eps in all components. Program then computes dist, # which is the largest deviation of any component of the final converged # solution from the initial Euler method guess. If dist is too large

66
> > > > > > > > > > > > > > > > >

The Numerical Solution of Differential Equations

# the mesh size h should be decreased; if too small, h should be increased. ynew:=eulermethod(yin,x,h,f); yfirst:=ynew; ll:=nops(yin); allnear:=false; while(not allnear) do yguess:=ynew; ynew:=[]; for i from 1 to ll do ynew:=[op(ynew),yin[i]+(h/2)*(f(x,yin,i)+f(x+h,yguess,i))]; od; allnear:=true; for i from 1 to ll do allnear:=allnear and abs(ynew[i]-yguess[i])<eps od: od; #end while dist:=max(seq(abs(ynew[i]-yfirst[i]),i=1..ll)); RETURN([dist,ynew]): end:

The two programs above each operate at a single point x and seek to compute the unknowns at the next point x + h. Now we need a global view, that is a program that will call the above repeatedly and increment the value of x until the end of the desired range of x. The global routine also needs to check whether or not the mesh size h needs to be changed at each point and to do so when necessary.
> > > > > > > > > > > > > > > > > > > > > > > > > > trapglobal:=proc(f,y0,h0,xinit,xfinal,eps,nprint) local x,y,y1,h,j,arr,dst,cnt: # Finds solution of the ODE system y’=f(x,y), where y is an array # and f is array-valued. y0 is initial data array at x=xinit. # Halts when x>xfinal. eps is convergence criterion for # trapezoidal rule; Prints every nprint-th value that is computed. x:=xinit:y:=y0:arr:=[[x,y[1]]]:h:=h0:cnt:=0: while x<=xfinal do y1:=traprule(y,x,h,eps,f): y:=y1[2]:dst:=y1[1]; # Is dst too large? If so, halve the mesh size h and repeat. while dst>3*eps do h:=h/2; lprint(‘At x=‘,x,‘h was reduced to‘,h); y1:=traprule(y,x,h,eps,f): y:=y1[2]:dst:=y1[1]; od: # Is dst too small? If so, double the mesh size h and repeat. while dst<.0001*eps do h:=2*h; lprint(‘At x=‘,x,‘h was increased to‘,h); y1:=traprule(y,x,h,eps,f): y:=y1[2]:dst:=y1[1]; od: # Adjoin newly computed values to the output array arr. x:=x+h; arr:=[op(arr),[x,y[2]]]: # Decide if we should print this line of output or not. cnt:=cnt+1: if cnt mod nprint =0 or x>=xfinal then print(x,y) fi;

2.11 Maple programs for the trapezoidal rule
> > RETURN(arr); > end: od:

67

The above three programs comprise a general package that can numerically solve systems of ordinary differential equations. The applicability of the package is limited mainly by the fact the Euler’s method and the Trapezoidal Rule are fairly primitive approximations to the truth, and therefore one should not expect dazzling accuracy when these routines are used over long intervals of integration.

2.11.1

Example: Computing the cosine function

We will now give two examples of the operation of the above programs. First let’s compute the cosine function. We will numerically integrate the equation y + y = 0 with initial conditions y(0) = 1 and y (0) = 0 over the range from x = 0 to x = 2π. To do this we use two unknown functions y1 , y2 which are subject to the equations y1 = y2 and y2 = −y1 , together with initial values y1 (0) = 1, y2 (0) = 0. Then the function y1 (x) will be the cosine function. To use the routines above, we need to program the function f = f (x, y) that gives the right hand sides of the input ODE’s. This is done as follows: > f:=proc(x,u,j) > if j=1 then RETURN(u[2]) else RETURN(-u[1]) fi: > end: That’s all. To run the programs we type the line > trapglobal(f,[1,0],.031415927,0,6.3,.0001,50): This means that we want to solve the system whose right hand sides are as given by f , with initial condition array [1, 0]. The initial choice of the mesh size h is π/100, in order to facilitate the comparison of the values that will be output with those of the cosine function at the same points. The range of x over which we are asking for the solution is from x = 0 to x = 6.3. Our convergence criterion eps is set to .0001, and we are asking the program to print every 50th line of putput that it computes. Maple responds to this call with the following output.
At x= 0 h was reduced to .1570796350e-1 .7853981750, [ .6845520546, -.7289635418] 1.570796368, [-.0313962454, -.9995064533] 2.356194568, [-.7289550591, -.6845604434] 3.141592768, [-.9995063863, .0313843153] 3.926990968, [-.6845688318, .7289465759] 4.712389168, [ .0313723853, .9995063195] 5.497787368, [ .7289380928, .6845772205] 6.283185568, [ .9995062524, -.0313604551]

36222569.0001. Maple responded with the following output. [19.2376534139] 110. [28.911695068.00000000. which we do as follows.2 Example: The moon rocket in one dimension As a second example we will run the moon rocket in one dimension. . 2).46812426. .1000): √ This means that we are solving our system with initial data (1. . The values of the negative of the sine function. [33.2668577906] 80.2242857188] 130. At x = π we have 3 or 4 correct digits. and printing the output every 1000 lines. integrating over the range of time from t = 0 until t = 250. [40. . [7. [38.19). [30. [25.05293965.00000000.012/(60-u[1])^2) fi: > end: Then we invoke our routines by the statement trapglobal(f. At x= 0 h was reduced to . .18089771.00000000.29624630. .3206445364] 50. . .u.2139997868] . Next we see that the accuracy is pretty good.11.0001. but realistically we will have to confess that the major source of imprecision lies in the Euler and Trapezoidal combination itself.54117736. To improve the accuracy we might try reducing the error tolerance eps.0. so as to halt the calculation as soon as y[1]. We inserted an extra line of program also. [22.1.68 The Numerical Solution of Differential Equations We observe that the program first decided that the value of h that we gave it was too big and it was cut in half.00000000.00000000. are less accurate at those points. 2. the distance from earth.j) > if j=1 then RETURN(u[2]) else RETURN(-1/u[1]^2+...70467985. [16.2305225677] 120. with a convergence tolerance eps of .02. is too crude to yield results of great accuracy over a long range of integration.0000000.0000000.71287575.00000000. [12.2458746369] 100. although it provides a good introduction to the philosophy of these methods. [42. .32630042.11311118. with an initial mesh size of h = .44558364.2554698522] 90. .0000000.0000000.2979916199] 60.00000000. . which is the surface of the moon.55572225. which are displayed above as the second column of unknowns.3523608113] 40.79081775.0000000.5027323920] 20. .00000000.2806820525] 70. The equations that we’re solving now are given by (2.250. .4142].1000000000e-1 10. So all we need to do is to program the right hand sides f .4022135344] 30.75. . which.[1.02.00000000. reached 59. > f:=proc(x. [35. [44. and we still have them at x = 2π.2188074327] 140.10.

90316649. if p > 0 the formula will need help getting started or restarted. and so we can call (2. [46. or 47. For instance. . are more demanding than that. . a−1 = 1. (2. this means that the whole trip took 211. b1 = 0.65 hours – nearly two days. the formula is explicit. or noniterative.2062733168] [50. whereas if p = 0 it is self-starting. b−1 = 0. b0 = 1. b0 = 2. Since each time unit corresponds to 13. where even a good method like the trapezoidal rule is unable to give the pinpoint accuracy that is needed. that includes methods of arbitrarily high accuracy. We can recognize an explicit. a0 = 1.2300000 h was 183.95113244.81. Otherwise.2098188816] [48.12 The big leagues 150. b−1 = 1/2.5 minutes. and the formula does not explicitly tell us the value of yn+1 . .1) a p + 2-point formula.12.82325670. then the rocket will reach some maximum distance from Earth and will fall back to the ground without ever having reached the moon.8 · 13. 174. . b0 = 1/2. and with good accuracy. 160. It was increased again after 174 time units because at that time the rocket was moving quite slowly. and again after 183 time units for the same reason. formula by looking at b1 .1) as well as on the left. if b1 = 0. of which all three of the rules that we have studied are examples. namely yn+1 . . then yn+1 appears implicitly on the right side of (2.5 minutes.4142 by a small amount.4000000000e-1 [54. In this section we discuss a general family of methods. The mesh size was reduced to .12.8100000. however. b−1 = 0.0000000. if we reduce the initial velocity from 1.4500000 h was 188. Some problems. Euler’s method has p = 0. a0 = 0. arises by taking p = 0. whereas for the midpoint rule we have p = 1. 170. 2.12. These are the linear multistep methods. 211. a0 = 1. .1) In order to compute the next value of y. A general linear multistep method is of the form p p yn+1 = i=0 a−i yn−i + h i=−1 b−i yn−i .2. we need to store p + 1 backwards values of y and p + 1 backwards values of y .2034559376] increased to . If b1 is nonzero.0900000.2000000000e-1 increased to .61091112. The trapezoidal rule.0000000. Our two-dimensional moon shot is an example of such a situation.3622060968] 69 At x= At x= At x= So the trip was somewhat eventful. for example. The reader might enjoy experimenting with this situation a little bit.2000000000e-1 [59.12 The big leagues The three integration rules that we have studied are able to handle small-to-medium sized problems in reasonable time.75560109. A total of p + 2 points are involved in the formula.7300000 h was 211.01 right away.2014073390] reduced to .0000000. In either case. Impact on the surface of the moon occurred at time 211.

2) Notice that in all three of the methods we have been studying.2). b−p .1). if we demand “too much accuracy” for a fixed p. .12. the more accuracy we demand. the more we come into conflict with stability. due to Dahlquist. if we value explicit formulas. a−p and b1 . .12.12. The conditions are therefore translation invariant. (2.12. . For instance. After cancelling the factor hr . b0 .12. thereby using one of the 2p + 3 free parameters. Now suppose we want our multistep formula to be exact not only when y(x) = 1. . However. . . states roughly that we cannot use more than about half of the 2p + 3 “degrees of freedom” to achieve high accuracy if we want to have a stable formula (and we do!).1) should be exactly true if the unknown function y happens to be a polynomial of low enough degree.12.4) Now clearly we do xr and all lower powers of x exactly if and only if we do (x + c)r and all lower powers of x exactly. it will turn out that no stable formulas exist. p p − i=0 ia−i + i=−1 b−i = 1. another highly desirable feature. . These constants are chosen to give the method various properties that we may deem to be desirable in a particular application. we substitute yk = (kh)r and yk = r(kh)r−1 . (2. so we can choose one special value of n in (2.12. For instance. after some simplification.3) The reader should check that this condition is also satisfied by all three of the methods we have studied. in some sense. i=0 (2.1) and obtain.70 The Numerical Solution of Differential Equations The general linear multistep formula contains 2p + 3 constants a0 . . and use of (2. we get p p (n + 1)r = i=0 a−i (n − i)r + r i=−1 b−i (n − i)r−1 . then we may set b1 = 0 immediately. First let’s discuss the conditions of accuracy. In general let’s find the condition for the formula to integrate the function y(x) = xr and all lower powers of x exactly for some fixed value of r.12. Indeed. yk = 1 for all k into (2. in (2. for a fixed p.12. Substitute yk = 1 and yk = 0 for all k into (2. An important theorem of the subject. and leaving 2p + 2 others. b−1 . . the sum of the a’s is indeed equal to 1. suppose y(x) = 1 for all x. We substitute yk = kh. and there follows the condition p a−i = 1. but also when y(x) = x. Hence.4) if we want. These are usually handled by asking that the equation (2.1). One might think that the remaining parameters should be chosen so as to give the highest possible accuracy.

6.7) is a linear difference equation with constant coefficients of order p + 1. so it has p + 1 roots somewhere in the complex plane. The accuracy conditions enable us to take the role of designer.5) is true (together with its analogues for all numbers between 0 and r). 2. we have written τ = h/L. we look for solutions of the form αn . the largest number r for which (2. on the step size that we use to do the integration.6. .8) This is a polynomial equation of degree p + 1. then there is no hope at all for the formula. the ratio of the step size to the relaxation length of the problem. Just as in section 2. (2.12. If as usual with such equations.5) A small technical remark here is that when r = 1 and i = 0 we see a 00 in the second sum on the right. . To see how well the general formula (2. . In fact. to obtain p (1 + τ b1 )yn+1 − i=0 (a−i − τ b−i )yn−i = 0 (2. let’s see where the roots are when h = 0. in accordance with (2. All we can hope for is that it should be possible. and should seach for the most accurate of all possible three-point formulas (ignoring the stability question altogether). if when h = 0 some root has absolute value strictly greater than 1. by choosing h small enough. that is.12. to get all of these roots to lie inside the unit disk in the complex plane. or equivalently.3).12. We can’t expect that these roots will have absolute value less than 1 however large we choose the step size h.12.12. Just for openers. These roots depend on the value of τ .12. substitute yk = −yk /L for all k in (2.12. By the order (of accuracy) of a linear multistep method we mean the highest power of x that the formula integrates exactly. whose solution is a falling exponential. because the result simplifies then to p p 71 (−1) = i=0 r i a−i − r r i=−1 ir−1 b−i r = 0.7) where as in section 2. The reader should verify that the trapezoidal rule is the most accurate of all possible two-point formulas.12. namely y y =− (L > 0) (2. This should be interpreted as 1. and to construct accurate formlas with desirable characteristics.6) L with y(0) = 1. .1). then after substitution and cancellation we obtain the characteristic equation p (1 + τ b1 )αp+1 − i=0 (a−i − τ b−i )αp−i = 0. stability will be judged with respect to the performance of our multistep formula on a particular differential equation.12. Now we must discuss the question of stability.2. Equation (2.12 The big leagues Let’s choose n = 0. 1. .1) does with this differential equation. (2.

We would be happiest if they would all be zero. and since all it asks is that we correctly integrate the equation y = 0. The coefficients. Then α1 (0) = 1. and this forces the characteristic equation to be of degree p + 1. too. Let α1 (h) denote the value of the principal root at some small value of h.12. Therefore the principal root is trying to give us a very accurate approximation to the exact solution. and the method will be unstable. Now for small positive h. depend only on the a’s that we use.12.9) This is also a polynomial equation of degree p + 1. which is surely not an excessive request.12.12.9) should have no roots of absolute value greater than 1. However.” so we’ll call them the parasitic roots.12. This means that the portion of the solution of the difference equation (2.9) that is = 1 is our good friend.7) doing for us? The answer is that they are trying as hard as they can to make our lives difficult. what are all of the other p roots of the difference equation (2. Hence.8) becomes simply p αp+1 − i=0 a−i αp−i = 0. because condition (2. The reader should now check the three methods that we have been using to see if this condition is satisfied. it turns out that the principal root tries as hard as it can to be a good approximation to e−τ . Qualitatively. (2. or even on the values of the various b’s in the formula. our first necessary condition for the stability of the formula (2. here’s what happens. The one root of (2.12. τ = 0 also. and therefore the roots. (2. its coefficients don’t depend on the step size. One that is printable is “parasitic. roots of the difference equations.9) is always α = 1. so the polynomial equation (2. and hence the remining roots have to be there.12. As h increases slightly away from 0 the principal root moves slightly away from 1. we would like them to be as small as possible in absolute value. This high quality is bought at the price of using a p + 2-point multistep formula.12. One of the roots of (2. The high quality of our approximate solution derives from the nearness of the principal root to e−τ .7) that comes from the one root α1 (h) is α1 (h)n = (“nearly” e−τ )n = “nearly” e−nτ = “nearly” e−nh/L .72 The Numerical Solution of Differential Equations because for all sufficiently small h there will be a root outside the unit disk. and we’ll call it the principal root. Well then. Now when h = 0.1) is that the polynomial equation (2.12. But e−nh/L is the exact solution of the differential equation (2. People have invented various names for these other.12. non-principal. but failing that. Next we have to study what happens to the roots when h is small and positive.2) is always satisfied in practice.10) .6) that we’re trying to solve.12.

is a three-point method (p = 1). this is indeed zero if our multistep formula correctly integrates a constant function. Thus the careful determination of the a’s and the b’s so as to make the formula as accurate as possible all result in the principal root being “nearly” e−τ .12.12) Now we equate the coefficient of each power of τ to zero.12. very near e−τ . We return to the polynomial equation (2. (2.12. First.2. shows one root of the characteristic equation near 1. more precisely. If we substitue (2.12 The big leagues 73 A picture of a good multistep method in action. So far. we can quickly check that . although we will not prove it here: the expansion of the principal root agrees with the expansion of e−τ through terms of degree equal to the order of the multistep method under consideration. and all of the other roots near the origin.8). we have shown by direct calculation that if our multistep formula is of order at least 1 (i.12. (2.12.2) and (2. because b1 = 1/3 = 0.12. if (2. Indeed. we get (1 − τ b1 )(1 + q1 τ + q2 τ 2 + · · ·)p+1 − p i=0 (2. and it happens to be very accurate. then. if we use (2.4) holds for r = 0 and r = 1).12.11) of the principal root agrees with the expansion of e−τ through terms of first degree in τ . and attempt to find a power series expansion for the principal root α1 (τ ) in powers of τ .13) and.12.12.15) This. but we omit the details. like the midpoint rule. (y 3 n+1 (2.12.12. Now let’s try to make this picture a bit more quantitative. The proof can be done just by continuing the argument that we began above. then the expansion (2. Equal care must be taken to assure that the parasitic roots stay small. according to (2.e.12. this simplifies instantly to the simple statement that q1 = −1. yn+1 = yn−1 + h + 4yn + yn−1 ). The expansion will be in the form α1 (τ ) = 1 + q1 τ + q2 τ 2 + · · · where the q’s are to be determined. It is iterative.14) However. Much more is true. the coefficient of τ is p b1 + (p + 1)q1 − i=0 [a−i (p − i)q1 + b−i ] = 0.11) into (2.. with some small value of h.2).3). We illustrate these ideas with a new multistep formula. Second.8).12. the coefficient of the zeroth power of τ is p 1− i=0 a−i (2.11) (a−i − τ b−i )(1 + q1 τ + q2 τ 2 + · · ·)p−i = 0.

we are going to describe a family of multistep methods. that we’ll need in several parts of the sequel. 2. because it is poised on the brink of causing trouble. this root grows in absolute value. due to Lagrange. (2.12. and in fact its error term can be found by the method of section 2.12. In the next section. which in the present case is just the quadratic equation τ 4τ τ 1+ α2 + α − 1 − = 0. i. If as h grows to a small positive value. called the Adams methods. then its powers will dwarf the powers of the principal root in the numerical solution. 3. so for all small positive values of τ this lies outside of the unit disk. just as far out of trouble as they can get. First.8). these methods will need some assistance in getting started. one of the roots (the friendly one) is as usual sitting at 1. All of these jobs can be done with the aid of a formula. so the roots are +1 and −1. when h = 0 the equation (2. the Lagrange interpolation formula. To see if this happens. that are stable.9) that determines the roots is just α2 − 1 = 0.e.17) 3 3 3 After substituting. and our apprehension was fully warrented. and all of the unfriendly roots are huddled together at the origin. The secret weapon of the Adams formulas is that when h = 0.5) are satisfied for r = 0. This means that we want methods in which the formulas span a number of points.8 to be −h5 y (v) (X)/90.. let’s substitute a power series α2 (h) = −1 + q1 τ + q2 τ 2 + · · · (2. so let’s begin with a little example. Furthermore. so the method will be unstable.12.74 The Numerical Solution of Differential Equations the accuracy conditions (2. The root at −1 is to be regarded with apprehension. that root will follow the power series expansion of e−τ very accurately.12. in which the next value of y is obtained from several backward values. 2. if one is willing to save enough backwards values of the y’s. ready to develop into the exponential series.13 Lagrange and Adams formulas Our next job is to develop formulas that can give us as much accuracy as we want in a numerical solution of a differential equation. 1. As h increases slightly to small positive values. . instead of from just one or two as in the methods that we have already studied. Now we examine the stability of this method. because for small τ the root acts like −1 − τ /3. where X lies between xn−1 and xn+1 . 4. we quickly find that q1 = −1/3. whose mission in life is to fit a polynomial to a given set of data points.12. First we will develop a very general tool. and following that we discuss the Adams formulas. and all accuracy will eventually be lost. in fact. The method is of fourth order.16) into the characteristic equation (2. so we will have to develop matched formulas that will provide them with starting values of the right accuracy. The root at +1 is the friendly one. and that offer whatever accuracy one might want. through the first four powers of τ .

yn ).13. 2. (2. (2. n.1) (7 − 1)(7 − 3) Let’s check that this is really the right answer.13.3) Consider the product of the denominators above. First of all. . (i − j)h (2. . (xn .13. and the first term has the vlaue 17. . xi − x j (2. we get  n  n j=1 j=i P (x) = i=1 yi   x − jh  .2. y1 ).2) This is the Lagrange interpolation formula. If we want to find the polynomial of degree n − 1 that passes through n given data points (x1 . Now we can jump from this little example all the way to the general formula.13. Finally we replace x by xh and substitute in (2. (7. the first and third terms are zero and the second is 10.13 Lagrange and Adams formulas 75 Problem: Through the three points (1. If we substitute in (2.13. we might as well suppose that xj = jh for j = 1. −5) there passes a unique quadratic polynomial.4)  n j=1 j=i P (xh) = i=1 (−1)n−i (i − 1)!(n − i)! yi   (x − j)  (2. xn are chosen to be equally spaced. the second and third terms vanish.13. . To be specific. Does it pass through the three given points? When x = 1. . Find it. (3. etc. (x2 .13. y2 ). namely of all those factors except for the one in which j = i. then all we have to do is to write it out:  n  n j=1 j=i P (x) = i=1 yi   x − xj  .5) and that is about as simple as we can get it. . x2 . . .13. since each term is. .2). Next. because of the factor x − 1. consider the special case of the above formula in which the points x1 . It is (i − 1)(i − 2) · · · (1)(−1)(−2) · · · (i − n)hn−1 = (−1)n−i (i − 1)!(n − i)!hn−1 . Solution: P (x) = 17 (x − 3)(x − 7) (1 − 3)(1 − 7) + 10 (x − 1)(x − 7) (3 − 1)(3 − 7) + (−5) (x − 1)(x − 3) . 10). .1) is a quadratic polynomial. . 17). Similarly when x = 3. as is obvious from the cancellation that takes place. . In the ith term of the sum above is the product of n − 1 factors (x − xj )/(xi − xj ).3) to obtain  n (2. .

all we have to do is specify p. (5y 12 n+1 (2. we show the first four of them below.6) into  p+1  p+1 j=1 j=i yp+1 = yp + h i=1 (−1)p−i+1 (i − 1)!(p − i + 1)! p+1 y (ih)j p   (x − j) dx.13.6) Instead of integrating the exact function y in this formula.7) We can rewrite this in the more customary form of a linear multistep method: p−1 yn+1 = yn + h i=−1 b−i yn−i .13. . . 2h. .  (2. so the numbers b−i are given by   p+1 j=1 j=p−i b−i = (−1)i+1 (p − 1 − i)!(i + 1)! p p+1   (x − j) dx  i = −1.13.10) 1 (x − 2)(x − 3)dx = − 12 so we have the third-order implicit Adams method yn+1 = yn + h + 8yn − yn−1 ). (p+1)h. . .8) This involves replacing i by p − i in (2.13. we replace y by the Lagrange interpolating polynomial that agrees with it at the p+1 points h. .76 The Numerical Solution of Differential Equations Now we’ll use this result to obtain a collection of excellent methods for solving differential equations.7). complete with their error terms. This transforms (2. Begin with the obvious fact that p+1 y((p + 1)h) = y(ph) + h p y (ht) dt. we will integrate the polynomial that agrees with y at a number of given data points. let’s take p = 2. (2. h2 y 2 yn+1 = yn + hyn − (2. Then we find b1 = 1 2 1 2 3 2 3 (x − 1)(x − 2)dx = (x − 1)(x − 3)dx = 5 12 2 3 b0 = − b−1 = 2 3 2 (2. For instance.13. p − 1.13.9) Now to choose a formula in this family. 0.12) . . First. . (2.13. (2.11) If this process is repeated for various values of p we get the whole collection of these formulas.13. For reference.13. the so-called Adams formulas.

605769 0.8) as it applies to the fourth-order method (2.174458 0.440927 0.13.548812 0. The other roots all start at the origin.818723 0.13) (2.369595 -0.075823 -0. instead of trying to cause trouble for us.4 0. and for a predictor method of order p + 1 we replace y by the interpolating polynomial that Root2(τ ) 0.367879 Friendly(τ ) 1.. then that polynomial equation becomes simply αp+1 − αp = 0.15) If we compare these formulas with the general linear multistep method (2.225454 -0.058536 0. useful as corrector formulas.128457 0.904837 0.13 Lagrange and Adams formulas yn+1 = yn + yn+1 yn+1 h h3 (yn+1 + yn ) − y 2 12 h h4 = yn + (5yn+1 + 8yn − yn−1 ) − y (iv 12 24 h 19h5 (v = yn + (9yn+1 + 19yn − 5yn−1 + yn−2 ) − y 24 720 77 (2.670320 0.406570 0.13.9 1. All of the Adams formulas are stable for this reason.333333 -0.14) (2.000000 0. Now look at equation (2. Notice that the friendly root is the one of largest absolute value for all τ ≤ 0.3 0.116824 -0. The other backwards values are all of the derivatives.492437 0. namely yn .0 0.224009 Root3(τ ) 0. so when h is small they too will be small.6 0. 0. as h grows to a small positive value. They are.13. If we use a formula with all the ai = 0. therefore.261204 -0. except that a0 = 1.904837 0.1) then we quickly find the reason for the stability of these formulas.333333 table 6 Now we have found the implicit Adams formulas.142857 0.2 0.405827 .0.113948 0.496585 0.2.16) The roots of this are 1. We return to (2. To find matching predictor formulas is also straightforward.157869 0.0 e−τ 1. In Table 6 we show the behavior of the three roots of the characteristic equation (2.15). .081048 0.740818 0.449329 0.7 0.000000 0.818731 0.194475 0.8 0.6). The root at 1.5 0. .9.153914 -0.12.606531 0. Notice that only one backwards value of y itself is used. 0.12.189813 -0.546918 0.297171 -0.13. (2.9) again.740758 0. It is the polynomial equation that determines the roots of the characteristic equation of the method.098550 0.000000 -0.12.13.390073 0. is the one that gives us the accuracy of the computed solution.670068 0. .1 0. in the limiting case where the step size is zero.13.000000 0. τ 0. It happens that the roots are all real in this case. In each of them the unknown yn+1 appears on both sides of the equation.

(2.13. 2h.13. h. . 3h.23) Next. for example. .22) has about 13 times as large an error term as the implicit fourth-order formula (2.7) we get      p yp+1 = yp + h i=0 (−1)p−i i!(p − i)! p+1 p j=0 j=i y (ih) p (x − j) dx. replace y (t) by the Lagrange polynomial that agrees with it at 0. .15). then a single application of the corrector formula ∂y will produce an estimate of the next value of y with full attainable accuracy. .13.22) y 24 720 Notice. p. . 2h.6). We are now fully equipped with Adams formulas for prediction and correction. As we have noted previously. we can write this in the more familiar form p yn+1 = yn + h i=0 b−i yn−i (2.13. .13.13. ph (for p ≥ m).13.13.13. together with their error terms: h 5h3 yn+1 = yn + (3yn − yn−1 ) + (2.78 The Numerical Solution of Differential Equations agrees with it at the data points 0. 3. 2. . so if we keep h small enough so ∂y that h ∂f is less than 1/13 or thereabouts. 1. We then obtain      p ym = y0 + h i=0 (−1)p−i i!(p − i)! m p j=0 j=i yi 0 (t − j) dt m = 1. . since multistep methods cannot get themselves started or restarted without assistance. (2. a single application of a corrector formula reduces error by about h ∂f . in matched pairs. . . in place of (2. This time we begin with a slight variation of (2. ph (but not at (p + 1)h as before).13. . . .21) y 12 8 h 251h5 (v yn+1 = yn + (55yn − 59yn−1 + 37yn−2 − 9yn−3 ) + (2.17) As before. . 2. .13. of whatever accuracy is needed. that the explicit fourth-order formula (2.19) We tabulate below these formulas in the cases p = 1.24) .20) y 2 12 h 3h4 (iv yn+1 = yn + (23yn − 16yn−1 + 5yn−2 ) + (2.18) where the numbers b−i are now given by      b−i = (−1)i (p − i)!i! p p+1 p j=0 j=p−i (x − j) dx i = 0. Then.13. (2. (2. p. Once again the Lagrange interpolation formula comes to the rescue. all of them stable. h. This is typical for matched pairs of predictor-corrector formulas. . mh y(mh) = y(0) + 0 y (t) dt. The use of these pairs requires special starting formulas.

y) value. yp (y0 is. and then refine the guesses all at once by using (2. y2 . and then proceed with the integration.25) is replaced by an f (x.25) are a set of p simultaneous equations in p unknowns y1 .13. together with the starter formulas shown above has the potential of being implemented in a computer program in which the user could specifiy the desired precision by giving the value of p. . This would make a very versatile code. .13. p.13.2. when these formulas are used on a differential equation y = f (x.25) to give us the new guesses from the old. . if we take p = 3. until sufficient convergence has occurred. For example. . . .13 Lagrange and Adams formulas We can rewrite these equations in the form p 79 ym = y0 + h j=0 λj y j m = 1. p (2. . (2. We can solve them with an iteration in which we first guess all p of the unknowns. of course.13. . . The program could then calculate from the formulas above the coefficients in the predictor-corrector pair and the starting method. . 8 80 (2.25) where the coefficients λi are given by   p j=0 j=i λi = (−1)p−i i!(p − i)! 0 m   (t − j) dt  i = 0. 2. . then the starting formulas are y1 = y0 + h 19h5 (v) (9y0 + 19y1 − 5y2 + y3 ) − y 24 720 h h5 y2 = y0 + (y0 + 4y1 + y2 ) − y (v) 3 90 3h 3h5 (v) y3 = y0 + (y0 + 3y1 + 3y2 + y3 ) − y .13. each of the values of y on the right side of (2. Therefore equations (2. y). . . known).27) The philosophy of using a matched pair of Adams formulas for propagation of the solution.13.26) Of course.

80 The Numerical Solution of Differential Equations .

1) for all vectors v . calculate its determinant. see any of the references cited at the end of this chapter.1. that is to say. Among these we mention the following: (a) Given an n × n matrix. The first major concept we need is that of a linear mapping. we will be very much concerned with the development of efficient software that will accomplish the above purposes. As usual. (c) Invert an n × n matrix. Let V and W be two vector spaces over the real numbers (we’ll stick to the real numbers unless otherwise specified). We will quickly review some additional concepts that will be helpful in our work. T is linear). On the left we add two vectors of V .e. We assume that the reader is familiar with the basic constructs of linear algebra: vector space.1). of problems that involve matrices or systems of linear equations. Euclidean n-dimensional space. We say that T is a linear mapping from V to W if T associates with every vector v of V a vector T v of W (so T is a mapping) in such a way that T (αv + βv ) = αT v + βT v (3.1 Vector spaces and linear mappings In this chapter we will study numerical methods for the solution of problems in linear algebra. (b) Given m linear algebraic equations in n unknowns. linear dependence and independence of vectors. or discover that it has none. spanning sets of vectors.1. For a more complete discussion of linear algebra.Chapter 3 Numerical linear algebra 3. And now.. Notice that the “+” signs are different on the two sides of (3. basis vectors. to business. v of V and real numbers (or scalars) α and β (i. (d) Find the eigenvalues and eigenvectors of an n × n matrix. on the right we add two vectors of W . . find the most general solution of the system. if possible.

respectively. T em . to describe a linear mapping. 4x + 5y) of W . suppose the vector spaces V and W are of dimensions m and n.1.1. Further. As such.3) T ei = j=1 tji fj i = 1. The mapping T that carries a vector x of V into Ax of W is a linear mapping. . e3 . . to linear combinations of more than two vectors) to obtain T x = α1 (T e1 ) + α2 (T e2 ) + · · · + αm (T em ). More generally. and in W there is a basis of n vectors f1 . comes from the fact that a particular mapping can be represented by many different matrices. em .1. 3. let V be Euclidean n-dimensional space and let W be Euclidean m-space. . suppose we know T e1 . and not just matrices. . . Then we have the situation that is sketched in figure 3. it often happens that problems in linear algebra that seem to be questions about matrices. T ei can be written as a linear combination of the basis vectors of W . We claim now that the action of T on every vector of V is known if we know only its effect on the m basis vectors of V . f2 . fn . . . . Then we can choose in V a basis of m vectors. For example. . any of the matrices that represent the same mapping will have the same determinant as the given one. so we write n (3.4) . .1 below. and making this kind of observation and identifying simple representatives of the class of relevant matrices can be quite helpful. say e1 . Second. . 3). −1) = (4. any matrix generates a linear mapping between two appropriately chosen (to match the dimensions of the matrix) vector spaces. .82 Here are a few examples of linear mappings. Let T be the mapping that carries the vector (x. we seem to confront a problem about matrices. That is. then T ei is a vector in W . The right side is known. It’s easy to check that this mapping is linear. and the claim is established. The coefficients of this linear combination will evidently depend on i. . let V and W both be the same. Consider the mapping that associates with a polynomial f of V its derivative T f = f in W . Indeed. In fact. x = α1 e1 + α2 e2 + · · · + αm em . suppose V is Euclidean two-dimensional space (the plane) and W is Euclidean three-dimensional space. (3. namely the space of all polynomials of some given degree n. . Then T is a linear mapping. T e2 .2) Now apply T to both sides and use the linearity of T (extended. This means that we can change to a simpler matrix that represents the same linear mapping before answering the question. y) of V to the vector (3x + 2y. e2 . Numerical linear algebra First. So. (3. For instance. . let A be a given m × n matrix of real numbers. If ei is one of these. To get back to the matter at hand. Then let x be any vector in V . Let T be a linear mapping from V to W . secure in the knowledge that the answer will be the same. by induction. The importance of studying linear mappings in general. T (2. . Express x in terms of the basis of V . m. x − y. are in fact questions about linear mappings. if we are given a square matrix and we want its determinant. “all we have to do” is describe its action on a set of basis vectors of V .

Consider the set of all vectors of V that are mapped by T into the zero vector of W . . j = 1.1. . . (3. together with the given sets of basis vectors for V and W . . . F ) we know the full mapping T . and then by (3. are enough to describe the linear operator T completely. em }. . Next. m. f2 V Figure 3. fn }. Indeed. .7) . then ν is called the nullity of the mapping T .3) we know the action of T on every vector of V . . .5) Now ker(T ) is not just a set of vectors.1 Vector spaces and linear mappings 83 T em . in the sense that from a knowledge of (t.1. suppose once more that T is a linear mapping from V to W . we can speak of its dimension. . e2 B ¨ ¨ ‰ r rr¨¨ e e e … e1e E W m B ¨ ¨ ‰ r rr¨¨ e e … e f1e f . . To summarize. and is written im(T ) = {w ∈ W | w = T v. it is easy to see that T carries the 0 vector of V into the 0 vector of W .3. f2 . Thus ker(T ) = {x ∈ V | T x = 0W }. . Indeed. if we know all of those numbers. one sees immediately that if x and y belong to ker(T ) and α and β are scalars.1.1. to a vector space W with a distinguished basis F = {f1 . it is itself a vector space.6) T (αx + βy) = αT (x) + βT (y) = 0. . If ν = dim ker(T ). This set is called the kernel of T . so αx + βy belongs to ker(T ) also. Since ker(T ) is a vector space. v ∈ V }. . . (3. . e2 .4) we know what T does to every basis vector of V . n. . This set is called the image of T . Consider also the set of vectors w of W that are of the form w = T v. i = 1. and is written ker(T ). . then by (3.1: The action of a linear mapping Now the mn numbers tji . . E.1. Since T is linear. then (3. for some vector v ∈ V (possibly many such v’s exist). that is a vector subspace of V . . an n × m matrix tji represents a linear mapping T from a vector space V with a distinguished basis E = {e1 .

unless b happens to be the zero vector (why?). unless our programs do exact arithmetic on rational numbers.8) has rank 2. . . Is it a vector space? That’s right. then we have w = T v and w = T v for some v . we remark that im(T ) is more than just a set of vectors. such that the r × r submatrix that they determine has a nonzero determinant. a set of r rows and r columns.9) (3. x2 . . is just that the rank of a matrix is not a continuous function of the matrix entries.e. Now we are ready to consider one of the most important problems of numerical linear algebra: the solution of simultaneous linear equations. and consider the system Ax = b (3. No computer program will be able to tell the difference between these two situations unless it is doing arithmetic to at least 21 digits of precision. Let A be a given m × n matrix. By the rank of a matrix A we mean any of the following: (a) the maximum number of linearly independent rows that we can find in A (b) same for columns (c) the largest number r for which there is an r × r nonzero sub-determinant in A (i. It is true that the rank of a linear mapping T from V to W is equal to the rank of any matrix A that represents T with respect to some pair of bases of V and W . let b be a given column vector of length m. A celebrated theorem of Sylvester asserts that rank(T ) + nullity(T ) = dim(V ).11). since if w and w are both in im(T ). It is also true that the rank of a matrix is not computable unless infinite-precision arithmetic is used.1. Therefore.84 Numerical linear algebra Once again.2]-entry is changed to 1.11) of m linear simultaneous equations in n unknowns x1 . the rank will be uncomputable. really. . What we are saying. In fact. but if the [2. Hence αw + βw = αT v + βv = T (αv + βv ) so αw + βw lies in im(T )..10) (3. it isn’t. xn .1.1. it is in fact a vector subspace of W . not necessarily consecutive. or whatever.1. and if α and β are scalars. v in V . Consider the set of all solution vectors x of (3. too. the rank becomes 1. . The dimension of the vector (sub)space im(T ) is called the rank of the mapping T . or do finite field arithmetic.1. the 2 × 2 matrix 1 1 1 1 + 10−20 (3.

1. 2.1. 1. . (3. x.1. Hence x −x belongs to ker(A). 1) and T takes (3. 0.e.11). −1) to (2. and rank(T ). . 0. What might the output look like? If b = 0. Ax = b. 2. find the 3 × 3 matrix that represents T . 3).1. x2 } of V . im(T ). Find T (1. and we can describe it by printing out a set of basis vectors of ker(A). 0) to (1. and the basis {1. A(x −x ) = 0. Then Ax = b. (d) Regard T as a mapping from V to the space W of polynomials of degree 1. if the system is homogeneous. the rank of a given n×n matrix A is a discontinuous function of the entries of A.1 1. 1. there is no difficulty. and a list of the basis vectors of ker(A). Use the basis given in part (c) for V .11) a vector space if b = 0? . Why isn’t the solution set of (3. . What is the maximum number of binary digits that could be needed to represent all of the elements of A? 5. Consider the vector space V of all polynomials in x of degree at most 2.11) by printing out one particular solution x .1. a vector space. 6. e2 . in case b = 0. If b = 0 then. because then all solutions are of the form x = x + α1 e1 + α2 e2 + · · · + αν eν . any one particular solution of the system. 4). i. Show by examples that for every n. 2).12) Therefore. and by subtraction. then consider any two solutions x and x of (3. −1. If T acts on the vector space of polynomials of degree at most n according to T (f ) = f − 3f . 2. (a) What is the rank of T ? (b) What is the image of T ? (c) For the basis {1.1 Vector spaces and linear mappings 85 Suppose then that we intend to write a computer program that will in some sense present as output all solutions x of (3. 3. Exercises 3. Let A be an n × n matrix with integer entries having absolute values at most M .11). . x − 1} for W . If the right-hand side vector b is not 0.3.1. for in that case the solution set is ker(A).11) should print out a basis for the kernel of A together with. Let T be the linear mapping that sends each polynomial to its derivative. 4. a computer program that alleges that it solves (3. eν are a basis for ker(A). find ker(T ).. we can describe all possible solutions of (3. 1) to (1. and T takes (1. and find the 2 × 3 matrix that represents T with respect to these bases. Suppose T takes the vector (1. then x − x = α1 e1 + α2 e2 + · · · + αν eν . so if e1 . Let T be a linear mapping of Euclidean 3-dimensional space to itself. We will see how to accomplish this in the next section. (e) Check that the ranks of the matrices in (c) and (d) are equal.

1. 0). Let aij = ri sj for i.2.2) We divide the second equation by 2. j = 1. Check that the determinant is unchanged. 1)}.1. getting x 7 = 2 y + z = −3. 1) and (0.3) The value of z can now be chosen arbitrarily. Suppose that the matrix 0 1 1   A =  1 0 −1  −2 1 1 represents a certain linear mapping from V to V with respect to the basis {(1. . . then the two equations can be written x + y + z = 2 2y + 2z = −3. 1. Let’s begin with a little example. (3.4) .1. and then subtract it from the first.1. 1)} (3. (3. 1). . 1. Construct two sets of two equations in two unknowns such that (a) their coefficient matrices differ by at most 10−12 in each entry. (3.86 Numerical linear algebra 7.2 Linear systems The method that we will use for the computer solution of m linear equations in n unknowns will be a natural extension of the familiar process of Gaussian elimination. 2. 2 (3. . we can rewrite the solution in the form  7 x 0    2    3  y  =  − 2  + z  −1  . say of the following set of two equations in three unknowns: x+y+z =2 x − y − z = 5. (1. 0) plus any linear combination of (−1. Show that the rank of A is at most 1. 1. 10. Find a system of linear equations whose solution set consists of the vector (1. Find the matrix that represents the same mapping with respect to the basis {(0. (1. 0.2. n. (0.2. 1. 0.14)   (3. z 1 0      (3. 0).1) If we subtract the second equation from the first.2.15) 3. and then x and y will be determined. 0. (0.13) of V . To make this more explicit. 1). 8. and (b) their solutions differ by at least 10+12 in each entry. 9. 0).

1] element.4) we see clearly that the general solution is the sum of a particular solution (7/2. −3/2.   (3. −1. The calculation can be compactified by writing the numbers in a matrix and omitting the names of the unknowns.2 Linear systems 87 In (3. ∗ ∗ ∗ ∗ ∗ ∗  (3. So consider the three equations in five unknowns shown below. That is.2. We won’t actually write the numerical values of the coefficients.1) is 1 1 1 2 1 −1 −1 5 . Later on we will take extensive measures to assure this.1). we let t = a21 and then do −→ − − − row(2) := −→ − t ∗ −→ row(2) row(1). but we’ll use asterisks instead.2. A vertical line in the matrix will separate the left sides and the right sides of the equations.6) − − − − − Next (−→ 2) := (−→ 2)/2.2. Thus. Then let t = a31 and do −→ − − − row(3) := −→ − t ∗ −→ row(3) row(1).9) (3.11) .2.1] element. 0).2. we use the 1 in the upper left-hand corner to zero out the entries below it in column 1.2.8) The first step is to create a 1 in the extreme upper left corner. plus any multiple of the basis vector (0.  ∗ ∗ ∗ ∗ ∗ ∗    ∗ ∗ ∗ ∗ ∗ ∗ .3). by dividing the first row through by the [1. After we divide the first row by the [1. to see some of the situations that can arise.10) The result is that we now have the matrix 1 ∗ ∗ ∗ ∗ ∗    0 ∗ ∗ ∗ ∗ ∗ . 0 ∗ ∗ ∗ ∗ ∗ (3. (3.2.2. (3. the original system (3.5) − − − Now we do (−→ 2) := (−→ 1) − (−→ 2) and we have row row row 1 1 1 2 0 2 2 −3 . Now we will step up to a slightly larger example.2. We will assume for the moment that the various numbers that we want to divide by are not zero.7) which is the matrix equivalent of (3. 1) for the kernel of the coefficient matrix of (3.2. and then (−→ 1) := (−→ 1) − (−→ 2) bring us to the final form row row row row row 7 1 0 0 2 0 1 1 −3 2 (3.3.2.

and that the others are determined by them. Let’s see how all of this will look if we were to operate directly on the matrix (3. we would say that x4 and x5 are free. and neither the first nor the second unknown appears in the third equation.11) we divide through the second row by a22 (again blissfully assuming that a22 is not zero) to create a 1 in the [2. 0 0 1 ∗ ∗ ∗  (3. in terms of the linear mapping that the matrix represents. we should say that the kernel of A has a two-dimensional basis. Then we let t = a13 and set −→ 1 := −→ − − − row row(1) − t ∗ −→ row(3) resulting in  (3. Then we use that 1 to create zeroes (just one zero in this case) below it in the second column by letting t = a32 and doing −→ − − − row(3) := −→ − t ∗ −→ row(3) row(2). and finally the first equation would yield x1 .14) (3.2. what we are doing is changing the sets of basis vectors. Evidently this does not change the solutions of the system of equations. Next. in terms of the original set of simultaneous equations. keeping the mapping fixed. we divide the third row by the [3. What is special about them is that the first unknown does not appear in the second equation. More precisely.2] position.13) This is the end of the so-called forward solution. To do this we let t = a23 and then we do −→ − −→ − − row(2) := row(2) − t ∗ −→ row(3). the first phase of the process of obtaining the general solution. in such a way that the matrix that represents the mapping becomes a bit more acceptable to our taste.3] position to create zeros in the third column above that 1.2.2. Hence. 0 0 1 ∗ ∗ ∗ (3.88 Numerical linear algebra We pause for a moment to consider what we’ve done. Finally. to add a constant multiple of one row to another in the matrix corresponds to adding a multiple of one equation to another. To finish the solution of such a system of equations we would use the third equation to express x3 in terms of x4 and x5 . First we use the 1 in the [3.13).16) . that we are now beginning. and this also doesn’t affect the solutions.15) 1 ∗ 0 ∗ ∗ ∗    0 1 1 ∗ ∗ ∗ . Now in (3. is called the backwards substitution. then the second equation would give us x2 in terms of x4 and x5 . First. let’s think about the system of equations that is represented here.12) 1 ∗ ∗ ∗ ∗ ∗    0 1 ∗ ∗ ∗ ∗ .2. Finally. because we start with the last equation and work backwards. also expressed in terms of x4 and x5 . Again.2.3] element to obtain   (3.2.2. The second phase of the solution. to divide a row of the matrix by a number corresponds to dividing an equation through by the same number.

and numbers that are smaller than that will be declared to be zero.2. in the presence of rounding errors. For instance. 0 0 1 a34 a35 a36 Each of the unknowns is expressible in terms of x4 and x5 : x1 x2 x3 x4 x5 = a16 − a14 x4 − a15 x5 = a26 − a24 x4 − a25 x5 = a36 − a34 x4 − a35 x5 = x4 = x5 . it is unreasonable to expect a 0 to be exactly zero. It’s time to deal with the case where one of the ∗’s that we divide by is actually a zero. none of our previously constructed zeros gets wrecked by this process. the vector and the two columns of the matrix shown below:        a16 a26 a36 0 0     . (3.e. We find a basis matrix (i.21) As this shows.2.2.   (3.17) Of course.2] position to create a zero in column 2 above that 1 by letting t = a12 and −→ − − − row(1) := −→ − t ∗ −→ row(1) row(2). where I is the identity matrix whose size is equal to the nullity of the system. They are. suppose that we have now arrived at  1 0 0 a14 a15 a16    0 1 0 a24 a25 a26  . 0 0 1 ∗ ∗ ∗  (3. In numerical work on computers. Next we use the 1 in the [2.20) This means that we have found the general solution of the given system by finding a particular solution and a pair of basis vectors for the kernel of A. respectively.18) We need to be more careful about the next step.          a14 a24 a34 −1 0 a15 a25 a35 0 −1     .. we find a particular solution by filling in extra zeros in the last column until its length matches the number (five in this case) of unknowns.  (3. In fact we will have to discuss rather carefully what we mean by zero.19) (3.2. Instead we will set a certain threshold level. so it’s time to use numbers instead of asterisks. .2.2 Linear systems 89 Observe that now x3 does not appear in the equations before it. The hard part will be the determination of the right threshold. a matrix whose columns are a basis for the kernel of A) by extending the fourth and fifth columns of the reduced row echelon form of A with −I.3. in this case 2. and we have arrived at the reduced echelon form of the original system of equations  1 0 0 ∗ ∗ ∗    0 1 0 ∗ ∗ ∗ .

to find a nonzero entry. ∗ ∗ ∗ ∗ . The equations from the third onwards have only zeros on their left-hand sides. . . if there is one.90 Numerical linear algebra but let’s postpone that question for a while. If any solutions at all exist. It is possible. . . The calculation can now proceed from the rejuvenated pivot element in the [3. we will mean that their size is below our current tolerance level. .     (3. ∗ ∗ ∗ ∗ .    (3. so we’ll be able to recognize the answers when we see them. . Else.e. when we speak of matrix entries being zero. extending over to. If aij is such a nonzero entry. . like ∗ ∗ 0 0 0 . and continue the process. and make the convention that in this and the following sections. .3] position.23) What then? We’re finished with the forward solution. . but it is zero. If so. has no solutions). In the matrix. ∗ ∗ ∗ ∗ . . so as to bring a nonzero entry into the [3. then we want next to bring aij into the pivot position [3. . ∗ 1 0 0 . x3 might appear in some later equation. suppose we are carrying out the row reduction of a certain system. .2. . then those equations had better have only zeros on their right-hand sides also. In the matrix. . . Second. that x3 does not appear in any later equation. With that understanding. ∗ ∗ 0 0 0 . However. this amounts to searching through the whole rectangle that lies “southeast” of the [3. and we’ve arrived at a stage like this:         1 0 0 0 . the next step would be to divide by a33 . . the vertical line. If the system is consistent. We can do this in two steps. . The last three ∗’s in the last column (and all the entries below them) must all be zeros.     . Then all entries ai3 = 0 for i ≥ 3. Then we could ask for some other unknown xj for j > 3 that does appear in some equation later than the third. and we do the backwards solution just as in the preceding case. exchange columns 3 and j (interchange the numbers of the unknowns. . so that x3 becomes xj and xj becomes x3 ). and continue. . ∗ ∗ 0 ∗ .3] consists entirely of zeros. . then of course we ignore the final rows of zeros. . . or the calculation halts with the announcement that the input system was inconsistent (i.3]. it may happen that the rectangle this:  1 ∗ ∗  0 1 ∗    0 0 0   0 0 0   0 0 0  . . . this means that we would exchange two rows. ∗ ∗ ∗ ∗ ∗ . ··· ··· ··· ··· ··· . First we exchange rows 3 and i (interchange two equations). This means that x3 happens not to appear in the third equation.22) Normally. . . . southeast of [3. . . We must remember somehow that we renumbered the unknowns. but not beyond. ··· ··· ··· ··· .      .3] position.3] position. . we can renumber the equations so that later equation becomes the third equation. .2. though.

x 2x x 3x + 3y − z = 4 − y + 2z = 6 + y + z = 6 − y − z = −2 (3. In the latter case. Construct a system of homogeneous equations that has a three-dimensional vector space of solutions. Speaking practically again. Construct a set of four equations in three unknowns. but as the number of rows we were able to reduce before roundoff errors became overwhelming. Speaking in theoretical.2 In problems 1–5.26) (3. 8. the number of nonzero rows in the coefficient matrix at the end of the forward solution phase is the rank of the matrix.28) 6.3. with a unique solution. x + y + z = 3 3x − y − 2z = −1 5x + y = 7 4. 7. whether there are more equations than unknowns. Construct a set of four equations in four unknowns. rather than practical terms for a moment.2. 1. solve each system of equations by transforming the coefficient matrix to reduced echelon form. of rank two.2.2. To “solve” a system means either to show that no solution exists or to find all possible solutions. fewer.24) (3. 2x 3x 7x 8x 2. x + y + z + q = 0 x − y − z − q = 0 3.2. though!).25) − + − + y y y y + z = 6 + 2z = 3 + 4z = 15 + 5z = 12 (3. . this number simply represents a number of rows beyond which we cannot do any more reductions because the matrix entries are all indistinguishable from zero. not as some interesting property of the matrix that we have just computed.2 Linear systems 91 It follows that in all cases. exhibit a particular solution and a basis for the kernel of the coefficient matrix. The pseudorank should be thought of then. 3x + u + v + w + t = 1 x − u + 2v + w − 3t = 2 5. the backwards solution always begins with a matrix that has a diagonal of 1’s stretching from top to bottom (the bottom may have moved up.27) (3. with only zero entries below the 1’s on the diagonal. or the same number.2. Exercises 3. Perhaps a good name for it is pseudorank.

all the entries below the 1’s are zeros. Under precisely what conditions is the set of all solutions of the system Ax = b a vector space? 10. . say. i] element is zero. say. then we bring auv into the pivot position [i. Initially. let’s deal with the fact that a diagonal element might be zero (in the fuzzy sense defined in the previous section) at the time when we want to divide by it. Previously we had said that this could be done by dividing through the ith row by the [i. . and the growth of the numerical errors is also kept to a minimum. First. we put τj = j for j = 1. describe an algorithm that will extract a maximal subset of linearly independent vectors. . and especially even if a nonzero element already occupies the pivot position anyway. and one more vector w. whether or not the [i. It may seem wasteful to make a policy of carrying out a complete search of the rectangle whenever we are ready to find the next pivot element. whether or not the [i. n. so that in the end we will be able to identify the output. in which case we carry out a search for some nonzero element in the rectangle that lies southeast of [i. . After careful analysis.3 Building blocks for the linear equation solver Let’s now try to amplify some of the points raised by the informal discussion of the procedure for solving linear equations. i] position and use that 1 as a pivot to reduce the entries below it to zeros. then it develops that the sizes of the matrix elements do not grow very much as the algorithm unfolds. auv . If this complete search is done every time. i] element is zero. j = 1. but it turns out that the extra labor is well rewarded with optimum numerical accuracy and stability. . Given a set of m vectors. to find the largest matrix element in absolute value. If we are solving m equations in n unknowns. n. 11. Given a set of m vectors in n space. . with a view towards the development of a formal algorithm. describe an algorithm that will decide whether or not w is in the vector subspace spanned by the given set. and next we want to put a 1 into the [i. Consider the moment when we are carrying out the forward solution. If that largest element found in the rectangle is. 3. Then we proceed as before. At all times then. This array will keep a record of the column interchanges that we do as we do them. i] in the matrix. without a search. i] element. Let’s call it τj . . unless that element is zero. .92 Numerical linear algebra 9. If at a certain moment we are about to interchange. then we will also interchange the entries τp and τq . it turns out that an even more conservative approach is better: the best procedure consists in searching through the entire southeast rectangle. the pth column and the qth column. then we need to carry along an extra array of length n. we have made i− 1 1’s down the diagonal. . τj will hold the number of the column where the current jth column really belongs. i] by interchanging rows u and i (renumbering the equations) and interchanging columns v and i (renumbering the unknowns).

3 Building blocks for the linear equation solver 93 It must be noted that there is a fundamental difference between the interchange of rows and the interchange of columns. At the end of the calculation then. but has no effect on the solutions. So C will be thought of as being partitioned into blocks of sizes as shown below:  C=   A: m×n τ : 1×n RHS : m × p  0: 1×p  . but several systems of simultaneous equations. It is convenient to solve all n of these systems at once because the reduction that we apply to A itself to bring it into reduced echelon form is useful in solving all n of these systems. Instead of storing it in its own private array. The reader might want to think about how do carry out that rearrangement. it’s easier to adjoin it to the matrix A that we’re working on. we must solve n systems of simultaneous equations each having the same left-hand side A. but the right-hand sides are different. Hence we must keep track of the column interchanges while we are doing them. The next point concerns the linear array τ that we are going to use to keep track of the column interchanges. where b is the first column of the identity matrix. and so on. so we’ll be able to tell which unknown is which at output time. it is very handy to have the capability of solving several systems with a common left-hand side at once. An interchange of two rows corresponds simply to listing the equations that we are trying to solve in a slightly different sequence.1) Now a good way to begin the writing of a program such as the general-purpose matrix . each of the form Ax = b. To find. to find A−1 . In the program itself. where the left-hand sides are all the same. for matrix inversion. if A is an n × n matrix. and we will return to it in section 3.3. For the second column of the inverse. b would be the second column of the identity matrix. Why are we allowing several different right sides? Some of the main customers for our program will be matrices A whose inverses we want to calculate. Hence. the output arrays will have to be shuffled. Thus. On the other hand. and we avoid having to repeat that part of the job n times. say. but we don’t need to record row interchanges. as an extra row. The data for our program will therefore be an m by n + p matrix whose first n columns will contain the coefficient matrix A and whose last p columns will be the p different right-hand side vectors b. an interchange of two columns amounts to renumbering two of the unknowns.  (3.3. and thereby avoid separate programming. but with n different right-hand side vectors. for then when we interchange two columns we will automatically interchange the corresponding elements of τ . The next item to consider is that we would like our program to be able to solve not just one system Ax = b. and for other purposes too. the first column of the inverse of A we want to solve Ax = b.6 under the heading of “to unscramble the eggs”. let’s call this matrix C. This means that the full matrix that we will be working with in our program will be (m + 1) × (n + p) if we are solving p systems of m equations in n unknowns with p righthand sides.

s.r. The use of the parameter u in this subroutine allows the flexibility for economical operation in both the forward and back solution. k and l.i. Subroutine searchmat will be called in at least two different places in the main routine.r.r. because the rest of row k will have already been reduced. and again after the back solution has been done.j2) This routine is given an r × s array C. as big. by doing the operation Ckq := Ckq − t ∗ Ciq for q = u.i.s. inclusive. First. it can be used to determine if the equations are consistent by searching the right-hand sides of equations r + 1. m (r is the pseudorank) to see if they are all zero (i. jloc.i2. They should then be tested one at a time. It then carries out a search of the rectangular submatrix of C whose northwest corner is at [i1 . k and u. In the back solution we use u = n + 1. and each with its own local variable names. between rows k and l inclusive. 3.l) The program is given an r × s matrix C. as iloc. . in which case it does nothing to C and returns a +1 in sign. in which case it does nothing to C and returns a +1 in sign. j2 ].s. into which it may be broken up. If this is done. . it will do the search for the next pivot element in the southeast rectangle. Procedure pivot(C.i.k. . below our tolerance level).94 Numerical linear algebra analysis program that we now have in mind is to consider the different procedures. and reduces row k of C. Procedure searchmat(C. j1 ] and whose southeast corner is at [i2 . j1 ] and [i2 .i1. . Procedure switchrow(C. and two positions in the matrix. . we take u = i + 1 and it reduces the whole row k. The subroutines switchrow and switchcol are used during the forward solution in the obvious way. suitable test problems.k. the subroutine assumes that Cii = 1. 1. 4. The subroutine returns this element of largest magnitude. j. unless i = j.r. unless i = j. s. then the main routine won’t be much more than a string of calls to the various blocks. . . and the row and column in which it lives. each with its own clearly defined input and output. Second. except it interchanges columns i and j of C. Procedure scale(C.u) . between columns k and l inclusive. by giving them small. The subroutine interchanges rows i and j of C.k..j.s. and four integers i. in order to find an element of largest absolute value that lives in that rectangle. in columns u to s.r. or modules.s. .u) Given the r × s matrix C. 5.e. and three integers i. 2.j1. We suggest that the individual blocks that we are about to discuss should be written as separate subroutines.j. each with its own documentation.l) This subroutine is like the previous one. say [i1 . j2 ]. and returns a variable called sign with a value of −1.i. It also returns a variable called sign with a value of −1. It then stores Cki in the local variable tm sets Cki to zero. Procedure switchcol(C. to unscramble the output (see procedure 5 below). In the forward solution.

7. .r.opt. The values of m and n must also be provided to the procedure. the routine stores Cii in a variable called piv. Output items from the procedure matalg are: • The pseudorank r • The determinant det if m = n • An n × r matrix basis . whose columns are a basis for the kernel of the coefficient matrix. the Northeast m × p submatrix of C holds p different right-hand side vectors for which we want solutions. j] of the r × s matrix C.v. • A real parameter eps that is used to bound roundoff error.m. This subroutine poses some interesting questions if we require that it should not use any additional array space beyond the input matrix itself. Procedure ident(C. Cr2 . After rearrangement the rows will correspond to the original numbering of the unknowns.q) This procedure inserts q times the n × n identity matrix into the n × n submatrix whose Northwest corner is at position [i. 6. • The numbers r. Input items to it are: • An r ×s matrix C (as well as the values of r and s).s. whose columns are particular solution vectors for the given systems. thereby compensating for the renumbering that was induced by column interchanges during the forward solution.r.r. . Unless the inverse of the coefficient matrix is wanted.p. The use of this subroutine is explained in detail in section 3.n. n and p.6 (q. m. s. whose Northwest m×n submatrix contains the matrix of coefficients of the system(s) of equations that are about to be solved. n). Its purpose is to rearrange the rows of the output matrix that holds a basis for the kernel. . .j. .eps). Crn on input.3. .). equal to 2 if we want to see the determinant of the coefficient matrix (if square) as well as a basis for the kernel (if it is nontrivial) and a set of p particular solution vectors. • An n × p matrix partic. Now let’s look at the assembly of these building blocks into a complete matrix analysis procedure called matalg(C.n) This procedure permutes the first n rows of the r × s matrix C according to the permutation that occupies the positions Cr1 . and also the rows of the output matrix that holds particular solutions of the give system(s).3 Building blocks for the linear equation solver 95 Given an r × s matrix C and integers i and u.i.s. .n. . . It is assumed that r = 1 + max(m. • A parameter option that is equal to 1 if we want an inverse. It then does Cij := Cij /piv for j = u. Procedure scramb(C.s. s and returns the value of piv.

at any rate if the rows or columns are distinct. using no more array space than the matrix itself. A complete formal algorithm that ties together all of these modules. Finally. At the end of the forward solution the matrix is upper triangular.96 Numerical linear algebra In case opt = 1 is chosen. Why store all of the extra columns of the identity matrix? (Good luck!) . In the next section we are going to discuss the vexing question of roundoff error and how to set the tolerance level below which entries are declared to be zero. The idea is that the input matrix is transformed. each time a new pivot element is selected. we multiply det by it. set p = n = m. with 1’s on the diagonal. The first operation divides the determinant by that same pivot element. we begin by setting det to 1. as described above. Then. a column at a time. into the inverse. Then det holds the determinant of the input matrix when the forward solution phase has ended. is given at the end of the next section. we multiply a row by a number and add it to another row.2. with control of rounding error. The reduction of the input matrix to echelon form in the forward solution phase entails the use of three kinds of operations. hence its determinant is clearly 1. Second.3 1. 3. The third changes the sign of the determinant. Exercises 3. What must have been the value of the determinant of the input matrix? Clearly it must have been equal to the product of all the pivot elements that were used during the reduction. Record the column interchanges in τ . and proceed as before. leaving the inverse matrix in the same place. Now we have described the basic modules out of which a general purpose program for linear equations can be constructed. and the identity matrix is transformed. the procedure will fill the last m columns and rows of C with an m × m identity matrix. Construct a formal algorithm that will invert a matrix. Repeat problem 1 on a system of each major type: inconsistent. 2. Transform it to reduced row echelon form step by step. many solutions. a column at a time. Record the status of the matrix C after each major loop so you’ll be able to test your program thoroughly and easily. unique solution. being sure to carry out a complete search of the Southeast rectangle each time. Hence. Make a test problem for the major program that you’re writing by tracing through a solution the way the computer would: Take one of the systems that appears at the end of section 3. we exchange a pair of rows or columns. into the identity matrix. on output. and to interchange rows and columns to bring the largest element found into the pivot position. First we divide a row by a pivot element. Let’s remark on how the determinant is calculated. Third. The second has no effect on the determinant. together with a plus or minus sign from the row or column interchanges. to compute the determinant. whenever a pair of different rows or columns are interchanged we reverse the sign of det.

But “how zero” do they have to be? Certainly it would be to much to insist. Show that a matrix A is of rank one if and only if its entries are of the form Aij = fi gj for all I and j.4 How big is zero? 97 4. i]. 7. − − − 5. . in order to locate the largest element that lives there and to use it for the next pivot. and the second concerns the rearrangement of the output to compensate for interchanges of rows and columns that are done during the row-echelon reduction.3. If a word consists of d binary digits. when working with sixteen decimal digits. and then computing EA. the forward solution halts because the remaining pivot candidates are all zero. that a number should be exactly equal to zero. Suppose we do a complete forward solution without ever searching or interchanging rows or columns. then when two d-digit binary numbers are multiplied . . n (3. The phenomenon of roundoff error occurs because of the finite size of a computer word.4 How big is zero? The story of the linear algebra subroutine has just two pieces untold: the first concerns how small we will allow a number to be without calling it zero. 3.3. − 6. if we set too small a threshold. . Show that the operation −→ := c ∗ −→ + −→ applied to a matrix A has the row(i) row(j) row(i) same effect as first applying that same operation to the identity matrix I to get a certain matrix E. . we will declare numbers to be zero that aren’t. or might be. and so forth. If the threshold is too large. Show that the operation of scaling −→ row(i): aik := aik aii k = 1. the calculation will continue after it should have terminated because of unreliability of the computed entries. Indeed. since our microscope lens would then be too clouded to tell the difference.2) has the same effect as first dividing the ith row of the identity matrix by aii to get a certain matrix E. and then computing EA. Show that the forward solution amounts to discovering a lower triangular matrix L and an upper triangular matrix U such that LA = U (think of L as a product of several matrices E such as you found in the preceding two problems). then numbers that “really are” zero will slip through. It is important that we should know how large roundoff error is. The main reduction loop begins with a search of the rectangle that lies Southeast of the pivot position [i. and our numerical solution will terminate too quickly because the computed matrix elements will be declared to be unreliable when really they are perfectly OK. A little more natural would be to declare that any number that is no larger than the size of the accumulated roundoff error in the calculation should be declared to be zero. If that element is zero.

to begin with. but we would rather have it work a little harder if the result will be that we get more reliable answers. By doing so we incur a rounding error whose size is at most 1 unit in the (d + 1)st place. In the first case. In each case let’s look at the effect that the operation has on the corresponding roundoff error estimator in the R matrix. as the calculation unfolds. This method was suggested by Professors Nijenhuis and Wilf. and so forth. True. while the calculation is proceeding. At any time during the calculation that we want to know how reliable a certain matrix entry is. but in any given computation these would tend to be overly conservative. we’ll need only to look at the corresponding entry of the error matrix to find out. |2−d Cij |]. Specifically.98 Numerical linear algebra together. during the course of a calculation. Initially an element Rij might be as large as 2−d |Cij | in magnitude. suppose we are dividing row i through by the element Cii . The accumulation of all of this rounding error can be quite significant in an extended computation. we had better halt the calculation. Then we proceed to add that answer to other numbers with errors in them. We prefer to let the computer estimate the error for us while it’s doing the calculation. and the other of which will contain estimates of the roundoff error that is present in the elements of the first one. we are going to keep two whole matrices. and we would usually terminate the calculation too soon. Therefore. Let’s call this auxiliary matrix R (as in roundoff). the one that has the coefficients and right-hand sides of the equations that we are solving. particularly when a good deal of cancellation occurs from subtraction of nearly equal quantities. some large number of times. and store it in Rij for each i and j. Here is a proposal for estimating the accumulated rounding error during the progress of a computation. the answer that should be 2d bits long gets rounded off to d bits when it is stored. consider a scaling operation. In other words. and . we do arithmetic on the matrix C of two kinds. one of which will contain the coefficients and the right-hand sides of the equations. the matrix R is set to randomly chosen values in the range in which the actual roundoff errors lie. or we pivot a row against another row. when we arrive at a stage where the numbers of interest are about the same size as the rounding errors that may be present in them. thinking that the errors were worse than they actually were. and to multiply. divide. The question is to determine the level of rounding error that is present. in which a certain row is divided by the pivot element. the size of the accumulated roundoff error? There are a number of theoretical a priori estimates for this error. Then. it will have to do more work. In this extra matrix we are going to keep estimates of the roundoff error in each of the elements of the matrix C. We carry along an additional matrix of the same size as the matrix C. and of either sign. Hence. We either scale a row by dividing it through by the pivot element. Then. How can we estimate. to initialize the R matrix we choose a number uniformly at random in the interval [−|2−d Cij |.

4. replace Ciq by Ciq + Riq and replace t by t + t (where t = Rki ). Then Ckq + Rkq is replaced by Ckq + Rkq − (t + t ) ∗ (Ciq + Riq ) = (Ckq − t ∗ Ciq ) + (Rkq − t ∗ Riq − t ∗ Ciq ) = (new Ckq ) + (new error Rkq ).2) and (3.2) 2 . Now let’s replace Ckq by Ckq + Rkq .n. when we search the now-famous Southeast rectangle for the new pivot element we accept it if it is larger in absolute value that its corresponding roundoff estimator. Then substitute these expressions into the pivot operation above. It solves p systems of m equations in n unknowns. suppose we are doing a pivoting operation on the kth row. Before each scaling and pivoting sequence we will need to update the R matrix as described above. .4. we are supplied with good error estimates of each entry while the calculation proceeds. It follows that as a result of the pivoting. in the sense that the entries are below the level of their corresponding roundoff estimator. that do not involve products of two of the errors). or just that rounding errors have built up so severely that we cannot decide on consistency. The R matrix is also used to check the consistency of the input system. and keep terms that are of first order in the errors (i. it gets modified along with the matrix elements whose errors are being estimated.eps). Then the corresponding right-hand side vector entries should also be zero in the same sense. Algorithm matalg(C. this means either that the input system was “really” inconsistent.4 How big is zero? let Rii be the corresponding entry of the error matrix.m. which is of dimension r × s.4) completely describe the evolution of the R matrix. With typical ambiguity of course. where r = max(m.4.e. The algorithm operates on the matrix C. Then.3) Equations (3.4) (3.4. in view of the fact that 99 Cij + Rij Cij Rij Rii Cij = + − 2 + terms involving products of two or more errors (3. the input system was inconsistent.3. Then for each column q we do the operation Ckq := Ckq − t ∗ Ciq .opt. Cii Cii In the second case.1) Cii + Rii Cii Cii Cii we see that the error entries Rij in the row that is being divided through by Cii should be computed as Rij Rii Cij Rij := − (3.p.4. Then.r.. and continuation of the “solution” would be meaningless. the error estimator is updated as follows: Rkq := Rkq − Cki ∗ Riq − Rki ∗ Ciq .s. else as far as the algorithm can tell. in which case it will calculate the inverse of the matrix in the first m = n rows and columns of C. At the end of the forward solution all rows of the coefficient matrix from a certain row onwards are filled with zeros. where t = Cki . unless opt= 1. and otherwise we declare the rectangle to be identically zero and halt the forward solution. n) + 1. (3. and in return. It begins life as random roundoff error.4.

n.n.n).(i.i.i. # divide by pivot element Z:=scaler(C.i.R.i.r. # set row permutation to the identity for j from 1 to n do C[r.i.s.k. fi.i+1).r. Z:=switchcol(R.s).eps) local R. if abs(Z[2])>abs(R[Z[1][1].s.s. od.000000000001*(rand()-500000000000)*eps*C[i. jj:=Z[1][2].n+j.r. do back solution .s.s.1.j.r.r). return.i).n+j).s. psrank:=i.jj]) then i:=i+1.i.s.r. # if opt = 1 that means inverse is expected if opt=1 then ident(C.s).i.100 Numerical linear algebra matalg:=proc(C.r. # begin forward solution Det:=1.r.r.1) fi. # end forward solution.R.Z.ii:=Z[1][1].r).opt.i+1. od.i. fi.j]).Done.j]:=j od.s. # initialize random error matrix R:=matrix(r.i+1).jj.s.r.i). # switch columns Det:=Det*switchcol(C.s.i.ii.n+1. Done:=false.ii.ii.m. begin consistency check if psrank<m then Det:=0.j)->0. Z:=pivot(C.psrank.Z[1][2]]) then printf("Right hand side %d is inconsistent". if abs(Z[2])>abs(R[ii.s. for j from 1 to p do # check that right hand sides are 0 for i>psrank Z:=searchmat(C.i.j).jj. Z:=switchrow(R.i+1.r. # switch rows Det:=Det*switchrow(C.i. od.m.jj.1. else Done:=true fi.r.k.s.s.r. i:=0. Det:=Det*scale(C.1. # equations are consistent.m.psrank+1. while ((i<m) and not(Done))do # find largest in SE rectangle Z:=searchmat(C.k. for k from i+1 to m do # reduce row k against row i Z:=pivotr(C.p.Det.

How would you do that? Write subroutine prnt that will carry it out.4) to take account of the impending scaling or pivoting operation . State it formally as algorithm forwd. When the program runs. and describe precisely its effect on them.evalm(R)).3.psrank+1).4 How big is zero? 101 for j from psrank to 2 by -1 do for i from 1 to j-1 do Z:=pivotr(C.j]:=0. That’s too much.r.j. namely scaler and pivotr.i.psrank+1. Two new sub-procedures are called by this procedure.psrank. Do the same for the backwards solution. fi.j. od. the forward solution process.i. it returns a list containing three items: the first is the determinant (if there is one). Z:=pivot(C. # permute rows prior to output Z:=scramb(C.R.s. # end back solution. and the third is the matrix of estimated roundoff errors. # fill under R matrix with zeroes for i from psrank+1 to n do for j from n-psrank to s do R[i.r.j]:=C[r. the second is the pseudorank of the coefficient matrix.j]:=0.s.r. one for each input right-hand side. end.2) or (3.4. and p particular solution vectors.r. Z:=scramb(R. Just print the number of digits of the answers that can be trusted. Exercises 3. it gives the solutions and their roundoff error estimates.j] od.s.psrank+1. For instance. which means that the input matrix is altered by the procedure) will contain a basis for the kernel of the coefficient matrix in columns psrank + 1 to n.n). od. 2. .s. list its global variables.n). in columns n + 1 to n + p. # fill under right-hand sides with zeroes for i from psrank+1 to n do for j from n+1 to s do C[i.s. The matrix C (which is called by name in the procedure.4 1. there’s no point in giving 12 digits of roundoff error estimate.-1). If the procedure terminates successfully. R[i.r. respectively. # copy row r of C to row r of R for j from 1 to n do R[r. and their mission is to update the R matrix in accordance with equations (3.psrank+1).j]:=0 od od.4. Work out an elegant way to print the answers and the error estimates.j]:=0 od od.n-psrank. insert minus identity in basis if psrank<n then # fill bottom of basis matrix with -I Z:=ident(C. C[i. Break off from the complete algorithm above. return(Det. These are called immediately before the action of scale or pivot.

102

Numerical linear algebra

3. Suppose you want to re-run a problem, with a different set of random numbers in the roundoff matrix initialization. How would you do that? Run one problem three or four times to see how sensitive the roundoff estimates are to the choice of random values that start them off. 4. Show your program to a person who is knowledgeable in programming, but who is not one of your classmates. Ask that person to use your program to solve some set of three simultaneous equations in five unknowns. Do not answer any questions verbally about how the program works or how to use it. Refer all such questions to your written documentation that accompanies the program. If the other person is able to run your program and understand the answers, award yourself a gold medal in documentation. Otherwise, improve your documentation and let the person try again. When successful, try again on someone else. 5. Select two vectors f , g of length 10 by choosing their elements at random. Form the 10 × 10 matrix of rank 1 whose elements are fi gj . Do this three times and add the resulting matrices to get a single 10 × 10 matrix of rank three. Run your program on the coefficient matrix you just constructed, in order to see if the program is smart enough to recognize a matrix of rank three when it sees one, by halting the forward solution with pseudorank = 3. Repeat the above experiment 50 times, and tabulate the frequencies with which your program “thought” that the 10 × 10 matrix had various ranks.

3.5

Operation count

With any numerical algorithm it is important to know how much work is involved in carrying it out. In this section we are going to estimate the labor involved in solving linear systems by the method of the previous sections. Let’s recognize two kinds of labor: arithmetic operations, and other operations, both as applied to elements of the matrix. The arithmetic operations are +, −, ×, ÷, all lumped together, and by other operations we mean comparisons of size, movement of data, and other operations performed directly on the matrix elements. Of course there are many “other operations,” not involving the matrix elements directly, such as augmenting counters, testing for completion of loops, etc., that go on during the reduction of the matrix, but the two categories above represent a good measure of the work done. We’re not going to include the management of the roundoff error matrix R in our estimates, because its effect would be simply to double the labor involved. Hence, remember to double all of the estimates of the labor that we are about to derive if you’re using the R matrix. We consider a generic stage in the forward solution where we have been given m equations in n unknowns with p right-hand sides, and during the forward solution phase we have just arrived at the [i, i] element.

3.5 Operation count

103

The next thing to do is to search the Southeast rectangle for the largest element, The rectangle contains about (m − i) ∗ (m − i) elements. Hence the search requires that many comparisons. Then we exchange two rows (n+p−i operations), exchange two columns (m operations) and divide a row by the pivot element (n + p − i arithmetic operations). Next, for each of the m−i−1 rows below the ith, and for each of the n+p−i elements of one of those rows, we do two arithmetic operations when we do the elementary row operation that produces a zero in the ith column. This requires, therefore, 2(n + p − i)(m − i − 1) arithmetic operations. For the forward phase of the solution, therefore, we count
r

Af =
i=1

{2(n + p − i)(m − i − 1) + (n + p − i)}

(3.5.1)

arithmetic operations altogether, where r is the pseudorank of the matrix, because the forward solution halts after the rth row with only zeros below. The non-arithmetic operations in the forward phase amount to
r

Nf =
i=1

{(m − i)(n − i) + (n + p − i) + m}.

(3.5.2)

Let’s leave these sums for a while, and go to the backwards phase of the solution. We do the columns in reverse order, from column r back to 1, and when we have arrived at a generic column j, we want to create zeroes in all of the positions above the 1 in the [j, j] position. To do this we perform the elementary row operation −→ := −→ − Aij ∗ −→ − − − row(i) row(i) row(j) (3.5.3)

to each of the j − 1 rows above row j. Let i be the number of one of these rows. Then, − exactly how many elements of −→ are acted upon by the elementary row operation above? row(i) − Certainly the elements in −→ that lie in columns 1 through j − 1 are unaffected, because row(i) −→ − only zero elements are in row(j) below them, thanks to the forward reduction process. − Furthermore, the elements in −→ that lie in columns j + 1 through r are unaffected, row(i) for a different reason. Indeed, any such entry is already zero, because it lies above an entry of 1 in some diagonal position that has already had its turn in the back solution (remember that we’re doing the columns in the sequence r, r − 1, . . . , 1). Not only is such an entry zero, − but it remains zero, because the entry of −→ row(j) below it is also zero, having previously been deleted by the action of a diagonal element below it. −→ − Hence in row(i), the elements that are affected by the elementary row operation (3.5.3) are those that lie in columns j, n − r, . . . , n, n + 1, . . . , n + p (be sure to write [or modify!] the program so that the row reduction (3.5.3) acts only on those columns!). We have now

104

Numerical linear algebra

− shown that exactly N + p + r − 1 entries of each row above −→ row(j) are affected (note that the number is independent of j and i), so
r

Ab =

(n + p − r + 1)(j − 1)

(3.5.4)

j=1

arithmetic operations are done during the back solution, and no other operations. It remains only to do the various sums, and for this purpose we recall that
N

i=
i=1 N 2

N (N + 1) 2 (3.5.5)

N (N + 1)(2N + 1) i = 6 i=1 Then it is straightforward to find the total number of arithmetic operations from Af +Ab as Arith(m, n, p, r) = r3 r2 − (2m + n + p − 5) + ((n + p)(2m − 5/2) − m + 1/3)r 6 2 (3.5.6)

and the total of the non-arithmetic operations from Nf as NonArith(m, n, p, r) = r3 r2 r − (m + n) + (6mn + 3m + 3n + 6p − 2) . 3 2 6 (3.5.7)

Let’s look at a few important special cases. First, suppose we are solving one system of n equations in n unknowns that has a unique solution. Then we have m = n = r and p = 1. We find that 2 Arith(n, n, 1, n) = n3 + O(n2 ) (3.5.8) 3 where O(n2 ) refers to some function of n that is bounded by a constant times n2 as n grows large. Similarly, for the non-arithmetic operations on matrix elements we find 1 n3 + O(n2 ) 3 in this case. It follows that a system of n equations can be solved for about one third of the price, in terms of arithmetic operations, of one matrix multiplication, at least if matrices are multiplied in the usual way (did you know that there is a faster way to multiply two matrices? We will see one later on). Now what is the price of a matrix inversion by this method? Then we are solving n systems of n equations in n unknowns, all with the same left-hand side. Hence we have r = m = n = p, and we find that Arith(n, n, n, n) = 13 3 n + O(n2 ). 6 (3.5.9)

Hence we can invert a matrix by this method for about the same price as solving 3.25 systems of equations! At first glance, it may seem as if the cost should be n times as great

 4 ∗  (3. because they correspond simply to solving the equations in a different sequence. just one right-hand side vector is given (p = 1). Now suppose we have arrived at the end of the back solution. The cost of the non-arithmetic operations remains at 1 n3 + O(n2 ). Its elements are called τj . 5. More precisely. and the permuations that were carried out on the columns are recorded in the array τ : [3.6. how do we express the general solution of the given set of equations? To find the answer. We leave it to the reader to work out the cost of a determinant. and the answers to the original question are before us. . The great economy results.1) The matrix above represents schematically the reduced row echelon form in a problem where there are five unknowns (n = 5). in effect. n. 3 5 2 1 b c e f    h k . The first of these is x3 = c − ax1 − bx4 (3. . just for bookkeeping purposes.  . except that they are scrambled.3. . 1.6 To unscramble the eggs Now we have reached the last of the issues that needs to be discussed in order to plan a complete linear equation solving routine. As we mentioned previously. Here’s an example of the kind of situation that might result:         1 0 0 a 0 1 0 d 0 0 1 g . and it concerns the rearrangement of the output so that it ends up in the right order. we don’t need to keep a record of the row interchanges. from the common left-hand sides. 2. During the operation of the forward solution algorithm we found it necessary to interchange rows and columns so that the largest element of the Southeast rectangle was brought into the pivot position. .2) .6 To unscramble the eggs 105 because we are solving n systems. or want only the rank of A then we need to do only the forward solution. . We must remember the column interchanges that occur along the way though. . the last p of which are not used (refer to (3. To remember the column interchanges we glue onto our array C an additional row.6. . where the row contains n + p entries altogether.3. . of course. renumbering the unknowns. . . . let’s go back to the set of equations that (3. .  . ··· ··· . j = 1. 4] shown in the last row of the matrix as it would be storded in a computation. the pseudorank r = 3. 3 If we want only the determinant of a square matrix A.6. or of finding the rank. because each time we do one of them we are.1) stands for.1) to see the complete partitioning of the matrix C). and it is kept at the bottomof the matrix. The question now is. the elements of {τj } are the first n entries of the new last row of the matrix. 3. and we can save the cost of the back solution.

then we get the whole solution vector which. and the two vectors of a basis for the kernel of the coefficient matrix.106 Numerical linear algebra because the numbering of the unknowns is as shown in the τ array. The three vectors on the right side of (3.6. It will be seen that we have gotten the desired result.6. That brings us to the matrix          a b c d e f g h k −1 0 0 0 −1 0 [3 5 2 1 4] ∗ 1 0 0 0 0 1 0 0 1      .4) Now we are looking at a display of the output as we would like our subroutine to give it. where E is . the reason for this is that we begin by wanting to solve Ax = b. respectively. can be written as        x1 x2 x3 x4 x5     =          0 k c 0 f          + (−x1 ) ∗      −1 g a 0 d          + (−x4 ) ∗      0 h b −1 e     .6.3) x2 = k − gx1 − hx4 . and the old fifth row is the new fourth. in order to obtain the three vectors in (3.6. The point that is just a little surprising is that to undo the column interchanges that are recorded by τ .4). namely with (3. as we have previously noted.6. the third row becomes the second.1). That means that the first row becomes the third. and the last column above will be the particular solution. The next two equations are x5 = f − dx1 − ex4 (3. to append the negative of a 2 × 2 identity matrix to the bottom of the fourth and fifth columns of (3. a particular solution of the given system of equations. Just roughly.6. we do row interchanges. If we add the two trivial equations x1 = x1 and x4 = x4 . and to compare the result with what we want. but only after we do the right rearrangement.4)? The first things to do are. and instead we end up solving (AE)y = b.   (3.4) are. The reader is invited to carry out on the rows the interchanges just described.5) The first two of the three long columns above will be the basis for the kernel.6.1) that represents the situation at the end of the back solution. Now here is the punch line: the right rearrangement to do is to permute the rows of those three long columns as described by the permuation τ . the fourth row is the new first. The question can now be rephrased: exactly what operations must be done to the matrix shown in (3.6. the second row becomes the fifth.    (3. after re-ordering the equations. and to lengthen the last column on the right by appending two more zeros.

let’s first pick up the element a1 and move it to aτ1 . τn ]. the one that holds T . which means that we must perform row operations on y to recover the answers in the right order. being careful to store the original aτ1 in a temporary location t so it won’t be destroyed. . 2. etc. . 5. Now we can leave the example above. and should be output as such. and state the rule in general. Thus the present array a1 will end up as the output aτ1 . and we consider the entire remaining n × (n − r + p) matrix as a whole. Evidently. . So we move the 5 in position a1 to position a3 (after putting the 13 into a safe place). . Her’s an example to help clarify the situation. After a certain number of steps (maybe only 1). x = Ey. Now we exchange the rows of T according to the permuation array τ . Next we move the contents of t to its destination. row 1 of T will be row τ1 of the new T . and the jth one of the last p columns of the new T is a particular solution of the jth one of the input systems of equations. . and we’re back where we started. We want to rearrange the entries of the array a according to the permuation τ without using any additional array storage. row 2 will be row τ2 . (3. an ] is given. .7) τ = [3.6) where I(r. To do this with no extra array storage. 9. 7. . Now the first n − r columns of the new T are a basis for the kernel of A. 1. Suppose the arrays a and τ at input time were: a = [5. along with a permutation array τ = [τ1 . we should regard the old T and the new T as occupying different areas of storage. in n unknowns. say. 7. without ever “stepping on our own toes. 2. . call it T . so that the new T is just a rearrangement of the rows of the old. in fact it is perfectly possible to carry out the whole row interchange procedure described above in just one array. Next. and so forth. and then the 13 goes to position a5 (after putting the 2 into a safe place) and the 2 is moved into position a1 . p)   (3. 13. Suppose a linear array a = [a1 . n − r) P (r. 6]. 9. 8]. At the end of the back solution we will have before us a matrix of the form     I(r. The a array now has become a = [2. r) B(r. all having a commom m × n coefficient matrix A. Conceptually. . . the initial a2 will end up as aτ2 .6.6. Although conceptually we should think of the old T and the new T as occupying distinct arrays in memory. r is the pseudorank of A. and under P we adjoin an (n − r) × p block of zeros. and B and P are matrices of the sizes shown.” so let’s consider that problem.8) . 5.6. r) is the r×r identity matrix. we forget the identity matrix on the left. and should be output as such.3. We are given p systems of m simultaneous equations each. 13. we will be back to a1 . Preciesly. . 8] (3. 4. We adjoin under B the negative of the (n − r) × (n − r) identity matrix. .6 To unscramble the eggs 107 a matrix obtained from the identity by elementary column operations.

q:=-tau[i]. the job is finished. by an eigenvector of A we mean a vector x = 0 such that Ax = λx (3. is not finished. The reader should carefully the sample arrays shown above. od.j. end. .108 Numerical linear algebra The job. while q<>t do v:=a[q]. initially we’ll change the signs of all of the entries of τ to make them negative. . q:=tau[q]. all matrices will be square. u:=a[i]. tau[i]:=q. a4 and a6 haven’t yet been moved. . a[t]:=v. . program. Therefore. Then as elements are moved around in the a array we will reverse the sign of the corresponding entry of the τ array.v. . . a[q]:=u. n whose entries are going to be C in rows 1. . fi. A convenient place to hang a flag is in the sign position of an entry of the array τ . n + p of the matrix 3.t. . in Maple: shuffle:=proc(a. #permutes the entries of a according to the permutation tau # # flag entries of tau with negative signs for i from 1 to n do tau[i]:=-tau[i] od. the entries C[r + 1.1) . In that way we can always begin the next block of entries of a to move by searching τ for a negative entry. u:=v.7 Eigenvalues and eigenvectors of matrices Our next topic in numerical linear algebra concerns the computation of the eigenvalues and eigenvectors of matrices. and the array a of length moved is one of the columns r + 1. Here’s a complete algorithm. i]. For this purpose we will flag the array positions. since we’re sure that the entries of τ are all supposed to be positive. Until further notice. .7.u.tau. Somehow we have to recognize that the elements a2 . . while the others have been moved to their destinations.n) local i. however. return(1). v:=u. .q. n are interpreted as τi . When none exist. for i from 1 to n do # has entry i been moved? if tau[i]<0 then # move the block of entries beginning at a[i] t:=i. od. trace through the complete operation of this algorithm on In order to apply the method to the linear equation solving i = 1. . tau[q]:=-tau[q]. n. If A is n × n.

They are satisfied by any vector x of the form c ∗ [1.7. To find the eigenvector that belongs to the eigenvalue λ = 4. or belongs to.3). and therefore they have no solution other than the zero vector unless the determinant 3−λ −1 −1 3 − λ (3. so eigenvectors are determined only up to constant multiples. 3 −1 −1 3 1 −1 =4 1 −1 (3. (3.1) for this matrix.2) If we write out the vector equation (3. consider the 2 × 2 matrix A= 3 −1 −1 3 . The first eigenvector of our 2 × 2 matrix is therefore any multiple of the vector [1. redundant since λ was chosen to make them so. of course.7. 1]. For an example. 1] is an eigenvector and that [1. replace λ by 4. These are the two eigenvalues of the matrix (3.3) −x1 + 3x2 = λx2 .7 Eigenvalues and eigenvectors of matrices 109 where the scalar λ is called an eigenvalue of A. and solve the equations.6) or as a single matrix equation 2 −1 −1 3 1 1 1 −1 = 1 1 1 −1 2 0 0 4 . rather than of the matrix A.7. −1] is an eigenvector can either be written as two vector equations: 3 −1 −1 3 1 1 =2 1 1 .1) of eigenvectors we notice that if x is an eigenvector then so is cx.7. let’s now find the eigenvectors (by a method that doesn’t bear the slightest resemblance to the numerical method that we will discuss later).7. We will see that in fact the eigenvalues of A are properties of the linear mapping that A represents. If we refer back to the definition (3. to find the eigenvector that belongs to the eigenvalue λ = 2.4) is equal to zero.7. so we can exploit changes of basis in the computation of eigenvalues.3. These are two homogeneous equations in two unknowns.7. For the same 2 × 2 example. We say that the eigenvector x corresponds to.3) and replace λ by 2 to obtain the two equations x 1 − x2 = 0 −x1 + x2 = 0. we return to (3. the eigenvalue λ. The two statements that [1. This condition yields a quadratic equation for λ whose two roots are λ = 2 and λ = 4.2). we go back to (3.5) These equations are. (3.7. The result is that any scalar multiple of the vector [1. where c is an arbitrary constant. 1]. First. −1] is an eigenvector corresponding to the eigenvalue λ = 4.7.7) .7.7. (3. it becomes the two scalar equations 3x1 − x2 = λx1 (3.

10) and f (Λ) is easy to calculate because it just has the numbers f (λi ) down the diagonal and zeros elsewhere. (3. and for every m. then f (A) = P f (Λ)P −1 (3.11) (3. where A is the given 2 × 2 matrix. Am = P Λm P −1 . Suppose we want to calculate A2147 .7) states that AP = P Λ.8).12) . This matrix equation AP = P Λ leads to one of the many important areas of application of the theory of eigenvalues. but this statement suffices for our present purposes). A direct calculation.8) This is called the spectral representation of A.7.110 Numerical linear algebra Observe that the matrix equation (3.7. Equation (3. and the set of eigenvalues is often called the spectrum of A. then eA = P eΛ P −1 (3. Finally. by raising A to higher and higher powers would take quite a while (although not as long as one might think at first sight! Exactly what powers of A would you compute? How many matrix multiplications would be required?). P is a (nonsingular) matrix whose columns are eigenvectors of A. the nonsingularity of P is equivalent to the linear independence of the eigenvectors. it’s just a short hop to the conclusion that (3.8) is very helpful in computing powers of A. It is of course quite easy to find high powers of the diagonal matrix Λ. Hence we can write A = P ΛP −1 . and Λ is the diagonal matrix that carries the eigenvalues of A down the diagonal (in order corresponding to the eigenvectors in the columns of P ). and so P has an inverse. Thus for example.11) remains valid even if f is not a polynomial.7. A better way is to begin with the relation AP = P Λ and to observe that in this case the matrix P is nonsingular.2).7. (3. we can equally well obtain any polynomial in the matrix A.7. For instance.7. A2147 = 1 1 1 −1 22147 0 0 42147 1/2 1/2 1/2 −1/2 .7.7. if A is the above 2 × 2 matrix. For instance A2 = (P ΛP −1 )(P ΛP −1 ) = P Λ2 P −1 . where A is the 2 × 2 matrix (3. Indeed if f is any polynomial.7. namely to the computation of functions of matrices. 13A3 + 78A19 − 43A31 = P (13Λ3 + 78Λ19 − 43Λ31 )P −1 . because we need only raise the entries on the diagonal to that power. but is represented by an everywhere-convergent powers series (we don’t even need that much. So for instance.9) Not only can we compute powers from the spectral representation (3.7. Since P has the eigenvectors of A in its columns.

−1]. Theorem 3. Recall that the eigenvectors of the symmetric 2 × 2 matrix (3. . Now first we’re going to devote our attention to the real symmetric matrices. . and they always have a spectral representation.3. where it will emerge (see Theorem 3. whose proof is deferred to section 3. to matrices A for which Aij = Aji for all i. . then all we need to do is to arrange them in the columns of a new matrix P . So. we can calculate functions of the matrix and can solve differential equations that involve the matrix. We’re going to follow a slightly unusual route now. What kind of an n × n matrix A has a set of n linearly independent eigenvectors? This is quite a hard problem.13) The reader will have no difficulty in checking that this matrix has just one eigenvalue. and therefore there is no spectral representation of this matrix. Instead. (3. Then the eigenvalues and eigenvectors of A are real. much more is true. 1] and [1. Furthermore.1 (The Spectral Theorem) – Let A be an n × n real symmetric matrix. and to a very elegant . then the a columns of P obviously do comprise a set of n independent eigenvectors of A.2) were [1.7 Eigenvalues and eigenvectors of matrices where eΛ has e2 and e4 on its diagonal.7. j = 1. and we’ll be all finished. Hence. A system of n linear simultaneous differential equations in n unknown functions can be written simply as y = Ay.” and then we’ll specialize our discussion to a kind of matrix that is guaranteed to have a spectral representation. and that corresponding to that eigenvalue there is just one independent eigenvector. i.7. Conversely. n.9. that will lead us simultaneously to a proof of the fundamental theorem (the “spectral theorem”) above. λ = 0. though we didn’t comment on it at the time.7. . we can always find a set of n eigenvectors of A that are pairwise orthogonal to each other (so they are surely independent). These matrices occur in many important applications.. and we won’t answer it completely.e.8). and then putting eAt = P eΛt P −1 .7. Then P will be invertible. as is shown by the following fundamental theorem of the subject. That changes the question. 111 We have now arrived at a very important area of application of eigenvalues and eigenvectors. to the solution of systems of differential equations. with say y(0) given as initial data.1) as a corollary of an algorithm. where the matrix eAt is calculated by writing A = P ΛP −1 if possible. when can we find a spectral representation of a given n × n matrix A? If we can find a set of n linearly independent eigenvectors for A. Indeed. if we somehow have found a spectral representation of A ` la (3. we give an example of a matrix that does not have as many independent eigenvectors as it “ought to. The solution of this system of differential equations is y(t) = eAt y(0).9. whenever we can find the spectral representation of a matrix A. For an example we don’t have to look any further than A= 0 1 0 0 . and these are indeed orthogonal to each other.

but at a price. and if P −1 = P T . called the method of Jacobi.1) . and we will examine their properties in some detail. If we take such as set of vectors. and are delivered to your door as an orthogonal set. In the next section we will introduce a very special family of matrices.. If we visualize the way a matrix is multiplied by its transpose. We will soon prove that a real symmetric matrix always has a set of n pairwise orthogonal eigenvectors. then P will be an orthogonal matrix. and the eigenvalues of A (on the diagonal of D). (3.2) and this is the spectral theorem for a symmetric matrix A. as they arise. normalize them by dividing each by its length.112 Numerical linear algebra computer algorithm. if P T P = P P T = I.8. namely that each module will not always be exactly optimal in terms of machine time for execution in each application. Throughout these algorithms certain themes will recur. we will see several situations in which we have to compute a certain angle and then carry out a rotation of space through that angle. (3. 3. in context. a proof of the spectral theorem will appear. first studied by Jacobi. i. we can multiply on the right by P T and obtain A = P ΛP T . Specifically. then we will have found a complete set of pairwise orthogonal eigenvectors of A (the columns of P ). For example. as a fast and pretty program in which all of the eigenvalues and eigenvectors of a real symmetric matrix are found simultaneously. square. the 2 × 2 matrix cos θ sin θ − sin θ cos θ is an orthogonal matrix for every real θ. Consequently it was felt that the price was worth the benefit of greater universality.8.e. and further we will have AP = P Λ. Following that we will show how the algorithm of Jacobi can be implemented on a computer. it will be clear that an orthogonal matrix is one in which each of the rows (columns) is a unit vector and any two distinct rows (columns) are orthogonal to each other. and arrange them in the consecutive columns of a matrix P . although it will be nearly so. We’ll discuss these points further. This choice will greatly simplify the preparation of programs. Since the themes occur so often we are going to abstract from them certain basic modules of algorithms that will be used repeatedly. for the computation of eigenvalues and eigenvectors of real symmetric matrices. Conversely.8 The orthogonal matrices of Jacobi A matrix P is called an orthogonal matrix if it is real. Since P T = P −1 . with almost no additional work. Once we understand these properties. if we can find an orthogonal matrix P such that P T AP is a diagonal matrix D.

. 0 0 0 . q] entry. The first thing we have to do is to describe some special orthogonal matrices that will be used in the algorithm. . Jpq (θ) has in position [p. . .  . and the remaining coordinates are all left alone.  cos θ · · · 0 0 0   . given a real symmetric matrix A. the plane of the pth and qth coordinate. 0 0 0 . . − sin θ in the [q. . and let θ be a real number. . . . . .8 The orthogonal matrices of Jacobi 113 In this section we are going to describe a numerical procedure that will find such an orthogonal matrix.1) is the familiar rotation of the plane through an angle θ.8. . What we propose to do is the following.  ··· ··· . Hence Jpq (θ) carries out a two-dimensional rotation of n-dimensional space.  ··· . 0 0 ··· ··· ··· . . . there is a reasonably pleasant way to picture its action on n-dimensional space. .. These matrices of Jacobi turn out to be useful in a host of numerical algorithms for the eigenproblem. If a real symmetric matrix A is given. . . Since the 2 × 2 matrix of (3.  ··· ··· . . . . . . . . .3. . with n ≥ 2 and p = q. We define the matrix Jpq (θ) by saying that J is just like the n × n identity matrix except that in the four positions that lie at the intersections of rows and columns p and q we find the entries (3. . . p] entry. . . It is important to notice that we will not have to find an eigenvalue. . .  . ··· 0 0 ··· ··· cos θ ··· ··· ··· ··· ··· . Let n. it has sin θ in the [p. . . q. is rotated through the angle θ. It turns out that this can always be done. namely one in which a certain plane. p] the entry cos θ. as shown below:            row p      row q       0 1 0 0 0 1 0 .8. Instead.  . . the whole orthogonal matrix whose columns are the desired vectors will creep up on us at once. . . . . The first application that we’ll make of them will be to the real symmetric matrices. and the angle θ in such a way that the matrix JAJ T is a little bit more diagonal (whatever that means!) than A is. let’s see how they can help us with symmetric matrices. . but later we’ll find that the same two-dimensional rotations will play important roles in the solution of non-symmetric problems as well. .   ··· ··· . First. More precisely. . ··· ··· · · · − sin θ · · · ··· ··· ··· ··· 0 0 ··· ··· ··· ··· 0 0 0 ··· 0 0 0   . etc.  . we will determine p. .3) 0 0 0 Not only is Jpq (θ) an orthogonal matrix. .  0 0 0 0 ··· 0 1 0  ··· 0 0 1   (3. As soon as we prove that the method works. .. . we can say that the matrix Jpq (θ) carries out a special kind of rotation of n-dimensional space. . .1). Hence the method is of theoretical as well as algorithmic importance. . at any rate unless A is already diagonal. . . so we will have the germ of a numerical procedure for computing eigenvalues and eigenvectors. we will have proved the spectral theorem at the same time. . cos θ in entry [q. . . . then find a corresponding eigenvector. . .8. . .  sin θ · · · 0 0 0   . and otherwise it agrees with the identity matrix. q]. . p and q be given positive integers. then find another eigenvalue and another vector... . . .

Now suppose that after some large number of repetitions of this process we find that the current matrix is very diagonal indeed.5) Then after one more multiplication. this time on the right by the transpose of the matrix in (3.” we’ll be able to say Od(B) < Od(A). assuming that they were not already zero. CAip + SAiq −SAip + CAiq C 2 App + 2SCApq + S 2 Aqq S 2 App − 2SCApq + C 2 Aqq CS(Aqq − App ) + (C 2 − S 2 )Apq aij Now we are going to choose the angle θ so that the elements Apq and Aqp are reduced to zero. we find that if i ∈ {p.114 Numerical linear algebra Indeed. since it is the product of such matrices.3). so the columns of P will be the (approximate) eigenvectors of A and the diagonal elements of D will be its eigenvalues. For any square. is the main idea of Jacobi’s method (he introduced it in order to study planetary orbits!). j ∈ {p. then we can find p. and then the matrix JAJ T is somehow a little more diagonal than A was. let Od(A) denote the sum of the squares of the off-diagonal entries of A. Then we will know that D = (product of all J’s used)A(product of all J’s used)T .8. q and θ such that Od(Jpq (θ)AJpq (θ)T ) < Od(A).4) If we let P denote the product of all J’s used. We’ll do this by a very direct computation of the elements of JAJ T (we’ll need them anyway for the computer program).8. (3. and the product of orthogonal matrices is always orthogonal (proof?). From now on. at any rate. Starting with A.8. j = q or i = q. so that perhaps aside from roundoff error it is a diagonal matrix D. q}. suppose we have found out how to determine such an angle θ. q} if i = j = p (JAJ T )ij =  if i = j = q     if i = p. instead of “B is more diagonal than A. Let’s now fill in the details. The matrix P will automatically be an orthogonal matrix.8. and then we will be able to see what the new value of Od is.6) In (3. First we’ll define what we mean by “more diagonal”. q and θ we will have J J A(J J )T a bit “more diagonal” and so forth. So fix p. That. j ∈ {p. Now JAJ T is still a symmetric matrix (try to transpose it and see what happens) so we can do it again.8. q and θ. After finding another p. j = p or i = p. (3. q. we would find p.3) by A we find if i = p if i = q otherwise (3. and it is not already a diagonal matrix.6) we have written C for cos θ and S for sin θ. Then by direct multiplication of the that   (cos θ)Apj + (sin θ)Aqj  (JA)ij = −(sin θ)Apj + (cos θ)Aqj   aij matrix in (3. and θ. then we have P AP T = D. j = q or i = q j = p    otherwise. To do this we refer to the formula in          . and let’s then see what the whole process would look like. which is much more professional. real matrix A. q} if i ∈ {p.8. q}. Now we claim that if A is a real symmetric matrix.

However. according to the formulas (3.8. After each rotation. namely the plane rotation in n-dimensional space that sends a real symmetric matrix A into JAJ T . If we keep later applications in mind.8) 3. If Apq = 0 for some p = q. and execute the operation (3. We will prove that Od(A) converges to zero.1 Let A be an n × n real. equate it to zero. but remembering that the new Apq =0. multiply a given not-necessarily-symmetric matrix on the left by J or on the right by J T .6) of the previous section.8. on demand.8. depending on the call. then it’s quite easy to check that the new sum of squares is exactly equal to the old sum of squares minus the squares of the two entries Apq and Aqp that were reduced to zero. Hence we have Theorem 3. then we can choose θ as in equation (3. Hence we should not think of the zero as “staying put”.8. then the best choice for a module will be one that will.7) and we will choose the value of θ that lies between − π and 4 With this value of θ.8.9 Convergence of the Jacobi method We have now described the fundamental operation of the Jacobi algorithm.3. and we have explicit formulas for the new matrix elements. If we sum the squares of all of the off-diagonal elements of JAJ T using the formulas (3.6).8. This is one of the situations we referred to earlier . It is important to note that a plane rotation that annihilates Apq may “revive” some other Ars that was set to zero by an earlier plane rotation. we are now ready to prepare the first module of the Jacobi program.6) for Apq . each time choosing the largest off-diagonal element Apq and annihilating it by the choice (3. we will have reduced one single off-diagonal element of A to zero in the new symmetric matrix JAJ T . and then solve for θ. symmetric matrix that is not diagonal. in line with the philosophy that it is best to break up large programs into small.6). The result is tan 2θ = 2Apq App − Aqq π 4. The full Jacobi algorithm consists in repeatedly executing these plane rotations.8. Let’s now see exactly what happens to Od(A) after a single rotation.8.8. There are still a number of quite substantive points to discuss before we will be able to assemble an efficient program for carrying out the method.7) of θ. What we want is to be able to execute the rotation through an angle θ.9 Convergence of the Jacobi method (3. Od(A) will be a little smaller than it was before. This could be accomplished by a single subroutine that would take the symmetric matrix A and the sine and cosine of the rotation angle. manageable chunks. pq (3.7) so that if J = Jpq (θ) then Od(JAJ T ) = Od(A) − 2A2 < Od(A). 115 (3.

temp. The procedure will be called Procedure rotate(s. according to the formulas (3. . What the procedure will do is exactly this.n. if opt=1 then for j from 1 to n do temp:=evalf(c*A[p. RETURN() end: To carry out one iteration of the Jacobi method.c. once with option = 1 and then with option = 2.p]+s*A[j.q]. the sine S and cosine C of a rotation angle 3. A[j.j]+s*A[q. it will multiply A on the right by J T . If called with option = 1.p]:=temp. od fi.8.opt) local j.q. od else for j from 1 to n do temp:=c*A[j. q] of the rotation.116 Numerical linear algebra where the most universal choice of subroutine will not be the most economical one in every application.p. Next.j]. it will multiply A on the left by J.j]+c*A[q. the plane [p.j]). since only two lines of the matrix are affected by its operation. The amount of computational labor that is done by this module is O(N ) per call. we will have to call rotate twice.p]+c*A[j. A[j.j]:=-s*A[p. and 2 if we want AJ T . and 4. A[p. suppose we are given 1. a parameter option that will equal 1 if we want to do JA. an n × n real matrix A 2.q. If option = 2. but we will get a lot of mileage out of this routine! Hence. A[q.c.j]:=temp. let’s prove that the results of applying one rotation after another do in fact converge to a diagonal matrix. global A.option).q].q]:=-s*A[j.5). The Maple procedure is as follows: rotate:=proc(s.p. and exit.

the sum of squares will have dropped to at most r 2 1− Od(original A).9. If we want it to drop to.1) pq n(n − 1) Now the effect of a single rotation of the matrix is to reduce Od(A) by 2A2 .. Hence the average square is exactly Od(A)/(n(n − 1)).3. Indeed. (3. 10−m times its original value then we can expect to need no more than about m(ln 10)n(n − 1)/2 rotations. To put it in very concrete terms.9 Convergence of the Jacobi method 117 Theorem 3.8n2 rotations will have to be done. tn(n − 1)/2 rotations). The average square is equal to the sum of the squares of the off-diagonal elements divided by the number of such elements. Proof.9. Suppose we follow the strategy of searching for the off-diagonal element of largest absolute value. the function Od(A) will have dropped to about e−t times its original value. after doing an average of one rotation per off-diagonal element.8) yields Od(JAJ T ) = Od(A) − 2A2 pq 2 Od(A) n(n − 1 2 = 1− Od(A). i. suppose we’re working in double precision (12-digit) arithmetic and we are willing to decree that convergence has taken place if Od has been reduced by 10−12 ..1 Let A be a real symmetric matrix.e. divided by n(n − 1). After doing an average of. then we see that Od(A) has dropped by at least a factor of (approximately) e. (3.2) Hence a single rotation will reduce Od(A) by a multiplicative factor of 1 − 2/(n(n − 1)) at least. after r rotations. and then repeating the whole process on the resulting matrix. completing the proof. t rotations per off-diagonal element (i. it follows that the sum of squares of the off-diagonal entries approaches zero as the number of plane rotations grows without bound. Then the sequence of matrices that is thereby obtained approaches a diagonal matrix D.e. it follows that the maximum of the squares of the off-diagonal elements of A is at least as big as the average square of an off-diagonal element. Hence.9. n(n − 1) ≤ Od(A) − (3. etc. Of course in practice we will be watching the function Od(A) as it drops.8. The proof told us even more since it produced a quantitative estimate of the rate at which Od(A) approaches zero. let Apq denote the off-diagonal element of largest absolute value. so equation pq (3. say.3) n(n − 1 If we put r = n(n − 1)/2. Since the maximum of any set of numbers is at least as big as the average of that set (proof?). Since this factor is less than 1. . At a certain stage of the iteration. and therefore Od(A) A2 ≥ . say. the function Od(A) is no more than 1/e times its original value. Then at most 12(ln 10)n(n − 1)/2 < 6(ln 10)n2 ≈ 13. choosing so as to zero out that element by carrying out a Jacobi rotation on A.9.

what is the price of a single rotation? Here are the steps: (i) Search for the off-diagonal element having the largest absolute value. Proof (of the Spectral Theorem 3. Suppose D is not diagonal. After one rotation we have J1 AJ1 . Since P AP T = D. where P is obtained by starting with the identity matrix and multiplying successively on the left by the rotational matrices J that are used.1): Consider the mapping f that associates with every orthogonal matrix P the matrix f (P ) = P T AP . First. itself. Now let’s get on with the implementation of the algorithm. 3. which is a contradiction (of the fact that Od(D) was minimal). Now. we must keep track of the product of all of the rotation matrices that we have so far used. we have indeed proved that the repeated rotations will diagonalize A. Theorem 3. and D is (virtually) diagonal. Now let’s re-direct our thoughts to the grand iterations process itself. or equivalently the rows of P . are the eigenvectors of A. but we omit the proof. because that is the matrix that ultimately will be an orthogonal matrix with the eigenvectors of A across its rows. Hence the image of the set of orthogonal matrices under f is compact. One thing we do want to do. of course. after three we have J J J AJ T J T J T . Hence F is diagonal. after T J T .7. is to prove the spectral theorem. AP = P D. This is true. the matrix we see is P AP T = D. and we are looking at a matrix that is “diagonal enough” for our purposes.1. . Hence the columns of P are n pairwise orthogonal eigenvectors of A. namely n(n − 1)/2. The set of orthogonal matrices is compact. we have seen that O(n2 ) rotations will be sufficient to reduce the off-diagonal sum of squares below some pre-assigned threshold level.7.e. and the proof of the spectral theorem is complete. We can stop when the actual observed value is small enough. At each step we apply a rotation matrix to the current symmetric matrix in order to make it “more diagonal”. Then we could find a Jacobi rotation that would produce another matrix in the image whose Od would be lower. Begin with A. So there is an orthogonal matrix P such that P T AP = D. so the columns of P T . At the same time. we have AP T = P T D. We have not proved that the matrices P themselves converge to a certain fixed matrix. which we abbreviate as O(n2 ). Now. T Let’s watch this happen. Still. Hence there is a matrix F in that image that minimizes the continuous function Od(f (P )) = Od(P T AP ). however. it’s comforting to know that at most O(n2 ) iterations will be enough to do the job.. i. and the mapping f is continuous.118 Numerical linear algebra so there won’t be any need to know in advance how many iterations are needed. After all two iterations we have J2 J1 AJ1 2 3 2 1 1 2 3 iterations have been done.10 Corbat´’s idea and the implementation of the Jacobi o algorithm It’s time to sit down with our accountants and add up the costs of the Jacobi method. since we have long since done all of the work. etc. The cost seems to be equal to the number of elements that have to be looked at.

10 Corbat´’s idea and the implementation of the Jacobi algorithm o 119 (ii) Calculate θ. but instead goes marching through the matrix one element at a time. stay zero instead of bouncing back again as they do in the Jacobi method. and the exact rate of convergence is unknown. sin θ and cos θ. Corbat´.e. say loc[i]. it’s too small”). Then do A14 and so forth. this cost is O(n) also. one is not finished. the sub-diagonal and the super-diagonal (i. One of these is called the cyclic Jacobi method. One of the reasons for the wide use of the Givens and Householder methods has been that they get the answers in just O(n3 ) time. For this reason. contains the number of a column in which an off-diagonal . (iii) Update the matrix P of eigenvectors by multiplying it by the rotation matrix. returning to A12 again after Ann . This method avoids the search. 10: 123-125.3. but we do not carry out a rotation unless the magnitude of the current matrix entry exceeds a certain threshold (“throw it back. Thanks to a suggestion of Corbat´. but instead to tri-diagonalize A by rotations. This method also has an uncertain rate of convergence. Next do a rotation that reduces A13 to zero (of course. J. The search is n times as expensive in time as either the rotation of A or the update of the eigenvector matrix. At a deeper level. cycling as long as necessary. in which we go through the entries cyclically as above. That is to say. first do a rotation that reduces A12 to zero. once reduced to zero. in the years since Jacobi first described his algorithm. The ith entry of this array. a number of other strategies for dealing with the eigenvalue problem have been worked out. having arrived at a tri-diagonal matrix. one does not search. instead of the O(n2 ) billing mentioned above. JACM. A variation on the cyclic method is called the threshold Jacobi method. This costs O(n). 1963) however. The advantage of tri-diagonalization is that it is a finite process: it can be done in such a way that elements. (F. In that variation. The longest part of the job is the search for the largest off-diagonal element. Aij = 0 unless |i − j| ≤ 1). The disadvantage is that. two newer methods due to Givens and Householder have been developed. and then carry out a rotation on the matrix A. On the coding of Jacobi’s method o o for computing eigenvalues and eigenvectors of symmetric matrices. We can do this by carrying along an additional linear array during the calculation. since only four lines of A are changed. What it does is just this: it allows us to use the largest off-diagonal entry at each stage while paying a price of a mere O(n) for the privilege. These methods work not by trying to diagonalize A by rotations. but the proof that it converges at all is quite complex. rather than at the mathematical level. a nontrivial operation. but instead one must then confront the question of obtaining the eigenvalues and eigenvectors of a tri-diagonal matrix. A12 doesn’t stay put. The suggestion is all the more remarkable because of its simplicity and the fact that it lives at the software level. instead of the O(n4 ) time in which the original Jacobi method operates. but becomes nonzero again!). A tri-diagonal matrix is one whose entries are all zero except for those on the diagonal. Since only two rows of P change. it is now easy to run the original Jacobi method in O(n3 ) time also.

The number of rows that must be searched in their entireties in order to update the loc array is therefore at most two (for rows p and q) plus the number of rows i in which loc[i] happens to be equal to p or to q. the qth. In the latter case we can still salvage something.e. but we will assume that it is so. i. that have been completely changed will simply have to be searched again in order to find the new values of loc. Now let’s turn to a typical intermediate stage in the calculation.. Since there are no such benefactors as described above. Then the expected number of rows that will have to be completely searched will be about four. most of the time we’ll be replacing the former largest entry with a smaller entry.120 Numerical linear algebra element of largest absolute value in row i lives. If we are replacing the entry of previously largest absolute value in the row i.. q}. What about the other n − 2 rows of the matrix? In the ith one of those rows exactly two entries were changed. . 2. in which case we would again know the new loc(i). How much does it cost to create and to maintain it? Initially. In that case we’ll just have to re-search the entire row to find the new champion. then it would be a simple matter to find the biggest off-diagonal element of the entire matrix A. we are going to have to pay a price for the care and feeding of this array. i. The price paid is at most three comparisons. and the previous largest entry was uncommonly large. Given an array loc. suppose loc[i]∈ {p. we might after all get lucky and replace it with an even larger number. Christmas. the new |Aiq |. . and this does the updating job in every row except those that happen to have had loc[i]∈ {p. . but we pay just once. namely the ones in the pth column and in the qth column. then the desired matrix element would be Ap. since it seems that it ought not to be any more likely that the winner was previously in those two columns than in any other columns.loc(p) . Precisely how do we go about modifying loc so it will correspond to the new matrix? The rotated matrix differs from the previous matrix in exactly two rows and two columns.loc(i) | (i = 1. Suppose the largest element that was previously in row i was not in either the pth or the qth column. If the largest one is the one where i = p. This costs O(n) operations. and see what the price is for updating the loc array. Now of course if some benefactor is kind enough to hand us this array. . in order to discover the new entry of largest absolute value in row i we need only compare at most three numbers: the new |Aip |. however. we just search through the whole matrix and set it up. q}. . suppose now that we carry out a single rotation on the matrix A.e. Then that previous largest element will still be there in the rotated matrix. We would just look through the n − 1 numbers |Ai. and since the general trend of the off-diagonal entries is downwards. on the average (the pth. q} is about two chances out of n. Certainly the two rows. say. This has never been in any sense proved.loc(i) |. p and q. and an average of about two others). In that case. Hence the cost of using this array is O(n). The column in which the largest of these three numbers is found will be the new loc[i].loc(i) is an entry in row i of A of largest absolute value in that row. n − 1) to find the largest one. It is reasonable to expect that the probability of the event loc[i]∈ {p. and the old |Ai. comes just once an year. This clearly costs O(n2 ) operations. Ai.

loc[i]:=j.r]) then loc[i]:=r else searchrow(i) fi. fi. for i from 1 to n-1 do if i=p or i=q then searchrow(i) else r:=loc[i].fi.p])>=abs(A[i.q) local i. update:=proc(p. end: We should mention that the search can be speeded up a little bit more by using a data structure called a heap.r. What we want is to store the locations of the biggest elements in each row in such a way that we can quickly access the biggest of all of them. the cost of the complete Jacobi algorithm for eigenvalues and eigenvectors is O(n3 ).3.10 Corbat´’s idea and the implementation of the Jacobi algorithm o 121 It follows that the expected cost of maintaining the loc array is O(n) per rotation. We show below the complete Maple procedure for updating the array loc immediately after a rotation has been done in the plane of p and q. if r=p or r=q then if abs(A[i.loc[i]])<=abs(A[i.j]). after all bills have been paid. global loc. for j from i+1 to n do if abs(A[i. else if abs(A[i. od. global loc.loc[i]]) then loc[i]:=p. Using Corbat´’s suggestion.fi.n.P. and it poses no special problem.q])>=abs(A[i. fi.bigg. searchrow:=proc(i) local j. The cost of finding the largest off-diagonal element has therefore been reduced to O(n) per rotation. . bigg:=0.j])>bigg then bigg:=abs(A[i. end: The above procedure uses a small auxiliary routine: Procedure searchrow(i) This procedure searches the portion of row i of the n × n matrix A that lies above the main diagonal and places the index of a column that contains an entry of largest absolute value in loc[i].loc[i]]) then loc[i]:=q.A.od. Hence the cost of finding that element is comparable with all of the other operations that go on in the algorithm.n.A. RETURN().fi. and subject to the equidistribution hypothesis mentioned o above. RETURN(). if abs(A[i.

and a parameter eps that we will use as a reduction factor to test whether or not the current matrix is “diagonal enough. computer science. This completes the initialization. 3. which is the number of significant digits that you would like to be carried in the calculation. The remaining steps all get done while the sum of the squares of the off-diagonal entries of the current matrix A exceeds eps times test. although the programming job itself would have gotten harder. and initialize the array loc by calling the subr outine search for each row of A. The output of procedure jacobiwill be an orthogonal matrix P that will hold the eigenvectors of A in its rows. Next we set the matrix P to the n × n identity matrix.loc(p) . If that is Ap. It’s time to discuss the assembly of those pieces. The Jacobi process will halt when this sum of squares of all off-diagonal elements has been reduced to eps*test or less. then we would have been able to find the overall winner in a single step. and it will be time to compute the sine and cosine of the rotation angle θ. or priority queue. The input matrix A will be destroyed by the action of the procedure.dgts) The input to the jacobi procedure is the n × n matrix A. To learn about heaps and how to use them.11 Getting it together The various pieces of the procedure that will find the eigenvalues and eigenvectors of a real symmetric matrix by the method of Jacobi are now in view. which is to say that its value is set outside of the jacobiprocedure. This is a fairly ticklish operation.loc(i) for the largest in absolute value. and it is available to all procedures that are involved. The input matrix A itself is global. After careful analysis it turns out that the best formulas for the purpose .” so that the procedure can terminate its operation. we will then know p. q. This would have reduced the program overhead by an addition twenty percent or thereabouts. since the Jacobi method is sensitive to small inaccuracies in these quantities. We first find the largest off-diagonal element by searching through the numbers Ai. Procedure jacobi(eps. Further input is dgts. consult books on data structures. note that we are going to calculate sin θ and cos θ without actually calculating θ itself. The first step in the operation of the procedure will be to compute the original offdiagonal sum of squares test. App amd Aqq . and discrete mathematics.122 Numerical linear algebra If we had stored the set of winners in a heap. Also. at a cost of a mere O(log n) operations. Apq . and a linear array eig that will hold the eigenvalues of A. and the expense would have been the maintenance of the heap structure.

q.1) sin θ := sign(xy) cos θ := 1 + |y|/t 2 .9.x. so only 2n elements are changed.j]^2. Notice that this multiplication affects only rows p and q of the matrix P . global loc. Next we multiply the matrix P on the left by the same n×n orthogonal matrix of Jacobi (3. as discussed in section 3. s:=sign(x*y)*evalf(sqrt(0. #find sine and cosine of theta t:=evalf(sqrt(x^2+y^2)).loc[i]]).p.11 Getting it together are: x := 2Apq y := App − Aqq t := x2 + y 2 1 − |y|/t 2 1/2 1/2 123 (3. first with opt=1 and again with opt=2. fi. Next we call update.big.j.t.n:=rowdim(A).P.d.q. jacobi:=proc(eps. Having computed sin θ and cos θ. with(linalg): # initialize iter:=0.eig.j)->if i=j then 1 else 0 fi).x1.loc[i]])>x then x:=abs(A[i. The matrix A has now been transformed into the next stage of its march towards diagonalization. x:=2*A[p.rotate(s. The Maple program for the Jacobi algorithm follows.1).j].c.y.n-1).s.dgts) local test.q. y:=A[p. q:=loc[p].5*(1+abs(y)/t))). we’re finished. in order to update the matrix that will hold the eigenvectors of A on output.j=i+1. #apply rotations to A for j from 1 to n do #update matrix of eigenvectors t:=c*P[p. to modify the loc array to correspond to the newly rotated matrix.od.3.10.p]-A[q.c. #sum squares o-d entries P:=matrix(n..c. rotate(s.n.n).n. loc:=array(1.2). #set up initial loc array big:=test. #find largest o. we can now call rotate twice.i.A.(i. element for i from 1 to n-1 do if abs(A[i.p:=i.i=1...8. test:=add(add(2*A[i.Digits:=dgts. In other words. while big>eps*test do #begin next sweep x:=0.j]+s*P[q.p. c:=evalf(sqrt(0. as discussed in section 3.q].n). . and we are at the end (or od) of the while that was started a few paragraphs ago.3). #initialize eigenvector matrix for i from 1 to n-1 do searchrow(i) od.5*(1-abs(y)/t))).q].p.iter.11.

suppose T is a linear mapping of E n (Euclidean n-dimensional space) to itself. where H is a nonsingular n × n matrix.iter:=iter+1. of sweeps needed To use the programs one does the following. rotate..00000001. Then choose dgts. and can be found in standard references on linear algebra. In fact. Next question: what happens to the eigenvalues if we change basis? Suppose x is an eigenvector of A for the eigenvalue λ. and will be the same for every matrix that represents T in some basis. update.124 P[q. RETURN(). vecs. and # no. then the same linear mapping is represented by B = HAH −1 .15) will carry 15 digits along in the computation. Now if we change to a different basis. enter the four procedures jacobi. If we choose a basis for E n then T is represented by an n × n matrix A with respect to that basis. od.iter). P[p. First. the determinant of the linear mapping itself. od. or A = H −1 BH. the fraction by which the original off diagonal sum of squares must be reduced for convergence. and eps. a call jacobi(.n)]. First question: what happens to the determinant if we change the basis? Answer: nothing. or B(Hx) = λ(Hx). searchrow into a Maple worksheet. print(eig. It’s worth noting that the eigenvalues of a matrix really are the eigenvalues of the linear mapping that the matrix represents with respect to some basis.P. let’s review some connections with the first section of this chapter. Therefore Hx .00000001 times what it was on the input matrix. update(p. the number of digits of accuracy to be maintained. big:=big-x^2/2. The proof of this fact is by a straightforward calculation.j]:=t.i=1. and will terminate when the sum of squares of the off diagonal elements is .12. Hence H −1 BHx = λx.j]+c*P[q.i]. A changes to B = HAH −1 . 3.q).j]. As an example. Hence we can speak of det(T ). Hence the value of the determinant is a property of the linear mapping T . end: Numerical linear algebra #update loc array #go do next sweep #end of while #output eigenvalue array #print eigenvals. eig:=[seq(A[i. Then Ax = λx. Next enter the matrix A whose eigenvalues and eigenvectors are wanted. in which we studied linear mappings. If we change basis.j]:=-s*P[p. because det(HAH −1 ) = det(H) det(A) det(H −1 ) (3. albeit sketchily.12 Remarks For a parting volley in the direction of eigenvalues.1) = det(A).

The eigenvalues are therefore independent of the basis. and are properties of the linear mapping T itself. into a new basis in which the matrix is closer to being diagonal. Hence this transformation corresponds exactly to looking at the underlying linear mapping in a different basis. where J T = J −1 . in which the matrix that represents it is a little more diagonal than before. it preserves lengths and angles because it preserves inner products between vectors: Jx.12. J T Jy = x. Jy = x. Since J is an orthogonal matrix. In the Jacobi method we carry out transformations A → JAJ T .12 Remarks 125 is an eigenvector of B with the same eigenvalue λ.3. Hence we can speak of the eigenvalues of a linear mapping T . y . . (3.2) Therefore the method of Jacobi works by rotating a basis slightly.

Sign up to vote on this title
UsefulNot useful