Professional Documents
Culture Documents
Gideon A. Ngwa1
c
Copyright
May 2007. Last updated December 11, 2017
2 Introduction 7
2.1 Introduction to Numerical Analysis . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Errors in Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Properties of Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Some Fundamental Results from Calculus . . . . . . . . . . . . . . . . . . 17
1
4 Iterative Methods for solving Ax = b 51
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2 Some special iteration schemes . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3 Methods of accelerating convergence of an iterative process . . . . . . . . . 55
4.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6 Polynomial Interpolation 62
6.1 Polynomial Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.2 Lagrange Form of the Interpolating Polynomial . . . . . . . . . . . . . . . 65
6.2.1 Divided Differences . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.3 Newton’s Form of the Interpolating polynomial . . . . . . . . . . . . . . . 68
6.3.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
2
9.4.4 Richardson Extrapolation . . . . . . . . . . . . . . . . . . . . . . . 109
9.4.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
9.5 Runge-Kutta Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
9.5.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
9.6 Linear Multi-step Methods for solving First Order Scaler Ordinary Differ-
ential Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
9.7 Zero stability for linear multi-step schemes . . . . . . . . . . . . . . . . . . 116
9.8 Accuracy of linear multi-step methods . . . . . . . . . . . . . . . . . . . . 119
9.9 Example of error analysis for an implicit method . . . . . . . . . . . . . . . 120
9.10 Absolute Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
9.11 Predictor-Corrector Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 123
9.11.1 Exereses: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
9.12 Some concluding comments . . . . . . . . . . . . . . . . . . . . . . . . . . 124
9.12.1 Some Limitations of Numerical Methods . . . . . . . . . . . . . . . 124
3
Chapter 1
1.1 Syllabus
1.1.1 Objectives of the Course
The main objective of this course is to introduce the basic concepts of numerical analysis,
numerical linear algebra and to present methods for solving linear and nonlinear equations.
These objects will be attained through the following syllabus
1. Introduction
2. Mathematical Preliminaries
(a) Vector spaces, matrices, matrix norms, special matrices, eigenvalues and eigen-
vectors
(b) Similarity transformations, nonnegative matrices, M-matrices
4
(a) Convergence analysis of iterative methods,
(b) Jacobi, Gauss-Seidel and relaxation methods,
(c) Preconditioning,
(d) Gradient and conjugate gradient methods
6. Numeral differentiation
7. Polynomial interpolation
5
Week Slot Topics to be treated Activity
1 I II Administration, generalities, preliminaries, introduction Lecture
2 I II Gauss Elimination and LU Factorization lecture
2 III Exercises on Introductory material Tutorial
3-5 I II LU and QR Factorizations Lecture
3 III Exercises on Gauss elimination Tutorial
4 III Exercises on Lu Factorization Tutorial
5 III Practical Class Tutorial
6 TBA TBA First Continuous Assessment test (Venue which may CA Test
be different from class slots shall be announced )
6 I II Matrix norms, error Analysis, condition numbers and
iterative improvements
6 III Exercises on Norms Tutorial
7 I II Iterative methods for solving large systems, Lecture
7 III Exercises on iterative methods Tutorial
8 I II Least Squares, the algebraic eigenvalue problems Lecture
8 III Exercises on least squares Tutorial
9 I II Numerical differentiation and applications, polynomial Lecture
interpolation
9 III Exercises on Numerical differentiation+Practical class Tutorial
10 TBA TBA Second continuous assessment test(Venue which may CA Test
be different from class slots shall be announced )
10 I II Polynomial interpolation, solutions of non-linear equa- Lecture
tions
10 III Exercises on polynomial interpolation Tutorial
11-13 I II Solutions of Non-linear equations Lecture
11 III Exercises on solutions of nonlinear systems Tutorial
13 II Overview of course and examination syllabus Discussion
Table 1.1: Course outline for MAT 6371 Numerical Linear Algebra
1.2.2 Outcome
At the end of the study period, the student is expected to have the ability to use numerical
methods to analyze problems in linear algebra, polynomial interpolation, numerical inte-
gration and differentiation and be able to differentiate between qualitative and numerical
methods of analysis. He she will be able to design iterative methods for solving linear
and nonlinear equations.
6
Chapter 2
Introduction
7
3. Approximation theory is well developed and the reader is encouraged to read more
on the subject to appreciate its scope. See for example [5, 6]
There are many ways to define a suitable norm, but if A is a linear space, there is a class
of norms, known as LP −norms that are most useful. If for example, A = C([a, b]; R),
the set of real valued continuous functions defined on the compact interval [a, b], then we
define
Z b 1p
p
kukp = ω(x)|u(x)| dx , ∀u ∈ C([a, b]; R). (2.1)
a
In this definition, the function ω : [a, b] → R is a fixed weighting function that provides
flexibility in measuring the norm of the function. In all cases ω ∈ C([a, b]; R) and is non-
Rb
negative on (a, b) so that a ω(x)dx exists and is positive. Three particular values of p
are important: p = 1, p = 2 and p = ∞, so that if we assume w(x) ≡ 1 on [a, b], we can
have
Z b
kuk1 = |u(x)|dx , (2.2)
a
Z b 21
kuk2 = |u(x)|2 dx , (2.3)
a
kuk∞ = sup {|u(x)| : x ∈ [a, b]} . (2.4)
It is clear that these norms can be defined on different sets, in which case we must
take the specificity of the set under consideration. For example, if A = Rn , then u =
(u1 , u2 , · · · , un ) ∈ Rn is an n-tuple and we have
n
!
X
kuk1 = |ui| , (2.5)
i=1
n
! 12
X
kuk2 = |ui|2 , (2.6)
i=1
kuk∞ = sup {|ui|, i = 1, 2, 3, · · · , n} . (2.7)
8
Example 1 Let [a, b] = [0, 3]. Let U(x) ≡ 0, and ω(x) ≡ 1, ∀x ∈ [0, 3]. For any positive
k, let uk : x 7→ uk (x), x ∈ [0, 3] be given, for each k, by
k(k 2 x − 1), for k12 ≤ x ≤ k22
uk (x) = −k(k x − 3), for k22 ≤ x ≤ k32
2
k = 1, 2, 3, · · ·
0, otherwise,
Using the above norms, formulas (2.2)-(2.4) we obtain
√
2
kU − uk k1 = 1/k, kU − uk k2 = √ , kU − uk k∞ = k.
3
Thus in the sense of (2.2), the distance between U and uk becomes small for large values
of k; for (2.3) the distance is a constant, the same for any k; and for (2.4), the distance
is large for large values of k so that if we consider (kU − Uk kp )k≥1 as a sequence of real
numbers, we see that
√
2
lim (kU − uk k1 , kU − uk k2 , kU − uk k∞ ) = (0, √ , ∞).
k→∞ 3
In this context we therefore say that the functions (uk )k≥1 , converge to the function U
only in terms of the k · k1 distance measurement.
Definition 1 A sequence of functions (qk )k≥1 is said to converge to a function g with
respect to a given norm k · k, if and only of
lim kqk − gk = 0.
k→∞
9
Truncation Errors: These arise when we are forced to give an approximate, rather
than an exact answer. For example, suppose that we use the Maclaurin expansion
to represent ex so that
β2 β3 βn
eβ = 1 + β + + +···+ +···
2! 3! n!
to approximate eβ . We must terminate the expansion at some point to write
β2 β3 βk
eβ = 1 + β + + +···+ + Ek
2! 3! k!
where Ek is the error (called the truncation error) introduced by truncating the
series at the given point. What is the effect of this truncation error on the value of
the final calculation? When do we feel that we have a ”good” approximation?
Round off errors: Also known as rounding errors. These arise because every computer
has a finite word length, so that most numbers and results of arithmetic operations
cannot be expressed exactly on a computer.√ For example, a computer with a 8-bit
character
√length can represent the number 2 = 1.4142135623730950488016887242097 · · ·
only as 2 = 1.414214 where the last digit is rounded to the nearest integer.
Initial data and truncation errors are problem dependent and can be dealt with. Trun-
cation errors can sometimes be problem dependent, but rounding errors are machine
dependent errors that must be controlled in a computation. One primary concern of the
Numerical Analyst is to reduce the effects of errors from all sources.
There are two ways of measuring the error in a calculation: Suppose x̂ represents and
approximation to x. we can measure
kx − x̂k
(i) the absolute error = kx − x̂k, (ii) the relative error = , (2.9)
kxk
where k · k is a suitable norm. For example, suppose that in computation A we have
x = 0.5 × 10−4 and x̂ = 0.4 × 10−4 , and in computation B, we have x = 5, 000 and
x̂ = 4950. Then, using the standard Euclidean distance measurement in R, the absolute
errors are 0.1 × 10−4 and 50 respectively, while the relative errors are 0.2 and 0.01. so
computation A has a 20% error while computation B has only a 1% error. The forgoing
shows that we have to be careful at deciding what we call an error. The point here is that
we need error bounds which will measure how large the error can be. Error bounds must
cover all possible cases.
2.2.1 Exercises
1. Read about floating point arithmetic, round off errors, machine representation of
numbers, conversion of numbers between different bases, and representation of frac-
tions: Then solve the following simple problems:
(a) convert the following binary numbers to decimal form (1010)2, (100101)2,
(10000001)2, (1101)2
10
(b) convert the following decimal numbers to binary form (82)10 , (100101)10 ,
(34301)10, (1101)10
(c) convert the following decimal numbers to hexadecimal form (123)10 , (1025)10 ,
(34301)10, (187)10
(c) convert the following hexadecimal numbers to decimal form (82)16 , (100)16 ,
(1A4C)16 , (6B.1C)16 , (F F F.118)16
2. Write and algorithm that converts integers from a base β to base the 10
3. Write and algorithm that converts a decimal fractions to a number in a base β and
vice-versa. Test your algorithm with the following: (0.101)2 = (0.625)10
5. For any positive integer N and any fixed r 6= 1, recall the formula for the geometric
sum
2 3 N 1 − r N +1
GN = 1 + r + r + r + · + r = ≡ QN .
1−r
Write a computer programme (or use a hand calculator) to evaluate GN and QN for
arbitrary values of r and N. Let r be chosen close to 1 and note the discrepancy. In
which of the two calculations do you have faith? Calculate the relative and absolute
errors in these computations
6. The following numbers are given in a decimal computer with a four digit normalized
mantissa a = 0.4523×104, b = 0.2115×10−3, c = 0.2583×101. Perform the following
operations and indicate the error in the results assuming symmetric rounding.a +
b + c, a − b − c, ab/c, a − b, b/(ca).
11
2.3 Properties of Matrices
A system of m linear equations in n unknowns has the general form
a11 x1 + a12 x2 + · · · + a1n xn = b1
a21 x1 + a22 x2 + · · · + a2n xn = b2
··············· (2.10)
am1 x1 + am2 x2 + · · · + amn xn = bm
The coefficients aij and the right hand sides are given numbers. The problem is to find,
if possible, numbers xj , j = 1, 2, · · · , n such that the m equations (2.10) are satisfied
simultaneously. The discussion is greatly facilitated if we understand the concepts of
vectors and matrices.
Definition 2 (A Matrix:) A matrix is a rectangular array (usually real) numbers ar-
rayed in rows and columns. It is customary to display an m × n matrix as a rectangular
array of m rows and n columns as follows:
a11 a12 · · · a1n
a21 a22 · · · a2n
A= ··· ··· ··· ···
(2.11)
am1 am2 · · · amn
At times, it is briefly written as
A = (aij ) (2.12)
The matrix in (2.11) has m rows and n columns. we simply say A is of order m × n
matrix. If m = n A is called a square matrix of order n. If A has only one row, we
will call it a row vector and if A has only one column , it is called a column vector
or simply vectors for short. So both the right-side constants bi , i = 1, 2, · · · , m and
unknowns xi , i = 1, 2, · · · , n for vectors
b1 x1
b2 x2
b3 x3
b= x= (2.13)
.. ..
. .
bm xn
We say that b is an m-vector and x is an n-vector.
Definition 3 (Equal Matrices) two matrices A = (aij ) and B = (bij ) are equal if and
only they are of the same order and aij = bij for all i and j.
Definition 4 (Matrix Multiplication) Let A = (aij ) be an m×n matrix and B = (bij )
be and n × p matrix. Then the matrix C = (cij ) is the (matrix) product of A with B (in
order), or C = AB, provided C is of order m × p and
n
X
cij = aik bkj for i = 1, · · · m; j = 1, · · · , p (2.14)
k=1
12
With this definition we can write our system (2.10) simply as
Ax = b (2.15)
Definition 6 (The identity matrix ) If a diagonal matrix of order n has all its diag-
onal entries equal to 1, then we call it an identity matrix of order n and denote it by I or
In if the order is important
The name identity is chosen for In because In A = A for all n × p matrices, and BIn = B
for all m × n matrices, so that if A is an n × n matrix, In A = AIn = A.
Definition 7 (Matrix inverse ) Division by a matrix is in general not defined. How-
ever, for square matrices we define a related concept, matrix inversion. We say the
square matrix A is invertible provided there exist a square matrix B of order N such
that
AB = In = BA (2.16)
Definition 8 (Matrix addition and scalar multiplication) For any two matrices A =
(aij ) and B = (bij ) of the same order and a scalar β, we have
(ii) (A + B) + C = A + (B + C)
(iii) a(A + B) = aA + aB
(v) A(B + C) = AB + AC
(vi) (A + B)C) = AB + BC
13
(viii) if all the entries of A are zero, the A is called the null matrix. If A is a null matrix
then A + B = B for all matrices B of the same order.
Using this definition we can rewrite the linear system of equations in several forms. We
have the following results concerning the existence and uniqueness of solutions of linear
equations.
x2 = x1 + y
Theorem 2 The linear system Ax = b, has at most one solution (i.e., the solution is
unique if it exists) if an only if the corresponding homogeneous system Ax = 0 has only
the trivial solution x = 0.
Theorem 3 Any homogeneous linear system of equations that has fewer equations than
unknowns has non-trivial (i.e., nonzero) solutions.
In fact we can prove that we cannot expect to get a solution to our linear system for all
possible choices of the right-side unless we have no more equations than unknowns
(ii) If B and C are matrices such that BC = I then the homogeneous system Cx = 0
has only the trivial solution x = 0.
We now know that we cannot expect exactly one solution to ours system unless there
are as many equations as unknowns. We can therefore restrict our attention to square
matrices and state the following theorem;
14
(iii) A is invertible.
The proof of these results can be found in any elementary text in numerical analysis.
β1 e1 + · · · + βm em = 0 ⇒ β1 = β2 = · · · = βm = 0.
Notes 2 (Rules of Transpose) (i) If A and B are matrices such that AB is defined,
the B T AT is defined and (AB)T = B T AT . Note the interchange of order!
If a and b are n-vectors, the bT a is the 1 × 1 matrix called the scalar or inner product
of a and b. For matrices with complex entries, there is the related concept of Hermatian
or conjugate transpose of the matrix A. Thus when A is a complex matrix we write
AT = AH = āji ∀i, j
P ij = ipj j = 1, · · · , n (2.18)
where ij is the j-th column of the matrix P, for some permutation p = (pi ) of degree n.
15
(ii) If A is an m × n matrix, then AP is an m × n matrix whose jth column equals the
pj th column of A, j = 1, · · · , n.
(iii) If A is an n × m matrix, then P T A is the n × m matrix whose ith row equals the
pi th row of A, i = 1, · · · , n.
An example will illustrate the issues of permutations.
Example 2 Consider the identity matrix of order 3. Then the matrix P defined by
0 0 1
P = 1 0 0
0 1 0
is a permutation matrix corresponding to the permutation pT = (231) since P i1 = i2 ,
P i2 = i3 and P i3 = i1 . One has
0 1 0
T
P = 0 0 1
1 0 0
Hence P T i1 = i3 , P T i2 = i1 and P T i3 = i2 illustrating (i) of the theorem. Further, one
calculates, for example, that
1 2 3 0 0 1 2 3 1
AP = 4 5 6 1 0 0 = 5 6 4
7 8 9 0 1 0 8 9 7
hence column 2 of AP is column 3 of A, illustrating (ii) of the Theorem. Also,
0 0 1 1 2 3 7 8 9
PA = 1 0 0 4 5 6 = 1 2 3
0 1 0 7 8 9 4 5 6
and the rows of A have been interchanged by the permutation matrix P illustrating (iii)
of the Theorem.
Definition 13 (Eigenvalues and eigenvectors of a matrix.) Let A be an n × n real
matrix. Then if there exist a vector x ∈ Rn and a scaler λ ∈ C such that Ax = λx then,
x is called an eigenvector of A and λ is called an eigenvalue.
The pair (λ, x) when it exist is determined by the matrix A so that λ satisfies the equation
Det(A − λI) = 0 and x is found for each λ such that it solves the equation (A − λI)x = 0.
For an n × n matrix, we have a polynomial of degree n in λ so that by the fundamental
theorem of algebra, we have at most n eigenvalues and consequently at most n eigen-
vectors. Eigenvectors for a real matrix can be complex. The set of all eigenvectors of a
given matrix is said to span the nulls pace of the matrix, while the set of al eigenvalues
of the same matrix is called the spectrum of that matrix. From a matrix we can define a
quantity call the spectral radius of A often denoted by ρ(A) as follows:
ρ(A) = max{|λ| : Ax = λx}
= max{|λ| : λ is an eigenvalue of A}
= the maximum of the absolute values of all the eigenvalue of A (2.19)
16
2.4 Some Fundamental Results from Calculus
Rolle’s Theorem. If f : [a, b] → R is continuous in [a, b] and differentiable in (a, b),
and if f (a) = f (b), then there exist ξ ∈ (a, b) such that f 0 (ξ) = 0.
Mean Value Theorem for Integrals. If f : [a, b] → R is continuous in [a, b], then
Rb
there exist ξ ∈ (a, b) such that a f (x)dx = f (ξ)(b − a).
Second Mean Value Theorem for Integrals. If g : [a, b] → R is integrable and does
not change sign in [a, b], if f : [a, b] → R is continuous in [a, b], then there exist
Rb Rb
ξ ∈ (a, b) such that a f (x)g(x)dx = f (ξ) a g(x)dx.
dk f
where f (k) (x∗ ) = | ∗
dxk x=x
and there exist ξ between x and x∗ such that
Pn (x) = (x − r1 )(x − r2 ) · · · (x − rn ) .
| {z }
product of n terms
17
Chapter 3
Ax = b
where A is an n × n matrix for a given right hand side b to get the unknown x
A frequently quoted test for invertibility of a matrix is based on the concept deter-
minants. The relevant theorems states that the matrix is A invertible if and only if
det(A) 6= 0. If det(A) 6= 0, then it is even possible to express the solution of Ax = b, in
terms of determinants, by the so-called Cramer’s rule. Calculating determinants is not
a very practical thing to do, given the number of operations involved. We shall explore
several methods of solving systems of linear equations. We shall explore direct methods
and iterative methods.
Direct methods of solutions are those methods which, in the absence of round off
errors will yield the exact solution in a finite number of elementary arithmetic operations.
In practice,since computers work with finite word length, direct methods do not lead to
exact solutions because errors creep in from roundoff, instability, loss of significant digits
etc. A large part of numerical analysis is devoted to controlling these errors.
Iterative methods are those which start with an initial approximation, and which by
applying a suitably chosen algorithm, lead to successively better approximations. Matrices
associated with linear systems are also classified as dense or sparse. Dense matrices have
very many zero elements and the order of such matrices tend to be small. Direct methods
would fare well when used to solve problems where the matrices are dense. Sparse matrices
on the other hand have very few non-zero elements and the order of such a system can
be very large. Iterative methods are useful here. In what follows we shall explore the
method of solutions for both sparse and dense systems. We keep the presentation as brief
as possible. Appropriate books are shown in the bibliography.
18
2. Economy: The method should not demand vastly more operations than alternative
methods which could solve the same problem
3. Robustness: The method should not break down for special cases that cannot be
specified in advance
4. Accuracy and Stability: The effects of limited accuracy on the method should
be to give an exact answer to the perturbed problem for which perturbations can
be bounded. It should be feasible to make this bound smaller than that on pertur-
bations which could arise from uncertainties in data.
We may then say that numerical analysis deals with the design of numerical methods
and the analysis of their performance on individual problems or class of problems.
Indirect methods also exist. In its simplest form, the indirect method involves designing an
iterative procedure which is hope will converge to the sought after solution. We examine
each of these methods in turn.
19
3.2.1 Gauss Elimination
The elimination comes first: At each stage, we remove one unknown from all but one
of the equations to produce a set of equations of order one smaller than the last. Back-
substitutionn follows: The last equation contains just the last unknown which is found
directly. All unknowns so far found can then be substituted into the equation with one
extra unknown. This can be found in the next step.
Solution: The matrix element 5, which is bracketed, is used to the pivot. It is divided
into the elements in the column below it to find multiples of the pivotal equation which
are subtracted from the others. The new set of equation is give as
back substitution:
(3.9) ⇒ x3 = 1 (3.10)
(3.8) ⇒ x2 = (−0.2 + (0.4)x3 )/2 = 1 (3.11)
(3.7) ⇒ (x1 = 18 − 6x3 − 7x2 )/5 = 1 (3.12)
For this small set the result is exact. For sets in which the numbers are less simple or
large sets, the calculations would in generally be affected by rounding errors.
We note that this is the method of systematic elimination that is commonly taught
in school. The actual computed result depends, usually on the sequence of equations,
because of rounding errors. Moreover the result depends, in general, on the type of
arithmetic used. Work or calculations which would be exact in decimal arithmetic will
not be exact in binary arithmetic. The process also relies on the pivot; they must be
non-zero because we have to divide by them. Their choice depends on the sequence of
equations and the number of variables.
20
3.2.2 General form of Gauss Elimination
For a system of equations of order n we have:
a11 a12 · · · a1n x1 b1
a21 a22 · · · a2n x2 b2
.. .. .. .. = .. (3.13)
. . ··· . . .
an1 an2 · · · ann xn bn
Then we calculate
(2)
aij = aij − mi1 a1j , i, j = 2, 3, · · · , n
(2)
bi = bi − mi1 b1 , i = 2, 3, · · · , n
Then we calculate
(3) (2) (2)
aij = aij − mi2 a2j , i, j = 3, 4, · · · , n,
(3) (2) (2)
bi = bi − mi1 b2 , i = 3, 4, · · · , n,
21
3.2.3 LU Decomposition
We can use the mij of the Gauss elimination process to define
1 0 0 ··· 0 1 0 0 ··· 0
−m21 1 0 ··· 0 0 1 0 ··· 0
M1 = −m31 0 1 · · · 0 , M2 = 0 −m32 1 ··· 0
(3.15)
.. .. .. . . .. .. .. .. . . ..
. . . . . . . . . .
−mn1 0 0 · · · 1 0 −mn2 0 ··· 1
1 0 0 0 ··· 0 1 ··· 0 0 ··· 0
. .. .. ..
0 1 0 0 ··· 0 ..
. . .
0 0 1 0 ··· 0
0 ··· 1 0 ··· 0
M3 = 0 0 −m43 1 · · · 0 , · · · Mk = (3.16)
0 ··· −mik 1 ··· 0
.. .. .. .. . . .. . .. .. . . ..
. . . . . . .. . . . .
0 0 −mn3 0 · · · 1 0 ··· −mnk 0 ··· 1
22
−1 −1 T −1
Tn−1 (−Tn−1 tn τnn )
Tn−1 = −1
0 τnn
by forming the product Tn Tn−1 . So Tn−1 is upper triangular. The results are true trivially
when n = 1, and so they are true ∀n ≥ 1.
So now we can write L = M −1 , a lower triangular matrix, and A = LU. This is the
LU decomposition of A, and we can regard the Gauss elimination methods as follows:
The system
LUx = b
is equivalent to the pair of systems
Ly = b, Ux = y.
So we solve for x by first finding
(i) y = L−1 b = Mb ≡ b(n) (the elimination part) (3.19)
and then obtaining
(ii) x = U −1 y i.e. (A(n) )−1 b(n) (the back substitution part). (3.20)
Thus Gauss elimination and LU decomposition are formally equivalent. However, we may
find L and U by other means and then use (i) and (ii) above to find x, that is (3.19) and
(3.20)
23
Lemma 5 Given an elementary lower triangular matrix of index k, M = I −meTk . Then
1. M −1 = I + meTk
2. MN = (I − meTk )(I − neTl ) = I − meTk − neTl + meTk neTl . But eTk n = 0 because
k ≤ l, so MN = I − meTk − neTl .
Now, in the Gauss elimination process we have introduced M1 , M2 , · · · , Mn−2 , Mn−1 el-
ementary lower triangular matrices of index 1, 2, 3, · · · , n − 1, respectively, and we can
write
That is
1
m21 1 0
L≡ m31 m32 1
(3.22)
.. .. ..
. . .
mn1 mn2 ··· 1
a unit lower triangular matrix, where the mik are precisely the multipliers in the Gauss
elimination process.
3.2.5 Pivoting
(k)
Gauss elimination or LU decomposition breaks down at step k if akk = 0. Also it gives
(k)
inaccurate answers if akk becomes very small. So it may be necessary or desirable to
(k−1)
change the order of the elimination process. Instead of using akk as our pivot at stage
k − 1, we may in Gauss elimination, interchange row k with a row with larger row number
before we do this step.
24
Example 4 Consider the system −1.41x1 + 2x2 = 1, x1 −1.41x2 + x3 = 1, 2x2 −1.41x3 =
1, and study the effect of pivoting on the solution
Solution: The exact solution is x1 = x2 = x3 = 1.69492. Solve the problem using floating
point decimal arithmetic and carry three significant figures gives
−1.41 2 0
A = 1 −1.41 1 , bT = (1, 1, 1)
0 2 −1.41
1 0 0 −1.41 2 0
L = −0.709 1 0 , U = 0 0.010 1
0 200 1 0 0 −201
y T = (1, 1.71, −341), giving xT = (0.71, 1.00, 1.70)
Clearly far fewer than two significant figures correct. Now repeat the process but at stage
two we find
−1.41 2 0
A(2) = 0 0.010 1 , b(2) = (1, 1.71, 1)T .
0 2 −1.41
We interchange rows
3 of both A(2) and b(2) , by pre-multiplying A(2) and b(2) with
2 and
1 0 0
the matrix P = 0 0 1 to have
0 1 0
−1.41 2 0
0 0
P A(2) = A(2) = 0 2 −1.41 , P b(2) = b(2) = (1, 1, 1.71)T .
0 0.010 1
from which we have
1 0 0 −1.41 2 0
L = 0 1 0 ,U = 0 2 −1.41
−0.709 0.005 1 0 0 1.01
y T = (1, 1, 1.71), giving xT = (1.69, 1.69, 1.69)
(2) (3)
Examination of the process shows that a small a22 leads to a large a33 , poor x3 , inaccurate
x2 and eventually a very bad x1 .
More generally, if ukk is small, it is also inaccurate because it is calculated by can-
cellation of much larger numbers with associated rounding errors of relatively large size.
When we divide by ukk in the back-substitution for xk we get an inaccurate xk . The
numerator in this division is also small and so inaccurate because of cancellation. This
further spoils xk , and the error in xk propagates as errors in xk−1 , xk−2 , · · · , x1
Example 5 (Effect of pivoting) Study the effect of pivoting on the LU decomposition
process if
0.001 2.000 3.000
A= −1.000 3.712 4.623
−2.000 1.072 5.643
25
and carry the operation on a four digit computer (ie four significant figures).
Solution: We work with four significant figures
No Pivoting:
1 0 0 0.001 2.000 3.000
L = 1000 1 0 , A(2) = 0 2004. 3005. ,
2000 1.997 1 0 4001. 6006.
0.001 2.000 3.000
(3)
A = 0 2004 3005
0 0 5.000
26
(b) Total Pivoting: This is seen as complete, or row and column, pivoting. At step k,
choose r ≥ k and p ≥ k so that
(k)
|a(k)
rp | ≥ |aij |, k ≤ i, j ≤ n. (3.24)
Then interchange rows k and r and also columns k and p. In other words, find the
(k) (k)
largest element in the whole sub-matrix between akk and ann , and interchange to
put this element in position (k, k).
We note that strategies (a) and (b) demand corresponding interchange in the vector of
right hand side (the data). In addition, Strategy (b) demands interchange of the vector
of unknowns. Very interesting results can be proved for strategy (a), which will lead to
efficient and robust algorithms. Strategy (b) is considerably more expensive and may
not provide additional robustness in the solution process. We can show that making the
interchange is equivalent to pre-multiplying the matrix A by a permutation matrix.
1
..
. 0
1
0 ··· 1
Pk =
.
.. .
..
(3.25)
1 ··· 0
1
. ..
0
1
where the off diagonal elements are in the rows k and the row to be interchanged. We
have the following result
2. The result is exactly the same if we had done all the row interchanges before starting
the elimination process
1. That |mik | < 1, ∀i, k is trivial and follows directly from the construction.
27
Hence
A(k+1) = Mk Pk A(k) ,
and
A(k) = Mk−1 Pk−1 · · · M2 P2 M1 P1 (A(1) ≡ A).
So, since Pn−1 Pn−1 = I,
and therefore
where A∗ (1) = Pn−1 Pn−2 · · · P1 A(1) , is A with all its rows permuted.
The next Theorem establishes the existence of LU decompositions for all non-singular
matrices
Proof: (By induction). The result is true for n = 1 obviously. Suppose it is true for
n = k − 1 and partition Ak , Lk , Uk (Where Lk , Uk are also principal matrices of L and
U) as;
Ak−1 b Lk−1 0 Uk−1 u
Ak = , Lk = , Uk = (3.28)
cT akk mT 1 0 ukk
28
By induction, Lk−1 and Uk−1 are uniquely determined. Also, by assumption,
Hence, Lk−1 , Uk−1 are non-singular and u, m, can be uniquely determined from the
equations (3.29). Thence ukk is also uniquely determined also. So we have Lk , Uk uniquely.
Further
Thus
Hence we satisfy all the same conditions for k as for k − 1, and can continue the induction
to step n. This shows that L and U are uniquely determined. .
Remark 2 We make the following remark about the existence and uniqueness of LU
factorizations.
1. The existence part of Theorem 8 is often quoted by saying that ”The matrix A admits
an LU decomposition if and only if A is strictly regular”.
2. The Uniqueness part of Theorem 8 can also be proved by assuming that A admits
two LU decompositions A = L1 U1 and A = L2 U2 . Then L1 U1 = L2 U2 ⇒ L−1 2 L1 =
U2 U1−1 , but by Lemmas 3 and 4, L−1
2 L1 is lower unit triangular and U U
2 1
−1
is upper
triangular and we have a contradiction. Since a lower triangular matrix cannot be
equal to an upper triangular matrix unless both of them are the identity matrix. So
the only instance when we can have equality is is that L1 = L2 and U1 = U2 .
Corollary 1 A strictly regular matrix A has a unique factorization A = LDU where both
L and U have unit diagonal
Corollary 2 A strictly regular symmetric matrix A has a (or admits) a unique represen-
tation A = LDLT where both L and U = LT have unit diagonal
29
Proof: Suppose A = LDLT with D > 0, and let x ∈ R \ {0}. Since L is nonsingular,
def P
y = LT x 6= 0. Then xT Ax = y T Dy = nk=1 dkk yk2 > 0, hence A is positive definite.
Conversely, if A is (symmetric) positive definite, then it is strictly regular (for if Ak x = 0
for some k ≤ n and some x ∈ Rk , then x∗ = (x1 , x2 , · · · , xk , 0, · · · , 0) ∈ Rn , we obtain
xT∗ Ax∗ = xT Ax = 0, a contradiction to the positive definiteness). Thus by Corollary 2
it admits an LDLT factorization. Take x such that LT x = ek , the k-th coordinate unite
column vector. Then 0 < xT Ax = eTk Dek = dkk .
Notes 3 The last result tells us that we can check to see if a matrix is positive definite
by trying to compute its LDLT decomposition. But this is not a standard method to use
except possibly for demonstration purposes.
Definition 18 (Diagonally dominant P matrix) An n×n matrix A is said to be strictly
diagonally dominant (by rows) |aii | > j6=i |aij |, for all i.
Proof: Take any non-zero vector x ∈ Rn , and let xk be the largest absolute value
component. That is |xk | ≥ |xi |. Then the strict diagonal dominance of A gives the k-th
component of Ax the value
X X
|(Ax)k | = |akk xk + akj xj | ≥ |akk ||xk | − |akj | |xj | > 0.
|{z}
j6=k j6=k
| {z } ≤|xk |
<|akk |
Hence Ax 6= 0for any nonzero x. That is, A is nonsingular. The leading submatrices of a
strictly diagonally dominant matrix are (even more) strictly diagonally dominant, hence
the strict rrgularity of A.
Corollary 4 No pivoting is necessary in the decomposition process if A is symmetric and
positive definite.
Proof: In this case xT Ax > 0, ∀x ∈ Rn \ {0} and we can choose
30
Theorem 11 If A is non-singular, there is always a set of row interchanges for which
an LU decomposition exists.
Proof: Suppose otherwise. Then at some point, say Step k, there is no non-zero akik , i =
k, k + 1, · · · , n that can be chosen as a pivot, so we can partition
(k)
(k) Ak x
A =
0 x
(k) (k) (k)
where Ak is a k×k matrix and the (k, k) element of Ak is zero. (Ak is upper triangular).
Consequently Det(A(k) = 0 and A(k) is singular. Contradicting the hypothesis that A is
non-singular. .
For a banded matrix, all non-zero elements lie in a band of width 2r + 1 along the main
diagonal.
It is frequently required to solve very large systems of the form Ax = b with sparse
matrices (n = 106 is considered small in this context!). The LU factorization of such a
sparse matrix will make sense only if L and U inherit the sparsity of the matrix A. (so
that the cost of computing Ux, say, should be comparable with that of computing Ax.
To this end We can have the following result:
31
Remark 3 Theorem 12 suggests that for a factorization of a sparse but not nicely struc-
tured matrix, one might try to reorder its rows and columns by a preliminary calculation
so that many of the zero elements become leading zero elements in rows and columns, thus
reducing the fill in the L and U.
5 1 1 1 1
1 1 0 0 0
Example 6 Find the LU decomposition of the matrix A = 1 0 1 0 0
1 0 0 1 0
1 0 0 0 1
Solution: Blind application of the theory without any observation that the matrix is
sparse but does not have a structure gives
L U
5 1 1 1 1 1 0 0 0 0 5 1 1 1 1
1 4
1 1 0 0 0
5
1 0 0 0
0 5
− 15 − 51 − 15
1
1 0 1 0 0 =
5
− 14 1 0 0
0 0 43 − 41 − 14
1
1 0 0 1 0
5
− 14 − 13 1 0 0 0 0 2
3
− 13
1
1 0 0 0 1 5
− 14 − 13 − 21 1 0 0 0 0 1
2
and the decomposition has gone through but there is significant fill-in in both L and
U. That is most zeros have been replaced by non-zero numbers which will complicate
the forward and backward
substation processes.
However, if with pre-multiply A by the
0 0 0 0 1
0 1 0 0 0
permutation matrix P =
0 0 1 0 0 it will interchange rows 1 and rows 5 leading
0 0 0 1 0
1 0 0 0 0
to the matrix
L U
1 0 0 0 1 1 0 0 0 0 1 0 0 0 1
1 1 0 0 0
1 1 0 0 0
0 1 0 0 −1
PA =
1 0 1 0 0 =
1 0 1 0 0
0 0 1 0 −1
1 0 0 1 0 1 0 0 1 0 0 0 0 1 −1
5 1 1 1 1 5 1 1 1 1 0 0 0 0 −1
and we have less fill in as can be seen in the decomposition. We can even do better and
interchange the first and last rows and columns by pre-multiplying A by some permutation
matrix P̃ (that you should find) to have the the matrix
L U
1 0 0 0 1 1 0 0 0 0 1 0 0 0 1
0 1 0 0 1
0 1 0 0 0
0 1 0 0 1
P̃ A =
0 0 1 0 1 =
0 0 1 0 0
0 0 1 0 1 .
0 0 0 1 1 0 0 0 1 0 0 0 0 1 1
1 1 1 1 5 1 1 1 1 1 0 0 0 0 1
There is a distinct advantage in do a preliminary calculation of the system.
32
3.2.7 Operation Counts
Gauss Elimination At the k th stage, we need
Usually, addition takes less time than multiplication and division, so we compare
different process fairly simply by counting the later two operations. For the whole
elimination process on the matrix, the number of these are
n−k
X 1
(n − 1)(n − k + 1) = (n3 − n) (3.34)
k=1
3
For each right hand side vector b, the number of multiplications associated with the
elimination process is
n−1
X 1
(n − k) = n(n − 1). (3.35)
k=1
2
P
In the back substitution, the number of multiplications and divisions is nk=1 k =
1
2
n(n + 1). So the total total number of operations for vector b is just approximately
n2 . The count for the complete process with one right hand side is then
1 3 1 1 1 1
(n − n) + n(n − 1) + n(n + 1) = (n3 + 3n2 − n) ∼ n3 . (3.36)
3 2 2 3 3
33
3. We can use the matrix A itself to store the elements of U and those of L.
(a) Solve the system of linear equations of the form Ax = b in the two steps
outlined above. Solve (i) Ly = b and (ii) Ux = y to obtain the solution x.
(b) Calculate Q
the determinant of the matrix A as Det(A) = Det(L)Det(U) =
Det(U) = ni=1 uii .
(c) Find the inverse if the matrix A as A−1 = U −1 L−1 .
5. The cost: Can we estimate the operation count? If we insist that it is only multipli-
cations and divisions that matter (though this is a restriction), then:
34
The consequence of this definition is the proof of Theorem 8. An alternative proof of that
result would go like this: Suppose A is strictly regular and has two LU decompositions
such that A = L1 U1 and A = L2 U2 . Then Li and Ui are non-singular from the definition
of the decomposition and hence L−1 −1
2 L1 = U2 U1 . But by Lemma 4 L2
−1
and U1−1 are
respectively lower and upper triangular and so the products L2 L1 and U2 U1−1 also re-
−1
spectively lower and upper triangular by Lemma 3. Thus the matrices L−1 2 L1 and U2 U1
−1
are simultaneously lower and upper triangular and hence must be the diagonal matrix.
Furthermore since L−1 −1
2 L1 is a unit lower diagonal matrix, we have L2 L1 = I ⇒ L1 = L2 .
Also L−1 −1
2 L1 = I = U2 U1 ⇒ U1 = U2 , and thus the LU decomposition is unique.
Instead of LU decomposition, of a matrix A, we can write A = LDU where L is unit
lower triangular, D is diagonal and U is an upper triangular matrix. We easily establish
that if A is symmetric and strictly regular, then we can write A = LDU where S is
diagonal, L and U are unit lower and unit upper triangular matrices respectively. In fact
such a decomposition of symmetric strictly regular A is unique since, in that case, we
have LDU = A = AT = U T DLT . But since the LU decomposition is unique, LT = U.
We can prove the following result about positive definite matrices:
Proof: Exercise1 .
Some decompositions have special names owing to the special characters of the matrices
involved:
QR Factorization
We need the relatd concept of inner product spaces.
35
(I4 ) ∀x, y, z ∈ X, (x + y, z) = (x, z) + (y, z).
A vector space endowed with an inner product is called and inner product space. If
(x, y) = 0, the vectors x and y are called orthogonal. A set of vectors xi ∈ X is called
orthonormal if
def 1, i = j
(xi , xj ) = δi,j =
0, i 6= j.
def
For x ∈ X, the function kxk = (x, x)1/2 is called the norm of x (induced by the given
inner product) and we can prove the Cauchy-Schwartz equality
(x, y) ≤ kxkkyk, ∀x, y ∈ X.
Example 7 For X = Rm , the following rule defines the so-called Euclidean inner product
Xm
(u, v) = uT v = ui vi , ∀u, v ∈ Rm
i=1
Example 8 If X = C([a, b]; R) the associated norm for any two functions f, g ∈ X may
be defined as in (2.1).
Next we define what it means for two matrices to be orthogonal.
Definition 23 (Orthogonal Matrices) An m × n matrix Q (m ≥ n) is said to be
2 2 1
T 1
orthogonal if Q Q = I. For example the matrix Q = 3 1 −2 2 is orthogonal.
2 −1 −2
Thus, the columns q 1 , q 2 , · · · , q m of the orthogonal Q satisfy q Ti q = δij , that is they are
orthonormal with respect to the Euclidean inner product in Rm . For a square Q, we get
Q−1 = QT , I = QT Q = QQT = (QT )T QT ,
therefore, QT is also an orthogonal matrix, so the rows of Q are orthonormal Pnas well.
n2 2
In
Pn 2particular, the columns and row elements of orthogonal Q ∈ R satisfy i=1 qij =
j=1 qij = 1. It follows also that the square orthogonal Q is non-singular. Moreover
36
Due to the bottom zero elements of R, the columns qn+1 , · · · , qm are essential for the
representaion itself. Hence we can safely write
a11 · · · a1n q11 · · · q1n
.. .. .. ..
. . . .
r11 r12 · · · r1n
. .. ..
. .. ..
. . . .
r22 .
an1 · · · ann = qn1 · · · qnn . .. (3.38)
. .. . . .. .
.. . ..
..
.. .. ..
.. rnn
. . . .
am1 · · · amn qm1 · · · qmn
and this form of the QR representation is called the skinny QR factorization.
Theorem 15 Every matrix A, has a QR factorization. If Furthermore A is non-singular,
the QR factorization is unique.
Proof: Let A = QR be non-singular. Then AT A = RT R is positive definite and there is
a unique Cholesky factorization A = L̃L̃T with L having a positive main diagonal. Thus
RT = L̃ is uniquely determined.
Remark 4 (Solving by QR factorization) If A = QR is a square non-singular ma-
trix, we solve Ax = QRx = b by solving Qy = b and then Rx = y. That is, solving first
Qy = b and then Rx = y (an upper triangular system by back substitution.
37
Taking the scalar product of both sides of (3.40) with q j and using orthogonality, we find
the first (k − 1) coefficients as rjk = (q j , ak ) for j = 1, 2, · · · , k − 1. Substituting these
back into (3.40) we obtain
k−1
X def bk
rkk q k = ak − (q j , ak )q k = bk ⇒ rkk = kbk k, q k = .
j=1
kb k k
38
3.2.10 Solving Tridiagonal systems:The Thomas Algorithm
The general Tridiagonal system may be written in the form
where ai , bi , ci and di are coefficients if the system usually arising from the data and
parameters of the problem under consideration and ui is the approximate value of u(xi )
at the point xi ∈ [a, b]. Sometimes it is advisable to assemble the solution vector x =
(u−1 , u0 , u1, · · · , un , un+1)T into the vector x of length n + 2, so that in matrix form we
have
a0 b0 c0 0 0 0 ··· 0 0 u−1 d0
0 a1 b1 c1 0 0 ··· 0 0 u 0 d1
0 0 a2 b2 c2 0 ··· 0 0 u 1 d2
0 0 0 a3 b3 c3 ··· 0 0 u 2 = d3
(3.42)
.. .. .. . . . . . . . . .. .. .. ..
. . . . . . . . .
. .
0 0 0 · · · 0 an−1 bn−1 cn−1 0 un dn−1
0 0 0 ··· 0 0 an bn cn un+1 dn
In most applications, u−1 and un+1 are often zero because their values would fall out of the
domain of definition of the system. In any event all cases are treated separately. In some
cases, Tridiagonal systems can arise when finite difference formulas are used to discretise
a second order ordinary differential equations on an interval [a, b].
If the differential equation is an initial value problem, then the two initiate conditions
will be prescribe for u at the initial point x = a. For example, u and its derivative u0 will
be prescribed at x = a. If u(a) = ua and u0 (a) = u0a then we divide the interval [a, b],
wherein a unique solution is known to exist into points xi = a + ih, h = (b − a)/n so that
x0 = a and xn = b. Then we assume the existence of a fictitious grid line at i = −1 and
applying the given initial data to have
u1 − u−1
u0 = ua , u0 (a) = u0a ⇒ = u0a ⇔ u−1 = u1 + 2hu0a (3.43)
2h
Substitute i = 0 in (3.41) to have
a0 u−1 + b0 u0 + c0 u1 = d0 , (3.44)
we eliminate the value of u1 between the two equations in (3.43) and (3.44) to have that
d0 b0 ua + 2a0 hu0a di bi ui + ai ui−1
u0 = ua , u1 = − , ui+1 = − , i = 1, 2, · · · , n(3.45)
a0 + c0 a0 + c0 ci ci
and the solution scheme is essentially and explicit scheme.
39
efficient way of solving Tridiagonal matrix systems. It is based on LU decomposition in
which the matrix system Ax = b is rewritten as LUx = b where L is a lower triangular
matrix and U is an upper triangular matrix. The system can be efficiently solved by
setting Ux = y and then solving first Ly = b for y and then Ux = y for x. The Thomas
algorithm consists of two steps. In Step 1 decomposing the matrix into A = LU and
solving Ly = b are accomplished in a single downwards sweep, taking us straight from
Ax = b to Ux = y. In step 2 the equation Ux = y is solved for x in an upwards sweep.
We start by seeking a solution of the form
ui = ei ui+1 + fi , i = n − 1, n − 2, · · · , 0 (3.46)
and then select the coefficients ei , fi so that the resulting solution is the sought after
solution. If we assume a solution of the form indicated then we must have that
−ci fi − ai fi−1
ui = ui+1 + (3.48)
bi + ai ei−1 bi + ai ei−1
We compare (3.46) and (3.48) to see that the two are of the same form so that equality
will be possible if for i = 1, 2, · · · , n, we have
−ci fi − ai fi−1
ei = , fi = (3.49)
bi + ai ei−1 bi + ai ei−1
The starting values for e0 and f0 are obtained by applying the boundary conditions that
the system must satisfy at x = a. For example, if u(a) = ua and u(b) = ub . Then we
apply the backward recursion (3.46) from i = n − 1, n − 2, n − 3, · · · , 1 with un = ub ,
e0 = 0, f0 = ua .
3.2.11 Exercises
1. Let A be an n × n matrix and xT be a vector in Rn . For a given column vector
b with bT ∈ Rn , it is required to solve the system of equations Ax = b for the
unknowns x1 , x2 , · · · , xn .
(i) Explain briefly how you would use (a) Gauss elimination and (b) Cramer’s
rule, to find xT ∈ Rn
(ii) Show that the number of operation counts in Gauss elimination is 31 (n3 +3n2 −
n) ≈ 13 n3 for large n and for Cramer’s rule is (n + 1)! [ Note: An operation
count is the number of arithmetic operations (additions, multiplications and
divisions) required to carry out the direct method and obtain the solution. For
Example, to solve 2x + 3 = 4, we need two operations; 1 addition and one
division]
40
(iii) Show how you will use the LU decomposition of A to find the inverse of the
matrix and also show that the number if operation counts to obtain the inverse
of a matrix through this route is approximately 13 n3 and compare the number
of operation counts that you would get to find the inverse of a matrix by the
usual method of finding the transpose of the matrix of cofactors.
2x1 − x2 = 1
−xi−1 + 2xi − xi+1 = 0, i = 2, · · · , n − 1,
−xn−1 + 2xn = 0
for n = 5.
5. Let
4 2 1 1 1 1 1 1
24 12 3 4 2 −1 3 −4
A =
8 4 8 16
, B = ,
4 1 5 −2
4 2 1 −6 −1 2 −2 5
3 −10
20
, c = −7
b =
−8 −27
10 −3
Show that the system Ax = b has an infinite number of solutions in which one of
the variables (which one?) can be chosen arbitrarily. Show also that the system
Bx = c has an infinite number of solutions in which two of the variables can be
chosen arbitrarily. Show further that if, for example, the numbers −3 or 27 are
changed, the system Bx = c has no solution.
41
6. Scaling during the solution of systems Ax = b:
Pivoting is not always the remedy for computational difficulties. Consider the fol-
lowing example
−6
10 −1 x1 1
= . (3.50)
10−6 10−12 x2 10−6
During Gauss elimination with partial pivoting, no pivoting will be necessary (That
is, it will not be necessary to interchange rows). Suppose we have a six digit com-
puter. Triangularization will give
−6
10 −1 x1 1
= . (3.51)
0 1 + 10−12 x2 −1 + 10−6
Now, with a six digit computer, 1 + 10−6 and 1 + 10−12 are both equal to 1. so
x2 = −1, x1 = 0. Using, say a higher accuracy computer we have x2 = −1 + 10−6 ,
x1 = 1, so to six significant digits gives x1 = 1, x2 = −1. Suppose we scale the
system: Multiply second equation in (3.50) by 106 to have the system
−6
10 −1 x1 1
−6 = . (3.52)
1 10 x2 1
After interchanging rows 1 and 2 of the scaled system (3.52) (pivoting). So scaling
was important here. Now consider the system
0.1 −1 1 x1 −1
20 40 2000 x2 = 1840 . (3.54)
8 15 200 x3 135
Triangularize system (3.54) ( by hand) using Gauss elimination with pivoting and
record the stages at which pivoting has taken place. Next reconsider the system,
this time with scaling. Triangularize again and report on the effects of scaling on
the solution process and pivoting strategy.
7. Let I2 be the identitymatrixof order 2. Show by direct computation that the non-
0 1
singular matrix A = has no LU decomposition while the singular matrix
1 0
A + I2 has.
42
(a) Solve the problem by calculating the inverse of A and then computing the
solution x = A−1 b
(b) Solve the problem by Cramers rule by calculating determinants and returning
the solution x
(c) Solve the problem by Gauss elimination with partial pivoting to return the
solution x
(d) Solve the problem by LU decomposition and return the solution x
(e) Find the inverse and determinant of the matrix by the route of LU decompo-
sition.
(f) Consider the 4×4 system 5x1 +7x2 +6x3 +5x4 = 23, 7x1 +10x2 +8x3 +7x4 = 32,
6x1 + 8x2 + 10x3 + 9x4 = 33, 5x1 + 7x2 + 9x3 + 10x4 = 31.
i. Use your programme that you would have written above to find the inverse
of the coefficient matrix for this problem.
ii. In theory, A−1 A = AA−1 , but in practice, because of round of errors we
cannot always expect equality. Calculate and print the A−1 A and AA−1
from your programme and report any discrepancies.
iii. Small changes in data (right hand side) can sometimes lead to relatively
large changes in the solution. Use your programme to solve the linear
equations by Gauss elimination to solve the problem given in this exam-
ple when the right hand sides is altered to contain, in turn the following
vectors: b1 = (23.01, 31.99, 32.99, 31.01)T , b2 = (23.1, 31.9, 32.9, 31.1)T .
iv. Small changes in data (entries of matrix A) can sometimes lead to relatively
large changes in the solution. Now let the system be modified so that
the (1, 1) entry respectively takes the value 5.01 and 4.99, instead of the
original value 5. Use your programme (LU Factorization) above to find
A−1 in both cases and compare the results with the exact value of A−1 .
1
(g) The entries of an n × n Hilbert matrix H = (h)ij are given by hij = 1+j−1 .
Hilbert matrices are considered badly behaved or ill-conditioned, and are often
used to test routines. Write Use you programme for LU factorization to find
the inverse of a 4 × 4 Hilbert matrix. Test your answer by calculating HH −1
and H −1 H. For full effect, you should compute the entries of H rather than
reading them as data
43
(ii) kxk = 0 ⇔ x = 0; definiteness
Exercise: verify that the p-norms are indeed norms in the sense of definition 25. The
triangle inequality for the p-norm, for any p is known as the Minkowski inequality. To
prove the results of other values of p we make use of the Cauchy-Schwartz inequality.
Commonly used values of p are p = 1, 2 or p = ∞,as given in (2.5)-(2.7). Staring with
the vector norms, we can define a matrix or subordinate norm to have the same properties
of the vector norm as follows:
Definition 27 (Matrix Norm.) For each vector p-norm in Rn , there is a subordinate
2
or consistent matrix p-norm kAkp , where A ∈ Rn is an n × n matrix and defined such
that
kAxkp
kAk = max kAxkp = max (3.56)
kxkp =1 kxk6=0 kxkp
44
Solution: We apply the definition exactly recalling that λ in (3.60) is called the eigenvalue
of the matrix A and satisfies the equation Det(A − λI) = 0 leading to a polynomial of
degree n is λ whenever A is an n×n matrix and therefore has n values by the fundamental
theorem of algebra. For this A in the present example, we have
√
kAk1 = max{1, 2, 25, 3} = 25, kAk∞ = max{10, 8, 7, 6} = 10, kAkF = 181 ≈ 13.454,
kAk2 = max{|7.03128|, | − 0.401914 + 1.29592i|, | − 0.401914 − 1.29592i|, |0.772548|}
= max{7.03128, 1.35681, 1.35681, 0.772548} ≈ 7.031
The norms defined in (3.58)-(3.61) are stressed because they are compatible with the
norms with p = 1, 2, ∞ in the sense of (3.57). So for example we have that for all vectors
2
x ∈ Rn and for all matrices A ∈ Rn
kAxk1 ≤ kAk1 kxk1 , kAxk∞ ≤ kAk∞ kxk∞ , kAxkF ≤ kAkF kxk2 , kAxk2 ≤ kAk2 kxk(3.62)
2
It is important to distinguish between vector and matrix norms even though the bear the
same notation in most of the cases. Notice that Ax is a vector so that kAxk1 denotes the
1-norm of a vector while kAk1 denotes the use of matrix norm. It is also important to note
that the pairs of matrix and vector norms are compatible only in the sense of definition
(3.57) and cannot be mixed. For example, if x = (0, 0, 1, 0)T and A is the matrix of
Example 10, then we have kxk1 = 1, but kAxk1 = 25 and kAk∞ kxk1 = 10. That is
we cannot expect to have the inequality kAxk1 ≤ kAk∞ kxk∞ or kAxk1 ≤ kAk∞ kxk1 .
Compatibility is a property that connects vector and matrix norms (of the same index).
In general, a compatible matrix norm for any matrix A and any index p is the number
From here it is then possible to find this infimum such that, for example, K1 = kAk1 , K2 =
kAk2 and K∞ = kAk∞ . We note that K2 6= kAkF . Thus although we have compatibility
in the sense that kAxk2 ≤ kAkF kxk2 , there is indeed a matrix norm K2 = kAk2 (as
defined above in terms of the largest absolute eigenvalue of the matrix), smaller for most
matrices than kAkF , and such that kAxk2 ≤ kAk2 kxk2 for every vector x ∈ Rn .
45
kAk∞ kA−1 k∞ ≈ 3.3 × 108 . A very large number. Now suppose we alter one element of b
slightly and write
1.2969 0.8648 0.8642
A= , b= .
0.2161 0.1441 0.14399
Then the exact solution to the problem becomes x = (866.8, −1298.9)T . A vary large
deviation from the solution for very small disturbances in the data. Thus we observe that
small changes in the data leads to very large changes in the solution. This kind of system
is said to be ill-conditioned. We must find a way of measuring the condition of a system.
We have
This bound for δx is sharp in the sense that for all matrices A and right hand side vector
b, there exists δb such that we have equality. Also,
kbk
kbk = kAxk ≤ kAkkxk ⇒ kxk ≥ .
kAk
kδxk kδbk
≤ κ(A) . (3.64)
kxk kbk
kδxk∞ kδbk∞
= 10−3 =
kxk∞ kbk∞
46
However inequality (3.64)simply shows that we are short of the bound by there orders of
magnitude. Complete the computation for case (ii) and see for yourself that (3.64) does
indeed give exact bounds.
The point is that we normally would not know the solution x. But since we normally
would know the matrix A, we can estimate its condition number and as such should be
in a position to say something about the bound in the relative error in the computation.
So far we have considered perturbations in the data provided by b. Lets also consider
perturbations in A. (A + δA)(x + δx) = b ⇒ Aδx = −δA(x + δx) and so
Lemma 6 If C is an n × n matrix with kCk < 1 for any compatible matrix norm, then
1
k(I − C)−1 k ≤ 1−kCk .
Proof: (A+δA)δx = δb−δAx so that (I+A−1 δA)δx = A−1 δb−(A−1 δA)x apply Lemma
6 to have k(I + A−1 δA)k ≤ 1−kA1−1 δAk = 1−ε
1 1
so that kδxk ≤ 1−ε (kA−1 kkδbk + εkxk). But
kbk
kxk ≥ kAk
so that
kδxk 1 kδbk
≤ ε + κ(A)
kxk 1−ε kbk
as required. .
The equations Ax = b give an ill-conditioned system of κ(A) is large. If κ(A) is so
large that it becomes compatible with the dominant relative errors in the data (that is
errors such as round off errors or experimental errors in the numbers provided for the
data) then all accuracy in the solution may be lost. The bound on the dominant relative
47
error may be pessimistic, however, they are rigorous and simple to express. This error
estimate is also independent of the scaling of A by scalar multiplication.
All the bounds on the relative error are given in terms of the kA−1 k and A−1 is costly
to find. However, error bounds on A−1 itself are more satisfactory: Suppose we have
found B ≈ A−1 and we form R = AB − I, with kRk small. Then B − A−1 = A−1 R so
that
kB − A−1 k
≤ kRk. (3.66)
kA−1 k
Thus, while a small r where r = b − Ax does not imply small relative errors in x, a small
kRk does imply small relative errors in B. Also,
A−1 (I + R) = B ⇒ A−1 = B(I + R)−1 .
Hence,
kBk
kA−1 k ≤ kBkk(I + R)−1 k ≤ , if kRk < 1. (3.67)
1 − kRk
Example 12 Consider the system
0.24 0.36 0.12 x1 0.84
0.12 0.16 0.24 x2 = 0.52
0.15 0.21 0.25 x3 0.64
estimate the condition number and also determine the condition.
Solution: Solving the problem using Gauss elimination using 2 digit floating point arith-
metic all the way, and scaled partial pivoting, the final working array is
0.24 0.36 0.12 | 0.84
0.50 −0.02 0.18 | 0.10
0.63 01.0 −0.01 | 0.01
Continuing the solution we have by back-substitution x̃ = (25, −14, −1)T . The residual
r = (0.0, 0.0, 0.08)T . In fact the exact solution is x = (−3, 4, 1)T . So that the solution
is in error in the first significant digit. The matrix of Coefficients in the system has
kAk∞ = 0.72. Furthermore, the matrix
0.252 0.36 0.12
B = 0.112 0.16 0.24
0.147 0.21 0.25
48
3.3.3 Iterative improvement
The example above shows that when κ(A) is large relative to the precision used, the
solution process may lead to relatively large errors in the computations. But is not always
guaranteed to do so. Whether or not the system is ill-conditioned can be ascertained
(without even knowledge of the condition number) during iterative improvement.
Suppose we wish to solve Ax = b. We know already that r = b − Ax̃(1) is the
residual, were x̃(1) is the first approximation to the solution x. Consequently, r = Ae
where e = x − x̃(1) is the error committed at this stage of the solution process. Thus
we have a linear system in e with the same coefficient matrix as the original system
but with different right hand side. Let ê(1) be the approximate solution of Ae = r.
ê(1) will in general not agree with e but should give an indication of the size of e. If
kê(1) k/kx̃(1) k ' 10−s , we can conclude that the first s decimal places of x̃(1) probably
agree with those of x. We would then expect ê(1) to be that close an approximation
to e, and hence we expect x̃(2) = x̃(1) + ê(1) to be a better approximation to x than
x̃(1) . If necessary, we can calculate r(2) = b − Ax̃(2) , and again solve Ae = r (2) to
obtain a new correction x̃(3) = x̃(2) + ê(2) to x. The number of places in agreement to
successive approximations x̃(1) , x̃(2) , x̃(3) · · · as well as the size of the residual should give
an indication of the accuracy of the solutions. one normally carries out the iteration until
kê(k) k
' 10−t
kx̃(k) k
if t decimal places of accuracy are carried during the solution. The number of iterations
steps required to achieve this can be shown to increase with κ(A). If κ(A) is very large,
ê(1) , ê(2) , · · · may never decrease in size, thus signalling extreme ill-conditioning.
3.3.4 Exercises
1. Discuss how you would use backward error analysis to measure the effect of rounding
errors in computations.
2. By choosing a suitable matrix norm, show that
(i) kABk ≤ kAkkBk, (ii) kAr k = kAkr
!
X
(iii) kAk∞ = max1≤i≤n |aij | , Maximum absolute row sum
j
3. If ~y is the approximate solution of the system A~x = ~b, where ~b is a non-zero vector,
if we define ~r = ~b − A~y , ~e = ~y − A−1~b, and for any compatible matrix norm
κ(a) = ||A||||A−1||, then show that
||~r|| ||~e|| κ(A)||~r||
≤ ≤ . (3.68)
κ(A)||~b|| ||A−1~b|| ||~b||
Interpret the terms in these inequalities and explain the relevance of these inequal-
ities to the numerical solution of systems of linear equations in fixed length floating
point arithmetic.
49
4. Construct a 7 × 7 matrix A whose entries will satisfy aii = 0.1, i = 1, 2, · · · , 7,
ai,i+1 = −1, aij = 0 ∀i 6= j, j 6= i + 1. Now consider the system A~x = ~b where
bi = 0.1 if i is odd and, −1 if i is even. Find the exact solution for the system.
Next change b7 from 0.1 to 0.101 leaving all the other entries the same. What is the
result of this small change in the solution? can you explain?
5. Consider again the last question, but this time construct a 5 × 5 system. Assume
that due to round off errors, the solution was computed to be x1 = 101, x2 = 10,
x3 = 2, x4 = .1 and x5 = 1.01. Compute the residual ~r and the error ~e. Find a
bound for the relative error in terms of the relative residual.
Use this result to estimate the condition number, κ(A), for the following matrices.
1 −1 0 7 8 9
(i) A = 2 −1 1 , (ii) A = 8 9 10
2 −2 1 9 10 8
using k · k∞ . Now for the very norm explicitly calculate κ(A). Comment on how
good your estimate is.
7. Further Reading: Complete this section by revising and reading about the fol-
lowing:
50
Chapter 4
4.1 Introduction
There are instances in which direct methods such as LU decomposition may not be the
best method to apply to a given linear system. Direct methods are attractive when
1. We have several equations with the same coefficient matrix but with different right
hand side
2. The matrix A is nearly singular. In this case, small changes in the residual do not
imply small errors in the solution.
When direct methods are no longer competitive, iterative methods can be introduced.
2. If the system to be solved is large and sparse, that is the number of zero elements
in the system is much large than the number of non-zero elements.
The basic idea behind an iterative method for solving the linear system Ax = b could be
defined as follows:
51
1. Split matrix A in the form A = N − P where the matrix N is simple and easy to
invert.
3. generate the sequence x(1) , x(2) , · · · , x(k) , of estimates to x through the Nx(k+1) =
P x(k) + b, k = 0, 1, 2 · · · .
4 1 0 1
Example 13 Consider A = 2 5 1 , b= 0 , and solve Ax = b by itera-
−1 2 4 3
tion.
4 0 0 0 −1 0
Solution: Set N = 2 5 1 , P = −2 0 −1 so that A = N − P and take
0 5 0 1 −2 0
1
x(0) = 1 then we have
1
2 10601
0
5 32000
1 −3
−1
, · · · , x = − 13347
(0)
1 , x(1) = 5 , x(2) (7)
x = =
10 40000
,··· .
1
1 21 12751
2 20 12800
1/3
The true solution is x = −1/3 . Is the iteration converging? When does the solution
1
converge?
Suppose that at step k we have that e(k) = x − x(k) , where e is the error committed
at step k. Now we have
So if M is in some sense a small matrix, then the errors e(k) must be diminishing and
we will hope that the iteration will converge to the true solution after a certain number
of steps. We have the following theorem
52
1. A is non-singular
2. kx(j) − xk∞ ≤ λj kx(0) − xk∞ or ke(j) k∞ ≤ λj ke(0) k∞ , and
3. if x is a solution of the linear system Ax = b, and (x(j) )j≥1 is a sequence of points
generated by the relation Nx(k+1) = P x(k) + b, then limj→∞ x(j) = x.
Proof:
1. By contradiction. If A is singular, then there exist y 6= 0 such that Ay = 0.
Therefore (N −P )y = 0 or y = N −1 P and so kyk∞ = kN −1 P yk ≤ kN −1 P k∞ kyk∞ .
Since y 6= 0, we must have that λ ≥ 1. Contradiction. So A is non-singular.
2. Set M = N −1 P , e(j) = x(j) − x, then
e(j) = Me(j−1) = M 2 e(j−2) = · · · = M j e(0)
and thus ke(j) k∞ ≤ kM j k∞ ke(0) k∞ = λj ke(0) k∞ .
3. If λ < 1 then limj→∞ λj = 0 and the iteration converges.
Now recall Definition 2.3, equation (2.19) on page 16 of the spectral radius of a matrix.
We have the following theorem
Theorem 18 ρ(A) ≤ kAk for any compatible matrix norm
Proof: Let λi , i = 1, 2, · · · , n be the eigenvalues of the matrix A. and let xi be the
corresponding eigenvector. the we have that for each i, Axi = λi xi . Thus kλi xi k =
|λi |kxi k = kAxi k ≤ kAkkxi k for each i. Thus λi ≤ kAk for each i, thus ρ(A) ≤ kAk as
required.
Remark 6 The first point here is that Theorems 17 and 18 show us that for any iterative
process to converge, the spectral radius of the matrix M = N −1 P must be smaller than
unity. The matrix N −1 P is called the iteration matrix. The second point is that the
spectral radius is difficult to find. So we may sometimes be contented by the requirement
that kAk < 1 for any compatible matrix norm. The only snag here is that kAk can be
larger than unity for the chosen norm but the spectral radius which is not determined by
any norm is less than unity and the iteration will converge.
53
1. N = D and P = L + U M = N −1 P = D −1 (L + U). We have the Jacobi iteration
Definition 30 (Rate of convergence) The number of decimal digits by which the error
in an iteration is eventually decreases by each convergent iteration gives a measure of the
rate of convergence.
Theorem 19 The rate of convergence can be measured just by calculating − log10 (ρ)
where ρ is the spectral radius of the iteration matrix
Proof: Let the iteration matrix M has m linearly independent eigenvectors corresponding
to the eigenvalues λs , s = 1, 2, · · · , m. Let the eigenvalues be ordered so that |λ1 | >
· · > |λm |. Then each vector e(∗) with m components can and may be written as
|λ2 | > ·P
e(0) = m s=1 cs v s where each cs is a scalar. But v s is an eigenvector of M thus we have
Mv s = λs v s . Thus we have
m
X m
X
(1) (0)
e = Me = cs Mv s = cs λ s v s
s=1 s=1
Therefore
m m n
(n)
X X λs
e = cs λns v s = λn1 cs vs.
s=1 s=1
λ1
54
a positive quantity since in a convergent iteration, 0 < ρ < 1. More generally, we have
that
Lyusternik’s Method
e(n+1) ' λ1 e(n) = λ1 e(n) + δ (n) where δ (n) is the correction term with small components
and |λ1 | < 1. By definition of the error,
x = x(n) + e(n) = x(n+1) + e(n+1) (4.4)
= x(n) + λ1 e(n) + δ (n) (4.5)
Eliminate e(n) between (4.4) and (4.5) to have
x(n+1) − λ1 x(n) δ (n)
x= + (4.6)
1 − λ1 1 − λ1
So, if kδ (n) k is small compared to 1 − λ1 , a good approximation to x is
x(n+1) − λ1 x(n)
x =
1 − λ1
x(n+1) − x(n)
= x(n) +
1 − λ1
so small differences in successive iterates do not necessarily imply a close approximation
(n+1) (n)
to the solution. As an illustration, suppose for example that max{|xi − xi |, i =
(n+1) (n)
|xi −xi |
0, 1, 2, · · · , m} = ε, and λ1 = 0.99. Then max{ |1−λ1 |
, i = 1, 2, 3 · · · , m} = 100ε,
(n)
where xi is the i-th component of x(n) . To obtain a solution with maximum error of ε
(n+1) (n)
we must iterate until maxi |xi − xi | ' 0.01ε.
For most problems the largest eigenvalue λ1 will not be known in advance and must
be estimated. One way of doing this is the following: For large n,we have
e(n) ' λ1 e(n−1) ⇒ e(n+1) − e(n) = λ1 (e(n) − e(n−1) )
That is x(n+1) − x(n) = λ1 (x(n) − x(n−1) )
kx(n+1) − x(n) k
and so kx(n+1) − x(n) k = |λ1 |k(x(n) − x(n−1) )k ⇒ |λ1 | ' .
kx(n) − x(n−1) k
55
for any suitably defined vector norm.
Aitken’s Method
and
xn+1
i − xni
λ1 ≈ , for each i (4.8)
xni − xn−1
i
and a better value of λ1 could be obtained by taking an average for all the values of i.
4.4 Exercises
1. Let A = D − L − U be a non-singular matrix. For the system Ax = b, where A is
an n × n matrix and b is an n × 1 column vector, show that
(a) The eigenvalues µ of the Jacobian Iteration are found by evaluating a deter-
minant to solve the equation Det(µD − L − U) = 0.
(b) The eigenvalues µ of the Gauss Seidel Iteration are found by evaluating a
determinant to solve the equation Det(µD − µL − U) = 0.
(c) If ω ≥ 1 is a parameter, then the eigenvalues µ of the SOR Iteration are found
by evaluating a determinant to solve the equation Det(µI − H(ω)) = 0, where
H(ω) = (I − ωD−1L)−1 {(1 − ω)I + ωD−1U}.
−4 1 0
Use the matrix A = 1 −4 1 to illustrate that the different iterations
0 1 −4
shown here will converge and in the case of the SOR iteration scheme, find the best
ω.
56
2, −1, 1
3. Consider the system A~x = ~b where A = , b=
−1, 2 1
(a) Show that the Jacobi iteration will converge for any initial guess
(b) Show that the Guass-Seidel iteration will converges for any initial guess
(c) show that 0 < ρ(HG ) < ρ(HJ ), where HG and HJ are the Gauss-Seidel and
Jacobi iteration matrices. What does this inequality say about the relative
rates convergence of the Jacobi and Gauss-Seidel iteration schemes?
(d) Show that there exists an optimum relaxation factor ωb for which an SOR
iteration scheme will converge fastest and find ωb .
(e) Carry out a few iterations for each method and compare the estimates to the
true solution
(a) Let A = (aij ) be a symmetric positive definite matrix of size n × n. Show that
aii > 0 for all i.[Hint: Consider ~eTi A~ei ]. Show that the eigenvectors of A are
all positive
(b) Let D = (dij ) be an n × n diagonal matrix such that dii > 0 for all i. Show
that D is positive definite and symmetric
(c) Show that the Guass-Seidel iteration for any symmetric positive definite matrix
will converge for any initial guess in the following steps:
(i) Set A = L + D + U = L + D + LT . since A is symmetric, construct the
matrix P = A − HGT AHG where HG = −(D + L)−1 LT and show that P is
symmetric.
(ii) show that HG = IQ , Q = (D + L)−1 A
(iii) Now show that P = QT (AQ−1 + (QT )−1 A − A)Q (using (ii)). Then show
that P = QT DQ (Using the above and the definition of Q).
(iv) Use parts (a) and (b) of this problem to show that P is positive definite
and symmetric
(v) suppose λ is an eigenvalue ofHG and ~u a corresponding eigenvector. Show
that ~uT P ~u > 0 ⇒ |λ| < 1.Does this complete the proof?
5. The following numbers for five decimal places are the third component of the fifth,
sixth and seventh iteration vectors respectively of the Jacobi iteration scheme.
0.41504, 0.45878, 0.49500. Use Aitken’s method to calculate an improved value for
this component. Verify if Lyusternik’s method gives the same value to four decimal
places when λ1 is approximately given by (4.8)
57
7. Practical Problem: Let Ω = {(x, y) ∈ R2 : 0 ≤ x ≤ 1, 0 ≤ y ≤ 1} the unit square.
It is required to solve a second order partial differential equation on Ω by
∂2φ ∂2φ
+ 2 + 8π 2 φ = 0, ∀x, y in Ω,
∂x2 ∂y
φ(x, y) = 0, ∀x, y ∈ ∂Ω.
(a) Show that the exact solution to the give problem is φ(x, y) = sin(2πx) sin(2πy)
(b) By using finite central differences to approximate the derivatives, and employ-
ing a grid points of 1000 points in each coordinate direction, write computer
programmes to solve the problem by iteration using
i. The Jacoby iteration
ii. The Gauss Seidel iteration
iii. The SOR iteration.
Record how many iterations are required to obtain a solution correct to 6
decimal places for each iteration scheme. For the SOR scheme, which was
the best value of the relaxation parameter? Show that if ω is the relaxation
parameter, then ωbest lies between 1 and 2.
58
Chapter 5
5.1 Introduction
We seek to find numbers λi and vectors xi such that
This process has many practical applications. For example, vibration analysis leads to a
symmetric A because of a vibrational energy principle. The x will represent the shape of
the vibrational mode and λ is related to the frequency.
Definition 31 We say λ ∈ C is an eigenvalue of the n × n matrix A if there exist y 6= 0
such that
Ay = λy, y ∈ Rn (5.2)
(A − λI)y = 0, y ∈ Rn . (5.3)
z, Az, A2 z, · · · , An z (5.4)
P
where A is an n × n matrix suppose that the vector z = nk=1 ck y k , where each y k is an
eigenvector of A with associated eigenvalue λk , then the sequence may be written so that
n
X
m
A z= λm
k yk . (5.5)
k=1
59
Hence the behaviour of the sequence (5.4) may be viewed in terms of the behaviour of
the sequence (λnk )n≥0 ¿ Then limm→∞ Am = 0 if |λk | < 1 for all k. We can order the
eigenvalues so that
|ΛI − A| = 0 (5.8)
When this is multiplied out, we get an nth degree polynomial equation for λ, so tere
are n roots. These may of course, not all be distinct. For each there will be a non-zero
eigenvector x (by the Fredholm alternative theorem). It is well known that for Hermitian
matrices all the eigenvalues are real. We write ∗ to denote the complex conjugate transpose
and use A∗ = A . Then
Also, x∗ Ax = x∗ Ax, so it is real, hence λ is real, being the ratio of reals. For a real
symmetric matrix, x is then real because A − λI is real.
This also shows that we can find λ, given x, using the Rayleigh Quotient
xT Ax
λ= . (5.10)
xT x
For
xT Ax λkxk22
= = λ. (5.11)
xT x kxk22
In the same way if (λ1 , x1 ) and (λ2 , x2 ) are eigensolutions with λ2 6= λ1 , we have
So
Since λ1 6= λ2 , we must have xT2 x1 , and so x2 and x1 are orthogonal. If they are normalised
so that xT2 x2 = kx2 k22 = 1, xT1 x1 = kx1 k22 = 1, they are called orthonormal
60
Definition 32 (orthogonal matrix) A matrix whose rows (columns) are orthonormal
is said to be an orthogonal matrix (unitary in the Hermitian case). Such a matrix X,
satisfies
X T X = XX T = I, so that X −1 = X T
.
Clearly,
kXk22 = max |xT X T Xx| = 1 (5.13)
kxk2 =1
5.2 Exercises
Research and present a write up about the topics indicated and also answer the questions
indicated
1. Numerical methods for estimating eigenvalues and eigenvectors
(a) The power method
(b) The inverse power method
(c) Similarity transform methods
(d) Householder reflections
(e) Givens method
2. Localization of eigenvalues: State Gershgorin’s Disc theorems about the local-
ization of eigenvalues of an n × n matrix A = (aij ). Let D be another n × n matrix
and k · k a matrix norm. Prove that if λ is an eigenvalue of A but not of D, then
k(λIn − D)−1 (A − D)k ≥ 1 where In is the identity matrix of size n. Employ this
inequality with D = diag(a1 , a2 , · · · , an ) and the matrix norm k · k∞ to prove Gersh-
gorin’s first theorem. Let q be the real number, q > 2 and A = (aij ) a 3 × 3 matrix
with aij = q −|i−j| , i, j = 1, 2, 3. Determine the Gershgorin disk of the matrix A and
show that 0 is not an eigenvalue of A.
3. Statement and prove of Cauchy’s interlace theorem for the localization of eigenvalues
61
Chapter 6
Polynomial Interpolation
62
A remedy for loss of significance is to use the shifted power form of the polynomial.
Definition 36 (Shifted Power Form of the Polynomial) The shifted power form of
the polynomial,Pn , takes the form
Pn (x) = a0 + a1 (x − c) + a2 (x − c)2 + a3 (x − c)3 + · · · + an (x − c)n (6.2)
where c, called the centre of the power form, is a fixed constant.
Clearly using the binomial theorem to expand all the powers out and rearranging we can
rewrite the shifted power form of the polynomial (6.2) in the power form (6.1). More
specifically, if c = 0, we have the power form of the polynomial. Thus the shifted power
form is a generalization of the power form. It is good practice to employ the shifted
power form with centre c chosen somewhere in the interval [a, b] when interested in a
polynomial in that interval. How does the form improve on our calculations using the
five-decimal-digit floating-point arithmetic of Example 14?
Example 15 If we choose the centre c = 6000 and rework the linear polynomial of Ex-
ample 14, then P1 (x) = a0 + a1 (x − 6000) and using five-decimal-digit floating-point
arithmetic demanding that P1 (6000) = 1/3 = 0.33333 and P1 (6001) = −2/3 = −0.66667,
we have P1 (x) = 0.33333 − (x − 6000), so that evaluating in the same arithmetic gives
P1 (6000) = 0.33333 and P1 (6001) = −0.66667, and the values are as correct as five digits
can make them. We shall see below that the coefficients in the shifted power form provide
(i)
derivative values ai = Pn (c)/i!, i = 0, 1, 2, · · · , n, when Pn is of the form (6.2).
A further generalization of the shifted power form is the Newton form of the Polynomial.
Definition 37 (Newton Form of the Polynomial) The Newton form of the polyno-
mial, Pn , takes the form
Pn (x) = a0 + a1 (x − c1 ) + a2 (x − c1 )(x − c2 ) + a3 (x − c1 )(x − c2 )(x − c3 ) + · · ·
+an (x − c1 )(x − c2 )(x − c3 ) · · · (x − cn−1 )(x − cn ) (6.3)
where we now have n centres c1 , c2 , · · · , cn , instead of just one centre, as in the shifted
power form.
It is good practice to choose the centres c1 , c2 , · · · , cn somewhere in the interval [a, b] if we
are interested in a polynomial in this interval. Again, the Newton form of the polynomial
reduces to the shifted power form if c1 = c2 = · · · = cn = c, and to the power form if the
centres c1 , c2 , · · · , cn all equal to zero.
It is inefficient to evaluate each of the n + 1 terms in (6.3) separately and then sum.
This would take n + n(n + 1)/2 additions and n(n + 1)/2 multiplications. Instead, one
notices that the factor x − c1 occurs in all terms but the first; that is,
Pn (x) = a0 + (x − c1 ) {a1 + a2 (x − c2 ) + a3 (x − c2 )(x − c3 )
+ · · · + an (x − c2 )(x − c3 ) · · · (x − cn )} .
Again, each term between the braces but the first contains the factor (x − c2 ); that is,
Pn (x) = a0 + (x − c1 ) {a1 + (x − c2 ) [a2 + a3 (x − c3 ) + · · · + an (x − c3 ) · · · (x − cn )]}
Continuing we get the nested form of the Newton Polynomial.
63
Definition 38 (Nested Form of the Newton Polynomial) The nested form of the
Newton polynomial takes the form
whose evaluation for any particular value of x takes 2n additions and n multiplications.
The efficiency of the evaluation in the nested form of the Newton polynomial comes from
the fact that we start with the innermost bracket and evaluate outward with very little
loss of significance.
More sophisticated remedies against loss of significance are offered by the presentation
in Chebyshev polynomials and other orthogonal polynomials. The problem of polynomial
representation using orthogonal polynomials is also very appealing for scientific work. We
note that we can transform polynomials from one form to the other using the following
nested multiplication algorithm.
For example the polynomial pn (x) = 1+2x interpolates the points (0, 1), (2, 5), (4, 9), (5, 11)
on the interval [0, 5].
Two important reasons for approximating functions with polynomials are that (i)
polynomials are easy to use and (ii) polynomials can be used to provide very good ap-
proximations for functions in C([a, b]; R). The first reason is obvious; eg if we wish to
integrate or differentiate a complicated function, then we may wish to differentiate or
integrate its corresponding approximating polynomial. The second reason is provided by
the following theorem of Weierstrass.
64
Theorem 21 (Weierstrass Approximation Theorem) Let f ∈ C([a, b]; R). For ev-
ery ε > 0, there exists a polynomial pn of order n(ε) ∈ N such that kf (x) − pn (x)k∞ < ε.
Proof: Exercise. The Weierstrass approximation theorem states that any continuous
function on a finite interval [a, b] may be uniformly approximated by some polynomial.
πn,k (x)
Ln,k (x) = , k = 0, 1, 2, 3, · · · . (6.9)
πn,k (xk )
Each Ln,k (x) is a polynomial of degree n called Lagrange Polynomials and have the prop-
erty that
1 r=k
Ln,k (xr ) = δr,k =
0 r 6= k
Theorem 22 (Lagrange) Given distinct points xk , k = 0, 1, 2, · · · , n and the numbers
yk , k = 0, 1, 2, · · · , then there exists a unique polynomial pn (x) of degree n which satisfies
pn (xr ) = yr , i = 0, 1, 2, · · · .
Proof: Existence: The polynomial exists since by construction,
X n
pn (x) = yr Ln,r (x)
r=0
65
Proof: Existence and uniqueness follows from Theorem 22. Since πn (xr ) = 0, the relation
is satisfied trivially for x = xr , r = 0, 1, 2, · · · , n. Let x be distinct from any xr and define
φ(t) = f (t) − pn (t) − K(x)πn (t) where K(x) is independent of t. Then φ(xr ) = 0, r =
0, 1, 2, · · · , n. Let K(x) = f (x)−p n (x)
πn (x)
, so that φ(x) = 0. Hence φ vanishes at the n + 2
points x0 , x1 , · · · , xn , x in the interval [a, b]. Hence by Rolle’s theorem, φ0 must vanish
at n + 1 distinct points in [a, b]. Hence, by induction, there exists η ∈ [a, b] such that
φ(n+1) (η) = 0. But pn is a polynomial of degree n by construction. So
en (x) is the error in the approximating polynomial. Suppose |f (x)| ≤ M∀x ∈ [a, b],
then
M
kekp ≤ kπn kp
(n + 1)!
where kf kp is defined by (2.1)
Remark 7 The interval [a, b] can be mapped into the interval [−1, 1] by a continuous
mapping. Hence, without loss of generality we can consider the interval [−1, 1]. If the
points xr are equally spaced in the interval, then πn (x) has n + 1 roots in the interval and
so oscillates n + 2 times (counting local extrema at end points).
So we may pose the following question: Can a judicious choice of interpolation points
improve the behaviour of pn ? Note that cos((n+1)θ), θ ∈ [0, π] has n+1 roots in [0, π] and
oscillates n + 2 times between ±1. This is a similar property of πn in [−1, 1] with equally
spaced points. So if we can find a mapping that relates x and θ, and has the property
that cos((n + 1)θ) is a finite polynomial in x, then we could use this function. This is
Chebyshev’s idea. The mapping is x = cos(θ). So we define the Chebyshev polynomials
on [−1, 1] by
so that
66
Notes 5 We note the following properties of the Chebyshev polynomials:
1. |Tn (x)| ≤ 1∀x ∈ [−1, 1]
2. Tn (x) = 2n−1 xn + · · · , n > 0
(r+ 12 )π (r+ 12 )π
3. Tn+1 (x) = 0 at θr = n+1
,r = 0, 1, 2, · · · or xr = cos n+1
,r = 0, 1, 2, · · · so if
−n
we choose these values as interpolation points, πn (x) = 2 Tn+1 .
4. Since |Tn+1 | ≤ 1, πn must oscillate between ±2−n , and hence we have
M
ken k∞ ≤
2n (n + 1)!
6.2.1 Divided Differences
One drawback with the Lagrange interpolating polynomial is that should an extra point
be added to the set {xr , yr }, it is necessary to recompute all the Lagrange polynomials.
Definition 41 (Divided differences) Let Pn be a polynomial which interpolates the
points xi , i = 0, 1, 2, · · · , n, and define the n-th divided difference f [x0 , x1 , · · · , xn ] to be
the coefficient of xn in pn . Define f [xi ] = f (xi ), i = 0, 1, 2, · · · , n. From the definition of
the Lagrange form of the interpolating polynomial, for n > 1, we have
n
!
X f (xr )
f [x0 , x1 , · · · , xn ] = Q (6.12)
k=0 r6=k (xk − xr
67
6.3 Newton’s Form of the Interpolating polynomial
Corollary 6 (Newton Interpolation) The Lagrange interpolation polynomial can be
represented in terms of divided difference by
n j−1
!
X Y
pn (x) = f [x0 , x1 , · · · , xj−1 ] (x − xr ) . (6.15)
j=0 r=0
Furthermore,
f [x1 , · · · , xn ] − f [x0 , x1 , · · · , xn−1 ]
f [x0 , x1 , · · · , xn ] = (6.16)
xn − x0
If in addition to the requirement that a polynomial has a value at given points on a
discrete set, we ask for the derivative to be specified, then the method is called a hermite
polynomial approximation.
Theorem 26 (Hermite Interpolation) Suppose f ∈ C 2n+2 ([a, b]; R) and that
(x0 , f (x0 )), (x1 , f (x1 )), · · · , (xn , f (xn )) are n + 1 distinct points with xi ∈ [a, b],
i = 0, 1, 2, cdots, n. Then there exists a unique polynomial p2n+1 , of degree at least 2n + 1,
that satisfies p2n+1 (xr ) = f (xr ) , p02n+1 (xr ) = f 0 (xr ), r = 0, 1, 2, · · · , n. and ∀x ∈ [a, b],
∃η ∈ (a, b)s.t.
[πn (x)]2 f (2n+2) (η)
f (x) = p2n+1 (x) + .
(2n + 2)!
Proof: Define
Hr (x) = 1 − 2(x − xr )L0n,r (xr ) L2n,r (x), Kr (x) = (x − xr )L2n,r (x), (6.17)
then write
r=n
X
p2n+1 (x) = [f (xr )Hr (x) + f 0 (xr )Kr (x)] (6.18)
r=0
It is then easy to verify that indeed p2n+1 so constructed is a polynomial of degree at least
2n + 1 and that p2n+1 (xr ) = f (xr ), p02n+1 (xr ) = f 0 (xr ) and finally that the polynomial so
constructed is unique. Full details are left as exercise.
6.3.1 Exercises
1. We are faced with the issue of polynomial interpolation in the L∞ norm. In partic-
ular, we wish to handle the following problem:
Given u ∈ C[a, b], if Pn is the set of all polynomials of degree less than or equal to
n, how do we choose Pn such that
68
(i) Discuss the problem of existence, uniqueness and accuracy of the polynomial
approximation in the L∞ norm paying particular attention to the Weierstrass
theorem and the associated proof using Bernstein polynomials, uniform or min-
imax polynomial approximation (the ooscillation theorem, de la Vallee Pousin
theorem, etc).
(ii) Also discuss the problem of existence, uniqueness and accuracy of the polyno-
mial approximation in the L2 norm paying particular attention to the approx-
imation by orthogonal polynomials and least squares approximation
(iii) Now, find the zero and first order polynomial P0 (x) and P1 (x) which are
R1
orthogonal in the sense that 0 Pi (x)Pj (x)dx = δi,j where δi,j is the Delta-Dirac
function. Use P0 Rand P1 to find the approximation to f (x) = αP0 (x) + βP1 (x)
1
which minimizes 0 (x2 − f (x))2 dx.
2. Divided differences and divided difference tables: The Newton form of the
interpolating polynomial may be written in the form
Where
f (x1 ) − f (x0 )
f [x0 ] = f (x0 ), f [x0 , x1 ] =
x1 − x0
f [x1 , x2 , · · · , xk ] − [x0 , x1 , · · · , xk−1 ]
f [x0 , x1 , · · · , xk ] =
xk − x0
The coefficients f [x0 , x1 , · · · , xk ] may be calculated from the following algorithm,
which displays the results in the form of a table, as shown in Table 6.1, called the
divided difference table.
Algorithm: The divided difference table. Given the first two columns of the
table containing x0 , x1 , · · · , xn and the corresponding f [x0 ], f [x1 ], · · · , f [xn ]
For k = 1, 2, · · · , n do;
For i = 0, 1, · · · , n − k do;
[x ,x ,··· ,xi+k ]−[xi ,xi+1 ,··· ,xi+k−1 ]
f [xi , xi+1 , · · · , xi+k ] := i+1 i+2 xi+k −xi
.
end do i
end do k
The results would then be displayed on a table like the following, from which the
coefficients can be read off. Now answer the following questions:
(i) Find the polynomial of degree ≤ 2 which satisfies P2 (1) = 1.5709, P2(4) =
1.5724, P2(6) = 1.5751 Hint: Calculate f [1, 4], f [4, 6], f [1, 4, 6].
69
Table 6.1: The divided difference Table
(ii) Values of a function, f , where f = {(0, 1), (1, 5), (2, 31), (3, 121), (4, 341)},
are those of a certain polynomial of degree less than or equal to 4. Find the
polynomial from a difference table, or otherwise, and then estimate f (5).
(iii) An integral related to the complete elliptic integral is defined by
Z π
2 dx
K(κ) = p
0 1 − sin2 κ sin2 x
From tables we find that K(1) = 1.5709, K(4) = 1.5727, K(6) = 1.5751. Find
K(3.5) using a second degree Lagrange Polynomial.
3. It can be shown that the Hermite polynomial of order k, Hk (x) satisfy the relation
Find the first few 4 Hermite polynomials and find the zeroes of H2 , H3 and H4 .
70
Chapter 7
It is clear that the triple set (xl , xm , xr ) will vary over the set of partition points (x0 , x1 , x2 , · · · , xn )
which for the sake of the second degree polynomial interpolation will be taken three points
at a time. For example for the triple (x5 , x6 , x7 ), xl = x5 , xm = x6 and xr = x7 . Thus
to find the approximate first derivative to a function y at xm , we may choose to use the
second degree polynomial p2 and differentiate to have
71
point. For example by evaluating the derivative of the approximating polynomial at xr
on the equally spaced mesh points with stepsize h we have
and thus obtain the approximation to the integral I by integrating the corresponding
approximating polynomial. If for example, the Lagrange form of the interpolating poly-
nomial are used, then we have
Z b r=n
X Z b
f (x)dx = Wr f (xr ) + R, where Wr = Ln,r (x)dx. (7.4)
a r=0 a
72
and f [a, a+b
2
, b] are the divided difference coefficients in the Newton Form of the
interpolating polynomial. Then
Z b Z b
f (x)dx ≈ p2 (x)dx
a a
Z b
a+b a+b a+b
= f [a] + f [a, ](x − a) + f [a, , b](x − a)(x − ) dx
a 2 2 2
1 a+b
= (b − a) f (a) + 4f ( ) + f (b) .
6 2
Thus the second degree polynomial Newton interpolation polynomial leads to Simp-
son’s rule for integration.
We can use polynomial interpolation for other proposes. See Chapter 7 of reference [1]
and chapter 6 of reference [3] for more.
7.2.1 Exercises
Rb
1. If ψ(x) = (x − a)(x − a+b
2
)(x − b), verify by direct integration that a ψ(x)dx = 0.
How does this result affect the interpretation of the definite integral as the area
under a graph of f ? Can you comment?
Rb
2. Five rules of approximating, I(f ), the definite integral I(f ) = a f (x)dx are the
following
73
(i) for a given step size h, for a given differentiable function f , define
f (ξ + h) − f (ξ) f (ξ + h) − f (ξ − h)
(i) f 0 (ξ) ≈ , (ii) f 0 (ξ) ≈ ,
h 2h
2f (ξ) − 4f (ξ + h) + 2f (ξ + 2h)
(iii) f 0 (ξ) ≈
h
f (ξ − h) − 2f (ξ) + f (ξ + h
(iv) f 00 (ξ) ≈ 2
h
For f (x) = sinh(x), calculate f 0 (1.4) and f 00 (1.4) and compare your results
with the exact answers. Estimate the errors analytically.
(ii) Let f (x) = ln(x), x > 0. On your computer, calculate the sequence of numbers
an = f [2 − 2−n , 2 + 2−n ] (where f [·, ·] is the divided difference) without the
effect of rounding errors, limn→∞ an = f 0 (2) = 0.5. What really happens on
the machine? Please include a segment of the code which you have used to do
your calculations and also give a print out of your results for large n
74
Chapter 8
8.1 Introduction
The aim of this section is to determine methods of approximating solutions of nonlinear
single equations. The problem can be posed in many guises: For a given nonlinear function
f : R → R,
Here we illustrate the common methods and address the issues of convergence of schemes.
4. verify: If f (c)f (a) < 0, take [a, c] ⊂ I as the new and better estimate of the
interval containing the zero of f else if f (c)f (b) < 0 we take [c, b] ⊂ I as the new
approximating interval
75
8.2.1 High points for the bisection method
1. Method is guaranteed to converge
2. Sought after root is known to lie in an interval of length 2−n (b − a) after n steps of
the method
3. After several evaluations of the function, we know and have a lot of information
about f near the sought after zero, but still only use the last calculated value.
Theorem 27 (Fixed point theorem) Let f ∈ C([a, b]; [a, b]) be an injection mapping
such that f (x) ∈ [a, b], ∀x ∈ [a, b], then there exist s in [a, b] such that f (s) = s.
Proof: Given that f is a continuous injection map, the image of [a, b] under f is a subset
of [a, b]. Thus f (a) ≥ a and f (b) ≤ b. Now, if f (a) = a or f (b) = b, the result holds, as
we can take s = a or s = b. Suppose, otherwise, that a < f (x) < b for all x in [a, b], that
is, f (x) 6= a and f (x) 6= b so that f (a) > a and f (b) < b. Consider g : x 7→ f (x) − x.
Clearly, g is continuous on [a, b] with g(a) = f (a) − a > 0, g(b) = f (b) − b < 0 and so
g(a)g(b) < 0 and so by Intermediate Value Theorem , ∃s ∈]a, b[ such that g(s) = 0 or
f (s) = s.
Proof: Geometrically, if the slope of g is greater than unity, then the sequence will
diverge. Now Since |g 0(s)| > 1, let M = 21 (1 + |g 0(s)|), so that 1 < M < |g 0 (s)|. Then
∃ρ > 0 such that |f 0(x)| > M for |x − s| ≤ ρ. Then we can apply the mean value theorem
by noting that if |xn − s| ≤ ρ, ∃η ∈ [xn , s] such that
76
Hence,
|xn+1 − s| ≥ M|xn − s| ≥ |xn − s|
and sequence is diverging away from s. .
(ii) ∃L with 0 < L < 1 such that ∀x, y ∈ I, |f (x) − f (y)| ≤ L|x − y|.
Remark 8 (a) Condition (ii) in the definition is called a Lipschitz condition and L is
called a Lipschitz constant.
(b) A function f : I → R that satisfies the Lipschitz condition on I (for any L > 0) is
said to be Lipschitz continuous with Lipschitz constant L.
contradicting the fact that 0 < L < 1. So the fixed point must be unique. .
Corollary 7 Suppose that f satisfies the Lipschitz condition on the interval [x0 −ρ, x0 +ρ],
with Lipschitz constant L < 1 and that |x0 − f (x0 )| < (1 − L)ρ, then the sequence (xn )n≥0
generated by xn+1 = f (xn ), converges to a fixed point s ∈ [x0 − ρ, x0 + ρ].
Therefore, f is an injection map on the interval [x0 − ρ, x0 + ρ] and result follows from
Theorem 27.
77
Corollary 8 (Error bounds) Suppose f satisfies the Lipschitz condition on the interval
[x0 − ρ, x0 + ρ] with Lipschitz constant L < 1 and that |x0 − f (x0 | ≤ (1 − L)ρ. Then the
sequence (xn )n≥1 defined by x1 ∈ [x0 − ρ, x0 + ρ], xn+1 = f (xn ) satisfies
1
(i) |x0 − s| ≤ |x
1−L 1
− x0 |
Ln
(ii) |xn − s| ≤ |x
1−L 1
− x0 |
Proof:
(i) Since s is a fixed point of f , we have
|x0 − s| = |x0 − f (x0 ) + f (x0 ) − s| ≤ |x0 − f (x0 )| + |f (x0 ) − f (s)|
≤ |x0 − x1 | + L|x0 − s|
1
≤ |x1 − x0 |.
1−L
(ii) The Lipschitz condition shows that
|xn+1 − xn | ≤ L|xn − xn−1 | ≤ L2 |xn−1 − xn−2 | ≤ · · · ≤ Ln |x1 − x0 |.
Therefore for m > n,
|xm − xn | ≤ |xm − xm−1 | + |xm−1 − xm−2 | + · · · + |xn+2 − xn+1 | + |xn+1 − xn |
≤ Ln (Lm−n−1 + Lm−n−2 + · · · + L2 + L + 1)|x1 − x0 |
Ln (1 − Lm−n )
= |x1 − x0 |.
1−L
Now let m → ∞ so that xm → s to establish (ii).
Example 16 Find the real root of the equation x3 − 2x2 − x − 2 = 0.
(a) By rewriting the equation in the form x = g(x) where g(x) = 2 + x−1 + 2x−2 , show
that the contraction mapping theorem works for g in the interval [2, z], z ≥ 3, and
calculate the Lipschitz constant. Start from a value of x in this interval and calculate
x4 − s. Compare with the exact value of the error.
p
(b) Repeat with g(x) = 3 (2x2 x + 2).
p
3
√ p3
√
Solution: The real solution is x = 31 2 + 44 − 3 177 + 44 + 3 177 = 2.65897
correct to five decimal places.
(a) From the equation x3 −2x2 −x−2 = 0, we have x3 = 2x2 +x+2 ⇒ x = 2+x−1 +2x−2 ,
yielding the equation x = g(x) where g(x) = 2 + x−1 + 2x−2 . Observe that g is
continuous and differentiable on the interval [2, z], ≥ 3. Thus ∀p, q ∈ [2, z], ∃η ∈
(2, z) such that g(p) − g(q) = g 0(η)(p − q), by the mean value theorem. In particular
|g(p) − g(q)| = |g 0(η)||p − q|. Here |g 0(x)| = | − x12 − x43 | ≤ 34 ∀x ∈ [2, z], z ≥ 3.
Thus L, the Lipschitz constant is 34 . Following the error estimates from Corollary
L4
8, we have |x4 − s| ≤ 1−L |x1 − x0 |. Let x0 = 3, then x1 = g(x0 ) = 23/9 = 2.55556.
L4
Therefore |x1 − x0 | = 0.44444, and 1−L |x1 − x0 | = 0.5625. The exact absolute error
after 4 steps is |2.65897 = g(g(g(g(3))))| = 0.00478166. So the estimate is good and
we get far better out of the iteration.
78
(b) Similar
1. Linear Convergence: The sequence (xn )n≥1 converges linearly to s with rate R,
if xn → s as n → ∞ and
xn+1 −s
(a) xn −s
→ r as n → ∞.
(b) R = − loge |r|.
Clearly, for convergence we require r < 1 and the larger R, (smaller r), the faster the
convergence.
Definition 45 (Relaxation Process) Suppose there exists an interval [a, b] such that
sgn(f (a)) 6= sgn(f (b)) and f 0 (x) does not vanish on the interval [a, b]. Then we define the
relaxation process by
If the sequence (xn )n≥1 converges to a limit s, then f (s) = 0. So for the relaxation process,
the mapping g is g(x) = x − θf (x), for some suitably chosen relaxation constant θ.
Theorem 30 (Relaxation Process) . Suppose f ∈ C 1 ([a, b]; R) and that f (a)f (b) < 0.
For definiteness, we suppose f (a) < 0, and f (b) > 0. If 0 < f 0 (x) < c for all x ∈ [a, b],
the relaxation process converges ∀x ∈ [a, b] if 0 < θ < 1c .
1. |g 0(x)| = 1 − θf 0 (x)| < 1∀x ∈ [a, b], so that −1 < 1 − θc < 1 ⇒ 0 < θ < 2/c.
where η1 , η2 ∈ [a, b], since f (a)f (b) < 0, for a < f (a) < b, we need 0 < θ < 1/c.
79
8.4 Newton’s Method
The relaxation process can be generalized to the mapping of the form g(x) = x−h(x)f (x)
where the function h can be chosen to improve the convergence properties of the iteration.
Then g 0 (x) = 1 − h0 (x)f (x) − h(x)f 0 (x) so that if f (s) = 0, then the asymptotic rate
of convergence is determined by g 0(s) = 1 − h(s)f 0 (s) and this will become infinite if
1 − h(s)f 0 (s) = 0. A function h that satisfies this condition is h(x) = 1/f 0(x), and this
gives Newton’s method:
f (xn )
xn+1 = xn − (8.2)
f 0 (xn )
and we call the sequence (xn )n≥1 generated this way a Newton sequence.
Theorem 31 (On the quadratic convergence of Newton’s Iteration) Suppose the
Newton sequence (xn )n≥1 converges to s so that f (s) = 0. If f 0 (s) 6= 0 and f 00 (x) is con-
tinuous in a neighbourhood of s. The sequence converges quadratically to s.
Proof: Using Taylor’s theorem, ∃η ∈ [xn , s] such that
1
0 = f (s) = f (xn ) + (s − xn )f 0 (xn ) + (s − xn )2 f 00 (ηn )
2
Thus
00
f (xn ) 1 2 f (ηn )
s − xn+1 = s − xn + = (s − xn )
f 0 (xn ) 2 f 0 (xn )
So that since f 00 is continuous,
s − xn+1 1 f 00 (s)
→ as n → ∞.
(s − xn )2 2 f 0(s)
The next result shows that the results of Newton’s method are local.
Theorem 32 (local convergence of Newton’s Method) If f (s) = 0, f 0 (s) 6= 0, and
f 00 is continuous, in the neighbourhood of the point s, then the Newton’s sequence converges
to s if the initial guess x0 is sufficiently closed to s.
00
Proof: Since g(x) = x− ff0(x)
(x)
, we have g 0 (x) = f (x)f (x)
f 0 (x)2
, and then g 0(s) = 0, since f (s) = 0,
and f 0 is continuous in the neighbourhood of s. Therefore there is a neighbourhood of s
where g satisfies Lipschitz Condition and the result follows.
80
Therefore,
f (xn ) xn−1 f (xn ) − xn f (xn−1 )
xn+1 = xn − 0
= (8.3)
f (xn ) xn − xn−1
2. Regular Falsi Method: The secant method can fail if the initial values x0 and
x1 are not sufficiently close to s. A method which is guaranteed to converge is
the Regular Falsi Method or the method of false positioning. The method resembles
bisection but uses a little more common sense as well. Instead of halving the interval,
the method calculates the root of the line joining the two points and then uses the
sub interval that contains the root as the next interval. Thus suppose f (a)f (b) < 0.
Find the equation of the straight line that passes through the points (a, f (a)) and
(b, f (b)) and if l0 (x) is that equation, find x1 such that l0 (x1 ) = 0 and then verify
if f (a)f (x1 ) < 0, then the interval [a, x1 ] is the new interval containing the sought
after zero, else it is found in the interval [x1 , b]. Thus at each general point, if
f (xn )f (xn−1 ) < 0, we find the equation of the straight line that passes through the
points (xn−1 , f (xn−1 )) and (xn , f (xn )) and if ln (x) is that equation, find xn+1 such
that ln (xn+1 ) = 0 and then verify if f (xn )f (xn+1 ) < 0, then the interval [xn , xn+1 ]
is the new interval containing the sought after zero, else it is found in the interval
[xn+1 , xn ]. Thus xn=1 is given by
xn f (xn−1 ) − xn−1 f (xn )
xn+1 = , n = 1, 2, · · · , x0 given. (8.4)
f (xn ) − f (xn−1 )
The next interval containing the zero is chosen when the sign of f (xn+1 ) is known.
Convergence is linear
8.5.1 Exercises
1
1. Show that the function f : x 7→ f (x) defined by f (x) = 10 (10x2 − 2x − 5), x ∈ R,
has exactly one zero between x = 0.5 and x = 1. Approximate the zero correct
to six decimal places using each the following methods1 : The bisection method,
Regular Falsi Method, Modified Regular Falsi method, the Secant method, Newton’s
Method and fixed point iteration method. Tabulate the results showing how many
steps of the iteration method is needed for each method to produce the required
approximation.
2. Find the interval containing the real positive zero of the function f (x) = x2 −2x−2.
Use each of the five methods introduced above to compute the zero correct to two
significant figures.2 In each case, estimate how many steps each method would
require to produce 6 significant figures.
1
All of these methods will be found in any elementary text on numerical analysis
2
Significant figures (also called significant digits) of a number are those digits that carry meaning
contributing to its accuracy. Generally this includes all digits except: (i) leading and trailing zeros where
they serve merely as placeholders to indicate the scale of the number and (ii) spurious digits introduced, for
81
3. In the Bisection method, let L denote the length of the initial interval [a0 , b0 ].
Let {ξ0, ξ1 , ξ2 , · · · } represent the successive mid points generated by the bisection
method. Show that
|ξn+1 − ξn | = 2−(n+2) L.
Also show that the number N of iterations required to guarantee the approximation
to a root to an accuracy ε is given by
ln(ε/L)
N > −2 −
ln 2
4. Given a continuous real valued function with a simple zero at x0 in its domain.
Suppose, by bisection method or otherwise, we have approximated the zero x0 with
error ε. That is f (x0 + ε) = 0. Use Taylor’s Theorem to derive an improvement for
the root x0 . Hence, derive Newton’s Method for successive approximations. Hint:
If x0 is the zero with error ε, then x1 = x0 + ε1 is the next approximation where ε1
is the approximation for ε obtained by expanding f (x0 + ε) in a Taylor series and
truncating the series after two terms and solving the equation f (x0 + ε) = 0.
√
5. Find an iterative formula for the r N where N is any positive number. Hint: If
x is the r th root of N, then xr = N or xr − N = 0. Now set f (x) = xr − N and
derive the Newton’s Iteration formula. Use your iterative formula to find the cube
root of 100 to the nearest thousandth of a unit.
6. Descartes’ Rule of signs: Descartes’ rule of signs, first described by René Descartes
in his work La Geometrie, is a technique for determining the number of positive or
negative real roots of a polynomial. The rule states that if the terms of a single-
variable polynomial with real coefficients are ordered by descending variable expo-
nent, then the number of positive real roots of the polynomial is either equal to the
number of sign differences between consecutive nonzero coefficients, or less than it
by a multiple of 2. More precisely“the multiple of 2 part” is exactly the number
of imaginary roots of equation (which is always an even number). Multiple roots
of the same value are counted separately. As a corollary of the rule, the number
of negative real roots is the number of sign changes after negating the coefficients
of odd-power terms (otherwise seen as substituting the negation of the variable for
the variable itself), or less than it by a multiple of 2. For example, the polynomial
x3 + x2 − x − 1 has one sign change between the second and third terms. Therefore
it has exactly 1 positive real root. Replace x by −x gives −x3 + x2 + x − 1. This
polynomial has two sign changes, so the original polynomial has 2 or 0 negative
roots. This polynomial factors into (x + 1)2 (x − 1) so the roots are −1 (twice) and
1. Now Use Descartes’ rule of signs to determine the number of positive roots for
example, by calculations carried out to greater accuracy than that of the original data, or measurements
taken to a greater precision than the equipment supports. The concept of significant figures is often used
in connection with rounding. Rounding to n significant figures is a more general-purpose technique than
rounding to n decimal places, since it handles numbers of different scales in a uniform way. The term
“significant figures” can also refer to a crude form of error representation based around significant figure
rounding.
82
the function f (x) = 5x3 − 3x2 − 3x − 8, x ∈ R. Use Newton’s iteration to determine
the roots.
7. Calculate the roots of the following equations to the nearest hundredth of a unit.
(a) x3 = 5, (b) x2 = 2;
(c) x3 + 3x − 5 = 0, (d) ex + x − 3 = 0;
(e) x + ln x = 2, (f) x cos x = x2 ;
(g) x2 − 30x − 110 = 0, (h) x4 + 8x − 12 = 0;
(i) x = 2 + sin(x), (j) x2 + 4 sin x = 0;
8. Apply Newton’s Method to the equation x2 − a = 0, for any positive number a and
derive a well known formula for extracting square roots of numbers.
9. Show that Newton’s iteration method will fail if either f 0 or f 00 vanishes near a zero
of f .
10. Show that we can derive Newton’s method for finding the zeros of a nonlinear
function f by expanding f (xn+1 ) = f (xn + (xn+1 − xn )) ∼ 0 in a 3-term Taylor
Series expansion and neglecting quadratic terms and higher terms if xn+1 − xn is
small.
83
Chapter 9
Numerical Approximation of
Solutions of Ordinary Differential
Equations
y (n) (x) = f (x, y (1) (x), y (2) (x), · · · , y (n−2) (x), y (n−1) (x)). (9.4)
ϕ(n) (x) = f (x, ϕ(1) (x), ϕ(2) (x), · · · , ϕ(n−2) (x), ϕ(n−1) (x)). (9.5)
84
The general solution of (9.5) will normally contain n arbitrary constants and hence there
exists an n-parameter family of solutions. If y(x0 ), y (1) (x0 ), y (2) (x0 ), · · · , y (n−1) (x0 ) are
prescribed at the point x = x0 , we have an initial value problem. We always assume
that y satisfies enough conditions to ensure that a unique solution exists. As a simple
example, we consider the problem y 0 (x) = y which has the general solution y(x) = Cex
where C is an arbitrary constant. If we prescribe the value of y at x0 say, by y(x0 ) = y0 ,
the differential equation then has the particular solution y(x) = y0 ex−x0 .
Differential equations are further classified as linear and nonlinear. An equation is
linear if the function f involves y and its derivatives linearly. Linear equations have the
important property that if {yi, i = 1, 2, 3, · · · , n} is a set of solutions of (9.1), then so is
P n
i=1 Ci yi (x) for arbitrary constants Ci , i = 1, 2, · · · , n. The simple second order equation
y (x) = y is easily verified to have the solutions y1 = ex , y2 = e−x and hence by linearity,
00
y(x) = C1 ex + C2 e−x is also a solution. The solutions y1 and y2 of the second order
equation are said to be linearly independent, if the Wronskian , W (y1, y2 ) of the solutions
does not vanish. Here
y1 y10
W (y1 , y2 ) =
= y1 y 0 − y2 y 0 . (9.6)
y2 y20 2 1
Amongst linear equations, those with constant coefficients are particulary easy to solve.
For example, let
n
X
ai y (i) (x) = 0 (9.7)
i=0
be the differential equation of order n, where a0i s are constants. If we seek a solution of
the form y(x) = Ceβx , for some constant C, then direct substitution shows that
n
X
ai β i = 0. (9.8)
i=0
(9.8) is called the characteristic equation, which is a polynomial of degree n and several
possibilities arise:
1. Distinct real roots: If the nth order equation (9.8) has n distinct roots, βi , i =
1, 2, 3 · · · , n, then linearity requires that
n
X
y(x) = Ci eβi x (9.9)
i=1
is a general solution of (9.7). If any of the distinct roots are complex, say βj =
µj +iωj , then β¯j = µj −iωj is also a root and these two roots will together contribute
the two linearly independent solutions yj1 = eµj x cos(ωj x), yj2 = eµj x sin(ωj x), cor-
responding to the complex conjugate pair of roots.
85
2. Multiple roots: When (9.8) has multiple roots, special techniques are available
for obtaining the linearly independent solutions. In particular, if β is a root of
(9.8) with multiplicity k ≤ n, then yi = xi−1 eβx , i = 1, 2, 3, · · · , k are k linearly
independent solutions corresponding to the multiple root.
Finally if (9.1) is linear but not homogeneous, that is if g(x) 6= 0, if ξ(x) is a particular
solution, that is a function ξ : R → R such that Lξ(x) = g(x), then the generalP solution
for the linear nth order differential equation will be given by y(x) = ξ(x) + ni=1 Ci eβi x .
d2 y dy
2
− 4 + 3y = x, y(0) = 4/9, y 0(0) = 7/3.
dx dx
Solution: To find the particular integral ξ(x), try the solution ξ(x) = ax + b since the
right hand side is a polynomial of degree ≤ 1. Substituting gives ξ(x) = 13 x + 49 . To find
the homogeneous solution, solve y 00 (x) − y 0(x) + 3y = 0. The characteristic equation gives
β 2 − 4β + 3 = 0 ⇒ β = 3, β = 1
Solution: From the theory of ordinary differential equations, we find that χ has equilib-
rium points at
b
sin(χ) = (9.11)
a
86
which exist when b ≤ ±a. It is straightforward to establish that (9.10) has general solution
defined for any constant κ by
a bc 2 a2
(i) + c tan(κ − T ); b > ±a; c = 1 − , c>0
b 2 b2
1 2
tan( χ) = (ii) ± 1; b = ±a (9.12)
2
±a(T − κ)
2
(iii) a + c tanh( bc T − κ); b < ±a; −c2 = 1 − a , c > 0.
b 2 b2
The solution (9.12) shows that when b ≤ ±a, χ will tend to one of the equilibrium points
given by (9.11) as T → ∞. In this case, χ(∞) is a constant and hence cos(χ(∞)) is
determined, and the solution is realizable
Now, despite the examples that have been given above, it is important to note that not
all differential equations have a solution and that some have infinitely solutions and still
others have unique solutions but cannot be solved explicitly to obtain the solution in terms
dy
of elementary functions. For example, the differential equation dx = x2 + y 2 , y(0) = y0 ,
cannot be solved to obtain a solution in closed for even though we can show that this
equation has a unique solution on an interval containing the point x = 0. On the other
hand we find that
dy 0, 0≤x≤b
= |y(x)|, y(0) = 0, ⇒ y(x) = 1 2
dx 4
(x − b) , x>b
for every b > 0 leading to an infinite number of solution satisfying the initial condition,
while the differential equation
dy −y ln(|y(x)|), y 6= 0
= y(0) = 0, ⇒ y(x) = 0, ∀x.
dx 0, y = 0,
In this last two examples, both functions in the right hand side are continuous, but
uniqueness is not guaranteed in both cases. It is again very easy for us to establish that
dy
the differential equation dx = y 2/3 , y(0) = 0, has a solution, for every a, b ∈ R defined by
1
27 (x − a)3 , x ≤ a
ya,b (x) = 0, a≤x≤b
1 3
27
(x − b) , x ≥ b.
This solution satisfies the initial conditions and in particular, y0,0 (x) = 0 and y0,0 (x) =
1 3
27
x are two solutions that satisfy the initial conditions but do not agree on any open
interval containing the origin 0. So though the function on the right hand side is con-
tinuous, the solution is not unique. These are Lipschitz ideas. We state the following
definition
Definition 46 (Lipschitz Continuity) A function f : Rn → Rn is said to Lipschitz
continuous if there exists a positive constant L, the Lipschitz constant, such that for every
x, y ∈ Rn , kf (x) − f (y)k < L|x − yk. If L = 1, such a map is called and isometric
embedding.
87
It is easy for us to prove following theorem:
Theorem 33
Exercise: Show that the converse of theorems (a) − (e) will be in general false.
The theory assures us that if the function on the right is continuous and Lipschitz con-
tinuous in y, then we can construct a unique solution near the initial conditions according
to the following theorem
Theorem 34 (Existence theorem for ODEs.) Suppose that f : Rn × [0, ∞) → Rn is
Lipschitz continuous on the n−rectangle
R = (x, y) ∈ Rn+1 : x0 ≤ x ≤ xm , ky − y 0 k ≤ ym
with Lipschitz constant L. That is for every y, z ∈ Rn , |f (x, y)−f (x, z)| < Lky−zk, then
there exists a continuously differentiable path (a solution to y 0 (x) = f (x, y(x)), y(x0 ) =
y 0 ), (x, y(x)) giving a unique value for each x ≥ x0 , until the curve leaves R.
Proof: See MAT409.
Thus we can only determine unique solutions locally. For every differential equation,
we can work out the rectangle R in which the unique solution exist. It is also possible
to extend the domain of validity of the theorem, by a creeping argument, so that the
solution is valid for all x on the real line (−∞, ∞). But this is not always the case. For
example, the differential equation dx dt
= 1 + x2 , x(0) = 0 has the solution x(t) = tan(t)
whose domain of validity, relative to the prescribed initial data, is {t ∈ R : − π2 < t < π2 }
and cannot be further extended1 . We can state and prove the following result:
Corollary 9 Suppose that the conditions of Theorem 34 hold, and let x0 ∈ R. Then
there exist a unique maximal solution y max : (a, b) → R, where x0 ∈ (a, b) (a maybe −∞
and b maybe +∞) with y max (x0 ) = y 0 . Any other solution is a restriction of y max to a
sub-interval.
It will appear that we must be able to locate the region of space R within which
a unique solution exist before starting our search for solutions. We note that from the
continuity of f and properties of the mean value theorem, for a mean value point η
between y and z, for x, y ∈ R, so that R ⊂ R2 , we have
∂f
|f (x, y) − f (x, z)| = | (x, η)(z − y)| ⇒ L = max{fy (x, y) : y ∈ R},
∂y
where fy is the partial derivative of f with respect to y. Here are some examples:
1
Such an occurrence is often called and ”explosion” - get to infinity in finite time.
88
1. For y 0(x) = λy(x), λ ∈ R, L = |λ|, xm , ym arbitrarily large. We can choose R such
that ym ≥ |y0 exp(λ(xm − x0 )) − y0 .
2. For y 0(x) = sin(y), L = 1 and xm , ym arbitrarily large. Since |f (x, y)| ≤ 1, we have
|y − y0 | ≤ |x − x0 |, so take ym ≥ |xm − x0 | ensures existence for all x.
In what follows we shall assume that the methods of finding the rectangle wherein the
unique solution exist is know and address the issue of constructing solutions for the
ordinary differential equations. The theory we advance should be extendible to systems
of differential equations. Taking into consideration the following:
dx dy
= f (x, y), = 1, x(t0 ) = x0 , y(t0 ) = 0,
dt dt
an autonomous first order system in R2
89
where the k th and highest order derivative has been expressed as a function of the
n
remaining (k −1) derivatives and g : U ×R| × ·{z· · × Rn} → Rn . Define g : V → Rnk ,
(k−1)times
(y1 , y2 , y3, · · · , yk ) → (y1 , y2 , y3 , · · · , yk−1, g(y1 , y2, y3 , · · · , yk )) and then consider the
first order ode
dy
= h(y), y ∈ V (9.16)
dt
The α : I → U is a solution of (9.15) if and only if β : I → V defined by
β(t) = α(t), α0 (t), · · · , α(k−1) (t) is a solution of (9.16). So for example, to solve
the equation
d3 y d2 y
+ sin(y) +y =x
dx3 dx2
we may choose to analyze a first order system of ordinary differential equations in
R3 as follows: set z = y 0(x), the w = z 0 (x) = y 00 (x) and so w 0 (x) = z 00 (x) = y 000 (x) =
x − sin(y)w − y and we have the equivalent first order system in R3 namely
y 0 (x) = z
d3 y d2 y 0
+ sin(y) + y = x ⇔ z (x) = w
dx3 dx2
w 0 (x) = x − y − w sin(y)
3. The point about 1, and 2 is that for all differential equations in normal form, we
might as well just consider first order autonomous systems of ordinary differential
equations (perhaps not always the most practical ways to solve differential equa-
tions). Almost all differential equations can be put in normal form by the implicit
function theorem , which can be proved as another example of the application of the
contraction mapping theorem
4. Not all differential equations can be solved in terms of elementary functions, integrals
of elementary functions or even ”special functions” of mathematical physics2
5. We can use differential equations to define functions. For example the differential
equation y 0 (x) = y(x), y(0) = 1 defines the function y(x) = ex , while y 00 (x) − y(x) =
0, y(0) = 1, y 0(0) = 0 defines the function y(x) = cosh(x).
2
Historical Note: In 1830, Galois was interested in solving polynomial equations in terms of rad-
icals(roots). Galois then proved that by associating groups to the polynomial equation, for example,
one cannot solve general quintic polynomial equations by radicals. Nowadays, we know that this can be
done using elliptic functions. In 1870, Galois was interested in solving differential equations in terms of
elementary functions and their integrals. He again proved that by associating the group to differential
equations, one cannot solve all differential equations using only elementary functions and their integrals.
eg x0 (t) = x2 −t. Nowadays we know that we can prove existence and then proceed to obtain approximate
numerical solutions to all differential equations whose solutions exist
90
The last remark points to the fact the we must find ways of approximating solutions of
differential equations especially when we would have established that a unique solution
exist. Numerical methods become very handy here. To be able to follow the theory
of approximation of solutions of differential equations with need the related concepts of
difference equations and differences.
91
x y ∆y ∆2 y ∆3 y ∆4 y ··· ∆n y
x0 y0
∆y0
x1 y1 ∆2 y0
∆y1 ∆3 y0
x2 y2 ∆2 y1 ∆4 y0
..
∆y2 ∆3 y1 .
2 4
x3 y3 ∆ y2 ∆ y1 ∆n y0
∆y3 ∆3 y2 . . ..
2
x4 y4 ∆ y3 . .
∆y4 . . .
4
x5 y5 . . . ∆ yn−4
. . . . ∆3 yn−3
2
. . . ∆ yn−2
. . ∆yn−1
xn yn
Example 19 Construct a difference table for the set of points (-3,-25), (-1,1), (1,3),
(3,29), (5,205) and hence find ∆4 y0 if the data set starts at (x0 , y0) = (−3, −15) and ends
at (x4 , y4 ) = (5, 205).
x y ∆y ∆2 y ∆3 y ∆4 y
−3 −25
26
−1 1 −24
2 48
1 3 24 78
26 126
3 29 150
176
5 205
4
4
X 4 i
∆ y0 = (−1) y4−i = y4 − 4y3 + 6y2 − 4y1 + y0 = 78
i=0
i
which answer could have been read off from the table as constructed.
92
Now, for an arbitrary function f and a positive step size h we define the following
difference formulae for the continuous function f .
Definition 48 1. Forward difference: ∆+,h f (x) = f (x + h) − f (x)
Proof: Exercise.
It is important to note that these rules only concern difference relationships but do not
enable us to actually find differences of functions. However, we can prove some interesting
results as shown below:
Theorem 36 Consider the forward difference ∆+,h f (x) = f (x + h) − f (x). Then
1. If f is the constant function ∆+,1 f (x) = 0
Pn n n−i
2. If f (x) = xn , then ∆+,1 f (x) = i−1 i x , a polynomial of degre (n-1) with
n−1
leading term nx
Proof: Exercise
9.2.1 Exercises
1. Construct difference tables for the following data set
3. Find the next term in of the following sequences by extending their difference tables
93
(a) 1,1,2,3,5,8,13,21,34
(b) 1,4,10,20,35,56
(c) -1,0,1,8,27,64
(d) 1,,8,17,32,57,100,177,320
(e) 2,2,14,74,64
(f) 14,23,34,42,59
4. Given the points (2,1), (3,6), (4,13), (5,22) and (6,33). Find the point (0,y). State
the assumption under which the value is determined.
6. State and prove corresponding results for the forward and central differences defined
above.
7. Show how you can use the difference tables for ∆+,h to locate errors in data values
8. Show that for any polynomial of degree n, ∆n+,1 pn (x0 ) is constant and find the value
of the constant.
For an arbitrary step size h is is easy for us to see from calculus that if the function under
consideration is differentiable, then
∆+,h f (x)
f 0 (x) ≈ (9.18)
h
∆h (x)
f
f 0 (x) ≈ (9.19)
2h
∆−,h f (x)
f 0 (x) ≈ (9.20)
h
and the question that readily comes to mind is to decide which of these is most suitable
for use in approximating the derivative of the function at the point x. The answer is
given to us from Taylor’s Formula for functions of a real variable: We calculate the error
incurred by approximating the derivative with finite difference quotients as follows: Let
Ef , Ec and Eb be the error incurred in approximating f 0 (x) by one of (9.18)-(9.20). Then
94
from Taylor’s mean value theorem, and for a mean value point η between x and x + h,
we have
∆+,h f (x) 1
|Ef | = |f 0(x) − | = h|f 00(η)| (9.21)
h 2
∆h f (x) 1
|Ec | = |f 0(x) − | = h2 |f 000 (η)| (9.22)
h 6
∆+,h f (x) 1
|E+ | = |f 0(x) − | = h|f 00(η)| (9.23)
h 2
Thus it becomes clear that if the second derivatives are bounded as is often the case,
then the central difference gives a better approximation than the forward or backward
difference.
Formulas for obtaining higher derivatives can be obtained by applying higher order
difference relations. For example, we may find f 00 (x) as follows
00 ∆+,h f 0 (x) 1 ∆+,h f (x + h) ∆+,h f (x)
f (x) ≈ = −
h h h h
1 f (x + 2h) − f (x + h) f (x + h) − f (x)
= −
h h h
f (x + 2h) − 2f (x + h) + f (x)
= (9.24)
h2
Similarly, the central difference gives
f (x + 2h) − 2f (x) + f (x − 2h)
f 00 (x) ≈ (9.25)
4h2
and the backward difference formula yields
f (x) − 2f (x − h) + f (x − 2h)
f 00 (x) ≈ (9.26)
h2
and the derivatives soon require that we evaluate values of the function at points x + 2h
and x − 2h respectively. For moderate values of h this function values are probably
not known and the domain of validity of the finite difference approximations is reduced.
Based on the error estimates for the first derivative, it would be better to use the central
difference to approximate the derivative and also aim at reducing the error margin at
evaluation by applying the difference formula at 21 h instead at h as follows.
0 1 1
!
∆ 1 f (x)
h 1 ∆ 1 f (x +
h 2
h) ∆ 1 f (x −
h 2
h)
f 00 (x) ≈ 2
= 2
− 2
h h h h
f (x + h) − 2f (x) + f (x − h)
= (9.27)
h2
9.2.2 Application
Example 20 Suppose we wish to solve the differential equation
1
y 00 (x) + y(x) = x, y(0) = = y(1). (9.28)
2
95
Find the exact solution and then use finite differences to approximate the solution. Com-
pare.
2. Prior to the working out of the formula (9.29), the only information we have about
y is the differential equation and the values that y has at the boundary points x = 0
and x = 1.
Let us approximate the second derivative with finite differences so that, using the con-
vention that yi ≈ y(xi ), equation (9.30) transforms to the discrete set of equations
yi+1 − 2yi + yi−1 1
+ y i = xi , i = 1, 2, 3, · · · , n − 1, y 0 = y n = (9.31)
h2 2
96
This can be rearranged to give
1
yi+1 − (2 − h2 )yi + yi−1 = h2 xi , i = 1, 2, 3, · · · , n − 1, y0 = yn = (9.32)
2
and example of a difference equation in yi . The process of transforming the differential
equation (using finite differences or other methods) into a system of discrete difference
equations is called discretisation. The system of equations in yi , i = 1, 2, 3, · · · , n − 1 as
defined by (9.32) is linear and can be solved by any method for solving linear systems of
the form Ax = b where A is an n × n matrix and b is the known vector, for the solution
x. Here the matrices A and b are given by
−h̃ 1 0 0 0 ··· 0 x̃1 − 12
1 −h̃ 1 0 0 ··· 0 x̃2
0 1 − h̃ 1 0 · · · 0 x̃
3
x̃4
A= 0 0 1 −h̃ 1 · · · 0
, b = , (9.33)
. . . . . . .
.. .. .. .. .. · · · .. ..
0 0 0 0 1 −h̃ 1 x̃n−2
1
0 0 0 0 0 1 −h̃ x̃n−1 − 2
0.51
0.508
0.506
0.504
0.502
y(x)
0.5
0.498
0.496
0.494
0.492
0.49
0 0.2 0.4 0.6 0.8 1
x
Figure 9.1: Graphs for example 20. The smooth curve is the exact solution while the
dashed lines is the curve for the approximate solution when n = 5 and the dotted line
is the curve for the approximate solution when n = 10. The approximation handles the
points where the slope of the solution changes rapidly differently.
the more points we take within the interval [0, 1], the better the approximation will be.
97
General methods for handling solutions of system of linear equations are handled in a
different course. We have included it here as an illustration to the method of differences,
which incidentally also introduces the notion of difference equations.
It is possible to use higher order differences to approximate derivatives of lower order.
For example, suppose we wish solve the problem y 0 (x) = f (x, y), y(x0) = x0 . Then it is
easy to verify (using Taylor series expansions) that the first derivative can be approxi-
mated, at the point xn , by a 3rd order difference relation
y0 = 1 + ε, y1 = 1 − ε, y2 = 1 + ε. (9.36)
where ε 1 may be seen as the small error introduced in estimating the initial values.
With the starting values, we can then obtain values of yn for n = 3, 4, · · · , easily. For
example, y3 = 1 − 5ε, y4 = 1 + 11ε, y5 = 1 − 32ε, · · · . Clearly yn is oscillating around
the value 1 and |yn | is increasing, even though the equation to be solved (9.35) has the
constant solution y = 1. The equation 2yn+1 + 3yn − 6yn−1 + yn−2 = 0. is an example
of a linear difference equation. That is an equation defined on some interval of integers
and their differences. We must learn properties of difference equations and understand
when a difference equation can be useful for solving given problems. Evidently the scheme
provided in this example is not suitable to be used as solver for the indicated equation,
or any first order differential equation for that matter, since small errors in the solution
propagate and magnify. And there will always be errors in approximations.
9.2.3 Exercises
Use finite difference approximations of the derivative to solve the following problems
98
6. y 00 (x) − xy 0 (x) = 0, y(−1) = 1, y(1) = 0.
7. y 00 (x) + xy 0 (x) − xy = 0, y(0) = 0, y(1) = e.
8. Consider the boundary value problems
(a) y 00(x) + y(x) = sin(πx), x ∈ (0, 1), y(0) = 1, y(1) = 1
(b) y 00(x) + π 2 y(x) = sin(πx), x ∈ (0, 1), y(0) = 1, y(1) = 1.
Suppose in each case that an approximate solution to this problem is computed
using schemes:
∆21 y(xj )
2h
(a) h2
+ y(xj ) = sin(πxj ); j = 1, 2, · · · , N − 1,
∆21 y(xj )
2h
(b) h2
+ π 2 y(xj ) = sin(πxj ); j = 1, 2, · · · , N − 1,
where y(x0 ) = y(xN ) = 1, h = 1/N, xj = jh, j = 0, 1, 2, · · · , N; N and integer
greater than 1.
(a) Identify the amount of error committed in approximating the second derivative
with the central differences and estimate its value.
(b) Let N = 5, and yj ≈ y(xj ). Find the values of the approximations yj , j =
1, 2, 3, 4 and compare the approximations with the exact solution which you
should first find in each case.
99
Example 21 The following are examples of difference equations and their solutions:
1. yn+1 − yn = 1 all n : Solution yn = n + c
n(n − 1)
2. yn+1 − yn = n all n : Solution yn = +c
2
3. yn+1 − (n + 1)yn = 0 all n > 0 : Solution yn = cn!
3. yn+2 − 2 cos γyn+1 + yn = 0 all n : Solution yn = c cos γn.
Consider the linear difference equation of order N with constant coefficients and
zero right hand side (Such and equation is also called homogeneous. Therefore a non-
homogeneous linear difference equation is one that has a non-zero right hand side
yn+N + an,N −1 yn+N −1 + an,N −2 yn+N −3 + · · · + an,0 yn = 0, (9.39)
Seek a solution of the form
yn ∝ β n , all n. (9.40)
Substitute in (9.39) to have
β n+N + an,N −1 β n+N −1 + an,N −2 β n+N −2 + · · · + an,0 β n = 0
⇒ ρ(β) = β N + an,N −1 β N −1 + an,N −2 β N −2 + · · · + an,0 = 0; (9.41)
a polynomial of degree N, called the characteristic polynomial. We have the following
possibilities:
The P
zeros of ρ(β) are distinct: In this case, (9.39) has the general solution yn =
N n
i=1 ci βi , all n.
βi is a zero of ρ with multiplicity m. In this case, (9.41) takes the form
ρ(β) = (β − βi )m q(β), q(βi ) 6= 0.
Then the corresponding solutions are βin , nβin , n2 βin , · · · , nm−1 βin and hence a linear
combination of these is also a solution of the linear difference equation.
Example 22 (i) For the third order linear difference equation yn+3 −2yn+2 −yn+1 +2yn =
0, The characteristic polynomial is ρ(β) = β 3 − 2β 2 − β + 2 = 0 which yields three
roots β = 1, −1, 2. Hence the general solution is yn = c1 (1)n + c2 (−1)n + c3 (2n ) =
c1 + c2 (−1)n + c32n containing three arbitrary constants which can be settled if three
initial values of yn are known. eg if y0 = 0, y1 = 1, y2 = 1.Then we have the three
equations c1 + c2 + c3 = 0, c1 − c2 + 2c3 = 1, and c1 + c2 + 4c3 = 1 which then yield
c1 = 0, c2 = −13
, and c3 = 13 . Then yn = 13 ((−1)n+1 + 2n ), ∀n.
(ii) For the third order linear difference equation yn+3 − 6yn+2 + 12yn+1 − 8yn = 0,
The characteristic polynomial is ρ(β) = β 3 − 6β 2 + 12β − 8 = 0 which factorizes
to (β − 2)3 = 0. Hence, since β = 2 is a repeated zero, the general solution is
yn = c1 (2)n +c2 n(2)n +c3 n2 (2n ), n = 0, 1, 2, · · · ; containing three arbitrary constants
which can be settled if three initial values of yn are known.
100
(iii) For the second order linear difference equation yn+2 − 2yn+1 + 2yn = 0, the charac-
teristic polynomial is ρ(β) = β 2 −2β +2 = 0 which yields the complex zero β = 1±i.
Hence since β is complex, the general√solution is yn = c1 (1 + i)n + c2 (1 − i)n , n =
0, 1, 2, · · · whic may be simplified to ( 2)n (C1 cos(nθ) + C2 sin(nθ)), θ = π4 .
(iv) For the third order linear difference equation yn+3 − 5yn+1 + 8yn+1 − 4yn = 0,
the characteristic polynomial is ρ(β) = β 3 − 5β 2 + 8β − 4 = 0 which factorizes to
(β − 1)(β − 2)2 = 0. Hence since ρ has a repeated zero as well, the general solution
is yn = 2n (c1 + nc2 ) + c3 , n = 0, 1, 2, · · ·
Example 23 It is required to solve the second order equation y 00 (x) − y(x) = 1. Show
that if xn = xn−1 + h for some step size h, then the general solution of the difference
equation resulting from the finite difference approximation can be expressed in the form
n n
h2 3 h2 3
y n = C1 1 + h + + O(h ) + C2 1 − h + + O(h ) + 1
2 2
where C1 and C2 are arbitrary constants that can be determine once data is provided for
the differential equation
Before solving this example we say something about the notation O(h3 ).
101
(b) The sequences
1/n2
= o(1/n) as n → ∞
1/(n log(n))
The big-oh and small-oh notations appear customarily only on the right side of an equa-
tion and serves the purpose of describing the essential feature of an error term without
bothering about multiplying constants or other details. They are used when we want to
describe limiting processes as a particular variable approaches its limiting state. We could
hear a mathematician say for example that f (x) = o(g(x)) as x → ∞ or f (x) = O(g(x))
as x → ∞. In general, the context of the direction of the limit will be known in the
particular instance else it is given the interpretation that we have given here. Exercise:
Read Section 1.6 of Reference [3] and answer the problems at the end of that section.
Solution to example 23:
4. The solution of (9.42) is therefore yn = ynh + ynp which establishes the sought after
solution.
9.3.1 Exercises
1. Find the general solutions of the difference equations
102
(a) yn+1 − 3yn = 5,
(b) yn+2 − 4yn+1 + 4yn = n
(c) yn+2 + 2yn+1 + 2yn = 0
2. Find the solution of the initial-value difference equations
(a) yn+2 − 4yn+1 + 3yn = 2n , y0 = 0, y1 = 1
(b) yn+2 − yn+1 + −yn = 0
3. show that the general solution of the difference equation yn+2 + 4hyn+1 − yn = 2h
where h is a constant, can be expressed in the form
n n 1
yn = C1 1 − 2h + O(h2 ) + C2 (−1)n 1 + 2h + O(h2 ) +
2
4. Show that if y0 = 0 and y1 = x, then the nth term yn = yn (x) of the solution of the
difference equation yn+2 − 2xyn+1 + yn = 0 is a polynomial3 of degree n in x with
leading coefficient 2n−1 .
9.3.2 Discretisation:
We consider the first order scalar equation
dy
y 0 (x) = = f (x, y), y(x0) = y0 (9.43)
dx
where the function f : R2 → R is continuous and Lipschtiz continuous in y. There are
many ways to generating discrete approximations to the initial value problem (9.43). we
discuss a few here
1. On a uniform mesh of points, with interval h, such that xn = x0 + nh, n =
0, 1, 2, 3, · · · , we can approximate the derivative using finite differences as we have
done above. For example introducing yn as the approximation to y(xn ), we can
choose
yn+1 − yn yn+1 − yn−1 3yn − 4yn−1 + yn−2
(y 0(xn ))approx = or or
h 2h 2h
Then setting (y 0(xn ))approx = f (xn , yn ), we obtain a recurrence relation for (yn )n≥1 .
2. On any mesh of points xi , i = 0, 1, 2, · · · , n with x0 < x1 < x2 < · · · < xn , we can
use the integral form
Z xn+1 Z xn+1 Z xn+1
0
y (x)dx = f (x, y(x))dx ⇒ yn+1 − yn = f (x, y(x))dx.
xn xn xn
Then by approximating the integral on the right by some suitable quadrature rule,
several schemes can be generated. For example, if xn+1 = xn + h,
3
Note: The polynomial yn of this difference equations is precisely the Chebyshev polynomial of degree
k defined by Tk (cos(θ)) = cos(kθ) which has the property that T0 (x) = 1,T1 (x) = x.
103
Rx
(a) Trapezoidal rule: gives xnn+1 f (x, y(x))dx = 21 h(f (xn , yn ) + f (xn+1 , yn+1).
Rx
(b) Simpson’s rule gives xnn+1 f (x, y(x))dx = 16 h(f (xn , yn ) + 4f (xn+ 1 , yn+ 1 ) +
2 2
f (xn+1 , yn+1)
Rx
(c) Rectangle rule gives with fixed at left end point gives xnn+1 f (x, y(x))dx =
hf (xn , yn ) or any point in the interval [xn , xn + h], can be used to evaluate the
right hand side
where ε > 0 may be seen as the error incurred in determining the initial data at time
t = 0 for both problems. Discuss the behaviours of its solutions.
104
9.4 One step Methods for First order scalar equation
In this section we consider the numerical approximation of the initial value problem
dy
y 0(x) = = f (x, y(x)); , y(x0 ) = y0 , x ∈ [x0 , xM ] (9.47)
dx
for some sufficiently large real number xm . The interval [x0 , xm ] is partitioned into points
xi ∈ P , i = 1, 2, · · · where
1. A general explicit one step method for solving the initial value problem (9.47) is an
iteration of the form
2. A general implicit one step method for solving the initial value problem (9.47) is an
iteration of the form
which gives a relationship between the estimates at xn and those at xn+1 . Implicit
methods do not give and explicit formula on how to proceed from the estimate yn to
the estimate yn+1 . However, for some special forms of the function f the inversion
is possible that will allow us rewrite an implicit scheme in explicit mode.
105
2. Taylor Series method (of Order k): From the differential equation, we not
only have y 0 = f (x, y), but if f (·, ·) is sufficiently smooth, we can, by repeated
differentiation, generate higher order derivatives. Consider f : R2 → R and let fx
and fy denote the first partial derivatives of f with respect to x and y respectively,
and fxx , fxy , fyx , fyy denote the second order partial derivatives, etc, then we can
calculate the total derivative of f when x and y are viewed as independent variables,
in the usual manner to have
df d2 y dy
df = fx dx + fy dy ⇒ = 2 = fx + fy = fx + fy f. (9.52)
dx dx dx
Similarly,
d2 f d3 y
= = fxx + 2fxy f + fyy f 2 + fx fy + fy2 f. (9.53)
dx2 dx3
It is clear that we can continue with the differentiation to any desired order. Now
suppose we define
k
X hi di−1 f
T (k) (x, y) = (9.54)
i=0
i! dxi−1
di f
where is the ith total derivative of f . Also, by Taylor expansion,
dxi
k
X hi diy 0 h 00 hk−1 dk y
y(xn + h) ≈ i
= y(xn ) + h y (xn ) + y (xn ) + · · ·
i
i! dx 2 k! dxk
h 0 hk−1 dk−1 f
= y(xn ) + h f + f + · · ·
2 k! dxk−1
= y(xn ) + hT k (xn , y(xn )) (9.55)
from which we derive the Taylor’s method scheme
yn+1 = yn + hT k (xn , yn ) = yn + hϕ(xn , yn ). (9.56)
Clearly ϕ(xn , yn ; h) = T k (xn , yn ) in Taylor’s method of order k. Note: Euler’s
method is Taylor’s method of order one!
3. Trapezoidal rule: Recall from 2a on page 104 that from the differential equation
we have
Z xn+1
h
y(xn+1) − y(xn ) = f (x, y(x))dx ≈ (f (xn , y(xn )) + f (xn+1 , y(xn+1)).
xn 2
This leads to the method
h
yn+1 = yn + (f (xn , y(xn )) + f (xn+1 , y(xn+1 ) (9.57)
2
Notice that this method is implicit as yn+1 , the next estimate to be derived appears
on both sides of the equation.
4. Other schemes are possible from numerical quadrature as explained above
The point about one step methods is that the value of the estimate at xn+1 depends only
on the estimate at xn . Hence the name one step.
106
9.4.2 Analysis of Explicit One Step Methods
What should be the choice of ϕ in the general explicit (implicit) one step method? How
can we be certain that our one step method actually represents the differential equation
that we are trying to solve? to answer these questions, and others, we must analyse
the error generated by the one step approximation and see how it depends on ϕ and its
relation to the ordinary differential equation under observation. There are several sources
of error in the solution process.
Definition 51 (Local Error) The local error or local discretization error is that error
committed in a single step from xn to xn+1 , assuming that y is known exactly for x ≤ xn .
Thus we have
Local error = y(xn+1 ) − [y(xn ) + hϕ(xn , y(xn ); h)] (9.58)
defined in terms of the solution where h = xn+1 − xn .
The second source of error is the truncation error
Definition 52 (The local Truncation Error) The truncation error is defined by con-
sidering the one step method yn+1 = yn + hϕn , as an approximation to the differential
equation: All terms are placed on one side of the equation and scaled so that as h → 0,
the resulting expression tends to y 0 = f (x, y); the remainder when that exact solution is
substituted into the resultant expression is then the truncation error, which we shall denote
as Tn . Thus
y(xn+1 ) − y(xn )
Tn = − ϕ(xn , y(xn ); h). (9.59)
h
On comparing definition 52 and 51, we see immediately the following relationship
Local discretization error = h × (Local truncation error) (9.60)
The local error and the local truncation error when they accumulate can lead more error
in the solution. There could be other sources of error such as local machine calculation
errors that have to be accounted. So we may wish to look at the global error in the
computation, which may be calculated in the assumption that calculations are done in
exact arithmetic.
Definition 53 (Global error) The global error en at xn is defined simply as
en = y(xn ) − yn (9.61)
The following theorem establishes the relationship between the error at xn , the error in
the initial data and the truncation error.
Theorem 37 (Error estimates) Let the differential equation (9.47) be approximated
at xn+1 by the one step method yn+1 = yn + hϕ(xn , yn ; h). Assume ϕ is Lipschitz con-
tinuous with Lipschitz constant Lϕ . If en = y(xn ) − yn is the global error at xn and
T = max0≤k≤n Tk , then
Lϕ (xn −x0 )
Lϕ (xn −x0 ) e −1
|en | ≤ e |e0 | + T. (9.62)
Lϕ
107
Proof: From 52
Thus subtracting (9.64) from (9.63), using the fact that en = y(xn ) − yn , we have
Suppose |Tn | ≤ T for all n ≥ 0 (a uniform bound for the truncation error) and xn =
x0 + nh, n = 0, 1, 2, 3 · · · (equally spaced points) then we have
as required.
Example 25 Apply Theorem 37 to Euler’s method, and estimate the global error in Eu-
ler’s method.
y(xn + h) − y(xn ) h
Tn = − f (xn , y(xn )) = y 00(η), xn < η < xn+1 , by MVTD
h 2
So if |y 00(x)| ≤ M, ∀x ∈ [x0 , xn ], then |Tn | ≤ 12 hM, ∀x ∈ [x0 xn ]. Hence
L(xn −x0 ) 1 eL(xn −x0 ) − 1
|en | ≤ e |e0 | + M h. (9.65)
2 L
108
Exercise. Show that Euler’s method for the y 0 = sin(y), y(0) = y0 converges.
Remark 11 We would like to make the truncation error T as small as possible. From
Theorem 37, we see that we can establish convergence only if T → 0 as h → 0. In
particular, if we have exact initial conditions, the error tends to zero as the step size
tends to zero.
In any case, we must be certain that where there is convergence, the computed solution
is indeed the solution to the differential equation whose solution we seek to approximate.
This leads us to the idea of consistency
Definition 54 (Consistent scheme) The one step method yn+1 = yn + hϕ(xn , yn ; h) is
said to be consistent with the differential equation (9.47) if the truncation truncation error
is such that
∀ε > o, ∃h0 (ε) > 0 for which |Tn | < ε for 0 < h < h0 (ε) (9.66)
The term rn is referred to as the round-off error, and is hardware (and possibly software)
dependent.
Example 26 What happens to the accuracy with which
yk (h) − yk−1(h)
h
is calculated as h → 0?
where h and n are selected so that xn = x∗ , and where A is a constant parameter that:
109
1. Does not depend on the step size
Example 27 Example: Use Euler’s method with step sizes 0.05, 0.1 0.2 = h to solve the
initial value problem: y 0(x) = 3 − x + y, y(0) = 1 over the interval [0, 1.6]. Find the exact
solution and show that yn (h) = φ(xn ) + Ah where A is approximately independent of h
but depends on xn
to extrapolate the error to zero, by finding A and then subtracting off the error.
Outline of Method
exercise: Exercise: Read Section 6.4 of Reference [1] and follow the examples given
there
110
Consistency condition for a general one step method
In a general one step method, we assume that ϕ is a continuous function of its arguments.
Also, y 0 is continuous. Now,
y(xn + h) − y(xn )
lim Tn = lim − ϕ(xn , y(xn ); h) = y 0 (x) − ϕ(xn , y(xn ; 0). (9.68)
h→0 h→0 h
Comparing (9.68) with the differential equation (9.47), we deduce the following consis-
tency condition:
Proof: Follows immediately from Theorem 37 on noting that |yn − y(xn )| = |en |. Hence
if y(x0 ) = y0 , exact initial condition, then
Lϕ (xn −x0 )
e −1
|yn − y(xn )| ≤ max |Tk |.
Lϕ 0≤k≤n
From the continuity of ϕ with respect to its arguments and the consistency condition, we
have
y(xn+1 ) − y(xn )
Tn = − ϕ(xn , y(xn ); h)
h
y(xn+1 ) − y(xn )
= − f (xn , y(xn )) + ϕ(xn , yn ; 0) − ϕ(xn , y(xn ); h)
| h {z } | {z }
y 0 at two points
difference in ϕ
We will be interested to know how accurate the solution is. This is done through the
definition of order of accuracy.
Definition 55 (Order of a scheme) The numerical scheme (9.49) is said to have order
of accuracy p if p is the largest positive integer such that for sufficiently smooth solution
y, there exist κ, a constant, and a step size h0 for which |Tn | ≤ κhp for 0 < h < h0
111
9.4.5 Exercises
1. For the differential equation y 0 = −xy + 1/y 2, y(1) = 1, derive the difference
equation corresponding to Taylor’s method of order 3. Carry out by hand one step
of the integration with h = 0.01. Write a computer programme to solve this problem
1 1
and carry out the integration from x = 1 to x = 3, using step size h = 64 and 128
2. For the equation y 0 = 2y, y(0) = 1, obtain the exact solution of the difference
equation from Euler’s method. estimate the value of h small enough to guarantee
four decimal places of accuracy in the solution over the interval [0,1]. Carry out the
solution with the appropriate value of h for 10 steps.
3. Find the Taylor series expansion of the function y that satisfies the differential
equation y 0 = xy + 1, y(0) = 1
4. For the equation y 0 = 1 − y/x, y(x0) = y0 , find the general formula for the Taylor
series method of Order k ≥ 1.
5. It is required to solve the problem (9.47). If the Trapezoidal rule is used to approxi-
1 2 (3)
mate the integral, show that the truncation error is − 12 h y (ξ) where h = xn+1 −xn
(3)
and ξ ∈ (xn , xn+1 ) is a mean value point, provided y is continuous.
If en = y(xn ) − yn , show that
1 1
|en+1 | ≤ |en | + hL(|en+1 | + |en |) + h3 M
12 12
where y is the solution to problem (9.47), L is the Lipschitz constant of f with
respect to y and |y (3) | ≤ M. Taking y0 = y(x0 ) and uniform step size h, deduce
that n
h2 M 1 + 12 hL
|en | ≤ − 1 , provided hL < 2.
12L 1 − 12 hL
6. In the previous question, show that if we define g(x) = fy (x, y(x)), then asymptot-
ically the en = he(xn ) where e(x) satisfies the equation
1 (3)
e0 (x) = g(x)e(x) − y (x), e(0) = 0.
12
Refine your arguments using the results of the previous question to show that en −
h2 e(xn ) = O(h3 ). Find e(x) for problem (9.47) when f (x, y) = y, x0 = 0, y0 = 1.
(a) Show that to generate estimates for the problem (9.47), the initial value prob-
lem (9.71) can be used as a test problem for a numerical method.
112
(b) Verify that if a k th order Taylor’s series method is applied to the test problem
(9.71), it generates the estimates
k
X h̄j
yn+1 = ρ(h̄)yn , n = 0, 1, 2, 3, · · · , ρ(h̄) =
j=0
j!
We can easily verify that these schemes are second order schemes. It is amazing that we
can get higher order accuracy only by evaluating the function at more intermediate values
between (xn , yn ) and (xn+1 , yn+1).Methods that really popularize Runge-Kutta methods
are the following fourth order scheme
h
yn+1 = yn + (K1 + 2K2 + 2K3 + K4 ) (9.75)
6
where
K1 = f (xn , yn )
1 1
K2 = f (xn + h, yn + hK1 )
2 2
1 1
K3 = f (xn + h, yn + hK2 )
2 2
K4 = f (xn + h, yn + hK3 )
Example 28 By applying the scheme (9.75) to the problem y 0 = λy, λ a constant, show
that the order of accuracy is indeed fourth.
Solution:
K1 = λyn
1 1
K2 = λ(yn + hλyn ) = λ(1 + λh)yn
2 2
1 1 1 1
K3 = λ(yn + λh(1 + λh)yn ) = λ(1 + λh + (λh)2 )yn
2 2 2 4
1 1 1 1 1
K4 = λ(λh(1 + λh + (λh)2 )yn ) = λ(1 + λh + (λh)2 + (λh)3 )yn .
2 4 2 4 4
Hence,
1 1 1
yn+1 = [1 + λh + ](λh)2 + ](λh)3 + ](λh)4 ]yn
2 6 24
which Correspond to the first 5 terms in the Taylor expansion of y(xn+1 )eλh yn .
114
9.5.1 Exercises
1. For the problem (9.47) with f (x, y) = x + y, x0 = 0, y0 = 1; calculate the local
truncation error for the method
Compare this error with the error of Taylor’s algorithm of Order 2. Which would
you expect to give a better result over the interval [0, 1]?
2. It employs many evaluations of f than it would seem necessary, and yet it is not
clear how these values of f can be used for an advantage.
Now consider the following direct application of Simpson’s rule at xn−1 , xn = xn−1 + h
and xn+1 = xn−1 + 2h.
Z xn+1 Z xn+1
0
y (x)dx = f (s, y(s))ds. (9.76)
xn−1 xn−1
Then
1
yn+1 − yn−1 = h[f (xn−1 , yn−1 ) + 4f (xn , yn ) + f (xn+1 , yn+1 )]. (9.77)
3
Observe that the value of yn+1 appears on both sides and involves values of x at xn−1 , xn
and xn+1 Three steps
If the mid point rule is used to approximate the integral on the right hand side of
(9.76), we have:
115
In this case we have a two step method, but this time it is almost explicit. We still need
starting values y0 and y−1 to start off the method.
Still, others can be derived. Lets define
We note that the formulae based on difference approximations to the derivative and those
based on difference approximation to the integral are very different in form. They could
also be combined in a variety of ways. The question is how do we choose between methods?
A complete understanding of the solutions of difference equations is needed to be able
to pursue studies on linear multi-step methods.
containing the arbitrary constants c1 , c2 and c3 which can be determined if three initial
values for y are specified. The original ordinary differential equation specifies only one
initial value y(0) = 1. Then it is clear that if the three initial values are in error with
magnitude |ε|, then since yn = c1 + c2 (0.186)n + c3 (2.686)n , if the error incurred in spec-
ifying the initial data means that c3 6= 0, then this component of the solution will grow
unbounded. This is an example of a numerical instability. As the scheme is obtained
with a zero right hand side or equivalently as h → 0, a scheme which avoids the scenario
explained above is said to be zero stable.
Now Suppose we are considering schemes which uses values yn , yn+1, yn+2 , ..., yn+k
(and also f (xn , yn ), f (xn+1 , yn+1 ), ..., f (xn+k−1 , yn+k−1)) to obtain the next value yn+k .
We call these k-step methods.
116
Definition 56 (Zero stability) A k-step method (for approximating the ordinary dif-
ferential equation of the form (9.47) is zero stable if there exists a constant K such that
if (yn )n≥0 and (zn )n≥0 are two sequences generated by the formulae with the same grid
points (xn )n≥0 but with different initial data, we have
for xn ≤ XM as h := max(xj+1 − xj ) → 0
Definition 57 A general linear k-step method (for approximating the ordinary differen-
tial equation of the form (9.47)) may be written in the form
k
X k
X
αj yn+j = h βj f (xn+j , yn+j ) (9.83)
j=0 j=0
where the constants αj , j = 0, 1, 2, .., k and βj , j = 0, 1, 2, .., k are real constants. This is
called linear because it involves only linear combinations of yn and f (xn , yn ). (We shall
simply write fn to denote f (xn , yn ).)
[Linear k-step method] We note that Linear multi-step schemes as given by Definition
(9.83) are explicit if βk = 0 otherwise it is implicit. Such schemes are defined in terms of
their first and second characteristic polynomials, ρ(ξ) and σ(ξ) respectively. Where
k
X k
X
j
ρ(ξ) = αj ξ , σ(ξ) = βj ξ j . (9.84)
j=0 j=0
so that we can solve for yn+k when h is sufficiently small and the scheme does not degen-
erate into a (k − 1)-step method. We also need
k
X
ρ(1) = αj = 0 (9.86)
j=1
so that the left hand side of (9.83) will give a difference approximation to y 0(xn ). This is
part of the consistency condition.
Theorem 39 (The root condition) An explicit linear multi-step method is zero stable
for any ordinary differential equation satisfying a Lipschitz condition if and only if its
first characteristic polynomial has zeros on the closed unit disc with any which lie on the
unit circle being simple.
117
Proof: To prove, necessity, consider the scheme applied to y 0 = 0. Therefore
k−1
!
1 X
yn+k = − αj yn+j .
αk j=0
where ξs is the zero of the the polynomial ρ(ξ) = 0 and the corresponding polynomial ps (.)
has degree one less than the multiplicity of this zero. If |ξs | > 1, then there is initial data
for which the solution will grow like |ξs |n , and if |ξs | = 1 and its multiplicity is ms > 1,
there is initial data for which the solution will grow like |nms −1 ξs |n = (nms −1 )n . In either
case this implies unbounded growth with n. Since the problem is linear, any difference of
a pair of solutions is also a solution and Definition 56 cannot be satisfied.
Remark 12 The root condition is so crucial that it is used by some authors as the actual
definition of zero stability.
Lemma 9 If the root condition is satisfied, the recurrence relation
k−1
!
1 X
yn+k = − αj yn+j satisfies |yk | = K max{|y0|, |y1|, · · · , |yk−1|}, ∀n ≥ 0.
αk j=0
Proof: Suppose α0 6= 0 and ρ(ξ) has k non-zero zeros, counting multiplicity. Then the
general solution yn has k arbitrary constants c1 , c2 , · · · , ck determinable by equating the
initial data with the expansion to obtain a linear system of equations of the form
y0 1 ··· . B1
y1 ξ1 · · · . B2
0
.. = .. .. .. .. , ie y = ZB (9.88)
. . . . .
yk−1 ξ1k−1 · · · . Bk
The matrix Z will be Vandemonde when the zeros of ρ(ξ) are distinct and non-zero. So
there exist K 0 such that
k B k∞ ≤ K 0 k y 0 k∞ .
Now, the solution (9.87) consist of a linear combination of the Bi ’s multiplied by linearly
independent solutions of the form nq ξsn and each of these is bounded for all n. (Because
of the rot condition: |ξs | ≤ 1 if ξs is a simple zero of ρ(ξ), ie q = 0, |ξs | < 1 if ξs is a
multiple zero of ρ(ξ), ie q 6= 0.) To see this, suppose |ξs | ≤ α < 1 with ξs a zero for which
q ≤ k − 1. We easily verify that nk−1 αk has an absolute maximum at n = (k − 1)(− ln α).
Call this maximum K 00 . Hence we have
|yn | ≤ KK 00 k B k∞ ≤ K 0 K 00 k y 0 k∞ .
Hence the result.
Though the results have been stated for explicit methods, extension to implicit methods
is direct.
118
Example 29 (i) All one step methods such as Eulers’s method and the trapezoidal rule
are all zero stable because consistency requires one zero of ρ(·) to be 1.
(ii) All the Adams-Moulton and Adams-Bashforth methods [3] are zero stable because
ρ(ξ) = ξ p (1 − ξ) for some value of p.
(a) The two step second order accurate mid point rule (9.78) and
(b) the following three-step sixth order accurate method:
(jh)2 00
y(xn + jh) = y(xn ) + (jh)y 0(xn ) + y (xn ) + · · ·
2!
(jh)2 000
y 0 (xn + jh) = y 0 (xn ) + (jh)y 00 (xn ) + y (xn ) + · · ·
2!
where
k k k
X X jq X j q−1
C0 = αj , Cq = αj − βj , q = 1, 3, · · · , (9.92)
j=0 j=0
q! j=0
(q − 1)!
119
For consistency, we require Tn −→ 0 as h → 0. We also require C0 = 0 and C1 = 0. In
terms of the characteristic polynomials, this consistency requirement can be stated as:
Applying the Lipschitz condition on f to have: |en+2 | ≤ |en |+ h3 Lh[|en |+4|en+1|+|en+2 |]+
h|Tn | where L is the Lipscitz constant for f . Hence
1 1 4
(1 − Lh)|en+2 | ≤ (1 + Lh)|en | + Lh|en+1 | + h|Tn |.
3 3 3
We cannot continue this argument until h < h0 = 3L−1 . In that case, take
Then we have,
n n
1 + 35 Lh 1 + 53 Lh T 3
|En | ≤ |E0 | + −1 , h < h0 = .
1 − 13 Lh 1 − 13 Lh L L
We easily verify that the first root approximates the exact solution while the second root
is parasitic. In fact:
121
Similarly,
1 + 13 h̄ 1
β2 (h̄) ∼ − 1 exp(−h̄) ∼ − (1 − h̄) as h̄ → 0.
1 − 3 h̄ 3
Notice therefore that if yn = a1 β1n + a2 β2n for any constants a1 , a2 , then the solution
corresponding to the parasitic root oscillates in sign from step to step. When λ is real
and negatives so that the true solutions decays, the parasitic solutions grows. This is a
severe disadvantage of this method for such problems.
Definition 59 (Absolute Stability) For a given value of h̄, a linear multi-step method
is said to be absolutely stable if all the zeros of the characteristic polynomial lie strictly
inside the unit disc. An open interval (α, β) is an interval of absolute stability for the
method if the method is absolutely stable for all h̄ ∈ (α, β).
Lemma 10 No consistent zero stable linear multi-step method is absolutely stable for
small positive values of h̄. In fact, if such a method has order of accuracy p, we have
Proof: From (9.94) with y(x) = exp(λx), we see that hTn = O(h̄p+1) and substituting
the exact solution into (9.90) with xn = 0 gives
From the zero stability, β1 (0) = 1 is a simple zero of ρ(ξ) = 0. Thus for sufficiently small
h̄, the left hand side of (9.96) gives
" k #−1
Y
β1 (h̄) − exp(h̄) = hσ(1)Tn (βj (h̄) − exp(h̄)) .
j=2
Here the right hand side is bounded because β(h̄) ≈ exp(h̄) and is simple for small h̄.
Therefore for small h̄ (9.95) is satisfied.
That is, this method has no interval of absolute stability and thus it a very poor
method to used on problems with decaying solutions.
Euler’s Method: For Euler’s method, we have yn − yn = h̄yn . β(h̄) = 1 + h̄, |β(h̄)| <
1 ⇒ h̄ ∈ (−2, 0). The interval of absolute stability here is all of (−2, 0)
k α(A − B) α(A − M)
1 −2 −∞
2 −1 −6
3 −6/11 −3
4 −3/40 −90/49
and this will converge if h|βk |L < |αk |, where L is the Lipschitz constant for f (·, ·). If such
an iteration is necessary, it may be expensive to recompute f (·, ·), so that many iterations
123
of (9.97) can be carried out. An accurate explicit method can be used to start of the
iteration scheme. The explicit method is called the predictor and the implicit method is
called the corrector. For example, we can have the following predictor corrector pair:
h
Predictor: yn+4 = yn+3 + (55fn+3 − 59fn+2 + 37fn+1 − 9fn )
24
h
Corrector: yn+4 = yn+3 + (9fn+4 + 19fn+3 − 5fn+2 + fn+1 )
24
Application of the Predictor corrector method:
where f˜n+4 = f (xn+4 , ỹn+4 ) and in this case the local truncation error is given by yn+4 −
ỹn+4 ≈ h5 y (5) (xn ).
9.11.1 Exereses:
Solve problems of sectionn7.6.3 of [3]
1. Asymptotes may exist in the solution domain were the solution may become un-
bounded: For example y 0 (x) = 2xy 2 , y(0) = 1
2. The solution may be Sensitivity to initial conditions in the original problem For
Example: y 0(x) = −ay(1 − y), y(0) = 1 + ε
3. Nearly all numerical solutions to initial value problems degenerate at long times:
we can not integrate to t = ∞
4. Stability of the methods for different choices of stepsize and the effect of the algo-
rithm on small errors in data. For example y 0(x) = ry, y(0) = y0 + δ0 , where δ0 is
the initial error.
124
More Methods, Numerical Software & Options
1. Other methods exist.
(a) Good points: quality assurance, interface, simple language to solve a problem,
in built graphics
(b) Bad points: not free, connecting solution with applications
(a) Good points: quality assurance, can include in other programs, high accuracy
and control
(b) Bad points: not free, relatively complex to use (need to be able to program
competently), often need
125
Bibliography
[2] G. H. golub, C. Van Loan, Matrix Computation, John Hopkins University Press,
1986
[3] Lee W. Johnson, R Dean Riess, Numerical Analysis, Addison wesley, 1982
[5] C. Hastings, Jr. Approximations for Digital Computers. Princeton University Press,
1955.
126