You are on page 1of 127

Lecture Notes in Numerical Linear Algebra

Gideon A. Ngwa1

c
Copyright May 2007. Last updated December 11, 2017

1 AssociateProfessor of Applied Mathematics, Faculty of Science, University of


Buea, P. O. Box 63, Buea.
Contents

1 Syllabus and Course Outline for MAT637 4


1.1 Syllabus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.1 Objectives of the Course . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Course Outline for MAT637 . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Outline of weekly schedule of lectures . . . . . . . . . . . . . . . . . 5
1.2.2 Outcome . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Introduction 7
2.1 Introduction to Numerical Analysis . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Errors in Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Properties of Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Some Fundamental Results from Calculus . . . . . . . . . . . . . . . . . . 17

3 Numerical Linear Algebra 18


3.1 Numerical solution of linear systems . . . . . . . . . . . . . . . . . . . . . . 18
3.1.1 Some objectives for the design of a Method . . . . . . . . . . . . . . 18
3.2 Linear simultaneous equations . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.1 Gauss Elimination . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.2 General form of Gauss Elimination . . . . . . . . . . . . . . . . . . 21
3.2.3 LU Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.4 Elementary lower triangular matrices . . . . . . . . . . . . . . . . . 23
3.2.5 Pivoting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.6 Pivoting Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.7 Operation Counts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.8 Alternative forms of decomposition: LDU and QR . . . . . . . . . . 34
3.2.9 Gram-Schmidt orthogonalization . . . . . . . . . . . . . . . . . . . 37
3.2.10 Solving Tridiagonal systems:The Thomas Algorithm . . . . . . . . . 39
3.2.11 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3.1 Vector and Matrix Norms . . . . . . . . . . . . . . . . . . . . . . . 43
3.3.2 Condition of system of equations Ax = b . . . . . . . . . . . . . . . 45
3.3.3 Iterative improvement . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

1
4 Iterative Methods for solving Ax = b 51
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2 Some special iteration schemes . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3 Methods of accelerating convergence of an iterative process . . . . . . . . . 55
4.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5 The Algebraic Eigenvalue Problem 59


5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6 Polynomial Interpolation 62
6.1 Polynomial Forms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6.2 Lagrange Form of the Interpolating Polynomial . . . . . . . . . . . . . . . 65
6.2.1 Divided Differences . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.3 Newton’s Form of the Interpolating polynomial . . . . . . . . . . . . . . . 68
6.3.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

7 Numerical Differentiation and Numerical Quadrature 71


7.1 Numerical Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7.2 Numerical integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
7.2.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

8 Numerical Solution of Nonlinear Equations 75


8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
8.2 Bisection Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
8.2.1 High points for the bisection method . . . . . . . . . . . . . . . . . 76
8.2.2 Low points for the bisection method . . . . . . . . . . . . . . . . . 76
8.3 Iterative methods (Successive approximation) . . . . . . . . . . . . . . . . 76
8.4 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
8.5 Other methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
8.5.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

9 Numerical Approximation of Solutions of Ordinary Differential Equa-


tions 84
9.1 Introduction and Analytic considerations . . . . . . . . . . . . . . . . . . 84
9.2 Finite differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
9.2.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
9.2.2 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
9.2.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
9.3 Difference Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
9.3.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
9.3.2 Discretisation: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
9.4 One step Methods for First order scalar equation . . . . . . . . . . . . . . 105
9.4.1 Examples of one step methods . . . . . . . . . . . . . . . . . . . . . 105
9.4.2 Analysis of Explicit One Step Methods . . . . . . . . . . . . . . . . 107
9.4.3 Round-off Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

2
9.4.4 Richardson Extrapolation . . . . . . . . . . . . . . . . . . . . . . . 109
9.4.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
9.5 Runge-Kutta Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
9.5.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
9.6 Linear Multi-step Methods for solving First Order Scaler Ordinary Differ-
ential Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
9.7 Zero stability for linear multi-step schemes . . . . . . . . . . . . . . . . . . 116
9.8 Accuracy of linear multi-step methods . . . . . . . . . . . . . . . . . . . . 119
9.9 Example of error analysis for an implicit method . . . . . . . . . . . . . . . 120
9.10 Absolute Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
9.11 Predictor-Corrector Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 123
9.11.1 Exereses: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
9.12 Some concluding comments . . . . . . . . . . . . . . . . . . . . . . . . . . 124
9.12.1 Some Limitations of Numerical Methods . . . . . . . . . . . . . . . 124

3
Chapter 1

Syllabus and Course Outline for


MAT637

1.1 Syllabus
1.1.1 Objectives of the Course
The main objective of this course is to introduce the basic concepts of numerical analysis,
numerical linear algebra and to present methods for solving linear and nonlinear equations.
These objects will be attained through the following syllabus

1. Introduction

(a) Well-posed problems,


(b) Sources of errors in numerical algorithms,
(c) Computer representation of numbers

2. Mathematical Preliminaries

(a) Vector spaces, matrices, matrix norms, special matrices, eigenvalues and eigen-
vectors
(b) Similarity transformations, nonnegative matrices, M-matrices

3. Direct methods for computing solutions of linear systems

(a) Triangular systems,


(b) Gauss elimination,
(c) LU-factorization,
(d) Cholesky-factorization,
(e) QR-factorization, block systems
(f) Error analysis

4. Iterative methods for approximating solutions of linear systems

4
(a) Convergence analysis of iterative methods,
(b) Jacobi, Gauss-Seidel and relaxation methods,
(c) Preconditioning,
(d) Gradient and conjugate gradient methods

5. Approximation of eigenvalues and eigenvectors

(a) Vector iteration,


(b) QR-iteration,
(c) Transformation methods

6. Numeral differentiation

(a) Finite differences


(b) Discretization errors
(c) Applications

7. Polynomial interpolation

(a) Lagrange form of the interpolating polynomial and error estimates


(b) Divided differences
(c) Newton’s form of the interpolating polynomial and error estimates
(d) Weierstrass approximation Theorem
(e) Numerical integration using interpolating polynomials

8. Determination of zeros of nonlinear systems

(a) Bisection method, secant, regula falsi-


(b) Newton’s methods;
(c) Fixed point iterations,
(d) Horner scheme and Newton-Horner scheme

1.2 Course Outline for MAT637


The following course outline will help in the study and reading of MAT 637. This outline
is meant to help you study for the course and to enable you prepare in advance.

1.2.1 Outline of weekly schedule of lectures


Lectures are scheduled for two slots1 , that is twice a week as shown in the Table 1.1
1
See Official Published university teaching time-table for venue and time of slots I and II

5
Week Slot Topics to be treated Activity
1 I II Administration, generalities, preliminaries, introduction Lecture
2 I II Gauss Elimination and LU Factorization lecture
2 III Exercises on Introductory material Tutorial
3-5 I II LU and QR Factorizations Lecture
3 III Exercises on Gauss elimination Tutorial
4 III Exercises on Lu Factorization Tutorial
5 III Practical Class Tutorial
6 TBA TBA First Continuous Assessment test (Venue which may CA Test
be different from class slots shall be announced )
6 I II Matrix norms, error Analysis, condition numbers and
iterative improvements
6 III Exercises on Norms Tutorial
7 I II Iterative methods for solving large systems, Lecture
7 III Exercises on iterative methods Tutorial
8 I II Least Squares, the algebraic eigenvalue problems Lecture
8 III Exercises on least squares Tutorial
9 I II Numerical differentiation and applications, polynomial Lecture
interpolation
9 III Exercises on Numerical differentiation+Practical class Tutorial
10 TBA TBA Second continuous assessment test(Venue which may CA Test
be different from class slots shall be announced )
10 I II Polynomial interpolation, solutions of non-linear equa- Lecture
tions
10 III Exercises on polynomial interpolation Tutorial
11-13 I II Solutions of Non-linear equations Lecture
11 III Exercises on solutions of nonlinear systems Tutorial
13 II Overview of course and examination syllabus Discussion

Table 1.1: Course outline for MAT 6371 Numerical Linear Algebra

1.2.2 Outcome
At the end of the study period, the student is expected to have the ability to use numerical
methods to analyze problems in linear algebra, polynomial interpolation, numerical inte-
gration and differentiation and be able to differentiate between qualitative and numerical
methods of analysis. He she will be able to design iterative methods for solving linear
and nonlinear equations.

6
Chapter 2

Introduction

2.1 Introduction to Numerical Analysis


Numerical Analysis is an interdisciplinary subject that involves engineering and physics in
converting a physical phenomenon into a mathematical model, mathematics in developing
techniques of solutions (or approximate solutions) of the mathematical equations, and
computer science for the implementation of the techniques in an optimal fashion on a
high speed computer (that is available). The task of a Numerical Analysts is to synthesize
procedures (such as restrictions placed on the physical system during modelling) so as to
obtain approximate solutions. Essentially, the problem of the Numerical Analyst is to
construct ”approximate” solutions that are ”acceptable”.
The problem of approximation can be possed in the following form
Problem 1 Given a set A and an element u ∈ A, if we choose a subset B of A, how can
we find an element U ∈ B so that U approximates u ∈ A in some way?,
What do we mean by ”approximate in some way”? Intuitively, we could expect the
distance between U and u to be small, or at least as small as possible. So let d(·, ·) be a
distance function defined on the set A. Then we can restate our questions as follows:
Problem 1 Given a set A and an element u ∈ A, if we choose a subset B of A, how can
we find an element U ∈ B so that d(u, U) ≤ d(V, u) ∀V ∈ B?
We can use a suitable norm k · k and define the distance d as d(u, v) = ku − vk, ∀u, v ∈ A.
Then the question can be rephrased as follows:
Problem 1 Given a set A and an element u ∈ A, if we choose a subset B of A, how can
we find an element U ∈ B so that ku − Uk ≤ ku − V k ∀V ∈ B?
If U exists, it is called the best approximation to u, with respect to the given norm.
Notes 1 1. U may not exist if A is not compact. Hence we shall restrict ourselves to
compact sets.
2. Regarding the element of best approximation, basic questions are (i) existence, (ii)
uniqueness , (iii) characterization (how one can recognize U other than comparing
it with all other elements of A) this is important for construction of the method of
getting the best approximation.

7
3. Approximation theory is well developed and the reader is encouraged to read more
on the subject to appreciate its scope. See for example [5, 6]

There are many ways to define a suitable norm, but if A is a linear space, there is a class
of norms, known as LP −norms that are most useful. If for example, A = C([a, b]; R),
the set of real valued continuous functions defined on the compact interval [a, b], then we
define
Z b  1p
p
kukp = ω(x)|u(x)| dx , ∀u ∈ C([a, b]; R). (2.1)
a

In this definition, the function ω : [a, b] → R is a fixed weighting function that provides
flexibility in measuring the norm of the function. In all cases ω ∈ C([a, b]; R) and is non-
Rb
negative on (a, b) so that a ω(x)dx exists and is positive. Three particular values of p
are important: p = 1, p = 2 and p = ∞, so that if we assume w(x) ≡ 1 on [a, b], we can
have
Z b 
kuk1 = |u(x)|dx , (2.2)
a
Z b  21
kuk2 = |u(x)|2 dx , (2.3)
a
kuk∞ = sup {|u(x)| : x ∈ [a, b]} . (2.4)

It is clear that these norms can be defined on different sets, in which case we must
take the specificity of the set under consideration. For example, if A = Rn , then u =
(u1 , u2 , · · · , un ) ∈ Rn is an n-tuple and we have
n
!
X
kuk1 = |ui| , (2.5)
i=1
n
! 12
X
kuk2 = |ui|2 , (2.6)
i=1
kuk∞ = sup {|ui|, i = 1, 2, 3, · · · , n} . (2.7)

Theorem 1 There exist positive numbers α, β and γ such that

αkuk1 ≤ βkuk2 ≤ γkuk∞ (2.8)



For example, if A = C([a, b]; R), then we can choose
√ α = 1, β = b − a and γ = b − a,
n
and if A = R , then we can choose α = 1, β = n and γ = n, etc.
Proofs See MAT402 and MAT302. Exercise
In the context of these norms, we can decide whether a given approximation U is good
for a given function u by calculating the norms given above for the difference u − U. We
must be quick to point out that two functions can be very ”close” with respect to one
norm but not with respect to another. Consider the following:

8
Example 1 Let [a, b] = [0, 3]. Let U(x) ≡ 0, and ω(x) ≡ 1, ∀x ∈ [0, 3]. For any positive
k, let uk : x 7→ uk (x), x ∈ [0, 3] be given, for each k, by

 k(k 2 x − 1), for k12 ≤ x ≤ k22
uk (x) = −k(k x − 3), for k22 ≤ x ≤ k32
2
k = 1, 2, 3, · · ·

0, otherwise,
Using the above norms, formulas (2.2)-(2.4) we obtain

2
kU − uk k1 = 1/k, kU − uk k2 = √ , kU − uk k∞ = k.
3
Thus in the sense of (2.2), the distance between U and uk becomes small for large values
of k; for (2.3) the distance is a constant, the same for any k; and for (2.4), the distance
is large for large values of k so that if we consider (kU − Uk kp )k≥1 as a sequence of real
numbers, we see that

2
lim (kU − uk k1 , kU − uk k2 , kU − uk k∞ ) = (0, √ , ∞).
k→∞ 3
In this context we therefore say that the functions (uk )k≥1 , converge to the function U
only in terms of the k · k1 distance measurement.
Definition 1 A sequence of functions (qk )k≥1 is said to converge to a function g with
respect to a given norm k · k, if and only of
lim kqk − gk = 0.
k→∞

The preceding example illustrates that the quality of an approximation is completely


dependent on how we choose to measure distances between functions. Where appropriate,
we shall define the norm that we use much more clearly. However, the last theorem and
discussion tells us that if we were interested in measuring distances, we should concentrate
effort on the L∞ -Norm since convergence in the k·k∞ is equivalent to uniform convergence
After each approximation, we are bound to have errors. we briefly outline the possible
sources of error in a computation.

2.2 Errors in Calculations


Initial data errors: These types of errors are as a result of idealistic assumptions made
to simplify a model, inaccurate measurements of data, or inaccurate representation
of mathematical constants. For example, if the constant π whose value, correct only
to 30 decimal places is
π = 3.1415926535897932384626433832795 · · ·
was to appear in a calculation, we shall be forced to replace it by π = 3.14, or
π = 3.142 or π = 3.14159 etc. Each representation will introduce an error in
the calculation and will give rise to a different answer in final computation. The
objective of the numerical analyst would be to develop schemes that minimize errors
from these types of sources.

9
Truncation Errors: These arise when we are forced to give an approximate, rather
than an exact answer. For example, suppose that we use the Maclaurin expansion
to represent ex so that
β2 β3 βn
eβ = 1 + β + + +···+ +···
2! 3! n!
to approximate eβ . We must terminate the expansion at some point to write
β2 β3 βk
eβ = 1 + β + + +···+ + Ek
2! 3! k!
where Ek is the error (called the truncation error) introduced by truncating the
series at the given point. What is the effect of this truncation error on the value of
the final calculation? When do we feel that we have a ”good” approximation?
Round off errors: Also known as rounding errors. These arise because every computer
has a finite word length, so that most numbers and results of arithmetic operations
cannot be expressed exactly on a computer.√ For example, a computer with a 8-bit
character
√length can represent the number 2 = 1.4142135623730950488016887242097 · · ·
only as 2 = 1.414214 where the last digit is rounded to the nearest integer.

Initial data and truncation errors are problem dependent and can be dealt with. Trun-
cation errors can sometimes be problem dependent, but rounding errors are machine
dependent errors that must be controlled in a computation. One primary concern of the
Numerical Analyst is to reduce the effects of errors from all sources.
There are two ways of measuring the error in a calculation: Suppose x̂ represents and
approximation to x. we can measure
kx − x̂k
(i) the absolute error = kx − x̂k, (ii) the relative error = , (2.9)
kxk
where k · k is a suitable norm. For example, suppose that in computation A we have
x = 0.5 × 10−4 and x̂ = 0.4 × 10−4 , and in computation B, we have x = 5, 000 and
x̂ = 4950. Then, using the standard Euclidean distance measurement in R, the absolute
errors are 0.1 × 10−4 and 50 respectively, while the relative errors are 0.2 and 0.01. so
computation A has a 20% error while computation B has only a 1% error. The forgoing
shows that we have to be careful at deciding what we call an error. The point here is that
we need error bounds which will measure how large the error can be. Error bounds must
cover all possible cases.

2.2.1 Exercises
1. Read about floating point arithmetic, round off errors, machine representation of
numbers, conversion of numbers between different bases, and representation of frac-
tions: Then solve the following simple problems:
(a) convert the following binary numbers to decimal form (1010)2, (100101)2,
(10000001)2, (1101)2

10
(b) convert the following decimal numbers to binary form (82)10 , (100101)10 ,
(34301)10, (1101)10
(c) convert the following decimal numbers to hexadecimal form (123)10 , (1025)10 ,
(34301)10, (187)10
(c) convert the following hexadecimal numbers to decimal form (82)16 , (100)16 ,
(1A4C)16 , (6B.1C)16 , (F F F.118)16

2. Write and algorithm that converts integers from a base β to base the 10

3. Write and algorithm that converts a decimal fractions to a number in a base β and
vice-versa. Test your algorithm with the following: (0.101)2 = (0.625)10

4. Machine representation of numbers

(a) If x ∈ R and x̄ is its chopped machine representation on a computer with base


10 and mantissa length m. Show that the relative error |x − x̃|/|x| is bounded
by 10−m+1 .
(b) If δ = 10−m+1 ( in the case of chopping) or δ = 0.5 × 10−m+1 (in the case of
rounding), show that x̃ = x(1 + ε) where |ε| ≤ δ.
(c) Given that the function F and F 0 are continuous, then (F (x̃) − F (x))/F (x) =
εxF 0 (x)/F (x) and that f (x) − f (x̃) = εxF 0 (x), by Mean value theorem. Use
these to estimate the relative and absolute errors for the following:
(i) F (x) = xk ,for various values of k and x.
(ii) F (x) = ex doe various values of x large,
(iii) F (x) = sin(x) for small x
(iv) F (x) = x2 − 1 for x near 1.
(v) F (x) = cos(x) for x near π/2.

5. For any positive integer N and any fixed r 6= 1, recall the formula for the geometric
sum
2 3 N 1 − r N +1
GN = 1 + r + r + r + · + r = ≡ QN .
1−r
Write a computer programme (or use a hand calculator) to evaluate GN and QN for
arbitrary values of r and N. Let r be chosen close to 1 and note the discrepancy. In
which of the two calculations do you have faith? Calculate the relative and absolute
errors in these computations

6. The following numbers are given in a decimal computer with a four digit normalized
mantissa a = 0.4523×104, b = 0.2115×10−3, c = 0.2583×101. Perform the following
operations and indicate the error in the results assuming symmetric rounding.a +
b + c, a − b − c, ab/c, a − b, b/(ca).

11
2.3 Properties of Matrices
A system of m linear equations in n unknowns has the general form
a11 x1 + a12 x2 + · · · + a1n xn = b1
a21 x1 + a22 x2 + · · · + a2n xn = b2
··············· (2.10)
am1 x1 + am2 x2 + · · · + amn xn = bm
The coefficients aij and the right hand sides are given numbers. The problem is to find,
if possible, numbers xj , j = 1, 2, · · · , n such that the m equations (2.10) are satisfied
simultaneously. The discussion is greatly facilitated if we understand the concepts of
vectors and matrices.
Definition 2 (A Matrix:) A matrix is a rectangular array (usually real) numbers ar-
rayed in rows and columns. It is customary to display an m × n matrix as a rectangular
array of m rows and n columns as follows:
 
a11 a12 · · · a1n
 a21 a22 · · · a2n 
A=  ··· ··· ··· ··· 
 (2.11)
am1 am2 · · · amn
At times, it is briefly written as
A = (aij ) (2.12)
The matrix in (2.11) has m rows and n columns. we simply say A is of order m × n
matrix. If m = n A is called a square matrix of order n. If A has only one row, we
will call it a row vector and if A has only one column , it is called a column vector
or simply vectors for short. So both the right-side constants bi , i = 1, 2, · · · , m and
unknowns xi , i = 1, 2, · · · , n for vectors
   
b1 x1
 b2   x2 
   
 b3   x3 
b=  x=  (2.13)
 ..   .. 
 .   . 
bm xn
We say that b is an m-vector and x is an n-vector.
Definition 3 (Equal Matrices) two matrices A = (aij ) and B = (bij ) are equal if and
only they are of the same order and aij = bij for all i and j.
Definition 4 (Matrix Multiplication) Let A = (aij ) be an m×n matrix and B = (bij )
be and n × p matrix. Then the matrix C = (cij ) is the (matrix) product of A with B (in
order), or C = AB, provided C is of order m × p and
n
X
cij = aik bkj for i = 1, · · · m; j = 1, · · · , p (2.14)
k=1

12
With this definition we can write our system (2.10) simply as

Ax = b (2.15)

Definition 5 (Diagonal and Triangular Matrices) If A = (aij ) is a square matrix


of order n, then we call its entries a11 , a22 , · · · , ann the diagonal elements, and call all
other entries off-diagonal. All entries ai of A with i < j are called superdiagonal, all
entries ai with i > j are called subdiagonal. If all off-diagonal entries of a square matrix
are zero,we call A a diagonal matrix. If all subdiagonal entries of a square matrix are
zero,we call A an upper (or right) triangular matrix while if all superdiagonal entries
of a square matrix are zero,we call A a lower (or left) triangular matrix.

Definition 6 (The identity matrix ) If a diagonal matrix of order n has all its diag-
onal entries equal to 1, then we call it an identity matrix of order n and denote it by I or
In if the order is important

The name identity is chosen for In because In A = A for all n × p matrices, and BIn = B
for all m × n matrices, so that if A is an n × n matrix, In A = AIn = A.
Definition 7 (Matrix inverse ) Division by a matrix is in general not defined. How-
ever, for square matrices we define a related concept, matrix inversion. We say the
square matrix A is invertible provided there exist a square matrix B of order N such
that

AB = In = BA (2.16)

the matrix B is called the inverse of A and is denoted by A−1 . We have

(A−1 )−1 = A, (AB)−1 = B −1 A−1 (2.17)

Definition 8 (Matrix addition and scalar multiplication) For any two matrices A =
(aij ) and B = (bij ) of the same order and a scalar β, we have

βA = (βaij ), A + B = (aij + bij ), ∀i, j.

Thus for given matrices A, B and C and constants a and b, we have


(i) A + B = B + A

(ii) (A + B) + C = A + (B + C)

(iii) a(A + B) = aA + aB

(iv) A(a + b) = (a + b)A = aA + bA

(v) A(B + C) = AB + AC

(vi) (A + B)C) = AB + BC

(vii) If a 6= 0 and A is invertible, then aA is invertible and (aA)−1 = (1/a)A−1

13
(viii) if all the entries of A are zero, the A is called the null matrix. If A is a null matrix
then A + B = B for all matrices B of the same order.

Definition 9 (Linear Combinations) If x(1) , x(2) , · · · , x(k) are k n-vectors and b1 , b2 , · · · , bk


are k numbers, then the weighted sum

b1 x(1) + b2 x(2) + · · · + bk x(k)

is called a linear combination of x(1) , x(2) , · · · , x(k) with weights, or coefficients, b1 , b2 , · · · , bk

Using this definition we can rewrite the linear system of equations in several forms. We
have the following results concerning the existence and uniqueness of solutions of linear
equations.

Lemma 1 (Existence and Uniqueness of solutions to (2.10)) If x = x1 is a solu-


tion of the system Ax = b, then any solution x = x2 of this system is of the form

x2 = x1 + y

where x = y is a solution of the homogeneous system Ax = 0.

Theorem 2 The linear system Ax = b, has at most one solution (i.e., the solution is
unique if it exists) if an only if the corresponding homogeneous system Ax = 0 has only
the trivial solution x = 0.

Theorem 3 Any homogeneous linear system of equations that has fewer equations than
unknowns has non-trivial (i.e., nonzero) solutions.

In fact we can prove that we cannot expect to get a solution to our linear system for all
possible choices of the right-side unless we have no more equations than unknowns

Lemma 2 (i) If A is an m × n matrix and the linear system Ax = b has a solution


for every m-vector b, then there exists an n × m matrix C such that AC = Im .

(ii) If B and C are matrices such that BC = I then the homogeneous system Cx = 0
has only the trivial solution x = 0.

Theorem 4 If A is an m × n matrix and the linear system Ax = b has a solution for


every m-vector b, then m < n.

We now know that we cannot expect exactly one solution to ours system unless there
are as many equations as unknowns. We can therefore restrict our attention to square
matrices and state the following theorem;

Theorem 5 If A is an n × n matrix. Then the following are equivalent:

(i) The homogeneous system Ax = 0 has only the trivial solution x = 0.

(ii) For every right-side b, the system Ax = b has a solution.

14
(iii) A is invertible.

The proof of these results can be found in any elementary text in numerical analysis.

Definition 10 (Linear Independence and Bases) Let e1 , e2 , · · · , em be m n-vectors.


We say that the n-vectors are linearly independent if

β1 e1 + · · · + βm em = 0 ⇒ β1 = β2 = · · · = βm = 0.

Now, let e1 , e2 , · · · , em be m linearly independent n-vectors. If every n-vector can be


written as a linear combination of these m n-vectors, then we call e1 , e2 , · · · , em a basis
(for all n-vectors).

Definition 11 (Transpose and Symmetric matrices) Let A = (aij ) and B = (bij )


be two matrices. We say B is the transpose of A and write B = AT provided B has as
many rows as A has columns and as many columns as A has rows and bij = aji , ∀i, j. If
AT = A, then A is said to be symmetric. In particular, the transpose of a column vector
is a row vector.

Notes 2 (Rules of Transpose) (i) If A and B are matrices such that AB is defined,
the B T AT is defined and (AB)T = B T AT . Note the interchange of order!

(ii) For any Matrix A, (AT )T = A.

(iii) If the matrix A is invertible, then so is AT , and (AT )−1 = (A−1 )T

If a and b are n-vectors, the bT a is the 1 × 1 matrix called the scalar or inner product
of a and b. For matrices with complex entries, there is the related concept of Hermatian
or conjugate transpose of the matrix A. Thus when A is a complex matrix we write

AT = AH = āji ∀i, j

where the bar denote complex conjugation.

Definition 12 (Permutation) A permutation of degree n is any arrangements of


the first n integers. A permutation matrix of order n is an n × n matrix P whose
columns (rows) are a rearrangement or permutation of the columns (rows) of the identity
matrix of order n. Precisely, the n × n matrix P is a permutation matrix if

P ij = ipj j = 1, · · · , n (2.18)

where ij is the j-th column of the matrix P, for some permutation p = (pi ) of degree n.

Consequently we can state the following Theorem:

Theorem 6 Let P be a permutation matrix satisfying (2.18). Then

(i) P T is a permutation matrix satisfying P T ipj = ij j = 1, · · · , n. Hence P T P = I;


therefore P is invertible and P −1 = P T .

15
(ii) If A is an m × n matrix, then AP is an m × n matrix whose jth column equals the
pj th column of A, j = 1, · · · , n.
(iii) If A is an n × m matrix, then P T A is the n × m matrix whose ith row equals the
pi th row of A, i = 1, · · · , n.
An example will illustrate the issues of permutations.
Example 2 Consider the identity matrix of order 3. Then the matrix P defined by
 
0 0 1
P = 1 0 0 
0 1 0
is a permutation matrix corresponding to the permutation pT = (231) since P i1 = i2 ,
P i2 = i3 and P i3 = i1 . One has
 
0 1 0
T
P =  0 0 1 
1 0 0
Hence P T i1 = i3 , P T i2 = i1 and P T i3 = i2 illustrating (i) of the theorem. Further, one
calculates, for example, that
    
1 2 3 0 0 1 2 3 1
AP =  4 5 6  1 0 0  =  5 6 4 
7 8 9 0 1 0 8 9 7
hence column 2 of AP is column 3 of A, illustrating (ii) of the Theorem. Also,
    
0 0 1 1 2 3 7 8 9
PA =  1 0 0  4 5 6  =  1 2 3 
0 1 0 7 8 9 4 5 6
and the rows of A have been interchanged by the permutation matrix P illustrating (iii)
of the Theorem.
Definition 13 (Eigenvalues and eigenvectors of a matrix.) Let A be an n × n real
matrix. Then if there exist a vector x ∈ Rn and a scaler λ ∈ C such that Ax = λx then,
x is called an eigenvector of A and λ is called an eigenvalue.
The pair (λ, x) when it exist is determined by the matrix A so that λ satisfies the equation
Det(A − λI) = 0 and x is found for each λ such that it solves the equation (A − λI)x = 0.
For an n × n matrix, we have a polynomial of degree n in λ so that by the fundamental
theorem of algebra, we have at most n eigenvalues and consequently at most n eigen-
vectors. Eigenvectors for a real matrix can be complex. The set of all eigenvectors of a
given matrix is said to span the nulls pace of the matrix, while the set of al eigenvalues
of the same matrix is called the spectrum of that matrix. From a matrix we can define a
quantity call the spectral radius of A often denoted by ρ(A) as follows:
ρ(A) = max{|λ| : Ax = λx}
= max{|λ| : λ is an eigenvalue of A}
= the maximum of the absolute values of all the eigenvalue of A (2.19)

16
2.4 Some Fundamental Results from Calculus
Rolle’s Theorem. If f : [a, b] → R is continuous in [a, b] and differentiable in (a, b),
and if f (a) = f (b), then there exist ξ ∈ (a, b) such that f 0 (ξ) = 0.

Mean Value Theorem for Derivatives. If f : [a, b] → R is continuous in [a, b] and


differentiable in (a, b), then there exist ξ ∈ (a, b) such that f (b)−f (a) = f 0 (ξ)(b−a).

Mean Value Theorem for Integrals. If f : [a, b] → R is continuous in [a, b], then
Rb
there exist ξ ∈ (a, b) such that a f (x)dx = f (ξ)(b − a).

Second Mean Value Theorem for Integrals. If g : [a, b] → R is integrable and does
not change sign in [a, b], if f : [a, b] → R is continuous in [a, b], then there exist
Rb Rb
ξ ∈ (a, b) such that a f (x)g(x)dx = f (ξ) a g(x)dx.

Boundedness Theorem. If f : [a, b] → R is continuous in [a, b], then there exist xM


Rb
and xm in [a, b] such that f (xm ) ≤ f (x) ≤ f (xM ) and furthermore, | a f (x)dx| ≤
(b − a) maxa≤x≤b {|f (x)|}. That is, a continuous function on a closed and bounded
subset (of R) attains its bounds.

Intermediate Value Theorem. If f : [a, b] → R is continuous in [a, b] and if xm and


xM are given as above and if m = f (xm ) and M = f (xM ), then for every y ∗ such
that m ≤ y ∗ ≤ M, There exist x∗ ∈ [a, b] such that f (x∗ ) = y ∗.

Taylor’s Theorem. If f : [a, b] → R is n + 1 times continuously differentiable in [a, b]


then if x and x∗ are any two points in (a, b), the
n
X (x − x∗ )k
f (x) = f (x∗ ) + f (k) (x∗ ) + Rn+1 (x, x∗ ).
k=1
k!

dk f
where f (k) (x∗ ) = | ∗
dxk x=x
and there exist ξ between x and x∗ such that

f (n+1) (ξ)(x − x∗ )n+1


Rn+1 (x, x∗ ) = .
(n + 1)!
Observe that the mean value theorem is a special case of Taylor’s the-
orem when n = 0.

The fundamental theorem ofPalgebra. Every polynomial of degree n has exactly n


zeroes. That is if Pn (x) = nk=0 ak xk where n ≥ 1 and an 6= 0, then there exist n
constants {rj }nj=1 (the zeroes of Pn (x)) such that

Pn (x) = (x − r1 )(x − r2 ) · · · (x − rn ) .
| {z }
product of n terms

Here, the rj ’s need not be distinct or all real.

Exercise: Prove all the above results.

17
Chapter 3

Numerical Linear Algebra

3.1 Numerical solution of linear systems


We shall consider methods for numerically solving the system of the equations if the for

Ax = b

where A is an n × n matrix for a given right hand side b to get the unknown x
A frequently quoted test for invertibility of a matrix is based on the concept deter-
minants. The relevant theorems states that the matrix is A invertible if and only if
det(A) 6= 0. If det(A) 6= 0, then it is even possible to express the solution of Ax = b, in
terms of determinants, by the so-called Cramer’s rule. Calculating determinants is not
a very practical thing to do, given the number of operations involved. We shall explore
several methods of solving systems of linear equations. We shall explore direct methods
and iterative methods.
Direct methods of solutions are those methods which, in the absence of round off
errors will yield the exact solution in a finite number of elementary arithmetic operations.
In practice,since computers work with finite word length, direct methods do not lead to
exact solutions because errors creep in from roundoff, instability, loss of significant digits
etc. A large part of numerical analysis is devoted to controlling these errors.
Iterative methods are those which start with an initial approximation, and which by
applying a suitably chosen algorithm, lead to successively better approximations. Matrices
associated with linear systems are also classified as dense or sparse. Dense matrices have
very many zero elements and the order of such matrices tend to be small. Direct methods
would fare well when used to solve problems where the matrices are dense. Sparse matrices
on the other hand have very few non-zero elements and the order of such a system can
be very large. Iterative methods are useful here. In what follows we shall explore the
method of solutions for both sparse and dense systems. We keep the presentation as brief
as possible. Appropriate books are shown in the bibliography.

3.1.1 Some objectives for the design of a Method


1. Simplicity: The method should be sufficiently free of special cases, demanding
modified procedures that it cab be programmed on a computer without undue effort

18
2. Economy: The method should not demand vastly more operations than alternative
methods which could solve the same problem

3. Robustness: The method should not break down for special cases that cannot be
specified in advance

4. Accuracy and Stability: The effects of limited accuracy on the method should
be to give an exact answer to the perturbed problem for which perturbations can
be bounded. It should be feasible to make this bound smaller than that on pertur-
bations which could arise from uncertainties in data.

5. Analysis: It should be possible to establish 3 and 4 by a rigorous analysis.

We may then say that numerical analysis deals with the design of numerical methods
and the analysis of their performance on individual problems or class of problems.

3.2 Linear simultaneous equations


Here we examine numerical methods for solving the system Ax = b for the vector x for
2
a given b ∈ Rn and A ∈ Rn . The problem can be solved by either direct or indirect
methods. Some direct methods include

1. Inversion: Since it is known that A is non-singular, the solution of Ax = b is


obtainable by calculating A−1 to have x = A−1 b. For small A, this can be acceptable
but for large A, it becomes cumbersome and grossly inefficient.

2. Cramer’s Rule: Each unknown is found as a ration of two determinants. The


problem here is that the operation count can be very large and some of the objectives
outlined above will be violated

3. Gauss elimination: Systematic elimination of some of the unknowns is done on


the problem until an equivalent system which is easier to solve that the original is
obtained.

4. LU Decomposition: Same as Gauss elimination except that we use matrices


all the way and decompose the system into lower triangular matrix L and upper
triangular matrix U. The solution of the problem is then achieved in a two step
process.

Indirect methods also exist. In its simplest form, the indirect method involves designing an
iterative procedure which is hope will converge to the sought after solution. We examine
each of these methods in turn.

19
3.2.1 Gauss Elimination
The elimination comes first: At each stage, we remove one unknown from all but one
of the equations to produce a set of equations of order one smaller than the last. Back-
substitutionn follows: The last equation contains just the last unknown which is found
directly. All unknowns so far found can then be substituted into the equation with one
extra unknown. This can be found in the next step.

Example 3 Consider the system

(5)x1 + 7x2 + 6x3 = 18 (3.1)


7x1 + 10x2 + 8x3 = 25 (3.2)
6x1 + 8x2 + 10x3 = 24 (3.3)

and find the solution x1 , x2 , x3 .

Solution: The matrix element 5, which is bracketed, is used to the pivot. It is divided
into the elements in the column below it to find multiples of the pivotal equation which
are subtracted from the others. The new set of equation is give as

(3.1)0 = (3.1) → (5)x1 + 7x2 + 6x3 = 18.0 (3.4)


7
(3.2)0 = (3.2) − (3.1) → + (0.2)x2 − 0.4x3 = −0.2 (3.5)
5
6
(3.3)0 = (3.3) − (3.1) → − 0.4x2 + 2.8x3 = 2.4 (3.6)
5
Now we repeat with the element 0.2 which is bracketed to get

(3.1)00 = (3.4) → (5)x1 + 7x2 + 6x3 = 18.0 (3.7)


00
(3.2) = (3.5) → + (0.2)x2 − 0.4x3 = −0.2 (3.8)
(3.3)00 = (3.6) + 2(3.5) → 2x3 = 2.00 (3.9)

back substitution:

(3.9) ⇒ x3 = 1 (3.10)
(3.8) ⇒ x2 = (−0.2 + (0.4)x3 )/2 = 1 (3.11)
(3.7) ⇒ (x1 = 18 − 6x3 − 7x2 )/5 = 1 (3.12)

For this small set the result is exact. For sets in which the numbers are less simple or
large sets, the calculations would in generally be affected by rounding errors.
We note that this is the method of systematic elimination that is commonly taught
in school. The actual computed result depends, usually on the sequence of equations,
because of rounding errors. Moreover the result depends, in general, on the type of
arithmetic used. Work or calculations which would be exact in decimal arithmetic will
not be exact in binary arithmetic. The process also relies on the pivot; they must be
non-zero because we have to divide by them. Their choice depends on the sequence of
equations and the number of variables.

20
3.2.2 General form of Gauss Elimination
For a system of equations of order n we have:
    
a11 a12 · · · a1n x1 b1
 a21 a22 · · · a2n   x2   b2 
    
 .. .. ..   .. = ..  (3.13)
 . . ··· .  .   . 
an1 an2 · · · ann xn bn

Assume a11 6= 0 and define

mi1 = ai1 /a11 , i = 2, 3, · · · , n

Then we calculate
(2)
aij = aij − mi1 a1j , i, j = 2, 3, · · · , n
(2)
bi = bi − mi1 b1 , i = 2, 3, · · · , n

to obtain an (n−1)×(n−1) subsystem in x2 , x3 , · · · , xn . We can adjoin the first equation


to this subsystem to obtain the n × n system A(2) x = b(2) . We continue recursively, and
(2)
the next step assumes a22 6= 0 and define
(2) (2)
mi2 = ai2 /a22 , i = 3, 4, · · · , n

Then we calculate
(3) (2) (2)
aij = aij − mi2 a2j , i, j = 3, 4, · · · , n,
(3) (2) (2)
bi = bi − mi1 b2 , i = 3, 4, · · · , n,

to obtain an (n − 2) × (n − 2) subsystem in x3 , x4 , · · · , xn . We can adjoin the first two


equations to this subsystem to obtain the n×n system A(3) x = b(3) . This we continue until
(n) (n) (n)
we find ann xn = bn , assuming that ann 6= 0, as the first step of the back-substitution.
The whole back-substitution process is applied to the system of equations of the form
   
a11 a12 a13 · · · a1n   b1
(2) (2) (2) x1  (2) 

 a22 a23 · · · a2n    x2   b2 
(3) (3)   (3) 
A(n) x ≡ 
 a33 · · · a3n  
  ..  =  b3  (3.14)
 . .  .   . 
 .. ..   .. 
(n)
xn (n)
ann bn
(n)
where aij = 0∀i > j, so that we have reduced the system to an upper triangular system
of equations. The last equation contains only one unknown, xn which is solved and the
value used in the last but one equation which contains two unknowns xn and xn−1 . xn−1
is found because we already have xn . The process continues and gives xn , xn−1 , · · · , x1 in
sequence.

21
3.2.3 LU Decomposition
We can use the mij of the Gauss elimination process to define
   
1 0 0 ··· 0 1 0 0 ··· 0
 −m21 1 0 ··· 0   0 1 0 ··· 0 
 
 
M1 =  −m31 0 1 · · · 0  , M2 =  0 −m32 1 ··· 0 
  
 (3.15)
 .. .. .. . . ..   .. .. .. . . .. 
 . . . . .   . . . . . 
−mn1 0 0 · · · 1 0 −mn2 0 ··· 1
   
1 0 0 0 ··· 0 1 ··· 0 0 ··· 0
. .. .. .. 
0 1 0 0 ··· 0   ..
  
 . . . 

 0 0 1 0 ··· 0  

 0 ··· 1 0 ··· 0 

M3 =  0 0 −m43 1 · · · 0   , · · · Mk =  (3.16)

  0 ··· −mik 1 ··· 0 
 .. .. .. .. . . ..   . .. .. . . .. 
 . . . . . .   .. . . . . 
0 0 −mn3 0 · · · 1 0 ··· −mnk 0 ··· 1

i = 2, 3, · · · , n and k = 1, 2, · · · , n − 1. Then clearly A(2) = M1 A, b(2) = M1 b, A(3) =


M2 A(2) , b(3) = M2 b(2) , and so on. Thus each Mk is lower unit triangular. That is
(Mk )ij = 0 unless i ≥ j. Then

A(n) = Mn−1 Mn−2 · · · M2 M1 A, b(n) = Mn−1 Mn−2 · · · M2 M1 b. (3.17)

Now we have the following result concerning lower triangular matrices:

Lemma 3 If M and N are lower triangular matrices, the so is MN.


P
Proof: We have (MN)ik = nj=1 mij njk . where mij njk = 0 unless i ≥ j and j ≥ k. That
is unless i ≥ k; so MN is lower triangular.
So M = Mn−1 Mn−2 · · · M2 M1 is a lower triangular matrix. Write A(n) = U an upper
triangular matrix. Then

U = MA, b(n) = Mb. (3.18)

We also have the following result concerning triangular matrices

Lemma 4 If T is a lower (an upper) triangular matrix, then it is non-singular if and


only if tii 6= 0, ∀i and then T −1 is also a lower (an upper) triangular matrix.

Proof: By induction. Suppose T = Tn is an order n upper triangular matrix and that


the result is true for order n − 1 or less. Set
 
Tn−1 tTn
Tn =
0 τnn

where tn is a column vector of length n − 1, 0 is a row vector of zeros of length n − 1 and


τnn is the constant. Obviously Tn is singular if Tn−1 is singular or τnn is zero. Hence Tn
is singular if it has a zero diagonal element. If Tn is not singular, check that

22
 −1 −1 T −1

Tn−1 (−Tn−1 tn τnn )
Tn−1 = −1
0 τnn
by forming the product Tn Tn−1 . So Tn−1 is upper triangular. The results are true trivially
when n = 1, and so they are true ∀n ≥ 1. 
So now we can write L = M −1 , a lower triangular matrix, and A = LU. This is the
LU decomposition of A, and we can regard the Gauss elimination methods as follows:
The system
LUx = b
is equivalent to the pair of systems
Ly = b, Ux = y.
So we solve for x by first finding
(i) y = L−1 b = Mb ≡ b(n) (the elimination part) (3.19)
and then obtaining
(ii) x = U −1 y i.e. (A(n) )−1 b(n) (the back substitution part). (3.20)
Thus Gauss elimination and LU decomposition are formally equivalent. However, we may
find L and U by other means and then use (i) and (ii) above to find x, that is (3.19) and
(3.20)

3.2.4 Elementary lower triangular matrices


We can regard the row operations of Gauss elimination as pre-multiplication by elementary
lower triangular matrices. This is necessary for the analysis of this process and of its
relationship to LU decomposition.
Definition 14 (Elementary lower triangular matrix) Let
I ≡ the identity matrix, In is I of order n
ek ≡ the k th basis vector, eTk = (0, 0, · · · , 0, 1, 0, · · · , 0), with 1 in the position k
then M is an elementary lower triangular matrix of index k if M = I − meTk for some
column vector m, where eTi m = O, i = 1, 2, · · · , k so that (m)i = 0 if i ≤ k. Hence we
can write M in the form
 
1
 .. 
 . 0 
 
 1 
M =  (3.21)
 0 −m k+1 1 
 .. .. . . 
 . . . 
−mn 0 · · · 1

23
Lemma 5 Given an elementary lower triangular matrix of index k, M = I −meTk . Then

1. M −1 = I + meTk

2. If also N is such a matrix of index l, N = I −neTl , l ≥ k, the MN = I −meTk −neTl .

Proof: We take each in turn

1. M(I + meTk ) = (I + meTk )M = I − meTk + meTk − meTk meTk , but eTk m = 0. So


M(I + meTk ) = I.

2. MN = (I − meTk )(I − neTl ) = I − meTk − neTl + meTk neTl . But eTk n = 0 because
k ≤ l, so MN = I − meTk − neTl . 

Now, in the Gauss elimination process we have introduced M1 , M2 , · · · , Mn−2 , Mn−1 el-
ementary lower triangular matrices of index 1, 2, 3, · · · , n − 1, respectively, and we can
write

Mk = I − mk eTk , where mTk = (mk+1,k , mk+2,k · · · , mn,k ) ⇒ Mk−1 = I + mk eTk

Hence in the LU decomposition of A,

L = (Mn−1 Mn−2 · · · M1 )−1 = M1−1 M2−1 · · · Mn−1


−1
= I + m1 eT1 + m2 eT2 + · · · + mn−1 eTn−1 .

That is
 
1

 m21 1 0 


L≡ m31 m32 1 
 (3.22)
 .. .. .. 
 . . . 
mn1 mn2 ··· 1

a unit lower triangular matrix, where the mik are precisely the multipliers in the Gauss
elimination process.

Remark 1 (Bookkeeping) When we write a programme for Gauss elimination, or for


LU decomposition, we can store L and U in the same storage locations used for the original
A. If we need A itself for other purpose, we can keep a copy of it some where else. The
parts of the columns of A below the diagonal are replaced by mij , the rows on or above the
diagonal hold uij and the unit elements on the diagonal of L need not be stored at all.

3.2.5 Pivoting
(k)
Gauss elimination or LU decomposition breaks down at step k if akk = 0. Also it gives
(k)
inaccurate answers if akk becomes very small. So it may be necessary or desirable to
(k−1)
change the order of the elimination process. Instead of using akk as our pivot at stage
k − 1, we may in Gauss elimination, interchange row k with a row with larger row number
before we do this step.

24
Example 4 Consider the system −1.41x1 + 2x2 = 1, x1 −1.41x2 + x3 = 1, 2x2 −1.41x3 =
1, and study the effect of pivoting on the solution
Solution: The exact solution is x1 = x2 = x3 = 1.69492. Solve the problem using floating
point decimal arithmetic and carry three significant figures gives
 
−1.41 2 0
A =  1 −1.41 1  , bT = (1, 1, 1)
0 2 −1.41
   
1 0 0 −1.41 2 0
L =  −0.709 1 0  , U =  0 0.010 1 
0 200 1 0 0 −201
y T = (1, 1.71, −341), giving xT = (0.71, 1.00, 1.70)
Clearly far fewer than two significant figures correct. Now repeat the process but at stage
two we find
 
−1.41 2 0
A(2) =  0 0.010 1  , b(2) = (1, 1.71, 1)T .
0 2 −1.41
We interchange rows
 3 of both A(2) and b(2) , by pre-multiplying A(2) and b(2) with
2 and 
1 0 0
the matrix P =  0 0 1  to have
0 1 0
 
−1.41 2 0
0 0
P A(2) = A(2) =  0 2 −1.41  , P b(2) = b(2) = (1, 1, 1.71)T .
0 0.010 1
from which we have
   
1 0 0 −1.41 2 0
L =  0 1 0 ,U =  0 2 −1.41 
−0.709 0.005 1 0 0 1.01
y T = (1, 1, 1.71), giving xT = (1.69, 1.69, 1.69)
(2) (3)
Examination of the process shows that a small a22 leads to a large a33 , poor x3 , inaccurate
x2 and eventually a very bad x1 .
More generally, if ukk is small, it is also inaccurate because it is calculated by can-
cellation of much larger numbers with associated rounding errors of relatively large size.
When we divide by ukk in the back-substitution for xk we get an inaccurate xk . The
numerator in this division is also small and so inaccurate because of cancellation. This
further spoils xk , and the error in xk propagates as errors in xk−1 , xk−2 , · · · , x1
Example 5 (Effect of pivoting) Study the effect of pivoting on the LU decomposition
process if  
0.001 2.000 3.000
A=  −1.000 3.712 4.623 
−2.000 1.072 5.643

25
and carry the operation on a four digit computer (ie four significant figures).
Solution: We work with four significant figures
No Pivoting:
   
1 0 0 0.001 2.000 3.000
L =  1000 1 0  , A(2) =  0 2004. 3005.  ,
2000 1.997 1 0 4001. 6006.
 
0.001 2.000 3.000
(3)
A =  0 2004 3005 
0 0 5.000

With Pivoting: Interchange rows 1 and 3 at stage 1


   
1 0 0 −2.000 1.072 5.643
L =  0.5 1 0  , A(2) =  0 3.176 1.802  ,
−0.0005 0.6297 1 0 2.000 3.003
 
−2.000 1.072 5.643
(3)
A =  0 3.176 1.802 
0 0 1.868

If b = (1, 2, 3)T , then


No Pivoting: xT = (−0.339, −0.100, +0.400)
With Pivoting: xT = (−0.4902, −0.0511, +0.3676)
Exact: xT = (−0.4904, −0.0504, +0.3675)
The lower 2 × 2 matrix on the lower right corner of A(2) , without pivoting is very nearly
singular. For its determinant we get the cancellation
2004 × 6006 − 3005 × 4001 = 12, 036, 024 − 12, 023, 005 = 13019
and so a loss of 3 figures. This time we do not even get x3 right to better than ∼ 10
percent. When we apply pivoting, we get some remedy as the second solution shows.

3.2.6 Pivoting Strategies


Two available strategies to counter the kind of behaviour observed in the above two
examples are partial and total pivoting.
(a) Partial Pivoting: At step k, choose r ≥ k so that
(k) (k)
|ark | ≥ |aik |, k ≤ i ≤ n. (3.23)
Then interchange rows k and r. In other words, find the largest element in the
(k)
column below akk and interchange its row with row k.

26
(b) Total Pivoting: This is seen as complete, or row and column, pivoting. At step k,
choose r ≥ k and p ≥ k so that
(k)
|a(k)
rp | ≥ |aij |, k ≤ i, j ≤ n. (3.24)

Then interchange rows k and r and also columns k and p. In other words, find the
(k) (k)
largest element in the whole sub-matrix between akk and ann , and interchange to
put this element in position (k, k).

We note that strategies (a) and (b) demand corresponding interchange in the vector of
right hand side (the data). In addition, Strategy (b) demands interchange of the vector
of unknowns. Very interesting results can be proved for strategy (a), which will lead to
efficient and robust algorithms. Strategy (b) is considerably more expensive and may
not provide additional robustness in the solution process. We can show that making the
interchange is equivalent to pre-multiplying the matrix A by a permutation matrix.
 
1
 .. 
 . 0 
 
 1 
 
 0 ··· 1 
Pk = 

 .
.. .
..

 (3.25)

 1 ··· 0 
 
 1 
 
 . ..

 0 
1

where the off diagonal elements are in the rows k and the row to be interchanged. We
have the following result

Theorem 7 In Partial or row pivoting,

1. |mik | < 1, ∀i, k

2. The result is exactly the same if we had done all the row interchanges before starting
the elimination process

Proof: We take each in turn:

1. That |mik | < 1, ∀i, k is trivial and follows directly from the construction.

2. At step k we get A(k+1) from A(k) by:

(a) Pre-multiplying by a permutation matrix Pk of the form shown in (3.25), does


the interchange.
(b) Pre-multiplying by a lower triangular matrix Mk does the elimination of the
elements in column k.

27
Hence
A(k+1) = Mk Pk A(k) ,
and
A(k) = Mk−1 Pk−1 · · · M2 P2 M1 P1 (A(1) ≡ A).
So, since Pn−1 Pn−1 = I,

A(n) = Mn−1 Pn−1 Mn−2 Pn−1 Pn−1 Pn−2 Mn−3 · · · M1 P1 A(1) ,

and therefore

A(n) = Mn−1 Mn−2



Pn−1 Pn−2 Mn−3 · · · M1 P1 A(1) , (3.26)

where Mn−2 = Pn−1 Mn−2 Pn−1 is derived from Mn−2 by interchange of two sub-
diagonal elements, but has the same form. Proceeding in this way, we can transform
all the permutation matrices to the right of the lower triangular matrices and find

A(n) = Mn−1 Mn−2



· · · M1∗ Pn−1 Pn−2 · · · P1 A(1) ,

= Mn−1 Mn−2 · · · M1∗ A∗ (1) (3.27)

where A∗ (1) = Pn−1 Pn−2 · · · P1 A(1) , is A with all its rows permuted.

Definition 15 (Regular Matrices) A square matrix A is called strictly regular if all


its leading submatrices, Ak (A itself as well) are nonsingular. The k th leading submatrix
Ak , k = 0, 1, 2, · · · , n − 1 of A is the matrix obtained by deleting the first k rows and
columns of A, so that A0 = A and An−1 = ann .

The next Theorem establishes the existence of LU decompositions for all non-singular
matrices

Theorem 8 (On the Existence of LU Decomposition.) Suppose A is a given n × n


matrix, Ak , k = 1, 2, 3, · · · , n − 1 be the k th principal sub-matrix. Then there exist a
unique LU decomposition of A without interchanges where L is a unit lower triangular
matrix and U an upper triangular matrix, if Det(Ak ) 6= 0, k = 1, 2, 3, · · · , n − 1. That is
if A is a strictly regular martrix

Proof: (By induction). The result is true for n = 1 obviously. Suppose it is true for
n = k − 1 and partition Ak , Lk , Uk (Where Lk , Uk are also principal matrices of L and
U) as;
     
Ak−1 b Lk−1 0 Uk−1 u
Ak = , Lk = , Uk = (3.28)
cT akk mT 1 0 ukk

where b, c, m, u are (k − 1)-vectors. Suppose we try to set Lk Uk = Ak , where

Lk−1 Uk−1 = Ak−1 , Lk−1 u = b, mT Uk−1 = cT , mT u + ukk = akk . (3.29)

28
By induction, Lk−1 and Uk−1 are uniquely determined. Also, by assumption,

Det(Lk−1 )Det(Uk−1 ) = Det(Uk−1 ) 6= 0 (3.30)

Hence, Lk−1 , Uk−1 are non-singular and u, m, can be uniquely determined from the
equations (3.29). Thence ukk is also uniquely determined also. So we have Lk , Uk uniquely.
Further

Det(Ak ) 6= 0 so Det(Lk−1 )Det(Uk−1 ) = Det(Uk ) 6= 0 (3.31)

Thus

Det(Uk ) = u11 u22 · · · ukk 6= 0 and ukk 6= 0 (3.32)

Hence we satisfy all the same conditions for k as for k − 1, and can continue the induction
to step n. This shows that L and U are uniquely determined. .
Remark 2 We make the following remark about the existence and uniqueness of LU
factorizations.
1. The existence part of Theorem 8 is often quoted by saying that ”The matrix A admits
an LU decomposition if and only if A is strictly regular”.
2. The Uniqueness part of Theorem 8 can also be proved by assuming that A admits
two LU decompositions A = L1 U1 and A = L2 U2 . Then L1 U1 = L2 U2 ⇒ L−1 2 L1 =
U2 U1−1 , but by Lemmas 3 and 4, L−1
2 L1 is lower unit triangular and U U
2 1
−1
is upper
triangular and we have a contradiction. Since a lower triangular matrix cannot be
equal to an upper triangular matrix unless both of them are the identity matrix. So
the only instance when we can have equality is is that L1 = L2 and U1 = U2 .

Corollary 1 A strictly regular matrix A has a unique factorization A = LDU where both
L and U have unit diagonal

Corollary 2 A strictly regular symmetric matrix A has a (or admits) a unique represen-
tation A = LDLT where both L and U = LT have unit diagonal

Proof: By strict regularity we have A = LDU, and by symmetry we have LDU = A ⇒


AT = U T DLT . Since the LDU factorization is unique, U = LT . 
When the matrix A is positive definite or (strictly) diagonally dominant by rows, we
can have interesting results
Definition 16 (Positive definite matrix) An n × n matrix A is said to be positive
definite of xT Ax > 0, ∀x 6= 0, x ∈ Rn , where x is a column vector.

Definition 17 (Symmetric Positive definite matrix) An n × n matrix A is said to


be symmetric positive definite of A is positive definite and AT = A.

Theorem 9 Let A be a real symmetric matrix of order n. Then A is positive definite if


and only if it has the LDLT decomposition in which the diagonal elements of D are all
positive.

29
Proof: Suppose A = LDLT with D > 0, and let x ∈ R \ {0}. Since L is nonsingular,
def P
y = LT x 6= 0. Then xT Ax = y T Dy = nk=1 dkk yk2 > 0, hence A is positive definite.
Conversely, if A is (symmetric) positive definite, then it is strictly regular (for if Ak x = 0
for some k ≤ n and some x ∈ Rk , then x∗ = (x1 , x2 , · · · , xk , 0, · · · , 0) ∈ Rn , we obtain
xT∗ Ax∗ = xT Ax = 0, a contradiction to the positive definiteness). Thus by Corollary 2
it admits an LDLT factorization. Take x such that LT x = ek , the k-th coordinate unite
column vector. Then 0 < xT Ax = eTk Dek = dkk . 
Notes 3 The last result tells us that we can check to see if a matrix is positive definite
by trying to compute its LDLT decomposition. But this is not a standard method to use
except possibly for demonstration purposes.

Corollary 3 (Cholesky factorization) A positive definite matrix A admits a Cholesky


factorization A = L̃L̃T where L̃ is a lower triangular matrix.

Proof: Since A = LDLT with positive diagonal, we can write


1 1 Def
A = LDLT = (LD 2 )(D 2 LT ) = L̃L̃T


Definition 18 (Diagonally dominant P matrix) An n×n matrix A is said to be strictly
diagonally dominant (by rows) |aii | > j6=i |aij |, for all i.

Theorem 10 If A is strictly diagonally dominant (by rows), then it is strictly regular,


hence the LU factorization exists.

Proof: Take any non-zero vector x ∈ Rn , and let xk be the largest absolute value
component. That is |xk | ≥ |xi |. Then the strict diagonal dominance of A gives the k-th
component of Ax the value
X X
|(Ax)k | = |akk xk + akj xj | ≥ |akk ||xk | − |akj | |xj | > 0.
|{z}
j6=k j6=k
| {z } ≤|xk |
<|akk |

Hence Ax 6= 0for any nonzero x. That is, A is nonsingular. The leading submatrices of a
strictly diagonally dominant matrix are (even more) strictly diagonally dominant, hence
the strict rrgularity of A. 
Corollary 4 No pivoting is necessary in the decomposition process if A is symmetric and
positive definite.
Proof: In this case xT Ax > 0, ∀x ∈ Rn \ {0} and we can choose

xT = v T ≡ (vk1 , vk2 , · · · , vkk , 0, 0, · · · , 0), vki 6= 0. (3.33)

then v Tk Ak v k = v 0 Tk Av 0 k > 0, where Ak is the leading principal sub-matrix of A, or order


k, and v 0k contains the first k elements of v k . So all such sub-matrices are positive definite.


30
Theorem 11 If A is non-singular, there is always a set of row interchanges for which
an LU decomposition exists.

Proof: Suppose otherwise. Then at some point, say Step k, there is no non-zero akik , i =
k, k + 1, · · · , n that can be chosen as a pivot, so we can partition
 (k) 
(k) Ak x
A =
0 x
(k) (k) (k)
where Ak is a k×k matrix and the (k, k) element of Ak is zero. (Ak is upper triangular).
Consequently Det(A(k) = 0 and A(k) is singular. Contradicting the hypothesis that A is
non-singular. .

Definition 19 (Sparse matrices) An n × n matrix is called sparse if nearly all the


elements of the matrix A are zero. Most useful examples of sparse matrices include banded
matrices

Definition 20 (Banded matrices.) An n × n matrix A is called banded if there exist


and integer r < n such that aij = 0 for all |i − j| > r.

For a banded matrix, all non-zero elements lie in a band of width 2r + 1 along the main
diagonal.

 r=1   r=2   r=3 


∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
 ∗ ∗ ∗   ∗ ∗ ∗ ∗   ∗ ∗ ∗ ∗ ∗ 
     

 ∗ ∗ ∗  
  ∗ ∗ ∗ ∗ ∗  
  ∗ ∗ ∗ ∗ ∗ ∗ 


 ∗ ∗ ∗ , 
  ∗ ∗ ∗ ∗ ∗ , 
  ∗ ∗ ∗ ∗ ∗ ∗ ∗ 


 ∗ ∗ ∗  
  ∗ ∗ ∗ ∗ ∗ 
 
 ∗ ∗ ∗ ∗ ∗ ∗ 

 ∗ ∗ ∗   ∗ ∗ ∗ ∗   ∗ ∗ ∗ ∗ ∗ 
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗

It is frequently required to solve very large systems of the form Ax = b with sparse
matrices (n = 106 is considered small in this context!). The LU factorization of such a
sparse matrix will make sense only if L and U inherit the sparsity of the matrix A. (so
that the cost of computing Ux, say, should be comparable with that of computing Ax.
To this end We can have the following result:

Theorem 12 Let A = LU be the LU factorization of A(without pivoting). Then (1) all


leading zeroes in the rows of A to the left of the main diagonal are inherited by L and (2)
all leading zeroes in the columns of A above the main diagonal are inherited by U.
P
Proof: Write the sparse A = LU and solve the system of n2 equations aij = nk=1 lik ukj
for each i, j = 1, 2, · · · n for lij and uij to obtain the results. 
A corollary to this result is the following
Corollary 5 If A is a banded matrix and A = LU, then L and U are also banded matrices
with the same band as A.

31
Remark 3 Theorem 12 suggests that for a factorization of a sparse but not nicely struc-
tured matrix, one might try to reorder its rows and columns by a preliminary calculation
so that many of the zero elements become leading zero elements in rows and columns, thus
reducing the fill in the L and U.
 
5 1 1 1 1
 1 1 0 0 0 
 
Example 6 Find the LU decomposition of the matrix A =   1 0 1 0 0 

 1 0 0 1 0 
1 0 0 0 1
Solution: Blind application of the theory without any observation that the matrix is
sparse but does not have a structure gives
   L  U 
5 1 1 1 1 1 0 0 0 0 5 1 1 1 1
1 4

 1 1 0 0 0  
  5
1 0 0 0 
 0 5
− 15 − 51 − 15 

1

 1 0 1 0 0  =
  5
− 14 1 0 0 
 0 0 43 − 41 − 14  
1
 1 0 0 1 0  
5
− 14 − 13 1 0  0 0 0 2
3
− 13 
1
1 0 0 0 1 5
− 14 − 13 − 21 1 0 0 0 0 1
2

and the decomposition has gone through but there is significant fill-in in both L and
U. That is most zeros have been replaced by non-zero numbers which will complicate
the forward and backward
 substation processes.
 However, if with pre-multiply A by the
0 0 0 0 1
 0 1 0 0 0 
 
permutation matrix P = 
 0 0 1 0 0  it will interchange rows 1 and rows 5 leading

 0 0 0 1 0 
1 0 0 0 0
to the matrix
   L  U 
1 0 0 0 1 1 0 0 0 0 1 0 0 0 1

 1 1 0 0 0  
  1 1 0 0 0 
 0 1 0 0 −1 

PA = 
 1 0 1 0 0  =
  1 0 1 0 0 
 0 0 1 0 −1 

 1 0 0 1 0   1 0 0 1 0  0 0 0 1 −1 
5 1 1 1 1 5 1 1 1 1 0 0 0 0 −1
and we have less fill in as can be seen in the decomposition. We can even do better and
interchange the first and last rows and columns by pre-multiplying A by some permutation
matrix P̃ (that you should find) to have the the matrix
   L  U 
1 0 0 0 1 1 0 0 0 0 1 0 0 0 1

 0 1 0 0 1  
  0 1 0 0 0 
 0 1 0 0 1 

P̃ A = 
 0 0 1 0 1  =
  0 0 1 0 0 
 0 0 1 0 1 .

 0 0 0 1 1   0 0 0 1 0  0 0 0 1 1 
1 1 1 1 5 1 1 1 1 1 0 0 0 0 1
There is a distinct advantage in do a preliminary calculation of the system.

32
3.2.7 Operation Counts
Gauss Elimination At the k th stage, we need

(n − k) divisions to form the mik


(k+1)
(n − k)2 additions and multiplications to form aij

Usually, addition takes less time than multiplication and division, so we compare
different process fairly simply by counting the later two operations. For the whole
elimination process on the matrix, the number of these are
n−k
X 1
(n − 1)(n − k + 1) = (n3 − n) (3.34)
k=1
3

For each right hand side vector b, the number of multiplications associated with the
elimination process is
n−1
X 1
(n − k) = n(n − 1). (3.35)
k=1
2
P
In the back substitution, the number of multiplications and divisions is nk=1 k =
1
2
n(n + 1). So the total total number of operations for vector b is just approximately
n2 . The count for the complete process with one right hand side is then
1 3 1 1 1 1
(n − n) + n(n − 1) + n(n + 1) = (n3 + 3n2 − n) ∼ n3 . (3.36)
3 2 2 3 3

Crammer’s Rule In Crammer’s rule, we obtain xk as the ratio of two determinants


each of order n. That is
|Bk |
xk = , k = 1, 2, . . . , n (3.37)
|A|
where Bk is the matrix A with column k replaced with the right hand side column
vector b. The total number of operations is (n + 1) times the number of operations
required to evaluate a determinant of order n. If this is Nn , then Nn = n ∗ Nn−1 ⇒
Nn = n!. thus the total number of operations is (n + 1)!.

Notes 4 We make the following remarks


1. By LU decomposition or factorization of (a nonsingular) matrix A, we understand a
representation of A as a product of two matrices L and U where L is lower triangular
with unit diagonal and U is upper triangular.

2. We note that non-singularity of U is a natural requirement in all practical ap-


plications. In fact if we insist that L be a unit lower triangular matrix, then
Det(A) = Det(L)Det(U) = Det(U) and we see that non-singularity if U in this
case must follow from non-singularity of A.

33
3. We can use the matrix A itself to store the elements of U and those of L.

4. One can use the LU factorization to

(a) Solve the system of linear equations of the form Ax = b in the two steps
outlined above. Solve (i) Ly = b and (ii) Ux = y to obtain the solution x.
(b) Calculate Q
the determinant of the matrix A as Det(A) = Det(L)Det(U) =
Det(U) = ni=1 uii .
(c) Find the inverse if the matrix A as A−1 = U −1 L−1 .

5. The cost: Can we estimate the operation count? If we insist that it is only multipli-
cations and divisions that matter (though this is a restriction), then:

(a) By the k th step of LU decomposition, we perform (n − k) divisions and (n − k)2


multiplications.
Pn−1 Hence if N∗ is the number of operations, we have NLU =
k=1 (n − k) 2
+ (n − k) = 31 n3 + O(n2 ).
(b) In forward factorization we use k multiplications/divisions to determine yk .
Thus
Pn if NF1 is2 the operation count to do this, the we can establish that NF =
k=1 k ∼ 2 n

6. The only important difference between LU decomposition and Gauss elimination is


that we do not consider the right hand side in LU decomposition until the factoriza-
tion is complete.

7. We can use a direct computation method to calculate P


and get the LU decomposition
simply by solving the system of equations A = LU ⇔ nk=1 lik ukj = aij where li and
uij are the entries of the matrices L and U.

8. It is frequent in applications to solve very large systems of the form Ax = b where


A is a sparse matrix (n = 106 is small in this context). The LU factorization of
sparse matrices should be handled with care, since it will make sense only if there is
no fill in and L and U inherit the sparse structure of the matrix A.

9. Theorem 12 suggests that if we are to carry out an LU factorization on a sparse


matrix which is not structured, one might first try to reorder the equations to intro-
duce a structure so that many zero elements become leading zero elements in rows
and columns, to reduce fill in.

3.2.8 Alternative forms of decomposition: LDU and QR


LDU Decomposition
Some authorities approach the factorization problem by first demanding that the matrix
A be strictly regular

Definition 21 (Strictly Regular Matrices) An n × n matrix A is called strictly reg-


ular if all its leading sub matrices, together with A itself, are non-singular

34
The consequence of this definition is the proof of Theorem 8. An alternative proof of that
result would go like this: Suppose A is strictly regular and has two LU decompositions
such that A = L1 U1 and A = L2 U2 . Then Li and Ui are non-singular from the definition
of the decomposition and hence L−1 −1
2 L1 = U2 U1 . But by Lemma 4 L2
−1
and U1−1 are
respectively lower and upper triangular and so the products L2 L1 and U2 U1−1 also re-
−1

spectively lower and upper triangular by Lemma 3. Thus the matrices L−1 2 L1 and U2 U1
−1

are simultaneously lower and upper triangular and hence must be the diagonal matrix.
Furthermore since L−1 −1
2 L1 is a unit lower diagonal matrix, we have L2 L1 = I ⇒ L1 = L2 .
Also L−1 −1
2 L1 = I = U2 U1 ⇒ U1 = U2 , and thus the LU decomposition is unique.
Instead of LU decomposition, of a matrix A, we can write A = LDU where L is unit
lower triangular, D is diagonal and U is an upper triangular matrix. We easily establish
that if A is symmetric and strictly regular, then we can write A = LDU where S is
diagonal, L and U are unit lower and unit upper triangular matrices respectively. In fact
such a decomposition of symmetric strictly regular A is unique since, in that case, we
have LDU = A = AT = U T DLT . But since the LU decomposition is unique, LT = U.
We can prove the following result about positive definite matrices:

Theorem 13 If A is an n × n symmetric matrix, then it is positive definite if and only


if it has an LDLT factorization in which the diagonal elements of D are all positive.

Proof: Exercise1 . 

Theorem 14 If A is strictly diagonally dominant (by rows), then it is strictly regular


and therefore admits an LU decomposition.

Some decompositions have special names owing to the special characters of the matrices
involved:

1. A = L̃L̃T decomposition– Cholesky Decomposition where L̃ is unit lower triangular.

2. A = (LD)U– Crout Decomposition

3. A = L(DU)– Doolittle Decomposition

QR Factorization
We need the relatd concept of inner product spaces.

Definition 22 (Inner product space.) An inner product on a vector space X is a func-


tion (·, ·) : X × X → R with the following properties

(I1 ) ∀x ∈ X, (x, x) ≥ 0 with equality only if x = 0,

(I2 ) ∀x, y ∈ X, (x, y) = (y, x),

(I3 ) ∀x, y ∈ X, ∀α ∈ R, (αx, y) = α(x, y),


1
Thus we can determine whether or not a matrix is positive definite by attempting to form its LDLT
decomposition. But this is not a very competitive method!

35
(I4 ) ∀x, y, z ∈ X, (x + y, z) = (x, z) + (y, z).
A vector space endowed with an inner product is called and inner product space. If
(x, y) = 0, the vectors x and y are called orthogonal. A set of vectors xi ∈ X is called
orthonormal if 
def 1, i = j
(xi , xj ) = δi,j =
0, i 6= j.
def
For x ∈ X, the function kxk = (x, x)1/2 is called the norm of x (induced by the given
inner product) and we can prove the Cauchy-Schwartz equality
(x, y) ≤ kxkkyk, ∀x, y ∈ X.
Example 7 For X = Rm , the following rule defines the so-called Euclidean inner product
Xm
(u, v) = uT v = ui vi , ∀u, v ∈ Rm
i=1

Example 8 If X = C([a, b]; R) the associated norm for any two functions f, g ∈ X may
be defined as in (2.1).
Next we define what it means for two matrices to be orthogonal.
Definition 23 (Orthogonal Matrices) An m × n matrix  Q (m ≥ n) is said to be
2 2 1
T 1
orthogonal if Q Q = I. For example the matrix Q = 3 1 −2 2  is orthogonal.
2 −1 −2
Thus, the columns q 1 , q 2 , · · · , q m of the orthogonal Q satisfy q Ti q = δij , that is they are
orthonormal with respect to the Euclidean inner product in Rm . For a square Q, we get
Q−1 = QT , I = QT Q = QQT = (QT )T QT ,
therefore, QT is also an orthogonal matrix, so the rows of Q are orthonormal Pnas well.
n2 2
In
Pn 2particular, the columns and row elements of orthogonal Q ∈ R satisfy i=1 qij =
j=1 qij = 1. It follows also that the square orthogonal Q is non-singular. Moreover

1 = Det(I) = Det(QQT ) = Det(Q)Det(QT ) = (Det(Q))2 ⇒ Det(Q) = ±1.


Definition 24 (QR factorization.) The QR factorization of an m × n matrix A is the
representation A = QR where Q is an m × m orthogonal matrix and R is m × n upper
triangular matrix.


 
r11 r12 · · · r1n  

    
a11 · · · a1n q11 · · · q1n · · · q1m 
 r 22 





 . . .
.   . . .
. .
.  . . .
.

 . .   . . .  . .  

    
 an1 · · · ann  =  qn1 · · · qnn qnm   rnn  m ≥ n
 . .   . . .  
 .. ..   .. .. ..   0 · · · · · · 0  

 . ..  
am1 · · · amn qm1 · · · qmn · · · qmm  .
. .  




| {z } | {z } 0 · · · · · · 0 

n m≥n 
| {z }

n

36
Due to the bottom zero elements of R, the columns qn+1 , · · · , qm are essential for the
representaion itself. Hence we can safely write
   
a11 · · · a1n q11 · · · q1n
 .. ..   .. .. 
 . .   . .   
 r11 r12 · · · r1n
 . ..   .. 
 .   .. .. 
 . .   . . 
    r22 . 
 an1 · · · ann  =  qn1 · · · qnn   . ..  (3.38)
 . ..   . .  .. . 
 .. .   ..
  .. 

 .. ..   ..

..  rnn
 . .   . . 
am1 · · · amn qm1 · · · qmn
and this form of the QR representation is called the skinny QR factorization.
Theorem 15 Every matrix A, has a QR factorization. If Furthermore A is non-singular,
the QR factorization is unique.
Proof: Let A = QR be non-singular. Then AT A = RT R is positive definite and there is
a unique Cholesky factorization A = L̃L̃T with L having a positive main diagonal. Thus
RT = L̃ is uniquely determined. 
Remark 4 (Solving by QR factorization) If A = QR is a square non-singular ma-
trix, we solve Ax = QRx = b by solving Qy = b and then Rx = y. That is, solving first
Qy = b and then Rx = y (an upper triangular system by back substitution.

3.2.9 Gram-Schmidt orthogonalization


Let A = QR be sought. If we denote by q j and aj the columns of Q and A respectively,
then it follows from (3.38) that given ak ∈ Rm , we want to find and orthogonal q j ∈ Rm
P
that satisfies ak = kj=1 rjk qj , k = 1, 2, · · · , n. This problem is solved for an arbitrary
inner product space X
Theorem 16 (Gram-Schmidt) Let X be an inner product space, and let the elements
a1 , a2 , · · · , an ∈ Xbe linearly independent. There exist elements q 1 , q 2 , · · · , q n ∈ X such
that
k
X
ak = rjk q j , k = 1, 2, · · · , n, and (q i , q j ) = δij (3.39)
j=1

Proof: From the first of (3.39), since a1 6= 0, we have


2 a1
r11 q 1 = a1 , kq 1 k = 1 ⇒ r11 = (a1 , a1 ) ⇒ r11 = ka1 k, q 1 = .
ka1 k
Suppose that the elements q j that satisfy (3.39) have been constructed for j < k. The
next element q k should be part of the orthonormal set q 1 , · · · , q k−1 , q k and should satisfy
k−1
X
rkk q k + rjk q j = ak (3.40)
j=1

37
Taking the scalar product of both sides of (3.40) with q j and using orthogonality, we find
the first (k − 1) coefficients as rjk = (q j , ak ) for j = 1, 2, · · · , k − 1. Substituting these
back into (3.40) we obtain
k−1
X def bk
rkk q k = ak − (q j , ak )q k = bk ⇒ rkk = kbk k, q k = .
j=1
kb k k

By construction, each q j is a linear combination of ai , i ≤ j and so the bk = ak −


Pk−1
j=1 cj aj 6= 0 and so q k is well defined. 
Remark 5 (Algorithm for QR decomposition) For k = 1 to n do
b(k) := a(k);
for j = 1 to k − 1 do (void if k = 1)
r(k, j) :=< q(j), q(j) >;
b(k) := b(k) − r(j, k)q(j)
end
r(k, k) := sqrt(< b(k), b(k) >);
q(k) := b(k)/r(k, k)
end
 
2 4 5
Example 9 QR factorization by Gram-Schmidt of A =  1 −1 1 
2 1 −1
Solution:
 
2
1 1
r11 = ka1 k = 3, q 1 = a1 =  1 
r11 3
1
     
4 2 2
1
r12 = < q1 , q1 >= 3, b2 = a2 − r12 q 1 =  −1  − 3  1  =  −2  ,
3
1 1 −1
 
2
1 1
r22 = k2k = 3, q 2 = 2 =  −2  ;
r22 3
−1
r13 = < q 1 , a3 >= 3, r23 =< q 2 , a3 >
       
5 2 2 1
1 1
b3 = a3 − r13 q 1 − r23 q2 =  1  − 3  1  − 3  −2  =  2 
3 3
−1 2 −1 −2
 
1
1 1
r33 = kb3 k, q 3 = b3 = 2 . So we obtain the decomposition
r33 3
−2
    
2 4 5 2 2 1 3 3 3
1
A =  1 −1 1  =  1 −2 2   0 3 3 
3
2 1 −1 2 −1 −2 0 0 3
| {z }| {z }
Q R

38
3.2.10 Solving Tridiagonal systems:The Thomas Algorithm
The general Tridiagonal system may be written in the form

ai ui−1 + bi ui + ci ui+1 = di , i = 0, 1, 2, · · · , n, (3.41)

where ai , bi , ci and di are coefficients if the system usually arising from the data and
parameters of the problem under consideration and ui is the approximate value of u(xi )
at the point xi ∈ [a, b]. Sometimes it is advisable to assemble the solution vector x =
(u−1 , u0 , u1, · · · , un , un+1)T into the vector x of length n + 2, so that in matrix form we
have
    
a0 b0 c0 0 0 0 ··· 0 0 u−1 d0
 0 a1 b1 c1 0 0 ··· 0 0   u 0   d1 
   

 0 0 a2 b2 c2 0 ··· 0 0   u 1   d2 
   

 0 0 0 a3 b3 c3 ··· 0 0   u 2  =  d3 
   
  (3.42)
 .. .. .. . . . . . . . . .. ..   ..   .. 
 . . . . . . . . . 
 .   . 
   

 0 0 0 · · · 0 an−1 bn−1 cn−1 0   un   dn−1 
0 0 0 ··· 0 0 an bn cn un+1 dn

In most applications, u−1 and un+1 are often zero because their values would fall out of the
domain of definition of the system. In any event all cases are treated separately. In some
cases, Tridiagonal systems can arise when finite difference formulas are used to discretise
a second order ordinary differential equations on an interval [a, b].
If the differential equation is an initial value problem, then the two initiate conditions
will be prescribe for u at the initial point x = a. For example, u and its derivative u0 will
be prescribed at x = a. If u(a) = ua and u0 (a) = u0a then we divide the interval [a, b],
wherein a unique solution is known to exist into points xi = a + ih, h = (b − a)/n so that
x0 = a and xn = b. Then we assume the existence of a fictitious grid line at i = −1 and
applying the given initial data to have
u1 − u−1
u0 = ua , u0 (a) = u0a ⇒ = u0a ⇔ u−1 = u1 + 2hu0a (3.43)
2h
Substitute i = 0 in (3.41) to have

a0 u−1 + b0 u0 + c0 u1 = d0 , (3.44)

we eliminate the value of u1 between the two equations in (3.43) and (3.44) to have that
d0 b0 ua + 2a0 hu0a di bi ui + ai ui−1
u0 = ua , u1 = − , ui+1 = − , i = 1, 2, · · · , n(3.45)
a0 + c0 a0 + c0 ci ci
and the solution scheme is essentially and explicit scheme.

The Thomas Algorithm


If the differential equation is a boundary value problem, the problem is no longer explicit
and we must solve the system (3.42) as a matrix system. The Thomas algorithm is an

39
efficient way of solving Tridiagonal matrix systems. It is based on LU decomposition in
which the matrix system Ax = b is rewritten as LUx = b where L is a lower triangular
matrix and U is an upper triangular matrix. The system can be efficiently solved by
setting Ux = y and then solving first Ly = b for y and then Ux = y for x. The Thomas
algorithm consists of two steps. In Step 1 decomposing the matrix into A = LU and
solving Ly = b are accomplished in a single downwards sweep, taking us straight from
Ax = b to Ux = y. In step 2 the equation Ux = y is solved for x in an upwards sweep.
We start by seeking a solution of the form

ui = ei ui+1 + fi , i = n − 1, n − 2, · · · , 0 (3.46)

and then select the coefficients ei , fi so that the resulting solution is the sought after
solution. If we assume a solution of the form indicated then we must have that

ui−1 = ei−1 ui + fi−1 , i = 1, 2, · · · , n (3.47)

Substitute this into the (3.41) and rearrange to have

−ci fi − ai fi−1
ui = ui+1 + (3.48)
bi + ai ei−1 bi + ai ei−1

We compare (3.46) and (3.48) to see that the two are of the same form so that equality
will be possible if for i = 1, 2, · · · , n, we have
−ci fi − ai fi−1
ei = , fi = (3.49)
bi + ai ei−1 bi + ai ei−1
The starting values for e0 and f0 are obtained by applying the boundary conditions that
the system must satisfy at x = a. For example, if u(a) = ua and u(b) = ub . Then we
apply the backward recursion (3.46) from i = n − 1, n − 2, n − 3, · · · , 1 with un = ub ,
e0 = 0, f0 = ua .

3.2.11 Exercises
1. Let A be an n × n matrix and xT be a vector in Rn . For a given column vector
b with bT ∈ Rn , it is required to solve the system of equations Ax = b for the
unknowns x1 , x2 , · · · , xn .

(i) Explain briefly how you would use (a) Gauss elimination and (b) Cramer’s
rule, to find xT ∈ Rn
(ii) Show that the number of operation counts in Gauss elimination is 31 (n3 +3n2 −
n) ≈ 13 n3 for large n and for Cramer’s rule is (n + 1)! [ Note: An operation
count is the number of arithmetic operations (additions, multiplications and
divisions) required to carry out the direct method and obtain the solution. For
Example, to solve 2x + 3 = 4, we need two operations; 1 addition and one
division]

40
(iii) Show how you will use the LU decomposition of A to find the inverse of the
matrix and also show that the number if operation counts to obtain the inverse
of a matrix through this route is approximately 13 n3 and compare the number
of operation counts that you would get to find the inverse of a matrix by the
usual method of finding the transpose of the matrix of cofactors.

2. For some class of matrices, it is possible to perform and LU decomposition in which


U = LT and LLT = A. Show that such and LLT decomposition of A is possible in
2 2 2
real arithmetic if and only if li,1 + li,2 + · · · + li,i−1 < ai,i , i = 1, 2, 3, · · · , n. Deduce
that if A is positive definite and symmetric, this restriction never arises. (Note: L
is a lower triangular matrix)

3. An n × n matrix A is said to be triple-diagonal or tri-diagonal if ai,j = 0 unless


|i − j| ≤ 1. Show that if such a triple diagonal matrix is symmetric and positive
definite, then it’s LLT decomposition is simple and has the form
q ai+1,i
2
li,i = ai,i − li,i−1 , li+1,i = , ∀i.
li,i

Use this algorithm to find the solution of the system

2x1 − x2 = 1
−xi−1 + 2xi − xi+1 = 0, i = 2, · · · , n − 1,
−xn−1 + 2xn = 0

for n = 5.

4. Write a program in your favourite language to implement the Thomas algorithm in


the general form and test your algorithm with the preceding problem when n = 100.

5. Let
   
4 2 1 1 1 1 1 1
 24 12 3 4   2 −1 3 −4 
A = 
 8 4 8 16 
, B =  ,
 4 1 5 −2 
4 2 1 −6 −1 2 −2 5
   
3 −10
 20 
 , c =  −7 
 
b = 
 −8   −27 
10 −3

Show that the system Ax = b has an infinite number of solutions in which one of
the variables (which one?) can be chosen arbitrarily. Show also that the system
Bx = c has an infinite number of solutions in which two of the variables can be
chosen arbitrarily. Show further that if, for example, the numbers −3 or 27 are
changed, the system Bx = c has no solution.

41
6. Scaling during the solution of systems Ax = b:
Pivoting is not always the remedy for computational difficulties. Consider the fol-
lowing example
 −6    
10 −1 x1 1
= . (3.50)
10−6 10−12 x2 10−6

During Gauss elimination with partial pivoting, no pivoting will be necessary (That
is, it will not be necessary to interchange rows). Suppose we have a six digit com-
puter. Triangularization will give
 −6    
10 −1 x1 1
= . (3.51)
0 1 + 10−12 x2 −1 + 10−6

Now, with a six digit computer, 1 + 10−6 and 1 + 10−12 are both equal to 1. so
x2 = −1, x1 = 0. Using, say a higher accuracy computer we have x2 = −1 + 10−6 ,
x1 = 1, so to six significant digits gives x1 = 1, x2 = −1. Suppose we scale the
system: Multiply second equation in (3.50) by 106 to have the system
 −6    
10 −1 x1 1
−6 = . (3.52)
1 10 x2 1

Triangularization then gives


    
1 10−6 x1 1
= . (3.53)
0 −1 − 10−12 x2 1 − 10−6

After interchanging rows 1 and 2 of the scaled system (3.52) (pivoting). So scaling
was important here. Now consider the system
    
0.1 −1 1 x1 −1
 20 40 2000   x2  =  1840  . (3.54)
8 15 200 x3 135

Triangularize system (3.54) ( by hand) using Gauss elimination with pivoting and
record the stages at which pivoting has taken place. Next reconsider the system,
this time with scaling. Triangularize again and report on the effects of scaling on
the solution process and pivoting strategy.

7. Let I2 be the identitymatrixof order 2. Show by direct computation that the non-
0 1
singular matrix A = has no LU decomposition while the singular matrix
1 0
A + I2 has.

8. Practical project: It is required to solve the problem Ax = b for a give vector b


and an n×n matrix A. Write computer programmes in your favourite language that
will accept an arbitrary n, the matrix A, and right hand side b, to do the following

42
(a) Solve the problem by calculating the inverse of A and then computing the
solution x = A−1 b
(b) Solve the problem by Cramers rule by calculating determinants and returning
the solution x
(c) Solve the problem by Gauss elimination with partial pivoting to return the
solution x
(d) Solve the problem by LU decomposition and return the solution x
(e) Find the inverse and determinant of the matrix by the route of LU decompo-
sition.
(f) Consider the 4×4 system 5x1 +7x2 +6x3 +5x4 = 23, 7x1 +10x2 +8x3 +7x4 = 32,
6x1 + 8x2 + 10x3 + 9x4 = 33, 5x1 + 7x2 + 9x3 + 10x4 = 31.
i. Use your programme that you would have written above to find the inverse
of the coefficient matrix for this problem.
ii. In theory, A−1 A = AA−1 , but in practice, because of round of errors we
cannot always expect equality. Calculate and print the A−1 A and AA−1
from your programme and report any discrepancies.
iii. Small changes in data (right hand side) can sometimes lead to relatively
large changes in the solution. Use your programme to solve the linear
equations by Gauss elimination to solve the problem given in this exam-
ple when the right hand sides is altered to contain, in turn the following
vectors: b1 = (23.01, 31.99, 32.99, 31.01)T , b2 = (23.1, 31.9, 32.9, 31.1)T .
iv. Small changes in data (entries of matrix A) can sometimes lead to relatively
large changes in the solution. Now let the system be modified so that
the (1, 1) entry respectively takes the value 5.01 and 4.99, instead of the
original value 5. Use your programme (LU Factorization) above to find
A−1 in both cases and compare the results with the exact value of A−1 .
1
(g) The entries of an n × n Hilbert matrix H = (h)ij are given by hij = 1+j−1 .
Hilbert matrices are considered badly behaved or ill-conditioned, and are often
used to test routines. Write Use you programme for LU factorization to find
the inverse of a 4 × 4 Hilbert matrix. Test your answer by calculating HH −1
and H −1 H. For full effect, you should compute the entries of H rather than
reading them as data

3.3 Error Analysis


3.3.1 Vector and Matrix Norms
It is desirable to define vector and matrix norms that can be used to determine errors in
numerically generated solutions of (linear) equations.
Definition 25 (Norm) A norm kxk of a vector x ∈ Rn is real number for which
(i) ∀x ∈ Rn , kxk ≥ 0; positivity

43
(ii) kxk = 0 ⇔ x = 0; definiteness

(iii) ∀x ∈ Rn , κ ∈ R, kκxk = |κ|kxk; scaling.

(iv) ∀x, y ∈ Rn , kx + yk ≤ kxk + kyk; Triangle inequality.


Exercise: Following this definition of a norm, show that kx − yk ≥ |kxk − kyk|
Definition 26 (p-norm of a vector in Rn ) The p-norm of a vector x ∈ Rn is given as
n
!1/p
X
kxkp = |xi |p (3.55)
i=1

Exercise: verify that the p-norms are indeed norms in the sense of definition 25. The
triangle inequality for the p-norm, for any p is known as the Minkowski inequality. To
prove the results of other values of p we make use of the Cauchy-Schwartz inequality.
Commonly used values of p are p = 1, 2 or p = ∞,as given in (2.5)-(2.7). Staring with
the vector norms, we can define a matrix or subordinate norm to have the same properties
of the vector norm as follows:
Definition 27 (Matrix Norm.) For each vector p-norm in Rn , there is a subordinate
2
or consistent matrix p-norm kAkp , where A ∈ Rn is an n × n matrix and defined such
that
kAxkp
kAk = max kAxkp = max (3.56)
kxkp =1 kxk6=0 kxkp

From this definition we see at once the compatibility condition

kAxkp ≤ kAkp kxkp . (3.57)

We can establish the following commonly used norms


n
!
X
kAk1 = max |aij | . Maximum absolute column sum (3.58)
1≤j≤n
i=1
v
uX
u n X
n
kAkF = t (aij )2 . The Frobenius norm (3.59)
i=1 j=1

kAk2 = ρ(A) = max{|λ| : Ax = λx}. The spectral radius of matrix A (3.60)


n
!
X
kAk∞ = max |aij | . Maximum absolute row sum (3.61)
1≤i≤n
j=1

which can be calculated for any matrix.


 
0 0 10 0
 1 1 5 1 
Example 10 Let A =   0 1
. Calculate the norms (3.58)-(3.61).
5 1 
0 0 5 1

44
Solution: We apply the definition exactly recalling that λ in (3.60) is called the eigenvalue
of the matrix A and satisfies the equation Det(A − λI) = 0 leading to a polynomial of
degree n is λ whenever A is an n×n matrix and therefore has n values by the fundamental
theorem of algebra. For this A in the present example, we have

kAk1 = max{1, 2, 25, 3} = 25, kAk∞ = max{10, 8, 7, 6} = 10, kAkF = 181 ≈ 13.454,
kAk2 = max{|7.03128|, | − 0.401914 + 1.29592i|, | − 0.401914 − 1.29592i|, |0.772548|}
= max{7.03128, 1.35681, 1.35681, 0.772548} ≈ 7.031

The norms defined in (3.58)-(3.61) are stressed because they are compatible with the
norms with p = 1, 2, ∞ in the sense of (3.57). So for example we have that for all vectors
2
x ∈ Rn and for all matrices A ∈ Rn

kAxk1 ≤ kAk1 kxk1 , kAxk∞ ≤ kAk∞ kxk∞ , kAxkF ≤ kAkF kxk2 , kAxk2 ≤ kAk2 kxk(3.62)
2

It is important to distinguish between vector and matrix norms even though the bear the
same notation in most of the cases. Notice that Ax is a vector so that kAxk1 denotes the
1-norm of a vector while kAk1 denotes the use of matrix norm. It is also important to note
that the pairs of matrix and vector norms are compatible only in the sense of definition
(3.57) and cannot be mixed. For example, if x = (0, 0, 1, 0)T and A is the matrix of
Example 10, then we have kxk1 = 1, but kAxk1 = 25 and kAk∞ kxk1 = 10. That is
we cannot expect to have the inequality kAxk1 ≤ kAk∞ kxk∞ or kAxk1 ≤ kAk∞ kxk1 .
Compatibility is a property that connects vector and matrix norms (of the same index).
In general, a compatible matrix norm for any matrix A and any index p is the number

Kp = inf{K ∈ R : kAxkp ≤ Kkxkp , ∀x ∈ Rn }. (3.63)

From here it is then possible to find this infimum such that, for example, K1 = kAk1 , K2 =
kAk2 and K∞ = kAk∞ . We note that K2 6= kAkF . Thus although we have compatibility
in the sense that kAxk2 ≤ kAkF kxk2 , there is indeed a matrix norm K2 = kAk2 (as
defined above in terms of the largest absolute eigenvalue of the matrix), smaller for most
matrices than kAkF , and such that kAxk2 ≤ kAk2 kxk2 for every vector x ∈ Rn .

3.3.2 Condition of system of equations Ax = b


Consider    
1.2969 0.8648 0.8642
A= , b= .
0.2161 0.1441 0.1440
The exact solution of Ax = b is x = (2, −2)T . Consider x̃ = (0.9911, −0.4870), and
calculate the residual r = b − Ax̃ = (10−8, −10−8 ). Observe that the residual r appears
small even though the solution x is nowhere near the exact value of x. So approximating x
by x̃ may be interpreted as making an appropriate perturbation in the data in the problem
as prescribed by the right hand side b. That is x → x̃ corresponds
 to b → b + δb,
 where
0.1441 −0.8648
kδbk∞ = 10−8 . If we find A−1 , we find that A−1 = 108 and so
−0.2161 1.2969

45
kAk∞ kA−1 k∞ ≈ 3.3 × 108 . A very large number. Now suppose we alter one element of b
slightly and write
   
1.2969 0.8648 0.8642
A= , b= .
0.2161 0.1441 0.14399

Then the exact solution to the problem becomes x = (866.8, −1298.9)T . A vary large
deviation from the solution for very small disturbances in the data. Thus we observe that
small changes in the data leads to very large changes in the solution. This kind of system
is said to be ill-conditioned. We must find a way of measuring the condition of a system.
We have

A(x + δx) = b + δb ⇒ Aδx = δb

so that using a compatible matrix norm throughout the argument we have

kδxk = kA−1 δbk ≤ kA−1 kkδbk.

This bound for δx is sharp in the sense that for all matrices A and right hand side vector
b, there exists δb such that we have equality. Also,
kbk
kbk = kAxk ≤ kAkkxk ⇒ kxk ≥ .
kAk

So we have if we define κ(A) = kAkkA−1 k, then we can write

kδxk kδbk
≤ κ(A) . (3.64)
kxk kbk

Definition 28 (Condition Number.) The quantity κ(A) = kAkkA−1 k in equation (3.64)


is called the condition number of the matrix A, and is sometimes abbreviated cond(A).
The condition number is different for different norms and the inequality in (3.64) can be
calculated for each choice of norm. It gives the bound for the relative error in x due to
a perturbation in b.
Example 11 Consider a 3 × 3 Hilbert matrix H where hij = 1/(i + j). Find its con-
dition number and establish the inequality (3.64) if δb = ±10−3 (1, −1, 1)T and (i) b =
(1, −1, 1)T , (ii) b = (1.0833, 0.7833, 0.6167)T .
 
72 −240 180
13
Solution: For this A, kAk∞ = 12 , A−1 =  −240 900 −720  so kA−1 k = 1860. So
180 −720 600
κ(A) = 2015. Now if b = (1, −1, 1)T , then the exact solution is x = (492, −1860, 1500)T
and so that if b is perturbed to b + δb, then the corresponding solution is x + δx =
(492 ± 0.492, −1860 ∓ 1.866, 1500 ± 1.500)T and so δx = ±(−0.492, 1.86, 1.500)T

kδxk∞ kδbk∞
= 10−3 =
kxk∞ kbk∞

46
However inequality (3.64)simply shows that we are short of the bound by there orders of
magnitude. Complete the computation for case (ii) and see for yourself that (3.64) does
indeed give exact bounds.
The point is that we normally would not know the solution x. But since we normally
would know the matrix A, we can estimate its condition number and as such should be
in a position to say something about the bound in the relative error in the computation.
So far we have considered perturbations in the data provided by b. Lets also consider
perturbations in A. (A + δA)(x + δx) = b ⇒ Aδx = −δA(x + δx) and so

kδxk ≤ kA−1 kkδAkkx + δxk

. From which we deduce that


kδxk kδAk
≤ κ(A) .
kx + δxk kAk

We must complete the analysis by considering perturbations both in A and b. We shall


need the following results

Lemma 6 If C is an n × n matrix with kCk < 1 for any compatible matrix norm, then
1
k(I − C)−1 k ≤ 1−kCk .

Proof: kCk < 1 ⇒ kC n k ≤ kCkn → 0 as n → ∞.


Now (I + C + C 2 + C 3 + · · · + C n )(I − C) ≡ I − C n+1 ⇒ (I + C)−1 = I + C + C 2 + · · · ,
and hence
1
k(I − C)−1 k ≤ 1 + kCk + kCk2 + · · · ≤ .
1 − kCk

Lemma 7 Suppose (A + δA)(x + δx) = b + δb, Ax = b and ε := kA−1 δAk ≤ 1, then


 
kδxk 1 kδbk
≤ ε + κ(A) (3.65)
kxk 1−ε kbk

Proof: (A+δA)δx = δb−δAx so that (I+A−1 δA)δx = A−1 δb−(A−1 δA)x apply Lemma
6 to have k(I + A−1 δA)k ≤ 1−kA1−1 δAk = 1−ε
1 1
so that kδxk ≤ 1−ε (kA−1 kkδbk + εkxk). But
kbk
kxk ≥ kAk
so that
 
kδxk 1 kδbk
≤ ε + κ(A)
kxk 1−ε kbk

as required. .
The equations Ax = b give an ill-conditioned system of κ(A) is large. If κ(A) is so
large that it becomes compatible with the dominant relative errors in the data (that is
errors such as round off errors or experimental errors in the numbers provided for the
data) then all accuracy in the solution may be lost. The bound on the dominant relative

47
error may be pessimistic, however, they are rigorous and simple to express. This error
estimate is also independent of the scaling of A by scalar multiplication.
All the bounds on the relative error are given in terms of the kA−1 k and A−1 is costly
to find. However, error bounds on A−1 itself are more satisfactory: Suppose we have
found B ≈ A−1 and we form R = AB − I, with kRk small. Then B − A−1 = A−1 R so
that
kB − A−1 k
≤ kRk. (3.66)
kA−1 k
Thus, while a small r where r = b − Ax does not imply small relative errors in x, a small
kRk does imply small relative errors in B. Also,
A−1 (I + R) = B ⇒ A−1 = B(I + R)−1 .
Hence,
kBk
kA−1 k ≤ kBkk(I + R)−1 k ≤ , if kRk < 1. (3.67)
1 − kRk
Example 12 Consider the system
    
0.24 0.36 0.12 x1 0.84
 0.12 0.16 0.24   x2  =  0.52 
0.15 0.21 0.25 x3 0.64
estimate the condition number and also determine the condition.
Solution: Solving the problem using Gauss elimination using 2 digit floating point arith-
metic all the way, and scaled partial pivoting, the final working array is
 
0.24 0.36 0.12 | 0.84
 0.50 −0.02 0.18 | 0.10 
0.63 01.0 −0.01 | 0.01
Continuing the solution we have by back-substitution x̃ = (25, −14, −1)T . The residual
r = (0.0, 0.0, 0.08)T . In fact the exact solution is x = (−3, 4, 1)T . So that the solution
is in error in the first significant digit. The matrix of Coefficients in the system has
kAk∞ = 0.72. Furthermore, the matrix
 
0.252 0.36 0.12
B =  0.112 0.16 0.24 
0.147 0.21 0.25

is singular (why?) while kA − Bk∞ = 0.012 hence we get κ(A) 1


≤ kA−Bk
kBk∞

= 0.012
0.752
or κ(A) ≥ 61; a large number. Therefore system is ill-conditioned. If we solve the
problem with 3 digits of accuracy, we may produce the exact solution even though
system is ill-conditioned. This becomes evident if we change the right hand side to
bT = (0.852, 0.620, 0.740), x̃ = (−3.30, 4.05, 1.13)T using three digit floating point arith-
metic. The residual now is r = (0.0024, 0.0008, 0.0020)T and the exact solution is
x = (−3.6, 4.25, 1.55)T , and our solution has about 10% error.

48
3.3.3 Iterative improvement
The example above shows that when κ(A) is large relative to the precision used, the
solution process may lead to relatively large errors in the computations. But is not always
guaranteed to do so. Whether or not the system is ill-conditioned can be ascertained
(without even knowledge of the condition number) during iterative improvement.
Suppose we wish to solve Ax = b. We know already that r = b − Ax̃(1) is the
residual, were x̃(1) is the first approximation to the solution x. Consequently, r = Ae
where e = x − x̃(1) is the error committed at this stage of the solution process. Thus
we have a linear system in e with the same coefficient matrix as the original system
but with different right hand side. Let ê(1) be the approximate solution of Ae = r.
ê(1) will in general not agree with e but should give an indication of the size of e. If
kê(1) k/kx̃(1) k ' 10−s , we can conclude that the first s decimal places of x̃(1) probably
agree with those of x. We would then expect ê(1) to be that close an approximation
to e, and hence we expect x̃(2) = x̃(1) + ê(1) to be a better approximation to x than
x̃(1) . If necessary, we can calculate r(2) = b − Ax̃(2) , and again solve Ae = r (2) to
obtain a new correction x̃(3) = x̃(2) + ê(2) to x. The number of places in agreement to
successive approximations x̃(1) , x̃(2) , x̃(3) · · · as well as the size of the residual should give
an indication of the accuracy of the solutions. one normally carries out the iteration until
kê(k) k
' 10−t
kx̃(k) k
if t decimal places of accuracy are carried during the solution. The number of iterations
steps required to achieve this can be shown to increase with κ(A). If κ(A) is very large,
ê(1) , ê(2) , · · · may never decrease in size, thus signalling extreme ill-conditioning.

3.3.4 Exercises
1. Discuss how you would use backward error analysis to measure the effect of rounding
errors in computations.
2. By choosing a suitable matrix norm, show that
(i) kABk ≤ kAkkBk, (ii) kAr k = kAkr
!
X
(iii) kAk∞ = max1≤i≤n |aij | , Maximum absolute row sum
j

3. If ~y is the approximate solution of the system A~x = ~b, where ~b is a non-zero vector,
if we define ~r = ~b − A~y , ~e = ~y − A−1~b, and for any compatible matrix norm
κ(a) = ||A||||A−1||, then show that
||~r|| ||~e|| κ(A)||~r||
≤ ≤ . (3.68)
κ(A)||~b|| ||A−1~b|| ||~b||
Interpret the terms in these inequalities and explain the relevance of these inequal-
ities to the numerical solution of systems of linear equations in fixed length floating
point arithmetic.

49
4. Construct a 7 × 7 matrix A whose entries will satisfy aii = 0.1, i = 1, 2, · · · , 7,
ai,i+1 = −1, aij = 0 ∀i 6= j, j 6= i + 1. Now consider the system A~x = ~b where
bi = 0.1 if i is odd and, −1 if i is even. Find the exact solution for the system.
Next change b7 from 0.1 to 0.101 leaving all the other entries the same. What is the
result of this small change in the solution? can you explain?

5. Consider again the last question, but this time construct a 5 × 5 system. Assume
that due to round off errors, the solution was computed to be x1 = 101, x2 = 10,
x3 = 2, x4 = .1 and x5 = 1.01. Compute the residual ~r and the error ~e. Find a
bound for the relative error in terms of the relative residual.

6. It can be shown that


 
1 kA − Bk
≤ inf , B non-invertible .
κ(A) kAk

Use this result to estimate the condition number, κ(A), for the following matrices.
   
1 −1 0 7 8 9
(i) A =  2 −1 1  , (ii) A =  8 9 10 
2 −2 1 9 10 8

using k · k∞ . Now for the very norm explicitly calculate κ(A). Comment on how
good your estimate is.

7. Further Reading: Complete this section by revising and reading about the fol-
lowing:

(a) Methods of producing orthogonal matrices: (Gram-Schmidt Orthogonaliza-


tions, Givens rotations, Householder transformations)
(b) Implementation of the QR decomposition to larger systems.

50
Chapter 4

Iterative Methods for solving Ax = b

4.1 Introduction
There are instances in which direct methods such as LU decomposition may not be the
best method to apply to a given linear system. Direct methods are attractive when

1. We have several equations with the same coefficient matrix but with different right
hand side

2. The matrix A is nearly singular. In this case, small changes in the residual do not
imply small errors in the solution.

When direct methods are no longer competitive, iterative methods can be introduced.

Definition 29 (Iterative Method) An iterative method is one in which a sequence of


approximations, called iterates, are generated which when designed, converge to the sought
after solution.

Iterative methods become necessary when

1. The system to solve is so large that it is impossible or non-economical to assemble


the coefficient matrix into a single place.

2. If the system to be solved is large and sparse, that is the number of zero elements
in the system is much large than the number of non-zero elements.

With iterative methods,

1. there is no arithmetic associated with zero elements (coefficients) so that fewer


elements have to be stored.

2. programming and data handling is/are much easier and simpler

3. methods can even be used to solve non-linear problems

The basic idea behind an iterative method for solving the linear system Ax = b could be
defined as follows:

51
1. Split matrix A in the form A = N − P where the matrix N is simple and easy to
invert.

2. Initialize solution vector with a guess x(0)

3. generate the sequence x(1) , x(2) , · · · , x(k) , of estimates to x through the Nx(k+1) =
P x(k) + b, k = 0, 1, 2 · · · .
   
4 1 0 1
Example 13 Consider A =  2 5 1 , b=   0 , and solve Ax = b by itera-
−1 2 4 3
tion.
   
4 0 0 0 −1 0
Solution: Set N =  2 5 1  , P =  −2 0 −1  so that A = N − P and take
0 5 0 1 −2 0
 
1
x(0) =  1  then we have
1
   
  2 10601
0
   5   32000 
       
1  −3   
 −1 
 
 , · · · , x =  − 13347
(0)
 
1  , x(1) =  5  , x(2) (7)
 
x =  =
 10   40000
,··· .

 
1   







 1   21   12751 
2 20 12800
 
1/3
The true solution is x =  −1/3  . Is the iteration converging? When does the solution
1
converge?
Suppose that at step k we have that e(k) = x − x(k) , where e is the error committed
at step k. Now we have

Nx = P x + b Equation satisfied by exact solution (4.1)


(k+1)
Nx = P x(k) + b Approximation at stag k (4.2)

So if N is non-singular, we must have on subtracting the two equations that

e(k+1) = N −1 P e(k) = Me(k) , M = N −1 P (4.3)

So if M is in some sense a small matrix, then the errors e(k) must be diminishing and
we will hope that the iteration will converge to the true solution after a certain number
of steps. We have the following theorem

Theorem 17 Suppose A = N − P and suppose kN −1 P k∞ = λ < 1. Then

52
1. A is non-singular
2. kx(j) − xk∞ ≤ λj kx(0) − xk∞ or ke(j) k∞ ≤ λj ke(0) k∞ , and
3. if x is a solution of the linear system Ax = b, and (x(j) )j≥1 is a sequence of points
generated by the relation Nx(k+1) = P x(k) + b, then limj→∞ x(j) = x.
Proof:
1. By contradiction. If A is singular, then there exist y 6= 0 such that Ay = 0.
Therefore (N −P )y = 0 or y = N −1 P and so kyk∞ = kN −1 P yk ≤ kN −1 P k∞ kyk∞ .
Since y 6= 0, we must have that λ ≥ 1. Contradiction. So A is non-singular.
2. Set M = N −1 P , e(j) = x(j) − x, then
e(j) = Me(j−1) = M 2 e(j−2) = · · · = M j e(0)
and thus ke(j) k∞ ≤ kM j k∞ ke(0) k∞ = λj ke(0) k∞ .
3. If λ < 1 then limj→∞ λj = 0 and the iteration converges. 
Now recall Definition 2.3, equation (2.19) on page 16 of the spectral radius of a matrix.
We have the following theorem
Theorem 18 ρ(A) ≤ kAk for any compatible matrix norm
Proof: Let λi , i = 1, 2, · · · , n be the eigenvalues of the matrix A. and let xi be the
corresponding eigenvector. the we have that for each i, Axi = λi xi . Thus kλi xi k =
|λi |kxi k = kAxi k ≤ kAkkxi k for each i. Thus λi ≤ kAk for each i, thus ρ(A) ≤ kAk as
required. 
Remark 6 The first point here is that Theorems 17 and 18 show us that for any iterative
process to converge, the spectral radius of the matrix M = N −1 P must be smaller than
unity. The matrix N −1 P is called the iteration matrix. The second point is that the
spectral radius is difficult to find. So we may sometimes be contented by the requirement
that kAk < 1 for any compatible matrix norm. The only snag here is that kAk can be
larger than unity for the chosen norm but the spectral radius which is not determined by
any norm is less than unity and the iteration will converge.

4.2 Some special iteration schemes


A general splitting of the matrix A can be achieved as follow: write A = D − L − U where
D is diagonal, U is upper triangular, and L is lower triangular and defined as follows:
     
a11 0 · · · 0 0 0 ··· 0 0 a12 · · · a1n
 0 a22 · · · 0   a21 0 · · · 0   0 0 · · · a2n 
     
D =  .. .. . . ..  , −L =  .. .. . . ..  , −U =  .. .. . . . 
 . . . .   . . . .   . . . .. 
0 0 · · · ann an1 an2 · · · 0 0 0 ··· 0
then several combinations of L U and D gives different iterations schemes. To get A = NP
we cane have

53
1. N = D and P = L + U M = N −1 P = D −1 (L + U). We have the Jacobi iteration

2. N = D −L and P = U M = N −1 P = (D −L)−1 U. We have Gauss Seidel iteration.

3. N = I − ωD−1 L, P = (1 − ω)I + ωD−1 U so that N −1 P = H(ω) where

H(ω) = (I − ωD−1 L)−1 ((1 − ω)I + ωD−1U)

is called the successive over relaxation method. where ω is a relaxation parameter


which can be used to speed up convergence and has the property that 1 < ω < 2.
Observe that ω = 1 gives the Gauss Seidel iteration. The interesting part for any
problem is normally to find the optimum ω that will speed up convergence.

Definition 30 (Rate of convergence) The number of decimal digits by which the error
in an iteration is eventually decreases by each convergent iteration gives a measure of the
rate of convergence.

Theorem 19 The rate of convergence can be measured just by calculating − log10 (ρ)
where ρ is the spectral radius of the iteration matrix

Proof: Let the iteration matrix M has m linearly independent eigenvectors corresponding
to the eigenvalues λs , s = 1, 2, · · · , m. Let the eigenvalues be ordered so that |λ1 | >
· · > |λm |. Then each vector e(∗) with m components can and may be written as
|λ2 | > ·P
e(0) = m s=1 cs v s where each cs is a scalar. But v s is an eigenvector of M thus we have
Mv s = λs v s . Thus we have
m
X m
X
(1) (0)
e = Me = cs Mv s = cs λ s v s
s=1 s=1

Therefore
m m  n
(n)
X X λs
e = cs λns v s = λn1 cs vs.
s=1 s=1
λ1

Thus for large n,


e(n) ≈ λn1 c1 v 1 . Similarly, e(n+1) ≈ λn+1
1 c1 v 1
and therefore
e(n+1) ≈ λ1 e(n) .
(n)
Thus if the i-th component of e(n) is ei , i ∈ {1, 2, 3, · · · , m} then we have
(n)
|ei | 1 1
(n+1)
≈ = , i ∈ {1, 2, 3, · · · , m}.
|ei | |λ1 | ρ(M)

Take logs on both sides of this expression to have


!
(n)
|ei |
log10 (n+1)
≈ − log10 ρ(M), i ∈ {1, 2, 3, · · · , m}.
|ei |

54
a positive quantity since in a convergent iteration, 0 < ρ < 1. More generally, we have
that

e(n+p) ≈ λ1 e(n+p−1) ≈ · · · ≈ λp1 e(n) , n = 1, 2, 3, · · · ,


if we wish to reduce the error by 10−q say, then the number of iterations needed to do
this will be the least value of p for which |λ1 |p = ρp ≤ 10−q ⇒ p ≥ q/(− log10 ρ). That is
p decreases with decreasing ρ. The number − log10 ρ is called the rate of convergence. for
this reason, loge ρ is defined to be the asymptotic rate of convergence denoted by R∞ (M).
The average rate of convergence, Rn (M) after n iterations is Rn (M) = −1 n
logn kMk2 . 

4.3 Methods of accelerating convergence of an itera-


tive process
The following two methods are applicable if λ1 is real.

Lyusternik’s Method
e(n+1) ' λ1 e(n) = λ1 e(n) + δ (n) where δ (n) is the correction term with small components
and |λ1 | < 1. By definition of the error,
x = x(n) + e(n) = x(n+1) + e(n+1) (4.4)
= x(n) + λ1 e(n) + δ (n) (4.5)
Eliminate e(n) between (4.4) and (4.5) to have
x(n+1) − λ1 x(n) δ (n)
x= + (4.6)
1 − λ1 1 − λ1
So, if kδ (n) k is small compared to 1 − λ1 , a good approximation to x is
x(n+1) − λ1 x(n)
x =
1 − λ1
x(n+1) − x(n)
= x(n) +
1 − λ1
so small differences in successive iterates do not necessarily imply a close approximation
(n+1) (n)
to the solution. As an illustration, suppose for example that max{|xi − xi |, i =
(n+1) (n)
|xi −xi |
0, 1, 2, · · · , m} = ε, and λ1 = 0.99. Then max{ |1−λ1 |
, i = 1, 2, 3 · · · , m} = 100ε,
(n)
where xi is the i-th component of x(n) . To obtain a solution with maximum error of ε
(n+1) (n)
we must iterate until maxi |xi − xi | ' 0.01ε.
For most problems the largest eigenvalue λ1 will not be known in advance and must
be estimated. One way of doing this is the following: For large n,we have
e(n) ' λ1 e(n−1) ⇒ e(n+1) − e(n) = λ1 (e(n) − e(n−1) )
That is x(n+1) − x(n) = λ1 (x(n) − x(n−1) )
kx(n+1) − x(n) k
and so kx(n+1) − x(n) k = |λ1 |k(x(n) − x(n−1) )k ⇒ |λ1 | ' .
kx(n) − x(n−1) k

55
for any suitably defined vector norm.

Aitken’s Method

e(n) = λ1 e(n−1) ⇒ x − x(n) = λ1 (x − x(n−1) )

and

e(n+1) = λ1 e(n) ⇒ x − x(n+1) = λ1 (x − x(n) )

Take the i-th component and eliminate λ1 by simple division to


(n+1) (n−1) (n) (n+1) (n)
xi xi − (xi )2 (n+1) (xi − xi )2
xi = (n+1) (n) (n−1)
= xi − (n+1) (n) (n−1)
(4.7)
xi − 2xi + xi xi − 2xi + xi
and Aitken’s method completely avoids explicit calculation of λ1 . It is of interest to note
however that the two methods are identical if, in Lyusternick’s method, λ1 is approximated
by

xn+1
i − xni
λ1 ≈ , for each i (4.8)
xni − xn−1
i

and a better value of λ1 could be obtained by taking an average for all the values of i.

4.4 Exercises
1. Let A = D − L − U be a non-singular matrix. For the system Ax = b, where A is
an n × n matrix and b is an n × 1 column vector, show that

(a) The eigenvalues µ of the Jacobian Iteration are found by evaluating a deter-
minant to solve the equation Det(µD − L − U) = 0.
(b) The eigenvalues µ of the Gauss Seidel Iteration are found by evaluating a
determinant to solve the equation Det(µD − µL − U) = 0.
(c) If ω ≥ 1 is a parameter, then the eigenvalues µ of the SOR Iteration are found
by evaluating a determinant to solve the equation Det(µI − H(ω)) = 0, where
H(ω) = (I − ωD−1L)−1 {(1 − ω)I + ωD−1U}.
 
−4 1 0
Use the matrix A =  1 −4 1  to illustrate that the different iterations
0 1 −4
shown here will converge and in the case of the SOR iteration scheme, find the best
ω.

2. Let A ne an n × n matrix. Show that if A is diagonally dominant, then the Jacobi


iteration will converge for any initial guess.

56
   
2, −1, 1
3. Consider the system A~x = ~b where A = , b=
−1, 2 1

(a) Show that the Jacobi iteration will converge for any initial guess
(b) Show that the Guass-Seidel iteration will converges for any initial guess
(c) show that 0 < ρ(HG ) < ρ(HJ ), where HG and HJ are the Gauss-Seidel and
Jacobi iteration matrices. What does this inequality say about the relative
rates convergence of the Jacobi and Gauss-Seidel iteration schemes?
(d) Show that there exists an optimum relaxation factor ωb for which an SOR
iteration scheme will converge fastest and find ωb .
(e) Carry out a few iterations for each method and compare the estimates to the
true solution

4. This question involves symmetric and positive definite matrices

(a) Let A = (aij ) be a symmetric positive definite matrix of size n × n. Show that
aii > 0 for all i.[Hint: Consider ~eTi A~ei ]. Show that the eigenvectors of A are
all positive
(b) Let D = (dij ) be an n × n diagonal matrix such that dii > 0 for all i. Show
that D is positive definite and symmetric
(c) Show that the Guass-Seidel iteration for any symmetric positive definite matrix
will converge for any initial guess in the following steps:
(i) Set A = L + D + U = L + D + LT . since A is symmetric, construct the
matrix P = A − HGT AHG where HG = −(D + L)−1 LT and show that P is
symmetric.
(ii) show that HG = IQ , Q = (D + L)−1 A
(iii) Now show that P = QT (AQ−1 + (QT )−1 A − A)Q (using (ii)). Then show
that P = QT DQ (Using the above and the definition of Q).
(iv) Use parts (a) and (b) of this problem to show that P is positive definite
and symmetric
(v) suppose λ is an eigenvalue ofHG and ~u a corresponding eigenvector. Show
that ~uT P ~u > 0 ⇒ |λ| < 1.Does this complete the proof?

5. The following numbers for five decimal places are the third component of the fifth,
sixth and seventh iteration vectors respectively of the Jacobi iteration scheme.
0.41504, 0.45878, 0.49500. Use Aitken’s method to calculate an improved value for
this component. Verify if Lyusternik’s method gives the same value to four decimal
places when λ1 is approximately given by (4.8)

6. It is required to solve the system ui = ρ(ui−1 + ui+1 ) + ci , i = 1, 2, · · · , n, where


ci ’s are known and ρ = r/(2 + 2r), r fixed. Prove that the Gauss Seidel iteration
procedure for solving these problems converges for all positive values of r, it being
assumed that the (n + 1)th iterative values are calculated systematically from i = 1
to i = n − 1

57
7. Practical Problem: Let Ω = {(x, y) ∈ R2 : 0 ≤ x ≤ 1, 0 ≤ y ≤ 1} the unit square.
It is required to solve a second order partial differential equation on Ω by

∂2φ ∂2φ
+ 2 + 8π 2 φ = 0, ∀x, y in Ω,
∂x2 ∂y
φ(x, y) = 0, ∀x, y ∈ ∂Ω.

(a) Show that the exact solution to the give problem is φ(x, y) = sin(2πx) sin(2πy)
(b) By using finite central differences to approximate the derivatives, and employ-
ing a grid points of 1000 points in each coordinate direction, write computer
programmes to solve the problem by iteration using
i. The Jacoby iteration
ii. The Gauss Seidel iteration
iii. The SOR iteration.
Record how many iterations are required to obtain a solution correct to 6
decimal places for each iteration scheme. For the SOR scheme, which was
the best value of the relaxation parameter? Show that if ω is the relaxation
parameter, then ωbest lies between 1 and 2.

58
Chapter 5

The Algebraic Eigenvalue Problem

5.1 Introduction
We seek to find numbers λi and vectors xi such that

Axi = λi xi or (A − λi I)xi = 0, i = 1, 2, 3, · · · , n. (5.1)

This process has many practical applications. For example, vibration analysis leads to a
symmetric A because of a vibrational energy principle. The x will represent the shape of
the vibrational mode and λ is related to the frequency.
Definition 31 We say λ ∈ C is an eigenvalue of the n × n matrix A if there exist y 6= 0
such that

Ay = λy, y ∈ Rn (5.2)

The vector y is called the eigenvector associated with the eigenvalue λ.


From (5.2), we have

(A − λI)y = 0, y ∈ Rn . (5.3)

Since y 6= 0 is an eigenvector, λ is an eigenvalue of A if and only if the homogeneous


system (A − λI)y = 0 has non-trivial solutions.
Lemma 8 The number λ is an eigenvalue of the matrix A if and only if the matrix
(A − λI) is singular
To illustrate the importance of eigenvalues, consider the sequence of vectors

z, Az, A2 z, · · · , An z (5.4)
P
where A is an n × n matrix suppose that the vector z = nk=1 ck y k , where each y k is an
eigenvector of A with associated eigenvalue λk , then the sequence may be written so that
n
X
m
A z= λm
k yk . (5.5)
k=1

59
Hence the behaviour of the sequence (5.4) may be viewed in terms of the behaviour of
the sequence (λnk )n≥0 ¿ Then limm→∞ Am = 0 if |λk | < 1 for all k. We can order the
eigenvalues so that

|λ1 | > |λ2 | > · · · > |λn | (5.6)

so that multiplying both sides of (5.5) by (1/λ1)m we have


 m
−1
m X
n λk
λ1 A = y 1 + )k=2 y k , m = 0, 1, 2 · · ·
λ1
Hence
m
lim λ−1
1 A = y1 (5.7)
m→∞

and we have a way of calculating the dominant eigenvector.


For Hermitian matrices, as well as for symmetric matrices (AT = A), the existence of
and eigenvalue λ, follows from the determinantal equation

|ΛI − A| = 0 (5.8)

When this is multiplied out, we get an nth degree polynomial equation for λ, so tere
are n roots. These may of course, not all be distinct. For each there will be a non-zero
eigenvector x (by the Fredholm alternative theorem). It is well known that for Hermitian
matrices all the eigenvalues are real. We write ∗ to denote the complex conjugate transpose
and use A∗ = A . Then

Ax = λx ⇒ x∗ Ax = λx∗ x = λkxk22 (5.9)

Also, x∗ Ax = x∗ Ax, so it is real, hence λ is real, being the ratio of reals. For a real
symmetric matrix, x is then real because A − λI is real.
This also shows that we can find λ, given x, using the Rayleigh Quotient
xT Ax
λ= . (5.10)
xT x
For
xT Ax λkxk22
= = λ. (5.11)
xT x kxk22
In the same way if (λ1 , x1 ) and (λ2 , x2 ) are eigensolutions with λ2 6= λ1 , we have

xT2 (λx1 ) = xT2 Ax1 = (xT2 λ2 )x1 .

So

(λ1 − λ2 )(xT2 x1 ) = 0. (5.12)

Since λ1 6= λ2 , we must have xT2 x1 , and so x2 and x1 are orthogonal. If they are normalised
so that xT2 x2 = kx2 k22 = 1, xT1 x1 = kx1 k22 = 1, they are called orthonormal

60
Definition 32 (orthogonal matrix) A matrix whose rows (columns) are orthonormal
is said to be an orthogonal matrix (unitary in the Hermitian case). Such a matrix X,
satisfies
X T X = XX T = I, so that X −1 = X T
.
Clearly,
kXk22 = max |xT X T Xx| = 1 (5.13)
kxk2 =1

i.e. kXk2 = kX −1 k2 = 1 (5.14)


Definition 33 (Orthogonal Transformation) A similarity transform of a matrix A
by an orthogonal matrix X is called an orthogonal transformation
Given a matrix A and an orthogonal matrix X, we have
B = X −1 AX = X T AX; A = XBX −1 = XBX T .
So
kBk2 ≤ kX T k2 kAk2 kXk2 = kAk2 ,
and
kAk2 ≤ kBk2 ⇒ kAk2 = kBk2
.

5.2 Exercises
Research and present a write up about the topics indicated and also answer the questions
indicated
1. Numerical methods for estimating eigenvalues and eigenvectors
(a) The power method
(b) The inverse power method
(c) Similarity transform methods
(d) Householder reflections
(e) Givens method
2. Localization of eigenvalues: State Gershgorin’s Disc theorems about the local-
ization of eigenvalues of an n × n matrix A = (aij ). Let D be another n × n matrix
and k · k a matrix norm. Prove that if λ is an eigenvalue of A but not of D, then
k(λIn − D)−1 (A − D)k ≥ 1 where In is the identity matrix of size n. Employ this
inequality with D = diag(a1 , a2 , · · · , an ) and the matrix norm k · k∞ to prove Gersh-
gorin’s first theorem. Let q be the real number, q > 2 and A = (aij ) a 3 × 3 matrix
with aij = q −|i−j| , i, j = 1, 2, 3. Determine the Gershgorin disk of the matrix A and
show that 0 is not an eigenvalue of A.
3. Statement and prove of Cauchy’s interlace theorem for the localization of eigenvalues

61
Chapter 6

Polynomial Interpolation

6.1 Polynomial Forms


We demonstrate that the customary way to describe a polynomial may not be the best
way in calculations.
Definition 34 (Polynomial) A polynomial of degree ≤ n is a function Pn : R → R,
which is of the form
Pn (x) = a0 + a1 x + a2 x2 + a3 x3 + · · · + an xn (6.1)
with coefficients a0 , a1 , · · · , an . This polynomial has (exact) degree n in case its leading
coefficient an 6= 0.
Definition 35 (Power Form of the Polynomial) When the polynomial is written in
the form (6.1), we say it is written in the power form.
The power form of the polynomial is the standard way to specify a polynomial in mathe-
matical discussions. This form is a very convenient form for differentiating or integrating
polynomials. But, in specific contexts, this form may not fare very well and other forms
are more convenient. The following example shows that the power form of the polynomial
may lead to loss of significance in some limited calculations.
Example 14 If we construct the power form of the straight line P1 (x) = a0 + a1 x which
takes values P1 (6000) = 31 , P1 (6001) = − 23 , then, in five-decimal-digit floating-point arith-
metic1 , we will obtain P1 (x) = 6000.3 − x. Evaluating this polynomial in the same arith-
metic, we find P1 (6000) = 0.3 and P1 (6001) = −0.7, which recovers only the first digit of
the given function values, a loss of four decimal digits.
1
In computing, floating point describes a method of representing real numbers in a way that can
support a wide range of values. Numbers are, in general, represented approximately to a fixed number
of significant digits and scaled using an exponent. The base for the scaling is normally 2, 10 or 16. The
typical number that can be represented exactly is of the form: number = Significant digits×baseexponent .
The term floating point refers to the fact that the radix point (decimal point, or, more commonly in
computers, binary point) can ”float”; that is, it can be placed anywhere relative to the significant digits
of the number. This position is indicated separately in the internal representation, and floating-point
representation can thus be thought of as a computer realization of scientific notation. Significant figures
(also called significant digits) of a number are those digits that carry meaning contributing to its precision.

62
A remedy for loss of significance is to use the shifted power form of the polynomial.
Definition 36 (Shifted Power Form of the Polynomial) The shifted power form of
the polynomial,Pn , takes the form
Pn (x) = a0 + a1 (x − c) + a2 (x − c)2 + a3 (x − c)3 + · · · + an (x − c)n (6.2)
where c, called the centre of the power form, is a fixed constant.
Clearly using the binomial theorem to expand all the powers out and rearranging we can
rewrite the shifted power form of the polynomial (6.2) in the power form (6.1). More
specifically, if c = 0, we have the power form of the polynomial. Thus the shifted power
form is a generalization of the power form. It is good practice to employ the shifted
power form with centre c chosen somewhere in the interval [a, b] when interested in a
polynomial in that interval. How does the form improve on our calculations using the
five-decimal-digit floating-point arithmetic of Example 14?
Example 15 If we choose the centre c = 6000 and rework the linear polynomial of Ex-
ample 14, then P1 (x) = a0 + a1 (x − 6000) and using five-decimal-digit floating-point
arithmetic demanding that P1 (6000) = 1/3 = 0.33333 and P1 (6001) = −2/3 = −0.66667,
we have P1 (x) = 0.33333 − (x − 6000), so that evaluating in the same arithmetic gives
P1 (6000) = 0.33333 and P1 (6001) = −0.66667, and the values are as correct as five digits
can make them. We shall see below that the coefficients in the shifted power form provide
(i)
derivative values ai = Pn (c)/i!, i = 0, 1, 2, · · · , n, when Pn is of the form (6.2).
A further generalization of the shifted power form is the Newton form of the Polynomial.
Definition 37 (Newton Form of the Polynomial) The Newton form of the polyno-
mial, Pn , takes the form
Pn (x) = a0 + a1 (x − c1 ) + a2 (x − c1 )(x − c2 ) + a3 (x − c1 )(x − c2 )(x − c3 ) + · · ·
+an (x − c1 )(x − c2 )(x − c3 ) · · · (x − cn−1 )(x − cn ) (6.3)
where we now have n centres c1 , c2 , · · · , cn , instead of just one centre, as in the shifted
power form.
It is good practice to choose the centres c1 , c2 , · · · , cn somewhere in the interval [a, b] if we
are interested in a polynomial in this interval. Again, the Newton form of the polynomial
reduces to the shifted power form if c1 = c2 = · · · = cn = c, and to the power form if the
centres c1 , c2 , · · · , cn all equal to zero.
It is inefficient to evaluate each of the n + 1 terms in (6.3) separately and then sum.
This would take n + n(n + 1)/2 additions and n(n + 1)/2 multiplications. Instead, one
notices that the factor x − c1 occurs in all terms but the first; that is,
Pn (x) = a0 + (x − c1 ) {a1 + a2 (x − c2 ) + a3 (x − c2 )(x − c3 )
+ · · · + an (x − c2 )(x − c3 ) · · · (x − cn )} .
Again, each term between the braces but the first contains the factor (x − c2 ); that is,
Pn (x) = a0 + (x − c1 ) {a1 + (x − c2 ) [a2 + a3 (x − c3 ) + · · · + an (x − c3 ) · · · (x − cn )]}
Continuing we get the nested form of the Newton Polynomial.

63
Definition 38 (Nested Form of the Newton Polynomial) The nested form of the
Newton polynomial takes the form

Pn (x) = a0 + (x − c1 ) {a1 + (x − c2 ) [a2 + (x − c3 ) (a3 + · · ·


(x − cn−1 ) (an−1 + (x − cn )an ) · · · )]} , (6.4)

whose evaluation for any particular value of x takes 2n additions and n multiplications.

The efficiency of the evaluation in the nested form of the Newton polynomial comes from
the fact that we start with the innermost bracket and evaluate outward with very little
loss of significance.
More sophisticated remedies against loss of significance are offered by the presentation
in Chebyshev polynomials and other orthogonal polynomials. The problem of polynomial
representation using orthogonal polynomials is also very appealing for scientific work. We
note that we can transform polynomials from one form to the other using the following
nested multiplication algorithm.

Theorem 20 (Nested multiplication) Given the n + 1 coefficients a0 , a1 · · · , an for


the Newton form (6.3) or the nested form (6.4) of the polynomial Pn (x), together with
the centers c1 , c2 , · · · , cn . Given also a number z, we simplify calculate: a0n = an , a0i =
ai + (z − ci+1 )a0i+1 . Then a00 = Pn (z). Moreover, the auxiliary quantities a01 , · · · , a0n are of
independent interest and we then have

Pn (x) = a00 + a01 (x − z) + a02 (x − z)(x − c1 ) + a03 (x − z)(x − c1 )(x − c2 ) + · · ·


+a0n (x − z)(x − c1 )(x − c2 ) · · · (x − cn−2 )(x − cn−1 ) (6.5)
0 0 0 0
= a0 + (x − z) {a1 + (x − c1 ) [a2 + (x − c2 ) (a3 + · · ·
 
(x − cn−2 ) a0n−1 + (x − cn−1 )a0n · · · , (6.6)

Proof: See Reference [1], Chapter 2. 


Given a discrete set of data points, we wish to estimate values of functions between
points (interpolation), or we may wish to approximate values of a given function outside
these data points (extrapolation).

Definition 39 (Interpolating Polynomial) Given a set of (n + 1) distinct data points


{x0 , x1 , · · · , xn }. If we construct a polynomial of degree at least n such that pn (xi ) = f (xi )
for a given function f , we say the polynomial p interpolates the points xi , i = 0, 1, 2, · · · , n,
with the function f .

For example the polynomial pn (x) = 1+2x interpolates the points (0, 1), (2, 5), (4, 9), (5, 11)
on the interval [0, 5].
Two important reasons for approximating functions with polynomials are that (i)
polynomials are easy to use and (ii) polynomials can be used to provide very good ap-
proximations for functions in C([a, b]; R). The first reason is obvious; eg if we wish to
integrate or differentiate a complicated function, then we may wish to differentiate or
integrate its corresponding approximating polynomial. The second reason is provided by
the following theorem of Weierstrass.

64
Theorem 21 (Weierstrass Approximation Theorem) Let f ∈ C([a, b]; R). For ev-
ery ε > 0, there exists a polynomial pn of order n(ε) ∈ N such that kf (x) − pn (x)k∞ < ε.
Proof: Exercise. The Weierstrass approximation theorem states that any continuous
function on a finite interval [a, b] may be uniformly approximated by some polynomial.

6.2 Lagrange Form of the Interpolating Polynomial


Definition 40 (Lagrange interpolation) Given n + 1 distinct points {x0 , x1 , · · · , xn },
define
Yn
πn (x) = (x − xr ) (6.7)
r=0
Y πn (x)
πn,k (x) = (x − xr ) = , k = 0, 1, 2 · · · , (6.8)
r6=k
x − xk

πn,k (x)
Ln,k (x) = , k = 0, 1, 2, 3, · · · . (6.9)
πn,k (xk )
Each Ln,k (x) is a polynomial of degree n called Lagrange Polynomials and have the prop-
erty that

1 r=k
Ln,k (xr ) = δr,k =
0 r 6= k
Theorem 22 (Lagrange) Given distinct points xk , k = 0, 1, 2, · · · , n and the numbers
yk , k = 0, 1, 2, · · · , then there exists a unique polynomial pn (x) of degree n which satisfies
pn (xr ) = yr , i = 0, 1, 2, · · · .
Proof: Existence: The polynomial exists since by construction,
X n
pn (x) = yr Ln,r (x)
r=0

is a polynomial of degree n with


n
X n
X
pn (xs ) = yr Ln,r (xs ) = yr δr,s = yr .
r=0 r=0

uniqueness: Now assume qn is a different polynomial of degree n which satisfies qn (xs ) =


ys . Then pn − qn is again a polynomial of degree less than or equal to n which vanishes at
n+ 1 points. This is a contradiction unless pn −qn ≡ 0, that is pn −qn vanishes identically.
Hence pn = qn . 
Theorem 23 (Polynomial interpolation) Suppose f ∈ C([a, b]; R) and that x0 , x1 , · · · , xn
are n + 1 distinct points in [a, b]. Then there exists a unique polynomial of degree n, pn ,
such that pn (xr ) = f (xr ) and ∀x ∈ [a, b], ∃η ∈ [a, b] such that
πn (x)f (n+1) (η)
e(x) = f (x) − pn (x) = .
(n + 1)!

65
Proof: Existence and uniqueness follows from Theorem 22. Since πn (xr ) = 0, the relation
is satisfied trivially for x = xr , r = 0, 1, 2, · · · , n. Let x be distinct from any xr and define
φ(t) = f (t) − pn (t) − K(x)πn (t) where K(x) is independent of t. Then φ(xr ) = 0, r =
0, 1, 2, · · · , n. Let K(x) = f (x)−p n (x)
πn (x)
, so that φ(x) = 0. Hence φ vanishes at the n + 2
points x0 , x1 , · · · , xn , x in the interval [a, b]. Hence by Rolle’s theorem, φ0 must vanish
at n + 1 distinct points in [a, b]. Hence, by induction, there exists η ∈ [a, b] such that
φ(n+1) (η) = 0. But pn is a polynomial of degree n by construction. So

φ(n+1) (η) = f (n+1) (η) − Kπn(n) (η)


= f (n+1) (η) − K(x)(n + 1)! = 0.


en (x) is the error in the approximating polynomial. Suppose |f (x)| ≤ M∀x ∈ [a, b],
then
M
kekp ≤ kπn kp
(n + 1)!
where kf kp is defined by (2.1)

Remark 7 The interval [a, b] can be mapped into the interval [−1, 1] by a continuous
mapping. Hence, without loss of generality we can consider the interval [−1, 1]. If the
points xr are equally spaced in the interval, then πn (x) has n + 1 roots in the interval and
so oscillates n + 2 times (counting local extrema at end points).

So we may pose the following question: Can a judicious choice of interpolation points
improve the behaviour of pn ? Note that cos((n+1)θ), θ ∈ [0, π] has n+1 roots in [0, π] and
oscillates n + 2 times between ±1. This is a similar property of πn in [−1, 1] with equally
spaced points. So if we can find a mapping that relates x and θ, and has the property
that cos((n + 1)θ) is a finite polynomial in x, then we could use this function. This is
Chebyshev’s idea. The mapping is x = cos(θ). So we define the Chebyshev polynomials
on [−1, 1] by

Tn (x) = cos(n cos−1 (x)), where x = cos(θ). (6.10)

Clearly, by reduction formulae,

cos((n + 1)θ) = cos(nθ) cos(θ) − sin(nθ) sin(θ)


cos((n − 1)θ) = cos(nθ) cos(θ) + sin(nθ) sin(θ)

so that

Tn+1 (x) = 2xTn (x) − Tn−1 (x), T0 (x) = 1, T1 (x) = x (6.11)

So the first few members of the Tchebychev polynomials are



T0 = 1, T1 = x, T2 = 2x2 − 1, T3 (x) = x 4x2 − 3 , T4 (x) = 8x4 − 8x2 + 1

and Clearly Tn is a polynomial of degree n.

66
Notes 5 We note the following properties of the Chebyshev polynomials:
1. |Tn (x)| ≤ 1∀x ∈ [−1, 1]
2. Tn (x) = 2n−1 xn + · · · , n > 0
 
(r+ 12 )π (r+ 12 )π
3. Tn+1 (x) = 0 at θr = n+1
,r = 0, 1, 2, · · · or xr = cos n+1
,r = 0, 1, 2, · · · so if
−n
we choose these values as interpolation points, πn (x) = 2 Tn+1 .
4. Since |Tn+1 | ≤ 1, πn must oscillate between ±2−n , and hence we have
M
ken k∞ ≤
2n (n + 1)!
6.2.1 Divided Differences
One drawback with the Lagrange interpolating polynomial is that should an extra point
be added to the set {xr , yr }, it is necessary to recompute all the Lagrange polynomials.
Definition 41 (Divided differences) Let Pn be a polynomial which interpolates the
points xi , i = 0, 1, 2, · · · , n, and define the n-th divided difference f [x0 , x1 , · · · , xn ] to be
the coefficient of xn in pn . Define f [xi ] = f (xi ), i = 0, 1, 2, · · · , n. From the definition of
the Lagrange form of the interpolating polynomial, for n > 1, we have
n
!
X f (xr )
f [x0 , x1 , · · · , xn ] = Q (6.12)
k=0 r6=k (xk − xr

Theorem 24 Let f ∈ C n ([a, b]; R) and let {xk , k = 0, 1, 2, · · · , n} be a set of distinct


points in [a, b]. Then there exists η ∈ [a, b] such that
f (n) (η)
f [x0 , x1 , · · · , xn ] = (6.13)
n!
Proof: Apply Rolle’s theorem n-times (as in the proof of Lagrange theorem) to decide
(n)
that e(n) has a root η ∈ (a, b) so that e(n) (η) = pn (η) − f (n) (η) = n!f [x0 , x1 , · · · , xn ] −
f (n) (η) vanished in the interval. .
Theorem 25 Let pn be the polynomial of degree n that interpolates the points {xr , f (xr )},
r = 0, 1, 2, · · · , n. Then the a polynomial
n
Y
pn+1 (x) = pn (x) + f [x0 , x1 , · · · , xn , xn+1 ] (x − xr )
r=0

satisfies pn+1 (xr ) = fr , r = 0, 1, 2, · · · , n + 1, and is a polynomial which interpolates


{(xr , f (xr )), r = 0, 1, 2 · · · , n + 1}.
Q
Proof: Clearly nr=0 (x − xr ) will vanish at x0 , x1 , x2 , · · · , xn since pn (xr ) = fr , for r =
0, 1, 2, cdots, n, the polynomial interpolates the points {(xr , fr ), r = 0, 1, 2, · · · , n}. At
x = xn+1 , the definition of the divided difference and Lagrange polynomial gives
r=n n+1 Qn
X X (xn+1 − xs )
pn+1 (xn+1 ) = fr Ln,r (xn+1 ) + fr Qs=0
j=n+1 = fn+1 (6.14)
r=0 r=0 j6=r (xr − xj )


67
6.3 Newton’s Form of the Interpolating polynomial
Corollary 6 (Newton Interpolation) The Lagrange interpolation polynomial can be
represented in terms of divided difference by
n j−1
!
X Y
pn (x) = f [x0 , x1 , · · · , xj−1 ] (x − xr ) . (6.15)
j=0 r=0

Furthermore,
f [x1 , · · · , xn ] − f [x0 , x1 , · · · , xn−1 ]
f [x0 , x1 , · · · , xn ] = (6.16)
xn − x0
If in addition to the requirement that a polynomial has a value at given points on a
discrete set, we ask for the derivative to be specified, then the method is called a hermite
polynomial approximation.
Theorem 26 (Hermite Interpolation) Suppose f ∈ C 2n+2 ([a, b]; R) and that
(x0 , f (x0 )), (x1 , f (x1 )), · · · , (xn , f (xn )) are n + 1 distinct points with xi ∈ [a, b],
i = 0, 1, 2, cdots, n. Then there exists a unique polynomial p2n+1 , of degree at least 2n + 1,
that satisfies p2n+1 (xr ) = f (xr ) , p02n+1 (xr ) = f 0 (xr ), r = 0, 1, 2, · · · , n. and ∀x ∈ [a, b],
∃η ∈ (a, b)s.t.
[πn (x)]2 f (2n+2) (η)
f (x) = p2n+1 (x) + .
(2n + 2)!
Proof: Define
 
Hr (x) = 1 − 2(x − xr )L0n,r (xr ) L2n,r (x), Kr (x) = (x − xr )L2n,r (x), (6.17)

then write
r=n
X
p2n+1 (x) = [f (xr )Hr (x) + f 0 (xr )Kr (x)] (6.18)
r=0

It is then easy to verify that indeed p2n+1 so constructed is a polynomial of degree at least
2n + 1 and that p2n+1 (xr ) = f (xr ), p02n+1 (xr ) = f 0 (xr ) and finally that the polynomial so
constructed is unique. Full details are left as exercise. 

6.3.1 Exercises
1. We are faced with the issue of polynomial interpolation in the L∞ norm. In partic-
ular, we wish to handle the following problem:
Given u ∈ C[a, b], if Pn is the set of all polynomials of degree less than or equal to
n, how do we choose Pn such that

kPn − uk ≤ kQn − uk, ∀Qn ∈ C[a, b]

and if we can find Pn , how accurately does Pn represent u?

68
(i) Discuss the problem of existence, uniqueness and accuracy of the polynomial
approximation in the L∞ norm paying particular attention to the Weierstrass
theorem and the associated proof using Bernstein polynomials, uniform or min-
imax polynomial approximation (the ooscillation theorem, de la Vallee Pousin
theorem, etc).
(ii) Also discuss the problem of existence, uniqueness and accuracy of the polyno-
mial approximation in the L2 norm paying particular attention to the approx-
imation by orthogonal polynomials and least squares approximation
(iii) Now, find the zero and first order polynomial P0 (x) and P1 (x) which are
R1
orthogonal in the sense that 0 Pi (x)Pj (x)dx = δi,j where δi,j is the Delta-Dirac
function. Use P0 Rand P1 to find the approximation to f (x) = αP0 (x) + βP1 (x)
1
which minimizes 0 (x2 − f (x))2 dx.

2. Divided differences and divided difference tables: The Newton form of the
interpolating polynomial may be written in the form

Pn (x) = f [x0 ] + f [x0 , x1 ](x − x0 ) + f [x0 , x1 , x2 ](x − x0 )(x − x1 )


+ · · · + f [x0 , x1 , x2 , · · · , xn ](x − x0 )(x − x1 ) · · · (x − xn−1 )
Xn i−1
Y
= f [x0 , x1 , · · · , xi ] (x − xj ) (6.19)
i=0 j=0

Where
f (x1 ) − f (x0 )
f [x0 ] = f (x0 ), f [x0 , x1 ] =
x1 − x0
f [x1 , x2 , · · · , xk ] − [x0 , x1 , · · · , xk−1 ]
f [x0 , x1 , · · · , xk ] =
xk − x0
The coefficients f [x0 , x1 , · · · , xk ] may be calculated from the following algorithm,
which displays the results in the form of a table, as shown in Table 6.1, called the
divided difference table.
Algorithm: The divided difference table. Given the first two columns of the
table containing x0 , x1 , · · · , xn and the corresponding f [x0 ], f [x1 ], · · · , f [xn ]
For k = 1, 2, · · · , n do;
For i = 0, 1, · · · , n − k do;
[x ,x ,··· ,xi+k ]−[xi ,xi+1 ,··· ,xi+k−1 ]
f [xi , xi+1 , · · · , xi+k ] := i+1 i+2 xi+k −xi
.
end do i
end do k

The results would then be displayed on a table like the following, from which the
coefficients can be read off. Now answer the following questions:

(i) Find the polynomial of degree ≤ 2 which satisfies P2 (1) = 1.5709, P2(4) =
1.5724, P2(6) = 1.5751 Hint: Calculate f [1, 4], f [4, 6], f [1, 4, 6].

69
Table 6.1: The divided difference Table

xi f [·] f [·, ·] f [·, ·, ·] f [·, ·, ·, ·] f [·, ·, ·, ·, ·]


x0 f [x0 ]
f [x0 ,x1 ]
x1 f [x1 ] f [x0 , x1 , x2 ]
[x1 x2 ] f [x0 , x1 , x2 , x3 ]
x2 f [x2 ] f [x1 , x2 , x3 ] f [x0 , x1 , x2 , x3 , x4 ]
f [x2 , x3 ] f [x1 , x2 , x3 , x4 ]
x3 f [x3 ] f [x2 , x3 , x4 ]
f [x3 , x4 ]
x4 f [x4 ]

(ii) Values of a function, f , where f = {(0, 1), (1, 5), (2, 31), (3, 121), (4, 341)},
are those of a certain polynomial of degree less than or equal to 4. Find the
polynomial from a difference table, or otherwise, and then estimate f (5).
(iii) An integral related to the complete elliptic integral is defined by
Z π
2 dx
K(κ) = p
0 1 − sin2 κ sin2 x

From tables we find that K(1) = 1.5709, K(4) = 1.5727, K(6) = 1.5751. Find
K(3.5) using a second degree Lagrange Polynomial.

3. It can be shown that the Hermite polynomial of order k, Hk (x) satisfy the relation

Hk+1 (x) = 2xHk (x) − 2kHk−1(x), k = 0, 1, 2, · · · , H0 = 1. (6.20)

Find the first few 4 Hermite polynomials and find the zeroes of H2 , H3 and H4 .

70
Chapter 7

Numerical Differentiation and


Numerical Quadrature

7.1 Numerical Differentiation


Using the interpolating polynomial, we have

f (x) = pn (x) + en (x)

where en is the error incurred in approximating the function f by pn . Thus on a set of


mesh points xi , i = 0, 1, 2, · · · , n, we may approximate f 0 (xi ), with p0n (xi ). For example,
suppose we have represented f by a second degree polynomial and write it in the Newton
form as

p2 (x) = f [xl ] + f [xl , xm ](x − xl ) + f [xl , xm , xr ](x − xl )(x − xm )

It is clear that the triple set (xl , xm , xr ) will vary over the set of partition points (x0 , x1 , x2 , · · · , xn )
which for the sake of the second degree polynomial interpolation will be taken three points
at a time. For example for the triple (x5 , x6 , x7 ), xl = x5 , xm = x6 and xr = x7 . Thus
to find the approximate first derivative to a function y at xm , we may choose to use the
second degree polynomial p2 and differentiate to have

y 0 (xm ) ≈ p02 (xm ) = y[xl , xm ] + y[xl , xm , xr ](xm − xl )


−y(xr )(xl − xm )2 + (xm − xr )2 y(xl ) + (xl − xr )(xl − 2xm + xr )y(xm )
= ,
(xl − xm )(xl − xr )(xm − xr )
so if we are on an qually spaced set of mesh points with xl = xm − h, xr = xm + h we
then have the approximation
y(xm + h) − y(xm − h)
y 0 (xm ) ≈ (7.1)
2h
which is the central difference approximation to the derivative of y at xm . The added
advantage here is that we can then approximate the derivative at any point in the interval
[xl , xr ] simply by evaluating the derivative of the approximating polynomial at the desired

71
point. For example by evaluating the derivative of the approximating polynomial at xr
on the equally spaced mesh points with stepsize h we have

y(xm − h) − 4y(xm ) + 3y(xm + h)


y 0 (xr ) ≈ (7.2)
2h
With higher degree polynomials we can construct different approximations.

7.2 Numerical integration


Using the interpolating polynomial, we can write
Z b Z b Z b
I= f (x)dx = pn (x)dx + (f (x) − pn (x)dx (7.3)
a a a
| {z } | {z }
Approximation of I Error

and thus obtain the approximation to the integral I by integrating the corresponding
approximating polynomial. If for example, the Lagrange form of the interpolating poly-
nomial are used, then we have
Z b r=n
X Z b
f (x)dx = Wr f (xr ) + R, where Wr = Ln,r (x)dx. (7.4)
a r=0 a

Clearly Wr depends on the interpolation points but is independent of the interpolation


values and Z b
πn (x)f (n+1) (η(x))dx
R= .
a (n + 1)!
If equally spaced interpolation points are used, then the formulae which results are called
Newton-Cotes quadrature formulas for example,

1. Trapezoidal Rule. If we represent the function f on the interval [a, b] by a


polynomial of degree 1, then p1 (x) = f [a] + f [a, b](x − a) where f [a] = f (a) and
f [a, b] = (f (b) − f (a))/(b − a), using the Newton Form of the interpolating polyno-
mial. Then
Z b Z b Z b
1
f (x)dx ≈ p1 (x)dx = (f [a] + f [a, b](x − a))dx = (b − a)(f (a) + f (b)).
a a a 2
Thus first degree polynomial interpolation leads to the Trapezoidal rule for integra-
tion.

2. Simpson’s Rule. If we represent the function f on the interval [a, b] by a polyno-


mial of degree 2, such that p2 (a) = f (a), p2 ( a+b
2
) = f ( a+b
2
) and p2 (b) = f (b), then
p2 (x) = f [a] + f [a, 2 ](x − a) + f [a, 2 , b](x − a)(x − 2 ) where f [a], f [a, a+b
a+b a+b a+b
2
]

72
and f [a, a+b
2
, b] are the divided difference coefficients in the Newton Form of the
interpolating polynomial. Then
Z b Z b
f (x)dx ≈ p2 (x)dx
a a
Z b 
a+b a+b a+b
= f [a] + f [a, ](x − a) + f [a, , b](x − a)(x − ) dx
a 2 2 2
 
1 a+b
= (b − a) f (a) + 4f ( ) + f (b) .
6 2

Thus the second degree polynomial Newton interpolation polynomial leads to Simp-
son’s rule for integration.

We can use polynomial interpolation for other proposes. See Chapter 7 of reference [1]
and chapter 6 of reference [3] for more.

7.2.1 Exercises
Rb
1. If ψ(x) = (x − a)(x − a+b
2
)(x − b), verify by direct integration that a ψ(x)dx = 0.
How does this result affect the interpretation of the definite integral as the area
under a graph of f ? Can you comment?
Rb
2. Five rules of approximating, I(f ), the definite integral I(f ) = a f (x)dx are the
following

(i) Rectangular rule: I(f ) ≈ (b − a)f (a)


(ii) Midpoint Rule: I(f ) ≈ (b − a)f ( a+b
2
)
(iii) Trapezoidal Rule: I(f ) ≈ 21 (b − a)(f (a) + f (b))
b−a
(iv) Simpson’s Rule: I(f ) ≈ 6
(f (a) + 4f ( a+b
2
) + f (b))
b−a (b−a)2
(v) Corrected Trapezoidal Rule: I(f ) = 2
[f (a) + f (b)] + 12
(f 0 (a) − f 0 (b))

Now answer the following questions.

(i) Determine the error in the approximation in each case.


(ii) Apply each of the five rules to find an approximate value for the integral
R1
I(f ) = 0 x sin(x)dx. Compare your the results with the correct answer I(f ) =
sin(1) − cos(1).
(iii) Determine N ( The number of partition points) such that the composite
R 1 −x 2
Trapezoidal rule give the value of I(f ) = 0 e dx correct to six digits after the
2
decimal point, assuming that e−x can be calculated accurately, and compute
the approximation. Do the same for Simpson’s Rule.

3. Numerical differentiation by central differences:

73
(i) for a given step size h, for a given differentiable function f , define

f (ξ + h) − f (ξ) f (ξ + h) − f (ξ − h)
(i) f 0 (ξ) ≈ , (ii) f 0 (ξ) ≈ ,
h 2h
2f (ξ) − 4f (ξ + h) + 2f (ξ + 2h)
(iii) f 0 (ξ) ≈
h
f (ξ − h) − 2f (ξ) + f (ξ + h
(iv) f 00 (ξ) ≈ 2
h
For f (x) = sinh(x), calculate f 0 (1.4) and f 00 (1.4) and compare your results
with the exact answers. Estimate the errors analytically.
(ii) Let f (x) = ln(x), x > 0. On your computer, calculate the sequence of numbers
an = f [2 − 2−n , 2 + 2−n ] (where f [·, ·] is the divided difference) without the
effect of rounding errors, limn→∞ an = f 0 (2) = 0.5. What really happens on
the machine? Please include a segment of the code which you have used to do
your calculations and also give a print out of your results for large n

74
Chapter 8

Numerical Solution of Nonlinear


Equations

8.1 Introduction
The aim of this section is to determine methods of approximating solutions of nonlinear
single equations. The problem can be posed in many guises: For a given nonlinear function
f : R → R,

1. Find a root of f (x) = 0.

2. Find a zero of f (x)

3. Find a fixed point of the map f : x 7→ x − f (x).

Here we illustrate the common methods and address the issues of convergence of schemes.

8.2 Bisection Method


Suppose we know the values of a and b such that f (a)f (b) < 0. If f is continuous, the
intermediate value theorem assures us that there exists at least one root of f in [a, b]. If
in addition we have determined that f is monotone in [a, b], then the IVT assures us that
there exists exactly one zero of f in [a, b] whenever f (a)f (b) < 0. we proceed as follows

1. Locate an interval I ⊂ R wherein f is monotonic

2. Locates points a, b ∈ I such that f (a)f (b) < 0.


a+b
3. Calculate c = 2

4. verify: If f (c)f (a) < 0, take [a, c] ⊂ I as the new and better estimate of the
interval containing the zero of f else if f (c)f (b) < 0 we take [c, b] ⊂ I as the new
approximating interval

Interval halving is continued as long as desired.

75
8.2.1 High points for the bisection method
1. Method is guaranteed to converge

2. Sought after root is known to lie in an interval of length 2−n (b − a) after n steps of
the method

3. Only requires evaluation of the function f .

8.2.2 Low points for the bisection method


1. It is not clear how to generalize this method to systems of nonlinear equations

2. Though convergence is guaranteed, it is sometimes painfully slow

3. After several evaluations of the function, we know and have a lot of information
about f near the sought after zero, but still only use the last calculated value.

8.3 Iterative methods (Successive approximation)


For this method, we write f (x) = x − g(x), where g is some continuous functions, thus
the equation f (x) = 0 becomes the equation x − g(x) = 0 and the equation is changed to
the fixed point problem: find x such that x = g(x).
Definition 42 (Fixed point) Let f : R → R be a function. A fixed point of f is that
point x ∈ R with the property that f (x) = x.

Theorem 27 (Fixed point theorem) Let f ∈ C([a, b]; [a, b]) be an injection mapping
such that f (x) ∈ [a, b], ∀x ∈ [a, b], then there exist s in [a, b] such that f (s) = s.

Proof: Given that f is a continuous injection map, the image of [a, b] under f is a subset
of [a, b]. Thus f (a) ≥ a and f (b) ≤ b. Now, if f (a) = a or f (b) = b, the result holds, as
we can take s = a or s = b. Suppose, otherwise, that a < f (x) < b for all x in [a, b], that
is, f (x) 6= a and f (x) 6= b so that f (a) > a and f (b) < b. Consider g : x 7→ f (x) − x.
Clearly, g is continuous on [a, b] with g(a) = f (a) − a > 0, g(b) = f (b) − b < 0 and so
g(a)g(b) < 0 and so by Intermediate Value Theorem , ∃s ∈]a, b[ such that g(s) = 0 or
f (s) = s. 

Theorem 28 (Divergence theorem) Suppose g(s) = s and that g 0 is continuous in an


interval I containing s. Then if |g 0(s)| > 1, then s is a point of repulsion for the iteration
xn+1 = g(xn ), x0 ∈ I, and the sequence (xn )n≥1 does not converge to s.

Proof: Geometrically, if the slope of g is greater than unity, then the sequence will
diverge. Now Since |g 0(s)| > 1, let M = 21 (1 + |g 0(s)|), so that 1 < M < |g 0 (s)|. Then
∃ρ > 0 such that |f 0(x)| > M for |x − s| ≤ ρ. Then we can apply the mean value theorem
by noting that if |xn − s| ≤ ρ, ∃η ∈ [xn , s] such that

|xn+1 − s| = |g(xn ) − s| = |xn − s||g 0(η)|.

76
Hence,
|xn+1 − s| ≥ M|xn − s| ≥ |xn − s|
and sequence is diverging away from s. .

Definition 43 (Contraction mapping) The mapping f : R → R is a contracting map-


ping on the closed interval I ⊂ R, if

(i) f : I → I. That is f is an injection mapping on I

(ii) ∃L with 0 < L < 1 such that ∀x, y ∈ I, |f (x) − f (y)| ≤ L|x − y|.

Remark 8 (a) Condition (ii) in the definition is called a Lipschitz condition and L is
called a Lipschitz constant.

(b) A function f : I → R that satisfies the Lipschitz condition on I (for any L > 0) is
said to be Lipschitz continuous with Lipschitz constant L.

Theorem 29 (Contraction mapping theorem) Suppose f is a contracting mapping


on an interval I ⊆ R. Then for every x0 ∈ I, the sequence (xn )n≥0 generated by xn+1 =
f (xn ), x0 ∈ I given converges to a unique fixed point in I.

Proof: Let s be a fixed point of f in I. Since f is an injection mapping, xn ∈ I for all


n ≥ 1 so that
|xn+1 − s| = f (xn ) − f (s)| ≤ L|xn − s|,
and we deduce that
|xn − s| ≤ Ln |x0 − s| → 0 as n → ∞.
Therefore xn → s as n → ∞. To prove that s is unique, let r and s be two fixed points
of the map f and f is a contraction in I. Then

|r − s| = |f (r) − f (s)| ≤ L|r − s| ⇒ L ≥ 1,

contradicting the fact that 0 < L < 1. So the fixed point must be unique. .

Corollary 7 Suppose that f satisfies the Lipschitz condition on the interval [x0 −ρ, x0 +ρ],
with Lipschitz constant L < 1 and that |x0 − f (x0 )| < (1 − L)ρ, then the sequence (xn )n≥0
generated by xn+1 = f (xn ), converges to a fixed point s ∈ [x0 − ρ, x0 + ρ].

Proof: For x ∈ [x0 − ρ, x0 + ρ],

|f (x) − x0 | = |f (x) − f (x0 ) + f (x0 ) − x0 | ≤ |f (x) − f (x0 )| + |f (x0 ) − x0 |


≤ L|x − x0 | + (1 − L)ρ
≤ Lρ + (1 − L)ρ = ρ.

Therefore, f is an injection map on the interval [x0 − ρ, x0 + ρ] and result follows from
Theorem 27. 

77
Corollary 8 (Error bounds) Suppose f satisfies the Lipschitz condition on the interval
[x0 − ρ, x0 + ρ] with Lipschitz constant L < 1 and that |x0 − f (x0 | ≤ (1 − L)ρ. Then the
sequence (xn )n≥1 defined by x1 ∈ [x0 − ρ, x0 + ρ], xn+1 = f (xn ) satisfies
1
(i) |x0 − s| ≤ |x
1−L 1
− x0 |
Ln
(ii) |xn − s| ≤ |x
1−L 1
− x0 |
Proof:
(i) Since s is a fixed point of f , we have
|x0 − s| = |x0 − f (x0 ) + f (x0 ) − s| ≤ |x0 − f (x0 )| + |f (x0 ) − f (s)|
≤ |x0 − x1 | + L|x0 − s|
1
≤ |x1 − x0 |.
1−L
(ii) The Lipschitz condition shows that
|xn+1 − xn | ≤ L|xn − xn−1 | ≤ L2 |xn−1 − xn−2 | ≤ · · · ≤ Ln |x1 − x0 |.
Therefore for m > n,
|xm − xn | ≤ |xm − xm−1 | + |xm−1 − xm−2 | + · · · + |xn+2 − xn+1 | + |xn+1 − xn |
≤ Ln (Lm−n−1 + Lm−n−2 + · · · + L2 + L + 1)|x1 − x0 |
Ln (1 − Lm−n )
= |x1 − x0 |.
1−L
Now let m → ∞ so that xm → s to establish (ii). 
Example 16 Find the real root of the equation x3 − 2x2 − x − 2 = 0.
(a) By rewriting the equation in the form x = g(x) where g(x) = 2 + x−1 + 2x−2 , show
that the contraction mapping theorem works for g in the interval [2, z], z ≥ 3, and
calculate the Lipschitz constant. Start from a value of x in this interval and calculate
x4 − s. Compare with the exact value of the error.
p
(b) Repeat with g(x) = 3 (2x2 x + 2).
 p
3
√ p3
√ 
Solution: The real solution is x = 31 2 + 44 − 3 177 + 44 + 3 177 = 2.65897
correct to five decimal places.
(a) From the equation x3 −2x2 −x−2 = 0, we have x3 = 2x2 +x+2 ⇒ x = 2+x−1 +2x−2 ,
yielding the equation x = g(x) where g(x) = 2 + x−1 + 2x−2 . Observe that g is
continuous and differentiable on the interval [2, z], ≥ 3. Thus ∀p, q ∈ [2, z], ∃η ∈
(2, z) such that g(p) − g(q) = g 0(η)(p − q), by the mean value theorem. In particular
|g(p) − g(q)| = |g 0(η)||p − q|. Here |g 0(x)| = | − x12 − x43 | ≤ 34 ∀x ∈ [2, z], z ≥ 3.
Thus L, the Lipschitz constant is 34 . Following the error estimates from Corollary
L4
8, we have |x4 − s| ≤ 1−L |x1 − x0 |. Let x0 = 3, then x1 = g(x0 ) = 23/9 = 2.55556.
L4
Therefore |x1 − x0 | = 0.44444, and 1−L |x1 − x0 | = 0.5625. The exact absolute error
after 4 steps is |2.65897 = g(g(g(g(3))))| = 0.00478166. So the estimate is good and
we get far better out of the iteration.

78
(b) Similar

Definition 44 (Rate of convergence) When a sequence converges, its rate of conver-


gence can be measured in several ways:

1. Linear Convergence: The sequence (xn )n≥1 converges linearly to s with rate R,
if xn → s as n → ∞ and
xn+1 −s
(a) xn −s
→ r as n → ∞.
(b) R = − loge |r|.

2. Quadratic Convergence: The sequence (xn )n≥1 converges quadratically to s with


rate R, if xn → s as n → ∞ and
xn+1 −s
(a) (xn −s)2
→ r as n → ∞.
(b) R = − loge |r|.

Clearly, for convergence we require r < 1 and the larger R, (smaller r), the faster the
convergence.

Definition 45 (Relaxation Process) Suppose there exists an interval [a, b] such that
sgn(f (a)) 6= sgn(f (b)) and f 0 (x) does not vanish on the interval [a, b]. Then we define the
relaxation process by

xn+1 = xn − θf (xn ), θ a constant called relaxation constant. (8.1)

If the sequence (xn )n≥1 converges to a limit s, then f (s) = 0. So for the relaxation process,
the mapping g is g(x) = x − θf (x), for some suitably chosen relaxation constant θ.

Theorem 30 (Relaxation Process) . Suppose f ∈ C 1 ([a, b]; R) and that f (a)f (b) < 0.
For definiteness, we suppose f (a) < 0, and f (b) > 0. If 0 < f 0 (x) < c for all x ∈ [a, b],
the relaxation process converges ∀x ∈ [a, b] if 0 < θ < 1c .

Proof: To use the contraction mapping theorem, we require

1. |g 0(x)| = 1 − θf 0 (x)| < 1∀x ∈ [a, b], so that −1 < 1 − θc < 1 ⇒ 0 < θ < 2/c.

2. g must be injective in [a, b]. We have that

a < g(x) = x − θf (x) = a − θf (a) + (x − a)(1 − θf 0 (η1 ))


= b − θf (b) + (x − b)(1 − θf 0 (η2 ))

where η1 , η2 ∈ [a, b], since f (a)f (b) < 0, for a < f (a) < b, we need 0 < θ < 1/c.

79
8.4 Newton’s Method
The relaxation process can be generalized to the mapping of the form g(x) = x−h(x)f (x)
where the function h can be chosen to improve the convergence properties of the iteration.
Then g 0 (x) = 1 − h0 (x)f (x) − h(x)f 0 (x) so that if f (s) = 0, then the asymptotic rate
of convergence is determined by g 0(s) = 1 − h(s)f 0 (s) and this will become infinite if
1 − h(s)f 0 (s) = 0. A function h that satisfies this condition is h(x) = 1/f 0(x), and this
gives Newton’s method:
f (xn )
xn+1 = xn − (8.2)
f 0 (xn )
and we call the sequence (xn )n≥1 generated this way a Newton sequence.
Theorem 31 (On the quadratic convergence of Newton’s Iteration) Suppose the
Newton sequence (xn )n≥1 converges to s so that f (s) = 0. If f 0 (s) 6= 0 and f 00 (x) is con-
tinuous in a neighbourhood of s. The sequence converges quadratically to s.
Proof: Using Taylor’s theorem, ∃η ∈ [xn , s] such that
1
0 = f (s) = f (xn ) + (s − xn )f 0 (xn ) + (s − xn )2 f 00 (ηn )
2
Thus
00
f (xn ) 1 2 f (ηn )
s − xn+1 = s − xn + = (s − xn )
f 0 (xn ) 2 f 0 (xn )
So that since f 00 is continuous,
s − xn+1 1 f 00 (s)
→ as n → ∞.
(s − xn )2 2 f 0(s)
The next result shows that the results of Newton’s method are local.
Theorem 32 (local convergence of Newton’s Method) If f (s) = 0, f 0 (s) 6= 0, and
f 00 is continuous, in the neighbourhood of the point s, then the Newton’s sequence converges
to s if the initial guess x0 is sufficiently closed to s.
00
Proof: Since g(x) = x− ff0(x)
(x)
, we have g 0 (x) = f (x)f (x)
f 0 (x)2
, and then g 0(s) = 0, since f (s) = 0,
and f 0 is continuous in the neighbourhood of s. Therefore there is a neighbourhood of s
where g satisfies Lipschitz Condition and the result follows. 

8.5 Other methods


1. The Secant Method: Replace the derivative in Newton’s method by finite differ-
ence approximations to have
f (xn ) − f (xn−1 )
f 0 (xn ) =
xn − xn−1

80
Therefore,
f (xn ) xn−1 f (xn ) − xn f (xn−1 )
xn+1 = xn − 0
= (8.3)
f (xn ) xn − xn−1

2. Regular Falsi Method: The secant method can fail if the initial values x0 and
x1 are not sufficiently close to s. A method which is guaranteed to converge is
the Regular Falsi Method or the method of false positioning. The method resembles
bisection but uses a little more common sense as well. Instead of halving the interval,
the method calculates the root of the line joining the two points and then uses the
sub interval that contains the root as the next interval. Thus suppose f (a)f (b) < 0.
Find the equation of the straight line that passes through the points (a, f (a)) and
(b, f (b)) and if l0 (x) is that equation, find x1 such that l0 (x1 ) = 0 and then verify
if f (a)f (x1 ) < 0, then the interval [a, x1 ] is the new interval containing the sought
after zero, else it is found in the interval [x1 , b]. Thus at each general point, if
f (xn )f (xn−1 ) < 0, we find the equation of the straight line that passes through the
points (xn−1 , f (xn−1 )) and (xn , f (xn )) and if ln (x) is that equation, find xn+1 such
that ln (xn+1 ) = 0 and then verify if f (xn )f (xn+1 ) < 0, then the interval [xn , xn+1 ]
is the new interval containing the sought after zero, else it is found in the interval
[xn+1 , xn ]. Thus xn=1 is given by
xn f (xn−1 ) − xn−1 f (xn )
xn+1 = , n = 1, 2, · · · , x0 given. (8.4)
f (xn ) − f (xn−1 )
The next interval containing the zero is chosen when the sign of f (xn+1 ) is known.
Convergence is linear

The methods presented can be generalized to finding zeros in Rn .

8.5.1 Exercises
1
1. Show that the function f : x 7→ f (x) defined by f (x) = 10 (10x2 − 2x − 5), x ∈ R,
has exactly one zero between x = 0.5 and x = 1. Approximate the zero correct
to six decimal places using each the following methods1 : The bisection method,
Regular Falsi Method, Modified Regular Falsi method, the Secant method, Newton’s
Method and fixed point iteration method. Tabulate the results showing how many
steps of the iteration method is needed for each method to produce the required
approximation.

2. Find the interval containing the real positive zero of the function f (x) = x2 −2x−2.
Use each of the five methods introduced above to compute the zero correct to two
significant figures.2 In each case, estimate how many steps each method would
require to produce 6 significant figures.
1
All of these methods will be found in any elementary text on numerical analysis
2
Significant figures (also called significant digits) of a number are those digits that carry meaning
contributing to its accuracy. Generally this includes all digits except: (i) leading and trailing zeros where
they serve merely as placeholders to indicate the scale of the number and (ii) spurious digits introduced, for

81
3. In the Bisection method, let L denote the length of the initial interval [a0 , b0 ].
Let {ξ0, ξ1 , ξ2 , · · · } represent the successive mid points generated by the bisection
method. Show that
|ξn+1 − ξn | = 2−(n+2) L.
Also show that the number N of iterations required to guarantee the approximation
to a root to an accuracy ε is given by

ln(ε/L)
N > −2 −
ln 2

4. Given a continuous real valued function with a simple zero at x0 in its domain.
Suppose, by bisection method or otherwise, we have approximated the zero x0 with
error ε. That is f (x0 + ε) = 0. Use Taylor’s Theorem to derive an improvement for
the root x0 . Hence, derive Newton’s Method for successive approximations. Hint:
If x0 is the zero with error ε, then x1 = x0 + ε1 is the next approximation where ε1
is the approximation for ε obtained by expanding f (x0 + ε) in a Taylor series and
truncating the series after two terms and solving the equation f (x0 + ε) = 0.

5. Find an iterative formula for the r N where N is any positive number. Hint: If
x is the r th root of N, then xr = N or xr − N = 0. Now set f (x) = xr − N and
derive the Newton’s Iteration formula. Use your iterative formula to find the cube
root of 100 to the nearest thousandth of a unit.

6. Descartes’ Rule of signs: Descartes’ rule of signs, first described by René Descartes
in his work La Geometrie, is a technique for determining the number of positive or
negative real roots of a polynomial. The rule states that if the terms of a single-
variable polynomial with real coefficients are ordered by descending variable expo-
nent, then the number of positive real roots of the polynomial is either equal to the
number of sign differences between consecutive nonzero coefficients, or less than it
by a multiple of 2. More precisely“the multiple of 2 part” is exactly the number
of imaginary roots of equation (which is always an even number). Multiple roots
of the same value are counted separately. As a corollary of the rule, the number
of negative real roots is the number of sign changes after negating the coefficients
of odd-power terms (otherwise seen as substituting the negation of the variable for
the variable itself), or less than it by a multiple of 2. For example, the polynomial
x3 + x2 − x − 1 has one sign change between the second and third terms. Therefore
it has exactly 1 positive real root. Replace x by −x gives −x3 + x2 + x − 1. This
polynomial has two sign changes, so the original polynomial has 2 or 0 negative
roots. This polynomial factors into (x + 1)2 (x − 1) so the roots are −1 (twice) and
1. Now Use Descartes’ rule of signs to determine the number of positive roots for
example, by calculations carried out to greater accuracy than that of the original data, or measurements
taken to a greater precision than the equipment supports. The concept of significant figures is often used
in connection with rounding. Rounding to n significant figures is a more general-purpose technique than
rounding to n decimal places, since it handles numbers of different scales in a uniform way. The term
“significant figures” can also refer to a crude form of error representation based around significant figure
rounding.

82
the function f (x) = 5x3 − 3x2 − 3x − 8, x ∈ R. Use Newton’s iteration to determine
the roots.

7. Calculate the roots of the following equations to the nearest hundredth of a unit.

(a) x3 = 5, (b) x2 = 2;
(c) x3 + 3x − 5 = 0, (d) ex + x − 3 = 0;
(e) x + ln x = 2, (f) x cos x = x2 ;
(g) x2 − 30x − 110 = 0, (h) x4 + 8x − 12 = 0;
(i) x = 2 + sin(x), (j) x2 + 4 sin x = 0;

8. Apply Newton’s Method to the equation x2 − a = 0, for any positive number a and
derive a well known formula for extracting square roots of numbers.

9. Show that Newton’s iteration method will fail if either f 0 or f 00 vanishes near a zero
of f .

10. Show that we can derive Newton’s method for finding the zeros of a nonlinear
function f by expanding f (xn+1 ) = f (xn + (xn+1 − xn )) ∼ 0 in a 3-term Taylor
Series expansion and neglecting quadratic terms and higher terms if xn+1 − xn is
small.

83
Chapter 9

Numerical Approximation of
Solutions of Ordinary Differential
Equations

9.1 Introduction and Analytic considerations


An ordinary differential equation of Order n is an equation of the form:

Ly(x) = g(x) (9.1)

where L is the nth or differential operator


n
X dn
L≡ an (x) (9.2)
n=0
dxn

and y(x) a continuously differentiable function of its arguments. In this notation,


n
!
X dn
Ly ≡ an (x) n y
n=0
dx
dy d2 y dy dn y
= a0 (x)y(x) + a1 (x) + a2 (x) 2 + a1 (x) · · · + an (x) n (9.3)
dx dx dx dx
So that L indeed defines a differential operator of order n that acts on the continuous
function y and g : R → R is a continuous function. The Coefficients ai can, in general
depend on y and its derivatives. The equation (9.1) can be solved in explicit form for the
derivative of highest order. So an nth ordinary differential equation will have the form

y (n) (x) = f (x, y (1) (x), y (2) (x), · · · , y (n−2) (x), y (n−1) (x)). (9.4)

By a solution of (9.4) or (9.1), we mean a function ϕ which is n times continuously


differentiable and which satisfies

ϕ(n) (x) = f (x, ϕ(1) (x), ϕ(2) (x), · · · , ϕ(n−2) (x), ϕ(n−1) (x)). (9.5)

84
The general solution of (9.5) will normally contain n arbitrary constants and hence there
exists an n-parameter family of solutions. If y(x0 ), y (1) (x0 ), y (2) (x0 ), · · · , y (n−1) (x0 ) are
prescribed at the point x = x0 , we have an initial value problem. We always assume
that y satisfies enough conditions to ensure that a unique solution exists. As a simple
example, we consider the problem y 0 (x) = y which has the general solution y(x) = Cex
where C is an arbitrary constant. If we prescribe the value of y at x0 say, by y(x0 ) = y0 ,
the differential equation then has the particular solution y(x) = y0 ex−x0 .
Differential equations are further classified as linear and nonlinear. An equation is
linear if the function f involves y and its derivatives linearly. Linear equations have the
important property that if {yi, i = 1, 2, 3, · · · , n} is a set of solutions of (9.1), then so is
P n
i=1 Ci yi (x) for arbitrary constants Ci , i = 1, 2, · · · , n. The simple second order equation
y (x) = y is easily verified to have the solutions y1 = ex , y2 = e−x and hence by linearity,
00

y(x) = C1 ex + C2 e−x is also a solution. The solutions y1 and y2 of the second order
equation are said to be linearly independent, if the Wronskian , W (y1, y2 ) of the solutions
does not vanish. Here

y1 y10
W (y1 , y2 ) =
= y1 y 0 − y2 y 0 . (9.6)
y2 y20 2 1

Amongst linear equations, those with constant coefficients are particulary easy to solve.
For example, let
n
X
ai y (i) (x) = 0 (9.7)
i=0

be the differential equation of order n, where a0i s are constants. If we seek a solution of
the form y(x) = Ceβx , for some constant C, then direct substitution shows that

n
X
ai β i = 0. (9.8)
i=0

(9.8) is called the characteristic equation, which is a polynomial of degree n and several
possibilities arise:

1. Distinct real roots: If the nth order equation (9.8) has n distinct roots, βi , i =
1, 2, 3 · · · , n, then linearity requires that
n
X
y(x) = Ci eβi x (9.9)
i=1

is a general solution of (9.7). If any of the distinct roots are complex, say βj =
µj +iωj , then β¯j = µj −iωj is also a root and these two roots will together contribute
the two linearly independent solutions yj1 = eµj x cos(ωj x), yj2 = eµj x sin(ωj x), cor-
responding to the complex conjugate pair of roots.

85
2. Multiple roots: When (9.8) has multiple roots, special techniques are available
for obtaining the linearly independent solutions. In particular, if β is a root of
(9.8) with multiplicity k ≤ n, then yi = xi−1 eβx , i = 1, 2, 3, · · · , k are k linearly
independent solutions corresponding to the multiple root.

Finally if (9.1) is linear but not homogeneous, that is if g(x) 6= 0, if ξ(x) is a particular
solution, that is a function ξ : R → R such that Lξ(x) = g(x), then the generalP solution
for the linear nth order differential equation will be given by y(x) = ξ(x) + ni=1 Ci eβi x .

Example 17 Find the solution of the differential equation

d2 y dy
2
− 4 + 3y = x, y(0) = 4/9, y 0(0) = 7/3.
dx dx
Solution: To find the particular integral ξ(x), try the solution ξ(x) = ax + b since the
right hand side is a polynomial of degree ≤ 1. Substituting gives ξ(x) = 13 x + 49 . To find
the homogeneous solution, solve y 00 (x) − y 0(x) + 3y = 0. The characteristic equation gives

β 2 − 4β + 3 = 0 ⇒ β = 3, β = 1

Therefore the solution y(x) is given by


1 4
y(x) = C1 e3x + C2 ex + x + .
3 9
To find a solution satisfying the given data we have we solve
4
y(0) = 9
7 ⇒ C1 = 1, C2 = −1.
y 0(0) = 3

Hence the desired solution is


1 4
y(x) = y(x) = e3x − ex + x + .
3 9
Let us examine another example in R.

Example 18 Let a and b be arbitrary real numbers with a ≥ b. Let χ : [0, ∞) → R be a


given function. Find the general solution of the first order nonlinear ordinary differential
equation

− a sin(χ) + b = 0 (9.10)
dT
and determined whether any of the solutions is physically realizable.

Solution: From the theory of ordinary differential equations, we find that χ has equilib-
rium points at
b
sin(χ) = (9.11)
a
86
which exist when b ≤ ±a. It is straightforward to establish that (9.10) has general solution
defined for any constant κ by

 a bc 2 a2

 (i) + c tan(κ − T ); b > ±a; c = 1 − , c>0


 b 2 b2



1  2
tan( χ) = (ii) ± 1; b = ±a (9.12)
2 
 ±a(T − κ)




 2
 (iii) a + c tanh( bc T − κ); b < ±a; −c2 = 1 − a , c > 0.


b 2 b2
The solution (9.12) shows that when b ≤ ±a, χ will tend to one of the equilibrium points
given by (9.11) as T → ∞. In this case, χ(∞) is a constant and hence cos(χ(∞)) is
determined, and the solution is realizable
Now, despite the examples that have been given above, it is important to note that not
all differential equations have a solution and that some have infinitely solutions and still
others have unique solutions but cannot be solved explicitly to obtain the solution in terms
dy
of elementary functions. For example, the differential equation dx = x2 + y 2 , y(0) = y0 ,
cannot be solved to obtain a solution in closed for even though we can show that this
equation has a unique solution on an interval containing the point x = 0. On the other
hand we find that

dy 0, 0≤x≤b
= |y(x)|, y(0) = 0, ⇒ y(x) = 1 2
dx 4
(x − b) , x>b
for every b > 0 leading to an infinite number of solution satisfying the initial condition,
while the differential equation

dy −y ln(|y(x)|), y 6= 0
= y(0) = 0, ⇒ y(x) = 0, ∀x.
dx 0, y = 0,
In this last two examples, both functions in the right hand side are continuous, but
uniqueness is not guaranteed in both cases. It is again very easy for us to establish that
dy
the differential equation dx = y 2/3 , y(0) = 0, has a solution, for every a, b ∈ R defined by
 1
 27 (x − a)3 , x ≤ a
ya,b (x) = 0, a≤x≤b
 1 3
27
(x − b) , x ≥ b.
This solution satisfies the initial conditions and in particular, y0,0 (x) = 0 and y0,0 (x) =
1 3
27
x are two solutions that satisfy the initial conditions but do not agree on any open
interval containing the origin 0. So though the function on the right hand side is con-
tinuous, the solution is not unique. These are Lipschitz ideas. We state the following
definition
Definition 46 (Lipschitz Continuity) A function f : Rn → Rn is said to Lipschitz
continuous if there exists a positive constant L, the Lipschitz constant, such that for every
x, y ∈ Rn , kf (x) − f (y)k < L|x − yk. If L = 1, such a map is called and isometric
embedding.

87
It is easy for us to prove following theorem:
Theorem 33

(a) if f is Lipschitz continuous ⇒ f is uniformly continuous


(b) f is an isometric embedding
⇒ (c) f is Lipschitz continuous
⇒ (d) f is uniformly continuous
⇒ (e) f is continuous.

Exercise: Show that the converse of theorems (a) − (e) will be in general false.
The theory assures us that if the function on the right is continuous and Lipschitz con-
tinuous in y, then we can construct a unique solution near the initial conditions according
to the following theorem
Theorem 34 (Existence theorem for ODEs.) Suppose that f : Rn × [0, ∞) → Rn is
Lipschitz continuous on the n−rectangle

R = (x, y) ∈ Rn+1 : x0 ≤ x ≤ xm , ky − y 0 k ≤ ym

with Lipschitz constant L. That is for every y, z ∈ Rn , |f (x, y)−f (x, z)| < Lky−zk, then
there exists a continuously differentiable path (a solution to y 0 (x) = f (x, y(x)), y(x0 ) =
y 0 ), (x, y(x)) giving a unique value for each x ≥ x0 , until the curve leaves R.
Proof: See MAT409.
Thus we can only determine unique solutions locally. For every differential equation,
we can work out the rectangle R in which the unique solution exist. It is also possible
to extend the domain of validity of the theorem, by a creeping argument, so that the
solution is valid for all x on the real line (−∞, ∞). But this is not always the case. For
example, the differential equation dx dt
= 1 + x2 , x(0) = 0 has the solution x(t) = tan(t)
whose domain of validity, relative to the prescribed initial data, is {t ∈ R : − π2 < t < π2 }
and cannot be further extended1 . We can state and prove the following result:
Corollary 9 Suppose that the conditions of Theorem 34 hold, and let x0 ∈ R. Then
there exist a unique maximal solution y max : (a, b) → R, where x0 ∈ (a, b) (a maybe −∞
and b maybe +∞) with y max (x0 ) = y 0 . Any other solution is a restriction of y max to a
sub-interval.
It will appear that we must be able to locate the region of space R within which
a unique solution exist before starting our search for solutions. We note that from the
continuity of f and properties of the mean value theorem, for a mean value point η
between y and z, for x, y ∈ R, so that R ⊂ R2 , we have
∂f
|f (x, y) − f (x, z)| = | (x, η)(z − y)| ⇒ L = max{fy (x, y) : y ∈ R},
∂y
where fy is the partial derivative of f with respect to y. Here are some examples:
1
Such an occurrence is often called and ”explosion” - get to infinity in finite time.

88
1. For y 0(x) = λy(x), λ ∈ R, L = |λ|, xm , ym arbitrarily large. We can choose R such
that ym ≥ |y0 exp(λ(xm − x0 )) − y0 .

2. For y 0(x) = sin(y), L = 1 and xm , ym arbitrarily large. Since |f (x, y)| ≤ 1, we have
|y − y0 | ≤ |x − x0 |, so take ym ≥ |xm − x0 | ensures existence for all x.

3. For y 0(x) = sin(xy), L = xm , situation similar to 2 with ym ≥ xm − x0 .


1
4. For y 0(x) = sin(y/x), x 6= 0, L = x0
and obtain the same situation as 2
1 ∂f 2y
5. For y 0(x) = 1+y 2
2 , f (x, y) = 1/(1 + y ) and | ∂y | = | (1+y 2 )2 | ≤ 1. Thus L = 1. Again,

we have arbitrarily large xm and ym , with ym ≥ x0 .

6. For y 0(x) = y 2, L = 2(|y0| + ym ), xm arbitrarily large.

In what follows we shall assume that the methods of finding the rectangle wherein the
unique solution exist is know and address the issue of constructing solutions for the
ordinary differential equations. The theory we advance should be extendible to systems
of differential equations. Taking into consideration the following:

Remark 9 1. A non-autonomous first order differential equation in an open subset U


n
of R , may be viewed as an autonomous first order system of equations on an open
subset V of Rn+1 as follows: Suppose g : R × U → Rn is a given function and that
dx
= g(t, x), x ∈ U, t ∈ R. (9.13)
dt
Define y = (x, t), h : V = R × U → Rn+1 by h(x, t) = (g(t, x), 1), and then
consider the differential equation
dy
= h(y), y ∈ V . (9.14)
dt
Then α is a solution of (9.13) if and only if β given by β(t) = (α(t), t) is a solution
of (9.14). By this remark we see that to solve the equation dxdt
= f (x, t), x(t0 ) = x0 ,
a non-autonomous ordinary differential equation in R, we could instead solve the
first order ordinary differential equation system in R2 defined as follows: Set y = t
so that y’(t) = 1 and construct the equivalent system

dx dy
= f (x, y), = 1, x(t0 ) = x0 , y(t0 ) = 0,
dt dt
an autonomous first order system in R2

2. A k th order autonomous ordinary differential equation on an open subset U of Rn


may be viewed as a first order autonomous differential equation on an open subset
V of Rnk . That is suppose

x(k) = g(x, x(1) , x(2) , · · · , x(k−1) ), x ∈ U (9.15)

89
where the k th and highest order derivative has been expressed as a function of the
n
remaining (k −1) derivatives and g : U ×R| × ·{z· · × Rn} → Rn . Define g : V → Rnk ,
(k−1)times
(y1 , y2 , y3, · · · , yk ) → (y1 , y2 , y3 , · · · , yk−1, g(y1 , y2, y3 , · · · , yk )) and then consider the
first order ode
dy
= h(y), y ∈ V (9.16)
dt
The α : I → U is a solution of (9.15) if and only if β : I → V defined by
β(t) = α(t), α0 (t), · · · , α(k−1) (t) is a solution of (9.16). So for example, to solve
the equation
d3 y d2 y
+ sin(y) +y =x
dx3 dx2
we may choose to analyze a first order system of ordinary differential equations in
R3 as follows: set z = y 0(x), the w = z 0 (x) = y 00 (x) and so w 0 (x) = z 00 (x) = y 000 (x) =
x − sin(y)w − y and we have the equivalent first order system in R3 namely

y 0 (x) = z
d3 y d2 y 0
+ sin(y) + y = x ⇔ z (x) = w
dx3 dx2
w 0 (x) = x − y − w sin(y)

a non-autonomous ordinary differential equation in R3 which may be viewed as an


autonomous ordinary differential equation in R4 as explained above.

3. The point about 1, and 2 is that for all differential equations in normal form, we
might as well just consider first order autonomous systems of ordinary differential
equations (perhaps not always the most practical ways to solve differential equa-
tions). Almost all differential equations can be put in normal form by the implicit
function theorem , which can be proved as another example of the application of the
contraction mapping theorem

4. Not all differential equations can be solved in terms of elementary functions, integrals
of elementary functions or even ”special functions” of mathematical physics2

5. We can use differential equations to define functions. For example the differential
equation y 0 (x) = y(x), y(0) = 1 defines the function y(x) = ex , while y 00 (x) − y(x) =
0, y(0) = 1, y 0(0) = 0 defines the function y(x) = cosh(x).
2
Historical Note: In 1830, Galois was interested in solving polynomial equations in terms of rad-
icals(roots). Galois then proved that by associating groups to the polynomial equation, for example,
one cannot solve general quintic polynomial equations by radicals. Nowadays, we know that this can be
done using elliptic functions. In 1870, Galois was interested in solving differential equations in terms of
elementary functions and their integrals. He again proved that by associating the group to differential
equations, one cannot solve all differential equations using only elementary functions and their integrals.
eg x0 (t) = x2 −t. Nowadays we know that we can prove existence and then proceed to obtain approximate
numerical solutions to all differential equations whose solutions exist

90
The last remark points to the fact the we must find ways of approximating solutions of
differential equations especially when we would have established that a unique solution
exist. Numerical methods become very handy here. To be able to follow the theory
of approximation of solutions of differential equations with need the related concepts of
difference equations and differences.

9.2 Finite differences


Definition 47 Given the set of points {(x0 , y0), (x1 , y1 ), · · · , (xn , yn )}, determined by the
relationship y = f (x) such that xi − xi−1 = h, a constant for all i, we define the following
quantities
1. First differences: ∆yi−1 = yi − yi−1 , for each i = 1, 2, · · · , n. Notice that we have
defined the first difference so that the subscript i − 1 should coincide with the second
member of the difference so that ∆yi = yi+1 − yi, for each i = 0, 1, 2, · · · , n − 1.
2. Second differences: Difference of the first difference are called second differences
and denoted ∆2 yi−1 = ∆yi − ∆yi−1 . Again the second difference is defined so that
the subscript i − 1 should coincide with the second member of the difference and so
∆2 yi = ∆yi+1 − ∆yi , i = 0, 1, 2, 3, · · · , n.
3. Higher Order differences: In like manner as above we can define higher order
differences. In general the nth order differences of a function are defined by the
formula ∆n yi−1 = ∆n−1 yi − ∆n−1 yi−1
It is possible to express the differences of any order in terms of the given values of the
function yi , i = 0, 1, 2, · · · , n by successive substitutions. For example
∆2 y0 = ∆y1 − ∆y0 = (y2 − y1 ) − (y1 − y0 ) = y2 − 2y1 + y0 .
In a similar manner,
∆3 y0 = ∆2 y1 − ∆2 y0 = y3 − 3y2 + 3y1 − y0 .
In general,
    n  
n n n n
X
n n
∆ y0 = yn − yn−1 + yn−2 + · · · + (−1) y0 = (−1) yn−i
1 2 i=0
i

where ni is the binomial coefficient defined by
     
n n−1 n−1 n!
= + =
i i i−1 i!(n − i)!
Schematically, we may represent the successive differences of the set of values of a
function by means of a diagonal difference table as shown in Table 9.1. Each entry in
the body of the diagonal difference table is the difference of the adjacent entries above
and below in the column to the left. The entry y0 is called the leading term, and the first
terms in each column ∆i y0 , i = 1, 2, · · · , n are called the leading differences.

91
x y ∆y ∆2 y ∆3 y ∆4 y ··· ∆n y
x0 y0
∆y0
x1 y1 ∆2 y0
∆y1 ∆3 y0
x2 y2 ∆2 y1 ∆4 y0
..
∆y2 ∆3 y1 .
2 4
x3 y3 ∆ y2 ∆ y1 ∆n y0
∆y3 ∆3 y2 . . ..
2
x4 y4 ∆ y3 . .
∆y4 . . .
4
x5 y5 . . . ∆ yn−4
. . . . ∆3 yn−3
2
. . . ∆ yn−2
. . ∆yn−1
xn yn

Table 9.1: Diagonal Difference Table

Example 19 Construct a difference table for the set of points (-3,-25), (-1,1), (1,3),
(3,29), (5,205) and hence find ∆4 y0 if the data set starts at (x0 , y0) = (−3, −15) and ends
at (x4 , y4 ) = (5, 205).

Solution: Labeling the points (xi , yi), i = 0, 1, 2, 3, 4 as suggested, we construct a differ-


ence table from Table 9.1 and the definition of differences. The

x y ∆y ∆2 y ∆3 y ∆4 y
−3 −25
26
−1 1 −24
2 48
1 3 24 78
26 126
3 29 150
176
5 205

Table 9.2: Difference table for Example 19

4  
4
X 4 i
∆ y0 = (−1) y4−i = y4 − 4y3 + 6y2 − 4y1 + y0 = 78
i=0
i

which answer could have been read off from the table as constructed.

92
Now, for an arbitrary function f and a positive step size h we define the following
difference formulae for the continuous function f .
Definition 48 1. Forward difference: ∆+,h f (x) = f (x + h) − f (x)

2. Central difference: ∆h f (x) = f (x + h) − f (x − h)

3. backward difference: ∆−,h f (x) = f (x) − f (x − h)


The difference operators shown above are linear and have the following properties:
Theorem 35
P P
∆+,h ( ni=1 fi (x)) = ni=1 ∆+,h fi (x), distributive property
∆+,h cf (x) = c∆+,h f (x), for every constant c.
m+n
∆m n
+,h (∆+,h f (x)) = ∆+,h f (x), exponentiation rule

Proof: Exercise.
It is important to note that these rules only concern difference relationships but do not
enable us to actually find differences of functions. However, we can prove some interesting
results as shown below:
Theorem 36 Consider the forward difference ∆+,h f (x) = f (x + h) − f (x). Then
1. If f is the constant function ∆+,1 f (x) = 0
Pn n n−i
2. If f (x) = xn , then ∆+,1 f (x) = i−1 i x , a polynomial of degre (n-1) with
n−1
leading term nx

3. If f (x) = xn , then ∆n+,1 cf (x) = cn! for any constant c

4. If f (x) = xn , then ∆n+1


+,1 cf (x) = 0 for any constant c

Proof: Exercise

9.2.1 Exercises
1. Construct difference tables for the following data set

(a) (2,0), (3,1), (4,8), (5,21)


(b) (0,5), (1,-3), (2,1), (3,8), (4,14)
(c) (-3,2), (-1,12), (1,21), (3,33)
(d) (2.0,-9), (2.5,0), (3.0,3), (3.5,12),(4.0,29),(4.5,96)
(e) (0,3), (3,9), (6,15), (9,21)

2. Construct a difference table for the function y = x3 − 2x2 + 7 with x0 = 0, h = 0.5


and consider 10 consecutive points.

3. Find the next term in of the following sequences by extending their difference tables

93
(a) 1,1,2,3,5,8,13,21,34
(b) 1,4,10,20,35,56
(c) -1,0,1,8,27,64
(d) 1,,8,17,32,57,100,177,320
(e) 2,2,14,74,64
(f) 14,23,34,42,59

what is the assumption made in each case?

4. Given the points (2,1), (3,6), (4,13), (5,22) and (6,33). Find the point (0,y). State
the assumption under which the value is determined.

5. Prove the formula for ∆n y0 by mathematical induction.

6. State and prove corresponding results for the forward and central differences defined
above.

7. Show how you can use the difference tables for ∆+,h to locate errors in data values

8. Show that for any polynomial of degree n, ∆n+,1 pn (x0 ) is constant and find the value
of the constant.

For an arbitrary step size h is is easy for us to see from calculus that if the function under
consideration is differentiable, then

∆+,h f (x) ∆h f (x) ∆−,h f (x)


lim = lim = lim = f 0 (x). (9.17)
h→0 h h→0 2h h→0 h
It will appear therefore that we can approximate the value of the derivative of a differ-
entiable function at the point x by any one of the differences provides we consider the
correct quotient. Thus we have

∆+,h f (x)
f 0 (x) ≈ (9.18)
h
∆h (x)
f
f 0 (x) ≈ (9.19)
2h
∆−,h f (x)
f 0 (x) ≈ (9.20)
h
and the question that readily comes to mind is to decide which of these is most suitable
for use in approximating the derivative of the function at the point x. The answer is
given to us from Taylor’s Formula for functions of a real variable: We calculate the error
incurred by approximating the derivative with finite difference quotients as follows: Let
Ef , Ec and Eb be the error incurred in approximating f 0 (x) by one of (9.18)-(9.20). Then

94
from Taylor’s mean value theorem, and for a mean value point η between x and x + h,
we have
∆+,h f (x) 1
|Ef | = |f 0(x) − | = h|f 00(η)| (9.21)
h 2
∆h f (x) 1
|Ec | = |f 0(x) − | = h2 |f 000 (η)| (9.22)
h 6
∆+,h f (x) 1
|E+ | = |f 0(x) − | = h|f 00(η)| (9.23)
h 2
Thus it becomes clear that if the second derivatives are bounded as is often the case,
then the central difference gives a better approximation than the forward or backward
difference.
Formulas for obtaining higher derivatives can be obtained by applying higher order
difference relations. For example, we may find f 00 (x) as follows
 
00 ∆+,h f 0 (x) 1 ∆+,h f (x + h) ∆+,h f (x)
f (x) ≈ = −
h h h h
 
1 f (x + 2h) − f (x + h) f (x + h) − f (x)
= −
h h h
f (x + 2h) − 2f (x + h) + f (x)
= (9.24)
h2
Similarly, the central difference gives
f (x + 2h) − 2f (x) + f (x − 2h)
f 00 (x) ≈ (9.25)
4h2
and the backward difference formula yields
f (x) − 2f (x − h) + f (x − 2h)
f 00 (x) ≈ (9.26)
h2
and the derivatives soon require that we evaluate values of the function at points x + 2h
and x − 2h respectively. For moderate values of h this function values are probably
not known and the domain of validity of the finite difference approximations is reduced.
Based on the error estimates for the first derivative, it would be better to use the central
difference to approximate the derivative and also aim at reducing the error margin at
evaluation by applying the difference formula at 21 h instead at h as follows.
0 1 1
!
∆ 1 f (x)
h 1 ∆ 1 f (x +
h 2
h) ∆ 1 f (x −
h 2
h)
f 00 (x) ≈ 2
= 2
− 2
h h h h
f (x + h) − 2f (x) + f (x − h)
= (9.27)
h2

9.2.2 Application
Example 20 Suppose we wish to solve the differential equation
1
y 00 (x) + y(x) = x, y(0) = = y(1). (9.28)
2
95
Find the exact solution and then use finite differences to approximate the solution. Com-
pare.

This is an example of a simple non-homogeneous second ordinary differential equation


with constant coefficients and it is easy to verify, using methods of solutions of odes of
these type, that the exact solution is
1 1 1
y(x) = x + cos(x) − cot( ) sin(x) (9.29)
2 2 2
where we have used the given boundary conditions in the solution. In this case, it is easy
to work out the exact solution so that if there was any need to use the functional values
of y, we will simply use formula (9.29). We attempt to approximate the solution using
finite differences. Before we do this we make a few remarks

1. The function y given in (9.29) is uniquely determined by the specifications of the


differential equation (9.28).

2. Prior to the working out of the formula (9.29), the only information we have about
y is the differential equation and the values that y has at the boundary points x = 0
and x = 1.

3. According to the problem the data at x = 1 and at x = 0 is linked continuously in


the open interval (0, 1), through the prescriptions of the derivatives of y as indicated.

4. A numerical approximation objective would be to approximate this continuous func-


tion at every point in the interval (0, 1). However, since the number of points in the
interval (0, 1) is infinite, any such numerical approximation can only be discrete and
finite to be realizable.

Now, Consider a partition of the interval [0, 1] into n + 1 distinct points x0 , x1 , x2 , · · · , xn ,


where xi = i/n, i = 0, 1, 2, 3, · · · , n. Then consider the set {(xi , y(xi)), i = 0, 1, 2, · · · , n},
where each y(xi ) is the value of y at the point xi for each i, and satisfies the given
differential equation at the point x = xi . Let yi ≈ y(xi ). That the we denote yi to
the the approximate value of y(xi ). It is clear that each yi shall be in error by a certain
amount because of the different sources of error that could arise including rounding errors,
machine representation errors, etc. If we already new the values of y at each xi , we could
produce a difference table and use it to investigate sources of error and predict other
values in the vicinity. But as it stands we do not know the values of y in the any of the
interior points, outside the relationship between y and its derivatives as prescribed by the
differential equation. We also know that y(0) = y(1) = 1/2, and that for each i

y 00 (xi ) + y(xi) = xi , i = 1, 2, 3, · · · , n − 1, y(x0) = y(xn ) = 1/2. (9.30)

Let us approximate the second derivative with finite differences so that, using the con-
vention that yi ≈ y(xi ), equation (9.30) transforms to the discrete set of equations
yi+1 − 2yi + yi−1 1
+ y i = xi , i = 1, 2, 3, · · · , n − 1, y 0 = y n = (9.31)
h2 2
96
This can be rearranged to give
1
yi+1 − (2 − h2 )yi + yi−1 = h2 xi , i = 1, 2, 3, · · · , n − 1, y0 = yn = (9.32)
2
and example of a difference equation in yi . The process of transforming the differential
equation (using finite differences or other methods) into a system of discrete difference
equations is called discretisation. The system of equations in yi , i = 1, 2, 3, · · · , n − 1 as
defined by (9.32) is linear and can be solved by any method for solving linear systems of
the form Ax = b where A is an n × n matrix and b is the known vector, for the solution
x. Here the matrices A and b are given by
   
−h̃ 1 0 0 0 ··· 0 x̃1 − 12
 1 −h̃ 1 0 0 ··· 0   x̃2 
   
 0 1 − h̃ 1 0 · · · 0   x̃ 
   3 
 x̃4
A= 0 0 1 −h̃ 1 · · · 0 
 
 , b =  , (9.33)
 . . . . . .   . 
 .. .. .. .. .. · · · ..   .. 
   
 0 0 0 0 1 −h̃ 1   x̃n−2 
1
0 0 0 0 0 1 −h̃ x̃n−1 − 2

where h̃ = 2 − h2 , x̃i = h2 xi , h = n1 , and xi = hi . The system of equations is tri-diagonal


and this special structure cab be exploited in the solution method. It is easy to to show
that the discrete system has a solution since the coefficient matrix A is nonsingular. we
solve the system for n = 5 and n = 10. The results are shown in figure 9.1. Evidently

0.51
0.508
0.506
0.504
0.502
y(x)

0.5
0.498
0.496
0.494
0.492
0.49
0 0.2 0.4 0.6 0.8 1
x

Figure 9.1: Graphs for example 20. The smooth curve is the exact solution while the
dashed lines is the curve for the approximate solution when n = 5 and the dotted line
is the curve for the approximate solution when n = 10. The approximation handles the
points where the slope of the solution changes rapidly differently.

the more points we take within the interval [0, 1], the better the approximation will be.

97
General methods for handling solutions of system of linear equations are handled in a
different course. We have included it here as an illustration to the method of differences,
which incidentally also introduces the notion of difference equations.
It is possible to use higher order differences to approximate derivatives of lower order.
For example, suppose we wish solve the problem y 0 (x) = f (x, y), y(x0) = x0 . Then it is
easy to verify (using Taylor series expansions) that the first derivative can be approxi-
mated, at the point xn , by a 3rd order difference relation

2y(xn + h) + 3y(xn ) − 6y(xn − h) + y(xn − 2h)


y 0(xn ) ≈ (9.34)
6h
so that using the convention that xn = xn−1 + h and that yn ≈ y(xn ), then we will obtain
the scheme 2yn+1 + 3yn − 6yn−1 + yn−2 = 6hf (xn , yn ). The data we have has only one
starting value, namely y(x0 ) = y0 , which is exact as the given information. In this case,
we need more starting data than is specified by the given ordinary differential equation.
If we apply this scheme to the equation

y 0 (x) = 0, y(0) = 1, (9.35)

and suppose we are given as starting data

y0 = 1 + ε, y1 = 1 − ε, y2 = 1 + ε. (9.36)

where ε  1 may be seen as the small error introduced in estimating the initial values.
With the starting values, we can then obtain values of yn for n = 3, 4, · · · , easily. For
example, y3 = 1 − 5ε, y4 = 1 + 11ε, y5 = 1 − 32ε, · · · . Clearly yn is oscillating around
the value 1 and |yn | is increasing, even though the equation to be solved (9.35) has the
constant solution y = 1. The equation 2yn+1 + 3yn − 6yn−1 + yn−2 = 0. is an example
of a linear difference equation. That is an equation defined on some interval of integers
and their differences. We must learn properties of difference equations and understand
when a difference equation can be useful for solving given problems. Evidently the scheme
provided in this example is not suitable to be used as solver for the indicated equation,
or any first order differential equation for that matter, since small errors in the solution
propagate and magnify. And there will always be errors in approximations.

9.2.3 Exercises
Use finite difference approximations of the derivative to solve the following problems

1. y 00 (x) + 2y 0(x) + y = 0, y(0) = y(1) = 1

2. y 00 (x) + 2y 0(x) + y = 0, y(0) = 1/2, y(1) = 1

3. y 00 (x) + x ∗ y = x, y(0) = 0, y 0(1) = 1

4. y 00 (x) + sin(y(x)) = 0, y(0) = 0, y 0(0) = 0.2

5. y 00 (x) + xy 0 (x) = 0, y(−1) = 1, y(1) = 0.

98
6. y 00 (x) − xy 0 (x) = 0, y(−1) = 1, y(1) = 0.
7. y 00 (x) + xy 0 (x) − xy = 0, y(0) = 0, y(1) = e.
8. Consider the boundary value problems
(a) y 00(x) + y(x) = sin(πx), x ∈ (0, 1), y(0) = 1, y(1) = 1
(b) y 00(x) + π 2 y(x) = sin(πx), x ∈ (0, 1), y(0) = 1, y(1) = 1.
Suppose in each case that an approximate solution to this problem is computed
using schemes:
∆21 y(xj )
2h
(a) h2
+ y(xj ) = sin(πxj ); j = 1, 2, · · · , N − 1,
∆21 y(xj )
2h
(b) h2
+ π 2 y(xj ) = sin(πxj ); j = 1, 2, · · · , N − 1,
where y(x0 ) = y(xN ) = 1, h = 1/N, xj = jh, j = 0, 1, 2, · · · , N; N and integer
greater than 1.
(a) Identify the amount of error committed in approximating the second derivative
with the central differences and estimate its value.
(b) Let N = 5, and yj ≈ y(xj ). Find the values of the approximations yj , j =
1, 2, 3, 4 and compare the approximations with the exact solution which you
should first find in each case.

9.3 Difference Equations


Definition 49 1. A difference equation of Order N is a relation between the differ-
ences
yn = ∆0 yn , ∆1 yn , ∆2 yn , · · · , ∆N yn
of the numbers of a sequence. That is
∆N yn = f (n, yn , ∆yn , · · · , ∆N −1 yn . (9.37)

2. A solution of such a difference equation is a sequence of numbers ym+k , k = 1, 2, · · ·


such that (9.37) is satisfied for n = m + k, k = 1, 2, 3, · · · .
Hence, whereas a differential equation involves functions defined on some interval of real
numbers and their derivatives, a difference equation involves functions defined in some
”interval” of integers and their differences.
If (9.37) is linear so that the right hand side depends linearly on yn , · · · , ∆N −1 yn , then
it is possible and customary to write (9.37) explicitly in terms of the yj ’s. A general
difference equation of order N may be written as
yn+N + an,N −1 yn+N −1 + an,N −2 yn+N −3 + · · · + an,0 yn = bn , (9.38)
and may be viewed as (a finite or infinite) system of linear equations whose coefficient
matrix is a banded matrix of band width N + 1

99
Example 21 The following are examples of difference equations and their solutions:
1. yn+1 − yn = 1 all n : Solution yn = n + c
n(n − 1)
2. yn+1 − yn = n all n : Solution yn = +c
2
3. yn+1 − (n + 1)yn = 0 all n > 0 : Solution yn = cn!
3. yn+2 − 2 cos γyn+1 + yn = 0 all n : Solution yn = c cos γn.

Consider the linear difference equation of order N with constant coefficients and
zero right hand side (Such and equation is also called homogeneous. Therefore a non-
homogeneous linear difference equation is one that has a non-zero right hand side
yn+N + an,N −1 yn+N −1 + an,N −2 yn+N −3 + · · · + an,0 yn = 0, (9.39)
Seek a solution of the form
yn ∝ β n , all n. (9.40)
Substitute in (9.39) to have
β n+N + an,N −1 β n+N −1 + an,N −2 β n+N −2 + · · · + an,0 β n = 0
⇒ ρ(β) = β N + an,N −1 β N −1 + an,N −2 β N −2 + · · · + an,0 = 0; (9.41)
a polynomial of degree N, called the characteristic polynomial. We have the following
possibilities:
The P
zeros of ρ(β) are distinct: In this case, (9.39) has the general solution yn =
N n
i=1 ci βi , all n.
βi is a zero of ρ with multiplicity m. In this case, (9.41) takes the form
ρ(β) = (β − βi )m q(β), q(βi ) 6= 0.
Then the corresponding solutions are βin , nβin , n2 βin , · · · , nm−1 βin and hence a linear
combination of these is also a solution of the linear difference equation.
Example 22 (i) For the third order linear difference equation yn+3 −2yn+2 −yn+1 +2yn =
0, The characteristic polynomial is ρ(β) = β 3 − 2β 2 − β + 2 = 0 which yields three
roots β = 1, −1, 2. Hence the general solution is yn = c1 (1)n + c2 (−1)n + c3 (2n ) =
c1 + c2 (−1)n + c32n containing three arbitrary constants which can be settled if three
initial values of yn are known. eg if y0 = 0, y1 = 1, y2 = 1.Then we have the three
equations c1 + c2 + c3 = 0, c1 − c2 + 2c3 = 1, and c1 + c2 + 4c3 = 1 which then yield
c1 = 0, c2 = −13
, and c3 = 13 . Then yn = 13 ((−1)n+1 + 2n ), ∀n.
(ii) For the third order linear difference equation yn+3 − 6yn+2 + 12yn+1 − 8yn = 0,
The characteristic polynomial is ρ(β) = β 3 − 6β 2 + 12β − 8 = 0 which factorizes
to (β − 2)3 = 0. Hence, since β = 2 is a repeated zero, the general solution is
yn = c1 (2)n +c2 n(2)n +c3 n2 (2n ), n = 0, 1, 2, · · · ; containing three arbitrary constants
which can be settled if three initial values of yn are known.

100
(iii) For the second order linear difference equation yn+2 − 2yn+1 + 2yn = 0, the charac-
teristic polynomial is ρ(β) = β 2 −2β +2 = 0 which yields the complex zero β = 1±i.
Hence since β is complex, the general√solution is yn = c1 (1 + i)n + c2 (1 − i)n , n =
0, 1, 2, · · · whic may be simplified to ( 2)n (C1 cos(nθ) + C2 sin(nθ)), θ = π4 .

(iv) For the third order linear difference equation yn+3 − 5yn+1 + 8yn+1 − 4yn = 0,
the characteristic polynomial is ρ(β) = β 3 − 5β 2 + 8β − 4 = 0 which factorizes to
(β − 1)(β − 2)2 = 0. Hence since ρ has a repeated zero as well, the general solution
is yn = 2n (c1 + nc2 ) + c3 , n = 0, 1, 2, · · ·

Example 23 It is required to solve the second order equation y 00 (x) − y(x) = 1. Show
that if xn = xn−1 + h for some step size h, then the general solution of the difference
equation resulting from the finite difference approximation can be expressed in the form
 n  n
h2 3 h2 3
y n = C1 1 + h + + O(h ) + C2 1 − h + + O(h ) + 1
2 2

where C1 and C2 are arbitrary constants that can be determine once data is provided for
the differential equation

Before solving this example we say something about the notation O(h3 ).

The big-oh and small-oh notation


The o and O notations are used frequently in the theory of asymptotic approximation
and has the following meanings:

1. A parameter f (x, h) is said to be of order big-oh of g(h) as x approaches its limit


(finite or infinite) and we write f (x, h) = O(g(h)) incase |f (x, h)| ≤ K(x)|g(h)| as
x approaches its limit. In terms of limits where x → 0, if limx→0 fg(h) (x,h)
= K(x)
where K is a number independent of h, or In terms of limits where x → ∞, if
limx→∞ fg(h)
(x,h)
= K(x) where K is a number independent of h. More often we say
that f (x, h) = p(x, h) + O(g(h)) as x approaches its limit incase |f (x, h) − p(x, h)| ≤
K(x)|g(h)| as x approaches its limit. In other words, f behaves like p when x is
near its limit.

2. The parameter f (x, h) is said to be of order small-oh of g(x, h) as x approaches its


limit, and we write f (x, h) = o(g(x, h)) if lim→ξ fg(x,h)
(x,h)
= 0 where |ξ| can be finite
or infinite. In this case we say that f is of higher order than g as x → ξ. Some
examples

(a) The sequences



1/n 


10, 000/n
= O(1/n) as n → ∞
10/n − 40/n2 − e−n 


1/n2

101
(b) The sequences

1/n2
= o(1/n) as n → ∞
1/(n log(n))

(c) If f (x) = α + O(1) as x → ∞, then limx→∞ f (x)−α


1
= constant
f (x)−α
(d) If f (x) = α + o(1) as x → ∞, then limx→∞ 1
=0
f (x)−x
(e) The function f (x) = x + 1/x is O(1/x) as x → ∞ since limx→∞ 1/x
=1
(f) The function f (x) = x + 1/x is o(x) as x → ∞ since limx→∞ f (x)−x x
= 0 an so
the notation f (x) is o(x) is just a fancy way of saying that the this f approaches
x as x → ∞

The big-oh and small-oh notations appear customarily only on the right side of an equa-
tion and serves the purpose of describing the essential feature of an error term without
bothering about multiplying constants or other details. They are used when we want to
describe limiting processes as a particular variable approaches its limiting state. We could
hear a mathematician say for example that f (x) = o(g(x)) as x → ∞ or f (x) = O(g(x))
as x → ∞. In general, the context of the direction of the limit will be known in the
particular instance else it is given the interpretation that we have given here. Exercise:
Read Section 1.6 of Reference [3] and answer the problems at the end of that section.
Solution to example 23:

1. We use the central difference approximation to approximate the second derivative


and the convention to write yn ≈ y(xn ) to have
yn+1 − 2yn + yn−1
2
− yn = 1 ⇒ yn+2 − (2 + h2 )yn+1 + yn = h2 . (9.42)
h
a difference equation of order 2.

2. A particular solution of (9.42), obtained by setting ynp = C in (9.42), is found to be


ynp = 1.
2
3. The characteristic equation of the homogeneous equation of (9.42) q is β − (2 +
2 2
h2 )β + 1 = 0. By the quadratic formula we have β1,2 = 1 + h2 ± h 1 + h4 . On

expanding 1 + t in a Taylor series around t = 0 and substituting h2 /4 for t. we
2
obtain β1,2 = 1 ± h + h2 + O(h3 ). Hence the general solution of the homogeneous
equation is ynh = C1 β1n + C2 β2n .

4. The solution of (9.42) is therefore yn = ynh + ynp which establishes the sought after
solution.

9.3.1 Exercises
1. Find the general solutions of the difference equations

102
(a) yn+1 − 3yn = 5,
(b) yn+2 − 4yn+1 + 4yn = n
(c) yn+2 + 2yn+1 + 2yn = 0
2. Find the solution of the initial-value difference equations
(a) yn+2 − 4yn+1 + 3yn = 2n , y0 = 0, y1 = 1
(b) yn+2 − yn+1 + −yn = 0
3. show that the general solution of the difference equation yn+2 + 4hyn+1 − yn = 2h
where h is a constant, can be expressed in the form
 n  n 1
yn = C1 1 − 2h + O(h2 ) + C2 (−1)n 1 + 2h + O(h2 ) +
2

4. Show that if y0 = 0 and y1 = x, then the nth term yn = yn (x) of the solution of the
difference equation yn+2 − 2xyn+1 + yn = 0 is a polynomial3 of degree n in x with
leading coefficient 2n−1 .

9.3.2 Discretisation:
We consider the first order scalar equation
dy
y 0 (x) = = f (x, y), y(x0) = y0 (9.43)
dx
where the function f : R2 → R is continuous and Lipschtiz continuous in y. There are
many ways to generating discrete approximations to the initial value problem (9.43). we
discuss a few here
1. On a uniform mesh of points, with interval h, such that xn = x0 + nh, n =
0, 1, 2, 3, · · · , we can approximate the derivative using finite differences as we have
done above. For example introducing yn as the approximation to y(xn ), we can
choose
yn+1 − yn yn+1 − yn−1 3yn − 4yn−1 + yn−2
(y 0(xn ))approx = or or
h 2h 2h
Then setting (y 0(xn ))approx = f (xn , yn ), we obtain a recurrence relation for (yn )n≥1 .
2. On any mesh of points xi , i = 0, 1, 2, · · · , n with x0 < x1 < x2 < · · · < xn , we can
use the integral form
Z xn+1 Z xn+1 Z xn+1
0
y (x)dx = f (x, y(x))dx ⇒ yn+1 − yn = f (x, y(x))dx.
xn xn xn

Then by approximating the integral on the right by some suitable quadrature rule,
several schemes can be generated. For example, if xn+1 = xn + h,
3
Note: The polynomial yn of this difference equations is precisely the Chebyshev polynomial of degree
k defined by Tk (cos(θ)) = cos(kθ) which has the property that T0 (x) = 1,T1 (x) = x.

103
Rx
(a) Trapezoidal rule: gives xnn+1 f (x, y(x))dx = 21 h(f (xn , yn ) + f (xn+1 , yn+1).
Rx
(b) Simpson’s rule gives xnn+1 f (x, y(x))dx = 16 h(f (xn , yn ) + 4f (xn+ 1 , yn+ 1 ) +
2 2
f (xn+1 , yn+1)
Rx
(c) Rectangle rule gives with fixed at left end point gives xnn+1 f (x, y(x))dx =
hf (xn , yn ) or any point in the interval [xn , xn + h], can be used to evaluate the
right hand side

3. We can approximate y by a polynomial, over some short interval of x, (eg Tay-


lor’s expansion) and use the differential equation (and its derivatives perhaps) to
determine the polynomial.

Irrespective of the manner of derivation of a method, the minimum we require of any


method, is that it be well defined, and converges to the correct solution as h (the mesh
size) tends to zero, the net effect is the generation of a sequence of (yn )n≥1 of values which
are approximations to the exact values (y(xn )n≥1 . So we shall often demand that

max |y(xn ) − yn | → 0 as h → 0, (9.44)


1≤n≤N

a minimal requirement that will not distinguish between methods.


It is often necessary to know how fast each method converges and choose the fastest
or most efficient. The choice of method may depend on the problem being solved.

Example 24 (Stiff problem) Consider the initial value problem


1 100
y 0(x) = ±100y − 2
∓ , y(0) = 1 + ε (9.45)
(1 + x) 1+x

where ε > 0 may be seen as the error incurred in determining the initial data at time
t = 0 for both problems. Discuss the behaviours of its solutions.

Solution: The solutions are


1
y(x) = + εe±100x . (9.46)
1+x
We note that the solution with the negative sign in the exponent damps out all errors
introduced in the data and the solution in the positive direction is well behaved, as x
increases. On the the other hand, for x in the positive direction, the solution with the
positive exponential sign grows exponentially and very fast. The only instance that the
solution will not grow exponentially is if ε = 0. So small errors in the date will lead to
large deviations in the solution, and eventually destroy the true solution which in this
case is 1/(1 + x). Such a solution is called ill-conditioned. It is clear that any method will
have difficulties dealing with problems like this.– The difficulty lies in the problem and
not in the method– problems of this nature are called stiff problems.
Hence the choice of the method is also, sometimes, determined by the problem at
hand.

104
9.4 One step Methods for First order scalar equation
In this section we consider the numerical approximation of the initial value problem
dy
y 0(x) = = f (x, y(x)); , y(x0 ) = y0 , x ∈ [x0 , xM ] (9.47)
dx
for some sufficiently large real number xm . The interval [x0 , xm ] is partitioned into points
xi ∈ P , i = 1, 2, · · · where

P = {x0 , x1 , x2 , · · · , xm }, with ||P || = max{xi − xi−1 , i = 1, 2, · · · , m} = h (9.48)

is a partition of [x0 , xm ]. In practice, xm is chosen to lie in the interval wherein the


differential equation is known to have a unique solution. We shall write yi to represent
the approximation to y(xi), in which case the error in the approximation may be written
as ei = y(xi) − yi , which may be regarded as the error incurred at the point xi by this
approximation. Let xn = xn−1 + h, xn+1 = xn + h, etc, where it is possible to consider a
situation where xn+1 − xn is not necessarily constant for each n. Then yn+1 ≈ y(xn + h)
is the approximation to y at xn+1 , given that y(xn ) is known.

Definition 50 (One step Method) Methods can be explicit or implicit:

1. A general explicit one step method for solving the initial value problem (9.47) is an
iteration of the form

yn+1 = yn + hϕ(xn , yn ; h), n = 0, 1, 2, 3 · · · (9.49)

where ϕ : R3 → R is continuous. In this formulation, ϕ(·, ·; ·), is called the iteration


function and tells us how to proceed from an estimate yn to an estimate yn+1 .

2. A general implicit one step method for solving the initial value problem (9.47) is an
iteration of the form

yn+1 = yn + hϕ(xn , xn+1 , yn , yn+1 ; h), n = 0, 1, 2, 3 · · · (9.50)

which gives a relationship between the estimates at xn and those at xn+1 . Implicit
methods do not give and explicit formula on how to proceed from the estimate yn to
the estimate yn+1 . However, for some special forms of the function f the inversion
is possible that will allow us rewrite an implicit scheme in explicit mode.

9.4.1 Examples of one step methods


1. Euler’s Method: In Euler’s method, ϕ(xn , yn ; h) = f (xn , yn ) so that Euler’s
method takes the form

yn+1 = yn + hf (xn , yn ) (9.51)

105
2. Taylor Series method (of Order k): From the differential equation, we not
only have y 0 = f (x, y), but if f (·, ·) is sufficiently smooth, we can, by repeated
differentiation, generate higher order derivatives. Consider f : R2 → R and let fx
and fy denote the first partial derivatives of f with respect to x and y respectively,
and fxx , fxy , fyx , fyy denote the second order partial derivatives, etc, then we can
calculate the total derivative of f when x and y are viewed as independent variables,
in the usual manner to have
df d2 y dy
df = fx dx + fy dy ⇒ = 2 = fx + fy = fx + fy f. (9.52)
dx dx dx
Similarly,
d2 f d3 y
= = fxx + 2fxy f + fyy f 2 + fx fy + fy2 f. (9.53)
dx2 dx3
It is clear that we can continue with the differentiation to any desired order. Now
suppose we define
k
X hi di−1 f
T (k) (x, y) = (9.54)
i=0
i! dxi−1
di f
where is the ith total derivative of f . Also, by Taylor expansion,
dxi
k  
X hi diy 0 h 00 hk−1 dk y
y(xn + h) ≈ i
= y(xn ) + h y (xn ) + y (xn ) + · · ·
i
i! dx 2 k! dxk
 
h 0 hk−1 dk−1 f
= y(xn ) + h f + f + · · ·
2 k! dxk−1
= y(xn ) + hT k (xn , y(xn )) (9.55)
from which we derive the Taylor’s method scheme
yn+1 = yn + hT k (xn , yn ) = yn + hϕ(xn , yn ). (9.56)
Clearly ϕ(xn , yn ; h) = T k (xn , yn ) in Taylor’s method of order k. Note: Euler’s
method is Taylor’s method of order one!
3. Trapezoidal rule: Recall from 2a on page 104 that from the differential equation
we have
Z xn+1
h
y(xn+1) − y(xn ) = f (x, y(x))dx ≈ (f (xn , y(xn )) + f (xn+1 , y(xn+1)).
xn 2
This leads to the method
h
yn+1 = yn + (f (xn , y(xn )) + f (xn+1 , y(xn+1 ) (9.57)
2
Notice that this method is implicit as yn+1 , the next estimate to be derived appears
on both sides of the equation.
4. Other schemes are possible from numerical quadrature as explained above
The point about one step methods is that the value of the estimate at xn+1 depends only
on the estimate at xn . Hence the name one step.

106
9.4.2 Analysis of Explicit One Step Methods
What should be the choice of ϕ in the general explicit (implicit) one step method? How
can we be certain that our one step method actually represents the differential equation
that we are trying to solve? to answer these questions, and others, we must analyse
the error generated by the one step approximation and see how it depends on ϕ and its
relation to the ordinary differential equation under observation. There are several sources
of error in the solution process.
Definition 51 (Local Error) The local error or local discretization error is that error
committed in a single step from xn to xn+1 , assuming that y is known exactly for x ≤ xn .
Thus we have
Local error = y(xn+1 ) − [y(xn ) + hϕ(xn , y(xn ); h)] (9.58)
defined in terms of the solution where h = xn+1 − xn .
The second source of error is the truncation error
Definition 52 (The local Truncation Error) The truncation error is defined by con-
sidering the one step method yn+1 = yn + hϕn , as an approximation to the differential
equation: All terms are placed on one side of the equation and scaled so that as h → 0,
the resulting expression tends to y 0 = f (x, y); the remainder when that exact solution is
substituted into the resultant expression is then the truncation error, which we shall denote
as Tn . Thus
y(xn+1 ) − y(xn )
Tn = − ϕ(xn , y(xn ); h). (9.59)
h
On comparing definition 52 and 51, we see immediately the following relationship
Local discretization error = h × (Local truncation error) (9.60)
The local error and the local truncation error when they accumulate can lead more error
in the solution. There could be other sources of error such as local machine calculation
errors that have to be accounted. So we may wish to look at the global error in the
computation, which may be calculated in the assumption that calculations are done in
exact arithmetic.
Definition 53 (Global error) The global error en at xn is defined simply as
en = y(xn ) − yn (9.61)
The following theorem establishes the relationship between the error at xn , the error in
the initial data and the truncation error.
Theorem 37 (Error estimates) Let the differential equation (9.47) be approximated
at xn+1 by the one step method yn+1 = yn + hϕ(xn , yn ; h). Assume ϕ is Lipschitz con-
tinuous with Lipschitz constant Lϕ . If en = y(xn ) − yn is the global error at xn and
T = max0≤k≤n Tk , then
 Lϕ (xn −x0 ) 
Lϕ (xn −x0 ) e −1
|en | ≤ e |e0 | + T. (9.62)

107
Proof: From 52

y(xn+1) = y(xn ) + hϕ(xn , y(xn ); h) + hTn (9.63)

For a general explicit one step method, we have

yn+1 = yn + hϕ(xn , yn ; h) (9.64)

Thus subtracting (9.64) from (9.63), using the fact that en = y(xn ) − yn , we have

|en+1 | = |en + h[ϕ(xn , y(xn ); h) − ϕ(xn , yn ; h)] + hTn |


≤ (1 + hLϕ )|en | + h|Tn | (by Triangle Inequality & Lipschitz contuinuity of ϕ

Suppose |Tn | ≤ T for all n ≥ 0 (a uniform bound for the truncation error) and xn =
x0 + nh, n = 0, 1, 2, 3 · · · (equally spaced points) then we have

|e1 | ≤ (1 + hLϕ )|e0 | + hT


|e2 | ≤ (1 + hLϕ )|e1 | + hT ≤ (1 + hLϕ )2 |e0 | + h[1 + (1 + hLϕ )]T
|e3 | ≤ (1 + hLϕ )3 |e0 | + h[1 + (1 + hLϕ ) + (1 + hLϕ )2 ]T
.. ..
. .  
n (1 + hLϕ )n − 1
|en | ≤ (1 + hLϕ ) |e0 | + T

In terms of exponentials, we have 1 + hLϕ ≤ ehLϕ therfore


"  #
hLϕ n  (xn −x0 )Lϕ 

hLϕ n e − 1 (xn −x0 )Lϕ e −1
|en | ≤ e |e0 | + T =e |e0 | + T
Lϕ Lϕ

as required.

Example 25 Apply Theorem 37 to Euler’s method, and estimate the global error in Eu-
ler’s method.

solution When Theorem 37 is applied to Euler’s method, we have ϕ(xn , yn ; h) = f (xn , yn ).


Sol Lϕ = L the Lipschitz constant for f . The Truncation error is obtained from a Taylor
series expansion as follows:

y(xn + h) − y(xn ) h
Tn = − f (xn , y(xn )) = y 00(η), xn < η < xn+1 , by MVTD
h 2
So if |y 00(x)| ≤ M, ∀x ∈ [x0 , xn ], then |Tn | ≤ 12 hM, ∀x ∈ [x0 xn ]. Hence
 
L(xn −x0 ) 1 eL(xn −x0 ) − 1
|en | ≤ e |e0 | + M h. (9.65)
2 L

Remark 10 In the last example, (9.65) shows that if |e0 | → 0 as h → 0, then en → 0 as


h → 0 and Euler’s method converges.

108
Exercise. Show that Euler’s method for the y 0 = sin(y), y(0) = y0 converges.
Remark 11 We would like to make the truncation error T as small as possible. From
Theorem 37, we see that we can establish convergence only if T → 0 as h → 0. In
particular, if we have exact initial conditions, the error tends to zero as the step size
tends to zero.
In any case, we must be certain that where there is convergence, the computed solution
is indeed the solution to the differential equation whose solution we seek to approximate.
This leads us to the idea of consistency
Definition 54 (Consistent scheme) The one step method yn+1 = yn + hϕ(xn , yn ; h) is
said to be consistent with the differential equation (9.47) if the truncation truncation error
is such that

∀ε > o, ∃h0 (ε) > 0 for which |Tn | < ε for 0 < h < h0 (ε) (9.66)

and any pair of points (xn , yn ), (xn+1 , yn+1).

9.4.3 Round-off Errors


As well as global and local errors in an algorithm, we accumulate errors due to the
maximum precision at which computer stores data. For example, suppose in Eulers
method we calculate yn (h) exactly. If we perform the algorithm on a computer, it
will store and use a value: ỹn (h), where

yn (h) = ỹn (h) + rn .

The term rn is referred to as the round-off error, and is hardware (and possibly software)
dependent.
Example 26 What happens to the accuracy with which
yk (h) − yk−1(h)
h
is calculated as h → 0?

9.4.4 Richardson Extrapolation


This method is sometimes referred to as extrapolation to the limit. It is a general method
for improving the accuracy of a numerical solution at a fixed point as the numerical
process proceeds. To introduce the technique, suppose we use a numerical method that
gives a global truncation error proportional to hp , that is T ∝ hP so that we have a pth
order method. Implicitly this means that:

yn (h) = φ(x∗ ) + Ahp

where h and n are selected so that xn = x∗ , and where A is a constant parameter that:

109
1. Does not depend on the step size

2. Does depend on the position x∗ = x

3. Does depend on the algorithm used

Example 27 Example: Use Euler’s method with step sizes 0.05, 0.1 0.2 = h to solve the
initial value problem: y 0(x) = 3 − x + y, y(0) = 1 over the interval [0, 1.6]. Find the exact
solution and show that yn (h) = φ(xn ) + Ah where A is approximately independent of h
but depends on xn

Solution: Exact solution is y(x) = x − 2 + 3ex .


The idea in Richardson’s method is to use

yn (h) = φ(x∗ ) + Ahp (9.67)

to extrapolate the error to zero, by finding A and then subtracting off the error.
Outline of Method

1. for h = h1 we compute the approximation yn (h1 ) to φ(x∗ ) at x = x∗ .

2. We divide the step size by 2: h = h2 = 0.5 ∗ h1 .

3. We compute the value of y2n (h2 ) using h − h2 , which is also an approximation to


φ(x) at x = x∗

4. we assume (9.67) that the global error at x = x∗ is always proportional to hp and


with the same constant of proportionality:

(a) From the first computation, yn (h1 ) = φ(x∗ ) + Ahp1 .


(b) From the second computation y2n (h2 ) = φ(x∗ ) + Ahp2
yn (h1 )−y2n (h2 )
5. Subtracting we eliminate φ(x∗ to have A = hp1 −hp2
.

6. A better approximation to φ(x∗ ) should be given by

φ(x∗ = yn (h1 ) − AH1p


yn (h1 ) − y2n (h2 ) p
= yn (h1 ) + h1
hp1 − hp2
   
1 1
= 1− yn (h1 ) + y2n (h2 )
1 − (0.5)p 1 − (0.5)p

This effectively extrapolates the global error to zero

exercise: Exercise: Read Section 6.4 of Reference [1] and follow the examples given
there

110
Consistency condition for a general one step method
In a general one step method, we assume that ϕ is a continuous function of its arguments.
Also, y 0 is continuous. Now,
 
y(xn + h) − y(xn )
lim Tn = lim − ϕ(xn , y(xn ); h) = y 0 (x) − ϕ(xn , y(xn ; 0). (9.68)
h→0 h→0 h

Comparing (9.68) with the differential equation (9.47), we deduce the following consis-
tency condition:

Scheme is consistent with the ode ⇔ ϕ(x, y(x); 0) ≡ f (x, y) (9.69)

We have the following convergence theorem


Theorem 38 Suppose we are to solve the first order differential equation (9.47) for x ≥
x0 . Suppose approximations are generated by the one step method (9.49), where ϕ is
uniformly Lipschitz on R×[0, h0 ] and satisfies the consistency condition (9.69) and (9.66).
Then we have convergence in the sense that

lim |yn − y(xn )| = 0 (9.70)


h→0

Proof: Follows immediately from Theorem 37 on noting that |yn − y(xn )| = |en |. Hence
if y(x0 ) = y0 , exact initial condition, then
 Lϕ (xn −x0 ) 
e −1
|yn − y(xn )| ≤ max |Tk |.
Lϕ 0≤k≤n

From the continuity of ϕ with respect to its arguments and the consistency condition, we
have
y(xn+1 ) − y(xn )
Tn = − ϕ(xn , y(xn ); h)
h
y(xn+1 ) − y(xn )
= − f (xn , y(xn )) + ϕ(xn , yn ; 0) − ϕ(xn , y(xn ); h)
| h {z } | {z }
y 0 at two points
difference in ϕ

We thus have that


y(xn+1) − y(xn )
|Tn | ≤ | − f (xn , y(xn ))| + |ϕ(xn , yn ; 0) − ϕ(xn , y(xn ); h)|
h
< ε for h < h0 (ε), n = 0, 1, 2, · · · .

We will be interested to know how accurate the solution is. This is done through the
definition of order of accuracy.
Definition 55 (Order of a scheme) The numerical scheme (9.49) is said to have order
of accuracy p if p is the largest positive integer such that for sufficiently smooth solution
y, there exist κ, a constant, and a step size h0 for which |Tn | ≤ κhp for 0 < h < h0

111
9.4.5 Exercises
1. For the differential equation y 0 = −xy + 1/y 2, y(1) = 1, derive the difference
equation corresponding to Taylor’s method of order 3. Carry out by hand one step
of the integration with h = 0.01. Write a computer programme to solve this problem
1 1
and carry out the integration from x = 1 to x = 3, using step size h = 64 and 128

2. For the equation y 0 = 2y, y(0) = 1, obtain the exact solution of the difference
equation from Euler’s method. estimate the value of h small enough to guarantee
four decimal places of accuracy in the solution over the interval [0,1]. Carry out the
solution with the appropriate value of h for 10 steps.

3. Find the Taylor series expansion of the function y that satisfies the differential
equation y 0 = xy + 1, y(0) = 1

4. For the equation y 0 = 1 − y/x, y(x0) = y0 , find the general formula for the Taylor
series method of Order k ≥ 1.

5. It is required to solve the problem (9.47). If the Trapezoidal rule is used to approxi-
1 2 (3)
mate the integral, show that the truncation error is − 12 h y (ξ) where h = xn+1 −xn
(3)
and ξ ∈ (xn , xn+1 ) is a mean value point, provided y is continuous.
If en = y(xn ) − yn , show that
1 1
|en+1 | ≤ |en | + hL(|en+1 | + |en |) + h3 M
12 12
where y is the solution to problem (9.47), L is the Lipschitz constant of f with
respect to y and |y (3) | ≤ M. Taking y0 = y(x0 ) and uniform step size h, deduce
that  n 
h2 M 1 + 12 hL
|en | ≤ − 1 , provided hL < 2.
12L 1 − 12 hL

6. In the previous question, show that if we define g(x) = fy (x, y(x)), then asymptot-
ically the en = he(xn ) where e(x) satisfies the equation
1 (3)
e0 (x) = g(x)e(x) − y (x), e(0) = 0.
12
Refine your arguments using the results of the previous question to show that en −
h2 e(xn ) = O(h3 ). Find e(x) for problem (9.47) when f (x, y) = y, x0 = 0, y0 = 1.

7. Here we consider the problem

y 0(x) = λy(x), y(x0 ) = y0 , (9.71)

where λ appropriately chosen.

(a) Show that to generate estimates for the problem (9.47), the initial value prob-
lem (9.71) can be used as a test problem for a numerical method.

112
(b) Verify that if a k th order Taylor’s series method is applied to the test problem
(9.71), it generates the estimates
k
X h̄j
yn+1 = ρ(h̄)yn , n = 0, 1, 2, 3, · · · , ρ(h̄) =
j=0
j!

and hence the solution is yn = ρ(h̄)n y0 where h̄ = λh.


(c) Suppose we use a one step method to solve problem (9.47), employing a fixed
step size h, if yn (h) is the estimate to y(x∗) where x∗ = x0 + nh and if the
method is a pth -order method, the we expect en (h) ' Chp for some constant
C, where en (h) = y(x∗ ) − yn (h). Use this expected equation to establish the
approximation
y4n (h/4) − y2n (h/2)
' 2−p
y2n (h/2) − yn (h)

9.5 Runge-Kutta Methods


Taylor series methods could, in principle, yield any desired order of accuracy, but the
computations quickly become very cumbersome and tedious. In particular, it will not
be feasible to carry out the manipulations required for some types of f . An alternative
is to re-evaluate f (·, ·) at points intermediate between (xn , yn ) and (xn+1 , yn+1) so as to
approximate the required partial derivatives by differences.
Suppose we take just one intermediate point and set up the following equations
yn+1 = yn + h(aK1 + bK2 ), K1 = f (xn , yn ), K2 = f (xn + αh, yn + βK1 ) (9.72)
and the parameters a,b, α and β are to be determined. Then clearly (9.72) is the form
of a general one step method if we set ϕ = aK1 + bK2 . The consistency condition
ϕ(xn , yn ; 0) = f (xn , yn ) yields
a+b=1 (9.73)
Now expand the terms of equation (9.72)in a Taylor series about xn to have
y(xn+1 ) = y(xn ) + ahy 0 (xn ) + bf (xn , αh, yn + βhk1 )
= y(xn ) + ahy 0 (xn ) + bh[f (xn , yn ) + αhfx (xn , yn ) + βhK1 fy (xn , yn )
 
(αh)2 2(αh)(βhK1 ) (βhK1 )2
+ fxx + fxy + fyy (xn , yn ) + · · · ]
2! 2! 2!
= y(xn ) + hf (xn , y(xn )) + bh2 [αfx (xn , y(xn ))]
 
3 1 2 1 2
+bh α fxx αβfxy f + β fyy f (xn , y(xn )) + O(h4 )
2
2 2
Comparing terms with the corresponding terms in the Taylor series expansion of y(xn +h),
we see that we can match Oder h2 terms, if
1
αb = βb = (9.74)
2
113
It is then easy to see that no other choice of α, β, a, b can be found to match O(h3 ) terms
hence looking at (9.72), (9.73) and (9.74), we may choose β = α, and still have one free
parameter. Some choices of:

1. a = 0, b = 1, α = β = 21 : yn+1 = yn + hf (xn + 12 h, yn + 12 hf (xn yn )). This is the


modified Euler’s method.

2. b = 21 , α = β = 1, a = 21 : yn+1 = yn + 12 hf (xn + h, yn + hf (xn yn )) This is the


improved Euler scheme.

We can easily verify that these schemes are second order schemes. It is amazing that we
can get higher order accuracy only by evaluating the function at more intermediate values
between (xn , yn ) and (xn+1 , yn+1).Methods that really popularize Runge-Kutta methods
are the following fourth order scheme
h
yn+1 = yn + (K1 + 2K2 + 2K3 + K4 ) (9.75)
6
where

K1 = f (xn , yn )
1 1
K2 = f (xn + h, yn + hK1 )
2 2
1 1
K3 = f (xn + h, yn + hK2 )
2 2
K4 = f (xn + h, yn + hK3 )

Example 28 By applying the scheme (9.75) to the problem y 0 = λy, λ a constant, show
that the order of accuracy is indeed fourth.

Solution:

K1 = λyn
1 1
K2 = λ(yn + hλyn ) = λ(1 + λh)yn
2 2
1 1 1 1
K3 = λ(yn + λh(1 + λh)yn ) = λ(1 + λh + (λh)2 )yn
2 2 2 4
1 1 1 1 1
K4 = λ(λh(1 + λh + (λh)2 )yn ) = λ(1 + λh + (λh)2 + (λh)3 )yn .
2 4 2 4 4
Hence,
1 1 1
yn+1 = [1 + λh + ](λh)2 + ](λh)3 + ](λh)4 ]yn
2 6 24
which Correspond to the first 5 terms in the Taylor expansion of y(xn+1 )eλh yn .

114
9.5.1 Exercises
1. For the problem (9.47) with f (x, y) = x + y, x0 = 0, y0 = 1; calculate the local
truncation error for the method

yn+1 = yn + h(K1 + K2 ), K1 = hf (xn , yn ), K2 = hf (xn + h, yn + K1 ).

Compare this error with the error of Taylor’s algorithm of Order 2. Which would
you expect to give a better result over the interval [0, 1]?

2. Consider the method


h
yn+1 = yn + (2k1 + 3k2 + 4k3 )
9
1 1 3 3
k1 = f (xn , yn ), k2 = f (xn + h, yn + hk1 ), k3 = f (xn + h, yn + hk2 ),
2 2 4 4
(i) Show that this is a third order Runge-Kutta method. (ii) Show that when applied
to the test problem (9.71), it generates the estimates yn+1 = ρ(h̄)yn and specify the
form of ρ.

9.6 Linear Multi-step Methods for solving First Or-


der Scaler Ordinary Differential Equations
Runge-Kutta methods, though having very high accuracy have the following disadvan-
tages:

1. It is always very difficult to estimate the truncation error in Runge-Kutta methods.

2. It employs many evaluations of f than it would seem necessary, and yet it is not
clear how these values of f can be used for an advantage.

Now consider the following direct application of Simpson’s rule at xn−1 , xn = xn−1 + h
and xn+1 = xn−1 + 2h.
Z xn+1 Z xn+1
0
y (x)dx = f (s, y(s))ds. (9.76)
xn−1 xn−1

Then
1
yn+1 − yn−1 = h[f (xn−1 , yn−1 ) + 4f (xn , yn ) + f (xn+1 , yn+1 )]. (9.77)
3
Observe that the value of yn+1 appears on both sides and involves values of x at xn−1 , xn
and xn+1 Three steps
If the mid point rule is used to approximate the integral on the right hand side of
(9.76), we have:

yn+1 − yn−1 = 2hf (xn , yn ). (9.78)

115
In this case we have a two step method, but this time it is almost explicit. We still need
starting values y0 and y−1 to start off the method.
Still, others can be derived. Lets define

∆− y := yn − yn−1 ⇒ ∆2− y = yn − 2yn−1 + yn−2 ,


∆3− y = yn − 2yn−1 + 2yn−2 − yn−3, etc. (9.79)

We can then derive the following schemes:


Adams-Moulton Implicit Methods:
1 1 1 19 4
yn+1 − yn = h[I − ∆− − ∆2− − ∆3− − ∆ − · · · ]f (xn+1 , yn+1 ) (9.80)
2 12 24 720 −

Adams-Bashforth Explicit Methods


1 5 3 251 4
yn+1 − yn = h[I + ∆− + ∆2− + ∆3− − ∆ − · · · ]f (xn+1 , yn ) (9.81)
2 12 8 720 −

We note that the formulae based on difference approximations to the derivative and those
based on difference approximation to the integral are very different in form. They could
also be combined in a variety of ways. The question is how do we choose between methods?
A complete understanding of the solutions of difference equations is needed to be able
to pursue studies on linear multi-step methods.

9.7 Zero stability for linear multi-step schemes


Recall that for the third order linear difference equation 2yn+1 + 3yn − 6yn−1 + yn−2 =
6hf (xn , yn ), when applied to the problem (9.35), with initial starting values (9.36) gives
a divergent solution. We easily verify that the characteristic

polynomial

is ρ(β) = 2β 3 +
3β 2 − 6β + 1 = 0, which factorizes to (β − 1)(β + 45 − 433 )(β + 54 + 433 ) = 0 so that the
general solution is
√ !n √ !n
−5 + 33 −5 − 33
yn = c1 (1)n + c2 + c3 , n = 0, 1, 2, · · · ,
4 4

containing the arbitrary constants c1 , c2 and c3 which can be determined if three initial
values for y are specified. The original ordinary differential equation specifies only one
initial value y(0) = 1. Then it is clear that if the three initial values are in error with
magnitude |ε|, then since yn = c1 + c2 (0.186)n + c3 (2.686)n , if the error incurred in spec-
ifying the initial data means that c3 6= 0, then this component of the solution will grow
unbounded. This is an example of a numerical instability. As the scheme is obtained
with a zero right hand side or equivalently as h → 0, a scheme which avoids the scenario
explained above is said to be zero stable.
Now Suppose we are considering schemes which uses values yn , yn+1, yn+2 , ..., yn+k
(and also f (xn , yn ), f (xn+1 , yn+1 ), ..., f (xn+k−1 , yn+k−1)) to obtain the next value yn+k .
We call these k-step methods.

116
Definition 56 (Zero stability) A k-step method (for approximating the ordinary dif-
ferential equation of the form (9.47) is zero stable if there exists a constant K such that
if (yn )n≥0 and (zn )n≥0 are two sequences generated by the formulae with the same grid
points (xn )n≥0 but with different initial data, we have

|yn − zn | ≤ K max{|y0 − z0 |, |y1 − z1 |, · · · , |yk−1 − zk−1 |} (9.82)

for xn ≤ XM as h := max(xj+1 − xj ) → 0

Definition 57 A general linear k-step method (for approximating the ordinary differen-
tial equation of the form (9.47)) may be written in the form
k
X k
X
αj yn+j = h βj f (xn+j , yn+j ) (9.83)
j=0 j=0

where the constants αj , j = 0, 1, 2, .., k and βj , j = 0, 1, 2, .., k are real constants. This is
called linear because it involves only linear combinations of yn and f (xn , yn ). (We shall
simply write fn to denote f (xn , yn ).)

[Linear k-step method] We note that Linear multi-step schemes as given by Definition
(9.83) are explicit if βk = 0 otherwise it is implicit. Such schemes are defined in terms of
their first and second characteristic polynomials, ρ(ξ) and σ(ξ) respectively. Where
k
X k
X
j
ρ(ξ) = αj ξ , σ(ξ) = βj ξ j . (9.84)
j=0 j=0

We must assume in (9.84) that

αk 6= 0, α02 + β02 6= 0, (9.85)

so that we can solve for yn+k when h is sufficiently small and the scheme does not degen-
erate into a (k − 1)-step method. We also need
k
X
ρ(1) = αj = 0 (9.86)
j=1

so that the left hand side of (9.83) will give a difference approximation to y 0(xn ). This is
part of the consistency condition.

Theorem 39 (The root condition) An explicit linear multi-step method is zero stable
for any ordinary differential equation satisfying a Lipschitz condition if and only if its
first characteristic polynomial has zeros on the closed unit disc with any which lie on the
unit circle being simple.

117
Proof: To prove, necessity, consider the scheme applied to y 0 = 0. Therefore
k−1
!
1 X
yn+k = − αj yn+j .
αk j=0

The general solution is thus


X
yn = ps (n)ξsn (9.87)
s

where ξs is the zero of the the polynomial ρ(ξ) = 0 and the corresponding polynomial ps (.)
has degree one less than the multiplicity of this zero. If |ξs | > 1, then there is initial data
for which the solution will grow like |ξs |n , and if |ξs | = 1 and its multiplicity is ms > 1,
there is initial data for which the solution will grow like |nms −1 ξs |n = (nms −1 )n . In either
case this implies unbounded growth with n. Since the problem is linear, any difference of
a pair of solutions is also a solution and Definition 56 cannot be satisfied.
Remark 12 The root condition is so crucial that it is used by some authors as the actual
definition of zero stability.
Lemma 9 If the root condition is satisfied, the recurrence relation
k−1
!
1 X
yn+k = − αj yn+j satisfies |yk | = K max{|y0|, |y1|, · · · , |yk−1|}, ∀n ≥ 0.
αk j=0

Proof: Suppose α0 6= 0 and ρ(ξ) has k non-zero zeros, counting multiplicity. Then the
general solution yn has k arbitrary constants c1 , c2 , · · · , ck determinable by equating the
initial data with the expansion to obtain a linear system of equations of the form
    
y0 1 ··· . B1
 y1   ξ1 · · · .   B2 
     0
 ..  =  .. .. ..   ..  , ie y = ZB (9.88)
 .   . . .  . 
yk−1 ξ1k−1 · · · . Bk
The matrix Z will be Vandemonde when the zeros of ρ(ξ) are distinct and non-zero. So
there exist K 0 such that
k B k∞ ≤ K 0 k y 0 k∞ .
Now, the solution (9.87) consist of a linear combination of the Bi ’s multiplied by linearly
independent solutions of the form nq ξsn and each of these is bounded for all n. (Because
of the rot condition: |ξs | ≤ 1 if ξs is a simple zero of ρ(ξ), ie q = 0, |ξs | < 1 if ξs is a
multiple zero of ρ(ξ), ie q 6= 0.) To see this, suppose |ξs | ≤ α < 1 with ξs a zero for which
q ≤ k − 1. We easily verify that nk−1 αk has an absolute maximum at n = (k − 1)(− ln α).
Call this maximum K 00 . Hence we have
|yn | ≤ KK 00 k B k∞ ≤ K 0 K 00 k y 0 k∞ .
Hence the result.
Though the results have been stated for explicit methods, extension to implicit methods
is direct.

118
Example 29 (i) All one step methods such as Eulers’s method and the trapezoidal rule
are all zero stable because consistency requires one zero of ρ(·) to be 1.

(ii) All the Adams-Moulton and Adams-Bashforth methods [3] are zero stable because
ρ(ξ) = ξ p (1 − ξ) for some value of p.

(iii) Consider the approximation of y 0 = y, y(0) = 1 by

(a) The two step second order accurate mid point rule (9.78) and
(b) the following three-step sixth order accurate method:

11yn+3 + 27yn+2 − 27yn+1 + 11yn = 3h[fn+3 + 9fn+2 + 9fn+1 + fn ] (9.89)

For (b),ρ(ξ) = 11ξ 3 + 27ξ 2 − 27ξ − 11 = (ξ − 1)(ξ + 0.3189...)(ξ + 3.1356...). Notice


that one of the zeros of ρ is bigger than unity in magnitude and so the root condition
fails, even though the method is consistent. So the method is not zero stable and we
do not expect it to converge to the sought after solution. For (a), ρ(ξ) = ξ 2 − 1 =
(ξ − 1)(ξ + 1). Two simple zeros each of magnitude one on the unit circle.
In (b) those zeros other than unity correspond to parasitic solutions of the difference
scheme which must not be allowed to dominate the solution process.

9.8 Accuracy of linear multi-step methods


Definition 58 suppose y is a smooth solution of y 0 (x) = f (x, y(x)). The truncation error
for the linear multi-step method (9.83) is
k
!
1 X 0
Tn = αj yn+j − hβj y (xn+j ) . (9.90)
h j=0

Noting that xn+j = xn + jh, we may write

(jh)2 00
y(xn + jh) = y(xn ) + (jh)y 0(xn ) + y (xn ) + · · ·
2!
(jh)2 000
y 0 (xn + jh) = y 0 (xn ) + (jh)y 00 (xn ) + y (xn ) + · · ·
2!

Substitute these into (9.90) and rearrange to have

hTn = C0 y(xn ) + C1 hy 0(xn ) + C2 h2 y 00 (xn ) + · · · (9.91)

where
k k k
X X jq X j q−1
C0 = αj , Cq = αj − βj , q = 1, 3, · · · , (9.92)
j=0 j=0
q! j=0
(q − 1)!

119
For consistency, we require Tn −→ 0 as h → 0. We also require C0 = 0 and C1 = 0. In
terms of the characteristic polynomials, this consistency requirement can be stated as:

ρ(1) = 0, and ρ0 (1) = σ(1) 6= 0. (9.93)

More generally, we can say that if C0 = c1 = · · · = cp = 0, and cp+1 6= 0 the method is


of order of accuracy p. Hence for a pth order method, the local truncation error

Tn = cp+1 hp y (p+1) (xn ) + O(hp+1) (9.94)

and cp+1 is called the error constant.


Our key theorem says that zero stability plus consistency is equivalent to convergence.
We can also show that convergence in a scheme will imply consistency and zero stability.
We can prove the following theorem of Dalquist
Theorem 40 (Dalquist) For a linear multi-step methods which is consistent with an or-
dinary differential equation satisfying a Lipschitz condition starting with consistent initial
data, zero stability is necessary and sufficient for convergence. Moreover, if the method
has truncation error O(hp ) and the solution y(x) ∈ C p+1 , then we expect the global error
is O(hp ).
How high an order can we expect? Again we have the following Dalquist Stability
limit theorem:
Theorem 41 (Dalquist’s stability limit) No zero stable linear k-step method can have
order greater than k + 1 for all odd k or k = 2 for all even k
eg: k = 1, we can have the Trapezoidal rule; order is 2. For k = 2, we can have
Simpson’s rule, order is 4. Adams-Bashforth k-step method, order is k, Adams-Molton
k-step method, order is k + 1,

9.9 Example of error analysis for an implicit method


Consider Simpson’s rule applied to y 0 = f (x, y): namely
h
yn+1 − yn−1 = [f (xn−1 , yn−1 ) + 4f (xn , yn ) + f (xn+1 , yn+1)],
3
and write it in the form
h
yn+2 − yn = [f (xn , yn ) + 4f (xn+1 , yn+1 ) + f (xn+2 , yn+2)],
3
which is implicit. The characteristic polynomials are
1 4 1
ρ(ξ) = (ξ − 1)(ξ + 1), σ(ξ) = ξ 2 + ξ + .
3 3 3
Hence the method is zero stable. Comparing with (9.92), we have:
1
C0 = C1 = C2 = C3 = C4 = 0, C5 = − .
90
120
Hence,
1 4 (5)
Tn = −
h y (ξ), ξ ∈ (xn , xn+2 ).
90
Now, for the global error, we have en+1 = yn+1 − y(xn+1). Therefore
h
en+2 − en = [(f (xn , y(xn )) − f (xn , yn )) + 4(f (xn+1 , y(xn+1)) − f (xn+1 , yn+1 ))
3
+(f (xn+2 , y(xn+2)) − f (xn+2 , yn+2))] + hTn .

Applying the Lipschitz condition on f to have: |en+2 | ≤ |en |+ h3 Lh[|en |+4|en+1|+|en+2 |]+
h|Tn | where L is the Lipscitz constant for f . Hence
1 1 4
(1 − Lh)|en+2 | ≤ (1 + Lh)|en | + Lh|en+1 | + h|Tn |.
3 3 3
We cannot continue this argument until h < h0 = 3L−1 . In that case, take

En = max{|en , |en+1 |}, T = max{|Tn |, ∀n}.

Then we have,
 n  n 
1 + 35 Lh 1 + 53 Lh T 3
|En | ≤ |E0 | + −1 , h < h0 = .
1 − 13 Lh 1 − 13 Lh L L

So, if we assume tat E0 = 0, then En → 0 as h → 0 since T → 0 as h → 0. This proves


convergence.
Suppose we apply the above to the test problem y 0 = λy, Then
h
yn+2 − yn = λh[yn+2 + 4yn+1 + yn ],
3
which then gives, On setting h̄ = λh,
1 4 1
(1 − h̄)yn+2 − h̄yn+1 − (1 + h̄)yn = 0;
3 3 3
a linear recurrence relation which can be solved. We have:
1 4 1
yn ∝ β n ⇒ (1 − h̄)β 2 − h̄β − (1 + h̄) = 0,
q 3 3 q 3
4 12h̄2 4 2
3
h̄ + 9
+4 3
h̄ − 129h̄ + 4
⇒ β1 = , β2 = .
2(1 − 31 h̄) 2(1 − 13 h̄)

We easily verify that the first root approximates the exact solution while the second root
is parasitic. In fact:

h̄2 h̄3 h̄4 h̄5 h̄6


β1 (h̄) = 1 + h̄ + + + + + +···
2 6 24 72 144
≈ exp(h̄) + O(h̄5) as h̄ → 0

121
Similarly,

1 + 13 h̄ 1
β2 (h̄) ∼ − 1 exp(−h̄) ∼ − (1 − h̄) as h̄ → 0.
1 − 3 h̄ 3

Notice therefore that if yn = a1 β1n + a2 β2n for any constants a1 , a2 , then the solution
corresponding to the parasitic root oscillates in sign from step to step. When λ is real
and negatives so that the true solutions decays, the parasitic solutions grows. This is a
severe disadvantage of this method for such problems.

9.10 Absolute Stability


The above example illustrates a phenomenon called weak instability or partial instability.
The instability is weak because:

- The global error does not show unbounded growth

- The instability is only problem dependent.

Definition 59 (Absolute Stability) For a given value of h̄, a linear multi-step method
is said to be absolutely stable if all the zeros of the characteristic polynomial lie strictly
inside the unit disc. An open interval (α, β) is an interval of absolute stability for the
method if the method is absolutely stable for all h̄ ∈ (α, β).

Lemma 10 No consistent zero stable linear multi-step method is absolutely stable for
small positive values of h̄. In fact, if such a method has order of accuracy p, we have

β1 (h̄) = exp(h̄) + O(h̄p+1) as h̄ → 0. (9.95)

Proof: From (9.94) with y(x) = exp(λx), we see that hTn = O(h̄p+1) and substituting
the exact solution into (9.90) with xn = 0 gives

ρ(exp(h̄)) − hσ(exp(h̄)) = hσ(1)Tn . (9.96)

From the zero stability, β1 (0) = 1 is a simple zero of ρ(ξ) = 0. Thus for sufficiently small
h̄, the left hand side of (9.96) gives
" k #−1
Y
β1 (h̄) − exp(h̄) = hσ(1)Tn (βj (h̄) − exp(h̄)) .
j=2

Here the right hand side is bounded because β(h̄) ≈ exp(h̄) and is simple for small h̄.
Therefore for small h̄ (9.95) is satisfied.

Example 30 (Examples) Simpson’s Method: For Simpson’s method,


1
yn+2 − yn = λh(yn+2 = 4yn+1 + yn ).
3
122
The second root β2 (h̄) of ρ(h̄) − σ(h̄) is
q
2
3
h̄ − 1 + 13 h̄2 1 − 23 h̄
β2 (h̄) = ≥ > 1 ∀h̄ < 0.
1 − 13 h̄ 1 − 13 h̄

That is, this method has no interval of absolute stability and thus it a very poor
method to used on problems with decaying solutions.

Euler’s Method: For Euler’s method, we have yn − yn = h̄yn . β(h̄) = 1 + h̄, |β(h̄)| <
1 ⇒ h̄ ∈ (−2, 0). The interval of absolute stability here is all of (−2, 0)

Trapezoidal Rule: For the Trapezoidal rule, we have


1
yn+2 − yn = λh(yn+1 + yn ).
2
This yields
1 + 12 h̄
β(h̄) = .
1 − 12 h̄
Clearly, for this scheme, |β(¯(h))| < 1 ⇒ h̄ < 0. The interval of absolute stability
here is all of R−

Adams-Moulton and Adams-Bashforth methods. For the Adams-Moulton (A-M) and


Adams-Bashfort (A-B) methods the different intervals of absolute satiability are all
of the form (α, 0) where the value of α depends on the value of k in the method.

k α(A − B) α(A − M)

1 −2 −∞
2 −1 −6
3 −6/11 −3
4 −3/40 −90/49

9.11 Predictor-Corrector Methods


When solving systems of differential equations, using linear multi-step methods (9.83), it
is sometimes advantageous to use implicit schemes; but then, there is the disadvantage of
having to solve an algebraic equation to obtain yn+k . For sufficiently small h, this can be
done by iteration.
k−1
X
(s+1) s
αk yn+k − hβk f (xn+k , yn+k ) = [−αj yn+j + hβj f (xn+j , yn+j ] (9.97)
j=0

and this will converge if h|βk |L < |αk |, where L is the Lipschitz constant for f (·, ·). If such
an iteration is necessary, it may be expensive to recompute f (·, ·), so that many iterations

123
of (9.97) can be carried out. An accurate explicit method can be used to start of the
iteration scheme. The explicit method is called the predictor and the implicit method is
called the corrector. For example, we can have the following predictor corrector pair:
h
Predictor: yn+4 = yn+3 + (55fn+3 − 59fn+2 + 37fn+1 − 9fn )
24
h
Corrector: yn+4 = yn+3 + (9fn+4 + 19fn+3 − 5fn+2 + fn+1 )
24
Application of the Predictor corrector method:

calculate or predict with:


h
ỹn+4 = yn+3 + (55fn+3 − 59fn+2 + 37fn+1 − 9fn ) (9.98)
24
Correct the prediction by using the predicted values:
h
yn+4 = yn+3 + (9f˜n+4 + 19fn+3 − 5fn+2 + fn+1 ) (9.99)
24

where f˜n+4 = f (xn+4 , ỹn+4 ) and in this case the local truncation error is given by yn+4 −
ỹn+4 ≈ h5 y (5) (xn ).

9.11.1 Exereses:
Solve problems of sectionn7.6.3 of [3]

9.12 Some concluding comments


9.12.1 Some Limitations of Numerical Methods
Difficult points where numerical solutions may have problems:

1. Asymptotes may exist in the solution domain were the solution may become un-
bounded: For example y 0 (x) = 2xy 2 , y(0) = 1

2. The solution may be Sensitivity to initial conditions in the original problem For
Example: y 0(x) = −ay(1 − y), y(0) = 1 + ε

3. Nearly all numerical solutions to initial value problems degenerate at long times:
we can not integrate to t = ∞

4. Stability of the methods for different choices of stepsize and the effect of the algo-
rithm on small errors in data. For example y 0(x) = ry, y(0) = y0 + δ0 , where δ0 is
the initial error.

124
More Methods, Numerical Software & Options
1. Other methods exist.

(a) Extension of these methods to systems of equations


(b) Method of Collocation. See sections 7.7 and 7.8.3 of [3]
(c) Shooting method for solving higher order equations. See section 7.8.2 of [3]

2. Programmable calculators, spreadsheets etc.

(a) Good points: Cheap, quick to experiment with


(b) Bad points: accuracy, connecting solution with other programs

3. Mathematical software packages: Examples: Matlab, Mathematica, etc

(a) Good points: quality assurance, interface, simple language to solve a problem,
in built graphics
(b) Bad points: not free, connecting solution with applications

4. Subroutine libraries: Examples: NAG, IMSL, etc.

(a) Good points: quality assurance, can include in other programs, high accuracy
and control
(b) Bad points: not free, relatively complex to use (need to be able to program
competently), often need

5. other graphics subroutines, availability in your favourite programming language,


etc.

(a) Numerical recipe’s in C, Fortran, · · ·


(b) Free downloadable shareware
i. http://www.netlib.org/
ii. http://gams.nist.gov/

6. . Write your own code in your favourite language

125
Bibliography

[1] S. D. Conte, C. de Boor: Elementary Numerical Analysis, McGraw-Hill, New York


1980

[2] G. H. golub, C. Van Loan, Matrix Computation, John Hopkins University Press,
1986

[3] Lee W. Johnson, R Dean Riess, Numerical Analysis, Addison wesley, 1982

[4] M. J. D. Powell, approximation Theory and methods, Cambridge University Press,


1981

[5] C. Hastings, Jr. Approximations for Digital Computers. Princeton University Press,
1955.

[6] N. I. Achiezer (Akhiezer), Theory of approximation, Translated by Charles J. Hyman


Frederick Ungar Publishing Co., New York 1956

[7] William Edmund Milne, Numerical Solutions of Differential equations,


Chapman and Hall, London 1953

126

You might also like