Kompendium

Numerical Analysis - An Introduction
Claus Führer, Achim Schroll

Numerical Analysis
Center for Mathematical Sciences
Lund University
3rd Edition, 2001

ii
Contents
1 Interpolation and Curve Design 1

1.1 Some Definitions and Notations . . . . . . . . . . . . . . . . . . . 1
1.2 Polynomial Spaces and Interpolation . . . . . . . . . . . . . . . . 3
1.2.1 Lagrange Polynomials . . . . . . . . . . . . . . . . . . . . 5
1.2.2 Newton Interpolation Polynomials. . . . . . . . . . . . . . 6
1.2.3 Interpolation Error. . . . . . . . . . . . . . . . . . . . . . . 8
1.2.4 Polynomial Interpolation in MATLAB . . . . . . . . . . . 10
1.2.5 Bernstein Polynomials . . . . . . . . . . . . . . . . . . . . 11
1.3 Bézier Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.1 Some notations and definitions . . . . . . . . . . . . . . . 13
1.3.2 de Casteljau Agorithm . . . . . . . . . . . . . . . . . . . . 15
1.3.3 Bézier curves and Bernstein polynomials . . . . . . . . . . 17
1.3.4 Chebyshev Polynomials . . . . . . . . . . . . . . . . . . . 19
1.3.5 Three-term recursion and orthogonal polynomials . . . . . 23
1.4 Quadrature formulas . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.4.1 Quadrature in MATLAB . . . . . . . . . . . . . . . . . . . 29
1.4.2 Gauss Quadrature . . . . . . . . . . . . . . . . . . . . . . 29
1.5 Piecewise Polynomials and Splines . . . . . . . . . . . . . . . . . . 32
1.5.1 Minimal Property of Cubic Splines . . . . . . . . . . . . . 35
1.5.2 B-Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2 Linear Systems 41
2.1 Regular Linear Systems . . . . . . . . . . . . . . . . . . . . . . . 42
2.1.1 LU Decomposition . . . . . . . . . . . . . . . . . . . . . . 42
2.1.2 Matrix Norms, Inner Products and Condition Numbers . . 48
2.2 Nonsquare Linear Systems . . . . . . . . . . . . . . . . . . . . . . 52
2.2.1 Projections . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.2.2 Condition of Least Squares Problems . . . . . . . . . . . . 55
2.2.3 Orthogonal factorizations . . . . . . . . . . . . . . . . . . 56
2.2.4 Householder Reflections and Givens Rotations . . . . . . . 58
2.2.5 Rank Deficient Least Squares Problems . . . . . . . . . . . 60
iii
iv CONTENTS
3 Signal Processing 63
3.1 Discrete Fourier Transformation . . . . . . . . . . . . . . . . . . . 63
4 Iterative Methods 73
4.1 Computation of Eigenvalues . . . . . . . . . . . . . . . . . . . . . 74
4.1.1 Power iteration . . . . . . . . . . . . . . . . . . . . . . . . 75
4.2 Fixed Point Iteration . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.3 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.3.1 Numerical Computation of Jacobians . . . . . . . . . . . . 84
4.3.2 Simplified Newton Method . . . . . . . . . . . . . . . . . . 85
4.4 Continuation Methods in Equilibrium Computation . . . . . . . . 87
4.5 Gauß-Newton method . . . . . . . . . . . . . . . . . . . . . . . . 91
4.6 Iterative Methods for Linear Systems . . . . . . . . . . . . . . . . 92
5 Ordinary Differential Equations 95

5.1 Differential Equations of Higher Order . . . . . . . . . . . . . . . 96
5.2 The Explicit Euler Method . . . . . . . . . . . . . . . . . . . . . . 96
5.2.1 Derivation of the Explicit Euler Method . . . . . . . . . . 97
5.2.2 Graphical Illustration of the Explicit Euler Method . . . . 97
5.2.3 Two Alternatives to Derive Euler’s Method . . . . . . . . . 98
5.2.4 Testing Euler’s Method . . . . . . . . . . . . . . . . . . . . 99
5.3 Stability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.4 Local, Global Errors and Convergence . . . . . . . . . . . . . . . 101
5.5 Stiffness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.6 The Implicit Euler Method . . . . . . . . . . . . . . . . . . . . . . 104
5.6.1 Graphical Illustration of the Implicit Euler Scheme . . . . 106
5.6.2 Stability Analysis . . . . . . . . . . . . . . . . . . . . . . . 106
5.6.3 Testing the Implicit Euler Method . . . . . . . . . . . . . . 107
5.7 Multistep Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.7.1 Adams Methods . . . . . . . . . . . . . . . . . . . . . . . . 109
5.7.2 Backward Differentiation Formulas (BDF) . . . . . . . . . 112
5.7.3 Solving the Corrector Equations . . . . . . . . . . . . . . 113
5.7.4 Order Selection and Starting a Multistep Method . . . . . 114
5.8 Explicit Runge–Kutta Methods . . . . . . . . . . . . . . . . . . . 115
5.8.1 The Order of a Runge–Kutta Method . . . . . . . . . . . . 117
5.8.2 Embedded Methods for Error Estimation . . . . . . . . . . 118
5.8.3 Stability of Runge–Kutta Methods . . . . . . . . . . . . . 120
Preface
The basic education in mathematics at LTH ends with an introductory course
in Numerical Analysis. Many mathematical problems have been introduced and
methods to solve some of them exactly (or analytically) have been studied. Since
centuries mathematics was focused on finding exact solutions to mathematically
formulated problems in science and engineering. With the appearance of first
mechanical and then electronic computing devices the interest in approximate
solutions to problems which could not be solved exactly increased drastically
and so-called numerical methods were developed. ”Numerical” means in that
context that already in an early stage mathematical manipulation of algebraic
expressions are replaced by computations with numbers. A function is replaced
by an algorithm, which evaluates the function for a given numeric argument.
In this introduction course we will present some of the most important computa-
tional methods and we will analyze them, to see how accurate results they may
generate and how robust. That’s why the course is called Numerical Analysis.
The numerical part will demand from you some skills in programming, which
we try to minimize by using MATLAB as a toolbox for performing numerical
experiments. In this part you will miss the classical ”paper-and-pencil” way to
work in mathematics. The analysis part demands good knowledge in calculus and
linear algebra. Finally the course has a strong engineering science part where we
link the field to important problems in applications.
The lecture notes (this booklet) covers the analytical part. We tried to use a
mathematical language and avoid to present the material just by a collection
of ”recipes”. The collection of computer assignments and also the final project
related exam covers the algorithmic and engineering part. All three parts are
important and interact permanently.
Though we try to motivate all methods by practical examples, their importance
of some techniques might become clear for you in later stages in your engineering
education.
This course starts a series of other courses in numerical analysis on a more ad-
vanced level. It should also be completed by courses in applied mathematics,
signal processing and control theory.
Be aware that lecture notes are no textbook. Therefore you will find references
to other literature in the text. It will help you a lot, if you look at some of these
references and also classical textbooks.
Claus Führer & Achim Schroll, Lund, October 2001
v
vi CONTENTS
Notational conventions
In this manuscript we apply (hopefully consequently) the following conventions
• indices, integer numbers: small Latin letters in the range [i − n], see For-
tran60 convention.
• scalar numbers: Greek letters, e.g. α, β, . . .
• vectors: small Latin letters mainly from the end of the alphabet, e.g.u, v, w
• matrices: capital Latin letters, e.g. A, B, . . .
• identity and zero matrix: I and 0 (sometimes with the dimension as a

subscript).
• matrix elements Aij or ai,j
• linear spaces: caligraphic letters, e.g. C
• norms: · with the type of norm as subscript (if necessary)
• absolute value (modulus): | · |
vii
viii CONTENTS
Topics
Course hours
Polynomial Interpolation all 4
Bézier Curves D 2
Spline Interpolation all 4
Bézier Splines D 2
Norms, Stability, Condition all 2
Linear Systems of Equations all 2
Least Squares (Data fitting) all 2
Orthogonal Factorization I 2
Numerical Signal Processing: FFT all 4
Nonlinear Systems: Fixed point problems all 2
Nonlinear Systems: Newton Iteration all 2
Nonlinear Data Fitting: Gauss-Newton I 2
Ordinary Differential Equation: Initial Value Problems all 8
Ordinary Differential Equations: Boundary Value Problems F,I,K 4
Basics of Parameter Estimation I 2
ix
x CONTENTS
Chapter 1
Interpolation and Curve Design
Before reading this paragraph make sure, that you are familiar with the basics
in Linear Algebra. Reread Chapter 6.2 in [Spa94].
1.1 Some Definitions and Notations

Interpolation Interpolating data is one of the most important topics in NA.
It is used to obtain a “handy” functional description instead of just the raw data
obtained from measurements, observations etc.
A functional description has several purposes:
• Compressing the amount of data
• Give information about values not covered by measurements, i.e. interpo-

lation, extrapolation
• Speed up evaluation: function evaluation is often faster than table-look-ups

On the other hand interpolation is also the basis of many other more advanced
methods in NA.
The situation is the following:
Definition 1
Given data points (ti , yi ), i = 1, . . . , n.
A function f is called interpolating these data if
f (ti ) = yi
You might think of the independent variable ti being time points while yi denote
measurements at the these points.
Other examples are pressure versus temperature, current versus voltage etc.
There can be more measurements at a given time ti , thus yi might be a vector
with several components.
1
2 CHAPTER 1. INTERPOLATION AND CURVE DESIGN
An interpolating function (or an interpolant) is seeked often in the set of poly-

nomials, piecewise polynomials, trigonometric polynomials, rational functions or
piecewise rational functions. In this course we will not treat interpolation by
rational and piecewise rational functions.
Sometimes the requirements for interpolation are giving more restrictions on the
function than desired. If we relax the interpolation by asking for the residuals
f (ti ) − yi = ri
being small or even minimal in some sense, than we speak about approximation
instead of interpolation. For this end we have to discuss what we mean by “small”
or “minimal” and we have to introduce norms and inner products. This will be
the topic in a later chapter.
Another important topic in this context is curve design. Curve design is the task
covered by many modern drawing programs like FREEHAND, COREL DRAW
etc. It is also the basis for font generation for example in METAFONT and in
the POSTSCRIPT language.
Parametric and Non Parametric Curves, Graphs Functions relate in a

unique way independent variables t to dependent variables y ∈ Rn .
The graph of a function is the set

t
Γf := { |t ∈ [t0 , te ]}
f (t)
and
ti
f (ti )
is a particular point of that graph. Note, that the independent variable t is
always the first component of the points if we consider a graph of a function. t
parameterizes the graph and the resulting curve is called a non parametric curve.
General curves are parametric, i.e. the describing parameter has to be given
separately and is not a component of the points. The graph of a parametric
2D-curve has the form:

f1 (t)
Γ := { |t ∈ [t0 , te ]}
f2 (t)
where f1 and f2 are functions over the given interval. A given graph can have
different parameterizations, i.e.

f1 (t) ϕ1 (τ )
Γ := { |t ∈ [t0 , te ]} = { |τ ∈ [τ0 , τe ]}.
f2 (t) ϕ2 (τ )
1.2. POLYNOMIAL SPACES AND INTERPOLATION 3
An example for graph of a non parametric curve is the plot of the sine function,
while the plot of the letter ”S” in a given font is an example for a parametric
curve.
In this curse we will consider only 2D-graphs together with interpolation, approx-
imation and curve design.
1.2 Polynomial Spaces and Interpolation

We call a function
p(t) = an tn + an−1 tn−1 + ... + a1 t + a0 (1.1)
a polynomial of degree n. The ai ∈ Rk are its coefficients. Mostly we will consider
scalar polynomials, i.e. k = 1.
The set of all nth degree polynomials is denoted by Pn .
We note, that Pn is a linear space over R, which has dimension n + 1.
A basis can easily be given
Theorem 2 The monomials 1, t, t2 , . . . , tn form a basis of Pn .
While uniquely defining a general function we need infinitely many values (all
function values), a polynomial is uniquely given by n + 1 coefficients. Thus a
common task in Numerical Analysis is to approximate a given function by a
polynomial and then describing it by its coefficients.
We turn now to the interpolation task, which asks us to determine coefficients ai
of a polynomial p(t) which interpolates the points (tj , yj ), j = 0, . . . , k. Just by
comparing the amount of information we got with the number of unknowns we
assume that we have to seek for a polynomial of degree n = k.
Writing down the interpolation conditions
p(ti ) = yi (1.2)
and rearranging the equations in matrix-vector form we obtain a square linear
system of equations:
 n n−1    
t0 t0 · · · t0 1 an y0
t1 t1  an−1   y1 
 

n n−1
· · · t1 1 
 ···  . =.
 ..   .. 
(1.3)
tn tn−1 · · · tn 1 a y
n n

0
n
=:A =:x =:b
which has to be solved for the unknown coefficients ai .

We will learn later in this course (cf. Ch. 2) how algorithms for solving these
kind of systems are constructed. Here we just take the corresponding MATLAB
command:
x = A \ b
and find the ai as components of the solution vector x.
The interesting question in this context is the solvability of the system, which is
answered by the following
Theorem 3 (Unisolvence Theorem)

Given n + 1 data points (ti , yi ) with mutually different ti , there is a unique poly-
nomial p ∈ Pn of (max) degree n which solves the interpolation task
p(ti ) = yi i = 0, . . . , n
(The proof is based on contradiction).

Thus we have to require, that we got measurements at distinct time points ti for
obtaining a unique solution. In that case the matrix A is regular and the linear
system has a unique solution.
Definition 4 The coefficient matrix

 n n−1 
t0 t0 · · · t0 1
tn1 tn−1 · · · t1 1
A := 

1 

···
tnn tn−1
n · · · tn 1
1
of the linear system is called a Vandermonde matrix.
MATLAB This matrix can be generated in MATLAB by the command vander.

Once the coefficients ai determined we can evaluate the polynomial at different
points. The cheapest way of evaluating is to apply Horner’s rule, which some-
times also is called method of nested multiplications. For this end we rewrite the
polynomial in the following form:

p(t) = · · · ((an t + an−1 )t + an−2 )t + · · · a1 t + a0 .
To evaluate p by using this formula requires n multiplications, while evaluating

the polynomial in its standard form would require n(n + 1)/2 multiplications.
While determining an interpolation polynomial using monomials and solving the
Vandermonde system is quite easy it is numerically not the best thing to do. The
entries of that matrix might differ widely in magnitude already for rather small
n which leads to a badly scaled matrix and a lot of input information might be
lost during the process of solving this system. Furthermore adding (or removing)
measurements require that the entire solution process has to be redone, which
might be in same applications even for small n computationally to expensive.
So we have to answer now two questions:
1
For biographical notes on most of the mathematicians named in this manuscript check
http://www-groups.dcs.st-and.ac.uk/history/Mathematicians
• Are there alternative basis representations of Pn which allow us to compute

the coefficients in a cheaper and more robust way?
• What is the typical degree of interpolation polynomials we have to deal
with?
Answering the first question leads us to the Lagrange and Newton basis of Pn .
1.2.1 Lagrange Polynomials

Let us consider a generic interpolation task and seek a polynomial interpolating
the data for a given value j ∈ {0, . . . , n}

0 for j = i
(ti , δij ) with δij = Kronecker symbol
1 for j = i
with i = 0 : n.
It can be easily checked that the polynomial

n
(t − ti )
Lnj (t) = (1.4)
(tj − ti )
i=0
i = j
performs this task.
Example 5 For n = 2 the Lagrange polynomials are
(t − t1 ) (t − t2 )
L20 (t) =
(t0 − t1 ) (t0 − t2 )
(t − t0 ) (t − t2 )
L21 (t) =
(t1 − t0 ) (t1 − t2 )
(t − t0 ) (t − t1 )
L22 (t) = .
(t2 − t0 ) (t2 − t1 )
For n = 4 and n = 10 some Lagrange polynomials are depicted in Fig. 1.1
Definition 6 The n + 1 polynomials Lnj ∈ Pn , j = 0, . . . , n are called Lagrange
polynomials.
Theorem 7 The n + 1 Lagrange polynomials form a basis of Pn .
These polynomials are constructed in such a way, that the interpolation task can
be solved without solving any linear system. The interpolation polynomial is just
a linear combination of the Lagrange polynomials with the measurements yi ’s as
factors:
n
p(t) = yi Lni (t)
i=0
(Check, that p indeed interpolates the data!)
1 10
4
L (t) 2 L0 (t)
0
0.5 0
4
L (t)
2
-2
0
10
-4 L4 (t)
-0.5
-6
-1 -8
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Figure 1.1: Lagrange polynomials
1.2.2 Newton Interpolation Polynomials.
When adding or removing measurement points the Lagrange polynomials have to

be recomputed. The idea behind Newton’s interpolation formula is to build up
the interpolation polynomial by successively introducing the measurements and
by constructing polynomials of increasing order based on previously computed
lower order polynomials.
For this end we write a general nth order polynomial in the following way
p(t) = c0 + c1 (t − t0 ) + c2 (t − t0 )(t − t1 ) + · · · + cn (t − t0 )(t − t1 ) · · · (t − tn−1 ), (1.5)
where ci are the coefficients which have to be determined according to the inter-
polation task.
Thus, the polynomials are represented in the basis

j−1
ω j (t) = (t − tk )|j = 0, . . . , n , (1.6)
k=0
with ω 0 (t) := 1.
When comparing this representation to the monomial formulation (1.1) one rec-
ognizes that the coefficients in front of the highest degree basis functions are the
same, i.e. cn = an . This fact will be used later, when designing an algorithm for
computing the ci ’s.
Note: ω j (ti ) = 0 for all i < j. The coefficients ci of the interpolation polynomial
are then given as the solution of

the following lower triangular system
    
ω 0 (t0 ) 0 0 ··· 0 0 c0 y0
 ω 0 (t1 ) 1
ω (t1 ) 0 ··· 0 0     
   c 1   y1 
 ω 0 (t2 ) ω 1
(t ) ω 2
···
(t2 ) 0 0   c 2   y2 
   
 2 
 ..   ..  =  .. 
 .  .   . 
 0    
ω (tn−1 ) ω 1 (tn−1 ) ω 2 (tn−1 )
· · · ω (tn−1 )
n−1
0  cn−1  yn−1 
ω 0 (tn ) ω 1 (tn ) · · · ω n−1 (tn ) ω n (tn )
ω 2 (tn ) cn yn
(1.7)
Such a triangular system can be solved by an easy recursion formula:
• Solve the first equation for c0
• Use this value and solve the next equation for c1
• and so on.
This recursive procedure can also be expressed in a recursion of successive inter-

polation polynomials:
Let us assume that pj−1 ∈ Pj−1 interpolates the first j data points. Then the
polynomial which interpolates the first j + 1 data points takes the form
pj (t) = pj−1 (t) + cj ω j (t). (1.8)
From the j th row in Eq. (1.7) one concludes
yj − pj−1 (tj )
cj =
ω j (tj )
This can be generalized to the composition of two interpolation polynomials of

order j − 1 two one of order j:
Theorem 8 (Lemma of Aitken)

Let us denote by p(f |t1 , . . . , tj ) ∈ Pj−1 the polynomial which interpolates ti , yi :=
f (ti ), i = 1, . . . , j.
Then the polynomial which interpolates ti , yi := f (ti ), i = 0, . . . , j is given by
p(f |t0 , . . . , tj )(t) =
(t0 − t)p(f |t1 , . . . , tj )(t) − (tj − t)p(f |t0 , . . . , tj−1 )(t)

t0 − tj
Definition 9 We denote by p(f |t0 , . . . , tj−1 ) ∈ Pj−1 the polynomial which inter-
polates ti , yi := f (ti ), i = 0, . . . , j − 1.
Its leading coefficient, i.e. the coefficient in front of tj−1 in its monomial repre-
sentation is correspondingly denoted by
f [t0 , . . . , tj−1 ].
These coefficients are called divided differences.
Note, by this definition f [ti ] = f (ti ).

An ultimate consequence of Aitken’s Lemma (Th. 8) is the recursion formula for
the divided differences, which is also the reason for their name:
f [t1 , . . . , tj ] − f [t0 , . . . , tj−1 ]

f [t0 , . . . , tj ] = . (1.9)
tj − t0
Furthermore we conclude from Eq. (1.8)
p(f |t0 , . . . , tj )(t) = f [t0 ]ω 0 (t) + f [t0 , t1 ]ω 1 (t) + . . . + f [t0 , . . . , tj ]ω j (t). (1.10)
Thus the coefficients cj of the interpolation polynomial are given by the divided
differences, which can easily be computed recursively:
k=0 k=1 k=2 ... k=n

t0 f [t0 ]
f [t0 , t1 ]
t1 f [t1 ] f [t0 , t1 , t2 ]
...
f [t1 , t2 ]
..
t2 f [t2 ] . f [t0 , . . . , tn ]
..
. f [tn−2 , tn−1 , tn ]
.. ..
. .
f [tn−1 , tn ]
tn f [tn ]
Note, the MATLAB command diff is a useful tool for forming differences of
MATLAB vector components.
1.2.3 Interpolation Error.

When interpolating a function we are often interested to estimate the interpola-
tion error, i.e. the quantity
r(t) := f (t) − p(f |t0 , . . . , tn )(t).

Theorem 10 Let f ∈ C n+1 (a, b) and let p(f |t0 , . . . , tn ) be the polynomial interpo-
lating the points (ti , f (ti )), i = 0, . . . , n and denote by I(t0 , . . . , tn , t) the smallest
interval containing t0 , . . . , tn and t.
Then there exists for all t ∈ (a, b) a ξ ∈ I(t0 , . . . , tn , t) such that
1
r(t) = f (n+1) (ξ)ω n+1 (t) (1.11)
(n + 1)!
holds.
Before we prove this theorem we discuss its consequences. The error is essentially
composed by two components, one depending on the function f and another
depending on ω(t) and consequently on the location of the ti and t. If the function
f and the polynomial degree is given, the only parameter which can be influenced
to decrease the error is the location of ti . An equidistant grid of ti ’s is not optimal.
An optimal placing of the interpolation points will be discussed in the advanced
course, when Chebychev polynomials are introduced.
If t is outside I(t0 , . . . , tn ) we speak about extrapolation. Extrapolation is of-
ten used for predicting the behavior of a process, e.g. the development of an
investment fond. As can be seen from the exercise the Newton interpolation
polynomials and with them the interpolation error grow rapidly outside the data
interval. Extrapolation is therefore a numerically dangerous process.
Proof (of Theorem 10):
We fix a t̄ = ti and set F (t) := r(t)−Kω n+1 (t) and determine K so that F (t̄) = 0.
Then, F has at least n + 2 zeros in I[t0 , t1 , . . . , tn , t̄]. Thus, by Rolle’s theorem F
has at least n + 1 zeros, F has n zeros and finally F (n+1) has at least one zero,
say ξ ∈ I[t0 , t1 , . . . , tn , t̄].
As p(n+1) ≡ 0 it follows
F (n+1) (ξ) = f (n+1) (ξ) − K(n + 1)! = 0
Thus
f (n+1) (ξ)
K=
(n + 1)!
from which we obtain the expression for the error. ✷.
So far we were interested in the error committed when interpolating a function
f by a polynomial of fixed degree. Can we decrease the error by increasing the
degree of the polynomials by adding more and more interpolation points?
This question is discussed in the exercises (see Runge’s phenomenon), from which
we conclude, that interpolation with high degree polynomials can lead to an
highly oscillatory error behavior and large errors.
Example 11 Assume the sine-function in [0, π/2] is interpolated by a fifth-order

polynomial p on an equidistant grid. What is the upper bound for
• the error at t = 0.1
• the error at t = 34 π (extrapolation!)
• the maximal error in the interval [0, π/2]?
The answers can be given directly by evaluating (1.11). Using the fact sin6 (t) ≤ 1
we can answer the questions by
• r(0.1) ≤ 1
720
ω 6 (0.1) ≤ 1
720
0.017 < 2.3 10−5
• r( 34 π) ≤ 1
720
ω 6 ( 34 π) ≤ 1
720
10.15 < 1.2 10−2
• maxt∈[0,π/2] |r(t)| ≤ 1
720
maxt∈[0,π/2] |ω 6 (t)| ≤ 2.3 10−5 .
The corresponding function r(t) is plotted in Fig. 1.2.
−5
x 10
1.8
1.6
1.4
1.2
1
|r(t)|
Interpolation Points
0.8
0.6
0.4
0.2
0
0 0.5 1 1.5
t
Figure 1.2: Interpolation error: Sine function interpolated by a fifth degree poly-
nomial
1.2.4 Polynomial Interpolation in MATLAB

MATLAB provides a pair of powerful commands for polynomial interpolation
MATLAB polyfit and polyval.
Polyfit takes the data points (ti , yi ) as input and returns the polynomial co-
efficients ai . Polyval takes the coefficients as input and the points where the
polynomial should be evaluated and returns the value of the polynomial.
Example 12 To interpolate the data points (0, 1), (1, 2), (2, −1), (3, −2) by a third
order polynomial and to plot the results for 100 points in [0, 3] these two com-
mands are used as follows
ti=[0,1,2,3] yi=[1,2,-1,-2];
coeff=polyfit(ti,yi,3)
p=polyval(coeff,linspace(0,3,100));
plot(linspace(0,3,100),p)
Note, that the last parameter of polyfit is the degree of the desired polynomial.
If you provide a number k < n − 1, where n is the number of data points, then
polyfit returns a polynomial, which fits a polynomial to the data points in the
least squares sense. The polynomial will in general no longer interpolate the data.
Polynomial data fitting is the topic of Sec. 2.2.
polyfit uses the Vandermonde approach.
1.2.5 Bernstein Polynomials

Another interesting way to represent polynomials is based on Bernstein polyno-
mials. This representation is the basis for curve and surface design tools and leads
to Bézier curves and splines. Many important algorithms in computer aided ge-
ometry and computer graphics are built on these concepts. First we will study
the properties of Bernstein polynomials and their use for interpolation.
Consider the binomial formula
n
n
1 = ((1 − t) + t) =
n
(1 − t)n−i ti
i=0
i
with the binomial coefficients being defined as

n n!
:= and 0! := 1.
i i!(n − i)!
Note that every summand is in Pn ([0, 1]).
Definition 13 The polynomials

n
Bin (t) := (1 − t)n−i ti
i
are called Bernstein polynomials.
From

n
Bin (t) = (1 − t)n−i ti
i

n−1 n−1
= (1 − t) t +
n−i i
(1 − t)n−i ti
i i−1
= (1 − t)Bin−1 (t) + tBi−1n−1
(t),
Cubic Bernstein Polynomials

1
3 3
0.8 B (λ) B (λ)
0 3
0.6
3 3
B1(λ) B (λ)
2
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
Figure 1.3: Bernstein polynomials
we obtain a recursion formula for Bernstein polynomials
Bin (t) = (1 − t)Bin−1 (t) + tBi−1

n−1
(t)
with B00 (t) = 1 and we set Bjn (t) = 0 for j > n and j < 0.
We briefly summarize some properties of Bernstein polynomials:
n n
1. i=0 Bi (t) = 1
2. t = 0 is a root of multiplicity i
3. t = 1 is a root of multiplicity n − i
4. Bin (t) = Bn−i

n
(1 − t)
5. Bin (t) ≥ 0
6. Bin has exactly one maximum
7. {Bin , i = 0 . . . , n} is a basis of Pn ([0, 1])
Due to the last property we can write every polynomial in Pn ([tmin , tmax ]) as a
linear combination of Bernstein polynomials:

n
p(t) = in (t)
bi B
i=0
where the bi are called Bézier points and

t − tmin
n (t) := B n
B ,
i i
tmax − tmin
1.3. BÉZIER CURVES 13
with tmin := mini ti and tmax := maxi ti .

The Bézier points for the interpolating polynomial are given as the solution of
the linear system
    
0n (t0 ) B
B 1n (t0 ) · · · B
nn (t0 ) b0 y0
 .. .. ..   ..  =  ..  .
 . . .  .   . 
n n
B0 (tn ) B1 (tn ) · · · Bn (tn )
n bn yn
Note, the entries of the governing matrix are all in [0, 1] by construction.
1.3 Bézier Curves

We leave now (for a while) the interpolation topic and study the principal ideas
of how to use polynomials and splines for curve design2 . Recall the definition of
a parametric and non parametric curve, cf. p. 2.
We construct curves by iterated linear interpolation. For this end, we need first
some definitions and conventions.
1.3.1 Some notations and definitions

Definition 14
We consider barycentric combinations of points:

n
n
b= αi bi with bi ∈ E
I 2 , αi ∈ R and αi = 1
i=0 i=0
I 2 is the space of all points in R2 (to be exact: E

where E I 2 is an affine linear space
over R2 .)
Special cases of barycentric combinations are convex combinations, where αi ≥ 0.

All points c ∈ EI 2 , which can be written as a convex combination of a given set
of points bi ∈ E
I 2 form the convex hull of the set {bi , i = 0, . . . , n}
Definition 15
A map Φ : E I2 →E
I 2 is called an affine map if it leaves barycentric combinations
invariant, i.e.
n
n
b= αi bi ⇒ Φ(b) = αi Φ(bi )
i=0 i=0
2
Much more detail on this topic can be found in [Far88].
Properties like “being the midpoint” of a line are kept invariant.

In coordinates, an affine map can be written as
Φ(b) = Ab + v
I 2 , A a 2 × 2 matrix and v the coordinates of

where b are the coordinates of b ∈ E
a vector.
Here some examples for affine maps
Example 16
• The identity A = I, v = 0
• Scaling v = 0, A, diagonal
• Rotation v = 0 and
cos α − sin α
A= (1.12)
sin α cos α
• Translation A = I is an affine map.
• Shearing v = 0 and
1 α
A=
0 1
Maps which leave angles and lengths unchanged are called orthogonal maps (or
rigid body motions). They are characterized by AT A = I. The set
{c|c = (1 − t)a + tb, t ∈ R} ⊂ E

I2
is called a straight line through a and b. All points are obtained by a barycentric
combination of a and b.
Note, a straight line can be viewed as the result of an affine map applied to the
real axis. In particular the interval [0, 1] is mapped to the line segment [a, b].
We call α = (1−t) and β = t the barycentric coordinates of the point c = αa+βb
and c(t) as a linear interpolation of a and b.
Linear interpolation is affine invariant, i.e.
Φ(αa + βb) = Φ(c) = αΦ(a) + βΦ(b)
The points a, b, c are called collinear, if they are related by linear interpolation.
Given three collinear points we note
vol1 (c, b) vol1 (a, c)

α= β= ,
vol1 (a, b) vol1 (a, b)
where vol1 (a, c) denotes the signed distance between a and c.

The ratios
vol1 (c, b) β
ratio(a, c, b) := =
vol1 (a, c) α
are evidently affine invariant, i.e. proportions are kept invariant.
Definition 17
A sequence of straight lines, where each segment interpolates two given points
bi , bi+1 is called a polygon or a piecewise linear interpolant of b0 , b1 , . . . , bN .
1.3.2 de Casteljau Agorithm

After having defined linear interpolation, we apply now repeated linear interpo-
lation to obtain higher degree polynomials. This is the basis of the de Casteljau
Algorithm 1959:
Let b0 , b1 , b2 ∈ E
I 2 , t ∈ R.
We define
b10 (t) := (1 − t)b0 + tb1

b11 (t) := (1 − t)b1 + tb2
which gives a polygon through b0 , b1 , b2 .

We continue linear interpolation by defining
b20 (t) = (1 − t)b10 (t) + tb11 (t)
which is a parabola
b20 (t) = (1 − t)2 b0 + 2t(1 − t)b1 + t2 b2
Note, the special representation of the parabola in terms of the (basis) functions
(1 − t)2 , 2t(1 − t), t2
which are the summands of the binomial expansions of
1 = ((1 − t) + t)2 .
This parabola is constructed via barycentric combinations, thus
ratio(b0 , b10 (t), b1 ) = ratio(b1 , b11 (t), b2 ) = ratio(b10 (t), b20 (t), b11 (t))
t
= ratio(0, t, 1) =
1−t
We can generalize this construction principle to generate higher degree polyno-
3
b1 b11
2.5
2 b2
b10
b20
1.5
1
0 t 1
b
0
0.5
0
0 0.5 1 1.5 2 2.5 3 3.5 4
Figure 1.4: de Casteljau Algorithm
mials:
Given b0 , b1 , . . . , bn ∈ E
I 2.
Start with b0i (t) := bi with i = 0, . . . , n.
Recurse:
(r−1) (r−1)
bri (t) := (1 − t)bi (t) + tbi+1 (t)
for r = 1, . . . , n and i = 0, . . . , n − r (1.13)
Definition 18
The bri (t) are called partial Bézier curves of degree r. They are controlled by the
points bi , . . . , bi+r .
The final curve bn (t) := bn0 (t) ∈ Pn [0, 1] is called a Bézier curve and the polygon
defined by b0 , b1 , . . . , bn is called its control polygon with control points bi .
Properties of Bézier curves and polygons
• Bézier curves are affine invariant, i.e. applying Φ to the control points yields
the same result as applying it to the complete Bézier curve.
• t ∈ [0, 1] lies in the convex hull of 0 and 1. Consequently, by the properties

of barycentric combinations, bn (t) lies in the convex hull of b0 , b1 , . . . , bn .
Ideal property for curve design!!
• Endpoint property: bn (0) = b0 and bn (1) = bn .

k dk n
• Slopes and derivatives: dtd k bn (0) is determined by bi , i = 0, . . . , k and dtk
b (1)
is determined by bn−i , i = 0, . . . , k.
1.3.3 Bézier curves and Bernstein polynomials

From page 15 we could already suspect a strong relationship between Bézier
curves and Bernstein polynomials. Here what can be said about that relationship:
Theorem 19
The partial Bézier curve bri (t) can be written as

r
bri (t) = bi+j Bjr (t) with r = 0 : n, i = 0 : n − r.
j=0
In particular (set i = 0, r = n)

n
n
b (t) = bj Bjn (t)
j=0
Proof: (Induction)
bri (t) = (1 − t)br−1

i (t) + tbr−1
i+1 (t)

r−1
r−1
= (1 − t) r−1
bi+j Bj (t) + t bi+1+j Bjr−1 (t)
j=0 j=0

i+r−1
i+r
= (1 − t) r−1
bj Bj−i (t) +t r−1
bj Bj−i−1 (t)
j=i j=i+1

i+r
i+r
= (1 − t) r−1
bj Bj−i−1 (t) +t r−1
bj Bj−i−1 (t),
j=i j=i
r−1
note Brr−1 = B−1 = 0 per construction.
Thus,

i+r
bri (t) = bj [(1 − t)Bj−i
r−1 r−1
(t) + tBj−i−1 (t)]
j=i

i+r
r
= bj Bj−i (t)
j=i
r
= bj+i Bjr (t) ✷
j=0
Properties:
1. Affine invariance
n
Due to Bi (t) = 1 the values of the Bernstein polynomials can be viewed
as barycentric coordinates of bn (t).
2. Convex hull property
From Bin (t) ≥ 0 we see again bn (t) is in the convex hull of the control
polygon defined by the bi . (review the definition of a convex combination!)
3. Linear precision, i.e.
n
i n
Bi (t) = t,
i=0
n
Note: (1− ni )a+ ni b are uniformly spaced points on the straight line between
a and b.
4. Invariance under parameter transformation

n
n
τ −a
bi Bin (t) = bi Bin ( )
i=0 i=0
b−a
Definition 20 We introduce the notation

r
bezier(b0 , b1 , . . . , br )(t) := bj Bjr (t)
j=0
for the Bézier curve generated by the Bézier points b0 , b1 , . . . , br .
We end this section with some examples. In Fig. 1.5 a Bézier curve with its four
control points

0 0.25 0.5 0.75 1
b0 := , b1 := , b2 := , b3 := , b4 := ,
1 2 2 1 1.5
and the corresponding control polygon are displayed. Note, that the abscissae
of the Bézier points are equally spaced which corresponds to a non parametric
curve, see also the linear precision property above.
In Fig. 1.6 the corresponding partial polynomials are plotted along with the
Bézier curve. These are b10 (t), b11 (t), b12 (t), b13 (t), b20 (t), b21 (t), b22 (t) and finally b30 , b31 .
For example,
b30 (t) := bezier(b0 , b1 , b2 , b3 , t) = bezier(b20 (t), b21 (t), t).
We modify now the curve and replace b4 by

0.4
b4 := .
1.5
2.5
1.5
0.5
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Figure 1.5: A Bézier Curve and its Control Polygon
Figure 1.6: Partial Polynomials of a Bézier Curve
Note, the abscissae of the Bézier points are no longer equidistant. The effect of
this change is seen in Fig. 1.7. We can visualize this figure as a so-called crossplot,
cf. Fig. 1.8
1.3.4 Chebyshev Polynomials

We consider Theorem 10 again and try to state the interpolation task in such a
way that the interpolation error is minimized.
The only way to minimize the error in the error expression above is to minimize
max |ωn+1 (t)| by optimally placing the knots. In many direct application of in-
terpolation there is often no freedom in choosing the interpolation points (e.g.
the time when measurements are made), but when designing more complex nu-
merical methods which include interpolation as a substask, one often considers
optimal placing of the interpolation points to optimize the method’s accuracy.
The classical example for this is Gauß’ quadrature formula, which will be taken
up in a later section.
Figure 1.7: Bézier curve representing a parametric graph and the corresponding
convex hull
Definition 21 The polynomials
Tn (t) = cos(n arccos t) t ∈ [−1, 1]
are called Chebyshev-Polynomials
To see that these are indeed polynomials, set t := cos α and consider
cos nα = 2 cos α cos(n − 1)α − cos(n − 2)α
which gives the three term recursion
Tn (t) = 2tTn−1 (t) − Tn−2 (t), T0 (t) = 1
Consequently the Ti are polynomials of degree i. Here some examples:
Chebyshev polynomials have special properties, which make them useful for our
purposes:
• The Ti have integer coefficients.
• The leading coefficient is an = 2n−1 .
• T2n is even, T2n+1 is odd.
• |Tn (t)| ≤ 1 for x ∈ [−1, 1] and |Tn (t)| = 1 for tk := cos(kπ/n).
• Tn (1) = 1, Tn (−1) = (−1)n

• Tn (t̄k ) = 0 for t̄k = cos 2k−1
2n
π for k = 1, . . . , n
Furthermore Chebyshev polynomials have an important minimal property which

we want to prove now:
2.0 2.0
1.5 1.5
1.0 1.0
0 .25 .50 .75 1.0 0 .25 .50 .75 1.0
t
1.0
.75
t .50
.25
0
0 .25 .50 .75 1.0
Figure 1.8: Crossplot of a parametric Bézier curve
Theorem 22
1. Let P ∈ Pn ([−1, 1]) have a leading coefficient an = 0. Then there exists a

ξ ∈ [−1, 1] with
|an |
|P (ξ)| ≥ n−1 .
2
2. Let ω ∈ Pn ([−1, 1]) have a leading coefficient an = 1. Then the scaled
Chebychev polynomials Tn /2n−1 have the minimal property
Tn /2n−1 ∞ ≤ min ω∞
ω
Proof:([DH95])
The first part will be proven by contradiction:
Let P ∈ Pn be a polynomial with leading coefficient an = 2n−1 and |Pn (t)| < 1
for all x ∈ [−1, 1]. Then, P − Tn ∈ Pn−1 as both polynomials have the same
leading coefficient. We consider now this difference at tk := cos kπ
n
:
Tn (t2k ) = 1 ∧ P (t2k ) < 1 ⇒ P (t2k ) − Tn (t2k ) < 0
Tn (t2k+1 ) = −1 ∧ P (t2k+1 ) > −1 ⇒ P (t2k+1 ) − Tn (t2k+1 ) > 0.
1
T3(x)
0.5
T1(x)
0 T (x)
2
−0.5
−1
−1 −0.5 0 0.5 1
Figure 1.9: Chebyshev Polynomials
Thus, the difference polynomial changes its sign at least n times in the interval
[−1, 1] and has consequently n roots in that interval. This contradicts the fact
P − Tn ∈ Pn−1 . By this we showed that for each polynomial P ∈ Pn be a polyno-
mial with leading coefficient an = 2n−1 there exists a ξ ∈ [−1, 1] with |Pn (ξ)| ≥ 1.
By scaling we finally see that for a general polynomial with an = 0 there exists a
ξ ∈ [−1, 1] with |Pn (ξ)| ≥ 2|an−1
n|
.
The second part of the theorem then follows directly. ✷
We apply this theorem to the result on the approximation error (cf. Th. 10) of
polynomial interpolation and conclude for [a, b] = [−1, 1]:
The approximation
1
f (t) − P (f |t0 , . . . , tn )(t) = f (n+1) (τ ) · ωn+1 (t)
(n + 1)!
error is minimal if ωn+1 = Tn+1 /2n , i.e. if the ti are roots of the n + 1st Chebychev
polynomial, so-called Chebychev points.
In case of [a, b] = [−1, 1] we have to consider the map:
t−a
[a, b] → [−1, 1] t → τ = 2 −1
b−a
and
1−τ 1+τ
[−1, 1] → [a, b] τ → t = a+ b
2 2
1.3.5 Three-term recursion and orthogonal polynomials

We saw that Chebyshev polynomials can be generated by a three term recursion.
In this chapter we want to characterize in more details the class of polynomials
generated by three term recursions.
First we introduce an inner product (scalar product) in function spaces:
b
< f, g >w := w(t)f (t)g(t)dt (1.14)
a
with a weight function w(t) : (a, b) → R+ .

By using inner products we can define orthogonality:
Definition 23
• Two functions f, g are called orthogonal with respect to the inner product
< ·, · >w if
< f, g >w = 0 (1.15)
• A sequence of polynomials pk ∈ Pk is called orthogonal if for all k:
< pk , g >= 0 ∀g ∈ Pk−1
Orthogonal polynomials and three-term recursions are related to each other by

the following theorem:
Theorem 24
There exists a unique sequence of normalized orthogonal polynomials
pk (t) = tk + πk−1 (t) πk−1 ∈ Pk−1 .
It obeys the three-term recursion
pk+1 (t) = (t − βk+1 )pk (t) − γk+1

2
pk−1 (t)
with p−1 (t) := 0, p0 (t) := 1 and

< tpk , pk >w 2 < pk , pk >w
βk+1 := γk+1 :=
< pk , pk >w < pk−1 , pk−1 >w
Proof:
The proof is by induction. Let us assume that p0 , . . . , pk−1 have already been
constructed. They form an orthogonal basis of Pk−1 . If pk is a pair with leading
coefficient 1, then ∈ Pk−1 . Thus, there exist coefficients cj such that

k−1
pk (t) − tpk−1 (t) = cj pj (t) (1.16)
j=0
with
< pk − tpk−1 , pj >w
cj =
< pj , pj >w
(why?).
As < pk − tpk−1 , pj >w =< pk , pj >w − < tpk−1 , pj >w we obtain when requiring
that pk is orthogonal to all lower degree polynomials
< tpk−1 , pj >w < pk−1 , tpj >w
cj = − =−
< pj , p j > < pj , pj >w
which results in c0 = . . . = ck−3 = 0 and
< tpk−1 , pk−1 >w
ck−1 = −
< pk−1 , pk−1 >w
and
< pk−1 , tpk−2 >w
ck−2 = − .
< pk−2 , pk−2 >w
As tpk−2 = pk−1 + (lower degree polynomial) we can we get
< pk−1 , tpk−2 >w < pk−1 , pk−1 >w
ck−2 = − =− .
< pk−2 , pk−2 >w < pk−2 , pk−2 >w
From (1.16) we then obtain
pk (t) = (t + ck−1 )pk−1 + ck−2 pk−2

−βk −γk2
which completes the proof. ✷
Example 25
The Chebyshev polynomials are orthogonal polynomials on [−1, 1] with repect to
the weight function w = (1 − t2 )−1/2 .
Example 26
For a = −1, b = 1 and ω(t) = 1 we obtain the Legendre polynomials Pk , which
can be constructed e.g. by the following MAPLE code:
p_m:=0;
p_0:=1;p_1:=t;
beta_2:=int(t*p_1*p_1,t=-1..1)/int(p_1*p_1,t=-1..1);
gamma2_2:=int(p_1*p_1,t=-1..1)/int(p_0*p_0,t=-1..1);
p_2:=(t-beta_2)*p_1-gamma2_2*p_0;
beta_3:=int(t*p_2*p_2,t=-1..1)/int(p_2*p_2,t=-1..1);
gamma2_3:=int(p_2*p_2,t=-1..1)/int(p_1*p_1,t=-1..1);
p_3:=(t-beta_3)*p_2-gamma2_3*p_1;
P2(t)
0.5
0
P3(t)
0.5
P1(t)
−1
−1 −0.5 0 0.5 1
Figure 1.10: Legendre polynomials
Chebyshev polynomials have their importance in approximation theory. We saw

their importance for optimally placing interpolation points. Legendre polyno-
mials give optimal integration (quadrature) formulas as will be seen in Section
1.4. In order to show this we need to show some more properties of orthogonal
polynomials.
Theorem 27
Let pk ∈ Pk be orthogonal to all p ∈ Pk−1 .
Then pk has k simple real roots in the open interval (a, b).
Proof:
Let t0 , . . . , tm−1 be distinct points in (a, b) where pk changes sign.
Then Qm (t) := (t−t0 )(t−t1 ) . . . (t−tm−1 ) changes sign at the same points. Thus,
wQm pk does not change sign in (a, b) and we get
b
< Qm , pk >w = w(t)Qm (t)pk (t)dt = 0.
a
Since the pk are orthogonal polynomials the degree of Qm has to be k. Thus, pk

has exactly k simple real roots in (a, b). ✷
1.4 Application of Polynomial Interpolation:

Quadrature
In this section we apply interpolation to construct quadrature formulas for nu-
merically integrating a given function. Numerical integration formulas are basic
to any method for numerically solving ODEs
ẏ = f (t, y) y(0) = y0
or equivalently t
y(t) = f (τ, y)dτ + y0 .
t0
In the special case f (t) := f (t, y) this results in a quadrature task:

t
y(t) = f (τ )dτ + y0 .
t0
Furthermore, numerical integration is important for its own, e.g. when computing
element matrices in FEM (finite element method) applications.
We introduce the following short notation
b
Iab (f ) := f (τ )dτ
a
and by Iab (f ) an appropriate numerical approximation.
Example 28
The approximation

n
Itt0e (f ) := Itti−1
i
(f )
i=1
with
1 1
Itti−1
i
(f ) := hi f (ti−1 ) + f (ti )
2 2
and hi := ti − ti−1 step size is called trapezoidal rule.
A general scheme for numerical integration can be written as follows

s
Itti−1
i
(f ) := hi bj f (ti−1 + cj hi ), (1.17)
j=1
where s is number of stages, bj are the weights and cj the knots of the quadrature
formula.
1.4. QUADRATURE FORMULAS 27
t0 t t t te
1 2 3
a b
Figure 1.11: Trapezoidal rule
Example 29 Simpson’s rule

1 4 1
Iti−1 (f ) := hi
ti
f (ti−1 ) + f (ti−1 + hi /2) + f (ti ) (1.18)
6 6 6
is a method with 3 stages.
The approximation error of such a scheme is defined by
Itt0e (f ) − Itt0e (f )
and we are interested in minimizing it for a fixed number of function evaluations

(stages) by optimally selecting the weights and the knots.
When constructing a quadrature formula we require two basic properties
• consistency of the method

)
f (t) = const. ⇒ I(f ) = I(f

Thus, bj = 1.
• positivity of the method
f (t) ≥ 0 t ∈ [a, b] ⇒ Iab (f ) ≥ 0
Consequently, bj ≥ 0.
To construct a quadrature formula we replace f by a “simpler” function f

and define
) := I(f).
I(f
Let f be a polynomial interpolating f on the knots τj := ti−1 + cj hi

s
f(t) := P (f |τ1 , . . . , τs )(t) = f (τj )Ls−1
j (t).
j=1
Here Ls−1
j (t) is the j
th
Lagrange polynomial defined by the knot points (cf. Sec-
tion 1.2.1)
s
t − τi
s−1
Lj (t) = .
j=i=1
τ j − τi
Thus,

s
Itti−1
i
(f) = f (τj )Itti−1
i
(Ls−1
j (t)).
j=1
Set
ti 1
1
bsj := Ls−1
j (t)dt = Ls−1
j (ti−1 + σhi )dσ
hi ti−1 0
then

s
Itti−1
i
(f) = hi bsj f (τj ).
j=1
Thus, given cj the weights bj are fixed. The methods are consistent due to
s
j=1 Lj (t) = 1. They are exact for polynomials at least up to order s − 1, i.e.
s−1

p ∈ Ps−1 ⇒ I(p) = I(p)
Lemma 30
Given s distinct points τ1 , . . . , τs ∈ [0, 1],
then there is a unique functional

s
I01 (f ) = bj f (τj )
j=1
with the property

I01 (p) = I01 (p) ∀p ∈ Ps−1 .
Proof:
by construction and the uniqueness of interpolating polynomials. ✷
By coordinate transformation this result applies to any finite time interval anal-
ogously.
We investigate now the approximation error and define
Definition 31
= I(p) ∀p ∈ Pk−1 and if there is a p0 ∈ Pk with I(p
If I(p) 0 ) = I(p0 ), then the
method has order k.
Note, consistent methods have at least order 1.
A criterion for the order of a scheme is given by the following theorem:
Theorem
s 32
q−1
If i=1 bi ci = 1
q
q = 1, . . . , k then the method I has order k.
Proof:
Taylor expansion of f about ti . ✷
Example 33
It can be easily checked by Taylor expansion that for the trapezoidal rule the local
error is
1
Itti−1
i
(f ) − Itti−1
i
(f ) = f (ti−1 )h3i + O(h4i )
12
with hi = ti − ti−1 . The global error is bounded by
n
t −t
te
max f (ξ)h2 + O(h3 )
e 0
It0 (f ) − It0 (f ) =
te
Iti−1 (f ) − Iti−1 (f ) ≤
ti ti
12 ξ∈[t0 ,te ]
i=1
with h = 1/(te − t0 ).
The power of h in this expression corresponds to the order of the method.
1.4.1 Quadrature in MATLAB

MATLAB provides two commands for computing numerically the integrand of a
given function: quad and quadl . Both methods are adaptive, i.e. the stepsize is MATLAB
adjusted automatically in such a way, that an error can be guaranteed within a
user given tolerance bound. quad is based on Simpson’s rule, while quadl uses a
more sophisticated method based on Lobatto polynomials.
In this course we will not discuss adaptive quadrature methods. Adaptivity will
be a topic in the chapter concerning ordinary differential equations, see Ch. 5.
1.4.2 Gauss Quadrature

An interesting question in this context is how to place the knots cj so that the
method gets an order k > s. What is the optimal (maximal) order? To an-
swer this question some knowledge from the theory of orthogonal polynomials
is required. This will be a topic in one of the advanced courses in Numerical
Analysis, where it can be seen, that the so-called Gauss-methods are optimal.
We investigate now the following questions:
• Can we place the knots cj so that the method gets an order k > s ?
• What is the optimal (maximal) order ?
Theorem 34
Define I01 by (ci , bi )si=1 with order k ≥ s,
and set
M (t) := (t − c1 )(t − c2 ) · · · (t − cs ) ∈ Ps [0, 1].
The order of I01 is larger than s + m iff
1
M (t)p(t)dt = 0 ∀p ∈ Pm−1 [0, 1], (1.19)
0
i.e. M ⊥Pm−1 [0, 1] in L2 .

Proof:
Let f ∈ Ps+m−1 . Then we can write it as
f (t) = M (t)g(t) + r(t)
with two polynomials g ∈ Pm−1 and r ∈ Ps−1 . Consider
I01 (f ) = I01 (M g) + I01 (r).
Due to condition (1.19) the second term vanishes and due to the order of the
method we have I01 (r) = I01 (r).
On the other hand
n
n
1
I0 (f ) = bi f (ci ) = bi M (ci )g(ci ) + I01 (r)
i=1 i=1
where the second term vanishes due to M (ci ) = 0. Thus I01 (f ) = I01 (f ). ✷
Example 35
Consider m = 1, s = 3:
1
0 = (t − c1 )(t − c2 )(t − c3 ) · 1dt
0
1 1
= − (c1 + c2 + c3 ) +
4 3
1
+ (c1 c2 + c1 c3 + c2 c3 ) − c1 c2 c3
2
⇒
1
4
− (c1 + c2 )/3 + c1 c2 /2
c3 = 1
3
− (c1 + c2 )/2 + c1 c2
Thus, there are two degrees of freedom in designing such a method.
Theorem 36
A method with s stages has maximal order 2s.
Proof:
Assume order k ≥ 2s + 1. Then by preceding theorem:
1
0= M (t)p(t)dt ∀p ∈ Ps [0, 1] (1.20)
0
especially also for p(t) = M (t). Thus,

1 1
0= M (t)M (t)dt = (t − c1 )2 · · · (t − cs )2 dt > 0, (1.21)
0 0
which is a contradiction. ✷
Note, the existence of a method of order k = 2s is not stated by this theorem.

For constructing such a method we set M (t) = c · Ps (2t − 1), where Ps is the
Legendre polynomial of degree s and c a constant such that cPs has the leading
coefficient 1.
Then, by construction
1
M (t)g(t)dt = 0 ∀g ∈ Ps−1 . (1.22)
0
Thus a method based on knots cj with Ps (2cj − 1) = 0 has order k = 2s.
Theorem 37
There is a method of order 2s. It is uniquely defined by taking cj as the roots of
the sth Legendre polynomial Ps (2t − 1).
Example 38
• s = 1 gives the midpoint rule

1
I01 (f ) = f ( ) (1.23)
2
• s = 2 Exercise.
• s = 3 gives a 6th order method
√ √
5 1 15 8 1 5 1 15
I0 (f ) = f ( −
1
) + f( ) + f( + ). (1.24)
18 2 10 18 2 18 2 10
These methods are called Gauß methods.

1.5 Piecewise Polynomials and Splines

To interpolate a larger amount of data and to avoid effects like Runge’s phe-
nomenon as demonstrated in the exercises one applies piecewise polynomial in-
terpolation, i.e. constructs a function s which interpolates the data points and
which is a polynomial between these points.
Definition 39 We consider now functions s ∈ C r−1 [t0 , tn ] of the following type:
s : [t0 , tn ] → R with s = si ∈ Pr [ti , ti+1 ] (1.25)

[ti ,ti+1 ]
Functions with these properties are called splines (r = 1 linear splines, r = 3

cubic splines). We call the points ti knots or breakpoints of the spline s.
The continuity requirements imply
dk dk
s i (ti+1 ) = si+1 (ti+1 ) k = 0, . . . , r − 1.
dtk dtk
Again, we consider the interpolation task, i.e. we look for a spline function s
satisfying the interpolation conditions:
s(ti ) = yi i = 0, . . . , n
A linear spline is easy to construct. It requires to simply draw straight lines

between the interpolation points.
We leave linear and quadratic splines for the exercises and turn directly to cubic
splines, the family of splines, which is most used in applications.
For a cubic spline we require that the functions si are cubic polynomials and join
at the knots with C 2 -continuity. Thus,
si (t) = ai (t − ti )3 + bi (t − ti )2 + ci (t − ti ) + di (1.26)
and we have the following conditions to determine the coefficients ai , bi , ci , di :
si (ti ) = yi i = 0, . . . , n − 1 sn−1 (tn ) = yn (1.27a)

si (ti+1 ) = si+1 (ti+1 ) i = 0, . . . , n − 2 (1.27b)
si (ti+1 ) = si+1 (ti+1 ) i = 0, . . . , n − 2 (1.27c)
si (ti+1 ) = si+1 (ti+1 ) i = 0, . . . , n − 2 (1.27d)
We have n+1 knots and consequently n intervals and 4n unknowns. To determine

these unknowns we have 4(n − 1) + 2 = 4n − 2 conditions. There are two degrees
of freedom left. We will fix them later by setting up two additional boundary
conditions.
1.5. PIECEWISE POLYNOMIALS AND SPLINES 33
From (1.27a) we get

di = yi i = 0, . . . , n − 1. (1.28)
We set hi := (ti+1 − ti ) and obtain from (1.27b)
yi+1 = ai h3i + bi h2i + ci hi + yi . (1.29)
The first and second derivatives are
si (ti+1 ) = 3ai h2i + 2bi hi + ci (1.30a)

si (ti+1 ) = 6ai hi + 2bi (1.30b)
We introduce new variables for the second derivatives at ti , i.e.
Si := si (ti ) = 6ai (ti − ti ) + 2bi = 2bi (1.31)
From (1.27d) we then obtain
Si+1 = 6ai hi + 2bi . (1.32)
Hence,
1 Si+1 − Si
b i = Si ai = (1.33)
2 6hi
Inserting these relations into (1.29) gives
Si+1 − Si 3 Si 2
yi+1 = hi + hi + ci hi + yi .
6hi 2
From that we get ci :
yi+1 − yi 2Si + Si+1
ci = − hi .
hi 6
Now we use condition (1.27c) and get
ci = 3ai−1 h2i−1 + 2bi−1 hi−1 + ci−1 .
Inserting the expression for ai , bi and ci gives
yi+1 − yi 2Si + Si+1

− hi =
hi 6
Si − Si−1 2 Si−1 yi − yi−1 2Si−1 + Si
3 hi−1 + 2 hi−1 + − hi−1
6hi−1 2 hi−1 6
and finally
yi+1 − yi yi − yi−1
hi−1 Si−1 + 2(hi−1 + hi )Si + hi Si+1 = 6 − (1.34)
hi hi−1
with i = 1, . . . , n − 1.
These are n−1 equations for the n+1 unknown second derivatives Si . We have to
ask for two more conditions, which are boundary conditions if we put conditions
on S0 and Sn .
The easiest is to ask for
S0 = Sn = 0. (1.35)
A cubic spline fulfilling this condition is called a natural spline. We will first
consider this possibility and then discuss other common choices of boundary
conditions.
Equations (1.34) and (1.35) give us a square linear system of equations which can
be solved to determine the Si :
  
2(h0 + h1 ) h1 S1
 h1 2(h1 + h2 ) h2   S2 
  
 ...   S3 
 h2 h3  =
 .   .. 
 .. hn−2  . 
hn−2 2(hn−2 + hn−1 ) Sn−1
 y2 −y1 y1 −y0 
h1
− h0
 y3 −y2 − y2 −y1 
 h2 h1 
 .. 

= 6 .  (1.36)

 .. 
 . 
yn −yn−1 yn−1 −yn−2
hn−1
− hn−2
Note, the ”empty” entries in the coefficient matrix are zeros. The matrix has a
banded structure. It is a tridiagonal matrix , furthermore it is symmetric. How
this structure can be exploited when solving the system will be discussed in
MATLAB Chapter 2. Here we just use the corresponding MATLAB command
S=A\b
for solving the system. For defining the coefficient matrix we can use the fact
MATLAB that the matrix is banded and apply MATLAB’s command ”diag”, cf. ”help
diag” and the exercises.
As pointed out before the definition of a cubic spline leaves two degrees of freedom.
These are normally described in terms of boundary conditions. There are several
common choices
• natural spline: We take S0 = Sn = 0. This choice is often taken, if we have
no other specific information available.
• end slope condition We might have knowledge about the slopes at the
boundary points, i.e. s (t0 ) and s (tn ) are known. From that conditions
for S0 and S1 can be derived and the linear system corresponding to (1.36)
can be set up. We leave this as an exercise.
• periodic spline: We assume, that the function we want to interpolate is a

periodic function with a period tn − t0 . From that we can conclude s (t0 ) =
s (tn ) and S0 = s (t0 ) = s (tn ) = Sn . Which gives enough conditions to
uniquely define the spline.
• not-a-knot condition: If the physical context gives no additional information

about the spline at the boundary, one may fix the boundary conditions by
the additional requirements
s

0 (t1 ) = s1 (t2 ) sn−2 (tn−1 ) = sn−1 (tn−1 ). (1.37)
By this s0 and s1 become a cubic parabola and the point t1 is no longer a

knot. The same holds for sn−2 , sn−1 and tn−1 . This motivates the name
of this type of boundary conditions. The MATLAB function ”spline” uses
this type of boundary condition. MATLAB
In MATLAB’s spline toolbox there are many additional tools for computing and MATLAB
evaluating splines. A command that computes spline coefficient for various end
conditions is csape.
1.5.1 Minimal Property of Cubic Splines

The Webster gives the following historical description of the word spline ”a thin
wood or metal strip used in building construction” (1756). When bending a
straight piece of metal along some nails (interpolation points) its deformation is
defined by minimizing the deformation energy. Let the curve be a function s with
the property s(ti ) = yi , where (ti , yi ) are the coordinates of the nails, then this
curve has the property
te
(s )2 (t)dt = min f 22
t0
(up to physical constants, like the elasticity coefficient), where the minimum is
taken over all C 2 functions satisfying the interpolation conditions.
In this subsection we will show, that the cubic spline functions indeed share this
property.
We denote by V the set of all C 2 functions which interpolate the points (ti , yi )
with i = 0, . . . , l + 1.
Theorem 40
Let s∗ ∈ V be a cubic spline satisfying a natural boundary condition. Then
s∗ 2 ≤ s 2 ∀s ∈ V.
Proof:
Let s ∈ V, then there is a h ∈ C 2 with h(ti ) = 0 such that s(t) = s∗ (t) + h(t). We
then obtain
s 22 = s∗ + h22 = s∗ 2 + 2 < s∗ , h > +h 22
with te
∗
< s , h >:=
s∗ (t)h (t)dt.
t0
∗
We have to show that < s , h >= 0:
Integration by parts gives
te
∗ ∗

< s , h >= s (t)h (t) te
t0 − s∗ (t)h (t)dt.
t0
From the natural boundary conditions follows
s∗ (t)h (t) |tte0 = 0.
As s∗ is a piecewise cubic polynomial we get for the last term

te ti
∗
s (t)h (t) = αi h (t) = αi (h(ti ) − h(ti−1 )) = 0
t0 ti−1
with some constants αi . ✷
1.5.2 B-Splines
In this subsection we study the linear space of splines like we did before for
polynomial spaces and look for a basis of this space which gives us spline repre-
sentations with ”nice” coefficients. By ”nice” we mean in the context of graphics,
coefficients which have a direct geometrical interpretation. By changing the coef-
ficients we want to influence the shape of the spline only locally. We saw this task
already when discussing the Bernstein basis for polynomials. For the interpola-
tion task it plays the interpretation of the coefficients plays not a certain role, but
when using splines for design purposes the coefficients can solve as ”handles” to
influence the shape by positioning them through mouse clicks or other computer
input devices.
Let ∆ := {a = t0 , t1 , . . . , tl+1 = b} with ti < ti+1 denote a partitioning (or a grid)
of the interval [a, b].
The space of all splines of degree k − 1 with respect to ∆ is denoted by Sk,∆ . It
is easily checked that Sk,∆ is a linear space and evidently Pk−1 ⊂ Sk,∆ holds.
Thus a basis of Sk,∆ consists of a basis of the polynomial space plus some addi-
tional functions. Let us consider first the monomial basis of the polynomial space
and extend it to a basis of the spline space.
For this end we define
Definition 41
(t − ti )k−1 if t ≥ ti
(t − ti )k−1
+ :=
0 else.
Theorem 42
B := {1, t, . . . , tk−1 , (t − t1 )k−1
+ , . . . , (t − tl )+ } is a basis of Sk,∆ and dim Sk,∆ =
k−1
k + l.
Note, the numbering. Why are the functions (t − t0 )k−1

+ and (t − tl+1 )k−1
+ corre-
sponding to the first and the last grid point not taken as basis functions?
Example 43 If we consider cubic splines (k = 4) and l = n − 1 we obtain
dim Sk,∆ = 4 + n − 1 = 2 + n + 1. So, when uniquely defining a spline we have
to give as many conditions for the coefficients. In the interpolation task we fixed
them by n + 1 interpolation conditions plus two boundary conditions.
With this theorem we got an easy way to determine the dimension of a spline space
but for computational purposes there are better ways to choose basis functions,
which lead us to B-splines.
We formally extend the grid to
¯ : τ1 = . . . = τk < τk+1 < . . . < τk+l+1 = . . . = τk+l+k
∆
with τk+i = ti for i = 0, . . . , l and define

Definition 44
The functions Nik defined recursively as follows are called B-splines:

 0 if τi = τi+1
Ni1 (t) := 1 if t ∈ [τi , τi+1 )

0 else
and
t − τi τi+k − t
Nik := Ni,k−1 + Ni+1,k−1
τi+k−1 − τi τi+k − τi+1
where we use the convention 0/0 = 0 if nodes coincide.
Examples of these functions are depicted in Fig. 1.12. There one observes the in-
creasing degree of smoothness when raising the order of these functions. Without
proof we collect some important properties of these functions:
1. Nik (t) = 0 only for t ∈ [τi , τi+k ]: local support
1.2
N
i1
1
Ni2
0.8 Ni3
0.6
N
i4
0.4
0.2
0 τ τ τ
i i+1
τ τ i+4
i+2 i+3
Figure 1.12: B-splines of order up to 4
2. Nik (t) ≥ 0: non-negative

3. Nik ∈ Sk,∆ if τi = τi+k : B-splines are splines
4. Nik ∈ C k−1−m if there are m-fold knots τj .
The last property may be used for modeling corners, see Fig. 1.13.
1.2
0.8
0.6
0.4
0.2
−0.2
1 1.5 2 2.5 3 3.5 4
Figure 1.13: N13 generated with a double knot at τ = 3
Theorem 45 The B-splines Nik , i = 1, . . . l + k form a basis of Sk,∆ .
Therefore has any function s ∈ Sk,∆ a unique representation

l+k
s= di Nik
i=1
and in particular

l+k
1= Nik .
i=1
The coefficients di are called de Boor points. See also their role in the context of
Bézier splines.
Changing the di ’s influences only a local part of the total spline due to the local
support property of the B-splines. The degree of the B-spline determines the
number of intervals influenced by this change.
More on this subject can be found in [de 78, Far88].
Chapter 2
Linear Systems
We saw in the preceding sections the need for solving linear systems. They oc-
curred when we wanted to solve the Vandermonde or Bézier system for polynomial
interpolation and in a special form (tridiagonal system) when computing cubic
interpolation splines.
Solving linear systems occurs in nearly all algorithms in numerical analysis as a
subproblem. It has the following form
Ax = b with A ∈ Rn×m
In this course we study the following cases
• n = m square systems
• n > m overdetermined systems
As n and m can be very large (up to 104 unknowns) computing time for solving
these systems can become crucial.
So long in this course we just solved these systems by using the MATLAB com- MATLAB
mand1 .
x = A \ b
Now, we go into details and look what this command actually does.
We first review some facts on solvability of linear systems.
Definition 46
The linear space
N (A) = {x ∈ Rm |Ax = 0}
is called the nullspace or kernel of A and
R(A) = {z ∈ Rn |∃x ∈ Rm Ax = z}
is called the range space or image space of A.
1
For MATLAB help on the ”\” command, type help mldivide
41
42 CHAPTER 2. LINEAR SYSTEMS
(see also [Spa94, p. 143,144]).

We note,
det(A) = 0 ≡ N (A) = {0} .
If det(A) = 0 the matrix is called non singular. If n = m a non singular matrix
is also called regular.
Theorem 47 The linear system Ax = b has a solution, if b ∈ R(A).

The solution is unique, if N (A) = {0}.
We will first consider the case n = m and only regular matrices.
2.1 Regular Linear Systems

2.1.1 LU Decomposition
Definition 48 A matrix L is called a lower triangular matrix if all elements
over its diagonal are zero. Furthermore, if its diagonal elements are one, then it
is called unit lower triangular.
Here an example for a unit lower triangular matrix.

 
1 0 0 0 0 0
 l21 1 0 0 0 0 
 
 l31 l32 1 0 0 0 
L=  l41
.

 l42 l43 1 0 0 
 l51 l52 l53 l54 1 0 
l61 l62 l63 l64 l65 1
An example for a lower triangular matrix is given in Eq.(1.7).
Correspondingly we speak of an upper triangular matrix U if its entries under
the diagonal are zero. Here an example:
 
u11 u12 u13 u14 u15 u16
 0 u22 u23 u24 u25 u26 
 
 0 0 u u u u 
U =  0
33 34 35 36 .
 0 0 u44 u45 u46  
 0 0 0 0 u55 u56 
0 0 0 0 0 u66
We assume, that we can factorize A into a product of a lower and upper triangular
matrix: A = LU . Then the linear system can be written as
Ax = b (2.1a)
LU x = b (2.1b)
Ly = b (2.1c)
2.1. REGULAR LINEAR SYSTEMS 43
with
y := U x (2.1d)
This suggests the following algorithm
• LU Factorization: Decompose A into a product of a lower triangular matrix

L and an upper triangular matrix U .
• Forward substitution Solve Ly = b for y by exploiting the triangular struc-

ture of L.
• Backward substitution Solve U x = y for x by exploiting the triangular

structure of U .
Before considering a method for performing the decomposition step, we look at

the two substitution steps.
Forward and backward substitution

Consider the example of a lower triangular system of the type Ly = b:
b1 = l11 y1
b2 = l21 y1 + l22 y2
..
.
b5 = l51 y1 + l52 y2 + l53 y3 + l54 y4 + l55 y5
From the first equation you immediately get y1 . Using this value, you easily
obtain y2 from the next equation and so on.
We describe the procedure for a general lower triangular matrix by the following
piece of MATLAB code: MATLAB
for i=1:n;
for j=1:i-1;
b(i)=b(i)-l(i,j)*b(j);
end;
b(i)=b(i)/l(i,i);
end;
y=b;
Here a similar example of an upper triangular system U x = y:
y1 = u11 x1 + u12 x2 + u13 x3 + u14 x4 + u15 x5

y2 = u22 x2 + u23 x3 + u24 x4 + u25 x5
..
.
y5 = u55 x5
For solving this system, we start with the last equation and solve for y5 and
proceed in a similar way but backwards.
MATLAB In MATLAB code this reads
for i=n:-1:1;
for j=i+1:n
y(i)=y(i)-u(i,j)*y(j);
end;
y(i)=y(i)/u(i,i);
end;
x=y;
Counting operations gives n2 /2 + O(n) multiplications and as many additions for
the backward or forward substitution methods2
Elementary transformations We turn now to the decomposition step and

show the principal idea by transforming A stepwise into an upper triangular ma-
trix by multiplication with so-called elementary transformation matrices. Again,
we explain things first by looking at an example of a 5 × 5 matrix:
    
1 a11 a12 . . . a1n a11 a12 . . . a1n
 −l21 1   a21 a22   0 a(1) a2n 
(1)
    22 
 .. .   . .  =  . . 
 . . .   .. ..   .. .. 
−ln1 1 an1 ann 0
(1)
an2 . . .
(1)
ann

=:M1 =:A =:A(1)
with
(1)
aij := aij − li1 a1j j = i, . . . , n
and li1 := ai1 /a11 .
We observe, that premultiplying A by M1 annihilates all but the first element in
the first column of A and changes all other elements but those in the first row.
In general an elementary transformation matrix has the following form:
 
1
 .. 
 . 
 
 
 
Mk :=  1  (2.2)
 ... 
 −lk+1,k 
 .. 
 . 
−ln,k 1
2
Operations are counted often in a ”unit” called flop, which stands for floating point oper-
ation and corresponds to an addition or multiplication. See the MATLAB command flops.
(k−1) (k−1)
with lik := aik /akk and A(k−1) := Mk−1 · · · M1 A and A(0) := A.
Elementary transformations are regular matrices and their inverses have a similar
structure:  
1
 .. 
 . 
 
 
 
−1
Mk :=   1  (2.3)
 ...  
 lk+1,k 
 . 
 .
. 
ln,k 1
We note two important facts in this context (which can be checked easily):
• Products of triangular matrices are triangular.
• Inverses of upper (lower) triangular matrices are upper (lower) triangular

(if they exist).
We set
U := Mn−1 Mn−2 . . . M2 M1 A

L−1
with an upper triangular matrix U and a lower triangular L.

Thus we got the LU -factorization of A
A = LU (2.4)
with L = M1−1 . . . Mn−2

−1 −1
Mn−1 .
(k−1)
We call akk the pivot element at stage k. The matrix A has an LU factorization
as long as all pivot elements are different from zero.
The derivation above is not a description of an algorithm. Setting up all ele-
mentary transformations explicitly and performing multiplications with matrices
which have very few non zero entries would be an enormous waste of computing
resources.
We give an algorithm for the LU factorization as a short segment of a piece of
MATLAB code: MATLAB
function [L,U]=lu_np(A)
% Factorizes A into a lower and an
% upper triangular part without pivoting.
% This code does not correspond to MATLAB’s command lu .
N=size(A,1);
if N~=size(A,2)
disp(’Matrix has to be square’)
break
end
for i=1:N,
pivot=A(i,i);
if pivot==0
disp(’Matrix has zero pivot elements’)
break
end
for j=i:N,
L(j,i)=A(j,i)/pivot;
end
for k=i+1:N,
for j=i+1:N,
A(k,j)=A(k,j)-L(k,i)*A(i,j);
end
end
end
U=zeros(N,N);
for i=1:N,for j=i:N, U(i,j)=A(i,j);end;end;
Note every regular matrix can be LU factorized. The pivot elements might be
zero which will lead to a break in the algorithm. This can be seen from the
following (regular) example:

0 1 x1 b
= 1
1 0 x2 b2
We will give a criterion for matrices which are LU factorizable.
Definition 49
A is called diagonal row dominant iff

n
|aii | > |aij | ∀i = 1, . . . , n
j=i
Theorem 50
Every diagonal dominant matrix A has an LU decomposition.
If A is diagonal dominant, then also all A(k) , and

(k+1) (k)
max |aij | ≤ max |aij | ≤ max |aij |.
ij ij ij
From the example above we see, that a matrix which has no LU factorization
can be transformed into a matrix which has an LU factorization by permuting
the rows:
1 0 x1 b
= 2
0 1 x2 b1
Interchanging rows of a matrix can mathematically expressed by premultiplica-

tion by a permutation matrix P . Permutations are row permuted identity ma-
trices. Here an example of a matrix which permutes the second with the fourth
row when premultiplying a 4 × 4 matrix.
 
1 0 0 0
0 0 0 1
 
0 0 1 0
0 1 0 0
We modify now the LU factorization above by introducing row permutations
after each step: In general the rows have to be interchanged
A(k+1) = Mk Pk A(k)
with permutation matrix Pk which interchanges the rows in such a way that
the pivot element becomes the largest element in the column segment a(:, k : n)
(MATLAB notation). Consequently, MATLAB
|lik | ≤ 1 i = k + 1, . . . , n.
Looking for the largest element in a column and then interchanging rows is called
partial pivoting in contrast complete pivoting which is a more seldom applied
strategy. There one attempts to interchange both rows and columns to obtain
a pivot element which is the largest element in the actual submatrix in the k-th
step.
Theorem 51 If A is a regular matrix, there is always a permutation matrix P ,

such that P A has an LU factorization.
In MATLAB LU-factorization with pivoting is performed by the command lu. MATLAB

It returns the triangular factors and the permutation matrix.
We conclude this section by counting the operations which are required for LU
factorization. It can be read from the MATLAB code LU np above, that

N −1
N −1

N

(N − i) = 2
k = 2
k2 − N 2
i=1 k=1 k=1
multiplications and as many additions are needed.

By noting, k
1
2
k = x2 dx + k −
k−1 3
and

N N
k
1
2
k = ( x2 dx + k − )
k=1 k=1 k−1 3
we finally get

N
N3 N2 N
k2 − N 2 = − +
k=1
3 2 6
multiplications for LU decomposition. Additionally we have to perform N (N −

1)/2 divisions.
Often we have to solve the same linear system for different right hand sides b.
In that case the factorization step needs only to be performed once and only the
(cheaper) forward and backward substitution steps have to be performed for the
different right hand sides. A particular example is the numerical evaluation of
the mathematical expression
A−1 B
This can be rewritten as a
AX = B
where X is a matrix with columns x(i) . Every column is then the solution of a
linear system
Ax(i) = b(i) .
2.1.2 Matrix Norms, Inner Products and Condition Num-

bers
Numerical computations are always influenced by errors. The errors in the results
of our algorithms we considered so far have mainly two sources
• Round-off errors
• errors in the input data .
Later we will meet a third error source, the truncation error when solving things
iteratively.
In order to study the effects of errors we have to be able to measure the size of
errors. Errors are often described as relative quantities, i.e.
absolute error
relative error =
exact solution
and as the exact solution often is not available we consider instead
absolute error
relative error = .
obtained solution
An error in the result of a linear system is a vector, thus we have to be able to
measure sizes of vectors.
For this end we introduce norms
Definition 52
A vector norm is defined by . : Rn → R with
• x ≥ 0
• x = 0 ⇔ x = 0
• x + y ≤ x + y
• αx = |α|x α ∈ R
(see also [Spa94, p. 111]).

Norms we use in this course:
xp := (|x1 |p + . . . + |xn |p )1/p
so-called p-norm or Hölder-norm.
Example 53
x1 = |x1 | + . . . + |xn |
1/2
x2 = (|x1 |2 + . . . + |xn |2 ) (Euklid)
x∞ = max |xi |
Theorem 54
All norms on Rn are equivalent in the sense:
There are constants c1 , c2 > 0 such that for all x
c1 xα ≤ xβ ≤ c2 xα
holds.
Example 55
√
x∞ ≤ x2 ≤ nx∞
√
x2 ≤ x1 ≤ nx2
x∞ ≤ x1 ≤ nx∞
Recall from your calculus course, that the definition of convergence is based on
norms. The ultimate consequence of this theorem is that an iteration process
in a finite dimensional space converging in one norm is also converging in any
other norm. For proving convergence we just can select a norm which is the most
convenient for the particular proof. Note, that in infinite dimensional spaces
(function spaces) this nice property is lost.
We relate now vector norms to matrices. The concept is highly based on viewing
matrices as linear maps
A : Rn −→ Rn .
Definition 56
Axp
Ap = max = max Axp
x=0 xp x p =1
defines a matrix norm, which is called subordinate to the vector norm xp .
Some matrix norms:

A1 = max j i aij

A2 = maxi λi (AT A) where λi (A) denotes the i-th eigenvalue of A

A∞ = maxi j aij
1/2
AF = i j |aij | 2
(Frobenius Norm)
MATLAB Vector and matrix norms can be computed in MATLAB by the command norm,
which takes additional argument to define the type of norm, i.e. ’inf’ stands
for infinity norm.
We consider now the sensitivity of the linear system Ax = b with respect to
perturbations ∆b of the input data b.
Ax̂ = A(x + ∆x) = b + ∆b (2.5)
How is the relative input error

∆b
b
related to the relative output error
∆x
?
x
From
A∆x = ∆b ⇒ ∆x = A−1 ∆b
we obtain by taking norms
∆x = A−1 ∆b ⇒ ∆x ≤ A−1 ∆b.

Note that the right inequality is a direct consequence from Def. 56. Analogously
we get
b = Ax ⇒ b ≤ Ax.
This leads to,
∆x A−1 ∆b A A−1 ∆b
≤ =
x x Ax
Thus,
∆x ∆b
≤ A A−1
x b
Definition 57 κ(A) := AA−1 is called the condition number of A.
Condition numbers can obtained in MATLAB by using the commands cond and MATLAB
rcond. The first command computes the condition number exactly and takes as
argument a specification of the type of norm used for computing this number.
rcond estimates the inverse of the condition number κ(A)−1 with respect to the
1-norm.
We consider also perturbations of the matrix A:
(A + ∆A)(x + ∆x) = b
and define for this end: A(t) := A + t∆A and x(t) := x + t∆x, with t ∈ R.
Consider:
A(t)x(t) = b
and take the derivative w.r.t. t:
A (t)x(t) + A(t)x (t) = 0

x (t) = −A(t)−1 A (t)x(t)
Thus,
x
≤ A−1 A
x
A
≤ A−1 A
A
Note, A (t) = ∆A and x (t) = ∆x. Thus
∆x ∆A
≤ κ(A)
x A
Example 58 The relative error due to round-off is called the so-called machine
epsilon. Usually we have ε ≈ 10−16 when using double precision arithmetic.
If there is no other error in the input data we obtain
∆x
≤ κ(A)ε
x
If κ(A) = 1 no amplification of the relative input error occurs. If κ(A) = 10k we

loose in the worst case k digits of accuracy.
The number κ(A)−1 can be viewed as the distance from A to the nearest singular
matrix.
2.2 Nonsquare Linear Systems

In this section, we consider linear systems of the form
Ax = b with A ∈ Rm×n
with m " n. This kind of problem often occurs when performing data fitting.
Example 59
We would like to fit the data
i 0 1 2 3 4
ti -1.0 -0.5 0.0 0.5 1.0
yi 1.0 0.5 0.0 0.5 2.0
by a quadratic polynomial of the form
p(t) = a2 t2 + a1 t + a0 .
We set up interpolation conditions p(ti ) = yi :

   
1 t0 t20   y0
1 t1 t21  a0  y1 
   
1 t2 t22  a1  = y2  .
   
1 t3 t23  a2  y3 
2
1 t4 t4 y4
This leads to a nonsquare linear system. It is overdetermined in the sense, that

already three data points would have been enough to uniquely define a quadratic
polynomial.
In general, this linear systems has no solution at all, because the measurements
might not fit to a quadratic polynomial.
While often physical laws determine the degree of the polynomial this effect is
due to measurement errors. Just reducing the amount of information to make
2.2. NONSQUARE LINEAR SYSTEMS 53
the system solvable would be the wrong way to attack the problem, because a
single erroneous measurement gets a too strong influence on the result.
Therefore we formulate the problem in a different way:
Find an x̂ with
Ax̂ − b2 = min Ax − b2 = min r(x)2 (2.6)

x x
with the residual vectors r(x) := b − Ax.

A necessary condition for x̂ to be a minimizer is
d
r(x)22 = 0. (2.7)
dx x=x̂
From
r(x)22 = rT r = (b − Ax)T (b − Ax) = bT b − 2xT AT b + xT AT Ax
we take the first derivative w.r.t. x.

This gives the condition for x̂ :
AT Ax̂ − AT b = 0 (2.8)
These equations are called normal equations and their solution a least squares
solution of the overdetermined linear system.
Normal equations have a geometric interpretation: Consider the range space
R(A). It is spanned by the columns of A.
By writing the normal equations as
AT (b − Ax) = AT r(x) = 0
we see that the residual corresponding to the least squares solution has to be
normal (orthogonal) to the columns of A or, in other words, to the range space
of A. This justifies the name ”normal” equations. This result can be generalized
as follows
Theorem 60
Let V be a finite dimensional linear space with an inner product < ·, · >. Let
U ⊂ V be a subspace and
U ⊥ := {v ∈ V | < v, u >= 0 ∀u ∈ U }
be its orthogonal complement in V .

Then, for all v ∈ V
v − u∗ = min v − u ⇔ v − u ∈ U ⊥ (2.9)
u∈U
with the norm v = (< v, v >)1/2 induced by the inner product < ·, · >.
r
b
A x
Im(A)
Figure 2.1: Geometric interpretation of the normal equations
Proof:
Let u∗ ∈ U be the unique point with v − u∗ ∈ U ⊥ . (Why is this point unique?)
Then for all u ∈ U we have
v − u2 = v − u∗ + 2 < v − u∗ , u∗ − u > +u∗ − u2

= v − u∗ 2 + u∗ − u2 ≥ v − u∗ 2
where equality holds only for u = u∗ . ✷
To compute the least squares solution from the normal equations requires to first
form the matrix AT A. It can be shown that the condition number is squared by
this process, which will result in an unnecessary high sensitivity with respect to
perturbations. This can be avoided by using special techniques like orthogonal
factorization of A or singular value decomposition [Hea97].
MATLAB Overdetermined systems are solved in MATLAB with the same command as
square systems, i.e. by using ”\”. Note, that totally different algorithms are
performed by one and the same command. In the overdetermined case case
MATLAB does not solve the least squares problem by directly setting up and
solving the normal equations. MATLAB uses for stability reasons, which will be
explained later, orthogonal factorizations (see Sec. 2.2.3) instead.
2.2.1 Projections
From Fig.2.1 it is intuitively clear that the normal equations and the least squares
approach are related to projections. In linear algebra projections are defined by
Definition 61
An n × n matrix P is called an orthogonal projection if it satisfies
• P2 = P
• PT = P
As P v ∈ Range(P ) we also say that P projects onto Range(P ).

Orthogonal projections have the property that the vector v − P v is orthogonal
to any vector in Range(P ):
< v − P v, P x >= v T P x − v T P T P x = v T P x − vP 2 x = 0
Example 62
• The solution of the normal equations is the projection of b onto RangeA

because A(AT A)−1 AT is a projector on Range(A).
• If P is an orthogonal projector so is I − P an orthogonal projector on

Range(P )⊥ .
• w2 = 1 P = I − wwT is a projector on the hyperplane wT x = 0.
• Q1 = (q1 , . . . , qr ) an m × r matrix,
with QT1 Q1 = I, then
Q1 QT
1 projector on R(Q1 )
⊥
I − Q1 QT1 projector on R(Q1 ) = N (Q1 )
• A projector is singular or the identity.
2.2.2 Condition of Least Squares Problems

We investigate now the sensitivity of least squares solutions with respect to per-
turbations of the right hand side b (measurements) and with respect to pertur-
bations of the matrix A (time points).
For this end we first define
Definition 63
The angle δ between b ∈ X and a subspace U ⊂ X is defined by
b − P b2
sin δ = ,
b2
where P is an orthogonal projection onto U (see Fig. 2.1).
The following theorem relates this angle to the condition of the least squares
problem with respect to perturbations of the data (perturbations in measurements
= perturbations of b, perturbations in measurement time = perturbation in A):
Theorem 64 (Condition of a least squares problem)

Let A ∈ Rm×n be a full rank matrix, b ∈ Rn and consider the least squares problem
min Ax − b2 .
The condition number of this problem
• with respect to perturbations in b is given by
κ2 (A)
κ≤
cos δ
• and with respect to perturbations in A
κ ≤ κ2 (A) + κ2 (A)2 tan δ
i λi (AT A) 1/2
Here κ2 (A) := max
mini λi (AT A)
is the condition number of A with respect to · 2
and δ is the angle between b and R(A) as defined above.
In the extreme case that δ = π/2 the condition becomes infinity, which expresses
the fact that the data has no relation to the problem. This indicates a wrong
model of the physical problem.
On the other hand, if δ is small, than the condition number is of the size of κ(A).
It is worth while to compare this fact to the condition of the normal equations
which is
κ2 (AT A) = κ2 (A)2 .
Thus, often the condition of least squares problem is significantly smaller than
the condition of the normal equations and one would introduce an ”artificial”
sensitivity with respect to perturbations if one would attempt to solve the least
squares problem via the normal equation. So we seek for an alternative charac-
terization of the least squares solution, which avoids forming the matrix AT A.
This alternative way may be computationally more expensive but it will be more
stable, i.e. less sensitive to perturbations.
For this end we will discuss orthogonal factorizations of A in the next subsection.
2.2.3 Orthogonal factorizations

First, we recall the definition of an orthogonal matrix:
Definition 65 An n × n matrix Q is called orthogonal if
QT Q = I
First, note that a orthogonal projection is only in the trivial case P = I described
by an orthogonal matrix. So do not confound the terms.
A direct consequence of the definition is det(Q) = ±1 (see determinant multi-
plication theorem). Furthermore we see from the definition of the 2-norm ·
that for orthogonal matrices Q2 = 1 holds. This makes orthogonal matrices so
important in numerical analysis: Transformations by orthogonal matrices do not
change the condition of linear systems.
Example 66
• Rotations (det(Q) = +1) are described by orthogonal matrices. A 2 × 2

rotation matrix is given by

cos θ sin θ
Q=
− sin θ cos θ
• Reflections (det(Q) = −1) are described by orthogonal matrices. A 2 × 2

reflection matrix has the form

cos θ sin θ
Q=
sin θ − cos θ
We assume now that we can write the m × n matrix A as a product of an m × m

orthogonal matrix Q and an m × n upper triangular matrix R:
A = QR
or schematically: We can reformulate the normal equations Eq. 2.8 by using this
R1
A = Q Q2
1 0
A = Q R
factorizations:
AT Ax − AT b = RT QT QRx − RT QT b
R1T R1 x = R1T QT1b
T
R1 x = Q1 b.
So, instead of solving the normal equations we can solve
R1 x = QT
1b (2.10)
and we avoid forming the product AT A.

By the relation
Ax − b22 = QT (Ax − b)22 = R1 x − QT
1 b2 + Q2 b2
2 T 2
we obtain even an expression for the norm of the residual of the least squares
solution:
r2 := min Ax − b2 = QT2 b2
MATLAB In MATLAB there is a command qr performing QR-factorization. The numerical
algorithm is based on either successive rotations of the coordinate system or
successive reflections corresponding to the geometric interpretation of orthogonal
matrices given above.
2.2.4 Householder Reflections and Givens Rotations

• Householder reflections
• Givens rotations.
In this course we will discuss only Householder reflections and refer to literature,
e.g. [Gv96], for Givens rotations.
The principal idea of of Householder reflections can be described geometrically:
Given a vector v, a reflection across span(v)⊥ is given by the (orthogonal) matrix
vv T
H =I −2
vTv
(Check that Hx = x for all x ∈ span(v)⊥ .)
span(v)
Ha
v a
Figure 2.2: Reflection across span(v)⊥
We select a vector v in such a way that it reflects a given vector a such that
   
a1 ã1
 a2   0 
   
Ha = H  ..  =  ..  .
 .  .
am 0
This can be achieved by setting v = a ∓ σe1 σ := a2 .. It can easily checked

that with this choice Ha = σe1 .
This special choice of v is illustrated in Fig. 2.3. The method is best illustrated
v a
Ha
Figure 2.3: Householder tranformation for annihilating entries in a vector
by the following MATLAB sequence MATLAB
function [v]=house(a1,m)
% [v]=function house(a1,m)
% computes a householdermatrix to
% transform the m-vector a1 into
% sigma e_1, where e_1 is the
% first unit vector and sigma is
% up to a sign norm(a1)
%
sigma=norm(a1);
e1=zeros(m,1);
e1(1)=1;
alpha=a1’*e1;
v=a1+sign(alpha)*sigma*e1;
gamma=sigma*(sigma+abs(alpha));
sigma=-sign(alpha)*sigma;
T
This code applied to the vector a := 1 2 3 gives the following result Ha :=
T
−3.7417 0 0 .
It is important to note that multiplications with Householder matrices can be
done with n + 1 multiplications and additions as
vv T 1
Ha = (I − )a = a − (v T a)v
γ γ
with γ := 1/2 v T v. (Standard matrix vector multiplications requires n2 multipli-

cations and additions.
With this elementary Householder transformations a matrix can be transformed

into a triangular matrix. We apply for this end n−1 Householder transformations
H1 , ..., Hn−1 to A, where the ith transformation introduces zeros in the ith column,
while leaving the columns 1, . . . , i − 1 unaffected.
To demonstrate the process we assume that the first two columns are already
transformed, i.e  
x x x x x
 0 x x x x 
 
H2 H1 A =   0 0 ã33 x x  .

 0 0 ? x x 
0 0 ? x x
The goal for the third transformation is then to construct a Householder matrix
H̃3 with    
ã33 ā33
H̃3  ?  =  0 .
? 0
Then we set
I 0
H3 =
0 H̃3
where we augment H̃3 by the identity matrix to keep the earlier columns of A
unaffected.
If we set then QT = Hn−1 · . . . · H1 we obtain QT A = R, where R is the desired
triangular matrix and we thus got the descomposition A = QR. For implemen-
tational details see [Gv96].
2.2.5 Rank Deficient Least Squares Problems

So far we assumed A ∈ Rm×n to be a full-rank matrix (Rank(A) = max(n, m)).
Now we consider the general case:
Rank(A) ≤ max(n, m).
The least squares problem (and the normal equations) still has a solution but it
is no longer unique.
We are interested to compute among those solutions, the solution with the least
Euclidean norm. Thus we have to solve the problem:
min x2 with L(b) := {x : Ax − b = min}
x∈L(b)
It is called the minimum norm least squares solution.

In order to characterize this solution we make first some definitions.
Definition 67
An n × m matrix A+ is called the More-Penrose pseudoinverse of the m × n
matrix A if the following properties hold
1. (A+ A)T = (A+ A)

2. (AA+ )T = (AA+ )
3. A+ AA+ = A+
4. AA+ A = A.
It can be shown, that A+ is uniquely defined by these conditions.

Let us look first at some examples:
Example 68 • If m = n and A regular, then A−1 = A+ .
• If m ≥ n and if A has full rank, then
A+ = (AT A)−1 AT
These examples show, that, if the linear system has a unique solution in the
”classical” or in the ”least squares sense”, then it can expressed by x∗ = A+ b.
Furthermore we note, that AA+ is an orthogonal projector onto R(A). Thus it
follows by Theorem 60 that x∗ = A+ b is a solution of min Ax − b2 , i.e.
AA+ b − b2 = min Ax − b2
(set in Th.60 U := R(A) and u = Ax).

All other solutions have the form
x = x∗ + v = A+ b + v with v ∈ N (A)
Furthermore we note that if b = 0 then x∗ ∈| N (A) (see property (3) in Def. 67.
Thus,
min x2 = min +
x2 = min x∗ + v.
x∈L(b) x=A b+v v∈N (A)
Again by Theorem 60, the solution of this problem satisfies x∗ + v ∈ N (A)⊥ ,

consequently v = 0.
This proves the following theorem, which characterizes the minimum norm least
squares solution by the pseudo inverse
Theorem 69
The solution of
min x2 with L(b) := {x : Ax − b = min}

x∈L(b)
is xa st = A+ b.
The pseudoinverse A+ can be computed via the singular value decomposition of
A, which is a generalization of the diagonalization of a symmetric matrix by a
similarity transformation with orthogonal matrices:
Theorem 70
Any matrix A ∈ Rm×n can be factorized in
A = U ΣV T
with U ∈ Rm×m and V ∈ Rn×n being orthogonal matrices and Σ ∈ Rm×n with
Σ = diag(σ1 , . . . , σmin(m,n) )
and σi ≥ 0.
This factorization is called singular value decomposition and the σi are called
singular values.
In this course we will not present an algorithm for numerically performing the
singular value decomposition. It is very much related to algorithms for computing
eigenvalues of a general real matrix. We refer to standard text books like [Gv96].
We note some properties of the singular values, which can easily be checked:
• If A = AT , then the singular values are the eigenvalues of A.
• In general, the σi2 are just the eigenvalues of AT A.
• If Rank(A) = k < min(m, n) then σi = 0 for i > k.
• If Σ = diag(σ1 , . . . , σk , 0, . . . , 0) then Σ+ = diag(σ1−1 , . . . , σk−1 , 0, . . . , 0)
• A+ = V Σ+ U T .
From the last property we see how the pseudoinverse can be constructed via the
MATLAB singular value decomposition (svd). In MATLAB the singular value decomposi-
tion is obtained by running the command
[U,S,V] = svd(A)
Compute from U, S and V the pseudo inverse of a singular m × n matrix and

compare the result to the output of MATLAB’s command
Aplus = pinv(A)
Chapter 3
Signal Processing
3.1 Discrete Fourier Transformation

We return for a short while to the interpolation task. In Chapter 1.2 we in-
terpolated data by polynomials and computed for this purpose the polynomial
coefficients. The basis was chosen in such a way, that the computational process
becomes as efficient as possible. The coefficients themselves played an important
role in the case of a Bernstein basis. There, they have a simple geometrical in-
terpretation and they could be used as control parameters to influence the shape
of the resulting polynomial in an easy way.
In this chapter we interpolate large data sets by trigonometric polynomials. Also
in that case, the coefficients have an important interpretation. They can be
related to frequencies and are therefore control parameters to influence the spec-
trum of the function.1
Definition 71 The complex valued functions

j
ω j (t) = ei2πt = ei2πjt
√
are called basic complex trigonometric polynomials. ( i := −1 ). These are
complex-valued periodic functions with period 1.
The space of all complex trigonometric polynomials of maximal degree N is

N −1
TCN := ϕ(t)|ϕ(t) = cj ei2πjt c ∈ C
j=0
Let again yi , i = 0, . . . , N − 1 denote given measurements at equidistant time

points
0 = t0 < t1 < . . . < tN −1 .
1
Parts of the material in this section follows the lines of [DH95]. For additional reading we
suggest [Jam95].
63
64 CHAPTER 3. SIGNAL PROCESSING
We assume in this chapter, that the measurements are samples of a periodic

function with period T . Thus, yN = y0 and tN = T . Thus ti − ti−1 = h = N T
.
N
The quotient T = r is called the sampling rate, it gives the number of samples
per seconds.
Example 72 In MATLAB you can generate the samples from a sound file. You
will find on the homepage of the course a sound file kaktus.au. By applying the
MATLAB commands
NM1=auread(’kaktus’,’size’)
[y,rate]=auread(’kaktus’);
you get the number of samples, here N = 480720, the sampling rate r = 8012 and
finally the samples y ∈ R. We can complete the data vector by yN = y0 . Thus
playing the sound file will take T = N/r = 60 sec.
In the following we will assume, that the time scale is normalized in such a way
that T = 1.
The interpolation task requires to determine ci ∈ C complex coefficients, such
that

N −1
ϕ(tk ) = cj ei2πjtk = yk
j=0
holds. Writing these conditions as a linear system results again in a Vandermonde-

like system
 N −1    
ω0 ··· ω0 1 cN −1 y0
ω1N −1 · · ·  ..   y1 
 ω1 1    
   .  =  ..  (3.1)
···  c1   . 
ω N −1 · · · ωN −1 1 c y
N −1

0 N
−1
=:A =:x =:b
i2πtk
with ωk := e .
For the amount of data we consider now, this system cannot be solved any longer
in ”finite” time even with fast computers. It would require for the example above
about 3.7 1016 complex multiplications and as many additions. However, the
nature of the problem allows us to reduce this work drastically. But before we
demonstrate this, we will give an interpretation of the resulting coefficients.
As the yi are real numbers, we get

N −1
N −1
N −1
i2πjtk −i2πjtk
yk = ϕ(tk ) = cj e = ϕ(tk ) = cj e = cN −j ei2πjtk .
j=0 j=0 j=0
Thus, when yi ∈ R,
cj = cN −j . (3.2)
3.1. DISCRETE FOURIER TRANSFORMATION 65
For odd N , i.e. N − 1 = 2n we get

2n
ϕ(tk ) = c0 + cj ei2πjtk
j=1

n

= c0 + cj ei2πjtk + c̄j e−i2πjtk .
j=1
By using Euler’s formula

eit + e−it eit − e−it
cos t = and sin t = (3.3)
2 2i
we obtain

n

ϕ(tk ) = c0 + 2 Re(cj ) cos(2πjtk ) − Im(cj ) sin(2πjtk )
j=1
a0 n

= + aj cos(2πjtk ) + bj sin(2πjtk )
2 j=1
with aj := 2Re(cj ) = cj + cj = cj + cN −j and bj := −2Im(cj ) = i(cj − cj ) =

i(cj − cN −j ).
We got two representations of the trigonometric interpolation polynomial, a com-
plex and a real one:

2n
a0
n

i2πjt
ϕ(t) = c0 + cj e = + aj cos(2πjt) + bj sin(2πjt) (3.4)
j=1
2 j=1
For even N , i.e. N = 2n we get similarly,

2n
a0
n−1
an
i2πjt
ϕ(t) = c0 + cj e = + aj cos(2πjt) + bj sin(2πjt) + cos nt. (3.5)
j=1
2 j=1 2
This gives us now the interpretation of the coefficients: The measurements are
signals composed out of trigonometric functions. If t ∈ [0, 1], (a2j + b2j )1/2 gives
b
the amplitude at the frequency j Hz and arctan − ajj the corresponding phase.
Definition 73 The transformation

   
y0 c0
 ..   . 
 .  @  .. 
yN −1 cN −1
with cj given by Eq. 3.1 is called Discrete Fourier tranformation DFT.

1 2
0.9 1.5
0.8
1
0.7
0.5
Phase (rad)
0.6
Amplitude
0.5 0
0.4
−0.5
0.3
−1
0.2
−1.5
0.1
0 −2
0 20 40 60 80 100 0 20 40 60 80 100
Frequency (Hz) Frequency (Hz)
Figure 3.1: Amplitude and Phase
Example 74 Consider the function f (t) = sin(44 · 2πt + 1) + 0.2 sin(10 · 2πt)
and assume that 100 samples are taken equidistantly in [0, 1]. From the Fourier
coefficients cj we obtain the amplitude and phase depicted in Fig. 3.1.
The corresponding MATLAB code to generate this picture is MATLAB
N=100;
t=linspace(0,1,N+1);
t=t(1:N); %Erase the last point
signal=sin(44*2*pi*t+1)+0.2*sin(10*2*pi*t);% generate the signal
c=fft(signal)/N; % Compute the Fourier coefficients
amplitude=sqrt((2*real(c)).^2+(-2*imag(c)).^2);
phase=atan((2*imag(c))./(2*real(c)));
% erase phase values caused by round-off errors
phase(find(amplitude<1.e-5))=0; %check: help find
figure(1)
stem([0:N-1],amplitude)
figure(2)
stem([0:N-1],phase)
The figure reflects clearly the two frequencies, 44 Hz and 10 Hz contained in the
signal. Additionally one observes that the picture is symmetric and the frequencies
are reflected at 50 Hz. This is a consequence of the property (3.2) of the Fourier
coefficients. The phase plot reflects the phase shifts, −π/2 at 10 Hz and 1 − pi/2
at 44 Hz. Note that the phase shift is related to the phase of the cosine function.
In the MATLAB code the Fourier coefficients are computed via the command
MATLAB fft, which stands for Fast Fourier Transformation, an algorithm, which will be
explained in the rest of this chapter. Note, the division by N in the MATLAB
code. MATLAB uses a slightly different definition of the Fourier transformation
as we use it in this course. The definitions differ by this factor.
For solving Eq. (3.1) we note first an important property of the complex trigono-
metric base polynomials ω(t)j
Theorem 75 Let tk = k/N and ωk := ei2πtk = ei2πk/N then

N −1
ωjk ωj−l = N δkl (3.6)
j=0
with δkl being the Kronecker symbol (cf. p. 5).

Thus, the ωjk , k = 0, . . . , N − 1 form an orthogonal system with respect to the
inner product (scalar product)
1
N −1
< ξ, ψ >:= ξj ψ̄j
N j=0
of the sequence space
ξi {i=0,...,N −1} |ξi ∈ C .
Using this fact, we can directly write the solution of Eq. (3.1):
   −(N −1)  
−(N −1) −(N −1) 
cN −1 ω0 · · · ωN −2 ωN −1 y 0
 ..  1   y1 
ωN −1 
−(N −2) −(N −2) −(N −2)
 .  ω0 · · · ωN −2  
 =    ..  (3.7)
 c1  N ···  . 
c0 1 ··· 1 1 yN −1
as due to Th. 75 and the fact ωjk = ωkj
 N −1   −(N −1) −(N −1) −(N −1) 
ω0 ··· ω0 1 ω0 · · · ωN −2 ωN −1
1  ω
 1
N −1
··· ω1 1  −(N −2) · · · ω −(N −2) ω −(N −2) 
 ω0 N −2 N −1 =I
N ···  ··· 
N −1
ωN −1 · · · ωN −1 1 1 ··· 1 1
holds.
Thus the Fourier coefficients cj can be obtained by a matrix-vector multiplication,
which reduces the amount of work from 2n3 complex operations to 2n2 .
Let us look at the matrix-vector multiplication in some more detail, we get
1 1
N −1 N −1
cj = yk ωk−j = yk ω1−kj . (3.8)
N k=0 N k=0
We assume that N is even, i.e. N = 2M and consider first even indices (j = 2l):
1 1
M −1 N −1
−2kl
c2l = yk ω1 + yk ω1−2kl
N k=0 N k=M
1 1
M −1 M −1
= yk ω1−2kl + yk+M ω1−2kl
N k=0 N k=0
j=2 i
j=3 j=1
j=0
j=4
1
j=5 j=7
j=6
Figure 3.2: Unit roots
−2(k+M )l
We note, ω1−2kl = ω1 and ω12 = ω2 (cf. Fig.3.2). Consequently,
1
M −1
c2l = (yk + yk+M )ω2−kl . (3.9)
N k=0
Correspondingly, we get for odd indices (j = 2l + 1)
1
M −1
c2l+1 = (yk − yk+M )ω1−k ω2−kl . (3.10)
N k=0
Thus, by rearranging the sum, we could half the computational effort.

We define
[0]
αk := (yk + yk+M )
for the data in the ”even step” (that the step is ”even” is indicated by the
superscript ”0”) and
αk := (yk − yk+M )ω1−k
[1]
for the data in the ”odd step” (that the step is ”odd” is indicated by the super-
script ”1”).
With these definitions Eqs. (3.9) and (3.10) read
1 [0] −kl 1 [1] −kl

M −1 M −1
c2l = α ω and c2l+1 = α ω .
N k=0 k 2 N k=0 k 2
We obtained the same style of formulas as (3.8), we replaced only N by M = N/2,

y by α and ω1 by ω2 . Note, that the powers of ω2 run through the unit circle
in Fig. 3.2 twice as fast as the powers of ω1 . If M is pair, the procedure can be
repeated and the numbers of sums can be halved another time. Now, we have to
distinguish the cases l even and l odd. We add another superscript to α to mark,
which case we considered (see example below). The optimal situation occurs, if
N = 2p . Then this transformation can be iterated until we have only a single
term.
We will describe the procedure first by an example:
Example 76 Let N = 8 and j = 5. During the process we have to divide j

successively by 2 and to apply the ”even” formula if we obtain an even number
or the ”odd” formula if we got an remainder 1. This can be read off the binary
representation of j. In this case j = (101)2 , i.e. we have first to apply the ”odd”
formula, then the ”even” and finally the ”odd” again.
1 1 [1] −2k
3 3
c5 = (yk − yk+4 )ω1−k ω1−4k = α ω odd
N k=0
N k=0 k 2
[1]
=:αk
1 [1] 1 [01] −k
1 1
(αk + αk+2 ) ω2−2k
[1]
= = α ω even
N k=0
N k=0 k 4
[10]
=:αk
1 [10]
0
1 [101]
(αk − αk+1 )ω4−k
[10]
= = α odd
N k=0
N 0
[101]
=:αk
1 [mirror2 (j)]
Similarly, we get cj = α
N 0
. Where mirror2 just reverses the binary rep-
resentation of j, e.g.
j = 3 (j)2 = 011 mirror2 (j) = 110.
The scheme for computing all coefficients is depicted in Fig. 3.1.

[0] [00] [000]

a0 a0 a0
y0 [0] [00] [001]
c0
a1 a1 a0
y1 [0] [01]
c4
[010]
a2 a0 a0
y c
2 [0] [01] [011] 2
a3 a1 a0
y3 [10]
c6
[1] [100]
a0 a0 a0
y4 [10]
c1
[1] [101]
a1 a1 a0
y5 c
[1] [11] [110] 5
a2 a0 a0
y6 [1] [11] [111]
c3
a3 a1 a0
y7 c7
Figure 3.3: Schematic representation of FFT for N=8
The general idea of the FFT algorithm (FFT=fast Fourier transformation) can
best be described by the following MATLAB code:
function c=dfft(y)
% c=dfft(y)
% discrete fourier transformation of y
%
N=length(y);
omega_N=exp(-i*2*pi/N);
c=zeros(1,N);
%
p=log2(N);
if round(p) ~=p
error(’N is not a power of 2’)
end
NRED=N;
for ind=1:p
NRED_old=NRED;
NRED=NRED/2;
NSEG=2^(ind-1); %Number of even/odd segments
for ISEG=0:NSEG-1
fac=1;
for kk=1:NRED;
k=kk+ISEG*NRED_old;
alpha_even=y(k)+y(k+NRED);
alpha_odd =(y(k)-y(k+NRED))*fac;
fac=fac*omega_N;
y(k)=alpha_even;
y(k+NRED)=alpha_odd;
end;
end;
omega_N=omega_N^2;
end;
% Sorting the indices and normalizing by N
% (this could be done by a simple bit-handling instead)
for j=0:N-1,
jbin=dec2bin(j,p+1);
jbininv=jbin(p+1:-1:2);
c(j+1)=y(bin2dec(jbininv)+1)/N;
end;
Computational Effort for FFT in Dependency on Prime Factors

9
10
Prime Prime Prime

8
10
2,3,683
2,5,409
Flops
7
10
17,241
2,23,89
2,3,11,31
6 3,5,7,13
10
2
5
10
4090 4091 4092 4093 4094 4095 4096 4097 4098 4099
Number of Points
Figure 3.4: Computational effort for FFT depending on the prime factors of N
The basic idea of this algorithm is due to Cooley and Tucker. It’s success is based
on the fact that it requires only O(N log2 N ) multiplications if N is a power of
two.
If the number of samples N is no power of two, the iteration follows no longer
a binary tree and the sums (3.8) are split corresponding to the prime factors of
N . The computational works increases with the size of the prime factors and in
the extrem, when N is prime the computational effort becomes O(n2 ) which is
just the work which has to be done for the matrix-vector multiplication in (3.7),
cf. Fig. 3.4.
Chapter 4
Iterative Methods
All methods we discussed so far were finite in nature, i.e. the result was obtained
in a finite number of computational steps. The computational effort for obtaining
a numerical solution can be predicted, as the number of operation depends only
on the problem type, not on the particular data. For example, to perform LU fac-
torization of a non-sparse matrix requires always the same number of operations.
If we assume that computations can be carried without any round-off error, this
methods would give the exact answer to the given problem in a finite number
of arithmetic operations (+, −, ·, /). However, that is an exceptional situation.
In most cases solutions of mathematical problems can not be computed exactly.
√
This is particularly the case, if the solution is a irrational number like 2, for
example. Therefore, solutions of nonlinear equations, e.g.
x2 − 2 = 0
can not be computed exactly. As eigenvalues are defined as solutions of the char-
acteristic polynomial, they are not computable in a finite number of operations.
The methods we will consider now are based on iterative processes. The nu-
merical solution is the limit of a convergent series {xn }. Iteration means, that
one computes xi based on previous elements xi−1 , xi−2 , . . . and estimates the dis-
tance xi − x∗ . If this quantity is small enough xi is taken as the numerical
approximation to the problem at hand. How many iterates it needs to reach this
tolerance limit depends highly on the data. Consequently, the computational
effort depends on the data and not only on the type of problem. Even assuming
that computations can be carried out at infinite precision (no round-off), the
result will be in general not the exact solution due to the truncation of the lim-
iting process. We obtain only approximative solutions and we need a good error
estimation along with the method.
We discuss in this course iterative method to compute eigenvalues and zeros of
nonlinear functions.
73
74 CHAPTER 4. ITERATIVE METHODS
4.1 Computation of Eigenvalues

Recall the definition
Definition 77 Let A be a real n × n matrix, then λ ∈ C is called an eigenvalue

of A if
det(A − λI) = 0.
If λ is an eigenvalue of a, then a vector x ∈ Cn is called a corresponding eigen-
vector of A if
Ax = λx.
Furthermore we want to recall some basic properties of eigenvalues and eigenvec-

tors
• An n × n matrix has n (not necessarily distinct) eigenvalues λ1 , λ2 , . . . , λn .
• If there are n linear dependent eigenvectors X = [x1 , . . . , xn ], then
X −1 AX = diag(λ1 , . . . , λn ).
• If λ is an eigenvalue, than λ is also an eigenvalue of A.
• If A = AT than all eigenvalues are real. The eigenvectors are linear inde-
pendent and form an orthogonal system, i.e. XX T = I.
• The eigenvalues of A are the zeros of its characteristic polynomial
p(λ) = det(A − λI)
It is a well-known result from algebra, that zeros only for polynomials up to

degree 4 can be computed in a finite number of steps. So eigenvalues of matrices
with n ≥ 5 can only be computed iteratively and we will get only approximative
results.
Often, it is possible to estimate the location of the eigenvalues. For the methods
we discuss later in this section it is often sufficient to know in advance how the
eigenvalues are clustered.
A standard tool to get some a-priori information about the location of the eigen-
values is the following theorem by Gerschgorin:
Theorem 78 Every eigenvalue λj ∈ C of the n × n matrix A is at least in one

of the circles  

 n 

Bi := z ∈ C : |z − aii | ≤ |aij | =: ri (4.1)

 j=1


j=i
4.1. COMPUTATION OF EIGENVALUES 75
Proof :
Let λ be an eigenvalue of A and x the corresponding eigenvector. Choose the
index i such that |xi | = x∞ . We write now the i-th component from the relation
Ax − λx = 0 as

Ax i = λxi .
Subtracting aii xi gives

Ax i − aii xi = (λ − aii )xi .
Consequently,

n
n

n

|λ − aii | |xi | = | Ax i − aii xi | ≤ |aij xj | = |aij | |xj | ≤ |aij | |xi |.
j=1 j=1 j=1
j=i j=i j=i
Thus λ ∈ Bi . ✷.
Furthermore it can be shown that if the union of r circles Bi is not intersecting
the remaining Bj , then the union contains exactly r eigenvalues.
4.1.1 Power iteration

We consider a symmetric N × N matrix A. Let 0 = x(0) ∈ RN be a given vector
and set
x(n) = Ax(n−1) . (4.2)
Definition 79 The quantity

T
x(n) Ax(n)
µ(n) := (4.3)
x(n) 2
is called the Rayleigh quotient
We will relate now the Rayleigh quotient to the in modulus largest eigenvalue of
A:
Theorem 80 Let λ1 , λ2 , . . . , λN be the eigenvalues of A with |λ1 | > |λi | and let
x1 be the eigenvector corresponding to λ1 . The iterates x(n) generated by the
recursion (4.2) have the properties
lim µ(n) = λ1
n→∞
if xT
1x
(0)
= 0.
Proof :
Let x1 , . . . , xN be an orthonormal system of eigenvectors of A. Then, there are
coefficients αi such that
N
(0)
x = αi xi .
i=1
Due to the orthogonality of the eigenvectors we get
xT
i x
(0)
= αi
and especially, by the assumption on x(0)
α1 = 0.
Furthermore,

N
N
(1) (0)
x = Ax = αi Axi = αi λi xi
i=1 i=1
and by iterating

N
x(n) = Ax(n−1) = An x(0) = αi λni xi .
i=1
Due to the orthonormality of the xi :

N
(n) T (n)
x x = x(n) 22 = αi2 λ2n
i
i=1
and

N
(n) T (n)
x Ax = αi2 λ2n+1
i .
i=1
Thus,
N N 2 λi 2n+1
2 2n+1
(n) i=1 αi λi i=1 αi λ1
µ = N 2 2n = λ1 N .
2 λi 2n
i=1 αi λi i=1 αi λ1
As |λ1 | > |λi | and α1 = 0 we get

lim µ(n) = λ1
✷
With mainly the same technique it can be shown
xT
kx
(n)
lim = 0 k = 2, 3, . . . , N (4.4)
n→∞ x(n)
which implies that x(n) converges towards the eigenvector x1 .

The iteration (4.2) is called power iteration. It is the simplest method to compute
the largest eigenvalue and the corresponding eigenvector.
We collect some properties of that iteration:
4.1. COMPUTATION OF EIGENVALUES 77
• The better the eigenvalues are separated the faster the convergence (see
exercises).
• If λ1 = λ2 , then
xT
kx
(n)
lim = 0 k = 3, 4, . . . , N (4.5)
n→∞ x(n)
and x(n) converges no longer to an eigenvector, but
lim x(n) = x∗ ∈ span{x1 , x2 }.

n→∞
Nevertheless, lim µ(n) = λ1 = λ2
By power iteration we obtain the largest eigenvalue. The smallest eigenvalue

can be obtained by the inverse power iteration, which is based on the following
theorem:
Theorem 81 Let λ be an eigenvalue of a regular n × n matrix A and x the

corresponding eigenvector. Then
• λ−1 is an eigenvalue of A−1 with eigenvector x
• λ − s is an eigenvalue of A − sI with eigenvector x.
So, if we apply the power iteration to A−1 we would obtain the largest eigenvalue
of A−1 which is just the inverse of the smallest eigenvalue of A. This way of
applying power iteration is called the inverse power iteration method.
By applying the second statement of Theorem 81 enables us to compute also
other eigenvalues, not only the largest or smallest. If we assume, that si is a
good guess of the ith eigenvalue of A, then A − si I has an eigenvalue which is
near zero and consequently, we can expect, that (A − si I)−1 has (λi − si )−1 as
its largest eigenvalue in modulus. Often Gerschgorin’s theorem can be applied to
obtain a good guess for a certain eigenvalue. This technique to compute the ith
eigenvalue of A is called eigenvalue shift technique.
The inverse power iteration is a good example for a problem, where the same
LU factorization is applied to many different right hand side vectors as we can
rewrite the iteration in the following way in order to avoid direct inversion of A:
x(n) = A−1 x(n−1) ⇔ Ax(n) = x(n−1) .
Note, eigenvectors are unique up to scaling. The scaling factor tends to grow
during the inverse iteration process. That is the reason, why the iterates are
normalized after each step.
We summarize the algorithm for the inverse power iteration:
• Let x(0) be a given vector.

• LU -factorize A (normally with pivoting)
• Solve Ax(1) = x(0) for x(1) .
• Set x(1) := x(1) /x(1)
• Compute µ(1)
• Iterate these steps, i.e. Solve Ax(k) = x(k−1) for x(k) , set x(k) := x(k) /x(k)
and compute the Rayleigh quotient µ(k) .
• If for some n the difference |µ(n) − µ(n−1) | is sufficiently small, then set
λ = (µ(n) )−1 .
• Apply the eigenvalue shift to repeat the process for the next eigenvalue.
Recall that this algorithm is applied to symmetric matrices and that its con-
vergence depends on how good the eigenvalues are separated from each other.
The main application of this technique can be found in eigenvalue problems for
boundary value problems, where depending on the particular discretization meth-
ods symmetric matrices of the type (1.36) occur.
There are also iteration methods for general, non symmetric matrices. These
are based on an iterative similarity transformation of A to block triangular form
(Schurform), cf. [Gv96].
4.2 Fixed Point Iteration

Fixed point iteration is the basis of nearly all iteration methods in Numerical
Analysis. The principle can be illustrated for a control circuit which has the
properties:
(i+1) System (i)

x x
signal measurement
Controler
ϕ
Figure 4.1: Control circuit
4.2. FIXED POINT ITERATION 79
1. There is exactly one desired state x∗ . Is the system in this state, the
controler may not change anything
x∗ = ϕ(x∗ ).
In that sense, the desired state x∗ is called a fixed point of the controler ϕ.
2. The controler tries to reduce deviations from the desired state x∗
ϕ(x) − x∗ ≤ Lx − x∗ , L<1 (4.6)
in order to stear the system towards x∗ .
Formally speaking, such a control system is described by a fixed point iteration

or functional iteration:
x(i) := ϕ(x(i−1) ) (4.7)
with a given starting vector x(0) .
When applying an iteration scheme like (4.7) we have to ask ourselves the fol-
lowing questions:
1. Are the iterates well defined ?
2. Do they converge ?
3. How fast do they converge ?
To answer these questions, more precise terminology is needed.
Definition 82 Let ϕ : D ⊂ Rn → Rn . We call a point x∗ ∈ D a fixed point of

ϕ if
ϕ(x∗ ) = x∗ (4.8)
Fixed points and zeros of nonlinear functions are closely connected, as
F (x∗ ) = 0 ⇔ x∗ = x∗ − F (x∗ ) =: ϕ(x∗ )
A fundamental property of the function ϕ is c.f. (4.6):
Definition 83 A function ϕ : D ⊆ Rn → Rn is called a contraction on a set

D0 ⊆ D if there is an L(ϕ) < 1 and a norm such that
ϕ(x) − ϕ(y) ≤ L(ϕ)x − y (4.9)
for all x, y ∈ D0 .
Otherwise the function is called dissipative.
Recall, that condition (4.9) implies that ϕ is Lipschitz continuous on D0 .

For differentiable functions, contractivity can be checked by applying the mean
value theorem:
1
ϕ(x) − ϕ(y) = ϕ (x + t(y − x))dt (x − y) (4.10)
0
In order to be able to apply this theorem we have to require
x, y ∈ D ⇒ x + t(y − x) ∈ D ∀t ∈ [0, 1]
which implies that D has to be a convex set.

Thus on convex sets D we then have
L(ϕ) = supx∈D ϕ (x). (4.11)
In this context it is important to relate norms to eigenvalues. Let us denote by

ρ(A) the absolute value of the largest eigenvalue of A. ρ(A) is sometimes called
the spectral radius of A.
Theorem 84 For all An×n and all E > 0 there exists a norm . such that
A ≤ ρ(A) + E
By this theorem and Eq. (4.11) we can check for continuously differentiable
functions ϕ contractivity by checking eigenvalues:
The condition
ρ(ϕ (x∗ )) ≤ δ < 1
is sufficient for ϕ being a contraction in a neighborhood of a point x∗ .
We are ready now for one of the most central theorems in Numerical Analysis
Theorem 85 (Fixed Point Theorem by Banach)

Let ϕ : D ⊆ Rn → Rn be contractive on a closed set D0 ⊆ D and suppose
ϕ(D0 ) ⊂ D0 .
Then ϕ has a unique fixed point x∗ ∈ D0 . Moreover for any arbitrary point
x(0) ∈ D0 the iteration x(i+1) = ϕ(x(i) ) converges to x∗
Proof:
For all x(0) ∈ D0 holds
x(i+1) − x(i) = ϕ(x(i) ) − ϕ(x(i−1) ) ≤ Lx(i) − x(i−1) (4.12)
and consequently
x(i+1) − x(i) ≤ Li x(1) − x(0) . (4.13)
4.2. FIXED POINT ITERATION 81
We show first that {x(i) } is a Cauchy-sequence:
x(i+m) − x(i) ≤ x(i+m) − x(i+m−1) + · · · + x(i+1) − x(i)

≤ (Li+m + Li+m−1 + · · · Li )x(1) − x(0)
= Li (1 + L + L2 + · · · Lm−1 )x(1) − x(0)
Li
≤ x(1) − x(0) .
1−L
Thus {x(i) } is a Cauchy series and as Rn is complete, there exists a x∗ ∈ Rn with
x∗ = lim x(i) .
i→∞
Furthermore as D0 is closed, x∗ ∈ D0 . Now, we have to show that x∗ is a fixed

point of ϕ:
x∗ − ϕ(x∗ ) = x∗ − x(i+1) + x(i+1) − ϕ(x∗ )

= x∗ − x(i+1) + ϕ(x(i) ) − ϕ(x∗ )
≤ x∗ − x(i+1) + ϕ(x(i) ) − ϕ(x∗ )
≤ x∗ − x(i+1) +L x(i) − x∗

→0 →0
Thus, x∗ = ϕ(x∗ ). ✷
We turn now to the question about the speed (rate) of convergence and about
the error we make, when we stop iterating after a finite number of iterations.
Let x∗ be the fixed point. Then we get for contractive functions ϕ:
x(i) − x∗ = ϕ(x(i−1) ) − ϕ(x∗ )

≤ L(ϕ)x(i−1) − x∗ (4.14)
≤ L(ϕ)(x(i−1) − x(i) + x(i) − x∗ )
and consequently:
L(ϕ)
x(i) − x∗ ≤ x(i−1) − x(i) (4.15)
1 − L(ϕ)
This inequality is called an a posteriori error bound. With this we can decide on
the quality of the k th iterate after having computed it.
If we want to know in advance how many iterates we would need to achieve a
certain accuracy we apply an a priori error bound: Inserting
x(i−1) − x(i) = ϕ(x(i−2) ) − ϕ(x(i−1) ) ≤ L(ϕ)x(i−2) − x(i−1)

≤ L(ϕ)i−1 x(0) − x(1)
into (4.15) gives

L(ϕ)i
x(i) − x∗ ≤ x(0) − x(1) (4.16)
1 − L(ϕ)
which is the desired a priori bound.
Definition 86 (Order or rate of convergence)

An iteration is called convergent of order p, if
x(i) − x∗ ≤ Cx(i−1) − x∗ p
p is sometimes called also the rate of convergence.
Two special cases are of particular importance
• p = 2 quadratic convergence. We will see later that this is the ideal order
of convergence for Newton’s method.
• p = 1 linear convergence. As we see from (4.14), this is the order of con-

vergence of fixed point iterations
Furthermore one often considers superlinear convergence which is achieved if the

Ck > 0 with
x(i) − x∗ ≤ Ci x(i−1) − x∗
form a null sequence.
4.3 Newton’s Method

We consider now the problem:
Find x ∈ Rn such that
F (x) = 0 (4.17)
with F : Rn → Rn .
These kind of problems occur frequently in many technical applications, e.g. the
determination of equilibrium points of dynamic processes in chemical, mechan-
ical or electrical engineering. Furthermore, this problem occurs as a subtask in
optimization (the gradient has to be zero) and in other numerical methods, e.g.
implicit discretization methods for ordinary differential equations.
As usual we first recall what we know from calculus about the solvability of the
problem:
Theorem 87 (Inverse Function Theorem)

Let F be continuously differentiable in an open set D ⊆ Rn and 0 ∈ F (D).
Assume furthermore that F (x) regular for all x ∈ D, then
4.3. NEWTON’S METHOD 83
1. there is a locally unique x∗ ∈ D

with F (x∗ ) = 0
2. there is in a neighborhood V (0) of 0

a continuously differentiable function G
with F (G(y)) = y and G(0) = x∗ .
3. For the derivative the following relation holds in V (0):

−1
G (y) = (F (G(y)))
For the proof we refer to e.g. [OR70].

We saw in the beginning of the preceding section how fixed point problems and
the problems of finding zeros of nonlinear functions are related. Fixed point
iteration applied to (4.17) might result in a slowly convergent or even divergent
sequence x(i) depending on the contractivity of ϕ = I − F .
We modify ϕ now so that(4.17) is equivalent to a fixed point problem with optimal
contractivity properties:
F (x) = 0 ⇐⇒ x = x − F (x)−1 F (x) =: ϕN (x). (4.18)

Assume that F (x) is non singular in a neighborhood of x∗ . Then

d −1
ϕN (x) = I − F (x) F (x) − F (x)−1 F (x) = 0 at x = x∗ .
dx
Consequently, ϕN is contractive in a neighborhood of x∗ .

A fixed point iteration applied to ϕN (x) is Newton’s method for (4.17):
x(i+1) = x(i) − F (x(i) )−1 F (x(i) ) =: ϕN (x(i) )
In numerical computations however, the inverse of the Jacobian is never com-

puted. Instead one solves a linear system. Therefore Newton’s method is better
described by the following algorithmic notation:
Newton’s method:
Iterate the following two steps:
1. Solve the linear system F (x(i) )∆x(i) = −F (x(i) ) for ∆x(i) .
2. Let x(i+1) := x(i) + ∆x(i) .
Obviously, every iteration step requires
• the computation of a Jacobian
• the solution of a linear system

These high costs can only be compensated by a fast convergence. The conver-
gence properties of Newton’s method are stated by the Newton–Kantorovitch
Theorem. It says that Newton’s method is locally quadratic convergent under
some conditions on the smootheness of F and on the topological properties of
D, i.e. for every x(0) sufficiently near the solution x∗ a sequence x(i) is generated
with
x(i) − x∗ ≤ CN x(i−1) − x∗ 2 .
In many applications no information about the Jacobian F (x(i) ) is available.
Often F is known only as a set of complex subroutines, which are generated au-
tomatically by special purpose programs in optimization or engineering. There
are modern techniques, called automatic differentiation which generate a sub-
routine for the corresponding Jacobian. These techniques can be viewed as a
pre-compiler1 .
4.3.1 Numerical Computation of Jacobians

Alternatively, the Jacobian can be approximated by finite differences. The k th
column of the Jacobian is approximated by
F (x(i) + ηek ) − F (x(i) )

∆ηek F =
η
with ek being the k th unit vector and η ∈ R a sufficiently small number. The
increment η has to be chosen such that the influence of the approximation error
ε(η) can be neglected. It consists of truncation errors and roundoff errors in
the evaluation of F . Let εF be an upper bound for the error in the numerical
computation of F , then
2ε + 1 ∂ 2 Fi η 2 + O(η 3 )
η ∂F F 2 ∂x2j
|εij (η)| = ∆ej Fi − ≤
i
. (4.19)
∂xj η
In Fig. 4.2 the overall error for the example sin (1) is given. In the left part of
the figure the roundoff error dominates and in the right part the truncation error.
The slopes in double logarithmic representation are -1 and +1 for the roundoff
and approximation errors which can be expected from (4.19).
When neglecting η 2 and higher order terms this bound is minimized if η is selected
according the rule of thumb
$
∂ 2 Fi −1
η = 2 εF 2
∂xj
1
see html://www.mcs.anl.gov/adifor
4.3. NEWTON’S METHOD 85
Error |ε(η)|
−5
10
−10
10 −15 −10 −5 0
10 10 10 10
Perturbation η
Figure 4.2: Error |εij (η)| in the numerical computation of sin (1)
which in practice is often replaced by

√
η = 2 εF .
The effort for the computation of F (x) by numerical differentiation consists of

aditional n additional evaluations of F . This high effort motivates the use of
simplified Newton methods even when the convergence is no longer quadratic.
Note that by numerically approximating F (x) this property is already lost (κ =
0).
4.3.2 Simplified Newton Method

In order to save evaluations of the Jacobian and to make full use of the first LU
decomposition, one may freeze the Jacobian in the first iteration step and iterate
the following two steps:
1. Solve the linear system F (x(0) )∆x(i) = −F (x(i) ) for ∆x(i) .
2. Let x(i+1) := x(i) + ∆x(i) .
This method is called simplified Newton method. Every single step is much simpler
to compute, because of the Jacobian and it’s LU factorization are already known.
However this method is only linearly convergent.
Newton Convergence Theorem

The main theorem for the convergence of Newton’s method is the Newton Con-
vergence Theorem, cf. [OR70, DH95]. We consider it here in a form, which is
applicable also for the simplified Newton method (set in the theorem B(x) =
F (x(0) )−1 or for the Gauß-Newton method, which we will consider later (set
B(x) = F (x)+ ).
Theorem 88
Let D ⊂ Rn be open and convex and x(0) ∈ D.
Let F ∈ C 1 (D, Rm ) and B ∈ C 0 (D, Rn×m ).
Assume that there exist constants r, ω, κ, δ0 such that for all x, y ∈ D the following
properties hold:
1. Curvature condition
% %
%B(y) F (x + τ (y − x)) − F (x) (y − x)% ≤ τ ω y − x2 (4.20)
with τ ∈ [0, 1].
2. Compatibility condition
B(y)R(x) ≤ κy − x (4.21)
with the compatibility residual R(x) := F (x)−F (x)B(x)F (x) and κ < 1.
3. Contraction condition
ω (1)
δ0 := κ + x − x(0) < 1.
2
with x(1) = x(0) − B(x(0) )F (x(0) )
4. Condition on the initial guess

& '
D0 := x| x − x(0) ≤ r ⊂D
x(1) − x(0)
with r := .
1 − δ0
Then the iteration
x(k+1) := x(k) − B(x(k) )F (x(k) )
is well-defined with x(k) ∈ D0 and converges to a solution x∗ ∈ D0 of B(x)F (x) =
0.
The speed of convergence can be estimated by
δkj
x(k+j) − x∗ ≤ ∆x(k) (4.22)
1 − δk
with δk := κ + ω2 ∆x(k) and the increments decay conforming to
ω
∆x(k+1) ≤ κ∆x(k) + ∆x(k) 2 . (4.23)
2
4.4. CONTINUATION METHODS IN EQUILIBRIUM COMPUTATION 87
By setting k = 0 the a priori error estimation formula
δ0j
x(j) − x∗ ≤ x(1) − x(0) (4.24)
1 − δ0
can be obtained.
This theorem with its constants needs some interpretation:
1. Curvature condition (1). This is a weighted Lipschitz condition for F ,

written in a way that it is invariant with respect to scaling. It can be fulfilled
with ω = L sup B(y) if F is Lipschitz continuous with constant L and
B is bounded on D. For linear problems F is constant and ω = 0. Then,
the method converges in one step. Hence ω is a measure for the nonlinearity
of the problem. In the case of a large ω, the radius of convergence r may
be very small because for a large ω, x(1) − x(0) has to be very small.
2. Compatibility condition (2). In the case of Newton’s method, i.e. B(x) =

F (x)−1 we have R(x) = 0 and thus κ = 0. Then the method converges
quadratically.
In the other cases this is a condition on the quality of the iteration matrix
B(x). It says how tolerant we can be, when replacing F (x(k) )−1 by an
approximate B(x). Due to κ = 0 we can expect only linear convergence.
The better the Newton iteration matrix is approximated the smaller κ.
3. “δ”-conditions (3), (4). These conditions are restrictions on the initial

guess x(0) . Newton’s method is only locally convergent and the size of the
convergence region depends on ω and κ.
4.4 Continuation Methods in Equilibrium Com-

putation
One of the main problems with Newton’s method is the choice of a starting value
x(0) . In the case of highly nonlinear problems or due to a poor approximation
B(x) of F (x)−1 the method might converge only for starting values x(0) in a
very small neighborhood of the unknown solution x∗ . So, special techniques for
generating good starting values have to be applied. One way is to embed the given
problem F (x) = 0 into a parameter depending family of problems H(x, s) = 0
with
H(x, s) := F (x) − (1 − s)F (x0 ) = 0 (4.25)
and a given value x0 ∈ D. This family contains two limiting problems. On one
hand we have
H(x, 0) = F (x) − F (x0 ) = 0
which is a system of nonlinear equations with x0 as a known solution. On the

other hand we have
H(x, 1) = F (x) = 0
which is the problem we want to solve.
The basic idea of continuation methods2 is to choose a partition 0 = s0 < s1 <
s2 < · · · < sm = 1 of [0, 1] and to solve a sequence of problems
H(x, si ) = 0, i = 1, . . . , m
by Newton’s method where the solution xi of the ith problem is taken as starting
value for the iteration in the next problem. The key point is that if ∆si = si+1 −si
is sufficiently small, then the iteration process will converge since the starting
value xi for x will be hopefully in the region of convergence of the next subproblem
H(x, si+1 ) = 0.
From the mechanical point of view −F (x0 ) is a force which has to be added to
F (x) in order to keep the system in a non-equilibrium position. The goal of
the homotopy is then to successively reduce this force to zero by incrementally
changing s.
The embedding chosen in (4.25) is called a global homotopy. It is a special case
of a more general class of embeddings, the so-called convex homotopy
H(x, s) := (1 − s)G(x) + sF (x), s ∈ [0, 1] (4.26)
where G ∈ C 1 (D, Rn ) is a function with a known zero, G(x0 ) = 0. By taking

G(x) := F (x) − F (x0 ) the global homotopy is obtained again.
We have
H(x, 0) = G(x), H(x, 1) = F (x),
i.e. the parameter s leads from a problem with a known solution to a problem
with an unknown solution. It describes a path x(s) with x(0) = x0 and x(1) = x∗ .
In general a homotopy function for a continuation method is defined as
H : Rn × R → Rn
with H(x0 , 0) = 0 and H(x, 1) = F (x), where x0 is a given point x0 ∈ D.

The method sketched so far is based on the assumption that there exists a smooth
solution path x(s) without bifurcation and turning points. Before describing con-
tinuation methods in more algorithmic details we look for criteria for an existence
of such a solution path.
For this end we differentiate H(x(s), s) = 0 with respect to the parameter s and
obtain
Hx (x, s)x (s) = −Hs (x, s)
2
Often also called homotopy or path following method.
4.4. CONTINUATION METHODS IN EQUILIBRIUM COMPUTATION 89
d d
with Hx := dx H, Hs := ds H.
If Hx (x, s) is regular we get to the so-called Davidenko differential equation:
x (s) = −Hx (x, s)−1 Hs (x, s), x(0) = x0 (4.27)
The existence of a solution path x(s) at least in a neighborhood of (x(0), 0) can

be ensured by standard existence theorems for ordinary differential equations as
long as Hx has a bounded inverse in that neighborhood. For the global homotopy
(4.25) this requirement is met if F satisfies the conditions of the Inverse Function
Theorem 87.
For the global homotopy the Davidenko differential equation reads
x (s) = −F (x)−1 F (x0 ).
We summarize these observations in the following Lemma [AG90].
Lemma 89 Let H : D × I ⊂ Rn+1 → Rn be a sufficiently smooth map and let

x0 ∈ Rn be such that H(x0 , 0) = 0 and Hx (x0 , 0) is regular.
Then there exists for some open interval 0 ∈ I0 ⊂ I a smooth curve
I0 ( s → x(s) ∈ Rn with
• x(0) = x0 ;
• H(x(s), s) = 0;
• rank(Hx (x(s), s)) = n ;
for all s ∈ I0 .
The direction x (s) given by the right hand side of the Davidenko differential
equation is just the tangent at the solution curve x in s.
Fig. 4.3 motivates the following definition
Definition 90 A point (x(si ), si ) is called a turning point if rank(H (x(si ), si ) =

n and rank(Hx (x(si ), si )) < n.
In the neighborhood of turning points the curve cannot be parameterized by the

parameter s and locally another parameterization must be taken into account.
We not cover the topic of turning points in this course and refer to [AG90]
instead. We will assume that there are no turning points in [0, 1].
Having just seen that if no turning points are present, the path x(s) is just the
solution of an ordinary differential equation, the Davidenko differential equation,
we could conclude this section by referring to numerical methods for ODEs. Un-
fortunately, this differential equation cannot be solved in a stable way. Small
errors are not damped out and will drift-off from the exact solution.
x x(s)
x'(si )
s0 s1 si
Figure 4.3: Turning points
Most numerical path following methods are predictor-corrector methods. The

(0)
predictor at step i provides a starting value xi for the Newton iteration which
attempts to “correct” this value to x(si ) := x∗i .
We have already seen the predictor of the classical continuation method which
just takes
xi := x∗i−1 .
(0)
There, the predictor is just a constant function in each step.

A more sophisticated predictor is used by the tangential continuation method
where the predictor is defined as
xi := x∗i−1 + (si − si−1 )xi−1 (si ).

(0)
This is just a step of the explicit Euler method for ODEs applied to the Davi-
denko differential equation, cf. Sec. 5.7.1.
The amount of corrector iteration steps depends on the quality of the prediction.
First, we have to require that the predicted value is within the convergence region
of the corrector problem. This region depends on the constants given by Theorem
88, mainly on the nonlinearity of the F . Even if the predicted value is within
the domain of convergence, it should for reasons of efficiency be such that there
are not too many corrector iteration steps needed. Both requirements demand
an elaborated strategy for the step size control, i.e. for the location of the points
si . For other strategies we refer to [AG90].
A central example for an application of a homotopy method will be given in the
project homework of this course.
4.5. GAUSS-NEWTON METHOD 91
4.5 Gauß-Newton method

Consider the following example
Example 91 The radiation from a radioactive source is measured behind a pro-
tective wall of thickness x as the number of counts per second. This number N
depends on x according to theory in the following way:
N (x) = N0 e−αx
Let us assume we want to determine the parameters α and N0 from n measure-

ments N (xi ), i = 1, . . . , n. If m > 2, then we get a nonlinear least squares
problem. (Note, often logarithms are taken to transform it into a linear least
squares problem, which might give a different solution!)
This problem is of the following general form:
F (x)22 = min
with F : Rn → Rm and m ≥ n.
A necessary criterion for
g(x) := F (x)T F (x) having a minimum is
g (x) = 2G(x) = 2F (x) F (x) = 0

T
The straight forward approach is to apply Newton’s method to G(x) = 0, which

gives the iteration
G (x(k) )∆x = −G(x(k)
d T
with x(k) := x(k−1) ) + ∆x. Note, that G (x) = dx F (x) F (x) + F (x)T F (x)
is the second derivative, the Hessian, of g. The numerical computation of this
function is time consuming and often not very reliable. This is why this approach
is not very usefull. We will consider a second approach instead, which is based
on successively linearizing F and then minimizing.
This way, we obtain a sequence of quadratic problems:
F (x(k) ) + F (x(k) )∆x22 = min
and set x(k+1) := x(k) ) + ∆x. This method is called Gauß-Newton Method.
Note, that we obtain in every iteration step a linear least squares problem with
the corresponding normal equations:
F (x(k) )T F (x(k) )∆x = −F (x(k) )T F (x(k) )
or expressed by the pseudo-inverse
∆x = −F (x(k) )+ F (x(k) )
which shows a formal similarity with Newton’s method for nonlinear equations.
When comparing to the first approach, we note that we now neglected the Hes-
sian. This neglection can be made if F (x∗ ) is sufficiently small, which corresponds
to the requirement that the measurement errors are small and unbiased. In that
case Newton’s convergence theorem assures locally linear convergence and in par-
ticular if F (x∗ ) = 0 even locally quadratic convergence. For a detailed practical
example see the exercises.
4.6 Iterative Methods for Linear Systems

We return in this section to the problem of solving linear systems of the form
Ax = b
where we assume that A ∈ Rn×n is regular. In Chapter 2 we considered so-called

direct methods, which computed the solution x in a finite number of computation
steps by some sort of factorization method. Here, we will consider iterative
methods for this problem, which approximate the solution within some given
accuracy bound. In particular when the problem occurs as a subproblem from
solving partial differential equations (PDEs)
• the dimension of A may be so large, that direct methods become to expen-
sive,
• A often has a certain structure (sparsity pattern), which should be exploited

for saving computation time and memory. This sparsity pattern often gets
lost when methods based on matrix factorization are applied. Zeros in the
matrix are often replaced by non-zero elements by this process and the
structure of the matrix is destroyed. This effect is called fill-in.
• A is in this special application often a symmetric and even positive definite.

So we will try to exploit this fact when considering iterative methods.
• PDEs are solved within some accuracy bound. The error depends on the
density of the discretization mesh. It would be an unnecessary waste of
computation time to solve a subproblem, like linear system solving, to a
higher accuracy as the one required for the overall process.
All these points motivate the construction of iterative methods.
We first reformulate the problem into a fixed point problem following the same
idea, which we applied when considering nonlinear problems, see p. 79:
Ax = b ⇔ Q−1 (b − Ax) + x = x
(I − Q−1 A) x + Q−1 b = x

G c
4.6. ITERATIVE METHODS FOR LINEAR SYSTEMS 93
where Q can be any regular n × n matrix.

So we have to study the fixed point iteration for the problem
ϕ(x) = Gx + c = x.
From the fixed point theorem (Th. 85) we conclude that the iteration
x(i+1) := Gx(i) + c
converges if
ρ(G) < 1,
where ρ(G) denotes the spectral radius of G, see p. 80.
We consider three important choices of Q:
• Q = I: Richardson iteration
• Q = D with Jacobi iteration
• Q = D + L with A = L + D + U : Gauss-Seidel iteration
Richardsson Iteration
If we just choose Q = I we obtain the iteration
x(i+1) := x(i) − Ax(i) + b,
which converges if
ρ(G) = ρ(IA ) = max{|1 − λmax (A)|, |1 − λmin |(A)} < 1
which is equivalent to |λ(A)| < 2. This restricts severely the class of problems
for which this methods is applicable.
Jacobi Iteration
We write A as a sum of a diagonal matrix D and lower and upper triangular
matrices L and U ,
A = L + D + U,
and choose Q = D. This gives the iteration
x(i+1) := −D−1 (L + U )x(i) + D−1 b.
Again by checking the spectral radius we get a condition for convergence:

aij
ρ(D−1 (L + U )) ≤ D−1 (L + U )∞ = max | |
i=j
aii
Thus, Jacobi iteration converges for diagonally dominant matrices.

Gauss-Seidel Iteration
Here, we choose Q = D + L and obtain the iteration
x(i+1) := −(D + L)−1 U x(i) + (D + L)−1 b.
As D + L is a triangular matrix one has to perform a forward substitution in

every iteration step.
Gauss-Seidel iteration converges for all positive matrices A, see [DH95].
Chapter 5
Ordinary Differential Equations
In this chapter we want to compute numerically a function y ∈ C 1 [t0 , te ], which

is the solution of the following initial value problem
ẏ = f (t, y) with y(t0 ) = y0 . (5.1)
Initial value problems occur frequently in applications. The numerical solution
of these kind of problems is a central task in all simulation environments for
mechanical, electrical, chemical systems. There are special purpose simulation
programs for applications in these fields, which often require from their users a
deep understanding of the basic properties of the underlying numerical methods.
For constructing methods we might write this ordinary differential equation in
integral-form: t
y(t) = y0 + f (τ, y(τ )) dτ
t0
Before we study numerical methods for computing a solution of (5.1) we have
to ask us the question if there exists a solution to that problem and if it is
unique. Without existence and uniqueness guaranteed asking for a numerical
solution becomes obsolete. For this end let us review a central and basic results
from ODE theory: The existence and uniqueness of a solution of (5.1) require
Lipschitz continuity of f , i.e.
Definition 92 A function f : D ⊂ Rn → Rn is called Lipschitz continuous if
there exists a constant L > 0 such that for all x, y ∈ D
f (x) − f (y) ≤ Lx − y
For functions having this property we get the main theorem for initial value
problems:
Theorem 93 (Picard and Lindelöf )
Let S := {(t, y)|t0 ≤ t ≤ te , y ∈ Rn } and let f : S → Rn be Lipschitz continuous
on S with respect to y, then there exists for every initial value (t0 , y0 ) a unique
solution y(t) of (5.1).
95
96 CHAPTER 5. ORDINARY DIFFERENTIAL EQUATIONS
5.1 Differential Equations of Higher Order

It is not necessarily always the first order derivative that appears in the differential
equation. The most prominent example of a second order differential equation is
Newton’s law of mechanics
m · a(t) = f (x, t).
Here m is the mass of a body, a(t) = ẍ(t) is its acceleration and f (t) is a force
acting on the body. If one is interested in the position as a function of time, the
governing equation is the second order differential equation
ẍ(t) = f (x, t)/m.
Introducing y1 (t) = x(t) and y2 (t) = ẋ(t), the second order ODE can be rewritten
as a first order system of two equations:
ẏ1 (t) = y2 (t),
ẏ2 (t) = ẍ(t) = f (y1 (t), t)/m.
In vector notation this system reads
Ẏ (t) = F (t, Y (t)),

y1 y2 (t)
where Y = and F (t, Y ) = . Formally the solution is
y2 f (y1 (t), t)/m
given by “integration”
t
Y (t) = Y (t0 ) + F (τ, Y (τ )) dτ.
t0
Note that now the “initial value”

y1 (t0 ) x(t0 )
Y (t0 ) = =
y2 (t0 ) ẋ(t0 )
is a two component vector. In other words: To solve a second order ODE, two
initial values need to be specified. Correspondingly, to solve a system of n 2nd.
order ODEs 2n initial values are needed.
5.2 The Explicit Euler Method

The construction of numerical methods for initial value problems as well as basic
properties of such methods shall first be explained for the simplest method: The
explicit Euler method. Be aware that this method is not the most efficient one
from the computational point of view. In later sections, when a basic under-
standing has been achieved, computationally efficient methods will be presented.
5.2. THE EXPLICIT EULER METHOD 97
5.2.1 Derivation of the Explicit Euler Method

A general principle to derive numerical methods is to “discretize” constuctions
like derivatives, integrals, etc.
Given an initial value problem
ẏ(t) = f (t, y(t)), y(t0 ) = y0
the operation which can not be evaluated numerically obviously is the limit h → 0,
that defines the derivative
y(t + h) − y(t)
ẏ(t) = lim .
h→0 h
However, for any positive (small) h, the finite difference
y(t + h) − y(t)
h
can easily be evaluated. By definition, it is an approximation of the derivative
ẏ(t). Let us therefore approximate the differential equation ẏ(t) = f (t, y(t)) by
the difference equation
u(t + h) − u(t)
= f (t, u(t)).
h
Given u at time t, one can compute u at the later time t + h, by solving the
difference equation
u(t + h) = u(t) + hf (t, u(t)).
This is exactly one step of the explicit Euler method Introducing the notation
tn+1 = tn + h and un = u(tn ) it reads
un+1 = un + hf (tn , un ), u0 = y 0 . (5.2)
We shall see in Sect. 5.4 that un really is a first order approximation to the exact
solution y(tn )
un − y(tn ) = O(h) h → 0.
5.2.2 Graphical Illustration of the Explicit Euler Method

Given the solution y(tn ) at some time tn , the differential equation ẏ = f (t, y)
tells us “in which direction to continue”. At time tn the explicit Euler method
computes this direction f (tn , un ) and follows it for a small time step tn → tn + h.
This is expressed in the formula (5.2) and illustrated in Fig. 5.1. Obviously each
step introduces an error and ends up on a different trajectory. A natural question,
that will be answered below is: How do these errors accumulate?
1.2
1.15
1.1
1.05
0.95
y
0.9
0.85
0.8
0.75
0.7
0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4
t
Figure 5.1: Explicit Euler Method
5.2.3 Two Alternatives to Derive Euler’s Method

In the first derivation, the derivative was discretized using a finite difference
quotient. This is not the only way to construct numerical methods for initial value
problems. An alternative view on Euler’s method is based on the reformulation
of the problem as an integal equation
t
y(t) = y0 + f (τ, y(τ )) dτ
t0
Then any of the quadrature rules of Chapter 1.4 can be applied to approximate
the integral. Choosing the rectangular rule,
tn +1
f (τ, y(τ )) dτ ≈ hf (tn , y(tn )),
tn
we find again Euler’s method
un+1 = un + hf (tn , un ).
Clearly other quadrature rules will lead to other methods.

Finally a constuction principle based on Taylor expansion shall be explained. To
this end, one assumes that the solution of the initial value problem (5.1) can be
expanded in a Taylor series
y(tn + h) = y(tn ) + hẏ(tn ) + O(h2 ).

5.2. THE EXPLICIT EULER METHOD 99
Ignoring the second order term and using the differential equation to express the
derivative ẏ(tn ) leads also to Euler’s method
un+1 = un + hf (tn , un ).
5.2.4 Testing Euler’s Method

From the three derivations it is clear, that Euler’s method does not compute
the exact solution of an initial value problem. All one can ask for is a reason-
ably good approximation. The following experiment illustrates the quality of the
approximation. Consider the differential equation
ẏ = −100y.
The exact solution is y(t) = y0 e−100t . With a positive initial value y0 it is positive
and rapidly decreasing as t → ∞. The explicit Euler method applied to this
differential equation reads
un+1 = (1 − 100h)un .
With a step size h = 0.1 the numerical “approximation” un+1 = −9un
un = (−9)n u0
oscillates with an exponentially growing amplitude. It does not approximate

the true solution at all. Reducing the step size to h = 0.001 however, yields
un+1 = 0.9un and the numerical solution
un = (0.9)n u0
is smoothly decaying.
Another test example is the initial value problem
√
ẏ = λ(y − sin(t)) + cos t, y(π/4) = 1/ 2,
where λ is a parameter. First we set λ = −0.2 and compare the results for
Euler’s method with two different step sizes h = π/10 and h = π/20, see Fig. 5.2.
Obviously, the errors decrease with the step size. Setting now λ = −10 the
numerical results for h = π/10 oscillate arround the true solution and the errors
grow rapidly in every single step. For the reduced step size h = π/20 however,
Euler’s method gives a quite good approximation to the true solution, see Fig. 5.3.
1.2 1.2
1.1 1.1
1 1
0.9 0.9
y
y
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4
t t
Figure 5.2: Explicit Euler Method, λ = −0.2, h = π/10 (left), h = π/20 (right)
1.2 1.2
1.1 1.1
1 1
0.9 0.9
y
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4
t t
Figure 5.3: Explicit Euler Method, λ = −10, h = π/10 (left), h = π/20 (right)
5.3 Stability Analysis

The previous section showed that in order to obtain reasonable approximations
the step size in Euler’s method has to be choosen small enough — how small
depends on the differential equation. The goal of the present section is to quantify
the condition on the step size. To this end consider the test equation
ẏ = λy, (5.3)
where λ now is a complex parameter. The solution
y(t) = y0 eλt
remains bounded
|y(t)| = |y0 | · |e(α+iβ)t | = |y0 | · |eαt |
5.4. LOCAL, GLOBAL ERRORS AND CONVERGENCE 101
if α = Reλ is non positive. In this case it is reasonable to ask that the numerical
solution remains bounded too. For the explicit Euler method,
un+1 = (1 + hλ)un
this demand requires that the amplification factor is bounded by one
|1 + hλ| ≤ 1. (5.4)
The explicit Euler method is called stable for the test equation (5.3) if the step
size h satisfies the condition (5.4). In the case of real and negative λ, this means
h ≤ −2/λ, cf. the experiments in the previous section.
The set
S = {hλ ∈ C : |1 + hλ| ≤ 1}
is called the stability region of the Euler method. It is a disc of radius 1 centered
at (−1, 0), see Fig. 5.4.
1.5
0.5
Im
−0.5
−1
−1.5
−2.5 −2 −1.5 −1 −0.5 0 0.5
Re
Figure 5.4: Explicit Euler Method, stability region
5.4 Local, Global Errors and Convergence

So far qualitative properties of the approximation (boundedness, monotonicity)
have been studied. In this section the actual errors will be analyzed. To this end
we have to destinguish local and global errors.
Definition 94 Given an initial value problem ẏ = f (t, y) with y(0) = y0 and
numerical approximations un ≈ y(tn ). The difference
en = un − y(tn )
is called the global error. The difference
εn+1 = un+1 − ŷ(tn+1 )
is called the local error, where ŷ is the solution to ŷ˙ = f (t, ŷ) with initial condition
ŷ(tn ) = un .
Note that the local error is an error in the numerical approximation, that is in-
trocuced in one single time step; at time tn the values un and ŷ(tn ) are identical.
The global error however, is an error at time tn that has accumulated during
n steps of integration. That is the error that one naturally observes when per-
forming numerical calculation. In order to estimate the global error, we first
analyze the local error and then study how local errors accumulate during many
integration steps.
Local errors can be analyzed by Taylor expansion. We demonstrate this for the
explicit Euler method
un+1 = un + hf (tn , un ).
Inserting the initial condition for ŷ yields
un+1 = ŷ(tn ) + hf (tn , ŷ(tn )). (5.5)
Taylor expansion of ŷ reads
˙ n) + h2 ¨
ŷ(tn+1 ) = ŷ(tn ) + hŷ(t ŷ(tn ) + . . .
2
Using the differential equation for ŷ we find
h2 ¨
ŷ(tn+1 ) = ŷ(tn ) + hf (tn , ŷ(tn )) + ŷ(tn ) + . . . (5.6)
2
Subtracting (5.6) from (5.5) gives the local error for the explicit Euler method :
h2 ¨
εn+1 = − ŷ(tn ) + . . .
2
The accumulation of all local errors during the time stepping procedure deter-
mines the global error which is observed after many iteration steps. To investigate
the global error, we subtract the Taylor expansion of the true solution
h2
y(tn+1 ) = y(tn ) + hẏ(tn ) + ÿ(tn ) + . . .
2
h2
= y(tn ) + hf (tn , y(tn )) + ÿ(tn ) + . . .
2
from the explicit Euler method
un+1 = un + hf (tn , un ).
5.5. STIFFNESS 103
This gives the global error recursion
en+1 = en + hf (tn , y(tn ) + en ) − hf (tn , y(tn )) − εn+1 (5.7)
Taking norms and using Lipschitz–continuity of f yields
en+1 ≤ en + hLf en + εn+1 .
To get an explicit bound for en , we apply the following result.

Lemma 95 (discrete Gronwall lemma) Let an+1 ≤ (1+hµ)an +b with h > 0,
µ > 0, b > 0 and a0 = 0. Then
b etn µ − 1
an ≤ , tn = nh.
h µ
For the global error we find the bound
h max ÿ etn Lf − 1
en ≤ .
2 Lf
For any fixed time level tn = nh, the global error decreases linearly with h :
en = O(h) as h → 0.
We say that Euler’s method is convergent of order 1. More precisely:

Definition 96 The order of convergence of a method is p, if the global error
satisfies
en = O(hp ), h → 0.
Note that the local error for Euler’s method is of second order εn = O(h2 ). The
accumulation of local errors over n = O(h−1 ) steps causes the the order of the
global error to decrease by one. The same effect was also observed for quadrature
errors in Section 1.4, c.f. Example 33. It is also interesting to compare the error
accumulation in quadrature formulas, which consists in simply summing up local
errors, with the non–linear recursion (5.7).
5.5 Stiffness
The explicit Euler method is always stable for the test equation ẏ = λy, λ < 0
when only the step size h is small enough
h < −2/λ.
However, for strongly negative λ * −1, this leads to extremely small step sizes.
Small step sizes may be reasonable and hence acceptable if the right hand side
of the differential equation is large and the solution has an large gradient. But,
a stongly negative λ does not neccessarily imply large gradients (the right hand
side depends on y(t) also). Consider the example
ẏ = λ(y − sin t) + cos t, λ = −50. (5.8)
A particular solution is
yp (t) = sin t.
The general solution for the homogenous equation is
yh (t) = ceλt ,
thus the general solution for the nonhomogenous differential equation (5.8) is
y(t) = (y0 − sin t0 )eλ(t−t0 ) + sin t. (5.9)
This solution consists of a slowly varying part, sin t and an exponentially fast
decaying initial layer
(y0 − sin t0 )eλ(t−t0 ) .
Generally, differential equations which admit very fast initial layers as well as
slow solution components, are called stiff problems.
When in (5.9) y0 = sin(t0 ) and λ * −1, then the layer has a large derivative and
it is plausible that a numerical method requires small step sizes. However, after
only a short time the initial layer has decayed to zero and is no longer visible
in the solution (5.9). Then y(t) ≈ sin t and it would be reasonable to use much
larger time steps. Unfortunatelly, the explicit Euler method does not allow time
steps larger then the stability bound h < −2/λ. Even if we start the initial value
problem for t0 = π/4 exactly on the slow component y0 = sin(π/4) —that means
there is no initial layer present— the explicit Euler approximation diverges with
h = 0.3 > −2/λ, see Fig. 5.5. The reason for this effect is as follows. In the first
step, the method introduces an local error as u1 = sin(t1 ). Then in the second
step the initial layer is activated, the step size is too large, and the error gets
amplified.
As it is impossible to avoid local errors, the only way out of this problem is
is to construct methods with better stability properties. This leads to implicit
methods.
5.6 The Implicit Euler Method

Similar as for the explicit counterpart, there are several ways to derive the implicit
Euler method. We begin by discretizing the derivative in ẏ(t) = f (t, y(t)). For
small h it holds that
y(t) − y(t − h) y(t) − y(t − h)
≈ ẏ(t) = lim ,
h h→0 h
5.6. THE IMPLICIT EULER METHOD 105
1.5
0.5
solution
−0.5
−1
−1.5
−2
1 1.5 2 2.5 3
time
Figure 5.5: Explicit Euler Method
thus
y(t) − y(t − h)
≈ f (t, y(t))
h
leading to the scheme
un+1 = un + hf (tn+1 , un+1 ). (5.10)
This method is known as the implicit Euler method. Given un ≈ y(tn ), a new
approximation un+1 ≈ y(tn+1 ) is defined by formula (5.10). However, this is an
implicit definition for un+1 and one has to solve the nonlinear equation (5.10) to
compute un+1 . Clearly, the methods of Ch. 4 can be applied for that task. For
example a fixed point iteration applied to (5.10)
(j+1) (j)
un+1 = un + hf (tn+1 , un+1 ), j = 0, 2, . . .
(0)
is easy to compute once an initial guess un+1 is known. However, to find this
guess, which may be a rough approximation to un+1 , an explicit Euler step is
good enough
(0)
un+1 = un + hf (tn , un ).
In this context the explicit Euler step is called a predictor to the fixed point iter-
ation which is then used as a corrector of the approximation. So called predictor–
corrector algorithms will be discussed in more detail in Sect. 5.7.1.
Before analyzing the implicit Euler method let us first give a second explanation.
We have seen in Sec. 5.2.3 that numerical methods can also be derived from the
integral formulation of the initial value problem
t
y(t) = y0 + f (τ, y(τ )) dτ.
t0
1.2
1.1
0.9
y
0.8
0.7
0.6
0.5
0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4
t
Figure 5.6: Implicit Euler Method
Approximating the integral by a rectangular rule where the integrand is evaluated

at the right end point of the integration intervall (instead of the left one),
tn +1
f (τ, y(τ )) dτ ≈ h · f (tn+1 , y(tn+1 )),
tn
we obtain again Euler’s implicit method

un+1 = un + hf (tn+1 , un+1 ).
5.6.1 Graphical Illustration of the Implicit Euler Scheme

The implicit version of Euler’s method uses a numerical gradient in the n + 1st
step from tn to tn+1 which is equal to the gradient of the true solution in the new
approximation un+1 :
un+1 − un
= f (tn+1 , un+1 ).
h
This is illustrated in Fig. 5.6.
5.6.2 Stability Analysis

The main motivation to search for implicit methods rather than explicit ones is
to construct a more stable algorithm. Let us therefore check the stability of the
implicit version. Consider the test equation
ẏ = λy
5.6. THE IMPLICIT EULER METHOD 107
1.5
0.5
Im
0
−0.5
−1
−1.5
−0.5 0 0.5 1 1.5 2 2.5
Re
Figure 5.7: Implicit Euler Method, stability region
and apply the implicit Euler scheme
un+1 = un + hλun+1 .
As the test equation is linear, this is easily solved for un+1

1
un+1 = un .
1 − hλ
The stability condition requires the amplification factor to be bounded

1

1 − hλ ≤ 1, Reλ ≤ 0.
This condition is satisfied for any positive step size h. Hence the implicit Euler
method is unconditionally stable and the stability region
S = {hλ ∈ C : |1 − hλ| ≥ 1}
includes the entire left half plane.
5.6.3 Testing the Implicit Euler Method

Does the unconditional stability of the implicit method effect practical computa-
tions? We return to the initial value problem
√
ẏ = λ(y − sin(t)), y(π/4) = 1/ 2.
which could not be approximated by the explicit Euler method in the case λ =
−10 and h = π/10, see Sect. 5.2.4.
Figs. 5.8 and 5.9 show a stable behaviour of the implicit method independent
of the parameters λ and h. Also the errors obviously decrease as h → 0 (with
λ fixed). In fact, the unconditionally stable implicit Euler method produces
qualitatively correct approximations for all (reasonable) step sizes. Of course the
robustness of the method has its price: solving a nonlinear equation in every
single step.
1.2 1.2
1.1 1.1
1 1
0.9 0.9
y
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4
t t
Figure 5.8: Implicit Euler Method, λ = −0.2, h = π/10 (left), h = π/20 (right)
1.2 1.2
1.1 1.1
1 1
0.9 0.9
y
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4
t t
Figure 5.9: Implicit Euler Method, λ = −10, h = π/10 (left), h = π/20 (right)
5.7. MULTISTEP METHODS 109
5.7 Multistep Methods

5.7.1 Adams Methods
The idea leading to Adams methods is quite simple. It is based on transforming
the initial value problem
ẏ = f (t, y) with y(t0 ) = y0 (5.11)
into its integral form t

y(t) = y0 + f (τ, y(τ )) dτ (5.12)
t0
and then approximating the integrand by an adequate polynomial. We will as-

sume that the time interval under consideration is partitioned into
t0 < t1 < · · · < ti < ti+1 = ti + hi < · · · < te
with the step size hi at step i + 1. Let us assume for the moment that k solution
points at successive time points are given
un+1−i := y(tn+1−i ), i = 1, . . . , k.
Then, by evaluating the function to be integrated (right hand side function in

(5.11) or simply rhs-function) the corresponding derivatives
f (tn+1−i , y(tn+1−i )), i = 1, . . . , k
are known and can be used to define an interpolation polynomial πkp of degree
k − 1 with the property
πkp (tn+1−i ) = f (tn+1−i , un+1−i ), i = 1, . . . , k.
By this requirement this polynomial is uniquely defined, though there are many
different ways to represent it. For theoretical purposes the Lagrange formulation
is convenient. There, πkp is combined of Lagrange basis polynomials Lk−1
i (t) (cf.
Sec. 1.2.1 and Sec. 1.4)

k
πkp (t) = Lk−1
i (t)f (tn+1−i , un+1−i ) (5.13)
i=1
with

k
t − tn+1−j
Lk−1 (t) := .
n+1−i − tn+1−j
i
i=j=1
t
They fulfill
Lk−1
i (tn+1−j ) := δij (Kronecker symbol).
By integrating (5.13) from tn to tn+1 , the Adams–Bashforth scheme in Lagrange

formulation for approximating y(tn+1 ) is obtained:

k
upn+1 = un + hn p
βk−i f (tn+1−i , un+1−i ) (5.14)
i=1
tn+1
p 1
with βk−i = Lk−1
i (t) dt.
hn tn
The number of previous values needed to approximate x(tn+1 ) is called the num-
ber of steps of the method and all previous values and their derivatives sometimes
are called the trail of the method. In the sequel we will denote by un the numerical
approximation to y(tn ) and set fn := f (tn , un ).
Example 97 For equal (constant) step sizes the Adams–Bashforth methods are
given by the following formulas
k = 1 : un+1 = un + hfn explicit Euler method

3
k = 2 : un+1 = un + h f − 12 fn−1
2 n
23
k = 3 : un+1 = un + h f −
12 n
16
f
12 n−1
+ 5
f
12 n−2
.
As a consequence of the construction of the basic polynomials Lk−1i the coefficients

β depend on the spacings hn , . . . , hn−k . In practical codes the polynomials are
normally not represented by Lagrange polynomials. For computational efficiency
a modified Newton representation is taken instead, cf. Eq. (1.5). To improve the
stability of the method an implicit multistep scheme is taken into consideration:
Let us assume for the moment that un+1 and previous values are known. Then a
c
polynomial πk+1 of degree k can be constructed by requiring
c
πk+1 (tn+1−i ) = f (tn+1−i , un+1−i ), i = 0, 1, . . . , k.
Similar to Eq. (5.14) this leads to the so-called Adams–Moulton method :

k
c
un+1 = un + hn βk−i f (tn+1−i , un+1−i ) (5.15)
i=0
with
tn+1
c 1
βk−i := Lki (t) dt
hn tn
k
t − tn+1−j
k
Li (t) := .
i=j=0
tn+1−i − tn+1−j
Example 98 For equal (constant) step sizes the Adams–Moulton methods are
given by the following formulas
k = 0 : un+1 = un + hfn+1 implicit Euler method

1
k = 1 : un+1 = un + h f
2 n+1
+ 12 fn Trapezoidal rule
5
k = 2 : un+1 = un + h f
12 n+1
+ 8
f
12 n
− 1
f
12 n−1
9
k = 3 : un+1 = un + h f
24 n+1
+ 19
f
24 n
− 5
f
24 n−1
+ 1
f
24 n−2
.
In contrast to Eq. (5.14) Adams–Moulton methods are defined by implicit equa-

tions, which must be solved iteratively for un+1 . The iteration process is started
with the “predicted” value upn+1 from the Adams–Bashforth scheme (5.14).
This results in the Adams Predictor-Corrector scheme:
Predict (P)
k
upn+1 = un + hn i=1
p
βk−i f (tn+1−i , un+1−i )
Evaluate (E)
f (tn+1 , upn+1 )
(5.16)
Correct (C)

un+1 = un + hn βkc f (tn+1 , upn+1 ) + ki=1 βk−i
c
f (tn+1−i , un+1−i )
Evaluate (E)
f (tn+1 , un+1 ).
This scheme is symbolized by the abbreviation P ECE. Frequently, the corrector

is iterated in the following way:
( )
(i+1) (i)

k
un+1 = un + hn βkc f (tn+1 , un+1 ) + c
βk−i f (tn+1−i , un+1−i ) i = 0, . . . , m − 1
i=1
(5.17)
(0)
with un+1 := upn+1 . This implementation is symbolized by P (EC) E. It con- m
sists of m steps of a fixed point iteration. The integration step is completed by

(m)
assigning un+1 := un+1 . Though the scheme (5.15) is an implicit one, the overall
method is explicit if there is only a fixed number of corrector iteration steps taken,
e.g. m = 1. Alternatively, this iteration can be carried out “until convergence”,
i.e. the iteration is controlled and m is kept variable. These versions differ with
respect to their stability properties. In the sequel we will assume always corrector
values obtained from iterating until convergence unless otherwise stated.
5.7.2 Backward Differentiation Formulas (BDF)

There is another important class of multistep methods, which is based on in-
terpolating the solution points un+1−i rather than the derivatives. Let πkp be a
polynomial of degree (k − 1), which interpolates the k points
un+1−i , i = 1, . . . , k.
Again, using the Lagrange formulation it can be expressed by

k
πkp (t) = Lk−1
i (t)un+1−i .
i=1
By just extrapolating this polynomial, a new solution point can be predicted as

k
xpn+1 = πkp (tn+1 ) = Lk−1
i (tn+1 )un+1−i .
i=1
p
Introducing the coefficients αk−i := −Lk−1
i (tn+1 ) the predictor equation

k
upn+1 =− p
αk−i un+1−i
i=1
is obtained. In this formula no information about the function f is incorporated.

It is only useful as a predictor in a predictor–corrector scheme. The BDF corrector
formula is obtained by considering the k th degree polynomial πk+1 c
which satisfies
the conditions:
c
πk+1 (tn+1−i ) = un+1−i , i = 0, . . . , k (5.18a)
c
π̇k+1 (tn+1 ) = f (tn+1 , un+1 ). (5.18b)
The first conditions are interpolation conditions using the unknown value xn+1 ,
which is defined implicitly by considering (5.18b).
With the coefficients
c L̇ki (tn+1 ) 1
αk−i := , βkc :=
L̇k0 (tn+1 ) hn L̇k0 (tn+1 )
equation (5.18b) can be expressed by

k
un+1 = − c
αk−i un+1−i + hn βkc f (tn+1 , un+1 ), (5.19)
i=1
where the Lki now correspond to the interpolation points un+1−i , i = 0, . . . , k.

This is the corrector scheme of the backward differentiation formula. We will see
later that this method is of particular interest for stiff problems. The predictor-
corrector scheme for a BDF method has the form:
Predict (P)

upn+1 = − ki=1 αk−i
p
un+1−i
Evaluate (E)
(5.20)
f (tn+1 , upn+1 )
Correct (C)

un+1 = − ki=1 αk−i
c
un+1−i + hn βkc f (tn+1 , upn+1 )
Again, the implicit formula can be solved iteratively by applying the scheme
P (EC)m with m ≥ 1, though, in practice, BDF methods are mainly implemented
together with Newton’s method. We will see the reason for this later, when
discussing stiff ODEs.
5.7.3 Solving the Corrector Equations

In this section we discuss the corrector iteration. The corrector iteration described
so far has the general form
(i+1) (i)
un+1 = ξ + hn βkc f (tn+1 , un+1 ) (5.21)
with ξ being the contributions based on “old” data, which is

k
c
ξ = un + hn βk−i f (tn+1−i , un+1−i )
i=1
in the case of Adams–Moulton methods, see Eq. (5.17), and

k
ξ=− c
αk−i un+1−i
i=1
in the case of implicit BDF methods, cf. (5.20). Eq. (5.21) describes a fixed point
iteration. By the fixed point theorem a necessary condition for the convergence
of this iteration is that the corresponding mapping
ϕ(u) = ξ + hn βkc f (t, u)
is a contraction, cf. Def. 4.9. As the Lipschitz constants of ϕ and f are related
by
L(ϕ) = hn |βkc |L(f ),
it is obvious that ϕ is contractive and the iteration convergent if the step size
is sufficiently small. In many cases L(f ) is of moderate size such that the step
size for the required local accuracy is small enough to ensure fast convergence
d
of the fixed point iteration. On the other hand, if the Jacobian dx f (t, x) has
eigenvalues being large in modulus, the step size might be restricted much more
by the demand for contractivity than by the required tolerance and stability.
This situation is to be expected when dealing with stiff systems. In this case it is
appropriate to switch from fixed point iteration to Newton iteration for solving
the implicit corrector equation. When applying Newton’s method, the nonlinear
equation
F (un+1 ) = un+1 − (ξ + hn βkc f (tn+1 , un+1 )) = 0 (5.22)
is considered. Newton’s method then defines the iteration

(i) (i) (i)
J(tn+1 , un+1 ) ∆u(i) = − un+1 − (ξ + hn βkc f (tn+1 , un+1 )) (5.23)
(i+1) (i)
with un+1 := un+1 + ∆u(i) and
d
J(t, u) := I − hβkc f (t, u).
du
Like in the case of fixed point iteration the predictor solution is taken as starting
(0)
value: un+1 := upn+1 . The method demands an high computational effort, which
is mostly spent when computing the Jacobian J and solving the linear system.
5.7.4 Order Selection and Starting a Multistep Method

The local residual of a pth order method depends on the p + 1st derivative of the
solution in the actual time interval [tn , tn+1 ]. If
cp hp y (p) < cp+1 hp+1 y (p+1)
it might be advantageous to take a method of order p − 1 instead. Similarly,

one might consider to raise the order if the higher derivative is smaller in that
sense. Adams and BDF methods allow to vary the order in an easy way. They
are defined in terms of interpolation polynomials based on a certain number of
points. Raising or lowering the number of interpolation points raises or lowers the
order of the interpolation polynomial and the order of the method. After every
successful step it is considered how many points of the past should be taken
to perform the next step. As mentioned earlier in this chapter, interpolation
polynomials can be defined in different ways. Especially the definition based on
finite differences, like in Newton’s interpolation method, permits in an efficient
way to vary the order of an interpolation polynomial by adding or taking away
additional interpolation points. In order to get an idea of the size of y (p+1) , the
pth resp. p + 1st derivative of the k th order interpolation polynomial is taken.
5.8. EXPLICIT RUNGE–KUTTA METHODS 115
The automatic order variation is also used for starting multistep methods: The
starting points are successively obtained by starting with a one step method, then
proceeding with a two step method and so on until an order is reached which is
appropriate for the given problem.
5.8 Explicit Runge–Kutta Methods

Runge–Kutta methods are one-step methods, i.e. they have the generic form
un+1 := un + hφh (tn , un ) (5.24)
with a method dependent increment function φh . In contrast to multistep meth-
ods, the transition from one step to the next is based on data of the most recent
step only.
The basic construction scheme is
U1 = un (5.25a)

i−1
Ui = un + h aij f (tn + cj h, Uj ) i = 2, . . . , s (5.25b)
j=1

s
un+1 = un + h bi f (tn + ci h, Ui ). (5.25c)
i=1
s is called the number of stages.

Example 99 By taking s = 2, a21 = 1/2, b1 = 0, b2 = 1, c1 = 0, and c2 = 1/2 the
following scheme is obtained
U1 = un (5.26a)
h
U2 = un + f (tn , U1 ) (5.26b)
2
h
un+1 = un + hf (tn + , U2 ) (5.26c)
2
For this method the increment function reads
h h
φh (t, u) := f t + , u + f (t, u) .
2 2
Normally, Runge–Kutta methods are written in an equivalent form by substitut-
ing ki := f (tn + ci h, Ui )
k1 = f (tn , un )

i−1
ki = f (tn + ci h, un + h aij kj ) i = 2, . . . , s
j=1

s
un+1 = un + h bi ki .
i=1
The coefficients characterizing a Runge–Kutta method are written in a compact

form using a so-called Butcher tableau:
c1
c2 a21
c3 a31 a32 c A
.. .. .. . . or
. . . . bT
cs as1 as2 · · · as,s−1
b1 b2 · · · bs−1 bs
with A = (aij ) and aij = 0 for j ≥ i.
The classical 4-stage Runge–Kutta method reads in this notation
0
1 1
2 2
1
2
0 12 . (5.27)
1 0 0 1
1 2 2 1
6 6 6 6
An s-stage Runge–Kutta method usually requires s function evaluations. If
cs = 1, asj = bj and bs = 0 (5.28)
the function evaluation for the last stage at tn can be used for the first stage at
tn+1 .
The higher amount of function evaluations per step in Runge–Kutta methods
compared to multistep methods is often compensated by the fact that Runge–
Kutta methods may be able to use larger step sizes.
A non autonomous differential equation ẏ = f (t, y) can be written in autonomous
form where the right hand side of the differential equation is not explicitly de-
pending on time, by augmenting the system by the trivial equation ṫ = 1:

ṫ 1
ẏ = = = F (y).
ẋ f (t, x)
Applying a Runge–Kutta method to the original and to the reformulated system

should lead to the same equations. This requirement relates the coefficients ci to
the aij :
s
ci = aij . (5.29)
j=1
We will consider only methods fulfilling this condition and assuming for the rest
of this chapter autonomous differential equations for ease of notation.
5.8.1 The Order of a Runge–Kutta Method

The global error of a Runge–Kutta method at tn is defined in the same way as
for Euler’s method
en := y(tn ) − un .
with n = tn /h. A Runge–Kutta method has order p if en = O(hp ).
Using (5.24) we get
en+1 = y(tn+1 ) − un − hφh (tn , un )

= y(tn+1 ) − y(tn ) + en − h(φh (tn , y(tn )) − φh (tn , y(tn )) + φh (tn , un )).
Setting ε(t, y, h) := y(t + h) − y(t) − hφh (t, y(t)) and applying the mean value
theorem1 to φ gives
en+1 = Φn (h)en + ε(tn , y, h) + O(e2n ) (5.30)

d
with Φn (h) := 1 + h dy φh (tn , y) . Thus,
y(tn )
Φn (h) = 1 + O(h).
ε is called the local error or, due to its role in (5.30), the global error increment
of the Runge–Kutta method.
Viewing the error propagation formula (5.30) we have to require
ε(tn , y, h) = O(hp+1 ) (5.31)
to get a method of order p.
Example 100 For the Runge–Kutta method (5.26) we get by Taylor expansion
h3
ε(tn−1 , y, h) = (fyy f 2 + 4fy2 f ) + O(h4 ). (5.32)
24
d
using the notation fy := dy
f (y) for the elementary differentials.
y(tn−1 )
Thus (5.26) is a second order method.
The goal when constructing a Runge–Kutta method is to choose the coefficients in

such a way that all elementary differentials up to a certain order cancel in a Taylor
series expansion of ε. This requires a special symbolic calculus with elementary
differentials, which is based on relating these differentials to labeled trees. The
order conditions are much more complicated than in the multistep case. They
consist of an underdetermined system of nonlinear equations for the coefficients
A, b, c, which is solved by considering additional simplifying assumptions.
1
For notational simplicity we restrict the presentation to the scalar case.
5.8.2 Embedded Methods for Error Estimation

Variable step size codes adjust the step size in such a way, that the global error
increment is kept below a certain tolerance threshold T OL. This requires a
good estimation of this quantity. The error can be estimated by comparing two
different methods. Here, a Runge–Kutta method of order p and another method
of order p + 1 is taken to perform a step from tn to tn+1 , say.
The local error of the pth order method is
ε(p) (t, y) = C (p) (t, y)hp+1 + O(hp+2 )
while the global error increment of the p + 1st order method is
ε(p+1) (t, y) = C (p+1) (t, y)hp+2 + O(hp+3 ),
where the coefficients C (i) depend on error coefficients and elementary differen-
tials, cf. (5.32).
The difference of both quantities is
n − un
u(p) = C (p) (t, y)hp+1 + O(hp+2 ) = ε(p) (tn , y(tn )) + O(hp+2 )
(p+1)
where the superscripts indicate the respective method.

Directly evaluating this formula for estimating the error would require addi-
tional function evaluations according to the number of stages of the higher order
method. This would be much too expensive. This extra work can be avoided
by using embedded methods. These are pairs of Runge–Kutta methods using the
same coefficients A and c. Thus, the stage values ki of both methods coincide.
The only difference is in the b coefficients, which are determined in such a way,
that one method has order p + 1 while the other has order p. The two methods
are described by the tableau
c1
c2 a21
c3 a31 a32
.. .. .. ..
. . . .
cs as1 as2 · · · as,s−1
bp1 bp2 · · · bps−1 bps
bp+1
1 bp+1
2 ··· bp+1
s−1 bp+1
s
where p indicates the order of the method and where
(p)

s
un+1 = un + h bpi ki
i=1
(p+1)
s
un+1 = un + h bp+1
i ki .
i=1
Example 101 One method of low order in that class is the RKF2(3) method
0
1 1
1 1 1
2 4 4 (5.33)
1 1
2 2
0
1 1 4
6 6 6
It uses 3 stages and is due to Fehlberg.
Local Extrapolation
Though it is always the error of the lower order method which is estimated,
one often uses the higher order and more accurate method for continuing the
integration process. This foregoing is called local extrapolation. This is also
reflected in naming the method, i.e.
• in RK2(3) the integration process is carried out with a second order method
and in
• in RK3(2) local extrapolation is used, thus a third order method is taken

for the integration.
When designing a Runge–Kutta method one is interested in minimizing the
weights in the principle error term. In the context of embedded methods the
difference of both methods must be as large as possible to give a good estimate,
so it might not be a good idea to seek for a lower order method with minimal
weights in the principle error term, when applying local extrapolation. Coeffi-
cients optimal with respect to this and some other design criteria have been found
by Dormand and Prince leading to one of the most effective explicit Runge–Kutta
formulas. We give here the coefficients for the 5,4-pair:
Example 102
0
1 1
5 5
3 3 9
10 40 40
4
5
44
45
− 56
15
32
9
8 19372
− 2187
25360 64448
− 212
9 6561 6561 729 (5.34)
1 9017
3168
− 355
33
46732
5247
49
176
− 18656
5103
1 35
384
0 500
1113
125
192
− 2187
6784
11
84
35
384
0 500
1113
125
192
− 2187
6784
11
84
0
5179
57600
0 7571
16695
393
640
− 339200
92097 187
2100
1
40
This method uses six stages for 5th order result and one more for obtaining the
4th order result. One clearly sees that the method saves one function evaluation
by meeting the requirement (5.28) for the 5th order method which is used for local
extrapolation.
5.8.3 Stability of Runge–Kutta Methods

Similar to the discussion for Euler’s method, we investigate the stability of Runge–
Kutta methods for finite step sizes h by considering the linear test equation
ẏ = λy, cf. Sect. 5.3
Applying the Runge–Kutta method (5.25) to that equation results in
U1 = un

i−1
Ui = un + h aij λUj i = 2, . . . , s
j=1
s
un+1 = un + h bi λUi .
i=1
By inserting the stage values into the final equation we get

s
s
s
2 3
un+1 = 1 + hλ bi + (hλ) bi ai1 + (hλ) bi ai2 ai−1,1 + . . .
i=1 i=2
i=3
. . . + (hλ)s bs as,s−1 · as−1,s−2 . . . a2,1 un

:= R(hλ)un . (5.35)
The function R : C → C is called stability function of the Runge–Kutta method,

see also the amplification factor (5.4) for Euler’s method.
A Runge–Kutta method is stable for a given step size h and a given complex
parameter λ if |R(hλ)| ≤ 1. Then, an error in y0 is not increased by applying
(5.35).
In Fig. 5.10 the stability regions of the methods DOPRI4 and DOPRI5 are dis-
played. One realizes that lower order methods tend to have the larger stability
region. When applying embedded methods with local extrapolation this fact
must be envisaged.
DOPRI4
2
-2 DOPRI5
-4
-4 -2 0 2
Figure 5.10: Stability regions for the Runge–Kutta pair DOPRI4 and DO-
PRI5.The methods are stable inside the gray areas.
Bibliography
[AG90] Eugene L. Allgower and Kurt Georg. Numerical Continuation Meth-

ods. Springer, 1990.
[BB99] Adrian Biran and Moshe Breiner. MATLAB 5 for Engineers.
Addison-Wesley, 1999.
[de 78] C. de Boor. A practical guide to splines. Springer-Verlag, 1978.
[DH95] Peter Deuflhard and Andreas Hohmann. Numerical Analysis - A first
course in Scientific computing. Walter de Gruyter, 1995.
[Far88] Gerald Farin. Curves and Surfaces in Computer Aided Geometric
Design. New York, 1988.
[Gv96] Gene H. Golub and Charles F. van Loan. Matrix Computations. John
Hopkins University Press, 3rd edition, 1996.
[GW99] Curtis F. Gerald and Patrick O. Wheatley. Applied Numerical Anal-
ysis. Addison-Weseley, 6th edition, 1999.
[Hea97] Michal T. Heath. Introduction to Scientific Computing. McGraw
Hill, 1997.
[Jam95] John F. James. A Student’s Guide to Fourier Transforms. Cambridge
University Press, 1995.
[OR70] J.M. Ortega and W.C. Rheinboldt. Iterative Solutions of Nonlinear
Equations in Several Variables. Academic Press, New York, 1970.
[PESMI96] Eva Pärt-Enander, Anders Sjöberg, Bo Melin, and Pernilla Isaksson.
The MATLAB Handbook. Addison-Wesley, 1996.
[QSS00] Alfio Quarteroni, Riccardo Sacco, and Fausto Saleri. Numerical
Mathematics. Springer, 2000.
[SAP97] Lawrence Shampine, Richard Allen, and Steve Pruess. Fundamentals
of Numerical Computing. John Wiley, 1997.
[Spa94] Gunnar Sparr. Linjär algebra. Studentlitteratur, 1994.
123
Index
p-norm, 49 convex combinations, 13

convex set, 80
singular values, 62 Cooley, 72
a posteriori error bound, 81 corrector
a priori error bound, 81 Adams method, 111
Adams method, 109 BDF method, 112
Aitken, 8 Euler method, 105
amplification factor, 101 solution, 113
automatic differentiation, 84
Davidenko differential equation, 89
autonomous ODE, 116
de Boor points, 39
B-splines, 37 de Casteljau algorithm, 15
Bézier curve diagonal row dominant, 46
partial, 16 Discrete Fourier tranformation, 65
Bézier points, 12 dissipative, 79
backward differentiation formula, 112 divided differences, 8
backward substitution, 43
barycentric combinations, 13 eigenvalue, 74
BDF, 113 eigenvector, 74
Bernstein polynomials, 11 elementary transformation matrices,
boundary conditions, 32 44
Butcher tableau, 116 elementary transformation matrix, 44
embedded methods, 118
characteristic polynomial, 74 explicit Euler
Chebychev points, 22 amplification factor, 101
Chebyshev-Polynomials , 20 convergence, 103
complete pivoting, 47 global error, 103
complex trigonometric polynomials, local error, 102
63 method, 97
condition number, 51 stability region, 101
continuation methods, 88
contraction, 113 FFT algorithm, 70
contraction, 79 fill-in, 92
convergence fixed point, 79
Newton’s method, 85 fixed point iteration, 79
convergent of order p, 82 flop, 44
124
INDEX 125
forward substitution, 43 lower triangular

functional iteration, 79 unit, 42
Gauss-methods, 29 matrix
Gauss-Seidel iteration, 93 banded, 34
Gauß methods, 31 orthogonal, 56
Gauß-Newton Method, 91 reflection, 57
Gerschgorin, 74 rotation, 57
global error, 102 tridiagonal, 34
global error increment, 117 mean value theorem, 80
midpoint rule, 31
Hölder-norm, 49
monomials, 3
homotopy
More-Penrose inverse, 60
convex, 88
multistep method
global, 88
predictor corrector scheme, 111
Horner’s rule, 4
Adams method, 109
image space, 41 Adams–Bashforth method, 110
implicit Euler Adams–Moulton method, 110
amplification factor, 107 BDF, 113
method, 105 implicit scheme, 110
stability region, 107
nested multiplications, 4
unconditional stability, 107
Newton basis polynomials, 6
increment function, 115
Newton’s method, 83
initial layer, 104
Newton–Kantorovitch Theorem, 84
initial value problem, 95
non parametric curve, 2
inner product, 23
non singular matrix, 42
interpolation polynomial
normal equations, 53
Lagrange form, 109
nullspace, 41
inverse power iteration method, 77
one-step method, 115
Jacobi iteration, 93
order of a RK method, 117
Jacobian, 83
ordinary differential equation, 95
kernel, 41 overdetermined linear system, 53
Kronecker symbol, 5
partial Bézier curve, 16
Lagrange polynomials, 5 partial pivoting, 47
least squares solution, 53 PECE, 111
Legendre polynomial, 31 pivot element, 45
Lipschitz constant, 113 polynomial, 3
Lipschitz continuous, 80 polynomials
local error, 102, 117 Bernstein, 11
local extrapolation, 119 Lagrange, 5
locally quadratic convergent, 84 Newton, 6
126 INDEX
predictor-corrector methods, 90 splines, 32

projection boundary condition, 34
orthogonal, 54 breakpoints, 32
pseudo inverse, 60 knots, 32
natural, 34
quadrature formula, 26 stability
consistency, 27 Runge–Kutta method, 120
knots, 26 stability function, 120
positivity, 27 stability region, 101
stages, 26 stiff system, 114
weights, 26 superlinear convergence, 82
range space, 41 theorem
rate of convergence, 82 unisolvence, 4
Rayleigh quotient, 75 three term recursion, 20
reflection, 57 trail of multistep method, 110
regular matrix, 42 trapezoidal rule, 26
Richardson iteration, 93 triangular matrix
rotation, 57 lower, 42
roundoff errors, 84 upper, 42
Runge’s phenomenon, 9, 32 truncation errors, 84
Runge–Kutta Tucker, 72
stages, 115
Runge–Kutta method Vandermonde matrix, 4
explicit, 115
order, 117
stability, 120
sampling rate, 64
scalar product, 23
Schurform, 78
shift technique, 77
simplified Newton method, 85
simplifying assumptions, 117
Simpson’s rule, 27
singular matrix, 42
singular value decomposition, 62
sparsity pattern, 92
spectral radius, 80
spline
natural, 34
not-a-knot end condition, 35
periodic, 35
with end slope condition, 34

Kompendium

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Kompendium

Uploaded by

Copyright:

Available Formats

Numerical Analysis - An Introduction

Claus Führer, Achim Schroll

3rd Edition, 2001

1 Interpolation and Curve Design 1

5 Ordinary Diﬀerential Equations 95

In this manuscript we apply (hopefully consequently) the following conventions

• scalar numbers: Greek letters, e.g. α, β, . . .

• matrices: capital Latin letters, e.g. A, B, . . .

• identity and zero matrix: I and 0 (sometimes with the dimension as a

• matrix elements Aij or ai,j

• linear spaces: caligraphic letters, e.g. C

• norms: · with the type of norm as subscript (if necessary)

• absolute value (modulus): | · |

Interpolation and Curve Design

1.1 Some Deﬁnitions and Notations

• Give information about values not covered by measurements, i.e. interpo-

• Speed up evaluation: function evaluation is often faster than table-look-ups

An interpolating function (or an interpolant) is seeked often in the set of poly-

Parametric and Non Parametric Curves, Graphs Functions relate in a

1.2 Polynomial Spaces and Interpolation

which has to be solved for the unknown coeﬃcients ai .

Theorem 3 (Unisolvence Theorem)

(The proof is based on contradiction).

Deﬁnition 4 The coeﬃcient matrix

MATLAB This matrix can be generated in MATLAB by the command vander.

To evaluate p by using this formula requires n multiplications, while evaluating

• Are there alternative basis representations of Pn which allow us to compute

1.2.1 Lagrange Polynomials

Figure 1.1: Lagrange polynomials

1.2.2 Newton Interpolation Polynomials.

When adding or removing measurement points the Lagrange polynomials have to

p(t) = c0 + c1 (t − t0 ) + c2 (t − t0 )(t − t1 ) + · · · + cn (t − t0 )(t − t1 ) · · · (t − tn−1 ), (1.5)

are then given as the solution of

• Solve the ﬁrst equation for c0

• Use this value and solve the next equation for c1

This recursive procedure can also be expressed in a recursion of successive inter-

pj (t) = pj−1 (t) + cj ω j (t). (1.8)

From the j th row in Eq. (1.7) one concludes

This can be generalized to the composition of two interpolation polynomials of

Theorem 8 (Lemma of Aitken)

p(f |t0 , . . . , tj )(t) =

(t0 − t)p(f |t1 , . . . , tj )(t) − (tj − t)p(f |t0 , . . . , tj−1 )(t)

These coeﬃcients are called divided diﬀerences.

Note, by this deﬁnition f [ti ] = f (ti ).

f [t1 , . . . , tj ] − f [t0 , . . . , tj−1 ]

Furthermore we conclude from Eq. (1.8)

k=0 k=1 k=2 ... k=n

1.2.3 Interpolation Error.

r(t) := f (t) − p(f |t0 , . . . , tn )(t).

F (n+1) (ξ) = f (n+1) (ξ) − K(n + 1)! = 0

Example 11 Assume the sine-function in [0, π/2] is interpolated by a ﬁfth-order

• the error at t = 0.1

• the error at t = 34 π (extrapolation!)

• the maximal error in the interval [0, π/2]?

The corresponding function r(t) is plotted in Fig. 1.2.

1.2.4 Polynomial Interpolation in MATLAB

1.2.5 Bernstein Polynomials

with the binomial coeﬃcients being deﬁned as

Note that every summand is in Pn ([0, 1]).

Deﬁnition 13 The polynomials

Cubic Bernstein Polynomials

Figure 1.3: Bernstein polynomials

we obtain a recursion formula for Bernstein polynomials

Bin (t) = (1 − t)Bin−1 (t) + tBi−1

and by Iab (f ) an appropriate numerical approximation.