Lecture Notes, Course on Numerical Analysis

Guillaume Bal

October 20, 2008
Contents
1 Ordinary Differential Equations 2
2 Finite Differences for Parabolic Equations 9
2.1 One dimensional Heat equation . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Fourier transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Stability and convergence using Fourier transforms . . . . . . . . . . . . 18
2.4 Application to the θ schemes . . . . . . . . . . . . . . . . . . . . . . . . . 25
3 Finite Differences for Elliptic Equations 28
4 Finite Differences for Hyperbolic Equations 32
5 Spectral methods 36
5.1 Unbounded grids and semidiscrete FT . . . . . . . . . . . . . . . . . . . 36
5.2 Periodic grids and Discrete FT . . . . . . . . . . . . . . . . . . . . . . . 38
5.3 Fast Fourier Transform (FFT) . . . . . . . . . . . . . . . . . . . . . . . . 39
5.4 Spectral approximation for unbounded grids . . . . . . . . . . . . . . . . 40
5.5 Application to PDE’s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6 Introduction to finite element methods 49
6.1 Elliptic equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.2 Galerkin approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.3 An example of finite element method . . . . . . . . . . . . . . . . . . . . 54
7 Introduction to Monte Carlo methods 62
7.1 Method of characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7.2 Probabilistic representation . . . . . . . . . . . . . . . . . . . . . . . . . 63
7.3 Random Walk and Finite Differences . . . . . . . . . . . . . . . . . . . . 64
7.4 Monte Carlo method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

Department of Applied Physics & Applied Mathematics, Columbia University, New York NY,
10027, gb2030@columbia.edu, http://www.columbia.edu/∼gb2030
1
1 Ordinary Differential Equations
An Ordinary Differential Equation is an equation of the form
dX(t)
dt
= f(t, X(t)), t ∈ (0, T),
X(0) = X
0
.
(1.1)
We have existence and uniqueness of the solution when f(t, x) is continuous and Lips-
chitz with respect to its second variable, i.e. there exists a constant C independent of
t, x, y such that
[f(t, x) −f(t, y)[ ≤ C[x −y[. (1.2)
If the constant is independent of time for t > 0, we can even take T = +∞. Here X
and f(t, X(t)) are vectors in R
n
for n ∈ N and [ [ is any norm in R
n
.
Remark 1.1 Throughout the text, we use the symbol “C” to denote an arbitrary
constant. The “value” of C may change at every instance. For example, if u(t) is a
function of t bounded by 2, we will write
C[u(t)[ ≤ C, (1.3)
even if the two constants “C” on the left- and right-hand sides of the inequality may be
different. When the value “2” is important, we will replace the above right-hand side
by 2C. In our convention however, 2C is just another constant “C”.
Higher-order differential equations can always been put in the form (1.1), which is
quite general. For instance the famous harmonic oscillator is the solution to
x

+ ω
2
x = 0, ω
2
=
k
m
, (1.4)
with given initial conditions x(0) and x

(0) and can be recast as
d
dt

X
1
X
2

(t) =

0 1
−ω
2
0

X
1
X
2

(t),

X
1
X
2

(0) =

x(0)
x

(0)

. (1.5)
Exercise 1.1 Show that the implied function f in (1.5) is Lipschitz.
The Lipschitz condition is quite important to obtain uniqueness of the solution. Take
for instance the case n = 1 and the function x(t) = t
2
. We easily obtain that
x

(t) = 2

x(t) t ∈ (0, +∞), x(0) = 0.
However, the solution ˜ x(t) ≡ 0 satisfies the same equation, which implies non-uniqueness
of the solution. This remark is important in practice: when an equation admits several
solutions, any sound numerical discretization is likely to pick one of them, but not nec-
essarily the solution one is interested in.
Exercise 1.2 Show that the function f(t) above is not Lipschitz.
2
How do we discretize (1.1)? There are two key ingredients in the derivation of a
numerical scheme: first to remember the definition of a derivative
df
dt
(t) = lim
h→0
f(t + h) −f(t)
h
= lim
h→0
f(t) −f(t −h)
h
= lim
h→0
f(t + h) −f(t −h)
2h
,
and second (which is quite related) to remember Taylor expansions to approximate a
function in the vicinity of a given point:
f(t + h) = f(t) + hf

(t) +
h
2
2
f

(t) + . . . +
h
n
n!
f
(n)
(t) +
h
n+1
(n + 1)!
f
(n+1)
(t + s), (1.6)
for 0 ≤ s ≤ t assuming the function f is sufficiently regular.
With this in mind, let us fix some ∆t > 0, which corresponds to the size of the
interval between successive points where we try to approximate the solution of (1.1).
We then approximate the derivative by
dX
dt
(t) ≈
X(t + ∆t) −X(t)
∆t
. (1.7)
It remains to approximate the right-hand side in (1.1). Since X(t) is supposed to be
known by the time we are interested in calculating X(t +∆t), we can choose f(t, X(t))
for the right-hand side. This gives us the so-called explicit Euler scheme
X(t + ∆t) −X(t)
∆t
= f(t, X(t)). (1.8)
Let us denote by T
n
= n∆t for 0 ≤ n ≤ N such that N∆t = T. We can recast (1.8) as
X
n+1
= X
n
+ ∆tf(n∆t, X
n
), X
0
= X(0). (1.9)
This is a fully discretized equation that can be solved on the computer since f(t, x) is
known.
Another choice for the right-hand side in (1.1) is to choose f(t + ∆t, X(t + ∆t)).
This gives us the implicit Euler scheme
X
n+1
= X
n
+ ∆tf((n + 1)∆t, X
n+1
), X
0
= X(0). (1.10)
Notice that this scheme necessitates to solve an equation at every step, namely to find
the inverse of the function x →x −f(n∆t, x). However we will see that this scheme is
more stable than its explicit relative in many practical situations and is therefore often
a better choice.
Exercise 1.3 Work out the explicit and implicit schemes in the case f(t, x) = λx for
some real number λ ∈ R. Write explicitly X
n
for both schemes and compare with the
exact solution X(n∆t). Show that in the stable case (λ < 0), the explicit scheme can
become unstable and that the implicit scheme cannot. Comment.
We now need to justify that the schemes we have considered are good approximations
of the exact solution and understand how good (or bad) our approximation is. This
requires to analyze the behavior of X
n
as ∆t →0, or equivalently the number of points
3
N to discretize a fixed interval of time (0, T) tends to infinity. In many problems, the
strategy to do so is the following. First we need to assume that the exact problem
(1.1) is properly posed, i.e. roughly speaking has good stability properties: when one
changes the initial conditions a little, one expects the solution not to change too much
in the future. A second ingredient is to show that the scheme is stable, i.e. that our
solution X
n
remains bounded independently of n and ∆t. The third ingredient is to
show that our scheme is consistent with the equation (1.1), i.e. that locally it is a good
approximation of the real equation.
Let us introduce the solution function g(t, X) = g(t, X; ∆T) defined by
g(t, X) = X(t + ∆T), (1.11)
where
dX
dt
(t + τ) = f(t + τ, X(t + τ)), 0 ≤ τ ≤ ∆t
X(t) = X.
(1.12)
This is nothing but the function that maps a solution of the ODE at time t to the
solution at time t + ∆t.
Similarly we introduce the approximate solution function g
∆t
(t, X) for t = n∆t by
X
n+1
= g
∆t
(n∆t, X
n
) for 0 ≤ n ≤ N −1. (1.13)
For instance, for the explicit Euler scheme, we have
g
∆t
(n∆t, X
n
) = X
n
+ ∆tf(n∆t, X
n
). (1.14)
For the implicit Euler scheme, we have
g
∆t
(n∆t, X
n
) = (x −∆tf((n + 1)∆t, x))
−1
(X
n
),
assuming that this inverse function is well-defined. Because the explicit scheme is much
simpler to analyze, we assume that (1.14) holds from now on.
That the equation is properly posed (a.k.a. well-posed) in the sense of Hadamard
means that
[g(t, X) −g(t, Y )[ ≤ (1 +C∆t)[X −Y [, (1.15)
where the constant C is independent of t, X and Y . The meaning of this inequality is
the following: we do not want the difference X(t + ∆t) − Y (t + ∆t) to be more than
(1 + C∆t) times what it was at time t (i.e. [X − Y [ = [X(t) − Y (t)[) during the time
interval ∆t.
We then have to prove that our scheme is stable, that is to say
[X
n
[ ≤ C(1 +[X
0
[), (1.16)
where C is independent of X
0
and 0 ≤ n ≤ N. Notice from the Lipschitz property of f
(with x = x and y = 0) and from (1.14) that
[X
n+1
[ = [g
∆t
(n∆t, X
n
)[ ≤ (1 +C∆t)[X
n
[ + C∆t,
4
where C is a constant. This implies by induction that
[X
n
[ ≤ (1 + C∆t)
n
[X
0
[ +
n
¸
k=0
(1 +C∆t)
k
C∆t ≤ (1 +C∆t)
N
(N∆tC +[X
0
[).
for 1 ≤ n ≤ N. Now N∆t = T and
(1 +
CT
N
)
N
≤ e
CT
,
which is bounded independently of ∆t. This implies the stability of the explicit Euler
scheme.
We then prove that the scheme is consistent and obtain the local error of discretiza-
tion. In our case, it consists of showing that
[g(n∆t, X) −g
∆t
(n∆t, X)[ ≤ C∆t
m+1
(1 +[X[), (1.17)
for some positive m, the order of the scheme. Notice that this is a local estimate.
Assuming that we know the initial condition at time n∆t, we want to make sure that
the exact and approximate solutions at time (n + 1)∆t are separated by an amount
at most of order ∆t
2
. Such a result is usually obtained by using Taylor expansions.
Let us consider the explicit Euler scheme and assume that the solution X(t) of the
ODE is sufficiently smooth so that the solution with initial condition at n∆t given by
X = X(n∆t) satisfies
X((n + 1)∆t) = X(n∆t) + ∆t
˙
X(n∆t) +
∆t
2
2
¨
X(n∆t + s), (1.18)
for some 0 ≤ s ≤ ∆t, where [
¨
X(n∆t +s)[ ≤ C(1+[X[). Notice that the above property
is a regularity property of the equation, not of the discretization. We prove convergence
of the explicit Euler scheme only when the above regularity conditions are met. These
are sufficient conditions (though not necessary ones) to obtain convergence and can be
shown rigorously when the function f(t, x) is of class C
1
for instance. This implies that
X((n + 1)∆t) = g(n∆t, X(n∆t))
= X(N∆t) + ∆tf(n∆t, X(n∆t)) +
∆t
2
2
¨
X(n∆t + s)
= g
∆t
(n∆t, X(n∆t)) +
∆t
2
2
¨
X(n∆t + s).
This implies (1.17) and the consistency of the scheme with m = 1 (the Euler explicit
scheme is first-order). Another way of interpreting the above result is as follows: the
exact solution X((n + 1)∆t) locally solves the discretized equation (whose solution is
g
∆t
(n∆t, X(n∆t)) at time (n + 1)∆t) up to a small term, here of order ∆t
2
.
Exercise 1.4 Consider the harmonic oscillator (1.5). Work out the functions g and
g

for the explicit and implicit Euler schemes. Show that (1.15) is satisfied, that both
schemes are stable (i.e. (1.16) is satisfied) and that both schemes are consistent and of
order m = 1 (i.e. (1.17) is satisfied with m = 1).
5
Once we have well-posedness of the exact equation as well as stability and consistency
of the scheme (for a scheme of order m), we obtain convergence as follows. We have
[X((n + 1)∆t) −X
n+1
[ = [g(n∆t, X(n∆t)) −g

(n∆t, X
n
)[
≤ [g(n∆t, X(n∆t)) −g(n∆t, X
n
)[ +[g(n∆t, X
n
) −g
∆t
(n∆t, X
n
)[
≤ (1 +C∆t)[X(n∆t) −X
n
[ + C∆t
m+1
(1 +[X
n
[)
≤ (1 +C∆t)[X(n∆t) −X
n
[ + C∆t
m+1
.
The fist line is the definition of the operators. The second line uses the triangle inequality
[X + Y [ ≤ [X[ +[Y [, (1.19)
which holds for every norm by definition. The third line uses the well-posedness of the
ODE and the consistency of the scheme. The last line uses the stability of the scheme.
Recall that C is here the notation for a constant that may change every time the symbol
is used.
Let us denote the error between the exact solution and the discretized solution by
ε
n
= [X(n∆t) −X
n
[. (1.20)
The previous calculations have shown that for a scheme of order m, we have
ε
n+1
≤ C∆t
m+1
+ (1 + C∆t)ε
n
.
Since ε
0
= 0, we easily deduce from the above relation (for instance by induction) that
ε
n
≤ C∆t
m+1
n
¸
k=0
(1 + C∆t)
k
≤ C∆t
m+1
n,
since (1 + C∆t)
k
≤ C uniformly for 0 ≤ k ≤ N. We deduce from the above relation
that
ε
n
≤ C∆t
m
,
since n ≤ N = T∆t
−1
This shows that the explicit Euler scheme is of order m = 1:
for all intermediate time steps n∆t, the error between X(n∆t) and X
n
is bounded by a
constant time ∆t. Obviously, as ∆t goes to 0, the discretized solution converges to the
exact one.
Exercise 1.5 Program in Matlab the explicit and implicit Euler schemes for the har-
monic oscillator (1.5). Find numerically the order of convergence of both schemes.
Higher-Order schemes. We have seen that the Euler scheme is first order accurate.
For problems that need to be solved over long intervals of time, this might not be
accurate enough. The obvious solution is to create a more accurate discretization. Here
is one way of constructing a second-order scheme. To simplify the presentation we
assume that X ∈ R, i.e. we consider a scalar equation.
6
We first realize that m in (1.17) has to be 2 instead of 1. We also realize that the
local approximation to the exact solution must be compatible to the right order with
the Taylor expansion (1.18), which we push one step further to get
X((n + 1)∆t) = X(n∆t) + ∆t
˙
X(n∆t) +
∆t
2
2
¨
X(n∆t) +
∆t
3
6
...
X
(n∆t + s) (1.21)
for some 0 ≤ s ≤ ∆t. Now from the derivation of the first-order scheme, we know that
what we need is an approximation such that
X((n + 1)∆t) = g
(2)
∆t
(n∆t, X) + O(∆t
3
).
An obvious choice is then to choose
g
(2)
∆t
(n∆t, X) = X(n∆t) + ∆t
˙
X(n∆t) +
∆t
2
2
¨
X(n∆t).
This is however not explicit since
˙
X and
¨
X are not known yet (only the initial condition
X = X(n∆t) is known when we want to construct g
(2)
∆t
(n∆t, X)). However, as previously,
we can use the equation that X(t) satisfies and get that
˙
X(n∆t) = f(n∆t, X)
¨
X(n∆t) =
∂f
∂t
(n∆t, X) +
∂f
∂x
(n∆t, X)f(n∆t, X),
by applying the chain rule to (1.1) at t = n∆t. Assuming that we know how to differ-
entiate the function f, we can then obtain the second-order scheme by defining
g
(2)
∆t
(n∆t, X) = X + ∆tf(n∆t, X)
+
∆t
2
2

∂f
∂t
(n∆t, X) +
∂f
∂x
(n∆t, X)f(n∆t, X)

.
(1.22)
Clearly by construction, this scheme is consistent and second-order accurate. The regu-
larity that we now need from the exact equation (1.1) is that [
...
X
(n∆t+s)[ ≤ C(1+[X[).
We shall again assume that our equation is nice enough so that the above constraint
holds (this can be shown when the function f(t, x) is of class C
2
for instance).
Exercise 1.6 It remains to show that our scheme g
(2)
∆t
(n∆t, X) is stable. This is left as
an exercise. Hint: show that [g
(2)
∆t
(n∆t, X)[ ≤ (1 + C∆t))[X[ + C∆t for some constant
C independent of 1 ≤ n ≤ N and X.
Exercise 1.7 Work out a second-order scheme for the harmonic oscillator (1.5) and
show that all the above constraints are satisfied.
Exercise 1.8 Implement in Matlab the second-order scheme derived in the previous
exercise. Show the order of convergence numerically and compare the solutions with
the first-order scheme. Comment.
7
Runge-Kutta methods. The main drawback of the previous second-order scheme is
that it requires to know derivatives of f. This is undesirable is practice. However, the
derivatives of f can also be approximated by finite differences of the form (1.7). This is
the basis for the Runge-Kutta methods.
Exercise 1.9 Show that the following schemes are second-order:
g
(2)
∆t
(n∆t, X) = X + ∆tf

n∆t +
∆t
2
, X +
∆t
2
f(n∆t, X)

g
(2)
∆t
(n∆t, X) = X +
∆t
2

f(n∆t, X) + f

(n + 1)∆t, X + ∆tf(n∆t, X)

g
(2)
∆t
(n∆t, X) = X +
∆t
4

f(n∆t, X) + 3f

(n +
2
3
)∆t, X +
2∆t
3
f(n∆t, X)

.
These schemes are called Midpoint method, Modified Euler method, and Heun’s Method,
respectively.
Exercise 1.10 Implement in Matlab the Midpoint method of the preceding exercise to
solve the harmonic oscillator problem (1.5). Compare with previous discretizations.
All the methods we have seen so far are one-step methods, in the sense that X
n+1
only
depends on X
n
. Multi-step methods are generalizations where X
n+1
depends on X
n−k
for k = 0, , m with m ∈ N. Classical multi-step methods are the Adams-Bashforth
and Adams-Moulton methods. The second- and fourth-order Adams Bashforth methods
are given by
X
n+1
= X
n
+
∆t
2
(3f(n∆t, X
n
) −f((n −1)∆t, X
n−1
))
X
n+1
= X
n
+
∆t
24

55f(n∆t, X
n
) −59f((n −1)∆t, X
n−1
)
+37f((n −2)∆t, X
n−2
) −9f((n −2)∆t, X
n−2
)

,
(1.23)
respectively.
Exercise 1.11 Implement in Matlab the Adams-Bashforth methods in (1.23) to solve
the harmonic oscillator problem (1.5). Compare with previous discretizations.
8
2 Finite Differences for Parabolic Equations
2.1 One dimensional Heat equation
Equation. Let us consider the simplest example of a parabolic equation
∂u
∂t
(t, x) −

2
u
∂x
2
(t, x) = 0, x ∈ R, t ∈ (0, T)
u(0, x) = u
0
(x), x ∈ R.
(2.1)
This is the one-dimensional (in space) heat equation. Note that the equation is also
linear, in the sense that the map u
0
(x) → u(x, t) at a given time t > 0 is linear. In
the rest of the course, we will mostly be concerned with linear equations. This equation
actually admits an exact solution, given by
u(t, x) =
1

4πt


−∞
exp


[x −y[
2
4t

u
0
(y)dy. (2.2)
Exercise 2.1 Show that (2.2) solves (2.1).
Discretization. We now want to discretize (2.1). This time, we have two variables
to discretize, time and space. The finite difference method consists of (i) introducing
discretization points X
n
= n∆x for n ∈ Z and ∆x > 0, and T
n
= n∆t for 0 ≤ n ≤ N =
T/∆t and Dt > 0; (ii) approximating the solution u(t, x) by
U
n
j
≈ u(T
n
, X
j
), 1 ≤ n ≤ N, j ∈ Z. (2.3)
Notice that in practice, j is finite. However, to simplify the presentation, we assume
here that we discretize the whole line x ∈ R.
The time variable is very similar to what we had for ODEs: knowing what happens
at T
n
= n∆t, we would like to get what happens at T
n+1
= (n + 1)∆t. We therefore
introduce the notation

t
U
n
j
=
U
n+1
j
−U
n
j
∆t
. (2.4)
The operator ∂
t
plays the same role for the series U
n
j
as

∂t
does for the function u(t, x).
The spatial variable is quite different from the time variable. There is no privileged
direction of propagation (we do not know a priori if information comes from the left
or the right) as there is for the time variable (we know that information at time t will
allow us to get information at time t +∆t). This intuitively explains why we introduce
two types of discrete approximations for the spatial derivative:

x
U
n
j
=
U
n
j+1
−U
n
j
∆x
, and ∂
x
U
n
j
=
U
n
j
−U
n
j−1
∆x
(2.5)
These are the forward and backward finite difference quotients, respectively. Now in
(2.1) we have a second-order spatial derivative. Since there is a priori no reason to
assume that information comes from the left or the right, we choose

2
u
∂x
2
(T
n
, X
j
) ≈ ∂
x

x
U
n
j
= ∂
x

x
U
n
j
=
U
n
j+1
−2U
n
j
+ U
n
j−1
∆x
2
. (2.6)
9
Exercise 2.2 Check the equalities in (2.6).
Notice that the discrete differentiation in (2.6) is centered: it is symmetric with respect
to what comes from the left and what comes from the right. The finite difference scheme
then reads
U
n+1
j
−U
n
j
∆t

U
n
j+1
−2U
n
j
+ U
n
j−1
∆x
2
= 0. (2.7)
This is clearly an approximation of (2.1). This is also the simplest scheme called the
explicit Euler scheme. Notice that this can be recast as
U
n+1
j
= λ(U
n
j+1
+ U
n
j−1
) + (1 −2λ)U
n
j
, λ =
∆t
∆x
2
. (2.8)
Exercise 2.3 Check this.
The numerical procedure is therefore straightforward: knowing U
n
j
for j ∈ Z, we cal-
culate U
n+1
j
for j ∈ Z for 0 ≤ n ≤ N − 1. There are then several questions that come
to mind: is the scheme stable (i.e. does U
n
j
remains bounded for n = N independently
of the choices of ∆t and ∆x)?, does it converge (is U
n
j
really a good approximation of
u(n∆t, j∆x) as advertised)?, and how fast does it converge (what is the error between
U
n
j
and u(n∆t, j∆x))?
Exercise 2.4 Discretizing the interval (−10, 10) with N points, use a numerical ap-
proximation of (2.1) to compute u(t, x) for t = 0.01, t = 0.1, t = 1, and t = 10 assuming
that the initial condition is u
0
(x) = 1 on (−1, 1) and u
0
(x) = 0 elsewhere. Implement
the algorithm in Matlab.
Stability. Our first concern will be to ensure that the scheme is stable. The norm we
choose to measure stability is the supremum norm. For series U = ¦U
j
¦
j∈Z
defined on
the grid j ∈ Z by U
j
, we define
|U|

= sup
j∈Z
[U
j
[. (2.9)
Stability means that there exists a constant C independent of ∆x, ∆t and 1 ≤ n ≤ N =
T/∆t such that
|U
n
|

= sup
j∈Z
[U
n
j
[ ≤ C|U
0
|

= C sup
j∈Z
[U
0
j
[. (2.10)
Since u(t, x) remains bounded for all times (see (2.2)), it is clear that if the scheme is
unstable, it has no chance to converge. We actually have to consider two cases according
as λ ≤ 1/2 or λ > 1/2. When λ > 1/2, the scheme will be unstable! To show this we
choose some initial conditions that oscillate very rapidly:
U
0
j
= (−1)
j
ε, ε > 0.
Here ε is a constant that can be chosen as small as one wants. We easily verify that
|U
0
j
|

= ε. Now straightforward calculations show that
U
1
j
=

λ

(−1)
j+1
+ (−1)
j−1

+ (1 −2λ)(−1)
j

ε = (1 −4λ)(−1)
j
ε.
10
By induction this implies that
U
n
j
= (1 −4λ)
n
(−1)
j
ε. (2.11)
Now assume that λ > 1/2. This implies that ρ = −(1 −4λ) > 1, whence
|U
n
|

= ρ
n
ε →∞ as n →∞.
For instance, for n = N = T/∆t, we have that |U
N
|

, which was supposed to be
an approximation of sup
x∈R
[u(T, x)[, tends to infinity as ∆t → 0. Obviously (2.10) is
violated: there is no constant C independent of ∆t such that (2.10) holds. In practice,
what this means is that any small fluctuation in the initial data (for instance caused by
computer round-off) will be amplified by the scheme exponentially fast as in (2.11) and
will eventually be bigger than any meaningful information.
This behavior has to be contrasted with the case λ ≤ 1/2. There, stability is obvious.
Since both λ and (1 −2λ) are positive, we deduce from (2.8) that
[U
n+1
j
[ ≤ sup¦[U
n
j−1
[, [U
n
j
[, [U
n
j+1
[¦. (2.12)
Exercise 2.5 Check it.
So obviously,
|U
n+1
|

≤ |U
n
|

≤ |U
0
|

.
So (2.10) is clearly satisfied with C = 1. This property is referred to as a maximum
principle: the discrete solution at later times (n ≥ 1) is bounded by the maximum of the
initial solution. Notice that the continuous solution of (2.1) also satisfies this property.
Exercise 2.6 Check this using (2.2). Recall that


−∞
e
−x
2
dx =

π.
Maximum principles are very important in the analysis of partial differential equations.
Although we are not going to use them further here, let us still mention that it is
important in general to enforce that the numerical schemes satisfy as many properties
of the continuous equation as possible.
We have seen that λ ≤ 1/2 is important to ensure stability. In terms of ∆t and ∆x,
this reads
∆t ≤
1
2
∆x
2
. (2.13)
This is the simplest example of the famous CFL condition (after Courant, Friedrich, and
Lewy who formulated it first). We shall see that explicit schemes are always conditionally
stable for parabolic equations, whereas implicit schemes will be unconditionally stable,
whence their practical interest. Notice that ∆t must be quite small (of order ∆x
2
) in
order to obtain stability.
Convergence. Now that we have stability for the Euler explicit scheme when λ ≤ 1/2,
we have to show that the scheme converges. This is done by assuming that we have a
local consistency of the discrete scheme with the equation, and that the solution of the
equation is sufficiently regular. We therefore use the three same main ingredients as for
ODEs: stability, consistency, regularity of the exact solution.
11
More specifically, let us introduce the error series z
n
= ¦z
n
j
¦
j∈Z
defined by
z
n
j
= U
n
j
−u
n
j
, where u
n
j
= u(n∆t, j∆x).
Convergence means that |z
n
|

converges to 0 as ∆t → 0 and ∆x → 0 (with the
constraint that λ ≤ 1/2) for all 1 ≤ n ≤ N = T/∆t. The order of convergence is
how fast it converges. To simplify, we assume that λ ≤ 1/2 is fixed and send ∆t → 0
(in which case ∆x =

λ
−1
∆t also converges to 0). Our main ingredient is now the
Taylor expansion (1.6). The consistency of the scheme consists of showing that the real
dynamics are well approximated. More precisely, assuming that the solution at time
T
n
= n∆t is given by u
n
j
on the grid, and that U
n
j
= u
n
j
, we have to show that the real
solution at time u
n+1
j
is well approximated by U
n+1
j
. To answer this, we calculate
τ
n
j
= ∂
t
u
n
j
−∂
x

x
u
n
j
. (2.14)
The expression τ
n
j
is the truncation or local discretization error. Using Taylor expan-
sions, we find that
u(t + ∆t) −u(t)
∆t
=
∂u
∂t
(t) +
∆t
2

2
u
∂t
2
(t + s)
u(x + ∆x) −2u(x) + u(x −∆x)
∆x
2
=

2
u
∂x
2
(x) +
∆x
2
12

4
u
∂x
4
(x + h),
for some 0 ≤ s ≤ ∆t and −∆x ≤ h ≤ ∆x.
Exercise 2.7 Check this.
Since (2.1) is satisfied by u(t, x) we deduce that
τ
n
j
=
∆t
2

2
u
∂t
2
(n∆t + s) −
∆x
2
12

4
u
∂x
4
(j∆x + h)
= ∆x
2

λ
2

2
u
∂t
2
(n∆t + s) −
1
12

4
u
∂x
4
(j∆x + h)

.
The truncation error is therefore of order m = 2 since it is proportional to ∆x
2
. Of
course, this accuracy holds if we can bound the term in the square brackets indepen-
dently of the discretization parameters ∆t and ∆x. Notice however that this term only
involves the exact solution u(t, x). All we need is that the exact solution is sufficiently
regular. Again, this is independent of the discretization scheme we choose.
Let us assume that the solution is sufficiently regular (what we need is that it is
of class C
2,4
, i.e. that it is continuously differentiable 2 times in time and 4 times in
space). We then deduce that

n
|

≤ C∆x
2
∼ C∆t (2.15)
where C is a constant independent of ∆x (and ∆t since λ is fixed). Since U
n
satisfies
the discretized equation, we deduce that

t
z
n
j
−∂
x

x
z
n
j
= −τ
n
j
. (2.16)
12
or equivalently that
z
n+1
j
= λ(z
n
j+1
+ z
n
j−1
) + (1 −2λ)z
n
j
−∆tτ
n
j
. (2.17)
The above equation allows us to analyze local consistency. Indeed, let us assume that the
exact solution is known on the spatial grid at time step n. This implies that z
n
= 0, since
the latter is by definition the difference between the exact solution and the approximate
solution on the spatial grid. Equation (2.17) thus implies that
z
n+1
j
= −∆tτ
n
j
,
so that
|z
n+1
|

≤ C∆t
2
∼ C∆t∆x
2
, (2.18)
thanks to (2.15). Comparing with the section on ODEs, this implies that the scheme
is of order m = 1 in time since ∆t
2
= ∆t
m+1
for m = 1. Being of order 1 in time
implies that the scheme is of order 2 is space since ∆t = λ∆x
2
. This will be confirmed
in Theorem 2.1 below.
Let us now conclude the analysis of the error estimate and come back to (2.17).
Using the stability estimate (2.12) for the homogeneous problem (because λ ≤ 1/2!)
and the bound (2.15), we deduce that
|z
n+1
|

≤ |z
n
|

+ C∆t∆x
2
.
This in turn implies that
|z
n
|

≤ nC∆t∆x
2
≤ CT∆x
2
.
We summarize what we have obtain as
Theorem 2.1 Let us assume that u(t, x) is of class C
2,4
and that λ ≤ 1/2. Then we
have that
sup
j∈Z;0≤n≤N
[u(n∆t, j∆x) −U
n
j
[ ≤ C∆x
2
, (2.19)
where C is a constant independent of ∆x and λ ≤ 1/2.
The explicit Euler scheme is of order 2 in space. But this corresponds to a scheme
of order 1 in time since ∆t = λ∆x
2
≤ ∆x
2
/2. Therefore it does not converge very fast.
Exercise 2.8 Implement the explicit Euler scheme in Matlab by discretizing the parabolic
equation on x ∈ (−10, 10) and assuming that u(t, −10) = u(t, 10) = 0. You may choose
as an initial condition u
0
(x) = 1 on (−1, 1) and u
0
(x) = 0 elsewhere. Choose the dis-
cretization parameters ∆t and ∆x such that λ = .48, λ = .5, and λ = .52. Compare
the stable solutions you obtain with the solution given by (2.2).
Exercise 2.9 Following the techniques described in Chapter 1, derive a scheme of order
4 in space (still with ∆t = λ∆x
2
and λ sufficiently small that the scheme is stable).
Implement this new scheme in Matlab and compare it with the explicit Euler scheme
on the numerical simulation described in exercise 2.8.
13
2.2 Fourier transforms
We now consider a fundamental tool in the analysis of partial differential equations with
homogeneous coefficients and their discretization by finite differences: Fourier trans-
forms.
There are many ways of defining the Fourier transform. We shall use the following
definition
[T
x→ξ
f](ξ) ≡
ˆ
f(ξ) =


−∞
e
−ixξ
f(x)dx. (2.20)
The inverse Fourier transform is defined by
[T
−1
ξ→x
ˆ
f](x) ≡ f(x) =
1


−∞
e
ixξ
ˆ
f(ξ)dξ. (2.21)
In the above definition, the operator T is applied to the function f(x), which it
maps to the function
ˆ
f(ξ). The notation can become quite useful when there are several
variables and one wants to emphasize with respect to with variable Fourier transform is
taken. We call the spaces of x’s the physical domain and the space of ξ’s the wavenumber
domain, or Fourier domain. Wavenumbers are to positions what frequencies are to times.
If you are more familiar with Fourier transforms of signals in time, you may want to
think of “frequencies” each time you see “wavenumbers”. A crucial property, which
explains the terminology of “forward” and “inverse” Fourier transforms, is that
T
−1
ξ→x
T
x→ξ
f = f, (2.22)
which can be recast as
f(x) =
1


−∞
e
ixξ


−∞
e
−iyξ
f(y)dydξ =


−∞


−∞
e
i(x−y)ξ

f(y)dy.
This implies that


−∞
e
izξ

dξ = δ(z),
the Dirac delta function since f(x) =


−∞
δ(x−y)f(y)dy by definition. Another way of
understanding this is to realize that
[Tδ](ξ) = 1, [T
−1
1](x) = δ(x). (2.23)
The first equality can also trivially been deduced from (2.20).
Another interesting calculation for us is that
[T exp(−
α
2
x
2
)](ξ) =


α
exp(−
1

ξ
2
). (2.24)
In other words, the Fourier transform of a Gaussian is also a Gaussian. Notice this im-
portant fact: a very narrow Gaussian (α very large) is transformed into a very wide one

−1
is very small) are vice-versa. This is related to Heisenberg’s uncertainty principle
in quantum mechanics, and it says that you cannot be localized in the Fourier domain
when you are localized in the spatial domain and vice-versa.
14
Exercise 2.10 Check (2.24) using integration in the complex plane (which essentially
says that


−∞
e
−β(x−iρ)
2
dx =


−∞
e
−βx
2
dx for ρ ∈ R by contour change since z →e
−βz
2
has no pole in C).
The simplest property of the Fourier transform is its linearity:
[T(αf + βg)](ξ) = α[Tf](ξ) + β[Tg](ξ). (2.25)
Other very important properties of the Fourier transform are as follows:
[Tf(x + y)](ξ) = e
iyξ
ˆ
f(ξ), [Tf(νx)](ξ) =
1
[ν[
ˆ
f

ξ
ν

, (2.26)
for all y ∈ R and ν ∈ R`¦0¦.
Exercise 2.11 Check these formulas.
The first equality shows how the Fourier transform acts on translations by a factor y,
the second how it acts on dilation. The former is extremely important in the analysis
of PDEs and their finite difference discretization:
Fourier transforms replace translations in the physical domain by multipli-
cations in the Fourier domain.
More precisely, translation by y is replaced by multiplication by e
iyξ
.
The reason this is important is that you can’t beat multiplications in simplicity. To
further convince you of the importance of the above assertion, consider derivatives of
functions. By linearity of the Fourier transform, we get that

T
f(x + h) −f(x)
h

(ξ) =
[Tf(x + h)](ξ) −[Tf(x)](ξ)
h
=
e
ihξ
−1
h
ˆ
f(ξ)
Now let us just pass to the limit h →0 in both sides. We get
ˆ
f

(ξ) = [Tf

](ξ) = iξ
ˆ
f(ξ). (2.27)
Again,
Fourier transforms replace differentiation in the physical domain by multi-
plication in the Fourier domain.
More precisely,
d
dx
is replaced by multiplication by iξ.
This assertion can easily been directly verified from the definition (2.20). However, it
is useful to see differentiation as a limit of differences of translations. Moreover, finite
differences are based on translations so we see how Fourier transforms will be useful in
their analysis.
One last crucial and quite related property of the Fourier transform is that it replaces
convolutions by multiplications. The convolution of two functions is given by
(f ∗ g)(x) =


−∞
f(x −y)g(y)dy =


−∞
f(y)g(x −y)dy. (2.28)
The Fourier transform is then given by
[T(f ∗ g)](ξ) =
ˆ
f(ξ)ˆ g(ξ). (2.29)
15
Exercise 2.12 Check this.
This important relation can be used to prove the equally important (everything is im-
portant in this section!) Parseval equality


−∞
[
ˆ
f(ξ)[
2
dξ = 2π


−∞
[f(x)[
2
dx, (2.30)
for all complex-valued function f(x) such that the above integrals make sense. We
show this as follows. We first recall that
ˆ
f(ξ) =
´
f(−ξ), where f denotes the complex
conjugate to f. We also define g(x) = f(−x), so that ˆ g(ξ) =
´
f(−ξ) using (2.26) with
ν = −1. We then have


−∞
[
ˆ
f(ξ)[
2
dξ =


−∞
ˆ
f(ξ)
ˆ
f(ξ)dξ =


−∞
ˆ
f(ξ)
´
f(−ξ)dξ =


−∞
ˆ
f(ξ)g(ξ)dξ
=


−∞

(f ∗ g)(ξ)dξ thanks to (2.28)
= 2π(f ∗ g)(0) using the definition (2.21) with x = 0
= 2π


−∞
f(y)g(−y)dy by definition of the convolution
= 2π


−∞
f(y)f(y)dy = 2π


−∞
[f(y)[
2
dy.
The Parseval equality (2.30) essentially says that there is as much energy (up to a factor
2π) in the physical domain (left-hand side in (2.30)) as in the Fourier domain (right-hand
side).
Application to the heat equation. Before considering finite differences, here is an
important exercise that shows how powerful Fourier transforms are to solve PDEs with
constant coefficients.
Exercise 2.13 Consider the heat equation (2.1). Using the Fourier transform, show
that
∂ˆ u(t, ξ)
∂t
= −ξ
2
ˆ u(t, ξ).
Here we mean ˆ u(t, ξ) = [T
x→ξ
u(t, x)](t, ξ). Solve this ODE. Then using (2.24) and
(2.28), recover (2.2).
Application to Finite Differences. We now use the Fourier transform to analyze
finite difference discretizations. To simplify we still assume that we want to solve the
heat equation on the whole line x ∈ R.
Let u
0
(x) be the initial condition for the heat equation. If we use the notation
U
n
(j∆x) = U
n
j
≈ u(n∆t, j∆x),
the explicit Euler scheme (2.7) can be recast as
U
n+1
(x) −U
n
(x)
∆t
=
U
n
(x + ∆x) + U
n
(x −∆x) −2U
n
(x)
∆x
2
, (2.31)
16
where x = j∆x. As before, this scheme says that the approximation U at time T
n+1
=
(n + 1)∆t and position X
j
= j∆x depends on the values of U at time T
n
and positions
X
j−1
, X
j
, and X
j+1
. It is therefore straightforward to realize that U
n
(X
j
) depends on
u
0
at all the points X
k
for j − n ≤ k ≤ j + n. It is however not easy at all to get this
dependence explicitly from (2.31).
The key to analyzing this dependence is to go to the Fourier domain. There, trans-
lations are replaced by multiplications, which is much simpler to deal with. Here is how
we do it. We first define U
0
(x) for all points x ∈ R, not only those points x = X
j
= j∆x.
One way to do so is to assume that
u
0
(x) = U
0
j
on

(j −
1
2
)∆x, (j +
1
2
)∆x

. (2.32)
We then define U
n
(x) using (2.31) initialized with U
0
(x) = u
0
(x). Clearly, we have
U
n+1
(x) = (1 −2λ)U
n
(x) + λ(U
n
(x + ∆x) + U
n
(x −∆x)). (2.33)
We thus define U
n
(x) for all n ≥ 0 and all x ∈ R by induction. If we choose as initial
condition (2.32), we then easily observe that
U
n
(x) = U
n
j
on

(j −
1
2
)∆x, (j +
1
2
)∆x

, (2.34)
where U
n
j
is given by (2.7). So analyzing U
n
(x) is in some sense sufficient to get infor-
mation on U
n
j
.
Once U
n
(x) is defined for all n, we can certainly introduce the Fourier transform
ˆ
U
n
(ξ) defined according to (2.20). However, using the translation property of the Fourier
transform (2.26), we obtain that
ˆ
U
n+1
(ξ) =

1 + λ(e
iξ∆x
+ e
−iξ∆x
−2)

ˆ
U
n
(ξ) (2.35)
In the Fourier domain, marching in time (going from step n to step n + 1) is much
simpler than in the physical domain: each wavenumber ξ is multiplied by a coefficient
R
λ
(ξ) = 1 + λ(e
iξ∆x
+ e
−iξ∆x
−2) = 1 −2λ(1 −cos(ξ∆x)). (2.36)
We then have that
ˆ
U
n
(ξ) = (R
λ
(ξ))
n
ˆ u
0
(ξ). (2.37)
The question of the stability of a discretization now becomes the following: are all
wavenumbers ξ stable, or equivalently is (R
λ
(ξ))
n
bounded for all 1 ≤ n ≤ N = T/∆t?
It is easy to observe that when λ ≤ 1/2, [R
λ
(ξ)[ ≤ 1 for all ξ. We then clearly have
that all wavenumbers are stable since [
ˆ
U
n
(ξ)[ ≤ [ˆ u
0
(ξ)[. However, when λ > 1/2, we
can choose values of ξ such that cos(ξ∆x) is close to −1. For such wavenumbers, the
value of R
λ
(ξ) is less than −1 so that
ˆ
U
n
changes sign at every new time step n with
growing amplitude (see numerical simulations). Such wavenumbers are unstable.
17
2.3 Stability and convergence using Fourier transforms
In this section, we apply the theory of Fourier transform to the analysis of the finite
difference of parabolic equations with constant coefficients.
As we already mentioned, an important property of Fourier transforms is that

∂x
→iξ,
that is, differentiation is replaced by multiplication by iξ in the Fourier domain. We use
this to define equations in the physical domain as follows
∂u
∂t
(t, x) + P(D)u(t, x) = 0, t > 0, x ∈ R
u(0, x) = u
0
(x), x ∈ R.
(2.38)
The operator P(D) is a linear operator defined as
P(D)f(x) = T
−1
ξ→x

P(iξ)T
x→ξ
f(x)

(ξ), (2.39)
where P(iξ) is a function of iξ. Here D stands for

∂x
. What the operator P(D) does
is: (i) take the Fourier transform of f(x), (ii) multiply
ˆ
f(ξ) by the function P(iξ), (iii)
take the inverse Fourier transform of the product.
The function P(iξ) is called the symbol of the operator P(D). Such operators are
called pseudo-differential operators.
Why do we introduce all this? The reason, as it was mentioned in the previous
section, is that the analysis of differential operators simplifies in the Fourier domain.
The above definition gives us a mathematical framework to build on this idea. Indeed,
let us consider (2.38) in the Fourier domain. Upon taking the Fourier transform T
x→ξ
term by term in (2.38), we get that
∂ˆ u
∂t
(t, ξ) + P(iξ)ˆ u(t, ξ) = 0, t > 0, ξ ∈ R,
ˆ u(0, ξ) = ˆ u
0
(ξ), ξ ∈ R.
(2.40)
As promised, we have replaced the “complicated” pseudo-differential operator P(D) by
a mere multiplication by P(iξ) in the Fourier domain. For every ξ ∈ R, (2.40) consists
of a simple linear first-order ordinary differential equation. Notice the parallel between
(2.38) and (2.40): by going to the Fourier domain, we have replaced the differentiation
operator D =

∂x
by iξ. The solution of (2.40) is then obviously given by
ˆ u(t, ξ) = e
−tP(iξ)
ˆ u
0
(ξ). (2.41)
The solution of (2.38) is then given by
u(t, x) = T
−1
ξ→x
(e
−tP(iξ)
) ∗ u
0
(x). (2.42)
Exercise 2.14 (i) Show that the parabolic equation (2.1) takes the form (2.38) with
symbol P(iξ) = ξ
2
.
(ii) Find the equation of the form (2.38) with symbol P(iξ) = ξ
4
.
(iii) Show that any arbitrary partial differential equation of order m ≥ 1 in x with con-
stant coefficients can be written in the form (2.38) with the symbol P(iξ) a polynomial
of order m.
18
Parabolic Symbol. In this section, we only consider real-valued symbols such that
for some M > 0,
P(iξ) ≥ C[ξ[
M
, for all ξ ∈ R. (2.43)
With a somewhat loose terminology, we refer to equations (2.38) with real-valued symbol
P(iξ) satisfying the above constraint as parabolic equations. The heat equation clearly
satisfies the above requirement with M = 2. The real-valuedness of the symbol is not
necessary but we shall impose it to simplify.
What about numerical schemes? Our definition so far provides a great tool to
analyze partial differential equations. It turns out it is also very well adapted to the
analysis of finite difference approximations to these equations. The reason is again that
finite difference operators are transformed into multiplications in the Fourier domain.
A general single-step finite difference scheme is given by
B

U
n+1
(x) = A

U
n
(x), n ≥ 0, (2.44)
with U
0
(x) = u
0
(x) a given initial condition. Here, the finite difference operators are
defined by
A

U(x) =
¸
α∈Iα
a
α
U(x −α∆x), B

U(x) =
¸
β∈I
β
b
β
U(x −β∆x), (2.45)
where I
α
⊂ Z and I
β
⊂ Z are finite sets of integers. Although this definition may look
a little complicated at first, we need such a complexity in practice. For instance, the
explicit Euler scheme is given by
U
n+1
(x) = (1 −2λ)U
n
(x) + λ(U
n
(x −∆x) + U
n
(x + ∆x)).
This corresponds to the coefficients
b
0
= 1, a
−1
= λ, a
0
= 1 −2λ, a
1
= λ,
and all the other coefficients a
α
and b
β
vanish. Explicit schemes are characterized by
the fact that only b
0
does not vanish. We shall see that implicit schemes, where other
coefficients b
β
= 0, are important in practice.
Notice that we are slightly changing the problem of interest. Finite difference
schemes are supposed to be defined on the grid ∆xZ (i.e. the points x
m
= m∆x
for m ∈ Z), not on the whole axis R. Let us notice however that from the definition
(2.44), the values of U
n+1
on the grid ∆xZ only depend on the values of u
0
on the same
grid, and that U
n+1
(j∆x) = U
n+1
j
, using the notation of the preceding section. In other
words U
n+1
(j∆x) does not depend on how we extend the initial condition u
0
originally
defined on the grid ∆xZ to the whole line R. This is why we replace U
n+1
j
by U
n+1
(x)
in the sequel and only consider properties of U
n+1
(x). In the end, we should see how
the properties on U
n+1
(x) translate into properties on U
n+1
j
. This is easy to do when
U
n+1
(x) is chosen as a piecewise constant function and more difficult when higher order
polynomials are used to construct it. It can be done but requires careful analysis of the
error made by interpolating smooth functions by polynomials. We do not consider this
problem in these notes.
19
We now come back to (2.44). From now on, this is our definition of a finite difference
scheme and we want to analyze its properties. As we already mentioned, this equation
is simpler in the Fourier domain. Let us thus take the Fourier transform of each term
in (2.44). What we get is

¸
β∈I
β
b
β
e
−iβ∆xξ

ˆ
U
n+1
(ξ) =

¸
α∈Iα
a
α
e
−iα∆xξ

ˆ
U
n
(ξ). (2.46)
Let us introduce the symbols
A

(ξ) =

¸
α∈Iα
a
α
e
−iα∆xξ

, B

(ξ) =

¸
β∈I
β
b
β
e
−iβ∆xξ

. (2.47)
In the implicit case, we assume that
B

(ξ) = 0. (2.48)
In the explicit case, B

(ξ) ≡ 1 and the above relation is obvious. We then obtain that
ˆ
U
n+1
(ξ) = R

(ξ)
ˆ
U
n
(ξ) = B

(ξ)
−1
A

(ξ)
ˆ
U
n
(ξ). (2.49)
This implies
ˆ
U
n
(ξ) = R
n

(ξ)ˆ u
0
(ξ). (2.50)
The above relation completely characterizes the solution
ˆ
U
n
(ξ). To obtain the solution
U
n
(x), we simply have to take the inverse Fourier transform of
ˆ
U
n
(ξ).
Exercise 2.15 Verify that for the explicit Euler scheme, we have
R

(ξ) = 1 −
2∆t
(∆x)
2
(1 −cos(ξ∆x)).
Notice that the difference between the exact solution ˆ u(n∆t, ξ) at time n∆t and its
approximation
ˆ
U
n
(ξ) is given by
ˆ u(n∆t, ξ) −
ˆ
U
n
(ξ) =

e
−n∆tP(iξ)
−R
n

(ξ)

ˆ u
0
(ξ). (2.51)
This is what we want to analyze.
To measure convergence, we first need to define a norm, which will tell us how far
or how close two functions are from one another. In finite dimensional spaces (such as
R
n
that we used for ordinary differential equations), all norms are equivalent (this is a
theorem). In infinite dimensional spaces, all norms are not equivalent (this is another
theorem; the equivalence of all norms on a space implies that it is finite dimensional...)
So the choice of the norm matters!
Here we only consider the L
2
norm, defined on R for sufficiently regular complex-
valued functions f by
|f| =


−∞
[f(x)[
2
dx

1/2
. (2.52)
The reason why this norm is nice is that we have the Parseval equality
|
ˆ
f(ξ)| =

2π|f(x)|. (2.53)
20
This means that the L
2
norms in the physical and the Fourier domains are equal (up
to the factor

2π). Most norms do not have such a nice interpretation in the Fourier
domain. This is one of the reasons why the L
2
norm is so popular.
Our analysis of the finite difference scheme is therefore concerned with characterizing
|u(n∆t, x) −U
n
(x)|.
We just saw that this was equivalent to characterizing

n
(ξ)| = |ˆ u(n∆t, ξ) −
ˆ
U
n
(ξ)| = |

e
−n∆tP(iξ)
−R
n

(ξ)

ˆ u
0
(ξ)|.
The control that we will obtain on ε
n
(ξ) depends on the three classical ingredients:
regularity of the solution, stability of the scheme, and consistency of the approximation.
The convergence arises as two parameters, ∆t and ∆x, converge to 0. To simplify
the analysis, we assume that the two parameters are related by
λ =
∆t
∆x
M
, (2.54)
where M is the order of the parabolic equation (M = 2 for the heat equation), and
where λ > 0 is fixed.
Regularity. The regularity of the exact equation is first seen in the growth of [e
−∆tP(iξ)
[.
Since our operator as assumed to be parabolic, we have
[e
−∆tP(iξ)
[ ≤ e
−C∆t|ξ|
M
. (2.55)
Stability. We impose that there exists a constant C independent of ξ and 1 ≤ n ≤
N = T/∆t such that
[R
n

(ξ)[ ≤ C, ξ ∈ R, 1 ≤ n ≤ N. (2.56)
The stability constraint is clear: no wavenumber ξ can grow out of control.
Exercise 2.16 Check again that the above constraint is satisfied for the explicit Euler
scheme if and only if λ ≤ 1/2.
What makes parabolic equations a little special is that they have better stability prop-
erties than just having bounded symbols. The exact symbol satisfies that
[e
−n∆tP(iξ)
[ ≤ e
−Cn∆t|ξ|
M
, (2.57)
for some constant C. This relation is important because we deduce that high wavenum-
bers (large ξ’s) are heavily damped by the operator. Parabolic equations have a very
efficient smoothing effect. Even if high wavenumbers are present in the initial solution,
they quickly disappear as time increases.
Not surprisingly, the finite difference approximation partially retains this effect. We
have however to be careful. The reason is that the scheme has a given scale, ∆x, at
which it approximates the exact operator. For length scales much larger than ∆x, we
21
expect the scheme to be a good approximation of the exact operator. For length of order
(or smaller than) ∆x however, we cannot expect the scheme to be accurate: we do not
have a sufficient resolution. Remember that small scales in the spatial worlds mean
high wavenumbers in the Fourier world (functions that vary on small scales oscillate
fast, hence are composed of large wavenumbers). Nevertheless, for wavenumbers that
are smaller than ∆x
−1
, we expect the scheme to retain some properties of the exact
equation. For this reason, we impose that the operator R

satisfy
[R

(ξ)[ ≤ 1 −δ[∆xξ[
M
for [∆xξ[ ≤ γ
[R

(ξ)[ ≤ 1 for [∆xξ[ > γ,
(2.58)
where γ and δ are positive constants independent of ξ and ∆x. We then have
Lemma 2.2 We have that
[R
n

(ξ)[ ≤ e
−Cn|∆xξ|
M
for [∆xξ[ ≤ γ. (2.59)
Proof. We deduce from (2.58) that
[R
n

(ξ)[ ≤ e
nln(1−δ|∆xξ|
M
)
≤ e
−n(δ/2)|∆xξ|
M
.
Consistency. Let us now turn our attention to the consistency and accuracy of the
finite difference scheme. Here again, we want to make sure that the finite difference
scheme is locally a good approximation of the exact equation. In the Fourier world, this
means that
e
−∆tP(iξ)
−R

(ξ) (2.60)
has to converge to 0 as ∆x and ∆t converge to 0. Indeed, assuming that we are given
the exact solution ˆ u(n∆t, ξ) at time n∆t, the error made by applying the approximate
equation instead of the exact equation between n∆t and (n+1)∆t is precisely given by

e
−∆tP(iξ)
−R

(ξ)

ˆ u(n∆t, ξ).
Again, we have to be a bit careful. We cannot expect the above quantity to be small
for all values of ξ. This is again related to the fact that wavenumbers of order of or
greater than ∆x
−1
are not captured by the finite difference scheme.
Exercise 2.17 Show that in the case of the heat equation and the explicit Euler scheme,
the error e
−∆tP(iξ)
− R

(ξ) is not small for all wavenumbers of the form ξ = kπ∆x
−1
for k = 1, 2, . . ..
Quantitatively, all we can expect is that the error is small when ξ∆x →0 and ∆t →0.
We say that the scheme is consistent and accurate of order m if
[e
−∆tP(iξ)
−R

(ξ)[ ≤ C∆t∆x
m
(1 +[ξ[
M+m
), ∆x[ξ[ < γ, (2.61)
where C is a constant independent of ξ, ∆x, and ∆t.
22
Proof of convergence. Let us come back to the analysis of ε
n
(ξ). We have that
ε
n
(ξ) = (e
−∆tP(iξ)
−R

(ξ))
n−1
¸
k=0
e
−(n−1−k)∆tP(iξ)
R
k

(ξ)ˆ u
0
(ξ),
using a
n
−b
n
= (a −b)
¸
n−1
k=0
a
n−1−k
b
k
. Let us first consider the case [ξ[ ≤ γ∆x
−1
. We
deduce from (2.57), (2.59), and (2.61), that

n
(ξ)[ ≤ C∆t∆x
m
(1 +[ξ[
M+m
)ne
−Cn∆t|ξ|
M
[ˆ u
0
(ξ)[. (2.62)
Now remark that
n∆t[ξ[
M
e
−Cn∆t|ξ|
M
≤ C

.
Since n∆t ≤ T, this implies that

n
(ξ)[ ≤ C∆x
m
(1 +[ξ[
m
)[ˆ u
0
(ξ)[. (2.63)
So far, the above inequality holds for [ξ[ ≤ γ∆x
−1
. However, for [ξ[ ≥ γ∆x
−1
, we have
thanks to the stability property (2.56) that

n
(ξ)[ ≤ 2[ˆ u
0
(ξ)[ ≤ C∆x
m
(1 +[ξ[
m
)[ˆ u
0
(ξ)[. (2.64)
So (2.63) actually holds for every ξ. This implies that

n
| =

R

n
(ξ)[
2

1/2
≤ C∆x
m

R
(1 +[ξ[
2m
)[ˆ u
0
(ξ)[
2

1/2
. (2.65)
This shows that |ε
n
| is of order ∆x
m
provided that

R
(1 +[ξ[
2m
)[ˆ u
0
(ξ)[
2

1/2
< ∞. (2.66)
The above inequality implies that ˆ u
0
(ξ) decays sufficiently fast as ξ → ∞. This is
actually equivalent to imposing that u
0
(x) is sufficiently regular.
The Hilbert spaces H
m
(R). For a function v(x), we denote by v
(n)
(x) its nth deriva-
tive. We introduce the sequence of functional spaces H
m
= H
m
(R) of functions whose
derivatives up to order m are square-integrable. The norm of H
m
is defined by
|v|
H
m =

m
¸
k=0

R
[v
(n)
(x)[
2
dx

1/2
. (2.67)
Notice that H
0
= L
2
, the space of square-integrable functions. It turns out that a
function v(x) belongs to H
m
if and only if its Fourier transform ˆ v(ξ) satisfies that

R
(1 +[ξ[
2m
)[ˆ v(ξ)[
2

1/2
< ∞. (2.68)
Exercise 2.18 (difficult) Prove it. The proof is based on the fact that differentiation
in the physical domain is replaced by multiplication by iξ in the Fourier domain.
23
Notice that in the above inequality, we can choose m ∈ R and not only m ∈ N

.
Thus (2.68) can be used as a definition for the space H
m
with m ∈ R. The space H
m
is still the space of functions with m square-integrable derivatives. Only the number of
derivatives can be “fractional”, or even negative!
Exercise 2.19 Calculate the Fourier transform of the function v defined by v(x) = 1 on
(−1, 1) and v(x) = 0 elsewhere. Using (2.68), show that the function v belongs to every
space H
m
with m < 1/2. This means that functions that are piecewise smooth (with
discontinuities) have a little less than half of a derivative in L
2
(you can differentiate
such functions almost 1/2 times)!
The relationship between (2.67) and (2.68) can be explained as follows:
v(x) is smooth ⇐⇒ ˆ v(ξ) decays fast
in the physical domain in the Fourier domain.
This is another very general and very important property of Fourier transforms. The
equivalence between (2.67) and (2.68) is one manifestation of this “rule”.
Main result of the section. Summarizing all of the above, we have obtained the
following result.
Theorem 2.3 Let us assume that P(iξ) and R

(ξ) are parabolic, in the sense that
(2.43) and (2.58) are satisfied and that R

(ξ) is of order m, i.e. (2.61) holds. Assuming
that the initial condition u
0
∈ H
m
, we obtain that
|u(n∆t, x) −U
n
(x)| ≤ C∆x
m
|u
0
|
H
m, for all 0 ≤ n ≤ N = T/∆t. (2.69)
We recall that ∆t = λ∆x
M
so that in the above theorem, the scheme is of order m in
space and of order m/M in time.
Exercise 2.20 (difficult) Follow the proof of Theorem 2.3 assuming that (2.43) and
(2.58) are replaced by P(iξ) ≥ 0 and R

(ξ) ≤ 1. Show that (2.69) should be replaced
by
|u(n∆t, x) −U
n
(x)| ≤ C∆x
m
|u
0
|
H
m+M, for all 0 ≤ n ≤ N = T/∆t. (2.70)
Loosely speaking, what this means is that we need the initial condition to be more
regular if the equation is not regularizing. You can compare this with Theorem 2.1: to
obtain a convergence of order 2 in space, we have to assume that the solution u(t, x) is
of order C
4
is space. This is consistent with the above result with M = 2 and m = 2.
Exercise 2.21 (very difficult). Assuming that (2.58) is replaced by R

(ξ) ≤ 1, show
that (2.69) should be replaced by
|u(n∆t, x) −U
n
(x)| ≤ C∆x
m
|u
0
|
H
m+ε, for all 0 ≤ n ≤ N = T/∆t. (2.71)
Here ε is an arbitrary positive constant. Of course, the constant C depends on ε. What
this result means is that if we loose the parabolic property of the discrete scheme, we
still have the same order of convergence provided that the initial condition is slightly
more regular than being in H
m
.
24
Exercise 2.22 Show that any stable scheme that is convergent of order m is also con-
vergent of order αm for all 0 ≤ α ≤ 1. Deduce that (2.69) in Theorem 2.3 can then be
replaced by
|u(n∆t, x) −U
n
(x)| ≤ C∆x
αm
|u
0
|
H
αm, for all 0 ≤ n ≤ N = T/∆t. (2.72)
The interest of this result is the following: when the initial data u
0
is not sufficiently
regular to be in H
m
but sufficiently regular to be in H
αm
with 0 < α < 1, we still have
convergence of the finite difference scheme as ∆x →0; however the rate of convergence
is slower.
Exercise 2.23 Show that everything we have said in this section holds if we replace
(2.43) and (2.58) by
Re P(iξ) ≥ C([ξ[
M
−1) and [R

(ξ)[ ≤ 1 −δ[∆xξ[
M
+ C∆t, [∆xξ[ ≤ γ,
respectively. Here, Re stands for “real part”. The above hypotheses are sufficiently
general to cover most practical examples of parabolic equations.
Exercise 2.24 Show that schemes accurate of order m > 0 so that (2.61) holds and
verifying [R

(ξ)[ ≤ 1 are stable in the sense that
[R

(ξ)[ ≤ (1 +C∆x
m
∆t) −δ[∆xξ[
M
, [∆xξ[ ≤ γ.
The last exercises show that schemes that are stable in the “classical” sense, i.e., that
verify [R

(ξ)[ < 1, and are consistent with the smoothing parabolic symbol, are them-
selves smoothing.
2.4 Application to the θ schemes
We are now ready to analyze the family of schemes for the heat equation (i.e. P(iξ) = ξ
2
)
called the θ schemes.
The θ scheme is defined by

t
U
n
(x) = θ∂
x

x
U
n+1
(x) + (1 −θ)∂
x

x
U
n
(x). (2.73)
Recall that the finite difference operators are defined by (2.4) and (2.5). When θ = 0,
we recover the explicit Euler scheme. For θ = 1, the scheme is called the fully implicit
Euler scheme. For θ = 1/2, the scheme is called the Crank-Nicolson scheme. We recall
that ∆t = λ∆x
2
.
Exercise 2.25 Show that the symbol of the θ scheme is given by
R

(ξ) =
1 −2(1 −θ)λ(1 −cos(∆xξ))
1 + 2θλ(1 −cos(∆xξ))
. (2.74)
Let us first consider the stability of the θ method. Let us assume that 0 ≤ θ ≤ 1. We
observe that R

(ξ) ≤ 1. The stability requirement is therefore that
min
ξ
R

(ξ) ≥ −1.
25
Exercise 2.26 Show that the above inequality holds if and only if
(1 −2θ)λ ≤
1
2
.
This shows that the method is unconditionally L
2
-stable when θ ≥ 1/2, i.e. that the
method is stable independently of the choice of λ. When θ < 1/2, stability only holds
if and only if
λ ≤
1
2(1 −2θ)
.
Let us now consider the accuracy of the θ method. Obviously we need to show that
(2.61) holds for some m > 0.
Exercise 2.27 Show that
e
−∆tξ
2
= 1 −∆tξ
2
+
1
2
∆t
2
ξ
4

1
6
∆t
3
ξ
6
+ O(∆t
4
ξ
8
), (2.75)
and that
R

(ξ) = 1−∆tξ
2
+∆t∆x
2
(
1
12
+λθ)ξ
4

∆t
2
∆x
4
360
(1+60λθ+360λ
2
θ
2
)+O(∆t
4
ξ
8
), (2.76)
as ∆t →0 and ∆xξ →0.
Exercise 2.28 Show that the θ scheme is always at least second-order in space (first-
order in time).
Show that the scheme is of order 4 in space if and only if
θ =
1
2

1
12λ
.
Show that the scheme is of order 6 in space if in addition,
1 + 60λθ + 360λ
2
θ
2
= 60λ
2
, i.e., λ =
1
10

5.
Implementation of the θ scheme. The above analysis shows that the θ scheme is
unconditionally stable when θ ≥ 1/2. This very nice stability property comes however
at a price: the calculation of U
n+1
from U
n
is no longer explicit.
Exercise 2.29 Show that the θ scheme can be put in the form (2.44) with
b
−1
= b
1
= −θλ, b
0
= 1 + 2λθ, a
−1
= a
1
= (1 −θ)λ, a
0
= 1 −2(1 −θ)λ
Since the coefficient b
1
and b
−1
do not vanish, the scheme is no longer explicit. Let us
come back to the solution U
n
j
defined on the grid ∆xZ. The operators A

and B

in
(2.45) are now infinite matrices such that
B

U
n+1
= A

U
n
26
where the infinite vector U
n
= (U
n
j
)
j∈Z
. In matrix form, we obtain that the operator
B

for the θ scheme is given by the infinite matrix
B

=

¸
¸
¸
¸
¸
¸
1 + 2θλ −θλ 0 0 0
−θλ 1 + 2θλ −θλ 0 0
0 −θλ 1 + 2θλ −θλ 0
0 0 −θλ 1 + 2θλ −θλ
0 0 0 −θλ 1 + 2θλ

. (2.77)
Similarly, the matrix A

is given by
A

=

¸
¸
¸
¸
¸
¸
1 −2(1 −θ)λ (1 −θ)λ 0 0 0
(1 −θ)λ 1 −2(1 −θ)λ (1 −θ)λ 0 0
0 (1 −θ)λ 1 −2(1 −θ)λ (1 −θ)λ 0
0 0 (1 −θ)λ 1 −2(1 −θ)λ (1 −θ)λ
0 0 0 (1 −θ)λ 1 −2(1 −θ)λ

.
(2.78)
The vector U
n+1
is thus given by
U
n+1
= (B

)
−1
A

U
n
. (2.79)
This implies that in any numerical implementation of the θ scheme where θ > 0, we need
to invert the system B

U
n+1
= A

U
n
(which is usually much faster than constructing
the matrix (B

)
−1
directly). In practice, we cannot use infinite matrices obviously.
However we can use the N N matrices above on an interval of size N∆x given. The
above matrices correspond to imposing that the solution vanishes at the boundary of
the interval (Dirichlet boundary conditions).
Exercise 2.30 Implement the θ scheme in Matlab for all possible values of 0 ≤ θ ≤ 1
and λ > 0.
Verify on a few examples with different values of θ < 1/2 that the scheme is unstable
when λ is too large.
Verify numerically (by dividing an initial spatial mesh by a factor 2, then 4) that the
θ scheme is of order 2, 4, and 6 in space for well chosen values of θ and λ. To obtain
these orders of convergence, make sure that the initial condition is sufficiently smooth.
For non-smooth initial conditions, show that the order of convergence decreases and
compare with the theoretical predictions of Theorem 2.3.
27
3 Finite Differences for Elliptic Equations
The prototype of elliptic equation we consider is the following two-dimensional diffusion
equation with absorption coefficient
−∆u(x) + σu(x) = f(x), x ∈ R
2
. (3.1)
Here, x = (x, y) ∈ R
2
, f(x) is a given source term, σ is a given positive absorption
coefficient, and
∆ = ∇
2
=

2
∂x
2
+

2
∂y
2
is the Laplace operator. We assume that our domain has no boundaries to simplify. The
above partial differential equation can be analyzed by Fourier transform. In two-space
dimensions, the Fourier transform is defined by
[T
x→ξ
f](ξ) ≡
ˆ
f(ξ) =

R
2
e
−ix·ξ
f(x)dx. (3.2)
The inverse Fourier transform is then defined by
[T
−1
ξ→x
ˆ
f](x) ≡ f(x) =
1
(2π)
2

R
2
e
ix·ξ
ˆ
f(ξ)dξ. (3.3)
We still verify that the Fourier transform replaces translations and differentiation by
multiplications. More precisely, we have
[Tf(x +y)](ξ) = e
iy·ξ
ˆ
f(ξ), [T∇f(x)](ξ) = iξ
ˆ
f(ξ). (3.4)
Notice here that both ∇ and ξ are 2-vectors. This implies that the operator −∆ is
replaced by a multiplication by [ξ[
2
(Check it!). Upon taking the Fourier transform in
both terms of the above equation, we obtain that
(σ +[ξ[
2
)ˆ u(ξ) =
ˆ
f(ξ), ξ ∈ R
2
. (3.5)
The solution to the diffusion equation is then given in the Fourier domain by
ˆ u(ξ) =
ˆ
f(ξ)
σ +[ξ[
2
. (3.6)
It then remains to apply the inverse Fourier transform to the above equality to obtain
u(x). Since the inverse Fourier transform of a product is the convolution of the inverse
Fourier transforms (as in 1D), we deduce that
u(x) =

T
−1
1
σ +[ξ[
2

(x) ∗ f(x)
= CK
0
(

σ[x[) ∗ f(x),
where K
0
is the modified Bessel function of order 0 of the second kind and C is a
normalization constant.
28
This shows again how powerful Fourier analysis is to solve partial differential equa-
tions with constant coefficients. We could define mode general equations of the form
P(D)u(x) = f(x) x ∈ R
2
,
where P(D) is an operator with symbol P(iξ). In the case of the diffusion equation,
we would have that P(iξ) = σ + [ξ[
2
, i.e. P(iξ) is a parabolic symbol of order 2 using
the terminology of the preceding section. All we will see in this section applies to quite
general symbols P(iξ), although we shall consider only P(iξ) = σ +[ξ[
2
to simplify.
Discretization. How do we solve (3.1) by finite differences? We can certainly replace
the differentials by approximated differentials as we did in 1D. This is the finite difference
approach. Let us consider the grid of points x
ij
= (i∆x, j∆x) for i, j ∈ Z and ∆x > 0
given. We then define
U
ij
≈ u(x
ij
), f
ij
= f(x
ij
). (3.7)
Using a second-order scheme, we can then replace (3.1) by


x

x
+ ∂
y

y

U
ij
+ σU
ij
= f
ij
, (i, j) ∈ Z
2
, (3.8)
where the operator ∂
x
is defined by

x
U
ij
=
U
i+1,j
−U
ij
∆x
, (3.9)
and ∂
x
, ∂
y
, and ∂
y
are defined similarly.
Assuming as in the preceding section that the finite difference operator is defined in
the whole space R
2
and not only on the grid ¦x
ij
¦, we have more generally that
BU(x) = f(x), x ∈ R
2
, (3.10)
where the operator B is a sum of translations defined by
BU(x) =
¸
β∈I
β
b
β
U(x −∆xβ). (3.11)
Here, I
β
is a finite subset of Z
2
, i.e. β = (β
1
, β
2
), where β
1
and β
2
are integer. Upon
taking Fourier transform of (3.11), we obtain

¸
β∈I
β
b
β
e
−i∆xβ·ξ

ˆ
U(ξ) =
ˆ
f(ξ). (3.12)
Defining
R

(ξ) =

¸
β∈I
β
b
β
e
−i∆xβ·ξ

, (3.13)
we then obtain that the finite difference solution is given in the Fourier domain by
ˆ
U(ξ) =
1
R

(ξ)
ˆ
f(ξ), (3.14)
29
and that the difference between the exact and the discrete solutions is given by
ˆ u(ξ) −
ˆ
U(ξ) =

1
P(iξ)

1
R

(ξ)

ˆ
f(ξ). (3.15)
For the second-order scheme introduced above, we obtain
R

(ξ) = σ +
2
∆x
2

2 −cos(∆xξ
x
) −cos(∆xξ
y
)

. (3.16)
For ∆x[ξ[ ≤ π, we deduce from Taylor expansions that
[R

(ξ) −P(iξ)[ ≤ C∆x
2
[ξ[
4
.
For ∆x[ξ[ ≥ π, we also obtain that
[R

(ξ)[ ≤
C
∆x
2
≤ C∆x
2
[ξ[
4
,
and
[P(iξ)[ ≤ C∆x
2
[ξ[
4
.
This implies that
[R

(ξ) −P(iξ)[ ≤ C∆x
2
[ξ[
4
for all ξ ∈ R
2
. Since R

(ξ) ≥ σ and P(iξ) ≥ C(1 +[ξ[
2
), we then easily deduce that

R

(ξ) −P(iξ)
R

(ξ)P(iξ)

≤ C∆x
2
(1 +[ξ[
2
). (3.17)
From this, we deduce that
[ˆ u(ξ) −
ˆ
U(ξ)[ ≤ C∆x
2
(1 +[ξ[
2
)[
ˆ
f(ξ)[.
Let us now square the above inequality and integrate. Upon taking the square root of
the integrals, we obtain that
|ˆ u(ξ) −
ˆ
U(ξ)| ≤ C∆x
2

R
2
(1 +[ξ[
2
)
2
[
ˆ
f(ξ)[
2

1/2
. (3.18)
The above integral is bounded provided that f(x) ∈ H
2
(R
2
), using the definitions (2.67)-
(2.68), which also hold in two space dimensions. From the Parseval equality (equation
(2.53) with

2π replaced by 2π in two space dimensions), we deduce that
|u(x) −U(x)| ≤ C∆x
2
|f(x)|
H
2
(R
2
)
. (3.19)
This is the final result of this section: provided that the source term is sufficiently
regular (its second-order derivatives are square-integrable), the error made by the finite
difference approximation is of order ∆x
2
. The scheme is indeed second-order.
Exercise 3.1 Notice that the bound in (3.17) would be replaced by C∆x
2
(1 + [ξ[
4
) if
only the Taylor expansion were used. The reason why (3.17) holds is because the exact
symbol P(iξ) damps high frequencies (i.e. is bounded from below by C(1 +[ξ[
2
). Show
that if this damping is not used in the proof, the final result (3.19) should be replaced
by
|u(x) −U(x)| ≤ C∆x
2
|f(x)|
H
4
(R
2
)
.
30
Exercise 3.2 (moderately difficult) Show that

R

(ξ) −P(iξ)
R

(ξ)P(iξ)

≤ C∆x
m
(1 +[ξ[
m
),
for all 0 ≤ m ≤ 2. [Hint: Show that (R

(ξ))
−1
and (P(iξ))
−1
are bounded and separate
[ξ[∆x ≥ 1 and [ξ[∆x ≤ 1].
Deduce that
|u(x) −U(x)| ≤ C∆x
m
|f(x)|
H
m
(R
2
)
,
for 0 ≤ m ≤ 2. We deduce from this result that the error u(x) −U(x) still converges to
0 when f is less regular than H
2
(R
2
). However, the convergence is slower and no longer
of order ∆x
2
.
Exercise 3.3 Implement the Finite Difference algorithm in Matlab on a square of fi-
nite dimension (impose that the solution vanishes at the boundary of the domain). This
implementation is not trivial. It requires constructing and inverting a matrix as was
done in (2.79).
Solve the diffusion problem with a smooth function f and show that the order of con-
vergence is 2. Show that the order of convergence decreases when f is less regular.
31
4 Finite Differences for Hyperbolic Equations
After parabolic and elliptic equations, we turn to the third class of important linear equa-
tions: hyperbolic equations. The simplest hyperbolic equation is the one-dimensional
transport equation
∂u
∂t
+ a
∂u
∂x
= 0 t > 0, x ∈ R
u(0, x) = u
0
(x), x ∈ R.
(4.1)
Here u
0
(x) is the initial condition. When a is constant, the solution to the above
equation is easily found to be
u(t, x) = u
0
(x −at). (4.2)
This means that the initial data is simply translated by a speed a. The solution can
also be obtained by Fourier transforms. Indeed we have
∂ˆ u
∂t
(t, ξ) + iaξˆ u(t, ξ) = 0, (4.3)
which gives
ˆ u(t, ξ) = ˆ u
0
(ξ)e
−iatξ
.
Now since multiplication by a phase in the Fourier domain corresponds to a translation
in the physical domain, we obtain (4.2).
Here is an important remark. Notice that (4.3) can be written as (2.40) with
P(iξ) = iaξ. Notice that this symbol does not have the properties imposed in sec-
tion 2. In particular, the symbol P(iξ) is purely imaginary (i.e. its real part vanishes).
This corresponds to a completely different behavior of the PDE solutions: instead of
regularizing initial discontinuities, as parabolic equations do, hyperbolic equations trans-
late and possibly modify the shape of discontinuities, but do not regularize them. This
behavior will inevitably show up when we discretize hyperbolic equations.
Let us now consider finite difference discretizations of the above transport equation.
The framework introduced in section 2 still holds. The single-step finite difference
scheme is given by (2.44). After Fourier transforms, it is defined by (2.49). The L
2
norm of the difference between the exact and approximate solutions is then given by
|u(n∆t, x) −U
n
(x)| =
1



n
(ξ)| =
1

e
−n∆tP(iξ)
−R
n

(ξ)

ˆ u
0
(ξ)

.
Here, we recall that P(iξ) = iaξ. The mathematical analysis of finite difference schemes
for hyperbolic equations is then not very different from for parabolic equations. Only
the results are very different: schemes that may be natural for parabolic equations will
no longer be so for hyperbolic schemes.
Stability. One of the main steps in controlling the error term ε
n
is to show that the
scheme is stable. By stability, we mean here that
[R
n

(ξ)[ = [R

(ξ)[
n
≤ C, n∆t ≤ T.
32
Let us consider the stability of classical schemes. The first idea that comes to mind
to discretize (4.1) is to replace the time derivative by ∂
t
and the spatial derivative by

x
. Spelling out the details, this gives
U
n+1
j
= U
n
j

a∆t
∆x
(U
n
j+1
−U
n
j
).
After Fourier transform (and extension of the scheme to the whole line x ∈ R), we
deduce that
R

(ξ) = 1 −ν(e
iξ∆x
−1) = [1 −ν(cos(ξ∆x) −1)] −iν sin(ξ∆x),
where we have defined
ν =
a∆t
∆x
. (4.4)
We deduce that
[R

(ξ)[
2
= 1 + 2ν(1 +ν)(1 −cos(ξ∆x)).
When a > 0, which implies that ν > 0, we deduce that the scheme is always unstable
since [R

(ξ)[
2
> 1 for cos(ξ∆x) ≤ 0 for instance. When ν < −1, we again deduce that
[R

(ξ)[
2
> 1 since ν(1+ν) > 0. It remains to consider −1 ≤ ν ≤ 0. There, ν(1+ν) ≤ 0
and we then deduce that [R

(ξ)[
2
≤ 1 for all wavenumbers ξ. This implies stability.
So the scheme is stable if and only if −1 ≤ ν ≤ 0. This implies that [a[∆t ≤ ∆x.
Again, the time step cannot be chosen too large. This is referred to as a CFL condition.
Notice however, that this also implies that a < 0. This can be understood as follows. In
the transport equation, information propagates from the right to the left when a < 0.
Since the scheme approximates the spatial derivative by ∂
x
, it is asymmetrical and uses
some information to the right (at the point X
j+1
= X
j
+ ∆x) of the current point X
j
.
For a < 0, it gets the correct information where it comes from. When a > 0, the scheme
still “looks” for information coming from the right, whereas physically it comes from
the left.
This leads to the definition of the upwind scheme, defined by
U
n+1
j
=

U
n
j
−ν(U
n
j
−U
n
j−1
) when a > 0
U
n
j
−ν(U
n
j+1
−U
n
j
) when a < 0.
(4.5)
Exercise 4.1 Check that the upwind scheme is stable provided that [a[∆t ≤ ∆x. Im-
plement the scheme in Matlab.
The above result shows that we have to “know” which direction the information
comes from to construct our scheme. For one-dimensional problems, this is relatively
easy since only the sign of a is involved. In higher dimensions, knowing where the
information comes from is much more complicated. A tempting solution to this difficulty
is to average over the operators ∂
x
and ∂
x
:
U
n+1
j
= U
n
j

ν
2
(U
n
j+1
−U
n
j−1
).
Exercise 4.2 Show that the above scheme is always unstable. Verify the instability in
Matlab.
33
A more satisfactory answer to the question has been answered by Lax and Wendroff.
The Lax-Wendroff scheme is defined by
U
n+1
j
=
1
2
ν(1 +ν)U
n
j−1
+ (1 −ν
2
)U
n
j

1
2
ν(1 −ν)U
n
j+1
. (4.6)
Exercise 4.3 Check that
[R

(ξ)[
2
= 1 −4ν
2
(1 −ν
2
) sin
4
ξ∆x
2
.
Deduce that the Lax-Wendroff scheme is stable when [ν[ ≤ 1.
Implement the Lax-Wendroff scheme in Matlab.
Consistency. We have introduced two stable schemes, the upwind scheme and the
Lax-Wendroff scheme. It remains to analyze their convergence properties.
Consistency consists of analyzing the local properties of the discrete scheme and
making sure that the scheme captures the main trends of the continuous equation. So
assuming that u(n∆t, X
j
) is known at time n∆t, we want to see what it becomes under
the discrete scheme. For the upwind scheme with a > 0 to simplify, we want to analyze


t
+ a∂
x

u(T
n
, X
j
).
The closer the above quantity to 0 (which it would be if the discrete scheme were replaced
by the continuous equation), the higher the order of convergence of the scheme. This
error is obtained assuming that the exact solution is sufficiently regular and by using
Taylor expansions.
Exercise 4.4 Show that


t
+ a∂
x

u(T
n
, X
j
) = −
1
2
(1 −ν)a∆x

2
u
∂x
2
(T
n
, X
j
) + O(∆x
2
).
This shows that the upwind scheme is first-order when ν < 1. Notice that first-order
approximations involve second-order derivatives of the exact solution. This is similar to
the parabolic case and justifies the following definition in the Fourier domain (remember
that second-order derivative in the physical domain means multiplication by −ξ
2
in the
Fourier domain).
We say that the scheme is convergent of order m if
[e
−i∆tP(iξ)
−R

(ξ)[ ≤ C∆t∆x
m
(1 +[ξ[
m+1
), (4.7)
for ∆t and ∆x[ξ[ sufficiently small (say ∆x[ξ[ ≤ π). Notice that ∆t and ∆x are again
related through (4.4).
Notice the parallel with (2.61): we have simply replaced M by 1. The difference
between parabolic and hyperbolic equations and scheme does not come from the consis-
tency conditions, but from the stability conditions. For hyperbolic schemes, we do not
have (2.57) or (2.58).
Exercise 4.5 Show that the upwind scheme is first-order and the Lax-Wendroff scheme
is second-order.
34
Proof of convergence . As usual, stability and consistency of the scheme, with
sufficient regularity of the exact solution, are the ingredients to the convergence of the
scheme. As in the parabolic case, we treat low and high wavenumbers separately. For
low wavenumbers (∆x[ξ[ ≤ π), we have
ε
n
(ξ) = (e
−∆tP(iξ)
−R

(ξ))
n−1
¸
k=0
e
−(n−1−k)∆tP(iξ)
R
k

(ξ)ˆ u
0
(ξ).
Here is the main difference with the parabolic case: e
−(n−1−k)∆tP(iξ)
and R
k

(ξ) are
bounded but not small for high frequencies (notice that [e
−n∆tP(iξ)
[ = 1 for P(iξ) = iaξ).
So using the stability of the scheme, which states that [R
k

(ξ)[ ≤ 1, we deduce that

n
(ξ)[ ≤ n[e
−∆tP(iξ)
−R

(ξ)[[ˆ u
0
(ξ)[ ≤ C∆x
m
(1 +[ξ[
m+1
)[ˆ u
0
(ξ)[,
using the order of convergence of the scheme.
For high wavenumbers ∆x[ξ[ ≥ π, we deduce that

n
(ξ)[ ≤ 2[ˆ u
0
(ξ)[ ≤ C∆x
m
(1 +[ξ[
m+1
)[ˆ u
0
(ξ)[.
So the above inequality holds for all frequencies ξ ∈ R. This implies that

n
(ξ)| ≤ C∆x
m

R
(1 +[ξ[
2m+2
)[ˆ u
0
(ξ)[
2

1/2
.
We have then proved the
Theorem 4.1 Let us assume that P(iξ) = iaξ and that R

(ξ) is of order m, i.e. (4.7)
holds. Assuming that the initial condition u
0
∈ H
m+1
, we obtain that
|u(n∆t, x) −U
n
(x)| ≤ C∆x
m
|u
0
|
H
m+1, for all 0 ≤ n ≤ N = T/∆t. (4.8)
The main difference with the parabolic case is that u
0
now needs to be in H
m+1
(and
not in H
m
) to obtain an accuracy of order ∆x
m
. What if u
0
is not in H
m
?
Exercise 4.6 Show that for any scheme of order m, we have
[e
−i∆tP(iξ)
−R

(ξ)[ ≤ C∆t∆x
αm
(1 +[ξ[
α(m+1)
)
for all 0 ≤ α ≤ 1 and [ξ[∆x ≤ π. Show that (4.8) is now replaced by
|u(n∆t, x) −U
n
(x)| ≤ C∆x
αm
|u
0
|
H
α(m+1) , for all 0 ≤ n ≤ N = T/∆t. (4.9)
Deduce that if u
0
is in H
s
for 0 ≤ s ≤ m + 1, the error of convergence of the scheme
will be of order
sm
m+1
.
We recall that piecewise constant functions are in the space H
s
for all s < 1/2 (take
s = 1/2 in the sequel to simplify). Deduce the order of the error of convergence for the
upwind scheme and the Lax-Wendroff scheme. Verify this numerically.
35
5 Spectral methods
5.1 Unbounded grids and semidiscrete FT
The material of this section is borrowed from chapters 1 & 2 in Trefethen [2].
Finite differences were based on approximating

∂x
by finite differences of the form

x
, ∂
x
for first-order approximations, or
1
2
(∂
x
+ ∂
x
),
for a second-order approximation, and so on for higher order approximations.
Spectral methods are based on using an approximation of very high order for the
partial derivative. In matrix form, the second-order approximation given by (1.2) and
fourth-order approximation given by (1.3) in [2] are replaced in the “spectral” approxi-
mation by the “infinite-order” matrix in (1.4).
How is this done? The main ingredient is “interpolation”. We interpolate a set of
values of a function at the points of a grid by a (smooth) function defined everywhere.
We can then obviously differentiate this function and take the values of the differentiated
function at the grid points. This is how derivatives are approximated in spectral methods
and how (1.4) in [2] is obtained.
Let us be more specific and consider first the infinite grid hZ with grid points x
j
= jh
for j ∈ Z and h > 0 given.
Let us assume that we are given the values v
j
of a certain function v(x) at the grid
points, i.e. v(x
j
) = v
j
. What we want is to define a function p(x) such that p(x) = v
j
.
This function p(x) will then be an interpolant of the series ¦v
j
¦. To do so we define the
semidiscrete Fourier transform as
ˆ v(k) = h

¸
j=−∞
e
−ikx
j
v
j
, k ∈ [−
π
h
,
π
h
]. (5.1)
It turns out that this transform can be inverted. We define the inverse semidiscrete
Fourier transform as
v
j
=
1

π/h
−π/h
e
ikx
j
ˆ v(k)dk. (5.2)
The above definitions are nothing but the familiar Fourier series. Here, v
j
are the Fourier
coefficients of the periodic function ˆ v(k) on [−
π
h
,
π
h
].
Exercise 5.1 Check that ˆ v(k) in (5.1) is indeed periodic of period 2π/h.
It is then well known that the function ˆ v(k) can be reconstructed from its Fourier
coefficients; this is (5.1).
36
Notice now that the inverse semidiscrete Fourier transform (a.k.a. Fourier series) in
(5.2) is defined for every x
j
of the grid hZ, but could be defined more generally for every
point x ∈ R. This is how we define our interpolant, by simply replacing x
j
in (5.2) by
x:
p(x) =
1

π/h
−π/h
e
ikx
ˆ v(k)dk. (5.3)
The function p(x) is a very nice function. It can be shown that it is analytic (which
means that the infinite Taylor expansion of p(z) in the vicinity of every complex number
z converges to p; so the function p is infinitely many times differentiable), and by
definition it satisfies that p(x
j
) = v
j
.
Now assuming that we want to define w
j
, an approximation of the derivative of v
given by v
j
at the grid points x
j
. The spectral approximation simply consists of defining
w
j
= p

(x
j
), (5.4)
where p(x) is defined by (5.3). Notice that w
j
a priori depends on all the values v
j
. This
is why the matrix (1.4) in [2] is infinite and full.
How do we calculate this matrix? Here we use the linearity of the differentiation. If
v
j
= v
(1)
j
+ v
(2)
j
,
then we clearly have that
w
j
= w
(1)
j
+ w
(2)
j
,
for the spectral differentials. So all we need to do is to consider the derivative of function
v
j
such that v
n
= 1 for some n ∈ Z and v
m
= 0 for m = n. We then obtain that
ˆ v
n
(k) = he
−inkh
,
and that
p
n
(x) =
h

π/h
−π/h
e
ik(x−nh)
dk = S
h
(x −nh),
where S
h
(x) is the sinc function
S
h
(x) =
sin(πx/h)
πx/h
. (5.5)
More generally, we obtain that the interpolant p(x) is given by
p(x) =

¸
j=−∞
v
j
S
h
(x −jh). (5.6)
An easy calculation shows that
S

h
(jh) =

0, j = 0
(−1)
j
jh
, j = 0.
(5.7)
37
Exercise 5.2 Check this.
This implies that
p

n
(x
j
) =

0, j = n
(−1)
(j−n)
(j −n)h
, j = n.
These are precisely the values of the entries (n, j) of the matrix in (1.4) in [2].
Notice that the same technique can be used to define approximations to derivatives
of arbitrary order. Once we have the interpolant p(x), we can differentiate it as many
times as necessary. For instance the matrix corresponding to second-order differentiation
is given in (2.14) of [2].
5.2 Periodic grids and Discrete FT
The simplest way to replace the infinite-dimensional matrices encountered in the pre-
ceding section is to assume that the initial function v is periodic. To simplify, it will be
2π periodic. The function is then represented by its value at the points x
j
= 2πj/N for
some N ∈ N, so that h = 2π/N. We also assume that N is even. Odd values of N are
treated similarly, but the formulas are slightly different.
We know that periodic functions are represented by Fourier series. Here, the function
is not only periodic, but also only given on the grid (x
1
, , x
N
). So it turns out that
its “Fourier series” is finite. This is how we define the Discrete Fourier transform
ˆ v
k
= h
N
¸
j=1
e
−ikx
j
v
j
, k = −
N
2
+ 1, . . . ,
N
2
. (5.8)
The Inverse Discrete Fourier Transform is then given by
v
j
=
1

N/2
¸
k=−N/2+1
e
ikx
j
ˆ v
k
, j = 1, . . . , N. (5.9)
The latter is recast as
v
j
=
1

N/2
¸

k=−N/2
e
ikx
j
ˆ v
k
, j = 1, . . . , N. (5.10)
Here we define ˆ v
−N/2
= ˆ v
N/2
and
¸

means that the terms k = ±N/2 are multiplied
by
1
2
. The latter definition is more symmetrical than the former, although both are
obviously equivalent. The reason we introduce (5.10) is that the two definitions do not
yield the same interpolant, and the interpolant based on (5.10) is more symmetrical.
As we did for infinite grids, we can extend (5.10) to arbitrary values of x and not
only the values on the grid. The discrete interpolant is now defined by
p(x) =
1

N/2
¸

k=−N/2
e
ikx
ˆ v
k
, x ∈ [0, 2π]. (5.11)
38
Notice that this interpolant is obviously 2π periodic, and is a trigonometric polynomial
of order (at most) N/2.
We can now easily obtain a spectral approximation of v(x) at the grid points: we
simply differentiate p(x) and take the values of the solution at the grid points. Again,
this corresponds to
w
j
= p

(x
j
), j = 1, . . . , N.
Exercise 5.3 Calculate the NN matrix D
N
which maps the values v
j
of the function
v(x) to the values w
j
of the function p

(x
j
). The result of the calculation is given on
p.21 in [2].
5.3 Fast Fourier Transform (FFT)
In the preceding subsections, we have seen how to calculate spectral derivatives at grid
points w
j
of a function defined by v
j
at the same gridpoints. We have also introduced
matrices that map v
j
to w
j
. Our derivation was done in the physical domain, in the sense
that we have constructed the interpolant p(x), taken its derivative, p

and evaluated the
derivative p

(x
j
) at the grid points. Another way to consider the derivation is to see
what it means to “take a derivative” in the Fourier domain. This is what FFT is based
on.
Remember that (classical) Fourier transforms replace derivation by multiplication
by iξ; see (2.27). The interpolations of the semidiscrete and discrete Fourier transforms
have very similar properties. For instance we deduce from (5.3) that
p

(x) =
1

π/h
−π/h
e
ikx
ikˆ v(k)dk.
So again, in the Fourier domain, differentiation really means replacing ˆ v(k) by ikˆ v(k).
For the discrete Fourier transform, we deduce from (5.11) that
p

(x) =
1

N/2
¸

k=−N/2
e
ikx
ikˆ v
k
.
Notice that by definition, we have ˆ v
−N/2
= ˆ v
N/2
so that i(−N/2)ˆ v
−N/2
+ iN/2ˆ v
N/2
= 0.
In other words, we have
p

(x) =
1

N/2−1
¸

k=−N/2+1
e
ikx
ikˆ v
k
.
This means that the Fourier coefficients ˆ v
k
are indeed replaced by ikˆ v
k
when we differ-
entiate p(x), expect for the value of k = N/2, for which the Fourier coefficient vanishes.
So, to calculate w
j
from v
j
in the periodic case, we have the following procedure
1. Calculate ˆ v
k
from v
j
for −N/2 + 1 ≤ k ≤ N/2.
2. Define ˆ w
k
= ikˆ v
k
for −N/2 + 1 ≤ k ≤ N/2 −1 and ˆ w
N/2
= 0.
3. Calculate w
j
from ˆ w
j
by Inverse Discrete Fourier transform.
39
Said in other words, the above procedure is the discrete analog of the formula
f

(x) = T
−1
ξ→x

iξT
x→ξ
[f(x)]

.
On a computer, the multiplication by ik (step 2) is fast. The main computational
difficulties come from the implementation of the discrete Fourier transform and its in-
version (steps 1 and 3). A very efficient implementation of steps 1 and 3 is called the
FFT, the Fast Fourier Transform. It is an algorithm that performs steps 1 and 3 in
O(N log N) operations. This has to be compared with a number of operations of order
O(N
2
) if the formulas (5.8) and (5.9) are being used.
Exercise 5.4 Check the order O(N
2
).
We are not going to describe how FFT works; let us merely state that it is a very
convenient alternative to constructing the matrix D
N
to calculate the spectral derivative.
We refer to [2] Programs 4 and 5 for details.
5.4 Spectral approximation for unbounded grids
We have introduced several interpolants to obtain approximations of the derivatives of
functions defined by their values on a (discrete) grid. It remains to know what sort of
error one makes by introducing such interpolants. Chapter 4 in [2] gives some answer to
this question. Here we shall give a slightly different answer based on the Sobolev scale
introduced in the analysis of finite differences.
Let u(x) be a function defined for x ∈ R and define v
j
= u(x
j
). The Fourier transform
of u(x) is denoted by ˆ u(k). We then define the semidiscrete Fourier transform ˆ v(k) by
(5.1) and the interpolant p(x) by (5.3).
The question is then: what is the error between u(x) and p(x)? The answer relies
on one thing: how regular is u(x). The analysis of the error is based on estimates in the
Fourier domain. Smooth functions u(x) correspond to functions ˆ u(k) that decay fast in
k.
The first theoretical ingredient, which is important in its own right, is the aliasing
formula, also called the Poisson summation formula.
Theorem 5.1 Let u ∈ L
2
(R) be sufficiently smooth. Then for all k ∈ [−π/h, π/h],
ˆ v(k) =

¸
j=−∞
ˆ u

k +
2πj
h

. (5.12)
In other words the aliasing formula relates the discrete Fourier transform ˆ v(k) to the
continuous Fourier transform ˆ u(k).
40
Derivation of the aliasing formula. We write
ˆ v(k) = h

¸
j=−∞
e
−ikhj
v
j
=

¸
j=−∞

R
ˆ u(k

)e
ihj(k

−k)
hdk


=

R
ˆ u(k +
2πl
h
)

¸
j=−∞
e
i2πlj
dl,
using the change of variables h(k

−k) = 2πl so that hdk

= 2πdl. We will show that

¸
j=−∞
e
i2πlj
=

¸
j=−∞
δ(l −j). (5.13)
Assuming that this equality holds, we then obtain that
ˆ v(k) =

R
ˆ u(k +
2πl
h
)

¸
j=−∞
δ(l −j)dl =

¸
j=−∞
ˆ u(k +
2πj
h
),
which is what was to be proved. It thus remains to show (5.13). Here is a simple
explanation. Take the Fourier series of a 1−periodic function f(x)
c
n
=

1/2
−1/2
e
−i2πnx
f(x)dx.
Then the inversion is given by
f(x) =

¸
n=−∞
e
i2πnx
c
n
.
Now take f(x) to be the delta function f(x) = δ(x). We then deduce that we have
c
n
= 1 and for x ∈ (0, 1),

¸
n=−∞
e
i2πnx
= δ(x).
Let us now extend both sides of the above relation by periodicity to the whole line R.
The left-hand side is a 1-periodic in x and does not change. The right hand side now
takes the form

¸
j=−∞
δ(x −j).
This is nothing but the periodic delta function. This proves (5.13).
Now that we have the aliasing formula, we can analyze the difference u(x) − p(x).
We’ll do it in the L
2
sense, since
|u(x) −p(x)|
L
2
(R)
=
1


|ˆ u(k) − ˆ p(k)|
L
2
(R)
. (5.14)
We now show the following result:
41
Theorem 5.2 Let us assume that u(x) ∈ H
m
(R) for m > 1/2. Then we have
|u(x) −p(x)|
L
2
(R)
≤ Ch
m
|u|
H
m
(R)
, (5.15)
where the constant C is independent of h and u.
Proof. We first realize that
ˆ p(k) = χ(k)ˆ v(k)
by construction, where χ(k) = 1 if k ∈ [−π/h, π/h] and χ(k) = 0 otherwise is the
characteristic function of the interval [−π/h, π/h]. In other words, p(x) is band-limited,
i.e. has no high frequencies. We thus deduce that

R
[ˆ u(k) − ˆ p(k)[
2
dk =

R
(1 −χ(k))[ˆ u(k)[
2
dk +

π/h
−π/h

¸
j∈Z

ˆ u

k +
2πj
h

2
dk,
using (5.12). We have denoted by Z

= Z`¦0¦. Looking carefully at the above terms,
one realizes that the above error only involves ˆ u(k) for values of k outside the interval
[−π/h, π/h]. In other words, if ˆ u(k) has compact support inside [−π/h, π/h], then
ˆ p(k) = ˆ u(k), hence p(x) = u(x) and the interpolant is exact. When ˆ u(k) does not vanish
outside [−π/h, π/h], i.e. when the high frequency content is present, we have to use the
regularity of the function u(x) to control the above term. This is what we now do. The
first term is easy to deal with. Indeed we have

R
(1 −χ(k))[ˆ u(k)[
2
dk ≤ h
2m

R
(1 + k
2m
)[ˆ u(k)[
2
dk ≤ h
2m
|u|
2
H
m
(R)
.
This is simply based on realizing that [hk[ ≥ π when χ(k) = 1. The second term is a
little more painful. The Cauchy-Schwarz inequality tells us that

¸
j
a
j
b
j

2

¸
j
[a
j
[
2
¸
j
[b
j
[
2
.
This is a mere generalization of [x y[ ≤ [x[[y[. Let us choose
a
j
=

ˆ u

k +
2πj
h

1 +[k +
2πj
h
[

m
, b
j
=

1 +[k +
2πj
h
[

−m
We thus obtain that

¸
j∈Z

ˆ u

k +
2πj
h

2

¸
j∈Z

ˆ u

k +
2πj
h

2

1 +[k +
2πj
h
[

2m ¸
j∈Z

1 +[k +
2πj
h
[

−2m
.
Notice that
¸
j∈Z

1 +[k +
2πj
h
[

−2m
≤ h
2m
¸
j∈Z

1
(2π(1 +[j[))
2m
, for [k[ ≤ π/h.
For m > 1/2, the above sum is convergent so that
¸
j∈Z

1 +[k +
2πj
h
[

−2m
≤ Ch
2m
,
42
where C depends on m but not on u or h. This implies that

π/h
−π/h

¸
j∈Z

ˆ u

k +
2πj
h

2
dk ≤ Ch
2m

π/h
−π/h
¸
j∈Z

ˆ u

k +
2πj
h

2

1 +[k +
2πj
h
[

2m
≤ Ch
2m

R
[ˆ u(k)[
2
(1 +[k[)
2m
dk = Ch
2m
|u|
2
H
m.
This concludes the proof of the theorem.
Let us emphasize something we saw in the course of the above proof. When u(x)
is band-limited so that ˆ u(k) = 0 outside [−π/h, π/h], we have that the interpolant is
exact; namely p(x) = u(x). However, we have seen that the interpolant is determined
by the values v
j
= u(x
j
).
More precisely, we saw that
p(x) =

¸
n=−∞
v
n
S
h
(x −nh),
where S
h
is the sinc function defined in (5.5). This proves the extremely important
Shannon-Whittaker sampling theorem:
Theorem 5.3 Let us assume that the function u(x) is such that ˆ u(k) = 0 for k outside
[−π/h, π/h]. Then u(x) is completely determined by its values v
j
= u(hj) for j ∈ Z.
More precisely, we have
u(x) =

¸
n=−∞
u(hj)S
h
(x −jh), S
h
(x) =
sin(πx/h)
πx/h
. (5.16)
This result has quite important applications in communication theory. Assuming that
a signal has no high frequencies (such as in speech, since the human ear cannot detect
frequencies above 25kHz), there is an adapted sampling value h such that the signal can
be reconstructed from the values it takes on the grid hZ. This means that a continuous
signal can be compressed onto a discrete grid with no error. When h is chosen too large
so that frequencies above π/h are present in the signal, then the interpolant is not exact,
and the compression in the signal has some error. This error is precisely called aliasing
error and is quantified by Theorem 5.1.
Let us finish this section by an analysis of the error between the derivatives of u(x)
and those of p(x). Remember that our initial objective is to obtain an approximation
of u

(x
j
) by using p

(x
j
). Minor modifications in the proof of Theorem 5.2 yield the
following result:
Theorem 5.4 Let us assume that u(x) ∈ H
m
(R) for m > 1/2 and let n ≤ m. Then
we have
|u
(n)
(x) −p
(n)
(x)|
L
2
(R)
≤ Ch
(m−n)
|u|
H
m
(R)
, (5.17)
where the constant C is independent of h and u.
The proof of this theorem consists of realizing that
|u
(n)
(x) −p
(n)
(x)|
L
2
(R)
=
1


|(ik)
n
(ˆ u(k) − ˆ p(k))|
L
2
(R)
.
43
Moreover we have

R
[k[
2n
[ˆ u(k) − ˆ p(k)[
2
dk =

R
[k[
2n
(1 −χ(k))[ˆ u(k)[
2
dk
+

π/h
−π/h
[k[
2n

¸
j∈Z

ˆ u

k +
2πj
h

2
dk.
The first term on the right hand side is bounded by
h
2(m−n)

(1 + k
2m
)[ˆ u(k)[
2
dk =

h
m−n
|u|
H
m

2
.
The second term is bounded by
h
−2n

π/h
−π/h

¸
j∈Z

ˆ u

k +
2πj
h

2
dk,
which is bounded, as we saw in the proof of Theorem 5.2 by h
−2n
h
2m
|u|
2
H
m. This proves
the theorem.
This result is important for us: it shows that the error one makes by approximating
a n−th order derivative of u(x) by using p
(n)
(x) is of order h
m−n
, where m is the H
m
-
regularity of u. So when u is very smooth, p
(n)
(x) converges extremely fast to u
(n)
(x)
as N →∞. This is the greatest advantage of spectral methods.
5.5 Application to PDE’s
An elliptic equation. Consider the simplest of elliptic equations
−u

(x) + σu(x) = f(x), x ∈ R, (5.18)
where the absorption σ > 0 is a constant positive parameter. Define α =

σ. We verify
that
G(x) =
1

e
−α|x|
, (5.19)
is the Green function of the above problem, whose solution is thus
u(x) =

R
G(x −y)f(y)dy =
1

R
e
−α|x−y|
f(y)dy. (5.20)
Note that this may also be obtained in the Fourier domain, where
ˆ u(ξ) =
ˆ
f(ξ)
σ + ξ
2
. (5.21)
Discretizing the above equation using the second-order finite difference method:

U(x + h) + U(x −h) −2U(x)
h
2
+ U(x) = f(x), (5.22)
we obtain as in (3.19) in section 3 an error estimate of the form
|u −U|
L
2 ≤ C h
2
|f|
H
2. (5.23)
44
Let us now compare this result to the discretization based on the spectral method.
Spectral methods are based on replacing functions by their spectral interpolants. Let
p(x) be the spectral interpolant of u(x) and F(x) the spectral interpolant of the source
term f(x). Now p(x) and F(x) only depend on the values they take at hj for j ∈ Z.
We may now calculate the second-order derivative of p(x) (see the following paragraph
for an explicit expression). The equation we now use to deduce p(x) from F(x) is the
equation (5.18):
−p

(x) + σp(x) = F(x), x ∈ R.
Note that this problem may be solved by inverting a matrix of the form −N
2
+ σI,
where N is the matrix of differentiation. The error u −p now solves the equation
−(u −p)

(x) + σ(u −p) = (f −F)(x).
Looking at this in the Fourier domain (and coming back) we easily deduce that
|u −p|
H
2
(R)
≤ C|f −F|
L
2
(R)
≤ Ch
m
|f|
H
m
(R)
, (5.24)
where the latter equality comes from Theorem 5.2. Here m is arbitrary provided that f
is arbitrarily smooth. We thus obtain that the spectral method has an “infinite” order
of accuracy, unlike finite difference methods. When the source terms and the solutions
of the PDEs we are interested in are smooth (spatially), the convergence properties of
spectral methods are much better than those of finite difference methods.
A parabolic equation. Let us now apply the above theory to the computation of
solutions of the heat equation
∂u
∂t


2
u
∂x
2
= 0,
on the real line R. We introduce U
n
j
the approximate solution at time T
n
= n∆t,
1 ≤ n ≤ N and point x
j
= hj for j ∈ Z.
Let us consider the Euler explicit scheme based on a spectral approximation of the
spatial derivatives. The interpolant at time T
n
is given by
p
n
(x) =

¸
j=−∞
U
n
j
S
h
(x −jh),
so that the approximation to the second derivative is given by
(p
n
)

(kh) =

¸
j=−∞
S

h
((k −j)h)U
n
j
=

¸
j=−∞
S

h
(jh)U
n
k−j
.
So the Euler explicit scheme is given by
U
n+1
j
−U
n
j
∆t
=

¸
k=−∞
S

h
(kh)U
n
j−k
. (5.25)
It now remains to see how this scheme behaves. We shall only consider stability here
(i.e. see how ∆t and h have to be chosen so that the discrete solution does not blow
45
up as n →T/∆t). As far as convergence is concerned, the scheme will be first order in
time (accuracy of order ∆t) and will be of spectral accuracy (infinite order) provided
that the initial solution is sufficiently smooth. This is plausible from the analysis of the
error made by replacing a function by its interpolant that we made earlier.
Let us come back to the stability issue. As we did for finite differences, we realize
that we can look at a scheme defined for all x ∈ R and not only x = x
j
:
U
n+1
(x) −U
n
(x)
∆t
=

¸
k=−∞
S

h
(kh)U
n
(x −hk). (5.26)
Once we have realized this, we can pass to the Fourier domain and obtain that
ˆ
U
n+1
(ξ) =

1 + ∆t

¸
k=−∞
S

h
(kh)e
−ikhξ

ˆ
U
n
(ξ). (5.27)
So the stability of the scheme boils down to making sure that
R(ξ) = 1 + ∆t

¸
k=−∞
S

h
(kh)e
−ikhξ
, (5.28)
stays between −1 and 1 for all frequencies ξ.
The analysis is more complicated than for finite differences because of the infinite
sum that appears in the definition of R. Here is a way to analyze R(ξ). We first realize
that
S

h
(kh) =

−π
2
3h
2
k = 0
2
(−1)
k+1
h
2
k
2
k = 0.
Exercise 5.5 Check this.
So we can recast R(ξ) as
R(ξ) = 1 −∆t
π
2
3h
2

2∆t
h
2
¸
k=0
e
i(hξ−π)k
k
2
.
Exercise 5.6 Check this.
We now need to understand the remaining infinite sum. It turns out that we can have
access to an exact formula. It is based on using the Poisson formula
¸
j
e
i2πlj
=
¸
j
δ(l −j),
which we recast for l ∈ [−1/2, 1/2] as
¸
j=0
e
i2πlj
= δ(l) −1.
46
Notice that both terms have zero mean on [−1/2, 1/2] (i.e. there is no 0 frequency). We
now integrate these terms in l. We obtain that
¸
j=0
e
i2πlj
i2πj
= H(l) −l −d
0
,
where H(l) is the Heaviside function, such that H(l) = 1 for l > 0 and H(l) = 0 for
l < 0. The constant d
0
is now fixed by making sure that the above right-hand side
remains 1-periodic (as is the left-hand side). Another way of looking at it is to make
sure that both sides have zero mean on (−1/2, 1/2). We thus find that d
0
= 1/2.
Exercise 5.7 Check this.
Let us integrate one more time. We now obtain that
¸
j=0
e
i2πlj
−4π
2
j
2
= lH(l) −
l
2
2

l
2
−d
1
,
where d
1
is again chosen so that the right-hand side has zero mean. We obtain that
d
1
= 1/12.
Exercise 5.8 Check this.
To sum up our calculation, we have obtained that
¸
j=0
e
i2πlj
j
2
= 4π
2

l
2
2
+
1
12
−l

H(l) −
1
2

.
This shows that
R(ξ) = 1 −
∆t
h
2

π
2
3
+ 8π
2

l
2
2
+
1
12
−l

H(l) −
1
2

, l =



1
2
. (5.29)
A plot of the function l → l
2
/2 + 1/12 − l(H(l) − 1/2) is given in Fig. 5.1. Notice
that this function is 1-periodic and continuous. This is justified as the delta function
is concentrated at one point, its integral is a piecewise linear function (H(l) −1/2 −l)
and is discontinuous, and its second integral is continuous (and is piecewise quadratic).
Exercise 5.9 Check that l(H(l) −
1
2
) =
1
2
[l[. Deduce that
R(ξ) = 1 −
∆t
h
2
π
2

1 + 4l(1 −[l[)

,
so that
R(ξ) = 1 −
∆t
h
2
h
2
ξ
2
:= 1 −∆tξ
2
, 0 ≤ [ξ[ ≤
π
h
.
[Check the above for hξ ≤ π and use (5.28) to show that R(ξ) = R(−ξ) and that R(ξ)
is

h
-periodic.]
47
−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5
−0.06
−0.04
−0.02
0
0.02
0.04
0.06
0.08
0.1
SECOND PERIODIC INTEGRAL OF THE PERIODIC DELTA FUNCTION
l
f
(
l
)
Figure 5.1: Plot of the function l → f(l) = l
2
/2 + 1/12 − l(H(l) − 1/2). The maximal
value is 1/12 and the minimal value −1/24.
Since R(ξ) is a periodized version of the symbol 1 −∆tξ
2
, we deduce that
1 −π
2
∆t
h
2
≤ R(ξ) ≤ 1,
and that the minimal value is reached for l = 0, i.e. for frequencies
ξ =

h
(m +
1
2
), m ∈ Z,
and that the maximum is reached for l = ±1/2, i.e. for frequencies
ξ =
2πm
h
, m ∈ Z.
Notice that R(0) = 0 as it should, and that R(π/h) = 1 − π
2
∆t/h
2
, which is a quite
bad approximation of e
−∆tπ
2
/(h
2
)
for the exact solution. But as usual, we do not expect
frequencies of order h
−1
to be well approximated on a grid of size h.
Since we need [R(ξ)[ ≤ 1 for all frequencies to obtain a stable scheme, we see that
λ =
∆t
h
2

2
π
2
, (5.30)
is necessary to obtain a convergent scheme as h →0 with λ fixed.
48
6 Introduction to finite element methods
We now turn to one of the most useful methods in numerical simulations: the finite
element method. This will be very introductory. The material mostly follows the
presentation in [1].
6.1 Elliptic equations
Let Ω be a convex open domain in R
2
with boundary ∂Ω. We will assume that ∂Ω is
either sufficiently smooth or a polygon. On Ω we consider the boundary value problem
−∆u(x) + σ(x)u(x) = f(x), x = (x
1
, x
2
) ∈ Ω
u(x) = 0, x ∈ ∂Ω.
(6.1)
The Laplace operator ∆ is defined as
∆ =

2
∂x
1
2
+

2
∂x
2
2
.
We assume that the absorption coefficient σ(x) ∈ L

(Ω) satisfies the constraint
σ(x) ≥ σ
0
> 0, x ∈
¯
Ω. (6.2)
We also assume that the source term f ∈ L
2
(Ω) (i.e., is square integrable).
This subsection is devoted to showing the existence of a solution to the above problem
and recalling some elementary properties that will be useful in the analysis of discretized
problems.
Let us first assume that there exists a sufficiently smooth solution u(x) to (6.1).
Let v(x) be a sufficiently smooth test function defined on Ω and such that v(x) = 0
for x ∈ ∂Ω. Upon multiplying (6.1) by v(x) and integrating over Ω, we obtain by
integrations by parts using Green’s formula that:
a(u, v) :=

∇u ∇v + σuv

dx =


fvdx := L(v). (6.3)
Exercise 6.1 Prove the above formula.
Let us introduce a few Hilbert spaces. For k ∈ N, we define
H
k
(Ω) ≡ H
k
=

v ∈ L
2
(Ω); D
α
v ∈ L
2
(Ω), for all [α[ ≤ k¦. (6.4)
We recall that for a multi-index α = (α
1
, α
2
) (in two space dimensions) for α
k
∈ N,
1 ≤ k ≤ 2, we denote [α[ = α
1
+ α
2
and
D
α
=

α
1
∂x
α
1
1

α
2
∂x
α
2
2
.
Thus v ∈ H
k
≡ H
k
(Ω) if all its partial derivatives of order up to k are square integrable
in Ω. The Hilbert space is equipped with the norm
|u|
H
k =

¸
|α|≤k
|D
α
u|
2
L
2

1/2
. (6.5)
49
We also define the semi-norm on H
k
:
[u[
H
k =

¸
|α|=k
|D
α
u|
2
L
2

1/2
. (6.6)
Note that the above semi-norm is not a norm, for all polynomials p
k−1
of order at most
k −1 are such that [p
k−1
[
H
k = 0.
Because of the choice of boundary conditions in (6.1), we also need to introduce the
Hilbert space
H
1
0
(Ω) =

v ∈ H
1
(Ω); v
|∂Ω
= 0
¸
. (6.7)
We can show that the above space is a closed subspace of H
1
(Ω), where both spaces
are equipped with the norm | |
H
1. Showing this is not completely trivial and requires
understanding the operator that restricts a function defined on Ω to the values it takes
at the boundary ∂Ω. The restriction is defined for sufficiently smooth functions only
(being an element in H
1
is sufficient, not necessary). For instance: “the restriction to
∂Ω of a function in L
2
(Ω)” means nothing. This is because L
2
functions are defined
up to modifications on sets of measure zero (for the Lebesgue measure) and that ∂Ω is
precisely a set of measure zero.
We can show that (6.3) holds for any function v ∈ H
1
0
(Ω). This allows us to “replace”
the problem (6.1) by the new problem in variational form:
Find u ∈ H
1
0
(Ω) such that for all v ∈ H
1
0
(Ω), we have
a(u, v) = L(v). (6.8)
Theorem 6.1 There is a unique solution to the above problem (6.8).
Proof. The theorem is a consequence of the Riesz representation theorem. The main
ingredients of the proof are the following. First, note that a(u, v) is a symmetric bilinear
form on the Hilbert space V = H
1
0
(Ω). It is clearly linear in u and v and one verifies
that a(u, v) = a(v, u).
The bilinear form a(u, v) is also positive definite, in the sense that a(v, v) > 0 for all
v ∈ V such that v = 0. Indeed, a(u, v) is even better than this: it is coercive in V , in
the sense that there exists α > 0 such that
a(u, u) ≥ α|u|
2
V
, for all v ∈ V. (6.9)
Indeed, from the definition of a(u, v) and the constraint on σ, we can choose α =
min(1, σ
0
).
Since a(u, v) is a symmetric, positive definite, bilinear form on V , it defines an inner
product on V (by definition of an inner product). The norm associated to this inner
product is given by
|u|
a
= a
1/2
(u, u), u ∈ V. (6.10)
The Riesz representation theorem then precisely states the following: Let V a Hilbert
space with an inner product a(, ). Then for each bounded linear form L on V , there is
a unique u ∈ V such that L(v) = a(u, v) for all v ∈ V . Theorem 6.1 is thus equivalent
50
to showing that L(v) is bounded on V , which means the existence of a constant |L|
V

such that [L(v)[ ≤ |L|
V
∗a
1/2
(v, v) for all v ∈ V . However we have
[L(v)[ ≤ |f|
L
2|v|
L
2 =
|f|
L
2

σ
0
(

σ
0
|v|
L
2) ≤
|f|
L
2

σ
0
a
1/2
(v, v).
This yields the bound for L and the existence of a solution to (6.8). Now assume that u
and w are solutions. Then by taking differences in (6.8), we deduce that a(u−w, v) = 0
for all v ∈ V . Choose now v = u −w to get that a(u −w, u −w) = 0. However, since a
is positive definite, this implies that u −w = 0 and the uniqueness of the solution.
Remark 6.2 The solution to the above problem admits another interpretation. Indeed
let us define the functional
J(u) =
1
2
a(u, u) −L(u). (6.11)
We can then show that u is a solution to (6.8) is equivalent to u being the minimum of
J(u) on V = H
1
0
.
Remark 6.3 Theorem 6.1 admits a very simple proof based on the Riesz representation
theorem. However it requires a(u, v) to be symmetric, which may not be the case in
practice. The Lax-Milgram theory is then the tool to use. What it says is that provided
that a(u, v) is coercive and bounded on a Hilbert space V (equipped with norm | |
V
),
in the sense that there are two constants C
1
and α such that
α|v|
2
V
≤ a(v, v), [a(v, w)[ ≤ C
1
|v|
V
|w|
V
, for all v, w ∈ V, (6.12)
and provided that the functional L(v) is bounded on V , in the sense that there exists a
constant |L|
V
∗ such that
[L(v)[ ≤ |L|
V
∗|v|
V
, for all v ∈ V
then the abstract problem
a(u, v) = L(v), for all v ∈ V, (6.13)
admits a unique solution u ∈ V . This abstract theory is extremely useful in many
practical problems.
Regularity issues. The above theory gives us a weak solution u ∈ H
1
0
to the problem
(6.1). The solution is “weak” because the elliptic problem should be considered in it
variational form (6.3) rather than its PDE form (6.1). The Laplacian of a function in
H
1
0
is not necessarily a function. So u is not a strong solution (which is defined as a C
2
solution of (6.1)).
It thus remains to understand whether the regularity of u ∈ H
1
0
is optimal. The
answer is no in most cases. What we can show is that for smooth boundaries ∂Ω and
for convex polygons ∂Ω, we have
|u|
H
2 ≤ C|f|
L
2. (6.14)
51
When f ∈ L
2
, we thus deduce that u ∈ H
2
∩ H
1
0
. Deriving such results is beyond the
scope of these notes and we will simply admit them. The main aspect of this regularity
result is that u has two more degrees of regularity than f; when f ∈ L
2
, the second-
order partial derivatives of u are also in L
2
. This may be generalized as follows. For
sufficiently smooth ∂Ω, or for very special polygons Γ (such as rectangles), we have the
following result:
|u|
H
k+2 ≤ C|f|
H
k, k ≥ 0. (6.15)
This means that the solution u may be quite regular provided that the source term
f is sufficiently regular. Moreover, a classical Sobolev inequality theorem says that
a function u ∈ H
k
for k > d/2, where d is space dimension, is also of class C
0
. In
dimension d = 2, this means that as soon as f ∈ H
k
for k > 1, then u is of class C
2
,
i.e., is a strong solution of (6.1).
As we saw for finite difference and spectral methods the regularity of the solution
dictates the accuracy of the discretized solution. The above regularity results will thus
be of crucial importance in the analysis of finite element methods which we take up
now.
6.2 Galerkin approximation
The above theory of existence is fundamentally different from what we used in the finite
difference and spectral methods: we no longer have (6.1) to discretize but rather (6.3).
The main difficulty in solving (6.3) is that V = H
1
0
is infinite dimensional. This however
gives us the right idea to discretize (6.3): simply replace V by a discrete approximation
V
h
and solve (6.3) in V
h
rather than V . We then have to choose the family of spaces V
h
for some parameter h →0 (think of h as a mesh size) such that the solution converges
to u in a reasonable sense.
The Galerkin method is based on choosing finite dimensional subspaces V
h
⊂ V . In
our problem (6.1), this means choosing V
h
as subsets of H
1
0
. This requires some care:
functions in H
1
0
have derivatives in L
2
. So piecewise constant functions for instance,
are not elements of H
1
0
(because they can jump and the derivatives of jumps are too
singular to be elements of L
2
), and V
h
can therefore not be constructed using piecewise
constant functions.
Let us assume that we have a sequence of finite dimensional subspaces V
h
⊂ V . Then
the spaces V
h
are Hilbert spaces. Assuming that the bilinear form a(u, v) is coercive
and bounded on V (see Rem. 6.3), then clearly it remains coercive and bounded on V
h
(because it is specifically chosen as a subset of V !). Theorem 6.1 and its generalization
in Rem. 6.3 then hold with V replaced by V
h
. The same theory as for the continuous
problem thus directly gives us existence and uniqueness of a solution u
h
to the discrete
problem:
Find u
h
∈ V
h
such that a(u
h
, v) = L(v) for all v ∈ V
h
. (6.16)
It remains to obtain an error estimate for the problem on V
h
. Let us assume that
a(u, v) is symmetric and as such defines a norm (6.10) on V and V
h
. Then we have
52
Lemma 6.4 The solution u
h
to (6.16) is the best approximation to the solution u of
(6.3) for the | |
a
norm:
|u −u
h
|
a
= min
v∈V
h
|u −v|
a
. (6.17)
Proof. There are two main ideas there. The first idea is that since V
h
⊂ V , we can
subtract (6.16) from (6.8) to get the following orthogonality condition
a(u −u
h
, v) = 0, for all v ∈ V
h
. (6.18)
Note that this property only depends on the linearity of u →a(u, v) and not on a being
symmetric. Now because a(u, u) is an inner product (so that a(u, v) ≤ |u|
a
|v|
a
) and
v →a(u, v) is linear, we verify that for all v ∈ V
h
|u −u
h
|
2
a
= a(u −u
h
, u −u
h
) = a(u −u
h
, u) = a(u −u
h
, u −v) ≤ |u −u
h
|
a
|u −v|
a
.
If |u−u
h
|
a
= 0, then the lemma is obviously true. If not, then we can divide both sides
in the above inequality by |u − u
h
|
a
to get that |u − u
h
|
a
≤ |u − v|
a
for all v ∈ V .
This yields the lemma.
Remark 6.5 When a(u, v) satisfies the hypotheses given in Rem. 6.3, the above result
does not hold. However, it may be replaced by
|u −u
h
|
V

C
1
α
min
v∈V
|u −v|
V
. (6.19)
Exercise 6.2 Prove (6.19).
So, although u
h
is no longer the best approximation of u in any norm (because a(u, v) no
longer generates a norm), we still observe that up to a multiplicative constant (greater
than 1 obviously), u
h
is still very close to being an optimal approximation of u (for the
norm | |
V
now).
Note that the above result also holds when a(u, v) is symmetric. Then u
h
is the best
approximation of u for the norm | |
a
, but not for the norm | |
V
. We have equipped
V (and V
h
) with two different, albeit equivalent, norms. We recall that two norms | |
a
and | |
V
are equivalent if there exists a constant C such that
C
−1
|v|
V
≤ |v|
a
≤ C|v|
V
, for all v ∈ V. (6.20)
Exercise 6.3 When a(u, v) is symmetric and coercive on V , show that the two norms
| |
a
and | |
V
are equivalent.
Let us summarize our results. We have replaced the variational formulation for
u ∈ V by a discrete variational formulation for u
h
∈ V
h
. We have shown that the discrete
problem indeed admitted a unique solution (and thus requires to solve a problem of the
form Ax = b since the problem is linear and finite dimensional), and that the solution
u
h
was the best approximation to u when a(u, v) generates a norm, or at least was not
too far from the best solution, this time for the norm | |
V
, when a(u, v) is coercive
(whether it is symmetric or not).
In our problem (6.1), V = H
1
0
. So we have obtained that |u −u
h
|
H
1 is close to the
best possible approximation of u by functions in H
1
0
. Note that the “closeness” is all
53
relative. For instance, when α is very small, an error bound of the form (6.19) is not
necessarily very accurate. When a is symmetric, (6.17) is however always optimal. In
any event, we need additional information on the spaces V
h
if one wants to go further.
We can go further in two directions: see how small |u−u
h
|
H
1 is for specific sequences of
spaces V
h
, but also see how small u−u
h
is in possibly other norms one may be interested
in, such as |u −u
h
|
L
2. We now consider a specific example of Galerkin approximation:
the finite element method (although we should warn the reader that many finite element
methods are not of the Galerkin form because they are based on discretization spaces
V
h
that are not subspaces of V ).
6.3 An example of finite element method
We are now ready to introduce a finite element method. To simplify we assume that Ω is
a rectangle. We then divide Ω into a family T
h
of non-overlapping closed triangles. Each
triangle T ∈ T
h
is represented by three vertices and three edges. We impose that two
neighboring triangles have one edge (and thus two vertices) in common, or one vertex
in common. This precludes the existence of vertices in the interior of an edge. The set
of all vertices in the triangulation are called the nodes of T
h
. The index h > 0 measures
the “size” of the triangles. A triangulation T
h
of size h is thus characterized by
¯
Ω =
¸
T∈T
h
T, h
T
= diam(T), h = max
T∈T
h
h
T
. (6.21)
As the triangulation gets finer and h →0, we wish to be able to construct more accurate
solutions u
h
to u solution of (6.8). Note that finite element methods could also be based
on different tilings of Ω, for instance based on quadrilaterals, or higher order polygons.
We only consider triangles here.
The reason we introduce the triangulation is that we want to approximate arbitrary
functions in V = H
1
0
by finite dimensional functions on each triangle T ∈ T
h
. As we have
already mentioned, the functions cannot be chosen on each triangle independently of
what happens on the other triangles. The reason is that such functions may not match
across edges shared by two triangles. These jumps however would violate the fact that
our discrete functions need to be in H
1
0
. For instance functions that are constant on
each triangle T ∈ T
h
are not in H
1
0
and are thus not admissible. Here we consider
the simplest finite element method. The simplest functions after piecewise constant
functions are piecewise linear functions. Moreover, we want the functions not to jump
across edges of the triangulation. We thus define
V
h
= ¦v ∈ H
1
0
; v is linear on each T ∈ T
h
¦. (6.22)
Recall that this imposes that v = 0 on ∂Ω. The space V
h
is a finite dimensional subspace
of H
1
0
(see below for a “proof”) so that the theory given in the previous section applies.
This gives us a solution u
h
to (6.16).
A linear system. We now recast the above problem for u
h
as a linear system (of the
form Ax = b). To do so, we need a basis for V
h
. Because the functions in V
h
are in
H
1
0
, then cannot jump across edges. Because they are piecewise linear, they must be
54
continuous on Ω. So they are characterized (uniquely determined) by the values they
take on each node P
i
∈ Ω for 1 ≤ i ≤ M
h
of the triangulation T
h
(note that since
Ω is an open domain, the notation means that M
h
is the number of internal nodes of
the triangulation: the nodes P
i
∈ ∂Ω should be labeled with M
h
+ 1 ≤ i ≤ M
h
+ b
h
,
where b
h
is the number of nodes P
i
∈ ∂Ω). Consider the family of pyramid functions
φ
i
(x) ∈ V
h
for 1 ≤ i ≤ M
h
defined by
φ
i
(P
j
) = δ
ij
, 1 ≤ i, j ≤ M
h
. (6.23)
We recall that the Kronecker delta δ
ij
= 1 if i = j and δ
ij
= 0 otherwise. Since φ
i
(x)
is piecewise affine on T
h
, it is uniquely defined by its values at the nodes P
j
. Moreover,
we easily deduce that the family is linearly independent, whence is a basis for V
h
. This
proves that V
h
is finite dimensional. Note that by assuming that φ
i
∈ V
h
, we have also
implicitly implied that φ
i
(P
j
) = 0 for P
j
∈ ∂Ω. Note also that for P
j
, P
k
∈ ∂Ω, the
function φ
i
(for P
i
∈ ∂Ω) vanishes on the whole edge (P
j
, P
k
) so that indeed φ
i
= 0 on
∂Ω. We can thus write any function v ∈ V
h
as
v(x) =
M
h
¸
i=1
v
i
φ
i
(x), v
i
= v(P
i
). (6.24)
Let us decompose the solution u
h
to (6.16) as u
h
(x) =
¸
M
h
i=1
u
i
φ
i
(x). The equation
(6.16) may then be recast as
(f, φ
i
) = a(u
h
, φ
i
) =
M
h
¸
j=1
a(φ
j
, φ
i
)u
j
.
This is nothing but the linear system
Ax = b, A
ij
= a(φ
j
, φ
i
), x
i
= u
i
, b
i
= (f, φ
i
). (6.25)
In the specific problem of interest here where a(u, v) is defined by (6.3), we have
A
ij
=

∇φ
j
(x) ∇φ
i
(x) + σ(x)φ
j
(x)φ
i
(x)

dx.
There is therefore no need to discretize the absorption parameter σ(x): the variational
formulation takes care of this. Solving for u
h
thus becomes a linear algebra problem,
which we do not describe further here. Let us just mention that the matrix A is quite
sparse because a(φ
i
, φ
j
) = 0 unless P
i
and P
j
belong to the same edge in the triangula-
tion. This is a crucial property to obtain efficient numerical inversions of (6.25). Let us
also mention that the ordering of the points P
i
is important. As a rough rule of thumb,
one would like the indices of points P
i
and P
j
belonging to the same edge to be such
that [i −j[ is as small as possible to render the matrix A as close to a diagonal matrix
as possible. The optimal ordering thus very much depends on the triangulation and is
by no means a trivial problem.
55
Approximation theory. So far, we have constructed a family of discrete solutions u
h
of (6.16) for various parameters h > 0. Let now T
h
be a sequence of triangulations such
that h →0 (recall that h is the maximal diameter of each triangle in the triangulation).
It remains to understand how small the error u−u
h
is. Moreover, this error will depend
on the chosen norm (we have at least three reasonable choices: | |
a
, | |
H
1, and | |
L
2).
Since the problem (6.16) is a Galerkin discretization, we can use (6.17) and (6.19)
and thus obtain in the H
1
0
and a− norms that u − u
h
is bounded by a constant times
u−v for arbitrary v ∈ V
h
. Finding the best approximation v for u in a finite dimensional
space for a given norm would be optimal. However this is not an easy task in general.
We now consider a method, which up to a multiplicative constant (independent of h),
indeed provides the best approximation.
Our approximation is based on interpolation theory. Let us assume that f ∈ L
2
(Ω)
so that u ∈ H
2
(Ω) thanks to (6.14). This implies that u is a continuous function as
2 > d/2 = 1. We define the interpolation operator π
h
: C(
¯
Ω) →V
h
by

h
v)(x) =
M
h
¸
i=1
v(P
i

i
(x). (6.26)
Note that the function v needs to be continuous since we evaluate it at the nodes P
i
.
An arbitrary function v ∈ H
1
0
need not be well-defined at the points P
i
(although it
almost is...)
We now want to show that π
h
v −v gets smaller as h →0. This is done by looking at
π
h
v −v on each triangle T ∈ T
h
and then summing the approximations over all triangles
of the triangulation. We thus need to understand the approximation at the level of one
triangle. We denote by π
T
v the affine interpolant of v on a triangle T. For a triangle T
with vertices P
i
, 1 ≤ i ≤ 3, it is defined as
π
T
v(x) =
3
¸
i=1
v(P
i

i
(x), x ∈ T,
where the φ
i
(x) are affine functions defined by φ
i
(P
j
) = δ
ij
, 1 ≤ i, j ≤ 3. We recall
that functions in H
2
(T) are continuous, so the above interpolant is well-defined for each
v ∈ H
2
(T). The various triangles in the triangulation may be very different so we need
a method that analyzes π
T
v −v on arbitrary triangles. This is done in two steps. First
we pick a triangle of reference
ˆ
T and analyze π
ˆ
T
v −v on it in various norms of interest.
Then we define a map that transforms
ˆ
T to any triangle of interest T. We then push
forward any properties we have found on
ˆ
T to the triangle T through this map. The
first step goes as follows.
Lemma 6.6 Let
ˆ
T be the triangle with vertices (0, 0), (1, 0), and (0, 1) in an orthonor-
mal system of coordinates of R
2
. Let ˆ v be a function in H
2
(
ˆ
T) and π
ˆ
T
ˆ v its affine
interpolant. Then there exists a constant C independent of ˆ v such that
|ˆ v −π
ˆ
T
ˆ v|
L
2
(
ˆ
T)
≤ C[ˆ v[
H
2
(
ˆ
T)
|∇(ˆ v −π
ˆ
T
ˆ v)|
L
2
(
ˆ
T)
≤ C[ˆ v[
H
2
(
ˆ
T)
.
(6.27)
We recall that [ [
H
2
(
ˆ
T)
is the semi-norm defined in (6.6).
56
Proof. The proof is based on judicious use of Taylor expansions. First, we want to
replace π
ˆ
T
by a more convenient interpolant Π
ˆ
T
. By definition of π
ˆ
T
, we deduce that

ˆ
T
v|
H
1 ≤ C|v|
H
2,
since [v(P
i
)[ ≤ C|v|
H
2 in two space dimensions. Also we verify that for all affine
interpolants Π
ˆ
T
, we have π
ˆ
T
Π
ˆ
T
v = Π
ˆ
T
v so that
|v −π
ˆ
T
v|
H
1 ≤ |v −Π
ˆ
T
v|
H
1 +|π
ˆ
T

ˆ
T
v −v)|
H
1 ≤ C|v −Π
ˆ
T
v|
H
1 + C[v[
H
2.
This is because the second-order derivatives of Π
ˆ
T
v vanish by construction. So we see
that it is enough to prove the results stated in the theorem for the interpolant Π
ˆ
T
. This
is based on the following Taylor expansion (with Lagrange remainder):
v(x) = v(x
0
) +(x−x
0
) ∇v(x
0
) +[x−x
0
[
2

1
0
(x ∇)
2
v((1−t)x
0
+tx)(1−t)dt. (6.28)
Since we will need it later, the same type of expansion yields for the gradient:
∇v(x) = ∇v(x
0
) +[x −x
0
[

1
0
x ∇∇v((1 −t)x
0
+ tx)dt. (6.29)
Both integral terms in the above expressions involve second-order derivatives of v(x).
However they may not quite be bounded by C[v[
H
2
(
ˆ
T)
. The reason is that the affine
interpolant v(x
0
) +(x−x
0
) ∇v(x
0
) and its derivative ∇v(x
0
) both involve the gradient
of v at x
0
, and there is no reason for [∇v(x
0
)[ to be bounded by C[v[
H
2
(
ˆ
T)
. This causes
some technical problems. The way around is to realize that x
0
may be seen as a variable
that we can vary freely in
ˆ
T. We therefore introduce the function φ(x) ∈ C

0
(
ˆ
T) such
that φ(x) ≥ 0 and

ˆ
T
φ(x)dx = 1. It is a classical real-analysis result that such a
function exists. Upon integrating (6.28) and (6.29) over
ˆ
T in x
0
, we obtain that
v(x) = Π
ˆ
T
v(x) +

ˆ
T
φ(x
0
)[x −x
0
[
2

1
0
(x ∇)
2
v((1 −t)x
0
+ tx)(1 −t)dtdx
0
∇v(x) = ∇Π
ˆ
T
v(x) +

ˆ
T
φ(x
0
)[x −x
0
[

1
0
x ∇∇v((1 −t)x
0
+ tx)dtdx
0
Π
ˆ
T
v(x) =

ˆ
T

v(x
0
) −x
0
∇v(x
0
)

φ(x
0
)dx
0

+

ˆ
T
φ(x
0
)∇v(x
0
)dx
0

x.
(6.30)
We verify that Π
ˆ
T
v is indeed an affine interpolant for v ∈ H
1
(
ˆ
T) in the sense (thanks
to the normalization of φ(x)) that Π
ˆ
T
v = v when v is a polynomial of degree at most 1
(in x
1
and x
2
) and that |Π
ˆ
T
v|
H
1 ≤ C|v|
H
1.
It remains to show that the remainders v(x) − Π
ˆ
T
v(x) and ∇(v(x) − Π
ˆ
T
v(x)) are
indeed bounded in L
2
(
ˆ
T) by C[v[
H
2. We realize that both remainders involve bounded
functions (such as [x − x
0
[ or 1 − t) multiplied by second-order derivatives of v. It is
therefore sufficient to show a result of the form

ˆ
T
dx

ˆ
T
φ(x
0
)

1
0
[f(x
0
+ t(x −x
0
))[dtdx
0

2
dx
0
≤ C|f|
2
L
2
(
ˆ
T)
, (6.31)
57
for all f ∈ L
2
(
ˆ
T). By the Cauchy-Schwarz inequality (stating that (f, g)
2
≤ |f|
2
|g|
2
,
where we use g = 1), the left-hand side in the above equation is bounded (because
ˆ
T is
a bounded domain) by
C

ˆ
T

ˆ
T
φ
2
(x
0
)

1
0
f
2
((1 −t)x
0
+ tx)dtdx
0
dx.
We extend f(x) by 0 outside
ˆ
T. Let us now consider the part 1/2 < t < 1 in the above
integral and perform the change of variables tx ←x. This yields

1
1/2
dt

ˆ
T
dx
0
φ
2
(x
0
)

t
−1 ˆ
T
dx
t
2
f
2
((1 −t)x
0
+x) ≤

1
1/2
dt
t
2

ˆ
T
dx
0
φ
2
(x
0
)

|f|
2
L
2
(
ˆ
T)
.
This is certainly bounded by C|f|
2
L
2
(
ˆ
T)
. Now the part 0 < t < 1/2 with the change of
variables (1 −t)x
0
←x
0
similarly gives a contribution of the form

1/2
0
dt
(1 −t)
2

2
|
L

(
ˆ
T)
[
ˆ
T[

|f|
2
L
2
(
ˆ
T)
,
where [
ˆ
T[ is the surface of
ˆ
T. These estimates show that
|v −Π
ˆ
T
v|
H
1
(
ˆ
T)
≤ C[v[
H
2
(
ˆ
T)
. (6.32)
This was all we needed to conclude the proof of the lemma.
Once we have the estimate on
ˆ
T, we need to obtain it for arbitrary triangles. This is
done as follows:
Lemma 6.7 Let T be a proper (not flat) triangle in the plane R
2
. There is an or-
thonormal system of coordinates such that the vertices of T are the points (0, 0), (0, a),
and (b, c), where a, b, and c are real numbers such that a > 0 and c > 0. Let us define
|A|

= max(a, [b[, c).
Let v ∈ H
2
(T) and π
T
v its affine interpolant. Then there exists a constant C inde-
pendent of the function v and of the triangle T such that
|v −π
T
v|
L
2
(T)
≤ C|A|
2

[v[
H
2
(T)
|∇(v −π
T
v)|
L
2
(T)
≤ C
|A|
3

ac
[v[
H
2
(T)
.
(6.33)
Let us now assume that h
T
is the diameter of T. Then clearly, |A|

≤ h
T
. Fur-
thermore, let us assume the existence of ρ
T
such that
ρ
T
h
T
≤ a ≤ h
T
, ρ
T
h
T
≤ c ≤ h
T
. (6.34)
Then we have the estimates
|v −π
T
v|
L
2
(T)
≤ Ch
2
T
[v[
H
2
(T)
|∇(v −π
T
v)|
L
2
(T)
≤ Cρ
−2
T
h
T
[v[
H
2
(T)
.
(6.35)
We recall that C is independent of v and of the triangle T.
58
Proof. We want to map the estimate on
ˆ
T in Lemma 6.6 to the triangle T. This
is achieved as follows. Let us define the linear map A : R
2
→ R
2
defined in Cartesian
coordinates as
A =

a 0
b c

. (6.36)
We verify that A(
ˆ
T) = T. Let us denote by ˆ x coordinates on
ˆ
T and by x coordinates
on T. For a function ˆ v on
ˆ
T, we can define its push-forward v = A

ˆ v on T, which is
equivalent to
ˆ v(ˆ x) = v(x) with x = Aˆ x or ˆ v(ˆ x) = v(Aˆ x).
By the chain rule, we deduce that
∂ˆ v
∂ˆ x
i
= A
ik
∂v
∂x
k
,
where we use the convention of summation over repeated indices. Differentiating one
more time, we deduce that
¸
|α|=2
[D
α
ˆ v[
2
(ˆ x) ≤ C|A|
4

¸
|α|=2
[D
α
v[
2
(x), x = Aˆ x. (6.37)
Here the constant C is independent of v and ˆ v. Note that the change of measures
induced by A is dˆ x = [det(A
−1
)[dx so that

ˆ
T
¸
|α|=2
[D
α
ˆ v[
2
(ˆ x)dˆ x ≤ |A|
4

[det(A
−1
)[

T
¸
|α|=2
[D
α
v[
2
(x)dx.
This is equivalent to the statement:
[ˆ v[
H
2
(
ˆ
T)
≤ C[det(A)[
−1/2
|A|
2

[v[
H
2
(T)
. (6.38)
This tells us how the right-hand side in (6.27) may be used in estimates on T. Let us
now consider the left-hand sides. We first notice that

ˆ
T
ˆ v)(ˆ x) = (π
T
v)(x). (6.39)
The reason is that ˆ v
i
= v
i
at the nodes and that for any polynomial of order 1 in the
variables ˆ x
1
, ˆ x
2
, p(Aˆ x) is also a polynomial of order 1 in the variables x
1
and x
2
. This
implies that
(ˆ v −π
ˆ
T
ˆ v)(ˆ x) = (v −π
T
v)(x), A
−1
∇(ˆ v −π
ˆ
T
ˆ v)(ˆ x) = ∇(v −π
T
v)(x); x = Aˆ x.
Using the same change of variables as above, this implies the relations
|ˆ v −π
ˆ
T
ˆ v|
2
L
2
(
ˆ
T)
= [ det A
−1
[ |v −π
T
v|
2
L
2
|A
−1
|
2

|∇(ˆ v −π
ˆ
T
ˆ v)|
2
L
2
(
ˆ
T)
≥ C[ det A
−1
[ |∇(v −π
T
v)|
2
L
2
.
(6.40)
Here, C is a universal constant, and |A
−1
|

= (ac)
−1
|A|

(i.e., is the maximal element
of the 2 2 matrix A
−1
). The above inequalities combined with (6.38) directly give
(6.33). The estimates (6.35) are then an easy corollary.
59
Remark 6.8 (Form of the triangles) The first inequality in (6.35) only requires that
the diameter of T be smaller than h
T
. The second estimate however requires more. It
requires that the triangles not be too flat and that all components a, b, and c be of
the same order h
T
. We may verify that the constraint (6.34) is equivalent to imposing
that all the angles of the triangle be greater than a constant independent of h
T
and is
equivalent to imposing that there be a ball of radius of order h
T
inscribed in T. From
now on, we impose that ρ
T
≥ ρ
0
> 0 for all T ∈ T
h
independent of the parameter h.
This concludes this section on approximation theory. Now that we have local error
estimates, we can sum them to obtain global error estimates.
A Global error estimate in H
1
0
(Ω). Let us come back to the general framework of
Galerkin approximations. We know from (6.19) that the error of u−u
h
in H
1
0
(the space
in which estimates are the easiest!) is bounded by a constant times the error u − v in
the same norm, where v is the best approximation of u (for this norm). So we obviously
have that
|u −u
h
|
H
1 ≤ C|u −π
h
u|
H
1. (6.41)
Here the constant C depends on the form a(, ) but not on u nor u
h
. The above right-
hand side is by definition

¸
T∈T
h

T

[u −π
T
u[
2
+[∇(u −π
T
u)[
2

dx

1/2

¸
T∈T
h
ρ
−4
0
h
2
T
[u[
2
H
2
(T)

1/2
,
thanks to (6.35). This however, implies the following result
Theorem 6.9 Let u be the solution to (6.8), which we assume is in H
2
(Ω). Let u
h
be
the solution of the discrete problem (6.16) where V
h
is defined in (6.22) based on the
triangulation define in (6.21). Then we have the error estimate
|u −u
h
|
H
1
(Ω)
≤ Cρ
−2
0
h[u[
H
2
(Ω)
. (6.42)
Here, C is a constant independent of T
h
, u, and u
h
. However it may depend on the
bilinear form a(u, v). We recall that ρ
0
was defined in Rem. 6.8.
The finite element method is thus first-order for the approximation of u in the H
1
norm. We may show that the order of approximation in O(h) obtained in the theorem
is optimal (in the sense that h cannot be replaced by h
α
with α > 1 in general).
A global error estimate in L
2
(Ω). The norm | |
a
is equivalent to the H
1
norm.
So the above estimate also holds for the | |
a
norm (which is defined only when a(u, v)
is symmetric). The third norm we have used is the L
2
norm. It is reasonable to expect
a faster convergence rate in the L
2
norm than in the H
1
norm. The reason is this.
In the linear approximation framework, the gradients are approximated by piecewise
constant functions whereas the functions themselves are approximated by polynomials
of order 1 on each triangle. We should therefore have a better accuracy for the functions
than for their gradients. This is indeed the case. However the demonstration is not
straightforward and requires to use the regularity of the elliptic problem one more time
in a curious way.
60
Theorem 6.10 Under the hypotheses of Theorem (6.9) and the additional hypothesis
that (6.14) holds, we have the following error estimate
|u −u
h
|
L
2
(Ω)
≤ Cρ
−4
0
h
2
|u|
H
2
(Ω)
. (6.43)
Proof. Let w be the solution to the problem
a(v, w) = (v, u −u
h
), for all v ∈ V. (6.44)
Note that since a(u, v) is symmetric, this problem is equivalent to (6.8). However, when
a(u, v) is not symmetric, (6.44) is equivalent to (6.8) only by replacing a(u, v) by the
adjoint bilinear form a

(u, v) = a(v, u). The equation (6.44) is therefore called a dual
problem to (6.8), and as a consequence the proof of Theorem 6.10 we now present is
referred to as a duality argument in the literature.
In any event, the above problem (6.44) admits a unique solution thanks to Theorem
6.1. Moreover our hyp othesis that (6.14) holds implies that
|w|
H
2 ≤ C|u −u
h
|
L
2. (6.45)
Using in that order the choice v = u −u
h
in (6.44), the orthogonality condition (6.18),
the bound in (6.12) (with V = H
1
0
), the estimate (6.45), and finally the error estimate
(6.42), we get:
|u −u
h
|
2
L
2
= a(u −u
h
, w) = a(u −u
h
, w −π
h
w) ≤ C|u −u
h
|
H
1|w −π
h
w|
H
1
≤ C|u −u
h
|
H

−2
0
h|w|
2
≤ Cρ
−2
0
h|u −u
h
|
H
1|u −u
h
|
L
2
≤ Cρ
−4
0
h
2
|u|
H
2|u −u
h
|
L
2.
This concludes our result.
61
7 Introduction to Monte Carlo methods
7.1 Method of characteristics
Let us first make a small detour by the method of characteristics. Consider a hyperbolic
equation of the form
∂v
∂t
−c
∂v
∂x
+ αv = β, v(0, x) = P(x), (7.1)
for t > 0 and x ∈ R. We want to solve (7.1) by the method of characteristics. The
main idea is to look for solutions of the form
v(t, x(t)), (7.2)
where x(t) is a characteristic, solving an ordinary differential equation, and where the
rate of change of ω(t) = v(t, x(t)) is prescribed along the characteristic x(t).
We observe that
d
dt

v(t, x(t))

=
∂v
∂t
+
dx
dt
∂v
∂x
,
so that the characteristic equations for (7.3) are
dx
dt
= −c,
dv
dt
+ αv = β. (7.3)
The solutions are found to be
x(t) = x
0
−ct, v(t, x(t)) = e
−αt
v(0, x
0
) +
β
α
(1 −e
−αt
). (7.4)
For each (t, x), we can find a unique x
0
(t, x) = x + ct so that
v(t, x) = e
−αt
P(x + ct) +
β
α
(1 −e
−αt
). (7.5)
Many first-order linear and non-linear equations may be solved by the method of
characteristics. The main idea, as we have seen, is to replace a partial differential
equation (PDE) by a system of ordinary differential equations (ODEs). The latter
may then be discretized if necessary and solved numerically. Since ODEs are solved,
we avoid the difficulties in PDE-based discretizations caused by the presence of high
frequencies that are not well captured by the schemes. See the section on finite difference
discretizations of hyperbolic equations.
There are many other difficulties in the numerical implementation of the method of
characteristics. For instance, we need to solve the characteristic backwards from x(t) to
x(0) and evaluate the initial condition at x(0), which may not be a grid point and thus
may require that we use an interpolation technique. Yet the method is very powerful as
it is much easier to solve ODEs than PDEs. In the literature, PDE-based, grid-based,
methods are referred to as Eulerian, whereas methods acting in the reference frame of
the particles following their characteristic (which requires solving ODEs) are referred to
as Lagrangian methods. The combination of grid-based and particle-based methods
gives rise to the so-called semi-Lagrangian method.
62
7.2 Probabilistic representation
One of the main drawbacks of the method of characteristics is that it applies (only)
to scalar first-order equations. In one dimension of space, we can apply it to the wave
equation
0 =

2
u
∂t
2


2
u
∂x
2
= (

∂t


∂x
)(

∂t
+

∂x
)u,
because of the above decomposition. However, the method does not apply to multi-
dimensional wave equations nor to elliptic or parabolic equations such as the Laplace and
heat equations. The reason is that information does not propagate along characteristics
(curves).
For some equations, a quasi-method of characteristics may still be developed. How-
ever, instead of being based on us following one characteristics, it is based on us following
an infinite number of characteristics generated from an appropriate probability measure
and averaging over them. More precisely, instead of looking for solutions of the form
w(t, x(t)), we look for solutions of the form
u(t, x) = E¦u
0
(X(t))[X(0) = x¦, (7.6)
where X(t) = X(t, ω) is a trajectory starting at x for realizations ω in a state space
Ω, and E is ensemble averaging with respect to a measure P defined on Ω. It is not
purpose of these notes to dwell on (7.6). Let us give the most classical of examples.
Let W(t) be a standard one-dimensional Brownian motion with W(0) = 0. This is an
object defined on a space Ω (the space of continuous trajectories on [0, ∞)) with an
appropriate measure P. All trajectories are therefore continuous (at least almost surely)
but are also almost surely always non-differentiable. The “characteristics” have thus
become rather un-smooth objects. In any event, we find that
u(t, x) = E¦u
0
(x + W(2t))¦, (7.7)
solves the heat equation
∂u
∂t


2
u
∂x
2
= 0, u(0, x) = u
0
(x). (7.8)
This may be shown by e.g. Itˆo calculus. Very briefly, let g(W(2t)) = u
0
(x + W(2t)).
Then, we find that
dg =
∂g
∂x
dW
2t
+
1
2

2
g
∂x
2
dW
2t
dW
2t
=
∂g
∂x
dW
2t
+

2
g
∂x
2
dt,
since dW
2t
dW
2t
= d(2t). Using E¦W
2t
¦ = 0, we obtain after ensemble averaging that
(7.7) solves (7.8). We thus see that u(t, x) is an infinite superposition (E is an inte-
gration) of pieces of information carried along the trajectories. We also see how this
representation gives rise to the Monte Carlo method. We cannot simulate an infinite
number of trajectories. However, we can certainly simulate a finite, large, number of
them and perform the averaging. This is the basis for the Monte Carlo method. Note
that in (7.7), simulating the characteristic is easy as we know the law of W(t) (a cen-
tered Gaussian variable with variance equal to t). In more general situations, where the
diffusion coefficient depends on position, then W(2t) is replaced by some process X(t; x)
which depends in a more complicated way on the location of the origin x. Solving for
the trajectories becomes more difficult.
63
7.3 Random Walk and Finite Differences
Brownian motion has a relatively simple discrete version, namely the random walk.
Consider a process S
k
such that S
0
= 0 and S
k+1
is equal to S
k
+ h with probability
1
2
and equal to S
k
−h also with probability
1
2
.
We may decompose this random variable as follows. Let us assume that we have
N (time) steps and define N independent random variables x
k
, 1 ≤ k ≤ N, such that
x
k
= ±1 with probability
1
2
. Then,
S
n
= h
n
¸
k=1
x
k
,
is our random walk.
Let us relate all this to Bernoulli random variables. Define y
k
for 1 ≤ k ≤ N, N
independent random variables equal to 0 or 1 with probability
1
2
and let Σ
n
=
¸
n
k=1
y
k
.
Each sequence of n of 0’s and 1’s for the y
k
has probability 2
−n
. The number of sequences
such that Σ
n
= m is

n
m

by definition (number of possibilities of choosing m balls among
n ≥ m balls). As a consequence, using x
k
= (2y
k
−1), we see that
p
m
= P(S
n
= hm) =
1
2
n

n
n+m
2

. (7.9)
Exercise 7.1 Check this.
Note that S
n
is even when n is even and odd when n is odd so that
n+m
2
is always an
integer. We thus have an explicit expression for S
n
. Moreover, we easily find that
E¦S
n
¦ = 0, VarS
n
= E¦S
2
n
¦ = nh
2
. (7.10)
The central limit theorem shows that S
n
converges to W(2t) if n → ∞ and h → 0
such that nh
2
= 2t. Moreover, S
αn
also converges to W(2αt) in an appropriate sense
so that there is solid mathematical background to obtain that S
n
is indeed a bona fide
discretization of Brownian motion. This is Donsker’s invariance principle.
How does one use all of this to solve a discretized version of the heat equation?
Following (7.7), we can define
U
n
j
= E¦U
0
j+h
−1
Sn
¦, (7.11)
where U
0
j
is an approximation of u
0
(jh) as before, and where U
n
j
is an approximation
of u(n∆t, jh). What is it that we have defined? For this, we need to use the idea of
conditional expectations, which roughly says that averaging over two random variables
may be written as an averaging over the first random variable (this is thus still a random
variable) and then average over the second random variable. In our context:
U
n
j
= E¦U
0
j+x
1
+...+xn
¦ = E¦E¦U
0
j+x
1
+...+xn
[x
n
¦¦
=
1
2
E¦U
0
j+1+x
1
+...+x
n−1
¦ +
1
2
E¦U
0
j−1+x
1
+...+x
n−1
¦ =
1
2

U
n−1
j+1
+ U
n−1
j−1

.
64
In other words, U
n
j
solves the finite difference equation with λ =
1
2
(and ONLY this
value of λ the way the procedure is constructed). Recall that
λ =
D∆t
h
2
=
1
2
, (7.12)
where D is the diffusion coefficient. In other words, the procedure (7.11) gives us an
approximation of
∂u
∂t
−D

2
u
∂x
2
= 0, u(0, x) = u
0
,
when ∆t and h are chosen so that (7.12) holds.
7.4 Monte Carlo method
Solving for U
n
j
thus requires that we construct S
n
(by flipping n ±1 coins) and evaluate
X := U
0
j+h
−1
Sn
. If we do it once, how accurate are we? The answer is not much. Let us
look at the random variable X. We know that E¦X¦ = U
n
j
so that X is an unbiased
estimator of U
n
j
. However, its variance is often quite large. Using (7.9), we find that
P(X = U
0
j+m
) = p
m
. As a consequence, we find that:
σ
2
= VarX =
n
¸
k=1
p
k

U
0
j−k

n
¸
l=1
p
l
U
0
j−l

2
. (7.13)
Assume that U
0
j
= U independent of j. Then σ
2
= 0 and X gives the right answer
all the time. However, when U
0
is highly oscillatory, say (−1)
l
, then
¸
p
l
U
0
j−l
will be
close to 0 by cancellations. This shows that σ
2
will be close to
¸
k
p
k
(U
0
j−k
)
2
, which is
comparable, or even much larger than E¦X¦. So X has a very large variance and is
therefore not a very good estimator of U
n
j
.
The (only) way to make the estimator better is to use the law of large numbers by
repeating the experiment M times and averaging over the M results. More precisely,
let X
k
for 1 ≤ k ≤ M be M independent random variables with the same law as X.
Define
S
M
=
1
M
M
¸
k=1
X
k
. (7.14)
Then we verify that E¦S
M
¦ = U
n
j
as before. The variance of S
M
is however significantly
smaller:
VarS
M
= E¦

1
M
M
¸
k=1
(X
k
−U
n
j
)

2
¦ =
1
M
2

M
¸
k=1
(X
k
−U
n
j
)
2
¦ =
VarX
M
,
since the variables are independent. As a consequence, S
M
is an unbiased estimator of
U
j
n
with a standard deviation of order O(M
−1/2
). This is the speed of convergence of
Monte Carlo. In order to obtain an accuracy of ε on average, we need to calculate an
order of ε
−2
realizations.
How do the calculations Monte Carlo versus deterministic compare to calculate U
n
j
(at a final time n∆t and a given position j)? Let us assume that we solve a d-dimensional
65
heat equation, which involves a d-dimensional random walk, which is nothing but d
one-dimensional random walks in each direction. The cost of each trajectory is of order
n ≈ (∆t)
−1
since we need to flip n coins. To obtain an O(h
2
) = O(∆t) accuracy, we
thus need M = (∆t)
−2
so that the final cost is O((∆t)
−3
).
Deterministic methods require that we solve n time steps using a grid of mesh size
O(h) in each direction, which means O(h
−d
) discretization points. The total cost of
an explicit method is therefore O((∆th
d
)
−1
) = O((∆t)
−1−
d
2
). We thus see that Monte
Carlo is more efficient than finite differences (or equally efficient) when
1
∆t
3

1
∆t
1+
d
2
, i.e., d ≥ 4. (7.15)
For sufficiently large dimensions, Monte Carlo methods do not suffer the curse of
dimensionality characterizing deterministic methods and thus become more efficient.
Monte Carlo methods are also much more versatile as they are based on solving “random
ODEs” (stochastic ODEs), with which it is easier to account for e.g. complicated
geometries. Finite differences however provide the solution everywhere unlike Monte
Carlo methods, which need to use different trajectories to evaluate solutions at different
points.
Moreover, the Monte Carlo method is much easier to parallelize. For the finite
difference model, it is difficult to use parallel architecture (though this a well-studied
problem) and obtain algorithms running time is inversely proportional to the number
of available processors. In contrast, Monte Carlo methods are easily parallelized since
the M random realizations of X are done independently. So with a (very large) number
of processors equal to M, the running time of the algorithm is of order ∆t
−1
, which
corresponds to the running time of the finite difference algorithm on a grid with a small
finite number of grid points. Even with a large number of processors, it is extremely
difficult to attain this level of running speed using the deterministic finite difference
algorithm.
Let us conclude by the following remark. The convergence of the Monte Carlo
algorithm in M

1
2
is rather slow. Many algorithms have been developed to accelerate
convergence. These algorithms are called variance-reduction algorithms and are based
on estimating the solution U
n
j
by using non-physical random processes with smaller
variance than the classical Monte Carlo method described above.
66
References
[1] S. Larsson and V. Thom´ ee, Partial Differential Equations with Numerical Meth-
ods, Springer, New York, 2003.
[2] L. N. Trefethen, Spectral Methods in Matlab, SIAM, Philadelphia, PA, 2000.
67

1

Ordinary Differential Equations
dX(t) = f (t, X(t)), dt X(0) = X0 .

An Ordinary Differential Equation is an equation of the form t ∈ (0, T ), (1.1)

We have existence and uniqueness of the solution when f (t, x) is continuous and Lipschitz with respect to its second variable, i.e. there exists a constant C independent of t, x, y such that |f (t, x) − f (t, y)| ≤ C|x − y|. (1.2) If the constant is independent of time for t > 0, we can even take T = +∞. Here X and f (t, X(t)) are vectors in Rn for n ∈ N and | · | is any norm in Rn . Remark 1.1 Throughout the text, we use the symbol “C” to denote an arbitrary constant. The “value” of C may change at every instance. For example, if u(t) is a function of t bounded by 2, we will write C|u(t)| ≤ C, (1.3)

even if the two constants “C” on the left- and right-hand sides of the inequality may be different. When the value “2” is important, we will replace the above right-hand side by 2C. In our convention however, 2C is just another constant “C”. Higher-order differential equations can always been put in the form (1.1), which is quite general. For instance the famous harmonic oscillator is the solution to x + ω 2 x = 0, ω2 = k , m (1.4)

with given initial conditions x(0) and x (0) and can be recast as d dt X1 (t) = X2 0 1 −ω 2 0 X1 (t), X2 X1 (0) = X2 x(0) . x (0) (1.5)

Exercise 1.1 Show that the implied function f in (1.5) is Lipschitz. The Lipschitz condition is quite important to obtain uniqueness of the solution. Take for instance the case n = 1 and the function x(t) = t2 . We easily obtain that x (t) = 2 x(t) t ∈ (0, +∞), x(0) = 0.

However, the solution x(t) ≡ 0 satisfies the same equation, which implies non-uniqueness ˜ of the solution. This remark is important in practice: when an equation admits several solutions, any sound numerical discretization is likely to pick one of them, but not necessarily the solution one is interested in. Exercise 1.2 Show that the function f (t) above is not Lipschitz. 2

How do we discretize (1.1)? There are two key ingredients in the derivation of a numerical scheme: first to remember the definition of a derivative df f (t + h) − f (t) f (t) − f (t − h) f (t + h) − f (t − h) (t) = lim = lim = lim , h→0 h→0 h→0 dt h h 2h and second (which is quite related) to remember Taylor expansions to approximate a function in the vicinity of a given point: f (t + h) = f (t) + hf (t) + hn hn+1 (n+1) h2 f (t) + . . . + f (n) (t) + f (t + s), 2 n! (n + 1)! (1.6)

for 0 ≤ s ≤ t assuming the function f is sufficiently regular. With this in mind, let us fix some ∆t > 0, which corresponds to the size of the interval between successive points where we try to approximate the solution of (1.1). We then approximate the derivative by X(t + ∆t) − X(t) dX (t) ≈ . dt ∆t (1.7)

It remains to approximate the right-hand side in (1.1). Since X(t) is supposed to be known by the time we are interested in calculating X(t + ∆t), we can choose f (t, X(t)) for the right-hand side. This gives us the so-called explicit Euler scheme X(t + ∆t) − X(t) = f (t, X(t)). ∆t (1.8)

Let us denote by Tn = n∆t for 0 ≤ n ≤ N such that N ∆t = T . We can recast (1.8) as Xn+1 = Xn + ∆tf (n∆t, Xn ), X0 = X(0). (1.9)

This is a fully discretized equation that can be solved on the computer since f (t, x) is known. Another choice for the right-hand side in (1.1) is to choose f (t + ∆t, X(t + ∆t)). This gives us the implicit Euler scheme Xn+1 = Xn + ∆tf ((n + 1)∆t, Xn+1 ), X0 = X(0). (1.10)

Notice that this scheme necessitates to solve an equation at every step, namely to find the inverse of the function x → x − f (n∆t, x). However we will see that this scheme is more stable than its explicit relative in many practical situations and is therefore often a better choice. Exercise 1.3 Work out the explicit and implicit schemes in the case f (t, x) = λx for some real number λ ∈ R. Write explicitly Xn for both schemes and compare with the exact solution X(n∆t). Show that in the stable case (λ < 0), the explicit scheme can become unstable and that the implicit scheme cannot. Comment. We now need to justify that the schemes we have considered are good approximations of the exact solution and understand how good (or bad) our approximation is. This requires to analyze the behavior of Xn as ∆t → 0, or equivalently the number of points 3

X and Y . that is to say |Xn | ≤ C(1 + |X0 |). Let us introduce the solution function g(t. Similarly we introduce the approximate solution function g∆t (t. Notice from the Lipschitz property of f (with x = x and y = 0) and from (1. The meaning of this inequality is the following: we do not want the difference X(t + ∆t) − Y (t + ∆t) to be more than (1 + C∆t) times what it was at time t (i. (1. We then have to prove that our scheme is stable. 4 .N to discretize a fixed interval of time (0. Xn ). where dX (t + τ ) = f (t + τ. (1.16) (1.e.a. That the equation is properly posed (a. Xn )| ≤ (1 + C∆t)|Xn | + C∆t.15) where the constant C is independent of t.e.14) where C is independent of X0 and 0 ≤ n ≤ N . Y )| ≤ (1 + C∆t)|X − Y |.11) 0 ≤ τ ≤ ∆t (1. X) − g(t. roughly speaking has good stability properties: when one changes the initial conditions a little. we have g∆t (n∆t.14) holds from now on. X) = g(t. that our solution Xn remains bounded independently of n and ∆t.e. i.1). one expects the solution not to change too much in the future. X. x))−1 (Xn ). X(t + τ )). X) = X(t + ∆T ). assuming that this inverse function is well-defined. Xn ) = (x − ∆tf ((n + 1)∆t. First we need to assume that the exact problem (1.14) that |Xn+1 | = |g∆t (n∆t.12) This is nothing but the function that maps a solution of the ODE at time t to the solution at time t + ∆t. well-posed) in the sense of Hadamard means that |g(t.1) is properly posed. i. i. ∆T ) defined by g(t. |X − Y | = |X(t) − Y (t)|) during the time interval ∆t. In many problems. For the implicit Euler scheme. dt X(t) = X. (1. the strategy to do so is the following.13) For instance. A second ingredient is to show that the scheme is stable. Because the explicit scheme is much simpler to analyze. X) for t = n∆t by Xn+1 = g∆t (n∆t. that locally it is a good approximation of the real equation. we assume that (1.e. Xn ) for 0 ≤ n ≤ N − 1. Xn ) = Xn + ∆tf (n∆t. The third ingredient is to show that our scheme is consistent with the equation (1. for the explicit Euler scheme. (1. we have g∆t (n∆t.k. T ) tends to infinity.

16) is satisfied) and that both schemes are consistent and of order m = 1 (i. Such a result is usually obtained by using Taylor expansions. Notice that this is a local estimate.e. it consists of showing that |g(n∆t. Notice that the above property is a regularity property of the equation.18) ¨ for some 0 ≤ s ≤ ∆t. This implies by induction that n |Xn | ≤ (1 + C∆t) |X0 | + k=0 n (1 + C∆t)k C∆t ≤ (1 + C∆t)N (N ∆tC + |X0 |). Let us consider the explicit Euler scheme and assume that the solution X(t) of the ODE is sufficiently smooth so that the solution with initial condition at n∆t given by X = X(n∆t) satisfies ∆t2 ¨ ˙ X((n + 1)∆t) = X(n∆t) + ∆tX(n∆t) + X(n∆t + s). the order of the scheme. (1.where C is a constant. Assuming that we know the initial condition at time n∆t.4 Consider the harmonic oscillator (1. 2 This implies (1.e.5). This implies the stability of the explicit Euler scheme. here of order ∆t2 . X(n∆t)) + 2 2 ∆t ¨ = g∆t (n∆t. (1. for 1 ≤ n ≤ N . 2 (1.17) for some positive m.15) is satisfied. X(n∆t)) + X(n∆t + s). We then prove that the scheme is consistent and obtain the local error of discretization. where |X(n∆t + s)| ≤ C(1 + |X|). x) is of class C 1 for instance.17) and the consistency of the scheme with m = 1 (the Euler explicit scheme is first-order). X(n∆t)) at time (n + 1)∆t) up to a small term. Show that (1. X) − g∆t (n∆t. that both schemes are stable (i. Work out the functions g and g∆ for the explicit and implicit Euler schemes. N which is bounded independently of ∆t.17) is satisfied with m = 1). This implies that X((n + 1)∆t) = g(n∆t. X(n∆t)) ∆t2 ¨ X(n∆t + s) = X(N ∆t) + ∆tf (n∆t. In our case. 5 . X)| ≤ C∆tm+1 (1 + |X|). Now N ∆t = T and (1 + CT N ) ≤ eCT . Another way of interpreting the above result is as follows: the exact solution X((n + 1)∆t) locally solves the discretized equation (whose solution is g∆t (n∆t. These are sufficient conditions (though not necessary ones) to obtain convergence and can be shown rigorously when the function f (t. we want to make sure that the exact and approximate solutions at time (n + 1)∆t are separated by an amount at most of order ∆t2 . Exercise 1. We prove convergence of the explicit Euler scheme only when the above regularity conditions are met. (1. not of the discretization.

as ∆t goes to 0. X(n∆t)) − g(n∆t. To simplify the presentation we assume that X ∈ R. this might not be accurate enough.e. The previous calculations have shown that for a scheme of order m. Since ε0 = 0.19) which holds for every norm by definition.20) εn ≤ C∆t m+1 k=0 (1 + C∆t)k ≤ C∆tm+1 n.5 Program in Matlab the explicit and implicit Euler schemes for the harmonic oscillator (1. Let us denote the error between the exact solution and the discretized solution by εn = |X(n∆t) − Xn |. The last line uses the stability of the scheme. Xn ) − g∆t (n∆t. For problems that need to be solved over long intervals of time. since n ≤ N = T ∆t−1 This shows that the explicit Euler scheme is of order m = 1: for all intermediate time steps n∆t. we easily deduce from the above relation (for instance by induction) that n (1. Higher-Order schemes.5). the discretized solution converges to the exact one. Xn )| + |g(n∆t. The fist line is the definition of the operators. The third line uses the well-posedness of the ODE and the consistency of the scheme.Once we have well-posedness of the exact equation as well as stability and consistency of the scheme (for a scheme of order m). Recall that C is here the notation for a constant that may change every time the symbol is used. we have εn+1 ≤ C∆tm+1 + (1 + C∆t)εn . We have seen that the Euler scheme is first order accurate. (1. X(n∆t)) − g∆ (n∆t. Xn )| ≤ |g(n∆t. Exercise 1. The second line uses the triangle inequality |X + Y | ≤ |X| + |Y |. Find numerically the order of convergence of both schemes. 6 . we consider a scalar equation. i. Here is one way of constructing a second-order scheme. the error between X(n∆t) and Xn is bounded by a constant time ∆t. We deduce from the above relation that εn ≤ C∆tm . we obtain convergence as follows. The obvious solution is to create a more accurate discretization. Obviously. We have |X((n + 1)∆t) − Xn+1 | = |g(n∆t. since (1 + C∆t)k ≤ C uniformly for 0 ≤ k ≤ N . Xn )| ≤ (1 + C∆t)|X(n∆t) − Xn | + C∆tm+1 (1 + |Xn |) ≤ (1 + C∆t)|X(n∆t) − Xn | + C∆tm+1 .

We first realize that m in (1.17) has to be 2 instead of 1. We also realize that the local approximation to the exact solution must be compatible to the right order with the Taylor expansion (1.18), which we push one step further to get ∆t2 ¨ ∆t3 ... ˙ X((n + 1)∆t) = X(n∆t) + ∆tX(n∆t) + X(n∆t) + X (n∆t + s) 2 6 (1.21)

for some 0 ≤ s ≤ ∆t. Now from the derivation of the first-order scheme, we know that what we need is an approximation such that X((n + 1)∆t) = g∆t (n∆t, X) + O(∆t3 ). An obvious choice is then to choose ˙ g∆t (n∆t, X) = X(n∆t) + ∆tX(n∆t) +
(2) (2)

∆t2 ¨ X(n∆t). 2

˙ ¨ This is however not explicit since X and X are not known yet (only the initial condition (2) X = X(n∆t) is known when we want to construct g∆t (n∆t, X)). However, as previously, we can use the equation that X(t) satisfies and get that ˙ X(n∆t) = f (n∆t, X) ∂f ∂f ¨ (n∆t, X) + (n∆t, X)f (n∆t, X), X(n∆t) = ∂t ∂x by applying the chain rule to (1.1) at t = n∆t. Assuming that we know how to differentiate the function f , we can then obtain the second-order scheme by defining g∆t (n∆t, X) = X + ∆tf (n∆t, X) ∆t2 ∂f ∂f + (n∆t, X) + (n∆t, X)f (n∆t, X) . 2 ∂t ∂x
(2)

(1.22)

Clearly by construction, this scheme is consistent and second-order accurate. The regu... larity that we now need from the exact equation (1.1) is that |X (n∆t+s)| ≤ C(1+|X|). We shall again assume that our equation is nice enough so that the above constraint holds (this can be shown when the function f (t, x) is of class C 2 for instance). Exercise 1.6 It remains to show that our scheme g∆t (n∆t, X) is stable. This is left as (2) an exercise. Hint: show that |g∆t (n∆t, X)| ≤ (1 + C∆t))|X| + C∆t for some constant C independent of 1 ≤ n ≤ N and X. Exercise 1.7 Work out a second-order scheme for the harmonic oscillator (1.5) and show that all the above constraints are satisfied. Exercise 1.8 Implement in Matlab the second-order scheme derived in the previous exercise. Show the order of convergence numerically and compare the solutions with the first-order scheme. Comment.
(2)

7

Runge-Kutta methods. The main drawback of the previous second-order scheme is that it requires to know derivatives of f . This is undesirable is practice. However, the derivatives of f can also be approximated by finite differences of the form (1.7). This is the basis for the Runge-Kutta methods. Exercise 1.9 Show that the following schemes are second-order: ∆t ∆t (2) ,X + f (n∆t, X) g∆t (n∆t, X) = X + ∆tf n∆t + 2 2 ∆t (2) g∆t (n∆t, X) = X + f (n∆t, X) + f (n + 1)∆t, X + ∆tf (n∆t, X) 2 2 2∆t ∆t (2) f (n∆t, X) + 3f (n + )∆t, X + f (n∆t, X) g∆t (n∆t, X) = X + 4 3 3

.

These schemes are called Midpoint method, Modified Euler method, and Heun’s Method, respectively. Exercise 1.10 Implement in Matlab the Midpoint method of the preceding exercise to solve the harmonic oscillator problem (1.5). Compare with previous discretizations. All the methods we have seen so far are one-step methods, in the sense that Xn+1 only depends on Xn . Multi-step methods are generalizations where Xn+1 depends on Xn−k for k = 0, · · · , m with m ∈ N. Classical multi-step methods are the Adams-Bashforth and Adams-Moulton methods. The second- and fourth-order Adams Bashforth methods are given by Xn+1 = Xn + Xn+1 ∆t (3f (n∆t, Xn ) − f ((n − 1)∆t, Xn−1 )) 2 ∆t 55f (n∆t, Xn ) − 59f ((n − 1)∆t, Xn−1 ) = Xn + 24 +37f ((n − 2)∆t, Xn−2 ) − 9f ((n − 2)∆t, Xn−2 ) ,

(1.23)

respectively. Exercise 1.11 Implement in Matlab the Adams-Bashforth methods in (1.23) to solve the harmonic oscillator problem (1.5). Compare with previous discretizations.

8

2
2.1

Finite Differences for Parabolic Equations
One dimensional Heat equation
∂2u ∂u (t, x) − 2 (t, x) = 0, ∂t ∂x u(0, x) = u0 (x), x ∈ R.

Equation. Let us consider the simplest example of a parabolic equation x ∈ R, t ∈ (0, T ) (2.1)

This is the one-dimensional (in space) heat equation. Note that the equation is also linear, in the sense that the map u0 (x) → u(x, t) at a given time t > 0 is linear. In the rest of the course, we will mostly be concerned with linear equations. This equation actually admits an exact solution, given by 1 u(t, x) = √ 4πt

exp −
−∞

|x − y|2 u0 (y)dy. 4t

(2.2)

Exercise 2.1 Show that (2.2) solves (2.1). Discretization. We now want to discretize (2.1). This time, we have two variables to discretize, time and space. The finite difference method consists of (i) introducing discretization points Xn = n∆x for n ∈ Z and ∆x > 0, and Tn = n∆t for 0 ≤ n ≤ N = T /∆t and Dt > 0; (ii) approximating the solution u(t, x) by Ujn ≈ u(Tn , Xj ), 1 ≤ n ≤ N, j ∈ Z. (2.3)

Notice that in practice, j is finite. However, to simplify the presentation, we assume here that we discretize the whole line x ∈ R. The time variable is very similar to what we had for ODEs: knowing what happens at Tn = n∆t, we would like to get what happens at Tn+1 = (n + 1)∆t. We therefore introduce the notation Ujn+1 − Ujn . (2.4) ∂t Ujn = ∆t ∂ The operator ∂t plays the same role for the series Ujn as does for the function u(t, x). ∂t The spatial variable is quite different from the time variable. There is no privileged direction of propagation (we do not know a priori if information comes from the left or the right) as there is for the time variable (we know that information at time t will allow us to get information at time t + ∆t). This intuitively explains why we introduce two types of discrete approximations for the spatial derivative: ∂x Ujn =
n Uj+1 − Ujn , ∆x

and

∂ x Ujn =

n Ujn − Uj−1 ∆x

(2.5)

These are the forward and backward finite difference quotients, respectively. Now in (2.1) we have a second-order spatial derivative. Since there is a priori no reason to assume that information comes from the left or the right, we choose
n n Uj+1 − 2Ujn + Uj−1 ∂2u (Tn , Xj ) ≈ ∂ x ∂x Ujn = ∂x ∂ x Ujn = . ∂x2 ∆x2

(2.6)

9

Exercise 2. j∆x) as advertised)?.8) Exercise 2. j∈Z (2.10) j∈Z j∈Z Since u(t.1).9) Stability means that there exists a constant C independent of ∆x. Implement the algorithm in Matlab. Now straightforward calculations show that Uj1 = λ (−1)j+1 + (−1)j−1 + (1 − 2λ)(−1)j ε = (1 − 4λ)(−1)j ε.2)). (2. When λ > 1/2.6) is centered: it is symmetric with respect to what comes from the left and what comes from the right. it has no chance to converge.3 Check this. and t = 10 assuming that the initial condition is u0 (x) = 1 on (−1. j∆x))? Exercise 2.7) ∆t ∆x2 This is clearly an approximation of (2.2 Check the equalities in (2. x) remains bounded for all times (see (2. it is clear that if the scheme is unstable.1. 1) and u0 (x) = 0 elsewhere.01.1) to compute u(t. the scheme will be unstable! To show this we choose some initial conditions that oscillate very rapidly: Uj0 = (−1)j ε. ε > 0.4 Discretizing the interval (−10. Our first concern will be to ensure that the scheme is stable.e. use a numerical approximation of (2. x) for t = 0. t = 1. There are then several questions that come to mind: is the scheme stable (i. Stability. we define U ∞ = sup |Uj |. Here ε is a constant that can be chosen as small as one wants. The finite difference scheme then reads n n Ujn+1 − Ujn Uj+1 − 2Ujn + Uj−1 − = 0. For series U = {Uj }j∈Z defined on the grid j ∈ Z by Uj .6). t = 0. does it converge (is Ujn really a good approximation of u(n∆t. ∆x2 (2. we calculate Ujn+1 for j ∈ Z for 0 ≤ n ≤ N − 1. (2. and how fast does it converge (what is the error between Ujn and u(n∆t. This is also the simplest scheme called the explicit Euler scheme. ∆t and 1 ≤ n ≤ N = T /∆t such that U n ∞ = sup |Ujn | ≤ C U 0 ∞ = C sup |Uj0 |. 10 . 10) with N points. λ= ∆t . We easily verify that Uj0 ∞ = ε. The norm we choose to measure stability is the supremum norm. Notice that this can be recast as n n Ujn+1 = λ(Uj+1 + Uj−1 ) + (1 − 2λ)Ujn . does Ujn remains bounded for n = N independently of the choices of ∆t and ∆x)?. The numerical procedure is therefore straightforward: knowing Ujn for j ∈ Z. We actually have to consider two cases according as λ ≤ 1/2 or λ > 1/2. Notice that the discrete differentiation in (2.

Recall that −∞ e−x dx = π. This is done by assuming that we have a local consistency of the discrete scheme with the equation. This behavior has to be contrasted with the case λ ≤ 1/2. what this means is that any small fluctuation in the initial data (for instance caused by computer round-off) will be amplified by the scheme exponentially fast as in (2. For instance. we deduce from (2. We therefore use the three same main ingredients as for ODEs: stability. we have to show that the scheme converges. Notice that ∆t must be quite small (of order ∆x2 ) in order to obtain stability. whence their practical interest. So obviously.By induction this implies that Ujn = (1 − 4λ)n (−1)j ε. 11 . √ 2 ∞ Exercise 2. for n = N = T /∆t.1) also satisfies this property. whereas implicit schemes will be unconditionally stable. we have that U N ∞ .13) ∆t ≤ ∆x2 . and Lewy who formulated it first). this reads 1 (2. There.11) and will eventually be bigger than any meaningful information. tends to infinity as ∆t → 0. whence Un ∞ (2.5 Check it. This property is referred to as a maximum principle: the discrete solution at later times (n ≥ 1) is bounded by the maximum of the initial solution. Notice that the continuous solution of (2. We have seen that λ ≤ 1/2 is important to ensure stability. This implies that ρ = −(1 − 4λ) > 1. |Uj+1 |}. Although we are not going to use them further here. Since both λ and (1 − 2λ) are positive.11) = ρn ε → ∞ as n → ∞. Maximum principles are very important in the analysis of partial differential equations. and that the solution of the equation is sufficiently regular. stability is obvious. |Ujn |.12) Exercise 2. regularity of the exact solution.6 Check this using (2. Now assume that λ > 1/2. We shall see that explicit schemes are always conditionally stable for parabolic equations. which was supposed to be an approximation of sup |u(T. Convergence.8) that n n |Ujn+1 | ≤ sup{|Uj−1 |. consistency.10) holds. So (2. Friedrich.10) is x∈R violated: there is no constant C independent of ∆t such that (2. Now that we have stability for the Euler explicit scheme when λ ≤ 1/2. Obviously (2. (2. In terms of ∆t and ∆x. x)|. let us still mention that it is important in general to enforce that the numerical schemes satisfy as many properties of the continuous equation as possible.10) is clearly satisfied with C = 1. In practice. U n+1 ∞ ≤ Un ∞ ≤ U0 ∞.2). 2 This is the simplest example of the famous CFL condition (after Courant.

let us introduce the error series z n = {zj }j∈Z defined by n zj = Ujn − un . j Convergence means that z n ∞ converges to 0 as ∆t → 0 and ∆x → 0 (with the constraint that λ ≤ 1/2) for all 1 ≤ n ≤ N = T /∆t. (2.4 . j where un = u(n∆t. Since (2. we have to show that the real j j solution at time un+1 is well approximated by Ujn+1 . We then deduce that τn ∞ ≤ C∆x2 ∼ C∆t (2. we assume that λ ≤ 1/2 is fixed and send ∆t → 0 To (in which case ∆x = λ−1 ∆t also converges to 0). j j (2. this is independent of the discretization scheme we choose. x) we deduce that τjn = ∆t ∂ 2 u ∆x2 ∂ 4 u (n∆t + s) − (j∆x + h) 2 ∂t2 2 12 ∂x44 λ∂ u 1 ∂ u = ∆x2 (n∆t + s) − (j∆x + h) .15) where C is a constant independent of ∆x (and ∆t since λ is fixed). and that Ujn = un .7 Check this. Since U n satisfies the discretized equation. i. To answer this.16) 12 . Exercise 2. 2 2 ∂t 12 ∂x4 The truncation error is therefore of order m = 2 since it is proportional to ∆x2 . All we need is that the exact solution is sufficiently regular. that it is continuously differentiable 2 times in time and 4 times in space). assuming that the solution at time Tn = n∆t is given by un on the grid. More precisely.e. we calculate j τjn = ∂t un − ∂ x ∂x un .6). Of course. Let us assume that the solution is sufficiently regular (what we need is that it is of class C 2. The consistency of the scheme consists of showing that the real dynamics are well approximated.14) The expression τjn is the truncation or local discretization error. Notice however that this term only involves the exact solution u(t. √ simplify. Using Taylor expansions. x). we find that ∂u ∆t ∂ 2 u u(t + ∆t) − u(t) = (t) + (t + s) ∆t ∂t 2 ∂t2 4 u(x + ∆x) − 2u(x) + u(x − ∆x) ∂2u ∆x2 ∂ u = (x) + (x + h). j∆x). ∆x2 ∂x2 12 ∂x4 for some 0 ≤ s ≤ ∆t and −∆x ≤ h ≤ ∆x. we deduce that n n ∂t zj − ∂ x ∂x zj = −τjn . The order of convergence is how fast it converges.1) is satisfied by u(t. this accuracy holds if we can bound the term in the square brackets independently of the discretization parameters ∆t and ∆x.n More specifically. Our main ingredient is now the Taylor expansion (1. Again.

λ = .1 Let us assume that u(t.1 below. ≤ nC∆t∆x2 ≤ CT ∆x2 . so that z n+1 ∞ ≤ C∆t2 ∼ C∆t∆x2 . Let us now conclude the analysis of the error estimate and come back to (2. 1) and u0 (x) = 0 elsewhere. we deduce that z n+1 This in turn implies that zn ∞ ∞ ≤ zn ∞ + C∆t∆x2 . Being of order 1 in time implies that the scheme is of order 2 is space since ∆t = λ∆x2 .5. This will be confirmed in Theorem 2.4 and that λ ≤ 1/2. Choose the discretization parameters ∆t and ∆x such that λ = .17).15). 10) = 0. Compare the stable solutions you obtain with the solution given by (2.17) thus implies that n+1 zj = −∆tτjn .2).12) for the homogeneous problem (because λ ≤ 1/2!) and the bound (2.9 Following the techniques described in Chapter 1. Using the stability estimate (2.19) j∈Z. Exercise 2. derive a scheme of order 4 in space (still with ∆t = λ∆x2 and λ sufficiently small that the scheme is stable). Equation (2. Indeed. Comparing with the section on ODEs.0≤n≤N where C is a constant independent of ∆x and λ ≤ 1/2.15). 10) and assuming that u(t.48. 13 .18) thanks to (2.17) The above equation allows us to analyze local consistency.or equivalently that n+1 n n n zj = λ(zj+1 + zj−1 ) + (1 − 2λ)zj − ∆tτjn . j∆x) − Ujn | ≤ C∆x2 . (2. (2. We summarize what we have obtain as Theorem 2. But this corresponds to a scheme of order 1 in time since ∆t = λ∆x2 ≤ ∆x2 /2. since the latter is by definition the difference between the exact solution and the approximate solution on the spatial grid. You may choose as an initial condition u0 (x) = 1 on (−1. x) is of class C 2. Implement this new scheme in Matlab and compare it with the explicit Euler scheme on the numerical simulation described in exercise 2. This implies that z n = 0. Therefore it does not converge very fast.8.52. Exercise 2. The explicit Euler scheme is of order 2 in space. −10) = u(t. this implies that the scheme is of order m = 1 in time since ∆t2 = ∆tm+1 for m = 1. and λ = . Then we have that sup |u(n∆t.8 Implement the explicit Euler scheme in Matlab by discretizing the parabolic equation on x ∈ (−10. let us assume that the exact solution is known on the spatial grid at time step n. (2.

δ(x − y)f (y)dy by definition. Notice this important fact: a very narrow Gaussian (α very large) is transformed into a very wide one (α−1 is very small) are vice-versa. is that −1 Fξ→x Fx→ξ f = f. (2. −∞ (2. you may want to think of “frequencies” each time you see “wavenumbers”. which it ˆ maps to the function f (ξ). 2π ∞ −∞ the Dirac delta function since f (x) = understanding this is to realize that [Fδ](ξ) = 1. We shall use the following definition ∞ ˆ e−ixξ f (x)dx. There are many ways of defining the Fourier transform. Another interesting calculation for us is that α [F exp(− x2 )](ξ) = 2 2π 1 exp(− ξ 2 ). Another way of [F −1 1](x) = δ(x). We call the spaces of x’s the physical domain and the space of ξ’s the wavenumber domain. 2π This implies that ∞ −∞ eizξ dξ = δ(z). or Fourier domain. α 2α (2. and it says that you cannot be localized in the Fourier domain when you are localized in the spatial domain and vice-versa.24) In other words. the operator F is applied to the function f (x).23) The first equality can also trivially been deduced from (2. A crucial property. (2.22) which can be recast as f (x) = 1 2π ∞ ∞ ∞ ∞ −∞ eixξ −∞ −∞ e−iyξ f (y)dydξ = −∞ ei(x−y)ξ dξ f (y)dy. The notation can become quite useful when there are several variables and one wants to emphasize with respect to with variable Fourier transform is taken. the Fourier transform of a Gaussian is also a Gaussian.21) In the above definition. 14 .2.20) [Fx→ξ f ](ξ) ≡ f (ξ) = −∞ The inverse Fourier transform is defined by −1 ˆ [Fξ→x f ](x) ≡ f (x) = 1 2π ∞ ˆ eixξ f (ξ)dξ. which explains the terminology of “forward” and “inverse” Fourier transforms. (2.20). This is related to Heisenberg’s uncertainty principle in quantum mechanics.2 Fourier transforms We now consider a fundamental tool in the analysis of partial differential equations with homogeneous coefficients and their discretization by finite differences: Fourier transforms. Wavenumbers are to positions what frequencies are to times. If you are more familiar with Fourier transforms of signals in time.

the second how it acts on dilation. 15 (2. Other very important properties of the Fourier transform are as follows: ˆ [Ff (x + y)](ξ) = eiyξ f (ξ). finite differences are based on translations so we see how Fourier transforms will be useful in their analysis. Fourier transforms replace differentiation in the physical domain by multiplication in the Fourier domain.20). we get that [Ff (x + h)](ξ) − [Ff (x)](ξ) eihξ − 1 ˆ f (x + h) − f (x) (ξ) = = f (ξ) h h h Now let us just pass to the limit h → 0 in both sides. The simplest property of the Fourier transform is its linearity: [F(αf + βg)](ξ) = α[Ff ](ξ) + β[Fg](ξ). consider derivatives of functions. To further convince you of the importance of the above assertion. We get F ˆ ˆ f (ξ) = [Ff ](ξ) = iξ f (ξ). The first equality shows how the Fourier transform acts on translations by a factor y.24) using integration in the complex plane (which essentially 2 2 2 ∞ ∞ says that −∞ e−β(x−iρ) dx = −∞ e−βx dx for ρ ∈ R by contour change since z → e−βz has no pole in C). for all y ∈ R and ν ∈ R\{0}.29) . it is useful to see differentiation as a limit of differences of translations. More precisely. d is replaced by multiplication by iξ. Moreover.27) (f ∗ g)(x) = −∞ f (x − y)g(y)dy = −∞ f (y)g(x − y)dy.28) The Fourier transform is then given by ˆ g [F(f ∗ g)](ξ) = f (ξ)ˆ(ξ). More precisely. (2. dx This assertion can easily been directly verified from the definition (2.Exercise 2. |ν| ν (2. One last crucial and quite related property of the Fourier transform is that it replaces convolutions by multiplications. The convolution of two functions is given by ∞ ∞ (2.11 Check these formulas. The reason this is important is that you can’t beat multiplications in simplicity. The former is extremely important in the analysis of PDEs and their finite difference discretization: Fourier transforms replace translations in the physical domain by multiplications in the Fourier domain. translation by y is replaced by multiplication by eiyξ . By linearity of the Fourier transform.10 Check (2.26) (2. Exercise 2.25) [Ff (νx)](ξ) = 1 ˆ ξ f . Again. However.

24) and ˆ (2. ξ) ˆ = −ξ 2 u(t. Then using (2. x)](t.26) with ˆ ν = −1. Solve this ODE. where f denotes the complex conjugate to f . here is an important exercise that shows how powerful Fourier transforms are to solve PDEs with constant coefficients. ˆ ∂t Here we mean u(t.31) .2). We first recall that f (ξ) = f (−ξ). Exercise 2. the explicit Euler scheme (2. Application to the heat equation. Application to Finite Differences.7) can be recast as U n+1 (x) − U n (x) U n (x + ∆x) + U n (x − ∆x) − 2U n (x) = .21) with x = 0 by definition of the convolution ∞ = 2π(f ∗ g)(0) = 2π −∞ ∞ f (y)g(−y)dy f (y)f (y)dy = 2π −∞ −∞ = 2π |f (y)|2 dy.30)) as in the Fourier domain (right-hand side).13 Consider the heat equation (2.30) for all complex-valued function f (x) such that the above integrals make sense. The Parseval equality (2. We ˆ show this as follows. If we use the notation U n (j∆x) = Ujn ≈ u(n∆t.1). (2. recover (2. We also define g(x) = f (−x). This important relation can be used to prove the equally important (everything is important in this section!) Parseval equality ∞ ∞ ˆ |f (ξ)|2 dξ = 2π −∞ −∞ |f (x)|2 dx. Before considering finite differences. To simplify we still assume that we want to solve the heat equation on the whole line x ∈ R. ξ). so that g (ξ) = f (−ξ) using (2.28). j∆x). We now use the Fourier transform to analyze finite difference discretizations.30) essentially says that there is as much energy (up to a factor 2π) in the physical domain (left-hand side in (2.28) using the definition (2. Using the Fourier transform. ξ) = [Fx→ξ u(t.12 Check this. Let u0 (x) be the initial condition for the heat equation.Exercise 2. ξ). show that ∂ u(t. We then have ∞ ∞ ∞ ∞ ˆ |f (ξ)|2 dξ = −∞ ∞ −∞ ˆ ˆ f (ξ)f (ξ)dξ = −∞ ˆ f (ξ)f (−ξ)dξ = −∞ ˆ f (ξ)g(ξ)dξ = −∞ (f ∗ g)(ξ)dξ ∞ thanks to (2. ∆t ∆x2 16 (2.

marching in time (going from step n to step n + 1) is much simpler than in the physical domain: each wavenumber ξ is multiplied by a coefficient Rλ (ξ) = 1 + λ(eiξ∆x + e−iξ∆x − 2) = 1 − 2λ(1 − cos(ξ∆x)). Here is how we do it. (j + )∆x . The key to analyzing this dependence is to go to the Fourier domain. We first define U 0 (x) for all points x ∈ R. So analyzing U n (x) is in some sense sufficient to get information on Ujn .where x = j∆x. we then easily observe that U n (x) = Ujn 1 1 on (j − )∆x.32) We then define U n (x) using (2. which is much simpler to deal with. 2 2 (2. (j + )∆x . It is however not easy at all to get this dependence explicitly from (2.7). As before. We then clearly have ˆ that all wavenumbers are stable since |U n (ξ)| ≤ |ˆ0 (ξ)|. we u can choose values of ξ such that cos(ξ∆x) is close to −1. we obtain that ˆ ˆ U n+1 (ξ) = 1 + λ(eiξ∆x + e−iξ∆x − 2) U n (ξ) (2. There. If we choose as initial condition (2. or equivalently is (Rλ (ξ))n bounded for all 1 ≤ n ≤ N = T /∆t? It is easy to observe that when λ ≤ 1/2. We then have that ˆ U n (ξ) = (Rλ (ξ))n u0 (ξ).32). using the translation property of the Fourier transform (2. It is therefore straightforward to realize that U n (Xj ) depends on u0 at all the points Xk for j − n ≤ k ≤ j + n. Such wavenumbers are unstable. For such wavenumbers. this scheme says that the approximation U at time Tn+1 = (n + 1)∆t and position Xj = j∆x depends on the values of U at time Tn and positions Xj−1 . we can certainly introduce the Fourier transform n ˆ U (ξ) defined according to (2. the ˆ value of Rλ (ξ) is less than −1 so that U n changes sign at every new time step n with growing amplitude (see numerical simulations). |Rλ (ξ)| ≤ 1 for all ξ.33) We thus define U n (x) for all n ≥ 0 and all x ∈ R by induction. ˆ (2. we have U n+1 (x) = (1 − 2λ)U n (x) + λ(U n (x + ∆x) + U n (x − ∆x)). (2. Once U n (x) is defined for all n. Xj .37) The question of the stability of a discretization now becomes the following: are all wavenumbers ξ stable.20). 2 2 (2. However. and Xj+1 .31). not only those points x = Xj = j∆x.26).34) where Ujn is given by (2.35) In the Fourier domain.31) initialized with U 0 (x) = u0 (x). One way to do so is to assume that u0 (x) = Uj0 1 1 on (j − )∆x.36) 17 . (2. However. translations are replaced by multiplications. Clearly. when λ > 1/2.

(2. ξ ∈ R. we have replaced the differentiation ∂ operator D = ∂x by iξ. u ∂t u(0. t > 0. The above definition gives us a mathematical framework to build on this idea.38) with symbol P (iξ) = ξ 4 . ˆ ˆ The solution of (2. is that the analysis of differential operators simplifies in the Fourier domain. we apply the theory of Fourier transform to the analysis of the finite difference of parabolic equations with constant coefficients. (2. (iii) take the inverse Fourier transform of the product. x ∈ R. x) = 0. Here D stands for ∂x . ξ) = 0. x) = u0 (x).3 Stability and convergence using Fourier transforms In this section. ∂x that is. Notice the parallel between (2. an important property of Fourier transforms is that ∂ → iξ. The solution of (2. (ii) Find the equation of the form (2. We use this to define equations in the physical domain as follows ∂u (t. The operator P (D) is a linear operator defined as −1 P (D)f (x) = Fξ→x P (iξ)Fx→ξ f (x) (ξ). Upon taking the Fourier transform Fx→ξ term by term in (2. (ii) multiply f (ξ) by the function P (iξ). we get that ∂u ˆ (t.39) ∂ where P (iξ) is a function of iξ. ξ) + P (iξ)ˆ(t.40): by going to the Fourier domain.1) takes the form (2.2. let us consider (2. Indeed.40) consists of a simple linear first-order ordinary differential equation. What the operator P (D) does ˆ is: (i) take the Fourier transform of f (x). x ∈ R (2. ξ) = u0 (ξ).14 (i) Show that the parabolic equation (2.41) (2. ∂t u(0. (2. as it was mentioned in the previous section. (iii) Show that any arbitrary partial differential equation of order m ≥ 1 in x with constant coefficients can be written in the form (2.38).38) with the symbol P (iξ) a polynomial of order m. t > 0. ξ) = e−tP (iξ) u0 (ξ).38) is then given by −1 u(t. differentiation is replaced by multiplication by iξ in the Fourier domain.40) is then obviously given by u(t.40) As promised. x) + P (D)u(t. x) = Fξ→x (e−tP (iξ) ) ∗ u0 (x).38) and (2.38) (2. 18 .38) with symbol P (iξ) = ξ 2 . As we already mentioned. The function P (iξ) is called the symbol of the operator P (D).38) in the Fourier domain. Such operators are called pseudo-differential operators. Why do we introduce all this? The reason. For every ξ ∈ R.42) Exercise 2. we have replaced the “complicated” pseudo-differential operator P (D) by a mere multiplication by P (iξ) in the Fourier domain. ˆ ˆ ξ ∈ R.

Let us notice however that from the definition (2. What about numerical schemes? Our definition so far provides a great tool to analyze partial differential equations.e.43) With a somewhat loose terminology. Explicit schemes are characterized by the fact that only b0 does not vanish. We shall see that implicit schemes. are important in practice. In the end. (2. The heat equation clearly satisfies the above requirement with M = 2. a1 = λ. Finite difference schemes are supposed to be defined on the grid ∆xZ (i. a0 = 1 − 2λ. Notice that we are slightly changing the problem of interest. It can be done but requires careful analysis of the error made by interpolating smooth functions by polynomials. not on the whole axis R. 19 . the explicit Euler scheme is given by U n+1 (x) = (1 − 2λ)U n (x) + λ(U n (x − ∆x) + U n (x + ∆x)). This is why we replace Ujn+1 by U n+1 (x) in the sequel and only consider properties of U n+1 (x). where other coefficients bβ = 0. we should see how the properties on U n+1 (x) translate into properties on Ujn+1 . (2. It turns out it is also very well adapted to the analysis of finite difference approximations to these equations. This corresponds to the coefficients b0 = 1.44). A general single-step finite difference scheme is given by B∆ U n+1 (x) = A∆ U n (x). Although this definition may look a little complicated at first.44) with U 0 (x) = u0 (x) a given initial condition.38) with real-valued symbol P (iξ) satisfying the above constraint as parabolic equations. we need such a complexity in practice. The reason is again that finite difference operators are transformed into multiplications in the Fourier domain. Here. the points xm = m∆x for m ∈ Z).Parabolic Symbol. B∆ U (x) = β∈Iβ bβ U (x − β∆x). (2. a−1 = λ. the values of U n+1 on the grid ∆xZ only depend on the values of u0 on the same grid. We do not consider this problem in these notes. In other words U n+1 (j∆x) does not depend on how we extend the initial condition u0 originally defined on the grid ∆xZ to the whole line R. and that U n+1 (j∆x) = Ujn+1 . P (iξ) ≥ C|ξ|M . we refer to equations (2. and all the other coefficients aα and bβ vanish. we only consider real-valued symbols such that for some M > 0. using the notation of the preceding section. This is easy to do when U n+1 (x) is chosen as a piecewise constant function and more difficult when higher order polynomials are used to construct it.45) where Iα ⊂ Z and Iβ ⊂ Z are finite sets of integers. In this section. The real-valuedness of the symbol is not necessary but we shall impose it to simplify. the finite difference operators are defined by A∆ U (x) = α∈Iα aα U (x − α∆x). For instance. for all ξ ∈ R. n ≥ 0.

ˆ ˆ (2. (2.. 20 (2.47) In the implicit case. To measure convergence. From now on. all norms are equivalent (this is a theorem). this equation is simpler in the Fourier domain.. As we already mentioned.We now come back to (2. (∆x)2 ˆn Notice that the difference between the exact solution u(n∆t. we assume that B∆ (ξ) = 0. In infinite dimensional spaces. B∆ (ξ) = β∈Iβ bβ e−iβ∆xξ . all norms are not equivalent (this is another theorem. What we get is ˆ bβ e−iβ∆xξ U n+1 (ξ) = β∈Iβ α∈Iα ˆ aα e−iα∆xξ U n (ξ).51) This is what we want to analyze. This implies n ˆ U n (ξ) = R∆ (ξ)ˆ0 (ξ). the equivalence of all norms on a space implies that it is finite dimensional. (2. Let us thus take the Fourier transform of each term in (2.52) The reason why this norm is nice is that we have the Parseval equality √ ˆ f (ξ) = 2π f (x) . To obtain the solution ˆ U n (x).) So the choice of the norm matters! Here we only consider the L2 norm. which will tell us how far or how close two functions are from one another. B∆ (ξ) ≡ 1 and the above relation is obvious. In finite dimensional spaces (such as Rn that we used for ordinary differential equations). (2. ξ) − U n (ξ) = e−n∆tP (iξ) − R∆ (ξ) u0 (ξ).44). (2. ξ) at time n∆t and its ˆ ˆ n (ξ) is given by approximation U n ˆ u(n∆t. we simply have to take the inverse Fourier transform of U n (ξ).48) In the explicit case. Exercise 2.15 Verify that for the explicit Euler scheme.50) The above relation completely characterizes the solution U (ξ). we first need to define a norm.46) Let us introduce the symbols A∆ (ξ) = α∈Iα aα e−iα∆xξ .44). this is our definition of a finite difference scheme and we want to analyze its properties. u (2. defined on R for sufficiently regular complexvalued functions f by ∞ f = −∞ |f (x)|2 dx 1/2 . we have R∆ (ξ) = 1 − 2∆t (1 − cos(ξ∆x)).53) . We then obtain that ˆ ˆ ˆ U n+1 (ξ) = R∆ (ξ)U n (ξ) = B∆ (ξ)−1 A∆ (ξ)U n (ξ).49) (2.

Our analysis of the finite difference scheme is therefore concerned with characterizing u(n∆t. stability of the scheme. To simplify the analysis. Most norms do not have such a nice interpretation in the Fourier domain. we assume that the two parameters are related by λ= ∆t . ∆x.54) where M is the order of the parabolic equation (M = 2 for the heat equation).16 Check again that the above constraint is satisfied for the explicit Euler scheme if and only if λ ≤ 1/2. We have however to be careful. we have |e−∆tP (iξ) | ≤ e−C∆t|ξ| . x) − U n (x) . Parabolic equations have a very efficient smoothing effect. at which it approximates the exact operator. converge to 0. ξ ∈ R. ∆t and ∆x. Exercise 2.55) Stability. We just saw that this was equivalent to characterizing ˆ εn (ξ) = u(n∆t.This means that the L2 norms in the physical and the Fourier domains are equal (up √ to the factor 2π). 1 ≤ n ≤ N. ξ) − U n (ξ) = ˆ n e−n∆tP (iξ) − R∆ (ξ) u0 (ξ) . Not surprisingly. we 21 . For length scales much larger than ∆x. Even if high wavenumbers are present in the initial solution. This is one of the reasons why the L2 norm is so popular. The regularity of the exact equation is first seen in the growth of |e−∆tP (iξ) |. (2. ˆ The control that we will obtain on εn (ξ) depends on the three classical ingredients: regularity of the solution. The convergence arises as two parameters. and consistency of the approximation.56) The stability constraint is clear: no wavenumber ξ can grow out of control. the finite difference approximation partially retains this effect. M (2. The exact symbol satisfies that |e−n∆tP (iξ) | ≤ e−Cn∆t|ξ| . We impose that there exists a constant C independent of ξ and 1 ≤ n ≤ N = T /∆t such that n |R∆ (ξ)| ≤ C. M (2. This relation is important because we deduce that high wavenumbers (large ξ’s) are heavily damped by the operator. The reason is that the scheme has a given scale. and where λ > 0 is fixed. Since our operator as assumed to be parabolic. ∆xM (2. they quickly disappear as time increases. Regularity.57) for some constant C. What makes parabolic equations a little special is that they have better stability properties than just having bounded symbols.

(2. For length of order (or smaller than) ∆x however. 2. In the Fourier world. ξ) at time n∆t. . for wavenumbers that are smaller than ∆x−1 . this means that e−∆tP (iξ) − R∆ (ξ) (2. .58) that n |R∆ (ξ)| ≤ en ln(1−δ|∆xξ| M) ≤ e−n(δ/2)|∆xξ| . ξ).61) .expect the scheme to be a good approximation of the exact operator. This is again related to the fact that wavenumbers of order of or greater than ∆x−1 are not captured by the finite difference scheme. (2. ˆ Again. the error e−∆tP (iξ) − R∆ (ξ) is not small for all wavenumbers of the form ξ = kπ∆x−1 for k = 1.60) has to converge to 0 as ∆x and ∆t converge to 0. .2 We have that n |R∆ (ξ)| ≤ e−Cn|∆xξ| M for |∆xξ| ≤ γ. and ∆t.17 Show that in the case of the heat equation and the explicit Euler scheme.59) Proof. Let us now turn our attention to the consistency and accuracy of the finite difference scheme. we cannot expect the scheme to be accurate: we do not have a sufficient resolution. where C is a constant independent of ξ. we expect the scheme to retain some properties of the exact equation. all we can expect is that the error is small when ξ∆x → 0 and ∆t → 0.. We deduce from (2. Indeed. For this reason. Remember that small scales in the spatial worlds mean high wavenumbers in the Fourier world (functions that vary on small scales oscillate fast. Here again. we have to be a bit careful. 22 ∆x|ξ| < γ. hence are composed of large wavenumbers). we impose that the operator R∆ satisfy |R∆ (ξ)| ≤ 1 − δ|∆xξ|M |R∆ (ξ)| ≤ 1 for for |∆xξ| ≤ γ |∆xξ| > γ. (2. We say that the scheme is consistent and accurate of order m if |e−∆tP (iξ) − R∆ (ξ)| ≤ C∆t∆xm (1 + |ξ|M +m ).58) where γ and δ are positive constants independent of ξ and ∆x. Nevertheless. we want to make sure that the finite difference scheme is locally a good approximation of the exact equation. assuming that we are given the exact solution u(n∆t. We then have Lemma 2. ∆x. Quantitatively. We cannot expect the above quantity to be small for all values of ξ. M Consistency. Exercise 2. the error made by applying the approximate ˆ equation instead of the exact equation between n∆t and (n + 1)∆t is precisely given by e−∆tP (iξ) − R∆ (ξ) u(n∆t.

the space of square-integrable functions. that |εn (ξ)| ≤ C∆t∆xm (1 + |ξ|M +m )ne−Cn∆t|ξ| |ˆ0 (ξ)|.65) This shows that εn is of order ∆xm provided that (1 + |ξ|2m )|ˆ0 (ξ)|2 dξ u R 1/2 < ∞. This implies that εn = R (2.64) |εn (ξ)|2 dξ 1/2 ≤ C∆xm R (1 + |ξ|2m )|ˆ0 (ξ)|2 dξ u 1/2 . For a function v(x).62) So far. It turns out that a function v(x) belongs to H m if and only if its Fourier transform v (ξ) satisfies that ˆ (1 + |ξ|2m )|ˆ(ξ)|2 dξ v R 1/2 < ∞. Let us come back to the analysis of εn (ξ). The Hilbert spaces H m (R).66) The above inequality implies that u0 (ξ) decays sufficiently fast as ξ → ∞. The proof is based on the fact that differentiation in the physical domain is replaced by multiplication by iξ in the Fourier domain. this implies that |εn (ξ)| ≤ C∆xm (1 + |ξ|m )|ˆ0 (ξ)|. This is ˆ actually equivalent to imposing that u0 (x) is sufficiently regular.Proof of convergence. (2. The norm of H m is defined by m v Hm = k=0 R |v (n) (x)|2 dx 1/2 .18 (difficult) Prove it. and (2. we have thanks to the stability property (2. (2. (2. We k=0 deduce from (2.56) that |εn (ξ)| ≤ 2|ˆ0 (ξ)| ≤ C∆xm (1 + |ξ|m )|ˆ0 (ξ)|.61). u (2. u u So (2. We have that n−1 ε (ξ) = (e n −∆tP (iξ) − R∆ (ξ)) k=0 k u e−(n−1−k)∆tP (iξ) R∆ (ξ)ˆ0 (ξ). 23 . Since n∆t ≤ T . we denote by v (n) (x) its nth derivative. (2. However.67) Notice that H 0 = L2 .57).59).68) Exercise 2. (2. We introduce the sequence of functional spaces H m = H m (R) of functions whose derivatives up to order m are square-integrable. u Now remark that n∆t|ξ|M e−Cn∆t|ξ| ≤ C .63) actually holds for every ξ. the above inequality holds for |ξ| ≤ γ∆x−1 . using an − bn = (a − b) n−1 an−1−k bk . Let us first consider the case |ξ| ≤ γ∆x−1 . for |ξ| ≥ γ∆x−1 .63) M M (2.

Show that (2.58) are satisfied and that R∆ (ξ) is of order m.58) is replaced by R∆ (ξ) ≤ 1. 1) and v(x) = 0 elsewhere. Summarizing all of the above. for all 0 ≤ n ≤ N = T /∆t. 24 .69) We recall that ∆t = λ∆xM so that in the above theorem. Assuming that the initial condition u0 ∈ H m . (2. You can compare this with Theorem 2. we have to assume that the solution u(t.Notice that in the above inequality. (2.67) and (2.43) and (2. show that (2.19 Calculate the Fourier transform of the function v defined by v(x) = 1 on (−1. show that the function v belongs to every space H m with m < 1/2. x) − U n (x) ≤ C∆xm u0 H m+M . x) − U n (x) ≤ C∆xm u0 H m+ε . we have obtained the following result.67) and (2. x) is of order C 4 is space. The equivalence between (2.58) are replaced by P (iξ) ≥ 0 and R∆ (ξ) ≤ 1. Only the number of derivatives can be “fractional”.e. what this means is that we need the initial condition to be more regular if the equation is not regularizing. Theorem 2. for all 0 ≤ n ≤ N = T /∆t.3 assuming that (2. Exercise 2.70) Loosely speaking. (2. we obtain that u(n∆t.68). Main result of the section. i.69) should be replaced by u(n∆t. This means that functions that are piecewise smooth (with discontinuities) have a little less than half of a derivative in L2 (you can differentiate such functions almost 1/2 times)! The relationship between (2.3 Let us assume that P (iξ) and R∆ (ξ) are parabolic. Exercise 2.21 (very difficult). Of course.68) can be used as a definition for the space H m with m ∈ R. What this result means is that if we loose the parabolic property of the discrete scheme. the scheme is of order m in space and of order m/M in time.68) can be explained as follows: v(x) is smooth in the physical domain ⇐⇒ v (ξ) decays fast ˆ in the Fourier domain. we still have the same order of convergence provided that the initial condition is slightly more regular than being in H m . The space H m is still the space of functions with m square-integrable derivatives.61) holds.68) is one manifestation of this “rule”. This is another very general and very important property of Fourier transforms.71) Here ε is an arbitrary positive constant. Thus (2.43) and (2. the constant C depends on ε. Assuming that (2. in the sense that (2. Using (2.20 (difficult) Follow the proof of Theorem 2. This is consistent with the above result with M = 2 and m = 2. x) − U n (x) ≤ C∆xm u0 Hm .69) should be replaced by u(n∆t. for all 0 ≤ n ≤ N = T /∆t. or even negative! Exercise 2. (2.1: to obtain a convergence of order 2 in space. we can choose m ∈ R and not only m ∈ N∗ .

2.23 Show that everything we have said in this section holds if we replace (2. we still have convergence of the finite difference scheme as ∆x → 0. that verify |R∆ (ξ)| < 1. 1 + 2θλ(1 − cos(∆xξ)) (2.69) in Theorem 2.4) and (2. respectively.24 Show that schemes accurate of order m > 0 so that (2. are themselves smoothing. The θ scheme is defined by ∂t U n (x) = θ∂x ∂ x U n+1 (x) + (1 − θ)∂x ∂ x U n (x).25 Show that the symbol of the θ scheme is given by R∆ (ξ) = 1 − 2(1 − θ)λ(1 − cos(∆xξ)) .22 Show that any stable scheme that is convergent of order m is also convergent of order αm for all 0 ≤ α ≤ 1.Exercise 2. We observe that R∆ (ξ) ≤ 1.61) holds and verifying |R∆ (ξ)| ≤ 1 are stable in the sense that |R∆ (ξ)| ≤ (1 + C∆xm ∆t) − δ|∆xξ|M .43) and (2.e. When θ = 0. |∆xξ| ≤ γ. however the rate of convergence is slower. (2. The stability requirement is therefore that min R∆ (ξ) ≥ −1. Deduce that (2. For θ = 1/2. Exercise 2. For θ = 1. (2. Exercise 2. Re stands for “real part”. We recall that ∆t = λ∆x2 . |∆xξ| ≤ γ. Let us assume that 0 ≤ θ ≤ 1. for all 0 ≤ n ≤ N = T /∆t. The above hypotheses are sufficiently general to cover most practical examples of parabolic equations.73) Recall that the finite difference operators are defined by (2. The last exercises show that schemes that are stable in the “classical” sense. i. the scheme is called the fully implicit Euler scheme. Here.5).58) by Re P (iξ) ≥ C(|ξ|M − 1) and |R∆ (ξ)| ≤ 1 − δ|∆xξ|M + C∆t. the scheme is called the Crank-Nicolson scheme. Exercise 2. P (iξ) = ξ 2 ) called the θ schemes..3 can then be replaced by u(n∆t. we recover the explicit Euler scheme. ξ 25 .74) Let us first consider the stability of the θ method.4 Application to the θ schemes We are now ready to analyze the family of schemes for the heat equation (i.72) The interest of this result is the following: when the initial data u0 is not sufficiently regular to be in H m but sufficiently regular to be in H αm with 0 < α < 1. and are consistent with the smoothing parabolic symbol. x) − U n (x) ≤ C∆xαm u0 H αm .e.

i. 2 6 and that R∆ (ξ) = 1−∆tξ 2 +∆t∆x2 ( as ∆t → 0 and ∆xξ → 0.45) are now infinite matrices such that B∆ U n+1 = A∆ U n 26 . stability only holds if and only if 1 λ≤ .29 Show that the θ scheme can be put in the form (2.28 Show that the θ scheme is always at least second-order in space (firstorder in time). b0 = 1 + 2λθ.e.Exercise 2. Obviously we need to show that (2. Exercise 2. the scheme is no longer explicit.75) Show that the scheme is of order 6 in space if in addition. a0 = 1 − 2(1 − θ)λ Since the coefficient b1 and b−1 do not vanish. (2. 10 Implementation of the θ scheme. When θ < 1/2. 2(1 − 2θ) Let us now consider the accuracy of the θ method. that the method is stable independently of the choice of λ.. 2 12λ 1 ∆t2 ∆x4 +λθ)ξ 4 − (1+60λθ+360λ2 θ2 )+O(∆t4 ξ 8 ).61) holds for some m > 0.44) with b−1 = b1 = −θλ. 1 + 60λθ + 360λ2 θ2 = 60λ2 . Exercise 2.26 Show that the above inequality holds if and only if 1 (1 − 2θ)λ ≤ . λ = 1√ 5.27 Show that 1 1 2 e−∆tξ = 1 − ∆tξ 2 + ∆t2 ξ 4 − ∆t3 ξ 6 + O(∆t4 ξ 8 ).76) 12 360 (2. Exercise 2. The operators A∆ and B∆ in (2. The above analysis shows that the θ scheme is unconditionally stable when θ ≥ 1/2. 2 This shows that the method is unconditionally L2 -stable when θ ≥ 1/2. Show that the scheme is of order 4 in space if and only if θ= 1 1 − . Let us come back to the solution Ujn defined on the grid ∆xZ.e. a−1 = a1 = (1 − θ)λ. i. This very nice stability property comes however at a price: the calculation of U n+1 from U n is no longer explicit.

78) The vector U n+1 is thus given by U n+1 = (B∆ )−1 A∆ U n .30 Implement the θ scheme in Matlab for all possible values of 0 ≤ θ ≤ 1 and λ > 0. make sure that the initial condition is sufficiently smooth. For non-smooth initial conditions.    0 0 (1 − θ)λ 1 − 2(1 − θ)λ (1 − θ)λ  0 0 0 (1 − θ)λ 1 − 2(1 − θ)λ (2. The above matrices correspond to imposing that the solution vanishes at the boundary of the interval (Dirichlet boundary conditions).3. we obtain that the operator B∆ for the θ scheme is given by the infinite matrix   1 + 2θλ −θλ 0 0 0   −θλ 1 + 2θλ −θλ 0 0     (2. the matrix A∆ is given by   1 − 2(1 − θ)λ (1 − θ)λ 0 0 0   (1 − θ)λ 1 − 2(1 − θ)λ (1 − θ)λ 0 0     A∆ =  0 (1 − θ)λ 1 − 2(1 − θ)λ (1 − θ)λ 0 . Verify numerically (by dividing an initial spatial mesh by a factor 2. However we can use the N × N matrices above on an interval of size N ∆x given.79) This implies that in any numerical implementation of the θ scheme where θ > 0. 4. In practice. Exercise 2. 27 . then 4) that the θ scheme is of order 2. show that the order of convergence decreases and compare with the theoretical predictions of Theorem 2.where the infinite vector U n = (Ujn )j∈Z . To obtain these orders of convergence. Verify on a few examples with different values of θ < 1/2 that the scheme is unstable when λ is too large.77) B∆ =  0 −θλ 1 + 2θλ −θλ 0 . and 6 in space for well chosen values of θ and λ.    0 0 −θλ 1 + 2θλ −θλ  0 0 0 −θλ 1 + 2θλ Similarly. we cannot use infinite matrices obviously. (2. we need to invert the system B∆ U n+1 = A∆ U n (which is usually much faster than constructing the matrix (B∆ )−1 directly). In matrix form.

and ∂2 ∂2 ∆= 2= + 2 ∂x2 ∂y is the Laplace operator. R2 (3.4) Notice here that both and ξ are 2-vectors. y) ∈ R2 .5) The solution to the diffusion equation is then given in the Fourier domain by u(ξ) = ˆ ˆ f (ξ) . (3.3 Finite Differences for Elliptic Equations The prototype of elliptic equation we consider is the following two-dimensional diffusion equation with absorption coefficient −∆u(x) + σu(x) = f (x). We assume that our domain has no boundaries to simplify.1) Here. (3.6) It then remains to apply the inverse Fourier transform to the above equality to obtain u(x). (3. The above partial differential equation can be analyzed by Fourier transform. the Fourier transform is defined by ˆ [Fx→ξ f ](ξ) ≡ f (ξ) = R2 e−ix·ξ f (x)dx. u ξ ∈ R2 .3) We still verify that the Fourier transform replaces translations and differentiation by multiplications. More precisely. we obtain that ˆ (σ + |ξ|2 )ˆ(ξ) = f (ξ). Since the inverse Fourier transform of a product is the convolution of the inverse Fourier transforms (as in 1D). we deduce that u(x) = 1 (x) ∗ f (x) σ + |ξ|2 √ = CK0 ( σ|x|) ∗ f (x). F −1 where K0 is the modified Bessel function of order 0 of the second kind and C is a normalization constant. (3. x ∈ R2 . σ is a given positive absorption coefficient. x = (x. σ + |ξ|2 (3.2) The inverse Fourier transform is then defined by −1 ˆ [Fξ→x f ](x) ≡ f (x) = 1 (2π)2 ˆ eix·ξ f (ξ)dξ. ˆ [F f (x)](ξ) = iξ f (ξ). In two-space dimensions. 28 . This implies that the operator −∆ is replaced by a multiplication by |ξ|2 (Check it!). Upon taking the Fourier transform in both terms of the above equation. we have ˆ [Ff (x + y)](ξ) = eiy·ξ f (ξ). f (x) is a given source term.

1) by finite differences? We can certainly replace the differentials by approximated differentials as we did in 1D. where the operator ∂x is defined by ∂x Uij = Ui+1.10) where the operator B is a sum of translations defined by BU (x) = β∈Iβ bβ U (x − ∆xβ).14) .j − Uij .11). (3. we have more generally that BU (x) = f (x). This is the finite difference approach. we would have that P (iξ) = σ + |ξ|2 .e. j ∈ Z and ∆x > 0 given.e. P (iξ) is a parabolic symbol of order 2 using the terminology of the preceding section. fij = f (xij ). where β1 and β2 are integer. β2 ). x ∈ R2 . R∆ (ξ) 29 (3. Iβ is a finite subset of Z2 . (3. All we will see in this section applies to quite general symbols P (iξ). i. Let us consider the grid of points xij = (i∆x. (3. Assuming as in the preceding section that the finite difference operator is defined in the whole space R2 and not only on the grid {xij }. j) ∈ Z2 .9) (i.13) we then obtain that the finite difference solution is given in the Fourier domain by ˆ U (ξ) = 1 ˆ f (ξ).12) Defining R∆ (ξ) = β∈Iβ bβ e−i∆xβ·ξ . ∆x (3. β∈Iβ (3. although we shall consider only P (iξ) = σ + |ξ|2 to simplify. We then define Uij ≈ u(xij ). and ∂ y are defined similarly. i. β = (β1 . How do we solve (3. We could define mode general equations of the form P (D)u(x) = f (x) x ∈ R2 . Discretization.11) Here. Upon taking Fourier transform of (3. In the case of the diffusion equation. (3. (3. ∂y . we obtain ˆ ˆ bβ e−i∆xβ·ξ U (ξ) = f (ξ). we can then replace (3.This shows again how powerful Fourier analysis is to solve partial differential equations with constant coefficients.8) and ∂ x .7) Using a second-order scheme. j∆x) for i.1) by − ∂x ∂ x + ∂y ∂ y Uij + σUij = fij . where P (D) is an operator with symbol P (iξ).

The scheme is indeed second-order. This implies that |R∆ (ξ) − P (iξ)| ≤ C∆x2 |ξ|4 for all ξ ∈ R2 . P (iξ) R∆ (ξ) (3. we deduce from Taylor expansions that |R∆ (ξ) − P (iξ)| ≤ C∆x2 |ξ|4 .e. we then easily deduce that R∆ (ξ) − P (iξ) ≤ C∆x2 (1 + |ξ|2 ). we deduce that u(x) − U (x) ≤ C∆x2 f (x) H 2 (R2 ) .17) ˆ (1 + |ξ|2 )2 |f (ξ)|2 dξ 1/2 . For ∆x|ξ| ≥ π. the error made by the finite difference approximation is of order ∆x2 . using the definitions (2.18) The above integral is bounded provided that f (x) ∈ H 2 (R2 ).17) holds is because the exact symbol P (iξ) damps high frequencies (i. The reason why (3. we obtain that ˆ u(ξ) − U (ξ) ≤ C∆x2 ˆ R2 C ≤ C∆x2 |ξ|4 . Upon taking the square root of the integrals. we obtain R∆ (ξ) = σ + 2 2 − cos(∆xξx ) − cos(∆xξy ) . we also obtain that |R∆ (ξ)| ≤ and |P (iξ)| ≤ C∆x2 |ξ|4 . the final result (3. (3.and that the difference between the exact and the discrete solutions is given by ˆ u(ξ) − U (ξ) = ˆ 1 1 ˆ − f (ξ). (3. Since R∆ (ξ) ≥ σ and P (iξ) ≥ C(1 + |ξ|2 ). R∆ (ξ)P (iξ) From this.19) should be replaced by u(x) − U (x) ≤ C∆x2 f (x) H 4 (R2 ) . we deduce that ˆ ˆ |ˆ(ξ) − U (ξ)| ≤ C∆x2 (1 + |ξ|2 )|f (ξ)|. Exercise 3. From the Parseval equality (equation √ (2. ∆x2 (3.15) For the second-order scheme introduced above.67)(2.1 Notice that the bound in (3.53) with 2π replaced by 2π in two space dimensions).16) For ∆x|ξ| ≤ π. 30 .68). which also hold in two space dimensions. u Let us now square the above inequality and integrate. 2 ∆x (3. Show that if this damping is not used in the proof.19) This is the final result of this section: provided that the source term is sufficiently regular (its second-order derivatives are square-integrable).17) would be replaced by C∆x2 (1 + |ξ|4 ) if only the Taylor expansion were used. is bounded from below by C(1 + |ξ|2 ).

This implementation is not trivial. We deduce from this result that the error u(x) − U (x) still converges to 0 when f is less regular than H 2 (R2 ). Deduce that u(x) − U (x) ≤ C∆xm f (x) H m (R2 ) . R∆ (ξ)P (iξ) for all 0 ≤ m ≤ 2. Exercise 3. Solve the diffusion problem with a smooth function f and show that the order of convergence is 2. It requires constructing and inverting a matrix as was done in (2.3 Implement the Finite Difference algorithm in Matlab on a square of finite dimension (impose that the solution vanishes at the boundary of the domain).79).2 (moderately difficult) Show that R∆ (ξ) − P (iξ) ≤ C∆xm (1 + |ξ|m ). [Hint: Show that (R∆ (ξ))−1 and (P (iξ))−1 are bounded and separate |ξ|∆x ≥ 1 and |ξ|∆x ≤ 1]. However. for 0 ≤ m ≤ 2. 31 . the convergence is slower and no longer of order ∆x2 .Exercise 3. Show that the order of convergence decreases when f is less regular.

ˆ ˆ Now since multiplication by a phase in the Fourier domain corresponds to a translation in the physical domain. Indeed we have ∂u ˆ (t. ξ) = u0 (ξ)e−iatξ . Let us now consider finite difference discretizations of the above transport equation. x ∈ R.2). The L2 norm of the difference between the exact and approximate solutions is then given by 1 1 εn (ξ) = √ u(n∆t. The mathematical analysis of finite difference schemes for hyperbolic equations is then not very different from for parabolic equations. By stability. ξ) = 0. x ∈ R (4. Here is an important remark. we mean here that n |R∆ (ξ)| = |R∆ (ξ)|n ≤ C. the solution to the above equation is easily found to be u(t. hyperbolic equations translate and possibly modify the shape of discontinuities.1) ∂t ∂x u(0. we obtain (4.49). Notice that (4. One of the main steps in controlling the error term εn is to show that the scheme is stable.44). but do not regularize them. 32 .e. ξ) + iaξ u(t.2) This means that the initial data is simply translated by a speed a. This corresponds to a completely different behavior of the PDE solutions: instead of regularizing initial discontinuities. n∆t ≤ T. In particular. Notice that this symbol does not have the properties imposed in section 2. The simplest hyperbolic equation is the one-dimensional transport equation ∂u ∂u +a =0 t > 0. it is defined by (2.4 Finite Differences for Hyperbolic Equations After parabolic and elliptic equations. This behavior will inevitably show up when we discretize hyperbolic equations. its real part vanishes). Stability. ˆ (4. Only the results are very different: schemes that may be natural for parabolic equations will no longer be so for hyperbolic schemes. as parabolic equations do. x) = u0 (x − at).40) with P (iξ) = iaξ. we recall that P (iξ) = iaξ. The framework introduced in section 2 still holds. After Fourier transforms. x) = u0 (x).3) Here.3) can be written as (2. (4. the symbol P (iξ) is purely imaginary (i. ˆ ∂t which gives u(t. The single-step finite difference scheme is given by (2. When a is constant. The solution can also be obtained by Fourier transforms. we turn to the third class of important linear equations: hyperbolic equations. x) − U n (x) = √ 2π 2π n e−n∆tP (iξ) − R∆ (ξ) u0 (ξ) . Here u0 (x) is the initial condition.

When a > 0. Again.1 Check that the upwind scheme is stable provided that |a|∆t ≤ ∆x. we deduce that the scheme is always unstable since |R∆ (ξ)|2 > 1 for cos(ξ∆x) ≤ 0 for instance. 2 Exercise 4. When a > 0. information propagates from the right to the left when a < 0.Let us consider the stability of classical schemes. This leads to the definition of the upwind scheme. the time step cannot be chosen too large. the scheme still “looks” for information coming from the right.2 Show that the above scheme is always unstable. Since the scheme approximates the spatial derivative by ∂x . So the scheme is stable if and only if −1 ≤ ν ≤ 0. 33 . Verify the instability in Matlab. When ν < −1. it is asymmetrical and uses some information to the right (at the point Xj+1 = Xj + ∆x) of the current point Xj . A tempting solution to this difficulty is to average over the operators ∂x and ∂ x : ν n n Ujn+1 = Ujn − (Uj+1 − Uj−1 ). (4. Notice however. This is referred to as a CFL condition.4) when when a>0 a < 0. whereas physically it comes from the left. It remains to consider −1 ≤ ν ≤ 0. ∆x (4. There. This can be understood as follows. Spelling out the details. Implement the scheme in Matlab. This implies that |a|∆t ≤ ∆x. we deduce that R∆ (ξ) = 1 − ν(eiξ∆x − 1) = [1 − ν(cos(ξ∆x) − 1)] − iν sin(ξ∆x). For a < 0.1) is to replace the time derivative by ∂t and the spatial derivative by ∂x . this is relatively easy since only the sign of a is involved. it gets the correct information where it comes from. we again deduce that |R∆ (ξ)|2 > 1 since ν(1 + ν) > 0. which implies that ν > 0.5) Exercise 4. The first idea that comes to mind to discretize (4. knowing where the information comes from is much more complicated. The above result shows that we have to “know” which direction the information comes from to construct our scheme. this gives Ujn+1 = Ujn − a∆t n (U − Ujn ). In higher dimensions. ν(1 + ν) ≤ 0 and we then deduce that |R∆ (ξ)|2 ≤ 1 for all wavenumbers ξ. ∆x j+1 After Fourier transform (and extension of the scheme to the whole line x ∈ R). This implies stability. In the transport equation. defined by Ujn+1 = n Ujn − ν(Ujn − Uj−1 ) n Ujn − ν(Uj+1 − Ujn ) a∆t . where we have defined ν= We deduce that |R∆ (ξ)|2 = 1 + 2ν(1 + ν)(1 − cos(ξ∆x)). that this also implies that a < 0. For one-dimensional problems.

A more satisfactory answer to the question has been answered by Lax and Wendroff. For the upwind scheme with a > 0 to simplify. This is similar to the parabolic case and justifies the following definition in the Fourier domain (remember that second-order derivative in the physical domain means multiplication by −ξ 2 in the Fourier domain).57) or (2. For hyperbolic schemes.3 Check that |R∆ (ξ)|2 = 1 − 4ν 2 (1 − ν 2 ) sin4 ξ∆x . We have introduced two stable schemes. The closer the above quantity to 0 (which it would be if the discrete scheme were replaced by the continuous equation). Xj ) = − (1 − ν)a∆x 2 (Tn . It remains to analyze their convergence properties. Xj ) + O(∆x2 ). Consistency. So assuming that u(n∆t.4). 2 (4. but from the stability conditions. 34 . This error is obtained assuming that the exact solution is sufficiently regular and by using Taylor expansions.61): we have simply replaced M by 1. Notice that first-order approximations involve second-order derivatives of the exact solution. Exercise 4. Xj ) is known at time n∆t. (4. we do not have (2. The Lax-Wendroff scheme is defined by 1 1 n n Ujn+1 = ν(1 + ν)Uj−1 + (1 − ν 2 )Ujn − ν(1 − ν)Uj+1 .6) Deduce that the Lax-Wendroff scheme is stable when |ν| ≤ 1.58). Notice the parallel with (2.7) for ∆t and ∆x|ξ| sufficiently small (say ∆x|ξ| ≤ π). we want to see what it becomes under the discrete scheme. Notice that ∆t and ∆x are again related through (4. 2 2 Exercise 4. the upwind scheme and the Lax-Wendroff scheme. Exercise 4. the higher the order of convergence of the scheme. The difference between parabolic and hyperbolic equations and scheme does not come from the consistency conditions.5 Show that the upwind scheme is first-order and the Lax-Wendroff scheme is second-order. Implement the Lax-Wendroff scheme in Matlab. Consistency consists of analyzing the local properties of the discrete scheme and making sure that the scheme captures the main trends of the continuous equation. We say that the scheme is convergent of order m if |e−i∆tP (iξ) − R∆ (ξ)| ≤ C∆t∆xm (1 + |ξ|m+1 ). Xj ). 2 ∂x This shows that the upwind scheme is first-order when ν < 1. we want to analyze ∂t + a∂ x u(Tn .4 Show that 1 ∂2u ∂t + a∂ x u(Tn .

e. the error of convergence of the scheme sm will be of order m+1 . we deduce that |εn (ξ)| ≤ 2|ˆ0 (ξ)| ≤ C∆xm (1 + |ξ|m+1 )|ˆ0 (ξ)|. Assuming that the initial condition u0 ∈ H m+1 . Show that (4. For low wavenumbers (∆x|ξ| ≤ π). stability and consistency of the scheme.8) is now replaced by u(n∆t.Proof of convergence . (4. which states that |R∆ (ξ)| ≤ 1. We have then proved the Theorem 4.6 Show that for any scheme of order m. x) − U n (x) ≤ C∆xαm u0 H α(m+1) . (4. Verify this numerically. What if u0 is not in H m ? Exercise 4.7) holds. we deduce that |εn (ξ)| ≤ n|e−∆tP (iξ) − R∆ (ξ)||ˆ0 (ξ)| ≤ C∆xm (1 + |ξ|m+1 )|ˆ0 (ξ)|. k So using the stability of the scheme. we obtain that u(n∆t. u u using the order of convergence of the scheme. for all 0 ≤ n ≤ N = T /∆t. As usual. we treat low and high wavenumbers separately. u k Here is the main difference with the parabolic case: e−(n−1−k)∆tP (iξ) and R∆ (ξ) are −n∆tP (iξ) bounded but not small for high frequencies (notice that |e | = 1 for P (iξ) = iaξ).1 Let us assume that P (iξ) = iaξ and that R∆ (ξ) is of order m. This implies that εn (ξ) ≤ C∆xm R (1 + |ξ|2m+2 )|ˆ0 (ξ)|2 dξ u 1/2 . We recall that piecewise constant functions are in the space H s for all s < 1/2 (take s = 1/2 in the sequel to simplify). x) − U n (x) ≤ C∆xm u0 H m+1 . (4. 35 . As in the parabolic case. u u So the above inequality holds for all frequencies ξ ∈ R. we have n−1 ε (ξ) = (e n −∆tP (iξ) − R∆ (ξ)) k=0 k e−(n−1−k)∆tP (iξ) R∆ (ξ)ˆ0 (ξ). i. with sufficient regularity of the exact solution. for all 0 ≤ n ≤ N = T /∆t. we have |e−i∆tP (iξ) − R∆ (ξ)| ≤ C∆t∆xαm (1 + |ξ|α(m+1) ) for all 0 ≤ α ≤ 1 and |ξ|∆x ≤ π. For high wavenumbers ∆x|ξ| ≥ π.9) Deduce that if u0 is in H s for 0 ≤ s ≤ m + 1. Deduce the order of the error of convergence for the upwind scheme and the Lax-Wendroff scheme.8) The main difference with the parabolic case is that u0 now needs to be in H m+1 (and not in H m ) to obtain an accuracy of order ∆xm . are the ingredients to the convergence of the scheme.

ˆ It is then well known that the function v (k) can be reconstructed from its Fourier ˆ coefficients. v(xj ) = vj . Let us be more specific and consider first the infinite grid hZ with grid points xj = jh for j ∈ Z and h > 0 given. for first-order approximations.1 Check that v (k) in (5. We can then obviously differentiate this function and take the values of the differentiated function at the grid points.1) It turns out that this transform can be inverted.2) and fourth-order approximation given by (1. or 1 (∂x + ∂ x ). this is (5. and so on for higher order approximations. In matrix form. ˆ h h Exercise 5.1) is indeed periodic of period 2π/h. To do so we define the semidiscrete Fourier transform as ∞ ∂x v (k) = h ˆ j=−∞ e−ikxj vj . This function p(x) will then be an interpolant of the series {vj }.5 5.e.2) vj = 2π −π/h The above definitions are nothing but the familiar Fourier series. What we want is to define a function p(x) such that p(x) = vj . 36 . i. Let us assume that we are given the values vj of a certain function v(x) at the grid points. ˆ (5. We define the inverse semidiscrete Fourier transform as π/h 1 eikxj v (k)dk. ].1).4). Spectral methods are based on using an approximation of very high order for the partial derivative. vj are the Fourier π π coefficients of the periodic function v (k) on [− . We interpolate a set of values of a function at the points of a grid by a (smooth) function defined everywhere. ]. the second-order approximation given by (1. π π k ∈ [− .1 Spectral methods Unbounded grids and semidiscrete FT The material of this section is borrowed from chapters 1 & 2 in Trefethen [2]. 2 for a second-order approximation. How is this done? The main ingredient is “interpolation”.3) in [2] are replaced in the “spectral” approximation by the “infinite-order” matrix in (1. h h (5. Finite differences were based on approximating ∂ ∂x by finite differences of the form ∂x . This is how derivatives are approximated in spectral methods and how (1. Here.4) in [2] is obtained.

5) More generally. So all we need to do is to consider the derivative of function vj such that vn = 1 for some n ∈ Z and vm = 0 for m = n. j = 0.Notice now that the inverse semidiscrete Fourier transform (a. (5. jh 37 (5. πx/h (5. j Sh (jh) = (−1)  . Notice that wj a priori depends on all the values vj . then we clearly have that wj = wj + wj .2) is defined for every xj of the grid hZ. This is why the matrix (1.a. It can be shown that it is analytic (which means that the infinite Taylor expansion of p(z) in the vicinity of every complex number z converges to p. This is how we define our interpolant. ˆ p(x) = (5.6) An easy calculation shows that  j=0  0.4) in [2] is infinite and full.4) where p(x) is defined by (5. We then obtain that vn (k) = he−inkh . for the spectral differentials.3). How do we calculate this matrix? Here we use the linearity of the differentiation. we obtain that the interpolant p(x) is given by ∞ p(x) = j=−∞ vj Sh (x − jh). (5. and by definition it satisfies that p(xj ) = vj . If vj = vj + vj . but could be defined more generally for every point x ∈ R. by simply replacing xj in (5.2) by x: π/h 1 eikx v (k)dk. Fourier series) in (5.k. an approximation of the derivative of v given by vj at the grid points xj .3) 2π −π/h The function p(x) is a very nice function. The spectral approximation simply consists of defining wj = p (xj ). so the function p is infinitely many times differentiable). ˆ and that pn (x) = h 2π π/h (1) (2) (1) (2) eik(x−nh) dk = Sh (x − nh).7) . −π/h where Sh (x) is the sinc function Sh (x) = sin(πx/h) . Now assuming that we want to define wj .

. The latter definition is more symmetrical than the former. although both are obviously equivalent. Notice that the same technique can be used to define approximations to derivatives of arbitrary order. it will be 2π periodic. This implies that  j=n  0. the function is not only periodic. j = n. ˆ k=−N/2+1 j = 1. . The reason we introduce (5. Once we have the interpolant p(x). . The discrete interpolant is now defined by 1 p(x) = 2π N/2 eikx vk .2 Periodic grids and Discrete FT The simplest way to replace the infinite-dimensional matrices encountered in the preceding section is to assume that the initial function v is periodic. 2 2 (5. This is how we define the Discrete Fourier transform N vk = h ˆ j=1 e−ikxj vj . . . .14) of [2]. . · · · . . xN ). We also assume that N is even. We know that periodic functions are represented by Fourier series.10) is more symmetrical.11) 38 . and the interpolant based on (5. (j − n)h These are precisely the values of the entries (n. (5.10) is that the two definitions do not yield the same interpolant. . (5. (j−n) pn (xj ) = (−1)  . 5. . we can differentiate it as many times as necessary. Odd values of N are treated similarly. The function is then represented by its value at the points xj = 2πj/N for some N ∈ N. Here. As we did for infinite grids. so that h = 2π/N .Exercise 5. N. . ˆ k=−N/2 x ∈ [0.10) to arbitrary values of x and not only the values on the grid.8) The Inverse Discrete Fourier Transform is then given by 1 vj = 2π The latter is recast as 1 vj = 2π N/2 N/2 eikxj vk . 2π]. To simplify. So it turns out that its “Fourier series” is finite. j) of the matrix in (1. k=− N N + 1.10) Here we define v−N/2 = vN/2 and ˆ ˆ means that the terms k = ±N/2 are multiplied 1 by 2 . ˆ k=−N/2 j = 1. we can extend (5.2 Check this.4) in [2]. but also only given on the grid (x1 . . N.9) eikxj vk . For instance the matrix corresponding to second-order differentiation is given in (2. . (5. but the formulas are slightly different.

. ˆ v For the discrete Fourier transform. . . p and evaluated the derivative p (xj ) at the grid points. we deduce from (5. 5. v −π/h So again. This is what FFT is based on. ˆ ˆ v v In other words. So. Again. this corresponds to wj = p (xj ). We have also introduced matrices that map vj to wj . we have the following procedure 1. we have 1 p (x) = 2π N/2−1 eikx ikˆk . v k=−N/2 Notice that by definition. for which the Fourier coefficient vanishes. j = 1. Calculate wj from wj by Inverse Discrete Fourier transform. . expect for the value of k = N/2. in the sense that we have constructed the interpolant p(x). we have seen how to calculate spectral derivatives at grid points wj of a function defined by vj at the same gridpoints.3 Fast Fourier Transform (FFT) In the preceding subsections. ˆ 2. The interpolations of the semidiscrete and discrete Fourier transforms have very similar properties. Exercise 5. ˆ 39 . see (2. For instance we deduce from (5. We can now easily obtain a spectral approximation of v(x) at the grid points: we simply differentiate p(x) and take the values of the solution at the grid points. Our derivation was done in the physical domain. Calculate vk from vj for −N/2 + 1 ≤ k ≤ N/2. Another way to consider the derivation is to see what it means to “take a derivative” in the Fourier domain.Notice that this interpolant is obviously 2π periodic. taken its derivative. we have v−N/2 = vN/2 so that i(−N/2)ˆ−N/2 + iN/2ˆN/2 = 0.3 Calculate the N ×N matrix DN which maps the values vj of the function v(x) to the values wj of the function p (xj ). Remember that (classical) Fourier transforms replace derivation by multiplication by iξ. in the Fourier domain. differentiation really means replacing v (k) by ikˆ(k). to calculate wj from vj in the periodic case. v k=−N/2+1 This means that the Fourier coefficients vk are indeed replaced by ikˆk when we differˆ v entiate p(x). Define wk = ikˆk for −N/2 + 1 ≤ k ≤ N/2 − 1 and wN/2 = 0.27). The result of the calculation is given on p.3) that p (x) = 1 2π π/h eikx ikˆ(k)dk. ˆ v ˆ 3. N.21 in [2]. and is a trigonometric polynomial of order (at most) N/2.11) that 1 p (x) = 2π N/2 eikx ikˆk .

8) and (5. let us merely state that it is a very convenient alternative to constructing the matrix DN to calculate the spectral derivative. h (5. Here we shall give a slightly different answer based on the Sobolev scale introduced in the analysis of finite differences.4 Check the order O(N 2 ). ˆ 40 . We are not going to describe how FFT works. Exercise 5.Said in other words. is the aliasing formula. Theorem 5.1 Let u ∈ L2 (R) be sufficiently smooth. π/h]. Let u(x) be a function defined for x ∈ R and define vj = u(xj ). also called the Poisson summation formula. Then for all k ∈ [−π/h. This has to be compared with a number of operations of order O(N 2 ) if the formulas (5. Smooth functions u(x) correspond to functions u(k) that decay fast in ˆ k. the multiplication by ik (step 2) is fast. The analysis of the error is based on estimates in the Fourier domain. We refer to [2] Programs 4 and 5 for details. It remains to know what sort of error one makes by introducing such interpolants. The question is then: what is the error between u(x) and p(x)? The answer relies on one thing: how regular is u(x).12) In other words the aliasing formula relates the discrete Fourier transform v (k) to the ˆ continuous Fourier transform u(k). Chapter 4 in [2] gives some answer to this question. The Fourier transform of u(x) is denoted by u(k). We then define the semidiscrete Fourier transform v (k) by ˆ ˆ (5. 5. the Fast Fourier Transform. The first theoretical ingredient. ∞ v (k) = ˆ j=−∞ u k+ ˆ 2πj .3). A very efficient implementation of steps 1 and 3 is called the FFT. which is important in its own right. On a computer.1) and the interpolant p(x) by (5.9) are being used. The main computational difficulties come from the implementation of the discrete Fourier transform and its inversion (steps 1 and 3). It is an algorithm that performs steps 1 and 3 in O(N log N ) operations.4 Spectral approximation for unbounded grids We have introduced several interpolants to obtain approximations of the derivatives of functions defined by their values on a (discrete) grid. the above procedure is the discrete analog of the formula −1 f (x) = Fξ→x iξFx→ξ [f (x)] .

Now take f (x) to be the delta function f (x) = δ(x). we then obtain that 2πl 2πj v (k) = ˆ u(k + ˆ ) δ(l − j)dl = u(k + ˆ ).13). We then deduce that we have cn = 1 and for x ∈ (0. It thus remains to show (5. The left-hand side is a 1-periodic in x and does not change. Here is a simple explanation. h j=−∞ h R j=−∞ which is what was to be proved. (5. Now that we have the aliasing formula. 1). The right hand side now takes the form ∞ δ(x − j). We will show that ∞ ∞ ei2πlj = j=−∞ j=−∞ δ(l − j). j=−∞ This is nothing but the periodic delta function. we can analyze the difference u(x) − p(x).13) Assuming that this equality holds.Derivation of the aliasing formula. (5. n=−∞ Let us now extend both sides of the above relation by periodicity to the whole line R. Take the Fourier series of a 1−periodic function f (x) 1/2 ∞ ∞ cn = −1/2 e−i2πnx f (x)dx.14) . since u(x) − p(x) We now show the following result: 41 L2 (R) 1 =√ u(k) − p(k) ˆ ˆ 2π L2 (R) .13). ∞ ei2πnx = δ(x). h j=−∞ R ∞ using the change of variables h(k − k) = 2πl so that hdk = 2πdl. Then the inversion is given by ∞ f (x) = n=−∞ ei2πnx cn . This proves (5. We’ll do it in the L2 sense. We write ∞ v (k) = h ˆ j=−∞ ∞ e−ikhj vj u(k )eihj(k −k) ˆ hdk 2π = j=−∞ R = 2πl u(k + ˆ ) ei2πlj dl.

π/h].Theorem 5.e. p(x) is band-limited. we have to use the regularity of the function u(x) to control the above term. In other words. π/h] and χ(k) = 0 otherwise is the characteristic function of the interval [−π/h. i. . i. The second term is a little more painful. We have denoted by Z∗ = Z\{0}. We first realize that p(k) = χ(k)ˆ(k) ˆ v by construction. if u(k) has compact support inside [−π/h. Proof. π/h]. the above sum is convergent so that 1 + |k + j∈Z∗ 2πj | h 42 −2m ≤ Ch2m . The first term is easy to deal with. hence p(x) = u(x) and the interpolant is exact. Looking carefully at the above terms. (2π(1 + |j|))2m for |k| ≤ π/h.e. This is what we now do. π/h]. The Cauchy-Schwarz inequality tells us that 2 aj b j j ≤ j |aj |2 j |bj |2 .15) where the constant C is independent of h and u. π/h]. one realizes that the above error only involves u(k) for values of k outside the interval ˆ [−π/h. where χ(k) = 1 if k ∈ [−π/h. This is simply based on realizing that |hk| ≥ π when χ(k) = 1. For m > 1/2. (5.2 Let us assume that u(x) ∈ H m (R) for m > 1/2. bj = 1 + |k + 2πj | h −m 2πj h 2 ≤ j∈Z∗ u k+ ˆ 2πj h 1 + |k + 2πj | h 2m 1 + |k + j∈Z∗ 2πj | h −2m . Notice that 1 + |k + j∈Z∗ 2πj | h −2m ≤ h2m j∈Z∗ 1 . Then we have u(x) − p(x) L2 (R) ≤ Chm u H m (R) . In other words. Let us choose aj = u k + ˆ We thus obtain that u k+ ˆ j∈Z∗ 2πj h 1 + |k + 2πj | h 2 m .12). Indeed we have (1 − χ(k))|ˆ(k)|2 dk ≤ h2m u R R (1 + k 2m )|ˆ(k)|2 dk ≤ h2m u u 2 H m (R) . using (5. We thus deduce that π/h |ˆ(k) − p(k)|2 dk = u ˆ R R (1 − χ(k))|ˆ(k)|2 dk + u −π/h j∈Z∗ u k+ ˆ 2πj h 2 dk. When u(k) does not vanish ˆ ˆ ˆ outside [−π/h. then ˆ p(k) = u(k). This is a mere generalization of |x · y| ≤ |x||y|. has no high frequencies. when the high frequency content is present.

4 Let us assume that u(x) ∈ H m (R) for m > 1/2 and let n ≤ m. When h is chosen too large so that frequencies above π/h are present in the signal.5). πx/h (5. we have that the interpolant is ˆ exact. This proves the extremely important Shannon-Whittaker sampling theorem: Theorem 5. However. Remember that our initial objective is to obtain an approximation of u (xj ) by using p (xj ). u k+ ˆ 2πj h 2 1 + |k + 2πj | h 2m ≤ Ch2m R |ˆ(k)|2 (1 + |k|)2m dk = Ch2m u u This concludes the proof of the theorem.3 Let us assume that the function u(x) is such that u(k) = 0 for k outside ˆ [−π/h. π/h]. Let us emphasize something we saw in the course of the above proof. Then u(x) is completely determined by its values vj = u(hj) for j ∈ Z. More precisely. More precisely. When u(x) is band-limited so that u(k) = 0 outside [−π/h. then the interpolant is not exact. Let us finish this section by an analysis of the error between the derivatives of u(x) and those of p(x). This error is precisely called aliasing error and is quantified by Theorem 5. we have seen that the interpolant is determined by the values vj = u(xj ). where Sh is the sinc function defined in (5.2 yield the following result: Theorem 5. This means that a continuous signal can be compressed onto a discrete grid with no error. we have ∞ u(x) = n=−∞ u(hj)Sh (x − jh).16) This result has quite important applications in communication theory. This implies that π/h u k+ ˆ −π/h j∈Z∗ 2πj h 2 π/h dk ≤ Ch2m −π/h j∈Z∗ 2 Hm . . there is an adapted sampling value h such that the signal can be reconstructed from the values it takes on the grid hZ. π/h].where C depends on m but not on u or h. we saw that ∞ p(x) = n=−∞ vn Sh (x − nh). Assuming that a signal has no high frequencies (such as in speech. The proof of this theorem consists of realizing that u(n) (x) − p(n) (x) L2 (R) 1 =√ (ik)n (ˆ(k) − p(k)) u ˆ 2π 43 L2 (R) . Minor modifications in the proof of Theorem 5. Then we have u(n) (x) − p(n) (x) L2 (R) ≤ Ch(m−n) u H m (R) . since the human ear cannot detect frequencies above 25kHz). namely p(x) = u(x).17) where the constant C is independent of h and u. and the compression in the signal has some error. (5. Sh (x) = sin(πx/h) .1.

whose solution is thus u(x) = R G(x − y)f (y)dy = 1 2α e−α|x−y| f (y)dy. (5. h2 (5.23) . where u(ξ) = ˆ ˆ f (ξ) .21) Discretizing the above equation using the second-order finite difference method: − U (x + h) + U (x − h) − 2U (x) + U (x) = f (x).19) An elliptic equation. σ + ξ2 (5.22) we obtain as in (3. as we saw in the proof of Theorem 5. The second term is bounded by π/h h−2n −π/h j∈Z∗ u k+ ˆ 2πj h 2 dk. Define α = that 1 −α|x| e . where m is the H m regularity of u. √ (5. 5. So when u is very smooth. R (5.2 by h−2n h2m u 2 m . which is bounded. This result is important for us: it shows that the error one makes by approximating a n−th order derivative of u(x) by using p(n) (x) is of order hm−n . This is the greatest advantage of spectral methods. p(n) (x) converges extremely fast to u(n) (x) as N → ∞. Consider the simplest of elliptic equations where the absorption σ > 0 is a constant positive parameter. We verify (5.19) in section 3 an error estimate of the form u−U L2 ≤ C h2 f 44 H2 .20) Note that this may also be obtained in the Fourier domain.18) σ. The first term on the right hand side is bounded by h2(m−n) (1 + k 2m )|ˆ(k)|2 dk = hm−n u u 2 Hm . This proves H the theorem.Moreover we have |k|2n |ˆ(k) − p(k)|2 dk = u ˆ R π/h |k|2n (1 − χ(k))|ˆ(k)|2 dk u R 2 + −π/h |k| 2n 2πj u k+ ˆ h j∈Z∗ dk. x ∈ R. G(x) = 2α is the Green function of the above problem.5 Application to PDE’s −u (x) + σu(x) = f (x).

e.2. (5. see how ∆t and h have to be chosen so that the discrete solution does not blow 45 . Here m is arbitrary provided that f is arbitrarily smooth.Let us now compare this result to the discretization based on the spectral method. We may now calculate the second-order derivative of p(x) (see the following paragraph for an explicit expression). where N is the matrix of differentiation. Now p(x) and F (x) only depend on the values they take at hj for j ∈ Z. A parabolic equation. So the Euler explicit scheme is given by ∞ Ujn+1 − Ujn n = Sh (kh)Uj−k . Looking at this in the Fourier domain (and coming back) we easily deduce that u−p H 2 (R) ≤C f −F L2 (R) ≤ Chm f H m (R) .25) It now remains to see how this scheme behaves. The equation we now use to deduce p(x) from F (x) is the equation (5. Let us consider the Euler explicit scheme based on a spectral approximation of the spatial derivatives. Let us now apply the above theory to the computation of solutions of the heat equation ∂u ∂ 2 u − 2 = 0. We introduce Uj the approximate solution at time Tn = n∆t. Note that this problem may be solved by inverting a matrix of the form −N 2 + σI.18): −p (x) + σp(x) = F (x). Let p(x) be the spectral interpolant of u(x) and F (x) the spectral interpolant of the source term f (x). x ∈ R. We shall only consider stability here (i. ∂t ∂x n on the real line R. Spectral methods are based on replacing functions by their spectral interpolants. 1 ≤ n ≤ N and point xj = hj for j ∈ Z. When the source terms and the solutions of the PDEs we are interested in are smooth (spatially). so that the approximation to the second derivative is given by ∞ ∞ (p ) (kh) = j=−∞ n Sh ((k − j)h)Ujn = j=−∞ n Sh (jh)Uk−j . The error u − p now solves the equation −(u − p) (x) + σ(u − p) = (f − F )(x). The interpolant at time Tn is given by ∞ p (x) = j=−∞ n Ujn Sh (x − jh).24) where the latter equality comes from Theorem 5. We thus obtain that the spectral method has an “infinite” order of accuracy. the convergence properties of spectral methods are much better than those of finite difference methods. ∆t k=−∞ (5. unlike finite difference methods.

This is plausible from the analysis of the error made by replacing a function by its interpolant that we made earlier.28) stays between −1 and 1 for all frequencies ξ. h2 k 2 Exercise 5.27) So the stability of the scheme boils down to making sure that ∞ R(ξ) = 1 + ∆t k=−∞ Sh (kh)e−ikhξ . 1/2] as ei2πlj = δ(l) − 1. k2 δ(l − j). The analysis is more complicated than for finite differences because of the infinite sum that appears in the definition of R.5 Check this. It turns out that we can have access to an exact formula. So we can recast R(ξ) as R(ξ) = 1 − ∆t Exercise 5. As we did for finite differences. We now need to understand the remaining infinite sum. which we recast for l ∈ [−1/2. We first realize that  −π 2   k=0 3h2k+1 Sh (kh) =  (−1)  2 k = 0. It is based on using the Poisson formula ei2πlj = j j π2 2∆t − 2 2 3h h k=0 ei(hξ−π)k . (5. we can pass to the Fourier domain and obtain that ∞ ˆ U n+1 (ξ) = 1 + ∆t k=−∞ ˆ Sh (kh)e−ikhξ U n (ξ). Let us come back to the stability issue.6 Check this.26) Once we have realized this. we realize that we can look at a scheme defined for all x ∈ R and not only x = xj : U n+1 (x) − U n (x) = Sh (kh)U n (x − hk). As far as convergence is concerned. ∆t k=−∞ ∞ (5. Here is a way to analyze R(ξ). (5. the scheme will be first order in time (accuracy of order ∆t) and will be of spectral accuracy (infinite order) provided that the initial solution is sufficiently smooth. j=0 46 .up as n → T /∆t).

1/2] (i.9 Check that l(H(l) − 1 ) = 1 |l|. 2π 2 (5. Notice that this function is 1-periodic and continuous.1. The constant d0 is now fixed by making sure that the above right-hand side remains 1-periodic (as is the left-hand side). To sum up our calculation. 5. j=0 This shows that R(ξ) = 1 − 1 1 ∆t π 2 l2 + 8π 2 + − l H(l) − 2 h 3 2 12 2 . 0 ≤ |ξ| ≤ .28) to show that R(ξ) = R(−ξ) and that R(ξ) is 2π -periodic. We obtain that ei2πlj = H(l) − l − d0 . Let us integrate one more time. l= hξ 1 − . This is justified as the delta function is concentrated at one point.] h R(ξ) = 1 − 47 . such that H(l) = 1 for l > 0 and H(l) = 0 for l < 0.29) A plot of the function l → l2 /2 + 1/12 − l(H(l) − 1/2) is given in Fig. there is no 0 frequency). Another way of looking at it is to make sure that both sides have zero mean on (−1/2. its integral is a piecewise linear function (H(l) − 1/2 − l) and is discontinuous.e. h2 ∆t 2 2 π h ξ := 1 − ∆tξ 2 . 2 h h [Check the above for hξ ≤ π and use (5. Exercise 5. 1/2). We obtain that d1 = 1/12. 2j 2 −4π 2 2 j=0 where d1 is again chosen so that the right-hand side has zero mean. Exercise 5. Deduce that 2 2 R(ξ) = 1 − so that ∆t 2 π 1 + 4l(1 − |l|) . we have obtained that ei2πlj l2 1 1 = 4π 2 + − l H(l) − 2 j 2 12 2 . We now integrate these terms in l.7 Check this. i2πj j=0 where H(l) is the Heaviside function. and its second integral is continuous (and is piecewise quadratic). Exercise 5. We now obtain that ei2πlj l2 l = lH(l) − − − d1 . We thus find that d0 = 1/2.8 Check this.Notice that both terms have zero mean on [−1/2.

h2 π (5.e. which is a quite 2 2 bad approximation of e−∆tπ /(h ) for the exact solution.02 0 −0. we do not expect frequencies of order h−1 to be well approximated on a grid of size h. for frequencies ξ= 2πm .1 0.1 0.e. we see that λ= 2 ∆t ≤ 2.3 −0. 48 .30) is necessary to obtain a convergent scheme as h → 0 with λ fixed. and that R(π/h) = 1 − π 2 ∆t/h2 . h m ∈ Z.5 Figure 5. h2 and that the minimal value is reached for l = 0.4 0. The maximal value is 1/12 and the minimal value −1/24.06 −0.02 −0. i.2 0. i. Since we need |R(ξ)| ≤ 1 for all frequencies to obtain a stable scheme.SECOND PERIODIC INTEGRAL OF THE PERIODIC DELTA FUNCTION 0. h 2 m ∈ Z.1: Plot of the function l → f (l) = l2 /2 + 1/12 − l(H(l) − 1/2).2 −0.04 −0.06 0.3 0.5 −0. and that the maximum is reached for l = ±1/2.08 0. Since R(ξ) is a periodized version of the symbol 1 − ∆tξ 2 . Notice that R(0) = 0 as it should.1 0 l 0. But as usual.4 −0. for frequencies ξ= 1 2π (m + ). we deduce that 1 − π2 ∆t ≤ R(ξ) ≤ 1.04 f(l) 0.

(6. The material mostly follows the presentation in [1]. Upon multiplying (6. is square integrable). for all |α| ≤ k}. α2 ) (in two space dimensions) for αk ∈ N. v) := Ω u· v + σuv dx = Ω f vdx := L(v).1) We assume that the absorption coefficient σ(x) ∈ L∞ (Ω) satisfies the constraint σ(x) ≥ σ0 > 0. Let v(x) be a sufficiently smooth test function defined on Ω and such that v(x) = 0 for x ∈ ∂Ω.1 Prove the above formula. 1 ≤ k ≤ 2. For k ∈ N.e.6 Introduction to finite element methods We now turn to one of the most useful methods in numerical simulations: the finite element method. x2 ) ∈ Ω (6. This will be very introductory.1). On Ω we consider the boundary value problem −∆u(x) + σ(x)u(x) = f (x). (6. (6. we obtain by integrations by parts using Green’s formula that: a(u. The Hilbert space is equipped with the norm u Hk = |α|≤k Dα u 2 L2 1/2 .1 Elliptic equations Let Ω be a convex open domain in R2 with boundary ∂Ω. u(x) = 0. This subsection is devoted to showing the existence of a solution to the above problem and recalling some elementary properties that will be useful in the analysis of discretized problems. x ∈ ∂Ω. ∂x1 2 ∂x2 2 ¯ x ∈ Ω. x = (x1 . (6.5) 49 . we define H k (Ω) ≡ H k = v ∈ L2 (Ω).4) We recall that for a multi-index α = (α1 . Dα v ∈ L2 (Ω). Let us introduce a few Hilbert spaces.3) Exercise 6.. Let us first assume that there exists a sufficiently smooth solution u(x) to (6. we denote |α| = α1 + α2 and Dα = ∂ α1 ∂ α2 . The Laplace operator ∆ is defined as ∆= ∂2 ∂2 + . 6. We will assume that ∂Ω is either sufficiently smooth or a polygon.2) We also assume that the source term f ∈ L2 (Ω) (i.1) by v(x) and integrating over Ω. ∂xα1 ∂xα2 1 2 Thus v ∈ H k ≡ H k (Ω) if all its partial derivatives of order up to k are square integrable in Ω.

u) ≥ α u 2 V. Then for each bounded linear form L on V . Indeed. The main ingredients of the proof are the following. we have a(u. note that a(u. v) > 0 for all v ∈ V such that v = 0. (6. Because of the choice of boundary conditions in (6.We also define the semi-norm on H k : |u|H k = |α|=k Dα u 2 L2 1/2 . bilinear form on V . u). (6. Theorem 6. in the sense that there exists α > 0 such that a(u. v) is also positive definite. for all v ∈ V.9) Indeed. The norm associated to this inner product is given by u a = a1/2 (u. First. it defines an inner product on V (by definition of an inner product). positive definite. v|∂Ω = 0 .1).3) holds for any function v ∈ H0 (Ω). u ∈ V.10) The Riesz representation theorem then precisely states the following: Let V a Hilbert space with an inner product a(·. v) = a(v. We can show that the above space is a closed subspace of H 1 (Ω). Showing this is not completely trivial and requires understanding the operator that restricts a function defined on Ω to the values it takes at the boundary ∂Ω. we also need to introduce the Hilbert space 1 (6.6) Note that the above semi-norm is not a norm. ·). v) is a symmetric bilinear 1 form on the Hilbert space V = H0 (Ω). in the sense that a(v. v) for all v ∈ V .7) H0 (Ω) = v ∈ H 1 (Ω). not necessary). v) is a symmetric. The theorem is a consequence of the Riesz representation theorem. The bilinear form a(u. v) = L(v). It is clearly linear in u and v and one verifies that a(u. there is a unique u ∈ V such that L(v) = a(u.1 is thus equivalent 50 . The restriction is defined for sufficiently smooth functions only (being an element in H 1 is sufficient. (6. a(u. v) and the constraint on σ.8) Proof.1) by the new problem in variational form: 1 1 Find u ∈ H0 (Ω) such that for all v ∈ H0 (Ω).8). Theorem 6. we can choose α = min(1. This is because L2 functions are defined up to modifications on sets of measure zero (for the Lebesgue measure) and that ∂Ω is precisely a set of measure zero. For instance: “the restriction to ∂Ω of a function in L2 (Ω)” means nothing. v) is even better than this: it is coercive in V .1 There is a unique solution to the above problem (6. 1 We can show that (6. u). from the definition of a(u. This allows us to “replace” the problem (6. for all polynomials pk−1 of order at most k − 1 are such that |pk−1 |H k = 0. where both spaces are equipped with the norm · H 1 . (6. σ0 ). Since a(u.

Now assume that u and w are solutions.1).14) . Remark 6. What we can show is that for smooth boundaries ∂Ω and for convex polygons ∂Ω. we have u H2 ≤C f 51 L2 .8) is equivalent to u being the minimum of 1 J(u) on V = H0 .1)).2 The solution to the above problem admits another interpretation. The above theory gives us a weak solution u ∈ H0 to the problem (6.3 Theorem 6. What it says is that provided that a(u. So u is not a strong solution (which is defined as a C 2 solution of (6.1 admits a very simple proof based on the Riesz representation theorem. which means the existence of a constant L such that |L(v)| ≤ L V ∗ a1/2 (v. we deduce that a(u − w. (6. This yields the bound for L and the existence of a solution to (6.to showing that L(v) is bounded on V . u) − L(u). v). This abstract theory is extremely useful in many practical problems. However it requires a(u.12) and provided that the functional L(v) is bounded on V . However we have |L(v)| ≤ f L2 V∗ v L2 f = √ L2 σ0 √ ( σ0 v L2 ) f ≤ √ L2 σ0 a1/2 (v. |a(v. v) is coercive and bounded on a Hilbert space V (equipped with norm · V ). in the sense that there exists a constant L V ∗ such that |L(v)| ≤ L then the abstract problem a(u.11) 2 We can then show that u is a solution to (6. (6.3) rather than its PDE form (6. which may not be the case in practice. (6. v). in the sense that there are two constants C1 and α such that α v 2 V ≤ a(v.13) V∗ v V. v) to be symmetric.8). Remark 6. Choose now v = u − w to get that a(u − w. Then by taking differences in (6. this implies that u − w = 0 and the uniqueness of the solution. v) = 0 for all v ∈ V . 1 Regularity issues. for all v. u − w) = 0. The Lax-Milgram theory is then the tool to use. The solution is “weak” because the elliptic problem should be considered in it variational form (6. (6. The Laplacian of a function in 1 H0 is not necessarily a function. v) for all v ∈ V . The answer is no in most cases. However. since a is positive definite.8). 1 It thus remains to understand whether the regularity of u ∈ H0 is optimal. for all v ∈ V admits a unique solution u ∈ V . for all v ∈ V. Indeed let us define the functional 1 J(u) = a(u.1). v) = L(v). w ∈ V. w)| ≤ C1 v V w V.

This requires some care: 1 functions in H0 have derivatives in L2 . For sufficiently smooth ∂Ω. the secondorder partial derivatives of u are also in L2 .15) This means that the solution u may be quite regular provided that the source term f is sufficiently regular.3) in Vh rather than V . v) is coercive and bounded on V (see Rem. then u is of class C 2 . Then we have 52 . Then the spaces Vh are Hilbert spaces. we thus deduce that u ∈ H 2 ∩ H0 . then clearly it remains coercive and bounded on Vh (because it is specifically chosen as a subset of V !).. i. Let us assume that we have a sequence of finite dimensional subspaces Vh ⊂ V . (6. where d is space dimension. Theorem 6. The Galerkin method is based on choosing finite dimensional subspaces Vh ⊂ V . a classical Sobolev inequality theorem says that a function u ∈ H k for k > d/2. 1 are not elements of H0 (because they can jump and the derivatives of jumps are too singular to be elements of L2 ).1) to discretize but rather (6. Deriving such results is beyond the scope of these notes and we will simply admit them.1).1 and its generalization in Rem.2 Galerkin approximation The above theory of existence is fundamentally different from what we used in the finite difference and spectral methods: we no longer have (6. As we saw for finite difference and spectral methods the regularity of the solution dictates the accuracy of the discretized solution. this means choosing Vh as subsets of H0 . is a strong solution of (6.1).3) is that V = H0 is infinite dimensional. when f ∈ L2 . 6. In dimension d = 2. v) is symmetric and as such defines a norm (6. v) = L(v) for all v ∈ Vh . this means that as soon as f ∈ H k for k > 1. The above regularity results will thus be of crucial importance in the analysis of finite element methods which we take up now. we have the following result: u H k+2 ≤ C f H k . 6. (6. Assuming that the bilinear form a(u. In 1 our problem (6. This may be generalized as follows.3 then hold with V replaced by Vh . 1 The main difficulty in solving (6. So piecewise constant functions for instance. This however gives us the right idea to discretize (6. Moreover. Let us assume that a(u. and Vh can therefore not be constructed using piecewise constant functions. We then have to choose the family of spaces Vh for some parameter h → 0 (think of h as a mesh size) such that the solution converges to u in a reasonable sense.10) on V and Vh . or for very special polygons Γ (such as rectangles).3). The main aspect of this regularity result is that u has two more degrees of regularity than f .16) It remains to obtain an error estimate for the problem on Vh . k ≥ 0. 6.1 When f ∈ L2 . is also of class C 0 .3).3): simply replace V by a discrete approximation Vh and solve (6.e. The same theory as for the continuous problem thus directly gives us existence and uniqueness of a solution uh to the discrete problem: Find uh ∈ Vh such that a(uh .

However. we verify that for all v ∈ Vh u − uh 2 a = a(u − uh . v) ≤ u a v a ) and v → a(u. We have equipped V (and Vh ) with two different. If u − uh a = 0. u) is an inner product (so that a(u.19) ≤ v a ≤C v V. we can subtract (6. when a(u. 6. for all v ∈ Vh . then the lemma is obviously true.3) for the · a norm: (6. Let us summarize our results. or at least was not too far from the best solution. This yields the lemma. (6.16) is the best approximation to the solution u of (6.16) from (6.8) to get the following orthogonality condition a(u − uh . We have shown that the discrete problem indeed admitted a unique solution (and thus requires to solve a problem of the form Ax = b since the problem is linear and finite dimensional).20) Exercise 6.1). u) = a(u − uh . albeit equivalent. v) no longer generates a norm). for all v ∈ V. v) = 0. but not for the norm · V .3. v) is coercive (whether it is symmetric or not).18) Note that this property only depends on the linearity of u → a(u. norms. So we have obtained that u − uh H 1 is close to the 1 best possible approximation of u by functions in H0 . We recall that two norms · a and · V are equivalent if there exists a constant C such that C −1 v V V ≤ C1 min u − v α v∈V V. v∈Vh Proof. (6. Then uh is the best approximation of u for the norm · a . We have replaced the variational formulation for u ∈ V by a discrete variational formulation for uh ∈ Vh . show that the two norms · a and · V are equivalent. Now because a(u. although uh is no longer the best approximation of u in any norm (because a(u. u − uh ) = a(u − uh . V = H0 . The first idea is that since Vh ⊂ V . If not.3 When a(u. 1 In our problem (6.4 The solution uh to (6. Remark 6. v) and not on a being symmetric. v) is linear. it may be replaced by u − uh Exercise 6.17) u − uh a = min u − v a .2 Prove (6. the above result does not hold. Note that the above result also holds when a(u. v) generates a norm.Lemma 6. There are two main ideas there. v) is symmetric and coercive on V .5 When a(u. then we can divide both sides in the above inequality by u − uh a to get that u − uh a ≤ u − v a for all v ∈ V . v) satisfies the hypotheses given in Rem. and that the solution uh was the best approximation to u when a(u. v) is symmetric. this time for the norm · V .19). uh is still very close to being an optimal approximation of u (for the norm · V now). we still observe that up to a multiplicative constant (greater than 1 obviously). Note that the “closeness” is all 53 . (6. u − v) ≤ u − uh a u − v a. So.

T ∈Th (6. h = max hT . an error bound of the form (6. We impose that two neighboring triangles have one edge (and thus two vertices) in common. The reason we introduce the triangulation is that we want to approximate arbitrary 1 functions in V = H0 by finite dimensional functions on each triangle T ∈ Th . These jumps however would violate the fact that 1 our discrete functions need to be in H0 . When a is symmetric. To do so. This gives us a solution uh to (6. We now recast the above problem for uh as a linear system (of the form Ax = b). (6. The space Vh is a finite dimensional subspace 1 of H0 (see below for a “proof”) so that the theory given in the previous section applies. Moreover. we want the functions not to jump across edges of the triangulation. 6. for instance based on quadrilaterals. then cannot jump across edges. (6. As we have already mentioned. they must be 54 . A linear system. The simplest functions after piecewise constant functions are piecewise linear functions. We thus define 1 Vh = {v ∈ H0 . This precludes the existence of vertices in the interior of an edge. Each triangle T ∈ Th is represented by three vertices and three edges.17) is however always optimal. We then divide Ω into a family Th of non-overlapping closed triangles. We can go further in two directions: see how small u−uh H 1 is for specific sequences of spaces Vh . In any event. The set of all vertices in the triangulation are called the nodes of Th . v is linear on each T ∈ Th }. when α is very small. We now consider a specific example of Galerkin approximation: the finite element method (although we should warn the reader that many finite element methods are not of the Galerkin form because they are based on discretization spaces Vh that are not subspaces of V ). we need additional information on the spaces Vh if one wants to go further.relative. Here we consider the simplest finite element method. Because they are piecewise linear.3 An example of finite element method We are now ready to introduce a finite element method. Note that finite element methods could also be based on different tilings of Ω. We only consider triangles here. For instance functions that are constant on 1 each triangle T ∈ Th are not in H0 and are thus not admissible. or one vertex in common.19) is not necessarily very accurate. A triangulation Th of size h is thus characterized by ¯ Ω= T ∈Th T. The reason is that such functions may not match across edges shared by two triangles. we wish to be able to construct more accurate solutions uh to u solution of (6.8). or higher order polygons.21) As the triangulation gets finer and h → 0. Because the functions in Vh are in 1 H0 .16).22) Recall that this imposes that v = 0 on ∂Ω. we need a basis for Vh . but also see how small u−uh is in possibly other norms one may be interested in. To simplify we assume that Ω is a rectangle. such as u − uh L2 . hT = diam(T ). the functions cannot be chosen on each triangle independently of what happens on the other triangles. For instance. The index h > 0 measures the “size” of the triangles.

24) ui φi (x).3). Solving for uh thus becomes a linear algebra problem. xi = u i . Moreover. φi ). Note also that for Pj . Since φi (x) is piecewise affine on Th . The optimal ordering thus very much depends on the triangulation and is by no means a trivial problem. We can thus write any function v ∈ Vh as Mh v(x) = i=1 vi φi (x).23) We recall that the Kronecker delta δij = 1 if i = j and δij = 0 otherwise. Aij = a(φj . Note that by assuming that φi ∈ Vh . Let us just mention that the matrix A is quite sparse because a(φi . 55 .25). So they are characterized (uniquely determined) by the values they take on each node Pi ∈ Ω for 1 ≤ i ≤ Mh of the triangulation Th (note that since Ω is an open domain. Pk ) so that indeed φi = 0 on ∂Ω. j ≤ Mh . vi = v(Pi ). This proves that Vh is finite dimensional. the notation means that Mh is the number of internal nodes of the triangulation: the nodes Pi ∈ ∂Ω should be labeled with Mh + 1 ≤ i ≤ Mh + bh . Mh i=1 (6. bi = (f. we have Aij = Ω φj (x) · φi (x) + σ(x)φj (x)φi (x) dx. which we do not describe further here. This is nothing but the linear system Ax = b. There is therefore no need to discretize the absorption parameter σ(x): the variational formulation takes care of this. whence is a basis for Vh . Let us also mention that the ordering of the points Pi is important. φi ).continuous on Ω. φj ) = 0 unless Pi and Pj belong to the same edge in the triangulation. v) is defined by (6. it is uniquely defined by its values at the nodes Pj .16) may then be recast as Mh (f. where bh is the number of nodes Pi ∈ ∂Ω). we easily deduce that the family is linearly independent. Consider the family of pyramid functions φi (x) ∈ Vh for 1 ≤ i ≤ Mh defined by φi (Pj ) = δij . Pk ∈ ∂Ω.16) as uh (x) = (6.25) In the specific problem of interest here where a(u. φi ) = a(uh . φi )uj . one would like the indices of points Pi and Pj belonging to the same edge to be such that |i − j| is as small as possible to render the matrix A as close to a diagonal matrix as possible. This is a crucial property to obtain efficient numerical inversions of (6. we have also implicitly implied that φi (Pj ) = 0 for Pj ∈ ∂Ω. φi ) = j=1 a(φj . The equation Let us decompose the solution uh to (6. 1 ≤ i. (6. As a rough rule of thumb. (6. the function φi (for Pi ∈ ∂Ω) vanishes on the whole edge (Pj .

For a triangle T with vertices Pi . The various triangles in the triangulation may be very different so we need a method that analyzes πT v − v on arbitrary triangles.) We now want to show that πh v − v gets smaller as h → 0. We recall that functions in H 2 (T ) are continuous. and (0. Let now Th be a sequence of triangulations such that h → 0 (recall that h is the maximal diameter of each triangle in the triangulation). where the φi (x) are affine functions defined by φi (Pj ) = δij . We define the interpolation operator πh : C(Ω) → Vh by Mh (πh v)(x) = i=1 v(Pi )φi (x). We then push ˆ forward any properties we have found on T to the triangle T through this map.27) We recall that | · |H 2 (T ) is the semi-norm defined in (6. Moreover.6). The first step goes as follows. However this is not an easy task in general. we can use (6.17) and (6. ˆ ˆ Then we define a map that transforms T to any triangle of interest T . This implies that u is a continuous function as ¯ 2 > d/2 = 1. and · L2 )..Approximation theory. 1) in an orthonorˆ mal system of coordinates of R2 .. 1 An arbitrary function v ∈ H0 need not be well-defined at the points Pi (although it almost is. indeed provides the best approximation. Let us assume that f ∈ L2 (Ω) so that u ∈ H 2 (Ω) thanks to (6. First ˆ we pick a triangle of reference T and analyze πT v − v on it in various norms of interest. which up to a multiplicative constant (independent of h). Our approximation is based on interpolation theory. So far.14). x ∈ T.16) is a Galerkin discretization. (6. This is done by looking at πh v − v on each triangle T ∈ Th and then summing the approximations over all triangles of the triangulation. ˆ Lemma 6.19) 1 and thus obtain in the H0 and a− norms that u − uh is bounded by a constant times u−v for arbitrary v ∈ Vh . Then there exists a constant C independent of v such that ˆ v − πT v ˆ ˆˆ (ˆ − πT v ) v ˆˆ ˆ L2 (T ) ˆ L2 (T ) ≤ C|ˆ|H 2 (T ) v ˆ ≤ C|ˆ|H 2 (T ) . this error will depend on the chosen norm (we have at least three reasonable choices: · a . (1. Let v be a function in H 2 (T ) and πT v its affine ˆ ˆˆ interpolant. ˆ 56 . 1 ≤ i ≤ 3. j ≤ 3.26) Note that the function v needs to be continuous since we evaluate it at the nodes Pi . it is defined as 3 πT v(x) = i=1 v(Pi )φi (x). We now consider a method. We denote by πT v the affine interpolant of v on a triangle T . Finding the best approximation v for u in a finite dimensional space for a given norm would be optimal. so the above interpolant is well-defined for each v ∈ H 2 (T ). We thus need to understand the approximation at the level of one triangle. It remains to understand how small the error u − uh is.16) for various parameters h > 0.6 Let T be the triangle with vertices (0. Since the problem (6. 0). 0). we have constructed a family of discrete solutions uh of (6. · H 1 . 1 ≤ i. v ˆ (6. This is done in two steps.

(6. L2 (T ) (6. ˆ We verify that ΠT v is indeed an affine interpolant for v ∈ H 1 (T ) in the sense (thanks ˆ to the normalization of φ(x)) that ΠT v = v when v is a polynomial of degree at most 1 ˆ (in x1 and x2 ) and that ΠT v H 1 ≤ C v H 1 . It is therefore sufficient to show a result of the form 1 2 dx ˆ T ˆ T φ(x0 ) 0 |f (x0 + t(x − x0 ))|dtdx0 dx0 ≤ C f 2 ˆ . Also we verify that for all affine interpolants ΠT . and there is no reason for | v(x0 )| to be bounded by C|v|H 2 (T ) . However they may not quite be bounded by C|v|H 2 (T ) . We realize that both remainders involve bounded functions (such as |x − x0 | or 1 − t) multiplied by second-order derivatives of v. It is a classical real-analysis result that such a ˆ ˆ function exists.29) Both integral terms in the above expressions involve second-order derivatives of v(x).30) v(x0 ) − x0 · v(x0 ) φ(x0 )dx0 + φ(x0 ) v(x0 )dx0 · x. we obtain that 1 v(x) = ΠT v(x) + ˆ v(x) = ΠT v(x) = ˆ ˆ T φ(x0 )|x − x0 |2 0 ˆ T (x · 1 )2 v((1 − t)x0 + tx)(1 − t)dtdx0 v((1 − t)x0 + tx)dtdx0 ˆ T ΠT v(x) + ˆ ˆ T φ(x0 )|x − x0 | 0 x· (6.29) over T in x0 . ˆ It remains to show that the remainders v(x) − ΠT v(x) and (v(x) − ΠT v(x)) are ˆ ˆ 2 ˆ indeed bounded in L (T ) by C|v|H 2 . This is because the second-order derivatives of ΠT v vanish by construction. we deduce that ˆ ˆ ˆ πT v ˆ H1 ≤C v H2 . This ˆ is based on the following Taylor expansion (with Lagrange remainder): 1 v(x) = v(x0 ) + (x − x0 ) · v(x0 ) + |x − x0 |2 0 (x · )2 v((1 − t)x0 + tx)(1 − t)dt.28) Since we will need it later. Upon integrating (6. since |v(Pi )| ≤ C v H 2 in two space dimensions. So we see ˆ that it is enough to prove the results stated in the theorem for the interpolant ΠT . By definition of πT .Proof. (6. The proof is based on judicious use of Taylor expansions. we want to replace πT by a more convenient interpolant ΠT . The way around is to realize that x0 may be seen as a variable ∞ ˆ ˆ that we can vary freely in T . the same type of expansion yields for the gradient: 1 v(x) = v(x0 ) + |x − x0 | 0 x· v((1 − t)x0 + tx)dt. First. The reason is that the affine ˆ interpolant v(x0 ) + (x − x0 ) · v(x0 ) and its derivative v(x0 ) both involve the gradient of v at x0 .28) and (6. we have πT ΠT v = ΠT v so that ˆ ˆ ˆ ˆ v − πT v ˆ H1 ≤ v − ΠT v ˆ H1 + πT (ΠT v − v) ˆ ˆ H1 ≤ C v − ΠT v ˆ H1 + C|v|H 2 . We therefore introduce the function φ(x) ∈ C0 (T ) such that φ(x) ≥ 0 and T φ(x)dx = 1.31) 57 . This causes ˆ some technical problems.

ˆ (6. c). ˆ We extend f (x) by 0 outside T . where a. Then we have the estimates v − πT v (v − πT v) L2 (T ) L2 (T ) ≤ hT . T (6.35) We recall that C is independent of v and of the triangle T . Now the part 0 < t < 1/2 with the change of L ˆ variables (1 − t)x0 ← x0 similarly gives a contribution of the form 1/2 0 dt φ2 (1 − t)2 ˆ L∞ (T ) |T | ˆ f 2 ˆ . There is an orthonormal system of coordinates such that the vertices of T are the points (0. |b|. ≤ Ch2 |v|H 2 (T ) T ≤ Cρ−2 hT |v|H 2 (T ) . Let us define A ∞ = max(a. Fur(6.ˆ for all f ∈ L2 (T ). and c are real numbers such that a > 0 and c > 0.33) Let us now assume that hT is the diameter of T . Then there exists a constant C independent of the function v and of the triangle T such that v − πT v (v − πT v) L2 (T ) ≤ C A ≤ C 2 ∞ |v|H 2 (T ) L2 (T ) A 3 ∞ |v|H 2 (T ) . ˆ where we use g = 1). let us assume the existence of ρT such that ρT hT ≤ a ≤ hT . b. c). ˆ Once we have the estimate on T . This yields 1 dt 1/2 ˆ T dx0 φ2 (x0 ) ˆ t−1 T dx 2 f ((1 − t)x0 + x) ≤ t2 1 1/2 dt t2 ˆ T dx0 φ2 (x0 ) f 2 ˆ . 0). By the Cauchy-Schwarz inequality (stating that (f. (0. we need to obtain it for arbitrary triangles. Let v ∈ H 2 (T ) and πT v its affine interpolant. This is done as follows: Lemma 6. Let us now consider the part 1/2 < t < 1 in the above integral and perform the change of variables tx ← x. These estimates show that v − ΠT v ˆ ˆ H 1 (T ) ≤ C|v|H 2 (T ) . 58 . Then clearly. ac ∞ (6. a). and (b.32) This was all we needed to conclude the proof of the lemma. the left-hand side in the above equation is bounded (because T is a bounded domain) by 1 C ˆ T ˆ T φ (x0 ) 0 2 f 2 ((1 − t)x0 + tx)dtdx0 dx. g)2 ≤ f 2 g 2 .34) ρT hT ≤ c ≤ hT .7 Let T be a proper (not flat) triangle in the plane R2 . L2 (T ) ˆ ˆ where |T | is the surface of T . A thermore. L2 (T ) This is certainly bounded by C f 2 2 (T ) .

which is ˆ ˆ equivalent to v (ˆ ) = v(x) with x = Aˆ or v (ˆ ) = v(Aˆ ). we deduce that |Dα v |2 (ˆ ) ≤ C A ˆ x |α|=2 4 ∞ |α|=2 |Dα v|2 (x). We first notice that (πT v )(ˆ ) = (πT v)(x).38) This tells us how the right-hand side in (6. Differentiating one more time. x = Aˆ . Let us now consider the left-hand sides. C is a universal constant. Let us define the linear map A : R2 → R2 defined in Cartesian coordinates as a 0 A= . p(Aˆ ) is also a polynomial of order 1 in the variables x1 and x2 .e. (6. ∂ xi ˆ ∂xk where we use the convention of summation over repeated indices. Let us denote by x coordinates on T and by x coordinates ˆ on T . (6. is the maximal element of the 2 × 2 matrix A−1 ). v ˆˆ x A−1 (ˆ − πT v )(ˆ ) = v ˆˆ x (v − πT v)(x). (ˆ − πT v ) v ˆˆ (v − πT v) (6. this implies the relations v − πT v ˆ ˆˆ A−1 2 ∞ 2 ˆ L2 (T ) 2 ˆ L2 (T ) = | det A−1 | v − πT v ≥ C| det A−1 | 2 L2 2 L2 . This ˆ ˆ x implies that (ˆ − πT v )(ˆ ) = (v − πT v)(x). 59 . This is achieved as follows. For a function v on T . We want to map the estimate on T in Lemma 6. we can define its push-forward v = A∗ v on T . x = Aˆ .35) are then an easy corollary. x2 . ˆˆ x (6. x (6.6 to the triangle T .36) b c ˆ ˆ ˆ We verify that A(T ) = T .27) may be used in estimates on T .37) Here the constant C is independent of v and v .38) directly give (6. The estimates (6.40) Here.ˆ Proof. This is equivalent to the statement: |ˆ|H 2 (T ) ≤ C|det(A)|−1/2 A v ˆ 2 ∞ |v|H 2 (T ) . ˆx x ˆx x By the chain rule. x Using the same change of variables as above.39) The reason is that vi = vi at the nodes and that for any polynomial of order 1 in the ˆ variables x1 . we deduce that ∂v ∂ˆ v = Aik . and A−1 ∞ = (ac)−1 A ∞ (i.33). The above inequalities combined with (6.. Note that the change of measures ˆ induced by A is dˆ = |det(A−1 )|dx so that x |Dα v |2 (ˆ )dˆ ≤ A ˆ x x ˆ T |α|=2 4 −1 ∞ |det(A )| T |α|=2 |Dα v|2 (x)dx.

The norm · a is equivalent to the H 1 norm.22) based on the triangulation define in (6.42) Here. From now on. where v is the best approximation of u (for this norm). This is indeed the case.Remark 6. In the linear approximation framework. and uh .8 (Form of the triangles) The first inequality in (6. The second estimate however requires more. which we assume is in H 2 (Ω). The reason is this. The third norm we have used is the L2 norm. ·) but not on u nor uh . we impose that ρT ≥ ρ0 > 0 for all T ∈ Th independent of the parameter h.8). 1 A Global error estimate in H0 (Ω). b.35) only requires that the diameter of T be smaller than hT .41) Here the constant C depends on the form a(·. thanks to (6.19) that the error of u − uh in H0 (the space in which estimates are the easiest!) is bounded by a constant times the error u − v in the same norm. 60 . implies the following result Theorem 6. u. v).34) is equivalent to imposing that all the angles of the triangle be greater than a constant independent of hT and is equivalent to imposing that there be a ball of radius of order hT inscribed in T . we can sum them to obtain global error estimates.9 Let u be the solution to (6.16) where Vh is defined in (6. and c be of the same order hT . Let uh be the solution of the discrete problem (6.8.21). 0 (6. This concludes this section on approximation theory. 6. the gradients are approximated by piecewise constant functions whereas the functions themselves are approximated by polynomials of order 1 on each triangle. We may verify that the constraint (6. (6. Let us come back to the general framework of 1 Galerkin approximations. We should therefore have a better accuracy for the functions than for their gradients. v) is symmetric). However the demonstration is not straightforward and requires to use the regularity of the elliptic problem one more time in a curious way. Then we have the error estimate u − uh H 1 (Ω) ≤ Cρ−2 h|u|H 2 (Ω) . However it may depend on the bilinear form a(u. So the above estimate also holds for the · a norm (which is defined only when a(u. C is a constant independent of Th . So we obviously have that u − uh H 1 ≤ C u − πh u H 1 . We may show that the order of approximation in O(h) obtained in the theorem is optimal (in the sense that h cannot be replaced by hα with α > 1 in general). A global error estimate in L2 (Ω). This however.35). The above righthand side is by definition |u − πT u|2 + | (u − πT u)|2 dx T ∈Th T 1/2 ≤ T ∈Th ρ−4 h2 |u|2 2 (T ) 0 T H 1/2 . It is reasonable to expect a faster convergence rate in the L2 norm than in the H 1 norm. The finite element method is thus first-order for the approximation of u in the H 1 norm. We know from (6. Now that we have local error estimates. It requires that the triangles not be too flat and that all components a. We recall that ρ0 was defined in Rem.

45). (6.42). we get: u − uh 2 L2 = a(u − uh . w) = (v.Theorem 6. v) by the adjoint bilinear form a∗ (u. v) is not symmetric.44) Note that since a(u.44) is equivalent to (6.43) Proof. 61 .9) and the additional hypothesis that (6. the orthogonality condition (6. 0 H1 This concludes our result.1.10 we now present is referred to as a duality argument in the literature. when a(u.14) holds. v) = a(v.10 Under the hypotheses of Theorem (6. w − πh w) ≤ C u − uh H 1 w − πh w ≤ C u − uh H 1 ρ−2 h w 2 ≤ Cρ−2 h u − uh H 1 u − uh L2 0 0 ≤ Cρ−4 h2 u H 2 u − uh L2 . for all v ∈ V. In any event. However. and as a consequence the proof of Theorem 6.44).8).14) holds implies that w H2 ≤ C u − uh L2 .12) (with V = H0 ).44) admits a unique solution thanks to Theorem 6. the above problem (6. Let w be the solution to the problem a(v. 1 the bound in (6. and finally the error estimate (6. u).8) only by replacing a(u. (6.18).8). this problem is equivalent to (6.44) is therefore called a dual problem to (6. (6. u − uh ). The equation (6. w) = a(u − uh .45) Using in that order the choice v = u − uh in (6. (6. we have the following error estimate u − uh L2 (Ω) ≤ Cρ−4 h2 u 0 H 2 (Ω) . v) is symmetric. Moreover our hyp othesis that (6. the estimate (6.

For instance.3) For each (t. The latter may then be discretized if necessary and solved numerically. x) = P (x). as we have seen. we avoid the difficulties in PDE-based discretizations caused by the presence of high frequencies that are not well captured by the schemes. x) = x + ct so that v(t. dt (7. α (7.5) Many first-order linear and non-linear equations may be solved by the method of characteristics. x(t)) = + . v(t. We want to solve (7. (7. x(t)). See the section on finite difference discretizations of hyperbolic equations. (7. There are many other difficulties in the numerical implementation of the method of characteristics. The main idea is to look for solutions of the form v(t. x). we can find a unique x0 (t.3) are dx = −c.1) by the method of characteristics. is to replace a partial differential equation (PDE) by a system of ordinary differential equations (ODEs). PDE-based. dt The solutions are found to be x(t) = x0 − ct. and where the rate of change of ω(t) = v(t. Since ODEs are solved. The combination of grid-based and particle-based methods gives rise to the so-called semi-Lagrangian method. In the literature. x(t)) is prescribed along the characteristic x(t). 62 .4) dv + αv = β. We observe that d ∂v dx ∂v v(t. whereas methods acting in the reference frame of the particles following their characteristic (which requires solving ODEs) are referred to as Lagrangian methods. dt ∂t dt ∂x so that the characteristic equations for (7. x(t)) = e−αt v(0. we need to solve the characteristic backwards from x(t) to x(0) and evaluate the initial condition at x(0). ∂t ∂x v(0.2) where x(t) is a characteristic.7 7. grid-based. Yet the method is very powerful as it is much easier to solve ODEs than PDEs.1 Introduction to Monte Carlo methods Method of characteristics Let us first make a small detour by the method of characteristics. α (7. which may not be a grid point and thus may require that we use an interpolation technique.1) for t > 0 and x ∈ R. solving an ordinary differential equation. Consider a hyperbolic equation of the form ∂v ∂v −c + αv = β. The main idea. x0 ) + β (1 − e−αt ). x) = e−αt P (x + ct) + β (1 − e−αt ). methods are referred to as Eulerian.

However. Note that in (7. then W (2t) is replaced by some process X(t. More precisely.6) where X(t) = X(t. Using E{W2t } = 0. In more general situations. o Then. In any event. All trajectories are therefore continuous (at least almost surely) but are also almost surely always non-differentiable. instead of looking for solutions of the form w(t. However.6). In one dimension of space.7. solves the heat equation ∂u ∂ 2 u − 2 = 0. This is the basis for the Monte Carlo method. the method does not apply to multidimensional wave equations nor to elliptic or parabolic equations such as the Laplace and heat equations. x) is an infinite superposition (E is an integration) of pieces of information carried along the trajectories. it is based on us following an infinite number of characteristics generated from an appropriate probability measure and averaging over them. we can certainly simulate a finite. u(0. large. let g(W (2t)) = u0 (x + W (2t)). We thus see that u(t. we obtain after ensemble averaging that (7.7). we can apply it to the wave equation ∂ ∂ ∂ ∂2u ∂2u ∂ )( + )u. x(t)). (7.2 Probabilistic representation One of the main drawbacks of the method of characteristics is that it applies (only) to scalar first-order equations. and E is ensemble averaging with respect to a measure P defined on Ω. We cannot simulate an infinite number of trajectories. we find that 1 ∂2g ∂g ∂2g ∂g dW2t + dW2t dW2t = dW2t + 2 dt. Let W (t) be a standard one-dimensional Brownian motion with W (0) = 0. a quasi-method of characteristics may still be developed.8). simulating the characteristic is easy as we know the law of W (t) (a centered Gaussian variable with variance equal to t). we find that u(t. ∞)) with an appropriate measure P. For some equations. Let us give the most classical of examples. where the diffusion coefficient depends on position. Itˆ calculus.7) . ω) is a trajectory starting at x for realizations ω in a state space Ω. x) = u0 (x). we look for solutions of the form u(t. 0= 2 − 2 =( − ∂t ∂x ∂t ∂x ∂t ∂x because of the above decomposition.g. However. dg = 63 (7. instead of being based on us following one characteristics. (7. The “characteristics” have thus become rather un-smooth objects. Very briefly. We also see how this representation gives rise to the Monte Carlo method. It is not purpose of these notes to dwell on (7. x) which depends in a more complicated way on the location of the origin x.7) solves (7. This is an object defined on a space Ω (the space of continuous trajectories on [0. ∂x 2 ∂x2 ∂x ∂x since dW2t dW2t = d(2t). number of them and perform the averaging. x) = E{u0 (X(t))|X(0) = x}. Solving for the trajectories becomes more difficult. The reason is that information does not propagate along characteristics (curves).8) ∂t ∂x This may be shown by e. x) = E{u0 (x + W (2t))}.

. 2 We may decompose this random variable as follows.. Note that Sn is even when n is even and odd when n is odd so that n+m is always an 2 integer.1 Check this. which roughly says that averaging over two random variables may be written as an averaging over the first random variable (this is thus still a random variable) and then average over the second random variable.+xn−1 } = U + Uj−1 .. we need to use the idea of conditional expectations. We thus have an explicit expression for Sn . Moreover.11) where Uj0 is an approximation of u0 (jh) as before.+xn } = E{E{Uj+x1 +.9) (7. Moreover.. 2 n Sn = h k=1 xk . What is it that we have defined? For this.. and where Ujn is an approximation of u(n∆t. using xk = (2yk − 1). jh). Let us assume that we have N (time) steps and define N independent random variables xk . we easily find that E{Sn } = 0. This is Donsker’s invariance principle.7. 1 2n n n+m 2 . (7.+xn−1 } + E{Uj−1+x1 +. we see that pm = P(Sn = hm) = Exercise 7. In our context: 0 0 Ujn = E{Uj+x1 +. namely the random walk. How does one use all of this to solve a discretized version of the heat equation? Following (7. (7.10) The central limit theorem shows that Sn converges to W (2t) if n → ∞ and h → 0 such that nh2 = 2t.+xn |xn }} 1 1 n−1 1 n−1 0 0 E{Uj+1+x1 +. As a consequence. Sαn also converges to W (2αt) in an appropriate sense so that there is solid mathematical background to obtain that Sn is indeed a bona fide discretization of Brownian motion. such that xk = ±1 with probability 1 . Define yk for 1 ≤ k ≤ N .. 2 VarSn = E{Sn } = nh2 . N independent random variables equal to 0 or 1 with probability 1 and let Σn = n yk . Let us relate all this to Bernoulli random variables. k=1 2 Each sequence of n of 0’s and 1’s for the yk has probability 2−n . = 2 2 2 j+1 64 . is our random walk. Then.7).. The number of sequences n such that Σn = m is m by definition (number of possibilities of choosing m balls among n ≥ m balls)..3 Random Walk and Finite Differences Brownian motion has a relatively simple discrete version. 1 Consider a process Sk such that S0 = 0 and Sk+1 is equal to Sk + h with probability 2 and equal to Sk − h also with probability 1 . 1 ≤ k ≤ N . we can define 0 Ujn = E{Uj+h−1 Sn }.

Using (7. In order to obtain an accuracy of ε on average.14) M k=1 Then we verify that E{SM } = Ujn as before.12) holds. The (only) way to make the estimator better is to use the law of large numbers by repeating the experiment M times and averaging over the M results. or even much larger than E{X}. 7. This shows that σ will be close to k pk (Uj−k ) . Then σ 2 = 0 and X gives the right answer 0 all the time. How do the calculations Monte Carlo versus deterministic compare to calculate Ujn (at a final time n∆t and a given position j)? Let us assume that we solve a d-dimensional 65 . M M k=1 M since the variables are independent. In other words. x) = u0 . If we do it once. Let us look at the random variable X. (7. More precisely.11) gives us an approximation of ∂u ∂2u − D 2 = 0. As a consequence. its variance is often quite large.13) Assume that Uj0 = U independent of j. which is comparable. However.In other words. Ujn solves the finite difference equation with λ = value of λ the way the procedure is constructed). So X has a very large variance and is therefore not a very good estimator of Ujn . 2 h 2 1 2 (and ONLY this (7. We know that E{X} = Ujn so that X is an unbiased estimator of Ujn . The variance of SM is however significantly smaller: VarSM 1 = E{ M M (Xk − k=1 Ujn ) 2 1 VarX } = 2 E{ (Xk − Ujn )2 } = .9). when U 0 is highly oscillatory. u(0. However. ∂t ∂x when ∆t and h are chosen so that (7. Define M 1 SM = Xk . we find that: n n 0 pk Uj−k k=1 σ = VarX = 2 − l=1 0 pl Uj−l 2 . how accurate are we? The answer is not much. As a consequence. SM is an unbiased estimator of j Un with a standard deviation of order O(M −1/2 ). (7. say (−1)l . we find that 0 P(X = Uj+m ) = pm . the procedure (7. we need to calculate an order of ε−2 realizations.12) where D is the diffusion coefficient. This is the speed of convergence of Monte Carlo. Recall that λ= D∆t 1 = .4 Monte Carlo method Solving for Ujn thus requires that we construct Sn (by flipping n ±1 coins) and evaluate 0 X := Uj+h−1 Sn . let Xk for 1 ≤ k ≤ M be M independent random variables with the same law as X. then pl Uj−l will be 0 2 2 close to 0 by cancellations.

the Monte Carlo method is much easier to parallelize. the running time of the algorithm is of order ∆t−1 . The convergence of the Monte Carlo 1 algorithm in M − 2 is rather slow. it is extremely difficult to attain this level of running speed using the deterministic finite difference algorithm. The total cost of d an explicit method is therefore O((∆thd )−1 ) = O((∆t)−1− 2 ). 66 ..e. it is difficult to use parallel architecture (though this a well-studied problem) and obtain algorithms running time is inversely proportional to the number of available processors.g. Monte Carlo methods are easily parallelized since the M random realizations of X are done independently. For the finite difference model. In contrast.heat equation. Deterministic methods require that we solve n time steps using a grid of mesh size O(h) in each direction. So with a (very large) number of processors equal to M . Monte Carlo methods do not suffer the curse of dimensionality characterizing deterministic methods and thus become more efficient. Moreover. d ≥ 4. with which it is easier to account for e. To obtain an O(h2 ) = O(∆t) accuracy. (7. which involves a d-dimensional random walk. which is nothing but d one-dimensional random walks in each direction. These algorithms are called variance-reduction algorithms and are based on estimating the solution Ujn by using non-physical random processes with smaller variance than the classical Monte Carlo method described above. Monte Carlo methods are also much more versatile as they are based on solving “random ODEs” (stochastic ODEs). Even with a large number of processors. Let us conclude by the following remark. which means O(h−d ) discretization points. The cost of each trajectory is of order n ≈ (∆t)−1 since we need to flip n coins. Many algorithms have been developed to accelerate convergence. We thus see that Monte Carlo is more efficient than finite differences (or equally efficient) when 1 1 ≤ d . which corresponds to the running time of the finite difference algorithm on a grid with a small finite number of grid points. Finite differences however provide the solution everywhere unlike Monte Carlo methods. which need to use different trajectories to evaluate solutions at different points. 3 ∆t ∆t1+ 2 i. we thus need M = (∆t)−2 so that the final cost is O((∆t)−3 ).15) For sufficiently large dimensions. complicated geometries.

N. 2003. SIAM. 67 . Larsson and V. 2000. Philadelphia. Thomee. [2] L. Spectral Methods in Matlab. Trefethen.References ´ [1] S. Springer. PA. New York. Partial Differential Equations with Numerical Methods.