Are you sure?
This action might not be possible to undo. Are you sure you want to continue?
Guillaume Bal
∗
October 20, 2008
Contents
1 Ordinary Diﬀerential Equations 2
2 Finite Diﬀerences for Parabolic Equations 9
2.1 One dimensional Heat equation . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Fourier transforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Stability and convergence using Fourier transforms . . . . . . . . . . . . 18
2.4 Application to the θ schemes . . . . . . . . . . . . . . . . . . . . . . . . . 25
3 Finite Diﬀerences for Elliptic Equations 28
4 Finite Diﬀerences for Hyperbolic Equations 32
5 Spectral methods 36
5.1 Unbounded grids and semidiscrete FT . . . . . . . . . . . . . . . . . . . 36
5.2 Periodic grids and Discrete FT . . . . . . . . . . . . . . . . . . . . . . . 38
5.3 Fast Fourier Transform (FFT) . . . . . . . . . . . . . . . . . . . . . . . . 39
5.4 Spectral approximation for unbounded grids . . . . . . . . . . . . . . . . 40
5.5 Application to PDE’s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6 Introduction to ﬁnite element methods 49
6.1 Elliptic equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
6.2 Galerkin approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.3 An example of ﬁnite element method . . . . . . . . . . . . . . . . . . . . 54
7 Introduction to Monte Carlo methods 62
7.1 Method of characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7.2 Probabilistic representation . . . . . . . . . . . . . . . . . . . . . . . . . 63
7.3 Random Walk and Finite Diﬀerences . . . . . . . . . . . . . . . . . . . . 64
7.4 Monte Carlo method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
∗
Department of Applied Physics & Applied Mathematics, Columbia University, New York NY,
10027, gb2030@columbia.edu, http://www.columbia.edu/∼gb2030
1
1 Ordinary Diﬀerential Equations
An Ordinary Diﬀerential Equation is an equation of the form
dX(t)
dt
= f(t, X(t)), t ∈ (0, T),
X(0) = X
0
.
(1.1)
We have existence and uniqueness of the solution when f(t, x) is continuous and Lips
chitz with respect to its second variable, i.e. there exists a constant C independent of
t, x, y such that
[f(t, x) −f(t, y)[ ≤ C[x −y[. (1.2)
If the constant is independent of time for t > 0, we can even take T = +∞. Here X
and f(t, X(t)) are vectors in R
n
for n ∈ N and [ [ is any norm in R
n
.
Remark 1.1 Throughout the text, we use the symbol “C” to denote an arbitrary
constant. The “value” of C may change at every instance. For example, if u(t) is a
function of t bounded by 2, we will write
C[u(t)[ ≤ C, (1.3)
even if the two constants “C” on the left and righthand sides of the inequality may be
diﬀerent. When the value “2” is important, we will replace the above righthand side
by 2C. In our convention however, 2C is just another constant “C”.
Higherorder diﬀerential equations can always been put in the form (1.1), which is
quite general. For instance the famous harmonic oscillator is the solution to
x
+ ω
2
x = 0, ω
2
=
k
m
, (1.4)
with given initial conditions x(0) and x
(0) and can be recast as
d
dt
X
1
X
2
(t) =
0 1
−ω
2
0
X
1
X
2
(t),
X
1
X
2
(0) =
x(0)
x
(0)
. (1.5)
Exercise 1.1 Show that the implied function f in (1.5) is Lipschitz.
The Lipschitz condition is quite important to obtain uniqueness of the solution. Take
for instance the case n = 1 and the function x(t) = t
2
. We easily obtain that
x
(t) = 2
x(t) t ∈ (0, +∞), x(0) = 0.
However, the solution ˜ x(t) ≡ 0 satisﬁes the same equation, which implies nonuniqueness
of the solution. This remark is important in practice: when an equation admits several
solutions, any sound numerical discretization is likely to pick one of them, but not nec
essarily the solution one is interested in.
Exercise 1.2 Show that the function f(t) above is not Lipschitz.
2
How do we discretize (1.1)? There are two key ingredients in the derivation of a
numerical scheme: ﬁrst to remember the deﬁnition of a derivative
df
dt
(t) = lim
h→0
f(t + h) −f(t)
h
= lim
h→0
f(t) −f(t −h)
h
= lim
h→0
f(t + h) −f(t −h)
2h
,
and second (which is quite related) to remember Taylor expansions to approximate a
function in the vicinity of a given point:
f(t + h) = f(t) + hf
(t) +
h
2
2
f
(t) + . . . +
h
n
n!
f
(n)
(t) +
h
n+1
(n + 1)!
f
(n+1)
(t + s), (1.6)
for 0 ≤ s ≤ t assuming the function f is suﬃciently regular.
With this in mind, let us ﬁx some ∆t > 0, which corresponds to the size of the
interval between successive points where we try to approximate the solution of (1.1).
We then approximate the derivative by
dX
dt
(t) ≈
X(t + ∆t) −X(t)
∆t
. (1.7)
It remains to approximate the righthand side in (1.1). Since X(t) is supposed to be
known by the time we are interested in calculating X(t +∆t), we can choose f(t, X(t))
for the righthand side. This gives us the socalled explicit Euler scheme
X(t + ∆t) −X(t)
∆t
= f(t, X(t)). (1.8)
Let us denote by T
n
= n∆t for 0 ≤ n ≤ N such that N∆t = T. We can recast (1.8) as
X
n+1
= X
n
+ ∆tf(n∆t, X
n
), X
0
= X(0). (1.9)
This is a fully discretized equation that can be solved on the computer since f(t, x) is
known.
Another choice for the righthand side in (1.1) is to choose f(t + ∆t, X(t + ∆t)).
This gives us the implicit Euler scheme
X
n+1
= X
n
+ ∆tf((n + 1)∆t, X
n+1
), X
0
= X(0). (1.10)
Notice that this scheme necessitates to solve an equation at every step, namely to ﬁnd
the inverse of the function x →x −f(n∆t, x). However we will see that this scheme is
more stable than its explicit relative in many practical situations and is therefore often
a better choice.
Exercise 1.3 Work out the explicit and implicit schemes in the case f(t, x) = λx for
some real number λ ∈ R. Write explicitly X
n
for both schemes and compare with the
exact solution X(n∆t). Show that in the stable case (λ < 0), the explicit scheme can
become unstable and that the implicit scheme cannot. Comment.
We now need to justify that the schemes we have considered are good approximations
of the exact solution and understand how good (or bad) our approximation is. This
requires to analyze the behavior of X
n
as ∆t →0, or equivalently the number of points
3
N to discretize a ﬁxed interval of time (0, T) tends to inﬁnity. In many problems, the
strategy to do so is the following. First we need to assume that the exact problem
(1.1) is properly posed, i.e. roughly speaking has good stability properties: when one
changes the initial conditions a little, one expects the solution not to change too much
in the future. A second ingredient is to show that the scheme is stable, i.e. that our
solution X
n
remains bounded independently of n and ∆t. The third ingredient is to
show that our scheme is consistent with the equation (1.1), i.e. that locally it is a good
approximation of the real equation.
Let us introduce the solution function g(t, X) = g(t, X; ∆T) deﬁned by
g(t, X) = X(t + ∆T), (1.11)
where
dX
dt
(t + τ) = f(t + τ, X(t + τ)), 0 ≤ τ ≤ ∆t
X(t) = X.
(1.12)
This is nothing but the function that maps a solution of the ODE at time t to the
solution at time t + ∆t.
Similarly we introduce the approximate solution function g
∆t
(t, X) for t = n∆t by
X
n+1
= g
∆t
(n∆t, X
n
) for 0 ≤ n ≤ N −1. (1.13)
For instance, for the explicit Euler scheme, we have
g
∆t
(n∆t, X
n
) = X
n
+ ∆tf(n∆t, X
n
). (1.14)
For the implicit Euler scheme, we have
g
∆t
(n∆t, X
n
) = (x −∆tf((n + 1)∆t, x))
−1
(X
n
),
assuming that this inverse function is welldeﬁned. Because the explicit scheme is much
simpler to analyze, we assume that (1.14) holds from now on.
That the equation is properly posed (a.k.a. wellposed) in the sense of Hadamard
means that
[g(t, X) −g(t, Y )[ ≤ (1 +C∆t)[X −Y [, (1.15)
where the constant C is independent of t, X and Y . The meaning of this inequality is
the following: we do not want the diﬀerence X(t + ∆t) − Y (t + ∆t) to be more than
(1 + C∆t) times what it was at time t (i.e. [X − Y [ = [X(t) − Y (t)[) during the time
interval ∆t.
We then have to prove that our scheme is stable, that is to say
[X
n
[ ≤ C(1 +[X
0
[), (1.16)
where C is independent of X
0
and 0 ≤ n ≤ N. Notice from the Lipschitz property of f
(with x = x and y = 0) and from (1.14) that
[X
n+1
[ = [g
∆t
(n∆t, X
n
)[ ≤ (1 +C∆t)[X
n
[ + C∆t,
4
where C is a constant. This implies by induction that
[X
n
[ ≤ (1 + C∆t)
n
[X
0
[ +
n
¸
k=0
(1 +C∆t)
k
C∆t ≤ (1 +C∆t)
N
(N∆tC +[X
0
[).
for 1 ≤ n ≤ N. Now N∆t = T and
(1 +
CT
N
)
N
≤ e
CT
,
which is bounded independently of ∆t. This implies the stability of the explicit Euler
scheme.
We then prove that the scheme is consistent and obtain the local error of discretiza
tion. In our case, it consists of showing that
[g(n∆t, X) −g
∆t
(n∆t, X)[ ≤ C∆t
m+1
(1 +[X[), (1.17)
for some positive m, the order of the scheme. Notice that this is a local estimate.
Assuming that we know the initial condition at time n∆t, we want to make sure that
the exact and approximate solutions at time (n + 1)∆t are separated by an amount
at most of order ∆t
2
. Such a result is usually obtained by using Taylor expansions.
Let us consider the explicit Euler scheme and assume that the solution X(t) of the
ODE is suﬃciently smooth so that the solution with initial condition at n∆t given by
X = X(n∆t) satisﬁes
X((n + 1)∆t) = X(n∆t) + ∆t
˙
X(n∆t) +
∆t
2
2
¨
X(n∆t + s), (1.18)
for some 0 ≤ s ≤ ∆t, where [
¨
X(n∆t +s)[ ≤ C(1+[X[). Notice that the above property
is a regularity property of the equation, not of the discretization. We prove convergence
of the explicit Euler scheme only when the above regularity conditions are met. These
are suﬃcient conditions (though not necessary ones) to obtain convergence and can be
shown rigorously when the function f(t, x) is of class C
1
for instance. This implies that
X((n + 1)∆t) = g(n∆t, X(n∆t))
= X(N∆t) + ∆tf(n∆t, X(n∆t)) +
∆t
2
2
¨
X(n∆t + s)
= g
∆t
(n∆t, X(n∆t)) +
∆t
2
2
¨
X(n∆t + s).
This implies (1.17) and the consistency of the scheme with m = 1 (the Euler explicit
scheme is ﬁrstorder). Another way of interpreting the above result is as follows: the
exact solution X((n + 1)∆t) locally solves the discretized equation (whose solution is
g
∆t
(n∆t, X(n∆t)) at time (n + 1)∆t) up to a small term, here of order ∆t
2
.
Exercise 1.4 Consider the harmonic oscillator (1.5). Work out the functions g and
g
∆
for the explicit and implicit Euler schemes. Show that (1.15) is satisﬁed, that both
schemes are stable (i.e. (1.16) is satisﬁed) and that both schemes are consistent and of
order m = 1 (i.e. (1.17) is satisﬁed with m = 1).
5
Once we have wellposedness of the exact equation as well as stability and consistency
of the scheme (for a scheme of order m), we obtain convergence as follows. We have
[X((n + 1)∆t) −X
n+1
[ = [g(n∆t, X(n∆t)) −g
∆
(n∆t, X
n
)[
≤ [g(n∆t, X(n∆t)) −g(n∆t, X
n
)[ +[g(n∆t, X
n
) −g
∆t
(n∆t, X
n
)[
≤ (1 +C∆t)[X(n∆t) −X
n
[ + C∆t
m+1
(1 +[X
n
[)
≤ (1 +C∆t)[X(n∆t) −X
n
[ + C∆t
m+1
.
The ﬁst line is the deﬁnition of the operators. The second line uses the triangle inequality
[X + Y [ ≤ [X[ +[Y [, (1.19)
which holds for every norm by deﬁnition. The third line uses the wellposedness of the
ODE and the consistency of the scheme. The last line uses the stability of the scheme.
Recall that C is here the notation for a constant that may change every time the symbol
is used.
Let us denote the error between the exact solution and the discretized solution by
ε
n
= [X(n∆t) −X
n
[. (1.20)
The previous calculations have shown that for a scheme of order m, we have
ε
n+1
≤ C∆t
m+1
+ (1 + C∆t)ε
n
.
Since ε
0
= 0, we easily deduce from the above relation (for instance by induction) that
ε
n
≤ C∆t
m+1
n
¸
k=0
(1 + C∆t)
k
≤ C∆t
m+1
n,
since (1 + C∆t)
k
≤ C uniformly for 0 ≤ k ≤ N. We deduce from the above relation
that
ε
n
≤ C∆t
m
,
since n ≤ N = T∆t
−1
This shows that the explicit Euler scheme is of order m = 1:
for all intermediate time steps n∆t, the error between X(n∆t) and X
n
is bounded by a
constant time ∆t. Obviously, as ∆t goes to 0, the discretized solution converges to the
exact one.
Exercise 1.5 Program in Matlab the explicit and implicit Euler schemes for the har
monic oscillator (1.5). Find numerically the order of convergence of both schemes.
HigherOrder schemes. We have seen that the Euler scheme is ﬁrst order accurate.
For problems that need to be solved over long intervals of time, this might not be
accurate enough. The obvious solution is to create a more accurate discretization. Here
is one way of constructing a secondorder scheme. To simplify the presentation we
assume that X ∈ R, i.e. we consider a scalar equation.
6
We ﬁrst realize that m in (1.17) has to be 2 instead of 1. We also realize that the
local approximation to the exact solution must be compatible to the right order with
the Taylor expansion (1.18), which we push one step further to get
X((n + 1)∆t) = X(n∆t) + ∆t
˙
X(n∆t) +
∆t
2
2
¨
X(n∆t) +
∆t
3
6
...
X
(n∆t + s) (1.21)
for some 0 ≤ s ≤ ∆t. Now from the derivation of the ﬁrstorder scheme, we know that
what we need is an approximation such that
X((n + 1)∆t) = g
(2)
∆t
(n∆t, X) + O(∆t
3
).
An obvious choice is then to choose
g
(2)
∆t
(n∆t, X) = X(n∆t) + ∆t
˙
X(n∆t) +
∆t
2
2
¨
X(n∆t).
This is however not explicit since
˙
X and
¨
X are not known yet (only the initial condition
X = X(n∆t) is known when we want to construct g
(2)
∆t
(n∆t, X)). However, as previously,
we can use the equation that X(t) satisﬁes and get that
˙
X(n∆t) = f(n∆t, X)
¨
X(n∆t) =
∂f
∂t
(n∆t, X) +
∂f
∂x
(n∆t, X)f(n∆t, X),
by applying the chain rule to (1.1) at t = n∆t. Assuming that we know how to diﬀer
entiate the function f, we can then obtain the secondorder scheme by deﬁning
g
(2)
∆t
(n∆t, X) = X + ∆tf(n∆t, X)
+
∆t
2
2
∂f
∂t
(n∆t, X) +
∂f
∂x
(n∆t, X)f(n∆t, X)
.
(1.22)
Clearly by construction, this scheme is consistent and secondorder accurate. The regu
larity that we now need from the exact equation (1.1) is that [
...
X
(n∆t+s)[ ≤ C(1+[X[).
We shall again assume that our equation is nice enough so that the above constraint
holds (this can be shown when the function f(t, x) is of class C
2
for instance).
Exercise 1.6 It remains to show that our scheme g
(2)
∆t
(n∆t, X) is stable. This is left as
an exercise. Hint: show that [g
(2)
∆t
(n∆t, X)[ ≤ (1 + C∆t))[X[ + C∆t for some constant
C independent of 1 ≤ n ≤ N and X.
Exercise 1.7 Work out a secondorder scheme for the harmonic oscillator (1.5) and
show that all the above constraints are satisﬁed.
Exercise 1.8 Implement in Matlab the secondorder scheme derived in the previous
exercise. Show the order of convergence numerically and compare the solutions with
the ﬁrstorder scheme. Comment.
7
RungeKutta methods. The main drawback of the previous secondorder scheme is
that it requires to know derivatives of f. This is undesirable is practice. However, the
derivatives of f can also be approximated by ﬁnite diﬀerences of the form (1.7). This is
the basis for the RungeKutta methods.
Exercise 1.9 Show that the following schemes are secondorder:
g
(2)
∆t
(n∆t, X) = X + ∆tf
n∆t +
∆t
2
, X +
∆t
2
f(n∆t, X)
g
(2)
∆t
(n∆t, X) = X +
∆t
2
f(n∆t, X) + f
(n + 1)∆t, X + ∆tf(n∆t, X)
g
(2)
∆t
(n∆t, X) = X +
∆t
4
f(n∆t, X) + 3f
(n +
2
3
)∆t, X +
2∆t
3
f(n∆t, X)
.
These schemes are called Midpoint method, Modiﬁed Euler method, and Heun’s Method,
respectively.
Exercise 1.10 Implement in Matlab the Midpoint method of the preceding exercise to
solve the harmonic oscillator problem (1.5). Compare with previous discretizations.
All the methods we have seen so far are onestep methods, in the sense that X
n+1
only
depends on X
n
. Multistep methods are generalizations where X
n+1
depends on X
n−k
for k = 0, , m with m ∈ N. Classical multistep methods are the AdamsBashforth
and AdamsMoulton methods. The second and fourthorder Adams Bashforth methods
are given by
X
n+1
= X
n
+
∆t
2
(3f(n∆t, X
n
) −f((n −1)∆t, X
n−1
))
X
n+1
= X
n
+
∆t
24
55f(n∆t, X
n
) −59f((n −1)∆t, X
n−1
)
+37f((n −2)∆t, X
n−2
) −9f((n −2)∆t, X
n−2
)
,
(1.23)
respectively.
Exercise 1.11 Implement in Matlab the AdamsBashforth methods in (1.23) to solve
the harmonic oscillator problem (1.5). Compare with previous discretizations.
8
2 Finite Diﬀerences for Parabolic Equations
2.1 One dimensional Heat equation
Equation. Let us consider the simplest example of a parabolic equation
∂u
∂t
(t, x) −
∂
2
u
∂x
2
(t, x) = 0, x ∈ R, t ∈ (0, T)
u(0, x) = u
0
(x), x ∈ R.
(2.1)
This is the onedimensional (in space) heat equation. Note that the equation is also
linear, in the sense that the map u
0
(x) → u(x, t) at a given time t > 0 is linear. In
the rest of the course, we will mostly be concerned with linear equations. This equation
actually admits an exact solution, given by
u(t, x) =
1
√
4πt
∞
−∞
exp
−
[x −y[
2
4t
u
0
(y)dy. (2.2)
Exercise 2.1 Show that (2.2) solves (2.1).
Discretization. We now want to discretize (2.1). This time, we have two variables
to discretize, time and space. The ﬁnite diﬀerence method consists of (i) introducing
discretization points X
n
= n∆x for n ∈ Z and ∆x > 0, and T
n
= n∆t for 0 ≤ n ≤ N =
T/∆t and Dt > 0; (ii) approximating the solution u(t, x) by
U
n
j
≈ u(T
n
, X
j
), 1 ≤ n ≤ N, j ∈ Z. (2.3)
Notice that in practice, j is ﬁnite. However, to simplify the presentation, we assume
here that we discretize the whole line x ∈ R.
The time variable is very similar to what we had for ODEs: knowing what happens
at T
n
= n∆t, we would like to get what happens at T
n+1
= (n + 1)∆t. We therefore
introduce the notation
∂
t
U
n
j
=
U
n+1
j
−U
n
j
∆t
. (2.4)
The operator ∂
t
plays the same role for the series U
n
j
as
∂
∂t
does for the function u(t, x).
The spatial variable is quite diﬀerent from the time variable. There is no privileged
direction of propagation (we do not know a priori if information comes from the left
or the right) as there is for the time variable (we know that information at time t will
allow us to get information at time t +∆t). This intuitively explains why we introduce
two types of discrete approximations for the spatial derivative:
∂
x
U
n
j
=
U
n
j+1
−U
n
j
∆x
, and ∂
x
U
n
j
=
U
n
j
−U
n
j−1
∆x
(2.5)
These are the forward and backward ﬁnite diﬀerence quotients, respectively. Now in
(2.1) we have a secondorder spatial derivative. Since there is a priori no reason to
assume that information comes from the left or the right, we choose
∂
2
u
∂x
2
(T
n
, X
j
) ≈ ∂
x
∂
x
U
n
j
= ∂
x
∂
x
U
n
j
=
U
n
j+1
−2U
n
j
+ U
n
j−1
∆x
2
. (2.6)
9
Exercise 2.2 Check the equalities in (2.6).
Notice that the discrete diﬀerentiation in (2.6) is centered: it is symmetric with respect
to what comes from the left and what comes from the right. The ﬁnite diﬀerence scheme
then reads
U
n+1
j
−U
n
j
∆t
−
U
n
j+1
−2U
n
j
+ U
n
j−1
∆x
2
= 0. (2.7)
This is clearly an approximation of (2.1). This is also the simplest scheme called the
explicit Euler scheme. Notice that this can be recast as
U
n+1
j
= λ(U
n
j+1
+ U
n
j−1
) + (1 −2λ)U
n
j
, λ =
∆t
∆x
2
. (2.8)
Exercise 2.3 Check this.
The numerical procedure is therefore straightforward: knowing U
n
j
for j ∈ Z, we cal
culate U
n+1
j
for j ∈ Z for 0 ≤ n ≤ N − 1. There are then several questions that come
to mind: is the scheme stable (i.e. does U
n
j
remains bounded for n = N independently
of the choices of ∆t and ∆x)?, does it converge (is U
n
j
really a good approximation of
u(n∆t, j∆x) as advertised)?, and how fast does it converge (what is the error between
U
n
j
and u(n∆t, j∆x))?
Exercise 2.4 Discretizing the interval (−10, 10) with N points, use a numerical ap
proximation of (2.1) to compute u(t, x) for t = 0.01, t = 0.1, t = 1, and t = 10 assuming
that the initial condition is u
0
(x) = 1 on (−1, 1) and u
0
(x) = 0 elsewhere. Implement
the algorithm in Matlab.
Stability. Our ﬁrst concern will be to ensure that the scheme is stable. The norm we
choose to measure stability is the supremum norm. For series U = ¦U
j
¦
j∈Z
deﬁned on
the grid j ∈ Z by U
j
, we deﬁne
U
∞
= sup
j∈Z
[U
j
[. (2.9)
Stability means that there exists a constant C independent of ∆x, ∆t and 1 ≤ n ≤ N =
T/∆t such that
U
n

∞
= sup
j∈Z
[U
n
j
[ ≤ CU
0

∞
= C sup
j∈Z
[U
0
j
[. (2.10)
Since u(t, x) remains bounded for all times (see (2.2)), it is clear that if the scheme is
unstable, it has no chance to converge. We actually have to consider two cases according
as λ ≤ 1/2 or λ > 1/2. When λ > 1/2, the scheme will be unstable! To show this we
choose some initial conditions that oscillate very rapidly:
U
0
j
= (−1)
j
ε, ε > 0.
Here ε is a constant that can be chosen as small as one wants. We easily verify that
U
0
j

∞
= ε. Now straightforward calculations show that
U
1
j
=
λ
(−1)
j+1
+ (−1)
j−1
+ (1 −2λ)(−1)
j
ε = (1 −4λ)(−1)
j
ε.
10
By induction this implies that
U
n
j
= (1 −4λ)
n
(−1)
j
ε. (2.11)
Now assume that λ > 1/2. This implies that ρ = −(1 −4λ) > 1, whence
U
n

∞
= ρ
n
ε →∞ as n →∞.
For instance, for n = N = T/∆t, we have that U
N

∞
, which was supposed to be
an approximation of sup
x∈R
[u(T, x)[, tends to inﬁnity as ∆t → 0. Obviously (2.10) is
violated: there is no constant C independent of ∆t such that (2.10) holds. In practice,
what this means is that any small ﬂuctuation in the initial data (for instance caused by
computer roundoﬀ) will be ampliﬁed by the scheme exponentially fast as in (2.11) and
will eventually be bigger than any meaningful information.
This behavior has to be contrasted with the case λ ≤ 1/2. There, stability is obvious.
Since both λ and (1 −2λ) are positive, we deduce from (2.8) that
[U
n+1
j
[ ≤ sup¦[U
n
j−1
[, [U
n
j
[, [U
n
j+1
[¦. (2.12)
Exercise 2.5 Check it.
So obviously,
U
n+1

∞
≤ U
n

∞
≤ U
0

∞
.
So (2.10) is clearly satisﬁed with C = 1. This property is referred to as a maximum
principle: the discrete solution at later times (n ≥ 1) is bounded by the maximum of the
initial solution. Notice that the continuous solution of (2.1) also satisﬁes this property.
Exercise 2.6 Check this using (2.2). Recall that
∞
−∞
e
−x
2
dx =
√
π.
Maximum principles are very important in the analysis of partial diﬀerential equations.
Although we are not going to use them further here, let us still mention that it is
important in general to enforce that the numerical schemes satisfy as many properties
of the continuous equation as possible.
We have seen that λ ≤ 1/2 is important to ensure stability. In terms of ∆t and ∆x,
this reads
∆t ≤
1
2
∆x
2
. (2.13)
This is the simplest example of the famous CFL condition (after Courant, Friedrich, and
Lewy who formulated it ﬁrst). We shall see that explicit schemes are always conditionally
stable for parabolic equations, whereas implicit schemes will be unconditionally stable,
whence their practical interest. Notice that ∆t must be quite small (of order ∆x
2
) in
order to obtain stability.
Convergence. Now that we have stability for the Euler explicit scheme when λ ≤ 1/2,
we have to show that the scheme converges. This is done by assuming that we have a
local consistency of the discrete scheme with the equation, and that the solution of the
equation is suﬃciently regular. We therefore use the three same main ingredients as for
ODEs: stability, consistency, regularity of the exact solution.
11
More speciﬁcally, let us introduce the error series z
n
= ¦z
n
j
¦
j∈Z
deﬁned by
z
n
j
= U
n
j
−u
n
j
, where u
n
j
= u(n∆t, j∆x).
Convergence means that z
n

∞
converges to 0 as ∆t → 0 and ∆x → 0 (with the
constraint that λ ≤ 1/2) for all 1 ≤ n ≤ N = T/∆t. The order of convergence is
how fast it converges. To simplify, we assume that λ ≤ 1/2 is ﬁxed and send ∆t → 0
(in which case ∆x =
√
λ
−1
∆t also converges to 0). Our main ingredient is now the
Taylor expansion (1.6). The consistency of the scheme consists of showing that the real
dynamics are well approximated. More precisely, assuming that the solution at time
T
n
= n∆t is given by u
n
j
on the grid, and that U
n
j
= u
n
j
, we have to show that the real
solution at time u
n+1
j
is well approximated by U
n+1
j
. To answer this, we calculate
τ
n
j
= ∂
t
u
n
j
−∂
x
∂
x
u
n
j
. (2.14)
The expression τ
n
j
is the truncation or local discretization error. Using Taylor expan
sions, we ﬁnd that
u(t + ∆t) −u(t)
∆t
=
∂u
∂t
(t) +
∆t
2
∂
2
u
∂t
2
(t + s)
u(x + ∆x) −2u(x) + u(x −∆x)
∆x
2
=
∂
2
u
∂x
2
(x) +
∆x
2
12
∂
4
u
∂x
4
(x + h),
for some 0 ≤ s ≤ ∆t and −∆x ≤ h ≤ ∆x.
Exercise 2.7 Check this.
Since (2.1) is satisﬁed by u(t, x) we deduce that
τ
n
j
=
∆t
2
∂
2
u
∂t
2
(n∆t + s) −
∆x
2
12
∂
4
u
∂x
4
(j∆x + h)
= ∆x
2
λ
2
∂
2
u
∂t
2
(n∆t + s) −
1
12
∂
4
u
∂x
4
(j∆x + h)
.
The truncation error is therefore of order m = 2 since it is proportional to ∆x
2
. Of
course, this accuracy holds if we can bound the term in the square brackets indepen
dently of the discretization parameters ∆t and ∆x. Notice however that this term only
involves the exact solution u(t, x). All we need is that the exact solution is suﬃciently
regular. Again, this is independent of the discretization scheme we choose.
Let us assume that the solution is suﬃciently regular (what we need is that it is
of class C
2,4
, i.e. that it is continuously diﬀerentiable 2 times in time and 4 times in
space). We then deduce that
τ
n

∞
≤ C∆x
2
∼ C∆t (2.15)
where C is a constant independent of ∆x (and ∆t since λ is ﬁxed). Since U
n
satisﬁes
the discretized equation, we deduce that
∂
t
z
n
j
−∂
x
∂
x
z
n
j
= −τ
n
j
. (2.16)
12
or equivalently that
z
n+1
j
= λ(z
n
j+1
+ z
n
j−1
) + (1 −2λ)z
n
j
−∆tτ
n
j
. (2.17)
The above equation allows us to analyze local consistency. Indeed, let us assume that the
exact solution is known on the spatial grid at time step n. This implies that z
n
= 0, since
the latter is by deﬁnition the diﬀerence between the exact solution and the approximate
solution on the spatial grid. Equation (2.17) thus implies that
z
n+1
j
= −∆tτ
n
j
,
so that
z
n+1

∞
≤ C∆t
2
∼ C∆t∆x
2
, (2.18)
thanks to (2.15). Comparing with the section on ODEs, this implies that the scheme
is of order m = 1 in time since ∆t
2
= ∆t
m+1
for m = 1. Being of order 1 in time
implies that the scheme is of order 2 is space since ∆t = λ∆x
2
. This will be conﬁrmed
in Theorem 2.1 below.
Let us now conclude the analysis of the error estimate and come back to (2.17).
Using the stability estimate (2.12) for the homogeneous problem (because λ ≤ 1/2!)
and the bound (2.15), we deduce that
z
n+1

∞
≤ z
n

∞
+ C∆t∆x
2
.
This in turn implies that
z
n

∞
≤ nC∆t∆x
2
≤ CT∆x
2
.
We summarize what we have obtain as
Theorem 2.1 Let us assume that u(t, x) is of class C
2,4
and that λ ≤ 1/2. Then we
have that
sup
j∈Z;0≤n≤N
[u(n∆t, j∆x) −U
n
j
[ ≤ C∆x
2
, (2.19)
where C is a constant independent of ∆x and λ ≤ 1/2.
The explicit Euler scheme is of order 2 in space. But this corresponds to a scheme
of order 1 in time since ∆t = λ∆x
2
≤ ∆x
2
/2. Therefore it does not converge very fast.
Exercise 2.8 Implement the explicit Euler scheme in Matlab by discretizing the parabolic
equation on x ∈ (−10, 10) and assuming that u(t, −10) = u(t, 10) = 0. You may choose
as an initial condition u
0
(x) = 1 on (−1, 1) and u
0
(x) = 0 elsewhere. Choose the dis
cretization parameters ∆t and ∆x such that λ = .48, λ = .5, and λ = .52. Compare
the stable solutions you obtain with the solution given by (2.2).
Exercise 2.9 Following the techniques described in Chapter 1, derive a scheme of order
4 in space (still with ∆t = λ∆x
2
and λ suﬃciently small that the scheme is stable).
Implement this new scheme in Matlab and compare it with the explicit Euler scheme
on the numerical simulation described in exercise 2.8.
13
2.2 Fourier transforms
We now consider a fundamental tool in the analysis of partial diﬀerential equations with
homogeneous coeﬃcients and their discretization by ﬁnite diﬀerences: Fourier trans
forms.
There are many ways of deﬁning the Fourier transform. We shall use the following
deﬁnition
[T
x→ξ
f](ξ) ≡
ˆ
f(ξ) =
∞
−∞
e
−ixξ
f(x)dx. (2.20)
The inverse Fourier transform is deﬁned by
[T
−1
ξ→x
ˆ
f](x) ≡ f(x) =
1
2π
∞
−∞
e
ixξ
ˆ
f(ξ)dξ. (2.21)
In the above deﬁnition, the operator T is applied to the function f(x), which it
maps to the function
ˆ
f(ξ). The notation can become quite useful when there are several
variables and one wants to emphasize with respect to with variable Fourier transform is
taken. We call the spaces of x’s the physical domain and the space of ξ’s the wavenumber
domain, or Fourier domain. Wavenumbers are to positions what frequencies are to times.
If you are more familiar with Fourier transforms of signals in time, you may want to
think of “frequencies” each time you see “wavenumbers”. A crucial property, which
explains the terminology of “forward” and “inverse” Fourier transforms, is that
T
−1
ξ→x
T
x→ξ
f = f, (2.22)
which can be recast as
f(x) =
1
2π
∞
−∞
e
ixξ
∞
−∞
e
−iyξ
f(y)dydξ =
∞
−∞
∞
−∞
e
i(x−y)ξ
2π
dξ
f(y)dy.
This implies that
∞
−∞
e
izξ
2π
dξ = δ(z),
the Dirac delta function since f(x) =
∞
−∞
δ(x−y)f(y)dy by deﬁnition. Another way of
understanding this is to realize that
[Tδ](ξ) = 1, [T
−1
1](x) = δ(x). (2.23)
The ﬁrst equality can also trivially been deduced from (2.20).
Another interesting calculation for us is that
[T exp(−
α
2
x
2
)](ξ) =
2π
α
exp(−
1
2α
ξ
2
). (2.24)
In other words, the Fourier transform of a Gaussian is also a Gaussian. Notice this im
portant fact: a very narrow Gaussian (α very large) is transformed into a very wide one
(α
−1
is very small) are viceversa. This is related to Heisenberg’s uncertainty principle
in quantum mechanics, and it says that you cannot be localized in the Fourier domain
when you are localized in the spatial domain and viceversa.
14
Exercise 2.10 Check (2.24) using integration in the complex plane (which essentially
says that
∞
−∞
e
−β(x−iρ)
2
dx =
∞
−∞
e
−βx
2
dx for ρ ∈ R by contour change since z →e
−βz
2
has no pole in C).
The simplest property of the Fourier transform is its linearity:
[T(αf + βg)](ξ) = α[Tf](ξ) + β[Tg](ξ). (2.25)
Other very important properties of the Fourier transform are as follows:
[Tf(x + y)](ξ) = e
iyξ
ˆ
f(ξ), [Tf(νx)](ξ) =
1
[ν[
ˆ
f
ξ
ν
, (2.26)
for all y ∈ R and ν ∈ R`¦0¦.
Exercise 2.11 Check these formulas.
The ﬁrst equality shows how the Fourier transform acts on translations by a factor y,
the second how it acts on dilation. The former is extremely important in the analysis
of PDEs and their ﬁnite diﬀerence discretization:
Fourier transforms replace translations in the physical domain by multipli
cations in the Fourier domain.
More precisely, translation by y is replaced by multiplication by e
iyξ
.
The reason this is important is that you can’t beat multiplications in simplicity. To
further convince you of the importance of the above assertion, consider derivatives of
functions. By linearity of the Fourier transform, we get that
T
f(x + h) −f(x)
h
(ξ) =
[Tf(x + h)](ξ) −[Tf(x)](ξ)
h
=
e
ihξ
−1
h
ˆ
f(ξ)
Now let us just pass to the limit h →0 in both sides. We get
ˆ
f
(ξ) = [Tf
](ξ) = iξ
ˆ
f(ξ). (2.27)
Again,
Fourier transforms replace diﬀerentiation in the physical domain by multi
plication in the Fourier domain.
More precisely,
d
dx
is replaced by multiplication by iξ.
This assertion can easily been directly veriﬁed from the deﬁnition (2.20). However, it
is useful to see diﬀerentiation as a limit of diﬀerences of translations. Moreover, ﬁnite
diﬀerences are based on translations so we see how Fourier transforms will be useful in
their analysis.
One last crucial and quite related property of the Fourier transform is that it replaces
convolutions by multiplications. The convolution of two functions is given by
(f ∗ g)(x) =
∞
−∞
f(x −y)g(y)dy =
∞
−∞
f(y)g(x −y)dy. (2.28)
The Fourier transform is then given by
[T(f ∗ g)](ξ) =
ˆ
f(ξ)ˆ g(ξ). (2.29)
15
Exercise 2.12 Check this.
This important relation can be used to prove the equally important (everything is im
portant in this section!) Parseval equality
∞
−∞
[
ˆ
f(ξ)[
2
dξ = 2π
∞
−∞
[f(x)[
2
dx, (2.30)
for all complexvalued function f(x) such that the above integrals make sense. We
show this as follows. We ﬁrst recall that
ˆ
f(ξ) =
´
f(−ξ), where f denotes the complex
conjugate to f. We also deﬁne g(x) = f(−x), so that ˆ g(ξ) =
´
f(−ξ) using (2.26) with
ν = −1. We then have
∞
−∞
[
ˆ
f(ξ)[
2
dξ =
∞
−∞
ˆ
f(ξ)
ˆ
f(ξ)dξ =
∞
−∞
ˆ
f(ξ)
´
f(−ξ)dξ =
∞
−∞
ˆ
f(ξ)g(ξ)dξ
=
∞
−∞
(f ∗ g)(ξ)dξ thanks to (2.28)
= 2π(f ∗ g)(0) using the deﬁnition (2.21) with x = 0
= 2π
∞
−∞
f(y)g(−y)dy by deﬁnition of the convolution
= 2π
∞
−∞
f(y)f(y)dy = 2π
∞
−∞
[f(y)[
2
dy.
The Parseval equality (2.30) essentially says that there is as much energy (up to a factor
2π) in the physical domain (lefthand side in (2.30)) as in the Fourier domain (righthand
side).
Application to the heat equation. Before considering ﬁnite diﬀerences, here is an
important exercise that shows how powerful Fourier transforms are to solve PDEs with
constant coeﬃcients.
Exercise 2.13 Consider the heat equation (2.1). Using the Fourier transform, show
that
∂ˆ u(t, ξ)
∂t
= −ξ
2
ˆ u(t, ξ).
Here we mean ˆ u(t, ξ) = [T
x→ξ
u(t, x)](t, ξ). Solve this ODE. Then using (2.24) and
(2.28), recover (2.2).
Application to Finite Diﬀerences. We now use the Fourier transform to analyze
ﬁnite diﬀerence discretizations. To simplify we still assume that we want to solve the
heat equation on the whole line x ∈ R.
Let u
0
(x) be the initial condition for the heat equation. If we use the notation
U
n
(j∆x) = U
n
j
≈ u(n∆t, j∆x),
the explicit Euler scheme (2.7) can be recast as
U
n+1
(x) −U
n
(x)
∆t
=
U
n
(x + ∆x) + U
n
(x −∆x) −2U
n
(x)
∆x
2
, (2.31)
16
where x = j∆x. As before, this scheme says that the approximation U at time T
n+1
=
(n + 1)∆t and position X
j
= j∆x depends on the values of U at time T
n
and positions
X
j−1
, X
j
, and X
j+1
. It is therefore straightforward to realize that U
n
(X
j
) depends on
u
0
at all the points X
k
for j − n ≤ k ≤ j + n. It is however not easy at all to get this
dependence explicitly from (2.31).
The key to analyzing this dependence is to go to the Fourier domain. There, trans
lations are replaced by multiplications, which is much simpler to deal with. Here is how
we do it. We ﬁrst deﬁne U
0
(x) for all points x ∈ R, not only those points x = X
j
= j∆x.
One way to do so is to assume that
u
0
(x) = U
0
j
on
(j −
1
2
)∆x, (j +
1
2
)∆x
. (2.32)
We then deﬁne U
n
(x) using (2.31) initialized with U
0
(x) = u
0
(x). Clearly, we have
U
n+1
(x) = (1 −2λ)U
n
(x) + λ(U
n
(x + ∆x) + U
n
(x −∆x)). (2.33)
We thus deﬁne U
n
(x) for all n ≥ 0 and all x ∈ R by induction. If we choose as initial
condition (2.32), we then easily observe that
U
n
(x) = U
n
j
on
(j −
1
2
)∆x, (j +
1
2
)∆x
, (2.34)
where U
n
j
is given by (2.7). So analyzing U
n
(x) is in some sense suﬃcient to get infor
mation on U
n
j
.
Once U
n
(x) is deﬁned for all n, we can certainly introduce the Fourier transform
ˆ
U
n
(ξ) deﬁned according to (2.20). However, using the translation property of the Fourier
transform (2.26), we obtain that
ˆ
U
n+1
(ξ) =
1 + λ(e
iξ∆x
+ e
−iξ∆x
−2)
ˆ
U
n
(ξ) (2.35)
In the Fourier domain, marching in time (going from step n to step n + 1) is much
simpler than in the physical domain: each wavenumber ξ is multiplied by a coeﬃcient
R
λ
(ξ) = 1 + λ(e
iξ∆x
+ e
−iξ∆x
−2) = 1 −2λ(1 −cos(ξ∆x)). (2.36)
We then have that
ˆ
U
n
(ξ) = (R
λ
(ξ))
n
ˆ u
0
(ξ). (2.37)
The question of the stability of a discretization now becomes the following: are all
wavenumbers ξ stable, or equivalently is (R
λ
(ξ))
n
bounded for all 1 ≤ n ≤ N = T/∆t?
It is easy to observe that when λ ≤ 1/2, [R
λ
(ξ)[ ≤ 1 for all ξ. We then clearly have
that all wavenumbers are stable since [
ˆ
U
n
(ξ)[ ≤ [ˆ u
0
(ξ)[. However, when λ > 1/2, we
can choose values of ξ such that cos(ξ∆x) is close to −1. For such wavenumbers, the
value of R
λ
(ξ) is less than −1 so that
ˆ
U
n
changes sign at every new time step n with
growing amplitude (see numerical simulations). Such wavenumbers are unstable.
17
2.3 Stability and convergence using Fourier transforms
In this section, we apply the theory of Fourier transform to the analysis of the ﬁnite
diﬀerence of parabolic equations with constant coeﬃcients.
As we already mentioned, an important property of Fourier transforms is that
∂
∂x
→iξ,
that is, diﬀerentiation is replaced by multiplication by iξ in the Fourier domain. We use
this to deﬁne equations in the physical domain as follows
∂u
∂t
(t, x) + P(D)u(t, x) = 0, t > 0, x ∈ R
u(0, x) = u
0
(x), x ∈ R.
(2.38)
The operator P(D) is a linear operator deﬁned as
P(D)f(x) = T
−1
ξ→x
P(iξ)T
x→ξ
f(x)
(ξ), (2.39)
where P(iξ) is a function of iξ. Here D stands for
∂
∂x
. What the operator P(D) does
is: (i) take the Fourier transform of f(x), (ii) multiply
ˆ
f(ξ) by the function P(iξ), (iii)
take the inverse Fourier transform of the product.
The function P(iξ) is called the symbol of the operator P(D). Such operators are
called pseudodiﬀerential operators.
Why do we introduce all this? The reason, as it was mentioned in the previous
section, is that the analysis of diﬀerential operators simpliﬁes in the Fourier domain.
The above deﬁnition gives us a mathematical framework to build on this idea. Indeed,
let us consider (2.38) in the Fourier domain. Upon taking the Fourier transform T
x→ξ
term by term in (2.38), we get that
∂ˆ u
∂t
(t, ξ) + P(iξ)ˆ u(t, ξ) = 0, t > 0, ξ ∈ R,
ˆ u(0, ξ) = ˆ u
0
(ξ), ξ ∈ R.
(2.40)
As promised, we have replaced the “complicated” pseudodiﬀerential operator P(D) by
a mere multiplication by P(iξ) in the Fourier domain. For every ξ ∈ R, (2.40) consists
of a simple linear ﬁrstorder ordinary diﬀerential equation. Notice the parallel between
(2.38) and (2.40): by going to the Fourier domain, we have replaced the diﬀerentiation
operator D =
∂
∂x
by iξ. The solution of (2.40) is then obviously given by
ˆ u(t, ξ) = e
−tP(iξ)
ˆ u
0
(ξ). (2.41)
The solution of (2.38) is then given by
u(t, x) = T
−1
ξ→x
(e
−tP(iξ)
) ∗ u
0
(x). (2.42)
Exercise 2.14 (i) Show that the parabolic equation (2.1) takes the form (2.38) with
symbol P(iξ) = ξ
2
.
(ii) Find the equation of the form (2.38) with symbol P(iξ) = ξ
4
.
(iii) Show that any arbitrary partial diﬀerential equation of order m ≥ 1 in x with con
stant coeﬃcients can be written in the form (2.38) with the symbol P(iξ) a polynomial
of order m.
18
Parabolic Symbol. In this section, we only consider realvalued symbols such that
for some M > 0,
P(iξ) ≥ C[ξ[
M
, for all ξ ∈ R. (2.43)
With a somewhat loose terminology, we refer to equations (2.38) with realvalued symbol
P(iξ) satisfying the above constraint as parabolic equations. The heat equation clearly
satisﬁes the above requirement with M = 2. The realvaluedness of the symbol is not
necessary but we shall impose it to simplify.
What about numerical schemes? Our deﬁnition so far provides a great tool to
analyze partial diﬀerential equations. It turns out it is also very well adapted to the
analysis of ﬁnite diﬀerence approximations to these equations. The reason is again that
ﬁnite diﬀerence operators are transformed into multiplications in the Fourier domain.
A general singlestep ﬁnite diﬀerence scheme is given by
B
∆
U
n+1
(x) = A
∆
U
n
(x), n ≥ 0, (2.44)
with U
0
(x) = u
0
(x) a given initial condition. Here, the ﬁnite diﬀerence operators are
deﬁned by
A
∆
U(x) =
¸
α∈Iα
a
α
U(x −α∆x), B
∆
U(x) =
¸
β∈I
β
b
β
U(x −β∆x), (2.45)
where I
α
⊂ Z and I
β
⊂ Z are ﬁnite sets of integers. Although this deﬁnition may look
a little complicated at ﬁrst, we need such a complexity in practice. For instance, the
explicit Euler scheme is given by
U
n+1
(x) = (1 −2λ)U
n
(x) + λ(U
n
(x −∆x) + U
n
(x + ∆x)).
This corresponds to the coeﬃcients
b
0
= 1, a
−1
= λ, a
0
= 1 −2λ, a
1
= λ,
and all the other coeﬃcients a
α
and b
β
vanish. Explicit schemes are characterized by
the fact that only b
0
does not vanish. We shall see that implicit schemes, where other
coeﬃcients b
β
= 0, are important in practice.
Notice that we are slightly changing the problem of interest. Finite diﬀerence
schemes are supposed to be deﬁned on the grid ∆xZ (i.e. the points x
m
= m∆x
for m ∈ Z), not on the whole axis R. Let us notice however that from the deﬁnition
(2.44), the values of U
n+1
on the grid ∆xZ only depend on the values of u
0
on the same
grid, and that U
n+1
(j∆x) = U
n+1
j
, using the notation of the preceding section. In other
words U
n+1
(j∆x) does not depend on how we extend the initial condition u
0
originally
deﬁned on the grid ∆xZ to the whole line R. This is why we replace U
n+1
j
by U
n+1
(x)
in the sequel and only consider properties of U
n+1
(x). In the end, we should see how
the properties on U
n+1
(x) translate into properties on U
n+1
j
. This is easy to do when
U
n+1
(x) is chosen as a piecewise constant function and more diﬃcult when higher order
polynomials are used to construct it. It can be done but requires careful analysis of the
error made by interpolating smooth functions by polynomials. We do not consider this
problem in these notes.
19
We now come back to (2.44). From now on, this is our deﬁnition of a ﬁnite diﬀerence
scheme and we want to analyze its properties. As we already mentioned, this equation
is simpler in the Fourier domain. Let us thus take the Fourier transform of each term
in (2.44). What we get is
¸
β∈I
β
b
β
e
−iβ∆xξ
ˆ
U
n+1
(ξ) =
¸
α∈Iα
a
α
e
−iα∆xξ
ˆ
U
n
(ξ). (2.46)
Let us introduce the symbols
A
∆
(ξ) =
¸
α∈Iα
a
α
e
−iα∆xξ
, B
∆
(ξ) =
¸
β∈I
β
b
β
e
−iβ∆xξ
. (2.47)
In the implicit case, we assume that
B
∆
(ξ) = 0. (2.48)
In the explicit case, B
∆
(ξ) ≡ 1 and the above relation is obvious. We then obtain that
ˆ
U
n+1
(ξ) = R
∆
(ξ)
ˆ
U
n
(ξ) = B
∆
(ξ)
−1
A
∆
(ξ)
ˆ
U
n
(ξ). (2.49)
This implies
ˆ
U
n
(ξ) = R
n
∆
(ξ)ˆ u
0
(ξ). (2.50)
The above relation completely characterizes the solution
ˆ
U
n
(ξ). To obtain the solution
U
n
(x), we simply have to take the inverse Fourier transform of
ˆ
U
n
(ξ).
Exercise 2.15 Verify that for the explicit Euler scheme, we have
R
∆
(ξ) = 1 −
2∆t
(∆x)
2
(1 −cos(ξ∆x)).
Notice that the diﬀerence between the exact solution ˆ u(n∆t, ξ) at time n∆t and its
approximation
ˆ
U
n
(ξ) is given by
ˆ u(n∆t, ξ) −
ˆ
U
n
(ξ) =
e
−n∆tP(iξ)
−R
n
∆
(ξ)
ˆ u
0
(ξ). (2.51)
This is what we want to analyze.
To measure convergence, we ﬁrst need to deﬁne a norm, which will tell us how far
or how close two functions are from one another. In ﬁnite dimensional spaces (such as
R
n
that we used for ordinary diﬀerential equations), all norms are equivalent (this is a
theorem). In inﬁnite dimensional spaces, all norms are not equivalent (this is another
theorem; the equivalence of all norms on a space implies that it is ﬁnite dimensional...)
So the choice of the norm matters!
Here we only consider the L
2
norm, deﬁned on R for suﬃciently regular complex
valued functions f by
f =
∞
−∞
[f(x)[
2
dx
1/2
. (2.52)
The reason why this norm is nice is that we have the Parseval equality

ˆ
f(ξ) =
√
2πf(x). (2.53)
20
This means that the L
2
norms in the physical and the Fourier domains are equal (up
to the factor
√
2π). Most norms do not have such a nice interpretation in the Fourier
domain. This is one of the reasons why the L
2
norm is so popular.
Our analysis of the ﬁnite diﬀerence scheme is therefore concerned with characterizing
u(n∆t, x) −U
n
(x).
We just saw that this was equivalent to characterizing
ε
n
(ξ) = ˆ u(n∆t, ξ) −
ˆ
U
n
(ξ) = 
e
−n∆tP(iξ)
−R
n
∆
(ξ)
ˆ u
0
(ξ).
The control that we will obtain on ε
n
(ξ) depends on the three classical ingredients:
regularity of the solution, stability of the scheme, and consistency of the approximation.
The convergence arises as two parameters, ∆t and ∆x, converge to 0. To simplify
the analysis, we assume that the two parameters are related by
λ =
∆t
∆x
M
, (2.54)
where M is the order of the parabolic equation (M = 2 for the heat equation), and
where λ > 0 is ﬁxed.
Regularity. The regularity of the exact equation is ﬁrst seen in the growth of [e
−∆tP(iξ)
[.
Since our operator as assumed to be parabolic, we have
[e
−∆tP(iξ)
[ ≤ e
−C∆tξ
M
. (2.55)
Stability. We impose that there exists a constant C independent of ξ and 1 ≤ n ≤
N = T/∆t such that
[R
n
∆
(ξ)[ ≤ C, ξ ∈ R, 1 ≤ n ≤ N. (2.56)
The stability constraint is clear: no wavenumber ξ can grow out of control.
Exercise 2.16 Check again that the above constraint is satisﬁed for the explicit Euler
scheme if and only if λ ≤ 1/2.
What makes parabolic equations a little special is that they have better stability prop
erties than just having bounded symbols. The exact symbol satisﬁes that
[e
−n∆tP(iξ)
[ ≤ e
−Cn∆tξ
M
, (2.57)
for some constant C. This relation is important because we deduce that high wavenum
bers (large ξ’s) are heavily damped by the operator. Parabolic equations have a very
eﬃcient smoothing eﬀect. Even if high wavenumbers are present in the initial solution,
they quickly disappear as time increases.
Not surprisingly, the ﬁnite diﬀerence approximation partially retains this eﬀect. We
have however to be careful. The reason is that the scheme has a given scale, ∆x, at
which it approximates the exact operator. For length scales much larger than ∆x, we
21
expect the scheme to be a good approximation of the exact operator. For length of order
(or smaller than) ∆x however, we cannot expect the scheme to be accurate: we do not
have a suﬃcient resolution. Remember that small scales in the spatial worlds mean
high wavenumbers in the Fourier world (functions that vary on small scales oscillate
fast, hence are composed of large wavenumbers). Nevertheless, for wavenumbers that
are smaller than ∆x
−1
, we expect the scheme to retain some properties of the exact
equation. For this reason, we impose that the operator R
∆
satisfy
[R
∆
(ξ)[ ≤ 1 −δ[∆xξ[
M
for [∆xξ[ ≤ γ
[R
∆
(ξ)[ ≤ 1 for [∆xξ[ > γ,
(2.58)
where γ and δ are positive constants independent of ξ and ∆x. We then have
Lemma 2.2 We have that
[R
n
∆
(ξ)[ ≤ e
−Cn∆xξ
M
for [∆xξ[ ≤ γ. (2.59)
Proof. We deduce from (2.58) that
[R
n
∆
(ξ)[ ≤ e
nln(1−δ∆xξ
M
)
≤ e
−n(δ/2)∆xξ
M
.
Consistency. Let us now turn our attention to the consistency and accuracy of the
ﬁnite diﬀerence scheme. Here again, we want to make sure that the ﬁnite diﬀerence
scheme is locally a good approximation of the exact equation. In the Fourier world, this
means that
e
−∆tP(iξ)
−R
∆
(ξ) (2.60)
has to converge to 0 as ∆x and ∆t converge to 0. Indeed, assuming that we are given
the exact solution ˆ u(n∆t, ξ) at time n∆t, the error made by applying the approximate
equation instead of the exact equation between n∆t and (n+1)∆t is precisely given by
e
−∆tP(iξ)
−R
∆
(ξ)
ˆ u(n∆t, ξ).
Again, we have to be a bit careful. We cannot expect the above quantity to be small
for all values of ξ. This is again related to the fact that wavenumbers of order of or
greater than ∆x
−1
are not captured by the ﬁnite diﬀerence scheme.
Exercise 2.17 Show that in the case of the heat equation and the explicit Euler scheme,
the error e
−∆tP(iξ)
− R
∆
(ξ) is not small for all wavenumbers of the form ξ = kπ∆x
−1
for k = 1, 2, . . ..
Quantitatively, all we can expect is that the error is small when ξ∆x →0 and ∆t →0.
We say that the scheme is consistent and accurate of order m if
[e
−∆tP(iξ)
−R
∆
(ξ)[ ≤ C∆t∆x
m
(1 +[ξ[
M+m
), ∆x[ξ[ < γ, (2.61)
where C is a constant independent of ξ, ∆x, and ∆t.
22
Proof of convergence. Let us come back to the analysis of ε
n
(ξ). We have that
ε
n
(ξ) = (e
−∆tP(iξ)
−R
∆
(ξ))
n−1
¸
k=0
e
−(n−1−k)∆tP(iξ)
R
k
∆
(ξ)ˆ u
0
(ξ),
using a
n
−b
n
= (a −b)
¸
n−1
k=0
a
n−1−k
b
k
. Let us ﬁrst consider the case [ξ[ ≤ γ∆x
−1
. We
deduce from (2.57), (2.59), and (2.61), that
[ε
n
(ξ)[ ≤ C∆t∆x
m
(1 +[ξ[
M+m
)ne
−Cn∆tξ
M
[ˆ u
0
(ξ)[. (2.62)
Now remark that
n∆t[ξ[
M
e
−Cn∆tξ
M
≤ C
.
Since n∆t ≤ T, this implies that
[ε
n
(ξ)[ ≤ C∆x
m
(1 +[ξ[
m
)[ˆ u
0
(ξ)[. (2.63)
So far, the above inequality holds for [ξ[ ≤ γ∆x
−1
. However, for [ξ[ ≥ γ∆x
−1
, we have
thanks to the stability property (2.56) that
[ε
n
(ξ)[ ≤ 2[ˆ u
0
(ξ)[ ≤ C∆x
m
(1 +[ξ[
m
)[ˆ u
0
(ξ)[. (2.64)
So (2.63) actually holds for every ξ. This implies that
ε
n
 =
R
[ε
n
(ξ)[
2
dξ
1/2
≤ C∆x
m
R
(1 +[ξ[
2m
)[ˆ u
0
(ξ)[
2
dξ
1/2
. (2.65)
This shows that ε
n
 is of order ∆x
m
provided that
R
(1 +[ξ[
2m
)[ˆ u
0
(ξ)[
2
dξ
1/2
< ∞. (2.66)
The above inequality implies that ˆ u
0
(ξ) decays suﬃciently fast as ξ → ∞. This is
actually equivalent to imposing that u
0
(x) is suﬃciently regular.
The Hilbert spaces H
m
(R). For a function v(x), we denote by v
(n)
(x) its nth deriva
tive. We introduce the sequence of functional spaces H
m
= H
m
(R) of functions whose
derivatives up to order m are squareintegrable. The norm of H
m
is deﬁned by
v
H
m =
m
¸
k=0
R
[v
(n)
(x)[
2
dx
1/2
. (2.67)
Notice that H
0
= L
2
, the space of squareintegrable functions. It turns out that a
function v(x) belongs to H
m
if and only if its Fourier transform ˆ v(ξ) satisﬁes that
R
(1 +[ξ[
2m
)[ˆ v(ξ)[
2
dξ
1/2
< ∞. (2.68)
Exercise 2.18 (diﬃcult) Prove it. The proof is based on the fact that diﬀerentiation
in the physical domain is replaced by multiplication by iξ in the Fourier domain.
23
Notice that in the above inequality, we can choose m ∈ R and not only m ∈ N
∗
.
Thus (2.68) can be used as a deﬁnition for the space H
m
with m ∈ R. The space H
m
is still the space of functions with m squareintegrable derivatives. Only the number of
derivatives can be “fractional”, or even negative!
Exercise 2.19 Calculate the Fourier transform of the function v deﬁned by v(x) = 1 on
(−1, 1) and v(x) = 0 elsewhere. Using (2.68), show that the function v belongs to every
space H
m
with m < 1/2. This means that functions that are piecewise smooth (with
discontinuities) have a little less than half of a derivative in L
2
(you can diﬀerentiate
such functions almost 1/2 times)!
The relationship between (2.67) and (2.68) can be explained as follows:
v(x) is smooth ⇐⇒ ˆ v(ξ) decays fast
in the physical domain in the Fourier domain.
This is another very general and very important property of Fourier transforms. The
equivalence between (2.67) and (2.68) is one manifestation of this “rule”.
Main result of the section. Summarizing all of the above, we have obtained the
following result.
Theorem 2.3 Let us assume that P(iξ) and R
∆
(ξ) are parabolic, in the sense that
(2.43) and (2.58) are satisﬁed and that R
∆
(ξ) is of order m, i.e. (2.61) holds. Assuming
that the initial condition u
0
∈ H
m
, we obtain that
u(n∆t, x) −U
n
(x) ≤ C∆x
m
u
0

H
m, for all 0 ≤ n ≤ N = T/∆t. (2.69)
We recall that ∆t = λ∆x
M
so that in the above theorem, the scheme is of order m in
space and of order m/M in time.
Exercise 2.20 (diﬃcult) Follow the proof of Theorem 2.3 assuming that (2.43) and
(2.58) are replaced by P(iξ) ≥ 0 and R
∆
(ξ) ≤ 1. Show that (2.69) should be replaced
by
u(n∆t, x) −U
n
(x) ≤ C∆x
m
u
0

H
m+M, for all 0 ≤ n ≤ N = T/∆t. (2.70)
Loosely speaking, what this means is that we need the initial condition to be more
regular if the equation is not regularizing. You can compare this with Theorem 2.1: to
obtain a convergence of order 2 in space, we have to assume that the solution u(t, x) is
of order C
4
is space. This is consistent with the above result with M = 2 and m = 2.
Exercise 2.21 (very diﬃcult). Assuming that (2.58) is replaced by R
∆
(ξ) ≤ 1, show
that (2.69) should be replaced by
u(n∆t, x) −U
n
(x) ≤ C∆x
m
u
0

H
m+ε, for all 0 ≤ n ≤ N = T/∆t. (2.71)
Here ε is an arbitrary positive constant. Of course, the constant C depends on ε. What
this result means is that if we loose the parabolic property of the discrete scheme, we
still have the same order of convergence provided that the initial condition is slightly
more regular than being in H
m
.
24
Exercise 2.22 Show that any stable scheme that is convergent of order m is also con
vergent of order αm for all 0 ≤ α ≤ 1. Deduce that (2.69) in Theorem 2.3 can then be
replaced by
u(n∆t, x) −U
n
(x) ≤ C∆x
αm
u
0

H
αm, for all 0 ≤ n ≤ N = T/∆t. (2.72)
The interest of this result is the following: when the initial data u
0
is not suﬃciently
regular to be in H
m
but suﬃciently regular to be in H
αm
with 0 < α < 1, we still have
convergence of the ﬁnite diﬀerence scheme as ∆x →0; however the rate of convergence
is slower.
Exercise 2.23 Show that everything we have said in this section holds if we replace
(2.43) and (2.58) by
Re P(iξ) ≥ C([ξ[
M
−1) and [R
∆
(ξ)[ ≤ 1 −δ[∆xξ[
M
+ C∆t, [∆xξ[ ≤ γ,
respectively. Here, Re stands for “real part”. The above hypotheses are suﬃciently
general to cover most practical examples of parabolic equations.
Exercise 2.24 Show that schemes accurate of order m > 0 so that (2.61) holds and
verifying [R
∆
(ξ)[ ≤ 1 are stable in the sense that
[R
∆
(ξ)[ ≤ (1 +C∆x
m
∆t) −δ[∆xξ[
M
, [∆xξ[ ≤ γ.
The last exercises show that schemes that are stable in the “classical” sense, i.e., that
verify [R
∆
(ξ)[ < 1, and are consistent with the smoothing parabolic symbol, are them
selves smoothing.
2.4 Application to the θ schemes
We are now ready to analyze the family of schemes for the heat equation (i.e. P(iξ) = ξ
2
)
called the θ schemes.
The θ scheme is deﬁned by
∂
t
U
n
(x) = θ∂
x
∂
x
U
n+1
(x) + (1 −θ)∂
x
∂
x
U
n
(x). (2.73)
Recall that the ﬁnite diﬀerence operators are deﬁned by (2.4) and (2.5). When θ = 0,
we recover the explicit Euler scheme. For θ = 1, the scheme is called the fully implicit
Euler scheme. For θ = 1/2, the scheme is called the CrankNicolson scheme. We recall
that ∆t = λ∆x
2
.
Exercise 2.25 Show that the symbol of the θ scheme is given by
R
∆
(ξ) =
1 −2(1 −θ)λ(1 −cos(∆xξ))
1 + 2θλ(1 −cos(∆xξ))
. (2.74)
Let us ﬁrst consider the stability of the θ method. Let us assume that 0 ≤ θ ≤ 1. We
observe that R
∆
(ξ) ≤ 1. The stability requirement is therefore that
min
ξ
R
∆
(ξ) ≥ −1.
25
Exercise 2.26 Show that the above inequality holds if and only if
(1 −2θ)λ ≤
1
2
.
This shows that the method is unconditionally L
2
stable when θ ≥ 1/2, i.e. that the
method is stable independently of the choice of λ. When θ < 1/2, stability only holds
if and only if
λ ≤
1
2(1 −2θ)
.
Let us now consider the accuracy of the θ method. Obviously we need to show that
(2.61) holds for some m > 0.
Exercise 2.27 Show that
e
−∆tξ
2
= 1 −∆tξ
2
+
1
2
∆t
2
ξ
4
−
1
6
∆t
3
ξ
6
+ O(∆t
4
ξ
8
), (2.75)
and that
R
∆
(ξ) = 1−∆tξ
2
+∆t∆x
2
(
1
12
+λθ)ξ
4
−
∆t
2
∆x
4
360
(1+60λθ+360λ
2
θ
2
)+O(∆t
4
ξ
8
), (2.76)
as ∆t →0 and ∆xξ →0.
Exercise 2.28 Show that the θ scheme is always at least secondorder in space (ﬁrst
order in time).
Show that the scheme is of order 4 in space if and only if
θ =
1
2
−
1
12λ
.
Show that the scheme is of order 6 in space if in addition,
1 + 60λθ + 360λ
2
θ
2
= 60λ
2
, i.e., λ =
1
10
√
5.
Implementation of the θ scheme. The above analysis shows that the θ scheme is
unconditionally stable when θ ≥ 1/2. This very nice stability property comes however
at a price: the calculation of U
n+1
from U
n
is no longer explicit.
Exercise 2.29 Show that the θ scheme can be put in the form (2.44) with
b
−1
= b
1
= −θλ, b
0
= 1 + 2λθ, a
−1
= a
1
= (1 −θ)λ, a
0
= 1 −2(1 −θ)λ
Since the coeﬃcient b
1
and b
−1
do not vanish, the scheme is no longer explicit. Let us
come back to the solution U
n
j
deﬁned on the grid ∆xZ. The operators A
∆
and B
∆
in
(2.45) are now inﬁnite matrices such that
B
∆
U
n+1
= A
∆
U
n
26
where the inﬁnite vector U
n
= (U
n
j
)
j∈Z
. In matrix form, we obtain that the operator
B
∆
for the θ scheme is given by the inﬁnite matrix
B
∆
=
¸
¸
¸
¸
¸
¸
1 + 2θλ −θλ 0 0 0
−θλ 1 + 2θλ −θλ 0 0
0 −θλ 1 + 2θλ −θλ 0
0 0 −θλ 1 + 2θλ −θλ
0 0 0 −θλ 1 + 2θλ
. (2.77)
Similarly, the matrix A
∆
is given by
A
∆
=
¸
¸
¸
¸
¸
¸
1 −2(1 −θ)λ (1 −θ)λ 0 0 0
(1 −θ)λ 1 −2(1 −θ)λ (1 −θ)λ 0 0
0 (1 −θ)λ 1 −2(1 −θ)λ (1 −θ)λ 0
0 0 (1 −θ)λ 1 −2(1 −θ)λ (1 −θ)λ
0 0 0 (1 −θ)λ 1 −2(1 −θ)λ
.
(2.78)
The vector U
n+1
is thus given by
U
n+1
= (B
∆
)
−1
A
∆
U
n
. (2.79)
This implies that in any numerical implementation of the θ scheme where θ > 0, we need
to invert the system B
∆
U
n+1
= A
∆
U
n
(which is usually much faster than constructing
the matrix (B
∆
)
−1
directly). In practice, we cannot use inﬁnite matrices obviously.
However we can use the N N matrices above on an interval of size N∆x given. The
above matrices correspond to imposing that the solution vanishes at the boundary of
the interval (Dirichlet boundary conditions).
Exercise 2.30 Implement the θ scheme in Matlab for all possible values of 0 ≤ θ ≤ 1
and λ > 0.
Verify on a few examples with diﬀerent values of θ < 1/2 that the scheme is unstable
when λ is too large.
Verify numerically (by dividing an initial spatial mesh by a factor 2, then 4) that the
θ scheme is of order 2, 4, and 6 in space for well chosen values of θ and λ. To obtain
these orders of convergence, make sure that the initial condition is suﬃciently smooth.
For nonsmooth initial conditions, show that the order of convergence decreases and
compare with the theoretical predictions of Theorem 2.3.
27
3 Finite Diﬀerences for Elliptic Equations
The prototype of elliptic equation we consider is the following twodimensional diﬀusion
equation with absorption coeﬃcient
−∆u(x) + σu(x) = f(x), x ∈ R
2
. (3.1)
Here, x = (x, y) ∈ R
2
, f(x) is a given source term, σ is a given positive absorption
coeﬃcient, and
∆ = ∇
2
=
∂
2
∂x
2
+
∂
2
∂y
2
is the Laplace operator. We assume that our domain has no boundaries to simplify. The
above partial diﬀerential equation can be analyzed by Fourier transform. In twospace
dimensions, the Fourier transform is deﬁned by
[T
x→ξ
f](ξ) ≡
ˆ
f(ξ) =
R
2
e
−ix·ξ
f(x)dx. (3.2)
The inverse Fourier transform is then deﬁned by
[T
−1
ξ→x
ˆ
f](x) ≡ f(x) =
1
(2π)
2
R
2
e
ix·ξ
ˆ
f(ξ)dξ. (3.3)
We still verify that the Fourier transform replaces translations and diﬀerentiation by
multiplications. More precisely, we have
[Tf(x +y)](ξ) = e
iy·ξ
ˆ
f(ξ), [T∇f(x)](ξ) = iξ
ˆ
f(ξ). (3.4)
Notice here that both ∇ and ξ are 2vectors. This implies that the operator −∆ is
replaced by a multiplication by [ξ[
2
(Check it!). Upon taking the Fourier transform in
both terms of the above equation, we obtain that
(σ +[ξ[
2
)ˆ u(ξ) =
ˆ
f(ξ), ξ ∈ R
2
. (3.5)
The solution to the diﬀusion equation is then given in the Fourier domain by
ˆ u(ξ) =
ˆ
f(ξ)
σ +[ξ[
2
. (3.6)
It then remains to apply the inverse Fourier transform to the above equality to obtain
u(x). Since the inverse Fourier transform of a product is the convolution of the inverse
Fourier transforms (as in 1D), we deduce that
u(x) =
T
−1
1
σ +[ξ[
2
(x) ∗ f(x)
= CK
0
(
√
σ[x[) ∗ f(x),
where K
0
is the modiﬁed Bessel function of order 0 of the second kind and C is a
normalization constant.
28
This shows again how powerful Fourier analysis is to solve partial diﬀerential equa
tions with constant coeﬃcients. We could deﬁne mode general equations of the form
P(D)u(x) = f(x) x ∈ R
2
,
where P(D) is an operator with symbol P(iξ). In the case of the diﬀusion equation,
we would have that P(iξ) = σ + [ξ[
2
, i.e. P(iξ) is a parabolic symbol of order 2 using
the terminology of the preceding section. All we will see in this section applies to quite
general symbols P(iξ), although we shall consider only P(iξ) = σ +[ξ[
2
to simplify.
Discretization. How do we solve (3.1) by ﬁnite diﬀerences? We can certainly replace
the diﬀerentials by approximated diﬀerentials as we did in 1D. This is the ﬁnite diﬀerence
approach. Let us consider the grid of points x
ij
= (i∆x, j∆x) for i, j ∈ Z and ∆x > 0
given. We then deﬁne
U
ij
≈ u(x
ij
), f
ij
= f(x
ij
). (3.7)
Using a secondorder scheme, we can then replace (3.1) by
−
∂
x
∂
x
+ ∂
y
∂
y
U
ij
+ σU
ij
= f
ij
, (i, j) ∈ Z
2
, (3.8)
where the operator ∂
x
is deﬁned by
∂
x
U
ij
=
U
i+1,j
−U
ij
∆x
, (3.9)
and ∂
x
, ∂
y
, and ∂
y
are deﬁned similarly.
Assuming as in the preceding section that the ﬁnite diﬀerence operator is deﬁned in
the whole space R
2
and not only on the grid ¦x
ij
¦, we have more generally that
BU(x) = f(x), x ∈ R
2
, (3.10)
where the operator B is a sum of translations deﬁned by
BU(x) =
¸
β∈I
β
b
β
U(x −∆xβ). (3.11)
Here, I
β
is a ﬁnite subset of Z
2
, i.e. β = (β
1
, β
2
), where β
1
and β
2
are integer. Upon
taking Fourier transform of (3.11), we obtain
¸
β∈I
β
b
β
e
−i∆xβ·ξ
ˆ
U(ξ) =
ˆ
f(ξ). (3.12)
Deﬁning
R
∆
(ξ) =
¸
β∈I
β
b
β
e
−i∆xβ·ξ
, (3.13)
we then obtain that the ﬁnite diﬀerence solution is given in the Fourier domain by
ˆ
U(ξ) =
1
R
∆
(ξ)
ˆ
f(ξ), (3.14)
29
and that the diﬀerence between the exact and the discrete solutions is given by
ˆ u(ξ) −
ˆ
U(ξ) =
1
P(iξ)
−
1
R
∆
(ξ)
ˆ
f(ξ). (3.15)
For the secondorder scheme introduced above, we obtain
R
∆
(ξ) = σ +
2
∆x
2
2 −cos(∆xξ
x
) −cos(∆xξ
y
)
. (3.16)
For ∆x[ξ[ ≤ π, we deduce from Taylor expansions that
[R
∆
(ξ) −P(iξ)[ ≤ C∆x
2
[ξ[
4
.
For ∆x[ξ[ ≥ π, we also obtain that
[R
∆
(ξ)[ ≤
C
∆x
2
≤ C∆x
2
[ξ[
4
,
and
[P(iξ)[ ≤ C∆x
2
[ξ[
4
.
This implies that
[R
∆
(ξ) −P(iξ)[ ≤ C∆x
2
[ξ[
4
for all ξ ∈ R
2
. Since R
∆
(ξ) ≥ σ and P(iξ) ≥ C(1 +[ξ[
2
), we then easily deduce that
R
∆
(ξ) −P(iξ)
R
∆
(ξ)P(iξ)
≤ C∆x
2
(1 +[ξ[
2
). (3.17)
From this, we deduce that
[ˆ u(ξ) −
ˆ
U(ξ)[ ≤ C∆x
2
(1 +[ξ[
2
)[
ˆ
f(ξ)[.
Let us now square the above inequality and integrate. Upon taking the square root of
the integrals, we obtain that
ˆ u(ξ) −
ˆ
U(ξ) ≤ C∆x
2
R
2
(1 +[ξ[
2
)
2
[
ˆ
f(ξ)[
2
dξ
1/2
. (3.18)
The above integral is bounded provided that f(x) ∈ H
2
(R
2
), using the deﬁnitions (2.67)
(2.68), which also hold in two space dimensions. From the Parseval equality (equation
(2.53) with
√
2π replaced by 2π in two space dimensions), we deduce that
u(x) −U(x) ≤ C∆x
2
f(x)
H
2
(R
2
)
. (3.19)
This is the ﬁnal result of this section: provided that the source term is suﬃciently
regular (its secondorder derivatives are squareintegrable), the error made by the ﬁnite
diﬀerence approximation is of order ∆x
2
. The scheme is indeed secondorder.
Exercise 3.1 Notice that the bound in (3.17) would be replaced by C∆x
2
(1 + [ξ[
4
) if
only the Taylor expansion were used. The reason why (3.17) holds is because the exact
symbol P(iξ) damps high frequencies (i.e. is bounded from below by C(1 +[ξ[
2
). Show
that if this damping is not used in the proof, the ﬁnal result (3.19) should be replaced
by
u(x) −U(x) ≤ C∆x
2
f(x)
H
4
(R
2
)
.
30
Exercise 3.2 (moderately diﬃcult) Show that
R
∆
(ξ) −P(iξ)
R
∆
(ξ)P(iξ)
≤ C∆x
m
(1 +[ξ[
m
),
for all 0 ≤ m ≤ 2. [Hint: Show that (R
∆
(ξ))
−1
and (P(iξ))
−1
are bounded and separate
[ξ[∆x ≥ 1 and [ξ[∆x ≤ 1].
Deduce that
u(x) −U(x) ≤ C∆x
m
f(x)
H
m
(R
2
)
,
for 0 ≤ m ≤ 2. We deduce from this result that the error u(x) −U(x) still converges to
0 when f is less regular than H
2
(R
2
). However, the convergence is slower and no longer
of order ∆x
2
.
Exercise 3.3 Implement the Finite Diﬀerence algorithm in Matlab on a square of ﬁ
nite dimension (impose that the solution vanishes at the boundary of the domain). This
implementation is not trivial. It requires constructing and inverting a matrix as was
done in (2.79).
Solve the diﬀusion problem with a smooth function f and show that the order of con
vergence is 2. Show that the order of convergence decreases when f is less regular.
31
4 Finite Diﬀerences for Hyperbolic Equations
After parabolic and elliptic equations, we turn to the third class of important linear equa
tions: hyperbolic equations. The simplest hyperbolic equation is the onedimensional
transport equation
∂u
∂t
+ a
∂u
∂x
= 0 t > 0, x ∈ R
u(0, x) = u
0
(x), x ∈ R.
(4.1)
Here u
0
(x) is the initial condition. When a is constant, the solution to the above
equation is easily found to be
u(t, x) = u
0
(x −at). (4.2)
This means that the initial data is simply translated by a speed a. The solution can
also be obtained by Fourier transforms. Indeed we have
∂ˆ u
∂t
(t, ξ) + iaξˆ u(t, ξ) = 0, (4.3)
which gives
ˆ u(t, ξ) = ˆ u
0
(ξ)e
−iatξ
.
Now since multiplication by a phase in the Fourier domain corresponds to a translation
in the physical domain, we obtain (4.2).
Here is an important remark. Notice that (4.3) can be written as (2.40) with
P(iξ) = iaξ. Notice that this symbol does not have the properties imposed in sec
tion 2. In particular, the symbol P(iξ) is purely imaginary (i.e. its real part vanishes).
This corresponds to a completely diﬀerent behavior of the PDE solutions: instead of
regularizing initial discontinuities, as parabolic equations do, hyperbolic equations trans
late and possibly modify the shape of discontinuities, but do not regularize them. This
behavior will inevitably show up when we discretize hyperbolic equations.
Let us now consider ﬁnite diﬀerence discretizations of the above transport equation.
The framework introduced in section 2 still holds. The singlestep ﬁnite diﬀerence
scheme is given by (2.44). After Fourier transforms, it is deﬁned by (2.49). The L
2
norm of the diﬀerence between the exact and approximate solutions is then given by
u(n∆t, x) −U
n
(x) =
1
√
2π
ε
n
(ξ) =
1
√
2π
e
−n∆tP(iξ)
−R
n
∆
(ξ)
ˆ u
0
(ξ)
.
Here, we recall that P(iξ) = iaξ. The mathematical analysis of ﬁnite diﬀerence schemes
for hyperbolic equations is then not very diﬀerent from for parabolic equations. Only
the results are very diﬀerent: schemes that may be natural for parabolic equations will
no longer be so for hyperbolic schemes.
Stability. One of the main steps in controlling the error term ε
n
is to show that the
scheme is stable. By stability, we mean here that
[R
n
∆
(ξ)[ = [R
∆
(ξ)[
n
≤ C, n∆t ≤ T.
32
Let us consider the stability of classical schemes. The ﬁrst idea that comes to mind
to discretize (4.1) is to replace the time derivative by ∂
t
and the spatial derivative by
∂
x
. Spelling out the details, this gives
U
n+1
j
= U
n
j
−
a∆t
∆x
(U
n
j+1
−U
n
j
).
After Fourier transform (and extension of the scheme to the whole line x ∈ R), we
deduce that
R
∆
(ξ) = 1 −ν(e
iξ∆x
−1) = [1 −ν(cos(ξ∆x) −1)] −iν sin(ξ∆x),
where we have deﬁned
ν =
a∆t
∆x
. (4.4)
We deduce that
[R
∆
(ξ)[
2
= 1 + 2ν(1 +ν)(1 −cos(ξ∆x)).
When a > 0, which implies that ν > 0, we deduce that the scheme is always unstable
since [R
∆
(ξ)[
2
> 1 for cos(ξ∆x) ≤ 0 for instance. When ν < −1, we again deduce that
[R
∆
(ξ)[
2
> 1 since ν(1+ν) > 0. It remains to consider −1 ≤ ν ≤ 0. There, ν(1+ν) ≤ 0
and we then deduce that [R
∆
(ξ)[
2
≤ 1 for all wavenumbers ξ. This implies stability.
So the scheme is stable if and only if −1 ≤ ν ≤ 0. This implies that [a[∆t ≤ ∆x.
Again, the time step cannot be chosen too large. This is referred to as a CFL condition.
Notice however, that this also implies that a < 0. This can be understood as follows. In
the transport equation, information propagates from the right to the left when a < 0.
Since the scheme approximates the spatial derivative by ∂
x
, it is asymmetrical and uses
some information to the right (at the point X
j+1
= X
j
+ ∆x) of the current point X
j
.
For a < 0, it gets the correct information where it comes from. When a > 0, the scheme
still “looks” for information coming from the right, whereas physically it comes from
the left.
This leads to the deﬁnition of the upwind scheme, deﬁned by
U
n+1
j
=
U
n
j
−ν(U
n
j
−U
n
j−1
) when a > 0
U
n
j
−ν(U
n
j+1
−U
n
j
) when a < 0.
(4.5)
Exercise 4.1 Check that the upwind scheme is stable provided that [a[∆t ≤ ∆x. Im
plement the scheme in Matlab.
The above result shows that we have to “know” which direction the information
comes from to construct our scheme. For onedimensional problems, this is relatively
easy since only the sign of a is involved. In higher dimensions, knowing where the
information comes from is much more complicated. A tempting solution to this diﬃculty
is to average over the operators ∂
x
and ∂
x
:
U
n+1
j
= U
n
j
−
ν
2
(U
n
j+1
−U
n
j−1
).
Exercise 4.2 Show that the above scheme is always unstable. Verify the instability in
Matlab.
33
A more satisfactory answer to the question has been answered by Lax and Wendroﬀ.
The LaxWendroﬀ scheme is deﬁned by
U
n+1
j
=
1
2
ν(1 +ν)U
n
j−1
+ (1 −ν
2
)U
n
j
−
1
2
ν(1 −ν)U
n
j+1
. (4.6)
Exercise 4.3 Check that
[R
∆
(ξ)[
2
= 1 −4ν
2
(1 −ν
2
) sin
4
ξ∆x
2
.
Deduce that the LaxWendroﬀ scheme is stable when [ν[ ≤ 1.
Implement the LaxWendroﬀ scheme in Matlab.
Consistency. We have introduced two stable schemes, the upwind scheme and the
LaxWendroﬀ scheme. It remains to analyze their convergence properties.
Consistency consists of analyzing the local properties of the discrete scheme and
making sure that the scheme captures the main trends of the continuous equation. So
assuming that u(n∆t, X
j
) is known at time n∆t, we want to see what it becomes under
the discrete scheme. For the upwind scheme with a > 0 to simplify, we want to analyze
∂
t
+ a∂
x
u(T
n
, X
j
).
The closer the above quantity to 0 (which it would be if the discrete scheme were replaced
by the continuous equation), the higher the order of convergence of the scheme. This
error is obtained assuming that the exact solution is suﬃciently regular and by using
Taylor expansions.
Exercise 4.4 Show that
∂
t
+ a∂
x
u(T
n
, X
j
) = −
1
2
(1 −ν)a∆x
∂
2
u
∂x
2
(T
n
, X
j
) + O(∆x
2
).
This shows that the upwind scheme is ﬁrstorder when ν < 1. Notice that ﬁrstorder
approximations involve secondorder derivatives of the exact solution. This is similar to
the parabolic case and justiﬁes the following deﬁnition in the Fourier domain (remember
that secondorder derivative in the physical domain means multiplication by −ξ
2
in the
Fourier domain).
We say that the scheme is convergent of order m if
[e
−i∆tP(iξ)
−R
∆
(ξ)[ ≤ C∆t∆x
m
(1 +[ξ[
m+1
), (4.7)
for ∆t and ∆x[ξ[ suﬃciently small (say ∆x[ξ[ ≤ π). Notice that ∆t and ∆x are again
related through (4.4).
Notice the parallel with (2.61): we have simply replaced M by 1. The diﬀerence
between parabolic and hyperbolic equations and scheme does not come from the consis
tency conditions, but from the stability conditions. For hyperbolic schemes, we do not
have (2.57) or (2.58).
Exercise 4.5 Show that the upwind scheme is ﬁrstorder and the LaxWendroﬀ scheme
is secondorder.
34
Proof of convergence . As usual, stability and consistency of the scheme, with
suﬃcient regularity of the exact solution, are the ingredients to the convergence of the
scheme. As in the parabolic case, we treat low and high wavenumbers separately. For
low wavenumbers (∆x[ξ[ ≤ π), we have
ε
n
(ξ) = (e
−∆tP(iξ)
−R
∆
(ξ))
n−1
¸
k=0
e
−(n−1−k)∆tP(iξ)
R
k
∆
(ξ)ˆ u
0
(ξ).
Here is the main diﬀerence with the parabolic case: e
−(n−1−k)∆tP(iξ)
and R
k
∆
(ξ) are
bounded but not small for high frequencies (notice that [e
−n∆tP(iξ)
[ = 1 for P(iξ) = iaξ).
So using the stability of the scheme, which states that [R
k
∆
(ξ)[ ≤ 1, we deduce that
[ε
n
(ξ)[ ≤ n[e
−∆tP(iξ)
−R
∆
(ξ)[[ˆ u
0
(ξ)[ ≤ C∆x
m
(1 +[ξ[
m+1
)[ˆ u
0
(ξ)[,
using the order of convergence of the scheme.
For high wavenumbers ∆x[ξ[ ≥ π, we deduce that
[ε
n
(ξ)[ ≤ 2[ˆ u
0
(ξ)[ ≤ C∆x
m
(1 +[ξ[
m+1
)[ˆ u
0
(ξ)[.
So the above inequality holds for all frequencies ξ ∈ R. This implies that
ε
n
(ξ) ≤ C∆x
m
R
(1 +[ξ[
2m+2
)[ˆ u
0
(ξ)[
2
dξ
1/2
.
We have then proved the
Theorem 4.1 Let us assume that P(iξ) = iaξ and that R
∆
(ξ) is of order m, i.e. (4.7)
holds. Assuming that the initial condition u
0
∈ H
m+1
, we obtain that
u(n∆t, x) −U
n
(x) ≤ C∆x
m
u
0

H
m+1, for all 0 ≤ n ≤ N = T/∆t. (4.8)
The main diﬀerence with the parabolic case is that u
0
now needs to be in H
m+1
(and
not in H
m
) to obtain an accuracy of order ∆x
m
. What if u
0
is not in H
m
?
Exercise 4.6 Show that for any scheme of order m, we have
[e
−i∆tP(iξ)
−R
∆
(ξ)[ ≤ C∆t∆x
αm
(1 +[ξ[
α(m+1)
)
for all 0 ≤ α ≤ 1 and [ξ[∆x ≤ π. Show that (4.8) is now replaced by
u(n∆t, x) −U
n
(x) ≤ C∆x
αm
u
0

H
α(m+1) , for all 0 ≤ n ≤ N = T/∆t. (4.9)
Deduce that if u
0
is in H
s
for 0 ≤ s ≤ m + 1, the error of convergence of the scheme
will be of order
sm
m+1
.
We recall that piecewise constant functions are in the space H
s
for all s < 1/2 (take
s = 1/2 in the sequel to simplify). Deduce the order of the error of convergence for the
upwind scheme and the LaxWendroﬀ scheme. Verify this numerically.
35
5 Spectral methods
5.1 Unbounded grids and semidiscrete FT
The material of this section is borrowed from chapters 1 & 2 in Trefethen [2].
Finite diﬀerences were based on approximating
∂
∂x
by ﬁnite diﬀerences of the form
∂
x
, ∂
x
for ﬁrstorder approximations, or
1
2
(∂
x
+ ∂
x
),
for a secondorder approximation, and so on for higher order approximations.
Spectral methods are based on using an approximation of very high order for the
partial derivative. In matrix form, the secondorder approximation given by (1.2) and
fourthorder approximation given by (1.3) in [2] are replaced in the “spectral” approxi
mation by the “inﬁniteorder” matrix in (1.4).
How is this done? The main ingredient is “interpolation”. We interpolate a set of
values of a function at the points of a grid by a (smooth) function deﬁned everywhere.
We can then obviously diﬀerentiate this function and take the values of the diﬀerentiated
function at the grid points. This is how derivatives are approximated in spectral methods
and how (1.4) in [2] is obtained.
Let us be more speciﬁc and consider ﬁrst the inﬁnite grid hZ with grid points x
j
= jh
for j ∈ Z and h > 0 given.
Let us assume that we are given the values v
j
of a certain function v(x) at the grid
points, i.e. v(x
j
) = v
j
. What we want is to deﬁne a function p(x) such that p(x) = v
j
.
This function p(x) will then be an interpolant of the series ¦v
j
¦. To do so we deﬁne the
semidiscrete Fourier transform as
ˆ v(k) = h
∞
¸
j=−∞
e
−ikx
j
v
j
, k ∈ [−
π
h
,
π
h
]. (5.1)
It turns out that this transform can be inverted. We deﬁne the inverse semidiscrete
Fourier transform as
v
j
=
1
2π
π/h
−π/h
e
ikx
j
ˆ v(k)dk. (5.2)
The above deﬁnitions are nothing but the familiar Fourier series. Here, v
j
are the Fourier
coeﬃcients of the periodic function ˆ v(k) on [−
π
h
,
π
h
].
Exercise 5.1 Check that ˆ v(k) in (5.1) is indeed periodic of period 2π/h.
It is then well known that the function ˆ v(k) can be reconstructed from its Fourier
coeﬃcients; this is (5.1).
36
Notice now that the inverse semidiscrete Fourier transform (a.k.a. Fourier series) in
(5.2) is deﬁned for every x
j
of the grid hZ, but could be deﬁned more generally for every
point x ∈ R. This is how we deﬁne our interpolant, by simply replacing x
j
in (5.2) by
x:
p(x) =
1
2π
π/h
−π/h
e
ikx
ˆ v(k)dk. (5.3)
The function p(x) is a very nice function. It can be shown that it is analytic (which
means that the inﬁnite Taylor expansion of p(z) in the vicinity of every complex number
z converges to p; so the function p is inﬁnitely many times diﬀerentiable), and by
deﬁnition it satisﬁes that p(x
j
) = v
j
.
Now assuming that we want to deﬁne w
j
, an approximation of the derivative of v
given by v
j
at the grid points x
j
. The spectral approximation simply consists of deﬁning
w
j
= p
(x
j
), (5.4)
where p(x) is deﬁned by (5.3). Notice that w
j
a priori depends on all the values v
j
. This
is why the matrix (1.4) in [2] is inﬁnite and full.
How do we calculate this matrix? Here we use the linearity of the diﬀerentiation. If
v
j
= v
(1)
j
+ v
(2)
j
,
then we clearly have that
w
j
= w
(1)
j
+ w
(2)
j
,
for the spectral diﬀerentials. So all we need to do is to consider the derivative of function
v
j
such that v
n
= 1 for some n ∈ Z and v
m
= 0 for m = n. We then obtain that
ˆ v
n
(k) = he
−inkh
,
and that
p
n
(x) =
h
2π
π/h
−π/h
e
ik(x−nh)
dk = S
h
(x −nh),
where S
h
(x) is the sinc function
S
h
(x) =
sin(πx/h)
πx/h
. (5.5)
More generally, we obtain that the interpolant p(x) is given by
p(x) =
∞
¸
j=−∞
v
j
S
h
(x −jh). (5.6)
An easy calculation shows that
S
h
(jh) =
0, j = 0
(−1)
j
jh
, j = 0.
(5.7)
37
Exercise 5.2 Check this.
This implies that
p
n
(x
j
) =
0, j = n
(−1)
(j−n)
(j −n)h
, j = n.
These are precisely the values of the entries (n, j) of the matrix in (1.4) in [2].
Notice that the same technique can be used to deﬁne approximations to derivatives
of arbitrary order. Once we have the interpolant p(x), we can diﬀerentiate it as many
times as necessary. For instance the matrix corresponding to secondorder diﬀerentiation
is given in (2.14) of [2].
5.2 Periodic grids and Discrete FT
The simplest way to replace the inﬁnitedimensional matrices encountered in the pre
ceding section is to assume that the initial function v is periodic. To simplify, it will be
2π periodic. The function is then represented by its value at the points x
j
= 2πj/N for
some N ∈ N, so that h = 2π/N. We also assume that N is even. Odd values of N are
treated similarly, but the formulas are slightly diﬀerent.
We know that periodic functions are represented by Fourier series. Here, the function
is not only periodic, but also only given on the grid (x
1
, , x
N
). So it turns out that
its “Fourier series” is ﬁnite. This is how we deﬁne the Discrete Fourier transform
ˆ v
k
= h
N
¸
j=1
e
−ikx
j
v
j
, k = −
N
2
+ 1, . . . ,
N
2
. (5.8)
The Inverse Discrete Fourier Transform is then given by
v
j
=
1
2π
N/2
¸
k=−N/2+1
e
ikx
j
ˆ v
k
, j = 1, . . . , N. (5.9)
The latter is recast as
v
j
=
1
2π
N/2
¸
k=−N/2
e
ikx
j
ˆ v
k
, j = 1, . . . , N. (5.10)
Here we deﬁne ˆ v
−N/2
= ˆ v
N/2
and
¸
means that the terms k = ±N/2 are multiplied
by
1
2
. The latter deﬁnition is more symmetrical than the former, although both are
obviously equivalent. The reason we introduce (5.10) is that the two deﬁnitions do not
yield the same interpolant, and the interpolant based on (5.10) is more symmetrical.
As we did for inﬁnite grids, we can extend (5.10) to arbitrary values of x and not
only the values on the grid. The discrete interpolant is now deﬁned by
p(x) =
1
2π
N/2
¸
k=−N/2
e
ikx
ˆ v
k
, x ∈ [0, 2π]. (5.11)
38
Notice that this interpolant is obviously 2π periodic, and is a trigonometric polynomial
of order (at most) N/2.
We can now easily obtain a spectral approximation of v(x) at the grid points: we
simply diﬀerentiate p(x) and take the values of the solution at the grid points. Again,
this corresponds to
w
j
= p
(x
j
), j = 1, . . . , N.
Exercise 5.3 Calculate the NN matrix D
N
which maps the values v
j
of the function
v(x) to the values w
j
of the function p
(x
j
). The result of the calculation is given on
p.21 in [2].
5.3 Fast Fourier Transform (FFT)
In the preceding subsections, we have seen how to calculate spectral derivatives at grid
points w
j
of a function deﬁned by v
j
at the same gridpoints. We have also introduced
matrices that map v
j
to w
j
. Our derivation was done in the physical domain, in the sense
that we have constructed the interpolant p(x), taken its derivative, p
and evaluated the
derivative p
(x
j
) at the grid points. Another way to consider the derivation is to see
what it means to “take a derivative” in the Fourier domain. This is what FFT is based
on.
Remember that (classical) Fourier transforms replace derivation by multiplication
by iξ; see (2.27). The interpolations of the semidiscrete and discrete Fourier transforms
have very similar properties. For instance we deduce from (5.3) that
p
(x) =
1
2π
π/h
−π/h
e
ikx
ikˆ v(k)dk.
So again, in the Fourier domain, diﬀerentiation really means replacing ˆ v(k) by ikˆ v(k).
For the discrete Fourier transform, we deduce from (5.11) that
p
(x) =
1
2π
N/2
¸
k=−N/2
e
ikx
ikˆ v
k
.
Notice that by deﬁnition, we have ˆ v
−N/2
= ˆ v
N/2
so that i(−N/2)ˆ v
−N/2
+ iN/2ˆ v
N/2
= 0.
In other words, we have
p
(x) =
1
2π
N/2−1
¸
k=−N/2+1
e
ikx
ikˆ v
k
.
This means that the Fourier coeﬃcients ˆ v
k
are indeed replaced by ikˆ v
k
when we diﬀer
entiate p(x), expect for the value of k = N/2, for which the Fourier coeﬃcient vanishes.
So, to calculate w
j
from v
j
in the periodic case, we have the following procedure
1. Calculate ˆ v
k
from v
j
for −N/2 + 1 ≤ k ≤ N/2.
2. Deﬁne ˆ w
k
= ikˆ v
k
for −N/2 + 1 ≤ k ≤ N/2 −1 and ˆ w
N/2
= 0.
3. Calculate w
j
from ˆ w
j
by Inverse Discrete Fourier transform.
39
Said in other words, the above procedure is the discrete analog of the formula
f
(x) = T
−1
ξ→x
iξT
x→ξ
[f(x)]
.
On a computer, the multiplication by ik (step 2) is fast. The main computational
diﬃculties come from the implementation of the discrete Fourier transform and its in
version (steps 1 and 3). A very eﬃcient implementation of steps 1 and 3 is called the
FFT, the Fast Fourier Transform. It is an algorithm that performs steps 1 and 3 in
O(N log N) operations. This has to be compared with a number of operations of order
O(N
2
) if the formulas (5.8) and (5.9) are being used.
Exercise 5.4 Check the order O(N
2
).
We are not going to describe how FFT works; let us merely state that it is a very
convenient alternative to constructing the matrix D
N
to calculate the spectral derivative.
We refer to [2] Programs 4 and 5 for details.
5.4 Spectral approximation for unbounded grids
We have introduced several interpolants to obtain approximations of the derivatives of
functions deﬁned by their values on a (discrete) grid. It remains to know what sort of
error one makes by introducing such interpolants. Chapter 4 in [2] gives some answer to
this question. Here we shall give a slightly diﬀerent answer based on the Sobolev scale
introduced in the analysis of ﬁnite diﬀerences.
Let u(x) be a function deﬁned for x ∈ R and deﬁne v
j
= u(x
j
). The Fourier transform
of u(x) is denoted by ˆ u(k). We then deﬁne the semidiscrete Fourier transform ˆ v(k) by
(5.1) and the interpolant p(x) by (5.3).
The question is then: what is the error between u(x) and p(x)? The answer relies
on one thing: how regular is u(x). The analysis of the error is based on estimates in the
Fourier domain. Smooth functions u(x) correspond to functions ˆ u(k) that decay fast in
k.
The ﬁrst theoretical ingredient, which is important in its own right, is the aliasing
formula, also called the Poisson summation formula.
Theorem 5.1 Let u ∈ L
2
(R) be suﬃciently smooth. Then for all k ∈ [−π/h, π/h],
ˆ v(k) =
∞
¸
j=−∞
ˆ u
k +
2πj
h
. (5.12)
In other words the aliasing formula relates the discrete Fourier transform ˆ v(k) to the
continuous Fourier transform ˆ u(k).
40
Derivation of the aliasing formula. We write
ˆ v(k) = h
∞
¸
j=−∞
e
−ikhj
v
j
=
∞
¸
j=−∞
R
ˆ u(k
)e
ihj(k
−k)
hdk
2π
=
R
ˆ u(k +
2πl
h
)
∞
¸
j=−∞
e
i2πlj
dl,
using the change of variables h(k
−k) = 2πl so that hdk
= 2πdl. We will show that
∞
¸
j=−∞
e
i2πlj
=
∞
¸
j=−∞
δ(l −j). (5.13)
Assuming that this equality holds, we then obtain that
ˆ v(k) =
R
ˆ u(k +
2πl
h
)
∞
¸
j=−∞
δ(l −j)dl =
∞
¸
j=−∞
ˆ u(k +
2πj
h
),
which is what was to be proved. It thus remains to show (5.13). Here is a simple
explanation. Take the Fourier series of a 1−periodic function f(x)
c
n
=
1/2
−1/2
e
−i2πnx
f(x)dx.
Then the inversion is given by
f(x) =
∞
¸
n=−∞
e
i2πnx
c
n
.
Now take f(x) to be the delta function f(x) = δ(x). We then deduce that we have
c
n
= 1 and for x ∈ (0, 1),
∞
¸
n=−∞
e
i2πnx
= δ(x).
Let us now extend both sides of the above relation by periodicity to the whole line R.
The lefthand side is a 1periodic in x and does not change. The right hand side now
takes the form
∞
¸
j=−∞
δ(x −j).
This is nothing but the periodic delta function. This proves (5.13).
Now that we have the aliasing formula, we can analyze the diﬀerence u(x) − p(x).
We’ll do it in the L
2
sense, since
u(x) −p(x)
L
2
(R)
=
1
√
2π
ˆ u(k) − ˆ p(k)
L
2
(R)
. (5.14)
We now show the following result:
41
Theorem 5.2 Let us assume that u(x) ∈ H
m
(R) for m > 1/2. Then we have
u(x) −p(x)
L
2
(R)
≤ Ch
m
u
H
m
(R)
, (5.15)
where the constant C is independent of h and u.
Proof. We ﬁrst realize that
ˆ p(k) = χ(k)ˆ v(k)
by construction, where χ(k) = 1 if k ∈ [−π/h, π/h] and χ(k) = 0 otherwise is the
characteristic function of the interval [−π/h, π/h]. In other words, p(x) is bandlimited,
i.e. has no high frequencies. We thus deduce that
R
[ˆ u(k) − ˆ p(k)[
2
dk =
R
(1 −χ(k))[ˆ u(k)[
2
dk +
π/h
−π/h
¸
j∈Z
∗
ˆ u
k +
2πj
h
2
dk,
using (5.12). We have denoted by Z
∗
= Z`¦0¦. Looking carefully at the above terms,
one realizes that the above error only involves ˆ u(k) for values of k outside the interval
[−π/h, π/h]. In other words, if ˆ u(k) has compact support inside [−π/h, π/h], then
ˆ p(k) = ˆ u(k), hence p(x) = u(x) and the interpolant is exact. When ˆ u(k) does not vanish
outside [−π/h, π/h], i.e. when the high frequency content is present, we have to use the
regularity of the function u(x) to control the above term. This is what we now do. The
ﬁrst term is easy to deal with. Indeed we have
R
(1 −χ(k))[ˆ u(k)[
2
dk ≤ h
2m
R
(1 + k
2m
)[ˆ u(k)[
2
dk ≤ h
2m
u
2
H
m
(R)
.
This is simply based on realizing that [hk[ ≥ π when χ(k) = 1. The second term is a
little more painful. The CauchySchwarz inequality tells us that
¸
j
a
j
b
j
2
≤
¸
j
[a
j
[
2
¸
j
[b
j
[
2
.
This is a mere generalization of [x y[ ≤ [x[[y[. Let us choose
a
j
=
ˆ u
k +
2πj
h
1 +[k +
2πj
h
[
m
, b
j
=
1 +[k +
2πj
h
[
−m
We thus obtain that
¸
j∈Z
∗
ˆ u
k +
2πj
h
2
≤
¸
j∈Z
∗
ˆ u
k +
2πj
h
2
1 +[k +
2πj
h
[
2m ¸
j∈Z
∗
1 +[k +
2πj
h
[
−2m
.
Notice that
¸
j∈Z
∗
1 +[k +
2πj
h
[
−2m
≤ h
2m
¸
j∈Z
∗
1
(2π(1 +[j[))
2m
, for [k[ ≤ π/h.
For m > 1/2, the above sum is convergent so that
¸
j∈Z
∗
1 +[k +
2πj
h
[
−2m
≤ Ch
2m
,
42
where C depends on m but not on u or h. This implies that
π/h
−π/h
¸
j∈Z
∗
ˆ u
k +
2πj
h
2
dk ≤ Ch
2m
π/h
−π/h
¸
j∈Z
∗
ˆ u
k +
2πj
h
2
1 +[k +
2πj
h
[
2m
≤ Ch
2m
R
[ˆ u(k)[
2
(1 +[k[)
2m
dk = Ch
2m
u
2
H
m.
This concludes the proof of the theorem.
Let us emphasize something we saw in the course of the above proof. When u(x)
is bandlimited so that ˆ u(k) = 0 outside [−π/h, π/h], we have that the interpolant is
exact; namely p(x) = u(x). However, we have seen that the interpolant is determined
by the values v
j
= u(x
j
).
More precisely, we saw that
p(x) =
∞
¸
n=−∞
v
n
S
h
(x −nh),
where S
h
is the sinc function deﬁned in (5.5). This proves the extremely important
ShannonWhittaker sampling theorem:
Theorem 5.3 Let us assume that the function u(x) is such that ˆ u(k) = 0 for k outside
[−π/h, π/h]. Then u(x) is completely determined by its values v
j
= u(hj) for j ∈ Z.
More precisely, we have
u(x) =
∞
¸
n=−∞
u(hj)S
h
(x −jh), S
h
(x) =
sin(πx/h)
πx/h
. (5.16)
This result has quite important applications in communication theory. Assuming that
a signal has no high frequencies (such as in speech, since the human ear cannot detect
frequencies above 25kHz), there is an adapted sampling value h such that the signal can
be reconstructed from the values it takes on the grid hZ. This means that a continuous
signal can be compressed onto a discrete grid with no error. When h is chosen too large
so that frequencies above π/h are present in the signal, then the interpolant is not exact,
and the compression in the signal has some error. This error is precisely called aliasing
error and is quantiﬁed by Theorem 5.1.
Let us ﬁnish this section by an analysis of the error between the derivatives of u(x)
and those of p(x). Remember that our initial objective is to obtain an approximation
of u
(x
j
) by using p
(x
j
). Minor modiﬁcations in the proof of Theorem 5.2 yield the
following result:
Theorem 5.4 Let us assume that u(x) ∈ H
m
(R) for m > 1/2 and let n ≤ m. Then
we have
u
(n)
(x) −p
(n)
(x)
L
2
(R)
≤ Ch
(m−n)
u
H
m
(R)
, (5.17)
where the constant C is independent of h and u.
The proof of this theorem consists of realizing that
u
(n)
(x) −p
(n)
(x)
L
2
(R)
=
1
√
2π
(ik)
n
(ˆ u(k) − ˆ p(k))
L
2
(R)
.
43
Moreover we have
R
[k[
2n
[ˆ u(k) − ˆ p(k)[
2
dk =
R
[k[
2n
(1 −χ(k))[ˆ u(k)[
2
dk
+
π/h
−π/h
[k[
2n
¸
j∈Z
∗
ˆ u
k +
2πj
h
2
dk.
The ﬁrst term on the right hand side is bounded by
h
2(m−n)
(1 + k
2m
)[ˆ u(k)[
2
dk =
h
m−n
u
H
m
2
.
The second term is bounded by
h
−2n
π/h
−π/h
¸
j∈Z
∗
ˆ u
k +
2πj
h
2
dk,
which is bounded, as we saw in the proof of Theorem 5.2 by h
−2n
h
2m
u
2
H
m. This proves
the theorem.
This result is important for us: it shows that the error one makes by approximating
a n−th order derivative of u(x) by using p
(n)
(x) is of order h
m−n
, where m is the H
m

regularity of u. So when u is very smooth, p
(n)
(x) converges extremely fast to u
(n)
(x)
as N →∞. This is the greatest advantage of spectral methods.
5.5 Application to PDE’s
An elliptic equation. Consider the simplest of elliptic equations
−u
(x) + σu(x) = f(x), x ∈ R, (5.18)
where the absorption σ > 0 is a constant positive parameter. Deﬁne α =
√
σ. We verify
that
G(x) =
1
2α
e
−αx
, (5.19)
is the Green function of the above problem, whose solution is thus
u(x) =
R
G(x −y)f(y)dy =
1
2α
R
e
−αx−y
f(y)dy. (5.20)
Note that this may also be obtained in the Fourier domain, where
ˆ u(ξ) =
ˆ
f(ξ)
σ + ξ
2
. (5.21)
Discretizing the above equation using the secondorder ﬁnite diﬀerence method:
−
U(x + h) + U(x −h) −2U(x)
h
2
+ U(x) = f(x), (5.22)
we obtain as in (3.19) in section 3 an error estimate of the form
u −U
L
2 ≤ C h
2
f
H
2. (5.23)
44
Let us now compare this result to the discretization based on the spectral method.
Spectral methods are based on replacing functions by their spectral interpolants. Let
p(x) be the spectral interpolant of u(x) and F(x) the spectral interpolant of the source
term f(x). Now p(x) and F(x) only depend on the values they take at hj for j ∈ Z.
We may now calculate the secondorder derivative of p(x) (see the following paragraph
for an explicit expression). The equation we now use to deduce p(x) from F(x) is the
equation (5.18):
−p
(x) + σp(x) = F(x), x ∈ R.
Note that this problem may be solved by inverting a matrix of the form −N
2
+ σI,
where N is the matrix of diﬀerentiation. The error u −p now solves the equation
−(u −p)
(x) + σ(u −p) = (f −F)(x).
Looking at this in the Fourier domain (and coming back) we easily deduce that
u −p
H
2
(R)
≤ Cf −F
L
2
(R)
≤ Ch
m
f
H
m
(R)
, (5.24)
where the latter equality comes from Theorem 5.2. Here m is arbitrary provided that f
is arbitrarily smooth. We thus obtain that the spectral method has an “inﬁnite” order
of accuracy, unlike ﬁnite diﬀerence methods. When the source terms and the solutions
of the PDEs we are interested in are smooth (spatially), the convergence properties of
spectral methods are much better than those of ﬁnite diﬀerence methods.
A parabolic equation. Let us now apply the above theory to the computation of
solutions of the heat equation
∂u
∂t
−
∂
2
u
∂x
2
= 0,
on the real line R. We introduce U
n
j
the approximate solution at time T
n
= n∆t,
1 ≤ n ≤ N and point x
j
= hj for j ∈ Z.
Let us consider the Euler explicit scheme based on a spectral approximation of the
spatial derivatives. The interpolant at time T
n
is given by
p
n
(x) =
∞
¸
j=−∞
U
n
j
S
h
(x −jh),
so that the approximation to the second derivative is given by
(p
n
)
(kh) =
∞
¸
j=−∞
S
h
((k −j)h)U
n
j
=
∞
¸
j=−∞
S
h
(jh)U
n
k−j
.
So the Euler explicit scheme is given by
U
n+1
j
−U
n
j
∆t
=
∞
¸
k=−∞
S
h
(kh)U
n
j−k
. (5.25)
It now remains to see how this scheme behaves. We shall only consider stability here
(i.e. see how ∆t and h have to be chosen so that the discrete solution does not blow
45
up as n →T/∆t). As far as convergence is concerned, the scheme will be ﬁrst order in
time (accuracy of order ∆t) and will be of spectral accuracy (inﬁnite order) provided
that the initial solution is suﬃciently smooth. This is plausible from the analysis of the
error made by replacing a function by its interpolant that we made earlier.
Let us come back to the stability issue. As we did for ﬁnite diﬀerences, we realize
that we can look at a scheme deﬁned for all x ∈ R and not only x = x
j
:
U
n+1
(x) −U
n
(x)
∆t
=
∞
¸
k=−∞
S
h
(kh)U
n
(x −hk). (5.26)
Once we have realized this, we can pass to the Fourier domain and obtain that
ˆ
U
n+1
(ξ) =
1 + ∆t
∞
¸
k=−∞
S
h
(kh)e
−ikhξ
ˆ
U
n
(ξ). (5.27)
So the stability of the scheme boils down to making sure that
R(ξ) = 1 + ∆t
∞
¸
k=−∞
S
h
(kh)e
−ikhξ
, (5.28)
stays between −1 and 1 for all frequencies ξ.
The analysis is more complicated than for ﬁnite diﬀerences because of the inﬁnite
sum that appears in the deﬁnition of R. Here is a way to analyze R(ξ). We ﬁrst realize
that
S
h
(kh) =
−π
2
3h
2
k = 0
2
(−1)
k+1
h
2
k
2
k = 0.
Exercise 5.5 Check this.
So we can recast R(ξ) as
R(ξ) = 1 −∆t
π
2
3h
2
−
2∆t
h
2
¸
k=0
e
i(hξ−π)k
k
2
.
Exercise 5.6 Check this.
We now need to understand the remaining inﬁnite sum. It turns out that we can have
access to an exact formula. It is based on using the Poisson formula
¸
j
e
i2πlj
=
¸
j
δ(l −j),
which we recast for l ∈ [−1/2, 1/2] as
¸
j=0
e
i2πlj
= δ(l) −1.
46
Notice that both terms have zero mean on [−1/2, 1/2] (i.e. there is no 0 frequency). We
now integrate these terms in l. We obtain that
¸
j=0
e
i2πlj
i2πj
= H(l) −l −d
0
,
where H(l) is the Heaviside function, such that H(l) = 1 for l > 0 and H(l) = 0 for
l < 0. The constant d
0
is now ﬁxed by making sure that the above righthand side
remains 1periodic (as is the lefthand side). Another way of looking at it is to make
sure that both sides have zero mean on (−1/2, 1/2). We thus ﬁnd that d
0
= 1/2.
Exercise 5.7 Check this.
Let us integrate one more time. We now obtain that
¸
j=0
e
i2πlj
−4π
2
j
2
= lH(l) −
l
2
2
−
l
2
−d
1
,
where d
1
is again chosen so that the righthand side has zero mean. We obtain that
d
1
= 1/12.
Exercise 5.8 Check this.
To sum up our calculation, we have obtained that
¸
j=0
e
i2πlj
j
2
= 4π
2
l
2
2
+
1
12
−l
H(l) −
1
2
.
This shows that
R(ξ) = 1 −
∆t
h
2
π
2
3
+ 8π
2
l
2
2
+
1
12
−l
H(l) −
1
2
, l =
hξ
2π
−
1
2
. (5.29)
A plot of the function l → l
2
/2 + 1/12 − l(H(l) − 1/2) is given in Fig. 5.1. Notice
that this function is 1periodic and continuous. This is justiﬁed as the delta function
is concentrated at one point, its integral is a piecewise linear function (H(l) −1/2 −l)
and is discontinuous, and its second integral is continuous (and is piecewise quadratic).
Exercise 5.9 Check that l(H(l) −
1
2
) =
1
2
[l[. Deduce that
R(ξ) = 1 −
∆t
h
2
π
2
1 + 4l(1 −[l[)
,
so that
R(ξ) = 1 −
∆t
h
2
h
2
ξ
2
:= 1 −∆tξ
2
, 0 ≤ [ξ[ ≤
π
h
.
[Check the above for hξ ≤ π and use (5.28) to show that R(ξ) = R(−ξ) and that R(ξ)
is
2π
h
periodic.]
47
−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5
−0.06
−0.04
−0.02
0
0.02
0.04
0.06
0.08
0.1
SECOND PERIODIC INTEGRAL OF THE PERIODIC DELTA FUNCTION
l
f
(
l
)
Figure 5.1: Plot of the function l → f(l) = l
2
/2 + 1/12 − l(H(l) − 1/2). The maximal
value is 1/12 and the minimal value −1/24.
Since R(ξ) is a periodized version of the symbol 1 −∆tξ
2
, we deduce that
1 −π
2
∆t
h
2
≤ R(ξ) ≤ 1,
and that the minimal value is reached for l = 0, i.e. for frequencies
ξ =
2π
h
(m +
1
2
), m ∈ Z,
and that the maximum is reached for l = ±1/2, i.e. for frequencies
ξ =
2πm
h
, m ∈ Z.
Notice that R(0) = 0 as it should, and that R(π/h) = 1 − π
2
∆t/h
2
, which is a quite
bad approximation of e
−∆tπ
2
/(h
2
)
for the exact solution. But as usual, we do not expect
frequencies of order h
−1
to be well approximated on a grid of size h.
Since we need [R(ξ)[ ≤ 1 for all frequencies to obtain a stable scheme, we see that
λ =
∆t
h
2
≤
2
π
2
, (5.30)
is necessary to obtain a convergent scheme as h →0 with λ ﬁxed.
48
6 Introduction to ﬁnite element methods
We now turn to one of the most useful methods in numerical simulations: the ﬁnite
element method. This will be very introductory. The material mostly follows the
presentation in [1].
6.1 Elliptic equations
Let Ω be a convex open domain in R
2
with boundary ∂Ω. We will assume that ∂Ω is
either suﬃciently smooth or a polygon. On Ω we consider the boundary value problem
−∆u(x) + σ(x)u(x) = f(x), x = (x
1
, x
2
) ∈ Ω
u(x) = 0, x ∈ ∂Ω.
(6.1)
The Laplace operator ∆ is deﬁned as
∆ =
∂
2
∂x
1
2
+
∂
2
∂x
2
2
.
We assume that the absorption coeﬃcient σ(x) ∈ L
∞
(Ω) satisﬁes the constraint
σ(x) ≥ σ
0
> 0, x ∈
¯
Ω. (6.2)
We also assume that the source term f ∈ L
2
(Ω) (i.e., is square integrable).
This subsection is devoted to showing the existence of a solution to the above problem
and recalling some elementary properties that will be useful in the analysis of discretized
problems.
Let us ﬁrst assume that there exists a suﬃciently smooth solution u(x) to (6.1).
Let v(x) be a suﬃciently smooth test function deﬁned on Ω and such that v(x) = 0
for x ∈ ∂Ω. Upon multiplying (6.1) by v(x) and integrating over Ω, we obtain by
integrations by parts using Green’s formula that:
a(u, v) :=
Ω
∇u ∇v + σuv
dx =
Ω
fvdx := L(v). (6.3)
Exercise 6.1 Prove the above formula.
Let us introduce a few Hilbert spaces. For k ∈ N, we deﬁne
H
k
(Ω) ≡ H
k
=
v ∈ L
2
(Ω); D
α
v ∈ L
2
(Ω), for all [α[ ≤ k¦. (6.4)
We recall that for a multiindex α = (α
1
, α
2
) (in two space dimensions) for α
k
∈ N,
1 ≤ k ≤ 2, we denote [α[ = α
1
+ α
2
and
D
α
=
∂
α
1
∂x
α
1
1
∂
α
2
∂x
α
2
2
.
Thus v ∈ H
k
≡ H
k
(Ω) if all its partial derivatives of order up to k are square integrable
in Ω. The Hilbert space is equipped with the norm
u
H
k =
¸
α≤k
D
α
u
2
L
2
1/2
. (6.5)
49
We also deﬁne the seminorm on H
k
:
[u[
H
k =
¸
α=k
D
α
u
2
L
2
1/2
. (6.6)
Note that the above seminorm is not a norm, for all polynomials p
k−1
of order at most
k −1 are such that [p
k−1
[
H
k = 0.
Because of the choice of boundary conditions in (6.1), we also need to introduce the
Hilbert space
H
1
0
(Ω) =
v ∈ H
1
(Ω); v
∂Ω
= 0
¸
. (6.7)
We can show that the above space is a closed subspace of H
1
(Ω), where both spaces
are equipped with the norm  
H
1. Showing this is not completely trivial and requires
understanding the operator that restricts a function deﬁned on Ω to the values it takes
at the boundary ∂Ω. The restriction is deﬁned for suﬃciently smooth functions only
(being an element in H
1
is suﬃcient, not necessary). For instance: “the restriction to
∂Ω of a function in L
2
(Ω)” means nothing. This is because L
2
functions are deﬁned
up to modiﬁcations on sets of measure zero (for the Lebesgue measure) and that ∂Ω is
precisely a set of measure zero.
We can show that (6.3) holds for any function v ∈ H
1
0
(Ω). This allows us to “replace”
the problem (6.1) by the new problem in variational form:
Find u ∈ H
1
0
(Ω) such that for all v ∈ H
1
0
(Ω), we have
a(u, v) = L(v). (6.8)
Theorem 6.1 There is a unique solution to the above problem (6.8).
Proof. The theorem is a consequence of the Riesz representation theorem. The main
ingredients of the proof are the following. First, note that a(u, v) is a symmetric bilinear
form on the Hilbert space V = H
1
0
(Ω). It is clearly linear in u and v and one veriﬁes
that a(u, v) = a(v, u).
The bilinear form a(u, v) is also positive deﬁnite, in the sense that a(v, v) > 0 for all
v ∈ V such that v = 0. Indeed, a(u, v) is even better than this: it is coercive in V , in
the sense that there exists α > 0 such that
a(u, u) ≥ αu
2
V
, for all v ∈ V. (6.9)
Indeed, from the deﬁnition of a(u, v) and the constraint on σ, we can choose α =
min(1, σ
0
).
Since a(u, v) is a symmetric, positive deﬁnite, bilinear form on V , it deﬁnes an inner
product on V (by deﬁnition of an inner product). The norm associated to this inner
product is given by
u
a
= a
1/2
(u, u), u ∈ V. (6.10)
The Riesz representation theorem then precisely states the following: Let V a Hilbert
space with an inner product a(, ). Then for each bounded linear form L on V , there is
a unique u ∈ V such that L(v) = a(u, v) for all v ∈ V . Theorem 6.1 is thus equivalent
50
to showing that L(v) is bounded on V , which means the existence of a constant L
V
∗
such that [L(v)[ ≤ L
V
∗a
1/2
(v, v) for all v ∈ V . However we have
[L(v)[ ≤ f
L
2v
L
2 =
f
L
2
√
σ
0
(
√
σ
0
v
L
2) ≤
f
L
2
√
σ
0
a
1/2
(v, v).
This yields the bound for L and the existence of a solution to (6.8). Now assume that u
and w are solutions. Then by taking diﬀerences in (6.8), we deduce that a(u−w, v) = 0
for all v ∈ V . Choose now v = u −w to get that a(u −w, u −w) = 0. However, since a
is positive deﬁnite, this implies that u −w = 0 and the uniqueness of the solution.
Remark 6.2 The solution to the above problem admits another interpretation. Indeed
let us deﬁne the functional
J(u) =
1
2
a(u, u) −L(u). (6.11)
We can then show that u is a solution to (6.8) is equivalent to u being the minimum of
J(u) on V = H
1
0
.
Remark 6.3 Theorem 6.1 admits a very simple proof based on the Riesz representation
theorem. However it requires a(u, v) to be symmetric, which may not be the case in
practice. The LaxMilgram theory is then the tool to use. What it says is that provided
that a(u, v) is coercive and bounded on a Hilbert space V (equipped with norm  
V
),
in the sense that there are two constants C
1
and α such that
αv
2
V
≤ a(v, v), [a(v, w)[ ≤ C
1
v
V
w
V
, for all v, w ∈ V, (6.12)
and provided that the functional L(v) is bounded on V , in the sense that there exists a
constant L
V
∗ such that
[L(v)[ ≤ L
V
∗v
V
, for all v ∈ V
then the abstract problem
a(u, v) = L(v), for all v ∈ V, (6.13)
admits a unique solution u ∈ V . This abstract theory is extremely useful in many
practical problems.
Regularity issues. The above theory gives us a weak solution u ∈ H
1
0
to the problem
(6.1). The solution is “weak” because the elliptic problem should be considered in it
variational form (6.3) rather than its PDE form (6.1). The Laplacian of a function in
H
1
0
is not necessarily a function. So u is not a strong solution (which is deﬁned as a C
2
solution of (6.1)).
It thus remains to understand whether the regularity of u ∈ H
1
0
is optimal. The
answer is no in most cases. What we can show is that for smooth boundaries ∂Ω and
for convex polygons ∂Ω, we have
u
H
2 ≤ Cf
L
2. (6.14)
51
When f ∈ L
2
, we thus deduce that u ∈ H
2
∩ H
1
0
. Deriving such results is beyond the
scope of these notes and we will simply admit them. The main aspect of this regularity
result is that u has two more degrees of regularity than f; when f ∈ L
2
, the second
order partial derivatives of u are also in L
2
. This may be generalized as follows. For
suﬃciently smooth ∂Ω, or for very special polygons Γ (such as rectangles), we have the
following result:
u
H
k+2 ≤ Cf
H
k, k ≥ 0. (6.15)
This means that the solution u may be quite regular provided that the source term
f is suﬃciently regular. Moreover, a classical Sobolev inequality theorem says that
a function u ∈ H
k
for k > d/2, where d is space dimension, is also of class C
0
. In
dimension d = 2, this means that as soon as f ∈ H
k
for k > 1, then u is of class C
2
,
i.e., is a strong solution of (6.1).
As we saw for ﬁnite diﬀerence and spectral methods the regularity of the solution
dictates the accuracy of the discretized solution. The above regularity results will thus
be of crucial importance in the analysis of ﬁnite element methods which we take up
now.
6.2 Galerkin approximation
The above theory of existence is fundamentally diﬀerent from what we used in the ﬁnite
diﬀerence and spectral methods: we no longer have (6.1) to discretize but rather (6.3).
The main diﬃculty in solving (6.3) is that V = H
1
0
is inﬁnite dimensional. This however
gives us the right idea to discretize (6.3): simply replace V by a discrete approximation
V
h
and solve (6.3) in V
h
rather than V . We then have to choose the family of spaces V
h
for some parameter h →0 (think of h as a mesh size) such that the solution converges
to u in a reasonable sense.
The Galerkin method is based on choosing ﬁnite dimensional subspaces V
h
⊂ V . In
our problem (6.1), this means choosing V
h
as subsets of H
1
0
. This requires some care:
functions in H
1
0
have derivatives in L
2
. So piecewise constant functions for instance,
are not elements of H
1
0
(because they can jump and the derivatives of jumps are too
singular to be elements of L
2
), and V
h
can therefore not be constructed using piecewise
constant functions.
Let us assume that we have a sequence of ﬁnite dimensional subspaces V
h
⊂ V . Then
the spaces V
h
are Hilbert spaces. Assuming that the bilinear form a(u, v) is coercive
and bounded on V (see Rem. 6.3), then clearly it remains coercive and bounded on V
h
(because it is speciﬁcally chosen as a subset of V !). Theorem 6.1 and its generalization
in Rem. 6.3 then hold with V replaced by V
h
. The same theory as for the continuous
problem thus directly gives us existence and uniqueness of a solution u
h
to the discrete
problem:
Find u
h
∈ V
h
such that a(u
h
, v) = L(v) for all v ∈ V
h
. (6.16)
It remains to obtain an error estimate for the problem on V
h
. Let us assume that
a(u, v) is symmetric and as such deﬁnes a norm (6.10) on V and V
h
. Then we have
52
Lemma 6.4 The solution u
h
to (6.16) is the best approximation to the solution u of
(6.3) for the  
a
norm:
u −u
h

a
= min
v∈V
h
u −v
a
. (6.17)
Proof. There are two main ideas there. The ﬁrst idea is that since V
h
⊂ V , we can
subtract (6.16) from (6.8) to get the following orthogonality condition
a(u −u
h
, v) = 0, for all v ∈ V
h
. (6.18)
Note that this property only depends on the linearity of u →a(u, v) and not on a being
symmetric. Now because a(u, u) is an inner product (so that a(u, v) ≤ u
a
v
a
) and
v →a(u, v) is linear, we verify that for all v ∈ V
h
u −u
h

2
a
= a(u −u
h
, u −u
h
) = a(u −u
h
, u) = a(u −u
h
, u −v) ≤ u −u
h

a
u −v
a
.
If u−u
h

a
= 0, then the lemma is obviously true. If not, then we can divide both sides
in the above inequality by u − u
h

a
to get that u − u
h

a
≤ u − v
a
for all v ∈ V .
This yields the lemma.
Remark 6.5 When a(u, v) satisﬁes the hypotheses given in Rem. 6.3, the above result
does not hold. However, it may be replaced by
u −u
h

V
≤
C
1
α
min
v∈V
u −v
V
. (6.19)
Exercise 6.2 Prove (6.19).
So, although u
h
is no longer the best approximation of u in any norm (because a(u, v) no
longer generates a norm), we still observe that up to a multiplicative constant (greater
than 1 obviously), u
h
is still very close to being an optimal approximation of u (for the
norm  
V
now).
Note that the above result also holds when a(u, v) is symmetric. Then u
h
is the best
approximation of u for the norm  
a
, but not for the norm  
V
. We have equipped
V (and V
h
) with two diﬀerent, albeit equivalent, norms. We recall that two norms  
a
and  
V
are equivalent if there exists a constant C such that
C
−1
v
V
≤ v
a
≤ Cv
V
, for all v ∈ V. (6.20)
Exercise 6.3 When a(u, v) is symmetric and coercive on V , show that the two norms
 
a
and  
V
are equivalent.
Let us summarize our results. We have replaced the variational formulation for
u ∈ V by a discrete variational formulation for u
h
∈ V
h
. We have shown that the discrete
problem indeed admitted a unique solution (and thus requires to solve a problem of the
form Ax = b since the problem is linear and ﬁnite dimensional), and that the solution
u
h
was the best approximation to u when a(u, v) generates a norm, or at least was not
too far from the best solution, this time for the norm  
V
, when a(u, v) is coercive
(whether it is symmetric or not).
In our problem (6.1), V = H
1
0
. So we have obtained that u −u
h

H
1 is close to the
best possible approximation of u by functions in H
1
0
. Note that the “closeness” is all
53
relative. For instance, when α is very small, an error bound of the form (6.19) is not
necessarily very accurate. When a is symmetric, (6.17) is however always optimal. In
any event, we need additional information on the spaces V
h
if one wants to go further.
We can go further in two directions: see how small u−u
h

H
1 is for speciﬁc sequences of
spaces V
h
, but also see how small u−u
h
is in possibly other norms one may be interested
in, such as u −u
h

L
2. We now consider a speciﬁc example of Galerkin approximation:
the ﬁnite element method (although we should warn the reader that many ﬁnite element
methods are not of the Galerkin form because they are based on discretization spaces
V
h
that are not subspaces of V ).
6.3 An example of ﬁnite element method
We are now ready to introduce a ﬁnite element method. To simplify we assume that Ω is
a rectangle. We then divide Ω into a family T
h
of nonoverlapping closed triangles. Each
triangle T ∈ T
h
is represented by three vertices and three edges. We impose that two
neighboring triangles have one edge (and thus two vertices) in common, or one vertex
in common. This precludes the existence of vertices in the interior of an edge. The set
of all vertices in the triangulation are called the nodes of T
h
. The index h > 0 measures
the “size” of the triangles. A triangulation T
h
of size h is thus characterized by
¯
Ω =
¸
T∈T
h
T, h
T
= diam(T), h = max
T∈T
h
h
T
. (6.21)
As the triangulation gets ﬁner and h →0, we wish to be able to construct more accurate
solutions u
h
to u solution of (6.8). Note that ﬁnite element methods could also be based
on diﬀerent tilings of Ω, for instance based on quadrilaterals, or higher order polygons.
We only consider triangles here.
The reason we introduce the triangulation is that we want to approximate arbitrary
functions in V = H
1
0
by ﬁnite dimensional functions on each triangle T ∈ T
h
. As we have
already mentioned, the functions cannot be chosen on each triangle independently of
what happens on the other triangles. The reason is that such functions may not match
across edges shared by two triangles. These jumps however would violate the fact that
our discrete functions need to be in H
1
0
. For instance functions that are constant on
each triangle T ∈ T
h
are not in H
1
0
and are thus not admissible. Here we consider
the simplest ﬁnite element method. The simplest functions after piecewise constant
functions are piecewise linear functions. Moreover, we want the functions not to jump
across edges of the triangulation. We thus deﬁne
V
h
= ¦v ∈ H
1
0
; v is linear on each T ∈ T
h
¦. (6.22)
Recall that this imposes that v = 0 on ∂Ω. The space V
h
is a ﬁnite dimensional subspace
of H
1
0
(see below for a “proof”) so that the theory given in the previous section applies.
This gives us a solution u
h
to (6.16).
A linear system. We now recast the above problem for u
h
as a linear system (of the
form Ax = b). To do so, we need a basis for V
h
. Because the functions in V
h
are in
H
1
0
, then cannot jump across edges. Because they are piecewise linear, they must be
54
continuous on Ω. So they are characterized (uniquely determined) by the values they
take on each node P
i
∈ Ω for 1 ≤ i ≤ M
h
of the triangulation T
h
(note that since
Ω is an open domain, the notation means that M
h
is the number of internal nodes of
the triangulation: the nodes P
i
∈ ∂Ω should be labeled with M
h
+ 1 ≤ i ≤ M
h
+ b
h
,
where b
h
is the number of nodes P
i
∈ ∂Ω). Consider the family of pyramid functions
φ
i
(x) ∈ V
h
for 1 ≤ i ≤ M
h
deﬁned by
φ
i
(P
j
) = δ
ij
, 1 ≤ i, j ≤ M
h
. (6.23)
We recall that the Kronecker delta δ
ij
= 1 if i = j and δ
ij
= 0 otherwise. Since φ
i
(x)
is piecewise aﬃne on T
h
, it is uniquely deﬁned by its values at the nodes P
j
. Moreover,
we easily deduce that the family is linearly independent, whence is a basis for V
h
. This
proves that V
h
is ﬁnite dimensional. Note that by assuming that φ
i
∈ V
h
, we have also
implicitly implied that φ
i
(P
j
) = 0 for P
j
∈ ∂Ω. Note also that for P
j
, P
k
∈ ∂Ω, the
function φ
i
(for P
i
∈ ∂Ω) vanishes on the whole edge (P
j
, P
k
) so that indeed φ
i
= 0 on
∂Ω. We can thus write any function v ∈ V
h
as
v(x) =
M
h
¸
i=1
v
i
φ
i
(x), v
i
= v(P
i
). (6.24)
Let us decompose the solution u
h
to (6.16) as u
h
(x) =
¸
M
h
i=1
u
i
φ
i
(x). The equation
(6.16) may then be recast as
(f, φ
i
) = a(u
h
, φ
i
) =
M
h
¸
j=1
a(φ
j
, φ
i
)u
j
.
This is nothing but the linear system
Ax = b, A
ij
= a(φ
j
, φ
i
), x
i
= u
i
, b
i
= (f, φ
i
). (6.25)
In the speciﬁc problem of interest here where a(u, v) is deﬁned by (6.3), we have
A
ij
=
Ω
∇φ
j
(x) ∇φ
i
(x) + σ(x)φ
j
(x)φ
i
(x)
dx.
There is therefore no need to discretize the absorption parameter σ(x): the variational
formulation takes care of this. Solving for u
h
thus becomes a linear algebra problem,
which we do not describe further here. Let us just mention that the matrix A is quite
sparse because a(φ
i
, φ
j
) = 0 unless P
i
and P
j
belong to the same edge in the triangula
tion. This is a crucial property to obtain eﬃcient numerical inversions of (6.25). Let us
also mention that the ordering of the points P
i
is important. As a rough rule of thumb,
one would like the indices of points P
i
and P
j
belonging to the same edge to be such
that [i −j[ is as small as possible to render the matrix A as close to a diagonal matrix
as possible. The optimal ordering thus very much depends on the triangulation and is
by no means a trivial problem.
55
Approximation theory. So far, we have constructed a family of discrete solutions u
h
of (6.16) for various parameters h > 0. Let now T
h
be a sequence of triangulations such
that h →0 (recall that h is the maximal diameter of each triangle in the triangulation).
It remains to understand how small the error u−u
h
is. Moreover, this error will depend
on the chosen norm (we have at least three reasonable choices:  
a
,  
H
1, and  
L
2).
Since the problem (6.16) is a Galerkin discretization, we can use (6.17) and (6.19)
and thus obtain in the H
1
0
and a− norms that u − u
h
is bounded by a constant times
u−v for arbitrary v ∈ V
h
. Finding the best approximation v for u in a ﬁnite dimensional
space for a given norm would be optimal. However this is not an easy task in general.
We now consider a method, which up to a multiplicative constant (independent of h),
indeed provides the best approximation.
Our approximation is based on interpolation theory. Let us assume that f ∈ L
2
(Ω)
so that u ∈ H
2
(Ω) thanks to (6.14). This implies that u is a continuous function as
2 > d/2 = 1. We deﬁne the interpolation operator π
h
: C(
¯
Ω) →V
h
by
(π
h
v)(x) =
M
h
¸
i=1
v(P
i
)φ
i
(x). (6.26)
Note that the function v needs to be continuous since we evaluate it at the nodes P
i
.
An arbitrary function v ∈ H
1
0
need not be welldeﬁned at the points P
i
(although it
almost is...)
We now want to show that π
h
v −v gets smaller as h →0. This is done by looking at
π
h
v −v on each triangle T ∈ T
h
and then summing the approximations over all triangles
of the triangulation. We thus need to understand the approximation at the level of one
triangle. We denote by π
T
v the aﬃne interpolant of v on a triangle T. For a triangle T
with vertices P
i
, 1 ≤ i ≤ 3, it is deﬁned as
π
T
v(x) =
3
¸
i=1
v(P
i
)φ
i
(x), x ∈ T,
where the φ
i
(x) are aﬃne functions deﬁned by φ
i
(P
j
) = δ
ij
, 1 ≤ i, j ≤ 3. We recall
that functions in H
2
(T) are continuous, so the above interpolant is welldeﬁned for each
v ∈ H
2
(T). The various triangles in the triangulation may be very diﬀerent so we need
a method that analyzes π
T
v −v on arbitrary triangles. This is done in two steps. First
we pick a triangle of reference
ˆ
T and analyze π
ˆ
T
v −v on it in various norms of interest.
Then we deﬁne a map that transforms
ˆ
T to any triangle of interest T. We then push
forward any properties we have found on
ˆ
T to the triangle T through this map. The
ﬁrst step goes as follows.
Lemma 6.6 Let
ˆ
T be the triangle with vertices (0, 0), (1, 0), and (0, 1) in an orthonor
mal system of coordinates of R
2
. Let ˆ v be a function in H
2
(
ˆ
T) and π
ˆ
T
ˆ v its aﬃne
interpolant. Then there exists a constant C independent of ˆ v such that
ˆ v −π
ˆ
T
ˆ v
L
2
(
ˆ
T)
≤ C[ˆ v[
H
2
(
ˆ
T)
∇(ˆ v −π
ˆ
T
ˆ v)
L
2
(
ˆ
T)
≤ C[ˆ v[
H
2
(
ˆ
T)
.
(6.27)
We recall that [ [
H
2
(
ˆ
T)
is the seminorm deﬁned in (6.6).
56
Proof. The proof is based on judicious use of Taylor expansions. First, we want to
replace π
ˆ
T
by a more convenient interpolant Π
ˆ
T
. By deﬁnition of π
ˆ
T
, we deduce that
π
ˆ
T
v
H
1 ≤ Cv
H
2,
since [v(P
i
)[ ≤ Cv
H
2 in two space dimensions. Also we verify that for all aﬃne
interpolants Π
ˆ
T
, we have π
ˆ
T
Π
ˆ
T
v = Π
ˆ
T
v so that
v −π
ˆ
T
v
H
1 ≤ v −Π
ˆ
T
v
H
1 +π
ˆ
T
(Π
ˆ
T
v −v)
H
1 ≤ Cv −Π
ˆ
T
v
H
1 + C[v[
H
2.
This is because the secondorder derivatives of Π
ˆ
T
v vanish by construction. So we see
that it is enough to prove the results stated in the theorem for the interpolant Π
ˆ
T
. This
is based on the following Taylor expansion (with Lagrange remainder):
v(x) = v(x
0
) +(x−x
0
) ∇v(x
0
) +[x−x
0
[
2
1
0
(x ∇)
2
v((1−t)x
0
+tx)(1−t)dt. (6.28)
Since we will need it later, the same type of expansion yields for the gradient:
∇v(x) = ∇v(x
0
) +[x −x
0
[
1
0
x ∇∇v((1 −t)x
0
+ tx)dt. (6.29)
Both integral terms in the above expressions involve secondorder derivatives of v(x).
However they may not quite be bounded by C[v[
H
2
(
ˆ
T)
. The reason is that the aﬃne
interpolant v(x
0
) +(x−x
0
) ∇v(x
0
) and its derivative ∇v(x
0
) both involve the gradient
of v at x
0
, and there is no reason for [∇v(x
0
)[ to be bounded by C[v[
H
2
(
ˆ
T)
. This causes
some technical problems. The way around is to realize that x
0
may be seen as a variable
that we can vary freely in
ˆ
T. We therefore introduce the function φ(x) ∈ C
∞
0
(
ˆ
T) such
that φ(x) ≥ 0 and
ˆ
T
φ(x)dx = 1. It is a classical realanalysis result that such a
function exists. Upon integrating (6.28) and (6.29) over
ˆ
T in x
0
, we obtain that
v(x) = Π
ˆ
T
v(x) +
ˆ
T
φ(x
0
)[x −x
0
[
2
1
0
(x ∇)
2
v((1 −t)x
0
+ tx)(1 −t)dtdx
0
∇v(x) = ∇Π
ˆ
T
v(x) +
ˆ
T
φ(x
0
)[x −x
0
[
1
0
x ∇∇v((1 −t)x
0
+ tx)dtdx
0
Π
ˆ
T
v(x) =
ˆ
T
v(x
0
) −x
0
∇v(x
0
)
φ(x
0
)dx
0
+
ˆ
T
φ(x
0
)∇v(x
0
)dx
0
x.
(6.30)
We verify that Π
ˆ
T
v is indeed an aﬃne interpolant for v ∈ H
1
(
ˆ
T) in the sense (thanks
to the normalization of φ(x)) that Π
ˆ
T
v = v when v is a polynomial of degree at most 1
(in x
1
and x
2
) and that Π
ˆ
T
v
H
1 ≤ Cv
H
1.
It remains to show that the remainders v(x) − Π
ˆ
T
v(x) and ∇(v(x) − Π
ˆ
T
v(x)) are
indeed bounded in L
2
(
ˆ
T) by C[v[
H
2. We realize that both remainders involve bounded
functions (such as [x − x
0
[ or 1 − t) multiplied by secondorder derivatives of v. It is
therefore suﬃcient to show a result of the form
ˆ
T
dx
ˆ
T
φ(x
0
)
1
0
[f(x
0
+ t(x −x
0
))[dtdx
0
2
dx
0
≤ Cf
2
L
2
(
ˆ
T)
, (6.31)
57
for all f ∈ L
2
(
ˆ
T). By the CauchySchwarz inequality (stating that (f, g)
2
≤ f
2
g
2
,
where we use g = 1), the lefthand side in the above equation is bounded (because
ˆ
T is
a bounded domain) by
C
ˆ
T
ˆ
T
φ
2
(x
0
)
1
0
f
2
((1 −t)x
0
+ tx)dtdx
0
dx.
We extend f(x) by 0 outside
ˆ
T. Let us now consider the part 1/2 < t < 1 in the above
integral and perform the change of variables tx ←x. This yields
1
1/2
dt
ˆ
T
dx
0
φ
2
(x
0
)
t
−1 ˆ
T
dx
t
2
f
2
((1 −t)x
0
+x) ≤
1
1/2
dt
t
2
ˆ
T
dx
0
φ
2
(x
0
)
f
2
L
2
(
ˆ
T)
.
This is certainly bounded by Cf
2
L
2
(
ˆ
T)
. Now the part 0 < t < 1/2 with the change of
variables (1 −t)x
0
←x
0
similarly gives a contribution of the form
1/2
0
dt
(1 −t)
2
φ
2

L
∞
(
ˆ
T)
[
ˆ
T[
f
2
L
2
(
ˆ
T)
,
where [
ˆ
T[ is the surface of
ˆ
T. These estimates show that
v −Π
ˆ
T
v
H
1
(
ˆ
T)
≤ C[v[
H
2
(
ˆ
T)
. (6.32)
This was all we needed to conclude the proof of the lemma.
Once we have the estimate on
ˆ
T, we need to obtain it for arbitrary triangles. This is
done as follows:
Lemma 6.7 Let T be a proper (not ﬂat) triangle in the plane R
2
. There is an or
thonormal system of coordinates such that the vertices of T are the points (0, 0), (0, a),
and (b, c), where a, b, and c are real numbers such that a > 0 and c > 0. Let us deﬁne
A
∞
= max(a, [b[, c).
Let v ∈ H
2
(T) and π
T
v its aﬃne interpolant. Then there exists a constant C inde
pendent of the function v and of the triangle T such that
v −π
T
v
L
2
(T)
≤ CA
2
∞
[v[
H
2
(T)
∇(v −π
T
v)
L
2
(T)
≤ C
A
3
∞
ac
[v[
H
2
(T)
.
(6.33)
Let us now assume that h
T
is the diameter of T. Then clearly, A
∞
≤ h
T
. Fur
thermore, let us assume the existence of ρ
T
such that
ρ
T
h
T
≤ a ≤ h
T
, ρ
T
h
T
≤ c ≤ h
T
. (6.34)
Then we have the estimates
v −π
T
v
L
2
(T)
≤ Ch
2
T
[v[
H
2
(T)
∇(v −π
T
v)
L
2
(T)
≤ Cρ
−2
T
h
T
[v[
H
2
(T)
.
(6.35)
We recall that C is independent of v and of the triangle T.
58
Proof. We want to map the estimate on
ˆ
T in Lemma 6.6 to the triangle T. This
is achieved as follows. Let us deﬁne the linear map A : R
2
→ R
2
deﬁned in Cartesian
coordinates as
A =
a 0
b c
. (6.36)
We verify that A(
ˆ
T) = T. Let us denote by ˆ x coordinates on
ˆ
T and by x coordinates
on T. For a function ˆ v on
ˆ
T, we can deﬁne its pushforward v = A
∗
ˆ v on T, which is
equivalent to
ˆ v(ˆ x) = v(x) with x = Aˆ x or ˆ v(ˆ x) = v(Aˆ x).
By the chain rule, we deduce that
∂ˆ v
∂ˆ x
i
= A
ik
∂v
∂x
k
,
where we use the convention of summation over repeated indices. Diﬀerentiating one
more time, we deduce that
¸
α=2
[D
α
ˆ v[
2
(ˆ x) ≤ CA
4
∞
¸
α=2
[D
α
v[
2
(x), x = Aˆ x. (6.37)
Here the constant C is independent of v and ˆ v. Note that the change of measures
induced by A is dˆ x = [det(A
−1
)[dx so that
ˆ
T
¸
α=2
[D
α
ˆ v[
2
(ˆ x)dˆ x ≤ A
4
∞
[det(A
−1
)[
T
¸
α=2
[D
α
v[
2
(x)dx.
This is equivalent to the statement:
[ˆ v[
H
2
(
ˆ
T)
≤ C[det(A)[
−1/2
A
2
∞
[v[
H
2
(T)
. (6.38)
This tells us how the righthand side in (6.27) may be used in estimates on T. Let us
now consider the lefthand sides. We ﬁrst notice that
(π
ˆ
T
ˆ v)(ˆ x) = (π
T
v)(x). (6.39)
The reason is that ˆ v
i
= v
i
at the nodes and that for any polynomial of order 1 in the
variables ˆ x
1
, ˆ x
2
, p(Aˆ x) is also a polynomial of order 1 in the variables x
1
and x
2
. This
implies that
(ˆ v −π
ˆ
T
ˆ v)(ˆ x) = (v −π
T
v)(x), A
−1
∇(ˆ v −π
ˆ
T
ˆ v)(ˆ x) = ∇(v −π
T
v)(x); x = Aˆ x.
Using the same change of variables as above, this implies the relations
ˆ v −π
ˆ
T
ˆ v
2
L
2
(
ˆ
T)
= [ det A
−1
[ v −π
T
v
2
L
2
A
−1

2
∞
∇(ˆ v −π
ˆ
T
ˆ v)
2
L
2
(
ˆ
T)
≥ C[ det A
−1
[ ∇(v −π
T
v)
2
L
2
.
(6.40)
Here, C is a universal constant, and A
−1

∞
= (ac)
−1
A
∞
(i.e., is the maximal element
of the 2 2 matrix A
−1
). The above inequalities combined with (6.38) directly give
(6.33). The estimates (6.35) are then an easy corollary.
59
Remark 6.8 (Form of the triangles) The ﬁrst inequality in (6.35) only requires that
the diameter of T be smaller than h
T
. The second estimate however requires more. It
requires that the triangles not be too ﬂat and that all components a, b, and c be of
the same order h
T
. We may verify that the constraint (6.34) is equivalent to imposing
that all the angles of the triangle be greater than a constant independent of h
T
and is
equivalent to imposing that there be a ball of radius of order h
T
inscribed in T. From
now on, we impose that ρ
T
≥ ρ
0
> 0 for all T ∈ T
h
independent of the parameter h.
This concludes this section on approximation theory. Now that we have local error
estimates, we can sum them to obtain global error estimates.
A Global error estimate in H
1
0
(Ω). Let us come back to the general framework of
Galerkin approximations. We know from (6.19) that the error of u−u
h
in H
1
0
(the space
in which estimates are the easiest!) is bounded by a constant times the error u − v in
the same norm, where v is the best approximation of u (for this norm). So we obviously
have that
u −u
h

H
1 ≤ Cu −π
h
u
H
1. (6.41)
Here the constant C depends on the form a(, ) but not on u nor u
h
. The above right
hand side is by deﬁnition
¸
T∈T
h
T
[u −π
T
u[
2
+[∇(u −π
T
u)[
2
dx
1/2
≤
¸
T∈T
h
ρ
−4
0
h
2
T
[u[
2
H
2
(T)
1/2
,
thanks to (6.35). This however, implies the following result
Theorem 6.9 Let u be the solution to (6.8), which we assume is in H
2
(Ω). Let u
h
be
the solution of the discrete problem (6.16) where V
h
is deﬁned in (6.22) based on the
triangulation deﬁne in (6.21). Then we have the error estimate
u −u
h

H
1
(Ω)
≤ Cρ
−2
0
h[u[
H
2
(Ω)
. (6.42)
Here, C is a constant independent of T
h
, u, and u
h
. However it may depend on the
bilinear form a(u, v). We recall that ρ
0
was deﬁned in Rem. 6.8.
The ﬁnite element method is thus ﬁrstorder for the approximation of u in the H
1
norm. We may show that the order of approximation in O(h) obtained in the theorem
is optimal (in the sense that h cannot be replaced by h
α
with α > 1 in general).
A global error estimate in L
2
(Ω). The norm  
a
is equivalent to the H
1
norm.
So the above estimate also holds for the  
a
norm (which is deﬁned only when a(u, v)
is symmetric). The third norm we have used is the L
2
norm. It is reasonable to expect
a faster convergence rate in the L
2
norm than in the H
1
norm. The reason is this.
In the linear approximation framework, the gradients are approximated by piecewise
constant functions whereas the functions themselves are approximated by polynomials
of order 1 on each triangle. We should therefore have a better accuracy for the functions
than for their gradients. This is indeed the case. However the demonstration is not
straightforward and requires to use the regularity of the elliptic problem one more time
in a curious way.
60
Theorem 6.10 Under the hypotheses of Theorem (6.9) and the additional hypothesis
that (6.14) holds, we have the following error estimate
u −u
h

L
2
(Ω)
≤ Cρ
−4
0
h
2
u
H
2
(Ω)
. (6.43)
Proof. Let w be the solution to the problem
a(v, w) = (v, u −u
h
), for all v ∈ V. (6.44)
Note that since a(u, v) is symmetric, this problem is equivalent to (6.8). However, when
a(u, v) is not symmetric, (6.44) is equivalent to (6.8) only by replacing a(u, v) by the
adjoint bilinear form a
∗
(u, v) = a(v, u). The equation (6.44) is therefore called a dual
problem to (6.8), and as a consequence the proof of Theorem 6.10 we now present is
referred to as a duality argument in the literature.
In any event, the above problem (6.44) admits a unique solution thanks to Theorem
6.1. Moreover our hyp othesis that (6.14) holds implies that
w
H
2 ≤ Cu −u
h

L
2. (6.45)
Using in that order the choice v = u −u
h
in (6.44), the orthogonality condition (6.18),
the bound in (6.12) (with V = H
1
0
), the estimate (6.45), and ﬁnally the error estimate
(6.42), we get:
u −u
h

2
L
2
= a(u −u
h
, w) = a(u −u
h
, w −π
h
w) ≤ Cu −u
h

H
1w −π
h
w
H
1
≤ Cu −u
h

H
1ρ
−2
0
hw
2
≤ Cρ
−2
0
hu −u
h

H
1u −u
h

L
2
≤ Cρ
−4
0
h
2
u
H
2u −u
h

L
2.
This concludes our result.
61
7 Introduction to Monte Carlo methods
7.1 Method of characteristics
Let us ﬁrst make a small detour by the method of characteristics. Consider a hyperbolic
equation of the form
∂v
∂t
−c
∂v
∂x
+ αv = β, v(0, x) = P(x), (7.1)
for t > 0 and x ∈ R. We want to solve (7.1) by the method of characteristics. The
main idea is to look for solutions of the form
v(t, x(t)), (7.2)
where x(t) is a characteristic, solving an ordinary diﬀerential equation, and where the
rate of change of ω(t) = v(t, x(t)) is prescribed along the characteristic x(t).
We observe that
d
dt
v(t, x(t))
=
∂v
∂t
+
dx
dt
∂v
∂x
,
so that the characteristic equations for (7.3) are
dx
dt
= −c,
dv
dt
+ αv = β. (7.3)
The solutions are found to be
x(t) = x
0
−ct, v(t, x(t)) = e
−αt
v(0, x
0
) +
β
α
(1 −e
−αt
). (7.4)
For each (t, x), we can ﬁnd a unique x
0
(t, x) = x + ct so that
v(t, x) = e
−αt
P(x + ct) +
β
α
(1 −e
−αt
). (7.5)
Many ﬁrstorder linear and nonlinear equations may be solved by the method of
characteristics. The main idea, as we have seen, is to replace a partial diﬀerential
equation (PDE) by a system of ordinary diﬀerential equations (ODEs). The latter
may then be discretized if necessary and solved numerically. Since ODEs are solved,
we avoid the diﬃculties in PDEbased discretizations caused by the presence of high
frequencies that are not well captured by the schemes. See the section on ﬁnite diﬀerence
discretizations of hyperbolic equations.
There are many other diﬃculties in the numerical implementation of the method of
characteristics. For instance, we need to solve the characteristic backwards from x(t) to
x(0) and evaluate the initial condition at x(0), which may not be a grid point and thus
may require that we use an interpolation technique. Yet the method is very powerful as
it is much easier to solve ODEs than PDEs. In the literature, PDEbased, gridbased,
methods are referred to as Eulerian, whereas methods acting in the reference frame of
the particles following their characteristic (which requires solving ODEs) are referred to
as Lagrangian methods. The combination of gridbased and particlebased methods
gives rise to the socalled semiLagrangian method.
62
7.2 Probabilistic representation
One of the main drawbacks of the method of characteristics is that it applies (only)
to scalar ﬁrstorder equations. In one dimension of space, we can apply it to the wave
equation
0 =
∂
2
u
∂t
2
−
∂
2
u
∂x
2
= (
∂
∂t
−
∂
∂x
)(
∂
∂t
+
∂
∂x
)u,
because of the above decomposition. However, the method does not apply to multi
dimensional wave equations nor to elliptic or parabolic equations such as the Laplace and
heat equations. The reason is that information does not propagate along characteristics
(curves).
For some equations, a quasimethod of characteristics may still be developed. How
ever, instead of being based on us following one characteristics, it is based on us following
an inﬁnite number of characteristics generated from an appropriate probability measure
and averaging over them. More precisely, instead of looking for solutions of the form
w(t, x(t)), we look for solutions of the form
u(t, x) = E¦u
0
(X(t))[X(0) = x¦, (7.6)
where X(t) = X(t, ω) is a trajectory starting at x for realizations ω in a state space
Ω, and E is ensemble averaging with respect to a measure P deﬁned on Ω. It is not
purpose of these notes to dwell on (7.6). Let us give the most classical of examples.
Let W(t) be a standard onedimensional Brownian motion with W(0) = 0. This is an
object deﬁned on a space Ω (the space of continuous trajectories on [0, ∞)) with an
appropriate measure P. All trajectories are therefore continuous (at least almost surely)
but are also almost surely always nondiﬀerentiable. The “characteristics” have thus
become rather unsmooth objects. In any event, we ﬁnd that
u(t, x) = E¦u
0
(x + W(2t))¦, (7.7)
solves the heat equation
∂u
∂t
−
∂
2
u
∂x
2
= 0, u(0, x) = u
0
(x). (7.8)
This may be shown by e.g. Itˆo calculus. Very brieﬂy, let g(W(2t)) = u
0
(x + W(2t)).
Then, we ﬁnd that
dg =
∂g
∂x
dW
2t
+
1
2
∂
2
g
∂x
2
dW
2t
dW
2t
=
∂g
∂x
dW
2t
+
∂
2
g
∂x
2
dt,
since dW
2t
dW
2t
= d(2t). Using E¦W
2t
¦ = 0, we obtain after ensemble averaging that
(7.7) solves (7.8). We thus see that u(t, x) is an inﬁnite superposition (E is an inte
gration) of pieces of information carried along the trajectories. We also see how this
representation gives rise to the Monte Carlo method. We cannot simulate an inﬁnite
number of trajectories. However, we can certainly simulate a ﬁnite, large, number of
them and perform the averaging. This is the basis for the Monte Carlo method. Note
that in (7.7), simulating the characteristic is easy as we know the law of W(t) (a cen
tered Gaussian variable with variance equal to t). In more general situations, where the
diﬀusion coeﬃcient depends on position, then W(2t) is replaced by some process X(t; x)
which depends in a more complicated way on the location of the origin x. Solving for
the trajectories becomes more diﬃcult.
63
7.3 Random Walk and Finite Diﬀerences
Brownian motion has a relatively simple discrete version, namely the random walk.
Consider a process S
k
such that S
0
= 0 and S
k+1
is equal to S
k
+ h with probability
1
2
and equal to S
k
−h also with probability
1
2
.
We may decompose this random variable as follows. Let us assume that we have
N (time) steps and deﬁne N independent random variables x
k
, 1 ≤ k ≤ N, such that
x
k
= ±1 with probability
1
2
. Then,
S
n
= h
n
¸
k=1
x
k
,
is our random walk.
Let us relate all this to Bernoulli random variables. Deﬁne y
k
for 1 ≤ k ≤ N, N
independent random variables equal to 0 or 1 with probability
1
2
and let Σ
n
=
¸
n
k=1
y
k
.
Each sequence of n of 0’s and 1’s for the y
k
has probability 2
−n
. The number of sequences
such that Σ
n
= m is
n
m
by deﬁnition (number of possibilities of choosing m balls among
n ≥ m balls). As a consequence, using x
k
= (2y
k
−1), we see that
p
m
= P(S
n
= hm) =
1
2
n
n
n+m
2
. (7.9)
Exercise 7.1 Check this.
Note that S
n
is even when n is even and odd when n is odd so that
n+m
2
is always an
integer. We thus have an explicit expression for S
n
. Moreover, we easily ﬁnd that
E¦S
n
¦ = 0, VarS
n
= E¦S
2
n
¦ = nh
2
. (7.10)
The central limit theorem shows that S
n
converges to W(2t) if n → ∞ and h → 0
such that nh
2
= 2t. Moreover, S
αn
also converges to W(2αt) in an appropriate sense
so that there is solid mathematical background to obtain that S
n
is indeed a bona ﬁde
discretization of Brownian motion. This is Donsker’s invariance principle.
How does one use all of this to solve a discretized version of the heat equation?
Following (7.7), we can deﬁne
U
n
j
= E¦U
0
j+h
−1
Sn
¦, (7.11)
where U
0
j
is an approximation of u
0
(jh) as before, and where U
n
j
is an approximation
of u(n∆t, jh). What is it that we have deﬁned? For this, we need to use the idea of
conditional expectations, which roughly says that averaging over two random variables
may be written as an averaging over the ﬁrst random variable (this is thus still a random
variable) and then average over the second random variable. In our context:
U
n
j
= E¦U
0
j+x
1
+...+xn
¦ = E¦E¦U
0
j+x
1
+...+xn
[x
n
¦¦
=
1
2
E¦U
0
j+1+x
1
+...+x
n−1
¦ +
1
2
E¦U
0
j−1+x
1
+...+x
n−1
¦ =
1
2
U
n−1
j+1
+ U
n−1
j−1
.
64
In other words, U
n
j
solves the ﬁnite diﬀerence equation with λ =
1
2
(and ONLY this
value of λ the way the procedure is constructed). Recall that
λ =
D∆t
h
2
=
1
2
, (7.12)
where D is the diﬀusion coeﬃcient. In other words, the procedure (7.11) gives us an
approximation of
∂u
∂t
−D
∂
2
u
∂x
2
= 0, u(0, x) = u
0
,
when ∆t and h are chosen so that (7.12) holds.
7.4 Monte Carlo method
Solving for U
n
j
thus requires that we construct S
n
(by ﬂipping n ±1 coins) and evaluate
X := U
0
j+h
−1
Sn
. If we do it once, how accurate are we? The answer is not much. Let us
look at the random variable X. We know that E¦X¦ = U
n
j
so that X is an unbiased
estimator of U
n
j
. However, its variance is often quite large. Using (7.9), we ﬁnd that
P(X = U
0
j+m
) = p
m
. As a consequence, we ﬁnd that:
σ
2
= VarX =
n
¸
k=1
p
k
U
0
j−k
−
n
¸
l=1
p
l
U
0
j−l
2
. (7.13)
Assume that U
0
j
= U independent of j. Then σ
2
= 0 and X gives the right answer
all the time. However, when U
0
is highly oscillatory, say (−1)
l
, then
¸
p
l
U
0
j−l
will be
close to 0 by cancellations. This shows that σ
2
will be close to
¸
k
p
k
(U
0
j−k
)
2
, which is
comparable, or even much larger than E¦X¦. So X has a very large variance and is
therefore not a very good estimator of U
n
j
.
The (only) way to make the estimator better is to use the law of large numbers by
repeating the experiment M times and averaging over the M results. More precisely,
let X
k
for 1 ≤ k ≤ M be M independent random variables with the same law as X.
Deﬁne
S
M
=
1
M
M
¸
k=1
X
k
. (7.14)
Then we verify that E¦S
M
¦ = U
n
j
as before. The variance of S
M
is however signiﬁcantly
smaller:
VarS
M
= E¦
1
M
M
¸
k=1
(X
k
−U
n
j
)
2
¦ =
1
M
2
E¦
M
¸
k=1
(X
k
−U
n
j
)
2
¦ =
VarX
M
,
since the variables are independent. As a consequence, S
M
is an unbiased estimator of
U
j
n
with a standard deviation of order O(M
−1/2
). This is the speed of convergence of
Monte Carlo. In order to obtain an accuracy of ε on average, we need to calculate an
order of ε
−2
realizations.
How do the calculations Monte Carlo versus deterministic compare to calculate U
n
j
(at a ﬁnal time n∆t and a given position j)? Let us assume that we solve a ddimensional
65
heat equation, which involves a ddimensional random walk, which is nothing but d
onedimensional random walks in each direction. The cost of each trajectory is of order
n ≈ (∆t)
−1
since we need to ﬂip n coins. To obtain an O(h
2
) = O(∆t) accuracy, we
thus need M = (∆t)
−2
so that the ﬁnal cost is O((∆t)
−3
).
Deterministic methods require that we solve n time steps using a grid of mesh size
O(h) in each direction, which means O(h
−d
) discretization points. The total cost of
an explicit method is therefore O((∆th
d
)
−1
) = O((∆t)
−1−
d
2
). We thus see that Monte
Carlo is more eﬃcient than ﬁnite diﬀerences (or equally eﬃcient) when
1
∆t
3
≤
1
∆t
1+
d
2
, i.e., d ≥ 4. (7.15)
For suﬃciently large dimensions, Monte Carlo methods do not suﬀer the curse of
dimensionality characterizing deterministic methods and thus become more eﬃcient.
Monte Carlo methods are also much more versatile as they are based on solving “random
ODEs” (stochastic ODEs), with which it is easier to account for e.g. complicated
geometries. Finite diﬀerences however provide the solution everywhere unlike Monte
Carlo methods, which need to use diﬀerent trajectories to evaluate solutions at diﬀerent
points.
Moreover, the Monte Carlo method is much easier to parallelize. For the ﬁnite
diﬀerence model, it is diﬃcult to use parallel architecture (though this a wellstudied
problem) and obtain algorithms running time is inversely proportional to the number
of available processors. In contrast, Monte Carlo methods are easily parallelized since
the M random realizations of X are done independently. So with a (very large) number
of processors equal to M, the running time of the algorithm is of order ∆t
−1
, which
corresponds to the running time of the ﬁnite diﬀerence algorithm on a grid with a small
ﬁnite number of grid points. Even with a large number of processors, it is extremely
diﬃcult to attain this level of running speed using the deterministic ﬁnite diﬀerence
algorithm.
Let us conclude by the following remark. The convergence of the Monte Carlo
algorithm in M
−
1
2
is rather slow. Many algorithms have been developed to accelerate
convergence. These algorithms are called variancereduction algorithms and are based
on estimating the solution U
n
j
by using nonphysical random processes with smaller
variance than the classical Monte Carlo method described above.
66
References
[1] S. Larsson and V. Thom´ ee, Partial Diﬀerential Equations with Numerical Meth
ods, Springer, New York, 2003.
[2] L. N. Trefethen, Spectral Methods in Matlab, SIAM, Philadelphia, PA, 2000.
67
1
Ordinary Diﬀerential Equations
dX(t) = f (t, X(t)), dt X(0) = X0 .
An Ordinary Diﬀerential Equation is an equation of the form t ∈ (0, T ), (1.1)
We have existence and uniqueness of the solution when f (t, x) is continuous and Lipschitz with respect to its second variable, i.e. there exists a constant C independent of t, x, y such that f (t, x) − f (t, y) ≤ Cx − y. (1.2) If the constant is independent of time for t > 0, we can even take T = +∞. Here X and f (t, X(t)) are vectors in Rn for n ∈ N and  ·  is any norm in Rn . Remark 1.1 Throughout the text, we use the symbol “C” to denote an arbitrary constant. The “value” of C may change at every instance. For example, if u(t) is a function of t bounded by 2, we will write Cu(t) ≤ C, (1.3)
even if the two constants “C” on the left and righthand sides of the inequality may be diﬀerent. When the value “2” is important, we will replace the above righthand side by 2C. In our convention however, 2C is just another constant “C”. Higherorder diﬀerential equations can always been put in the form (1.1), which is quite general. For instance the famous harmonic oscillator is the solution to x + ω 2 x = 0, ω2 = k , m (1.4)
with given initial conditions x(0) and x (0) and can be recast as d dt X1 (t) = X2 0 1 −ω 2 0 X1 (t), X2 X1 (0) = X2 x(0) . x (0) (1.5)
Exercise 1.1 Show that the implied function f in (1.5) is Lipschitz. The Lipschitz condition is quite important to obtain uniqueness of the solution. Take for instance the case n = 1 and the function x(t) = t2 . We easily obtain that x (t) = 2 x(t) t ∈ (0, +∞), x(0) = 0.
However, the solution x(t) ≡ 0 satisﬁes the same equation, which implies nonuniqueness ˜ of the solution. This remark is important in practice: when an equation admits several solutions, any sound numerical discretization is likely to pick one of them, but not necessarily the solution one is interested in. Exercise 1.2 Show that the function f (t) above is not Lipschitz. 2
How do we discretize (1.1)? There are two key ingredients in the derivation of a numerical scheme: ﬁrst to remember the deﬁnition of a derivative df f (t + h) − f (t) f (t) − f (t − h) f (t + h) − f (t − h) (t) = lim = lim = lim , h→0 h→0 h→0 dt h h 2h and second (which is quite related) to remember Taylor expansions to approximate a function in the vicinity of a given point: f (t + h) = f (t) + hf (t) + hn hn+1 (n+1) h2 f (t) + . . . + f (n) (t) + f (t + s), 2 n! (n + 1)! (1.6)
for 0 ≤ s ≤ t assuming the function f is suﬃciently regular. With this in mind, let us ﬁx some ∆t > 0, which corresponds to the size of the interval between successive points where we try to approximate the solution of (1.1). We then approximate the derivative by X(t + ∆t) − X(t) dX (t) ≈ . dt ∆t (1.7)
It remains to approximate the righthand side in (1.1). Since X(t) is supposed to be known by the time we are interested in calculating X(t + ∆t), we can choose f (t, X(t)) for the righthand side. This gives us the socalled explicit Euler scheme X(t + ∆t) − X(t) = f (t, X(t)). ∆t (1.8)
Let us denote by Tn = n∆t for 0 ≤ n ≤ N such that N ∆t = T . We can recast (1.8) as Xn+1 = Xn + ∆tf (n∆t, Xn ), X0 = X(0). (1.9)
This is a fully discretized equation that can be solved on the computer since f (t, x) is known. Another choice for the righthand side in (1.1) is to choose f (t + ∆t, X(t + ∆t)). This gives us the implicit Euler scheme Xn+1 = Xn + ∆tf ((n + 1)∆t, Xn+1 ), X0 = X(0). (1.10)
Notice that this scheme necessitates to solve an equation at every step, namely to ﬁnd the inverse of the function x → x − f (n∆t, x). However we will see that this scheme is more stable than its explicit relative in many practical situations and is therefore often a better choice. Exercise 1.3 Work out the explicit and implicit schemes in the case f (t, x) = λx for some real number λ ∈ R. Write explicitly Xn for both schemes and compare with the exact solution X(n∆t). Show that in the stable case (λ < 0), the explicit scheme can become unstable and that the implicit scheme cannot. Comment. We now need to justify that the schemes we have considered are good approximations of the exact solution and understand how good (or bad) our approximation is. This requires to analyze the behavior of Xn as ∆t → 0, or equivalently the number of points 3
X and Y . that is to say Xn  ≤ C(1 + X0 ). Let us introduce the solution function g(t. Similarly we introduce the approximate solution function g∆t (t. Notice from the Lipschitz property of f (with x = x and y = 0) and from (1. The meaning of this inequality is the following: we do not want the diﬀerence X(t + ∆t) − Y (t + ∆t) to be more than (1 + C∆t) times what it was at time t (i. (1. We then have to prove that our scheme is stable. 4 .N to discretize a ﬁxed interval of time (0. Xn ). where dX (t + τ ) = f (t + τ. (1.16) (1.e.a. That the equation is properly posed (a. Xn ) ≤ (1 + C∆t)Xn  + C∆t.15) where the constant C is independent of t.e.14) where C is independent of X0 and 0 ≤ n ≤ N . Y ) ≤ (1 + C∆t)X − Y .11) 0 ≤ τ ≤ ∆t (1. X) − g(t. roughly speaking has good stability properties: when one changes the initial conditions a little. we have g∆t (n∆t.14) holds from now on. X) = g(t. that our solution Xn remains bounded independently of n and ∆t.e. i.1). one expects the solution not to change too much in the future. X. x))−1 (Xn ). X(t + τ )). X) = X(t + ∆T ). assuming that this inverse function is welldeﬁned. Xn ) = (x − ∆tf ((n + 1)∆t. First we need to assume that the exact problem (1.14) that Xn+1  = g∆t (n∆t.12) This is nothing but the function that maps a solution of the ODE at time t to the solution at time t + ∆t. wellposed) in the sense of Hadamard means that g(t.1) is properly posed. i. i. ∆T ) deﬁned by g(t. X − Y  = X(t) − Y (t)) during the time interval ∆t. In many problems. For the implicit Euler scheme. dt X(t) = X. (1. the strategy to do so is the following.13) For instance. A second ingredient is to show that the scheme is stable. Because the explicit scheme is much simpler to analyze. X) for t = n∆t by Xn+1 = g∆t (n∆t. that locally it is a good approximation of the real equation. we assume that (1.e. Xn ) for 0 ≤ n ≤ N − 1. Xn ) = Xn + ∆tf (n∆t. The third ingredient is to show that our scheme is consistent with the equation (1. for the explicit Euler scheme. (1. we have g∆t (n∆t.k. T ) tends to inﬁnity.
16) is satisﬁed) and that both schemes are consistent and of order m = 1 (i. Such a result is usually obtained by using Taylor expansions. Notice that this is a local estimate.e. it consists of showing that g(n∆t. Notice that the above property is a regularity property of the equation.18) ¨ for some 0 ≤ s ≤ ∆t. This implies by induction that n Xn  ≤ (1 + C∆t) X0  + k=0 n (1 + C∆t)k C∆t ≤ (1 + C∆t)N (N ∆tC + X0 ). Let us consider the explicit Euler scheme and assume that the solution X(t) of the ODE is suﬃciently smooth so that the solution with initial condition at n∆t given by X = X(n∆t) satisﬁes ∆t2 ¨ ˙ X((n + 1)∆t) = X(n∆t) + ∆tX(n∆t) + X(n∆t + s). the order of the scheme. (1.where C is a constant. Assuming that we know the initial condition at time n∆t.4 Consider the harmonic oscillator (1. 2 This implies (1.e.5). This implies the stability of the explicit Euler scheme. here of order ∆t2 . X(n∆t)) + 2 2 ∆t ¨ = g∆t (n∆t. (1. for 1 ≤ n ≤ N . 2 (1.17) for some positive m.15) is satisﬁed. X(n∆t)) + X(n∆t + s). We then prove that the scheme is consistent and obtain the local error of discretization. where X(n∆t + s) ≤ C(1 + X). x) is of class C 1 for instance.17) and the consistency of the scheme with m = 1 (the Euler explicit scheme is ﬁrstorder). X(n∆t)) at time (n + 1)∆t) up to a small term. Show that (1. X) − g∆t (n∆t. that both schemes are stable (i. Work out the functions g and g∆ for the explicit and implicit Euler schemes. N which is bounded independently of ∆t.17) is satisﬁed with m = 1). This implies that X((n + 1)∆t) = g(n∆t. X(n∆t)) ∆t2 ¨ X(n∆t + s) = X(N ∆t) + ∆tf (n∆t. In our case. 5 . X) ≤ C∆tm+1 (1 + X). Now N ∆t = T and (1 + CT N ) ≤ eCT . Another way of interpreting the above result is as follows: the exact solution X((n + 1)∆t) locally solves the discretized equation (whose solution is g∆t (n∆t. These are suﬃcient conditions (though not necessary ones) to obtain convergence and can be shown rigorously when the function f (t. we want to make sure that the exact and approximate solutions at time (n + 1)∆t are separated by an amount at most of order ∆t2 . Exercise 1. We prove convergence of the explicit Euler scheme only when the above regularity conditions are met. (1. not of the discretization.
as ∆t goes to 0. X(n∆t)) − g(n∆t. To simplify the presentation we assume that X ∈ R. this might not be accurate enough.e. The previous calculations have shown that for a scheme of order m. Since ε0 = 0.19) which holds for every norm by deﬁnition.20) εn ≤ C∆t m+1 k=0 (1 + C∆t)k ≤ C∆tm+1 n.5 Program in Matlab the explicit and implicit Euler schemes for the harmonic oscillator (1. Let us denote the error between the exact solution and the discretized solution by εn = X(n∆t) − Xn . The last line uses the stability of the scheme. Xn ) − g∆t (n∆t. For problems that need to be solved over long intervals of time. since n ≤ N = T ∆t−1 This shows that the explicit Euler scheme is of order m = 1: for all intermediate time steps n∆t. we easily deduce from the above relation (for instance by induction) that n (1. HigherOrder schemes.5). the discretized solution converges to the exact one. Xn ) + g(n∆t. The ﬁst line is the deﬁnition of the operators. The third line uses the wellposedness of the ODE and the consistency of the scheme.Once we have wellposedness of the exact equation as well as stability and consistency of the scheme (for a scheme of order m). Recall that C is here the notation for a constant that may change every time the symbol is used. we have εn+1 ≤ C∆tm+1 + (1 + C∆t)εn . We have seen that the Euler scheme is ﬁrst order accurate. (1. X(n∆t)) − g∆ (n∆t. Xn ) ≤ g(n∆t. Exercise 1. The second line uses the triangle inequality X + Y  ≤ X + Y . Find numerically the order of convergence of both schemes. 6 . we consider a scalar equation. i. Here is one way of constructing a secondorder scheme. the error between X(n∆t) and Xn is bounded by a constant time ∆t. We deduce from the above relation that εn ≤ C∆tm . we obtain convergence as follows. The obvious solution is to create a more accurate discretization. Obviously. We have X((n + 1)∆t) − Xn+1  = g(n∆t. since (1 + C∆t)k ≤ C uniformly for 0 ≤ k ≤ N . Xn ) ≤ (1 + C∆t)X(n∆t) − Xn  + C∆tm+1 (1 + Xn ) ≤ (1 + C∆t)X(n∆t) − Xn  + C∆tm+1 .
We ﬁrst realize that m in (1.17) has to be 2 instead of 1. We also realize that the local approximation to the exact solution must be compatible to the right order with the Taylor expansion (1.18), which we push one step further to get ∆t2 ¨ ∆t3 ... ˙ X((n + 1)∆t) = X(n∆t) + ∆tX(n∆t) + X(n∆t) + X (n∆t + s) 2 6 (1.21)
for some 0 ≤ s ≤ ∆t. Now from the derivation of the ﬁrstorder scheme, we know that what we need is an approximation such that X((n + 1)∆t) = g∆t (n∆t, X) + O(∆t3 ). An obvious choice is then to choose ˙ g∆t (n∆t, X) = X(n∆t) + ∆tX(n∆t) +
(2) (2)
∆t2 ¨ X(n∆t). 2
˙ ¨ This is however not explicit since X and X are not known yet (only the initial condition (2) X = X(n∆t) is known when we want to construct g∆t (n∆t, X)). However, as previously, we can use the equation that X(t) satisﬁes and get that ˙ X(n∆t) = f (n∆t, X) ∂f ∂f ¨ (n∆t, X) + (n∆t, X)f (n∆t, X), X(n∆t) = ∂t ∂x by applying the chain rule to (1.1) at t = n∆t. Assuming that we know how to diﬀerentiate the function f , we can then obtain the secondorder scheme by deﬁning g∆t (n∆t, X) = X + ∆tf (n∆t, X) ∆t2 ∂f ∂f + (n∆t, X) + (n∆t, X)f (n∆t, X) . 2 ∂t ∂x
(2)
(1.22)
Clearly by construction, this scheme is consistent and secondorder accurate. The regu... larity that we now need from the exact equation (1.1) is that X (n∆t+s) ≤ C(1+X). We shall again assume that our equation is nice enough so that the above constraint holds (this can be shown when the function f (t, x) is of class C 2 for instance). Exercise 1.6 It remains to show that our scheme g∆t (n∆t, X) is stable. This is left as (2) an exercise. Hint: show that g∆t (n∆t, X) ≤ (1 + C∆t))X + C∆t for some constant C independent of 1 ≤ n ≤ N and X. Exercise 1.7 Work out a secondorder scheme for the harmonic oscillator (1.5) and show that all the above constraints are satisﬁed. Exercise 1.8 Implement in Matlab the secondorder scheme derived in the previous exercise. Show the order of convergence numerically and compare the solutions with the ﬁrstorder scheme. Comment.
(2)
7
RungeKutta methods. The main drawback of the previous secondorder scheme is that it requires to know derivatives of f . This is undesirable is practice. However, the derivatives of f can also be approximated by ﬁnite diﬀerences of the form (1.7). This is the basis for the RungeKutta methods. Exercise 1.9 Show that the following schemes are secondorder: ∆t ∆t (2) ,X + f (n∆t, X) g∆t (n∆t, X) = X + ∆tf n∆t + 2 2 ∆t (2) g∆t (n∆t, X) = X + f (n∆t, X) + f (n + 1)∆t, X + ∆tf (n∆t, X) 2 2 2∆t ∆t (2) f (n∆t, X) + 3f (n + )∆t, X + f (n∆t, X) g∆t (n∆t, X) = X + 4 3 3
.
These schemes are called Midpoint method, Modiﬁed Euler method, and Heun’s Method, respectively. Exercise 1.10 Implement in Matlab the Midpoint method of the preceding exercise to solve the harmonic oscillator problem (1.5). Compare with previous discretizations. All the methods we have seen so far are onestep methods, in the sense that Xn+1 only depends on Xn . Multistep methods are generalizations where Xn+1 depends on Xn−k for k = 0, · · · , m with m ∈ N. Classical multistep methods are the AdamsBashforth and AdamsMoulton methods. The second and fourthorder Adams Bashforth methods are given by Xn+1 = Xn + Xn+1 ∆t (3f (n∆t, Xn ) − f ((n − 1)∆t, Xn−1 )) 2 ∆t 55f (n∆t, Xn ) − 59f ((n − 1)∆t, Xn−1 ) = Xn + 24 +37f ((n − 2)∆t, Xn−2 ) − 9f ((n − 2)∆t, Xn−2 ) ,
(1.23)
respectively. Exercise 1.11 Implement in Matlab the AdamsBashforth methods in (1.23) to solve the harmonic oscillator problem (1.5). Compare with previous discretizations.
8
2
2.1
Finite Diﬀerences for Parabolic Equations
One dimensional Heat equation
∂2u ∂u (t, x) − 2 (t, x) = 0, ∂t ∂x u(0, x) = u0 (x), x ∈ R.
Equation. Let us consider the simplest example of a parabolic equation x ∈ R, t ∈ (0, T ) (2.1)
This is the onedimensional (in space) heat equation. Note that the equation is also linear, in the sense that the map u0 (x) → u(x, t) at a given time t > 0 is linear. In the rest of the course, we will mostly be concerned with linear equations. This equation actually admits an exact solution, given by 1 u(t, x) = √ 4πt
∞
exp −
−∞
x − y2 u0 (y)dy. 4t
(2.2)
Exercise 2.1 Show that (2.2) solves (2.1). Discretization. We now want to discretize (2.1). This time, we have two variables to discretize, time and space. The ﬁnite diﬀerence method consists of (i) introducing discretization points Xn = n∆x for n ∈ Z and ∆x > 0, and Tn = n∆t for 0 ≤ n ≤ N = T /∆t and Dt > 0; (ii) approximating the solution u(t, x) by Ujn ≈ u(Tn , Xj ), 1 ≤ n ≤ N, j ∈ Z. (2.3)
Notice that in practice, j is ﬁnite. However, to simplify the presentation, we assume here that we discretize the whole line x ∈ R. The time variable is very similar to what we had for ODEs: knowing what happens at Tn = n∆t, we would like to get what happens at Tn+1 = (n + 1)∆t. We therefore introduce the notation Ujn+1 − Ujn . (2.4) ∂t Ujn = ∆t ∂ The operator ∂t plays the same role for the series Ujn as does for the function u(t, x). ∂t The spatial variable is quite diﬀerent from the time variable. There is no privileged direction of propagation (we do not know a priori if information comes from the left or the right) as there is for the time variable (we know that information at time t will allow us to get information at time t + ∆t). This intuitively explains why we introduce two types of discrete approximations for the spatial derivative: ∂x Ujn =
n Uj+1 − Ujn , ∆x
and
∂ x Ujn =
n Ujn − Uj−1 ∆x
(2.5)
These are the forward and backward ﬁnite diﬀerence quotients, respectively. Now in (2.1) we have a secondorder spatial derivative. Since there is a priori no reason to assume that information comes from the left or the right, we choose
n n Uj+1 − 2Ujn + Uj−1 ∂2u (Tn , Xj ) ≈ ∂ x ∂x Ujn = ∂x ∂ x Ujn = . ∂x2 ∆x2
(2.6)
9
Exercise 2. j∆x) as advertised)?.8) Exercise 2. j∈Z (2.10) j∈Z j∈Z Since u(t.1).9) Stability means that there exists a constant C independent of ∆x. Implement the algorithm in Matlab. Now straightforward calculations show that Uj1 = λ (−1)j+1 + (−1)j−1 + (1 − 2λ)(−1)j ε = (1 − 4λ)(−1)j ε.2)). (2. When λ > 1/2.6) is centered: it is symmetric with respect to what comes from the left and what comes from the right. it has no chance to converge.3 Check this. and t = 10 assuming that the initial condition is u0 (x) = 1 on (−1. j∆x))? Exercise 2.7) ∆t ∆x2 This is clearly an approximation of (2.2 Check the equalities in (2. x) remains bounded for all times (see (2. it is clear that if the scheme is unstable.1. 1) and u0 (x) = 0 elsewhere.01.1) to compute u(t. the scheme will be unstable! To show this we choose some initial conditions that oscillate very rapidly: Uj0 = (−1)j ε. ε > 0.4 Discretizing the interval (−10. Our ﬁrst concern will be to ensure that the scheme is stable.e. use a numerical approximation of (2. x) for t = 0. t = 1. There are then several questions that come to mind: is the scheme stable (i. Stability. we deﬁne U ∞ = sup Uj . Here ε is a constant that can be chosen as small as one wants. The ﬁnite diﬀerence scheme then reads n n Ujn+1 − Ujn Uj+1 − 2Ujn + Uj−1 − = 0. For series U = {Uj }j∈Z deﬁned on the grid j ∈ Z by Uj .6). t = 0. does it converge (is Ujn really a good approximation of u(n∆t. ∆x2 (2. we calculate Ujn+1 for j ∈ Z for 0 ≤ n ≤ N − 1. (2. and how fast does it converge (what is the error between Ujn and u(n∆t. This is also the simplest scheme called the explicit Euler scheme. ∆t and 1 ≤ n ≤ N = T /∆t such that U n ∞ = sup Ujn  ≤ C U 0 ∞ = C sup Uj0 . 10 . 10) with N points. λ= ∆t . We easily verify that Uj0 ∞ = ε. The norm we choose to measure stability is the supremum norm. Notice that this can be recast as n n Ujn+1 = λ(Uj+1 + Uj−1 ) + (1 − 2λ)Ujn . does Ujn remains bounded for n = N independently of the choices of ∆t and ∆x)?. The numerical procedure is therefore straightforward: knowing Ujn for j ∈ Z. We actually have to consider two cases according as λ ≤ 1/2 or λ > 1/2. Notice that the discrete diﬀerentiation in (2.
Recall that −∞ e−x dx = π. This is done by assuming that we have a local consistency of the discrete scheme with the equation. This behavior has to be contrasted with the case λ ≤ 1/2. what this means is that any small ﬂuctuation in the initial data (for instance caused by computer roundoﬀ) will be ampliﬁed by the scheme exponentially fast as in (2. For instance. we deduce from (2. We therefore use the three same main ingredients as for ODEs: stability. we have to show that the scheme converges. Notice that ∆t must be quite small (of order ∆x2 ) in order to obtain stability. whence their practical interest. So obviously.By induction this implies that Ujn = (1 − 4λ)n (−1)j ε. 11 . √ 2 ∞ Exercise 2. for n = N = T /∆t.1) also satisﬁes this property. whereas implicit schemes will be unconditionally stable. we have that U N ∞ .13) ∆t ≤ ∆x2 . and Lewy who formulated it ﬁrst). this reads 1 (2. There.11) and will eventually be bigger than any meaningful information. tends to inﬁnity as ∆t → 0. whence Un ∞ (2.5 Check it. This property is referred to as a maximum principle: the discrete solution at later times (n ≥ 1) is bounded by the maximum of the initial solution. Notice that the continuous solution of (2. We have seen that λ ≤ 1/2 is important to ensure stability. This implies that ρ = −(1 − 4λ) > 1. Uj+1 }. Although we are not going to use them further here. Since both λ and (1 − 2λ) are positive.11) = ρn ε → ∞ as n → ∞. Maximum principles are very important in the analysis of partial diﬀerential equations. and that the solution of the equation is suﬃciently regular. stability is obvious. Ujn .12) Exercise 2. regularity of the exact solution.6 Check this using (2. Now assume that λ > 1/2. We shall see that explicit schemes are always conditionally stable for parabolic equations. which was supposed to be an approximation of sup u(T. Convergence.8) that n n Ujn+1  ≤ sup{Uj−1 . consistency.10) holds. So (2. Friedrich.10) is x∈R violated: there is no constant C independent of ∆t such that (2. Now that we have stability for the Euler explicit scheme when λ ≤ 1/2. Obviously (2. (2. In terms of ∆t and ∆x. x). let us still mention that it is important in general to enforce that the numerical schemes satisfy as many properties of the continuous equation as possible.10) is clearly satisﬁed with C = 1. In practice. U n+1 ∞ ≤ Un ∞ ≤ U0 ∞.2). 2 This is the simplest example of the famous CFL condition (after Courant.
let us introduce the error series z n = {zj }j∈Z deﬁned by n zj = Ujn − un . j Convergence means that z n ∞ converges to 0 as ∆t → 0 and ∆x → 0 (with the constraint that λ ≤ 1/2) for all 1 ≤ n ≤ N = T /∆t. (2.4 . j where un = u(n∆t. Since (2. we have to show that the real j j solution at time un+1 is well approximated by Ujn+1 . We then deduce that τn ∞ ≤ C∆x2 ∼ C∆t (2. we assume that λ ≤ 1/2 is ﬁxed and send ∆t → 0 To (in which case ∆x = λ−1 ∆t also converges to 0). j j (2. this is independent of the discretization scheme we choose. x) we deduce that τjn = ∆t ∂ 2 u ∆x2 ∂ 4 u (n∆t + s) − (j∆x + h) 2 ∂t2 2 12 ∂x44 λ∂ u 1 ∂ u = ∆x2 (n∆t + s) − (j∆x + h) .15) where C is a constant independent of ∆x (and ∆t since λ is ﬁxed). and that Ujn = un .7 Check this. Since U n satisﬁes the discretized equation. i. To answer this.16) 12 . Exercise 2. 2 2 ∂t 12 ∂x4 The truncation error is therefore of order m = 2 since it is proportional to ∆x2 . All we need is that the exact solution is suﬃciently regular. that it is continuously diﬀerentiable 2 times in time and 4 times in space). assuming that the solution at time Tn = n∆t is given by un on the grid. More precisely.e. we calculate j τjn = ∂t un − ∂ x ∂x un .6). Of course. Let us assume that the solution is suﬃciently regular (what we need is that it is of class C 2. The consistency of the scheme consists of showing that the real dynamics are well approximated.14) The expression τjn is the truncation or local discretization error. Notice however that this term only involves the exact solution u(t. √ simplify. Using Taylor expansions. x). we ﬁnd that ∂u ∆t ∂ 2 u u(t + ∆t) − u(t) = (t) + (t + s) ∆t ∂t 2 ∂t2 4 u(x + ∆x) − 2u(x) + u(x − ∆x) ∂2u ∆x2 ∂ u = (x) + (x + h). j∆x). ∆x2 ∂x2 12 ∂x4 for some 0 ≤ s ≤ ∆t and −∆x ≤ h ≤ ∆x. we deduce that n n ∂t zj − ∂ x ∂x zj = −τjn . The order of convergence is how fast it converges.1) is satisﬁed by u(t. this accuracy holds if we can bound the term in the square brackets independently of the discretization parameters ∆t and ∆x.n More speciﬁcally. Our main ingredient is now the Taylor expansion (1. Again.
λ = .1 Let us assume that u(t.1 below. ≤ nC∆t∆x2 ≤ CT ∆x2 . so that z n+1 ∞ ≤ C∆t2 ∼ C∆t∆x2 . Let us now conclude the analysis of the error estimate and come back to (2. 1) and u0 (x) = 0 elsewhere. we deduce that z n+1 This in turn implies that zn ∞ ∞ ≤ zn ∞ + C∆t∆x2 . Being of order 1 in time implies that the scheme is of order 2 is space since ∆t = λ∆x2 .5. This will be conﬁrmed in Theorem 2.4 and that λ ≤ 1/2. Choose the discretization parameters ∆t and ∆x such that λ = .17).15). 10) = 0. Compare the stable solutions you obtain with the solution given by (2.17) thus implies that n+1 zj = −∆tτjn .2).12) for the homogeneous problem (because λ ≤ 1/2!) and the bound (2.9 Following the techniques described in Chapter 1. Using the stability estimate (2.19) j∈Z. Exercise 2. derive a scheme of order 4 in space (still with ∆t = λ∆x2 and λ suﬃciently small that the scheme is stable). Equation (2. Indeed. Comparing with the section on ODEs.0≤n≤N where C is a constant independent of ∆x and λ ≤ 1/2.15). 10) and assuming that u(t.48. 13 .18) thanks to (2.17) The above equation allows us to analyze local consistency.or equivalently that n+1 n n n zj = λ(zj+1 + zj−1 ) + (1 − 2λ)zj − ∆tτjn . j∆x) − Ujn  ≤ C∆x2 . (2. (2. We summarize what we have obtain as Theorem 2. But this corresponds to a scheme of order 1 in time since ∆t = λ∆x2 ≤ ∆x2 /2. since the latter is by deﬁnition the diﬀerence between the exact solution and the approximate solution on the spatial grid. You may choose as an initial condition u0 (x) = 1 on (−1. x) is of class C 2. Implement this new scheme in Matlab and compare it with the explicit Euler scheme on the numerical simulation described in exercise 2. This implies that z n = 0. Therefore it does not converge very fast.8.52. Exercise 2. The explicit Euler scheme is of order 2 in space. −10) = u(t. this implies that the scheme is of order m = 1 in time since ∆t2 = ∆tm+1 for m = 1. and λ = . Then we have that sup u(n∆t.8 Implement the explicit Euler scheme in Matlab by discretizing the parabolic equation on x ∈ (−10. let us assume that the exact solution is known on the spatial grid at time step n. (2.
δ(x − y)f (y)dy by deﬁnition. Notice this important fact: a very narrow Gaussian (α very large) is transformed into a very wide one (α−1 is very small) are viceversa. is that −1 Fξ→x Fx→ξ f = f. (2. −∞ (2. you may want to think of “frequencies” each time you see “wavenumbers”. which it ˆ maps to the function f (ξ). 2π ∞ −∞ the Dirac delta function since f (x) = understanding this is to realize that [Fδ](ξ) = 1. We shall use the following deﬁnition ∞ ˆ e−ixξ f (x)dx. There are many ways of deﬁning the Fourier transform. Another interesting calculation for us is that α [F exp(− x2 )](ξ) = 2 2π 1 exp(− ξ 2 ). Another way of [F −1 1](x) = δ(x). We call the spaces of x’s the physical domain and the space of ξ’s the wavenumber domain. 2π This implies that ∞ −∞ eizξ dξ = δ(z). or Fourier domain. α 2α (2. and it says that you cannot be localized in the Fourier domain when you are localized in the spatial domain and viceversa.24) In other words. the operator F is applied to the function f (x).23) The ﬁrst equality can also trivially been deduced from (2. A crucial property. (2.22) which can be recast as f (x) = 1 2π ∞ ∞ ∞ ∞ −∞ eixξ −∞ −∞ e−iyξ f (y)dydξ = −∞ ei(x−y)ξ dξ f (y)dy. The notation can become quite useful when there are several variables and one wants to emphasize with respect to with variable Fourier transform is taken. the Fourier transform of a Gaussian is also a Gaussian.21) In the above deﬁnition. 14 .2.20) [Fx→ξ f ](ξ) ≡ f (ξ) = −∞ The inverse Fourier transform is deﬁned by −1 ˆ [Fξ→x f ](x) ≡ f (x) = 1 2π ∞ ˆ eixξ f (ξ)dξ. which explains the terminology of “forward” and “inverse” Fourier transforms. (2.20). This is related to Heisenberg’s uncertainty principle in quantum mechanics.2 Fourier transforms We now consider a fundamental tool in the analysis of partial diﬀerential equations with homogeneous coeﬃcients and their discretization by ﬁnite diﬀerences: Fourier transforms. Wavenumbers are to positions what frequencies are to times. If you are more familiar with Fourier transforms of signals in time.
the second how it acts on dilation. 15 (2. Other very important properties of the Fourier transform are as follows: ˆ [Ff (x + y)](ξ) = eiyξ f (ξ). ﬁnite diﬀerences are based on translations so we see how Fourier transforms will be useful in their analysis. Fourier transforms replace diﬀerentiation in the physical domain by multiplication in the Fourier domain.20). we get that [Ff (x + h)](ξ) − [Ff (x)](ξ) eihξ − 1 ˆ f (x + h) − f (x) (ξ) = = f (ξ) h h h Now let us just pass to the limit h → 0 in both sides. The simplest property of the Fourier transform is its linearity: [F(αf + βg)](ξ) = α[Ff ](ξ) + β[Fg](ξ). consider derivatives of functions. To further convince you of the importance of the above assertion. We get F ˆ ˆ f (ξ) = [Ff ](ξ) = iξ f (ξ). The ﬁrst equality shows how the Fourier transform acts on translations by a factor y.24) using integration in the complex plane (which essentially 2 2 2 ∞ ∞ says that −∞ e−β(x−iρ) dx = −∞ e−βx dx for ρ ∈ R by contour change since z → e−βz has no pole in C). for all y ∈ R and ν ∈ R\{0}.29) . it is useful to see diﬀerentiation as a limit of diﬀerences of translations. More precisely. d is replaced by multiplication by iξ. Moreover.27) (f ∗ g)(x) = −∞ f (x − y)g(y)dy = −∞ f (y)g(x − y)dy.28) The Fourier transform is then given by ˆ g [F(f ∗ g)](ξ) = f (ξ)ˆ(ξ). More precisely. (2. dx This assertion can easily been directly veriﬁed from the deﬁnition (2.Exercise 2. ν ν (2. One last crucial and quite related property of the Fourier transform is that it replaces convolutions by multiplications. The convolution of two functions is given by ∞ ∞ (2.11 Check these formulas. The reason this is important is that you can’t beat multiplications in simplicity. The former is extremely important in the analysis of PDEs and their ﬁnite diﬀerence discretization: Fourier transforms replace translations in the physical domain by multiplications in the Fourier domain. translation by y is replaced by multiplication by eiyξ . By linearity of the Fourier transform.10 Check (2.26) (2. Exercise 2.25) [Ff (νx)](ξ) = 1 ˆ ξ f . Again. However.
24) and ˆ (2. ξ) ˆ = −ξ 2 u(t. Then using (2. x)](t.26) with ˆ ν = −1. Solve this ODE. where f denotes the complex conjugate to f . here is an important exercise that shows how powerful Fourier transforms are to solve PDEs with constant coeﬃcients. ˆ ∂t Here we mean u(t.31) .2). We ﬁrst recall that f (ξ) = f (−ξ). Exercise 2. the explicit Euler scheme (2. Application to the heat equation. Application to Finite Diﬀerences.7) can be recast as U n+1 (x) − U n (x) U n (x + ∆x) + U n (x − ∆x) − 2U n (x) = .21) with x = 0 by deﬁnition of the convolution ∞ = 2π(f ∗ g)(0) = 2π −∞ ∞ f (y)g(−y)dy f (y)f (y)dy = 2π −∞ −∞ = 2π f (y)2 dy.30)) as in the Fourier domain (righthand side).13 Consider the heat equation (2.30) for all complexvalued function f (x) such that the above integrals make sense. The Parseval equality (2. We ˆ show this as follows. If we use the notation U n (j∆x) = Ujn ≈ u(n∆t.1). (2. recover (2. We also deﬁne g(x) = f (−x). This important relation can be used to prove the equally important (everything is important in this section!) Parseval equality ∞ ∞ ˆ f (ξ)2 dξ = 2π −∞ −∞ f (x)2 dx. Before considering ﬁnite diﬀerences. To simplify we still assume that we want to solve the heat equation on the whole line x ∈ R. ξ). so that g (ξ) = f (−ξ) using (2.28). j∆x). We now use the Fourier transform to analyze ﬁnite diﬀerence discretizations.30) essentially says that there is as much energy (up to a factor 2π) in the physical domain (lefthand side in (2.28) using the deﬁnition (2. Using the Fourier transform. ξ) = [Fx→ξ u(t.12 Check this. Let u0 (x) be the initial condition for the heat equation.Exercise 2. ξ). show that ∂ u(t. We then have ∞ ∞ ∞ ∞ ˆ f (ξ)2 dξ = −∞ ∞ −∞ ˆ ˆ f (ξ)f (ξ)dξ = −∞ ˆ f (ξ)f (−ξ)dξ = −∞ ˆ f (ξ)g(ξ)dξ = −∞ (f ∗ g)(ξ)dξ ∞ thanks to (2. ∆t ∆x2 16 (2.
marching in time (going from step n to step n + 1) is much simpler than in the physical domain: each wavenumber ξ is multiplied by a coeﬃcient Rλ (ξ) = 1 + λ(eiξ∆x + e−iξ∆x − 2) = 1 − 2λ(1 − cos(ξ∆x)). Here is how we do it. (j + )∆x . The key to analyzing this dependence is to go to the Fourier domain. We ﬁrst deﬁne U 0 (x) for all points x ∈ R. So analyzing U n (x) is in some sense suﬃcient to get information on Ujn .where x = j∆x. we then easily observe that U n (x) = Ujn 1 1 on (j − )∆x.32) We then deﬁne U n (x) using (2. which is much simpler to deal with. 2 2 (2. (j + )∆x . It is however not easy at all to get this dependence explicitly from (2.7). As before. We then clearly have ˆ that all wavenumbers are stable since U n (ξ) ≤ ˆ0 (ξ). we u can choose values of ξ such that cos(ξ∆x) is close to −1. we obtain that ˆ ˆ U n+1 (ξ) = 1 + λ(eiξ∆x + e−iξ∆x − 2) U n (ξ) (2. There. If we choose as initial condition (2. or equivalently is (Rλ (ξ))n bounded for all 1 ≤ n ≤ N = T /∆t? It is easy to observe that when λ ≤ 1/2. We then have that ˆ U n (ξ) = (Rλ (ξ))n u0 (ξ).32). using the translation property of the Fourier transform (2. It is therefore straightforward to realize that U n (Xj ) depends on u0 at all the points Xk for j − n ≤ k ≤ j + n. Such wavenumbers are unstable. For such wavenumbers. this scheme says that the approximation U at time Tn+1 = (n + 1)∆t and position Xj = j∆x depends on the values of U at time Tn and positions Xj−1 . we can certainly introduce the Fourier transform n ˆ U (ξ) deﬁned according to (2. the ˆ value of Rλ (ξ) is less than −1 so that U n changes sign at every new time step n with growing amplitude (see numerical simulations). Rλ (ξ) ≤ 1 for all ξ.33) We thus deﬁne U n (x) for all n ≥ 0 and all x ∈ R by induction. ˆ (2. we have U n+1 (x) = (1 − 2λ)U n (x) + λ(U n (x + ∆x) + U n (x − ∆x)). (2. Once U n (x) is deﬁned for all n. Xj .37) The question of the stability of a discretization now becomes the following: are all wavenumbers ξ stable.20). 2 2 (2. However. and Xj+1 .31). not only those points x = Xj = j∆x.26).34) where Ujn is given by (2.35) In the Fourier domain.31) initialized with U 0 (x) = u0 (x). One way to do so is to assume that u0 (x) = Uj0 1 1 on (j − )∆x.36) 17 . (2. However. translations are replaced by multiplications. Clearly. when λ > 1/2.
(2. ξ ∈ R. we have replaced the diﬀerentiation ∂ operator D = ∂x by iξ. u ∂t u(0. t > 0. The above deﬁnition gives us a mathematical framework to build on this idea.38) with symbol P (iξ) = ξ 4 . ˆ ˆ The solution of (2. is that the analysis of diﬀerential operators simpliﬁes in the Fourier domain. we apply the theory of Fourier transform to the analysis of the ﬁnite diﬀerence of parabolic equations with constant coeﬃcients. (2. (iii) take the inverse Fourier transform of the product. x ∈ R. x) = 0. Here D stands for ∂x . ξ) = 0. x) = u0 (x).3 Stability and convergence using Fourier transforms In this section. ∂x that is. Notice the parallel between (2. an important property of Fourier transforms is that ∂ → iξ. The solution of (2. (ii) Find the equation of the form (2. We use this to deﬁne equations in the physical domain as follows ∂u (t. The operator P (D) is a linear operator deﬁned as −1 P (D)f (x) = Fξ→x P (iξ)Fx→ξ f (x) (ξ). Upon taking the Fourier transform Fx→ξ term by term in (2. (ii) multiply f (ξ) by the function P (iξ). we get that ∂u ˆ (t.39) ∂ where P (iξ) is a function of iξ. ξ) + P (iξ)ˆ(t.40): by going to the Fourier domain.1) takes the form (2.2. let us consider (2. Indeed.40) consists of a simple linear ﬁrstorder ordinary diﬀerential equation. What the operator P (D) does ˆ is: (i) take the Fourier transform of f (x). x ∈ R (2. ξ) = u0 (ξ).14 (i) Show that the parabolic equation (2.41) (2. ∂t u(0. (2. as it was mentioned in the previous section. (iii) Show that any arbitrary partial diﬀerential equation of order m ≥ 1 in x with constant coeﬃcients can be written in the form (2.38).38) with the symbol P (iξ) a polynomial of order m. t > 0. ξ) = e−tP (iξ) u0 (ξ).38) is then given by −1 u(t. diﬀerentiation is replaced by multiplication by iξ in the Fourier domain.40) is then obviously given by u(t.40) As promised. x) + P (D)u(t. x) = Fξ→x (e−tP (iξ) ) ∗ u0 (x).38) and (2.38) (2. 18 .38) with symbol P (iξ) = ξ 2 . As we already mentioned. The function P (iξ) is called the symbol of the operator P (D).38) in the Fourier domain. Such operators are called pseudodiﬀerential operators. Why do we introduce all this? The reason. For every ξ ∈ R.42) Exercise 2. we have replaced the “complicated” pseudodiﬀerential operator P (D) by a mere multiplication by P (iξ) in the Fourier domain. ˆ ˆ ξ ∈ R.
Let us notice however that from the deﬁnition (2. What about numerical schemes? Our deﬁnition so far provides a great tool to analyze partial diﬀerential equations.e.43) With a somewhat loose terminology. Explicit schemes are characterized by the fact that only b0 does not vanish. We shall see that implicit schemes. are important in practice. In the end. (2. The heat equation clearly satisﬁes the above requirement with M = 2. a1 = λ. Finite diﬀerence schemes are supposed to be deﬁned on the grid ∆xZ (i. a0 = 1 − 2λ. Notice that we are slightly changing the problem of interest. It can be done but requires careful analysis of the error made by interpolating smooth functions by polynomials. not on the whole axis R. 19 . the explicit Euler scheme is given by U n+1 (x) = (1 − 2λ)U n (x) + λ(U n (x − ∆x) + U n (x + ∆x)). This is why we replace Ujn+1 by U n+1 (x) in the sequel and only consider properties of U n+1 (x). where other coeﬃcients bβ = 0. we should see how the properties on U n+1 (x) translate into properties on Ujn+1 . (2. It turns out it is also very well adapted to the analysis of ﬁnite diﬀerence approximations to these equations. This corresponds to the coeﬃcients b0 = 1.44). A general singlestep ﬁnite diﬀerence scheme is given by B∆ U n+1 (x) = A∆ U n (x). Although this deﬁnition may look a little complicated at ﬁrst.44) with U 0 (x) = u0 (x) a given initial condition.38) with realvalued symbol P (iξ) satisfying the above constraint as parabolic equations. we need such a complexity in practice. The reason is again that ﬁnite diﬀerence operators are transformed into multiplications in the Fourier domain. Here. the points xm = m∆x for m ∈ Z).Parabolic Symbol. B∆ U (x) = β∈Iβ bβ U (x − β∆x). (2. a−1 = λ. the values of U n+1 on the grid ∆xZ only depend on the values of u0 on the same grid. We do not consider this problem in these notes. In other words U n+1 (j∆x) does not depend on how we extend the initial condition u0 originally deﬁned on the grid ∆xZ to the whole line R. and that U n+1 (j∆x) = Ujn+1 . P (iξ) ≥ CξM . we refer to equations (2. and all the other coeﬃcients aα and bβ vanish. we only consider realvalued symbols such that for some M > 0. using the notation of the preceding section. This is easy to do when U n+1 (x) is chosen as a piecewise constant function and more diﬃcult when higher order polynomials are used to construct it.45) where Iα ⊂ Z and Iβ ⊂ Z are ﬁnite sets of integers. In this section. The realvaluedness of the symbol is not necessary but we shall impose it to simplify. the ﬁnite diﬀerence operators are deﬁned by A∆ U (x) = α∈Iα aα U (x − α∆x). For instance. for all ξ ∈ R. n ≥ 0.
ˆ ˆ (2. (2.. 20 (2.47) In the implicit case. To measure convergence. From now on. all norms are equivalent (this is a theorem). this equation is simpler in the Fourier domain.. As we already mentioned.We now come back to (2. (∆x)2 ˆn Notice that the diﬀerence between the exact solution u(n∆t. we assume that B∆ (ξ) = 0. In inﬁnite dimensional spaces. B∆ (ξ) = β∈Iβ bβ e−iβ∆xξ . all norms are not equivalent (this is another theorem. What we get is ˆ bβ e−iβ∆xξ U n+1 (ξ) = β∈Iβ α∈Iα ˆ aα e−iα∆xξ U n (ξ).51) This is what we want to analyze. This implies n ˆ U n (ξ) = R∆ (ξ)ˆ0 (ξ). the equivalence of all norms on a space implies that it is ﬁnite dimensional. (2. Let us thus take the Fourier transform of each term in (2.52) The reason why this norm is nice is that we have the Parseval equality √ ˆ f (ξ) = 2π f (x) . To obtain the solution ˆ U n (x).) So the choice of the norm matters! Here we only consider the L2 norm. which will tell us how far or how close two functions are from one another. B∆ (ξ) ≡ 1 and the above relation is obvious. In ﬁnite dimensional spaces (such as Rn that we used for ordinary diﬀerential equations). (2. ξ) − U n (ξ) = e−n∆tP (iξ) − R∆ (ξ) u0 (ξ).44). (2. ξ) at time n∆t and its ˆ ˆ n (ξ) is given by approximation U n ˆ u(n∆t. we simply have to take the inverse Fourier transform of U n (ξ).48) In the explicit case. Exercise 2.15 Verify that for the explicit Euler scheme.50) The above relation completely characterizes the solution U (ξ). we ﬁrst need to deﬁne a norm.46) Let us introduce the symbols A∆ (ξ) = α∈Iα aα e−iα∆xξ .44). this is our deﬁnition of a ﬁnite diﬀerence scheme and we want to analyze its properties. u (2. deﬁned on R for suﬃciently regular complexvalued functions f by ∞ f = −∞ f (x)2 dx 1/2 . we have R∆ (ξ) = 1 − 2∆t (1 − cos(ξ∆x)).53) . We then obtain that ˆ ˆ ˆ U n+1 (ξ) = R∆ (ξ)U n (ξ) = B∆ (ξ)−1 A∆ (ξ)U n (ξ).49) (2.
Our analysis of the ﬁnite diﬀerence scheme is therefore concerned with characterizing u(n∆t. stability of the scheme. To simplify the analysis. Most norms do not have such a nice interpretation in the Fourier domain. we assume that the two parameters are related by λ= ∆t . ∆x.54) where M is the order of the parabolic equation (M = 2 for the heat equation).16 Check again that the above constraint is satisﬁed for the explicit Euler scheme if and only if λ ≤ 1/2. We have however to be careful. we have e−∆tP (iξ)  ≤ e−C∆tξ . x) − U n (x) . Parabolic equations have a very eﬃcient smoothing eﬀect. at which it approximates the exact operator. converge to 0. ξ ∈ R. ∆t and ∆x. Exercise 2.55) Stability. We just saw that this was equivalent to characterizing ˆ εn (ξ) = u(n∆t.This means that the L2 norms in the physical and the Fourier domains are equal (up √ to the factor 2π). 1 ≤ n ≤ N. ξ) − U n (ξ) = ˆ n e−n∆tP (iξ) − R∆ (ξ) u0 (ξ) . Not surprisingly. we 21 . For length scales much larger than ∆x. Even if high wavenumbers are present in the initial solution. This is one of the reasons why the L2 norm is so popular. The regularity of the exact equation is ﬁrst seen in the growth of e−∆tP (iξ) . (2. ˆ The control that we will obtain on εn (ξ) depends on the three classical ingredients: regularity of the solution. The convergence arises as two parameters. and consistency of the approximation.56) The stability constraint is clear: no wavenumber ξ can grow out of control. the ﬁnite diﬀerence approximation partially retains this eﬀect. M (2. The exact symbol satisﬁes that e−n∆tP (iξ)  ≤ e−Cn∆tξ . We impose that there exists a constant C independent of ξ and 1 ≤ n ≤ N = T /∆t such that n R∆ (ξ) ≤ C. M (2. This relation is important because we deduce that high wavenumbers (large ξ’s) are heavily damped by the operator. The reason is that the scheme has a given scale. and where λ > 0 is ﬁxed. Since our operator as assumed to be parabolic. ∆xM (2. they quickly disappear as time increases. Regularity.57) for some constant C. What makes parabolic equations a little special is that they have better stability properties than just having bounded symbols.
(2. For length of order (or smaller than) ∆x however. 2. In the Fourier world. ξ) at time n∆t. . for wavenumbers that are smaller than ∆x−1 . this means that e−∆tP (iξ) − R∆ (ξ) (2. .58) that n R∆ (ξ) ≤ en ln(1−δ∆xξ M) ≤ e−n(δ/2)∆xξ . ξ).61) .expect the scheme to be a good approximation of the exact operator. This is again related to the fact that wavenumbers of order of or greater than ∆x−1 are not captured by the ﬁnite diﬀerence scheme. (2. ˆ Again. the error e−∆tP (iξ) − R∆ (ξ) is not small for all wavenumbers of the form ξ = kπ∆x−1 for k = 1.60) has to converge to 0 as ∆x and ∆t converge to 0. .2 We have that n R∆ (ξ) ≤ e−Cn∆xξ M for ∆xξ ≤ γ. and ∆t.17 Show that in the case of the heat equation and the explicit Euler scheme.59) Proof. Let us now turn our attention to the consistency and accuracy of the ﬁnite diﬀerence scheme. we cannot expect the scheme to be accurate: we do not have a suﬃcient resolution. where C is a constant independent of ξ. we expect the scheme to retain some properties of the exact equation. all we can expect is that the error is small when ξ∆x → 0 and ∆t → 0.. We deduce from (2. Indeed. For this reason. Remember that small scales in the spatial worlds mean high wavenumbers in the Fourier world (functions that vary on small scales oscillate fast. Here again. we have to be a bit careful. 22 ∆xξ < γ. hence are composed of large wavenumbers). we impose that the operator R∆ satisfy R∆ (ξ) ≤ 1 − δ∆xξM R∆ (ξ) ≤ 1 for for ∆xξ ≤ γ ∆xξ > γ. (2. We say that the scheme is consistent and accurate of order m if e−∆tP (iξ) − R∆ (ξ) ≤ C∆t∆xm (1 + ξM +m ).58) where γ and δ are positive constants independent of ξ and ∆x. Nevertheless. we want to make sure that the ﬁnite diﬀerence scheme is locally a good approximation of the exact equation. assuming that we are given the exact solution u(n∆t. We then have Lemma 2. ∆x. Quantitatively. We cannot expect the above quantity to be small for all values of ξ. M Consistency. Exercise 2. the error made by applying the approximate ˆ equation instead of the exact equation between n∆t and (n + 1)∆t is precisely given by e−∆tP (iξ) − R∆ (ξ) u(n∆t.
the space of squareintegrable functions. that εn (ξ) ≤ C∆t∆xm (1 + ξM +m )ne−Cn∆tξ ˆ0 (ξ).65) This shows that εn is of order ∆xm provided that (1 + ξ2m )ˆ0 (ξ)2 dξ u R 1/2 < ∞. This implies that εn = R (2.64) εn (ξ)2 dξ 1/2 ≤ C∆xm R (1 + ξ2m )ˆ0 (ξ)2 dξ u 1/2 . For a function v(x).62) So far. It turns out that a function v(x) belongs to H m if and only if its Fourier transform v (ξ) satisﬁes that ˆ (1 + ξ2m )ˆ(ξ)2 dξ v R 1/2 < ∞. Let us come back to the analysis of εn (ξ). The Hilbert spaces H m (R).66) The above inequality implies that u0 (ξ) decays suﬃciently fast as ξ → ∞. The proof is based on the fact that diﬀerentiation in the physical domain is replaced by multiplication by iξ in the Fourier domain. this implies that εn (ξ) ≤ C∆xm (1 + ξm )ˆ0 (ξ). This is ˆ actually equivalent to imposing that u0 (x) is suﬃciently regular.Proof of convergence. (2. The norm of H m is deﬁned by m v Hm = k=0 R v (n) (x)2 dx 1/2 .18 (diﬃcult) Prove it. and (2. we have thanks to the stability property (2. (2. (2. We k=0 deduce from (2.56) that εn (ξ) ≤ 2ˆ0 (ξ) ≤ C∆xm (1 + ξm )ˆ0 (ξ).61). u (2. u u So (2. We have that n−1 ε (ξ) = (e n −∆tP (iξ) − R∆ (ξ)) k=0 k u e−(n−1−k)∆tP (iξ) R∆ (ξ)ˆ0 (ξ). 23 . Since n∆t ≤ T . we denote by v (n) (x) its nth derivative. (2. However.67) Notice that H 0 = L2 .57).59).68) Exercise 2. (2. We introduce the sequence of functional spaces H m = H m (R) of functions whose derivatives up to order m are squareintegrable. u Now remark that n∆tξM e−Cn∆tξ ≤ C .63) actually holds for every ξ. the above inequality holds for ξ ≤ γ∆x−1 . using an − bn = (a − b) n−1 an−1−k bk . Let us ﬁrst consider the case ξ ≤ γ∆x−1 . for ξ ≥ γ∆x−1 .63) M M (2.
Show that (2.58) are satisﬁed and that R∆ (ξ) is of order m.58) is replaced by R∆ (ξ) ≤ 1. 1) and v(x) = 0 elsewhere. Summarizing all of the above. for all 0 ≤ n ≤ N = T /∆t. 24 .69) We recall that ∆t = λ∆xM so that in the above theorem. Assuming that the initial condition u0 ∈ H m . (2. You can compare this with Theorem 2. we have to assume that the solution u(t.Notice that in the above inequality. (2.67) and (2.43) and (2. show that (2.19 Calculate the Fourier transform of the function v deﬁned by v(x) = 1 on (−1. show that the function v belongs to every space H m with m < 1/2. x) − U n (x) ≤ C∆xm u0 H m+M . x) − U n (x) ≤ C∆xm u0 H m+ε . we have obtained the following result.67) and (2. x) is of order C 4 is space. The equivalence between (2.58) are replaced by P (iξ) ≥ 0 and R∆ (ξ) ≤ 1. Only the number of derivatives can be “fractional”.e. what this means is that we need the initial condition to be more regular if the equation is not regularizing. Theorem 2. for all 0 ≤ n ≤ N = T /∆t.3 assuming that (2. Exercise 2.70) Loosely speaking. (2. we obtain that u(n∆t.68). Main result of the section. i.69) should be replaced by u(n∆t. This means that functions that are piecewise smooth (with discontinuities) have a little less than half of a derivative in L2 (you can diﬀerentiate such functions almost 1/2 times)! The relationship between (2.3 Let us assume that P (iξ) and R∆ (ξ) are parabolic. Exercise 2.21 (very diﬃcult). Of course.68) can be used as a deﬁnition for the space H m with m ∈ R. What this result means is that if we loose the parabolic property of the discrete scheme. the scheme is of order m in space and of order m/M in time.68) can be explained as follows: v(x) is smooth in the physical domain ⇐⇒ v (ξ) decays fast ˆ in the Fourier domain. we still have the same order of convergence provided that the initial condition is slightly more regular than being in H m . The space H m is still the space of functions with m squareintegrable derivatives.61) holds.68) is one manifestation of this “rule”. This is another very general and very important property of Fourier transforms.71) Here ε is an arbitrary positive constant. Thus (2.43) and (2. the constant C depends on ε. Assuming that (2. in the sense that (2. Using (2.20 (diﬃcult) Follow the proof of Theorem 2. This is consistent with the above result with M = 2 and m = 2. x) − U n (x) ≤ C∆xm u0 Hm .69) should be replaced by u(n∆t. for all 0 ≤ n ≤ N = T /∆t. or even negative! Exercise 2. (2.1: to obtain a convergence of order 2 in space. we can choose m ∈ R and not only m ∈ N∗ .
2.23 Show that everything we have said in this section holds if we replace (2. we still have convergence of the ﬁnite diﬀerence scheme as ∆x → 0. that verify R∆ (ξ) < 1. 1 + 2θλ(1 − cos(∆xξ)) (2.69) in Theorem 2.4) and (2. respectively.24 Show that schemes accurate of order m > 0 so that (2. are themselves smoothing. The θ scheme is deﬁned by ∂t U n (x) = θ∂x ∂ x U n+1 (x) + (1 − θ)∂x ∂ x U n (x).25 Show that the symbol of the θ scheme is given by R∆ (ξ) = 1 − 2(1 − θ)λ(1 − cos(∆xξ)) .22 Show that any stable scheme that is convergent of order m is also convergent of order αm for all 0 ≤ α ≤ 1.Exercise 2. We observe that R∆ (ξ) ≤ 1.61) holds and verifying R∆ (ξ) ≤ 1 are stable in the sense that R∆ (ξ) ≤ (1 + C∆xm ∆t) − δ∆xξM .43) and (2.e. When θ = 0. ∆xξ ≤ γ. however the rate of convergence is slower. (2. The stability requirement is therefore that min R∆ (ξ) ≥ −1. Deduce that (2. For θ = 1/2. Exercise 2. For θ = 1. (2. Exercise 2. Re stands for “real part”. We recall that ∆t = λ∆x2 . ∆xξ ≤ γ. Let us assume that 0 ≤ θ ≤ 1. for all 0 ≤ n ≤ N = T /∆t. The above hypotheses are suﬃciently general to cover most practical examples of parabolic equations.73) Recall that the ﬁnite diﬀerence operators are deﬁned by (2. The last exercises show that schemes that are stable in the “classical” sense. i. the scheme is called the fully implicit Euler scheme. Here.5).58) by Re P (iξ) ≥ C(ξM − 1) and R∆ (ξ) ≤ 1 − δ∆xξM + C∆t. the scheme is called the CrankNicolson scheme. Exercise 2. P (iξ) = ξ 2 ) called the θ schemes..3 can then be replaced by u(n∆t. we recover the explicit Euler scheme. ξ 25 .74) Let us ﬁrst consider the stability of the θ method.4 Application to the θ schemes We are now ready to analyze the family of schemes for the heat equation (i.72) The interest of this result is the following: when the initial data u0 is not suﬃciently regular to be in H m but suﬃciently regular to be in H αm with 0 < α < 1. and are consistent with the smoothing parabolic symbol. x) − U n (x) ≤ C∆xαm u0 H αm .e.
i. 2 6 and that R∆ (ξ) = 1−∆tξ 2 +∆t∆x2 ( as ∆t → 0 and ∆xξ → 0.45) are now inﬁnite matrices such that B∆ U n+1 = A∆ U n 26 . stability only holds if and only if 1 λ≤ .29 Show that the θ scheme can be put in the form (2.28 Show that the θ scheme is always at least secondorder in space (ﬁrstorder in time). b0 = 1 + 2λθ.e.Exercise 2. Obviously we need to show that (2. Exercise 2. the scheme is no longer explicit.75) Show that the scheme is of order 6 in space if in addition. a0 = 1 − 2(1 − θ)λ Since the coeﬃcient b1 and b−1 do not vanish. (2. 10 Implementation of the θ scheme. When θ < 1/2. 2(1 − 2θ) Let us now consider the accuracy of the θ method. that the method is stable independently of the choice of λ.. 2 12λ 1 ∆t2 ∆x4 +λθ)ξ 4 − (1+60λθ+360λ2 θ2 )+O(∆t4 ξ 8 ).61) holds for some m > 0.44) with b−1 = b1 = −θλ. 1 + 60λθ + 360λ2 θ2 = 60λ2 . Exercise 2.26 Show that the above inequality holds if and only if 1 (1 − 2θ)λ ≤ . λ = 1√ 5.27 Show that 1 1 2 e−∆tξ = 1 − ∆tξ 2 + ∆t2 ξ 4 − ∆t3 ξ 6 + O(∆t4 ξ 8 ).76) 12 360 (2. Exercise 2. The operators A∆ and B∆ in (2. The above analysis shows that the θ scheme is unconditionally stable when θ ≥ 1/2. 2 This shows that the method is unconditionally L2 stable when θ ≥ 1/2. Show that the scheme is of order 4 in space if and only if θ= 1 1 − . Let us come back to the solution Ujn deﬁned on the grid ∆xZ.e. a−1 = a1 = (1 − θ)λ. i. This very nice stability property comes however at a price: the calculation of U n+1 from U n is no longer explicit.
78) The vector U n+1 is thus given by U n+1 = (B∆ )−1 A∆ U n .30 Implement the θ scheme in Matlab for all possible values of 0 ≤ θ ≤ 1 and λ > 0. make sure that the initial condition is suﬃciently smooth. For nonsmooth initial conditions. 0 0 (1 − θ)λ 1 − 2(1 − θ)λ (1 − θ)λ 0 0 0 (1 − θ)λ 1 − 2(1 − θ)λ (2. The above matrices correspond to imposing that the solution vanishes at the boundary of the interval (Dirichlet boundary conditions).3. we obtain that the operator B∆ for the θ scheme is given by the inﬁnite matrix 1 + 2θλ −θλ 0 0 0 −θλ 1 + 2θλ −θλ 0 0 (2. the matrix A∆ is given by 1 − 2(1 − θ)λ (1 − θ)λ 0 0 0 (1 − θ)λ 1 − 2(1 − θ)λ (1 − θ)λ 0 0 A∆ = 0 (1 − θ)λ 1 − 2(1 − θ)λ (1 − θ)λ 0 . Verify numerically (by dividing an initial spatial mesh by a factor 2. However we can use the N × N matrices above on an interval of size N ∆x given.79) This implies that in any numerical implementation of the θ scheme where θ > 0. 4. In practice. Exercise 2. 27 . then 4) that the θ scheme is of order 2. show that the order of convergence decreases and compare with the theoretical predictions of Theorem 2.where the inﬁnite vector U n = (Ujn )j∈Z . To obtain these orders of convergence. Verify on a few examples with diﬀerent values of θ < 1/2 that the scheme is unstable when λ is too large.77) B∆ = 0 −θλ 1 + 2θλ −θλ 0 . and 6 in space for well chosen values of θ and λ. 0 0 −θλ 1 + 2θλ −θλ 0 0 0 −θλ 1 + 2θλ Similarly. we cannot use inﬁnite matrices obviously. (2. we need to invert the system B∆ U n+1 = A∆ U n (which is usually much faster than constructing the matrix (B∆ )−1 directly). In matrix form.
and ∂2 ∂2 ∆= 2= + 2 ∂x2 ∂y is the Laplace operator. R2 (3.4) Notice here that both and ξ are 2vectors. y) ∈ R2 .5) The solution to the diﬀusion equation is then given in the Fourier domain by u(ξ) = ˆ ˆ f (ξ) . (3.3 Finite Diﬀerences for Elliptic Equations The prototype of elliptic equation we consider is the following twodimensional diﬀusion equation with absorption coeﬃcient −∆u(x) + σu(x) = f (x). We assume that our domain has no boundaries to simplify.1) Here. (3.6) It then remains to apply the inverse Fourier transform to the above equality to obtain u(x). (3. The above partial diﬀerential equation can be analyzed by Fourier transform. the Fourier transform is deﬁned by ˆ [Fx→ξ f ](ξ) ≡ f (ξ) = R2 e−ix·ξ f (x)dx. u ξ ∈ R2 .3) We still verify that the Fourier transform replaces translations and diﬀerentiation by multiplications. More precisely. we obtain that ˆ (σ + ξ2 )ˆ(ξ) = f (ξ). Since the inverse Fourier transform of a product is the convolution of the inverse Fourier transforms (as in 1D). we deduce that u(x) = 1 (x) ∗ f (x) σ + ξ2 √ = CK0 ( σx) ∗ f (x). F −1 where K0 is the modiﬁed Bessel function of order 0 of the second kind and C is a normalization constant. (3. x ∈ R2 . σ is a given positive absorption coeﬃcient. x = (x. σ + ξ2 (3.2) The inverse Fourier transform is then deﬁned by −1 ˆ [Fξ→x f ](x) ≡ f (x) = 1 (2π)2 ˆ eix·ξ f (ξ)dξ. ˆ [F f (x)](ξ) = iξ f (ξ). In twospace dimensions. 28 . This implies that the operator −∆ is replaced by a multiplication by ξ2 (Check it!). Upon taking the Fourier transform in both terms of the above equation. we have ˆ [Ff (x + y)](ξ) = eiy·ξ f (ξ). f (x) is a given source term.
1) by ﬁnite diﬀerences? We can certainly replace the diﬀerentials by approximated diﬀerentials as we did in 1D. where the operator ∂x is deﬁned by ∂x Uij = Ui+1.10) where the operator B is a sum of translations deﬁned by BU (x) = β∈Iβ bβ U (x − ∆xβ).14) .j − Uij .11). (3. we have more generally that BU (x) = f (x). This is the ﬁnite diﬀerence approach. we would have that P (iξ) = σ + ξ2 .e. j ∈ Z and ∆x > 0 given.e. P (iξ) is a parabolic symbol of order 2 using the terminology of the preceding section. fij = f (xij ). where β1 and β2 are integer. β2 ). x ∈ R2 . R∆ (ξ) 29 (3. Iβ is a ﬁnite subset of Z2 . (3. All we will see in this section applies to quite general symbols P (iξ). i. Let us consider the grid of points xij = (i∆x. (3. Assuming as in the preceding section that the ﬁnite diﬀerence operator is deﬁned in the whole space R2 and not only on the grid {xij }. j) ∈ Z2 .9) (i.13) we then obtain that the ﬁnite diﬀerence solution is given in the Fourier domain by ˆ U (ξ) = 1 ˆ f (ξ).12) Deﬁning R∆ (ξ) = β∈Iβ bβ e−i∆xβ·ξ . ∆x (3. β∈Iβ (3. although we shall consider only P (iξ) = σ + ξ2 to simplify. We then deﬁne Uij ≈ u(xij ). and ∂ y are deﬁned similarly. i. β = (β1 . How do we solve (3. We could deﬁne mode general equations of the form P (D)u(x) = f (x) x ∈ R2 . Discretization.11) Here. Upon taking Fourier transform of (3. In the case of the diﬀusion equation. (3. (3. ∂y . we obtain ˆ ˆ bβ e−i∆xβ·ξ U (ξ) = f (ξ). we can then replace (3.This shows again how powerful Fourier analysis is to solve partial diﬀerential equations with constant coeﬃcients.8) and ∂ x .7) Using a secondorder scheme. j∆x) for i.1) by − ∂x ∂ x + ∂y ∂ y Uij + σUij = fij . where P (D) is an operator with symbol P (iξ).
The scheme is indeed secondorder. This implies that R∆ (ξ) − P (iξ) ≤ C∆x2 ξ4 for all ξ ∈ R2 . P (iξ) R∆ (ξ) (3. we deduce from Taylor expansions that R∆ (ξ) − P (iξ) ≤ C∆x2 ξ4 .e. we then easily deduce that R∆ (ξ) − P (iξ) ≤ C∆x2 (1 + ξ2 ). we deduce that u(x) − U (x) ≤ C∆x2 f (x) H 2 (R2 ) .17) ˆ (1 + ξ2 )2 f (ξ)2 dξ 1/2 . For ∆xξ ≥ π. the error made by the ﬁnite diﬀerence approximation is of order ∆x2 . using the deﬁnitions (2.18) The above integral is bounded provided that f (x) ∈ H 2 (R2 ).17) holds is because the exact symbol P (iξ) damps high frequencies (i. The reason why (3. we obtain that ˆ u(ξ) − U (ξ) ≤ C∆x2 ˆ R2 C ≤ C∆x2 ξ4 . Upon taking the square root of the integrals. we obtain R∆ (ξ) = σ + 2 2 − cos(∆xξx ) − cos(∆xξy ) . we also obtain that R∆ (ξ) ≤ and P (iξ) ≤ C∆x2 ξ4 . the ﬁnal result (3. (3.and that the diﬀerence between the exact and the discrete solutions is given by ˆ u(ξ) − U (ξ) = ˆ 1 1 ˆ − f (ξ). (3. Since R∆ (ξ) ≥ σ and P (iξ) ≥ C(1 + ξ2 ). R∆ (ξ)P (iξ) From this.19) should be replaced by u(x) − U (x) ≤ C∆x2 f (x) H 4 (R2 ) . we deduce that ˆ ˆ ˆ(ξ) − U (ξ) ≤ C∆x2 (1 + ξ2 )f (ξ). Exercise 3. From the Parseval equality (equation √ (2. ∆x2 (3.15) For the secondorder scheme introduced above.67)(2.1 Notice that the bound in (3.53) with 2π replaced by 2π in two space dimensions).16) For ∆xξ ≤ π. 30 .68). which also hold in two space dimensions. u Let us now square the above inequality and integrate. 2 ∆x (3. Show that if this damping is not used in the proof.19) This is the ﬁnal result of this section: provided that the source term is suﬃciently regular (its secondorder derivatives are squareintegrable).17) would be replaced by C∆x2 (1 + ξ4 ) if only the Taylor expansion were used. is bounded from below by C(1 + ξ2 ).
This implementation is not trivial. We deduce from this result that the error u(x) − U (x) still converges to 0 when f is less regular than H 2 (R2 ). Deduce that u(x) − U (x) ≤ C∆xm f (x) H m (R2 ) . R∆ (ξ)P (iξ) for all 0 ≤ m ≤ 2. Exercise 3. Solve the diﬀusion problem with a smooth function f and show that the order of convergence is 2. It requires constructing and inverting a matrix as was done in (2.3 Implement the Finite Diﬀerence algorithm in Matlab on a square of ﬁnite dimension (impose that the solution vanishes at the boundary of the domain).79).2 (moderately diﬃcult) Show that R∆ (ξ) − P (iξ) ≤ C∆xm (1 + ξm ). [Hint: Show that (R∆ (ξ))−1 and (P (iξ))−1 are bounded and separate ξ∆x ≥ 1 and ξ∆x ≤ 1]. However. for 0 ≤ m ≤ 2. 31 . the convergence is slower and no longer of order ∆x2 .Exercise 3. Show that the order of convergence decreases when f is less regular.
ˆ ˆ Now since multiplication by a phase in the Fourier domain corresponds to a translation in the physical domain. Indeed we have ∂u ˆ (t. ξ) = u0 (ξ)e−iatξ . Let us now consider ﬁnite diﬀerence discretizations of the above transport equation. x ∈ R.2). The L2 norm of the diﬀerence between the exact and approximate solutions is then given by 1 1 εn (ξ) = √ u(n∆t. The mathematical analysis of ﬁnite diﬀerence schemes for hyperbolic equations is then not very diﬀerent from for parabolic equations. By stability. ξ) = 0. x ∈ R (4. Here is an important remark. we mean here that n R∆ (ξ) = R∆ (ξ)n ≤ C. the solution to the above equation is easily found to be u(t. hyperbolic equations translate and possibly modify the shape of discontinuities.1) ∂t ∂x u(0. we obtain (4.49). Notice that (4. One of the main steps in controlling the error term εn is to show that the scheme is stable.44). but do not regularize them. 32 .e. ξ) + iaξ u(t.2) This means that the initial data is simply translated by a speed a. This corresponds to a completely diﬀerent behavior of the PDE solutions: instead of regularizing initial discontinuities. n∆t ≤ T. In particular. Notice that this symbol does not have the properties imposed in section 2. The simplest hyperbolic equation is the onedimensional transport equation ∂u ∂u +a =0 t > 0. it is deﬁned by (2.4 Finite Diﬀerences for Hyperbolic Equations After parabolic and elliptic equations. This behavior will inevitably show up when we discretize hyperbolic equations. its real part vanishes). Stability. ˆ (4. Only the results are very diﬀerent: schemes that may be natural for parabolic equations will no longer be so for hyperbolic schemes. as parabolic equations do. x) = u0 (x − at).40) with P (iξ) = iaξ. we recall that P (iξ) = iaξ. The framework introduced in section 2 still holds. After Fourier transforms. x) = u0 (x).3) Here.3) can be written as (2. (4. the symbol P (iξ) is purely imaginary (i. ˆ ∂t which gives u(t. The singlestep ﬁnite diﬀerence scheme is given by (2. When a is constant. The solution can also be obtained by Fourier transforms. we turn to the third class of important linear equations: hyperbolic equations. x) − U n (x) = √ 2π 2π n e−n∆tP (iξ) − R∆ (ξ) u0 (ξ) . Here u0 (x) is the initial condition.
When a > 0. Again.1 Check that the upwind scheme is stable provided that a∆t ≤ ∆x. we deduce that the scheme is always unstable since R∆ (ξ)2 > 1 for cos(ξ∆x) ≤ 0 for instance. 2 Exercise 4. When a > 0. information propagates from the right to the left when a < 0.Let us consider the stability of classical schemes. This leads to the deﬁnition of the upwind scheme. the time step cannot be chosen too large. the scheme still “looks” for information coming from the right.2 Show that the above scheme is always unstable. Since the scheme approximates the spatial derivative by ∂x . So the scheme is stable if and only if −1 ≤ ν ≤ 0. 33 . Verify the instability in Matlab. When ν < −1. it is asymmetrical and uses some information to the right (at the point Xj+1 = Xj + ∆x) of the current point Xj . A tempting solution to this diﬃculty is to average over the operators ∂x and ∂ x : ν n n Ujn+1 = Ujn − (Uj+1 − Uj−1 ). (4. Notice however. This is referred to as a CFL condition.4) when when a>0 a < 0. whereas physically it comes from the left. It remains to consider −1 ≤ ν ≤ 0. ∆x (4. There. This can be understood as follows. Spelling out the details. Implement the scheme in Matlab. This implies that a∆t ≤ ∆x. we deduce that R∆ (ξ) = 1 − ν(eiξ∆x − 1) = [1 − ν(cos(ξ∆x) − 1)] − iν sin(ξ∆x). For a < 0.1) is to replace the time derivative by ∂t and the spatial derivative by ∂x . this is relatively easy since only the sign of a is involved. it gets the correct information where it comes from. we again deduce that R∆ (ξ)2 > 1 since ν(1 + ν) > 0. which implies that ν > 0.5) Exercise 4. The ﬁrst idea that comes to mind to discretize (4. knowing where the information comes from is much more complicated. The above result shows that we have to “know” which direction the information comes from to construct our scheme. this gives Ujn+1 = Ujn − a∆t n (U − Ujn ). In higher dimensions. ν(1 + ν) ≤ 0 and we then deduce that R∆ (ξ)2 ≤ 1 for all wavenumbers ξ. ∆x j+1 After Fourier transform (and extension of the scheme to the whole line x ∈ R). This implies stability. In the transport equation. deﬁned by Ujn+1 = n Ujn − ν(Ujn − Uj−1 ) n Ujn − ν(Uj+1 − Ujn ) a∆t . where we have deﬁned ν= We deduce that R∆ (ξ)2 = 1 + 2ν(1 + ν)(1 − cos(ξ∆x)). that this also implies that a < 0. For onedimensional problems.
A more satisfactory answer to the question has been answered by Lax and Wendroﬀ. For the upwind scheme with a > 0 to simplify. This is similar to the parabolic case and justiﬁes the following deﬁnition in the Fourier domain (remember that secondorder derivative in the physical domain means multiplication by −ξ 2 in the Fourier domain).57) or (2. For hyperbolic schemes.3 Check that R∆ (ξ)2 = 1 − 4ν 2 (1 − ν 2 ) sin4 ξ∆x . We have introduced two stable schemes. The closer the above quantity to 0 (which it would be if the discrete scheme were replaced by the continuous equation). Xj ) = − (1 − ν)a∆x 2 (Tn . It remains to analyze their convergence properties. Xj ) + O(∆x2 ). Consistency. So assuming that u(n∆t.4). 2 (4. but from the stability conditions. 34 . This error is obtained assuming that the exact solution is suﬃciently regular and by using Taylor expansions.61): we have simply replaced M by 1. Notice that ﬁrstorder approximations involve secondorder derivatives of the exact solution. Exercise 4. Xj ) is known at time n∆t. (4. we do not have (2. The LaxWendroﬀ scheme is deﬁned by 1 1 n n Ujn+1 = ν(1 + ν)Uj−1 + (1 − ν 2 )Ujn − ν(1 − ν)Uj+1 .6) Deduce that the LaxWendroﬀ scheme is stable when ν ≤ 1.58). Notice the parallel with (2.7) for ∆t and ∆xξ suﬃciently small (say ∆xξ ≤ π). we want to see what it becomes under the discrete scheme. Notice that ∆t and ∆x are again related through (4. 2 2 Exercise 4. the upwind scheme and the LaxWendroﬀ scheme. Exercise 4. the higher the order of convergence of the scheme. The diﬀerence between parabolic and hyperbolic equations and scheme does not come from the consistency conditions.5 Show that the upwind scheme is ﬁrstorder and the LaxWendroﬀ scheme is secondorder. Implement the LaxWendroﬀ scheme in Matlab. Consistency consists of analyzing the local properties of the discrete scheme and making sure that the scheme captures the main trends of the continuous equation. We say that the scheme is convergent of order m if e−i∆tP (iξ) − R∆ (ξ) ≤ C∆t∆xm (1 + ξm+1 ). Xj ). 2 ∂x This shows that the upwind scheme is ﬁrstorder when ν < 1. we want to analyze ∂t + a∂ x u(Tn .4 Show that 1 ∂2u ∂t + a∂ x u(Tn .
e. the error of convergence of the scheme sm will be of order m+1 . we deduce that εn (ξ) ≤ 2ˆ0 (ξ) ≤ C∆xm (1 + ξm+1 )ˆ0 (ξ). Assuming that the initial condition u0 ∈ H m+1 . Show that (4. For low wavenumbers (∆xξ ≤ π). stability and consistency of the scheme.8) is now replaced by u(n∆t.Proof of convergence . (4. which states that R∆ (ξ) ≤ 1. We have then proved the Theorem 4.6 Show that for any scheme of order m. x) − U n (x) ≤ C∆xαm u0 H α(m+1) . (4. Verify this numerically. What if u0 is not in H m ? Exercise 4.7) holds. we deduce that εn (ξ) ≤ ne−∆tP (iξ) − R∆ (ξ)ˆ0 (ξ) ≤ C∆xm (1 + ξm+1 )ˆ0 (ξ). k So using the stability of the scheme. we obtain that u(n∆t. u u using the order of convergence of the scheme. for all 0 ≤ n ≤ N = T /∆t. As usual. we treat low and high wavenumbers separately. u k Here is the main diﬀerence with the parabolic case: e−(n−1−k)∆tP (iξ) and R∆ (ξ) are −n∆tP (iξ) bounded but not small for high frequencies (notice that e  = 1 for P (iξ) = iaξ).1 Let us assume that P (iξ) = iaξ and that R∆ (ξ) is of order m. This implies that εn (ξ) ≤ C∆xm R (1 + ξ2m+2 )ˆ0 (ξ)2 dξ u 1/2 . We recall that piecewise constant functions are in the space H s for all s < 1/2 (take s = 1/2 in the sequel to simplify). x) − U n (x) ≤ C∆xm u0 H m+1 . (4. 35 . As in the parabolic case. u u So the above inequality holds for all frequencies ξ ∈ R. we have n−1 ε (ξ) = (e n −∆tP (iξ) − R∆ (ξ)) k=0 k e−(n−1−k)∆tP (iξ) R∆ (ξ)ˆ0 (ξ). i. with suﬃcient regularity of the exact solution. for all 0 ≤ n ≤ N = T /∆t. we have e−i∆tP (iξ) − R∆ (ξ) ≤ C∆t∆xαm (1 + ξα(m+1) ) for all 0 ≤ α ≤ 1 and ξ∆x ≤ π. For high wavenumbers ∆xξ ≥ π.9) Deduce that if u0 is in H s for 0 ≤ s ≤ m + 1. Deduce the order of the error of convergence for the upwind scheme and the LaxWendroﬀ scheme.8) The main diﬀerence with the parabolic case is that u0 now needs to be in H m+1 (and not in H m ) to obtain an accuracy of order ∆xm . are the ingredients to the convergence of the scheme.
ˆ It is then well known that the function v (k) can be reconstructed from its Fourier ˆ coeﬃcients. v(xj ) = vj . Let us be more speciﬁc and consider ﬁrst the inﬁnite grid hZ with grid points xj = jh for j ∈ Z and h > 0 given. for ﬁrstorder approximations.1 Check that v (k) in (5. We can then obviously diﬀerentiate this function and take the values of the diﬀerentiated function at the grid points.1) It turns out that this transform can be inverted.2) and fourthorder approximation given by (1. or 1 (∂x + ∂ x ). this is (5. and so on for higher order approximations. In matrix form. ˆ h h Exercise 5.1) is indeed periodic of period 2π/h. To do so we deﬁne the semidiscrete Fourier transform as ∞ ∂x v (k) = h ˆ j=−∞ e−ikxj vj . This function p(x) will then be an interpolant of the series {vj }.5 5.e.2) vj = 2π −π/h The above deﬁnitions are nothing but the familiar Fourier series. What we want is to deﬁne a function p(x) such that p(x) = vj . 36 . i. Let us assume that we are given the values vj of a certain function v(x) at the grid points. ˆ (5. We deﬁne the inverse semidiscrete Fourier transform as π/h 1 eikxj v (k)dk. ].1).4). Spectral methods are based on using an approximation of very high order for the partial derivative. vj are the Fourier π π coeﬃcients of the periodic function v (k) on [− . We interpolate a set of values of a function at the points of a grid by a (smooth) function deﬁned everywhere. ]. the secondorder approximation given by (1. π π k ∈ [− .1 Spectral methods Unbounded grids and semidiscrete FT The material of this section is borrowed from chapters 1 & 2 in Trefethen [2]. 2 for a secondorder approximation. How is this done? The main ingredient is “interpolation”.3) in [2] are replaced in the “spectral” approximation by the “inﬁniteorder” matrix in (1. h h (5. Finite diﬀerences were based on approximating ∂ ∂x by ﬁnite diﬀerences of the form ∂x . This is how derivatives are approximated in spectral methods and how (1. Here.4) in [2] is obtained.
5) More generally. So all we need to do is to consider the derivative of function vj such that vn = 1 for some n ∈ Z and vm = 0 for m = n. j = 0.Notice now that the inverse semidiscrete Fourier transform (a. (5. jh 37 (5. πx/h (5. j Sh (jh) = (−1) . Notice that wj a priori depends on all the values vj . then we clearly have that wj = wj + wj .2) is deﬁned for every xj of the grid hZ. This is why the matrix (1.a. It can be shown that it is analytic (which means that the inﬁnite Taylor expansion of p(z) in the vicinity of every complex number z converges to p. This is how we deﬁne our interpolant. ˆ p(x) = (5.6) An easy calculation shows that j=0 0.4) in [2] is inﬁnite and full.4) where p(x) is deﬁned by (5. We then obtain that vn (k) = he−inkh . for the spectral diﬀerentials.3). How do we calculate this matrix? Here we use the linearity of the diﬀerentiation. we obtain that the interpolant p(x) is given by ∞ p(x) = j=−∞ vj Sh (x − jh). (5. and by deﬁnition it satisﬁes that p(xj ) = vj . If vj = vj + vj . but could be deﬁned more generally for every point x ∈ R. by simply replacing xj in (5.2) by x: π/h 1 eikx v (k)dk. Fourier series) in (5.k. an approximation of the derivative of v given by vj at the grid points xj .3) 2π −π/h The function p(x) is a very nice function. The spectral approximation simply consists of deﬁning wj = p (xj ). so the function p is inﬁnitely many times diﬀerentiable). ˆ and that pn (x) = h 2π π/h (1) (2) (1) (2) eik(x−nh) dk = Sh (x − nh).7) . −π/h where Sh (x) is the sinc function Sh (x) = sin(πx/h) . Now assuming that we want to deﬁne wj .
. The latter deﬁnition is more symmetrical than the former. although both are obviously equivalent. Notice that the same technique can be used to deﬁne approximations to derivatives of arbitrary order. it will be 2π periodic. This implies that j=n 0. the function is not only periodic. j = n. ˆ k=−N/2+1 j = 1. . The reason we introduce (5. Once we have the interpolant p(x). . The discrete interpolant is now deﬁned by 1 p(x) = 2π N/2 eikx vk .2 Periodic grids and Discrete FT The simplest way to replace the inﬁnitedimensional matrices encountered in the preceding section is to assume that the initial function v is periodic. 2 2 (5. This is how we deﬁne the Discrete Fourier transform N vk = h ˆ j=1 e−ikxj vj . . . .14) of [2]. . · · · . . xN ). We also assume that N is even. We know that periodic functions are represented by Fourier series.10) is more symmetrical.11) 38 . and the interpolant based on (5. (j − n)h These are precisely the values of the entries (n. (5.10) is that the two deﬁnitions do not yield the same interpolant. . (5. (j−n) pn (xj ) = (−1) . 5. . we can diﬀerentiate it as many times as necessary. Odd values of N are treated similarly. The function is then represented by its value at the points xj = 2πj/N for some N ∈ N. Here. As we did for inﬁnite grids. so that h = 2π/N .Exercise 5. N. . ˆ k=−N/2 x ∈ [0.10) to arbitrary values of x and not only the values on the grid.8) The Inverse Discrete Fourier Transform is then given by 1 vj = 2π The latter is recast as 1 vj = 2π N/2 N/2 eikxj vk . 2π]. To simplify. So it turns out that its “Fourier series” is ﬁnite. j) of the matrix in (1. k=− N N + 1.10) Here we deﬁne v−N/2 = vN/2 and ˆ ˆ means that the terms k = ±N/2 are multiplied 1 by 2 . ˆ k=−N/2 j = 1. we can extend (5.2 Check this.4) in [2]. but also only given on the grid (x1 . . N.9) eikxj vk . For instance the matrix corresponding to secondorder diﬀerentiation is given in (2. . (5. but the formulas are slightly diﬀerent.
. ˆ v For the discrete Fourier transform. . . p and evaluated the derivative p (xj ) at the grid points. we deduce from (5. 5. v −π/h So again. This is what FFT is based on. ˆ ˆ v v In other words. So. Again. this corresponds to wj = p (xj ). We have also introduced matrices that map vj to wj . we have the following procedure 1. we have 1 p (x) = 2π N/2−1 eikx ikˆk . v k=−N/2 Notice that by deﬁnition. for which the Fourier coeﬃcient vanishes. j = 1. Calculate wj from wj by Inverse Discrete Fourier transform. . expect for the value of k = N/2. in the sense that we have constructed the interpolant p(x). we have seen how to calculate spectral derivatives at grid points wj of a function deﬁned by vj at the same gridpoints.3 Fast Fourier Transform (FFT) In the preceding subsections. ˆ 2. The interpolations of the semidiscrete and discrete Fourier transforms have very similar properties. Exercise 5. ˆ 39 . see (2. For instance we deduce from (5. We can now easily obtain a spectral approximation of v(x) at the grid points: we simply diﬀerentiate p(x) and take the values of the solution at the grid points. Our derivation was done in the physical domain. Calculate vk from vj for −N/2 + 1 ≤ k ≤ N/2. Another way to consider the derivation is to see what it means to “take a derivative” in the Fourier domain.Notice that this interpolant is obviously 2π periodic. taken its derivative. we have v−N/2 = vN/2 so that i(−N/2)ˆ−N/2 + iN/2ˆN/2 = 0.3 Calculate the N ×N matrix DN which maps the values vj of the function v(x) to the values wj of the function p (xj ). Remember that (classical) Fourier transforms replace derivation by multiplication by iξ. in the Fourier domain. diﬀerentiation really means replacing v (k) by ikˆ(k). to calculate wj from vj in the periodic case. v k=−N/2+1 This means that the Fourier coeﬃcients vk are indeed replaced by ikˆk when we diﬀerˆ v entiate p(x). Deﬁne wk = ikˆk for −N/2 + 1 ≤ k ≤ N/2 − 1 and wN/2 = 0.27). The result of the calculation is given on p.3) that p (x) = 1 2π π/h eikx ikˆ(k)dk. ˆ v ˆ 3. N.21 in [2]. and is a trigonometric polynomial of order (at most) N/2.11) that 1 p (x) = 2π N/2 eikx ikˆk .
8) and (5. let us merely state that it is a very convenient alternative to constructing the matrix DN to calculate the spectral derivative. h (5. Here we shall give a slightly diﬀerent answer based on the Sobolev scale introduced in the analysis of ﬁnite diﬀerences.4 Check the order O(N 2 ). ˆ 40 . We are not going to describe how FFT works. Exercise 5.Said in other words. is the aliasing formula. Theorem 5.1 Let u ∈ L2 (R) be suﬃciently smooth. π/h]. Let u(x) be a function deﬁned for x ∈ R and deﬁne vj = u(xj ). also called the Poisson summation formula. Then for all k ∈ [−π/h. This has to be compared with a number of operations of order O(N 2 ) if the formulas (5. Smooth functions u(x) correspond to functions u(k) that decay fast in ˆ k. the multiplication by ik (step 2) is fast. The analysis of the error is based on estimates in the Fourier domain. We refer to [2] Programs 4 and 5 for details. It remains to know what sort of error one makes by introducing such interpolants. The question is then: what is the error between u(x) and p(x)? The answer relies on one thing: how regular is u(x).12) In other words the aliasing formula relates the discrete Fourier transform v (k) to the ˆ continuous Fourier transform u(k). Chapter 4 in [2] gives some answer to this question. The Fourier transform of u(x) is denoted by u(k). We then deﬁne the semidiscrete Fourier transform v (k) by ˆ ˆ (5. 5. the Fast Fourier Transform. The ﬁrst theoretical ingredient. ∞ v (k) = ˆ j=−∞ u k+ ˆ 2πj .3). A very eﬃcient implementation of steps 1 and 3 is called the FFT. which is important in its own right. On a computer.1) and the interpolant p(x) by (5.9) are being used. The main computational diﬃculties come from the implementation of the discrete Fourier transform and its inversion (steps 1 and 3). It is an algorithm that performs steps 1 and 3 in O(N log N ) operations.4 Spectral approximation for unbounded grids We have introduced several interpolants to obtain approximations of the derivatives of functions deﬁned by their values on a (discrete) grid. the above procedure is the discrete analog of the formula −1 f (x) = Fξ→x iξFx→ξ [f (x)] .
Now take f (x) to be the delta function f (x) = δ(x). we then obtain that 2πl 2πj v (k) = ˆ u(k + ˆ ) δ(l − j)dl = u(k + ˆ ).13). We then deduce that we have cn = 1 and for x ∈ (0. It thus remains to show (5. The lefthand side is a 1periodic in x and does not change. Here is a simple explanation. h j=−∞ h R j=−∞ which is what was to be proved. (5. Now that we have the aliasing formula. 1). The right hand side now takes the form ∞ δ(x − j). We will show that ∞ ∞ ei2πlj = j=−∞ j=−∞ δ(l − j). j=−∞ This is nothing but the periodic delta function. we can analyze the diﬀerence u(x) − p(x).13) Assuming that this equality holds.Derivation of the aliasing formula. (5. n=−∞ Let us now extend both sides of the above relation by periodicity to the whole line R. Take the Fourier series of a 1−periodic function f (x) 1/2 ∞ ∞ cn = −1/2 e−i2πnx f (x)dx.14) . since u(x) − p(x) We now show the following result: 41 L2 (R) 1 =√ u(k) − p(k) ˆ ˆ 2π L2 (R) .13). ∞ ei2πnx = δ(x). h j=−∞ R ∞ using the change of variables h(k − k) = 2πl so that hdk = 2πdl. Then the inversion is given by ∞ f (x) = n=−∞ ei2πnx cn . This proves (5. We’ll do it in the L2 sense. We write ∞ v (k) = h ˆ j=−∞ ∞ e−ikhj vj u(k )eihj(k −k) ˆ hdk 2π = j=−∞ R = 2πl u(k + ˆ ) ei2πlj dl.
π/h].Theorem 5.e. p(x) is bandlimited. we have to use the regularity of the function u(x) to control the above term. In other words. π/h] and χ(k) = 0 otherwise is the characteristic function of the interval [−π/h. i. . i. The second term is a little more painful. We have denoted by Z∗ = Z\{0}. We ﬁrst realize that p(k) = χ(k)ˆ(k) ˆ v by construction. if u(k) has compact support inside [−π/h. Proof. π/h]. the above sum is convergent so that 1 + k + j∈Z∗ 2πj  h 42 −2m ≤ Ch2m . The ﬁrst term is easy to deal with. hence p(x) = u(x) and the interpolant is exact. Looking carefully at the above terms. (2π(1 + j))2m for k ≤ π/h.e. This is what we now do. π/h]. The CauchySchwarz inequality tells us that 2 aj b j j ≤ j aj 2 j bj 2 .15) where the constant C is independent of h and u. π/h]. one realizes that the above error only involves u(k) for values of k outside the interval ˆ [−π/h. where χ(k) = 1 if k ∈ [−π/h. This is simply based on realizing that hk ≥ π when χ(k) = 1. For m > 1/2. (5.2 Let us assume that u(x) ∈ H m (R) for m > 1/2. bj = 1 + k + 2πj  h −m 2πj h 2 ≤ j∈Z∗ u k+ ˆ 2πj h 1 + k + 2πj  h 2m 1 + k + j∈Z∗ 2πj  h −2m . Notice that 1 + k + j∈Z∗ 2πj  h −2m ≤ h2m j∈Z∗ 1 . Then we have u(x) − p(x) L2 (R) ≤ Chm u H m (R) . In other words. Let us choose aj = u k + ˆ We thus obtain that u k+ ˆ j∈Z∗ 2πj h 1 + k + 2πj  h 2 m .12). Indeed we have (1 − χ(k))ˆ(k)2 dk ≤ h2m u R R (1 + k 2m )ˆ(k)2 dk ≤ h2m u u 2 H m (R) . using (5. We thus deduce that π/h ˆ(k) − p(k)2 dk = u ˆ R R (1 − χ(k))ˆ(k)2 dk + u −π/h j∈Z∗ u k+ ˆ 2πj h 2 dk. When u(k) does not vanish ˆ ˆ ˆ outside [−π/h. then ˆ p(k) = u(k). This is a mere generalization of x · y ≤ xy. has no high frequencies. when the high frequency content is present.
4 Let us assume that u(x) ∈ H m (R) for m > 1/2 and let n ≤ m. When h is chosen too large so that frequencies above π/h are present in the signal.5). πx/h (5. we have that the interpolant is ˆ exact. This proves the extremely important ShannonWhittaker sampling theorem: Theorem 5. However. Remember that our initial objective is to obtain an approximation of u (xj ) by using p (xj ). u k+ ˆ 2πj h 2 1 + k + 2πj  h 2m ≤ Ch2m R ˆ(k)2 (1 + k)2m dk = Ch2m u u This concludes the proof of the theorem.3 Let us assume that the function u(x) is such that u(k) = 0 for k outside ˆ [−π/h. π/h]. Let us emphasize something we saw in the course of the above proof. Then u(x) is completely determined by its values vj = u(hj) for j ∈ Z. More precisely. More precisely. When u(x) is bandlimited so that u(k) = 0 outside [−π/h. then the interpolant is not exact. Let us ﬁnish this section by an analysis of the error between the derivatives of u(x) and those of p(x). This error is precisely called aliasing error and is quantiﬁed by Theorem 5. we have seen that the interpolant is determined by the values vj = u(xj ). where Sh is the sinc function deﬁned in (5.2 yield the following result: Theorem 5. This means that a continuous signal can be compressed onto a discrete grid with no error. we have ∞ u(x) = n=−∞ u(hj)Sh (x − jh).16) This result has quite important applications in communication theory. This implies that π/h u k+ ˆ −π/h j∈Z∗ 2πj h 2 π/h dk ≤ Ch2m −π/h j∈Z∗ 2 Hm . . there is an adapted sampling value h such that the signal can be reconstructed from the values it takes on the grid hZ. π/h].where C depends on m but not on u or h. we saw that ∞ p(x) = n=−∞ vn Sh (x − nh). Assuming that a signal has no high frequencies (such as in speech. The proof of this theorem consists of realizing that u(n) (x) − p(n) (x) L2 (R) 1 =√ (ik)n (ˆ(k) − p(k)) u ˆ 2π 43 L2 (R) . Minor modiﬁcations in the proof of Theorem 5. Then we have u(n) (x) − p(n) (x) L2 (R) ≤ Ch(m−n) u H m (R) . since the human ear cannot detect frequencies above 25kHz). namely p(x) = u(x).17) where the constant C is independent of h and u. and the compression in the signal has some error. (5. Sh (x) = sin(πx/h) .1.
whose solution is thus u(x) = R G(x − y)f (y)dy = 1 2α e−αx−y f (y)dy. (5. h2 (5.23) . where u(ξ) = ˆ ˆ f (ξ) .21) Discretizing the above equation using the secondorder ﬁnite diﬀerence method: − U (x + h) + U (x − h) − 2U (x) + U (x) = f (x).19) An elliptic equation. σ + ξ2 (5.22) we obtain as in (3. as we saw in the proof of Theorem 5. The second term is bounded by π/h h−2n −π/h j∈Z∗ u k+ ˆ 2πj h 2 dk. Deﬁne α = that 1 −αx e . where m is the H m regularity of u. √ (5. 5. So when u is very smooth. R (5.2 by h−2n h2m u 2 m . which is bounded. This result is important for us: it shows that the error one makes by approximating a n−th order derivative of u(x) by using p(n) (x) is of order hm−n . This is the greatest advantage of spectral methods. p(n) (x) converges extremely fast to u(n) (x) as N → ∞. Consider the simplest of elliptic equations where the absorption σ > 0 is a constant positive parameter. We verify (5.19) in section 3 an error estimate of the form u−U L2 ≤ C h2 f 44 H2 .20) Note that this may also be obtained in the Fourier domain.18) σ. The ﬁrst term on the right hand side is bounded by h2(m−n) (1 + k 2m )ˆ(k)2 dk = hm−n u u 2 Hm . This proves H the theorem.Moreover we have k2n ˆ(k) − p(k)2 dk = u ˆ R π/h k2n (1 − χ(k))ˆ(k)2 dk u R 2 + −π/h k 2n 2πj u k+ ˆ h j∈Z∗ dk. x ∈ R. G(x) = 2α is the Green function of the above problem.5 Application to PDE’s −u (x) + σu(x) = f (x).
e.2. (5. see how ∆t and h have to be chosen so that the discrete solution does not blow 45 . Here m is arbitrary provided that f is arbitrarily smooth.Let us now compare this result to the discretization based on the spectral method. We may now calculate the secondorder derivative of p(x) (see the following paragraph for an explicit expression). where N is the matrix of diﬀerentiation. Now p(x) and F (x) only depend on the values they take at hj for j ∈ Z. A parabolic equation. So the Euler explicit scheme is given by ∞ Ujn+1 − Ujn n = Sh (kh)Uj−k . Looking at this in the Fourier domain (and coming back) we easily deduce that u−p H 2 (R) ≤C f −F L2 (R) ≤ Chm f H m (R) .25) It now remains to see how this scheme behaves. The equation we now use to deduce p(x) from F (x) is the equation (5. Let us consider the Euler explicit scheme based on a spectral approximation of the spatial derivatives. Let us now apply the above theory to the computation of solutions of the heat equation ∂u ∂ 2 u − 2 = 0. We introduce Uj the approximate solution at time Tn = n∆t. Note that this problem may be solved by inverting a matrix of the form −N 2 + σI.18): −p (x) + σp(x) = F (x). Let p(x) be the spectral interpolant of u(x) and F (x) the spectral interpolant of the source term f (x). x ∈ R. We shall only consider stability here (i. ∂t ∂x n on the real line R. Spectral methods are based on replacing functions by their spectral interpolants. 1 ≤ n ≤ N and point xj = hj for j ∈ Z. When the source terms and the solutions of the PDEs we are interested in are smooth (spatially). so that the approximation to the second derivative is given by ∞ ∞ (p ) (kh) = j=−∞ n Sh ((k − j)h)Ujn = j=−∞ n Sh (jh)Uk−j . The error u − p now solves the equation −(u − p) (x) + σ(u − p) = (f − F )(x). The interpolant at time Tn is given by ∞ p (x) = j=−∞ n Ujn Sh (x − jh).24) where the latter equality comes from Theorem 5. We thus obtain that the spectral method has an “inﬁnite” order of accuracy. the convergence properties of spectral methods are much better than those of ﬁnite diﬀerence methods. ∆t k=−∞ (5. unlike ﬁnite diﬀerence methods.
This is plausible from the analysis of the error made by replacing a function by its interpolant that we made earlier.28) stays between −1 and 1 for all frequencies ξ. h2 k 2 Exercise 5.27) So the stability of the scheme boils down to making sure that ∞ R(ξ) = 1 + ∆t k=−∞ Sh (kh)e−ikhξ . 1/2] as ei2πlj = δ(l) − 1. k2 δ(l − j). The analysis is more complicated than for ﬁnite diﬀerences because of the inﬁnite sum that appears in the deﬁnition of R.5 Check this. It turns out that we can have access to an exact formula. So we can recast R(ξ) as R(ξ) = 1 − ∆t Exercise 5. As we did for ﬁnite diﬀerences. We now need to understand the remaining inﬁnite sum. which we recast for l ∈ [−1/2. We ﬁrst realize that −π 2 k=0 3h2k+1 Sh (kh) = (−1) 2 k = 0. It is based on using the Poisson formula ei2πlj = j j π2 2∆t − 2 2 3h h k=0 ei(hξ−π)k . (5. we can pass to the Fourier domain and obtain that ∞ ˆ U n+1 (ξ) = 1 + ∆t k=−∞ ˆ Sh (kh)e−ikhξ U n (ξ). Let us come back to the stability issue.6 Check this.26) Once we have realized this. we realize that we can look at a scheme deﬁned for all x ∈ R and not only x = xj : U n+1 (x) − U n (x) = Sh (kh)U n (x − hk). As far as convergence is concerned. ∆t k=−∞ ∞ (5. Here is a way to analyze R(ξ). (5. the scheme will be ﬁrst order in time (accuracy of order ∆t) and will be of spectral accuracy (inﬁnite order) provided that the initial solution is suﬃciently smooth. j=0 46 .up as n → T /∆t).
1/2] (i.9 Check that l(H(l) − 1 ) = 1 l. 2π 2 (5. Notice that this function is 1periodic and continuous.1. The constant d0 is now ﬁxed by making sure that the above righthand side remains 1periodic (as is the lefthand side). To sum up our calculation. 5. j=0 This shows that R(ξ) = 1 − 1 1 ∆t π 2 l2 + 8π 2 + − l H(l) − 2 h 3 2 12 2 . 0 ≤ ξ ≤ .28) to show that R(ξ) = R(−ξ) and that R(ξ) is 2π periodic. We obtain that ei2πlj = H(l) − l − d0 . Let us integrate one more time. l= hξ 1 − . This is justiﬁed as the delta function is concentrated at one point.] h R(ξ) = 1 − 47 . such that H(l) = 1 for l > 0 and H(l) = 0 for l < 0.29) A plot of the function l → l2 /2 + 1/12 − l(H(l) − 1/2) is given in Fig. there is no 0 frequency). Another way of looking at it is to make sure that both sides have zero mean on (−1/2. its integral is a piecewise linear function (H(l) − 1/2 − l) and is discontinuous.e. h2 ∆t 2 2 π h ξ := 1 − ∆tξ 2 . 2 h h [Check the above for hξ ≤ π and use (5. Exercise 5. 1/2). We obtain that d1 = 1/12. 2j 2 −4π 2 2 j=0 where d1 is again chosen so that the righthand side has zero mean. Exercise 5. Deduce that 2 2 R(ξ) = 1 − so that ∆t 2 π 1 + 4l(1 − l) . we have obtained that ei2πlj l2 1 1 = 4π 2 + − l H(l) − 2 j 2 12 2 . We now integrate these terms in l.7 Check this. i2πj j=0 where H(l) is the Heaviside function. and its second integral is continuous (and is piecewise quadratic). Exercise 5. We now obtain that ei2πlj l2 l = lH(l) − − − d1 . We thus ﬁnd that d0 = 1/2.8 Check this.Notice that both terms have zero mean on [−1/2.
h2 π (5.e. which is a quite 2 2 bad approximation of e−∆tπ /(h ) for the exact solution.02 0 −0. we do not expect frequencies of order h−1 to be well approximated on a grid of size h. for frequencies ξ= 2πm .1 0.1 0.e. we see that λ= 2 ∆t ≤ 2.3 −0. 48 .30) is necessary to obtain a convergent scheme as h → 0 with λ ﬁxed. and that R(π/h) = 1 − π 2 ∆t/h2 . h m ∈ Z.5 Figure 5. h2 and that the minimal value is reached for l = 0.4 0. The maximal value is 1/12 and the minimal value −1/24.06 −0.02 −0. i.2 0. i. Since we need R(ξ) ≤ 1 for all frequencies to obtain a stable scheme.SECOND PERIODIC INTEGRAL OF THE PERIODIC DELTA FUNCTION 0. h 2 m ∈ Z.1: Plot of the function l → f (l) = l2 /2 + 1/12 − l(H(l) − 1/2).2 −0.04 −0.06 0.3 0.5 −0. and that the maximum is reached for l = ±1/2.08 0. Since R(ξ) is a periodized version of the symbol 1 − ∆tξ 2 . Notice that R(0) = 0 as it should.1 0 l 0. But as usual.4 −0. for frequencies ξ= 1 2π (m + ). we deduce that 1 − π2 ∆t ≤ R(ξ) ≤ 1.04 f(l) 0.
(6. The material mostly follows the presentation in [1]. Upon multiplying (6. is square integrable). for all α ≤ k}. α2 ) (in two space dimensions) for αk ∈ N. v) := Ω u· v + σuv dx = Ω f vdx := L(v).1) We assume that the absorption coeﬃcient σ(x) ∈ L∞ (Ω) satisﬁes the constraint σ(x) ≥ σ0 > 0. Let v(x) be a suﬃciently smooth test function deﬁned on Ω and such that v(x) = 0 for x ∈ ∂Ω.1 Prove the above formula. 1 ≤ k ≤ 2. For k ∈ N.e.6 Introduction to ﬁnite element methods We now turn to one of the most useful methods in numerical simulations: the ﬁnite element method. x2 ) ∈ Ω (6. This will be very introductory.1). On Ω we consider the boundary value problem −∆u(x) + σ(x)u(x) = f (x). (6. (6. we obtain by integrations by parts using Green’s formula that: a(u. The Hilbert space is equipped with the norm u Hk = α≤k Dα u 2 L2 1/2 .1 Elliptic equations Let Ω be a convex open domain in R2 with boundary ∂Ω. u(x) = 0. This subsection is devoted to showing the existence of a solution to the above problem and recalling some elementary properties that will be useful in the analysis of discretized problems. x ∈ ∂Ω. ∂x1 2 ∂x2 2 ¯ x ∈ Ω. x = (x1 . (6.5) 49 . we deﬁne H k (Ω) ≡ H k = v ∈ L2 (Ω).4) We recall that for a multiindex α = (α1 . Dα v ∈ L2 (Ω). Let us introduce a few Hilbert spaces.3) Exercise 6.. Let us ﬁrst assume that there exists a suﬃciently smooth solution u(x) to (6. we denote α = α1 + α2 and Dα = ∂ α1 ∂ α2 . The Laplace operator ∆ is deﬁned as ∆= ∂2 ∂2 + . 6. We will assume that ∂Ω is either suﬃciently smooth or a polygon.2) We also assume that the source term f ∈ L2 (Ω) (i.1) by v(x) and integrating over Ω. ∂xα1 ∂xα2 1 2 Thus v ∈ H k ≡ H k (Ω) if all its partial derivatives of order up to k are square integrable in Ω.
u) ≥ α u 2 V. Then for each bounded linear form L on V . Indeed. The main ingredients of the proof are the following. we have a(u. note that a(u. v) > 0 for all v ∈ V such that v = 0. (6. Because of the choice of boundary conditions in (6.We also deﬁne the seminorm on H k : uH k = α=k Dα u 2 L2 1/2 . bilinear form on V . u). (6. Theorem 6. in the sense that there exists α > 0 such that a(u. v) is also positive deﬁnite. for all v ∈ V.9) Indeed. The norm associated to this inner product is given by u a = a1/2 (u. First. it deﬁnes an inner product on V (by deﬁnition of an inner product). positive deﬁnite. v∂Ω = 0 .1).3) holds for any function v ∈ H0 (Ω). u ∈ V.10) The Riesz representation theorem then precisely states the following: Let V a Hilbert space with an inner product a(·. v) = a(v. We can show that the above space is a closed subspace of H 1 (Ω). Showing this is not completely trivial and requires understanding the operator that restricts a function deﬁned on Ω to the values it takes at the boundary ∂Ω. we also need to introduce the Hilbert space 1 (6.6) Note that the above seminorm is not a norm. ·). v) is a symmetric bilinear 1 form on the Hilbert space V = H0 (Ω). in the sense that a(v. v) for all v ∈ V .7) H0 (Ω) = v ∈ H 1 (Ω). not necessary). v) is a symmetric. The theorem is a consequence of the Riesz representation theorem. The bilinear form a(u. v) = L(v). It is clearly linear in u and v and one veriﬁes that a(u. there is a unique u ∈ V such that L(v) = a(u.1 is thus equivalent 50 . The restriction is deﬁned for suﬃciently smooth functions only (being an element in H 1 is suﬃcient. (6. a(u. v) and the constraint on σ.8) Proof.1) by the new problem in variational form: 1 1 Find u ∈ H0 (Ω) such that for all v ∈ H0 (Ω).8). Theorem 6. we can choose α = min(1. This is because L2 functions are deﬁned up to modiﬁcations on sets of measure zero (for the Lebesgue measure) and that ∂Ω is precisely a set of measure zero. For instance: “the restriction to ∂Ω of a function in L2 (Ω)” means nothing. v) is even better than this: it is coercive in V .1 There is a unique solution to the above problem (6. 1 We can show that (6. u). from the deﬁnition of a(u. This allows us to “replace” the problem (6. for all polynomials pk−1 of order at most k − 1 are such that pk−1 H k = 0. where both spaces are equipped with the norm · H 1 . (6. σ0 ). Since a(u.
Now assume that u and w are solutions.1).14) . Remark 6. What we can show is that for smooth boundaries ∂Ω and for convex polygons ∂Ω. we have u H2 ≤C f 51 L2 .8) is equivalent to u being the minimum of 1 J(u) on V = H0 .1)).2 The solution to the above problem admits another interpretation. The above theory gives us a weak solution u ∈ H0 to the problem (6.3 Theorem 6. What it says is that provided that a(u. So u is not a strong solution (which is deﬁned as a C 2 solution of (6.1 admits a very simple proof based on the Riesz representation theorem. which means the existence of a constant L such that L(v) ≤ L V ∗ a1/2 (v. we deduce that a(u − w. (6. This yields the bound for L and the existence of a solution to (6.to showing that L(v) is bounded on V . u) − L(u). v). This abstract theory is extremely useful in many practical problems. However it requires a(u.12) and provided that the functional L(v) is bounded on V . However we have L(v) ≤ f L2 V∗ v L2 f = √ L2 σ0 √ ( σ0 v L2 ) f ≤ √ L2 σ0 a1/2 (v. a(v. v) is coercive and bounded on a Hilbert space V (equipped with norm · V ). in the sense that there exists a constant L V ∗ such that L(v) ≤ L then the abstract problem a(u.11) 2 We can then show that u is a solution to (6. (6.3) rather than its PDE form (6. which may not be the case in practice. (6. v). in the sense that there are two constants C1 and α such that α v 2 V ≤ a(v.13) V∗ v V. v) to be symmetric.8). Remark 6. Choose now v = u − w to get that a(u − w. Then by taking diﬀerences in (6. this implies that u − w = 0 and the uniqueness of the solution. v) = 0 for all v ∈ V . 1 Regularity issues. for all v. u − w) = 0. The LaxMilgram theory is then the tool to use. The solution is “weak” because the elliptic problem should be considered in it variational form (6. (6. The Laplacian of a function in 1 H0 is not necessarily a function. v) for all v ∈ V . The answer is no in most cases. However. since a is positive deﬁnite.8). 1 It thus remains to understand whether the regularity of u ∈ H0 is optimal. for all v ∈ V admits a unique solution u ∈ V . for all v ∈ V. Indeed let us deﬁne the functional 1 J(u) = a(u.1). v) = L(v). w ∈ V. w) ≤ C1 v V w V.
This requires some care: 1 functions in H0 have derivatives in L2 . For suﬃciently smooth ∂Ω. the secondorder partial derivatives of u are also in L2 .15) This means that the solution u may be quite regular provided that the source term f is suﬃciently regular.3) in Vh rather than V . v) is coercive and bounded on V (see Rem. then u is of class C 2 . Then we have 52 . Then the spaces Vh are Hilbert spaces. we thus deduce that u ∈ H 2 ∩ H0 . then clearly it remains coercive and bounded on Vh (because it is speciﬁcally chosen as a subset of V !).. i. Let us assume that we have a sequence of ﬁnite dimensional subspaces Vh ⊂ V . (6. where d is space dimension. Theorem 6. The Galerkin method is based on choosing ﬁnite dimensional subspaces Vh ⊂ V . a classical Sobolev inequality theorem says that a function u ∈ H k for k > d/2. 1 are not elements of H0 (because they can jump and the derivatives of jumps are too singular to be elements of L2 ).1) to discretize but rather (6. Deriving such results is beyond the scope of these notes and we will simply admit them.1).1 and its generalization in Rem.2 Galerkin approximation The above theory of existence is fundamentally diﬀerent from what we used in the ﬁnite diﬀerence and spectral methods: we no longer have (6. As we saw for ﬁnite diﬀerence and spectral methods the regularity of the solution dictates the accuracy of the discretized solution. this means choosing Vh as subsets of H0 . is a strong solution of (6.1).3) is that V = H0 is inﬁnite dimensional. when f ∈ L2 . 6. In dimension d = 2. v) is symmetric and as such deﬁnes a norm (6. v) = L(v) for all v ∈ Vh . this means that as soon as f ∈ H k for k > 1. The above regularity results will thus be of crucial importance in the analysis of ﬁnite element methods which we take up now. we have the following result: u H k+2 ≤ C f H k . 6. (6. Assuming that the bilinear form a(u. In 1 our problem (6. This may be generalized as follows.3 then hold with V replaced by Vh . 1 The main diﬃculty in solving (6. So piecewise constant functions for instance. This however gives us the right idea to discretize (6. Moreover. Let us assume that a(u. and Vh can therefore not be constructed using piecewise constant functions. We then have to choose the family of spaces Vh for some parameter h → 0 (think of h as a mesh size) such that the solution converges to u in a reasonable sense.10) on V and Vh . or for very special polygons Γ (such as rectangles).3). The main aspect of this regularity result is that u has two more degrees of regularity than f .16) It remains to obtain an error estimate for the problem on Vh . k ≥ 0. 6.1 When f ∈ L2 . is also of class C 0 .3).3): simply replace V by a discrete approximation Vh and solve (6.e. The same theory as for the continuous problem thus directly gives us existence and uniqueness of a solution uh to the discrete problem: Find uh ∈ Vh such that a(uh .
However. we verify that for all v ∈ Vh u − uh 2 a = a(u − uh . v) ≤ u a v a ) and v → a(u. We have equipped V (and Vh ) with two diﬀerent. If u − uh a = 0. u) is an inner product (so that a(u.19) ≤ v a ≤C v V. we can subtract (6. when a(u. 6. for all v ∈ Vh . then the lemma is obviously true.3) for the · a norm: (6. Let us summarize our results. or at least was not too far from the best solution. This yields the lemma. (6.16) is the best approximation to the solution u of (6.16) from (6.8) to get the following orthogonality condition a(u − uh . We have shown that the discrete problem indeed admitted a unique solution (and thus requires to solve a problem of the form Ax = b since the problem is linear and ﬁnite dimensional).20) Exercise 6.1). u) = a(u − uh . albeit equivalent. v) no longer generates a norm). for all v ∈ V. v) = 0. but not for the norm · V .3. v) is coercive (whether it is symmetric or not).18) Note that this property only depends on the linearity of u → a(u. norms. So we have obtained that u − uh H 1 is close to the 1 best possible approximation of u by functions in H0 . We recall that two norms · a and · V are equivalent if there exists a constant C such that C −1 v V V ≤ C1 min u − v α v∈V V. v∈Vh Proof. (6. Then uh is the best approximation of u for the norm · a . We have replaced the variational formulation for u ∈ V by a discrete variational formulation for uh ∈ Vh . show that the two norms · a and · V are equivalent. Now because a(u. although uh is no longer the best approximation of u in any norm (because a(u. u − uh ) = a(u − uh . V = H0 . The ﬁrst idea is that since Vh ⊂ V . If not.3 When a(u. 1 In our problem (6.4 The solution uh to (6. Remark 6. v) and not on a being symmetric. v) is linear. it may be replaced by u − uh Exercise 6.17) u − uh a = min u − v a .2 Prove (6. the above result does not hold. Note that the above result also holds when a(u. v) generates a norm.Lemma 6. There are two main ideas there. v) is symmetric and coercive on V .5 When a(u. then we can divide both sides in the above inequality by u − uh a to get that u − uh a ≤ u − v a for all v ∈ V . v) satisﬁes the hypotheses given in Rem. and that the solution uh was the best approximation to u when a(u. v) is symmetric. this time for the norm · V .19). uh is still very close to being an optimal approximation of u (for the norm · V now). we still observe that up to a multiplicative constant (greater than 1 obviously). Note that the “closeness” is all 53 . (6. u − v) ≤ u − uh a u − v a. So.
T ∈Th (6. h = max hT . an error bound of the form (6. We impose that two neighboring triangles have one edge (and thus two vertices) in common. The reason we introduce the triangulation is that we want to approximate arbitrary 1 functions in V = H0 by ﬁnite dimensional functions on each triangle T ∈ Th . These jumps however would violate the fact that 1 our discrete functions need to be in H0 . When a is symmetric. To do so. This gives us a solution uh to (6. We now recast the above problem for uh as a linear system (of the form Ax = b). (6. The space Vh is a ﬁnite dimensional subspace 1 of H0 (see below for a “proof”) so that the theory given in the previous section applies. Moreover. we want the functions not to jump across edges of the triangulation. 6. for instance based on quadrilaterals. then cannot jump across edges. (6. As we have already mentioned. they must be 54 . A linear system. The simplest functions after piecewise constant functions are piecewise linear functions. We thus deﬁne 1 Vh = {v ∈ H0 . This precludes the existence of vertices in the interior of an edge. Each triangle T ∈ Th is represented by three vertices and three edges.17) is however always optimal. We then divide Ω into a family Th of nonoverlapping closed triangles. We can go further in two directions: see how small u−uh H 1 is for speciﬁc sequences of spaces Vh . In any event. The set of all vertices in the triangulation are called the nodes of Th . v is linear on each T ∈ Th }. when α is very small. We now consider a speciﬁc example of Galerkin approximation: the ﬁnite element method (although we should warn the reader that many ﬁnite element methods are not of the Galerkin form because they are based on discretization spaces Vh that are not subspaces of V ). we need additional information on the spaces Vh if one wants to go further.relative. Here we consider the simplest ﬁnite element method. Because they are piecewise linear.3 An example of ﬁnite element method We are now ready to introduce a ﬁnite element method. Note that ﬁnite element methods could also be based on diﬀerent tilings of Ω. We only consider triangles here. For instance functions that are constant on 1 each triangle T ∈ Th are not in H0 and are thus not admissible. or one vertex in common.19) is not necessarily very accurate. A triangulation Th of size h is thus characterized by ¯ Ω= T ∈Th T. The reason is that such functions may not match across edges shared by two triangles. we wish to be able to construct more accurate solutions uh to u solution of (6.8). or higher order polygons.21) As the triangulation gets ﬁner and h → 0. Because the functions in Vh are in 1 H0 .16).22) Recall that this imposes that v = 0 on ∂Ω. we need a basis for Vh . but also see how small u−uh is in possibly other norms one may be interested in. To simplify we assume that Ω is a rectangle. such as u − uh L2 . hT = diam(T ). the functions cannot be chosen on each triangle independently of what happens on the other triangles. For instance. The index h > 0 measures the “size” of the triangles.
24) ui φi (x).3). Solving for uh thus becomes a linear algebra problem. xi = u i . Moreover. φi ). Note also that for Pj . Since φi (x) is piecewise aﬃne on Th . The optimal ordering thus very much depends on the triangulation and is by no means a trivial problem. We can thus write any function v ∈ Vh as Mh v(x) = i=1 vi φi (x).23) We recall that the Kronecker delta δij = 1 if i = j and δij = 0 otherwise. Aij = a(φj . Note that by assuming that φi ∈ Vh . Let us just mention that the matrix A is quite sparse because a(φi . 55 .25). So they are characterized (uniquely determined) by the values they take on each node Pi ∈ Ω for 1 ≤ i ≤ Mh of the triangulation Th (note that since Ω is an open domain. Pk ) so that indeed φi = 0 on ∂Ω. j ≤ Mh . vi = v(Pi ). This proves that Vh is ﬁnite dimensional. the notation means that Mh is the number of internal nodes of the triangulation: the nodes Pi ∈ ∂Ω should be labeled with Mh + 1 ≤ i ≤ Mh + bh . Mh i=1 (6. bi = (f. we have Aij = Ω φj (x) · φi (x) + σ(x)φj (x)φi (x) dx. which we do not describe further here. This is nothing but the linear system Ax = b. There is therefore no need to discretize the absorption parameter σ(x): the variational formulation takes care of this. whence is a basis for Vh . Let us also mention that the ordering of the points Pi is important. φi ).continuous on Ω. φj ) = 0 unless Pi and Pj belong to the same edge in the triangulation. v) is deﬁned by (6. it is uniquely deﬁned by its values at the nodes Pj .16) may then be recast as Mh (f. where bh is the number of nodes Pi ∈ ∂Ω). we easily deduce that the family is linearly independent. Consider the family of pyramid functions φi (x) ∈ Vh for 1 ≤ i ≤ Mh deﬁned by φi (Pj ) = δij . Pk ∈ ∂Ω.16) as uh (x) = (6.25) In the speciﬁc problem of interest here where a(u. φi ) = a(uh . φi )uj . one would like the indices of points Pi and Pj belonging to the same edge to be such that i − j is as small as possible to render the matrix A as close to a diagonal matrix as possible. This is a crucial property to obtain eﬃcient numerical inversions of (6. we have also implicitly implied that φi (Pj ) = 0 for Pj ∈ ∂Ω. φi ) = j=1 a(φj . The equation Let us decompose the solution uh to (6. 1 ≤ i. (6. As a rough rule of thumb. (6. the function φi (for Pi ∈ ∂Ω) vanishes on the whole edge (Pj .
For a triangle T with vertices Pi . The various triangles in the triangulation may be very diﬀerent so we need a method that analyzes πT v − v on arbitrary triangles.) We now want to show that πh v − v gets smaller as h → 0. We recall that functions in H 2 (T ) are continuous. and (0. Let now Th be a sequence of triangulations such that h → 0 (recall that h is the maximal diameter of each triangle in the triangulation). where the φi (x) are aﬃne functions deﬁned by φi (Pj ) = δij . We deﬁne the interpolation operator πh : C(Ω) → Vh by Mh (πh v)(x) = i=1 v(Pi )φi (x). We then push ˆ forward any properties we have found on T to the triangle T through this map.27) We recall that  · H 2 (T ) is the seminorm deﬁned in (6. Moreover.6). The ﬁrst step goes as follows. However this is not an easy task in general. we can use (6.17) and (6. ˆ ˆ Then we deﬁne a map that transforms T to any triangle of interest T . This implies that u is a continuous function as ¯ 2 > d/2 = 1. and · L2 )..Approximation theory. 1) in an orthonorˆ mal system of coordinates of R2 .. 1 An arbitrary function v ∈ H0 need not be welldeﬁned at the points Pi (although it almost is. indeed provides the best approximation. Let us assume that f ∈ L2 (Ω) so that u ∈ H 2 (Ω) thanks to (6. First ˆ we pick a triangle of reference T and analyze πT v − v on it in various norms of interest. which up to a multiplicative constant (independent of h). Our approximation is based on interpolation theory. So far.14). x ∈ T.16) is a Galerkin discretization. (6. This is done by looking at πh v − v on each triangle T ∈ Th and then summing the approximations over all triangles of the triangulation. ˆ Lemma 6.19) 1 and thus obtain in the H0 and a− norms that u − uh is bounded by a constant times u−v for arbitrary v ∈ Vh . Then there exists a constant C independent of v such that ˆ v − πT v ˆ ˆˆ (ˆ − πT v ) v ˆˆ ˆ L2 (T ) ˆ L2 (T ) ≤ CˆH 2 (T ) v ˆ ≤ CˆH 2 (T ) . this error will depend on the chosen norm (we have at least three reasonable choices: · a . (1. Let v be a function in H 2 (T ) and πT v its aﬃne ˆ ˆˆ interpolant. ˆ 56 . 1 ≤ i ≤ 3. j ≤ 3.26) Note that the function v needs to be continuous since we evaluate it at the nodes Pi . it is deﬁned as 3 πT v(x) = i=1 v(Pi )φi (x). We now consider a method. We denote by πT v the aﬃne interpolant of v on a triangle T . Finding the best approximation v for u in a ﬁnite dimensional space for a given norm would be optimal. so the above interpolant is welldeﬁned for each v ∈ H 2 (T ). We thus need to understand the approximation at the level of one triangle. It remains to understand how small the error u − uh is.16) for various parameters h > 0.6 Let T be the triangle with vertices (0. Since the problem (6. 0). 0). we have constructed a family of discrete solutions uh of (6. · H 1 . 1 ≤ i. v ˆ (6. This is done in two steps.
(6. L2 (T ) (6. ˆ We verify that ΠT v is indeed an aﬃne interpolant for v ∈ H 1 (T ) in the sense (thanks ˆ to the normalization of φ(x)) that ΠT v = v when v is a polynomial of degree at most 1 ˆ (in x1 and x2 ) and that ΠT v H 1 ≤ C v H 1 . It is therefore suﬃcient to show a result of the form 1 2 dx ˆ T ˆ T φ(x0 ) 0 f (x0 + t(x − x0 ))dtdx0 dx0 ≤ C f 2 ˆ . Also we verify that for all aﬃne interpolants ΠT . and there is no reason for  v(x0 ) to be bounded by CvH 2 (T ) . However they may not quite be bounded by CvH 2 (T ) . We realize that both remainders involve bounded functions (such as x − x0  or 1 − t) multiplied by secondorder derivatives of v. It is a classical realanalysis result that such a ˆ ˆ function exists.29) Both integral terms in the above expressions involve secondorder derivatives of v(x).30) v(x0 ) − x0 · v(x0 ) φ(x0 )dx0 + φ(x0 ) v(x0 )dx0 · x. we obtain that 1 v(x) = ΠT v(x) + ˆ v(x) = ΠT v(x) = ˆ ˆ T φ(x0 )x − x0 2 0 ˆ T (x · 1 )2 v((1 − t)x0 + tx)(1 − t)dtdx0 v((1 − t)x0 + tx)dtdx0 ˆ T ΠT v(x) + ˆ ˆ T φ(x0 )x − x0  0 x· (6.29) over T in x0 . ˆ It remains to show that the remainders v(x) − ΠT v(x) and (v(x) − ΠT v(x)) are ˆ ˆ 2 ˆ indeed bounded in L (T ) by CvH 2 . This is because the secondorder derivatives of ΠT v vanish by construction. we deduce that ˆ ˆ ˆ πT v ˆ H1 ≤C v H2 . This ˆ is based on the following Taylor expansion (with Lagrange remainder): 1 v(x) = v(x0 ) + (x − x0 ) · v(x0 ) + x − x0 2 0 (x · )2 v((1 − t)x0 + tx)(1 − t)dt.28) Since we will need it later. Upon integrating (6. since v(Pi ) ≤ C v H 2 in two space dimensions. So we see ˆ that it is enough to prove the results stated in the theorem for the interpolant ΠT . By deﬁnition of πT .Proof. (6. The proof is based on judicious use of Taylor expansions. we want to replace πT by a more convenient interpolant ΠT . The way around is to realize that x0 may be seen as a variable ∞ ˆ ˆ that we can vary freely in T . the same type of expansion yields for the gradient: 1 v(x) = v(x0 ) + x − x0  0 x· v((1 − t)x0 + tx)dt. First. The reason is that the aﬃne ˆ interpolant v(x0 ) + (x − x0 ) · v(x0 ) and its derivative v(x0 ) both involve the gradient of v at x0 .28) and (6. we have πT ΠT v = ΠT v so that ˆ ˆ ˆ ˆ v − πT v ˆ H1 ≤ v − ΠT v ˆ H1 + πT (ΠT v − v) ˆ ˆ H1 ≤ C v − ΠT v ˆ H1 + CvH 2 . We therefore introduce the function φ(x) ∈ C0 (T ) such that φ(x) ≥ 0 and T φ(x)dx = 1.31) 57 . This causes ˆ some technical problems.
ˆ (6. c). ˆ We extend f (x) by 0 outside T . where a. Then we have the estimates v − πT v (v − πT v) L2 (T ) L2 (T ) ≤ hT . T (6.35) We recall that C is independent of v and of the triangle T . Now the part 0 < t < 1/2 with the change of L ˆ variables (1 − t)x0 ← x0 similarly gives a contribution of the form 1/2 0 dt φ2 (1 − t)2 ˆ L∞ (T ) T  ˆ f 2 ˆ . There is an orthonormal system of coordinates such that the vertices of T are the points (0. b. ≤ Ch2 vH 2 (T ) T ≤ Cρ−2 hT vH 2 (T ) . Let us deﬁne A ∞ = max(a. Fur(6.ˆ for all f ∈ L2 (T ). and c are real numbers such that a > 0 and c > 0.33) Let us now assume that hT is the diameter of T . Then there exists a constant C independent of the function v and of the triangle T such that v − πT v (v − πT v) L2 (T ) ≤ C A ≤ C 2 ∞ vH 2 (T ) L2 (T ) A 3 ∞ vH 2 (T ) . ˆ where we use g = 1). let us assume the existence of ρT such that ρT hT ≤ a ≤ hT . b. c). ˆ Once we have the estimate on T . This yields 1 dt 1/2 ˆ T dx0 φ2 (x0 ) ˆ t−1 T dx 2 f ((1 − t)x0 + x) ≤ t2 1 1/2 dt t2 ˆ T dx0 φ2 (x0 ) f 2 ˆ . 0). By the CauchySchwarz inequality (stating that (f. (0. we need to obtain it for arbitrary triangles. Let v ∈ H 2 (T ) and πT v its aﬃne interpolant. This is done as follows: Lemma 6. Let us now consider the part 1/2 < t < 1 in the above integral and perform the change of variables tx ← x. These estimates show that v − ΠT v ˆ ˆ H 1 (T ) ≤ CvH 2 (T ) . 58 . Then clearly. ac ∞ (6. a). and (b.32) This was all we needed to conclude the proof of the lemma. the lefthand side in the above equation is bounded (because T is a bounded domain) by 1 C ˆ T ˆ T φ (x0 ) 0 2 f 2 ((1 − t)x0 + tx)dtdx0 dx. g)2 ≤ f 2 g 2 .34) ρT hT ≤ c ≤ hT .7 Let T be a proper (not ﬂat) triangle in the plane R2 . L2 (T ) ˆ ˆ where T  is the surface of T . A thermore. L2 (T ) This is certainly bounded by C f 2 2 (T ) .
which is ˆ ˆ equivalent to v (ˆ ) = v(x) with x = Aˆ or v (ˆ ) = v(Aˆ ). we deduce that Dα v 2 (ˆ ) ≤ C A ˆ x α=2 4 ∞ α=2 Dα v2 (x). We ﬁrst notice that (πT v )(ˆ ) = (πT v)(x).38) This tells us how the righthand side in (6. Diﬀerentiating one more time. x = Aˆ . Let us now consider the lefthand sides. C is a universal constant. Let us deﬁne the linear map A : R2 → R2 deﬁned in Cartesian coordinates as a 0 A= . p(Aˆ ) is also a polynomial of order 1 in the variables x1 and x2 .e. (6. ∂ xi ˆ ∂xk where we use the convention of summation over repeated indices. Let us denote by x coordinates on T and by x coordinates ˆ on T . (6. is the maximal element of the 2 × 2 matrix A−1 ). v ˆˆ x A−1 (ˆ − πT v )(ˆ ) = v ˆˆ x (v − πT v)(x). (ˆ − πT v ) v ˆˆ (v − πT v) (6. this implies the relations v − πT v ˆ ˆˆ A−1 2 ∞ 2 ˆ L2 (T ) 2 ˆ L2 (T ) =  det A−1  v − πT v ≥ C det A−1  2 L2 2 L2 . This ˆ ˆ x implies that (ˆ − πT v )(ˆ ) = (v − πT v)(x). 59 . This is achieved as follows. For a function v on T . We want to map the estimate on T in Lemma 6. we can deﬁne its pushforward v = A∗ v on T . x = Aˆ .35) are then an easy corollary. x2 . ˆˆ x (6. x (6.6 to the triangle T .36) b c ˆ ˆ ˆ We verify that A(T ) = T .27) may be used in estimates on T .37) Here the constant C is independent of v and v .38) directly give (6. The estimates (6.40) Here.ˆ Proof. This is equivalent to the statement: ˆH 2 (T ) ≤ Cdet(A)−1/2 A v ˆ 2 ∞ vH 2 (T ) . ˆx x ˆx x By the chain rule. x Using the same change of variables as above.39) The reason is that vi = vi at the nodes and that for any polynomial of order 1 in the ˆ variables x1 . we deduce that ∂v ∂ˆ v = Aik . and A−1 ∞ = (ac)−1 A ∞ (i.33). The above inequalities combined with (6.. Note that the change of measures ˆ induced by A is dˆ = det(A−1 )dx so that x Dα v 2 (ˆ )dˆ ≤ A ˆ x x ˆ T α=2 4 −1 ∞ det(A ) T α=2 Dα v2 (x)dx.
The norm · a is equivalent to the H 1 norm.22) based on the triangulation deﬁne in (6.42) Here. From now on. where v is the best approximation of u (for this norm). This is indeed the case.Remark 6. In the linear approximation framework. and uh .8 (Form of the triangles) The ﬁrst inequality in (6. The second estimate however requires more. which we assume is in H 2 (Ω). The reason is this. The third norm we have used is the L2 norm. ·) but not on u nor uh . we impose that ρT ≥ ρ0 > 0 for all T ∈ Th independent of the parameter h.8). 1 A Global error estimate in H0 (Ω). b.35) only requires that the diameter of T be smaller than hT .41) Here the constant C depends on the form a(·. thanks to (6.19) that the error of u − uh in H0 (the space in which estimates are the easiest!) is bounded by a constant times the error u − v in the same norm. 60 . implies the following result Theorem 6. u. v).34) is equivalent to imposing that all the angles of the triangle be greater than a constant independent of hT and is equivalent to imposing that there be a ball of radius of order hT inscribed in T . we can sum them to obtain global error estimates.9 Let u be the solution to (6.16) where Vh is deﬁned in (6. and c be of the same order hT . Let uh be the solution of the discrete problem (6.8.21). 0 (6. This concludes this section on approximation theory. 6. the gradients are approximated by piecewise constant functions whereas the functions themselves are approximated by polynomials of order 1 on each triangle. We may verify that the constraint (6. (6. Let us come back to the general framework of 1 Galerkin approximations. We should therefore have a better accuracy for the functions than for their gradients. v) is symmetric). However the demonstration is not straightforward and requires to use the regularity of the elliptic problem one more time in a curious way. Then we have the error estimate u − uh H 1 (Ω) ≤ Cρ−2 huH 2 (Ω) . However it may depend on the bilinear form a(u. So the above estimate also holds for the · a norm (which is deﬁned only when a(u. C is a constant independent of Th . So we obviously have that u − uh H 1 ≤ C u − πh u H 1 . We may show that the order of approximation in O(h) obtained in the theorem is optimal (in the sense that h cannot be replaced by hα with α > 1 in general). A global error estimate in L2 (Ω). This however.35). The above righthand side is by deﬁnition u − πT u2 +  (u − πT u)2 dx T ∈Th T 1/2 ≤ T ∈Th ρ−4 h2 u2 2 (T ) 0 T H 1/2 . It is reasonable to expect a faster convergence rate in the L2 norm than in the H 1 norm. The ﬁnite element method is thus ﬁrstorder for the approximation of u in the H 1 norm. We know from (6. Now that we have local error estimates. It requires that the triangles not be too ﬂat and that all components a. We recall that ρ0 was deﬁned in Rem.
45). (6.42). we get: u − uh 2 L2 = a(u − uh . w) = (v.Theorem 6. v) by the adjoint bilinear form a∗ (u. v) is not symmetric.44) Note that since a(u.44) is equivalent to (6.43) Proof. 61 .9) and the additional hypothesis that (6. the orthogonality condition (6. 0 H1 This concludes our result.1.10 we now present is referred to as a duality argument in the literature. when a(u.14) holds. v) = a(v.10 Under the hypotheses of Theorem (6. w − πh w) ≤ C u − uh H 1 w − πh w ≤ C u − uh H 1 ρ−2 h w 2 ≤ Cρ−2 h u − uh H 1 u − uh L2 0 0 ≤ Cρ−4 h2 u H 2 u − uh L2 . for all v ∈ V. In any event. However. and as a consequence the proof of Theorem 6.44).8).14) holds implies that w H2 ≤ C u − uh L2 .12) (with V = H0 ).44) admits a unique solution thanks to Theorem 6. the above problem (6. Let w be the solution to the problem a(v. 1 the bound in (6. and ﬁnally the error estimate (6. u).8) only by replacing a(u. (6.18).8). this problem is equivalent to (6.44) is therefore called a dual problem to (6. (6. u − uh ). The equation (6. w) = a(u − uh .45) Using in that order the choice v = u − uh in (6. (6. we have the following error estimate u − uh L2 (Ω) ≤ Cρ−4 h2 u 0 H 2 (Ω) . v) is symmetric. Moreover our hyp othesis that (6. the estimate (6.
For instance.3) For each (t. The latter may then be discretized if necessary and solved numerically. x) = P (x). as we have seen. we avoid the diﬃculties in PDEbased discretizations caused by the presence of high frequencies that are not well captured by the schemes. x) = x + ct so that v(t. dt (7. α (7.5) Many ﬁrstorder linear and nonlinear equations may be solved by the method of characteristics. x(t)) = + . v(t. We want to solve (7. (7. x(t)). See the section on ﬁnite diﬀerence discretizations of hyperbolic equations. (7. There are many other diﬃculties in the numerical implementation of the method of characteristics. The main idea is to look for solutions of the form v(t. x). we can ﬁnd a unique x0 (t.3) are dx = −c.1) by the method of characteristics. is to replace a partial diﬀerential equation (PDE) by a system of ordinary diﬀerential equations (ODEs). PDEbased. dt The solutions are found to be x(t) = x0 − ct. and where the rate of change of ω(t) = v(t. Since ODEs are solved. The combination of gridbased and particlebased methods gives rise to the socalled semiLagrangian method. In the literature. x(t)) is prescribed along the characteristic x(t). 62 .4) dv + αv = β. We observe that d ∂v dx ∂v v(t. whereas methods acting in the reference frame of the particles following their characteristic (which requires solving ODEs) are referred to as Lagrangian methods. dt ∂t dt ∂x so that the characteristic equations for (7. x(t)) = e−αt v(0. we need to solve the characteristic backwards from x(t) to x(0) and evaluate the initial condition at x(0). ∂t ∂x v(0.2) where x(t) is a characteristic.7 7. gridbased. Yet the method is very powerful as it is much easier to solve ODEs than PDEs.1 Introduction to Monte Carlo methods Method of characteristics Let us ﬁrst make a small detour by the method of characteristics. α (7. which may not be a grid point and thus may require that we use an interpolation technique.1) for t > 0 and x ∈ R. solving an ordinary diﬀerential equation. Consider a hyperbolic equation of the form ∂v ∂v −c + αv = β. The main idea. x0 ) + β (1 − e−αt ). x) = e−αt P (x + ct) + β (1 − e−αt ). methods are referred to as Eulerian.
However. Note that in (7. then W (2t) is replaced by some process X(t. More precisely.6) where X(t) = X(t. Using E{W2t } = 0. In more general situations. o Then. In any event. All trajectories are therefore continuous (at least almost surely) but are also almost surely always nondiﬀerentiable. instead of looking for solutions of the form w(t. However.6). In one dimension of space.7. solves the heat equation ∂u ∂ 2 u − 2 = 0. This is the basis for the Monte Carlo method. the method does not apply to multidimensional wave equations nor to elliptic or parabolic equations such as the Laplace and heat equations. x) is an inﬁnite superposition (E is an integration) of pieces of information carried along the trajectories. it is based on us following an inﬁnite number of characteristics generated from an appropriate probability measure and averaging over them. we can certainly simulate a ﬁnite. u(0. large. let g(W (2t)) = u0 (x + W (2t)). We thus see that u(t. we obtain after ensemble averaging that (7.7). we can apply it to the wave equation ∂ ∂ ∂ ∂2u ∂2u ∂ )( + )u. x(t)). (7.2 Probabilistic representation One of the main drawbacks of the method of characteristics is that it applies (only) to scalar ﬁrstorder equations. and E is ensemble averaging with respect to a measure P deﬁned on Ω. We cannot simulate an inﬁnite number of trajectories. we ﬁnd that 1 ∂2g ∂g ∂2g ∂g dW2t + dW2t dW2t = dW2t + 2 dt. Let W (t) be a standard onedimensional Brownian motion with W (0) = 0. a quasimethod of characteristics may still be developed.8). simulating the characteristic is easy as we know the law of W (t) (a centered Gaussian variable with variance equal to t). we ﬁnd that u(t. ∞)) with an appropriate measure P. For some equations. Let us give the most classical of examples. where the diﬀusion coeﬃcient depends on position. Itˆ calculus.7) . ω) is a trajectory starting at x for realizations ω in a state space Ω. x) = u0 (x). we look for solutions of the form u(t. 0= 2 − 2 =( − ∂t ∂x ∂t ∂x ∂t ∂x because of the above decomposition.g. However. dg = 63 (7. instead of being based on us following one characteristics. (7. The “characteristics” have thus become rather unsmooth objects. Very brieﬂy. We also see how this representation gives rise to the Monte Carlo method. It is not purpose of these notes to dwell on (7. x) which depends in a more complicated way on the location of the origin x.7) solves (7. This is an object deﬁned on a space Ω (the space of continuous trajectories on [0. ∂x 2 ∂x2 ∂x ∂x since dW2t dW2t = d(2t). number of them and perform the averaging. x) = E{u0 (X(t))X(0) = x}. Solving for the trajectories becomes more diﬃcult. The reason is that information does not propagate along characteristics (curves).8) ∂t ∂x This may be shown by e. x) = E{u0 (x + W (2t))}.
. 2 We may decompose this random variable as follows.. Note that Sn is even when n is even and odd when n is odd so that n+m is always an 2 integer.1 Check this. which roughly says that averaging over two random variables may be written as an averaging over the ﬁrst random variable (this is thus still a random variable) and then average over the second random variable.+xn−1 } = U + Uj−1 .. we need to use the idea of conditional expectations. We thus have an explicit expression for Sn . Moreover.11) where Uj0 is an approximation of u0 (jh) as before.+xn } = E{E{Uj+x1 +.9) (7. Moreover.. 2 n Sn = h k=1 xk . What is it that we have deﬁned? For this.. and where Ujn is an approximation of u(n∆t. using xk = (2yk − 1). jh). Let us assume that we have N (time) steps and deﬁne N independent random variables xk . we easily ﬁnd that E{Sn } = 0. This is Donsker’s invariance principle.7. 1 2n n n+m 2 . (7.+xn−1 } + E{Uj−1+x1 +. we see that pm = P(Sn = hm) = Exercise 7. In our context: 0 0 Ujn = E{Uj+x1 +. namely the random walk. How does one use all of this to solve a discretized version of the heat equation? Following (7. (7.10) The central limit theorem shows that Sn converges to W (2t) if n → ∞ and h → 0 such that nh2 = 2t.+xn xn }} 1 1 n−1 1 n−1 0 0 E{Uj+1+x1 +. As a consequence. Sαn also converges to W (2αt) in an appropriate sense so that there is solid mathematical background to obtain that Sn is indeed a bona ﬁde discretization of Brownian motion. such that xk = ±1 with probability 1 . Deﬁne yk for 1 ≤ k ≤ N .. 2 VarSn = E{Sn } = nh2 . N independent random variables equal to 0 or 1 with probability 1 and let Σn = n yk . Let us relate all this to Bernoulli random variables. k=1 2 Each sequence of n of 0’s and 1’s for the yk has probability 2−n . = 2 2 2 j+1 64 . is our random walk. Then.7).. The number of sequences n such that Σn = m is m by deﬁnition (number of possibilities of choosing m balls among n ≥ m balls)..3 Random Walk and Finite Diﬀerences Brownian motion has a relatively simple discrete version. 1 Consider a process Sk such that S0 = 0 and Sk+1 is equal to Sk + h with probability 2 and equal to Sk − h also with probability 1 . 1 ≤ k ≤ N . we can deﬁne 0 Ujn = E{Uj+h−1 Sn }.
Using (7. In order to obtain an accuracy of ε on average.14) M k=1 Then we verify that E{SM } = Ujn as before.12) holds. The (only) way to make the estimator better is to use the law of large numbers by repeating the experiment M times and averaging over the M results. or even much larger than E{X}. 7. This shows that σ will be close to k pk (Uj−k ) . Then σ 2 = 0 and X gives the right answer 0 all the time. How do the calculations Monte Carlo versus deterministic compare to calculate Ujn (at a ﬁnal time n∆t and a given position j)? Let us assume that we solve a ddimensional 65 . M M k=1 M since the variables are independent. In other words. x) = u0 . If we do it once. Let us look at the random variable X. (7. More precisely.11) gives us an approximation of ∂u ∂2u − D 2 = 0. As a consequence. its variance is often quite large.13) Assume that Uj0 = U independent of j. which is comparable. However.In other words. Ujn solves the ﬁnite diﬀerence equation with λ = value of λ the way the procedure is constructed). So X has a very large variance and is therefore not a very good estimator of Ujn . 2 h 2 1 2 (and ONLY this (7. We know that E{X} = Ujn so that X is an unbiased estimator of Ujn . The variance of SM is however signiﬁcantly smaller: VarSM 1 = E{ M M (Xk − k=1 Ujn ) 2 1 VarX } = 2 E{ (Xk − Ujn )2 } = .9). when U 0 is highly oscillatory. u(0. However. ∂t ∂x when ∆t and h are chosen so that (7. Deﬁne M 1 SM = Xk . we ﬁnd that: n n 0 pk Uj−k k=1 σ = VarX = 2 − l=1 0 pl Uj−l 2 . how accurate are we? The answer is not much. As a consequence. SM is an unbiased estimator of j Un with a standard deviation of order O(M −1/2 ). (7. say (−1)l . we ﬁnd that 0 P(X = Uj+m ) = pm . the procedure (7. we need to calculate an order of ε−2 realizations.12) where D is the diﬀusion coeﬃcient. This is the speed of convergence of Monte Carlo. Recall that λ= D∆t 1 = .4 Monte Carlo method Solving for Ujn thus requires that we construct Sn (by ﬂipping n ±1 coins) and evaluate 0 X := Uj+h−1 Sn . let Xk for 1 ≤ k ≤ M be M independent random variables with the same law as X. then pl Uj−l will be 0 2 2 close to 0 by cancellations.
the Monte Carlo method is much easier to parallelize. the running time of the algorithm is of order ∆t−1 . The convergence of the Monte Carlo 1 algorithm in M − 2 is rather slow. it is extremely diﬃcult to attain this level of running speed using the deterministic ﬁnite diﬀerence algorithm. The total cost of d an explicit method is therefore O((∆thd )−1 ) = O((∆t)−1− 2 ). 66 ..e. it is diﬃcult to use parallel architecture (though this a wellstudied problem) and obtain algorithms running time is inversely proportional to the number of available processors.g. Monte Carlo methods are easily parallelized since the M random realizations of X are done independently. For the ﬁnite diﬀerence model. In contrast.heat equation. Deterministic methods require that we solve n time steps using a grid of mesh size O(h) in each direction. So with a (very large) number of processors equal to M . Monte Carlo methods do not suﬀer the curse of dimensionality characterizing deterministic methods and thus become more eﬃcient. Moreover. d ≥ 4. with which it is easier to account for e. To obtain an O(h2 ) = O(∆t) accuracy. (7. which involves a ddimensional random walk. which is nothing but d onedimensional random walks in each direction. These algorithms are called variancereduction algorithms and are based on estimating the solution Ujn by using nonphysical random processes with smaller variance than the classical Monte Carlo method described above. Monte Carlo methods are also much more versatile as they are based on solving “random ODEs” (stochastic ODEs). Even with a large number of processors. Let us conclude by the following remark. which means O(h−d ) discretization points. The cost of each trajectory is of order n ≈ (∆t)−1 since we need to ﬂip n coins. Many algorithms have been developed to accelerate convergence. We thus see that Monte Carlo is more eﬃcient than ﬁnite diﬀerences (or equally eﬃcient) when 1 1 ≤ d . which corresponds to the running time of the ﬁnite diﬀerence algorithm on a grid with a small ﬁnite number of grid points. Finite diﬀerences however provide the solution everywhere unlike Monte Carlo methods. which need to use diﬀerent trajectories to evaluate solutions at diﬀerent points. 3 ∆t ∆t1+ 2 i. we thus need M = (∆t)−2 so that the ﬁnal cost is O((∆t)−3 ).15) For suﬃciently large dimensions. complicated geometries.
N. 2003. SIAM. 67 . Larsson and V. 2000. Philadelphia. Thomee. [2] L. Spectral Methods in Matlab. Trefethen.References ´ [1] S. Springer. PA. New York. Partial Diﬀerential Equations with Numerical Methods.
This action might not be possible to undo. Are you sure you want to continue?