Professional Documents
Culture Documents
Lecture Notes On Numerical Analysis of Partial Di Erential Equations
Lecture Notes On Numerical Analysis of Partial Di Erential Equations
Lecture notes on
Numerical Analysis of
Partial Differential Equations
version of 2011-09-05
Douglas N. Arnold
c
2009
by Douglas N. Arnold. These notes may not be duplicated without explicit permission from the author.
Contents
Chapter 1. Introduction
1. Basic examples of PDEs
1.1. Heat flow and the heat equation
1.2. Elastic membranes
1.3. Elastic plates
2. Some motivations for studying the numerical analysis of PDE
1
1
1
3
3
4
7
7
10
11
13
15
17
17
20
21
21
23
23
29
29
31
38
40
49
49
50
51
53
54
55
55
56
57
62
CONTENTS
7.1. Estimate in H 1
7.2. Estimate in L2
8. A posteriori error estimates and adaptivity
8.1. The Clement interpolant
8.2. The residual and the error
8.3. Estimating the residual
8.4. A posteriori error indicators
8.5. Examples of adaptive finite element computations
62
63
64
64
67
68
69
70
75
75
76
78
79
79
80
81
83
CHAPTER 1
Introduction
Galileo wrote that the great book of nature is written in the language of mathematics. The most precise and concise description of many physical systems is through partial
differential equations.
1. Basic examples of PDEs
1.1. Heat flow and the heat equation. We start with a typical physical application
of partial differential equations, the modeling of heat flow. Suppose we have a solid body
occupying a region R3 . The temperature distribution in the body can be given by a
function u : J R where J is an interval of time we are interested in and u(x, t) is
the temperature at a point x at time t J. The heat content (the amount of thermal
energy) in a subbody D is given by
Z
heat content of D =
cu dx
D
where c is the product of the specific heat of the material and the density of the material.
Since the temperature may vary with time, so can the heat content of D. The change of
heat energy in D from a time t1 to a time t2 is given by
Z
Z
change of heat in D =
cu(x, t2 ) dx
cu(x, t1 ) dx
D
D
Z t2
Z
Z t2 Z
(cu)
=
cu dx dt =
(x, t) dx dt.
t
t1 t D
t1
D
Now, by conservation of energy, any change of heat in D must be accounted for by heat
flowing in or out of D through its boundary or by heat entering from external sources (e.g.,
if the body were in a microwave oven). The heat flow is measured by a vector field (x, t)
called the heat flux, which points in the direction in which heat is flowing with magnitude
the rate energy flowing across a unit area per unit time. If we have a surface S embedded
in D with
R normal n, then the heat flowing across S in the direction pointed to by n in unit
time is S n ds. Therefore the heat that flows out of D, i.e., across its boundary, in the
time interval [t1 , t2 ], is given by
Z t2 Z
Z t2 Z
heat flow out of D
n ds dt =
div dx dt,
t1
t1
where we have used the divergence theorem. We denote the heat entering
R t Rfrom external
sources by f (x, t), given as energy per unit volume per unit time, so that t12 D f (x, t) dx dt
1
1. INTRODUCTION
gives amount external heat added to D during [t1 , t2 ], and so conservation of energy is
expressed by the equation
Z t2 Z
Z t2 Z
Z t2 Z
(cu)
f (x, t) dx dt,
(1.1)
div ds dt +
(x, t) dx dt =
t
D
D
t1
t1
t1
D
for all subbodies D and times t1 , t2 . Thus the quantity
(cu)
+ div f
t
must vanish identically, and so we have established the differential equation
(cu)
= div + f, x , t.
t
To complete the description of heat flow, we need a constitutive equation, which tells
us how the heat flux depends on the temperature. The simplest is Fouriers law of heat
conduction, which says that heat flows in the direction opposite the temperature gradient
with a rate proportional to the magnitude of the gradient:
= grad u,
where the positive quantity is called the conductivity of the material. (Usually is just
a scalar, but if the material is thermally anisotropic, i.e., it has preferred directions of heat
flow, as might be a fibrous or laminated material, can be a 3 3 positive-definite matrix.)
Therefore we have obtained the equation
(cu)
= div( grad u) + f in J.
t
The source function f , the material coefficients c and and the solution u can all be functions
of x and t. If the material is homogeneous (the same everywhere) and not changing with
time, then c and are constants and the equation simplifies to the heat equation,
u
= u + f,
t
where = c/ and we have f = f /. If the material coefficients depend on the temperature
u, as may well happen, we get a nonlinear PDE generalizing the heat equation.
The heat equation not only governs heat flow, but all sorts of diffusion processes where
some quantity flows from regions of higher to lower concentration. The heat equation is the
prototypical parabolic differential equation.
Now suppose our body reaches a steady state: the temperature is unchanging. Then the
time derivative term drops and we get
(1.2)
div( grad u) = f in ,
where now u and f are functions of f alone. For a homogeneous material, this becomes the
Poisson equation
u = f,
the prototypical elliptic differential equation. For an inhomogeneous material we can leave
the steady state heat equation in divergence form as in (1.2), or differentiate out to obtain
u + grad grad u = f.
To determine the steady state temperature distribution in a body we need to know not
only the sources and sinks within the body (given by f ), but also what is happening at the
boundary := . For example a common situation is that the boundary is held at a given
temperature
(1.3)
u = g on .
The PDE (1.2) together with the Dirichlet boundary condition (1.3) form an elliptic boundary value problem. Under a wide variety of circumstances this problem can be shown to
have a unique solution. The following theorem is one example (although the smoothness
requirements can be greatly relaxed).
R+ , f :
2
R, g : R be C functions. Then there exists a unique function u C () satisfying
the differential equation (1.2) and the boundary condition (1.3). Moreover u is C .
Instead of the Dirichlet boundary condition of imposed temperature, we often see the
Neumann boundary condition of imposed heat flux (flow across the boundary):
u
= g on .
n
For example if g = 0, this says that the boundary is insulated. We may also have a Dirichlet
condition on part of the boundary and a Neumann condition on another.
1.2. Elastic membranes. Consider a taut (homogeneous isotropic) elastic membrane
affixed to a flat or nearly flat frame and possibly subject to a transverse force distribution,
e.g., a drum head hit by a mallet. We model this with a bounded domain R2 which
represents the undisturbed position of the membrane if the frame is flat and no force is
applied. At any point x of the domain and any time t, the transverse displacement is
given by u(x, t). As long as the displacements are small, then u approximately satisfies the
membrane equation
2u
2 = ku + f,
t
where is the density of the membrane (mass per unit area), k is the tension (force per
unit distance), and f is the imposed transverse force density (force per unit area). This is
a second order hyperbolic equation, the wave equation. If the membrane is in steady state,
the displacement satisfies the Poisson equation
u = f,
f = f /k.
1.3. Elastic plates. The derivation of the membrane equation depends upon the assumption that the membrane resists stretching (it is under tension), but does not resist
bending. If we consider a plate, i.e., a thin elastic body made of a material which resists
bending as well as stretching, we obtain instead the plate equation
2u
= D2 u + f,
t2
1. INTRODUCTION
where D is the bending modulus, a constant which takes into account the elasticity of the
material and the thickness of the plate (D = Et3 /[12(1 2 )] where E is Youngs modulus
and is Poissons ratio). Now the steady state equation is the biharmonic equation
2 u = f.
Later in this course we will study other partial differential equations, including the equations of elasticity, the Stokes and NavierStokes equations of fluid flow, and Maxwells equations of electromagnetics.
2. Some motivations for studying the numerical analysis of PDE
In this course we will study algorithms for obtaining approximate solutions to PDE
problems, for example, using the finite element method. Such algorithms are a hugely
developed technology (we will, in fact, only skim the surface of what is known in this course),
and there are thousands of computer codes implementing them. As an example of the sort of
work that is done routinely, here is the result of a simulation using a finite element method
to find a certain kind of force distribution, the so-called von Mises stress, engendered in a
connecting rod of a Porsche race car when a certain load is applied. The von Mises stress
predicts when and where the metal of the rod will deform, and was used to design the shape
of the rod.
Figure 1.1. Connector rods designed by LN Engineering for Porsche race
cars, and the stress distribution in a rod computed with finite elements.
But one should not get the idea that it is straightforward to solve any reasonable PDE
problem with finite elements. Not only do challenges constantly arise as practitioners seek
to model new systems and solve new equations, but when used with insufficient knowledge
and care, even advance numerical software can give disastrous results. A striking example
is the sinking of the Sleipner A offshore oil platform in the North Sea in 1991. This occured
when the Norwegian oil company, Statoil, was slowly lowering to the sea floor an array
of 24 massive concrete tanks, which would support the 57,000 ton platform (which was to
accomodate about 200 people and 40,000 tons of drilling equipment). By flooding the tanks
in a so-called controlled ballasting operation, they were lowered at the rate of about 5 cm
per minute. When they reached a depth of about 65m the tanks imploded and crashed to
the sea floor, leaving nothing but a pile of debris at 220 meters of depth. The crash did not
result in loss of life, but did cause a seismic event registering 3.0 on the Richter scale, and
an economic loss of about $700 million.
6m
The post accident investigation traced the error to inaccurate finite element approximation of one of the most basic PDEs used in engineering, the equations of linear elasticity,
which were used to model the tricell (using the popular finite element program NASTRAN).
The shear stresses were underestimated by 47%, leading to insufficient design. In particular,
certain concrete walls were not thick enough. More careful finite element analysis, made
after the accident, predicted that failure would occur with this design at a depth of 62m,
which matches well with the actual occurrence at 65m.
CHAPTER 2
u = f in ,
u = g on .
For simplicity we will consider first a very simple domain = (0, 1) (0, 1), the unit square
in R2 . Now this problem is so simplified that we can attack it analytically, e.g., by separation
of variables, but it is a very useful model problem for studying numerical methods.
1. The 5-point difference operator
Let N be a positive integer and set h = 1/N . Consider the mesh in R2
R2h := { (mh, nh) : m, n Z }.
Note that each mesh point x R2h has four nearest neighbors in R2h , one each to the left,
right, above, and below. We let h = R2h , the set of interior mesh points, and we regard
this a discretization of the domain . We also define h as the set of mesh points in R2h
which dont belong to h , but which have a nearest neighbor in h . We regard h as a
h := h h
discretization of . We also let
h R satisfying
To discretize (2.1) we shall seek a function uh :
(2.2)
h uh = f on h ,
uh = g on h .
h (mesh functions) to
Here h is an operator, to be defined, which takes functions on
functions on h . It should approximate the true Laplacian in the sense that if v is a smooth
and vh = v| is the associated mesh function, then we want
function on
h
h vh v|h
for h small.
Before defining h , let us turn to the one-dimensional case. That is, given a function vh
defined at the mesh points nh, n Z, we want to define a function Dh2 vh on the mesh points,
so that Dh2 vh v 00 |Zh if vh = v|Zh . One natural procedure is to construct the quadratic
polynomial p interpolating vh at three consecutive mesh points (n 1)h, nh, (n + 1)h, and
let Dh2 vh (nh) be the constant value of p00 . This gives the formula
vh (n + 1)h 2vh (nh) + vh (n 1)h
2
Dh vh (nh) = 2vh [(n 1)h, nh, (n + 1)h] =
.
h2
Dh2 is known as the 3-point difference approximation to d2 /dx2 . We know that if v is C 2 in
a neighborhood of nh, then limh0 v[x h, x, x + h] = v 00 (x)/2. In fact, it is easy to show
7
h2 (4)
v (), for some x h, x + h ,
12
h v(mh, nh) =
From the error estimate in the one-dimensional case we easily get that for v C 4 (),
h2 4 v
4v
h v(mh, nh) v(mh, nh) =
(, nh) + 4 (mh, ) ,
12 x4
y
for some , . Thus:
then
Theorem 2.1. If v C 2 (),
lim kh v vkL (h ) = 0.
h0
then
If v C 4 (),
kh v vkL (h )
h2
M4 ,
6
4
4
where M4 = max(k 4 v/x4 kL ()
, k v/y kL ()
).
4 1
0
1
0
0
0
0
0
u1,1
h f1,1 u1,0 u0,1
1 4 1
0
1
0
0
0
0 u2,1 h2 f2,1 u2,0
0
1 4 0
0
1
0
0
0 u3,1 h2 f3,1 u3,0 u4,1
0
0 4 1
0
1
0
0 u1,2 h2 f1,2 u0,2
1
1
0
1 4 1
0
1
0 u2,2 =
h2 f2,2
0
2
0
1
0
1 4 0
0
1 u3,2 h f3,2 u4,2
0
0
1
0
0 4 1
0
2
0
0
0
1
0
1 4 1
u2,3
h f2,3 u2,4
0
0
0
0
0
1
0
1 4
u3,3
h2 f3,3 u4,3 u3,4
The matrix may be rewritten as
A I O
I A I
O I A
1
A=
0
For general N the matrix can be
R(N 1)(N 1) :
A
I
O
.
..
O
1
0
4 1 .
1 4
O
I
A
..
.
O
O O
O O
O O ,
. . . .. ..
. .
I A
where I and O are the identity and zero matrix in R(N 1)(N 1) , respectively, and A
R(N 1)(N 1) is the tridiagonal matrix with 4 on the diagonal and 1 above and below the
diagonal. This assumes the unknowns are ordered
u1,1 , u2,1 , . . . , uN 1,1 , u1,2 , . . . , uN 1,N 1 ,
and the equations are ordered similarly.
The matrix can be created as in Matlab with the following code.
10
I = speye(n-1);
e = ones(n-1,1);
A = spdiags([e,-4*e,e],[-1,0,1],n-1,n-1);
J = spdiags([e,e],[-1,1],n-1,n-1);
Lh = kron(I,A) + kron(J,I)
Notice that the matrix has many special properties:
it is sparse with at most 5 elements per row nonzero
it is block tridiagonal, with tridiagonal and diagonal blocks
it is symmetric
it is diagonally dominant
its diagonal elements are negative, all others nonnegative
it is negative definite
2. Analysis via a maximum principle
We will now prove that the problem (2.2) has a unique solution and prove an error
estimate. The key will be a discrete maximum principle.
h satisfying
Theorem 2.2 (Discrete Maximum Principle). Let v be a function on
h v 0 on h .
Then maxh v maxh v. Equality holds if and only if v is constant.
Proof. Suppose maxh v maxh v. Take x0 h where the maximum is achieved.
Let x1 , x2 , x3 , and x4 be the nearest neighbors. Then
4v(x0 ) =
4
X
i=1
v(xi ) h h v(x0 )
4
X
v(xi ) 4v(x0 ),
i=1
since v(xi ) v(x0 ). Thus equality holds throughout and v achieves its maximum at all the
nearest neighbors of x0 as well. Applying the same argument to the neighbors in the interior,
and then to their neighbors, etc., we conclude that v is constant.
Remarks. 1. The analogous discrete minimum principle, obtained by reversing the inequalities and replacing max by min, holds. 2. This is a discrete analogue of the maximum
principle for the Laplace operator.
Theorem 2.3. There is a unique solution to the discrete boundary value problem (2.2).
Proof. Since we are dealing with a square linear system, it suffices to show nonsingularity, i.e., that if h uh = 0 on h and uh = 0 on h , then uh 0. Using the discrete
maximum and the discrete minimum principles, we see that in this case uh is everywhere
0.
The next result is a statement of maximum norm stability.
Theorem 2.4. The solution uh to (2.2) satisfies
1
(2.3)
kuh kL ( h ) kf kL (h ) + kgkL (h ) .
8
11
This is a stability result in the sense that it states that the mapping (f, g) 7 uh is
bounded uniformly with respect to h.
Proof. We introduce the comparison function (x) = [(x1 1/2)2 + (x2 1/2)2 ]/4,
h . Set M = kf kL ( ) . Then
which satisfies h = 1 on h , and 0 1/8 on
h
h (uh + M ) = h uh + M 0,
so
1
max uh max(uh + M ) max(uh + M ) max g + M.
h
h
h
h
8
Thus uh is bounded above by the right-hand side of (2.3). A similar argument applies to
uh giving the theorem.
By applying the stability result to the error u uh we can bound the error in terms of
the consistency error h u u.
Theorem 2.5. Let u be the solution of the Dirichlet problem (1.2) and uh the solution
of the discrete problem (2.2). Then
1
ku uh kL ( h ) k u h ukL ( h ) .
8
Proof. Since h uh = f = u on h , h (u uh ) = h u u. Also, u uh = 0 on
h . Apply Theorem 2.4 (with uh replaced by u uh ), we obtain the theorem.
Combining with Theorem 2.1, we obtain error estimates.
then
Corollary 2.6. If u C 2 (),
lim ku uh kL ( h ) = 0.
h0
then
If u C 4 (),
h2
M4 ,
48
4
4
where M4 = max(k 4 u/x41 kL ()
, k u/x2 kL ()
).
ku uh kL ( h )
uh = 0 on h .
12
should be taken if we wanted to get a good theory for the Poisson equation, but that wont
concern us now.) We shall assume that there is a solution u of the original problem.
Now let Xh and Yh be finite dimensional normed vector spaces and Lh : Xh Yh a linear
operator. Our numerical method, or discretization, is:
Given fh Yh find uh Xh such that Lh uh = fh .
Of course, this is a very minimalistic framework so far. Without some more hypotheses, we
do not know if this finite dimensional problem has a solution, or if the solution is unique.
And we certainly dont know that uh is in any sense an approximation of u.
In fact, up until now, there is no way to compare u to uh , since they belong to different
spaces. For this reason, we introduce a representative of u, rh u Xh . We can then talk
about the error rh u uh and its norm krh u uh kXh . If this error norm is small, that means
that uh is close to u, or at least close to our representative rh u of u, in the sense of the norm.
In short, we would like the error to be small in norm. To make this precise we do what
is always done in numerical analysis: we consider not a single discretization, but a sequence
of discretizations. To keep the notation simple, we will now think of h > 0 as a parameter
tending to 0, and suppose that we have the normed spaces Xh and Yh and the linear operator
Lh : Xh Yh and the element fh Yh for each h. This family of discretizations is called
convergent if the norm krh u uh kXh tends to 0 as h 0.
h ) which vanish on h , and
In our example, we take Xh to be the grid functions in L (
Yh to be the grid functions in L (), and equip both with the maximum norm. We also
simply define rh u = u|h . Thus a small error means that uh is close to the true solution u
at all the grid points, which is a desireable result.
Up until this point there is not enough substance to our abstract framework for us to be
able to prove a convergence result, because the only connection between the original problem
Lu = f and the discrete problems Lh uh = fh is that the notations are similar. We surely
need some hypotheses. The first of two key hypotheses is consistency, which say that, in
some sense, the discrete problem is reasonable, in that the solution of the original problem
almost satisfies the discrete problem. More precisely, we define the consistency error as
Lh rh u fh Yh , a quantity which we can measure using our norm in Yh . The family of
discretizations is called consistent if the norm kLh rh u fh kYh tends to 0 as h 0.
Not every consistent family of discretizations is convergent (as you can easily convince
yourself, since consistency involves the norm in Yh but not the norm in Xh and for convergence it is the opposite). There is a second key hypothesis, uniform well-posedness of
the discrete problems. More precisely, we assume that each discrete problem is uniquely
solvable (nonsingular): for every gh Yh there is a unique vh Xh with Lh vh = gh . Thus
1
the operator L1
h : Yh Xh is defined and we call its norm ch = kLh kL(Yh ,Xh ) the stability
constant of the discretization. The family of discretizations is called stable if the stability
constants are bounded uniformly in h: suph ch < . Note that stability is a property of the
discrete problems and depends on the particular choice of norms, but it does not depend on
the true solution u in any way.
With these definition we get a theorem which is trivial to prove, but which captures the
underlying structure of many convergence results in numerical PDE.
Theorem 2.7. Let there be given normed vector spaces Xh and Yh , an invertible linear
operator Lh : Xh Yh , an element fh Yh , and a representative rh u Xh . Define uh Xh
4. FOURIER ANALYSIS
13
by Lh uh = fh . Then the norm of the error is bounded by the stability constant times the
norm of the consistency error. If a family of such discretizations is consistent and stable,
then it is convergent.
Proof. Since Lh uh = fh ,
Lh (rh u uh ) = Lh rh u fh .
Applying L1
h we obtain
rh u uh = L1
h (Lh rh u fh ),
N
1
X
k=1
u(kh)v(kh),
14
Thus
Dh2 m = m m ,
m =
4
2
2 mh
.
(1
cos
mh)
=
sin
h2
h2
2
Note that
4
.
h2
Note also that for small m << N , m 2 m2 . In particular 1 2 . To get a strict lower
bound we note that 1 = 8 for N = 2 and 1 increases with N .
Since the operator Dh2 is symmetric with respect to the inner product on L(Ih ), and the
eigenvalues m are distinct, it follows that the eigenvectors m are mutually orthogonal.
(This can also be obtained using trigonometric identities, or by expressing the sin functions
in terms of complex exponentials and using the discrete Fourier transform.) Since there are
N 1 of them, they form a basis of L(Ih ).
0 < 1 < 2 < < N 1 <
N
1
X
am m ,
kvk2h
N
1
X
m=1
a2m km k2h .
m=1
Then
f =
N
1
X
m am m ,
kf k2h
m=1
N
1
X
m=1
hu, vih = h
N
1 N
1
X
X
m=1 n=1
15
k1
h k
Since the kvkh kvkL (h ) we also have consistency with respect to the discrete 2-norm.
We leave it to the reader to complete the analysis with a convergence result.
5. Analysis via summation by parts
Fourier analysis is not the only approach to get an L2 stability result. Another uses
summation by parts, the discrete analogue of integration by parts.
Let v be a mesh function. Define the backward difference operator
v(mh, nh) v((m 1)h, nh)
, 1 m N,
h
In this section we denote
N X
N
X
2
hv, wih = h
v(mh, nh)w(mh, nh),
x v(mh, nh) =
0 n N.
m=1 n=1
with the corresponding norm k kh (this agrees with the notation in the last section for mesh
functions which vanish on h ).
Lemma 2.10. If v L(h ) (the set of mesh functions vanishing on h ), then
1
kvkh (kx vkh + ky vkh ).
2
Proof. It is enough to show that kvkh kx vkh . The same will similarly hold for y as
well, and we can average the two results.
For 1 m N , 0 n N ,
!2
N
X
2
|v(mh, nh)|
|v(ih, nh) v((i 1)h, nh)|
i=1
!2
N
X
h
|x v(ih, nh)|
i=1
N
X
h
|x v(ih, nh)|2
i=1
=h
N
X
!
h
N
X
i=1
|x v(ih, nh)|2 .
i=1
Therefore
h
N
X
|v(mh, nh)| h
m=1
N
X
i=1
|x v(ih, nh)|2
!
2
16
and
h2
N X
N
X
|v(mh, nh)|2 h2
N X
N
X
|x v(ih, nh)|2 ,
i=1 n=1
m=1 n=1
i.e.,
kvk2h
kx vk2h ,
as desired.
i=1
N
X
vi wi +
i=1
=2
N
1
X
N
X
vi wi
i=1
vi1 wi1
i=1
N
X
vi1 wi
i=1
N
1
X
vi1 wi
i=1
N
1
X
N
X
vi wi1
i=1
vi+1 wi
i=1
N
1
X
i=1
Hence,
h
N
1
X
i=1
N
X
i=1
and thus
hDx2 v, wih = hx v, x wih .
Similarly, hDy2 v, wih = hy v, y wih , so the lemma follows.
Combining the discrete Poincare inequality with the discrete Greens theorem, we immediately get a stability result. If v L(h ), then
1
1
1
kvk2h (kx vk2h + ky vk2h ) = hh v, vih kh vkh kvkh .
2
2
2
Thus
kvkh kh vkh ,
which is a stability result.
v L(h ),
6. EXTENSIONS
17
6. Extensions
6.1. Curved boundaries. Thus far we have studied as a model problem the discretization of Poissons problem on the square. In this subsection we consider a variant which can
be used to discretize Poissons problem on a fairly general domain.
Let be a smoothly bounded open set in R2 with boundary . We again consider the
Dirichlet problem for Poissons equation, (2.1), and again set h = R2h . If (x, y) h
and the segment (x + sh, y), 0 s 1 belongs to , then the point (x + h, y), which belongs
to h , is a neighbor of (x, y) to the right. If this segment doesnt belong to we define
another sort of neighbor to the right, which belongs to . Namely we define the neighbor to
be the point (x + sh, y) where 0 < s 1 is the largest value for which (x + th, y) for
all 0 t < s. The points of so constructed (as neighbors to the right or left or above or
below points in h ) constitute h . Thus every point in h has four nearest neighbors all of
h := h h . We also define
which belong to
h as those points in h all four of whose
neighbor belong to h and h as those points in h with at least one neighbor in h . See
Figure 2.2.
In order to discretize the Poisson equation we need to construct a discrete analogue of
h . Of course on
the Laplacian h v for mesh functions v on
h , h v is defined as the usual
5-point Laplacian. For (x, y) h , let (x+h1 , y), (x, y +h2 ), (xh3 , y), and (x, y h4 ) be the
nearest neighbors (with 0 < hi h), and let v1 , v2 , v3 , and v4 denote the value of v at these
four points. Setting v0 = v(x, y) as well, we will define h v(x, y) as a linear combination of
the five values vi . In order to derive the formula, we first consider approximating d2 v/dx2 (0)
by a linear combination of v(h ), v(0), and v(h+ ), for a function v of one variable. By
Taylors theorem
v(h ) + 0 v(0) + + v(h+ ) = ( + 0 + + )v(0) + (+ h+ h )v 0 (0)
1
1
+ (+ h2+ + h2 )v 00 (0) + (+ h3+ h3 )v 000 (0) + .
2
6
18
+ h+ h = 0,
1
(+ h2+ + h2 ) = 1,
2
which give
=
2
,
h (h + h+ )
+ =
2
,
h+ (h + h+ )
0 =
2
.
h h+
Note that we have simply recovered the usual divided difference approximation to d2 v/dx2 :
v(h ) + 0 v(0) + + v(h+ ) =
Returning to the 2-dimensional case, and applying the above considerations to both
2 v/x2 and 2 v/y 2 we arrive at the ShortleyWeller formula for h v:
h v(x, y)
2
2
2
2
=
v1 +
v2 +
v3 +
v4
h1 (h1 + h3 )
h2 (h2 + h4 )
h3 (h1 + h3 )
h4 (h2 + h4 )
2
2
+
h1 h3 h2 h4
v0 .
Using Taylors theorem with remainder we easily calculate that for v C 3 (),
2M3
h,
3
where M3 is the maximum of the L norms of the third derivatives of v. Of course at the
mesh points in
h , the truncation error is bounded by M4 h2 /6 = O(h2 ), as before, but for
mesh points neighboring the boundary, it is reduced to O(h).
h R determined again by 2.2. This is a
The approximate solution to (2.1) is uh :
system of linear equations with one unknown for each point of h . In general the matrix
wont be symmetric, but it maintains other good properties from the case of the square:
k v h vkL (h )
xh
6. EXTENSIONS
19
Thus we use the maximum norm except with a weight which decreases the emphasis on the
points with a neighbor on the boundary. The advantage of this norm is that, measured in
this norm, the consistency error is still O(h2 ):
M4 2 2M3
kh u ukYh max
h ,h
h = O(h2 ).
6
3
We will now show that the Shortley-Weller discrete Laplacian is stable from Xh to Yh . For the
argument we will use the maximum principle with a slightly more sophisticated comparison
function.
h R defined by (x1 , x2 ) = [(x1 1/2)2 +
Before we used as a comparison function :
2
(x2 1/2) ]/4, where (1/2, 1/2) was chosen as the vertex because it was the center of the
square (making kkL as small as possible while satisfying h 1). Now, suppose that
is contained in the disk of some radius r about some point p. Then we define
(
[(x1 p1 )2 + (x2 p2 )2 ]/4,
x h ,
(2.5)
(x) =
2
2
[(x1 p2 ) + (x2 p2 ) ]/4 + h, x h
Thus we perturb the quadratic comparison function by adding h on the boundary. Then
is bounded independent of h (kkL r2 /4 + h r2 /4 + 2r). Moreover h (x) = 1,
if x
h , since then is just the simple quadratic at x and all its neighbors. However, if
x h , then there is an additional term in h (x) for each neighbor of x on the boundary
(typically one or two). For example, if (x1 h1 , x2 ) h is a neighbor of x and the other
neighbors are in
h , then
2
h (x) = 1 +
h h1 ,
h1 (h1 + h)
since h1 h. Thus we have
(
1,
x
h ,
h (x)
1
h , x h .
(2.6)
20
v Xh ,
which is the desired stability result. As usual, we apply the stability estimate to v = u uh ,
and so get the error estimate
ku uh kL ( h ) ckh u ukYh = O(h2 ).
Remark. The perturbation of h on the boundary in the definition (2.5) of the comparison
function , allowed us to place a factor of h in front of the h terms in the Yh norm (2.4) and
still obtain stability. For this we needed (2.6) and the fact that the perturbed comparison
function remained bounded independent of h. In fact, we could take a larger perturbation
by replacing h with 1 in (2.5). This would lead to a strengthening of (2.6), namely we could
replace h1 by h2 , and still have bounded independently of h. In this way we can prove
stability with the same L norm for Xh and an even weaker norm for Yh :
2
kf kYh := max max |f (x)|, h max |f (x)| .
x
h
xh
We thus get an even stronger error bound, with Th = h u u denoting the truncation
error, we get
o
n
2
2
3
ku uh kL ( h ) c max kTh kL (
,
h
kT
k
= O(h2 ).
h
L (h ) c max M4 h , M3 h
h )
This estimate shows that the points with neighbors on the boundary, despite having the
largest truncation error (O(h) rather than O(h2 ) for the other grid points), contribute only
a small portion of the error (O(h3 ) rather than O(h2 )).
This example should be another warning to placing too much trust in a naive analysis
of a numerical method by just using Taylors theorem to expand the truncation error. Not
only can a method perform worse than this might suggest, because of instability, it can also
perform better, because of additional stability properties, as in this example.
6.2. More general PDEs. It is not difficult to extend the method and analysis to
more general PDEs. For example, instead of the Poisson equation, we may take
u a
u
u
b
cu = f,
x1
x2
u(x1 + h, x2 ) u(x1 h, x2 )
u(x1 , x2 + h) u(x1 , x2 h)
b(x)
h
h
c(x)u(x) = f (x), x h .
It is easy to show that the truncation error is O(h2 ). As long as the coefficient c 0, a version
of the discrete maximum principle holds, and one thus obtains stability and convergence.
6. EXTENSIONS
21
6.3. More general boundary conditions. It is also fairly easy to extend the method
to more general boundary conditions, e.g., the Neumann condition u/n = g on all or
part of the boundary, although some cleverness is needed to obtain a stable method with
truncation error O(h2 ) especially on a domain with curved boundary. We will not go into
this topic here, but will treat Neumann problems when we consider finite elements.
6.4. Nonlinear problems. Consider, for example, the quasilinear equation
u = F (u, u/x1 , u/x2 ),
with Dirichlet boundary conditions on the square. Whether this problem has a solution,
and whether that solution is unique, or at least locally unique, depends on the nature of the
nonlinearity F , and is beyond the scope of these notes. Supposing the problem does have a
(locally) unique solution, we may try to compute it with finite differences. A simple scheme
is
h uh = F (uh , x1 uh , x2 uh ), x h ,
where we use, e.g., centered differences like
u(x1 + h, x2 ) u(x1 h, x2 )
x1 uh (x) =
, x h .
2h
Viewing the values of uh at the M interior mesh points as unknowns, this is a system of M
equations in M unknowns. The equations are not linear, but they have the same sparsity
pattern as the linear systems we considered earlier: the equation associated to a certain
grid point involves at most 5 unknowns, those associated to the grid point and its nearest
neighbors.
The nonlinear system is typically solved by an iterative method, very often Newtons
method or a variant of it. Issues like solvability, consistency, stability, and convergence can
be studied for a variety of particular nonlinear problems. As for nonlinear PDE themselves,
many issues arise which vary with the problem under consideration.
CHAPTER 3
1. Classical iterations
Gaussian elimination and its variants are called direct methods, meaning that they produce the exact solution of the linear system in finite number of steps. (This ignores the effects
of round-off error, which is, in fact, a significant issue for some problems.) More efficient
methods are iterative methods, which start from an initial guess u0 of the solution of the
linear system, and produce a sequence u1 , u2 , . . . , of iterates whichhopefullyconverge
to the solution of the linear system. Stopping after a finite number of iterates, gives us an
approximate solution to the linear system. This is very reasonable. Since the solution of
the linear system is only an approximation for the solution of the PDE problem, there is
little point in computing it exactly or nearly exactly. If the numerical discretization provides
23
24
Figure 3.1. Algorithmic speedup from early 1960s through late 1990s for
solving the discrete Laplacian on a cubic mesh of size 64 64 64. The
comparison line labelled Moores Law is based on a speedup by a factor of
two every 18 months.
about 4 significant digits, we would be happy if the linear solver provides 4 or maybe 5 digits.
Further accuracy in the linear solver serves no purpose.
For an iterative method the goal is, of course, to design an iteration for which
(1) the iteration is efficient, i.e., the amount of work to compute an iteration should not
be too large: typically we want it to be proportional to the number n of unknowns;
(2) the rate of convergence of the iterative method is fast, so that not too many iterations
are needed.
First we consider some classical iterative methods to solve Au = f . One way to motivate
such methods is to note that if u0 is some approximate solution, then the exact solution u
may be written u = u0 +e and the error e = uu0 is related to the residual r = f Au0 by the
equation Ae = r. That is, we can express u as a residual correction to u0 : u = u0 + A1 (f
Au0 ). Of course, this merely rephrases the problem, since computing e = A1 (f Au0 )
means solving Ae = r for e, which is as difficult as the original problem of solving Au = f
for u. But suppose we can find some nonsingular matrix B which approximates A1 but is
less costly to apply. We are then led to the iteration u1 = u0 + B(f Au0 ), which can be
repeated to give
(3.1)
i = 0, 1, 2, . . . .
1. CLASSICAL ITERATIONS
25
Of course the effectiveness of such a method will depend on the choice of B. For speed of
convergence, we want B to be close to A1 . For efficiency, we want B to be easy to apply.
Some typical choices of B are:
B = I for some > 0. As we shall see, this method will converge for symmetric
positive definite A if is a sufficiently small positive number. This iteration is often
called Richardson iteration.
B = D1 where D is the diagonal matrix with the same diagonal elements as A.
This is called the Jacobi method.
B = E 1 where E is the lower triangular matrix with the same diagonal and subdiagonal elements of A. This is the GaussSeidel method.
Another way to derive the classical iterative methods, instead of residual correction, is
to give a splitting of A as P + Q for two matrices P and Q where P is in some sense close to
A but much easier to invert. We then write the equations as P u = f Qu, which suggests
the iteration
ui+1 = P 1 (f Qui ).
Since Q = A P , this iteration may also be written
ui+1 = ui + P 1 (f Aui ).
Thus this iteration coincides with (3.1) when B = P 1 .
Sometimes the iteration (3.1) is modified to
ui+1 = (1 )ui + [ui + B(f Aui )],
i = 0, 1, 2, . . . ,
for a real parameter . If = 1, this is the unmodified iteration. For 0 < < 1 the iteration
has been damped, while for > 1 the iteration is amplified. The damped Jacobi method will
come up below when we study multigrid. The amplified GaussSeidel method is known as
SOR (successive over-relaxation). This terminology is explained in the next two paragraphs.
Before investigating their convergence, let us particularize the classical iterations to the
discrete Laplacian 2h in one or two dimensions. In one dimension, the equations are
um+1 + 2um um1
= f m , m = 1, . . . , N 1,
h2
where h = 1/N and u0 = uN = 0. The Jacobi iteration is then simply
um
i+1
um1
+ um+1
h2
i
i
=
+ f m,
2
2
m = 1, . . . , N 1,
em1
+ em+1
i
i
,
2
so at each iteration the error at a point is set equal to the average of the errors at the
neighboring points at the previous iteration. The same holds true for the 5-point Laplacian
in two dimensions, except that now there are four neighboring points. In an old terminology,
updating the value at a point based on the values at the neighboring points is called relaxing
the value at the point.
For the GaussSeidel method, the corresponding equations are
em
i+1 =
um
i+1 =
m+1
um1
h2
i+1 + ui
+ f m,
2
2
m = 1, . . . , N 1,
26
and
m+1
em1
i+1 + ei
, m = 1, . . . , N 1.
2
We can think of the Jacobi method as updating the value of u at all the mesh points
simultaneously based on the old values, while the GaussSeidel method updates the values
of one point after another always using the previously updated values. For this reason the
Jacobi method is sometimes referred to as simultaneous relaxation and the GaussSeidel
method as successive relaxation (and amplified GaussSeidel as successive overrelaxation).
Note that the GaussSeidel iteration gives different results if the unknowns are reordered. (In
fact, from the point of view of convergence of GaussSeidel, there are better orderings than
just the naive orderings we have taken so far.) By contrast, the Jacobi iteration is unaffected
by reordering of the unknowns. The Jacobi iteration is very naturally a parallel algorithm:
if we have many processors, each can independently update one or several variables.
Our next goal is to investigate the convergence of (3.1). Before doing so we make some
preliminary definition and observations. First we recall that a sequence of vectors or matrices
Xi converges linearly to a vector or matrix X if there exists a positive number r < 1 and a
number C such that
em
i+1 =
(3.2)
kX Xi k Cri ,
i = 1, 2, . . . .
06=xRn
xT (GT G)x
(Gx)T Gx
=
sup
= (GT G),
Tx
n
xT x
x
06=xR
1. CLASSICAL ITERATIONS
27
Theorem 3.1. Let G Rnn and > 0. Then there exists a norm on Rn such that the
corresponding matrix norm satisfies kGk (G) + .
Proof. We may use the Jordan canonical form to write SGS 1 = J where S is an
invertible matrix and J has the eigenvalues of G on the diagonal, 0s and s on the first
superdiagonal, and 0s everywhere else. (The usual Jordan canonical form is the case = 1,
but if we conjugate a Jordan block by the matrix diag(1, , 2 , . . .) the 1s above the diagonal
are changed to .) We select as the vector norm kxk := kSxk . This leads to kGk =
kSGS 1 k = kJk (A) + (the infinity matrix norm, is the maximum of the row
sums).
An important corollary of this result is a criterion for when the powers of a matrix tend
to zero.
Theorem 3.2. For G Rnn , limi Gi = 0 if and only if (G) < 1, and in this case
the convergence is linear with rate (G).
Proof. For any choice of vector norm kGn k (Gn ) = (G)n , so if (G) 1, then Gn
does not converge to 0.
Conversely, if (G) < 1, then for any ((G), 1) we can find an operator norm so that
kGk , and then kGn k kGkn = n 0.
We now apply this result to the question of convergence of the iteration (3.1), which we
write as
ui+1 = (I BA)ui + Bf = Gui + Bf,
where the iteration matrix G = I BA. The equation u = Gu + Bf is certainly satisfied
(where u is the exact solution), and so we have another way to view a classical iteration:
it is a one-point iteration for this fixed point equation. The error then satisfies ei+1 = Gei ,
and the method converges for all starting values e0 = u u0 if and only if limi Gi = 0,
which, as we have just seen, holds if and only if (G) < 1, in which case the convergence
is linear with rate of linear convergence (G). Now the condition that the (G) < 1 means
that all the eigenvalues of G = I BA lie strictly inside the unit circle in the complex
plane, or equivalently that all the eigenvalues of BA lie strictly inside the circle of radius 1
in the complex plane centered at the point 1. If BA has real eigenvalues, then the condition
becomes that all the eigenvalues of BA belong to the interval (0, 2). Note that, if A is
symmetric positive definite (SPD) and B is symmetric, then BA is symmetric with respect
to the inner product hu, viA = uT Av, so BA does indeed have real eigenvalues in that case.
As a first example, we consider the convergence of the Richardson method for an SPD
matrix A. Since the matrix is SPD, it has a basis of eigenvectors with positive real eigenvalues
0 < min (A) = 1 2 n = max (A) = (A).
The eigenvalues of BA = A are then i , i = 1, . . . , n, and the iteration converges if and
only if 0 < < 2/max .
Theorem 3.3. Let A be an SPD matrix. Then the Richardson iteration um+1 = um +
(f Aum ) is convergent for all choices of u0 if and only if 0 < < 2/max (A). In this
case the rate of convergence is
max(|1 max (A)|, |1 min (A)|).
28
Note that the optimal choice is given by max (A) 1 = 1 min (A), i.e., opt =
2/[max (A) + min (A)], and, with this choice of , the rate of convergence is
max (A) min (A)
1
=
,
max (A) + min (A)
+1
where = max (A)/min (A) = kAk2 kA1 k2 is the spectral condition number of A. Of course,
in practice we do not know the eigenvalues, so we cannot make the optimal choice. But even
if we could, we would find very slow convergence when is large, as it typically is for
discretizations of PDE.
For example, if we consider A = Dh2 , then min 2 , max 4/h2 , so = O(h2 ), and
the rate of convergence is like 1 ch2 for some c. Thus the converge is indeed very slow (we
will need O(h2 ) iterations).
Note that for A = Dh2 the Jacobi method coincides with the Richardson method with
= h2 /2. Since max (A) < 4/h2 , we have < 2/max (A) and the Jacobi method is
convergent. But again convergence is very slow, with a rate of 1 O(h2 ). In fact for any
0 < 1, the damped Jacobi method is convergent, since it coincides with the Richardson
method with = h2 /2.
For the Richardson, Jacobi, and damped Jacobi iterations, the approximate inverse B is
symmetric, but this is not the case for GaussSeidel, in which B is the inverse of the lower
triangle of A. Of course we get a similar method if we use B T , the upper triangle of A. If we
take two steps of GaussSeidel, one with the lower triangle and one with the upper triangle,
the iteration matrix is
(I B T A)(I BA) = I (B T + B B T AB)A,
so this double iteration is itself a classical iteration with the approximate inverse
(3.3)
:= B T + B B T AB.
B
we get the
This iteration is called symmetric GaussSeidel. Now, from the definition of B,
identity
(3.4)
GaussSeidel iteration is convergent if and only if min (BA) > 0, i.e., if and only if BA
is SPD with respect to the A inner product. This is easily checked to be equivalent to B
being SPD with respect to the usual inner product. When this is the case (3.4) implies that
k(I BA)vkA < kvkA for all nonzero v, and hence the original iteration is convergent as
well.
In fact the above argument didnt use any properties of the original approximate inverse
B. So what we have really proved this more general theorem.
Theorem 3.4. Let ui+1 = ui + B(f Aui ) be an iterative method in residual correction
Aui ) with B
given by
form, and consider the symmetrized iteration, i.e., ui+1 = ui + B(f
(3.3). Then the symmetrized iteration is convergent if and only if B is SPD, and, in that
case, the original iteration is convergent as well.
29
30
At each step the search direction si and step length i are chosen to, hopefully, get us nearer
to the desired solution vector. If the goal is to minimize a function F : Rn R (quadratic
or not), a reasonable choice (but certainly not the only reasonable choice) of search direction
is the direction of steepest descent of F at ui , i.e., si = F (ui ). In our quadratic case, the
steepest descent direction is si = f Aui = ri , the residual. Thus the Richardson iteration
can be viewed as a line search method with steepest descent as search direction, and a fixed
step size.
Also for a general minimization problem, for any choice of search direction, there is an
obvious choice of stepsize, namely we can do an exact line search by minimizing the function
of one variable 7 F (ui + si ). Thus we must solve sTi F (ui + si ) = 0, which, in the
quadratic case, gives
(3.5)
i =
sTi ri
.
sTi Asi
If we choose the steepest descent direction with exact line search, we get si = ri , i =
riT ri /riT Ari , giving the method of steepest descents:
choose initial iterate u0
for i = 0, 1, . . .
set ri = f Aui
rT r
set ui+1 = ui + rTiAri i ri
i
end
Thus the method of steepest descents is a variant of the Richardson iteration ui+1 =
ui + (f Aui ) in which the parameter depends on i. It does not fit in the category of
simple iterations ui+1 = Gui + Bf with a fixed iteration matrix G which we analyzed in the
previous section, so we shall need to analyze it by other means.
Let us consider the work per iteration of the method of steepest descents. As written
above, it appears to require two matrix-vector multiplications per iteration, one to compute
31
Ari used in defining the step length, and one to compute Aui used to compute the residual,
and one to compute Ari used in defining the step length. However, once we have computed
pi := Ari and the step length i we can compute the next residual without an additional
matrix-vector multiplication, since ui+1 = ui + i ri implies that ri+1 = ri i pi . Thus we
can write the algorithm as
choose u0
set r0 = f Au0
for i = 0, 1, . . .
set pi = Ari
rT r
set i = riT pii
i
set ui+1 = ui + i ri
set ri+1 = ri i pi
end
Thus, for each iteration we need to compute one matrix-vector multiplication, two Euclidean inner products, and two operations which consist of a scalar-vector multiplication and
a vector-vector additions (referred to as a SAXPY operation). The matrix-vector multiplication involves roughly one addition and multiplication for each nonzero in the matrix, while
the inner products and SAXPY operations each involve n multiplications and additions. If
A is sparse with O(n) nonzero elements, the entire per iteration cost is O(n) operations.
We shall show below that if the matrix A is SPD, the method of steepest descents
converges to the solution of Au = f for any initial iterate u0 , and that the convergence is
linear with the same rate of convergence as we found for Richardson extrapolation with the
optimal parameter, namely ( 1)/( + 1) where is the spectral condition number of A.
This means, again, that the convergence is slow if the condition number is large. This is
quite easy to visualize already for 2 2 matrices. See Figure 3.2.
2.2. The conjugate gradient method. The slow convergence of the method of steepest descents motivates a far superior line search method, the conjugate gradient method. CG
also uses exact line search to choose the step length, but uses a more sophisticated choice of
search direction than steepest descents.
For any line search method with exact line search, u1 = u0 + 0 s0 minimizes F over the
1-dimensional affine space u0 + span[s0 ], and then u2 = u0 + 0 s0 + 1 s1 minimizes F over
the 1-dimensional affine space u0 + 0 s0 + span[s1 ]. However u2 does not minimize F over
the 2-dimensional affine space u0 + span[s0 , s1 ]. If that were the case, then for 2-dimensional
problems we would have u2 = u and we saw that that was far from the case for steepest
descents.
However, it turns out that there is a simple condition on the search directions si that
ensures that u2 is the minimizer of F over u0 + span[s0 , s1 ], and more generally that ui
is the minimizer of F over u0 + span[s0 , . . . , si1 ]. Such a choice of search directions is
very favorable. While we only need do 1-dimensional minimizations, after k steps we end
up finding the minimizer in an k-dimensional space. In particular, as long as the search
directions are linearly independent, this implies that un = u.
32
2.5
2.5
1.5
1.5
0.5
0.5
0.5
0.5
1
1
0.5
0.5
1.5
2.5
1
1
0.5
0.5
1.5
2.5
Theorem 3.6. Suppose that ui are defined by exact line search using search directions
which are A-orthogonal: sTi Asj = 0 for i 6= j. Then
F (ui+1 ) = min{ F (u) | u u0 + span[s0 , . . . , si ] }.
Proof. Write Wi+1 for span[s0 , . . . , si ]. Now
min F =
u0 +Wi+1
min min F (y + si ).
yu0 +Wi R
The key point is that the function (y, ) 7 F (y + si ) decouples into the sum of a function
of y which does not depend on plus a function of which does not depend on y. This is
because ui u0 + Wi , so sTi Aui = sTi Au0 = sTi Ay for for any y u0 + Wi , thanks to the
A-orthogonality of the search directions. Thus
1
2
F (y + si ) = y T Ay + sTi Ay + sTi Asi y T f sTi f
2
2
1 T
2 T
T
T
=
y Ay y f +
s Asi si (f Aui ) .
2
2 i
Since only the term in brackets involves , the minimum occurs when minimizes that term,
which is when = sTi (f Aui )/sTi Asi , which is the formula for exact line search.
Any method which uses A-orthogonal (also called conjugate) search directions has the
nice property of the theorem. However it is not so easy to construct such directions. By
far the most useful method is the method of conjugate gradients, or the CG method, which
defines the search directions by A-orthogonalizing the residuals ri = f Aui :
s 0 = r0
s i = ri
i1 T
X
sj Ari
sj .
T
s
As
j
j
j=0
33
This sequence of search directions, together with the exact line search choice of step length
(3.5) defines the conjugate gradient. The last formula (which is just the Gram-Schmidt
procedure) appears to be quite expensive to implement and to involve a lot of storage, but
fortunately we shall see that it may be greatly simplified.
Lemma 3.7.
(1) Wi = span[s0 , . . . , si1 ] = span[r0 , . . . , ri1 ].
(2) The residuals are l2 -orthogonal: riT rj = 0 for i 6= j.
(3) There exists m n such that W1 ( W2 ( ( Wm = Wm+1 = and u0 6= u1 6=
=
6 um = um+1 = = u.
(4) For i m, { s0 , . . . , si1 } is an A-orthogonal basis for Wi and { r0 , . . . , ri1 } is an
l2 -orthogonal basis for Wi .
(5) sTi rj = riT ri for 0 j i.
Proof. The first statement comes directly from the definitions. To verify the second
statement, note that, for 0 j < i, F (ui + trj ) is minimal when t = 0, which gives
rjT (Aui f ) = 0, which is the desired orthogonality. For the third statement, certainly there
is a least integer m [1, n] so that Wm = Wm+1 . Then rm = 0 since it both belongs to
Wm and is orthogonal to Wm . This implies that um = u and that sm = 0. Since sm = 0
um+1 = um = u. Therefore rm+1 = 0, which implies that sm+1 = 0, um+2 = u, etc.
The fourth statement is an immediate consequence of the preceding ones. For the last
statement, we use the orthogonality of the residuals to see that sTi ri = riT ri . But, if 0 j
i,then
sTi rj sTi r0 = sTi A(u0 uj ) = 0,
since u0 uj Wi .
Since si Wi+1 and the rj , j i are an orthogonal basis for that space for i < m, we
have
si =
i
X
s T rj
i
j=0
rjT rj
rj .
riT ri
i
i1
X
X
rj
rj
T
= ri + ri ri
,
T
T
r r
r r
j=0 j j
j=0 j j
whence
si = ri +
riT ri
si1 .
T
ri1
ri1
This is the formula which is used to compute the search direction. In implementing this
formula it is useful to compute the residual from the formula ri+1 = ri i Asi (since
ui+1 = ui + i si ). Putting things together we obtain the following implementation of CG:
34
(matrix multiplication)
(dot product)
(SAXPY)
(SAXPY)
(dot product)
(SAXPY)
The conjugate gradient method gives the exact solution in n iterations, but it is most
commonly used as an iterative method and terminated with far fewer operations. A typical
stopping criterion would be to test if r2 is below a given tolerance. To justify this, we shall
show that the method is linearly convergence and we shall establish the rate of convergence.
For analytical purposes, it is most convenient to use the vector norm kukA := (uT Au)1/2 ,
and its associated matrix norm.
We start with a third characterization of Wi = span[s0 , . . . , si1 ] = span[r0 , . . . , ri1 ].
Lemma 3.8. Wi = span[r0 , Ar0 , . . . , Ai1 r0 ] for i = 1, 2, . . . , m.
Proof. Since dim Wi = i, it is enough to show that Wi span[r0 , Ar0 , . . . , Ai1 r0 ],
which we do by induction. This is certainly true for i = 1. Assume it holds for some i.
Then, since ui u0 + Wi , ri = f Aui r0 + AWi span[r0 , Ar0 , . . . , Ai r0 ], and therefore
Wi+1 , which is spanned by Wi and ri belongs to span[r0 , Ar0 , . . . , Ai r0 ], which completes the
induction.
35
The space span[r0 , Ar0 , . . . , Ai1 r0 ] is called the Krylov space generated by the matrix A
and the vector r0 . Note that we have as well
Wi = span[r0 , Ar0 , . . . , Ai1 r0 ] = { p(A)r0 | p Pi1 } = { q(A)(u u0 ) | q Pi , q(0) = 0 }.
Here Pi denotes the space of polynomials of degree at most i. Since ri is l2 -orthogonal to
Wi , u ui is A-orthogonal to Wi , so
ku ui kA = inf ku ui + wkA .
wWi
Since ui u0 Wi ,
inf ku ui + wkA = inf ku u0 + wkA .
wWi
wWi
pPi
p(0)=1
Applying the obvious bound kp(A)(u u0 )kA kp(A)kA ku u0 kA we see that we can obtain
an error estimate for the conjugate gradient method by estimating
K = inf kp(A)kA .
pPi
p(0)=1
Now if 0 < 1 < < n are the eigenvalues of A, then the eigenvalues of p(A) are p(j ),
j = 1, . . . , n, and kp(A)kA = maxj |p(j )|. Thus1
K = inf max |p(j )| inf
pPi
p(0)=1
max |p()|.
pPi 1 n
p(0)=1
The final infimum can be calculated explicitly, as will be explained below. Namely, for any
0 < a < b, and integer n > 0,
2
n
n .
(3.6)
min max |p(x)| =
pPn x[a,b]
b/a+1
b/a1
p(0)=1
+
b/a1
This gives
K
2
+1
1
i
b/a+1
i
1
,
i 2
+1
1
+1
where = n /1 is the condition number of A. (To get the right-hand side, we suppressed
the second term in the denominator of the left-hand side, which is less than 1 and tends to
zero with i, and kept only the first term, which is greater than 1 and tends to infinity with
i.) We have thus proven that
i
1
ku ui kA 2
ku u0 kA ,
+1
1Here
we bound maxj |p(j )| by max1 n |p()| simply because we can minimize the latter quantity
explicitly. However this does not necessarily lead to the best possible estimate, and the conjugate gradient
method is often observed to converge faster than the result derived here. Better bounds can sometimes be
obtained by taking into account the distribution of the spectrum of A, rather than just its minimum and
maximum.
36
1
r=
.
+1
Note that r 1 2/ for large . So the convergence deteriorates when the condition
number is large. However, this is still a notable improvement over the classical iterations.
For the discrete Laplacian, where = O(h2 ), the convergence rate is bounded by 1 ch,
not 1 ch2 .
The above analysis yields a convergence estimate for the method of steepest descent as
well. Indeed, the first step of conjugate gradients coincides with steepest descents, and so,
for steepest descents,
2
1
ku u1 kA +1 1 ku u0 kA =
ku u0 kA .
+1
+
1
+1
Of course, the same result holds if we replace u0 by ui and u1 by ui+1 . Thus steepest
descents converges linearly, with rate ( 1)/( + 1) (just like Richardson iteration with the
optimal parameter). Notice that the estimates indicate that a large value of will slow the
convergence of both steepest
descents and conjugate gradients, but, since the dependence
for conjugate gradients is on rather than , the convergence of conjugate gradients will
usually be much faster.
The figure shows a plot of the norm of the residual versus the number of iterations for
the conjugate gradient method and the method of steepest descents applied to a matrix
of size 233 arising from a finite element simulation. The matrix is irregular, but sparse
(averaging about 6 nonzero elements per row), and has a condition number of about 1, 400.
A logarithmic scale is used on the y-axis so the near linearity of the graph reflects linear
convergence behavior. For conjugate gradients, the observed rate of linear convergence is
between .7 and .8, and it takes 80 iterations to reduce the initial residual by a factor of
about 106 . The convergence of steepest descents is too slow to be useful: in 400 iterations
the residual is not even reduced by a factor of 2.
Remark. There are a variety of conjugate-gradient-like iterative methods that apply to
matrix problems Au = f where A is either indefinite, non-symmetric, or both. Many share
the idea of approximation of the solution in a Krylov space.
Our analysis of conjugate gradients and steepest descents depended on the explicit solution of the minimization problem given in (3.6). Here we outline the proof of this result,
leaving the details as an exercise.
The Chebyshev polynomials are defined by the recursion
T0 (x) = 1,
T1 (x) = x,
1
Tn (x) = cos(n arccos x), Tn (x) = [(x + x2 1)n + (x x2 1)n ],
2
with the first equation valid for |x| 1 and the second valid for |x| 1.
The polynomial Tn satisfies |Tn (x)| 1 on [1, 1] with equality holding for n + 1 distinct
numbers in [1, 1]. This can be used to establish the following: for any < 1, there does not
37
10
10
10
SD
0
1
10
norm of residual
norm of residual
10
10
10
20
10
10
10
30
10
CG
2
40
10
50
100
150
iterations
200
250
10
10
20
30
iterations
40
50
0.1
0.8
0.075
0.6
0.05
0.4
0.025
0.2
0.025
0.2
0
10
12
0.05
10
exist any polynomial q Pn with q() = Tn () and |q(x)| < 1 on [1, 1]. In other words,
Tn minimizes of maxx[1,1] |p(x)| over all polynomials in Pn which take the value Tn () at
.
Scaling this result we find that
1
b+a
2x b a
p(x) = Tn
Tn
ba
ba
solves the minimization problem (3.6) and gives the minimum value claimed. This polynomial is plotted for n = 5, a = 2, b = 10 in Figure 3.4.
38
2.3. Preconditioning. The idea is we choose a matrix M A such that the system
M z = c is relatively easy to solve. We then consider the preconditioned system M 1 Ax =
M 1 b. The new matrix M 1 A is SPD with respect to the M inner product, and we solve
the preconditioned system using conjugate gradients but using the M -inner product in place
of the l2 -inner product. Thus to obtain the preconditioned conjugate gradient algorithm, or
PCG, we substitute M 1 A for A everywhere and change expressions of the form xT y into
xT M y. Note that the A-inner product xT Ay remains invariant under these two changes.
Thus we obtain the algorithm:
choose initial iterate u0 , set s0 = r0 = M 1 f M 1 Au0
for i = 0, 1, . . .
rT M ri
i = iT
si Asi
ui+1 = ui + i si
ri+1 = ri i M 1 Asi
rT M ri+1
si+1 = ri+1 + i+1T
si
ri M ri
end
Note that term sTi Asi arises as the M -inner product of si with M 1 Asi . The quantity
ri is the residual in the preconditioned equation, which is related to the regular residual,
ri = f Aui by ri = M ri . Writing PCG in terms of ri rather than ri we get
choose initial iterate u0 , set r0 = f Au0 , s0 = M 1 r0
for i = 0, 1, . . .
rT M 1 ri
i = i T
si Asi
ui+1 = ui + i si
ri+1 = ri i Asi
rT M 1 ri+1
si+1 = M 1 ri+1 + i+1T 1
si
ri M ri
end
Thus we need to compute M 1 ri at each iteration. Otherwise the work is essentially the
same as for ordinary conjugate gradients. Since the algorithm is just conjugate gradients for
the preconditioned equation we immediately have an error estimate:
i
1
ku0 ukA ,
kui ukA 2
+1
where now is the ratio of the largest to the least eigenvalue of M 1 A. To the extent that
M approximates A, this ratio will be close to 1 and so the algorithm will converge quickly.
The matrix M is called the preconditioner. A good preconditioner should have two properties. First, it must be substantially easier to solve systems with the matrix M than with
39
the original matrix A, since we will have to solve such a system at each step of the preconditioned conjugate gradient algorithm. Second, the matrix M 1 A should be substantially
better conditioned than A, so that PCG converges faster than ordinary CG. In short, M
should be near A, but much easier to invert. Note that these conditions are similar to those
we look for in defining a classical iteration via residual correction. If ui+1 = ui + B(f Aui )
is an iterative method for which B is SPD, then we might use M = B 1 as a preconditioner.
For example, the Jacobi method suggests taking M to be the diagonal matrix with the same
diagonal entries as A. When we compute M 1 ri in the preconditioned conjugate gradient
algorithm, we are simply applying one Jacobi iteration. Similarly we could use symmetric
Gauss-Seidel to get a preconditioner.
In fact, we can show that conjugate gradients preconditioned by some SPD approximate
inverse always converges faster than the corresponding classical iterative method. For if is
an eigenvalue of BA, then 1 where is the spectral radius of I BA, and so
1+
min (BA) 1 , max (BA) 1 + , (BA)
.
1
Thus the rate of convergence for the PCG method is at most
q
p
p
1+
1
1
(BA) 1
1 1 2
p
=
.
q
1+
(BA) + 1
+
1
1
The last quantity is strictly less than for all (0, 1). (For small it is about
/2, while
for the important case of 1 with small, it is approximately 1 2.) Thus the
rate of convergence of PCG with B as a preconditioner is better than that of the classical
iteration with B as approximate inverse.
Diagonal (Jacobi) preconditioning is often inadequate (in the case of the 5-point Laplacian it accomplishes nothing, since the diagonal is constant). Symmetric Gauss-Seidel is
somewhat better, but often insufficient as well. A third possibility which is often applied
when A is sparse is to determine M via the incomplete Cholesky factorization. This means
that a triangular matrix L is computed by the Cholesky algorithm applied to A, except that
no fill-in is allowed: only the non-zero elements of A are altered, and the zero elements left
untouched. One then takes M = LLT , and, so M 1 is easy to apply. Yet, other preconditioners take into account the source of the matrix problem. For example, if a matrix arises
from the discretization of a complex partial differential equation, we might precondition it
by the discretization matrix for a simpler related differential equation (if that lead to a linear
systems which is easier to solve). In fact the derivation of good preconditioners for important
classes of linear systems remain a very active research area.
We close with numerical results for preconditioned conjugate gradients with both the diagonal preconditioner and incomplete Cholesky factorization as preconditioner. In Figure 3.5
we reproduce the results shown in Figure 3.3, together with these preconditioned iterations.
By fitting the log of the norm of the residual to a linear polynomial, we can compute the
observed rates of linear convergence. They are The preconditioned methods are much more
steepest descents
0.997 PCG (diag.) 0.529
conjugate gradients 0.725 PCG (IC)
0.228
40
50
10
10
SD
SD
10
norm of residual
norm of residual
10
CG
50
10
PCG (diag)
100
10
PCG (IC)
CG
PCG (diag)
10
10
PCG (IC)
20
10
150
10
30
200
10
50
100
150
iterations
200
250
10
10
20
30
iterations
40
50
3. Multigrid methods
Figure 3.6 shows the result of solving a discrete system of the form h uh = f using the
GaussSeidel iteration. We have take h = 64, and chosen a smooth right-hand side vector
f which results in the vector uh which is shown in the first plot. The initial iterate u0 ,
which is shown in the second plot, was chosen at random, and then the iterates u1 , u2 , u10 ,
u50 , and u500 are shown in the subsequent plots. In Figure 3.7, the maximum norm error
kuh ui k/kuh k is plotted for i = 0, 1, . . . , 50.
These numerical experiments illustrate the following qualitative properties, which are
typical of the GaussSeidel iteration applied to matrices arising from the discretization of
elliptic PDEs.
If we start with a random error, the norm of the error will be reduced fairly quickly
for the first few iterations, but the error reduction occurs much more slowly after
that.
After several iterations the error is much smoother, but not much smaller, than
initially. Otherwise put, the highly oscillatory modes of the error are suppressed
much more quickly by the iteration than the low frequency modes.
The first observation is valid for all the methods we have studied: Richardson, Jacobi,
damped Jacobi, and GaussSeidel. The second obervationthat GaussSeidel iteration
smooths the erroris shared damped Jacobi with < 1, but not by Jacobi itself.
If we take the Richardson method with = 1/max (A) for the operator A = Dh2 ,
it is very easy to see how the smoothing property comes
about. The initial error can be
P
expanded in terms of the eigenfunctions of A: e0 = nm=1 ci sin mx. The mth component
3. MULTIGRID METHODS
41
exact solution
initial iterate
iterate 1
iterate 2
iterate 10
iterate 50
42
3. MULTIGRID METHODS
43
reducing the low frequency components. On the coarser mesh, it is of course less expensive
to solve. For simplicity, we assume for now that an exact solver is used on the coarse mesh.
Finally this coarse mesh solution to the residual problem is somehow transferred back to the
fine mesh where it can be added back to our smoothed approximation.
Thus we have motivated the following rough outline of an algorithm:
(1) Starting from an initial guess u0 apply a fine mesh smoothing iteration to get an
improved approximation u.
(2) Transfer the residual in u to a coarser mesh, solve a coarse mesh version of the
problem there, transfer the solution back to the fine mesh, and add it back to u to
get u.
Taking u for u1 and thus have described an iteration to get from u0 to u1 (which we can
then apply again to get from u1 to u2 , and so on). In fact it is much more common to also
apply a fine mesh smoothing at the end of the iteration, i.e., to add a third step:
3. Starting from u apply the smoothing iteration to get an improved approximation u.
The point of including the third step is that it leads to a multigrid iteration which is symmetric, which is often advantageous (e.g., the iteration can be used as a preconditioner for
conjugate gradients). If the approximation inverse B used for the first smoothing step is not
symmetric, we need to apply B T (which is also an approximate inverse, since A is symmetric)
to obtain a symmetric iteration.
We have just described a two-grid iteration. The true multigrid method will involve not
just the original mesh and one coarser mesh, but a whole sequence of meshes. However, once
we understand the two-grid iteration, the multigrid iteration will follow easily.
To make the two-grid method more precise we need to explain step 2 more fully, namely
(a) how do we transfer the residual from the fine mesh to the coarse mesh?; (b) what problem
do we solve on the coarse mesh?; and (c) how do we transfer the solution of that problem
from the coarse mesh to the fine mesh? For simplicity, we suppose that N = 1/h is even
and that we are interested in solving Ah u = f where A = Dh2 . Let H = 2h = (N/2)1 .
We will use the mesh of size H as our coarse mesh. The first step of our multigrid iteration
is then just
u = u0 + Bh (f Ah u0 ),
where Bh is just the approximate inverse of Ah from GaussSeidel or some other smoothing
iteration. The resulting residual is f Ah u. This is a function on the fine mesh points
h, 2h, . . . , (N 1)h, and a natural way to transfer it to the coarse mesh is restrict it to the
even grid points 2h, 4h, . . . , (N 2)h = H, 2H, . . . , (N/2 1)H, which are exactly the coarse
mesh grid points. Denoting this restriction operator from fine grid to coarse grid functions
(i.e., from RN 1 RN/21 ) by PH , we then solve AH eH = PH (f Ah uh ) where, of course,
2
AH = DH
is the 3-point difference operator on the coarse mesh. To transfer the solution eH ,
a coarse grid function, to the fine grid, we need a prolongation operator QH : RN/21 RN 1 .
It is natural to set QH eH (jh) = eH (jh) if j is even. But what about when j is odd: how
should we define QH eH at the midpoint of two adjacent coarse mesh points? A natural
choice, which is simple to implement, is QH eH (jh) = [eH ((j 1)h) + e((j + 1)h)]/2. With
these two operators second step is
u = u + QH A1
).
H PH (f Ah u
44
(3.7)
1/2 0
0 0
1
0
0 0
1/2 1/2 0 0
0
1
0 0
QH =
0 1/2 1/2 0
.
..
..
..
..
.
.
.
0
0
0 0
..
.
0
0
0
0
0
..
.
1/2
but PH as we described it, consists only of 0s and 1s. Therefore one commonly takes a
different choice for PH , namely PH = (1/2)QTH . This means that the transferred coarse grid
function doesnt just take the value of the corresponding fine grid function at the coarse grid
point, but rather uses a weighted average of the fine grid functions values at the point in
question and the fine grid points to the left and right (with weights 1/4, 1/2, 1/4). With
this choice, QH Ah PH is symmetric; in fact, QH Ah PH = AH . This is a useful formula. For
operators other than the Ah = Dh2 , we can use the same intergrid transfer operators,
namely QH given by (3.7) and PH = (1/2)QTH , and then define the coarse grid operator by
AH = QH Ah PH .
Remark. In a finite element context, the situation is simpler. If the fine mesh is a
refinement of the coarse mesh, then a coarse mesh function is already a fine mesh function.
Therefore, the operator QH can be taken simply to be the inclusion operator of the coarse
mesh space into the fine mesh space. The residual in u0 Sh is most naturally viewed as
a functional on Sh : v 7 (f, v) B(u0 , v). It is then natural to transfer the residual to the
T
coarse mesh simply by restricting the test function v to SH . This operation ShT SH
is
exactly the adjoint of the inclusion operator SH Sh . Thus the second step, solving the
coarse mesh problem for the restricted residual is obvious in the finite element case: we find
eH SH such that
B(eH , v) = (f, v) B(
u, v),
v SH ,
3. MULTIGRID METHODS
45
uh = twogrid(h, Ah , fh , u0 )
input: h, mesh size (h = 1/n with n even)
Ah , operator on mesh functions
fh , mesh function (right-hand side)
u0 , mesh function (initial iterate)
output: uh , mesh function (approximate solution)
for i = 0, 1, . . . until satisfied
1. presmoothing: u = ui + Bh (fh Ah ui )
2. coarse grid correction:
2.1. residual computation: rh = fh Ah u
2.2. restriction: H = 2h, rH = PH rh , AH = PH Ah QH
2.3. coarse mesh solve: solve AH eH = rH
2.4. prolongation: eh = QH eH
2.5. correction: u = u + eh
3. postsmoothing: uh ui+1 = u + BhT (fh Ah u)
end
Algorithm 3.1: Two-grid iteration for approximately solving Ah uh = fh .
In the smoothing steps, the matrix Bh could be, for example, (D L)1 where D is
diagonal, L strictly lower triangular, and Ah = D L LT . This would be a GaussSeidel
smoother, but there are other possibilities as well. Besides these steps, the major work is in
the coarse mesh solve. To obtain a more efficient algorithm, we may also solve on the coarse
mesh using a two-grid iteration, and so involving an even coarser grid. In the following
multigrid algorithm, we apply this idea recursively, using multigrid to solve at each mesh
level, until we get to a sufficiently coarse mesh, h = 1/2, at which point we do an exact solve
(with a 1 1 matrix!).
46
uh = multigrid(h, Ah , fh , u0 )
input: h, mesh size (h = 1/n with n a power of 2)
Ah , operator on mesh functions
fh , mesh function (right-hand side)
u0 , mesh function (initial iterate)
output: uh , mesh function (approximate solution)
if h = 1/2 then
uh = A1
h fh
else
for i = 0, 1, . . . until satisfied
1. presmoothing: u = ui + Bh (f Ah ui )
2. coarse grid correction:
2.1. residual computation: rh = fh Ah u
2.2. restriction: H = 2h, rH = PH rh , AH = PH Ah QH
2.3. coarse mesh solve: eH = multigrid(H, AH , rH , 0)
2.4. prolongation: eh = QH eH
2.5. correction: u = u + eh
3. postsmoothing: uh ui+1 = u + BhT (f Ah u)
end
end if
Algorithm 3.2: Multigrid iteration for approximately solving Ah uh = f .
Figure 3.8 shows 5 iterations of this multigrid algorithm for solving the system h uh =
f , h = 1/64, considered at the beginning of this section, starting from a random initial
guess (we would get even better results starting from a zero initial guess). Compare with
Figure 3.6. The fast convergence of the multigrid algorithm is remarkable. Indeed, for
the multigrid method discussed here it is possible to show that the iteration is linearly
convergent with a rate independent of the mesh size (in this example, it is roughly 0.2).
This means that the number of iterations needed to obtain a desired accuracy remains
bounded independent of h. It is also easy to count the number of operations per iteration.
Each iteration involves two applications of the smoothing iteration, plus computation of
the residual, restriction, prolongation, and correction on the finest mesh level. All those
procedures cost O(n) operations. But then, during the coarse grid solve, the same procedures
are applied on the grid of size 2h, incurring an additional cost of O(n/2). Via the recursion
the work will be incurred for each mesh size h, 2h, 4h, . . .. Thus the total work per iteration
will be O(n + n/2 + n/4 + . . . + 1) = O(n) (since the geometric series sums to 2n). Thus
the total work to obtain the solution of the discrete system to any desired accuracy is itself
O(n), i.e., optimal.
3. MULTIGRID METHODS
47
initial iterate
iterate 1
iterate 2
iterate 3
iterate 4
iterate 5
CHAPTER 4
1 (),
vH
b(u, v) = F (v),
v V,
1 ), b : V V R is a bilinear form, F : V R is
where V is a Hilbert space (H 1 or H
a linear form. (The inhomogeneous Dirichlet problem takes this form if we solve for u ug
where ug is a function satisfying the inhomogeneous Dirichlet BC ug = g on .) For
symmetric problems (no first order term), the bilinear form b is symmetric, and the weak
form is equivalent to the variational form:
1
u = argmin[ b(v, v) F (v)].
2
vV
49
50
b(uh , v) = F (v),
v Vh .
This is called the Galerkin method. For symmetric problems it is equivalent to the Rayleigh
Ritz method, which replaces V by Vh in the variational formulation:
1
uh = argmin[ b(v, v) F (v)].
2
vVh
The Galerkin solution can be reduced to a set of n linear equations in n unknowns where
n = dim Vh by choosing a basis. Adopting terminology from elasticity, the matrix is called
the stiffness matrix and the right hand side is the load vector.
Comparing (4.1) and (4.2), we find that the error in the Galerkin method u uh satisfies
(4.3)
b(u uh , v) = 0,
v Vh .
This relation, known as Galerkin orthogonality, is key to the analysis of Galerkin methods.
To define a simple finite element method, we suppose that is a polygon in R2 and let Th
by closed triangles so that the intersection
be a simplicial decomposition of (covering of
of any two distinct elements of Th is either empty or a common edge or vertex. Let
M01 (Th ) = { v C() | V |T P1 (T )T Th } = { v H 1 () | V |T P1 (T )T Th },
01 (Th ) = H
1 () M 1 (Th ). The P1 finite element method for the Dirichlet problem is
and M
0
1 (Th ).
the Galerkin method with Vh = M
0
We can use the Lagrange (hat function) basis for Vh to ensure that (1) the matrix is
sparse, and (2) the integrals entering into the stiffness matrix and load vector are easy to
compute.
Figure 4.1. A hat function basis element for M01 (Th ).
In the special case where is the unit square, and Th is obtained from a uniform m m
partition into subsquares, each bissected by its SW-NE diagonal (so n = (m 1)2 ), the
resulting stiffness matrix is exactly the same as the matrix of the 5-point Laplacian.
51
52
continuous piecewise linears. The global degrees of freedom are the vertex values, and the
corresponding local basis consists of the hat functions.
For the Lagrange element of degree 2, the shape functions are V (T ) = P2 (T ). There is
one DOF associated to each vertex (evaluation at the vertex), and one associated to each
edge (evaluation at the midpoint of the edge). Let us check unisolvence. Since dim V (T ) = 6
and there are 6 DOFs on T , we must show that if all 6 DOFs vanish for some v V (T ),
then v 0. For this choose a numbering of the vertices of T , and let i P1 (R2 ), i = 1, 2, 3,
denote the linear function that is 1 on the ith vertex and which vanishes on the opposite
edge, e of T . Thus the equation i = 0 is the equation of the line through the edge e. Since
v vanishes at the two endpoints and the midpoint of e, v|e vanishes (a quadratic in 1D which
vanishes at 3 points is zero). Therefore v is divisible by the linear polynomial i , for each
i. Thus v is a multiple of 1 2 3 . But v is quadratic and this product is cubic, so the only
possibility is that v 0. It is easy to check that the assembled space is exactly the space
M02 (Th ) of continuous piecewise quadratic functions. There is one basis function associated
to each vertex and one to each edge.
Figure 4.2. Basis functions for M02 (Th ).
Note: the linear functions i are the barycentric coordinate functions on T . They satisfy
1 + 2 + 3 1.
Higher degree Lagrange elements are defined similarly. V (T ) = Pr (T ). dim V (T ) =
(r + 1)(r + 2)/2. There is 1 DOF at each vertex, r 1 on each edge, and (r 2)(r 1)/2 in
each triangle. Note that 3 1 + 3 (r 1) + (r 2)(r 1)/2 = (r + 1)(r + 2)/2. The DOFs
53
are the evaluation functionals at the points with barycentric coordinates all of the form i/r
with 0 i r and integer.
Additional material covered here, notes not available: Data structures for triangulations,
the finite element assembly process (loops over elements).
Other finite element spaces. Cubic Hermite finite elements. Shape functions are P3 (T )
for a triangle T . Three DOFs for each vertexRv: u 7 u(v), u 7 (u/x)(v), u 7 (u/y)(v);
and one DOF associated to the interior u 7 T v (alternatively, evaluation at the barycenter).
Proof of unisolvence. Note that the assembled finite element space belongs to C 0 and H 1 ,
but not to C 1 and H 2 .
Figure 4.3. Cubic Hermite element.
Quintic Hermite finite elements (Argyris elements). Shape functions are P5 (T ) for a
triangle T . Six DOFs for each vertex v: evaluation of u, both 1st partials, and all three 2nd
partials of u at v. One DOF for each edge: evaluation of u/n at midpoint. Unisolvent,
H 2.
Figure 4.4. Quintic Hermite element.
b(u, v) = F (v),
v V.
We will analyze the Galerkin method using the ideas of stability and consistency we introduced in Chapter 2, 1.2. Recall that there we studied a problem of the form find u X
such that Lu = f where L mapped the vector space X where the solution was sought to
54
the vector space Y where the data f belonged. For the problem (4.4), the solution space
X = V , and the data space Y = V , the dual space of V . The linear operator L is simply
given by
Lw(v) = b(w, v) w, v V.
So the problem (4.4) is simply: given F V find u V such that Lu = F .
First of all, before turning to discretization, we consider the well-posedness of this problem. We have already assumed that the bilinear form b is bounded, i.e., there exists a
constant M such that
|b(w, v)| M kwkkvk, w, v V,
where, of course, the norm is the V norm. This implies that the operator L : V V is a
bounded linear operator (with the same bound).
We will now consider hypotheses on the bilinear form b that ensure that the problem
(4.4) is well-posed. We will consider three cases, in increasing generality.
4.1. The symmetric coercive case. First we assume that the bilinear form b is symmetric (b(w, v) = b(v, w)) and coercive, which means that there exists a constant > 0, such
that
b(v, v) kvk2 , v V.
In this case b(w, v) is an inner product on V , and the associated norm is equivalent to the
V norm:
kvk2V b(v, v) M kvk2V .
Thus V is a Hilbert space when endowed with the b inner product, and the Riesz Representation Theorem gives that for all F V there exists a unique u V such that
b(u, v) = F (v),
v V,
i.e., (4.4) has a unique solution for any data, and L1 : V V is well-defined. Taking
v = u and using the coercivity condition we immediately have kuk2V F (u) kF kV kukV ,
so kukV 1 kF kV , i.e., kL1 kL(V ,V ) 1 .
In short, well-posedness is an immediate consequence of the Riesz Representation Theorem in the symmetric coercive case.
As a simple example of the utility of this result, let us consider the Neumann problem
u
= 0 on
n
where 0 < a a(x) a
, 0 < c c(x) c. The weak formulation is: Find u V such that
div a grad u + cu = f in ,
b(u, v) = F (v),
v V,
where V = H 1 ,
Z
Z
(a grad w grad v + cuv),
b(w, v) =
F (v) =
f v.
55
4.2. The coercive case. Even if we dispense with the assumption of symmetry, coercivity implies well-posedness. From coercivity we have kwk2 b(w, w) = Lw(w)
kLwkV kwk, so
(4.5)
kwk 1 kLwkV ,
w V.
56
Stability just means that Lh is invertible with the stability constant given by kL1
h kL(Vh ,Vh ) .
Since b is coercive over all of V it is certainly coercive over Vh , and so the last section implies
stability with stability constant 1 . In short, if the bilinear form is coercive, then for any
choice of subspace the Galerkin method is stable with the stability constant bounded by the
inverse of the coercivity constant.
To talk about consistency, as in Chapter 2, we need to define a representative rh u of
the solution u in Vh . A natural choice, which we shall make, is that rh : V Vh is the
orthogonal projection so that
ku rh uk = inf ku vk.
vVh
06=vVh
|(Lh rh u Fh )(v)|
.
kvk
But (Lh rh uFh )(v) = b(rh u, v)F (v) = b(rh uu, v), so |(Lh rh uFh )(v)| M kurh ukkvk.
Therefore the consistency error is bounded by
M inf ku vk.
vVh
This is the fundamental estimate for finite elements. It shows that finite elements are quasioptimal, i.e., that the error in the finite element solution is no more than a constant multiple
of the error in the best possible approximation from the subspace. The constant can be taken
to be 1 plus the bound of the bilinear form times the stability constant.
Remark. We obtained stability using coercivity. From the last section we know that
we could obtain stability as well if we had instead of coercivity, a discrete inf-sup condition:
There exists > 0 such that for all 0 6= w Vh there exists 0 6= v Vh such that
b(w, v) kwkkvk.
57
(In the finite dimensional case the dense range condition follows from the inf-sup condition
since an operator from Vh to Vh which is 1-to-1 is automatically onto) The big difficulty
however is that the fact that b satisfies the inf-sup condition over V does not by any means
imply that it satisfies the inf-sup condition over a finite dimensional subspace Vh . In short, for
coercive problems stability is automatic, but for more general well-posed problems Galerkin
methods may or may not be stable (depending on the choice of subspace), and proving
stability can be difficult.
6. Finite element approximation theory
In this section we turn to the question of finite element approximation theory, that is of
estimating
inf ku vk1
vVh
where Vh is a finite element space. For simplicity, we first consider the case where Vh =
M01 (Th ), the Lagrange space of continuous piecewise linear functions on a given mesh Th
where Th is a simplicial decomposition of with mesh size h = maxT Th diam T . (Note: we
are overloading the symbol h. If we were being more careful we would consider a sequence
of meshes Ti with mesh size hi tending to zero. But the common practice of using h as both
the index and the mesh size saves writing subscripts and does not lead to confusion.)
First we need some preliminary results on Sobolev spaces: density of smooth functions,
if s > n/2.
Poincare inequality, Sobolev embedding H s () C()
Theorem 4.1 (Poincare inequality). Let be a bounded domain in Rn with Lipschitz
boundary (e.g., smooth boundary or a polygon). Then there exists a constant c, depending
only on , such that
Z
1
u = 0.
kukL2 () ck grad ukL2 () , u H () such that
I
0
58
constant is sharp (achieved by u(x) = cos x/L). In fact the result can be proved with
the constant d/, d =diameter of for any convex domain in n-dimensions (Payne and
Weinberger, Arch. Rat. Mech. Anal. 5 1960, pp. 286292). The dependence of the constant
on the domain is more complicated for non-convex domains.
Multi-index notation: In n-dimensions a multi-index is an n-tuple (1 , . . . , n ) with
the i non-negative integers. We write || = 1 + + n for the degree of the multi-index,
! = 1 ! n !,
|| u
.
x = x1 1 xnn , D u =
x1 1 xnn
P
Thus a general element of Pr (Rn ) is p(x) = ||r a x , and a general constant-coefficient
P
linear partial differential operator of degree r is Lu = ||r a D u. Taylors theorem for a
smooth function defined in a neighborhood of a point x0 Rn is
X 1
u(x) =
D u(x0 )(x x0 ) + O(|x x0 |m+1 )
!
|| = 1 + + n ,
! = 1 ! n !,
||m
We write i i , i = 1, . . . , n. We have
(
!
x , ,
()!
D x =
0,
otherwise.
In particular D x = !.
Let be a bounded domain in Rn with Lipschitz boundary (for our applications, it will
be a triangle). It is easy to see that the DOFs
Z
D u, || r,
u 7
D Pr u(x) dx =
D u(x) dx, || r.
|| r.
||=1
for some constant c1 depending only on (where we use the L2 () norm). Applying the
same reasoning to D u Pr1 (D u), we have kD u Pr1 (D u)k is bounded by the sum
of the norms of second partial derivatives, so
X
ku Pr uk c2
kD u Pr2 (D u)k.
||=2
59
kD u P0 (D u)k C
||=r
kD uk.
||=r+1
so
kD (u Pr u)k C
kD uk.
||=r+1
u H r+1 ().
pPr
u H r+1 ().
Remark. This proof of the Bramble-Hilbert lemma, based on the Poincare inequality,
is due to Verf
urth (A note on polynomial approximation in Sobolev spaces, M2AN 33, 1999).
The method is constructive in that it exhibits a specific polynomial p satisfying the estimate
(namely Pr u). Based on classical work of Payne and Weinberger on the dependence of the
Poincare constant on the domain mentioned above, it leads to good explicit bounds on the
constant in the Bramble-Hilbert lemma. A much older constructive proof is due to Dupont
and Scott and taught in the textbook of Brenner and Scott. However that method is both
more complicated and it leads a worse bound on the contant. Many texts (e.g., Braess)
give a non-constructive proof of the Bramble-Hilbert lemma based on Rellichs compactness
theorem.
Now we derive an important corollary of the Bramble-Hilbert lemma.
Corollary 4.3. Let be a Lipschitz domain and r 0, and : H r+1 () Pr () be
a bounded linear projection onto Pr (). Then there exists a constant c which only depends
on the domain , r, and the norm of such that
ku ukH r c|u|H r+1 .
Note: the hypothesis means that is a bounded linear operator mapping H r+1 () into
Pr () such that u = u if u Pr (). Bounded means that kkL(H r+1 ,H r ) < . It doesnt
matter what norm we choose on Pr , since it is a finite dimensional space.
Proof.
ku ukH r = inf k(u p) (u p)kH r
pPr
60
Figure 4.5. Mapping between the reference triangle and an arbitrary triangle.
v2
v1
v2
A J
1
F
v0
J
@
@
J f
@
J
XXX
@
XXX
J
@
XXX
J
@
X
^
J
XXX
f
@
X
z
X
@
A
v0
v1
As an example of the corollary, let = T be the unit triangle with vertices v0 = (0, 0),
v1 = (1, 0), and v2 = (0, 1), r = 1, and let = IT the linear interpolant: IT u P1 (T) and
vi ) = u(
vi ), i = 0, 1, 2. The IT u is defined for all u C(T) and kIT ukL kukL
IT u(
ckukH 2 , where we use the Sobolev embedding theorem in the last step (and c is some absolute
constant). From the corollary we get
ku IT ukH 1 (T) c|u|H 2 (T) .
(4.6)
This result will turn out to be a key step in analyzing piecewise linear interpolation.
The next step is to scale this result from the unit triangle to an arbitrary triangle T .
Suppose the vertices of T are v0 , v1 , and v2 . There exists a unique affine map F taking vi
to vi , i = 0, 1, 2. Indeed,
B = (v1 v0 |v2 v0 ),
F x = v0 + B x,
where the last notation means that B is the 2 2 matrix whose columns are v1 v0 and
v2 v0 . The map F takes T 1-to-1 onto T . Since the columns of B are both vectors of length
at most hT , certainly the four components bij of B are bounded by hT .
Now to any function f on T we may associate the pulled-back function f on T where
f(
x) = f (x) with x = F x.
I.e., f = f F . See Figure 4.5.
Next we relate derivatives and norms of a function f with its pull-back f. For the
derivative we simply use the chain rule:
2
X f
X f
f
xi
(
x) =
(x)
=
(x).
bij
xj
xi
xj
xi
i=1
i=1
Similarly,
2
XX
2 f
2f
(
x) =
bij bkl
(x),
xj xl
xi xk
i=1 k=1
etc. Thus we have
X
||=r
|D f(
x)| ckBkr
X
||=r
|D f (x)|.
61
|D f (x)| ckB 1 kr
||=r
x)|.
|D f(
||=r
1
Now kBk = O(hT ). We can bound kB k by introducing another geometric quantity, namely
the diameter T of the inscribed disk in T . Then any vector of length T is the difference
of two points in T (two opposite points on the
inscribed circle), and these are mapped by
1
B to two
points in T ,1which are1at most 2 apart. Thus, using the Euclidean norm
1
kB k 2/T , i.e., kB k = O(T ). We have thus shown
X
X
X
X
|D f(
x)| chrT
|D f (x)|,
|D f (x)| cr
|D f(
x)|.
T
||=r
||=r
||=r
||=r
Now let us consider how norms map under pull-back. First we consider the L2 norm.
Let |T | denote the area of T . Changing variables from x to x = F x, we have
Z
Z
2
2
x)|2 d
x = 2|T |kfk2L2 (T) .
|f (x)| dx = 2|T | |f(
kf kL2 (T ) =
T
T
p
That is, kf kL2 (T ) = 2|T |kfkL2 (T) . Next consider the H r seminorm:
Z X
Z X
Z X
2r
2r
2
2
|f |H r (T ) =
|D f (x)| dx cT
|D f (
x)| dx = 2c|T |T
|D f(
x)|2 d
x,
T ||=r
T ||=r
T ||=r
so
p
|f |H r (T ) c |T |r
T |f |H r (T) .
Similarly,
1
|f|H r (T) c p hrT |f |H r (T ) .
|T |
Now let u H 2 (T ), and let u H 2 (T) be the corresponding function. We saw in (4.6)
that
u|H 2 (T) .
k
u IT ukH 1 (T) c|
Now it is easy to see that the pull-back of IT u is IT u (both are linear functions which equal
IT u. We then have
u(vi ) at the vertex vi ). Therefore u IT u = u\
p
p
ku IT ukL2 (T ) c |T |k
u IT ukL2 (T) c |T ||
u|H 2 (T) ch2T |u|H 2 (T ) ,
and
p
p
u IT u|H 1 (T) c |T |1
u|H 2 (T) ch2T /T |u|H 2 (T ) .
|u IT u|H 1 (T ) c |T |1
T |
T |
If the triangle T is not too distorted, then T is not much smaller than hT . Let us define
T = hT /T , the shape constant of T . We have proved:
Theorem 4.4. Let T be a triangle with diameter hT , and let IT be the linear interpolant
at the vertices of T . Then there exists an absolute constant c such that
ku IT ukL2 (T ) ch2T |u|H 2 (T ) ,
u H 2 (T ).
Moreover there exists a constant c0 depending only on the shape constant for T such that
|u IT u|H 1 (T ) c0 hT |u|H 2 (T ) ,
u H 2 (T ).
62
Now we have analyzed linear interpolation on a single but arbitrary triangle, we can just
add over the triangles to analyze piecewise linear interpolation.
Theorem 4.5. Suppose we have a sequence of triangulations Th with mesh size h tending
let Ih u denote the continuous piecewise linear
to 0. For u a continuous function on ,
interpolant of u on the mesh Th . Then there exists an absolute constant c such that
ku Ih ukL2 () ch2 |u|H 2 () ,
u H 2 ().
If the mesh sequence is shape regular (i.e., the shape constant is uniformly bounded), then
there exists a constant c0 depending only on a bound for the shape constant such that
|u Ih u|H 1 () c0 h|u|H 2 () ,
u H 2 ().
In a similar fashion, for the space of Lagrange finite elements of degree r we can analyze
the interpolant defined via the degrees of freedom.
Theorem 4.6. Suppose we have a sequence of triangulations Th with mesh size h tending
to 0. Let Vh be the space of Lagrange finite elements of degree r with respect to the mesh,
let Ih u denote the interpolant of u into Vh defined
and for u a continuous function on ,
through the degrees of freedom. Then there exists an absolute constant c such that
ku Ih ukL2 () chs |u|H s () ,
u H s (),
2 s r + 1.
If the mesh sequence is shape regular (i.e., the shape constant is uniformly bounded), then
there exists a constant c0 depending only on a bound for the shape constant such that
|u Ih u|H 1 () c0 hs1 |u|H s () ,
u H s (),
2 s r + 1.
Thus for smooth u (more precisely, u H r+1 ), we obtain the rate of convergence O(hr+1 )
in L2 and O(hr ) in H 1 when we approximation with Lagrange elements of degree r.
The proof of this result is just the Bramble-Hilbert lemma and scaling. Note that we
must assume s 2 so that u H s is continuous and the interpolant is defined. On the
other hand we are limited to a rate of O(hr+1 ) in L2 and O(hr ) in H 1 , since the interpolant
is exact on polynomials of degree r, but not higher degree polynomials.
7. Error estimates for finite elements
7.1. Estimate in H 1 . To be concrete, consider the Dirichlet problem
div a grad u = f in ,
u = 0 on ,
with a coefficient a bounded above and below by positive constants of and f L2 . The
1 () such that
weak formulation is: find u V = H
b(u, v) = F (v), v V,
R
where b(u, v) = grad u grad v dx, F (v) = f v dx. Clearly b is bounded: |b(w, v)|
M kwk1 kvk1 , (with M = sup a). By Poincares inequality, Theorem 4.1, b is coercive:
b(v, v) kvk21 .
Now suppose that is a polygon and let Vh be the space of Lagrange finite elements of
degree r vanishing on the boundary with respect to a mesh of of mesh size h, and define
uh to be the finite element solution: uh Vh ,
(4.7)
b(uh , v) = F (v),
v Vh .
63
(where c = 1 + M ). Then we may apply the finite element approximation theory summarized in Theorem 4.6, and conclude that
ku uh k1 chr kukr+1
(4.8)
as long as the solution u belongs to H r+1 . If u is less smooth, the rate of convergence will
be decreased accordingly.
In short, one proves the error estimate in H 1 by using quasi-optimality in H 1 (which
comes from coercivity), and then finite element approximation theory.
7.2. EstimateR in L2 . Now let g L2 () be a given function, and consider the computation of G(u) := ug dx, which is a functional of the solution u of our Dirichlet problem.
We ask how accurately G(uh ) approximates G(u). To answer this, we define an auxilliary
function V by
b(w, ) = G(w), w V.
This is simply a weak formulation of the Dirichlet problem
div a grad = g in ,
= 0 on .
We will assume that this problem satisfies H 2 regularity, i.e., the solution H 2 () and
satisfies
kk2 ckgk0 .
This is true, for example, if is either a convex Lipschitz domain or a smooth domain and
a is a smooth coefficient.
Remark. Note that we write b(w, ) with the trial function second and the test
function w first, the opposite as for the original problem (4.7). Since the bilinear form
we are considering is symmetric, this is not a true distinction. But if we started with an
nonsymmetric bilinear form, we would still define the auxilliary function in this way. In
short satisfies a boundary value problem for the adjoint differential equation.
Now consider the error in G(u):
Z
G(u) G(uh ) = (u uh )g dx = b(u uh , ) = b(u uh , v)
for any v Vh , where the second equality comes from the definition of the auxilliary function
and the third from Galerkin orthogonality (4.3). Therefore
|G(u) G(uh )| M ku uh k1 inf k vk1 .
vVh
vVh
Thus
|G(u) G(uh )| chku uh k1 kgk0 chr+1 kukr+1 kgk0 .
R
In short, if g L2 (), the error in G(u) = ug dx is O(hr+1 ), one power of h higher than
the H 1 error.
64
u H s (),
valid for integers 0 t 1, 2 s r + 1. See Theorem 4.6 (the constant c here depends
only on r and the shape regularity of the mesh). We proved this result element-by-element,
using the Bramble-Hilbert lemma and scaling. Of course this result implies that
inf ku vkt chst kuks ,
vVh
u H s (),
for the same ranges of t and s. The restriction t 1 is needed, since otherwise the functions
in Vh , being continuous but not generally C 1 , do not belong to H t (). Here we are concerned
with weakening the restriction s 2, so we can say something about the approximation by
piecewise polynomials of a function u that does not belong to H 2 (). We might hope for
example that
inf ku vk0 chkuk1 , u H 1 ().
vVh
65
In fact, this estimate is true, and is important to the development of a posteriori error
estimates and adaptivity. However it can not be proven using the usual interpolant Ih u,
because Ih u is not defined unless the function u has well-defined point values at the node
points of the mesh, and this is not true for a general function u H 1 . (In 2- and 3-dimensions
the Sobolev embedding theorem implies the existence of point values for function in H 2 , but
not in H 1 .)
The way around this is through a different operator than Ih , called the Clement interpolant, or quasi-interpolant. For each polynomial degree r 1 and each mesh, the Clement
interpolant h : L2 () Vh is a bounded linear operator. Its approximation properties are
summarized in the following theorem.
Theorem 4.7 (Clement interpolant). Let be a domain in Rn furnished with a simplicial triangulation with shape constant and maximum element diameter h, let r be a
positive integer, and let Vh denote the Lagrange finite element space of continuous piecewise
polynomials of degree r. Then there exists a bounded linear operator h : L2 () Vh and a
constant c depending only on and r such that
ku h ukt chst kuks ,
u H s (),
for all 0 t s r + 1, t 1.
R, i = 1, . . . , N , be the usual
Now we define the Clement interpolant. Let i : C()
DOFs for Vh and i the corresponding basis functions. Thus the usual interpolant is
X
Ih u =
i (u)i , u C().
i
To define the Clement interpolant we let Si denote the support of i , i.e., the union of
triangles where i is not identically zero (if i is a vertex degree of freedom this is the union
of the elements with that vertex, if an edge degree of freedom, the union of the triangles
with that edge, etc.). Denote by Pi : L2 (Si ) Pr (Si ) the L2 -projection. Then we set
X
h u =
i (Pi u)i , u L2 ().
i
The usual interpolant Ih u is completely local in the sense that if u vanishes on a particular
triangle T , then Ih u also vanishes on T . The Clement interpolation operator is not quite
so local, but is nearly local in the following sense. If u vanishes on the set T, defined to be
the union of the triangles that share at least one vertex with T (see Figure 4.6), then h u
vanishes on T . In fact, for any 0 t s r + 1,
(4.9)
ku h ukH t (T ) chst kuk s , u H s (T),
T
H (T )
where the constant depends only on the shape regularity of the mesh and r. From (4.9),
using the shape regularity, the estimate of Theorem 3.7 easily follows.
To avoid too much technicality, we shall prove (4.9) in the case of linear elements, r = 1.
Thus we are interested in the case t = 0 or 1 and t s 2. Let T be a particular triangle
and let zi , i , i , Si denote its vertices and corresponding DOFs, basis functions, and their
supports, for i = 1, 2, 3. Note that it is easy to see that
(4.10)
1/2
c,
k grad i kL2 (T ) ch1
T |T |
66
Figure 4.6. Shown in brown Sz for a vertex z and in blue T for a triangle T .
where the constants may depend on the shape constant for T . Next, using the Bramble
Hilbert lemma and scaling on Si , we have for 0 t s 2,
(4.11)
In performing the scaling, we map the whole set Si to the corresponding set Si using the
affine map F which takes one of the triangles T in Si to the unit triangle T. The Bramble
Hilbert lemma is applied on the scaled domain Si . Although there is not just a single domain
Si it depends on the number of triangles meeting at the vertex zi and their shapesthe
constant which arises when applying the BrambleHilbert lemma on the scaled domain can
bounded on the scaled domain in terms only of the shape constant for the triangulation (this
can be established using compactness).
We also need one other bound. Let i and j denote the indices of two different vertices
of T . For u L2 (T), both Pi u and Pj u are defined on T . If u denotes the corresponding
function on the scaled domain T, then we have
kPi u Pj ukL (T ) = kPi u Pj ukL (T) ckPi u Pj ukL2 (T)
c(kPi u ukL2 (T) + kPj u ukL2 (T) )
c(kPi u ukL2 (Si ) + kPj u ukL2 (Sj ) )
where the first inequality comes from equivalence of norms on the finite dimensional space
P1 (T) and the second from the triangle inequality. Both of the terms on the right-hand side
can be bounded using the Bramble-Hilbert lemma and scaled back to Si , just as for (4.11).
In this way we obtain the estimate
(4.12)
Therefore also
(4.13)
3
X
i=1
i (Pi u)i .
67
3
X
i (Pi u P1 u)i .
i=2
The first term on the right hand side is bounded using (4.11):
ku P1 ukH t (T ) chst
T kukH s (T) .
For the second term we have
ki (Pi u P1 u)i kH t (T ) kPi u P1 ukL (T ) ki kH t (T ) ,
which satisfies the desired bound by (4.10) and (4.13).
For the next section an important special case of (4.9) is
ku h ukL2 (T ) chT kukH 1 (T) .
(4.14)
(4.15)
We draw one more conclusion, which we will need below. Let T denote the unit triangle,
and e one edge of it. The trace theorem then tells us that
k
uk2 2 c(k
uk2 2 + k grad uk2 2 ), u H 1 (T).
L (
e)
L (T )
L (T )
u H 1 (T ),
where the constant depends only on the shape constant of T . If we now apply this with u
replaced by uh u and use (4.14) and (4.15), we get this bound for the Clement interpolant:
(4.16)
where he is the length of the edge e, T is a triangle containing e, and c depends only on the
shape constant for the mesh.
8.2. The residual and the error. Consider our usual model problem
div a grad u = f in , u = 0 on ,
and f L2 . The weak formulation is to find
with a continuous positive coefficient a on
u V satisfying
b(u, v) = F (v), v V,
1 () and
where V = H
Z
Z
b(w, v) = a grad w grad v dx, F (v) = f v dx, w, v V.
1 :
The bilinear form is bounded and coercive on H
|b(w, v)| M kwk1 kvk1 ,
b(v, v) kvk21 ,
w, v V.
Now let U V be any function (we will be interested in the case where U = uh , a finite
element approximation of u). The residual in U is the linear functional R(U ) V given by
R(U )w = F (w) b(U, w),
w V.
68
Clearly R(U )w = b(u U, w). It follows immediately that |R(U )w| M ku U k1 kwk1 for all
w V , or, equivalently, that kR(U )kV M ku U k1 . On the other hand, taking w = u U
and using the coercivity, we get ku U k21 R(U )(u U ) kR(U )kV ku U k1 . Thus
M 1 kR(U )kV ku U k1 1 kR(U )kV .
In short, the V norm of the residual R(U ) is equivalent to the H 1 norm of the error u U .
8.3. Estimating the residual. Now let Vh be the Lagrange finite element subspace
of V corresponding to some mesh Th and some polynomial degree r, and let uh be the
corresponding finite element solution. We have just seen that we may estimate the error
ku uh k1 be estimating the V error in the residual
R(uh )w = F (w) b(uh , w),
w V.
This quantity does not involve the unknown solution u, so we may hope to compute it a
posteriori, i.e., after we have computed uh .
Next we use integration by parts on each element T of the mesh to rewrite R(uh )w:
XZ
R(uh )w =
(f w a grad uh grad w) dx
T
T Th
XZ
T
(f + div a grad uh )w dx
XZ
uh
w ds.
nT
Consider the final sum. We can split each integral over T into the sum of the integrals of
the three edges of T . Each edge e which is not contained in the boundary comes in twice.
For such an edge, let T and T+ be the two triangles which contain e and set
uh |T uh |T+
Re (uh ) = a
+
L2 (e)
nT
nT+
on e. Since nT = nT+ the term in parenthesis is the jump in the normal derivative of uh
across the edge e. Also, for T Th , we set RT (uh ) = f + div a grad uh L2 (T ). Then
XZ
XZ
(4.17)
R(uh )w =
RT (uh )w dx +
Re (uh )w ds.
T
eE 0
Next we use Galerkin orthogonality: since uh is the finite element solution, we have
b(u uh , v) = 0,
v Vh .
v Vh .
eE 0
69
Therefore
|
!1/2
XZ
T
!1/2
X
XZ
T
kwk2H 1 (T)
kwk1 .
XZ
eE 0
!1/2
Re (uh )(w h w) ds| c
kwk1 .
kwk1 ,
or
!1/2
kR(uh )kV c
X
T
In view of the equivalence of the V norm of the residual and the H 1 norm of the error, this
gives us the a posteriori error estimate
!1/2
X
X
he kRe (uh )k2L2 (e)
,
(4.21)
ku uh k1 c
h2T kRT (uh )k2L2 (T ) +
T
70
The factor of 1/2 is usually used, to account for the fact that each edge belongs to two
triangles. In terms of the error indicators, we can rewrite the a posteriori estimate as
!1/2
X
ku uh k1 c
T2
.
T
Our basic adaptive strategy then proceed via the following SOLVE-ESTIMATE-MARKREFINE loop:
SOLVE: Given a mesh, compute uh
P
ESTIMATE: For each triangle T compute T . If ( T T2 )1/2 tol, quit.
MARK: Mark those elements T for which T is too large for refinement.
REFINE: Create a new mesh with the marked elements refined.
We have already seen how to carry out the SOLVE and ESTIMATE steps. There are a
number of possible strategies for choosing which elements to mark. One of the simplest is
maximal marking. We pick a number between 0 and 1, compute max = maxT T , and refine
those elements T for T max . Another approach, which isPusually preferred
orfler
P is D
2
2
2
marking, in which a collection of elements S is marked so that T S T
T Th T , i.e.,
we mark enough elements that they account for a given portion of the total error. The
program on the next page shows one way to implement this.
Once we have marked the elements, there is the question of how to carry out the refinement to be sure that all the marked elements are refined and there is not too much additional
refinement. In 2-dimensions this is quite easy. Most schemes are based either on dividing
each triangle in two, or dividing the marked triangles into 4 congruent triangles. Generally,
after refining the marked elements, additional elements have to be refined to avoid hanging
nodes in which a vertex of an element fall in the interior of the edge of a neighboring element. In 3-dimensions things are more complicated, but good refinement schemes (which
retain shape regularity and avoid hanging nodes) are known.
8.5. Examples of adaptive finite element computations. On the next page we
present a bare-bones adaptive Poisson solver written in FEniCS, displayed on the next page.
This code uses Lagrange piecewise linear finite elements to solve the Dirichlet problem
u = 1 in ,
u = 0 on ,
with an L-shaped domain and f 1, with the error indicators and marking strategy
described above. The solution behaves like r2/3 sin(2/3) in a neighborhood of the reentrant
corner, and so is not in H 2 . The results can be seen in Figure 4.7. The final mesh has 6,410
elements, all right isoceles triangles, with hypotenuse length h ranging from 0.044 to 0.002.
If we used a uniform mesh with the smallest element size, this would require over 3 million
elements. Figure 4.8 displays an adaptive mesh in 3D.
71
72
plot(mesh)
interactive()
73
CHAPTER 5
Time-dependent problems
So far we have considered the numerical solution of elliptic PDEs. In this chapter we will
consider some parabolic and hyperbolic PDEs.
1. Finite difference methods for the heat equation
In this section we consider the Dirichlet problem for the heat equation: find u :
[0, T ] R such that
(5.1)
(5.2)
u
(x, t) u(x, t) = f (x, t), x , 0 t T,
t
u(x, t) = 0, x , 0 t T.
x .
For simplicity, we will assume = (0, 1) (0, 1) is the unit square in R2 (or the unit interval
in R). Let us consider first discretization in space only, which we have already studied.
Following the notations we used in Chapter 2, we use a mesh with spacing h = 1/N , and let
h = h h .
h be the set of interior mesh points, h the set of boundary mesh points, and
h [0, T ] R such at
The semidiscrete finite difference method is: find uh :
uh
(x, t) h uh (x, t) = f (x, t), x h , 0 t T,
t
uh (x, t) = 0, x h , 0 t T.
uh (x, 0) = u0 (x),
x h .
If we let Umn (t) = uh ((mh, nh), t), then we may write the first equation as
0
Umn
(t)
Um+1,n (t) + Um1,n (t) + Um,n+1 (t) + Um,n1 (t) 4Umn (t)
= fmn (t),
h2
0 < m, n < N, 0 t T.
Thus we have a system of (n1)2 ordinary differential equations, with given initial conditions.
One could feed this system of ODEs to an ODE solver. But we shall consider simple ODE
solution schemes, which are, after all, themselves finite difference schemes, and analyze them
directly. We shall focus on three simple schemes, although there are much more sophisticated
possibilities.
For a system of ODEs, find u : [0, T ] Rm such that
u0 (t) = f (t, u(t)),
0 t T,
75
u(0) = u0 ,
76
5. TIME-DEPENDENT PROBLEMS
j = 0, 1, . . . .
(5.3)
i.e.,
j+1
j
j
Umn
= Umn
+ k[(h U )jmn + fmn
], 0 < m, n < N, j = 0, 1 . . . .
This is called the forward-centered difference method for the heat equation, because it uses
forward differences in time and centered differences in space.
We shall analyze the forward-centered scheme (5.3) as usual, by establishing consistency
and stability. Let ujmn = u((mh, nh), tj ) denote the restriction of the exact solution, the
consistency error is just
(5.4)
j
uj+1
mn umn
j
(h u)jmn fmn
k
j+1
umn ujmn
u
j
=
(h u)mn
u ((mh, nh), jk) .
k
t
j
Emn
:=
(5.5)
]) + k u/y kL ([0,T
]) )/12.
Even easier is
j+1
umn ujmn u
77
with c = max(c1 , c2 ).
j
Next we establish a stability result. Suppose that a mesh function Umn
satisfies (5.3).
We want to bound an appropriate norm of the mesh function in terms of an appropriate
j
norm of the function fmn
on the right hand side. For the norm, we use the max norm:
kU kL = max
j
|.
max |Umn
0jM 0m,nN
j
j
|. From (5.3), we have
|, and F j = max0m,nN |fmn
Write K j = max0m,nN |Umn
j+1
Umn
= (1
4k j
k j
j
j
j
j
)Umn + 2 (Um1,n
+ Um+1,n
+ Um,n1
+ Um,n+1
) + kfmn
.
2
h
h
Now we make the assumption that 4k/h2 1. Then the 5 coefficients of U on the right
hand side are all nonnegative numbers which add to 1, so it easily follows that
K j+1 K j + kF j .
Therefore K 1 K 0 + kF 0 , K 2 K 0 + k(F 0 + F 1 ), etc. Thus max0jM K j K 0 +
T max0j<M F j , where we have used that kM = T . Thus, if U satisfies (5.3), then
kU kL kU 0 kL (h ) + T kf kL ,
(5.6)
which is a stability result. We have thus shown stability under the condition that k h2 /4.
We say that the forward-centered difference method for the heat equation is conditionally
stable.
j
Now we apply this stability result to the error ejmn = ujmn Umn
, which satisfies
j
ej+1
mn emn
j
(h e)jmn = Emn
,
k
0
= 0, so the stability
with E the consistency error (so kEkL c(k + h2 )). Note that Emn
result gives
kekL T kEkL .
Using our estimate for the consistency error, we have proven the following theorem.
j
Theorem 5.1. Let u solve the heat equation (5.1), and let Umn
be determined by the
forward-centered finite difference method with mesh size h = 1/N and timestep k = T /M .
Suppose that k h2 /4. Then
max
j
max |u((mh, nh), jk) Umn
| C(k + h2 ),
0jM 0m,nN
78
5. TIME-DEPENDENT PROBLEMS
(5.7)
j+1
j+1
j
j+1
Umn
kh Umn
= Umn
+ kfmn
,
0 < m, n < N,
j+1
j+1
j+1
j+1
j+1
j
j+1
(1 + 4)Umn
(Um+1,n
+ Um1,n
+ Um,n+1
+ Um,n1
) = Umn
+ kfmn
,
with = k/h2 . This is a sparse system of (N 1)2 equations in (N 1)2 unknowns with
the same sparsity pattern as h . Since h is symmetric and positive definite, this system,
whose matrix is I kh , is as well.
Remark. The computational difference between explicit and implicit methods was very
significant before the advent of fast solvers, like multigrid. Since such solvers reduce the
computational work to the order of the number of unknowns, they are not so much slower
than explicit methods.
j
Now suppose that (5.8) holds. Let K j again denote the maximum of |Umn
|. Then there
j+1
j+1
j+1
j+1
exists m, n such that K
= sUmn where s = 1. Therefore sUmn sUm+1,n and similarly
for each of the other three neighbors. For this particular m, n, we multiply (5.8) by s and
obtain
j+1
j
j+1
K j+1 = sUmn
sUmn
+ skfmn
K j + kkf j+1 kL .
From this we get the stability result (5.6) as before, but now the stability is unconditional:
it holds for any h, k > 0. We immediately obtain the analogue of the convergence theorem
Theorem 5.1 for the backward-centered method.
79
j
Theorem 5.2. Let u solve the heat equation (5.1), and let Umn
be determined by the
backward-centered finite difference method with mesh size h = 1/N and timestep k = T /M .
Then
max
j
max |u((mh, nh), jk) Umn
| C(k + h2 ),
0jM 0m,nN
and the corresponding inner product. We then showed that h had an orthogonal basis
of eigenfunctions mn with corresponding eigenvalues satisfying
2 2 1,1 mn N 1,N 1 < 8/h2 .
Consider now the forward-centered difference equations for the heat equation u/t = u,
with homogeneous Dirichlet boundary data, and given initial data u0 (for simplicity we
assume f 0). We have
j+1
j
Umn
= GUmn
,
80
5. TIME-DEPENDENT PROBLEMS
0 t T,
x .
We have allowed the thermal conductivity a to be variable, assuming only that it is bounded
above and below by positive constants, since this will cause no additional complications. We
could easily generalize further, allowing a variable specific heat (coefficient of u/t), lower
order terms, and different boundary conditions.
Now we consider finite elements for spatial discretization. To derive a weak formulation,
1 () and integrate over .
we multiply the heat equation (5.1) by a test function v(x) H
This gives
Z
Z
Z
u
f (x, t)v(x) dx, 0 t T.
(x, t)v(x) dx + a(x) grad u(x, t) grad v(x) dx =
t
R
Writing h , i for the L2 () inner product and b(w, v) = a grad w grad v dx, we may
write the weak formulation as
u
1 (), 0 t T.
h , vi + b(u, v) = hf, vi, v H
t
For the finite element method it is often useful to think of u(x, t) as a function of t taking
1 ().
values in functions of x. Specifically, we may think of u mapping t [0, T ] to u(, t) H
1 ()), which means that u(, t)
Specifically, we may seek the solution u C 1 ([0, T ], H
1 () for each t, that u is differentiable with respect to t, and that u(, t)/t H
1 () for
H
1 ().
all t. Thus when we write u(t), we mean the function u(, t) H
1 () denote the usual space of Lagrange finite elements of degree r with
Now let Vh H
respect to a triangulation of . For a semidiscrete finite element approximation we seek uh
mapping [0, T ] into Vh , i.e., uh C 1 ([0, T ], Vh ), satisfying
uh
, vi + b(uh , v) = hf, vi, v Vh , 0 t T.
t
We also need to specify an initial condition for uh . For this we choose some u0h Vh which
approximates u0 (common choices are the L2 projection or the interpolant).
We now show that this problem may be viewed as a system of ODEs. For this let i ,
1 i D be a basis for Vh (for efficiency we will choose a local basis). We may then write
(5.9)
uh (x, t) =
D
X
j (t)j (x).
j=1
Plugging this into (5.9) and taking the test function v = i , we get
X
X
hj , i i j0 (t) +
b(j , i ) j (t) = hf, i i.
j
81
Aij = hj , i i,
Fi (t) = hf, i i,
j+1 j
+ Aj = F j ,
k
or
M j+1 = M j + k(Aj + F j ).
Notice that for finite elements this method is not truly explicit, since we have to solve an
equation involving the mass matrix at each time step.
The backward Eulers method
j+1 j
M
+ Aj+1 = F j+1 ,
k
leads to a different linear system at each time step:
(M + kA)j+1 = M j + kF j+1 ,
while CrankNicolson would give
k
k
k
(M + A)j+1 = (M A)j + (F j + F j+1 ).
2
2
2
2.1. Analysis of the semidiscrete finite element method. Before analyzing a fully
discrete finite element scheme, we analyze the convergence of the semidiscrete scheme, since
it is less involved. The key to the analysis of the semidiscrete finite element method is to
compare uh not directly to u, but rather to an appropriate representative wh C 1 ([0, T ], Vh ).
For wh we choose the elliptic projection of u, defined by
(5.10)
v Vh ,
0 t T.
From our study of the finite element method for elliptic problems, we have the L2 estimate
ku(t) wh (t)k chr+1 ku(t)kr+1 ,
(5.11)
0 t T.
wh
u
u
(t)
(t)k chr+1 k (t)kr+1 ,
t
t
t
0 t T.
Now
h
(5.12)
wh
wh
, vi + b(wh , v) = h
, vi + b(u, v)
t
t
(wh u)
=h
, vi + hf, vi,
t
v Vh ,
0 t T.
82
5. TIME-DEPENDENT PROBLEMS
kyk
d
1d
y
kyk =
kyk2 = h , yi.
dt
2 dt
t
Thus we get
(5.13)
kyh k
d
(wh u)
(wh u)
kyh k + b(yh , yh ) = h
, yh i k
kkyh k,
dt
t
t
so
d
(wh u)
u
kyh k k
k chr+1 k (t)kr+1 .
dt
t
t
This holds for each t. Integrating over [0, t], we get
kyh (t)k kyh (0)k + chr+1 k
u
kL1 ([0,T ];H r+1 ()) .
t
E(v) = h
v Vh ,
(wh u)
, vi,
t
0 t T.
83
0tT
Remark. In the case of finite elements for elliptic problems, we first got an estimate
in H 1 , then an estimate in L2 , and I mentioned that there are others possible. In the
case of the parabolic problem, there are many other estimates we could derive in different
norms in space or time or both. For example, by integrating (5.13) in time we get that
kyh kL2 ([0,T ];H 1 ()) = O(hr+1 ). For the elliptic projection we have ku wh kH 1 () = O(hr ) for
each t, so the triangle inequality gives ku uh kL2 ([0,T ];H 1 ()) = O(hr ).
2.2. Analysis of a fully discrete finite element method. Now we turn to the
analysis of a fully discrete scheme: finite elements in space and backward Euler in time.
Writing ujh for uh (, jk) (with k the time step), the scheme is
uj+1
ujh
h
, vi + b(uj+1 , v) = hf j+1 , vi, v Vh , j = 0, 1, . . . .
k
We initialize the iteration by choosing u0h Vh to be, e.g., the interpolant, L2 projection, or
elliptic projection. Notice that, at each time step, we have to solve the linear system
(5.14)
(M + kA)j+1 = M j + kF j+1 ,
where j is the vector of coefficients of ujh with respect to a basis, and M , A, and F , are the
mass matrix, stiffness matrix, and load vector respectively.
To analyze this scheme, we proceed as we did for the semidiscrete scheme, with some
extra complications coming from the time discretization. In particular, we continue to use
the elliptic projection wh as a representative of u. Thus the consistency error is given by
whj+1 whj
,vi + b(whj+1 , v) hf j+1 , vi
k
uj+1 uj
(wj+1 uj+1 ) (whj uj )
=h
, vi + b(uj+1 , v) hf j+1 , vi + h h
, vi
k
k
uj+1 uj
uj+1
(wj+1 uj+1 ) (whj uj )
=h
, vi + h h
, vi = hz j , vi,
k
t
k
where the last line gives the definition of z j . Next we estimate the two terms that comprise
z j , in L2 . First we have
h
uj+1 uj
uj+1
k 2u
k
k k 2 kL (L2 ) ,
k
t
2 t
by Taylors theorem. Next,
Z
(whj+1 uj+1 ) (whj uj )
1 (j+1)k
=
[wh (s) u(s)] ds,
k
k jk
t
84
5. TIME-DEPENDENT PROBLEMS
so
whj+1 whj
, vi + b(whj+1 , v) hf j+1 , vi = hz j , vi,
h
k
v Vh ,
j = 0, 1, . . . .
with
2u
u
kL ([0,T ];L2 ()) + hr+1 k kL ([0,T ];H r+1 () ) =: E,
2
t
t
Combining with the scheme (5.14), we get (for yh = wh uh )
kz j k c(kk
j = 0, 1, . . . .
yhj+1 yhj
, vi + b(yhj+1 , v) = hz j , vi, v Vh .
k
We conclude the argument with a stability argument. Choose v = yhj+1 Vh . This becomes:
h
0jM
0jM
Exercise for the reader: analyze the convergence of the fully discrete finite element method
using CrankNicolson for time discretization. In the stability argument, you will want to
use the test function v = (yhj+1 + yhj )/2.