2
Elliptic boundary value problems (chapter 1):
Poisson equation: scalar, symmetric, elliptic.
3
4
Contents
2 Weak formulation 17
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Sobolev spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.1 The spaces W m () based on weak derivatives . . . . . . . . . . . . . . . . 23
2.2.2 The spaces H m () based on completion . . . . . . . . . . . . . . . . . . . 25
2.2.3 Properties of Sobolev spaces . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3 General results on variational formulations . . . . . . . . . . . . . . . . . . . . . . 34
2.4 Minimization of functionals and saddlepoint problems . . . . . . . . . . . . . . . 43
2.5 Variational formulation of scalar elliptic problems . . . . . . . . . . . . . . . . . . 45
2.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.5.2 Elliptic BVP with homogeneous Dirichlet boundary conditions . . . . . . 46
2.5.3 Other boundary conditions . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.5.4 Regularity results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.5.5 RieszSchauder theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.6 Weak formulation of the Stokes problem . . . . . . . . . . . . . . . . . . . . . . . 56
2.6.1 Proof of the infsup property . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.6.2 Regularity of the Stokes problem . . . . . . . . . . . . . . . . . . . . . . . 60
2.6.3 Other boundary conditions . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5
3.6 Isoparametric finite elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.7 Nonconforming finite elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6
9 Multigrid methods 197
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
9.2 Multigrid for a onedimensional model problem . . . . . . . . . . . . . . . . . . . 198
9.3 Multigrid for scalar elliptic problems . . . . . . . . . . . . . . . . . . . . . . . . . 203
9.4 Convergence analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
9.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
9.4.2 Approximation property . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
9.4.3 Smoothing property . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
9.4.4 Multigrid contraction number . . . . . . . . . . . . . . . . . . . . . . . . . 216
9.4.5 Convergence analysis for symmetric positive definite problems . . . . . . . 218
9.5 Multigrid for convectiondominated problems . . . . . . . . . . . . . . . . . . . . 223
9.6 Nested Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
9.7 Numerical experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
9.8 Algebraic multigrid methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
9.9 Nonlinear multigrid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
7
8
Chapter 1
In this chapter we introduce the classical formulation of scalar elliptic problems and of the Stokes
equations. Some results known from the literature on existence and uniqueness of a classical
solution will be presented. Furthermore, we briefly discuss the issue of regularity.
Moreover, the boundary of should satisfy certain smoothnes conditions that will be introduced
in this section. For this we need socalled Holder spaces.
By C k (), k N, we denote the space of functions f : R for which all (partial) derivatives
 f
D f := , = (1 , . . . , n ),  = 1 + . . . + n ,
x11 . . . xnn
of order  k are continuous functions on . The space C k (), k N, consists of all functions
in C k () C() for which all derivatives of order k have continuous extensions to .
Since is compact, the functional
defines a norm on C k (). The space (C k (), k kC k () ) is a Banach space (cf. Appendix A.1).
Note that f maxk kD f k, does not define a norm on C k ().
For f : R we define its support by
supp(f ) := { x  f (x) 6= 0 }.
The space C0k (), k N, consists of all functions in C k () which have a compact support in
, i.e., supp(f ) . The functional f maxk kD f k, defines a norm on C0k (), but
9
(C0k (), k kC k () ) is not a Banach space.
f (x) f (y)
[f ],D := sup{  x, y D, x 6= y } for f : D R.
kx yk
f kf kC() + [f ], .
We write f C 0, () and say that f is Holder continuous in with exponent if for arbitrary
compact subsets D the property [f ],D < holds. An important special case is = 1: the
space C 0,1 () [or C 0,1 ()] consists of all Lipschitz continuous functions on [].
The space C k, () [C k, ()], k N, (0, 1], consists of those functions in C k () [C k ()]
for which all derivatives D f of order  = k are elements of C 0, () [C 0, ()]. On C k () we
define a norm by X
f kf kC k () + [D f ], .
=k
Note that
and similarly with replaced by . We use the notation C k,0 () := C k () [C k,0 () := C k ()].
Remark 1.1.1 The inclusion C k+1 () C k, (),p (0, 1], is in general not true. Consider
n = 2 and = { (x, y)  1 < x < 1, 1 < y < x }. The function
1
(
(sign x)y 1 2 if y > 0,
f (x, y) =
0 otherwise,
/ C 0, () if ( 34 , 1].
belongs to C 1 (), but f
Based on these Holder spaces we can now characterize smoothness of the boundary .
F igure 1
10
A very important special case is C 0,1 . In this case all transformations (and their
inverses) must be Lipschitz continuous functions and we then call a Lipschitz domain. This
holds, for example, if consists of different patches which are graphs of smooth functions (e.g.,
polynomials) and at the interface between different patches interior angles are bounded away
from zero. In Figure ?? we give an illustration for the two dimensional case.
F igure 2
In almost all theoretical analyses presented in this book it suffices to have C 0,1 . Moreover,
the domains used in practice usually satisfy this condition. Therefore, in the remainder of this
book we always consider such domains, unless stated otherwise explicitly.
Assumption 1.1.3 In this book we assume that the domain Rn is such that
One can show that if this assumption holds then C k+1 () C k,1 () (cf. remark 1.1.1).
2 2u
with aij , bi and c given functions on . Because xi x
u
j
= xj xi we may assume, without loss
of generality, that
aij (x) = aji (x)
holds for all x . Corresponding to the differential operator L we can define a partial differ
ential equation
Lu = f, (1.3)
with f a given function on . In (1.2) the part containing the second derivatives only, i.e.
n
X 2u
aij ,
xi xj
i,j=1
is called the principal part of L. Related to this principal part we have the n n symmetric
matrix
A(x) = (aij (x))1i,jn . (1.4)
11
Note that due to the symmetry of A the eigenvalues are real. These eigenvalues, which may
depend on x , are denoted by
Remark 1.2.1 If the operator L is elliptic, then we may assume that all eigenvalues of the
matrix A(x) in (1.4) are positive:
The operator L (and the corresponding boundary value problem) is called uniformly elliptic if
inf{1 (x)  x } > 0 holds. Note that if the operator L is elliptic with coefficients aij C()
then the function x A(x) is continuous on the compact set and hence L is uniformly elliptic.
Using
Xn
aij (x)i j = T A(x) 1 (x) T ,
i,j=1
we obtain that the operator L is uniformly elliptic if and only if there exists a constant 0 > 0
such that for all Rn , we have
n
X
aij (x)i j 0 T for all x .
i,j=1
We obtain a boundary value problem when we combine the partial differential equation in (1.3)
with certain boundary conditions for the unknown function u. For ease we restrict ourselves to
problems with Dirichlet boundary conditions, i.e., we impose :
u = g on ,
with g a given function on . Other types of boundary conditions are the so called Neumann
u
boundary conditon, i.e., a condition on the normal derivative n on , and the mixed boundary
condition which is a linear combination of a Dirichlet and a Neumann boundary condition.
Summarizing, we consider a linear second order Dirichlet boundary value problem in Rn :
n n
X 2u X u
Lu = aij + bi + cu = f in , (1.5a)
xi xj xi
i,j=1 i=1
u=g on , (1.5b)
where (aij (x))1i,jn is such that the problem is elliptic. A solution u of (1.5) is called a
classical solution if u C 2 () C(). The functions (aij (x))1i,jn , (bi (x))1in and c(x) are
called the coefficients of the operator L.
12
1.2.2 Examples
We assume n = 2, i.e. a problem with two independent variables, say x1 = x and x2 = y. Then
the differential operator is given by
2u 2u 2u u u
Lu = a11 2
+ 2a12 + a22 2
+ b1 + b2 + cu.
x xy y x y
In this case we have 1 (x)2 (x) = det(A(x)) and the ellipticity condition can be formulated as
2u 2u
u := + 2 = 0 in ,
x2 y
the Poisson equation (cf. Poisson [72])
u = f in , (1.6)
13
For a detailed analysis of singularly perturbed convectiondiffusion equations we refer to Roos et
al [76]. An illustration of the two phenomena described above is given in Section ??. Finally we
note that for the numerical solution of a problem with a singularly perturbed equation special
tools are needed, both with respect to the discretization of the problem and the iterative solver
for the discrete problem.
Example 1.2.3 The reactiondiffusion equation can be used to show that a solution of an
elliptic Dirichlet boundary value problem as in (1.5) need not be unique. Consider the problem
in (1.7) on = (0, 1)2 , with f = 0 and c(x, y) = ()2 ()2 , , N, combined with
zero Dirichlet boundary conditions. Then both u(x, y) 0 and u(x, y) = sin(x) sin(y) are
solutions of this boundary value problem.
Example 1.2.4 Even for very simple elliptic problems a classical solution may not exist. Con
sider
a 2 u2 = 1 in (0,1),
x
u(0) = u(1) = 0,
with a(x) = 1 for 0 x 0.5 and a(x) = 2 for 0.5 < x 1. Clearly the second derivative of a
solution u of this problem cannot be continuous at x = 0.5.
We present a typical result from the literature on existence and uniqueness of a classical solu
tion. For this we need a certain condition on . The domain is said to satisfy the exterior
sphere condition if for every x0 there exists a ball B such that B = x0 . Note that this
condition is fulfilled, for example, if is convex or if C 2,0 .
14
Theorem 1.2.5 ([39], Theorem 6.13) Consider the boundary value problem (1.5) and as
sume that
Theorem 1.2.6 ([39], Theorem 6.17) Let u C 2 () be a solution of (1.5). Suppose that L
is elliptic and that there are k N, (0, 1) such that the coefficients of L and the function
f are in C k, (). Then u C k+2, () holds. If the coefficients and f are in C (), then
u C ().
This result shows that the interior regularity depends on the smoothness of the coefficients and
of the right handside f , but does not depend on the smoothness of the boundary (data). A
result on global regularity is given in:
Theorem 1.2.7 ([39], Theorem 6.19) Let u C 2 () C() be a classical solution of (1.5).
Suppose that L is uniformly elliptic and that there are k N, (0, 1) such that the coefficients
of L and the function f are in C k, (), C k+2, . Assume that g can be extended on such
that g C k+2, (). Then u C k+2, () holds.
For a global regularity result as in the previous theorem to hold, the smoothness of the boundary
(data) is important. In practice one often has a domain with a boundary consisting of the union
of straight lines (in 2D) or planes (3D). Then the previous theorem does not apply and the
global regularity of the solution can be rather low as is shown in the next example.
Example 1.2.8 [from [47], p.13] We consider (1.9) with = (0,1)(0,1), f 0, g(x, y) = x2
(so g C(), g C ()). Then Theorem 1.2.5 guarantees the existence of a unique classical
solution u C 2 () C(). However, u is not an element of C 2 ().
Proof. Assume that u C 2 () holds. From this and u = 0 in it follows that u = 0 in
holds. From u = g = x2 on we get uxx (x, 0) = 2 for x [0, 1] and uyy (0, y) = 0 for y [0, 1].
It follows that u(0, 0) = 2 must hold, which yields a contradiction.
15
considers a steadystate situation then these NavierStokes equations, in dimensionless quanti
ties, are as follows:
n
X ui p
ui + uj + = fi in , 1 i n, (1.10a)
xj xi
j=1
n
X uj
= 0 in , (1.10b)
xj
j=1
with > 0 a parameter that is related to the viscosity of the medium. Using the notation
u
u := (u1 , . . . , un )T , div u := nj=1 xjj , f = (f1 , . . . , fn )T , we obtain the more compact
P
formulation
u + (u )u + p = f in , (1.11a)
div u = 0 in . (1.11b)
Note that the pressure p is determined only up to a constant by these NavierStokes equations.
The problem has to be completed with suitable boundary conditions. One simple possibility is
to take homogeneous Dirichlet boundary conditions for u, i.e., u = 0 on . If in the Navier
Stokes equations the nonlinear convection term (u )u is neglected, which can be justified in
situations where the viscosity parameter is large, one obtains the Stokes equations. From a
simple rescaling argument (replace u by 1 u) it follows that without loss of generality in the
Stokes equations we can assume = 1. Summarizing, we obtain the following Stokes problem:
u + p = f in , (1.12a)
div u = 0 in , (1.12b)
u = 0 on . (1.12c)
16
Chapter 2
Weak formulation
2.1 Introduction
For solving a boundary value problem it can be (very) advantageous to consider a generalization
of the classical problem formulation, in which larger function spaces are used and a weaker
solution (explained below) is allowed. This results in the variational formulation (also called
weak formulation) of a boundary value problem. In this section we consider an introductory
example which illustrates that even for a very simple boundary value problem the choice of an
appropriate solution space is an important issue. This example also serves as a motivation for
the introduction of the Sobolev spaces in section 2.2.
We assume that the coefficient a is an element of C 1 ([0, 1]) and that a(x) > 0 holds for all
x [0, 1]. This problem then has a unique solution in the space
which may be checked by substitution in (2.1). If one multiplies the equation (2.1a) by an
arbitrary function v V1 , integrates both the left and right handside and then applies partial
integration one can show that u V1 is the solution of (2.1) if and only if
Z 1 Z 1
a(x)u (x)v (x) dx = v(x) dx for all v V1 . (2.2)
0 0
This variational problem can be reformulated as a minimization problem. For this we introduce
the notion of a bilinear form.
17
Definition 2.1.1 Let X be a vector space. A mapping k : X X R is called a bilinear form
if for arbitrary , R and u, v, w X the following holds:
Lemma 2.1.2 Let X be a vector space and k : X X R a symmetric bilinear form which is
positive, i.e., k(v, v) > 0 for all v X, v 6= 0. Let f : X R be a linear functional. Define
J : X R by
1
J(v) = k(v, v) f (v).
2
Then J(u) < J(v) for all v X, v 6= u, holds if and only if
Note that all assumptions of lemma 2.1.2 are fulfilled. It then follows that the unique solution
of (2.1) (or, equivalently, of (2.2)) is also the unique minimizer of the functional
1
1
Z
J(v) = [ a(x)v (x)2 v(x)] dx. (2.5)
0 2
We consider a case in which the coefficient a is only piecewise continuous (and not differentiable
at all x (0, 1)). Then the problem in (2.1) is not welldefined. However, the definitions of the
18
bilinear form k(, ) and of the functional J() still make sense. We now analyze a minimization
problem with a functional as in (2.5) in which the coefficient a is piecewise constant:
(
1 if x [0, 21 ]
a(x) =
2 if x ( 21 , 1].
We show that for this problem the choice of an appropriate solution space is a delicate issue. Note
that due to lemma 2.1.2 the minimization problem in X = V1 has a corresponding equivalent
variational formulation as in (2.3). With our choice of the coefficient a the functional J() takes
the form
Z 1 Z 1
2 1
2
J(v) := [ v (x) v(x)] dx + [v (x)2 v(x)] dx. (2.6)
0 2 1
2
This functional is welldefined on the space V1 . The functional J, however, is also welldefined
if v is only differentiable and even if we allow v to be nondifferentiable at x = 21 . We introduce
the spaces
Note that V1 V2 V3 V4 and that on all these spaces the functional J() is welldefined.
Moreover, with X = Vi , i = 1, . . . , 4, and
1
Z
2
Z 1 Z 1
k(u, v) = u (x)v (x) dx + 2u (x)v (x) dx, f (v) = v(x) dx (2.7)
1
0 2
0
all assumptions of lemma 2.1.2 are fulfilled. We define a (natural) norm on these spaces:
1
Z
2
Z 1
w2 := w (x)2 dx + w (x)2 dx. (2.8)
1
0 2
One easily checks that this indeed defines a norm on the space V4 and thus also on the subspaces
Vi , i = 1, 2, 3. Furthermore, this norm is induced by the scalar product
1
Z
2
Z 1
(w, v)1 := w (x)v (x) dx + w (x)v (x) dx (2.9)
1
0 2
on V4 , and
1
w2 k(w, w) w2 for all w V4 (2.10)
2
holds. We show that in the space V3 the minimization problem has a unique solution.
Lemma 2.1.3 The problem minvV3 J(v) has a unique solution u given by
(
5
12 x2 + 12 x if 0 x 12 ,
u(x) = (2.11)
14 x2 + 24
5 1
x + 24 if 12 x 1.
19
Proof. Note that u V3 and even u C ([0, 12 ]) C ([ 21 , 1]). We use the notation
uL ( 12 ) = limx 1 u (x) and similarly for uR ( 21 ). We apply lemma 2.1.2 with X = V3 . For
2
arbitrary v V3 we have
Z 1 Z 1
2
k(u, v) f (v) = u (x)v (x) v(x) dx + 2u (x)v (x) v(x) dx
1
0 2
1
1 1
Z
2
= uL ( )v( ) (u (x) + 1)v(x) dx (2.12)
2 2 0
Z 1
1 1
2uR ( )v( ) (2u (x) + 1)v(x) dx.
2 2 1
2
1
Due to u (x) = 1 on [0, 2 ], u (x)
= 12
on [ 12 , 1] and uL ( 21 ) 2uR ( 21 ) = 0 we obtain
k(u, v) = f (v) for all v V3 . From lemma 2.1.2 we conclude that u is the unique minimizer in
V3 .
Thus with X = V3 a minimizer u exists and the relation (2.4) takes the form
1
J(v) J(u) = k(v u, v u),
2
with k(, ) as in (2.7). Due to (2.10) the norm   can be used as a measure for the distance
from the minimum (i.e. J(v) J(u)):
1 1
v u2 J(v) J(u) v u2 . (2.13)
4 2
Before we turn to the the minimization problems in the spaces V1 and V2 we first present a
useful lemma.
Lemma 2.1.4 Define W := { v C ([0, 1])  v(0) = v(1) = 0 }. For every u V3 there is a
sequence (un )n1 in W such that
lim un u = 0. (2.14)
n
Proof. Take u V3 and define u(x) := u (x) for all x [0, 1], x 6= 12 , u( 12 ) a fixed arbitrary
value and u(x) = u(x) for all x [0, 1]. Then u is even and u L2 (1, 1) . From Fourier
analysis it follows that there is a sequence
n
X
un (x) = ak cos(kx), n N,
k=0
such that Z 1 2
lim ku un k2L2 = lim u(x) un (x) dx = 0.
n n 1
R1
Note that due to the fact that u is continuous and u(0) = u(1) = 0 we get a0 = 12 1 u(x) dx =
R 12 R1 Pn ak
0 u (x) dx + 12 u (x) dx = 0. Define un (x) = k=1 k sin(kx) For n 1. Then un W ,
un = un and
u un 2 ku un k2L2
holds. Thus it follows that limn un u = 0.
20
Lemma 2.1.5 Let u V3 be given by (2.11). For i = 1, 2 the following holds:
inf J(v) = min J(v) = J(u). (2.15)
vVi vV3
Proof. Take i = 1 or i = 2. I := inf vVi J(v) is defined as the greatest lower bound of J(v)
for v Vi . From V3 Vi it follows that J(u) I holds. Suppose that J(u) < I holds, i.e. we
have := I J(u) > 0. Due to W Vi and lemma 2.1.4 there is a sequence (un )n1 in Vi such
that limn u un  = 0 holds. Using (2.13) we obtain
1
J(un ) = J(u) + (J(un ) J(u)) I + un u2 .
2
So for n sufficiently large we have J(un ) < I, which on the other hand is not possible because
I is a lower bound of J(v) for v Vi . We conclude that J(u) = I holds.
The result in this lemma shows that the infimum of J(v) for v V2 is equal to J(u) and
thus, using (2.13), it follows that the minimizer u V3 can be approximated to any accuracy,
measured in the norm  , by elements from the smaller space V2 . The question arises why
in the minimization problem the space V3 is used and not the seemingly more natural space V2 .
The answer to this question is formulated in the following lemma.
Lemma 2.1.6 There does not exist u V2 such that J(u) J(v) for all v V2 .
Proof. Suppose that such a minimizer, say u V2 , exists. From lemma 2.1.5 we then obtain
J(u) = min J(v) = inf J(v) = J(u)
vV2 vV2
with u as in (2.11) the minimizer in V3 . Note that u V3 . From lemma 2.1.2 it follows that
the minimizer in V3 is unique and thus u = u must hold. But then u = u
/ V2 , which yields a
contradiction.
The same arguments as in the proof of this lemma can be used to show that in the smaller
space V1 there also does not exist a minimizer. Based on these results the function u V3 is
called the weak solution of the minimization problem in V2 . From (2.15) we see that for solving
the minimization problem in the space V2 , in the sense that one wants to compute inf vV2 J(v),
it is natural to consider the minimization problem in the larger space V3 .
We now consider the even larger space V4 and show that the minimization problem still makes
sense (i.e. has a unique solution). However, the minimum value does not equal inf vV2 J(v).
Lemma 2.1.7 The problem minvV4 J(v) has a unique solution u given by
(
12 x(x 1) if 0 x 12 ,
u(x) =
14 x(x 1) if 12 < x 1.
Proof. Note that u V4 holds. We apply lemma 2.1.2. Recall the relation (2.12):
Z 1
1 1 2
k(u, v) f (v) = uL ( )v( ) (u (x) + 1)v(x) dx
2 2 0
Z 1
1 1
2uR ( )v( ) (2u (x) + 1)v(x) dx.
2 2 1
2
21
From u (x) = 1 on [0, 12 ], u (x) = 21 on [ 12 , 1] and uL ( 21 ) = uR ( 12 ) = 0 it follows that
k(u, v) = f (v) for all v V4 . We conclude that u is the unique minimizer in V4 .
A straightforward calculation yields the following values for the minima of the functional J()
in V3 and in V4 , respectively:
11 1 1
J(u) = , J(u) = .
12 32 32
From this we see that, opposite to u, we should not call u a weak solution of the minimization
problem in V2 , because for u we have J(u) < inf vV2 J(v).
We conclude the discussion of this example with a few remarks on important issues that play
an important role in the remainder of this book
1) Both the theoretical analysis and the numerical solution methods treated in this book
heavily rely on the varational formulation of the elliptic boundary value problem (as, for
example, in (2.2)). In section 2.3 general results on existence, uniquess and stability of
variational problems are presented. In the sections 2.5 and 2.6 these are applied to varia
tional formulations of elliptic boundary value problems. The finite element discretization
method treated in chapter 3 is based on the variational formulation of the boundary value
problem. The derivation of the conjugate gradient (CG) iterative method, discussed in
chapter 7, is based on the assumption that the given (discrete) problem can be formulated
as a minimization problem with a functional J very similar to the one in lemma 2.1.2.
2) The bilinear form used in the weak formulation often has properties similar to those of an
inner product, cf. (2.7), (2.9), (2.10). To take advantage of this one should formulate the
problem in an inner product space. Then the structure of the space (inner product) fits
nicely to the variational problem.
3) To guarantee that a weak solution actually lies in the space one should use a space that
is large enough but not too large. This can be realized by completion of the space in
which the problem is formulated. The concept of completion is explained in section 2.2.2.
The conditions discussed in the remarks 2) and 3) lead to Hilbert spaces, i.e. inner product
spaces that are complete. The Hilbert spaces that are appropriate for elliptic boundary value
problems are the Sobolev spaces. These are treated in section 2.2.
22
2.2.1 The spaces W m () based on weak derivatives
We take u C 1 () and C0 (). Since vanishes identically outside some compact subset
of , one obtains by partial integration in the variable xj :
u(x) (x)
Z Z
(x) dx = u(x) dx
xj xj
and thus Z Z
D u(x)(x) dx = u(x)D (x) dx ,  = 1,
holds. Repeated application of this result yields the fundamental Greens formula
Z Z

D u(x)(x) dx = (1) u(x)D (x) dx ,
(2.16)
for all C0 (), k
u C (), k = 1, 2, . . . and  k.
Lemma 2.2.2 If for u L2 () the th weak derivative exists then it is unique (in the usual
Lebesgue sense). If u C k () then for 0 <  k the th weak derivative and the classical
th derivative coincide.
Proof. The second statement follows from the first one and Greens formula (2.16). We now
prove the uniqueness. Assume that vi L2 (), i = 1, 2, both satisfy (2.17). Then it follows
that Z
v1 (x) v2 (x) (x) dx = hv1 v2 , iL2 = 0 for all C0 ()
Since is dense in L2 () this implies that hv1 v2 , iL2 = 0 for all L2 () and thus
C0 ()
v1 v2 = 0 (a.e.).
Remark 2.2.3 As a warning we note that if the classical derivative of u, say D u, exists
almost everywhere in and D u L2 (), this does not guarantee the existence of the th
weak derivative. This is shown by the following example:
The classical first derivative of u on \ {0} is u (x) = 0. However, the weak first derivative of
u as defined in 2.2.1 does not exist.
A further noticable observation is the following: If u C ()C() then u does not always
have a first weak derivative. This is shown by the example = (0, 1), u(x) = x. The only
candidate for the first weak derivative of u is v(x) = 21 x . However, v
/ L2 ().
23
The Sobolev space W m (), m = 1, 2, . . . , consists of all functions in L2 () for which all th
weak derivatives with  m exist:
holds.
It is easy to verify, that h, im defines an inner product on W m (). We now formulate a main
result:
Proof. We must show that the space W m () with the norm k km is complete. For m = 0
this is trivial. We consider m 1. First note that for v W m ():
X
kvk2m = kD vk2L2 (2.20)
m
Let (uk )k1 be a Cauchy sequence in W m (). From (2.20) it follows that if kuk u km
then kD uk D u kL2 for all 0  m. Hence, (D uk )k1 is a Cauchy sequence in
L2 () for all  m. Since L2 () is complete it follows that there exists a unique u() L2 ()
with limk D uk = u() in L2 (). For  = 0 this yields limk uk = u(0) in L2 (). For
0 <  m and arbitrary C0 () we obtain
From this it follows that u() L2 () is the th weak derivative of u(0) . We conclude that
u(0) W m () and
X
lim kuk u(0) k2m = lim kD uk D u(0) k2L2
k k
m
X
= lim kD uk u() k2L2 = 0
k
m
This shows that the Cauchy sequence (uk )k1 in W m () has its limit point in W m () and thus
this space is complete.
24
Similar constructions can be applied if we replace the Hilbert Rspace L2 () by the Banach space
Lp (), 1 p < of measurable functions for which kukp := ( u(x)p dx)1/p is bounded. This
results in Sobolev spaces which are usually denoted by Wpm (). For notational simplicity we
deleted the index p = 2 in our presentation. For p 6= 2 the Sobolev space Wpm () is a Banach
space but not a Hilbert space. In this book we only need the Sobolev space with p = 2 as
defined in (2.18). For p 6= 2 we refer to the literature, e.g. [1].
Lemma 2.2.6 Let (Z, kk) be a normed space. Then there exists a Banach space (X, kk ) such
that Z X, kxk = kxk for all x Z, and Z is dense in X. The space X is called the completion
of Z. This space is unique, except for isometric (i.e., norm preserving) isomorphisms.
endowed with the norm k km , as defined in (2.19), i.e, we want to construct the completion of
(Zm , k km ). For m = 0 this results in L2 (), since C () is dense in L2 (). Hence, we only
consider m 1. Note that Zm W m () and that this embedding is continuous. One can apply
the general result of lemma 2.2.6 which then defines the completion of the space Zm . However,
here we want to present a more constructive approach which reveals some interesting relations
between this completion and the space W m ().
First, note that due to (2.20) a Cauchy sequence (uk )k1 in Zm is also a Cauchy sequence in L2 ()
and thus to every such sequence there corresponds a unique u L2 () with limk kuk ukL2 =
0. The space Vm Zm is defined as follows:
Proof. Take u Vm and let (uk )k1 be a Cauchy sequence in Zm with limk kuk ukL2 = 0.
From (2.20) it follows that (D uk )k1 , 0 <  m are Cauchy sequences in L2 (). Let
u() := limk D uk in L2 (). As in (2.21) one shows that u() is the th weak derivative
D u of u. Using D u = limk D uk in L2 (), for 0 <  m, we get
X
lim kuk uk2m = lim kD uk D uk2L2 = 0
k k
m
Since (uk )k1 is a sequence in Zm we have shown that Vm is the closure of Zm in W m ().
On the space Vm we can take the same inner product (and corresponding norm) as used in
the space W m () (cf. (2.19)). From lemma 2.2.7 and the fact that in the space Zm the norm
is the same as the norm of W m () it follows that (Vm , k km ) is the completion of (Zm , k km ).
25
Since the norm k km is induced by an inner product we have that (Vm , h, im ) is a Hilbert
space. This defines the Sobolev space
H m () := (Vm , h, im ) = completion of (Zm , h, im )
It is clear from lemma 2.2.7 that
H m () W m ()
holds. A fundamental result is the following:
Proof. The first proof of this result was presented in [63]. A proof can also be found in
[1, 65].
We see that the construction using weak derivatives (space W m ()) and the one based on
completion (space H m ()) result in the same Sobolev space. In the remainder we will only use
the notation H m ().
The result in theorem 2.2.8 holds for arbitrary open sets in Rn . If the domain satisfies
certain very mild smoothnes conditions (it suffices to have assumption 1.1.3) one can prove a
somewhat stronger result, that we will need further on:
Theorem 2.2.9 Let H m () be the completion of the space (C () , h, im ). Then
H m () = H m () = W m ()
holds.
Proof. We refer to [1].
Note that C () $ Zm and thus H m () results from the completion of a smaller space than
H m ().
Remark 2.2.10 If assumption 1.1.3 is not satisfied then it may happen that H m () $ W m ()
holds. Consider the example
= { (x, y) R2  x (1, 0) (0, 1), y (0, 1) }
and take u(x, y) = 1 if x > 0, u(x, y) = 0 if x < 0. Then D(1,0) u = D (0,1) u = 0 on and thus
u W 1 (). However, one can verify that there does not exist a sequence (k )k1 in C 1 () such
that X
ku k k21 = ku k k2L2 + kD k k2L2 0 for k
=1
26
Remark 2.2.11 In general we have H01 () $ H 1 (). Consider, as a simple example, =
(0, 1), u(x) = x. Then u H 1 () but for arbitrary C0 () we have
kk1
/ H01 () = C0 ()
Hence u .
P technique of completion can also be applied if instead of k km one uses the norm kukm,p =
The
( m kD ukp )1/p , 1 p < . This results in Sobolev spaces denoted by Hpm (). For p = 2
we have H2m () = H m (). For p 6= 2 these spaces are Banach spaces but not Hilbert spaces. A
result as in theorem 2.2.8 also holds for p 6= 2: Hpm () = Wpm ().
We now formulate a result on a certain class of piecewise smooth functions which form a subset
of the Sobolev space H m (). This subset plays an important role in the finite element method
that will be presented in chapter 3.
Theorem 2.2.12 Assume that can be partitioned as = N i=1 i , with i j = for all
i 6= j and for all i the assumption 1.1.3 is fulfilled. For m N, m 1, define
u H m () u C m1 ()
Proof. First we need some notation. Let i := i . The outward unit normal on i is
denoted by n(i) . Let i := i (= i ) and int denotes the set of all those intersections
i with measn1 (i ) > 0 (in 2D with triangles: intersections by sides are taken into account
but intersections by vertices not). Similarly, i0 := i and b the set of all i0 with
measn1 (i0 ) > 0. For i int let n(i) be the unit normal pointing outward from i (thus
n(i) = n(i) . Finally, for u V1 let
27
For the last term in this expression we have
N Z X Z
(i) (i)
X
u(x)(x)nk dx = [u]i (x)nk dx
i=1 i i int i
X Z (i)
+ u(x)(x)nk dx =: Rint + Rb
i0 b i0
(x) u(x)
Z Z Z
u(x) dx = (x) dx = v(x)(x) dx , C0 (),
xk xk
it follows that Rint = 0 must hold for all C0 (). This implies that the jump of u across i
is zero and thus u C() holds. Conversely, if u C() then Rint = 0 and from the relation
u
(2.24) it follows that the weak derivative x k
exists. Since k is arbitrary we conclude u H 1 ().
This completes the proof for the case m = 1.
For m > 1 we use an induction argument. Assume that the statement holds for m. We consider
m + 1. Take u Vm+1 and assume that u H m+1 () holds. Take an arbitrary multiindex
with  m 1. Classical derivatives will be denoted by D and weak ones by D (with
a multiindex as ). From the induction hypothesis we obtain w := D u C(). From
u H m+1 () it follows that D w H 1 () for  1. Furthermore, for these values we
also have, due to u Vm+1 that D w = D w C 1 (i ) for i = 1, . . . , N . From the result
for m = 1 it now follows that D w C() for  1. Hence, D w is continuous across the
internal interfaces i and thus D w C() holds. We conclude that D D u C() for all
 m1,  1, i.e., u C m (). Conversely, if u Vm+1 and u C m () then D u C()
for  m. From the result for m = 1 it follows that D u H 1 () for all  m and thus
u H m+1 () holds.
A first important question concerns the smoothness of functions from H m () in the classical
(i.e., C k ()) sense. For example, one can show that if R then all functions from H 1 () must
be continuous on . This, however, is not true for the two dimensional case, as the following
example shows:
Example 2.2.13 In this example we show that functions in H 1 (), with R2 , are not
1
necessarily continuous on . With r := (x21 + x22 ) 2 let B(0, ) := { (x1 , x2 ) R2  r < } for
> 0. We take = B(0, 21 ). Below we also use := \ B(0, ) with 0 < < 12 . On we
define the function u by u(0, 0) := 0, u(x1 , x2 ) := ln(ln(1/r)) otherwise. Using polar coordinates
one obtains
Z Z Z 1
2
2 2
u(x) dx = lim u(x) dx = 2 lim [ln(ln(1/r))]2 r dr <
0 0
28
so u L2 () holds. Note that u C ( \ {0}). For the first derivatives we have
Z 1
u(x) 2 1 2
Z X
2
dx = 2 lim 2 2
r dr =
xi 0 r (ln r) ln 2
i=1,2
u
It follows that the classical first derivatives x i
exist a.e. on and are elements of L2 (). Note,
however, remark 2.2.3. For arbitrary C0 () we have, using Greens formula on :
(x) u(x)
Z Z Z
u(x) dx = u(x)(x)nx1 ds (x) dx.
x1 B(0,) x1
Note that
Z
lim  u(x)(x)nx1 ds lim 2kk  ln(ln(1/)) = 0.
0 B(0,) 0
So we have
(x) u(x)
Z Z
u(x) dx = (x) dx.
x1 x1
u
We conclude that x 1
is the generalized partial derivative with respect to the variable x1 . The
same argument yields an analogous result for the derivative w.r.t. x2 . We conclude that u
H 1 (). It is clear that u is not continuous on .
We now formulate an important general result which relates smoothness in the Sobolov sense
(weak derivatives) to classical smoothness properties.
For normed spaces X and Y a linear operator I : X Y is called a continuous embedding if I
is continuous and injective. Such a continuous embedding is denoted by X Y . Furthermore,
usually for x X the corresponding element Ix Y is denoted by x, too (X Y is formally
replaced by X Y ).
n
Theorem 2.2.14 If m 2 > k (recall: Rn ) then there exist continuous embeddings
H m () C k () (2.25a)
H0m () C k (). (2.25b)
It is trivial that for m 0 there are continuous embeddings H m+1 () H m () and H0m+1 ()
H0m (). In the next theorem a result on compactness of embeddings (cf. Appendix A.1) is for
mulated. We recall that if X, Y are Banach spaces then a continuous embedding X Y is
compact if and only if each bounded sequence in X has a subsequence which is convergent in Y .
29
Theorem 2.2.15 The continuous embeddings
Proof. We sketch the idea of the proof. In [1] it is shown that the embeddings
H 1 () L2 (), H01 () L2 ()
are compact. This proves the results in (2.26a) and (2.26b) for m = 0. The results in (2.26a) and
(2.26b) for m 1 can easily be derived from this as follows. Let (uk )k1 be a bounded sequence
in H m+1 () (m 1). Then (D uk )k1 is a bounded sequence in H 1 () for  m. Thus
this sequence has a subsequence (D uk )k 1 that converges in L2 (). Hence, the subsequence
(uk )k 1 converges in H m (). This proves the compactness of the embedding H m+1 ()
H m (). The result in (2.26b) for m 1 can be shown in the same way.
With a similar shift argument one can easily show that it suffices to prove the results in (2.26c)
and (2.26d) for the case k = 0. The analysis for the case k = 0 is based on the following general
result (which is easy to prove). If X, Y, Z are normed spaces with continuous embeddings
I1 : X Y, I2 : Y Z and if at least one of these embeddings is compact then the continuous
embedding I2 I1 : X Z is compact. For m n2 > 0 there exist , (0, 1) with 0 < < <
m n2 . The following continuous embeddings exist:
H m () C 0, () C 0, () C().
In this sequence only the first embedding is nontrivial. This one is proved in [1], theorem 5.4.
Furthermore, from [1] theorem 1.31 it follows that the second embedding is compact. We con
clude that for m n2 > 0 the embedding H m () C() is compact. This then yields the result
in (2.26c) for k = 0. The same line of reasoning can be used to show that (2.26d) holds.
The result in the following theorem is a basic inequality that will be used frequently.
30
Since u(a, x2 , . . . , xn ) = 0 we obtain, using the CauchySchwarz inequality
Z x1 u(t, x , . . . , x ) 2
2 2 n
u(x) = 1 dt
a t
Z x1 Z x1
u(t, x2 , . . . , xn ) 2
1 dt dt
a a t
Z a
u(t, x2 , . . . , xn ) 2
2a dt for x E
a t
Note that the latter term does not depend on x1 . Integration with respect to the variable x1
results in Z a Z a
2 2
2
u(x1 , . . . , xn ) dx1 4a D(1,0,...,0) u(x) dx1 for x E
a a
and integration with respect to the other variables gives
Z Z
2 X
u(x)2 dx 4a2 D (1,0,...,0) u(x) dx 4a2 kD uk2L2
E E =1
Thus kuk2 Cu2 holds. For m > 2 the same reasoning is applicable.
In the weak formulation of the elliptic boundary value problems one must treat boundary con
ditions. For this the next result will be needed.
Theorem 2.2.18 (Trace operator) There exists a unique bounded linear operator
with the property that for all u C 1 () the equality (u) = u holds.
31
Proof. We define : C 1 () L2 () by (u) = u and will show that
holds. The desired result then follows from the extension theorem A.2.3. We give a proof
of (2.29) for the two dimensional case. The general case can be treated in the same way. In a
neighbourhood of x we take a local coordinate system (, ) such that locally the boundary
can be represented as
In this last expression only the first term on the right handside depends on t. Integration over
t [() b, ()] results in
Z () Z ()
2 2 2 u(, ) 2
b u(, ()) 2 u(, t) dt + 2b d
()b ()b
Z ()
u(, ) 2
=2 u(, )2 + b2 d
()b
If C 1 ([a, a]) then for the local arc length variable s on loc we have
p
ds = 1 + ()2 d
32
If is only Lipschitz continuous on [a, a] then exists almost everywhere on [a, a] and
 () is bounded (Rademachers theorem). Hence, the same argument can be applied. Finally
note that can be covered by a finite number of local parts loc . Addition of the local in
equalities in (2.30) then yields the result in (2.29).
The operator defined in theorem 2.2.18 is called the trace operator. For u H 1 () the function
(u) L2 () represents the boundary values of u and is called the trace of u. For (u) one
often uses the notation u . For example, for u H 1 (), the identity u = 0 means that
(u) = 0 in the L2 () sense.
The space range() can be shown to be dense in L2 () but is strictly smaller than L2 ().
For a characterization of this subspace one can use a Sobolev space with a broken index:
1
H 2 () = range() = { v L2 ()  u H 1 () : v = (u) } (2.31)
1
The space H 2 () is a Hilbert space which has similar properties as the usual Sobolev spaces.
We do not discuss this topic here, since we will not need this space in the remainder.
Using the trace operator one can give another natural characterization of the space H01 ():
H01 () = { u H 1 ()  u = 0 }
holds.
Proof. We only prove . For a proof of we refer to [47] theorem 6.2.42 or [1]
remark 7.54. First note that { u H 1 ()  u = 0 } = ker(). Furthermore, C0 () ker()
and the trace operator : H 1 () L2 () is continuous. From the latter it follows that ker()
is closed in H 1 (). This yields:
kk1 kk1
H01 () = C0 () ker() = ker()
Finally, we collect a few results on Greens formulas that hold in Sobolev spaces. For notational
simplicity theR function arguments x are deleted in the integrals, and in boundary integrals like,
for example, (u)(v) ds we delete the trace operator .
33
Theorem 2.2.20 The following identities hold, with n = (n1 , . . . , nn ) the outward unit normal
on and H m := H m ():
v
Z
R u
v dx + uvni ds, u, v H 1 , 1 i n
R
u dx = x (2.32a)
xi i
Z
u v dx = u v dx + u nv ds, u H 2 , v H 1
R R
(2.32b)
Z
u div v dx = u v dx + uv n ds, u H 1 , v (H 1 )n .
R R
(2.32c)
Proof. These results immediately follow from the corresponding formulas in C () the con
tinuity of the trace operator and a density argument based on theorem 2.2.9.
H m () := H0m () . (2.33)
(v)
kkm := sup , H m ().
vH0m () kvkm
In section 2.1 we already gave an example of a variational problem. In the previous section we
introduced Hilbert spaces that will be used for the variational formulation of elliptic boundary
problems in the sections 2.5 and 2.6. In this section we present some general existence and
uniqueness result for variational problems. These results will play a key role in the analysis of
wellposedness of the weak formulations of elliptic boundary value problems. They will also be
used in the discretization error analysis for the finite element method in chapter 3.
A remark on notation: For elements from a Hilbert space we use boldface notation (e.g., u),
elements from the dual space (i.e., bounded linear functionals) are denoted by f, g, etc., and for
linear operators between spaces we use capitals (e.g., L).
For a continuous bilinear form k : H1 H2 R we define its norm by kkk = sup{ k(x, y)  kxkH1 =
1, kykH2 = 1 }. A fundamental result is given in the following theorem:
34
Theorem 2.3.1 Let H1 , H2 be Hilbert spaces and k : H1 H2 R be a continuous bilinear
form. For f H2 consider the variational problem:
1. For arbitrary f H2 the problem (2.35) has a unique solution u H1 and kukH1 ckf kH2
holds with a constant c independent of f .
k(u, v)
>0 : sup kukH1 for all u H1 , (2.36)
vH2 kvkH2
v H2 , v 6= 0, u H1 : k(u, v) 6= 0. (2.37)
35
Remark 2.3.2 The conditon (2.37) can also be formulated as follows:
v H2 such that k(u, v) = 0 for all u H1 v = 0.
and is often called the infsup condition. In the finite dimensional case with dim(H1 ) =
dim(H2 ) < this condition implies the result in (2.37) and thus is necessary and sufficient
for existence and uniqueness, as can been seen from the following. Let L : H1 H2 be as in
(2.38). If dim(H1 ) = dim(H2 ) < we have
L is bijective L is injective
kLukH2 k(u, v) (2.42)
inf > 0 inf sup > 0.
u6=0 kukH1 u6=0 v kukH1 kvkH2
The latter condition seems to be weaker than the infsup condition in (2.41), since > 0 is
required there. However, in the finite dimensional case it is easy to show, using a compactness
argument, that these two conditions are equivalent. In infinite dimensional Hilbert spaces the
infsup condition (2.41) is in general really stronger than the one in (2.42).
As we saw in the analysis above, it is natural to identify the bounded bilinear form k : H1
H2 R with a bounded linear operator L : H1 H2 via (Lu)(v) = k(u, v). The result in
theorem 2.3.1 is a reformulation of the following result that can be found in functional analysis
textbooks. Let L : H1 H2 be a bounded linear operator. Then L is an isomorphism if and
only if the following two conditions hold:
Furthermore, if (2.36) is satisfied, then kwkH2 kukH1 , with > 0 as in (2.36), holds.
Proof. Take u H1 . Then g : v k(u, v) defines a continuous linear functional on H2 .
Form the Riesz representation theorem it follows that there exists a unique w H2 such that
hw, viH2 = g(v) = k(u, v) for all v H2 , and kgkH2 = kwkH2 . Using (2.36) we obtain
g(v) k(u, v)
kwkH2 = kgkH2 = sup = sup kukH1
vH2 kvkH2 vH2 kvkH2
36
Definition 2.3.4 Let H be a Hilbert space. A bilinear form k : H H R is called Helliptic
if there exists a constant > 0 such
This theorem will play an important role in the analysis of wellposedness of the weak for
mulation of scalar elliptic problems in the section 2.5.
In the remainder of this section we analyze wellposedness for a variational problem which
has a special saddle point structure. This structure is such that the analysis applies to the
Stokes problem. This application is treated in section 2.6.
Let V and M be Hilbert spaces and
a : V V R, b:V M R
Below we will analyze the wellposedness of the problem (2.45) and thus of (2.43). In particular,
we will derive conditions on the bilinear forms a(, ) and b(, ) that are necessary and sufficient
for existence and uniqueness of a solution. The main result is presented in theorem 2.3.10.
We start with a few useful results. We need the following null space:
37
corresponding dual spaces are denoted by V0 and (V0 ) .
b(, )
>0 : sup kkM for all M. (2.47)
V kkV
Lemma 2.3.7 Assume that (2.47) holds. For g (V0 ) the variational problem
Proof. We apply theorem 2.3.1 with H1 = M , H2 = V0 and k(, ) = b(, ) (note the
interchange of arguments). From (2.48) it follows that condition (2.36) is fulfilled. Take V0 ,
6= 0. Then / V0 and thus there exists M with b(, ) 6= 0. Hence, the second condition
(2.37) is also satisfied. Application of theorem 2.3.1 yields that the problem (2.49) has a unique
solution M and kkM 1 kgk(V ) holds.
0
Note that, opposite to (2.36), in (2.47) we take the supremum over the first argument of the
bilinear form. In the following lemma we formulate a result in which the supremum is taken
over the second argument:
Lemma 2.3.8 The condition (2.47) (or (2.48)) implies:
b(, )
>0 : sup kkV for all V0 . (2.50)
M kkM
In general, (2.50) does not imply (2.47). Take, for example, the bilinear that is identically
zero, i.e., b(, ) = 0 for all V, M . Then (2.47) does not hold. However, since V0 = V
and thus V0 = 0 it follows that (2.50) does hold.
Application of lemma 2.3.3 in combination with the infsup properties in (2.47) and (2.50)
yields the following corollary.
38
Corollary 2.3.9 Assume that (2.47) holds. For every M there exists a unique V0
such that
h, iV = b(, ) for all V0 .
Furthermore, kkV kkM holds. For every V0 there exists a unique M such that
In the following main theorem we present necessary and sufficient conditions on the bilinear
forms a(, ) and b(, ) such that the saddle point problem (2.43) has a unique solution which
depends continuously on the data.
1. For arbitrary f H the problem (2.51) has a unique solution u H and kukH ckf kH
holds with a constant c independent of f .
2. The infsup condition (2.47) holds and the conditions (2.52a), (2.52b) are satisfied:
a(, )
>0: sup kkV for all V0 (2.52a)
V0 kkV
V0 , 6= 0, V0 : a(, ) 6= 0. (2.52b)
Moreover, if the second statement holds, then for c in the first statement one can take
c = ( + 2kak)2 1 2 .
We now prove that the statements 1 and 2 are equivalent. We recall the condition (2.36):
a(, ) + b(, ) + b(, ) 1
sup 1 kk2V + kk2M 2 for all (, ) H. (2.53)
(,)H (kk2V + kk2M ) 2
39
5 steps: a) (2.53) (2.47), b) {(2.53), (2.47)} (2.52a), c) {(2.47), (2.37)} (2.52b), d)
{(2.47), (2.52a)} (2.53), e) {(2.52a), (2.52b)} (2.37).
a). If in (2.53) we take = 0 we obtain
b(, ) b(, ) 2
sup = sup 1 kkM for all M
V kkV 2 2
(,)H (kkV + kkM ) 2
a(0 , 0 ) + a(0 , ) + b( , ) 1
sup 1 k0 k2V + kk2M 2
.
(,)H (kk2V + kk2M ) 2
a(0 , 0 ) a(0 , 0 ) 2
1
2 2
sup = sup 1 k0 kV + kkM k0 kV ,
k 0 kV (,)H (kk2 2 2
V + kkM )
0 V0
From assumption (2.52a) it follows that there exist > 0 and 0 V0 with k 0 kV = 1 such
that a(0 , 0 ) k0 kV holds. We now introduce
:= 1 0 + , := , 1 R,
kkV
:= 2 , := , 2 R.
kkM
40
Note that kk2V + kk2M = 21 + 1 + 22 . We obtain:
k(u, v) a(, ) + b(, ) + b(, )
sup sup 1
vH kvkH2 (,)H (kk2V + kk2M ) 2
a(0 , ) + a( , ) + b( , ) + b(, )
= sup 1
1 ,2 (1 + 21 + 22 ) 2
a(0 , ) + a( , ) + h, iM + h, iV
= sup 1
1 ,2 (1 + 21 + 22 ) 2
1 a(0 , 0 ) + a(0 , ) + a( , ) + 2 kkM + kkV
= sup 1
1 ,2 (1 + 21 + 22 ) 2
(1 kak)k0 kV + 2 kak(1 + 1) k kV + kkM
sup 1
1 ,2 (1 + 21 + 22 ) 2
We take 1 , 2 such that 1 kak = and 2 kak(1 + 1) = . This results in
kak +
(1 , 2 ) = (, kak + ).
1 (+2kak)2
Note that kak, 1 1, 2 1, and thus (1 + 21 + 22 ) 2 1 + 2 . We conclude
that
k(u, v) 2 2
sup k k
0 V + k k
V + kkM kukH (2.56)
vH kvkH ( + 2kak)2 ( + 2kak)2
holds. Using a continuity argument the same result holds if = 0 or = 0. Hence condition
2
(2.53) holds with = (+2kak) 2.
e). Take V0 , 6= 0. From (2.52) and theorem 2.3.1 with H1 = H2 = V0 , k(u, v) = a(u, v)
it follows that there exists a unique V0 such that a(, ) = h, iV for all V0 and
kkV 1 kkV0 = 1 kkV . If we take = we obtain
Using this one can prove with exactly the same arguments as in part d) that
k(u, v) 2
sup kukH for all u H
vH kvkH ( + 2kak)2
holds. Thus for every u H, u 6= 0 there exists v H such that k(u, v) 6= 0 and thus
k(u, v) 6= 0, too. This shows that (2.37) holds and completes the proof of a)e). The final state
ment in the theorem follows from the final result in theorem 2.3.1 and the choice of in part d).
41
Remark 2.3.11 The final result in theorem 2.3.10 predicts that if we scale such that kak = 1
then the stability constant c = ( + 2kak)2 1 2 is large when the values of the constants
and are much smaller than one. We now give an example with kak = 1 in which the stability
deteriorates like 1 2 for 0, 0. This shows that the behaviour c 1 2 for the
stability constant is sharp.
Take V = R2 with the euclidean norm k k2 , M = R and let e1 = (1 0)T , e2 = (0 1)T be the
standard basis vectors in R2 . For fixed > 0, > 0 we introduce the bilinear forms
b(, ) = eT1 , R2 , R,
T 0 1
a(, ) = A, A := , , R2 .
1
b(, )
sup =  for all ,
V kk2
a(, )
sup = kk2 for all V0 .
V0 kk2
With u = (, ), v = (, ) R3 we have
0 1
k(u, v) = a(, ) + b(, ) + b(, ) = uT Cv, C := 1 0 .
0 0
1 1 1 T
u= .
2
1 3
2
kf kH kukH 2 kf kH .
Important sufficient conditions for wellposedness of the problem (2.43) are formulated in the
following corollary.
42
Corollary 2.3.12 For arbitrary f1 V , f2 M consider the variational problem (2.43):
find (, ) V M such that
Assume that the bilinear forms a(, ) and b(, ) are continuous and satisfy the following two
conditions:
b(, )
> 0 : sup kkM M (infsup condition), (2.59a)
V kkV
>0 : a(, ) kk2V V (Vellipticity). (2.59b)
Then the conditions (2.47) and (2.52) (with = ) are satisfied and the problem (2.58) has a
2
unique solution (, ). Moreover, the stability bound k(, )kH (+2kak)
2 k(f1 , f2 )kH holds.
In the variational problems treated in the previous section we did not assume symmetry of the
bilinear forms. In this section we introduce certain symmetry properties and show that in that
case equivalent alternative problem formulations can be derived.
43
Proof. From the LaxMilgram theorem it follows that the variational problem (2.60) has a
unique solution u H. For arbitrary z H, z 6= 0 we have, with ellipticity constant > 0:
1
J(u + z) = k(u + z, u + z) f (u + z)
2
1 1
= k(u, u) f (u) + k(u, z) f (z) + k(z, z)
2 2
1 1 2
= J(u) + k(z, z) J(u) + kzkH > J(u).
2 2
This proves the desired result.
We now reconsider the variational problem (2.43) and the result formulated in corollary 2.3.12.
Assume that the bilinear forms a(, ) and b(, ) are continuous and satisfy the conditions (2.47)
and (2.52). In addition we assume that a(, ) is symmetric. Define the functional L : V M R
by
1
L(, ) = a(, ) + b(, ) f1 () f2 ()
2
Then the unique solution (, ) of (2.62) is also the unique element in V M for which
holds.
Proof. ddd From theorem 2.3.10 it follows that the problem (2.62) has a unique solution.
Take a fixed element (, ) V M . We will prove:
L(, ) L(, ) M b(, ) = f2 () M
(2.64)
L(, ) L(, ) V a(, ) + b(, ) = f1 () V.
From this it follows that (, ) satisfies (2.63) if and only if (, ) is a solution of (2.62). This
then proves the statement of the theorem. We now prove (2.64). Note that
L(, ) L(, ) M
b(, ) f2 () b(, ) f2 () M
b(, ) f2 () M
b(, ) = f2 () M.
From this the first result in (2.64) follows. For the second result we first note
L(, ) L(, ) V
1 1
a(, ) + b(, ) f1 () a(, ) + b(, ) f1 () V
2 2
1
a(, ) a(, ) + b(, ) f1 () V.
2
44
Now note that a(, ) is a quadratic term and a(, ) + b(, ) f1 () is linear. A
scaling argument now yields
L(, ) L(, ) V
0 a(, ) + b(, ) f1 () V
0 = a(, ) + b(, ) f1 () V,
Note that if a(, ) is symmetric then (2.52a) implies (2.52b). Due to the property (2.63) the
problem (2.62) with a symmetric bilinear form a(, ) is called a saddlepoint problem.
with a(x) > 0 for all x [0, 1]. Let V1 , k(, ) and f () be as defined in section 2.1:
One easily checks that u V1 solves (2.65) iff u is a solution of (2.66). Hence, if the problem
(2.65) has a solution u C 2 ([0, 1]) this must also be the unique (due to lemma 2.1.2) solution of
(2.66). As in section 2.1 we now consider this problem with a discontinuous piecewise constant
function a. Then the classical formulation (2.65) is not welldefined, whereas the variational
problem does make sense. However, in section 2.1 it is shown that the problem (2.66) has no
solution (the space V1 is too small). Since in the bilinear form k(, ) only first derivatives
occur, the larger space V2 := { v C 1 ([0, 1])  v(0) = v(1) = 0 } seems to be more appropriate.
This leads to the weaker variational formulation:
(
find u V2 such that
(2.67)
k(u, v) = f (v) for all v V2 .
However, it is shown in section 2.1 that the problem (2.67) still has no solution. The key step
is to take the completion of the space V2 (or , equivalently, of V1 ):
kk1 kk kk
H01 (0, 1) = C0 ([0, 1]) = V 1 1 = V 2 1.
45
Thus we consider (
find u H01 (0, 1) such that
Both the bilinear form k(, ) and f () are continuous on H01 (0, 1) and thus this problem is
From the LaxMilgram lemma 2.3.5 it follows that there exists a unique solution (which is usually
called the weak solution) of the variational problem (2.68). For this existence and uniqueness
result it is essential that we used the Sobolev space H01 (0, 1) , which is a Hilbert space. In
section 2.1 we considered a space V3 with V2 V3 H01 (0, 1) and showed that the function
kk
u given in (2.11) solves the variational problem in the space V3 . Due to V 3 1 = H01 (0, 1) this
We summarize the fundamental steps discussed in this section in the following diagram:
Remark 2.5.1 For the weak formulation in (2.68) to have a unique solution it is important
that the bilinear form is elliptic. The following examples illustrates this. Consider (2.65) with
a(x) = x. Then the solution of this problem is given by u(x) = 32 x(1 x) (as can be checked
/ V2 . Since u C 2 (0, 1) C([0, 1]), this is the classical solution
by substitution). Note that u
R1
of (2.65), cf. section 1.2. However, due to 0 u (x)2 dx = it follows that u / H 1 (0, 1) and
Note that this form differs from the one in (1.2). If the coefficients aij are differentiable, then
due to
n n n
X u X 2u X aij u
aij = aij +
xi xj xi xj xi xj
i,j=1 i,j=1 i,j=1
the operator L can be written in the form as in (1.2) with the same c as in (2.69) but with aij
a
and bi in (1.2) replaced by aij and bi nj=1 xjij , respectively.
P
46
As in section 1.2.1 the coefficients that determine the principal part of the operator are collected
in a matrix
A(x) = (aij (x))1i,jn .
Lu = f in (2.71a)
u=0 on . (2.71b)
We now derive a (weaker) variational formulation of this problem along the same lines as in
the previous section. For this derivation we assume that the equation (2.71a) has a solution u
in the space V := { u C 2 ()  u = 0 on }. Multiplication of (2.71a) with v C0 () and
using partial integration implies that u also satisfies
Z Z
uT Av + b uv + cuv dx = f v dx.
Note that in the bilinear form no higher than first derivatives occur. This motivates to use spaces
kk1
obtained by completion w.r.t. the norm kk1 and leads to the Sobolev space H01 () = C0 () .
kk1
One may check that C0 () V H01 () and thus V = H01 (). We thus obtain the follow
ing.
It is easy to verify that if the problem (2.73) has a smooth solution u C 2 () and if the
coefficients are sufficiently smooth then u is also a solution of (2.71). In this sense this problem
is the correct weak formulation.
47
Remark 2.5.2 There is a subtle reason why in the derivation of the weak formulation we used
the test space C0 () and not C (). The reason for this choice is closely related to the type
of boundary condition. In the situation here we have prescribed boundary values which are
kk
automatically fulfilled in the space V and also (after completion)
R in V 1 = H01 (). Therefore,
the differential equation should be tested in the form (Luf )v dx = 0 only in the interior of
, i.e., with functions v that are zero on the boundary. Hence we take v C0 (). In problems
with other types of boundary conditions it may be necessary to take test functions v C ().
This will be further explained in section 2.5.3.
We now analyze existence and uniqueness of the variational problem (2.73) by means of the
LaxMilgram lemma 2.3.5. We use the following mild smoothness assumptions concerning the
coefficients in the differential operator:
Theorem 2.5.3 Let (2.70) and (2.74) hold and assume that the condition
1
div b + c 0 a.e. in
2
is fulfilled. Then for every f L2 () the variational problem (2.73) with f (v) :=
R
f v dx has
a unique solution u. Moreover, the inequality
P Proof. We 2use the LaxMilgram lemma and the fact that k k1 and  1 , defined by u21 =
1
=1 kD ukL2 , are equivalent norms on H0 ().
From Z
f (v) =  f v dx kf kL2 kvkL2 kf kL2 kvk1
it follows that f () defines a bounded linear functional on H01 (). We now check the boundedness
of the bilinear form k(, ) for u, v H01 ():
n Z n Z
u v u
Z
X X
k(u, v) aij dx +
bi v dx +
cuv dx
xj xi xi
i,j=1 i=1
n n
X u v X u
kaij kL k kL2 k kL2 + kbi kL k k 2 kvkL2
xj xi xi L
i,j=1 i=1
Ckuk1 kvk1 .
Note that C0 () is dense in H01 (), the bilinear form is continuous and k k1 and  1 are
equivalent norms. Hence, for the ellipticity, k(u, u) kuk21 (with > 0), to hold it suffices to
48
show k(u, u) u21 for all u C0 ().
Take u C0 (). From the uniform ellipticity assumption (2.70) it follows (with = u) that
n Z n
u u u 2
X Z X
aij dx 0 dx = 0 u21 ,
xj xi xj
i,j=1 j=1
has a unique solution. Moreover, kuk1 Ckf kL2 holds with a constant C independent of f .
If div b 0 holds (a.e.), then this problem has a unique solution, and kuk1 Ckf kL2 holds with
a constant C independent of f .
We note that the condition div b 0 holds, for example, if all bi , i = 1, . . . , n, are constants.
In the singular perturbation case it may happen that the stability constant deteriorates: C =
C() if 0.
Inhomogeneous Dirichlet boundary conditions. First we treat the Poisson equation with
Dirichlet boundary data that are not identically zero:
u = f in
u = on .
49
Assume that this problem has a solution in the space V = { u C 2 ()  u = on }.
After completion (w.r.t. k k1 ) this will yield the space { u H 1 ()  (u) = } where is the
trace operator. As in the previous section the boundary conditions are automatically fulfilled
in this space and thus we take test functions v C0 () (cf. remark 2.5.2). Multiplication of
the differential equation with such a function v, partial integration and using completion with
respect to k k1 results in the following variational problem:
(
find u { u H 1 ()  u = } such that
R R 1
(2.75)
u v dx = f v dx for all v H0 ().
v H01 (). Taking v = u u it follows that u = u and thus we have at most one solution. To
prove existence we introduce a transformed problem. For the identity u = ( (u) = )
to make sense, we assume that the boundary data are such that range(). Then there
exists u0 H 1 () such that u0 = .
Proof. Trivial.
The LaxMilgram lemma yields existence of a solution of (2.76) and thus of (2.75).
Natural boundary conditions. We now consider a problem in which also (normal) derivatives
of u occur in the boundary condition:
u = f in (2.77a)
u
+u = on (2.77b)
n
u
with R a constant and n = (n )u = u n the normal derivative at the boundary. For
this problem the following difficulty arises related to the (normal) derivative in the boundary
condition. For u H 1 () the weak derivative D u,  = 1, is an element of L2 (). It can
be shown that it is not possible to define unambiguously v for v L2 (). In other words,
u
there is no trace operator which in a satisfactory way defines n 
for u H 1 (). This is the
reason why for the solution u we search in the space H 1 () which does not take the boundary
conditions into account. Due to this, for the derivation of an appropriate weak formulation, we
multiply (2.77a) with test functions v C () (and not C0 (), cf. remark 2.5.2). This results
in Z Z Z
u v dx u n v ds = f v dx for all v C ().
50
This results in the following variational problem:
(
find u H 1 () such that
R R R R 1
(2.78)
u v dx + uv ds = f v dx + v ds v H ().
It is easy to verify that if the problem (2.78) has a smooth solution u C 2 () and if is
sufficiently smooth then u is also a solution of (2.77). In this sense this problem is the correct
weak formulation.
Note that now the space H 1 () is used and not H01 (). The space H 1 () does not contain
any information concerning the boundary condition (2.77b). The boundary data are part of
the bilinear form used in (2.78). In the case of Dirichlet boundary conditions, as in (2.73)
and (2.75), this is the other way around: The solution space is such that the boundary condi
tions are automatically fulfilled and the boundary data do not influence the bilinear form. The
latter class of boundary conditons are called essential boundary conditions (these are apriori
fulfilled by the choice of the solution space). Boundary conditions as in (2.77b) are called nat
ural boundary conditions (these are automatically fulfilled if the variational problem is solved).
We now analyze existence and uniqueness of the variational formulation. For this we need
two variants of the PoincareFriedrichs inequality:
[Note that for u H01 () the first result reduces to the PoincareFriedrichs inequality]
Then qi is continuous on H 1 () (for i = 1 this follows from the continuity of the trace operator),
qi (u) = 2 qi (u) for all R and if u is equal to a constant, say c, then qi (u) = qi (c) = 0 iff
c = 0.
Assume that there does not exist a constant C such that kuk21 C u21 +qi (u) for all u H 1 ().
51
From this we conclude that
lim qi (uk() ) = 0.
Using the continuity of qi it follows that qi (c) = qi (u) = 0 and thus c = 0. This yields a
contradiction:
1 = lim kuk() k21 = kuk21 = kck21 = 0.
Thus the results are proved.
Using this lemma we can prove the following result for the variational problem in (2.78):
Theorem 2.5.8 Consider the variational problem (2.78) with > 0, f L2 () and
L2 (). This problem has a unique solution u and the inequality
kuk1 C kf kL2 + kkL2 ()
and with C > 0. Application of the LaxMilgram lemma completes the proof.
We now analyze the problem with pure Neumann boundary conditions, i.e., = 0 in (2.77b).
Clearly, for this problem we can not have uniqueness: if u is a solution then for any constant c
the function u + c is also a solution. Moreover, for existence of a solution the data f and must
satisfy a certain condition. Assume that u H 2 () is a solution of (2.77) for the case = 0,
then Z Z Z Z Z
f dx = u dx = u 1 dx u n ds = ds
52
must hold. This motivates the introduction of the compatibility relation:
Z Z
f dx + ds = 0. (2.81)
To obtain uniqueness, for the solution space we take a subspace of H 1 () consisting of functions
u with hu, 1iL2 = hu, 1i1 = 0:
Z
H1 () := { u H 1 ()  u dx = 0 }.
Since this is a closed subspace of H 1 () it is a Hilbert space. Instead of (2.78) we now consider:
(
find u H1 () such that
R R R 1
(2.82)
u v dx = f v dx + v ds for all v H ().
Theorem 2.5.9 Consider the variational problem (2.82) with f L2 (), L2 () and
assume that the compatibility relation (2.81) holds. Then this problem has a unique solution u
and the inequality
kuk1 C kf kL2 + kkL2 ()
holds with a constant C independent of f and .
The continuity of this functional is shown in the proof of theorem 2.5.8. DefineR k(u, v) =
v dx. The continuity of this bilinear form is trivial. For u H1 () we have u dx = 0
R
u
and thus, using the result in (2.79b), we get
Z
k(u, u) = u1 +  u dx2 Ckuk21 for all u H1 ()
2
with a constant C > 0. Hence, the bilinear form is H1 ()elliptic. From the LaxMilgram
lemma it follows that there exists a unique solution u H1 () such that k(u, v) = g(v) for all
v H1 (). Note that k(u, 1) = 0 and, due to the compatibility relation, g(1) = 0. It follows
that for the solution u we have k(u, v) = g(v) for all v H 1 ().
Remark 2.5.10 For the case < 0 it may happen that the problem (2.77) has a nontrivial
kernel (and thus we do not have uniqueness). Moreover, in general this kernel is not as simple
as for the case = 0. As an example, consider the problem
53
Mixed boundary conditions
It may happen that in a boundary value problem both natural and essential boundary values
occur. We discuss a typical example. Let e and n be parts of the boundary such that
measn1 (e ) > 0, measn1 (n ) > 0, e n = , e n = . Now consider the following
boundary value problem:
u = f in (2.83a)
u=0 on e (2.83b)
u
= on n . (2.83c)
n
The Dirichlet (= essential) boundary conditions are fulfilled by the choice of the solution space:
H1e () := { u H 1 ()  (u) = 0 on e }.
The Neumann (= natural) boundary conditions will be part of the linear functional used in the
variational problem. A similar derivation as for the previous examples results in the following
variational problem:
(
find u H1e () such that
R R R 1 ().
(2.84)
u v dx = f v dx + n v ds for all v H e
One easily verifies that if this problem has a smooth solution u C 2 () then u is a solution of
the problem (2.83).
Remark 2.5.11 For the proof of existence and uniqueness we need the following Poincare
Friedrichs inequality in the space H1e ():
For a proof of this result we refer to the literature, e.g. [3], Remark 5.16.
Theorem 2.5.12 The variational problem (2.84) with f L2 () and L2 () has a unique
solution u and the inequality
kuk1 C kf kL2 + kkL2 ()
The continuity of this linearRfunctional can be shown as in the proof of theorem 2.5.8. Define
the bilinear form k(u, v) = u v dx. The continuity of k(, ) is trivial. From (2.85) it
follows that k(u, u) Ckuk21 for all u H1e () and with C > 0. Hence the bilinear form is
H1e ()elliptic. Application of the LaxMilgram lemma completes the proof.
54
2.5.4 Regularity results
In this section we present a few results from the literature on global smoothness of the solution.
First the notion of H m regularity is introduced. For ease of presentation we restrict ourselves
to elliptic boundary value problems with homogeneous Dirichlet boundary conditions.
the inequality
kuk1 Ckf k1
holds with a constant C independent of f . This property is called H 1 regularity of the variational
problem. If for some m > 1 and f H m2 () the unique solution u of
Z
k(u, v) = f v dx for all v H01 ()
satisfies
kukm Ckf km2
with a constant C independent of f then the variational problem is said to be mregular.
The result in the next theorem is an analogon of the result in theorem 1.2.7, but now the
smoothness is measured using Sobolev norms instead of Holder norms.
Theorem 2.5.14 ([39], Theorem 8.13) Assume that u H01 () is a solution of (2.73) (for
existence of u: see theorem 2.5.3). For some m N assume that C m+2 and:
Corollary 2.5.15 Assume that the assumptions of theorem 2.5.3 and of theorem 2.5.14 are
fulfilled. Then the variational problem (2.73) is H m+2 regular.
Proof. Due to theorem 2.5.3 the problem has a unique solution u and kuk1 Ckf kL2 holds.
Now combine this with the result in (2.86):
kukm+2 C kukL2 + kf km C1 (kf kL2 + kf km 2C1 kf km
55
Note that in these regularity results there is a severe condition on the smoothness of the bound
ary. For example, for m = 0, i.e., H 2 regularity, we have the condition C 2 . In practice,
this assumption often does not hold. For convex domains one can prove H 2 regularity without
assuming such a strong smoothness condition on . The following result is due to [53]:
Theorem 2.5.16 Let be convex. Suppose that the assumptions of theorem 2.5.3 hold and in
addition aij C 0,1 () for all i, j. Then the unique solution of (2.73) satisfies
We note that very similar regularity results hold for elliptic problems with natural boundary
conditions (as in (2.77b)). In problems with mixed boundary conditions, however, one in general
has less regularity.
From the formulation of the Stokes problem it is clear that the pressure p is determined only
up to a constant. In order to eliminate this degree of freedom we introduce the additional
requirement Z
hp, 1iL2 = p dx = 0.
Assume that the Stokes Rproblem has a solution u V := { u C 2 ()n  u = 0 on },
p M := { p C 1 ()  p dx = 0 }. Then (u, p) also solves the following variational problem:
find (u, p) V M such that
Z Z Z
u v dx p div v dx = f v dx v C0 ()n
Z
(2.88)
q div u dx = 0 q M
56
with u v dx = hu, viL2 := ni=1 hui , vi iL2 . We introduce the bilinear forms and
R P
Note that no derivatives of the pressure occur in (2.88). To obtain a weak formulation in
appropriate Hilbert spaces we apply the completion principle. For the velocity we use completion
w.r.t. k k1 and for the pressure we use the norm k kL2 :
Z
kk1 n kk1 1 n kkL2 2 2
V = C0 () = H0 () , M = L0 () := {p L ()  p dx = 0}.
This results in the following weak formulation of the Stokes problem, with V := H01 ()n , M :=
L20 ():
Lemma 2.6.1 Suppose that u C 2 ()n and p C 1 () satisfy (2.90). Then (u, p) is a solution
of (2.87).
div u = 0 in .
To show the wellposedness of the variational formulation of the Stokes problem we apply corol
lary 2.3.12. For this we need the following infsup condition, which will be proved in section 2.6.1
R
q div v dx
>0 : sup kqkL2 q L20 (). (2.91)
1
vH () n kvk 1
0
Using this property we obtain a fundamental result on wellposedness of the variational Stokes
problem:
57
Theorem 2.6.2 For every f L2 ()n the Stokes problem (2.90) has a unique solution (u, p)
V M . Moreover, the inequality
R Proof. We can apply corollary 2.3.12 with V, M, a(, ), b(, ) as defined above and f1 (v) =
f v dx, f2 = 0. The continuity of b(, ) on V M follows from
Z
b(v, q) = q div v dx kqkL2 kdiv vkL2 n kqkL2 kvk1 for (u, q) V M.
The infsup condition is given in (2.91). Note that the minus sign in (2.89b) does not play a role
for the infsup condition in (2.91). The continuity of a(, ) on V V is clear. The V ellipticity
follows from the PoincareFriedrichs inequality:
n
X n
X
a(u, u) = ui 21 c kui k21 = ckuk21 for all u V.
i=1 i=1
Remark 2.6.3 Note that the bilinear form a(, ) is symmetric and thus we can apply theo
rem 2.4.2. This shows that the variational formulation of the Stokes problem is equivalent to a
saddlepoint problem.
hf, uiL2 
Z
u f (x)u(x) dx, u H01 (), kf k1 = sup .
uH 1 () kuk1
0
We now define the first (partial) derivative of an L2 function (in the sense of distributions). For
f L2 () the mapping
Z
F : u f (x)D u(x) dx,  = 1, u H01 (),
58
Based on these partial derivatives we define
f f
f = ,..., ,
x1 xn
v
u n
uX f sX
kf k1 =t k k21 = kD f k21 .
xi
i=1 =1
In the next theorem we present a rather deep result from analysis. Its proof (for the case
C 0,1 ) is long and technical.
Theorem 2.6.4 There exists a constant C such that for all p L2 ():
kpkL2 C kpk1 + kpk1 .
Remark 2.6.5 From the definitions of kpk1 , kpk1 it immediately follows that kpk1
kpkL2 and kpk1 kpkL2 for all p L2 (). Hence, using theorem 2.6.4 it follows that k kL2
and kk1 +kk1 are equivalent norms on L2 (). This can be seen as a (nontrivial) extension
of the (trivial) result that on H m (), m 1, the norms k km and k km1 + k km1 are
equivalent.
Proof. Suppose that this result does not hold. Then there exists a sequence (pk )k1 in L20 ()
such that
1 = kpk kL2 kkpk k1 for all k. (2.92)
From the fact that the continuous
embedding H01 () L2 () is compact, it follows that
2 2 1 1
L () = L () H0 () = H () is a compact embedding. Hence there exists a subse
quence (pk() )1 that is a Cauchy sequence in H 1 (). From (2.92) and theorem 2.6.4 it follows
that (pk() )1 is a Cauchy sequence in L2 () and thus there exists p L2 () such that
pk()
lim () = 0 for all C0 () and all i = 1, . . . , n.
xi
pk()
0 = lim () = lim hpk() , i 2 = hp, i 2 for i = 1, . . . , n.
xi xi L xi L
59
Hence p H 1 () and p = 0. It follows R that p isR equal to a constant (a.e.), say p = c. From
2
(2.93) and pk() L0 () it follows that p dx = c dx = 0 and thus c = 0. This results in a
contradiction:
1 = lim kpk() kL2 = kpkL2 = kckL2 = 0,
and thus the proof is complete.
Proof. From lemma 2.6.6 it follows that there exists c > 0 such that kqk1 ckqkL2 for
all q L20 (). Hence, for suitable k with 1 k n we have
q c
k k1 kqkL2 for all q L20 ().
xk n
Thus there exists v H01 () with kvk1 = 1 and
q v 1 c
Z
for all q L20 ().
 (v) = q dx kqkL2
xk xk 2 n
For v = (v1 , . . . , vn ) H01 ()n defined by vk = v, vi = 0 for i 6= k we have
R R
q div v dx q div v dx
sup = sup
vH01 ()n kvk1 vH01 ()n kvk1
q v dx
R R
q div v dx xk 1 c
= kqkL2
kvk1 kvk1 2 n
for all q L20 (). This completes the proof.
60
2.6.3 Other boundary conditions
For a Stokes problem with nonhomogeneous Dirichlet boundary conditions, say u = g on , a
compatibility condition is needed: Z
g n ds = 0 (2.95)
...other boundary conditions ...
...in preparation ....
61
62
Chapter 3
and that the conditions (2.36) and (2.37) from theorem 2.3.1 hold:
k(u, v)
>0 : sup kukH1 for all u H1 , (3.3)
vH2 kvkH2
v H2 , v 6= 0, u H1 : k(u, v) 6= 0. (3.4)
From theorem 2.3.1 we know that for a continuous bilinear form the conditions (3.3) and (3.4)
are necessary and sufficient for wellposedness of the variational problem in (3.1).
The Galerkin discretization of the problem (3.1) is based on the following simple idea. We
assume a finite dimensional subspaces H1,h H1 , H2,h H2 (note: in concrete cases the index
h will correspond to some mesh size parameter) and consider the finite dimensional variational
problem
find uh H1,h such that k(uh , vh ) = f (vh ) for all vh H2,h . (3.5)
This problem is called the Galerkin discretization of (3.1) (in H1,h H2,h ). We now discuss the
wellposedness of this Galerkindiscretization. First note that the continuity of k : H1,h H2,h
R follows from (3.2). From theorem 2.3.1 it follows that we need the conditions (3.3) and (3.4)
with Hi replaced by Hi,h , i = 1, 2. However, because Hi,h is finite dimensional we only need (3.3)
since this implies (3.4) (see remark 2.3.2). Thus we formulate the following (discrete) infsup
condition in the space H1,h H2,h :
k(uh , vh )
h > 0 : sup h kuh kH1 for all uh H1,h . (3.6)
vh H2,h kvh kH2
63
We now prove two fundamental results:
Theorem 3.1.1 (Cealemma.) Let (3.2), (3.3), (3.4), (3.6) hold. Then the variational prob
lem (3.1) and its Galerkin discretization (3.5) have unique solutions u and uh , respectively.
Furthermore, the inequality
M
ku uh kH1 1 + inf ku vh kH1 (3.7)
h vh H1,h
holds.
Proof. The result on existence and uniqueness follows from theorem 2.3.1 and the fact that
in the finite dimensional case (3.3) implies (3.4). From (3.1) and (3.5) it follows that
k(u uh , vh ) = 0 for all vh H2,h . (3.8)
For arbitrary vh H1,h we have, due to (3.6), (3.8), (3.2):
1 k(vh uh , wh )
kvh uh kH1 sup
h wh H2,h kwh kH2
1 k(vh u, wh ) M
= sup kvh ukH1 .
h wh H2,h kwh kH2 h
From this and the triangle inequality
ku uh kH1 ku vh kH1 + kvh uh kH1 for all vh H1,h
the result follows.
The result in this theorem simplifies if we consider the important special case H1 = H1 =: H,
H1,h = H2,h =: Hh and assume that the bilinear form k(, ) is elliptic on H.
Corollary 3.1.2 Consider the case H1 = H2 =: H and H1,h = H2,h =: Hh . Assume that
(3.2) holds and that the bilinear form k(, ) is Helliptic with ellipticity constant . Then the
variational problem (3.1) and its Galerkin discretization (3.5) have unique solutions u and uh ,
respectively. Furthermore, the inequality
M
ku uh kH inf ku vh kH (3.9)
vh Hh
holds.
Proof. Because k(, ) is Helliptic the conditions (3.3) (with = ), (3.4) and (3.6) (with
h = ) are satisfied. From theorem 3.1.1 we conclude that unique solutions u and uh exist.
Using k(u uh , vh ) = 0 for all vh Hh and the ellipticity and continuity we get for arbitrary
vh Hh :
1 1
ku uh k2H k(u uh , u uh ) = k(u uh , u vh )
M
ku uh kH ku vh kH .
64
Hence the inequality in (3.9) holds.
In chapter 4 and chapter 5 we will use theorem 3.1.1 in the discretization error analysis. In
the remainder of this chapter we only consider cases with H1 = H1 =: H, H1,h = H2,h =: Hh
and Helliptic bilinear forms, such that corollary 3.1.2 can be applied.
An improvement of the bound in (3.9) can be obtained if k(, ) is symmetric:
Corollary 3.1.3 Assume that the conditions as in corollary 3.1.2 are satisfied. If in addition
the bilinear form k(, ) is symmetric, the inequality
s
M
ku uh kH inf ku vh kH (3.10)
vh Hh
holds.
1
Proof. Introduce the norm v := k(v, v) 2 on H. Note that
kvkH v M kvkH for all v H.
The space (H,  ) is a Hilbert space and due to v2 = k(v, v), k(u, v) uv the bilinear
form has ellipticity constant and continuity constant w.r.t. the norm   both equal to 1.
Application of corollary 3.1.2 in the space (H,  ) yields
Assume H1 = H2 = H and H1,h = H2,h = Hh . For the actual computation of the solution
uh of the Galerkin discretization we need a basis of the space Hh . Let {i }1iN be a basis of
Hh , i.e., every vh Hh has a unique representation
N
X
vh = vj j with v := (v1 , . . . , vN )T RN .
j=1
65
In the remainder of this chapter we discuss concrete choices for the space Hh , namely the
socalled finite element spaces. These spaces turn out to be very suitable for the Galerkin
discretization of scalar elliptic boundary value problems. Finite element spaces can also be used
for the Galerkin discretization of the Stokes problem. This topic is treated in chapter 5. Once a
space Hh is known one can investigate approximation properties of this space and derive bounds
for inf vh Hh ku vh kH (with u the weak solution of the elliptic boundary value problem), cf.
section 3.3. Due to the Cealemma we then have a bound for the discretization error ku uh kH
(see section 3.4).
In Part III of this book (iterative methods) we discuss techniques that can be used for solving
the linear system in (3.12).
To simplify the presentation we only consider finite element methods for elliptic boundary value
problems in Rn with n 3.
Starting point for the finite element approach is a subdivsion of the domain in a finite number
of subsets T . Such a subdivision is called a triangulation and is denoted by Th = {T }. For the
subsets T we only allow:
T is an nsimplex (i.e., interval, triangle, tetrahedron), or,
(3.13)
T is an nrectangle.
Furthermore, the triangulation Th = {T } should be such that
= T Th T , (3.14a)
int T1 int T2 = for all T1 , T2 Th , T1 6= T2 , (3.14b)
any edge [face] of any T1 Th is either a subset
(3.14c)
of or an edge [face] of another T2 Th .
Definition 3.2.1 A triangulation that satisfies (3.13) and (3.14) is called admissible.
Note that a triangulation can be admissible only if the domain is polygonal (i.e., consists
of lines and/or planes). If the domain is not polygonal we can approximate it by a polygonal
domain h and construct an admissible triangulation of h (see ...) or use isoparametric finite
elements (section 3.6).
66
Definition 3.2.2 A family of admissible triangulations {Th } is called regular if
1. The parameter h approaches zero: inf{ h  Th {Th } } = 0,
hT
2. : T = T for all T Th and all Th {Th }.
A family of admissible triangulations {Th } is called quasiuniform if
h
: for all T Th and all Th {Th }.
T
The dimension of Pk is
n+k
dim Pk = .
k
The spaces of simplicial finite elements are given by
X0h := { v L2 ()  vT P0 for all T Th }, (3.15a)
Xkh := { v C()  vT Pk for all T Th }, k 1. (3.15b)
Thus these spaces consist of piecewise polynomials which, for k 1, are continuous on .
Remark 3.2.3 From theorem 2.2.12 it follows that Xkh H 1 () for all k 1.
We will also need simplicial finite element spaces with functions that are zero on :
Xkh,0 := Xkh H01 (), k 1. (3.16)
The dimension of Qk is
dim Qk = (k + 1)n .
The spaces of rectangular finite elements are given by
Q0h := { v L2 ()  vT Q0 for all T Th }, (3.17a)
Qkh := { v C()  vT Qk for all T Th }, k 1, (3.17b)
Qkh,0 := Qkh H01 () , k 1. (3.17c)
67
3.3 Approximation properties of finite element spaces
In this section, for u H 2 () we derive bounds for the approximation error inf vh Hh ku vh k1
with Hh = Xkh or Hh = Qkh (note that Hh depends on the parameter k).
The main idea of the analysis is as follows. First we will introduce an interpolation operator
Ihk : C() Hh . Recall that we assumed n 3. The Sobolev embedding theorem 2.2.14 yields
H m () C() for m 2
and thus the interpolation operator is welldefined for u H m (), m 2. We will prove
interpolation error bounds of the form (cf. theorem 3.3.9)
We first introduce the interpolation operators IXk : C() Xkh and IQk : C() Qkh . Then we
formulate some useful results that will be applied to prove the main result in theorem 3.3.9.
We start with the definition of an interpolation operator IXk : C() Xkh . For the descrip
tion of this operator the socalled barycentric coordinates are useful:
n+1
X 1 k1
n+1
X
Lk (T ) := j aj  j {0, , . . . , , 1} j, j = 1
k k
j=1 j=1
which is called the principal lattice of order k (in T ). Examples for n = 2 and n = 3 are given
in figure...
figure
This principal lattice can be used to determine a unique polynomial p Pk :
68
we define a corresponding function IXk u L2 () by piecewise polynomial interpolation on each
simplex T Th :
Since these xj are interpolation points we have that p1 (xj ) = p2 (xj ) = u(xj ) for j = 1, . . . , k + 1.
The functions (pi ) are onedimensional polynomials of degree k. We conclude that (p1 ) =
(p2 ) holds, and thus IXk u is continuous across the interface .
The case n = 3 (or even n 3) can be treated similarly.
For the space of rectangular finite elements, Qkh , an interpolation operator IQk : C() Qkh
can be defined in a very similar way. For this we introduce a uniform grid on a rectangle in Rn .
For a given interval [a, b] a uniform grid with mesh size ba
k is given by
ba
Gk[a,b] := { a + j  0 j k }.
k
Qn
On an nrectangle T = i=1 [ai , bi ] we define a uniform lattice by
n
Y
Lk (T ) := Gk[ai ,bi ] .
i=1
With similar arguments as used in the proof of lemma 3.3.3 one can show the following:
Lemma 3.3.4 For k 1 and u C() we have IQk u Qkh .
For the analysis of the interpolation error we begin with two elementary lemmas.
Lemma 3.3.5 Let T , T Rn be two sets as in (3.13) and F (x) = Bx + c an affine mapping
such that F (T ) = T . Then the following inequalities hold:
hT hT
kBk2 , kB1 k2 .
T T
69
Proof. We will prove the first inequality. The second one then follows from the first one by
using F 1 (T ) = T with F 1 (x) = B1 x B1 c.
Note that
1
kBk2 = max{ kBxk2  x Rn , kxk2 = T }. (3.21)
T
Let B(a; T ) be a ball with centre a and diameter T that is contained in T . Take x Rn with
kxk2 = T . For y1 = a + 12 x T and y2 = a 12 x T we have
x = y1 y2 , F (yi ) T, i = 1, 2,
and thus
kBxk2 = kB(y1 y2 )k2 = kF (y1 ) F (y2 )k2 . hT (3.22)
hT
From (3.21) and (3.22) we obtain kBk2 T .
Lemma 3.3.6 Let K and K be Lipschitz domains in Rn that are affine equivalent:
For the case m 1 we need some basic results on Frechet derivatives. For v C () the
Frechet derivative D m v(x) : Rn . . . Rn R is an mlinear form. Let ej be the jth basis
vector in Rn . For  = m and for suitable i1 , . . . , im N we have
 v(x)  v(x)
D v(x) = n = = D m v(x)(ei1 , . . . , eim ) (3.24)
x1 1 . . . xn xi1 . . . xim
(note the subtle difference in notation between D and D m ). Let E be an mlinear form on Rn .
Then both
E(y 1 , . . . , y m )
kEk2 := max and kEk := max E(ei1 , . . . , eim )
y i Rn ky 1 k2 . . . ky m k2 1ij n
define norms on the space of mlinear forms on Rn . Using the norm equivalence property it
follows that there exists a constant c, independent of E, such that
70
If we take E = D m v(x) and use (3.24) we get
D m v(x)(y 1 , . . . , y m ) = D m v(x)(By 1 , . . . , By m )
and thus
kD m v(x)k2 kBkm m
2 kD v(x)k2 . (3.26)
Combination of (3.25) and (3.26) yields
This proves the result in (3.23a). The result in (3.23b) follows from (3.23a) and F 1 (K) = K
with F 1 (x) = B1 x B1 c.
Lemma 3.3.7 Let K be a Lipschitz domain in Rn . There exists a constant C such that
Z
2 2
X 2
kukm C um + D u dx for all u H m (K).
m1 K
Proof. For m = 1 this result is given in (2.79b). From the result in (2.79b) it also follows
that Z
2 2
2
kukL2 C u1 + u dx for all u H 1 (K). (3.27)
K
71
Note that for m 1 we have = D u21 = +1 . Using this and the inequality (3.27)
P
X X Z 2
= kD uk2L2 (K) C +1 + D u dx = C(+1 + ).
= = K
m m1 Z
X X X 2
kuk2m u2m D u dx
= C m + = C + ,
=0 =0 m1 K
Let p(x) = m1 x1 1 . . . xnn Pm1 . For any given u H m (K) one can show that the
P
coefficients can be taken such that
Z Z
D p dx = D u dx for  m 1
K K
holds (hint: the ordering  = m 1,  = m 2, . . ., yields a linear system for the coefficients
with a nonsingular lower triangular matrix). Using the result in lemma 3.3.7 we obtain
Z
X 2
ku pk2m C u p2m + D (u p) dx )
m1 K (3.30)
= Cu p2m = Cu2m
72
Theorem 3.3.9 Let {Th } be a regular family of triangulations of consisting of nsimplices
and let Xkh be the corresponding finite element space as in (3.15b). For 2 m k + 1 and
t {0, 1} the following holds:
Let {Th } be a regular family of triangulations of consisting of nrectangles and let Qkh be the
corresponding finite element space as in (3.17b). For 2 m k + 1 and t {0, 1} the following
holds:
ku IQk ukt Chmt um for all u H m (). (3.32)
The constants C in(3.31) and (3.32) are independent of u and of Th {Th }.
Proof. We will prove the result in (3.31). Very similar arguments can be used to show that
the result in (3.32) holds. Take 2 m k + 1. The constants C used below are all uniform
with respect to u H m () and Th {Th }. We will show that for all {0, 1}
holds, with  0 := k kL2 . The result in (3.31) then follows from this and from kvk21 = v20 + v21 .
Due to X
u IXk u2 = u IXk u2,T
T Th
Let T be the unit nsimplex and F : T T an affine transformation F (x) = Bx + c such that
F (T ) = T . Due to the fact that the family {Th } is regular, there exists a constant C such that
hT
kBk2 kB1 k2 c C. (3.34)
T
P
Note that kpk := xLk (T ) p(x) defines a norm on Pk . Since all norms on Pk are equivalent
there exists a constant C such that
Let IXk : C(T ) Pk be the interpolation operator on the unit nsimplex as defined in (3.19)
(with T = T ). We then have
73
and (3.36) we get
For u H m () we obtain, using lemma 3.3.5, lemma 3.3.6 and the results in (3.34), (3.37),
(3.38):
1
u IXk u,T C kB1 k2  det B 2 u F (IXk u) F ,T
1
= C kB1 k2  det B 2 u IXk u,T
1
C kB1 k2  det B 2 ku IXk ukm,T
1
C kB1 k2  det B 2 um,T C kB1 k2 kBkm
2 um,T
C kB1 k2 kBk2 kBkm
2 um,T
C kBkm
2 um,T C hm um,T .
Furthermore, the results in (3.39)and (3.40) hold for u H m () H01 () with Xkh , Qkh replaced
by Xkh,0 and Qkh,0 , respectively.
Proof. The first part is clear. The second part follows from the fact that for u H m ()
H01 () we have IXk u Xkh,0 and IQk u Qkh,0 .
We now prove socalled local and global inverse inequalities. These results can be used to
bound the H 1 norm of a finite element function in terms of its L2 norm.
If in addition the family of triangulations is quasiuniform, then there exists a constant c inde
pendent of h such that
vh 1 c h1 kvh kL2 for all vh Vh .
74
Proof. We consider the case of simplices. The other case can be treated very similar. For
T Th let F (x) = BT x + c be an affine transformation such that F (T ) = T , where T is the
unit simplex. Note that on the finite dimensional space Pk (T ) all norms are equivalent. Using
lemma 3.3.6 we get, with vh = vh F ,
1
vh m+1,T ckB1 m+1
T k2  det BT  2 vh m+1,T
1
chm1
T  det BT  2 vh m+1,T
1
chm1
T  det BT  2 vh m,T ch1
T vh m,T ,
which proves the local inverse inequality. Note that vh H01 (). Thus for m = 0 we can sum up
these local results and using the quasiuniformity assumption (i.e., h1 1
T ch ) we then obtain
X X X
vh 21 = vh 21,T c h2 2
T vh 0,T c h
2
vh 20,T = ch2 kvh k2L2
T Th T Th T Th
uT Av + b uv + cuv dx ,
R R
k(u, v) = f (v) = f v dx (3.42a)
1
with 2 div b + c 0 a.e. in , (3.42b)
T T
and 0 > 0 A(x) 0 for all Rn , x , (3.42c)
and aij L () i, j, bi H 1 () L () i, c L (). (3.42d)
For the Galerkin discretization we use finite element subspaces Hh = Xkh,0 and Hh = Qkh,0 . We
prove bounds for the discretization error ku uh k1 (section 3.4.1) and ku uh kL2 (section 3.4.2).
75
We have the following result concerning the discretization error:
Theorem 3.4.1 Assume that the conditions (3.42b)(3.42d) are fulfilled and that the solution
u H01 () of (3.41) lies in H m () with m 2. Let uh be the solution of (3.43). For 2 m
k + 1 the following holds
ku uh k1 C hm1 um ,
with a constant C independent of u and of Th {Th }.
Proof. From the proof of theorem 2.5.3 it follows that the bilinear form k(, ) is continuous
and H01 ()elliptic. From corollary 3.1.2 it follows that the continuous and discrete problems
have unique solutions and that
ku uh k1 C inf ku vh k1
vh Xkh,0
A very similar result holds for the Galerkin discretization with rectangular finite elements.
Let {Th } be a regular family of triangulations of consisting of nrectangles and Qkh,0 , k 1,
the corresponding finite element space as in (3.17c). The discrete problem is given by
(
find uh Qkh,0 such that
(3.44)
k(uh , vh ) = f (vh ) for all vh Qkh,0 .
Theorem 3.4.2 Assume that the conditions in (3.42b)(3.42d) are fulfilled and that the solution
u H01 () of (3.41) lies in H m () with m 2. Let uh be the solution of (3.44). For 2 m
k + 1 the following holds
ku uh k1 C hm1 um ,
with a constant C independent of u and of Th {Th }.
Proof. The same arguments as in the proof of theorem 3.4.1 can be used.
Note that in the preceding two theorems we used the smoothness assumption u H01 ()H m ()
with m 2. Sufficient conditions for this to hold are given in section 2.5.4, theorem 2.5.14 and
theorem 2.5.16. In the literature one can find discretization error bounds for the case when u
is less regular, i.e., u H01 () but u
/ H 2 () (cf., for example, [?]). One simple result for the
1
case of minimal smoothness (u H0 () only) is given in:
Theorem 3.4.3 Assume that the conditions of theorem 2.5.3 are fulfilled. Let uh be the solution
of (3.43). Then we have:
lim ku uh k1 = 0.
h0
76
kk1
Proof. Define V := H01 () H 2 (). Note that V = H01 (). Take > 0.
From corollary 3.1.2 we obtain
ku uh k1 C inf ku vh k1 . (3.45)
vh Xkh,0
Remark 3.4.4 Comment on results for cases with other boundary conditions ....
Note that if k(, ) is continuous and H01 ()elliptic then this dual problem has a unique solution.
The dual problem is said to be H 2 regular (cf. section 2.5.4) if
The following result concerning the finite element discretization error holds:
Theorem 3.4.5 Suppose that the assumptions of theorem 3.4.1 [theorem 3.4.2] are fulfilled and
that the dual problem (3.48) is H 2 regular. For 2 m k + 1 the inequality
ku uh kL2 C hm um
Proof. We give the proof for the case of simplicial finite elements. Exactly the same argu
ments can be used for rectangular finite elements.
The bilinear form k(, ) is continuous and H01 ()elliptic and thus the problem (3.41), its
77
Galerkin discretization and the dual problem (3.48) are uniquely solvable. Define eh = u uh
and note that eh H01 (). Let u H01 () H 2 () be the solution of the dual problem
Z
k(v, u) = eh v dx for all v H01 ().
From corollary 3.3.10 and the H 2 regularity of the dual problem we obtain
Remark 3.4.6 Comment on sufficient conditions for H 2 regularity of the dual problem ...
and thus i = 0 for all i. This yields that i , i = 1, . . . , N , are independent. Hence, N k :=
dim(H) = dim(H ) holds. We now show that N k holds, too. Let v1 , . . . , vk be a basis of H.
Define the matrix L RN k by Lij = i (vj ). Let x Rk be such that Lx = 0. We then have
k
X Xk
i (vj )xj = i ( xj vj ) = 0 for all i = 1, . . . , N
j=1 j=1
78
Pk
Using (3.51b) this yields j=1 xj vj = 0 and thus x = 0. Hence, L has full column rank and
thus N k holds.
The set (i )1iN as in (3.51) forms a basis of H and is called the dual basis of (i )1iN .
We now construct a socalled nodal basis of the space of simplicial finite elements Xkh,0 . We
will associate a basis function to each interpolation point in the principal lattice of T Th that
lies not on . To make this more precise, for an admissible triangulation Th , consisting of
nsimplices, we introduce the grid
T Th { xj Lk (T )  xj
/ } =: {x1 , . . . , xN } =: V
T Th :
(
0 if xj 6= xi (3.52)
(i )T Pk and xj Lk (T ) : i (xj ) =
1 if xj = xi
From lemma 3.3.3 it follows that for all k 1 we have i Xkh,0 . Thus we have a collection of
functions (i )1iN with the properties:
One easily verifies that for i , i , i = 1, . . . , N the conditions of lemma 3.5.1 are satisfied.
Due to the property i (xj ) = ij the functions i are called nodal basis functions. In ex
actly the same way one can construct nodal basis functions for other finite element spaces like,
for example, Xkh , Qkh,0 , Qkh .
We consider the discrete problem (3.43) and use the nodal basis (i )1iN of Xkh,0 to refor
mulate this problem as explained in (3.11)(3.12). This results in the linear system of equations
The matrix Kh is called the stiffness matrix. In the remainder of this section we derive some
important properties of this matrix that will play an important role in the chapters 69. In these
chapters we discuss iterative solution methods for the linear system in (3.54). Below we assume
that for the bilinear form k(, ) the conditions (3.42b)(3.42d) are satisfied.
79
Lemma 3.5.3 Let {Th } be a regular family of triangulations consisting of nsimplices and for
each Th let Kh be the stiffness matrix defined in (3.54). There exists a constant q independent
of Th {Th } such that
max{ qrow (Kh ) , qcol (Kh ) } q
Proof. Take a fixed i with 1 i N . Define a neighbourhood of xi by
Nxi := { T Th  xi Lk (T ) } = supp(i )
From the assumption that we have a regular family of triangulations it follows that
Nxi  M (3.55)
with a constant M independent of i and of Th {Th }. Assume that (Kh )ij 6= 0. Using the fact
that we have a nodal basis it follows that
xj Nxi ,
i.e., xj is a lattice point in Nxi . Using (3.55) we get that the number of lattice points in Nxi
can be bounded by a constant, say q, independent of i and of Th {Th }. Hence qrow (Kh ) q
holds. The same arguments apply if one interchanges i and j.
Note that the constant q depends on the degree k used in the finite element space Xkh,0 . The re
sult of this lemma shows that the number of nonzero entries in the N N matrix Kh is bounded
by qN . If h 0 then N and the number of nonzero entries in Kh is proportional to N .
Therefore the stiffness matrix is said to be sparse.
80
Lemma 3.5.6 Let A RN N be a symmetric positive definite matrix and let q be such that
qrow (A) q. For any nonsingular diagonal matrix D RN N we have
1 1
(DA 2 ADA 2 ) q (DAD)
1 1
Proof. Define A = DA 2 ADA 2 and note that this matrix is symmetric positive definite and
diag(A) = I. Let A = LLT be the Cholesky factorization of A. Let ei be the ith standard basis
vector in RN . Then we have
and thus kAk2 kAk q holds. For an arbitrary nonsingular diagonal matrix D RN N
we have:
The result in this lemma shows that for the sparse symmetric positive definite stiffness ma
trix Kh the symmetric scaling with the diagonal matrix DKh is in a certain sense optimal.
Hence, we investigate the condition number of the scaled matrix
1 1
Kh := DKh2 Kh DKh2 (3.56)
H 1 () L6 () for n = 3 (3.58)
1 q
H () L () for n = 2, q > 0 (3.59)
For the embedding in (3.59) one can analyze the dependence of the norm of the embedding
operator on q. This results in (cf. [9]):
kukLq C qkuk1 for all u H 1 (), q > 0 , (3.60)
81
then (Kh ) cc21 holds.
For v RN we define u := N
P
i=1 vi i . Note that each nodal basis function i is associated to
a grid point xi such that i (xi ) = 1, i (xj ) = 0 for j 6= i. The set of grid points (xi )1iN is
denoted by V. We have
X X
DKh ii u(xi )2 = k(i , i )u(xi )2
hDKh v, vi = (3.62)
xi V xi V
There are constants d1 > 0 and d2 such that d1 i 21 k(i , i ) d2 i 21 for all i. Using the
lemmas 3.3.5 and 3.3.6 one can show that there are constants d1 > 0 and d2 independent of
Th {Th } such that
d1 h2 k(i , i ) d2 h2
X
T T  T T  for all T Th (3.63)
xi T
d1 2
X X
h2
T kuk 2
0,T hDK h
v, vi d h2 2
T kuk0,T (3.64)
T Th T Th
it follows that the second inequality in (3.61) holds with a constant c2 independent of v and
of Th {Th }. We now consider the first inequality in (3.61). First note that for arbitrary
> 2, 0 and w L () we have, using the discrete Holder inequality:
X Z 2 X 2 X Z 2
hT w dx hT2 w dx
T Th T T Th T Th T
X 2
hmin
1 kwk2L () (3.66)
T Th
2
Chmin N kwk2L ()
We now distinguish n = 3 and n = 2. First we treat n = 3. We use the Holder inequality to get
X X Z
2
hDKh v, vi C 2
hT kuk0,T = C h2 2
T u dx
T Th T Th T
1 1
Z Z
1 1
h2p
X
T dx p
u2q dx q
( + = 1)
T T p q
T Th
32p
Z
X 1
C hT p u2q dx q
T Th T
82
Now take p = 23 , q = 3 and apply (3.66) with = 0, = 6. This yields
Z
X 1 2
hDKh v, vi C u6 dx 3
C N 3 kuk2L6 ()
T Th T
Combination of the results in (3.65) and (3.67) proves the result in (3.57) for n = 3.
We consider n = 2. Using the Holder inequality it follows that for p > 1:
2 2p
Z Z Z
1 1 1 1
kuk20,T u2p dx p
1 dx p
C hT u2p dx p
T T T
T Th T Th T
We apply (3.66) with = 2p > 2, = 4 and use the result in (3.60). This yields
2 p1 2 p1
p
hDKh v, vi C hmin N p kuk2L2p () C phmin
p
N p kuk21
h p2 h 2p
hDKh v, vi C p N kuk21 C p N hKh v, vi
hmin hmin
h h
2
The constant C can be chosen independent of p. For p = max{2, hmin } we have p hmin
p
h
C(1 + log hmin ) and thus
h
hDKh v, vi C(1 + log )N hKh v, vi (3.68)
hmin
Combination of the results in (3.65) and (3.68) proves the result (3.57) for n = 2.
Remark 3.5.8 In [9] one can find an example which shows that for n = 2 the logarithmic term
h
can not be avoided. If the family of triangulations is quasiuniform then hmin for a constant
2
independent of Th {Th } and furthermore, N = O(h2 ) for n = 2, N 3 = O(h2 ) for n = 3.
Hence, for the quasiuniform case we have (Kh ) Ch2 for n = 2, n = 3. Moreover, in this
case the diagonal of Kh is wellconditioned, (DKh ) = O(1), and thus the scaling in (3.56) is
not essential. We emphasize that for the general case of a regular (possibly non quasiuniform)
family of triangulations the scaling is essential: a result as in (3.57) does in general not hold for
the matrix Kh . Finally we note that for the quasiuniform case it is not difficult to prove that
there exists a constant C > 0 independent of Th {Th } such that (Kh ) Ch2 holds, both
for n = 2 and n = 3.
83
3.5.1 Mass matrix
Apart from the stiffness matrix the socalled mass matrix also plays an important role in finite
elements. This matrix depends on the choice of the basis in the finite element space but not on
the bilinear form k(, ).
Let (i )1iN be the nodal basis of the finite element space Xkh,0 as defined in (3.52). The mass
matrix Mh RN N is given by
Z
(Mh )ij = i j dx = hi , j iL2 (3.69)
Note that this matrix is symmetric positive definite. As for the stiffness matrix we use a diagonal
scaling with the diagonal matrix DMh := diag(Mh ). The next result shows that the scaled mass
matrix is uniformly wellconditioned:
Theorem 3.5.9 Let {Th } be a regular family of triangulations consisting of nsimplices and for
each Th let Mh be the mass matrix defined in (3.69). Then there exists a constant C independent
of Th {Th } such that
1 1
(DMh2 Mh DM2h ) C
The nodal point associated to i is denoted by xi (1 i N ). Using lemma 3.3.6 and the norm
equivalence property in the space Pk (T ) it follows that there exist constants c1 > 0 and c2 such
that X
c1 u20,T T  u(zi )2 c2 u20,T
zi Lk (T )
Define di := supp(i ). For i IT the quantity T d1 i is uniformly (w.r.t. T ) bounded both
from below by a strictly positive constant and from above (by 1). If we combine this with the
result in (3.70) we get (with different constants c1 > 0, c2 ):
X X
c1 hMh v, vi di vi2 c2 hMh v, vi
T Th iIT
Hence
N
X
c1 hMh v, vi di vi2 c2 hMh v, vi
i=1
84
with c1 > 0. Note Z
(DMh )ii = hMh ei , ei i = 2i dx
supp(i )
thus there are constants c1 > 0, c2 independent of i such that c1 di (DMh )ii c2 di . We then
obtain
c1 hMh v, vi hDMh v, vi c2 hMh v, vi
with c1 > 0. Thus the result is proved.
Using this in combination with the quasiuniformity of {Th } it follows that the spectral condition
number of DMh = diag(Mh ) is uniformly bounded. Now apply theorem 3.5.9.
85
86
Chapter 4
4.1 Introduction
In this chapter we consider the convectiondiffusion boundary value problem
u + b u + cu = f in
u = 0 on
87
In the remainder of this section we briefly discuss the topic of regularity of the variational
problem (4.2). In section 2.5.4 regularity results of the form kukm Ckf km2 , m = 1, 2, . . .,
with a constant C independent of f , are presented (with smoothness assumptions on the coef
ficients and on the domain). In the convectiondominated case it is of interest to analyze the
dependence of the (stability) constant C on . An important result of this analysis is given in
the following theorem.
Proof. Using partial integration, (4.4) and the PoincareFriedrichs inequaliy we get
1
Z
2
div b + c u2 dx
k(u, u) = u1 +
2
u1 + 0 kukL2 c kuk21 + kuk2L2
2 2
this yields
1 1
2 kuk1 + kukL2 2 kuk21 + kuk2L2 2 2 c1 kf kL2
and thus the result in (4.5) holds. If u H 2 () then the equality u + b u + cu = f holds
(where all derivatives are weak ones). Hence, using (4.5) and 1, we obtain
1
kukL2 kf kL2 + kbkL kuk1 + kckL kukL2 c 2 kf kL2 (4.7)
with a constant c independent of f and . We use the following result (lemma 8.1 in [57])
88
Remark 4.1.2 The constants in (4.5) and (4.6) depend on 0 in (4.4). For the analysis the
assumption 0 > 0 is essential. For the case 0 0 a slight modification of the analysis results
in a stability bound
1 1 1
0 kukL2 C min 2 , 0 2 kf kL2 ,
p
2 kuk1 +
The results in theorem 4.1.1 indicate that derivatives of the solution u (e.g., kuk1 ) grow if 0.
This is due to the fact that in general in such a convectiondiffusion problem there are boundary
and internal layers in which the solution (or some of its derivatives) can vary exponentially. For
an analysis of these boundary layers we refer to the literature, e.g. [76]. In certain special cases
it is possible to obtain bounds on the derivative in streamline direction which are significantly
1
better than the general bound kuk1 C 2 kf kL2 in (4.5). We now present two such results.
The first one is for a relatively simple onedimensional problem, whereas the second one is
related to a twodimensional convectiondiffusion problem with Neumann boundary conditions
on the outflow boundary.
Theorem 4.1.3 For f L2 ([0, 1]) consider the following problem (with weak derivatives):
We use C := (1 e1/ )1 . Note that C (1 e1 )1 for (0, 1]. Using g(t) = 1 et/ f (t)
we get (for x [0, 1])
Z x Z 1
(1 et/ )f (t) dt Cu1 (x)ex/ e(xt)/ 1 e(t1)/ f (t) dt.
u(x) = Cu2 (x)
0 x
89
From
Z x 1 Z
Cu2 (x) t/
et/ 1 e(t1)/ f (t) dt
Cu1 (x)
u (x) = (1 e )f (t) dt
0 x
Z x Z 1
C C
= e(x1)/ (1 et/ )f (t) dt + e(xt)/ 1 e(t1)/ f (t) dt
0 x
we get
x 1
C C C
Z Z
u (x) f (t) dt + f (t) dt = kf kL1 . (4.11)
0 x
Combination of (4.10) and (4.11) proves the result in (4.8). We also have
Z 1
C 1 (x1)/ x
Z Z
u (x) dx e (1 et/ )f (t) dt dx
0 0 0
C 1 x/ 1 t/
Z Z
1 e(t1)/ f (t) dt dx.
+ e e
0 x
Note that in (4.9) we have a bound on the derivative measured in the L1 norm (which is weaker
than the L2 norm) that is independent of . Similar results for a more general onedimensional
convectiondiffusion problem are given in [76] (section 1.1.2) and [38].
holds.
90
Proof. First note that the weak formulation of this problem has a unique solution u H 1 ().
Using the fact that is convex it follows that u H 2 () holds and thus the problem (with
weak derivatives) in (4.12) has a unique solution u H 2 (). From the differential equation we
get
kux k2L2 = hf, ux iL2 + huyy , ux iL2 + huxx , ux iL2 chu, ux iL2 .
Using Greens formulas and the boundary conditions for the solution u we obtain, with W :=
{ (x, y)  x = 0 }:
1 1
Z
2
huyy , ux iL2 = h (uy ) , 1iL2 = u2 dy 0
2 x 2 E y
1 1
Z
huxx , ux iL2 = h (ux )2 , 1iL2 = u2 dy 0
2 x 2 W x
Z
hu, ux iL2 = u2 dy hux , uiL2 and thus hu, ux iL2 0.
E
Hence, we have
kux k2L2 hf, ux iL2 kf kL2 kux kL2 . (4.14)
This yields
c kuk2L2 kf kL2 kukL2 . (4.15)
We note that for this problem a similar independent bound for the derivative uy does not
1
hold. A sharp inequality of the form kuy kL2 2 kf kL2 can be shown. Furthermore, for the
uniform bound on kux kL2 in (4.13) to hold it is essential that we consider the convectiondiffusion
problem with Neumann boundary conditions at the outflow boundary. Due to this there is no
exponential boundary layer at the outflow boundary.
Lemma 4.2.1 Let U be a normed linear space with norm k k and V a subspace of U . Let
s(, ) be a continuous bilinear forms on U V and t(, ) a bilinear form on U V such that
for all u U the functional v t(u, v) is bounded on V . Define r := s + t and assume that r
is V elliptic. Let c0 > 0 and c1 be such that
91
t(u,v)
On U we define the seminorm kuk := supvV kvk . Then the following holds:
r(u, v) max{c1 , 1} kuk + kuk kvk for all u U, v V (4.18)
r(u, v) c0
sup kuk + kuk for all u V. (4.19)
vV kvk 1 + c0 + c1
Proof. For u U, v V we have
r(u, v) s(u, v) + t(u, v) c1 kukkvk + kuk kvk
max{c1 , 1} kuk + kuk kvk
and thus (4.18) holds. We now consider (4.19). Take a fixed u V and (0, 1). Then there
exists v V such that kv k = 1 and kuk t(u, v ). Note that
r(u, v ) = s(u, v ) + t(u, v ) kuk c1 kuk
c0 kuk
and thus for w := u + 1+c1 v V we obtain
c0 kuk
r(u, w ) = r(u, u) + r(u, v )
1 + c1
c0 kuk
c0 kuk2 +
kuk c1 kuk
1 + c1
c0
= kuk + kuk kuk. (4.20)
1 + c1
Furthermore,
c0 kuk 1 + c0 + c1
kw k kuk + = kuk (4.21)
1 + c1 1 + c1
holds. Combination of (4.20) and (4.21) yields
r(u, w ) c0
kuk + kuk .
kw k 1 + c0 + c1
Because w V and (0, 1) is arbitrary this proves the result in (4.19).
We emphasize that the seminorm k k on U depends on the bilinear form t(, ) and on the
subspace V . Also note that in (4.18) we have a boundedness result on U V , whereas in (4.19)
we have an infsup bound on V V .
Using this lemma we can derive the following variant of the Cealemma (theorem 3.1.1).
Theorem 4.2.2 Let the conditions as in lemma 4.2.1 be satisfied. Take f U and assume
that there exist u U , v V such that
92
Proof. Let w V be arbitrary. Using (4.18), (4.19) and the Galerkin property r(uv, z) = 0
for all z V we get
1 + c0 + c1 r(v w, z) 1 + c0 + c1 r(u w, z)
kv wk + kv wk sup = sup
c0 zV kzk c0 zV kzk
1 + c0 + c1
max{c1 , 1} ku wk + ku wk .
c0
Using this and the triangle inequality
ku vk + ku vk kv wk + kv wk + ku wk + ku wk
In this theorem there are significant differences compared to the Cealemma. For example,
in theorem 4.2.2 we do not assume that U (or V ) is a Hilbert space and we do not assume an
infsup property for the bilinear form r(, ) on U V (only on V V , cf. (4.19)). On the other
hand, in theorem 4.2.2 we assume existence of solutions in U and V , cf. (4.22), whereas in the
Cealemma existence and uniqueness of solutions follows from assumptions on continuity and
infsup properties of the bilinear form.
For the weak formulation we introduce the Hilbert spaces H1 = { v H 1 (I)  v(0) = 0 }, H2 =
L2 (I). The norm on H1 is kvk21 = kv k2L2 + kvk2L2 . We define the bilinear form
Z 1
k(u, v) = bu v + uv dx
0
on H1 H2 .
93
Proof. We apply theorem 2.3.1. The bilinear form k(, ) is continuous on H1 H2 :
k(u, v) bku kL2 kvkL2 + kukL2 kvkL2 2 max{1, b}kuk1 kvkL2 , u H1 , v H2 .
For u H1 we have
k(u, v) hbu + u, viL2 1
sup = sup = kbu + ukL2 = b2 ku k2L2 + kuk2L2 + 2bhu , uiL2 2 .
vH2 kvkL2 vH2 kvkL2
Using u(0) = 0 we get hu , uiL2 = u(1)2 hu, u iL2 and thus hu , uiL2 0. Hence we get
k(u, v)
sup min{1, b}kuk1 for all u H1 ,
vH2 kvkL2
i.e., the infsup condition (2.36) in theorem 2.3.1 is satisfied. We now consider the condition
(2.37)
R 1 in this theorem.
R1 Let v H2 be such that k(u, v) = 0 for all u H1 . This implies
b 0 u v dx = 0 uv dx for all u C0 (I) and thus v H 1 (I) with v = 1b v (weak derivative).
Using this we obtain
Z 1 Z 1 Z 1 Z 1
uv dx = b u v dx = bu(1)v(1) b uv dx = bu(1)v(1) uv dx for all u H1 ,
0 0 0 0
and thus u(1)v(1) = 0 for all u H1 . This implies v(1) = 0. Using this and bv v = 0 yields
b b
kvk2L2 = hv, viL2 + hbv v, viL2 = bhv , viL2 = v(1)2 v(0)2 = v(0)2 0.
2 2
This implies v = 0 and thus condition (2.37) is satisfied. Application of theorem 2.3.1 now yields
existence and uniqueness of a solution u H1 and
hf, viL2
kuk1 c sup = c kf kL2 ,
vH2 kvkL2
For the discretization of this problem we use a Galerkin method with a standard finite ele
ment space. To simplify the notation we use a uniform grid and consider only linear finite
elements. Let h = N1 , xi = ih, 0 i N , and
For the error analysis of this method we apply the Cealemma (theorem 3.1.1). The conditions
(3.2), (3.3), (3.4) in theorem 3.1.1 have been shown to hold in the proof of theorem 4.3.1. It
remains to verify the discrete infsup condition:
k(uh , vh )
h > 0 : sup h kuh k1 for all uh X h . (4.28)
vh Xh kvh kL2
94
Lemma 4.3.2 The infsup property (4.28) holds with h = c h, c > 0 independent of h.
Now apply an inverse inequality, cf. lemma 3.3.11, kvh kL2 ch1 kvh kL2 for all vh Xh , result
ing in kuh kL2 12 kuh kL2 + chkuh kL2 ch kuk1 with a constant c > 0 independent of h.
Remark 4.3.3 The result in the previous lemma is sharp in the sense that the best (i.e. largest)
infsup constant h in (4.28) in general satisfies h c h. This can be deduced from a numer
ical experiment or a technical analytical derivation. Here we present results of a numerical
experiment. We consider the continuous and discrete problems as in (4.26), (4.27) with b = 1.
Discretization of the bilinear forms (u, v) hu , viL2 , (u, v) hu, viL2 and (u, v) hu , v iL2
in the finite element space Xh (with respect to the nodal basis) results in N N matrices
1
0 1 1 6
1 0 1 1 1 1
6 6
1 . . ..
2h .. .. ..
Ch =
.. .. . , Mh =
. . . ,
(4.29)
2 3
1 1
1 0 1 6 1 6
1 1
1 1 6 2
2 1
1 2 1
1 .. .. ..
Ah = .
. . .
h
1 2 1
1 1
Note that
1
k(uh , vh ) hCh x + Mh x, yi2 kMh 2 (Ch + Mh )xk2
inf sup = inf sup 1 1 = inf 1
uh Xh vh Xh kuh k1 kvh kL2 xRN yRN xRN k(Ah + Mh ) 2 xk2
h(Ah + Mh )x, xi22 hMh y, yi22
1 1
kMh 2 (Ch + Mh )(Ah + Mh ) 2 zk2
= inf
zRN kzk2
1 1
= k(Ah + Mh ) 2 (Ch + Mh )1 Mh2 k1
2 =: h .
1 1
A (MATLAB) computation of the quantity q(h) := h /h yields: q( 10 ) = 1.3944, q( 50 ) =
1
1.3987, q( 250 ) = 1.3988. Hence, in this case h is proportional to h.
Using the infsup result of the previous lemma we obtain the following corollaries.
Corollary 4.3.4 The discrete problem (4.27) has a unique solution uh Xh and the following
stability bound holds:
kuh kL2 kf kL2 . (4.30)
95
Proof. Existence and uniqueness of a solution follows from continuity and ellipticity of the
bilinear form k(, ) on Xh Xh . From
Note that this stability result for the discrete problem is weaker than the one for the contin
uous problem in theorem 4.3.1.
Corollary 4.3.5 Let u H1 and uh Xh be the solutions of (4.26) and (4.27), respectively.
From theorem 3.1.1 we obtain the error bound
ku uh k1 c h1 inf ku vh k1 ,
vh Xh
u uh 1 c
ku uh kL2 ch
These results show that, due to the deterioration of the infsup stability constant h for h 0, the
discretization with standard linear finite elements is not satisfactory. A heuristic explanation for
this instability phenomenom can be given via the matrix Ch in (4.29) that represents the finite
element discretization of u u . The differential equation (in strong form) is u = 1b u + 1b f
on (0, 1), which is a first order ordinary differential equation. The initial condition is given by
1
u(0) = 0. For discretization of u (xi ) we use (cf. Ch ) the central difference 2h
u(xi+1 )u(xi1 ) .
Thus for the approximation of u at time x = xi we use u at the future time x = xi+1 ,
which is an unnatural approach.
We now turn to the question how a better finite element discretization for this very simple
problem can be developed. One possibility is to use suitable different finite element spaces
H1,h H1 and H2,h H2 . This leads to a socalled PetrovGalerkin method. We do not treat
such methods here, but refer to the literature, for example [76]. From an implementation point
of view it is convenient to use only one finite element space instead of two different ones. We
will show how a satisfactory discretization with (only) the space Xh of linear finite elements can
be obtained by using the concept of stabilization.
A first stabilized method is based on the following observation. If u H1 satisfies (4.26),
then Z 1
(bu + u)bv dx = hf, bv iL2 for all v H1 (4.32)
0
also holds. By adding this equation to the one in (4.26) it follows that the solution u H1
satisfies
hbu + u, bv + viL2 = hf, bv + viL2 for all v H1 . (4.33)
96
Based on this we introduce the notation
k1 (u, v) := hbu + u, bv + viL2 , u, v H1 ,
f1 (v) := hf, bv + viL2 , v H1 .
The bilinear form k1 (, ) is continuous on H1 H1 and f1 is continuous on H1 . Moreover, k1 (, )
is symmetric and using hv , viL2 = 12 v(1)2 0 for v H1 , we get
k1 (v, v) = b2 kv k2L2 + kvk2L2 + 2bhv , viL2 min{b2 , 1}kvk21 , for v H1 , (4.34)
and thus k1 (, ) is elliptic on H1 . The discrete problem is as follows:
determine uh Xh such that k1 (uh , vh ) = f1 (vh ) for all vh Xh . (4.35)
Due to Xh H1 and the H1 ellipticity of k1 (, ) this problem has a unique solution uh Xh .
For the discretization error we obtain the following result.
Lemma 4.3.7 Let u H1 be the solution of (4.26) (or (4.33)) and uh the solution of (4.35).
The following holds:
ku uh k1 c inf ku vh k1
vh Xh
constant 1. Substitution shows that the solution is given by u(x) = 1 (ex 1). Note that
u, f C (I). Further elementary computations yield
1p
kf kL2 2, ku kL2 .
4
Hence a bound ku kL2 c kf kL2 with a constant c independent of f L2 (I) can not hold, i.e.,
this problem is not H 2 regular.
97
We now generalize the stabilized finite element method presented above and show that using
this generalization we can derive a method with an H 1 error bound of the order h (as in (4.36))
1
and an improved L2 error bound of the order h1 2 . This generalization is obtained by adding
times, with a parameter in [0, 1], the equation in (4.32) to the one in (4.26). This shows that
the solution u H1 of (4.26) also satisfies
Note that for = 0 we have the original variational formulation and that = 1 results in the
problem (4.33). For 6= 1 the bilinear form k (, ) is not symmetric. For all [0, 1] we have
f H1 . The discrete problem is as follows:
The discrete solution uh (if it exists) depends on . We investigate how can be chosen such
that the discretization error (bound) is minimized. For this analysis we use the abstract results
in section 4.2. We write
The bilinear form s (, ) defines a scalar product on H1 . We introduce the norm and the
seminorm (cf. lemma 4.2.1)
1 t (u, vh )
u := s (u, u) 2 , kuk,h, := sup , for u H1 .
vh Xh vh 
Note that
1
t (u, u) = b( + 1)hu , uiL2 = b( + 1)u(1)2 0 for all u H1 , (4.40)
2
and
1
(b u1 + kukL2 ) u b u1 + kukL2 for all u H1 . (4.41)
2
Lemma 4.3.9 For all [0, 1] the continuous problem (4.38) and the discrete problem (4.39)
have unique solutions u and uh , respectively. The discrete solution satisfies the stability bound
b uh 1 + kuh kL2 2kf kL2 .
Proof. For = 0 the existence and uniqueness of solutions is given in theorem 4.3.1 and
corollary 4.3.4. The stability result for = 0 also follows from corollary 4.3.4. For > 0 we
obtain, using (4.40),
and
k (u, v) (bku kL2 + kukL2 )(bkv kL2 + kvkL2 ) c kuk1 kvk1 for u, v H1 .
98
Hence k (, ) is elliptic and continuous on H1 . The LaxMilgram lemma implies that both the
continuous and the discrete problem have a unique solution. For v H1 we have, cf. (4.40),
Lemma 4.3.10 Let u and uh be the solutions of (4.38) and (4.39), respectively. The following
error bound holds:
u uh  + ku uh k,h, 4 inf u vh  + ku vh k,h, . (4.43)
vh Xh
Proof. To derive this error bound we use theorem 4.2.2 with U = H1 , V = Xh , r(, ) = k (, ),
k k =   , k k = k k,h, . We verify the corresponding conditions in lemma 4.2.1. The
bilinear form s (, ) is continuous on U = H1 : s (u, v) u v . Hence (4.17) is satisfied with
c1 = 1. For u H1 the functional v t (u, v) is clearly continuous on V = Xh . From (4.42) it
follows that condition (4.16) is satisfied with c0 = 1. Application of theorem 4.2.2 yields
u uh  + ku uh k,h, C inf u vh  + ku vh k,h, ,
vh Xh
For the Sobolev space H1 we have H1 C(I) and thus the nodal interpolation
is welldefined.
Theorem 4.3.11 Let u H1 and uh Xh be the solutions of (4.38) and (4.39), respectively.
For all [0, 1] the error bound
b 1
b u uh 1 + ku uh kL2 C b u IX u1 + (1 + min{ , })ku IX ukL2
h
holds with a constant C independent of h, , b and u.
99
Using this and the inverse inequality vh 1 ch1 kvh kL2 for all vh Xh we obtain
t (eh , vh ) b(1 )heh , vh iL2
keh k,h, = sup = sup
vh Xh vh  vh Xh (b2 vh 2 2 12
1 + kvh kL2 )
(4.45)
bkeh kL2 vh 1 1 b
sup 1 c min , keh kL2 ,
vh Xh ( + ch2 b2 ) 2 bvh 1 h
with c independent of , h and b. The result follows from combination of (4.44) and (4.45).
Corollary 4.3.12 Let u and uh be as in theorem 4.3.11 and assume that u H 2 (I). Then the
following error bound holds for [0, 1]:
h
b u uh 1 + ku uh kL2 Ch h + b + b min{1, } ku kL2
(4.46)
b
with a constant C independent of h, , b and u.
Hence, the bound for the norm  1 is the same as for = 1, but we have an improvement in
the L2 error bound.
From these discretization error results and from the stability result in lemma 4.3.9 we see
that = 0 leads to poor accuracy and poor stability properties. The best stability property is
for the case = 1. A somewhat weaker stability property but a better approximation property
is obtained for = opt . For = opt we have a good compromise between sufficient stability
and high approximation quality. Finding such a compromise is a topic that is important in all
stabilized finite element methods.
Remark 4.3.13 Experiments to show dependence of errors on . Is the bound in (4.46) sharp?
100
with a constant M independent of . Now recall the standard Galerkin discretization with linear
finite elements, i.e.: uh X1h,0 such that
From the analysis in chapter 3 (corollary 3.1.2 and corollary 3.3.10) we obtain the discretization
error bound
M M
u uh 1 inf u vh 1 C hu2 , (4.51)
vh X1h,0
provided u H 2 (). The constant C is independent of h and . We can apply the duality
argument used in section 3.4.2 to derive a bound for the error in the L2 norm. The dual
problem has the same form as in(4.1)(4.2) but with b replaced by b. Assume that (4.4) also
holds with b replaced by b and that the solution of the dual problem lies in H 2 () (the latter
is true if is convex). Then a regularity result as in (4.6) holds for the solution of the dual
problem. Using this we obtain
1
ku uh kL2 C2 2 h2 u2 , (4.52)
Below we present a refined analysis based on the same approach as in section 4.3 which re
sults in better bounds for the discretization error. These bounds reflect important properties of
the standard Galerkin finite element discretization applied to the convectiondiffusion problem
in (4.2) and show the effect of introducing a stabilization. In section 4.4.1 we consider well
posedness of the problem in (4.2). In section 4.4.2 we analyze a stabilized finite element method
for this problem.
hu, viH1 = hu, viL2 + hb u, b viL2 = hu, viL2 + hux , vx iL2 (4.54)
101
is a Hilbert space (follows from the same arguments as for the Sobolev space H 1 ()). Take
H2 := L2 (). The bilinear form corresponding to the problem in (4.53) is
Using the same arguments as in the proof of theorem 4.3.1 one can show that there exists a
unique u H1 such that
k(u, v) = hf, viL2 for all v H2
and that kukH1 c kf kL2 holds with a constant c independent of f L2 (). Thus this
hyperbolic problem is wellposed in the space H1 H2 . Note that the stability result is similar
to the one for the convectiondiffusion problem in theorem 4.1.4.
1
There are constants 0 , cb such that div b + c 0 0, kckL cb 0 . (4.55)
2
We take cb := 0 if 0 = 0. Note that this assumption is somewhat stronger than the one in (4.3)
but still covers the important special case div b = 0, c constant and c 0.
Theorem 4.4.2 Consider the variational problem in (4.2) and assume that (4.55) holds. For
u H01 () define the (semi)norms
1
u := u21 + 0 kuk2L2 2 , (4.56a)
R
b u v dx
kb uk = kuk := sup . (4.56b)
vH 1 () v
0
k(u, v) 1
for all u H01 ().
sup u + kb uk (4.58)
vH 1 () v
0
2 + max{cb , 1}
Proof. We apply lemma 4.2.1 with U = V = H01 (), norm k k =   and
Z Z
s(u, v) = u v + cuv dx , t(u, v) = b uv dx .
c
t(u, v) c kvkL2 c v1 v
102
and thus v t(u, v) is bounded on V . Note that k(u, v) = s(u, v)+ t(u, v) holds. For u H01 ()
we have
Z
k(u, u) = u u + b uu + cu2 dx
Z
1
= u u + ( div b + c)u2 dx
2
Z
u u + 0 u2 dx = u2
and thus (4.16) is satisfied with c0 = 1. Furthermore, for all u, v H01 () we have
Hence (4.17) holds with c1 = max{cb , 1}. The results in (4.18) and (4.19) then yield (4.57) and
(4.58), respectively.
The result in this theorem can be interpreted as follows. Let H1 be the space H01 () endowed
with the norm   + kb k and H2 the space H01 () with the norm   . Note that
these norms are problem dependent. The spaces H1 and H2 are Hilbert spaces. Using the linear
operator L : H1 H2 , L(u)(v) := k(u, v) the variational problem (4.2) can be reformulated as
follows: find u H1 such that Lu = f . From the results in theorem 2.3.1 and theorem 4.4.2 it
follows that L is an isomorphism and that the inequalities
The infsup bound in the previous theorem implies the following stability result for the vari
ational problem (4.2).
Corollary 4.4.3 Consider the variational problem in (4.2) and assume that (4.55) holds with
0 > 0. Then the inequality
p 1
u1 + 0 kukL2 + 2 kb uk 2 2 + max{cb , 1} 0 2 kf kL2 (4.59)
holds.
103
Proof. From k(u, v) = hf, viL2 for all v H01 () and (4.58) we obtain
hf, viL2
u + kb uk 2 + max{cb , 1} sup
vH 1 () v
0
kf kL2 kvkL2
2 + max{cb , 1} sup 1
vH01 () (v21 + 0 kvk2L2 ) 2
1
2 + max{cb , 1} 0 2 kf kL2 .
Furthermore, note that
1 p
u u1 + 0 kukL2
2
holds.
1
From this corollary it follows that for the case 0 > 0 the inequality 2 kuk1 + kukL2 Ckf kL2
holds with a constant C independent of f and . This result is proved in theorem 4.1.1, too.
However, from corollary 4.4.3 we also obtain
kb uk Ckf kL2 (4.60)
with C independent of f and . Hence, we have a bound on the derivative in streamline direction.
Taking (formally) = 0 we obtain a bound kukL2 + kb ukL2 Ckf kL2 , which is (for the
example b = (1 0)T ) the same as the stability bound kukH1 ckf kL2 derived in remark 4.4.1.
For = 0 and 0 > 0 the norm k k is equivalent to the L2 norm. A result that relates
the norm k k to the more tractable L2 norm also for > 0 is given in the following lemma.
For the first inequality we need the global inverse inequality from lemma 3.3.11: vh 1
ch1 kvh kL2 for all vh Vh . Using this inequality we get
hw, viL2 hw, Ph wiL2
kwk = sup
vH 1 () v
0
Ph w
kPh wk2L2 1
= 1 c2 2
+ 0 ) 2 kPh wkL2 CkPh wkL2 ,
(Ph w21 + 0 kPh wk2L2 ) 2 h
104
4.4.2 Finite element discretization
We now analyze the Galerkin finite element discretization of the convectiondiffusion problem.
For ease of presentation we only consider simplicial finite elements. The case with rectangular
finite elements can be treated analogously. Let {Th } be a regular family of triangulations of
consisting of nsimplices and let
We now use the same stabilization approach as in section 4.3. Assume that for the solution
u H01 () of the convectiondiffusion problem we have u H 2 (). Then
Z
u + b u + cu v dx = hf, viL2 for all v H01 (),
or, equivalently,
k(u, v) + hu + b u + cu, b viL2 = hf, viL2 + hf, b viL2 for all v H01 ().
provided u H 2 (T ) for all T Th . In the remainder we assume that u has this regularity
property. If T = 0 for all T we have the standard (unstabilized) method as in (4.62). In the
105
remainder of this section we present an error analysis of the discretization (4.63) along the same
lines as in section 4.3. We use the abstract analysis in section 4.2 with the spaces
U := { v H01 ()  vT H 2 (T ) for all T Th }, V = Vh .
Note that U depends on Th . We split k (, ) as follows:
k (u, v) = s (u, v) + t (u, v), u, v U,
X
s (u, v) := hu, viL2 + hcu, viL2 + T hb u, b viT ,
T Th
X
t (u, v) := hb u, viL2 + T hu + cu, b viT .
T Th
T Th T Th
106
Proof. Using hb vh , vh iL2 = 21 hdiv b vh , vh iL2 and (4.55) we obtain
X X
k (vh , vh ) vh 21 + 0 kvh k2L2 + T kb vh k2T + T hvh + cvh , b vh iT
T Th T Th
X (4.68)
= vh 2 + T hvh + cvh , b vh iT .
T Th
1
For a bound on the last term in (4.68) we use kvh kT 1 1
inv hT vh 1,T , T 1 inv hT 2
2
1
and T 12 0 2 c1
b :
X
T hvh + cvh , b vh iT
T T
h
X
1 1
T inv hT vh 1,T kb vh kT + 0 cb kvh kT kb vh kT
T Th
X h 1 p p 1 p i
vh 1,T T kb vh kT + 0 kvh kT T kb vh kT
T Th
2 2
1 X h i
vh 21,T + 0 kvh k2T + T kb vh k2T
2
T Th
1 X 1
= vh 21 + 0 kvh k2L2 + T kb vh k2T = vh 2 .
2 2
T Th
Remark 4.4.7 If Vh is the space of piecewise linear finite elements then (vh )T = 0. Inspection
of the proof shows that the result of the lemma holds with the condition (4.67) replaced by the
weaker condition 0 T 21 01 c2
b .
Corollary 4.4.8 If (4.67) is satisfied then the discrete problem (4.63a) has a unique solution
uh Vh . For 0 > 0 we have the stability bound
X 1 1 p
T kb uh k2T
p
uh 1 + 0 kuh kL2 + 2
2 3 + h kf kL2 , with h := max T .
T Th
0 T Th
Proof. The bilinear form k (, ) is trivially bounded on the finite dimensional space Vh .
Lemma 4.4.6 yields Vh ellipticity of the bilinear form. Existence of a unique solution follows
from the LaxMilgram lemma. For the left handside of the stabiliy inequality we have
X 1
T kb uk2T 2 3uh  .
p
uh 1 + 0 kuh kL2 +
T Th
107
Lemma 4.4.6 yields
uh 2 2k (uh , uh ) = 2f (uh )
Combining this with
X
f (uh ) = hf, uh iL2 + T hf, b uh iT
T Th
X 1 X 1
kf kL2 kuh kL2 + T kf k2T 2
T kb uh k2T 2
T Th T Th
1 p 1 p
kf kL2 uh  + h kf kL2 uh  = + h kf kL2 uh  .
0 0
completes the proof.
Remark 4.4.9 As an example consider the case with linear finite elements, T = for all T
and 0 = 1. Then the stability result of this corollary takes the form
1
uh 1 + kuh kL2 + kb uh kL2 2 3 1 + kf kL2 , for [0, c2 ]. (4.69)
2 b
note the similarity of this result with the one in corollary 4.4.3 (for the continuous problem)
and in corollary 4.3.9 (for the stabilized finite element method applied to the 1D hyperbolic
problem).
From the results in corollary 4.4.3 and (4.69) we see that one obtains the strongest stability
result if T is chosen as large as possible (maximal stabilization). In section 4.3 it is shown that
smaller values for the stabilization parameter may lead to smaller discretization errors. Below
we give an analyis on how to chose the parameter T such that the (bound for the) discretization
error in minimized.
Application of theorem 4.2.2 yields:
Theorem 4.4.10 Assume that (4.67) is satisfied. For the discrete solution uh of (4.63a) we
have the error bound
u uh  + ku uh k,h, C inf u vh  + ku vh k,h, (4.70)
vh Vh
with C = 1 + max{cb , 1} 3 + 2 max{cb , 1} .
Proof. From lemma 4.4.5 and lemma 4.4.6 it follows that the conditions (4.16) and (4.17)
are satisfied with c0 = 12 , c1 = max{cb , 1}. Now we use theorem 4.2.2.
The norm   is given in terms of the usual L2  and H 1 norm. For the right handside in
(4.70) we need a bound on ku vh k,h, . Such a result is given in the following lemma. We will
need the assumption
kdiv bkL 0 0 . (4.71)
(0 as in (4.55)). This can always be satisfied (for suitable 0 ) if 0 > 0. For the cases 0 = 0
this implies that div b = 0 must hold.
Lemma 4.4.11 Let T be such that (4.67) holds and assume that (4.55) and (4.71) are satisfied.
For u U the following estimate holds:
X kbk,T 1
min T1 , 2 2
p
kuk,h, u1 + 0 (1 + 0 )kukL2 + kukT .
+ 2inv h2T 0
T Th
108
Proof. By definition we have
P
hb u, vh iL2 + T Th T hu + cu, b vh iT
kuk,h, = sup . (4.72)
vh Vh vh 
1
We first treat the second term in the nominator. Using the inverse inequality (4.66) and T2
1
1 1
1 inv hT 2 , T2 1 2 c1 we get
2 2 0 b
X X
T hu + cu, b vh iT T u2,T + cb 0 kukT kb vh kT
T Th T Th
X 1 p p
u1,T + 0 kukT T kb vh kT
T Th
2
X 1 p 2 1 (4.73)
u1,T + 0 kukT 2 v 
h
2
T Th
1
u21 + 0 kuk2L2 2 vh 
p
u1 + 0 kukL2 vh  .
For the first term in the nominator in (4.72) we obtain, using partial integration,
We write vh 2 = vh 21,T + 0 kvh k2T + T kb vh k2T =: T Th T2 . For the last term in
P P
T Th
(4.74) we have
X X
hu, b vh iL2  = hu, b vh iT kukT kb vh kT .
T Th T Th
1
From kb vh kT T 2 T and
kbk,T 1
kb vh kT kbk,T vh 1,T = 1 vh 21,T + 2inv h2T 0 vh 21,T 2
( + 2inv h2T 0 ) 2
kbk,T 1 kbk,T
1 vh 21,T + 0 vh 2T 2
1 T ,
( + 2inv h2T 0 ) 2 ( + 2inv h2T 0 ) 2
we get
X 1 kbk,T
hu, b vh iL2  kukT min T 2 , 1 T
T Th ( + 2inv h2T 0 ) 2
h X kbk2,T i1
min T1 , kuk2T
2
vh  .
+ 2inv h2T 0
T Th
109
Combining this with the results in (4.72) and (4.73) completes the proof.
For the estimation of the approximation error in (4.70) we use an interpolation operator (e.g.,
nodal interpolation)
IVh : H Vh = Xkh,0 (k 1),
that satisfies
ku IVh ukT chm
T um,T (4.75a)
u IVh u1,T chm1
T um,T , (4.75b)
for u H m (), with 2 m k + 1. A main discretization error result is given in the following
theorem.
Theorem 4.4.12 Assume that (4.55), (4.71) hold and that T is such that (4.67) is satisfied.
Let u be the solution of (4.2) and assume that u H 2 (). Let uh Vh = Xkh,0 be the solution
of the discrete problem (4.63). For 2 m k + 1 the discretization error bound
X 1
T kb (u uh )k2T
p
u uh 1 + 0 ku uh kL2 + 2
T Th
hm1 + 0 (1 + 0 )hm um
p
C (4.76)
n X kbk2,T 2 o 21
T kbk2,T h2m2 + min T1 , h2m
+C T T um,T , (4.77)
+ 2inv h2T 0
T Th
Proof. We apply theorem 4.4.10. For the left handside in (4.70) we have
1 h X 1 i
T kb (u uh )k2T 2 .
p
u uh  + ku uh k,h, u uh 1 + 0 ku uh kL2 +
3 T T h
For the right handside in (4.70) we obtain, using kb (u vh )kT kbk,T u vh 1,T and
lemma 4.4.11:
inf u vh  + ku vh k,h, u IVh u + ku IVh uk,h,
vh Vh
p
C u IVh u1 + 0 (1 + 0 )ku IVh ukL2 +
X kbk2,T 21
T kbk2,T u IVh u21,T + min T1 , ku IVh uk2T
C .
+ 2inv h2T 0
T Th
Note that this theorem covers the cases T = 0 for all T (i.e., no stabilization) and 0 = 0.
To gain more insight we consider a special case:
Remark 4.4.13 We take b = (1 0)T , c 1 (hence 0 = 1, 0 = 0), T = for all T and m = 2.
Then the estimate in the previous theorem takes the form
h 1
u uh  + ku uh kL2 + k (u uh )kL2 C h + h + + min , p u2 ,
x /h + 1
110
with C independent of u, h, and . For 0 this result is very similar to the one in
corollary 4.3.12 for the onedimensional hyperbolic problem.
For the case without stabilization we obtain the following.
Corollary 4.4.14 If the assumptions of theorem 4.4.12 are fulfilled we have the following dis
cretization error bounds for the case = 0 for all T :
h
u uh 1 C 1 + hm1 um (4.78)
ku uh kL2 Chm1 um if 0 > 0, (4.79)
with a constant C independent of u, h and .
For 0 these bounds are much better than the ones in (4.51) and (4.52) which resulted from the
standard finite element errror analysis. Moreover, the results in (4.78), (4.79) reflect important
properties of the standard Galerkin finite element discretization that are observed in practice:
If h . holds then (4.78) yields u uh 1 Chm1 um with C independent of h and
, which is an optimal discretization error bound. This explains the fact that for h .
the standard Galerkin finite element method applied to the convectiondiffusion problem
usually yields an accurate discretization. Note, however, that for 1 the condition
h . is very unfavourable in view of computational costs.
For fixed h, even if u is smooth (i.e., um bounded for 0) the H 1 error bound in (4.78)
tends to infinity for 0. Thus, if the analysis is sharp, we expect that an instability
phenomenom can occur for 0.
For the case 0 > 0 we have the suboptimal bound hm1 um for the L2 norm of the
discretization error (for an optimal bound we should have hm um ). If u is smooth (um
bounded for 0) the discretization error in kkL2 will be arbitrarily small if h is sufficiently
small, even if h. Hence, for the case 0 > 0 the L2 norm of the discretization error can
not show an instability phenomenon as described in the previous remark for the H 1 norm.
Note, however, that the L2 norm is weaker than the H 1 norm and in particular allows a
more oscillatory behaviour of the error.
We now turn to the question whether the results can be improved by chosing a suitable value
for the stabilization parameter T . For the term between square brackets in (4.77) we have
kbk2,T
T kbk2,T h2m2 + min T1 ,
2m
hT gT (T )kbk,T h2m2
T T , (4.80)
+ 2inv h2T 0
kbk,T
1
h2T . For hT kbk1
with gT () := kbk,T + min kbk,T , ,T the function gT attains
its minimum at = hT kbk1 ,T . Remember the condition on T in (4.67). This leads to the
parameter choice
hT hT 1 2
T,opt := T , with T := min 1, inv kbk,T . (4.81)
kbk,T 2
If kbk,T = 0 we take T,opt = 0. Note that T,opt hT kbk1 ,T and thus for hT sufficiently
1 1 2
small the condition T,opt 2 0 cb in (4.67) is satisfied. The second condition in (4.67) is
satisfied due to the definition of T,opt in (4.81). If T = 1 we have
h2T
gT (T,opt ) T,opt kbk,T + = 2hT ,
T,opt kbk,T
111
hT
and if T < 1 this implies kbk,T 22
inv and thus
kbk,T 2
gT (T,opt ) T,opt kbk,T + hT (1 + 22
inv )hT .
Hence, for T = T,opt we obtain the following bound for the T dependent term in (4.77):
n X kbk2,T 2m 2 o 12 1 1
T kbk2,T h2m2 + min T1 , CkbkL2 hm 2 um .
T hT um,T
+ 2inv h2T 0
T Th
Corollary 4.4.15 Let the assumptions be as in theorem 4.4.12. For T = T,opt we get the
estimate
X 1
T,opt kb (u uh )k2T 2
p
u uh 1 + 0 ku uh kL2 +
T Th
1
0 (1 + 0 )h + kbkL2 h hm1 um .
p
C + (4.82)
This implies
1
0 (1 + 0 ) kbkL2 m1
u uh 1 C 1 + h+ h h um , (4.83a)
1
kbkL2 m1
ku uh kL2 C + (1 + 0 )h + h h um if 0 > 0,
0 0
(4.83b)
X 1 1
T,opt kb (u uh )k2T 2 C + 0 (1 + 0 )h + kbkL2 h hm1 um .
p
(4.83c)
T Th
The result in (4.83c) shows a control on the streamline derivative of the discretization
error. Such a control is not present in the case T = 0 (no stabilization). If T = 1 for all
T (i.e., 12 2inv kbk,T hT ) and hT c0 h with c0 > 0 independent of T and h we obtain
r
+ 1 hm1 um ,
kb (u uh )kL2 c
h
112
In (4.82) there is a correct scaling of , 0 and b. Note that T,opt = T kbkh,T
T
has a scaling
w.r.t. kbk,T that is the same as in the onedimensional hyperbolic problem in (4.47).
113
114
Chapter 5
115
For the discretization error we have a variant of the Cealemma 3.1.1:
Theorem 5.1.1 Consider the variational problem (5.1) with continuous bilinear forms a(, )
and b(, ) that satisfy:
b(,)
>0: supV kkV kkM M (5.6a)
>0 : a(, ) kk2V V (5.6b)
b(h ,h )
h > 0 : suph Vh kh kV h kh kM h Mh (5.6c)
Then the problem (5.1) and its Galerkin discretization (5.5) have unique solutions (, ) and
(h , h ), respectively. Furthermore the inequality
k h kV + k h kM C inf k h kV + inf k h kM
h Vh h Mh
2 1 + 1 h2 (2kak + kbk)3 .
holds, with C =
Proof. We shall apply lemma 3.1.1 to the variational problem (5.1) and its Galerkin dis
cretization (5.5). Hence, we have to verify the conditions (3.2), (3.3), (3.4), (3.6). First note
that
k(U, V) kak + kbk kUkH kVkH
holds and thus the condition (3.2) is satisfied with M = kak + kbk. Due to the assumptions
(5.6a) and (5.6b) it follows from corollary 2.3.12 and theorem 2.3.10 that the conditions (3.3),
(3.4) are satisfied. Due to (5.6b) and (5.6c) it follows from corollary 2.3.12 and theorem 2.3.10,
with V and M replaced by Vh and Mh , respectively, that the condition (3.6) is fulfilled, too.
Moreover, from the final statement in theorem 2.3.10 we obtain that (3.6) holds with
2 2
h = h2 h + 2kak h2 kbk + 2kak
Remark 5.1.2 The condition (5.6c) implies dim(Vh ) dim(Mh ). This can be shown by the
following argument. Let ( j )1jm be a basis of Vh and (i )1ik a basis of Mh . Define the
matrix B Rkm by
Bij = b( j , i )
From (5.6c) it follows that for every h Mh , h 6= 0, there exists h Vh such that b( h , h ) 6=
0. Thus for every y Rk , y 6= 0, there exists x Rm such that yT Bx 6= 0, i.e., xT BT y 6= 0.
This implies that all columns of BT , and thus all rows of B, are independent. A necessary
condition for this is k m.
116
5.2 Finite element discretization of the Stokes problem
We recall the variational formulation of the Stokes problem (with homogeneous Dirichlet bound
ary conditions) given in section 2.6 : find (u, p) V M such that
with
b(v, q) := q div v dx
Z
f (v) := f v dx
For the Galerkin discretization of this problem we use the simplicial finite element spaces defined
in section 3.2.1, i.e., for a given family {Th } of admissible triangulations of we define the pair
of spaces:
(Vh , Mh ) := (Xkh,0 )n , Xhk1 L20 () , k 1
(5.8)
A short discussion concerning other finite element spaces that can be used for the Stokes problem
is given in section 5.2.2. For k 2, the spaces in (5.8) are called HoodTaylor finite elements [52].
Note that for k = 1 the pressurespace X0h L20 () contains discontinuous functions, whereas for
k 2 all functions in the pressure space are continuous. The Stokes problem fits in the general
setting presented in section 5.1. The discrete problem is as in (5.5): find (uh , ph ) such that
From the analysis in section 2.6 it follows that the conditions (5.6a) and (5.6b) in theorem 5.1.1
are satisfied. The following remark shows that the condition in (5.6c), which is often called the
discrete infsup condition, needs a careful analysis:
Remark 5.2.1 Take n = 2, = (0, 1)2 and a uniform triangulation Th of that is defined as
follows. For N N and h := N1+1 the domain is subdivided in squares with sides of length h
and vertices in the set { (ih, jh)  0 i, j N + 1 }. The triangulation Th is obtained by subdi
viding every square in two triangles by inserting a diagonal from (ih, jh) to ((i + 1)h, (j + 1)h).
The spaces (Vh , Mh ) are defined as in (5.8) with k = 1. The space Vh has dimension 2N 2 and
dim(Mh ) = 2(N + 1)2 1. From dim(Vh ) < dim(Mh ) and remark 5.1.2 it follows that the
condition (5.6c) does not hold.
The same argument applies to the three dimensional case with a uniform triangulation of
(0, 1)3 consisting of tetrahedra (every cube is subdivided in 6 tetrahedra). In this case we
have dim(Vh ) = 3N 3 and dim(Mh ) = 6(N + 1)3 1.
We now show that also the lowest order rectangular finite elements in general do not satisfy
(5.6c). For this we consider n = 2, = (0, 1)2 and use a triangulation consisting of squares
1
Tij := [ih, (i + 1)h)] [jh, (j + 1)h], 0 i, j N, h :=
N +1
117
We assume that N is odd. The corresponding lowest order rectangular finite element spaces
(Vh , Mh ) = (Q1h,0 , Q0h L20 ()) are defined in (3.17). We define ph Mh by
For uh Vh we use the notation uh = (u, v), u(ih, jh) =: ui,j , v(ih, jh) =: vi,j . Then we have:
Z Z
ph div uh dx = (1)i+j uh n ds
Tij Tij
h
= (1)i+j (ui+1,j+1 + ui+1,j ) (ui,j+1 + ui,j )
2
+ (vi+1,j+1 + vi,j+1 ) (vi+1,j + vi,j )
and thus
Z N Z
X
b(uh , ph ) = ph div uh dx = ph div uh dx = 0
i,j=0 Tij
for arbitrary uh Vh . We conclude that there exists ph Mh , ph 6= 0, such that supvh Vh b(vh , ph ) =
0 and thus the discrete infsup conditon does not hold for the pair (Vh , Mh ).
Definition 5.2.2 Let {Th } be a family of admissible triangulations of . Suppose that to every
Th {Th } there correspond finite element spaces Vh V and Mh M . The pair (Vh , Mh ) is
called stable if there exists a constant > 0 independent of Th {Th } such that
b(vh , qh )
sup kqh kL2 for all qh Mh (5.10)
vh Vh kvh k1
xi := supp(i ) = { T Th  xi T }
118
and a neighbourhood of T Th by
T := { xi  xi T }
N
X
IXC u = (Pi u)i (5.11)
i=1
Variants of this operator are discussed in [13, 81, 14]. Results as in theorem 5.2.3 also hold
if H01 () and X1h,0 are replaced by H 1 () and X1h , respectively.
Using the Clement operator one can reformulate the stability condition (5.10) in another form
that turns out to be easier to handle. This reformulation is given in [93] and applies to a large
class of finite element spaces. Here we only present a simplified variant that applies to the
HoodTaylor finite element spaces. We will need the meshdependent norm
sX
kqh k1,h := h2T kqh k20,T , qh X1h L20 ()
T Th
Theorem 5.2.4 Let {Th } be a regular family of triangulations. The HoodTaylor pair of finite
element spaces (Vh , Mh ) as in (5.8), k 2, is stable iff there exists a constant > 0 independent
of Th {Th } such that
b(vh , qh )
sup kqh k1,h for all qh Mh (5.13)
vh Vh kvh k1
119
with a constant C independent of qh and of T . This yields kqh k1,h Ckqh kL2 and thus the
stability property implies (5.13).
Assume that (5.13) holds. Take an arbitrary qh Mh with kqh kL2 = 1. The constants C below
are independent of qh and of Th {Th }. From the infsup property of the continuous problem
it follows that there exists > 0, independent of qh , and v H01 ()n such that
kvk1 = 1, b(v, qh )
kwh k1 Ckvk1 = C
X
h2 2 2
T kv wh k0,T Ckvk1 = C
T Th
T Th T Th
Ckqh k1,h
b(wh , qh )
C b(wh , qh ) = C b(v, qh ) + C b(wh v, qh )
kwh k1
C
C Ckqh k1,h C
Proof. We consider only n {2, 3}. Take qh Mh , qh 6= 0. The constants used in the
proof are independent of Th {Th } and of qh . The set of edges in Th is denoted by E. This
set is partitioned in edges which are in the interior of and edges which are part of :
120
E = Eint Ebnd . For every E E, mE denotes the midpoint of the edge E. Every E Eint with
endpoints a1 , a2 Rn is assigned a vector tE := a1 a2 . For E Ebnd we define tE := 0. Since
qh X1h the function x tE qh (x) is continuous across E, for E Eint . We define
wE := tE qh (mE ) tE , for E E
n
Due to lemma 3.3.2 a unique wh X2h,0 is defined by
(
0 if xi is a vertex of T Th
wh (xi ) =
wE if xi = mE for E E
For each T Th the set of edges of T is denoted by ET . By using quadrature we see that for
any p P2 which is zero at the vertices of T we have
T  X
Z
p(x) dx = p(mE )
T 2n 1
EET
We obtain:
Z Z X Z
qh div wh dx = qh wh dx = (qh )T wh dx
T Th T
X T  X
= (qh )T wh (mE ) (5.15)
2n 1
T Th EET
X T  X 2
= tE qh (mE )
2n 1
T Th EET
Using the fact that (qh )T is constant one easily checks that
X 2
Ckqh k20,T tE qh (mE ) Ckqh k20,T (5.16)
EET
T 
Z X
qh div wh dx C kqh k20,T
2n 1
T Th
X (5.17)
C h2T kqh k20,T = Ckqh k21,h
T Th
Let ET be the set of all edges of the unit nsimplex T . In the space { v P2  v is zero at the vertices of T }
P 1
2 2 are equivalent. Using this componentwise for the
the norms kvk1,T and EET v(m E )
vectorfunction wh := wh F we get:
121
Summation over all simplices T yields, using (5.16),
X X X
kwh k21 Cwh 21 C wh 21,T C h2
T T  kwE k22
T Th T Th EET
X X 2
=C h2
T T  tE qh (mE ) ktE k22 (5.18)
T Th EET
X
C h2T kqh k20,T = C kqh k21,h
T Th
b(wh , qh )
C kqh k1,h
kwh k1
with a constant C > 0 independent of qh and of Th {Th }. Now apply theorem 5.2.4.
One can also prove stability for higher order HoodTaylor finite elements:
Theorem 5.2.6 Let {Th } be a regular family of triangulations as in theorem 5.2.5. Then the
HoodTaylor pair of finite element spaces with k 3 is stable.
Remark 5.2.7 The condition that every T Th has at least one vertex in the interior of is a
mild one. Let S := { T Th  T has no vertex in the interior of }. If S 6= {} then a suitable
bisection of each T S (and of one of the neighbours of T ) results in a modified admissible
triangulation for which the condition is satisfied. In certain cases the condition can be avoided
(for example, for n = 2, k = 2, 3, cf. [87]) or replaced by another similar assumption on the
geometry of the triangulation (cf. remark 3.2 in [16]). An example which shows that the stability
result does in general not hold without an assumption on the geometry of the triangulation is
given in [16] remark 3.3.
For the discretization of the Stokes problem with HoodTaylor finite elements we have the fol
lowing bound for the discretization error:
Theorem 5.2.8 Let {Th } be a regular family of triangulations as in theorem 5.2.5. Consider the
discrete Stokes problem (5.9) with HoodTaylor finite element spaces as in (5.8), k 2. Suppose
that the continuous solution (u, p) lies in H m ()n H m1 () with m 2. For m k + 1 the
following holds:
ku uh k1 + kp ph kL2 C hm1 um + pm1
Proof. We apply theorem 5.1.1 with V = H01 ()n , M = L20 () and (Vh , Mh ) the pair of Hood
Taylor finite element spaces. From the analysis in section 2.6 it follows that the conditions (5.6a)
and (5.6b) are satisfied. From theorem 5.2.5 or theorem 5.2.6 it follows that the discrete infsup
property (5.6c) holds with a constant h independent of Th . Hence we have
ku uh k1 + kp ph kL2 C inf ku vh k1 + inf kp qh kL2 (5.19)
vh Vh qh Mh
122
For the first term on the right handside we use (componentwise) the result of corollary 3.3.10.
This yields:
inf ku vh k1 Chm1 um (5.20)
vh Vh
with a constant C independent of u and of Th {Th }.
Using p L20 () it follows that
inf kp qh kL2 = inf kp qh kL2
qh Mh qh Xk1
h
For m = 2 we can use the Clement operator of theorem 5.2.3 and for m 3 the result in
corollary 3.3.10 to bound the approximation error for the pressure:
inf kp qh kL2 Chm1 pm1 (5.21)
qh Mh
Sufficient conditions for (u, p) H 2 ()n H 1 () to hold are given in section 2.6.2.
As in section 3.4.2 one can derive an L2 error bound for the velocity using a duality argument.
For this we have to assume H 2 regularity of the Stokes problem (cf. section 2.6.2):
Theorem 5.2.9 Consider the Stokes problem and its HoodTaylor discretization as described
in theorem 5.2.8. In addition we assume that the Stokes problem is H 2 regular. Then for
2 m k + 1 the inequality
ku uh kL2 Chm um + pm1
For the second term on the right handside we can use approximation results as in (5.20), (5.21).
In combination with the H 2 regularity this yields
inf ku vh k1 + inf kp qh kL2 C h u2 + p1 C hku uh kL2
vh Vh qh Mh
For keh kH ku uh k1 + kp ph kL2 we can use the result in theorem 5.2.8. Thus we conlcude
that
ku uh k2L2 C hm1 um + pm1 h ku uh kL2
holds.
123
5.2.2 Other finite element spaces
In this section we briefly discuss some other pairs of finite element spaces that are used in prac
tice for solving Stokes (and NavierStokes) problems.
In remark 5.2.1 it is shown that for k = 1 this pair in general will not be stable. In [11] it
is proved that the pair (Vh , Mh ) with k = 2 is stable both for n = 2 and n = 3. In [87] it is
proved that for all k 2 the pair (Vh , Mh ) is stable if n = 2. In these stable cases one can prove
disretization error bounds as in theorem 5.2.8 and theorem 5.2.9. The analysis is very similar
to the one presented for the case of simplicial finite elements in section 5.2.1.
Minielement.
Let {Th } be a regular family of triangulations consisting of simplices. For every T Th we can
define a socalled bubble function
(Q
n+1
i=1 i (x) for x T
bT (x) =
0 otherwise
This element is stable, cf. [40, 4]. An advantage of this element compared to, for example, the
HoodTaylor element with k = 2 is that the implementation of the former is relatively easy. This
is due to the following. The unknowns associated to the bubble basis functions can be elimi
nated by a simple local technique (socalled static condensation) and the remaining unknowns
for the velocity and pressure basis functions are associated to the same set of points, namely the
vertices of the simplices. In case of HoodTaylor elements (k = 2) one also needs the midpoints
of edges for some of the velocity unknowns. Hence, the data structures for the minielement
are relatively simple. A disadvantage of the minielement is its low accuracy (only P1 for the
velocity).
IsoP2 P1 element.
This element is a variant of the HoodTaylor element with k = 2. Let {Th } be a regular family
of triangulations consisting of simplices. Given Th we construct a refinement T 1 h by dividing
2
each nsimplex T Th , n = 2 or n = 3, into 2n subsimplices by connecting the midpoints of
the edges of T . Note that for n = 3 this construction is not unique. The space of continuous
functions which are piecewise linear on the simplices in T 1 h and zero on is denoted by X11 h,0 .
2 2
The isoP2 P1 element consists of the pair of spaces
Both for n = 2 and n = 3 this pair is stable. This can be shown using the analysis of sec
tion 5.2.1. The proofs of theorem 5.2.4 and of theorem 5.2.5 apply, with minor modifications,
124
to the isoP2 P1 pair, too. In the discrete velocity space Vh the degrees of freedom (unknowns)
are associated to the vertices and the midpoints of edges of T Th . This is the same as for the
discrete velocity space in the HoodTaylor pair with k = 2. This explains the name isoP2 P1 .
Note, however, that the accuracy for the velocity for the isoP2 P1 element is only O(h) in
the norm k k1 , whereas for the HoodTaylor pair with k = 2 one has O(h2 ) in the norm k k1
(provided the solution is sufficiently smooth).
In certain situations, if the pair (Vh , Mh ) of finite element spaces is not stable, one can still
successfully apply these spaces for discretization of the Stokes problem, provided one uses an
appropriate stabilization technique. We do not discuss this topic here. An overview of some
useful stabilization methods is given in [73], section 9.4.
125
126
Chapter 6
The discretization of elliptic boundary value problems like the Poisson equation or the Stokes
equations results in a large sparse linear system of equations. For the numerical solution of such
a system iterative methods are applied. Important classes of iterative methods are treated in
the chapters 710. In this chapter we present some basic results on linear iterative methods and
discuss some classical iterative methods like, for example, the Jacobi and GaussSeidel method.
In our applications these methods turn out to be very inefficient and thus not very suitable for
practical use. However, these methods play a role in the more advanced (and more efficient)
methods treated in the chapters 710. Furthermore, these basic iterative methods can be used
to explain important notions such as convergence rate and efficiency. Standard references for
a detailed analysis of basic iterative methods are Varga [92], Young [100]. We also refer to
Hackbusch [46],[48] and Axelsson [6] for an extensive analysis of these methods.
In the remainder of this chapter we consider a (large sparse) linear system of equations
Ax = b (6.1)
6.1 Introduction
We consider a given iterative method, denoted by xk+1 = (xk ), k 0, for solving the system
in (6.1). We define the error as
ek = xk x , k 0.
The iterative method is called a linear iterative method if there exists a matrix C (depending
on the particular method but independent of k) such that for the errors we have the recursion
The matrix C is called the iteration matrix of the method. In the next section we will see that
basic iterative methods are linear. Also the multigrid methods discussed in chapter 9 are linear.
The Conjugate Gradient method, however, is a nonlinear iterative method (cf. chapter 7).
From (6.2) it follows that ek = Ck e0 for all k, and thus limk ek = 0 for arbitrary e0 if and
only if limk Ck = 0. Based on this, the linear iterative method with iteraton matric C is is
called convergent if
lim Ck = 0 . (6.3)
k
127
An important characterization for convergence is related to the spectral radius of the iteration
matrix. To derive this characterization we first need two lemmas.
Lemma 6.1.1 For all B Rnn and all > 0 there exists a matrix norm k k on Rnn such
that
kBk (B) +
Proof. For the given matrix B there exists a nonsingular matrix T Cnn which transforms
B to its Jordan normal form:
T1 BT = J, J = blockdiag(J )1m ,
with J = ,
1
.. ..
. .
or J = ..
, (B), 1m
. 1
kBk = kT1
BT k = kJ k max  + (B) +
(B)
Proof. . Take > 0 such that (B) + < 1 holds and let k k be the matrix norm
defined in lemma 6.1.1. Then
128
Corollary 6.1.3 For any B Rnn and any matrix norm k k on Rnn we have
(B) kBk
Proof. If (B) = 0 then B = 0 and the result holds. For (B) 6= 0 define B := (B)1 B.
Assume that (B) > kBk. Then 1 = (B) > kBk holds and thus limk kBkk = 0. Using
kBk k kBkk this yields limk Bk = 0. From lemma 6.1.2 we conclude (B) < 1 which gives
a contradiction.
Proof. From corollary 6.1.3 we get (B)k = (Bk ) kBk k and thus
1
(B) kBk k k for all k 1 (6.4)
Take arbitrary > 0 and define B := ((B) + )1 B. Then (B) < 1 and thus limk Bk = 0.
Hence there exists k0 such that for all k k0 , kBk k 1, i.e., ((B) + )k kBk k 1. We get
1
kBk k k (B) + for all k k0 (6.5)
From (6.4) and (6.5) it follows that limk kBk k1/k = (B).
Hence, to reduce the norm of an arbitrary starting error ke0 k by a factor 1/e we need asymp
totically (i.e. for k large enough) approximately ( ln((C)))1 iterations. Based on this we
call ln((C)) the asymptotic convergence rate of the iterative method (in the literature, e.g.
Hackbusch [48], sometimes (C) is called the asymptotic convergence rate).
The quantity kCk is the contraction number of the iterative method. Note that
holds, and
(C) kCk.
129
From these results we conclude that (C) is a reasonable measure for the rate of convergence,
provided k is large enough. For k small it may be better to use the contraction number as a
measure for the rate of convergence. Note that the asymptotic convergence rate does not depend
on the norm k k. In some situations, measuring the rate of convergence using the contraction
number or using the asymptotic rate of convergence is the same. For example, if we use the
Euclidean norm and if C is symmetric then
holds. However, in other situations, for example if C is strongly nonsymmetric, one can have
(C) kCk.
To measure the quality (efficiency) of an iterative method one has to consider the following
two aspects:
The arithmetic costs per iteration. This can be quantified in flops needed for one iteration.
The rate of convergence. This can be quantified using ln((C)) (asymptotic convergence
rate) or kCk (contraction number).
A given error reduction factor R, i.e. we wish to reduce the norm of an arbitrary starting
error by a factor R.
The complexity of an iterative method is then defined as the order of magnitude of the number
of flops needed to obtain an error reduction with a factor R for the given problem. In this
notion the arithmetic costs per iteration and the rate of convergence are combined. The quality
of different methods for a given problem (class) can be compared by means of this complexity
concept. Examples of this are given in section 6.6
We first show how a splitting of the matrix A in a natural way results in a linear iterative
method. We assume a splitting of the matrix A such that
Mx = Nx + b .
130
The splitting of A results in the following matrix splitting iterative method. For a given starting
vector x0 Rn we define
Mxk+1 = Nxk + b , k 0 (6.8)
This can also be written as
xk+1 = xk M1 (Axk b). (6.9)
From (6.9) it follows that for the error ek = xk x we have
ek+1 = (I M1 A)ek .
Hence the iteration in (6.8), (6.9) is a linear iterative method with iteration matrix
C = I M1 A = M1 N. (6.10)
The condition in (6.7c) is introduced to obtain a method in (6.8) for which the arithmetic costs
per iteration are acceptable. Below we will see that the abovementioned classical iterative
methods can be derived using a suitable matrix splitting. These methods satisfy the conditions
in (6.7b), (6.7c), but unfortunately, when applied to discrete elliptic boundary value problems,
the convergence rates of these methods are in general very low. This is illustrated in section 6.6.
Richardson method
The simplest linear iterative method is the Richardson method:
0
x a given starting vector ,
(6.11)
xk+1 = xk (Axk b) , k 0 .
Jacobi method
A second classical and very simple method is due to Jacobi. We introduce the notation
with L a lower triangular matrix with zero entries on the diagonal and U an upper triangular
matrix with zero entries on the diagonal. We assume that A has only nonzero entries on the
diagonal, so D is nonsingular. The method of Jacobi is the iterative method as in (6.8) based
on the matrix splitting
M := D , N := L + U .
The method of Jacobi is as follows
0
x a given starting vector ,
Dxk+1 = (L + U)xk + b , k 0 .
From this we see that in the method of Jacobi we solve the ith equation ( nj=1 aij xj = bi ) for
P
the ith unknown (xi ) using values for the other unknowns (xj , j 6= i) computed in the previous
131
iteration.
The iteration can also be represented as
xk+1 = (I D1 A)xk + D1 b, k0
In the Jacobi method the computational costs per iteration are low, namely comparable to
one matrixvector multiplication Ax, i.e. cn flops (due to the sparsity if A).
We introduce a variant of the Jacobi method in which a parameter is used. This method is given
by
xk+1 = (I D1 A)xk + D1 b, k 0 (6.14)
1 1
M= D, N=( 1)D + L + U (6.15)
For = 1 we obtain the Jacobi method. For 6= 1 this method is called the damped Jacobi
method (damped due to the fact that in practice one usually takes (0, 1)).
GaussSeidel method
This method is based on the matrix splitting
M := D L , N := U .
(D L)xk+1 = Uxk + b , k 0 .
For the GaussSeidel method to be feasible we assume that D is nonsingular. In the Jacobi
method (6.13) for the computation of xk+1 i (i.e. for solving the ith equation for the ith unknown
xi ) we use the values xkj , j 6= i, whereas in the GaussSeidel method (6.16) for the computation
of xk+1
i we use xk+1
j , j < i and xkj , j > i.
The iteration matrix of the GausSeidel method is given by
C = (D L)1 U = I (D L)1 A .
SOR method
The GaussSeidel method in (6.16) can be rewritten as
132
From this representation it is clear how xk+1 i can be obtained by adding a certain correction term
k
to xi . We now introduce a method in which this correction term is multiplied by a parameter
>0: 0
x a given starting vector ,
k+1 k k+1 k b /a , (6.17)
P P
x i = xi j<i aij x j + ji aij xj i ii
i = 1, . . . , n, k 0 .
This method is the Successive Overrelaxation method (SOR). The terminology over is used
because in general one should take > 1 (cf. Theorem 6.4.3 below). For = 1 the SOR method
results in the GaussSeidel method. In matrixvector notation the SOR method is as follows:
or, equivalently,
1 1
(D L)xk+1 = [(1 )D + U]xk + b .
From this it is clear that the SOR method is also a matrix splitting iterative method, corre
sponding to the splitting (cf. (6.15))
1 1
M := DL , N := ( 1)D + U .
The iteration matrix is given by
1
C = C = I M1 A = I ( D L)1 A.
For the SOR method the arithmetic costs per iteration are comparable to those of the Gauss
Seidel method.
The Symmetric Successive Overrelaxation method (SSOR) is a variant of the SOR method.
One SSOR iteration consists of two SOR steps. In the first step we apply an SOR iteration as
in (6.17) and in the second step we again apply an SOR iteration but now with the reversed
ordering of the unknowns. In formulas we thus have:
k+ 1 k+ 21
P
xi 2 = xki k b /a ,
P
a x
j<i ij j + a x
ji ij j i ii i = 1, 2, . . . , n
k+ 1 k+ 1
xk+1 k+1
P P
= xi 2 j>i aij xj + ji aij xj 2 bi /aii , i = n, . . . , 1
i
133
6.3 Convergence analysis in the symmetric positive definite case
For the classical linear iterative methods we derive convergence results for the case that A is
symmetric positive definite. Recall that for square symmetric matrices B and C we use the
notation B < C (B C) if C B is positive definite (semidefinite).
We start with an elementary lemma:
Lemma 6.3.1 Let B Rnn be a symmetric positive definite matrix. The smallest and largest
eigenvalues of B are denoted by min (B) and max (B), respectively. The following holds:
2
(I B) < 1 iff 0<< (6.19)
max (B)
2
min (I B) = (I opt B) = 1
(B) + 1
(6.20)
2
for opt =
min (B) + max (B)
Hence (I B) < 1 iff > 0 and max (B) 1 < 1. This proves the result in (6.19). The
result in (6.20) follows from
As an immediate consequence of this lemma we get a convergence result for the Richardson
method.
Corollary 6.3.2 Let A be symmetric positive definite. For the iteration matrix of the Richard
son method C = I A we have
2
(C ) < 1 iff 0<< (6.21)
max (A)
2
min (C ) = (Copt ) = 1
(A) + 1
(6.22)
2
for opt =
min (A) + max (A)
We now consider the Jacobi method. From Theorem 6.1.4 we obtain that this method is conver
gent if and only if (I D1 A) < 1 holds. A simple example shows that the method of Jacobi
134
1 21
1 1
does not converge for every symmetric positive definite matrix A: consider A = 1 1 21 1 ,
1 1 1 12
with spectrum (A) = { 12 , 3 21 }. Then (I D1 A) = 1 111 3 12  = 43 .
2
From the analysis in section 6.5.2 (theorems 6.5.12 and 6.5.13) it follows that if A is symmetric
positive definite and aij 0 for all i 6= j, then the Jacobi method is convergent.
A convergence result for the damped Jacobi method can be derived in which the assumption
aij 0 is avoided:
Theorem 6.3.3 Let A be a symmetric positive definite matrix. For the iteration matrix of the
damped Jacobi method C = I D1 A we have
2
(C ) < 1 iff 0<< (6.23)
max (D1 A)
2
min (C ) = (Copt ) = 1 1
(D A) + 1
(6.24)
2
for opt =
min (D1 A) + max (D1 A)
1 1
Proof. Note that D 2 AD 2 is symmetric positive definite. Apply lemma 6.3.1 with
1 1 1 1
B = D 2 AD 2 and note that (D 2 AD 2 ) = (D1 A).
If the matrix A is the stiffness matrix resulting from a finite element discretization of a scalar
elliptic boundary value problem then in general the condition number (D1 A) is very large
(namely h2 , with h a mesh size parameter). The result in the previous theorem shows that
for such problems the rate of convergence of a (damped) Jacobi method is very low. This is
illustrated in section 6.6.
For the convergence analysis of both the GaussSeidel and the SOR method the following lemma
is useful.
Lemma 6.3.4 Let A be symmetric positive definite and assume that M is such that
M + MT > A (6.25)
(I M1 A) kI M1 AkA < 1
holds.
Proof. Assume that Mx = 0. Then h(M + MT )x, xi = 0 and using assumption (6.25) this
1 1
implies x = 0. Hence M is nonsingular. We introduce C := I A 2 M1 A 2 and note that
1
kI M1 AkA = kCk2 = (CT C) 2
135
Using this we get
1 1 1 1
0 CT C = I A 2 MT A 2 I A 2 M1 A 2
1 1 1 1
= I A 2 (MT + M1 )A 2 + A 2 MT AM1 A 2 < I
Using this lemma we immediately get a main convergence result for the GaussSeidel method.
Theorem 6.3.5 Let A be symmetric positive definite. Then we have
(C) kCkA < 1 with C := I (D L)1 A
and thus the GaussSeidel method is convergent.
Proof. The GaussSeidel method corresponds to the matrixsplitting A = M N with
M = D L. Note that M + MT = D + (D L LT ) = D + A > A holds. Application of
lemma 6.3.4 proves the result.
We now consider the SOR method. Recall that this method corresponds to the matrixsplitting
A = M N with M = 1 D L.
Theorem 6.3.6 Let A be symmetric positive definite. Then for (0, 2) we have
1
(C ) kC kA < 1 with C := I ( D L)1 A
and thus the SOR method is convergent.
1
Proof. For M = D L we have
2 2
M + MT = D L LT = D+A>A if (0, 2)
Application of lemma 6.3.4 proves the result.
In the following lemma we show that for every matrix A (i.e., not necessarily symmetric) with
a nonsingular diagonal the SOR method with / (0, 2) is not convergent.
Lemma 6.3.7 Let A Rnn be a matrix with aii 6= 0 for all i. For the iteration matrix
C = I ( 1 D L)1 A of the SOR method we have
(C ) 1  for all 6= 0
Proof. Define L := D1 L and U := D1 U. Then we have
1
C = I (I L)1 (I L U) = I L (1 )I + U)
1
(1 )I + U) = (1 )n . Let {i  1 i n } = (C ) be
Hence, det(C ) = det I L
the spectrum of the iteration matrix. Then due to fact that the determinant of a matrix equals
the product of its eigenvalues we get ni=1 i  = 1 n . Thus there must be an eigenvalue with
modulus at least 1 .
136
6.4 Rate of convergence of the SOR method
The result in theorem 6.3.6 shows that in the symmetric positive definite case the SOR method is
convergent if we take (0, 2). This result, however, does not quantify the rate of convergence.
Moreover, it is not clear how the rate of convergence depends on the choice of the parameter .
It is known that for certain problems a suitable choice of the parameter can result in an SOR
method which has a much higher rate of convergence than the Jacobi and GaussSeidel methods.
This is illustrated in example 6.6.4. However, the relation between the rate of convergence and
the parameter is strongly problem dependent and for most problems it is not known how a
good (i.e. close to optimal) value for the parameter can be determined.
In this section we present an analysis which, for a relatively small class of blocktridiagonal
matrices, shows the dependence of the spectral radius of the SOR iteration matrix on the
parameter . For related (more general) results we refer to the literature, e.g., Young [100],
Hageman and Young [50], Varga [92]. For a more recent treatment we refer to Hackbusch [48].
We start with a technical lemma. Recall the decomposition A = D L U, with D = diag(A),
L and U strictly lower and upper triangular matrices, respectively.
Lemma 6.4.1 Consider A = D L U with det(A) 6= 0. Assume that A has the block
tridiagonal structure
D11 A12
.. ..
A21
. .
A=
.. .. .. , Dii Rni ni , 1 i k,
. . .
(6.26)
.. ..
. .
Ak1,k
Ak,k1 Dkk
1 1
0 z D11 A12
1 ..
zD22 A21
0 .
Gz = .. .. ..
. . .
..
. 1 1
0 D A
z k1,k1 k1,k
zD1
kk Ak,k1 0
This similarity transformation with Tz does not change the spectrum and thus (Gz ) =
(D1 L + D1 U) holds for all z 6= 0. The latter spectrum is independent of z.
137
Lemma 6.4.2 Let A be as in lemma 6.4.1. Let CJ = I D1 A and C = I ( 1 D L)1 A
be the iteration matrices of the Jacobi and SOR method, respectively. The following holds
and thus = 1, i.e., the result in (b) holds. For (C ), 6= 0 and 6= 0 we have
1 1 1 +1
= n 2 n det ( 2 L + 2 U) 1 I
2
+1 1 1
(C ) 1 ( 2 L + 2 U) = (CJ )
2
Now we can prove a main result on the rate of convergence of the SOR method.
Theorem 6.4.3 Let A be as in lemma 6.4.1 and CJ , C the iteration matrices of the Jacobi and
SOR method, respectively. Assume that all eigenvalues of CJ are real and that := (CJ ) < 1
(i.e. the method of Jacobi is convergent). Define
!2
2
opt := p =1+ p (6.27)
1 + 1 2 1 + 1 2
and
opt 1 = (Copt ) < (C ) < 1 for all (0, 2), 6= opt , (6.29)
holds.
138
Proof. We only consider (0, 2). Introduce L := D1 L and U := D1 U. First we treat
the case where there exists (0, 2) such that (C ) = 0, i.e., C = 0. This implies = 1,
U = 0, = 0 and opt = 1. From U = 0 we get C = (1 )(I L)1 , which yields
(C ) = 1 . One now easily verifies that for this case the results in (6.28) and (6.29) hold.
We now consider the case with (C ) > 0 for all (0, 2). Take (C ), 6= 0. From
lemma 6.4.2 it follows that
+1
1 = (CJ ) [, ]
2
A simple computation yields
1 p 2
=  2 2 4( 1) (6.30)
4
1 p 2
 =  2 2 4( 1)
4
The maximum value is attained for the + sign and with the value = , resulting in
1 p 2
 = + 2 2 4( 1) (6.31)
4
1 2 1
+ 2 2 4( 1) 2 2 1
p
4 4
we conclude that the maximum value for  is attained for the case (6.31) and thus
1 p 2
(C ) = + 2 2 4( 1)
4
which proves the first part in (6.28). An elementary computation shows that for (0, 2) the
function (C ) as defined in (6.28) is continuous, monotonically decreasing on (0, opt ] and
monotonically increasing on [opt , 2). Morover, both for 0 and 2 we have the function
value 1. From this the result in (6.29) follows.
In (6.27) we see that opt > 1 holds, which motivates the name overrelaxation. Note that we
do not require symmetry of the matrix A. However, we do assume that the eigenvalues of CJ
are real. A sufficient condition for the latter to hold is that A is symmetric. For different values
of the function (C ) defined in (6.28) is shown in figure 6.1.
139
1
0.9
mu=0.95
0.8
mu=0.9
0.6
0.5
mu=0.6
0.4
0.3
0.2
0.1
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
omega
Corollary 6.4.4 If we take = 1 then the SOR method is the same as the GaussSeidel
method. Hence, if A satisfies the assumptions in theorem 6.4.3 we obtain from (6.28)
(C1 ) = 2 = (CJ )2
Thus ln((C1 )) = 2 ln(CJ ), i.e., the asymptotic convergence rate of the GaussSeidel method
is twice the one of the Jacobi method.
140
6.5.1 Perron theory for positive matrices
For a matrix A Rnn an eigenvalue (A) for which  = (A) holds is not necessarily
real. If we assume A > 0 then it can be shown that (A) (A) holds and, moreover, that the
corresponding eigenvector is strictly positive. These and other related results, due to Perron [71],
are given in lemma 6.5.2, theorem 6.5.3 and theorem 6.5.5.
We start the analysis with an elementary lemma.
0 B C (B) (C)
Proof. From 0 B C we get 0 Bk Ck for all k. Hence, kBk k kCk k for all k.
1/k
Recall that for arbitrary A Rnn we have (A) = limk kAk k (cf. lemma 6.1.5). Using
this we get (B) (C).
Lemma 6.5.2 Take A Rnn with A > 0. For (A) with  = (A) and w Cn , w 6= 0,
with Aw = w the relation
Aw = (A)w
holds, i.e., (A) is an eigenvalue of A.
Assume that we have < in (6.32). Then there exsts > (A) such that w Aw
k w k
and thus Ak w k w for all k N. This yields kAk k kA
k w k
k and thus
1/k
(A) = limk kAk k , which is a contradiction with > (A). We conclude that in
(6.32) equality must hold, i.e., Aw = (A)w.
Theorem 6.5.3 (Perron) For A Rnn with A > 0 the following holds:
Proof. From lemma 6.5.2 we obtain that there exists w 6= 0 such that
holds. Thus (A) is an eigenvalue of A. The eigenvector w from (6.34) contains at least one
entry that is strictly positive. Due to this and A > 0 we have that Aw > 0, which due to
(6.34) implies (A) > 0 and w > 0. From this the results in (6.33a) and (6.33b) follow.
Assume that there exists x 6= 0 independent of v such that Ax = (A)x. For arbitrary 1 k n
define = xvkk and z = x v. Note that zk = 0 and due to the assumption that x and v are in
dependent we have z 6= 0. We also have Az = (A)z. From lemma 6.5.2 we get Az = (A)z,
141
which results in a contradicton, because (Az)k > 0 and (A)(z)k = 0. Thus the result in
(6.33c) is proved.
The eigenvalue (A) and corresponding eigenvector v > 0 (which is unique up to scaling)
are called the Perron root and Perron vector.
If instead of A > 0 we only assume A 0 then the results (6.33a) and (6.33b) hold with >
replaced by as is shown in the following corollary. Clearly, for A 0 the result (6.33c) does
not always hold (take A = 0).
In the next theorem we present a few further results for the Perron root of a positive matrix.
Theorem 6.5.5 (Perron) For A Rnn with A > 0 the following holds:
Proof. We use the Jordan form A = TT1 (cf. Appendix B) with a matrix of the form
= blockdiag(i )1is and
i 1
.. ..
. . Rki ki , 1 i s,
i =
.
. . 1
with all i (A). Due to (6.33c) we know that the eigenspace corresponding to the eigenvalue
(A) is one dimensional. Thus there is only one block i with i = (A). Let the ordering of
the blocks in be such that the first block 1 corresponds to the eigenvalue 1 = (A). We
will now show that its dimension must be k1 = 1. Let ej be the jth basis vector in Rn and
define t := Te1 , t := TT ek1 . From ATe1 = Te1 we get At = (A)t and thus t is the Perron
vector of A. This implies t > 0. Note that AT TT ek1 = TT T ek1 and thus AT t = (A)t.
142
Since AT > 0 this implies that t is the Perron vector of AT and thus t > 0. Using that both
t and t are strictly positive we get 0 < tT t = eTk1 T1 Te1 = eTk1 e1 . This can only be true if
k1 = 1. We conclude that there is only one Jordan block corresponding to (A) and that this
block has the size 1 1, i.e., (A) is simple eigenvalue.
We now consider (6.37b). Let w Cn , w 6= 0, = ei (A) (i.e.,  = (A)) be such that
Aw = w. From lemma 6.5.2 we get that Aw = (A)w and from (6.33c) it follows that
w > 0 holds. We introduce k , rk R, with rk > 0, such that wk = rk eik , 1 k n, and
D := diag(eik )1kn . Then Dw = w holds and thus
This yields
ei D1 AD A w = 0
j=1
Due to akj wj  > 0 for all j this can only be true if ei(+k j ) 1 = 0 for all j = 1, . . . , n. We
take j = k and thus obtain ei = 1, hence = ei (A) = (A). This shows that (6.37b) holds.
We finally prove (6.37c). Assume Aw = w with a nonzero vector w 0 and 6= (A)
holds. Application of theorem 6.5.3 to AT implies that there exists a vector x > 0 such that
AT x = (AT )x = (A)x. Note that xT Aw = xT w and xT Aw = wT AT x = (A)wT x. This
implies ( (A))wT x = 0 and thus, because wT x > 0, we obtain = (A), which contradicts
6= (A). This completes the proof of the theorem.
From corollary 6.5.4 we know that for A 0 there exists an eigenvector v 0 corresponding to
the eigenvalue (A). Under the stronger assumption A 0 and irreducible (cf. Appendix B)
this vector must be strictly positive (as for the case A > 0). This and other related results for
nonnegative irreducible matrices are due to Frobenius [37].
Theorem 6.5.6 (Frobenius) Let A Rnn be irreducible and A 0. Then the following
holds:
143
Definition 6.5.7 A matrix splitting A = M N is called a regular splitting if
Recall that the iteration matrix of a matrix splitting method (based on the splitting A =
M N) is given by C = I M1 A = M1 N.
Theorem 6.5.8 Assume that A1 0 holds and that A = M N is a regular splitting. Then
(A1 N)
(C) = (M1 N) = <1
1 + (A1 N)
holds.
Proof. The matrices I C = M1 A and I + A1 N = A1 M are nonsingular. We use the
identities
A1 N = (I C)1 C (6.40)
C = (I + A1 N)1 A1 N (6.41)
Because C 0 we can apply corollary 6.5.4. Hence there exists a nonzero vector v 0 such
that Cv = (C)v. Due to the fact that I C is nonsingular we have (C) 6= 1. From (6.40) we
get
(C)
A1 Nv = (I C)1 Cv = v (6.42)
1 (C)
From this and A1 0, N 0, v 0 we conclude A1 Nv 0 and (C) < 1. From (6.42)
(C) (C)
it also follows that 1(C) is a positive eigenvalue of A1 N. This implies 1(C) (A1 N),
which can be reformulated as
(A1 N)
(C) (6.43)
1 + (A1 N)
From A1 N 0 and corollary 6.5.4 it follows that there exists a nonzero vector w 0 such
that A1 Nw = (A1 N)w. Using (6.41) we get
(A1 N)
Cw = (I + A1 N)1 A1 Nw = w
1 + (A1 N)
(A1 N)
Thus 1+(A1 N) is a positive eigenvalue of C. This yields
(A1 N)
(C) (6.44)
1 + (A1 N)
Combination of (6.43) and (6.44) completes the proof.
x
From the fact that the function x 1+x is increasing and using lemma 6.5.1 one immedi
ately obtains the following result.
Corollary 6.5.9 Assume that A1 0 holds and that A = M1 N1 = M2 N2 are two
regular splittings with N1 N2 . Then
(I M1 1
1 A) (I M2 A) < 1
holds.
144
For the application of these general results to concrete matrix splitting methods it is convenient
to introduce the following class of matrices.
A1 0 (6.45a)
aij 0 for all i 6= j (6.45b)
Consider an Mmatrix A and let sk 0 be the kth P columns of A1 . From the identity Ask = ek
(kth basis vector) it follows that akk (sk )k = 1 j6=k akj (sk )j 1 holds, and thus akk > 0.
Hence, in an Mmatrix all diagonal entries are strictly positive. Another property that we will
need further on is given in the following lemma.
Lemma 6.5.11 Let A be an Mmatrix. Assume that the matrix B has the properties bij 0
for all i 6= j and B A. Then B is an Mmatrix, too. Furthermore, the inequalities
0 B1 A1
hold.
There is an extensive literature on properties of Mmatrices, cf. [12], [34]. A few results are
given in the following theorem.
(a) If A is irreducibly diagonally dominant and aii > 0 for all i, aij 0 for all i 6= j, then A
is an Mmatrix.
(b) Assume that aij 0 for all i 6= j. Then A is an Mmatrix if and only if all eigenvalues of
A have positive real part.
(c) Assume that aij 0 for all i 6= j. Then A is an Mmatrix if A + AT is positive definite
(this follows from (b)).
(d) If A is symmetric positive definite and aij 0 for all i 6= j, then A is an Mmatrix (this
follows from (b)).
(e) If A is a symmetric Mmatrix then A is symmetric positive definite (this follows from (b)).
(f ) If A is an Mmatrix and B results from A after a Gaussian elimination step without pivot
ing, then B is an Mmatrix, too (i.e. Gaussian elimination without pivoting preserves the
Mmatrix property).
145
Proof. A proof can be found in [12].
We now show that for Mmatrices the Jacobi and GaussSeidel methods correspond to regu
lar splittings. Recall the decomposition A = D L U.
holds.
Proof. In the proof lemma 6.5.11 it is shown that the method of Jacobi corresponds to a
regular splitting. For the GaussSeidel method note that MGS = D L has only nonpositive
offdiagonal entries and MGS A = U 0. From lemma 6.5.11 it follows that MGS is an
Mmatrix, hence M1 GS 0 holds. Thus the GaussSeidel method corresponds to a regular split
ting, too. Now note that NGS := U NJ := L + U holds and thus corollary 6.5.9 yields the
result in (6.46).
This result shows that for an Mmatrix both the Jacobi and GaussSeidel method are con
vergent. Moreover, the asymptotic convergence rate of the GaussSeidel method is at least as
high as for the Jacobi method. If A is the result of the discretization of an elliptic boundary
value problem then often the arithmetic costs per iteration are comparable for both methods.
In such cases the GaussSeidel method is usually more efficient than the method of Jacobi.
(I M1 1
1 A) (I M2 A) < 1 for all 0 < 2 1 1
holds.
146
the convectiondominated case). The resulting discrete problems are denoted by (P) (Poisson
problem) and (CD) (convectiondiffusion problem).
Example 6.6.1 (Model problem (P)) For the Poisson equation we obtain a stiffness matrix
A that is symmetric positive definite and for which (D1 A) = O(h2 ) holds. In Table 6.1 we
show the results for the method of Jacobi applied to this problem with different values of h. For
the starting vector we take x0 = 0. We use the Euclidean norm k k2 . By # we denote the
number of iterations needed to reduce the norm of the starting error by a factor R = 103 .
We observe that when we halve the mesh size h we need approximately four times as many
iterations. This is in agreement with (D1 A) = O(h2 ) and the result in theorem 6.3.3.
We take a reduction factor R = 103 and consider model problem (P). Then the complexity of
the method of Jacobi is cn2 flops (c depends on R but is independent of n). For model problem
(P) there are methods that have complexity cn with < 2. In particular = 1 21 for the SOR
method, = 1 14 for preconditioned Conjugate Gradient (chapter 7) and = 1 for the multigrid
method (chapter 9). It is clear that if n is large a reduction of the exponent will result in a
significant gain in efficiency, for example, for h = 1/320 we have n h2 105 and n2 1010 .
Also note that = 1 is a lower bound because for one matrixvector multiplication Ax we
already need cn flops.
Example 6.6.2 (Model problem (P)) In Table 6.2 we show results for the situation as de
scribed in example 6.6.1 but now for the GaussSeidel method instead of the method of Jacobi.
For this model problem with R = 103 the GaussSeidel method has a complexity cn2 , which is
Example 6.6.3 (Model problem (CD)) It is important to note that in the GaussSeidel
method the results depend on the ordering of the unknowns, whereas in the method of Jacobi
the resulting iterates are independent of the ordering. We consider model problem (CD) with
b1 = cos(/6), b2 = sin(/6). We take R = 103 and h = 1/160. Using an ordering of the grid
points (and corresponding unknowns) from left to right in the domain (0, 1)2 we obtain the results
as in Table 6.3. When we use the reversed node ordering then get the results shown in Table 6.4.
These results illustrate a rather general phenomenon: if a problem is convectiondominated
then for the GaussSeidel method it is advantageous to use a node ordering corresponding (as
much as possible) to the direction in which information is transported.
147
100 102 104
# 17197 856 14
Example 6.6.4 We consider the model problem (P) as in example 6.6.1, with h = 1/160. In
Figure 6.2 for different values of the parameter we show the corresponding number of SOR
iterations (#), needed for an error reduction with a factor R = 103 . The same experiment is
performed for the model problem (CD) as in example 6.6.3 with h = 1/160, = 102 . The
results are shown in Figure 6.3. Note that with a suitable value for an enormous reduction
5
10
4
10
3
10
2
10
1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2
in the number of iterations needed can be achieved. Also note the rapid change in the number
of iterations (#) close to the optimal value.
148
4
10
3
10
2
10
1
10
0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9
149
150
Chapter 7
7.1 Introduction
In this chapter we discuss the Conjugate Gradient method (CG) for the iterative solution of
sparse linear systems with a symmetric positive definite matrix.
In section 7.2 we introduce and analyze the CG method. This method is based on the
formulation of the discrete problem as a minimization problem. The CG method is nonlinear and
of a different type as the basic iterative methods discussed in chapter 3. The CG method is not
suitable for solving strongly nonsymmetric problems, as for example a discretized convection
diffusion problem with a dominating convection. Many variants of CG have been developed
which are applicable to linear systems with nonsymmetric matrix. A few of these methods are
treated in chapter 8. In the CG method and in the variants for nonsymmetric problems the
resulting iterates are contained in a socalled Krylov subspace, which explains the terminology
Krylov subspace methods. A detailed treatment of these Krylov subspace methods is given in
Saad [78]. An important concept related to all these Krylov subspace methods is the socalled
preconditioning technique. This will be explained in section 7.3.
151
For this F we have
rk := Axk b. (7.6)
hpk , rk i
opt (xk , pk ) := . (7.7)
hpk , Apk i
For (0), i.e. the derivative of F at xk in the direction pk , we have (0) = hpk , rk i. The
direction pk with kpk k2 = 1 for which the modulus of this derivative is maximal is given by
pk = rk /krk k2 . This follows from  (0) = hpk , rk i kpk k2 krk k2 , in which we have equality
only if pk = rk ( R). The sign and length of pk are irrelevant because the right sign and
the optimal length are determined by the steplength parameter opt . With the choice pk = rk
we obtain the Steepest Descent method:
( 0
x a given starting vector,
k ,rk i
x k+1 = xk hrhrk ,Ar k r .
i
k
In general the Steepest Descent method converges only slowly. The reason for this is already
clear from a simple example with n = 2. We take
1 0
A= , 0 < 1 < 2 , b = (0, 0)T (hence, x = (0, 0)T ).
0 2
152
x0
x1
x3
x2
The function F (x1 , x2 ) = 12 hx, Axi hx, bi = 12 (1 x21 + 2 x22 ) has level lines Nc = {(x1 , x2 )
R2  F (x1 , x2 ) = c} which are ellipsoids. Assume that 2 1 holds (so (A) 1). Then
the ellipsoids are stretched in x1 direction as is shown in Figure 7.1, and convergence is very slow.
We now introduce the CG method along similar lines as in Hackbusch [48]. To be able to
formulate the weakness of the Steepest Descent method we introduce the following notion of
optimality. Let V be a subspace of Rn .
y is called optimal for the subspace V if
(7.9)
F (y) = minzV F (y + z)
So y is optimal for V if on the hyperplane y+V the functional F is minimal at y. Assume P a given
y and subspace V . Let d1 , . . . , ds be a basis of V and for c Rs define g(c) = F (y + si=1 ci di ).
Then y is optimal for V iff g(0) = 0 holds. Note that
g
(0) = hF (y), di i = hAy b, di i
ci
Hence we obtain the following:
In the Steepest Descent method we have pk = rk . From (7.7) and (7.8) we obtain
hrk , rk i
hrk , rk+1 i = hrk , rk i hrk , Ark i = 0 .
hrk , Ark i
Using (7.10) we conclude that in the Steepest Descent method xk+1 is optimal for the subspace
span{pk }. This is also clear from Fig. 7.1 : for example, x3 is optimal for the subspace spanned
by the search direction x3 x2 . From Fig. 7.1 it is also clear that xk is not optimal for the
subspace spanned by all previous search directions. For example x3 can be improved in the
search direction p1 = (x2 x1 ): for = opt (x3 , p1 ) we have F (x4 ) = F (x3 + p1 ) < F (x3 ).
Now consider a start with p0 = r0 and thus x1 = x0 + opt (x0 , p0 )p0 (as in Steepest Descent).
We assume that the second search direction p1 is chosen such that hp1 , Ap0 i = 0 holds. Due to
153
the fact that A is symmetric positive definite we have that p1 and p0 are independent. Define
x2 = x1 + opt (x1 , p1 )p1 . Note that now hp0 , b Ax2 i = 0 and also hp1 , b Ax2 i = 0 and thus
(cf. 7.10) x2 is optimal for span{p0 , p1 }. For the special case n = 2 as in the example shown in
figure 7.1 we have span{p0 , p1 } = R2 . Hence x2 is optimal for R2 which implies x2 = x ! This
is illustrated in Fig. 7.2.
We have constructed search directions p0 , p1 and an iterand x2 such that x2 is optimal for the
x0
x1
x2
twodimensional subspace span{p0 , p1 }. This leads to the basic idea behind the Conjugate Gra
dient (CG) method: we shall use search directions such that xk is optimal for the kdimensional
subspace span{p0 , p1 , . . . , pk1 }. In the Steepest Descent method the iterand xk is optimal for
the onedimensional subspace span{pk1 }. This difference results in much faster convergence
of the CG iterands as compared to the iterands in the Steepest Descent method.
We will now show how to construct appropriate search directions such that this optimality
property holds. Moreover, we derive a method for the construction of these search directions
with low computational costs.
As in the Steepest Descent method, we start with p0 = r0 and x1 as in (7.5). Recall that x1 is
optimal for span{p0 }. Assume that for a given k with 1 k < n, linearly independent search
directions p0 , ..., pk1 are given such that xk as in (7.5) is optimal for span{p0 , ..., pk1 }. We
introduce the notation
Vk = span{p0 , ..., pk1 }
and assume that xk 6= x , i.e., rk 6= 0 (if xk = x we do not need a new search direc
tion). We will show how pk can be taken such that xk+1 , defined as in (7.5), is optimal for
span{p0 , p1 , ..., pk } =: Vk+1 . We choose pk such that
pk A Vk , i.e. pk VkA (7.11)
holds. This Aorthogonality condition does not determine a unique search direction pk . The
Steepest Descent method above was based on the observation that rk = F (xk ) is the direction
of steepest descent at xk . Therefore we use this direction to determine the new search direction.
A unique new search direction pk is given by the following:
pk VkA such that kpk rk kA = min kp rk kA (7.12)
A
pVk
154
The definition of p1 is illustrated in Fig. 7.3.
V1 6 3r
1
V A
 1
V1 = span{p0 }.
p1
Note that pk is the Aorthogonal projection of rk on VkA . This yields the following formula for
the search direction pk :
k1 k1
X hpj , rk iA j X hpj , Ark i j
pk = rk p = rk
p . (7.13)
j=0
hpj , pj iA
j=0
hpj , Apj i
We assumed that xk is optimal for Vk and that rk 6= 0. From the former we get that hpj , rk i = 0
for j = 0, . . . , k 1, i.e., rk Vk (note that here we have and not A ). Using rk 6= 0
we conclude that rk / Vk and thus from (7.13) it follows that pk
/ Vk . Hence, pk is linearly
0
independent of p , . . . , p k1 and
Due to (7.11) and the optimality of xk for the subspace Vk (cf. also (7.10)) we have for j < k
Using (7.10) we conclude that xk+1 is optimal for the subspace Vk+1 ! The search directions
pk defined as in (7.13) (p0 := r0 ) and the iterands as in (7.15) define the Conjugate Gradient
method. This method is introduced in Hestenes and Stiefel [51].
Theorem 7.2.1 Let x0 Rn be given and m < n be such that for k = 0, 1, . . . , m we have
xk 6= x and pk , xk+1 as in (7.13), (7.15). Define Vk = span{p0 , . . . , pk1 } (0 k m + 1).
155
Then the following holds for all k = 1, . . . , m + 1:
dim(Vk ) = k (7.16a)
k 0
x x + Vk (7.16b)
k 0
F (x ) = min{F (x)  x x + Vk } (7.16c)
0 k1 0 0 k1 0
Vk = span{r , ..., r } = span{r , Ar , ..., A r } (7.16d)
j k
hp , r i = 0, for all j = 0, 1, ..., k 1 (7.16e)
j k
hr , r i = 0, for all j = 0, 1, ..., k 1 (7.16f)
k k k1
p span{r , p } (for k m) (7.16g)
Proof. The result in (7.16a) is shown in the derivation of the method, cf. (7.14). The result
in (7.16b) can be shown by induction using xk = xk1 + opt (xk1 , pk1 )pk1 . The construc
tion of the search directions and new iterands in the CG method is such that xk is optimal
for Vk , i.e., F (xk ) = min{F (xk + w)  w Vk }. Using xk x0 + Vk this can be rewritten as
F (xk ) = min{F (x0 + w)  w Vk } which proves the result in (7.16c).
We introduce the notation Rk = span{r0 , . . . , rk1 } and prove Vk Rk by induction. For
k = 1 this holds due to p0 = r0 . Assume that it holds for some k m. Since Vk+1 =
span{Vk , pk } and Vk Rk Rk+1 , we only have to show pk Rk+1 . From (7.13) it fol
lows that pk span{p0 , . . . , pk1 , rk } = span{Vk , rk } Rk+1 , which completes the induc
tion argument. Using dim(Vk ) = k it follows that that Vk = Rk must hold. Hence the first
equality in (7.16d) is proved. We introduce the notation Wk = span{r0 , Ar0 , ..., Ak1 r0 }
and prove Rk Wk by induction. For k = 1 this is trivial. Assume that for some k m,
Rk Wk holds. Due to Rk+1 = span{Rk , rk } and Rk Wk Wk+1 we only have to show
rk Wk+1 . Note that rk = rk1 + opt (xk1 , pk1 )Apk1 and rk1 Rk Wk Wk+1 ,
Apk1 AVk = ARk AWk Wk+1 . Thus rk Wk+1 holds, which completes the induction.
Due to dim(Rk ) = k it follows that Rk = Wk must hold. Hence the second equality in (7.16d)
is proved.
The search directions and iterands are such that xk is optimal for Vk = span{p0 , . . . , pk1 }.
From (7.10) we get hpj , rk i = 0 for j = 0, . . . , k 1 and thus (7.16e) holds. Due to Vk =
span{r0 , ..., rk1 } this immediately yields (7.16f), too. To prove (7.16g) we use the formula
(7.13). Note that rj+1 = rj + opt (xj , pj )Apj and thus Apj span{rj+1 , rj }. From this and
(7.16f) it follows that for j k 1 we have hpj , rk iA = hApj , rk i = 0. Thus in the sum in
(7.13) all terms with j k 2 are zero.
The result in (7.16g) is very important for an efficient implementation of the CG method. Com
bining this result with the formula given in (7.13) we immediately obtain that in the summation
in (7.13) there is only one nonzero term, i.e. for pk we have the formula
hpk1 , Ark i
pk = rk pk1 . (7.17)
hpk1 , Apk1 i
From (7.17) we see that we have a simple and cheap two term recursion for the search directions
in the CG method. Combination of (7.5),(7.7),(7.8) and (7.17) results in the following CG
156
algorithm:
x a given starting vector; r0 = Ax0 b
0
for k 0 (if rk 6= 0) :
k k1
pk = rk hphrk1,Ap
,Ap
i
k1 p
i
k1 ( if k = 0 then p0 := r0 ) (7.18)
k k
x k+1 = xk + (xk , pk )pk with (xk , pk ) = hp ,r i
opt opt hp ,Apk i
k
= rk + opt (xk , pk )Apk
k+1
r
We now discuss the arithmetic costs per iteration and the rate of convergence of the CG method.
If we use the CG algorithm with the formulas in (7.19) then in one iteration we have to compute
one matrixvector multiplication, two inner products and a few vector updates, i.e. (if A is a
sparse matrix) we need cn flops. The costs per iteration are of the same order of magnitude as
for the Jacobi, GaussSeidel and SOR method.
With respect to the rate of convergence of the CG method we formulate the following theorem.
Theorem 7.2.2 Define Pk := { p Pk  p(0) = 1 }. Let xk , k 0 be the iterands of the CG
method and ek = xk x . The following holds:
kek kA = min kpk (A)e0 kA (7.20)
pk Pk
157
Proof. From (7.16b) we get ek e0 + Vk . And due to Vk = span{r0 , . . . , rk1 } and (7.16f)
we have Aek = rk Vk and thus ek A Vk . This implies kek kA = minvk Vk ke0 vk kA . Note
that vk Vk can be represented as
k1
X k1
X
vk = j Aj r0 = j Aj+1 e0
j=0 j=0
Hence,
k1
X
k 0
ke kA = min ke j Aj+1 e0 kA = min kpk (A)e0 kA
Rk pk Pk
j=0
This proves the result in (7.20). The result in (7.21) follows from
Let I = [min , max ] with min and max the extreme eigenvalues of A. From the results above
we have
kek kA min max pk ()ke0 kA
pk Pk I
The minmax quantity in this upper bound can be analyzed using Chebychev polynomials,
defined by T0 (x) = 1, T1 (x) = x, Tm+1 (x) = 2xTm (x) Tm1 (x) for m 1. These polynomials
have the representation
1h p k p k i
Tk (x) = x + x2 1 + x + x2 1 (7.23)
2
and for any interval [a, b] with b < 1 they have the following property
2 a b
min max pk (x) = 1/Tk
pk Pk ,pk (1)=1 x[a,b] ba
158
the contraction numbers of the Richardson and (damped) Jacobi methods which are of the form
1 c/(A). For the case (A) ch2 the latter takes the form 1 ch2 , whereas for CG we
have an (average) reduction factor 1 ch.
Often the bound in (7.22) is rather pessimistic because the phenomenon of superlinear conver
gence is not expressed in this bound.
For a further theoretical analysis of the CG method we refer to Axelsson and Barker [7],
Golub and Van Loan [41] and Hackbusch [48].
Example 7.2.3 (Poisson model problem) We apply the CG method to the discrete Poisson
equation from section 6.6. First we discuss the complexity of the CG method for this model
problem. In this case we have (A) ch2 . Using (7.22) it follows that (in the Anorm) the
error is reduced with approximately a factor
p
(A) 1
:= p 1 2c1 h (7.24)
(A) + 1
per iteration. The arithmetic costs are cn flops per iteration. So for a reduction of the error with
1
a factor R we need approximately ln R/ ln(1 2c1 h)cn ch1 n cn1 2 flops. We conlude
that the complexity is of the same order of magnitude as for the SOR method with the optimal
value for the relaxation parameter. However, note that, opposite to the SOR method, in the
CG method we do not have the problem of chosing a suitable parameter value. In Table 7.1
we show results which can be compared with the results in section 6.6. We use the Euclidean
norm and # denotes the number of iterations needed to reduce the starting error with a factor
R = 103 .
In figure 7.4 we illustrate the phenomenon of superlinear convergence in the CG method. For
the case h = 1/160 we show the actual error reduction in the Anorm, i.e.
kxk x kA
k :=
kxk1 x kA
p p
in the first 250 iterations. The factor ( (A)1)/( (A)+1) has the value = 0.96 (horizontal
line in figure 7.4). There is a clear decreasing tendency of k during the iteration process. For
large values of k, k is significantly smaller than . Finally, we note that an irregular convergence
behaviour as in figure 7.4 is typical for the CG method.
159
1
0.95
0.9
0.85
0.8
0.75
0.7
0.65
0 50 100 150 200 250
Ax = b , A := W1 A , b := W1 b,
we obtain
160
This method is called the preconditioned Richardson method. Note that if we assume that
(an estimate of) (A) is known, then we do not need the preconditioned matrix A in this
method. In (7.28) we have to compute z := W1 (Axk b), i.e., Wz = Axk b. Due to the
condition in (7.25a) z can be computed with acceptable arithmetic costs. For the spectral radius
of the iteration matrix C of the preconditioned method we obtain, using (A) = (W1 A) =
1 1
(W 2 AW 2 ) (0, ),
From (7.27) and (7.29) we conclude that if (W1 A) (A) (cf. (7.25b)), then (C) (C)
and the convergence of the preconditioned method will be much faster than for the original one.
Note that for W = diag(A) the preconditioned Richardson method coincides with the damped
Jacobi method.
If one uses this iterative method for preconditioning then W := M is taken as the preconditioner
for A. If the method (7.30) converges then W is a reasonable approximation for A in the sense
that (I W1 A) < 1.
The iteration in (7.30) corresponds to an iterative method and thus M1 y (y Rn ) can be
computed with acceptable arithmetic costs. Hence the condition in (7.25a), with W = M, is
satisfied.
Related to the implementation of such a preconditioner we note the following. In an iterative
method the matrix M is usually not used in its implementation (cf. GaussSeidel or SOR),
i.e. the iteration (7.30) can be implemented without explicitly computing M. The solution of
Wx = y, i.e. of Mx = y, is the result of (7.30) with k = 0, x0 = 0, b = y. From this it
follows that the computation of the solution of Wx = y can be implemented by performing one
iteration of the iterative method applied to Az = y with starting vector 0.
1 + (C)
(M1 A) (7.31)
1 (C)
Proof. Because A and M are symmetric positive definite it follows that
1 1
(M1 A) = (M 2 AM 2 ) (0, ) .
Using (I M1 A) < 1 we obtain that (M1 A) (0, 2). The eigenvalues of M1 A are
denoted by i :
0 < 1 2 . . . n < 2 .
161
Hence (C) = max{1 1 , 1 n } holds and
n 1 + (n 1) 1 + 1 n 
(M1 A) = = .
1 1 (1 1 ) 1 1 1 
So
1 + (C)
(M1 A)
1 (C)
holds.
1+x
With respect to the bound in (7.31) we note that the function x 1x increases monoton
ically on [0, 1).
In the introductory example above we have seen that it is favourable to have a small value for
(M1 A). In (7.31) we have a bound on (M1 A) that decreases if (C) decreases. This
indicates that the higher the convergence rate of the iterative method in (7.30), the better the
quality of M as a preconditioner for A.
Example 7.4.2 (Discrete Poisson equation) We consider the matrix A resulting from the
finite element discretization of the Poisson equation as described in section 6.6. If we use the
method of Jacobi, i.e. M = D then (C) 1 ch2 holds and (7.31) results in
<
(D1 A) 2ch2 . (7.32)
In this model problem the eigenvalues are known and it can be shown that the exponent 2 in
(7.32) is sharp. When we use the SSOR basic iterative method in (7.30), with M as in (6.18)
and with an appropriate value for , we have (C) 1 ch and thus
<
(M1 A) 2ch1 . (7.33)
Hence, in this example, for the SSOR preconditioner the quantity (M1 A) is significantly
smaller than for the Jacobi preconditioner. So in a preconditioned Richardson method or in
a Preconditioned Conjugate Gradient method (cf. Example 7.7.1) the SSOR preconditioner
results in a method with a higher rate of convergence than the Jacobi preconditioner.
162
7.5.1 LU factorization
The direct method of Gaussian elimination for solving a system Ax = b is closely related to
the LU factorization of A. We recall the following: for every square matrix A there exists a
permutation matrix P, a lower triangular matrix L, with diag(L) = I and an upper triangular
matrix U such that the factorization
PA = LU (7.34)
holds. If A is nonsingular, then for given P these L and U are unique. To simplify the discussion
we only consider the case P = I, i.e. we do not use pivoting. It is known that a factorization as in
(7.34) with P = I exists if A is symmetric positive definite or if A is an Mmatrix. Many different
algorithms for the computation of an LU factorization exist (cf. Golub and Van Loan [41]). A
standard technique is presented in the following algorithm, in which aij is overwritten by lij if
i > j and by uij otherwise.
LU factorization.
For k = 1, . . . , n 1
If akk = 0 then quit else
For i = k + 1, . . . , n (7.35)
:= aik /akk ; aik := ;
For j = k + 1, . . . , n
aij := aij akj .
Clearly, the Gaussian elimination process fails if we encounter a zero pivot. In the if condition
in (7.35) it is checked whether the pivot in the kth elimination step is equal to zero. If this con
dition is never true, the Gaussian elimination algorithm (7.35) yields an LU decomposition as in
(7.34) with P = I. In the kth step of the Gaussian elimination process we eliminate the nonzero
entries below the diagonal in the kth column. Due to this the entries in the (n k) (n k)
right under block in the matrix change. This corresponds to the assignment aij := aij akj ,
with a loop over i and j. In the assignment aik := , with a loop over i, the values of lik , i > k
are computed. Finally note that in the kth step of the elimination process the entries amj , with
1 m k and j m do not change; these are the entries umj (1 m k, j m) of the
matrix U.
This yields the following explicit formulas for lik and uik :
k1
X
lik = (aik lij ujk )/ukk , 1k<in, (7.37)
j=1
i1
X
uik = aik lij ujk , 1ikn. (7.38)
j=1
Thus we can compute L and U row by row, i.e. we take i = 1, 2, . . . , n and for i fixed we compute
lik by (7.37) with k = 1, . . . , i 1 and then uik by (7.38) with k = i, . . . , n. We discuss a simple
163
implementation of this row wise Gaussian elimination process. We take i fixed and introduce
the notation
m1
(m)
X
aik := aik lij ujk , 1 k, m n. (7.39)
j=1
(1)
Note that aik = aik and
(i)
uik = aik for k i,
(k)
lik = aik /ukk for k < i, (7.40)
(m+1) (m)
aik = aik lim umk .
Using these formulas the entries lik and uik can be computed as follows. Note that u1k = a1k
for k = 1, . . . , n. Assume that the rows 1, . . . , i 1 of L and U have been computed, then lik ,
1 k < i, and uik , i k n, are determined by
For k = 1, . . . , i 1
(k)
lik = aik /ukk
For j = k + 1, . . . , n
(k+1) (k)
aij = aij lik ukj
For k = i, . . . , n
(i)
uik = aik
As in (7.35) we can overwrite the matrix A, and we then obtain the following algorithm, which
is commonly used for a rowcontiguous data structure:
For ease of presentation we deleted the statement If akk = 0 then quit. If in both algorithms,
(7.35) and (7.41), a zero pivot does not occur (i.e. akk = 0 is never true), then both algorithms
yield identical LU factorizations.
For certain classes of matrices, for example symmetric matrices or matrices having a band struc
ture, there exist Gaussian elimination algorithms which take advantage of special properties of
the matrix. Such specialized algorithms enhance effiency. A wellknown example is the Cholesky
decomposition method, in which for a symmetric positive definite matrix A a factorization
A = LLT is computed (here L is lower triangular, but diag(L) is not necessarily equal to I).
Based on the formula in (7.36) the following algorithm is obtained:
Cholesky factorization.
For k = 1,q...,n
Pk1 2
akk = akk j=1 akj (7.42)
For i = k + 1, . . . , n
Pk1
aik := (aik j=1 aij akj )/akk
164
To obtain a stable Gaussian elimination algorithm it is important to use a (partial) pivoting
strategy. We do not discuss this topic here but refer to the literature, e.g. Golub and Van
Loan [41]. We note that if the matrix A is symmetric positive definite or weakly diagonally
dominant, then a straightforward implementation of Gaussian elimination is stable even without
using pivoting. For example, the Cholesky algorithm as in (7.42), applied to a symmetric positive
definite matrix, is stable.
Let S be a subset of the indexset { (i, j)  1 i, j n }. We call this subset the sparsity pattern
and we assume:
{ (i, i)  1 i n } S , G(A) S . (7.43)
In our applications the matrices A are such that all diagonal entries are nonzero and thus (7.43)
reduces to the condition G(A) S. We now simply enforce sparsity of L and U by setting
every entry in L and U to zero if the corresponding index is outside the sparsity pattern, i.e.
we introduce the condition:
lij = uij = 0 if (i, j)
/ S. (7.44)
We apply Gaussian elimination and we require sparsity of L and U as formulated in (7.44). This
then yields an incomplete LU factorization. As for the (complete) LU factorization method dis
cussed in Section 7.5.1, several different implementations of an incomplete factorization method
exist. We present a few wellknown algorithms. We assume that no zero pivot occurs in the
algorithms below. Theorem 7.5.3 gives sufficient conditions on the matrix A such that this as
sumption is fulfilled.
We start with the incomplete Cholesky factorization A = LLT R based on algorithm (7.42).
For this algorithm to make sense, the matrix A should be symmetric positive definite. To
preserve symmetry we assume that the sparsity pattern is symmetric, i.e. if (i, j) S then
(j, i) S, too. We use a formulation in which aij is overwritten by lij if i j.
The sums in this algorithm should be taken only over those j for which the corresponding indexes
(k, j) and (i, j) are in S.
165
The first thorough analysis of incomplete factorization techniques is given in Meijerink and
Van der Vorst [62]. In that paper a modified (i.e. incomplete) version of algorithm (7.35) is
considered:
Incomplete LU factorization.
For k = 1, . . . , n 1
For j = k + 1, . . . , n If (k, j) / S then akj := 0 ()
For i = k + 1, . . . , n If (i, k)
/ S then aik := 0 ()
(7.46)
For i = k + 1, . . . , n
:= aik /akk ; aik := ;
For j = k + 1, . . . , n
aij := aij akj .
Compared to algorithm (7.35) only the lines () have been added. In these lines certain entries
in the kth row of U and in the kth column of L are set to zero, according to the condition (7.44).
Algorithm (7.46) has a simple structure and is easy to analyze (cf. theorem 7.5.1). However,
this algorithm is a rather inefficient implementation of incomplete factorization. Below we
reformulate the algorithm, resulting in a significantly more efficient implementation, given in
algorithm (7.56).
Theorem 7.5.1 Assume that algorithm (7.46) does not break down (i.e. akk = 0 is never true).
Then this algorithm results in an incomplete factorization A = LU R with
Proof . The result in (7.47) holds because L (U) is lower (upper) triangular. By construction
(cf. lines () in the algorithm) the result in (7.48) holds. It remains to prove the result in (7.49).
The standard basis vector with value 1 in the mth entry is denoted by em . By vm = (vm 1 , v 2 , . . . , v n )T
m m
i
we denote a generic nvector with vm = 0 for i m. Note that a standard (complete) Gaussian
elimination as in algorithm (7.35) can be represented in matrix formulation as :
A1 = A
For k = 1, . . . , n 1
(7.50)
Ak+1 = Lk Ak , with
Lk of the form Lk = I + vk eTk .
The matrices Ak+1 have the property (Ak+1 )ij = 0 if i > j and j k. Then U := An =
Ln1 Ln2 . . . L1 A holds. Using L1 T
k = I vk ek we obtain the LU factorization
n1
X
A= L1 1 1
1 L2 . . . Ln1 U = (I vk eTk )U =: LU . (7.51)
k=1
The kth stage of algorithm (7.46) consists of two parts. First the kth row and kth column are
modified by setting certain entries to zero (lines () in (7.46)) and then a standard Gaussian
166
elimination step as in (7.50) is applied to the modified matrix. In matrix formulation this yields:
A1 = A (7.52a)
For k = 1, . . . , n 1
Ak = Ak + Rk , with (7.52b)
Rk of the form Rk = vk eTk + ek vkT , and (7.52c)
(Rk )ij = 0 for all (i, j) S , (7.52d)
Ak+1 = Lk Ak , with (7.52e)
Lk of the form Lk = I + vk eTk , (7.52f)
Again, the matrix Ak+1 has the property (Ak+1 )ij = 0 if i > j and j k. The three vectors vk
that occur in (7.52c) and (7.52f) may all be different. From the form of Rk and Lm (cf. (7.52c),
(7.52f)) we obtain
Lm Rk = Rk for m < k . (7.53)
Now note that for the resulting upper triangular matrix U := An we get, using (7.53) and the
notation L := Ln1 Ln2 . . . L1 :
n1
X
LU = A + Rj .
j=1
Pn1
Hence A = LU R with R := j=1 Rj , and the result in (7.49) follows from (7.52d).
We can use the results in theorem 7.5.1 to derive a much more efficient implementation of
algorithm (7.46). Using the condition in (7.44) (or (7.48)) for the incomplete LU factorization,
we obtain for L = (lij )1i,jn , U = (uij )1i,jn :
By S we denote the number of elements in the sparsity pattern S. After using (7.54) there are
still S entries in L and U which have to be determined. From (7.49) we deduce that
This yields S (nonlinear) equations for these unknown entries of L and U.. We now follow the
line of reasoning as in (7.36)(7.41) for the complete LU factorization. From (7.55) we obtain
(cf. (7.36)) :
min(i,k)
X
aik = lij ujk if 1 i, k n and (i, k) S .
j=1
167
This yields explicit formulas for lik and uik (cf. (7.37)) :
k1
X
lik = (aik lij ujk )/ukk if 1 k < i n and (i, k) S ,
j=1
i1
X
uik = aik lij ujk if 1 i k n and (i, k) S.
j=1
Thus we can compute L and U row by row. We take i fixed and use the notation as in (7.39).
This yields (cf. (7.40)) :
(i)
uik = aik if k i and (i, k) S,
(k)
lik = aik /ukk if k < i and (i, k) S,
(m+1) (m)
aik = aik lim umk .
Using these formulas the entries lik and uik can be computed as follows. Note that for k =
1, . . . , n, u1k = a1k if (1, k) S and u1k = 0 otherwise. Assume that the rows 1, . . . , i 1 of L
and U have been computed, then lik and uik are determined by
For k = 1, . . . , i 1
(k)
If (i, k) S then lik = aik /ukk
For j = k + 1, . . . , n
(k+1) (k)
aij = aij lik ukj ()
For k = i, . . . , n
(i)
If (i, k) S then uik = aik
(m)
In the computation of lik and uik we use aik only for (i, k) S. Hence the update in line () is
needed only if (i, j) S. Again we use a formulation in which we overwrite the matrix A, and
we obtain the following incomplete version of algorithm (7.41) :
Remark 7.5.2 The results in theorem 7.5.1 and the construction (7.54)(7.56) following that
theorem show that if algorithm (7.46) does not break down then Algorithm (7.56) and Algo
rithm (7.46) are equivalent, in the sense that these two algorithms yield the same incomplete
LU factorization. Moreover, the derivation in (7.54)(7.56) implies that if an incomplete LU
factorization exists which satisfies (7.47)(7.49), then these L (with diag(L) = I ) and U are
unique.
Note that the implementation in (7.56) is more efficient than the implementation (7.46). In the
latter algorithm it may well happen that certain assignments aij := aij akj in the jloop are
superfluous, since for a higher value of k (in the kloop) these previously computed values are
set to zero.
168
As stated in theorem 7.5.1 and remark 7.5.2, a unique incomplete LU factorization which satisfies
(7.47)(7.49) exists, if algorithm (7.46) does not break down. The following result is proved in
[62] theorem 2.3.
Theorem 7.5.3 If A is an Mmatrix then algorithm (7.46) does not break down. If A is in
addition symmetric positive definite, and the pattern S is symmetric, algorithm (7.45) does not
break down.
If A is an Mmatrix, then the unique incomplete LU factorization can be computed using, for
example, algorithm (7.46) or algorithm (7.56). If A is in addition symmetric positive definite,
and the pattern S is symmetric, then we can use algorithm (7.45), too. With respect to the
stability of an incomplete LU factorization we give the following result, which is proved in
Meijerink and Van der Vorst [62] theorem 3.2.
We can use the incomplete LU decomposition to construct a basic iterative method. Such a
method is obtained by taking M := LU, with L and U as in theorem 7.5.1, and applying the
iteration xk+1 = xk M1 (Axk b). For this method we have to compute an incomplete LU
factorization of the given matrix A and per iteration we have to solve a system of the form
LUx = y. The latter can be done with low computational costs by a forward and backward
substitution process. The iteration matrix of this method is given by I(LU)1 A. With respect
to the convergence of this iterative method we give the following theorem.
169
diagonal. Moving these entries to the corresponding diagonal elements does not cause any
additional fillin. The algorithm is as follows:
Again the sums in this algorithm should only be taken over those j for which the corresponding
indexes are in S. One can prove that this algorithm, if it does not break down, yields an
incomplete factorization
A = LLT + R
with
li,j = 0 if (i, j)
/S
T
(LL )i,j = ai,j if (i, j) S, i 6= j,
X n n
X
T
(LL )i,j = ai,j for all i.
j=1 j=1
It is known that for certain problems this lumping strategy (moving in each row certain
entries to the diagonal) can improve the quality of the incomplete Cholesky preconditioner
significantly. This is illustrated in numerical experiments for the Poisson equation in section 7.7.
Let B be a given regular n nmatrix. Consider the following transformation of the origi
nal problem given in (7.1)
The matrix A is symmetric positive definite, so we can apply the CG method from (7.18) to the
system in (7.58). This results in the following algorithm (which is not used in practice, because
170
in general the computation of A = B1 ABT will be too expensive):
0
x a given starting vector ; r0 = Ax0 b
for k 0 (if rk 6= 0) :
k ,rk >
pk = rk + <r<r
k1 ,rk1 > p
k1 (if k = 0 : p0 := r0 ) (7.59)
<rk ,rk >
xk+1 = xk + opt (xk , pk )pk with opt (xk , pk ) = <p
k ,Apk >
k+1
r = rk + opt (xk , pk )Apk .
This algorithm yields approximations xk for the solution x = BT x of the transformed system.
To obtain an algorithm in in the original variables we introduce the notation
pk := BT pk , xk := BT xk , rk := Brk , (7.60)
zk := BT rk = BT B1 rk = W1 rk , with W := BBT . (7.61)
This algorithm, which yields approximations xk for the solution x of the original system, is
called the Preconditioned Conjugate Gradient method (PCG) with preconditioner W. For W = I
we obtain the algortihm as in (7.18).
Note that in the algorithm in (7.62) the matrx B is involved only in W = BBT . Hence this
algorithm is applicable if a symmetric positive definite matrix W is available. Such a matrix
has a corresponding Choleskydecomposition W = BBT . This decomposition, however, plays
a role only in the theoretical derivation of the method (cf. (7.58)) and is not needed in the
algorithm.
Using the identity kxk x kA = kxk x kA we obtain that the error reduction in (7.62), mea
sured in k kA , is the same as the error reduction in (7.59), measured in k kA . Based on the
result in (7.22) we have that the rate of convergence of the algorithm in (7.59), and thus of the
PCG algorithm, too, is determined by
So for a significant increase of the rate of convergence due to preconditioning we should have a
preconditioner W with (W1 A) (A). In the PCG algorithm in (7.62) we have to solve a
system with matrix W in every iteration. So the matrix W should be such that the solution of
this system can be computed with low computational costs (not much more than one matrix
vector multipication). Note that these requirements for the preconditioner are as in section 7.3.
171
For the PCG methods we need a symmetric positive definite preconditioner W. In section 7.3
we discussed the symmetric positive definite preconditioners W = MSSOR (SSOR precondition
ing), W = LLT (incomplete Cholesky) and W = LLT (modified incomplete Cholesky). In
the example below we apply the PCG method with these three preconditioners to the discrete
Poisson equation.
Example 7.7.1 (Poisson model problem) We consider the discrete Poisson equation as in
section 6.6. We apply the PCG method with the SSOR preconditioner. For the parameter
in the SSOR preconditioning we use the value for opt as in theorem 6.4.3, i.e. is such that
the spectral radius of the iteration matrix of the SOR method is minimal. In Table 7.2 we show
results that can be compared with the results in Table 7.1. We measure the error reduction in
the Euclidean norm k k2 . By # we denote the number of iterations needed to reduce the norm
of the starting error by a factor R = 103 .
In Axelsson and Barker [7] it is shown that for this model problem the SSOR preconditioner, with
an appropriate value for the parameter , results in (W1 A) ch1 and thus (cf. (7.22))we
expect an error reduction per iteration (measured in the Anorm) with at least a factor 1 c h.
If this is the case, then for this problem the PCG method has a complexity O(n5/4 ).
The results in Table 7.2 are consistent with such a reduction factor of the form 1 c h. Ap
parently, the choice = opt , as explained above, is appropriate. Related to this we note that
in Axelsson and Barker [7] it is shown that often the rate of convergence of the PCG method
with SSOR preconditioning is not very sensitive with respect to perturbations in the value of
the parameter . This phenomenon is illustrated in Table 7.3, where we show the results of
1
PCG with SSOR preconditioning for h = 160 and for several values of .
In Table 7.4 we show the results obtained with the incomplete Cholesky preconditioner, i.e.
W = LLT , and with the modified incomplete Cholesky preconditioner, i.e. W = LLT (cf.
Section 7.5). In both cases, for the sparsity pattern we used S = G(A). In the literature these
algorithms are denoted by ICCG and MICCG, respectively.
The results for ICCG indicate that for the preconditioned system we have (W1 A) ch2
where the constant is better than for the unpreconditioned system with W = I. The results
for MICCG indicate that for the preconditioned system we have (W1 A) ch1 , which is
comparable to the result with SSOR preconditioning.
172
h 1/40 1/80 1/160 1/320
ICCG, # 20 40 79 157
MICCG, # 8 11 14 20
Remark 7.7.2 (in preparation) IC preconditioning is often more robust than MIC precondi
tioning, for example for problems with discontinuous coefficients.
173
174
Chapter 8
8.1 Introduction
In Section 7.2 the CG method for solving Ax = b has been derived as a minimization method
for the functional
1
F (x) = hx, Axi hx, bi.
2
If A is symmetric positive definite then this F is a quadratic functional with a unique minimizer
and minimization of F is equivalent to solving Ax = b.
If A is not symmetric positive definite then the nice minimization properties of CG do not
hold and it is not clear whether CG is still useful. If A is not symmetric positive definite we
can still try the CG algorithm. In practice we often observe that for nonsymmetric problems
in which the symmetric part (i.e. 12 (A + AT )) is positive definite, the CG algorithm is still
a fairly efficient solver if the skewsymmetric part (i.e. 12 (A AT )) is small compared to
the symmetric part. In other words, the CG algorithm can be used for solving nonsymmetric
problems in which the nonsymmetric part is a perturbation on a symmetric positive definite part.
In problems with moderate nonsymmetry (kA AT k kA+ AT k) or with strong nonsymmetry
(kAAT k kA+AT k) the CG algorithm generally diverges. For such nonsymmetric problems
other Krylov subspace methods have been developed.
Example 8.1.1 We consider the discrete convectiondiffusion problem as in section 6.6 with
b1 = cos(/6), b2 = sin(/6) and h = 1/160. We use We take x0 = 0 and an error reduction
factor R = 103 . The CG algorithm is applied to this problem for different values of the parameter
. The results are shown in Table 8.1.
Note that for large values of the problem is nearly symmetric and the convergence behaviour
of the CG method is reasonable. For smaller values of the nonsymmetry of the problem is
increasing and the CG method fails.
In section 8.2 below we show that, for A symmetric positive definite, the CG method can be
seen as a projection method. Using this point of view we can develop variants of the CG method
175
which can be used for problems in which A is symmetric but indefinite or A is nonsymmetric. In
recent years, many of such variants have been introduced. For an overview of these methods we
refer to Saad [78], Freund et al. [36], Greenbaum [42], Sleijpen and Van der Vorst [85]. We will
discuss a few important methods and explain the main approaches in this field of nonsymmetric
Krylov subspace methods.
In the remainder of this chapter the Krylov subspace Kk (A; r0 ), with r0 = Ax0 b the starting
residual, will play an important role. To avoid certain technical details, we make the following
assumption concerning the starting vector x0 :
Assumption 8.2.1 In the remainder of this chapter we assume that x0 is chosen such that
dim(Kk (A; r0 )) = k for k = 1, 2, . . . , n.
We note that in the generic case this assumption is fulfilled. Only for special choices of x0 one
has dim(Kk (A; r0 )) < k for k < n. We emphasize that the formulations of the algorithms which
are discussed in the remainder of this chapter do not depend on this assumption.
We first reconsider the CG method applied to the problem Ax = b with A symmetric pos
itive definite. Using the results of theorem 7.2.1 we obtain that
xk x0 + Kk (A; r0 )
holds and
A(xk x ) = Axk b = rk Kk (A; r0 ),
or, equivalently,
We conclude that xk x0 is the Aorthogonal projection (i.e. with respect to the Ainner product
h, iA ) of the starting error x x0 on Kk (A; r0 ) . This is illustrated in figure 8.1. From the
x x0
3
)
Kk (A; r0

the right angle is w.r.t. the
xk x0
inner product h, iA
176
observations above it follows that the CG iterate xk can be characterized as the unique solution
of the following problem:
We will now derive an algorithm ((8.16) below), different from the CG algorithm, that can
be used to solve this problems. For this algorithm and the CG algorithm the computational
costs per iteration are comparable and, in exact arithmetic, these two algorithms yield the same
iterands. The ideas underlying this alternative algorithm will play an important role in the
derivation of algorithms for the case that A is not symmetric positive definite.
We start with a simple method for computing an orthogonal basis of the Krylov subspace,
the socalled Lanczos method:
0
q := 0; q1 := r0 /kr0 k; 0 := 0;
for j 1 :
qj+1 := Aqj j1 qj1 ,
(8.4)
j := hqj+1 , qj i,
qj+1 := qj+1 j qj , j := kqj+1 k,
q j+1 j+1
:= q /j .
and thus
1 1
..
1
2 .
AQk = Qk
.. .. .. + k qk+1 (0, 0, ..., 0, 1)
. . .
.. ..
. .
k1
k1 k
177
holds. Due to the orthogonality of the basis we have
For solving the problem (8.3) we have to compute xk x0 + Kk (A; r0 ) which satisfies the
orthogonality property
hAxk b, zi = 0 for all z Kk (A; r0 ).
This yields the condition
Note that q1 = r0 /kr0 k and QTk r0 = kr0 k(1, 0, . . . , 0)T =: kr0 ke1 . Since the vector xk x0 must
be an element of Kk (A; r0 ) it can be represented using the basis q1 , q2 , . . . , qk , i.e. there exists
an yk Rk such that xk x0 = Qk yk . Using this the condition (8.8) can be formulated as
With the results in (8.6) and (8.24) we obtain that the solution of the problem (8.3) (or (8.2))
is given by
Note that Tk = QTk AQ is a symmetric positive definite tridiagonal k k matrix. So the vector
xk can be obtained by first solving the tridiagonal system in (8.10a) and then computing xk
as in (8.10b). This, however, would result in an algorithm with high computational costs per
iteration. We now show that based on (8.10a), (8.10b) an algorithm can be derived in which
the iterand xk can be updated from the previous iterand xk1 in a simple and cheap way (as in
the CG algorithm). To derive this algorithm we represent Tk using its LU factorization (which
exists, because Tk is symmetric positive definite):
1
u1 1
l2 1
u2 2
. .
Tk = Lk Uk = . . . .
. . . . ,
. .
. .
.. ..
uk1 k1
lk 1 uk
for k = 1, 2, . . ., where the i are the same as in the matrix Tk . We also introduce the notation
Pk = [p1 p2 . . . pk ] := Qk U1 k 1 0
k , z := Lk kr ke1 . From (8.10a) and (8.10b) we then obtain
xk = x0 + Qk T1 0
k kr ke1
= x0 + Qk U1 1 0 0 k
k Lk kr ke1 = x + Pk z . (8.11)
From the kth column in the identity Pk Uk = Qk one obtains pk1 k1 + pk uk = qk and thus
the simple update formula
1 k
pk = (q k1 pk1 ). (8.12)
uk
From the last row in the identity Tk = Lk Uk we obtain lk uk1 = k1 and lk k1 + uk = k ,
i.e.
k1
lk = , uk = k lk k1 . (8.13)
uk1
178
zk1
If we represent zk as zk = with k R (k = 1, 2, . . .), it follows from (8.11) that
k
Finally, from the last equation in Lk zk = kr0 ke1 it follows that lk k1 + k = 0 and thus
k = lk k1 . (8.15)
In (8.12), (8.13), (8.14) and (8.15) we have recursion formulas which allow a simple update
k 1 k. Combining these formulas with the Lanczos algorithm (8.4) for computing qk results
in the following Lanczos iterative solution method:
q0 := 0; q1 := r0 /kr0 k; 0 = p0 = l1 = 0; 1 = kr0 k;
for j 1 :
qj+1 := Aqj j1 qj1 , j := hqj+1 , qj i
if j > 1 then lj = uj1 and j = lj j1 ,
j1
uj = j lj j1 , (8.16)
pj = u1j (qj j1 pj1 ),
xj = xj1 + j pj ,
qj+1 := qj+1 j qj , j := kqj+1 k,
qj+1 := qj+1 / .
j
This algorithm for computing the solution xk of (8.2) (or (8.3)) has about the same computa
tional costs as the CG algorithm presented in section 7.2.
In the derivation of the Lanczos iterative solution method (8.16) the following ingredients are
important:
In the derivation of the projected system (8.10a) the fact that we have an orthogonal basis plays
a crucial role.
The approach discussed above is a starting point for the development of methods which can
be used in cases where A is not symmetric positive definite. In generalizing this appraoch to
systems in which A is not symmetric positive definite one encounters the following two major
179
difficulties:
and
In section 8.3 we consider the case that matrix A is not positive definite, but still symmetric
(i.e. symmetric indefinite). Then we can still use the Lanczos method to compute, in a cheap
way, an orthogonal basis of the Krylov subspace. To deal with the problem formulated in (8.20)
one can replace the error minimization in the Anorm in (8.2) by a residual minimization in the
euclidean norm, i.e. minimize kAx bk over the space x0 + Kk (A; r0 ). For every nonsingular
matrix A this residual minimization problem has a unique solution. Furthermore, as will be
shown in section 8.3, this residual minimization problem can be solved with low computational
costs if an orthogonal basis of the Krylov subspace is available. A wellknown method for solving
symmetric indefinite problems, which is based on using the Lanczos method (8.4) for computing
the solution of the residual minimization problem is the MINRES method.
In section 8.4 and Section 8.5 we assume that the matrix A is not even symmetric. Then
both the problem formulated in (8.20) and the problem formulated in (8.21) arise. We can deal
with the problem in (8.20) as in the MINRES method, i.e. we can use residual minimization in
the euclidean norm instead of error minimization in the Anorm. It will turn out that, just as for
the symmetric indefinite case, this residual minimization problem can be solved with low costs if
an orthogonal basis of the Krylov subspace is available. However, due to the nonsymmetry (cf.
(8.21)), for computing such an orthogonal basis we now have to use a method which is computa
tionally much more expensive than the Lanczos method. An important method which is based
on the idea of computing an orthogonal basis of the Krylov subspace and using this basis to solve
the residual minimization problem is the GMRES method. We discuss this method in section 8.4.
Another important class of methods for solving nonsymmetric problems is treated in section 8.5.
In these methods one does not compute the solution of an error or residual minimization prob
lem (as is done in CG, MINRES, GMRES). Instead one tries to determine xk x0 + Kk (A; r0 )
which satisfies an orthogonality condition similar to the one in (8.3). It turns out that using
this approach one can avoid the expensive computation of an orthogonal basis of the Krylov
subspace. The main example from this class is the BiCG method. The BiCG method has lead
to many variants. A few popular variants are considered in section 8.5, too.
180
and thus
1 1
..
1
2 .
.. .. ..
AQk = Qk+1
. . . =: Qk+1 Tk .
(8.23)
.. ..
. .
k1
k1 k
k
Note that Tk is a (k + 1) k matrix. Due to the orthogonality of the basis we have
The MINRES method, introduced in Paige and Saunders [70], is based on the following residual
minimization problem:
where r0 := Ax0 b. Note that the Euclidean norm is used and that for any regular A this
minimization problem has a unique solution xk , which is illustrated in figure 8.2.
Clearly, we have a projection: rk = Axk b is the projection (with respect to h, i) of r0
Ax0 b
3
R

b
Axk R = A(Kk (A; r0 ))
From (8.27) we see that the residual minimization problem in (8.25) leads to a least squares
problem with the (k + 1) k tridiagonal matrix Tk . Due to the structure of this matrix Givens
rotations are very suitable for solving the least squares problem in (8.27). Combination of the
Lanczos algorithm (for computing an orthogonal basis) with a least squares solver based on
181
Givens rotations results in the MINRES algorithm. We will now derive this algorithm.
First we recall that for (x, y) 6= (0, 0) a unique orthogonal Givens rotation is given by
2 2
c + s! = 1
c s !
G= such that x w (8.28)
s c G
= with w > 0
y 0
The least squares problem in (8.27a) is solved using an orthogonal transformation Vk R(k+1)(k+1)
such that
Rk
Vk Tk = , Rk Rkk upper triangular (8.29)
bk
Define bk := Vk e1 =: , with bk Rk . Then the solution of the least squares problem
is given by yk = R1k bk . We show how the matrices Rk and vectors bk , k = 1, 2, . . ., can be
computed using short (and thus cheap) recursions. We introduce the notation
Ij1
Gj = cj sj R(j+1)(j+1) with c2j + s2j = 1
sj cj
For k 3 and for given Tk , Gk1 , Gk2 one can compute ck , ss , rk such that
0
Gk1 Gk2 k1
r
Gk = k , rk Rk (8.32)
k 0
k
Note that rk has at most three nonzero entries:
rk = (0, . . . , 0, rk,k2 , rk,k1 , rk,k )T (8.33)
182
For bk = Vk e1 =: (bk , bk,k+1 )T , we have the recursion
bj1
b1 = G1 e1 , bj = cj sj bj1,j , j2 (8.34)
sj cj 0
(Notation: bj1,j is the jth entry of bj1 .) We now derive a simple recursion for the vector xk
in (8.27b). Define the matrix Qk R1 k =: Pk = (p1 . . . pk ) with columns pj (1 j k). From
Pk Rk = Qk and the nonzero structure of the columns of Rk (cf. (8.33)) it follows that
xk = x0 kr0 kQk R1 0 0
k bk = x kr kPk bk
(8.36)
= x0 kr0 kPk1 bk1 kr0 kbk,k pk = xk1 kr0 kbk,k pk
MINRES algorithm.
Given x0 , compute r0 = Ax0 b. For k = 1, 2, . . . :
Compute qk , k , k using the Lanczos method.
Compute rk using (8.30), (8.31) or (8.32) (note (8.33)).
Compute pk using (8.35).
Compute bk,k using (8.34).
Compute update: xk = xk1 kr0 kbk,k pk .
Note that in each iteration of this method we need only one matrixvector multiplication and a
few relatively cheap operations, like scalar products and vector additions.
Remark 8.3.1 If for a given symmetric regular matrix A and given starting residual r0 assump
tion 8.2.1 does not hold then there exists a minimal k0 hn such that AKk0 (A; r0 ) = Kk0 (A; r0 ).
In the Lanczos method we then obtain (using exact arithmetic) k0 = 0 and thus the itera
tion stops for k = k0 . It can be shown that xk0 computed in the MINRES algorithm satisfies
Axk0 = b and thus we have solved the linear system.
We now derive the preconditioned MINRES algorithm. For this we assume a given symmetric
positive definite matrix M. Let L be such that M = LLT . We consider the preconditioned
system
L1 ALT z = L1 b , z = LT x
Note that A := L1 ALT is symmetric. For given x0 Rn we have z0 = LT x0 and the starting
residual of the preconditioned problem satisfies Az0 L1 b = L1 r0 . We apply the Lanczos
method to construct an orthogonal basis q1 , . . . qk of the space Kk (A; L1 r0 ). We want to avoid
computations with the matrices L and LT . This can be achieved if we reformulate the algorithm
using the transformations
tj := Lqj , tj := Lqj , wj := LT qj = M1 tj
183
Using these definitions we obtain an equivalent formulation of the algorithm (8.4) applied to A
with r = L1 r0 , which is called the preconditioned Lanczos method:
1
t0 := 0; w0 := M1 r0 ; krk := hw0 , r0 i 2
t1 := r0 /krk; w1 := w0 /krk; 0 := 0;
for j 1 :
j+1 := Awj
j1 ,
t j1 t
j := htj+1 , wj i, (8.37)
tj+1 := tj+1 j tj ,
wj+1 := M1 tj+1 ; := hwj+1 , tj+1 i,
j
j+1 j+1
t := t /j ,
wj+1 := wj+1 /j ,
Note that for M = I we obtain the algorithm (8.4) and that in each iteration a system with the
matrix M must be solved. As a consequence of theorem 8.2.2 we get:
Theorem 8.3.2 The set w1 , w2 , ..., wk defined in algorithm (8.37) is orthogonal with respect to
h, iM and forms a basis of the Krylov subspace Kk (M1 A; M1 r0 ) (k n).
Proof. From theorem 8.2.2 and the defintion of wj it follows that (LT wj )1jk forms
an orthogonal basis of Kk (L1 ALT ; L1 r0 ) with respect to the Euclidean scalar product.
Note that hLT wj , LT wi i = 0 iff hwj , wi iM = 0, and LT wj Kk (L1 ALT ; L1 r0 ) iff wj
Kk (M1 A; M1 r0 ).
From (8.23) we obtain, using LT Wk = Qk , that L1 ALT LT Wk = LT Wk+1 Tk holds and thus
0 n k 0 k 1 1 0
Given x R , compute x x + K (M A; M r ) such that
kM1 Axk M1 bkM (8.39)
= min{ kM1 Ax M1 bkM  x x0 + Kk (M1 A; M1 r0 ) }
with r0 = Ax0 b. Using arguments as in (8.26) it follows that the solution of the minimization
problem (8.39) can be obtained from
184
Preconditioned MINRES algorithm.
1
Given x0 , compute r0 = Ax0 b, w0 = M1 r0 , krk = hw0 , r0 i 2 .
For k = 1, 2, . . . :
Compute wk , k , k using the preconditioned Lanczos method (8.37).
Compute rk using (8.30), (8.31) or (8.32) (note (8.33)).
Compute pk using (8.35) with qk replaced by wk .
Compute bk,k using (8.34).
Compute update: xk = xk1 krkbk,k pk .
The minimization property (8.39) yields a convergence result for the preconditioned MINRES
method:
Theorem 8.3.3 Let A Rnn be symmetric and M Rnn symmetric positive definite. For
xk , k 0, computed in the preconditioned MINRES algorithm we define rk = M1 (Axk b).
The following holds:
kpk (M1 A)r0 kM kpk (M1 A)kM kr0 kM = max pk () kr0 kM
(M1 A)
holds.
From this result it follows that bounds on the reduction of the (preconditioned) residual can be
obtained if one assumes information on the spectrum of M1 A. We present two results that are
wellknown in the literature. Proofs, which are based on approximation properties of Chebyshev
polynomials are given in, for example, [42].
Theorem 8.3.4 Let A, M and rk be as in theorem 8.3.3. Assume that all eigenvalues of M1 A
are positive. Then
krk kM p
1 A) + 1) k ,
2 1 2/( (M k = 0, 1, . . .
kr0 kM
holds.
We note that in this bound the dependence on the condition number (M1 A) is the same as
in wellknown bounds for the preconditioned CG method.
185
Theorem 8.3.5 Let A, M and rk be as in theorem 8.3.3. Assume that (M1 A) [a, b] [c, d]
with a < b < 0 < c < d and b a = d c. Then
r
krk kM ad [k/2]
0
2 1 2/( + 1) , k = 0, 1, . . . (8.42)
kr kM bc
holds.
In the special case a = d, b = c the reduction factor in (8.42) takes the form 12/((M1 A)+
1). Note that here the dependence on (M1 A) is different from the positive definite case in
theorem 8.3.4.
186
Using this notation, the Arnoldi algorithm results in
The result in (8.45) is similar to the result in (8.23). However, note that the matrix Hk in (8.45)
contains significantly more nonzero elements than the tridiagonal matrix Tk in (8.23).
Using induction it can be shown that q1 , q2 , ..., qk forms an orthogonal basis of the Krylov
subspace Kk (A; r0 ). As in the derivation of (8.27a),(8.27b) for the MINRES method, using the
fact that we have an orthogonal basis, we obtain that the xk that satisfies the minimal residual
criterion (8.25) can be characterized by the least squares problem:
The GMRES method is introduced in Saad and Schultz [80]. For a detailed discussion of
implementation aspects of the GMRES method we refer to that paper. In [80] it is shown that
using similar techniques as in the derivation of the MINRES method the least squares problem
in step 3 in (8.48) can be solved with low computational costs. However, step 2 in (8.48) is
expensive, both with respect to memory and arithmetic work. This is due to the fact that in
the kth iteration we need computations involving q1 , q2 , ..., qk1 to determine qk . To avoid
computations involving all the previous basis vectors, the GMRES method with restart is often
used in practice. In GMRES(m) we apply m iterations of the GMRES method as in (8.48),
then we define x0 := xm and again apply m iterations of the GMRES method with this new
starting vector, etc.. Note that for k > m the iterands xk do not fulfill the minimal residual
criterion (8.25). In Saad and Schultz [80] it is shown that (in exact arithmetic) the GMRES
method cannot break down and that (as in CG) the exact solution is obtained in at most n
iterations. The minimal residual criterion implies that in GMRES the residual is reduced in
every iteration . These nice properties of GMRES do not hold for the GMRES(m) algoritm; a
well known difficulty with the GMRES(m) method is that it can stagnate.
187
m: 10 20 40 80
1
= 101 h = 32 97 72 61 56
1
= 102 h = 32 68 75 80 59
1
= 104 h = 32 61 59 60 59
1
= 101 h = 64 270 191 146 147
1
= 102 h = 64 127 134 150 160
1
= 104 h = 64 119 114 114 114
There are other methods which are of GMRES type, in the sense that these methods (in exact
arithmetic) yield iterands defined by the minimal residual criterion (8.25). These methods differ
in the approach that is used for computing the minimal residual iterand. Examples of GMRES
type of methods are the Generalized Conjugate Residual method (GCR) and Orthodir. These
variants of GMRES seem to be less popular because for many problems they are at least as
expense as GMRES and numerically less stable. For a further discussion and comparison of
GMRES type methods we refer to Saad and Schultz [79], Barrett et al. [10] and Freund et
al. [36].
The BiCG method which we discuss below is based on a generalized Lanczos method that
is used for computing a reasonable basis of the Krylov subspace. This generalized Lanczos
method uses short recursions (as in the Lanczos method), but the resulting basis will in general
not be orthogonal. The implementation of the BiCG method is as simple as the implementation
of the CG method.
The BiCG method is based on the biLanczos (also called nonsymmetric Lanczos) method:
0
v := v0 := 0; v1 := v1 := r0 /kr0 k2 ; 0 = 0 = 0;
For j 1 :
:= hAvj , vj i
j
j+1 := Avj vj j1 ; (8.49)
w j j1 v
j+1 := A v j v j1 vj1 ;
T j j
w
j := kwj+1 k, vj+1 := wj+1 /j ,
j := hvj+1 , wj+1 i, wj+1 := wj+1 /j .
188
If A = AT holds, then the two recursions in (8.49) are the same and the biLanczos method re
duces to the Lanczos method in (8.4). In the biLanczos method it can happen that hvj+1 , wj+1 i =
0 , even if vj+1 6= 0 and wj+1 6= 0. In that case the algorithm is not executable anymore and
this is called a (serious) breakdown. Using induction we obtain that for the two sequences of
vectors generated by BiCG the following properties hold:
Based on (8.52) we call vi and vj (i 6= j) biorthogonal. In general the vj (j = 1, 2, ...) will not
be orthogonal. Using the notation
Vj := [v1 . . . vj ], Vj := [v1 . . . vj ]
and
VkT Vk = Ik , VkT vk+1 = 0 . (8.54)
In BiCG we do not use a minimal residual criterion as in (8.25) but the following criterion
based on an orthogonality condition:
The existence of an xk satisfying the criterion in (8.55) is not guaranteed ! If the criterion (8.55)
cannot be fulfilled, the BiCG algorithm in (8.58) below will break down. For the case that A
is symmetric positive definite, the criteria in (8.55) and in (8.3) are equivalent and (in exact
arithmetic) the BiCG algorithm will yield the same iterands as the CG algorithm.
Using (8.50)(8.52) we see that the BiCG iterand xk , characterized in (8.55), satisfies
Due to the relations in (8.53), (8.54) this yields the following characterization:
Tk yk = kr0 k2 e1 (8.56)
k 0 k
x = x + Vk y . (8.57)
Note that this is very similar to the characterization of the CG iterand in (8.10a), (8.10b).
However, in (8.56) the tridiagonal matrix Tk need not be symmetric positive definite and Vk in
general will not be orthogonal. Using an LUdecomposition of the tridiagonal matrix Tk we can
189
compute yk , provided Tk is nonsingular, and then determine xk . An efficient implementation
of this approach can be derived along the same lines as for the Lanczos iterative method in
section 8.2. This then results in the BiCG algorithm, introduced in Lanczos [59] (cf. also
Fletcher [35]):
Note: here and in the remainder of this section the residual is defined by rk = b Axk (instead
of Axk b).
The BiCG algorithm is simple and has low computational costs per iteration (compared to
GMRES type methods). A disadvantage is that a breakdown can occur (k = 0 or k = 0).
A near breakdown will result in numerical instabilities. To avoid these (near) breakdowns
variants of BiCG have been developed that use socalled lookahead Lanczos algorithms for
computing a basis of the Krylov subspace. Also the criterion in (8.55) can be replaced by
another criterion to avoid a breakdown caused by the fact that the BiCG iterand as in (8.55)
does not exist. The combination of a lookahead Lanczos approach and a criterion based on
minimization of a quasiresidual is the basis of the QMR (Quasi Minimal Residual) method.
For a discussion of the lookahead Lanczos approach and QMR we refer to Freund et al. [36].
For the BiCG method there are only very few theoretical convergence results. A variant
of BiCG is analyzed in Bank and Chan [8]. A disadvantage of the BiCG method is that we
need a multiplication by AT which is often not easily available. Below we discuss variants of
the BiCG method which only use multiplications with the matrix A (two per iteration). For
many problems these methods have a higher rate of convergence than the BiCG method.
We introduce the BiCGSTAB method (from Van der Vorst [91])and the CGS (Conjugate
Gradients Squared) method (from Sonneveld [86]). These methods are derived from the BiCG
method. We assume that the BiCG method does not break down.
We first reformulate the BiCG method using a notation based on matrix polynomials. With
Tk , Pk Pk defined by
T0 (x) = 1, P0 (x) = 1 ,
Pk (x) = Pk1 (x) k1 xTk1 (x) , k 1,
Tk (x) = Pk (x) + k Tk1 (x) , k 1,
with k , k as in (8.58), we have for the search directions pk and the residuals rk resulting from
the BiCG method:
rk = Pk (A)r0 , (8.59)
0
pk = Tk (A)r . (8.60)
190
Results as in (8.59), (8.60) also hold for rk and pk with A replaced by AT and r0 replaced by
r0 . For the sequences of residuals and search directions, generated by the BiCG method, we
define related transformed sequences:
xk := A1 (b rk ). (8.63)
In the BiCGSTAB method and the CGS method we compute the iterands xk corresponding to
a suitable polynomial Qk . These polynomials are chosen in such a way that the xk can be
computed with simple (i.e short) recursions involving rk , pk and A. The costs per iteration
of these algorithms will be roughly the same as the costs per iteration in BiCG. An important
advantage is that we do not need AT . Clearly, from an efficiency point of view it is favourable
to have a polynomial Qk such that krk k = kQk (A)rk k krk k holds. For obtaining a hybrid
BiCG method one can try to find a polynomial Qk such that for the corresponding transformed
quantities we have low costs per iteration (short recursions) and a (much) smaller transformed
residual. The first example of such a polynomial is due to Sonneveld [86]. He proposes:
with Pk the BiCG polynomial. The iterands xk corresponding to this Qk are computed in the
CGS method (cf. (8.75)). Another choice is proposed in Van der Vorst [91]:
The choice of the parameters j is discussed below (cf. (8.73)). The iterands xk corresponding
to this Qk are computed in the BiCGSTAB algorithm.
We now show how the BiCGSTAB algorithm can be derived. First note that for the
BiCGSTAB polynomial we have
191
Note that in (8.66), (8.67) and (8.68) we have simple recursions in which the scalars k and k
defined in the BiCG algorithm are used. We now show that for these scalars one can derive
other, more feasible, formulas. We consider
k = hrk , rk i
The coefficient for the highest order term of the BiCG polynomial Pk is equal to (1)k 0 1 k1 .
So we have
rk = Pk (AT )r0 = (1)k 0 1 k1 (AT )k r0 + wk ,
with wk Kk (AT ; r0 ). Using this in the definition of k and the orthogonality condition for rk
in (8.55) we obtain the relation
with wk Kk (AT ; r0 ). Using this in the definition of k and the orthogonality condition for rk
in (8.55) we obtain the relation
Similarly, for the scalar k defined in the BiCG algorithm, the formula
k
k = (8.72)
hApk , r0 i
can be derived. We finally discuss the choice of the parameters j in the BiCGSTAB polynomial.
We use the notation
1
rk+ 2 := rk k Apk .
The recursion for the transformed residuals can be rewritten as
1 1
rk+1 = rk+ 2 k Ark+ 2 .
192
This results in 1 1
hArk+ 2 , rk+ 2 i
k = 1 1 . (8.73)
hArk+ 2 , Ark+ 2 i
Using the recursion in (8.66), (8.67), (8.68) , the formulas for the scalars in (8.71), (8.72) and
the choice for k as in (8.73) we obtain the following BiCGSTAB algorithm, where for ease of
notation we dropped the notation for the transformed variables.
starting vector x0 ; r0 = b Ax0 ; choose r0 ( e.g. = r0 )
p1 = c1 = 0. 1 = 0, 1 = 1 = 1,
for k 0 :
k 0
k = hr , r i, k = (k1 /k1 )(k /k1 ),
pk = k pk1 + rk k k1 ck1 , (8.74)
c k = Apk ,
k = hck , r0 i, k = k /k ,
k+ 21 k ck , ck+ 21 = Ark+ 12 ,
r = r k
k+ 21 k+ 21 1 1
= hc , r i/hck+ 2 , ck+ 2 i,
k
1 1 1
xk+1 = xk + k pk + k rk+ 2 , rk+1 = rk+ 2 k ck+ 2 .
This BiCGSTAB method is introduced in Van der Vorst [91] as a variant of BiCG and of
the CGS method. Variants of the BiCGSTAB method, denoted by BiCGSTAB(), are dis
cussed in Sleijpen and Fokkema [84].
In the CGS method we take Qk (x) = Pk (x), with Pk the BiCG polynomial. This results
in transformed residuals satisfying the relation
rk = (Pk (A))2 r0 .
This explains the name Conjugate Gradients Squared. Along the same lines as above for the
BiCGSTAB method one can derive the following CGS algorithm from the BiCG algorithm (cf.
Sonneveld [86]):
starting vector x0 ; r0 := b Ax0 ; q0 := q1 := 0; 1 := 1;
For k 0 :
k := hr0 ; rk i; k := k /k1 ;
k k k
w := r + k q
q := w + k (qk + k qk1 )
k k
(8.75)
v k := Aqk
k := hr0 , vk i; k := k /k ;
qk+1 := wk k vk
rk+1 := rk k A(wk + qk+1 )
xk+1 := xk + k (wk + qk+1 )
Note that both in the BiCGSTAB method and in the CGS method we have relatively low costs
per iteration (two matrix vector products and a few inner products) and that we do not need
AT . The fact that in the CGS polynomial we use the square of the BiCG polynomial often
results in a rather irregular convergence behaviour (cf. Van der Vorst [91]). The BiCGSTAB
polynomial (8.65),(8.73) is chosen such that the resulting method in general has a less irregular
convergence behaviour than the CGS method.
193
Example 8.5.1 (Convectiondiffusion problem) In Figure 8.3 we show the results of the
CGS method and the BiCGSTAB method applied to the convectiondiffusion equation as in
1
example 8.4.1, with = 102 , h = 32 . As a measure for the error reduction we use
10
log(kd ) := 10 log(Axk b2 /Ax0 b2 ) = 10 log(Axk b2 /b2 ).
12
60
0
CGS
12 BiCGSTAB
Note that for both methods we first have a growth of the norm of the defect (in CGS even up to
1012 !). In this example we observe that indeed the BiCGSTAB has a smoother convergence
behaviour than the CGS method.
We finally note that for all these nonsymmetric Krylov subspace methods the use of a suitable
preconditioner is of great importance for the efficiency of the methods. There is only very little
analysis in this field, and in general the choice of the preconditioner is based on trial and error.
Often variants of the ILU factorization are used as a preconditioner.
The preconditioned BiCGSTAB algorithm, with preconditioner W, is as follows (cf. Sleijpen
194
and Van der Vorst [85]):
195
196
Chapter 9
Multigrid methods
9.1 Introduction
In this chapter we treat multigrid methods (MGM) for solving discrete scalar elliptic boundary
value problems. We first briefly discuss a few important differences between multigrid methods
and the iterative methods treated in the preceding chapters .
The basic iterative methods and the Krylov subspace methods use the matrix A and the right
hand side b which are the result of a discretization method. The fact that these data correspond
to a certain underlying continuous boundary value problem is not used in the iterative method.
However, the relation between the data (A and b) and the underlying problem can be useful for
the development of a fast iterative solver. Due to the fact that A results from a discretization
procedure we know, for example, that there are other matrices which, in a certain natural sense,
are similar to the matrix A. These matrices result from the discretization of the underlying
continuous boundary value problem on other grids than the grid corresponding to the given
discrete problem Ax = b. The use of discretizations of the given continuous problem on several
grids with different mesh sizes plays an important role in multigrid methods.
We will see that for a large class of discrete elliptic boundary value problems multigrid methods
have a significantly higher rate of convergence than the methods treated in the preceding chap
ters. Often multigrid methods even have optimal complexity.
Due to the fact that in multigrid methods discrete problems on different grids are needed,
the implementation of multigrid methods is in general (much) more involved than the imple
mentation of, for example, Krylov subspace methods. We also note that for multigrid methods it
is relatively hard to develop black box solvers which are applicable to a wide class of problems.
In section 9.2 we explain the main ideas of the MGM using a simple one dimensional prob
lem. In section 9.3 we introduce multigrid methods for discrete scalar elliptic boundary value
problems. In section 9.4 we present a convergence analysis of these multigrid methods. Opposite
to the basic iterative and Krylov subspace methods, in the convergence analysis we will need the
underlying continuous problem. The standard multigrid method discussed in the sections 9.29.4
is efficient only for diffusiondominated elliptic problems. In section 9.5 we consider modifica
tions of standard multigrid methods which are used for convectiondominated problems. In
section 9.6 we discuss the principle of nested iteration. In this approach we use computations on
relatively coarse grids to obtain a good starting vector for an iterative method (not necessarily
197
a multigrid method). In section 9.7 we show some results of numerical experiments. In sec
tion 9.8 we discuss socalled algebraic multigrid methods. In these methods, as in basic iterative
methods and Krylov subspace methods, we only use the given matrix and righthand side, but
no information on an underlying grid structure. Finally, in section 9.9 we consider multigrid
techniques which can be applied directly to nonlinear elliptic boundary value problems without
using a linearization technique.
For a thorough treatment of multigrid methods we refer to the monograph of Hackbusch [44].
For an introduction to multigrid methods requiring less knowledge of mathematics, we refer to
Wesseling [96], Briggs [23], Trottenberg et al. [69]. A theoretical analysis of multigrid methods
is presented in [19].
The solution of this discrete problem is denoted by x . The solution of the Galerkin discretization
in the space X1h ,0 is given by u = P x . A simple computation shows that
A = h1
tridiag(1, 2, 1) R
n n
198
Note that, apart from a scaling factor, the same matrix results from a standard discretization
with finite differences of the problem (9.1).
Clearly, in practice one should not solve the problem in (9.8) using an iterative method (a
Cholesky factorization A = LLT is stable and efficient). However, we do apply a basic iterative
method here, to illustrate a certain smoothing property which plays an important role in
multigrid methods. We consider the damped Jacobi method
1
xk+1
= xk h (A xk b ) with (0, 1] . (9.9)
2
The iteration matrix of this method is given by
1
C = C () = I h A .
2
In this simple model problem an orthogonal eigenvector basis of A , and thus of C , too, is
known. This basis is closely related to the Fourier modes:
Note that w satisfies the boundary conditions in (9.1) and that (w ) (x) = ()2 w (x) holds,
and thus w is an eigenfunction of the problem in (9.1). We introduce vectors z Rn , 1
n , which correspond to the Fourier modes w restricted to the interior grid int :
These vectors form an orthogonal basis of Rn . For = 2 we give an illustration in figure 9.1.
To a vector z there corresponds a frequency . If < 12 n holds then the vector z , or the
o x o
x x
x x
x x : z12
x
0 1 : z42
o o o o
o o
1
corresponding finite element function P z , is called a low frequency mode, and if 2 n
199
holds then this vector [finite element function] is called a high frequency mode. These vectors
z are eigenvectors of the matrix A :
4
A z = sin2 ( h )z ,
h 2
and thus we have
C z = (1 2 sin2 ( h ))z . (9.10)
2
From this we obtain
kC k2 = max1n 1 2 sin2 ( 2 h )
(9.11)
= 1 2 sin2 ( 2 h ) = 1 12 2 h2 + O(h4 ) .
From this we see that the damped Jacobi method is convergent, but that the rate of convergence
will be very low for h small (cf. section 6.3).
Note that the eigenvalues and the eigenvectors of C are functions of h [0, 1]:
, := 1 2 sin2 ( h ) =: g (h ) , with (9.12a)
2
2
g (y) = 1 2 sin ( y) (y [0, 1]) . (9.12b)
2
Hence, the size of the eigenvalues , can directly be obtained from the graph of the function
g . In figure 9.2 we show the graph of the function g for a few values of . From the graphs
1
= 3
1
= 2
2
= 3
1 =1
in this figure we conclude that for a suitable choice of we have g (y) 1 if y [ 21 , 1]. We
choose = 23 (then g ( 21 ) = g (1) holds). Then we have g 2 (y) 13 for y [ 21 , 1]. Using this
3
and the result in (9.12a) we obtain
1 1
,  for n .
3 2
Hence:
200
the high frequency modes are strongly damped by the iteration matrix C .
From figure 9.2 it is also clear that the low rate of convergence of the damped Jacobi method is
caused by the low frequency modes (h 1).
Summarizing, we draw the conclusion that in this example the damped Jacobi method will
smooth the error. This elementary observation is of great importance for the twogrid method
introduced below. In the setting of multigrid methods the damped Jacobi method is called a
smoother. The smoothing property of damped Jacobi is illustrated in figure 9.3. It is impor
0 1 0 1
tant to note that the discussion above concerning smoothing is related to the iteration matrix
C , which means that the error will be made smoother by the damped Jacobi method, but not
(necessarily) the new iterand xk+1 .
In multigrid methods we have to transform information from one grid to another. For that
purpose we introduce socalled prolongations and restrictions. In a setting with nested finite
element spaces these operators can be defined in a very natural way. Due to the nestedness the
identity operator
I : X1h1 ,0 X1h ,0 , I v = v
is welldefined. This identity operator represents linear interpolation as is illustrated for = 2
in figure 9.4.
The matrix representation of this interpolation operator is given by
x x X1h1,0
0 x 1
x
I2
x ?
x
x
x
x X1h2,0
0 x
x
x 1
x
p : Rn1 Rn , p := P1 P1 (9.13)
201
A simple computation yields
1
2
1
1 1
2 2
1
1
p = (9.14)
2
..
.
1
2
1
1
2 n n1
When used in a multigrid method then often this restriction based on injection is not satisfactory
(cf. Hackbusch [44], section 3.5). A better method is obtained if a natural Galerkin property is
satisfied. It can easily be verified (cf. also lemma 9.3.2) that with A , A1 and p as defined
in (9.8), (9.13) we have
r A p = A1 iff r = pT (9.15)
Thus the natural Galerkin condition r A p = A1 implies the choice
r = pT (9.16)
for the restriction operator.
The twogrid method is based on the idea that a smooth error, which results from the ap
plication of one or a few damped Jacobi iterations, can be approximated fairly well on a coarser
grid. We now introduce this twogrid method.
Consider A x = b and let x be the result of one or a few damped Jacobi iterations applied
to a given starting vector x0 . For the error e := x x we have
A e = b A x =: d ( residual or defect) (9.17)
Based on the assumption that e is smooth it seems reasonable to make the approximation
e p e1 with an appropriate vector (grid function) e1 Rn1 . To determine the vector
e1 we use the equation (9.17) and the Galerkin property (9.15). This results in the equation
A1 e1 = r d
for the vector e1 . Note that x = x + e x + p e1 . Thus for the new iterand we take
x := x + p e1 . In a more compact formulation this twogrid method is as follows:
procedure TGM (x , b )
if = 0 then x0 := A1
0 b0 else
begin
x := J (x , b ) ( smoothing it., e.g. damped Jacobi )
d1 := r (b A x ) ( restriction of defect ) (9.18)
1
e1 := A1 d1 ( solve coarse grid problem )
x := x + p e1 ( add correction )
TGM := x
end;
202
Often, after the coarse grid correction x := x + p e1 , one or a few smoothing iterations are
applied again. Smoothing before/after the coarse grid correction is called pre/postsmoothing.
Besides the smoothing property a second property which is of great importance for a multigrid
method is the following:
Thus for solving the problem A1 e1 = d1 approximately we can apply the twogrid algo
rithm in (9.18) recursively. This results in the following multigrid method for solving A x = b :
procedure MGM (x , b )
if = 0 then x0 := A1 0 b0 else
begin
x := J1 (x , b ) ( presmoothing )
d1 := r (b A x )
(9.19)
e01 := 0; for i = 1 to do ei1 := MGM1 (ei1
1 , d1 );
x := x + p e1
x := J2 (x , b ) ( postsmoothing )
MGM := x
end;
If one wants to solve the system on a given finest grid, say with level number , i.e. A x = b ,
then we apply some iterations of MGM (x , b ).
=3
B B
B B
B B
=2 B B : smoothing
A
B B A
B B A : solve exactly
=1 B B A
A A
B B A A A
=0 B B A A A
=1 =2
Figure 9.5: Structure of one multigrid iteration
203
is not restricted to (nearly) symmetric problems. Multigrid methods can also be used for solving
problems which are strongly nonsymmetric (convection dominated). However, for such problems
one usually has to modify the standard multigrid approach. These modifications are discussed
in section 9.5.
We will introduce the twogrid and multigrid method by generalizing the approach of section 9.2
to the higher (i.e., two and three) dimensional case. We consider the finite element discretization
of scalar elliptic boundary value problems as discussed in section 3.4. Thus the continuous
variational problem is of the form
(
find u H01 () such that
(9.20)
k(u, v) = f (v) for all v H01 ()
The coefficients A, b, c are assumed to satisfy the conditions in (3.42). For the discretization of
this problem we use simplicial finite elements. The case with rectangular finite elements can be
treated in a very similar way. Let {Th } be a regular family of triangulations of consisting of n
simplices and Xkh,0 , k 1, the corresponding finite element space as in (3.16). The presentation
and implementation of the multigrid method is greatly simplified if we assume a given sequence
of nested finite element spaces.
Assumption 9.3.1 In the remainder of this chapter we always assume that we have a sequence
V := Xkh ,0 , = 0, 1, . . ., of simplicial finite element spaces which are nested:
We note that this assumption is not necessary for a succesful application of multigrid methods.
For a treatment of multigrid methods in case of nonnestedness we refer to [69] (?). The con
struction of a hierarchy of triangulations such that the corresponding finite element spaces are
nested is discussed in chapter ??.
In V we use the standard nodal basis (i )1in as explained in section 3.5. This basis
induces an isomorphism
Xn
n
P : R V , P x = xi i
i=1
The Galerkin discretization: Find u V such that
Along the same lines as in the onedimensional case we introduce a multigrid method for solving
this system of equations on an arbitrary level 0.
For the smoother we use one of the basic iterative methods discussed in section 6.2. For this
method we use the notation
xk+1 = S (xk , b ) = xk M1 k
(A x b) , k = 0, 1, . . .
204
The corresponding iteration matrix is denoted by
S = I M1
A
For the prolongation we use the matrix representation of the identity I : V1 V , i.e.,
p := P1 P1 (9.23)
r A p = A1 if and only if r = pT
r A p = A1
hA p x, rT yi = hA1 x, yi for all x, y Rn1
k(P1 x, P rT y) = k(P1 x, P1 y) for all x, y Rn1
r A p = A1
P rT y = P1 y for all y Rn1
rT y = P1 P1 y = p y for all y Rn1
rT = p
procedure MGM (x , b )
if = 0 then x0 := A1 0 b0 else
begin
x := S1 (x , b ) ( presmoothing )
d1 := r (b A x )
(9.25)
e01 := 0; for i = 1 to do ei1 := MGM1 (ei1
1 , d1 );
x := x + p e1
x := S2 (x , b ) ( postsmoothing )
MGM := x
end;
205
We briefly comment on some important issues related to this multigrid method.
Smoothers
For many problems basic iterative methods provide good smoothers. In particular the Gauss
Seidel method is often a very effective smoother. Other smoothers used in practice are the
damped Jacobi method and the ILU method.
+2
. W U
1g
In the last inequality we assumed that the costs for computing x0 = A1 0 b0 (i.e., V M G0 ) are
negligible compared to W U . The result in (9.28) shows that the arithmetic costs for one Vcycle
are proportional (if ) to the costs of a residual computation. For example, for g = 18
(uniform refinement in 3D) the arithmetic costs of a Vcycle with 1 = 2 = 1 on level are
comparable to 4 12 times the costs of a residual computation on level .
For the Wcycle ( = 2) the arithmetic costs on level are denoted by W M G . We have:
W M G . W U + 2W U + 2W M G1 = ( + 2)W U + 2W M G1
. ( + 2) W U + 2W U 1 + 22 W U 2 + . . . + 21 W U 1 + W M G0
. ( + 2) 1 + 2g + (2g)2 + . . . + (2g)1 W U + W M G0
206
From this we see that to obtain a bound proportional to W U we have to assume
1
g<
2
Under this assumption we get for the Wcycle
+2
W M G . W U
1 2g
(again we neglected W M G0 ). Similar bounds can be obtained for 3, provided g < 1 holds.
9.4.1 Introduction
One easily verifies that the twogrid method is a linear iterative method. The iteration matrix
of this method with 1 presmoothing and 2 postsmoothing iterations on level is given by
CT G, = CT G, (2 , 1 ) = S 2 (I p A1 1
1 r A )S (9.29)
with S = I M1
A the iteration matrix of the smoother.
Theorem 9.4.1 The multigrid method (9.25) is a linear iterative method with iteration matrix
CM G, given by
CM G,0 = 0 (9.30a)
S 2
I p (I CM G,1 )A1 1
CM G, = 1 r A S (9.30b)
= CT G, + S 2 p CM G,1 A1 1
1 r A S , = 1, 2, . . . (9.30c)
Proof. The result in (9.30a) is trivial. The result in (9.30c) follows from (9.30b) and the
definition of CT G, . We now prove the result in (9.30b) by induction. For = 1 it follows from
(9.30a) and (9.29). Assume that the result is correct for 1. Then MGM1 (y1 , z1 ) defines
a linear iterative method and for arbitrary y1 , z1 Rn1 we have
MGM1 (y1 , z1 ) A1 1
1 z1 = CM G,1 (y1 A1 z1 ) (9.31)
x1 := S1 (xold
, b )
x2 := x1 + p MGM1 0, r (b A x1 )
xnew
:= S2 (x2 , b )
207
From this we get
2 2
xnew
x = xnew
A1
b = S (x x )
= S 2 x1 x + p MGM1 0, r (b A x1 )
xnew x = S 2 x1 x + p (A1 1
1 z1 CM G,1 A1 z1
= S 2 I p (I CM G,1 )A1
1
1 r A (x x )
= S 2 I p (I CM G,1 )A1 1 old
x )
1 r A S (x
The convergence analysis will be based on the following splitting of the twogrid iteration matrix,
with 2 = 0, i.e. no postsmoothing:
1
kCT G, (0, 1 )k2 = k(I p A1
1 r A )S k2
1
(9.32)
kA1 1
p A1 r k2 kA S k2
We formulate three results that will be used in the analysis further on. First we recall the global
inverse inequality that is proved in lemma 3.3.11:
v 1 c h1
kv kL2 for all v V
with a constant c independent of . Note that for this result we need assumption 9.4.2.
We now show that, apart from a scaling factor, the isomorphism P : (Rn , h, i) (V , h, iL2 )
and its inverse are uniformly (w.r.t. ) bounded:
Lemma 9.4.3 There exist constants c1 > 0 and c2 independent of such that
1
n
c1 kP xkL2 h2 kxk2 c2 kP xkL2 for all x Rn (9.33)
Proof. Let M be the mass matrix, i.e., (M )ij = hi , j iL2 . Note the basic equality
208
From this and from the sparsity of M we obtain
d2 hn (M )ii kM k2 kM k d1 hn (9.35)
which proves the first inequality in (9.33). We now use corollary 3.5.10. This yields min (M )
c max (M ) with a strictly positive constant c independent of . Thus we have
This yields
kP xk2L2 = hM x, xi min (M )kxk22 chn kxk22 ,
which proves the second inequality in (9.33).
The third preliminary result concerns the scaling of the stiffness matrix:
Lemma 9.4.4 Let A be the stiffness matrix as in (9.22). Assume that the bilinear form is
such that the usual conditions (3.42) are satisfied. Then there exist constants c1 > 0 and c2
independent of such that
c1 hn2 kA k2 c2 hn2
Proof. First note that
hA x, yi
kA k2 = maxn
x,yR kxk2 kyk2
Using the result in lemma 9.4.3, the continuity of the bilinear form and the inverse inequality
we get
hA x, yi k(v , w )
maxn chn max
x,yR kxk2 kyk2 v ,w V kv kL2 kw kL2
v 1 w 1
chn max c hn2
v ,w V kv kL2 kw kL2
and thus the upper bound is proved. The lower bound follows from
hA x, yi
maxn max hA ei , ei i = k(i , i ) ci 21 chn2
x,yR kxk2 kyk2 1in
The last inequality can be shown by using for T supp(i ) the affine transformation from the
unit simplex to T .
209
with constants c1 > 0 and c2 independent of . We now formulate a main result for the conver
gence analysis of multigrid methods:
kA1 1 1
p A1 r k2 CA kA k2 for = 1, 2, . . . (9.37)
Proof. Let b Rn be given. The constants in the proof are independent of b and of .
Consider the variational problems:
Then
A1 1 1 1
b = P u and A1 r b = P1 u1
Now we apply theorem 3.4.5 and use the H 2 regularity of the problem. This yields
Now we combine (9.38) with (9.39) and use (9.36). Then we get
k(A1 1 2n
p A1 r )b k2 c h kb k2
Note that in the proof of the approximation property we use the underlying continuous problem.
kA S k2 g()kA k2
where g() is a monotonically decreasing function with lim g() = 0. In the first part of
this section we derive results for the case that A is symmetric positive definite. In the second
part we discuss the general case.
210
Lemma 9.4.6 Let B Rmm be a symmetric positive definite matrix with (B) (0, 1]. Then
we have
1
kB(I B) k2 for = 1, 2, . . .
2( + 1)
Proof. Note that
1
kB(I B) k2 = max x(1 x) =
x(0,1] +1 +1
A simple computation shows that +1 is decreasing on [1, ).
Below for a few basic iterative methods we derive the smoothing property for the symmetric
case, i.e., b = 0 in the bilinear form k(, ). We first consider the Richardson method:
Theorem 9.4.7 Assume that in the bilinear form we have b = 0 and that the usual conditions
(3.42) are satisfied. Let A be the stiffness matrix in (9.22). For c0 (0, 1] we have the smoothing
property
c0 1
kA (I A ) k2 kA k2 , = 1, 2, . . .
(A ) 2c0 ( + 1)
holds.
Proof. Note that A is symmetric positive definite. Apply lemma 9.4.6 with B := A ,
:= c0 (A )1 . This yields
1 1 1
kA (I A ) k2 1 (A ) = kA k2
2( + 1) 2c0 ( + 1) 2c0 ( + 1)
and thus the result is proved.
kD k2 1
kA k2
2( + 1) 2( + 1)
and thus the result is proved.
211
Remark 9.4.9 The value of the parameter used in theorem 9.4.8 is such that (D1
A ) =
1 1
(D 2 A D 2 ) 1 holds. Note that
1 1 hA x, xi hA ei , ei i
(D 2 A D 2 ) = max max =1
n
xR hD x, xi 1in hD ei ei i
and thus we have 1. This is in agreement with the fact that in multigrid methods one
usually use a damped Jacobi method as a smoother.
We finally consider the symmetric GaussSeidel method. This method is the same as the SSOR
method with parameter value = 1. Thus it follows from (6.18) that this method has an
iteration matrix
S = I M1
A , M = (D L )D1 T
(D L ) , (9.41)
where we use the decomposition A = D L LT with D a diagonal matrix and L a strictly
lower triangular matrix.
Theorem 9.4.10 Assume that in the bilinear form we have b = 0 and that the usual conditions
(3.42) are satisfied. Let A be the stiffness matrix in (9.22) and M as in (9.41). The smoothing
property
c
kA (I M1
A ) k2 + 1 kA k2 , = 1, 2, . . .
For the symmetric positive definite case smoothing properties have also been proved for other
iterative methods. For example, in [98, 97] a smoothing property is proved for a variant of the
ILU method and in [24] it is shown that the SPAI (sparse approximate inverse) preconditioner
satisfies a smoothing property.
212
Lemma 9.4.11 Let k k be any induced matrix norm and assume that for B Rmm the
inequality kBk 1 holds. The we have
r
+1 2
k(I B)(I + B) k 2 , for = 1, 2, . . .
This yields
X
k(I B)(I + B) k 2 +
k k1
k=1
1
Using k 2 ( + 1) and we get (with [ ] the round
k k1 k k
down operator):
X
k k1
k=1
[ 21 (+1)]
X X
= +
k k1 k1 k
1 [ 12 (+1)]+1
[ 21 ] [ 21 ]
X X
= +
k k1 m m1
1 m=1
[ 21 ]
X
=2 =2 1
k k1 [ 2 ] 0
k=1
Corollary 9.4.12 Let k k be any induced matrix norm. Assume that for a linear iterative
method with iteration matrix I M1
A we have
kI M1
A k 1 (9.43)
Then for S := I 12 M1
A the following smoothing property holds:
r
2
kA S k 2 kM k , = 1, 2, . . .
213
Proof. Define B = I M1
A and apply lemma 9.4.11:
r
1 2
kA S k kM k
k(I B)(I + B) k 2 kM k
2
Remark 9.4.13 Note that in the smoother in corollary 9.4.12 we use damping with a factor 21 .
Generalizations of the results in lemma 9.4.11 and corollary 9.4.12 are given in [66, 49, 32]. In
[66, 32] it is shown that the damping factor 21 can be replaced by an arbitrary damping factor
(0, 1). Also note that in the smoothing property in corollary 9.4.12 we have a dependence
1
of the form 2 , whereas in the symmetric case this is of the form 1 . It [49] it is noted that
1
this loss of a factor 2 when going to the nonsymmetric case is due to the fact that complex
eigenvalues may occur. Assume that M1 A is a normal matrix. The assumption (9.43) implies
that (M1 A ) K := { z C  1 z 1 }. We have:
1 1 1 1
kM1 2
A (I 2 M A ) k2 max z(1 z) 2 = max z(1 z) 2
zK 2 zK 2
1 1
= max 1 + ei 2  ei 2
[0,2] 2 2
1 1 1 1
= max 4( + cos )( cos )
[0,2] 2 2 2 2
4
= max 4(1 ) =
[0,1] +1 +1
Note that the latter function of also occurs in the proof of lemma 9.4.6. We conclude that for
the class of normal matrices M1 A an estimate of the form
1 1 c
kM1
A (I 2 M A ) k2 , = 1, 2, . . .
To verify the condition in (9.43) we will use the following elementary result:
We now use these results to derive a smoothing property for the Richardson method.
214
Theorem 9.4.15 Assume that the bilinear form satisfies the usual conditions (3.42). Let A
be the stiffness matrix in (9.22). There exist constants > 0 and c independent of such that
the following smoothing property holds:
c
kA (I h2n A ) k2 kA k2 , = 1, 2, . . .
Proof. Using lemma 9.4.3, the inverse inequality and the ellipticity of the bilinear form we
get, for arbitrary x Rn :
hA x, yi 1
n k(P x, v )
kA xk2 = max c h2 max
n
yR kyk2 v V kv kL2
1
n P x1 v 1 1
n1
c h2 max c h2 P x1
v V kv kL2
1 1
n1 1 n1 1
c h2 k(P x, P x) 2 = c h2 hA x, xi 2
From this and lemma 9.4.14 it follows that there exists a constant > 0 such that
kI 2h2n
A k2 1 for all (9.44)
1 n2
Define M := 2 h I. From lemma 9.4.4 it follows that there exists a constant cM independent
of such that kM k2 cM kA k2 . Application of corollary 9.4.12 proves the result of the lemma.
Theorem 9.4.16 Assume that the bilinear form satisfies the usual conditions (3.42). Let A be
the stiffness matrix in (9.22) and D = diag(A ). There exist constants > 0 and c independent
of such that the following smoothing property holds:
c
kA (I D1
A ) k2 kA k2 , = 1, 2, . . .
1
Proof. We use the matrix norm induced by the vector norm kykD := kD2 yk2 for y Rn .
1
1
Note that for B Rn n we have kBkD = kD2 BD 2 k2 . The inequalities
kD1 2n
k2 c1 h , (D ) c2 (9.45)
hold with constants c1 , c2 independent of . Using this in combination with lemma 9.4.3, the
inverse inequality and the ellipticity of the bilinear form we get, for arbitrary x Rn :
1 1 1 1
1 1 hA D 2 x, D 2 yi k(P D 2 x, P D 2 y)
kD 2 A D 2 xk2 = max = max
yRn kyk2 yRn kyk2
1 1
P D 2 x1 kP D 2 ykL2
c h1 max
yRn kyk2
1
n1 1 1 1
c h2 P D 2 x1 kD 2 k2 c P D 2 x1
1 1 1 1 1 1
c k(P D 2 x, P D 2 x) 2 = c hD 2 A D 2 x, xi 2
215
From this and lemma 9.4.14 it follows that there exists a constant > 0 such that
1 1
kI 2D1
A kD = kI 2D A D k2 1 for all
2 2
1
Define M := 2 D . Application of corollary 9.4.12 with k k = k kD in combination with (9.45)
yields
1 1 1
kA (I h D1
A )
k2 (D 2
) kA (I 2 M A ) kD
c c c
kM kD = kD k2 kA k2
2
and thus the result is proved.
Lemma 9.4.17 Consider the Richardson method as in theorem 9.4.7 or theorem 9.4.15. In
both cases (9.46) holds with CS = 1.
Lemma 9.4.18 Consider the damped Jacobi method as in theorem 9.4.8 or theorem 9.4.16. In
both cases (9.46) holds.
and thus
1 1
1 1 1 1
kS k2 kD 2 (D2 S D 2 ) D2 k2 (D2 ) kS kD (D2 )
Now note that D is uniformly (w.r.t. ) wellconditioned.
216
Using lemma 9.4.3 it follows that for p = P1 P1 we have
Cp,1 kxk2 kp xk2 Cp,2 kxk2 for all x Rn1 (9.47)
Theorem 9.4.19 Consider the multigrid method with iteration matrix given in (9.30) and pa
rameter values 2 = 0, 1 = > 0, 2. Assume that there are constants CA , CS and a
monotonically decreasing function g() with g() 0 for such that for all :
kA1 1 1
p A1 r k2 CA kA k2 (9.48a)
kA S k2 g() kA k2 , 1 (9.48b)
kS k2 CS , 1 (9.48c)
kCM G, k2 , = 0, 1, . . .
holds.
kCT G, k2 kA1 1
p A1 r k2 kA S k2 CA g()
Remark 9.4.20 Consider A , p , r as defined in (9.22), (9.23),(9.24). Assume that the vari
ational problem (9.20) is such that the usual conditions (3.42) are satisfied. Moreover, the prob
lem (9.20) and the corresponding dual problem are assumed to be H 2 regular. In the multigrid
method we use the Richardson or the damped Jacobi method described in section 9.4.3. Then
the assumptions 9.48 are fulfilled and thus for 2 = 0 and 1 sufficiently large the multigrid
Wcylce has a contractrion number smaller than one indpendent of .
Remark 9.4.21 Let CM G, (2 , 1 ) be the iteration matrix of the multigrid method with 1 pre
and 2 postsmoothing iterations. With := 1 + 2 we have
CM G, (2 , 1 ) = CM G, (0, ) kCM G, (0, )k2
Using theorem 9.4.19 we thus get, for 2, a bound for the spectral radius of the iteration
matrix CM G, (2 , 1 ).
217
Remark 9.4.22 Note on other convergence analyses. Xu, Yserentant (quasiuniformity not
needed in BPX). Comment on regularity. Book Bramble.
Richardson method : M = c1
0 (A )I , c0 (0, 1] (9.49a)
1
Damped Jacobi : M = D , as in thm. 9.4.8 (9.49b)
Symm. GaussSeidel : M = (D L )D1
(D LT ) (9.49c)
For symmetric matrices B, C Rmm we use the notation B C iff hBx, xi hCx, xi for all
x Rm .
Lemma 9.4.24 For M as in (9.49) the following properties hold:
Proof. For the Richardson method the result is trivial. For the damped Jacobi method we
1 1
have (0, (D1 1 1
A ) ] and thus (D A D ) 1. This yields A D = M .
2 2
The result in (9.50b) follows from kD k2 kA k2 . For the symmetric GaussSeidel method the
results (9.50a) follows from M = A + L D1 T
L and the result in (9.50b) is proved in (9.42).
We note that the standard approximation property (9.37) implies the result (9.51) if we consider
the smoothers in (9.49):
218
Lemma 9.4.25 Consider M as in (9.49) and assume that the approximation property (9.37)
holds. Then (9.51) holds with CA = CM CA .
Proof. Trivial.
One easily verifies that for the smoothers in (9.49) the modified approximation property (9.51)
implies the standard approximation property (9.37) if (M ) is uniformly (w.r.t. ) bounded.
The latter property holds for the Richardson and the damped Jacobi method.
We will analyze the convergence of the twogrid and multigrid method using the energy scalar
product. For matrices B, C Rn n that are symmetric w.r.t. h, iA we use the notation
B A C iff hBx, xiA hCx, xiA for all x Rn . Note that B Rn n is symmetric w.r.t.
h, iA iff (A B)T = A B holds. We also note the following elementary property for symmetric
matrices B, C Rn n :
B C BA A CA (9.52)
We now turn to the twogrid method. For the coarse grid correction we introduce the notation
Q := I p A1 1
1 r A . For symmetry reasons we only consider 1 = 2 = 2 with > 0 even.
The iteration matrix of the twogrid method is given by
1 1
CT G, = CT G, () = S2 Q S2
Due the symmetric positive definite setting we have the following fundamental property:
The next lemma gives another characterization of the modified approximation property:
0 A Q A CA M1
A for = 1, 2, . . . (9.54)
219
Theorem 9.4.28 Assume that (9.50a) and (9.51) hold. Then we have
1
kCT G, ()kA max y(1 CA y)
y[0,1]
1
(1 CA (9.55)
) if CA 1
=
CA if CA 1
+1 +1
Proof. Define X := M1 A . This matrix is symmetric w.r.t. the energy scalar product and
from (9.50a) it follows that
0 A X A I (9.56)
holds. From lemma 9.4.27 we obtain 0 A Q A CA X . Note that due to this, (9.56) and the
fact that Q is an Aorthogonal projection which is not identically zero we get
CA 1 (9.57)
A minimax result (cf., for example, [83]) shows that in the previous expression the min and max
operations can be interchanged. A simple computation yields
max min CA x + (1 ) (1 x)
x[0,1] [0,1]
This proves the inequality in (9.55). An elementary computation shows that the equality in
(9.55) holds.
We now show that the approach used in the convergence analysis of the twogrid method in
theorem 9.4.28 can also be used for the multigrid method.
We start with an elementary result concerning a fixed point iteration that will be used in theo
rem 9.4.30.
Lemma 9.4.29 For given constants c > 1, 1 define g : [0, 1) R by
(
(1 1c )
if 0 < 1 c1
g() = c
1 +1
(9.59)
+1 +1 (1 ) 1 + c 1 if 1 c1 <1
220
For N, 1, define the sequence ,0 = 0, )
,i+1 = g(,i for i 1. The following
holds:
g() is continuous and increasing on [0, 1)
For c = CA , g(0) coincides with the upper bound in (9.55)
c
g() = iff =
c+
The sequence (,i )i0 is monotonically increasing, and := lim ,i < 1
i
1
( ) , is the first intersection point of the graphs of g() and
c
= 1 2 . . .
= g(0)
c+
Proof. Elementary calculus.
As an illustration for two pairs (c, ) we show the graph of the function g in figure 9.6.
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
Theorem 9.4.30 We take 1 = 2 = and consider the multigrid algorithm with iteration
matrix CM G, = CM G, (, ) as in (9.30). Assume that (9.50a) and (9.51) hold. For c = CA ,
c
2 and as in (9.30) let c+ be the fixed point defined in lemma 9.4.29. Then
kCM G, kA
holds.
The matrices S and Q are symmetric w.r.t. h, iA . If CM G,1 is symmetric w.r.t. h, iA1
then from T
(A R )T = (A p A1 1
1 )(A1 CM G,1 )(A1 r A ) = A R
221
it follows that R is symmetric w.r.t. h, iA , too. By induction we conclude that for all the
matrices R and CM G, are symmetric w.r.t. h, iA . Note that
0 A CM G,1 0 CM G,1 A1 1
1 0 p CM G,1 A1 r 0 A R
and thus
R A 1 (I Q ) (9.61)
holds. Define X := M1
A . Using (9.58), (9.60) and (9.61) we get
0 A Q + R A (1 1 )Q + 1 I
A (1 1 ) CA X + (1 )I + 1 I for all [0, 1]
This yields
(1 x)
min max (1 1 ) CA x + 1 + 1
[0,1] x[0,1]
As in the proof of theorem 9.4.28 we can interchange the min and max operations in the previous
expression. A simple computation shows that for [0, 1] we have
max min (1 ) CA x + 1 + (1 x)
x[0,1] [0,1]
where g() is the function defined in lemma 9.4.29 with c = CA . Thus satisfies 0 = 0 and
) for 1. Application of the results in lemma 9.4.29 completes the proof.
g(1
The bound for the multigrid contraction number in theorem 9.4.30 decreases if increases.
Moreover, for the bound converges to the bound for the twogrid contraction number in
theorem 9.4.28.
222
9.5 Multigrid for convectiondominated problems
We assume that for a certain = we want to compute the solution x of the problem A x = b
using an iterative method (not necessarily a multigrid method). In the nested iteration method
we use the systems on coarse grids to obtain a good starting vector x0 for this iterative method
with relatively low computational costs. The nested iteration method for the computation of
this starting vector x0 is as follows
compute the solution x0 of A0 x0 = b0
x01 := p1 x0 (prolongation of x0 )
xk1 := result of k iterations of an iterative method
applied to A1 x1 = b1 with starting vector x01
x02 := p2 xk1 ( prolongation of xk1 )
(9.62)
xk2 := result of k iterations of an iterative method
applied to A2 x2 = b2 with starting vector x02
..
.
etc.
..
.
x0 := p xk1 .
In this nested iteration method we use a prolongation p : Rn1 Rn . The nested iteration
principle is based on the idea that p x1 should be a reasonable approximation of x , because
A1 x1 = b1 and A x = b are discretizations of the same continuous problem. With
respect to the computational costs of this approach we note the following (cf. Hackbusch [44],
section 5.3). For the nested iteration to be a feasible approach, the number of iterations applied
on the coarse grids (i.e. k in (9.62)) should not be too large and the number of grid points
in the union of all coarse grids (i.e. level 0, 1, 2, ..., 1) should be at most of the same order
of magnitude as the number of grid points in the level grid. Often, if one uses a multigrid
solver these two conditions are satisfied. Usually in multigrid we use coarse grids such that the
number of grid points decreases in a geometric fashion, and for k in (9.62) we can often take
k = 1 or k = 2 due to the fact that on the coarse grids we use the multigrid method, which has
a high rate of convergence.
Note that if one uses the algorithm MGM from (9.25) as the solver on level then the
implementation of the nested iteration method can be done with only little additional effort
because the coarse grid data structure and coarse grid operators (e.g. A , < ) needed in the
nested iteration method are already available.
If in the nested iteration method we use a multigrid iterative solver on all levels we obtain
223
x03 x1
3  3
p3
x02 
2 
p2 x12 MGM3 (x03 , b3 )
x01 MGM2 (x02 , b2 )
1 
p1 MGM1 (x01 , b1 ) x11
0
x0
x0 := A1
k
0 b0 ; x0 := x0
for = 1 to do
begin
(9.63)
x0 := p xk1
for i = 1 to k do xi := MGM (xi1
, b )
end;
Remark 9.6.1 The prolongation p used in the nested iteration may be the same as the pro
longation p used in the multigrid method. However, from the point of view of efficiency it
is sometimes better to use in the nested iteration a prolongation p that has higher order of
accuracy than the prolongation used in the multigrid method.
224
h 1 3 5 7 9
3 1/16 0.080 0.055 0.061 0.067 0.070
4 1/32 0.056 0.053 0.058 0.062 0.066
5 1/64 0.044 0.055 0.059 0.062 0.065
6 1/128 0.043 0.054 0.058 0.061 0.063
Complexity. Consider the situation described in example 9.7.1. Then the arithmetic costs per
multigrid iteration are cn flops and the error reduction per iteration is bounded by < 1 with
independent of (as proved in section 9.4). To obtain a reduction of a starting error by a fixed
factor R we then need at most ln R/ ln  iterations, i.e. the arithmetic costs are approximately
ln R/ ln  cn flops. We conclude that the multigrid method has complexity cn . Note that this
is optimal in the sense that for one matrixvector multiplication A x we already need O(n )
flops. A nice feature of multigrid methods is that such an optimal complexity property holds
for a large class of interesting problems.
With respect to the efficiency of multigrid methods we note the following. The rate of con
vergence will increase if 1 + 2 or is increased. However, in that case also the arithmetic
costs per iteration will grow. Analysis of the dependence of the multigrid contraction number
on 1 , 2 , and numerical experiments have shown that for many problems we obtain an efficient
method if we take 1 + 2 {1, 2, 3, 4} and {1, 2} . In other words, in general many (> 4)
smoothing iterations or more than two recursive calls in (9.25) will make a multigrid method
less efficient.
Stopping criterion. In general, for the discrete solution x (with corresponding finite element
function u = P x ) we have a discretization error , so it does not make sense to solve the
discrete problem to machine accuracy. For a large class of elliptic boundary value the fol
lowing estimate for the discretization error holds: ku u k ch2 . If in the multigrid itera
tion one has an arbitrary starting vector (e.g., 0) then the error reduction factor R should be
taken proportional to h2 . Using the multigrid iteration MGM one then needs approximately
ln R/ ln  ln ch2
/ ln  c ln n / ln  iterations to obtain an approximation with the de
sired accuracy. Per iteration we need O(n ) flops. Hence we conclude: When we use a multigrid
method for computing an approximation u of u with accuracy comparable to the discretization
error in u , the arithmetic costs are of the order
c n ln n flops . (9.64)
Multigrid and nested iteration. For an analysis of the multigrid method used in a nested
iteration we refer to Hackbusch [44]. From this analysis it follows that a small fixed number of
MGM iterations (i.e. k in (9.62)) on each level in the nested iteration method is sufficient
to obtain an approximation x of x with accuracy comparable to the discretization error in x .
225
The arithmetic costs of this combined multigrid and nested iteration method are of the order
c
n flops. (9.65)
 ln 
When we compare the costs in (9.64) with the costs in (9.65) we see that using the nested
iteration approach results in a more efficient algorithm. From the work estimate in (9.65) we
conclude: Using multigrid in combination with nested iteration we can compute an approximation
x of x with accuracy comparable to the discretization error in x and with arithmetic costs
Cn flops (C independent of ).
Example 9.7.2 To illustrate the behaviour of multigrid in combination with nested iteration
we show numerical results for an example from Hackbusch [44]. In the Poisson problem as in
example 9.7.1 we take boundary conditions and a righthand side such that the solution is given
by u(x, y) = 12 y 3 /(x + 1), so we consider:
u(x, y) = 12 y 3 /(x + 1) on .
For the discretization we apply linear finite elements on a family of nested uniform triangulations
with mesh size h = 21 . The discrete solution on level is denoted by u . The discretization
error, measured in a weighted Euclidean norm is given in table 9.3. From these results one can
observe a ch2 behaviour of the discretization error. We apply the nested iteration approach of
section 9.6 in combination with the multigrid method. We start with a coarsest triangulation
Th0 with mesh size h0 = 12 (this contains only one interior grid point). For the prolongation
p used in the nested iteration we take the prolongation p as in the multigrid method (linear
interpolation). When we apply only one multigrid iteration on each level (i.e k = 1 in
(9.63)) we obtain approximations x0 (= p x11 ) and x1 of x (= P1 u ) (cf. figure 9.7). The
errors in these approximations are given in table 9.4. In that table we also give the errors for the
case with two multigrid iterations on each level (i.e., k = 2 in (9.63)). Comparing the results in
table 9.4 with the discretization errors given in table 9.3 we see that we only need two multigrid
iterations on each grid to compute an approximation of x (0 ) with accuracy comparable
to the discretization error in x .
226
h kxi x k2 , k = 1 kxi x k2 , k = 2
1/8 x02 7.24 103 x02 6.47 103
x21 5.98 104 x21 4.92 104
x22 2.86 105
1/16 x03 2.09 103 x30 1.73; 103
x13 1.30 104 x31 9.91 105
x23 4.91 106
1/32 x04 5.17 104 x40 4.43 104
x14 2.54 105 x14 1.82 105
x42 8.52 107
1/64 x05 1.23 104 x05 1.12 104
x15 4.76 106 x51 3.25 106
x52 1.47 107
227
228
Chapter 10
In this chapter we discuss a class of iterative methods for solving a linear system with a matrix
of the form
A BT
K=
B 0
(10.1)
A Rmm symmetric positive definite
B Rnm rank(B) = n < m
The socalled Schur complement matrix is given by S := BA1 BT . Note that S is symmetric
positive definite. The symmetic matrix K is (strongly) indefinite:
Lemma 10.0.1 The matrix K has m strictly positive and n strictly negative eigenvalues.
Proof. From the factorization
A1 0 A BT
A 0
K=
B I 0 S 0 I
it follows that K is congruent to the matrix blockdiag(A1 , S) which has m strictly positive
and n strictly negative eigenvalues. Now apply Sylvesters inertia theorem.
Remark 10.0.2 Consider a linear system of the form
v f
K = 1 (10.2)
w f2
with K as in (10.1). Define the functional L : Rm Rn R by L(v, w) = 12 hAv, vi + hBv, wi
hf1 , vi hf2 , wi. Using the same arguments as in the proof of theorem 2.4.2 one can easily show
that (v , w ) is a solution of the problem (10.2) iff
L(v , w) L(v , w ) L(v, w ) for all v Rm , w Rn
Due to this property the linear system (10.2) is called a saddlepoint problem.
In section 8.3 we discussed the preconditioned MINRES method for solving a linear system with
a symmetric indefinite matrix. This method can be applied to the system in (10.2). Recall that
the preconditioner must be symmetric positive definite. In section 10.1 we analyze a particular
preconditioning technique for the matrix K. In section 10.2 we apply these methods to the
discrete Stokes problem.
229
10.1 Block diagonal preconditioning
In this section we analyze the effect of symmetric preconditioning of the matrix K in (10.1) with
a block diagonal matrix
MA 0
M :=
0 MS
MA Rmm , MA = MTA > 0, MS Rnn , MS = MTS > 0
The preconditioned matrix is given by
A BT
21 12
K = M KM =
B 0
1 1 1 1
A := MA 2 AMA 2 , B := MS 2 BMA 2
We first consider a very special preconditoner, which in a certain sense is optimal:
Lemma 10.1.1 For MA = A and MS = S we have
1 1
(K) = { (1 5) , 1 , (1 + 5) }
2 2
Proof. Note that
I BT
1 1
K = , B = S 2 BA 2
B 0
v v
The matrix B has a nontrivial kernel. For v ker(B), v 6= 0, we have K = and thus
0 0
1 (K). For (K), 6= 1, we get
I BT
v v
= , w 6= 0
B 0 w w
This holds iff ( 1) (BBT ) = (I) = {1} and thus = 21 (1 5).
Note that from the result in (8.41) it follows that the preconditioned MINRES method with
the preconditioner as in lemma 10.1.1 yields (in exact arithmetic) the exact solution in at most
three iterations. In most applications (e.g., the Stokes problem) it is very costly to solve linear
systems with the matrices A and S. Hence this preconditioner is not feasible. Instead we will
use approximations MA of A and MS of S. The quality of these approximations is measured
by the following spectral inequalities, with A , S > 0:
A MA A A MA
(10.3)
S MS S S MS
Using an analysis as in [77, 82] we obtain a result for the eigenvalues of the preconditioned
matrix:
Theorem 10.1.2 For the matrix K with preconditioners that satisfy (10.3) we have:
1 q
2 + 4 , 1 (
q
2 + 4
(K) (A A S A A A S A
2 2
1
q
A , (A + 2A + 4S A
2
230
Proof. We use the following inequalities
A I A A I (10.4a)
A A 1
M1 A A A
1
(10.4b)
1 1
S I MS 2 SMS 2 S I (10.4c)
1 1
Note that BBT = MS 2 BM1 T
A B MS . Using (10.4b) and (10.4c) we get
2
A S I BBT A S I (10.5)
Take (K). Then 6= 0 and there exists (v, w) 6= (0, 0) such that
Av + BT w = v
(10.6)
Bv = w
1
q
A , (A + 2A + 4S A )
2
We now consider the case < 0. From (10.5) and (10.4a) it follows that
1 T 1
A + B B (A + S A )I
q
and thus A + 1 S A . This yields 21 (A A 2 + 4 ). Finally we derive an upper
S A
bound for < 0. We introduce := > 0. From (10.6) it follows that for < 0, w 6= 0 must
hold. Furthermore, we have
B(A + I)1 BT w = w
and thus (B(A + I)1 BT ). From I + A1 (1 + A )I and (10.4c) we obtain
1 1 1
B(A + I)1 BT = BA 2 (I + A1 )1 A 2 BT (1 + ) BA1 BT
A
1 12 1 1
= (1 + ) MS SMS 2 (1 + ) S I
A A
q
We conclude that (1+ A )1 S holds. Hence, for = we get 21 (A 2 + 4 ).
A S A
231
10.2 Application to the Stokes problem
In this section the results of the previous sections are applied to the discretized Stokes problem
that is treated in section 5.2. We consider the Galerkin discretization of the Stokes problem
with HoodTaylor finite element spaces
Here we use the notation d for the dimension of the velocity vector ( Rd ). For the bases
in these spaces we use standard nodal basis functions. In the velocity space Vh = (Xkh,0 )d the
set of basis functions is denoted by ( i )1im . Each i is a vector function in Rd with d 1
components identically zero. The basis in the pressure space Mh = Xhk1 L20 () is denoted by
(i )1in . The corresponding isomorphisms are given by
m
X
m
Ph,1 : R Vh , Ph,1 v = vi i
i=1
X n
Ph,2 : Rn Mh , Ph,2 w = wi i
i=1
A BT
K= R(m+n)(m+n) , with
B 0
Z
hAv, vi = a(Ph,1 v, Ph,1 v) = (Ph,1 v) (Ph,1 v) dx v, v Rm
Z
MA := blockdiag(MM G ) (d blocks)
232
For the preconditioner MS of the Schur complement S we use the mass matrix in the pressure
space, which is defined by
hMS w, zi = hPh,2 w, Ph,2 ziL2 for all w, z Rn (10.8)
This mass matrix is symmetric positive definite and (after diagonal scaling, cf. section 3.5.1)
in general wellconditioned. In practice the linear systems of the form MS w = q are solved
approximately by applying a few iterations of an iterative solver (for example, CG). We recall
the stability property of the HoodTaylor finite element pair (Vh , Mh ) (cf. section 5.2.1):
b(uh , qh )
> 0 : sup kqh kL2 for all qh Mh (10.9)
uh Vh kuh k1
with independent of h. Using this stability property we get the following spectral inequalities
for the preconditioner Ms :
Theorem 10.2.1 Let MS be the pressure mass matrix defined in (10.8). Assume that the
stability property (10.9) holds. Then
2 MS S d MS (10.10)
holds.
Proof. For w Rn we have:
1
hBv, wi hBA 2 v, wi
max 1 = maxm
vRm Av, vi 2 vR kvk
1
hv, A 2 BT wi
= maxm
vR kvk
1 1
= kA 2 BT wk = hSw, wi 2
Hence, for arbritrary w Rn :
1 b(uh , Ph,2 w)
hSw, wi 2 = max (10.11)
uh Vh uh 1
Using this and the stability bound (10.9) we get
1 1
hSw, wi 2 kPh,2 wkL2 = hMS w, wi 2
and thus the first inequality in (10.10) holds. Note that
b(uh , Ph,2 w) kdiv uh kL2 kPh,2 wkL2
1
d uh 1 kPh,2 wkL2 = d uh 1 hMS w, wi 2
holds. Combining this with (10.11) proves the second inequality in (10.10).
Corollary 10.2.2 Suppose that for solving a discrete Stokes problem with stiffness matrix
K we use a preconditioned MINRES method with preconditioners MA (for A) and MS (for
S) as defined above. Then the inequalities (10.3) hold with constants A , A , S , S that are
independent of h. From theorem 10.1.2 it follows that the spectrum of the preconditioned matrix
K is contained in a set [a, b] [c, d] with a < b < 0 < c < d, all independent of h, and with
b a = d c. From theorem 8.42 we then conclude that the residual reduction factor can be
bounded by a constant smaller than one independent of h.
233
234
Appendix A
Functional Analysis
Real vector space. A real vector space is a set X of elements, called vectors, together with
the algebraic operations vector addition and multiplication of vectors by real scalars. Vector
addition should be commutative and associative. Multiplication by scalars should be associative
and distributive.
Example A.1.1 Examples of real vector spaces are Rn and C([a, b])
Normed space. A normed space is a vector space X with a norm defined on it. Here a norm
on a vector space X is a realvalued function on X whose value at x X is denoted by kxk and
which has the properties
kxk 0
kxk = 0 x = 0
(A.1)
kxk =  kxk
kx + yk kxk + kyk
for arbitrary x, y X, R.
Example A.1.2 . Examples of normed spaces are
n
X
(Rn , k k2 ) with kxk22 = x2i ,
i=1
Banach space. A Banach space is a complete normed space. This means that in X every
Cauchy sequence, in the metric defined by the norm, has a limit which is an element of X.
235
Example A.1.3 Examples of Banach spaces are:
(Rn , k k2 ) ,
(Rn , k k ) with any norm k k on Rn ,
(C([a, b]), k k ).
The completeness of the space in the second example follows from the fact that on a finite
dimensional space all norms are equivalent. The completeness of the space in the third example
is a consequence of the following theorem: The limit of a uniformly convergent sequence of
continouos functions is a continuous function.
Remark A.1.4 The space (C([a, b]), kkL2 ) is not complete. Consider for example the sequence
fn C([0, 1]), n 1, defined by
if t 12
0
fn (t) = n(t 12 ) if 21 t 12 + n1
1 if 12 + n1 t 1 .
So (fn )n1 is a Cauchy sequence. For the limit function f we would have
if 0 t 12
0
f (t) =
1 if 12 + t 1,
Inner product space. An inner product space is a (real) vector space X with an inner product
defined on X. For such an inner product we need a mapping of X X into R, i.e. with every
pair of vectors x and y from X there is associated a scalar denoted by hx, yi. This mapping is
called an inner product on X if for arbitrary x, y, z X and R the following holds:
hx, xi 0 (A.2)
hx, xi = 0 x = 0 (A.3)
hx, yi = hy, xi (A.4)
hx, yi = hx, yi (A.5)
hx + y, zi = hx, zi + hy, zi. (A.6)
An inner product and the corresponding norm satisfy the CauchySchwarz inequality:
236
Example A.1.5 Examples of inner product spaces are:
n
X
Rn with hx, yi = xi yi ,
i=1
Z b
C([a, b]) with hf, gi = f (t)g(t) dt.
a
Hilbert space. An inner product space which is complete is called a Hilbert space.
Z b
2
L ([a, b]) with hf, gi = f (t)g(t) dt.
a
We note that the space C([a, b]) with the inner product (and corresponding norm) as in Ex
ample A.1.5 results in the normed space (C([a, b]), k kL2 ). In Remark A.1.4 it is shown that
this space is not complete. Thus the inner product space C([a, b]) as in Example A.1.5 is not a
Hilbert space.
The space L2 (). Let be a domain in Rn . We denote by L2 () the space of all Lebesgue
measurable functions f : R for which
sZ
kf k0 := kf kL2 := f (x)2 dx <
In this space functions are identified that are equal almost evereywhere (a.e.) on . The elements
of L2 () are thus actually equivalence classes of functions. One writes f = 0 [f = g] if f (x) = 0
[f (x) = g(x)] a.e. in . The space L2 () with
Z
hf, gi = f (x)g(x) dx
is a Hilbert space. The space C0 () (all functies in C () which have a compact support in
) is dense in L2 ():
kk0
C0 () = L2 ()
In other words, the completion of the normed space (C0 (), k k0 ) results in the space L2 ().
237
Dual space. Let (X, k k) be a normed space. The set of all bounded linear functionals
f : X R forms a real vector space. On this space we can define the norm:
f (x)
kf k := sup{ x X, x 6= 0 }.
kxk
This results in a normed space called the dual space of X and denoted by X .
Bounded linear operators. Let (X, k kX ) and (Y, k kY ) be normed spaces and T : X Y
be a linear operator. The (operator) norm of T is defined by
kT xkY
kT kY X := sup  x X, x 6= 0 .
kxkX
Theorem A.2.1 (ArzelaAscoli.) A subset K of C() is precompact (i.e., every sequence has
a convergent subsequence) if and only if the following two conditions holds:
238
Theorem A.2.3 (Extension of operators.) Let X be a normed space and Y a Banach space.
Suppose X0 is a dense subspace of X and T : X0 Y a bounded linear operator. Then there
exists a unique extension Te : X Y with the properties
Theorem A.2.4 (Banach fixed point theorem.) Let (X, kk) be a Banach space. Let F : X
X be a (possibly nonlinear) contraction, i.e. there is a constant < 1 such that for all x, y X
:
kF (x) F (y)k kx yk.
Then there exists a unique x X (called a fixed point) such that
F (x) = x
holds.
Theorem A.2.5 (Corollary of open mapping theorem.) Let X and Y be Banach spaces
and T : X Y a linear bounded operator which is bijective. Then T 1 is bounded, i.e.,
T : X Y is an isomorphism.
Corollary A.2.6 Let X and Y be Banach spaces and T : X Y a linear bounded operator
which is injective. Let R(T ) = { T x  x X } be the range of T . Then T 1 : R(T ) X is
bounded if and only if R(T ) is closed (in Y ).
Theorem A.2.8 (Riesz representation theorem.) Let H be a Hilbert space with inner prod
uct denoted by h, i and corresponding norm k kH . Let f be an element of the dual space H ,
with norm kf kH . Then there exists a unique w H such that
Corollary A.2.9 Let H be a Hilbert space with inner product denoted by h, i. The bilinear
form
hf, giH := hJH f, JH gi , f, g H
defines a scalar product on H . The space H with this scalar product is a Hilbert space.
239
240
Appendix B
Linear Algebra
Spectrum, spectral radius. By (A) we denote the spectrum of A, i.e. the collection of
all eigenvalues of the matrix A. Note that in general (A) contains complex numbers (even for
A real). We use the notation (A) for the spectral radius of A Rnn :
Matrix norms. A matrix norm on Rnn is a real valued function whose value at A Rnn is
denoted by kAk and which has the properties
241
for all A, B Rnn and all R. A special class of matrix norms are those induced by a
vector norm. For a given vector norm k k on Rn we define an induced matrix norm by:
kAxk
kAk := sup{  x Rn , x 6= 0 } for A Rnn . (B.6)
kxk
Induced by the vector norms in (B.1), (B.2), (B.3) we obtain the matrix norms kAk1 , kAk2 ,
kAk . From the definition of the induced matrix norm it follows that
Condition number. For a nonsingular matrix A the spectral condition number is defined
by
(A) := kAk2 kA1 k2 .
We note that condition numbers can be defined with respect to other matrix norms, too.
Below we introduce some notions related to special properties which matrices A Rnn may
have.
The matrix A is symmetric if A = AT holds. The matrix A is normal if the equality
AT A = AAT holds. Note that every symmetric matrix is normal.
A symmetric matrix A is positive definite if xT Ax > 0 holds for all x 6= 0. In that case,
A is said to be symmetric positive definite.
The matrix A is called irreducible if there does not exist a permutation matrix such that
T A is a two by two block matrix in which the (2, 1) block is a zero block. The matrix A is
called irreducibly diagonally dominant if A is irreducible and weakly diagonally dominant.
Theorem B.2.1 (Results on eigenvalues and eigenvectors) . For A, B Rnn the fol
lowing results hold:
242
E1. (A) = (AT ).
E3. A Rnn is normal if and only if A has an orthogonal basis of eigenvectors. In general
these eigenvectors and corresponding eigenvalues are complex, and orthogonality is meant
with respect to the complex Euclidean inner product.
E4. If A is symmetric then A has an orthogonal basis of real eigenvectors. Furthermore, all
eigenvalues of A are real.
E5. A is symmetric positive definite if and only if A is symmetric and (A) (0, ).
Theorem B.2.2 (Results on matrix norms) . For A Rnn the following results hold:
p
N3. kAk2 = (AT A).
(A) kAk.
Using (N4) and (E5) we obtain the following results for the spectral condition number:
with max and min the largest and smallest eigenvalue of A, respectively.
Theorem B.2.3 (Jordan normal form) . For every A Rnn exists a nonsingular matrix
T such that A = TT1 with a matrix of the form = blockdiag(i )1is ,
i 1
.. ..
. . Rki ki , 1 i s,
i := .
. . 1
and {1 , . . . , s } = (A).
243
244
Bibliography
[2] S. Agmon, A. Douglis, and L. Nirenberg. Estimates near the boundary of solutions of
elliptic partial differential equations satisfying general boundary conditions ii. Comm. on
Pure and Appl. Math., 17:3592, 1964.
[4] D. N. Arnold, F. Brezzi, and M. Fortin. A stable finite element for the stokes equation.
Calcalo, 21:337344, 1984.
[5] W. E. Arnoldi. The principle of minimized iterations in the solution of the matrix eigen
value problem. Quart. Appl. Math., 9:1729, 1951.
[6] O. Axelsson. Iterative Solution Methods. Cambridge University Press, NY, 1994.
[7] O. Axelsson and V. A. Barker. Finite Element Solution of Boundary Value Problems.
Theory and Computation. Academic Press, Orlando, 1984.
[8] R. E. Bank and T. F. Chan. A composite step biconjugate gradient method. Numer.
Math., 66:295319, 1994.
[9] R. E. Bank and L. R. Scott. On the conditioning of finite element equations with highly
refined meshes. SIAM J. Numer. Anal., 26:13831394, 1989.
[11] M. Bercovier and O. Pironneau. Error estimates for finite element solution of the stokes
problem in primitive variables. Numer. Math., 33:211224, 1979.
[13] C. Bernardi. Optimal finite element interpolation on curved domains. SIAM J. Numer.
Anal., 26:12121240, 1989.
[14] C. Bernardi and V. Girault. A local regularization operator for triangular and quadrilateral
finite elements. SIAM J. Numer. Anal., 35:18931915, 1998.
[15] D. Boffi. Stability of higherorder triangular hoodtaylor methods for the stationary stokes
equations. Math. Models Methods Appl. Sci., 4:223235, 1994.
245
[16] D. Boffi. Threedimensional finite element methods for the stokes problem. SIAM J.
Numer. Anal., 34:664670, 1997.
[17] D. Braess, M. Dryja, and W. Hackbusch. A multigrid method for nonconforming fe
discretisations with application to nonmatching grids. Computing, 63:125, 1999.
[18] D. Braess and W. Hackbusch. A new convergence proof for the multigrid method including
the Vcycle. SIAM J. Numer. Anal., 20:967975, 1983.
[20] J. H. Bramble and S. R. Hilbert. Estimation of linear functionals on sobolev spaces with
applications to fourier transforms and spline interpolation. SIAM J. Numer. Anal., 7:113
124, 1970.
[21] S. C. Brenner and L. R. Scott. The Mathematical Theory of Finite Element Methods.
Springer, New York, 1994.
[22] F. Brezzi and R. S. Falk. Stability of higherorder hoodtaylor methods. SIAM J. Numer.
Anal., 28:581590, 1991.
[23] W. L. Briggs, V. E. Henson, and S. F. McCormick. A Multigrid Tutorial (2nd ed.). SIAM,
Philadelphia, 2000.
[24] O. Broker, M. Grote, C. Mayer, and A. Reusken. Robust parallel smoothing for multigrid
via sparse approximate inverses. SIAM J. Sci. Comput., 32:13951416, 2001.
[27] P. G. Ciarlet. The Finite Element Method for Elliptic Problems. North Holland, 1978.
[28] P. G. Ciarlet. Basic error estimates for elliptic problems. In P. G. Ciarlet and J. L. Lions,
editors, Handbook of Numerical Analysis, Volume II: Finite Element Methods (Part 1).
North Holland, Amsterdam, 1991.
[29] P. Clement. Approximation by finite element functions using local regularization. RAIRO
Anal. Numer. (M2 AN), 9(R2):7784, 1975.
[31] G. Duvaut and J. L. Lions. Les Inequations en Mecanique et en Physique. Dunod, Paris,
1972.
[32] E. Ecker and W. Zulehner. On the smoothing property of multigrid methods in the
nonsymmetric case. Numerical Linear Algebra with Applications, 3:161172, 1996.
[33] V. Faber and T. Manteuffel. Orthogonal error methods. SIAM J. Numer. Anal., 20:352
262, 1984.
[34] M. Fiedler. Special Matrices and their Applications in Numerical Mathematics. Nijhoff,
Dordrecht, 1986.
246
[35] R. Fletcher. Conjugate gradient methods for indefinite systems. In G. A. Watson, editor,
Numerical Analysis Dundee 1975, Lecture Notes in Mathemaics, Vol. 506, pages 7389,
Berlin, 1976. Springer.
[37] G. Frobenius. Uber matrizen aus nicht negativen elementen. Preuss. Akad. Wiss., pages
456477, 1912.
[38] E. Gartland. Strong uniform stability and exact discretizations of a model singular pertur
bation problem and its finite difference approximations. Appl. Math. Comput., 31:473485,
1989.
[39] D. Gilbarg and N. S. Trudinger. Elliptic Partial Differential Equations of Second Order.
Springer, Berlin, Heidelberg, 1977.
[40] V. Girault and P.A. Raviart. Finite Element Methods for NavierStokes Equations, vol
ume 5 of Springer Series in Computational Mathematics. Springer, Berlin, Heidelberg,
1986.
[41] G. H. Golub and C. F. V. Loan. Matrix Computations. John Hopkins University Press,
Baltimore, 2. edition, 1989.
[42] A. Greenbaum. Iterative Methods for Solving Linear Systems. SIAM, Philadelphia, 1997.
[44] W. Hackbusch. Multigrid Methods and Applications, volume 4 of Springer Series in Com
putational Mathematics. Springer, Berlin, Heidelberg, 1985.
[47] W. Hackbusch. Elliptic Differential Equations: Theory and Numerical Treatment, vol
ume 18 of Springer Series in Computational Mathematics. Springer, Berlin, 1992.
[50] L. A. Hageman and D. M. Young. Applied Iterative Methods. Academic Press, New York,
1981.
[51] M. R. Hestenes and E. Stiefel. Methods of conjugate gradients for solving linear systems.
J. Res. Nat. Bur. Stand., 49:409436, 1952.
[52] P. Hood and C. Taylor. A numerical solution of the navierstokes equations using the finite
element technique. Comp. and Fluids, 1:73100, 1973.
247
[53] J. Kadlec. On the regularity of the solution of the Poisson problem on a domain with
boundary locally similar to the boundary of a convex open set. Czechoslovak Math. J.,
14(89):386393, 1964. (russ.).
[54] R. B. Kellog and J. E. Osborn. A regularity result for the stokes problem in a convex
polygon. J. Funct. Anal., 21:397431, 1976.
[55] E. Kreyszig. Introductory Functional Analysis with Applications. Wiley, New York, 1978.
[57] O. A. Ladyzhenskaya and N. A. Uraltseva. Linear and Quasilinear Elliptic Equations, vol
ume 46 of Mathematics in Science and Engineering. Academic Press, New York, London,
1968.
[58] P. Lancaster and M. Tismenetsky. The Theory of Matrices. Academic Press, Orlando, 2.
edition, 1985.
[59] C. Lanczos. Solution of systems of linear equations by minimized iterations. J. Res. Natl.
Bur. Stand. 49, pages 3353, 1952.
[60] J. L. Lions and E. Magenes. Nonhomogeneous Boundary Value Problems and Applications,
Vol. I. Springer, Berlin, 1972.
[61] J. T. Marti. Introduction to Sobolev Spaces and Finite Element Solution of Elliptic Bound
ary Value Problems. Academic Press, London, 1986.
[62] J. A. Meijerink and H. A. van der Vorst. An iterative solution method for linear systems
of which the coefficient matrix is a symmetric mmatrix. Math. Comp., 31:148162, 1977.
[63] N. Meyers and J. Serrin. H=W. Proc. Nat. Acad. Sci. USA, 51:10551056, 1964.
[64] C. Miranda. Partial Differential Equations of Elliptic Type. Springer, Berlin, 1970.
[65] J. Necas. Les Methodes Directes en Theorie des Equations Elliptiques. Masson, Paris,
1967.
[66] O. Nevanlinna. Convergence of Iterations for Linear Equations. Birkhauser, Basel, 1993.
[68] W. Niethammer. The sor method on parallel computers. Numer. Math., 56:247254, 1989.
[69] U. T. C. W. Oosterlee and A. Schuler, editors. Multigrid. Academic Press, London, 2001.
[70] C. C. Paige and M. Saunders. Solution of sparse indefinite systems of linear equations.
SIAM J. Numer. Anal., 12:617629, 1975.
[71] O. Perron. Zur theorie der matrizen. Math. Ann., 64:248263, 1907.
[72] S. D. Poisson. Remarques sur une equation qui se presente dans la theorie des attractions
des spheroides. Nouveau Bull. Soc. Philomathique de Paris, 3:388392, 1813.
248
[73] A. Quarteroni and A. Valli. Numerical Approximation of Partial Differential Equations,
volume 23 of Springer Series in Computational Mathematics. Springer, Berlin, Heidelberg,
1994.
[74] A. Reusken. On maximum norm convergence of multigrid methods for twopoint boundary
value problems. SIAM J. Numer. Anal., 29:15691578, 1992.
[75] A. Reusken. The smoothing property for regular splittings. In W. Hackbusch and G. Wit
tum, editors, Incomplete Decompositions : (ILU)Algorithms, Theory and Applications,
volume 41 of Notes on Numerical Fluid Mechanics, pages 130138, Braunschweig, 1993.
Vieweg.
[76] H.G. Roos, M. Stynes, and L. Tobiska. Numerical Methods for Singularly Perturbed Dif
ferential Equations, volume 24 of Springer Series in Computational Mathematics. Springer,
Berlin, Heidelberg, 1996.
[77] T. Rusten and R. Winther. A preconditioned iterative method for saddlepoint problems.
SIAM J. Matrix Anal. Appl., 13:887904, 1992.
[78] Y. Saad. Iterative Methods for Sparse Linear Systems. PWS Publishing Company, London,
1996.
[79] Y. Saad and M. H. Schultz. Conjugate gradientlike algorithms for solving nonsymmetric
linear systems. Math. Comp., 44:417424, 1985.
[80] Y. Saad and M. H. Schultz. Gmres: a generalized minimal residual algorithm for solving
nonsymmetric linear systems. SIAM J. Sci. Statist. Comput., 7:856869, 1986.
[81] L. R. Scott and S. Zhang. Finite element interpolation of nonsmooth functions satisfying
boundary conditions. Math. Comp., 54:483493, 1990.
[82] D. Silvester and A. Wathen. Fast iterative solution of stabilised stokes systems. part ii:
using general block preconditioners. SIAM J. Numer. Anal., 31:13521367, 1994.
[84] G. L. G. Sleijpen and D. R. Fokkema. Bicgstab(l) for linear equations involving matrices
with complex spectrum. ETNA, 1:1132, 1993.
[85] G. L. G. Sleijpen and H. van der Vorst. Optimal iteration methods for large linear sys
tems of equations. In C. B. Vreugdenhil and B. Koren, editors, Numerical Methods for
AdvectionDiffusion Problems, volume 45 of Notes on Numerical Fluid Mechanics, pages
291320. Vieweg, Braunschweig, 1993.
[86] P. Sonneveld. Cgs: a fast lanczostype solver for nonsymmetric linear systems. SIAM J.
Sci. Statist. Comput., 10:3652, 1989.
[87] R. Sternberg. Error analysis of some finite element methods for the stokes problem. Math.
Comp., 54:495508, 1990.
[88] G. Strang. Linear Algebra and its Applications. Harcourt Brace Jovanovich, San Diego,
3. edition, 1988.
249
[89] A. van der Sluis. Condition numbers and equilibration of matrices. Numer. Math., 14:14
23, 1969.
[90] A. van der Sluis and H. van der Vorst. The rate of convergence of conjugate gradients.
Numer. Math., 48:543560, 1986.
[91] H. A. van der Vorst. Bicgstab: A fast and smoothly converging variant of bicg for the
solution of nonsymmetric linear systems. SIAM J. Sci. Statist. Comput., 13:631644,
1992.
[92] R. S. Varga. Matrix Iterative Analysis. Prentice Hall, Englewood Cliffs, New Jersey, 1962.
[93] R. Verfurth. Error estimates for a mixed finite element approximation of stokes problem.
RAIRO Anal. Numer., 18:175182, 1984.
[94] R. Verfurth. Robust a posteriori error estimates for stationary convectiondiffusion equa
tions. SIAM J. Numer. Anal., 43:17661782, 2005.
[97] G. Wittum. Linear iterations as smoothers in multigrid methods : Theory with applica
tions to incomplete decompositions. Impact Comput. Sci. Eng., 1:180215, 1989.
[98] G. Wittum. On the robustness of ILUsmoothing. SIAM J. Sci. Stat. Comp., 10:699717,
1989.
[99] J. Wloka. Partial Differential Equations. Cambridge University Press, Cambridge, 1987.
[100] D. M. Young. Iterative Solution of Large Linear Systems. Academic Press, NY, 1971.
250