You are on page 1of 45

Inverse Problems

Martin Benning
Updated on: June 6, 2016

Lecture Notes
Michaelmas Term 2015
2
Contents

1 Introduction to inverse problems 5


1.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.1 Matrix inversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.2 Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.1.3 Deconvolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.4 Tomography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.1.5 Magnetic Resonance Imaging (MRI) . . . . . . . . . . . . . . . . . . . . . . 10

2 Linear inverse problems 13


2.1 Generalised inverse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Compact operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Singular value decomposition of compact operators . . . . . . . . . . . . . . . . . . 18

3 Regularisation 21
3.1 Truncated singular value decomposition . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Tikhonov regularisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Source-conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4 Asymptotic regularisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.5 Landweber iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.6 Variational regularisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.6.1 Tikhonov-type regularisation . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.7 Convex Variational Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4 Computational realisation 39
4.1 A primal dual hybrid gradient method . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.1.1 Convergence of PDHGM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.1.2 Deconvolution with total variation regularisation . . . . . . . . . . . . . . . 43

3
4 CONTENTS

These lecture notes are based on the following books and lecture notes:

1. Engl, Heinz Werner, Martin Hanke, and Andreas Neubauer. Regularization of inverse prob-
lems. Vol. 375. Springer Science & Business Media, 1996.

2. Mueller, Jennifer L., and Samuli Siltanen. Linear and nonlinear inverse problems with
practical applications. Vol. 10. SIAM, 2012.

3. Martin Burger, Inverse Problems. Lecture notes winter 2007/2008.


http://wwwmath.uni-muenster.de/num/Vorlesungen/IP_WS07/skript.pdf

4. Christian Clason, Inverse Probleme (in German), Lecture notes winter 2014/2015
https://www.uni-due.de/~adf040p/teaching/inverse_14/InverseSkript.pdf

The lecture notes are under constant redevelopment and will likely contain errors and mistakes. I
very much appreciate the finding and reporting of those (to mb941@cam.ac.uk). Thanks!
Chapter 1

Introduction to inverse problems

Solving an inverse problem is the task of computing an unknown physical quantity that is related
to given, indirect measurements via a forward model. Inverse problems appear in a vast majority
of applications, including imaging (Computed Tomography (CT), Positron Emission Tomography
(PET), Magnetic Resonance Imaging (MRI), Electron Tomography (ET), microscopic imaging,
geophysical imaging), signal- and image-processing, computer vision, machine learning and (big)
data analysis in general, and many more.
Mathematically, an inverse problem can be described as the solution of the operator equation

Ku = f (1.1)

with given measurement data f for the unknown quantity u. Here, K : U → V denotes an operator
mapping from the Banach space U to the Banach space V. Throughout this lecture we are going
to restrict ourselves to linear and bounded operators though.
Inverting a forward model however is not straightforward in most relevant applications, for two
basic reasons: either a (unique) inverse model simply does not exist, or existing inverse models
heavily amplify small measurement errors. In the sense of Hadamard the problem (1.1) is called
well-posed if

• for all input data there exists a solution of the problem, i.e. for all f ∈ V there exists a
u ∈ U with Ku = f .

• for all input data this solution is unique, i.e. u 6= v implies Kv 6= f .

• the solution of the problem depends continuously on the input datum, i.e. for all {uk }k∈N
with Kuk → f we have uk → u.

If any of these conditions is violated, problem (1.1) is called ill-posed. In the following we are going
to see that most practically relevant inverse problems are ill-posed or approximately ill-posed.1

1.1 Examples
In the following we are going to present various examples of inverse problems and highlight the
challenges of dealing with them.
1
In fact the name ill-posed problems may be a more suitable name for this lecture, as the real challenge is to deal
with the ill-posedness of the inverse problems. However, the name inverse problems became more widely accepted
for this field of mathematics.

5
6 1.1. EXAMPLES

1.1.1 Matrix inversion


One of the most simple (class of) inverse problems that arises from (numerical) linear algebra is
the solution of linear systems. These can be written in the form of (1.1) with u ∈ Rn and f ∈ Rn
being n-dimensional vectors with real entries and K ∈ Rn×n being a matrix with real entries. We
further assume K to be a symmetric, positive definite matrix. In that case we know from the
spectral theory of symmetric matrices that there exist Eigenvalues 0 ≤ λ1 ≤ λ2 ≤ . . . ≤ λn and
corresponding Eigenvectors kj ∈ Rn for j ∈ {1, . . . , n} such that K can be written as

n
X
K= λj kj kjT . (1.2)
j=1

Note that by dividing (1.1) on both sides of the equality with λn , (1.2) can be rescaled so that
λn = 1 and λ1 = κ−1 hold. Here, κ denotes the so-called condition number κ = λn /λ1 (in the
case of λ1 6= 0) of the matrix K. It is well known from numerical linear algebra that the condition
number is a measure of how stable (1.1) can be solved.
If we assume that we observe f δ instead of f , with kf − f δ k2 ≤ δ (here k · k2 denotes the
Euclidean norm), and further denote uδ the solution of Kuδ = f δ , the difference between uδ and
the solution u of (1.1) reads as

n
X  
u − uδ = λ−1
j kj kj
T
f − f δ
.
j=1

Therefore we can estimate


2 X n   2 2
u − uδ = λ−2 kk k2 T
k f − f δ
≤ λ −2
f − f ,
δ

j j 2 j
1
2 | {z } 2
j=1
=1

due to λ1 ≤ λi for i 6= 1 and the orthogonality of the eigenvectors. Thus, taking the square root
yields the estimate

u − uδ ≤ κ f − f δ ≤ κδ .

2 2

Hence, we observe that in the worst case an error δ in the data y is amplified by the condition
number κ of the matrix K. A matrix with large κ is therefore called ill-conditioned. We want to
demonstrate the effect of this error amplification with a small example.

Example 1.1. Let us consider the matrix


 
1 1
K= 1001
1 1000

and the vector f = (1, 1)T . Then the solution of Ku = f is simply given via u = (1, 0)T . If we,
however, consider the perturbed data f δ = (99/100, 101/100)T instead of f , the solution uδ of
Kuδ = f δ is uδ = (−19.01, 20)T .
CHAPTER 1. INTRODUCTION TO INVERSE PROBLEMS 7

1.1.2 Differentiation
Another classic inverse problem is differentiation. Assume we are given a function f ∈ L∞ ([0, 1])
with f (0) = 0 for which we want to compute u = f 0 . These conditions are satisfied if and only if
u and f satisfy the operator equation
Z y
f (y) = u(x) dx ,
0

which
Ry can be written as the operator equation Ku = f with the linear operator (K·)(y) :=
0 ·(x) dx. As in the previous section, we assume that instead of f we observe a noisy version f δ
for which we further assume that the perturbation is additive, i.e. f δ = f + nδ with f ∈ C 1 ([0, 1])
and nδ ∈ L∞ ([0, 1]). It is obvious that the derivative u only exists if the noise nδ is differentiable.
However, even in case nδ is differentiable the error in the derivative can become arbitrarily large.
Consider for instance the sequence (δj )j∈N with δj → 0 and the sequence of corresponding noise
functions nδj ∈ C 1 ([0, 1]) ,→ L∞ ([0, 1]) with
 
kx
δj
n (x) := δj sin ,
δj

for a fixed but arbitrary number k. We on the one hand observe nδj L∞ ([0,1]) = δj → 0, but on

the other hand have


 
0 kx
δj
u (x) = f (x) + k cos ,
δj

and therefore obtain the estimate


 
δ 0
δj
u − u =
n
j
= k.

L∞ ([0,1])
L∞ ([0,1])

Thus, despite the noise in the data becoming arbitrarily small, the error in the derivative can
become arbitrarily big (depending on k).
Note that considering a decreasing error in the norm of the Banach
δ space C ([0,
1 1]) will yield
a different result. If we have a sequence of noise functions with n C 1 ([0,1]) = δ j → 0 instead,
j
we can conclude

δj
u − u ∞ ≤ nδj 1 → 0,

L ([0,1]) C ([0,1])

due to C 1 ([0, 1]) being embedded in L∞ ([0, 1]). In contrast to the previous example the sequence
of functions nδj (x) := δj sin(kx) for instance satisfies
 
δ 0
(x) = (1 + k)δj → 0 .
δj δj
n = sup n (x) + sup n
j
C 1 ([0,1]) x∈[0,1] x∈[0,1]

However, for a fixed δ the bound on u − uδ L∞ ([0,1]) can obviously still become fairly large

compared to δ, depending on how large k is.


8 1.1. EXAMPLES

1.1.3 Deconvolution
An interesting problem that occurs in many imaging, image- and signal processing applications is
the deblurring or deconvolution of signals from a known, linear degradation. Deconvolution of a
signal can be modelled as solving the inverse problem of the convolution, which reads as
Z
f (y) = (Ku)(y) := u(x)g(y − x) dx . (1.3)
Rn

Here f denotes the blurry image, u is the (unknown) true image and g is the function that models
the degradation. Due to the Fourier convolution theorem we can rewrite (1.3) to
n
f = (2π) 2 F −1 (F(u)F(g)) . (1.4)

with F denoting the Fourier transform


Z
−n
F(u)(ξ) := (2π) 2 u(x) exp(−ix · ξ) dx (1.5)
Rn

and F −1 being the inverse Fourier transform


Z
−1 −n
F (f )(x) := (2π) 2 f (ξ) exp(ix · ξ) dξ (1.6)
Rn

It is important to note that the inverse Fourier transform is indeed the unique, inverse operator of
the Fourier transform in the Hilbert-space L2 due to the theorem of Plancherel. If we rearrange
(1.4) to solve it for u we obtain
 
−n −1 F(f )
u = (2π) 2 F , (1.7)
F(g)

and hence, we allegedly can recover u by simple division in the Fourier domain. However, we are
quickly going to discover that this inverse problem is ill-posed and the division will lead to heavy
amplifications of small errors.
Let u denote the image that satisfies (1.3). Further we assume that instead of the blurry image
f we observe f δ = f + nδ instead, and that uδ is the solution of (1.7) with input datum f δ . Hence,
we observe
F(f − f δ ) −1 F(nδ )
   
n
−1
δ
(2π) 2 u − u = F = F . (1.8)

F(g) F(g)

As the convolution kernel g usually has compact support, F(g) will tend to zero for high frequen-
cies. Hence, the denominator of (1.8) becomes fairly small, whereas the numerator will be non-zero
as the noise is of high frequency. Thus, in the limit the solution will not depend continuously on
the data and the convolution problem therefore be ill-posed.

1.1.4 Tomography
In almost any tomography application the underlying inverse problem is either the inversion of
the Radon transform or of the X-ray transform in dimensions higher than two. For u ∈ C0∞ (Rn ),
CHAPTER 1. INTRODUCTION TO INVERSE PROBLEMS 9

s ∈ R and θ ∈ S n−1 , the Radon transform2 R : C0∞ (Rn ) → C ∞ (S n−1 × R) can be defined as the
line integral operator
Z
f (θ, s) = (Ru)(θ, s) = u(x) dx (1.9)
Zx·θ=s
= u(sθ + tθ⊥ ) dt . (1.10)
R

Here θ⊥ denotes a vector orthogonal to θ.


Example 1.2. Let n = 2. Then S n−1 is simply the unit sphere S 1 = {θ ∈ R2 | kθk2 = 1}. We
can choose for instance θ = (cos(ϕ), sin(ϕ))T , ϕ ∈ [0, 2π[, and parametrise the Radon transform
in terms of ϕ and s, i.e.
Z
f (ϕ, s) = (Ru)(ϕ, s) = u(s cos(ϕ) − t sin(ϕ), s sin(ϕ) + t cos(ϕ)) dt .
R

Note that - with respect to the origin of the reference coordinate system - ϕ determines the angle
of the line along one wants to integrate, while s is the offset of that line to the centre of the
coordinate system.

X-ray Computed Tomography (CT)


In X-ray computed tomography (CT), the unknown quantity u represents a spatially varying den-
sity that is exposed to X-radiation from different angles, and that absorbs the radiation according
to its material or biological properties.
The basic modelling assumption for the intensity decay of an X-ray beam is that on a small
distance ∆t it is proportional to the intensity itself, the density and the distance, i.e.
I(sθ + (t + ∆t)θ⊥ ) − I(sθ + tθ⊥ )
= −I(sθ + tθ⊥ )u(sθ + tθ⊥ ) .
∆t
By taking the limit ∆t → 0 we end up with the ordinary differential equation
d
I(sθ + tθ⊥ ) = −I(sθ + tθ⊥ )u(sθ + tθ⊥ ) . (1.11)
dt
We now integrate (1.11) from t = −R/2, the position of the emitter, to t = R/2, the position of
the detector, to obtain
Z R d ⊥ Z R
dt I(sθ + tθ )
2 2


dt = − u(sθ + tθ⊥ ) dt .
− R I(sθ + tθ )
2
−R 2

Note that due to d/dx log(f (x)) = f 0 (x)/f (x) the left hand side in the above equation simplifies
to
Z R d ⊥      
dt I(sθ + tθ ) R ⊥ R ⊥
2


dt = log I sθ + θ − log I sθ − θ .
− R I(sθ + tθ )
2
2 2

As we know the radiation intensity at both the emitter and the detector, we therefore know
f (θ, s) := log(I(sθ − (Rθ⊥ )/2)) − log(I(sθ + (Rθ⊥ )/2)) and we can write the estimation of the
unknown density u as the inverse problem of the Radon transform (1.10) (if we further assume
that u can be continuously extended to zero outside of the circle of radius R).
2
Named after the Austrian mathematician Johann Karl August Radon (16 December 1887 – 25 May 1956)
10 1.1. EXAMPLES

Positron Emission Tomography (PET)

In Positron Emission Tomography (PET) a so-called radioactive tracer (a positron emitting ra-
dionuclide on a biologically active molecule) is injected into a patient (or subject). The emitted
positrons of the tracer will interact with the subjects’ electrons after travelling a short distance
(usually less than 1mm), causing the annihilation of both the positron and the electron, which
results in a pair of gamma photons moving into (approximately) opposite directions. This pair
of photons is detected by the scanner detectors, and an intensity f (ϕ, s) can be associated with
the number of annihilations detected at the detector pair that forms the line with offset s and
angle ϕ (with respect to the reference coordinate system). Thus, we can consider the problem of
recovering the unknown tracer density u as a solution of the inverse problem (1.10) again. The line
of integration is determined by the position of the detector pairs and the geometry of the scanner.

1.1.5 Magnetic Resonance Imaging (MRI)


Magnetic resonance imaging (MRI) is an imaging technique that allows to visualise the chemical
composition of patients or materials. MRI scanners use strong magnetic fields and radio waves
to excite subatomic particles (like protons) that subsequently emit radio frequency signals which
can be measured by the radio frequency coils. In the following we want to briefly outline the
mathematics of the acquisition process. Subsequently we are going to see that finding the unknown
spin proton density basically leads to solving the inverse problem of the Fourier transform (1.5).
The magnetisation of a so-called spin isochromat can be described by the Bloch equations3

− T12 γBz (t) −γBy (t)


      
M (t) Mx (t) 0
d  x
My (t)  =  −γBz (t) − T12 γBx (t)   My (t)  +  0  . (1.12)
dt M0
Mz (t) γBy (t) −γBx (t) − T11 Mz (t) T1

Here M (t) = (Mx (t), My (t), Mz (t)) is the nuclear magnetisation (of the spin isochromat), γ is the
gyromagnetic ratio, B(t) = (Bx (t), By (t), Bz (t)) denotes the magnetic field experienced by the
nuclei, T1 is the longitudinal and T2 the transverse relaxation time and M0 the magnetisation in
thermal equilibrium. If we define Mxy (t) = Mx (t) + iMy (t) and Bxy (t) = Bx (t) + iBy (t), we can
rewrite (1.12) to

d Mxy (t)
Mxy (t) = −iγ (Mxy (t)Bz (t) − Mz (t)Bxy (t)) − (1.13a)
dt T2
d γ Mz (t) − M0
(1.13b)

Mz (t) = i Mxy (t)Bxy (t) − Mxy (t)Bxy (t) −
dt 2 T1

with · denoting the complex conjugate of ·.


If we assume for instance that B = (0, 0, B0 ) is just a constant magnetic field in z-direction,
(1.13) reduces to the decoupled equations

d Mxy (t)
Mxy (t) = −iγB0 Mxy (t) − , (1.14a)
dt T2
d Mz (t) − M0
Mz (t) = − . (1.14b)
dt T1
3
Named after the Swiss born American physicist Felix Bloch (23 October 1905 - 10 September 1983)
CHAPTER 1. INTRODUCTION TO INVERSE PROBLEMS 11

It is easy to see that this system of equations (1.14) has the unique solution

Mxy (t) = e−t(iω0 +1/T2 ) Mxy (0) (1.15a)


− t − t
 
Mz (t) = Mz (0)e T1 + M0 1 − e T1 (1.15b)

for ω0 := γB0 denoting the Lamor frequency, and Mxy (0), Mz (0) being the initial magnetisations
at time t = 0.

Rotating frame
Thus, for a constant magnetic background field in z-direction, B0 , Mxy basically rotates around
the z-axis in clockwise direction with frequency ω0 (if we ignore the T2 decay for a moment).
Rotating the x- and y-coordinate axes with the same frequency yields the representation of the
Bloch equations in the so-called rotating frame. If we substitute Mxy
r (t) := eiω0 t M (t), B r (t) :=
xy xy
Bxy (t)e 0 , Mz (t) := Mz (t) and Bz (t) := Bz (t), we obtain
iω t r r

 Mxy r (t)
d r
Mxy (t) = −iγ Mxyr
(t)(Bzr (t) − B0 ) − Mzr (t)Bxy
r
(t) − (1.16a)
dt T2
d r γ M r (t) − M
0
r r (t) − M r (t)B r (t) − z
(1.16b)

M (t) = i Mxy (t)Bxy
dt z 2 xy xy
T1
instead of (1.13).
Thus, if we assume the magnetic field to be constant with magnitude B0 in z-direction within
the rotating frame, i.e. B r (t) = (Bxr (t), Byr (t), B0 ), (1.16a) simplifies to
r (t)
Mxy
d r
Mxy (t) = iγMzr (t)Bxy
r
(t) − . (1.17)
dt T2

90◦ pulse
Now we assume that Bxr (t) = c, c constant, and Byr (t) = 0 for t ∈ [0, τ ], and τ  T1 and τ  T2 .
Then we can basically ignore the effect of Mxy
r (t)/T and (M r (t)−M )/T , and the Bloch equations
2 z 0 1
in the rotating frame simplify to

Mxr (t) Mxr (t)


    
0 0 0
d 
Myr (t)  =  0 0 ω   Myr (t)  (1.18)
dt
Mzr (t) 0 −ω 0 Mzr (t)

with ω := γc, in matrix form with separate components. Assuming the initial magnetisations in
the rotating frame to be zero in the x-y plane, i.e. Mxr (0) = 0 and Myr (0) = 0, and constant in
the z-plane with value Mzr (0), the solution of (1.18) can be written as

Mxr (t)
    
1 0 0 0
 Myr (t)  =  0 cos(ωt) sin(ωt)   0 . (1.19)
r 0 − sin(ωt) cos(ωt) r
Mz (0)
Mz (t)

Thus, equation (1.19) rotates the initial z-magnetisation around the x-axis by the angle θ := ωt.
Note that if c and τ are chosen such that θ = π, all magnetisation is rotated from the z-axis to
the y-axis, i.e. Myr (τ ) = Mzr (0). In analogy, choosing Bx (t) = 0 and By (t) = c, all magnetisation
can be shifted from the z- to the x-axis.
12 1.1. EXAMPLES

Signal acquisition
If the radio-frequency (RF) pulse is turned off and thus, Bxr (t) = 0 and Byr (t) = 0 for t > τ ,
the same coils that have been used to induce the RF pulse can be used to measure the x-y
magnetisation. Since we measure a volume of the whole x-y net-magnetisation, the acquired
signal equals
Z Z
y(t) = M (x, t) dx = e−iω0 (x)t M r (x, t) dx (1.20)
R3 R2
with M (x, t) denoting Mxy (t) for a specific spatial coordinate x ∈ R3 (M r (x, t) respectively).
Using (1.15a) and assuming τ < t  T2 , this yields
Z
y(t) = Mτ (x)e−iω0 (x)t dx , (1.21)
R3
with Mτ denoting the x-y-magnetisation at spatial location x ∈ R3 and time t = τ . Note that
Mτ = 0 without any RF pulse applied in advance.

Signal recovery
The basic clue to allow for spatially resolving nuclear magnetic resonance spectrometry is to add
a magnetic field B̂(t) to the constant magnetic field B0 in z-direction that varies spatially over
time. Then, (1.14a) changes to
d Mxy (t)
Mxy (t) = −iγ(B0 + B̂(t))Mxy (t) − ,
dt T2
which, for initial value Mxy (0), has the unique solution
Rt t
B̂(τ ) dτ ) − T2
Mxy (t) = e−iγ (B0 t+ 0 e Mxy (0) (1.22)
if we ensure B̂(0) = 0. If now x(t) denotes the spatial location of a considered spin isochromat at
time t, we can write B̂(t) as B̂(t) = x(t) · G(t), with a function G that describes the influence of
the magnetic field gradient over time.
Based on the considerations that lead to (1.21) we therefore measure
Z Rt
y(t) = Mτ (x)e−iγ (B0 (x)t+ 0 x(τ )·G(τ ) dτ ) dx
R3
in an NMR experiment. Further assuming that B0 is also constant in space, we can eliminate this
term and write the signal acquisition as
Z Rt
eiγB0 t
y(t) = Mτ (x)e−iγ 0 x(τ )·G(τ ) dτ dx (1.23)
R3
In the following we assume that x(t) can be approximated reasonably well via its zero-order
Taylor approximation around t0 = 0, i.e.
Z t Z t
x(τ ) · G(τ ) dτ ≈ x(0) · G(τ ) dτ . (1.24)
0 0
and hence, the inverse problem of finding the unknown spin-proton density Mτ for given measure-
ments y is equivalent to solving the inverse problem of the Fourier transform
Z
f (t) = (KMτ )(t) = Mτ (x)e−ix·ξ(t) dx , (1.25)
R3
Rt
with f (t) := eiγB0 t y(t) and ξ(t) := γ 0 G(τ ) dτ .
Chapter 2

Linear inverse problems

Throughout this lecture we deal with functional analytic operators. For the sake of brevity, we
cannot recall all basic concepts of functional analysis but refer to popular textbooks that deal with
this subject, like [10]. Nevertheless, we want to recall a few important properties that are going to
be important for the further course of this lecture. In particular, we are going to focus on inverse
problems with bounded, linear operators K only, i.e. K ∈ L(U, V) with

kKukV
kKkL(U ,V) := sup = sup kKukV < ∞ .
u∈U \{0} kukU kukU ≤1

For X ⊂ U and K : X → V we further want to denote with

1. D(K) := U the domain

2. N (K) := {u ∈ U | Ku = 0} the kernel

3. R(K) := {Ku ∈ V | x ∈ U} the range

of K. We say that K is continuous in u ∈ U if there exists a δ > 0 for all ε > 0 with

kKu − KvkV ≤ ε for all v ∈ U with ku − vkU ≤ δ.

For linear K (as assumed throughout this lecture) it can be shown that continuity is equivalent
to the existence of a positive constant C such that

kKukV ≤ CkukU

for all u ∈ X . Note that this constant C actually equals the operator norm kKkL(U ,V) .
For the first part of the lecture we only consider K ∈ L(U, V) with U and V being Hilbert
spaces. From functional calculus we know that every Hilbert space is equipped with a scalar
product, which we are going to denote by h·, ·iU (if U denotes the corresponding Hilbertspace). In
analogy to the transpose of a matrix, this scalar product structure together with the theorem of
Fréchet-Riesz [10, Section 2.10, Theorem 2.E] allows us to define the (unique) adjoint operator of
K, denoted with K ∗ , as follows:

hKu, viV = hu, K ∗ viU .

13
14 2.1. GENERALISED INVERSE

In addition to that, a scalar product allows to have a notion of orthogonality. Two function u, v ∈ U
are said to be orthogonal if hu, viU = 0. For a subset X ⊂ U the orthogonal complement of X in
U is defined as
X ⊥ := {u ∈ U | hu, xiU = 0 for all x ∈ X } .
From this definition we immediately observe that X ⊥ is a closed subspace. Further we have
U ⊥ = {0}. Moreover, we have X ⊂ (X ⊥ )⊥ . If X is a closed subspace we even have X = (X ⊥ )⊥
and {0}⊥ = X . In this case there exists the orthogonal decomposition
U = X ⊕ X⊥ ,
which means that every element u ∈ U can uniquely be represented as
u = x + x⊥ with x ∈ X , x⊥ ∈ X ⊥ ,
see for instance [10, Section 2.9, Corollary 1]. The mapping u 7→ x defines a linear operator
PX ∈ L(U, X ) that is called orthogonal projection on X . The orthogonal projection satisfies the
following conditions (cf. [6, Section 5.16]):
1. PX is self-adjoint, i.e. PX = PX∗ ;
2. kPX kL(U ,X ) = 1 (if X 6= {0});
3. I − PX = PX ⊥ ;
4. kv − PX vkU = minu∈X kv − ukU ;
5. x = PX u if and only if x ∈ X and u − x ∈ X ⊥ .
Note that for a non-closed subspace X we only have (X ⊥ )⊥ = X . For K ∈ L(U, V) we therefore
have
• R(K)⊥ = N (K ∗ ) and thus N (K ∗ )⊥ = R(K);
• R(K ∗ )⊥ = N (K) and thus N (K)⊥ = R(K ∗ ).
Hence, we can conclude the orthogonal decompositions
U = N (K) ⊕ R(K ∗ ) and V = N (K ∗ ) ⊕ R(K) .
In the following we want to investigate the concept of generalised inverses of bounded linear
operators, before we will identify compactness of operators as the major source of ill-posedness.
Subsequently we are going to discuss this in more detail by analysing compact operators in terms
of their singular value decomposition.

2.1 Generalised inverse


In order to overcome the issues of non-existence or non-uniqueness we want to generalise the
concept of least-squares solutions to linear operators in Hilbert spaces.
If we consider the generic inverse problem (1.1) again, we know that there does not exist a
solution of the inverse problem if f ∈
/ R(K). In that case it seems reasonable to find an element
u ∈ U for which kKu − f kV gets minimal instead.
However, for N (K) 6= {0} there are infinitely many solutions that minimise kKu − f kV of
which we have to pick one. Picking the one with minimal norm kukU brings us to the definition
of the minimal norm solution.
CHAPTER 2. LINEAR INVERSE PROBLEMS 15

Definition 2.1. An element u ∈ U is called

• least-squares solution of (1.1), if

kKu − f kV = min kKw − f kV ; (2.1)


w∈U

• minimal norm solution of (1.1), if

kukU = min{kwkU | w satisfies (2.1)} . (2.2)

Note that for arbitrary f a minimum norm solution does not need to exist if R(K) is not
closed. If, however, a minimum norm solution exists, it is unique and can (in theory) be computed
via the Moore-Penrose generalised inverse.

Definition 2.2. Let K̃ := K|N (K)⊥ : N (K)⊥ → R(K) denote the restriction of K ∈ L(U, V).
Then the Moore-Penrose inverse K † is defined as the unique linear extension of K̃ −1 with

D(K † ) := R(K) ⊕ R(K)⊥ ,


N (K † ) := R(K)⊥ .

Note that K̃ is injective due to the restriction to N (K)⊥ , and surjective due to the restriction
to R(K). Hence, K̃ −1 exists, and - as a consequence - K † is well-defined on R(K). Due to the
orthogonal decomposition D(K † ) = R(K) ⊕ R(K)⊥ there exist f1 ∈ R(K) and f2 ∈ R(K)⊥ with
f = f1 + f2 , for arbitrary f ∈ D(K † ). Hence, we have

K † f = K † f1 + K † f2 = K † f1 = K̃ −1 f1 , (2.3)

due to R(K)⊥ = N (K † ), and thus, K † is well-defined on the whole of D(K † ). It can be shown
that K † can be characterized by the Moore-Penrose equations.

Lemma 2.1. The Moore-Penrose inverse K † satisfies R(K † ) = N (K)⊥ and the Moore-Penrose
equations

1. KK † K = K,

2. K † KK † = K † ,

3. K † K = I − PN (K) ,

4. KK † = PR(K) ,


D(K )

where PN (K) : U → N (K) and PR(K) : V → R(K) denote the orthogonal projections on N (K)
and R(K), respectively.

Proof. First of all we are going to prove R(K † ) = N (K)⊥ . According to (2.3) and the definition
of the Moore-Penrose inverse we observe for f ∈ D(K † ) that

K † f = K̃ −1 PR(K) f = K † PR(K) f , (2.4)


16 2.1. GENERALISED INVERSE

as we not only have PR(K) f ∈ R(K) but also PR(K) f ∈ R(K). Hence, K † f ∈ R(K̃ −1 ) =
N (K)⊥ and therefore R(K † ) ⊂ N (K)⊥ . However, we also observe N (K)⊥ ⊂ R(K † ) as K † Ku =
K̃ −1 K̃u = u for u ∈ N (K)⊥ already implies u ∈ R(K † ).
It remains to prove the Moore-Penrose equations. We begin with 4: For f ∈ D(K † ) it follows
from R(K † ) = N (K)⊥ and (2.4) that

KK † f = K K̃ −1 PR(K) f = K̃ K̃ −1 PR(K) f = PR(K) f ,

due to K̃ −1 PR(K) f ∈ N (K)⊥ and K = K̃ on N (K)⊥ .


3: According to the definition of K † we have K † Ku = K̃ −1 Ku for all u ∈ U and thus

K † Ku = K̃ −1 KPN (K) u +K̃ −1 K (I − PN (K) ) u = K̃ −1 K̃(I − PN (K) )u = (I − PN (K) )u .


| {z } | {z }
=0 =PN (K)⊥

2: Inserting 4 into (2.4) yields

K † f = K † PR(K) f = K † KK † f

for all f ∈ D(K † ).


1: With 3 we have

KK † K = K(I − PN (K) ) = K − KPN (K) = K .

The following theorem states that minimum norm solutions can be computed via the generalised
inverse.

Theorem 2.1. For each f ∈ D(K † ) the minimal norm solution u† of (1.1) is given via

u† = K † f .

The set of all least squares solutions is given via {u† } + N (K).

Proof. See [2, Theorem 2.5].

In numerical linear algebra it is a well known fact that the normal equations can be considered
to compute least squares solutions. The same is true in the continuous case.

Theorem 2.2. For given f ∈ D(K † ) the function u ∈ U is minimal norm solution if and only if
it satisfies the normal equations

K ∗ Ku = K ∗ f . (2.5)

We further have u = u† in case of u ∈ N (K)⊥ .

Proof. An element u ∈ U is least squares solution if and only if Ku is the projection of f onto R(K),
i.e. Ku = PR(K) (f ). The latter is equivalent to Ku ∈ R(K) and Ku − f ∈ R(K)⊥ = N (K ∗ ).
Thus, we conclude K ∗ (Ku − f ) = 0. Further, a least squares solution has minimal norm if and
only if u ∈ N (K)⊥ .
CHAPTER 2. LINEAR INVERSE PROBLEMS 17

As a direct consequence from Theorem 2.1 and Theorem 2.2 we obtain

K † f = (K ∗ K)† K ∗ f ,

and hence, in order to approximate K † f we may also compute an approximation via (2.5) instead.
At the end of this section we further want to analyse the domain of the generalised inverse
in more detail. Due to the construction of the Moore-Penrose inverse we have D(K † ) = R(K) ⊕
R(K)⊥ . As orthogonal complements are always closed we can conclude

D(K † ) = R(K) ⊕ R(K)⊥ = R(K) ⊕ N (K ∗ )⊥ = V ,

and hence, D(K † ) is dense in V. Thus, if R(K) is closed it follows that D(K † ) = V and vice versa,
D(K † ) = V implies R(K) to be closed. Moreover, for f ∈ R(K)⊥ = N (K † ) the minimum-norm
solution is u† = 0. Therefore, for given f ∈ R(K) the important question to address is when f
also satisfies f ∈ R(K). If this is the case, K † has to be continuous. However, the existence of a
single element f ∈ R(K) \ R(K) is enough already to prove that K † is discontinuous.
Theorem 2.3. Let K ∈ L(U, V). Then K † ∈ L(D(K † ), U) if and only if R(K) is closed.

Proof. See [2, Proposition 2.4].

In the next section we are going to discover that the class of compact operators is a class for
which the Moore-Penrose inverses are discontinuous.

2.2 Compact operators


Compact operators are very common in inverse problems; in fact, almost all (linear) inverse prob-
lems involve the inversion of compact operators. Compact operators are defined as follows.
Definition 2.3. Let K ∈ L(U, V). Then K is said to be compact if the image of a bounded
sequence {un }n∈N ⊂ U contains a convergent subsequence {Kunk }k∈N ⊂ V. We denote the space
of compact operators with K(U, V).
Compact operators can be seen as the infinite dimensional analogue to ill-conditioned matrices.
Indeed it can be seen that compactness is a main source of ill-posedness, confirmed by the following
result.
Theorem 2.4. Let K ∈ K(U, V) between infinite dimensional Banach space U and V, such that
the dimension of R(K) is infinite. Then the problem (1.1) is ill-posed, i.e. the corresponding
Moore-Penrose inverse K † is discontinuous.

Proof. Since U and R(K) are infinite dimensional, we can conclude that N (K)⊥ is also infi-
nite dimensional. We can therefore find a sequence {un }n∈N with un ∈ N (K)⊥ , kun k = 1 and
hun , uk i = 0 for n 6= k. Since K is a compact operator the sequence fn = Kun is compact. Hence,
we can find k, l such that for each δ > 0 we have kKul − Kuk k < δ. However, we also obtain

kK † Kul − K † Kuk k2 = kul − uk k2 = kul k2 − hul , uk i + kuk k2 = 2 .

Hence, K † is discontinuous.

To have a better understanding of when we have f ∈ R(K) \ R(K) for compact operators K,
we want to consider the singular value decomposition of compact operators.
18 2.3. SINGULAR VALUE DECOMPOSITION OF COMPACT OPERATORS

2.3 Singular value decomposition of compact operators


We want to characterise the Moore-Penrose inverse of compact operators in terms of a spectral
decomposition. Like in the finite dimensional case of matrices, we can only expect a spectral
decomposition to exist for self-adjoint operators. Due to Theorem 2.2 we can consider K ∗ K
instead of K, which brings us to the singular value decomposition of linear, compact operators.
Theorem 2.5. Let K ∈ K(U, V). Then there exists
• a null sequence {σj }j∈N with σ1 ≥ σ2 ≥ . . . > 0,

• an orthonormal basis {uj }j∈N ⊂ U of R(K ∗ ),

• an orthonormal basis {vj }j∈N ⊂ V of R(K),


with

Kuj = σj vj , K ∗ vj = σj uj for all j ∈ N (2.6)

and

X
Kw = σj hw, uj iU vj for all j ∈ N. (2.7)
j=1

A sequence {(σj , vj , uj )} for which (2.7) is satisfied, is called singular system.

Proof. Exercise 3 on Exercise Sheet 2.

Remark 2.1. Since Eigenvalues of K ∗ K with Eigenvectors uj are also Eigenvalues of KK ∗ with
Eigenvectors vj , we further obtain a singular value decomposition of K ∗ due to (2.6), i.e.

X

K z= σj hz, vj iV uj for all j ∈ N.
j=1

We now want to derive a representation of the Moore-Penrose inverse in terms of the singular
value decomposition. Remember that we have concluded K † f = (K ∗ K)† K ∗ f from Theorem 2.1
and Theorem 2.2. Hence, for u† = K † f we obtain

X ∞
X
σj2 hu† , uj iU uj = K ∗ Ku† = K ∗ f = σj hf, vj iV uj
j=1 j=1

and by comparison of the respective linear components


1
hu† , uj iU = hf, vj iV .
σj
As a direct consequence, the singular value decomposition of the generalised inverse reads as

X 1
u† = K † f = hf, vj iV uj . (2.8)
σj
j=1

As for the singular value decomposition of K and K ∗ we have to verify if the sum converges. In
contrast to the singular value decompositions of K and K ∗ this is not always true. Precisely, the
so-called Picard criterion has to be met for (2.8) to converge.
CHAPTER 2. LINEAR INVERSE PROBLEMS 19

Theorem 2.6. Let K ∈ K(U, V) with singular system {(σj , vj , uj )}, and f ∈ R(K). Then
f ∈ R(K) and (2.8) converges if and only if the Picard criterion


X |hf, vj iV |2
kK f k2U = <∞ (2.9)
j=1
σj2

is met.

Proof. See [2, Page 38, Theorem 2.8]

Hence, the Picard criterion tells us how quickly the coefficients hf, vj iV have to decay compared
to the singular values σj . From representation (2.8) we can further conduct what happens in case
of noisy measurements. Assume we have f δ = f + δvj instead of f . We then observe


δ
K f − K f = δkK † vj kU =
† δ
→ ∞ for j → ∞ .
U σj

For static j we see that the amplification of the error δ depends on how small σj is. Hence, the
faster the singular values decay, the stronger the amplification of errors. For that reason, one
distinguishes between three classes of ill-posed problems:

• Mildly ill-posed inverse problems are problems of the form (1.1) for which c > 0 and r > 0
exist with σj ≥ cj −r for all j ∈ N.

• Severely ill-posed inverse problems of the form (1.1) for which a significantly large j̃ ∈ N
exists, such that σj ≤ cj −r is true for all j ≥ j̃ and c, r > 0.

• Exponentially ill-posed problems of the form (1.1) for which a significantly large j̃ ∈ N exists,
r
such that σj ≤ ce−j is true for all j ≥ j̃ and c, r > 0.

Example 2.1. Let us consider the example of differentiation again, as introduced in Section 1.1.2.
Let U = V = L2 ([0, 1]), then the operator K of the inverse problem (1.1) of differentiation is given
as
Z y Z 1
(Ku)(y) = u(x) dx = k(x, y)u(x) dx ,
0 0

for
(
1 x≤y
k(x, y) = .
0 else

From the theory of integral operators it follows that K is compact, due to it’s kernel k(x, y) being
square integrable (cf. [1]). In order to compute the singular value decomposition we have to
compute K ∗ first, which is characterised via

hKu, viL2 ([0,1]) = hu, K ∗ viL2 ([0,1]) .

Hence, we obtain
Z 1Z 1 Z 1 Z 1
hKu, viL2 ([0,1]) = k(x, y)u(x) dx v(y) dy = u(x) k(x, y)v(y) dy dx .
0 0 0 0
20 2.3. SINGULAR VALUE DECOMPOSITION OF COMPACT OPERATORS

Hence, the adjoint operator K ∗ is given via


Z 1 Z 1

(K v)(x) = k(x, y)v(y) dy = v(y) dy .
0 x

Now we want to compute the Eigenvalues and Eigenvectors of K ∗ K, i.e. we look for λ > 0 and
u ∈ L2 ([0, 1]) with
Z 1Z y
λu(x) = (K ∗ Ku)(x) = u(z) dz dy .
x 0

We immediately observe u(1) = 0 and further


Z 1Z y Z x
0 d
λu (x) = u(z) dz dy = − u(z) dz ,
dx x 0 0

from which we conclude u0 (0) = 0. Taking the derivative another time thus yields the ordinary
differential equation

λu00 (x) + u(x) = 0 ,

for which solutions are of the form

u(x) = c1 sin(σ −1 x) + c2 cos(σ −1 x) ,



with σ := λ and constants c1 , c2 . In order to satisfy the boundary conditions u(1) = 0 and
u0 (0) = 0, the constants have to be chosen such that c1 = 0 and c2 cos(σ −1 ) = 0. Choosing c2 = 0
would result in the trivial zero solution u ≡ 0, which is why we choose σ such that cos(σ −1 ) = 0
instead. Hence, we have
2
σj = for j ∈ N ,
(2j − 1)π

and by choosing c2 = 2 we obtain the following normalised representation of uj :

  
1
uj (x) = 2 cos j− πx .
2

According to (2.6) we further obtain


√ √
  Z x      
−1 1 1 1
vj (x) = σj (Kuj )(x) = j − π 2 cos j− πy dy = 2 sin j− πx ,
2 0 2 2

and hence, for f ∈ L2 ([0, 1]) the Picard criterion becomes


∞ 1  2
(2j − 1)2 π 2
Z  
X 1
πx dx < ∞ .

f (x) sin j−
2
0 2
j=1

Thus, the picard criterion in this case is equivalent to the condition that the Fourier series of f is
differentiable with respect to every element of the series (and hence, that f is differentiable).
Chapter 3

Regularisation

We have seen in the previous section that the major source of ill-posedness of inverse problems
of the type (1.1) is a fast decay of the singular values of K. An idea to overcome this issue is to
define approximations of K † in the following fashion. Consider the family of operators

X
Rα f := gα (σj )hf, vj iV uj , (3.1)
j=1

with functions gα : R>0 → R≥0 that converge to 1/σj as α converges to zero. We are going to see
that such an operator Rα is what is called a regularisation (of K † ), if gα is bounded, i.e.

gα (σ) ≤ Cα for all σ ∈ R>0 . (3.2)

In case (3.2) holds true, we immediately observe



X ∞
X
kRα f k2U = gα (σj )2 |hf, vj iV |2 ≤ Cα2 |hf, vj iV |2 ≤ Cα2 kf k2V ,
j=1 j=1

which means that Cα is a bound for the norm of Rα .


Before we are going to examine concrete examples of regularisations of the form (3.1), we want
to define what a regularisation actually is, and what properties come along with it.

Definition 3.1. Let K ∈ L(U, V) be a bounded operator. A family {Rα }α>0 of linear operators is
called regularisation (or regularisation operator) of K † if

• Rα ∈ L(V, U) for all α > 0;

• Rα f → K † f for all f ∈ D(K † ).

Hence, a regularisation is a pointwise approximation of the Moore-Penrose inverse with con-


tinuous operators. Note however, that we usually cannot expect f ∈ D(K † ) for most applications,
due to measurement and modelling errors. If we however assume that there exists f ∈ D(K † ) such
that we have

δ
f − f ≤δ


V

21
22

for measured data f δ ∈ V, we can expect that α has to be chosen in accordance to this error δ.
Let us therefore consider the overall error kRα f δ − K † f kU that we can split up as follows:

Rα f δ − K † f ≤ Rα f δ − Rα f + Rα f − K † f

U U U

≤ δ kRα kL(V,U ) + Rα f − K f . (3.3)

U

The first term of (3.3) is the data error; this term unfortunately does not stay bounded for α → 0,
which can be concluded from the uniform boundedness principle (also known as the Banach-
Steinhaus theorem), cf. [5, Section 2.2, Theorem 2.2]. The second term however vanishes for
α → 0, due to the pointwise convergence of Rα to K † . Hence it becomes evident from (3.3) that
the choice of α depends on δ.
Definition 3.2. A function α : R>0 × V → R>0 , (δ, f δ ) → α(δ, f δ ) is called parameter choice rule.
We distinguish between
1. a-priori parameter choice rules, if they depend on δ only;

2. a-posteriori parameter choice rules, if they depend on δ and f δ ;

3. heuristic parameter choice rules, if they depend on f δ only.


If {Rα }α>0 is a regularisation of K † and α is a parameter choice rule, then the pair (Rα , α) is
called convergent regularisation, if for all f ∈ D(K † ) there exists a parameter choice rule α :
R>0 × V → R>0 such that
n o
lim sup Rα f δ − K † f f δ ∈ V, f − f δ ≤ δ = 0 (3.4)

δ→0 U V

and
n o
lim sup α(δ, f † ) f δ ∈ V, f − f δ ≤ δ = 0

δ→0 V

are guaranteed.
In case of 1 or 2 we would simply write α(δ), respectively α(f δ ), instead of α(δ, f δ ).
For the sake of brevity we are only discussing a-priori parameter choices throughout this lecture.
In fact, it can be shown that for every regularisation an a-priori parameter choice rule, and thus,
a convergent regularisation, exists.
Theorem 3.1. Let {Rα }α>0 be a regularisation of K † , for K ∈ L(U, V). Then there exists an
a-priori parameter choice rule, such that (Rα , α) is a convergent regularisation.

Proof. Let f ∈ D(K † ) be arbitrary but fixed. We can find a monotone increasing function σ :
R>0 → R>0 such that for every ε > 0 we have


ε
R f δ
− K f ≤ ,

σ(ε)
U 2
due to the pointwise convergence Rα → K † .
As the operator Rσ(ε) is continuous for fixed ε, there exists ρ(ε) > 0 with

Rσ(ε) g − Rσ(ε) f ≤ ε for all g ∈ V with kg − f k ≤ ρ(ε) .



U V
2
CHAPTER 3. REGULARISATION 23

Without loss of generality we can assume ρ to be a continuous, strictly monotone increasing


function with limε→0 ρ(ε) = 0. Then, due to the inverse function theorem there exists a strictly
monotone and continuous function ρ−1 on R(ρ) with limδ→0 ρ−1 (δ) = 0. We continuously extend
ρ−1 on R>0 and define our a-priori strategy as

α : R>0 → R>0 , δ → σ(ρ−1 (δ)) .

Then limδ→0 α(δ) = 0 follows. Furthermore, there exists δ := ρ(ε) for all ε > 0, such that with
α(δ) = σ(ε)

Rα(δ) f δ − K † f ≤ Rσ(ε) f δ − Rσ(ε) f + Rσ(ε) f δ − K † f ≤ ε

U U U

follows for all f δ ∈ V with kf −f δ kV ≤ δ. Thus, (Rα , α) is a convergent regularisation method.

Now we want to turn our attention to the case when f ∈ / D(K † ). As the generalised inverse is
not defined for those functions we cannot expect a convergent regularisation to stay bounded as
α → 0. This is confirmed by the following result.

Theorem 3.2. Let K ∈ L(U, V) and {Rα }α>0 be a regularisation of K † . Then, uα := Rα f


converges to K † f as α → 0, for f ∈ D(K † ). If

sup kKRα kL(V,U ) < ∞ ,


α>0

then kuα kU → ∞ for f ∈


/ D(K † ).

Proof. The convergence in case of f ∈ D(K † ) follows from Theorem 3.1. We therefore only
need to consider the case f ∈ / D(K † ). We assume that there exists a sequence αk → 0 such
that kuαk kU is uniformly bounded. Then there exists a weakly convergent subsequence uαkl with
some limit u ∈ U, cf. [3, Section 2.2, Theorem 2.1]. As continuous linear operators are also weakly
continuous, we further have Kuαkl * Ku. However, as KRα are uniformly bounded operators, we
also conclude Kuαkl = KRαkl f * PR(K) f (according to Lemma 2.1). Hence, we have f ∈ D(K † )
in contradiction to the assumption f ∈
/ D(K † ).

We can finally characterise a-priori parameter choice strategies that lead to convergent regu-
larisation methods via the following theorem.

Theorem 3.3. Let {Rα }α>0 be a regularisation, and α : R>0 → R>0 an a-priori parameter choice
rule. Then (Rα , α) is a convergent regularisation method if and only if

1. limδ→0 α(δ) = 0

2. limδ→0 δkRα(δ) kL(V,U ) = 0

Proof. ⇒: Let condition 1 and 2 be fulfilled. From (3.3) we then observe




Rα(δ) f − K f → 0 for δ → 0.
δ

U

Hence, (Rα , α) is a convergent regularisation method.


⇐: Now let (Rα , α) be a convergent regularisation method. We prove that conditions 1 and 2
have to follow from this by showing that violation of either one of them leads to a contradiction
24

to (Rα , α) being a convergent regularisation method. If condition 1 is violated, Rα(δ) f does not
converge pointwise to K † f and hence, (3.4) is violated for f = f δ and δ = 0. Hence, (Rα , α) is
not a convergent regularisation method. If condition 1 is fulfilled but condition 2 is violated, there
exists a null sequence {δk }k∈N with δk kRα(δk ) kL(V,U ) ≥ C > 0, and hence, we can find a sequence
{gk }k∈N ⊂ V with kgk kV = 1 and δk kRα(δk ) gk kU ≥ C. Let f ∈ D(K † ) be arbitrary and define
fk := f + δk gk . Then we have on the one hand kf − fk kV ≤ δk , but on the other hand the norm of

Rα(δk ) fk − K † f = Rα(δk ) f − K † f + δk Rα(δk ) gk

cannot converge to zero, as the second term δk Rα(δk ) gk is bounded from below by construction.
Hence, (3.4) is violated for f δ = gk and thus, (Rα , α) is not a convergent regularisation method.

In the following we want to revisit (3.1) and prove that these methods are indeed regularisation
methods for piecewise continuous functions gα satisfying (3.2).
Theorem 3.4. Let gα : R>0 → R be a piecewise continuous function satisfying (3.2), limα→0 gα (σ) =
σ and
1

sup σgα (σ) ≤ γ , (3.5)


α,σ

for some constant γ > 0. If Rα is defined as in (3.1), we have

Rα f → K † f as α → 0,

for all f ∈ D(K † ).

Proof. From the singular value decomposition of K † and the definition of Rα we obtain
∞   ∞

X 1 X
Rα f − K f = gα (σj ) − hf, vj iV uj = (σj gα (σj ) − 1) hu† , uj iU uj .
σj
j=1 j=1

From (3.5) we can conclude



(σj gα (σj ) − 1) hu† , uj iU ≤ (γ + 1)ku† kU ,

and hence, each element of the sum stays bounded. Thus, we can estimate
2 ∞
X 2
lim sup Rα f − K † f ≤ lim sup |σj gα (σj ) − 1|2 hu† , uj iU

α→0 U α→0
j=1

X 2 2

≤ lim σ g (σ ) − 1 hu , uj U ,
i

j α j
α→0
j=1

and due to
the pointwise
convergence of gα (σj ) to 1/σj we deduce limα→0 σj gα (σj )−1 = 0. Hence,
we have Rα f − K † f U → 0 for all f ∈ D(K † ).

Proposition 3.1. Let the same assumptions hold as in Theorem 3.4. Further, let (3.2) be satisfied.
Then (Rα , α) is a convergent regularisation method if

lim δCα(δ) = 0
δ→0

is guaranteed.
CHAPTER 3. REGULARISATION 25

Finally we want to consider the error propagation of a data error for the regularisation method
(3.1).
Theorem 3.5. Let the same assumptions hold for gα as in Proposition 3.1. If we define uα := Rα f
and uδα := Rα f δ , with f ∈ D(K † ), f δ ∈ V and kf − f δ kV ≤ δ, then

kKuα − Kuδα kV ≤ γδ , (3.6)

and

kuα − uδα kU ≤ Cα δ (3.7)

hold true.

Proof. From the singular value decomposition we can estimate



X
kKuα − Kuδα k2V ≤ σj2 gα (σj )2 |hf − f δ , vj iV |2
j=1

X
≤ γ2 |hf − f δ , vj iV |2 = γ 2 kf − f δ k2V ≤ γ 2 δ 2 ,
j=1

which yields (3.6). In the same fashion we can estimate



X
kuα − uδα k2U ≤ gα (σj )2 |hf − f δ , vj iV |2
j=1

X
≤ Cα2 |hf − f δ , vj iV |2 = Cα2 kf − f δ k2V ≤ Cα2 δ 2 ,
j=1

to obtain (3.7).

Combining the assertions of Theorem 3.4, Proposition 3.1 and Theorem 3.5, we obtain the
following convergence results of the regularised solutions.
Proposition 3.2. Let the assumptions of Theorem 3.4, Proposition 3.1 and Theorem 3.5 hold
true. Then,

uα(δ,f δ ) → u†

is guaranteed as δ → 0.

3.1 Truncated singular value decomposition


As a first example for a spectral regularisation of the form (3.1) we want to consider the so-called
truncated singular value decomposition. As the name suggests, the idea is to discard all singular
values below a certain threshold. If we identity this threshold with the regularisation parameter
α, we can realise this truncation via
(
1
σ≥α
gα (σ) = σ . (3.8)
0 σ<α
26 3.2. TIKHONOV REGULARISATION

Note that we naturally obtain limα→0 gα (σ) = 1/σ for σ > 0. Equation (3.1) then reads as
X 1
uα = Rα f = hf, vj iV uj , (3.9)
σj
σj ≥α

for all f ∈ V. Note that (3.9) is always finite for α > 0 as zero is the only accumulation point of
singular vectors of compact operators.
From (3.8) we immediately observe gα (σ) ≤ Cα = 1/α. Thus, according to Proposition 3.1
the truncated singular value decomposition, together with an a-priori parameter choice strategy
satisfying limδ→0 α(δ) = 0, is a convergent regularisation method if limδ→0 δ/α(δ) = 0.
Moreover, we observe supσ,α σgα (σ) = γ = 1 and hence, we obtain the error estimates kKuα −
Kuα kV ≤ δ and kuα − uδα kU ≤ δ/α as a consequence of Theorem 3.5.
δ

3.2 Tikhonov regularisation


The main idea behind Tikhonov regularisation1 is to shift the singular values of K ∗ K by a constant
factor, which will be associated with the regularisation parameter α. This shift can be realised via
the function
σ
gα (σ) = .
σ2 +α
Again, we immediately observe limα→0 gα (σ) = 1/σ for σ > 0. Further, we can estimate gα (σ) ≤
√ √
1/(2 α) due to σ 2 + α ≥ 2 ασ. Moreover, we observe σgα (σ) = σ 2 /(σ 2 + α) < 1 =: γ for α > 0.
The corresponding Tikhonov regularisation (3.1) reads as

X σj
uα = Rα f = hf, vj iV uj . (3.10)
σ2
j=1 j

Note that Tikhonov regularisation can be computed without knowledge of the singular system.
Considering the equation (K ∗ K + αI)uα in terms of the singular value decomposition, we observe
∞ ∞
X σj ∗
X ασj
2 hf, v i
j V K Ku j + 2 hf, vj iV uj
j=1
σj + α |{z} j=1 σj + α
=σj vj
| {z }
=σj2 uj
∞ ∞
X σj (σj2 + α) X
= hf, vj iV uj = σj hf, vj iV uj = K ∗ f .
j=1
σj2 + α j=1

Hence, the Tikhonov-regularised solution uα can be obtained by solving

(K ∗ K + αI)uα = K ∗ f (3.11)

for uα . The advantage in computing uα via (3.11) is that its computation does not require the
singular value decomposition of K, but only involves the inversion of a linear, well-posed operator
equation with a symmetric, positive definite operator.
1
Named after the Russian mathematician Andrey Nikolayevich Tikhonov (30 October 1906 - 7 October 1993)
CHAPTER 3. REGULARISATION 27

3.3 Source-conditions
Before we continue to investigate other examples of regularisation we want to briefly address
the question of the convergence speed of a regularisation method. From Theorem 3.5 we have
already obtained a convergence rate result; however, with additional regularity assumptions on
the (unknown) minimal norm solution we are able to improve those. The regularity assumptions
that we want to consider are known as source conditions, and are of the form

∃w ∈ U : u† = (K ∗ K)µ w . (3.12)

The power µ > 0 of the operator is understood in the sense of the consider the µ-th power of the
singular values of the operator K ∗ K, i.e.

σj2µ hw, uj iU uj .
X
(K ∗ K)µ w =
j=1

The rate of convergence of a regularisation scheme to the minimal norm solution now depends on
the specific choice of gα . We assume that gα satisfies

σ 2µ |σgα (σ) − 1| ≤ ωµ (α) ,

for all σ > 0. In case of the truncated singular value decomposition we would for instance have
ωµ (α) = αµ . With this additional assumption, we can improve the estimate in Theorem 3.4 as
follows:

X

kRα f − K f kV ≤ |σj gα (σj ) − 1|2 |hu† , uj iU |2
j=1

|σj gα (σj ) − 1|2 σj4µ |hw, uj iU |2
X
=
j=1

≤ ωµ (α)2 kwk2U

Hence, we have obtained the estimate

kuα − u† kU ≤ ωµ (α)kwkU .

Together with (3.3) we can further estimate

kuα(δ) − u† kU ≤ ωµ (α)kwkU + Cα δ . (3.13)

Example 3.1. In case of the truncated singular value decomposition we know from Section 3.1
that Cα = 1/α, and we can further conclude ωµ (α) = α2µ . Hence, (3.13) simplifies to

kuα(δ) − u† kU ≤ α2µ kwkU + δα−1 (3.14)

in this case. In order to make the right-hand-side of (3.14) as small as possible, we have to choose
α such that
  1
δ 2µ+1
α= .
2µkwkU
28 3.4. ASYMPTOTIC REGULARISATION

With this choice of α we estimate


1−2µ 1−2µ 2µ 1
kuα(δ) − u† kU ≤ 2| 1+2µ
{z } |µ {z } δ
1+2µ 2µ+1 kwk 2µ+1
U
≤2 ≤1
2µ 1
≤ 2δ 2µ+1 kwkU2µ+1 .

It is important to note that no matter how large µ is, the rate of convergence δ 2µ+1 will always be
slower than δ, due to the ill-posedness of the inversion of K.

3.4 Asymptotic regularisation


Another form of regularisation is asymptotic regularisation of the form
∂t u(t) = K ∗ (f − Ku(t))
. (3.15)
u(0) = 0
As the linear operator K does not change with respect to the time t, we can make the Ansatz of
writing u(t) in terms of the singular value decomposition of K as

X
u(t) = γj (t)uj , (3.16)
j=1

for some function γ : R → R. From the initial conditions we immediately observe γ(0) = 0. From
the singular value decomposition and (3.15) we further see
 
X∞ X∞
γj0 (t)uj = σj hf, vj iV − σj γ(t) huj , uj iU  uj .
 
j=1 j=1
| {z }
=kuj k2U =1

Hence, by equating the coefficients we get


γj0 (t) = σj hf, vj iV − σj2 γj (t) ,
and together with γj (0) we obtain
 2
 1
γj (t) = 1 − e−σj t hf, vj iV
σj
as a solution for all j and hence, (3.16) reads as
∞   1
2
1 − e−σj t
X
u(t) = hf, vj iV uj .
σj
j=1

If we substitute t = 1/α, we obtain the regularisation


∞  σ2

X
− αj 1
uα = 1−e hf, vj iV uj
σj
j=1
 2

− σα
with gα (σ) = 1 − e σ . We immediately see that gα (σ)σ ≤ 1 =: γ, and due to e ≥
1 x

σ2 σ2
1 + x we further observe 1 − e− α ≤ σ 2 /α and therefore (1 − e− α )/σ ≤ maxj σj /α = σ1 /α =
kKkL(U ,V) /α =: Cα .
CHAPTER 3. REGULARISATION 29

3.5 Landweber iteration


If we approximate (3.15) via a forward finite-difference discretisation, we end up with the iterative
procedure
uk+1 − uk  
= K ∗ f − Kuk , (3.17)
τ  
⇔ uk+1 = uk + τ K ∗ f − Kuk ,
⇔ uk+1 = (I − τ K ∗ K)uk + τ K ∗ f ,

for some τ > 0 and u0 ≡ 0. Iteration (3.17) is known as the so-called Landweber iteration. We
assume f ∈ D(K † ) first, and with the singular value decomposition of K and K ∗ we obtain

X ∞ 
X 
k+1
1 − τ σj hu , uj iU + τ σj hf, vj iV uj ,
2
(3.18)
 k
hu , uj iU uj =
j=1 j=1

and hence, by equating the individual summands

huk+1 , uj iU = 1 − τ σj2 huk , uj iU + τ σj hf, vj iV . (3.19)




Assuming u0 ≡ 0, summing up equation (3.19) yields


k
X
huk , uj iU = τ σj hf, vj iV (1 − τ σj2 )k−i . (3.20)
i=1

The following Lemma will help us simplifying (3.20).


Lemma 3.1. For k ∈ N \ {1} we have
k k
X 1 − 1 − τ σ2
2 k−i
(1 − τ σ ) = . (3.21)
τ σ2
i=1

Proof. Equation (3.21) can simply be verified via induction. We immediately see that
2 2
X
2 2−i 2 1 − (1 − 2τ σ 2 + τ 2 σ 4 ) 1 − 1 − τ σ2
(1 − τ σ ) = 1 + (1 − τ σ ) = =
τ σ2 τ σ2
i=1

serves as as our induction base. Considering k → k + 1, we observe


k+1
X k
X
(1 − τ σ 2 )k+1−i = 1 + (1 − τ σ 2 )k+1−i
i=1 i=1
k
X
2
= 1 + (1 − τ σ ) (1 − τ σ 2 )k−i
i=1
k
1 − 1 − τ σ2
2
= 1 + (1 − τ σ )
τ σ2
k+1
1 − 1 − τ σ2

= ,
τ σ2
and we are done.
30 3.6. VARIATIONAL REGULARISATION

If we now insert (3.21) into (3.20) we therefore obtain


  1
huk , uj iU = 1 − (1 − τ σj2 )k hf, vj iV . (3.22)
σj
The important consequence of Equation (3.22) is that we now immediately see that huk , uj iU →
hu† , uj iU if we ensure (1 − τ σj2 )k → 0. In other words, we need to choose τ such that |1 − τ σj2 | < 1
(respectively 0 < τ σj < 2) for all j. As in the case of asymptotic regularisation we exploit that
σ1 = kKkL(U ,V) > σj for all j and select τ such that
2
0<τ <
kKk2L(U ,V)

is satisfied. If we interpret the iteration number as the regularisation parameter α := 1/k, we


obtain the regularisation method
∞ 
X 1 1
uα = Rα f = 1 − 1 − τ σj2 α hf, vj iV
σj
j=1
 1

with gα (σ) = 1 − (1 − τ σ 2 ) α /σ.

3.6 Variational regularisation


Variational regularisation aims at finding approximations to (1.1) by minimising appropriate func-
tionals of the form
Eα (u) := H(Ku, f ) + αJ(u) . (3.23)
Here H : V × V → R ∪ {+∞} and J : U → R ∪ {+∞} represent two functionals over Banach
spaces U and V, K ∈ L(U, V) a linear and continuous operator, and α > 0 is a real, positive
constant. The term H is usually named fidelity or data term, as it measures the deviation between
the measured data f and the forward model Ku. The functional J is the regularisation term as
it will impose certain regularity conditions on the unknown u. The regularisation parameter will
balance between both terms. It is beyond the scope of this lecture to discuss all the functional
analytic concepts to study the minimisation of (3.23) for general functionals H and J. However,
we briefly want to discuss three properties that will guarantee existence of a minimiser of (3.23).
The first of them is a rather trivial one, assuring that functionals have some effective domain.
Definition 3.3. A functional E : X → R ∪ {+∞} is called proper, if the effective domain
dom(E) := {x ∈ X | E(x) < ∞}
is not empty.
More importantly we are going to need that functionals are lower semi-continuous. Roughly
speaking this means that the functional values for arguments near an argument x are either close
to E(x) or greater than E(x).
Definition 3.4. Let X be a Banach space with topology τ . The functional E : X → R ∪ {+∞} is
said to be lower semi-continuous at x ∈ X if
E(x) ≤ lim inf E(xk )
k→∞

for all xk → x in the topology τ of X .


CHAPTER 3. REGULARISATION 31

Together with compactness of the sub level sets this leads to the fundamental theorem of
optimisation.
Theorem 3.6. Let X be a Banach space with topology τ and let E : X → R ∪ {+∞} be lower
semi-continuous. Furthermore, let the level set
{x ∈ X | E(x) ≤ c}
be non-empty, and compact in the topology τ for some c ∈ R. Then there exists a global minimiser
x̂ of E, i.e. E(x̂) ≤ E(x) for all x ∈ X .

Proof. Let Ê = inf x∈X E(x). Then a sequence {xk }k∈N exists with E(xk ) → Ê for k → ∞. For
k sufficiently large, E(xk ) ≤ c holds and hence, {xk }k∈N is contained in a compact set. As a
consequence, a subsequence {xkl }l∈N exists with xkl → x̂, for l → ∞, for some x̂ ∈ X . From the
lower semi-continuity of E we obtain
Ê ≤ E(x̂) ≤ lim inf E(xkl ) ≤ Ê ,
l→∞

consequently x̂ is a global minimizer.

For the sake of brevity, we will not going to prove lower semi-continuity and compactness of
the sub-level sets for any of the functionals that we are going to consider. Nevertheless we want
to emphasise (without proof) that all functionals considered throughout this lecture are lower
semi-continuous and have compact sub level sets in the weak-* topology of the corresponding
Banach space. According to the Theorem of Banach-Alaoglu (cf. [9, Theorem 1.9.13]), the set
{v ∈ X ∗ | kvkX ∗ ≤ c}, for c > 0, is compact in the weak-* topology. Hence, we could conclude
existence of a global minimum for a given infinite dimensional optimisation problem, if we were
able to prove lower semi-continuity in the weak-* topology. Unfortunately this is not as easy as it
sounds and certainly beyond the scope of this lecture.
Before we want to proceed to our first example of a variational regularisation model, we further
want to recall the concepts of directional and Fréchet-derivatives of functionals.
Definition 3.5. Let E : X → R ∪ {+∞} be a functional or operator. The directional derivative
(also called first variation) at position x ∈ X in direction y ∈ X is defined as
E(x + ty) − E(x)
dy E(x) := lim , (3.24)
t↓0 t
if that limit exists.
Note that if we define the function ϕy (τ ) := E(x + τ y), the directional derivative of E is
equivalent to ϕ0y (τ )|τ =0
Definition 3.6. Let E : X → R ∪ {+∞} a functional mapping from the Banach space X , and let
dy E(x) exist for all y ∈ X . If E 0 (x) : X → R ∪ {+∞} exists such that
(E 0 (x))(y) = dy E(x) ∀y ∈ X ,
and
|E(x + y) − E(x) − (E 0 (x))(y)|
→0 for kykX → 0 ,
kykX
holds true, E is called Fréchet-differentiable in x and E 0 (x) the Fréchet-derivative in x. If the
Fréchet-derivative exists for all x ∈ X , the operator E 0 : X → X ∗ is called Fréchet-derivative.
32 3.6. VARIATIONAL REGULARISATION

3.6.1 Tikhonov-type regularisation


First of all we want to show that the solution of the Tikhonov regularisation (3.11) can be inter-
preted as the minimiser of a functional of the form (3.23).
Theorem 3.7. For f ∈ V the Tikhonov-regularised solution uα = Rα f with Rα as defined in
(3.10) is uniquely determined as the global minimiser of the Tikhonov-functional
1 α
Tα (u) := kKu − f k2V + kuk2U . (3.25)
2 2
Proof. ⇒: A global minimiser û ∈ U of Tα (û) is characterised via Tα (û) ≤ Tα (u) for all u ∈ U.
Hence, it follows from
1 α 1 α
Tα (u) − Tα (uα ) = kKu − f k2V + kuk2U − kKuα − f k2V − kuα k2U
2 2 2 2
1 α 1 α
= kKukV − hKu, f i + kukU − kKuα kV + hKuα , f i − kuα k2U
2 2 2
2 2 2 2
+ h(K ∗ K + αI)uα − K ∗ f, uα − ui
| {z }
=0
1 α
= kKu − Kuα k2V + ku − uα k2U
2 2
≥0

that uα is a global minimiser of Tα .


⇐: If we have Tα (û) ≤ Tα (u) (for all u ∈ U), it follows with u = û + τ v for arbitrary τ > 0 and
fixed v ∈ U that
τ2 τ 2α
0 ≤ Tα (u) − Tα (û) = kKvk2V + kvk2U + τ h(K ∗ K + αI)û − K ∗ f, viU
2 2
holds true. Dividing by τ and subsequent consideration of the limit τ ↓ 0 thus yields

h(K ∗ K + αI)û − K ∗ f, viU ≥ 0 , for all v ∈ U.

Hence, we can conclude û = uα and that uα is the unique minimiser of (3.25).

We therefore see, that uα satisfying (3.10) (resp. (3.11)) is also the solution of a variational
regularisation problem of the form (3.23), with H(Ku, f ) = 21 kKu − f k2V and J(u) = 12 kukU .
A nice property of this representation as a minimiser of a functional is that we can generalise
Tikhonov regularisation by choosing different regularisation functionals for J, i.e.
1
T̃α (u) := kKu − f k2V + αJ(u) , (3.26)
2
for general J : U → R ∪ {+∞}. We want to denote this regularisation as Tikhonov-type regular-
isation and present a few examples in the following. Note that Tikhonov-type regularisation for
general J is not necessarily a regularisation in the sense of Definition 3.1, as we cannot expect
uα := arg minu∈U T̃α (u) → u† for α → 0. However, we can generalise Definition 2.1 to justify
calling the minimisation of (3.26) a regularisation.
Definition 3.7. An element u ∈ U is called J-minimising solution of (1.1), if

J(u) = min{J(w) | w satisfies (2.1)} . (3.27)


CHAPTER 3. REGULARISATION 33

Clearly, for f δ with kf − f δ kV ≤ δ and u† satisfying (3.27) we can expect uα → u† for


α → 0. However, it is beyond the scope of this lecture to verify this or any other generalised
concepts of regularisation and we leave it to the interested reader to explore more about possible
generalisations of regularisation.
Before we are going to restrict ourselves to convex functionals and address key properties of
them and their minimisation, we want to give a few examples of regularisation functionals used in
the context of Tikhonov-type regularisation.
Example 3.2 (Squared norm with linear operator). The easiest way to extend classical Tikhonov
regularisation to a more general regularisation method is to replace kuk2U /2 with
1
J(u) := kDuk2W ,
2
for an operator D ∈ L(U, W). An example for an operator D that is commonly used in this
framework is the gradient operator ∇ ∈ L(L2 (Ω), L2 (Ω; Rn )). In this context it is also common to
use the H 1 -norm as a regulariser, i.e.
1 1 1
J(u) = kukH 1 (Ω) := kuk2L2 (Ω) + k∇uk2L2 (Ω;Rn ) .
2 2 2
Example 3.3 (Maximum-entropy regularisation). Maximum-entropy regularisation is of particu-
lar interest if solutions of the inverse problem are assumed to be probability density functions, i.e.
functions in the space
 Z 
PDF(Ω) := u ∈ L (Ω)
1
u(x) dx = 1, u ≥ 0 .


The (negative) entropy used in physics and information theory is defined as the functional J :
PDF(Ω) → R≥−1 with
Z
J(u) := u(x) log(u(x)) − u(x) dx ,

and the convention 0 log(0) := 0. The corresponding Tikhonov-type regularisation reads as


 Z 
1 2
uα = arg min kKu − f kV + α u(x) log(u(x)) − u(x) dx , (3.28)
u∈PDF(Ω) 2 Ω

for operators K ∈ L(PDF(Ω), V).


Example 3.4 (`1 regularisation). When it comes to non-injective operators K ∈ L(`2 , `2 ) between
sequence spaces, the `1 -norm

X
J(u) = kuk`1 := |uj |
j=1

is often used as a regulariser, in order to enforce sparse solutions. The corresponding Tikhonov-
type regularisation reads as
 
1 X∞ 
2
uα ∈ arg min kKu − f k`2 + α |uj | .
u∈`1  2 j=1

Note that `p spaces are increasing in p, which implies kuk`2 ≤ kuk`1 and hence, u ∈ `2 if u ∈ `1 .
34 3.7. CONVEX VARIATIONAL CALCULUS

Example 3.5 (Total variation regularisation). Total variation as a regulariser has originally been
introduced for image-denoising and -restoration applications with the goal to preserve edges in
images, respectively discontinuities in signals [8]. For smooth signals u ∈ W 1,1 (Ω) the total
variation is simply defined as
Z
J(u) := k(∇u)(x)k2 dx ;

a more rigorous definition to also include functions with discontinuties is given via the dual for-
mulation
Z
J(u) = TV(u) := sup u(x) (divϕ)(x) dx .
ϕ∈C0∞ (Ω;Rn ) Ω
kkϕk2 k∞ ≤1

Note that the choice of the vector norm is somewhat arbitrary and could be replaced by other
p-vector norms. The corresponding Tikhonov-type regularisation reads as
 
1 2
uα ∈ arg min kKu − f kV + αTV(u) , (3.29)
u∈BV(Ω) 2

for K ∈ L(BV(Ω) , V), and where BV(Ω) denotes the space of functions of bounded variation
defined as

BV(Ω) := u ∈ L1 (Ω) | TV(u) < ∞ .




3.7 Convex Variational Calculus


Definition 3.8 (Convex Set). Let X be a Banach space. A subset C ⊆ X is called convex, if

λx + (1 − λ)y ∈ C ,

for all λ ∈ [0, 1] and all x, y ∈ C.


In analogy we can define (strictly) convex functionals on convex sets.
Definition 3.9 (Convex Functional). Let C be a convex set. A functional E : C → R ∪ {+∞} is
called convex, if

E(λx + (1 − λ)y) ≤ λE(x) + (1 − λ)E(y) (3.30)

for all λ ∈ [0, 1] and all x, y ∈ C. The functional E is called strictly convex, if the equality of
(3.30) only holds for x = y or λ ∈ {0, 1}.
Example 3.6 (Absolute Value Function). For C = R and E : R → R≥0 the absolute value function
E(x) = |x| is convex, since the absolute value function is a metric, and the triangular inequality
yields |λx + (1 − λ)y| ≤ λ|x| + (1 − λ)|y|. Obviously, the absolute value function is not strictly
convex.
Example 3.7 (Characteristic Function). The characteristic function χC : X → R ∪ {+∞} with
(
0 x∈C
χC (x) := (3.31)
+∞ x 6∈ C
is convex if C ⊆ X is a convex subset.
CHAPTER 3. REGULARISATION 35

We want to generalise the notion of differentiability to non-differentiable but convex functionals.


Definition 3.10 (Subdifferential). Let X be a Banach space with dual space X ∗ , and let the
functional E : X → R ∪ {+∞} be convex. Then, E is called subdifferentiable at x ∈ X , if there
exists an element p ∈ X ∗ such that
E(y) − E(x) − hp, y − xiX ≥ 0
holds, for all y ∈ X . Furthermore, we call p a subgradient at position x. The collection of all
subgradients at position x, i.e.
∂E(x) := {p ∈ X ∗ | E(y) − E(x) − hp, v − uiX ≥ 0 , ∀y ∈ X } ⊂ X ∗ ,
is called subdifferential of E at x.
For non-differentiable functionals the subdifferential is multivalued; we want to consider the
subdifferential of the absolute value function as an illustrative example.
Example 3.8. Let X = R, and let E : R → R≥0 be the absolute value function E(u) = |x|. Then,
the subdifferential of E at x is given by

for x > 0

{1}

∂E(x) = sign(x) := [−1, 1] for x = 0 .
{−1} for x < 0

This can easily be verified by case differentiation. For x = 0 the inequality |x| ≥ px is obviously
fulfilled for any x ∈ R iff p ∈ [−1, 1]. For x > 0 the inequality |y| ≥ py+(1−p)x is fulfilled for every
y ∈ R iff p = 1. In analogy, p = −1 has to hold for x < 0 in order to guarantee |y| ≥ py − (1 + p)x
for each y ∈ R.
Throughout the remainder of this lecture we are particularly interested in convex, non-differentiable
and absolutely one-homogeneous functionals (like the `1 -norm or the total variation), i.e. we want
to consider functionals E for which E(cx) = |c|E(x) is satisfied, for every c ∈ R. We therefore
characterise the subdifferential of (absolutely) one-homogeneous functionals with the following
lemma.
Lemma 3.2 (Subdifferential for One-Homogeneous Functionals). Let E : X → R ∪ {+∞} be a
convex and one-homogeneous functional. Then, the subdifferential of E at x can equivalently be
written as
∂E(x) = {p ∈ X ∗ | hp, xi = E(x), hp, yiX ≤ E(y), ∀y ∈ X } .
Proof. If we consider y = 0 ∈ X in the definition of the subdifferential we immediately see
hp, xi ≥ E(x) ,
while for y = 2x ∈ X we obtain
hp, xi ≤ E(2x) − E(x) = 2E(x) − E(x) = E(x) ,
due to the one-homogeneity of E. Thus, hp, xi = E(x) follows. Moreover, if we insert hp, xi = E(x)
in the definition of the subdifferential we get
hp, yi ≤ E(y) ,
for all y ∈ X .
36 3.7. CONVEX VARIATIONAL CALCULUS

Example 3.9. If we consider the one-homogeneous absolute value function E : R → R≥0 with
E(x) = |x| again, we see that xp = |x| implies p = sign(x). Moreover, we discover that for
arguments y with the same sign as x we obtain yp = |y|, while for arguments y having the
opposite sign as x, we get yp = −|y| < |y|.

Considering (3.23) again, we now and for the rest of the lecture assume H and J to be convex
(in addition to the assumption of being proper and lower semi-continuous). Further, we assume
H to be Fréchet-differentiable with respect to u. For this setup we can state the following useful
characterisation of the subdifferential (without proof).

Theorem 3.8. Let H : V × V → R ∪ {+∞} and J : U → R ∪ {+∞} be proper, convex and lower
semi-continuous functionals, for Banach spaces U and V. Let furthermore K ∈ L(U, V), and H be
Fréchet-differentiable with respect to u. Then, for α > 0, the subdifferential of the functional Eα
as defined in (3.23) is given via

∂(H(K·, f ) + αJ)(u) = K ∗ H 0 (Ku, f ) + α∂J(u) .

Proof. See [7] or [4, Proposition 5.6].

Finally we want to characterise global minimisers of convex functionals.

Theorem 3.9 (Generalised Fermat). Let E : X → R ∪ {+∞} be a proper, convex functional. An


element x ∈ X is a (global) minimiser of E iff 0 ∈ ∂E(x).

Proof. For 0 ∈ ∂E(x) we have

0 = h0, y − xi ≤ E(y) − E(x)

for all y ∈ X by definition of the subdifferential. Thus, x is a global minimizer of E.


If 0 ∈
/ ∂E(x) holds, then there exists at least one y ∈ X such that

E(y) − E(x) < h0, y − xi = 0 ,

and hence, x cannot be a minimizer of E.

It is worth mentioning that all the definitions and theorems derived in this chapter can be
transferred to (strictly) concave functions and (global) maximisation problems. In particular, if E
is a (strictly) concave functional then −E is (strictly) convex, and the characterisation of a global
maximum of E is identical to the characterisation of a global minimum of −E. We will make use
of this fact throughout the next chapter when it comes to saddle-point problems.
To conclude this chapter, we want to introduce the useful notion of the generalised Bregman
distance.

Definition 3.11. For a proper, lower semi-continuous and convex functional E : X → R ∪ {+∞}
with non-empty subdifferential ∂E, the generalised Bregman distance is defined as
p
DE (x, y) := E(x) − E(y) − hp, x − yi , p ∈ ∂E(y) . (3.32)
p
The generalised Bregman distance is no distance in the classical sense. It does satisfy DE (x, y) ≥
0 for all x, y ∈ X ; however, it is neither symmetric nor does it satisfy a triangular inequality in
general. Nevertheless can we symmetrise the Bregman distance by simply adding the two Bregman
distances with interchanged arguments.
CHAPTER 3. REGULARISATION 37

Definition 3.12. Let the same assumptions hold as in Definition 3.11. The generalised symmetric
Bregman distance is defined as
symm p1 p2
DE (x1 , x2 ) := DE (x2 , x1 ) + DE (x1 , x2 ) = hx1 − x2 , p1 − p2 i ,

for p1 ∈ ∂E(x1 ) and p2 ∈ ∂E(x2 ).


symm
Note that we have omitted p1 and p2 in the notation of DE only for the sake of brevity.
The generalised symmetric Bregman distance will be a helpful tool throughout the next chapter
for the convergence analysis of the so-called primal-dual hybrid gradient method for the iterative
solution of convex variational regularisation problems.
38 3.7. CONVEX VARIATIONAL CALCULUS
Chapter 4

Computational realisation

In this final chapter of the lecture we want to focus on how to implement variational regularisation
methods numerically. We focus on variational regularisation methods as they appear to be the most
general class of regularisation methods that we have studied throughout this lecture. However,
we will only consider convex variational regularisation methods, as we want to guarantee that the
iterative methods that we are going to consider converge to global minimisers.

4.1 A primal dual hybrid gradient method


Many Tikhonov-type regularisation problems of the form (3.26) can be formulated as a member
of the following class of saddle-point problems:

inf sup F (u) − G(v) + hDu, vi . (4.1)


u v

Here both F and G are proper, l.s.c. and convex functionals, and D ∈ L(U, W). Before we
continue to study (4.1) and potential numerical solutions, we want to give a quick example.

Example 4.1 (Total variation regularisation). Note that we can reformulate (3.29) as
 
1 2
uα ∈ arg min kKu − f kV + suphu, αdivvi − χC (v)
u∈BV(Ω) 2 v

with χC being a characterisitc function as defined in (3.31) and

C := v ∈ C0∞ (Ω; Rn ) | kkvk2 kL∞ (Ω) ≤ 1




In order to discuss how to solve (4.1) numerically, we want to consider the optimality conditions
of the saddle-point problem first. For any subgradients p̂ ∈ ∂F (û) and q̂ ∈ ∂G(v̂) that satisfy

p̂ + D∗ v̂ = 0 ,
(4.2)
q̂ − Dû = 0 ,

the point (û, v̂) is a saddle-point of (4.1), due to the convexity of F and G. We therefore want to
approach finding a saddle point of (4.1) via the iterative approach

+ D∗ v k+1
 k+1  k+1
− uk
   
p u 0
+M = , (4.3)
q k+1 − Duk+1 v k+1 − v k 0

39
40 4.1. A PRIMAL DUAL HYBRID GRADIENT METHOD

for pk+1 ∈ ∂F (uk+1 ) and q k+1 ∈ ∂G(v k+1 ), and a 2 × 2 operator-matrix M . We immediately
observe that (4.3) should approach (4.2) in the limit (for properly chosen M ). Hence, the important
question is how to choose M such that (4.3) results in a convergent algorithm with relatively simple
update steps for uk+1 and v k+1 ? From our intuitive knowledge as numerical analysts we would
guess that M should be positive definite and maybe even self-adjoint, to guarantee convergence
of (4.3). But beyond that? A naïve choice for M could simply be the identity; however, in that
case each update step is implicit in both uk+1 and v k+1 , which may not simplify solving our
saddle-point problem at all. Alternatively, we propose to use

I −D∗
 1 
M := τ (4.4)
−D σ1 I

instead, with τ and σ being positive scalars. For this choice, (4.3) simplifies to the equations

uk+1 − (uk − τ D∗ v k ) + τ pk+1 = 0 , (4.5)


v k+1 k
− (v + σD(2u k+1 k
− u )) + σq k+1
= 0. (4.6)

Due to pk+1 ∈ ∂F (uk+1 ) and q k+1 ∈ ∂G(uk+1 ), Equations (4.5) and (4.6) can be rewritten to

uk+1 = (I + τ ∂F )−1 (uk − τ D∗ v k ) , (4.7)


−1
v k+1
= (I + σ∂G) k
(v + σD(2u k+1
− u )) ,
k
(4.8)

where (I + α∂E)−1 is short for


 
1
(I + α∂E)−1 (f ) := arg min ku − f k22 + αE(u) . (4.9)
u 2
Operations of the form (4.9) are known as the proximal or resolvent operator with respect to the
convex functional E.
The iterates (4.7) and (4.8) are known to be the so-called primal-dual hybrid gradient method
(PDHGM). Before we prove actual convergence of those iterates to a saddle point of (4.1), we
want to highlight what makes PDHGM so useful. First of all, the particular choice of M in (4.4)
decouples uk+1 and v k+1 in the update for uk+1 , which makes updating uk+1 and v k+1 in an
alternating possible. Secondly, PDHGM now only requires basic arithmetic operations, operator-
adjoints and -multiplications, and the evaluation of the resolvent operations with respect to F and
G. If these are simple, the overall PDHGM is simple. We want to take a look at the simple total
variation problem to see that in that case the resolvent operations are particularly easy.
Example 4.2. Note that the resolvent operators in case of Example 4.1 are given as
 
1 τ
û = (I + τ ∂F )−1 (w) = arg min ku − wk22 + kKu − f k22 (4.10)
u 2 2
and
 
−1 1
v̂ = (I + σ∂G) (z) = arg min kv − zkF + σχC (v) .
2
(4.11)
v 2
The resolvent operator (4.10) simply resembles the variational form of (shifted) Tikhonov regular-
isation. Hence, we can solve (4.10) via

û = (I + τ K ∗ K)−1 (w + τ K ∗ f ) ,
CHAPTER 4. COMPUTATIONAL REALISATION 41

according to Equation (3.11).


Further, we can figure out via case differentiation (left as an exercise) that the resolvent operator
(4.11) can be computed via
zj
v̂j = ,
max(1, kzk2 )

where zj denotes the j-th vectorial component of the vector field z, for j ∈ {1, . . . , n}. Hence,
PDHGM reads as

uk+1 = (I + τ K ∗ K)−1 (uk + τ (divv k + K ∗ f )) (4.12)


(v k+ σ∇(2uk+1 − uk ))j
vjk+1 = for all j ∈ {1, . . . , n} (4.13)
max(1, kv k + σ∇(2uk+1 − uk )k2 )

in case of total variation regularisation, due to ∇∗ = −div. Note that in case of ROF-denoising as
proposed in [8], Equation (4.12) simplifies to

uk + τ (divv k + f )
uk+1 = .
1+τ
Note that for real world applications one obviously has to find appropriate discretisations of K
and ∇.

4.1.1 Convergence of PDHGM


Before we prove convergence of the iterates under suitable conditions on τ and σ, we want to prove
that M as defined in (4.4) is positive definite (under suitable conditions on τ and σ).

Lemma 4.1. Consider the self-adjoint matrix M as defined in (4.4). Then, M is positive definite
if 0 < τ σkDk2 < 1.

Proof. In order to prove positive definiteness of M we need to show that


    
u u
M , >0 (4.14)
v v

holds true for all (u, v)T 6= (0, 0)T . Evaluating the dual product in (4.14) yields the condition
1 1
kuk2 + kvk2 − 2hDu, vi > 0 . (4.15)
τ σ
With the help of Young’s inequality, we can estimate −2hDu, vi from below with
 
2c 1
−2hDu, vi ≥ − τ σkDk kuk + kvk2 2
,
τ cσ

for some constant c > 0. Due to the condition τ σkDk2 < 1 it is clear that we can find an ε > 0
with (1 + ε)2 τ σkDk2 = 1. This further implies 1/((1 + ε)τ σkDk2 ) = 1 + ε and hence, we can
choose c = 1 + ε = 1/((1 + ε)τ σkDk2 ) to obtain
  
1 1 1
−2hDu, vi ≥ − 2
kuk + kvk 2
.
1+ε τ σ
42 4.1. A PRIMAL DUAL HYBRID GRADIENT METHOD

Inserting this estimate into (4.15) therefore yields


  
ε 1 1
kuk2 + kvk2 > 0 ,
1+ε τ σ

for (u, v)T 6= (0, 0)T , since ε > 0.

With the help of Lemma 4.1 we are able to prove convergence of the iterates (4.7) and (4.8)
to a saddle point of (4.1).
Theorem 4.1. Let 0 < τ σkDk2 < 1. Then the sequence (uk , v k ) defined via (4.7) and (4.8)
satisfies the following properties (for ku0 k < ∞ and kv 0 k < ∞):
• limk→∞ DFsymm (uk , û) = 0
symm k
• limk→∞ DG (v , v̂) = 0

• All limit points of (uk , v k ) are saddle points of (4.1).


Here (û, v̂) denotes a saddle point satisfying (4.2).

Proof. Note that we can combine (4.2) and (4.3) with M as defined in (4.4) to

− p̂ + D∗ (v k+1 − v̂) I −D∗


 k+1   1   k+1
− uk
  
p u 0
+ τ = . (4.16)
q k+1 − q̂ − D(uk+1 − û) −D σ1 I v k+1 − v k 0

Taking a dual product of

pk+1 − p̂ + D∗ (v k+1 − v̂)


 

q k+1 − q̂ − D(uk+1 − û)

with (uk+1 − û, v k+1 − v̂)T yields

− p̂ + D∗ (v k+1 − v̂)
 k+1   k+1 
p u − û
, symm k+1
= DFsymm (uk+1 , û) + DG (v , v̂) ≥ 0 .
q k+1 − q̂ − D(uk+1 − û) v k+1 − v̂

Hence, we can conclude


  k+1
− uk
  k+1   k+1
− uk
  k+1 
u u − û u u − û
0≥ M , = ,M
v k+1 − v k v k+1 − v̂ v k+1 − v k v k+1 − v̂

due to Mpbeing self-adjoint. By using the notation wk := (uk , v k )T and ŵ := (û, v̂)T , and
kwkM := hM, w, wi, we observe

0 ≥ kwk+1 − wk k2M + hwk − ŵ, M (wk+1 − wk )i


= kwk+1 − wk k2M − kwk − ŵk2M − hŵ − wk , M (wk+1 − ŵ)i
1  k+1 
= kwk+1 − wk k2M − kwk − ŵk2M + kw − ŵk2M + kwk − ŵk2M − kwk+1 − wk k2M
2
1  k+1 
= kw − wk k2M − kwk − ŵk2M + kwk+1 − ŵk2M ,
2
and thus, we conclude

kwk − ŵk2M ≥ kwk+1 − ŵk2M + kwk+1 − wk k2M .


CHAPTER 4. COMPUTATIONAL REALISATION 43

As k · kM is a norm, due to M being self-adjoint and positive definite, this implies boundedness
of the sequence wk . Summing up the inner product of (4.16) with wk+1 − ŵ over all k therefore
yields

X   X∞
symm k+1 symm k+1
2 DF (u , û) + DG (v , v̂) + kwk+1 − wk k2M
k=0 k=0
∞ 
X 
= kwk − ŵk2M + kwk+1 − ŵk2M = kw0 − ŵk2M < ∞ .
k=0

This implies kwk+1 − wk kM → 0, DFsymm (uk+1 , û) → 0 and DG


symm k+1
(v , v̂) → 0 for k → +∞.
Due to the boundedness of w there exists a subsequence wkl that converges to a limit point
k

w∞ := liml→+∞ wkl . We have to verify that w∞ is a saddle point, which for s∞ := liml→+∞ skl ,
sk := (pk , q k )T , follows immediately from (4.16).

4.1.2 Deconvolution with total variation regularisation


An example for total variation regularisation of the inverse problem of image convolution is given
as Exercise 2 on Exercisesheet 3.
44 4.1. A PRIMAL DUAL HYBRID GRADIENT METHOD
Bibliography

[1] Heinz Werner Engl. Integralgleichungen. Springer-Verlag, 1997. 19

[2] Heinz Werner Engl, Martin Hanke, and Andreas Neubauer. Regularization of inverse problems,
volume 375. Springer Science & Business Media, 1996. 16, 17, 19

[3] Charles W Groetsch. Stable approximate evaluation of unbounded operators. Springer, 2006.
23

[4] Jan Lellmann. Convex optimization with applications to image processing, 2013. Lecture
notes. 36

[5] Terry J Morrison. Functional analysis: An introduction to Banach space theory, volume 43.
John Wiley & Sons, 2011. 22

[6] Arch W Naylor and George R Sell. Linear operator theory in engineering and science. Springer
Science & Business Media, 2000. 14

[7] R Tyrrell Rockafellar. Convex Analysis. Princeton University Press, 1970. 36

[8] Leonid I Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise
removal algorithms. Physica D: Nonlinear Phenomena, 60(1):259–268, 1992. 34, 41

[9] Terence Tao. Epsilon of Room, One, volume 1. American Mathematical Soc., 2010. 31

[10] Eberhard Zeidler. Applied functional analysis, applications to mathematical physics. Applied
Mathematical Sciences, 108, 1995. 13, 14

45

You might also like