You are on page 1of 313

Parameter Estimation and Inverse Problems

Rick Aster, Brian Borchers, and Cliff Thurber1

June 12, 2003

c Aster, Borchers, and Thurber
Preface for Draft Versions

This textbook grew out of a course in geophysical inverse methods that has
been taught for over 10 years at New Mexico Tech, first by Rick Aster and for
the last four years by Rick Aster and Brian Borchers. The lecture notes were
first assembled into book format and used as the official text for the course in
the fall of 2001. The draft of the textbook was then used in Cliff Thurber’s
course at the University of Wisconsin during the spring semester of 2002. In
Fall, 2002, Aster, Borchers, and Thurber signed a contract with Academic Press
to produce a text by the end of 2003 for publication in 2004.
We expect that readers of this book will have some familiarity with calculus,
differential equations, linear algebra, probability and statistics at the undergrad-
uate level. Appendices A and B review the required linear algebra, probability,
and statistics. In our experience teaching this course, many students have had
weaknesses in their preparation in linear algebra or in probability and statistics.
For that reason, we typically spend the first two to three weeks of the course
reviewing this material.
Chapters one through five form the heart of the book and should be read
in sequence. Chapters six, seven, and eight are independent of each other, but
depend strongly on the material in Chapters one through five. They can be
covered in any order. Chapters nine and ten are independent of Chapters six,
seven, and eight, but should be covered in sequence. Chapter 11 is independent
of Chapters six through ten.
If significant time for review of linear algebra, probability, and statistics is
taken, then there will not be time to cover the entire book in one semester.
However, it should be possible cover the majority of the material by skipping
one or two of the chapters after Chapter five.
The text is a work in progress, but already contains much usable material
that should be of interest to a large audience. We and Academic Press feel that
the final content, presentation, examples, index terms, references, and other
features will be substantially benefited by having a large number of instructors,
students, and researchers use, and provide feedback on, draft versions. Such
comments and corrections are greatly appreciated, and significant contributions
will be acknowledged in the book.


Preface for Draft Versions iii

1 Introduction 1-1
1.1 Classification of Inverse Problems . . . . . . . . . . . . . . . . . . 1-1
1.2 Examples of Parameter Estimation Problems . . . . . . . . . . . 1-3
1.3 Examples of Inverse Problems . . . . . . . . . . . . . . . . . . . . 1-7
1.4 Why Inverse Problems Are Hard . . . . . . . . . . . . . . . . . . 1-11
1.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-14
1.6 Notes and Further Reading . . . . . . . . . . . . . . . . . . . . . 1-15

2 Linear Regression 2-1

2.1 Introduction to Linear Regression . . . . . . . . . . . . . . . . . . 2-1
2.2 Statistical Aspects of Least Squares . . . . . . . . . . . . . . . . 2-2
2.3 L1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-17
2.4 Monte Carlo Error Propagation . . . . . . . . . . . . . . . . . . . 2-21
2.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-22
2.6 Notes and Further Reading . . . . . . . . . . . . . . . . . . . . . 2-25

3 Discretizing Continuous Inverse Problems 3-1

3.1 Integral Equations . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1
3.2 Quadrature Methods . . . . . . . . . . . . . . . . . . . . . . . . . 3-2
3.3 The Gram Matrix Technique . . . . . . . . . . . . . . . . . . . . 3-7
3.4 Expansion in Terms of Basis Functions . . . . . . . . . . . . . . . 3-8
3.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-12
3.6 Notes and Further Reading . . . . . . . . . . . . . . . . . . . . . 3-13

4 Rank Deficiency and Ill-Conditioning 4-1

4.1 The SVD and the Generalized Inverse . . . . . . . . . . . . . . . 4-1
4.2 Covariance and Resolution of the Generalized Inverse Solution . 4-7
4.3 Instability of the Generalized Inverse Solution . . . . . . . . . . . 4-9
4.4 Rank Deficient Problems . . . . . . . . . . . . . . . . . . . . . . . 4-12
4.5 Discrete Ill-Posed Problems . . . . . . . . . . . . . . . . . . . . . 4-17
4.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-34


5 Tikhonov Regularization 5-1

5.1 Selecting a Good Solution . . . . . . . . . . . . . . . . . . . . . . 5-1
5.2 SVD Implementation of Tikhonov Regularization . . . . . . . . . 5-3
5.3 Resolution, Bias, and Uncertainty in the Tikhonov Solution . . . 5-9
5.4 Higher Order Tikhonov Regularization . . . . . . . . . . . . . . . 5-12
5.5 Resolution in Higher Order Regularization . . . . . . . . . . . . . 5-19
5.6 The TGSVD Method . . . . . . . . . . . . . . . . . . . . . . . . . 5-24
5.7 Generalized Cross Validation . . . . . . . . . . . . . . . . . . . . 5-25
5.8 Error Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-28
5.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-32
5.10 Notes and Further Reading . . . . . . . . . . . . . . . . . . . . . 5-34

6 Iterative Methods 6-1

6.1 Iterative Methods for Tomography Problems . . . . . . . . . . . 6-1
6.2 The Conjugate Gradient Method . . . . . . . . . . . . . . . . . . 6-8
6.3 The CGLS Method . . . . . . . . . . . . . . . . . . . . . . . . . . 6-12
6.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-16
6.5 Notes and Further Reading . . . . . . . . . . . . . . . . . . . . . 6-16

7 Other Regularization Techniques 7-1

7.1 Using bounds constraints . . . . . . . . . . . . . . . . . . . . . . 7-1
7.2 Maximum Entropy Regularization . . . . . . . . . . . . . . . . . 7-7
7.3 Total Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-11
7.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-17
7.5 Notes and Further Reading . . . . . . . . . . . . . . . . . . . . . 7-17

8 Fourier Techniques 8-1

8.1 Linear Systems in the Time and Frequency Domains . . . . . . . 8-1
8.2 Deconvolution from a Fourier Perspective . . . . . . . . . . . . . 8-5
8.3 Linear Systems in Discrete Time . . . . . . . . . . . . . . . . . . 8-8
8.4 Water Level Regularization . . . . . . . . . . . . . . . . . . . . . 8-12
8.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-17
8.6 Notes and Further Reading . . . . . . . . . . . . . . . . . . . . . 8-18

9 Nonlinear Regression 9-1

9.1 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-1
9.2 The Gauss–Newton and Levenberg–Marquardt Methods . . . . . 9-4
9.3 Statistical Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . 9-7
9.4 Implementation Issues . . . . . . . . . . . . . . . . . . . . . . . . 9-8
9.5 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-11
9.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-15
9.7 Notes and Further Reading . . . . . . . . . . . . . . . . . . . . . 9-17

10 Nonlinear Inverse Problems 10-1

10.1 Regularizing Nonlinear Least Squares Problems . . . . . . . . . . 10-1
10.2 Occam’s Inversion . . . . . . . . . . . . . . . . . . . . . . . . . . 10-2
10.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-7

11 Bayesian Methods 11-1

11.1 Review of the Classical Approach . . . . . . . . . . . . . . . . . . 11-1
11.2 The Bayesian Approach . . . . . . . . . . . . . . . . . . . . . . . 11-2
11.3 The Multivariate Normal Case . . . . . . . . . . . . . . . . . . . 11-7
11.4 Maximum Entropy Methods . . . . . . . . . . . . . . . . . . . . . 11-12
11.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-14
11.6 Notes and Further Reading . . . . . . . . . . . . . . . . . . . . . 11-15

A Review of Linear Algebra A-1

A.1 Systems of Linear Equations . . . . . . . . . . . . . . . . . . . . . A-1
A.2 Matrix and Vector Algebra . . . . . . . . . . . . . . . . . . . . . A-5
A.3 Linear Independence . . . . . . . . . . . . . . . . . . . . . . . . . A-12
A.4 Subspaces of Rn . . . . . . . . . . . . . . . . . . . . . . . . . . . A-13
A.5 Orthogonality and the Dot Product . . . . . . . . . . . . . . . . A-18
A.6 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . A-24
A.7 Lagrange Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . A-27
A.8 Taylor’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . A-28
A.9 Vector and Matrix Norms . . . . . . . . . . . . . . . . . . . . . . A-29
A.10 The Condition Number of a Linear System . . . . . . . . . . . . A-32
A.11 The QR Factorization . . . . . . . . . . . . . . . . . . . . . . . . A-33
A.12 Linear Algebra in More General Spaces . . . . . . . . . . . . . . A-35
A.13 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-36
A.14 Notes and Further Reading . . . . . . . . . . . . . . . . . . . . . A-41

B Review of Probability and Statistics B-1

B.1 Probability and Random Variables . . . . . . . . . . . . . . . . . B-1
B.2 Expected Value and Variance . . . . . . . . . . . . . . . . . . . . B-9
B.3 Joint Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . B-10
B.4 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . B-15
B.5 The Multivariate Normal Distribution . . . . . . . . . . . . . . . B-17
B.6 The Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . B-18
B.7 Testing for Normality . . . . . . . . . . . . . . . . . . . . . . . . B-19
B.8 Estimating Means and Confidence Intervals . . . . . . . . . . . . B-21
B.9 Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . B-23
B.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-25
B.11 Notes and Further Reading . . . . . . . . . . . . . . . . . . . . . B-27

C Glossary of Mathematical Terminology C-1

ordinary differential equation
partial differential equation
mathematical model
forward operator
forward problem
Chapter 1 direct problem
inverse problem
model identification problem


1.1 Classification of Inverse Problems

Scientists and engineers frequently need to relate the physical parameters of a
system to observations of the system. The observations or data will be denoted
by d, while the physical parameters or model will be denoted by m. We assume
that the physics of the system are understood, so that we can specify a function
G that relates the model m to the data d.

G(m) = d . (1.1)

The data d may be a continuous function of time and space, or it might

be a collection of discrete observations. Similarly, the model m might be a
continuous function of time and/or space, or it might simply be a collection
of discrete parameters. When m and d are themselves functions, we typically
refer to G as an operator, while G may be called a function when m and d
are vectors. The operator G can take on many forms. In some cases G is an
ordinary differential equation (ODE) or partial differential equation (PDE). In
other cases, G might be a linear or nonlinear system of algebraic equations.
Note that there is some confusion between mathematicians and other sci-
entists in the terminology that we have used. Applied mathematicians usually
refer to G(m) = d as the “mathematical model”, and m as the “parameters”,
while scientists often refer to m as the “model” and G as the forward opera-
tor. We will adopt the scientist’s convention of calling m the model and G the
forward operator.
The forward problem or direct problem is to find d given m. Computing
G(m) might involve solving an ODE, solving a PDE, or evaluating an integral.
Our focus will be on the inverse problem of obtaining an estimate of m
from d. A third problem that we will not consider in this book is the model
identification problem of determining G from several pairs of m and d.
In many cases, we will want to determine a finite number, n, of parameters
that define a model. The parameters may define a physical entity directly, or


discrete inverse problem may be coefficients or other constants in a functional relationship that describes
parameter estimation problem a physical process. We can express the parameters as a vector m. Similarly,
continuous inverse problem if there are only a finite number of data collected, then we can express the
linear systems
data as a vector d. Such problems are called discrete inverse problems or
parameter estimation problems. A parameter estimation problem can be
written as a nonlinear system of equations

G(m) = d . (1.2)

In other cases, where m and d are functions of time and space, we cannot
simply express them as vectors. Such problems are called continuous in-
verse problems. For convenience, we will refer to discrete inverse problems as
parameter estimation problems, and reserve the phrase “inverse problem” for
continuous inverse problems.
A central theme of this book is that continuous inverse problems can of-
ten be approximated by parameter estimation problems with a large number of
parameters. This process of discretizing a continuous inverse problem is dis-
cussed in Chapter 3. By solving the discretized parameter estimation problem,
we can avoid some of the mathematical difficulties associated with continuous
inverse problems. However, it happens that the parameter estimation prob-
lems that arise from discretization of continuous inverse problems still require
sophisticated solution techniques.
A class of mathematical models for which many useful results exist are linear
systems, which obey superposition

G(m1 + m2 ) = G(m1 ) + G(m2 ) (1.3)

and scaling
G(αm) = αG(m) . (1.4)
In the case of a discrete linear inverse problem, (1.2) can always be written
in the form of a linear system of algebraic equations.

G(m) = Gm = d . (1.5)

In a continuous linear system it turns out that G can always be expressed as a

linear integral operator. Equation (1.1) becomes
Z b
d(s) = g(s, x) m(x) dx . (1.6)

Here the function g(s, x) is called the kernel.

The linearity [(1.3), (1.4)] of (1.6) is easily seen because
Z b Z b Z b
g(s, x)(m1 (x) + m2 (x)) dx = g(s, x) m1 (x) dx + g(s, x)m2 (x) dx
a a a

and Z b Z b
g(s, x) αm(x) dx = α g(s, x) m(x) dx .
a a

Equations of the form of (1.6) where m(x) is the unknown are called Fred- Fredholm integral of the first
holm integral equations of the first kind (IFK). IFK’s arise in a sur-
prisingly large number of inverse problems. Unfortunately, they can also have convolution equation
mathematical properties that make it difficult to obtain useful solutions. convolution
In some cases, g(s, x) depends only on the difference s − x, and we have a deconvolution
convolution equation Fourier Transform
Z ∞ linear regression
d(s) = g(s − x) m(x) dx . (1.7)

The forward problem of determining d(s) from m(x) is called convolution.

The inverse problem of determining m(x) from d(s) is called deconvolution.
Methods for deconvolution are discussed in Chapter 8.
Another example of an IFK is the problem of inverting a Fourier transform
Z ∞
Φ(f ) = e−ı2πf x φ(x) dx

to obtain φ(x). Although there are many tables and analytic methods of ob-
taining Fourier transforms and their inverses, it is sometimes desirable to obtain
numerical estimates of φ(x), such as where there is no analytic inverse, or where
we wish to find φ(x) from spectral data at a discrete collection of frequencies.
It is an intriguing question why linearity appears in many interesting geo-
physical problems. Many physical systems encountered in practice are linear to
a very high degree because only small departures from equilibrium occur. An
important example from geophysics is that of seismic wave propagation, where
the stresses caused by the propagation of seismic disturbances are usually very
small relative to the magnitudes of the elastic moduli. This situation leads to
small strains and a very nearly linear stress-strain relationship. Because of this,
many seismic wave field problems obey superposition and scaling. Other physi-
cal systems such as gravity and magnetic fields at the strengths encountered in
geophysics have inherently linear physics.
Because many important inverse problems are linear, Chapters 2 through
8 of this book deal with methods for the solution of linear inverse problems.
We discuss methods for nonlinear parameter estimation and inverse problems
in Chapters 9 and 10.

1.2 Examples of Parameter Estimation Problems

Example 1.1
An archetypical example of a linear parameter estimation problem
is the fitting of a function to a data set via linear regression. In
an ancient example, the observed data, y, are altitude observations
of a ballistic body observed at a set of times t (Figure 1.1), and we
wish to solve for a model, m, giving the initial altitude (m1 ), initial

Figure 1.1: A linear regression example.

vertical velocity (m2 ), and the effective gravitational acceleration

(m3 ) experienced by the body during its trajectory. This problem,
much studied over the centuries, is naturally of practical interest in
warfare, but is also of fundamental interest, for example, in abso-
lute gravity meters capable of estimating g from the acceleration of
a falling object in a vacuum to accuracies approaching 10−9 (e.g.,
[KPR+ 91]).
The forward model is a quadratic function in the t − y plane

y(t) = m1 + m2 t − (1/2)m3 t2 . (1.8)

The data will consist of a set of m observations of the height of the

body yi at time ti . Applying (1.8) to each of the observations, we
obtain a matrix equation which relates the observations, yi , to the
model parameters.

− 12 t21
   
1 t1 y1

 1 t2 − 12 t22 
  
 y2 

 1 t3 − 12 t23  m1

 y3 

 . . .   m2  =  .  . (1.9)
   
 . . .  m3  . 
   
 . . .   . 
1 tm − 12 t2m ym .
Even though the functional relationship (1.8) is quadratic, the equa-
tions for the three parameters are linear, so this is a linear parameter
estimation problem.

Because there are more (known) data than (unknown) model parameters in
(1.9), and because our forward modeling of the physics may be approximate the

m constraint equations in (1.9) will likely be inconsistent so that it is impos- inconsistent

sible to find a model m that satisfies every observation in y. In introductory residual
linear algebra, where an exact solution is expected, we might simply state that 2–norm
no solution exists. However, useful solutions to such systems may be found by
earthquake location problem
finding model parameters which best satisfy the data set in some approximate
A traditional strategy for finding approximate solutions to an inconsistent
system of linear equations is to find the model that produces the smallest squared
misfit, or residual, between the observed data and the predictions of the for-
ward model, i.e., find the m that minimizes the Euclidean length (or the 2–
norm) of the residual
ky − Gmk2 = t (yi − (Gm)i )2 . (1.10)

However, (1.10) is not always the best misfit measure to use in solving para-
metric systems. Another misfit measure that we may wish to optimize instead,
and which is decidedly better in many situations, is the 1-norm
ky − Gm)k1 = |yi − (Gm)i | . (1.11)

Methods for linear parameter estimation using both of these misfit measures
are discussed in Chapter 2.

Example 1.2

A classic geophysical example of a nonlinear parameter estimation

problem is the determination of an earthquake hypocenter in space
and time (Figure 1.2), specified by the 4-vector

m = [x, τ ]T , (1.12)

where x is the three dimensional source location and τ is the time

of occurrence. The desired model is that which best fits a set of
observed seismic phase arrival times

d = [t1 , t2 , . . . , tm ]T (1.13)

from a network of m seismometers.

The forward operator is a nonlinear function that maps a trial hypocen-
ter (x, τ )0 into a vector of predicted arrival times, taking into ac-
count the network geometry and the physics of seismic wave propa-

Figure 1.2: The Earthquake location problem.

 

 t2 

 t3 

 .  = G((x, τ )0 ; W) ,

 . 

 . 

where W is a matrix where the ith row, Wi,· gives the spatial co-
ordinates of the ith station. G returns arrival–times for the network
of seismic stations for a given source location and time by forward
modeling arrival times through some seismic velocity structure, v(x).
This problem is nonlinear even if the seismic velocity is a constant, v.
In this case, the ray paths in Figure 1.2 are straight. Furthermore,
the phase arrival time at station i located at position Wi is
kWi,· − xk2
ti = +τ . (1.14)
Because the forward model (1.14) is nonlinear there is no general way
to write the relationship between the data, the hypocenter model m,
and G as a linear system of equations.
Nonlinear parameter estimation problems are frequently solved by
guessing a starting solution (which is hopefully close to the true

vertical seismic profiling

Figure 1.3: The vertical seismic profiling problem.

solution) and then iteratively improving it until a good solution is

(hopefully) obtained. We will discuss methods for solving nonlinear
parameter estimation problems in Chapter 9.

1.3 Examples of Inverse Problems

Example 1.3
In vertical seismic profiling we wish to know the vertical seis-
mic velocity of the Earth surrounding a borehole. A downward-
propagating seismic disturbance is generated at the surface by a
source and seismic waves are sensed by a string of seismometers in
the borehole (Figure 1.3).
The arrival time of the wavefront at each instrument is measured
from the recorded seismograms. As in the tomography Example 1.3,

we can linearize this problem by parameterizing it using a slowness

(reciprocal velocity) model. The observed travel time at depth y is
then the definite integral of the vertical slowness, s, from the surface
to y Z y Z ∞
t(y) = s(z) dz = s(z)H(y − z) dz (1.15)
0 0

where the kernel function H is a step function equal to one when

its argument is nonnegative and zero when its argument is negative.
The form (y − z) of the kernel argument shows that (1.15) is a
In theory, we can solve this problem quite easily. By the fundamental
theorem of calculus,
t0 (y) = s(y). (1.16)
However, in practice we only have noisy observations of t(y). Differ-
entiating these observations is an extremely unstable process.

Example 1.4

Another instructive example of a linear continuous inverse problem

is to invert a vertical gravitational anomaly, d(x) observed at some
height, h, to find an unknown line mass density distribution, m(x) =
∆ρ(x) (Figure 1.4). This problem can be written as an IFK, as the
data are just a superposition of the vertical gravity contributions
from each differential element of the line mass
Z ∞
d(s) = Γ 3/2
m(x) dx ,
−∞ ((x − s) + h2 )

where Γ is the gravitational constant. Note that the kernel has the
form g(x−s). Because the kernel depends only on x−s, this forward
problem is also an example of convolution. Because g(x) is a smooth
function, d(s) will be a smoothed version of m(x), and inverse solu-
tions for m(x) will consequently be some roughened transformation
of d(x).

Example 1.5

Suppose we were to reinterpret the observations in the example

above in the form of a model where the gravity anomaly is at-
tributable to a variation in depth, m(x) = h(x), of a fixed line
density perturbation, ∆ρ, rather than its density contrast (Figure

Figure 1.4: A simple linear inverse problem; find ∆ρ(x) from gravity anomaly
observations, d(x).

Figure 1.5: A simple nonlinear inverse problem; find h(x) from gravity anomaly
observations, d(x).

1.5). In this case, the data are still, of course, just given by the su-
perposition of the contributions to the gravitational anomaly field,
but the forward problem is now
Z ∞
d(s) = Γ 3/2
∆ρ dx .
−∞ ((x − s) + m2 (x))

This integral equation is nonlinear, as the relationship between model

and data does not follow the superposition and scaling rules (1.3)
and (1.4).
Nonlinear inverse problems are generally considerably more diffi-
cult to solve than linear ones. In some cases, they may be solvable
by coordinate transformations which linearize the problem or other
clever special-case methods. In other cases the problem cannot be
linearized and computer intensive optimization techniques must be
used to solve the problem. The essential differences arise because all
linear problems can be generalized to be the “same” in some sense.
Specifically, the data can be expressed as an inner product between
the kernel function and the model [Par94], but each nonlinear prob-
lem is nonlinear in its own way.

Example 1.6

tomography An important and instructive inverse problem is tomography, from

slowness the Greek root tomos, or cut. Tomography is the general technique
of determining a model from ray-path integrated properties such as
attenuation (X-ray, radar, seismic), travel time (seismic or acoustic),
and/or source intensity (positron emission). Tomography has many
applications in medicine, acoustics and earth science. An important
example in earth science is cross–well seismic or radar tomography,
where the source excitation is in one bore hole, and the signals are
received by sensors in another bore hole. Another important exam-
ple is joint earthquake/velocity inversion on scales ranging from a
few cubic kilometers to global, where the locations of earthquake
sources and the velocity parameters of the volume sampled by the
associated seismic ray paths are estimated [Thu92].

The physical model for tomography in its most basic form assumes
that geometric ray theory (essentially a high frequency limit to the
wave equation) is valid for the problem under consideration, so that
we can consider electromagnetic or elastic energy to be propagating
along ray paths. The density of ray path coverage in a tomographic
problem may vary significantly throughout the model and thus pro-
vide much better information on the physical properties of interest
in regions of denser ray coverage while providing little or no infor-
mation in other regions.

Rather than using velocity, it is often simpler in tomography to

work with slowness, which is simply the reciprocal of velocity. If
the slowness at a point x is s(x), and we know a ray path l, then
the travel time along that ray path is given by a line integral

t= s(x) dl .

In general, the ray paths l can bend due to refraction effects. How-
ever, if refraction effects are negligible, then it may be feasible to
approximate the ray paths with straight lines.

A common (but not exclusive) way of discretizing the unknown

structure that we wish to image with tomography is as uniform
blocks. If we assume straight rays, then we obtain a linear model.
The elements of the matrix G are just the lengths of the intersections
of the ray paths with the individual blocks. Consider the simple ex-
ample shown in (Figure 1.6), where 9 homogeneous blocks with sides
of unit length and unknown slowness are constrained by 8 travel-time

rank deficient

Figure 1.6: A simple tomography example.

measurements, ti . The linear system is

 
  s11  
1 0 0 1 0 0 1 0 0 t1
 s12 
 0 1 0 0 1 0 0 1 0    t2 
  s13   
 0 0 1 0 0 1 0 0 1    t3 
  s21   
 1 1 1 0 0 0 0 0 0    t4 
Gm =   s22 = .
 0 0 0 1 1 1 0 0 0    t5 
  s23   
√0 0 0 0 √0 0 1 1 √1 t6
    
  s31   
 2 0 0 0 2 0 0 0 √2    t7 
 s32 
0 0 0 0 0 0 0 0 2 t8
Because there are 9 unknown parameters, sij , that we would ideally
like to estimate, but only 8 equations, this system is clearly rank
deficient. In fact, rank(G) is only 7. Also, we can easily see that
there is redundant information in this system; the slowness s33 is
completely determined by t8 , but is also measured by observations
t3 , t6 , and t7 . We shall see that, despite these complicating factors,
discretized versions of tomography and other inverse problems can
nonetheless be usefully solved.

1.4 Why Inverse Problems Are Hard

Scientists and engineers must be interested in far more than simply finding
mathematically acceptable answers to inverse problems. There may be many

solution existence models that adequately fit the data. It is essential to characterize just what
null space solution has been obtained, how “good” it is in terms of physical plausibility and
null space model fit to the data, and perhaps how consistent it is with other constraints. Some
solution nonuniqueness
important issues that must be considered include solution existence, solution
null space models
solution uniqueness
uniqueness, and instability of the solution process.
model resolution
ill-posed problem 1. Existence. There may be no model that exactly fits a given data set if our
discrete-ill-posed problem mathematical model of the system’s physics is approximate and/or the
ill-conditioned system data contains noise.
solution instability
regularization 2. Uniqueness. If exact solutions do exist, they may not be unique, even for
an infinite number of exact data. This can easily occur in potential field
problems. A trivial example is the case of the external gravitational field
from a spherically-symmetric mass distribution of unknown density, where
we can only determine the total mass, not the dimensions and density of
the body. Non-uniqueness is also a characteristic of rank deficient linear
problems because the matrix G can have a nontrivial null space.
One manifestation of this is that a model may fit a finite number of data
points acceptably, but predict wild and physically unrealistic values where
we have no observations. In linear parameter estimation problems, models
that lie in the null space of G, m0 , are solutions to Gm0 = 0, and hence fit
d = 0 in the forward problem exactly. These null space models can be
added to any particular model that satisfies (1.5), and not change the fit
to the data. The result is an infinite number of mathematically acceptable
In practical terms, suppose that there exists a nonzero model m which
results in an instrument reading of zero. One cannot discriminate this
situation from the situation where m is really zero.
An important and thorny issue with problems that have nonunique solu-
tions is that an estimated model may be significantly smoothed relative to
the true situation. Characterizing this smoothing is essential to interpret-
ing models in terms of their likely correspondence to reality. This analysis
falls under the general topic of model resolution analysis.

3. Instability. The process of computing an inverse solution is often ex-

tremely unstable. That is, a small change in measurement can leverage
an enormous change in the implied models, Systems where small features
of the data can drive large changes in inferred models are commonly re-
ferred to as ill-posed (in the case of continuous systems), or discrete
ill-posed or ill-conditioned (in the case of discrete linear systems).
It is frequently possible to stabilize the inversion process by adding addi-
tional constraints. This process is generally referred to as regularization.
Regularization is frequently necessary to produce a useful solution to an
otherwise intractable ill-posed or ill-conditioned inverse problem at the
cost of limiting us to a smoothed model.

To obtain some mathematical intuition about these issues, let us consider

some simple examples where each of them arise in systems characterized by an
g(s, x) m(x) dx = y(s).

First, consider the trivial case of

g(s, x) = 1

which gives the simple integral equation

Z 1
m(x) dx = y(s) . (1.17)

Clearly, since the left hand side of (1.17) is independent of s, this system has
no solution unless y(s) is a constant. Thus, there are an infinite number of
mathematically conceivable data sets y(s), which are not constant and where
no exact solution exists. This is an existence issue; there are data sets for which
there are no exact solutions.
Conversely, where a solution to (1.17) does exist, the solution is nonunique,
since there are an infinite number of functions which integrate over the unit
interval to produce the same constant. This is a uniqueness issue; there are an
infinite number of models that satisfy the IFK exactly.
A more subtle example of nonuniqueness is shown by the equation
Z 1 Z 1
g(s, x) m(x) dx = s · sin(πx) m(x) dx = y(s) . (1.18)
0 0

A general relation illustrating the orthogonality of harmonic functions is

Z 1 Z 1
sin(nπx) sin(mπx) dx = − cos(π(n+m)x) − cos(π(n−m)x) dx (1.19)
0 2 0
1 sin(π(n + m)) sin(π(n − m))
=− − = 0 (n 6= ±m; n, m 6= 0) .
2π n+m n−m
Thus, in (1.18) we have
Z 1
g(s, x) sin(nπx) dx = 0

for an infinite number of models of the form sin(nπx), where n = ±2, ± 3, . . ..

Furthermore, because (1.18) is a linear system, we can add any function of
the form
m0 (x) = αn sin(nπx)

square-integrable function to any model solution, m(x), and obtain a new model that fits the data just as
Riemann–Lebesgue lemma well because
Z 1 Z 1 Z 1
s·sin(πx)(m(x)+m0 (x)) dx = s·sin(πx) m(x) dx+ s·sin(πx) m0 (x) dx
0 0 0
Z 1
= s · sin(πx) m(x) dx + 0 .

Equation (1.18) thus allows an infinite range of very different solutions which
fit the data equally well.
Even if we do not encounter existence or uniqueness issues, it can be shown
that instability is a fundamental feature of IFK’s, because
Z ∞
lim g(s, t) sin nπt dt = 0 (1.20)
n−>∞ −∞

for all square-integrable functions g(s, t), where

Z ∞Z ∞
g(s, t)2 ds dt < ∞ .
−∞ −∞

This result is known as the Riemann–Lebesgue lemma [Rud87].

Without proving (1.20) rigorously, we can understand why this damping
out of oscillatory functions occurs. Model “wiggliness” is smoothed by the
integration with the kernel, g(s, t). Thus a significant perturbation to the
model of the form sin(nπt) leads, for large n, to an insignificant change in y(s).
The inverse problem has the situation reversed; an inferred model can be very
sensitive to changes in the data, including random noise components that have
nothing to do with the physical system that we are trying to learn about.
This property of IFK’s is similar to the problem of solving linear systems of
equations where the condition number of the matrix is very large. That is, the
matrix is nearly singular. In both cases, the difficulty lies in the nature of the
problem, not in the algorithm used to solve the problem. Ill-posed behavior is a
fundamental feature of many inverse problems because of the general smoothing
that occurs in most forward problems (and the corresponding roughening that
must occur in the solution of the inverse problem); understanding and dealing
with it are fundamental issues in solving many problems in the physical sciences.

1.5 Exercises
1. Consider a mathematical model (1.1) of the form G(m) = d, where m is
a vector of size n, and d is a vector of size m. Suppose that the model
obeys the superposition and scaling laws and is thus linear. Show that
G(m) can be written in the form

G(m) = Γm

where Γ is an m by n matrix. What are the elements of Γ? Hint: Consider

the standard basis, and write m as a linear combination of the vectors in
the standard basis. Apply the superposition and scaling laws. Finally,
recall the definition of matrix–vector multiplication.
2. Can (1.9) be inconsistent, even with only m = 3 data points? How about
just m = 2 data points? What if the ti are distinct? If the system can be
inconsistent, give an example, if not, explain why not.
3. Find a journal article which discusses the solution of an inverse problem
in your discipline. What is the data? Are the data discrete or continuous?
Have the authors discussed possible sources of noise in the data? What
is the model? Is the model continuous or discrete? What physical laws
determine the forward operator G? Is G linear or nonlinear? Discuss any
issues associated with existence, uniqueness, or instability of solutions?

1.6 Notes and Further Reading

Some important references on inverse problems in geophysics and remote sens-
ing include [Car87, Lin88, Par94, Two96]. Many examples of ill–posed prob-
lems and their solution can be found in the book edited by Tikhonov and
Goncharsky [TG87]. More mathematically oriented references include [Bau87,
Gro93, Han98, Kir96, LH95, Mor84, Men89, TA77, Tar87].
Tomography, and in particular medical imaging applications of tomography,
is a very large field. Some references on tomography include [Her80, KS01,
LL00, Nat01, Nol87].
full rank
least squares
least squares solution
2–norm solution

Chapter 2

Linear Regression

2.1 Introduction to Linear Regression

Consider a discrete linear inverse problem. We begin with a data vector, d,
of m observations, and a vector m of n model parameters that we wish to
determine. As we have seen, the inverse problem can be written as a linear
system of equations
Gm = d . (2.1)

If rank(G) = n, the system of equations is of full rank. In this chapter we

will assume that there are more observations than model parameters, and thus
more linear equations than unknowns. We will also assume that the system of
equations is of full rank. In Chapter 4 we will consider rank deficient problems.
For a full rank system with more equations than unknowns, it is frequently
the case that no solution m satisfies all of the equations in (2.1) exactly. This
happens because the dimension of the range of G is smaller than m and a noisy
data vector can easily lie outside of the range of G.
However, a useful approximate solution may still be found by finding a par-
ticular model m that minimizes some measure of the the misfit between the
actual data and the forward model predictions using m. The residual is given
r = d − Gm. (2.2)

One commonly used measure of the misfit is the 2–norm of the residual. This
model that minimizes the 2–norm of the residual is called the least squares
solution. The least squares or 2–norm solution is of special interest both
because it is very amenable to analysis and geometric intuition, and because it
turns out to be statistically the most likely solution if data errors are normally
A typical example of a linear parameter estimation problem is linear regres-
sion to a line, where we seek a model of the form y = m1 + m2 x which which


maximum likelihood best fits a set of m > 2 data points. We seek a minimum 2–norm solution for
   
1 x1 d1
 1 x2 
  d2 
 
 
 . .  m1
  . 
Gm =  . = =d (2.3)
 .  m2

 . 
 
 . .   . 
1 xm dm

The 2–norm solution for m is, from the normal equations (A.7),

mL2 = (GT G)−1 GT d . (2.4)

It can be shown that if G is of full rank then (GT G)−1 always exists.
Equation (2.4) is a general 2–norm minimizing solution for any full-rank
linear parameter estimation problem. Returning to our example of fitting a
straight line to m data points, we can now write a closed–form solution for this
particular problem as

mL2 = (GT G)−1 GT d (2.5)

  −1
  1 x1
1 ... 1  . . . . . . 
= 
x1 . . . xm
1 xm
 
1 ... 1  d2 
 
x1 . . . xm  . . . 
 Pm −1  Pm 
m i=1 x i i=1 d i
= Pm Pm 2 Pm
i=1 xi i=1 xi i=1 xi di
 Pm 2 Pm   Pm 
1 − Pi=1 xi − i=1 xi i=1 d i
= Pm Pm m Pm .
m i=1 x2i − ( i=1 xi )2 − i=1 xi −m i=1 xi di

2.2 Statistical Aspects of Least Squares

If we consider our data points not to be absolutely perfect measurements, but
rather to posses random errors, then we have the problem of finding the solu-
tion which is best from a statistical point of view. One statistical approach,
maximum likelihood estimation, asks the question, given that we observed
a certain data set, that we know the statistical characteristics of our observa-
tions, and that we have a forward model for the problem, what is the model
from which these observables would be most likely?
Maximum likelihood estimation is a general method which can be used for
any estimation problem where we can write down a joint probability density
function for the observations. The essential problem is to find the most likely

model, as parameterized by the n elements in the vector, m for a set of m obser- likelihood function
vations in the vector d. We will assume that the observations are statistically maximum likelihood principle
independent so that we can use a product form joint density function.
Given a model m, we have a probability density function fi (di , m) for the
ith observation. In general, this probability distribution will vary, depending
on m. The joint probability density for a vector of observations d will then be

f (d, m) = f1 (d1 , m) · f2 (d2 , m) · · · fm (dm , m) = Πm

i=1 fi (di , m). (2.6)

Note that the values of f are probability densities not probabilities. We can
only compute the probability of getting data in some range given a model m
by computing an integral of f (d, m) over that range. In fact, the probability of
getting any particular data set is precisely 0!
We can get around this problem by considering the probability of getting
a data set in a small box around a particular data set d. This probability
is approximately proportional to the probability density f (d, m). For many
possible models m, the probability density will be quite small. Such models
would be relatively unlikely to result in the data d. For other models, the
probability density might be much larger. These models would be relatively
likely to result in the data d.
For this reason, (2.6) is called the likelihood function. According to the
maximum likelihood principle we should select the model m that maximizes
the likelihood function. Estimates of the model obtained by following the max-
imum likelihood principle have many desirable properties [CB02, DS98]. It is
particularly interesting that when we have a discrete linear inverse problem and
the data errors are independent and normally distributed, then the maximum
likelihood principle leads us to the least squares solution.
Suppose that our data have independent random errors which are normally
distributed with expected value zero. Let the standard deviation of the ith
observation be σi . The likelihood for the ith observation is
1 2 2
fi (di , m) = 1/2
e−(di −(Gm)i ) /2σi . (2.7)
(2π) σi
Considering the entire data set, The likelihood function is the product of the
individual likelihoods
1 −(di −(Gm)i ) 2
f (d, m) = Πm
i=1 e . (2.8)
(2π)m/2 σ 1 · · · σm
The constant factor does not effect the maximization of f , so we can solve the
maximization problem
−(di −(Gm)i ) /2σi2
max Πm
i=1 e . (2.9)

Since the logarithm is monotonically increasing function, we can instead solve

the maximization problem
−(di −(Gm)i ) /2σi2
max log Πm
i=1 e . (2.10)

Taking the logarithm, this simplifies to

X (di − (Gm)i )2
max − . (2.11)

We can turn maximization into minimization by changing − to + and max to

min. Also, it will be convenient to eliminate a constant factor of 1/2. Our
problem becomes
X (di − (Gm)i )2
min . (2.12)

Aside from the constant factors of 1/σi2 , this is the least squares problem for
Gm = d.
To incorporate the standard deviations, we scale the system of equations.
W = diag(1/σ1 , 1/σ2, . . . , 1/σm ). (2.13)
Then let
Gw = WG (2.14)
dw = Wd . (2.15)
The weighted system of equations is

Gw m = dw . (2.16)

The least squares solution to the weighted problem is

mw = (GTw Gw )−1 GTw dw (2.17)

kdw − Gw mw k22 = (di − (Gmw )i )2 /σi2 . (2.18)
Thus the least squares solution to Gw m = dw is the maximum likelihood solu-
The sum of squares also provides important statistical information about
the quality of our estimate of the model. Let
χ2obs = (di − (Gmw )i )2 /σi2 . (2.19)

Since χ2obs depends on the random measurement errors, it is a random variable.

It can be shown that under our assumptions, χ2obs has a χ2 distribution with
ν = m − n degrees of freedom [CB02, DS98].
Recall that the pdf of the χ2 distribution is given by
1 1
fχ2 (x) = x 2 ν−1 e−x/2 . (2.20)
2ν/2 Γ(ν/2)

0.25 $“chiˆ2$ test






0 5 10 15 20 25 30

Figure 2.1: χ2 distributions for different degrees of freedom.

See Figure 2.1.

The χ2 test is a statistical test of the assumptions that we used in finding
the least squares solution. In this test, we compute χ2obs and compare it to the
theoretical χ2 distribution with ν = m − n degrees of freedom. The probability
of obtaining a χ2 value as large or larger than the observed value is
Z ∞
p= fχ2 (x)dx. (2.21)
This value is called the p–value of the test.
There are three general possibilities for p:

1. The p–value is not too small and not too large. Our least squares solution
produces an acceptable data fit and our assumptions are consistent with
the data In general use, p does not actually have to be very large to
be deemed marginally “acceptable” in many cases (e.g., p ≈ 10−2 ), as
truly “wrong” models (see below) will typically produce extremely small
p–values (e.g., 10−12 ) because of the short-tailed nature of the normal
2. The p–value is very small. We are faced with three nonexclusive possibil-
ities, but something is clearly wrong.

(a) The data truly represent an extremely unlikely realization. This is

easy to rule out for p–values very close to zero. For example if an

experiment produced a data realization where the probability of a

worse fit was 10−9 , if the model and forward problem are correct,
then we would have to perform on the order of a billion experiments
to get a worse fit to the data. It is far more likely that something
else is wrong.
(b) The mathematical model Gm = d is simply incorrect. Most often
this happens because we have left some important physics out of our
mathematical model.
(c) The data errors are underestimated (e.g., the values of σi are too
small or the errors are appreciably non-normally distributed, and are
better characterized by a longer-tailed distribution than the normal

3. The p–value is very large (very close to one). The fit of the forward model
to the data is almost exact. We should investigate the possibility that we
have overestimated the data errors. A more sinister possibility is that a
very high p–value is indicative of data fraud, such as might happen if data
were cooked-up ahead of time to fit a particular model.
A rule of thumb for problems with a large number of degrees of freedom is
that the expected value of χ2 approaches ν. This arises because, as the central
limit theorem predicts, the χ2 random variable will itself becomes normally
distributed with mean ν and standard deviation (2ν)1/2 for large ν.
In addition to examining the χ2obs statistic, it is always a good idea to
examine the residuals rw = Gw mw − dw . The residuals should be roughly
normally distributed with standard deviation one and no obvious patterns. In
some cases where an incorrect model has been fitted to the data, the residuals
will reveal the nature of the modeling error. For example, in a linear regression
to a straight line, it might be that all of the residuals are negative for small
values of the independent variable t and then positive for larger values of t.
This would indicate that perhaps an additional quadratic term is needed in the
regression model.
If we have two alternative models that we have to the data it can be impor-
tant to compare the models to determine whether or not one model fits the data
significantly better than the other model. The F distribution (Appendix B) can
be used to compare the fit of two models to assess whether or not a statistically
significant residual reduction has occurred [DS98].
The respective χ2obs values of the two models are have χ2 distributions with
their respective degrees of freedom, ν1 and ν2 . It can be shown that under our
statistical assumptions, the ratio
χ21 /ν1 χ2 ν2
R= 2 = 12 (2.22)
χ2 /ν2 χ2 ν1
will have an F distribution with parameters ν1 and ν2 [DS98].
If the observed value of the ratio, Robs is extremely large or small, this
indicates that one of the models fits the data significantly better than the other

level. If Robs is greater than the the 95th percentile of the F distribution unbiased
with parameters ν1 and ν2 , then we can conclude that the second model has a
statistically significant improvement in data fit over the first model.
Parameter estimates in linear regression are constructed of linear combina-
tions of independent data with normal errors. Because a linear combination of
normally distributed random variables is normally distributed (Appendix B),
the model parameters will also be normally distributed. Note that, even though
the data errors are uncorrelated, the covariance matrix of the model parameters
will generally have nonzero off-diagonal terms.
To derive the mapping between data and model covariances, consider the
covariance of an m-length data vector, d, of normally distributed, independent
random variables operated on by a general linear transformation specified by an
m by n matrix, A. From Appendix B, we know that

Cov(Ad) = ACov(d)AT . (2.23)

The least squares solution (2.17) has A = (GTw Gw )−1 GTw . Since the weighted
data have an identity covariance matrix, the covariance for the model parameters
C = (GTw Gw )−1 GTw Im Gw (GTw Gw )−1 = (GTw Gw )−1 . (2.24)
In the case of equal and uncorrelated normal data errors, so that the data
covariance matrix Cov(d) is simply the variance σ 2 times the m by m identity
matrix, Im , (2.24) simplifies to

C = σ 2 (GT G)−1 . (2.25)

The covariance matrix of the model parameters is typically not a diago-

nal matrix, indicating that the model parameters are correlated. Because the
components of mL2 are each constructed from linear combinations of the com-
ponents in the data vector, statistical dependence between the elements of m
should not be surprising.
The expected value of the least squares solution is

E[mL2 ] = (GT G)−1 GT E[d] .

Because E[d] = dtrue , and Gmtrue = dtrue ,

GT Gmtrue = GT dtrue .

E[mL2 ] = mtrue .
In statistical terms, the least squares solution is said to unbiased.
We can compute 95% confidence intervals for the individual model param-
eters using the fact that each model parameter mi has a normal distribution
with mean mtrue and variance Cov(mL2 )i,i . The 95% confidence intervals are
given by
mL2 ± 1.96 · diag(Cov(mL2 ))1/2 , (2.26)

where the 1.96 factor arises from the fact that

Z 1.96σ
1 x2
√ e− 2σ2 dx ≈ 0.95 .
σ 2π −1.96σ

Example 2.1

Recall our example from Chapter 1 of linear regression of ballistic

observations to a quadratic model

y(t) = m1 + m2 t + (1/2)m3 t2 . (2.27)

Consider a synthetic data set with m = 10 observations and inde-

pendent normal data errors (σ = 8 m), generated using a known
true model of

mtrue = [10 m, 100 m/s, 9.8 m/s2 ]T (2.28)

The data are

t y
1 109.3827
2 187.5385
3 267.5319
4 331.8753
5 386.0535
6 428.4271
7 452.1644
8 498.1461
9 512.3499
10 512.9753

To obtain the least squares solution for the model parameters, we

construct the G matrix. The rows of G are given by

Gi = [1, ti , − (1/2)t2i ]

so that  
1 1 −0.5

 1 2 −2.0 

 1 3 −4.5 

 1 4 −8.0 

 1 5 −12.5 
G=  .

 1 6 −18.0 

 1 7 −24.5 

 1 8 −32.0 

 1 9 −40.5 
1 10 −50.0



Elevation (m)




0 2 4 6 8 10 12
Time (s)

Figure 2.2: Data and model predictions for the ballistics example.

We solve for the parameters using the weighted normal equations

(2.17) to obtain a model estimate

mL2 = [16.4 m, 97.0 m/s, 9.4 m/s2 ]T . (2.29)

Figure 2.2 shows the observed data and the fitted curve.
mL2 has a corresponding model covariance matrix, from (2.25), of
 
88.53 −33.60 −5.33
C =  −33.60 15.44 2.67  . (2.30)
−5.33 2.67 0.48

In our example, n = 10 and p = 3, so (2.26) gives

mL2 = [16.42 ± 18.44 m, 96.97 ± 7.70 m/s, 9.41 ± 1.36 m/s2 ]T .

The χ2 value for this regression is about 4.2, and the number of
degrees of freedom is ν = m − n = 10 − 3 = 7, so the p–value (2.21)
is Z ∞
1 5 x
p= 7/2 Γ(7/2)
x 2 e− 2 dx ≈ 0.76 ,
4.20 2
which is in the realm of acceptability, meaning that modeling as-
sumptions and the 2–norm error for our estimated model are con-
sistent with both the data and the data uncertainty assumptions.

confidence regions If we consider combinations of model parameters, the situation becomes more
principal axes, error ellipsoid complex. To characterize model uncertainty more effectively, we can examine
95% confidence regions for pairs or larger collections of parameters. When
confidence regions for parameters considered jointly are projected onto model
coordinate axes, we obtain intervals for individual parameters which may be
significantly larger than parameter confidence intervals considered individually.
For a vector of estimated model parameters that have an n-dimensional
MVN distribution with mean mtrue and covariance matrix C, the quantity

(m − mL2 )T C−1 (m − mL2 )

can be shown to have a χ2 distribution with n degrees of freedom. Thus, if

∆2 is chosen to be the 95th percentile of the χ2 distribution with n degrees of
freedom, the 95% confidence region is defined by the inequality

(m − mL2 )T C−1 (m − mL2 ) ≤ ∆2 . (2.32)

The confidence region defined by this inequality is n dimensional ellipsoid.

If we wish to find an error ellipsoid for a lower dimensional subset of the
model parameters, we can project the n dimensional error ellipsoid down to the
lower dimensional space [Ast88] by taking only those rows and columns of C
and elements of m which correspond to the dimensions that we want to keep.
In this case, if the unprojected parameters are not of interest, the number of
degrees of freedom in the associated χ2 calculation, n, should also be reduced
to match the number of model parameters in the projected error ellipsoid.
Since the covariance matrix and its inverse are symmetric and positive defi-
nite, we can diagonalize C −1 as

C−1 = PT ΛP , (2.33)

where Λ is a diagonal matrix of positive eigenvalues and the columns of P

are orthonormal eigenvectors. The semiaxes defined by the columns of P are
referred to as error ellipsoid principal axes, where The pi semimajor error
ellipsoid axis is in the direction P·,i and has a length ∆/ Λi,i .
Because the model covariance matrix is typically not diagonal, the principal
axes are typically not aligned in the mi directions. However, we can project
the appropriate confidence ellipsoid onto the mi axes to obtain a “box” which
includes the entire 95% error ellipsoid, along with some points outside of the
error ellipsoid. This box provides a conservative confidence interval for the joint
collection of model parameters.
The correlations for the pairs of parameters (mi , mj ) are measures of the
ellipticity of the projection of the error ellipsoid onto the (mi , mj ) plane. A
correlation approaching +1 means the projection is ”needle-like” with a positive
slope, a near-zero correlation means that the projection is approximately spher-
ical, and a correlation approaching -1 means that the projection is ”needle-like”
with a negative slope.

Example 2.2 parameter correlation

The parameter correlations for our example are

Cov(mi , mj )
ρmi , mj =p
Var(mi ) · Var(mj )

which give
ρm1 ,m2 = −0.91
ρm1 ,m3 = −0.81
ρm2 ,m3 = 0.97 .
This shows that the three model parameters are highly statistically
dependent, and that the error ellipsoid is thus very elliptical.
Application of (2.33) for this example shows that the directions of
the error ellipsoid principal axes are
 
−0.03 −0.93 0.36
P = [P·,1 , P·,2 , P·,3 ] ≈ −0.23
 0.36 0.90  ,
0.97 0.06 0.23

with corresponding eigenvalues

[λ1 , λ2 , λ3 ] ≈ [104.7, 0.0098, 0.4046] .

The resulting 95% confidence ellipsoid semiaxis lengths in the prin-

cipal axes directions are
q p p p
2 (0.95)[1/ λ1 , 1/ λ2 , 1/ λ3 ] ≈ [0.24, 24.72, 3.85] ,

where Fχ−1
2 (0.95) ≈ 2.80 is the 95th percentile of the χ
with three degrees of freedom.
Projecting the 95% confidence ellipsoid axes back into the (m1 , m2 , m3 )
coordinate system (Figure 2.3) we obtain 95% confidence intervals
for the parameters considered jointly

= [16.42 ± 23.03 m, 96.97 ± 9.62 m/s, 9.41 ± 1.70 m/s2 ]T (2.34)

which are about 40% broader than the single parameter confidence
estimates obtained using only the diagonal covariance matrix terms
(2.31). Note that there is actually a greater than 95% probability
that the box in (2.34) will include the true values of the param-
eters. The reason is that these intervals considered together as a
rectangular region includes many points which lie outside of the
95% confidence ellipsoid.

Figure 2.3: Projections of the 95% error ellipsoid onto model axes.

It is important to note that the covariance matrix (2.25) contains information

only about where and how often we made measurements, and on what the
standard deviations of those measurements were. Covariance is thus totally a
characteristic of experimental design reflective of how much influence a general
data set will have on a model estimate. It does not depend upon particular data
values from some individual realization. This is why it is essential to evaluate
the p–value, or some other “goodness–of–fit” measure for a estimated model.
Examining the solution parameters and the covariance matrix alone will not
reveal whether we are fitting the data adequately!
Suppose that we do not know the standard deviations of the measurement
errors a priori. Can we still perform linear regression? If we assume that the
measurement errors are independent and normally distributed with expected
value 0 and standard deviation σ, then we perform the linear regression and
estimate σ from the residuals.
First, we find the least squares solution to the unweighted problem Gm = d.
We let
r = d − GmL2 .
To estimate the standard deviation from the residual vector, let
u m
u 1 X
s=t r2 . (2.35)
n − m i=1 i

As you might expect, there is a statistical cost associated with our not
knowing the true standard deviation. If the data standard deviations are known

ahead of time, then the model errors

mi − mtruei
m0i = (2.36)
have the standard normal distribution. However, if instead of a known σ, we
have an estimate of σ, s obtained from (2.35), then the model errors
mi − mtrue
m0i = (2.37)
have a t distribution with with ν = n − m degrees of freedom. For smaller
degrees of freedom this produces appreciably broader confidence intervals. As
ν becomes large, s becomes an increasingly better estimate of σ.
A second major problem is that we can no longer use the χ2 test of goodness–
of–fit. The χ2 test was based on the assumption that the data errors were
normally distributed with known standard deviation σ. If the actual residuals
were too large relative to σ, then χ2 would be large, and we would reject the
linear regression fit. If we substitute the estimate (2.35) into (2.19), we find
that χ2obs = ν, so that the fit always passes the χ2 test.

Example 2.3

Consider the analysis of a linear regression problem in which the

measurement errors are assumed to be normally distributed, but
with unknown σ. We are given a set of x and y data that appear to
follow a linear relationship.
In this case,  
1 x1

 1 x2 

 . . 
G=  ,

 . . 

 . . 
1 xn
and we want to find the least squares solution to

Gm = y . (2.38)

The 2–norm solution has

y = −1.03 + 10.09x .

Figure 2.4 shows the data and the linear regression line. Our es-
timate of the standard deviation of the measurement errors is s =
30.74. Thus the estimated covariance matrix for the fit parameters
338.24 −4.93
C = s2 (GT G)−1 =
−4.93 0.08

proportional effect and the parameter confidence intervals are

m1 = −1.03 ± tn−2,0.975 338.24 = −1.03 ± 38.05

and √
m2 = 10.09 ± tn−2,0.975 0.08 = 10.09 ± 0.59 .

Since the standard deviation of the measurement errors is unknown,

we cannot perform a χ2 test of goodness–of–fit. However, we can still
examine the residuals. Figure 2.5 shows the residuals. It is clear that
although the residuals seem to be random, the standard deviation
seems to increase as x and y increase. This is a common phenomenon
in linear regression, called a proportional effect. One possible
reason for this is that in some instruments the measurement errors
are of size that is proportional to the magnitude of the measurement.
We will address the proportional effect by assuming that the stan-
dard deviation is proportional to y. We then rescale the system of
equations (2.38) by dividing each equation by yi , so that

Gw mw = yw .

With this weighted system, we obtain the least squares estimate

y = −12.24 + 10.25x .

with 95% confidence intervals for the parameters

m1 = −12.24 ± 22.39

m2 = 10.25 ± 0.47 .
Figure 2.6 shows the data and least squares fit. Figure 2.7 shows
the scaled residuals. Notice that this time there is no trend in the
magnitude of the residuals as x and y increase. The estimated stan-
dard deviation is 0.045, or 4.5%. In fact, these data were generated
according to y = 10x + 0, with standard deviations for the measure-
ment errors at 5% of the y value.










20 40 60 80 100

Figure 2.4: Data and Linear Regression Line.









20 40 60 80 100

Figure 2.5: Residuals.











20 40 60 80 100

Figure 2.6: Data and Linear Regression Line, scaled.





20 40 60 80 100

Figure 2.7: Residuals for the scaled problem.

2.3. L1 REGRESSION 2-17

2.3 L1 Regression outliers

robust solution
Linear regression solutions are very susceptible to even small numbers of out- 1–norm solution
liers, extreme observations which are sufficiently different from other observa-
tions to effect the fitted model. In practice, such observations are often the
result of a procedural error in making a measurement.
We can readily appreciate this from a maximum likelihood perspective by
noting the very rapid fall-off of the tails of the normal distribution. For example,
the probability of a data point occurring more than 5 standard deviations to one
side or the other of the expected value for a single data point with a normally
distributed error is less than one in a million
Z ∞
2 1 2
P (|X − E[X]| ≥ 5σ) = √ e− 2 x dx ≈ 6 × 10−7 . (2.39)
2π 5
Thus, if an outlier occurs in the data set due to a non-normal error process,
the least squares solution will go to great lengths to accommodate it so as to
make the relative contribution of that point to the total likelihood as large as
Consider the solution which minimizes the 1–norm of the residual vector,
X |di − (Gm)i |
µ = = kdw − Gw mk1 , (2.40)

rather than minimizing the 2–norm. The 1–norm solution, mL1 , will be more
outlier resistant, or robust, than the least squares solution, mL2 , because
(2.40) does not square each of the terms in the misfit measure, as (2.12) does.
The 1–norm solution mL1 also has a maximum likelihood interpretation; it
is the maximum likelihood estimator for data with errors distributed according
to a 2-sided exponential distribution
1 −|x−µ|/σ
f (x) = e . (2.41)

Data sets distributed as (2.41) are unusual. Nevertheless, it is still often worth-
while to find a solution where (2.40) is minimized rather than (2.12), even if
measurement errors are normally distributed, because of the all-too-frequent
occurrence of incorrect mesurements that do not follow the normal distribution.

Example 2.4
It is easy to demonstrate the advantages of 1–norm minimization
using the quadratic regression example discussed earlier. Figure 2.8
shows the original sequence of independent data points with unit
standard deviations, where one of the points (number 4) is clearly
an outlier if a model of the form (2.27) is appropriate (it is the
original data with 200 m subtracted from it). The data prediction
using the 2–norm minimizing solution for this data set,
mL2 = [26.4 m, 75.6 m/s, 4.9 m/s2 ]T , (2.42)




Elevation (m)



0 2 4 6 8 10 12
Time (s)

Figure 2.8: L1 (upper) and L2 (lower) residual minimizing solutions to a

parabolic data set with an outlier at t = 4 s.

(the lower of the two curves) is clearly skewed away from the major-
ity of the data points due to its efforts to accommodate the outlier
data point and thus minimize χ2 , and it is a poor estimate of the
true model (2.28). Even without a graphical view of the data fit, we
can note immediately that (2.42) fails to fit the data acceptably be-
cause of the huge χ2 value of ≈ 1109. This is clearly astronomically
out of bounds for a problem with 7 degrees of freedom, where the
value of χ2 should be not be far from 7. The corresponding p–value
for this huge χ2 value is effectively zero.
The upper data prediction is obtained using the 1–norm solution,

mL1 = [17.6 m, 96.4 m/s, 9.3 m/s2 ]T (2.43)

which minimizes (2.40). The data prediction from (2.43) faithfully

fits the quadratic trend for the majority of the data points and ig-
nores the outlier at t = 4. It is also much closer to the true model
(2.28), and to the 2–norm model for the data set without the outlier

It is instructive to consider the almost trivial regression problem of finding

the value of an observable by n repeated measurements. The system of equations
2.3. L1 REGRESSION 2-19

is iteratively reweighted least

   
1 d1
 1   d2  IRLS
   

 1 

 d3 

Gm = 
 . m = 
  . =d.
 (2.44)

 . 

 . 

 .   . 
1 dn
The least squares solution to (2.44) is can be found using the normal equa-
T −1 T −1
mL2 = (G G) G d = n di . (2.45)
Thus the 2–norm solution is simply the average of the observations.
The 1–norm solution is somewhat more complicated. The problem is that
f (m) = kd − Gmk1 = |di − m| (2.46)

is a nondifferentiable function of m at each point where m = di . The good news

is that f (m) is a convex function of m. Thus any local minimum point is also
a global minimum point.
We can proceed by finding f 0 (m) at those points where it is defined, and
then separately consider the points at which the derivative is not defined. Every
minimum point must either have f 0 (m) undefined or f 0 (m) = 0.
At those points where f 0 (m) is defined,
f (m) = sgn(di − m) . (2.47)

The derivative is 0 when exactly half of the data are less than m and half of
the data are greater than m. Of course, this can only happen when there are
an even number of observations. In this case, any value of m between the two
middle observations is a 1–norm solution. When there are an odd number of
data, the median data point is the unique 1–norm solution. Even an extreme
outlier will not have a large effect on the median of the data. This is the reason
for the robustness of the 1–norm solution.
The general problem of minimizing kd − Gmk1 is somewhat more compli-
cated. One practical way to find 1–norm solutions is iteratively reweighted
least squares, or IRLS [SGT88]. The IRLS algorithm solves a sequence of
weighted least squares problems whose solutions converge to a 1–norm minimiz-
ing solution.
r = Gm − d (2.48)
We want to minimize
f (m) = krk1 = |ri |. (2.49)

This function is nondifferentiable at any point where one of the elements of r

is zero. Ignoring this issue for a moment, we can go ahead and compute the
derivatives of f at other points.
∂f (m) X ∂|ri |
= (2.50)
∂mk i=1

∂f (m) X
= Gi,k sgn(ri ) . (2.51)
∂mk i=1

Writing sgn(ri ) as ri /|ri |, we get

∂f (m) X 1
= Gi,k ri . (2.52)
∂mk i=1
|ri |

∇f (m) = GT Rr = GT R(d − Gm) (2.53)
where R is a diagonal weighting matrix with Ri,i = 1/|ri |. To find the 1–norm
minimizing solution, we solve ∇f (m) = 0.

GT R(d − Gm) = 0 . (2.54)

GT RGm = GT Rd . (2.55)
Since R depends on m, this is a nonlinear system of equations which we
cannot solve directly. Fortunately, we can use a simple iterative scheme to find
a the appropriate weights. The IRLS algorithm begins with the least squares
solution m0 = mL2 . We calculate the initial residual vector r0 = d − Gm0 .
We then solve (2.55) to obtain a new model m1 and residual vector r1 . The
process is repeated until the model and residual vector stabilize. A typical rule
is to stop the iteration when

kmk+1 − mk2
<τ (2.56)
1 + kmk+1 k2
for some tolerance τ .
The procedure will fail if any element of the residual vector becomes zero.
A simple modification to the algorithm deals with this problem. We select
a tolerance  below which we consider the residuals to be effectively zero. If
|ri | < , then we set Ri,i = 1/. With this modification it can be shown
that this procedure will always converge to an approximate 1–norm minimizing
As with the χ2 misfit measure, there is a corresponding p-value that can be
used under the assumption of normal data errors for 1–norm solutions [PM80].
For a 1–norm misfit measure, y, for ν degrees of freedom given by (2.40), the
probability that a worse misfit than that observed, µobs could have occurred

given independent normally-distributed data and ν degrees of freedom is ap- Monte Carlo error propagation
proximately given by

(1) γZ (2) (x)

p(1) (y, ν) = Prob(µ(1) > µobs ) = S(x) − , (2.57)
where Z x ξ2
1 − 2
S(x) = √ e 2σ1
dξ ,
σ1 2π −∞

σ1 = (1 − 2/π)ν ,
2 − π/2 1
γ= ν2 ,
(π/2 − 1)3/2
x2 − 1 − x2
Z (2) (x) = √ e 2 ,

µ(1) − µ̄
x= ,
and p
µ̄ = 2/π ν .

2.4 Monte Carlo Error Propagation

For solution techniques that are nonlinear and/or algorithmic, such as the IRLS,
there is typically no simple way to propagate uncertainties in the data to uncer-
tainties in the estimated model parameters. In such cases, one can apply Monte
Carlo error propagation techniques, in which simulate a collection of noisy
data vectors then measure the statistics of the resulting models. We can ob-
tain covariance matrix information for this solution by first forward propagating
(2.43) into an assumed noise-free data vector

GmL1 = db .

We next resolve the IRLS problem many (q) times for 1–norm models corre-
sponding to independent data realizations to obtain a suite of 1–norm solutions
GmL1 ,i = db + ni ,
where ni is the ith noise vector. If A is the q by m matrix where the ith row
contains the difference between the ith model estimate and the average model

Ai,. = mL1 ,i − m̄L1 ,

then an empirical estimate of the covariance matrix is

Cov(mL1 ) = .

Example 2.5
Recall Example 2.3. An estimate of Cov(mL1 ) for this example using
10,000 iterations of the Monte Carlo procedure is
 
123.92 −46.96 −7.43
Cov(mL1 ) =  −46.96 15.44 3.72  ≈ 1.4 · Cov(mL2 ) .
−7.43 3.72 0.48
Although we have no reason to believe that the model parameters
will be normally distributed, we can still compute approximate 95%
confidence intervals for the parameters by simply acting as though
they were normally distributed. We obtain

mL1 = [17.6 ± 21.8 m, 96.4 ± 7.70 m/s, 9.3 ± 1.4 m/s2 ]T . (2.58)

2.5 Exercises
1. A seismic profiling experiment is performed where the first arrival times
of seismic energy from a mid-crustal refractor are made at distances (km)
of  
 10.1333 
 
 14.2667 
x=  
 18.4000 

 22.5333 
from the source, and are found to be (in seconds after the source origin
time)  
 4.2853 
 
 5.1374 
t=  5.8181

 
 6.8632 
The model for this arrival time pattern is a simple two-layer flat lying
structure which predicts that
ti = t0 + s2 xi
where the intercept time, t0 depends on the thickness and slowness of the
upper layer, and s2 is the slowness of the lower layer. The estimated noise
is believed to be uncorrelated, and normally distributed with expected 0
and standard deviation 0.1 s. i.e. σ = 0.1s.
2.5. EXERCISES 2-23

(a) Find the least squares solution for the two model parameters t0 and chi2pdf
s2 . Plot the data predictions from your model relative to the true
(b) Calculate and comment on the parameter correlation matrix. How
will the correlation entries be reflected in the appearance of the error
(c) Plot the error ellipsoid in the (t0 , s2 ) plane and calculate conservative
95% confidence intervals for t0 and s2 .
Hint: The following MATLAB code will plot a 2-dimensional co-
variance ellipse, where covm is the covariance matrix and m is the
2-vector of model parameters.
%diagonalize the covariance matrix
%generate a vector of angles from 0 to 2*pi
%calculate the x component of the ellipsoid for all angles
%calculate the y component of the ellipsoid for all angles
%plot(x,y), adding in the model parameters
(d) Evaluate the p–value for this model (you may find the MATLAB
Statistics Toolbox function chi2cdf to be useful here).
(e) Evaluate the value of χ2 for 1000 Monte Carlo simulations using the
data prediction from your model perturbed by noise that is consistent
with the data assumptions. Compare a histogram of these χ2 values
with the theoretical χ2 distribution for the correct number of degrees
of freedom (you may find the MATLAB statistical toolbox function
chi2pdf to be useful here).
(f) Are your p–value and Monte Carlo χ2 distribution consistent with
the theoretical modeling and the data set? If not, explain what is
(g) Use IRLS to evaluate 1–norm estimates for t0 and s2 . Plot the data
predictions from your model relative to the true data and compare
with (a).
(h) Use Monte Carlo error propagation and IRLS to estimate symmetric
95% confidence intervals on the 1–norm solution for t0 and s2 .
(i) Examining the contributions from each of the data points to the 1–
norm misfit measure, can you make a case that any of the data points
are statistical outliers?

2. Use MATLAB to generate 10,000 realizations of a data set of m = 5

points d = a + bx + n, where x = [1, 2, 3, 4, 5]T , the n = 2 true model
parameters are a = b = 1, and n is a m-element vector of uncorrelated
N (0, 1) noise.

(a) Assuming that the noise standard deviation is known a priori to be 1,

solve for the 2–norm parameters for your realizations and histogram
them in 100 bins.
(b) Calculate the parameter covariance matrix for uncorrelated N (0, 1)
data errors, C = (GT G)−1 , and give standard deviations, σa and σb ,
for your estimates of a and b,
(c) Calculate standardized parameter estimates
a − ā
a0 = p
σ · C1,1

b − b̄
b0 = p ,
σ · C2,2
and demonstrate using a Q − Q plot that your estimates for a0 and
b0 are distributed as N (0, 1).
(d) Show using a Q − Q plot that the squared residual lengths

krk22 = kd − Gmk22

for your solutions in (a) are distributed as χ2 with m − n = ν = 3

degrees of freedom.
(e) Assume that the noise standard deviation for the synthetic data set
is not known, and estimate it for each realization as
u m
u 1 X
s=t r2 .
n − m i=1 i

Histogram your standardized solutions

a − ā
a0 = 1/2
s · C1,1

b − b̄
b0 = 1/2
s · C2,2
where each solution is normalized by its respective standard deviation
(f) Demonstrate using a Q − Q plot that your estimates for a0 and b0 are
distributed as the t PDF with ν = 3 degrees of freedom.

2.6 Notes and Further Reading total least squares

robust least squares
Linear regression is a major subfield within statistics, and there are hundreds of
associated textbooks. However, many of these textbooks focus on applications
of linear regression in the social sciences. In such applications, the primary focus
is often on determining which variables have an effect on the response variable of
interest, rather than on estimating parameter values for a predetermined model.
In this context it is important to test the hypothesis that a predictor variable has
a 0 coefficient in the regression model. Since we normally know which predictor
variables are important in the physical sciences, this is not typically important
to us. Useful books on linear regression from the standpoint of estimating
parameters in the context considered here include [DS98, Mye90].
Statistical methods that are robust to outliers are an important topic. Huber
discusses a variety of robust statistical procedures [Hub96]. The computational
problem of computing a 1–norm solution has been extensively researched. Wat-
son reviews the history of methods for finding p–norm solutions including the
1–norm case [Wat00].
We have assumed that G is known exactly. In some cases entries in this
matrix might be subject to measurement error. This problem has been studied
as the total least squares problem [HV91]. An alternative approach to least
squares problems with uncertainties in G that has recently received considerable
attention is called robust least squares [BTN01, EL97].
Fredholm Integral Equation

Chapter 3

Discretizing Continuous
Inverse Problems

In this chapter, we will discuss inverse problems involving functions rather than
vectors of parameters. There is a large body of mathematical theory that can
be applied to such problems. Some of these problems can be solved analytically.
However, in practice such problems are often approximated by linear systems
of equations. Thus we will focus on techniques for discretizing continuous in-
verse problems here then consider techniques for solving discretized problems in
subsequent chapters.

3.1 Integral Equations

We begin by considering problems of the form
Z b
d(s) = g(s, t)m(t) dt . (3.1)

Here d(s) is a known function, typically representing the observed data. The
function g(s, t) is also known. It encodes the physics that relates the unknown
model m(t) to the observed d(s). The interval [a, b] may be finite, in which case
the analysis is somewhat simpler, or the interval may be infinite. The function
d(s) might in theory be known over an entire interval [c, d], but in practice we
will only have measurements of d(s) at a finite set of points.
We wish to solve for the unknown function m(t). This type of linear equation,
which we previously saw in Chapter 1, is called a Fredholm integral equation
of the first kind or IFK. A surprisingly large number of inverse problems
can be written as Fredholm integral equations of the first kind. Unfortunately,
some IFK’s have properties that can make them very difficult to solve.


deconvolution Example 3.1

data kernels One common type of inverse problem that we have seen previously
quadrature rule and which can be written in the form of an IFK is the deconvolu-
tion problem Z t
y(t) = g(t − u)f (u) du .
Here the kernel function g depends only on the difference between
t and u (1.7). Our goal is to obtain f(t). Convolution integrals of
this type arise whenever the response of an instrument depends on
some sort of weighted time average of the value of the parameter
of interest. The instrument response is y(t), while f (t) is the value
that we wish to measure.
This problem is not quite in the form of (3.1), since the upper limit
of integration is not constant. However, if we fix some time b beyond
which we are not interested in the solution, and let
g(t − u) u ≤ t
g(t, u) =
0 u>t

then we can rewrite the equation as

Z b
y(t) = g(t, u)f (u) du .

3.2 Quadrature Methods

There are a number of techniques for approximating IFK’s by discrete systems
of linear equations. We will first assume that d is known only at a finite number
of points s1 , s2 , . . ., sm . This is not an important restriction, since all practical
data consist of a discrete number of measurements. We can write the IFK as
Z b
di = d(si ) = g(si , t)m(t) dt (i = 1, 2, . . . , m) ,

or as Z b
di = gi (t)m(t) dt (i = 1, 2, . . . , m) , (3.2)
where gi (t) = g(si , t). The functions gi (t) are referred to as representers or
data kernels.
In the quadrature approach to discretizing an IFK, we use a quadrature
rule (an approximate numerical integration scheme) to approximate
Z b
gi (t)m(t) dt .

midpoint rule
collocation, simple

a t1 t2 tn b

Figure 3.1: Grid for the midpoint rule.

The simplest quadrature rule is the midpoint rule. We divide the interval
[a, b] into n subintervals, and pick points t1 , t2 , . . ., tn in the middle of each
interval. Thus
ti = a + + (i − 1)∆t
∆t = .
The integral is then approximated by
Z b n
gi (t)m(t) dt ≈ gi (tj )m(tj ) ∆t (3.3)
a j=1

(Figure 3.1). With this approximation, our equation for di becomes

di = gi (tj )m(tj )∆t (i = 1, 2, . . . , m) . (3.4)

If we let  
i = 1, 2, . . . , m
Gi,j = gi (tj )∆t (3.5)
j = 1, 2, . . . , n
mj = m(tj ) (j = 1, 2, . . . , n) , (3.6)
then we obtain a linear system of equations Gm = d.
The approach of using the midpoint rule to approximate the integral is known
as simple collocation. Of course, there are also more sophisticated quadrature
rules such as the trapezoidal rule and Simpson’s rule. In each case, we end up
with a similar linear system of equations.

Example 3.2
Consider the vertical seismic profiling example (Example 1.3) of
Chapter 1, where we wish to estimate vertical seismic slowness using
arrival time measurements of downward propagating seismic waves
(Figure 3.2). As noted in Chapter 1, the data in this case are simply

Figure 3.2: Discretization of the vertical seismic profiling problem (n/m = 2).

integrated values of the model parameters. Discretizing the forward

problem (1.15) for m observations, ti at yi (assumed equally spaced
at distances ∆y; Figure 3.2) and for n model depths zi (assumed
equally spaced at distances ∆z) gives n/m = ∆y/∆z. The corre-
sponding discretized system of equations is
ti = H(yi − zj )sj ∆z . (3.7)

The rows of the matrix Gi,· each consist of i · n/m entries ∆z on the
left and n − (i · n/m) zeros on the right. For n = m, G is simply a
lower triangular matrix with each nonzero entry equal to ∆z.

Example 3.3

Another instructive example of an inverse problem formulated as

an IFK and then discretized by the method of simple collocation
concerns an optics experiment in which light passes through a slit
[Sha72]. We measure the intensity of diffracted light d(s), as a func-
tion of outgoing angle −π/2 ≤ s ≤ π/2. Our goal is to find the
intensity of the light incident on the slit m(t), as a function of the
incoming angle −π/2 ≤ t ≤ π/2 (Figure 3.3).

Figure 3.3: The Shaw problem.

The relationship between d and m is

Z π/2  2
2 sin(π(sin(s) + sin(t)))
d(s) = (cos(s) + cos(t)) m(t) dt .
−π/2 π(sin(s) + sin(t))
We use the method of simple collocation with n equal model intervals
for both the model and data functions, where n is a multiple of 2.
Although it is not generally necessary, we will assume for convenience
that data is collected at the same n equally-spaced angles
(i − 0.5)π π
si = ti = − (i = 1, 2, . . . , n) . (3.9)
n 2
2 sin(π(sin(si ) + sin(tj )))
Gi,j = ∆s(cos(si ) + cos(tj )) (3.10)
π(sin(si ) + sin(tj ))
∆s = . (3.11)
If we let
di = d(si ) (i = 1, 2, . . . , n) (3.12)
mj = m(tj ) (j = 1, 2, . . . , n) (3.13)
our discretized linear system for this problem is then Gm = d.
The MATLAB Regularization Toolbox [Han94] contains a routine
shaw that computes the G matrix for this problem.

Example 3.4


0 x

Figure 3.4: The Source history Reconstruction Problem.

In this example we consider the problem of recovering the history

of groundwater pollution at a site from later measurements of the
groundwater contamination at other points “downstream” from the
initial contamination. See Figure 3.4 This “source history recon-
struction problem” has been considered by a number of authors
[NBW00, SK94, SK95, WU96].
The transport of the contaminant is governed by an advection–
diffusion equation:

∂C ∂2C ∂C
= D 2 −v (3.14)
∂t ∂x ∂x
C(0, t) = Cin (t)
C(x, t) → 0 as x → ∞
C(x, 0) = C0 (x)

Here D is the diffusion coefficient, and v is the velocity of the ground-

water flow.
The solution to this PDE is given by the convolution
C(x, T ) = Cin (t)f (x, T − t) dt , (3.15)

where Cin (t) is the time history of contaminant injection at x = 0,

and the advection-diffusion kernel is
[x−v(T −t)]2
x − 4D(T −t)
f (x, T − t) = p e .
2 πD(T − t)3

In this problem, we assume that the parameters of the advection–

diffusion equation are known and that we want to estimate Cin (t)
from observations at some later time T . The convolution formula
(3.15) for C(x, T ) is discretized as

d = Gm

where d is a vector of sampled concentrations at different locations, Gram matrix technique

x, at a time T , m is a vector of Cin values, and
[xi −v(T −tj )]2
xi − 4D(T −tj )
Gi,j = f (xi , T − tj )∆t = p e ∆t .
2 πD(T − tj )3

3.3 The Gram Matrix Technique

In the Gram matrix technique, we write the model m(t) as a linear combi-
nation of the m representers (3.2)
m(t) = αj gj (t) , (3.16)

where the αi are coefficients that we must determine. Our approximation to

the IFK is then
Z b m
d(si ) = gi (t) αj gj (t) dt (3.17)
a j=1
X Z b
= αj gi (t)gj (t) dt (i = 1, 2, . . . , m) . (3.18)
j=1 a


We construct a m by m matrix Γ, with elements

Z b
Γi,j = gi (t)gj (t) dt (i, j = 1, 2, . . . , m) . (3.20)

and an m element vector d with elements

di = d(si ) (i = 1, 2, . . . , m) . (3.21)

Our IFK is then discretized as an n by n linear system of equations

Γα = d .

Once this linear system has been solved, the corresponding model is
m(t) = αj gj (t) . (3.22)

If the representers and Gram matrix are analytically expressible, then the
Gram matrix formulation facilitates finding continuous solutions for m(t) [Par94].

inner product Where only numerical representations for the representers exist, the method can
still be used, although the elements of Γ must be obtained by numerical inte-
It can be shown (see the Exercises) that if the representers gi (t) are linearly
independent, then the matrix Γ will be nonsingular.
However, as we will see in the next chapter, the Γ matrix tends to become
very badly conditioned as m increases. On the other hand, we want to use as
large as possible a value of m so as to increase the accuracy of the discretization.
There is thus a tradeoff between the discretization error due to using too small
a value of m and the ill-conditioning of Γ due to using too large a value of m.

3.4 Expansion in Terms of Basis Functions

In the Gram matrix technique, we approximated m(t) as a linear combination
of m representers, which form a natural basis. More generally, given suitable
functions h1 (t), h2 (t), . . ., hn (t), which form a basis for a space H of functions,
we could approximate m(t) by
m(t) = αj hj (t) (3.23)

where n is typically smaller than the number of observations m. We substitute

this approximation into the integral equation to obtain
Z b n
d(si ) = gi (t) αj hj (t) dt (3.24)
a j=1
X Z b
= αj gi (t)hj (t) dt(i = 1, 2, . . . , m) . (3.25)
j=1 a

This leads to the m by n linear system
Gα = d
where Z b
Gi,j = gi (t)hj (t) dt .
If we define the dot product or inner product of two functions to be
Z b
f ·g = f (x)g(x) dx , (3.27)

then the corresponding norm is

Z b
kf k2 = f (x)2 dx . (3.28)

This norm can be very useful in approximating functions. The corresponding Laguerre polynomial basis
measure of the difference between functions f and g is
Z b
kf − gk2 = (f (x) − g(x))2 dx . (3.29)

If it happens that our basis functions hj (x) are orthonormal with respect to
this inner product, then the projection of gi (t) onto the space H spanned by
the basis is

projH gi (t) = (gi · h1 )h1 (t) + (gi · h2 )h2 (t) + . . . + (gi · hn )hn (t)

The elements in the G matrix are given by the same dot products

Gi,j = gi · hj (3.30)

Thus we have effectively projected the original representers onto our function
space H.
A variety of basis functions have been used to discretize integral equations
including sines and cosines, spherical harmonics, B-splines, and wavelets. In
selecting the basis functions, it is important to select a basis that can reasonably
represent likely models. The basis functions should be linearly independent, so
that a function can be written in terms of the basis functions in exactly one
way, and (3.23) is thus unique. It is also desirable to use an orthonormal basis.
The selection of an appropriate basis for a particular problem is a fine art
that requires detailed knowledge of the problem as well as of the behavior of
the basis functions. Beware that a poorly selected basis may not adequately
approximate the solution, resulting in an estimated model m(t) which is very

Example 3.5
Consider the discretization of a Fredholm integral equation of the
first kind in which the interval of integration is [0, ∞].
Z ∞
d(s) = g(s, t)m(t) dt .

One simple approach to this problem would be to use a change of

variables from t to x so that the x interval becomes finite. We could
then use simple collocation to discretize the problem. Unfortunately,
this method often introduces singularities in the integrand that are
difficult to handle numerically.
Instead, we will use a more sophisticated approach involving the
Laguerre polynomials. The Laguerre basis consists of polynomials
Li (t) where
L−1 (t) = 0

L0 (t) = 1
2n + 1 − t n
Ln+1 (t) = Ln (t) − Ln−1 (t) (n = 0, 1, . . .) .
n+1 n+1
The first few polynomials include L1 (t) = 1−t, L2 (t) = 1−2t+t2 /2,
and L3 (t) = 1 − 3t + 3t2 /2 − t3 /6.
We will use a slightly modified inner product on this interval
Z ∞
f ·g = f (x)g(x)e−x dx .

The weighting factor of e helps to ensure that the integral will
converge, but has the effect of down weighting the importance of
values of the function for large x.
The Laguerre polynomial basis functions are orthonormal in the
sense that
Z ∞  
0 m 6= n
Ln (t)Lm (t)e−t dt = .
0 1 m=n

Having selected the Laguerre polynomials (starting with L0 ) as our

basis, we can expand m(t) in terms of them. In order to avoid an
infinite sum, we will truncate the expansion after n terms.
m(t) = αj Lj−1 (t)e−t

and then express the integral as

Z ∞ n
X Z ∞
d(si ) = gi (t)m(t) dt = αj gi (t)Lj−1 (t)e−t dt .
0 j=1 0

Depending on gi (t), the integrals inside the sum can be evaluated

exactly using analytical methods or approximately using numerical
methods. In any case, we obtain a linear system of equations

Gα = d ,

Z ∞  
i = 1, 2, . . . , m
Gi,j = gi (t)Lj−1 (t)e−t dt .
0 j = 1, 2, . . . , n

In terms of our inner product, these are

Gi,j = gi · Lj−1 ,

which are exactly the coefficients of the Lj−1 in the orthogonal ex-
pansion of gi (t).
In this example, we selected an inner product and orthogonal basis
that gives less weight to values of m(t) for large t. We are effectively
measuring how close to functions f and g are by

kf − gk = (f (x) − g(x))2 e−x dx .

As a result, we should expect to obtain a solution which becomes

less accurate as t increases. This might be appropriate if we are
uninterested in what happens for large values of t. However, if the
solution at large values of t is important, then this would not be an
appropriate norm.

3.5 Exercises
1. Consider the following data set:

y d(y)
0.025 0.2388
0.075 0.2319
0.125 0.2252
0.175 0.2188
0.225 0.2126
0.275 0.2066
0.325 0.2008
0.375 0.1952
0.425 0.1898
0.475 0.1846
0.525 0.1795
0.575 0.1746
0.625 0.1699
0.675 0.1654
0.725 0.1610
0.775 0.1567
0.825 0.1526
0.875 0.1486
0.925 0.1447
0.975 0.1410
This data can be found in the file ifk.mat.
The function d(y), 0 ≤ y ≤ 1 is related to an unknown function m(x),
0 ≤ x ≤ 1, through the IFK
Z 1
d(y) = xe−xy m(x) dx . (3.31)

(a) Using the data provided, discretize the integral equation using simple
collocation and solve the resulting system of equations.
(b) What is the condition number for this system of equations? Given
that the data d(y) are only accurate to about 4 digits, what does this
tell you about the accuracy of your solution?

2. Use the Gram matrix technique to discretize the integral equation from
problem 3.1.

(a) Solve the resulting linear system of equations, and plot the resulting
(b) What was the condition number of Γ? What does this tell you about
the accuracy of your solution?

3. Show that if the representers gi (t) are linearly independent, then the Gram
matrix Γ is nonsingular.

3.6 Notes and Further Reading

Techniques for discretizing integral equations are discussed in [Par94, Two96,
singular value decomposition
data space
model space
svd MATLAB command

Chapter 4

Rank Deficiency and


4.1 The SVD and the Generalized Inverse

We next describe the general solution to linear least squares problems, especially
those that are ill-conditioned and/or rank deficient.
One method of analyzing and solving such systems is the singular value
decomposition, or SVD. The SVD is a decomposition whereby a general m
by n system matrix, G, relating model and data

Gm = d (4.1)

is factored into
G = USVT (4.2)

• U is an m by m orthogonal matrix with columns that are unit basis vectors

spanning the data space, Rm .

• V is an n by n orthogonal matrix with columns that are basis vectors

spanning the model space, Rn .

• S is an m by n diagonal matrix.

The SVD matrices can be computed in MATLAB with the svd command.
The singular values along the diagonal of S are customarily arranged in
decreasing size, s1 ≥ s2 ≥ . . . ≥ smin(m,n) ≥ 0. Some of the singular values may
be zero. If only the first p singular values are nonzero, we can partition S as
Sp 0
S= (4.3)
0 0


svd, compact form where Sp is a p by p diagonal matrix composed of the positive singular values.
Now, expand the SVD representation of G

G = USVT (4.4)
Sp 0 T
= [U·,1 , U·,2 , . . . , U·,m ] [V·,1 , V·,2 , . . . , V·,n ] (4.5)
0 0
Sp 0 T
= [Up , U0 ] [Vp , V0 ] (4.6)
0 0

where Up denotes the first p columns of U, U0 denotes the last m−p columns of
U, Vp denotes the first p columns of V, and V0 denotes the last n − p columns
of V. Because the last m − p columns of U and the last n − p columns of V
in (4.6) are multiplied by zeros in S, we can simplify the SVD of G into its
compact form
G = Up Sp VpT . (4.8)
Take any vector y in R(G). Using (4.8), we can write y as

y = Gx = Up Sp VpT x .


Thus every vector y in R(G) can be written as y = Up z where z = Sp VpT x.

Writing out this matrix vector multiplication, we see that any vector y in R(G)
can be written as a linear combination of the columns of Up .
y= zi U·,i (4.10)

Since the columns of Up span R(G), are linearly independent, and are orthonor-
mal, they form an orthonormal basis for R(G). Because this basis has p vectors,
rank(G) = p.
Since U is an orthogonal matrix, the columns of U form an orthonormal basis
for Rm . We have already seen that the p columns of Up form an orthonormal
basis for R(G). By theorem (A.5), N (GT ) + R(G) = Rm , so the remaining
m − p columns of U0 form an orthonormal basis for N (GT ). Similarly, because
GT = Vp Sp Up , the columns of Vp form an orthogonal basis for R(GT ) and
the columns of V0 form an orthogonal basis for N (G).
Two other important properties of the SVD are similar to properties of
eigenvalues and eigenvectors. Since the columns of V are orthogonal,

VT V·,i = ei .

GV·,i = USVT V·,i = USei = si U·,i (4.11)
GT U·,i = VST UT U·,i = VST ei = si V·,i . (4.12)

There is an important connection between the singular values of G and the generalized inverse
eigenvalues of the matrices GGT and GT G . Moore–Penrose pseudoinverse
GG U·,i = Gsi V·,i (4.13)
= si GV·,i (4.14)
= s2i U·,i (4.15)

GT GV·,i = s2i V·,i . (4.16)
These relations show that we could, in theory, compute the SVD by finding
the eigenvalues and eigenvectors of GT G and GGT . In practice, more efficient
specialized algorithms are used [Dem97, GL96, TB97].
The SVD can be used to compute a generalized inverse of G, which is
called the Moore-Penrose Pseudoinverse because it has desirable properties
originally identified by Moore and Penrose [Moo20, Pen55]. The generalized
inverse is
G† = Vp S−1 T
p Up . (4.17)
Using (4.17), let
m† = Vp S−1 T †
p Up d = G d . (4.18)
We will show that m† is a least squares solution. Among the desirable properties
of (4.18) is that G† , and hence m† , always exist, unlike, for example, the inverse
of GT G in the normal equations (2.4) which does not exist when G is not of
full rank.
To encapsulate what the SVD tells us about our linear system, G, and the
corresponding generalized inverse system G† , consider four cases:

1. Both the model and data null spaces, N (G) and N (GT ) are trivial. Up
and Vp are both square orthogonal matrices, so that UT = U−1 , and
VT = V−1 . (4.18) gives

G† = Vp S−1 T T −1
p Up = (Up Sp Vp ) = G−1 (4.19)

which is the matrix inverse for a square full-rank matrix (m = n =

rank(G)). The solution is unique, and the data are fit exactly.

2. N (G) is nontrivial, but N (GT ) is trivial. UTp = U−1p and VpT Vp = Ip

(the p by p identity matrix). G applied to the generalized inverse solution

Gm† = GG† d (4.20)

= Up Sp VpT Vp S−1 T
p Up d (4.21)
= Up Sp Ip S−1 T
p Up d (4.22)
= d (4.23)

minimum length solution so the data are fit. However, the solution is nonunique, because of the
rank deficient system existence of the nontrivial model null space N (G). The general solution
is the sum of m† and an arbitrary model component in N (G)
m = m † + m 0 = m† + αi V·,i . (4.24)

Since the columns of V are orthonormal, the square of the 2–norm of a

general solution m is
kmk22 = km† k22 + αi2 ≥ km† k22 (4.25)

where we have equality only if all of the model null space coefficients αi
are zero. The generalized inverse solution, which lies in R(GT ) is thus
a minimum 2–norm or minimum length solution for a rank deficient
system with m = rank(G) < n.
We can also write this solution in terms of G and GT .

m† = Vp S−1 T
p Up d (4.26)
= Vp Sp UTp Up S−2 T
p Up d (4.27)
= GT (Up S−2 T
p Up )d (4.28)
T T −1
= G (GG ) d. (4.29)

In practice it is better to compute a solution using the SVD than to use

the above formula involving (GGT )−1 because of round-off issues.

3. N (G) is trivial but N (GT ) is nontrivial and R(G) is a strict subset of

Rm . Here

Gm† = Up Sp VpT Vp S−1 T T

p Up (Up Up d) (4.30)
= Up UTp d . (4.31)

The product Up UTp d gives the projection of d onto R(G). That is, Gm†
is equal to the projection of d onto the range of G. This shows that the
generalized inverse gives a least squares solution to Gm = d.
This solution is exactly the same as that obtained from the normal equa-
tions, because

(GT G)−1 = (Vp Sp UTp Up Sp VpT )−1 (4.32)

= (Vp S2p VpT )−1 (4.33)
= Vp S−2 T
p Vp (4.34)


m† = G† d (4.36)
= Vp S−1 T
p Up d (4.37)
= Vp S−2 T T
p Vp Vp Sp Up d (4.38)
T −1 T
= (G G) G d. (4.39)

Again, it is better in practice to use the generalized inverse solution than

to calculate (GT G)−1 and use the the normal equation solution because
of round–off issues.

4. Both N (GT ) and N (G) are nontrivial and rank(G) is less than both
m and n. In this case, the generalized inverse solution encapsulates the
behavior from both of the two previous cases and minimizes both kGm −
dk2 and kmk2 .
As in case 3,

Gm† = Up Sp VpT Vp S−1 T T

p Up (Up Up d) (4.40)
= Up UTp d (4.41)
= projN (G) d . (4.42)

Thus bm† is a least squares solution to Gm = d.

As in case 2, any least squares model m can be written as
m = m † + m 0 = m† + αi V·,i . (4.43)

kmk22 = km† k22 + αi2 ≥ km† k22 , (4.44)

so m† is the least squares solution of minimal length.

The generalized inverse thus produces an inverse solution that always exists,
and that is both least-squares and minimum length, regardless of the relative
sizes of rank(G), m and n. Relationships between the subspaces R(G), N (GT ),
R(GT ), N (G), and the operators G and G† , are shown schematically in Figure
4.1. Table 4.1 summarizes the SVD and its properties.
To examine the significance of the N (G) subspace, which is spanned by the
columns of V0 , Consider an arbitrary model m0 which lies in N (G)
m0 = αi V·,i . (4.45)

Figure 4.1: SVD model and data space mappings, where G is the forward
operator and G† is the generalized inverse. N (GT ) and N (G) are the data and
model null spaces, respectively.

Matrix Size Properties

G m by n G = uSVT = Up Sp VpT
U m by m U is an orthogonal matrix. U = [Up U0 ].
S m by n S is a diagonal matrix of singular values. Si,i = si
V n by n V is an orthogonal matrix. V = [Vp V0 ].
Up m by p Columns of Up form a basis for R(G).
Sp p by p Diagonal matrix of nonzero singular values.
Vp n by p Columns of Vp form a basis for R(GT ).
U0 m by m − p Columns of U0 form a basis for N (GT ).
V0 n by n − p Columns of V0 form a basis for N (G).
U·,i m by 1 Column i of U is an eigenvector of GGT with eigenvalue s2i .
V·,i n by 1 Column i of V is an eigenvector of GT G with eigenvalue s2i .
G† n by m The pseudoinverse of G is G† = Vp S−1 p Up

m† m by 1 The generalized inverse solution. m† = G d.
Pp UT d
m† = i=1 s·,ii V·,i

Table 4.1: Summary of the SVD and its properties.


G operating on such a model gives model null space

X data null space
Gm0 = Up Sp VpT m0 = Up Sp αi VpT V·,i = 0 (4.46) pinv MATLAB command

because the columns of V0 and Vp are orthogonal. Since the system Gm = d

is linear, we can add any m0 to a general model, m, and not change the fit of
the model to the data, because

G(m + m0 ) = Gm + Gm0 = d + 0 . (4.47)

For this reason, the null space of G is commonly called the model null space.

The existence of a nontrivial model null space (one that includes more than
just the zero vector) is at the heart of solution nonuniqueness. General models
in Rn include an infinite number of solutions that will fit the data equally
well, because model components in N (G) have no affect on data fit. To select a
particular preferred solution from this infinite set thus requires more constraints
(such as minimum length or smoothing constraints) than are encoded in the
matrix G.
To see the significance of the N (GT ) subspace, consider an arbitrary data
vector, d0 , which lies in N (GT )
d0 = βi U·,i . (4.48)

The generalized inverse operating on such a data vector gives

m† = Vp S−1 T −1
p Up d0 = Vp Sp βi UTp U·,i = 0 , (4.49)

because the columns of U are orthogonal. N (GT ) is a subspace of Rm consisting

of all vectors d0 which have no influence on the generalized inverse model, m† .
If p < n there are an infinite number of potential data sets that will produce
the same model when (4.18) is applied. N (GT ) is called the data null space.
MATLAB has a pinv command that generates the G† matrix, rejecting
singular values that are less than a default or specified tolerance.

4.2 Covariance and Resolution of the General-

ized Inverse Solution
The generalized inverse always gives us a solution, m† , with well-determined
properties, but it is essential to investigate how faithful a representation any
model is likely to be of the true situation.

model space resolution matrix In Chapter 2, we found that under the assumption of independent normally
distributed measurement errors, the least squares solution was an unbiased es-
timator of the true model, and that the estimated model parameters had a
multivariate normal distribution with covariance

Cov(mL2 ) = σ 2 (GT G)−1 .

We could attempt the same analysis for the generalized inverse solution m† .
The covariance matrix would be given by

Cov(m† ) = G† Cov(d)(G† )T .

However, unless the least squares problem is of full rank, the generalized inverse
solution is not an unbiased estimator of the true solution. This is because
the true model may have nonzero projections in the model null space which
are not represented in the generalized inverse solution. In practice, the bias
introduced by this can be far larger than the uncertainty due to measurement
error. Estimating this bias is a hard problem, which we will address in Chapter
The concept of resolution is another way to way to characterize the quality
of the generalized inverse solution. In this approach we see how closely the
generalized inverse solution matches a given model, assuming that there are no
errors in the data. We begin with any model m. By multiplying G times m,
we can find a corresponding data vector d. If we then multiply G† times d, we
get back a generalized inverse solution m†

m† = G† Gm (4.50)

We would obviously like to get back our original model m† = m. However,

since the original model may have had a nonzero projection onto the model
null space N (G), m† will not in general be equal to m. The model space
resolution matrix is given by

Rm = G† G. (4.51)

In terms of the SVD,

Rm = Vp S−1 T T
p Up Up Sp Vp (4.52)
Rm = Vp VpT . (4.53)

If N (G) is trivial, then rank(G) = p = n, and Rm is the n by n identity

matrix. In this case the original model is recovered exactly and we say that
the resolution is perfect. If N (G) is a nontrivial subspace of Rn , then p =
rank(G) < n, so that Rm is not the identity matrix. The model space resolution
matrix is instead a symmetric matrix describing how the generalized inverse
solution smears out the original model, m, into a recovered model, m† .

In practice, the model space resolution matrix is used in two different ways. If data space resolution matrix
the diagonal entries of Rm are close to one, then we should have good resolution. singular value spectrum
If any of the diagonal entries are close to zero, then the corresponding model
parameters will be poorly resolved. We can also multiply Rm times a particular
model m to see how that model would be resolved by the inverse solution.
We can perform the operations of multiplication by G† and G in the opposite
order to obtain a data space resolution matrix, Rd .

d† = Gm† (4.55)
= GG† d (4.56)
= Rd d. (4.57)

The data space resolution matrix can be written as

Rd = Up UTp . (4.59)

If N (GT ) contains only the zero vector, then p = rank(G) = m, and Rd = I.

In this case, d† = d, and the generalized inverse solution m† fits the data exactly.
However, if N (GT ) is nontrivial, then p = rank(G) < m, and Rd is not the
identity matrix. In this case, d† is not equal to d, and m† does not exactly fit
the data d.
Note that model and data space resolution matrices (4.54) and (4.59) do not
depend on specific data values, but are exclusively properties of the matrix G.
They reflect of the experiment physics and geometry, and can thus be calculated
during the design phase of an experiment to assess expected model resolution
and ability to fit the data, respectively.

4.3 Instability of the Generalized Inverse Solu-

The generalized inverse solution (4.18) is selected so that its projection onto
N (G) is 0. We have effectively multiplied each of vectors in V0 by 0 in comput-
ing m† . However, m† does include terms involving the singular vectors in Vp
with very small singular values. As a practical matter, it can be very difficult to
distinguish between zero singular values and extremely small singular values. In
analyzing the generalized inverse solution we will consider the singular value
spectrum, which is simply the set of singular values of G.
It turns out that small singular values cause the generalized inverse solution
to be extremely sensitive to small amounts of noise in the data vector d. We
can quantify the instabilities that are brought about by small singular values by
rewriting the generalized inverse solution (4.18) in a form that makes the effect
of small singular values explicit. We start with the formula for the generalized
inverse solution
m† = Vp S−1 T
p Up d. (4.60)

condition number, of svd The elements of the vector UTp d are simply the dot products of the first p
columns of U with d. This vector can be written as
(U·,1 )T d
 
 (U·,2 )T d 
 
 . 
Up d =  . (4.61)
 . 

 . 
(U·,p ) d

When we multiply S−1

p times this vector, we obtain

 
(U·,1 ) d
(U·,2 )T d
 
 
 s2 
−1 T

Sp Up d =  . 
. (4.62)

 . 

 . 

(U·,p )T d

Finally, when we multiply Vp times this vector, we obtain a linear combination

of the columns of Vp that can be written as
X UT·,i d
m† = Vp S−1 T
p Up d = V·,i . (4.63)

In the presence of random noise, d will generally have a nonzero projection

onto each of the directions specified by the columns of U (the rows of UT ). The
presence of a very small si in the denominator of (4.63) can thus give us a very
large coefficient for the corresponding model space basis vector V·,i , and these
basis vectors can thus dominate the solution. In the worst case, the generalized
inverse solution is just a noise amplifier, and the answer is nonphysical and
practically useless.
A simple measure of the instability of the solution is the condition number
(see Appendix A). Suppose that we have a data vector d and an associated
generalized inverse solution m† = G† d.
If we consider a slightly perturbed data vector d0 and its associated gener-
alized inverse solution m0† = G† d0 , then

m† − m0† = G† (d − d0 ) .

km† − m0† k2 ≤ kG† k2 kd − d0 k2 .
From (4.63), it is be clear that the largest difference in the inverse models will
occur when the d − d0 is in the direction U·,p . If

d − d0 = αU·,p ,

then cond MATLAB command

kd − d k2 = α .
We can then compute the effect on the generalized inverse solution as
m† − m0† = V·,p

km† − m0† k2 = .
Thus we have a bound on the instability of the generalized inverse solution
km† − m0† k2 1

kd − d0 k2 sp
km† − m0† k2 ≤ kd − d0 k2 .
Similarly, we can see that the generalized inverse model is smallest in norm
when d points in a direction parallel to V·,1 . Thus
km† k2 ≥ kdk2 .
Combining these inequalities, we obtain
km† − m0† k2 s1 kd − d0 k2
≤ . (4.64)
km† k2 sp kdk2

The bound (4.64) is applicable to pseudoinverse solutions, regardless of what

value of p we use. As we decrease p, the solution becomes more stable. However,
this stability comes at the expense of reducing the dimension of the subspace
of Rn from which we can find a solution. As we reduce p, the model resolution
matrix becomes less like the identity matrix. Thus there is a tradeoff between
sensitivity to noise and model resolution.
The condition number of G is defined as
cond(G) = ,
where k = min(m, n). The MATLAB command cond can be used to compute
the condition number of G. If G is of full rank, and we use all of the singular
values in the pseudoinverse solution (p = k), then the condition number is
exactly the coefficient in (4.64). If G is of less than full rank, then the condition
number is effectively infinite.
Note that the condition number depends only on the matrix G and does not
take into account the actual data. Just as with the resolution matrix, or the
covariance matrix in linear regression, we can compute the condition number
before we gather any data.

discrete Picard condition A condition that insures solution stability is the discrete Picard condition
truncated SVD [Han98]. The discrete Picard condition is satisfied when the dot products of
TSVD the columns of Up and the data vector decay to zero more quickly than the
discrepancy principle
singular values, si . Under this condition, we should not see instability due to
over fitting of data
contributions from the smallest singular values.
rank deficient If the discrete Picard condition is not satisfied, we may still be able to recover
a usable solution by truncating (4.63) at some highest term p0 < p, to produce
a truncated SVD, or TSVD solution. One way to decide when to truncate
(4.63) is to apply the discrepancy principle, where we consider all solutions
which minimize kmk, subject to fitting the data to some tolerance

kGw m − dw k2 ≤ δ , (4.65)

where Gw and dw are the weighted system matrix and data vector (2.16). A
simple way to chose δ for independent normal data errors is to note that, in this
case, the square of the weighting-normalized 2–norm misfit measure (4.65) will
be distributed as χ2 with ν = m − n degrees of freedom. In this case we can
choose a δ corresponding to some confidence interval for that χ2 distribution
(e.g., 95%). The truncated SVD solution is then straight forward; we sum up
terms in (4.63) until the discrepancy principal is just met. This produces a
stable solution that can be statistically justified as being consistent with both
modeling assumptions and data errors.
This solution will not fit the data as well as solutions that include the small
singular value model space basis vectors. Perhaps surprisingly, this is what we
should generally do when we solve ill-posed problems with noise. If we fit the
data vector exactly or nearly exactly, we are in fact over fitting the data and
perhaps letting the noise control major features of the model. This corresponds
to using the available freedom of a large model space to produce a χ2 value that
is too good (Chapter 2). Truncating (4.63) also decreases the used dimension
of the model space by p − p0 . we should thus be careful to acknowledge that
true model projections in the directions of the newly omitted columns of V
cannot appear in the truncated SVD solution, even if they are present in the
true model.
The truncated SVD solution is but one example of regularization, whereby
solutions are selected to sacrifice fit to the data for solution stability. Much of
the utility of inverse solutions in many situations, as we shall subsequently see,
hinges on the application of good regularization strategies.

4.4 Rank Deficient Problems

A linear least squares problem is said to be rank deficient if there is a clear
distinction between the non–zero and zero singular values, and the number of
non–zero singular values, p is less than the minimum of m and n. In practice, the
computed singular values will often include values that are extremely small but
not quite zero, because of round off errors in the computation of SVD. If there is
a substantial gap between the largest of these tiny singular values and the first

Figure 4.2: A simple tomography example (revisited)

truly non–zero singular value, then it can be easy to distinguish between the
non–zero and zero singular values. Rank deficient problems can be solved in a
straight forward manner by applying the generalized inverse solution. Instability
of the resulting solution due to small singular values is seldom an issue.

Example 4.1
With the tools of the SVD at our disposal, let us reconsider the
straight-line tomography example (example 1.3; Figure 4.2), where
we had a rank deficient system in which we were constraining a
9-parameter model with 8 observations.
 
  s11  
1 0 0 1 0 0 1 0 0  s12  t1
 0 1 0 0 1 0 0 1 0    t2 
 0 0 1 0 0 1 0 0 1   s13   t3 
    
   s21   
 1 1 1 0 0 0 0 0 0    t4 
Gm =     s
  22  = 
   .
 0 0 0 1 1 1 0 0 0   s23   t5 

 0 0 0 0 0 0 1 1 1    t6 
 √ √ √   s31   
 2 0 0 0 2 0 0 0 √2     t7 
 s32 
0 0 0 0 0 0 0 0 2 t8

reshape The singular values of G are

>> svd(G)

ans =


The last nonzero singular value is nonzero because of roundoff error

in the computation of the singular values. In this case it is easy to
see that this number is so small that it is effectively 0. The ratio of
the largest nonzero singular value to the smallest nonzero singular
value is about 6, and so the generalized inverse solution will not be
particularly unstable in the presence of noise.
Since rank(G) = 7, the problem is rank-deficient and the model null
space, N (G) is spanned by the two orthonormal vectors specified by
the 8th and 9th columns of V
 
−0.136 −0.385

 0.385 −0.136 

 −0.249 0.521 

 −0.385 0.136 

V0 = 
 0.136 0.385  .

 0.249 −0.521 

 0.521 0.249 

 −0.521 −0.249 
0.000 0.000
To obtain a geometric appreciation for the two model null space
vectors, we can reshape them into 3 by 3 matrices corresponding to
the geometry of the blocks (e.g, by using the MATLAB reshape
 
−0.136 −0.385 0.521
reshape(m0,1 , 3, 3) =  0.385 0.136 −0.521 
−0.249 0.249 0.000
 
−0.385 0.136 0.249
reshape(m0,2 , 3, 3) =  −0.136 0.385 −0.249  .
0.521 −0.521 0.000

Recall that we can add any multiple of m0 to any solution and not
change our fit to the data. Three common features of the two model
null space basis vectors stand out
1. The sums along all rows and columns of m0 are zero
2. The upper left to lower right diagonal sum is zero
3. There is no projection of m0 in the m9 = s33 model space
The zero sum conditions (1) and (2) arise because the average slow-
nesses along any of the rows or columns in those directions are un-
constrained by the ray path geometry. Paths passing through three
blocks can only constrain the average block value. The zero value
for s33 (3) occurs because s33 is uniquely constrained because of the
observation t8 .
The lone data space null basis vector, specified by the 8th column of
U, is  
 −0.408 
 
 −0.408 
 
 0.408 
U0 =    .
 0.408 

 0.408 
 
 0.000 
This is a basis vector in the data space that we can never fit with
any model because it is inconsistent. Measurements of the form cU0 ,
where c is a constant, require the average values of the si,· along
rows to decrease or increase, while the average value of the s·,i along
columns simultaneously increases or decreases, and, the upper left
to lower right diagonal average and the value of s33 cannot change
at all. This is impossible for any model to satisfy.
The n by n model space resolution matrix, Rm tells us about the
uniqueness of the generalized inverse solution parameters. The di-
agonal elements are indicative of how recoverable each parameter
value will be. The reshaped diagonal of Rm for this system is
 
0.833 0.833 0.667
reshape(diag(Rm ), 3, 3) =  0.833 0.833 0.667 
0.667 0.667 1.000
which tells us that m9 = s33 is perfectly resolved, but that we can
expect loss of resolution (and hence smearing of the true model into
other blocks) for all of the other solution parameters. The full Rm
gives the specifics of how this smearing occurs
Rm =

resolution test
spike resolution test  
impulse resolution test 0.833 0 0.167 0 0.167 −0.167 0.167 −0.167 0

 0 0.833 0.167 0.167 0 −0.167 −0.167 0.167 0 

 0.167
 0.167 0.667 −0.167 −0.167 0.333 0 0 0 

 0 0.167 −0.167 0.833 0 0.167 0.167 −0.167 0 

 0.167
 0 −0.167 0 0.833 0.167 −0.167 0.167 0 .

 −0.167 −0.167 0.333 0.167 0.167 0.667 0 0 0 
 
 0.167 −0.167 0 0.167 −0.167 0 0.667 0.333 0 
 
 −0.167 0.167 0 −0.167 0.167 0 0.333 0.667 0 
0 0 0 0 0 0 0 0 1.000
As you can probably surmise, examining the entire model resolution
matrix becomes cumbersome in large problems. In such cases it is
instead common to perform a resolution test by generating syn-
thetic data for some known model of interest and then assessing the
recovery of that model in the inverse solution. One such synthetic
model commonly used in resolution tests is uniform or zero except
for a single perturbed model element. Examining the inverse recov-
ery of this model is commonly referred to as a spike or impulse
resolution test. For this example, consider the model perturbation
 
 0 
 
 0 
 
 0 
 
mtest =  1  .

 0 
 
 0 
 
 0 
The predicted data set for this mtest is
 

 1 

 0 

 0 

dtest = Gmtest = 
 1  ,

 0 

 
 
 2 
and the reshaped generalized inverse model is
 
0.167 0 −0.167
reshape(m† , 3, 3) =  0 0.833 0.167 
−0.167 0.167 0.000

which is just the 5th column of Rm showing that information about checkerboard resolution test
the central block slowness will bleed into some, but not all, of the
adjacent blocks, even for noise-free data. A related resolution test
that is commonly performed in tomography work is a checkerboard
test, which, as you would surmise, consists of generating synthetic
data from a model of alternating positive and negative perturbations
and then inverting to attempt to recover it. For example, the true
checkerboard slownesses model
 
2 1 2
reshape(mtrue , 3, 3) =  1 2 1 
2 1 2

produces the noise-free synthetic data vector

 
 4.000 
 
 5.000 
 
 5.000 
 4.000  ,
 
 
 5.000 
 
 8.485 

and is recovered as
 
2.333 1.000 1.667
reshape(m† , 3, 3) =  1.000 1.667 1.333  .
1.667 1.333 2.000

It is important to note that the ability to recover the true model in

practice is affected not only by resolution concerns, which are a char-
acteristic purely of the system G, and hence apply even for noise-free
data, but also by the mapping of any data noise into the model pa-
rameters. For example, if we add independent noise distributed as
N (0, 0.2) to each data point in the checkerboard resolution test, to
obtain a noisy data vector, dn , the recovered model degrades further
 
2.365 1.302 1.732
reshape(m† , 3, 3) =  1.215 1.476 1.245  .
1.672 1.272 2.042

4.5 Discrete Ill-Posed Problems

In this section we will consider problems in which the singular values decay to-
wards zero without any obvious gap between non–zero and zero singular values.

discrete ill–posed problem This happens frequently when we discretize Fredholm integral equations of the
moderately ill–posed first kind as we did in Chapter 3. In particular, as we increase the number of
severely ill–posed points n in the discretization, we typically find that the matrix G becomes more
weakly ill–posed
and more badly conditioned. Discrete inverse problems such as these cannot for-
instrument response
mally be called ill–posed, because the condition number remains finite although
very large. However, we will refer to these problems as discrete ill–posed
The rate at which the singular values decay can be used to characterize
a discrete ill–posed problem as weakly, moderately, or strongly ill–posed. If
si = O(i−α ) for α ≤ 1, then we call the problem weakly ill–posed. If si = O(i−α )
for α > 1, then the problem is moderately ill–posed. If si = O(e−α ) then the
problem is severely ill-posed.
In addition to the general pattern of singular values which decay to 0, dis-
crete ill–posed problems are typically characterized by differences in the singular
vectors V·,j [Han98]. For large singular values, the singular vectors are typically
smooth, while the singular vectors corresponding to the smaller singular values
typically have lots of oscillations. These oscillations become apparent in the
generalized inverse solution.
When we attempt to solve such a problem with the SVD, it becomes difficult
to decide where to truncate the sum. If we truncate the sum too early, then we
effectively lose details in our model solution corresponding to the corresponding
model vectors Vk that are left out of the sum. If we include all of the singular
values, then the generalized inverse solution becomes extremely unstable in the
presence of noise. In particular, we can expect that high frequencies in the
generalized inverse solution will be strongly affected by the noise. Regularization
techniques are required to address this problem.

Example 4.2

Consider an inverse problem where we have a physical process (ground

motion) recorded by a linear instrument of limited bandwidth (e.g.,
a vertical seismometer). The response of a linear instrument may be
characterized by an instrument impulse response. Consider the
impulse response

g0 te−t/T0

(t ≥ 0)
g(t) = . (4.68)
0 (t < 0)

(4.68) (Figure 4.3) is the displacement response of a critically damped

seismometer with a characteristic time constant T0 to a unit area (1
m/s2 · s) impulsive ground acceleration input, and g0 is a gain con-
stant. Assuming that the displacement of the seismometer is elec-
tronically converted to output volts, we choose g0 to be T0 e−1 V/m·s
to produce a 1 V maximum output value for the impulse response,
and T0 = 10 s. The response of this instrument to an acceleration
impulse is shown in Figure 4.3.

1 deconvolution





0 20 40 60 80
Time (s)

Figure 4.3: Example instrument response; seismometer output voltage in re-

sponse to a unit area ground acceleration impulse.

The seismometer output, v(t), is a voltage record given by the con-

volution (1.7) of the true ground acceleration, mtrue (t), with (4.68)
Z ∞
v(t) = g(τ − t) mtrue (τ ) dτ . (4.69)

We are interested the inverse deconvolution operation which will

remove the smoothing effect of the g(t) in (4.69) to recover the true
ground acceleration mtrue .
Discretizing (4.69) using the midpoint rule (3.3) with a time interval
∆t, we obtain
d = Gm (4.70)
(ti − tj )e−(ti −tj )/T0 ∆t

(tj ≥ ti )
Gi,j = .
0 (tj < ti )

Note that the rows of G are time reversed, and the columns of G
are non-time-reversed, sampled versions of the impulse response g(t),
lagged by i and j, respectively. Using a time interval of [−5, 100] s,
outside of which (4.68) and (we assume) any model, m, of interest
will be very small or zero, and a discretization interval of ∆t = 0.5 s
we obtain a discretized m by n system matrix G with m = n = 210.
(Figure (4.4).


i 100
50 100 150 200

Figure 4.4: Grayscale image of the system matrix, G. Values range from 0
(black) to 0.5 (white). Discretization interval is ∆t = 0.5 s and the time interval
is [−5, 100] s.

The singular values of G are all nonzero and range from about 25.3
to 0.017 (Figure 4.5), giving a condition number of approximately
1480, which is not an especially large range relative to what can arise
in extremely ill-posed problems. However, adding noise at the level
of 1 part in 1000 will be sufficient to make the generalized inverse
solution unstable. The reason for the moderately large condition
number can be seen by examining successive rows of G, which are
nearly (but not quite) identical.

Gi,· GTi+1,·
≈ 0.999 .
kGi,· kkGi+1,· k

Now, consider a true ground acceleration signal consisting of two

acceleration pulses with temporal widths of σ = 2 s, centered at
t = 8 s and t = 25 s
/(2σ 2 ) 2
/(2σ 2 )
mtrue (t) = e−(t−8) + 0.5e−(t−25) . (4.71)

We sample mtrue (t) to obtain a 210 element vector mtrue on the

time interval [−5, 100] s and generate the synthetic data set

dsyn = Gmtrue (4.72)

shown in Figure (4.7).





50 100 150 200


Figure 4.5: Singular values for the discretized convolution matrix.

Acceleration (m/s2)


0 20 40 60 80
Time (s)

Figure 4.6: The true model.



−20 0 20 40 60 80 100
Time (s)

Figure 4.7: Noise-free predicted data from the true model.

The recovered 2–norm model from the full generalized inverse solu-
m = VS−1 UT dsyn (4.73)
is shown in Figure 4.8. (4.73) fits its noiseless data vector, dsyn ,
perfectly, and is essentially identical to the true model.
Next consider what happens when we add a modest amount of noise
to (4.72). Figure 4.9 shows the data after normally distributed noise
with standard deviation 0.05 has been added.
The 2–norm solution for a noisy data vector, dsyn + n, where the
elements of n are uncorrelated N (0, (0.05)2 ) random variables

m = VS−1 UT (dsyn + n) . (4.74)

is shown in Figure (4.10).

The solution in Figure 4.10, although it fits its particular data vec-
tor, dsyn + n, exactly, is worthless in divining information about the
true ground motion ((4.71); Figure 4.6). Information about mtrue
is overwhelmed by the small amount of added noise, amplified enor-
mously by the inverse operator.
Can a useful model be recovered via the truncated SVD? Using
the discrepancy principle (4.65) as our guide and selecting a range
of solutions with varying p0 , we can in fact obtain an approximate
agreement between the number of degrees of freedom (m − p0 ; the


Acceleration (m/s2)




−20 0 20 40 60 80 100
Time (s)

Figure 4.8: Generalized inverse solution using all 210 singular values the noise-
free data of Figure 4.7.


−20 0 20 40 60 80 100
Time (s)

Figure 4.9: Predicted data from the true model plus uncorrelated N (0, (0.05)2 )

Acceleration (m/s2)




0 20 40 60 80
Time (s)

Figure 4.10: Generalized inverse solution using all 210 singular values, and the
noisy data of Figure 4.9.

approximate expected value for the χ2 distribution for large numbers

of degrees of freedom) and the square of the data misfit normalized
by the noise standard deviation when we keep p0 = 26 columns in V
(Figure 4.11).
Essential features of the true model are resolved in the solution of
Figure 4.11, but the solution technique introduces oscillations and
loss of resolution. Specifically, we see that the widths of the in-
ferred pulses are somewhat wider, and the inferred amplitudes some-
what less, than those of the true ground acceleration. These effects
are both hallmarks of limited resolution, as characterized by a non-
identity model resolution matrix. An image of the model resolution
matrix (4.54) in Figure 4.12, shows a finite-width central band and
oscillatory side lobes.
A typical column of the model resolution matrix (Figure 4.13) quantifies
the smearing of the true model into the recovered model for the
choice of the p = 26 inverse operator. The smoothing is over a
characteristic width of about 5 seconds, which is why our recovered
model, although it does a decent job of rejecting noise, underesti-
mates the amplitude and narrowness of the true model. The oscilla-
tory behavior of the resolution matrix is attributable to our abrupt
truncation of the model space. Each of the n columns of V, is an
oscillatory model basis function, with i − 1 zero crossings, where i is
the column number.

Acceleration (m/s2)

0 20 40 60 80
Time (s)

Figure 4.11: Solution Using the 26 Largest Singular Values; N (0, (0.05)2 ) data


50 100 150 200

Figure 4.12: Grayscale image of the model resolution matrix, Rm for the trun-
cated SVD solution including the 26 largest singular values. Values range from
0.35 (white) to -0.07 (black)


Resolution Matrix Element Value






0 20 40 60 80
Time (s)

Figure 4.13: A column from the model resolution matrix, Rm for the truncated
SVD solution including the 26 largest singular values.

When we truncate the series after 26 terms to to stabilize the inverse

solution, we place a limit on the most oscillatory model space basis
vectors that we will allow in our solution. This truncation gives us
a model and a model resolution, which contain oscillatory structure
with around p − 1 = 25 zero crossings. The oscillatory, orthogonal
columns of V are a sort of pseudo-Fourier basis functions for the
model. We will examine this perspective further in Chapter 8.

Example 4.3

Recall the Shaw example problem from Example 3.2. The MATLAB
Regularization Toolbox contains a routine shaw that computes the
G matrix for this problem. Using this routine, we computed the
G matrix for n = 20. We then computed the singular values of
this matrix. Figure 4.14 shows the singular values of the G matrix,
which decay very rapidly to zero. Because the singular values decay
to 0 in an exponential fashion, this is a strongly ill–posed problem
There is no obvious break point above which the singular values are
nonzero and below which the singular values are 0. The MATLAB
rank command gives a value of p = 18, suggesting that the last
two singular values are effectively 0. The condition number of this
problem is enormous (> 1014 ).






0 5 10 15 20

Figure 4.14: Singular values of G, with n = 20.

Column 18 of V, which corresponds to the smallest nonzero singular

value, is shown in Figure 4.15. Notice that this vector is very rough.
Column 1 of V, which corresponds to the largest singular value, is
shown in Figure 4.16. Notice that this column represents a relatively
smooth function.
Next, we will consider a simple resolution test. Suppose that the
input to the system is given by

1 i = 10
mi =
0 otherwise

We can run the model forward and then use the generalized inverse
with various values of p to obtain inverse solutions. The spike in-
put is shown in Figure 4.17. The corresponding data (with no noise
added) is shown in Figure 4.18. If we compute the generalized in-
verse with p = 18, and apply it to the data in Figure 4.18, we get
essentially perfect recovery of the original spike. See Figure 4.19.
However, if we add a very small amount of noise to the data in Fig-
ure 4.18, things change dramatically. Adding normally distributed
random noise with mean 0 and standard deviation 10−6 to the data
in Figure 4.18 and computing the inverse solution using p = 18,
produces the wild solution shown in Figure 4.20, which bears little
resemblance to the input (notice that the vertical scale is multiplied
by 105 !). This inverse operator is even more unstable than that of








−2 −1 0 1 2

Figure 4.15: Column 18 of V.









−2 −1 0 1 2

Figure 4.16: Column 1 of V.




−2 −1 0 1 2

Figure 4.17: The spike model.



−2 −1 0 1 2

Figure 4.18: Spike model data, no noise added.




−2 −1 0 1 2

Figure 4.19: The recovered spike model, no noise, p = 18.

the previous deconvolution example, such that even noise on the

order of 1 part in a million will dominate the inverse solution!
Next, we consider what happens when we use only the 10 largest
singular values and their corresponding model space vectors to con-
struct a solution. Figure 4.21 shows this solution with no data noise.
Because we have cut off a number of singular values, we have lost
resolution. The inverse solution is smeared out, but it is still possi-
ble to tell that there is some significant spike-like feature near t = 0.
Figure 4.22 shows the solution using 10 singular values with the same
noise as Figure 4.20. This time the recovery is not strongly affected
by the noise, although we still have imperfect resolution.
What happens if we discretize the problem with a larger number of
intervals? Figure 4.23 Shows the singular values for the G matrix
with n = 100 intervals. The first 20 or so singular values are appar-
ently nonzero, while the last 80 or so singular values are effectively
zero. Figure 4.24 shows the inverse solution for the spike model with
n = 100 and p = 20. This is much worse than the solution that we
obtained with n = 20. In general, discretizing over more intervals
does not help and can often make things much worse.
What about a smaller number of intervals? Figure 4.25 shows the
singular values of the G matrix with n = 6. In this case there are no
terribly small singular values. However, with only 6 elements in the
model vector, we cannot hope to resolve the details of an arbitrary
source intensity distribution.

x 10




−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Figure 4.20: Recovery of the spike model with noise, p = 18.







−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Figure 4.21: Recovery of the spike model with no noise, p = 10.








−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Figure 4.22: Recovery of the spike model with noise, p = 10.






0 20 40 60 80 100

Figure 4.23: Singular values of G, with n = 100.


x 10





−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Figure 4.24: Recovery of the spike model with noise, n = 100, p = 20.




1 2 3 4 5 6

Figure 4.25: Singular values of G, with n = 6.


This example demonstrates the dilemma posed by small singular

values. If we include the small singular values, then our inverse
solution becomes incredibly sensitive to noise in the data. If we do
not include the smaller singular values, then our solution is not as
sensitive to noise in the data, but we lose resolution. The tradeoff
between sensitivity to noise and resolution is the central issue in
solving discrete ill–posed problems.

4.6 Exercises
1. A large NS-EW oriented, nearly square plan view, sandstone quarry block
(16 m by 16 m) with a bulk P-wave seismic velocity of approximately
3000 m/s is suspected of harboring higher-velocity dinosaur remains. An
ultrasonic P-wave travel-time tomography scan is performed in a hori-
zontal plane bisecting the boulder, producing a data set consisting of 16
E→W, 16 N→S, 31 SW→NE, and 31 NW→SE travel times. Each travel-
time measurement has statistically independent errors and with estimated
standard deviations of 15 µs.
The data files that you will need to load from your working directory into
your MATLAB program are: rowscan.m, colscan.m, diag1scan.m,
diag2scan.m (these contain the travel-time data), and std.m (which
contains the standard deviation of the data measurements). The travel
time contribution from a uniform background model (velocity of 3000 m/s)
has been subtracted from each travel-time measurement for you, so you
will be solving for perturbations from a uniform slowness model of 3000
m/s. The row format of each data file is (x1, y1, x2, y2, t) where the
starting point coordinate of each shot is (x1, y1), the end point coordinate
is (x2, y2), and the travel time along a ray path between the start and
end points is a path integral (in seconds)
t = s(x)dl

where s is the slowness (reciprocal velocity) along the path, l, between

source and receiving points, and ∆lblock is the length for which the ray is
in each block.
Parameterize your tomographic model of the slowness perturbation in the
plane of the survey by dividing the boulder into a 16 by 16 grid of 256
1-m-square, N by E blocks to construct a linear system for the problem.
Assume that the ray paths through each homogeneous block can be well-
approximated by straight lines, so that the travel time expression is
t = s(x)dl = sblock · ∆lblock
l blocks
4.6. EXERCISES 4-35

where ∆lblock is 1 m for the row and column scans and sqrt{2} m for the
diagonal scans.
Utilize the SVD to find a minimum-length/least-squares solution, m† ,
for the 256 block slowness perturbations which fit the data as exactly as
possible. Perform two inversions:
(A) Using the row and column scans only, and
(B) Using the complete data set.
For each inversion:

(a) State and discuss the significance of the elements and dimensions of
the data and model null spaces.
(b) Note if there any model parameters that have perfect resolution.
(c) Note the condition number of your raw system matrix relating the
data and model.
(d) Note the condition number of your generalized inverse matrix.
(e) Produce a 16 by 16 element contour or other plot of your slowness
perturbation model, displaying the maximum and minimum slowness
perturbations in the title of each plot. Anything in there? If so, how
fast or slow is it (in m/s)?
(f) Show the model resolution by contouring or otherwise displaying the
256 diagonal elements of the model resolution matrix, reshaped into
an appropriate 16 by 16 grid.
(g) Construct, and contour or otherwise display a nonzero model which
fits the trivial data set d = 0 exactly.
(h) Describe how one could use solutions of the type discussed in (g)
to demonstrate that very rough models exist which will fit any data
set just as well as a generalized inverse model. Show one such wild

2. Find the singular value decomposition of the G matrix from problem 3.1.
Taking into account the fact that the measured data are only accurate to
about four digits, use the truncated SVD to compute a solution to this
discrepancy principle

Chapter 5

Tikhonov Regularization

We saw in Chapter 4 that, given the SVD of G (4.2), we can express a generalized
inverse solution by (4.63)
X UT·,i d
m† = Vp S−1 T
p Up d = V·,i ,

and that this expression can become extremely unstable when one or more of the
singular values, si become small. We considered the option of dropping terms
in the sum that correspond to the smaller singular values. This regularized or
stabilized the solution in the sense that it made the solution less sensitive to
data noise. However, we paid a price for this stability in that the regularized
solution had reduced resolution.
In this chapter we will discuss Tikhonov regularization, which is perhaps
the most widely used technique for regularizing discrete ill–posed problems. It
turns out that the Tikhonov solution can be expressed quite easily in terms of
the SVD of G. We will derive a formula for the Tikhonov solution and see how
it is a variant on the generalized inverse solution that effectively gives larger
weight to larger singular values in the SVD solution and gives lower weight to
small singular values.

5.1 Selecting a Good Solution

For a general linear least squares problem, there may be infinitely many least
squares solutions. If we consider that the data contain noise, and that there
is no point in fitting such noise exactly, it becomes evident that there can be
many solutions which can adequately fit the data in the sense that kGm − dk2
is small enough. The discrepancy principle (4.65) can be used to regularizing
the solution of a discrete ill–posed problem based on the assumption that a
reasonable level for kGm − dk2 is known.
What do we mean by small enough? For data, d, with measurement errors
that are independent and normal with mean 0 and standard deviation σ, the


|| m ||

|| Gm-d ||

Figure 5.1: Minimum values of the model norm, kmk2 , for varying values of δ.

misfit measure (1/σ 2 )kGmtrue −dk22 will have a χ2 distribution with m degrees
of freedom. If the standard deviations are not all equal, but the errors are
still normal and independent, we can divide each equation by the standard
deviation associated with its right hand side (2.14) and obtain a weighted system
of equations (2.17) in which kGw mtrue − dw k22 will follow a χ2 distribution.
We learned in Chapter 2 that when we estimate the solution to a full rank
least squares problem, kGw mL2 − dw has a χ2 distribution with m − n degrees
of freedom. Unfortunately, when the number of model parameters n is greater
than or equal to the number of data m, this simply does not work, because there
is no χ2 distribution with fewer than one degrees of freedom.
In√practice, a common heuristic is to require kGw m − dw k2 to be smaller
than m, which is the approxmate median of a χ2 distribution with m degrees
of freedom.
Under the discrepancy principle, we consider all solutions with kGm−dk2 ≤
δ, and select from these solutions the one that minimizes the norm of m

min kmk2
kGm − dk2 ≤ δ.

Why select the minimum norm solution from among those solutions that
adequately fit the data? One interpretation is that any significant nonzero
feature that appears in the regularized solution is there because it was necessary
to fit the data. Model features that are unnecessary to fit the data will be
removed by the regularization.
Notice that as δ increases, the set of feasible models expands, and the min-
imum value of kmk2 decreases. We can thus trace out a curve of minimum
values of kmk2 versus δ (Figure 5.1). It is also possible to trace out this curve
by considering problems of the form

min kGm − dk2

kmk2 ≤ .



|| m ||

|| Gm-d ||

Figure 5.2: Optimal values of the model norm, kmk2 , and the misfit norm,
kGm − dk2 , as  varies.

As  decreases, the set of feasible solutions becomes smaller, and the minimum
value of kGm − dk2 increases. Again, as we adjust  we trace out the curve of
optimal values of kmk2 and kGm − dk2 . See Figure 5.2.
A third option is to consider the damped least squares problem

min kGm − dk22 + α2 kmk22 . (5.3)

which arises when we apply the method of Lagrange multipliers to (5.1). It

can be shown [Han98], that for appropriate choices of δ, , and α, all three
problems yield the same solution. We will concentrate on solving the damped
least squares form of the problem (5.3). Solutions to (5.1) and (5.2) can be
obtained by adjusting α until the constraints are satisfied.
An important observation is that when plotted on a log–log scale, the curve
of optimal values of kmk2 and kGm − dk2 often takes on an L shape. For this
reason, the curve is called an L–curve [Han92]. In addition to the discrepancy
principle, another popular criterion for picking the value of α is the L–curve
criterion in which we pick the value of α that gives the solution closest to the
corner of the L–curve.

5.2 SVD Implementation of Tikhonov Regular-

The damped least squares problem (5.3) is equivalent to the ordinary least
squares problem
G d
αI m − . (5.4)
0 2
As long as α is nonzero, the last n rows of the matrix are linearly independent,
so we have a full rank least squares problem that could be solved by the method

of normal equations.
 G  T
G αI m= G αI . (5.5)
αI 0

This system of equations simplifies to

(GT G + α2 I)m = GT d . (5.6)

Using the SVD of G, (5.6) can be written as

(VST UT USVT + α2 I)m = VST UT d (5.7)

which simplifies to

(VST SVT + α2 I)m = VST UT d . (5.8)

Since this system of equations is nonsingular, it has a unique solution. We will

show that the solution is
X s2i (U·,i )T d
mα = V·,i . (5.9)
s2i + α2 si

In this formula, we pick k = min(m, n) so that all singular values are included,
no matter how small they might be. To show that this solution is correct, we
substitute (5.9) into the left hand side of (5.8)
X s2i (U·,i )T d
(VST SVT + α2 I) V·,i
i=1 i
+α 2 si

X s2i (U·,i )T d
= (VST SVT + α2 I)V·,i . (5.10)
s2i + α2 si

(VST SVT +α2 I)V·,i can be simplified by noting that that VT V·,i is a standard
basis vector, ei . When we multiply ST S times a standard basis vector, we get
a vector with s2i in position i and zeros elsewhere. When we multiply V times
this vector, we get s2i V·,i . Thus
k k
X s2i (U·,i )T d X s2i (U·,i )T d 2
(VST SVT + α2 I) V ·,i (si + α2 )V·,i
i=1 i
+ α2 si i=1
i + α 2 s i

= si (U·,i )T dV·,i = Vp STp UTp d = VST UT d . (5.11)

The objects
fi = (5.12)
s2i + α2

are called filter factors. For si  α, fi ≈ 1. For si  α, fi ≈ 0. In between, filter factors

the filter factors serve to reduce the relative contribution of the smaller singular damped SVD method
values to the solution. An similar alternative method (called the damped SVD
method, uses the filter factors
fˆi = . (5.13)
si + α
This has a similar effect to using (5.12), but transitions more slowly with the
index i between including large and rejecting small singular values.
The MATLAB regularization toolbox [Han94] contains a number of useful
commands for performing Tikhonov regularization. These commands include
l curve for plotting the L–curve and finding its corner, tikhonov for computing
the solution for a particular value of α, lsqi for solving (5.2), discrep for solving
(5.1), picard for plotting singular values, and fil fac for computing filter factors.
The package also includes a routine regudemo which leads the user through a
tour of the features of the toolbox. Note that the regularization toolbox uses
notation which is somewhat different from the notation used in this book. Read
the documentation to understand the notation used by the toolbox.

Example 5.1
We will now use the regularization toolbox to analyze the Shaw
problem (with noise), as introduced in Example 3.2. We begin by
computing the L–curve and finding its corner.

>> [reg_corner,rho,eta,reg_param]=l_curve(U,diag(S),dspiken,’Tikh’);
>> reg_corner

reg_corner =


Figure 5.3 shows the L–curve. The corner of the curve is at α =

3.12 × 10−7 .
Next, we compute the Tikhonov regularization solution correspond-
ing to this value of α.

>> [mtik,rho,eta]=tikhonov(U,diag(S),V,dspiken,3.12e-7)

mtik =


L−curve, Tikh. corner at 3.121e−07

10 1.0635e−14

10 3.0125e−13


solution norm || x ||2



10 6.8476e−09

5.4948e−06 0.00015565 0.0044092 0.1249

−8 −7 −6 −5 −4 −3 −2 −1 0
10 10 10 10 10 10 10 10 10
residual norm || A x − b ||2

Figure 5.3: L–curve for the Shaw problem.


rho =


eta =




−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Figure 5.4: Tikhonov solution for the Shaw problem, α = 3.12 × 10−7 .


Here mtik (Figure 5.4) is the solution, rho is the misfit norm, and
eta is the solution norm. The solution is shown in Figure 5.4. Notice
that this solution is much better than the wild solution obtained by
the truncated SVD with p = 18 (Figure 4.20).
We can also use the discrep command to find the appropriate α
to obtain a Tikhonov regularized solution. Because normally dis-
tributed random noise with standard deviation 1×10−6 was added to
this data, we search for a solution where the square of the norm of the
noise vector (and of the residuals) is roughly√n = 20 times 10−12 , or a
corresponding norm kGm−dk2 of roughly 20×10−6 ≈ 4.47×10−6 .

>> [mdisc,alpha]=discrep(U,diag(S),V,dspiken,1.0e-6*sqrt(20));
>> alpha

alpha =


The discrepancy principle estimate of δ provides a somewhat larger

value of α than that obtained using the L–curve technique. The
corresponding solution (Figure 5.5) thus has a smaller norm. This
solution is perhaps somewhat better than the L–curve solution in
the sense of having less wild side lobes.






−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Figure 5.5: Tikhonov solution for the Shaw problem, α = 5.36 × 10−5 .

It is interesting to note that the misfit of the original spike model,

approximately 4.42 × 10−7 , is actually smaller than the tolerance
that we specified in finding a solution by the discrepancy principle.
Why did not discrep recover the original spike model? One quick
observation is that the spike model has a norm of 1, while the solu-
tion obtained by discrep has a norm of only 0.66. Since Tikhonov
regularization prefers solutions with smaller norms, we ended up
with the solution in Figure 5.5. More generally, this is due to the
presence of noise and to limited resolution (see below).

The regularization toolbox command picard can be used to produce

a plot of the singular values si , the values of |(U·,i )T d|, and the
ratios |(U·,i )T d|/si . Figure 5.6 shows the values for our problem.
Notice that |(U·,i )T d| reaches a noise floor of about 1 × 10−7 after
i = 13. The singular values continue to decay. As a consequence,
the ratios increase rapidly. It is clear from this plot that we cannot
expect to obtain useful information from the singular values beyond
p = 13. Notice that the 13th singular value is ≈ 2.6 × 10−7 , which
is comparable to the values of α that we have been using.

Picard plot
|uTi b|
|uTi b|/σi





0 2 4 6 8 10 12 14 16 18 20

Figure 5.6: Picard plot for the Shaw problem.

5.3 Resolution, Bias, and Uncertainty in the Tikhonov

Just as with our earlier truncated singular value decomposition approach, we
can compute a model resolution matrix for the Tikhonov regularization method.
From equation (5.6) we know that
mα = (GT G + α2 I)−1 GT d .
G] ≡ (GT G + α2 I)−1 GT .
This can be rewritten in terms of the SVD as
G] = VFS† UT
where F is an n by n diagonal matrix with diagonal elements given by the filter
factors fi , and S† is the generalized inverse of S. G] is the appropriate matrix
for constructing the model resolution matrix, which is
Rm,α = G] G = VFVT . (5.14)
Similarly, we can compute a data resolution matrix.
Rd,α = GG] = UFUT . (5.15)
Notice that Rm,α and Rd,α depend on the particular value of α used.

Example 5.2

In our Shaw example, with α = 5.36 × 10−5 , the model resolution

matrix has the following diagonal entries

>> diag(R)

ans =


indicating that the most entries in the solution vector are rather
poorly resolved. Figure 5.7 displays this poor resolution by apply-
ing R to the (true) spike model (4.54). Recall that this is the model
recovered when the true model is a spike and there is no noise added
to the data vector; additional features of a recovered model in prac-
tice (5.1) will also be influenced by noise.

As in Chapter 2, we can compute a covariance matrix for the estimated

model parameters. Since
m α = G] d
the covariance is
Cov(mα ) = G] Cov(d)(G] )T .
As with the SVD solution of Chapter 4, the Tikhonov regularized solution will
generally be biased and the differences between the regularized solution values
and the true model may actually be much larger than the confidence intervals
obtained from the covariance matrix of the model parameters.








−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Figure 5.7: Resolution of the spike model, α = 5.36 × 10−5 .

Example 5.3
Recall our earlier example of the Shaw problem with the spike model
input. Figure 5.8 shows the true model, the solution obtained using
α = 5.36 × 10−5 , and 95% confidence intervals for the estimated
parameters. Notice that very few of the actual model parameters
are included within the confidence intervals.

Ultimately, our goal is to obtain a model m which is as close to mtrue as

possible. Of course, because of noise in the data and the bias introduced by
regularization, we cannot hope to determine mtrue or even the exact distance
from our model m to mtrue . However, it might be possible to estimate the
magnitude of the difference between our inverse solution and the true model. If
we have such an estimate as a function of the regularization parameter α, then
we can obviously try to minimize the estimated error.
Several estimates of this sort have been developed by different authors. The
quasi-optimality criterion of Tikhonov and Glasko [TG65], is based on the esti-
mate 1/2
kmtrue − G] dk2 ≈ dT GGT + αI GGT d .
Hanke and Raus [HR96] use the estimate
 −3 1/2
kmtrue − G] dk2 ≈ α2 dT GGT + α2 I d .

These estimates of the error are very rough and order–of–magnitude. They
should not be counted on to give a very accurate estimate of the error in the




−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Figure 5.8: Confidence intervals for the Tikhonov solution to the Shaw problem
and a spike model.

regularized solution. They may still be useful in finding an appropriate value of

the regularization parameter. Unfortunately, these estimates of the regulariza-
tion error are non–convex functions of α, and because of local minima, it can
be hard to find the value of α which actually minimizes the error estimate.

Example 5.4

Figure 5.9 shows a plot of the Hanke and Raus estimates of the
error in the Shaw problem with noisy data from the spike model.
The graph has a number of local minimum points, and it is not
obvious which value of α should be used. One minimum lies near α =
5.36 × 10−5 , which was the value of α determined by the discrepancy
principle. At this value of α, the estimated error is about 0.058,
which is somewhat smaller than the actual error of about 0.74.

5.4 Higher Order Tikhonov Regularization

So far in our discussions of Tikhonov regularization we have minimized an ob-
jective function involving kmk2 . In many situations, we would prefer to obtain
a solution which minimizes some other measure of m, such as the norm of the
first or second derivative.

3 first order Tikhonov



|| mtrue − mlambda ||



−7 −6 −5 −4 −3 −2 −1 0
10 10 10 10 10 10 10 10

Figure 5.9: Hanke and Raus estimates of the solution error for the Shaw problem
with spike model.

For example, if we have discretized our problem using simple collocation,

then we can approximate, to a multiplicative constant, the first derivative of
the model by Lm, where
 
−1 1

 −1 1 

L=  ...  .
 (5.16)
 −1 1 
−1 1

Here Lm is a finite difference approximation to the first derivative of m. By

minimizing kLmk2 , we will favor solutions which are relatively flat. In first
order Tikhonov regularization, we solve the damped least squares problem

min kGm − dk22 + α2 kLmk22 . (5.17)

Another commonly used operator is the second difference, where
 
1 −2 1

 1 −2 1 

L=  . . .  .
 (5.18)
 1 −2 1 
1 −2 1

Here Lm is a finite difference approximation proportional to the second deriva-

tive of m, and minimizing kLmk2 penalizes solutions which are rough in a

second order Tikhonov second derivative sense. In second order Tikhonov regularzation, we use
the L from (5.18).
zeroth order Tikhonov
regularization We have already seen (5.9) how to solve (5.17) when L = I. This is called
generalized singular value zeroth order Tikhonov regularization. It is possible to extend this idea to
decomposition higher orders, and to problems in which the models are higher (e.g. 2- or 3-)
GSVD dimensional.
generalized singular values To solve higher order Tikhonov regularization problems, we employ the gen-
eralized singular value decomposition, or GSVD. Using the GSVD, the
solution to (5.17) can be expressed as a sum of filter factors times generalized
singular vectors.
Unfortunately, the definition of the GSVD and associated notation are not
standardized. In the following, we will generally follow the conventions used by
the MATLAB regularization toolbox and its cgsvd command. One important
difference is that we will use λi for the generalized singular value, while the
toolbox uses σi for the generalized singular value. Note that MATLAB also has
a built in command, gsvd, which uses a different definition of the GSVD.
We will assume that G is an m by n matrix, and that L is a p by n matrix,
with m ≥ n ≥ p, that rank(L) = p, and that the null spaces of G and L intersect
only at the zero vector. It is possible to define the GSVD in a more general
context, but these assumptions are necessary for this development of the GSVD
and are reasonable for most problems.
Under the above assumptions there exist matrices U, V, Λ, M and X with
the following properties and relationships:
• U is m by n and orthogonal.
• V is p by p and orthogonal.
• X is n by n and nonsingular.
• Λ is a p by p diagonal matrix, with

0 ≤ λ 1 ≤ λ2 ≤ . . . ≤ λ p ≤ 1 (5.19)

• M is a p by p diagonal matrix with

1 ≥ µ1 ≥ µ2 ≥ . . . ≥ µp > 0 (5.20)

• The λi and µi are normalized so that

λ2i + µ2i = 1 , i = 1, 2, . . . , p (5.21)

• The generalized singular values are

γi ≡ (5.22)
0 ≤ γ1 ≤ γ2 ≤ . . . ≤ γp (5.23)

Λ 0
G=U X−1 (5.24)
0 In−p

L = V [M 0] X−1 (5.25)

T T 0
X G GX = (5.26)
0 I

XT LT LX = (5.27)
0 0

• When p < n, the matrix L will have a nontrivial null space. It can be
shown that the vectors X·,p+1 , X·,p+2 , . . ., X·,n form a basis for the null
space of L.
When G comes from an IFK, the GSVD typically has two properties that
were also characteristic of the SVD. First, the generalized singular values γi
(5.22) tend to zero without any obvious break in the sequence. Second, the vec-
tors U·,i , V·,i , and X·,i tend to become rougher as i increases and γi decreases.
In applying the GSVD to solve (5.17), we note that we can reformulate the
problem as a a least-squares system
G d
m− . (5.28)
αL 0 2

Assume that (5.28) is of full rank, the corresponding normal equations are
 T T
 G  T T
G αL m= G αL . (5.29)
αL 0
We could solve these normal equations directly, but the GSVD provides a short
cut. Rather than deriving the GSVD solution from scratch, we will simply show
that the solution is
p n
X γi2 UT·,i d X
mα,L = X ·,i + (UT·,i d)X·,i (5.30)
γi2 + α2 λi i=p+1

by substituting it into the normal equations.

(GT G + α2 LT L)mα,L =

0 Λ 0 M
X−T UT U X−1 + α2 X−T VT V MT X−1
0 I 0 I 0
 
p 2 T n
X γi (U·,i ) d X

2 + α2 X·,i + (U·,i )T dX·,i  . (5.31)
γi λ i i=p+1

Because U and V are orthogonal matrices, the products UT U and VT V thus

produce identity matrices. Also Λ and M are diagonal and are thus symmetric.
M2 0
T 2 T −T Λ 0 −1 2 −T −1
(G G + α L L)mα,L = X X +α X X
0 I 0 0
 
p 2 T n
X γi (U·,i ) d X

2 + α2 X·,i + (U·,i )T dX·,i 
i=1 i
λ i i=p+1

Λ2 + α2 M2
−T 0 −1
= X X
0 I
 
p n
X γi2 T
(U·,i ) d X
 X·,i + (U·,i )T dX·,i 
γi2 + α2 λi i=p+1

γi2 (U·,i )T d
Λ + α2 M2 0
−T −1
= X X X·,i
γ 2 + α2
i=1 i
λi 0 I
n  2
Λ + α2 M2 0
+ (U·,i )T d X−T X−1 X·,i .
0 I

p n
X γi2 (U·,i )T d −T 2 X
= X (λ i + α 2 2
µi )e i + (U·,i )T dX−T ei
γi2 + α2 λi i=p+1
 
p n
γi2 T
X (U·,i ) d 2 X
= X−T  µi + α 2 ei + (U·,i )T dei 
γi2 + α2 λi µ2i i=p+1
 
p T n
X (U·,i ) d X
= X−T  γi2 µ2i ei + (U·,i )T dei 
λ i i=p+1
 
Xp n
= X−T  λi (U·,i )T dei + (U·,i )T dei  .
i=1 i=p+1

We can also simplify the right hand side of the original system of equations.
Λ 0
GT d = X−T UT d
0 I
 
Xp n
= X−T  λi (U·,i )T dei + (U·,i )T dei  . (5.32)
i=1 i=p+1



True Slowness (s/km)







0 200 400 600 800 1000
Depth (m)

Figure 5.10: A smooth test model for the vsp problem.

(GT G + α2 LT L)mα,L = GT d .

Example 5.5

A rough model, such as a spike, is very rough indeed as measured

by the norm of its derivatives and thus does not provide a suitable
example for model estimation via higher order Tikhonov regulariza-
tion. We will thus switch to a smoother alternative example using
the vertical seismic profiling problem of Example 1.3 where the true
slowness model is a smooth function (Figure 5.10). In this case, for a
1-km deep borehole experiment, we discretized the system (Example
3.2) using m = n = 50 data and model points corresponding to sen-
sors every 20 m and 20–m–thick model slowness intervals. The true
model is a smooth function (Figure 5.10). Figure 5.11 shows corre-
sponding arrival time data with N (0, (2 × 10−4 )2 s2 ) noise added.
Solving the problem using zeroth order Tikhonov regularization pro-
duces a model (Figure 5.13) with many fine-scale features that are
not present in the true model (Figure 5.10). The failure to identify a
corner in the L–curve (5.12) further suggests that this model is not a
good tradeoff between data fit and regularization. If we have reason
to believe that the true model is a smooth function, it is appropriate
to apply higher–order regularization.



Arrival Time (s)




0 200 400 600 800 1000
Depth (m)

Figure 5.11: Data for the smooth model Shaw problem, noise added.

L−curve for 0 order regularization
solution norm || m ||

10.0048 15.2027 23.1009 35.1025 53.3394 81.0508

123.1592 187.1441

10 −3 −2 −1 0
10 10 10 10
residual norm || G m − d ||2

Figure 5.12: L–curve for the Shaw problem, regularization order 0.


0th Order Tikhonov Model)



Slowness (s/km)






0 200 400 600 800 1000
Depth (m)

Figure 5.13: Tikhonov solution, regularization order 0, α = 10.0, shown in

comparison with the true model.

Figure 5.14 shows the first–order L–curve for this problem obtained
using the regularization toolbox commands get l, cgsvd, l curve,
and tikhonov. The L–curve now has a well-defined corner near
α ≈ 137 for this particular data realization, and Figure 5.15 shows
the corresponding solution. The first–order regularized solution is
much smoother than the zeroth order regularized one, and is much
closer to the true solution.
Figure 5.16 shows the L–curve for second—order Tikhonov regu-
larization, which has a corner at α ≈ 2325 and Figure 5.17 shows
the corresponding solution. This solution is smoother still compared
to the first–order regularized solution. Both the first– and second–
order solutions depart most from the true solution at shallow depths
where the true slowness has the greatest slope and curvature.

5.5 Resolution in Higher Order Regularization

As with zeroth order Tikhonov regularization, we can compute a resolution
matrix for higher order Tikhonov regularization. For particular values of L and
α, the Tikhonov regularization solution can be written as
mα,L = G] d (5.33)

L−curve for 1st order regularization

10 5.0049


solution semi−norm || L m ||


10 130.7975

10 −4 −3 −2 −1
10 10 10 10
residual norm || G m − d ||2

Figure 5.14: L–curve for the vsp problem, regularization order 1.

1st order Tikhonov Model


Slowness (s/km)




0 200 400 600 800 1000
Depth (m)

Figure 5.15: Tikhonov solution, regularization order 1, α = 2.93 × 10−6 , shown

in comparison with the true model.

L−curve for 2nd order regularization


solution semi−norm || L m ||

10 5.9393


−5 33.4206


10 −4 −3 −2
10 10 10
residual norm || G m − d ||2

Figure 5.16: L–curve for the vsp problem, regularization order 2.

2nd order

Slowness (s/km)




0 200 400 600 800 1000
Depth (m)

Figure 5.17: Tikhonov solution, regularization order 2, α = 2325, shown in

comparison with the true model.

G] = (GT G + α2 LT L)−1 GT . (5.34)
Using properties of the GSVD we can simplify this expression to
Λ2 + α2 M2 ΛT
] 0 T −T 0
G =X X X UT .
0 I 0 I
Λ2 + α2 M2 0 ΛT
=X UT .
0 I 0 I

=X UT (5.35)
0 I
where F is a diagonal matrix with diagonal elements

Fi,i = . (5.36)
γi2 + α2

The resolution matrix is then

0 Λ 0
Rα,L = G] G = X UT U X−1
0 I 0 I
F 0
=X X−1 . (5.37)
0 I

Example 5.6
To examine the resolution of the Tikhonov-regularized inversions
of Example 5.4, we perform a spike test using (5.37). Figure 5.18
shows the effect of multiplying Rα,L with a unit amplitude spike
model (spike depth 500 m) under zeroth–, first–, and second–order
Tikhonov regularization using the corresponding α values of 10.0,
137, and 2325, respectively. The shapes of these curves (Figure
5.18) are indicative of the ability of each inversion to recover abrupt
changes in slowness halfway through the model. Progressively in-
creasing widths with increasing regularization order demonstrate de-
creasing resolution as more misfit is allowed in the interest of obtain-
ing a progressively smoother solution. Under first– or second–order
regularization, the resolution of various model features will depend
critically on how smooth or rough these features are in the true
model. In this example, the higher–order solutions recover the true
model better because the true model is smooth.







−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Figure 5.18: Rα,L times the spike model for each of the regularized solutions of
Example 5.4

truncated generalized singular 10


value decomposition

solution semi−norm || L x ||2


4 14

2 12

10 10
8 6 4

−7 −6 −5 −4 −3 −2 −1 0
10 10 10 10 10 10 10 10
residual norm || A x − b ||2

Figure 5.19: L–curve for the TGSVD solution of the Shaw problem, smooth

5.6 The TGSVD Method

When we first started working with the singular value decomposition we exam-
ined the truncated singular value (TSVD) method of regularization. We can
conceptualize the TSVD as simply skipping tossing out small singular values
and their associated model space basis vectors, or as a damped SVD solution in
which filter factors of one are used for larger singular values and filter factors of
zero are used for smaller singular values. This approach can also be extended to
work with the generalized singular value decomposition to obtain a truncated
generalized singular value decomposition or TGSVD solution. In this
case we include the k largest generalized singular values with filter factors of
one to obtain the solution

p n
X (U·,i )T d X
mk,L = X·,i + ((U·,i )T d)X·,i . (5.38)
λi i=p+1

Figure 5.19 shows the L–curve for the TGSVD solution of the Shaw problem
with the smooth true model. The corner occurs at k = 10. Figure 5.20 shows
the corresponding solution. The model recovery is reasonably good, but it is not
as close to the original model as the solution obtained by second order Tikhonov

2 generalized cross validation









−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Figure 5.20: TGSVD solution of the Shaw problem, smooth model.

5.7 Generalized Cross Validation

Generalized cross validation (GCV), developed by Wahba [Wah90], is a method
for selecting the regularization parameter which has a number of desirable sta-
tistical properties. Let

kGmα,L − dk22 V (α)

G(α) = = . (5.39)
Trace(I − GG]α,L )2 T (α)

Here V (α) measures the misfit. As α increases, V (α) will also increase.
Below the corner of the L–curve, V (α) should increase very slowly. At the corner
of the L–curve, V (α) should begin to increase fairly rapidly. The denominator,
T (α) measures the closeness of the data resolution matrix to the identity matrix.
T (α) is a slowly increasing function of α. Intuitively, we would expect G(α) to
have a minimum near the corner of the L–curve. For values of α below the
corner of the curve, V (α) will be roughly constant and T (α) will be increasing,
so G(α) should be decreasing. For values of α above the corner of the L–curve,
V (α) should be increasing rapidly, while T (α) should be increasing slowly. Thus
G(α) should be increasing. In the GCV method, we pick the value α∗ which
minimizes G(α).
Wahba was able to show that under reasonable assumptions about the
smoothness of mtrue and under the assumption that the noise is white that the
value of α that minimizes G(α) approaches the value that minimizes E[Gmα,L −
dtrue ] as the number of data points m goes to infinity [Wah90]. Wahba was
also able to show that under the same assumptions, E[kmtrue − mα,L k2 ] goes
to 0 as m goes to infinity [Wah90]. These results are too complicated to prove

GCV function, minimum at λ = 7.0362e−06







−15 −10 −5 0 5 10 15 20
10 10 10 10 10 10 10 10

Figure 5.21: GCV curve for the Shaw problem, second order regularization.

here, but they do provide an important theoretical justification for the GCV
method of selecting the regularization parameter.

Example 5.7
Figure 5.21 shows a graph of G(α) for our Shaw test problem, us-
ing second order Tikhonov regularization. The minimum occurs at
α = 7.04 × 10−6 . Figure 5.22 shows the corresponding solution.
This solution is actually the best of all the solutions that we have
obtained for this test problem. The norm of the difference between
this solution and the true solution is 0.028.









−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Figure 5.22: GCV solution for the Shaw problem, second order, α = 7.04×10−6 .

5.8 Error Bounds

In this section we present two theoretical results which help to answer some ques-
tions about the accuracy of Tikhonov regularization solutions. We will present
these results in a very simplified form, covering only zeroth order Tikhonov
The first question is whether for a particular value of the regularization
parameter α, we can establish a bound on the sensitivity of the regularized
solution to the noise in the observed data d and/or errors in the system matrix
G. This would provide a sort of condition number for the inverse problem.
Note that this does not tell us how far the regularized solution is from the
true model, since Tikhonov regularization has introduced a bias in the solution.
Under Tikhonov regularization with a nonzero α, we would not obtain the true
model even if the noise was 0.
Hansen [Han98] gives such a bound, and the following theorem gives the
bound for zeroth order Tikhonov regularization. A slightly more complicated
formula is available for higher order Tikhonov regularization.

Theorem 5.1

Suppose that the problems

min kGm − dk22 + α2 kmk22 (5.40)

min kḠm − d̄k22 + α2 kmk22 (5.41)
are solved to obtain mα and m̄α . Then
kmα − m̄α k2 κ̄α kek2 krα k2
≤ 2 + + κ̄α (5.42)
kmα k2 1 − κ̄α kdα k2 kdα k2

κ̄α = (5.43)

E = G − Ḡ (5.44)

e = d − d̄ (5.45)

= (5.46)

dα = Gmα (5.47)

rα = d − dα . (5.48)
5.8. ERROR BOUNDS 5-29



−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Figure 5.23: The spike model for the Shaw problem.

In the particular case when G = Ḡ, and the only difference between the two
problems is e = d − d̄, the inequality becomes even simpler

kmα − m̄α k2 kek2

≤ κ̄α . (5.49)
kmα k2 kdα k2

The condition number κ̄α is inversely proportional to α. Thus increasing α will

decrease the sensitivity of the solution to perturbations in the data. Of course,
increasing α also increases the error in the solution due to regularization bias.
The second question is whether we can establish any sort of bound on the
norm of the difference between the regularized solution and the true model. This
bound would incorporate both sensitivity to noise and the bias introduced by
Tikhonov regularization. Such a bound must of course depend on the magnitude
of the noise on the data. It must also depend on the particular regularization
parameter chosen. Tikhonov [TA77] developed a beautiful theorem that ad-
dresses this question in the context of inverse problems involving IFK’s. More
recently, Neumaier [Neu98] has developed a version of Tikhonov’s theorem that
can be applied directly to discretized problems.
Recall that in a discrete ill-posed problem, the matrix G has a smoothing
effect– when we multiply Gm, the result is smoother than m. Similarly, if we
multiply GT times Gm, the result is smoother than Gm. For example, Figures
5.23 through 5.25 show m our original spike model for the Shaw problem, GT m,
and GT Gm.
It should be clear that models in the range of GT form a relatively smooth
subspace of all possible models. Models in this subspace of Rn can be written
as m = GT w, for some weights w. Furthermore, models in the range of GT G








−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Figure 5.24: GT times the spike model for the Shaw problem.









−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Figure 5.25: GT G times the spike model for the Shaw problem.
5.8. ERROR BOUNDS 5-31

form a subspace of R(GT ), since any model in R(GT G) can be written as

m = GT (Gw) which is a linear combination of columns of GT . Because of
the smoothing effect of G and GT , we would expect these models to be even
smoother than the models in R(GT ). We could construct smaller subspaces of
Rn that contain even smoother models, but it turns out that with zeroth order
Tikhonov regularization these are the only subspaces of interest.
There is another way to see that models in R(GT ) will be relatively smooth.
Recall that the vectors V··· ,1 , V··· ,2 , . . ., V··· ,p from the SVD of G form an
orthogonal basis for R(GT ). For discrete ill–posed problems, we know from
Chapter 4 that these basis vectors will be relatively smooth, so linear combina-
tions of these vectors in R(GT ) should be smooth.
The following theorem gives a bound on the total error (bias due to regular-
ization plus error due to noise in the data) for zeroth order Tikhonov regular-
ization. Unfortunately, the theorem only applies in situations where we begin
with a true model mtrue in one of the aforementioned subspaces of Rn . This
seldom happens in practice, but theorem does help us understand how the error
in the Tikhonov regularization solution behaves in the best possible case.

Theorem 5.2
Suppose that we use zeroth order Tikhonov regularization to solve
Gm = d and that mtrue can be be expressed as
G w p=1
mtrue = (5.50)
GT Gw p = 2

and that
kGmtrue − dk2 ≤ ∆kwk2 (5.51)
for some ∆ > 0. Then

kmtrue − G] dk2 ≤ + γαp kwk2 (5.52)

1/2 p = 1
γ= . (5.53)
1 p=2
Furthermore, if we begin with the bound

kGmtrue − dk2 ≤ δ (5.54)

we can let
∆= . (5.55)
Under this condition, it can be shown that the optimal value of α is
∆ 1
α̂ = = O(∆ p+1 ) . (5.56)

With this choice of α,

∆ = 2γpα̂p+1 (5.57)

and the error bound simplifies to

kmtrue − G]α̂ dk2 ≤ γ(p + 1)α̂p = O(∆ p+1 ) . (5.58)

This theorem tells us that the error in the Tikhonov regularization solution
depends on both the noise level ∆ and on the regularization parameter α. For
very large values of α, the error due to regularization will be dominant. For very
small values of α, the error due to noise in the data will be dominant. There
is an optimal value of α which balances these effects. Using the optimal α, we
can obtain an error bound of O(∆2/3 ) if p = 2, and an error bound of O(∆1/2 )
if p = 1.
Of course, the above result can only be applied when our true model lives
in the very restricted subspace of models in R(GT ). In practice this seldom
happens. The result also depends on ∆, which we can at best approximate by
∆ ≈ δ. As a rule of thumb, if the model mtrue is reasonably smooth, then
we can hope for an error in the Tikhonov regularized solution which is O(δ 1/2 ).
Another way of saying this is that we can hope for an answer with about half
as many accurate digits as the data.

5.9 Exercises
1. Use the method of Lagrange Multipliers to derive the damped least squares
problem (5.2) from the discrepancy principal problem (5.1), and demon-
strate that (5.2) can be written as (5.4).

2. Consider the integral equation and data set from Problem 3.1. You can
find a copy of this data set in ifk.mat.

(a) Discretize the problem using simple collocation.

(b) Using the data supplied, and assuming that the numbers are accurate
to four significant figures, determine a reasonable bound δ for the
(c) Use zeroth order Tikhonov regularization to solve the problem. Use
GCV, the discrepancy principle and the L–curve criterion to pick
the regularization parameter. Estimate the norm of the difference
between your solutions and mtrue .
(d) Use first order Tikhonov regularization to solve the problem. Use
GCV, the discrepancy principle and the L–curve criterion to pick the
regularization parameter.
5.9. EXERCISES 5-33

(e) Use second order Tikhonov regularization to solve the problem. Use
GCV, the discrepancy principle and the L–curve criterion to pick the
regularization parameter.
(f) Analyze the resolution of your solutions. Are the features you see
in your inverse solutions unambiguously real? Interpret your results.
Describe the size and location of any significant features in the solu-

3. Consider the following problem in cross well tomography. Two vertical

wells are located 1600 meters apart. A seismic source is inserted in one well
at depths of 50, 150, ..., 1550 m. A string of hydrophones are inserted
in the other well at depths of 50m, 150m, ..., 1550m. For each source
receiver pair, a travel time is recorded, with a measurement standard
deviation of 0.5 msec. We wish to determine the velocity structure in the
two dimensional plane between the two wells.
Discretizing the problem into a 16 by 16 grid of 100 meter by 100 meter
blocks gives 256 ray paths and 256 data points. The G matrix and d
noisy data for this problem, assuming straight-line ray paths, are in the
file crosswell.mat

(a) Use the truncated SVD to solve this inverse problem. Plot the result
using imagesc and colorbar.
(b) Use zeroth order Tikhonov regularization to solve this problem and
plot your solution. Explain why it is hard to use the discrepancy
principle to select the regularization parameter. Use the L–curve
criterion to select your regularization parameter. Plot the L–curve
as well as your solution.
(c) Use second order Tikhonov regularization to solve this problem and
plot your solution. Because this is a two dimensional problem, you
will need to implement a finite difference approximation to the Lapla-
cian (second derivative in the horizontal direction plus the second
derivative in the vertical direction). You can generate the appropri-
ate L roughening matrix using the following MATLAB code:
for i=2:15,
for j=2:15,

What if any problems did you have in using the L–curve criterion on
this problem? Plot the L–curve as well as your solution.
(d) Discuss your results. Can you explain the characteristic vertical
bands that appeared in some of your solutions?

5.10 Notes and Further Reading

Per Christian Hansen’s book [Han98] is a very complete reference on the linear
algebra of Tikhonov regularization. Arnold Neumaier’s tutorial [Neu98] is also
a very useful reference. Two other useful surveys of Tikhonov regularization are
[Eng93, EHN96]. Vogel [Vog02] includes an extensive discussion of methods for
selecting the regularization parameter. Hansen’s MATLAB regularization tool-
box [Han94] is a very useful collection of software for performing regularization
within MATLAB.
Within statistics, badly conditioned linear regression problems are said to
suffer from “multicolinearity.” A method called “Ridge Regression”, which is
exactly Tikhonov regularization is often used to deal with such problems. Statis-
ticians also use a method called “Partial Components Regression (PCR)” which
is simply the TSVD method.
Kaczmarz’s algorithm

Chapter 6

Iterative Methods

SVD based pseudo inverse and Tikhonov regularization solutions become im-
practical when we consider larger problems in which G has tens of thousands
of rows and columns. The problem is that while G may be a sparse matrix, the
U and V matrices in the SVD will typically be dense matrices. For example,
consider a tomography problem in which the model is of size 256 by 256 (65536
model elements), and there are 100,000 ray paths. Most of the ray paths miss
most of the model cells, so the majority of the entries in G are zero. Thus the
G matrix might have a density of less than 1%. If we stored G as a regular
dense matrix, it would require about 50 gigabytes of storage. Furthermore, the
U matrix would require 80 gigabytes of storage, and the V matrix would require
about 35 gigabytes of storage. However, the sparse matrix G can be stored in
less than one gigabyte of storage.
Because iterative methods can take advantage of the sparsity commonly
found in the G matrix, iterative methods are often the only applicable method
for large problems.

6.1 Iterative Methods for Tomography Problems

We will concentrate in this section on Kaczmarz’s algorithm and its ART and
SIRT variants. These algorithms were originally developed for tomographic
problems and are particularly effective for such problems.
Kaczmarz’s algorithm is an easy to implement algorithm for solving a linear
system of equations Gm = d. The algorithm starts with an initial solution
m(0) , and then moves to a solution m(1) by projecting the initial solution onto
the hyperplane defined by the first equation in the system of equations. Next,
we project the solution m(1) onto the hyperplane defined by the second equa-
tion. The process is repeated until the solution has been projected onto the
hyperplanes defined by all of the equations. At that point, we can begin a new
cycle through the hyperplanes.




Figure 6.1: Kaczmarz’s algorithm on a system of two equations.

Figure 6.1 shows an example in which Kaczmarz’s algorithm is used to solve

the system of equations
y = 1
−x +y = −1
In order to implement the algorithm, we need a formula to compute the
projection of a vector onto the hyperplane defined by equation i. Let Gi,· be
the ith row of G. Consider the hyperplane defined by Gi+1,· m = di+1 . The
vector GTi+1,· is perpendicular to this hyperplane. Thus the update to m(i) will
be of the form
m(i+1) = m(i) + αGTi+1,· . (6.1)
We use the fact that Gi+1,· m(i+1) = di+1 to solve for α.
Gi+1,· m(i) + αGTi+1,· = di+1 . (6.2)

Gi+1,· m(i) − di+1 = −αGi+1,· GTi+1,· (6.3)

Gi+1,· m(i) − di+1

α=− (6.4)
Gi+1,· GTi+1,·
Thus the update formula is
Gi+1,· m(i) − di+1 T
m(i+1) = m(i) − Gi+1,· (6.5)
Gi+1,· GTi+1,·



Figure 6.2: Slow convergence occurs when hyperplanes are nearly parallel.

It can be shown that if the system of equations Gm = d has a unique

solution, then Kaczmarz’s algorithm will converge to this solution. If the system
of equations has many solutions, then the algorithm will converge to the solution
which is closest to the point m(0) . In particular, if we start with m(0) = 0, we
will obtain a minimum length solution. If there is no exact solution to the
system of equations, then the algorithm will fail to converge, but will typically
bounce around near an approximate solution.
A second important question is how fast Kaczmarz’s algorithm will converge
to a solution. If the hyperplanes described by the system of equations are nearly
orthogonal, then the algorithm will converge very quickly. However, if two or
more hyperplanes are nearly parallel to each other, convergence can be extremely
slow. Figure 6.2 shows a typical situation in which the algorithm zigzags back
and forth without making much progress towards a solution. As the two lines
become more nearly parallel, the problem becomes worse. This problem can be
alleviated by picking an ordering of the equations such that adjacent equations
describe hyperplanes which are nearly orthogonal to each other. In the context
of tomography, this can be done by ordering the equations so that successive
equations do not share common cells in the 2–dimensional model.

Example 6.1

In this example, we will consider a tomographic reconstruction prob-

lem with the same geometry used in homework exercise 4.1. The true

ART algorithm




2 4 6 8 10 12 14 16

Figure 6.3: True Model.

model is shown in Figure 6.3. Synthetic data was generated, with

normally distributed random noise added. The random noise had
standard deviation 0.01. Figure 6.4 shows the truncated SVD solu-
tion. The two blobs are apparent, but it is not possible to distinguish
the small hole within the larger of the two blobs.
Figure 6.5 shows the solution obtained after 200 iterations of Kacz-
marz’s algorithm. This solution is very similar to the truncated SVD
solution and both solutions are about the same distance from the
true model.

The ART algorithm is a version of Kaczmarz’s algorithm which has been

modified especially for the tomographic reconstruction problem. Notice that in
(6.5), the updates to the solution always consist of adding a multiple of a row
of G to the current solution. The numerator in the fraction is the difference
between the right hand side of equation i+1 and the value of the i+1 component
of Gm. The denominator is simply the square of the norm of row i + 1 of G.
Effectively, Kaczmarz’s algorithm is determining the error in equation i + 1,
and then adjusting the solution by spreading the required correction over the
elements of m which appear in equation i + 1.
A very crude approximation to the Kaczmarz’s update is to replace all of
the nonzero entries in row i + 1 of G with ones. Using this approximation,
qi+1 = mj
cell j in ray path i + 1




2 4 6 8 10 12 14 16

Figure 6.4: Truncated SVD Solution.




2 4 6 8 10 12 14 16

Figure 6.5: Kaczmarz’s Algorithm Solution.


SIRT technique is an approximation to the sum of the slownesses along ray path i + 1. The
simultaneous interactive difference between qi+1 and di+1 is roughly the error in our predicted travel
reconstruction technique
time for ray i + 1. The denominator of the fraction becomes Ni+1 , the number
of cells along ray path i + 1 or equivalently the number of nonzero entries in
row i + 1 of G.
Thus we are taking the total error in the travel time for ray i + 1 and
dividing it by the number of cells in ray path i + 1. This correction factor is
then multiplied by a vector which has ones in cells along the ray path i + 1.
This has the effect of smearing the needed correction in travel time over all of
the cells in ray path i + 1.
The new approximate update formula can be written as
( (i) q −d
(i+1) mj − i+1 Ni+1
cell j in ray path i + 1
mj = (i) (6.6)
mj cell j not in ray path i + 1 .
The approximation can be improved by taking into account that the ray
paths through some cells are longer than the ray paths through other cells. Let
Li+1 be the length of ray path i + 1. We can improve the approximation by
using the following update formula.
( (i) d
(i+1) mj + Li+1 − Nqi+1 cell j in ray path i + 1
mj = (i)
i+1 i+1
mj cell j not in ray path i + 1 .
The main advantage of ART is that it saves storage- we need only store
information about which rays pass through which cells, and we do not need to
record the length of each ray in each cell. A second advantage of the method is
that it reduces the number of floating point multiplications required by Kacz-
marz’s algorithm. Although in current computers floating point multiplications
and additions require roughly the same amount of time, during the 1970’s when
ART was first developed, multiplication was slower than addition.
One problem with ART is that the resulting tomographic images tend to
be somewhat more noisy than the images resulting from Kaczmarz’s algorithm.
The Simultaneous Iterative Reconstruction Technique (SIRT) is a variation on
ART which gives slightly better images in practice, at the expense of a slightly
slower algorithm. In the SIRT algorithm, we compute updates to cell j of the
model for each ray i that passes through cell j. The updates for cell j are added
together and then divided by Nj before being added to mj .

Example 6.2
Returning to our earlier tomography example, Figure 6.6 shows the
ART solution obtained after 200 iterations. Again, the solution is
very similar to the truncated SVD solution.
Figure 6.7 shows the SIRT solution for our example tomography
problem. This solution is similar to the Kaczmarz’s and ART solu-




2 4 6 8 10 12 14 16

Figure 6.6: ART Solution.




2 4 6 8 10 12 14 16

Figure 6.7: SIRT Solution.


CG method 6.2 The Conjugate Gradient Method

conjugate gradient method
mutually conjugate vectors In this section we will consider the method of conjugate gradients (CG) for
solving a symmetric and positive definite system of equations Ax = b. In
the next section we will apply the CG method to the normal equations for for
Gm = d.
We begin by considering the quadratic optimization problem
1 T
min φ(x) = x Ax − bT x (6.8)
where A is an n by n symmetric and positive definite matrix. We require the A
be positive definite so that the function φ(x) will be convex and have a unique
minimum. We can calculate ∇φ(x) = Ax − b and set it equal to zero to find
the minimum. The minimum occurs at a point x that satisfies the equation

∇φ(x) = Ax − b = 0 (6.9)

Ax = b . (6.10)
Thus solving the system of equations Ax = b is equivalent to minimizing φ(x).
The method of conjugate gradients approaches the problem of minimizing
φ(x) by constructing a basis for Rn in which the minimization problem is ex-
tremely simple. The basis vectors p0 , p1 , . . ., pn−1 are selected so that

pTi Apj = 0 when i 6= j. (6.11)

A collection of vectors with this property are said to be mutually conjugate

with respect to A. We express x in terms of these basis vectors as
x= αi pi . (6.12)

!T n−1
! n−1
1 X X
φ(α) = αi pi A αi pi −b αi pi . (6.13)
2 i=0 i=0 i=0

The product xT Ax can be written as a double sum.

n−1 n−1 n−1
1 XX X
φ(α) = αi αj pTi Apj − bT αi pi . (6.14)
2 i=0 j=0 i=0

Since the vectors are mutually conjugate with respect to A, this simplifies to
n−1 n−1
1X 2 T T
φ(α) = α p Api − b αi pi (6.15)
2 i=0 i i i=0

1X 2 T
φ(α) = α p Api − 2αi bT pi . (6.16)
2 i=0 i i

Equation (6.16) shows that φ(α) consists of n terms, each of which is inde-
pendent of the other terms. Thus we can minimize φ(α) by selecting each αi to
minimize the ith term,
αi2 pTi Api − 2αi bT pi .
Differentiating with respect to αi and setting the derivative equal to zero, we
find that the optimal value for αi is

bT pi
αi = (6.17)
pTi Api

We’ve seen that if we have a basis of vectors which are mutually conjugate
with respect to A, then minimizing φ(x) is very easy. We have not yet shown
how to construct the mutually conjugate basis vectors.
Our algorithm will actually construct a sequence of solution vectors xi , resid-
ual vectors ri = b − Axi , and basis vectors pi . The algorithm begins with
x0 = 0, r0 = b, p0 = r0 , and α0 = (rT0 r0 )/(pT0 Ap0 ).
Suppose that at the start of iteration k of the algorithm we have constructed
x0 , x1 , . . ., xk , r0 , r1 , . . ., rk , p0 , p1 , . . ., pk and α0 , α1 , . . ., αk . We assume
that the first k + 1 basis vectors pi are mutually conjugate with respect to A,
the first k + 1 residual vectors ri are mutually orthogonal, and that rTi pj = 0
when i 6= j.
We let
xk+1 = xk + αk pk . (6.18)
This effectively adds one more term of (6.12) into the solution. Next, we let

rk+1 = rk − αk Apk . (6.19)

This correctly updates the residual, because

rk+1 = b − Axk+1 (6.20)

= b − A (xk + αk pk ) (6.21)
= (b − Axk ) − αk Apk (6.22)
= rk − αk Apk . (6.23)

We let
rTk+1 rk+1
βk+1 = . (6.24)
rTk rk
Finally, we let
pk+1 = rk+1 + βk+1 pk (6.25)

In the following calculations, it will be useful to know that bT pk = rTk rk .

To see this,

bT pk = (rk + Axk )T pk (6.26)

= rTk pk + pTk Axk (6.27)
= rTk (rk + βk pk−1 ) + pTk Axk (6.28)
= rTk rk + βk rTk pk−1 + pTk A (α0 p0 + . . . αk−1 pk−1 ) (6.29)
= rTk rk + 0 + 0 (6.30)
= rTk rk (6.31)

We will now show that rk+1 is orthogonal to ri for i ≤ k. For every i < k,

rTk+1 ri = (rk − αk Apk )T ri (6.32)

= rTk ri − αk rTi Apk (6.33)

Since rk is orthogonal to all of the earlier ri vectors,

rTk+1 ri = 0 − αk pTk Ark . (6.34)

Since A is symmetric, pTk Ark = rTk Apk . Also, since pi = ri + βi pi−1 ,

rTk+1 ri = 0 − αk (pi − βi pi−1 ) Apk . (6.35)

Both pi and pi−1 are conjugate with pk . Thus

rTk+1 ri = 0. (6.36)

We also have to show that rTk+1 rk = 0.

rTk+1 rk = (rk − αk Apk )T rk (6.37)

= rTk rk − αk (pk − βk pk−1 ) Apk T
= rTk rk − αk pTk Apk + αk βk pTk−1 Apk (6.39)
= rTk rk − rTk rk + αk βk 0 (6.40)
= 0. (6.41)

Next, we will show that rk+1 is orthogonal to pi for i ≤ k.

rTk+1 pi = rTk+1 (ri + βi pi−1 ) (6.42)

= rTk+1 ri + βi rTk+1 pi−1 (6.43)
= 0+ βi rTk+1 pi−1 (6.44)
= βi (rk − αk Apk ) pi−1 (6.45)
= βi rTk pi−1 − αk pTi−1 Apk (6.46)
= 0−0 (6.47)
= 0. (6.48)

Finally, we need to show that pTk+1 Api = 0 for i ≤ k. For i < k,

pTk+1 Api = (rk+1 + βk+1 pk )T Api (6.49)

= rTk+1 Api + βk+1 pTk Api (6.50)
= rTk+1 Api +0 (6.51)
= rTk+1 (ri − ri+1 ) (6.52)
1 T
rk+1 ri − rTk+1 ri+1

= (6.53)
= 0. (6.54)

We also have to deal with the case where i = k.

T 1
pTk+1 Apk = (rk+1 + βk+1 pk ) (rk − rk+1 ) (6.55)
1  T

= βk+1 (rk + βk pk−1 ) rk − rTk+1 rk+1 (6.56)
βk+1 rTk rk + βk+1 βk pTk−1 rk − rTk+1 rk+1

= (6.57)
rT rk+1 + βk+1 βk 0 − rTk+1 rk+1

= (6.58)
αk k+1
= 0. (6.59)

We have now shown that the algorithm generates a sequence of mutually

conjugate basis vectors. In theory, the algorithm finds an exact solution to the
system of equation in n iterations. In practice, due to round off errors in the
computation, the exact solution may not be obtained in n iterations. In practical
implementations of the algorithm, we iterate until the residual is smaller than
some tolerance that we specify. The algorithm can be summarized as:

Algorithm 6.1 Conjugate Gradients

Given a positive definite and symmetric system of equations Ax = b,

and an initial solution x0 , Let β−1 = 0, p−1 = 0, r0 = b − Ax0 , and
k = 0. Repeat the following steps until convergence.

1. Let pk = −rk + βk−1 pk−1 .

krk k22
2. Let αk = pT
k Apk

3. Let xk+1 = xk + αk pk .
4. Let rk+1 = rk + αk Apk .
krk+1 k22
5. Let βk = krk k22
6. Let k = k + 1.

CGLS A major advantage of the CG method is that it requires storage only for the
conjugate gradient least vectors xk , pk , rk and the matrix A. If A is large and sparse, then sparse matrix
squares method
techniques can be used to store A. Unlike factorization methods (QR, SVD,
Cholesky), there will be no fill in of the zero entries in A. Thus it is possible to
solve extremely large systems (n in the hundreds of thousands) using CG, while
direct factorization would require far too much storage.

6.3 The CGLS Method

The CG method by itself can only be applied to positive definite systems of
equations. It is not directly applicable to the sorts of least squares problems
that we are interested in. However, we can apply the CG method to the normal
equations for a least squares problem. In the CGLS method, we solve a least
squares problem
min kGm − dk2 (6.60)
by applying CG to the normal equations

GT Gm = GT d . (6.61)

In implementing this algorithm it is important to avoid round off errors.

One important source of error is the evaluation of the residual, GT Gm − GT d.
It turns out that this calculation is more accurate when we factor out GT and
compute GT (Gm−d). We will use the notation sk = Gmk −d, and rk = GT sk .
Note that we can compute sk+1 recursively from sk as follows

sk+1 = Gmk+1 − d . (6.62)

sk+1 = G(mk + αk pk ) − d . (6.63)

sk+1 = (Gmk − d) + αk Gpk . (6.64)

sk+1 = sk + αk Gpk . (6.65)

With this trick, we can now state the CGLS algorithm.

Algorithm 6.2 CGLS

Given a system of equations Gm = d, Let k = 0, m0 = 0, p−1 = 0,
β−1 = 0, s0 = −d, and r0 = GT s0 . Repeat the following iterations
until convergence
1. Let pk = −rk + βk−1 pk−1 .
krk k22
2. Let αk = (pT G T )(Gp ) .

3. Let mk+1 = mk + αk pk .

4. Let sk+1 = sk + αk Gpk .

5. Let rk+1 = GT sk+1 .
krk+1 k22
6. Let βk = krk k22
7. Let k = k + 1.

The CGLS algorithm has an important property that makes it particularly

useful for ill-posed problems. It can be shown [Han98] that at least for exact
arithmetic, kmk k2 increases monotonically, and that kGmk − dk2 decreases
monotonically. We can use the discrepancy principle together with this property
to obtain a regularized solution. Simply stop the CGLS algorithm as soon as
kGmk − dk2 < δ.
The CGLS algorithm is only applicable to zeroth order regularization. If
we wish to use first or second regularization, the CGLS algorithm cannot be
directly applied. However, it is possible to effectively transform higher order
regularization into zeroth regularization and apply CGLS to the transformed
problem [Han98]. The MATLAB regularization toolbox contains a command,
pcgls that performs CGLS with this transformation.

Example 6.3
Recall the image deblurring problem that we discussed in Chapter
1 and Chapter 3. Figure 6.8 shows an image that has been blurred
and also has a small amount of added noise.
This image is of size 200 pixels by 200 pixels, so the G matrix for the
blurring operator is of size 40,000 by 40,000. Fortunately, the blur-
ring matrix G is quite sparse, with less than 0.1% nonzero elements.
The matrix requires about 12 megabytes of storage. A dense matrix
of this size would require about 13 gigabytes of storage. Using the
SVD approach to Tikhonov regularization would require far more
storage than most current computers have. However, CGLS works
quite well on this problem.
Figure 6.9 shows the L–curve for the CGLS solution of this prob-
lem. For the first 15 or so iterations, kGm − dk2 decreases quickly.
After that point, the improvement in misfit slows down, while kmk2
increases rapidly. Figure 6.10 shows the CGLS solution after 30 iter-
ations. The blurring has been greatly improved. Note that the 30 is
far less than the size of the matrix, n = 40, 000. Unfortunately, fur-
ther CGLS iterations do not significantly improve the image. In fact,
noise builds up rapidly. Figure 6.11 shows the CGLS solution after
100 iterations. In this image the noise has been greatly amplified,
with little or no improvement in the clarity of the image.

Figure 6.8: Blurred Image.

|| m ||


0 1 2
10 10 10
|| Gm−d ||

Figure 6.9: L–curve for CGLS Deblurring.


Figure 6.10: CGLS Solution After 30 Iterations.

Figure 6.11: CGLS Solution After 100 Iterations.


6.4 Exercises
1. Consider the cross well tomography problem of exercise 5.2.
(a) Apply Kaczmarz’s algorithm to this problem.
(b) Apply ART to this problem.
(c) Apply SIRT to this problem.
(d) Comment on the solutions that you obtained.
2. The regularization toolbox command blur computes the system matrix for
the problem of deblurring an image that has been blurred by a Gaussian
point spread function. The file blur.mat contains a particular G matrix
and a data vector d.
(a) How large is the G matrix? How many nonzero entries does it have?
How much storage would be required for the G matrix if all of its en-
tries were nonzero? How much storage would the SVD of G require?
(b) Plot the raw image.
(c) Using CGLS, obtain an estimate of the true image. Plot your solu-
3. Show that if p0 , p1 , . . ., pn−1 are mutually conjugate with respect to an n
by n symmetric and positive definite matrix A, then the vectors are also
linearly independent. Hint: Use the definition of linear independence.

6.5 Notes and Further Reading

Iterative methods for tomography problems including Kaczmarz’s algorithm,
ART, and SIRT are discussed in [KS01]. Parallel algorithms based on ART and
SIRT are discussed in [CZ97].
Shewchuk’s technical report [She94] provides a good introduction to the
Conjugate Gradient Method. The CGLS method is discussed in [HH93, Han98].
A popular alternative to CGLS which we have not discussed is the LSQR
method of Paige and Saunders [Han98, PS82b, PS82a]. Berryman has studied
the resolution of LSQR solutions [Ber00a, Ber00b]. In statistics, a procedure
called Partial Least Squares (PLS) is equivalent to LSQR.
The performance of the CG algorithm degrades dramatically on poorly con-
ditioned systems of equations. This is not a particular problem for CGLS, since
we don’t run the iteration until it converges to a least squares solution. However,
in other applications including the numerical solution of PDE’s, higher accuracy
is required. In such situations a technique called preconditioning is used to
improve the performance of CG. Essentially, preconditioning involves a change
of variables x̄ = Cx. The matrix C is selected so that the resulting system
of equations will be better conditioned than the original system of equations
[Dem97, GL96, TB97].
non–negative least squares

Chapter 7

Other Regularization

In addition to Tikhonov regularization, there are a number of other techniques

for regularizing discrete ill–posed inverse problems. All of the methods discussed
in this chapter use prior assumptions about the nature of the correct solution
which are used to pick a “good” solution from the collection of solutions which
adequately fit the data.

7.1 Using bounds constraints

In many physical situations, there are bounds on the maximum and/or minimum
values of model parameters. For example, our model parameters may represent
a physical quantity such as the density of the earth that is inherently non–
negative. This establishes a strict lower bound of 0. We might also declare that
a density greater than 3.5 grams per cubic centimeter would be unrealistic for
a particular problem.
We can try to improve the solution by solving for an L2 norm solution that
includes the constraint that all of the elements of the discretized model, m be

min kGm − dk2

m ≥ 0.

This non-negative least squares problem can be solved by an algorithm called

NNLS that was originally developed by Lawson and Hanson [LH95]. MATLAB
includes a command, lsqnoneg, that implements the NNLS algorithm.
In general, given vectors l and u of lower and upper bounds, we can consider


bound variables least squares the bounded variables least squares (BVLS) problem
tikhcstr min kGm − dk2
m ≥ l
m ≤ u

Stark and Parker developed an algorithm for solving the BVLS problem and
implemented their algorithm in Fortran [SP95]. A similar algorithm is given in
the 1995 edition of Lawson and Hanson’s book [LH95].
Given the BVLS algorithm for (7.2), we can perform Tikhonov regulariza-
tion with bounds. The MATLAB regularization toolbox includes a command
tikhcstr for Tikhonov regularization with lower and upper bounds on the model
A related optimization problem involves minimizing or maximizing a linear
functional of the model subject to bounds constraints and a constraint on the
misfit. This problem can be formulated as

min cT m
kGm − dk2 ≤ δ
m ≥ l (7.3)
m ≤ u

This problem can be solved by an algorithm given in Stark and Parker [SP95].
Solutions to this problem can be used to obtain bounds on the maximum and
minimum possible values of model parameters.

Example 7.1
Recall the source history reconstruction problem of Example 3.3.2.
Figure 7.1 shows a hypothetical source history. Figure 7.2 shows the
corresponding samples at T = 300, with noise added.
Figure 7.3 shows the least squares solution. Clearly, some regular-
ization is required. Figure 7.4 shows the nonnegative least squares
solution. This solution is somewhat better, in that concentrations
are all nonnegative. However, the true source history is not ac-
curately reconstructed. Furthermore, the NNLS solution includes
concentrations that are larger than one. Suppose for a moment that
the solubility limit of the contaminant in water is known to be 1.1.
This provides a natural upper bound on the elements of the model
vector. Figure 7.5 shows the corresponding BVLS solution. Further
regularization is required.
Figure 7.6 shows the L–curve for a second order Tikhonov regular-
ization solution with bounds on the model parameters. Figure 7.7
shows the regularized solution for λ = 0.0616. This solution cor-
rectly shows the two major peaks in the input concentration, but







0 50 100 150 200 250

Figure 7.1: Hypothetical Source History.

both peaks are slightly broader than they should be. The solution
misses one smaller peak at around t = 150.
We can use (7.3) to establish bounds on the values of the model
parameters. For example, we might want to establish bounds on the
average concentration from t = 125 to t = 150. These concentrations
appear in positions 51 through 60 of the model vector m. We let
ci be zero in positions 1 through 50 and 61 through 100, and let ci
be 0.1 in positions 51 through 60. The solution to (7.3) is then a
bound on the average concentration from t = 125 to t = 150. After
solving the optimization problem, we obtained a lower bound 0.36
and an upper bound 0.73 for the average concentration during this
time period.









0 50 100 150 200 250 300

Figure 7.2: Samples at T = 300.

x 10




0 50 100 150 200 250

Figure 7.3: Least Squares Solution.






0 50 100 150 200 250

Figure 7.4: NNLS Solution.







0 50 100 150 200 250

Figure 7.5: BVLS Solution.



|| Lm ||


10 −3 −2 −1 0
10 10 10 10
|| Gm−d ||

Figure 7.6: L–Curve for Tikhonov solution with bounds.









0 50 100 150 200 250

Figure 7.7: Tikhonov regularization solution, λ = 0.0616.


7.2 Maximum Entropy Regularization maximum entropy

In maximum entropy regularization,
Pn we replace the regularization term used in
Tikhonov regularization with i=1 mi ln(wi mi ), where the weights wi can be
adjusted to favor particular types of solutions. Because logarithms of negative
numbers are complex, we restrict our attention to solutions in which the mi are
The term maximum entropy comes from a Bayesian approach to selecting
a prior probability distribution,
Pn in which we select a probabilityPdistribution
which maximizes −
Pn i=1 i p ln p i subject to the constraint that
, i=1 pi = 1.
The quantity − i=1 pi ln pi has the same form as entropy in statistical physics.
Since the model m is not itself a probability distribution, and also because we
have added the effect of weights wi , maximum entropy regularization is distinct
from the maximum entropy approach to computing probability distributions
discussed in Chapter 11.
In the maximum entropy regularization approach, we maximize the entropy
of m subject to a constraint on the size of the misfit kGm − dk.
max − i=1 mi ln wi mi .
kGm − dk ≤ δ (7.4)
m ≥ 0 .
Using a Lagrange multiplier in the same way that we did with Tikhonov regu-
larization, we can transform this problem into
min kGm − dk2 + α2 i=1 mi ln wi mi .
m ≥ 0 .
It can be shown that as long as α 6= 0, the objective function in (7.5) is strictly
convex. Thus (7.5) has a unique solution. However, the optimization problem
can become difficult to solve as α approaches 0.
This optimization problem can be solved using a variety of nonlinear op-
timization methods. For example, the regularization toolbox contains a rou-
tine maxent which uses a nonlinear conjugate gradient method to solve the
problem. Our experience has been that this method is somewhat unreliable,
especially when α is small. For the example presented here, we have used a
penalty function approach to deal with the nonnegativity constraints and the
BFGS method to perform the minimization. This approach is more robust than
the nonlinear conjugate gradient method but also somewhat slower.
Figure 7.8 shows the regularization function f (x) = x ln(wx) for three differ-
ent values of w, along with the conventional 0th order Tikhonov regularization
function x2 . Notice that the maximum entropy function is 0 at x = 0, decreases
to a minimum value and then increases. We can find the minimum point by
setting the derivative equal to zero.
f 0 (x) = ln(wx) + 1 . (7.6)

ln(wx) + 1 = 0. (7.7)

x ln x
14 x ln 0.2 x
x ln 5 x



0 0.5 1 1.5 2 2.5 3 3.5 4

Figure 7.8: The maximum entropy regularization function.

x= . (7.8)
Thus maximum entropy regularization favors solutions with x values near 1/(ew),
and penalizes solutions with larger or smaller x values. For large values of x,
the function x2 grows faster than x ln wx. Maximum entropy regularization pe-
nalizes solutions with large x values, but not as much as 0th order Tikhonov
regularization. It should be clear that the choice of w, along with the magnitude
of the model parameters can have a significant effect on the maximum entropy
regularization solution.
Maximum entropy regularization is particularly popular in astronomy and
image processing applications [CE85, NN86, SB84]. Here the goal is to recover
a solution which consists of bright spots (stars, galaxies or other astronomical
objects) on a dark background. The non-negativity constraints in maximum
entropy furthermore ensure that the resulting image will not include features
with negative intensities. While conventional Tikhonov regularization tends
to broaden peaks in the solution, maximum entropy regularization may not
penalize sharp peaks as much.

Example 7.2

In this example we will use maximum entropy regularization to at-

tempt to recover a solution to the Shaw problem. Our target model
will consist of two spikes over a smaller random background inten-
sity. The target model is shown in Figure 7.9.











−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Figure 7.9: Target model for maximum entropy regularization.

We will assume that the data errors are normally distributed with
mean 0 and standard deviation 1. Following the discrepancy princi-
ple, we will look for a solution with kGm − dk2 around 4.4.
For this problem, default weights of w = 1 are appropriate, since the
background noise level of about 0.5 is close to the minimum of the
regularization term. We next solved (7.5) for several values of the
regularization parameter α. We found that at α = 0.2, the misfit was
kGm − dk2 = 4.4. The corresponding solution is shown in Figure
7.10. The spike near θ = −0.5 is visible, but the magnitude of the
peak is not correctly estimated. The second spike near θ = 0.7 is
not very well resolved.
For comparison, we also used 0th order Tikhonov regularization with
a nonnegativity constraint to solve the same problem. Figure 7.11
shows the Tikhonov solution. This solution is similar to the solution
produced by maximum entropy regularization.
For this sample problem, maximum entropy regularization had no
advantage over 0th order Tikhonov regularization with non-negativity
constraints. The best maximum entropy solution was compara-
ble to the solution obtained by Tikhonov regularization with non-
negativity constraints. This result is consistent with the results from
a number of sample problems in [AH91]. Amato and Huges found
that maximum entropy regularization was at best comparable to,
and often worse than Tikhonov regularization with non-negativity











−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Figure 7.10: Maximum entropy solution.











−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Figure 7.11: Tikhonov regularization solution.


7.3 Total Variation total variation regularization

Total variation (TV) is a regularization functional that works well for problems
in which there are discontinuous jumps in the model. The TV regularization
functional is
T V (m) = |mi+1 − mi |. (7.9)
It is convenient to write this as
T V (m) = kLmk1 (7.10)
where  
−1 1

 −1 1 

 ... .
 (7.11)
 −1 1 
−1 1
In first and second order Tikhonov regularization, discontinuities in the
model are smoothed out and do not show up well in the inverse solution. This
is because smooth transitions are penalized less by the regularization term than
sharp transitions. The particular advantage of TV regularization is that the
regularization term does not penalize discontinuous transitions in the model
any more than smooth transitions.
This approach has been particularly popular for the problem of “denoising”
a model. The denoising problem can be thought of as a linear inverse problem
in which G = I. The goal is to take a noisy data set, and smooth out the
noise, while still retaining long term trends and even sharp discontinuities in
the model. TV regularization can also be used with other linear and nonlinear
inverse problems.
We could use a TV regularization term in place of kLmk2 in the optimization
min kGm − dk22 + α2 kLmk21 . (7.12)
However, this problem is no longer a least squares problem, and the machinery
that we have developed for solving least squares problems (SVD, GSVD, etc.) is
no longer applicable. In fact, (7.12) is a non differentiable optimization problem.
Since we have a 1-norm in the regularization term, it makes sense to switch
to the 1-norm in the data misfit term as well. Consider the problem
min kGm − dk1 + αkLmk1 . (7.13)
We can rewrite this problem as
G d
αL m − . (7.14)
0 1
This is a 1-norm minimization problem that we can solve easily by using the
iteratively reweighted least squares algorithm discussed in Chapter 2.

modified truncated SVD Per Hansen has suggested another approach which retains the two–norm
MTSVD of the data misfit while incorporating the TV regularization term [Han98]. In
PP-TSVD Hansen’s PP-TSVD method, we begin by using the SVD and the k largest
singular values of G to obtain a rank k approximation to G. This approximation
Gk = si U·,i V·,i . (7.15)

Note that the matrix Gk will be rank deficient. The point of the approximation
is to obtain a matrix with a well defined null space. The vectors V·,k+1 , . . .,
V·,n form a basis for the null space of Gk . We will need this basis later, so let

Bk = [V·,k+1 . . . V·,n ] . (7.16)

Using the model basis set [V·,1 . . . V·,k ], the minimum length least squares
solution is, from the SVD,
X UT·,i d
mk = V·,i . (7.17)

We can modify this solution by adding in any vector in the null space of Gk .
This will increase kmk2 , but have no effect on kGk m − dk.
We can use this to find solutions which minimize some regularization func-
tional, but have minimum misfit. For example, in the modified truncated SVD
(MTSVD) method, we find a model mL,k which minimizes kLmk among those
models which minimize kGk m−dk. Since all models which minimize kGk m−dk
can be written as m = mk − Bk z for some vector z, the MTSVD problem can
be written as
min kL(mk − Bk z)k2 (7.18)
min kLBk z − Lmk k2 . (7.19)
This is a least squares problem that can be solved with the SVD, QR factoriza-
tion, or by the normal equations,
The PP-TSVD algorithm uses a similar approach. First, we minimize kGk m−
dk2 . Let β be the minimum value of kGk m − dk2 . Instead of minimizing the
two–norm of Lm, we minimize the 1-norm of Lm, subject to the constraint that
m must be a least squares solution.

min kLmk1
kGk m − dk2 = β.

Again, we can write any least squares solution as m = mk − Bk z, so the PP-

TSVD problem can be reformulated as

min kL(mk − Bk z)k1 (7.21)





−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Figure 7.12: The target model.

min kLBk z − Lmk k1 . (7.22)
This is an 1-norm minimization problem which can be solved in many ways,
including IRLS.
Note that in (7.22) the matrix LBk has n − 1 rows and n − k columns.
In general, when we solve this 1-norm minimization problem, we can find a
solution in which n − k of the equations are satisfied exactly. Thus at most
k − 1 entries in L(mk − Bk z) will be nonzero. Since each nonzero entry in
L(mk − Bk z) corresponds to a discontinuity in the model, there will be at most
k −1 discontinuities in the model. Furthermore, the zero entries in L(mk −Bk z)
correspond to points at which the model is constant. For example, if we use
k = 1, we will get a flat model with no discontinuities. For k = 2, we can obtain
a model with two flat sections and one discontinuity, and so on.
The PP-TSVD method can also be extended to piecewise linear functions
and to piecewise higher order polynomials by using a matrix L which approxi-
mates the second or higher order derivatives. The MATLAB function pptsvd,
available from Per Hansen’s web page, implements the PP-TSVD algorithm.

Example 7.3
In this example we consider the Shaw problem with a target model
which consists of a single step function. Our target model is shown
in Figure 7.12. The corresponding data, with noise (normal, 0.001
standard deviation) added is shown in Figure 7.13.
First, we solved the problem with conventional Tikhonov regular-
ization. Figure 7.14 shows the 0th order Tikhonov regularization

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Figure 7.13: Data for the target model, noise added.

solution. Figure 7.15 shows the 2nd order Tikhonov regularization

solution. Both solutions show the discontinuity in the solution as a
smooth transition. The relative error (kmtrue − mk2 /kmtrue k2 ) is
about 9% for the 0th order solution and about 7% for the second
order solution.
Next, we solved the problem by minimizing the L1 misfit with TV
regularization using (7.14). Since the noise level is 0.001, we ex-
pected the two–norm of the misfit to be about 0.0045. Using the
value α = 1.0, we obtained a solution with kGm − dk2 = 0.0040.
This solution is shown in Figure 7.16. This solution is extremely
good, with kmtrue − mk2 /kmtrue k2 < 0.001.
Next, we solved the problem with pptsvd. The PP-TSVD solution
with k = 2 is shown in Figure 7.17. Again, we get a very good
solution, with kmtrue − mk2 /kmtrue k2 < 0.0001.








−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Figure 7.14: Zeroth order Tikhonov regularization solution.






−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Figure 7.15: Second order Tikhonov regularization solution.





−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Figure 7.16: TV regularized solution.




−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Figure 7.17: PP-TSVD solution, k = 2.

7.4. EXERCISES 7-17

7.4 Exercises CLEAN

1. Consider the problem

min cT m
kGm − dk22 ≤ δ2 .

Using the method of Lagrange multipliers, develop a formula that can be

used to solve (7.23).
2. (a) Compute the gradient of the function being minimized in (7.5).
(b) Compute the Hessian of the function in (7.5).
(c) Show that for α 6= 0, the Hessian is positive definite for all m ≥ 0.
Thus the function being minimized is strictly convex, and there is a
unique minimum.
3. In image processing applications of maximum entropy regularization, it is
common to add a constraint to (7.4) that
mi = C

for some constant C. Show how this constraint can be incorporated into
(7.5) using a second Lagrange Multiplier.

7.5 Notes and Further Reading

Methods for bounded variables least squares problems and minimizing a linear
functional subject to a bound on the misfit are given in [SP95]. Some applica-
tions of these techniques can be found in [HSH+ 90, Par91, PZ89, SP87, SP88].
The maximum entropy method is particularly popular in deconvolution and
denoising of astronomical images from radio telescope data. Algorithms for
maximum entropy regularization are discussed in [CE85, NN86, Sau90, SB84].
More general discussions can be found in [AH91, Han98]. Another popular
deconvolution algorithm used in radio astronomy is the “CLEAN” algorithm
[Cla80, Hög74]. Briggs [Bri95] compares the performance of CLEAN, Maximum
Entropy regularization and NNLS.
Vogel [Vog02] has a chapter on numerical methods for total variation regu-
larization. The PP-TSVD method is discussed in [HM96, Han98].

Chapter 8

Fourier Techniques

8.1 Linear Systems in the Time and Frequency

A remarkable feature of linear time-invariant systems is that the forward oper-
ator can be generally described by a particular form of Fredholm integral of the
first kind, a convolution (1.7),
Z ∞
d(t) = m(τ )g(t − τ ) dτ . (8.1)

Inverse problems involving such systems can thus be usefully solved for m(t)
via an inverse or deconvolution operation. Here, the independent variable t is
time and the data d, model, m, and system kernel g are all time functions.
However, the results here are just as applicable to spatial problems and are also
generalizable to higher dimensions. Here, we will briefly overview the essentials
of Fourier theory in the context of performing convolutions and deconvolutions.
Readers are urged to consult one of the many references for a more complete
Consider a linear time-invariant operator, G, that converts an unknown
model, m(t) into an observable data function d(t)

d(t) = G[m(t)] (8.2)

and which follows the principles of superposition

G[m1 (t) + m2 (t)] = G[m1 (t)] + G[m2 (t)] (8.3)

and scaling
G[αm(t)] = αG[m(t)] , (8.4)
where α is a scalar. (8.4) also implies that the output of the system is zero when
m(t) = 0.
G[0] = 0 . (8.5)


delta function To show that the operation of any system satisfying (8.3) and (8.4) can be
sifting property, of delta cast in the form of (8.1), we utilize the sifting property of the impulse or
impulse function delta function, δ(t). The delta function can be conceptualized as the limiting
sifting property, delta function case of a pulse as its width goes to zero, its height goes to infinity, and its area
impulse response stays constant and equal to one, e.g.,
Green’s function
δ(t) = lim τ −1 Π(t/τ ) (8.6)
τ →0

where τ −1 Π(t/τ ) is a unit-area rectangle function of height τ −1 and width τ .

The sifting property allows us to extract the value of a function at a particular
point from inside of an integral,
Z b
f (t)δ(t − t0 ) dt = f (t0 ) (8.7)

= f (t0 ) a ≤ t0 ≤ b (8.8)

= 0 elsewhere (8.9)
for any f (t) continuous at finite t = t0 . The impulse response of a system,
where the model and data are related by the operation G, is defined as the
output produced when the input is a delta function
g(t) = G[δ(t)] . (8.10)
An impulse response is also widely referred to as a system Green’s function
in many applications.
We will now show that all linear, time-invariant system outputs for a given
input are characterizable via the convolution integral (8.1). First, note that any
input signal, m(t), can clearly be written as a summation of impulse functions
through the sifting property
Z ∞
m(t) = m(τ )δ(t − τ )dτ (8.11)

Thus, a general linear system response d(t) to an arbitrary input m(t) can be
written as Z ∞ 
d(t) = G m(τ )δ(t − τ )dτ (8.12)
or, from the fundamental definition of the integral operation

" #
d(t) = G lim m(τn )δ(t − τn )∆τ . (8.13)
∆τ →0

Because g characterizes a linear process, we can move it inside of the summation

using the scaling relation, where the m(τn ) are now weights, to obtain

d(t) = lim m(τn )G[δ(t − τn )]∆τ . (8.14)
∆τ →0

We now note that this defines the integral Fourier transform

Z ∞ inverse Fourier transform
time domain response
d(t) = m(τ )g(t − τ )dτ , spectrum
frequency domain response
which is, again, (8.1), the convolution of m(t) and g(t), often abbreviated as transfer function
d(t) = m(t) ∗ g(t). Fourier basis functions
The convolution operation thus generally describes the transformation of basis functions, Fourier
models to data in any linear, time-invariant, physical process, as well as the spectral amplitude
spectral phase
smearing action of a linear measuring instrument. For example, an (unattain-
able) perfect instrument which recorded some m(t) with no distortion whatso-
ever would have a delta function impulse response, perhaps with a time delay
t0 , as
Z ∞
d(t) = m(t) ∗ δ(t − t0 ) = m(τ )δ(t − t0 − τ )dτ = m(t − t0 ) . (8.15)

An important and very useful relationship exists between the convolution

operation and the Fourier transform
Z ∞
G(f ) = F [g(t)] ≡ g(t)e−ı2πf t dt (8.16)

and its inverse

Z ∞
g(t) = F [G(f )] ≡ G(f )eı2πf t df , (8.17)

where F denotes the Fourier transform operator, and F −1 denotes the inverse
Fourier transform. The impulse response g(t) is called the time domain
response, and its Fourier transform, G(f ), is called the spectrum of g(t). For
a system with impulse response g(t), G(f ) is called the frequency domain
response or transfer function of the system. The Fourier transform (8.16)
gives a formula for evaluating the spectrum, and the inverse Fourier transform
(8.17) says that the time domain function g(t) can be exactly reconstructed by a
complex weighted integration of functions of the form eı2πf t , where the weight-
ing is provided by G(f ). The essence of Fourier analysis is thus representing
functions using this particular infinite set of Fourier basis functions, eı2πf t .
It is important to note that, for a real-valued function g(t) the spectrum G(f )
will be complex. |G(f )| is called the ingpectral amplitude, and the angle
that G(f ) makes in the complex plane, tan−1 (imag(G(f ))/real(G(f )) is called
the spectral phase.
Readers should be note that in physics and geophysics applications the sign
convention chosen for the complex exponentials in the Fourier transform and its
inverse may be reversed, so that the forward transform (8.16) has a plus sign in
the exponent and the inverse transform (8.17) has a minus sign in the exponent.
This convention change simply causes a complex conjugation in the spectrum
that is reversed when the corresponding inverse transform is applied to return to

convolution theorem the time domain, and is thus simply an alternative spectral phase convention.
An additional convention issues arise as to whether to express frequency in
Hertz (f ) or radians (ω = 2πf ). Alternative Fourier transform formulations
using omega instead of f have scaling factors of 2π in the forward, reverse, or
both transforms.
Consider the Fourier transform of the convolution of two functions
Z ∞ Z ∞ 
F [m(t) ∗ g(t)] = m(τ )g(t − τ )dτ e−ı2πf t dt (8.18)
−∞ −∞

reversing the order of integration gives

Z ∞ Z ∞ 
−ı2πf t
F [m(t) ∗ g(t)] = m(τ ) g(t − τ )e dt dτ .
−∞ −∞

A subsequent change of variables, ξ = t − τ , gives

Z ∞ Z ∞ 
F [m(t) ∗ g(t)] == m(τ ) g(ξ)e−ı2πf (ξ+τ ) dξ dτ . (8.19)
−∞ −∞

which is the convolution theorem. The convolution theorem shows that con-
volution of two functions in the time domain has the simple effect of multiplying
their Fourier transforms the frequency domain. The Fourier transform of the
impulse response, G(f ) = F [g(t)], thus characterizes how M(f ) is altered in
spectral amplitude and phase by the convolution.
To see the implication of the convolution theorem more explicitly, consider
the response of a linear system, characterized by the impulse response g(t) in
the time domain and the transfer function G(f ) in the frequency domain, to
a single model Fourier basis function of frequency f0 , m(t) = eı2πf0 t . The
spectrum of eı2πf t can be seen to be δ(f − f0 ), by examining the corresponding
inverse Fourier transform (8.17)
Z ∞
e = δ(f − f0 )eı2πf t df . (8.20)

The frequency domain response of a linear system to this basis function model
is therefore (8.19)
F [eı2πf0 ]G(f ) = δ(f − f0 )G(f0 ) . (8.21)
The corresponding time domain response is thus (8.17)
Z ∞
G(f0 )δ(f − f0 )eı2πf t df = G(f0 )eı2πf0 . (8.22)

Linear time-invariant systems thus only map model Fourier basis functions to
data functions at the same frequency, and only alter them in spectral ampli-
tude and phase, with the alteration being characterized by the spectrum of the
system impulse response evaluated at the appropriate frequency. Of particular
interest here is the result that model basis functions at frequencies that are

weakly mapped to the data (frequencies where G(f ) is small) may be difficult zeros
or impossible to recover in an inverse problem. poles
The transfer function can be expressed in a particularly useful analytical
form if we can express a linear time-invariant system as a linear differential
dn y dn−1 y dy
an n + an−1 n−1 + · · · + a1 + a0 y =
dt dt dt

dm x dm−1 x dx
bm m
+ b m−1 m−1
+ · · · + b1 + b0 x , (8.23)
dt dt dt
where the coefficients ai and bi are constants. Because each of the terms in
(8.23) are linear and constant (there are no powers or other nonlinear functions
of x, y, or their derivatives), (8.23) expresses a linear time-invariant system
obeying superposition (8.3) and scaling (8.4), because differentiation is itself a
linear operation.
If a system of the form of (8.23) operates on a model of the form m(t) =
eı2πf t , (8.22) indicates that the corresponding predicted data will be d(t) =
G(f )eı2πf t . A time derivative for such a function merely generates a multiplier
of 2πıf . Dividing the resulting equation on both sides by eı2πf t and solving for
G(f ) gives the transfer function
Pm j
D(f ) j=0 bj (2πıf )
G(f ) = = Pn k
. (8.24)
M(f ) k=0 ak (2πıf )

(8.24) is a ratio of two complex polynomials in f for any system expressible

in the form of (8.23). The m + 1 complex frequencies, fz , where the numerator
(and transfer function) is zero are referred to as zeros of G(f ). The predicted
data will be zero for input components at such frequencies, regardless of their
amplitude. Conversely, any model components at such frequencies will have no
influence on the data. Corresponding model basis functions, eı2πfz t , will thus
lie in the model null space and be unrecoverable by any inverese methodology.
Frequencies for which the denominator of (8.24) is zero are called poles. The
predicted data is infinite at such frequencies, so that continuous excitation of
the system at such frequencies leads to an unstable forward problem.

8.2 Deconvolution from a Fourier Perspective

In recovering a model, m(t) which has been convolved with some g(t) using
Fourier theory, we wish to recover the basis function spectral amplitude and
phase (the Fourier components) of m by reversing the changes in spectral am-
plitude and phase (8.19) caused by the convolution. As noted above, this may
be difficult or impossible at and near frequencies where the spectral amplitude
of the transfer function, |G(f )| is small (i.e., close to the zero frequencies).

downward continuation

Figure 8.1: Upward continuation of a vertical gravitational field anomaly.

Example 8.1
A illustrative physical example of an ill-posed inverse system is down-
ward continuation (Figure 8.1). Consider a point mass, M , located
at the origin, which has a total gravitational field
−M γ r̂ −M γ(xx̂ + y ŷ + z ẑ)
gt (r) = (gx , gy , gz ) = 2
= (8.25)
krk kxk3
where γ is Newton’s gravitational constant. At a general position
(r = xx̂ + y ŷ + z ẑ), the vertical (ẑ) component of the gravitational
field will be
−M γz
gz = ẑ · gt = (8.26)
We note that the integral of gz over the xy plane is a constant with
respect to the observation height, z,
Z ∞ Z ∞ Z ∞ Z ∞
dx dy
gz dx dy = −M γz 3
. (8.27)
−∞ −∞ −∞ −∞ krk

Making the polar substitution ρ2 = x2 + y 2 gives

Z ∞   ∞
ρ dρ −1
= − 2πM γz 2 2 3/2
= −2πM γz 2 2 1/2

0 (z + ρ ) (z + r )

= −2πM γ , (8.28)
which is independent of the plane height z.
If we consider the vertical field at z = 0+ from a point mass, we
thus have a delta function singularity at the origin with a magnitude

given by (8.26), as the field has no vertical component except exactly downward continuation
at the origin. We can thus write the vertical field an infinitesimal upward continuation filter
distance above the xy plane as low–pass filter
filter, low–pass
gz |z=0+ = −2πM γδ(x, y) . (8.29)

Now consider z = 0 to be the ocean floor or surface of the Earth

and suppose that gravity data is collected from a ship or airplane
located at height h. The vertical field observed at z = h will be
gz (h) = (8.30)
2π(x2 + y 2 + h2 )3/2

if we normalize the response (the integrand of (8.27) by the magni-

tude of the delta function at z = 0, −2πM γ. As a consequence of
the vertical gravitational field obeying superposition and linearity,
this can be recognized as a linear problem, where 8.30 is the impulse
response for observations on the plane z = h. Vertical field measure-
ments obtained at z = h can thus be expressed by a 2-dimensional
convolution operation in the xy plane
Z ∞Z ∞
gz (h, x, y) = gz (0, s1 − x, s2 − y)·
−∞ −∞

ds1 ds2 . (8.31)
2π(s21 + s22 + h2 )3/2
We can examine the effects of the convolution process, and address
the inverse problem of downward continuation of the vertical field
at z = h back to z = 0, by examining the transfer function cor-
responding to (8.30), which will reveal how sensitive observations
made at z = h are to the field values at z = 0 as a function of
spatial frequency, k. The Fourier transform of the impulse response
is the transfer function
Z ∞ Z ∞
h · e−i2πkx x e−i2πky y dx dy
G(kx , ky ) =
−∞ −∞ 2π(x2 + y 2 + h2 )3/2
2 2 1/2
= e−2πh(kx +ky ) , (8.32)
which is the upward continuation filter. In this case the . model
and data basis functions are of the form eı2πkx x , and eı2πky y , where
kx,y are spatial frequencies (analogous to f in temporal problems).
As z increases, (8.32) indicates that higher frequency (larger k) spa-
tial basis functions of the true seafloor gravity signal are exponen-
tially attenuated to greater and greater degrees. Such an operation
is thus referred to as a low–pass filter. Predicted observations
thus become smoother with increasing h.

deconvolution Investigating the inverse problem of inferring the field at z = 0 from

discrete Fourier Transform measurements made at z = h, we must devise an inverse filter to
Fourier Transform, discrete reverse the smoothing effect of upward continuation. An obvious
candidate is the reciprocal of (8.32)
time series
inverse discrete Fourier −1 2 2 1/2
transform [G(kx , ky )] = e2πh(kx +ky ) . (8.33)
inverse Fourier transform,
discrete (8.33) is the transfer function of the deconvolution that undoes the
FFT convolution of the forward problem. However, (8.33) has a response
that grows without bound as the spatial frequencies kx and ky be-
come large. Thus, small high-spatial-frequency components in the
data acquired at (h > 0) may leverage enormous inferred variations
in seafloor gravity field (h = 0). If these high-frequency amplitudes
of the Fourier basis functions in the data are noisy, the inverse op-
eration will be very unstable.

The ill-posed character of downward continuation at high frequencies is typ-

ical of such problems. This means that solutions obtained using the Fourier
methodology will have to be regularized in many cases to produce stable and
meaningful results.

8.3 Linear Systems in Discrete Time

We can often readily approximate the continuous time transforms (8.16) and
(8.17) with a corresponding discrete operation, the discrete Fourier trans-
form, or DFT. The discrete Fourier Transform operates on time series, e.g.,
sequences mj consisting of equally time-spaced, sequential values of m(t). The
rate, fs , at which this sampling occurs is called the sampling rate. The for-
ward discrete Fourier transform is
Mk = DFT[m]j ≡ mj e−2πıjk/n (8.34)

and its inverse is

mj = DFT−1 [M]k ≡ Mk e2πıjk/n . (8.35)

(8.34) and (8.35) use the common conventions that the indexing ranges from 0
to n − 1, and also use the same complex exponential sign convention as (8.16)
and (8.17). (8.35) shows that a model time series can be expressed as a linear
combination of the n basis functions e2πıjk/n , where the complex coefficients are
the discrete spectral values Mk . The DFT is also frequently referred to as the
FFT because a particularly efficient algorithm, the fast Fourier transform,

Figure 8.2: Frequency and index mapping describing the DFT of a real-valued,
series (n = 16) sampled at a sampling rate fs . For a real-valued series, the
n complex elements of the DFT exhibits even spectral amplitude symmetry
and odd spectral phase symmetry about the index n/2. Indices less than n/2
correspond to the zero and positive frequencies 0, fs /n, 2fs /n, . . . , fs /n, fs /2−
fs /n. Indices greater than n/2 correspond to the negative frequencies −fs /2 +
fs /n, − fs /2 + 2fs /n, . . . − fs /n. Index n/2 corresponds to frequencies at half
the sampling rate. For the DFT to adequately represent the spectrum of an
assumed periodic time series, fs must be greater than or equal to the Nyquist
frequency (8.36)).

Hermitian symmetry, spectral is widely exploited to evaluate (8.34) and (8.35. Figure 8.2 shows the frequency
Nyquist frequency and index mapping for an n-length DFT.
convolution, serial
The DFT spectra, Mk (8.34) are complex and discrete, with positive real
convolution, circular
frequencies assigned to indices 1 through n/2 − 1, and negative real frequencies
assigned to indices n/2 through n. The k = 0 term is the sum of the mj , or
n times the average series value. There is an implicit assumption in the DFT
that the underlying time series, mj , is periodic over n terms. If m(t) is real
valued, substitution of n − k for n 8.34 shows that the DFT has Hermitian
symmetry, where Mk = M∗n−k . Because of this complex conjugate symmetry
about k, the spectral amplitude, |M| is symmetric and the spectral phase is
antisymmetric with respect to k. For this reason it is customary to only plot
the positive frequency spectral amplitude and phase for the spectrum of a real
For a sampled time series to accurately convey information about a contin-
uous function it can be shown that the continuous real-world function must be
sampled at a rate fs that is at least twice the highest frequency, fmax , at which
there is appreciable energy, or at or above the Nyquist frequency,

fs ≥ fN = 2fmax . (8.36)

Should the condition (8.36) not be met, a usually irreversible nonlinear distor-
tion called aliasing will occur, where energy at frequencies higher than fs will
appear in the spectrum to lie at frequencies below fs .
The discrete convolution of two sampled time series with equal sampling
intervals ∆t = 1/f s can be performed in two ways. The first of these is serial
dj = mi gj−i ∆t (8.37)

where we assume that the shorter of the two time series is zero for all indices
greater than m. In serial convolution we get a result with at most n + m − 1
nonzero terms.
The second type of discrete convolution is circular, can be applied only
to two time series of equal length (n = m). If the time series are of different
lengths, the lengths can be equalized by padding the shorter of the two with
zeros. The result of a circular convolution is as if we had joined each time series
to its tail and convolved them in circular fashion. Applying the convolution

di = DFT−1 [DFT[m] · DFT[g]]i ∆t = DFT−1 [M · G]i , (8.38)

where Mi · Gi indicates element-by-element multiplication with no summation

(not the vector dot product), produces the circular convolution. To avoid wrap-
around effects and thus obtain a result that is indistinguishable from the serial
convolution (8.37), it may be necessary to pad both m and g with up to n zeros
and to perform (8.38) on series of length up to 2n. Because of the factoring

strategy used in the FFT algorithm, it is also desirable to pad m and g up to spectral division
lengths that are powers of two, or at least highly composite numbers.
Consider the case where we have a theoretically known, or accurately esti-
mated, system impulse response, g(t), convolved with an unknown model, m(t).
We note in passing that, although we will examine the deconvolution problem
in 1 dimension for simplicity, the results are generalizable to higher dimensions.
The forward problem is
Z b
d(t) = g(t − τ )m(τ )dτ . (8.39)

Uniformly discretizing this expression using simple collocation with a sampling

interval, ∆t = 1/fs , gives
d = Gm , (8.40)
where d and m are appropriate length time series (sampled approximations of
d(t) and m(t)), and G is a matrix with rows that are padded, time reversed,
sampled impulse response vectors (Figure 4.4), so that

Gi,j = g(ti − τj )∆t . (8.41)

This time domain representation of the forward problem was previously exam-
ined in Example 4.5.
Taking the Fourier perspective, an inverse solution can be obtained by first
padding d and g appropriately with zeros so that they are of equal and suffi-
cient length to avoid wrap-around artifacts associated with circular convolution.
Evaluating the discrete Fourier transforms of both vectors, (8.19) allows us to
cast the problem as a complex-valued linear system

D = G · M∆t , (8.42)

with G being a diagonal matrix with

Gi,i = Gi , (8.43)

where G is the discrete Fourier transform of the sampled impulse response, g,

D, is the discrete Fourier transform of the data vector, d, and M is the discrete
Fourier transform of the model vector, m.
(8.42) suggests a straightforward solution by spectral division

m = DFT−1 [M] = DFT−1 G−1 · D .


(8.44) is appealing in its simplicity and efficiency. The application of (8.19),

combined with the efficient FFT implementation of the DFT, reduces the nec-
essary computational effort from solving a potentially very large linear system
of time-domain equations (8.40) to just three n-length DFT operations and n
complex divisions. If d and g are real, packing/unpacking algorithms exists that
allow the DFT operations to be further reduced to operations infolving complex
vectors of length n/2.

water level regularization However, (8.44) does not avoid instability associated with deconvolution sim-
ply by transforming the problem into the frequency domain because reciprocals
of any very small elements of G will dominate the diagonal of G−1 . (8.44) will
thus frequently require regularization to be useful.

8.4 Water Level Regularization

A simple and widely-applied method of regularizing spectral division is water
level regularization. The water level strategy employs a modified G matrix,
Gw , in the spectral division, where

Gw,i,i = Gi,i (|Gi,i | > w)

Gw,i,i = wGi,i /|Gi,i | (0 < |Gi,i | ≤ w)

Gw,i,i = w (Gi,i = 0) .
The water-level damped model estimate is then

mw = DFT−1 G−1

w D . (8.45)

The colorful name for this technique arises from the analogy of pouring water
into the holes of the Fourier transform of g until the spectral amplitude level
there reaches w. The effect of the water level is to prevent unregularized spec-
tral division at frequencies where the spectral amplitude of the system transfer
function is small, and thus prevent undesirable noise amplification.
An optimal water level value w will reduce the sensitivity to noise in the
inverse solution while still recovering important model features. As is typical
of the regularization process, it is possible to chose at a “best” solution by
assessing the tradeoff between the forward model residual and the model norm.
as the regularization parameter w is varied. A useful property in evaluating
data misfit and model length for calculations in the frequency domain that the
2–norm of the a Fourier transform vector (defined for complex vectors as the
square root of the sum of the squared complex element amplitudes) vector will
is proportional to the 2–norm of the time-domain vector (the specific constant
of proportionality depends on the DFT conventions used). One can thus easily
evaluate 2–norm tradoff metrics in the frequency domain without calculating
inverse Fourier transforms. The 2–norm of the water level-regularized solution,
mw , will thus decrease monitonically as w increases because |Gw,i,i | ≥ |Gi,i |.

Example 8.2
Revisiting the time-domain seismometer deconvolution example, which
was regularized in Chapter 4 using SVD truncation (Example 4.5),
we investigate a frequency-domain solution regularized via the water
level technique. The impulse response, true model, and noisy data
for this example are plotted in Figures 4.3, 4.6, and 4.9, respec-
tively. We first pad the n = 210 point data and impulse response





10 −3 −2 −1 0
10 10 10 10

Figure 8.3: Spectral amplitudes of the impulse response, noise-free, and noise
data vectors for the seismometer deconvolution problem. Spectra range in fre-
quency from zero (not shown) to half of the sampling frequency (1 Hz). Because
spectral amplitude for real-valued time series are symmetric with frequency,
spectra are shown only for f = kfs /n > 0.

vectors with 210 zeros to eliminate wrap-around artifacts and apply

the Fast Fourier transform to both vectors to obtain the correspond-
ing discrete spectra. The spectral elements are complex, but we can
examine the element magnitudes (which are critical when assessing
the stability of the spectral division solution) by plotting the ampli-
tudes of the impulse response, noise-free data, and data with noise
added Figure 8.3.
Examining the impulse response spectral amplitude, |Gk |, we note
that it decreases by approximately three orders of magnitude be-
tween low frequencies and 1 Hz. The convolution theorem (8.19)
tells us that G is the spectral amplitude shaping that the forward
convolution will impose on a general model in mapping it to the data.
Thus, the convolution of a general signal, consisting of a range of fre-
quencies, in this system will result in increased attenuation of higher
frequencies, which is the hallmark of a low–pass filter. Figure 8.3
also shows that the spectral amplitudes of the noise-free data falls
off even more quickly than the impulse response. This means that
spectral division will be a stable process for noise free data in this
problem, and that the quotient will indeed converge to numerical
accuracy to the Fourier transform of the true model. However, Fig-

Spectral Amplitude


−2 −1 0
10 10 10
f (Hz)

Figure 8.4: Spectral amplitudes of the Fourier transform of the noisy data
divided by the transfer function (the Fourier transform of the impulse response).
This spectrum is dominated by amplified noise at frequencies above about 0.1
Hz and is the Fourier transform amplitude of Figure 4.10.

ure 8.3 also shows that the spectral amplitudes of the noisy data
dominate the signal at frequencies higher than ≈ 0.1 Hz. Because
of the small values of Gk at these frequencies the spectral division
solution will result in a model that is dominated by noise, exactly
as in the time domain-case solution (Figure 4.10). Figure 8.4 shows
the amplitude spectrum resulting from this spectral division.
To regularize the spectral division solution, an optimal water level is
needed. Examining Figure 8.3 it is readily observed that the value
of w that will deconvolve the portion of the data spectrum that is
unobscured by noise while suppressing the amplification of higher
frequency noise must be close to unity. Such a spectral observation
might be more difficult for real data with more complex spectra,
however. A more adaptable way to select w is to construct a tradeoff
curve between model smoothness and data misfit by trying a range
of water level values. Figure 8.5 shows the curve for this synthetic
example, showing that the optimum w is close to 3. Figure 8.6 shows
a corresponding range of solutions, and Figure 8.7 shows the solution
for w = 3.16.
The solution of Figure 8.7, chosen from the corner of the tradeoff
curve of Figure 8.5, shows the resolution reduction characteristic of
regularized solutions, manifested in reduced amplitude and increased


Model 2−Norm


2 w=31.6228


0 200 400 600 800 1000
Misfit 2−Norm

Figure 8.5: Tradeoff curve between model 2–norm and data 2–norm misfit for
a range of water level values.



10 log10 (w)


0 20 40 60 80
Time (s)

Figure 8.6: Models corresponding to a range of water level values and used to
construct Figure 8.5. Dashed curves show the true model (Figure 4.6).

high–pass filter 1
filter, high–pass


Acceleration (m/s2)



0 20 40 60 80
Time (s)

Figure 8.7: Model corresponding to w = 3.16 (Figure 8.5; Figure 8.6). Dashed
curves show the true model (Figure 4.6).

spread into adjacent model elements relative to the true model.

A significant new idea introduced by the Fourier methodology in our context

is that it provides a new set of basis functions (the exponentials) that have the
remarkable property (shown by the convolution theorem) of passing through a
linear system altered only in phase and amplitude. Spectral plots such as Figure
8.3 and 8.4 can thus be used to understand what frequencies components will
have stability issues in the inverse solution. The information contained in the
spectral plot of Figure 8.3 is thus analogous to that obtained with a Picard plot
in the context of the SVD. The Fourier perspective also provides a link between
inverse theory and the vast field of linear filtering. In the context of Example
8.4, the deconvolution problem is seen to be identical to the problem of finding
an optimal high–pass filter to recover the model while suppressing the influ-
ence of noise. The Fourier methodology is also spectacularly computationally
efficient, relative to time domain deconvolution methods. This efficiency can be-
come critically important when larger and/or higher dimensional models are of
interest, a great many deconvolutions must be performed, or speed is essential.
8.5. EXERCISES 8-17

8.5 Exercises
1. Consider regularized deconvolution as the solution to a pth order Tikhonov
regularized system    
d G
= m
O αLp
which is, for example for p = 0,
 T 
h 0 0 ... 0
 0
 hT 0 ... 0 

 0
 0 hT ... 0 

 . . . . . 
 
 . . . . . 
 
   . . . . . 
d  
=  0 0 0 ... hT m (8.46)
O  α

 0 0 ... 0 

 0 α 0 ... 0 
 
 . . . . . 
 
 . . . . . 
 
 . . . . . 
0 0 0 ... α

where m is a model of length n, d is a data vector of length n, hT is a

time-reversed system impulse response of length n/2, and O is an n-length
zero vector.
a) Show that summing the two constraint systems of (8.46) and solving
the resultant n by n system via spectral division is equivalent to a solving
a modified water level damping system where Gw,i,i = Gi,i + α.
b) Show that a corresponding nth order Tikhonov-regularized deconvo-
lution solution is equivalent to solving the modified water level damping
system where Gw,i,i = Gi,i + (2πıf )n α.
Hint: Apply the convolution theorem and note that the Fourier transform
of dg(t)/dt is 2πıf times the Fourier transform of g(t).
2. A displacement seismogram is observed from a large earthquake at a far-
field a seismic station, from which the source region can be approximated
as a point. A much smaller aftershock from the main shock region is used
as an empirical Green’s function for this event. It is supposed that the
observed signal from the large event should be approximately equal to
the convolution of the main shock’s rupture history with this empirical
Green’s function.
The 256 point seismogram is in the file seis.mat. The impulse response of
the seismometer is in the file impresp.mat.
a) Deconvolve the impulse response from the observed mainshock seis-
mogram using water-level-damped deconvolution to solve for the source

time function of the large earthquake (it should consist of a non-negative

pulse or set of pulses). Estimate the source duration in samples and assess
any evidence for subevents and their relative durations and amplitudes.
Approximately what water level do you believe is best for this data set?
b) Recast the problem as a linear inverse system, as described in the
example for Chapter 4, and solve the system for a least-squares source
time function.
c) Are the results in (b) better or worse than in (a)? Why?

8.6 Notes and Further Reading

Because of its tremendous utility, there are many, many resources on Fourier
applications in the physical sciences, engineering, and pure mathematics. A
basic text covering theory and some applications at the approximate level of
this text is [Bra00]. An advanced text on the topic is [Pri83]. Kak and Slaney
give an extensive treatment of Fourier based methods for tomographic imaging.
Vogel [Vog02] discusses Fourier transform based methods for image deblurring.
Chapter 9

Nonlinear Regression

In this chapter we will consider nonlinear regression problems. These problems

are similar to the linear regression problems that we considered in chapter 2,
except that now the relationship between the model parameters and the data
can be nonlinear.
Specifically, we consider the problem of fitting n parameters m1 , m2 , . . .,
mn to a data set with N data points (x1 , d2 ), (x2 , d2 ), . . ., (xN , dN ), with asso-
ciated measurement standard deviations σ1 , σ2 , . . ., σN . There is a nonlinear
function G(m, x) which given the model parameters m and a point xi predicts
an observation di . Our goal is to find the values of the parameters m which
best fit the data.
As with linear regression, if we assume that the measurement errors are nor-
mally distributed, then the maximum likelihood principle leads us to minimizing
the sum of squared errors. The function to be minimized can be written as
N  2
X G(m, xi ) − di
f (m) = . (9.1)
In this chapter we will discuss methods for minimizing f (m) and the statis-
tical analysis of the nonlinear regression fit.

9.1 Newton’s Method

Consider a nonlinear system of equations
F(x) = d (9.2)
where x is a vector of m unknowns and d is a vector of m known values. We
will construct a sequence of vectors x0 , x1 , . . . which will converge to a solution
x∗ to the system of equations.
If we assume that F is continuously differentiable, we can construct a simple
Taylor series approximation
F(x0 + s) ≈ F(x0 ) + ∇F(x0 )s (9.3)


Newton’s method where ∇F(x0 ) is the Jacobian matrix

∂F1 (x0 ) ∂F1 (x0 )
 
∂x1 ... ∂xm
∇F(x0 ) =  ... ... ... . (9.4)
 
∂Fm (x0 ) ∂Fm (x0 )
∂x1 ... ∂xm

Using this approximation, we can obtain an approximate equation for the dif-
ference between x0 and the unknown x∗ .

F(x∗ ) = d ≈ F(x0 ) + ∇F(x0 )(x∗ − x0 ). (9.5)

We can rewrite this equation as

∇F(x0 )(x∗ − x0 ) ≈ d − F(x0 ) (9.6)

x∗ − x0 ≈ ∇F(x0 )−1 (d − F(x0 )). (9.7)
This leads to Newton’s Method.

Algorithm 9.3 Newton’s Method

Given a system of equations F(x) = d and an initial guess x0 , use

the formula
xi+1 = xi + ∇F (xi )−1 (d − F(xi ))
to compute a sequence of solutions x1 , x2 , . . .. Stop if and when the
sequence converges to an acceptable solution.

The theoretical properties of Newton’s method are summarized in the fol-

lowing theorem. For a proof, see [DS96].

Theorem 9.1

If x0 is close enough to x∗ , F(x) is continuously differentiable in

a neighborhood of x∗ , and ∇F(x∗ ) is nonsingular, then Newton’s
method will converge to x∗ . The convergence rate is quadratic in
the sense that there is a constant c such that for large n,

kxn+1 − x∗ k2 ≤ ckxn − x∗ k22

In practical terms, quadratic convergence means that as we approach x∗ , the

number of accurate digits in the solution doubles at each iteration. Unfortu-
nately, if the hypotheses of the above theorem are not satisfied, then Newton’s
method can converge very slowly or simply fail to converge to any solution.

A simple modification to the basic Newton’s method algorithm often helps damped Newton method
with convergence problems. In the damped Newton method, we use the New- Newton’s method for
minimizing f(x)
ton’s method equations at each iteration to compute a direction in which to
move. However, instead of simply taking the full step s, we search along the
line from xi to xi + s for a point which minimizes kF(xi + αs) − dk2 , and take
the step which minimize the norm.
Now suppose that we have a scalar valued function f (x), and want to min-
imize f . if we assume that f is twice continuously differentiable, then we have
the Taylor series approximation
f (x0 + s) ≈ f (x0 ) + ∇f (x0 )T s + sT ∇2 f (x0 )s (9.8)
where ∇f (x0 ) is the gradient of f
∂f (x0 )
 
∇f (x0 ) =  ...  (9.9)
 
∂f (x0 )

and ∇2 f (x0 ) is the Hessian matrix

 ∂ 2 f (x0 ) ∂ 2 f (x0 )

... ∂x1 ∂xm
2 0
∇ f (x ) =  ... ... ... . (9.10)
 
∂ 2 f (x0 ) ∂ 2 f (x0 )
∂xm ∂x1 ... ∂x2m

A necessary condition for x∗ to be a minimum of f is that ∇f (x∗ ) = 0. We

can approximate the gradient of f by

∇f (x0 + s) ≈ ∇f (x0 ) + ∇2 f (x0 )s (9.11)

and then set this approximate gradient equal to zero.

∇f (x0 ) + ∇2 f (x0 )s = 0. (9.12)

∇2 f (x0 )s = −∇f (x0 ). (9.13)

Solving this system of equations for a step s leads to Newton’s method for
minimizing f (x).

Algorithm 9.4 Newton’s Method for Minimizing f (x)

Given a twice continuously differentiable function f (x), and an ini-
tial guess x0 , use the formula
xi+1 = xi − ∇2 f (xi ) ∇f (xi )

to compute a sequence of solutions x1 , x2 , . . .. Stop if and when the

sequence converges to a solution with ∇f (x∗ ) = 0.

The theoretical properties of Newton’s method for minimizing f (x) are sum-
marized in the following theorem. Since Newton’s method for minimizing f (x)
is just Newton’s method for solving a nonlinear system of equations applied to
∇f (x) = 0, the proof follows immediately from the proof of theorem 9.1.

Theorem 9.2
If f is twice continuously differentiable in a neighborhood of a local
minimizer x∗ , there is a constant λ such that k∇2 f (x)−∇2 f (y)k2 ≤
λkx − yk2 in the neighborhood, ∇2 f (x∗ ) is positive definite, and x0
is close enough to x∗ , then Newton’s method will converge quadrat-
ically to x∗ .

As with Newton’s method for systems of equations, Newton’s method for

minimizing f (x) is very efficient when it works, but the method can also fail to
converge. Because of its speed, Newton’s method is the basis for most methods
of nonlinear optimization. Various modifications to the basic method are used
to ensure convergence to a local minimum of f (x).

9.2 The Gauss–Newton and Levenberg–Marquardt

In this section we will consider variations of Newton’s method that have been
adapted to the nonlinear regression problem. Recall that our objective in non-
linear regression to to minimize the sum of squared errors,
N  2
X G(m, xi ) − di
f (m) = . (9.14)

For convenience, we will let

G(m, xi ) − di
fi (m) = (9.15)
and  
f1 (m)
F(m) =  . . .  . (9.16)
fN (m)
The gradient of f (m) can be expressed in terms of the individual terms in
the sum of squares.
∇ fi (m)2 .

∇f (m) = (9.17)
∇f (m) = 2∇fi (m)F(m). (9.18)

In this equation, the product of ∇fi (m) and F(m) is the element wise product Gauss–Newton method
of the vectors. This formula can be simplified by using matrix notation to

∇f (m) = 2J(m)T F(m) (9.19)

where J(m) is the Jacobian matrix

 
∂f1 (m) ∂f1 (m)
∂m1 ... ∂mn
J(m) =  ... ... ... . (9.20)
 
∂fN (m) ∂fN (m)
∂m1 ... ∂mn

Similarly, we can express the Hessian of f in terms of the individual fi (m)

∇2 f (m) = ∇2 (fi (m)2 ). (9.21)

∇2 f (m) = Hi (m). (9.22)

Here Hi (m) is the Hessian of fi (m)2 . The j, k element of this Hessian matrix
is given by
i ∂ 2 (fi (m)2 )
Hj,k (m) = . (9.23)
∂mj ∂mk
i ∂ ∂fi (m)
Hj,k (m) = 2fi (m) . (9.24)
∂mj ∂mk
∂ 2 fi (m)
i ∂fi (m) ∂fi (m)
Hj,k (m) = 2 + fi (m) . (9.25)
∂mj ∂mk ∂mj ∂mk
∇2 f (m) = 2J(m)T J(m) + Q(m) (9.26)
Q(p) = 2 fi (m)∇2 fi (m). (9.27)

In the Gauss–Newton method, we ignore the Q(m) term and simply

approximate the Hessian by

∇2 f (m) ≈ 2J(m)T J(m). (9.28)

In the context of nonlinear regression, we typically expect that the fi (m) terms
will be reasonably small as we approach the optimal parameters m∗ , so that this
is a reasonable approximation. Specialized methods are available for problems
with large residuals in which this approximation is not justified.
Using our approximation, the equations for Newton’s method are

2J(m)T J(m)(mk+1 − mk ) = −2J(m)T F(m) (9.29)


Levenberg–Marquardt method or just

J(mk )T J(mk )(mk+1 − mk ) = −J(mk )T F(mk ). (9.30)
Because the Gauss–Newton method is based on Newton’s method, it can
fail for all of the same reasons as Newton’s method. Furthermore, because we
have introduced an additional approximation, the Gauss–Newton method can
also fail when Q(m) is not small. However, the Gauss–Newton often works well
in practice.
In the Levenberg–Marquardt method, we modify the Gauss–Newton
method equations to

(J(mk )T J(mk ) + λI)(mk+1 − mk ) = −J(mk )T F(mk ). (9.31)

Here the parameter λ is adjusted during the course of the algorithm to insure
convergence. One important reason for using a nonzero value of λ is that the
λI term ensures that the matrix is nonsingular. For very large values of λ, we
mk+1 − mk = ∇f (m). (9.32)
This is a steepest descent step. The algorithm simply moves downhill. The
steepest descent step provides very slow, but certain convergence. For very
small values of λ, we get the Gauss–Newton direction, which gives fast but
uncertain convergence.
The hard part of the Levenberg–Marquardt method is determining the right
value of λ. The general idea is to use small values of λ in situations where the
Gauss–Newton method is working well, but to switch to larger values of λ when
the Gauss–Newton method is not making progress. A very simple approach is
start with a small value of λ, and then adjust it in every iteration. If the L–M
step leads to a reduction in f (m), then decrease λ by a constant factor (say 2).
If the L–M step does not lead to a reduction in f (m), then do not take the step.
Instead, increase λ by a constant factor (say 2), and try again. Repeat this
process until a step is found which actually does decrease the value of f (m).
Robust implementations of the L–M method use sophisticated strategies for
adjusting the parameter λ. In practice, a careful implementation of the L–M
method typically has the good performance of the Gauss–Newton method as
well as very good convergence properties. In general, the L–M method is the
method of choice for small to medium sized nonlinear least squares problems.
Note that the λI term in the L–M method looks a lot like Tikhonov reg-
ularization. It is important to understand that this is not actually a case of
Tikhonov regularization, since the λI is only used as a way to improve the con-
vergence of the algorithm, and does not enter into the objective function being
minimized. We will discuss regularization for nonlinear problems in chapter 10.

9.3 Statistical Aspects

Recall from appendix B that if an n element vector d has a multivariate normal
distribution, and A is an m by n matrix, then Ad also has a multivariate normal
distribution with covariance matrix

Cov(Ad) = ACov(d)AT . (9.33)

We applied this formula to the linear system Gm = d, which we solved by the

normal equations. The resulting formula for Cov(m) was

Cov(m) = (GT G)−1 GT Cov(d)G(GT G)−1 . (9.34)

In the simplest case, where Cov(d) = σ 2 I, this simplifies to

Cov(m) = σ 2 (GT G)−1 . (9.35)

For the nonlinear regression problem we no longer have a linear relationship

between the data and the estimated model parameters m, so we cannot assume
that the estimated model parameters have a multivariate normal distribution,
and we cannot use the above formula to obtain the covariance matrix of the
estimated parameters.
Since we are interested in how small perturbations in the data result in
small perturbations in the data, we can consider a linearization of the nonlinear
function F(m),
F(m∗ + ∆m) ≈ F(m∗ ) + J(m∗ )∆m.
Under this approximation, there is a linear relationship between changes in the
data d and changes in the parameters m.

∆d = J(m∗ )∆m

Using this linear approximation, our nonlinear least squares problem becomes a
linear least squares problem. Thus the matrix J(m∗ ) takes the place of G in an
approximate estimate of the covariance of the model parameters. Since we have
incorporated the σi into the formula for f , the Cov(d) matrix is the identity
matrix, and we obtain the formula

Cov(m∗ ) = (J(m∗ )T J(m∗ ))−1 . (9.36)

Unlike linear regression, the parameter covariance matrix in nonlinear re-

gression is the result of two critical approximations. We first approximated the
Hessian of f (m) by dropping the Q(m) term. We then linearized f (m) around
m∗ . Depending on the particular problem that we are solving, either of these
approximations could be problematic. The Monte Carlo approach discussed in
Chapter 2 is a robust (but computationally intensive) alternative approach to
estimating Cov(m∗ ).
As with linear regression, it is possible to apply nonlinear regression when the
measurement errors are independent and normally distributed and the standard

black box function deviations are unknown but assumed to be equal. We set the σi in (9.1) to 1,
automatic differentiation and minimize the sum of squared errors. Define a vector of residuals by
finite differences
ri = G(m∗ , xi ) − di i = 1, 2, . . . , N

and let r̄ be the mean of the residuals. Our estimate of the measurement
standard deviation is then given by
PN 2
i=1 (ri − r̄)
s= . (9.37)
N −n

The covariance matrix for the estimated model parameters is

Cov(m∗ ) = s2 (J(m∗ )T J(m∗ ))−1 . (9.38)

Once we have m∗ and Cov(m∗ ), we can establish confidence intervals for the
model parameters exactly as we did in Chapter 2. Just as with linear regression,
it is also important to examine the residuals for systematic patterns or deviations
from normality. If we have not estimated the measurement standard deviation
s, then it is also important to test the χ2 value for goodness of fit.

9.4 Implementation Issues

In this section we consider a number of important issues in the implementation
of the Gauss–Newton and Levenberg–Marquardt methods.
The most important difference between the linear regression problems that
we solved in Chapter 2 and the nonlinear regression problems discussed in this
chapter is in nonlinear regression we have a nonlinear function G(m, xi ). Our
nonlinear regression methods require use to compute values of G and its partial
derivatives with respect to the model parameters mi . In some cases, we have
explicit formulas for G, and can easily find formulas for the derivatives of G.
In other cases, G exists only as a black box subroutine that we can call as
required to compute G values.
When an explicit formula for G is available, there are a number of options for
computing the derivatives. We can differentiate G by hand, or by using a CAS
such as Mathematica or Maple. There are also automatic differentiation
software packages that can translate the source code of a program to compute
G into a program to compute the derivatives of G.
One very simple approach is to use finite differences to approximate the
derivatives of G. The
∂G(m, x) G(m + hei , x) − G(m, x)

mi h
to estimate the derivatives. Here h must be carefully selected. The finite dif-
ference approximation is based on a Taylor series approximation to G which
become inaccurate as h increases. On the other hand, as has h becomes very

small, we get significant round–off local minimum points

√ error in the numerator of the fraction. A
good rule of thumb is to set h to , where  is the accuracy of the evaluations
of G. For example, if the function evaluations are accurate to 0.0001, then an
appropriate choice of h would be about 0.01.
When G is available only as a black box subroutine that can be called with
particular values of m and x, and the source code for G is not available, then
our options are much more limited. In this case, the only practical approach
is to use finite differences. Finite difference approximations are inevitably less
accurate the exact formulas, and this can lead to numerical problems in the
solution of the nonlinear regression problem.
A second important issue in the implementation of the G–N and L–M meth-
ods is when to terminate the iterations. We would like to terminate when
∇f (m) is approximately 0 and the values of m have stopped changing substan-
tially from one iteration to the next. Because of scaling issues, it is difficult to
set an absolute tolerance on k∇f (m)k2 that would be appropriate for all prob-
lems. Similarly, it is difficult to pick a single absolute tolerance on kmi+1 −mi k2
or |f (mi+1 ) − f (mi )| that would be appropriate for all problems.
The following convergence tests have been normalized so that they will work
well on a wide variety of problems. We assume that values of G can be calculated
with an accuracy of . To ensure that the gradient of f is approximately 0, we
require that √
k∇f (mi )k2 < (1 + |f (mi )|).
To ensure that successive values of m are close, we require

kmi − mi−1 k2 < (1 + kmi k2 ).
Finally, to make sure that the values of f (m) have stopped changing, we require
|f (mi ) − f (mi−1 )| < (1 + |f (mi )|).
There are a number of problems that can arise during the solution of a
nonlinear regression problem by the G–N or L–M methods.
The first issue is that our nonlinear regression methods are based on the
assumption that f (m) is a smooth function. This means not only that f (m)
must be continuous, but also that its first and second partial derivatives with
respect to the parameters must be continuous. Figure 9.1 shows a function which
is itself continuous, but has discontinuities in the first derivative at x = 0.2 and
the second derivative at x = 0.5. When G is given by an explicit formula, it is
usually easy to check this assumption. When G is given to us in the form of a
black box subroutine, it can be very difficult to check this assumption.
A second issue is that the function f (m) may have a “flat bottom”. See
Figure 9.2. In this case, there are many values of the parameters that come close
to fitting the data, and it is difficult to determine the optimal m∗ . In practice,
this condition is seen to occur when J(m∗ )T J(m∗ ) is very badly conditioned.
We will address this problem by regularization in the next chapter.
The final problem that we will consider is that the f (m) function may be
nonconvex and thus have multiple local minimum points. See figure 9.3. The










0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure 9.1: An Example of Nonsmoothness in f (m).










0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure 9.2: An Example of f (m) with a flat bottom.

9.5. AN EXAMPLE 9-11

3 global optimization








0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure 9.3: An Example of f (m) with a local minima.

G–N and L–M methods are designed to converge to a local minimum, but de-
pending on the point at which we begin the search, there is no way to be certain
that we will converge to a global minimum.
Global Optimization methods have been developed to deal with this prob-
lem. One simple global optimization procedure is called multistart. . In this
procedure, we randomly generate a large number of initial solutions, and per-
form the L–M method starting with each of the random solutions. We then
examine the local minimum solutions found by the procedure, and select the
minimum with the smallest value of f (m).

9.5 An Example

Example 9.1

Consider the problem of fitting

y = m1 em2 x + m3 xem4 x (9.39)

to a set of data. We will start with the true parameters:

>> mtrue

mtrue =

Number m1 m2 m3 m4 χ2 p-value
1 0.8380 -0.4616 2.2461 -1.0214 9.5957 0.73
2 3.3135 -0.6132 -7.5628 -2.8005 10.7563 0.63
3 1.8864 0.2017 -0.9294 0.0151 37.4439 < 1 × 10−3
4 2.3725 -0.5059 -2.7754 -66.0067 271.5124 < 1 × 10−3

Table 9.1: Locally optimal solutions for our sample problem.


Our x points will be 17 evenly spaced points between 1 and 5. After

generating the corresponding y values at adding normally distributed
noise with a standard deviation of 0.01, we obtain the our data set:

>> [x y]

ans =

1.0000 1.3308
1.2500 1.2634
1.5000 1.1536
1.7500 1.0247
2.0000 0.9125
2.2500 0.8007
2.5000 0.6951
2.7500 0.6117
3.0000 0.5160
3.2500 0.4708
3.5000 0.3838
3.7500 0.3309
4.0000 0.2925
4.2500 0.2413
4.5000 0.2044
4.7500 0.1669
5.0000 0.1524

We next used the L–M method to solve the problem a total of 100
times, using random starting points with each parameter uniformly
distributed between -2 and 2. This produced a number of different
locally optimal solutions, which are shown in table 1. Since solution
number 1 has the best χ2 value, we will analyze it.
9.5. AN EXAMPLE 9-13






0 2 4 6 8 10 12 14 16 18

Figure 9.4: Normalized residuals.

The χ2 value of 9.5957 has an associated p-value (based on 14 degrees

of freedom) of 0.73, so this regression fit passes the χ2 test.
Figure 9.4 shows the normalized residuals for this regression fit. No-
tice that the majority of the residuals are within 0.5 standard devia-
tions, with a few residuals as large as 1.9 standard deviations. There
is no obvious trend in the residuals as x ranges from 1 to 5.
Next, we compute the covariance matrix for the model parameters.
The square roots of the diagonal elements of the covariance matrix
are standard deviations for the individual model parameters. These
are then used to compute approximate 95% confidence intervals for
model parameters.

>> J=jac(mstar);
>> covm=inv(J’*J)

covm =

0.0483 -0.0153 -0.0809 0.0100

-0.0153 0.0056 0.0262 -0.0037
-0.0809 0.0262 0.1369 -0.0173
0.0100 -0.0037 -0.0173 0.0026

>> %
>> % Now, compute the confidence intervals.
>> %

>> [mstar-1.96*sqrt(diag(covm)) mstar+1.96*sqrt(diag(covm))]

ans =

0.4073 1.2687
-0.6077 -0.3156
1.5209 2.9713
-1.1206 -0.9221

Notice that the true parameters (1, -.5, 2, and -1) are all covered
by these confidence intervals. However, there is quite a bit of un-
certainty in the model parameters. This is an example of a poorly
conditioned nonlinear regression problem in which the data do not
constrain the possible parameter values very well.
The correlation matrix provides some insight into the nature of the
ill conditioning. For our solution, the correlation matrix is

>> corm

corm =

1.0000 -0.9362 -0.9949 0.8951

-0.9362 1.0000 0.9502 -0.9863
-0.9949 0.9502 1.0000 -0.9243
0.8951 -0.9863 -0.9243 1.0000

Notice the strong negative correlation between m1 and m2 . This

tells us that by increasing m1 and simultaneously decreasing m2 , we
can obtain a solution which is very nearly as good as our optimal
solution. There are also strong negative correlations between m2 and
m4 and m3 and m4 . Similarly, there are strong positive correlations
between m1 and m4 and between m2 and m3 .
9.6. EXERCISES 9-15

9.6 Exercises
1. A recording instrument sampling at 50 Hz is run with a high precision
sine wave generator connected to its input. The signal generator produces
a 39.98 s signal of the form

x(t) = A sin(2πfa t + φb )

and the recorded signal consists of instrumental noise, n(t), plus a mean
y(t) = B sin(2πfb t + φb ) + c + n(t) .
Using the data in instdata.mat, and a starting model: fb = 0.5, φb = 0,
c = x̄, and B = (max(x) − min(x))/2, apply Newton’s method to solve for
the unknown parameters (B, fb , φb , c).

2. An alternative version of the Levenberg–Marquardt method stabilizes the

Gauss–Newton method by multiplicative damping. Instead of adding λI
to the diagonal of J(mk )T J(mk ), this method multiplies the diagonal of
J(mk )T J(mk ) by a factor of (1 + λ). Show that this method can fail
by producing an example in which the modified J(mk )T J(mk ) matrix is
singular, no matter how large λ becomes.

3. A cluster of 10 small earthquakes occurs in a shallow geothermal reservoir.

The field is instrumented with nine seismometers, eight of which are at the
surface and one of which is 300 m down a borehole. The P-wave velocity of
the fractured granite medium is thought to be an approximately uniform
2 km/s. The station locations (in meters relative to a central origin) are:

x y z

st1 500 -500 0

st2 -500 -500 0
st3 100 100 0
st4 -100 0 0
st5 0 100 0
st6 0 -100 0
st7 0 -50 0
st8 0 200 0
st9 10 50 -300

The arrival times of P-waves from the earthquakes are carefully measured
at the stations, with an estimated error of approximately 1 ms. The arrival
time estimates for each earthquake at each station (in seconds, with the
nearest clock second subtracted) are

eq1 eq2 eq3 eq4 eq5 eq6 eq7


st1 0.8423 1.2729 0.8164 1.1745 1.1954 0.5361 0.7633

st2 0.8680 1.2970 0.8429 1.2009 1.2238 0.5640 0.7878
st3 0.5826 1.0095 0.5524 0.9177 0.9326 0.2812 0.5078
st4 0.5975 1.0274 0.5677 0.9312 0.9496 0.2953 0.5213
st5 0.5802 1.0093 0.5484 0.9145 0.9313 0.2795 0.5045
st6 0.5988 1.0263 0.5693 0.9316 0.9480 0.2967 0.5205
st7 0.5857 1.0141 0.5563 0.9195 0.9351 0.2841 0.5095
st8 0.6017 1.0319 0.5748 0.9362 0.9555 0.3025 0.5275
st9 0.5266 0.9553 0.5118 0.8533 0.8870 0.2115 0.4448

eq8 eq9 eq10

st1 0.8865 1.0838 0.9413

st2 0.9120 1.1114 0.9654
st3 0.6154 0.8164 0.6835
st4 0.6360 0.8339 0.6982
st5 0.6138 0.8144 0.6833
st6 0.6347 0.8336 0.6958
st7 0.6215 0.8211 0.6857
st8 0.6394 0.8400 0.7020
st9 0.5837 0.7792 0.6157

The above data can be found in eqdata.mat.

(a) Apply the L–M method to this data set to estimate minimum 2-norm
misfit locations of the earthquakes.
(b) Estimate the errors in x, y, z (in meters) and origin time (in s) for each
earthquake using the diagonal elements of the appropriate covariance
matrix. Do the earthquakes form any sort of discernible trend?

4. The Lightning Mapping Array, LMA, is a deployable, GPS-based sys-

tem that has recently been developed that locates the sources of lightning
radiation in three spatial dimensions and time [RTK+ 99] The system ac-
curately measures the arrival time of impulsive radiation events in an
unused VHF television channel at 60 MHz. The measurements are made
at nine or more locations in a region 40 to 60 km in diameter. Each station
records the peak radiation event in successive 100 us time intervals; from
this, several hundred to over a thousand radiation sources are typically
located per lightning discharge. In a large storm it is not uncommon to
locate over a million sources each minute.

(a) Use the arrival time at station 1, 2, 4, 6, 7, 8, 10, and 13 to find the
time and location of the radio frequency source in the lightning flash.
Assume the radio waves travel in a straight line at the speed of light,
2.99731435d8 m/s

arrival time (s) station location (km)

No. x y z
1. 0.0922360280 -24.3471411 2.14673146 1.18923667
2. 0.0921837940 -12.8746056 14.5005985 1.10808551
3. 0.0922165500 16.0647214 -4.41975194 1.12675062
4. 0.0921199690 0.450543748 30.0267473 1.06693166
6. 0.0923199800 -17.3754105 -27.1991732 1.18526730
7. 0.0922839580 -44.0424408 -4.95601205 1.13775547
8. 0.0922030460 -34.6170855 17.4012873 1.14296361
9. 0.0922797660 17.6625731 -24.1712580 1.09097830
10. 0.0922497250 0.837203704 -10.7394229 1.18219520
11. 0.0921672710 4.88218031 10.5960946 1.12031719
12. 0.0921702350 16.9664920 9.64835135 1.09399160
13. 0.0922357370 32.6468622 -13.2199767 1.01175261
(b) During lightning storms we record the arrival of thousands of events
each second. We use several methods to find which arrival times
are due to the same event. The above data were chosen as possible
candidates to go together. We require any solution to use time from
at least 6 stations. Find the largest subset of this data that gives a
good solution. See if it is the same subset suggested in part a of this

9.7 Notes and Further Reading

Dennis and Schnabel [DS96] is a classic reference on Newton’s method. The
Gauss–Newton and Levenberg–Marquardt methods are discussed in [Bjö96,
NS96]. Statistical aspects of nonlinear regression are discussed in [BW88, DS98,
There are a number of freely available and commercial software packages for
nonlinear regression. Some freely available packages include GaussFit [JFM87],
Chapter 10

Nonlinear Inverse Problems

10.1 Regularizing Nonlinear Least Squares Prob-

As with linear problems, the nonlinear least squares approach can run into
problems with extremely ill-conditioned problems. This typically happens as
the number of model parameters grows. In this lecture we will discuss regular-
ization for nonlinear inverse problems and a popular algorithm for computing a
regularized solution to a nonlinear inverse problem.
The basic idea of Tikhonov regularization can be carried over to nonlinear
problems in a straight forward way. Suppose that we are given a nonlinear
forward model G(m), and wish to find the solution with smallest kLmk which
comes close enough to matching the data. We can formulate this problem as
min kLmk2
kG(m) − dk2 ≤ δ.
Notice that the form of the problem is virtually identical to the problem that
we considered earlier in the linear case. The only difference is that we now
use G(m) instead of Gm. As in the linear case, we can also reformulate this
problem in terms of minimizing the misfit subject to a constraint on kLmk.
min kG(m) − dk2
kLmk2 ≤ .
or as a damped least squares problem
min kG(m) − dk22 + α2 kLmk22 . (10.3)
All three versions of the regularized least squares problem can be solved by
applying standard nonlinear optimization software. In particular, (10.3) is a
nonlinear least squares problem, so we could apply the L–M or G–N methods
to it. Of course, we will still have to deal with the possibility of local minimum


10.2 Occam’s Inversion

Occam’s inversion is a popular algorithm for nonlinear inversion that was in-
troduced by Constable, Parker, and Constable [CPC87]. The name refers to
“Occam’s Razor”, the principle that models should not be unnecessarily com-
plicated. Occam’s inversion uses the discrepancy principle, and searches for the
solution that minimizes kLmk2 subject to the constraint kG(m) − dk2 ≤ δ.
The algorithm is easy to implement, requires only the nonlinear forward model
G(m) and its Jacobian matrix, and works well in practice. However, to the
best of our knowledge, no theoretical analysis proving the convergence of the
algorithm to even a local minimum has been done.
In the following we will assume that our nonlinear inverse problem has been
formulated as
min kLmk2
kG(m) − dk2 ≤ δ.
Here L can be I for zeroth order regularization, or it can be a finite difference
approximation of a first or second derivative. In fact, Occam’s inversion is often
used on two dimensional problems in which L is an approximate Laplacian
As usual, we will assume that the measurement errors in d are independent
and normally distributed. For convenience, we will also assume that the systems
of equations G(m) = d has been scaled so that the standard deviations σi are
all the same.
The basic idea behind Occam’s inversion is local linearization. Given a model
m, we can use Taylor’s theorem to obtain the approximation

G(m + ∆m) ≈ G(m) + J(m)∆m (10.5)

where J(m) is the Jacobian matrix

 
∂G1 (m) ∂G1 (m)
∂m1 ... mm
J(m) =  ... ... ... . (10.6)
 
∂Gn (m) ∂Gn (m)
∂m1 ... ∂mm

Here n is the number of data points and m is the number of model parameters.
Under this linear approximation, the damped least squares problem (10.3)

min kG(m) + J(m)∆m − dk22 + α2 kL(m + ∆m)k22 (10.7)

where the variable is ∆m and m is a constant. In practice, it is easier to formula

this as a problem in which the variable is actually m + ∆m.

min kJ(m)(m + ∆m) − (d − G(m) + J(m)m)k22 + α2 kL(m + ∆m)k22 .

d̂(m) = d − G(m) + J(m)m. (10.9)

In (10.8), J(m) and d̂(m) are constant. This problem is a damped linear least
squares problem which we learned how to solve when we studied Tikhonov
regularization. The solution is given by
m + ∆m = J(m)T J(m) + α2 LT L J(m)T d̂(m). (10.10)

It is worth noting that this method is very similar to the Gauss–Newton

method applied to the damped least squares problem (10.3). As a result, it
suffers from the same sorts of problems with convergence that we saw earlier.
In particular, for poor initial guesses, the method often fails to converge to a
solution. Convergence could be improved by using line search to pick the best
solution along the line from the current model to the new model given by (10.10).
Convergence could also be improved by by using the L–M idea to improve the
convergence of the algorithm.
We could just use this iteration to solve (10.3) for a number of values of
α, and then pick the largest value of α for which the corresponding model
satisfies the data constraint kG(m)−dk2 < δ. However, the authors of Occam’s
inversion came up with an alternative approach which is much faster in practice.
At each iteration of the algorithm, we pick the largest value of α which keeps
the χ2 value of the solution within its bound. If this is not possible, we pick
the value of α that minimizes the χ2 value. Effectively, we adjust α throughout
the course of the algorithm so that in the end, we should have a value of α that
leads to a solution with χ2 = δ 2 .
We can now state the algorithm.

Algorithm 10.5 Occam’s inversion algorithm

Beginning with a solution m0 , use the formula
mk+1 = J(mk )T J(mk ) + α2 LT L J(mk )T d̂(mk ).

Pick the largest value of α such that χ2 (mk+1 ) ≤ δ 2 . If no such

value exists, then pick a value of α which minimizes χ2 (mk+1 ).

Example 10.1
In this example we will consider the problem of estimating subsurface
electrical conductivities from above ground EM induction measure-
ments. The instrument used in this example is the Geonics EM–38
ground conductivity meter. A description of the instrument and the
mathematical model of the response of the instrument can be found
in [HBR+ 02]. The forward model is quite complicated, but we will
treat it as a black box model, and instead concentrate on the inverse
Measurements are taken at heights of 0, 10, 20, 30, 40, 50, 75, 100,
and 150 centimeters above the ground, with the coils oriented in

Height (cm) EMV (mS/m) EMH (mS/m)

0 134.5 117.4
10 129.0 97.7
20 120.5 81.7
30 110.5 69.2
40 100.5 59.6
50 90.8 51.8
75 70.9 38.2
100 56.8 29.8
150 38.5 19.9

Table 10.1: Data for the EM-38 example.

both the vertical and horizontal orientations. There are a total of

18 observations. The data are shown in table 10.1. We will assume
measurement standard deviations of 0.1 mS/m.
We will discretize the subsurface electrical conductivity profile into
10 layers each 20 cm thick, with a semi–infinite layer below two
meters. Thus we have 11 parameters to estimate.
The forward model G(m) is available to us in the form of a subrou-
tine for computing the forward model predictions. Since we do not
have simple formulas for G(m), we cannot simply write down for-
mulas for the Jacobian matrix. However, we can use finite difference
approximations to estimate J(m).
We first tried using the L–M method to estimate the model param-
eters. After 50 iterations, the L–M method produced a model which
is shown in Figure 10.1. The χ2 value for this model is 9.62, with 9
degrees of freedom, so the model actually fits the data adequately.
However, the least squares problem is very badly conditioned. The
condition number of JT J is approximately 7.6e+17. Furthermore,
this model is unrealistic because it includes negative electrical con-
ductivities. The model exhibits the kinds of high frequency oscilla-
tions that we have come to expect of under regularized solutions to
inverse problems. Clearly, we need to find a regularized solution.
We next tried Occam’s inversion, with second order regularization
and δ = 0.4243. The resulting model is shown in Figure 10.2. Figure
10.3 shows the true model. The Occam’s inversion solution is a fairly
good reproduction of the original model.




Conductivity (mS/m)






0 0.5 1 1.5 2 2.5
depth (m)

Figure 10.1: L–M solution.



Conductivity (mS/m)





0 0.5 1 1.5 2 2.5
depth (m)

Figure 10.2: Occam’s inversion solution.




Conductivity (mS/m)





0 0.5 1 1.5 2 2.5
depth (m)

Figure 10.3: Target model.

10.3. EXERCISES 10-7

10.3 Exercises
1. Recall example 1.3, in which we had gravity anomaly observations above
a density perturbation of variable depth m(x), and fixed density ∆ρ. In
this exercise, you will use Occam’s inversion to solve an instance of this
inverse problem.
Consider a gravity perturbation along a one kilometer section, with ob-
servations take every 50 meters, and density perturbation of 200 Kg/m3
(0.2 g/cm3 .) The perturbation is expected to be at a depth of roughly
200 meters.
The data file gravprob.mat contains a vector x of observation locations.
Use the same coordinates for your discretization of the model. The vector
obs contains the actual observations. Assume that the observations are
accurate to about 1.0 × 10−12 .

(a) Derive a formula for the entries in the Jacobian matrix.

(b) Write MATLAB routines to compute the forward model predictions
and the Jacobian matrix for this problem.
(c) Use the supplied implementation of Occam’s inversion to solve the
inverse problem.
(d) Discuss your results. What features in the inverse solution appear
to be real? What is the resolution of your solution? Were there any
difficulties with local minimum points?
(e) What would happen if the density perturbation was instead at about
1,000 meters depth?
Chapter 11

Bayesian Methods

So far, we have followed a classical least squares/regularization approach to

solving inverse problems. There is an alternative approach which is based on
very different ideas about the fundamental nature of inverse problems. In this
chapter, we will review the classical approach to inverse problems, introduce the
Bayesian approach, and then compare the two methods.

11.1 Review of the Classical Approach

In the classical approach, we begin with a mathematical model of the form
Gm = d in the linear case or G(m) = d in the nonlinear case. We assume that
there is a true model mtrue , and a true data set dtrue such that Gmtrue −dtrue =
0. We are given an actual data set d, which is dtrue plus measurement noise.
Our goal is to recover mtrue from this noisy data. Of course, without perfect
data, we cannot hope to exactly recover mtrue , but we can hope to find a good
approximate solution.
An approximate solution can be found by minimizing kGm − dk2 to obtain
a least squares solution, mL2 . Intuitively, the least squares solution is the
solution that best fits the data. We also saw that under the assumption that the
data errors are independent and normally distributed, the maximum likelihood
principle leads to the same least squares solution.
Since there is noise in the data, we should expect some misfit in the data– the
sum of squared errors will not typically be zero. We saw that the χ2 distribution
can be used to set a reasonable bound δ on kGm − dk2 . This leads to a set
of solutions which fit the data “well enough”. We were also able to compute a
covariance matrix for the estimated parameters

Cov(m) = (GT G)−1 GT Cov(d)G(GT G)−1 . (11.1)

We then used Cov(m) to compute confidence intervals for the estimated param-


This approach worked quite well for linear regression problems in which the
least squares problem is well conditioned. However, we found that in many
cases the least squares problem is not well conditioned. The set of solutions
that adequately fits the data is huge, and contains many solutions which are
completely unreasonable.
We discussed a number of approaches to regularizing the least squares prob-
lem. All of these approaches pick one “best” solution out of the set of solutions
that adequately fit the data. The different regularization methods differ in what
constitutes the best solution. For example, Tikhonov regularization picks the
model that minimizes kmk2 subject to the constraint kGm − dk2 < δ, while
higher order Tikhonov regularization selects the model that minimizes kLmk2
subject to kGm − dk2 < δ.
The regularization approach can be applied to both linear and nonlinear
problems. For relatively small linear problems the computation of the regu-
larized solution is generally done with the help of the SVD. This process is
straight forward and extremely robust. For large sparse linear problems iter-
ative methods such as CGLS can be used. For nonlinear problems things are
more complicated. We saw that the Gauss–Newton and Levenberg-Marquardt
methods could be used to find a local minimum of the least squares problem.
However, these nonlinear least squares problems often have a large number of
local minimum solutions. Finding the global minimum is challenging. We dis-
cussed the multi start approach in which we start the L–M method from a
number of random starting points. Other global optimization techniques for
solving the nonlinear least squares problem include simulated annealing and
genetic algorithms.
How can we justify selecting one solution from the set of models which
adequately fit the data? One justification is Occam’s razor. Occam’s razor is
the principle that when we have several different theories to consider, we should
select the simplest theory. The solutions selected by regularization are in some
sense the simplest models which fit the data. Any feature seen in the regularized
solution must be required by the data. If fitting the data did not require a feature
seen in the regularized solution, then that feature would have been smoothed
out by the regularization term. This answer is not entirely satisfactory.
It is also worth recalling that once we have regularized a least squares prob-
lem, we lose the ability to obtain statistically valid confidence intervals for the
parameters. The problem is that by regularizing the problem we bias the so-
lution. In particular, this means that the expected value of the regularized
solution is not the true solution.

11.2 The Bayesian Approach

The Bayesian approach to solving inverse problems is based on completely dif-
ferent ideas. However, as we will soon see, it often results in similar solutions.
The most fundamental difference between the classical approach and the
Bayesian approach is in the nature of the answer. In classical approach, there

is a specific (but unknown) model mtrue that we would like to discover. In the prior distribution
Bayesian approach, the model is a random variable. Our goal is to compute a subjective prior
probability distribution for the model. Once we have a probability distribution principle of indifference
uninformative prior
for the model, we can use the distribution to answers questions about the model distribution
such as “What is the probability that m5 is less than 0?” An important advan- posterior distribution
tage of the Bayesian approach is that it explicitly addresses the uncertainty in Bayes Theorem
the model, while the classical approach provides only a single solution.
A second very important difference between the classical and Bayesian ap-
proaches is that in the Bayesian approach we can incorporate additional in-
formation about the solution which comes from other data sets or our own
intuition. This prior information is expressed in the form of a probability distri-
bution for m. This prior distribution may incorporate the user’s subjective
judgment, or it may incorporate information from other experiments. If no
other information is available, then under the principle of indifference, we
may pick a prior distribution in which each possible model parameter has equal
likelihood. Such a prior distribution is said to be uninformative.
One of the main objections to the Bayesian approach is that the method is
“unscientific” because it allows the analyst to incorporate subjective judgments
into the model which are not based on the data alone. Bayesians reply that
the analyst is free to pick an uninformative prior distribution. Furthermore, it
is possible to complete the Bayesian analysis with a variety of prior distribu-
tions and examine the effects of different prior distributions on the posterior
It should be pointed out that if the parameters m are contained in the range
(−∞, ∞), then the uninformative prior is not a proper probability distribution.
The problem is that there does not exist a probability distribution p(m) such
that Z ∞
p(m)dm = 1 (11.2)
and p(m) is constant. In practice, the use of this improper prior distribution can
be justified, because the posterior distribution for m is a proper distribution.
We will use the notation p(m) for the prior distribution. We also assume
that using the forward model we can compute the probability that given a
particular model, a particular data value will be observed. This is a conditional
probability distribution, We will use the notation f (d|m) for this conditional
probability distribution. Of course we know the data and are attempting to
estimate the model. Thus we are interested in the conditional distribution of
the model parameter(s) given the data. We will use the notation q(m|d) for
this posterior probability distribution.
Bayes’ theorem relates these distributions in a way that makes it possible
for us to compute what we want. Recall Bayes’ theorem.

Theorem 11.1

f (d|m)p(m)
q(m|d) = R . (11.3)
f (d|m)p(m)dm

MAP model Notice that the denominator in this formula is simply a normalizing constant
Markov Chain Monte Carlo which is used to insure that the integral of conditional distribution q is one.
maximum likelihood
Since the normalization constant is not always needed, this formula is sometimes
written as
q(m|d) ∝ f (d|m)p(m). (11.4)
The posterior distribution q(m|d) does not provide a single model that we
can consider the “answer.” In cases where we want to single out one model as the
answer, it is appropriate to use the model with the largest value of q(m|d). This
is the so called maximum a posteriori (MAP) model. An alternative would be
to use the mean of the posterior distribution. In situations where the posterior
distribution is symmetric, the MAP model and the posterior mean model are
the same.
In general, the computation of a posterior distribution can be a complicated
process. The difficulty is in evaluating the integrals in (11.3) These are often
integrals in very high dimensions, for which numerical integration techniques are
computationally expensive. One important approach to such problems is the
Markov Chain Monte Carlo Method (MCMC) [GRS96]. Fortunately, there are
a number of special cases in which the computation of the posterior distribution
is greatly simplified.
One simplification occurs when the prior distribution p(m) is constant. In
this case, we have
q(m|d) ∝ f (d|m). (11.5)
In this context, the conditional distribution of d subject to m is known as the
likelihood of m (written L(m).) Under the maximum likelihood principle we
select the model mM L which maximizes L(m). This is exactly the MAP model.

A further simplification occurs when the noise in the measured data is in-
dependent and normally distributed with standard deviation σ. Because the
measurement errors are independent, we can write the likelihood function as
the product of the likelihoods of each individual data point.

L(m) = f (d|m) = f (d1 |m)f (d2 |m) . . . (f (dn |m). (11.6)

Since the individual data points di are normally distributed with mean (G(m))i
and standard deviation σ, we can write f (di |m) as
(di −(G(m))i )2
f (di |m) = e− 2σ 2 . (11.7)

Thus the likelihood of m is

Pn (di −(G(m))i )2
L(m) = e− i=1 2σ 2 . (11.8)

We can maximize the likelihood by maximizing the exponent, or minimizing

minus the exponent.
X (di − (G(m))i )2
min . (11.9)
2σ 2

Except for the constant factor of 1/2σ 2 , this is precisely the least squares prob-
lem min kG(m)−dk22 . Thus we have shown that the Bayesian approach leads to
the least squares solution when we have independent and normally distributed
measurement errors and we use a flat prior distribution.

Example 11.1
Consider the following very simple inverse problem. We have an
object of unknown mass m. We have a scale that can be used to
weigh the object. However, the measurement errors are normally
distributed with mean 0 and standard deviation σ = 1 kg. With
this error model, we have
1 2
f (d|m) = √ e−(d−m) /2 . (11.10)

We weigh the object, and get a measurement of 10.3 kg. What do we

know about the mass of the object? We will consider what happens
when we use a flat prior probability distribution. In this case, since
p(m) is constant,
q(m|d) ∝ f (d|m). (11.11)

1 2
q(m|d = 10.3 kg) ∝ √ e−(10.3−m) /2 . (11.12)

In fact, if we integrate the right hand side of this last equation from
m = −∞ to m = +∞, we find the integral is one, so the constant
of proportionality is one, and
1 2
q(m|d = 10.3 kg) = √ e−(10.3−m) /2 . (11.13)

This posterior distribution is shown in Figure 11.1.
Next, suppose that we obtain a second measurement of 10.1 kg.
Now, we use the distribution (11.13) as our prior distribution and
compute a new posterior distribution.

q(m|d1 = 10.3 kg, d2 = 10.1 kg) ∝ f (d2 = 10.1 kg|m)q(m|d1 = 10.3 kg)

1 2 1 2
q(m|d1 = 10.3 kg, d2 = 10.1 kg) ∝ √ e−(10.1−m) /2 √ e−(10.3−m) /2 .
2π 2π
We can multiply the exponentials
√ by adding exponents. We can also
absorb the factors of 1/ 2π into the constant of proportionality.
+(10.1−m)2 )/2
q(m|d1 = 10.3 kg, d2 = 10.1 kg) ∝ e−((10.3−m)

conjugacy 0.4








5 6 7 8 9 10 11 12 13 14 15

Figure 11.1: Posterior distribution q(m|d = 10.3kg), flat prior.

Finally, we can simplify (10.3 − m)2 + (10.1 − m)2 by combining

terms and completing the square.
(10.3 − m)2 + (10.1 − m)2 = 2(10.2 − m)2 + 0.02 (11.17)

q(m|d1 = 10.3kg, d2 = 10.1kg) ∝ e−(2(10.2−m) +0.02)/2
. (11.18)
The constant factor e−0.02/2 can also be absorbed into the constant
of proportionality. We are left with
q(m|d1 = 10.3 kg, d2 = 10.1 kg) ∝ e−(10.2−m) . (11.19)
After normalization, this can be written as
1 − √
q(m|d1 = 10.3 kg, d2 = 10.1 kg) = √ e 2(1/ 2)2 . (11.20)
√1 2π

This last distribution is a normal

√ distribution with mean 10.2 kg
and a standard deviation of 1/ 2 kg. The distribution is shown in
Figure 11.2.
It is remarkable that when we start with a normal prior distribu-
tion distribution and take into account normally distributed data,
we obtain a normal posterior distribution. The property that the
posterior distribution has the same form as the prior distribution is
called conjugacy. There are other families of conjugate distribu-
tions, but in general this is a very rare property [GCSR03].

0.7 multivariate normal







5 6 7 8 9 10 11 12 13 14 15

Figure 11.2: Posterior distribution q(m|d1 = 10.3 kg, d2 = 10.1 kg), flat prior.

11.3 The Multivariate Normal Case

The idea that a normal prior distribution leads to a normal posterior distribution
can be extended to situations in which there are many model parameters. In
situations where we have a linear model Gm = d, the data errors have a multi-
variate normal distribution, and the prior distribution for the model parameters
is also multivariate normal, the computation of the posterior distribution is rela-
tively simple. Tarantola [Tar87] develops the solution to the Bayesian inversion
problem under these conditions.
Let dobs be the observed data, and let CD be the covariance matrix for
the data. Let mprior be the mean of the prior distribution and let CM be the
covariance matrix for the prior distribution. The prior distribution is
1 T
p(m) ∝ e− 2 (m−mprior ) M (m−mprior ) . (11.21)

The conditional distribution of the data given m is

1 T
f (d|m) ∝ e− 2 (Gm−dobs ) D (Gm−dobs ) . (11.22)

1 T
C−1 T −1
q(m|d) ∝ e− 2 ((Gm−dobs ) D (Gm−dobs )+(m−mprior ) CM (m−mprior )) . (11.23)

Tarantola shows that this can be simplified to

1 T
q(m|d) ∝ e− 2 (m−mM AP ) M0
(m−mM AP )

sqrtm where mM AP is the MAP solution, and

CM 0 = (GT C−1 −1 −1
D G + CM ) . (11.25)

The MAP solution can be found by minimizing the exponent in (11.23).

min (Gm−dobs )T C−1 T −1

D (Gm−dobs )+(m−mprior ) CM (m−mprior ). (11.26)

The key to minimizing this expression is to rewrite it in terms of the matrix

square roots of C−1 −1
M and CD . Note that every symmetric matrix has a matrix
square root. The matrix square root can be computed from the SVD of the
matrix. The MATLAB command sqrtm can also be used to find the square
root of a matrix.
−1/2 −1/2
min (CD (Gm − dobs ))T (CD (Gm − dobs ))+
−1/2 −1/2 (11.27)
(CM (m − mprior ))T (CM (m − mprior )).

−1/2 −1/2
min kCD (Gm − dobs )k2 + kCM (m − mprior )k2 . (11.28)

" # " # 2
−1/2 −1/2
CD G CD dobs

min m− . (11.29)

−1/2 −1/2
CM CM mprior

This is a standard linear least squares problem.

It is worthwhile to consider what happens to the posterior distribution in the
most extreme case in which the prior distribution provides essentially no infor-
mation. If the prior distribution has a covariance matrix CM = σ 2 I, where σ is
extremely large, then C−1M will be extremely small, and the posterior covariance
matrix will be essentially

CM 0 ≈ (GT Cov(d) G)−1 . (11.30)

For convenience, let us assume that the data are independent with variance one.
CM 0 ≈ (GT I G)−1 . (11.31)

CM 0 ≈ (GT G)−1 . (11.32)

This is precisely the covariance matrix for the model parameters in (11.1) Fur-
thermore, when we solve (11.29) to obtain the MAP solution, we find that
(11.29) simplifies to the least squares problem min kGm − dk22 . Thus under the
typical assumption of independent data errors, a very broad prior distribution
leads essentially to the unregularized least squares solution!
It also worthwhile to consider what happens in the special case where CD =
σ 2 I, and CM = α2 I. In this special case, (11.29) simplifies to

min k(1/σ)(Gm − dobs )k2 + (1/α)2 km − mprior k2 . (11.33)


This is simply a case of zeroth order Tikhonov regularization!

Once we have obtained the posterior distribution, it is also possible to gen-
erate random models according to the posterior distribution. We begin by
computing the Cholesky factorization of CM 0 .

CM 0 = RT R. (11.34)

This can be done easily using the chol command in MATLAB. We then generate
a vector s of normally distributed random numbers with mean zero and standard
deviation one. This can be done using the randn command in MATLAB. The
covariance matrix of s is the identity matrix. Finally, we generate our random
solution with
m = RT s + mM AP . (11.35)
The expected value of m is

E[m] = E[RT s + mM AP ]. (11.36)

E[m] = RT E[s] + mM AP . (11.37)

Since E[s] = 0,
E[m] = mM AP . (11.38)
We will also compute Cov(m) to show that it is CM 0 .

Cov(m) = Cov(RT s + mM AP ). (11.39)

Since mM AP is a constant,

Cov(m) = Cov(RT s). (11.40)

From appendix A, we know how to find the covariance of matrix times a random
Cov(m) = RT Cov(s)R. (11.41)

Cov(m) = RT IR. (11.42)

Cov(m) = RT R. (11.43)

Cov(m) = CM 0 . (11.44)

MAP solution
Target Model








−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Figure 11.3: The MAP solution and the target model.

Example 11.2

As an example of this approach, we will solve the Shaw problem with

the smooth target model. We have previously solved this problem
using Tikhonov regularization.
For our first solution, we will use a multivariate normal prior dis-
tribution with mean one and standard deviation one for each model
parameter, with independent model parameters. Thus CM = I and
mprior is a vector of ones. We used MATLAB to setup and solve
(11.29) and obtained the mMAP solution shown in Figure 11.3. Fig-
ure 11.4 shows this same solution with error bars. These error bars
were computed by adding and subtracting 1.96 times the posterior
standard deviation from the MAP solution.
Figure 11.5 shows a total of twenty solutions which were randomly
generated according to the posterior distribution. From these so-
lutions, it is clear that there is a high probability of a peak in the
model near θ = 1. Another peak near θ = −1 is also evident.
Next, we considered a broader prior distribution. We used a prior
mean of one, but used variances of 100 in the prior distribution of
the model parameters instead of variances of one. Figure 11.6 shows
the MAP model for this case, together with its error bars. Notice
that solution is not nearly as good as the model in Figure 11.3. With
a very unrestrictive prior, we have depended mostly on the available
data, which simply does not pin down the solution.







−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Figure 11.4: The MAP solution with error bars.


−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Figure 11.5: Twenty randomly generated solutions.


maximum entropy methods 20






−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Figure 11.6: The MAP solution with a broader prior distribution.

This illustrates a major problem with applying the Bayesian ap-

proach to poorly conditioned problems. In order to get a reason-
ably tight posterior distribution, we have to make very strong as-
sumptions in the prior distribution. If these assumptions are not
warranted, then we cannot obtain a good solution to the inverse

11.4 Maximum Entropy Methods

We have already seen that an important issue in the Bayesian approach is the
selection of an appropriate prior distribution. In this section we will consider
maximum entropy methods which can be used to select a prior distribution
subject to available information such as bounds on the parameter, the mean of
the parameter, and so forth.

Definition 11.1

The entropy of a discrete probability distribution

P (X = xi ) = pi

is given by
H(X) = − pi ln pi .

The entropy of a continuous probability distribution maximum entropy principle

Z a calculus of variations
P (X ≤ a) = f (x)dx
−∞ Leibler
cross entropy
is given by Z ∞
H(X) = − f (x) ln f (x)dx.

Under the maximum entropy principle, we select as our prior distribution

a distribution which has the largest possible entropy subject to any constraints
imposed by available information about the distribution.
For continuous random variables, the optimization problems resulting from
the maximum entropy principle involve an unknown density function f (x). Such
problems can be solved by techniques from the calculus of variations. Fortu-
nately, maximum entropy distributions for a number of important cases have
already been worked out. See Kapur and Kesavan [KK92] for a collection of
results on maximum entropy distributions.

Example 11.3
Suppose that we know that X takes on only nonnegative values, and
that the mean value of X is µ. It can be shown using the calculus of
variations that the maximum entropy distribution is an exponential
distribution [KK92]. The maximum entropy distribution is
1 −x/µ
fX (x) = e x ≥ 0.

Definition 11.2
Given a discrete probability distribution
P (X = xi ) = pi
and an alternative distribution
P (X = xi ) = qi ,
the Kullback–Leibler cross entropy is given by
X pi
D(p, q) = pi ln .
Given continuous distributions f (x) and g(x), the cross entropy of
the distributions is
Z ∞
f (x)
D(f, g) = f (x) ln
−∞ g(x)

minimum cross entropy The cross entropy is a measure of how close two distributions are. Notice that if
the two distributions are identical, then the cross entropy is 0. It can be shown
that in all other cases, the cross entropy is greater than zero.
Under the minimum cross entropy principle if we are give a prior distri-
bution q and some additional constraints on the distribution, we should select
a posterior distribution p which minimizes the cross entropy of p and q subject
to the constraints.
For example, in the Minimum Relative Entropy (MRE) method of
Woodbury and Ulrych [UBL90, WU96], the minimum cross entropy principle
is applied to linear inverse problems of the form Gm = d subject to lower
and upper bounds on the model elements. First, a maximum entropy prior
distribution is computed using the lower and upper bounds and a given prior
mean. Then, a posterior distribution is selected to minimize the cross entropy
subject to the constraint that the posterior mean distribution must satisfy the
equations Gm = d.

11.5 Exercises
1. Consider the following coin tossing experiment. We repeatedly toss a coin,
and each time record whether it comes up heads (0), or tails (1). The bias
b of the coin is the probability that it comes up heads. We do not have
reason to believe that this is a fair coin, so we will not assume that b = 1/2.
Instead, we will begin with a flat prior distribution p(b) = 1, for 0 ≤ b ≤ 1.

(a) What is f (d|b)? Note that the only possible data are 0 and 1, so this
distribution will involve delta functions at d = 0, and d = 1.
(b) Suppose that on our first flip, the coin comes up heads. Compute
the posterior distribution q(b|d1 = 0).
(c) The second, third, fourth, and fifth flips are 1, 1, and 1, and 1. Find
the posterior distribution q(b|d1 = 1, d2 = 1, d3 = 0, d4 = 1, d5 = 1).
Plot the posterior distribution.
(d) What is your MAP estimate of the bias?
(e) Now, suppose that you initially felt that the coin was at least close
to fair, with
p(b) ∝ e−10(b−0.5) 0 ≤ b ≤ 1
Repeat the analysis of the five coin flips.

2. Apply the Bayesian method discussed in this Chapter to problem 2 in

Chapter 5. Select what you consider to be a reasonable prior. How sensi-
tive is your solution to the prior mean and covariance?

3. Consider a conventional six sided die, with faces numbered 1, 2, 3, 4, 5,

and 6. If each side were equally likely to come up, the mean value would
be 7/2. Suppose instead that the mean is 9/2. Formulate an optimization

problem that could be solved to find the maximum entropy distribution

subject to the constraint that the mean is 9/2. Use the method of Lagrange
multipliers to solve this optimization problem and obtain the maximum
entropy distribution.
4. Let X be a discrete random variable that takes on the values 1, 2, . . ., n.
Suppose that E[X] = µ is given. Find the maximum entropy distribution
for X.

11.6 Notes and Further Reading

Sivia’s book [Siv96] is a good general introduction to Bayesian ideas for scien-
tists and engineers. The textbook by Gelman et al. [GCSR03] provides a more
comprehensive introduction to Bayesian statistics. Tarantola [Tar87] is the stan-
dard reference work on Bayesian methods for inverse problems. The paper of
Gouveia and Scales [GS97] discusses the relative advantages and disadvantages
of Bayesian and classical methods for inverse problems. The draft textbook by
Scales and Smith [SS97] takes a Bayesian approach to inverse problems. Sivia’s
book includes a brief introduction to the maximum entropy principle [Siv96]. A
complete survey of maximum entropy methods can be found in [KK92].
linear equations
Gaussian elimination

Appendix A

Review of Linear Algebra

The topics discussed in this appendix have been selected because they are log-
ically necessary to the development of the rest of the textbook. This is by no
means a complete review of linear algebra. It would take an entire textbook
to completely cover the material in this appendix. Instead, the purpose of this
appendix is to summarize some important concepts, definitions, and theorems
for linear algebra that will be used throughout the book.

A.1 Systems of Linear Equations

Recall that a system of linear equations can be solved by the process of Gaussian

Example A.1
Consider the system of equations:
x +2y +3z = 14
x +2y +2z = 11 .
x +3y +4z = 19

We can eliminate x from the second and third equations by sub-

tracting the first equation from the second and third equations to
x +2y +3z = 14
−z = −3 .
y +z = 5
We would like to get y into the second equation, so we simply inter-
change the second and third equations:
x +2y +3z = 14
y +z = 5 .
−z = −3


elementary row operations Next, we can eliminate y from the first equation by subtracting two
times the second equation from the first equation.

x +z = 4
y +z = 5 .
−z = −3

Next, we multiply the third equation by −1 to get an equation for

x +z = 4
y +z = 5 .
z = 3
Finally, we eliminate z from the first two equations.

x = 1
y = 2 .
z = 3

The solution to the original system of equations is x = 1, y = 2,

z = 3. Geometrically, this system of equations describes three planes
(given by the three equations) which intersect in a single point.

In solving this system of equations, we made use of three fundamental oper-

ations: adding a multiple of one equation to another equation, multiplying an
equation by a nonzero constant, and swapping two equations. This process can
be extended to solve systems of equations with three, four, or more variables.
Relatively small systems can be solved by hand, while systems with thousands
or even millions of variables can be solved by computer.
In performing the elimination process, the actual names of the variables are
insignificant. We could have renamed the variables in the above example to
a, b, and c without changing the solution in any significant way. Since the
actual names of the variables are insignificant, we can save time by writing
down the significant coefficients from the system of equations in a table called
an augmented matrix. This augmented matrix form is also useful in solving
a system of equations by computer. The elements of the augmented matrix are
simply stored in an array.
In augmented matrix form, our example system becomes:
 
1 2 3 14
 1 2 2 11  .
1 3 4 19

In this notation, our elementary row operations are adding a multiple

of one row to another row, multiplying a row by a nonzero constant, and in-
terchanging two rows. The elimination process is essentially identical to the

process used in the previous example. In the example, the final version of the reduced row echelon form
augmented matrix is RREF
 
1 0 0 1
 0 1 0 2 .
0 0 1 3

Definition A.1

A matrix is said to be in reduced row echelon form (RREF)

if it has the following properties:

1. The first nonzero element in each row is a one. These elements

of the matrix are called pivot elements. A column in which
a pivot element appears is called a pivot column.
2. Except for the pivot element, all entries in pivot columns are
3. If the matrix has any rows of zeros, these zero rows are at the
bottom of the matrix.

In solving a system of equations in augmented matrix form, we use ele-

mentary row operations to reduce the augmented matrix to RREF and then
convert back to conventional notation to read off the solutions. The process
of transforming a matrix into RREF can easily be automated. Most graphing
calculators have built in functions for computing the RREF of a matrix. In
MATLAB, the rref command computes the RREF of a matrix. In addition to
solving equations, the RREF can be used for many other calculations in linear
It can be shown that any linear system of equations has either no solutions,
exactly one solution, or infinitely many solutions. For example, in two dimen-
sions, the lines represented by the equations can fail to intersect (no solution),
intersect at a point (one solution) or intersect in a line (many solutions.) In the
following examples, we show how it is easy to determine the number of solutions
from the RREF of the augmented matrix.

Example A.2

Consider the system of equations:

x +y = 1
x +y = 2

This system of equations describes two parallel lines. The augmented

matrix is  
1 1 1
1 1 2

homogeneous The RREF of this augmented matrix is

trivial solution  
non homogeneous 1 1 0
0 0 1
In equation form, this is
x +y = 0
0 = 1
There are obviously no solutions to 0 = 1, so the original system of
equations has no solutions.

Example A.3
In this example, we will consider a system of two equations in three
variables which has many solutions. Our system of equations is:
x1 +x2 +x3 = 0
. (A.1)
x1 +2x2 +2x3 = 0
We put this system of equations into augmented matrix form and
then find the RREF  
1 0 0 0
0 1 1 0
We can translate this back into equation form as
x = 0
y +z = 0
Clearly, x must be 0 in any solution to the system of equations.
However, y and z are not fixed. We can treat z as a free variable
and allow it to take on any value. However, whatever value z takes
on, y must be equal to −z. Geometrically, this system of equations
describes the intersection of two planes. The intersection of the two
planes consists of the points on the line y = −z in the x = 0 plane.

A linear system of equations may have more equations than variables, in

which case the system of equations is over determined. Although over de-
termined systems often have no solutions, it is possible for an over determined
system of equations to have many solutions or exactly one solution.
A system of equations with fewer equations than variables is under de-
termined. Although in many cases under determined systems of equations
have infinitely many solutions, it is possible for an under determined system of
equations to have no solutions.
A system of equations with all zeros on the right hand side is homogeneous.
Every homogeneous system of equations has at least one solution. You can set
all of the variables to 0 and get the trivial solution. A system of equations
with a nonzero right hand side is non homogeneous.

A.2 Matrix and Vector Algebra zero matrix

linear combination
As we have seen in the previous section, a matrix is a table of numbers laid
out in rows and columns. A vector is a matrix that consists of a single column
of numbers.
There are several important notational conventions used with matrices and
vectors. Bold face capital letters such as A, B, . . . are used to denote matrices.
Bold face lower case letters such as x, y, . . . are used to denote vectors. Lower
case letters or Greek letters such as m, n, α, β, . . . will be used to denote scalars.
At times we will need to refer to specific parts of a matrix. The notation
Aij denotes the element of the matrix A in row i and column j. We denote the
jth element of the vector x by xj . The notation A·,j is used to refer to column
j of the matrix A, while Ai,· refers to row i of A.
We can also build up larger matrices from smaller matrices. The notation
A = [B C] means that the matrix A is composed of the matrices B and C,
with matrix C beside matrix B. The notation A = [B; C] means that matrix
A is composed of the matrices B and C, with C below matrix B.
It is also possible to perform arithmetic on matrices and vectors. If A and B
are two matrices of the same size, we can add them together by simply adding
the corresponding elements of the two matrices. Similarly, we can subtract B
from A by subtracting the elements of B from the elements of A. We can
multiply a scalar times a vector by multiplying the scalar times each element
of the vector. Since vectors are just n by 1 matrices, we can perform the same
arithmetic operations on vectors. A zero matrix, 0 is a matrix with all zero
elements. A zero matrix plays the same role in matrix algebra as the scalar 0,
with A + 0 = 0 + A = A.
In general, matrices and vectors may contain complex numbers as well as
real numbers. However, in this book all of our matrices will have real entries.
Using vector notation, we can write a linear system of equations in vector

Example A.4
Recall the system of equations (A.1)

x1 +x2 +x3 = 0
x1 +2x2 +2x3 = 0
from example A.1. We can write this in vector form as
1 1 1 0
x1 + x2 + x3 = .
1 2 2 0

The expression on the left hand side of the equation in which scalars are
multiplied by vectors and then added together is called a linear combination.
Although this form is not convenient for solving the system of equations, it can
be useful in setting up a system of equations.

matrix times vector If A is an m by n matrix, and x is an n element vector, we can multiply A

times x. The product is defined by

Ax = x1 A·,1 + x2 A·,2 + . . . + xn A·,n .

Example A.5
1 2 3
4 5 6
and  
x= 0 
1 2 3 7
Ax = 1 +0 +2 = .
4 5 6 16

Notice that the formula for Ax involves a linear combination much like the
one that occurred in the vector form of a system of equations. It is possible
to take any linear system of equations and rewrite the system of equations as
Ax = b, where A is a matrix containing the coefficients of the variables in the
equations, b is a vector containing the coefficients on the right hand sides of the
equations, and x is a vector containing the variables.

Example A.6
The system of equations

x1 +x2 +x3 = 0
x1 +2x2 +2x3 = 0

can be written as Ax = b, where

1 1 1
1 2 2
 
x =  x2 
b= .

Definition A.2 matrix times matrix

compatible matrices
If A is a matrix of size m by n, and B is a matrix of size n by r, row–column expansion method
then the product C = AB is obtained by multiplying A times each
of the columns of B and assembling the matrix vector products in
C. That is,
C = [AB1 AB2 . . . ABr ] . (A.2)

Note that the product is only possible if the two matrices are of compatible
sizes. In general, if A has m rows and n columns, and B has n rows and r
columns, then the product AB exists and is of size m by r. In some cases, it is
possible to multiply AB but not BA. Also, it turns out that even when both
AB and BA exist, AB is not always equal to BA!
There is an alternate way to compute the matrix–matrix product. In the
row–column expansion method, we obtain the entry in row i and column j
of C by computing the matrix product of row i of A and column j of B.

Example A.7
Let  
1 2
A= 3 4 
5 6
5 2
3 7
and let C = AB. The matrix C will be of size 3 by 2. We compute
the product using both of the methods. First, using the matrix–
vector approach:
C = [AB1 AB2 ]
        
1 2 1 2
C = 5  3  + 3  4  2  3  + 7  4 
5 6 5 6
 
11 16
C =  27 34  .
43 52
Next, we use the row–column approach
 
1×5+2×3 1×2+2×7
C= 3×5+4×3 3×2+4×7 
5×5+6×3 5×2+6×7
 
11 16
C =  27 34  .
43 52

identity matrix The n by n identity matrix In consists of 1’s on the diagonal and 0’s on the
off diagonal. For example, the 3 by 3 identity matrix is
 
1 0 0
I3 =  0 1 0  .
0 0 1

We often write I without specifying the size of the matrix in situations where
the size of matrix is obvious from context. You can easily show that if A is an
m by n matrix, then
AIn = A
Im A = A.
Thus multiplying by I in matrix algebra is similar to multiplying by 1 in con-
ventional scalar algebra.
We have not defined division of matrices, but it is possible to define the
matrix algebra equivalent of the reciprocal.

Definition A.3

If A is an n by n matrix, and there is a matrix B such that

AB = BA = I, (A.3)

then B is the inverse of A. We write B = A−1 .

How do we compute the inverse of a matrix? If AB = I, then

[AB1 AB2 . . . ABn ] = I

Since the columns of the identity matrix are known, and A is known, we can
solve  
 0 
 
 . 
AB1 =  . 

 
 . 
to obtain B1 . In the same way, we can find the remaining columns of the inverse.
If any of these systems of equations are inconsistent, then A−1 does not exist.

Example A.8

2 1
A= .
5 2

To find the first column of A−1 , we solve the system of equations

with the augmented matrix
2 1 1
5 2 0

The RREF is  
1 0 −2
0 1 5
Thus the first column of A−1 is

Similarly, the second column of A−1 is


−1 −2 1
A = .
5 −2

The inverse matrix can be used to solve a system of linear equations with n
equations and n variables. Given the system of equations Ax = b, and A−1 ,
we can calculate

Ax = b
A Ax = A−1 b
Ix = A−1 b
x = A−1 b.

This argument shows that if A−1 exists, then for any right hand side b, the
system of equations Ax = b has a unique solution. If A−1 does not exist, then
the system Ax = b may either have many solutions or no solution.

Example A.9
Consider the system of equations Ax = b, where
2 1
5 2

b= .

power of a matrix From the previous example, we know A−1 , so

symmetric matrix x = A−1 b
diagonal matrix     
−2 1 1 1
x= = .
5 −2 2 1

Definition A.4

When A is an n by n matrix, Ak is the product of k copies of A.

By convention, we define A0 = I.

Definition A.5

The transpose of a matrix A is obtained by taking the columns of

of A and writing them as the rows of the transpose. The transpose
of A is denoted by AT .

Example A.10

2 1
A= .
5 2
2 5
AT = .
1 2

We will frequently encounter symmetric matrices.

Definition A.6

A matrix is symmetric if A = AT .

Although many textbooks on linear algebra consider only square diagonal

matrices, we will have occasion to refer to matrices of size m by n which have
nonzero entries only on the diagonal.

Definition A.7

An m by n matrix A is diagonal if Ai,j = 0 whenever i 6= j.


We will also work with non square matrices which are upper triangular. upper triangular matrix
lower triangular matrix
Definition A.8
An m by n matrix R is upper triangular if Ri,j = 0 whenever
i > j. A matrix L is lower triangular if LT is upper triangular.

Example A.11
The matrix  
1 0 0 0 0
S= 0 2 0 0 0 
0 0 3 0 0
is diagonal, and the matrix
 
1 2 3
 0 2 4 
 0

0 5 
0 0 0

is upper triangular.

Theorem A.1
The following statements are true for any scalars s and t and any
matrices A, B, and C. It is assumed that the matrices are of the
appropriate size for the operations involved and that whenever an
inverse occurs, the matrix is invertible.
1. A + 0 = 0 + A = A.
2. A + B = B + A.
3. (A + B) + C = A + (B + C).
4. A(BC) = (AB)C.
5. A(B + C) = AB + AC.
6. (A + B)C = AC + BC.
7. (st)A = s(tA).
8. s(AB) = (sA)B = A(sB).
9. (s + t)A = sA + tA.
10. s(A + B) = sA + sB.
11. (AT )T = A.
12. (sA)T = s(AT ).

linear independence 13. (A + B)T = AT + BT .

14. (AB)T = BT AT .
15. (AB)−1 = B−1 A−1 .
16. (A−1 )−1 = A.
17. (AT )−1 = (A−1 )T .
18. If A and B are n by n matrices, and AB = I, then A−1 = B
and B−1 = A.

The first ten rules in this list are identical to rules of conventional alge-
bra, and you should have little trouble in applying them. The rules involving
transposes and inverses are new, but they can be mastered without too much
Many students have difficulty with the following statements, which would
appear to be true on the surface, but which are in fact false for at least some

1. AB = BA.

2. If AB = 0, then A = 0 or B = 0.

3. If AB = AC and A 6= 0, then B = C.

It is a worthwhile exercise to construct examples of 2 by 2 matrices for which

each of these statements are false.

A.3 Linear Independence

Definition A.9

The vectors v1 , v2 , . . ., vn are linearly independent if the system

of equations
c1 v1 + c2 v2 + . . . + cn vn = 0 (A.4)
has only the trivial solution c = 0. If there are multiple solutions,
then the vectors are linearly dependent.

Determining whether or not a set of vectors is linearly independent is simple.

Just solve the above system of equations (A.4).

Example A.12

Let   subspace
1 2 3
A= 4 5 6 .
7 8 9
Are the columns of A linearly independent vectors? To determine
this we setup the system of equations Ax = 0 in an augmented
matrix, and then find the RREF
 
1 0 −1 0
 0 1 2 0 .
0 0 0 0

The solutions are  

x = x3  −2  .
We can set x3 = 1 and obtain the nonzero solution
 
x =  −2  .

Thus the columns of A are linearly dependent.

There are a number of important theoretical consequences of linear indepen-

dence. For example, it can be shown that if the columns of an n by n matrix A
are linearly independent, then A−1 exists, and the system of equations Ax = b
has a unique solution for every right hand side b.

A.4 Subspaces of Rn
So far, we have worked with vectors of real numbers in the n dimensional space
Rn . There are a number of properties of Rn that make it convenient to work
with vectors. First, the operation of vector addition always works— we can take
any two vectors in Rn and add them together and get another vector in Rn .
Second, we can multiply any vector in Rn by a scalar and obtain another vector
in Rn . Finally, we have the 0 vector, with the property that for any vector x,
x + 0 = 0 + x = x.

Definition A.10

A subspace W of Rn is a subset of Rn which satisfies the three


1. If x and y are vectors in W , then x + y is also a vector in W .


null space 2. If x is a vector in W and s is any scalar, then sx is also a vector

in W .
3. The 0 vector is in W .

Example A.13

In R3 , the plane P defined by the equation

x1 + x2 + x3 = 0

is a subspace of Rn . To see this, note that if we take any two vectors

in the plane and add them together, we get another vector in the
plane. If we take a vector in this plane and multiply it by any scalar,
we get another vector in the plane. Finally, 0 is a vector in the plane.

Subspaces are important because they provide an environment within which

all of the rules of matrix/vector algebra apply. The most important subspace of
Rn that we will work with in this course is the null space of an m by n matrix.

Definition A.11

Let A be an m by n matrix. The null space of A (written N (A))

is the set of all vectors x such that Ax = 0.

To show that N (A) is actually a subspace of Rn , we need to show three


1. If x and y are in N (A), then Ax = 0 and Ay = 0. By adding these

equations, we find that A(x + y) = 0. Thus x + y is in N (A).

2. If x is in N (A) and s is any scalar, then Ax = 0. We can multiply this

equation by s to get sAx = 0. Thus A(sx) = 0, and sx is in N (A).

3. A0 = 0, so 0 is in N (A).

Computationally, finding the null space of a matrix is as simple as solving

the system of equations Ax = 0.

Example A.14

Let  
3 1 9 4
A= 2 1 7 3 .
5 2 16 7

In order to find the null space of A, we solve the system of equations

Ax = 0. To solve the equations, we put the system of equations into
an augmented matrix
 
3 1 9 4 0
 2 1 7 3 0 
5 2 16 7 0
and find the RREF
 
1 0 2 1 0
 0 1 3 1 0 .
0 0 0 0 0
From the augmented matrix, we find that
   
−2 −1
 −3   −1 
x = x3 
 1  + x4  0
  .

0 1
Any vector in the null space can be written as a linear combination
of the above vectors, so the null space is a two dimensional plane
within R4 .
Now, consider the problem of solving Ax = b, where
 
b =  17  .
It happens that one particular solution to this system of equations
is  
 2 
p=  1 .

However, we can take any vector in the null space of A and add it to
this solution to obtain another solution. Suppose that x is in N (A).
A(x + p) = Ax + Ap
A(x + p) = 0 + b
A(x + p) = b.

For example,
     
1 −2 −1
 2   −3   −1 
 1  + 2 1  + 3 0 
    

2 0 1

basis is also a solution to Ax = b.

standard basis
basis, standard
dimension In the context of inverse problems, the null space is critical because the
presence of a nontrivial null space leads to non uniqueness in the solution to a
linear system of equations.

Definition A.12
A basis for a subspace W is a set of vectors v1 , . . ., vp such that
1. Any vector in W can be written as a linear combination of the
basis vectors.
2. The basis vectors are linearly independent.

A particularly simple and useful basis is the standard basis.

Definition A.13
The standard basis for Rn is the set of vectors e1 , . . ., ep such
that the elements of ei are all zero, except for the ith element, which
is one.

Any nontrivial subspace W of Rn will have many different bases. For exam-
ple, we can take any basis and multiply one of the basis vectors by 2 to obtain a
new basis. However, it is possible to show that all bases for a subspace W have
the same number of basis vectors.

Theorem A.2
Let W be a subspace of Rn with basis v1 , v2 , . . ., vp . Then all bases
for W have p basis vectors, and p is the dimension of W .

Example A.15
In the previous example, the vectors
   
−2 −1
 −3   −1 
 1  v2 =  0 
v1 =    

0 1

form a basis for the null space of A because any vector in the null
space can be written as a linear combination of v1 and v2 , and
because the vectors v1 and v2 are linearly independent. Since the
basis has two vectors, the dimension of the null space of A is two.

It can be shown that the procedure used in the above example always pro- column space
duces a basis for N (A). column space
Definition A.14
Let A be an m by n matrix. The column space or range of A
(written R(A)) is the set of all vectors b such that Ax = b has at
least one solution. In other words, the column space is the set of all
vectors b that can be written as a linear combination of the columns
of A.

The range is important in the context of discrete linear inverse problems,

because R(G) consists of all vectors d for which there is a consistent model m.
To find the column space of a matrix, we consider what happens when we
compute the RREF of [A | b]. In the part of the augmented matrix correspond-
ing to the left hand side of the equations we always get the same result, namely
the RREF of A. The solution to the system of equations may involve some
free variables, but we can always set these free variables to 0. Thus when we
can solve Ax = b, we can solve the system of equations by using only variables
corresponding to the pivot columns in the RREF of A. In other words, if we
can solve Ax = b, then we can write b as a linear combination of the pivot
columns of A. Note that these are columns from the original matrix A, not
columns from the RREF of A.

Example A.16
As in the previous example, let
 
3 1 9 4
A= 2 1 7 3 .
5 2 16 7

We want to find the column space of A. We already know that the

RREF of A is  
1 0 2 1
 0 1 3 1 .
0 0 0 0
Thus whenever we can solve Ax = b, we can find a solution in
which x3 and x4 are 0. In other words, whenever there is a solution
to Ax = b, we can write b as a linear combination of the first two
columns of A    
3 1
b = x1  2  + x2  1  .
5 2
Since these two vectors are linearly independent and span R(A),
they form a basis for R(A). The dimension of R(A) is two.

rank In the context of inverse problems, the range of a matrix A is important

because it tells us when we can solve Ax = b.
In finding the null space and range of a matrix A we found that the basis
vectors for N (A) corresponded to non pivot columns of A, while the basis
vectors for R(A) corresponded to pivot columns of A. Since the matrix A had
n columns, we obtain the theorem.

Theorem A.3

dim N (A) + dim R(A) = n . (A.5)

In addition to the null space and range of a matrix A, we will often work
with the null space and range of the transpose of A. Since the columns of AT
are rows of A, the column space of AT is also called the row space of A. Since
each row of A can be written as a linear combination of the nonzero rows of the
RREF of A, the nonzero rows of the RREF form a basis for the row space of
A. There are exactly as many nonzero rows in in the RREF of A as there are
pivot columns. Thus we have the following theorem.

Theorem A.4

dim(R(AT )) = dim R(A). (A.6)

Definition A.15

The rank of a matrix A is the dimension of R(A).

A.5 Orthogonality and the Dot Product

Definition A.16

Let x and y be two vectors in Rn . The dot product of x and y is

x · y = xT y = x1 y1 + x2 y2 + . . . + xn yn

Definition A.17

I Euclidean length
@ x-y 2–norm
@ law of cosines
@ orthogonal vectors
x @ perpendicular vectors



Figure A.1: Relationship between the dot product and the angle between two

Let x be a vector in Rn . The Euclidean length or 2–norm of x

√ q
kxk2 = xT x = x21 + x22 + . . . + x2n .

Later we will introduce two other ways of measuring the “length” of a vector.
The subscript 2 is used to distinguish this 2–norm from the other norms.
You may be familiar with an alternative definition of the dot product in
which x · y = kxkkyk cos(θ) where θ is the angle between the two vectors. The
two definitions are equivalent. To see this, consider a triangle with sides x, y,
and x − y. See Figure 1. The angle between sides x and y is θ. By the law of

kx − yk22 = kxk22 + kyk22 − 2kxk2 kyk2 cos(θ)

(x − y)T (x − y) = xT x + yT y − 2kxk2 kyk2 cos(θ)
xT x − 2xT y + yT y = xT x + yT y − 2kxk2 kyk2 cos(θ)
−2xT y = −2kxk2 kyk2 cos(θ)
xT y = kxk2 kyk2 cos(θ).

We can also use this formula to compute the angle between two vectors.

xT y
θ = cos−1 .
kxk2 kyk2

Definition A.18

Two vectors x and y in Rn are orthogonal or perpendicular

(written x ⊥ y) if xT y = 0.

orthogonal subspaces Definition A.19

orthogonal basis
orthogonal matrix A set of vectors v1 , v2 , . . ., vp is orthogonal if each pair of vectors
in the set is orthogonal.

Definition A.20
Two subspaces V and W of Rn are orthogonal or perpendicular
if every vector in V is perpendicular to every vector in W .

if x is in N (A), then Ax = 0. Since each element of the product Ax can

be obtained by taking the dot product of a row of A and x, x is perpendicular
to each row of A. Since x is perpendicular to all of the columns of AT , it is
perpendicular to R(AT ). We have the following theorem.

Theorem A.5
Let A be an m by n matrix. Then

N (A) ⊥ R(AT ) .

N (A) + R(AT ) = Rn .
That is, any vector x in Rn can be written uniquely as x = p + q
where p is in N (A) and q is in R(AT ).

Definition A.21
A basis in which the basis vectors are orthogonal is an orthogonal

Definition A.22
An n by n matrix Q is orthogonal if the columns of Q are orthog-
onal and each column of Q has length one.

Orthogonal matrices have a number of useful properties.

Theorem A.6
If Q is an orthogonal matrix, then:
1. QT Q = I. In other words, Q−1 = QT .

orthogonal projection

x y

Figure A.2: The orthogonal projection of x onto y.

2. For any vector x in Rn , kQxk2 = kxk2 .

3. For any two vectors x and y in R , x y = (Qx)T (Qy) .
n T

A problem that we will often encounter in practice is projecting a vector x

onto another vector y or onto a subspace W to obtain a projected vector p. See
Figure 2. We know that

xT y = kxk2 kyk2 cos(θ) ,

where θ is the angle between x and y. Also,

cos(θ) = .
xT y
kpk2 = .
Since p points in the same direction as y,
xT y
p = projy x = y.
yT y
The vector p is called the orthogonal projection of x onto y.
Similarly, if W is a subspace of Rn with an orthogonal basis w1 , w2 , . . .,
wp , then the orthogonal projection of x onto W is

xT w1 xT w2 xT wp
p = projW x = w1 + w2 + . . . + wp .
w1T w1 w2T w2 wpT wp

It is inconvenient that the projection formula requires an orthogonal basis.

The Gram–Schmidt orthogonalization process can be used to turn any

Gram–Schmidt basis for a subspace of Rn into an orthogonal basis. We begin with a basis v1 ,
orthogonalization process
v2 , . . ., vp . The process recursively constructs an orthogonal basis by taking
each vector in the original basis and then subtracting off its projection on the
space spanned by the previous vectors. The formulas are

w1 = v1
v1T v2
w2 = v2 − v1
v1T v1
w1T vp wpT vp
wp = vp − T w1 − . . . − T wp .
w1 w1 wp wp
Unfortunately, the Gram–Schmidt process is numerically unstable when applied
to large bases. In MATLAB the command orth provides a numerically stable
way to produce an orthogonal basis from a nonorthogonal basis.
An important property of the orthogonal projection is that the projection
of x onto W is the point in W which is closest to x. In the special case that
x is in W , the projection of x onto W is x. This provides a convenient way to
write a vector in W as a linear combination of the orthogonal basis vectors.

Example A.17
In this example, we will find the point on the plane

x1 + x2 + x3 = 0

which is closest to the point x1 = 1, x2 = 2, x3 = 3 .

First, we must find an orthogonal basis for our subspace. Our sub-
space is the null space of the matrix
A= 1 1 1

Using the RREF, we find that the null space has the basis
   
−1 −1
u1 =  1  u2 =  0  .
0 1
Unfortunately, this basis is not orthogonal. Using the MATLAB
orth command, we obtain the orthogonal basis
   
−0.7071 0.4082
w1 =  0  w2 =  −0.8165  .
0.7071 0.4082
Using this orthogonal basis, we compute the projection of x onto
the plane
xT w1 xT w2
p= T w1 + T w2 .
w1 w1 w2 w2
 
−1 least squares solution
p =  0 . normal equations

Given an inconsistent system of equations Ax = b, it is often desirable

to find an approximate solution. A natural measure of the quality of an ap-
proximate solution is the distance from Ax to b, kAx − bk. A solution which
minimizes kAx − bk2 is called a least squares solution, because it minimizes
the sum of the squares of the errors.
The least squares solution can be obtained by projecting b onto the range
of A. This calculation requires us to first find an orthogonal basis for R(A).
There is an alternative approach which does not require the orthogonal basis.
Ax = p = projR(A) b .
Then Ax − b is perpendicular to R(A). In particular, each of the columns of
A is orthogonal to Ax − b. Thus

AT (Ax − b) = 0 .

AT Ax = AT b . (A.7)
This last system of equations is referred to as the normal equations for the
least squares problem. It can be shown that if the columns of A are linearly
independent, then the normal equations have exactly one solution. This solution
minimizes the sum of squared errors.

Example A.18
Let  
1 2
 1 3 
 1

1 
2 2
and  
 2 
 3  .
b= 

It is easy to see that the system of equations Ax = b is inconsis-
tent. We will find the least squares solution by solving the normal
T 7 10
A A= .
10 18
AT b = .

eigenvector The solution to AT Ax = AT b is

characteristic equation 2.3846
x= .
eig −0.2692

A.6 Eigenvalues and Eigenvectors

Definition A.23

An n by n matrix A has an eigenvalue λ with an associated eigen-

vector x if x is not 0, and

Ax = λx .

The eigenvalues and eigenvectors of a matrix A are important in analyzing

difference equations of the form

x(k + 1) = Ax(k)

and differential equations of the form

x0 (t) = Ax(t) .

In general, they give us information about the effect of multiplying A times a

In order to find eigenvalues and eigenvectors, we consider the equation

Ax = λx .

This can be written as

(A − λI)x = 0. (A.8)
In order to find nonzero eigenvectors, the matrix A − λI must be singular. This
leads to the characteristic equation

det(A − λI) = 0 . (A.9)

For small matrices (2 by 2 or 3 by 3), it is relatively simple to solve the character-

istic equation (A.9) to find the eigenvalues and then substitute the eigenvalues
into (A.8) and solve the system of equations to find corresponding eigenvectors.
Notice that the eigenvalues can in general be complex numbers. For larger ma-
trices solving the characteristic equation becomes completely impractical and
more sophisticated numerical methods are used. The MATLAB command eig
can be used to find eigenvalues and eigenvectors of a matrix.

Suppose that we can find a set of n linearly independent eigenvectors orthogonal diagonalization
quadratic form
v1 , v2 , . . . , vn

of a matrix A with associated eigenvalues

λ1 , λ2 , . . . , λn .

These eigenvectors form a basis for Rn . We can use the eigenvectors to diago-
nalize the matrix as
P= v1 v2 . . . vn
and  
λ1 0 ... 0
 0 λ2 0 ... 
 ... ... ... ... .

0 ... 0 λn
To see that this works, simply compute AP
AP = A v1 v2 . . . vn
AP = λ1 v1 λ2 v2 . . . λn vn
AP = PΛ .

A = PΛPT .
Not all matrices are diagonalizable, because not all matrices have n linearly
independent eigenvectors. However, there is an important special case in which
matrices can always be diagonalized.

Theorem A.7
If A is a real symmetric matrix, then A can be written as

A = QΛQT ,

where Q is a real orthogonal matrix of eigenvectors of A and Λ is a

real diagonal matrix of the eigenvalues of A.

This orthogonal diagonalization of a real symmetric matrix A will be

useful later on when we consider orthogonal factorizations of general matrices.
The eigenvalues of symmetric matrices are particularly important in the
analysis of quadratic forms.

Definition A.24

positive definite matrix A quadratic form is a function of the form

positive semidefinite matrix
PD matrix f (x) = xT Ax ,
PSD matrix
indefinite matrix where A is a symmetric n by n matrix. The quadratic form f (x)
negative semidefinite matrix is positive definite (PD) if f (x) ≥ 0 for all x and f (x) = 0 only
negative definite matrix when x = 0. The quadratic form is positive semidefinite (PSD)
if f (x) ≥ 0 for all x. Similarly, a symmetric matrix A is positive
definite if the associated quadratic form f (x) = xT Ax is positive
definite. The quadratic form is negative semidefinite if −f (x)
is positive semidefinite. If f (x) is neither positive semidefinite nor
negative semidefinite, then f (x) is indefinite.

Positive definite quadratic forms have an important application in analytic

geometry. Let A be a positive definite and symmetric matrix. Then the region
defined by the inequality
(x − c)A(x − c) ≤ δ
is an ellipsoidal volume, with its center at c. We can diagonalize A as
A = PΛP−1
where the columns of P are normalized eigenvectors of A, and Λ is a diagonal
matrix who entries are the eigenvalues of A. It can be shown that the ith
eigenvector of A points in the direction of the ith semi major
p axis of the ellipsoid,
and the length of the ith semi major axis is given by δ/λi .
An important connection between positive semidefinite matrices and eigen-
values is the following theorem.

Theorem A.8
A symmetric matrix A is positive semidefinite if and only if its eigen-
values are greater than or equal to 0.

This provides a convenient way to check whether or not a matrix is positive

semidefinite. However, for large matrices, computing all of the eigenvalues can
be time consuming.
The Cholesky factorization provides an another way to determine whether
or not a symmetric matrix is positive definite.

Theorem A.9
Let A be an an n by n positive definite and symmetric matrix. Then
A can be written uniquely as
A = RT R ,
where R is an upper triangular matrix. Furthermore, A can be
factored in this way only if it is positive definite.

The MATLAB command chol can be used to compute the Cholesky factor- chol
ization of a symmetric and positive definite matrix. Lagrange multipliers
We will need to compute derivatives of quadratic forms. stationary point

Theorem A.10
Let f (x) = xT Ax where A is an n by n symmetric matrix. Then

∇f (x) = 2Ax

∇2 f (x) = 2A .

A.7 Lagrange Multipliers

The method of Lagrange multipliers is an important technique for solving
optimization problems of the form

min f (x)
g(x) = 0.

Theorem A.11
A minimum point of (A.10) can occur only at a point x where

∇f (x) = λ∇g(x) (A.11)

for some λ.

This condition is a necessary but not sufficient condition. A point satisfying

(A.11) is called a stationary point. A stationary point may be a minimum,
but it could also be a maximum point or a saddle point. Thus it is necessary to
check all of the stationary points to find the true minimum.
The Lagrange multiplier condition can be extended to problems of the form

min f (x)
g(x) ≤ 0 .

Theorem A.12
A minimum point of (A.12) can occur only at a point x where

∇f (x) + λ∇g(x) = 0 (A.13)

for some Lagrange multiplier λ ≥ 0.


Example A.19

Consider the problem

min x1 + x2
x21 + x22 − 1 ≤ 0 .

The Lagrange multiplier condition is

1 2x1
+λ =0
1 2x2

with λ ≥ 0. One solution to this nonlinear system of equations is

x1 = 0.7071, x2 = 0.7071, with λ = −1.4142. Since λ < 0, this point
does not satisfy the Lagrange multiplier condition. In fact, it is the
maximum of f (x) subject to g(x) ≤ 0. The second solution to the
Lagrange multiplier equations is x1 = −0.7071, x2 = −0.7071, with
λ = 1.4142. Since this is the only solution with λ ≥ 0, this point
solves the minimization problem.

Notice that (A.13) is (except for the condition λ > 0) the necessary condition
for a minimum point of the unconstrained minimization problem

min f (x) + λg(x) . (A.14)

Here the parameter λ can be adjusted so that in the optimal solution x∗ ,

g(x∗ ) ≤ 0. We will make frequent use of this technique to convert constrained
optimization problems into unconstrained optimization problems.

A.8 Taylor’s Theorem

Theorem A.13

Suppose that f (x) and its first and second partial derivatives are
continuous. Then given some point x0 , we can write f (x) as

f (x) = f (x0 )+∇f (x)T (x−x0 )+ (x−x0 )T ∇2 f (c)(x−x0 ) (A.15)
for some point c between x and x0 .

If x − x0 is small, then we can approximate f (x) as

f (x) ≈ f (x0 ) + ∇f (x)T (x − x0 ) + (x − x0 )T ∇2 f (x0 )(x − x0 ) . (A.16)

We will frequently make use of this approximation. Taylor’s theorem can easily convex function
be specialized to functions of a single variable. norm!definition

f (x) ≈ f (x0 ) + f 0 (x0 )(x − x0 ) + f 00 (x0 )(x − x0 )2 .
The theorem can also be extended to include terms associated with higher order
derivatives, but we will not need this.
The gradient ∇f (x0 ) has an important geometric interpretation. This vector
points in the direction in which f (x) increases most rapidly as we move away
from x0 . The Hessian, ∇2 f (x0 ), plays a role similar to the second derivative of
a function of a single variable.

Theorem A.14

If f (x) is a twice continuously differentiable function, and ∇2 f (x) is

a positive semidefinite matrix, then f (x) is convex in a neighborhood
of x. If ∇2 f (x) is positive definite, then f (x) is strictly convex.

A.9 Vector and Matrix Norms

Although the conventional Euclidean length is most commonly used, there are
alternative ways to measure the length of a vector.

Definition A.25

Any measure of the length of a vector that satisfies the following

four properties can be used as a norm.

1. For any vector x, kxk ≥ 0.

2. For any vector x and any scalar s, ksxk = |s|kxk.
3. For any vectors x and y, kx + yk ≤ kxk + kyk.
4. kxk = 0 if and only if x = 0.

Definition A.26

The p–norm of a vector in Rn is defined for p ≥ 1 by

kxkp = (|x1 |p + |x2 |p + . . . + |xn |p )1/p .


norm!infinity It can be shown that for any p ≥ 1, the p–norm satisfies the conditions of
norm Definition A.9. The conventional Euclidean length is just the 2–norm. Two
other p–norms are commonly used. The 1–norm is the sum of the absolute
values of the entries in x. The infinity–norm is obtained by taking the limit
as p goes to infinity. The infinity–norm is the maximum of the absolute values
of the entries in x. The MATLAB command norm can be used to compute the
norm of a vector. It has options for the 1, 2, and infinity norms.

Example A.20
Let  
x =  2 .

kxk1 = 6

kxk2 = 14


kxk∞ = 3

These alternative norms can be useful in finding approximate solutions to

over determined linear systems of equations. To minimize the maximum of the
errors, we minimize kAx − bk∞ . To minimize the sum of the absolute values of
the errors, we minimize kAx−bk1 . Unfortunately, these minimization problems
can be much harder to solve then then the least squares problem.

Definition A.27
Any measure of the size or length of an m by n matrix that satisfies
the following five properties can be used as a matrix norm.
1. For any matrix A, kAk ≥ 0.
2. For any matrix A and any scalar s, ksAk = |s|kAk.
3. For any matrices A and B, kA + Bk ≤ kAk + kBk.
4. kAk = 0 if and only if A = 0.
5. For any two matrices A and B of compatible sizes, kABk ≤

Definition A.28 matrix norm

norm!of a matrix
The p–norm of a matrix A is defined by norm!Frobenius
Frobenius norm
kAkp = max kAxkp . norm!compatible matrix and
kxkp =1
compatible matrix and vector
In this formula, kxkp and kAxkp are vector p–norms, while kAkp is norms
the matrix p–norm of A.

Solving the maximization problem to determine a matrix p–norm could be

extremely difficult. Fortunately, there are simpler formulas for the most com-
monly used matrix p–norms.
kAk1 = max |Ai,j | (A.17)
kAk2 = λmax (AT A) (A.18)
kAk∞ = max |Ai,j | . (A.19)

Here λmax (AT A) denotes the largest eigenvalue of AT A.

Definition A.29
The Frobenius norm of an m by n matrix is given by
um X n
kAkF = t A2ij .
i=1 j=1

Definition A.30
A matrix norm and a vector norm are compatible if

kAxk ≤ kAkkxk .

The matrix p–norm is (by its definition) compatible with the vector p–norm
from which it was derived. It can also be shown that the Frobenius norm of a
matrix is compatible with the vector 2–norm. Thus the Frobenius norm is often
used with the vector 2–norm.
In practice, the Frobenius norm, 1–norm, and infinity–norm of a matrix are
easy to compute, while the 2–norm of a matrix can be difficult to compute for
large matrices. The MATLAB command norm has options for computing the
1, 2, infinity, and Frobenius norms of a matrix.

condition number A.10 The Condition Number of a Linear System

Suppose that we want to solve a system of n equations in n variables

Ax = b .

Suppose further that because of measurement errors in b, we actually solve

Ax̂ = b̂ .

Can we get a bound on kx − x̂k in terms of kb − b̂k?

Ax = b
Ax̂ = b̂
A(x − x̂) = b − b̂
(x − x̂) = A−1 (b − b̂)
kx − x̂k = kA−1 (b − b̂)k
kx − x̂k ≤ kA−1 kkb − b̂k .

This formula provides an absolute bound on the error in the solution. It is also
worthwhile to compute a relative error bound.

kx − x̂k kA−1 kkb − b̂k

kbk kbk
kx − x̂k kA−1 kkb − b̂k

kAxk kbk
kb − b̂k
kx − x̂k ≤ kAxkkA−1 k
kb − b̂k
kx − x̂k ≤ kAkkxkkA−1 k
kx − x̂k kb − b̂k
≤ kAkkA−1 k .
kxk kbk

The relative error in b is measured by

kb − b̂k
The relative error in x is measured by
kx − x̂k

The constant condition number

−1 cond
cond(A) = kAkkA k
ill-conditioned!linear system of
is called the condition number of A.
Note that nothing that we did in the calculation of the condition number
depends on which norm we used. The condition number can be computed
using the 1–norm, 2–norm, infinity–norm, or Frobenius norm. The MATLAB
command cond can be used to find the condition number of a matrix. It has
options for the 1, 2, infinity, and Frobenius norms.
The condition number provides an upper bound on how inaccurate the solu-
tion to a system of equations might be because of errors in the right hand side.
In some cases, the condition number greatly overestimates the error in the solu-
tion. As a practical matter, it is wise to assume that the error is of roughly the
size predicted by the condition number. In practice, floating point arithmetic
only allows us to store numbers to about 16 digits of precision. If the condition
number is greater than 1016 , then by the above inequality, there may be no
accurate digits in the computer solution to the system of equations. Systems of
equations with very large condition numbers are called ill-conditioned.
It is important to understand that ill-conditioning is a property of the system
of equations and not the algorithm used to solve the system of equations. Ill-
conditioning cannot be fixed simply by using a better algorithm. Instead, we
must either increase the precision of our computer or find a different, better
conditioned system of equations to solve.

A.11 The QR Factorization

Although the theory of linear algebra can be developed using the reduced row
echelon form, there is an alternative computational approach which works bet-
ter in practice. The basic idea is to compute factorizations of matrices that
involve orthogonal, diagonal, and upper triangular matrices. This alternative
approach leads to algorithms which can quickly compute accurate solutions to
linear systems of equations and least squares problems.

Theorem A.15

Let A be an m by n matrix. A can be written as

A = QR

where Q is an m by m orthogonal matrix, and R is an m by n upper

triangular matrix. This is called the QR factorization of A.

The MATLAB command qr can be used to compute the QR factorization of a


In a common situation, A will be an m by n matrix with m > n and the

rank of A will be n. In this case, we can write
R= ,

where R1 is n by n, and
Q = [Q1 Q2 ] ,
where Q1 is m by n and Q2 is m by m − n. In this case the QR factorization
has some important properties.

Theorem A.16

Let Q and R be the QR factorization of an m by n matrix A with

m > n and rank(A) = n. Then

1. The columns of Q1 are an orthogonal basis for R(A).

2. The columns of Q2 an orthogonal basis for N (AT ).
3. The matrix R1 is nonsingular.

Now, suppose that we want to solve the least squares problem

min kAx − bk2 .

Since multiplying a vector by an orthogonal matrix does not change its length,
this is equivalent to
min kQT (Ax − b)k2 .
QT A = QT QR = R.
So, we have
min kRx − QT bk2
R1 x − QT1 b
0x − QT2 b


Whatever value of x we pick, we will probably end up with nonzero error because
of the 0x − QT2 b part of the least squares problem. We cannot minimize the
norm of this part of the vector. However, we can find an x that exactly solves
R1 x = QT1 b. Thus we can minimize the least squares problem by solving the
square system of equations
R1 x = QT1 b. (A.20)
The advantage of solving this system of equations instead of the normal equa-
tions (A.7) is that the normal equations are typically much more badly condi-
tioned than (A.20).

A.12 Linear Algebra in More General Spaces vector space

inner product
So far, we have considered only vectors in Rn . However, the concepts of linear
algebra can be extended to other contexts. In general, as long as the objects
that we want to consider can be multiplied by scalars and added together, and
as long as they obey the laws of vector algebra, then we have a vector space in
which we can practice linear algebra. If we can also define a dot product, then we
have what is called an inner product space, and we can define orthogonality,
projections, and the 2–norm.
There are many different vector spaces used in various areas of science and
mathematics. For our work in inverse problems, a very commonly used vector
space is the space of functions defined on an interval [a, b].
Multiplying a scalar times a function or adding two function together clearly
produces another function. In this space, the function z(x) = 0 takes the place
of the 0 vector, since f (x) + z(x) = f (x). Two functions f (x) and g(x) are
linearly independent if the only solution to

c1 f (x) + c2 g(x) = z(x)

is c1 = c2 = 0.
We can define the dot product of two functions f and g to be
Z b
f ·g = f (x)g(x)dx.

Another commonly used notation for this dot product or inner product of f
and g is
f · g = hf, gi.
It is easy to show that this inner product has all of the algebraic properties of
the dot product of two vectors in Rn . A more important motivation for defining
the dot product in this way is that it leads to a useful definition
√ of the 2–norm
of a function. Following our earlier formula that kxk2 = xT x, we have
Z b
kf k2 = f (x)2 dx.

Using this definition, the distance between two functions f and g is

Z b
kf − gk2 = (f (x) − g(x))2 dx.

This measure is obviously zero when f (x) = g(x) everywhere, and is only zero
when f (x) = g(x) except possibly at some isolated points.
Using this inner product and norm, we can reconstruct the theory of lin-
ear algebra from Rn in our space of functions. This includes the concepts of
orthogonality, projections, norms, and least squares solutions.

A.13 Exercises
1. Construct an example of an over determined system of equations that has
infinitely many solutions.
2. Construct an example of an over determined system of equations that has
exactly one solution.
3. Construct an example of an under determined system of equations that
has no solutions.
4. Is it possible for an under determined system of equations to have exactly
one solution? If so, construct an example. If not, then explain why it is
not possible.
5. Let A be an m by n matrix with n pivot columns in its RREF. Can A
have infinitely many solutions?
6. Write a MATLAB routine to compute the RREF of a matrix.

% R=myrref(A)
% A an m by n input matrix.
% R output: the RREF(A)

Check your routine by comparing its results with results from MATLAB’s
rref command. Try some large (100 by 200) random matrices. Do you
always get the same result? If not, explain why.
7. If C = AB is a 5 by 4 matrix, then how many rows does A have? How
many columns does B have? Can you say anything about the number of
columns in A?
8. Write a MATLAB function to compute the product of two matrices

% C=myprod(A,B)
% A a matrix of size m by n
% B a matrix of size n by r
% C output: C=A*B

9. Construct a pair of 2 by 2 matrices A and B such that AB 6= BA.

10. Construct a pair of 2 by 2 matrices A and B such that AB = 0, but
neither of A and B are zero.

11. Construct a triple of 2 by 2 matrices A, B, and C such that AB = AC,

A 6= 0, but B 6= C.
12. Let A be an m by n matrix, and let

P = A(AT A)−1 AT .

We will assume that (AT A)−1 exists, but not that A−1 exists.
(a) Show that P is symmetric.
(b) Show that P2 = P.
13. Suppose that v1 , v2 and v3 are three vectors in R3 and that v3 = −2v1 +
3v2 . Are the vectors linearly dependent or linearly independent?
14. Let A be an n by n matrix. Show that if A−1 exists, then the columns
of A are linearly independent. Also show that if the columns of A are
linearly independent, then A−1 exists.
15. Show that any collection of four vectors in R3 must be linearly indepen-
dent. Show that in general, any collection of more than n vectors in Rn
must be linearly dependent.
16. Let  
1 2 3 4
A= 2 2 1 3 .
4 6 7 11
Find bases for N (A), R(A), N (AT ) and R(AT ). What are the dimensions
of the four subspaces?
17. Let A be an n by n matrix such that A−1 exists. What are N (A), R(A),
N (AT ), and R(AT )?
18. Let A be any 9 by 6 matrix. If the dimension of the null space of A is 5,
then what is the dimension of R(A)? What is the dimension of R(AT )?
What is the rank of A?
19. Suppose that a non homogeneous system of equations with four equations
and six unknowns has a solution with two free variables. Is it possible to
change the right hand side of the system of equations so that the modified
system of equations has no solutions?
20. Let W be the set of vectors x in R4 such that x1 x2 = 0. Is W a subspace
of R4 ?
21. Find vectors v1 and v2 such that

v1 ⊥ v2
v1 ⊥ x

parallelogram law and

v2 ⊥ x

where 
x= 2 

22. Let v1 , v2 , v3 be a set of three orthogonal vectors. Show that the vectors
are also linearly independent.
23. Show that if x ⊥ y, then

kx + yk22 = kxk22 + kyk22 .

24. Let Q be an n by n matrix such that

QT Q = I.

Show that Q is an orthogonal matrix.

25. Let u be any nonzero vector in Rn . Let

P=I− .
uT u
Show that P is a symmetric matrix.
26. Prove the parallelogram law

ku + vk22 + ku − vk22 = 2kuk22 + 2kvk22

27. Suppose that A is an n by n matrix such that elements in each row of A

add up to 1. Show that 1 is an eigenvalue of A and find a corresponding
28. Let A be an n by n matrix with eigenvalues

λ1 , λ 2 , . . . , λn .

Show that
det(A) = λ1 λ2 · · · λn .

29. Suppose that a non singular matrix A can be diagonalized as

A = PΛP−1 .

Find a diagonalization of A−1 . What are the eigenvalues of A−1 .


30. Suppose that A is diagonalizable and that all eigenvalues of A have ab-
solute value less than one. What is the limit as k goes to infinity of Ak ?
31. Let A be an m by n matrix, and let

P = A(AT A)−1 AT .

We will assume that (AT A)−1 exists, but not that A−1 exists.
(a) Show that for any vector x, Px is in R(A).
(b) Show that for any vector x, (x − Px) ⊥ R(A). Hint: Show that
x − Px is in N (AT ).
This shows that multiplying P times x computes the orthogonal projection
of x onto R(A). Can you find a similar formula for projecting onto N (A)?
32. In this exercise, we will derive the formula (A.17) for the one–norm of a
matrix. Begin with the optimization problem

kAk1 = max kAxk1 .

kxk1 =1

(a) Show that if kxk1 = 1, then

kAxk1 ≤ max |Ai,j | .

(b) Find a vector x such that kxk1 = 1, and

kAxk1 = max |Ai,j |.

(c) Conclude that

kAk1 = max kAxk1 = max |Ai,j | .
kxk1 =1 j

33. Derive the formula (A.19) for the infinity norm of a matrix.
34. In this exercise we will derive the formula (A.18) for the two–norm of a
matrix. Begin with the maximization problem

max kAxk22 .
kxk2 =1

Note that we have squared kAxk2 . We will take the square root at the
end of the problem.

(a) Using the formula kxk2 = xT x, rewrite the above maximization
problem without norms.

(b) Use the Lagrange multiplier method to find a system of equations

that must be satisfied by any stationary point of the maximization
(c) Explain how the eigenvalues and eigenvectors of AT A are related to
this system of equations. Express the solution to the maximization
problem in terms of the eigenvalues and eigenvectors of AT A.
(d) Use this solution to get kAk2 .
35. In this exercise we will derive the normal equations using calculus. Let
f (x) = kAx − bk22
We want to minimize f (x).
(a) Rewrite f (x) as a dot product and then expand.
(b) Find ∇f (x).
(c) Set ∇f (x) = 0, and obtain the normal equations.
36. Let A be an m by n matrix.
(a) Show that AT A is symmetric.
(b) Show that AT A is positive semidefinite. Hint: Use the definition of
PSD rather than trying to compute eigenvalues.
(c) Show that if rank(A) = n, then the only vector x such that Ax = 0
is x = 0. Use this to show that if rank(A) = n, then AT A is positive
37. Solve the system of equations
x + y = 1
1.0000001x + y = 2.
Now change the two on the right hand side of the second equation to a
one. Find the new solution. In both cases, plot the solutions. Why did
the solution change so much?
38. Show that
cond(AB) ≤ cond(A)cond(B).
39. Consider the Hilbert matrix of order 10,
 
1 1/2 . . . 1/10
 1/2 1/3 . . . 1/11 
H=  ...

... ... ... 
1/10 1/11 . . . 1/19
where Hi,j = 1/(i + j). This matrix can be computed with the hilb
command in MATLAB. The Hilbert matrix is known to be extremely
badly conditioned. Solve Hx = b, where b is the last column of the H
identity matrix. How accurate is your solution likely to be?

40. Let A be a symmetric and positive definite matrix with Cholesky factor-
A = RT R.
Show how the Cholesky factorization can be used to solve Ax = b by
solving two systems of equations, each of which has R or RT as its matrix.
41. Show that if A is an m by n matrix with QR factorization

A = QR

kAkF = kRkF .

42. Let P3 [0, 1] be the space of polynomials of degree less than or equal to 3
on the interval [0, 1]. The polynomials p1 (x) = 1, p2 (x) = x, p3 (x) = x2 ,
and p4 (x) = x3 form a basis for P3 [0, 1], but they are not orthogonal with
respect to the inner product
Z 1
f ·g = f (x)g(x) dx .

Use the Gram-Schmidt process to construct an orthogonal basis for P3 [0, 1].
Once you have your basis, use it to find the third degree polynomial that
best approximates f (x) = e−x on the interval [0, 1].

A.14 Notes and Further Reading

Much of the material in this appendix is typically covered in sophomore level
linear algebra courses. There are a huge number of textbooks at this level.
One good introductory linear algebra textbook that covers the material in this
appendix is [Lay03]. At a slightly more advanced level, [Str88] and [Mey00] are
both excellent. Fast and accurate algorithms for linear algebra computations
are a somewhat more advanced topic. A classic reference is [GL96]. Other good
books on this topic include [TB97] and [Dem97].
probability function

Appendix B

Review of Probability and


This appendix contains a brief review of important topics in classical probability

and statistics. The topics have been selected because they are used in various
places throughout the book. In writing this appendix we have also attempted to
discuss some of the important connections between the mathematics of probabil-
ity theory and its application to the analysis of experimental data with random
measurement errors. The approach used in this appendix is a classical one. It is
important to note that in Chapter 11 (Bayesian methods), some very different
interpretations of probability theory are discussed.

B.1 Probability and Random Variables

The mathematical theory of probability begins with an experiment, which has
a set S of possible outcomes. We will be interested in events which are subsets
A of S.

Definition B.1

The probability function P is a function defined on subsets of S

with the following properties

1. P (S) = 1.
2. For every event A ⊆ S, P (A) ≥ 0.
3. if events A1 , A2 , . . . are pairwise mutually exclusive (that is, if
Ai ∩ Aj is empty for all pairs i, j), then

P (∪∞
i=1 Ai ) = P (Ai )


random variable The list of properties of probability given in this definition is extremely
realizations useful in developing the mathematics of probability theory. However, applying
this definition of probability to real world situations requires some ingenuity.

Example B.1
Consider the experiment of throwing a dart at a dart board. We will
assume that our dart thrower is an expert who always hits the dart
board. The sample space S consists of the points on the dart board.
We can define an event A that consists of the points in the bulls eye.
Then P (A) is the probability that the thrower hits the bulls eye.

In practice, the outcome of an experiment is often a number rather than

an event. Random variables are a useful generalization of the basic concept of

Definition B.2
A random variable X is a function X(s) that assigns a value to
each outcome s in the sample space S.
Each time we perform an experiment, we obtain a specific value of
the random variable. These are called realizations of the random
We will typically use capital letters to denote random variables, with
lower case letters used to denote realizations of a random variable.

Example B.2
To continue our previous example, let X be the function that takes
a point on the dart board and returns the associated score. Suppose
that throwing the dart in the bulls eye scores 50 points. Then for
each point s in the bullseye, X(s) = 50.

In this book we will deal frequently with experimental results in the form of
measurements that can include some random measurement error.

Example B.3
Suppose we weigh an object five times and obtain masses of m1 =
10.1 kg, m2 = 10.0 kg, m3 = 10.0 kg, m4 = 9.9 kg, and m5 = 10.1
kg. We will assume that there is one true mass m, and that the mea-
surements we obtained varied because of some random measurement
error. Thus

m1 = m+e1 , m2 = m+e2 , m3 = m+e3 , m4 = m+e4 , m5 = m+e5


We can treat the measurement errors e1 , e2 , . . ., e5 as realizations PDF

of a random variable E. Equivalently, since the true mass m is probability density function
just a constant, we could treat the measurements m1 , m2 , . . ., m5 uniform
Gaussian probability density
as realizations of a random variable M . In practice it makes little
normal probability density
difference whether we treat the measurements or the measurement standard normal random
errors as random variables. variable
Again, it is important to note that in a Bayesian approach the mass exponential random variable
m of the object would itself be a random variable. This is an issue
that we will take up in Chapter 11.

The relative probability of realization values for a random variable can

be characterized by a unit area, nonnegative, probability density function
(PDF), fX (x). Since the random variable always has some value, the PDF
must have a total area of one. Furthermore, the PDF is nonnegative for all x.
The following definitions give some useful probability density functions which
frequently arise in inverse problems.

Definition B.3
The uniform probability density on the interval [a, b] has
 1
 b−a a ≤ x ≤ b
fU (x) = 0 x<a (B.1)
0 x>b

Definition B.4
The Gaussian or normal probability density (with parameters µ
and σ) has
1 1 2 2
fN (x) = √ e− 2 (x−µ) /σ = N (µ, σ 2 ) (B.2)
σ 2π
The standard normal random variable, N (0, 1), has µ = 0 and
σ = 1.

Definition B.5
The exponential random variable has

λe−λx x ≥ 0

fexp (x) = (B.3)
0 x<0

Figure B.1: The uniform probability density function.

Figure B.2: The standard normal probability density function.


double-sided exponential
gamma function

Figure B.3: The exponential probability density function (λ = 1).

Definition B.6
the double-sided exponential probability density has
1 √
fdexp (x) = 3/2 e− 2|x−µ|/σ . (B.4)
2 σ

Definition B.7
The χ2 probability density with parameter ν has
1 1
fχ2 (x) = x 2 ν−1 e−x/2 , (B.5)
2ν/2 Γ(ν/2)
where the gamma function is
Z ∞
Γ(x) = ξ x−1 e−ξ dξ

is a generalized factorial (Γ(n) = (n − 1)! if n is an integer).

It can be shown that if X1 , X2 , . . ., Xn are independent with stan-
dard normal distributions, then

Z = X12 + X22 + . . . + Xn2

has a χ2 distribution with ν = n.


Figure B.4: The double-sided exponential probability density function (µ = 0,

λ = 1).

Figure B.5: The chi-square probability density function (ν = 1, 3, 5, 7, 9).


Student’s $t$ distribution

F distribution

Figure B.6: The student’s t probability density function.

Definition B.8
The Student’s t distribution with n degrees of freedom has

Γ((n + 1)/2) 1
ft (x) = √ 1+ . (B.6)
Γ(n/2) nπ n

The Student’s t distribution is so named because W. S. Gosset used the

pseudonym “Student” in publishing the first paper in which the distribution
It should be noted that in the limit as n goes to infinity, t approaches a
standard normal distribution (Figure B.2; Figure B.6).

Definition B.9
The F distribution is a function of two degrees of freedom, ν1 and
 ν1 /2  − ν1 +ν
(ν1 −2)
ν1 ν1
ν2 1 + x ν2 x 2
fF (x) = , (B.7)
β(ν1 /2, ν2 /2)
where the beta function is
Z 1
β(z, w) = tz−1 (1 − t)w−1 dt . (B.8)

cumulative distribution

Figure B.7: The F probability density function.

The F distribution is important because the ratio of two χ2 random

variables with respective parameters ν1 and ν2 has an F distribution
with parameters ν1 and ν2 .

The cumulative distribution function (CDF) FX (a) for a one-dimensional

random variable, X is just P (X ≤ a), which is given by the definite integral of
the associated PDF
Z a
FX (a) = P (X ≤ a)] = f (x) dx .

Note that FX (a) must lie in the interval [0, 1] for all a, and is a non-decreasing
function of a because of the unit area and non negativity of the PDF.
For the uniform PDF on the unit interval, for example, the CDF is a ramp
function Z a
FU (a) = fu (z) dz

 0 a≤0
FU (a) = a 0≤a≤1
1 a>1

The PDF, fX (x), and CDF, FX (a), completely determine the probabilistic
properties of a random variable.
The probability that a particular realization of X will lie within a general
interval [a, b] is
P (a ≤ X ≤ b) = P (X ≤ b) − P (X ≤ a) = F (b) − F (a)

Z b Z a Z b expected value
= f (x) dx − f (x) dx = f (x) dx . mode of a PDF
−∞ −∞ a
median of a PDF

B.2 Expected Value and Variance

Definition B.10

The expected value of a random variable X, denoted µX or E[X]

is Z ∞
µX = E[X] = xfX (x) dx .

In general, if g(X) is some function of a random variable X, then

Z ∞
E[g(X)] = g(x)fX (x) dx .

Some authors use the term “mean” for the expected value of a random
variable. We will reserve the term mean for the average of a set of data. Note
that the expected value of a random variable is not necessarily identical to the
mode (the value with the largest value of f (x)) nor is it necessarily identical
to the median, the value of x for which the F (x) = 1/2.

Example B.4

The expected value of an N (µ, σ) random variable X is

Z ∞
1 (x−µ)2
E[X] = x √ e− 2σ2 dx
−∞ σ 2π
Z ∞
1 x2
E[X] = √ (x + µ)e− 2σ2 dx
−∞ σ 2π
Z ∞ Z ∞
1 x2
− 2σ 1 x2
E[X] = µ √ e 2
dx + √ xe− 2σ2 dx
−∞ σ 2π −∞ σ 2π

The first integral term is µ because the integral of the entire PDF is 1,
and the second term is zero because it is an odd function integrated
over a symmetric interval. Thus

E[X] = µ

Definition B.11

variance The variance of a random variable X, denoted by Var(X) or σX 2

standard deviation is given by
joint probability density
independent random variables
Z ∞
2 2 2 2
Var(X) = σX = E[(X−µX ) ] = E[X ]−µX = (x−µX )2 fX (x) dx .

The standard deviation of X, often denoted σX , is

σX = Var(X) .

The variance and standard deviation serve as measures of the spread of the
random variable about its expected value. Since the units of σ are the same as
the units of µ, the standard deviation is generally more practical as a measure of
the spread of the random variable. However, the variance has many properties
that make it more useful for certain calculations.

B.3 Joint Distributions

Definition B.12

If we have two random variables X and Y , they may have a joint

probability density (JDF), f (x, y) with
Z a Z b
P (X ≤ a and Y ≤ b) = f (x, y) dy dx . (B.9)
−∞ −∞

If X and Y have a joint probability density, then we can use to evaluate

the expected value of a function of X and Y . The expected value of g(X, Y ) is
given by
Z a Z b
E[g(X, Y ) = g(x, y)f (x, y) dy dx . (B.10)
−∞ −∞

Definition B.13

Two random variables X and Y are independent if a JDF exists

and is defined by
f (x, y) = fX (x)fY (y) .

Definition B.14

If X and Y have a JDF, then the covariance of X and Y is covariance

uncorrelated random variables
Cov(X, Y ) = E[(X − E[X])(Y − E[Y ])] = E[XY ] − E[X]E[Y ] .

If X and Y are independent, then E[XY ] = E[X]E[Y ], and Cov(X, Y ) = 0.

However if X and Y are dependent, it is still possible, given some particular
distributions, for X and Y to have Cov(X, Y ) = 0. If Cov(X, Y ) = 0, X and
Y are called uncorrelated.

randn Definition B.15

The correlation of X and Y is

Cov(X, Y )
ρXY = p .
Var(X)Var(Y )

Correlation is thus a scaled covariance.

Theorem B.1

The following properties of Var, Cov, and correlation hold for any
random variables X and Y and scalars s and a.

• Var(X) ≥ 0
• Var(X + a) = Var(X)
• Var(sX) = s2 Var(X)
• Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X, Y )
• Cov(X, Y ) = Cov(Y, X)
• ρ(X, Y ) = ρ(Y, X)
• −1 ≤ ρXY ≤ 1 .

The following example demonstrates the use of some of these properties.

Example B.5

Suppose that Z is a standard normal random variable. Let

X = µ + σZ .

E[X] = E[µ] + σE[Z]
E[X] = µ .
Var(X) = Var(µ) + σ 2 Var(Z) = σ 2 .
Thus if we have a program to generate random numbers with the
standard normal distribution, we can use it to generate normal ran-
dom numbers with any desired expected value and standard devia-
tion. The MATLAB command randn generates independent real-
izations of an N (0, 1) random variable.

Example B.6

What is the CDF (or PDF) of the sum of two independent random
variables X + Y ? To see this, we write the desired CDF in terms of
an appropriate integral over the JDF, f (x, y) (Figure B.8)

FX+Y (z) = P (X + Y ≤ z)
= f (x, y) dx dy
= fX (x)fY (y) dx dy
Z ∞ Z z−y
= fX (x)fY (y) dx dy
−∞ −∞
Z ∞ Z z−y
= fX (x) dx fY (y) dy
−∞ −∞
Z ∞
= FX (z − y)fY (y) dy .

The associated PDF can be obtained as

Z ∞
fX+Y (z) = FX (z − y)fY (y) dy
dz −∞
Z ∞
= FX (z − y)fY (y) dy
−∞ dz
Z ∞
= fX (z − y)fY (y) dy
= fX (z) ∗ fY (z) .

Adding two independent random variables thus produces a new ran-

dom variable which has a PDF given by the convolution of the PDF’s
of the two individual variables.

The JDF can be used to evaluate the CDF or PDF arising from a general
function of independent random variables. The process is identical to the pre-
vious example except that the specific form of the integral limits is determined
by the specific function.

Example B.7

Consider the product of two independent, identically distributed,

standard normal random variables,

Z = XY

Figure B.8: Integration of a joint probability density for two independent ran-
dom variables, X, and Y , to evaluate the CDF of Z = X + Y .

with a JDF given by

1 −(x2 +y2 )/2σ2
f (x, y) = f (x)f (y) = e .
2πσ 2
The CDF of Z is

F (z) = P (Z ≤ z) = P (XY ≤ z) .

For z ≤ 0, this is the integral of the JDF over the exterior of the
hyperbolas defined by xy ≤ z ≤ 0, while for z ≥ 0, we integrate
over the interior of the complementary hyperbolas xy ≤ z ≥ 0. At
z = 0, the integral covers exactly half of the (x, y) plane (the 2nd
and 4th quadrants) and, because of the symmetry of the JDF, has
accumulated half of the probability, or 1/2.
The integral is thus
Z 0 Z ∞
1 −(x2 +y2 )/2σ2
F (z) = 2 e dy dx (z ≤ 0)
−∞ z/x 2πσ 2

Z 0 Z z/x
1 −(x2 +y2 )/2σ2
F (z) = 1/2 + 2 e dy dx (z ≥ 0) .
−∞ 0 2πσ 2

As in the previous example for the sum of two random variables, the conditional probability
PDF may be obtained from the CDF by differentiating with respect law of total probability
to z. Bayes’ theorem

B.4 Conditional Probability

In some situations we will be interested in the probability of an event happening
given that some other event has also happened.

Definition B.16
The conditional probability of A given that B has occurred is
given by
P (A ∩ B)
P (A|B) = . (B.11)
P (B)

Arguments based on conditional probabilities are often very helpful in com-

puting probabilities. The key to such arguments is the law of total probabil-

Theorem B.2
Suppose that B1 , B2 , . . ., Bn are mutually disjoint and exhaustive
events. That is, Bi ∩ Bj = ∅ for i 6= j, and

∪ni=1 Bi = S

P (A) = P (A|Bi )P (Bi ) (B.12)

It is often necessary to reverse the order of conditioning in a conditional

probability. Bayes’ theorem provides a way to do this.

Theorem B.3

P (A|B)P (B)
P (B|A) = (B.13)
P (A)

conditional expected value Example B.8

conditional distribution
A screening test has been developed for a very serious but rare dis-
ease. If a person has the disease, then the test will detect the disease
with probability 99%. If a person does not have the disease, then
the test will give a false positive detection with probability 1%. The
probability that any individual in the population has the disease is
0.01%. Suppose that a randomly selected individual tests positive
for the disease. What is the probability that this individual actually
has the disease?
Let A be the event “the person tests positive”. Let B be the event
“The person has the disease”. Then we want to compute P (B|A).
By Bayes theorem,

P (A|B)P (B)
P (B|A) = .
P (A)

We have that P (A|B) is 0.99, and that P (B) is 0.0001. To compute

P (A), we apply the law of total probability, considering separately
the probability of a diseased individual testing positive and the prob-
ability of someone without the disease testing positive.

P (A) = 0.99 × 0.0001 + 0.01 × 0.9999 = 0.010098.

0.99 × 0.0001
P (B|A) = = 0.0098
In other words, even after a positive screening test, it is still unlikely
that the individual will have the disease. The vast majority of those
individuals who test positive will in fact not have the disease.

The concept of conditioning can be extended from simple events to distribu-

tions and expected values of random variables. If the distribution of X depends
on the value of Y , then we can work with the conditional PDF fX|Y (x), the
conditional CDF FX|Y (a), and the conditional expected value E[X|Y ].
In this notation, we can also specify a particular value of Y by using the nota-
tion fX|Y =y , FX|Y =y , or E[X|Y = y]. In working with conditional distributions
and expected values, the following versions of the law of total probability can
be very useful.

Theorem B.4
Given two random variables X and Y , with the distribution of X
depending on Y , we can compute
Z ∞
P (X ≤ a) = P (X ≤ a|Y = y)fY (y)dy (B.14)

and multivariate normal

Z ∞ distribution
E[X] = E[X|Y = y]fY (y)dy. (B.15) MVN
covariance matrix

Example B.9
Let U be a random variable uniformly distributed on (1, 2). Let X
be an exponential random variable with parameter λ = U . We will
find the expected value of X.
Z 2
E[X] = E[X|U = u]fU (u)du.

Since the expected value of exponential random variable with pa-

rameter λ is 1/λ, and the PDF of a uniform random variable on
(1, 2) is fU (u) = 1,
Z 2
E[X] = du = ln 2.
1 u

B.5 The Multivariate Normal Distribution

Definition B.17
If the random variables X1 , . . ., Xn have a multivariate normal
distribution (MVN), then the joint probability density function is
1 1 −1
f (x) = n/2
p e−(x−µ)C (x−µ)/2 (B.16)
(2π) det(C)
where µ = [µ1 , µ2 , . . . , µn ]T is a vector containing the expected
values along each of the coordinate directions of X1 , . . ., Xn , and C
contains the covariances between the random variables

Ci,j = Cov(Xi , Xj ) . (B.17)

Notice that if C is singular, then the joint density function involves

a division by zero, and the joint density simply isn’t defined.

The vector µ and the covariance matrix C completely characterize the

MVN distribution. However, there are other multivariate distributions that are
not completely characterized by the expected values and covariance matrix.

Theorem B.5
Let X be a multivariate normal random vector with expected values
defined by the vector µ and covariance C, and let Y = AX. Then
Y is also multivariate normal, with
E[Y] = Aµ
Cov(Y) = ACAT .

Theorem B.6
If we have an n-dimensional MVN distribution with covariance ma-
trix C and expected value µ, and the covariance matrix is of full
rank, then the quantity
Z = (X − µ)T C−1 (X − µ)
has a χ2 distribution with n degrees of freedom.

Example B.10
We can generate vectors of random numbers according to an MVN
distribution by using the following process, which is very similar to
the process for generating random normal scalars.
1. Find the Cholesky factorization C = LLT .
2. Let Z be a vector of n independent N (0, 1) random numbers.
3. Let X = µ + LZ.
Because E[Z] = 0, E[X] = µ + L0 = µ. Also, since Cov(Z) = I and
Cov(µ) = 0, Cov(X) = Cov(µ + LZ) = LILT = C.

B.6 The Central Limit Theorem

Theorem B.7
Let X1 , X2 , . . ., Xn be independent and identically distributed (IID)
random variables with a finite expected value µ and variance σ 2 . Let
X1 + X2 + . . . + Xn − nµ
Zn = √ .

In the limit as n approaches infinity, the distribution of Zn ap-
proaches the standard normal distribution.

30 Q–Q plot





−4 −3 −2 −1 0 1 2 3 4 5

Figure B.9: Histogram of a sample data set.

The central limit shows why quasi-normally distributed PDF’s appear so

frequently in nature; the superposition of numerous independent random vari-
ables produces an approximately normal PDF, regardless of the PDF of the
underlying IID variables.

B.7 Testing for Normality

Many of the statistical procedures that we will use assume that data are nor-
mally distributed. Fortunately, the statistical techniques that we describe are
generally robust in the face of small deviations from normality. Thus it may be
important to be able to examine a data set to see whether or not the distribution
is approximately normal.
Plotting a histogram of the data provides a quick view of the distribution.
For example, Figure B.9 shows the histogram from a set of 100 data points.
The characteristic bell shaped curve in the histogram makes it seem that these
data might be normally distributed. The sample mean is -0.03 and the sample
standard deviation is 1.26.
The Q-Q Plot provides a more precise graphical test of whether a set of ob-
servations of a random variable could have come from a particular distribution.
The data points,
d = [d1 , d2 , . . . , dn ]T ,
are first sorted in numerical order from smallest to largest into a vector y, which
is plotted versus

xi = F −1 ((i − 0.5)/n) (i = 1, 2, . . . , n) ,

QQ Plot of Sample Data versus Standard Normal


Quantiles of Input Sample




−3 −2 −1 0 1 2 3
Standard Normal Quantiles

Figure B.10: Q − Q Plot for the sample data set.

QQ Plot of Sample Data versus Standard Normal




Quantiles of Input Sample





−3 −2 −1 0 1 2 3
Standard Normal Quantiles

Figure B.11: Q − Q Plot for a data set with a skewed distribution.


where F (x) is the CDF of the distribution against which we wish to compare
our observations.
If we are testing to see if the elements of d could have come from the normal
distribution, then F (x) is the CDF for the standard normal distribution
Z x
1 1 2
FN (x) = √ e− 2 z dz .
2π −∞
If the elements of d are normally distributed, the points (yi , xi ) will follow a
straight line with a slope and intercept determined by the standard deviation
and expected value, respectively of the normal distribution that produced the
data. The MATLAB statistics toolbox includes a command, qqplot, that can
be used to generate a Q − Q plot.
Figure B.10 shows the Q − Q plot for our sample data set. The point at the
upper right corner of the plot has the value 4.62, but under a normal distribution
with expected value -0.03 and standard deviation 1.26, the largest of the 100
points should be about 2.5. Similarly, some points in the lower left corner stray
from the line. It is apparent that the data set contains more extreme values
than the normal distribution would predict. In fact, these data were generated
according to a t distribution with 5 degrees of freedom, which has broader tails
than the normal distribution (Figure B.6).
Figure B.11 shows a more extreme example. This data is clearly not normally
distributed. The Q − Q plot shows that distribution is skewed to the right. It
would be unwise to treat these data as if they were normally distributed.
There are a number of statistical tests for normality. These tests, including
the Kolmogorov–Smirnov test, Anderson–Darling test, and Lilliefors test each
produce probabilistic measures called p-values. A small p-value indicates that
the observed data would be unlikely if the distribution were in fact normal,
while a large p-value is consistent with normality. In practice, these tests may
declare a data set to be non-normal even though the data are “normal enough”
for practical applications. For example, the Lilliefors test implemented by the
MATLAB statistics toolbox command lillietest rejects the data in Figure B.10
with a p-value of 0.04.

B.8 Estimating Means and Confidence Intervals

Given a collection of noisy measurements m1 , m2 , . . ., mn of some quantity
of interest, how can we estimate the true value m, and how uncertain is our
estimate? This is a classic problem in statistics.
We will assume first that the measurement errors are independent and nor-
mally distributed with expected value 0 and some unknown standard deviation
σ. Equivalently, the measurements themselves are normally distributed with
expected value m and standard deviation σ.
We begin by computing the measurement average
m1 + m2 + . . . + mn
m̄ = .

sample mean This sample mean m̄ will serve as our estimate of m. We will also compute
Student’s $t$ distribution an estimate s of the standard deviation
confidence interval sP
n 2
i=1 (mi − m̄)
s= .

The key to our approach to estimating m is the following theorem.

Theorem B.8
(The Sampling Theorem) Under the assumption that measurements
are independent and normally distributed with expected value m
and standard deviation σ, the random quantity
m̄ − m
t= √
s/ n

has a Student’s t distribution with n − 1 degrees of freedom.

If we had the true standard deviation σ instead of the estimate s, then

t would in fact be normally distributed with expected value 0 and standard
deviation 1. This doesn’t quite work out because we’ve used an estimate s of
the standard deviation. For smaller values of n, the estimate s is less accurate,
and thus t distribution has fatter tails than the standard normal distribution.
As n goes to infinity, s becomes a better estimate of σ and it can be shown that
the t distribution converges to a standard normal distribution.
Let tn−1, 0.975 be the 97.5%-tile of the t distribution and let tn−1, 0.025 be
the 2.5%-tile of the t distribution. Then
m̄ − m
P tn−1, 0.025 ≤ √ ≤ tn−1, 0.975 = 0.95 .
s/ n

This can be rewritten as

√  √ 
P tn−1, 0.025 s/n ≤ (m̄ − m) ≤ tn−1, 0.975 s/n = 0.95 .

We can construct √ the 95% confidence√interval for m as the interval from

m̄ + tn−1, 0.025 s/ n to m̄ + tn−1, 0.975 s/ n. √
As the t distribution is √
this can also be written as m̄ − tn−1, 0.975 s/ n to m̄ + tn−1, 0.975 s/ n.
As we have seen, there is a 95% probability that when we construct the
confidence interval, that interval will contain the true mean, m. Note that we
have not said that, given a particular set of data and the resulting confidence
interval, there is a 95% probability that m is in the confidence interval. The
semantic difficulty here is that m is not a random variable, but is rather some
true fixed quantity that we are estimating; the measurements m1 , m2 , . . ., mn ,
and the calculated m̄, s and confidence interval are the random quantities.

Example B.11 $t$–test

Suppose that we want to estimate the mass of an object an obtain

the following ten measurements of the mass (in grams):

9.98 10.07 9.94 10.22 9.98

10.01 10.11 10.01 9.99 9.92

The sample mean is m̄ = 10.02 g. The sample standard deviation is

s = 0.0883. The 97.5%-tile of the t distribution with n-1=9 degrees
of freedom is (from a t-table or function) 2.262. Thus our 95%
confidence interval for the mean is
 √ √ 
m̄ − 2.262s/ n, m̄ + 2.262s/ n g =
h √ √ i
10.02 − 2.262 × 0.0883/ 10, 10.02 + 2.262 × 0.0883/ 10 g =

[9.96, 10.08] g .

The above procedure for constructing a confidence interval for the mean us-
ing the t distribution was based on the assumption that the measurements were
normally distributed. In situations where the data are not normally distributed
this procedure can fail in a very dramatic fashion. However, it may be safe
to generate an approximate confidence interval using this procedure if (1) the
number n of data is large (50 or more) or (2) the distribution of the data is not
strongly skewed and n is at least 15.

B.9 Hypothesis Tests

In some situations we want to test whether or not a set of normally distributed
data could reasonably have come from a normal distribution with expected
value µ0 . Applying the sampling theorem (B.8), we see that if our data did
come from a normal distribution with expected value µ0 , then there would a
95% probability that
m̄ − µ0
tobs = √
s/ n
would lie in the interval

[Ft−1 (0.025), Ft−1 (0.975)] = [tn−1, 0.025 , tn−1, 0.975 ]

and only a 5% probability that t would lie outside this interval. Equivalently,
there is only a 5% probability that |tobs | ≥ tn−1, 0.975 .
This leads to the t–test: If |tobs | ≥ tn−1, 0.975 , then we reject the hypothesis
that µ = µ0 . On the other hand, if |t| < tn−1, 0.975 , then we cannot reject the
hypothesis that µ = µ0 . Although the 95% confidence level is traditional, we

$p$–value, $t$–test can also perform the t-test at a 99% or some other confidence level. In general,
type I error if we want a confidence level of 1 − α, then we compare |t| to tn−1, 1−α/2 .
type II error
In addition to reporting whether or not a set of data passes a t-test it is good
power, of a hypothesis test
practice to report the associated t–test p–value . The p-value associated with a
t-test is the largest value of α for which the data passes the t-test. Equivalently,
it is the probability that we could have gotten a greater t value than we have
observed, given that all of our assumptions are correct.

Example B.12

Consider the following data

1.2944 −0.3362 1.7143 2.6236 0.3082

1.8580 2.2540 −0.5937 −0.4410 1.5711

These appear to be roughly normally distributed, with a mean that

seems to be larger than 0. We will test the hypothesis µ = 0. The t
statistic is
m̄ − µ0
tobs = √
s n
which for this data set is
1.0253 − 0
tobs = √ ≈ 2.725 .
1.1895/ 10

Because |tobs | is larger than t9, 0.975 = 2.262, we reject the hypoth-
esis that these data came from a normal distribution with expected
value 0 at the 95% confidence level.

The t-test (or any other statistical test) can fail in two ways. First, it could
be that the hypothesis that µ = µ0 is true, but our particular data set contained
some unlikely values and failed the t-test. Rejecting the hypothesis when it is
in fact true is called a type I error . We can control the probability of a type
I error by decreasing α.
The second way in which the t-test can fail is more difficult to control. It
could be that the hypothesis µ = µ0 was false, but the sample mean was close
enough to µ0 to pass the t-test. In this case, we have a type II error. The
probability of a type II error depends very much on how close the true mean is
to µ0 . If the true mean µ = µ1 is very close to µ0 , then a type II error is quite
likely. However, if the true mean µ = µ1 is very far from µ0 then a type II error
will be less likely. Given a particular alternative hypothesis, µ = µ1 , we call
the probability of a type II error β(µ1 ), and call the probability of not making
a type II error (1 − β(µ1 )) the power of the test. We can estimate β(µ1 ) by
repeatedly generating sets of n random numbers with µ = µ1 and performing
the hypothesis test on the sets of random numbers.

The results of a hypothesis test should always be reported with care. It is Markov’s inequality
important to discuss and justify any assumptions (such as the normality as- Chebychev’s inequality
sumption made in the t-test) underlying the test. The p-value should always be
reported along with whether or not the hypothesis was rejected. If the hypoth-
esis was not rejected and some particular alternative hypothesis is available, it
is good practice to estimate the power of the hypothesis test against this alter-
native hypothesis. Confidence intervals for the mean should be reported along
with the results of a hypothesis test.
It is important to distinguish between the statistical significance of a hy-
pothesis test and the actual magnitude of any difference between the observed
mean and the hypothesized mean. For example, with very large n it is nearly
always possible to achieve statistical significance at the 95% confidence level,
even though the observed mean may differ from the hypothesis by only 1% or

B.10 Exercises
1. Compute the expected value and variance of a uniform random variable
in terms of the parameters a and b.

2. Compute the CDF of an exponential random variable with parameter λ.

3. Compute the expected value and variance of an exponential random vari-

able in terms of the parameter λ.

4. Suppose that X is a random variable. Show that for any c > 0,

P (|X| ≥ c) ≤
This result is known as Markov’s inequality.

5. Suppose that X is a random variable with expected value µ and variance

σ 2 . Show that for any k > 0,

P (|X − µ| ≥ kσ) ≤
This result is known as Chebychev’s inequality.

6. Construct an example of two random variables X and Y such that X

and Y are uncorrelated, but not independent. Hint: What happens if
Y = −X?

7. Show that
Var(X) = E[X 2 ] − E[X]2

8. Show that
Cov(aX, Y ) = aCov(X, Y )
and that
Cov(X + Y, Z) = Cov(X, Z) + Cov(Y, Z)

9. Show that the PDF for the sum of two independent uniform random vari-
ables on [a, b] = [0, 1] is

 0 (x ≤ 0)
x (0 ≤ x ≤ 1)

f (x) =

 2 − x (1 ≤ x ≤ 2)
0 (x ≥ 0) .

10. On a multiple choice question with four possible answers, there is a 75%
chance that a student will know the correct answer and 25% chance that
the student will randomly pick one of the four answers. What is the
probability that the student will get a correct answer? Given that the
answer was correct, what is the probability that the student actually knew
the correct answer?
11. Suppose that X and Y are independent random variables. Use condition-
ing to find a formula for the CDF of X + Y in terms of the PDF’s and
CDF’s of X and Y .
12. Suppose that X is a vector of 2 random variables with an MVN distribu-
tion (expected value µ, covariance C), and that A is a 2 by 2 matrix. Use
the properties of expected value and covariance to show that Y = AX has
expected value Aµ and covariance ACAT .
13. Consider a least squares problem Ax = b, where the A matrix is known
exactly, but the right hand side vector b includes some random measure-
ment noise. Thus
bnoise = btrue + 
where  is a vector of independent N (0, 1) random numbers. We would
like to find xtrue such that

Axtrue = btrue .

Unfortunately, since we do not have btrue , the best we can do is to com-

pute a least squares estimate using the normal equations and the noisy
xL2 = (AT A)−1 AT bnoise .
What is the expected value of xL2 ? What is the covariance matrix?
14. Consider the following data, which we will assume are drawn from a normal

-0.4326 -1.6656 0.1253 0.2877 -1.1465

1.1909 1.1892 -0.0376 0.3273 0.1746

Find the sample mean and standard deviation. Use these to construct a
95% confidence interval for the mean. Test the hypothesis H0 : µ = 0
at the 95% confidence level. What do you conclude? What was the
corresponding p-value?
15. Using MATLAB, repeat the following experiment 1,000 times. Use the
statistics toolbox function exprnd() to generate 5 exponentially distributed
random numbers (B.3) with λ = 10. Use these 5 random numbers to gen-
erate a 95% confidence interval for the mean. How many times out of the
1,000 experiments did the 95% confidence interval cover the expected value
of 10? What happens if you instead generate 50 exponentially distributed
random numbers at a time? Discuss your results.
16. Using MATLAB, repeat the following experiment 1,000 times. Use the
randn function to generate a set of 10 normally distributed random num-
bers with expected value 10.5 and standard deviation 1. Perform a t-test
of the hypothesis µ = 10 at the 95% confidence level. How many type II
errors were committed? What is the approximate power of the t-test with
n = 10 against the alternative hypothesis µ = 10.5? Discuss your results.
17. Using MATLAB, repeat the following experiment 1,000 times. Using the
exprnd() function of the statistics toolbox, generate 5 exponentially dis-
tributed random numbers with expected value 10. Take the average of the
5 random numbers. Plot a histogram and a probability plot of the 1,000
averages that you computed. Are the averages approximately normally
distributed? Explain why or why not. What would you expect to happen
if you took averages of 50 exponentially distributed random numbers at a
time? Try it and discuss the results.

B.11 Notes and Further Reading

Most of the material in this appendix can be found in virtually any introductory
textbook in probability and statistics. Some recent textbooks include [BE00,
CB02]. The multivariate normal distribution is a somewhat more advanced topic
that is often ignored in introductory courses. [Sea82] has a good discussion of
the multivariate normal distribution and its properties. Numerical methods
for probability and statistics are a specialized topic. Two standard references
include [JG80, Thi88].
Appendix C

Glossary of Mathematical

• A, B, C, ... : Matrices.
• Ai,· : ith row of matrix A.
• A·,i : ith column of matrix A.
• Ai,j : (i, j)th element of matrix A.
• E[X]: Expected value of the random variable X.
• G† : Generalized inverse of the matrix G calculated from the SVD.
• A, B, C, ... : Functions; Random Variables.
• N (µ, σ 2 ): Normal probability density function with expected value µ and
variance σ 2 .
• N (A): Null space of the matrix A.
• rank(A): Rank of the matrix A.
• R(A): Range of the matrix A.
• Rn : Vector space of dimension n.
• A, B, C, ... : Fourier transforms.
• a, b, c, ... : Column vectors.
• ai : ith element of vector a.
• ā: Mean value of the elements in vector a.
• Cov(x), Cov(X, Y ): Covariance between the elements of vector x or
between the random variables X and Y .


• ρ(x), ρ(X, Y ): Correlation between the elements of vector x or between

the random variables X and Y .
• si : Singular values.
• Var(X): Variance of the random variable X.

• α, β, γ, ... : Scalars.
• a, b, c, ... : Scalars or functions.
• χ2 : Chi-square random variable.
• λi : Eigenvalues.
• µ(1) : 1-norm misfit measure.
• σ: Standard deviation.
• σ 2 : Variance.
• ν: Degrees of freedom.
• CDF: Cumulative distribution function.
• JDF: Joint probability density function.
• MVN: Multivariate normal probability density function.
• PDF: Probability density function.
• SVD: Singular value decomposition.
• tν,p : p-percentile of the t distribution with ν degrees of freedom.
• Fν1 ,ν2 ,p : p-percentile of the F distribution with ν1 and ν2 degrees of free-

[AH91] U. Amato and W. Hughes. Maximum entropy regularization of fred-

holm integral equations of the 1st kind. Inverse Problems, 7(6):793–
808, 1991.

[Ast88] R. C. Aster. On projecting error ellipsoids. Bulletin of the Seismo-

logical Society of America, 78(3):1373–1374, 1988.

[Bau87] Johann Baumeister. Stable Solution of Inverse Problems. Vieweg,

Braunschweig, 1987.

[BDBS89] P. T. Boggs, J. R. Donaldson, R. H. Byrd, and R. B. Schnabel. ODR-

PACK software for weighted orthogonal distance regression. ACM
Transactions on Mathematical Software, 15(4):348–364, 1989. The
software is available at

[BE00] Lee J. Bain and Max Englehardt. Introduction to Probability and

Mathematical Statistics. Brooks/Cole, Pacific Grove, CA, second
edition, 2000.

[Ber00a] J. G. Berryman. Analysis of approximate inverses in tomography

I. resolution analysis. Optimization and Engineering, 1(1):87–115,

[Ber00b] J. G. Berryman. Analysis of approximate inverses in tomography

II. iterative inverses. Optimization and Engineering, 1(4):437–473,

[Bjö96] Åke Björck. Numerical Methods for Least Squares Problems. SIAM,
Philadelphia, 1996.

[Bra00] R. Bracewell. The Fourier Transform and its Applications. McGraw–

Hill, Boston, third edition, 2000.

[Bri95] Daniel Shenon Briggs. High Fidelity Deconvolution of Moderately

Resolved Sources. PhD thesis, New Mexico Institute of Mining and
Technology, 1995.


[BTN01] Aharon Ben-Tal and Arkadi Nemirovski. Lectures on Modern Convex

Optimization: Analysis, Algorithms, and Engineering Applications.
SIAM, Philadelphia, 2001.

[BW88] Douglas M. Bates and Donald G. Watts. Nonlinear Regression anal-

ysis and its applications. Wiley, New York, 1988.

[Car87] Philip Carrion. Inverse Problems and Tomography in Acoustics and

Seismology. Penn Publishing Company, Atlanta, 1987.

[CB02] George Casella and Roger L. Berger. Statistical Inference. Duxbury,

Pacific Grove, CA, second edition, 2002.

[CE85] T. J. Cornwell and K. F. Evans. A simple maximum entropy de-

convolution algorithm. Astronomy and Astrophysics, 143(1):77–83,

[Cla80] B. G. Clark. An efficient implementation of the algorithm “CLEAN”.

Astronomy and Astrophysics, 89(3):377–378, 1980.

[CPC87] Steven C. Constable, Robert L. Parker, and Catherine G. Consta-

ble. Occam’s inversion: A practical algorithm for generating smooth
models from electromagnetic sounding data. Geophysics, 52(3):289–
300, 1987.

[CZ97] Y. Censor and S. A. Zenios. Parallel Optimization: Theory, Algo-

rithms, and Applications. Oxford University Press, New York, 1997.

[Dem97] James W. Demmel. Applied Numerical Linear Algebra. SIAM,

Philadelphia, 1997.

[DS96] J. E. Dennis, Jr. and Robert B. Schnabel. Numerical Methods for Un-
constrained Optimization and Nonlinear Equations. SIAM, Philadel-
phia, 1996.

[DS98] Norman R. Draper and Harry Smith. Applied Regression Analysis.

Wiley, New York, third edition, 1998.

[EHN96] Heinz W. Engl, Martin Hanke, and Andreas Neubauer. Regular-

ization of Inverse Problems. Kluwer Academic Publishers, Boston,

[EL97] Laurent ElGhaoui and Hervé Lebret. Robust solutions to least–

squares problems with uncertain data. SIAM Journal on Matrix
Analysis and Applications, 18(4):1035–1064, 1997.

[Eng93] Heinz W. Engl. Regularization methods for the stable solution of

inverse problems. Surveys on Mathematics for Industry, 3:71–143,

[GCSR03] Andrew Gelman, John B. Carlin, Hal S. Stern, and Donald B. Rubin.
Bayesian Data Analysis. Chapman & Hall/CRC, Boca Raton, FL,
second edition, 2003.

[GL96] Gene H. Golub and Charles F. Van Loan. Matrix Computations.

Johns Hopkins University Press, Baltimore, third edition, 1996.

[Gro93] Charles W. Groetsch. Inverse Problems in the Mathematical Sci-

ences. Vieweg, Braunschweig, 1993.

[Gro99] Charles W. Groetsch. Inverse Problems: Activities for Undergradu-

ates. Mathematical Association of America, Washington, 1999.

[GRS96] W.R. Gilks, S. Richardson, and D. J. Spiegelhalter. Markov Chain

Monte Carlo in Practice. Chapman & Hall, London, 1996.

[GS97] Wences P. Gouveia and John A. Scales. Resolution of seismic wave-

form inversion: Bayes versus Occam. Inverse Problems, 13(2):323–
349, 1997.

[Han92] Per Christian Hansen. Analysis of discrete ill-posed problems by

means of the L-curve. SIAM Review, 34(4):561–580, December 1992.

[Han94] Per Christian Hansen. Regularization tools: A MAT-

LAB package for analysis and solution of discrete ill–
posed problems. Numerical Algorithms, 6(I–II):1–35, 1994.

[Han98] Per Christian Hansen. Rank–Deficient and Discrete Ill-Posed Prob-

lems: numerical aspects of linear inversion. SIAM, Philadelphia,

[HBR+ 02] J.M.H. Hendrickx, B. Borchers, J.D. Rhoades, D.L. Corwin, S.M.
Lesch, A.C. Hilgendorf, and J. Schlue. Inversion of soil conductivity
profiles from electromagnetic induction measurements; theory and
experimental verification. Soil Science Society of America Journal,
66(3):673–685, 2002.

[Her80] Gabor T. Herman. Image Reconstruction from Projections. Aca-

demic Press, San Francisco, 1980.

[HH93] Martin Hanke and Per Christian Hansen. Regularization methods for
large–scale problems. Surveys on Mathematics for Industry, 3:253–
315, 1993.

[HM96] Per Christian Hansen and K. Mosegaard. Piecewise polynomial so-

lutions without a priori break points. Numerical Linear Algebra with
Applications, 3(6):513–524, 1996.

[Hög74] J. A. Högbom. Aperture synthesis with a non-regular distribution of

interferometer baselines. Astronomy and Astrophysics Supplement,
15:417–426, 1974.

[HR96] M. Hanke and T. Raus. A general heuristic for choosing the regular-
ization parameter in ill–posed problems. SIAM Journal on Scientific
Computing, 17(4):956–972, 1996.

[HSH+ 90] J. A. Hildebrand, J. M. Stevenson, P. T. C. Hammer, M. A. Zum-

berge, R. L. Parker, C. J. Fox, and P. J. Meis. A sea–floor and
sea–surface gravity survey of axial volcano. Journal of Geophysical
Research– Solid Earth and Planets, 95(B8):12751–12763, 1990.

[Hub96] Peter J. Huber. Robust Statistical Procedures. SIAM, Philadelphia,

second edition, 1996.

[HV91] Sabine Van Huffel and Joos Vandewalle. The Total Least Squares
Problem: computational aspects and analysis. SIAM, Philadelphia,

[JFM87] W. H. Jeffreys, M. J. Fitzpatrick, and B. E. McArthur. Gauss-

fit - a system for least squares and robust estimation. Celes-
tial Mechanics, 41(1–4):39–49, 1987. The software is available at

[JG80] William J. Kennedy Jr. and James E. Gentle. Statistical Computing.

Marcel Dekker, New York, 1980.

[Kir96] Andreas Kirsch. An Introduction to the Mathematical Theory of

Inverse Problems. Springer Verlag, New York, 1996.

[KK92] J. N. Kapur and H. K. Kesavan. Entropy Optimization Principles

with Applications. Academic Press, Boston, 1992.

[KPR+ 91] F. J. Klopping, G. Peter, D. S. Robertson, K. A. Berstis, R. E.

Moose, and W. E. Carter. Improvements in absolute gravity obser-
vations. Journal of Geophysical Research– Solid Earth and Planets,
96(B5):8295–8303, 1991.

[KS01] Avinash C. Kak and Malcolm Slaney. Principles of Computerized

Tomographic Imaging. SIAM, Philadelphia, 2001.

[Lay03] David C. Lay. Linear Algebra and its Applications. Addison–Wesley,

Boston, third edition, 2003.

[LH95] Charles L. Lawson and Richard J. Hanson. Solving Least Squares

Problems. SIAM, Philadelphia, 1995.

[Lin88] Laurence R. Lines, editor. Inversion of Geophysical Data. Society of

Exploration Geophysicists, Tulsa, OK, 1988.

[LL00] Zhi-Pei Liang and Paul C. Lauterbur. Principles of Magnetic Reso-

nance Imaging: A Signal Processing Perspective. IEEE Press, New
York, 2000.
[Men89] William Menke. Geophysical Data Analysis: discrete inverse theory,
volume 45 of International Geophysics Series. Academic Press, San
Diego, revised edition, 1989.
[Mey00] Carl D. Meyer. Matrix Analysis and Applied Linear Algebra. SIAM,
Philadelphia, 2000.
[MGH80] J. J. More, B. S. Garbow, and K. E. Hillstrom. User guide for
MINPACK-1. Technical Report ANL–80–74, Argonne National Lab-
oratory, 1980.
[Moo20] E. H. Moore. On the reciprocal of the general algebraic matrix.
Bulletin of the American Mathematical Society, pages 394–395, 1920.
[Mor84] V. A. Morozov. Methods for Solving Incorrectly Posed Problems.
Springer-Verlag, New York, 1984.
[Mye90] Raymond H. Myers. Classical and Modern Regression with Applica-
tions. PWS Kent, Boston, second edition, 1990.
[Nat01] Frank Natterer. The Mathematics of Computerized Tomography.
SIAM, Philadelphia, 2001.
[NBW00] Roseanna Neupauer, Brian Borchers, and John L. Wilson. Com-
parison of inverse methods for reconstructing the release history of
a groundwater contamination source. Water Resources Research,
36(9):2469–2475, 2000.
[Neu98] Arnold Neumaier. Solving ill-conditioned and singular linear sys-
tems: A tutorial on regularization. SIAM Review, 40(3):636–666,
[NN86] Ramesh Narayan and Rajaram Nityananda. Maximum entropy im-
age restoration in astronomy. In Annual Review of Astronomy and
Astrophysics, volume 24, pages 127–170. Annual Reviews Inc., Palo
Alto, CA, 1986.
[Nol87] Guust Nolet, editor. Seismic Tomography With Applications in
Global Seismology and Exploration Geophysics. D. Reidel, Boston,
[NS96] Stephen G. Nash and Ariela Sofer. Linear and Nonlienar Program-
ming. McGraw–Hill, New York, 1996.
[Par91] R. L. Parker. A theory of ideal bodies for seamount magnetism.
Journal of Geophysical Research– Solid Earth, 96(B10):16101–16112,

[Par94] Robert L. Parker. Geophysical Inverse Theory. Princeton University

Press, Princeton, NJ, 1994.

[Pen55] R. Penrose. A generalized invese for matrices. Proc. Cambridge.

Philos. Soc., pages 406–413, 1955.

[PM80] R. L. Parker and M. K. McNutt. Statistics for the one-norm misfit

measure. Journal of Geophysical Research, pages 4429–4430, 1980.

[Pri83] Maurice B. Priestley. Spectral Analysis and Time Series. Academic

Press, London, 1983.

[PS82a] C. C. Paige and M. A. Saunder. Algorithm 583 LSQR: Sparse lin-

ear equations and least squares problems. ACM Transactions on
Mathematical Software, 8(2):195–209, 1982.

[PS82b] C. C. Paige and M. A. Saunder. LSQR: An algorithm for sparse

linear equations and sparse least squares. ACM Transactions on
Mathematical Software, 8(1):43–71, 1982.

[PZ89] R. L. Parker and M. A. Zumberge. An analysis of geophysical ex-

periments to test newton’s law of gravity. Nature, 342:29–32, 1989.

[RTK+ 99] W. Rison, R. J. Thomas, P. R. Krehbiel, T. Hamlin, and J. Harlin.

A GPS–based three-dimensional lightning mapping system: Initial
observations in central new mexico. Geophysical Research Letters,
26(23):3573–3576, 1999.

[Rud87] Walter Rudin. Real and Complex Analysis. McGraw–Hill, New York,
third edition, 1987.

[Sau90] R. J. Sault. A modification of the Cornwell and Evans maximum

entropy algorithm. Astrophysical Journal, 354(2):L61–L63, 1990.

[SB84] J. Skilling and R. K. Bryan. Maximum entropy image reconstruc-

tion: General algorithm. Monthly Notices of the Royal Astronomical
Society, 211:111–124, 1984.

[Sea82] Shayle R. Searle. Matrix Algebra Useful for Statistics. Wiley, New
York, 1982.

[SGT88] John A. Scales, Adam Gersztenkorn, and Sven Treitel. Fast lp so-
lution of large, sparse, linear systems: Application to seismic travel
time tomography. Journal of Computational Physics, 75(2):314–333,

[Sha72] C. B. Shaw, Jr. Improvement of the resolution of an instrument by

numerical solution of an integral equation. Journal of Mathematical
Analysis and Applications, 37:83–112, 1972.

[She94] Jonathan Richard Shewchuk. An introduction to the conjugate gra-

dient method without the agonizing pain, edition 1–1/4. Technical
report, School of Computer Science, Carnegie Mellon University, Au-
gust 1994.

[Siv96] D. S. Sivia. Data Analysis, A Bayesian Tutorial. Oxford University

Press, Oxford, 1996.

[SK94] T. H. Skaggs and Z. J. Kabala. Recovering the release history of a

groundwater contaminant. Water Resources Research, 30(1):71–79,

[SK95] T. H. Skaggs and Z. J. Kabala. Recovering the history of a ground-

water contaminant plume: Method of quasi-reversibility. Water Re-
sources Research, 31:2669–2673, 1995.

[SP87] P. B. Stark and R. L. Parker. Velocity bounds from statistical es-

timates of gamma(p) and x(p). Journal of Geophysical Research–
Solid Earth and Planets, 92(B3):2713–2719, 1987.

[SP88] P. B. Stark and R. L. Parker. Correction to ’velocity bounds from

statistical estimates of gamma(p) and x(p)’. Journal of Geophysical
Research, 93:13821–13822, 1988.

[SP95] P. B. Stark and R. L. Parker. Bounded–variable least–squares: An

algorithm and applications. Computational Statistics, 10(2):129–141,

[SS97] John Scales and Martin Smith. DRAFT: Geophysical inverse theory.
http://landau.Mines.EDU/ samizdat/inverse theory/, 1997.

[Str88] Gilbert Strang. Linear Algebra and its Applications. Harcourt Brace
Jovanovich Inc., San Diego, third edition, 1988.

[TA77] A. N. Tikhonov and V. Y. Arsenin. Solutions of Ill-Posed Problems.

Halsted Press, New York, 1977.

[Tar87] Albert Tarantola. Inverse Problem Theory: Methods for Data Fitting
and Model Parameter Estimation. Elsevier, New York, 1987.

[TB97] Loyd N. Trefethen and David Bau. Numerical Linear Algebra. SIAM,
Philadelphia, 1997.

[TG65] A.N. Tikhonov and V.B. Glasko. Use of the regularization method in
non–linear problems. USSR Computational Mathematics and Math-
ematical Physics, 5(3):93–107, 1965.

[TG87] A. N. Tikhonov and A. V. Goncharsky, editors. Ill–Posed Problems

in the Natural Sciences. MIR Publishers, Moscow, 1987.

[Thi88] Ronald A. Thisted. Elements of Statistical Computing. Chapman

and Hall, New York, 1988.
[Thu92] C. Thurber. Hypocenter velocity structure coupling in local earth-
quake tomography. Physics of the Earth and Planetary Interiors,
75(1–3):55–62, 1992.

[Two96] S. Twomey. Introduction to the Mathematics of Inversion in Remote

Sensing and Indirect Measurements. Dover, Mineola, New York,
[UBL90] T. Ulrych, A. Bassrei, and M. Lane. Minimum relative entropy
inversion of 1d data with applications. Geophysical Prospecting,
38(5):465–487, 1990.
[Vog02] Curtis R. Vogel. Computational Methods for Inverse Problems.
SIAM, Philadelphia, 2002.
[Wah90] Grace Wahba. Spline Models for Observational Data. SIAM,
Philadelphia, 1990.
[Wat00] G. A. Watson. Approximation in normed linear spaces. Journal of
Computational and Applied Mathematics, 121(1–2):1–36, 2000.
[Win91] G. Milton Wing. A Primer on Integeral Equations of the First Kind:
the problem of deconvolution and unfolding. SIAM, Philadelphia,
[WU96] A. D. Woodbury and T. J. Ulrych. Minimum relative entropy inver-
sion: Theory and application to recovering the release history of a
ground water contaminant. Water Resources Research, 32(9):2671–
2681, 1996.

χ2 test, 2-5 cond MATLAB command, 4-11

p–value, 2-5 condition number, A-32, A-33
p–value, t–test, B-24 condition number, of svd, 4-10
t–test, B-23 conditional distribution, B-16
1–norm solution, 2-17 conditional expected value, B-16
1-norm, 1-5 conditional probability, B-15
2–norm, 1-5, A-19 confidence interval, B-22
2–norm solution, 2-1 confidence regions, 2-10
conjugacy, 11-6
ART algorithm, 6-4 conjugate gradient least squares method,
automatic differentiation, 9-8 6-12
conjugate gradient method, 6-8
basis, A-16 continuous inverse problem, 1-2
basis functions, Fourier, 8-3 convex function, A-29
basis, standard, A-16 convolution, 1-3, 8-1
Bayes Theorem, 11-3 convolution equation, 1-3
Bayes’ theorem, B-15 convolution theorem, 8-4
black box function, 9-8 convolution, circular, 8-10
bound variables least squares, 7-2 convolution, serial, 8-10
BVLS, 7-2 covariance, B-11
covariance matrix, B-17
calculus of variations, 11-13 cross entropy, 11-13
CDF, B-8 cumulative distribution function, B-
CG method, 6-8 8
CGLS, 6-12
characteristic equation, A-24 damped Newton method, 9-3
Chebychev’s inequality, B-25 damped SVD method, 5-5
checkerboard resolution test, 4-17 data kernels, 3-2
chi2pdf, 2-23 data null space, 4-7
chol, A-27 data space, 4-1
CLEAN, 7-17 data space resolution matrix, 4-9
collocation, simple, 3-3 deconvolution, 1-3, 3-2, 4-19, 8-1, 8-
column space, A-17 8
compatible matrices, A-7 delta function, 8-2
compatible matrix and vector norms, denoising, 7-11
A-31 DFT, 8-8
cond, A-33 diagonal matrix, A-10


dimension, A-16 Gauss–Newton method, 9-5

direct problem, 1-1 Gaussian elimination, A-1
discrepancy principle, 4-12, 5-1 Gaussian probability density, B-3
discrete Fourier Transform, 8-8 GCV, 5-25
discrete ill–posed problem, 4-18 generalized cross validation, 5-25
discrete inverse problem, 1-2 generalized inverse, 4-3
discrete Picard condition, 4-12 generalized singular value decompo-
discrete-ill-posed problem, 1-12 sition, 5-14
double-sided exponential distribution, generalized singular values, 5-14
B-5 global optimization, 9-11
downward continuation, 8-6, 8-7 Gram matrix technique, 3-7
Gram–Schmidt orthogonalization pro-
earthquake location problem, 1-5 cess, A-22
eig, A-24 Green’s function, 8-2
eigenvalue, A-24 GSVD, 5-14
eigenvector, A-24
elementary row operations, A-2 Hermitian symmetry, spectral, 8-10
entropy, 11-12 high–pass filter, 8-16
Euclidean length, A-19 homogeneous, A-4
events, B-1
expected value, B-9 identity matrix, A-8
experiment, B-1 IFK, 1-3, 3-1
exponential random variable, B-3 ill-conditioned
linear system of equations, A-
F distribution, B-7 33
FFT, 8-8 ill-conditioned system, 1-12
filter factors, 5-5 ill-posed problem, 1-12
filter, high–pass, 8-16 impulse function, 8-2
filter, low–pass, 8-7 impulse resolution test, 4-16
finite differences, 9-8 impulse response, 8-2
first order Tikhonov regularization, inconsistent, 1-5
5-13 indefinite matrix, A-26
forward operator, 1-1 independent random variables, B-10
forward problem, 1-1 inner product, 3-8, A-35
Fourier basis functions, 8-3 instrument response, 4-18
Fourier Transform, 1-3 inverse discrete Fourier transform,
Fourier transform, 8-3 8-8
Fourier Transform, discrete, 8-8 inverse Fourier transform, 8-3
Fredholm Integral Equation, 3-1 inverse Fourier transform, discrete,
Fredholm integral of the first kind, 8-8
1-3 inverse problem, 1-1
frequency domain response, 8-3 IRLS, 2-19
Frobenius norm, A-31 iteratively reweighted least squares,
full rank, 2-1 2-19

gamma function, B-5 joint probability density, B-10


Kaczmarz’s algorithm, 6-1 model resolution, 1-12

kernel, 1-2 model space, 4-1
Kullback, 11-13 model space resolution matrix, 4-8
moderately ill–posed, 4-18
L–curve, 5-3 modified truncated SVD, 7-12
Lagrange multipliers, A-27 Monte Carlo error propagation, 2-
Laguerre polynomial basis, 3-9 21
law of cosines, A-19 Moore–Penrose pseudoinverse, 4-3
law of total probability, B-15 MTSVD, 7-12
least squares, 2-1 multistart, 9-11
least squares solution, 2-1, A-23 multivariate normal distribution, 11-
Leibler, 11-13 7, B-17
Levenberg–Marquardt method, 9-6 mutually conjugate vectors, 6-8
likelihood function, 2-3 MVN, B-17
linear combination, A-5
linear equations, A-1 negative definite matrix, A-26
linear independence, A-12 negative semidefinite matrix, A-26
linear regression, 1-3 Newton’s method, 9-2
linear systems, 1-2 Newton’s method for minimizing f(x),
local minimum points, 9-9 9-3
low–pass filter, 8-7 NNLS, 7-1
lower triangular matrix, A-11 non homogeneous, A-4
non–negative least squares, 7-1
MAP model, 11-4 nonuniqueness, 4-7
Markov Chain Monte Carlo Method, norm, A-30
11-4 compatible matrix and vector,
Markov’s inequality, B-25 A-31
mathematical model, 1-1 definition, A-29
matrix norm, A-31 Frobenius, A-31
matrix times matrix, A-7 infinity, A-30
matrix times vector, A-6 of a matrix, A-31
maximum entropy methods, 11-12 normal equations, A-23
maximum entropy principle, 11-13 normal probability density, B-3
maximum entropy regularization, 7- null space, 1-12, A-14
7 null space model, 1-12
maximum likelihood, 11-4 null space models, 1-12
maximum likelihood estimation, 2-2 Nyquist frequency, 8-10
maximum likelihood principle, 2-3
median of a PDF, B-9 ODE, 1-1
midpoint rule, 3-3 ordinary differential equation, 1-1
minimum cross entropy principle, 11- orth, A-22
14 orthogonal basis, A-20
minimum length solution, 4-4 orthogonal diagonalization, A-25
mode of a PDF, B-9 orthogonal matrix, A-20
model identification problem, 1-1 orthogonal projection, A-21
model null space, 4-7 orthogonal subspaces, A-20

orthogonal vectors, A-19 residual, 1-5, 2-1

outliers, 2-17 resolution test, 4-16
over fitting of data, 4-12 Riemann–Lebesgue lemma, 1-14
robust least squares, 2-25
parallelogram law, A-38 robust solution, 2-17
parameter correlation, 2-11 row–column expansion method, A-7
parameter estimation problem, 1-2 RREF, A-3
partial differential equation, 1-1
PD matrix, A-26 sample mean, B-22
PDE, 1-1 second order Tikhonov regulariza-
PDF, B-3 tion, 5-14
perpendicular vectors, A-19 severely ill–posed, 4-18
pinv MATLAB command, 4-7 sifting property, delta function, 8-2
poles, 8-5 sifting property, of delta function,
positive definite matrix, A-26 8-2
positive semidefinite matrix, A-26 simultaneous interactive reconstruc-
posterior distribution, 11-3 tion technique, 6-6
power of a matrix, A-10 singular value decomposition, 4-1
power, of a hypothesis test, B-24 singular value spectrum, 4-9
PP-TSVD, 7-12 SIRT technique, 6-6
principal axes, error ellipsoid, 2-10 slowness, 1-10
principle of indifference, 11-3 solution existence, 1-12
prior distribution, 11-3 solution instability, 1-12
probability, B-1 solution nonuniqueness, 1-12
probability density function, B-3 solution uniqueness, 1-12
probability function, B-1 spectral amplitude, 8-3
proportional effect, 2-14 spectral division, 8-11
PSD matrix, A-26 spectral phase, 8-3
pseudoinverse, 4-3 spectrum, 8-3
spike resolution test, 4-16
Q–Q plot, B-19 sqrtm, 11-8
qr, A-33 square-integrable function, 1-14
quadratic form, A-25 standard basis, A-16
quadrature rule, 3-2 standard deviation, B-10
standard normal random variable,
randn, B-12 B-3
random variable, B-2 stationary point, A-27
range, A-17 Student’s t distribution, B-7, B-22
rank, A-18 subjective prior, 11-3
rank deficient, 1-11, 4-12 subspace, A-13
rank deficient system, 4-4 SVD, 4-1
realizations, B-2 svd MATLAB command, 4-1
reduced row echelon form, A-3 svd, compact form, 4-2
regularization, 1-12, 4-12 symmetric matrix, A-10
representers, 3-2
reshape, 4-14 TGSVD, 5-24

tikhcstr, 7-2
time domain response, 8-3
time series, 8-8
tomography, 1-10
total least squares, 2-25
total variation regularization, 7-11
transfer function, 8-3
transpose, A-10
trivial solution, A-4
truncated generalized singular value
decomposition, 5-24
truncated SVD, 4-12
TSVD, 4-12
type I error, B-24
type II error, B-24

unbiased, 2-7
uncorrelated random variables, B-
uniform, B-3
uninformative prior distribution, 11-
upper triangular matrix, A-11
upward continuation filter, 8-7

variance, B-10
vector space, A-35
vertical seismic profiling, 1-7

water level regularization, 8-12

weakly ill–posed, 4-18

zero matrix, A-5

zeros, 8-5
zeroth order Tikhonov regularization,