Parameter Estimation and Inverse Problems

© All Rights Reserved

5 views

Parameter Estimation and Inverse Problems

© All Rights Reserved

- Math8430 Lectures Fisica
- Engineering Calculations Using Microsoft Excel 2014
- M.sc. Mathemetics
- 1-s2.0-S0920410512001143-main
- Sat Study Guide
- Ben-Dor 2000- Analytical Solution for Penetration by Rigid Conical Impactors Using Cavi
- Pig
- 3 and 4 Semester CSE Syllabus
- Effective Hamiltonians and Averaging for Hamiltonian Dynamics II
- Maths Haryana Syllabus
- 11.[104-111]Analytical Solution for Telegraph Equation by Modified of Sumudu Transform Elzaki Transform
- Modeling Piezoelectric Actuators
- Pre Parcial Final
- Jiajia Sun_2011_SBGf_Geophysical Inversion Constrained by Petrophysics
- 2016 17MAMSCAdmission BI
- Well data
- upwindmolml
- SUPER RESOLUTION
- ch3b_slip_line_f.pdf
- MATH4052_T07

You are on page 1of 313

1

2002-2003,

c Aster, Borchers, and Thurber

ii

Preface for Draft Versions

This textbook grew out of a course in geophysical inverse methods that has

been taught for over 10 years at New Mexico Tech, first by Rick Aster and for

the last four years by Rick Aster and Brian Borchers. The lecture notes were

first assembled into book format and used as the official text for the course in

the fall of 2001. The draft of the textbook was then used in Cliff Thurber’s

course at the University of Wisconsin during the spring semester of 2002. In

Fall, 2002, Aster, Borchers, and Thurber signed a contract with Academic Press

to produce a text by the end of 2003 for publication in 2004.

We expect that readers of this book will have some familiarity with calculus,

differential equations, linear algebra, probability and statistics at the undergrad-

uate level. Appendices A and B review the required linear algebra, probability,

and statistics. In our experience teaching this course, many students have had

weaknesses in their preparation in linear algebra or in probability and statistics.

For that reason, we typically spend the first two to three weeks of the course

reviewing this material.

Chapters one through five form the heart of the book and should be read

in sequence. Chapters six, seven, and eight are independent of each other, but

depend strongly on the material in Chapters one through five. They can be

covered in any order. Chapters nine and ten are independent of Chapters six,

seven, and eight, but should be covered in sequence. Chapter 11 is independent

of Chapters six through ten.

If significant time for review of linear algebra, probability, and statistics is

taken, then there will not be time to cover the entire book in one semester.

However, it should be possible cover the majority of the material by skipping

one or two of the chapters after Chapter five.

The text is a work in progress, but already contains much usable material

that should be of interest to a large audience. We and Academic Press feel that

the final content, presentation, examples, index terms, references, and other

features will be substantially benefited by having a large number of instructors,

students, and researchers use, and provide feedback on, draft versions. Such

comments and corrections are greatly appreciated, and significant contributions

will be acknowledged in the book.

iii

iv PREFACE FOR DRAFT VERSIONS

Contents

1 Introduction 1-1

1.1 Classification of Inverse Problems . . . . . . . . . . . . . . . . . . 1-1

1.2 Examples of Parameter Estimation Problems . . . . . . . . . . . 1-3

1.3 Examples of Inverse Problems . . . . . . . . . . . . . . . . . . . . 1-7

1.4 Why Inverse Problems Are Hard . . . . . . . . . . . . . . . . . . 1-11

1.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-14

1.6 Notes and Further Reading . . . . . . . . . . . . . . . . . . . . . 1-15

2.1 Introduction to Linear Regression . . . . . . . . . . . . . . . . . . 2-1

2.2 Statistical Aspects of Least Squares . . . . . . . . . . . . . . . . 2-2

2.3 L1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-17

2.4 Monte Carlo Error Propagation . . . . . . . . . . . . . . . . . . . 2-21

2.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-22

2.6 Notes and Further Reading . . . . . . . . . . . . . . . . . . . . . 2-25

3.1 Integral Equations . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1

3.2 Quadrature Methods . . . . . . . . . . . . . . . . . . . . . . . . . 3-2

3.3 The Gram Matrix Technique . . . . . . . . . . . . . . . . . . . . 3-7

3.4 Expansion in Terms of Basis Functions . . . . . . . . . . . . . . . 3-8

3.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-12

3.6 Notes and Further Reading . . . . . . . . . . . . . . . . . . . . . 3-13

4.1 The SVD and the Generalized Inverse . . . . . . . . . . . . . . . 4-1

4.2 Covariance and Resolution of the Generalized Inverse Solution . 4-7

4.3 Instability of the Generalized Inverse Solution . . . . . . . . . . . 4-9

4.4 Rank Deficient Problems . . . . . . . . . . . . . . . . . . . . . . . 4-12

4.5 Discrete Ill-Posed Problems . . . . . . . . . . . . . . . . . . . . . 4-17

4.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-34

v

vi CONTENTS

5.1 Selecting a Good Solution . . . . . . . . . . . . . . . . . . . . . . 5-1

5.2 SVD Implementation of Tikhonov Regularization . . . . . . . . . 5-3

5.3 Resolution, Bias, and Uncertainty in the Tikhonov Solution . . . 5-9

5.4 Higher Order Tikhonov Regularization . . . . . . . . . . . . . . . 5-12

5.5 Resolution in Higher Order Regularization . . . . . . . . . . . . . 5-19

5.6 The TGSVD Method . . . . . . . . . . . . . . . . . . . . . . . . . 5-24

5.7 Generalized Cross Validation . . . . . . . . . . . . . . . . . . . . 5-25

5.8 Error Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-28

5.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-32

5.10 Notes and Further Reading . . . . . . . . . . . . . . . . . . . . . 5-34

6.1 Iterative Methods for Tomography Problems . . . . . . . . . . . 6-1

6.2 The Conjugate Gradient Method . . . . . . . . . . . . . . . . . . 6-8

6.3 The CGLS Method . . . . . . . . . . . . . . . . . . . . . . . . . . 6-12

6.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-16

6.5 Notes and Further Reading . . . . . . . . . . . . . . . . . . . . . 6-16

7.1 Using bounds constraints . . . . . . . . . . . . . . . . . . . . . . 7-1

7.2 Maximum Entropy Regularization . . . . . . . . . . . . . . . . . 7-7

7.3 Total Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-11

7.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-17

7.5 Notes and Further Reading . . . . . . . . . . . . . . . . . . . . . 7-17

8.1 Linear Systems in the Time and Frequency Domains . . . . . . . 8-1

8.2 Deconvolution from a Fourier Perspective . . . . . . . . . . . . . 8-5

8.3 Linear Systems in Discrete Time . . . . . . . . . . . . . . . . . . 8-8

8.4 Water Level Regularization . . . . . . . . . . . . . . . . . . . . . 8-12

8.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-17

8.6 Notes and Further Reading . . . . . . . . . . . . . . . . . . . . . 8-18

9.1 Newton’s Method . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-1

9.2 The Gauss–Newton and Levenberg–Marquardt Methods . . . . . 9-4

9.3 Statistical Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . 9-7

9.4 Implementation Issues . . . . . . . . . . . . . . . . . . . . . . . . 9-8

9.5 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-11

9.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-15

9.7 Notes and Further Reading . . . . . . . . . . . . . . . . . . . . . 9-17

CONTENTS vii

10.1 Regularizing Nonlinear Least Squares Problems . . . . . . . . . . 10-1

10.2 Occam’s Inversion . . . . . . . . . . . . . . . . . . . . . . . . . . 10-2

10.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-7

11.1 Review of the Classical Approach . . . . . . . . . . . . . . . . . . 11-1

11.2 The Bayesian Approach . . . . . . . . . . . . . . . . . . . . . . . 11-2

11.3 The Multivariate Normal Case . . . . . . . . . . . . . . . . . . . 11-7

11.4 Maximum Entropy Methods . . . . . . . . . . . . . . . . . . . . . 11-12

11.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-14

11.6 Notes and Further Reading . . . . . . . . . . . . . . . . . . . . . 11-15

A.1 Systems of Linear Equations . . . . . . . . . . . . . . . . . . . . . A-1

A.2 Matrix and Vector Algebra . . . . . . . . . . . . . . . . . . . . . A-5

A.3 Linear Independence . . . . . . . . . . . . . . . . . . . . . . . . . A-12

A.4 Subspaces of Rn . . . . . . . . . . . . . . . . . . . . . . . . . . . A-13

A.5 Orthogonality and the Dot Product . . . . . . . . . . . . . . . . A-18

A.6 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . A-24

A.7 Lagrange Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . A-27

A.8 Taylor’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . A-28

A.9 Vector and Matrix Norms . . . . . . . . . . . . . . . . . . . . . . A-29

A.10 The Condition Number of a Linear System . . . . . . . . . . . . A-32

A.11 The QR Factorization . . . . . . . . . . . . . . . . . . . . . . . . A-33

A.12 Linear Algebra in More General Spaces . . . . . . . . . . . . . . A-35

A.13 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-36

A.14 Notes and Further Reading . . . . . . . . . . . . . . . . . . . . . A-41

B.1 Probability and Random Variables . . . . . . . . . . . . . . . . . B-1

B.2 Expected Value and Variance . . . . . . . . . . . . . . . . . . . . B-9

B.3 Joint Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . B-10

B.4 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . B-15

B.5 The Multivariate Normal Distribution . . . . . . . . . . . . . . . B-17

B.6 The Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . B-18

B.7 Testing for Normality . . . . . . . . . . . . . . . . . . . . . . . . B-19

B.8 Estimating Means and Confidence Intervals . . . . . . . . . . . . B-21

B.9 Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . B-23

B.10 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-25

B.11 Notes and Further Reading . . . . . . . . . . . . . . . . . . . . . B-27

viii CONTENTS

ordinary differential equation

ODE

partial differential equation

PDE

mathematical model

forward operator

forward problem

Chapter 1 direct problem

inverse problem

model identification problem

Introduction

Scientists and engineers frequently need to relate the physical parameters of a

system to observations of the system. The observations or data will be denoted

by d, while the physical parameters or model will be denoted by m. We assume

that the physics of the system are understood, so that we can specify a function

G that relates the model m to the data d.

G(m) = d . (1.1)

be a collection of discrete observations. Similarly, the model m might be a

continuous function of time and/or space, or it might simply be a collection

of discrete parameters. When m and d are themselves functions, we typically

refer to G as an operator, while G may be called a function when m and d

are vectors. The operator G can take on many forms. In some cases G is an

ordinary differential equation (ODE) or partial differential equation (PDE). In

other cases, G might be a linear or nonlinear system of algebraic equations.

Note that there is some confusion between mathematicians and other sci-

entists in the terminology that we have used. Applied mathematicians usually

refer to G(m) = d as the “mathematical model”, and m as the “parameters”,

while scientists often refer to m as the “model” and G as the forward opera-

tor. We will adopt the scientist’s convention of calling m the model and G the

forward operator.

The forward problem or direct problem is to find d given m. Computing

G(m) might involve solving an ODE, solving a PDE, or evaluating an integral.

Our focus will be on the inverse problem of obtaining an estimate of m

from d. A third problem that we will not consider in this book is the model

identification problem of determining G from several pairs of m and d.

In many cases, we will want to determine a finite number, n, of parameters

that define a model. The parameters may define a physical entity directly, or

1-1

1-2 CHAPTER 1. INTRODUCTION

discrete inverse problem may be coefficients or other constants in a functional relationship that describes

parameter estimation problem a physical process. We can express the parameters as a vector m. Similarly,

continuous inverse problem if there are only a finite number of data collected, then we can express the

linear systems

data as a vector d. Such problems are called discrete inverse problems or

kernel

parameter estimation problems. A parameter estimation problem can be

written as a nonlinear system of equations

G(m) = d . (1.2)

In other cases, where m and d are functions of time and space, we cannot

simply express them as vectors. Such problems are called continuous in-

verse problems. For convenience, we will refer to discrete inverse problems as

parameter estimation problems, and reserve the phrase “inverse problem” for

continuous inverse problems.

A central theme of this book is that continuous inverse problems can of-

ten be approximated by parameter estimation problems with a large number of

parameters. This process of discretizing a continuous inverse problem is dis-

cussed in Chapter 3. By solving the discretized parameter estimation problem,

we can avoid some of the mathematical difficulties associated with continuous

inverse problems. However, it happens that the parameter estimation prob-

lems that arise from discretization of continuous inverse problems still require

sophisticated solution techniques.

A class of mathematical models for which many useful results exist are linear

systems, which obey superposition

and scaling

G(αm) = αG(m) . (1.4)

In the case of a discrete linear inverse problem, (1.2) can always be written

in the form of a linear system of algebraic equations.

G(m) = Gm = d . (1.5)

linear integral operator. Equation (1.1) becomes

Z b

d(s) = g(s, x) m(x) dx . (1.6)

a

The linearity [(1.3), (1.4)] of (1.6) is easily seen because

Z b Z b Z b

g(s, x)(m1 (x) + m2 (x)) dx = g(s, x) m1 (x) dx + g(s, x)m2 (x) dx

a a a

and Z b Z b

g(s, x) αm(x) dx = α g(s, x) m(x) dx .

a a

1.2. EXAMPLES OF PARAMETER ESTIMATION PROBLEMS 1-3

Equations of the form of (1.6) where m(x) is the unknown are called Fred- Fredholm integral of the first

kind

holm integral equations of the first kind (IFK). IFK’s arise in a sur-

IFK

prisingly large number of inverse problems. Unfortunately, they can also have convolution equation

mathematical properties that make it difficult to obtain useful solutions. convolution

In some cases, g(s, x) depends only on the difference s − x, and we have a deconvolution

convolution equation Fourier Transform

Z ∞ linear regression

d(s) = g(s − x) m(x) dx . (1.7)

−∞

The inverse problem of determining m(x) from d(s) is called deconvolution.

Methods for deconvolution are discussed in Chapter 8.

Another example of an IFK is the problem of inverting a Fourier transform

Z ∞

Φ(f ) = e−ı2πf x φ(x) dx

−∞

to obtain φ(x). Although there are many tables and analytic methods of ob-

taining Fourier transforms and their inverses, it is sometimes desirable to obtain

numerical estimates of φ(x), such as where there is no analytic inverse, or where

we wish to find φ(x) from spectral data at a discrete collection of frequencies.

It is an intriguing question why linearity appears in many interesting geo-

physical problems. Many physical systems encountered in practice are linear to

a very high degree because only small departures from equilibrium occur. An

important example from geophysics is that of seismic wave propagation, where

the stresses caused by the propagation of seismic disturbances are usually very

small relative to the magnitudes of the elastic moduli. This situation leads to

small strains and a very nearly linear stress-strain relationship. Because of this,

many seismic wave field problems obey superposition and scaling. Other physi-

cal systems such as gravity and magnetic fields at the strengths encountered in

geophysics have inherently linear physics.

Because many important inverse problems are linear, Chapters 2 through

8 of this book deal with methods for the solution of linear inverse problems.

We discuss methods for nonlinear parameter estimation and inverse problems

in Chapters 9 and 10.

Example 1.1

An archetypical example of a linear parameter estimation problem

is the fitting of a function to a data set via linear regression. In

an ancient example, the observed data, y, are altitude observations

of a ballistic body observed at a set of times t (Figure 1.1), and we

wish to solve for a model, m, giving the initial altitude (m1 ), initial

1-4 CHAPTER 1. INTRODUCTION

(m3 ) experienced by the body during its trajectory. This problem,

much studied over the centuries, is naturally of practical interest in

warfare, but is also of fundamental interest, for example, in abso-

lute gravity meters capable of estimating g from the acceleration of

a falling object in a vacuum to accuracies approaching 10−9 (e.g.,

[KPR+ 91]).

The forward model is a quadratic function in the t − y plane

body yi at time ti . Applying (1.8) to each of the observations, we

obtain a matrix equation which relates the observations, yi , to the

model parameters.

− 12 t21

1 t1 y1

1 t2 − 12 t22

y2

1 t3 − 12 t23 m1

y3

. . . m2 = . . (1.9)

. . . m3 .

. . . .

1 tm − 12 t2m ym .

Even though the functional relationship (1.8) is quadratic, the equa-

tions for the three parameters are linear, so this is a linear parameter

estimation problem.

Because there are more (known) data than (unknown) model parameters in

(1.9), and because our forward modeling of the physics may be approximate the

1.2. EXAMPLES OF PARAMETER ESTIMATION PROBLEMS 1-5

sible to find a model m that satisfies every observation in y. In introductory residual

linear algebra, where an exact solution is expected, we might simply state that 2–norm

1-norm

no solution exists. However, useful solutions to such systems may be found by

earthquake location problem

finding model parameters which best satisfy the data set in some approximate

sense.

A traditional strategy for finding approximate solutions to an inconsistent

system of linear equations is to find the model that produces the smallest squared

misfit, or residual, between the observed data and the predictions of the for-

ward model, i.e., find the m that minimizes the Euclidean length (or the 2–

norm) of the residual

v

um

uX

ky − Gmk2 = t (yi − (Gm)i )2 . (1.10)

i=1

However, (1.10) is not always the best misfit measure to use in solving para-

metric systems. Another misfit measure that we may wish to optimize instead,

and which is decidedly better in many situations, is the 1-norm

m

X

ky − Gm)k1 = |yi − (Gm)i | . (1.11)

i=1

Methods for linear parameter estimation using both of these misfit measures

are discussed in Chapter 2.

Example 1.2

problem is the determination of an earthquake hypocenter in space

and time (Figure 1.2), specified by the 4-vector

m = [x, τ ]T , (1.12)

of occurrence. The desired model is that which best fits a set of

observed seismic phase arrival times

d = [t1 , t2 , . . . , tm ]T (1.13)

The forward operator is a nonlinear function that maps a trial hypocen-

ter (x, τ )0 into a vector of predicted arrival times, taking into ac-

count the network geometry and the physics of seismic wave propa-

gation,

1-6 CHAPTER 1. INTRODUCTION

t1

t2

t3

. = G((x, τ )0 ; W) ,

.

.

tm

where W is a matrix where the ith row, Wi,· gives the spatial co-

ordinates of the ith station. G returns arrival–times for the network

of seismic stations for a given source location and time by forward

modeling arrival times through some seismic velocity structure, v(x).

This problem is nonlinear even if the seismic velocity is a constant, v.

In this case, the ray paths in Figure 1.2 are straight. Furthermore,

the phase arrival time at station i located at position Wi is

T

kWi,· − xk2

ti = +τ . (1.14)

v

Because the forward model (1.14) is nonlinear there is no general way

to write the relationship between the data, the hypocenter model m,

and G as a linear system of equations.

Nonlinear parameter estimation problems are frequently solved by

guessing a starting solution (which is hopefully close to the true

1.3. EXAMPLES OF INVERSE PROBLEMS 1-7

(hopefully) obtained. We will discuss methods for solving nonlinear

parameter estimation problems in Chapter 9.

Example 1.3

In vertical seismic profiling we wish to know the vertical seis-

mic velocity of the Earth surrounding a borehole. A downward-

propagating seismic disturbance is generated at the surface by a

source and seismic waves are sensed by a string of seismometers in

the borehole (Figure 1.3).

The arrival time of the wavefront at each instrument is measured

from the recorded seismograms. As in the tomography Example 1.3,

1-8 CHAPTER 1. INTRODUCTION

(reciprocal velocity) model. The observed travel time at depth y is

then the definite integral of the vertical slowness, s, from the surface

to y Z y Z ∞

t(y) = s(z) dz = s(z)H(y − z) dz (1.15)

0 0

its argument is nonnegative and zero when its argument is negative.

The form (y − z) of the kernel argument shows that (1.15) is a

convolution.

In theory, we can solve this problem quite easily. By the fundamental

theorem of calculus,

t0 (y) = s(y). (1.16)

However, in practice we only have noisy observations of t(y). Differ-

entiating these observations is an extremely unstable process.

Example 1.4

is to invert a vertical gravitational anomaly, d(x) observed at some

height, h, to find an unknown line mass density distribution, m(x) =

∆ρ(x) (Figure 1.4). This problem can be written as an IFK, as the

data are just a superposition of the vertical gravity contributions

from each differential element of the line mass

Z ∞

h

d(s) = Γ 3/2

m(x) dx ,

−∞ ((x − s) + h2 )

2

where Γ is the gravitational constant. Note that the kernel has the

form g(x−s). Because the kernel depends only on x−s, this forward

problem is also an example of convolution. Because g(x) is a smooth

function, d(s) will be a smoothed version of m(x), and inverse solu-

tions for m(x) will consequently be some roughened transformation

of d(x).

Example 1.5

above in the form of a model where the gravity anomaly is at-

tributable to a variation in depth, m(x) = h(x), of a fixed line

density perturbation, ∆ρ, rather than its density contrast (Figure

1.3. EXAMPLES OF INVERSE PROBLEMS 1-9

Figure 1.4: A simple linear inverse problem; find ∆ρ(x) from gravity anomaly

observations, d(x).

Figure 1.5: A simple nonlinear inverse problem; find h(x) from gravity anomaly

observations, d(x).

1.5). In this case, the data are still, of course, just given by the su-

perposition of the contributions to the gravitational anomaly field,

but the forward problem is now

Z ∞

m(x)

d(s) = Γ 3/2

∆ρ dx .

−∞ ((x − s) + m2 (x))

2

and data does not follow the superposition and scaling rules (1.3)

and (1.4).

Nonlinear inverse problems are generally considerably more diffi-

cult to solve than linear ones. In some cases, they may be solvable

by coordinate transformations which linearize the problem or other

clever special-case methods. In other cases the problem cannot be

linearized and computer intensive optimization techniques must be

used to solve the problem. The essential differences arise because all

linear problems can be generalized to be the “same” in some sense.

Specifically, the data can be expressed as an inner product between

the kernel function and the model [Par94], but each nonlinear prob-

lem is nonlinear in its own way.

Example 1.6

1-10 CHAPTER 1. INTRODUCTION

slowness the Greek root tomos, or cut. Tomography is the general technique

of determining a model from ray-path integrated properties such as

attenuation (X-ray, radar, seismic), travel time (seismic or acoustic),

and/or source intensity (positron emission). Tomography has many

applications in medicine, acoustics and earth science. An important

example in earth science is cross–well seismic or radar tomography,

where the source excitation is in one bore hole, and the signals are

received by sensors in another bore hole. Another important exam-

ple is joint earthquake/velocity inversion on scales ranging from a

few cubic kilometers to global, where the locations of earthquake

sources and the velocity parameters of the volume sampled by the

associated seismic ray paths are estimated [Thu92].

The physical model for tomography in its most basic form assumes

that geometric ray theory (essentially a high frequency limit to the

wave equation) is valid for the problem under consideration, so that

we can consider electromagnetic or elastic energy to be propagating

along ray paths. The density of ray path coverage in a tomographic

problem may vary significantly throughout the model and thus pro-

vide much better information on the physical properties of interest

in regions of denser ray coverage while providing little or no infor-

mation in other regions.

work with slowness, which is simply the reciprocal of velocity. If

the slowness at a point x is s(x), and we know a ray path l, then

the travel time along that ray path is given by a line integral

Z

t= s(x) dl .

`

In general, the ray paths l can bend due to refraction effects. How-

ever, if refraction effects are negligible, then it may be feasible to

approximate the ray paths with straight lines.

structure that we wish to image with tomography is as uniform

blocks. If we assume straight rays, then we obtain a linear model.

The elements of the matrix G are just the lengths of the intersections

of the ray paths with the individual blocks. Consider the simple ex-

ample shown in (Figure 1.6), where 9 homogeneous blocks with sides

of unit length and unknown slowness are constrained by 8 travel-time

1.4. WHY INVERSE PROBLEMS ARE HARD 1-11

rank deficient

s11

1 0 0 1 0 0 1 0 0 t1

s12

0 1 0 0 1 0 0 1 0 t2

s13

0 0 1 0 0 1 0 0 1 t3

s21

1 1 1 0 0 0 0 0 0 t4

Gm = s22 = .

0 0 0 1 1 1 0 0 0 t5

s23

√0 0 0 0 √0 0 1 1 √1 t6

s31

2 0 0 0 2 0 0 0 √2 t7

s32

0 0 0 0 0 0 0 0 2 t8

s33

Because there are 9 unknown parameters, sij , that we would ideally

like to estimate, but only 8 equations, this system is clearly rank

deficient. In fact, rank(G) is only 7. Also, we can easily see that

there is redundant information in this system; the slowness s33 is

completely determined by t8 , but is also measured by observations

t3 , t6 , and t7 . We shall see that, despite these complicating factors,

discretized versions of tomography and other inverse problems can

nonetheless be usefully solved.

Scientists and engineers must be interested in far more than simply finding

mathematically acceptable answers to inverse problems. There may be many

1-12 CHAPTER 1. INTRODUCTION

solution existence models that adequately fit the data. It is essential to characterize just what

null space solution has been obtained, how “good” it is in terms of physical plausibility and

null space model fit to the data, and perhaps how consistent it is with other constraints. Some

solution nonuniqueness

important issues that must be considered include solution existence, solution

null space models

solution uniqueness

uniqueness, and instability of the solution process.

model resolution

ill-posed problem 1. Existence. There may be no model that exactly fits a given data set if our

discrete-ill-posed problem mathematical model of the system’s physics is approximate and/or the

ill-conditioned system data contains noise.

solution instability

regularization 2. Uniqueness. If exact solutions do exist, they may not be unique, even for

an infinite number of exact data. This can easily occur in potential field

problems. A trivial example is the case of the external gravitational field

from a spherically-symmetric mass distribution of unknown density, where

we can only determine the total mass, not the dimensions and density of

the body. Non-uniqueness is also a characteristic of rank deficient linear

problems because the matrix G can have a nontrivial null space.

One manifestation of this is that a model may fit a finite number of data

points acceptably, but predict wild and physically unrealistic values where

we have no observations. In linear parameter estimation problems, models

that lie in the null space of G, m0 , are solutions to Gm0 = 0, and hence fit

d = 0 in the forward problem exactly. These null space models can be

added to any particular model that satisfies (1.5), and not change the fit

to the data. The result is an infinite number of mathematically acceptable

models.

In practical terms, suppose that there exists a nonzero model m which

results in an instrument reading of zero. One cannot discriminate this

situation from the situation where m is really zero.

An important and thorny issue with problems that have nonunique solu-

tions is that an estimated model may be significantly smoothed relative to

the true situation. Characterizing this smoothing is essential to interpret-

ing models in terms of their likely correspondence to reality. This analysis

falls under the general topic of model resolution analysis.

tremely unstable. That is, a small change in measurement can leverage

an enormous change in the implied models, Systems where small features

of the data can drive large changes in inferred models are commonly re-

ferred to as ill-posed (in the case of continuous systems), or discrete

ill-posed or ill-conditioned (in the case of discrete linear systems).

It is frequently possible to stabilize the inversion process by adding addi-

tional constraints. This process is generally referred to as regularization.

Regularization is frequently necessary to produce a useful solution to an

otherwise intractable ill-posed or ill-conditioned inverse problem at the

cost of limiting us to a smoothed model.

1.4. WHY INVERSE PROBLEMS ARE HARD 1-13

some simple examples where each of them arise in systems characterized by an

IFK Z 1

g(s, x) m(x) dx = y(s).

0

g(s, x) = 1

Z 1

m(x) dx = y(s) . (1.17)

0

Clearly, since the left hand side of (1.17) is independent of s, this system has

no solution unless y(s) is a constant. Thus, there are an infinite number of

mathematically conceivable data sets y(s), which are not constant and where

no exact solution exists. This is an existence issue; there are data sets for which

there are no exact solutions.

Conversely, where a solution to (1.17) does exist, the solution is nonunique,

since there are an infinite number of functions which integrate over the unit

interval to produce the same constant. This is a uniqueness issue; there are an

infinite number of models that satisfy the IFK exactly.

A more subtle example of nonuniqueness is shown by the equation

Z 1 Z 1

g(s, x) m(x) dx = s · sin(πx) m(x) dx = y(s) . (1.18)

0 0

Z 1 Z 1

1

sin(nπx) sin(mπx) dx = − cos(π(n+m)x) − cos(π(n−m)x) dx (1.19)

0 2 0

1 sin(π(n + m)) sin(π(n − m))

=− − = 0 (n 6= ±m; n, m 6= 0) .

2π n+m n−m

Thus, in (1.18) we have

Z 1

g(s, x) sin(nπx) dx = 0

0

Furthermore, because (1.18) is a linear system, we can add any function of

the form

X∞

m0 (x) = αn sin(nπx)

n=2

1-14 CHAPTER 1. INTRODUCTION

square-integrable function to any model solution, m(x), and obtain a new model that fits the data just as

Riemann–Lebesgue lemma well because

Z 1 Z 1 Z 1

s·sin(πx)(m(x)+m0 (x)) dx = s·sin(πx) m(x) dx+ s·sin(πx) m0 (x) dx

0 0 0

Z 1

= s · sin(πx) m(x) dx + 0 .

0

Equation (1.18) thus allows an infinite range of very different solutions which

fit the data equally well.

Even if we do not encounter existence or uniqueness issues, it can be shown

that instability is a fundamental feature of IFK’s, because

Z ∞

lim g(s, t) sin nπt dt = 0 (1.20)

n−>∞ −∞

Z ∞Z ∞

g(s, t)2 ds dt < ∞ .

−∞ −∞

Without proving (1.20) rigorously, we can understand why this damping

out of oscillatory functions occurs. Model “wiggliness” is smoothed by the

integration with the kernel, g(s, t). Thus a significant perturbation to the

model of the form sin(nπt) leads, for large n, to an insignificant change in y(s).

The inverse problem has the situation reversed; an inferred model can be very

sensitive to changes in the data, including random noise components that have

nothing to do with the physical system that we are trying to learn about.

This property of IFK’s is similar to the problem of solving linear systems of

equations where the condition number of the matrix is very large. That is, the

matrix is nearly singular. In both cases, the difficulty lies in the nature of the

problem, not in the algorithm used to solve the problem. Ill-posed behavior is a

fundamental feature of many inverse problems because of the general smoothing

that occurs in most forward problems (and the corresponding roughening that

must occur in the solution of the inverse problem); understanding and dealing

with it are fundamental issues in solving many problems in the physical sciences.

1.5 Exercises

1. Consider a mathematical model (1.1) of the form G(m) = d, where m is

a vector of size n, and d is a vector of size m. Suppose that the model

obeys the superposition and scaling laws and is thus linear. Show that

G(m) can be written in the form

G(m) = Γm

1.6. NOTES AND FURTHER READING 1-15

the standard basis, and write m as a linear combination of the vectors in

the standard basis. Apply the superposition and scaling laws. Finally,

recall the definition of matrix–vector multiplication.

2. Can (1.9) be inconsistent, even with only m = 3 data points? How about

just m = 2 data points? What if the ti are distinct? If the system can be

inconsistent, give an example, if not, explain why not.

3. Find a journal article which discusses the solution of an inverse problem

in your discipline. What is the data? Are the data discrete or continuous?

Have the authors discussed possible sources of noise in the data? What

is the model? Is the model continuous or discrete? What physical laws

determine the forward operator G? Is G linear or nonlinear? Discuss any

issues associated with existence, uniqueness, or instability of solutions?

Some important references on inverse problems in geophysics and remote sens-

ing include [Car87, Lin88, Par94, Two96]. Many examples of ill–posed prob-

lems and their solution can be found in the book edited by Tikhonov and

Goncharsky [TG87]. More mathematically oriented references include [Bau87,

Gro93, Han98, Kir96, LH95, Mor84, Men89, TA77, Tar87].

Tomography, and in particular medical imaging applications of tomography,

is a very large field. Some references on tomography include [Her80, KS01,

LL00, Nat01, Nol87].

1-16 CHAPTER 1. INTRODUCTION

full rank

residual

least squares

least squares solution

2–norm solution

Chapter 2

Linear Regression

Consider a discrete linear inverse problem. We begin with a data vector, d,

of m observations, and a vector m of n model parameters that we wish to

determine. As we have seen, the inverse problem can be written as a linear

system of equations

Gm = d . (2.1)

will assume that there are more observations than model parameters, and thus

more linear equations than unknowns. We will also assume that the system of

equations is of full rank. In Chapter 4 we will consider rank deficient problems.

For a full rank system with more equations than unknowns, it is frequently

the case that no solution m satisfies all of the equations in (2.1) exactly. This

happens because the dimension of the range of G is smaller than m and a noisy

data vector can easily lie outside of the range of G.

However, a useful approximate solution may still be found by finding a par-

ticular model m that minimizes some measure of the the misfit between the

actual data and the forward model predictions using m. The residual is given

by

r = d − Gm. (2.2)

One commonly used measure of the misfit is the 2–norm of the residual. This

model that minimizes the 2–norm of the residual is called the least squares

solution. The least squares or 2–norm solution is of special interest both

because it is very amenable to analysis and geometric intuition, and because it

turns out to be statistically the most likely solution if data errors are normally

distributed.

A typical example of a linear parameter estimation problem is linear regres-

sion to a line, where we seek a model of the form y = m1 + m2 x which which

2-1

2-2 CHAPTER 2. LINEAR REGRESSION

maximum likelihood best fits a set of m > 2 data points. We seek a minimum 2–norm solution for

estimation

1 x1 d1

1 x2

d2

. . m1

.

Gm = . = =d (2.3)

. m2

.

. . .

1 xm dm

The 2–norm solution for m is, from the normal equations (A.7),

It can be shown that if G is of full rank then (GT G)−1 always exists.

Equation (2.4) is a general 2–norm minimizing solution for any full-rank

linear parameter estimation problem. Returning to our example of fitting a

straight line to m data points, we can now write a closed–form solution for this

particular problem as

−1

1 x1

1 ... 1 . . . . . .

=

x1 . . . xm

1 xm

d1

1 ... 1 d2

x1 . . . xm . . .

dm

Pm −1 Pm

m i=1 x i i=1 d i

= Pm Pm 2 Pm

i=1 xi i=1 xi i=1 xi di

Pm 2 Pm Pm

1 − Pi=1 xi − i=1 xi i=1 d i

= Pm Pm m Pm .

m i=1 x2i − ( i=1 xi )2 − i=1 xi −m i=1 xi di

If we consider our data points not to be absolutely perfect measurements, but

rather to posses random errors, then we have the problem of finding the solu-

tion which is best from a statistical point of view. One statistical approach,

maximum likelihood estimation, asks the question, given that we observed

a certain data set, that we know the statistical characteristics of our observa-

tions, and that we have a forward model for the problem, what is the model

from which these observables would be most likely?

Maximum likelihood estimation is a general method which can be used for

any estimation problem where we can write down a joint probability density

function for the observations. The essential problem is to find the most likely

2.2. STATISTICAL ASPECTS OF LEAST SQUARES 2-3

model, as parameterized by the n elements in the vector, m for a set of m obser- likelihood function

vations in the vector d. We will assume that the observations are statistically maximum likelihood principle

independent so that we can use a product form joint density function.

Given a model m, we have a probability density function fi (di , m) for the

ith observation. In general, this probability distribution will vary, depending

on m. The joint probability density for a vector of observations d will then be

i=1 fi (di , m). (2.6)

Note that the values of f are probability densities not probabilities. We can

only compute the probability of getting data in some range given a model m

by computing an integral of f (d, m) over that range. In fact, the probability of

getting any particular data set is precisely 0!

We can get around this problem by considering the probability of getting

a data set in a small box around a particular data set d. This probability

is approximately proportional to the probability density f (d, m). For many

possible models m, the probability density will be quite small. Such models

would be relatively unlikely to result in the data d. For other models, the

probability density might be much larger. These models would be relatively

likely to result in the data d.

For this reason, (2.6) is called the likelihood function. According to the

maximum likelihood principle we should select the model m that maximizes

the likelihood function. Estimates of the model obtained by following the max-

imum likelihood principle have many desirable properties [CB02, DS98]. It is

particularly interesting that when we have a discrete linear inverse problem and

the data errors are independent and normally distributed, then the maximum

likelihood principle leads us to the least squares solution.

Suppose that our data have independent random errors which are normally

distributed with expected value zero. Let the standard deviation of the ith

observation be σi . The likelihood for the ith observation is

1 2 2

fi (di , m) = 1/2

e−(di −(Gm)i ) /2σi . (2.7)

(2π) σi

Considering the entire data set, The likelihood function is the product of the

individual likelihoods

1 −(di −(Gm)i ) 2

/2σi2

f (d, m) = Πm

i=1 e . (2.8)

(2π)m/2 σ 1 · · · σm

The constant factor does not effect the maximization of f , so we can solve the

maximization problem

2

−(di −(Gm)i ) /2σi2

max Πm

i=1 e . (2.9)

the maximization problem

2

−(di −(Gm)i ) /2σi2

max log Πm

i=1 e . (2.10)

2-4 CHAPTER 2. LINEAR REGRESSION

m

X (di − (Gm)i )2

max − . (2.11)

i=1

2σi2

min. Also, it will be convenient to eliminate a constant factor of 1/2. Our

problem becomes

m

X (di − (Gm)i )2

min . (2.12)

i=1

2σi2

Aside from the constant factors of 1/σi2 , this is the least squares problem for

Gm = d.

To incorporate the standard deviations, we scale the system of equations.

Let

W = diag(1/σ1 , 1/σ2, . . . , 1/σm ). (2.13)

Then let

Gw = WG (2.14)

and

dw = Wd . (2.15)

The weighted system of equations is

Gw m = dw . (2.16)

Now,

m

X

kdw − Gw mw k22 = (di − (Gmw )i )2 /σi2 . (2.18)

i=1

Thus the least squares solution to Gw m = dw is the maximum likelihood solu-

tion.

The sum of squares also provides important statistical information about

the quality of our estimate of the model. Let

m

X

χ2obs = (di − (Gmw )i )2 /σi2 . (2.19)

i=1

It can be shown that under our assumptions, χ2obs has a χ2 distribution with

ν = m − n degrees of freedom [CB02, DS98].

Recall that the pdf of the χ2 distribution is given by

1 1

fχ2 (x) = x 2 ν−1 e−x/2 . (2.20)

2ν/2 Γ(ν/2)

2.2. STATISTICAL ASPECTS OF LEAST SQUARES 2-5

$p$–value

0.2

0.15

0.1

0.05

0

0 5 10 15 20 25 30

x

The χ2 test is a statistical test of the assumptions that we used in finding

the least squares solution. In this test, we compute χ2obs and compare it to the

theoretical χ2 distribution with ν = m − n degrees of freedom. The probability

of obtaining a χ2 value as large or larger than the observed value is

Z ∞

p= fχ2 (x)dx. (2.21)

χ2

obs

This value is called the p–value of the test.

There are three general possibilities for p:

1. The p–value is not too small and not too large. Our least squares solution

produces an acceptable data fit and our assumptions are consistent with

the data In general use, p does not actually have to be very large to

be deemed marginally “acceptable” in many cases (e.g., p ≈ 10−2 ), as

truly “wrong” models (see below) will typically produce extremely small

p–values (e.g., 10−12 ) because of the short-tailed nature of the normal

distribution.

2. The p–value is very small. We are faced with three nonexclusive possibil-

ities, but something is clearly wrong.

easy to rule out for p–values very close to zero. For example if an

2-6 CHAPTER 2. LINEAR REGRESSION

worse fit was 10−9 , if the model and forward problem are correct,

then we would have to perform on the order of a billion experiments

to get a worse fit to the data. It is far more likely that something

else is wrong.

(b) The mathematical model Gm = d is simply incorrect. Most often

this happens because we have left some important physics out of our

mathematical model.

(c) The data errors are underestimated (e.g., the values of σi are too

small or the errors are appreciably non-normally distributed, and are

better characterized by a longer-tailed distribution than the normal

pdf.

3. The p–value is very large (very close to one). The fit of the forward model

to the data is almost exact. We should investigate the possibility that we

have overestimated the data errors. A more sinister possibility is that a

very high p–value is indicative of data fraud, such as might happen if data

were cooked-up ahead of time to fit a particular model.

A rule of thumb for problems with a large number of degrees of freedom is

that the expected value of χ2 approaches ν. This arises because, as the central

limit theorem predicts, the χ2 random variable will itself becomes normally

distributed with mean ν and standard deviation (2ν)1/2 for large ν.

In addition to examining the χ2obs statistic, it is always a good idea to

examine the residuals rw = Gw mw − dw . The residuals should be roughly

normally distributed with standard deviation one and no obvious patterns. In

some cases where an incorrect model has been fitted to the data, the residuals

will reveal the nature of the modeling error. For example, in a linear regression

to a straight line, it might be that all of the residuals are negative for small

values of the independent variable t and then positive for larger values of t.

This would indicate that perhaps an additional quadratic term is needed in the

regression model.

If we have two alternative models that we have to the data it can be impor-

tant to compare the models to determine whether or not one model fits the data

significantly better than the other model. The F distribution (Appendix B) can

be used to compare the fit of two models to assess whether or not a statistically

significant residual reduction has occurred [DS98].

The respective χ2obs values of the two models are have χ2 distributions with

their respective degrees of freedom, ν1 and ν2 . It can be shown that under our

statistical assumptions, the ratio

χ21 /ν1 χ2 ν2

R= 2 = 12 (2.22)

χ2 /ν2 χ2 ν1

will have an F distribution with parameters ν1 and ν2 [DS98].

If the observed value of the ratio, Robs is extremely large or small, this

indicates that one of the models fits the data significantly better than the other

2.2. STATISTICAL ASPECTS OF LEAST SQUARES 2-7

level. If Robs is greater than the the 95th percentile of the F distribution unbiased

with parameters ν1 and ν2 , then we can conclude that the second model has a

statistically significant improvement in data fit over the first model.

Parameter estimates in linear regression are constructed of linear combina-

tions of independent data with normal errors. Because a linear combination of

normally distributed random variables is normally distributed (Appendix B),

the model parameters will also be normally distributed. Note that, even though

the data errors are uncorrelated, the covariance matrix of the model parameters

will generally have nonzero off-diagonal terms.

To derive the mapping between data and model covariances, consider the

covariance of an m-length data vector, d, of normally distributed, independent

random variables operated on by a general linear transformation specified by an

m by n matrix, A. From Appendix B, we know that

The least squares solution (2.17) has A = (GTw Gw )−1 GTw . Since the weighted

data have an identity covariance matrix, the covariance for the model parameters

is

C = (GTw Gw )−1 GTw Im Gw (GTw Gw )−1 = (GTw Gw )−1 . (2.24)

In the case of equal and uncorrelated normal data errors, so that the data

covariance matrix Cov(d) is simply the variance σ 2 times the m by m identity

matrix, Im , (2.24) simplifies to

nal matrix, indicating that the model parameters are correlated. Because the

components of mL2 are each constructed from linear combinations of the com-

ponents in the data vector, statistical dependence between the elements of m

should not be surprising.

The expected value of the least squares solution is

GT Gmtrue = GT dtrue .

Thus

E[mL2 ] = mtrue .

In statistical terms, the least squares solution is said to unbiased.

We can compute 95% confidence intervals for the individual model param-

eters using the fact that each model parameter mi has a normal distribution

with mean mtrue and variance Cov(mL2 )i,i . The 95% confidence intervals are

given by

mL2 ± 1.96 · diag(Cov(mL2 ))1/2 , (2.26)

2-8 CHAPTER 2. LINEAR REGRESSION

Z 1.96σ

1 x2

√ e− 2σ2 dx ≈ 0.95 .

σ 2π −1.96σ

Example 2.1

observations to a quadratic model

pendent normal data errors (σ = 8 m), generated using a known

true model of

t y

1 109.3827

2 187.5385

3 267.5319

4 331.8753

5 386.0535

6 428.4271

7 452.1644

8 498.1461

9 512.3499

10 512.9753

construct the G matrix. The rows of G are given by

Gi = [1, ti , − (1/2)t2i ]

so that

1 1 −0.5

1 2 −2.0

1 3 −4.5

1 4 −8.0

1 5 −12.5

G= .

1 6 −18.0

1 7 −24.5

1 8 −32.0

1 9 −40.5

1 10 −50.0

2.2. STATISTICAL ASPECTS OF LEAST SQUARES 2-9

600

500

400

Elevation (m)

300

200

100

0

0 2 4 6 8 10 12

Time (s)

Figure 2.2: Data and model predictions for the ballistics example.

(2.17) to obtain a model estimate

Figure 2.2 shows the observed data and the fitted curve.

mL2 has a corresponding model covariance matrix, from (2.25), of

88.53 −33.60 −5.33

C = −33.60 15.44 2.67 . (2.30)

−5.33 2.67 0.48

(2.31)

The χ2 value for this regression is about 4.2, and the number of

degrees of freedom is ν = m − n = 10 − 3 = 7, so the p–value (2.21)

is Z ∞

1 5 x

p= 7/2 Γ(7/2)

x 2 e− 2 dx ≈ 0.76 ,

4.20 2

which is in the realm of acceptability, meaning that modeling as-

sumptions and the 2–norm error for our estimated model are con-

sistent with both the data and the data uncertainty assumptions.

2-10 CHAPTER 2. LINEAR REGRESSION

confidence regions If we consider combinations of model parameters, the situation becomes more

principal axes, error ellipsoid complex. To characterize model uncertainty more effectively, we can examine

95% confidence regions for pairs or larger collections of parameters. When

confidence regions for parameters considered jointly are projected onto model

coordinate axes, we obtain intervals for individual parameters which may be

significantly larger than parameter confidence intervals considered individually.

For a vector of estimated model parameters that have an n-dimensional

MVN distribution with mean mtrue and covariance matrix C, the quantity

∆2 is chosen to be the 95th percentile of the χ2 distribution with n degrees of

freedom, the 95% confidence region is defined by the inequality

If we wish to find an error ellipsoid for a lower dimensional subset of the

model parameters, we can project the n dimensional error ellipsoid down to the

lower dimensional space [Ast88] by taking only those rows and columns of C

and elements of m which correspond to the dimensions that we want to keep.

In this case, if the unprojected parameters are not of interest, the number of

degrees of freedom in the associated χ2 calculation, n, should also be reduced

to match the number of model parameters in the projected error ellipsoid.

Since the covariance matrix and its inverse are symmetric and positive defi-

nite, we can diagonalize C −1 as

C−1 = PT ΛP , (2.33)

are orthonormal eigenvectors. The semiaxes defined by the columns of P are

th

referred to as error ellipsoid principal axes, where The pi semimajor error

ellipsoid axis is in the direction P·,i and has a length ∆/ Λi,i .

Because the model covariance matrix is typically not diagonal, the principal

axes are typically not aligned in the mi directions. However, we can project

the appropriate confidence ellipsoid onto the mi axes to obtain a “box” which

includes the entire 95% error ellipsoid, along with some points outside of the

error ellipsoid. This box provides a conservative confidence interval for the joint

collection of model parameters.

The correlations for the pairs of parameters (mi , mj ) are measures of the

ellipticity of the projection of the error ellipsoid onto the (mi , mj ) plane. A

correlation approaching +1 means the projection is ”needle-like” with a positive

slope, a near-zero correlation means that the projection is approximately spher-

ical, and a correlation approaching -1 means that the projection is ”needle-like”

with a negative slope.

2.2. STATISTICAL ASPECTS OF LEAST SQUARES 2-11

Cov(mi , mj )

ρmi , mj =p

Var(mi ) · Var(mj )

which give

ρm1 ,m2 = −0.91

ρm1 ,m3 = −0.81

ρm2 ,m3 = 0.97 .

This shows that the three model parameters are highly statistically

dependent, and that the error ellipsoid is thus very elliptical.

Application of (2.33) for this example shows that the directions of

the error ellipsoid principal axes are

−0.03 −0.93 0.36

P = [P·,1 , P·,2 , P·,3 ] ≈ −0.23

0.36 0.90 ,

0.97 0.06 0.23

cipal axes directions are

q p p p

Fχ−1

2 (0.95)[1/ λ1 , 1/ λ2 , 1/ λ3 ] ≈ [0.24, 24.72, 3.85] ,

where Fχ−1

2 (0.95) ≈ 2.80 is the 95th percentile of the χ

2

distribution

with three degrees of freedom.

Projecting the 95% confidence ellipsoid axes back into the (m1 , m2 , m3 )

coordinate system (Figure 2.3) we obtain 95% confidence intervals

for the parameters considered jointly

which are about 40% broader than the single parameter confidence

estimates obtained using only the diagonal covariance matrix terms

(2.31). Note that there is actually a greater than 95% probability

that the box in (2.34) will include the true values of the param-

eters. The reason is that these intervals considered together as a

rectangular region includes many points which lie outside of the

95% confidence ellipsoid.

2-12 CHAPTER 2. LINEAR REGRESSION

Figure 2.3: Projections of the 95% error ellipsoid onto model axes.

only about where and how often we made measurements, and on what the

standard deviations of those measurements were. Covariance is thus totally a

characteristic of experimental design reflective of how much influence a general

data set will have on a model estimate. It does not depend upon particular data

values from some individual realization. This is why it is essential to evaluate

the p–value, or some other “goodness–of–fit” measure for a estimated model.

Examining the solution parameters and the covariance matrix alone will not

reveal whether we are fitting the data adequately!

Suppose that we do not know the standard deviations of the measurement

errors a priori. Can we still perform linear regression? If we assume that the

measurement errors are independent and normally distributed with expected

value 0 and standard deviation σ, then we perform the linear regression and

estimate σ from the residuals.

First, we find the least squares solution to the unweighted problem Gm = d.

We let

r = d − GmL2 .

To estimate the standard deviation from the residual vector, let

v

u m

u 1 X

s=t r2 . (2.35)

n − m i=1 i

As you might expect, there is a statistical cost associated with our not

knowing the true standard deviation. If the data standard deviations are known

2.2. STATISTICAL ASPECTS OF LEAST SQUARES 2-13

mi − mtruei

m0i = (2.36)

σ

have the standard normal distribution. However, if instead of a known σ, we

have an estimate of σ, s obtained from (2.35), then the model errors

mi − mtrue

m0i = (2.37)

s

have a t distribution with with ν = n − m degrees of freedom. For smaller

degrees of freedom this produces appreciably broader confidence intervals. As

ν becomes large, s becomes an increasingly better estimate of σ.

A second major problem is that we can no longer use the χ2 test of goodness–

of–fit. The χ2 test was based on the assumption that the data errors were

normally distributed with known standard deviation σ. If the actual residuals

were too large relative to σ, then χ2 would be large, and we would reject the

linear regression fit. If we substitute the estimate (2.35) into (2.19), we find

that χ2obs = ν, so that the fit always passes the χ2 test.

Example 2.3

measurement errors are assumed to be normally distributed, but

with unknown σ. We are given a set of x and y data that appear to

follow a linear relationship.

In this case,

1 x1

1 x2

. .

G= ,

. .

. .

1 xn

and we want to find the least squares solution to

Gm = y . (2.38)

y = −1.03 + 10.09x .

Figure 2.4 shows the data and the linear regression line. Our es-

timate of the standard deviation of the measurement errors is s =

30.74. Thus the estimated covariance matrix for the fit parameters

is

338.24 −4.93

C = s2 (GT G)−1 =

−4.93 0.08

2-14 CHAPTER 2. LINEAR REGRESSION

√

m1 = −1.03 ± tn−2,0.975 338.24 = −1.03 ± 38.05

and √

m2 = 10.09 ± tn−2,0.975 0.08 = 10.09 ± 0.59 .

we cannot perform a χ2 test of goodness–of–fit. However, we can still

examine the residuals. Figure 2.5 shows the residuals. It is clear that

although the residuals seem to be random, the standard deviation

seems to increase as x and y increase. This is a common phenomenon

in linear regression, called a proportional effect. One possible

reason for this is that in some instruments the measurement errors

are of size that is proportional to the magnitude of the measurement.

We will address the proportional effect by assuming that the stan-

dard deviation is proportional to y. We then rescale the system of

equations (2.38) by dividing each equation by yi , so that

Gw mw = yw .

y = −12.24 + 10.25x .

m1 = −12.24 ± 22.39

and

m2 = 10.25 ± 0.47 .

Figure 2.6 shows the data and least squares fit. Figure 2.7 shows

the scaled residuals. Notice that this time there is no trend in the

magnitude of the residuals as x and y increase. The estimated stan-

dard deviation is 0.045, or 4.5%. In fact, these data were generated

according to y = 10x + 0, with standard deviations for the measure-

ment errors at 5% of the y value.

2.2. STATISTICAL ASPECTS OF LEAST SQUARES 2-15

1100

1000

900

800

700

y

600

500

400

300

200

20 40 60 80 100

x

80

60

40

20

0

r

−20

−40

−60

−80

20 40 60 80 100

x

2-16 CHAPTER 2. LINEAR REGRESSION

1100

1000

900

800

700

y

600

500

400

300

200

20 40 60 80 100

x

0.1

0.05

0

r

−0.05

−0.1

20 40 60 80 100

x

2.3. L1 REGRESSION 2-17

robust solution

Linear regression solutions are very susceptible to even small numbers of out- 1–norm solution

liers, extreme observations which are sufficiently different from other observa-

tions to effect the fitted model. In practice, such observations are often the

result of a procedural error in making a measurement.

We can readily appreciate this from a maximum likelihood perspective by

noting the very rapid fall-off of the tails of the normal distribution. For example,

the probability of a data point occurring more than 5 standard deviations to one

side or the other of the expected value for a single data point with a normally

distributed error is less than one in a million

Z ∞

2 1 2

P (|X − E[X]| ≥ 5σ) = √ e− 2 x dx ≈ 6 × 10−7 . (2.39)

2π 5

Thus, if an outlier occurs in the data set due to a non-normal error process,

the least squares solution will go to great lengths to accommodate it so as to

make the relative contribution of that point to the total likelihood as large as

possible.

Consider the solution which minimizes the 1–norm of the residual vector,

m

(1)

X |di − (Gm)i |

µ = = kdw − Gw mk1 , (2.40)

i=1

σi

rather than minimizing the 2–norm. The 1–norm solution, mL1 , will be more

outlier resistant, or robust, than the least squares solution, mL2 , because

(2.40) does not square each of the terms in the misfit measure, as (2.12) does.

The 1–norm solution mL1 also has a maximum likelihood interpretation; it

is the maximum likelihood estimator for data with errors distributed according

to a 2-sided exponential distribution

1 −|x−µ|/σ

f (x) = e . (2.41)

2σ

Data sets distributed as (2.41) are unusual. Nevertheless, it is still often worth-

while to find a solution where (2.40) is minimized rather than (2.12), even if

measurement errors are normally distributed, because of the all-too-frequent

occurrence of incorrect mesurements that do not follow the normal distribution.

Example 2.4

It is easy to demonstrate the advantages of 1–norm minimization

using the quadratic regression example discussed earlier. Figure 2.8

shows the original sequence of independent data points with unit

standard deviations, where one of the points (number 4) is clearly

an outlier if a model of the form (2.27) is appropriate (it is the

original data with 200 m subtracted from it). The data prediction

using the 2–norm minimizing solution for this data set,

mL2 = [26.4 m, 75.6 m/s, 4.9 m/s2 ]T , (2.42)

2-18 CHAPTER 2. LINEAR REGRESSION

600

500

400

Elevation (m)

300

200

100

0

0 2 4 6 8 10 12

Time (s)

parabolic data set with an outlier at t = 4 s.

(the lower of the two curves) is clearly skewed away from the major-

ity of the data points due to its efforts to accommodate the outlier

data point and thus minimize χ2 , and it is a poor estimate of the

true model (2.28). Even without a graphical view of the data fit, we

can note immediately that (2.42) fails to fit the data acceptably be-

cause of the huge χ2 value of ≈ 1109. This is clearly astronomically

out of bounds for a problem with 7 degrees of freedom, where the

value of χ2 should be not be far from 7. The corresponding p–value

for this huge χ2 value is effectively zero.

The upper data prediction is obtained using the 1–norm solution,

fits the quadratic trend for the majority of the data points and ig-

nores the outlier at t = 4. It is also much closer to the true model

(2.28), and to the 2–norm model for the data set without the outlier

(2.29).

the value of an observable by n repeated measurements. The system of equations

2.3. L1 REGRESSION 2-19

squares

1 d1

1 d2 IRLS

1

d3

Gm =

. m =

. =d.

(2.44)

.

.

. .

1 dn

The least squares solution to (2.44) is can be found using the normal equa-

tions.

n

X

T −1 T −1

mL2 = (G G) G d = n di . (2.45)

i=1

Thus the 2–norm solution is simply the average of the observations.

The 1–norm solution is somewhat more complicated. The problem is that

n

X

f (m) = kd − Gmk1 = |di − m| (2.46)

i=1

is that f (m) is a convex function of m. Thus any local minimum point is also

a global minimum point.

We can proceed by finding f 0 (m) at those points where it is defined, and

then separately consider the points at which the derivative is not defined. Every

minimum point must either have f 0 (m) undefined or f 0 (m) = 0.

At those points where f 0 (m) is defined,

n

X

0

f (m) = sgn(di − m) . (2.47)

i=1

The derivative is 0 when exactly half of the data are less than m and half of

the data are greater than m. Of course, this can only happen when there are

an even number of observations. In this case, any value of m between the two

middle observations is a 1–norm solution. When there are an odd number of

data, the median data point is the unique 1–norm solution. Even an extreme

outlier will not have a large effect on the median of the data. This is the reason

for the robustness of the 1–norm solution.

The general problem of minimizing kd − Gmk1 is somewhat more compli-

cated. One practical way to find 1–norm solutions is iteratively reweighted

least squares, or IRLS [SGT88]. The IRLS algorithm solves a sequence of

weighted least squares problems whose solutions converge to a 1–norm minimiz-

ing solution.

Let

r = Gm − d (2.48)

We want to minimize

m

X

f (m) = krk1 = |ri |. (2.49)

i=1

2-20 CHAPTER 2. LINEAR REGRESSION

is zero. Ignoring this issue for a moment, we can go ahead and compute the

derivatives of f at other points.

m

∂f (m) X ∂|ri |

= (2.50)

∂mk i=1

mk

m

∂f (m) X

= Gi,k sgn(ri ) . (2.51)

∂mk i=1

m

∂f (m) X 1

= Gi,k ri . (2.52)

∂mk i=1

|ri |

Thus

∇f (m) = GT Rr = GT R(d − Gm) (2.53)

where R is a diagonal weighting matrix with Ri,i = 1/|ri |. To find the 1–norm

minimizing solution, we solve ∇f (m) = 0.

GT RGm = GT Rd . (2.55)

Since R depends on m, this is a nonlinear system of equations which we

cannot solve directly. Fortunately, we can use a simple iterative scheme to find

a the appropriate weights. The IRLS algorithm begins with the least squares

solution m0 = mL2 . We calculate the initial residual vector r0 = d − Gm0 .

We then solve (2.55) to obtain a new model m1 and residual vector r1 . The

process is repeated until the model and residual vector stabilize. A typical rule

is to stop the iteration when

kmk+1 − mk2

<τ (2.56)

1 + kmk+1 k2

for some tolerance τ .

The procedure will fail if any element of the residual vector becomes zero.

A simple modification to the algorithm deals with this problem. We select

a tolerance below which we consider the residuals to be effectively zero. If

|ri | < , then we set Ri,i = 1/. With this modification it can be shown

that this procedure will always converge to an approximate 1–norm minimizing

solution.

As with the χ2 misfit measure, there is a corresponding p-value that can be

used under the assumption of normal data errors for 1–norm solutions [PM80].

For a 1–norm misfit measure, y, for ν degrees of freedom given by (2.40), the

(1)

probability that a worse misfit than that observed, µobs could have occurred

2.4. MONTE CARLO ERROR PROPAGATION 2-21

given independent normally-distributed data and ν degrees of freedom is ap- Monte Carlo error propagation

proximately given by

p(1) (y, ν) = Prob(µ(1) > µobs ) = S(x) − , (2.57)

6

where Z x ξ2

1 − 2

S(x) = √ e 2σ1

dξ ,

σ1 2π −∞

σ1 = (1 − 2/π)ν ,

2 − π/2 1

γ= ν2 ,

(π/2 − 1)3/2

x2 − 1 − x2

Z (2) (x) = √ e 2 ,

2π

µ(1) − µ̄

x= ,

σ1

and p

µ̄ = 2/π ν .

For solution techniques that are nonlinear and/or algorithmic, such as the IRLS,

there is typically no simple way to propagate uncertainties in the data to uncer-

tainties in the estimated model parameters. In such cases, one can apply Monte

Carlo error propagation techniques, in which simulate a collection of noisy

data vectors then measure the statistics of the resulting models. We can ob-

tain covariance matrix information for this solution by first forward propagating

(2.43) into an assumed noise-free data vector

GmL1 = db .

We next resolve the IRLS problem many (q) times for 1–norm models corre-

sponding to independent data realizations to obtain a suite of 1–norm solutions

to

GmL1 ,i = db + ni ,

where ni is the ith noise vector. If A is the q by m matrix where the ith row

contains the difference between the ith model estimate and the average model

AT A

Cov(mL1 ) = .

q

2-22 CHAPTER 2. LINEAR REGRESSION

Example 2.5

Recall Example 2.3. An estimate of Cov(mL1 ) for this example using

10,000 iterations of the Monte Carlo procedure is

123.92 −46.96 −7.43

Cov(mL1 ) = −46.96 15.44 3.72 ≈ 1.4 · Cov(mL2 ) .

−7.43 3.72 0.48

Although we have no reason to believe that the model parameters

will be normally distributed, we can still compute approximate 95%

confidence intervals for the parameters by simply acting as though

they were normally distributed. We obtain

mL1 = [17.6 ± 21.8 m, 96.4 ± 7.70 m/s, 9.3 ± 1.4 m/s2 ]T . (2.58)

2.5 Exercises

1. A seismic profiling experiment is performed where the first arrival times

of seismic energy from a mid-crustal refractor are made at distances (km)

of

6.0000

10.1333

14.2667

x=

18.4000

22.5333

26.6667

from the source, and are found to be (in seconds after the source origin

time)

3.4935

4.2853

5.1374

t= 5.8181

.

6.8632

8.1841

The model for this arrival time pattern is a simple two-layer flat lying

structure which predicts that

ti = t0 + s2 xi

where the intercept time, t0 depends on the thickness and slowness of the

upper layer, and s2 is the slowness of the lower layer. The estimated noise

is believed to be uncorrelated, and normally distributed with expected 0

and standard deviation 0.1 s. i.e. σ = 0.1s.

2.5. EXERCISES 2-23

(a) Find the least squares solution for the two model parameters t0 and chi2pdf

s2 . Plot the data predictions from your model relative to the true

data.

(b) Calculate and comment on the parameter correlation matrix. How

will the correlation entries be reflected in the appearance of the error

ellipsoid?

(c) Plot the error ellipsoid in the (t0 , s2 ) plane and calculate conservative

95% confidence intervals for t0 and s2 .

Hint: The following MATLAB code will plot a 2-dimensional co-

variance ellipse, where covm is the covariance matrix and m is the

2-vector of model parameters.

%diagonalize the covariance matrix

[u,lam]=eig(inv(covm));

%generate a vector of angles from 0 to 2*pi

theta=(0:.01:2*pi)’;

%calculate the x component of the ellipsoid for all angles

r(:,1)=(delta/sqrt(lam(1,1)))*u(1,1)*cos(theta)+...

(delta/sqrt(lam(2,2)))*u(1,2)*sin(theta);

%calculate the y component of the ellipsoid for all angles

r(:,2)=(delta/sqrt(lam(1,1)))*u(2,1)*cos(theta))+...

(delta/sqrt(lam(2,2)))*u(2,2)*sin(theta);

%plot(x,y), adding in the model parameters

plot(m(1)+r(:,1),m(2)+r(:,2))

(d) Evaluate the p–value for this model (you may find the MATLAB

Statistics Toolbox function chi2cdf to be useful here).

(e) Evaluate the value of χ2 for 1000 Monte Carlo simulations using the

data prediction from your model perturbed by noise that is consistent

with the data assumptions. Compare a histogram of these χ2 values

with the theoretical χ2 distribution for the correct number of degrees

of freedom (you may find the MATLAB statistical toolbox function

chi2pdf to be useful here).

(f) Are your p–value and Monte Carlo χ2 distribution consistent with

the theoretical modeling and the data set? If not, explain what is

wrong.

(g) Use IRLS to evaluate 1–norm estimates for t0 and s2 . Plot the data

predictions from your model relative to the true data and compare

with (a).

(h) Use Monte Carlo error propagation and IRLS to estimate symmetric

95% confidence intervals on the 1–norm solution for t0 and s2 .

(i) Examining the contributions from each of the data points to the 1–

norm misfit measure, can you make a case that any of the data points

are statistical outliers?

2-24 CHAPTER 2. LINEAR REGRESSION

points d = a + bx + n, where x = [1, 2, 3, 4, 5]T , the n = 2 true model

parameters are a = b = 1, and n is a m-element vector of uncorrelated

N (0, 1) noise.

solve for the 2–norm parameters for your realizations and histogram

them in 100 bins.

(b) Calculate the parameter covariance matrix for uncorrelated N (0, 1)

data errors, C = (GT G)−1 , and give standard deviations, σa and σb ,

for your estimates of a and b,

(c) Calculate standardized parameter estimates

a − ā

a0 = p

σ · C1,1

and

b − b̄

b0 = p ,

σ · C2,2

and demonstrate using a Q − Q plot that your estimates for a0 and

b0 are distributed as N (0, 1).

(d) Show using a Q − Q plot that the squared residual lengths

krk22 = kd − Gmk22

degrees of freedom.

(e) Assume that the noise standard deviation for the synthetic data set

is not known, and estimate it for each realization as

v

u m

u 1 X

s=t r2 .

n − m i=1 i

a − ā

a0 = 1/2

s · C1,1

and

b − b̄

b0 = 1/2

s · C2,2

where each solution is normalized by its respective standard deviation

estimate.

(f) Demonstrate using a Q − Q plot that your estimates for a0 and b0 are

distributed as the t PDF with ν = 3 degrees of freedom.

2.6. NOTES AND FURTHER READING 2-25

robust least squares

Linear regression is a major subfield within statistics, and there are hundreds of

associated textbooks. However, many of these textbooks focus on applications

of linear regression in the social sciences. In such applications, the primary focus

is often on determining which variables have an effect on the response variable of

interest, rather than on estimating parameter values for a predetermined model.

In this context it is important to test the hypothesis that a predictor variable has

a 0 coefficient in the regression model. Since we normally know which predictor

variables are important in the physical sciences, this is not typically important

to us. Useful books on linear regression from the standpoint of estimating

parameters in the context considered here include [DS98, Mye90].

Statistical methods that are robust to outliers are an important topic. Huber

discusses a variety of robust statistical procedures [Hub96]. The computational

problem of computing a 1–norm solution has been extensively researched. Wat-

son reviews the history of methods for finding p–norm solutions including the

1–norm case [Wat00].

We have assumed that G is known exactly. In some cases entries in this

matrix might be subject to measurement error. This problem has been studied

as the total least squares problem [HV91]. An alternative approach to least

squares problems with uncertainties in G that has recently received considerable

attention is called robust least squares [BTN01, EL97].

2-26 CHAPTER 2. LINEAR REGRESSION

IFK

Fredholm Integral Equation

Chapter 3

Discretizing Continuous

Inverse Problems

In this chapter, we will discuss inverse problems involving functions rather than

vectors of parameters. There is a large body of mathematical theory that can

be applied to such problems. Some of these problems can be solved analytically.

However, in practice such problems are often approximated by linear systems

of equations. Thus we will focus on techniques for discretizing continuous in-

verse problems here then consider techniques for solving discretized problems in

subsequent chapters.

We begin by considering problems of the form

Z b

d(s) = g(s, t)m(t) dt . (3.1)

a

Here d(s) is a known function, typically representing the observed data. The

function g(s, t) is also known. It encodes the physics that relates the unknown

model m(t) to the observed d(s). The interval [a, b] may be finite, in which case

the analysis is somewhat simpler, or the interval may be infinite. The function

d(s) might in theory be known over an entire interval [c, d], but in practice we

will only have measurements of d(s) at a finite set of points.

We wish to solve for the unknown function m(t). This type of linear equation,

which we previously saw in Chapter 1, is called a Fredholm integral equation

of the first kind or IFK. A surprisingly large number of inverse problems

can be written as Fredholm integral equations of the first kind. Unfortunately,

some IFK’s have properties that can make them very difficult to solve.

3-1

3-2 CHAPTER 3. DISCRETIZING CONTINUOUS INVERSE PROBLEMS

representers

data kernels One common type of inverse problem that we have seen previously

quadrature rule and which can be written in the form of an IFK is the deconvolu-

tion problem Z t

y(t) = g(t − u)f (u) du .

0

Here the kernel function g depends only on the difference between

t and u (1.7). Our goal is to obtain f(t). Convolution integrals of

this type arise whenever the response of an instrument depends on

some sort of weighted time average of the value of the parameter

of interest. The instrument response is y(t), while f (t) is the value

that we wish to measure.

This problem is not quite in the form of (3.1), since the upper limit

of integration is not constant. However, if we fix some time b beyond

which we are not interested in the solution, and let

g(t − u) u ≤ t

g(t, u) =

0 u>t

Z b

y(t) = g(t, u)f (u) du .

0

There are a number of techniques for approximating IFK’s by discrete systems

of linear equations. We will first assume that d is known only at a finite number

of points s1 , s2 , . . ., sm . This is not an important restriction, since all practical

data consist of a discrete number of measurements. We can write the IFK as

Z b

di = d(si ) = g(si , t)m(t) dt (i = 1, 2, . . . , m) ,

a

or as Z b

di = gi (t)m(t) dt (i = 1, 2, . . . , m) , (3.2)

a

where gi (t) = g(si , t). The functions gi (t) are referred to as representers or

data kernels.

In the quadrature approach to discretizing an IFK, we use a quadrature

rule (an approximate numerical integration scheme) to approximate

Z b

gi (t)m(t) dt .

a

3.2. QUADRATURE METHODS 3-3

midpoint rule

collocation, simple

a t1 t2 tn b

The simplest quadrature rule is the midpoint rule. We divide the interval

[a, b] into n subintervals, and pick points t1 , t2 , . . ., tn in the middle of each

interval. Thus

∆t

ti = a + + (i − 1)∆t

2

where

b−a

∆t = .

n

The integral is then approximated by

Z b n

X

gi (t)m(t) dt ≈ gi (tj )m(tj ) ∆t (3.3)

a j=1

n

X

di = gi (tj )m(tj )∆t (i = 1, 2, . . . , m) . (3.4)

j=1

If we let

i = 1, 2, . . . , m

Gi,j = gi (tj )∆t (3.5)

j = 1, 2, . . . , n

and

mj = m(tj ) (j = 1, 2, . . . , n) , (3.6)

then we obtain a linear system of equations Gm = d.

The approach of using the midpoint rule to approximate the integral is known

as simple collocation. Of course, there are also more sophisticated quadrature

rules such as the trapezoidal rule and Simpson’s rule. In each case, we end up

with a similar linear system of equations.

Example 3.2

Consider the vertical seismic profiling example (Example 1.3) of

Chapter 1, where we wish to estimate vertical seismic slowness using

arrival time measurements of downward propagating seismic waves

(Figure 3.2). As noted in Chapter 1, the data in this case are simply

3-4 CHAPTER 3. DISCRETIZING CONTINUOUS INVERSE PROBLEMS

Figure 3.2: Discretization of the vertical seismic profiling problem (n/m = 2).

problem (1.15) for m observations, ti at yi (assumed equally spaced

at distances ∆y; Figure 3.2) and for n model depths zi (assumed

equally spaced at distances ∆z) gives n/m = ∆y/∆z. The corre-

sponding discretized system of equations is

n

X

ti = H(yi − zj )sj ∆z . (3.7)

j=1

The rows of the matrix Gi,· each consist of i · n/m entries ∆z on the

left and n − (i · n/m) zeros on the right. For n = m, G is simply a

lower triangular matrix with each nonzero entry equal to ∆z.

Example 3.3

an IFK and then discretized by the method of simple collocation

concerns an optics experiment in which light passes through a slit

[Sha72]. We measure the intensity of diffracted light d(s), as a func-

tion of outgoing angle −π/2 ≤ s ≤ π/2. Our goal is to find the

intensity of the light incident on the slit m(t), as a function of the

incoming angle −π/2 ≤ t ≤ π/2 (Figure 3.3).

3.2. QUADRATURE METHODS 3-5

Z π/2 2

2 sin(π(sin(s) + sin(t)))

d(s) = (cos(s) + cos(t)) m(t) dt .

−π/2 π(sin(s) + sin(t))

(3.8)

We use the method of simple collocation with n equal model intervals

for both the model and data functions, where n is a multiple of 2.

Although it is not generally necessary, we will assume for convenience

that data is collected at the same n equally-spaced angles

(i − 0.5)π π

si = ti = − (i = 1, 2, . . . , n) . (3.9)

n 2

Let

2

2 sin(π(sin(si ) + sin(tj )))

Gi,j = ∆s(cos(si ) + cos(tj )) (3.10)

π(sin(si ) + sin(tj ))

where

π

∆s = . (3.11)

n

If we let

di = d(si ) (i = 1, 2, . . . , n) (3.12)

and

mj = m(tj ) (j = 1, 2, . . . , n) (3.13)

our discretized linear system for this problem is then Gm = d.

The MATLAB Regularization Toolbox [Han94] contains a routine

shaw that computes the G matrix for this problem.

Example 3.4

3-6 CHAPTER 3. DISCRETIZING CONTINUOUS INVERSE PROBLEMS

Flow

0 x

of groundwater pollution at a site from later measurements of the

groundwater contamination at other points “downstream” from the

initial contamination. See Figure 3.4 This “source history recon-

struction problem” has been considered by a number of authors

[NBW00, SK94, SK95, WU96].

The transport of the contaminant is governed by an advection–

diffusion equation:

∂C ∂2C ∂C

= D 2 −v (3.14)

∂t ∂x ∂x

C(0, t) = Cin (t)

C(x, t) → 0 as x → ∞

C(x, 0) = C0 (x)

water flow.

The solution to this PDE is given by the convolution

Z T

C(x, T ) = Cin (t)f (x, T − t) dt , (3.15)

0

and the advection-diffusion kernel is

[x−v(T −t)]2

x − 4D(T −t)

f (x, T − t) = p e .

2 πD(T − t)3

diffusion equation are known and that we want to estimate Cin (t)

from observations at some later time T . The convolution formula

(3.15) for C(x, T ) is discretized as

d = Gm

3.3. THE GRAM MATRIX TECHNIQUE 3-7

x, at a time T , m is a vector of Cin values, and

[xi −v(T −tj )]2

xi − 4D(T −tj )

Gi,j = f (xi , T − tj )∆t = p e ∆t .

2 πD(T − tj )3

In the Gram matrix technique, we write the model m(t) as a linear combi-

nation of the m representers (3.2)

m

X

m(t) = αj gj (t) , (3.16)

j=1

the IFK is then

Z b m

X

d(si ) = gi (t) αj gj (t) dt (3.17)

a j=1

m

X Z b

= αj gi (t)gj (t) dt (i = 1, 2, . . . , m) . (3.18)

j=1 a

(3.19)

Z b

Γi,j = gi (t)gj (t) dt (i, j = 1, 2, . . . , m) . (3.20)

a

di = d(si ) (i = 1, 2, . . . , m) . (3.21)

Γα = d .

Once this linear system has been solved, the corresponding model is

m

X

m(t) = αj gj (t) . (3.22)

j=1

If the representers and Gram matrix are analytically expressible, then the

Gram matrix formulation facilitates finding continuous solutions for m(t) [Par94].

3-8 CHAPTER 3. DISCRETIZING CONTINUOUS INVERSE PROBLEMS

inner product Where only numerical representations for the representers exist, the method can

still be used, although the elements of Γ must be obtained by numerical inte-

gration.

It can be shown (see the Exercises) that if the representers gi (t) are linearly

independent, then the matrix Γ will be nonsingular.

However, as we will see in the next chapter, the Γ matrix tends to become

very badly conditioned as m increases. On the other hand, we want to use as

large as possible a value of m so as to increase the accuracy of the discretization.

There is thus a tradeoff between the discretization error due to using too small

a value of m and the ill-conditioning of Γ due to using too large a value of m.

In the Gram matrix technique, we approximated m(t) as a linear combination

of m representers, which form a natural basis. More generally, given suitable

functions h1 (t), h2 (t), . . ., hn (t), which form a basis for a space H of functions,

we could approximate m(t) by

n

X

m(t) = αj hj (t) (3.23)

j=1

this approximation into the integral equation to obtain

Z b n

X

d(si ) = gi (t) αj hj (t) dt (3.24)

a j=1

n

X Z b

= αj gi (t)hj (t) dt(i = 1, 2, . . . , m) . (3.25)

j=1 a

(3.26)

This leads to the m by n linear system

Gα = d

where Z b

Gi,j = gi (t)hj (t) dt .

a

If we define the dot product or inner product of two functions to be

Z b

f ·g = f (x)g(x) dx , (3.27)

a

s

Z b

kf k2 = f (x)2 dx . (3.28)

a

3.4. EXPANSION IN TERMS OF BASIS FUNCTIONS 3-9

This norm can be very useful in approximating functions. The corresponding Laguerre polynomial basis

measure of the difference between functions f and g is

s

Z b

kf − gk2 = (f (x) − g(x))2 dx . (3.29)

a

If it happens that our basis functions hj (x) are orthonormal with respect to

this inner product, then the projection of gi (t) onto the space H spanned by

the basis is

projH gi (t) = (gi · h1 )h1 (t) + (gi · h2 )h2 (t) + . . . + (gi · hn )hn (t)

The elements in the G matrix are given by the same dot products

Gi,j = gi · hj (3.30)

Thus we have effectively projected the original representers onto our function

space H.

A variety of basis functions have been used to discretize integral equations

including sines and cosines, spherical harmonics, B-splines, and wavelets. In

selecting the basis functions, it is important to select a basis that can reasonably

represent likely models. The basis functions should be linearly independent, so

that a function can be written in terms of the basis functions in exactly one

way, and (3.23) is thus unique. It is also desirable to use an orthonormal basis.

The selection of an appropriate basis for a particular problem is a fine art

that requires detailed knowledge of the problem as well as of the behavior of

the basis functions. Beware that a poorly selected basis may not adequately

approximate the solution, resulting in an estimated model m(t) which is very

wrong.

Example 3.5

Consider the discretization of a Fredholm integral equation of the

first kind in which the interval of integration is [0, ∞].

Z ∞

d(s) = g(s, t)m(t) dt .

0

variables from t to x so that the x interval becomes finite. We could

then use simple collocation to discretize the problem. Unfortunately,

this method often introduces singularities in the integrand that are

difficult to handle numerically.

Instead, we will use a more sophisticated approach involving the

Laguerre polynomials. The Laguerre basis consists of polynomials

Li (t) where

L−1 (t) = 0

3-10 CHAPTER 3. DISCRETIZING CONTINUOUS INVERSE PROBLEMS

L0 (t) = 1

and

2n + 1 − t n

Ln+1 (t) = Ln (t) − Ln−1 (t) (n = 0, 1, . . .) .

n+1 n+1

The first few polynomials include L1 (t) = 1−t, L2 (t) = 1−2t+t2 /2,

and L3 (t) = 1 − 3t + 3t2 /2 − t3 /6.

We will use a slightly modified inner product on this interval

Z ∞

f ·g = f (x)g(x)e−x dx .

0

−x

The weighting factor of e helps to ensure that the integral will

converge, but has the effect of down weighting the importance of

values of the function for large x.

The Laguerre polynomial basis functions are orthonormal in the

sense that

Z ∞

0 m 6= n

Ln (t)Lm (t)e−t dt = .

0 1 m=n

basis, we can expand m(t) in terms of them. In order to avoid an

infinite sum, we will truncate the expansion after n terms.

n

X

m(t) = αj Lj−1 (t)e−t

j=1

Z ∞ n

X Z ∞

d(si ) = gi (t)m(t) dt = αj gi (t)Lj−1 (t)e−t dt .

0 j=1 0

exactly using analytical methods or approximately using numerical

methods. In any case, we obtain a linear system of equations

Gα = d ,

where

Z ∞

i = 1, 2, . . . , m

Gi,j = gi (t)Lj−1 (t)e−t dt .

0 j = 1, 2, . . . , n

Gi,j = gi · Lj−1 ,

3.4. EXPANSION IN TERMS OF BASIS FUNCTIONS 3-11

which are exactly the coefficients of the Lj−1 in the orthogonal ex-

pansion of gi (t).

In this example, we selected an inner product and orthogonal basis

that gives less weight to values of m(t) for large t. We are effectively

measuring how close to functions f and g are by

sZ

∞

kf − gk = (f (x) − g(x))2 e−x dx .

0

less accurate as t increases. This might be appropriate if we are

uninterested in what happens for large values of t. However, if the

solution at large values of t is important, then this would not be an

appropriate norm.

3-12 CHAPTER 3. DISCRETIZING CONTINUOUS INVERSE PROBLEMS

3.5 Exercises

1. Consider the following data set:

y d(y)

0.025 0.2388

0.075 0.2319

0.125 0.2252

0.175 0.2188

0.225 0.2126

0.275 0.2066

0.325 0.2008

0.375 0.1952

0.425 0.1898

0.475 0.1846

0.525 0.1795

0.575 0.1746

0.625 0.1699

0.675 0.1654

0.725 0.1610

0.775 0.1567

0.825 0.1526

0.875 0.1486

0.925 0.1447

0.975 0.1410

This data can be found in the file ifk.mat.

The function d(y), 0 ≤ y ≤ 1 is related to an unknown function m(x),

0 ≤ x ≤ 1, through the IFK

Z 1

d(y) = xe−xy m(x) dx . (3.31)

0

(a) Using the data provided, discretize the integral equation using simple

collocation and solve the resulting system of equations.

(b) What is the condition number for this system of equations? Given

that the data d(y) are only accurate to about 4 digits, what does this

tell you about the accuracy of your solution?

2. Use the Gram matrix technique to discretize the integral equation from

problem 3.1.

(a) Solve the resulting linear system of equations, and plot the resulting

model.

(b) What was the condition number of Γ? What does this tell you about

the accuracy of your solution?

3.6. NOTES AND FURTHER READING 3-13

3. Show that if the representers gi (t) are linearly independent, then the Gram

matrix Γ is nonsingular.

Techniques for discretizing integral equations are discussed in [Par94, Two96,

Win91].

3-14 CHAPTER 3. DISCRETIZING CONTINUOUS INVERSE PROBLEMS

singular value decomposition

SVD

data space

model space

svd MATLAB command

Chapter 4

Ill-Conditioning

We next describe the general solution to linear least squares problems, especially

those that are ill-conditioned and/or rank deficient.

One method of analyzing and solving such systems is the singular value

decomposition, or SVD. The SVD is a decomposition whereby a general m

by n system matrix, G, relating model and data

Gm = d (4.1)

is factored into

G = USVT (4.2)

where

spanning the data space, Rm .

spanning the model space, Rn .

• S is an m by n diagonal matrix.

The SVD matrices can be computed in MATLAB with the svd command.

The singular values along the diagonal of S are customarily arranged in

decreasing size, s1 ≥ s2 ≥ . . . ≥ smin(m,n) ≥ 0. Some of the singular values may

be zero. If only the first p singular values are nonzero, we can partition S as

Sp 0

S= (4.3)

0 0

4-1

4-2 CHAPTER 4. RANK DEFICIENCY AND ILL-CONDITIONING

svd, compact form where Sp is a p by p diagonal matrix composed of the positive singular values.

Now, expand the SVD representation of G

G = USVT (4.4)

Sp 0 T

= [U·,1 , U·,2 , . . . , U·,m ] [V·,1 , V·,2 , . . . , V·,n ] (4.5)

0 0

Sp 0 T

= [Up , U0 ] [Vp , V0 ] (4.6)

0 0

(4.7)

where Up denotes the first p columns of U, U0 denotes the last m−p columns of

U, Vp denotes the first p columns of V, and V0 denotes the last n − p columns

of V. Because the last m − p columns of U and the last n − p columns of V

in (4.6) are multiplied by zeros in S, we can simplify the SVD of G into its

compact form

G = Up Sp VpT . (4.8)

Take any vector y in R(G). Using (4.8), we can write y as

y = Gx = Up Sp VpT x .

(4.9)

Writing out this matrix vector multiplication, we see that any vector y in R(G)

can be written as a linear combination of the columns of Up .

p

X

y= zi U·,i (4.10)

i=1

Since the columns of Up span R(G), are linearly independent, and are orthonor-

mal, they form an orthonormal basis for R(G). Because this basis has p vectors,

rank(G) = p.

Since U is an orthogonal matrix, the columns of U form an orthonormal basis

for Rm . We have already seen that the p columns of Up form an orthonormal

basis for R(G). By theorem (A.5), N (GT ) + R(G) = Rm , so the remaining

m − p columns of U0 form an orthonormal basis for N (GT ). Similarly, because

GT = Vp Sp Up , the columns of Vp form an orthogonal basis for R(GT ) and

the columns of V0 form an orthogonal basis for N (G).

Two other important properties of the SVD are similar to properties of

eigenvalues and eigenvectors. Since the columns of V are orthogonal,

VT V·,i = ei .

Thus

GV·,i = USVT V·,i = USei = si U·,i (4.11)

and

GT U·,i = VST UT U·,i = VST ei = si V·,i . (4.12)

4.1. THE SVD AND THE GENERALIZED INVERSE 4-3

There is an important connection between the singular values of G and the generalized inverse

eigenvalues of the matrices GGT and GT G . Moore–Penrose pseudoinverse

pseudoinverse

T

GG U·,i = Gsi V·,i (4.13)

= si GV·,i (4.14)

= s2i U·,i (4.15)

Similarly,

GT GV·,i = s2i V·,i . (4.16)

These relations show that we could, in theory, compute the SVD by finding

the eigenvalues and eigenvectors of GT G and GGT . In practice, more efficient

specialized algorithms are used [Dem97, GL96, TB97].

The SVD can be used to compute a generalized inverse of G, which is

called the Moore-Penrose Pseudoinverse because it has desirable properties

originally identified by Moore and Penrose [Moo20, Pen55]. The generalized

inverse is

G† = Vp S−1 T

p Up . (4.17)

Using (4.17), let

m† = Vp S−1 T †

p Up d = G d . (4.18)

We will show that m† is a least squares solution. Among the desirable properties

of (4.18) is that G† , and hence m† , always exist, unlike, for example, the inverse

of GT G in the normal equations (2.4) which does not exist when G is not of

full rank.

To encapsulate what the SVD tells us about our linear system, G, and the

corresponding generalized inverse system G† , consider four cases:

1. Both the model and data null spaces, N (G) and N (GT ) are trivial. Up

and Vp are both square orthogonal matrices, so that UT = U−1 , and

VT = V−1 . (4.18) gives

G† = Vp S−1 T T −1

p Up = (Up Sp Vp ) = G−1 (4.19)

rank(G)). The solution is unique, and the data are fit exactly.

(the p by p identity matrix). G applied to the generalized inverse solution

gives

= Up Sp VpT Vp S−1 T

p Up d (4.21)

= Up Sp Ip S−1 T

p Up d (4.22)

= d (4.23)

4-4 CHAPTER 4. RANK DEFICIENCY AND ILL-CONDITIONING

minimum length solution so the data are fit. However, the solution is nonunique, because of the

rank deficient system existence of the nontrivial model null space N (G). The general solution

is the sum of m† and an arbitrary model component in N (G)

n

X

m = m † + m 0 = m† + αi V·,i . (4.24)

i=p+1

general solution m is

n

X

kmk22 = km† k22 + αi2 ≥ km† k22 (4.25)

i=p+1

where we have equality only if all of the model null space coefficients αi

are zero. The generalized inverse solution, which lies in R(GT ) is thus

a minimum 2–norm or minimum length solution for a rank deficient

system with m = rank(G) < n.

We can also write this solution in terms of G and GT .

m† = Vp S−1 T

p Up d (4.26)

= Vp Sp UTp Up S−2 T

p Up d (4.27)

= GT (Up S−2 T

p Up )d (4.28)

T T −1

= G (GG ) d. (4.29)

the above formula involving (GGT )−1 because of round-off issues.

Rm . Here

p Up (Up Up d) (4.30)

= Up UTp d . (4.31)

The product Up UTp d gives the projection of d onto R(G). That is, Gm†

is equal to the projection of d onto the range of G. This shows that the

generalized inverse gives a least squares solution to Gm = d.

This solution is exactly the same as that obtained from the normal equa-

tions, because

= (Vp S2p VpT )−1 (4.33)

= Vp S−2 T

p Vp (4.34)

(4.35)

4.1. THE SVD AND THE GENERALIZED INVERSE 4-5

and

m† = G† d (4.36)

= Vp S−1 T

p Up d (4.37)

= Vp S−2 T T

p Vp Vp Sp Up d (4.38)

T −1 T

= (G G) G d. (4.39)

to calculate (GT G)−1 and use the the normal equation solution because

of round–off issues.

4. Both N (GT ) and N (G) are nontrivial and rank(G) is less than both

m and n. In this case, the generalized inverse solution encapsulates the

behavior from both of the two previous cases and minimizes both kGm −

dk2 and kmk2 .

As in case 3,

p Up (Up Up d) (4.40)

= Up UTp d (4.41)

= projN (G) d . (4.42)

As in case 2, any least squares model m can be written as

n

X

m = m † + m 0 = m† + αi V·,i . (4.43)

i=p+1

Again,

n

X

kmk22 = km† k22 + αi2 ≥ km† k22 , (4.44)

i=p+1

The generalized inverse thus produces an inverse solution that always exists,

and that is both least-squares and minimum length, regardless of the relative

sizes of rank(G), m and n. Relationships between the subspaces R(G), N (GT ),

R(GT ), N (G), and the operators G and G† , are shown schematically in Figure

4.1. Table 4.1 summarizes the SVD and its properties.

To examine the significance of the N (G) subspace, which is spanned by the

columns of V0 , Consider an arbitrary model m0 which lies in N (G)

n

X

m0 = αi V·,i . (4.45)

i=p+1

4-6 CHAPTER 4. RANK DEFICIENCY AND ILL-CONDITIONING

Figure 4.1: SVD model and data space mappings, where G is the forward

operator and G† is the generalized inverse. N (GT ) and N (G) are the data and

model null spaces, respectively.

G m by n G = uSVT = Up Sp VpT

U m by m U is an orthogonal matrix. U = [Up U0 ].

S m by n S is a diagonal matrix of singular values. Si,i = si

V n by n V is an orthogonal matrix. V = [Vp V0 ].

Up m by p Columns of Up form a basis for R(G).

Sp p by p Diagonal matrix of nonzero singular values.

Vp n by p Columns of Vp form a basis for R(GT ).

U0 m by m − p Columns of U0 form a basis for N (GT ).

V0 n by n − p Columns of V0 form a basis for N (G).

U·,i m by 1 Column i of U is an eigenvector of GGT with eigenvalue s2i .

V·,i n by 1 Column i of V is an eigenvector of GT G with eigenvalue s2i .

G† n by m The pseudoinverse of G is G† = Vp S−1 p Up

T

†

m† m by 1 The generalized inverse solution. m† = G d.

Pp UT d

m† = i=1 s·,ii V·,i

4.2. COVARIANCE AND RESOLUTION OF THE GENERALIZED INVERSE SOLUTION4-7

nonuniqueness

n

X data null space

Gm0 = Up Sp VpT m0 = Up Sp αi VpT V·,i = 0 (4.46) pinv MATLAB command

i=p+1

is linear, we can add any m0 to a general model, m, and not change the fit of

the model to the data, because

For this reason, the null space of G is commonly called the model null space.

The existence of a nontrivial model null space (one that includes more than

just the zero vector) is at the heart of solution nonuniqueness. General models

in Rn include an infinite number of solutions that will fit the data equally

well, because model components in N (G) have no affect on data fit. To select a

particular preferred solution from this infinite set thus requires more constraints

(such as minimum length or smoothing constraints) than are encoded in the

matrix G.

To see the significance of the N (GT ) subspace, consider an arbitrary data

vector, d0 , which lies in N (GT )

m

X

d0 = βi U·,i . (4.48)

i=p+1

n

X

m† = Vp S−1 T −1

p Up d0 = Vp Sp βi UTp U·,i = 0 , (4.49)

i=p+1

of all vectors d0 which have no influence on the generalized inverse model, m† .

If p < n there are an infinite number of potential data sets that will produce

the same model when (4.18) is applied. N (GT ) is called the data null space.

MATLAB has a pinv command that generates the G† matrix, rejecting

singular values that are less than a default or specified tolerance.

ized Inverse Solution

The generalized inverse always gives us a solution, m† , with well-determined

properties, but it is essential to investigate how faithful a representation any

model is likely to be of the true situation.

4-8 CHAPTER 4. RANK DEFICIENCY AND ILL-CONDITIONING

model space resolution matrix In Chapter 2, we found that under the assumption of independent normally

distributed measurement errors, the least squares solution was an unbiased es-

timator of the true model, and that the estimated model parameters had a

multivariate normal distribution with covariance

We could attempt the same analysis for the generalized inverse solution m† .

The covariance matrix would be given by

Cov(m† ) = G† Cov(d)(G† )T .

However, unless the least squares problem is of full rank, the generalized inverse

solution is not an unbiased estimator of the true solution. This is because

the true model may have nonzero projections in the model null space which

are not represented in the generalized inverse solution. In practice, the bias

introduced by this can be far larger than the uncertainty due to measurement

error. Estimating this bias is a hard problem, which we will address in Chapter

5.

The concept of resolution is another way to way to characterize the quality

of the generalized inverse solution. In this approach we see how closely the

generalized inverse solution matches a given model, assuming that there are no

errors in the data. We begin with any model m. By multiplying G times m,

we can find a corresponding data vector d. If we then multiply G† times d, we

get back a generalized inverse solution m†

m† = G† Gm (4.50)

since the original model may have had a nonzero projection onto the model

null space N (G), m† will not in general be equal to m. The model space

resolution matrix is given by

Rm = G† G. (4.51)

Rm = Vp S−1 T T

p Up Up Sp Vp (4.52)

Rm = Vp VpT . (4.53)

(4.54)

matrix. In this case the original model is recovered exactly and we say that

the resolution is perfect. If N (G) is a nontrivial subspace of Rn , then p =

rank(G) < n, so that Rm is not the identity matrix. The model space resolution

matrix is instead a symmetric matrix describing how the generalized inverse

solution smears out the original model, m, into a recovered model, m† .

4.3. INSTABILITY OF THE GENERALIZED INVERSE SOLUTION 4-9

In practice, the model space resolution matrix is used in two different ways. If data space resolution matrix

the diagonal entries of Rm are close to one, then we should have good resolution. singular value spectrum

If any of the diagonal entries are close to zero, then the corresponding model

parameters will be poorly resolved. We can also multiply Rm times a particular

model m to see how that model would be resolved by the inverse solution.

We can perform the operations of multiplication by G† and G in the opposite

order to obtain a data space resolution matrix, Rd .

d† = Gm† (4.55)

= GG† d (4.56)

= Rd d. (4.57)

(4.58)

Rd = Up UTp . (4.59)

In this case, d† = d, and the generalized inverse solution m† fits the data exactly.

However, if N (GT ) is nontrivial, then p = rank(G) < m, and Rd is not the

identity matrix. In this case, d† is not equal to d, and m† does not exactly fit

the data d.

Note that model and data space resolution matrices (4.54) and (4.59) do not

depend on specific data values, but are exclusively properties of the matrix G.

They reflect of the experiment physics and geometry, and can thus be calculated

during the design phase of an experiment to assess expected model resolution

and ability to fit the data, respectively.

tion

The generalized inverse solution (4.18) is selected so that its projection onto

N (G) is 0. We have effectively multiplied each of vectors in V0 by 0 in comput-

ing m† . However, m† does include terms involving the singular vectors in Vp

with very small singular values. As a practical matter, it can be very difficult to

distinguish between zero singular values and extremely small singular values. In

analyzing the generalized inverse solution we will consider the singular value

spectrum, which is simply the set of singular values of G.

It turns out that small singular values cause the generalized inverse solution

to be extremely sensitive to small amounts of noise in the data vector d. We

can quantify the instabilities that are brought about by small singular values by

rewriting the generalized inverse solution (4.18) in a form that makes the effect

of small singular values explicit. We start with the formula for the generalized

inverse solution

m† = Vp S−1 T

p Up d. (4.60)

4-10 CHAPTER 4. RANK DEFICIENCY AND ILL-CONDITIONING

condition number, of svd The elements of the vector UTp d are simply the dot products of the first p

columns of U with d. This vector can be written as

(U·,1 )T d

(U·,2 )T d

T

.

Up d = . (4.61)

.

.

T

(U·,p ) d

p times this vector, we obtain

T

(U·,1 ) d

s1

(U·,2 )T d

s2

−1 T

Sp Up d = .

. (4.62)

.

.

(U·,p )T d

sp

of the columns of Vp that can be written as

p

X UT·,i d

m† = Vp S−1 T

p Up d = V·,i . (4.63)

i=1

si

onto each of the directions specified by the columns of U (the rows of UT ). The

presence of a very small si in the denominator of (4.63) can thus give us a very

large coefficient for the corresponding model space basis vector V·,i , and these

basis vectors can thus dominate the solution. In the worst case, the generalized

inverse solution is just a noise amplifier, and the answer is nonphysical and

practically useless.

A simple measure of the instability of the solution is the condition number

(see Appendix A). Suppose that we have a data vector d and an associated

generalized inverse solution m† = G† d.

If we consider a slightly perturbed data vector d0 and its associated gener-

alized inverse solution m0† = G† d0 , then

m† − m0† = G† (d − d0 ) .

Thus

km† − m0† k2 ≤ kG† k2 kd − d0 k2 .

From (4.63), it is be clear that the largest difference in the inverse models will

occur when the d − d0 is in the direction U·,p . If

d − d0 = αU·,p ,

4.3. INSTABILITY OF THE GENERALIZED INVERSE SOLUTION 4-11

0

kd − d k2 = α .

We can then compute the effect on the generalized inverse solution as

α

m† − m0† = V·,p

sp

with

α

km† − m0† k2 = .

sp

Thus we have a bound on the instability of the generalized inverse solution

km† − m0† k2 1

≤

kd − d0 k2 sp

or

1

km† − m0† k2 ≤ kd − d0 k2 .

sp

Similarly, we can see that the generalized inverse model is smallest in norm

when d points in a direction parallel to V·,1 . Thus

1

km† k2 ≥ kdk2 .

s1

Combining these inequalities, we obtain

km† − m0† k2 s1 kd − d0 k2

≤ . (4.64)

km† k2 sp kdk2

value of p we use. As we decrease p, the solution becomes more stable. However,

this stability comes at the expense of reducing the dimension of the subspace

of Rn from which we can find a solution. As we reduce p, the model resolution

matrix becomes less like the identity matrix. Thus there is a tradeoff between

sensitivity to noise and model resolution.

The condition number of G is defined as

s1

cond(G) = ,

sk

where k = min(m, n). The MATLAB command cond can be used to compute

the condition number of G. If G is of full rank, and we use all of the singular

values in the pseudoinverse solution (p = k), then the condition number is

exactly the coefficient in (4.64). If G is of less than full rank, then the condition

number is effectively infinite.

Note that the condition number depends only on the matrix G and does not

take into account the actual data. Just as with the resolution matrix, or the

covariance matrix in linear regression, we can compute the condition number

before we gather any data.

4-12 CHAPTER 4. RANK DEFICIENCY AND ILL-CONDITIONING

discrete Picard condition A condition that insures solution stability is the discrete Picard condition

truncated SVD [Han98]. The discrete Picard condition is satisfied when the dot products of

TSVD the columns of Up and the data vector decay to zero more quickly than the

discrepancy principle

singular values, si . Under this condition, we should not see instability due to

over fitting of data

regularization

contributions from the smallest singular values.

rank deficient If the discrete Picard condition is not satisfied, we may still be able to recover

a usable solution by truncating (4.63) at some highest term p0 < p, to produce

a truncated SVD, or TSVD solution. One way to decide when to truncate

(4.63) is to apply the discrepancy principle, where we consider all solutions

which minimize kmk, subject to fitting the data to some tolerance

kGw m − dw k2 ≤ δ , (4.65)

where Gw and dw are the weighted system matrix and data vector (2.16). A

simple way to chose δ for independent normal data errors is to note that, in this

case, the square of the weighting-normalized 2–norm misfit measure (4.65) will

be distributed as χ2 with ν = m − n degrees of freedom. In this case we can

choose a δ corresponding to some confidence interval for that χ2 distribution

(e.g., 95%). The truncated SVD solution is then straight forward; we sum up

terms in (4.63) until the discrepancy principal is just met. This produces a

stable solution that can be statistically justified as being consistent with both

modeling assumptions and data errors.

This solution will not fit the data as well as solutions that include the small

singular value model space basis vectors. Perhaps surprisingly, this is what we

should generally do when we solve ill-posed problems with noise. If we fit the

data vector exactly or nearly exactly, we are in fact over fitting the data and

perhaps letting the noise control major features of the model. This corresponds

to using the available freedom of a large model space to produce a χ2 value that

is too good (Chapter 2). Truncating (4.63) also decreases the used dimension

of the model space by p − p0 . we should thus be careful to acknowledge that

true model projections in the directions of the newly omitted columns of V

cannot appear in the truncated SVD solution, even if they are present in the

true model.

The truncated SVD solution is but one example of regularization, whereby

solutions are selected to sacrifice fit to the data for solution stability. Much of

the utility of inverse solutions in many situations, as we shall subsequently see,

hinges on the application of good regularization strategies.

A linear least squares problem is said to be rank deficient if there is a clear

distinction between the non–zero and zero singular values, and the number of

non–zero singular values, p is less than the minimum of m and n. In practice, the

computed singular values will often include values that are extremely small but

not quite zero, because of round off errors in the computation of SVD. If there is

a substantial gap between the largest of these tiny singular values and the first

4.4. RANK DEFICIENT PROBLEMS 4-13

truly non–zero singular value, then it can be easy to distinguish between the

non–zero and zero singular values. Rank deficient problems can be solved in a

straight forward manner by applying the generalized inverse solution. Instability

of the resulting solution due to small singular values is seldom an issue.

Example 4.1

With the tools of the SVD at our disposal, let us reconsider the

straight-line tomography example (example 1.3; Figure 4.2), where

we had a rank deficient system in which we were constraining a

9-parameter model with 8 observations.

s11

1 0 0 1 0 0 1 0 0 s12 t1

0 1 0 0 1 0 0 1 0 t2

0 0 1 0 0 1 0 0 1 s13 t3

s21

1 1 1 0 0 0 0 0 0 t4

Gm = s

22 =

.

0 0 0 1 1 1 0 0 0 s23 t5

0 0 0 0 0 0 1 1 1 t6

√ √ √ s31

2 0 0 0 2 0 0 0 √2 t7

s32

0 0 0 0 0 0 0 0 2 t8

s33

(4.66)

4-14 CHAPTER 4. RANK DEFICIENCY AND ILL-CONDITIONING

>> svd(G)

ans =

3.180e+00

2.000e+00

1.732e+00

1.732e+00

1.732e+00

1.607e+00

5.535e-01

4.230e-16

in the computation of the singular values. In this case it is easy to

see that this number is so small that it is effectively 0. The ratio of

the largest nonzero singular value to the smallest nonzero singular

value is about 6, and so the generalized inverse solution will not be

particularly unstable in the presence of noise.

Since rank(G) = 7, the problem is rank-deficient and the model null

space, N (G) is spanned by the two orthonormal vectors specified by

the 8th and 9th columns of V

−0.136 −0.385

0.385 −0.136

−0.249 0.521

−0.385 0.136

V0 =

0.136 0.385 .

0.249 −0.521

0.521 0.249

−0.521 −0.249

0.000 0.000

To obtain a geometric appreciation for the two model null space

vectors, we can reshape them into 3 by 3 matrices corresponding to

the geometry of the blocks (e.g, by using the MATLAB reshape

command).

−0.136 −0.385 0.521

reshape(m0,1 , 3, 3) = 0.385 0.136 −0.521

−0.249 0.249 0.000

−0.385 0.136 0.249

reshape(m0,2 , 3, 3) = −0.136 0.385 −0.249 .

0.521 −0.521 0.000

4.4. RANK DEFICIENT PROBLEMS 4-15

Recall that we can add any multiple of m0 to any solution and not

change our fit to the data. Three common features of the two model

null space basis vectors stand out

1. The sums along all rows and columns of m0 are zero

2. The upper left to lower right diagonal sum is zero

3. There is no projection of m0 in the m9 = s33 model space

direction.

The zero sum conditions (1) and (2) arise because the average slow-

nesses along any of the rows or columns in those directions are un-

constrained by the ray path geometry. Paths passing through three

blocks can only constrain the average block value. The zero value

for s33 (3) occurs because s33 is uniquely constrained because of the

observation t8 .

The lone data space null basis vector, specified by the 8th column of

U, is

−0.408

−0.408

−0.408

0.408

U0 = .

0.408

0.408

0.000

0.000

This is a basis vector in the data space that we can never fit with

any model because it is inconsistent. Measurements of the form cU0 ,

where c is a constant, require the average values of the si,· along

rows to decrease or increase, while the average value of the s·,i along

columns simultaneously increases or decreases, and, the upper left

to lower right diagonal average and the value of s33 cannot change

at all. This is impossible for any model to satisfy.

The n by n model space resolution matrix, Rm tells us about the

uniqueness of the generalized inverse solution parameters. The di-

agonal elements are indicative of how recoverable each parameter

value will be. The reshaped diagonal of Rm for this system is

0.833 0.833 0.667

reshape(diag(Rm ), 3, 3) = 0.833 0.833 0.667

0.667 0.667 1.000

which tells us that m9 = s33 is perfectly resolved, but that we can

expect loss of resolution (and hence smearing of the true model into

other blocks) for all of the other solution parameters. The full Rm

gives the specifics of how this smearing occurs

Rm =

4-16 CHAPTER 4. RANK DEFICIENCY AND ILL-CONDITIONING

resolution test

spike resolution test

impulse resolution test 0.833 0 0.167 0 0.167 −0.167 0.167 −0.167 0

0 0.833 0.167 0.167 0 −0.167 −0.167 0.167 0

0.167

0.167 0.667 −0.167 −0.167 0.333 0 0 0

0 0.167 −0.167 0.833 0 0.167 0.167 −0.167 0

0.167

0 −0.167 0 0.833 0.167 −0.167 0.167 0 .

−0.167 −0.167 0.333 0.167 0.167 0.667 0 0 0

0.167 −0.167 0 0.167 −0.167 0 0.667 0.333 0

−0.167 0.167 0 −0.167 0.167 0 0.333 0.667 0

0 0 0 0 0 0 0 0 1.000

(4.67)

As you can probably surmise, examining the entire model resolution

matrix becomes cumbersome in large problems. In such cases it is

instead common to perform a resolution test by generating syn-

thetic data for some known model of interest and then assessing the

recovery of that model in the inverse solution. One such synthetic

model commonly used in resolution tests is uniform or zero except

for a single perturbed model element. Examining the inverse recov-

ery of this model is commonly referred to as a spike or impulse

resolution test. For this example, consider the model perturbation

0

0

0

0

mtest = 1 .

0

0

0

0

The predicted data set for this mtest is

0

1

0

0

dtest = Gmtest =

1 ,

0

√0

2

0

and the reshaped generalized inverse model is

0.167 0 −0.167

reshape(m† , 3, 3) = 0 0.833 0.167

−0.167 0.167 0.000

4.5. DISCRETE ILL-POSED PROBLEMS 4-17

which is just the 5th column of Rm showing that information about checkerboard resolution test

the central block slowness will bleed into some, but not all, of the

adjacent blocks, even for noise-free data. A related resolution test

that is commonly performed in tomography work is a checkerboard

test, which, as you would surmise, consists of generating synthetic

data from a model of alternating positive and negative perturbations

and then inverting to attempt to recover it. For example, the true

checkerboard slownesses model

2 1 2

reshape(mtrue , 3, 3) = 1 2 1

2 1 2

5.000

4.000

5.000

5.000

4.000 ,

5.000

8.485

2.828

and is recovered as

2.333 1.000 1.667

reshape(m† , 3, 3) = 1.000 1.667 1.333 .

1.667 1.333 2.000

practice is affected not only by resolution concerns, which are a char-

acteristic purely of the system G, and hence apply even for noise-free

data, but also by the mapping of any data noise into the model pa-

rameters. For example, if we add independent noise distributed as

N (0, 0.2) to each data point in the checkerboard resolution test, to

obtain a noisy data vector, dn , the recovered model degrades further

2.365 1.302 1.732

reshape(m† , 3, 3) = 1.215 1.476 1.245 .

1.672 1.272 2.042

In this section we will consider problems in which the singular values decay to-

wards zero without any obvious gap between non–zero and zero singular values.

4-18 CHAPTER 4. RANK DEFICIENCY AND ILL-CONDITIONING

discrete ill–posed problem This happens frequently when we discretize Fredholm integral equations of the

moderately ill–posed first kind as we did in Chapter 3. In particular, as we increase the number of

severely ill–posed points n in the discretization, we typically find that the matrix G becomes more

weakly ill–posed

and more badly conditioned. Discrete inverse problems such as these cannot for-

instrument response

mally be called ill–posed, because the condition number remains finite although

very large. However, we will refer to these problems as discrete ill–posed

problems.

The rate at which the singular values decay can be used to characterize

a discrete ill–posed problem as weakly, moderately, or strongly ill–posed. If

si = O(i−α ) for α ≤ 1, then we call the problem weakly ill–posed. If si = O(i−α )

for α > 1, then the problem is moderately ill–posed. If si = O(e−α ) then the

problem is severely ill-posed.

In addition to the general pattern of singular values which decay to 0, dis-

crete ill–posed problems are typically characterized by differences in the singular

vectors V·,j [Han98]. For large singular values, the singular vectors are typically

smooth, while the singular vectors corresponding to the smaller singular values

typically have lots of oscillations. These oscillations become apparent in the

generalized inverse solution.

When we attempt to solve such a problem with the SVD, it becomes difficult

to decide where to truncate the sum. If we truncate the sum too early, then we

effectively lose details in our model solution corresponding to the corresponding

model vectors Vk that are left out of the sum. If we include all of the singular

values, then the generalized inverse solution becomes extremely unstable in the

presence of noise. In particular, we can expect that high frequencies in the

generalized inverse solution will be strongly affected by the noise. Regularization

techniques are required to address this problem.

Example 4.2

motion) recorded by a linear instrument of limited bandwidth (e.g.,

a vertical seismometer). The response of a linear instrument may be

characterized by an instrument impulse response. Consider the

impulse response

g0 te−t/T0

(t ≥ 0)

g(t) = . (4.68)

0 (t < 0)

seismometer with a characteristic time constant T0 to a unit area (1

m/s2 · s) impulsive ground acceleration input, and g0 is a gain con-

stant. Assuming that the displacement of the seismometer is elec-

tronically converted to output volts, we choose g0 to be T0 e−1 V/m·s

to produce a 1 V maximum output value for the impulse response,

and T0 = 10 s. The response of this instrument to an acceleration

impulse is shown in Figure 4.3.

4.5. DISCRETE ILL-POSED PROBLEMS 4-19

1 deconvolution

0.8

0.6

V

0.4

0.2

0

0 20 40 60 80

Time (s)

sponse to a unit area ground acceleration impulse.

volution (1.7) of the true ground acceleration, mtrue (t), with (4.68)

Z ∞

v(t) = g(τ − t) mtrue (τ ) dτ . (4.69)

−∞

remove the smoothing effect of the g(t) in (4.69) to recover the true

ground acceleration mtrue .

Discretizing (4.69) using the midpoint rule (3.3) with a time interval

∆t, we obtain

d = Gm (4.70)

where

(ti − tj )e−(ti −tj )/T0 ∆t

(tj ≥ ti )

Gi,j = .

0 (tj < ti )

Note that the rows of G are time reversed, and the columns of G

are non-time-reversed, sampled versions of the impulse response g(t),

lagged by i and j, respectively. Using a time interval of [−5, 100] s,

outside of which (4.68) and (we assume) any model, m, of interest

will be very small or zero, and a discretization interval of ∆t = 0.5 s

we obtain a discretized m by n system matrix G with m = n = 210.

(Figure (4.4).

4-20 CHAPTER 4. RANK DEFICIENCY AND ILL-CONDITIONING

20

40

60

80

i 100

120

140

160

180

200

50 100 150 200

j

Figure 4.4: Grayscale image of the system matrix, G. Values range from 0

(black) to 0.5 (white). Discretization interval is ∆t = 0.5 s and the time interval

is [−5, 100] s.

The singular values of G are all nonzero and range from about 25.3

to 0.017 (Figure 4.5), giving a condition number of approximately

1480, which is not an especially large range relative to what can arise

in extremely ill-posed problems. However, adding noise at the level

of 1 part in 1000 will be sufficient to make the generalized inverse

solution unstable. The reason for the moderately large condition

number can be seen by examining successive rows of G, which are

nearly (but not quite) identical.

Gi,· GTi+1,·

≈ 0.999 .

kGi,· kkGi+1,· k

acceleration pulses with temporal widths of σ = 2 s, centered at

t = 8 s and t = 25 s

2

/(2σ 2 ) 2

/(2σ 2 )

mtrue (t) = e−(t−8) + 0.5e−(t−25) . (4.71)

time interval [−5, 100] s and generate the synthetic data set

4.5. DISCRETE ILL-POSED PROBLEMS 4-21

1

10

0

10

i

s

−1

10

i

1

0.9

0.8

0.7

Acceleration (m/s2)

0.6

0.5

0.4

0.3

0.2

0.1

0 20 40 60 80

Time (s)

4-22 CHAPTER 4. RANK DEFICIENCY AND ILL-CONDITIONING

3

V

0

−20 0 20 40 60 80 100

Time (s)

The recovered 2–norm model from the full generalized inverse solu-

tion

m = VS−1 UT dsyn (4.73)

is shown in Figure 4.8. (4.73) fits its noiseless data vector, dsyn ,

perfectly, and is essentially identical to the true model.

Next consider what happens when we add a modest amount of noise

to (4.72). Figure 4.9 shows the data after normally distributed noise

with standard deviation 0.05 has been added.

The 2–norm solution for a noisy data vector, dsyn + n, where the

elements of n are uncorrelated N (0, (0.05)2 ) random variables

The solution in Figure 4.10, although it fits its particular data vec-

tor, dsyn + n, exactly, is worthless in divining information about the

true ground motion ((4.71); Figure 4.6). Information about mtrue

is overwhelmed by the small amount of added noise, amplified enor-

mously by the inverse operator.

Can a useful model be recovered via the truncated SVD? Using

the discrepancy principle (4.65) as our guide and selecting a range

of solutions with varying p0 , we can in fact obtain an approximate

agreement between the number of degrees of freedom (m − p0 ; the

4.5. DISCRETE ILL-POSED PROBLEMS 4-23

1.2

0.8

Acceleration (m/s2)

0.6

0.4

0.2

−0.2

−20 0 20 40 60 80 100

Time (s)

Figure 4.8: Generalized inverse solution using all 210 singular values the noise-

free data of Figure 4.7.

3

V

−1

−20 0 20 40 60 80 100

Time (s)

Figure 4.9: Predicted data from the true model plus uncorrelated N (0, (0.05)2 )

noise.

4-24 CHAPTER 4. RANK DEFICIENCY AND ILL-CONDITIONING

Acceleration (m/s2)

2

−1

−2

−3

−4

0 20 40 60 80

Time (s)

Figure 4.10: Generalized inverse solution using all 210 singular values, and the

noisy data of Figure 4.9.

of degrees of freedom) and the square of the data misfit normalized

by the noise standard deviation when we keep p0 = 26 columns in V

(Figure 4.11).

Essential features of the true model are resolved in the solution of

Figure 4.11, but the solution technique introduces oscillations and

loss of resolution. Specifically, we see that the widths of the in-

ferred pulses are somewhat wider, and the inferred amplitudes some-

what less, than those of the true ground acceleration. These effects

are both hallmarks of limited resolution, as characterized by a non-

identity model resolution matrix. An image of the model resolution

matrix (4.54) in Figure 4.12, shows a finite-width central band and

oscillatory side lobes.

A typical column of the model resolution matrix (Figure 4.13) quantifies

the smearing of the true model into the recovered model for the

choice of the p = 26 inverse operator. The smoothing is over a

characteristic width of about 5 seconds, which is why our recovered

model, although it does a decent job of rejecting noise, underesti-

mates the amplitude and narrowness of the true model. The oscilla-

tory behavior of the resolution matrix is attributable to our abrupt

truncation of the model space. Each of the n columns of V, is an

oscillatory model basis function, with i − 1 zero crossings, where i is

the column number.

4.5. DISCRETE ILL-POSED PROBLEMS 4-25

0.8

0.7

0.6

Acceleration (m/s2)

0.5

0.4

0.3

0.2

0.1

0

−0.1

0 20 40 60 80

Time (s)

Figure 4.11: Solution Using the 26 Largest Singular Values; N (0, (0.05)2 ) data

noise.

20

40

60

80

100

i

120

140

160

180

200

50 100 150 200

j

Figure 4.12: Grayscale image of the model resolution matrix, Rm for the trun-

cated SVD solution including the 26 largest singular values. Values range from

0.35 (white) to -0.07 (black)

.

4-26 CHAPTER 4. RANK DEFICIENCY AND ILL-CONDITIONING

0.12

0.1

0.08

0.06

0.04

0.02

−0.02

0 20 40 60 80

Time (s)

Figure 4.13: A column from the model resolution matrix, Rm for the truncated

SVD solution including the 26 largest singular values.

solution, we place a limit on the most oscillatory model space basis

vectors that we will allow in our solution. This truncation gives us

a model and a model resolution, which contain oscillatory structure

with around p − 1 = 25 zero crossings. The oscillatory, orthogonal

columns of V are a sort of pseudo-Fourier basis functions for the

model. We will examine this perspective further in Chapter 8.

Example 4.3

Recall the Shaw example problem from Example 3.2. The MATLAB

Regularization Toolbox contains a routine shaw that computes the

G matrix for this problem. Using this routine, we computed the

G matrix for n = 20. We then computed the singular values of

this matrix. Figure 4.14 shows the singular values of the G matrix,

which decay very rapidly to zero. Because the singular values decay

to 0 in an exponential fashion, this is a strongly ill–posed problem

There is no obvious break point above which the singular values are

nonzero and below which the singular values are 0. The MATLAB

rank command gives a value of p = 18, suggesting that the last

two singular values are effectively 0. The condition number of this

problem is enormous (> 1014 ).

4.5. DISCRETE ILL-POSED PROBLEMS 4-27

5

10

0

10

−5

10

−10

10

−15

10

−20

10

0 5 10 15 20

value, is shown in Figure 4.15. Notice that this vector is very rough.

Column 1 of V, which corresponds to the largest singular value, is

shown in Figure 4.16. Notice that this column represents a relatively

smooth function.

Next, we will consider a simple resolution test. Suppose that the

input to the system is given by

1 i = 10

mi =

0 otherwise

We can run the model forward and then use the generalized inverse

with various values of p to obtain inverse solutions. The spike in-

put is shown in Figure 4.17. The corresponding data (with no noise

added) is shown in Figure 4.18. If we compute the generalized in-

verse with p = 18, and apply it to the data in Figure 4.18, we get

essentially perfect recovery of the original spike. See Figure 4.19.

However, if we add a very small amount of noise to the data in Fig-

ure 4.18, things change dramatically. Adding normally distributed

random noise with mean 0 and standard deviation 10−6 to the data

in Figure 4.18 and computing the inverse solution using p = 18,

produces the wild solution shown in Figure 4.20, which bears little

resemblance to the input (notice that the vertical scale is multiplied

by 105 !). This inverse operator is even more unstable than that of

4-28 CHAPTER 4. RANK DEFICIENCY AND ILL-CONDITIONING

0.4

0.3

0.2

0.1

−0.1

−0.2

−0.3

−0.4

−2 −1 0 1 2

−0.05

−0.1

−0.15

−0.2

−0.25

−0.3

−0.35

−0.4

−0.45

−2 −1 0 1 2

4.5. DISCRETE ILL-POSED PROBLEMS 4-29

1.5

0.5

−0.5

−2 −1 0 1 2

1.5

0.5

−0.5

−2 −1 0 1 2

4-30 CHAPTER 4. RANK DEFICIENCY AND ILL-CONDITIONING

1.5

0.5

−0.5

−2 −1 0 1 2

order of 1 part in a million will dominate the inverse solution!

Next, we consider what happens when we use only the 10 largest

singular values and their corresponding model space vectors to con-

struct a solution. Figure 4.21 shows this solution with no data noise.

Because we have cut off a number of singular values, we have lost

resolution. The inverse solution is smeared out, but it is still possi-

ble to tell that there is some significant spike-like feature near t = 0.

Figure 4.22 shows the solution using 10 singular values with the same

noise as Figure 4.20. This time the recovery is not strongly affected

by the noise, although we still have imperfect resolution.

What happens if we discretize the problem with a larger number of

intervals? Figure 4.23 Shows the singular values for the G matrix

with n = 100 intervals. The first 20 or so singular values are appar-

ently nonzero, while the last 80 or so singular values are effectively

zero. Figure 4.24 shows the inverse solution for the spike model with

n = 100 and p = 20. This is much worse than the solution that we

obtained with n = 20. In general, discretizing over more intervals

does not help and can often make things much worse.

What about a smaller number of intervals? Figure 4.25 shows the

singular values of the G matrix with n = 6. In this case there are no

terribly small singular values. However, with only 6 elements in the

model vector, we cannot hope to resolve the details of an arbitrary

source intensity distribution.

4.5. DISCRETE ILL-POSED PROBLEMS 4-31

5

x 10

4

−1

−2

−3

−4

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

0.5

0.4

0.3

0.2

0.1

−0.1

−0.2

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

4-32 CHAPTER 4. RANK DEFICIENCY AND ILL-CONDITIONING

0.5

0.4

0.3

0.2

0.1

−0.1

−0.2

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

5

10

0

10

−5

10

−10

10

−15

10

−20

10

0 20 40 60 80 100

4.5. DISCRETE ILL-POSED PROBLEMS 4-33

13

x 10

2

1.5

0.5

−0.5

−1

−1.5

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Figure 4.24: Recovery of the spike model with noise, n = 100, p = 20.

1

10

0

10

−1

10

−2

10

1 2 3 4 5 6

4-34 CHAPTER 4. RANK DEFICIENCY AND ILL-CONDITIONING

values. If we include the small singular values, then our inverse

solution becomes incredibly sensitive to noise in the data. If we do

not include the smaller singular values, then our solution is not as

sensitive to noise in the data, but we lose resolution. The tradeoff

between sensitivity to noise and resolution is the central issue in

solving discrete ill–posed problems.

4.6 Exercises

1. A large NS-EW oriented, nearly square plan view, sandstone quarry block

(16 m by 16 m) with a bulk P-wave seismic velocity of approximately

3000 m/s is suspected of harboring higher-velocity dinosaur remains. An

ultrasonic P-wave travel-time tomography scan is performed in a hori-

zontal plane bisecting the boulder, producing a data set consisting of 16

E→W, 16 N→S, 31 SW→NE, and 31 NW→SE travel times. Each travel-

time measurement has statistically independent errors and with estimated

standard deviations of 15 µs.

The data files that you will need to load from your working directory into

your MATLAB program are: rowscan.m, colscan.m, diag1scan.m,

diag2scan.m (these contain the travel-time data), and std.m (which

contains the standard deviation of the data measurements). The travel

time contribution from a uniform background model (velocity of 3000 m/s)

has been subtracted from each travel-time measurement for you, so you

will be solving for perturbations from a uniform slowness model of 3000

m/s. The row format of each data file is (x1, y1, x2, y2, t) where the

starting point coordinate of each shot is (x1, y1), the end point coordinate

is (x2, y2), and the travel time along a ray path between the start and

end points is a path integral (in seconds)

Z

t = s(x)dl

l

source and receiving points, and ∆lblock is the length for which the ray is

in each block.

Parameterize your tomographic model of the slowness perturbation in the

plane of the survey by dividing the boulder into a 16 by 16 grid of 256

1-m-square, N by E blocks to construct a linear system for the problem.

Assume that the ray paths through each homogeneous block can be well-

approximated by straight lines, so that the travel time expression is

Z X

t = s(x)dl = sblock · ∆lblock

l blocks

4.6. EXERCISES 4-35

where ∆lblock is 1 m for the row and column scans and sqrt{2} m for the

diagonal scans.

Utilize the SVD to find a minimum-length/least-squares solution, m† ,

for the 256 block slowness perturbations which fit the data as exactly as

possible. Perform two inversions:

(A) Using the row and column scans only, and

(B) Using the complete data set.

For each inversion:

(a) State and discuss the significance of the elements and dimensions of

the data and model null spaces.

(b) Note if there any model parameters that have perfect resolution.

(c) Note the condition number of your raw system matrix relating the

data and model.

(d) Note the condition number of your generalized inverse matrix.

(e) Produce a 16 by 16 element contour or other plot of your slowness

perturbation model, displaying the maximum and minimum slowness

perturbations in the title of each plot. Anything in there? If so, how

fast or slow is it (in m/s)?

(f) Show the model resolution by contouring or otherwise displaying the

256 diagonal elements of the model resolution matrix, reshaped into

an appropriate 16 by 16 grid.

(g) Construct, and contour or otherwise display a nonzero model which

fits the trivial data set d = 0 exactly.

(h) Describe how one could use solutions of the type discussed in (g)

to demonstrate that very rough models exist which will fit any data

set just as well as a generalized inverse model. Show one such wild

model.

2. Find the singular value decomposition of the G matrix from problem 3.1.

Taking into account the fact that the measured data are only accurate to

about four digits, use the truncated SVD to compute a solution to this

problem.

4-36 CHAPTER 4. RANK DEFICIENCY AND ILL-CONDITIONING

discrepancy principle

Chapter 5

Tikhonov Regularization

We saw in Chapter 4 that, given the SVD of G (4.2), we can express a generalized

inverse solution by (4.63)

p

X UT·,i d

m† = Vp S−1 T

p Up d = V·,i ,

i=1

si

and that this expression can become extremely unstable when one or more of the

singular values, si become small. We considered the option of dropping terms

in the sum that correspond to the smaller singular values. This regularized or

stabilized the solution in the sense that it made the solution less sensitive to

data noise. However, we paid a price for this stability in that the regularized

solution had reduced resolution.

In this chapter we will discuss Tikhonov regularization, which is perhaps

the most widely used technique for regularizing discrete ill–posed problems. It

turns out that the Tikhonov solution can be expressed quite easily in terms of

the SVD of G. We will derive a formula for the Tikhonov solution and see how

it is a variant on the generalized inverse solution that effectively gives larger

weight to larger singular values in the SVD solution and gives lower weight to

small singular values.

For a general linear least squares problem, there may be infinitely many least

squares solutions. If we consider that the data contain noise, and that there

is no point in fitting such noise exactly, it becomes evident that there can be

many solutions which can adequately fit the data in the sense that kGm − dk2

is small enough. The discrepancy principle (4.65) can be used to regularizing

the solution of a discrete ill–posed problem based on the assumption that a

reasonable level for kGm − dk2 is known.

What do we mean by small enough? For data, d, with measurement errors

that are independent and normal with mean 0 and standard deviation σ, the

5-1

5-2 CHAPTER 5. TIKHONOV REGULARIZATION

|| m ||

delta

|| Gm-d ||

Figure 5.1: Minimum values of the model norm, kmk2 , for varying values of δ.

misfit measure (1/σ 2 )kGmtrue −dk22 will have a χ2 distribution with m degrees

of freedom. If the standard deviations are not all equal, but the errors are

still normal and independent, we can divide each equation by the standard

deviation associated with its right hand side (2.14) and obtain a weighted system

of equations (2.17) in which kGw mtrue − dw k22 will follow a χ2 distribution.

We learned in Chapter 2 that when we estimate the solution to a full rank

least squares problem, kGw mL2 − dw has a χ2 distribution with m − n degrees

of freedom. Unfortunately, when the number of model parameters n is greater

than or equal to the number of data m, this simply does not work, because there

is no χ2 distribution with fewer than one degrees of freedom.

In√practice, a common heuristic is to require kGw m − dw k2 to be smaller

than m, which is the approxmate median of a χ2 distribution with m degrees

of freedom.

Under the discrepancy principle, we consider all solutions with kGm−dk2 ≤

δ, and select from these solutions the one that minimizes the norm of m

min kmk2

(5.1)

kGm − dk2 ≤ δ.

Why select the minimum norm solution from among those solutions that

adequately fit the data? One interpretation is that any significant nonzero

feature that appears in the regularized solution is there because it was necessary

to fit the data. Model features that are unnecessary to fit the data will be

removed by the regularization.

Notice that as δ increases, the set of feasible models expands, and the min-

imum value of kmk2 decreases. We can thus trace out a curve of minimum

values of kmk2 versus δ (Figure 5.1). It is also possible to trace out this curve

by considering problems of the form

(5.2)

kmk2 ≤ .

5.2. SVD IMPLEMENTATION OF TIKHONOV REGULARIZATION 5-3

L–curve

epsilon

|| m ||

|| Gm-d ||

Figure 5.2: Optimal values of the model norm, kmk2 , and the misfit norm,

kGm − dk2 , as varies.

As decreases, the set of feasible solutions becomes smaller, and the minimum

value of kGm − dk2 increases. Again, as we adjust we trace out the curve of

optimal values of kmk2 and kGm − dk2 . See Figure 5.2.

A third option is to consider the damped least squares problem

can be shown [Han98], that for appropriate choices of δ, , and α, all three

problems yield the same solution. We will concentrate on solving the damped

least squares form of the problem (5.3). Solutions to (5.1) and (5.2) can be

obtained by adjusting α until the constraints are satisfied.

An important observation is that when plotted on a log–log scale, the curve

of optimal values of kmk2 and kGm − dk2 often takes on an L shape. For this

reason, the curve is called an L–curve [Han92]. In addition to the discrepancy

principle, another popular criterion for picking the value of α is the L–curve

criterion in which we pick the value of α that gives the solution closest to the

corner of the L–curve.

ization

The damped least squares problem (5.3) is equivalent to the ordinary least

squares problem

2

G d

min

αI m − . (5.4)

0 2

As long as α is nonzero, the last n rows of the matrix are linearly independent,

so we have a full rank least squares problem that could be solved by the method

5-4 CHAPTER 5. TIKHONOV REGULARIZATION

of normal equations.

T

G T

d

G αI m= G αI . (5.5)

αI 0

which simplifies to

show that the solution is

k

X s2i (U·,i )T d

mα = V·,i . (5.9)

i=1

s2i + α2 si

In this formula, we pick k = min(m, n) so that all singular values are included,

no matter how small they might be. To show that this solution is correct, we

substitute (5.9) into the left hand side of (5.8)

k

X s2i (U·,i )T d

(VST SVT + α2 I) V·,i

s2

i=1 i

+α 2 si

k

X s2i (U·,i )T d

= (VST SVT + α2 I)V·,i . (5.10)

i=1

s2i + α2 si

(VST SVT +α2 I)V·,i can be simplified by noting that that VT V·,i is a standard

basis vector, ei . When we multiply ST S times a standard basis vector, we get

a vector with s2i in position i and zeros elsewhere. When we multiply V times

this vector, we get s2i V·,i . Thus

k k

X s2i (U·,i )T d X s2i (U·,i )T d 2

(VST SVT + α2 I) V ·,i (si + α2 )V·,i

s2

i=1 i

+ α2 si i=1

s2

i + α 2 s i

k

X

= si (U·,i )T dV·,i = Vp STp UTp d = VST UT d . (5.11)

i=1

The objects

s2i

fi = (5.12)

s2i + α2

5.2. SVD IMPLEMENTATION OF TIKHONOV REGULARIZATION 5-5

the filter factors serve to reduce the relative contribution of the smaller singular damped SVD method

values to the solution. An similar alternative method (called the damped SVD

method, uses the filter factors

si

fˆi = . (5.13)

si + α

This has a similar effect to using (5.12), but transitions more slowly with the

index i between including large and rejecting small singular values.

The MATLAB regularization toolbox [Han94] contains a number of useful

commands for performing Tikhonov regularization. These commands include

l curve for plotting the L–curve and finding its corner, tikhonov for computing

the solution for a particular value of α, lsqi for solving (5.2), discrep for solving

(5.1), picard for plotting singular values, and fil fac for computing filter factors.

The package also includes a routine regudemo which leads the user through a

tour of the features of the toolbox. Note that the regularization toolbox uses

notation which is somewhat different from the notation used in this book. Read

the documentation to understand the notation used by the toolbox.

Example 5.1

We will now use the regularization toolbox to analyze the Shaw

problem (with noise), as introduced in Example 3.2. We begin by

computing the L–curve and finding its corner.

>> [reg_corner,rho,eta,reg_param]=l_curve(U,diag(S),dspiken,’Tikh’);

>> reg_corner

reg_corner =

3.120998999813652e-07

3.12 × 10−7 .

Next, we compute the Tikhonov regularization solution correspond-

ing to this value of α.

>> [mtik,rho,eta]=tikhonov(U,diag(S),V,dspiken,3.12e-7)

mtik =

-0.01756815988942

0.08736794531934

-0.11352225772109

-0.02687189186254

0.13097279143005

5-6 CHAPTER 5. TIKHONOV REGULARIZATION

6

10 1.0635e−14

5

10 3.0125e−13

4

10

8.5336e−12

3

10

2

2.4173e−10

10

1

10 6.8476e−09

0

10

1.9397e−07

5.4948e−06 0.00015565 0.0044092 0.1249

−1

10

−8 −7 −6 −5 −4 −3 −2 −1 0

10 10 10 10 10 10 10 10 10

residual norm || A x − b ||2

0.02511897984851

-0.17334073966220

-0.02730844460233

0.38747989040776

0.52263562656346

0.24617936521424

-0.01194773792291

-0.03194830349568

0.00037512886287

-0.01136710435884

0.00811188289513

0.02757132302673

-0.01861173224197

-0.01291369695969

0.01038008213527

rho =

2.921973109767267e-07

eta =

5.2. SVD IMPLEMENTATION OF TIKHONOV REGULARIZATION 5-7

0.6

0.4

0.2

−0.2

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Figure 5.4: Tikhonov solution for the Shaw problem, α = 3.12 × 10−7 .

0.74616447426107

Here mtik (Figure 5.4) is the solution, rho is the misfit norm, and

eta is the solution norm. The solution is shown in Figure 5.4. Notice

that this solution is much better than the wild solution obtained by

the truncated SVD with p = 18 (Figure 4.20).

We can also use the discrep command to find the appropriate α

to obtain a Tikhonov regularized solution. Because normally dis-

tributed random noise with standard deviation 1×10−6 was added to

this data, we search for a solution where the square of the norm of the

noise vector (and of the residuals) is roughly√n = 20 times 10−12 , or a

corresponding norm kGm−dk2 of roughly 20×10−6 ≈ 4.47×10−6 .

>> [mdisc,alpha]=discrep(U,diag(S),V,dspiken,1.0e-6*sqrt(20));

>> alpha

alpha =

5.3561e-05

value of α than that obtained using the L–curve technique. The

corresponding solution (Figure 5.5) thus has a smaller norm. This

solution is perhaps somewhat better than the L–curve solution in

the sense of having less wild side lobes.

5-8 CHAPTER 5. TIKHONOV REGULARIZATION

0.5

0.4

0.3

0.2

0.1

−0.1

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Figure 5.5: Tikhonov solution for the Shaw problem, α = 5.36 × 10−5 .

approximately 4.42 × 10−7 , is actually smaller than the tolerance

that we specified in finding a solution by the discrepancy principle.

Why did not discrep recover the original spike model? One quick

observation is that the spike model has a norm of 1, while the solu-

tion obtained by discrep has a norm of only 0.66. Since Tikhonov

regularization prefers solutions with smaller norms, we ended up

with the solution in Figure 5.5. More generally, this is due to the

presence of noise and to limited resolution (see below).

a plot of the singular values si , the values of |(U·,i )T d|, and the

ratios |(U·,i )T d|/si . Figure 5.6 shows the values for our problem.

Notice that |(U·,i )T d| reaches a noise floor of about 1 × 10−7 after

i = 13. The singular values continue to decay. As a consequence,

the ratios increase rapidly. It is clear from this plot that we cannot

expect to obtain useful information from the singular values beyond

p = 13. Notice that the 13th singular value is ≈ 2.6 × 10−7 , which

is comparable to the values of α that we have been using.

5.3. RESOLUTION, BIAS, AND UNCERTAINTY IN THE TIKHONOV SOLUTION5-9

Picard plot

10

10

σi

|uTi b|

|uTi b|/σi

5

10

0

10

−5

10

−10

10

−15

10

−20

10

0 2 4 6 8 10 12 14 16 18 20

i

Solution

Just as with our earlier truncated singular value decomposition approach, we

can compute a model resolution matrix for the Tikhonov regularization method.

From equation (5.6) we know that

mα = (GT G + α2 I)−1 GT d .

Let

G] ≡ (GT G + α2 I)−1 GT .

This can be rewritten in terms of the SVD as

G] = VFS† UT

where F is an n by n diagonal matrix with diagonal elements given by the filter

factors fi , and S† is the generalized inverse of S. G] is the appropriate matrix

for constructing the model resolution matrix, which is

Rm,α = G] G = VFVT . (5.14)

Similarly, we can compute a data resolution matrix.

Rd,α = GG] = UFUT . (5.15)

Notice that Rm,α and Rd,α depend on the particular value of α used.

Example 5.2

5-10 CHAPTER 5. TIKHONOV REGULARIZATION

matrix has the following diagonal entries

>> diag(R)

ans =

0.8981

0.4720

0.4461

0.3837

0.4145

0.4087

0.4205

0.4375

0.4390

0.4475

0.4475

0.4390

0.4375

0.4205

0.4087

0.4145

0.3837

0.4461

0.4720

0.8981

indicating that the most entries in the solution vector are rather

poorly resolved. Figure 5.7 displays this poor resolution by apply-

ing R to the (true) spike model (4.54). Recall that this is the model

recovered when the true model is a spike and there is no noise added

to the data vector; additional features of a recovered model in prac-

tice (5.1) will also be influenced by noise.

model parameters. Since

m α = G] d

the covariance is

Cov(mα ) = G] Cov(d)(G] )T .

As with the SVD solution of Chapter 4, the Tikhonov regularized solution will

generally be biased and the differences between the regularized solution values

and the true model may actually be much larger than the confidence intervals

obtained from the covariance matrix of the model parameters.

5.3. RESOLUTION, BIAS, AND UNCERTAINTY IN THE TIKHONOV SOLUTION5-11

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Example 5.3

Recall our earlier example of the Shaw problem with the spike model

input. Figure 5.8 shows the true model, the solution obtained using

α = 5.36 × 10−5 , and 95% confidence intervals for the estimated

parameters. Notice that very few of the actual model parameters

are included within the confidence intervals.

possible. Of course, because of noise in the data and the bias introduced by

regularization, we cannot hope to determine mtrue or even the exact distance

from our model m to mtrue . However, it might be possible to estimate the

magnitude of the difference between our inverse solution and the true model. If

we have such an estimate as a function of the regularization parameter α, then

we can obviously try to minimize the estimated error.

Several estimates of this sort have been developed by different authors. The

quasi-optimality criterion of Tikhonov and Glasko [TG65], is based on the esti-

mate 1/2

−4

kmtrue − G] dk2 ≈ dT GGT + αI GGT d .

Hanke and Raus [HR96] use the estimate

−3 1/2

kmtrue − G] dk2 ≈ α2 dT GGT + α2 I d .

These estimates of the error are very rough and order–of–magnitude. They

should not be counted on to give a very accurate estimate of the error in the

5-12 CHAPTER 5. TIKHONOV REGULARIZATION

1

Tikhonov

True

0.5

intensity

−0.5

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

angle

Figure 5.8: Confidence intervals for the Tikhonov solution to the Shaw problem

and a spike model.

the regularization parameter. Unfortunately, these estimates of the regulariza-

tion error are non–convex functions of α, and because of local minima, it can

be hard to find the value of α which actually minimizes the error estimate.

Example 5.4

Figure 5.9 shows a plot of the Hanke and Raus estimates of the

error in the Shaw problem with noisy data from the spike model.

The graph has a number of local minimum points, and it is not

obvious which value of α should be used. One minimum lies near α =

5.36 × 10−5 , which was the value of α determined by the discrepancy

principle. At this value of α, the estimated error is about 0.058,

which is somewhat smaller than the actual error of about 0.74.

So far in our discussions of Tikhonov regularization we have minimized an ob-

jective function involving kmk2 . In many situations, we would prefer to obtain

a solution which minimizes some other measure of m, such as the norm of the

first or second derivative.

5.4. HIGHER ORDER TIKHONOV REGULARIZATION 5-13

regularization

2.5

2

|| mtrue − mlambda ||

1.5

0.5

0

−7 −6 −5 −4 −3 −2 −1 0

10 10 10 10 10 10 10 10

lambda

Figure 5.9: Hanke and Raus estimates of the solution error for the Shaw problem

with spike model.

then we can approximate, to a multiplicative constant, the first derivative of

the model by Lm, where

−1 1

−1 1

L= ... .

(5.16)

−1 1

−1 1

minimizing kLmk2 , we will favor solutions which are relatively flat. In first

order Tikhonov regularization, we solve the damped least squares problem

Another commonly used operator is the second difference, where

1 −2 1

1 −2 1

L= . . . .

(5.18)

1 −2 1

1 −2 1

tive of m, and minimizing kLmk2 penalizes solutions which are rough in a

5-14 CHAPTER 5. TIKHONOV REGULARIZATION

second order Tikhonov second derivative sense. In second order Tikhonov regularzation, we use

regularization

the L from (5.18).

zeroth order Tikhonov

regularization We have already seen (5.9) how to solve (5.17) when L = I. This is called

generalized singular value zeroth order Tikhonov regularization. It is possible to extend this idea to

decomposition higher orders, and to problems in which the models are higher (e.g. 2- or 3-)

GSVD dimensional.

generalized singular values To solve higher order Tikhonov regularization problems, we employ the gen-

eralized singular value decomposition, or GSVD. Using the GSVD, the

solution to (5.17) can be expressed as a sum of filter factors times generalized

singular vectors.

Unfortunately, the definition of the GSVD and associated notation are not

standardized. In the following, we will generally follow the conventions used by

the MATLAB regularization toolbox and its cgsvd command. One important

difference is that we will use λi for the generalized singular value, while the

toolbox uses σi for the generalized singular value. Note that MATLAB also has

a built in command, gsvd, which uses a different definition of the GSVD.

We will assume that G is an m by n matrix, and that L is a p by n matrix,

with m ≥ n ≥ p, that rank(L) = p, and that the null spaces of G and L intersect

only at the zero vector. It is possible to define the GSVD in a more general

context, but these assumptions are necessary for this development of the GSVD

and are reasonable for most problems.

Under the above assumptions there exist matrices U, V, Λ, M and X with

the following properties and relationships:

• U is m by n and orthogonal.

• V is p by p and orthogonal.

• X is n by n and nonsingular.

• Λ is a p by p diagonal matrix, with

0 ≤ λ 1 ≤ λ2 ≤ . . . ≤ λ p ≤ 1 (5.19)

1 ≥ µ1 ≥ µ2 ≥ . . . ≥ µp > 0 (5.20)

λi

γi ≡ (5.22)

µi

where

0 ≤ γ1 ≤ γ2 ≤ . . . ≤ γp (5.23)

5.4. HIGHER ORDER TIKHONOV REGULARIZATION 5-15

•

Λ 0

G=U X−1 (5.24)

0 In−p

•

L = V [M 0] X−1 (5.25)

•

Λ2

T T 0

X G GX = (5.26)

0 I

•

M2

0

XT LT LX = (5.27)

0 0

• When p < n, the matrix L will have a nontrivial null space. It can be

shown that the vectors X·,p+1 , X·,p+2 , . . ., X·,n form a basis for the null

space of L.

When G comes from an IFK, the GSVD typically has two properties that

were also characteristic of the SVD. First, the generalized singular values γi

(5.22) tend to zero without any obvious break in the sequence. Second, the vec-

tors U·,i , V·,i , and X·,i tend to become rougher as i increases and γi decreases.

In applying the GSVD to solve (5.17), we note that we can reformulate the

problem as a a least-squares system

2

G d

min

m−
. (5.28)

αL 0
2

Assume that (5.28) is of full rank, the corresponding normal equations are

T T

G T T

d

G αL m= G αL . (5.29)

αL 0

We could solve these normal equations directly, but the GSVD provides a short

cut. Rather than deriving the GSVD solution from scratch, we will simply show

that the solution is

p n

X γi2 UT·,i d X

mα,L = X ·,i + (UT·,i d)X·,i (5.30)

i=1

γi2 + α2 λi i=p+1

(GT G + α2 LT L)mα,L =

ΛT

0 Λ 0 M

X−T UT U X−1 + α2 X−T VT V MT X−1

0

0 I 0 I 0

p 2 T n

X γi (U·,i ) d X

2 + α2 X·,i + (U·,i )T dX·,i . (5.31)

i=1

γi λ i i=p+1

5-16 CHAPTER 5. TIKHONOV REGULARIZATION

produce identity matrices. Also Λ and M are diagonal and are thus symmetric.

2

M2 0

T 2 T −T Λ 0 −1 2 −T −1

(G G + α L L)mα,L = X X +α X X

0 I 0 0

p 2 T n

X γi (U·,i ) d X

2 + α2 X·,i + (U·,i )T dX·,i

γ

i=1 i

λ i i=p+1

Λ2 + α2 M2

−T 0 −1

= X X

0 I

p n

X γi2 T

(U·,i ) d X

X·,i + (U·,i )T dX·,i

i=1

γi2 + α2 λi i=p+1

p

γi2 (U·,i )T d

2

Λ + α2 M2 0

X

−T −1

= X X X·,i

γ 2 + α2

i=1 i

λi 0 I

n 2

Λ + α2 M2 0

X

+ (U·,i )T d X−T X−1 X·,i .

0 I

i=p+1

p n

X γi2 (U·,i )T d −T 2 X

= X (λ i + α 2 2

µi )e i + (U·,i )T dX−T ei

i=1

γi2 + α2 λi i=p+1

p n

γi2 T

λ2i

X (U·,i ) d 2 X

= X−T µi + α 2 ei + (U·,i )T dei

i=1

γi2 + α2 λi µ2i i=p+1

p T n

X (U·,i ) d X

= X−T γi2 µ2i ei + (U·,i )T dei

i=1

λ i i=p+1

Xp n

X

= X−T λi (U·,i )T dei + (U·,i )T dei .

i=1 i=p+1

We can also simplify the right hand side of the original system of equations.

T

Λ 0

GT d = X−T UT d

0 I

Xp n

X

= X−T λi (U·,i )T dei + (U·,i )T dei . (5.32)

i=1 i=p+1

5.4. HIGHER ORDER TIKHONOV REGULARIZATION 5-17

0.34

0.33

0.32

True Slowness (s/km)

0.31

0.3

0.29

0.28

0.27

0.26

0.25

0 200 400 600 800 1000

Depth (m)

Thus

(GT G + α2 LT L)mα,L = GT d .

Example 5.5

by the norm of its derivatives and thus does not provide a suitable

example for model estimation via higher order Tikhonov regulariza-

tion. We will thus switch to a smoother alternative example using

the vertical seismic profiling problem of Example 1.3 where the true

slowness model is a smooth function (Figure 5.10). In this case, for a

1-km deep borehole experiment, we discretized the system (Example

3.2) using m = n = 50 data and model points corresponding to sen-

sors every 20 m and 20–m–thick model slowness intervals. The true

model is a smooth function (Figure 5.10). Figure 5.11 shows corre-

sponding arrival time data with N (0, (2 × 10−4 )2 s2 ) noise added.

Solving the problem using zeroth order Tikhonov regularization pro-

duces a model (Figure 5.13) with many fine-scale features that are

not present in the true model (Figure 5.10). The failure to identify a

corner in the L–curve (5.12) further suggests that this model is not a

good tradeoff between data fit and regularization. If we have reason

to believe that the true model is a smooth function, it is appropriate

to apply higher–order regularization.

5-18 CHAPTER 5. TIKHONOV REGULARIZATION

0.35

0.3

0.25

Arrival Time (s)

0.2

0.15

0.1

0.05

0

0 200 400 600 800 1000

Depth (m)

Figure 5.11: Data for the smooth model Shaw problem, noise added.

th

L−curve for 0 order regularization

−2

10

2

solution norm || m ||

123.1592 187.1441

284.3712

432.1107

−3

10

−4

10 −3 −2 −1 0

10 10 10 10

residual norm || G m − d ||2

5.5. RESOLUTION IN HIGHER ORDER REGULARIZATION 5-19

0.34

0.32

0.3

Slowness (s/km)

0.28

0.26

0.24

0.22

0.2

0.18

0 200 400 600 800 1000

Depth (m)

comparison with the true model.

Figure 5.14 shows the first–order L–curve for this problem obtained

using the regularization toolbox commands get l, cgsvd, l curve,

and tikhonov. The L–curve now has a well-defined corner near

α ≈ 137 for this particular data realization, and Figure 5.15 shows

the corresponding solution. The first–order regularized solution is

much smoother than the zeroth order regularized one, and is much

closer to the true solution.

Figure 5.16 shows the L–curve for second—order Tikhonov regu-

larization, which has a corner at α ≈ 2325 and Figure 5.17 shows

the corresponding solution. This solution is smoother still compared

to the first–order regularized solution. Both the first– and second–

order solutions depart most from the true solution at shallow depths

where the true slowness has the greatest slope and curvature.

As with zeroth order Tikhonov regularization, we can compute a resolution

matrix for higher order Tikhonov regularization. For particular values of L and

α, the Tikhonov regularization solution can be written as

mα,L = G] d (5.33)

5-20 CHAPTER 5. TIKHONOV REGULARIZATION

−4

10 5.0049

9.6125

2

solution semi−norm || L m ||

18.4619

35.4583

68.1018

−5

10 130.7975

251.2119

482.4818

926.6625

1779.7633

−6

10 −4 −3 −2 −1

10 10 10 10

residual norm || G m − d ||2

0.34

0.32

Slowness (s/km)

0.3

0.28

0.26

0.24

0 200 400 600 800 1000

Depth (m)

in comparison with the true model.

5.5. RESOLUTION IN HIGHER ORDER REGULARIZATION 5-21

−3

10

2

2.5038

solution semi−norm || L m ||

−4

10 5.9393

14.0888

−5 33.4206

10

79.2782

188.0587

−6

10

446.1006

1058.2109

2510.2193

5954.5794

−7

10 −4 −3 −2

10 10 10

residual norm || G m − d ||2

2nd order

0.34

0.32

Slowness (s/km)

0.3

0.28

0.26

0.24

0 200 400 600 800 1000

Depth (m)

comparison with the true model.

5-22 CHAPTER 5. TIKHONOV REGULARIZATION

where

G] = (GT G + α2 LT L)−1 GT . (5.34)

Using properties of the GSVD we can simplify this expression to

−1

Λ2 + α2 M2 ΛT

] 0 T −T 0

G =X X X UT .

0 I 0 I

−1

Λ2 + α2 M2 0 ΛT

0

=X UT .

0 I 0 I

FΛ−1

0

=X UT (5.35)

0 I

where F is a diagonal matrix with diagonal elements

γi2

Fi,i = . (5.36)

γi2 + α2

FΛ−1

0 Λ 0

Rα,L = G] G = X UT U X−1

0 I 0 I

F 0

=X X−1 . (5.37)

0 I

Example 5.6

To examine the resolution of the Tikhonov-regularized inversions

of Example 5.4, we perform a spike test using (5.37). Figure 5.18

shows the effect of multiplying Rα,L with a unit amplitude spike

model (spike depth 500 m) under zeroth–, first–, and second–order

Tikhonov regularization using the corresponding α values of 10.0,

137, and 2325, respectively. The shapes of these curves (Figure

5.18) are indicative of the ability of each inversion to recover abrupt

changes in slowness halfway through the model. Progressively in-

creasing widths with increasing regularization order demonstrate de-

creasing resolution as more misfit is allowed in the interest of obtain-

ing a progressively smoother solution. Under first– or second–order

regularization, the resolution of various model features will depend

critically on how smooth or rough these features are in the true

model. In this example, the higher–order solutions recover the true

model better because the true model is smooth.

5.5. RESOLUTION IN HIGHER ORDER REGULARIZATION 5-23

0.5

0.4

0.3

0.2

0.1

−0.1

−0.2

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Figure 5.18: Rα,L times the spike model for each of the regularized solutions of

Example 5.4

.

5-24 CHAPTER 5. TIKHONOV REGULARIZATION

10

L−curve

value decomposition

TGSVD

8

10

16

10

4 14

10

10

2 12

0

10 10

8 6 4

2

−2

10

−7 −6 −5 −4 −3 −2 −1 0

10 10 10 10 10 10 10 10

residual norm || A x − b ||2

Figure 5.19: L–curve for the TGSVD solution of the Shaw problem, smooth

model.

When we first started working with the singular value decomposition we exam-

ined the truncated singular value (TSVD) method of regularization. We can

conceptualize the TSVD as simply skipping tossing out small singular values

and their associated model space basis vectors, or as a damped SVD solution in

which filter factors of one are used for larger singular values and filter factors of

zero are used for smaller singular values. This approach can also be extended to

work with the generalized singular value decomposition to obtain a truncated

generalized singular value decomposition or TGSVD solution. In this

case we include the k largest generalized singular values with filter factors of

one to obtain the solution

p n

X (U·,i )T d X

mk,L = X·,i + ((U·,i )T d)X·,i . (5.38)

λi i=p+1

i=p−k+1

Figure 5.19 shows the L–curve for the TGSVD solution of the Shaw problem

with the smooth true model. The corner occurs at k = 10. Figure 5.20 shows

the corresponding solution. The model recovery is reasonably good, but it is not

as close to the original model as the solution obtained by second order Tikhonov

regularization.

5.7. GENERALIZED CROSS VALIDATION 5-25

GCV

1.8

1.6

1.4

1.2

0.8

0.6

0.4

0.2

0

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Generalized cross validation (GCV), developed by Wahba [Wah90], is a method

for selecting the regularization parameter which has a number of desirable sta-

tistical properties. Let

G(α) = = . (5.39)

Trace(I − GG]α,L )2 T (α)

Here V (α) measures the misfit. As α increases, V (α) will also increase.

Below the corner of the L–curve, V (α) should increase very slowly. At the corner

of the L–curve, V (α) should begin to increase fairly rapidly. The denominator,

T (α) measures the closeness of the data resolution matrix to the identity matrix.

T (α) is a slowly increasing function of α. Intuitively, we would expect G(α) to

have a minimum near the corner of the L–curve. For values of α below the

corner of the curve, V (α) will be roughly constant and T (α) will be increasing,

so G(α) should be decreasing. For values of α above the corner of the L–curve,

V (α) should be increasing rapidly, while T (α) should be increasing slowly. Thus

G(α) should be increasing. In the GCV method, we pick the value α∗ which

minimizes G(α).

Wahba was able to show that under reasonable assumptions about the

smoothness of mtrue and under the assumption that the noise is white that the

value of α that minimizes G(α) approaches the value that minimizes E[Gmα,L −

dtrue ] as the number of data points m goes to infinity [Wah90]. Wahba was

also able to show that under the same assumptions, E[kmtrue − mα,L k2 ] goes

to 0 as m goes to infinity [Wah90]. These results are too complicated to prove

5-26 CHAPTER 5. TIKHONOV REGULARIZATION

−2

10

−4

10

−6

10

G(λ)

−8

10

−10

10

−12

10

−14

10

−15 −10 −5 0 5 10 15 20

10 10 10 10 10 10 10 10

λ

Figure 5.21: GCV curve for the Shaw problem, second order regularization.

here, but they do provide an important theoretical justification for the GCV

method of selecting the regularization parameter.

Example 5.7

Figure 5.21 shows a graph of G(α) for our Shaw test problem, us-

ing second order Tikhonov regularization. The minimum occurs at

α = 7.04 × 10−6 . Figure 5.22 shows the corresponding solution.

This solution is actually the best of all the solutions that we have

obtained for this test problem. The norm of the difference between

this solution and the true solution is 0.028.

5.7. GENERALIZED CROSS VALIDATION 5-27

1.8

1.6

1.4

1.2

0.8

0.6

0.4

0.2

0

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Figure 5.22: GCV solution for the Shaw problem, second order, α = 7.04×10−6 .

5-28 CHAPTER 5. TIKHONOV REGULARIZATION

In this section we present two theoretical results which help to answer some ques-

tions about the accuracy of Tikhonov regularization solutions. We will present

these results in a very simplified form, covering only zeroth order Tikhonov

regularization.

The first question is whether for a particular value of the regularization

parameter α, we can establish a bound on the sensitivity of the regularized

solution to the noise in the observed data d and/or errors in the system matrix

G. This would provide a sort of condition number for the inverse problem.

Note that this does not tell us how far the regularized solution is from the

true model, since Tikhonov regularization has introduced a bias in the solution.

Under Tikhonov regularization with a nonzero α, we would not obtain the true

model even if the noise was 0.

Hansen [Han98] gives such a bound, and the following theorem gives the

bound for zeroth order Tikhonov regularization. A slightly more complicated

formula is available for higher order Tikhonov regularization.

Theorem 5.1

and

min kḠm − d̄k22 + α2 kmk22 (5.41)

are solved to obtain mα and m̄α . Then

kmα − m̄α k2 κ̄α kek2 krα k2

≤ 2 + + κ̄α (5.42)

kmα k2 1 − κ̄α kdα k2 kdα k2

where

kGk2

κ̄α = (5.43)

α

E = G − Ḡ (5.44)

e = d − d̄ (5.45)

kEk2

= (5.46)

kGk2

dα = Gmα (5.47)

rα = d − dα . (5.48)

5.8. ERROR BOUNDS 5-29

1.5

0.5

−0.5

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

In the particular case when G = Ḡ, and the only difference between the two

problems is e = d − d̄, the inequality becomes even simpler

≤ κ̄α . (5.49)

kmα k2 kdα k2

decrease the sensitivity of the solution to perturbations in the data. Of course,

increasing α also increases the error in the solution due to regularization bias.

The second question is whether we can establish any sort of bound on the

norm of the difference between the regularized solution and the true model. This

bound would incorporate both sensitivity to noise and the bias introduced by

Tikhonov regularization. Such a bound must of course depend on the magnitude

of the noise on the data. It must also depend on the particular regularization

parameter chosen. Tikhonov [TA77] developed a beautiful theorem that ad-

dresses this question in the context of inverse problems involving IFK’s. More

recently, Neumaier [Neu98] has developed a version of Tikhonov’s theorem that

can be applied directly to discretized problems.

Recall that in a discrete ill-posed problem, the matrix G has a smoothing

effect– when we multiply Gm, the result is smoother than m. Similarly, if we

multiply GT times Gm, the result is smoother than Gm. For example, Figures

5.23 through 5.25 show m our original spike model for the Shaw problem, GT m,

and GT Gm.

It should be clear that models in the range of GT form a relatively smooth

subspace of all possible models. Models in this subspace of Rn can be written

as m = GT w, for some weights w. Furthermore, models in the range of GT G

5-30 CHAPTER 5. TIKHONOV REGULARIZATION

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Figure 5.24: GT times the spike model for the Shaw problem.

1.8

1.6

1.4

1.2

0.8

0.6

0.4

0.2

0

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Figure 5.25: GT G times the spike model for the Shaw problem.

5.8. ERROR BOUNDS 5-31

m = GT (Gw) which is a linear combination of columns of GT . Because of

the smoothing effect of G and GT , we would expect these models to be even

smoother than the models in R(GT ). We could construct smaller subspaces of

Rn that contain even smoother models, but it turns out that with zeroth order

Tikhonov regularization these are the only subspaces of interest.

There is another way to see that models in R(GT ) will be relatively smooth.

Recall that the vectors V··· ,1 , V··· ,2 , . . ., V··· ,p from the SVD of G form an

orthogonal basis for R(GT ). For discrete ill–posed problems, we know from

Chapter 4 that these basis vectors will be relatively smooth, so linear combina-

tions of these vectors in R(GT ) should be smooth.

The following theorem gives a bound on the total error (bias due to regular-

ization plus error due to noise in the data) for zeroth order Tikhonov regular-

ization. Unfortunately, the theorem only applies in situations where we begin

with a true model mtrue in one of the aforementioned subspaces of Rn . This

seldom happens in practice, but theorem does help us understand how the error

in the Tikhonov regularization solution behaves in the best possible case.

Theorem 5.2

Suppose that we use zeroth order Tikhonov regularization to solve

Gm = d and that mtrue can be be expressed as

T

G w p=1

mtrue = (5.50)

GT Gw p = 2

and that

kGmtrue − dk2 ≤ ∆kwk2 (5.51)

for some ∆ > 0. Then

∆

kmtrue − G] dk2 ≤ + γαp kwk2 (5.52)

2α

where

1/2 p = 1

γ= . (5.53)

1 p=2

Furthermore, if we begin with the bound

we can let

δ

∆= . (5.55)

kwk2

Under this condition, it can be shown that the optimal value of α is

1

p+1

∆ 1

α̂ = = O(∆ p+1 ) . (5.56)

2γp

5-32 CHAPTER 5. TIKHONOV REGULARIZATION

∆ = 2γpα̂p+1 (5.57)

p

kmtrue − G]α̂ dk2 ≤ γ(p + 1)α̂p = O(∆ p+1 ) . (5.58)

This theorem tells us that the error in the Tikhonov regularization solution

depends on both the noise level ∆ and on the regularization parameter α. For

very large values of α, the error due to regularization will be dominant. For very

small values of α, the error due to noise in the data will be dominant. There

is an optimal value of α which balances these effects. Using the optimal α, we

can obtain an error bound of O(∆2/3 ) if p = 2, and an error bound of O(∆1/2 )

if p = 1.

Of course, the above result can only be applied when our true model lives

in the very restricted subspace of models in R(GT ). In practice this seldom

happens. The result also depends on ∆, which we can at best approximate by

∆ ≈ δ. As a rule of thumb, if the model mtrue is reasonably smooth, then

we can hope for an error in the Tikhonov regularized solution which is O(δ 1/2 ).

Another way of saying this is that we can hope for an answer with about half

as many accurate digits as the data.

5.9 Exercises

1. Use the method of Lagrange Multipliers to derive the damped least squares

problem (5.2) from the discrepancy principal problem (5.1), and demon-

strate that (5.2) can be written as (5.4).

2. Consider the integral equation and data set from Problem 3.1. You can

find a copy of this data set in ifk.mat.

(b) Using the data supplied, and assuming that the numbers are accurate

to four significant figures, determine a reasonable bound δ for the

misfit.

(c) Use zeroth order Tikhonov regularization to solve the problem. Use

GCV, the discrepancy principle and the L–curve criterion to pick

the regularization parameter. Estimate the norm of the difference

between your solutions and mtrue .

(d) Use first order Tikhonov regularization to solve the problem. Use

GCV, the discrepancy principle and the L–curve criterion to pick the

regularization parameter.

5.9. EXERCISES 5-33

(e) Use second order Tikhonov regularization to solve the problem. Use

GCV, the discrepancy principle and the L–curve criterion to pick the

regularization parameter.

(f) Analyze the resolution of your solutions. Are the features you see

in your inverse solutions unambiguously real? Interpret your results.

Describe the size and location of any significant features in the solu-

tion.

wells are located 1600 meters apart. A seismic source is inserted in one well

at depths of 50, 150, ..., 1550 m. A string of hydrophones are inserted

in the other well at depths of 50m, 150m, ..., 1550m. For each source

receiver pair, a travel time is recorded, with a measurement standard

deviation of 0.5 msec. We wish to determine the velocity structure in the

two dimensional plane between the two wells.

Discretizing the problem into a 16 by 16 grid of 100 meter by 100 meter

blocks gives 256 ray paths and 256 data points. The G matrix and d

noisy data for this problem, assuming straight-line ray paths, are in the

file crosswell.mat

(a) Use the truncated SVD to solve this inverse problem. Plot the result

using imagesc and colorbar.

(b) Use zeroth order Tikhonov regularization to solve this problem and

plot your solution. Explain why it is hard to use the discrepancy

principle to select the regularization parameter. Use the L–curve

criterion to select your regularization parameter. Plot the L–curve

as well as your solution.

(c) Use second order Tikhonov regularization to solve this problem and

plot your solution. Because this is a two dimensional problem, you

will need to implement a finite difference approximation to the Lapla-

cian (second derivative in the horizontal direction plus the second

derivative in the vertical direction). You can generate the appropri-

ate L roughening matrix using the following MATLAB code:

L=zeros(14*14,256);

k=1;

for i=2:15,

for j=2:15,

M=zeros(16,16);

M(i,j)=-4;

M(i,j+1)=1;

M(i,j-1)=1;

M(i+1,j)=1;

M(i-1,j)=1;

L(k,:)=(reshape(M,256,1))’;

5-34 CHAPTER 5. TIKHONOV REGULARIZATION

k=k+1;

end

end

What if any problems did you have in using the L–curve criterion on

this problem? Plot the L–curve as well as your solution.

(d) Discuss your results. Can you explain the characteristic vertical

bands that appeared in some of your solutions?

Per Christian Hansen’s book [Han98] is a very complete reference on the linear

algebra of Tikhonov regularization. Arnold Neumaier’s tutorial [Neu98] is also

a very useful reference. Two other useful surveys of Tikhonov regularization are

[Eng93, EHN96]. Vogel [Vog02] includes an extensive discussion of methods for

selecting the regularization parameter. Hansen’s MATLAB regularization tool-

box [Han94] is a very useful collection of software for performing regularization

within MATLAB.

Within statistics, badly conditioned linear regression problems are said to

suffer from “multicolinearity.” A method called “Ridge Regression”, which is

exactly Tikhonov regularization is often used to deal with such problems. Statis-

ticians also use a method called “Partial Components Regression (PCR)” which

is simply the TSVD method.

Kaczmarz’s algorithm

Chapter 6

Iterative Methods

SVD based pseudo inverse and Tikhonov regularization solutions become im-

practical when we consider larger problems in which G has tens of thousands

of rows and columns. The problem is that while G may be a sparse matrix, the

U and V matrices in the SVD will typically be dense matrices. For example,

consider a tomography problem in which the model is of size 256 by 256 (65536

model elements), and there are 100,000 ray paths. Most of the ray paths miss

most of the model cells, so the majority of the entries in G are zero. Thus the

G matrix might have a density of less than 1%. If we stored G as a regular

dense matrix, it would require about 50 gigabytes of storage. Furthermore, the

U matrix would require 80 gigabytes of storage, and the V matrix would require

about 35 gigabytes of storage. However, the sparse matrix G can be stored in

less than one gigabyte of storage.

Because iterative methods can take advantage of the sparsity commonly

found in the G matrix, iterative methods are often the only applicable method

for large problems.

We will concentrate in this section on Kaczmarz’s algorithm and its ART and

SIRT variants. These algorithms were originally developed for tomographic

problems and are particularly effective for such problems.

Kaczmarz’s algorithm is an easy to implement algorithm for solving a linear

system of equations Gm = d. The algorithm starts with an initial solution

m(0) , and then moves to a solution m(1) by projecting the initial solution onto

the hyperplane defined by the first equation in the system of equations. Next,

we project the solution m(1) onto the hyperplane defined by the second equa-

tion. The process is repeated until the solution has been projected onto the

hyperplanes defined by all of the equations. At that point, we can begin a new

cycle through the hyperplanes.

6-1

6-2 CHAPTER 6. ITERATIVE METHODS

y=x-1

y=1

the system of equations

y = 1

−x +y = −1

In order to implement the algorithm, we need a formula to compute the

projection of a vector onto the hyperplane defined by equation i. Let Gi,· be

the ith row of G. Consider the hyperplane defined by Gi+1,· m = di+1 . The

vector GTi+1,· is perpendicular to this hyperplane. Thus the update to m(i) will

be of the form

m(i+1) = m(i) + αGTi+1,· . (6.1)

We use the fact that Gi+1,· m(i+1) = di+1 to solve for α.

Gi+1,· m(i) + αGTi+1,· = di+1 . (6.2)

α=− (6.4)

Gi+1,· GTi+1,·

Thus the update formula is

Gi+1,· m(i) − di+1 T

m(i+1) = m(i) − Gi+1,· (6.5)

Gi+1,· GTi+1,·

6.1. ITERATIVE METHODS FOR TOMOGRAPHY PROBLEMS 6-3

y=1

y=(1/3)x

Figure 6.2: Slow convergence occurs when hyperplanes are nearly parallel.

solution, then Kaczmarz’s algorithm will converge to this solution. If the system

of equations has many solutions, then the algorithm will converge to the solution

which is closest to the point m(0) . In particular, if we start with m(0) = 0, we

will obtain a minimum length solution. If there is no exact solution to the

system of equations, then the algorithm will fail to converge, but will typically

bounce around near an approximate solution.

A second important question is how fast Kaczmarz’s algorithm will converge

to a solution. If the hyperplanes described by the system of equations are nearly

orthogonal, then the algorithm will converge very quickly. However, if two or

more hyperplanes are nearly parallel to each other, convergence can be extremely

slow. Figure 6.2 shows a typical situation in which the algorithm zigzags back

and forth without making much progress towards a solution. As the two lines

become more nearly parallel, the problem becomes worse. This problem can be

alleviated by picking an ordering of the equations such that adjacent equations

describe hyperplanes which are nearly orthogonal to each other. In the context

of tomography, this can be done by ordering the equations so that successive

equations do not share common cells in the 2–dimensional model.

Example 6.1

lem with the same geometry used in homework exercise 4.1. The true

6-4 CHAPTER 6. ITERATIVE METHODS

ART algorithm

2

10

12

14

16

2 4 6 8 10 12 14 16

normally distributed random noise added. The random noise had

standard deviation 0.01. Figure 6.4 shows the truncated SVD solu-

tion. The two blobs are apparent, but it is not possible to distinguish

the small hole within the larger of the two blobs.

Figure 6.5 shows the solution obtained after 200 iterations of Kacz-

marz’s algorithm. This solution is very similar to the truncated SVD

solution and both solutions are about the same distance from the

true model.

modified especially for the tomographic reconstruction problem. Notice that in

(6.5), the updates to the solution always consist of adding a multiple of a row

of G to the current solution. The numerator in the fraction is the difference

between the right hand side of equation i+1 and the value of the i+1 component

of Gm. The denominator is simply the square of the norm of row i + 1 of G.

Effectively, Kaczmarz’s algorithm is determining the error in equation i + 1,

and then adjusting the solution by spreading the required correction over the

elements of m which appear in equation i + 1.

A very crude approximation to the Kaczmarz’s update is to replace all of

the nonzero entries in row i + 1 of G with ones. Using this approximation,

X

qi+1 = mj

cell j in ray path i + 1

6.1. ITERATIVE METHODS FOR TOMOGRAPHY PROBLEMS 6-5

10

12

14

16

2 4 6 8 10 12 14 16

10

12

14

16

2 4 6 8 10 12 14 16

6-6 CHAPTER 6. ITERATIVE METHODS

SIRT technique is an approximation to the sum of the slownesses along ray path i + 1. The

simultaneous interactive difference between qi+1 and di+1 is roughly the error in our predicted travel

reconstruction technique

time for ray i + 1. The denominator of the fraction becomes Ni+1 , the number

of cells along ray path i + 1 or equivalently the number of nonzero entries in

row i + 1 of G.

Thus we are taking the total error in the travel time for ray i + 1 and

dividing it by the number of cells in ray path i + 1. This correction factor is

then multiplied by a vector which has ones in cells along the ray path i + 1.

This has the effect of smearing the needed correction in travel time over all of

the cells in ray path i + 1.

The new approximate update formula can be written as

( (i) q −d

(i+1) mj − i+1 Ni+1

i+1

cell j in ray path i + 1

mj = (i) (6.6)

mj cell j not in ray path i + 1 .

The approximation can be improved by taking into account that the ray

paths through some cells are longer than the ray paths through other cells. Let

Li+1 be the length of ray path i + 1. We can improve the approximation by

using the following update formula.

( (i) d

(i+1) mj + Li+1 − Nqi+1 cell j in ray path i + 1

mj = (i)

i+1 i+1

(6.7)

mj cell j not in ray path i + 1 .

The main advantage of ART is that it saves storage- we need only store

information about which rays pass through which cells, and we do not need to

record the length of each ray in each cell. A second advantage of the method is

that it reduces the number of floating point multiplications required by Kacz-

marz’s algorithm. Although in current computers floating point multiplications

and additions require roughly the same amount of time, during the 1970’s when

ART was first developed, multiplication was slower than addition.

One problem with ART is that the resulting tomographic images tend to

be somewhat more noisy than the images resulting from Kaczmarz’s algorithm.

The Simultaneous Iterative Reconstruction Technique (SIRT) is a variation on

ART which gives slightly better images in practice, at the expense of a slightly

slower algorithm. In the SIRT algorithm, we compute updates to cell j of the

model for each ray i that passes through cell j. The updates for cell j are added

together and then divided by Nj before being added to mj .

Example 6.2

Returning to our earlier tomography example, Figure 6.6 shows the

ART solution obtained after 200 iterations. Again, the solution is

very similar to the truncated SVD solution.

Figure 6.7 shows the SIRT solution for our example tomography

problem. This solution is similar to the Kaczmarz’s and ART solu-

tions.

6.1. ITERATIVE METHODS FOR TOMOGRAPHY PROBLEMS 6-7

10

12

14

16

2 4 6 8 10 12 14 16

10

12

14

16

2 4 6 8 10 12 14 16

6-8 CHAPTER 6. ITERATIVE METHODS

conjugate gradient method

mutually conjugate vectors In this section we will consider the method of conjugate gradients (CG) for

solving a symmetric and positive definite system of equations Ax = b. In

the next section we will apply the CG method to the normal equations for for

Gm = d.

We begin by considering the quadratic optimization problem

1 T

min φ(x) = x Ax − bT x (6.8)

2

where A is an n by n symmetric and positive definite matrix. We require the A

be positive definite so that the function φ(x) will be convex and have a unique

minimum. We can calculate ∇φ(x) = Ax − b and set it equal to zero to find

the minimum. The minimum occurs at a point x that satisfies the equation

∇φ(x) = Ax − b = 0 (6.9)

or

Ax = b . (6.10)

Thus solving the system of equations Ax = b is equivalent to minimizing φ(x).

The method of conjugate gradients approaches the problem of minimizing

φ(x) by constructing a basis for Rn in which the minimization problem is ex-

tremely simple. The basis vectors p0 , p1 , . . ., pn−1 are selected so that

with respect to A. We express x in terms of these basis vectors as

n−1

X

x= αi pi . (6.12)

i=0

Then

n−1

!T n−1

! n−1

!

1 X X

T

X

φ(α) = αi pi A αi pi −b αi pi . (6.13)

2 i=0 i=0 i=0

n−1 n−1 n−1

!

1 XX X

φ(α) = αi αj pTi Apj − bT αi pi . (6.14)

2 i=0 j=0 i=0

Since the vectors are mutually conjugate with respect to A, this simplifies to

n−1 n−1

!

1X 2 T T

X

φ(α) = α p Api − b αi pi (6.15)

2 i=0 i i i=0

6.2. THE CONJUGATE GRADIENT METHOD 6-9

or

n−1

1X 2 T

φ(α) = α p Api − 2αi bT pi . (6.16)

2 i=0 i i

Equation (6.16) shows that φ(α) consists of n terms, each of which is inde-

pendent of the other terms. Thus we can minimize φ(α) by selecting each αi to

minimize the ith term,

αi2 pTi Api − 2αi bT pi .

Differentiating with respect to αi and setting the derivative equal to zero, we

find that the optimal value for αi is

bT pi

αi = (6.17)

pTi Api

We’ve seen that if we have a basis of vectors which are mutually conjugate

with respect to A, then minimizing φ(x) is very easy. We have not yet shown

how to construct the mutually conjugate basis vectors.

Our algorithm will actually construct a sequence of solution vectors xi , resid-

ual vectors ri = b − Axi , and basis vectors pi . The algorithm begins with

x0 = 0, r0 = b, p0 = r0 , and α0 = (rT0 r0 )/(pT0 Ap0 ).

Suppose that at the start of iteration k of the algorithm we have constructed

x0 , x1 , . . ., xk , r0 , r1 , . . ., rk , p0 , p1 , . . ., pk and α0 , α1 , . . ., αk . We assume

that the first k + 1 basis vectors pi are mutually conjugate with respect to A,

the first k + 1 residual vectors ri are mutually orthogonal, and that rTi pj = 0

when i 6= j.

We let

xk+1 = xk + αk pk . (6.18)

This effectively adds one more term of (6.12) into the solution. Next, we let

= b − A (xk + αk pk ) (6.21)

= (b − Axk ) − αk Apk (6.22)

= rk − αk Apk . (6.23)

We let

rTk+1 rk+1

βk+1 = . (6.24)

rTk rk

Finally, we let

pk+1 = rk+1 + βk+1 pk (6.25)

6-10 CHAPTER 6. ITERATIVE METHODS

To see this,

= rTk pk + pTk Axk (6.27)

= rTk (rk + βk pk−1 ) + pTk Axk (6.28)

= rTk rk + βk rTk pk−1 + pTk A (α0 p0 + . . . αk−1 pk−1 ) (6.29)

= rTk rk + 0 + 0 (6.30)

= rTk rk (6.31)

We will now show that rk+1 is orthogonal to ri for i ≤ k. For every i < k,

= rTk ri − αk rTi Apk (6.33)

T

rTk+1 ri = 0 − αk (pi − βi pi−1 ) Apk . (6.35)

rTk+1 ri = 0. (6.36)

= rTk rk − αk (pk − βk pk−1 ) Apk T

(6.38)

= rTk rk − αk pTk Apk + αk βk pTk−1 Apk (6.39)

= rTk rk − rTk rk + αk βk 0 (6.40)

= 0. (6.41)

= rTk+1 ri + βi rTk+1 pi−1 (6.43)

= 0+ βi rTk+1 pi−1 (6.44)

T

= βi (rk − αk Apk ) pi−1 (6.45)

= βi rTk pi−1 − αk pTi−1 Apk (6.46)

= 0−0 (6.47)

= 0. (6.48)

6.2. THE CONJUGATE GRADIENT METHOD 6-11

= rTk+1 Api + βk+1 pTk Api (6.50)

= rTk+1 Api +0 (6.51)

1

= rTk+1 (ri − ri+1 ) (6.52)

αi

1 T

rk+1 ri − rTk+1 ri+1

= (6.53)

αi

= 0. (6.54)

T 1

pTk+1 Apk = (rk+1 + βk+1 pk ) (rk − rk+1 ) (6.55)

αk

1 T

= βk+1 (rk + βk pk−1 ) rk − rTk+1 rk+1 (6.56)

αk

1

βk+1 rTk rk + βk+1 βk pTk−1 rk − rTk+1 rk+1

= (6.57)

αk

1

rT rk+1 + βk+1 βk 0 − rTk+1 rk+1

= (6.58)

αk k+1

= 0. (6.59)

conjugate basis vectors. In theory, the algorithm finds an exact solution to the

system of equation in n iterations. In practice, due to round off errors in the

computation, the exact solution may not be obtained in n iterations. In practical

implementations of the algorithm, we iterate until the residual is smaller than

some tolerance that we specify. The algorithm can be summarized as:

and an initial solution x0 , Let β−1 = 0, p−1 = 0, r0 = b − Ax0 , and

k = 0. Repeat the following steps until convergence.

krk k22

2. Let αk = pT

.

k Apk

3. Let xk+1 = xk + αk pk .

4. Let rk+1 = rk + αk Apk .

krk+1 k22

5. Let βk = krk k22

.

6. Let k = k + 1.

6-12 CHAPTER 6. ITERATIVE METHODS

CGLS A major advantage of the CG method is that it requires storage only for the

conjugate gradient least vectors xk , pk , rk and the matrix A. If A is large and sparse, then sparse matrix

squares method

techniques can be used to store A. Unlike factorization methods (QR, SVD,

Cholesky), there will be no fill in of the zero entries in A. Thus it is possible to

solve extremely large systems (n in the hundreds of thousands) using CG, while

direct factorization would require far too much storage.

The CG method by itself can only be applied to positive definite systems of

equations. It is not directly applicable to the sorts of least squares problems

that we are interested in. However, we can apply the CG method to the normal

equations for a least squares problem. In the CGLS method, we solve a least

squares problem

min kGm − dk2 (6.60)

by applying CG to the normal equations

GT Gm = GT d . (6.61)

One important source of error is the evaluation of the residual, GT Gm − GT d.

It turns out that this calculation is more accurate when we factor out GT and

compute GT (Gm−d). We will use the notation sk = Gmk −d, and rk = GT sk .

Note that we can compute sk+1 recursively from sk as follows

With this trick, we can now state the CGLS algorithm.

Given a system of equations Gm = d, Let k = 0, m0 = 0, p−1 = 0,

β−1 = 0, s0 = −d, and r0 = GT s0 . Repeat the following iterations

until convergence

1. Let pk = −rk + βk−1 pk−1 .

krk k22

2. Let αk = (pT G T )(Gp ) .

k

k

3. Let mk+1 = mk + αk pk .

6.3. THE CGLS METHOD 6-13

5. Let rk+1 = GT sk+1 .

krk+1 k22

6. Let βk = krk k22

.

7. Let k = k + 1.

useful for ill-posed problems. It can be shown [Han98] that at least for exact

arithmetic, kmk k2 increases monotonically, and that kGmk − dk2 decreases

monotonically. We can use the discrepancy principle together with this property

to obtain a regularized solution. Simply stop the CGLS algorithm as soon as

kGmk − dk2 < δ.

The CGLS algorithm is only applicable to zeroth order regularization. If

we wish to use first or second regularization, the CGLS algorithm cannot be

directly applied. However, it is possible to effectively transform higher order

regularization into zeroth regularization and apply CGLS to the transformed

problem [Han98]. The MATLAB regularization toolbox contains a command,

pcgls that performs CGLS with this transformation.

Example 6.3

Recall the image deblurring problem that we discussed in Chapter

1 and Chapter 3. Figure 6.8 shows an image that has been blurred

and also has a small amount of added noise.

This image is of size 200 pixels by 200 pixels, so the G matrix for the

blurring operator is of size 40,000 by 40,000. Fortunately, the blur-

ring matrix G is quite sparse, with less than 0.1% nonzero elements.

The matrix requires about 12 megabytes of storage. A dense matrix

of this size would require about 13 gigabytes of storage. Using the

SVD approach to Tikhonov regularization would require far more

storage than most current computers have. However, CGLS works

quite well on this problem.

Figure 6.9 shows the L–curve for the CGLS solution of this prob-

lem. For the first 15 or so iterations, kGm − dk2 decreases quickly.

After that point, the improvement in misfit slows down, while kmk2

increases rapidly. Figure 6.10 shows the CGLS solution after 30 iter-

ations. The blurring has been greatly improved. Note that the 30 is

far less than the size of the matrix, n = 40, 000. Unfortunately, fur-

ther CGLS iterations do not significantly improve the image. In fact,

noise builds up rapidly. Figure 6.11 shows the CGLS solution after

100 iterations. In this image the noise has been greatly amplified,

with little or no improvement in the clarity of the image.

6-14 CHAPTER 6. ITERATIVE METHODS

4.45

10

|| m ||

4.44

10

0 1 2

10 10 10

|| Gm−d ||

6.3. THE CGLS METHOD 6-15

6-16 CHAPTER 6. ITERATIVE METHODS

6.4 Exercises

1. Consider the cross well tomography problem of exercise 5.2.

(a) Apply Kaczmarz’s algorithm to this problem.

(b) Apply ART to this problem.

(c) Apply SIRT to this problem.

(d) Comment on the solutions that you obtained.

2. The regularization toolbox command blur computes the system matrix for

the problem of deblurring an image that has been blurred by a Gaussian

point spread function. The file blur.mat contains a particular G matrix

and a data vector d.

(a) How large is the G matrix? How many nonzero entries does it have?

How much storage would be required for the G matrix if all of its en-

tries were nonzero? How much storage would the SVD of G require?

(b) Plot the raw image.

(c) Using CGLS, obtain an estimate of the true image. Plot your solu-

tion.

3. Show that if p0 , p1 , . . ., pn−1 are mutually conjugate with respect to an n

by n symmetric and positive definite matrix A, then the vectors are also

linearly independent. Hint: Use the definition of linear independence.

Iterative methods for tomography problems including Kaczmarz’s algorithm,

ART, and SIRT are discussed in [KS01]. Parallel algorithms based on ART and

SIRT are discussed in [CZ97].

Shewchuk’s technical report [She94] provides a good introduction to the

Conjugate Gradient Method. The CGLS method is discussed in [HH93, Han98].

A popular alternative to CGLS which we have not discussed is the LSQR

method of Paige and Saunders [Han98, PS82b, PS82a]. Berryman has studied

the resolution of LSQR solutions [Ber00a, Ber00b]. In statistics, a procedure

called Partial Least Squares (PLS) is equivalent to LSQR.

The performance of the CG algorithm degrades dramatically on poorly con-

ditioned systems of equations. This is not a particular problem for CGLS, since

we don’t run the iteration until it converges to a least squares solution. However,

in other applications including the numerical solution of PDE’s, higher accuracy

is required. In such situations a technique called preconditioning is used to

improve the performance of CG. Essentially, preconditioning involves a change

of variables x̄ = Cx. The matrix C is selected so that the resulting system

of equations will be better conditioned than the original system of equations

[Dem97, GL96, TB97].

non–negative least squares

NNLS

Chapter 7

Other Regularization

Techniques

for regularizing discrete ill–posed inverse problems. All of the methods discussed

in this chapter use prior assumptions about the nature of the correct solution

which are used to pick a “good” solution from the collection of solutions which

adequately fit the data.

In many physical situations, there are bounds on the maximum and/or minimum

values of model parameters. For example, our model parameters may represent

a physical quantity such as the density of the earth that is inherently non–

negative. This establishes a strict lower bound of 0. We might also declare that

a density greater than 3.5 grams per cubic centimeter would be unrealistic for

a particular problem.

We can try to improve the solution by solving for an L2 norm solution that

includes the constraint that all of the elements of the discretized model, m be

non–negative.

(7.1)

m ≥ 0.

NNLS that was originally developed by Lawson and Hanson [LH95]. MATLAB

includes a command, lsqnoneg, that implements the NNLS algorithm.

In general, given vectors l and u of lower and upper bounds, we can consider

7-1

7-2 CHAPTER 7. OTHER REGULARIZATION TECHNIQUES

bound variables least squares the bounded variables least squares (BVLS) problem

BVLS

tikhcstr min kGm − dk2

m ≥ l

(7.2)

m ≤ u

.

Stark and Parker developed an algorithm for solving the BVLS problem and

implemented their algorithm in Fortran [SP95]. A similar algorithm is given in

the 1995 edition of Lawson and Hanson’s book [LH95].

Given the BVLS algorithm for (7.2), we can perform Tikhonov regulariza-

tion with bounds. The MATLAB regularization toolbox includes a command

tikhcstr for Tikhonov regularization with lower and upper bounds on the model

parameters.

A related optimization problem involves minimizing or maximizing a linear

functional of the model subject to bounds constraints and a constraint on the

misfit. This problem can be formulated as

min cT m

kGm − dk2 ≤ δ

m ≥ l (7.3)

m ≤ u

.

This problem can be solved by an algorithm given in Stark and Parker [SP95].

Solutions to this problem can be used to obtain bounds on the maximum and

minimum possible values of model parameters.

Example 7.1

Recall the source history reconstruction problem of Example 3.3.2.

Figure 7.1 shows a hypothetical source history. Figure 7.2 shows the

corresponding samples at T = 300, with noise added.

Figure 7.3 shows the least squares solution. Clearly, some regular-

ization is required. Figure 7.4 shows the nonnegative least squares

solution. This solution is somewhat better, in that concentrations

are all nonnegative. However, the true source history is not ac-

curately reconstructed. Furthermore, the NNLS solution includes

concentrations that are larger than one. Suppose for a moment that

the solubility limit of the contaminant in water is known to be 1.1.

This provides a natural upper bound on the elements of the model

vector. Figure 7.5 shows the corresponding BVLS solution. Further

regularization is required.

Figure 7.6 shows the L–curve for a second order Tikhonov regular-

ization solution with bounds on the model parameters. Figure 7.7

shows the regularized solution for λ = 0.0616. This solution cor-

rectly shows the two major peaks in the input concentration, but

7.1. USING BOUNDS CONSTRAINTS 7-3

1.4

1.2

0.8

Cin(t)

0.6

0.4

0.2

0

0 50 100 150 200 250

time

both peaks are slightly broader than they should be. The solution

misses one smaller peak at around t = 150.

We can use (7.3) to establish bounds on the values of the model

parameters. For example, we might want to establish bounds on the

average concentration from t = 125 to t = 150. These concentrations

appear in positions 51 through 60 of the model vector m. We let

ci be zero in positions 1 through 50 and 61 through 100, and let ci

be 0.1 in positions 51 through 60. The solution to (7.3) is then a

bound on the average concentration from t = 125 to t = 150. After

solving the optimization problem, we obtained a lower bound 0.36

and an upper bound 0.73 for the average concentration during this

time period.

7-4 CHAPTER 7. OTHER REGULARIZATION TECHNIQUES

0.4

0.35

0.3

0.25

0.2

C(x,300)

0.15

0.1

0.05

−0.05

0 50 100 150 200 250 300

distance

9

x 10

1.5

0.5

Concentration

−0.5

−1

−1.5

0 50 100 150 200 250

Time

7.1. USING BOUNDS CONSTRAINTS 7-5

3.5

2.5

Concentration

1.5

0.5

0

0 50 100 150 200 250

Time

1.4

1.2

0.8

Concentration

0.6

0.4

0.2

0

0 50 100 150 200 250

Time

7-6 CHAPTER 7. OTHER REGULARIZATION TECHNIQUES

1

10

0

|| Lm ||

10

−1

10

−2

10 −3 −2 −1 0

10 10 10 10

|| Gm−d ||

0.8

0.7

0.6

0.5

Concentration

0.4

0.3

0.2

0.1

−0.1

0 50 100 150 200 250

Time

7.2. MAXIMUM ENTROPY REGULARIZATION 7-7

regularization

In maximum entropy regularization,

Pn we replace the regularization term used in

Tikhonov regularization with i=1 mi ln(wi mi ), where the weights wi can be

adjusted to favor particular types of solutions. Because logarithms of negative

numbers are complex, we restrict our attention to solutions in which the mi are

non-negative.

The term maximum entropy comes from a Bayesian approach to selecting

a prior probability distribution,

Pn in which we select a probabilityPdistribution

n

which maximizes −

Pn i=1 i p ln p i subject to the constraint that

, i=1 pi = 1.

The quantity − i=1 pi ln pi has the same form as entropy in statistical physics.

Since the model m is not itself a probability distribution, and also because we

have added the effect of weights wi , maximum entropy regularization is distinct

from the maximum entropy approach to computing probability distributions

discussed in Chapter 11.

In the maximum entropy regularization approach, we maximize the entropy

of m subject to a constraint on the size of the misfit kGm − dk.

Pn

max − i=1 mi ln wi mi .

kGm − dk ≤ δ (7.4)

m ≥ 0 .

Using a Lagrange multiplier in the same way that we did with Tikhonov regu-

larization, we can transform this problem into

Pn

min kGm − dk2 + α2 i=1 mi ln wi mi .

(7.5)

m ≥ 0 .

It can be shown that as long as α 6= 0, the objective function in (7.5) is strictly

convex. Thus (7.5) has a unique solution. However, the optimization problem

can become difficult to solve as α approaches 0.

This optimization problem can be solved using a variety of nonlinear op-

timization methods. For example, the regularization toolbox contains a rou-

tine maxent which uses a nonlinear conjugate gradient method to solve the

problem. Our experience has been that this method is somewhat unreliable,

especially when α is small. For the example presented here, we have used a

penalty function approach to deal with the nonnegativity constraints and the

BFGS method to perform the minimization. This approach is more robust than

the nonlinear conjugate gradient method but also somewhat slower.

Figure 7.8 shows the regularization function f (x) = x ln(wx) for three differ-

ent values of w, along with the conventional 0th order Tikhonov regularization

function x2 . Notice that the maximum entropy function is 0 at x = 0, decreases

to a minimum value and then increases. We can find the minimum point by

setting the derivative equal to zero.

f 0 (x) = ln(wx) + 1 . (7.6)

ln(wx) + 1 = 0. (7.7)

7-8 CHAPTER 7. OTHER REGULARIZATION TECHNIQUES

16

x2

x ln x

14 x ln 0.2 x

x ln 5 x

12

10

−2

0 0.5 1 1.5 2 2.5 3 3.5 4

1

x= . (7.8)

ew

Thus maximum entropy regularization favors solutions with x values near 1/(ew),

and penalizes solutions with larger or smaller x values. For large values of x,

the function x2 grows faster than x ln wx. Maximum entropy regularization pe-

nalizes solutions with large x values, but not as much as 0th order Tikhonov

regularization. It should be clear that the choice of w, along with the magnitude

of the model parameters can have a significant effect on the maximum entropy

regularization solution.

Maximum entropy regularization is particularly popular in astronomy and

image processing applications [CE85, NN86, SB84]. Here the goal is to recover

a solution which consists of bright spots (stars, galaxies or other astronomical

objects) on a dark background. The non-negativity constraints in maximum

entropy furthermore ensure that the resulting image will not include features

with negative intensities. While conventional Tikhonov regularization tends

to broaden peaks in the solution, maximum entropy regularization may not

penalize sharp peaks as much.

Example 7.2

tempt to recover a solution to the Shaw problem. Our target model

will consist of two spikes over a smaller random background inten-

sity. The target model is shown in Figure 7.9.

7.2. MAXIMUM ENTROPY REGULARIZATION 7-9

1000

900

800

700

600

500

400

300

200

100

0

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

We will assume that the data errors are normally distributed with

mean 0 and standard deviation 1. Following the discrepancy princi-

ple, we will look for a solution with kGm − dk2 around 4.4.

For this problem, default weights of w = 1 are appropriate, since the

background noise level of about 0.5 is close to the minimum of the

regularization term. We next solved (7.5) for several values of the

regularization parameter α. We found that at α = 0.2, the misfit was

kGm − dk2 = 4.4. The corresponding solution is shown in Figure

7.10. The spike near θ = −0.5 is visible, but the magnitude of the

peak is not correctly estimated. The second spike near θ = 0.7 is

not very well resolved.

For comparison, we also used 0th order Tikhonov regularization with

a nonnegativity constraint to solve the same problem. Figure 7.11

shows the Tikhonov solution. This solution is similar to the solution

produced by maximum entropy regularization.

For this sample problem, maximum entropy regularization had no

advantage over 0th order Tikhonov regularization with non-negativity

constraints. The best maximum entropy solution was compara-

ble to the solution obtained by Tikhonov regularization with non-

negativity constraints. This result is consistent with the results from

a number of sample problems in [AH91]. Amato and Huges found

that maximum entropy regularization was at best comparable to,

and often worse than Tikhonov regularization with non-negativity

constraints.

7-10 CHAPTER 7. OTHER REGULARIZATION TECHNIQUES

1000

900

800

700

600

500

400

300

200

100

0

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

1000

900

800

700

600

500

400

300

200

100

0

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

7.3. TOTAL VARIATION 7-11

denoising

Total variation (TV) is a regularization functional that works well for problems

in which there are discontinuous jumps in the model. The TV regularization

functional is

n−1

X

T V (m) = |mi+1 − mi |. (7.9)

i=1

It is convenient to write this as

T V (m) = kLmk1 (7.10)

where

−1 1

−1 1

L=

... .

(7.11)

−1 1

−1 1

In first and second order Tikhonov regularization, discontinuities in the

model are smoothed out and do not show up well in the inverse solution. This

is because smooth transitions are penalized less by the regularization term than

sharp transitions. The particular advantage of TV regularization is that the

regularization term does not penalize discontinuous transitions in the model

any more than smooth transitions.

This approach has been particularly popular for the problem of “denoising”

a model. The denoising problem can be thought of as a linear inverse problem

in which G = I. The goal is to take a noisy data set, and smooth out the

noise, while still retaining long term trends and even sharp discontinuities in

the model. TV regularization can also be used with other linear and nonlinear

inverse problems.

We could use a TV regularization term in place of kLmk2 in the optimization

problem

min kGm − dk22 + α2 kLmk21 . (7.12)

However, this problem is no longer a least squares problem, and the machinery

that we have developed for solving least squares problems (SVD, GSVD, etc.) is

no longer applicable. In fact, (7.12) is a non differentiable optimization problem.

Since we have a 1-norm in the regularization term, it makes sense to switch

to the 1-norm in the data misfit term as well. Consider the problem

min kGm − dk1 + αkLmk1 . (7.13)

We can rewrite this problem as

G d

min

αL m − . (7.14)

0 1

This is a 1-norm minimization problem that we can solve easily by using the

iteratively reweighted least squares algorithm discussed in Chapter 2.

7-12 CHAPTER 7. OTHER REGULARIZATION TECHNIQUES

modified truncated SVD Per Hansen has suggested another approach which retains the two–norm

MTSVD of the data misfit while incorporating the TV regularization term [Han98]. In

PP-TSVD Hansen’s PP-TSVD method, we begin by using the SVD and the k largest

singular values of G to obtain a rank k approximation to G. This approximation

is

Xk

T

Gk = si U·,i V·,i . (7.15)

i=1

Note that the matrix Gk will be rank deficient. The point of the approximation

is to obtain a matrix with a well defined null space. The vectors V·,k+1 , . . .,

V·,n form a basis for the null space of Gk . We will need this basis later, so let

Using the model basis set [V·,1 . . . V·,k ], the minimum length least squares

solution is, from the SVD,

k

X UT·,i d

mk = V·,i . (7.17)

i=1

si

We can modify this solution by adding in any vector in the null space of Gk .

This will increase kmk2 , but have no effect on kGk m − dk.

We can use this to find solutions which minimize some regularization func-

tional, but have minimum misfit. For example, in the modified truncated SVD

(MTSVD) method, we find a model mL,k which minimizes kLmk among those

models which minimize kGk m−dk. Since all models which minimize kGk m−dk

can be written as m = mk − Bk z for some vector z, the MTSVD problem can

be written as

min kL(mk − Bk z)k2 (7.18)

or

min kLBk z − Lmk k2 . (7.19)

This is a least squares problem that can be solved with the SVD, QR factoriza-

tion, or by the normal equations,

The PP-TSVD algorithm uses a similar approach. First, we minimize kGk m−

dk2 . Let β be the minimum value of kGk m − dk2 . Instead of minimizing the

two–norm of Lm, we minimize the 1-norm of Lm, subject to the constraint that

m must be a least squares solution.

min kLmk1

(7.20)

kGk m − dk2 = β.

TSVD problem can be reformulated as

7.3. TOTAL VARIATION 7-13

2.5

1.5

0.5

0

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

or

min kLBk z − Lmk k1 . (7.22)

This is an 1-norm minimization problem which can be solved in many ways,

including IRLS.

Note that in (7.22) the matrix LBk has n − 1 rows and n − k columns.

In general, when we solve this 1-norm minimization problem, we can find a

solution in which n − k of the equations are satisfied exactly. Thus at most

k − 1 entries in L(mk − Bk z) will be nonzero. Since each nonzero entry in

L(mk − Bk z) corresponds to a discontinuity in the model, there will be at most

k −1 discontinuities in the model. Furthermore, the zero entries in L(mk −Bk z)

correspond to points at which the model is constant. For example, if we use

k = 1, we will get a flat model with no discontinuities. For k = 2, we can obtain

a model with two flat sections and one discontinuity, and so on.

The PP-TSVD method can also be extended to piecewise linear functions

and to piecewise higher order polynomials by using a matrix L which approxi-

mates the second or higher order derivatives. The MATLAB function pptsvd,

available from Per Hansen’s web page, implements the PP-TSVD algorithm.

Example 7.3

In this example we consider the Shaw problem with a target model

which consists of a single step function. Our target model is shown

in Figure 7.12. The corresponding data, with noise (normal, 0.001

standard deviation) added is shown in Figure 7.13.

First, we solved the problem with conventional Tikhonov regular-

ization. Figure 7.14 shows the 0th order Tikhonov regularization

7-14 CHAPTER 7. OTHER REGULARIZATION TECHNIQUES

0

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

solution. Both solutions show the discontinuity in the solution as a

smooth transition. The relative error (kmtrue − mk2 /kmtrue k2 ) is

about 9% for the 0th order solution and about 7% for the second

order solution.

Next, we solved the problem by minimizing the L1 misfit with TV

regularization using (7.14). Since the noise level is 0.001, we ex-

pected the two–norm of the misfit to be about 0.0045. Using the

value α = 1.0, we obtained a solution with kGm − dk2 = 0.0040.

This solution is shown in Figure 7.16. This solution is extremely

good, with kmtrue − mk2 /kmtrue k2 < 0.001.

Next, we solved the problem with pptsvd. The PP-TSVD solution

with k = 2 is shown in Figure 7.17. Again, we get a very good

solution, with kmtrue − mk2 /kmtrue k2 < 0.0001.

7.3. TOTAL VARIATION 7-15

2.6

2.4

2.2

1.8

1.6

1.4

1.2

0.8

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

2.2

1.8

1.6

1.4

1.2

0.8

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

7-16 CHAPTER 7. OTHER REGULARIZATION TECHNIQUES

2.5

1.5

0.5

0

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

2.5

1.5

0.5

0

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

7.4. EXERCISES 7-17

min cT m

(7.23)

kGm − dk22 ≤ δ2 .

used to solve (7.23).

2. (a) Compute the gradient of the function being minimized in (7.5).

(b) Compute the Hessian of the function in (7.5).

(c) Show that for α 6= 0, the Hessian is positive definite for all m ≥ 0.

Thus the function being minimized is strictly convex, and there is a

unique minimum.

3. In image processing applications of maximum entropy regularization, it is

common to add a constraint to (7.4) that

N

X

mi = C

i=1

for some constant C. Show how this constraint can be incorporated into

(7.5) using a second Lagrange Multiplier.

Methods for bounded variables least squares problems and minimizing a linear

functional subject to a bound on the misfit are given in [SP95]. Some applica-

tions of these techniques can be found in [HSH+ 90, Par91, PZ89, SP87, SP88].

The maximum entropy method is particularly popular in deconvolution and

denoising of astronomical images from radio telescope data. Algorithms for

maximum entropy regularization are discussed in [CE85, NN86, Sau90, SB84].

More general discussions can be found in [AH91, Han98]. Another popular

deconvolution algorithm used in radio astronomy is the “CLEAN” algorithm

[Cla80, Hög74]. Briggs [Bri95] compares the performance of CLEAN, Maximum

Entropy regularization and NNLS.

Vogel [Vog02] has a chapter on numerical methods for total variation regu-

larization. The PP-TSVD method is discussed in [HM96, Han98].

7-18 CHAPTER 7. OTHER REGULARIZATION TECHNIQUES

convolution

deconvolution

Chapter 8

Fourier Techniques

Domains

A remarkable feature of linear time-invariant systems is that the forward oper-

ator can be generally described by a particular form of Fredholm integral of the

first kind, a convolution (1.7),

Z ∞

d(t) = m(τ )g(t − τ ) dτ . (8.1)

−∞

Inverse problems involving such systems can thus be usefully solved for m(t)

via an inverse or deconvolution operation. Here, the independent variable t is

time and the data d, model, m, and system kernel g are all time functions.

However, the results here are just as applicable to spatial problems and are also

generalizable to higher dimensions. Here, we will briefly overview the essentials

of Fourier theory in the context of performing convolutions and deconvolutions.

Readers are urged to consult one of the many references for a more complete

treatment.

Consider a linear time-invariant operator, G, that converts an unknown

model, m(t) into an observable data function d(t)

and scaling

G[αm(t)] = αG[m(t)] , (8.4)

where α is a scalar. (8.4) also implies that the output of the system is zero when

m(t) = 0.

G[0] = 0 . (8.5)

8-1

8-2 CHAPTER 8. FOURIER TECHNIQUES

delta function To show that the operation of any system satisfying (8.3) and (8.4) can be

sifting property, of delta cast in the form of (8.1), we utilize the sifting property of the impulse or

function

impulse function delta function, δ(t). The delta function can be conceptualized as the limiting

sifting property, delta function case of a pulse as its width goes to zero, its height goes to infinity, and its area

impulse response stays constant and equal to one, e.g.,

Green’s function

δ(t) = lim τ −1 Π(t/τ ) (8.6)

τ →0

The sifting property allows us to extract the value of a function at a particular

point from inside of an integral,

Z b

f (t)δ(t − t0 ) dt = f (t0 ) (8.7)

a

= f (t0 ) a ≤ t0 ≤ b (8.8)

= 0 elsewhere (8.9)

for any f (t) continuous at finite t = t0 . The impulse response of a system,

where the model and data are related by the operation G, is defined as the

output produced when the input is a delta function

g(t) = G[δ(t)] . (8.10)

An impulse response is also widely referred to as a system Green’s function

in many applications.

We will now show that all linear, time-invariant system outputs for a given

input are characterizable via the convolution integral (8.1). First, note that any

input signal, m(t), can clearly be written as a summation of impulse functions

through the sifting property

Z ∞

m(t) = m(τ )δ(t − τ )dτ (8.11)

−∞

Thus, a general linear system response d(t) to an arbitrary input m(t) can be

written as Z ∞

d(t) = G m(τ )δ(t − τ )dτ (8.12)

−∞

or, from the fundamental definition of the integral operation

∞

" #

X

d(t) = G lim m(τn )δ(t − τn )∆τ . (8.13)

∆τ →0

n=−∞

using the scaling relation, where the m(τn ) are now weights, to obtain

∞

X

d(t) = lim m(τn )G[δ(t − τn )]∆τ . (8.14)

∆τ →0

n=−∞

8.1. LINEAR SYSTEMS IN THE TIME AND FREQUENCY DOMAINS 8-3

Z ∞ inverse Fourier transform

time domain response

d(t) = m(τ )g(t − τ )dτ , spectrum

−∞

frequency domain response

which is, again, (8.1), the convolution of m(t) and g(t), often abbreviated as transfer function

d(t) = m(t) ∗ g(t). Fourier basis functions

The convolution operation thus generally describes the transformation of basis functions, Fourier

models to data in any linear, time-invariant, physical process, as well as the spectral amplitude

spectral phase

smearing action of a linear measuring instrument. For example, an (unattain-

able) perfect instrument which recorded some m(t) with no distortion whatso-

ever would have a delta function impulse response, perhaps with a time delay

t0 , as

Z ∞

d(t) = m(t) ∗ δ(t − t0 ) = m(τ )δ(t − t0 − τ )dτ = m(t − t0 ) . (8.15)

−∞

operation and the Fourier transform

Z ∞

G(f ) = F [g(t)] ≡ g(t)e−ı2πf t dt (8.16)

−∞

Z ∞

−1

g(t) = F [G(f )] ≡ G(f )eı2πf t df , (8.17)

−∞

where F denotes the Fourier transform operator, and F −1 denotes the inverse

Fourier transform. The impulse response g(t) is called the time domain

response, and its Fourier transform, G(f ), is called the spectrum of g(t). For

a system with impulse response g(t), G(f ) is called the frequency domain

response or transfer function of the system. The Fourier transform (8.16)

gives a formula for evaluating the spectrum, and the inverse Fourier transform

(8.17) says that the time domain function g(t) can be exactly reconstructed by a

complex weighted integration of functions of the form eı2πf t , where the weight-

ing is provided by G(f ). The essence of Fourier analysis is thus representing

functions using this particular infinite set of Fourier basis functions, eı2πf t .

It is important to note that, for a real-valued function g(t) the spectrum G(f )

will be complex. |G(f )| is called the ingpectral amplitude, and the angle

that G(f ) makes in the complex plane, tan−1 (imag(G(f ))/real(G(f )) is called

the spectral phase.

Readers should be note that in physics and geophysics applications the sign

convention chosen for the complex exponentials in the Fourier transform and its

inverse may be reversed, so that the forward transform (8.16) has a plus sign in

the exponent and the inverse transform (8.17) has a minus sign in the exponent.

This convention change simply causes a complex conjugation in the spectrum

that is reversed when the corresponding inverse transform is applied to return to

8-4 CHAPTER 8. FOURIER TECHNIQUES

convolution theorem the time domain, and is thus simply an alternative spectral phase convention.

An additional convention issues arise as to whether to express frequency in

Hertz (f ) or radians (ω = 2πf ). Alternative Fourier transform formulations

using omega instead of f have scaling factors of 2π in the forward, reverse, or

both transforms.

Consider the Fourier transform of the convolution of two functions

Z ∞ Z ∞

F [m(t) ∗ g(t)] = m(τ )g(t − τ )dτ e−ı2πf t dt (8.18)

−∞ −∞

Z ∞ Z ∞

−ı2πf t

F [m(t) ∗ g(t)] = m(τ ) g(t − τ )e dt dτ .

−∞ −∞

Z ∞ Z ∞

F [m(t) ∗ g(t)] == m(τ ) g(ξ)e−ı2πf (ξ+τ ) dξ dτ . (8.19)

−∞ −∞

which is the convolution theorem. The convolution theorem shows that con-

volution of two functions in the time domain has the simple effect of multiplying

their Fourier transforms the frequency domain. The Fourier transform of the

impulse response, G(f ) = F [g(t)], thus characterizes how M(f ) is altered in

spectral amplitude and phase by the convolution.

To see the implication of the convolution theorem more explicitly, consider

the response of a linear system, characterized by the impulse response g(t) in

the time domain and the transfer function G(f ) in the frequency domain, to

a single model Fourier basis function of frequency f0 , m(t) = eı2πf0 t . The

spectrum of eı2πf t can be seen to be δ(f − f0 ), by examining the corresponding

inverse Fourier transform (8.17)

Z ∞

ı2πf0

e = δ(f − f0 )eı2πf t df . (8.20)

−∞

The frequency domain response of a linear system to this basis function model

is therefore (8.19)

F [eı2πf0 ]G(f ) = δ(f − f0 )G(f0 ) . (8.21)

The corresponding time domain response is thus (8.17)

Z ∞

G(f0 )δ(f − f0 )eı2πf t df = G(f0 )eı2πf0 . (8.22)

−∞

Linear time-invariant systems thus only map model Fourier basis functions to

data functions at the same frequency, and only alter them in spectral ampli-

tude and phase, with the alteration being characterized by the spectrum of the

system impulse response evaluated at the appropriate frequency. Of particular

interest here is the result that model basis functions at frequencies that are

8.2. DECONVOLUTION FROM A FOURIER PERSPECTIVE 8-5

weakly mapped to the data (frequencies where G(f ) is small) may be difficult zeros

or impossible to recover in an inverse problem. poles

The transfer function can be expressed in a particularly useful analytical

form if we can express a linear time-invariant system as a linear differential

equation

dn y dn−1 y dy

an n + an−1 n−1 + · · · + a1 + a0 y =

dt dt dt

dm x dm−1 x dx

bm m

+ b m−1 m−1

+ · · · + b1 + b0 x , (8.23)

dt dt dt

where the coefficients ai and bi are constants. Because each of the terms in

(8.23) are linear and constant (there are no powers or other nonlinear functions

of x, y, or their derivatives), (8.23) expresses a linear time-invariant system

obeying superposition (8.3) and scaling (8.4), because differentiation is itself a

linear operation.

If a system of the form of (8.23) operates on a model of the form m(t) =

eı2πf t , (8.22) indicates that the corresponding predicted data will be d(t) =

G(f )eı2πf t . A time derivative for such a function merely generates a multiplier

of 2πıf . Dividing the resulting equation on both sides by eı2πf t and solving for

G(f ) gives the transfer function

Pm j

D(f ) j=0 bj (2πıf )

G(f ) = = Pn k

. (8.24)

M(f ) k=0 ak (2πıf )

in the form of (8.23). The m + 1 complex frequencies, fz , where the numerator

(and transfer function) is zero are referred to as zeros of G(f ). The predicted

data will be zero for input components at such frequencies, regardless of their

amplitude. Conversely, any model components at such frequencies will have no

influence on the data. Corresponding model basis functions, eı2πfz t , will thus

lie in the model null space and be unrecoverable by any inverese methodology.

Frequencies for which the denominator of (8.24) is zero are called poles. The

predicted data is infinite at such frequencies, so that continuous excitation of

the system at such frequencies leads to an unstable forward problem.

In recovering a model, m(t) which has been convolved with some g(t) using

Fourier theory, we wish to recover the basis function spectral amplitude and

phase (the Fourier components) of m by reversing the changes in spectral am-

plitude and phase (8.19) caused by the convolution. As noted above, this may

be difficult or impossible at and near frequencies where the spectral amplitude

of the transfer function, |G(f )| is small (i.e., close to the zero frequencies).

8-6 CHAPTER 8. FOURIER TECHNIQUES

downward continuation

Example 8.1

A illustrative physical example of an ill-posed inverse system is down-

ward continuation (Figure 8.1). Consider a point mass, M , located

at the origin, which has a total gravitational field

−M γ r̂ −M γ(xx̂ + y ŷ + z ẑ)

gt (r) = (gx , gy , gz ) = 2

= (8.25)

krk kxk3

where γ is Newton’s gravitational constant. At a general position

(r = xx̂ + y ŷ + z ẑ), the vertical (ẑ) component of the gravitational

field will be

−M γz

gz = ẑ · gt = (8.26)

krk3

We note that the integral of gz over the xy plane is a constant with

respect to the observation height, z,

Z ∞ Z ∞ Z ∞ Z ∞

dx dy

gz dx dy = −M γz 3

. (8.27)

−∞ −∞ −∞ −∞ krk

Z ∞ ∞

ρ dρ −1

= − 2πM γz 2 2 3/2

= −2πM γz 2 2 1/2

0 (z + ρ ) (z + r )

0

= −2πM γ , (8.28)

which is independent of the plane height z.

If we consider the vertical field at z = 0+ from a point mass, we

thus have a delta function singularity at the origin with a magnitude

8.2. DECONVOLUTION FROM A FOURIER PERSPECTIVE 8-7

given by (8.26), as the field has no vertical component except exactly downward continuation

at the origin. We can thus write the vertical field an infinitesimal upward continuation filter

distance above the xy plane as low–pass filter

filter, low–pass

gz |z=0+ = −2πM γδ(x, y) . (8.29)

and suppose that gravity data is collected from a ship or airplane

located at height h. The vertical field observed at z = h will be

h

gz (h) = (8.30)

2π(x2 + y 2 + h2 )3/2

tude of the delta function at z = 0, −2πM γ. As a consequence of

the vertical gravitational field obeying superposition and linearity,

this can be recognized as a linear problem, where 8.30 is the impulse

response for observations on the plane z = h. Vertical field measure-

ments obtained at z = h can thus be expressed by a 2-dimensional

convolution operation in the xy plane

Z ∞Z ∞

gz (h, x, y) = gz (0, s1 − x, s2 − y)·

−∞ −∞

h

ds1 ds2 . (8.31)

2π(s21 + s22 + h2 )3/2

We can examine the effects of the convolution process, and address

the inverse problem of downward continuation of the vertical field

at z = h back to z = 0, by examining the transfer function cor-

responding to (8.30), which will reveal how sensitive observations

made at z = h are to the field values at z = 0 as a function of

spatial frequency, k. The Fourier transform of the impulse response

is the transfer function

Z ∞ Z ∞

h · e−i2πkx x e−i2πky y dx dy

G(kx , ky ) =

−∞ −∞ 2π(x2 + y 2 + h2 )3/2

2 2 1/2

= e−2πh(kx +ky ) , (8.32)

which is the upward continuation filter. In this case the . model

and data basis functions are of the form eı2πkx x , and eı2πky y , where

kx,y are spatial frequencies (analogous to f in temporal problems).

As z increases, (8.32) indicates that higher frequency (larger k) spa-

tial basis functions of the true seafloor gravity signal are exponen-

tially attenuated to greater and greater degrees. Such an operation

is thus referred to as a low–pass filter. Predicted observations

thus become smoother with increasing h.

8-8 CHAPTER 8. FOURIER TECHNIQUES

discrete Fourier Transform measurements made at z = h, we must devise an inverse filter to

Fourier Transform, discrete reverse the smoothing effect of upward continuation. An obvious

DFT

candidate is the reciprocal of (8.32)

time series

inverse discrete Fourier −1 2 2 1/2

transform [G(kx , ky )] = e2πh(kx +ky ) . (8.33)

inverse Fourier transform,

discrete (8.33) is the transfer function of the deconvolution that undoes the

FFT convolution of the forward problem. However, (8.33) has a response

that grows without bound as the spatial frequencies kx and ky be-

come large. Thus, small high-spatial-frequency components in the

data acquired at (h > 0) may leverage enormous inferred variations

in seafloor gravity field (h = 0). If these high-frequency amplitudes

of the Fourier basis functions in the data are noisy, the inverse op-

eration will be very unstable.

ical of such problems. This means that solutions obtained using the Fourier

methodology will have to be regularized in many cases to produce stable and

meaningful results.

We can often readily approximate the continuous time transforms (8.16) and

(8.17) with a corresponding discrete operation, the discrete Fourier trans-

form, or DFT. The discrete Fourier Transform operates on time series, e.g.,

sequences mj consisting of equally time-spaced, sequential values of m(t). The

rate, fs , at which this sampling occurs is called the sampling rate. The for-

ward discrete Fourier transform is

n−1

X

Mk = DFT[m]j ≡ mj e−2πıjk/n (8.34)

j=0

n−1

1X

mj = DFT−1 [M]k ≡ Mk e2πıjk/n . (8.35)

n

k=0

(8.34) and (8.35) use the common conventions that the indexing ranges from 0

to n − 1, and also use the same complex exponential sign convention as (8.16)

and (8.17). (8.35) shows that a model time series can be expressed as a linear

combination of the n basis functions e2πıjk/n , where the complex coefficients are

the discrete spectral values Mk . The DFT is also frequently referred to as the

FFT because a particularly efficient algorithm, the fast Fourier transform,

8.3. LINEAR SYSTEMS IN DISCRETE TIME 8-9

Figure 8.2: Frequency and index mapping describing the DFT of a real-valued,

series (n = 16) sampled at a sampling rate fs . For a real-valued series, the

n complex elements of the DFT exhibits even spectral amplitude symmetry

and odd spectral phase symmetry about the index n/2. Indices less than n/2

correspond to the zero and positive frequencies 0, fs /n, 2fs /n, . . . , fs /n, fs /2−

fs /n. Indices greater than n/2 correspond to the negative frequencies −fs /2 +

fs /n, − fs /2 + 2fs /n, . . . − fs /n. Index n/2 corresponds to frequencies at half

the sampling rate. For the DFT to adequately represent the spectrum of an

assumed periodic time series, fs must be greater than or equal to the Nyquist

frequency (8.36)).

8-10 CHAPTER 8. FOURIER TECHNIQUES

Hermitian symmetry, spectral is widely exploited to evaluate (8.34) and (8.35. Figure 8.2 shows the frequency

Nyquist frequency and index mapping for an n-length DFT.

convolution, serial

The DFT spectra, Mk (8.34) are complex and discrete, with positive real

convolution, circular

frequencies assigned to indices 1 through n/2 − 1, and negative real frequencies

assigned to indices n/2 through n. The k = 0 term is the sum of the mj , or

n times the average series value. There is an implicit assumption in the DFT

that the underlying time series, mj , is periodic over n terms. If m(t) is real

valued, substitution of n − k for n 8.34 shows that the DFT has Hermitian

symmetry, where Mk = M∗n−k . Because of this complex conjugate symmetry

about k, the spectral amplitude, |M| is symmetric and the spectral phase is

antisymmetric with respect to k. For this reason it is customary to only plot

the positive frequency spectral amplitude and phase for the spectrum of a real

signal.

For a sampled time series to accurately convey information about a contin-

uous function it can be shown that the continuous real-world function must be

sampled at a rate fs that is at least twice the highest frequency, fmax , at which

there is appreciable energy, or at or above the Nyquist frequency,

fs ≥ fN = 2fmax . (8.36)

Should the condition (8.36) not be met, a usually irreversible nonlinear distor-

tion called aliasing will occur, where energy at frequencies higher than fs will

appear in the spectrum to lie at frequencies below fs .

The discrete convolution of two sampled time series with equal sampling

intervals ∆t = 1/f s can be performed in two ways. The first of these is serial

fashion,

n−1

X

dj = mi gj−i ∆t (8.37)

i=0

where we assume that the shorter of the two time series is zero for all indices

greater than m. In serial convolution we get a result with at most n + m − 1

nonzero terms.

The second type of discrete convolution is circular, can be applied only

to two time series of equal length (n = m). If the time series are of different

lengths, the lengths can be equalized by padding the shorter of the two with

zeros. The result of a circular convolution is as if we had joined each time series

to its tail and convolved them in circular fashion. Applying the convolution

theorem

(not the vector dot product), produces the circular convolution. To avoid wrap-

around effects and thus obtain a result that is indistinguishable from the serial

convolution (8.37), it may be necessary to pad both m and g with up to n zeros

and to perform (8.38) on series of length up to 2n. Because of the factoring

8.3. LINEAR SYSTEMS IN DISCRETE TIME 8-11

strategy used in the FFT algorithm, it is also desirable to pad m and g up to spectral division

lengths that are powers of two, or at least highly composite numbers.

Consider the case where we have a theoretically known, or accurately esti-

mated, system impulse response, g(t), convolved with an unknown model, m(t).

We note in passing that, although we will examine the deconvolution problem

in 1 dimension for simplicity, the results are generalizable to higher dimensions.

The forward problem is

Z b

d(t) = g(t − τ )m(τ )dτ . (8.39)

a

interval, ∆t = 1/fs , gives

d = Gm , (8.40)

where d and m are appropriate length time series (sampled approximations of

d(t) and m(t)), and G is a matrix with rows that are padded, time reversed,

sampled impulse response vectors (Figure 4.4), so that

This time domain representation of the forward problem was previously exam-

ined in Example 4.5.

Taking the Fourier perspective, an inverse solution can be obtained by first

padding d and g appropriately with zeros so that they are of equal and suffi-

cient length to avoid wrap-around artifacts associated with circular convolution.

Evaluating the discrete Fourier transforms of both vectors, (8.19) allows us to

cast the problem as a complex-valued linear system

D = G · M∆t , (8.42)

Gi,i = Gi , (8.43)

D, is the discrete Fourier transform of the data vector, d, and M is the discrete

Fourier transform of the model vector, m.

(8.42) suggests a straightforward solution by spectral division

(8.44)

combined with the efficient FFT implementation of the DFT, reduces the nec-

essary computational effort from solving a potentially very large linear system

of time-domain equations (8.40) to just three n-length DFT operations and n

complex divisions. If d and g are real, packing/unpacking algorithms exists that

allow the DFT operations to be further reduced to operations infolving complex

vectors of length n/2.

8-12 CHAPTER 8. FOURIER TECHNIQUES

water level regularization However, (8.44) does not avoid instability associated with deconvolution sim-

ply by transforming the problem into the frequency domain because reciprocals

of any very small elements of G will dominate the diagonal of G−1 . (8.44) will

thus frequently require regularization to be useful.

A simple and widely-applied method of regularizing spectral division is water

level regularization. The water level strategy employs a modified G matrix,

Gw , in the spectral division, where

Gw,i,i = w (Gi,i = 0) .

The water-level damped model estimate is then

mw = DFT−1 G−1

w D . (8.45)

The colorful name for this technique arises from the analogy of pouring water

into the holes of the Fourier transform of g until the spectral amplitude level

there reaches w. The effect of the water level is to prevent unregularized spec-

tral division at frequencies where the spectral amplitude of the system transfer

function is small, and thus prevent undesirable noise amplification.

An optimal water level value w will reduce the sensitivity to noise in the

inverse solution while still recovering important model features. As is typical

of the regularization process, it is possible to chose at a “best” solution by

assessing the tradeoff between the forward model residual and the model norm.

as the regularization parameter w is varied. A useful property in evaluating

data misfit and model length for calculations in the frequency domain that the

2–norm of the a Fourier transform vector (defined for complex vectors as the

square root of the sum of the squared complex element amplitudes) vector will

is proportional to the 2–norm of the time-domain vector (the specific constant

of proportionality depends on the DFT conventions used). One can thus easily

evaluate 2–norm tradoff metrics in the frequency domain without calculating

inverse Fourier transforms. The 2–norm of the water level-regularized solution,

mw , will thus decrease monitonically as w increases because |Gw,i,i | ≥ |Gi,i |.

Example 8.2

Revisiting the time-domain seismometer deconvolution example, which

was regularized in Chapter 4 using SVD truncation (Example 4.5),

we investigate a frequency-domain solution regularized via the water

level technique. The impulse response, true model, and noisy data

for this example are plotted in Figures 4.3, 4.6, and 4.9, respec-

tively. We first pad the n = 210 point data and impulse response

8.4. WATER LEVEL REGULARIZATION 8-13

5

10

0

10

−5

10

−10

10

−15

10 −3 −2 −1 0

10 10 10 10

Figure 8.3: Spectral amplitudes of the impulse response, noise-free, and noise

data vectors for the seismometer deconvolution problem. Spectra range in fre-

quency from zero (not shown) to half of the sampling frequency (1 Hz). Because

spectral amplitude for real-valued time series are symmetric with frequency,

spectra are shown only for f = kfs /n > 0.

the Fast Fourier transform to both vectors to obtain the correspond-

ing discrete spectra. The spectral elements are complex, but we can

examine the element magnitudes (which are critical when assessing

the stability of the spectral division solution) by plotting the ampli-

tudes of the impulse response, noise-free data, and data with noise

added Figure 8.3.

Examining the impulse response spectral amplitude, |Gk |, we note

that it decreases by approximately three orders of magnitude be-

tween low frequencies and 1 Hz. The convolution theorem (8.19)

tells us that G is the spectral amplitude shaping that the forward

convolution will impose on a general model in mapping it to the data.

Thus, the convolution of a general signal, consisting of a range of fre-

quencies, in this system will result in increased attenuation of higher

frequencies, which is the hallmark of a low–pass filter. Figure 8.3

also shows that the spectral amplitudes of the noise-free data falls

off even more quickly than the impulse response. This means that

spectral division will be a stable process for noise free data in this

problem, and that the quotient will indeed converge to numerical

accuracy to the Fourier transform of the true model. However, Fig-

8-14 CHAPTER 8. FOURIER TECHNIQUES

Spectral Amplitude

1

10

0

10

−2 −1 0

10 10 10

f (Hz)

Figure 8.4: Spectral amplitudes of the Fourier transform of the noisy data

divided by the transfer function (the Fourier transform of the impulse response).

This spectrum is dominated by amplified noise at frequencies above about 0.1

Hz and is the Fourier transform amplitude of Figure 4.10.

ure 8.3 also shows that the spectral amplitudes of the noisy data

dominate the signal at frequencies higher than ≈ 0.1 Hz. Because

of the small values of Gk at these frequencies the spectral division

solution will result in a model that is dominated by noise, exactly

as in the time domain-case solution (Figure 4.10). Figure 8.4 shows

the amplitude spectrum resulting from this spectral division.

To regularize the spectral division solution, an optimal water level is

needed. Examining Figure 8.3 it is readily observed that the value

of w that will deconvolve the portion of the data spectrum that is

unobscured by noise while suppressing the amplification of higher

frequency noise must be close to unity. Such a spectral observation

might be more difficult for real data with more complex spectra,

however. A more adaptable way to select w is to construct a tradeoff

curve between model smoothness and data misfit by trying a range

of water level values. Figure 8.5 shows the curve for this synthetic

example, showing that the optimum w is close to 3. Figure 8.6 shows

a corresponding range of solutions, and Figure 8.7 shows the solution

for w = 3.16.

The solution of Figure 8.7, chosen from the corner of the tradeoff

curve of Figure 8.5, shows the resolution reduction characteristic of

regularized solutions, manifested in reduced amplitude and increased

8.4. WATER LEVEL REGULARIZATION 8-15

10

w=0.1

8

Model 2−Norm

w=0.31623

4

w=1

w=3.1623

w=10

2 w=31.6228

w=100

0

0 200 400 600 800 1000

Misfit 2−Norm

Figure 8.5: Tradeoff curve between model 2–norm and data 2–norm misfit for

a range of water level values.

20

15

10

10 log10 (w)

−5

−10

0 20 40 60 80

Time (s)

Figure 8.6: Models corresponding to a range of water level values and used to

construct Figure 8.5. Dashed curves show the true model (Figure 4.6).

8-16 CHAPTER 8. FOURIER TECHNIQUES

high–pass filter 1

filter, high–pass

0.8

Acceleration (m/s2)

0.6

0.4

0.2

0 20 40 60 80

Time (s)

Figure 8.7: Model corresponding to w = 3.16 (Figure 8.5; Figure 8.6). Dashed

curves show the true model (Figure 4.6).

is that it provides a new set of basis functions (the exponentials) that have the

remarkable property (shown by the convolution theorem) of passing through a

linear system altered only in phase and amplitude. Spectral plots such as Figure

8.3 and 8.4 can thus be used to understand what frequencies components will

have stability issues in the inverse solution. The information contained in the

spectral plot of Figure 8.3 is thus analogous to that obtained with a Picard plot

in the context of the SVD. The Fourier perspective also provides a link between

inverse theory and the vast field of linear filtering. In the context of Example

8.4, the deconvolution problem is seen to be identical to the problem of finding

an optimal high–pass filter to recover the model while suppressing the influ-

ence of noise. The Fourier methodology is also spectacularly computationally

efficient, relative to time domain deconvolution methods. This efficiency can be-

come critically important when larger and/or higher dimensional models are of

interest, a great many deconvolutions must be performed, or speed is essential.

8.5. EXERCISES 8-17

8.5 Exercises

1. Consider regularized deconvolution as the solution to a pth order Tikhonov

regularized system

d G

= m

O αLp

which is, for example for p = 0,

T

h 0 0 ... 0

0

hT 0 ... 0

0

0 hT ... 0

. . . . .

. . . . .

. . . . .

d

= 0 0 0 ... hT m (8.46)

O α

0 0 ... 0

0 α 0 ... 0

. . . . .

. . . . .

. . . . .

0 0 0 ... α

time-reversed system impulse response of length n/2, and O is an n-length

zero vector.

a) Show that summing the two constraint systems of (8.46) and solving

the resultant n by n system via spectral division is equivalent to a solving

a modified water level damping system where Gw,i,i = Gi,i + α.

b) Show that a corresponding nth order Tikhonov-regularized deconvo-

lution solution is equivalent to solving the modified water level damping

system where Gw,i,i = Gi,i + (2πıf )n α.

Hint: Apply the convolution theorem and note that the Fourier transform

of dg(t)/dt is 2πıf times the Fourier transform of g(t).

2. A displacement seismogram is observed from a large earthquake at a far-

field a seismic station, from which the source region can be approximated

as a point. A much smaller aftershock from the main shock region is used

as an empirical Green’s function for this event. It is supposed that the

observed signal from the large event should be approximately equal to

the convolution of the main shock’s rupture history with this empirical

Green’s function.

The 256 point seismogram is in the file seis.mat. The impulse response of

the seismometer is in the file impresp.mat.

a) Deconvolve the impulse response from the observed mainshock seis-

mogram using water-level-damped deconvolution to solve for the source

8-18 CHAPTER 8. FOURIER TECHNIQUES

pulse or set of pulses). Estimate the source duration in samples and assess

any evidence for subevents and their relative durations and amplitudes.

Approximately what water level do you believe is best for this data set?

Why?

b) Recast the problem as a linear inverse system, as described in the

example for Chapter 4, and solve the system for a least-squares source

time function.

c) Are the results in (b) better or worse than in (a)? Why?

Because of its tremendous utility, there are many, many resources on Fourier

applications in the physical sciences, engineering, and pure mathematics. A

basic text covering theory and some applications at the approximate level of

this text is [Bra00]. An advanced text on the topic is [Pri83]. Kak and Slaney

give an extensive treatment of Fourier based methods for tomographic imaging.

Vogel [Vog02] discusses Fourier transform based methods for image deblurring.

Chapter 9

Nonlinear Regression

are similar to the linear regression problems that we considered in chapter 2,

except that now the relationship between the model parameters and the data

can be nonlinear.

Specifically, we consider the problem of fitting n parameters m1 , m2 , . . .,

mn to a data set with N data points (x1 , d2 ), (x2 , d2 ), . . ., (xN , dN ), with asso-

ciated measurement standard deviations σ1 , σ2 , . . ., σN . There is a nonlinear

function G(m, x) which given the model parameters m and a point xi predicts

an observation di . Our goal is to find the values of the parameters m which

best fit the data.

As with linear regression, if we assume that the measurement errors are nor-

mally distributed, then the maximum likelihood principle leads us to minimizing

the sum of squared errors. The function to be minimized can be written as

N 2

X G(m, xi ) − di

f (m) = . (9.1)

i=1

σi

In this chapter we will discuss methods for minimizing f (m) and the statis-

tical analysis of the nonlinear regression fit.

Consider a nonlinear system of equations

F(x) = d (9.2)

where x is a vector of m unknowns and d is a vector of m known values. We

will construct a sequence of vectors x0 , x1 , . . . which will converge to a solution

x∗ to the system of equations.

If we assume that F is continuously differentiable, we can construct a simple

Taylor series approximation

F(x0 + s) ≈ F(x0 ) + ∇F(x0 )s (9.3)

9-1

9-2 CHAPTER 9. NONLINEAR REGRESSION

∂F1 (x0 ) ∂F1 (x0 )

∂x1 ... ∂xm

∇F(x0 ) = ... ... ... . (9.4)

∂Fm (x0 ) ∂Fm (x0 )

∂x1 ... ∂xm

Using this approximation, we can obtain an approximate equation for the dif-

ference between x0 and the unknown x∗ .

or

x∗ − x0 ≈ ∇F(x0 )−1 (d − F(x0 )). (9.7)

This leads to Newton’s Method.

the formula

xi+1 = xi + ∇F (xi )−1 (d − F(xi ))

to compute a sequence of solutions x1 , x2 , . . .. Stop if and when the

sequence converges to an acceptable solution.

lowing theorem. For a proof, see [DS96].

Theorem 9.1

a neighborhood of x∗ , and ∇F(x∗ ) is nonsingular, then Newton’s

method will converge to x∗ . The convergence rate is quadratic in

the sense that there is a constant c such that for large n,

number of accurate digits in the solution doubles at each iteration. Unfortu-

nately, if the hypotheses of the above theorem are not satisfied, then Newton’s

method can converge very slowly or simply fail to converge to any solution.

9.1. NEWTON’S METHOD 9-3

A simple modification to the basic Newton’s method algorithm often helps damped Newton method

with convergence problems. In the damped Newton method, we use the New- Newton’s method for

minimizing f(x)

ton’s method equations at each iteration to compute a direction in which to

move. However, instead of simply taking the full step s, we search along the

line from xi to xi + s for a point which minimizes kF(xi + αs) − dk2 , and take

the step which minimize the norm.

Now suppose that we have a scalar valued function f (x), and want to min-

imize f . if we assume that f is twice continuously differentiable, then we have

the Taylor series approximation

1

f (x0 + s) ≈ f (x0 ) + ∇f (x0 )T s + sT ∇2 f (x0 )s (9.8)

2

where ∇f (x0 ) is the gradient of f

∂f (x0 )

∂x1

∇f (x0 ) = ... (9.9)

∂f (x0 )

∂xm

∂ 2 f (x0 ) ∂ 2 f (x0 )

∂x21

... ∂x1 ∂xm

2 0

∇ f (x ) = ... ... ... . (9.10)

∂ 2 f (x0 ) ∂ 2 f (x0 )

∂xm ∂x1 ... ∂x2m

can approximate the gradient of f by

Solving this system of equations for a step s leads to Newton’s method for

minimizing f (x).

Given a twice continuously differentiable function f (x), and an ini-

tial guess x0 , use the formula

−1

xi+1 = xi − ∇2 f (xi ) ∇f (xi )

sequence converges to a solution with ∇f (x∗ ) = 0.

9-4 CHAPTER 9. NONLINEAR REGRESSION

The theoretical properties of Newton’s method for minimizing f (x) are sum-

marized in the following theorem. Since Newton’s method for minimizing f (x)

is just Newton’s method for solving a nonlinear system of equations applied to

∇f (x) = 0, the proof follows immediately from the proof of theorem 9.1.

Theorem 9.2

If f is twice continuously differentiable in a neighborhood of a local

minimizer x∗ , there is a constant λ such that k∇2 f (x)−∇2 f (y)k2 ≤

λkx − yk2 in the neighborhood, ∇2 f (x∗ ) is positive definite, and x0

is close enough to x∗ , then Newton’s method will converge quadrat-

ically to x∗ .

minimizing f (x) is very efficient when it works, but the method can also fail to

converge. Because of its speed, Newton’s method is the basis for most methods

of nonlinear optimization. Various modifications to the basic method are used

to ensure convergence to a local minimum of f (x).

Methods

In this section we will consider variations of Newton’s method that have been

adapted to the nonlinear regression problem. Recall that our objective in non-

linear regression to to minimize the sum of squared errors,

N 2

X G(m, xi ) − di

f (m) = . (9.14)

i=1

σi

G(m, xi ) − di

fi (m) = (9.15)

σi

and

f1 (m)

F(m) = . . . . (9.16)

fN (m)

The gradient of f (m) can be expressed in terms of the individual terms in

the sum of squares.

N

X

∇ fi (m)2 .

∇f (m) = (9.17)

i=1

N

X

∇f (m) = 2∇fi (m)F(m). (9.18)

i=1

9.2. THE GAUSS–NEWTON AND LEVENBERG–MARQUARDT METHODS9-5

In this equation, the product of ∇fi (m) and F(m) is the element wise product Gauss–Newton method

of the vectors. This formula can be simplified by using matrix notation to

∂f1 (m) ∂f1 (m)

∂m1 ... ∂mn

J(m) = ... ... ... . (9.20)

∂fN (m) ∂fN (m)

∂m1 ... ∂mn

functions.

XN

∇2 f (m) = ∇2 (fi (m)2 ). (9.21)

i=1

N

X

∇2 f (m) = Hi (m). (9.22)

i=1

Here Hi (m) is the Hessian of fi (m)2 . The j, k element of this Hessian matrix

is given by

i ∂ 2 (fi (m)2 )

Hj,k (m) = . (9.23)

∂mj ∂mk

i ∂ ∂fi (m)

Hj,k (m) = 2fi (m) . (9.24)

∂mj ∂mk

∂ 2 fi (m)

i ∂fi (m) ∂fi (m)

Hj,k (m) = 2 + fi (m) . (9.25)

∂mj ∂mk ∂mj ∂mk

Thus

∇2 f (m) = 2J(m)T J(m) + Q(m) (9.26)

where

N

X

Q(p) = 2 fi (m)∇2 fi (m). (9.27)

i=1

approximate the Hessian by

In the context of nonlinear regression, we typically expect that the fi (m) terms

will be reasonably small as we approach the optimal parameters m∗ , so that this

is a reasonable approximation. Specialized methods are available for problems

with large residuals in which this approximation is not justified.

Using our approximation, the equations for Newton’s method are

9-6 CHAPTER 9. NONLINEAR REGRESSION

J(mk )T J(mk )(mk+1 − mk ) = −J(mk )T F(mk ). (9.30)

Because the Gauss–Newton method is based on Newton’s method, it can

fail for all of the same reasons as Newton’s method. Furthermore, because we

have introduced an additional approximation, the Gauss–Newton method can

also fail when Q(m) is not small. However, the Gauss–Newton often works well

in practice.

In the Levenberg–Marquardt method, we modify the Gauss–Newton

method equations to

Here the parameter λ is adjusted during the course of the algorithm to insure

convergence. One important reason for using a nonzero value of λ is that the

λI term ensures that the matrix is nonsingular. For very large values of λ, we

get

−1

mk+1 − mk = ∇f (m). (9.32)

λ

This is a steepest descent step. The algorithm simply moves downhill. The

steepest descent step provides very slow, but certain convergence. For very

small values of λ, we get the Gauss–Newton direction, which gives fast but

uncertain convergence.

The hard part of the Levenberg–Marquardt method is determining the right

value of λ. The general idea is to use small values of λ in situations where the

Gauss–Newton method is working well, but to switch to larger values of λ when

the Gauss–Newton method is not making progress. A very simple approach is

start with a small value of λ, and then adjust it in every iteration. If the L–M

step leads to a reduction in f (m), then decrease λ by a constant factor (say 2).

If the L–M step does not lead to a reduction in f (m), then do not take the step.

Instead, increase λ by a constant factor (say 2), and try again. Repeat this

process until a step is found which actually does decrease the value of f (m).

Robust implementations of the L–M method use sophisticated strategies for

adjusting the parameter λ. In practice, a careful implementation of the L–M

method typically has the good performance of the Gauss–Newton method as

well as very good convergence properties. In general, the L–M method is the

method of choice for small to medium sized nonlinear least squares problems.

Note that the λI term in the L–M method looks a lot like Tikhonov reg-

ularization. It is important to understand that this is not actually a case of

Tikhonov regularization, since the λI is only used as a way to improve the con-

vergence of the algorithm, and does not enter into the objective function being

minimized. We will discuss regularization for nonlinear problems in chapter 10.

9.3. STATISTICAL ASPECTS 9-7

Recall from appendix B that if an n element vector d has a multivariate normal

distribution, and A is an m by n matrix, then Ad also has a multivariate normal

distribution with covariance matrix

normal equations. The resulting formula for Cov(m) was

between the data and the estimated model parameters m, so we cannot assume

that the estimated model parameters have a multivariate normal distribution,

and we cannot use the above formula to obtain the covariance matrix of the

estimated parameters.

Since we are interested in how small perturbations in the data result in

small perturbations in the data, we can consider a linearization of the nonlinear

function F(m),

F(m∗ + ∆m) ≈ F(m∗ ) + J(m∗ )∆m.

Under this approximation, there is a linear relationship between changes in the

data d and changes in the parameters m.

∆d = J(m∗ )∆m

Using this linear approximation, our nonlinear least squares problem becomes a

linear least squares problem. Thus the matrix J(m∗ ) takes the place of G in an

approximate estimate of the covariance of the model parameters. Since we have

incorporated the σi into the formula for f , the Cov(d) matrix is the identity

matrix, and we obtain the formula

gression is the result of two critical approximations. We first approximated the

Hessian of f (m) by dropping the Q(m) term. We then linearized f (m) around

m∗ . Depending on the particular problem that we are solving, either of these

approximations could be problematic. The Monte Carlo approach discussed in

Chapter 2 is a robust (but computationally intensive) alternative approach to

estimating Cov(m∗ ).

As with linear regression, it is possible to apply nonlinear regression when the

measurement errors are independent and normally distributed and the standard

9-8 CHAPTER 9. NONLINEAR REGRESSION

black box function deviations are unknown but assumed to be equal. We set the σi in (9.1) to 1,

automatic differentiation and minimize the sum of squared errors. Define a vector of residuals by

finite differences

ri = G(m∗ , xi ) − di i = 1, 2, . . . , N

and let r̄ be the mean of the residuals. Our estimate of the measurement

standard deviation is then given by

s

PN 2

i=1 (ri − r̄)

s= . (9.37)

N −n

Once we have m∗ and Cov(m∗ ), we can establish confidence intervals for the

model parameters exactly as we did in Chapter 2. Just as with linear regression,

it is also important to examine the residuals for systematic patterns or deviations

from normality. If we have not estimated the measurement standard deviation

s, then it is also important to test the χ2 value for goodness of fit.

In this section we consider a number of important issues in the implementation

of the Gauss–Newton and Levenberg–Marquardt methods.

The most important difference between the linear regression problems that

we solved in Chapter 2 and the nonlinear regression problems discussed in this

chapter is in nonlinear regression we have a nonlinear function G(m, xi ). Our

nonlinear regression methods require use to compute values of G and its partial

derivatives with respect to the model parameters mi . In some cases, we have

explicit formulas for G, and can easily find formulas for the derivatives of G.

In other cases, G exists only as a black box subroutine that we can call as

required to compute G values.

When an explicit formula for G is available, there are a number of options for

computing the derivatives. We can differentiate G by hand, or by using a CAS

such as Mathematica or Maple. There are also automatic differentiation

software packages that can translate the source code of a program to compute

G into a program to compute the derivatives of G.

One very simple approach is to use finite differences to approximate the

derivatives of G. The

∂G(m, x) G(m + hei , x) − G(m, x)

≈

mi h

to estimate the derivatives. Here h must be carefully selected. The finite dif-

ference approximation is based on a Taylor series approximation to G which

become inaccurate as h increases. On the other hand, as has h becomes very

9.4. IMPLEMENTATION ISSUES 9-9

√ error in the numerator of the fraction. A

good rule of thumb is to set h to , where is the accuracy of the evaluations

of G. For example, if the function evaluations are accurate to 0.0001, then an

appropriate choice of h would be about 0.01.

When G is available only as a black box subroutine that can be called with

particular values of m and x, and the source code for G is not available, then

our options are much more limited. In this case, the only practical approach

is to use finite differences. Finite difference approximations are inevitably less

accurate the exact formulas, and this can lead to numerical problems in the

solution of the nonlinear regression problem.

A second important issue in the implementation of the G–N and L–M meth-

ods is when to terminate the iterations. We would like to terminate when

∇f (m) is approximately 0 and the values of m have stopped changing substan-

tially from one iteration to the next. Because of scaling issues, it is difficult to

set an absolute tolerance on k∇f (m)k2 that would be appropriate for all prob-

lems. Similarly, it is difficult to pick a single absolute tolerance on kmi+1 −mi k2

or |f (mi+1 ) − f (mi )| that would be appropriate for all problems.

The following convergence tests have been normalized so that they will work

well on a wide variety of problems. We assume that values of G can be calculated

with an accuracy of . To ensure that the gradient of f is approximately 0, we

require that √

k∇f (mi )k2 < (1 + |f (mi )|).

To ensure that successive values of m are close, we require

√

kmi − mi−1 k2 < (1 + kmi k2 ).

Finally, to make sure that the values of f (m) have stopped changing, we require

that

|f (mi ) − f (mi−1 )| < (1 + |f (mi )|).

There are a number of problems that can arise during the solution of a

nonlinear regression problem by the G–N or L–M methods.

The first issue is that our nonlinear regression methods are based on the

assumption that f (m) is a smooth function. This means not only that f (m)

must be continuous, but also that its first and second partial derivatives with

respect to the parameters must be continuous. Figure 9.1 shows a function which

is itself continuous, but has discontinuities in the first derivative at x = 0.2 and

the second derivative at x = 0.5. When G is given by an explicit formula, it is

usually easy to check this assumption. When G is given to us in the form of a

black box subroutine, it can be very difficult to check this assumption.

A second issue is that the function f (m) may have a “flat bottom”. See

Figure 9.2. In this case, there are many values of the parameters that come close

to fitting the data, and it is difficult to determine the optimal m∗ . In practice,

this condition is seen to occur when J(m∗ )T J(m∗ ) is very badly conditioned.

We will address this problem by regularization in the next chapter.

The final problem that we will consider is that the f (m) function may be

nonconvex and thus have multiple local minimum points. See figure 9.3. The

9-10 CHAPTER 9. NONLINEAR REGRESSION

0.9

0.8

0.7

0.6

f(p)

0.5

0.4

0.3

0.2

0.1

0

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

p

0.9

0.8

0.7

0.6

f(p)

0.5

0.4

0.3

0.2

0.1

0

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

p

9.5. AN EXAMPLE 9-11

3 global optimization

multistart

2.8

2.6

2.4

2.2

f(p)

1.8

1.6

1.4

1.2

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

p

G–N and L–M methods are designed to converge to a local minimum, but de-

pending on the point at which we begin the search, there is no way to be certain

that we will converge to a global minimum.

Global Optimization methods have been developed to deal with this prob-

lem. One simple global optimization procedure is called multistart. . In this

procedure, we randomly generate a large number of initial solutions, and per-

form the L–M method starting with each of the random solutions. We then

examine the local minimum solutions found by the procedure, and select the

minimum with the smallest value of f (m).

9.5 An Example

Example 9.1

>> mtrue

mtrue =

9-12 CHAPTER 9. NONLINEAR REGRESSION

Number m1 m2 m3 m4 χ2 p-value

1 0.8380 -0.4616 2.2461 -1.0214 9.5957 0.73

2 3.3135 -0.6132 -7.5628 -2.8005 10.7563 0.63

3 1.8864 0.2017 -0.9294 0.0151 37.4439 < 1 × 10−3

4 2.3725 -0.5059 -2.7754 -66.0067 271.5124 < 1 × 10−3

1.0000

-0.5000

2.0000

-1.0000

generating the corresponding y values at adding normally distributed

noise with a standard deviation of 0.01, we obtain the our data set:

>> [x y]

ans =

1.0000 1.3308

1.2500 1.2634

1.5000 1.1536

1.7500 1.0247

2.0000 0.9125

2.2500 0.8007

2.5000 0.6951

2.7500 0.6117

3.0000 0.5160

3.2500 0.4708

3.5000 0.3838

3.7500 0.3309

4.0000 0.2925

4.2500 0.2413

4.5000 0.2044

4.7500 0.1669

5.0000 0.1524

We next used the L–M method to solve the problem a total of 100

times, using random starting points with each parameter uniformly

distributed between -2 and 2. This produced a number of different

locally optimal solutions, which are shown in table 1. Since solution

number 1 has the best χ2 value, we will analyze it.

9.5. AN EXAMPLE 9-13

1.5

0.5

−0.5

−1

−1.5

−2

0 2 4 6 8 10 12 14 16 18

of freedom) of 0.73, so this regression fit passes the χ2 test.

Figure 9.4 shows the normalized residuals for this regression fit. No-

tice that the majority of the residuals are within 0.5 standard devia-

tions, with a few residuals as large as 1.9 standard deviations. There

is no obvious trend in the residuals as x ranges from 1 to 5.

Next, we compute the covariance matrix for the model parameters.

The square roots of the diagonal elements of the covariance matrix

are standard deviations for the individual model parameters. These

are then used to compute approximate 95% confidence intervals for

model parameters.

>> J=jac(mstar);

>> covm=inv(J’*J)

covm =

-0.0153 0.0056 0.0262 -0.0037

-0.0809 0.0262 0.1369 -0.0173

0.0100 -0.0037 -0.0173 0.0026

>> %

>> % Now, compute the confidence intervals.

>> %

9-14 CHAPTER 9. NONLINEAR REGRESSION

ans =

0.4073 1.2687

-0.6077 -0.3156

1.5209 2.9713

-1.1206 -0.9221

Notice that the true parameters (1, -.5, 2, and -1) are all covered

by these confidence intervals. However, there is quite a bit of un-

certainty in the model parameters. This is an example of a poorly

conditioned nonlinear regression problem in which the data do not

constrain the possible parameter values very well.

The correlation matrix provides some insight into the nature of the

ill conditioning. For our solution, the correlation matrix is

>> corm

corm =

-0.9362 1.0000 0.9502 -0.9863

-0.9949 0.9502 1.0000 -0.9243

0.8951 -0.9863 -0.9243 1.0000

tells us that by increasing m1 and simultaneously decreasing m2 , we

can obtain a solution which is very nearly as good as our optimal

solution. There are also strong negative correlations between m2 and

m4 and m3 and m4 . Similarly, there are strong positive correlations

between m1 and m4 and between m2 and m3 .

9.6. EXERCISES 9-15

9.6 Exercises

1. A recording instrument sampling at 50 Hz is run with a high precision

sine wave generator connected to its input. The signal generator produces

a 39.98 s signal of the form

x(t) = A sin(2πfa t + φb )

and the recorded signal consists of instrumental noise, n(t), plus a mean

offset

y(t) = B sin(2πfb t + φb ) + c + n(t) .

Using the data in instdata.mat, and a starting model: fb = 0.5, φb = 0,

c = x̄, and B = (max(x) − min(x))/2, apply Newton’s method to solve for

the unknown parameters (B, fb , φb , c).

Gauss–Newton method by multiplicative damping. Instead of adding λI

to the diagonal of J(mk )T J(mk ), this method multiplies the diagonal of

J(mk )T J(mk ) by a factor of (1 + λ). Show that this method can fail

by producing an example in which the modified J(mk )T J(mk ) matrix is

singular, no matter how large λ becomes.

The field is instrumented with nine seismometers, eight of which are at the

surface and one of which is 300 m down a borehole. The P-wave velocity of

the fractured granite medium is thought to be an approximately uniform

2 km/s. The station locations (in meters relative to a central origin) are:

x y z

st2 -500 -500 0

st3 100 100 0

st4 -100 0 0

st5 0 100 0

st6 0 -100 0

st7 0 -50 0

st8 0 200 0

st9 10 50 -300

The arrival times of P-waves from the earthquakes are carefully measured

at the stations, with an estimated error of approximately 1 ms. The arrival

time estimates for each earthquake at each station (in seconds, with the

nearest clock second subtracted) are

9-16 CHAPTER 9. NONLINEAR REGRESSION

st2 0.8680 1.2970 0.8429 1.2009 1.2238 0.5640 0.7878

st3 0.5826 1.0095 0.5524 0.9177 0.9326 0.2812 0.5078

st4 0.5975 1.0274 0.5677 0.9312 0.9496 0.2953 0.5213

st5 0.5802 1.0093 0.5484 0.9145 0.9313 0.2795 0.5045

st6 0.5988 1.0263 0.5693 0.9316 0.9480 0.2967 0.5205

st7 0.5857 1.0141 0.5563 0.9195 0.9351 0.2841 0.5095

st8 0.6017 1.0319 0.5748 0.9362 0.9555 0.3025 0.5275

st9 0.5266 0.9553 0.5118 0.8533 0.8870 0.2115 0.4448

st2 0.9120 1.1114 0.9654

st3 0.6154 0.8164 0.6835

st4 0.6360 0.8339 0.6982

st5 0.6138 0.8144 0.6833

st6 0.6347 0.8336 0.6958

st7 0.6215 0.8211 0.6857

st8 0.6394 0.8400 0.7020

st9 0.5837 0.7792 0.6157

(a) Apply the L–M method to this data set to estimate minimum 2-norm

misfit locations of the earthquakes.

(b) Estimate the errors in x, y, z (in meters) and origin time (in s) for each

earthquake using the diagonal elements of the appropriate covariance

matrix. Do the earthquakes form any sort of discernible trend?

tem that has recently been developed that locates the sources of lightning

radiation in three spatial dimensions and time [RTK+ 99] The system ac-

curately measures the arrival time of impulsive radiation events in an

unused VHF television channel at 60 MHz. The measurements are made

at nine or more locations in a region 40 to 60 km in diameter. Each station

records the peak radiation event in successive 100 us time intervals; from

this, several hundred to over a thousand radiation sources are typically

located per lightning discharge. In a large storm it is not uncommon to

locate over a million sources each minute.

(a) Use the arrival time at station 1, 2, 4, 6, 7, 8, 10, and 13 to find the

time and location of the radio frequency source in the lightning flash.

Assume the radio waves travel in a straight line at the speed of light,

2.99731435d8 m/s

9.7. NOTES AND FURTHER READING 9-17

No. x y z

1. 0.0922360280 -24.3471411 2.14673146 1.18923667

2. 0.0921837940 -12.8746056 14.5005985 1.10808551

3. 0.0922165500 16.0647214 -4.41975194 1.12675062

4. 0.0921199690 0.450543748 30.0267473 1.06693166

6. 0.0923199800 -17.3754105 -27.1991732 1.18526730

7. 0.0922839580 -44.0424408 -4.95601205 1.13775547

8. 0.0922030460 -34.6170855 17.4012873 1.14296361

9. 0.0922797660 17.6625731 -24.1712580 1.09097830

10. 0.0922497250 0.837203704 -10.7394229 1.18219520

11. 0.0921672710 4.88218031 10.5960946 1.12031719

12. 0.0921702350 16.9664920 9.64835135 1.09399160

13. 0.0922357370 32.6468622 -13.2199767 1.01175261

(b) During lightning storms we record the arrival of thousands of events

each second. We use several methods to find which arrival times

are due to the same event. The above data were chosen as possible

candidates to go together. We require any solution to use time from

at least 6 stations. Find the largest subset of this data that gives a

good solution. See if it is the same subset suggested in part a of this

problem.

Dennis and Schnabel [DS96] is a classic reference on Newton’s method. The

Gauss–Newton and Levenberg–Marquardt methods are discussed in [Bjö96,

NS96]. Statistical aspects of nonlinear regression are discussed in [BW88, DS98,

Mye90].

There are a number of freely available and commercial software packages for

nonlinear regression. Some freely available packages include GaussFit [JFM87],

MINPACK [MGH80], and ODRPACK [BDBS89].

9-18 CHAPTER 9. NONLINEAR REGRESSION

Chapter 10

lems

As with linear problems, the nonlinear least squares approach can run into

problems with extremely ill-conditioned problems. This typically happens as

the number of model parameters grows. In this lecture we will discuss regular-

ization for nonlinear inverse problems and a popular algorithm for computing a

regularized solution to a nonlinear inverse problem.

The basic idea of Tikhonov regularization can be carried over to nonlinear

problems in a straight forward way. Suppose that we are given a nonlinear

forward model G(m), and wish to find the solution with smallest kLmk which

comes close enough to matching the data. We can formulate this problem as

min kLmk2

(10.1)

kG(m) − dk2 ≤ δ.

Notice that the form of the problem is virtually identical to the problem that

we considered earlier in the linear case. The only difference is that we now

use G(m) instead of Gm. As in the linear case, we can also reformulate this

problem in terms of minimizing the misfit subject to a constraint on kLmk.

min kG(m) − dk2

(10.2)

kLmk2 ≤ .

or as a damped least squares problem

min kG(m) − dk22 + α2 kLmk22 . (10.3)

All three versions of the regularized least squares problem can be solved by

applying standard nonlinear optimization software. In particular, (10.3) is a

nonlinear least squares problem, so we could apply the L–M or G–N methods

to it. Of course, we will still have to deal with the possibility of local minimum

solutions.

10-1

10-2 CHAPTER 10. NONLINEAR INVERSE PROBLEMS

Occam’s inversion is a popular algorithm for nonlinear inversion that was in-

troduced by Constable, Parker, and Constable [CPC87]. The name refers to

“Occam’s Razor”, the principle that models should not be unnecessarily com-

plicated. Occam’s inversion uses the discrepancy principle, and searches for the

solution that minimizes kLmk2 subject to the constraint kG(m) − dk2 ≤ δ.

The algorithm is easy to implement, requires only the nonlinear forward model

G(m) and its Jacobian matrix, and works well in practice. However, to the

best of our knowledge, no theoretical analysis proving the convergence of the

algorithm to even a local minimum has been done.

In the following we will assume that our nonlinear inverse problem has been

formulated as

min kLmk2

(10.4)

kG(m) − dk2 ≤ δ.

Here L can be I for zeroth order regularization, or it can be a finite difference

approximation of a first or second derivative. In fact, Occam’s inversion is often

used on two dimensional problems in which L is an approximate Laplacian

operator.

As usual, we will assume that the measurement errors in d are independent

and normally distributed. For convenience, we will also assume that the systems

of equations G(m) = d has been scaled so that the standard deviations σi are

all the same.

The basic idea behind Occam’s inversion is local linearization. Given a model

m, we can use Taylor’s theorem to obtain the approximation

∂G1 (m) ∂G1 (m)

∂m1 ... mm

J(m) = ... ... ... . (10.6)

∂Gn (m) ∂Gn (m)

∂m1 ... ∂mm

Here n is the number of data points and m is the number of model parameters.

Under this linear approximation, the damped least squares problem (10.3)

becomes

this as a problem in which the variable is actually m + ∆m.

(10.8)

Let

d̂(m) = d − G(m) + J(m)m. (10.9)

10.2. OCCAM’S INVERSION 10-3

In (10.8), J(m) and d̂(m) are constant. This problem is a damped linear least

squares problem which we learned how to solve when we studied Tikhonov

regularization. The solution is given by

−1

m + ∆m = J(m)T J(m) + α2 LT L J(m)T d̂(m). (10.10)

method applied to the damped least squares problem (10.3). As a result, it

suffers from the same sorts of problems with convergence that we saw earlier.

In particular, for poor initial guesses, the method often fails to converge to a

solution. Convergence could be improved by using line search to pick the best

solution along the line from the current model to the new model given by (10.10).

Convergence could also be improved by by using the L–M idea to improve the

convergence of the algorithm.

We could just use this iteration to solve (10.3) for a number of values of

α, and then pick the largest value of α for which the corresponding model

satisfies the data constraint kG(m)−dk2 < δ. However, the authors of Occam’s

inversion came up with an alternative approach which is much faster in practice.

At each iteration of the algorithm, we pick the largest value of α which keeps

the χ2 value of the solution within its bound. If this is not possible, we pick

the value of α that minimizes the χ2 value. Effectively, we adjust α throughout

the course of the algorithm so that in the end, we should have a value of α that

leads to a solution with χ2 = δ 2 .

We can now state the algorithm.

Beginning with a solution m0 , use the formula

−1

mk+1 = J(mk )T J(mk ) + α2 LT L J(mk )T d̂(mk ).

value exists, then pick a value of α which minimizes χ2 (mk+1 ).

Example 10.1

In this example we will consider the problem of estimating subsurface

electrical conductivities from above ground EM induction measure-

ments. The instrument used in this example is the Geonics EM–38

ground conductivity meter. A description of the instrument and the

mathematical model of the response of the instrument can be found

in [HBR+ 02]. The forward model is quite complicated, but we will

treat it as a black box model, and instead concentrate on the inverse

problem.

Measurements are taken at heights of 0, 10, 20, 30, 40, 50, 75, 100,

and 150 centimeters above the ground, with the coils oriented in

10-4 CHAPTER 10. NONLINEAR INVERSE PROBLEMS

0 134.5 117.4

10 129.0 97.7

20 120.5 81.7

30 110.5 69.2

40 100.5 59.6

50 90.8 51.8

75 70.9 38.2

100 56.8 29.8

150 38.5 19.9

18 observations. The data are shown in table 10.1. We will assume

measurement standard deviations of 0.1 mS/m.

We will discretize the subsurface electrical conductivity profile into

10 layers each 20 cm thick, with a semi–infinite layer below two

meters. Thus we have 11 parameters to estimate.

The forward model G(m) is available to us in the form of a subrou-

tine for computing the forward model predictions. Since we do not

have simple formulas for G(m), we cannot simply write down for-

mulas for the Jacobian matrix. However, we can use finite difference

approximations to estimate J(m).

We first tried using the L–M method to estimate the model param-

eters. After 50 iterations, the L–M method produced a model which

is shown in Figure 10.1. The χ2 value for this model is 9.62, with 9

degrees of freedom, so the model actually fits the data adequately.

However, the least squares problem is very badly conditioned. The

condition number of JT J is approximately 7.6e+17. Furthermore,

this model is unrealistic because it includes negative electrical con-

ductivities. The model exhibits the kinds of high frequency oscilla-

tions that we have come to expect of under regularized solutions to

inverse problems. Clearly, we need to find a regularized solution.

We next tried Occam’s inversion, with second order regularization

and δ = 0.4243. The resulting model is shown in Figure 10.2. Figure

10.3 shows the true model. The Occam’s inversion solution is a fairly

good reproduction of the original model.

10.2. OCCAM’S INVERSION 10-5

5000

4000

3000

2000

Conductivity (mS/m)

1000

−1000

−2000

−3000

−4000

−5000

0 0.5 1 1.5 2 2.5

depth (m)

400

350

300

Conductivity (mS/m)

250

200

150

100

50

0 0.5 1 1.5 2 2.5

depth (m)

10-6 CHAPTER 10. NONLINEAR INVERSE PROBLEMS

400

350

300

Conductivity (mS/m)

250

200

150

100

50

0 0.5 1 1.5 2 2.5

depth (m)

10.3. EXERCISES 10-7

10.3 Exercises

1. Recall example 1.3, in which we had gravity anomaly observations above

a density perturbation of variable depth m(x), and fixed density ∆ρ. In

this exercise, you will use Occam’s inversion to solve an instance of this

inverse problem.

Consider a gravity perturbation along a one kilometer section, with ob-

servations take every 50 meters, and density perturbation of 200 Kg/m3

(0.2 g/cm3 .) The perturbation is expected to be at a depth of roughly

200 meters.

The data file gravprob.mat contains a vector x of observation locations.

Use the same coordinates for your discretization of the model. The vector

obs contains the actual observations. Assume that the observations are

accurate to about 1.0 × 10−12 .

(b) Write MATLAB routines to compute the forward model predictions

and the Jacobian matrix for this problem.

(c) Use the supplied implementation of Occam’s inversion to solve the

inverse problem.

(d) Discuss your results. What features in the inverse solution appear

to be real? What is the resolution of your solution? Were there any

difficulties with local minimum points?

(e) What would happen if the density perturbation was instead at about

1,000 meters depth?

10-8 CHAPTER 10. NONLINEAR INVERSE PROBLEMS

Chapter 11

Bayesian Methods

solving inverse problems. There is an alternative approach which is based on

very different ideas about the fundamental nature of inverse problems. In this

chapter, we will review the classical approach to inverse problems, introduce the

Bayesian approach, and then compare the two methods.

In the classical approach, we begin with a mathematical model of the form

Gm = d in the linear case or G(m) = d in the nonlinear case. We assume that

there is a true model mtrue , and a true data set dtrue such that Gmtrue −dtrue =

0. We are given an actual data set d, which is dtrue plus measurement noise.

Our goal is to recover mtrue from this noisy data. Of course, without perfect

data, we cannot hope to exactly recover mtrue , but we can hope to find a good

approximate solution.

An approximate solution can be found by minimizing kGm − dk2 to obtain

a least squares solution, mL2 . Intuitively, the least squares solution is the

solution that best fits the data. We also saw that under the assumption that the

data errors are independent and normally distributed, the maximum likelihood

principle leads to the same least squares solution.

Since there is noise in the data, we should expect some misfit in the data– the

sum of squared errors will not typically be zero. We saw that the χ2 distribution

can be used to set a reasonable bound δ on kGm − dk2 . This leads to a set

of solutions which fit the data “well enough”. We were also able to compute a

covariance matrix for the estimated parameters

We then used Cov(m) to compute confidence intervals for the estimated param-

eters.

11-1

11-2 CHAPTER 11. BAYESIAN METHODS

This approach worked quite well for linear regression problems in which the

least squares problem is well conditioned. However, we found that in many

cases the least squares problem is not well conditioned. The set of solutions

that adequately fits the data is huge, and contains many solutions which are

completely unreasonable.

We discussed a number of approaches to regularizing the least squares prob-

lem. All of these approaches pick one “best” solution out of the set of solutions

that adequately fit the data. The different regularization methods differ in what

constitutes the best solution. For example, Tikhonov regularization picks the

model that minimizes kmk2 subject to the constraint kGm − dk2 < δ, while

higher order Tikhonov regularization selects the model that minimizes kLmk2

subject to kGm − dk2 < δ.

The regularization approach can be applied to both linear and nonlinear

problems. For relatively small linear problems the computation of the regu-

larized solution is generally done with the help of the SVD. This process is

straight forward and extremely robust. For large sparse linear problems iter-

ative methods such as CGLS can be used. For nonlinear problems things are

more complicated. We saw that the Gauss–Newton and Levenberg-Marquardt

methods could be used to find a local minimum of the least squares problem.

However, these nonlinear least squares problems often have a large number of

local minimum solutions. Finding the global minimum is challenging. We dis-

cussed the multi start approach in which we start the L–M method from a

number of random starting points. Other global optimization techniques for

solving the nonlinear least squares problem include simulated annealing and

genetic algorithms.

How can we justify selecting one solution from the set of models which

adequately fit the data? One justification is Occam’s razor. Occam’s razor is

the principle that when we have several different theories to consider, we should

select the simplest theory. The solutions selected by regularization are in some

sense the simplest models which fit the data. Any feature seen in the regularized

solution must be required by the data. If fitting the data did not require a feature

seen in the regularized solution, then that feature would have been smoothed

out by the regularization term. This answer is not entirely satisfactory.

It is also worth recalling that once we have regularized a least squares prob-

lem, we lose the ability to obtain statistically valid confidence intervals for the

parameters. The problem is that by regularizing the problem we bias the so-

lution. In particular, this means that the expected value of the regularized

solution is not the true solution.

The Bayesian approach to solving inverse problems is based on completely dif-

ferent ideas. However, as we will soon see, it often results in similar solutions.

The most fundamental difference between the classical approach and the

Bayesian approach is in the nature of the answer. In classical approach, there

11.2. THE BAYESIAN APPROACH 11-3

is a specific (but unknown) model mtrue that we would like to discover. In the prior distribution

Bayesian approach, the model is a random variable. Our goal is to compute a subjective prior

probability distribution for the model. Once we have a probability distribution principle of indifference

uninformative prior

for the model, we can use the distribution to answers questions about the model distribution

such as “What is the probability that m5 is less than 0?” An important advan- posterior distribution

tage of the Bayesian approach is that it explicitly addresses the uncertainty in Bayes Theorem

the model, while the classical approach provides only a single solution.

A second very important difference between the classical and Bayesian ap-

proaches is that in the Bayesian approach we can incorporate additional in-

formation about the solution which comes from other data sets or our own

intuition. This prior information is expressed in the form of a probability distri-

bution for m. This prior distribution may incorporate the user’s subjective

judgment, or it may incorporate information from other experiments. If no

other information is available, then under the principle of indifference, we

may pick a prior distribution in which each possible model parameter has equal

likelihood. Such a prior distribution is said to be uninformative.

One of the main objections to the Bayesian approach is that the method is

“unscientific” because it allows the analyst to incorporate subjective judgments

into the model which are not based on the data alone. Bayesians reply that

the analyst is free to pick an uninformative prior distribution. Furthermore, it

is possible to complete the Bayesian analysis with a variety of prior distribu-

tions and examine the effects of different prior distributions on the posterior

distribution.

It should be pointed out that if the parameters m are contained in the range

(−∞, ∞), then the uninformative prior is not a proper probability distribution.

The problem is that there does not exist a probability distribution p(m) such

that Z ∞

p(m)dm = 1 (11.2)

−∞

and p(m) is constant. In practice, the use of this improper prior distribution can

be justified, because the posterior distribution for m is a proper distribution.

We will use the notation p(m) for the prior distribution. We also assume

that using the forward model we can compute the probability that given a

particular model, a particular data value will be observed. This is a conditional

probability distribution, We will use the notation f (d|m) for this conditional

probability distribution. Of course we know the data and are attempting to

estimate the model. Thus we are interested in the conditional distribution of

the model parameter(s) given the data. We will use the notation q(m|d) for

this posterior probability distribution.

Bayes’ theorem relates these distributions in a way that makes it possible

for us to compute what we want. Recall Bayes’ theorem.

Theorem 11.1

f (d|m)p(m)

q(m|d) = R . (11.3)

allmodels

f (d|m)p(m)dm

11-4 CHAPTER 11. BAYESIAN METHODS

MAP model Notice that the denominator in this formula is simply a normalizing constant

Markov Chain Monte Carlo which is used to insure that the integral of conditional distribution q is one.

Method

maximum likelihood

Since the normalization constant is not always needed, this formula is sometimes

written as

q(m|d) ∝ f (d|m)p(m). (11.4)

The posterior distribution q(m|d) does not provide a single model that we

can consider the “answer.” In cases where we want to single out one model as the

answer, it is appropriate to use the model with the largest value of q(m|d). This

is the so called maximum a posteriori (MAP) model. An alternative would be

to use the mean of the posterior distribution. In situations where the posterior

distribution is symmetric, the MAP model and the posterior mean model are

the same.

In general, the computation of a posterior distribution can be a complicated

process. The difficulty is in evaluating the integrals in (11.3) These are often

integrals in very high dimensions, for which numerical integration techniques are

computationally expensive. One important approach to such problems is the

Markov Chain Monte Carlo Method (MCMC) [GRS96]. Fortunately, there are

a number of special cases in which the computation of the posterior distribution

is greatly simplified.

One simplification occurs when the prior distribution p(m) is constant. In

this case, we have

q(m|d) ∝ f (d|m). (11.5)

In this context, the conditional distribution of d subject to m is known as the

likelihood of m (written L(m).) Under the maximum likelihood principle we

select the model mM L which maximizes L(m). This is exactly the MAP model.

A further simplification occurs when the noise in the measured data is in-

dependent and normally distributed with standard deviation σ. Because the

measurement errors are independent, we can write the likelihood function as

the product of the likelihoods of each individual data point.

Since the individual data points di are normally distributed with mean (G(m))i

and standard deviation σ, we can write f (di |m) as

(di −(G(m))i )2

f (di |m) = e− 2σ 2 . (11.7)

Pn (di −(G(m))i )2

L(m) = e− i=1 2σ 2 . (11.8)

minus the exponent.

n

X (di − (G(m))i )2

min . (11.9)

i=1

2σ 2

11.2. THE BAYESIAN APPROACH 11-5

Except for the constant factor of 1/2σ 2 , this is precisely the least squares prob-

lem min kG(m)−dk22 . Thus we have shown that the Bayesian approach leads to

the least squares solution when we have independent and normally distributed

measurement errors and we use a flat prior distribution.

Example 11.1

Consider the following very simple inverse problem. We have an

object of unknown mass m. We have a scale that can be used to

weigh the object. However, the measurement errors are normally

distributed with mean 0 and standard deviation σ = 1 kg. With

this error model, we have

1 2

f (d|m) = √ e−(d−m) /2 . (11.10)

2π

know about the mass of the object? We will consider what happens

when we use a flat prior probability distribution. In this case, since

p(m) is constant,

q(m|d) ∝ f (d|m). (11.11)

1 2

q(m|d = 10.3 kg) ∝ √ e−(10.3−m) /2 . (11.12)

2π

In fact, if we integrate the right hand side of this last equation from

m = −∞ to m = +∞, we find the integral is one, so the constant

of proportionality is one, and

1 2

q(m|d = 10.3 kg) = √ e−(10.3−m) /2 . (11.13)

2π

This posterior distribution is shown in Figure 11.1.

Next, suppose that we obtain a second measurement of 10.1 kg.

Now, we use the distribution (11.13) as our prior distribution and

compute a new posterior distribution.

q(m|d1 = 10.3 kg, d2 = 10.1 kg) ∝ f (d2 = 10.1 kg|m)q(m|d1 = 10.3 kg)

(11.14)

1 2 1 2

q(m|d1 = 10.3 kg, d2 = 10.1 kg) ∝ √ e−(10.1−m) /2 √ e−(10.3−m) /2 .

2π 2π

(11.15)

We can multiply the exponentials

√ by adding exponents. We can also

absorb the factors of 1/ 2π into the constant of proportionality.

2

+(10.1−m)2 )/2

q(m|d1 = 10.3 kg, d2 = 10.1 kg) ∝ e−((10.3−m)

(11.16)

11-6 CHAPTER 11. BAYESIAN METHODS

conjugacy 0.4

0.35

0.3

0.25

0.2

0.15

0.1

0.05

0

5 6 7 8 9 10 11 12 13 14 15

terms and completing the square.

(10.3 − m)2 + (10.1 − m)2 = 2(10.2 − m)2 + 0.02 (11.17)

2

q(m|d1 = 10.3kg, d2 = 10.1kg) ∝ e−(2(10.2−m) +0.02)/2

. (11.18)

The constant factor e−0.02/2 can also be absorbed into the constant

of proportionality. We are left with

2

q(m|d1 = 10.3 kg, d2 = 10.1 kg) ∝ e−(10.2−m) . (11.19)

After normalization, this can be written as

(10.2−m)2

1 − √

q(m|d1 = 10.3 kg, d2 = 10.1 kg) = √ e 2(1/ 2)2 . (11.20)

√1 2π

2

√ distribution with mean 10.2 kg

and a standard deviation of 1/ 2 kg. The distribution is shown in

Figure 11.2.

It is remarkable that when we start with a normal prior distribu-

tion distribution and take into account normally distributed data,

we obtain a normal posterior distribution. The property that the

posterior distribution has the same form as the prior distribution is

called conjugacy. There are other families of conjugate distribu-

tions, but in general this is a very rare property [GCSR03].

11.3. THE MULTIVARIATE NORMAL CASE 11-7

distribution

0.6

0.5

0.4

0.3

0.2

0.1

0

5 6 7 8 9 10 11 12 13 14 15

Figure 11.2: Posterior distribution q(m|d1 = 10.3 kg, d2 = 10.1 kg), flat prior.

The idea that a normal prior distribution leads to a normal posterior distribution

can be extended to situations in which there are many model parameters. In

situations where we have a linear model Gm = d, the data errors have a multi-

variate normal distribution, and the prior distribution for the model parameters

is also multivariate normal, the computation of the posterior distribution is rela-

tively simple. Tarantola [Tar87] develops the solution to the Bayesian inversion

problem under these conditions.

Let dobs be the observed data, and let CD be the covariance matrix for

the data. Let mprior be the mean of the prior distribution and let CM be the

covariance matrix for the prior distribution. The prior distribution is

1 T

C−1

p(m) ∝ e− 2 (m−mprior ) M (m−mprior ) . (11.21)

1 T

C−1

f (d|m) ∝ e− 2 (Gm−dobs ) D (Gm−dobs ) . (11.22)

Thus

1 T

C−1 T −1

q(m|d) ∝ e− 2 ((Gm−dobs ) D (Gm−dobs )+(m−mprior ) CM (m−mprior )) . (11.23)

1 T

C−1

q(m|d) ∝ e− 2 (m−mM AP ) M0

(m−mM AP )

(11.24)

11-8 CHAPTER 11. BAYESIAN METHODS

CM 0 = (GT C−1 −1 −1

D G + CM ) . (11.25)

D (Gm−dobs )+(m−mprior ) CM (m−mprior ). (11.26)

square roots of C−1 −1

M and CD . Note that every symmetric matrix has a matrix

square root. The matrix square root can be computed from the SVD of the

matrix. The MATLAB command sqrtm can also be used to find the square

root of a matrix.

−1/2 −1/2

min (CD (Gm − dobs ))T (CD (Gm − dobs ))+

−1/2 −1/2 (11.27)

(CM (m − mprior ))T (CM (m − mprior )).

−1/2 −1/2

min kCD (Gm − dobs )k2 + kCM (m − mprior )k2 . (11.28)

" # " #
2

−1/2 −1/2

CD G CD dobs

min
m−
. (11.29)

−1/2 −1/2

CM CM mprior

2

It is worthwhile to consider what happens to the posterior distribution in the

most extreme case in which the prior distribution provides essentially no infor-

mation. If the prior distribution has a covariance matrix CM = σ 2 I, where σ is

extremely large, then C−1M will be extremely small, and the posterior covariance

matrix will be essentially

For convenience, let us assume that the data are independent with variance one.

Then

CM 0 ≈ (GT I G)−1 . (11.31)

This is precisely the covariance matrix for the model parameters in (11.1) Fur-

thermore, when we solve (11.29) to obtain the MAP solution, we find that

(11.29) simplifies to the least squares problem min kGm − dk22 . Thus under the

typical assumption of independent data errors, a very broad prior distribution

leads essentially to the unregularized least squares solution!

It also worthwhile to consider what happens in the special case where CD =

σ 2 I, and CM = α2 I. In this special case, (11.29) simplifies to

11.3. THE MULTIVARIATE NORMAL CASE 11-9

Once we have obtained the posterior distribution, it is also possible to gen-

erate random models according to the posterior distribution. We begin by

computing the Cholesky factorization of CM 0 .

CM 0 = RT R. (11.34)

This can be done easily using the chol command in MATLAB. We then generate

a vector s of normally distributed random numbers with mean zero and standard

deviation one. This can be done using the randn command in MATLAB. The

covariance matrix of s is the identity matrix. Finally, we generate our random

solution with

m = RT s + mM AP . (11.35)

The expected value of m is

Since E[s] = 0,

E[m] = mM AP . (11.38)

We will also compute Cov(m) to show that it is CM 0 .

Since mM AP is a constant,

From appendix A, we know how to find the covariance of matrix times a random

vector.

Cov(m) = RT Cov(s)R. (11.41)

Cov(m) = RT R. (11.43)

Cov(m) = CM 0 . (11.44)

11-10 CHAPTER 11. BAYESIAN METHODS

2

MAP solution

Target Model

1.8

1.6

1.4

1.2

0.8

0.6

0.4

0.2

0

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

Example 11.2

the smooth target model. We have previously solved this problem

using Tikhonov regularization.

For our first solution, we will use a multivariate normal prior dis-

tribution with mean one and standard deviation one for each model

parameter, with independent model parameters. Thus CM = I and

mprior is a vector of ones. We used MATLAB to setup and solve

(11.29) and obtained the mMAP solution shown in Figure 11.3. Fig-

ure 11.4 shows this same solution with error bars. These error bars

were computed by adding and subtracting 1.96 times the posterior

standard deviation from the MAP solution.

Figure 11.5 shows a total of twenty solutions which were randomly

generated according to the posterior distribution. From these so-

lutions, it is clear that there is a high probability of a peak in the

model near θ = 1. Another peak near θ = −1 is also evident.

Next, we considered a broader prior distribution. We used a prior

mean of one, but used variances of 100 in the prior distribution of

the model parameters instead of variances of one. Figure 11.6 shows

the MAP model for this case, together with its error bars. Notice

that solution is not nearly as good as the model in Figure 11.3. With

a very unrestrictive prior, we have depended mostly on the available

data, which simply does not pin down the solution.

11.3. THE MULTIVARIATE NORMAL CASE 11-11

3.5

2.5

1.5

0.5

−0.5

−1

−1.5

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

−1

−2

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

11-12 CHAPTER 11. BAYESIAN METHODS

entropy

15

10

−5

−10

−15

−20

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

proach to poorly conditioned problems. In order to get a reason-

ably tight posterior distribution, we have to make very strong as-

sumptions in the prior distribution. If these assumptions are not

warranted, then we cannot obtain a good solution to the inverse

problem.

We have already seen that an important issue in the Bayesian approach is the

selection of an appropriate prior distribution. In this section we will consider

maximum entropy methods which can be used to select a prior distribution

subject to available information such as bounds on the parameter, the mean of

the parameter, and so forth.

Definition 11.1

P (X = xi ) = pi

is given by

X

H(X) = − pi ln pi .

11.4. MAXIMUM ENTROPY METHODS 11-13

Z a calculus of variations

Kullback

P (X ≤ a) = f (x)dx

−∞ Leibler

cross entropy

is given by Z ∞

H(X) = − f (x) ln f (x)dx.

−∞

a distribution which has the largest possible entropy subject to any constraints

imposed by available information about the distribution.

For continuous random variables, the optimization problems resulting from

the maximum entropy principle involve an unknown density function f (x). Such

problems can be solved by techniques from the calculus of variations. Fortu-

nately, maximum entropy distributions for a number of important cases have

already been worked out. See Kapur and Kesavan [KK92] for a collection of

results on maximum entropy distributions.

Example 11.3

Suppose that we know that X takes on only nonnegative values, and

that the mean value of X is µ. It can be shown using the calculus of

variations that the maximum entropy distribution is an exponential

distribution [KK92]. The maximum entropy distribution is

1 −x/µ

fX (x) = e x ≥ 0.

µ

Definition 11.2

Given a discrete probability distribution

P (X = xi ) = pi

and an alternative distribution

P (X = xi ) = qi ,

the Kullback–Leibler cross entropy is given by

X pi

D(p, q) = pi ln .

qi

Given continuous distributions f (x) and g(x), the cross entropy of

the distributions is

Z ∞

f (x)

D(f, g) = f (x) ln

−∞ g(x)

11-14 CHAPTER 11. BAYESIAN METHODS

minimum cross entropy The cross entropy is a measure of how close two distributions are. Notice that if

principle

the two distributions are identical, then the cross entropy is 0. It can be shown

that in all other cases, the cross entropy is greater than zero.

Under the minimum cross entropy principle if we are give a prior distri-

bution q and some additional constraints on the distribution, we should select

a posterior distribution p which minimizes the cross entropy of p and q subject

to the constraints.

For example, in the Minimum Relative Entropy (MRE) method of

Woodbury and Ulrych [UBL90, WU96], the minimum cross entropy principle

is applied to linear inverse problems of the form Gm = d subject to lower

and upper bounds on the model elements. First, a maximum entropy prior

distribution is computed using the lower and upper bounds and a given prior

mean. Then, a posterior distribution is selected to minimize the cross entropy

subject to the constraint that the posterior mean distribution must satisfy the

equations Gm = d.

11.5 Exercises

1. Consider the following coin tossing experiment. We repeatedly toss a coin,

and each time record whether it comes up heads (0), or tails (1). The bias

b of the coin is the probability that it comes up heads. We do not have

reason to believe that this is a fair coin, so we will not assume that b = 1/2.

Instead, we will begin with a flat prior distribution p(b) = 1, for 0 ≤ b ≤ 1.

(a) What is f (d|b)? Note that the only possible data are 0 and 1, so this

distribution will involve delta functions at d = 0, and d = 1.

(b) Suppose that on our first flip, the coin comes up heads. Compute

the posterior distribution q(b|d1 = 0).

(c) The second, third, fourth, and fifth flips are 1, 1, and 1, and 1. Find

the posterior distribution q(b|d1 = 1, d2 = 1, d3 = 0, d4 = 1, d5 = 1).

Plot the posterior distribution.

(d) What is your MAP estimate of the bias?

(e) Now, suppose that you initially felt that the coin was at least close

to fair, with

2

p(b) ∝ e−10(b−0.5) 0 ≤ b ≤ 1

Repeat the analysis of the five coin flips.

Chapter 5. Select what you consider to be a reasonable prior. How sensi-

tive is your solution to the prior mean and covariance?

and 6. If each side were equally likely to come up, the mean value would

be 7/2. Suppose instead that the mean is 9/2. Formulate an optimization

11.6. NOTES AND FURTHER READING 11-15

subject to the constraint that the mean is 9/2. Use the method of Lagrange

multipliers to solve this optimization problem and obtain the maximum

entropy distribution.

4. Let X be a discrete random variable that takes on the values 1, 2, . . ., n.

Suppose that E[X] = µ is given. Find the maximum entropy distribution

for X.

Sivia’s book [Siv96] is a good general introduction to Bayesian ideas for scien-

tists and engineers. The textbook by Gelman et al. [GCSR03] provides a more

comprehensive introduction to Bayesian statistics. Tarantola [Tar87] is the stan-

dard reference work on Bayesian methods for inverse problems. The paper of

Gouveia and Scales [GS97] discusses the relative advantages and disadvantages

of Bayesian and classical methods for inverse problems. The draft textbook by

Scales and Smith [SS97] takes a Bayesian approach to inverse problems. Sivia’s

book includes a brief introduction to the maximum entropy principle [Siv96]. A

complete survey of maximum entropy methods can be found in [KK92].

16 CHAPTER 11. BAYESIAN METHODS

linear equations

Gaussian elimination

Appendix A

The topics discussed in this appendix have been selected because they are log-

ically necessary to the development of the rest of the textbook. This is by no

means a complete review of linear algebra. It would take an entire textbook

to completely cover the material in this appendix. Instead, the purpose of this

appendix is to summarize some important concepts, definitions, and theorems

for linear algebra that will be used throughout the book.

Recall that a system of linear equations can be solved by the process of Gaussian

elimination.

Example A.1

Consider the system of equations:

x +2y +3z = 14

x +2y +2z = 11 .

x +3y +4z = 19

tracting the first equation from the second and third equations to

get:

x +2y +3z = 14

−z = −3 .

y +z = 5

We would like to get y into the second equation, so we simply inter-

change the second and third equations:

x +2y +3z = 14

y +z = 5 .

−z = −3

A-1

A-2 APPENDIX A. REVIEW OF LINEAR ALGEBRA

elementary row operations Next, we can eliminate y from the first equation by subtracting two

times the second equation from the first equation.

x +z = 4

y +z = 5 .

−z = −3

z.

x +z = 4

y +z = 5 .

z = 3

Finally, we eliminate z from the first two equations.

x = 1

y = 2 .

z = 3

z = 3. Geometrically, this system of equations describes three planes

(given by the three equations) which intersect in a single point.

ations: adding a multiple of one equation to another equation, multiplying an

equation by a nonzero constant, and swapping two equations. This process can

be extended to solve systems of equations with three, four, or more variables.

Relatively small systems can be solved by hand, while systems with thousands

or even millions of variables can be solved by computer.

In performing the elimination process, the actual names of the variables are

insignificant. We could have renamed the variables in the above example to

a, b, and c without changing the solution in any significant way. Since the

actual names of the variables are insignificant, we can save time by writing

down the significant coefficients from the system of equations in a table called

an augmented matrix. This augmented matrix form is also useful in solving

a system of equations by computer. The elements of the augmented matrix are

simply stored in an array.

In augmented matrix form, our example system becomes:

1 2 3 14

1 2 2 11 .

1 3 4 19

of one row to another row, multiplying a row by a nonzero constant, and in-

terchanging two rows. The elimination process is essentially identical to the

A.1. SYSTEMS OF LINEAR EQUATIONS A-3

process used in the previous example. In the example, the final version of the reduced row echelon form

augmented matrix is RREF

1 0 0 1

0 1 0 2 .

0 0 1 3

Definition A.1

if it has the following properties:

of the matrix are called pivot elements. A column in which

a pivot element appears is called a pivot column.

2. Except for the pivot element, all entries in pivot columns are

zero.

3. If the matrix has any rows of zeros, these zero rows are at the

bottom of the matrix.

mentary row operations to reduce the augmented matrix to RREF and then

convert back to conventional notation to read off the solutions. The process

of transforming a matrix into RREF can easily be automated. Most graphing

calculators have built in functions for computing the RREF of a matrix. In

MATLAB, the rref command computes the RREF of a matrix. In addition to

solving equations, the RREF can be used for many other calculations in linear

algebra.

It can be shown that any linear system of equations has either no solutions,

exactly one solution, or infinitely many solutions. For example, in two dimen-

sions, the lines represented by the equations can fail to intersect (no solution),

intersect at a point (one solution) or intersect in a line (many solutions.) In the

following examples, we show how it is easy to determine the number of solutions

from the RREF of the augmented matrix.

Example A.2

x +y = 1

x +y = 2

matrix is

1 1 1

.

1 1 2

A-4 APPENDIX A. REVIEW OF LINEAR ALGEBRA

trivial solution

non homogeneous 1 1 0

.

0 0 1

In equation form, this is

x +y = 0

.

0 = 1

There are obviously no solutions to 0 = 1, so the original system of

equations has no solutions.

Example A.3

In this example, we will consider a system of two equations in three

variables which has many solutions. Our system of equations is:

x1 +x2 +x3 = 0

. (A.1)

x1 +2x2 +2x3 = 0

We put this system of equations into augmented matrix form and

then find the RREF

1 0 0 0

.

0 1 1 0

We can translate this back into equation form as

x = 0

.

y +z = 0

Clearly, x must be 0 in any solution to the system of equations.

However, y and z are not fixed. We can treat z as a free variable

and allow it to take on any value. However, whatever value z takes

on, y must be equal to −z. Geometrically, this system of equations

describes the intersection of two planes. The intersection of the two

planes consists of the points on the line y = −z in the x = 0 plane.

which case the system of equations is over determined. Although over de-

termined systems often have no solutions, it is possible for an over determined

system of equations to have many solutions or exactly one solution.

A system of equations with fewer equations than variables is under de-

termined. Although in many cases under determined systems of equations

have infinitely many solutions, it is possible for an under determined system of

equations to have no solutions.

A system of equations with all zeros on the right hand side is homogeneous.

Every homogeneous system of equations has at least one solution. You can set

all of the variables to 0 and get the trivial solution. A system of equations

with a nonzero right hand side is non homogeneous.

A.2. MATRIX AND VECTOR ALGEBRA A-5

linear combination

As we have seen in the previous section, a matrix is a table of numbers laid

out in rows and columns. A vector is a matrix that consists of a single column

of numbers.

There are several important notational conventions used with matrices and

vectors. Bold face capital letters such as A, B, . . . are used to denote matrices.

Bold face lower case letters such as x, y, . . . are used to denote vectors. Lower

case letters or Greek letters such as m, n, α, β, . . . will be used to denote scalars.

At times we will need to refer to specific parts of a matrix. The notation

Aij denotes the element of the matrix A in row i and column j. We denote the

jth element of the vector x by xj . The notation A·,j is used to refer to column

j of the matrix A, while Ai,· refers to row i of A.

We can also build up larger matrices from smaller matrices. The notation

A = [B C] means that the matrix A is composed of the matrices B and C,

with matrix C beside matrix B. The notation A = [B; C] means that matrix

A is composed of the matrices B and C, with C below matrix B.

It is also possible to perform arithmetic on matrices and vectors. If A and B

are two matrices of the same size, we can add them together by simply adding

the corresponding elements of the two matrices. Similarly, we can subtract B

from A by subtracting the elements of B from the elements of A. We can

multiply a scalar times a vector by multiplying the scalar times each element

of the vector. Since vectors are just n by 1 matrices, we can perform the same

arithmetic operations on vectors. A zero matrix, 0 is a matrix with all zero

elements. A zero matrix plays the same role in matrix algebra as the scalar 0,

with A + 0 = 0 + A = A.

In general, matrices and vectors may contain complex numbers as well as

real numbers. However, in this book all of our matrices will have real entries.

Using vector notation, we can write a linear system of equations in vector

form.

Example A.4

Recall the system of equations (A.1)

x1 +x2 +x3 = 0

x1 +2x2 +2x3 = 0

from example A.1. We can write this in vector form as

1 1 1 0

x1 + x2 + x3 = .

1 2 2 0

The expression on the left hand side of the equation in which scalars are

multiplied by vectors and then added together is called a linear combination.

Although this form is not convenient for solving the system of equations, it can

be useful in setting up a system of equations.

A-6 APPENDIX A. REVIEW OF LINEAR ALGEBRA

times x. The product is defined by

Example A.5

Let

1 2 3

A=

4 5 6

and

1

x= 0

2

Then

1 2 3 7

Ax = 1 +0 +2 = .

4 5 6 16

Notice that the formula for Ax involves a linear combination much like the

one that occurred in the vector form of a system of equations. It is possible

to take any linear system of equations and rewrite the system of equations as

Ax = b, where A is a matrix containing the coefficients of the variables in the

equations, b is a vector containing the coefficients on the right hand sides of the

equations, and x is a vector containing the variables.

Example A.6

The system of equations

x1 +x2 +x3 = 0

x1 +2x2 +2x3 = 0

1 1 1

A=

1 2 2

x1

x = x2

x3

and

0

b= .

0

A.2. MATRIX AND VECTOR ALGEBRA A-7

compatible matrices

If A is a matrix of size m by n, and B is a matrix of size n by r, row–column expansion method

then the product C = AB is obtained by multiplying A times each

of the columns of B and assembling the matrix vector products in

C. That is,

C = [AB1 AB2 . . . ABr ] . (A.2)

Note that the product is only possible if the two matrices are of compatible

sizes. In general, if A has m rows and n columns, and B has n rows and r

columns, then the product AB exists and is of size m by r. In some cases, it is

possible to multiply AB but not BA. Also, it turns out that even when both

AB and BA exist, AB is not always equal to BA!

There is an alternate way to compute the matrix–matrix product. In the

row–column expansion method, we obtain the entry in row i and column j

of C by computing the matrix product of row i of A and column j of B.

Example A.7

Let

1 2

A= 3 4

5 6

and

5 2

B=

3 7

and let C = AB. The matrix C will be of size 3 by 2. We compute

the product using both of the methods. First, using the matrix–

vector approach:

C = [AB1 AB2 ]

1 2 1 2

C = 5 3 + 3 4 2 3 + 7 4

5 6 5 6

11 16

C = 27 34 .

43 52

Next, we use the row–column approach

1×5+2×3 1×2+2×7

C= 3×5+4×3 3×2+4×7

5×5+6×3 5×2+6×7

11 16

C = 27 34 .

43 52

A-8 APPENDIX A. REVIEW OF LINEAR ALGEBRA

identity matrix The n by n identity matrix In consists of 1’s on the diagonal and 0’s on the

off diagonal. For example, the 3 by 3 identity matrix is

1 0 0

I3 = 0 1 0 .

0 0 1

We often write I without specifying the size of the matrix in situations where

the size of matrix is obvious from context. You can easily show that if A is an

m by n matrix, then

AIn = A

and

Im A = A.

Thus multiplying by I in matrix algebra is similar to multiplying by 1 in con-

ventional scalar algebra.

We have not defined division of matrices, but it is possible to define the

matrix algebra equivalent of the reciprocal.

Definition A.3

AB = BA = I, (A.3)

Since the columns of the identity matrix are known, and A is known, we can

solve

1

0

.

AB1 = .

.

0

to obtain B1 . In the same way, we can find the remaining columns of the inverse.

If any of these systems of equations are inconsistent, then A−1 does not exist.

Example A.8

Let

2 1

A= .

5 2

A.2. MATRIX AND VECTOR ALGEBRA A-9

with the augmented matrix

2 1 1

.

5 2 0

The RREF is

1 0 −2

.

0 1 5

Thus the first column of A−1 is

−2

.

5

1

.

−2

Thus

−1 −2 1

A = .

5 −2

The inverse matrix can be used to solve a system of linear equations with n

equations and n variables. Given the system of equations Ax = b, and A−1 ,

we can calculate

Ax = b

−1

A Ax = A−1 b

Ix = A−1 b

x = A−1 b.

This argument shows that if A−1 exists, then for any right hand side b, the

system of equations Ax = b has a unique solution. If A−1 does not exist, then

the system Ax = b may either have many solutions or no solution.

Example A.9

Consider the system of equations Ax = b, where

2 1

A=

5 2

and

3

b= .

7

A-10 APPENDIX A. REVIEW OF LINEAR ALGEBRA

transpose

symmetric matrix x = A−1 b

diagonal matrix

−2 1 1 1

x= = .

5 −2 2 1

Definition A.4

By convention, we define A0 = I.

Definition A.5

of A and writing them as the rows of the transpose. The transpose

of A is denoted by AT .

Example A.10

Let

2 1

A= .

5 2

Then

2 5

AT = .

1 2

Definition A.6

A matrix is symmetric if A = AT .

matrices, we will have occasion to refer to matrices of size m by n which have

nonzero entries only on the diagonal.

Definition A.7

A.2. MATRIX AND VECTOR ALGEBRA A-11

We will also work with non square matrices which are upper triangular. upper triangular matrix

lower triangular matrix

Definition A.8

An m by n matrix R is upper triangular if Ri,j = 0 whenever

i > j. A matrix L is lower triangular if LT is upper triangular.

Example A.11

The matrix

1 0 0 0 0

S= 0 2 0 0 0

0 0 3 0 0

is diagonal, and the matrix

1 2 3

0 2 4

R=

0

0 5

0 0 0

is upper triangular.

Theorem A.1

The following statements are true for any scalars s and t and any

matrices A, B, and C. It is assumed that the matrices are of the

appropriate size for the operations involved and that whenever an

inverse occurs, the matrix is invertible.

1. A + 0 = 0 + A = A.

2. A + B = B + A.

3. (A + B) + C = A + (B + C).

4. A(BC) = (AB)C.

5. A(B + C) = AB + AC.

6. (A + B)C = AC + BC.

7. (st)A = s(tA).

8. s(AB) = (sA)B = A(sB).

9. (s + t)A = sA + tA.

10. s(A + B) = sA + sB.

11. (AT )T = A.

12. (sA)T = s(AT ).

A-12 APPENDIX A. REVIEW OF LINEAR ALGEBRA

14. (AB)T = BT AT .

15. (AB)−1 = B−1 A−1 .

16. (A−1 )−1 = A.

17. (AT )−1 = (A−1 )T .

18. If A and B are n by n matrices, and AB = I, then A−1 = B

and B−1 = A.

The first ten rules in this list are identical to rules of conventional alge-

bra, and you should have little trouble in applying them. The rules involving

transposes and inverses are new, but they can be mastered without too much

trouble.

Many students have difficulty with the following statements, which would

appear to be true on the surface, but which are in fact false for at least some

matrices.

1. AB = BA.

2. If AB = 0, then A = 0 or B = 0.

3. If AB = AC and A 6= 0, then B = C.

each of these statements are false.

Definition A.9

of equations

c1 v1 + c2 v2 + . . . + cn vn = 0 (A.4)

has only the trivial solution c = 0. If there are multiple solutions,

then the vectors are linearly dependent.

Just solve the above system of equations (A.4).

Example A.12

A.4. SUBSPACES OF RN A-13

Let subspace

1 2 3

A= 4 5 6 .

7 8 9

Are the columns of A linearly independent vectors? To determine

this we setup the system of equations Ax = 0 in an augmented

matrix, and then find the RREF

1 0 −1 0

0 1 2 0 .

0 0 0 0

1

x = x3 −2 .

1

We can set x3 = 1 and obtain the nonzero solution

1

x = −2 .

1

dence. For example, it can be shown that if the columns of an n by n matrix A

are linearly independent, then A−1 exists, and the system of equations Ax = b

has a unique solution for every right hand side b.

A.4 Subspaces of Rn

So far, we have worked with vectors of real numbers in the n dimensional space

Rn . There are a number of properties of Rn that make it convenient to work

with vectors. First, the operation of vector addition always works— we can take

any two vectors in Rn and add them together and get another vector in Rn .

Second, we can multiply any vector in Rn by a scalar and obtain another vector

in Rn . Finally, we have the 0 vector, with the property that for any vector x,

x + 0 = 0 + x = x.

Definition A.10

properties:

A-14 APPENDIX A. REVIEW OF LINEAR ALGEBRA

in W .

3. The 0 vector is in W .

Example A.13

x1 + x2 + x3 = 0

in the plane and add them together, we get another vector in the

plane. If we take a vector in this plane and multiply it by any scalar,

we get another vector in the plane. Finally, 0 is a vector in the plane.

all of the rules of matrix/vector algebra apply. The most important subspace of

Rn that we will work with in this course is the null space of an m by n matrix.

Definition A.11

is the set of all vectors x such that Ax = 0.

things:

equations, we find that A(x + y) = 0. Thus x + y is in N (A).

equation by s to get sAx = 0. Thus A(sx) = 0, and sx is in N (A).

3. A0 = 0, so 0 is in N (A).

the system of equations Ax = 0.

Example A.14

Let

3 1 9 4

A= 2 1 7 3 .

5 2 16 7

A.4. SUBSPACES OF RN A-15

Ax = 0. To solve the equations, we put the system of equations into

an augmented matrix

3 1 9 4 0

2 1 7 3 0

5 2 16 7 0

and find the RREF

1 0 2 1 0

0 1 3 1 0 .

0 0 0 0 0

From the augmented matrix, we find that

−2 −1

−3 −1

x = x3

1 + x4 0

.

0 1

Any vector in the null space can be written as a linear combination

of the above vectors, so the null space is a two dimensional plane

within R4 .

Now, consider the problem of solving Ax = b, where

22

b = 17 .

39

It happens that one particular solution to this system of equations

is

1

2

p= 1 .

2

However, we can take any vector in the null space of A and add it to

this solution to obtain another solution. Suppose that x is in N (A).

Then

A(x + p) = Ax + Ap

A(x + p) = 0 + b

A(x + p) = b.

For example,

1 −2 −1

2 −3 −1

x=

1 + 2 1 + 3 0

2 0 1

A-16 APPENDIX A. REVIEW OF LINEAR ALGEBRA

standard basis

basis, standard

dimension In the context of inverse problems, the null space is critical because the

presence of a nontrivial null space leads to non uniqueness in the solution to a

linear system of equations.

Definition A.12

A basis for a subspace W is a set of vectors v1 , . . ., vp such that

1. Any vector in W can be written as a linear combination of the

basis vectors.

2. The basis vectors are linearly independent.

Definition A.13

The standard basis for Rn is the set of vectors e1 , . . ., ep such

that the elements of ei are all zero, except for the ith element, which

is one.

Any nontrivial subspace W of Rn will have many different bases. For exam-

ple, we can take any basis and multiply one of the basis vectors by 2 to obtain a

new basis. However, it is possible to show that all bases for a subspace W have

the same number of basis vectors.

Theorem A.2

Let W be a subspace of Rn with basis v1 , v2 , . . ., vp . Then all bases

for W have p basis vectors, and p is the dimension of W .

Example A.15

In the previous example, the vectors

−2 −1

−3 −1

1 v2 = 0

v1 =

0 1

form a basis for the null space of A because any vector in the null

space can be written as a linear combination of v1 and v2 , and

because the vectors v1 and v2 are linearly independent. Since the

basis has two vectors, the dimension of the null space of A is two.

A.4. SUBSPACES OF RN A-17

It can be shown that the procedure used in the above example always pro- column space

duces a basis for N (A). column space

range

Definition A.14

Let A be an m by n matrix. The column space or range of A

(written R(A)) is the set of all vectors b such that Ax = b has at

least one solution. In other words, the column space is the set of all

vectors b that can be written as a linear combination of the columns

of A.

because R(G) consists of all vectors d for which there is a consistent model m.

To find the column space of a matrix, we consider what happens when we

compute the RREF of [A | b]. In the part of the augmented matrix correspond-

ing to the left hand side of the equations we always get the same result, namely

the RREF of A. The solution to the system of equations may involve some

free variables, but we can always set these free variables to 0. Thus when we

can solve Ax = b, we can solve the system of equations by using only variables

corresponding to the pivot columns in the RREF of A. In other words, if we

can solve Ax = b, then we can write b as a linear combination of the pivot

columns of A. Note that these are columns from the original matrix A, not

columns from the RREF of A.

Example A.16

As in the previous example, let

3 1 9 4

A= 2 1 7 3 .

5 2 16 7

RREF of A is

1 0 2 1

0 1 3 1 .

0 0 0 0

Thus whenever we can solve Ax = b, we can find a solution in

which x3 and x4 are 0. In other words, whenever there is a solution

to Ax = b, we can write b as a linear combination of the first two

columns of A

3 1

b = x1 2 + x2 1 .

5 2

Since these two vectors are linearly independent and span R(A),

they form a basis for R(A). The dimension of R(A) is two.

A-18 APPENDIX A. REVIEW OF LINEAR ALGEBRA

because it tells us when we can solve Ax = b.

In finding the null space and range of a matrix A we found that the basis

vectors for N (A) corresponded to non pivot columns of A, while the basis

vectors for R(A) corresponded to pivot columns of A. Since the matrix A had

n columns, we obtain the theorem.

Theorem A.3

In addition to the null space and range of a matrix A, we will often work

with the null space and range of the transpose of A. Since the columns of AT

are rows of A, the column space of AT is also called the row space of A. Since

each row of A can be written as a linear combination of the nonzero rows of the

RREF of A, the nonzero rows of the RREF form a basis for the row space of

A. There are exactly as many nonzero rows in in the RREF of A as there are

pivot columns. Thus we have the following theorem.

Theorem A.4

Definition A.15

Definition A.16

x · y = xT y = x1 y1 + x2 y2 + . . . + xn yn

Definition A.17

A.5. ORTHOGONALITY AND THE DOT PRODUCT A-19

@

I Euclidean length

@ x-y 2–norm

@ law of cosines

@

@ orthogonal vectors

x @ perpendicular vectors

1

θ

y

Figure A.1: Relationship between the dot product and the angle between two

vectors.

is

√ q

kxk2 = xT x = x21 + x22 + . . . + x2n .

Later we will introduce two other ways of measuring the “length” of a vector.

The subscript 2 is used to distinguish this 2–norm from the other norms.

You may be familiar with an alternative definition of the dot product in

which x · y = kxkkyk cos(θ) where θ is the angle between the two vectors. The

two definitions are equivalent. To see this, consider a triangle with sides x, y,

and x − y. See Figure 1. The angle between sides x and y is θ. By the law of

cosines,

(x − y)T (x − y) = xT x + yT y − 2kxk2 kyk2 cos(θ)

xT x − 2xT y + yT y = xT x + yT y − 2kxk2 kyk2 cos(θ)

−2xT y = −2kxk2 kyk2 cos(θ)

xT y = kxk2 kyk2 cos(θ).

We can also use this formula to compute the angle between two vectors.

xT y

θ = cos−1 .

kxk2 kyk2

Definition A.18

(written x ⊥ y) if xT y = 0.

A-20 APPENDIX A. REVIEW OF LINEAR ALGEBRA

orthogonal basis

orthogonal matrix A set of vectors v1 , v2 , . . ., vp is orthogonal if each pair of vectors

in the set is orthogonal.

Definition A.20

Two subspaces V and W of Rn are orthogonal or perpendicular

if every vector in V is perpendicular to every vector in W .

be obtained by taking the dot product of a row of A and x, x is perpendicular

to each row of A. Since x is perpendicular to all of the columns of AT , it is

perpendicular to R(AT ). We have the following theorem.

Theorem A.5

Let A be an m by n matrix. Then

N (A) ⊥ R(AT ) .

Furthermore,

N (A) + R(AT ) = Rn .

That is, any vector x in Rn can be written uniquely as x = p + q

where p is in N (A) and q is in R(AT ).

Definition A.21

A basis in which the basis vectors are orthogonal is an orthogonal

basis.

Definition A.22

An n by n matrix Q is orthogonal if the columns of Q are orthog-

onal and each column of Q has length one.

Theorem A.6

If Q is an orthogonal matrix, then:

1. QT Q = I. In other words, Q−1 = QT .

A.5. ORTHOGONALITY AND THE DOT PRODUCT A-21

orthogonal projection

x y

3. For any two vectors x and y in R , x y = (Qx)T (Qy) .

n T

onto another vector y or onto a subspace W to obtain a projected vector p. See

Figure 2. We know that

kpk2

cos(θ) = .

kxk2

Thus

xT y

kpk2 = .

kyk2

Since p points in the same direction as y,

xT y

p = projy x = y.

yT y

The vector p is called the orthogonal projection of x onto y.

Similarly, if W is a subspace of Rn with an orthogonal basis w1 , w2 , . . .,

wp , then the orthogonal projection of x onto W is

xT w1 xT w2 xT wp

p = projW x = w1 + w2 + . . . + wp .

w1T w1 w2T w2 wpT wp

The Gram–Schmidt orthogonalization process can be used to turn any

A-22 APPENDIX A. REVIEW OF LINEAR ALGEBRA

Gram–Schmidt basis for a subspace of Rn into an orthogonal basis. We begin with a basis v1 ,

orthogonalization process

v2 , . . ., vp . The process recursively constructs an orthogonal basis by taking

orth

each vector in the original basis and then subtracting off its projection on the

space spanned by the previous vectors. The formulas are

w1 = v1

v1T v2

w2 = v2 − v1

v1T v1

...

w1T vp wpT vp

wp = vp − T w1 − . . . − T wp .

w1 w1 wp wp

Unfortunately, the Gram–Schmidt process is numerically unstable when applied

to large bases. In MATLAB the command orth provides a numerically stable

way to produce an orthogonal basis from a nonorthogonal basis.

An important property of the orthogonal projection is that the projection

of x onto W is the point in W which is closest to x. In the special case that

x is in W , the projection of x onto W is x. This provides a convenient way to

write a vector in W as a linear combination of the orthogonal basis vectors.

Example A.17

In this example, we will find the point on the plane

x1 + x2 + x3 = 0

First, we must find an orthogonal basis for our subspace. Our sub-

space is the null space of the matrix

A= 1 1 1

Using the RREF, we find that the null space has the basis

−1 −1

u1 = 1 u2 = 0 .

0 1

Unfortunately, this basis is not orthogonal. Using the MATLAB

orth command, we obtain the orthogonal basis

−0.7071 0.4082

w1 = 0 w2 = −0.8165 .

0.7071 0.4082

Using this orthogonal basis, we compute the projection of x onto

the plane

xT w1 xT w2

p= T w1 + T w2 .

w1 w1 w2 w2

A.5. ORTHOGONALITY AND THE DOT PRODUCT A-23

−1 least squares solution

p = 0 . normal equations

1

to find an approximate solution. A natural measure of the quality of an ap-

proximate solution is the distance from Ax to b, kAx − bk. A solution which

minimizes kAx − bk2 is called a least squares solution, because it minimizes

the sum of the squares of the errors.

The least squares solution can be obtained by projecting b onto the range

of A. This calculation requires us to first find an orthogonal basis for R(A).

There is an alternative approach which does not require the orthogonal basis.

Let

Ax = p = projR(A) b .

Then Ax − b is perpendicular to R(A). In particular, each of the columns of

A is orthogonal to Ax − b. Thus

AT (Ax − b) = 0 .

AT Ax = AT b . (A.7)

This last system of equations is referred to as the normal equations for the

least squares problem. It can be shown that if the columns of A are linearly

independent, then the normal equations have exactly one solution. This solution

minimizes the sum of squared errors.

Example A.18

Let

1 2

1 3

A=

1

1

2 2

and

1

2

3 .

b=

4

It is easy to see that the system of equations Ax = b is inconsis-

tent. We will find the least squares solution by solving the normal

equations.

T 7 10

A A= .

10 18

14

AT b = .

19

A-24 APPENDIX A. REVIEW OF LINEAR ALGEBRA

eigenvalue

characteristic equation 2.3846

x= .

eig −0.2692

Definition A.23

vector x if x is not 0, and

Ax = λx .

difference equations of the form

x(k + 1) = Ax(k)

x0 (t) = Ax(t) .

vector.

In order to find eigenvalues and eigenvectors, we consider the equation

Ax = λx .

(A − λI)x = 0. (A.8)

In order to find nonzero eigenvectors, the matrix A − λI must be singular. This

leads to the characteristic equation

istic equation (A.9) to find the eigenvalues and then substitute the eigenvalues

into (A.8) and solve the system of equations to find corresponding eigenvectors.

Notice that the eigenvalues can in general be complex numbers. For larger ma-

trices solving the characteristic equation becomes completely impractical and

more sophisticated numerical methods are used. The MATLAB command eig

can be used to find eigenvalues and eigenvectors of a matrix.

A.6. EIGENVALUES AND EIGENVECTORS A-25

Suppose that we can find a set of n linearly independent eigenvectors orthogonal diagonalization

quadratic form

v1 , v2 , . . . , vn

λ1 , λ2 , . . . , λn .

These eigenvectors form a basis for Rn . We can use the eigenvectors to diago-

nalize the matrix as

A = PΛPT

where

P= v1 v2 . . . vn

and

λ1 0 ... 0

0 λ2 0 ...

Λ=

... ... ... ... .

0 ... 0 λn

To see that this works, simply compute AP

AP = A v1 v2 . . . vn

AP = λ1 v1 λ2 v2 . . . λn vn

AP = PΛ .

Thus

A = PΛPT .

Not all matrices are diagonalizable, because not all matrices have n linearly

independent eigenvectors. However, there is an important special case in which

matrices can always be diagonalized.

Theorem A.7

If A is a real symmetric matrix, then A can be written as

A = QΛQT ,

real diagonal matrix of the eigenvalues of A.

useful later on when we consider orthogonal factorizations of general matrices.

The eigenvalues of symmetric matrices are particularly important in the

analysis of quadratic forms.

Definition A.24

A-26 APPENDIX A. REVIEW OF LINEAR ALGEBRA

positive semidefinite matrix

PD matrix f (x) = xT Ax ,

PSD matrix

indefinite matrix where A is a symmetric n by n matrix. The quadratic form f (x)

negative semidefinite matrix is positive definite (PD) if f (x) ≥ 0 for all x and f (x) = 0 only

negative definite matrix when x = 0. The quadratic form is positive semidefinite (PSD)

if f (x) ≥ 0 for all x. Similarly, a symmetric matrix A is positive

definite if the associated quadratic form f (x) = xT Ax is positive

definite. The quadratic form is negative semidefinite if −f (x)

is positive semidefinite. If f (x) is neither positive semidefinite nor

negative semidefinite, then f (x) is indefinite.

geometry. Let A be a positive definite and symmetric matrix. Then the region

defined by the inequality

(x − c)A(x − c) ≤ δ

is an ellipsoidal volume, with its center at c. We can diagonalize A as

A = PΛP−1

where the columns of P are normalized eigenvectors of A, and Λ is a diagonal

matrix who entries are the eigenvalues of A. It can be shown that the ith

eigenvector of A points in the direction of the ith semi major

p axis of the ellipsoid,

and the length of the ith semi major axis is given by δ/λi .

An important connection between positive semidefinite matrices and eigen-

values is the following theorem.

Theorem A.8

A symmetric matrix A is positive semidefinite if and only if its eigen-

values are greater than or equal to 0.

semidefinite. However, for large matrices, computing all of the eigenvalues can

be time consuming.

The Cholesky factorization provides an another way to determine whether

or not a symmetric matrix is positive definite.

Theorem A.9

Let A be an an n by n positive definite and symmetric matrix. Then

A can be written uniquely as

A = RT R ,

where R is an upper triangular matrix. Furthermore, A can be

factored in this way only if it is positive definite.

A.7. LAGRANGE MULTIPLIERS A-27

The MATLAB command chol can be used to compute the Cholesky factor- chol

ization of a symmetric and positive definite matrix. Lagrange multipliers

We will need to compute derivatives of quadratic forms. stationary point

Theorem A.10

Let f (x) = xT Ax where A is an n by n symmetric matrix. Then

∇f (x) = 2Ax

and

∇2 f (x) = 2A .

The method of Lagrange multipliers is an important technique for solving

optimization problems of the form

min f (x)

(A.10)

g(x) = 0.

Theorem A.11

A minimum point of (A.10) can occur only at a point x where

for some λ.

(A.11) is called a stationary point. A stationary point may be a minimum,

but it could also be a maximum point or a saddle point. Thus it is necessary to

check all of the stationary points to find the true minimum.

The Lagrange multiplier condition can be extended to problems of the form

min f (x)

(A.12)

g(x) ≤ 0 .

Theorem A.12

A minimum point of (A.12) can occur only at a point x where

A-28 APPENDIX A. REVIEW OF LINEAR ALGEBRA

Example A.19

min x1 + x2

x21 + x22 − 1 ≤ 0 .

1 2x1

+λ =0

1 2x2

x1 = 0.7071, x2 = 0.7071, with λ = −1.4142. Since λ < 0, this point

does not satisfy the Lagrange multiplier condition. In fact, it is the

maximum of f (x) subject to g(x) ≤ 0. The second solution to the

Lagrange multiplier equations is x1 = −0.7071, x2 = −0.7071, with

λ = 1.4142. Since this is the only solution with λ ≥ 0, this point

solves the minimization problem.

Notice that (A.13) is (except for the condition λ > 0) the necessary condition

for a minimum point of the unconstrained minimization problem

g(x∗ ) ≤ 0. We will make frequent use of this technique to convert constrained

optimization problems into unconstrained optimization problems.

Theorem A.13

Suppose that f (x) and its first and second partial derivatives are

continuous. Then given some point x0 , we can write f (x) as

1

f (x) = f (x0 )+∇f (x)T (x−x0 )+ (x−x0 )T ∇2 f (c)(x−x0 ) (A.15)

2

for some point c between x and x0 .

1

f (x) ≈ f (x0 ) + ∇f (x)T (x − x0 ) + (x − x0 )T ∇2 f (x0 )(x − x0 ) . (A.16)

2

A.9. VECTOR AND MATRIX NORMS A-29

We will frequently make use of this approximation. Taylor’s theorem can easily convex function

be specialized to functions of a single variable. norm!definition

1

f (x) ≈ f (x0 ) + f 0 (x0 )(x − x0 ) + f 00 (x0 )(x − x0 )2 .

2

The theorem can also be extended to include terms associated with higher order

derivatives, but we will not need this.

The gradient ∇f (x0 ) has an important geometric interpretation. This vector

points in the direction in which f (x) increases most rapidly as we move away

from x0 . The Hessian, ∇2 f (x0 ), plays a role similar to the second derivative of

a function of a single variable.

Theorem A.14

a positive semidefinite matrix, then f (x) is convex in a neighborhood

of x. If ∇2 f (x) is positive definite, then f (x) is strictly convex.

Although the conventional Euclidean length is most commonly used, there are

alternative ways to measure the length of a vector.

Definition A.25

four properties can be used as a norm.

2. For any vector x and any scalar s, ksxk = |s|kxk.

3. For any vectors x and y, kx + yk ≤ kxk + kyk.

4. kxk = 0 if and only if x = 0.

Definition A.26

A-30 APPENDIX A. REVIEW OF LINEAR ALGEBRA

norm!infinity It can be shown that for any p ≥ 1, the p–norm satisfies the conditions of

norm Definition A.9. The conventional Euclidean length is just the 2–norm. Two

other p–norms are commonly used. The 1–norm is the sum of the absolute

values of the entries in x. The infinity–norm is obtained by taking the limit

as p goes to infinity. The infinity–norm is the maximum of the absolute values

of the entries in x. The MATLAB command norm can be used to compute the

norm of a vector. It has options for the 1, 2, and infinity norms.

Example A.20

Let

−1

x = 2 .

−3

Then

kxk1 = 6

√

kxk2 = 14

and

kxk∞ = 3

over determined linear systems of equations. To minimize the maximum of the

errors, we minimize kAx − bk∞ . To minimize the sum of the absolute values of

the errors, we minimize kAx−bk1 . Unfortunately, these minimization problems

can be much harder to solve then then the least squares problem.

Definition A.27

Any measure of the size or length of an m by n matrix that satisfies

the following five properties can be used as a matrix norm.

1. For any matrix A, kAk ≥ 0.

2. For any matrix A and any scalar s, ksAk = |s|kAk.

3. For any matrices A and B, kA + Bk ≤ kAk + kBk.

4. kAk = 0 if and only if A = 0.

5. For any two matrices A and B of compatible sizes, kABk ≤

kAkkBk.

A.9. VECTOR AND MATRIX NORMS A-31

norm!of a matrix

The p–norm of a matrix A is defined by norm!Frobenius

Frobenius norm

kAkp = max kAxkp . norm!compatible matrix and

kxkp =1

vector

compatible matrix and vector

In this formula, kxkp and kAxkp are vector p–norms, while kAkp is norms

the matrix p–norm of A.

extremely difficult. Fortunately, there are simpler formulas for the most com-

monly used matrix p–norms.

m

X

kAk1 = max |Ai,j | (A.17)

j

i=1

q

kAk2 = λmax (AT A) (A.18)

n

X

kAk∞ = max |Ai,j | . (A.19)

i

j=1

Definition A.29

The Frobenius norm of an m by n matrix is given by

v

um X n

uX

kAkF = t A2ij .

i=1 j=1

Definition A.30

A matrix norm and a vector norm are compatible if

kAxk ≤ kAkkxk .

The matrix p–norm is (by its definition) compatible with the vector p–norm

from which it was derived. It can also be shown that the Frobenius norm of a

matrix is compatible with the vector 2–norm. Thus the Frobenius norm is often

used with the vector 2–norm.

In practice, the Frobenius norm, 1–norm, and infinity–norm of a matrix are

easy to compute, while the 2–norm of a matrix can be difficult to compute for

large matrices. The MATLAB command norm has options for computing the

1, 2, infinity, and Frobenius norms of a matrix.

A-32 APPENDIX A. REVIEW OF LINEAR ALGEBRA

Suppose that we want to solve a system of n equations in n variables

Ax = b .

Ax̂ = b̂ .

Ax = b

Ax̂ = b̂

A(x − x̂) = b − b̂

(x − x̂) = A−1 (b − b̂)

kx − x̂k = kA−1 (b − b̂)k

kx − x̂k ≤ kA−1 kkb − b̂k .

This formula provides an absolute bound on the error in the solution. It is also

worthwhile to compute a relative error bound.

≤

kbk kbk

kx − x̂k kA−1 kkb − b̂k

≤

kAxk kbk

kb − b̂k

kx − x̂k ≤ kAxkkA−1 k

kbk

kb − b̂k

kx − x̂k ≤ kAkkxkkA−1 k

kbk

kx − x̂k kb − b̂k

≤ kAkkA−1 k .

kxk kbk

kb − b̂k

.

kbk

The relative error in x is measured by

kx − x̂k

.

kxk

A.11. THE QR FACTORIZATION A-33

−1 cond

cond(A) = kAkkA k

ill-conditioned!linear system of

equations

is called the condition number of A.

qr

Note that nothing that we did in the calculation of the condition number

depends on which norm we used. The condition number can be computed

using the 1–norm, 2–norm, infinity–norm, or Frobenius norm. The MATLAB

command cond can be used to find the condition number of a matrix. It has

options for the 1, 2, infinity, and Frobenius norms.

The condition number provides an upper bound on how inaccurate the solu-

tion to a system of equations might be because of errors in the right hand side.

In some cases, the condition number greatly overestimates the error in the solu-

tion. As a practical matter, it is wise to assume that the error is of roughly the

size predicted by the condition number. In practice, floating point arithmetic

only allows us to store numbers to about 16 digits of precision. If the condition

number is greater than 1016 , then by the above inequality, there may be no

accurate digits in the computer solution to the system of equations. Systems of

equations with very large condition numbers are called ill-conditioned.

It is important to understand that ill-conditioning is a property of the system

of equations and not the algorithm used to solve the system of equations. Ill-

conditioning cannot be fixed simply by using a better algorithm. Instead, we

must either increase the precision of our computer or find a different, better

conditioned system of equations to solve.

Although the theory of linear algebra can be developed using the reduced row

echelon form, there is an alternative computational approach which works bet-

ter in practice. The basic idea is to compute factorizations of matrices that

involve orthogonal, diagonal, and upper triangular matrices. This alternative

approach leads to algorithms which can quickly compute accurate solutions to

linear systems of equations and least squares problems.

Theorem A.15

A = QR

triangular matrix. This is called the QR factorization of A.

matrix.

A-34 APPENDIX A. REVIEW OF LINEAR ALGEBRA

rank of A will be n. In this case, we can write

R1

R= ,

0

where R1 is n by n, and

Q = [Q1 Q2 ] ,

where Q1 is m by n and Q2 is m by m − n. In this case the QR factorization

has some important properties.

Theorem A.16

m > n and rank(A) = n. Then

2. The columns of Q2 an orthogonal basis for N (AT ).

3. The matrix R1 is nonsingular.

Since multiplying a vector by an orthogonal matrix does not change its length,

this is equivalent to

min kQT (Ax − b)k2 .

But

QT A = QT QR = R.

So, we have

min kRx − QT bk2

or

R1 x − QT1 b

min

0x − QT2 b

.

2

Whatever value of x we pick, we will probably end up with nonzero error because

of the 0x − QT2 b part of the least squares problem. We cannot minimize the

norm of this part of the vector. However, we can find an x that exactly solves

R1 x = QT1 b. Thus we can minimize the least squares problem by solving the

square system of equations

R1 x = QT1 b. (A.20)

The advantage of solving this system of equations instead of the normal equa-

tions (A.7) is that the normal equations are typically much more badly condi-

tioned than (A.20).

A.12. LINEAR ALGEBRA IN MORE GENERAL SPACES A-35

inner product

So far, we have considered only vectors in Rn . However, the concepts of linear

algebra can be extended to other contexts. In general, as long as the objects

that we want to consider can be multiplied by scalars and added together, and

as long as they obey the laws of vector algebra, then we have a vector space in

which we can practice linear algebra. If we can also define a dot product, then we

have what is called an inner product space, and we can define orthogonality,

projections, and the 2–norm.

There are many different vector spaces used in various areas of science and

mathematics. For our work in inverse problems, a very commonly used vector

space is the space of functions defined on an interval [a, b].

Multiplying a scalar times a function or adding two function together clearly

produces another function. In this space, the function z(x) = 0 takes the place

of the 0 vector, since f (x) + z(x) = f (x). Two functions f (x) and g(x) are

linearly independent if the only solution to

is c1 = c2 = 0.

We can define the dot product of two functions f and g to be

Z b

f ·g = f (x)g(x)dx.

a

Another commonly used notation for this dot product or inner product of f

and g is

f · g = hf, gi.

It is easy to show that this inner product has all of the algebraic properties of

the dot product of two vectors in Rn . A more important motivation for defining

the dot product in this way is that it leads to a useful definition

√ of the 2–norm

of a function. Following our earlier formula that kxk2 = xT x, we have

s

Z b

kf k2 = f (x)2 dx.

a

s

Z b

kf − gk2 = (f (x) − g(x))2 dx.

a

This measure is obviously zero when f (x) = g(x) everywhere, and is only zero

when f (x) = g(x) except possibly at some isolated points.

Using this inner product and norm, we can reconstruct the theory of lin-

ear algebra from Rn in our space of functions. This includes the concepts of

orthogonality, projections, norms, and least squares solutions.

A-36 APPENDIX A. REVIEW OF LINEAR ALGEBRA

A.13 Exercises

1. Construct an example of an over determined system of equations that has

infinitely many solutions.

2. Construct an example of an over determined system of equations that has

exactly one solution.

3. Construct an example of an under determined system of equations that

has no solutions.

4. Is it possible for an under determined system of equations to have exactly

one solution? If so, construct an example. If not, then explain why it is

not possible.

5. Let A be an m by n matrix with n pivot columns in its RREF. Can A

have infinitely many solutions?

6. Write a MATLAB routine to compute the RREF of a matrix.

%

% R=myrref(A)

%

% A an m by n input matrix.

% R output: the RREF(A)

%

Check your routine by comparing its results with results from MATLAB’s

rref command. Try some large (100 by 200) random matrices. Do you

always get the same result? If not, explain why.

7. If C = AB is a 5 by 4 matrix, then how many rows does A have? How

many columns does B have? Can you say anything about the number of

columns in A?

8. Write a MATLAB function to compute the product of two matrices

%

% C=myprod(A,B)

%

% A a matrix of size m by n

% B a matrix of size n by r

% C output: C=A*B

%

10. Construct a pair of 2 by 2 matrices A and B such that AB = 0, but

neither of A and B are zero.

A.13. EXERCISES A-37

A 6= 0, but B 6= C.

12. Let A be an m by n matrix, and let

P = A(AT A)−1 AT .

We will assume that (AT A)−1 exists, but not that A−1 exists.

(a) Show that P is symmetric.

(b) Show that P2 = P.

13. Suppose that v1 , v2 and v3 are three vectors in R3 and that v3 = −2v1 +

3v2 . Are the vectors linearly dependent or linearly independent?

14. Let A be an n by n matrix. Show that if A−1 exists, then the columns

of A are linearly independent. Also show that if the columns of A are

linearly independent, then A−1 exists.

15. Show that any collection of four vectors in R3 must be linearly indepen-

dent. Show that in general, any collection of more than n vectors in Rn

must be linearly dependent.

16. Let

1 2 3 4

A= 2 2 1 3 .

4 6 7 11

Find bases for N (A), R(A), N (AT ) and R(AT ). What are the dimensions

of the four subspaces?

17. Let A be an n by n matrix such that A−1 exists. What are N (A), R(A),

N (AT ), and R(AT )?

18. Let A be any 9 by 6 matrix. If the dimension of the null space of A is 5,

then what is the dimension of R(A)? What is the dimension of R(AT )?

What is the rank of A?

19. Suppose that a non homogeneous system of equations with four equations

and six unknowns has a solution with two free variables. Is it possible to

change the right hand side of the system of equations so that the modified

system of equations has no solutions?

20. Let W be the set of vectors x in R4 such that x1 x2 = 0. Is W a subspace

of R4 ?

21. Find vectors v1 and v2 such that

v1 ⊥ v2

v1 ⊥ x

A-38 APPENDIX A. REVIEW OF LINEAR ALGEBRA

v2 ⊥ x

where

1

x= 2

3

22. Let v1 , v2 , v3 be a set of three orthogonal vectors. Show that the vectors

are also linearly independent.

23. Show that if x ⊥ y, then

QT Q = I.

25. Let u be any nonzero vector in Rn . Let

uuT

P=I− .

uT u

Show that P is a symmetric matrix.

26. Prove the parallelogram law

add up to 1. Show that 1 is an eigenvalue of A and find a corresponding

eigenvector.

28. Let A be an n by n matrix with eigenvalues

λ1 , λ 2 , . . . , λn .

Show that

det(A) = λ1 λ2 · · · λn .

A = PΛP−1 .

A.13. EXERCISES A-39

30. Suppose that A is diagonalizable and that all eigenvalues of A have ab-

solute value less than one. What is the limit as k goes to infinity of Ak ?

31. Let A be an m by n matrix, and let

P = A(AT A)−1 AT .

We will assume that (AT A)−1 exists, but not that A−1 exists.

(a) Show that for any vector x, Px is in R(A).

(b) Show that for any vector x, (x − Px) ⊥ R(A). Hint: Show that

x − Px is in N (AT ).

This shows that multiplying P times x computes the orthogonal projection

of x onto R(A). Can you find a similar formula for projecting onto N (A)?

32. In this exercise, we will derive the formula (A.17) for the one–norm of a

matrix. Begin with the optimization problem

kxk1 =1

m

X

kAxk1 ≤ max |Ai,j | .

j

i=1

m

X

kAxk1 = max |Ai,j |.

j

i=1

m

X

kAk1 = max kAxk1 = max |Ai,j | .

kxk1 =1 j

i=1

33. Derive the formula (A.19) for the infinity norm of a matrix.

34. In this exercise we will derive the formula (A.18) for the two–norm of a

matrix. Begin with the maximization problem

max kAxk22 .

kxk2 =1

Note that we have squared kAxk2 . We will take the square root at the

end of the problem.

√

(a) Using the formula kxk2 = xT x, rewrite the above maximization

problem without norms.

A-40 APPENDIX A. REVIEW OF LINEAR ALGEBRA

that must be satisfied by any stationary point of the maximization

problem.

(c) Explain how the eigenvalues and eigenvectors of AT A are related to

this system of equations. Express the solution to the maximization

problem in terms of the eigenvalues and eigenvectors of AT A.

(d) Use this solution to get kAk2 .

35. In this exercise we will derive the normal equations using calculus. Let

f (x) = kAx − bk22

We want to minimize f (x).

(a) Rewrite f (x) as a dot product and then expand.

(b) Find ∇f (x).

(c) Set ∇f (x) = 0, and obtain the normal equations.

36. Let A be an m by n matrix.

(a) Show that AT A is symmetric.

(b) Show that AT A is positive semidefinite. Hint: Use the definition of

PSD rather than trying to compute eigenvalues.

(c) Show that if rank(A) = n, then the only vector x such that Ax = 0

is x = 0. Use this to show that if rank(A) = n, then AT A is positive

definite.

37. Solve the system of equations

x + y = 1

1.0000001x + y = 2.

Now change the two on the right hand side of the second equation to a

one. Find the new solution. In both cases, plot the solutions. Why did

the solution change so much?

38. Show that

cond(AB) ≤ cond(A)cond(B).

39. Consider the Hilbert matrix of order 10,

1 1/2 . . . 1/10

1/2 1/3 . . . 1/11

H= ...

... ... ...

1/10 1/11 . . . 1/19

where Hi,j = 1/(i + j). This matrix can be computed with the hilb

command in MATLAB. The Hilbert matrix is known to be extremely

badly conditioned. Solve Hx = b, where b is the last column of the H

identity matrix. How accurate is your solution likely to be?

A.14. NOTES AND FURTHER READING A-41

40. Let A be a symmetric and positive definite matrix with Cholesky factor-

ization

A = RT R.

Show how the Cholesky factorization can be used to solve Ax = b by

solving two systems of equations, each of which has R or RT as its matrix.

41. Show that if A is an m by n matrix with QR factorization

A = QR

then

kAkF = kRkF .

42. Let P3 [0, 1] be the space of polynomials of degree less than or equal to 3

on the interval [0, 1]. The polynomials p1 (x) = 1, p2 (x) = x, p3 (x) = x2 ,

and p4 (x) = x3 form a basis for P3 [0, 1], but they are not orthogonal with

respect to the inner product

Z 1

f ·g = f (x)g(x) dx .

0

Use the Gram-Schmidt process to construct an orthogonal basis for P3 [0, 1].

Once you have your basis, use it to find the third degree polynomial that

best approximates f (x) = e−x on the interval [0, 1].

Much of the material in this appendix is typically covered in sophomore level

linear algebra courses. There are a huge number of textbooks at this level.

One good introductory linear algebra textbook that covers the material in this

appendix is [Lay03]. At a slightly more advanced level, [Str88] and [Mey00] are

both excellent. Fast and accurate algorithms for linear algebra computations

are a somewhat more advanced topic. A classic reference is [GL96]. Other good

books on this topic include [TB97] and [Dem97].

A-42 APPENDIX A. REVIEW OF LINEAR ALGEBRA

experiment

events

probability

probability function

Appendix B

Statistics

and statistics. The topics have been selected because they are used in various

places throughout the book. In writing this appendix we have also attempted to

discuss some of the important connections between the mathematics of probabil-

ity theory and its application to the analysis of experimental data with random

measurement errors. The approach used in this appendix is a classical one. It is

important to note that in Chapter 11 (Bayesian methods), some very different

interpretations of probability theory are discussed.

The mathematical theory of probability begins with an experiment, which has

a set S of possible outcomes. We will be interested in events which are subsets

A of S.

Definition B.1

with the following properties

1. P (S) = 1.

2. For every event A ⊆ S, P (A) ≥ 0.

3. if events A1 , A2 , . . . are pairwise mutually exclusive (that is, if

Ai ∩ Aj is empty for all pairs i, j), then

∞

X

P (∪∞

i=1 Ai ) = P (Ai )

i=1

B-1

B-2 APPENDIX B. REVIEW OF PROBABILITY AND STATISTICS

random variable The list of properties of probability given in this definition is extremely

realizations useful in developing the mathematics of probability theory. However, applying

this definition of probability to real world situations requires some ingenuity.

Example B.1

Consider the experiment of throwing a dart at a dart board. We will

assume that our dart thrower is an expert who always hits the dart

board. The sample space S consists of the points on the dart board.

We can define an event A that consists of the points in the bulls eye.

Then P (A) is the probability that the thrower hits the bulls eye.

an event. Random variables are a useful generalization of the basic concept of

probability.

Definition B.2

A random variable X is a function X(s) that assigns a value to

each outcome s in the sample space S.

Each time we perform an experiment, we obtain a specific value of

the random variable. These are called realizations of the random

variable.

We will typically use capital letters to denote random variables, with

lower case letters used to denote realizations of a random variable.

Example B.2

To continue our previous example, let X be the function that takes

a point on the dart board and returns the associated score. Suppose

that throwing the dart in the bulls eye scores 50 points. Then for

each point s in the bullseye, X(s) = 50.

In this book we will deal frequently with experimental results in the form of

measurements that can include some random measurement error.

Example B.3

Suppose we weigh an object five times and obtain masses of m1 =

10.1 kg, m2 = 10.0 kg, m3 = 10.0 kg, m4 = 9.9 kg, and m5 = 10.1

kg. We will assume that there is one true mass m, and that the mea-

surements we obtained varied because of some random measurement

error. Thus

B.1. PROBABILITY AND RANDOM VARIABLES B-3

of a random variable E. Equivalently, since the true mass m is probability density function

just a constant, we could treat the measurements m1 , m2 , . . ., m5 uniform

Gaussian probability density

as realizations of a random variable M . In practice it makes little

normal probability density

difference whether we treat the measurements or the measurement standard normal random

errors as random variables. variable

Again, it is important to note that in a Bayesian approach the mass exponential random variable

m of the object would itself be a random variable. This is an issue

that we will take up in Chapter 11.

be characterized by a unit area, nonnegative, probability density function

(PDF), fX (x). Since the random variable always has some value, the PDF

must have a total area of one. Furthermore, the PDF is nonnegative for all x.

The following definitions give some useful probability density functions which

frequently arise in inverse problems.

Definition B.3

The uniform probability density on the interval [a, b] has

1

b−a a ≤ x ≤ b

fU (x) = 0 x<a (B.1)

0 x>b

Definition B.4

The Gaussian or normal probability density (with parameters µ

and σ) has

1 1 2 2

fN (x) = √ e− 2 (x−µ) /σ = N (µ, σ 2 ) (B.2)

σ 2π

The standard normal random variable, N (0, 1), has µ = 0 and

σ = 1.

Definition B.5

The exponential random variable has

λe−λx x ≥ 0

fexp (x) = (B.3)

0 x<0

B-4 APPENDIX B. REVIEW OF PROBABILITY AND STATISTICS

B.1. PROBABILITY AND RANDOM VARIABLES B-5

double-sided exponential

distribution

gamma function

Definition B.6

the double-sided exponential probability density has

1 √

fdexp (x) = 3/2 e− 2|x−µ|/σ . (B.4)

2 σ

Definition B.7

The χ2 probability density with parameter ν has

1 1

fχ2 (x) = x 2 ν−1 e−x/2 , (B.5)

2ν/2 Γ(ν/2)

where the gamma function is

Z ∞

Γ(x) = ξ x−1 e−ξ dξ

0

It can be shown that if X1 , X2 , . . ., Xn are independent with stan-

dard normal distributions, then

B-6 APPENDIX B. REVIEW OF PROBABILITY AND STATISTICS

λ = 1).

B.1. PROBABILITY AND RANDOM VARIABLES B-7

F distribution

Definition B.8

The Student’s t distribution with n degrees of freedom has

−(n+1)/2

x2

Γ((n + 1)/2) 1

ft (x) = √ 1+ . (B.6)

Γ(n/2) nπ n

pseudonym “Student” in publishing the first paper in which the distribution

appeared.

It should be noted that in the limit as n goes to infinity, t approaches a

standard normal distribution (Figure B.2; Figure B.6).

Definition B.9

The F distribution is a function of two degrees of freedom, ν1 and

ν2

ν1 /2 − ν1 +ν

2

2

(ν1 −2)

ν1 ν1

ν2 1 + x ν2 x 2

fF (x) = , (B.7)

β(ν1 /2, ν2 /2)

where the beta function is

Z 1

β(z, w) = tz−1 (1 − t)w−1 dt . (B.8)

0

B-8 APPENDIX B. REVIEW OF PROBABILITY AND STATISTICS

cumulative distribution

function

CDF

variables with respective parameters ν1 and ν2 has an F distribution

with parameters ν1 and ν2 .

random variable, X is just P (X ≤ a), which is given by the definite integral of

the associated PDF

Z a

FX (a) = P (X ≤ a)] = f (x) dx .

−∞

Note that FX (a) must lie in the interval [0, 1] for all a, and is a non-decreasing

function of a because of the unit area and non negativity of the PDF.

For the uniform PDF on the unit interval, for example, the CDF is a ramp

function Z a

FU (a) = fu (z) dz

−∞

0 a≤0

FU (a) = a 0≤a≤1

1 a>1

The PDF, fX (x), and CDF, FX (a), completely determine the probabilistic

properties of a random variable.

The probability that a particular realization of X will lie within a general

interval [a, b] is

P (a ≤ X ≤ b) = P (X ≤ b) − P (X ≤ a) = F (b) − F (a)

B.2. EXPECTED VALUE AND VARIANCE B-9

Z b Z a Z b expected value

= f (x) dx − f (x) dx = f (x) dx . mode of a PDF

−∞ −∞ a

median of a PDF

Definition B.10

is Z ∞

µX = E[X] = xfX (x) dx .

−∞

Z ∞

E[g(X)] = g(x)fX (x) dx .

−∞

Some authors use the term “mean” for the expected value of a random

variable. We will reserve the term mean for the average of a set of data. Note

that the expected value of a random variable is not necessarily identical to the

mode (the value with the largest value of f (x)) nor is it necessarily identical

to the median, the value of x for which the F (x) = 1/2.

Example B.4

Z ∞

1 (x−µ)2

E[X] = x √ e− 2σ2 dx

−∞ σ 2π

Z ∞

1 x2

E[X] = √ (x + µ)e− 2σ2 dx

−∞ σ 2π

Z ∞ Z ∞

1 x2

− 2σ 1 x2

E[X] = µ √ e 2

dx + √ xe− 2σ2 dx

−∞ σ 2π −∞ σ 2π

The first integral term is µ because the integral of the entire PDF is 1,

and the second term is zero because it is an odd function integrated

over a symmetric interval. Thus

E[X] = µ

Definition B.11

B-10 APPENDIX B. REVIEW OF PROBABILITY AND STATISTICS

,

standard deviation is given by

joint probability density

independent random variables

Z ∞

2 2 2 2

Var(X) = σX = E[(X−µX ) ] = E[X ]−µX = (x−µX )2 fX (x) dx .

−∞

p

σX = Var(X) .

The variance and standard deviation serve as measures of the spread of the

random variable about its expected value. Since the units of σ are the same as

the units of µ, the standard deviation is generally more practical as a measure of

the spread of the random variable. However, the variance has many properties

that make it more useful for certain calculations.

Definition B.12

probability density (JDF), f (x, y) with

Z a Z b

P (X ≤ a and Y ≤ b) = f (x, y) dy dx . (B.9)

−∞ −∞

the expected value of a function of X and Y . The expected value of g(X, Y ) is

given by

Z a Z b

E[g(X, Y ) = g(x, y)f (x, y) dy dx . (B.10)

−∞ −∞

Definition B.13

and is defined by

f (x, y) = fX (x)fY (y) .

Definition B.14

B.3. JOINT DISTRIBUTIONS B-11

uncorrelated random variables

Cov(X, Y ) = E[(X − E[X])(Y − E[Y ])] = E[XY ] − E[X]E[Y ] .

However if X and Y are dependent, it is still possible, given some particular

distributions, for X and Y to have Cov(X, Y ) = 0. If Cov(X, Y ) = 0, X and

Y are called uncorrelated.

B-12 APPENDIX B. REVIEW OF PROBABILITY AND STATISTICS

Cov(X, Y )

ρXY = p .

Var(X)Var(Y )

Theorem B.1

The following properties of Var, Cov, and correlation hold for any

random variables X and Y and scalars s and a.

• Var(X) ≥ 0

• Var(X + a) = Var(X)

• Var(sX) = s2 Var(X)

• Var(X + Y ) = Var(X) + Var(Y ) + 2Cov(X, Y )

• Cov(X, Y ) = Cov(Y, X)

• ρ(X, Y ) = ρ(Y, X)

• −1 ≤ ρXY ≤ 1 .

Example B.5

X = µ + σZ .

Then

E[X] = E[µ] + σE[Z]

so

E[X] = µ .

Also,

Var(X) = Var(µ) + σ 2 Var(Z) = σ 2 .

Thus if we have a program to generate random numbers with the

standard normal distribution, we can use it to generate normal ran-

dom numbers with any desired expected value and standard devia-

tion. The MATLAB command randn generates independent real-

izations of an N (0, 1) random variable.

B.3. JOINT DISTRIBUTIONS B-13

Example B.6

What is the CDF (or PDF) of the sum of two independent random

variables X + Y ? To see this, we write the desired CDF in terms of

an appropriate integral over the JDF, f (x, y) (Figure B.8)

FX+Y (z) = P (X + Y ≤ z)

ZZ

= f (x, y) dx dy

x+y≤z

ZZ

= fX (x)fY (y) dx dy

x+y≤z

Z ∞ Z z−y

= fX (x)fY (y) dx dy

−∞ −∞

Z ∞ Z z−y

= fX (x) dx fY (y) dy

−∞ −∞

Z ∞

= FX (z − y)fY (y) dy .

−∞

Z ∞

d

fX+Y (z) = FX (z − y)fY (y) dy

dz −∞

Z ∞

d

= FX (z − y)fY (y) dy

−∞ dz

Z ∞

= fX (z − y)fY (y) dy

−∞

= fX (z) ∗ fY (z) .

dom variable which has a PDF given by the convolution of the PDF’s

of the two individual variables.

The JDF can be used to evaluate the CDF or PDF arising from a general

function of independent random variables. The process is identical to the pre-

vious example except that the specific form of the integral limits is determined

by the specific function.

Example B.7

standard normal random variables,

Z = XY

B-14 APPENDIX B. REVIEW OF PROBABILITY AND STATISTICS

Figure B.8: Integration of a joint probability density for two independent ran-

dom variables, X, and Y , to evaluate the CDF of Z = X + Y .

1 −(x2 +y2 )/2σ2

f (x, y) = f (x)f (y) = e .

2πσ 2

The CDF of Z is

F (z) = P (Z ≤ z) = P (XY ≤ z) .

For z ≤ 0, this is the integral of the JDF over the exterior of the

hyperbolas defined by xy ≤ z ≤ 0, while for z ≥ 0, we integrate

over the interior of the complementary hyperbolas xy ≤ z ≥ 0. At

z = 0, the integral covers exactly half of the (x, y) plane (the 2nd

and 4th quadrants) and, because of the symmetry of the JDF, has

accumulated half of the probability, or 1/2.

The integral is thus

Z 0 Z ∞

1 −(x2 +y2 )/2σ2

F (z) = 2 e dy dx (z ≤ 0)

−∞ z/x 2πσ 2

and

Z 0 Z z/x

1 −(x2 +y2 )/2σ2

F (z) = 1/2 + 2 e dy dx (z ≥ 0) .

−∞ 0 2πσ 2

B.4. CONDITIONAL PROBABILITY B-15

As in the previous example for the sum of two random variables, the conditional probability

PDF may be obtained from the CDF by differentiating with respect law of total probability

to z. Bayes’ theorem

In some situations we will be interested in the probability of an event happening

given that some other event has also happened.

Definition B.16

The conditional probability of A given that B has occurred is

given by

P (A ∩ B)

P (A|B) = . (B.11)

P (B)

puting probabilities. The key to such arguments is the law of total probabil-

ity

Theorem B.2

Suppose that B1 , B2 , . . ., Bn are mutually disjoint and exhaustive

events. That is, Bi ∩ Bj = ∅ for i 6= j, and

∪ni=1 Bi = S

Then

n

X

P (A) = P (A|Bi )P (Bi ) (B.12)

i=1

probability. Bayes’ theorem provides a way to do this.

Theorem B.3

P (A|B)P (B)

P (B|A) = (B.13)

P (A)

B-16 APPENDIX B. REVIEW OF PROBABILITY AND STATISTICS

conditional distribution

A screening test has been developed for a very serious but rare dis-

ease. If a person has the disease, then the test will detect the disease

with probability 99%. If a person does not have the disease, then

the test will give a false positive detection with probability 1%. The

probability that any individual in the population has the disease is

0.01%. Suppose that a randomly selected individual tests positive

for the disease. What is the probability that this individual actually

has the disease?

Let A be the event “the person tests positive”. Let B be the event

“The person has the disease”. Then we want to compute P (B|A).

By Bayes theorem,

P (A|B)P (B)

P (B|A) = .

P (A)

P (A), we apply the law of total probability, considering separately

the probability of a diseased individual testing positive and the prob-

ability of someone without the disease testing positive.

Thus

0.99 × 0.0001

P (B|A) = = 0.0098

0.010098

In other words, even after a positive screening test, it is still unlikely

that the individual will have the disease. The vast majority of those

individuals who test positive will in fact not have the disease.

tions and expected values of random variables. If the distribution of X depends

on the value of Y , then we can work with the conditional PDF fX|Y (x), the

conditional CDF FX|Y (a), and the conditional expected value E[X|Y ].

In this notation, we can also specify a particular value of Y by using the nota-

tion fX|Y =y , FX|Y =y , or E[X|Y = y]. In working with conditional distributions

and expected values, the following versions of the law of total probability can

be very useful.

Theorem B.4

Given two random variables X and Y , with the distribution of X

depending on Y , we can compute

Z ∞

P (X ≤ a) = P (X ≤ a|Y = y)fY (y)dy (B.14)

−∞

B.5. THE MULTIVARIATE NORMAL DISTRIBUTION B-17

Z ∞ distribution

E[X] = E[X|Y = y]fY (y)dy. (B.15) MVN

−∞

covariance matrix

Example B.9

Let U be a random variable uniformly distributed on (1, 2). Let X

be an exponential random variable with parameter λ = U . We will

find the expected value of X.

Z 2

E[X] = E[X|U = u]fU (u)du.

1

rameter λ is 1/λ, and the PDF of a uniform random variable on

(1, 2) is fU (u) = 1,

Z 2

1

E[X] = du = ln 2.

1 u

Definition B.17

If the random variables X1 , . . ., Xn have a multivariate normal

distribution (MVN), then the joint probability density function is

1 1 −1

f (x) = n/2

p e−(x−µ)C (x−µ)/2 (B.16)

(2π) det(C)

where µ = [µ1 , µ2 , . . . , µn ]T is a vector containing the expected

values along each of the coordinate directions of X1 , . . ., Xn , and C

contains the covariances between the random variables

a division by zero, and the joint density simply isn’t defined.

MVN distribution. However, there are other multivariate distributions that are

not completely characterized by the expected values and covariance matrix.

B-18 APPENDIX B. REVIEW OF PROBABILITY AND STATISTICS

Theorem B.5

Let X be a multivariate normal random vector with expected values

defined by the vector µ and covariance C, and let Y = AX. Then

Y is also multivariate normal, with

E[Y] = Aµ

and

Cov(Y) = ACAT .

Theorem B.6

If we have an n-dimensional MVN distribution with covariance ma-

trix C and expected value µ, and the covariance matrix is of full

rank, then the quantity

Z = (X − µ)T C−1 (X − µ)

has a χ2 distribution with n degrees of freedom.

Example B.10

We can generate vectors of random numbers according to an MVN

distribution by using the following process, which is very similar to

the process for generating random normal scalars.

1. Find the Cholesky factorization C = LLT .

2. Let Z be a vector of n independent N (0, 1) random numbers.

3. Let X = µ + LZ.

Because E[Z] = 0, E[X] = µ + L0 = µ. Also, since Cov(Z) = I and

Cov(µ) = 0, Cov(X) = Cov(µ + LZ) = LILT = C.

Theorem B.7

Let X1 , X2 , . . ., Xn be independent and identically distributed (IID)

random variables with a finite expected value µ and variance σ 2 . Let

X1 + X2 + . . . + Xn − nµ

Zn = √ .

nσ

In the limit as n approaches infinity, the distribution of Zn ap-

proaches the standard normal distribution.

B.7. TESTING FOR NORMALITY B-19

30 Q–Q plot

25

20

15

10

0

−4 −3 −2 −1 0 1 2 3 4 5

frequently in nature; the superposition of numerous independent random vari-

ables produces an approximately normal PDF, regardless of the PDF of the

underlying IID variables.

Many of the statistical procedures that we will use assume that data are nor-

mally distributed. Fortunately, the statistical techniques that we describe are

generally robust in the face of small deviations from normality. Thus it may be

important to be able to examine a data set to see whether or not the distribution

is approximately normal.

Plotting a histogram of the data provides a quick view of the distribution.

For example, Figure B.9 shows the histogram from a set of 100 data points.

The characteristic bell shaped curve in the histogram makes it seem that these

data might be normally distributed. The sample mean is -0.03 and the sample

standard deviation is 1.26.

The Q-Q Plot provides a more precise graphical test of whether a set of ob-

servations of a random variable could have come from a particular distribution.

The data points,

d = [d1 , d2 , . . . , dn ]T ,

are first sorted in numerical order from smallest to largest into a vector y, which

is plotted versus

xi = F −1 ((i − 0.5)/n) (i = 1, 2, . . . , n) ,

B-20 APPENDIX B. REVIEW OF PROBABILITY AND STATISTICS

5

2

Quantiles of Input Sample

−1

−2

−3

−4

−3 −2 −1 0 1 2 3

Standard Normal Quantiles

70

60

50

40

Quantiles of Input Sample

30

20

10

−10

−20

−3 −2 −1 0 1 2 3

Standard Normal Quantiles

B.8. ESTIMATING MEANS AND CONFIDENCE INTERVALS B-21

where F (x) is the CDF of the distribution against which we wish to compare

our observations.

If we are testing to see if the elements of d could have come from the normal

distribution, then F (x) is the CDF for the standard normal distribution

Z x

1 1 2

FN (x) = √ e− 2 z dz .

2π −∞

If the elements of d are normally distributed, the points (yi , xi ) will follow a

straight line with a slope and intercept determined by the standard deviation

and expected value, respectively of the normal distribution that produced the

data. The MATLAB statistics toolbox includes a command, qqplot, that can

be used to generate a Q − Q plot.

Figure B.10 shows the Q − Q plot for our sample data set. The point at the

upper right corner of the plot has the value 4.62, but under a normal distribution

with expected value -0.03 and standard deviation 1.26, the largest of the 100

points should be about 2.5. Similarly, some points in the lower left corner stray

from the line. It is apparent that the data set contains more extreme values

than the normal distribution would predict. In fact, these data were generated

according to a t distribution with 5 degrees of freedom, which has broader tails

than the normal distribution (Figure B.6).

Figure B.11 shows a more extreme example. This data is clearly not normally

distributed. The Q − Q plot shows that distribution is skewed to the right. It

would be unwise to treat these data as if they were normally distributed.

There are a number of statistical tests for normality. These tests, including

the Kolmogorov–Smirnov test, Anderson–Darling test, and Lilliefors test each

produce probabilistic measures called p-values. A small p-value indicates that

the observed data would be unlikely if the distribution were in fact normal,

while a large p-value is consistent with normality. In practice, these tests may

declare a data set to be non-normal even though the data are “normal enough”

for practical applications. For example, the Lilliefors test implemented by the

MATLAB statistics toolbox command lillietest rejects the data in Figure B.10

with a p-value of 0.04.

Given a collection of noisy measurements m1 , m2 , . . ., mn of some quantity

of interest, how can we estimate the true value m, and how uncertain is our

estimate? This is a classic problem in statistics.

We will assume first that the measurement errors are independent and nor-

mally distributed with expected value 0 and some unknown standard deviation

σ. Equivalently, the measurements themselves are normally distributed with

expected value m and standard deviation σ.

We begin by computing the measurement average

m1 + m2 + . . . + mn

m̄ = .

n

B-22 APPENDIX B. REVIEW OF PROBABILITY AND STATISTICS

sample mean This sample mean m̄ will serve as our estimate of m. We will also compute

Student’s $t$ distribution an estimate s of the standard deviation

confidence interval sP

n 2

i=1 (mi − m̄)

s= .

n−1

Theorem B.8

(The Sampling Theorem) Under the assumption that measurements

are independent and normally distributed with expected value m

and standard deviation σ, the random quantity

m̄ − m

t= √

s/ n

t would in fact be normally distributed with expected value 0 and standard

deviation 1. This doesn’t quite work out because we’ve used an estimate s of

the standard deviation. For smaller values of n, the estimate s is less accurate,

and thus t distribution has fatter tails than the standard normal distribution.

As n goes to infinity, s becomes a better estimate of σ and it can be shown that

the t distribution converges to a standard normal distribution.

Let tn−1, 0.975 be the 97.5%-tile of the t distribution and let tn−1, 0.025 be

the 2.5%-tile of the t distribution. Then

m̄ − m

P tn−1, 0.025 ≤ √ ≤ tn−1, 0.975 = 0.95 .

s/ n

√ √

P tn−1, 0.025 s/n ≤ (m̄ − m) ≤ tn−1, 0.975 s/n = 0.95 .

m̄ + tn−1, 0.025 s/ n to m̄ + tn−1, 0.975 s/ n. √

As the t distribution is √

symmetric,

this can also be written as m̄ − tn−1, 0.975 s/ n to m̄ + tn−1, 0.975 s/ n.

As we have seen, there is a 95% probability that when we construct the

confidence interval, that interval will contain the true mean, m. Note that we

have not said that, given a particular set of data and the resulting confidence

interval, there is a 95% probability that m is in the confidence interval. The

semantic difficulty here is that m is not a random variable, but is rather some

true fixed quantity that we are estimating; the measurements m1 , m2 , . . ., mn ,

and the calculated m̄, s and confidence interval are the random quantities.

B.9. HYPOTHESIS TESTS B-23

the following ten measurements of the mass (in grams):

.

10.01 10.11 10.01 9.99 9.92

s = 0.0883. The 97.5%-tile of the t distribution with n-1=9 degrees

of freedom is (from a t-table or function) 2.262. Thus our 95%

confidence interval for the mean is

√ √

m̄ − 2.262s/ n, m̄ + 2.262s/ n g =

h √ √ i

10.02 − 2.262 × 0.0883/ 10, 10.02 + 2.262 × 0.0883/ 10 g =

[9.96, 10.08] g .

The above procedure for constructing a confidence interval for the mean us-

ing the t distribution was based on the assumption that the measurements were

normally distributed. In situations where the data are not normally distributed

this procedure can fail in a very dramatic fashion. However, it may be safe

to generate an approximate confidence interval using this procedure if (1) the

number n of data is large (50 or more) or (2) the distribution of the data is not

strongly skewed and n is at least 15.

In some situations we want to test whether or not a set of normally distributed

data could reasonably have come from a normal distribution with expected

value µ0 . Applying the sampling theorem (B.8), we see that if our data did

come from a normal distribution with expected value µ0 , then there would a

95% probability that

m̄ − µ0

tobs = √

s/ n

would lie in the interval

and only a 5% probability that t would lie outside this interval. Equivalently,

there is only a 5% probability that |tobs | ≥ tn−1, 0.975 .

This leads to the t–test: If |tobs | ≥ tn−1, 0.975 , then we reject the hypothesis

that µ = µ0 . On the other hand, if |t| < tn−1, 0.975 , then we cannot reject the

hypothesis that µ = µ0 . Although the 95% confidence level is traditional, we

B-24 APPENDIX B. REVIEW OF PROBABILITY AND STATISTICS

$p$–value, $t$–test can also perform the t-test at a 99% or some other confidence level. In general,

type I error if we want a confidence level of 1 − α, then we compare |t| to tn−1, 1−α/2 .

type II error

In addition to reporting whether or not a set of data passes a t-test it is good

power, of a hypothesis test

practice to report the associated t–test p–value . The p-value associated with a

t-test is the largest value of α for which the data passes the t-test. Equivalently,

it is the probability that we could have gotten a greater t value than we have

observed, given that all of our assumptions are correct.

Example B.12

1.8580 2.2540 −0.5937 −0.4410 1.5711

seems to be larger than 0. We will test the hypothesis µ = 0. The t

statistic is

m̄ − µ0

tobs = √

s n

which for this data set is

1.0253 − 0

tobs = √ ≈ 2.725 .

1.1895/ 10

Because |tobs | is larger than t9, 0.975 = 2.262, we reject the hypoth-

esis that these data came from a normal distribution with expected

value 0 at the 95% confidence level.

The t-test (or any other statistical test) can fail in two ways. First, it could

be that the hypothesis that µ = µ0 is true, but our particular data set contained

some unlikely values and failed the t-test. Rejecting the hypothesis when it is

in fact true is called a type I error . We can control the probability of a type

I error by decreasing α.

The second way in which the t-test can fail is more difficult to control. It

could be that the hypothesis µ = µ0 was false, but the sample mean was close

enough to µ0 to pass the t-test. In this case, we have a type II error. The

probability of a type II error depends very much on how close the true mean is

to µ0 . If the true mean µ = µ1 is very close to µ0 , then a type II error is quite

likely. However, if the true mean µ = µ1 is very far from µ0 then a type II error

will be less likely. Given a particular alternative hypothesis, µ = µ1 , we call

the probability of a type II error β(µ1 ), and call the probability of not making

a type II error (1 − β(µ1 )) the power of the test. We can estimate β(µ1 ) by

repeatedly generating sets of n random numbers with µ = µ1 and performing

the hypothesis test on the sets of random numbers.

B.10. EXERCISES B-25

The results of a hypothesis test should always be reported with care. It is Markov’s inequality

important to discuss and justify any assumptions (such as the normality as- Chebychev’s inequality

sumption made in the t-test) underlying the test. The p-value should always be

reported along with whether or not the hypothesis was rejected. If the hypoth-

esis was not rejected and some particular alternative hypothesis is available, it

is good practice to estimate the power of the hypothesis test against this alter-

native hypothesis. Confidence intervals for the mean should be reported along

with the results of a hypothesis test.

It is important to distinguish between the statistical significance of a hy-

pothesis test and the actual magnitude of any difference between the observed

mean and the hypothesized mean. For example, with very large n it is nearly

always possible to achieve statistical significance at the 95% confidence level,

even though the observed mean may differ from the hypothesis by only 1% or

less.

B.10 Exercises

1. Compute the expected value and variance of a uniform random variable

in terms of the parameters a and b.

able in terms of the parameter λ.

E[|X|]

P (|X| ≥ c) ≤

c

This result is known as Markov’s inequality.

σ 2 . Show that for any k > 0,

1

P (|X − µ| ≥ kσ) ≤

k2

This result is known as Chebychev’s inequality.

and Y are uncorrelated, but not independent. Hint: What happens if

Y = −X?

7. Show that

Var(X) = E[X 2 ] − E[X]2

B-26 APPENDIX B. REVIEW OF PROBABILITY AND STATISTICS

8. Show that

Cov(aX, Y ) = aCov(X, Y )

and that

Cov(X + Y, Z) = Cov(X, Z) + Cov(Y, Z)

9. Show that the PDF for the sum of two independent uniform random vari-

ables on [a, b] = [0, 1] is

0 (x ≤ 0)

x (0 ≤ x ≤ 1)

f (x) =

2 − x (1 ≤ x ≤ 2)

0 (x ≥ 0) .

10. On a multiple choice question with four possible answers, there is a 75%

chance that a student will know the correct answer and 25% chance that

the student will randomly pick one of the four answers. What is the

probability that the student will get a correct answer? Given that the

answer was correct, what is the probability that the student actually knew

the correct answer?

11. Suppose that X and Y are independent random variables. Use condition-

ing to find a formula for the CDF of X + Y in terms of the PDF’s and

CDF’s of X and Y .

12. Suppose that X is a vector of 2 random variables with an MVN distribu-

tion (expected value µ, covariance C), and that A is a 2 by 2 matrix. Use

the properties of expected value and covariance to show that Y = AX has

expected value Aµ and covariance ACAT .

13. Consider a least squares problem Ax = b, where the A matrix is known

exactly, but the right hand side vector b includes some random measure-

ment noise. Thus

bnoise = btrue +

where is a vector of independent N (0, 1) random numbers. We would

like to find xtrue such that

Axtrue = btrue .

pute a least squares estimate using the normal equations and the noisy

data.

xL2 = (AT A)−1 AT bnoise .

What is the expected value of xL2 ? What is the covariance matrix?

14. Consider the following data, which we will assume are drawn from a normal

distribution.

B.11. NOTES AND FURTHER READING B-27

1.1909 1.1892 -0.0376 0.3273 0.1746

Find the sample mean and standard deviation. Use these to construct a

95% confidence interval for the mean. Test the hypothesis H0 : µ = 0

at the 95% confidence level. What do you conclude? What was the

corresponding p-value?

15. Using MATLAB, repeat the following experiment 1,000 times. Use the

statistics toolbox function exprnd() to generate 5 exponentially distributed

random numbers (B.3) with λ = 10. Use these 5 random numbers to gen-

erate a 95% confidence interval for the mean. How many times out of the

1,000 experiments did the 95% confidence interval cover the expected value

of 10? What happens if you instead generate 50 exponentially distributed

random numbers at a time? Discuss your results.

16. Using MATLAB, repeat the following experiment 1,000 times. Use the

randn function to generate a set of 10 normally distributed random num-

bers with expected value 10.5 and standard deviation 1. Perform a t-test

of the hypothesis µ = 10 at the 95% confidence level. How many type II

errors were committed? What is the approximate power of the t-test with

n = 10 against the alternative hypothesis µ = 10.5? Discuss your results.

17. Using MATLAB, repeat the following experiment 1,000 times. Using the

exprnd() function of the statistics toolbox, generate 5 exponentially dis-

tributed random numbers with expected value 10. Take the average of the

5 random numbers. Plot a histogram and a probability plot of the 1,000

averages that you computed. Are the averages approximately normally

distributed? Explain why or why not. What would you expect to happen

if you took averages of 50 exponentially distributed random numbers at a

time? Try it and discuss the results.

Most of the material in this appendix can be found in virtually any introductory

textbook in probability and statistics. Some recent textbooks include [BE00,

CB02]. The multivariate normal distribution is a somewhat more advanced topic

that is often ignored in introductory courses. [Sea82] has a good discussion of

the multivariate normal distribution and its properties. Numerical methods

for probability and statistics are a specialized topic. Two standard references

include [JG80, Thi88].

B-28 APPENDIX B. REVIEW OF PROBABILITY AND STATISTICS

Appendix C

Glossary of Mathematical

Terminology

• A, B, C, ... : Matrices.

• Ai,· : ith row of matrix A.

• A·,i : ith column of matrix A.

• Ai,j : (i, j)th element of matrix A.

• E[X]: Expected value of the random variable X.

• G† : Generalized inverse of the matrix G calculated from the SVD.

• A, B, C, ... : Functions; Random Variables.

• N (µ, σ 2 ): Normal probability density function with expected value µ and

variance σ 2 .

• N (A): Null space of the matrix A.

• rank(A): Rank of the matrix A.

• R(A): Range of the matrix A.

• Rn : Vector space of dimension n.

• A, B, C, ... : Fourier transforms.

• a, b, c, ... : Column vectors.

• ai : ith element of vector a.

• ā: Mean value of the elements in vector a.

• Cov(x), Cov(X, Y ): Covariance between the elements of vector x or

between the random variables X and Y .

C-1

C-2 APPENDIX C. GLOSSARY OF MATHEMATICAL TERMINOLOGY

the random variables X and Y .

• si : Singular values.

• Var(X): Variance of the random variable X.

• α, β, γ, ... : Scalars.

• a, b, c, ... : Scalars or functions.

• χ2 : Chi-square random variable.

• λi : Eigenvalues.

• µ(1) : 1-norm misfit measure.

• σ: Standard deviation.

• σ 2 : Variance.

• ν: Degrees of freedom.

• CDF: Cumulative distribution function.

• JDF: Joint probability density function.

• MVN: Multivariate normal probability density function.

• PDF: Probability density function.

• SVD: Singular value decomposition.

• tν,p : p-percentile of the t distribution with ν degrees of freedom.

• Fν1 ,ν2 ,p : p-percentile of the F distribution with ν1 and ν2 degrees of free-

dom.

Bibliography

holm integral equations of the 1st kind. Inverse Problems, 7(6):793–

808, 1991.

logical Society of America, 78(3):1373–1374, 1988.

Braunschweig, 1987.

PACK software for weighted orthogonal distance regression. ACM

Transactions on Mathematical Software, 15(4):348–364, 1989. The

software is available at http://www.netlib.org/odrpack/.

Mathematical Statistics. Brooks/Cole, Pacific Grove, CA, second

edition, 2000.

I. resolution analysis. Optimization and Engineering, 1(1):87–115,

2000.

II. iterative inverses. Optimization and Engineering, 1(4):437–473,

2000.

[Bjö96] Åke Björck. Numerical Methods for Least Squares Problems. SIAM,

Philadelphia, 1996.

Hill, Boston, third edition, 2000.

Resolved Sources. PhD thesis, New Mexico Institute of Mining and

Technology, 1995.

BIB-1

BIB-2 BIBLIOGRAPHY

Optimization: Analysis, Algorithms, and Engineering Applications.

SIAM, Philadelphia, 2001.

ysis and its applications. Wiley, New York, 1988.

Seismology. Penn Publishing Company, Atlanta, 1987.

Pacific Grove, CA, second edition, 2002.

convolution algorithm. Astronomy and Astrophysics, 143(1):77–83,

1985.

Astronomy and Astrophysics, 89(3):377–378, 1980.

ble. Occam’s inversion: A practical algorithm for generating smooth

models from electromagnetic sounding data. Geophysics, 52(3):289–

300, 1987.

rithms, and Applications. Oxford University Press, New York, 1997.

Philadelphia, 1997.

[DS96] J. E. Dennis, Jr. and Robert B. Schnabel. Numerical Methods for Un-

constrained Optimization and Nonlinear Equations. SIAM, Philadel-

phia, 1996.

Wiley, New York, third edition, 1998.

ization of Inverse Problems. Kluwer Academic Publishers, Boston,

1996.

squares problems with uncertain data. SIAM Journal on Matrix

Analysis and Applications, 18(4):1035–1064, 1997.

inverse problems. Surveys on Mathematics for Industry, 3:71–143,

1993.

BIBLIOGRAPHY BIB-3

[GCSR03] Andrew Gelman, John B. Carlin, Hal S. Stern, and Donald B. Rubin.

Bayesian Data Analysis. Chapman & Hall/CRC, Boca Raton, FL,

second edition, 2003.

Johns Hopkins University Press, Baltimore, third edition, 1996.

ences. Vieweg, Braunschweig, 1993.

ates. Mathematical Association of America, Washington, 1999.

Monte Carlo in Practice. Chapman & Hall, London, 1996.

form inversion: Bayes versus Occam. Inverse Problems, 13(2):323–

349, 1997.

means of the L-curve. SIAM Review, 34(4):561–580, December 1992.

LAB package for analysis and solution of discrete ill–

posed problems. Numerical Algorithms, 6(I–II):1–35, 1994.

http://www.imm.dtu.dk/documents/users/pch/Regutools/regutools.html.

lems: numerical aspects of linear inversion. SIAM, Philadelphia,

1998.

[HBR+ 02] J.M.H. Hendrickx, B. Borchers, J.D. Rhoades, D.L. Corwin, S.M.

Lesch, A.C. Hilgendorf, and J. Schlue. Inversion of soil conductivity

profiles from electromagnetic induction measurements; theory and

experimental verification. Soil Science Society of America Journal,

66(3):673–685, 2002.

demic Press, San Francisco, 1980.

[HH93] Martin Hanke and Per Christian Hansen. Regularization methods for

large–scale problems. Surveys on Mathematics for Industry, 3:253–

315, 1993.

lutions without a priori break points. Numerical Linear Algebra with

Applications, 3(6):513–524, 1996.

BIB-4 BIBLIOGRAPHY

interferometer baselines. Astronomy and Astrophysics Supplement,

15:417–426, 1974.

[HR96] M. Hanke and T. Raus. A general heuristic for choosing the regular-

ization parameter in ill–posed problems. SIAM Journal on Scientific

Computing, 17(4):956–972, 1996.

berge, R. L. Parker, C. J. Fox, and P. J. Meis. A sea–floor and

sea–surface gravity survey of axial volcano. Journal of Geophysical

Research– Solid Earth and Planets, 95(B8):12751–12763, 1990.

second edition, 1996.

[HV91] Sabine Van Huffel and Joos Vandewalle. The Total Least Squares

Problem: computational aspects and analysis. SIAM, Philadelphia,

1991.

fit - a system for least squares and robust estimation. Celes-

tial Mechanics, 41(1–4):39–49, 1987. The software is available at

http://clyde.as.utexas.edu/Gaussfit.html.

Marcel Dekker, New York, 1980.

Inverse Problems. Springer Verlag, New York, 1996.

with Applications. Academic Press, Boston, 1992.

Moose, and W. E. Carter. Improvements in absolute gravity obser-

vations. Journal of Geophysical Research– Solid Earth and Planets,

96(B5):8295–8303, 1991.

Tomographic Imaging. SIAM, Philadelphia, 2001.

Boston, third edition, 2003.

Problems. SIAM, Philadelphia, 1995.

Exploration Geophysicists, Tulsa, OK, 1988.

BIBLIOGRAPHY BIB-5

nance Imaging: A Signal Processing Perspective. IEEE Press, New

York, 2000.

[Men89] William Menke. Geophysical Data Analysis: discrete inverse theory,

volume 45 of International Geophysics Series. Academic Press, San

Diego, revised edition, 1989.

[Mey00] Carl D. Meyer. Matrix Analysis and Applied Linear Algebra. SIAM,

Philadelphia, 2000.

[MGH80] J. J. More, B. S. Garbow, and K. E. Hillstrom. User guide for

MINPACK-1. Technical Report ANL–80–74, Argonne National Lab-

oratory, 1980.

[Moo20] E. H. Moore. On the reciprocal of the general algebraic matrix.

Bulletin of the American Mathematical Society, pages 394–395, 1920.

[Mor84] V. A. Morozov. Methods for Solving Incorrectly Posed Problems.

Springer-Verlag, New York, 1984.

[Mye90] Raymond H. Myers. Classical and Modern Regression with Applica-

tions. PWS Kent, Boston, second edition, 1990.

[Nat01] Frank Natterer. The Mathematics of Computerized Tomography.

SIAM, Philadelphia, 2001.

[NBW00] Roseanna Neupauer, Brian Borchers, and John L. Wilson. Com-

parison of inverse methods for reconstructing the release history of

a groundwater contamination source. Water Resources Research,

36(9):2469–2475, 2000.

[Neu98] Arnold Neumaier. Solving ill-conditioned and singular linear sys-

tems: A tutorial on regularization. SIAM Review, 40(3):636–666,

1998.

[NN86] Ramesh Narayan and Rajaram Nityananda. Maximum entropy im-

age restoration in astronomy. In Annual Review of Astronomy and

Astrophysics, volume 24, pages 127–170. Annual Reviews Inc., Palo

Alto, CA, 1986.

[Nol87] Guust Nolet, editor. Seismic Tomography With Applications in

Global Seismology and Exploration Geophysics. D. Reidel, Boston,

1987.

[NS96] Stephen G. Nash and Ariela Sofer. Linear and Nonlienar Program-

ming. McGraw–Hill, New York, 1996.

[Par91] R. L. Parker. A theory of ideal bodies for seamount magnetism.

Journal of Geophysical Research– Solid Earth, 96(B10):16101–16112,

1991.

BIB-6 BIBLIOGRAPHY

Press, Princeton, NJ, 1994.

Philos. Soc., pages 406–413, 1955.

measure. Journal of Geophysical Research, pages 4429–4430, 1980.

Press, London, 1983.

ear equations and least squares problems. ACM Transactions on

Mathematical Software, 8(2):195–209, 1982.

linear equations and sparse least squares. ACM Transactions on

Mathematical Software, 8(1):43–71, 1982.

periments to test newton’s law of gravity. Nature, 342:29–32, 1989.

A GPS–based three-dimensional lightning mapping system: Initial

observations in central new mexico. Geophysical Research Letters,

26(23):3573–3576, 1999.

[Rud87] Walter Rudin. Real and Complex Analysis. McGraw–Hill, New York,

third edition, 1987.

entropy algorithm. Astrophysical Journal, 354(2):L61–L63, 1990.

tion: General algorithm. Monthly Notices of the Royal Astronomical

Society, 211:111–124, 1984.

[Sea82] Shayle R. Searle. Matrix Algebra Useful for Statistics. Wiley, New

York, 1982.

[SGT88] John A. Scales, Adam Gersztenkorn, and Sven Treitel. Fast lp so-

lution of large, sparse, linear systems: Application to seismic travel

time tomography. Journal of Computational Physics, 75(2):314–333,

1988.

numerical solution of an integral equation. Journal of Mathematical

Analysis and Applications, 37:83–112, 1972.

BIBLIOGRAPHY BIB-7

dient method without the agonizing pain, edition 1–1/4. Technical

report, School of Computer Science, Carnegie Mellon University, Au-

gust 1994.

Press, Oxford, 1996.

groundwater contaminant. Water Resources Research, 30(1):71–79,

1994.

water contaminant plume: Method of quasi-reversibility. Water Re-

sources Research, 31:2669–2673, 1995.

timates of gamma(p) and x(p). Journal of Geophysical Research–

Solid Earth and Planets, 92(B3):2713–2719, 1987.

statistical estimates of gamma(p) and x(p)’. Journal of Geophysical

Research, 93:13821–13822, 1988.

algorithm and applications. Computational Statistics, 10(2):129–141,

1995.

[SS97] John Scales and Martin Smith. DRAFT: Geophysical inverse theory.

http://landau.Mines.EDU/ samizdat/inverse theory/, 1997.

[Str88] Gilbert Strang. Linear Algebra and its Applications. Harcourt Brace

Jovanovich Inc., San Diego, third edition, 1988.

Halsted Press, New York, 1977.

[Tar87] Albert Tarantola. Inverse Problem Theory: Methods for Data Fitting

and Model Parameter Estimation. Elsevier, New York, 1987.

[TB97] Loyd N. Trefethen and David Bau. Numerical Linear Algebra. SIAM,

Philadelphia, 1997.

[TG65] A.N. Tikhonov and V.B. Glasko. Use of the regularization method in

non–linear problems. USSR Computational Mathematics and Math-

ematical Physics, 5(3):93–107, 1965.

in the Natural Sciences. MIR Publishers, Moscow, 1987.

BIB-8 BIBLIOGRAPHY

and Hall, New York, 1988.

[Thu92] C. Thurber. Hypocenter velocity structure coupling in local earth-

quake tomography. Physics of the Earth and Planetary Interiors,

75(1–3):55–62, 1992.

Sensing and Indirect Measurements. Dover, Mineola, New York,

1996.

[UBL90] T. Ulrych, A. Bassrei, and M. Lane. Minimum relative entropy

inversion of 1d data with applications. Geophysical Prospecting,

38(5):465–487, 1990.

[Vog02] Curtis R. Vogel. Computational Methods for Inverse Problems.

SIAM, Philadelphia, 2002.

[Wah90] Grace Wahba. Spline Models for Observational Data. SIAM,

Philadelphia, 1990.

[Wat00] G. A. Watson. Approximation in normed linear spaces. Journal of

Computational and Applied Mathematics, 121(1–2):1–36, 2000.

[Win91] G. Milton Wing. A Primer on Integeral Equations of the First Kind:

the problem of deconvolution and unfolding. SIAM, Philadelphia,

1991.

[WU96] A. D. Woodbury and T. J. Ulrych. Minimum relative entropy inver-

sion: Theory and application to recovering the release history of a

ground water contaminant. Water Resources Research, 32(9):2671–

2681, 1996.

Index

p–value, 2-5 condition number, A-32, A-33

p–value, t–test, B-24 condition number, of svd, 4-10

t–test, B-23 conditional distribution, B-16

1–norm solution, 2-17 conditional expected value, B-16

1-norm, 1-5 conditional probability, B-15

2–norm, 1-5, A-19 confidence interval, B-22

2–norm solution, 2-1 confidence regions, 2-10

conjugacy, 11-6

ART algorithm, 6-4 conjugate gradient least squares method,

automatic differentiation, 9-8 6-12

conjugate gradient method, 6-8

basis, A-16 continuous inverse problem, 1-2

basis functions, Fourier, 8-3 convex function, A-29

basis, standard, A-16 convolution, 1-3, 8-1

Bayes Theorem, 11-3 convolution equation, 1-3

Bayes’ theorem, B-15 convolution theorem, 8-4

black box function, 9-8 convolution, circular, 8-10

bound variables least squares, 7-2 convolution, serial, 8-10

BVLS, 7-2 covariance, B-11

covariance matrix, B-17

calculus of variations, 11-13 cross entropy, 11-13

CDF, B-8 cumulative distribution function, B-

CG method, 6-8 8

CGLS, 6-12

characteristic equation, A-24 damped Newton method, 9-3

Chebychev’s inequality, B-25 damped SVD method, 5-5

checkerboard resolution test, 4-17 data kernels, 3-2

chi2pdf, 2-23 data null space, 4-7

chol, A-27 data space, 4-1

CLEAN, 7-17 data space resolution matrix, 4-9

collocation, simple, 3-3 deconvolution, 1-3, 3-2, 4-19, 8-1, 8-

column space, A-17 8

compatible matrices, A-7 delta function, 8-2

compatible matrix and vector norms, denoising, 7-11

A-31 DFT, 8-8

cond, A-33 diagonal matrix, A-10

IDX-1

IDX-2 INDEX

direct problem, 1-1 Gaussian elimination, A-1

discrepancy principle, 4-12, 5-1 Gaussian probability density, B-3

discrete Fourier Transform, 8-8 GCV, 5-25

discrete ill–posed problem, 4-18 generalized cross validation, 5-25

discrete inverse problem, 1-2 generalized inverse, 4-3

discrete Picard condition, 4-12 generalized singular value decompo-

discrete-ill-posed problem, 1-12 sition, 5-14

double-sided exponential distribution, generalized singular values, 5-14

B-5 global optimization, 9-11

downward continuation, 8-6, 8-7 Gram matrix technique, 3-7

Gram–Schmidt orthogonalization pro-

earthquake location problem, 1-5 cess, A-22

eig, A-24 Green’s function, 8-2

eigenvalue, A-24 GSVD, 5-14

eigenvector, A-24

elementary row operations, A-2 Hermitian symmetry, spectral, 8-10

entropy, 11-12 high–pass filter, 8-16

Euclidean length, A-19 homogeneous, A-4

events, B-1

expected value, B-9 identity matrix, A-8

experiment, B-1 IFK, 1-3, 3-1

exponential random variable, B-3 ill-conditioned

linear system of equations, A-

F distribution, B-7 33

FFT, 8-8 ill-conditioned system, 1-12

filter factors, 5-5 ill-posed problem, 1-12

filter, high–pass, 8-16 impulse function, 8-2

filter, low–pass, 8-7 impulse resolution test, 4-16

finite differences, 9-8 impulse response, 8-2

first order Tikhonov regularization, inconsistent, 1-5

5-13 indefinite matrix, A-26

forward operator, 1-1 independent random variables, B-10

forward problem, 1-1 inner product, 3-8, A-35

Fourier basis functions, 8-3 instrument response, 4-18

Fourier Transform, 1-3 inverse discrete Fourier transform,

Fourier transform, 8-3 8-8

Fourier Transform, discrete, 8-8 inverse Fourier transform, 8-3

Fredholm Integral Equation, 3-1 inverse Fourier transform, discrete,

Fredholm integral of the first kind, 8-8

1-3 inverse problem, 1-1

frequency domain response, 8-3 IRLS, 2-19

Frobenius norm, A-31 iteratively reweighted least squares,

full rank, 2-1 2-19

INDEX IDX-3

kernel, 1-2 model space, 4-1

Kullback, 11-13 model space resolution matrix, 4-8

moderately ill–posed, 4-18

L–curve, 5-3 modified truncated SVD, 7-12

Lagrange multipliers, A-27 Monte Carlo error propagation, 2-

Laguerre polynomial basis, 3-9 21

law of cosines, A-19 Moore–Penrose pseudoinverse, 4-3

law of total probability, B-15 MTSVD, 7-12

least squares, 2-1 multistart, 9-11

least squares solution, 2-1, A-23 multivariate normal distribution, 11-

Leibler, 11-13 7, B-17

Levenberg–Marquardt method, 9-6 mutually conjugate vectors, 6-8

likelihood function, 2-3 MVN, B-17

linear combination, A-5

linear equations, A-1 negative definite matrix, A-26

linear independence, A-12 negative semidefinite matrix, A-26

linear regression, 1-3 Newton’s method, 9-2

linear systems, 1-2 Newton’s method for minimizing f(x),

local minimum points, 9-9 9-3

low–pass filter, 8-7 NNLS, 7-1

lower triangular matrix, A-11 non homogeneous, A-4

non–negative least squares, 7-1

MAP model, 11-4 nonuniqueness, 4-7

Markov Chain Monte Carlo Method, norm, A-30

11-4 compatible matrix and vector,

Markov’s inequality, B-25 A-31

mathematical model, 1-1 definition, A-29

matrix norm, A-31 Frobenius, A-31

matrix times matrix, A-7 infinity, A-30

matrix times vector, A-6 of a matrix, A-31

maximum entropy methods, 11-12 normal equations, A-23

maximum entropy principle, 11-13 normal probability density, B-3

maximum entropy regularization, 7- null space, 1-12, A-14

7 null space model, 1-12

maximum likelihood, 11-4 null space models, 1-12

maximum likelihood estimation, 2-2 Nyquist frequency, 8-10

maximum likelihood principle, 2-3

median of a PDF, B-9 ODE, 1-1

midpoint rule, 3-3 ordinary differential equation, 1-1

minimum cross entropy principle, 11- orth, A-22

14 orthogonal basis, A-20

minimum length solution, 4-4 orthogonal diagonalization, A-25

mode of a PDF, B-9 orthogonal matrix, A-20

model identification problem, 1-1 orthogonal projection, A-21

model null space, 4-7 orthogonal subspaces, A-20

IDX-4 INDEX

outliers, 2-17 resolution test, 4-16

over fitting of data, 4-12 Riemann–Lebesgue lemma, 1-14

robust least squares, 2-25

parallelogram law, A-38 robust solution, 2-17

parameter correlation, 2-11 row–column expansion method, A-7

parameter estimation problem, 1-2 RREF, A-3

partial differential equation, 1-1

PD matrix, A-26 sample mean, B-22

PDE, 1-1 second order Tikhonov regulariza-

PDF, B-3 tion, 5-14

perpendicular vectors, A-19 severely ill–posed, 4-18

pinv MATLAB command, 4-7 sifting property, delta function, 8-2

poles, 8-5 sifting property, of delta function,

positive definite matrix, A-26 8-2

positive semidefinite matrix, A-26 simultaneous interactive reconstruc-

posterior distribution, 11-3 tion technique, 6-6

power of a matrix, A-10 singular value decomposition, 4-1

power, of a hypothesis test, B-24 singular value spectrum, 4-9

PP-TSVD, 7-12 SIRT technique, 6-6

principal axes, error ellipsoid, 2-10 slowness, 1-10

principle of indifference, 11-3 solution existence, 1-12

prior distribution, 11-3 solution instability, 1-12

probability, B-1 solution nonuniqueness, 1-12

probability density function, B-3 solution uniqueness, 1-12

probability function, B-1 spectral amplitude, 8-3

proportional effect, 2-14 spectral division, 8-11

PSD matrix, A-26 spectral phase, 8-3

pseudoinverse, 4-3 spectrum, 8-3

spike resolution test, 4-16

Q–Q plot, B-19 sqrtm, 11-8

qr, A-33 square-integrable function, 1-14

quadratic form, A-25 standard basis, A-16

quadrature rule, 3-2 standard deviation, B-10

standard normal random variable,

randn, B-12 B-3

random variable, B-2 stationary point, A-27

range, A-17 Student’s t distribution, B-7, B-22

rank, A-18 subjective prior, 11-3

rank deficient, 1-11, 4-12 subspace, A-13

rank deficient system, 4-4 SVD, 4-1

realizations, B-2 svd MATLAB command, 4-1

reduced row echelon form, A-3 svd, compact form, 4-2

regularization, 1-12, 4-12 symmetric matrix, A-10

representers, 3-2

reshape, 4-14 TGSVD, 5-24

INDEX IDX-5

tikhcstr, 7-2

time domain response, 8-3

time series, 8-8

tomography, 1-10

total least squares, 2-25

total variation regularization, 7-11

transfer function, 8-3

transpose, A-10

trivial solution, A-4

truncated generalized singular value

decomposition, 5-24

truncated SVD, 4-12

TSVD, 4-12

type I error, B-24

type II error, B-24

unbiased, 2-7

uncorrelated random variables, B-

11

uniform, B-3

uninformative prior distribution, 11-

3

upper triangular matrix, A-11

upward continuation filter, 8-7

variance, B-10

vector space, A-35

vertical seismic profiling, 1-7

weakly ill–posed, 4-18

zeros, 8-5

zeroth order Tikhonov regularization,

5-14

- Math8430 Lectures FisicaUploaded bymariuccio254
- Engineering Calculations Using Microsoft Excel 2014Uploaded bymarmaduke32
- M.sc. MathemeticsUploaded byŠtŸløžx Prıı'nc Aadi
- 1-s2.0-S0920410512001143-mainUploaded byMiguel Angel Vidal Arango
- Sat Study GuideUploaded byPhilip Brocoum
- Ben-Dor 2000- Analytical Solution for Penetration by Rigid Conical Impactors Using CaviUploaded byCristian Sandoval
- PigUploaded byHéctor Leonardo González León
- 3 and 4 Semester CSE SyllabusUploaded byanirudhshekh
- Effective Hamiltonians and Averaging for Hamiltonian Dynamics IIUploaded byahsbon
- Maths Haryana SyllabusUploaded byRahul
- 11.[104-111]Analytical Solution for Telegraph Equation by Modified of Sumudu Transform Elzaki TransformUploaded byAlexander Decker
- Modeling Piezoelectric ActuatorsUploaded byMikel Rico
- Pre Parcial FinalUploaded byMaria Jose Vacca Torres
- Jiajia Sun_2011_SBGf_Geophysical Inversion Constrained by PetrophysicsUploaded bykeoley
- 2016 17MAMSCAdmission BIUploaded byArun Tyagi
- Well dataUploaded byAlessandro Trigilio
- upwindmolmlUploaded byAhamadi Elhouyoun
- SUPER RESOLUTIONUploaded byUGCJOURNAL PUBLICATION
- ch3b_slip_line_f.pdfUploaded byHari Prasad
- MATH4052_T07Uploaded byJohn Chan
- Lecture 2 NPDEUploaded byPete Jacopo Belbo Caya
- Summer 2014Uploaded byMichael Lo
- TANCET 2019 SyllabiUploaded byKaviyarasanThukkaram
- acusolve_turb_validation.pdfUploaded byEric Toro
- Viscosity solutions for particle systems and homogenization of dislocation dynamicsUploaded bypartha
- 436860Uploaded bynadamau22633
- 59126.pdfUploaded byTony Walle
- Lec08Uploaded byNegoescu Mihai Cristian
- Computational Methods for Photonic CrystalsUploaded byyassinebouazzi
- 1-s2.0-0307904X86900570-main.pdfUploaded byYastika Pertiwi

- Measurement of G/TUploaded byjabirhussain
- Final Mp Solution 2018Uploaded byjabirhussain
- State VariableUploaded byparisalitopan
- Embedded 20Uploaded byahamed
- Pmt Class Xii Physics EPMT_Class_Xmi & AcUploaded byChandra Sekhar
- Report General ProficiencyUploaded byjabirhussain
- 2dscatterer.pdfUploaded byjabirhussain
- How to Use Envi85Uploaded byjabirhussain
- A Story of Blind ScienceUploaded byArshadahc
- Experimental Study of Moment Method for Undergraduates in ElectromagneticUploaded byjabirhussain
- Principles of Digital Communication EC-2004Uploaded byjabirhussain
- Communication Engineering EC-3009Uploaded byjabirhussain
- Satellite Communication GuldelinesUploaded byjabirhussain
- Problems on Logarithmic UnitUploaded byjabirhussain
- Advt. No. ER-01-2014 English Version[1]Uploaded byjabirhussain
- Logarithmic UnitUploaded byjabirhussain
- 6042860 Revised Rates of Fellowships 2Uploaded byDida Khaling
- High Volume Data TransferUploaded byjabirhussain
- Notification by Finance Assam on UGC Pay for Colleges in Assam 2010Uploaded byjabirhussain
- Database Management SystemsUploaded byjabirhussain
- August 2012Uploaded byjabirhussain

- 1-s2.0-S0167865505002606-mainUploaded bySantiago PA
- Wilful Walks1 - Paul's Edit[2]Uploaded byMontyCantsin
- Orifice EquationUploaded byRoadmaster911
- U.S. Patent 2,909,092, entitled "Electric Pickups for Musical Instruments", to DeArmond, 1956.Uploaded byDuane Blake
- scj-report on insight 700c.pdfUploaded byManoj Ravindranath
- Case Study project managementUploaded byHoa Quỳnh
- Kawasaki ER-6N Parts DiagramUploaded byBimmerstore Germany
- Elevators From WikepediaUploaded byWesame Shnoda
- WS2813 LED.pdfUploaded byManuel Sierra
- Singapore Stock BrokingUploaded byamycfy
- Fundamentals of SatcomUploaded bySafura Begum
- Maximumpc - April 2016 Mobile MuscleUploaded byBeer Talking
- Physics 2008 P2Uploaded byRishi Gopie
- BIJ-10-2016-0159Uploaded byjorge
- Recycling MgO-C Refractory in EAF of IMEXSAUploaded byDiego Ribeiro
- AutoCAD Lab Manual 1 1 (1)Uploaded bysmitha.varun4
- Abbey Lincoln - Songbook (79).pdfUploaded byAnna Basista
- Fluid Sim P ManualUploaded byJesús Adrian Sigala Gutiérrez
- Bis is 802 Part 1 Sec 2 2016Uploaded byvinay kaithwas
- Bismillah Daftar PustakaUploaded byvenna
- 0620_s09_ms_3Uploaded byVarun Panicker
- Comparative Characterization of Clinkers MicrostrUploaded byKhouloud Abidi
- LTUploaded byAriel Pinkihan Laguitao
- FNAL Database Hosting SLA-2012 8Uploaded byJulie Roberts
- News Letter 16-17Uploaded byshailendra mishra
- Factory SafetyUploaded byliliyayanono
- overview of est204 ass 2Uploaded byapi-280746179
- ESG Economic Validation Atlassian Jira Service DeskUploaded byafjnr6008
- Further Electrical PowerUploaded byzzyyee
- Radio Station WEAA Leading in a Challenging SituationUploaded bysuman