This action might not be possible to undo. Are you sure you want to continue?
1
Mathematical Modelling of Continuous Systems
Carlo Tomasi
Duke University
Fall 2004
2
Chapter 1
Introduction
Fields such as robotics or computer vision are interdisciplinary subjects at the intersection of engineering and computer
science. By their nature, they deal with both computers and the physical world. Although the former are in the latter,
the workings of computers are best described in the blackandwhite vocabulary of discrete mathematics, which is
foreign to most classical models of reality, quantum physics notwithstanding.
This class surveys some of the key tools of applied math to be used at the interface of continuous and discrete. It
is not on robotics or computer vision, nor does it cover any other application area. Applications evolve rapidly, but
their mathematical foundations remain. Even if you will not pursue any of these ﬁelds, the mathematics that you learn
in this class will not go wasted. To be sure, applied mathematics is a discipline in itself and, in many universities, a
separate department. Consequently, this class can be a quick tour at best. It does not replace calculus or linear algebra,
which are assumed as prerequisites, nor is it a comprehensive survey of applied mathematics. What is covered is a
compromise between the time available and what is useful and fun to talk about. Even if in some cases you may have
to wait until you take an applied class to fully appreciate the usefulness of a particular topic, I hope that you will enjoy
studying these subjects in their own right.
1.1 Who Should Take This Class
The main goal of this class is to present a collection of mathematical tools for both understanding and solving problems
in ﬁelds that manipulate models of the real world, such as robotics, artiﬁcial intelligence, vision, engineering, or several
aspects of the biological sciences. Several classes at most universities each cover some of the topics presented in this
class, and do so in much greater detail. If you want to understand the full details of any one of the topics in the
syllabus below, you should take one or more of these other classes instead. If you want to understand how these tools
are implemented numerically, you should take one of the classes in the scientiﬁc computing program, which again
cover these issues in much better detail. Finally, if you want to understand robotics, vision, or other applied ﬁelds, you
should take classes in these subjects, since this course is not on applications.
On the other hand, if you do plan to study robotics, vision, or other applied subjects in the future, and you regard
yourself as a user of the mathematical techniques outlined in the syllabus below, then you may beneﬁt from this course.
Of the proofs, we will only see those that add understanding. Of the implementation aspects of algorithms that are
available in, say, Matlab or LApack, we will only see the parts that we need to understand when we use the code.
In brief, we will be able to cover more topics than other classes because we will be often (but not always) un
concerned with rigorous proof or implementation issues. The emphasis will be on intuition and on practicality of the
various algorithms. For instance, why are singular values important, and how do they relate to eigenvalues? What are
the dangers of Newtonstyle minimization? How does a Kalman ﬁlter work, and why do PDEs lead to sparse linear
systems? In this spirit, for instance, we discuss Singular Value Decomposition and Schur decomposition both because
they never fail and because they clarify the structure of an algebraic or a differential linear problem.
3
4 CHAPTER 1. INTRODUCTION
1.2 Syllabus
Here is the ideal syllabus, but how much we cover depends on how fast we go.
1. Introduction
2. Unknown numbers
2.1 Algebraic linear systems
2.1.1 Characterization of the solutions to a linear system
2.1.2 Gaussian elimination
2.1.3 The Singular Value Decomposition
2.1.4 The pseudoinverse
2.2 Function optimization
2.2.1 Newton and GaussNewton methods
2.2.2 LevenbergMarquardt method
2.2.3 Constraints and Lagrange multipliers
3. Unknown functions of one real variable
3.1 Ordinary differential linear systems
3.1.1 Eigenvalues and eigenvectors
3.1.2 The Schur decomposition
3.1.3 Ordinary differential linear systems
3.1.4 The matrix zoo
3.1.5 Real, symmetric, positivedeﬁnite matrices
3.2 Statistical estimation
3.2.1 Linear estimation
3.2.2 Weighted least squares
3.2.3 The Kalman ﬁlter
4. Unknown functions of several variables
4.1 Tensor ﬁelds of several variables
4.1.1 Grad, div, curl
4.1.2 Line, surface, and volume integrals
4.1.3 Green’s theorem and potential ﬁelds of two variables
4.1.4 Stokes’ and divergence theorems and potential ﬁelds of three variables
4.1.5 Diffusion and ﬂow problems
4.2 Partial differential equations and sparse linear systems
4.2.1 Finite differences
4.2.2 Direct versus iterative solution methods
4.2.3 Jacobi and GaussSeidel iterations
4.2.4 Successive overrelaxation
4.3 Calculus of variations
4.3.1 EulerLagrange equations
4.3.2 The brachistochrone
1.3. DISCUSSION OF THE SYLLABUS 5
1.3 Discussion of the Syllabus
In robotics, vision, physics, and any other branch of science whose subject belongs to or interacts with the real
world, mathematical models are developed that describe the relationship between different quantities. Some of these
quantities are measured, or sensed, while others are inferred by calculation. For instance, in computer vision, equations
tie the coordinates of points in space to the coordinates of corresponding points in different images. Image points are
data, world points are unknowns to be computed.
Similarly, in robotics, a robot arm is modeled by equations that describe where each link of the robot is as a
function of the conﬁguration of the link’s own joints and that of the links that support it. The desired position of the
end effector, as well as the current conﬁguration of all the joints, are the data. The unknowns are the motions to be
imparted to the joints so that the end effector reaches the desired target position.
Of course, what is data and what is unknown depends on the problem. For instance, the vision system mentioned
above could be looking at the robot arm. Then, the robot’s end effector position could be the unknowns to be solved
for by the vision system. Once vision has solved its problem, it could feed the robot’s endeffector position as data for
the robot controller to use in its own motion planning problem.
Sensed data are invariably noisy, because sensors have inherent limitations of accuracy, precision, resolution,
and repeatability. Consequently, the systems of equations to be solved are typically overconstrained: there are more
equations than unknowns, and it is hoped that the errors that affect the coefﬁcients of one equation are partially
cancelled by opposite errors in other equations. This is the basis of optimization problems: Rather than solving a
minimal system exactly, an optimization problem tries to solve many equations simultaneously, each of them only
approximately, but collectively as well as possible, according to some global criterion. Least squares is perhaps the
most popular such criterion, and we will devote a good deal of attention to it.
In summary, the problems encountered in robotics and vision, as well as other applications of mathematics, are
optimization problems. A fundamental distinction between different classes of problems reﬂects the complexity of the
unknowns. In the simplest case, unknowns are scalars. When there is more than one scalar, the unknown is a vector
of numbers, typically either real or complex. Accordingly, the ﬁrst part of this course will be devoted to describing
systems of algebraic equations, especially linear equations, and optimization techniques for problems whose solution
is a vector of reals. The main tool for understanding linear algebraic systems is the Singular Value Decomposition
(SVD), which is both conceptually fundamental and practically of extreme usefulness. When the systems are nonlinear,
they can be solved by various techniques of function optimization, of which we will consider the basic aspects.
Since physical quantities often evolve over time, many problems arise in which the unknowns are themselves
functions of time. This is our second class of problems. Again, problems can be cast as a set of equations to be solved
exactly, and this leads to the theory of Ordinary Differential Equations (ODEs). Here, “ordinary” expresses the fact
that the unknown functions depend on just one variable (e.g., time). The main conceptual tool for addressing ODEs is
the theory of eigenvalues, and the primary computational tool is the Schur decomposition.
Alternatively, problems with time varying solutions can be stated as minimization problems. When viewed glob
ally, these minimization problems lead to the calculus of variations. When the minimization problems above are
studied locally, they become state estimation problems, and the relevant theory is that of dynamic systems and Kalman
ﬁltering.
The third category of problems concerns unknown functions of more than one variable. The images taken by a
moving camera, for instance, are functions of time and space, and so are the unknown quantities that one can compute
fromthe images, such as the distance of points in the world fromthe camera. This leads to Partial Differential equations
(PDEs), or to extensions of the calculus of variations. In this class, we will see how PDEs arise, and how they can be
solved numerically.
6 CHAPTER 1. INTRODUCTION
1.4 Books
The class will be based on these lecture notes, and additional notes handed out when necessary. Other useful references
include the following.
• R. Courant and D. Hilbert, Methods of Mathematical Physics, Volume I and II, John Wiley and Sons, 1989.
• D. A. Danielson, Vectors and Tensors in Engineering and Physics, AddisonWesley, 1992.
• J. W. Demmel, Applied Numerical Linear Algebra, SIAM, 1997.
• A. Gelb et al., Applied Optimal Estimation, MIT Press, 1974.
• P. E. Gill, W. Murray, and M. H. Wright, Practical Optimization, Academic Press, 1993.
• G. H. Golub and C. F. Van Loan, Matrix Computations, 2nd Edition, Johns Hopkins University Press, 1989, or
3rd edition, 1997.
• W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling, Numerical Recipes in C, 2nd Edition,
Cambridge University Press, 1992.
• G. Strang, Introduction to Applied Mathematics, Wellesley Cambridge Press, 1986.
• A. E. Taylor and W. R. Mann, Advanced Calculus, 3rd Edition, John Wiley and Sons, 1983.
• L. N. Trefethen and D. Bau, III, Numerical Linear Algebra, SIAM, 1997.
• R. Weinstock, Calculus of Variations, Dover, 1974.
Chapter 2
Algebraic Linear Systems
An algebraic linear system is a set of m equations in n unknown scalars, which appear linearly. Without loss of
generality, an algebraic linear system can be written as follows:
Ax = b (2.1)
where A is an m× n matrix, x is an ndimensional vector that collects all of the unknowns, and b is a known vector
of dimension m. In this chapter, we only consider the cases in which the entries of A, b, and x are real numbers.
Two reasons are usually offered for the importance of linear systems. The ﬁrst is apparently deep, and refers
to the principle of superposition of effects. For instance, in dynamics, superposition of forces states that if force
f
1
produces acceleration a
1
(both possibly vectors) and force f
2
produces acceleration a
2
, then the combined force
f
1
+ αf
2
produces acceleration a
1
+ αa
2
. This is Newton’s second law of dynamics, although in a formulation less
common than the equivalent f = ma. Because Newton’s laws are at the basis of the entire ediﬁce of Mechanics,
linearity appears to be a fundamental principle of Nature. However, like all physical laws, Newton’s second law is an
abstraction, and ignores viscosity, friction, turbulence, and other nonlinear effects. Linearity, then, is perhaps more in
the physicist’s mind than in reality: if nonlinear effects can be ignored, physical phenomena are linear!
A more pragmatic explanation is that linear systems are the only ones we know how to solve in general. This
argument, which is apparently more shallow than the previous one, is actually rather important. Here is why. Given
two algebraic equations in two variables,
f(x, y) = 0
g(x, y) = 0 ,
we can eliminate, say, y and obtain the equivalent system
F(x) = 0
y = h(x) .
Thus, the original system is as hard to solve as it is to ﬁnd the roots of the polynomial F in a single variable. Unfortu
nately, if f and g have degrees d
f
and d
g
, the polynomial F has generically degree d
f
d
g
.
Thus, the degree of a system of equations is, roughly speaking, the product of the degrees. For instance, a system of
m quadratic equations corresponds to a polynomial of degree 2
m
. The only case in which the exponential is harmless
is when its base is 1, that is, when the system is linear.
In this chapter, we ﬁrst review a few basic facts about vectors in sections 2.1 through 2.4. More speciﬁcally, we
develop enough language to talk about linear systems and their solutions in geometric terms. In contrast with the
promise made in the introduction, these sections contain quite a few proofs. This is because a large part of the course
material is based on these notions, so we want to make sure that the foundations are sound. In addition, some of the
proofs lead to useful algorithms, and some others prove rather surprising facts. Then, in section 2.5, we characterize
the solutions of linear algebraic systems.
7
8 CHAPTER 2. ALGEBRAIC LINEAR SYSTEMS
2.1 Linear (In)dependence
Given n vectors a
1
, . . . , a
n
and n real numbers x
1
, . . . , x
n
, the vector
b =
n
j=1
x
j
a
j
(2.2)
is said to be a linear combination of a
1
, . . . , a
n
with coefﬁcients x
1
, . . . , x
n
.
The vectors a
1
, . . . , a
n
are linearly dependent if they admit the null vector as a nonzero linear combination. In
other words, they are linearly dependent if there is a set of coefﬁcients x
1
, . . . , x
n
, not all of which are zero, such that
n
j=1
x
j
a
j
= 0 . (2.3)
For later reference, it is useful to rewrite the last two equalities in a different form. Equation (2.2) is the same as
Ax = b (2.4)
and equation (2.3) is the same as
Ax = 0 (2.5)
where
A =
_
a
1
· · · a
n
¸
, x =
_
¸
_
x
1
.
.
.
x
n
_
¸
_ , b =
_
¸
_
b
1
.
.
.
b
m
_
¸
_ .
If you are not convinced of these equivalences, take the time to write out the components of each expression for a
small example. This is important. Make sure that you are comfortable with this.
Thus, the columns of a matrix A are dependent if there is a nonzero solution to the homogeneous system (2.5).
Vectors that are not dependent are independent.
Theorem 2.1.1 The vectors a
1
, . . . , a
n
are linearly dependent iff
1
at least one of them is a linear combination of the
others.
Proof. In one direction, dependency means that there is a nonzero vector x such that
n
j=1
x
j
a
j
= 0 .
Let x
k
be nonzero for some k. We have
n
j=1
x
j
a
j
= x
k
a
k
+
n
j=1, j=k
x
j
a
j
= 0
so that
a
k
= −
n
j=1, j=k
x
j
x
k
a
j
(2.6)
as desired. The converse is proven similarly: if
a
k
=
n
j=1, j=k
x
j
a
j
1
“iff” means “if and only if.”
2.2. BASIS 9
for some k, then
n
j=1
x
j
a
j
= 0
by letting x
k
= −1 (so that x is nonzero). ∆
2
We can make the ﬁrst part of the proof above even more speciﬁc, and state the following
Lemma 2.1.2 If n nonzero vectors a
1
, . . . , a
n
are linearly dependent then at least one of them is a linear combination
of the ones that precede it.
Proof. Just let k be the last of the nonzero x
j
. Then x
j
= 0 for j > k in (2.6), which then becomes
a
k
=
n
j<k
x
j
x
k
a
j
as desired. ∆
2.2 Basis
A set a
1
, . . . , a
n
is said to be a basis for a set B of vectors if the a
j
are linearly independent and every vector in B can
be written as a linear combination of them. B is said to be a vector space if it contains all the linear combinations of
its basis vectors. In particular, this implies that every linear space contains the zero vector. The basis vectors are said
to span the vector space.
Theorem 2.2.1 Given a vector b in the vector space B and a basis a
1
, . . . , a
n
for B, the coefﬁcients x
1
, . . . , x
n
such
that
b =
n
j=1
x
j
a
j
are uniquely determined.
Proof. Let also
b =
n
j=1
x
j
a
j
.
Then,
0 = b −b =
n
j=1
x
j
a
j
−
n
j=1
x
j
a
j
=
n
j=1
(x
j
−x
j
)a
j
but because the a
j
are linearly independent, this is possible only when x
j
−x
j
= 0 for every j. ∆
The previous theorem is a very important result. An equivalent formulation is the following:
If the columns a
1
, . . . , a
n
of A are linearly independent and the system Ax = b admits a solution, then
the solution is unique.
2
This symbol marks the end of a proof.
10 CHAPTER 2. ALGEBRAIC LINEAR SYSTEMS
Pause for a minute to verify that this formulation is equivalent.
Theorem 2.2.2 Two different bases for the same vector space B have the same number of vectors.
Proof. Let a
1
, . . . , a
n
and a
1
, . . . , a
n
be two different bases for B. Then each a
j
is in B (why?), and can therefore
be written as a linear combination of a
1
, . . . , a
n
. Consequently, the vectors of the set
G = a
1
, a
1
, . . . , a
n
must be linearly dependent. We call a set of vectors that contains a basis for B a generating set for B. Thus, G is a
generating set for B.
The rest of the proof now proceeds as follows: we keep removing a vectors from G and replacing them with a
vectors in such a way as to keep Ga generating set for B. Then we show that we cannot run out of a vectors before we
run out of a
vectors, which proves that n ≥ n
. We then switch the roles of a and a
vectors to conclude that n
≥ n.
This proves that n = n
.
From lemma 2.1.2, one of the vectors in G is a linear combination of those preceding it. This vector cannot be a
1
,
since it has no other vectors preceding it. So it must be one of the a
j
vectors. Removing the latter keeps Ga generating
set, since the removed vector depends on the others. Now we can add a
2
to G, writing it right after a
1
:
G = a
1
, a
2
, . . . .
G is still a generating set for B.
Let us continue this procedure until we run out of either a vectors to remove or a
vectors to add. The a vectors
cannot run out ﬁrst. Suppose in fact per absurdum that G is now made only of a
vectors, and that there are still
leftover a
vectors that have not been put into G. Since the a
s form a basis, they are mutually linearly independent.
Since B is a vector space, all the a
s are in B. But then G cannot be a generating set, since the vectors in it cannot
generate the leftover a
s, which are independent of those in G. This is absurd, because at every step we have made
sure that G remains a generating set. Consequently, we must run out of a
s ﬁrst (or simultaneously with the last a).
That is, n ≥ n
.
Now we can repeat the whole procedure with the roles of a vectors and a
vectors exchanged. This shows that
n
≥ n, and the two results together imply that n = n
. ∆
A consequence of this theorem is that any basis for R
m
has m vectors. In fact, the basis of elementary vectors
e
j
= jth column of the m×m identity matrix
is clearly a basis for R
m
, since any vector
b =
_
¸
_
b
1
.
.
.
b
m
_
¸
_
can be written as
b =
m
j=1
b
j
e
j
and the e
j
are clearly independent. Since this elementary basis has m vectors, theorem 2.2.2 implies that any other
basis for R
m
has m vectors.
Another consequence of theorem 2.2.2 is that n vectors of dimension m < n are bound to be dependent, since any
basis for R
m
can only have m vectors.
Since all bases for a space have the same number of vectors, it makes sense to deﬁne the dimension of a space as
the number of vectors in any of its bases.
2.3. INNER PRODUCT AND ORTHOGONALITY 11
2.3 Inner Product and Orthogonality
In this section we establish the geometric meaning of the algebraic notions of norm, inner product, projection, and
orthogonality. The fundamental geometric fact that is assumed to be known is the law of cosines: given a triangle with
sides a, b, c (see ﬁgure 2.1), we have
a
2
= b
2
+c
2
−2bc cos θ
where θ is the angle between the sides of length b and c. A special case of this law is Pythagoras’ theorem, obtained
when θ = ±π/2.
θ
c
b
a
Figure 2.1: The law of cosines states that a
2
= b
2
+c
2
−2bc cos θ.
In the previous section we saw that any vector in R
m
can be written as the linear combination
b =
m
j=1
b
j
e
j
(2.7)
of the elementary vectors that point along the coordinate axes. The length of these elementary vectors is clearly one,
because each of them goes from the origin to the unit point of one of the axes. Also, any two of these vectors form a
90degree angle, because the coordinate axes are orthogonal by construction. How long is b? From equation (2.7) we
obtain
b = b
1
e
1
+
m
j=2
b
j
e
j
and the two vectors b
1
e
1
and
m
j=2
b
j
e
j
are orthogonal. By Pythagoras’ theorem, the square of the length b of b is
b
2
= b
2
1
+
m
j=2
b
j
e
j
2
.
Pythagoras’ theorem can now be applied again to the last sum by singling out its ﬁrst term b
2
e
2
, and so forth. In
conclusion,
b
2
=
m
j=1
b
2
j
.
This result extends Pythagoras’ theorem to m dimensions.
If we deﬁne the inner product of two mdimensional vectors as follows:
b
T
c =
m
j=1
b
j
c
j
,
then
b
2
= b
T
b . (2.8)
Thus, the squared length of a vector is the inner product of the vector with itself. Here and elsewhere, vectors are
column vectors by default, and the symbol
T
makes them into row vectors.
12 CHAPTER 2. ALGEBRAIC LINEAR SYSTEMS
Theorem 2.3.1
b
T
c = b c cos θ
where θ is the angle between b and c.
Proof. The law of cosines applied to the triangle with sides b, c, and b −c yields
b −c
2
= b
2
+c
2
−2b c cos θ
and from equation (2.8) we obtain
b
T
b + c
T
c −2b
T
c = b
T
b + c
T
c −2b c cos θ .
Canceling equal terms and dividing by 2 yields the desired result. ∆
Corollary 2.3.2 Two nonzero vectors b and c in R
m
are mutually orthogonal iff b
T
c = 0.
Proof. When θ = ±π/2, the previous theorem yields b
T
c = 0. ∆
Given two vectors b and c applied to the origin, the projection of b onto c is the vector from the origin to the point
p on the line through c that is nearest to the endpoint of b. See ﬁgure 2.2.
p
b
c
Figure 2.2: The vector from the origin to point p is the projection of b onto c. The line from the endpoint of b to p is
orthogonal to c.
Theorem 2.3.3 The projection of b onto c is the vector
p = P
c
b
where P
c
is the following square matrix:
P
c
=
cc
T
c
T
c
.
Proof. Since by deﬁnition point p is on the line through c, the projection vector p has the form p = ac, where
a is some real number. From elementary geometry, the line between p and the endpoint of b is shortest when it is
orthogonal to c:
c
T
(b −ac) = 0
2.4. ORTHOGONAL SUBSPACES AND THE RANK OF A MATRIX 13
which yields
a =
c
T
b
c
T
c
so that
p = ac = c a =
cc
T
c
T
c
b
as advertised. ∆
2.4 Orthogonal Subspaces and the Rank of a Matrix
Linear transformations map spaces into spaces. It is important to understand exactly what is being mapped into what
in order to determine whether a linear system has solutions, and if so how many. This section introduces the notion of
orthogonality between spaces, deﬁnes the null space and range of a matrix, and its rank. With these tools, we will be
able to characterize the solutions to a linear system in section 2.5. In the process, we also introduce a useful procedure
(GramSchmidt) for orthonormalizing a set of linearly independent vectors.
Two vector spaces A and B are said to be orthogonal to one another when every vector in A is orthogonal to every
vector in B. If vector space A is a subspace of R
m
for some m, then the orthogonal complement of A is the set of all
vectors in R
m
that are orthogonal to all the vectors in A.
Notice that complement and orthogonal complement are very different notions. For instance, the complement of
the xy plane in R
3
is all of R
3
except the xy plane, while the orthogonal complement of the xy plane is the z axis.
Theorem 2.4.1 Any basis a
1
, . . . , a
n
for a subspace A of R
m
can be extended into a basis for R
m
by adding m− n
vectors a
n+1
, . . . , a
m
.
Proof. If n = m we are done. If n < m, the given basis cannot generate all of R
m
, so there must be a vector, call
it a
n+1
, that is linearly independent of a
1
, . . . , a
n
. This argument can be repeated until the basis spans all of R
m
, that
is, until m = n. ∆
Theorem 2.4.2 (GramSchmidt) Given n vectors a
1
, . . . , a
n
, the following construction
r = 0
for j = 1 to n
a
j
= a
j
−
r
l=1
(q
T
l
a
j
)q
l
if a
j
= 0
r = r + 1
q
r
=
a
j
a
j
end
end
yields a set of orthonormal
3
vectors q
1
. . . , q
r
that span the same space as a
1
, . . . , a
n
.
Proof. We ﬁrst prove by induction on r that the vectors q
r
are mutually orthonormal. If r = 1, there is little to
prove. The normalization in the above procedure ensures that q
1
has unit norm. Let us now assume that the procedure
3
Orthonormal means orthogonal and with unit norm.
14 CHAPTER 2. ALGEBRAIC LINEAR SYSTEMS
above has been performed a number j −1 of times sufﬁcient to ﬁnd r −1 vectors q
1
, . . . , q
r−1
, and that these vectors
are orthonormal (the inductive assumption). Then for any i < r we have
q
T
i
a
j
= q
T
i
a
j
−
r−1
l=1
(q
T
l
a
j
)q
T
i
q
l
= 0
because the term q
T
i
a
j
cancels the ith term (q
T
i
a
j
)q
T
i
q
i
of the sum (remember that q
T
i
q
i
= 1), and the inner products
q
T
i
q
l
are zero by the inductive assumption. Because of the explicit normalization step q
r
= a
j
/a
j
, the vector q
r
, if
computed, has unit norm, and because q
T
i
a
j
= 0, it follwos that q
r
is orthogonal to all its predecessors, q
T
i
q
r
= 0 for
i = 1, . . . , r −1.
Finally, we notice that the vectors q
j
span the same space as the a
j
s, because the former are linear combinations
of the latter, are orthonormal (and therefore independent), and equal in number to the number of linearly independent
vectors in a
1
, . . . , a
n
. ∆
Theorem 2.4.3 If A is a subspace of R
m
and A
⊥
is the orthogonal complement of A in R
m
, then
dim(A) + dim(A
⊥
) = m .
Proof. Let a
1
, . . . , a
n
be a basis for A. Extend this basis to a basis a
1
, . . . , a
m
for R
m
(theorem 2.4.1). Orthonor
malize this basis by the GramSchmidt procedure (theorem 2.4.2) to obtain q
1
, . . . , q
m
. By construction, q
1
, . . . , q
n
span A. Because the new basis is orthonormal, all vectors generated by q
n+1
, . . . , q
m
are orthogonal to all vectors
generated by q
1
, . . . , q
n
, so there is a space of dimension at least m− n that is orthogonal to A. On the other hand,
the dimension of this orthogonal space cannot exceed m−n, because otherwise we would have more than m vectors
in a basis for R
m
. Thus, the dimension of the orthogonal space A
⊥
is exactly m−n, as promised. ∆
We can now start to talk about matrices in terms of the subspaces associated with them. The null space null(A) of
an m× n matrix A is the space of all ndimensional vectors that are orthogonal to the rows of A. The range of A is
the space of all mdimensional vectors that are generated by the columns of A. Thus, x ∈ null(A) iff Ax = 0, and
b ∈ range(A) iff Ax = b for some x.
From theorem 2.4.3, if null(A) has dimension h, then the space generated by the rows of A has dimension r =
n −h, that is, A has n −h linearly independent rows. It is not obvious that the space generated by the columns of A
has also dimension r = n −h. This is the point of the following theorem.
Theorem 2.4.4 The number r of linearly independent columns of any m× n matrix A is equal to the number of its
independent rows, and
r = n −h
where h = dim(null(A)).
Proof. We have already proven that the number of independent rows is n − h. Now we show that the number of
independent columns is also n −h, by constructing a basis for range(A).
Let v
1
, . . . , v
h
be a basis for null(A), and extend this basis (theorem 2.4.1) into a basis v
1
, . . . , v
n
for R
n
. Then
we can show that the n −h vectors Av
h+1
, . . . , Av
n
are a basis for the range of A.
First, these n − h vectors generate the range of A. In fact, given an arbitrary vector b ∈ range(A), there must be
a linear combination of the columns of A that is equal to b. In symbols, there is an ntuple x such that Ax = b. The
ntuple x itself, being an element of R
n
, must be some linear combination of v
1
, . . . , v
n
, our basis for R
n
:
x =
n
j=1
c
j
v
j
.
2.5. THE SOLUTIONS OF A LINEAR SYSTEM 15
Thus,
b = Ax = A
n
j=1
c
j
v
j
=
n
j=1
c
j
Av
j
=
n
j=h+1
c
j
Av
j
since v
1
, . . . , v
h
span null(A), so that Av
j
= 0 for j = 1, . . . , h. This proves that the n −h vectors Av
h+1
, . . . , Av
n
generate range(A).
Second, we prove that the n − h vectors Av
h+1
, . . . , Av
n
are linearly independent. Suppose, per absurdum, that
they are not. Then there exist numbers x
h+1
, . . . , x
n
, not all zero, such that
n
j=h+1
x
j
Av
j
= 0
so that
A
n
j=h+1
x
j
v
j
= 0 .
But then the vector
n
j=h+1
x
j
v
j
is in the null space of A. Since the vectors v
1
, . . . , v
h
are a basis for null(A), there
must exist coefﬁcients x
1
, . . . , x
h
such that
n
j=h+1
x
j
v
j
=
h
j=1
x
j
v
j
,
in conﬂict with the assumption that the vectors v
1
, . . . , v
n
are linearly independent. ∆
Thanks to this theorem, we can deﬁne the rank of Ato be equivalently the number of linearly independent columns
or of linearly independent rows of A:
rank(A) = dim(range(A)) = n −dim(null(A)) .
2.5 The Solutions of a Linear System
Thanks to the results of the previous sections, we now have a complete picture of the four spaces associated with an
m×n matrix A of rank r and nullspace dimension h:
range(A); dimension r = rank(A)
null(A); dimension h
range(A)
⊥
; dimension m−r
null(A)
⊥
; dimension r = n −h .
The space range(A)
⊥
is called the left nullspace of the matrix, and null(A)
⊥
is called the rowspace of A. A
frequently used synonym for “range” is column space. It should be obvious from the meaning of these spaces that
null(A)
⊥
= range(A
T
)
range(A)
⊥
= null(A
T
)
where A
T
is the transpose of A, deﬁned as the matrix obtained by exchanging the rows of A with its columns.
Theorem 2.5.1 The matrix A transforms a vector x in its null space into the zero vector, and an arbitrary vector x
into a vector in range(A).
16 CHAPTER 2. ALGEBRAIC LINEAR SYSTEMS
This allows characterizing the set of solutions to linear system as follows. Let
Ax = b
be an m×n system (m can be less than, equal to, or greater than n). Also, let
r = rank(A)
be the number of linearly independent rows or columns of A. Then,
b ∈ range(A) ⇒ no solutions
b ∈ range(A) ⇒ ∞
n−r
solutions
with the convention that ∞
0
= 1. Here, ∞
k
is the cardinality of a kdimensional vector space.
In the ﬁrst case above, there can be no linear combination of the columns (no x vector) that gives b, and the system
is said to be incompatible. In the second, compatible case, three possibilities occur, depending on the relative sizes of
r, m, n:
• When r = n = m, the system is invertible. This means that there is exactly one x that satisﬁes the system, since
the columns of A span all of R
n
. Notice that invertibility depends only on A, not on b.
• When r = n and m > n, the system is redundant. There are more equations than unknowns, but since b is in
the range of A there is a linear combination of the columns (a vector x) that produces b. In other words, the
equations are compatible, and exactly one solution exists.
4
• When r < n the system is underdetermined. This means that the null space is nontrivial (i.e., it has dimension
h > 0), and there is a space of dimension h = n −r of vectors x such that Ax = 0. Since b is assumed to be in
the range of A, there are solutions x to Ax = b, but then for any y ∈ null(A) also x + y is a solution:
Ax = b , Ay = 0 ⇒ A(x + y) = b
and this generates the ∞
h
= ∞
n−r
solutions mentioned above.
Notice that if r = n then n cannot possibly exceed m, so the ﬁrst two cases exhaust the possibilities for r = n. Also,
r cannot exceed either m or n. All the cases are summarized in ﬁgure 2.3.
Of course, listing all possibilities does not provide an operational method for determining the type of linear system
for a given pair A, b. Gaussian elimination, and particularly its version called reduction to echelon form is such a
method, and is summarized in the next section.
2.6 Gaussian Elimination
Gaussian elimination is an important technique for solving linear systems. In addition to always yielding a solution,
no matter whether the system is invertible or not, it also allows determining the rank of a matrix.
Other solution techniques exist for linear systems. Most notably, iterative methods solve systems in a time that
depends on the accuracy required, while direct methods, like Gaussian elimination, are done in a ﬁnite amount of
time that can be bounded given only the size of a matrix. Which method to use depends on the size and structure
(e.g., sparsity) of the matrix, whether more information is required about the matrix of the system, and on numerical
considerations. More on this in chapter 3.
Consider the m×n system
Ax = b (2.9)
4
Notice that the technical meaning of “redundant” has a stronger meaning than “with more equations than unknowns.” The case r < n < mis
possible, has more equations (m) than unknowns (n), admits a solution if b ∈ range(A), but is called “underdetermined” because there are fewer
(r) independent equations than there are unknowns (see next item). Thus, “redundant” means “with exactly one solution and with more equations
than unknowns.”
2.6. GAUSSIAN ELIMINATION 17
yes no
yes no
yes no
underdetermined
redundant invertible
b in range(A)
r = n
m = n
incompatible
Figure 2.3: Types of linear systems.
which can be square or rectangular, invertible, incompatible, redundant, or underdetermined. In short, there are no
restrictions on the system. Gaussian elimination replaces the rows of this system by linear combinations of the rows
themselves until A is changed into a matrix U that is in the socalled echelon form. This means that
• Nonzero rows precede rows with all zeros. The ﬁrst nonzero entry, if any, of a row, is called a pivot.
• Below each pivot is a column of zeros.
• Each pivot lies to the right of the pivot in the row above.
The same operations are applied to the rows of A and to those of b, which is transformed to a new vector c, so equality
is preserved and solving the ﬁnal system yields the same solution as solving the original one.
Once the system is transformed into echelon form, we compute the solution x by backsubstitution, that is, by
solving the transformed system
Ux = c .
2.6.1 Reduction to Echelon Form
The matrix A is reduced to echelon form by a process in m − 1 steps. The ﬁrst step is applied to U
(1)
= A and
c
(1)
= b. The kth step is applied to rows k, . . . , m of U
(k)
and c
(k)
and produces U
(k+1)
and c
(k+1)
. The last step
produces U
(m)
= U and c
(m)
= c. Initially, the “pivot column index” p is set to one. Here is step k, where u
ij
denotes
entry i, j of U
(k)
:
Skip nopivot columns If u
ip
is zero for every i = k, . . . , m, then increment p by 1. If p exceeds n stop.
5
Row exchange Now p ≤ n and u
ip
is nonzero for some k ≤ i ≤ m. Let l be one such value of i
6
. If l = k, exchange
rows l and k of U
(k)
and of c
(k)
.
Triangularization The new entry u
kp
is nonzero, and is called the pivot. For i = k + 1, . . . , m, subtract row k of
U
(k)
multiplied by u
ip
/u
kp
from row i of U
(k)
, and subtract entry k of c
(k)
multiplied by u
ip
/u
kp
from entry i
of c
(k)
. This zeros all the entries in the column below the pivot, and preserves the equality of left and righthand
side.
When this process is ﬁnished, U is in echelon form. In particular, if the matrix is square and if all columns have a
pivot, then U is uppertriangular.
5
“Stop” means that the entire algorithm is ﬁnished.
6
Different ways of selecting l here lead to different numerical properties of the algorithm. Selecting the largest entry in the column leads to
better roundoff properties.
18 CHAPTER 2. ALGEBRAIC LINEAR SYSTEMS
2.6.2 Backsubstitution
A system
Ux = c (2.10)
in echelon form is easily solved for x. To see this, we ﬁrst solve the system symbolically, leaving undetermined vari
ables speciﬁed by their name, and then transformthis solution procedure into one that can be more readily implemented
numerically.
Let r be the index of the last nonzero row of U. Since this is the number of independent rows of U, r is the rank
of U. It is also the rank of A, because A and U admit exactly the same solutions and are equal in size. If r < m, the
last m−r equations yield a subsystem of the following form:
_
¸
_
0
.
.
.
0
_
¸
_ =
_
¸
_
c
r+1
.
.
.
c
m
_
¸
_ .
Let us call this the residual subsystem. If on the other hand r = m (obviously r cannot exceed m), there is no residual
subsystem.
If there is a residual system (i.e., r < m) and some of c
r+1
, . . . , c
m
are nonzero, then the equations corresponding
to these nonzero entries are incompatible, because they are of the form 0 = c
i
with c
i
= 0. Since no vector x can
satisfy these equations, the linear system admits no solutions: it is incompatible.
Let us now assume that either there is no residual system, or if there is one it is compatible, that is, c
r+1
= . . . =
c
m
= 0. Then, solutions exist, and they can be determined by backsubstitution, that is, by solving the equations
starting from the last one and replacing the result in the equations higher up.
Backsubstitutions works as follows. First, remove the residual system, if any. We are left with an r ×n system. In
this system, call the variables corresponding to the r columns with pivots the basic variables, and call the other n −r
the free variables. Say that the pivot columns are j
1
, . . . , j
r
. Then symbolic backsubstitution consists of the following
sequence:
for i = r downto 1
x
j
i
=
1
u
ij
i
_
_
c
i
−
n
l=ji+1
u
il
x
l
_
_
end
This is called symbolic backsubstitution because no numerical values are assigned to free variables. Whenever they
appear in the expressions for the basic variables, free variables are speciﬁed by name rather than by value. The ﬁnal
result is a solution with as many free parameters as there are free variables. Since any value given to the free variables
leaves the equality of system (2.10) satisﬁed, the presence of free variables leads to an inﬁnity of solutions.
When solving a system in echelon form numerically, however, it is inconvenient to carry around nonnumeric
symbol names (the free variables). Here is an equivalent solution procedure that makes this unnecessary. The solution
obtained by backsubstitution is an afﬁne function
7
of the free variables, and can therefore be written in the form
x = v
0
+x
j
1
v
1
+. . . +x
j
n−r
v
n−r
(2.11)
where the x
j
i
are the free variables. The vector v
0
is the solution when all free variables are zero, and can therefore be
obtained by replacing each free variable by zero during backsubstitution. Similarly, the vector v
i
for i = 1, . . . , n −r
can be obtained by solving the homogeneous system
Ux = 0
with x
ji
= 1 and all other free variables equal to zero. In conclusion, the general solution can be obtained by running
backsubstitution n−r +1 times, once for the nonhomogeneous system, and n−r times for the homogeneous system,
with suitable values of the free variables. This yields the solution in the form (2.11).
Notice that the vectors v
1
, . . . , v
n−r
form a basis for the null space of U, and therefore of A.
7
An afﬁne function is a linear function plus a constant.
2.6. GAUSSIAN ELIMINATION 19
2.6.3 An Example
An example will clarify both the reduction to echelon form and backsubstitution. Consider the system
Ax = b
where
U
(1)
= A =
_
_
1 3 3 2
2 6 9 5
−1 −3 3 0
_
_
, c
(1)
= b =
_
_
1
5
5
_
_
.
Reduction to echelon form transforms A and b as follows. In the ﬁrst step (k = 1), there are no nopivot columns,
so the pivot column index p stays at 1. Throughout this example, we choose a trivial pivot selection rule: we pick the
ﬁrst nonzero entry at or below row k in the pivot column. For k = 1, this means that u
(1)
11
= a
11
= 1 is the pivot. In
other words, no row exchange is necessary.
8
The triangularization step subtracts row 1 multiplied by 2/1 from row 2,
and subtracts row 1 multiplied by 1/1 from row 3. When applied to both U
(1)
and c
(1)
this yields
U
(2)
=
_
_
1 3 3 2
0 0 3 1
0 0 6 2
_
_
, c
(2)
=
_
_
1
3
6
_
_
.
Notice that now (k = 2) the entries u
(2)
ip
are zero for i = 2, 3, for both p = 1 and p = 2, so p is set to 3: the second
pivot column is column 3, and u
(2)
23
is nonzero, so no row exchange is necessary. In the triangularization step, row 2
multiplied by 6/3 is subtracted from row 3 for both U
(2)
and c
(2)
to yield
U = U
(3)
=
_
_
1 3 3 2
0 0 3 1
0 0 0 0
_
_
, c = c
(3)
=
_
_
1
3
0
_
_
.
There is one zero row in the lefthand side, and the rank of U and that of A is r = 2, the number of nonzero rows.
The residual system is 0 = 0 (compatible), and r < n = 4, so the system is underdetermined, with ∞
n−r
= ∞
2
solutions.
In symbolic backsubstitution, the residual subsystem is ﬁrst deleted. This yields the reduced system
_
1 3 3 2
0 0 3 1
_
x =
_
1
3
_
(2.12)
The basic variables are x
1
and x
3
, corresponding to the columns with pivots. The other two variables, x
2
and
x
4
, are free. Backsubstitution applied ﬁrst to row 2 and then to row 1 yields the following expressions for the pivot
variables:
x
3
=
1
u
23
(c
2
−u
24
x
4
) =
1
3
(3 −x
4
) = 1 −
1
3
x
4
x
1
=
1
u
11
(c
1
−u
12
x
2
−u
13
x
3
−u
14
x
4
) =
1
1
(1 −3x
2
−3x
3
−2x
4
)
= 1 −3x
2
−(3 −x
4
) −2x
4
= −2 −3x
2
−x
4
so the general solution is
x =
_
¸
¸
_
−2 −3x
2
−x
4
x
2
1 −
1
3
x
4
x
4
_
¸
¸
_
=
_
¸
¸
_
−2
0
1
0
_
¸
¸
_
+x
2
_
¸
¸
_
−3
1
0
0
_
¸
¸
_
+x
4
_
¸
¸
_
−1
0
−
1
3
1
_
¸
¸
_
.
8
Selecting the largest entry in the column at or below row k is a frequent choice, and this would have caused rows 1 and 2 to be switched.
20 CHAPTER 2. ALGEBRAIC LINEAR SYSTEMS
This same solution can be found by the numerical backsubstitution method as follows. Solving the reduced system
(2.12) with x
2
= x
4
= 0 by numerical backsubstitution yields
x
3
=
1
3
(3 −1 · 0) = 1
x
1
=
1
1
(1 −3 · 0 −3 · 1 −2 · 0) = −2
so that
v
0
=
_
¸
¸
_
−2
0
1
0
_
¸
¸
_
.
Then v
1
is found by solving the nonzero part (ﬁrst two rows) of Ux = 0 with x
2
= 1 and x
4
= 0 to obtain
x
3
=
1
3
(−1 · 0) = 0
x
1
=
1
1
(−3 · 1 −3 · 0 −2 · 0) = −3
so that
v
1
=
_
¸
¸
_
−3
1
0
0
_
¸
¸
_
.
Finally, solving the nonzero part of Ux = 0 with x
2
= 0 and x
4
= 1 leads to
x
3
=
1
3
(−1 · 1) = −
1
3
x
1
=
1
1
(−3 · 0 −3 ·
_
−
1
3
_
−2 · 1) = −1
so that
v
2
=
_
¸
¸
_
−1
0
−
1
3
1
_
¸
¸
_
and
x = v
0
+x
2
v
1
+x
4
v
2
=
_
¸
¸
_
−2
0
1
0
_
¸
¸
_
+x
2
_
¸
¸
_
−3
1
0
0
_
¸
¸
_
+x
4
_
¸
¸
_
−1
0
−
1
3
1
_
¸
¸
_
just as before.
As mentioned at the beginning of this section, Gaussian elimination is a direct method, in the sense that the answer
can be found in a number of steps that depends only on the size of the matrix A. In the next chapter, we study a different
2.6. GAUSSIAN ELIMINATION 21
method, based on the socalled the Singular Value Decomposition (SVD). This is an iterative method, meaning that an
exact solution usually requires an inﬁnite number of steps, and the number of steps necessary to ﬁnd an approximate
solution depends on the desired number of correct digits.
This state of affairs would seem to favor Gaussian elimination over the SVD. However, the latter yields a much
more complete answer, since it computes bases for all the four spaces mentioned above, as well as a set of quantities,
called the singular values, which provide great insight into the behavior of the linear transformation represented by
the matrix A. Singular values also allow deﬁning a notion of approximate rank which is very useful in a large number
of applications. It also allows ﬁnding approximate solutions when the linear system in question is incompatible. In
addition, for reasons that will become apparent in the next chapter, the computation of the SVD is numerically well
behaved, much more so than Gaussian elimination. Finally, very efﬁcient algorithms for the SVD exist. For instance,
on a regular workstation, one can compute several thousand SVDs of 5 × 5 matrices in one second. More generally,
the number of ﬂoating point operations necessary to compute the SVD of an m×n matrix is amn
2
+bn
3
where a, b
are small numbers that depend on the details of the algorithm.
22 CHAPTER 2. ALGEBRAIC LINEAR SYSTEMS
Chapter 3
The Singular Value Decomposition
In section 2, we saw that a matrix transforms vectors in its domain into vectors in its range (column space), and vectors
in its null space into the zero vector. No nonzero vector is mapped into the left null space, that is, into the orthogonal
complement of the range. In this section, we make this statement more speciﬁc by showing how unit vectors
1
in the
rowspace are transformed by matrices. This describes the action that a matrix has on the magnitudes of vectors as
well. To this end, we ﬁrst need to introduce the notion of orthogonal matrices, and interpret them geometrically as
transformations between systems of orthonormal coordinates. We do this in section 3.1. Then, in section 3.2, we use
these new concepts to introduce the allimportant concept of the Singular Value Decomposition (SVD). The chapter
concludes with some basic applications and examples.
3.1 Orthogonal Matrices
Let S be an ndimensional subspace of R
m
(so that we necessarily have n ≤ m), and let v
1
, . . . , v
n
be an orthonormal
basis for S. Consider a point P in S. If the coordinates of P in R
m
are collected in an mdimensional vector
p =
_
¸
_
p
1
.
.
.
p
m
_
¸
_ ,
and since P is in S, it must be possible to write p as a linear combination of the v
j
s. In other words, there must exist
coefﬁcients
q =
_
¸
_
q
1
.
.
.
q
n
_
¸
_
such that
p = q
1
v
1
+. . . +q
n
v
n
= V q
where
V =
_
v
1
· · · v
n
¸
is an m×n matrix that collects the basis for S as its columns. Then for any i = 1, . . . , n we have
v
T
i
p = v
T
i
n
j=1
q
j
v
j
=
n
j=1
q
j
v
T
i
v
j
= q
i
,
since the v
j
are orthonormal. This is important, and may need emphasis:
1
Vectors with unit norm.
23
24 CHAPTER 3. THE SINGULAR VALUE DECOMPOSITION
If
p =
n
j=1
q
j
v
j
and the vectors of the basis v
1
, . . . , v
n
are orthonormal, then the coefﬁcients q
j
are the signed mag
nitudes of the projections of p onto the basis vectors:
q
j
= v
T
j
p . (3.1)
In matrix form,
q = V
T
p . (3.2)
Also, we can collect the n
2
equations
v
T
i
v
j
=
_
1 if i = j
0 otherwise
into the following matrix equation:
V
T
V = I (3.3)
where I is the n ×n identity matrix. A matrix V that satisﬁes equation (3.3) is said to be orthogonal. Thus, a matrix
is orthogonal if its columns are orthonormal. Since the left inverse of a matrix V is deﬁned as the matrix L such that
LV = I , (3.4)
comparison with equation (3.3) shows that the left inverse of an orthogonal matrix V exists, and is equal to the
transpose of V .
Of course, this argument requires V to be full rank, so that the solution L to equation (3.4) is unique. However, V
is certainly full rank, because it is made of orthonormal columns.
Notice that V R = I cannot possibly have a solution when m > n, because the m × m identity matrix has m
linearly independent
2
columns, while the columns of V R are linear combinations of the n columns of V , so V R can
have at most n linearly independent columns.
Of course, this result is still valid when V is m×m and has orthonormal columns, since equation (3.3) still holds.
However, for square, fullrank matrices (r = m = n), the distinction between left and right inverse vanishes. In fact,
suppose that there exist matrices L and R such that LV = I and V R = I. Then L = L(V R) = (LV )R = R, so the
left and the right inverse are the same. Thus, for square orthogonal matrices, V
T
is both the left and the right inverse:
V
T
V = V V
T
= I ,
and V
T
is then simply said to be the inverse of V :
V
T
= V
−1
.
Since the matrix V V
T
contains the inner products between the rows of V (just as V
T
V is formed by the inner
products of its columns), the argument above shows that the rows of a square orthogonal matrix are orthonormal as
well. We can summarize this discussion as follows:
Theorem 3.1.1 The left inverse of an orthogonal m×n matrix V with m ≥ n exists and is equal to the transpose of
V :
V
T
V = I .
In particular, if m = n, the matrix V
−1
= V
T
is also the right inverse of V :
V square ⇒ V
−1
V = V
T
V = V V
−1
= V V
T
= I .
2
Nay, orthonormal.
3.1. ORTHOGONAL MATRICES 25
Sometimes, when m = n, the geometric interpretation of equation (3.2) causes confusion, because two interpreta
tions of it are possible. In the interpretation given above, the point P remains the same, and the underlying reference
frame is changed from the elementary vectors e
j
(that is, from the columns of I) to the vectors v
j
(that is, to the
columns of V ). Alternatively, equation (3.2) can be seen as a transformation, in a ﬁxed reference system, of point P
with coordinates p into a different point Q with coordinates q. This, however, is relativity, and should not be surpris
ing: If you spin clockwise on your feet, or if you stand still and the whole universe spins counterclockwise around
you, the result is the same.
3
Consistently with either of these geometric interpretations, we have the following result:
Theorem 3.1.2 The norm of a vector x is not changed by multiplication by an orthogonal matrix V :
V x = x .
Proof.
V x
2
= x
T
V
T
V x = x
T
x = x
2
.
∆
We conclude this section with an obvious but useful consequence of orthogonality. In section 2.3 we deﬁned the
projection p of a vector b onto another vector c as the point on the line through c that is closest to b. This notion of
projection can be extended from lines to vector spaces by the following deﬁnition: The projection p of a point b ∈ R
n
onto a subspace C is the point in C that is closest to b.
Also, for unit vectors c, the projection matrix is cc
T
(theorem 2.3.3), and the vector b − p is orthogonal to c. An
analogous result holds for subspace projection, as the following theorem shows.
Theorem 3.1.3 Let U be an orthogonal matrix. Then the matrix UU
T
projects any vector b onto range(U). Further
more, the difference vector between b and its projection p onto range(U) is orthogonal to range(U):
U
T
(b −p) = 0 .
Proof. A point p in range(U) is a linear combination of the columns of U:
p = Ux
where x is the vector of coefﬁcients (as many coefﬁcients as there are columns in U). The squared distance between b
and p is
b −p
2
= (b −p)
T
(b −p) = b
T
b + p
T
p −2b
T
p = b
T
b + x
T
U
T
Ux −2b
T
Ux .
Because of orthogonality, U
T
U is the identity matrix, so
b −p
2
= b
T
b + x
T
x −2b
T
Ux .
The derivative of this squared distance with respect to x is the vector
2x −2U
T
b
3
At least geometrically. One solution may be more efﬁcient than the other in other ways.
26 CHAPTER 3. THE SINGULAR VALUE DECOMPOSITION
x
b
2
v
1
u σ
v
u
3
2
x
1
x
2
2
b
b
3
1
2
u σ
1 1
b
Figure 3.1: The matrix in equation (3.5) maps a circle on the plane into an ellipse in space. The two small boxes are
corresponding points.
which is zero iff
x = U
T
b ,
that is, when
p = Ux = UU
T
b
as promised.
For this value of p the difference vector b −p is orthogonal to range(U), in the sense that
U
T
(b −p) = U
T
(b −UU
T
b) = U
T
b −U
T
b = 0 .
∆
3.2 The Singular Value Decomposition
In these notes, we have often used geometric intuition to introduce newconcepts, and we have then translated these into
algebraic statements. This approach is successful when geometry is less cumbersome than algebra, or when geometric
intuition provides a strong guiding element. The geometric picture underlying the Singular Value Decomposition is
crisp and useful, so we will use geometric intuition again. Here is the main intuition:
An m × n matrix A of rank r maps the rdimensional unit hypersphere in rowspace(A) into an r
dimensional hyperellipse in range(A).
This statement is stronger than saying that A maps rowspace(A) into range(A), because it also describes what
happens to the magnitudes of the vectors: a hypersphere is stretched or compressed into a hyperellipse, which is a
quadratic hypersurface that generalizes the twodimensional notion of ellipse to an arbitrary number of dimensions. In
three dimensions, the hyperellipse is an ellipsoid, in one dimension it is a pair of points. In all cases, the hyperellipse
in question is centered at the origin.
For instance, the rank2 matrix
A =
1
√
2
_
_
√
3
√
3
−3 3
1 1
_
_
(3.5)
transforms the unit circle on the plane into an ellipse embedded in threedimensional space. Figure 3.1 shows the map
b = Ax .
3.2. THE SINGULAR VALUE DECOMPOSITION 27
Two diametrically opposite points on the unit circle are mapped into the two endpoints of the major axis of the
ellipse, and two other diametrically opposite points on the unit circle are mapped into the two endpoints of the minor
axis of the ellipse. The lines through these two pairs of points on the unit circle are always orthogonal. This result can
be generalized to any m×n matrix.
Simple and fundamental as this geometric fact may be, its proof by geometric means is cumbersome. Instead, we
will prove it algebraically by ﬁrst introducing the existence of the SVD and then using the latter to prove that matrices
map hyperspheres into hyperellipses.
Theorem 3.2.1 If A is a real m×n matrix then there exist orthogonal matrices
U =
_
u
1
· · · u
m
¸
∈ R
m×m
V =
_
v
1
· · · v
n
¸
∈ R
n×n
such that
U
T
AV = Σ = diag(σ
1
, . . . , σ
p
) ∈ R
m×n
where p = min(m, n) and σ
1
≥ . . . ≥ σ
p
≥ 0. Equivalently,
A = UΣV
T
.
Proof. Let x and y be unit vectors in R
n
and R
m
, respectively, and consider the bilinear form
z = y
T
Ax .
The set
S = {x, y  x ∈ R
n
, y ∈ R
m
, x = y = 1}
is compact, so that the scalar function z(x, y) must achieve a maximum value on S, possibly at more than one point
4
.
Let u
1
, v
1
be two unit vectors in R
m
and R
n
respectively where this maximum is achieved, and let σ
1
be the corre
sponding value of z:
max
x=y=1
y
T
Ax = u
T
1
Av
1
= σ
1
.
It is easy to see that u
1
is parallel to the vector Av
1
. If this were not the case, their inner product u
T
1
Av
1
could
be increased by rotating u
1
towards the direction of Av
1
, thereby contradicting the fact that u
T
1
Av
1
is a maximum.
Similarly, by noticing that
u
T
1
Av
1
= v
T
1
A
T
u
1
and repeating the argument above, we see that v
1
is parallel to A
T
u
1
.
By theorems 2.4.1 and 2.4.2, u
1
and v
1
can be extended into orthonormal bases for R
m
and R
n
, respectively.
Collect these orthonormal basis vectors into orthogonal matrices U
1
and V
1
. Then
U
T
1
AV
1
= S
1
=
_
σ
1
0
T
0 A
1
_
.
In fact, the ﬁrst column of AV
1
is Av
1
= σ
1
u
1
, so the ﬁrst entry of U
T
1
AV
1
is u
T
1
σ
1
u
1
= σ
1
, and its other entries
are u
T
j
Av
1
= 0 because Av
1
is parallel to u
1
and therefore orthogonal, by construction, to u
2
, . . . , u
m
. A similar
argument shows that the entries after the ﬁrst in the ﬁrst row of S
1
are zero: the row vector u
T
1
A is parallel to v
T
1
, and
therefore orthogonal to v
2
, . . . , v
n
, so that u
T
1
Av
2
= . . . = u
T
1
Av
n
= 0.
The matrix A
1
has one fewer row and column than A. We can repeat the same construction on A
1
and write
U
T
2
A
1
V
2
= S
2
=
_
σ
2
0
T
0 A
2
_
4
Actually, at least at two points: if u
T
1
Av
1
is a maximum, so is (−u
1
)
T
A(−v
1
).
28 CHAPTER 3. THE SINGULAR VALUE DECOMPOSITION
so that
_
1 0
T
0 U
T
2
_
U
T
1
AV
1
_
1 0
T
0 V
2
_
=
_
_
σ
1
0 0
T
0 σ
2
0
T
0 0 A
2
_
_
.
This procedure can be repeated until A
k
vanishes (zero rows or zero columns) to obtain
U
T
AV = Σ
where U
T
and V are orthogonal matrices obtained by multiplying together all the orthogonal matrices used in the
procedure, and
Σ = diag(σ
1
, . . . , σ
p
) .
Since matrices U and V are orthogonal, we can premultiply the matrix product in the theorem by U and postmultiply
it by V
T
to obtain
A = UΣV
T
,
which is the desired result.
It only remains to show that the elements on the diagonal of Σ are nonnegative and arranged in nonincreasing
order. To see that σ
1
≥ . . . ≥ σ
p
(where p = min(m, n)), we can observe that the successive maximization problems
that yield σ
1
, . . . , σ
p
are performed on a sequence of sets each of which contains the next. To show this, we just need
to show that σ
2
≤ σ
1
, and induction will do the rest. We have
σ
2
= max
ˆ
x=
ˆ
y=1
ˆy
T
A
1
ˆx = max
ˆ
x=
ˆ
y=1
_
0 ˆy
¸
T
S
1
_
0
ˆx
_
= max
ˆ
x=
ˆ
y=1
_
0 ˆy
¸
T
U
T
1
AV
1
_
0
ˆx
_
= max
x = y = 1
x
T
v
1
= y
T
u
1
= 0
y
T
Ax ≤ σ
1
.
To explain the last equality above, consider the vectors
x = V
1
_
0
ˆx
_
and y = U
1
_
0
ˆy
_
.
The vector x is equal to the unit vector [0 ˆx]
T
transformed by the orthogonal matrix V
1
, and is therefore itself a unit
vector. In addition, it is a linear combination of v
2
, . . . , v
n
, and is therefore orthogonal to v
1
. A similar argument
shows that y is a unit vector orthogonal to u
1
. Because x and y thus deﬁned belong to subsets (actually subspheres)
of the unit spheres in R
n
and R
m
, we conclude that σ
2
≤ σ
1
.
The σ
i
are nonnegative because all these maximizations are performed on unit hyperspheres. The σ
i
s are maxima
of the function z(x, y) which always assumes both positive and negative values on any hypersphere: If z(x, y) is
negative, then z(−x, y) is positive, and if x is on a hypersphere, so is −x. ∆
We can now review the geometric picture in ﬁgure 3.1 in light of the singular value decomposition. In the process,
we introduce some nomenclature for the three matrices in the SVD. Consider the map in ﬁgure 3.1, represented by
equation (3.5), and imagine transforming point x (the small box at x on the unit circle) into its corresponding point
b = Ax (the small box on the ellipse). This transformation can be achieved in three steps (see ﬁgure 3.2):
1. Write x in the frame of reference of the two vectors v
1
, v
2
on the unit circle that map into the major axes of the
ellipse. There are a few ways to do this, because axis endpoints come in pairs. Just pick one way, but order
v
1
, v
2
so they map into the major and the minor axis, in this order. Let us call v
1
, v
2
the two right singular
vectors of A. The corresponding axis unit vectors u
1
, u
2
on the ellipse are called left singular vectors. If we
deﬁne
V =
_
v
1
v
2
¸
,
3.2. THE SINGULAR VALUE DECOMPOSITION 29
2
x
1
x
v
2
v
1
2 2
2
v’
1
v’
2
y
y
1
u
3
y
3
u σ
2 2
u σ
1 1
σ
2 2
u’
σ
1 1
u’
x
ξ
ξ
1
ξ
η
η
1
y
η
Figure 3.2: Decomposition of the mapping in ﬁgure 3.1.
the new coordinates ξ of x become
ξ = V
T
x
because V is orthogonal.
2. Transform ξ into its image on a “straight” version of the ﬁnal ellipse. “Straight” here means that the axes of the
ellipse are aligned with the y
1
, y
2
axes. Otherwise, the “straight” ellipse has the same shape as the ellipse in
ﬁgure 3.1. If the lengths of the halfaxes of the ellipse are σ
1
, σ
2
(major axis ﬁrst), the transformed vector η has
coordinates
η = Σξ
where
Σ =
_
_
σ
1
0
0 σ
2
0 0
_
_
is a diagonal matrix. The real, nonnegative numbers σ
1
, σ
2
are called the singular values of A.
3. Rotate the reference frame in R
m
= R
3
so that the “straight” ellipse becomes the ellipse in ﬁgure 3.1. This
rotation brings η along, and maps it to b. The components of η are the signed magnitudes of the projections of
b along the unit vectors u
1
, u
2
, u
3
that identify the axes of the ellipse and the normal to the plane of the ellipse,
so
b = Uη
where the orthogonal matrix
U =
_
u
1
u
2
u
3
¸
collects the left singular vectors of A.
30 CHAPTER 3. THE SINGULAR VALUE DECOMPOSITION
We can concatenate these three transformations to obtain
b = UΣV
T
x
or
A = UΣV
T
since this construction works for any point x on the unit circle. This is the SVD of A.
The singular value decomposition is “almost unique”. There are two sources of ambiguity. The ﬁrst is in the
orientation of the singular vectors. One can ﬂip any right singular vector, provided that the corresponding left singular
vector is ﬂipped as well, and still obtain a valid SVD. Singular vectors must be ﬂipped in pairs (a left vector and its
corresponding right vector) because the singular values are required to be nonnegative. This is a trivial ambiguity. If
desired, it can be removed by imposing, for instance, that the ﬁrst nonzero entry of every left singular value be positive.
The second source of ambiguity is deeper. If the matrix A maps a hypersphere into another hypersphere, the axes
of the latter are not deﬁned. For instance, the identity matrix has an inﬁnity of SVDs, all of the form
I = UIU
T
where U is any orthogonal matrix of suitable size. More generally, whenever two or more singular values coincide,
the subspaces identiﬁed by the corresponding left and right singular vectors are unique, but any orthonormal basis can
be chosen within, say, the right subspace and yield, together with the corresponding left singular vectors, a valid SVD.
Except for these ambiguities, the SVD is unique.
Even in the general case, the singular values of a matrix A are the lengths of the semiaxes of the hyperellipse E
deﬁned by
E = {Ax : x = 1} .
The SVD reveals a great deal about the structure of a matrix. If we deﬁne r by
σ
1
≥ . . . ≥ σ
r
> σ
r+1
= . . . = 0 ,
that is, if σ
r
is the smallest nonzero singular value of A, then
rank(A) = r
null(A) = span{v
r+1
, . . . , v
n
}
range(A) = span{u
1
, . . . , u
r
} .
The sizes of the matrices in the SVD are as follows: U is m× m, Σ is m× n, and V is n × n. Thus, Σ has the
same shape and size as A, while U and V are square. However, if m > n, the bottom (m−n) ×n block of Σ is zero,
so that the last m− n columns of U are multiplied by zero. Similarly, if m < n, the rightmost m× (n − m) block
of Σ is zero, and this multiplies the last n −m rows of V . This suggests a “small,” equivalent version of the SVD. If
p = min(m, n), we can deﬁne U
p
= U(:, 1 : p), Σ
p
= Σ(1 : p, 1 : p), and V
p
= V (:, 1 : p), and write
A = U
p
Σ
p
V
T
p
where U
p
is m×p, Σ
p
is p ×p, and V
p
is n ×p.
Moreover, if p−r singular values are zero, we can let U
r
= U(:, 1 : r), Σ
r
= Σ(1 : r, 1 : r), and V
r
= V (:, 1 : r),
then we have
A = U
r
Σ
r
V
T
r
=
r
i=1
σ
i
u
i
v
T
i
,
which is an even smaller, minimal, SVD.
Finally, both the 2norm and the Frobenius norm
A
F
=
¸
¸
¸
_
m
i=1
n
j=1
a
ij

2
3.3. THE PSEUDOINVERSE 31
and
A
2
= sup
x=0
Ax
x
are neatly characterized in terms of the SVD:
A
2
F
= σ
2
1
+. . . +σ
2
p
A
2
= σ
1
.
In the next few sections we introduce fundamental results and applications that testify to the importance of the
SVD.
3.3 The Pseudoinverse
One of the most important applications of the SVD is the solution of linear systems in the least squares sense. A linear
system of the form
Ax = b (3.6)
arising froma reallife application may or may not admit a solution, that is, a vector x that satisﬁes this equation exactly.
Often more measurements are available than strictly necessary, because measurements are unreliable. This leads to
more equations than unknowns (the number m of rows in A is greater than the number n of columns), and equations
are often mutually incompatible because they come from inexact measurements (incompatible linear systems were
deﬁned in chapter 2). Even when m ≤ n the equations can be incompatible, because of errors in the measurements
that produce the entries of A. In these cases, it makes more sense to ﬁnd a vector x that minimizes the norm
Ax −b
of the residual vector
r = Ax −b .
where the double bars henceforth refer to the Euclidean norm. Thus, x cannot exactly satisfy any of the m equations
in the system, but it tries to satisfy all of them as closely as possible, as measured by the sum of the squares of the
discrepancies between left and righthand sides of the equations.
In other circumstances, not enough measurements are available. Then, the linear system (3.6) is underdetermined,
in the sense that it has fewer independent equations than unknowns (its rank r is less than n, see again chapter 2).
Incompatibility and underdeterminacy can occur together: the system admits no solution, and the leastsquares
solution is not unique. For instance, the system
x
1
+x
2
= 1
x
1
+x
2
= 3
x
3
= 2
has three unknowns, but rank 2, and its ﬁrst two equations are incompatible: x
1
+ x
2
cannot be equal to both 1 and
3. A leastsquares solution turns out to be x = [1 1 2]
T
with residual r = Ax − b = [1 − 1 0], which has norm
√
2
(admittedly, this is a rather high residual, but this is the best we can do for this problem, in the leastsquares sense).
However, any other vector of the form
x
=
_
_
1
1
2
_
_
+α
_
_
−1
1
0
_
_
is as good as x. For instance, x
= [0 2 2], obtained for α = 1, yields exactly the same residual as x (check this).
In summary, an exact solution to the system (3.6) may not exist, or may not be unique, as we learned in chapter 2.
An approximate solution, in the leastsquares sense, always exists, but may fail to be unique.
32 CHAPTER 3. THE SINGULAR VALUE DECOMPOSITION
If there are several leastsquares solutions, all equally good (or bad), then one of them turns out to be shorter than
all the others, that is, its norm x is smallest. One can therefore redeﬁne what it means to “solve” a linear system so
that there is always exactly one solution. This minimum norm solution is the subject of the following theorem, which
both proves uniqueness and provides a recipe for the computation of the solution.
Theorem 3.3.1 The minimumnorm least squares solution to a linear system Ax = b, that is, the shortest vector x
that achieves the
min
x
Ax −b ,
is unique, and is given by
ˆx = V Σ
†
U
T
b (3.7)
where
Σ
†
=
_
¸
¸
¸
¸
¸
¸
¸
¸
¸
_
1/σ
1
0 · · · 0
.
.
.
1/σ
r
.
.
.
.
.
.
0
.
.
.
0 0 · · · 0
_
¸
¸
¸
¸
¸
¸
¸
¸
¸
_
is an n ×m diagonal matrix.
The matrix
A
†
= V Σ
†
U
T
is called the pseudoinverse of A.
Proof. The minimumnorm Least Squares solution to
Ax = b
is the shortest vector x that minimizes
Ax −b
that is,
UΣV
T
x −b .
This can be written as
U(ΣV
T
x −U
T
b) (3.8)
because U is an orthogonal matrix, UU
T
= I. But orthogonal matrices do not change the norm of vectors they are
applied to (theorem 3.1.2), so that the last expression above equals
ΣV
T
x −U
T
b
or, with y = V
T
x and c = U
T
b,
Σy −c .
In order to ﬁnd the solution to this minimization problem, let us spell out the last expression. We want to minimize the
norm of the following vector:
_
¸
¸
¸
¸
¸
¸
¸
¸
¸
_
σ
1
0 · · · 0
0
.
.
.
· · · 0
σ
r
.
.
. 0
.
.
.
.
.
.
0 0
_
¸
¸
¸
¸
¸
¸
¸
¸
¸
_
_
¸
¸
¸
¸
¸
¸
¸
¸
_
y
1
.
.
.
y
r
y
r+1
.
.
.
y
n
_
¸
¸
¸
¸
¸
¸
¸
¸
_
−
_
¸
¸
¸
¸
¸
¸
¸
¸
_
c
1
.
.
.
c
r
c
r+1
.
.
.
c
m
_
¸
¸
¸
¸
¸
¸
¸
¸
_
.
3.4. LEASTSQUARES SOLUTION OF A HOMOGENEOUS LINEAR SYSTEMS 33
The last m−r differences are of the form
0 −
_
¸
_
c
r+1
.
.
.
c
m
_
¸
_
and do not depend on the unknown y. In other words, there is nothing we can do about those differences: if some or
all the c
i
for i = r +1, . . . , m are nonzero, we will not be able to zero these differences, and each of them contributes
a residual c
i
 to the solution. In each of the ﬁrst r differences, on the other hand, the last n − r components of y are
multiplied by zeros, so they have no effect on the solution. Thus, there is freedom in their choice. Since we look for
the minimumnorm solution, that is, for the shortest vector x, we also want the shortest y, because x and y are related
by an orthogonal transformation. We therefore set y
r+1
= . . . = y
n
= 0. In summary, the desired y has the following
components:
y
i
=
c
i
σ
i
for i = 1, . . . , r
y
i
= 0 for i = r + 1, . . . , n .
When written as a function of the vector c, this is
y = Σ
+
c .
Notice that there is no other choice for y, which is therefore unique: minimum residual forces the choice of y
1
, . . . , y
r
,
and minimumnorm solution forces the other entries of y. Thus, the minimumnorm, leastsquares solution to the
original system is the unique vector
ˆx = V y = V Σ
+
c = V Σ
+
U
T
b
as promised. The residual, that is, the norm of Ax − b when x is the solution vector, is the norm of Σy − c, since
this vector is related to Ax −b by an orthogonal transformation (see equation (3.8)). In conclusion, the square of the
residual is
Ax −b
2
= Σy −c
2
=
m
i=r+1
c
2
i
=
m
i=r+1
(u
T
i
b)
2
which is the projection of the righthand side vector b onto the complement of the range of A. ∆
3.4 LeastSquares Solution of a Homogeneous Linear Systems
Theorem 3.3.1 works regardless of the value of the righthand side vector b. When b = 0, that is, when the system is
homogeneous, the solution is trivial: the minimumnorm solution to
Ax = 0 (3.9)
is
x = 0 ,
which happens to be an exact solution. Of course it is not necessarily the only one (any vector in the null space of A
is also a solution, by deﬁnition), but it is obviously the one with the smallest norm.
Thus, x = 0 is the minimumnorm solution to any homogeneous linear system. Although correct, this solution is
not too interesting. In many applications, what is desired is a nonzero vector x that satisﬁes the system (3.9) as well
as possible. Without any constraints on x, we would fall back to x = 0 again. For homogeneous linear systems, the
meaning of a leastsquares solution is therefore usually modiﬁed, once more, by imposing the constraint
x = 1
34 CHAPTER 3. THE SINGULAR VALUE DECOMPOSITION
on the solution. Unfortunately, the resulting constrained minimization problem does not necessarily admit a unique
solution. The following theorem provides a recipe for ﬁnding this solution, and shows that there is in general a whole
hypersphere of solutions.
Theorem 3.4.1 Let
A = UΣV
T
be the singular value decomposition of A. Furthermore, let v
n−k+1
, . . . , v
n
be the k columns of V whose correspond
ing singular values are equal to the last singular value σ
n
, that is, let k be the largest integer such that
σ
n−k+1
= . . . = σ
n
.
Then, all vectors of the form
x = α
1
v
n−k+1
+. . . +α
k
v
n
(3.10)
with
α
2
1
+. . . +α
2
k
= 1 (3.11)
are unitnorm least squares solutions to the homogeneous linear system
Ax = 0,
that is, they achieve the
min
x=1
Ax .
Note: when σ
n
is greater than zero the most common case is k = 1, since it is very unlikely that different singular
values have exactly the same numerical value. When A is rank deﬁcient, on the other case, it may often have more
than one singular value equal to zero. In any event, if k = 1, then the minimumnorm solution is unique, x = v
n
. If
k > 1, the theorem above shows how to express all solutions as a linear combination of the last k columns of V .
Proof. The reasoning is very similar to that for the previous theorem. The unitnorm Least Squares solution to
Ax = 0
is the vector x with x = 1 that minimizes
Ax
that is,
UΣV
T
x .
Since orthogonal matrices do not change the norm of vectors they are applied to (theorem 3.1.2), this norm is the same
as
ΣV
T
x
or, with y = V
T
x,
Σy .
Since V is orthogonal, x = 1 translates to y = 1. We thus look for the unitnorm vector y that minimizes the
norm (squared) of Σy, that is,
σ
2
1
y
2
1
+. . . +σ
2
n
y
2
n
.
This is obviously achieved by concentrating all the (unit) mass of y where the σs are smallest, that is by letting
y
1
= . . . = y
n−k
= 0. (3.12)
From y = V
T
x we obtain x = V y = y
1
v
1
+. . . +y
n
v
n
, so that equation (3.12) is equivalent to equation (3.10) with
α
1
= y
n−k+1
, . . . , α
k
= y
n
, and the unitnorm constraint on y yields equation (3.11). ∆
Section 3.5 shows a sample use of theorem 3.4.1.
3.5. SVD LINE FITTING 35
3.5 SVD Line Fitting
The Singular Value Decomposition of a matrix yields a simple method for ﬁtting a line to a set of points on the plane.
3.5.1 Fitting a Line to a Set of Points
Let p
i
= (x
i
, y
i
)
T
be a set of m ≥ 2 points on the plane, and let
ax +by −c = 0
be the equation of a line. If the lefthand side of this equation is multiplied by a nonzero constant, the line does not
change. Thus, we can assume without loss of generality that
n = a
2
+b
2
= 1 , (3.13)
where the unit vector n = (a, b)
T
, orthogonal to the line, is called the line normal.
The distance from the line to the origin is c (see ﬁgure 3.3), and the distance between the line n and a point p
i
is
equal to
d
i
= ax
i
+by
i
−c = p
T
i
n −c . (3.14)
p
i
a
b
c
Figure 3.3: The distance between point p
i
= (x
i
, y
i
)
T
and line ax +by −c = 0 is ax
i
+by
i
−c.
The bestﬁt line minimizes the sum of the squared distances. Thus, if we let d = (d
1
, . . . , d
m
) and P =
(p
1
. . . , p
m
)
T
, the bestﬁt line achieves the
min
n=1
d
2
= min
n=1
Pn −c1
2
. (3.15)
In equation (3.15), 1 is a vector of m ones.
3.5.2 The Best Line Fit
Since the third line parameter c does not appear in the constraint (3.13), at the minimum (3.15) we must have
∂d
2
∂c
= 0 . (3.16)
If we deﬁne the centroid p of all the points p
i
as
p =
1
m
P
T
1 ,
36 CHAPTER 3. THE SINGULAR VALUE DECOMPOSITION
equation (3.16) yields
∂d
2
∂c
=
∂
∂c
_
n
T
P
T
−c1
T
_
(Pn −1c)
=
∂
∂c
_
n
T
P
T
Pn +c
2
1
T
1 −2n
T
P
T
c1
_
= 2
_
mc −n
T
P
T
1
_
= 0
from which we obtain
c =
1
m
n
T
P
T
1 ,
that is,
c = p
T
n .
By replacing this expression into equation (3.15), we obtain
min
n=1
d
2
= min
n=1
Pn −1p
T
n
2
= min
n=1
Qn
2
,
where Q = P − 1p
T
collects the centered coordinates of the m points. We can solve this constrained minimization
problem by theorem 3.4.1. Equivalently, and in order to emphasize the geometric meaning of signular values and
vectors, we can recall that if n is on a circle, the shortest vector of the form Qn is obtained when n is the right singular
vector v
2
corresponding to the smaller σ
2
of the two singular values of Q. Furthermore, since Qv
2
has norm σ
2
, the
residue is
min
n=1
d = σ
2
and more speciﬁcally the distances d
i
are given by
d = σ
2
u
2
where u
2
is the left singular vector corresponding to σ
2
. In fact, when n = v
2
, the SVD
Q = UΣV
T
=
2
i=1
σ
i
u
i
v
T
i
yields
Qn = Qv
2
=
2
i=1
σ
i
u
i
v
T
i
v
2
= σ
2
u
2
because v
1
and v
2
are orthonormal vectors.
To summarize, to ﬁt a line (a, b, c) to a set of m points p
i
collected in the m × 2 matrix P = (p
1
. . . , p
m
)
T
,
proceed as follows:
1. compute the centroid of the points (1 is a vector of m ones):
p =
1
m
P
T
1
2. form the matrix of centered coordinates:
Q = P −1p
T
3. compute the SVD of Q:
Q = UΣV
T
3.5. SVD LINE FITTING 37
4. the line normal is the second column of the 2 ×2 matrix V :
n = (a, b)
T
= v
2
,
5. the third coefﬁcient of the line is
c = p
T
n
6. the residue of the ﬁt is
min
n=1
d = σ
2
The following matlab code implements the line ﬁtting method.
function [l, residue] = linefit(P)
% check input matrix sizes
[m n] = size(P);
if n ˜= 2, error(’matrix P must be m x 2’), end
if m < 2, error(’Need at least two points’), end
one = ones(m, 1);
% centroid of all the points
p = (P’ * one) / m;
% matrix of centered coordinates
Q = P  one * p’;
[U Sigma V] = svd(Q);
% the line normal is the second column of V
n = V(:, 2);
% assemble the three line coefficients into a column vector
l = [n ; p’ * n];
% the smallest singular value of Q
% measures the residual fitting error
residue = Sigma(2, 2);
A useful exercise is to think how this procedure, or something close to it, can be adapted to ﬁt a set of data points
in R
m
with an afﬁne subspace of given dimension n. An afﬁne subspace is a linear subspace plus a point, just like an
arbitrary line is a line through the origin plus a point. Here “plus” means the following. Let L be a linear space. Then
an afﬁne space has the form
A = p +L = {a  a = p + l and l ∈ L} .
Hint: minimizing the distance between a point and a subspace is equivalent to maximizing the norm of the projection
of the point onto the subspace. The ﬁtting problem (including ﬁtting a line to a set of points) can be cast either as a
maximization or a minimization problem.
38 CHAPTER 3. THE SINGULAR VALUE DECOMPOSITION
Chapter 4
Function Optimization
There are three main reasons why most problems in robotics, vision, and arguably every other science or endeavor
take on the form of optimization problems. One is that the desired goal may not be achievable, and so we try to get as
close as possible to it. The second reason is that there may be more ways to achieve the goal, and so we can choose
one by assigning a quality to all the solutions and selecting the best one. The third reason is that we may not know
how to solve the system of equations f(x) = 0, so instead we minimize the norm f(x), which is a scalar function of
the unknown vector x.
We have encountered the ﬁrst two situations when talking about linear systems. The case in which a linear system
admits exactly one exact solution is simple but rare. More often, the system at hand is either incompatible (some say
overconstrained) or, at the opposite end, underdetermined. In fact, some problems are both, in a sense. While these
problems admit no exact solution, they often admit a multitude of approximate solutions. In addition, many problems
lead to nonlinear equations.
Consider, for instance, the problem of Structure From Motion (SFM) in computer vision. Nonlinear equations
describe how points in the world project onto the images taken by cameras at given positions in space. Structure from
motion goes the other way around, and attempts to solve these equations: image points are given, and one wants to
determine where the points in the world and the cameras are. Because image points come from noisy measurements,
they are not exact, and the resulting system is usually incompatible. SFM is then cast as an optimization problem.
On the other hand, the exact system (the one with perfect coefﬁcients) is often close to being underdetermined. For
instance, the images may be insufﬁcient to recover a certain shape under a certain motion. Then, an additional criterion
must be added to deﬁne what a “good” solution is. In these cases, the noisy system admits no exact solutions, but has
many approximate ones.
The term “optimization” is meant to subsume both minimization and maximization. However, maximizing the
scalar function f(x) is the same as minimizing −f(x), so we consider optimization and minimization to be essentially
synonyms. Usually, one is after global minima. However, global minima are hard to ﬁnd, since they involve a universal
quantiﬁer: x
∗
is a global minimum of f if for every other x we have f(x) ≥ f(x
∗
). Global minization techniques
like simulated annealing have been proposed, but their convergence properties depend very strongly on the problem at
hand. In this chapter, we consider local minimization: we pick a starting point x
0
, and we descend in the landscape of
f(x) until we cannot go down any further. The bottom of the valley is a local minimum.
Local minimization is appropriate if we know how to pick an x
0
that is close to x
∗
. This occurs frequently in
feedback systems. In these systems, we start at a local (or even a global) minimum. The system then evolves and
escapes from the minimum. As soon as this occurs, a control signal is generated to bring the system back to the
minimum. Because of this immediate reaction, the old minimum can often be used as a starting point x
0
when looking
for the new minimum, that is, when computing the required control signal. More formally, we reach the correct
minimum x
∗
as long as the initial point x
0
is in the basin of attraction of x
∗
, deﬁned as the largest neighborhood of x
∗
in which f(x) is convex.
Good references for the discussion in this chapter are Matrix Computations, Practical Optimization, and Numerical
Recipes in C, all of which are listed with full citations in section 1.4.
39
40 CHAPTER 4. FUNCTION OPTIMIZATION
4.1 Local Minimization and Steepest Descent
Suppose that we want to ﬁnd a local minimum for the scalar function f of the vector variable x, starting from an initial
point x
0
. Picking an appropriate x
0
is crucial, but also very problemdependent. We start from x
0
, and we go downhill.
At every step of the way, we must make the following decisions:
• Whether to stop.
• In what direction to proceed.
• How long a step to take.
In fact, most minimization algorithms have the following structure:
k = 0
while x
k
is not a minimum
compute step direction p
k
with p
k
= 1
compute step size α
k
x
k+1
= x
k
+α
k
p
k
k = k + 1
end.
Different algorithms differ in how each of these instructions is performed.
It is intuitively clear that the choice of the step size α
k
is important. Too small a step leads to slow convergence,
or even to lack of convergence altogether. Too large a step causes overshooting, that is, leaping past the solution. The
most disastrous consequence of this is that we may leave the basin of attraction, or that we oscillate back and forth
with increasing amplitudes, leading to instability. Even when oscillations decrease, they can slow down convergence
considerably.
What is less obvious is that the best direction of descent is not necessarily, and in fact is quite rarely, the direction
of steepest descent, as we now show. Consider a simple but important case,
f(x) = c + a
T
x +
1
2
x
T
Qx (4.1)
where Q is a symmetric, positive deﬁnite matrix. Positive deﬁnite means that for every nonzero x the quantity x
T
Qx
is positive. In this case, the graph of f(x) −c is a plane a
T
x plus a paraboloid.
Of course, if f were this simple, no descent methods would be necessary. In fact the minimum of f can be found
by setting its gradient to zero:
∂f
∂x
= a +Qx = 0
so that the minimum x
∗
is the solution to the linear system
Qx = −a . (4.2)
Since Q is positive deﬁnite, it is also invertible (why?), and the solution x
∗
is unique. However, understanding the
behavior of minimization algorithms in this simple case is crucial in order to establish the convergence properties of
these algorithms for more general functions. In fact, all smooth functions can be approximated by paraboloids in a
sufﬁciently small neighborhood of any point.
Let us therefore assume that we minimize f as given in equation (4.1), and that at every step we choose the
direction of steepest descent. In order to simplify the mathematics, we observe that if we let
˜ e(x) =
1
2
(x −x
∗
)
T
Q(x −x
∗
)
then we have
˜ e(x) = f(x) −c +
1
2
x
∗T
Qx
∗
= f(x) −f(x
∗
) (4.3)
4.1. LOCAL MINIMIZATION AND STEEPEST DESCENT 41
so that ˜ e and f differ only by a constant. In fact,
˜ e(x) =
1
2
(x
T
Qx + x
∗T
Qx
∗
−2x
T
Qx
∗
) =
1
2
x
T
Qx + a
T
x +
1
2
x
∗T
Qx
∗
= f(x) −c +
1
2
x
∗T
Qx
∗
and from equation (4.2) we obtain
f(x
∗
) = c + a
T
x
∗
+
1
2
x
∗T
Qx
∗
= c −x
∗T
Qx
∗
+
1
2
x
∗T
Qx
∗
= c −
1
2
x
∗T
Qx
∗
.
The result (4.3),
˜ e(x) = f(x) −f(x
∗
) ,
is rather interesting in itself. It says that adding a linear term a
T
x (and a constant c) to a paraboloid
1
2
x
T
Qx merely
shifts the bottom of the paraboloid, both in position (x
∗
rather than 0) and value (c −
1
2
x
∗T
Qx
∗
rather than zero).
Adding the linear term does not “warp” or “tilt” the shape of the paraboloid in any way.
Since ˜ e is simpler, we consider that we are minimizing ˜ e rather than f. In addition, we can let
y = x −x
∗
,
that is, we can shift the origin of the domain to x
∗
, and study the function
e(y) =
1
2
y
T
Qy
instead of f or ˜ e, without loss of generality. We will transform everything back to f and x once we are done. Of
course, by construction, the new minimum is at
y
∗
= 0
where e reaches a value of zero:
e(y
∗
) = e(0) = 0 .
However, we let our steepest descent algorithm ﬁnd this minimum by starting from the initial point
y
0
= x
0
−x
∗
.
At every iteration k, the algorithm chooses the direction of steepest descent, which is in the direction
p
k
= −
g
k
g
k
opposite to the gradient of e evaluated at y
k
:
g
k
= g(y
k
) =
∂e
∂y
¸
¸
¸
¸
y=y
k
= Qy
k
.
We select for the algorithm the most favorable step size, that is, the one that takes us from y
k
to the lowest point in
the direction of p
k
. This can be found by differentiating the function
e(y
k
+αp
k
) =
1
2
(y
k
+αp
k
)
T
Q(y
k
+αp
k
)
with respect to α, and setting the derivative to zero to obtain the optimal step α
k
. We have
∂e(y
k
+αp
k
)
∂α
= (y
k
+αp
k
)
T
Qp
k
and setting this to zero yields
α
k
= −
(Qy
k
)
T
p
k
p
T
k
Qp
k
= −
g
T
k
p
k
p
T
k
Qp
k
= g
k
p
T
k
p
k
p
T
k
Qp
k
= g
k
g
T
k
g
k
g
T
k
Qg
k
. (4.4)
42 CHAPTER 4. FUNCTION OPTIMIZATION
Thus, the basic step of our steepest descent can be written as follows:
y
k+1
= y
k
+g
k
g
T
k
g
k
g
T
k
Qg
k
p
k
that is,
y
k+1
= y
k
−
g
T
k
g
k
g
T
k
Qg
k
g
k
. (4.5)
How much closer did this step bring us to the solution y
∗
= 0? In other words, how much smaller is e(y
k+1
),
relative to the value e(y
k
) at the previous step? The answer is, often not much, as we shall now prove. The arguments
and proofs below are adapted from D. G. Luenberger, Introduction to Linear and Nonlinear Programming, Addison
Wesley, 1973.
From the deﬁnition of e and from equation (4.5) we obtain
e(y
k
) −e(y
k+1
)
e(y
k
)
=
y
T
k
Qy
k
−y
T
k+1
Qy
k+1
y
T
k
Qy
k
=
y
T
k
Qy
k
−
_
y
k
−
g
T
k
g
k
g
T
k
Qg
k
g
k
_
T
Q
_
y
k
−
g
T
k
g
k
g
T
k
Qg
k
g
k
_
y
T
k
Qy
k
=
2
g
T
k
g
k
g
T
k
Qg
k
g
T
k
Qy
k
−
_
g
T
k
g
k
g
T
k
Qg
k
_
2
g
T
k
Qg
k
y
T
k
Qy
k
=
2g
T
k
g
k
g
T
k
Qy
k
−(g
T
k
g
k
)
2
y
T
k
Qy
k
g
T
k
Qg
k
.
Since Q is invertible we have
g
k
= Qy
k
⇒ y
k
= Q
−1
g
k
and
y
T
k
Qy
k
= g
T
k
Q
−1
g
k
so that
e(y
k
) −e(y
k+1
)
e(y
k
)
=
(g
T
k
g
k
)
2
g
T
k
Q
−1
g
k
g
T
k
Qg
k
.
This can be rewritten as follows by rearranging terms:
e(y
k+1
) =
_
1 −
(g
T
k
g
k
)
2
g
T
k
Q
−1
g
k
g
T
k
Qg
k
_
e(y
k
) (4.6)
so if we can bound the expression in parentheses we have a bound on the rate of convergence of steepest descent. To
this end, we introduce the following result.
Lemma 4.1.1 (Kantorovich inequality) Let Qbe a positive deﬁnite, symmetric, n×n matrix. For any vector y there
holds
(y
T
y)
2
y
T
Q
−1
y y
T
Qy
≥
4σ
1
σ
n
(σ
1
+σ
n
)
2
where σ
1
and σ
n
are, respectively, the largest and smallest singular values of Q.
4.1. LOCAL MINIMIZATION AND STEEPEST DESCENT 43
Proof. Let
Q = UΣU
T
be the singular value decomposition of the symmetric (hence V = U) matrix Q. Because Q is positive deﬁnite, all its
singular values are strictly positive, since the smallest of them satisﬁes
σ
n
= min
y=1
y
T
Qy > 0
by the deﬁnition of positive deﬁniteness. If we let
z = U
T
y
we have
(y
T
y)
2
y
T
Q
−1
y y
T
Qy
=
(y
T
U
T
Uy)
2
y
T
UΣ
−1
U
T
y y
T
UΣU
T
y
=
(z
T
z)
2
z
T
Σ
−1
z z
T
Σz
=
1/
n
i=1
θ
i
σ
i
n
i=1
θ
i
/σ
i
=
φ(σ)
ψ(σ)
(4.7)
where the coefﬁcients
θ
i
=
z
2
i
z
2
add up to one. If we let
σ =
n
i=1
θ
i
σ
i
, (4.8)
then the numerator φ(σ) in (4.7) is 1/σ. Of course, there are many ways to choose the coefﬁcients θ
i
to obtain a
particular value of σ. However, each of the singular values σ
j
can be obtained by letting θ
j
= 1 and all other θ
i
to
zero. Thus, the values 1/σ
j
for j = 1, . . . , n are all on the curve 1/σ. The denominator ψ(σ) in (4.7) is a convex
combination of points on this curve. Since 1/σ is a convex function of σ, the values of the denominator ψ(σ) of (4.7)
must be in the shaded area in ﬁgure 4.1. This area is delimited from above by the straight line that connects point
(σ
1
, 1/σ
1
) with point (σ
n
, 1/σ
n
), that is, by the line with ordinate
λ(σ) = (σ
1
+σ
n
−σ)/(σ
1
σ
n
) .
For the same vector of coefﬁcients θ
i
, the values of φ(σ), ψ(σ), and λ(σ) are on the vertical line corresponding to
the value of σ given by (4.8). Thus an appropriate bound is
φ(σ)
ψ(σ)
≥ min
σ
1
≤σ≤σ
n
φ(σ)
λ(σ)
= min
σ
1
≤σ≤σ
n
1/σ
(σ
1
+σ
n
−σ)/(σ
1
σ
n
)
.
The minimum is achieved at σ = (σ
1
+σ
n
)/2, yielding the desired result. ∆
Thanks to this lemma, we can state the main result on the convergence of the method of steepest descent.
Theorem 4.1.2 Let
f(x) = c + a
T
x +
1
2
x
T
Qx
be a quadratic function of x, with Q symmetric and positive deﬁnite. For any x
0
, the method of steepest descent
x
k+1
= x
k
−
g
T
k
g
k
g
T
k
Qg
k
g
k
(4.9)
where
g
k
= g(x
k
) =
∂f
∂x
¸
¸
¸
¸
x=x
k
= a +Qx
k
44 CHAPTER 4. FUNCTION OPTIMIZATION
σ
2 1
σ σ σ
n
λ(σ)
φ(σ)
ψ(σ)
φ,ψ,λ
σ
Figure 4.1: Kantorovich inequality.
converges to the unique minimum point
x
∗
= −Q
−1
a
of f. Furthermore, at every step k there holds
f(x
k+1
) −f(x
∗
) ≤
_
σ
1
−σ
n
σ
1
+σ
n
_
2
(f(x
k
) −f(x
∗
))
where σ
1
and σ
n
are, respectively, the largest and smallest singular value of Q.
Proof. From the deﬁnitions
y = x −x
∗
and e(y) =
1
2
y
T
Qy (4.10)
we immediately obtain the expression for steepest descent in terms of f and x. By equations (4.3) and (4.6) and the
Kantorovich inequality we obtain
f(x
k+1
) −f(x
∗
) = e(y
k+1
) =
_
1 −
(g
T
k
g
k
)
2
g
T
k
Q
−1
g
k
g
T
k
Qg
k
_
e(y
k
) ≤
_
1 −
4σ
1
σ
n
(σ
1
+σ
n
)
2
_
e(y
k
) (4.11)
=
_
σ
1
−σ
n
σ
1
+σ
n
_
2
(f(x
k
) −f(x
∗
)) . (4.12)
Since the ratio in the last term is smaller than one, it follows immediately that f(x
k
) − f(x
∗
) → 0 and hence, since
the minimum of f is unique, that x
k
→x
∗
. ∆
The ratio κ(Q) = σ
1
/σ
n
is called the condition number of Q. The larger the condition number, the closer the
fraction (σ
1
− σ
n
)/(σ
1
+ σ
n
) is to unity, and the slower convergence. It is easily seen why this happens in the case
4.1. LOCAL MINIMIZATION AND STEEPEST DESCENT 45
p
0
x
*
x
0 1
p
Figure 4.2: Trajectory of steepest descent.
in which x is a twodimensional vector, as in ﬁgure 4.2, which shows the trajectory x
k
superimposed on a set of
isocontours of f(x).
There is one good, but very precarious case, namely, when the starting point x
0
is at one apex (tip of either axis)
of an isocontour ellipse. In that case, one iteration will lead to the minimum x
∗
. In all other cases, the line in the
direction p
k
of steepest descent, which is orthogonal to the isocontour at x
k
, will not pass through x
∗
. The minimum
of f along that line is tangent to some other, lower isocontour. The next step is orthogonal to the latter isocontour (that
is, parallel to the gradient). Thus, at every step the steepest descent trajectory is forced to make a ninetydegree turn.
If isocontours were circles (σ
1
= σ
n
) centered at x
∗
, then the ﬁrst turn would make the new direction point to x
∗
, and
minimization would get there in just one more step. This case, in which κ(Q) = 1, is consistent with our analysis,
because then
σ
1
−σ
n
σ
1
+σ
n
= 0 .
The more elongated the isocontours, that is, the greater the condition number κ(Q), the farther away a line orthogonal
to an isocontour passes from x
∗
, and the more steps are required for convergence.
For general (that is, nonquadratic) f, the analysis above applies once x
k
gets close enough to the minimum, so
that f is well approximated by a paraboloid. In this case, Q is the matrix of second derivatives of f with respect to x,
and is called the Hessian of f. In summary, steepest descent is good for functions that have a well conditioned Hessian
near the minimum, but can become arbitrarily slow for poorly conditioned Hessians.
To characterize the speed of convergence of different minimization algorithms, we introduce the notion of the
order of convergence. This is deﬁned as the largest value of q for which the
lim
k→∞
x
k+1
−x
∗
x
k
−x
∗
q
is ﬁnite. If β is this limit, then close to the solution (that is, for large values of k) we have
x
k+1
−x
∗
≈ βx
k
−x
∗
q
for a minimization method of order q. In other words, the distance of x
k
from x
∗
is reduced by the qth power at every
step, so the higher the order of convergence, the better. Theorem 4.1.2 implies that steepest descent has at best a linear
order of convergence. In fact, the residuals f(x
k
) − f(x
∗
) in the values of the function being minimized converge
46 CHAPTER 4. FUNCTION OPTIMIZATION
linearly. Since the gradient of f approaches zero when x
k
tends to x
∗
, the arguments x
k
to f can converge to x
∗
even
more slowly.
To complete the steepest descent algorithm we need to specify how to check whether a minimum has been reached.
One criterion is to check whether the value of f(x
k
) has signiﬁcantly decreased from f(x
k−1
). Another is to check
whether x
k
is signiﬁcantly different from x
k−1
. Close to the minimum, the derivatives of f are close to zero, so
f(x
k
) − f(x
k−1
) may be very small but x
k
− x
k−1
may still be relatively large. Thus, the check on x
k
is more
stringent, and therefore preferable in most cases. In fact, usually one is interested in the value of x
∗
, rather than in that
of f(x
∗
). In summary, the steepest descent algorithm can be stopped when
x
k
−x
k−1
<
where the positive constant is provided by the user.
In our analysis of steepest descent, we used the Hessian Qin order to compute the optimal step size α (see equation
(4.4)). We used Qbecause it was available, but its computation during steepest descent would in general be overkill. In
fact, only gradient information is necessary to ﬁnd p
k
, and a line search in the direction of p
k
can be used to determine
the step size α
k
. In contrast, the Hessian of f(x) requires computing
_
n
2
_
second derivatives if x is an ndimensional
vector.
Using line search to ﬁnd α
k
guarantees that a minimum in the direction p
k
is actually reached even when the
parabolic approximation is inadequate. Here is how line search works.
Let
h(α) = f(x
k
+αp
k
) (4.13)
be the scalar function of one variable that is obtained by restricting the function f to the line through the current point
x
k
and in the direction of p
k
. Line search ﬁrst determines two points a, c that bracket the desired minimum α
k
, in the
sense that a ≤ α
k
≤ c, and then picks a point between a and c, say, b = (a + c)/2. The only difﬁculty here is to
ﬁnd c. In fact, we can set a = 0, corresponding through equation (4.13) to the starting point x
k
. A point c that is on
the opposite side of the minimum with respect to a can be found by increasing α through values α
1
= a, α
2
, . . . until
h(α
i
) is greater than h(α
i−1
). Then, if we can assume that h is convex between α
1
and α
i
, we can set c = α
i
. In
fact, the derivative of h at a is negative, so the function is initially decreasing, but it is increasing between α
i−1
and
α
i
= c, so the minimum must be somewhere between a and c. Of course, if we cannot assume convexity, we may ﬁnd
the wrong minimum, but there is no generalpurpose ﬁx to this problem.
Line search now proceeds by shrinking the bracketing triple (a, b, c) until c−a is smaller than the desired accuracy
in determining α
k
. Shrinking works as follows:
if b −a > c −b
u = (a +b)/2
if f(u) > f(b)
(a, b, c) = (u, b, c)
otherwise
(a, b, c) = (a, u, b)
end
otherwise
u = (b +c)/2
if f(u) > f(b)
(a, b, c) = (a, b, u)
otherwise
(a, b, c) = (b, u, c)
end
end.
4.2. NEWTON’S METHOD 47
It is easy to see that in each case the bracketing triple (a, b, c) preserves the property that f(b) ≤ f(a) and
f(b) ≤ f(c), and therefore the minimum is somewhere between a and c. In addition, at every step the interval (a, c)
shrinks to 3/4 of its previous size, so line search will ﬁnd the minimum in a number of steps that is logarithmic in the
desired accuracy.
4.2 Newton’s Method
If a function can be well approximated by a paraboloid in the region in which minimization is performed, the analysis
in the previous section suggests a straightforward ﬁx to the slow convergence of steepest descent. In fact, equation
(4.2) tells us how to jump in one step from the starting point x
0
to the minimum x
∗
. Of course, when f(x) is not
exactly a paraboloid, the new value x
1
will be different from x
∗
. Consequently, iterations are needed, but convergence
can be expected to be faster. This is the idea of Newton’s method, which we now summarize. Let
f(x
k
+ ∆x) ≈ f(x
k
) + g
T
k
∆x +
1
2
∆x
T
Q
k
∆x (4.14)
be the ﬁrst terms of the Taylor series expansion of f about the current point x
k
, where
g
k
= g(x
k
) =
∂f
∂x
¸
¸
¸
¸
x=x
k
and
Q
k
= Q(x
k
) =
∂
2
f
∂x∂x
T
¸
¸
¸
¸
x=x
k
=
_
¸
¸
_
∂
2
f
∂x
2
1
· · ·
∂
2
f
∂x1∂xn
.
.
.
.
.
.
∂
2
f
∂xn∂x1
· · ·
∂
2
f
∂x
2
n
_
¸
¸
_
x=x
k
are the gradient and Hessian of f evaluated at the current point x
k
. Notice that even when f is a paraboloid, the
gradient g
k
is different from a as used in equation (4.1). In fact, a and Q are the coefﬁcients of the Taylor expansion
of f around point x = 0, while g
k
and Q
k
are the coefﬁcients of the Taylor expansion of f around the current point
x
k
. In other words, gradient and Hessian are constantly reevaluated in Newton’s method.
To the extent that approximation (4.14) is valid, we can set the derivatives of f(x
k
+ ∆x) with respect to ∆x to
zero, and obtain, analogously to equation (4.2), the linear system
Q
k
∆x = −g
k
, (4.15)
whose solution ∆x
k
= α
k
p
k
yields at the same time the step direction p
k
= ∆x
k
/∆x
k
and the step size α
k
=
∆x
k
. The direction is of course undeﬁned once the algorithm has reached a minimum, that is, when α
k
= 0.
A minimization algorithm in which the step direction p
k
and size α
k
are deﬁned in this manner is called Newton’s
method. The corresponding p
k
is termed the Newton direction, and the step deﬁned by equation (4.15) is the Newton
step.
The greater speed of Newton’s method over steepest descent is borne out by analysis: while steepest descent has a
linear order of convergence, Newton’s method is quadratic. In fact, let
y(x) = x −Q(x)
−1
g(x)
be the place reached by a Newton step starting at x (see equation (4.15)), and suppose that at the minimum x
∗
the
Hessian Q(x
∗
) is nonsingular. Then
y(x
∗
) = x
∗
because g(x
∗
) = 0, and
x
k+1
−x
∗
= y(x
k
) −x
∗
= y(x
k
) −y(x
∗
) .
48 CHAPTER 4. FUNCTION OPTIMIZATION
From the meanvalue theorem, we have
x
k+1
−x
∗
= y(x
k
) −y(x
∗
) ≤
_
_
_
_
_
∂y
∂x
T
_
x=x
∗
(x
k
−x
∗
)
_
_
_
_
+
1
2
¸
¸
¸
¸
∂
2
y
∂x∂x
T
¸
¸
¸
¸
x=
ˆ
x
x
k
−x
∗
2
where ˆx is some point on the line between x
∗
and x
k
. Since y(x
∗
) = x
∗
, the ﬁrst derivatives of y at x
∗
are zero, so that
the ﬁrst term in the righthand side above vanishes, and
x
k+1
−x
∗
≤ c x
k
−x
∗
2
where c depends on thirdorder derivatives of f near x
∗
. Thus, the convergence rate of Newton’s method is of order at
least two.
For a quadratic function, as in equation (4.1), steepest descent takes many steps to converge, while Newton’s
method reaches the minimum in one step. However, this single iteration in Newton’s method is more expensive,
because it requires both the gradient g
k
and the Hessian Q
k
to be evaluated, for a total of n +
_
n
2
_
derivatives. In
addition, the Hessian must be inverted, or, at least, system (4.15) must be solved. For very large problems, in which
the dimension n of x is thousands or more, storing and manipulating a Hessian can be prohibitive. In contrast, steepest
descent requires the gradient g
k
for selecting the step direction p
k
, and a line search in the direction p
k
to ﬁnd the
step size. The method of conjugate gradients, discussed in the next section, is motivated by the desire to accelerate
convergence with respect to the steepest descent method, but without paying the storage cost of Newton’s method.
4.3 Conjugate Gradients
Newton’s method converges faster (quadratically) than steepest descent (linear convergence rate) because it uses more
information about the function f being minimized. Steepest descent locally approximates the function with planes,
because it only uses gradient information. All it can do is to go downhill. Newton’s method approximates f with
paraboloids, and then jumps at every iteration to the lowest point of the current approximation. The bottom line is that
fast convergence requires work that is equivalent to evaluating the Hessian of f.
Prima facie, the method of conjugate gradients discussed in this section seems to violate this principle: it achieves
fast, superlinear convergence, similarly to Newton’s method, but it only requires gradient information. This paradox,
however, is only apparent. Conjugate gradients works by taking n steps for each of the steps in Newton’s method.
It effectively solves the linear system (4.2) of Newton’s method, but it does so by a sequence of n onedimensional
minimizations, each requiring one gradient computation and one line search.
Overall, the work done by conjugate gradients is equivalent to that done by Newton’s method. However, system
(4.2) is never constructed explicitly, and the matrix Q is never stored. This is very important in cases where x has
thousands or even millions of components. These highdimensional problems arise typically from the discretization
of partial differential equations. Say for instance that we want to compute the motion of points in an image as a
consequence of camera motion. Partial differential equations relate image intensities over space and time to the motion
of the underlying image features. At every pixel in the image, this motion, called the motion ﬁeld, is represented by
a vector whose magnitude and direction describe the velocity of the image feature at that pixel. Thus, if an image
has, say, a quarter of a million pixels, there are n = 500, 000 unknown motion ﬁeld values. Storing and inverting a
500, 000 ×500, 000 Hessian is out of the question. In cases like these, conjugate gradients saves the day.
The conjugate gradients method described in these notes is the socalled PolakRibi` ere variation. It will be intro
duced in three steps. First, it will be developed for the simple case of minimizing a quadratic function with positive
deﬁnite and known Hessian. This quadratic function f(x) was introduced in equation (4.1). We know that in this case
minimizing f(x) is equivalent to solving the linear system (4.2). Rather than an iterative method, conjugate gradients
is a direct method for the quadratic case. This means that the number of iterations is ﬁxed. Speciﬁcally, the method
converges to the solution in n steps, where n is the number of components of x. Because of the equivalence with
a linear system, conjugate gradients for the quadratic case can also be seen as an alternative method for solving a
linear system, although the version presented here will only work if the matrix of the system is symmetric and positive
deﬁnite.
4.3. CONJUGATE GRADIENTS 49
Second, the assumption that the Hessian Q in expression (4.1) is known will be removed. As discussed above, this
is the main reason for using conjugate gradients.
Third, the conjugate gradients method will be extended to general functions f(x). In this case, the method is no
longer direct, but iterative, and the cost of ﬁnding the minimum depends on the desired accuracy. This occurs because
the Hessian of f is no longer a constant, as it was in the quadratic case. As a consequence, a certain property that holds
in the quadratic case is now valid only approximately. In spite of this, the convergence rate of conjugate gradients is
superlinear, somewhere between Newton’s method and steepest descent. Finding tight bounds for the convergence rate
of conjugate gradients is hard, and we will omit this proof. We rely instead on the intuition that conjugate gradients
solves system (4.2), and that the quadratic approximation becomes more and more valid as the algorithm converges
to the minimum. If the function f starts to behave like a quadratic function early, that is, if f is nearly quadratic in a
large neighborhood of the minimum, convergence is fast, as it requires close to the n steps that are necessary in the
quadratic case, and each of the steps is simple. This combination of fast convergence, modest storage requirements,
and low computational cost per iteration explains the popularity of conjugate gradients methods for the optimization
of functions of a large number of variables.
4.3.1 The Quadratic Case
Suppose that we want to minimize the quadratic function
f(x) = c + a
T
x +
1
2
x
T
Qx (4.16)
where Q is a symmetric, positive deﬁnite matrix, and x has n components. As we saw in our discussion of steepest
descent, the minimum x
∗
is the solution to the linear system
Qx = −a . (4.17)
We know how to solve such a system. However, all the methods we have seen so far involve explicit manipulation
of the matrix Q. We now consider an alternative solution method that does not need Q, but only the quantity
g
k
= Qx
k
+ a
that is, the gradient of f(x), evaluated at n different points x
1
, . . . , x
n
. We will see that the conjugate gradients method
requires n gradient evaluations and n line searches in lieu of each n ×n matrix inversion in Newton’s method.
Formal proofs can be found in Elijah Polak, Optimization — Algorithms and consistent approximations, Springer,
NY, 1997. The arguments offered below appeal to intuition.
Consider the case n = 3, in which the variable x in f(x) is a threedimensional vector. Then the quadratic function
f(x) is constant over ellipsoids, called isosurfaces, centered at the minimum x
∗
. How can we start from a point x
0
on one of these ellipsoids and reach x
∗
by a ﬁnite sequence of onedimensional searches? In connection with steepest
descent, we noticed that for poorly conditioned Hessians orthogonal directions lead to many small steps, that is, to
slow convergence.
When the ellipsoids are spheres, on the other hand, this works much better. The ﬁrst step takes from x
0
to x
1
, and
the line between x
0
and x
1
is tangent to an isosurface at x
1
. The next step is in the direction of the gradient, so that
the new direction p
1
is orthogonal to the previous direction p
0
. This would then take us to x
∗
right away. Suppose
however that we cannot afford to compute this special direction p
1
orthogonal to p
0
, but that we can only compute
some direction p
1
orthogonal to p
0
(there is an n − 1dimensional space of such directions!). It is easy to see that in
that case n steps will take us to x
∗
. In fact, since isosurfaces are spheres, each line minimization is independent of the
others: The ﬁrst step yields the minimum in the space spanned by p
0
, the second step then yields the minimum in the
space spanned by p
0
and p
1
, and so forth. After n steps we must be done, since p
0
. . . , p
n−1
span the whole space.
In summary, any set of orthogonal directions, with a line search in each direction, will lead to the minimum for
spherical isosurfaces. Given an arbitrary set of ellipsoidal isosurfaces, there is a onetoone mapping with a spherical
system: if Q = UΣU
T
is the SVD of the symmetric, positive deﬁnite matrix Q, then we can write
1
2
x
T
Qx =
1
2
y
T
y
50 CHAPTER 4. FUNCTION OPTIMIZATION
where
y = Σ
1/2
U
T
x . (4.18)
Consequently, there must be a condition for the original problem (in terms of Q) that is equivalent to orthogonality for
the spherical problem. If two directions q
i
and q
j
are orthogonal in the spherical context, that is, if
q
T
i
q
j
= 0 ,
what does this translate into in terms of the directions p
i
and p
j
for the ellipsoidal problem? We have
q
i,j
= Σ
1/2
U
T
p
i,j
,
so that orthogonality for q
i,j
becomes
p
T
i
UΣ
1/2
Σ
1/2
U
T
p
j
= 0
or
p
T
i
Qp
j
= 0 . (4.19)
This condition is called Qconjugacy, or Qorthogonality: if equation (4.19) holds, then p
i
and p
j
are said to be
Qconjugate or Qorthogonal to each other. We will henceforth simply say “conjugate” for brevity.
In summary, if we can ﬁnd n directions p
0
, . . . , p
n−1
that are mutually conjugate, and if we do line minimization
along each direction p
k
, we reach the minimum in at most n steps. Of course, we cannot use the transformation (4.18)
in the algorithm, because Σ and especially U
T
are too large. So now we need to ﬁnd a method for generating n
conjugate directions without using either Q or its SVD. We do this in two steps. First, we ﬁnd conjugate directions
whose deﬁnitions do involve Q. Then, in the next subsection, we rewrite these expressions without Q.
Here is the procedure, due to Hestenes and Stiefel (Methods of conjugate gradients for solving linear systems, J.
Res. Bureau National Standards, section B, Vol 49, pp. 409436, 1952), which also incorporates the steps from x
0
to
x
n
:
g
0
= g(x
0
)
p
0
= −g
0
for k = 0 . . . , n −1
α
k
= arg min
α≥0
f(x
k
+αp
k
)
x
k+1
= x
k
+α
k
p
k
g
k+1
= g(x
k+1
)
γ
k
=
g
T
k+1
Qp
k
p
T
k
Qp
k
p
k+1
= −g
k+1
+γ
k
p
k
end
where
g
k
= g(x
k
) =
∂f
∂x
¸
¸
¸
¸
x=x
k
is the gradient of f at x
k
.
It is simple to see that p
k
and p
k+1
are conjugate. In fact,
p
T
k
Qp
k+1
= p
T
k
Q(−g
k+1
+γ
k
p
k
)
= −p
T
k
Qg
k+1
+
g
T
k+1
Qp
k
p
T
k
Qp
k
p
T
k
Qp
k
= −p
T
k
Qg
k+1
+ g
T
k+1
Qp
k
= 0 .
It is somewhat more cumbersome to show that p
i
and p
k+1
for i = 0, . . . , k are also conjugate. This can be done by
induction. The proof is based on the observation that the vectors p
k
are found by a generalization of GramSchmidt
(theorem 2.4.2) to produce conjugate rather than orthogonal vectors. Details can be found in Polak’s book mentioned
earlier.
4.3. CONJUGATE GRADIENTS 51
4.3.2 Removing the Hessian
The algorithm shown in the previous subsection is a correct conjugate gradients algorithm. However, it is computa
tionally inadequate because the expression for γ
k
contains the Hessian Q, which is too large. We now show that γ
k
can be rewritten in terms of the gradient values g
k
and g
k+1
only. To this end, we notice that
g
k+1
= g
k
+α
k
Qp
k
,
or
α
k
Qp
k
= g
k+1
−g
k
.
In fact,
g(x) = a +Qx
so that
g
k+1
= g(x
k+1
) = g(x
k
+α
k
p
k
) = a +Q(x
k
+α
k
p
k
) = g
k
+α
k
Qp
k
.
We can therefore write
γ
k
=
g
T
k+1
Qp
k
p
T
k
Qp
k
=
g
T
k+1
α
k
Qp
k
p
T
k
α
k
Qp
k
=
g
T
k+1
(g
k+1
−g
k
)
p
T
k
(g
k+1
−g
k
)
,
and Q has disappeared.
This expression for γ
k
can be further simpliﬁed by noticing that
p
T
k
g
k+1
= 0
because the line along p
k
is tangent to an isosurface at x
k+1
, while the gradient g
k+1
is orthogonal to the isosurface at
x
k+1
. Similarly,
p
T
k−1
g
k
= 0 .
Then, the denominator of γ
k
becomes
p
T
k
(g
k+1
−g
k
) = −p
T
k
g
k
= (g
k
−γ
k−1
p
k−1
)
T
g
k
= g
T
k
g
k
.
In conclusion, we obtain the PolakRibi` ere formula
γ
k
=
g
T
k+1
(g
k+1
−g
k
)
g
T
k
g
k
.
4.3.3 Extension to General Functions
We now know how to minimize the quadratic function (4.16) in n steps, without ever constructing the Hessian explic
itly. When the function f(x) is arbitrary, the same algorithm can be used.
However, n iterations will not sufﬁce. In fact, the Hessian, which was constant for the quadratic case, now is a
function of x
k
. Strictly speaking, we then lose conjugacy, since p
k
and p
k+1
are associated to different Hessians.
However, as the algorithm approaches the minimum x
∗
, the quadratic approximation becomes more and more valid,
and a few cycles of n iterations each will achieve convergence.
52 CHAPTER 4. FUNCTION OPTIMIZATION
Chapter 5
Eigenvalues and Eigenvectors
Given a linear transformation
b = Ax ,
the singular value decomposition A = UΣV
T
of A transforms the domain of the transformation via the matrix V
T
and its range via the matrix U
T
so that the transformed system is diagonal. In fact, the equation b = UΣV
T
x can be
written as follows
U
T
b = ΣV
T
x ,
that is,
c = Σy
where
y = V
T
x and c = U
T
b ,
and where Σ is diagonal. This is a fundamental transformation to use whenever the domain and the range of A are
separate spaces. Often, however, domain and range are intimately related to one another even independently of the
transformation A. The most important example is perhaps that of a system of linear differential equations, of the form
˙ x = Ax
where Ais n×n. For this equation, the fact that Ais square is not a coincidence. In fact, x is assumed to be a function
of some real scalar variable t (often time), and ˙ x is the derivative of x with respect to t:
˙ x =
dx
dt
.
In other words, there is an intimate, preexisting relation between x and ˙ x, and one cannot change coordinates for x
without also changing those for ˙ x accordingly. In fact, if V is an orthogonal matrix and we deﬁne
y = V
T
x ,
then the deﬁnition of ˙ x forces us to transform ˙ x by V
T
as well:
d V
T
x
dt
= V
T
dx
dt
= V
T
˙ x .
In brief, the SVD does nothing useful for systems of linear differential equations, because it diagonalizes Aby two
different transformations, one for the domain and one for the range, while we need a single transformation. Ideally,
we would like to ﬁnd an orthogonal matrix S and a diagonal matrix Λ such that
A = SΛS
T
(5.1)
53
54 CHAPTER 5. EIGENVALUES AND EIGENVECTORS
so that if we deﬁne
y = S
T
x
we can write the equivalent but diagonal differential system
˙ y = Λy .
This is now much easier to handle, because it is a system of n independent, scalar differential equations, which can be
solved separately. The solutions can then be recombined through
x = Sy .
We will see all of this in greater detail soon.
Unfortunately, writing A in the form (5.1) is not always possible. This stands to reason, because now we are
imposing stronger constraints on the terms of the decomposition. It is like doing an SVD but with the additional
constraint U = V . If we refer back to ﬁgure 3.1, now the circle and the ellipse live in the same space, and the
constraint U = V implies that the vectors v
i
on the circle that map into the axes σ
i
u
i
of the ellipse are parallel to the
axes themselves. This will only occur for very special matrices.
In order to make a decomposition like (5.1) possible, we weaken the constraints in several ways:
• the elements of S and Λ are allowed to be complex, rather than real;
• the elements on the diagonal of Λ are allowed to be negative; in fact, they can be even nonreal;
• S is required to be only invertible, rather than orthogonal.
To distinguish invertible from orthogonal matrices we use the symbol Q for invertible and S for orthogonal. In some
cases, it will be possible to diagonalize A by orthogonal transformations S and S
T
. Finally, for complex matrices we
generalize the notion of transpose by introducing the Hermitian operator: The matrix Q
H
(pronounced “QHermitian”)
is deﬁned to be the complex conjugate of the transpose of Q. If Q happens to be real, conjugate transposition becomes
simply transposition, so the Hermitian is a generalization of the transpose. A matrix S is said to be unitary if
S
H
S = SS
H
= I ,
so unitary generalizes orthogonal for complex matrices. Unitary matrices merely rotate or ﬂip vectors, in the sense
that they do not alter the vectors’ norms. For complex vectors, the norm squared is deﬁned as
x
2
= x
H
x ,
and if S is unitary we have
Sx
2
= x
H
S
H
Sx = x
H
x = x
2
.
Furthermore, if x
1
and x
2
are mutually orthogonal, in the sense that
x
H
1
x
2
= 0 ,
then Sx
1
and Sx
2
are orthogonal as well:
x
H
1
S
H
Sx
2
= x
H
1
x
2
= 0 .
In contrast, a nonunitary transformation Q can change the norms of vectors, as well as the inner products between
vectors. A matrix that is equal to its Hermitian is called a Hermitian matrix.
In summary, in order to diagonalize a square matrix A from a system of linear differential equations we generally
look for a decomposition of A of the form
A = QΛQ
−1
(5.2)
55
where Q and Λ are complex, Q is invertible, and Λ is diagonal. For some special matrices, this may specialize to
A = SΛS
H
with unitary S.
Whenever two matrices A and B, diagonal or not, are related by
A = QBQ
−1
,
they are said to be similar to each other, and the transformation of B into A (and vice versa) is called a similarity
transformation.
The equation A = QΛQ
−1
can be rewritten as follows:
AQ = QΛ
or separately for every column of Q as follows:
Aq
i
= λ
i
q
i
(5.3)
where
Q =
_
q
1
· · · q
n
¸
and Λ = diag(λ
1
, . . . , λ
n
) .
Thus, the columns of q
i
of Q and the diagonal entries λ
i
of Λ are solutions of the eigenvalue/eigenvector equation
Ax = λx , (5.4)
which is how eigenvalues and eigenvectors are usually introduced. In contrast, we have derived this equation from the
requirement of diagonalizing a matrix by a similarity transformation. The columns of Q are called eigenvectors, and
the diagonal entries of Λ are called eigenvalues.
−2 −1 0 1 2
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
Figure 5.1: Effect of the transformation (5.5) on a sample of points on the unit circle. The dashed lines are vectors that
do not change direction under the transformation.
That real eigenvectors and eigenvalues do not always exist can be clariﬁed by considering the eigenvalue problem
from a geometrical point of view in the n = 2 case. As we know, an invertible linear transformation transforms the
unit circle into an ellipse. Each point on the unit circle is transformed into some point on the ellipse. Figure 5.1 shows
the effect of the transformation represented by the matrix
A =
_
2/3 4/
√
3
0 2
_
(5.5)
56 CHAPTER 5. EIGENVALUES AND EIGENVECTORS
for a sample of points on the unit circle. Notice that there are many transformations that map the unit circle into the
same ellipse. In fact, the circle in ﬁgure 5.1 can be rotated, pulling the solid lines along. Each rotation yields another
matrix A, but the resulting ellipse is unchanged. In other words, the curvetocurve transformation from circle to
ellipse is unique, but the pointtopoint transformation is not. Matrices represent pointtopoint transformations.
The eigenvalue problem amounts to ﬁnding axes q
1
, q
2
that are mapped into themselves by the original transfor
mation A (see equation (5.3)). In ﬁgure 5.1, the two eigenvectors are shown as dashed lines. Notice that they do not
correspond to the axes of the ellipse, and that they are not orthogonal. Equation (5.4) is homogeneous in x, so x can
be assumed to be a unit vector without loss of generality.
Given that the directions of the input vectors are generally changed by the transformation A, as evident from ﬁgure
5.1, it is not obvious whether the eigenvalue problem admits a solution at all. We will see that the answer depends
on the matrix A, and that a rather diverse array of situations may arise. In some cases, the eigenvalues and their
eigenvectors exist, but they are complex. The geometric intuition is hidden, and the problem is best treated as an
algebraic one. In other cases, all eigenvalues exist, perhaps all real, but not enough eigenvectors can be found, and the
matrix A cannot be diagonalized. In particularly good cases, there are n real, orthonormal eigenvectors. In bad cases,
we have to give up the idea of diagonalizing A, and we can only triangularize it. This turns out to be good enough for
solving linear differential systems, just as triangularization was sufﬁcient for solving linear algebraic systems.
5.1 Computing Eigenvalues and Eigenvectors Algebraically
Let us rewrite the eigenvalue equation
Ax = λx
as follows:
(A−λI)x = 0 . (5.6)
This is a homogeneous, square system of equations, which admits nontrivial solutions iff the matrix A − λI is rank
deﬁcient. A square matrix B is rankdeﬁcient iff its determinant,
det(B) =
_
b
11
if B is 1 ×1
n
i=1
(−1)
i+1
b
i1
det(B
i1
) otherwise
is zero. In this expression, B
ij
is the algebraic complement of entry b
ij
, deﬁned as the (n − 1) × (n − 1) matrix
obtained by removing row i and column j from B.
Volumes have been written about the properties of the determinant. For our purposes, it is sufﬁcient to recall the
following properties from linear algebra:
• det(B) = det(B
T
);
• det(
_
b
1
· · · b
n
¸
) = 0 iff b
1
, . . . , b
n
are linearly dependent;
• det(
_
b
1
· · · b
i
· · · b
j
· · · b
n
¸
) = −det(
_
b
1
· · · b
j
· · · b
i
· · · b
n
¸
);
• det(BC) = det(B) det(C).
Thus, for system (5.6) to admit nontrivial solutions, we need
det(A−λI) = 0 . (5.7)
From the deﬁnition of determinant, it follows, by very simple induction, that the lefthand side of equation (5.7)
is a polynomial of degree n in λ, and that the coefﬁcient of λ
n
is 1. Therefore, equation (5.7), which is called the
characteristic equation of A, has n complex solutions, in the sense that
det(A−λI) = (−1)
n
(λ −λ
1
) · . . . · (λ −λ
n
)
where some of the λ
i
may coincide. In other words, an n × n matrix has at most n distinct eigenvalues. The case of
exactly n distinct eigenvalues is of particular interest, because of the following results.
5.1. COMPUTING EIGENVALUES AND EIGENVECTORS ALGEBRAICALLY 57
Theorem 5.1.1 Eigenvectors x
1
, . . . , x
k
corresponding to distinct eigenvalues λ
1
, . . . , λ
k
are linearly independent.
Proof. Suppose that c
1
x
1
+. . . +c
k
x
k
= 0 where the x
i
are eigenvectors of a matrix A. We need to show that then
c
1
= . . . = c
k
= 0. By multiplying by A we obtain
c
1
Ax
1
+. . . +c
k
Ax
k
= 0
and because x
1
, . . . , x
k
are eigenvectors corresponding to eigenvalues λ
1
, . . . , λ
k
, we have
c
1
λ
1
x
1
+. . . +c
k
λ
k
x
k
= 0 . (5.8)
However, from
c
1
x
1
+. . . +c
k
x
k
= 0
we also have
c
1
λ
k
x
1
+. . . +c
k
λ
k
x
k
= 0
and subtracting this equation from equation (5.8) we have
c
1
(λ
1
−λ
k
)x
1
+. . . +c
k−1
(λ
k−1
−λ
k
)x
k−1
= 0 .
Thus, we have reduced the summation to one containing k − 1 terms. Since all λ
i
are distinct, the differences in
parentheses are all nonzero, and we can replace each x
i
by x
i
= (λ
i
−λ
k
)x
i
, which is still an eigenvector of A:
c
1
x
1
+. . . +c
k−1
x
k−1
= 0 .
We can repeat this procedure until only one term remains, and this forces c
1
= 0, so that
c
2
x
2
+. . . +c
k
x
k
= 0
This entire argument can be repeated for the last equation, therefore forcing c
2
= 0, and so forth.
In summary, the equation c
1
x
1
+. . . +c
k
x
k
= 0 implies that c
1
= . . . = c
k
= 0, that is, that the vectors x
1
, . . . , x
k
are linearly independent. ∆
For Hermitian matrices (and therefore for real symmetric matrices as well), the situation is even better.
Theorem 5.1.2 A Hermitian matrix has real eigenvalues.
Proof. A matrix A is Hermitian iff A = A
H
. Let λ and x be an eigenvalue of A and a corresponding eigenvector:
Ax = λx . (5.9)
By taking the Hermitian we obtain
x
H
A
H
= λ
∗
x
H
.
Since A = A
H
, the last equation can be rewritten as follows:
x
H
A = λ
∗
x
H
. (5.10)
If we multiply equation (5.9) from the left by x
H
and equation (5.10) from the right by x, we obtain
x
H
Ax = λx
H
x
x
H
Ax = λ
∗
x
H
x
which implies that
λx
H
x = λ
∗
x
H
x
58 CHAPTER 5. EIGENVALUES AND EIGENVECTORS
Since x is an eigenvector, the scalar x
H
x is nonzero, so that we have
λ = λ
∗
as promised. ∆
Corollary 5.1.3 A real and symmetric matrix has real eigenvalues.
Proof. A real and symmetric matrix is Hermitian. ∆
Theorem 5.1.4 Eigenvectors corresponding to distinct eigenvalues of a Hermitian matrix are mutually orthogonal.
Proof. Let λ and µ be two distinct eigenvalues of A, and let x and y be corresponding eigenvectors:
Ax = λx
Ay = µy ⇒ y
H
A = µy
H
because A = A
H
and from theorem 5.1.2 µ = µ
∗
. If we multiply these two equations by y
H
from the left and x from
the right, respectively, we obtain
y
H
Ax = λy
H
x
y
H
Ax = µy
H
x ,
which implies
λy
H
x = µy
H
x
or
(λ −µ)y
H
x = 0 .
Since the two eigenvalues are distinct, λ −µ is nonzero, and we must have y
H
x = 0. ∆
Corollary 5.1.5 An n ×n Hermitian matrix with n distinct eigenvalues admits n orthonormal eigenvectors.
Proof. From theorem 5.1.4, the eigenvectors of an n×n Hermitian matrix with n distinct eigenvalues are all mutu
ally orthogonal. Since the eigenvalue equation Ax = λx is homogeneous in x, the vector x can be normalized without
violating the equation. Consequently, the eigenvectors can be made to be orthonormal. ∆
In summary, any square matrix with n distinct eigenvalues can be diagonalized by a similarity transformation, and
any square Hermitian matrix with n distinct eigenvalues can be diagonalized by a unitary similarity transformation.
Notice that the converse is not true: a matrix can have coincident eigenvalues and still admit n independent, and
even orthonormal, eigenvectors. For instance, the n × n identity matrix has n equal eigenvalues but n orthonormal
eigenvectors (which can be chosen in inﬁnitely many ways).
The examples in section 5.2 show that when some eigenvalues coincide, rather diverse situations can arise concern
ing the eigenvectors. First, however, we point out a simple but fundamental fact about the eigenvalues of a triangular
matrix.
Theorem 5.1.6 The determinant of a triangular matrix is the product of the elements on its diagonal.
5.2. GOOD AND BAD MATRICES 59
Proof. This follows immediately from the deﬁnition of determinant. Without loss of generality, we can assume a
triangular matrix B to be uppertriangular, for otherwise we can repeat the argument for the transpose, which because
of the properties above has the same eigenvalues. Then, the only possibly nonzero b
i1
of the matrix B is b
11
, and the
summation in the deﬁnition of determinant given above reduces to a single term:
det(B) =
_
b
11
if B is 1 ×1
b
11
det(B
11
) otherwise
.
By repeating the argument for B
11
and so forth until we are left with a single scalar, we obtain
det(B) = b
11
· . . . · b
nn
.
∆
Corollary 5.1.7 The eigenvalues of a triangular matrix are the elements on its diagonal.
Proof. The eigenvalues of a matrix A are the solutions of the equation
det(A−λI) = 0 .
If A is triangular, so is B = A−λI, and from the previous theorem we obtain
det(A−λI) = (a
11
−λ) · . . . · (a
nn
−λ)
which is equal to zero for
λ = a
11
, . . . , a
nn
.
∆
Note that diagonal matrices are triangular, so this result holds for diagonal matrices as well.
5.2 Good and Bad Matrices
Solving differential equations becomes much easier when matrices have a full set of orthonormal eigenvectors. For
instance, the matrix
A =
_
2 0
0 1
_
(5.11)
has eigenvalues 2 and 1 and eigenvectors
s
1
=
_
1
0
_
s
2
=
_
0
1
_
.
Matrices with n orthonormal eigenvectors are called normal. Normal matrices are good news, because then the
n ×n system of differential equations
˙ x = Ax
has solution
x(t) =
n
i=1
c
i
s
i
e
λit
= S
_
¸
_
e
λ
1
t
.
.
.
e
λnt
_
¸
_c
60 CHAPTER 5. EIGENVALUES AND EIGENVECTORS
where S = [s
1
· · · s
n
] are the eigenvectors, λ
i
are the eigenvalues, and the vector c of constants c
i
is
c = S
H
x(0) .
More compactly,
x(t) = S
_
¸
_
e
λ1t
.
.
.
e
λ
n
t
_
¸
_S
H
x(0) .
Fortunately these matrices occur frequently in practice. However, not all matrices are as good as these. First, there
may still be a complete set of n eigenvectors, but they may not be orthonormal. An example of such a matrix is
_
2 −1
0 1
_
which has eigenvalues 2 and 1 and nonorthogonal eigenvectors
q
1
=
_
1
0
_
q
2
=
√
2
2
_
1
1
_
.
This is conceptually only a slight problem, because the unitary matrix S is replaced by an invertible matrix Q, and the
solution becomes
x(t) = Q
_
¸
_
e
λ1t
.
.
.
e
λnt
_
¸
_Q
−1
x(0) .
Computationally this is more expensive, because a computation of a Hermitian is replaced by a matrix inversion.
However, things can be worse yet, and a full set of eigenvectors may fail to exist, as we now show.
A necessary condition for an n ×n matrix to be defective, that is, to have fewer than n eigenvectors, is that it have
repeated eigenvalues. In fact, we have seen (theorem 5.1.1) that a matrix with distinct eigenvalues (zero or nonzero
does not matter) has a full set of eigenvectors (perhaps nonorthogonal, but independent). The simplest example of a
defective matrix is
_
0 1
0 0
_
which has double eigenvalue 0 and only eigenvector [1 0]
T
, while
_
3 1
0 3
_
has double eigenvalue 3 and only eigenvector [1 0]
T
, so zero eigenvalues are not the problem.
However, repeated eigenvalues are not a sufﬁcient condition for defectiveness, as the identity matrix proves.
How bad can a matrix be? Here is a matrix that is singular, has fewer than n eigenvectors, and the eigenvectors it
has are not orthogonal. It belongs to the scum of all matrices:
A =
_
_
0 2 −1
0 2 1
0 0 2
_
_
.
Its eigenvalues are 0, because the matrix is singular, and 2, repeated twice. A has to have a repeated eigenvalue if it is
to be defective. Its two eigenvectors are
q
1
=
_
_
1
0
0
_
_
, q
2
=
√
2
2
_
_
1
1
0
_
_
corresponding to eigenvalues 0 and 2 in this order, and there is no q
3
. Furthermore, q
1
and q
2
are not orthogonal to
each other.
5.3. COMPUTING EIGENVALUES AND EIGENVECTORS NUMERICALLY 61
5.3 Computing Eigenvalues and Eigenvectors Numerically
The examples above have shown that not every n × n matrix admits n independent eigenvectors, so some matrices
cannot be diagonalized by similarity transformations. Fortunately, these matrices can be triangularized by similarity
transformations, as we now show. We will show later on that this allows solving systems of linear differential equations
regardless of the structure of the system’s matrix of coefﬁcients.
It is important to notice that if a matrix A is triangularized by similarity transformations,
T = Q
−1
AQ ,
then the eigenvalues of the triangular matrix T are equal to those of the original matrix A. In fact, if
Ax = λx ,
then
QTQ
−1
x = λx ,
that is,
Ty = λy
where
y = Q
−1
x ,
so λ is also an eigenvalue for T. The eigenvectors, however, are changed according to the last equation.
The Schur decomposition does even better, since it triangularizes any square matrix A by a unitary (possibly
complex) transformation:
T = S
H
AS .
This transformation is equivalent to factoring A into the product
A = STS
H
,
and this product is called the Schur decomposition of A. Numerically stable and efﬁcient algorithms exist for the Schur
decomposition. In this note, we will not study these algorithms, but only show that all square matrices admit a Schur
decomposition.
5.3.1 Rotations into the x
1
Axis
An important preliminary fact concerns vector rotations. Let e
1
be the ﬁrst column of the identity matrix. It is
intuitively obvious that any nonzero real vector x can be rotated into a vector parallel to e
1
. Formally, take any
orthogonal matrix S whose ﬁrst column is
s
1
=
x
x
.
Since s
T
1
x = x
T
x/x = x, and since all the other s
j
are orthogonal to s
1
, we have
S
T
x =
_
¸
_
s
T
1
.
.
.
s
T
n
_
¸
_x =
_
¸
_
s
T
1
x
.
.
.
s
T
n
x
_
¸
_ =
_
¸
¸
¸
_
x
0
.
.
.
0
_
¸
¸
¸
_
which is parallel to e
1
as desired. It may be less obvious that a complex vector x can be transformed into a real vector
parallel to e
1
by a unitary transformation. But the trick is the same: let
s
1
=
x
x
.
62 CHAPTER 5. EIGENVALUES AND EIGENVECTORS
Now s
1
may be complex. We have s
H
1
x = x
H
x/x = x, and
S
H
x =
_
¸
_
s
H
1
.
.
.
s
H
n
_
¸
_x =
_
¸
_
s
H
1
x
.
.
.
s
H
n
x
_
¸
_ =
_
¸
¸
¸
_
x
0
.
.
.
0
_
¸
¸
¸
_
,
just about like before. We are now ready to triangularize an arbitrary square matrix A.
5.3.2 The Schur Decomposition
The Schur decomposition theorem is the cornerstone of eigenvalue computations. It states that any square matrix can
be triangularized by unitary transformations. The diagonal elements of a triangular matrix are its eigenvalues, and
unitary transformations preserve eigenvalues. Consequently, if the Schur decomposition of a matrix can be computed,
its eigenvalues can be determined. Moreover, as we will see later, a system of linear differential equations can be
solved regardless of the structure of the matrix of its coefﬁcients.
Lemma 5.3.1 If A is an n ×n matrix and λ and x are an eigenvalue of A and its corresponding eigenvector,
Ax = λx (5.12)
then there is a transformation
T = U
H
AU
where U is a unitary, n ×n matrix, such that
T =
_
¸
¸
¸
_
λ
0
.
.
.
0
C
_
¸
¸
¸
_
.
Proof. Let U be a unitary transformation that transforms the (possibly complex) eigenvector x of A into a real
vector on the x
1
axis:
x = U
_
¸
¸
¸
_
r
0
.
.
.
0
_
¸
¸
¸
_
where r is the nonzero norm of x. By substituting this into (5.12) and rearranging we have
AU
_
¸
¸
¸
_
r
0
.
.
.
0
_
¸
¸
¸
_
= λU
_
¸
¸
¸
_
r
0
.
.
.
0
_
¸
¸
¸
_
U
H
AU
_
¸
¸
¸
_
r
0
.
.
.
0
_
¸
¸
¸
_
= λ
_
¸
¸
¸
_
r
0
.
.
.
0
_
¸
¸
¸
_
U
H
AU
_
¸
¸
¸
_
1
0
.
.
.
0
_
¸
¸
¸
_
= λ
_
¸
¸
¸
_
1
0
.
.
.
0
_
¸
¸
¸
_
5.3. COMPUTING EIGENVALUES AND EIGENVECTORS NUMERICALLY 63
T
_
¸
¸
¸
_
1
0
.
.
.
0
_
¸
¸
¸
_
=
_
¸
¸
¸
_
λ
0
.
.
.
0
_
¸
¸
¸
_
.
The last lefthand side is the ﬁrst column of T, and the corresponding righthand side is of the form required by the
lemma. ∆
Theorem 5.3.2 (Schur) If A is any n ×n matrix then there exists a unitary n ×n matrix S such that
S
H
AS = T
where T is triangular. Furthermore, S can be chosen so that the eigenvalues λ
i
of A appear in any order along the
diagonal of T.
Proof. By induction. The theorem obviously holds for n = 1:
1 A1 = A .
Suppose it holds for all matrices of order n −1. Then from the lemma there exists a unitary U such that
U
H
AU =
_
¸
¸
¸
_
λ
0
.
.
.
0
C
_
¸
¸
¸
_
where λ is any eigenvalue of A. Partition C into a row vector and an (n −1) ×(n −1) matrix G:
C =
_
w
H
G
_
.
By the inductive hypothesis, there is a unitary matrix V such that V
H
GV is a Schur decomposition of G. Let
S = U
_
¸
¸
¸
_
1 0 · · · 0
0
.
.
.
0
V
_
¸
¸
¸
_
.
Clearly, S is a unitary matrix, and S
H
AS is uppertriangular. Since the elements on the diagonal of a triangular matrix
are the eigenvalues, S
H
AS is the Schur decomposition of A. Because we can pick any eigenvalue as λ, the order of
eigenvalues can be chosen arbitrarily. ∆
This theorem does not say how to compute the Schur decomposition, only that it exists. Fortunately, there is a
stable and efﬁcient algorithm to compute the Schur decomposition. This is the preferred way to compute eigenvalues
numerically.
64 CHAPTER 5. EIGENVALUES AND EIGENVECTORS
5.4 Eigenvalues/Vectors and Singular Values/Vectors
In this section we prove a few additional important properties of eigenvalues and eigenvectors. In the process, we also
establish a link between singular values/vectors and eigenvalues/vectors. While this link is very important, it is useful
to remember that eigenvalues/vectors and singular values/vectors are conceptually and factually very distinct entities
(recall ﬁgure 5.1).
First, a general relation between determinant and eigenvalues.
Theorem 5.4.1 The determinant of a matrix is equal to the product of its eigenvalues.
Proof. The proof is very simple, given the Schur decomposition. In fact, we know that the eigenvalues of a matrix
A are equal to those of the triangular matrix in the Schur decomposition of A. Furthermore, we know from theorem
5.1.6 that the determinant of a triangular matrix is the product of the elements on its diagonal. If we recall that a
unitary matrix has determinant 1 or 1, that the determinants of S and S
H
are the same, and that the determinant of a
product of matrices is equal to the product of the determinants, the proof is complete. ∆
We saw that an n ×n Hermitian matrix with n distinct eigenvalues admits n orthonormal eigenvectors (corollary
5.1.5). The assumption of distinct eigenvalues made the proof simple, but is otherwise unnecessary. In fact, now that
we have the Schur decomposition, we can state the following stronger result.
Theorem 5.4.2 (Spectral theorem) Every Hermitian matrix can be diagonalized by a unitary matrix, and every real
symmetric matrix can be diagonalized by an orthogonal matrix:
A = A
H
⇒ A = SΛS
H
A real, A = A
T
⇒ A = SΛS
T
, S real .
In either case, Λ is real and diagonal.
Proof. We already know that Hermitian matrices (and therefore real and symmetric ones) have real eigenvalues
(theorem 5.1.2), so Λ is real. Let now
A = STS
H
be the Schur decomposition of A. Since A is Hermitian, so is T. In fact, T = S
H
AS, and
T
H
= (S
H
AS)
H
= S
H
A
H
S = S
H
AS = T .
But the only way that T can be both triangular and Hermitian is for it to be diagonal, because 0
∗
= 0. Thus, the
Schur decomposition of a Hermitian matrix is in fact a diagonalization, and this is the ﬁrst equation of the theorem
(the diagonal of a Hermitian matrix must be real).
Let now Abe real and symmetric. All that is left to prove is that then its eigenvectors are real. But eigenvectors are
the solution of the homogeneous system (5.6), which is both real and rankdeﬁcient, and therefore admits nontrivial
real solutions. Thus, S is real, and S
H
= S
T
. ∆
In other words, a Hermitian matrix, real or not, with distinct eigenvalues or not, has real eigenvalues and n or
thonormal eigenvectors. If in addition the matrix is real, so are its eigenvectors.
We recall that a real matrix A such that for every nonzero x we have x
T
Ax > 0 is said to be positive deﬁnite. It is
positive semideﬁnite if for every nonzero x we have x
T
Ax ≥ 0. Notice that a positive deﬁnite matrix is also positive
semideﬁnite. Positive deﬁnite or semideﬁnite matrices arise in the solution of overconstrained linear systems, because
A
T
A is positive semideﬁnite for every A (lemma 5.4.5). They also occur in geometry through the equation of an
ellipsoid,
x
T
Qx = 1
5.4. EIGENVALUES/VECTORS AND SINGULAR VALUES/VECTORS 65
in which Q is positive deﬁnite. In physics, positive deﬁnite matrices are associated to quadratic forms x
T
Qx that
represent energies or secondorder moments of mass or force distributions. Their physical meaning makes them
positive deﬁnite, or at least positive semideﬁnite (for instance, energies cannot be negative). The following result
relates eigenvalues/vectors with singular values/vectors for positive semideﬁnite matrices.
Theorem 5.4.3 The eigenvalues of a real, symmetric, positive semideﬁnite matrix A are equal to its singular values.
The eigenvectors of A are also its singular vectors, both left and right.
Proof. From the previous theorem, A = SΛS
T
, where both Λ and S are real. Furthermore, the entries in Λ are
nonnegative. In fact, from
As
i
= λs
i
we obtain
s
T
i
As
i
= s
T
i
λs
i
= λs
T
i
s
i
= λs
i
2
= λ .
If A is positive semideﬁnite, then x
T
Ax ≥ 0 for any nonzero x, and in particular s
T
i
As
i
≥ 0, so that λ ≥ 0.
But
A = SΛS
T
with nonnegative diagonal entries in Λ is the singular value decomposition A = UΣV
T
of A with Σ = Λ and
U = V = S. Recall that the eigenvalues in the Schur decomposition can be arranged in any desired order along the
diagonal. ∆
Theorem 5.4.4 A real, symmetric matrix is positive semideﬁnite iff all its eigenvalues are nonnegative. It is positive
deﬁnite iff all its eigenvalues are positive.
Proof. Theorem 5.4.3 implies one of the two directions: If A is real, symmetric, and positive semideﬁnite, then its
eigenvalues are nonnegative. If the proof of that theorem is repeated with the strict inequality, we also obtain that if A
is real, symmetric, and positive deﬁnite, then its eigenvalues are positive.
Conversely, we show that if all eigenvalues λ of a real and symmetric matrix A are positive (nonnegative) then A
is positive deﬁnite (semideﬁnite). To this end, let x be any nonzero vector. Since real and symmetric matrices have n
orthonormal eigenvectors (theorem 5.4.2), we can use these eigenvectors s
1
, . . . , s
n
as an orthonormal basis for R
n
,
and write
x = c
1
s
1
+. . . +c
n
s
n
with
c
i
= x
T
s
i
.
But then
x
T
Ax = x
T
A(c
1
s
1
+. . . +c
n
s
n
) = x
T
(c
1
As
1
+. . . +c
n
As
n
)
= x
T
(c
1
λ
1
s
1
+. . . +c
n
λ
n
s
n
) = c
1
λ
1
x
T
s
1
+. . . +c
n
λ
n
x
T
s
n
= λ
1
c
2
1
+. . . +λ
n
c
2
n
> 0 (or ≥ 0)
because the λ
i
are positive (nonnegative) and not all c
i
can be zero. Since x
T
Ax > 0 (or ≥ 0) for every nonzero x, A
is positive deﬁnite (semideﬁnite). ∆
Theorem 5.4.3 establishes one connection between eigenvalues/vectors and singular values/vectors: for symmetric,
positive deﬁnite matrices, the concepts coincide. This result can be used to introduce a less direct link, but for arbitrary
matrices.
Lemma 5.4.5 A
T
A is positive semideﬁnite.
66 CHAPTER 5. EIGENVALUES AND EIGENVECTORS
Proof. For any nonzero x we can write x
T
A
T
Ax = Ax
2
≥ 0. ∆
Theorem 5.4.6 The eigenvalues of A
T
A with m ≥ n are the squares of the singular values of A; the eigenvectors of
A
T
Aare the right singular vectors of A. Similarly, for m ≤ n, the eigenvalues of AA
T
are the squares of the singular
values of A, and the eigenvectors of AA
T
are the left singular vectors of A.
Proof. If m ≥ n and A = UΣV
T
is the SVD of A, we have
A
T
A = V ΣU
T
UΣV
T
= V Σ
2
V
T
which is in the required format to be a (diagonal) Schur decomposition with S = V and T = Λ = Σ
2
. Similarly, for
m ≤ n,
AA
T
= UΣV
T
V ΣU
T
= UΣ
2
U
T
is a Schur decomposition with S = U and T = Λ = Σ
2
. ∆
We have seen that important classes of matrices admit a full set of orthonormal eigenvectors. The theorem below
characterizes the class of all matrices with this property, that is, the class of all normal matrices. To prove the theorem,
we ﬁrst need a lemma.
Lemma 5.4.7 If for an n ×n matrix B we have BB
H
= B
H
B, then for every i = 1, . . . , n, the norm of the ith row
of B equals the norm of its ith column.
Proof. From BB
H
= B
H
B we deduce
Bx
2
= x
H
B
H
Bx = x
H
BB
H
x = B
H
x
2
. (5.13)
If x = e
i
, the ith column of the n × n identity matrix, Be
i
is the ith column of B, and B
H
e
i
is the ith column of
B
H
, which is the conjugate of the ith row of B. Since conjugation does not change the norm of a vector, the equality
(5.13) implies that the ith column of B has the same norm as the ith row of B. ∆
Theorem 5.4.8 An n ×n matrix is normal if an only if it commutes with its Hermitian:
AA
H
= A
H
A .
Proof. Let A = STS
H
be the Schur decomposition of A. Then,
AA
H
= STS
H
ST
H
S
H
= STT
H
S
H
and A
H
A = ST
H
S
H
STS
H
= ST
H
TS
H
.
Because S is invertible (even unitary), we have AA
H
= A
H
A if and only if TT
H
= T
H
T.
However, a triangular matrix T for which TT
H
= T
H
T must be diagonal. In fact, from the lemma, the norm of
the ith row of T is equal to the norm of its ith column. Let i = 1. Then, the ﬁrst column of T has norm t
11
. The
ﬁrst row has ﬁrst entry t
11
, so the only way that its norm can be t
11
 is for all other entries in the ﬁrst row to be zero.
We now proceed through i = 2, . . . , n, and reason similarly to conclude that T must be diagonal.
The converse is also obviously true: if T is diagonal, then TT
H
= T
H
T. Thus, AA
H
= A
H
A if and only if T is
diagonal, that is, if and only if A can be diagonalized by a unitary similarity transformation. This is the deﬁnition of a
normal matrix. ∆
5.4. EIGENVALUES/VECTORS AND SINGULAR VALUES/VECTORS 67
Corollary 5.4.9 A triangular, normal matrix must be diagonal.
Proof. We proved this in the proof of theorem 5.4.8. ∆
Checking that A
H
A = AA
H
is much easier than computing eigenvectors, so theorem 5.4.8 is a very useful
characterization of normal matrices. Notice that Hermitian (and therefore also real symmetric) matrices commute
trivially with their Hermitians, but so do, for instance, unitary (and therefore also real orthogonal) matrices:
UU
H
= U
H
U = I .
Thus, Hermitian, real symmetric, unitary, and orthogonal matrices are all normal.
68 CHAPTER 5. EIGENVALUES AND EIGENVECTORS
Chapter 6
Ordinary Differential Systems
In this chapter we use the theory developed in chapter 5 in order to solve systems of ﬁrstorder linear differential
equations with constant coefﬁcients. These systems have the following form:
˙ x = Ax + b(t) (6.1)
x(0) = x
0
(6.2)
where x = x(t) is an ndimensional vector function of time t, the dot denotes differentiation, the coefﬁcients a
ij
in
the n ×n matrix A are constant, and the vector function b(t) is a function of time. The equation (6.2), in which x
0
is
a known vector, deﬁnes the initial value of the solution.
First, we show that scalar differential equations of order greater than one can be reduced to systems of ﬁrstorder
differential equations. Then, in section 6.2, we recall a general result for the solution of ﬁrstorder differential systems
from the elementary theory of differential equations. In section 6.3, we make this result more speciﬁc by showing
that the solution to a homogeneous system is a linear combination of exponentials multiplied by polynomials in t.
This result is based on the Schur decomposition introduced in chapter 5, which is numerically preferable to the more
commonly used Jordan canonical form. Finally, in sections 6.4 and 6.5, we set up and solve a particular differential
system as an illustrative example.
6.1 Scalar Differential Equations of Order Higher than One
The ﬁrstorder system (6.1) subsumes also the case of a scalar differential equation of order n, possibly greater than 1,
d
n
y
dt
n
+c
n−1
d
n−1
y
dt
n−1
+. . . +c
1
dy
dt
+c
0
y = b(t) . (6.3)
In fact, such an equation can be reduced to a ﬁrstorder system of the form (6.1) by introducing the ndimensional
vector
x =
_
¸
_
x
1
.
.
.
x
n
_
¸
_ =
_
¸
¸
¸
_
y
dy
dt
.
.
.
d
n−1
y
dt
n−1
_
¸
¸
¸
_
.
With this deﬁnition, we have
d
i
y
dt
i
= x
i+1
for i = 0, . . . , n −1
d
n
y
dt
n
=
dx
n
dt
,
69
70 CHAPTER 6. ORDINARY DIFFERENTIAL SYSTEMS
and x satisﬁes the additional n −1 equations
x
i+1
=
dx
i
dt
(6.4)
for i = 1, . . . , n − 1. If we write the original system (6.3) together with the n − 1 differential equations (6.4), we
obtain the ﬁrstorder system
˙ x = Ax + b(t)
where
A =
_
¸
¸
¸
¸
¸
_
0 1 0 · · · 0
0 0 1 · · · 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0 · · · 1
−c
0
−c
1
−c
2
· · · −c
n−1
_
¸
¸
¸
¸
¸
_
is the socalled companion matrix of (6.3) and
b(t) =
_
¸
¸
¸
¸
¸
_
0
0
.
.
.
0
b(t)
_
¸
¸
¸
¸
¸
_
.
6.2 General Solution of a Linear Differential System
We know from the general theory of differential equations that a general solution of system (6.1) with initial condition
(6.2) is given by
x(t) = x
h
(t) + x
p
(t)
where x
h
(t) is the solution of the homogeneous system
˙ x = Ax
x(0) = x
0
and x
p
(t) is a particular solution of
˙ x = Ax + b(t)
x(0) = 0 .
The two solution components x
h
and x
p
can be written by means of the matrix exponential, introduced in the following.
For the scalar exponential e
λt
we can write a Taylor series expansion
e
λt
= 1 +
λt
1!
+
λ
2
t
2
2!
+· · · =
∞
j=0
λ
j
t
j
j!
.
Usually
1
, in calculus classes, the exponential is introduced by other means, and the Taylor series expansion above is
proven as a property.
For matrices, the exponential e
Z
of a matrix Z ∈ R
n×n
is instead deﬁned by the inﬁnite series expansion
e
Z
= I +
Z
1!
+
Z
2
2!
+· · · =
∞
j=0
Z
j
j!
.
1
Not always. In some treatments, the exponential is deﬁned through its Taylor series.
6.2. GENERAL SOLUTION OF A LINEAR DIFFERENTIAL SYSTEM 71
Here I is the n ×n identity matrix, and the general term Z
j
/j! is simply the matrix Z raised to the jth power divided
by the scalar j!. It turns out that this inﬁnite sum converges (to an n×n matrix which we write as e
Z
) for every matrix
Z. Substituting Z = At gives
e
At
= I +
At
1!
+
A
2
t
2
2!
+
A
3
t
3
3!
+· · · =
∞
j=0
A
j
t
j
j!
. (6.5)
Differentiating both sides of (6.5) gives
de
At
dt
= A+
A
2
t
1!
+
A
3
t
2
2!
+· · ·
= A
_
I +
At
1!
+
A
2
t
2
2!
+· · ·
_
de
At
dt
= Ae
At
.
Thus, for any vector w, the function x
h
(t) = e
At
w satisﬁes the homogeneous differential system
˙ x
h
= Ax
h
.
By using the initial values (6.2) we obtain w = x
0
, and
x
h
(t) = e
At
x(0) (6.6)
is a solution to the differential system (6.1) with b(t) = 0 and initial values (6.2). It can be shown that this solution is
unique.
From the elementary theory of differential equations, we also know that a particular solution to the nonhomoge
neous (b(t) = 0) equation (6.1) is given by
x
p
(t) =
_
t
0
e
A(t−s)
b(s) ds .
This is easily veriﬁed, since by differentiating this expression for x
p
we obtain
˙ x
p
= Ae
At
_
t
0
e
−As
b(s) ds +e
At
e
−At
b(t) = Ax
p
+ b(t) ,
so x
p
satisﬁes equation (6.1).
In summary, we have the following result.
The solution to
˙ x = Ax + b(t) (6.7)
with initial value
x(0) = x
0
(6.8)
is
x(t) = x
h
(t) + x
p
(t) (6.9)
where
x
h
(t) = e
At
x(0) (6.10)
and
x
p
(t) =
_
t
0
e
A(t−s)
b(s) ds . (6.11)
72 CHAPTER 6. ORDINARY DIFFERENTIAL SYSTEMS
Since we now have a formula for the general solution to a linear differential system, we seem to have all we need.
However, we do not know how to compute the matrix exponential. The naive solution to use the deﬁnition (6.5)
requires too many terms for a good approximation. As we have done for the SVD and the Schur decomposition, we
will only point out that several methods exist for computing a matrix exponential, but we will not discuss how this is
done
2
. In a fundamental paper on the subject, Nineteen dubious ways to compute the exponential of a matrix (SIAM
Review, vol. 20, no. 4, pp. 80136), Cleve Moler and Charles Van Loan discuss a large number of different methods,
pointing out that no one of them is appropriate for all situations. A full discussion of this matter is beyond the scope
of these notes.
When the matrix A is constant, as we currently assume, we can be much more speciﬁc about the structure of the
solution (6.9) of system (6.7), and particularly so about the solution x
h
(t) to the homogeneous part. Speciﬁcally, the
matrix exponential (6.10) can be written as a linear combination, with constant vector coefﬁcients, of scalar expo
nentials multiplied by polynomials. In the general theory of linear differential systems, this is shown via the Jordan
canonical form. However, in the paper cited above, Moler and Van Loan point out that the Jordan form cannot be
computed reliably, and small perturbations in the data can change the results dramatically. Fortunately, a similar result
can be found through the Schur decomposition introduced in chapter 5. The next section shows how to do this.
6.3 Structure of the Solution
For the homogeneous case b(t) = 0, consider the ﬁrst order system of linear differential equations
˙ x = Ax (6.12)
x(0) = x
0
. (6.13)
Two cases arise: either A admits n distinct eigenvalues, or is does not. In chapter 5, we have seen that if (but not only
if) A has n distinct eigenvalues then it has n linearly independent eigenvectors (theorem 5.1.1), and we have shown
how to ﬁnd x
h
(t) by solving an eigenvalue problem. In section 6.3.1, we brieﬂy review this solution. Then, in section
6.3.2, we show how to compute the homogeneous solution x
h
(t) in the extreme case of an n × n matrix A with n
coincident eigenvalues.
To be sure, we have seen that matrices with coincident eigenvalues can still have a full set of linearly independent
eigenvectors (see for instance the identity matrix). However, the solution procedure we introduce in section 6.3.2 for
the case of n coincident eigenvalues can be applied regardless to how many linearly independent eigenvectors exist.
If the matrix has a full complement of eigenvectors, the solution obtained in section 6.3.2 is the same as would be
obtained with the method of section 6.3.1.
Once these two extreme cases (nondefective matrix or allcoincident eigenvalues) have been handled, we show a
general procedure in section 6.3.3 for solving a homogeneous or nonhomogeneous differential system for any, square,
constant matrix A, defective or not. This procedure is based on backsubstitution, and produces a result analogous
to that obtained via Jordan decomposition for the homogeneous part x
h
(t) of the solution. However, since it is
based on the numerically sound Schur decomposition, the method of section 6.3.3 is superior in practice. For a
nonhomogeneous system, the procedure can be carried out analytically if the functions in the righthand side vector
b(t) can be integrated.
6.3.1 A is Not Defective
In chapter 5 we saw how to ﬁnd the homogeneous part x
h
(t) of the solution when A has a full set of n linearly
independent eigenvectors. This result is brieﬂy reviewed in this section for convenience.
3
If A is not defective, then it has n linearly independent eigenvectors q
1
, . . . , q
n
with corresponding eigenvalues
λ
1
, . . . , λ
n
. Let
Q =
_
q
1
· · · q
n
¸
.
2
In Matlab, expm(A) is the matrix exponential of A.
3
Parts of this subsection and of the following one are based on notes written by Scott Cohen.
6.3. STRUCTURE OF THE SOLUTION 73
This square matrix is invertible because its columns are linearly independent. Since Aq
i
= λ
i
q
i
, we have
AQ = QΛ, (6.14)
where Λ = diag(λ
1
, . . . , λ
n
) is a square diagonal matrix with the eigenvalues of A on its diagonal. Multiplying both
sides of (6.14) by Q
−1
on the right, we obtain
A = QΛQ
−1
. (6.15)
Then, system (6.12) can be rewritten as follows:
˙ x = Ax
˙ x = QΛQ
−1
x
Q
−1
˙ x = ΛQ
−1
x
˙ y = Λy, (6.16)
where y = Q
−1
x. The last equation (6.16) represents n uncoupled, homogeneous, differential equations ˙ y
i
= λ
i
y
i
.
The solution is
y
h
(t) = e
Λt
y(0),
where
e
Λt
= diag(e
λ1t
, . . . , e
λnt
).
Using the relation x = Qy, and the consequent relation y(0) = Q
−1
x(0), we see that the solution to the homogeneous
system (6.12) is
x
h
(t) = Qe
Λt
Q
−1
x(0).
If A is normal, that is, if it has n orthonormal eigenvectors q
1
, . . . q
n
, then Q is replaced by the Hermitian matrix
S =
_
s
1
· · · s
n
¸
, Q
−1
is replaced by S
H
, and the solution to (6.12) becomes
x
h
(t) = Se
Λt
S
H
x(0).
6.3.2 A Has n Coincident Eigenvalues
When A = QΛQ
−1
, we derived that the solution to (6.12) is x
h
(t) = Qe
Λt
Q
−1
x(0). Comparing with (6.6), it should
be the case that
e
Q(Λt)Q
−1
= Qe
Λt
Q
−1
.
This follows easily from the deﬁnition of e
Z
and the fact that (Q(Λt)Q
−1
)
j
= Q(Λt)
j
Q
−1
. Similarly, if A = SΛS
H
,
where S is Hermitian, then the solution to (6.12) is x
h
(t) = Se
Λt
S
H
x(0), and
e
S(Λt)S
H
= Se
Λt
S
H
.
How can we compute the matrix exponential in the extreme case in which A has n coincident eigenvalues, regard
less of the number of its linearly independent eigenvectors? In any case, A admits a Schur decomposition
A = STS
H
(theorem 5.3.2). We recall that S is a unitary matrix and T is upper triangular with the eigenvalues of Aon its diagonal.
Thus we can write T as
T = Λ +N,
where Λ is diagonal and N is strictly upper triangular. The solution (6.6) in this case becomes
x
h
(t) = e
S(Tt)S
H
x(0) = Se
Tt
S
H
x(0) = Se
Λt+Nt
S
H
x(0).
74 CHAPTER 6. ORDINARY DIFFERENTIAL SYSTEMS
Thus we can compute (6.6) if we can compute e
Tt
= e
Λt+Nt
. This turns out to be almost as easy as computing e
Λt
when the diagonal matrix Λ is a multiple of the identity matrix:
Λ = λI
that is, when all the eigenvalues of A coincide. In fact, in this case, Λt and Nt commute:
Λt Nt = λIt Nt = λt Nt = Nt λt = Nt λIt = Nt Λt .
It can be shown that if two matrices Z
1
and Z
2
commute, that is if
Z
1
Z
2
= Z
2
Z
1
,
then
e
Z1+Z2
= e
Z1
e
Z2
= e
Z2
e
Z1
.
Thus, in our case, we can write
e
Λt+Nt
= e
Λt
e
Nt
.
We already know how to compute e
Λt
, so it remains to show how to compute e
Nt
. The fact that Nt is strictly upper
triangular makes the computation of this matrix exponential much simpler than for a general matrix Z.
Suppose, for example, that N is 4 ×4. Then N has three nonzero superdiagonals, N
2
has two nonzero superdiag
onals, N
3
has one nonzero superdiagonal, and N
4
is the zero matrix:
N =
_
¸
¸
_
0 ∗ ∗ ∗
0 0 ∗ ∗
0 0 0 ∗
0 0 0 0
_
¸
¸
_
→N
2
=
_
¸
¸
_
0 0 ∗ ∗
0 0 0 ∗
0 0 0 0
0 0 0 0
_
¸
¸
_
→
N
3
=
_
¸
¸
_
0 0 0 ∗
0 0 0 0
0 0 0 0
0 0 0 0
_
¸
¸
_
→N
4
=
_
¸
¸
_
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
_
¸
¸
_
.
In general, for a strictly upper triangular n ×n matrix, we have N
j
= 0 for all j ≥ n (i.e., N is nilpotent of order n).
Therefore,
e
Nt
=
∞
j=0
N
j
t
j
j!
=
n−1
j=0
N
j
t
j
j!
is simply a ﬁnite sum, and the exponential reduces to a matrix polynomial.
In summary, the general solution to the homogeneous differential system (6.12) with initial value (6.13) when the
n ×n matrix A has n coincident eigenvalues is given by
x
h
(t) = Se
Λt
n−1
j=0
N
j
t
j
j!
S
H
x
0
(6.17)
where
A = S(Λ +N)S
H
is the Schur decomposition of A,
Λ = λI
is a multiple of the identity matrix containing the coincident eigenvalues of A on its diagonal, and N is strictly upper
triangular.
6.3. STRUCTURE OF THE SOLUTION 75
6.3.3 The General Case
We are now ready to solve the linear differential system
˙ x = Ax + b(t) (6.18)
x(0) = x
0
(6.19)
in the general case of a constant matrix A, defective or not, with arbitrary b(t). In fact, let A = STS
H
be the Schur
decomposition of A, and consider the transformed system
˙ y(t) = Ty(t) + c(t) (6.20)
where
y(t) = S
H
x(t) and c(t) = S
H
b(t) . (6.21)
The triangular matrix T can always be written in the following form:
T =
_
¸
¸
¸
_
T
11
· · · · · · T
1k
0 T
22
· · · T
2k
.
.
.
.
.
.
.
.
.
0 · · · 0 T
kk
_
¸
¸
¸
_
where the diagonal blocks T
ii
for i = 1, . . . , k are of size n
i
×n
i
(possibly 1×1) and contain allcoincident eigenvalues.
The remaining nonzero blocks T
ij
with i < j can be in turn bundled into matrices
R
i
=
_
T
i,i+1
· · · T
i,k
¸
that contain everything to the right of the corresponding T
ii
. The vector c(t) can be partitioned correspondingly as
follows
c(t) =
_
¸
_
c
1
(t)
.
.
.
c
k
(t)
_
¸
_
where c
i
has n
i
entries, and the same can be done for
y(t) =
_
¸
_
y
1
(t)
.
.
.
y
k
(t)
_
¸
_
and for the initial values
y(0) =
_
¸
_
y
1
(0)
.
.
.
y
k
(0)
_
¸
_ .
The triangular system (6.20) can then be solved by backsubstitution as follows:
for i = k down to 1
if i < k
d
i
(t) = R
i
[y
i+1
(t), . . . , y
k
(t)]
T
else
d
i
(t) = 0 (an n
k
dimensional vector of zeros)
end
T
ii
= λ
i
I +N
i
(diagonal and strictly uppertriangular part of T
ii
)
y
i
(t) = e
λ
i
It
n
i
−1
j=0
N
j
i
t
j
j!
y
i
(0) +
_
t
0
_
e
λ
i
I(t−s)
n
i
−1
j=0
N
j
i
(t−s)
j
j!
_
(c
i
(s) + d
i
(s)) ds
end.
76 CHAPTER 6. ORDINARY DIFFERENTIAL SYSTEMS
In this procedure, the expression for y
i
(t) is a direct application of equations (6.9), (6.10), (6.11), and (6.17) with
S = I. In the general case, the applicability of this routine depends on whether the integral in the expression for y
i
(t)
can be computed analytically. This is certainly the case when b(t) is a constant vector b, because then the integrand is
a linear combination of exponentials multiplied by polynomials in t −s, which can be integrated by parts.
The solution x(t) for the original system (6.18) is then
x(t) = Sy(t) .
As an illustration, we consider a very small example, the 2 ×2 homogeneous, triangular case,
_
˙ y
1
˙ y
2
_
=
_
t
11
t
12
0 t
22
_ _
y
1
y
2
_
. (6.22)
When t
11
= t
22
= λ, we obtain
y(t) = e
λIt
_
1 t
12
t
0 1
_
y(0) .
In scalar form, this becomes
y
1
(t) = (y
1
(0) +t
12
y
2
(0) t) e
λt
y
2
(t) = y
2
(0) e
λt
,
and it is easy to verify that this solution satisﬁes the differential system (6.22).
When t
11
= λ
1
= t
22
= λ
2
, we could solve the system by ﬁnding the eigenvectors of T, since we know that in this
case two linearly independent eigenvectors exist (theorem 5.1.1). Instead, we apply the backsubstitution procedure
introduced in this section. The second equation of the system,
˙ y
2
(t) = t
22
y
2
has solution
y
2
(t) = y
2
(0) e
λ
2
t
.
We then have
d
1
(t) = t
12
y
2
(t) = t
12
y
2
(0) e
λ2t
and
y
1
(t) = y
1
(0)e
λ1t
+
_
t
0
e
λ1(t−s)
d
1
(s) ds
= y
1
(0)e
λ1t
+t
12
y
2
(0) e
λ1t
_
t
0
e
−λ1s
e
λ2s
ds
= y
1
(0)e
λ1t
+t
12
y
2
(0) e
λ1t
_
t
0
e
(λ2−λ1)s
ds
= y
1
(0)e
λ1t
+
t
12
y
2
(0)
λ
2
−λ
1
e
λ1t
(e
(λ2−λ1)t
−1)
= y
1
(0)e
λ
1
t
+
t
12
y
2
(0)
λ
2
−λ
1
(e
λ
2
t
−e
λ
1
t
)
Exercise: verify that this solution satisﬁes both the differential equation (6.22) and the initial value equation y(0) = y
0
.
Thus, the solutions to system (6.22) for t
11
= t
22
and for t
11
= t
22
have different forms. While y
2
(t) is the same
in both cases, we have
y
1
(t) = y
1
(0) e
λt
+t
12
y
2
(0) t e
λt
if t
11
= t
22
y
1
(t) = y
1
(0)e
λ1t
+
t
12
y
2
(0)
λ
2
−λ
1
(e
λ2t
−e
λ1t
) if t
11
= t
22
.
6.4. A CONCRETE EXAMPLE 77
rest position
of mass 1
rest position
of mass 2
2
v
1
v
1
2
2
1
3
Figure 6.1: A system of masses and springs. In the absence of external forces, the two masses would assume the
positions indicated by the dashed lines.
This would seem to present numerical difﬁculties when t
11
≈ t
22
, because the solution would suddenly switch
from one form to the other as the difference between t
11
and t
22
changes from about zero to exactly zero or viceversa.
This, however, is not a problem. In fact,
lim
λ
1
→λ
e
λt
−e
λ
1
t
λ −λ
1
= t e
λt
,
and the transition between the two cases is smooth.
6.4 A Concrete Example
In this section we set up and solve a more concrete example of a system of differential equations. The initial system
has two secondorder equations, and is transformed into a ﬁrstorder system with four equations. The 4 ×4 matrix of
the resulting system has an interesting structure, which allows ﬁnding eigenvalues and eigenvectors analytically with a
little trick. The point of this section is to show how to transform the complex formal solution of the differential system,
computed with any of the methods described above, into a real solution in a form appropriate to the problem at hand.
Consider the mechanical system in ﬁgure 6.1. Suppose that we want to study the evolution of the system over time.
Since forces are proportional to accelerations, because of Newton’s law, and since accelerations are second derivatives
of position, the new equations are differential. Because differentiation occurs only with respect to one variable, time,
these are ordinary differential equations, as opposed to partial.
In the following we write the differential equations that describe this system. Two linear differential equations of
the second order
4
result. We will then transform these into four linear differential equations of the ﬁrst order.
By Hooke’s law, the three springs exert forces that are proportional to the springs’ elongations:
f
1
= c
1
v
1
f
2
= c
2
(v
2
−v
1
)
4
Recall that the order of a differential equation is the highest degree of derivative that appears in it.
78 CHAPTER 6. ORDINARY DIFFERENTIAL SYSTEMS
f
3
= −c
3
v
2
where the c
i
are the positive spring constants (in newtons per meter).
The accelerations of masses 1 and 2 (springs are assumed to be massless) are proportional to their accelerations,
according to Newton’s second law:
m
1
¨ v
1
= −f
1
+f
2
= −c
1
v
1
+c
2
(v
2
−v
1
) = −(c
1
+c
2
)v
1
+c
2
v
2
m
2
¨ v
2
= −f
2
+f
3
= −c
2
(v
2
−v
1
) −c
3
v
2
= c
2
v
1
−(c
2
+c
3
)v
2
or, in matrix form,
¨v = Bv (6.23)
where
v =
_
v
1
v
2
_
and B =
_
−
c
1
+c
2
m
1
c
2
m
1
c
2
m
2
−
c
2
+c
3
m
2
_
.
We also assume that initial conditions
v(0) and ˙ v(0) (6.24)
are given, which specify positions and velocities of the two masses at time t = 0.
To solve the secondorder system (6.23), we will ﬁrst transform it to a system of four ﬁrstorder equations. As
shown in the introduction to this chapter, the trick is to introduce variables to denote the ﬁrstorder derivatives of v, so
that secondorder derivatives of v are ﬁrstorder derivatives of the new variables. For uniformity, we deﬁne four new
variables
u =
_
¸
¸
_
u
1
u
2
u
3
u
4
_
¸
¸
_
=
_
¸
¸
_
v
1
v
2
˙ v
1
˙ v
2
_
¸
¸
_
(6.25)
so that
u
3
= ˙ v
1
and u
4
= ˙ v
2
,
while the original system (6.23) becomes
_
˙ u
3
˙ u
4
_
= B
_
u
1
u
2
_
.
We can now gather these four ﬁrstorder differential equations into a single system as follows:
˙ u = Au (6.26)
where
A =
_
¸
¸
_
0 0
0 0
1 0
0 1
B
0 0
0 0
_
¸
¸
_
.
Likewise, the initial conditions (6.24) are replaced by the (known) vector
u(0) =
_
v(0)
˙ v(0)
_
.
In the next section we solve equation (6.26).
6.5. SOLUTION OF THE EXAMPLE 79
6.5 Solution of the Example
Not all matrices have a full set of linearly independent eigenvectors. With the system of springs in ﬁgure 6.1, however,
we are lucky. The eigenvalues of A are solutions to the equation
Ax = λx , (6.27)
where we recall that
A =
_
0 I
B 0
_
and B =
_
−
c
1
+c
2
m1
c
2
m1
c
2
m2
−
c
2
+c
3
m2
_
.
Here, the zeros in A are 2 ×2 matrices of zeros, and I is the 2 ×2 identity matrix. If we partition the vector x into its
upper and lower halves y and z,
x =
_
y
z
_
,
we can write
Ax =
_
0 I
B 0
_ _
y
z
_
=
_
z
By
_
so that the eigenvalue equation (6.27) can be written as the following pair of equations:
z = λy (6.28)
By = λz ,
which yields
By = µy with µ = λ
2
.
In other words, the eigenvalues of A are the square roots of the eigenvalues of B: if we denote the two eigenvalues of
B as µ
1
and µ
2
, then the eigenvalues of A are
λ
1
=
√
µ
1
λ
2
= −
√
µ
1
λ
3
=
√
µ
2
λ
4
= −
√
µ
2
.
The eigenvalues µ
1
and µ
2
of B are the solutions of
det(B −µI) =
_
c
1
+c
2
m
1
+µ
__
c
2
+c
3
m
2
+µ
_
−
c
2
2
m
1
m
2
= µ
2
+ 2αµ +β = 0
where
α =
1
2
_
c
1
+c
2
m
1
+
c
2
+c
3
m
2
_
and β =
c
1
c
2
+c
1
c
3
+c
2
c
3
m
1
m
2
are positive constants that depend on the elastic properties of the springs and on the masses. We then obtain
µ
1,2
= −α ±γ ,
where
γ =
_
α
2
−β =
¸
1
4
_
c
1
+c
2
m
1
−
c
2
+c
3
m
2
_
2
+
c
2
2
m
1
m
2
.
The constant γ is real because the radicand is nonnegative. We also have that α ≥ γ, so that the two solutions µ
1,2
are
real and negative, and the four eigenvalues of A,
λ
1
=
√
−α +γ , λ
2
= −
√
−α +γ , (6.29)
λ
3
=
√
−α −γ , λ
4
= −
√
−α −γ (6.30)
80 CHAPTER 6. ORDINARY DIFFERENTIAL SYSTEMS
come in nonreal, complexconjugate pairs. This is to be expected, since our system of springs obviously exhibits an
oscillatory behavior.
Also the eigenvectors of A can be derived from those of B. In fact, from equation (6.28) we see that if y is an
eigenvector of B corresponding to eigenvalue µ = λ
2
, then there are two corresponding eigenvectors for A of the
form
x =
_
y
±λy
_
. (6.31)
The eigenvectors of B are the solutions of
(B −(−α ±γ)I)y = 0 . (6.32)
Since ±(−α±γ) are eigenvalues of B, the determinant of this equation is zero, and the two scalar equations in (6.32)
must be linearly dependent. The ﬁrst equation reads
−
_
c
1
+c
2
m
1
−α ±γ
_
y
1
+
c
2
m
1
y
2
= 0
and is obviously satisﬁed by any vector of the form
y = k
_
c
2
m
1
c
1
+c
2
m
1
−α ±γ
_
where k is an arbitrary constant. For k = 0, y denotes the two eigenvectors of B, and from equation (6.31) the four
eigenvectors of A are proportional to the four columns of the following matrix:
Q =
_
¸
¸
_
c2
m1
c2
m1
c2
m1
c2
m1
a +λ
2
1
a +λ
2
2
a +λ
2
3
a +λ
2
4
λ
1
c2
m
1
λ
2
c2
m
1
λ
3
c2
m
1
λ
4
c2
m
1
λ
1
_
a +λ
2
1
_
λ
2
_
a +λ
2
2
_
λ
3
_
a +λ
2
3
_
λ
4
_
a +λ
2
4
_
_
¸
¸
_
(6.33)
where
a =
c
1
+c
2
m
1
.
The general solution to the ﬁrstorder differential system (6.26) is then given by equation (6.17). Since we just found
four distinct eigenvectors, however, we can write more simply
u(t) = Qe
Λt
Q
−1
u(0) (6.34)
where
Λ =
_
¸
¸
_
λ
1
0 0 0
0 λ
2
0 0
0 0 λ
3
0
0 0 0 λ
4
_
¸
¸
_
.
In these expressions, the values of λ
i
are given in equations (6.30), and Q is in equation (6.33).
Finally, the solution to the original, secondorder system (6.23) can be obtained from equation (6.25) by noticing
that v is equal to the ﬁrst two components of u.
This completes the solution of our system of differential equations. However, it may be useful to add some
algebraic manipulation in order to show that the solution is indeed oscillatory. As we see in the following, the masses’
motions can be described by the superposition of two sinusoids whose frequencies depend on the physical constants
involved (masses and spring constants). The amplitudes and phases of the sinusoids, on the other hand, depend on the
initial conditions.
To simplify our manipulation, we note that
u(t) = Qe
Λt
w ,
6.5. SOLUTION OF THE EXAMPLE 81
where we deﬁned
w = Q
−1
u(0) . (6.35)
We now leave the constants in w unspeciﬁed, and derive the general solution v(t) for the original, secondorder
problem. Numerical values for the constants can be found from the initial conditions u(0) by equation (6.35). We
have
v(t) = Q(1 : 2, :)e
Λt
w ,
where Q(1 : 2, :) denotes the ﬁrst two rows of Q. Since
λ
2
= −λ
1
and λ
4
= −λ
3
(see equations (6.30)), we have
Q(1 : 2, :) =
_
q
1
q
1
q
2
q
2
¸
where we deﬁned
q
1
=
_
c
2
m1
c
1
+c
2
m1
+λ
2
1
_
and q
2
=
_
c
2
m1
c
1
+c
2
m1
+λ
2
3
_
.
Thus, we can write
v(t) = q
1
_
k
1
e
λ
1
t
+k
2
e
−λ
1
t
_
+ q
2
_
k
3
e
λ
3
t
+k
4
e
−λ
3
t
_
.
Since the λs are imaginary but v(t) is real, the k
i
must come in complexconjugate pairs:
k
1
= k
∗
2
and k
3
= k
∗
4
. (6.36)
In fact, we have
v(0) = q
1
(k
1
+k
2
) + q
2
(k
3
+k
4
)
and from the derivative
˙ v(t) = q
1
λ
1
_
k
1
e
λ1t
−k
2
e
−λ1t
_
+ q
2
λ
3
_
k
3
e
λ3t
−k
4
e
−λ3t
_
we obtain
˙ v(0) = q
1
λ
1
(k
1
−k
2
) + q
2
λ
3
(k
3
−k
4
) .
Since the vectors q
i
are independent (assuming that the mass c
2
is nonzero), this means that
k
1
+k
2
is real k
1
−k
2
is purely imaginary
k
3
+k
4
is real k
3
−k
4
is purely imaginary ,
from which equations (6.36) follow.
Finally, by using the relation
e
jx
+e
−jx
2
= cos x ,
and simple trigonometry we obtain
v(t) = q
1
A
1
cos(ω
1
t +φ
1
) + q
2
A
2
cos(ω
2
t +φ
2
)
where
ω
1
=
√
α −γ =
¸
¸
¸
_
1
2
(a +b) −
¸
1
4
(a −b)
2
+
c
2
2
m
1
m
2
ω
2
=
√
α +γ =
¸
¸
¸
_
1
2
(a +b) +
¸
1
4
(a −b)
2
+
c
2
2
m
1
m
2
82 CHAPTER 6. ORDINARY DIFFERENTIAL SYSTEMS
and
a =
c
1
+c
2
m
1
, b =
c
2
+c
3
m
2
.
Notice that these two frequencies depend only on the conﬁguration of the system, and not on the initial conditions.
The amplitudes A
i
and phases φ
i
, on the other hand, depend on the constants k
i
as follows:
A
1
= 2k
1
 , A
2
= 2k
3

φ
1
= arctan
2
(Im(k
1
), Re(k
1
)) φ
2
= arctan
2
(Im(k
3
), Re(k
3
))
where Re, Im denote the real and imaginary part and where the twoargument function arctan
2
is deﬁned as follows
for (x, y) = (0, 0)
arctan
2
(y, x) =
_
¸
¸
_
¸
¸
_
arctan(
y
x
) if x > 0
π + arctan(
y
x
) if x < 0
π
2
if x = 0 and y > 0
−
π
2
if x = 0 and y < 0
and is undeﬁned for (x, y) = (0, 0). This function returns the arctangent of y/x (notice the order of the arguments) in
the proper quadrant, and extends the function by continuity along the y axis.
The two constants k
1
and k
3
can be found from the given initial conditions v(0) and ˙ v(0) from equations (6.35)
and (6.25).
Chapter 7
Stochastic State Estimation
Perhaps the most important part of studying a problem in robotics or vision, as well as in most other sciences, is to
determine a good model for the phenomena and events that are involved. For instance, studying manipulation requires
deﬁning models for how a robot arm can move and for how it interacts with the world. Analyzing image motion
implies deﬁning models for how points move in space and how this motion projects onto the image. When motion
is involved, as is very often the case, models take on frequently the form of dynamic systems. A dynamic system
is a mathematical description of a quantity that evolves over time. The theory of dynamic systems is both rich and
fascinating. Although in this chapter we will barely scratch its surface, we will consider one of its most popular and
useful aspects, the theory of state estimation, in the particular form of Kalman ﬁltering. To this purpose, an informal
deﬁnition of a dynamic system is given in the next section. The deﬁnition is then illustrated by setting up the dynamic
system equations for a simple but realistic application, that of modeling the trajectory of an enemy mortar shell. In
sections 7.3 through 7.5, we will develop the theory of the Kalman ﬁlter, and in section 7.6 we will see that the shell
can be shot down before it hits us. As discussed in section 7.7, Kalman ﬁltering has intimate connections with the
theory of algebraic linear systems we have developed in chapters 2 and 3.
7.1 Dynamic Systems
In its most general meaning, the term system refers to some physical entity on which some action is performed by
means of an input u. The system reacts to this input and produces an output y (see ﬁgure 7.1).
A dynamic system is a system whose phenomena occur over time. One often says that a system evolves over time.
Simple examples of a dynamic system are the following:
• An electric circuit, whose input is the current in a given branch and whose output is a voltage across a pair of
nodes.
• A chemical reactor, whose inputs are the external temperature, the temperature of the gas being supplied, and
the supply rate of the gas. The output can be the temperature of the reaction product.
• A mass suspended from a spring. The input is the force applied to the mass and the output is the position of the
mass.
input output system
S u y
Figure 7.1: A general system.
83
84 CHAPTER 7. STOCHASTIC STATE ESTIMATION
In all these examples, what is input and what is output is a choice that depends on the application. Also, all the
quantities in the examples vary continuously with time. In other cases, as for instance for switching networks and
computers, it is more natural to consider time as a discrete variable. If time varies continuously, the system is said to
be continuous; if time varies discretely, the system is said to be discrete.
7.1.1 State
Given a dynamic system, continuous or discrete, the modeling problem is to somehow correlate inputs (causes) with
outputs (effects). The examples above suggest that the output at time t cannot be determined in general by the value
assumed by the input quantity at the same point in time. Rather, the output is the result of the entire history of the
system. An effort of abstraction is therefore required, which leads to postulating a new quantity, called the state, which
summarizes information about the past and the present of the system. Speciﬁcally, the value x(t) taken by the state at
time t must be sufﬁcient to determine the output at the same point in time. Also, knowledge of both x(t
1
) and u
[t
1
,t
2
)
,
that is, of the state at time t
1
and the input over the interval t
1
≤ t < t
2
, must allow computing the state (and hence
the output) at time t
2
. For the mass attached to a spring, for instance, the state could be the position and velocity of
the mass. In fact, the laws of classical mechanics allow computing the new position and velocity of the mass at time t
2
given its position and velocity at time t
1
and the forces applied over the interval [t
1
, t
2
). Furthermore, in this example,
the output y of the system happens to coincide with one of the two state variables, and is therefore always deducible
from the latter.
Thus, in a dynamic system the input affects the state, and the output is a function of the state. For a discrete
system, the way that the input changes the state at time instant number k into the new state at time instant k + 1 can
be represented by a simple equation:
x
k+1
= f(x
k
, u
k
, k)
where f is some function that represents the change, and u
k
is the input at time k. Similarly, the relation between state
and output can be expressed by another function:
y
k
= h(x
k
, k) .
A discrete dynamic system is completely described by these two equations and an initial state x
0
. In general, all
quantities are vectors.
For continuous systems, time does not come in quanta, so one cannot compute x
k+1
as a function of x
k
, u
k
, and
k, but rather compute x(t
2
) as a functional φ of x(t
1
) and the entire input u over the interval [t
1
, t
2
):
x(t
2
) = φ(x(t
1
), u(·), t
1
, t
2
)
where u(·) represents the entire function u, not just one of its values. A description of the system in terms of func
tions, rather than functionals, can be given in the case of a regular system, for which the functional φ is continuous,
differentiable, and with continuous ﬁrst derivative. In that case, one can show that there exists a function f such that
the state x(t) of the system satisﬁes the differential equation
˙ x(t) = f(x(t), u(t), t)
where the dot denotes differentiation with respect to time. The relation from state to output, on the other hand, is
essentially the same as for the discrete case:
y(t) = h(x(t), t) .
Specifying the initial state x
0
completes the deﬁnition of a continuous dynamic system.
7.1.2 Uncertainty
The systems deﬁned in the previous section are called deterministic, since the evolution is exactly determined once
the initial state x at time 0 is known. Determinism implies that both the evolution function f and the output function
h are known exactly. This is, however, an unrealistic state of affairs. In practice, the laws that govern a given physical
7.2. AN EXAMPLE: THE MORTAR SHELL 85
system are known up to some uncertainty. In fact, the equations themselves are simple abstractions of a complex
reality. The coefﬁcients that appear in the equations are known only approximately, and can change over time as a
result of temperature changes, component wear, and so forth. A more realistic model then allows for some inherent,
unresolvable uncertainty in both f and h. This uncertainty can be represented as noise that perturbs the equations we
have presented so far. A discrete system then takes on the following form:
x
k+1
= f(x
k
, u
k
, k) +η
k
y
k
= h(x
k
, k) +ξ
k
and for a continuous system
˙ x(t) = f(x(t), u(t), t) +η(t)
y(t) = h(x(t), t) +ξ(t) .
Without loss of generality, the noise distributions can be assumed to have zero mean, for otherwise the mean can be
incorporated into the deterministic part, that is, in either f or h. The mean may not be known, but this is a different
story: in general the parameters that enter into the deﬁnitions of f and h must be estimated by some method, and the
mean perturbations are no different.
A common assumption, which is sometimes valid and always simpliﬁes the mathematics, is that η and ξ are
zeromean Gaussian random variables with known covariance matrices Q and R, respectively.
7.1.3 Linearity
The mathematics becomes particularly simple when both the evolution function f and the output function h are linear.
Then, the system equations become
x
k+1
= F
k
x
k
+G
k
u
k
+η
k
y
k
= H
k
x
k
+ξ
k
for the discrete case, and
˙ x(t) = F(t)x(t) +G(t)u(t) +η(t)
y(t) = H(t)x(t) +ξ(t)
for the continuous one. It is useful to specify the sizes of the matrices involved. We assume that the input u is a vector
in R
p
, the state x is in R
n
, and the output y is in R
m
. Then, the state propagation matrix F is n ×n, the input matrix
G is n × p, and the output matrix H is m × n. The covariance matrix Q of the system noise η is n × n, and the
covariance matrix of the output noise ξ is m×m.
7.2 An Example: the Mortar Shell
In this section, the example of the mortar shell will be discussed in order to see some of the technical issues involved
in setting up the equations of a dynamic system. In particular, we consider discretization issues because the physical
system is itself continuous, but we choose to model it as a discrete system for easier implementation on a computer.
In sections 7.3 through 7.5, we consider the state estimation problem: given observations of the output y over an
interval of time, we want to determine the state x of the system. This is a very important task. For instance, in the case
of the mortar shell, the state is the (initially unknown) position and velocity of the shell, while the output is a set of
observations made by a tracking system. Estimating the state then leads to enough knowledge about the shell to allow
driving an antiaircraft gun to shoot the shell down in midﬂight.
You spotted an enemy mortar installation about thirty kilometers away, on a hill that looks about 0.5 kilometers
higher than your own position. You want to track incoming projectiles with a Kalman ﬁlter so you can aim your guns
86 CHAPTER 7. STOCHASTIC STATE ESTIMATION
accurately. You do not know the initial velocity of the projectiles, so you just guess some values: 0.6 kilometers/second
for the horizontal component, 0.1 kilometers/second for the vertical component. Thus, your estimate of the initial state
of the projectile is
ˆx
0
=
_
¸
¸
_
˙
d
d
˙ z
z
_
¸
¸
_
=
_
¸
¸
_
−0.6
30
0.1
0.5
_
¸
¸
_
where d is the horizontal coordinate, z is the vertical, you are at (0, 0), and dots denote derivatives with respect to
time.
From your highschool physics, you remember that the laws of motion for a ballistic trajectory are the following:
d(t) = d(0) +
˙
d(0)t (7.1)
z(t) = z(0) + ˙ z(0)t −
1
2
gt
2
(7.2)
where g is the gravitational acceleration, equal to 9.8 × 10
−3
kilometers per second squared. Since you do not trust
your physics much, and you have little time to get ready, you decide to ignore air drag. Because of this, you introduce
a state update covariance matrix Q = 0.1I
4
, where I
4
is the 4 ×4 identity matrix.
All you have to track the shells is a camera pointed at the mortar that will rotate so as to keep the projectile at the
center of the image, where you see a blob that increases in size as the projectile gets closer. Thus, the aiming angle of
the camera gives you elevation information about the projectile’s position, and the size of the blob tells you something
about the distance, given that you know the actual size of the projectiles used and all the camera parameters. The
projectile’s elevation is
e = 1000
z
d
(7.3)
when the projectile is at (d, z). Similarly, the size of the blob in pixels is
s =
1000
√
d
2
+z
2
. (7.4)
You do not have very precise estimates of the noise that corrupts e and s, so you guess measurement covariances
R
e
= R
s
= 1000, which you put along the diagonal of a 2 ×2 diagonal measurement covariance matrix R.
7.2.1 The Dynamic System Equation
Equations (7.1) and (7.2) are continuous. Since you are taking measurements every dt = 0.2 seconds, you want to
discretize these equations. For the z component, equation (7.2) yields
z(t +dt) −z(t) = z(0) + ˙ z(0)(t +dt) −
1
2
g(t +dt)
2
−
_
z(0) + ˙ z(0)t −
1
2
gt
2
_
= ( ˙ z(0) −gt)dt −
1
2
g(dt)
2
= ˙ z(t)dt −
1
2
g(dt)
2
,
since ˙ z(0) −gt = ˙ z(t).
Consequently, if t +dt is time instant k + 1 and t is time instant k, you have
z
k+1
= z
k
+ ˙ z
k
dt −
1
2
g(dt)
2
. (7.5)
The reasoning for the horizontal component d is the same, except that there is no acceleration:
d
k+1
= d
k
+
˙
d
k
dt . (7.6)
7.2. AN EXAMPLE: THE MORTAR SHELL 87
Equations (7.5) and (7.6) can be rewritten as a single system update equation
x
k+1
= Fx
k
+Gu
where
x
k
=
_
¸
¸
_
˙
d
k
d
k
˙ z
k
z
k
_
¸
¸
_
is the state, the 4 × 4 matrix F depends on dt, the control scalar u is equal to −g, and the 4 × 1 control matrix G
depends on dt. The two matrices F and G are as follows:
F =
_
¸
¸
_
1 0 0 0
dt 1 0 0
0 0 1 0
0 0 dt 1
_
¸
¸
_
G =
_
¸
¸
_
0
0
dt
−dt
2
2
_
¸
¸
_
.
7.2.2 The Measurement Equation
The two nonlinear equations (7.3) and (7.4) express the available measurements as a function of the true values of the
projectile coordinates d and z. We want to replace these equations with linear approximations. To this end, we develop
both equations as Taylor series around the current estimate and truncate them after the linear term. From the elevation
equation (7.3), we have
e
k
= 1000
z
k
d
k
≈ 1000
_
ˆ z
k
ˆ
d
k
+
z
k
− ˆ z
k
ˆ
d
k
−
ˆ z
k
ˆ
d
2
k
(d
k
−
ˆ
d
k
)
_
,
so that after simplifying we can redeﬁne the measurement to be the discrepancy from the estimated value:
e
k
= e
k
−1000
ˆ z
k
ˆ
d
k
≈ 1000(
z
k
ˆ
d
k
−
ˆ z
k
ˆ
d
2
k
d
k
) . (7.7)
We can proceed similarly for equation (7.4):
s
k
=
1000
_
d
2
k
+z
2
k
≈
1000
_
ˆ
d
2
k
+ ˆ z
2
k
−
1000
ˆ
d
k
(
ˆ
d
2
k
+ ˆ z
2
k
)
3/2
(d
k
−
ˆ
d
k
) −
1000ˆ z
k
(
ˆ
d
2
k
+ ˆ z
2
k
)
3/2
(z
k
− ˆ z
k
)
and after simplifying:
s
k
= s
k
−
2000
_
ˆ
d
2
+ ˆ z
2
≈ −1000
_
ˆ
d
k
(
ˆ
d
2
k
+ ˆ z
2
k
)
3/2
d
k
+
ˆ z
k
(
ˆ
d
2
k
+ ˆ z
2
k
)
3/2
z
k
_
. (7.8)
The two measurements s
k
and e
k
just deﬁned can be collected into a single measurement vector
y
k
=
_
s
k
e
k
_
,
and the two approximate measurement equations (7.7) and (7.8) can be written in the matrix form
y
k
= H
k
x
k
(7.9)
where the measurement matrix H
k
depends on the current state estimate ˆx
k
:
H
k
= −1000
_
_
0
ˆ
d
k
(
ˆ
d
2
k
+ˆ z
2
k
)
3/2
0
ˆ z
k
(
ˆ
d
2
k
+ˆ z
2
k
)
3/2
0
ˆ z
k
ˆ
d
2
k
0 −
1
ˆ
d
k
_
_
88 CHAPTER 7. STOCHASTIC STATE ESTIMATION
As the shell approaches us, we frantically start studying state estimation, and in particular Kalman ﬁltering, in the
hope to build a system that lets us shoot down the shell before it hits us. The next few sections will be read under this
impending threat.
Knowing the model for the mortar shell amounts to knowing the laws by which the object moves and those that
relate the position of the projectile to our observations. So what else is there left to do? From the observations, we
would like to know where the mortar shell is right now, and perhaps predict where it will be in a few seconds, so we
can direct an antiaircraft gun to shoot down the target. In other words, we want to know x
k
, the state of the dynamic
system. Clearly, knowing x
0
instead is equivalent, at least when the dynamics of the system are known exactly (the
system noise η
k
is zero). In fact, from x
0
we can simulate the system up until time t, thereby determining x
k
as well.
Most importantly, we do not want to have all the observations before we shoot: we would be dead by then. A scheme
that reﬁnes an initial estimation of the state as new observations are acquired is called a recursive
1
state estimation
system. The Kalman ﬁlter is one of the most versatile schemes for recursive state estimations. The original paper
by Kalman (R. E. Kalman, “A new approach to linear ﬁltering and prediction problems,” Transactions of the ASME
Journal Basic Engineering, 82:34–45, 1960) is still one of the most readable treatments of this subject from the point
of view of stochastic estimation.
Even without noise, a single observation y
k
may not be sufﬁcient to determine the state x
k
(in the example, one
observation happens to be sufﬁcient). This is a very interesting aspect of state estimation. It is really the ensemble of
all observations that let one estimate the state, and yet observations are processed one at a time, as they become avail
able. A classical example of this situation in computer vision is the reconstruction of threedimensional shape from
a sequence of images. A single image is twodimensional, so by itself it conveys no threedimensional information.
Kalman ﬁlters exist that recover shape information froma sequence of images. See for instance L. Matthies, T. Kanade,
and R. Szeliski, “Kalman ﬁlterbased algorithms for estimating depth from image sequences,” International Journal of
Computer Vision, 3(3):209236, September 1989; and T.J. Broida, S. Chandrashekhar, and R. Chellappa, “Recursive
3D motion estimation from a monocular image sequence,” IEEE Transactions on Aerospace and Electronic Systems,
26(4):639–656, July 1990.
Here, we introduce the Kalman ﬁlter from the simpler point of view of least squares estimation, since we have
developed all the necessary tools in the ﬁrst part of this course. The next section deﬁnes the state estimation problem
for a discrete dynamic system in more detail. Then, section 7.4 deﬁnes the essential notions of estimation theory
that are necessary to understand the quantitative aspects of Kalman ﬁltering. Section 7.5 develops the equation of the
Kalman ﬁlter, and section 7.6 reconsiders the example of the mortar shell. Finally, section 7.7 establishes a connection
between the Kalman ﬁlter and the solution of a linear system.
7.3 State Estimation
In this section, the estimation problem is deﬁned in some more detail. Given a discrete dynamic system
x
k+1
= F
k
x
k
+G
k
u
k
+η
k
(7.10)
y
k
= H
k
x
k
+ξ
k
(7.11)
where the system noise η
k
and the measurement noise ξ
k
are Gaussian variables,
η
k
∼ N(0, Q
k
)
ξ
k
∼ N(0, R
k
) ,
as well as a (possibly completely wrong) estimate ˆx
0
of the initial state and an initial covariance matrix P
0
of the
estimate ˆx
0
, the Kalman ﬁlter computes the optimal estimate ˆx
kk
at time k given the measurements y
0
, . . . , y
k
. The
ﬁlter also computes an estimate P
kk
of the covariance of ˆx
kk
given those measurements. In these expressions, the hat
means that the quantity is an estimate. Also, the ﬁrst k in the subscript refers to which variable is being estimated, the
second to which measurements are being used for the estimate. Thus, in general, ˆx
ij
is the estimate of the value that
x assumes at time i given the ﬁrst j + 1 measurements y
0
, . . . , y
j
.
1
The term “recursive” in the systems theory literature corresponds loosely to “incremental” or “iterative” in computer science.
7.3. STATE ESTIMATION 89
k  k1
^
y
H
k
x
k  k1
^
x
^
k  k
k  k
P
x
^
P
k+1  k
k+1  k
propagate propagate
x
^
P
k1  k1
k1  k1
y
k
y
k1
y
k+1
k
update
k  k1
P
k1 k+1
time
Figure 7.2: The update stage of the Kalman ﬁlter changes the estimate of the current system state x
k
to make the
prediction of the measurement closer to the actual measurement y
k
. Propagation then accounts for the evolution of the
system state, as well as the consequent growing uncertainty.
7.3.1 Update
The covariance matrix P
kk
must be computed in order to keep the Kalman ﬁlter running, in the following sense. At
time k, just before the new measurement y
k
comes in, we have an estimate ˆx
kk−1
of the state vector x
k
based on the
previous measurements y
0
, . . . , y
k−1
. Now we face the problem of incorporating the new measurement y
k
into our
estimate, that is, of transforming ˆx
kk−1
into ˆx
kk
. If ˆx
kk−1
were exact, we could compute the new measurement y
k
without even looking at it, through the measurement equation (7.11). Even if ˆx
kk−1
is not exact, the estimate
ˆy
kk−1
= H
k
ˆx
kk−1
is still our best bet. Now y
k
becomes available, and we can consider the residue
r
k
= y
k
−ˆy
kk−1
= y
k
−H
k
ˆx
kk−1
.
If this residue is nonzero, we probably need to correct our estimate of the state x
k
, so that the new prediction
ˆy
kk
= H
k
ˆx
kk
of the measurement value is closer to the measurement y
k
than the old prediction
ˆy
kk−1
= H
k
ˆx
kk−1
that we made just before the new measurement y
k
was available.
The question however is, by how much should we correct our estimate of the state? We do not want to make ˆy
kk
coincide with y
k
. That would mean that we trust the new measurement completely, but that we do not trust our state
estimate ˆx
kk−1
at all, even if the latter was obtained through a large number of previous measurements. Thus, we
need some criterion for comparing the quality of the new measurement y
k
with that of our old estimate ˆx
kk−1
of the
state. The uncertainty about the former is R
k
, the covariance of the observation error. The uncertainty about the state
just before the new measurement y
k
becomes available is P
kk−1
. The update stage of the Kalman ﬁlter uses R
k
and
P
kk−1
to weigh past evidence (ˆx
kk−1
) and new observations (y
k
). This stage is represented graphically in the middle
of ﬁgure 7.2. At the same time, also the uncertainty measure P
kk−1
must be updated, so that it becomes available for
the next step. Because a new measurement has been read, this uncertainty becomes usually smaller: P
kk
< P
kk−1
.
The idea is that as time goes by the uncertainty on the state decreases, while that about the measurements may
remain the same. Then, measurements count less and less as the estimate approaches its true value.
90 CHAPTER 7. STOCHASTIC STATE ESTIMATION
7.3.2 Propagation
Just after arrival of the measurement y
k
, both state estimate and state covariance matrix have been updated as described
above. But between time k and time k + 1 both state and covariance may change. The state changes according to the
system equation (7.10), so our estimate ˆx
k+1k
of x
k+1
given y
0
, . . . , y
k
should reﬂect this change as well. Similarly,
because of the system noise η
k
, our uncertainty about this estimate may be somewhat greater than one time epoch ago.
The system equation (7.10) essentially “dead reckons” the new state from the old, and inaccuracies in our model of
how this happens lead to greater uncertainty. This increase in uncertainty depends on the system noise covariance Q
k
.
Thus, both state estimate and covariance must be propagated to the new time k + 1 to yield the new state estimate
ˆx
k+1k
and the new covariance P
k+1k
. Both these changes are shown on the right in ﬁgure 7.2.
In summary, just as the state vector x
k
represents all the information necessary to describe the evolution of a
deterministic system, the covariance matrix P
kk
contains all the necessary information about the probabilistic part of
the system, that is, about how both the system noise η
k
and the measurement noise ξ
k
corrupt the quality of the state
estimate ˆx
kk
.
Hopefully, this intuitive introduction to Kalman ﬁltering gives you an idea of what the ﬁlter does, and what infor
mation it needs to keep working. To turn these concepts into a quantitative algorithm we need some preliminaries on
optimal estimation, which are discussed in the next section. The Kalman ﬁlter itself is derived in section 7.5.
7.4 BLUE Estimators
In what sense does the Kalman ﬁlter use covariance information to produce better estimates of the state? As we will
se later, the Kalman ﬁlter computes the Best Linear Unbiased Estimate (BLUE) of the state. In this section, we see
what this means, starting with the deﬁnition of a linear estimation problem, and then considering the attributes “best”
and “unbiased” in turn.
7.4.1 Linear Estimation
Given a quantity y (the observation) that is a known function of another (deterministic but unknown) quantity x (the
state) plus some amount of noise,
y = h(x) + n , (7.12)
the estimation problem amounts to ﬁnding a function
ˆx = L(y)
such that ˆx is as close as possible to x. The function L is called an estimator, and its value ˆx given the observations y
is called an estimate. Inverting a function is an example of estimation. If the function h is invertible and the noise term
n is zero, then L is the inverse of h, no matter how the phrase “as close as possible” is interpreted. In fact, in that case
ˆx is equal to x, and any distance between ˆx and x must be zero. In particular, solving a square, nonsingular system
y = Hx (7.13)
is, in this somewhat trivial sense, a problem of estimation. The optimal estimator is then represented by the matrix
L = H
−1
and the optimal estimate is
ˆx = Ly .
A less trivial example occurs, for a linear observation function, when the matrix H has more rows than columns,
so that the system (7.13) is overconstrained. In this case, there is usually no inverse to H, and again one must say in
what sense ˆx is required to be “as close as possible” to x. For linear systems, we have so far considered the criterion
that prefers a particular ˆx if it makes the Euclidean norm of the vector y − Hx as small as possible. This is the
7.4. BLUE ESTIMATORS 91
(unweighted) least squares criterion. In section 7.4.2, we will see that in a very precise sense ordinary least squares
solve a particular type of estimation problem, namely, the estimation problem for the observation equation (7.12) with
h a linear function and n Gaussian zeromean noise with the indentity matrix for covariance.
An estimator is said to be linear if the function L is linear. Notice that the observation function h can still be
nonlinear. If L is required to be linear but h is not, we will probably have an estimator that produces a worse estimate
than a nonlinear one. However, it still makes sense to look for the best possible linear estimator. The best estimator
for a linear observation function happens to be a linear estimator.
7.4.2 Best
In order to deﬁne what is meant by a “best” estimator, one needs to deﬁne a measure of goodness of an estimate. In
the least squares approach to solving a linear system like (7.13), this distance is deﬁned as the Euclidean norm of the
residue vector
y −Hˆx
between the left and the righthand sides of equation (7.13), evaluated at the solution ˆx. Replacing (7.13) by a “noisy
equation”,
y = Hx + n (7.14)
does not change the nature of the problem. Even equation (7.13) has no exact solution when there are more independent
equations than unknowns, so requiring equality is hopeless. What the least squares approach is really saying is that
even at the solution ˆx there is some residue
n = y −Hˆx (7.15)
and we would like to make that residue as small as possible in the sense of the Euclidean norm. Thus, an overcon
strained system of the form (7.13) and its “noisy” version (7.14) are really the same problem. In fact, (7.14) is the
correct version, if the equality sign is to be taken literally.
The noise term, however, can be used to generalize the problem. In fact, the Euclidean norm of the residue (7.15)
treats all components (all equations in (7.14)) equally. In other words, each equation counts the same when computing
the norm of the residue. However, different equations can have noise terms of different variance. This amounts to
saying that we have reasons to prefer the quality of some equations over others or, alternatively, that we want to
enforce different equations to different degrees. From the point of view of least squares, this can be enforced by some
scaling of the entries of n or, even, by some linear transformation of them:
n →Wn
so instead of minimizing n
2
= n
T
n (the square is of course irrelevant when it comes to minimization), we now
minimize
Wn
2
= n
T
R
−1
n
where
R
−1
= W
T
W
is a symmetric, nonnegativedeﬁnite matrix. This minimization problem, called weighted least squares, is only slightly
different from its unweighted version. In fact, we have
Wn = W(y −Hx) = Wy −WHx
so we are simply solving the system
Wy = WHx
in the traditional, “unweighted” sense. We know the solution from normal equations:
ˆx = ((WH)
T
WH)
−1
(WH)
T
Wy = (H
T
R
−1
H)
−1
H
T
R
−1
y .
92 CHAPTER 7. STOCHASTIC STATE ESTIMATION
Interestingly, this same solution is obtained from a completely different criterion of goodness of a solution ˆx. This
criterion is a probabilistic one. We consider this different approach because it will let us show that the Kalman ﬁlter is
optimal in a very useful sense.
The new criterion is the socalled minimumcovariance criterion. The estimate ˆx of x is some function of the
measurements y, which in turn are corrupted by noise. Thus, ˆx is a function of a random vector (noise), and is therefore
a random vector itself. Intuitively, if we estimate the same quantity many times, from measurements corrupted by
different noise samples from the same distribution, we obtain different estimates. In this sense, the estimates are
random.
It makes therefore sense to measure the quality of an estimator by requiring that its variance be as small as possible:
the ﬂuctuations of the estimate ˆx with respect to the true (unknown) value x from one estimation experiment to the
next should be as small as possible. Formally, we want to choose a linear estimator L such that the estimates ˆx = Ly
it produces minimize the following covariance matrix:
P = E[(x −ˆx)(x −ˆx)
T
] .
Minimizing a matrix, however, requires a notion of “size” for matrices: how large is P? Fortunately, most inter
esting matrix norms are equivalent, in the sense that given two different deﬁnitions P
1
and P
2
of matrix norm
there exist two positive scalars α, β such that
αP
1
< P
2
< βP
1
.
Thus, we can pick any norm we like. In fact, in the derivations that follow, we only use properties shared by all norms,
so which norm we actually use is irrelevant. Some matrix norms were mentioned in section 3.2.
7.4.3 Unbiased
In additionto requiring our estimator to be linear and with minimum covariance, we also want it to be unbiased, in the
sense that if repeat the same estimation experiment many times we neither consistently overestimate nor consistently
underestimate x. Mathematically, this translates into the following requirement:
E[x −ˆx] = 0 and E[ˆx] = E[x] .
7.4.4 The BLUE
We now address the problem of ﬁnding the Best Linear Unbiased Estimator (BLUE)
ˆx = Ly
of x given that y depends on x according to the model (7.14), which is repeated here for convenience:
y = Hx + n . (7.16)
First, we give a necessary and sufﬁcient condition for L to be unbiased.
Lemma 7.4.1 Let n in equation (7.16) be zero mean. Then the linear estimator L is unbiased if an only if
LH = I ,
the identity matrix.
Proof.
E[x −ˆx] = E[x −Ly] = E[x −L(Hx + n)]
= E[(I −LH)x] −E[Ln] = (I −HL)E[x]
7.4. BLUE ESTIMATORS 93
since E[Ln] = LE[n] and E[n] = 0. For this to hold for all x we need I −LH = 0. ∆
And now the main result.
Theorem 7.4.2 The Best Linear Unbiased Estimator (BLUE)
ˆx = Ly
for the measurement model
y = Hx + n
where the noise vector n has zero mean and covariance R is given by
L = (H
T
R
−1
H)
−1
H
T
R
−1
and the covariance of the estimate ˆx is
P = E[(x −ˆx)(x −ˆx)
T
] = (H
T
R
−1
H)
−1
. (7.17)
Proof. We can write
P = E[(x −ˆx)(x −ˆx)
T
] = E[(x −Ly)(x −Ly)
T
]
= E[(x −LHx −Ln)(x −LHx −Ln)
T
] = E[((I −LH)x −Ln)((I −LH)x −Ln)
T
]
= E[Lnn
T
L
T
] = LE[nn
T
] L
T
= LRL
T
because L is unbiased, so that LH = I.
To show that
L
0
= (H
T
R
−1
H)
−1
H
T
R
−1
(7.18)
is the best choice, let L be any (other) linear unbiased estimator. We can trivially write
L = L
0
+ (L −L
0
)
and
P = LRL
T
= [L
0
+ (L −L
0
)]R[L
0
+ (L −L
0
)]
T
= L
0
RL
T
0
+ (L −L
0
)RL
T
0
+L
0
R(L −L
0
)
T
+ (L −L
0
)R(L −L
0
)
T
.
From (7.18) we obtain
RL
T
0
= RR
−1
H(H
T
R
−1
H)
−1
= H(H
T
R
−1
H)
−1
so that
(L −L
0
)RL
T
0
= (L −L
0
)H(H
T
R
−1
H)
−1
= (LH −L
0
H)(H
T
R
−1
H)
−1
.
But L and L
0
are unbiased, so LH = L
0
H = I, and
(L −L
0
)RL
T
0
= 0 .
The term L
0
R(L −L
0
)
T
is the transpose of this, so it is zero as well. In conclusion,
P = L
0
RL
T
0
+ (L −L
0
)R(L −L
0
)
T
,
the sum of two positive deﬁnite or at least semideﬁnite matrices. For such matrices, the norm of the sum is greater or
equal to either norm, so this expression is minimized when the second term vanishes, that is, when L = L
0
.
This proves that the estimator given by (7.18) is the best, that is, that it has minimum covariance. To prove that the
covariance P of ˆx is given by equation (7.17), we simply substitute L
0
for L in P = LRL
T
:
P = L
0
RL
T
0
= (H
T
R
−1
H)
−1
H
T
R
−1
RR
−1
H(H
T
R
−1
H)
−1
= (H
T
R
−1
H)
−1
H
T
R
−1
H(H
T
R
−1
H)
−1
= (H
T
R
−1
H)
−1
as promised. ∆
94 CHAPTER 7. STOCHASTIC STATE ESTIMATION
7.5 The Kalman Filter: Derivation
We now have all the components necessary to write the equations for the Kalman ﬁlter. To summarize, given a linear
measurement equation
y = Hx + n
where n is a Gaussian random vector with zero mean and covariance matrix R,
n ∼ N(0, R) ,
the best linear unbiased estimate ˆx of x is
ˆx = PH
T
R
−1
y
where the matrix
P
∆
= E[(ˆx −x)(ˆx −x)
T
] = (H
T
R
−1
H)
−1
is the covariance of the estimation error.
Given a dynamic system with system and measurement equations
x
k+1
= F
k
x
k
+G
k
u
k
+η
k
(7.19)
y
k
= H
k
x
k
+ξ
k
where the system noise η
k
and the measurement noise ξ
k
are Gaussian random vectors,
η
k
∼ N(0, Q
k
)
ξ
k
∼ N(0, R
k
) ,
as well as the best, linear, unbiased estimate ˆx
0
of the initial state with an error covariance matrix P
0
, the Kalman
ﬁlter computes the best, linear, unbiased estimate ˆx
kk
at time k given the measurements y
0
, . . . , y
k
. The ﬁlter also
computes the covariance P
kk
of the error ˆx
kk
−x
k
given those measurements. Computation occurs according to the
phases of update and propagation illustrated in ﬁgure 7.2. We now apply the results from optimal estimation to the
problem of updating and propagating the state estimates and their error covariances.
7.5.1 Update
At time k, two pieces of data are available. One is the estimate ˆx
kk−1
of the state x
k
given measurements up to but not
including y
k
. This estimate comes with its covariance matrix P
kk−1
. Another way of saying this is that the estimate
ˆx
kk−1
differs from the true state x
k
by an error term e
k
whose covariance is P
kk−1
:
ˆx
kk−1
= x
k
+ e
k
(7.20)
with
E[e
k
e
T
k
] = P
kk−1
.
The other piece of data is the new measurement y
k
itself, which is related to the state x
k
by the equation
y
k
= H
k
x
k
+ξ
k
(7.21)
with error covariance
E[ξ
k
ξ
T
k
] = R
k
.
We can summarize this available information by grouping equations 7.20 and 7.21 into one, and packaging the error
covariances into a single, blockdiagonal matrix. Thus, we have
y = Hx
k
+ n
7.5. THE KALMAN FILTER: DERIVATION 95
where
y =
_
ˆx
kk−1
y
k
_
, H =
_
I
H
k
_
, n =
_
e
k
ξ
k
_
,
and where n has covariance
R =
_
P
kk−1
0
0 R
k
_
.
As we know, the solution to this classical estimation problem is
ˆx
kk
= P
kk
H
T
R
−1
y
P
kk
= (H
T
R
−1
H)
−1
.
This pair of equations represents the update stage of the Kalman ﬁlter. These expressions are somewhat wasteful,
because the matrices H and R contain many zeros. For this reason, these two update equations are now rewritten in a
more efﬁcient and more familiar form. We have
P
−1
kk
= H
T
R
−1
H
=
_
I H
T
k
¸
_
P
−1
kk−1
0
0 R
−1
k
_ _
I
H
k
_
= P
−1
kk−1
+H
T
k
R
−1
k
H
k
and
ˆx
kk
= P
kk
H
T
R
−1
y
= P
kk
_
P
−1
kk−1
H
T
k
R
−1
k
_
_
ˆx
kk−1
y
k
_
= P
kk
(P
−1
kk−1
ˆx
kk−1
+H
T
k
R
−1
k
y
k
)
= P
kk
((P
−1
kk
−H
T
k
R
−1
k
H
k
)ˆx
kk−1
+H
T
k
R
−1
k
y
k
)
= ˆx
kk−1
+P
kk
H
T
k
R
−1
k
(y
k
−H
k
ˆx
kk−1
) .
In the last line, the difference
r
k
∆
= y
k
−H
k
ˆx
kk−1
is the residue between the actual measurement y
k
and its best estimate based on ˆx
kk−1
, and the matrix
K
k
∆
= P
kk
H
T
k
R
−1
k
is usually referred to as the Kalman gain matrix, because it speciﬁes the amount by which the residue must be multi
plied (or ampliﬁed) to obtain the correction term that transforms the old estimate ˆx
kk−1
of the state x
k
into its new
estimate ˆx
kk
.
7.5.2 Propagation
Propagation is even simpler. Since the new state is related to the old through the system equation 7.19, and the noise
term η
k
is zero mean, unbiasedness requires
ˆx
k+1k
= F
k
ˆx
kk
+G
k
u
k
,
96 CHAPTER 7. STOCHASTIC STATE ESTIMATION
which is the state estimate propagation equation of the Kalman ﬁlter. The error covariance matrix is easily propagated
thanks to the linearity of the expectation operator:
P
k+1k
= E[(ˆx
k+1k
−x
k+1
)(ˆx
k+1k
−x
k+1
)
T
]
= E[(F
k
(ˆx
kk
−x
k
) −η
k
)(F
k
(ˆx
kk
−x
k
) −η
k
)
T
]
= F
k
E[(ˆx
kk
−x
k
)(ˆx
kk
−x
k
)
T
]F
T
k
+E[η
k
η
T
k
]
= F
k
P
kk
F
T
k
+Q
k
where the system noise η
k
and the previous estimation error ˆx
kk
−x
k
were assumed to be uncorrelated.
7.5.3 Kalman Filter Equations
In summary, the Kalman ﬁlter evolves an initial estimate and an initial error covariance matrix,
ˆx
0−1
∆
= ˆx
0
and P
0−1
∆
= P
0
,
both assumed to be given, by the update equations
ˆx
kk
= ˆx
kk−1
+K
k
(y
k
−H
k
ˆx
kk−1
)
P
−1
kk
= P
−1
kk−1
+H
T
k
R
−1
k
H
k
where the Kalman gain is deﬁned as
K
k
= P
kk
H
T
k
R
−1
k
and by the propagation equations
ˆx
k+1k
= F
k
ˆx
kk
+G
k
u
k
P
k+1k
= F
k
P
kk
F
T
k
+Q
k
.
7.6 Results of the Mortar Shell Experiment
In section 7.2, the dynamic system equations for a mortar shell were set up. Matlab routines available through the
class Web page implement a Kalman ﬁlter (with naive numerics) to estimate the state of that system from simulated
observations. Figure 7.3 shows the true and estimated trajectories. Notice that coincidence of the trajectories does not
imply that the state estimate is uptodate. For this it is also necessary that any given point of the trajectory is reached
by the estimate at the same time instant. Figure 7.4 shows that the distance between estimated and true target position
does indeed converge to zero, and this occurs in time for the shell to be shot down. Figure 7.5 shows the 2norm of the
covariance matrix over time. Notice that the covariance goes to zero only asymptotically.
7.7 Linear Systems and the Kalman Filter
In order to connect the theory of state estimation with what we have learned so far about linear systems, we now show
that estimating the initial state x
0
from the ﬁrst k +1 measurements, that is, obtaining ˆx
0k
, amounts to solving a linear
system of equations with suitable weights for its rows.
The basic recurrence equations (7.10) and (7.11) can be expanded as follows:
y
k
= H
k
x
k
+ξ
k
= H
k
(F
k−1
x
k−1
+G
k−1
u
k−1
+η
k−1
) +ξ
k
= H
k
F
k−1
x
k−1
+H
k
G
k−1
u
k−1
+H
k
η
k−1
+ξ
k
= H
k
F
k−1
(F
k−2
x
k−2
+G
k−2
u
k−2
+η
k−2
) +H
k
G
k−1
u
k−1
+H
k
η
k−1
+ξ
k
7.7. LINEAR SYSTEMS AND THE KALMAN FILTER 97
5 10 15 20 25 30 35
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
true (dashed) and estimated (solid) missile trajectory
Figure 7.3: The true and estimated trajectories get closer to one another. Trajectories start on the right.
0 5 10 15 20 25 30
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
distance between true and estimated missile position vs. time
Figure 7.4: The estimate actually closes in towards the target.
98 CHAPTER 7. STOCHASTIC STATE ESTIMATION
0 5 10 15 20 25 30
0
5
10
15
20
25
30
35
40
norm of the state covariance matrix vs time
Figure 7.5: After an initial increase in uncertainty, the norm of the state covariance matrix converges to zero. Upwards
segments correspond to state propagation, downwards ones to state update.
= H
k
F
k−1
F
k−2
x
k−2
+H
k
(F
k−1
G
k−2
u
k−2
+G
k−1
u
k−1
) +
H
k
(F
k−1
η
k−2
+η
k−1
) +ξ
k
.
.
.
= H
k
F
k−1
. . . F
0
x
0
+H
k
(F
k−1
. . . F
1
G
0
u
0
+. . . +G
k−1
u
k−1
) +
H
k
(F
k−1
. . . F
1
η
0
+. . . +η
k−1
) +ξ
k
or in a more compact form,
y
k
= H
k
Φ(k −1, 0)x
0
+H
k
k
j=1
Φ(k −1, j)G
j−1
u
j−1
+ν
k
(7.22)
where
Φ(l, j) =
_
F
l
. . . F
j
for l ≥ j
1 for l < j
and the term
ν
k
= H
k
k
j=1
Φ(k −1, j)η
j−1
+ξ
k
is noise.
The key thing to notice about this somewhat intimidating expression is that for any k it is a linear system in x
0
, the
initial state of the system. We can write one system like the one in equation (7.22) for every value of k = 0, . . . , K,
where K is the last time instant considered, and we obtain a large system of the form
z
K
= Ψ
K
x
0
+ g
K
+ n
K
(7.23)
where
z
K
=
_
¸
_
y
0
.
.
.
y
K
_
¸
_
7.7. LINEAR SYSTEMS AND THE KALMAN FILTER 99
Ψ
K
=
_
¸
¸
¸
_
H
0
H
1
F
0
.
.
.
H
K
Φ(K −1, 0)
_
¸
¸
¸
_
g
K
=
_
¸
¸
¸
_
0
H
1
G
0
u
0
.
.
.
H
K
(Φ(K −1, 1)G
0
u
0
+. . . + Φ(K −1, K)G
K−1
u
K−1
)
_
¸
¸
¸
_
n
K
=
_
¸
_
ν
0
.
.
.
ν
K
_
¸
_ .
Without knowing anything about the statistics of the noise vector n
K
in equation (7.23), the best we can do is to
solve the system
z
K
= Ψ
K
x
0
+ g
K
in the sense of least squares, to obtain an estimate of x
0
from the measurements y
0
, . . . , y
K
:
ˆx
0K
= Ψ
†
K
(z
K
−g
K
)
where Ψ
†
K
is the pseudoinverse of Ψ
K
. We know that if Ψ
K
has full rank, the result with the pseudoinverse is the
same as we would obtain by solving the normal equations, so that
Ψ
†
K
= (Ψ
T
K
Ψ
K
)
−1
Ψ
T
K
.
The least square solution to system (7.23) minimizes the residue between the left and the righthand side under the
assumption that all equations are to be treated the same way. This is equivalent to assuming that all the noise terms in
n
K
are equally important. However, we know the covariance matrices of all these noise terms, so we ought to be able
to do better, and weigh each equation to keep these covariances into account. Intuitively, a small covariance means that
we believe in that measurement, and therefore in that equation, which should consequently be weighed more heavily
than others. The quantitative embodiment of this intuitive idea is at the core of the Kalman ﬁlter.
In summary, the Kalman ﬁlter for a linear system has been shown to be equivalent to a linear equation solver, under
the assumption that the noise that affects each of the equations has the same probability distribution, that is, that all
the noise terms in n
K
in equation 7.23 are equally important. However, the Kalman ﬁlter differs from a linear solver
in the following important respects:
1. The noise terms in n
K
in equation 7.23 are not equally important. Measurements come with covariance matrices,
and the Kalman ﬁlter makes optimal use of this information for a proper weighting of each of the scalar equations
in (7.23). Better information ought to yield more accurate results, and this is in fact the case.
2. The system (7.23) is not solved all at once. Rather, an initial solution is reﬁned over time as new measurements
become available. The ﬁnal solution can be proven to be exactly equal to solving system (7.23) all at once.
However, having better and better approximations to the solution as new data come in is much preferable in a
dynamic setting, where one cannot in general wait for all the data to be collected. In some applications, data my
never stop arriving.
3. A solution for the estimate ˆx
kk
of the current state is given, and not only for the estimate ˆx
0k
of the initial state.
As time goes by, knowledge of the initial state may obsolesce and become less and less useful. The Kalman
ﬁlter computes uptodate information about the current state.
2
Chapter 1
Introduction
Fields such as robotics or computer vision are interdisciplinary subjects at the intersection of engineering and computer science. By their nature, they deal with both computers and the physical world. Although the former are in the latter, the workings of computers are best described in the blackandwhite vocabulary of discrete mathematics, which is foreign to most classical models of reality, quantum physics notwithstanding. This class surveys some of the key tools of applied math to be used at the interface of continuous and discrete. It is not on robotics or computer vision, nor does it cover any other application area. Applications evolve rapidly, but their mathematical foundations remain. Even if you will not pursue any of these ﬁelds, the mathematics that you learn in this class will not go wasted. To be sure, applied mathematics is a discipline in itself and, in many universities, a separate department. Consequently, this class can be a quick tour at best. It does not replace calculus or linear algebra, which are assumed as prerequisites, nor is it a comprehensive survey of applied mathematics. What is covered is a compromise between the time available and what is useful and fun to talk about. Even if in some cases you may have to wait until you take an applied class to fully appreciate the usefulness of a particular topic, I hope that you will enjoy studying these subjects in their own right.
1.1 Who Should Take This Class
The main goal of this class is to present a collection of mathematical tools for both understanding and solving problems in ﬁelds that manipulate models of the real world, such as robotics, artiﬁcial intelligence, vision, engineering, or several aspects of the biological sciences. Several classes at most universities each cover some of the topics presented in this class, and do so in much greater detail. If you want to understand the full details of any one of the topics in the syllabus below, you should take one or more of these other classes instead. If you want to understand how these tools are implemented numerically, you should take one of the classes in the scientiﬁc computing program, which again cover these issues in much better detail. Finally, if you want to understand robotics, vision, or other applied ﬁelds, you should take classes in these subjects, since this course is not on applications. On the other hand, if you do plan to study robotics, vision, or other applied subjects in the future, and you regard yourself as a user of the mathematical techniques outlined in the syllabus below, then you may beneﬁt from this course. Of the proofs, we will only see those that add understanding. Of the implementation aspects of algorithms that are available in, say, Matlab or LApack, we will only see the parts that we need to understand when we use the code. In brief, we will be able to cover more topics than other classes because we will be often (but not always) unconcerned with rigorous proof or implementation issues. The emphasis will be on intuition and on practicality of the various algorithms. For instance, why are singular values important, and how do they relate to eigenvalues? What are the dangers of Newtonstyle minimization? How does a Kalman ﬁlter work, and why do PDEs lead to sparse linear systems? In this spirit, for instance, we discuss Singular Value Decomposition and Schur decomposition both because they never fail and because they clarify the structure of an algebraic or a differential linear problem.
3
and volume integrals Green’s theorem and potential ﬁelds of two variables Stokes’ and divergence theorems and potential ﬁelds of three variables Diffusion and ﬂow problems Finite differences Direct versus iterative solution methods Jacobi and GaussSeidel iterations Successive overrelaxation 4.2 Partial differential equations and sparse linear systems 4. surface.3 4.3. Unknown functions of several variables 4. Introduction 2.3 Constraints and Lagrange multipliers 3.2 Weighted least squares 3.2.1.2.1 Linear estimation 3.5 Eigenvalues and eigenvectors The Schur decomposition Ordinary differential linear systems The matrix zoo Real.2 Syllabus Here is the ideal syllabus.2 LevenbergMarquardt method 2.2 Function optimization 2.3 Calculus of variations 4.3 3.1.4 Characterization of the solutions to a linear system Gaussian elimination The Singular Value Decomposition The pseudoinverse 2.2.5 4. curl Line.1 Ordinary differential linear systems 3.1 4.2 2.1.3 4.4 Grad.1.1 4.2.1 Newton and GaussNewton methods 2.1.2 4. but how much we cover depends on how fast we go.4 CHAPTER 1.2.1.3 2.2.2 4. Unknown functions of one real variable 3. div.4 4. 1.1. INTRODUCTION 1.2 Statistical estimation 3.3 The Kalman ﬁlter 4. positivedeﬁnite matrices 3.1.1 Algebraic linear systems 2.1.1 Tensor ﬁelds of several variables 4.3.2 3.1.4 3.1.2.1 EulerLagrange equations 4.1.1.1 2.2 The brachistochrone .1 3.2. symmetric.1. Unknown numbers 2.2.2.
1.3. DISCUSSION OF THE SYLLABUS
5
1.3 Discussion of the Syllabus
In robotics, vision, physics, and any other branch of science whose subject belongs to or interacts with the real world, mathematical models are developed that describe the relationship between different quantities. Some of these quantities are measured, or sensed, while others are inferred by calculation. For instance, in computer vision, equations tie the coordinates of points in space to the coordinates of corresponding points in different images. Image points are data, world points are unknowns to be computed. Similarly, in robotics, a robot arm is modeled by equations that describe where each link of the robot is as a function of the conﬁguration of the link’s own joints and that of the links that support it. The desired position of the end effector, as well as the current conﬁguration of all the joints, are the data. The unknowns are the motions to be imparted to the joints so that the end effector reaches the desired target position. Of course, what is data and what is unknown depends on the problem. For instance, the vision system mentioned above could be looking at the robot arm. Then, the robot’s end effector position could be the unknowns to be solved for by the vision system. Once vision has solved its problem, it could feed the robot’s endeffector position as data for the robot controller to use in its own motion planning problem. Sensed data are invariably noisy, because sensors have inherent limitations of accuracy, precision, resolution, and repeatability. Consequently, the systems of equations to be solved are typically overconstrained: there are more equations than unknowns, and it is hoped that the errors that affect the coefﬁcients of one equation are partially cancelled by opposite errors in other equations. This is the basis of optimization problems: Rather than solving a minimal system exactly, an optimization problem tries to solve many equations simultaneously, each of them only approximately, but collectively as well as possible, according to some global criterion. Least squares is perhaps the most popular such criterion, and we will devote a good deal of attention to it. In summary, the problems encountered in robotics and vision, as well as other applications of mathematics, are optimization problems. A fundamental distinction between different classes of problems reﬂects the complexity of the unknowns. In the simplest case, unknowns are scalars. When there is more than one scalar, the unknown is a vector of numbers, typically either real or complex. Accordingly, the ﬁrst part of this course will be devoted to describing systems of algebraic equations, especially linear equations, and optimization techniques for problems whose solution is a vector of reals. The main tool for understanding linear algebraic systems is the Singular Value Decomposition (SVD), which is both conceptually fundamental and practically of extreme usefulness. When the systems are nonlinear, they can be solved by various techniques of function optimization, of which we will consider the basic aspects. Since physical quantities often evolve over time, many problems arise in which the unknowns are themselves functions of time. This is our second class of problems. Again, problems can be cast as a set of equations to be solved exactly, and this leads to the theory of Ordinary Differential Equations (ODEs). Here, “ordinary” expresses the fact that the unknown functions depend on just one variable (e.g., time). The main conceptual tool for addressing ODEs is the theory of eigenvalues, and the primary computational tool is the Schur decomposition. Alternatively, problems with time varying solutions can be stated as minimization problems. When viewed globally, these minimization problems lead to the calculus of variations. When the minimization problems above are studied locally, they become state estimation problems, and the relevant theory is that of dynamic systems and Kalman ﬁltering. The third category of problems concerns unknown functions of more than one variable. The images taken by a moving camera, for instance, are functions of time and space, and so are the unknown quantities that one can compute from the images, such as the distance of points in the world from the camera. This leads to Partial Differential equations (PDEs), or to extensions of the calculus of variations. In this class, we will see how PDEs arise, and how they can be solved numerically.
6
CHAPTER 1. INTRODUCTION
1.4 Books
The class will be based on these lecture notes, and additional notes handed out when necessary. Other useful references include the following. • R. Courant and D. Hilbert, Methods of Mathematical Physics, Volume I and II, John Wiley and Sons, 1989. • D. A. Danielson, Vectors and Tensors in Engineering and Physics, AddisonWesley, 1992. • J. W. Demmel, Applied Numerical Linear Algebra, SIAM, 1997. • A. Gelb et al., Applied Optimal Estimation, MIT Press, 1974. • P. E. Gill, W. Murray, and M. H. Wright, Practical Optimization, Academic Press, 1993. • G. H. Golub and C. F. Van Loan, Matrix Computations, 2nd Edition, Johns Hopkins University Press, 1989, or 3rd edition, 1997. • W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling, Numerical Recipes in C, 2nd Edition, Cambridge University Press, 1992. • G. Strang, Introduction to Applied Mathematics, Wellesley Cambridge Press, 1986. • A. E. Taylor and W. R. Mann, Advanced Calculus, 3rd Edition, John Wiley and Sons, 1983. • L. N. Trefethen and D. Bau, III, Numerical Linear Algebra, SIAM, 1997. • R. Weinstock, Calculus of Variations, Dover, 1974.
Chapter 2
Algebraic Linear Systems
An algebraic linear system is a set of m equations in n unknown scalars, which appear linearly. Without loss of generality, an algebraic linear system can be written as follows: Ax = b (2.1)
where A is an m × n matrix, x is an ndimensional vector that collects all of the unknowns, and b is a known vector of dimension m. In this chapter, we only consider the cases in which the entries of A, b, and x are real numbers. Two reasons are usually offered for the importance of linear systems. The ﬁrst is apparently deep, and refers to the principle of superposition of effects. For instance, in dynamics, superposition of forces states that if force f1 produces acceleration a1 (both possibly vectors) and force f2 produces acceleration a2 , then the combined force f1 + αf2 produces acceleration a1 + αa2 . This is Newton’s second law of dynamics, although in a formulation less common than the equivalent f = ma. Because Newton’s laws are at the basis of the entire ediﬁce of Mechanics, linearity appears to be a fundamental principle of Nature. However, like all physical laws, Newton’s second law is an abstraction, and ignores viscosity, friction, turbulence, and other nonlinear effects. Linearity, then, is perhaps more in the physicist’s mind than in reality: if nonlinear effects can be ignored, physical phenomena are linear! A more pragmatic explanation is that linear systems are the only ones we know how to solve in general. This argument, which is apparently more shallow than the previous one, is actually rather important. Here is why. Given two algebraic equations in two variables, f (x, y) = g(x, y) = we can eliminate, say, y and obtain the equivalent system F (x) = 0 y = h(x) . Thus, the original system is as hard to solve as it is to ﬁnd the roots of the polynomial F in a single variable. Unfortunately, if f and g have degrees df and dg , the polynomial F has generically degree df dg . Thus, the degree of a system of equations is, roughly speaking, the product of the degrees. For instance, a system of m quadratic equations corresponds to a polynomial of degree 2m . The only case in which the exponential is harmless is when its base is 1, that is, when the system is linear. In this chapter, we ﬁrst review a few basic facts about vectors in sections 2.1 through 2.4. More speciﬁcally, we develop enough language to talk about linear systems and their solutions in geometric terms. In contrast with the promise made in the introduction, these sections contain quite a few proofs. This is because a large part of the course material is based on these notions, so we want to make sure that the foundations are sound. In addition, some of the proofs lead to useful algorithms, and some others prove rather surprising facts. Then, in section 2.5, we characterize the solutions of linear algebraic systems. 7 0 0,
j=1 (2. . In other words. . xn . dependency means that there is a nonzero vector x such that n xj aj = 0 .2) is said to be a linear combination of a1 . the vector n b= j=1 x j aj (2. In one direction. This is important. . xn . an are linearly dependent if they admit the null vector as a nonzero linear combination. b= . . j=k 1 “iff” x j aj means “if and only if. .5) (2. . . an are linearly dependent iff1 at least one of them is a linear combination of the others. .4) If you are not convinced of these equivalences. . they are linearly dependent if there is a set of coefﬁcients x1 . xn b1 .3) For later reference. the columns of a matrix A are dependent if there is a nonzero solution to the homogeneous system (2. The vectors a1 . .1.1 Linear (In)dependence Given n vectors a1 . x= . take the time to write out the components of each expression for a small example. . bm (2. . x1 .8 CHAPTER 2.3) is the same as Ax = 0 where A= a1 ··· an .” . Proof. . . xn . such that n xj aj = 0 . Make sure that you are comfortable with this. We have n n xj aj = xk ak + j=1 j=1. an with coefﬁcients x1 .5). . not all of which are zero.6) as desired. . . an and n real numbers x1 . . . . j=k n xj aj = 0 so that ak = − j=1. Equation (2. Theorem 2. The converse is proven similarly: if n ak = j=1. Thus.2) is the same as Ax = b and equation (2. Vectors that are not dependent are independent. . . j=k xj aj xk (2. . j=1 Let xk be nonzero for some k. . . ALGEBRAIC LINEAR SYSTEMS 2. . . it is useful to rewrite the last two equalities in a different form. . . .1 The vectors a1 . . .
∆2 We can make the ﬁrst part of the proof above even more speciﬁc. an are linearly dependent then at least one of them is a linear combination of the ones that precede it. an of A are linearly independent and the system Ax = b admits a solution. Proof. . . this implies that every linear space contains the zero vector. . and state the following Lemma 2. . . Just let k be the last of the nonzero xj . . 2 This symbol marks the end of a proof.6). Then xj = 0 for j > k in (2. . Then. an is said to be a basis for a set B of vectors if the aj are linearly independent and every vector in B can be written as a linear combination of them. .1. The basis vectors are said to span the vector space. . this is possible only when xj − xj = 0 for every j. Theorem 2. . . . 2. an for B. then 9 n xj aj = 0 j=1 by letting xk = −1 (so that x is nonzero). then the solution is unique. . BASIS for some k. .2. In particular. .1 Given a vector b in the vector space B and a basis a1 . An equivalent formulation is the following: If the columns a1 . . . The previous theorem is a very important result. 0=b−b= n n n xj aj − j=1 j=1 xj aj = j=1 (xj − xj )aj ∆ but because the aj are linearly independent. .2 Basis A set a1 . the coefﬁcients x1 . xn such that n b= j=1 xj aj are uniquely determined. .2. which then becomes n ak = j<k xj aj xk ∆ as desired. B is said to be a vector space if it contains all the linear combinations of its basis vectors. . Let also b= j=1 n xj aj .2. . Proof.2 If n nonzero vectors a1 .
2.10 Pause for a minute to verify that this formulation is equivalent. an be two different bases for B. . The rest of the proof now proceeds as follows: we keep removing a vectors from G and replacing them with a vectors in such a way as to keep G a generating set for B. Then each aj is in B (why?). This shows that n ≥ n. the basis of elementary vectors ej = jth column of the m × m identity matrix is clearly a basis for Rm . . Since this elementary basis has m vectors. bm m can be written as b= bj e j j=1 and the ej are clearly independent. since it has no other vectors preceding it. Consequently. Another consequence of theorem 2.2 implies that any other basis for Rm has m vectors. . and can therefore be written as a linear combination of a1 . a2 . From lemma 2. . ∆ A consequence of this theorem is that any basis for Rm has m vectors. . This vector cannot be a1 . Now we can repeat the whole procedure with the roles of a vectors and a vectors exchanged. n ≥ n . . G is still a generating set for B. it makes sense to deﬁne the dimension of a space as the number of vectors in any of its bases. . CHAPTER 2. . So it must be one of the aj vectors. Since all bases for a space have the same number of vectors. Now we can add a2 to G. . . Since B is a vector space. Suppose in fact per absurdum that G is now made only of a vectors.2. This is absurd. This proves that n = n . an and a1 . . which are independent of those in G. . Since the a s form a basis. .2 Two different bases for the same vector space B have the same number of vectors. Proof. Let us continue this procedure until we run out of either a vectors to remove or a vectors to add. . . The a vectors cannot run out ﬁrst. which proves that n ≥ n .2 is that n vectors of dimension m < n are bound to be dependent. an must be linearly dependent. .2. and the two results together imply that n = n . . since any vector b1 . . since the removed vector depends on the others. We then switch the roles of a and a vectors to conclude that n ≥ n.1. That is. G is a generating set for B. Let a1 . In fact. because at every step we have made sure that G remains a generating set. ALGEBRAIC LINEAR SYSTEMS Theorem 2. Then we show that we cannot run out of a vectors before we run out of a vectors. the vectors of the set G = a1 . . they are mutually linearly independent. Removing the latter keeps G a generating set. But then G cannot be a generating set. writing it right after a1 : G = a1 . . b= . Consequently. since any basis for Rm can only have m vectors. . Thus. we must run out of a s ﬁrst (or simultaneously with the last a). since the vectors in it cannot generate the leftover a s.2. . We call a set of vectors that contains a basis for B a generating set for B. theorem 2. a1 . an . one of the vectors in G is a linear combination of those preceding it. all the a s are in B. and that there are still leftover a vectors that have not been put into G.
j This result extends Pythagoras’ theorem to m dimensions. we have a2 = b2 + c2 − 2bc cos θ where θ is the angle between the sides of length b and c. inner product. (2. By Pythagoras’ theorem. b. c (see ﬁgure 2. and the symbol T makes them into row vectors. and so forth. Here and elsewhere. and orthogonality. the squared length of a vector is the inner product of the vector with itself.8) Thus. vectors are column vectors by default. The length of these elementary vectors is clearly one. the square of the length b of b is m b 2 = b2 + 1 j=2 bj e j 2 . If we deﬁne the inner product of two mdimensional vectors as follows: m b c= j=1 T bj cj . A special case of this law is Pythagoras’ theorem. How long is b? From equation (2. because each of them goes from the origin to the unit point of one of the axes. m b 2 = j=1 b2 . c a θ b Figure 2. .3 Inner Product and Orthogonality In this section we establish the geometric meaning of the algebraic notions of norm.7) we obtain m b = b1 e1 + j=2 bj ej and the two vectors b1 e1 and m j=2 bj ej are orthogonal.3. INNER PRODUCT AND ORTHOGONALITY 11 2. Also. then b 2 = bT b . Pythagoras’ theorem can now be applied again to the last sum by singling out its ﬁrst term b2 e2 .1). The fundamental geometric fact that is assumed to be known is the law of cosines: given a triangle with sides a. projection. any two of these vectors form a 90degree angle. In conclusion.7) of the elementary vectors that point along the coordinate axes. In the previous section we saw that any vector in Rm can be written as the linear combination m b= j=1 bj e j (2.2. obtained when θ = ±π/2.1: The law of cosines states that a2 = b2 + c2 − 2bc cos θ. because the coordinate axes are orthogonal by construction.
Proof. ∆ Corollary 2. and b − c yields b−c 2 = b 2 + c 2 −2 b c cos θ and from equation (2. c cos θ . Proof. cT c Proof. Theorem 2. Since by deﬁnition point p is on the line through c. The line from the endpoint of b to p is orthogonal to c. the line between p and the endpoint of b is shortest when it is orthogonal to c: cT (b − ac) = 0 .12 Theorem 2.3.8) we obtain bT b + cT c − 2bT c = bT b + cT c − 2 b Canceling equal terms and dividing by 2 yields the desired result. See ﬁgure 2. From elementary geometry. CHAPTER 2.3 The projection of b onto c is the vector p = Pc b where Pc is the following square matrix: Pc = ccT .2. where a is some real number.1 bT c = b where θ is the angle between b and c.3. the projection vector p has the form p = ac. When θ = ±π/2. ALGEBRAIC LINEAR SYSTEMS c cos θ The law of cosines applied to the triangle with sides b . b p c Figure 2.3. c .2 Two nonzero vectors b and c in Rm are mutually orthogonal iff bT c = 0.2: The vector from the origin to point p is the projection of b onto c. ∆ Given two vectors b and c applied to the origin. the previous theorem yields bT c = 0. the projection of b onto c is the vector from the origin to the point p on the line through c that is nearest to the endpoint of b.
and its rank. then the orthogonal complement of A is the set of all vectors in Rm that are orthogonal to all the vectors in A. and if so how many. With these tools.5. . Proof.1 Any basis a1 . . . we also introduce a useful procedure (GramSchmidt) for orthonormalizing a set of linearly independent vectors. the complement of the xy plane in R3 is all of R3 except the xy plane. It is important to understand exactly what is being mapped into what in order to determine whether a linear system has solutions. am . ∆ Theorem 2. . that is. This section introduces the notion of orthogonality between spaces. an . . . Theorem 2. . we will be able to characterize the solutions to a linear system in section 2. until m = n. qr that span the same space as a1 . This argument can be repeated until the basis spans all of Rm . . while the orthogonal complement of the xy plane is the z axis. Proof. . cT b cT c ccT b cT c 13 ∆ 2. . If r = 1. Two vector spaces A and B are said to be orthogonal to one another when every vector in A is orthogonal to every vector in B. the following construction r=0 for j = 1 to n aj = aj − if aj = 0 r =r+1 a qr = aj j end end r T l=1 (ql aj )ql yields a set of orthonormal 3 vectors q1 . If n = m we are done. Let us now assume that the procedure 3 Orthonormal means orthogonal and with unit norm. .4.4. .2 (GramSchmidt) Given n vectors a1 . call it an+1 . . . an . .4. In the process.4 Orthogonal Subspaces and the Rank of a Matrix Linear transformations map spaces into spaces. .2. deﬁnes the null space and range of a matrix. . The normalization in the above procedure ensures that q1 has unit norm. If vector space A is a subspace of Rm for some m. . For instance. . the given basis cannot generate all of Rm . an . . If n < m. Notice that complement and orthogonal complement are very different notions. . so there must be a vector. there is little to prove. . ORTHOGONAL SUBSPACES AND THE RANK OF A MATRIX which yields a= so that p = ac = c a = as advertised. an for a subspace A of Rm can be extended into a basis for Rm by adding m − n vectors an+1 . . . that is linearly independent of a1 . We ﬁrst prove by induction on r that the vectors qr are mutually orthonormal.
these n − h vectors generate the range of A.1).3. there must be a linear combination of the columns of A that is equal to b. Proof. . Finally. . . ∆ We can now start to talk about matrices in terms of the subspaces associated with them. an be a basis for A. First. qn . . In symbols. . it follwos that qr is orthogonal to all its predecessors. q1 . Because the new basis is orthonormal. are orthonormal (and therefore independent). . A has n − h linearly independent rows. and that these vectors are orthonormal (the inductive assumption). .4. . . . . . .4 The number r of linearly independent columns of any m × n matrix A is equal to the number of its independent rows. if computed. .3 If A is a subspace of Rm and A⊥ is the orthogonal complement of A in Rm .4. . . and the inner products i T qi ql are zero by the inductive assumption. . . . x ∈ null(A) iff Ax = 0. . . Extend this basis to a basis a1 . we notice that the vectors qj span the same space as the aj s. the dimension of the orthogonal space A⊥ is exactly m − n. r − 1. .4. our basis for Rn : n x= j=1 cj vj . . . the dimension of this orthogonal space cannot exceed m − n. Now we show that the number of independent columns is also n − h. must be some linear combination of v1 . From theorem 2. and b ∈ range(A) iff Ax = b for some x. . given an arbitrary vector b ∈ range(A). ALGEBRAIC LINEAR SYSTEMS above has been performed a number j − 1 of times sufﬁcient to ﬁnd r − 1 vectors q1 . vh be a basis for null(A).4. as promised. and because qT aj = 0. . qr−1 . Thus. Orthonormalize this basis by the GramSchmidt procedure (theorem 2. qm are orthogonal to all vectors generated by q1 .14 CHAPTER 2.2) to obtain q1 . . We have already proven that the number of independent rows is n − h.4. Proof. The range of A is the space of all mdimensional vectors that are generated by the columns of A.4. By construction. qT qr = 0 for i i i = 1. . qm . by constructing a basis for range(A). and equal in number to the number of linearly independent vectors in a1 . ∆ qT aj i (qT aj )qT qi i i Theorem 2. . am for Rm (theorem 2. The ntuple x itself. Thus. . . being an element of Rn . . then the space generated by the rows of A has dimension r = n − h. has unit norm. . Theorem 2. qn span A. Let v1 . . vn for Rn . . there is an ntuple x such that Ax = b. . The null space null(A) of an m × n matrix A is the space of all ndimensional vectors that are orthogonal to the rows of A. so there is a space of dimension at least m − n that is orthogonal to A. because the former are linear combinations of the latter. Then for any i < r we have r−1 qT aj = qT aj − i i l=1 (qT aj )qT ql = 0 l i because the term cancels the ith term of the sum (remember that qT qi = 1). . the vector qr . It is not obvious that the space generated by the columns of A has also dimension r = n − h. Avn are a basis for the range of A. . Let a1 . that is. then dim(A) + dim(A⊥ ) = m . and r =n−h where h = dim(null(A)). . vn . . This is the point of the following theorem. and extend this basis (theorem 2. . . . In fact. . . . Because of the explicit normalization step qr = aj / aj . On the other hand. all vectors generated by qn+1 .1) into a basis v1 . . an . . . if null(A) has dimension h. . . . . because otherwise we would have more than m vectors in a basis for Rm . Then we can show that the n − h vectors Avh+1 .
we prove that the n − h vectors Avh+1 .1 The matrix A transforms a vector x in its null space into the zero vector. Thanks to this theorem. . This proves that the n − h vectors Avh+1 . . . dimension h range(A)⊥ . . .5. . It should be obvious from the meaning of these spaces that null(A)⊥ range(A)⊥ = range(AT ) = null(AT ) where AT is the transpose of A. not all zero. . vh span null(A). Suppose.2. . and null(A)⊥ is called the rowspace of A. Second. . . . . . .5. . . dimension m − r null(A)⊥ . . such that n xj Avj = 0 j=h+1 so that A n xj vj = 0 . vh are a basis for null(A). deﬁned as the matrix obtained by exchanging the rows of A with its columns. and an arbitrary vector x into a vector in range(A). Theorem 2. . . there must exist coefﬁcients x1 . h. so that Avj = 0 for j = 1.5 The Solutions of a Linear System Thanks to the results of the previous sections. 2. . Then there exist numbers xh+1 . . j=h+1 But then the vector j=h+1 xj vj is in the null space of A. b = Ax = A j=1 15 n n n cj vj = j=1 cj Avj = j=h+1 cj Avj since v1 . . . . ∆ in conﬂict with the assumption that the vectors v1 . . The space range(A)⊥ is called the left nullspace of the matrix. that they are not. . A frequently used synonym for “range” is column space. . Avn generate range(A). THE SOLUTIONS OF A LINEAR SYSTEM Thus. dimension r = n − h . Avn are linearly independent. . per absurdum. . Since the vectors v1 . . . dimension r = rank(A) null(A). xn . we now have a complete picture of the four spaces associated with an m × n matrix A of rank r and nullspace dimension h: range(A). vn are linearly independent. we can deﬁne the rank of A to be equivalently the number of linearly independent columns or of linearly independent rows of A: rank(A) = dim(range(A)) = n − dim(null(A)) . . xh such that n h n xj vj = j=h+1 j=1 xj vj . .
the system is redundant.” The case r < n < m is possible. and on numerical considerations. there are solutions x to Ax = b. Then. 4 • When r < n the system is underdetermined. In the second. There are more equations than unknowns. three possibilities occur. n: • When r = n = m. Notice that if r = n then n cannot possibly exceed m. the system is invertible. and exactly one solution exists. but since b is in the range of A there is a linear combination of the columns (a vector x) that produces b. Also.16 CHAPTER 2. Also. More on this in chapter 3. but then for any y ∈ null(A) also x + y is a solution: Ax = b .3. Let Ax = b be an m × n system (m can be less than. All the cases are summarized in ﬁgure 2. it has dimension h > 0). and the system is said to be incompatible. 2. Notice that invertibility depends only on A. so the ﬁrst two cases exhaust the possibilities for r = n. and particularly its version called reduction to echelon form is such a method. admits a solution if b ∈ range(A). but is called “underdetermined” because there are fewer (r) independent equations than there are unknowns (see next item). not on b. whether more information is required about the matrix of the system.6 Gaussian Elimination Gaussian elimination is an important technique for solving linear systems. ∞k is the cardinality of a kdimensional vector space. there can be no linear combination of the columns (no x vector) that gives b. the equations are compatible. Here. • When r = n and m > n. are done in a ﬁnite amount of time that can be bounded given only the size of a matrix. Thus. Most notably. Gaussian elimination. no matter whether the system is invertible or not. like Gaussian elimination. and there is a space of dimension h = n − r of vectors x such that Ax = 0. equal to. compatible case. has more equations (m) than unknowns (n). In addition to always yielding a solution.. In the ﬁrst case above. or greater than n).” . This means that there is exactly one x that satisﬁes the system. iterative methods solve systems in a time that depends on the accuracy required. while direct methods. r cannot exceed either m or n. Which method to use depends on the size and structure (e. sparsity) of the matrix. Other solution techniques exist for linear systems. In other words. b ∈ range(A) b ∈ range(A) ⇒ no solutions ⇒ ∞n−r solutions with the convention that ∞0 = 1. it also allows determining the rank of a matrix. m. Of course. depending on the relative sizes of r. listing all possibilities does not provide an operational method for determining the type of linear system for a given pair A. Since b is assumed to be in the range of A.9) 4 Notice that the technical meaning of “redundant” has a stronger meaning than “with more equations than unknowns. Ay = 0 ⇒ A(x + y) = b and this generates the ∞h = ∞n−r solutions mentioned above.e. Consider the m × n system Ax = b (2. “redundant” means “with exactly one solution and with more equations than unknowns.g. This means that the null space is nontrivial (i. let r = rank(A) be the number of linearly independent rows or columns of A. since the columns of A span all of Rn . ALGEBRAIC LINEAR SYSTEMS This allows characterizing the set of solutions to linear system as follows. b. and is summarized in the next section..
and is called the pivot. The ﬁrst step is applied to U (1) = A and c(1) = b. . This zeros all the entries in the column below the pivot. then U is uppertriangular. The last step produces U (m) = U and c(m) = c. if the matrix is square and if all columns have a pivot. Initially. This means that • Nonzero rows precede rows with all zeros. . exchange rows l and k of U (k) and of c(k) . and preserves the equality of left. of a row. . invertible. Gaussian elimination replaces the rows of this system by linear combinations of the rows themselves until A is changed into a matrix U that is in the socalled echelon form. 17 no incompatible no underdetermined which can be square or rectangular. GAUSSIAN ELIMINATION b in range(A) yes r=n yes m=n yes invertible no redundant Figure 2. m.6. then increment p by 1. . m of U (k) and c(k) and produces U (k+1) and c(k+1) . . by solving the transformed system Ux = c . U is in echelon form. and subtract entry k of c(k) multiplied by uip /ukp from entry i of c(k) . we compute the solution x by backsubstitution. . 6 Different 5 “Stop” .5 Row exchange Now p ≤ n and uip is nonzero for some k ≤ i ≤ m. For i = k + 1. or underdetermined. Let l be one such value of i6 . means that the entire algorithm is ﬁnished. • Below each pivot is a column of zeros.2. . If p exceeds n stop. . . there are no restrictions on the system. The kth step is applied to rows k. the “pivot column index” p is set to one. is called a pivot. . that is. Here is step k. If l = k. m.1 Reduction to Echelon Form The matrix A is reduced to echelon form by a process in m − 1 steps. When this process is ﬁnished. In particular. Once the system is transformed into echelon form. redundant. In short. 2. • Each pivot lies to the right of the pivot in the row above. Selecting the largest entry in the column leads to better roundoff properties. where uij denotes entry i. Triangularization The new entry ukp is nonzero.6. incompatible. . The ﬁrst nonzero entry. subtract row k of U (k) multiplied by uip /ukp from row i of U (k) .and righthand side. if any. so equality is preserved and solving the ﬁnal system yields the same solution as solving the original one. j of U (k) : Skip nopivot columns If uip is zero for every i = k. The same operations are applied to the rows of A and to those of b. which is transformed to a new vector c. ways of selecting l here lead to different numerical properties of the algorithm. .3: Types of linear systems.
once for the nonhomogeneous system. and can therefore be written in the form x = v0 + xj1 v1 + . the linear system admits no solutions: it is incompatible. then the equations corresponding to these nonzero entries are incompatible.e. . call the variables corresponding to the r columns with pivots the basic variables. . If there is a residual system (i. Let r be the index of the last nonzero row of U . solutions exist.. and therefore of A. We are left with an r × n system. . remove the residual system. and they can be determined by backsubstitution. . The vector v0 is the solution when all free variables are zero. .18 CHAPTER 2. Say that the pivot columns are j1 . In this system. the last m − r equations yield a subsystem of the following form: cr+1 0 . and n − r times for the homogeneous system. If on the other hand r = m (obviously r cannot exceed m). that is. . . . In conclusion. vn−r form a basis for the null space of U . leaving undetermined variables speciﬁed by their name. Backsubstitutions works as follows. . Since any value given to the free variables leaves the equality of system (2. . . ALGEBRAIC LINEAR SYSTEMS 2. . r < m) and some of cr+1 . n − r can be obtained by solving the homogeneous system Ux = 0 with xji = 1 and all other free variables equal to zero. r is the rank of U . we ﬁrst solve the system symbolically. . with suitable values of the free variables. and call the other n − r the free variables. Here is an equivalent solution procedure that makes this unnecessary. The ﬁnal result is a solution with as many free parameters as there are free variables. To see this. 0 cm Let us call this the residual subsystem. because A and U admit exactly the same solutions and are equal in size. Then. . cm are nonzero. 7 An n l=ji +1 uil xl afﬁne function is a linear function plus a constant. It is also the rank of A. if any. cr+1 = .10) in echelon form is easily solved for x.6. = cm = 0. This yields the solution in the form (2. that is. . . . there is no residual subsystem. because they are of the form 0 = ci with ci = 0.2 Backsubstitution A system Ux = c (2. and then transform this solution procedure into one that can be more readily implemented numerically. Then symbolic backsubstitution consists of the following sequence: for i = r downto 1 1 xji = ci − uiji end This is called symbolic backsubstitution because no numerical values are assigned to free variables. Similarly. . free variables are speciﬁed by name rather than by value. the presence of free variables leads to an inﬁnity of solutions. . = . First. The solution obtained by backsubstitution is an afﬁne function7 of the free variables. Whenever they appear in the expressions for the basic variables. + xjn−r vn−r (2. it is inconvenient to carry around nonnumeric symbol names (the free variables). .10) satisﬁed. the general solution can be obtained by running backsubstitution n − r + 1 times. Let us now assume that either there is no residual system. the vector vi for i = 1. . . . Notice that the vectors v1 . If r < m. . . When solving a system in echelon form numerically. jr . and can therefore be obtained by replacing each free variable by zero during backsubstitution.11) where the xji are the free variables. however. or if there is one it is compatible. Since no vector x can satisfy these equations. . Since this is the number of independent rows of U . by solving the equations starting from the last one and replacing the result in the equations higher up.11).
Throughout this example. 0 0 6 2 6 Notice that now (k = 2) the entries uip are zero for i = 2.3 An Example An example will clarify both the reduction to echelon form and backsubstitution. we choose a trivial pivot selection rule: we pick the (1) ﬁrst nonzero entry at or below row k in the pivot column. are free. so the system is underdetermined. Consider the system Ax = b where U (1) 3 2 9 5 . 3 0 1 =b= 5 .6. the residual subsystem is ﬁrst deleted. row 2 multiplied by 6/3 is subtracted from row 3 for both U (2) and c(2) to yield 1 3 3 2 1 U = U (3) = 0 0 3 1 . c = c(3) = 3 . In the ﬁrst step (k = 1). and the rank of U and that of A is r = 2. Backsubstitution applied ﬁrst to row 2 and then to row 1 yields the following expressions for the pivot variables: x3 x1 1 1 1 (c2 − u24 x4 ) = (3 − x4 ) = 1 − x4 u23 3 3 1 1 = (c1 − u12 x2 − u13 x3 − u14 x4 ) = (1 − 3x2 − 3x3 − 2x4 ) u11 1 = 1 − 3x2 − (3 − x4 ) − 2x4 = −2 − 3x2 − x4 = −3 + x2 1 + x4 0 0 −1 0 .12) (2) (2) The basic variables are x1 and x3 . The triangularization step subtracts row 1 multiplied by 2/1 from row 2. there are no nopivot columns. . this means that u11 = a11 = 1 is the pivot. corresponding to the columns with pivots. For k = 1. x2 and x4 . This yields the reduced system 1 3 0 0 3 3 2 1 x= 1 3 (2.6. with ∞n−r = ∞2 solutions. The other two variables. so the pivot column index p stays at 1. 5 1 =A= 2 −1 3 6 −3 c(1) Reduction to echelon form transforms A and b as follows. The residual system is 0 = 0 (compatible). and subtracts row 1 multiplied by 1/1 from row 3. and u23 is nonzero. 3.2. no row exchange is necessary. so no row exchange is necessary. 0 0 0 0 0 There is one zero row in the lefthand side. for both p = 1 and p = 2. When applied to both U (1) and c(1) this yields 1 3 3 2 1 U (2) = 0 0 3 1 . In 8 other words. the number of nonzero rows. −1 3 1 so the general solution is −2 − 3x2 − x4 −2 0 x2 = x= 1 1 1 − 3 x4 0 x4 8 Selecting the largest entry in the column at or below row k is a frequent choice. and r < n = 4. and this would have caused rows 1 and 2 to be switched. In symbolic backsubstitution. In the triangularization step. so p is set to 3: the second pivot column is column 3. c(2) = 3 . GAUSSIAN ELIMINATION 19 2.
in the sense that the answer can be found in a number of steps that depends only on the size of the matrix A. ALGEBRAIC LINEAR SYSTEMS This same solution can be found by the numerical backsubstitution method as follows.12) with x2 = x4 = 0 by numerical backsubstitution yields x3 x1 so that = = 1 (3 − 1 · 0) = 1 3 1 (1 − 3 · 0 − 3 · 1 − 2 · 0) = −2 1 −2 0 v0 = 1 . 0 Finally. Solving the reduced system (2. solving the nonzero part of U x = 0 with x2 = 0 and x4 = 1 leads to x3 x1 so that = = 1 1 (−1 · 1) = − 3 3 1 1 (−3 · 0 − 3 · − 1 3 −1 0 v2 = 1 − 3 1 −2 0 x = v0 + x2 v1 + x4 v2 = 1 + x2 0 −3 1 + x4 0 0 −1 0 −1 3 1 − 2 · 1) = −1 and just as before. 0 Then v1 is found by solving the nonzero part (ﬁrst two rows) of U x = 0 with x2 = 1 and x4 = 0 to obtain x3 x1 so that = = 1 (−1 · 0) = 0 3 1 (−3 · 1 − 3 · 0 − 2 · 0) = −3 1 −3 1 v1 = 0 . As mentioned at the beginning of this section. In the next chapter. we study a different .20 CHAPTER 2. Gaussian elimination is a direct method.
For instance. b are small numbers that depend on the details of the algorithm. Finally. much more so than Gaussian elimination.2. the number of ﬂoating point operations necessary to compute the SVD of an m × n matrix is amn2 + bn3 where a. This state of affairs would seem to favor Gaussian elimination over the SVD. and the number of steps necessary to ﬁnd an approximate solution depends on the desired number of correct digits. which provide great insight into the behavior of the linear transformation represented by the matrix A. for reasons that will become apparent in the next chapter. More generally. one can compute several thousand SVDs of 5 × 5 matrices in one second. GAUSSIAN ELIMINATION 21 method. called the singular values. since it computes bases for all the four spaces mentioned above. the computation of the SVD is numerically well behaved.6. In addition. on a regular workstation. Singular values also allow deﬁning a notion of approximate rank which is very useful in a large number of applications. It also allows ﬁnding approximate solutions when the linear system in question is incompatible. as well as a set of quantities. However. . the latter yields a much more complete answer. based on the socalled the Singular Value Decomposition (SVD). This is an iterative method. meaning that an exact solution usually requires an inﬁnite number of steps. very efﬁcient algorithms for the SVD exist.
22 CHAPTER 2. ALGEBRAIC LINEAR SYSTEMS .
we ﬁrst need to introduce the notion of orthogonal matrices. . i since the vj are orthonormal. . into the orthogonal complement of the range. . This describes the action that a matrix has on the magnitudes of vectors as well. . pm and since P is in S. 23 . 3. and vectors in its null space into the zero vector. . In this section. we saw that a matrix transforms vectors in its domain into vectors in its range (column space). . and interpret them geometrically as transformations between systems of orthonormal coordinates. q= . .1.2. n we have n n vT p = vT i i j=1 qj vj = j=1 q j vT vj = q i .Chapter 3 The Singular Value Decomposition In section 2. In other words. We do this in section 3. . and may need emphasis: 1 Vectors with unit norm. This is important. . Then for any i = 1. it must be possible to write p as a linear combination of the vj s. If the coordinates of P in Rm are collected in an mdimensional vector p1 . in section 3. and let v1 . + qn vn = V q where V = v1 ··· vn is an m × n matrix that collects the basis for S as its columns. p= . vn be an orthonormal basis for S. we make this statement more speciﬁc by showing how unit vectors1 in the rowspace are transformed by matrices. qn such that p = q1 v1 + .1 Orthogonal Matrices Let S be an ndimensional subspace of Rm (so that we necessarily have n ≤ m). . that is. . To this end. Consider a point P in S. Then. . there must exist coefﬁcients q1 . The chapter concludes with some basic applications and examples. . No nonzero vector is mapped into the left null space. we use these new concepts to introduce the allimportant concept of the Singular Value Decomposition (SVD).
since equation (3. we can collect the n equations vT vj = i into the following matrix equation: 1 0 if i = j otherwise (3. In fact. . the matrix V −1 = V T is also the right inverse of V : V square 2 Nay.3) 2 (3. j In matrix form. . then the coefﬁcients qj are the signed magnitudes of the projections of p onto the basis vectors: qj = vT p . for square. Of course.3) still holds. V T is both the left and the right inverse: V TV = V V T = I . (3. because it is made of orthonormal columns.1) q = V Tp . ⇒ V −1 V = V T V = V V −1 = V V T = I . vn are orthonormal.24 If CHAPTER 3. so V R can have at most n linearly independent columns. Then L = L(V R) = (LV )R = R.2) V TV = I where I is the n × n identity matrix. .4) is unique. Also. Since the matrix V V T contains the inner products between the rows of V (just as V T V is formed by the inner products of its columns). if m = n. this argument requires V to be full rank. We can summarize this discussion as follows: Theorem 3. In particular. so that the solution L to equation (3. Of course. Notice that V R = I cannot possibly have a solution when m > n. and V T is then simply said to be the inverse of V : V T = V −1 . V is certainly full rank. for square orthogonal matrices. Thus. because the m × m identity matrix has m linearly independent 2 columns. fullrank matrices (r = m = n). so the left and the right inverse are the same. Thus. . THE SINGULAR VALUE DECOMPOSITION n p= j=1 qj vj and the vectors of the basis v1 .1. suppose that there exist matrices L and R such that LV = I and V R = I.3) is said to be orthogonal. while the columns of V R are linear combinations of the n columns of V . A matrix V that satisﬁes equation (3. However. this result is still valid when V is m × m and has orthonormal columns. and is equal to the transpose of V .4) comparison with equation (3. However.3) shows that the left inverse of an orthogonal matrix V exists. orthonormal. .1 The left inverse of an orthogonal m × n matrix V with m ≥ n exists and is equal to the transpose of V: V TV = I . the distinction between left and right inverse vanishes. (3. a matrix is orthogonal if its columns are orthonormal. the argument above shows that the rows of a square orthogonal matrix are orthonormal as well. Since the left inverse of a matrix V is deﬁned as the matrix L such that LV = I .
Proof.3 Let U be an orthogonal matrix. Also. the result is the same. ORTHOGONAL MATRICES 25 Sometimes. the point P remains the same.2) causes confusion.1. is relativity. we have the following result: Theorem 3. Proof. In section 2. from the columns of I) to the vectors vj (that is. .3. One solution may be more efﬁcient than the other in other ways.1.1. for unit vectors c. however. because two interpretations of it are possible.2 The norm of a vector x is not changed by multiplication by an orthogonal matrix V : Vx = x . the difference vector between b and its projection p onto range(U ) is orthogonal to range(U ): U T (b − p) = 0 . to the columns of V ). The derivative of this squared distance with respect to x is the vector 2x − 2U T b 3 At least geometrically.3). An analogous result holds for subspace projection. equation (3. This notion of projection can be extended from lines to vector spaces by the following deﬁnition: The projection p of a point b ∈ Rn onto a subspace C is the point in C that is closest to b. the projection matrix is ccT (theorem 2. This. Then the matrix U U T projects any vector b onto range(U ). In the interpretation given above. the geometric interpretation of equation (3. and the vector b − p is orthogonal to c. U T U is the identity matrix. so b−p 2 = bT b + xT x − 2bT U x . Vx 2 = xT V T V x = xT x = x 2 . Alternatively. A point p in range(U ) is a linear combination of the columns of U : p = Ux where x is the vector of coefﬁcients (as many coefﬁcients as there are columns in U ). Furthermore.3 we deﬁned the projection p of a vector b onto another vector c as the point on the line through c that is closest to b.2) can be seen as a transformation. ∆ We conclude this section with an obvious but useful consequence of orthogonality. when m = n. of point P with coordinates p into a different point Q with coordinates q. and should not be surprising: If you spin clockwise on your feet. Because of orthogonality.3. Theorem 3.3 Consistently with either of these geometric interpretations. or if you stand still and the whole universe spins counterclockwise around you. in a ﬁxed reference system. as the following theorem shows. and the underlying reference frame is changed from the elementary vectors ej (that is. The squared distance between b and p is b − p 2 = (b − p)T (b − p) = bT b + pT p − 2bT p = bT b + xT U T U x − 2bT U x .
In three dimensions. the hyperellipse in question is centered at the origin. Here is the main intuition: An m × n matrix A of rank r maps the rdimensional unit hypersphere in rowspace(A) into an rdimensional hyperellipse in range(A). in the sense that U T (b − p) = U T (b − U U T b) = U T b − U T b = 0 . . in one dimension it is a pair of points. and we have then translated these into algebraic statements.2 The Singular Value Decomposition In these notes. the rank2 matrix √ √ 3 1 3 A= √ (3. ∆ 3. which is a quadratic hypersurface that generalizes the twodimensional notion of ellipse to an arbitrary number of dimensions. which is zero iff x = UT b . This statement is stronger than saying that A maps rowspace(A) into range(A). The geometric picture underlying the Singular Value Decomposition is crisp and useful.5) −3 3 2 1 1 transforms the unit circle on the plane into an ellipse embedded in threedimensional space. For instance. For this value of p the difference vector b − p is orthogonal to range(U ). when p = Ux = UUT b as promised. so we will use geometric intuition again. because it also describes what happens to the magnitudes of the vectors: a hypersphere is stretched or compressed into a hyperellipse. THE SINGULAR VALUE DECOMPOSITION b3 x v2 x 1 v1 u3 σ u2 2 σ u1 1 b b 1 b 2 Figure 3.1: The matrix in equation (3. This approach is successful when geometry is less cumbersome than algebra. the hyperellipse is an ellipsoid. we have often used geometric intuition to introduce new concepts. that is. In all cases. Figure 3.5) maps a circle on the plane into an ellipse in space. The two small boxes are corresponding points. or when geometric intuition provides a strong guiding element.26 x 2 CHAPTER 3.1 shows the map b = Ax .
By theorems 2. thereby contradicting the fact that uT Av1 is a maximum. . Theorem 3. um . and two other diametrically opposite points on the unit circle are mapped into the two endpoints of the minor axis of the ellipse.1 and 2. possibly at more than one point 4 . y  x ∈ Rn . Simple and fundamental as this geometric fact may be. σp ) ∈ Rm×n A = U ΣV T . σ2 0 0T A2 at least at two points: if uT Av1 is a maximum. . Then T U1 AV1 = S1 = σ1 0 0T A1 . . . . where p = min(m. so that uT Av2 = . and 1 1 therefore orthogonal to v2 . y) must achieve a maximum value on S. . so that the scalar function z(x. we see that v1 is parallel to AT u1 . its proof by geometric means is cumbersome. vn . . . u1 and v1 can be extended into orthonormal bases for Rm and Rn .1 If A is a real m × n matrix then there exist orthogonal matrices U V such that = = u1 v1 ··· ··· um vn ∈ Rm×m ∈ Rn×n U T AV = Σ = diag(σ1 . . This result can be generalized to any m × n matrix. Proof. If this were not the case. THE SINGULAR VALUE DECOMPOSITION 27 Two diametrically opposite points on the unit circle are mapped into the two endpoints of the major axis of the ellipse. 1 Similarly. and let σ1 be the corresponding value of z: max yT Ax = uT Av1 = σ1 . .2. the ﬁrst column of AV1 is Av1 = σ1 u1 . The lines through these two pairs of points on the unit circle are always orthogonal. .3. .4. . and consider the bilinear form z = yT Ax . . . their inner product uT Av1 could 1 be increased by rotating u1 towards the direction of Av1 . . v1 be two unit vectors in Rm and Rn respectively where this maximum is achieved.2. A similar argument shows that the entries after the ﬁrst in the ﬁrst row of S1 are zero: the row vector uT A is parallel to vT . and its other entries 1 T are uj Av1 = 0 because Av1 is parallel to u1 and therefore orthogonal.2. ≥ σp ≥ 0. so the ﬁrst entry of U1 AV1 is uT σ1 u1 = σ1 . We can repeat the same construction on A1 and write T U2 A1 V2 = S2 = 4 Actually. 1 . respectively. by noticing that uT Av1 = vT AT u1 1 1 and repeating the argument above. respectively. = uT Avn = 0. Equivalently. x = y = 1} is compact. by construction. Collect these orthonormal basis vectors into orthogonal matrices U1 and V1 . n) and σ1 ≥ .4. 1 1 The matrix A1 has one fewer row and column than A. T In fact. to u2 . The set S = {x. we will prove it algebraically by ﬁrst introducing the existence of the SVD and then using the latter to prove that matrices map hyperspheres into hyperellipses. y ∈ Rm . so is (−u1 )T A(−v1 ). Instead. Let u1 . Let x and y be unit vectors in Rn and Rm . 1 x = y =1 It is easy to see that u1 is parallel to the vector Av1 .
We have σ2 = = ˆ x max ˆT A1 ˆ = max y x ˆ ˆ ˆ = y =1 x = y =1 max 0 ˆ y T T U1 AV1 0 0 ˆ x ˆ y = T S1 0 ˆ x ˆ ˆ x = y =1 max yT Ax ≤ σ1 . vn . y) is negative. . it is a linear combination of v2 . . then z(−x. consider the vectors x = V1 0 ˆ x and y = U1 0 ˆ y . and if x is on a hypersphere. and induction will do the rest.1. we can observe that the successive maximization problems that yield σ1 . we just need to show that σ2 ≤ σ1 . we introduce some nomenclature for the three matrices in the SVD. Because x and y thus deﬁned belong to subsets (actually subspheres) of the unit spheres in Rn and Rm . . n)). A2 1 0 0 V2 T σ1 = 0 0 0 σ2 0 This procedure can be repeated until Ak vanishes (zero rows or zero columns) to obtain U T AV = Σ where U T and V are orthogonal matrices obtained by multiplying together all the orthogonal matrices used in the procedure. The corresponding axis unit vectors u1 . ≥ σp (where p = min(m. v2 the two right singular vectors of A. represented by equation (3. but order v1 . and Σ = diag(σ1 . ∆ We can now review the geometric picture in ﬁgure 3.28 so that 1 0 0 T U2 T T U1 AV1 CHAPTER 3. . we can premultiply the matrix product in the theorem by U and postmultiply it by V T to obtain A = U ΣV T . To see that σ1 ≥ . Since matrices U and V are orthogonal. In the process. . so is −x. v2 on the unit circle that map into the major axes of the ellipse. Consider the map in ﬁgure 3. and is therefore itself a unit vector. and imagine transforming point x (the small box at x on the unit circle) into its corresponding point b = Ax (the small box on the ellipse). The σi are nonnegative because all these maximizations are performed on unit hyperspheres. u2 on the ellipse are called left singular vectors. If we deﬁne V = v1 v2 . A similar argument shows that y is a unit vector orthogonal to u1 . . σp are performed on a sequence of sets each of which contains the next. y) is positive. Just pick one way. σp ) . y) which always assumes both positive and negative values on any hypersphere: If z(x. It only remains to show that the elements on the diagonal of Σ are nonnegative and arranged in nonincreasing order. x = y =1 xT v1 = yT u1 = 0 To explain the last equality above. .1 in light of the singular value decomposition. which is the desired result. . . . . v2 so they map into the major and the minor axis. because axis endpoints come in pairs.5). . This transformation can be achieved in three steps (see ﬁgure 3. in this order. To show this. and is therefore orthogonal to v1 .2): 1. THE SINGULAR VALUE DECOMPOSITION 0T 0T . ˆ The vector x is equal to the unit vector [0 x]T transformed by the orthogonal matrix V1 . we conclude that σ2 ≤ σ1 . Write x in the frame of reference of the two vectors v1 . There are a few ways to do this. Let us call v1 . The σi s are maxima of the function z(x. . In addition. . .
u2 . the transformed vector η has coordinates η = Σξ where σ1 Σ= 0 0 0 σ2 0 is a diagonal matrix. so b = Uη where the orthogonal matrix U= collects the left singular vectors of A. THE SINGULAR VALUE DECOMPOSITION x 2 29 y3 x v1 v2 x 1 u3 σu 2 2 σ u1 1 y y 1 y η 2 ξ 2 2 v’2 ξ σ u’ 2 2 v’1 ξ1 η σ u’ 1 1 η1 Figure 3. y2 axes.2. This rotation brings η along. 2. nonnegative numbers σ1 . If the lengths of the halfaxes of the ellipse are σ1 . the new coordinates ξ of x become ξ = V Tx because V is orthogonal. σ2 are called the singular values of A. Transform ξ into its image on a “straight” version of the ﬁnal ellipse. Rotate the reference frame in Rm = R3 so that the “straight” ellipse becomes the ellipse in ﬁgure 3. Otherwise. u1 u2 u3 .1. and maps it to b. the “straight” ellipse has the same shape as the ellipse in ﬁgure 3.3. u3 that identify the axes of the ellipse and the normal to the plane of the ellipse.1. The real. 3. “Straight” here means that the axes of the ellipse are aligned with the y1 .1. σ2 (major axis ﬁrst).2: Decomposition of the mapping in ﬁgure 3. The components of η are the signed magnitudes of the projections of b along the unit vectors u1 .
If desired. . n). There are two sources of ambiguity. Similarly. the singular values of a matrix A are the lengths of the semiaxes of the hyperellipse E deﬁned by E = {Ax : x = 1} . whenever two or more singular values coincide. . . Thus. i which is an even smaller. Finally. Even in the general case. 1 : p). the subspaces identiﬁed by the corresponding left and right singular vectors are unique.30 CHAPTER 3. a valid SVD. . and still obtain a valid SVD. together with the corresponding left singular vectors. This suggests a “small. say. the right subspace and yield. both the 2norm and the Frobenius norm m n A F = i=1 j=1 aij 2 . that the ﬁrst nonzero entry of every left singular value be positive. 1 : r). the identity matrix has an inﬁnity of SVDs. The SVD reveals a great deal about the structure of a matrix. If the matrix A maps a hypersphere into another hypersphere. while U and V are square. . the SVD is unique. and this multiplies the last n − m rows of V . Σ has the same shape and size as A. all of the form I = U IU T where U is any orthogonal matrix of suitable size. THE SINGULAR VALUE DECOMPOSITION We can concatenate these three transformations to obtain b = U ΣV T x or A = U ΣV T since this construction works for any point x on the unit circle. If p = min(m. ≥ σr > σr+1 = . if m < n. More generally. and Vr = V (:. and write A = Up Σp VpT where Up is m × p. Singular vectors must be ﬂipped in pairs (a left vector and its corresponding right vector) because the singular values are required to be nonnegative. if σr is the smallest nonzero singular value of A. . the rightmost m × (n − m) block of Σ is zero. so that the last m − n columns of U are multiplied by zero. If we deﬁne r by σ1 ≥ . The singular value decomposition is “almost unique”. then we have r A = Ur Σr VrT = i=1 σi ui vT . that is. 1 : r). and Vp is n × p. ur } . if m > n. and V is n × n. provided that the corresponding left singular vector is ﬂipped as well. then rank(A) = r null(A) = span{vr+1 . 1 : p).” equivalent version of the SVD. Σr = Σ(1 : r. However. . SVD. the bottom (m − n) × n block of Σ is zero. and Vp = V (:. For instance. Moreover. The second source of ambiguity is deeper. The ﬁrst is in the orientation of the singular vectors. but any orthonormal basis can be chosen within. the axes of the latter are not deﬁned. The sizes of the matrices in the SVD are as follows: U is m × m. if p − r singular values are zero. vn } range(A) = span{u1 . 1 : r). Σ is m × n. This is the SVD of A. it can be removed by imposing. Except for these ambiguities. . we can deﬁne Up = U (:. One can ﬂip any right singular vector. minimal. = 0 . . . This is a trivial ambiguity. Σp = Σ(1 : p. Σp is p × p. . 1 : p). for instance. . we can let Ur = U (:.
However. Often more measurements are available than strictly necessary. that is. and its ﬁrst two equations are incompatible: x1 + x2 cannot be equal to both 1 and √ 3. because measurements are unreliable. and equations are often mutually incompatible because they come from inexact measurements (incompatible linear systems were deﬁned in chapter 2). a vector x that satisﬁes this equation exactly. and the leastsquares solution is not unique. the system x1 + x2 x1 + x2 x3 = 1 = 3 = 2 has three unknowns. This leads to more equations than unknowns (the number m of rows in A is greater than the number n of columns). where the double bars henceforth refer to the Euclidean norm. THE PSEUDOINVERSE and A are neatly characterized in terms of the SVD: A 2 F A 2 2 2 = σ1 + . in the sense that it has fewer independent equations than unknowns (its rank r is less than n. an exact solution to the system (3. Even when m ≤ n the equations can be incompatible. In other circumstances. this is a rather high residual. x cannot exactly satisfy any of the m equations in the system. Incompatibility and underdeterminacy can occur together: the system admits no solution. any other vector of the form 1 −1 x = 1 + α 1 2 0 is as good as x. 2 31 = sup x=0 Ax x In the next few sections we introduce fundamental results and applications that testify to the importance of the SVD. the linear system (3. .6) arising from a reallife application may or may not admit a solution.and righthand sides of the equations. but it tries to satisfy all of them as closely as possible. as we learned in chapter 2. as measured by the sum of the squares of the discrepancies between left. In summary.6) is underdetermined.6) may not exist.3. always exists. yields exactly the same residual as x (check this). A linear system of the form Ax = b (3. + σp = σ1 . . in the leastsquares sense). but rank 2. An approximate solution. 3. In these cases. or may not be unique. obtained for α = 1. but may fail to be unique. x = [0 2 2]. not enough measurements are available. For instance. Then. it makes more sense to ﬁnd a vector x that minimizes the norm Ax − b of the residual vector r = Ax − b . see again chapter 2). because of errors in the measurements that produce the entries of A. . A leastsquares solution turns out to be x = [1 1 2]T with residual r = Ax − b = [1 − 1 0]. but this is the best we can do for this problem. in the leastsquares sense.3. For instance. which has norm 2 (admittedly. Thus.3 The Pseudoinverse One of the most important applications of the SVD is the solution of linear systems in the least squares sense.
. . 0 is called the pseudoinverse of A.. . . .32 CHAPTER 3. all equally good (or bad). 1/σr 0 . . yn cm 0 0 . . This can be written as T U ΣV T x − b . . . which both proves uniqueness and provides a recipe for the computation of the solution.7) where † Σ = is an n × m diagonal matrix. U U = I. In order to ﬁnd the solution to this minimization problem. let us spell out the last expression. 0 ··· 0 . Proof. Theorem 3. 0 .. so that the last expression above equals ΣV T x − U T b or. that is.1 The minimumnorm least squares solution to a linear system Ax = b. . and is given by ˆ = V Σ† U T b x (3. that is. Σy − c . The minimumnorm Least Squares solution to Ax = b is the shortest vector x that minimizes Ax − b that is. . .. . THE SINGULAR VALUE DECOMPOSITION If there are several leastsquares solutions. the shortest vector x that achieves the min Ax − b . . 0 0 ··· . . . We want to minimize the norm of the following vector: σ1 0 ··· 0 y1 c1 0 . . x is unique. This minimum norm solution is the subject of the following theorem. .. . .. with y = V T x and c = U T b. y r+1 − cr+1 . . One can therefore redeﬁne what it means to “solve” a linear system so that there is always exactly one solution. σr yr cr .8) because U is an orthogonal matrix. then one of them turns out to be shorter than all the others. The matrix A† = V Σ† U T 1/σ1 .1.2). U (ΣV T x − U T b) (3. . . But orthogonal matrices do not change the norm of vectors they are applied to (theorem 3.3. ··· 0 . its norm x is smallest.
The residual. Thus. . by deﬁnition). we also want the shortest y. . which is therefore unique: minimum residual forces the choice of y1 . . that is. what is desired is a nonzero vector x that satisﬁes the system (3. that is. the meaning of a leastsquares solution is therefore usually modiﬁed. the desired y has the following components: yi yi = ci for i = 1.4 LeastSquares Solution of a Homogeneous Linear Systems Theorem 3. . Although correct. for the shortest vector x. leastsquares solution to the original system is the unique vector ˆ = V y = V Σ+ c = V Σ+ U T b x as promised. .8)). is the norm of Σy − c. . the square of the residual is m m Ax − b 2 = Σy − c 2 = i=r+1 c2 = i (uT b)2 i i=r+1 which is the projection of the righthand side vector b onto the complement of the range of A. = yn = 0. yr . Thus. In each of the ﬁrst r differences. we would fall back to x = 0 again.9) as well as possible. In many applications. . Without any constraints on x.9) . x = 0 is the minimumnorm solution to any homogeneous linear system. .3. .3. since this vector is related to Ax − b by an orthogonal transformation (see equation (3. In summary. . by imposing the constraint x =1 (3. when the system is homogeneous. this solution is not too interesting. . . LEASTSQUARES SOLUTION OF A HOMOGENEOUS LINEAR SYSTEMS The last m − r differences are of the form cr+1 . the last n − r components of y are multiplied by zeros. once more.4. In conclusion. Notice that there is no other choice for y. In other words. the solution is trivial: the minimumnorm solution to Ax = 0 is x=0. When b = 0. r σi = 0 for i = r + 1. which happens to be an exact solution. there is freedom in their choice. this is y = Σ+ c . . so they have no effect on the solution. cm 33 and do not depend on the unknown y. . and each of them contributes a residual ci  to the solution. there is nothing we can do about those differences: if some or all the ci for i = r + 1. we will not be able to zero these differences.1 works regardless of the value of the righthand side vector b. that is. ∆ 3. . but it is obviously the one with the smallest norm. . on the other hand. n . 0− . Since we look for the minimumnorm solution. m are nonzero. We therefore set yr+1 = . . When written as a function of the vector c. . and minimumnorm solution forces the other entries of y. Thus. . the norm of Ax − b when x is the solution vector. For homogeneous linear systems. the minimumnorm. Of course it is not necessarily the only one (any vector in the null space of A is also a solution. because x and y are related by an orthogonal transformation.
34 CHAPTER 3. .1. then the minimumnorm solution is unique. We thus look for the unitnorm vector y that minimizes the norm (squared) of Σy. Unfortunately. + σn yn . so that equation (3. + αk = 1 (3. + αk vn with 2 2 α1 + . .2).11) are unitnorm least squares solutions to the homogeneous linear system Ax = 0. .11). . . on the other case. . . In any event. . U ΣV T x . ∆ Section 3. x = vn . since it is very unlikely that different singular values have exactly the same numerical value.4. and shows that there is in general a whole hypersphere of solutions. that is by letting y1 = . . (3. Proof. vn be the k columns of V whose corresponding singular values are equal to the last singular value σn . if k = 1. with y = V T x. the theorem above shows how to express all solutions as a linear combination of the last k columns of V . Theorem 3.12) is equivalent to equation (3. the resulting constrained minimization problem does not necessarily admit a unique solution. . . it may often have more than one singular value equal to zero. all vectors of the form x = α1 vn−k+1 + . The following theorem provides a recipe for ﬁnding this solution. . . 2 2 2 2 σ1 y1 + . = yn−k = 0. . αk = yn . . .4. .1. This is obviously achieved by concentrating all the (unit) mass of y where the σs are smallest. let k be the largest integer such that σn−k+1 = . Then. x = 1 translates to y = 1. let vn−k+1 . that is. and the unitnorm constraint on y yields equation (3. that is.5 shows a sample use of theorem 3. . The unitnorm Least Squares solution to Ax = 0 is the vector x with x = 1 that minimizes Ax that is. . .10) (3. If k > 1.1 Let A = U ΣV T be the singular value decomposition of A. x =1 Note: when σn is greater than zero the most common case is k = 1. Since V is orthogonal. When A is rank deﬁcient. + yn vn .12) From y = V T x we obtain x = V y = y1 v1 + . that is. Since orthogonal matrices do not change the norm of vectors they are applied to (theorem 3.10) with α1 = yn−k+1 . this norm is the same as ΣV T x or. Σy . THE SINGULAR VALUE DECOMPOSITION on the solution. . The reasoning is very similar to that for the previous theorem. Furthermore. = σn . they achieve the min Ax .
(3.5. at the minimum (3.16) If we deﬁne the centroid p of all the points pi as p= 1 T P 1. is called the line normal. . (3.5.13). SVD LINE FITTING 35 3.15) we must have =0. orthogonal to the line. . the line does not change. the bestﬁt line achieves the min d n =1 In equation (3. and let be the equation of a line.3: The distance between point pi = (xi .15). dm ) and P = (p1 .13) where the unit vector n = (a. yi )T be a set of m ≥ 2 points on the plane. The bestﬁt line minimizes the sum of the squared distances.14) i pi c b a Figure 3.1 Fitting a Line to a Set of Points ax + by − c = 0 Let pi = (xi .5. Thus. yi )T and line ax + by − c = 0 is axi + byi − c. b)T . .3. 2 = min P n − c1 n =1 2 . and the distance between the line n and a point pi is equal to di = axi + byi − c = pT n − c .5 SVD Line Fitting The Singular Value Decomposition of a matrix yields a simple method for ﬁtting a line to a set of points on the plane. m .3). . Thus. 3.2 The Best Line Fit ∂ d ∂c 2 Since the third line parameter c does not appear in the constraint (3. if we let d = (d1 . . (3. pm )T . 1 is a vector of m ones. .15) 3. If the lefthand side of this equation is multiplied by a nonzero constant. we can assume without loss of generality that n = a2 + b2 = 1 . . The distance from the line to the origin is c (see ﬁgure 3. (3.
the shortest vector of the form Qn is obtained when n is the right singular vector v2 corresponding to the smaller σ2 of the two singular values of Q.36 equation (3. Equivalently. the residue is min d = σ2 n =1 and more speciﬁcally the distances di are given by d = σ2 u2 where u2 is the left singular vector corresponding to σ2 . THE SINGULAR VALUE DECOMPOSITION = ∂ nT P T − c1T (P n − 1c) ∂c ∂ = nT P T P n + c2 1T 1 − 2nT P T c1 ∂c = 2 mc − nT P T 1 = 0 1 T T n P 1. compute the centroid of the points (1 is a vector of m ones): p= 2. c = pT n . .16) yields ∂ d ∂c 2 CHAPTER 3.1. pm )T . we can recall that if n is on a circle. .4. to ﬁt a line (a. where Q = P − 1pT collects the centered coordinates of the m points. and in order to emphasize the geometric meaning of signular values and vectors. the SVD 2 Q = U ΣV T = i=1 σi ui vT i yields 2 Qn = Qv2 = i=1 σi ui vT v2 = σ2 u2 i because v1 and v2 are orthonormal vectors. proceed as follows: 1. Furthermore. form the matrix of centered coordinates: Q = P − 1pT 3. By replacing this expression into equation (3. we obtain min d n =1 2 = min P n − 1pT n n =1 2 = min Qn n =1 2 . compute the SVD of Q: Q = U ΣV T 1 T P 1 m . b. In fact. c) to a set of m points pi collected in the m × 2 matrix P = (p1 . . We can solve this constrained minimization problem by theorem 3. when n = v2 . To summarize.15). m from which we obtain c= that is. since Qv2 has norm σ2 .
end if m < 2. function [l. 5.5. b)T = v2 . the residue of the ﬁt is The following matlab code implements the line ﬁtting method. % the smallest singular value of Q % measures the residual fitting error residue = Sigma(2. Here “plus” means the following. An afﬁne subspace is a linear subspace plus a point. error(’Need at least two points’). if n ˜= 2. or something close to it. 2). The ﬁtting problem (including ﬁtting a line to a set of points) can be cast either as a maximization or a minimization problem.3. 2). p’ * n]. [U Sigma V] = svd(Q). Let L be a linear space. 1). % matrix of centered coordinates Q = P . the line normal is the second column of the 2 × 2 matrix V : n = (a. just like an arbitrary line is a line through the origin plus a point. % assemble the three line coefficients into a column vector l = [n . % the line normal is the second column of V n = V(:. residue] = linefit(P) % check input matrix sizes [m n] = size(P). error(’matrix P must be m x 2’). can be adapted to ﬁt a set of data points in Rm with an afﬁne subspace of given dimension n. end one = ones(m. SVD LINE FITTING 4. the third coefﬁcient of the line is 37 c = pT n min d = σ2 n =1 6. . A useful exercise is to think how this procedure. % centroid of all the points p = (P’ * one) / m. Then an afﬁne space has the form A = p + L = {a  a = p + l and l ∈ L} .one * p’. Hint: minimizing the distance between a point and a subspace is equivalent to maximizing the norm of the projection of the point onto the subspace.
38 CHAPTER 3. THE SINGULAR VALUE DECOMPOSITION .
Chapter 4
Function Optimization
There are three main reasons why most problems in robotics, vision, and arguably every other science or endeavor take on the form of optimization problems. One is that the desired goal may not be achievable, and so we try to get as close as possible to it. The second reason is that there may be more ways to achieve the goal, and so we can choose one by assigning a quality to all the solutions and selecting the best one. The third reason is that we may not know how to solve the system of equations f(x) = 0, so instead we minimize the norm f(x) , which is a scalar function of the unknown vector x. We have encountered the ﬁrst two situations when talking about linear systems. The case in which a linear system admits exactly one exact solution is simple but rare. More often, the system at hand is either incompatible (some say overconstrained) or, at the opposite end, underdetermined. In fact, some problems are both, in a sense. While these problems admit no exact solution, they often admit a multitude of approximate solutions. In addition, many problems lead to nonlinear equations. Consider, for instance, the problem of Structure From Motion (SFM) in computer vision. Nonlinear equations describe how points in the world project onto the images taken by cameras at given positions in space. Structure from motion goes the other way around, and attempts to solve these equations: image points are given, and one wants to determine where the points in the world and the cameras are. Because image points come from noisy measurements, they are not exact, and the resulting system is usually incompatible. SFM is then cast as an optimization problem. On the other hand, the exact system (the one with perfect coefﬁcients) is often close to being underdetermined. For instance, the images may be insufﬁcient to recover a certain shape under a certain motion. Then, an additional criterion must be added to deﬁne what a “good” solution is. In these cases, the noisy system admits no exact solutions, but has many approximate ones. The term “optimization” is meant to subsume both minimization and maximization. However, maximizing the scalar function f (x) is the same as minimizing −f (x), so we consider optimization and minimization to be essentially synonyms. Usually, one is after global minima. However, global minima are hard to ﬁnd, since they involve a universal quantiﬁer: x∗ is a global minimum of f if for every other x we have f (x) ≥ f (x∗ ). Global minization techniques like simulated annealing have been proposed, but their convergence properties depend very strongly on the problem at hand. In this chapter, we consider local minimization: we pick a starting point x0 , and we descend in the landscape of f (x) until we cannot go down any further. The bottom of the valley is a local minimum. Local minimization is appropriate if we know how to pick an x0 that is close to x∗ . This occurs frequently in feedback systems. In these systems, we start at a local (or even a global) minimum. The system then evolves and escapes from the minimum. As soon as this occurs, a control signal is generated to bring the system back to the minimum. Because of this immediate reaction, the old minimum can often be used as a starting point x0 when looking for the new minimum, that is, when computing the required control signal. More formally, we reach the correct minimum x∗ as long as the initial point x0 is in the basin of attraction of x∗ , deﬁned as the largest neighborhood of x∗ in which f (x) is convex. Good references for the discussion in this chapter are Matrix Computations, Practical Optimization, and Numerical Recipes in C, all of which are listed with full citations in section 1.4. 39
40
CHAPTER 4. FUNCTION OPTIMIZATION
4.1 Local Minimization and Steepest Descent
Suppose that we want to ﬁnd a local minimum for the scalar function f of the vector variable x, starting from an initial point x0 . Picking an appropriate x0 is crucial, but also very problemdependent. We start from x0 , and we go downhill. At every step of the way, we must make the following decisions: • Whether to stop. • In what direction to proceed. • How long a step to take. In fact, most minimization algorithms have the following structure: k=0 while xk is not a minimum compute step direction pk with pk = 1 compute step size αk xk+1 = xk + αk pk k =k+1 end. Different algorithms differ in how each of these instructions is performed. It is intuitively clear that the choice of the step size αk is important. Too small a step leads to slow convergence, or even to lack of convergence altogether. Too large a step causes overshooting, that is, leaping past the solution. The most disastrous consequence of this is that we may leave the basin of attraction, or that we oscillate back and forth with increasing amplitudes, leading to instability. Even when oscillations decrease, they can slow down convergence considerably. What is less obvious is that the best direction of descent is not necessarily, and in fact is quite rarely, the direction of steepest descent, as we now show. Consider a simple but important case, 1 f (x) = c + aT x + xT Qx 2 (4.1)
where Q is a symmetric, positive deﬁnite matrix. Positive deﬁnite means that for every nonzero x the quantity xT Qx is positive. In this case, the graph of f (x) − c is a plane aT x plus a paraboloid. Of course, if f were this simple, no descent methods would be necessary. In fact the minimum of f can be found by setting its gradient to zero: ∂f = a + Qx = 0 ∂x so that the minimum x∗ is the solution to the linear system Qx = −a . (4.2)
Since Q is positive deﬁnite, it is also invertible (why?), and the solution x∗ is unique. However, understanding the behavior of minimization algorithms in this simple case is crucial in order to establish the convergence properties of these algorithms for more general functions. In fact, all smooth functions can be approximated by paraboloids in a sufﬁciently small neighborhood of any point. Let us therefore assume that we minimize f as given in equation (4.1), and that at every step we choose the direction of steepest descent. In order to simplify the mathematics, we observe that if we let e(x) = ˜ then we have 1 (x − x∗ )T Q(x − x∗ ) 2 (4.3)
1 e(x) = f (x) − c + x∗ T Qx∗ = f (x) − f (x∗ ) ˜ 2
4.1. LOCAL MINIMIZATION AND STEEPEST DESCENT
so that e and f differ only by a constant. In fact, ˜ e(x) = ˜ 1 T 1 1 1 (x Qx + x∗ T Qx∗ − 2xT Qx∗ ) = xT Qx + aT x + x∗ T Qx∗ = f (x) − c + x∗ T Qx∗ 2 2 2 2 1 1 1 f (x∗ ) = c + aT x∗ + x∗ T Qx∗ = c − x∗ T Qx∗ + x∗ T Qx∗ = c − x∗ T Qx∗ . 2 2 2 The result (4.3), e(x) = f (x) − f (x∗ ) , ˜
41
and from equation (4.2) we obtain
is rather interesting in itself. It says that adding a linear term aT x (and a constant c) to a paraboloid 1 xT Qx merely 2 1 shifts the bottom of the paraboloid, both in position (x∗ rather than 0) and value (c − 2 x∗ T Qx∗ rather than zero). Adding the linear term does not “warp” or “tilt” the shape of the paraboloid in any way. Since e is simpler, we consider that we are minimizing e rather than f . In addition, we can let ˜ ˜ y = x − x∗ , that is, we can shift the origin of the domain to x∗ , and study the function e(y) = 1 T y Qy 2
instead of f or e, without loss of generality. We will transform everything back to f and x once we are done. Of ˜ course, by construction, the new minimum is at y∗ = 0 where e reaches a value of zero: e(y∗ ) = e(0) = 0 .
However, we let our steepest descent algorithm ﬁnd this minimum by starting from the initial point y0 = x0 − x∗ . At every iteration k, the algorithm chooses the direction of steepest descent, which is in the direction pk = − opposite to the gradient of e evaluated at yk : gk = g(yk ) = ∂e = Qyk . ∂y y=y k gk gk
We select for the algorithm the most favorable step size, that is, the one that takes us from yk to the lowest point in the direction of pk . This can be found by differentiating the function e(yk + αpk ) = 1 (y + αpk )T Q(yk + αpk ) 2 k
with respect to α, and setting the derivative to zero to obtain the optimal step αk . We have ∂e(yk + αpk ) = (yk + αpk )T Qpk ∂α and setting this to zero yields αk = − (Qyk )T pk gT p pT p gT g = − Tk k = gk Tk k = gk Tk k . pT Qpk pk Qpk pk Qpk gk Qgk k (4.4)
6) so if we can bound the expression in parentheses we have a bound on the rate of convergence of steepest descent. Lemma 4.5) we obtain e(yk ) − e(yk+1 ) e(yk ) = yT Qyk − yT Qyk+1 k k+1 yT Qyk k T gT gk gT gk yT Qyk − yk − gTkQg gk Q yk − gTkQg gk k k k k k yT Qyk k = = = Since Q is invertible we have gT gk gT gk 2 gTkQg gT Qyk − gTkQg k k k k k yT Qyk k 2gT gk gT Qyk − (gT gk )2 k k k .5) How much closer did this step bring us to the solution y∗ = 0? In other words. Luenberger. the basic step of our steepest descent can be written as follows: yk+1 = yk + gk that is. From the deﬁnition of e and from equation (4. To this end. respectively. relative to the value e(yk ) at the previous step? The answer is.1. The arguments and proofs below are adapted from D. FUNCTION OPTIMIZATION gT gk k p gT Qgk k k gT gk k g . we introduce the following result. AddisonWesley.1 (Kantorovich inequality) Let Q be a positive deﬁnite. For any vector y there holds 4σ1 σn (yT y)2 ≥ yT Q−1 y yT Qy (σ1 + σn )2 where σ1 and σn are. the largest and smallest singular values of Q. e(yk ) gk Q gk gk Qgk This can be rewritten as follows by rearranging terms: e(yk+1 ) = 1− (gT gk )2 k T Q−1 g gT Qg gk k k k e(yk ) (4. yT Qyk gT Qgk k k 2 gT Qgk k gk = Qyk and ⇒ yk = Q−1 gk yT Qyk = gT Q−1 gk k k so that e(yk ) − e(yk+1 ) (gT g )2 = T −1k k T . how much smaller is e(yk+1 ). 1973. as we shall now prove. . n × n matrix. G.42 Thus. often not much. gT Qgk k k (4. Introduction to Linear and Nonlinear Programming. symmetric. yk+1 = yk − CHAPTER 4.
.8). . For the same vector of coefﬁcients θi .7) where the coefﬁcients θi σi . each of the singular values σj can be obtained by letting θj = 1 and all other θi to zero. ∆ Thanks to this lemma. the method of steepest descent xk+1 = xk − where gk = g(xk ) = gT gk k g gT Qgk k k (4.9) ∂f = a + Qxk ∂x x=xk . Thus an appropriate bound is φ(σ) φ(σ) 1/σ ≥ min = min .8) then the numerator φ(σ) in (4. Of course. that is. the values 1/σj for j = 1. all its singular values are strictly positive. with Q symmetric and positive deﬁnite. . However. . (4.2 Let 1 f (x) = c + aT x + xT Qx 2 be a quadratic function of x. LOCAL MINIMIZATION AND STEEPEST DESCENT Proof. n are all on the curve 1/σ. there are many ways to choose the coefﬁcients θi to obtain a particular value of σ. Let 43 Q = U ΣU T be the singular value decomposition of the symmetric (hence V = U ) matrix Q. Theorem 4. ψ(σ) σ1 ≤σ≤σn λ(σ) σ1 ≤σ≤σn (σ1 + σn − σ)/(σ1 σn ) The minimum is achieved at σ = (σ1 + σn )/2. This area is delimited from above by the straight line that connects point (σ1 .7) must be in the shaded area in ﬁgure 4. the values of the denominator ψ(σ) of (4. 1/σ1 ) with point (σn . since the smallest of them satisﬁes σn = min yT Qy > 0 y =1 by the deﬁnition of positive deﬁniteness.1.7) is a convex combination of points on this curve. we can state the main result on the convergence of the method of steepest descent. If we let z = UT y we have 1/ (yT y)2 (yT U T U y)2 (zT z)2 = T = T −1 T = T Q−1 y yT Qy −1 U T y yT U ΣU T y y y UΣ z Σ z z Σz θi = add up to one.1. the values of φ(σ). and λ(σ) are on the vertical line corresponding to the value of σ given by (4. Thus.7) is 1/σ. The denominator ψ(σ) in (4. If we let σ= i=1 n 2 zi z 2 n i=1 θi σi n i=1 θi /σi = φ(σ) ψ(σ) (4. 1/σn ).1. by the line with ordinate λ(σ) = (σ1 + σn − σ)/(σ1 σn ) . Because Q is positive deﬁnite.4. Since 1/σ is a convex function of σ. ψ(σ). yielding the desired result. For any x0 .
the closer the fraction (σ1 − σn )/(σ1 + σn ) is to unity. it follows immediately that f (xk ) − f (x∗ ) → 0 and hence. FUNCTION OPTIMIZATION φ.6) and the Kantorovich inequality we obtain f (xk+1 ) − f (x∗ ) = e(yk+1 ) = = σ1 − σn σ1 + σn 1− 2 (gT gk )2 k T Q−1 g gT Qg gk k k k e(yk ) ≤ 1− 4σ1 σn (σ1 + σn )2 e(yk ) (4.10) we immediately obtain the expression for steepest descent in terms of f and x. Furthermore. It is easily seen why this happens in the case . respectively. Since the ratio in the last term is smaller than one. By equations (4.11) (4. at every step k there holds f (xk+1 ) − f (x∗ ) ≤ σ1 − σn σ1 + σn 2 (f (xk ) − f (x∗ )) where σ1 and σn are.1: Kantorovich inequality. converges to the unique minimum point x∗ = −Q−1 a of f . since the minimum of f is unique. The larger the condition number. that xk → x∗ .3) and (4.λ λ(σ) ψ(σ) φ(σ) σ1 σ2 σ σn σ Figure 4.ψ. and the slower convergence.12) (f (xk ) − f (x∗ )) . ∆ The ratio κ(Q) = σ1 /σn is called the condition number of Q. From the deﬁnitions y = x − x∗ and e(y) = 1 T y Qy 2 (4. the largest and smallest singular value of Q. Proof.44 CHAPTER 4.
1. that is. the greater the condition number κ(Q). σ1 + σn The more elongated the isocontours.4. To characterize the speed of convergence of different minimization algorithms. as in ﬁgure 4. There is one good. but can become arbitrarily slow for poorly conditioned Hessians. which is orthogonal to the isocontour at xk . which shows the trajectory xk superimposed on a set of isocontours of f (x). namely.1. Q is the matrix of second derivatives of f with respect to x. If isocontours were circles (σ1 = σn ) centered at x∗ . For general (that is. the farther away a line orthogonal to an isocontour passes from x∗ . and minimization would get there in just one more step. The next step is orthogonal to the latter isocontour (that is. In that case. then close to the solution (that is. at every step the steepest descent trajectory is forced to make a ninetydegree turn. is consistent with our analysis. In fact. but very precarious case.2 implies that steepest descent has at best a linear order of convergence. the analysis above applies once xk gets close enough to the minimum. parallel to the gradient). the residuals f (xk ) − f (x∗ ) in the values of the function being minimized converge . In other words. and is called the Hessian of f . If β is this limit. nonquadratic) f . we introduce the notion of the order of convergence. This case.2. the line in the direction pk of steepest descent. Theorem 4. In all other cases. so the higher the order of convergence. in which κ(Q) = 1. then the ﬁrst turn would make the new direction point to x∗ . The minimum of f along that line is tangent to some other. so that f is well approximated by a paraboloid. and the more steps are required for convergence. This is deﬁned as the largest value of q for which the lim xk+1 − x∗ xk − x∗ q k→∞ is ﬁnite. in which x is a twodimensional vector. the better. will not pass through x∗ .2: Trajectory of steepest descent. Thus. In summary. one iteration will lead to the minimum x∗ . for large values of k) we have xk+1 − x∗ ≈ β xk − x∗ q for a minimization method of order q. lower isocontour. when the starting point x0 is at one apex (tip of either axis) of an isocontour ellipse. LOCAL MINIMIZATION AND STEEPEST DESCENT 45 x* x0 p0 p1 Figure 4. because then σ1 − σn =0. In this case. the distance of xk from x∗ is reduced by the qth power at every step. steepest descent is good for functions that have a well conditioned Hessian near the minimum.
Let h(α) = f (xk + αpk ) (4. . . . b) end otherwise u = (b + c)/2 if f (u) > f (b) (a. c) = (a. Shrinking works as follows: if b − a > c − b u = (a + b)/2 if f (u) > f (b) (a. Then. c) until c − a is smaller than the desired accuracy in determining αk . FUNCTION OPTIMIZATION linearly. and therefore preferable in most cases. in the sense that a ≤ αk ≤ c. b. b. corresponding through equation (4. One criterion is to check whether the value of f (xk ) has signiﬁcantly decreased from f (xk−1 ). In our analysis of steepest descent. Another is to check whether xk is signiﬁcantly different from xk−1 . so the function is initially decreasing. Thus. c) = (u. u. Using line search to ﬁnd αk guarantees that a minimum in the direction pk is actually reached even when the parabolic approximation is inadequate. rather than in that of f (x∗ ). the check on xk is more stringent. In contrast. In fact.4)). In fact. and then picks a point between a and c. c) otherwise (a. we can set a = 0. Line search ﬁrst determines two points a.13) to the starting point xk . The only difﬁculty here is to ﬁnd c. Line search now proceeds by shrinking the bracketing triple (a. We used Q because it was available. b = (a + c)/2. Close to the minimum. b. the steepest descent algorithm can be stopped when xk − xk−1 < where the positive constant is provided by the user. . c) = (b. Since the gradient of f approaches zero when xk tends to x∗ . To complete the steepest descent algorithm we need to specify how to check whether a minimum has been reached. Of course. b. only gradient information is necessary to ﬁnd pk . so the minimum must be somewhere between a and c. so f (xk ) − f (xk−1 ) may be very small but xk − xk−1 may still be relatively large. b. c) end end. c) = (a. b. say. α2 . if we can assume that h is convex between α1 and αi . Here is how line search works. we can set c = αi . u) otherwise (a. until h(αi ) is greater than h(αi−1 ). the derivative of h at a is negative.13) be the scalar function of one variable that is obtained by restricting the function f to the line through the current point xk and in the direction of pk . u. but its computation during steepest descent would in general be overkill. and a line search in the direction of pk can be used to determine the step size αk . the derivatives of f are close to zero. In fact. but there is no generalpurpose ﬁx to this problem. we used the Hessian Q in order to compute the optimal step size α (see equation (4. the Hessian of f (x) requires computing n second derivatives if x is an ndimensional 2 vector. In fact. the arguments xk to f can converge to x∗ even more slowly. if we cannot assume convexity. b. but it is increasing between αi−1 and αi = c. c that bracket the desired minimum αk . we may ﬁnd the wrong minimum.46 CHAPTER 4. A point c that is on the opposite side of the minimum with respect to a can be found by increasing α through values α1 = a. usually one is interested in the value of x∗ . In summary.
. while gk and Qk are the coefﬁcients of the Taylor expansion of f around the current point xk . a and Q are the coefﬁcients of the Taylor expansion of f around point x = 0. we can set the derivatives of f (xk + ∆x) with respect to ∆x to zero. the analysis in the previous section suggests a straightforward ﬁx to the slow convergence of steepest descent. The greater speed of Newton’s method over steepest descent is borne out by analysis: while steepest descent has a linear order of convergence. . b. . so line search will ﬁnd the minimum in a number of steps that is logarithmic in the desired accuracy. equation (4. In addition. and suppose that at the minimum x∗ the Hessian Q(x∗ ) is nonsingular. iterations are needed. Consequently. when αk = 0. This is the idea of Newton’s method. In fact. To the extent that approximation (4. . A minimization algorithm in which the step direction pk and size αk are deﬁned in this manner is called Newton’s method. In other words.2) tells us how to jump in one step from the starting point x0 to the minimum x∗ .15) is the Newton step. ∂2f ∂x2 n ∂2f ∂xn ∂x1 are the gradient and Hessian of f evaluated at the current point xk . .2 Newton’s Method If a function can be well approximated by a paraboloid in the region in which minimization is performed. Notice that even when f is a paraboloid.2. where gk = g(xk ) = and Qk = Q(xk ) = ∂2f = T ∂x∂x x=xk ∂f ∂x x=xk ∂2f ∂x2 1 (4. the gradient gk is different from a as used in equation (4. gradient and Hessian are constantly reevaluated in Newton’s method. that is. In fact. and the step deﬁned by equation (4. Then y(x∗ ) = x∗ because g(x∗ ) = 0. at every step the interval (a. Of course. Newton’s method is quadratic. In fact.1). the new value x1 will be different from x∗ . The direction is of course undeﬁned once the algorithm has reached a minimum.2). (4. Let 1 f (xk + ∆x) ≈ f (xk ) + gT ∆x + ∆xT Qk ∆x k 2 be the ﬁrst terms of the Taylor series expansion of f about the current point xk . and therefore the minimum is somewhere between a and c. and xk+1 − x∗ = y(xk ) − x∗ = y(xk ) − y(x∗ ) . the linear system Qk ∆x = −gk .15)). 4.14) ··· ··· ∂2f ∂x1 ∂xn x= xk .14) is valid. which we now summarize.15) whose solution ∆xk = αk pk yields at the same time the step direction pk = ∆xk / ∆xk and the step size αk = ∆xk . c) preserves the property that f (b) ≤ f (a) and f (b) ≤ f (c). and obtain. but convergence can be expected to be faster. analogously to equation (4. let y(x) = x − Q(x)−1 g(x) be the place reached by a Newton step starting at x (see equation (4. c) shrinks to 3/4 of its previous size.4. The corresponding pk is termed the Newton direction. NEWTON’S METHOD 47 It is easy to see that in each case the bracketing triple (a. . when f (x) is not exactly a paraboloid.
and the matrix Q is never stored. It effectively solves the linear system (4. Newton’s method approximates f with paraboloids. conjugate gradients for the quadratic case can also be seen as an alternative method for solving a linear system. Overall.3 Conjugate Gradients Newton’s method converges faster (quadratically) than steepest descent (linear convergence rate) because it uses more information about the function f being minimized. for a total of n + n derivatives. The bottom line is that fast convergence requires work that is equivalent to evaluating the Hessian of f . the Hessian must be inverted. For a quadratic function. the convergence rate of Newton’s method is of order at least two. Steepest descent locally approximates the function with planes. while Newton’s method reaches the minimum in one step. system (4.15) must be solved. this motion. This means that the number of iterations is ﬁxed. similarly to Newton’s method. First. we have xk+1 − x∗ = y(xk ) − y(x∗ ) ≤ CHAPTER 4. because it only uses gradient information. called the motion ﬁeld. All it can do is to go downhill. At every pixel in the image. FUNCTION OPTIMIZATION ∂y 1 ∂2y (xk − x∗ ) + xk − x∗ T ∂x x=x∗ 2 ∂x∂xT x=x ˆ 2 where ˆ is some point on the line between x∗ and xk . 4. it will be developed for the simple case of minimizing a quadratic function with positivedeﬁnite and known Hessian. Prima facie. the ﬁrst derivatives of y at x∗ are zero. although the version presented here will only work if the matrix of the system is symmetric and positive deﬁnite. is only apparent. 000 × 500.2). where n is the number of components of x. but it only requires gradient information. storing and manipulating a Hessian can be prohibitive. We know that in this case minimizing f (x) is equivalent to solving the linear system (4. This quadratic function f (x) was introduced in equation (4. 000 Hessian is out of the question. each requiring one gradient computation and one line search. is represented by a vector whose magnitude and direction describe the velocity of the image feature at that pixel. Say for instance that we want to compute the motion of points in an image as a consequence of camera motion. The conjugate gradients method described in these notes is the socalled PolakRibi` re variation. or. say. However. however. so that x the ﬁrst term in the righthand side above vanishes. Thus. but it does so by a sequence of n onedimensional minimizations. but without paying the storage cost of Newton’s method. the method converges to the solution in n steps. is motivated by the desire to accelerate convergence with respect to the steepest descent method. This paradox. steepest descent takes many steps to converge.2) of Newton’s method. 000 unknown motion ﬁeld values. the work done by conjugate gradients is equivalent to that done by Newton’s method. These highdimensional problems arise typically from the discretization of partial differential equations. in which the dimension n of x is thousands or more.1). steepest descent requires the gradient gk for selecting the step direction pk . at least. In contrast. conjugate gradients saves the day. a quarter of a million pixels. and a line search in the direction pk to ﬁnd the step size. Rather than an iterative method. there are n = 500. For very large problems. Partial differential equations relate image intensities over space and time to the motion of the underlying image features. system (4. superlinear convergence. conjugate gradients is a direct method for the quadratic case. Since y(x∗ ) = x∗ .2) is never constructed explicitly. In cases like these. Thus. the method of conjugate gradients discussed in this section seems to violate this principle: it achieves fast. It will be introe duced in three steps. However. Storing and inverting a 500. The method of conjugate gradients. discussed in the next section. Conjugate gradients works by taking n steps for each of the steps in Newton’s method. if an image has. This is very important in cases where x has thousands or even millions of components.48 From the meanvalue theorem. and then jumps at every iteration to the lowest point of the current approximation. Because of the equivalence with a linear system. this single iteration in Newton’s method is more expensive. because it requires both the gradient gk and the Hessian Qk to be evaluated. In 2 addition. . as in equation (4.1). Speciﬁcally. and xk+1 − x∗ ≤ c xk − x∗ 2 where c depends on thirdorder derivatives of f near x∗ .
We now consider an alternative solution method that does not need Q. if f is nearly quadratic in a large neighborhood of the minimum. the method is no longer direct. and so forth. As we saw in our discussion of steepest descent. (4. . we noticed that for poorly conditioned Hessians orthogonal directions lead to many small steps. then we can write 1 1 T x Qx = yT y 2 2 . Third. 1997. If the function f starts to behave like a quadratic function early. in which the variable x in f (x) is a threedimensional vector. evaluated at n different points x1 . on the other hand. . Suppose however that we cannot afford to compute this special direction p1 orthogonal to p0 . convergence is fast. the minimum x∗ is the solution to the linear system Qx = −a . but iterative. Given an arbitrary set of ellipsoidal isosurfaces. We rely instead on the intuition that conjugate gradients solves system (4. NY. This would then take us to x∗ right away. this is the main reason for using conjugate gradients. the conjugate gradients method will be extended to general functions f (x). However. the gradient of f (x). and low computational cost per iteration explains the popularity of conjugate gradients methods for the optimization of functions of a large number of variables. In summary. and the cost of ﬁnding the minimum depends on the desired accuracy. and x has n components.2). the assumption that the Hessian Q in expression (4. 4. and that the quadratic approximation becomes more and more valid as the algorithm converges to the minimum. positive deﬁnite matrix Q. In fact. As discussed above. Consider the case n = 3. but that we can only compute some direction p1 orthogonal to p0 (there is an n − 1dimensional space of such directions!). Finding tight bounds for the convergence rate of conjugate gradients is hard. this works much better. called isosurfaces. This occurs because the Hessian of f is no longer a constant. Then the quadratic function f (x) is constant over ellipsoids.16) where Q is a symmetric. a certain property that holds in the quadratic case is now valid only approximately. After n steps we must be done. with a line search in each direction. In spite of this. will lead to the minimum for spherical isosurfaces. We will see that the conjugate gradients method requires n gradient evaluations and n line searches in lieu of each n × n matrix inversion in Newton’s method. . modest storage requirements. This combination of fast convergence. and the line between x0 and x1 is tangent to an isosurface at x1 .4. as it was in the quadratic case. somewhere between Newton’s method and steepest descent. Springer. . . since isosurfaces are spheres. so that the new direction p1 is orthogonal to the previous direction p0 . there is a onetoone mapping with a spherical system: if Q = U ΣU T is the SVD of the symmetric. centered at the minimum x∗ . The next step is in the direction of the gradient. and we will omit this proof.17) We know how to solve such a system. positive deﬁnite matrix. and each of the steps is simple.3. the convergence rate of conjugate gradients is superlinear. CONJUGATE GRADIENTS 49 Second. as it requires close to the n steps that are necessary in the quadratic case. pn−1 span the whole space. How can we start from a point x0 on one of these ellipsoids and reach x∗ by a ﬁnite sequence of onedimensional searches? In connection with steepest descent. As a consequence. Optimization — Algorithms and consistent approximations. since p0 . that is. The arguments offered below appeal to intuition. all the methods we have seen so far involve explicit manipulation of the matrix Q. It is easy to see that in that case n steps will take us to x∗ . the second step then yields the minimum in the space spanned by p0 and p1 . to slow convergence. any set of orthogonal directions.3. .1 The Quadratic Case 1 f (x) = c + aT x + xT Qx 2 Suppose that we want to minimize the quadratic function (4. Formal proofs can be found in Elijah Polak. When the ellipsoids are spheres. xn . In this case.1) is known will be removed. that is. . The ﬁrst step takes from x0 to x1 . but only the quantity gk = Qxk + a that is. each line minimization is independent of the others: The ﬁrst step yields the minimum in the space spanned by p0 .
Details can be found in Polak’s book mentioned earlier.50 where CHAPTER 4. If two directions qi and qj are orthogonal in the spherical context. Of course.j = Σ1/2 U T pi.19) This condition is called Qconjugacy. due to Hestenes and Stiefel (Methods of conjugate gradients for solving linear systems. So now we need to ﬁnd a method for generating n conjugate directions without using either Q or its SVD. k are also conjugate. 1952). we reach the minimum in at most n steps. in the next subsection.18) in the algorithm. pn−1 that are mutually conjugate.j . . (4. . there must be a condition for the original problem (in terms of Q) that is equivalent to orthogonality for the spherical problem. 409436. In summary. This can be done by induction. section B. J. Res. The proof is based on the observation that the vectors pk are found by a generalization of GramSchmidt (theorem 2. .19) holds. . . Here is the procedure. Bureau National Standards. FUNCTION OPTIMIZATION y = Σ1/2 U T x . Vol 49. we rewrite these expressions without Q. . or Qorthogonality: if equation (4. . i (4. pp. and if we do line minimization along each direction pk . . pT Qpk+1 k = pT Q(−gk+1 + γk pk ) k = −pT Qgk+1 + k gT Qpk T k+1 p Qpk pT Qpk k k ∂f ∂x x=xk = −pT Qgk+1 + gT Qpk = 0 . First.4. In fact.j becomes pT U Σ1/2 Σ1/2 U T pj = 0 i or pT Qpj = 0 . We will henceforth simply say “conjugate” for brevity. if we can ﬁnd n directions p0 . . we cannot use the transformation (4. . that is. We do this in two steps. Then. we ﬁnd conjugate directions whose deﬁnitions do involve Q.2) to produce conjugate rather than orthogonal vectors. so that orthogonality for qi. It is simple to see that pk and pk+1 are conjugate. i what does this translate into in terms of the directions pi and pj for the ellipsoidal problem? We have qi. then pi and pj are said to be Qconjugate or Qorthogonal to each other.18) Consequently. which also incorporates the steps from x0 to xn : g0 = g(x0 ) p0 = −g0 for k = 0 . . n − 1 αk = arg minα≥0 f (xk + αpk ) xk+1 = xk + αk pk gk+1 = g(xk+1 ) gT Qp k+1 γ k = pT Q p k k k pk+1 = −gk+1 + γk pk end where gk = g(xk ) = is the gradient of f at xk . . if qT qj = 0 . k k+1 It is somewhat more cumbersome to show that pi and pk+1 for i = 0. because Σ and especially U T are too large.
gT gk k 4. the denominator of γk becomes pT (gk+1 − gk ) = −pT gk = (gk − γk−1 pk−1 )T gk = gT gk . Strictly speaking. we obtain the PolakRibi` re formula e γk = gT (gk+1 − gk ) k+1 . However. However. However.3.3. and a few cycles of n iterations each will achieve convergence. Similarly. k−1 Then. it is computationally inadequate because the expression for γk contains the Hessian Q. In fact. CONJUGATE GRADIENTS 51 4. k k k In conclusion.2 Removing the Hessian The algorithm shown in the previous subsection is a correct conjugate gradients algorithm. To this end. which was constant for the quadratic case.3. When the function f (x) is arbitrary. the same algorithm can be used. n iterations will not sufﬁce. we then lose conjugacy. without ever constructing the Hessian explicitly. . g(x) = a + Qx so that gk+1 = g(xk+1 ) = g(xk + αk pk ) = a + Q(xk + αk pk ) = gk + αk Qpk . since pk and pk+1 are associated to different Hessians.16) in n steps. or αk Qpk = gk+1 − gk . now is a function of xk . We now show that γk can be rewritten in terms of the gradient values gk and gk+1 only. pT gk = 0 . the Hessian. This expression for γk can be further simpliﬁed by noticing that pT gk+1 = 0 k because the line along pk is tangent to an isosurface at xk+1 .3 Extension to General Functions We now know how to minimize the quadratic function (4. pT Qpk pT αk Qpk pT (gk+1 − gk ) k k k and Q has disappeared. In fact. which is too large. the quadratic approximation becomes more and more valid. we notice that gk+1 = gk + αk Qpk .4. as the algorithm approaches the minimum x∗ . while the gradient gk+1 is orthogonal to the isosurface at xk+1 . We can therefore write γk = gT αk Qpk gT (gk+1 − gk ) gT Qpk k+1 = k+1 = k+1 .
52 CHAPTER 4. FUNCTION OPTIMIZATION .
dt ˙ In other words. domain and range are intimately related to one another even independently of the transformation A. In fact. while we need a single transformation. we would like to ﬁnd an orthogonal matrix S and a diagonal matrix Λ such that A = SΛS T 53 (5.Chapter 5 Eigenvalues and Eigenvectors Given a linear transformation b = Ax . Ideally. the singular value decomposition A = U ΣV of A transforms the domain of the transformation via the matrix V T and its range via the matrix U T so that the transformed system is diagonal. because it diagonalizes A by two different transformations. T and where Σ is diagonal. the fact that A is square is not a coincidence. and x is the derivative of x with respect to t: ˙ x= dx . dt dt In brief. For this equation. if V is an orthogonal matrix and we deﬁne y = V Tx . c = Σy where y = V Tx and c = UT b . x is assumed to be a function ˙ of some real scalar variable t (often time). preexisting relation between x and x. Often. the SVD does nothing useful for systems of linear differential equations. the equation b = U ΣV T x can be written as follows U T b = ΣV T x . that is. there is an intimate. This is a fundamental transformation to use whenever the domain and the range of A are separate spaces.1) . and one cannot change coordinates for x ˙ without also changing those for x accordingly. one for the domain and one for the range. however. In fact. In fact. of the form ˙ x = Ax where A is n × n. The most important example is perhaps that of a system of linear differential equations. ˙ ˙ then the deﬁnition of x forces us to transform x by V T as well: dx dV Tx ˙ =VT = V Tx .
it will be possible to diagonalize A by orthogonal transformations S and S T . A matrix that is equal to its Hermitian is called a Hermitian matrix. Furthermore. as well as the inner products between vectors. In some cases.2) . in the sense that they do not alter the vectors’ norms. In summary. If we refer back to ﬁgure 3. the norm squared is deﬁned as x and if S is unitary we have Sx 2 2 = xH x . 1 then Sx1 and Sx2 are orthogonal as well: xH S H Sx2 = xH x2 = 0 . conjugate transposition becomes simply transposition. if x1 and x2 are mutually orthogonal. and the constraint U = V implies that the vectors vi on the circle that map into the axes σi ui of the ellipse are parallel to the axes themselves. now the circle and the ellipse live in the same space. To distinguish invertible from orthogonal matrices we use the symbol Q for invertible and S for orthogonal. for complex matrices we generalize the notion of transpose by introducing the Hermitian operator: The matrix QH (pronounced “Q Hermitian”) is deﬁned to be the complex conjugate of the transpose of Q. We will see all of this in greater detail soon. scalar differential equations. This stands to reason. because it is a system of n independent.1) possible. • S is required to be only invertible. rather than orthogonal. It is like doing an SVD but with the additional constraint U = V . writing A in the form (5. This will only occur for very special matrices. which can be solved separately. If Q happens to be real. in order to diagonalize a square matrix A from a system of linear differential equations we generally look for a decomposition of A of the form A = QΛQ−1 (5.54 so that if we deﬁne CHAPTER 5.1. they can be even nonreal. rather than real. in fact. EIGENVALUES AND EIGENVECTORS y = ST x we can write the equivalent but diagonal differential system ˙ y = Λy . Finally. 2 = xH S H Sx = xH x = x . For complex vectors. The solutions can then be recombined through x = Sy . we weaken the constraints in several ways: • the elements of S and Λ are allowed to be complex.1) is not always possible. A matrix S is said to be unitary if S H S = SS H = I . Unfortunately. because now we are imposing stronger constraints on the terms of the decomposition. so the Hermitian is a generalization of the transpose. In order to make a decomposition like (5. a nonunitary transformation Q can change the norms of vectors. in the sense that xH x2 = 0 . 1 1 In contrast. • the elements on the diagonal of Λ are allowed to be negative. so unitary generalizes orthogonal for complex matrices. Unitary matrices merely rotate or ﬂip vectors. This is now much easier to handle.
and the transformation of B into A (and vice versa) is called a similarity transformation.1 shows the effect of the transformation represented by the matrix √ 2/3 4/ 3 (5.4) which is how eigenvalues and eigenvectors are usually introduced. diagonal or not. an invertible linear transformation transforms the unit circle into an ellipse. That real eigenvectors and eigenvalues do not always exist can be clariﬁed by considering the eigenvalue problem from a geometrical point of view in the n = 2 case. .5 0 −0. For some special matrices. are related by A = QBQ−1 . Figure 5. . .5) A= 0 2 .5 −2 −2 −1 0 1 2 Figure 5. 2 1.55 where Q and Λ are complex. Each point on the unit circle is transformed into some point on the ellipse. λn ) .5 1 0. In contrast. As we know.5) on a sample of points on the unit circle. (5. this may specialize to A = SΛS H with unitary S. the columns of qi of Q and the diagonal entries λi of Λ are solutions of the eigenvalue/eigenvector equation Ax = λx . Q is invertible. The dashed lines are vectors that do not change direction under the transformation. The columns of Q are called eigenvectors. . we have derived this equation from the requirement of diagonalizing a matrix by a similarity transformation.1: Effect of the transformation (5. (5.3) Thus. and Λ is diagonal. Whenever two matrices A and B. The equation A = QΛQ−1 can be rewritten as follows: AQ = QΛ or separately for every column of Q as follows: Aqi = λi qi where Q= q1 ··· qn and Λ = diag(λ1 . and the diagonal entries of Λ are called eigenvalues. they are said to be similar to each other.5 −1 −1.
and that the coefﬁcient of λn is 1. The geometric intuition is hidden. The eigenvalue problem amounts to ﬁnding axes q1 . in the sense that det(A − λI) = (−1)n (λ − λ1 ) · . it follows. bn are linearly dependent. that the lefthand side of equation (5. and we can only triangularize it. In some cases. . Volumes have been written about the properties of the determinant. but they are complex. but the pointtopoint transformation is not. In this expression.1. Equation (5. The case of exactly n distinct eigenvalues is of particular interest. In bad cases. In other words.6) to admit nontrivial solutions. by very simple induction. pulling the solid lines along.3)). In fact.1. an n × n matrix has at most n distinct eigenvalues. because of the following results. Therefore. Matrices represent pointtopoint transformations.1 Computing Eigenvalues and Eigenvectors Algebraically Let us rewrite the eigenvalue equation Ax = λx as follows: (A − λI)x = 0 . we have to give up the idea of diagonalizing A.4) is homogeneous in x. all eigenvalues exist. but the resulting ellipse is unchanged. which is called the characteristic equation of A. perhaps all real. Notice that there are many transformations that map the unit circle into the same ellipse. . det(B) = b11 n i+1 bi1 i=1 (−1) det(Bi1 ) if B is 1 × 1 otherwise is zero. which admits nontrivial solutions iff the matrix A − λI is rankdeﬁcient. just as triangularization was sufﬁcient for solving linear algebraic systems. it is sufﬁcient to recall the following properties from linear algebra: • det(B) = det(B T ). For our purposes.56 CHAPTER 5. there are n real. .6) This is a homogeneous. This turns out to be good enough for solving linear differential systems. . Thus. but not enough eigenvectors can be found. the curvetocurve transformation from circle to ellipse is unique. (5. A square matrix B is rankdeﬁcient iff its determinant. . Notice that they do not correspond to the axes of the ellipse. the eigenvalues and their eigenvectors exist. bi ··· bj ··· bn ) = − det( b1 ··· bj ··· bi ··· bn ). Each rotation yields another matrix A. Given that the directions of the input vectors are generally changed by the transformation A. the two eigenvectors are shown as dashed lines.1 can be rotated. (5. so x can be assumed to be a unit vector without loss of generality. orthonormal eigenvectors. it is not obvious whether the eigenvalue problem admits a solution at all. In ﬁgure 5. • det(BC) = det(B) det(C). equation (5. deﬁned as the (n − 1) × (n − 1) matrix obtained by removing row i and column j from B. We will see that the answer depends on the matrix A. In other cases.7) is a polynomial of degree n in λ.7). In particularly good cases. q2 that are mapped into themselves by the original transformation A (see equation (5. has n complex solutions. square system of equations. 5. . the circle in ﬁgure 5. for system (5. . Bij is the algebraic complement of entry bij . and the matrix A cannot be diagonalized. • det( b1 • det( b1 ··· ··· bn ) = 0 iff b1 . and the problem is best treated as an algebraic one. In other words. we need det(A − λI) = 0 . as evident from ﬁgure 5. and that they are not orthogonal. · (λ − λn ) where some of the λi may coincide. EIGENVALUES AND EIGENVECTORS for a sample of points on the unit circle.7) From the deﬁnition of determinant. and that a rather diverse array of situations may arise.
1. . + ck λk xk = 0 and subtracting this equation from equation (5. . . . + ck−1 xk−1 = 0 . However. xk are linearly independent. and so forth. . . . + ck−1 (λk−1 − λk )xk−1 = 0 . Theorem 5. . ∆ (5. A matrix A is Hermitian iff A = AH . .1 Eigenvectors x1 . Since all λi are distinct. . . we have c1 λ1 x1 + . therefore forcing c2 = 0. . = ck = 0. .9) from the left by xH and equation (5. = ck = 0. . .9) Since A = AH . and this forces c1 = 0. . If we multiply equation (5. Proof. . + ck xk = 0 we also have c1 λk x1 + . . . By multiplying by A we obtain c1 Ax1 + . . xk corresponding to distinct eigenvalues λ1 .8) we have c1 (λ1 − λk )x1 + . . . . Proof. . the situation is even better. COMPUTING EIGENVALUES AND EIGENVECTORS ALGEBRAICALLY 57 Theorem 5.8) For Hermitian matrices (and therefore for real symmetric matrices as well). the last equation can be rewritten as follows: xH A = λ∗ xH . we have reduced the summation to one containing k − 1 terms. . (5. .2 A Hermitian matrix has real eigenvalues. . . . Suppose that c1 x1 + . . + ck xk = 0 This entire argument can be repeated for the last equation. . + ck xk = 0 where the xi are eigenvectors of a matrix A. from c1 x1 + . which is still an eigenvector of A: c1 x1 + . In summary. .10) from the right by x. . . . . we obtain xH Ax xH Ax which implies that = = λxH x λ∗ xH x (5. .10) λxH x = λ∗ xH x . . By taking the Hermitian we obtain xH AH = λ∗ xH . the differences in parentheses are all nonzero.5. We can repeat this procedure until only one term remains.1. . λk . Let λ and x be an eigenvalue of A and a corresponding eigenvector: Ax = λx . We need to show that then c1 = . + ck xk = 0 implies that c1 = . λk are linearly independent.1. so that c2 x2 + . that is. + ck Axk = 0 and because x1 . Thus. the equation c1 x1 + . xk are eigenvectors corresponding to eigenvalues λ1 . . that the vectors x1 . . . + ck λk xk = 0 . and we can replace each xi by xi = (λi − λk )xi .
and let x and y be corresponding eigenvectors: Ax Ay = = λx µy ⇒ yH A = µyH because A = AH and from theorem 5.1. Proof.1. ∆ In summary.2 µ = µ∗ . respectively. Let λ and µ be two distinct eigenvalues of A. eigenvectors. The examples in section 5. so that we have λ = λ∗ as promised. Consequently. . ∆ Corollary 5. Notice that the converse is not true: a matrix can have coincident eigenvalues and still admit n independent. ∆ Theorem 5. A real and symmetric matrix is Hermitian.4 Eigenvectors corresponding to distinct eigenvalues of a Hermitian matrix are mutually orthogonal.4. we obtain yH Ax yH Ax which implies or = = λyH x µyH x .1. Proof. λ − µ is nonzero. and we must have yH x = 0. the scalar xH x is nonzero. and even orthonormal. the n × n identity matrix has n equal eigenvalues but n orthonormal eigenvectors (which can be chosen in inﬁnitely many ways). the vector x can be normalized without violating the equation.5 An n × n Hermitian matrix with n distinct eigenvalues admits n orthonormal eigenvectors.1.2 show that when some eigenvalues coincide.6 The determinant of a triangular matrix is the product of the elements on its diagonal. λyH x = µyH x (λ − µ)yH x = 0 . From theorem 5. and any square Hermitian matrix with n distinct eigenvalues can be diagonalized by a unitary similarity transformation. any square matrix with n distinct eigenvalues can be diagonalized by a similarity transformation. Theorem 5. First. EIGENVALUES AND EIGENVECTORS Since x is an eigenvector. we point out a simple but fundamental fact about the eigenvalues of a triangular matrix.1.3 A real and symmetric matrix has real eigenvalues.1. If we multiply these two equations by yH from the left and x from the right. Proof.58 CHAPTER 5. the eigenvectors of an n × n Hermitian matrix with n distinct eigenvalues are all mutually orthogonal. ∆ Since the two eigenvalues are distinct. rather diverse situations can arise concerning the eigenvectors. For instance. Since the eigenvalue equation Ax = λx is homogeneous in x. however. the eigenvectors can be made to be orthonormal. Corollary 5.
Matrices with n orthonormal eigenvectors are called normal. ann . 5. eλn t c . Then. the matrix 2 0 A= (5. and from the previous theorem we obtain det(A − λI) = (a11 − λ) · . and the summation in the deﬁnition of determinant given above reduces to a single term: det(B) = b11 b11 det(B11 ) if B is 1 × 1 otherwise . .2. If A is triangular. we can assume a triangular matrix B to be uppertriangular. This follows immediately from the deﬁnition of determinant. the only possibly nonzero bi1 of the matrix B is b11 . . · (ann − λ) which is equal to zero for λ = a11 . . ∆ Note that diagonal matrices are triangular. we obtain det(B) = b11 · . . which because of the properties above has the same eigenvalues. .2 Good and Bad Matrices Solving differential equations becomes much easier when matrices have a full set of orthonormal eigenvectors. ∆ Corollary 5.5. so this result holds for diagonal matrices as well. By repeating the argument for B11 and so forth until we are left with a single scalar. Normal matrices are good news. For instance. for otherwise we can repeat the argument for the transpose. · bnn .11) 0 1 has eigenvalues 2 and 1 and eigenvectors s1 = 1 0 s2 = 0 1 . so is B = A − λI. GOOD AND BAD MATRICES 59 Proof. . because then the n × n system of differential equations ˙ x = Ax has solution n x(t) = i=1 ci si eλi t = S eλ1 t .1. . . The eigenvalues of a matrix A are the solutions of the equation det(A − λI) = 0 . Without loss of generality. .. Proof.7 The eigenvalues of a triangular matrix are the elements on its diagonal.
not all matrices are as good as these. and a full set of eigenvectors may fail to exist. .1. and the vector c of constants ci is c = S H x(0) . Furthermore. repeated twice. so zero eigenvalues are not the problem. .1) that a matrix with distinct eigenvalues (zero or nonzero does not matter) has a full set of eigenvectors (perhaps nonorthogonal. However. but they may not be orthonormal. because the matrix is singular. x(t) = Q Q x(0) . A has to have a repeated eigenvalue if it is to be defective. as we now show. eλn t Computationally this is more expensive. However.. 0 0 2 Its eigenvalues are 0. things can be worse yet. repeated eigenvalues are not a sufﬁcient condition for defectiveness. Its two eigenvectors are √ 1 1 2 1 q1 = 0 . because a computation of a Hermitian is replaced by a matrix inversion. there may still be a complete set of n eigenvectors. λi are the eigenvalues. EIGENVALUES AND EIGENVECTORS where S = [s1 · · · sn ] are the eigenvectors. is that it have repeated eigenvalues. It belongs to the scum of all matrices: 0 2 −1 A= 0 2 1 . and 2. to have fewer than n eigenvectors. that is. How bad can a matrix be? Here is a matrix that is singular. eλn t Fortunately these matrices occur frequently in practice. q1 and q2 are not orthogonal to each other. as the identity matrix proves. q2 = 2 0 0 corresponding to eigenvalues 0 and 2 in this order. An example of such a matrix is 2 0 −1 1 √ q2 = H S x(0) . The simplest example of a defective matrix is 0 1 0 0 which has double eigenvalue 0 and only eigenvector [1 0]T . x(t) = S eλ1 t . However. A necessary condition for an n × n matrix to be defective. and there is no q3 . has fewer than n eigenvectors. which has eigenvalues 2 and 1 and nonorthogonal eigenvectors q1 = 1 0 2 2 1 1 . and the eigenvectors it has are not orthogonal. because the unitary matrix S is replaced by an invertible matrix Q.. First. and the solution becomes λt e 1 −1 . we have seen (theorem 5.60 CHAPTER 5. . This is conceptually only a slight problem. but independent). In fact. while 3 1 0 3 has double eigenvalue 3 and only eigenvector [1 0]T . More compactly.
Let e1 be the ﬁrst column of the identity matrix. . T y = λy where y = Q−1 x . we will not study these algorithms. It is intuitively obvious that any nonzero real vector x can be rotated into a vector parallel to e1 . COMPUTING EIGENVALUES AND EIGENVECTORS NUMERICALLY 61 5. .3. QT Q−1 x = λx . and this product is called the Schur decomposition of A. we have 1 T T x s1 s1 x . Fortunately. x = . This transformation is equivalent to factoring A into the product A = ST S H . But the trick is the same: let s1 = x .3. We will show later on that this allows solving systems of linear differential equations regardless of the structure of the system’s matrix of coefﬁcients. In fact. these matrices can be triangularized by similarity transformations. T = Q−1 AQ . since it triangularizes any square matrix A by a unitary (possibly complex) transformation: T = S H AS . Numerically stable and efﬁcient algorithms exist for the Schur decomposition. and since all the other sj are orthogonal to s1 . = .3 Computing Eigenvalues and Eigenvectors Numerically The examples above have shown that not every n × n matrix admits n independent eigenvectors. It is important to notice that if a matrix A is triangularized by similarity transformations. x . . however. The eigenvectors. . sT sT x n n 0 which is parallel to e1 as desired. take any orthogonal matrix S whose ﬁrst column is x s1 = . as we now show. 5. are changed according to the last equation. then that is. 0 ST x = . x Since sT x = xT x/ x = x . so some matrices cannot be diagonalized by similarity transformations. so λ is also an eigenvalue for T . if Ax = λx . Formally.1 Rotations into the x1 Axis An important preliminary fact concerns vector rotations. In this note. then the eigenvalues of the triangular matrix T are equal to those of the original matrix A. but only show that all square matrices admit a Schur decomposition.5. The Schur decomposition does even better. It may be less obvious that a complex vector x can be transformed into a real vector parallel to e1 by a unitary transformation. .
if the Schur decomposition of a matrix can be computed. Consequently. . We have sH x = xH x/ x = x . and unitary transformations preserve eigenvalues. .1 If A is an n × n matrix and λ and x are an eigenvalue of A and its corresponding eigenvector. . T = U H AU (5. 0 = = λ λ r 0 . Lemma 5. . . . Let U be a unitary transformation that transforms the (possibly complex) eigenvector x of A into a real vector on the x1 axis: r 0 x=U . Ax = λx then there is a transformation where U is a unitary. = .2 The Schur Decomposition The Schur decomposition theorem is the cornerstone of eigenvalue computations. EIGENVALUES AND EIGENVECTORS Now s1 may be complex. . as we will see later.3. U H AU U H AU 0 r 0 . 0 . n × n matrix. its eigenvalues can be determined.12) . . . = λU . It states that any square matrix can be triangularized by unitary transformations. . a system of linear differential equations can be solved regardless of the structure of the matrix of its coefﬁcients. . sH sH x n n x 0 . . Moreover. such that T = λ 0 . . . 0 1 0 . x = . SH x = . 0 0 C . . By substituting this into (5. 5. 0 1 0 . .3. . The diagonal elements of a triangular matrix are its eigenvalues. We are now ready to triangularize an arbitrary square matrix A. . . . 0 Proof.12) and rearranging we have r r 0 0 AU . and 1 sH 1 sH x 1 . 0 where r is the nonzero norm of x.62 CHAPTER 5. just about like before.
the order of eigenvalues can be chosen arbitrarily. Proof. . 0 λ 0 . and S H AS is uppertriangular. S is a unitary matrix. only that it exists. Because we can pick any eigenvalue as λ. and the corresponding righthand side is of the form required by the lemma. ∆ This theorem does not say how to compute the Schur decomposition. . 63 = The last lefthand side is the ﬁrst column of T . C . Fortunately.2 (Schur) If A is any n × n matrix then there exists a unitary n × n matrix S such that S H AS = T where T is triangular. . . there is a unitary matrix V such that V H GV is a Schur decomposition of G. 0 . 0 where λ is any eigenvalue of A. Then from the lemma there exists a unitary U such that λ 0 U H AU = . Suppose it holds for all matrices of order n − 1. Furthermore. 0 Clearly. ∆ Theorem 5. Let 0 ··· 0 1 0 S=U . Since the elements on the diagonal of a triangular matrix are the eigenvalues. By induction.3. V . S can be chosen so that the eigenvalues λi of A appear in any order along the diagonal of T . .5. Partition C into a row vector and an (n − 1) × (n − 1) matrix G: C= wH G . . . . there is a stable and efﬁcient algorithm to compute the Schur decomposition. This is the preferred way to compute eigenvalues numerically. COMPUTING EIGENVALUES AND EIGENVECTORS NUMERICALLY T 1 0 . The theorem obviously holds for n = 1: 1A1 = A . By the inductive hypothesis.3. S H AS is the Schur decomposition of A.
While this link is very important. Λ is real and diagonal.6 that the determinant of a triangular matrix is the product of the elements on its diagonal.2 (Spectral theorem) Every Hermitian matrix can be diagonalized by a unitary matrix. because 0∗ = 0. so Λ is real. But eigenvectors are the solution of the homogeneous system (5. It is positive semideﬁnite if for every nonzero x we have xT Ax ≥ 0.1). Thus. S real .2). because AT A is positive semideﬁnite for every A (lemma 5. a general relation between determinant and eigenvalues. Proof. Theorem 5. which is both real and rankdeﬁcient. In fact. First.4. Positive deﬁnite or semideﬁnite matrices arise in the solution of overconstrained linear systems. ∆ We saw that an n × n Hermitian matrix with n distinct eigenvalues admits n orthonormal eigenvectors (corollary 5. Thus. so are its eigenvectors.1.64 CHAPTER 5. We already know that Hermitian matrices (and therefore real and symmetric ones) have real eigenvalues (theorem 5. In fact. the Schur decomposition of a Hermitian matrix is in fact a diagonalization. the proof is complete. has real eigenvalues and n orthonormal eigenvectors. The proof is very simple. Let now A = ST S H be the Schur decomposition of A. but is otherwise unnecessary. T = S H AS. They also occur in geometry through the equation of an ellipsoid. But the only way that T can be both triangular and Hermitian is for it to be diagonal. EIGENVALUES AND EIGENVECTORS 5. xT Qx = 1 . In the process. Proof.4 Eigenvalues/Vectors and Singular Values/Vectors In this section we prove a few additional important properties of eigenvalues and eigenvectors.5).5). given the Schur decomposition. and that the determinant of a product of matrices is equal to the product of the determinants. we know that the eigenvalues of a matrix A are equal to those of the triangular matrix in the Schur decomposition of A. and S H = S T . with distinct eigenvalues or not. ∆ ⇒ A = SΛS H ⇒ A = SΛS T . and every real symmetric matrix can be diagonalized by an orthogonal matrix: A = AH A real.1 The determinant of a matrix is equal to the product of its eigenvalues. Furthermore. Notice that a positive deﬁnite matrix is also positive semideﬁnite. that the determinants of S and S H are the same. In other words.1. we know from theorem 5. and this is the ﬁrst equation of the theorem (the diagonal of a Hermitian matrix must be real). so is T . We recall that a real matrix A such that for every nonzero x we have xT Ax > 0 is said to be positive deﬁnite. now that we have the Schur decomposition. real or not. Let now A be real and symmetric. a Hermitian matrix. it is useful to remember that eigenvalues/vectors and singular values/vectors are conceptually and factually very distinct entities (recall ﬁgure 5. Since A is Hermitian. If we recall that a unitary matrix has determinant 1 or 1. and therefore admits nontrivial real solutions.4. we also establish a link between singular values/vectors and eigenvalues/vectors.4. The assumption of distinct eigenvalues made the proof simple. If in addition the matrix is real.6). In fact. All that is left to prove is that then its eigenvectors are real. A = AT In either case. and T H = (S H AS)H = S H AH S = S H AS = T . we can state the following stronger result.1. Theorem 5. S is real.
then its eigenvalues are nonnegative. Their physical meaning makes them positive deﬁnite. .5 AT A is positive semideﬁnite. . sn as an orthonormal basis for Rn . The eigenvectors of A are also its singular vectors. positive semideﬁnite matrix A are equal to its singular values. the entries in Λ are nonnegative. + cn λn xT sn = λ1 c2 + . energies cannot be negative). . If A is positive semideﬁnite. . then xT Ax ≥ 0 for any nonzero x. Conversely.2).3 implies one of the two directions: If A is real. If the proof of that theorem is repeated with the strict inequality. + λn c2 > 0 (or ≥ 0) 1 n because the λi are positive (nonnegative) and not all ci can be zero. A is positive deﬁnite (semideﬁnite). + cn sn with But then xT Ax = xT A(c1 s1 + . . .4. . Recall that the eigenvalues in the Schur decomposition can be arranged in any desired order along the diagonal.4. Since xT Ax > 0 (or ≥ 0) for every nonzero x.5.4. so that λ ≥ 0. Theorem 5. Proof. A = SΛS T . positive deﬁnite matrices. and positive semideﬁnite. .4. from Asi = λsi we obtain sT Asi = sT λsi = λsT si = λ si i i i 2 =λ. or at least positive semideﬁnite (for instance. but for arbitrary matrices. then its eigenvalues are positive. In fact. From the previous theorem. and in particular sT Asi ≥ 0. To this end.4. we show that if all eigenvalues λ of a real and symmetric matrix A are positive (nonnegative) then A is positive deﬁnite (semideﬁnite).3 The eigenvalues of a real. It is positive deﬁnite iff all its eigenvalues are positive. i But A = SΛS T with nonnegative diagonal entries in Λ is the singular value decomposition A = U ΣV T of A with Σ = Λ and U = V = S. . . Proof. and positive deﬁnite. symmetric matrix is positive semideﬁnite iff all its eigenvalues are nonnegative. positive deﬁnite matrices are associated to quadratic forms xT Qx that represent energies or secondorder moments of mass or force distributions. ∆ Theorem 5. .4 A real.3 establishes one connection between eigenvalues/vectors and singular values/vectors: for symmetric. . and write x = c1 s1 + . This result can be used to introduce a less direct link. the concepts coincide.4. let x be any nonzero vector. + cn sn ) = xT (c1 As1 + . . . .4. Theorem 5. In physics. Lemma 5. symmetric. symmetric. Theorem 5. symmetric. we can use these eigenvectors s1 . EIGENVALUES/VECTORS AND SINGULAR VALUES/VECTORS 65 in which Q is positive deﬁnite. ∆ ci = xT si . + cn λn sn ) = c1 λ1 xT s1 + . The following result relates eigenvalues/vectors with singular values/vectors for positive semideﬁnite matrices. . we also obtain that if A is real. both left and right. Furthermore. + cn Asn ) = xT (c1 λ1 s1 + . where both Λ and S are real. . Since real and symmetric matrices have n orthonormal eigenvectors (theorem 5.
Then. Bei is the ith column of B.8 An n × n matrix is normal if an only if it commutes with its Hermitian: AAH = AH A . . The theorem below characterizes the class of all matrices with this property. Proof. Lemma 5. AAT = U ΣV T V ΣU T = U Σ2 U T is a Schur decomposition with S = U and T = Λ = Σ2 .4. Similarly. Let A = ST S H be the Schur decomposition of A. .66 Proof. Proof. ∆ Theorem 5. . Since conjugation does not change the norm of a vector. and the eigenvectors of AAT are the left singular vectors of A. EIGENVALUES AND EIGENVECTORS 2 ≥ 0. and B H ei is the ith column of B H . . ∆ .4. and reason similarly to conclude that T must be diagonal. ∆ We have seen that important classes of matrices admit a full set of orthonormal eigenvectors.7 If for an n × n matrix B we have BB H = B H B. . the norm of the ith row of B equals the norm of its ith column. for m ≤ n. . The converse is also obviously true: if T is diagonal. Similarly. the eigenvectors of AT A are the right singular vectors of A. from the lemma. Proof. if and only if A can be diagonalized by a unitary similarity transformation. ∆ Theorem 5.4. the ith column of the n × n identity matrix. Let i = 1. for m ≤ n. the class of all normal matrices. (5. the equality (5. a triangular matrix T for which T T H = T H T must be diagonal. For any nonzero x we can write xT AT Ax = Ax CHAPTER 5. then T T H = T H T . . . n. AAH = ST S H ST H S H = ST T H S H and AH A = ST H S H ST S H = ST H T S H . The ﬁrst row has ﬁrst entry t11 . Because S is invertible (even unitary). that is. then for every i = 1. From BB H = B H B we deduce Bx 2 = xH B H Bx = xH BB H x = B H x 2 . AAH = AH A if and only if T is diagonal.13) If x = ei . we have AT A = V ΣU T U ΣV T = V Σ2 V T which is in the required format to be a (diagonal) Schur decomposition with S = V and T = Λ = Σ2 . so the only way that its norm can be t11  is for all other entries in the ﬁrst row to be zero.6 The eigenvalues of AT A with m ≥ n are the squares of the singular values of A. This is the deﬁnition of a normal matrix. we have AAH = AH A if and only if T T H = T H T . the eigenvalues of AAT are the squares of the singular values of A. We now proceed through i = 2. Then. the ﬁrst column of T has norm t11 . To prove the theorem. Thus. the norm of the ith row of T is equal to the norm of its ith column. If m ≥ n and A = U ΣV T is the SVD of A. However. we ﬁrst need a lemma. n.13) implies that the ith column of B has the same norm as the ith row of B. which is the conjugate of the ith row of B. In fact. that is.
for instance.4. EIGENVALUES/VECTORS AND SINGULAR VALUES/VECTORS Corollary 5.4. but so do.5.4. so theorem 5. We proved this in the proof of theorem 5.4. Proof. unitary (and therefore also real orthogonal) matrices: UUH = UHU = I . real symmetric. 67 ∆ Checking that AH A = AAH is much easier than computing eigenvectors. Thus. Hermitian. Notice that Hermitian (and therefore also real symmetric) matrices commute trivially with their Hermitians. and orthogonal matrices are all normal. normal matrix must be diagonal.8 is a very useful characterization of normal matrices. unitary. .8.9 A triangular.
EIGENVALUES AND EIGENVECTORS .68 CHAPTER 5.
.4 and 6. Finally. In section 6. These systems have the following form: ˙ x = Ax + b(t) x0 (6.1) (6.2). The equation (6. in section 6. . n − 1 .1) by introducing the ndimensional vector y x1 dy . 6.1) subsumes also the case of a scalar differential equation of order n. possibly greater than 1. such an equation can be reduced to a ﬁrstorder system of the form (6.Chapter 6 Ordinary Differential Systems In this chapter we use the theory developed in chapter 5 in order to solve systems of ﬁrstorder linear differential equations with constant coefﬁcients. .1 Scalar Differential Equations of Order Higher than One The ﬁrstorder system (6. which is numerically preferable to the more commonly used Jordan canonical form. . xn n−1 d y dtn−1 With this deﬁnition.5. dt 69 for i = 0. First.2.3) In fact. dt . we set up and solve a particular differential system as an illustrative example. . . we show that scalar differential equations of order greater than one can be reduced to systems of ﬁrstorder differential equations. in which x0 is a known vector. in sections 6. we make this result more speciﬁc by showing that the solution to a homogeneous system is a linear combination of exponentials multiplied by polynomials in t. This result is based on the Schur decomposition introduced in chapter 5. Then. .3. dn y dn−1 y dy + cn−1 n−1 + . . = x= . and the vector function b(t) is a function of time.2) x(0) = where x = x(t) is an ndimensional vector function of time t. + c1 + c0 y = b(t) . deﬁnes the initial value of the solution. the dot denotes differentiation. we have di y dti dn y dtn = = xi+1 dxn . . we recall a general result for the solution of ﬁrstorder differential systems from the elementary theory of differential equations. n dt dt dt (6. . the coefﬁcients aij in the n × n matrix A are constant.
. For matrices. n − 1. . ORDINARY DIFFERENTIAL SYSTEMS dxi (6. the exponential is deﬁned through its Taylor series. . 0 0 . For the scalar exponential eλt we can write a Taylor series expansion eλt = 1 + λt λ2 t2 + + ··· = 1! 2! ∞ j=0 λj t j . .1) with initial condition (6. and the Taylor series expansion above is proven as a property. introduced in the following.4). 0 −c2 ··· ··· . is the socalled companion matrix of (6. . ··· 1 · · · −cn−1 . . . . .. . . in calculus classes. .2 General Solution of a Linear Differential System We know from the general theory of differential equations that a general solution of system (6.70 and x satisﬁes the additional n − 1 equations CHAPTER 6. j! Usually1 . . 0 b(t) 6. .2) is given by x(t) = xh (t) + xp (t) where xh (t) is the solution of the homogeneous system ˙ x = x(0) = and xp (t) is a particular solution of ˙ x = x(0) = Ax + b(t) 0. the exponential is introduced by other means. the exponential eZ of a matrix Z ∈ Rn×n is instead deﬁned by the inﬁnite series expansion Z Z2 e =I+ + + ··· = 1! 2! Z 1 Not ∞ j=0 Zj . .4) dt for i = 1.3) and 0 0 . we obtain the ﬁrstorder system ˙ x = Ax + b(t) xi+1 = where A= 0 0 . .3) together with the n − 1 differential equations (6. 0 −c0 1 0 . j! always. Ax x0 The two solution components xh and xp can be written by means of the matrix exponential. In some treatments. If we write the original system (6. 0 −c1 b(t) = 0 1 .
Substituting Z = At gives eAt = I + Differentiating both sides of (6. This is easily veriﬁed.5) deAt dt Thus. It turns out that this inﬁnite sum converges (to an n × n matrix which we write as eZ ) for every matrix Z. It can be shown that this solution is unique.2).2) we obtain w = x0 .1).7) (6. for any vector w.8) (6. since by differentiating this expression for xp we obtain ˙ xp = AeAt 0 t e−As b(s) ds + eAt e−At b(t) = Axp + b(t) .11) . The solution to ˙ x = Ax + b(t) with initial value x(0) = x0 is x(t) = xh (t) + xp (t) where and xp (t) = 0 (6.5) gives deAt dt A2 t A3 t2 + + ··· 1! 2! At A2 t2 = A I+ + + ··· 1! 2! = A+ = AeAt . GENERAL SOLUTION OF A LINEAR DIFFERENTIAL SYSTEM 71 Here I is the n × n identity matrix. (6.6) is a solution to the differential system (6.10) xh (t) = eAt x(0) t eA(t−s) b(s) ds . j! (6. the function xh (t) = eAt w satisﬁes the homogeneous differential system ˙ xh = Axh .1) with b(t) = 0 and initial values (6. we have the following result. From the elementary theory of differential equations. and xh (t) = eAt x(0) (6.9) (6.2. so xp satisﬁes equation (6. and the general term Z j /j! is simply the matrix Z raised to the jth power divided by the scalar j!. By using the initial values (6. we also know that a particular solution to the nonhomogeneous (b(t) = 0) equation (6.6. At A2 t2 A3 t3 + + + ··· = 1! 2! 3! ∞ j=0 Aj tj .1) is given by t xp (t) = 0 eA(t−s) b(s) ds . In summary.
we will only point out that several methods exist for computing a matrix exponential. we can be much more speciﬁc about the structure of the solution (6. we brieﬂy review this solution.3. pp.3 If A is not defective. λn . qn with corresponding eigenvalues λ1 . and we have shown how to ﬁnd xh (t) by solving an eigenvalue problem. Once these two extreme cases (nondefective matrix or allcoincident eigenvalues) have been handled. However. 4. Let Q = q1 · · · qn . However.3.72 CHAPTER 6. . In a fundamental paper on the subject. To be sure. we seem to have all we need. we have seen that matrices with coincident eigenvalues can still have a full set of linearly independent eigenvectors (see for instance the identity matrix). .3 is superior in practice. However. . This procedure is based on backsubstitution. If the matrix has a full complement of eigenvectors. However. When the matrix A is constant. this is shown via the Jordan canonical form. square. and small perturbations in the data can change the results dramatically.1. Speciﬁcally. . For a nonhomogeneous system.3 Structure of the Solution For the homogeneous case b(t) = 0. the matrix exponential (6. 6. . Cleve Moler and Charles Van Loan discuss a large number of different methods.1). no. defective or not. in section 6. a similar result can be found through the Schur decomposition introduced in chapter 5. As we have done for the SVD and the Schur decomposition. as we currently assume. (6. .3. the method of section 6.10) can be written as a linear combination.3.3 for solving a homogeneous or nonhomogeneous differential system for any.13) Two cases arise: either A admits n distinct eigenvalues. Nineteen dubious ways to compute the exponential of a matrix (SIAM Review.2 is the same as would be obtained with the method of section 6.1.5) requires too many terms for a good approximation. Then. In chapter 5. constant matrix A. .3.3. expm(A) is the matrix exponential of A. In the general theory of linear differential systems. A full discussion of this matter is beyond the scope of these notes. we show how to compute the homogeneous solution xh (t) in the extreme case of an n × n matrix A with n coincident eigenvalues. 20. with constant vector coefﬁcients.3. since it is based on the numerically sound Schur decomposition. This result is brieﬂy reviewed in this section for convenience. The naive solution to use the deﬁnition (6.7).9) of system (6. pointing out that no one of them is appropriate for all situations. The next section shows how to do this. of scalar exponentials multiplied by polynomials. the solution obtained in section 6. but we will not discuss how this is done2 . . and produces a result analogous to that obtained via Jordan decomposition for the homogeneous part xh (t) of the solution. of this subsection and of the following one are based on notes written by Scott Cohen.1 A is Not Defective In chapter 5 we saw how to ﬁnd the homogeneous part xh (t) of the solution when A has a full set of n linearly independent eigenvectors. we show a general procedure in section 6. Moler and Van Loan point out that the Jordan form cannot be computed reliably. In section 6. .3. we have seen that if (but not only if) A has n distinct eigenvalues then it has n linearly independent eigenvectors (theorem 5. 6. ORDINARY DIFFERENTIAL SYSTEMS Since we now have a formula for the general solution to a linear differential system.12) (6. the procedure can be carried out analytically if the functions in the righthand side vector b(t) can be integrated. vol.2.1. 80136). the solution procedure we introduce in section 6. we do not know how to compute the matrix exponential. Fortunately. consider the ﬁrst order system of linear differential equations ˙ x = x(0) = Ax x0 .2 for the case of n coincident eigenvalues can be applied regardless to how many linearly independent eigenvectors exist. or is does not. in the paper cited above. 2 In 3 Parts Matlab. then it has n linearly independent eigenvectors q1 . and particularly so about the solution xh (t) to the homogeneous part.
where S is Hermitian. system (6. This follows easily from the deﬁnition of eZ and the fact that (Q(Λt)Q−1 )j = Q(Λt)j Q−1 . The last equation (6.16) represents n uncoupled.3.6) in this case becomes xh (t) = eS(T t)S x(0) = SeT t S H x(0) = SeΛt+N t S H x(0). Since Aqi = λi qi . The solution (6. we have AQ = QΛ. H .12) can be rewritten as follows: ˙ x = Ax ˙ x = QΛQ−1 x −1 ˙ Q x = ΛQ−1 x ˙ y = Λy. 73 (6.6. 6. STRUCTURE OF THE SOLUTION This square matrix is invertible because its columns are linearly independent. (6. .2 A Has n Coincident Eigenvalues When A = QΛQ−1 . .2).12) is xh (t) = QeΛt Q−1 x(0). regardless of the number of its linearly independent eigenvectors? In any case. and eS(Λt)S H = SeΛt S H . and the consequent relation y(0) = Q−1 x(0). where Λ is diagonal and N is strictly upper triangular. Multiplying both sides of (6.6). . λn ) is a square diagonal matrix with the eigenvalues of A on its diagonal. . Similarly.12) becomes xh (t) = SeΛt S H x(0). ˙ The solution is yh (t) = eΛt y(0). then the solution to (6. qn . Comparing with (6. We recall that S is a unitary matrix and T is upper triangular with the eigenvalues of A on its diagonal. If A is normal.12) is xh (t) = SeΛt S H x(0). How can we compute the matrix exponential in the extreme case in which A has n coincident eigenvalues. homogeneous. Q−1 is replaced by S H .15) Then. .14) where Λ = diag(λ1 . . where eΛt = diag(eλ1 t . then Q is replaced by the Hermitian matrix S = s1 · · · sn . differential equations yi = λi yi .12) is xh (t) = QeΛt Q−1 x(0). if it has n orthonormal eigenvectors q1 . . we derived that the solution to (6.16) where y = Q−1 x. Using the relation x = Qy. we see that the solution to the homogeneous system (6. . and the solution to (6. that is. . A admits a Schur decomposition A = ST S H (theorem 5.3. it should be the case that −1 eQ(Λt)Q = QeΛt Q−1 . we obtain A = QΛQ−1 . eλn t ). if A = SΛS H .14) by Q−1 on the right. .3. (6. Thus we can write T as T = Λ + N. .
Then N has three nonzero superdiagonals. for example. in this case. We already know how to compute eΛt .74 CHAPTER 6. the general solution to the homogeneous differential system (6. N 3 has one nonzero superdiagonal. that N is 4 × 4.6) if we can compute eT t = eΛt+N t . Suppose. ∞ n−1 N j tj N j tj eN t = = j! j! j=0 j=0 is simply a ﬁnite sum. 0 0 0 0 0 0 0 0 In general. that is if Z1 Z2 = Z2 Z1 . and the exponential reduces to a matrix polynomial. for a strictly upper triangular n × n matrix. when all the eigenvalues of A coincide. . Therefore.17) where A = S(Λ + N )S H is the Schur decomposition of A.13) when the n × n matrix A has n coincident eigenvalues is given by n−1 xh (t) = SeΛt j=0 N j tj H S x0 j! (6.. we have N j = 0 for all j ≥ n (i. Λt and N t commute: Λt N t = λ It N t = λt N t = N t λt = N t λ It = N t Λt . so it remains to show how to compute eN t . we can write eΛt+N t = eΛt eN t . It can be shown that if two matrices Z1 and Z2 commute. ORDINARY DIFFERENTIAL SYSTEMS Thus we can compute (6. and N 4 is the zero matrix: 0 ∗ ∗ ∗ 0 0 ∗ ∗ 0 0 ∗ ∗ 0 0 0 ∗ 2 N = 0 0 0 ∗ →N = 0 0 0 0 → 0 0 0 0 0 0 0 0 0 0 0 ∗ 0 0 0 0 0 0 0 0 0 0 0 0 4 N3 = 0 0 0 0 →N = 0 0 0 0 . and N is strictly upper triangular. Thus. N 2 has two nonzero superdiagonals. The fact that N t is strictly upper triangular makes the computation of this matrix exponential much simpler than for a general matrix Z. N is nilpotent of order n). in our case. In summary. then eZ1 +Z2 = eZ1 eZ2 = eZ2 eZ1 . This turns out to be almost as easy as computing eΛt when the diagonal matrix Λ is a multiple of the identity matrix: Λ = λI that is.12) with initial value (6. Λ = λI is a multiple of the identity matrix containing the coincident eigenvalues of A on its diagonal.e. In fact.
.3.k that contain everything to the right of the corresponding Tii . . . . c(t) = . .20) The triangular matrix T can always be written in the following form: T11 · · · · · · T1k 0 T22 · · · T2k T = . yk (t) and for the initial values y1 (0) .6. . y(0) = .3. 0 ··· 0 Tkk where the diagonal blocks Tii for i = 1. . j j ni −1 Ni t j=0 j! yi (0) + t 0 eλi I(t−s) j j ni −1 Ni (t−s) j=0 j! (ci (s) + di (s)) ds . The vector c(t) can be partitioned correspondingly as follows c1 (t) . . . k are of size ni ×ni (possibly 1×1) and contain allcoincident eigenvalues.18) (6. and the same can be done for y1 (t) .19) in the general case of a constant matrix A. yk (t)]T else di (t) = 0 (an nk dimensional vector of zeros) end Tii = λi I + Ni (diagonal and strictly uppertriangular part of Tii ) yi (t) = eλi It end. ck (t) where ci has ni entries.i+1 ··· Ti. . let A = ST S H be the Schur decomposition of A. . (6. .21) (6. y(t) = . . The remaining nonzero blocks Tij with i < j can be in turn bundled into matrices Ri = Ti. STRUCTURE OF THE SOLUTION 75 6. with arbitrary b(t). .20) can then be solved by backsubstitution as follows: for i = k down to 1 if i < k di (t) = Ri [yi+1 (t).. .3 The General Case We are now ready to solve the linear differential system ˙ x = x(0) = Ax + b(t) x0 (6. defective or not. . yk (0) The triangular system (6. . . . and consider the transformed system ˙ y(t) = T y(t) + c(t) where y(t) = S H x(t) and c(t) = S H b(t) . In fact.
the expression for yi (t) is a direct application of equations (6. we apply the backsubstitution procedure introduced in this section. y1 ˙ y2 ˙ When t11 = t22 = λ. ORDINARY DIFFERENTIAL SYSTEMS In this procedure. .10).9).76 CHAPTER 6. we consider a very small example.11). the solutions to system (6. When t11 = λ1 = t22 = λ2 .18) is then x(t) = Sy(t) . (6.1. we have y1 (t) = y1 (0) eλt + t12 y2 (0) t eλt t12 y2 (0) λ2 t (e − eλ1 t ) y1 (t) = y1 (0)eλ1 t + λ2 − λ1 if if t11 = t22 t11 = t22 . because then the integrand is a linear combination of exponentials multiplied by polynomials in t − s. the 2 × 2 homogeneous. Instead. the applicability of this routine depends on whether the integral in the expression for yi (t) can be computed analytically.22) for t11 = t22 and for t11 = t22 have different forms. this becomes y1 (t) = y2 (t) = (y1 (0) + t12 y2 (0) t) eλt y2 (0) eλt . (6. Thus. = t11 0 1 0 t12 t22 t12 t 1 y1 y2 . we obtain y(t) = eλIt In scalar form. triangular case.22) and the initial value equation y(0) = y0 . (6. The solution x(t) for the original system (6. As an illustration. and it is easy to verify that this solution satisﬁes the differential system (6.1). While y2 (t) is the same in both cases. we could solve the system by ﬁnding the eigenvectors of T . and (6. The second equation of the system. This is certainly the case when b(t) is a constant vector b. In the general case. y2 (t) = t22 y2 ˙ has solution We then have and y1 (t) = y1 (0)eλ1 t + 0 t y2 (t) = y2 (0) eλ2 t . d1 (t) = t12 y2 (t) = t12 y2 (0) eλ2 t eλ1 (t−s) d1 (s) ds t 0 = = = = y1 (0)eλ1 t + t12 y2 (0) eλ1 t y1 (0)eλ1 t + t12 y2 (0) eλ1 t 0 e−λ1 s eλ2 s ds e(λ2 −λ1 )s ds t t12 y2 (0) λ1 t (λ2 −λ1 )t e (e − 1) λ2 − λ1 t12 y2 (0) λ2 t y1 (0)eλ1 t + (e − eλ1 t ) λ2 − λ1 y1 (0)eλ1 t + Exercise: verify that this solution satisﬁes both the differential equation (6.22).17) with S = I.22) y(0) . since we know that in this case two linearly independent eigenvectors exist (theorem 5. which can be integrated by parts.
the three springs exert forces that are proportional to the springs’ elongations: f1 f2 4 Recall = = c1 v1 c2 (v2 − v1 ) that the order of a differential equation is the highest degree of derivative that appears in it. eλt − eλ1 t lim = t eλt . time. the two masses would assume the positions indicated by the dashed lines. Since forces are proportional to accelerations. Suppose that we want to study the evolution of the system over time. and since accelerations are second derivatives of position. A CONCRETE EXAMPLE 77 1 v1 rest position of mass 1 1 2 v2 2 rest position of mass 2 3 Figure 6. however. Consider the mechanical system in ﬁgure 6. By Hooke’s law. This would seem to present numerical difﬁculties when t11 ≈ t22 .1. In the absence of external forces. is not a problem. as opposed to partial. .4. these are ordinary differential equations. the new equations are differential. The point of this section is to show how to transform the complex formal solution of the differential system. This. We will then transform these into four linear differential equations of the ﬁrst order. λ1 →λ λ − λ1 and the transition between the two cases is smooth. The initial system has two secondorder equations. The 4 × 4 matrix of the resulting system has an interesting structure. 6. which allows ﬁnding eigenvalues and eigenvectors analytically with a little trick. because of Newton’s law. into a real solution in a form appropriate to the problem at hand. Two linear differential equations of the second order4 result. computed with any of the methods described above. because the solution would suddenly switch from one form to the other as the difference between t11 and t22 changes from about zero to exactly zero or viceversa. In fact.1: A system of masses and springs.6. and is transformed into a ﬁrstorder system with four equations.4 A Concrete Example In this section we set up and solve a more concrete example of a system of differential equations. Because differentiation occurs only with respect to one variable. In the following we write the differential equations that describe this system.
24) are given.78 f3 = CHAPTER 6.24) are replaced by the (known) vector u(0) = In the next section we solve equation (6. so that secondorder derivatives of v are ﬁrstorder derivatives of the new variables.23) . we will ﬁrst transform it to a system of four ﬁrstorder equations.26). .23). ¨ = Bv v where v= v1 v2 and B= +c − c1m1 2 c2 m2 c2 m1 c2 +c3 − m2 = −f1 + f2 = −c1 v1 + c2 (v2 − v1 ) = −(c1 + c2 )v1 + c2 v2 = −f2 + f3 = −c2 (v2 − v1 ) − c3 v2 = c2 v1 − (c2 + c3 )v2 (6. according to Newton’s second law: m1 v1 ¨ m2 v2 ¨ or. As shown in the introduction to this chapter. ˙ We can now gather these four ﬁrstorder differential equations into a single system as follows: ˙ u = Au where 0 0 A= 0 0 B 1 0 0 0 0 1 0 0 .25) u3 v1 ˙ u4 v2 ˙ so that u3 = v1 ˙ while the original system (6. we deﬁne four new variables u1 v1 u v u= 2 = 2 (6. We also assume that initial conditions v(0) and ˙ v(0) (6. (6. the initial conditions (6. and u4 = v2 . To solve the secondorder system (6. The accelerations of masses 1 and 2 (springs are assumed to be massless) are proportional to their accelerations. which specify positions and velocities of the two masses at time t = 0. v(0) ˙ v(0) . ORDINARY DIFFERENTIAL SYSTEMS −c3 v2 where the ci are the positive spring constants (in newtons per meter).26) Likewise. in matrix form. For uniformity.23) becomes u3 ˙ u4 ˙ =B u1 u2 . the trick is to introduce variables to denote the ﬁrstorder derivatives of v.
We also have that α ≥ γ. m1 m2 The constant γ is real because the radicand is nonnegative. so that the two solutions µ1.29) λ1 = √ √ λ3 = −α − γ . z we can write Ax = 0 B I 0 y z = z By so that the eigenvalue equation (6. √ √ −α + γ . we are lucky. and the four eigenvalues of A.5 Solution of the Example Not all matrices have a full set of linearly independent eigenvectors.1. y x= . where we recall that A= 0 B I 0 and B= +c − c1m1 2 c2 m2 c2 m1 c2 +c3 − m2 (6. where γ= α2 − β = 1 4 c1 + c2 c2 + c3 − m1 m2 2 + c2 2 . The eigenvalues µ1 and µ2 of B are the solutions of det(B − µI) = where α= 1 2 c1 + c2 c2 + c3 + m1 m2 and β= c1 c2 + c1 c3 + c2 c3 m1 m2 c1 + c2 +µ m1 c2 + c3 c2 2 +µ − = µ2 + 2αµ + β = 0 m2 m1 m2 are positive constants that depend on the elastic properties of the springs and on the masses.27) . and I is the 2 × 2 identity matrix. (6. SOLUTION OF THE EXAMPLE 79 6. the eigenvalues of A are the square roots of the eigenvalues of B: if we denote the two eigenvalues of B as µ1 and µ2 . If we partition the vector x into its upper and lower halves y and z.2 = −α ± γ .30) . however. λ2 = − −α + γ . = λy = λz . the zeros in A are 2 × 2 matrices of zeros.2 are real and negative. (6. The eigenvalues of A are solutions to the equation Ax = λx . With the system of springs in ﬁgure 6. then the eigenvalues of A are √ √ √ √ λ1 = µ1 λ2 = − µ1 λ3 = µ2 λ4 = − µ2 . We then obtain µ1.6.5.28) In other words. λ4 = − −α − γ (6. Here.27) can be written as the following pair of equations: z By which yields By = µy with µ = λ2 .
The ﬁrst equation reads − c2 c1 + c2 − α ± γ y1 + y2 = 0 m1 m1 and is obviously satisﬁed by any vector of the form y=k c1 +c2 m1 c2 m1 −α±γ where k is an arbitrary constant. depend on the initial conditions. it may be useful to add some algebraic manipulation in order to show that the solution is indeed oscillatory. 0 λ4 (6. on the other hand. and the two scalar equations in (6. ORDINARY DIFFERENTIAL SYSTEMS come in nonreal. Finally. m1 The general solution to the ﬁrstorder differential system (6. y denotes the two eigenvectors of B. (6.25) by noticing that v is equal to the ﬁrst two components of u. and from equation (6.31) the four eigenvectors of A are proportional to the four columns of the following matrix: c2 c2 c2 c2 Q= a + λ2 1 c λ1 m21 λ1 a + λ2 1 m1 a + λ2 2 c λ2 m21 λ2 a + λ2 2 a= m1 a + λ2 3 c λ3 m21 λ3 a + λ2 3 m1 a + λ2 4 c λ4 m21 λ4 a + λ2 4 m1 (6. (6.33) where c1 + c2 . the determinant of this equation is zero.32) must be linearly dependent. complexconjugate pairs.80 CHAPTER 6. from equation (6. In fact. Also the eigenvectors of A can be derived from those of B. we note that u(t) = QeΛt w .28) we see that if y is an eigenvector of B corresponding to eigenvalue µ = λ2 . the values of λi are given in equations (6. however. This is to be expected.30). The amplitudes and phases of the sinusoids.32) Since ±(−α ± γ) are eigenvalues of B.34) In these expressions. However.26) is then given by equation (6. then there are two corresponding eigenvectors for A of the form y x= . . Since we just found four distinct eigenvectors.31) ±λy The eigenvectors of B are the solutions of (B − (−α ± γ)I)y = 0 .23) can be obtained from equation (6. the solution to the original. To simplify our manipulation. This completes the solution of our system of differential equations.17). the masses’ motions can be described by the superposition of two sinusoids whose frequencies depend on the physical constants involved (masses and spring constants). As we see in the following. since our system of springs obviously exhibits an oscillatory behavior. For k = 0. secondorder system (6.33). we can write more simply u(t) = QeΛt Q−1 u(0) where λ1 0 Λ= 0 0 0 λ2 0 0 0 0 λ3 0 0 0 . and Q is in equation (6.
Finally. Since λ2 = −λ1 and (see equations (6.5.36) In fact. secondorder problem. we can write c2 m1 c1 +c2 m1 + λ4 = −λ3 q1 q2 q2 c2 m1 c1 +c2 m1 + q1 λ2 1 and q2 = λ2 3 . by using the relation k1 − k2 is purely imaginary k3 − k4 is purely imaginary .30)). We have v(t) = Q(1 : 2. SOLUTION OF THE EXAMPLE where we deﬁned 81 w = Q−1 u(0) . Since the λs are imaginary but v(t) is real.6. :) denotes the ﬁrst two rows of Q. ∗ k1 = k2 ∗ k3 = k4 . Numerical values for the constants can be found from the initial conditions u(0) by equation (6. where Q(1 : 2. 2 and simple trigonometry we obtain v(t) = q1 A1 cos(ω1 t + φ1 ) + q2 A2 cos(ω2 t + φ2 ) where ω1 = √ α−γ = 1 (a + b) − 2 1 (a + b) + 2 c2 1 2 (a − b)2 + 4 m1 m2 1 c2 2 (a − b)2 + 4 m1 m2 ω2 = √ α+γ = . (6. this means that k1 + k2 is real k3 + k4 is real from which equations (6. ejx + e−jx = cos x . :) = where we deﬁned q1 = Thus.36) follow. the ki must come in complexconjugate pairs: and (6. we have Q(1 : 2. Since the vectors qi are independent (assuming that the mass c2 is nonzero). :)eΛt w . we have v(0) = q1 (k1 + k2 ) + q2 (k3 + k4 ) and from the derivative ˙ v(t) = q1 λ1 k1 eλ1 t − k2 e−λ1 t + q2 λ3 k3 eλ3 t − k4 e−λ3 t we obtain ˙ v(0) = q1 λ1 (k1 − k2 ) + q2 λ3 (k3 − k4 ) . and derive the general solution v(t) for the original. v(t) = q1 k1 eλ1 t + k2 e−λ1 t + q2 k3 eλ3 t + k4 e−λ3 t .35).35) We now leave the constants in w unspeciﬁed.
and not on the initial conditions. x) = if x = 0 and y > 0 π 2π −2 if x = 0 and y < 0 and is undeﬁned for (x.35) and (6. m2 Notice that these two frequencies depend only on the conﬁguration of the system. Im denote the real and imaginary part and where the twoargument function arctan2 is deﬁned as follows for (x. 0). This function returns the arctangent of y/x (notice the order of the arguments) in the proper quadrant.82 and a= c1 + c2 m1 . The amplitudes Ai and phases φi . Re(k1 )) A2 = 2k3  φ2 = arctan2 (Im(k3 ). 0) y if x > 0 arctan( x ) y π + arctan( x ) if x < 0 arctan2 (y. CHAPTER 6. y) = (0. . and extends the function by continuity along the y axis. y) = (0. ORDINARY DIFFERENTIAL SYSTEMS b= c2 + c3 . depend on the constants ki as follows: A1 = 2k1  . φ1 = arctan2 (Im(k1 ). Re(k3 )) where Re. on the other hand.25). ˙ The two constants k1 and k3 can be found from the given initial conditions v(0) and v(0) from equations (6.
83 . The output can be the temperature of the reaction product. and the supply rate of the gas. The deﬁnition is then illustrated by setting up the dynamic system equations for a simple but realistic application. u input S system y output Figure 7. we will develop the theory of the Kalman ﬁlter.1). models take on frequently the form of dynamic systems. • A chemical reactor. as is very often the case. Although in this chapter we will barely scratch its surface. Simple examples of a dynamic system are the following: • An electric circuit.1 Dynamic Systems In its most general meaning. For instance. 7. we will consider one of its most popular and useful aspects.6 we will see that the shell can be shot down before it hits us. whose input is the current in a given branch and whose output is a voltage across a pair of nodes. As discussed in section 7. A dynamic system is a system whose phenomena occur over time. To this purpose.7. the theory of state estimation. studying manipulation requires deﬁning models for how a robot arm can move and for how it interacts with the world. The system reacts to this input and produces an output y (see ﬁgure 7. an informal deﬁnition of a dynamic system is given in the next section. In sections 7.5. and in section 7.1: A general system. One often says that a system evolves over time. Kalman ﬁltering has intimate connections with the theory of algebraic linear systems we have developed in chapters 2 and 3. Analyzing image motion implies deﬁning models for how points move in space and how this motion projects onto the image. The theory of dynamic systems is both rich and fascinating.Chapter 7 Stochastic State Estimation Perhaps the most important part of studying a problem in robotics or vision. in the particular form of Kalman ﬁltering. is to determine a good model for the phenomena and events that are involved. The input is the force applied to the mass and the output is the position of the mass.3 through 7. as well as in most other sciences. When motion is involved. • A mass suspended from a spring. that of modeling the trajectory of an enemy mortar shell. A dynamic system is a mathematical description of a quantity that evolves over time. the temperature of the gas being supplied. the term system refers to some physical entity on which some action is performed by means of an input u. whose inputs are the external temperature.
as for instance for switching networks and computers. uk . must allow computing the state (and hence the output) at time t2 . one can show that there exists a function f such that the state x(t) of the system satisﬁes the differential equation ˙ x(t) = f (x(t).1 State Given a dynamic system. In other cases. can be given in the case of a regular system. t1 . In fact. u(t). all quantities are vectors. For the mass attached to a spring. that is. the output is the result of the entire history of the system. the value x(t) taken by the state at time t must be sufﬁcient to determine the output at the same point in time. t2 ) where u(·) represents the entire function u. The examples above suggest that the output at time t cannot be determined in general by the value assumed by the input quantity at the same point in time. In general. however. Thus. and with continuous ﬁrst derivative.2 Uncertainty The systems deﬁned in the previous section are called deterministic. knowledge of both x(t1 ) and u[t1 . In that case. t) . A discrete dynamic system is completely described by these two equations and an initial state x0 . continuous or discrete. all the quantities in the examples vary continuously with time. if time varies discretely. k) where f is some function that represents the change. k) . 7. not just one of its values. but rather compute x(t2 ) as a functional φ of x(t1 ) and the entire input u over the interval [t1 . for instance. rather than functionals. Similarly. t) where the dot denotes differentiation with respect to time. the output y of the system happens to coincide with one of the two state variables. If time varies continuously. the state could be the position and velocity of the mass.1. the relation between state and output can be expressed by another function: yk = h(xk . uk . which leads to postulating a new quantity. for which the functional φ is continuous. the modeling problem is to somehow correlate inputs (causes) with outputs (effects). since the evolution is exactly determined once the initial state x at time 0 is known. A description of the system in terms of functions. Specifying the initial state x0 completes the deﬁnition of a continuous dynamic system. Also. Rather. and is therefore always deducible from the latter. This is. so one cannot compute xk+1 as a function of xk .84 CHAPTER 7. For continuous systems. The relation from state to output. called the state. and uk is the input at time k. An effort of abstraction is therefore required. For a discrete system. STOCHASTIC STATE ESTIMATION In all these examples. the system is said to be discrete. in a dynamic system the input affects the state. which summarizes information about the past and the present of the system. t2 ): x(t2 ) = φ(x(t1 ). and the output is a function of the state. In practice.1. of the state at time t1 and the input over the interval t1 ≤ t < t2 . the way that the input changes the state at time instant number k into the new state at time instant k + 1 can be represented by a simple equation: xk+1 = f (xk . what is input and what is output is a choice that depends on the application. Determinism implies that both the evolution function f and the output function h are known exactly. on the other hand.t2 ) . differentiable. the system is said to be continuous. in this example. u(·). time does not come in quanta. the laws of classical mechanics allow computing the new position and velocity of the mass at time t2 given its position and velocity at time t1 and the forces applied over the interval [t1 . 7. and k. Speciﬁcally. an unrealistic state of affairs. it is more natural to consider time as a discrete variable. t2 ). Also. is essentially the same as for the discrete case: y(t) = h(x(t). Furthermore. the laws that govern a given physical .
t) + η(t) h(x(t). k) + ξk Without loss of generality.7.3 Linearity The mathematics becomes particularly simple when both the evolution function f and the output function h are linear. and the output y is in Rm . The mean may not be known. The covariance matrix Q of the system noise η is n × n. while the output is a set of observations made by a tracking system.2. component wear. the system equations become xk+1 yk for the discrete case. the state is the (initially unknown) position and velocity of the shell. in either f or h. and the mean perturbations are no different. This is a very important task. the noise distributions can be assumed to have zero mean. This uncertainty can be represented as noise that perturbs the equations we have presented so far. k) + ηk = h(xk . A common assumption. In sections 7. and so forth. on a hill that looks about 0. which is sometimes valid and always simpliﬁes the mathematics. Then. is that η and ξ are zeromean Gaussian random variables with known covariance matrices Q and R. You spotted an enemy mortar installation about thirty kilometers away. and can change over time as a result of temperature changes. It is useful to specify the sizes of the matrices involved. A more realistic model then allows for some inherent. we consider the state estimation problem: given observations of the output y over an interval of time. the state propagation matrix F is n × n. that is.5 kilometers higher than your own position. the input matrix G is n × p. respectively. 7. AN EXAMPLE: THE MORTAR SHELL 85 system are known up to some uncertainty. the equations themselves are simple abstractions of a complex reality. = f (xk . t) + ξ(t) . uk . In particular. You want to track incoming projectiles with a Kalman ﬁlter so you can aim your guns . in the case of the mortar shell.2 An Example: the Mortar Shell In this section. We assume that the input u is a vector in Rp .5. Then. the state x is in Rn . 7. u(t).3 through 7. and ˙ x(t) = y(t) = F (t)x(t) + G(t)u(t) + η(t) H(t)x(t) + ξ(t) = = Fk xk + Gk uk + ηk Hk xk + ξk for the continuous one. we want to determine the state x of the system. unresolvable uncertainty in both f and h. we consider discretization issues because the physical system is itself continuous. and the covariance matrix of the output noise ξ is m × m. Estimating the state then leads to enough knowledge about the shell to allow driving an antiaircraft gun to shoot the shell down in midﬂight. The coefﬁcients that appear in the equations are known only approximately. for otherwise the mean can be incorporated into the deterministic part. In fact. but this is a different story: in general the parameters that enter into the deﬁnitions of f and h must be estimated by some method. and the output matrix H is m × n. but we choose to model it as a discrete system for easier implementation on a computer. the example of the mortar shell will be discussed in order to see some of the technical issues involved in setting up the equations of a dynamic system.1. For instance. A discrete system then takes on the following form: xk+1 yk and for a continuous system ˙ x(t) = y(t) = f (x(t).
equation (7. The projectile’s elevation is z e = 1000 (7. 0).6) (7. You do not know the initial velocity of the projectiles.8 × 10−3 kilometers per second squared. you want to discretize these equations. and you have little time to get ready. so you guess measurement covariances Re = Rs = 1000. z is the vertical. Thus.6 d d 30 ˆ0 = x z = 0. For the z component.2) are continuous.1) and (7. you are at (0. equal to 9. Since you are taking measurements every dt = 0. where you see a blob that increases in size as the projectile gets closer. you remember that the laws of motion for a ballistic trajectory are the following: d(t) z(t) ˙ = d(0) + d(0)t 1 = z(0) + z(0)t − gt2 ˙ 2 (7.1I4 . and dots denote derivatives with respect to time. you decide to ignore air drag. Thus. All you have to track the shells is a camera pointed at the mortar that will rotate so as to keep the projectile at the center of the image. which you put along the diagonal of a 2 × 2 diagonal measurement covariance matrix R. 0. ˙ 2 since z(0) − gt = z(t). given that you know the actual size of the projectiles used and all the camera parameters. and the size of the blob tells you something about the distance. STOCHASTIC STATE ESTIMATION accurately. Because of this.86 CHAPTER 7.4) You do not have very precise estimates of the noise that corrupts e and s.2 seconds.6 kilometers/second for the horizontal component.2. Similarly.1 ˙ 0. the aiming angle of the camera gives you elevation information about the projectile’s position.2) yields z(t + dt) − z(t) 1 1 = z(0) + z(0)(t + dt) − g(t + dt)2 − z(0) + z(0)t − gt2 ˙ ˙ 2 2 1 = (z(0) − gt)dt − g(dt)2 ˙ 2 1 = z(t)dt − g(dt)2 . your estimate of the initial state of the projectile is ˙ −0.2) where g is the gravitational acceleration. 7. z). if t + dt is time instant k + 1 and t is time instant k.5 z where d is the horizontal coordinate. you introduce a state update covariance matrix Q = 0.1) (7.1 The Dynamic System Equation Equations (7. you have 1 zk+1 = zk + zk dt − g(dt)2 . so you just guess some values: 0. ˙ ˙ Consequently. d2 + z 2 (7. except that there is no acceleration: ˙ dk+1 = dk + dk dt .5) . From your highschool physics. (7. Since you do not trust your physics much.3) d when the projectile is at (d. where I4 is the 4 × 4 identity matrix.1 kilometers/second for the vertical component. the size of the blob in pixels is s= √ 1000 . ˙ 2 The reasoning for the horizontal component d is the same.
7. (7. ˆ ˆ ˆ dk dk d2 k (7. AN EXAMPLE: THE MORTAR SHELL Equations (7.7) and (7.3) and (7.2 The Measurement Equation The two nonlinear equations (7. we develop both equations as Taylor series around the current estimate and truncate them after the linear term. we have zk zk − zk ˆ zk ˆ zk ˆ ˆ ek = 1000 ≈ 1000 + − (dk − dk ) . −dt2 0 0 dt 1 2 7. The two matrices F and G are as follows: 0 1 0 0 0 dt 1 0 0 0 F = G= 0 0 1 0 dt . To this end. the 4 × 4 matrix F depends on dt. and the 4 × 1 control matrix G depends on dt.2.2.3). ˆk ˆk ˆ dk d d d2 k so that after simplifying we can redeﬁne the measurement to be the discrepancy from the estimated value: ek = ek − 1000 We can proceed similarly for equation (7.8) can be written in the matrix form yk = Hk xk where the measurement matrix Hk depends on the current state estimate ˆk : x ˆ zk ˆ dk 0 0 3/2 3/2 ˆ z2 ˆ z2 (d2 +ˆk ) (d2 +ˆk ) k k Hk = −1000 zk ˆ 0 0 − d1 ˆ2 ˆ d k k (7.5) and (7.4): sk = and after simplifying: sk = sk − ˆ 2000 dk zk ˆ ≈ −1000 dk + zk ˆ ˆ ˆ2 + z 2 (d2 + zk )3/2 ˆ2 (d2 + zk )3/2 ˆ2 d ˆ k k .8) 1000 ≈ 2 d2 + zk k 1000 ˆ d2 + zk ˆ2 k − ˆ 1000dk 1000ˆk z ˆ (d − dk ) − (z − zk ) ˆ ˆ2 + z 2 )3/2 k ˆ2 + z 2 )3/2 k (dk ˆk (dk ˆk zk zk ˆ zk ˆ ≈ 1000( − dk ) . the control scalar u is equal to −g.7) The two measurements sk and ek just deﬁned can be collected into a single measurement vector yk = sk ek . We want to replace these equations with linear approximations.4) express the available measurements as a function of the true values of the projectile coordinates d and z. and the two approximate measurement equations (7.9) . From the elevation equation (7.6) can be rewritten as a single system update equation xk+1 = F xk + Gu where ˙ dk d xk = k zk ˙ zk 87 is the state.
In other words. Also.J. section 7. STOCHASTIC STATE ESTIMATION As the shell approaches us. Given a discrete dynamic system xk+1 yk = = Fk xk + Gk uk + ηk Hk xk + ξk (7. The next few sections will be read under this impending threat. and section 7. 1960) is still one of the most readable treatments of this subject from the point of view of stochastic estimation. we do not want to have all the observations before we shoot: we would be dead by then. So what else is there left to do? From the observations.88 CHAPTER 7. the state of the dynamic system. one observation happens to be sufﬁcient). Chellappa. and yet observations are processed one at a time.” Transactions of the ASME Journal Basic Engineering. 1 The term “recursive” in the systems theory literature corresponds loosely to “incremental” or “iterative” in computer science. “Kalman ﬁlterbased algorithms for estimating depth from image sequences. Rk ) . in the hope to build a system that lets us shoot down the shell before it hits us.10) (7. ηk ξk ∼ ∼ N (0. and perhaps predict where it will be in a few seconds. July 1990. A scheme that reﬁnes an initial estimation of the state as new observations are acquired is called a recursive1 state estimation system. as they become available. It is really the ensemble of all observations that let one estimate the state. . thereby determining xk as well. The Kalman ﬁlter is one of the most versatile schemes for recursive state estimations. we want to know xk . In these expressions. and R.4 deﬁnes the essential notions of estimation theory that are necessary to understand the quantitative aspects of Kalman ﬁltering. the Kalman ﬁlter computes the optimal estimate ˆkk at time k given the measurements y0 . yj . section 7. . . 26(4):639–656. we would like to know where the mortar shell is right now.7 establishes a connection between the Kalman ﬁlter and the solution of a linear system. in general.6 reconsiders the example of the mortar shell. we introduce the Kalman ﬁlter from the simpler point of view of least squares estimation. S.” IEEE Transactions on Aerospace and Electronic Systems. A classical example of this situation in computer vision is the reconstruction of threedimensional shape from a sequence of images. . T. the hat x means that the quantity is an estimate. Broida. ˆij is the estimate of the value that x x assumes at time i given the ﬁrst j + 1 measurements y0 . 3(3):209236. Chandrashekhar. E. the estimation problem is deﬁned in some more detail. Kalman ﬁlters exist that recover shape information from a sequence of images. Finally. we frantically start studying state estimation. 7. Section 7.11) where the system noise ηk and the measurement noise ξk are Gaussian variables. yk . In fact. . . This is a very interesting aspect of state estimation. September 1989. See for instance L. The original paper by Kalman (R. ˆ as well as a (possibly completely wrong) estimate x0 of the initial state and an initial covariance matrix P0 of the estimate ˆ0 . so by itself it conveys no threedimensional information. Most importantly. Thus. Kanade. the ﬁrst k in the subscript refers to which variable is being estimated. and T. so we can direct an antiaircraft gun to shoot down the target. Szeliski.5 develops the equation of the Kalman ﬁlter. Here. A single image is twodimensional. Qk ) N (0. “Recursive 3D motion estimation from a monocular image sequence. Matthies. . a single observation yk may not be sufﬁcient to determine the state xk (in the example.3 State Estimation In this section. The next section deﬁnes the state estimation problem for a discrete dynamic system in more detail. and in particular Kalman ﬁltering. knowing x0 instead is equivalent. Then. Clearly. the second to which measurements are being used for the estimate. Knowing the model for the mortar shell amounts to knowing the laws by which the object moves and those that relate the position of the projectile to our observations. The x x ﬁlter also computes an estimate Pkk of the covariance of ˆkk given those measurements. . 82:34–45.” International Journal of Computer Vision. from x0 we can simulate the system up until time t. Kalman. since we have developed all the necessary tools in the ﬁrst part of this course. . Even without noise. and R. “A new approach to linear ﬁltering and prediction problems. at least when the dynamics of the system are known exactly (the system noise ηk is zero).
so that it becomes available for the next step. The uncertainty about the former is Rk . we x need some criterion for comparing the quality of the new measurement yk with that of our old estimate ˆkk−1 of the x state. At time k. measurements count less and less as the estimate approaches its true value. so that the new prediction ˆkk = Hk ˆkk y x of the measurement value is closer to the measurement yk than the old prediction ˆkk−1 = Hk ˆkk−1 y x that we made just before the new measurement yk was available. . and we can consider the residue rk = yk − ˆkk−1 = yk − Hk ˆkk−1 . the estimate x ˆkk−1 = Hk ˆkk−1 y x is still our best bet. That would mean that we trust the new measurement completely. just before the new measurement yk comes in.1 Update The covariance matrix Pkk must be computed in order to keep the Kalman ﬁlter running. 7.3. This stage is represented graphically in the middle x of ﬁgure 7. Now we face the problem of incorporating the new measurement yk into our estimate. . Thus.2. . as well as the consequent growing uncertainty. by how much should we correct our estimate of the state? We do not want to make ˆkk y coincide with yk . Now yk becomes available. Then. Even if ˆkk−1 is not exact.7. The uncertainty about the state just before the new measurement yk becomes available is Pkk−1 . of transforming ˆkk−1 into ˆkk .2: The update stage of the Kalman ﬁlter changes the estimate of the current system state xk to make the prediction of the measurement closer to the actual measurement yk . even if the latter was obtained through a large number of previous measurements. the covariance of the observation error. while that about the measurements may remain the same. y x If this residue is nonzero. The idea is that as time goes by the uncertainty on the state decreases. also the uncertainty measure Pkk−1 must be updated. STATE ESTIMATION 89 yk1 k1 Hk ^ xk1  k1 Pk1  k1 propagate ^ xk  k1 Pk  k1 yk k ^ yk  k1 ^ xk  k Pk  k ^ xk+1  k Pk+1  k yk+1 time k+1 update propagate Figure 7. this uncertainty becomes usually smaller: Pkk < Pkk−1 . Propagation then accounts for the evolution of the system state. The question however is.3. but that we do not trust our state estimate ˆkk−1 at all. in the following sense. we have an estimate ˆkk−1 of the state vector xk based on the x previous measurements y0 . If ˆkk−1 were exact. we could compute the new measurement yk x x x without even looking at it. . . we probably need to correct our estimate of the state xk . At the same time. yk−1 . Because a new measurement has been read. that is. The update stage of the Kalman ﬁlter uses Rk and Pkk−1 to weigh past evidence (ˆkk−1 ) and new observations (yk ).11). through the measurement equation (7.
that is. 7. both state estimate and state covariance matrix have been updated as described above. just as the state vector xk represents all the information necessary to describe the evolution of a deterministic system. and any distance between ˆ and x must be zero. STOCHASTIC STATE ESTIMATION 7. The state changes according to the system equation (7. The Kalman ﬁlter itself is derived in section 7. and what information it needs to keep working. y = h(x) + n . in this somewhat trivial sense. both state estimate and covariance must be propagated to the new time k + 1 to yield the new state estimate ˆk+1k and the new covariance Pk+1k . x A less trivial example occurs. The function L is called an estimator. . and its value ˆ given the observations y x x is called an estimate. the covariance matrix Pkk contains all the necessary information about the probabilistic part of the system. solving a square. The optimal estimator is then represented by the matrix L = H −1 and the optimal estimate is ˆ = Ly . In particular.10). x In summary. the Kalman ﬁlter computes the Best Linear Unbiased Estimate (BLUE) of the state. x because of the system noise ηk . This increase in uncertainty depends on the system noise covariance Qk . we have so far considered the criterion x that prefers a particular ˆ if it makes the Euclidean norm of the vector y − Hx as small as possible.2. . yk should reﬂect this change as well. To turn these concepts into a quantitative algorithm we need some preliminaries on optimal estimation. But between time k and time k + 1 both state and covariance may change.13) is overconstrained. Inverting a function is an example of estimation. Similarly.12) the estimation problem amounts to ﬁnding a function ˆ = L(y) x such that ˆ is as close as possible to x.4. this intuitive introduction to Kalman ﬁltering gives you an idea of what the ﬁlter does. (7. starting with the deﬁnition of a linear estimation problem. for a linear observation function. there is usually no inverse to H. For linear systems. in that case ˆ is equal to x. and then considering the attributes “best” and “unbiased” in turn. about how both the system noise ηk and the measurement noise ξk corrupt the quality of the state estimate ˆkk .4 BLUE Estimators In what sense does the Kalman ﬁlter use covariance information to produce better estimates of the state? As we will se later. which are discussed in the next section. then L is the inverse of h. If the function h is invertible and the noise term n is zero.3. and inaccuracies in our model of how this happens lead to greater uncertainty.90 CHAPTER 7. x Hopefully.2 Propagation Just after arrival of the measurement yk . so our estimate ˆk+1k of xk+1 given y0 .13) is. we see what this means. . In this case. The system equation (7. . so that the system (7.1 Linear Estimation Given a quantity y (the observation) that is a known function of another (deterministic but unknown) quantity x (the state) plus some amount of noise. and again one must say in what sense ˆ is required to be “as close as possible” to x. no matter how the phrase “as close as possible” is interpreted. a problem of estimation. This is the x . when the matrix H has more rows than columns.5. nonsingular system x x y = Hx (7. our uncertainty about this estimate may be somewhat greater than one time epoch ago.10) essentially “dead reckons” the new state from the old. In this section. In fact. Thus. Both these changes are shown on the right in ﬁgure 7. 7.
an overconstrained system of the form (7. The best estimator for a linear observation function happens to be a linear estimator. “unweighted” sense.13) and its “noisy” version (7. namely. is only slightly different from its unweighted version. An estimator is said to be linear if the function L is linear.13). can be used to generalize the problem. y = Hx + n (7. In the least squares approach to solving a linear system like (7. What the least squares approach is really saying is that even at the solution ˆ there is some residue x n = y − Hˆ x (7.4. it still makes sense to look for the best possible linear estimator. In fact. the Euclidean norm of the residue (7. evaluated at the solution ˆ.14)) equally. nonnegativedeﬁnite matrix.13) by a “noisy x equation”. we have W n = W (y − Hx) = W y − W Hx so we are simply solving the system W y = W Hx in the traditional. 7. we will probably have an estimator that produces a worse estimate than a nonlinear one.14) are really the same problem.13).2. alternatively. that we want to enforce different equations to different degrees. Thus.7. different equations can have noise terms of different variance. From the point of view of least squares. If L is required to be linear but h is not. Notice that the observation function h can still be nonlinear. BLUE ESTIMATORS 91 (unweighted) least squares criterion.13) has no exact solution when there are more independent equations than unknowns. However. each equation counts the same when computing the norm of the residue. The noise term. Even equation (7. called weighted least squares. We know the solution from normal equations: ˆ = ((W H)T W H)−1 (W H)T W y = (H T R−1 H)−1 H T R−1 y .4.2 Best In order to deﬁne what is meant by a “best” estimator. however. x 2 = nT n (the square is of course irrelevant when it comes to minimization). In other words. this can be enforced by some scaling of the entries of n or. This amounts to saying that we have reasons to prefer the quality of some equations over others or. by some linear transformation of them: n → Wn so instead of minimizing n minimize where R−1 = W T W is a symmetric. Replacing (7. we will see that in a very precise sense ordinary least squares solve a particular type of estimation problem. so requiring equality is hopeless. (7.14) is the correct version.14) does not change the nature of the problem.4. However. This minimization problem. one needs to deﬁne a measure of goodness of an estimate. if the equality sign is to be taken literally.12) with h a linear function and n Gaussian zeromean noise with the indentity matrix for covariance. In fact. In section 7. we now Wn 2 = nT R−1 n . In fact. even. the estimation problem for the observation equation (7.15) treats all components (all equations in (7.15) and we would like to make that residue as small as possible in the sense of the Euclidean norm. this distance is deﬁned as the Euclidean norm of the residue vector y − Hˆ x between the left and the righthand sides of equation (7.
Then the linear estimator L is unbiased if an only if LH = I . we also want it to be unbiased. in the sense that if repeat the same estimation experiment many times we neither consistently overestimate nor consistently underestimate x. however.4.2. this same solution is obtained from a completely different criterion of goodness of a solution ˆ. we obtain different estimates.4. we give a necessary and sufﬁcient condition for L to be unbiased. Thus. Thus. In this sense. The new criterion is the socalled minimumcovariance criterion. x 7. we can pick any norm we like. ˆ is a function of a random vector (noise).14). It makes therefore sense to measure the quality of an estimator by requiring that its variance be as small as possible: the ﬂuctuations of the estimate ˆ with respect to the true (unknown) value x from one estimation experiment to the x next should be as small as possible. Lemma 7.16) .4.1 Let n in equation (7. which is repeated here for convenience: y = Hx + n . and is therefore x a random vector itself. The estimate ˆ of x is some function of the x measurements y. 7.92 CHAPTER 7. if we estimate the same quantity many times. β such that α P 1 < P 2 <β P 1 . Formally.16) be zero mean. This x criterion is a probabilistic one. we only use properties shared by all norms. E[x − ˆ] x = E[x − Ly] = E[x − L(Hx + n)] = E[(I − LH)x] − E[Ln] = (I − HL)E[x] (7. STOCHASTIC STATE ESTIMATION Interestingly. this translates into the following requirement: E[x − ˆ] = 0 x and E[ˆ] = E[x] . Mathematically. requires a notion of “size” for matrices: how large is P ? Fortunately. which in turn are corrupted by noise. First. Intuitively. in the derivations that follow. x x Minimizing a matrix. Some matrix norms were mentioned in section 3. We consider this different approach because it will let us show that the Kalman ﬁlter is optimal in a very useful sense. the identity matrix. from measurements corrupted by different noise samples from the same distribution. the estimates are random.4 The BLUE ˆ = Ly x We now address the problem of ﬁnding the Best Linear Unbiased Estimator (BLUE) of x given that y depends on x according to the model (7. In fact. so which norm we actually use is irrelevant.3 Unbiased In additionto requiring our estimator to be linear and with minimum covariance. most interesting matrix norms are equivalent. Proof. in the sense that given two different deﬁnitions P 1 and P 2 of matrix norm there exist two positive scalars α. we want to choose a linear estimator L such that the estimates ˆ = Ly x it produces minimize the following covariance matrix: P = E[(x − ˆ)(x − ˆ)T ] .
17). so it is zero as well. that is. For such matrices. We can trivially write and P = = LRLT = [L0 + (L − L0 )]R[L0 + (L − L0 )]T L0 RLT + (L − L0 )RLT + L0 R(L − L0 )T + (L − L0 )R(L − L0 )T . and The term L0 R(L − L0 )T is the transpose of this. We can write P = = = E[(x − ˆ)(x − ˆ)T ] = E[(x − Ly)(x − Ly)T ] x x E[(x − LHx − Ln)(x − LHx − Ln)T ] = E[((I − LH)x − Ln)((I − LH)x − Ln)T ] E[LnnT LT ] = L E[nnT ] LT = LRLT (7. the norm of the sum is greater or equal to either norm. so this expression is minimized when the second term vanishes. 0 (L − L0 )RLT = 0 . To show that L0 = (H T R−1 H)−1 H T R−1 L = L0 + (L − L0 ) (7. that it has minimum covariance.18) is the best. In conclusion. that is. so that LH = I. 0 the sum of two positive deﬁnite or at least semideﬁnite matrices. This proves that the estimator given by (7.18) we obtain so that (L − L0 )RLT = (L − L0 )H(H T R−1 H)−1 = (LH − L0 H)(H T R−1 H)−1 .4. x x Proof. 0 But L and L0 are unbiased.2 The Best Linear Unbiased Estimator (BLUE) ˆ = Ly x for the measurement model y = Hx + n where the noise vector n has zero mean and covariance R is given by L = (H T R−1 H)−1 H T R−1 and the covariance of the estimate ˆ is x P = E[(x − ˆ)(x − ˆ)T ] = (H T R−1 H)−1 .4. when L = L0 . To prove that the covariance P of ˆ is given by equation (7. P = L0 RLT + (L − L0 )R(L − L0 )T . 93 ∆ And now the main result. so LH = L0 H = I. BLUE ESTIMATORS since E[Ln] = L E[n] and E[n] = 0. For this to hold for all x we need I − LH = 0. 0 0 RLT = RR−1 H(H T R−1 H)−1 = H(H T R−1 H)−1 0 From (7. = L0 RLT = (H T R−1 H)−1 H T R−1 RR−1 H(H T R−1 H)−1 0 = (H T R−1 H)−1 H T R−1 H(H T R−1 H)−1 = (H T R−1 H)−1 ∆ . Theorem 7.18) is the best choice.7. let L be any (other) linear unbiased estimator. we simply substitute L0 for L in P = LRLT : x P as promised.17) because L is unbiased.
. ˆ as well as the best. Thus. unbiased estimate ˆkk at time k given the measurements y0 . given a linear measurement equation y = Hx + n where n is a Gaussian random vector with zero mean and covariance matrix R. the best linear unbiased estimate ˆ of x is x where the matrix ˆ = P H T R−1 y x P = E[(ˆ − x)(ˆ − x)T ] = (H T R−1 H)−1 x x ∆ is the covariance of the estimation error. 7. Another way of saying this is that the estimate ˆkk−1 differs from the true state xk by an error term ek whose covariance is Pkk−1 : x ˆkk−1 = xk + ek x with E[ek eT ] = Pkk−1 . . Given a dynamic system with system and measurement equations xk+1 yk = = Fk xk + Gk uk + ηk Hk xk + ξk (7. We now apply the results from optimal estimation to the problem of updating and propagating the state estimates and their error covariances.5 The Kalman Filter: Derivation We now have all the components necessary to write the equations for the Kalman ﬁlter. two pieces of data are available.19) where the system noise ηk and the measurement noise ξk are Gaussian random vectors.21) We can summarize this available information by grouping equations 7. and packaging the error covariances into a single.20 and 7. ηk ξk ∼ ∼ N (0.5. R) . k (7. . Rk ) . the Kalman ﬁlter computes the best. unbiased estimate x0 of the initial state with an error covariance matrix P0 . (7.1 Update At time k. yk .20) The other piece of data is the new measurement yk itself.94 CHAPTER 7.2. To summarize. which is related to the state xk by the equation yk = Hk xk + ξk with error covariance T E[ξk ξk ] = Rk . blockdiagonal matrix. One is the estimate ˆkk−1 of the state xk given measurements up to but not x including yk .21 into one. STOCHASTIC STATE ESTIMATION 7. linear. linear. we have y = Hxk + n . . Qk ) N (0. Computation occurs according to the x phases of update and propagation illustrated in ﬁgure 7. n ∼ N (0. This estimate comes with its covariance matrix Pkk−1 . The ﬁlter also x computes the covariance Pkk of the error ˆkk − xk given those measurements.
x x is usually referred to as the Kalman gain matrix. x 7. We have −1 Pkk = H T R−1 H = I T Hk −1 Pkk−1 0 0 −1 Rk I Hk −1 T −1 = Pkk−1 + Hk Rk Hk and ˆkk x = = = = = In the last line. the solution to this classical estimation problem is ˆkk x Pkk = = Pkk H T R−1 y (H T R−1 H)−1 . because it speciﬁes the amount by which the residue must be multiplied (or ampliﬁed) to obtain the correction term that transforms the old estimate ˆkk−1 of the state xk into its new x estimate ˆkk . H= I Hk .5. and the noise term ηk is zero mean. unbiasedness requires ˆk+1k = Fk ˆkk + Gk uk . THE KALMAN FILTER: DERIVATION where y= and where n has covariance R= Pkk−1 0 0 Rk .5. This pair of equations represents the update stage of the Kalman ﬁlter. Since the new state is related to the old through the system equation 7.7. x x .2 Propagation Propagation is even simpler. because the matrices H and R contain many zeros. n= ek ξk . These expressions are somewhat wasteful.19. and the matrix x T −1 Kk = Pkk Hk Rk ∆ ∆ Pkk H T R−1 y Pkk −1 Pkk−1 T −1 Hk Rk ˆkk−1 x yk −1 T −1 Pkk (Pkk−1 ˆkk−1 + Hk Rk yk ) x −1 T −1 T −1 Pkk ((Pkk − Hk Rk Hk )ˆkk−1 + Hk Rk yk ) x T −1 ˆkk−1 + Pkk Hk Rk (yk − Hk ˆkk−1 ) . 95 As we know. ˆkk−1 x yk . the difference rk = yk − Hk ˆkk−1 x is the residue between the actual measurement yk and its best estimate based on ˆkk−1 . For this reason. these two update equations are now rewritten in a more efﬁcient and more familiar form.
The basic recurrence equations (7. 7.11) can be expanded as follows: yk = = = Hk xk + ξk = Hk (Fk−1 xk−1 + Gk−1 uk−1 + ηk−1 ) + ξk Hk Fk−1 xk−1 + Hk Gk−1 uk−1 + Hk ηk−1 + ξk Hk Fk−1 (Fk−2 xk−2 + Gk−2 uk−2 + ηk−2 ) + Hk Gk−1 uk−1 + Hk ηk−1 + ξk . ˆ0−1 = ˆ0 x x both assumed to be given.4 shows that the distance between estimated and true target position does indeed converge to zero. = ˆkk−1 + Kk (yk − Hk ˆkk−1 ) x x −1 T −1 = Pkk−1 + Hk Rk Hk where the Kalman gain is deﬁned as and by the propagation equations T −1 Kk = Pkk Hk Rk ˆk+1k x Pk+1k = = Fk ˆkk + Gk uk x T Fk Pkk Fk + Qk . by the update equations ˆkk x −1 Pkk and P0−1 = P0 . STOCHASTIC STATE ESTIMATION which is the state estimate propagation equation of the Kalman ﬁlter. obtaining ˆ0k .3 shows the true and estimated trajectories.5 shows the 2norm of the covariance matrix over time.7 Linear Systems and the Kalman Filter In order to connect the theory of state estimation with what we have learned so far about linear systems. that is. the dynamic system equations for a mortar shell were set up. amounts to solving a linear x system of equations with suitable weights for its rows.3 Kalman Filter Equations ∆ ∆ In summary. x 7.5.96 CHAPTER 7.2. the Kalman ﬁlter evolves an initial estimate and an initial error covariance matrix. Notice that coincidence of the trajectories does not imply that the state estimate is uptodate. Notice that the covariance goes to zero only asymptotically.6 Results of the Mortar Shell Experiment In section 7. and this occurs in time for the shell to be shot down. Figure 7. Figure 7. For this it is also necessary that any given point of the trajectory is reached by the estimate at the same time instant. The error covariance matrix is easily propagated thanks to the linearity of the expectation operator: Pk+1k = = = = E[(ˆk+1k − xk+1 )(ˆk+1k − xk+1 )T ] x x E[(Fk (ˆkk − xk ) − ηk )(Fk (ˆkk − xk ) − ηk )T ] x x T T Fk E[(ˆkk − xk )(ˆkk − xk )T ]Fk + E[ηk ηk ] x x T Fk Pkk Fk + Qk where the system noise ηk and the previous estimation error ˆkk − xk were assumed to be uncorrelated. Figure 7.10) and (7. we now show that estimating the initial state x0 from the ﬁrst k + 1 measurements. 7. Matlab routines available through the class Web page implement a Kalman ﬁlter (with naive numerics) to estimate the state of that system from simulated observations.
time 2 1.4 0.2 1 0.2 0 0 5 10 15 20 25 30 Figure 7.7.2 1 0.8 0.3: The true and estimated trajectories get closer to one another.4 1. distance between true and estimated missile position vs. Trajectories start on the right.4 1.7.2 5 10 15 20 25 30 35 Figure 7. LINEAR SYSTEMS AND THE KALMAN FILTER 97 true (dashed) and estimated (solid) missile trajectory 1.4: The estimate actually closes in towards the target.2 0 −0.8 0.4 0.8 1. .6 1.6 0.6 0.
k yk = Hk Φ(k − 1. .22) for every value of k = 0. . downwards ones to state update. . . . the initial state of the system. + Gk−1 uk−1 ) + Hk (Fk−1 . = . + ηk−1 ) + ξk or in a more compact form. j)ηj−1 + ξk is noise.98 40 CHAPTER 7.22) where Φ(l. . We can write one system like the one in equation (7. STOCHASTIC STATE ESTIMATION norm of the state covariance matrix vs time 35 30 25 20 15 10 5 0 0 5 10 15 20 25 30 Figure 7. F0 x0 + Hk (Fk−1 . where K is the last time instant considered. . . . The key thing to notice about this somewhat intimidating expression is that for any k it is a linear system in x0 . = Hk Fk−1 Fk−2 xk−2 + Hk (Fk−1 Gk−2 uk−2 + Gk−1 uk−1 ) + Hk (Fk−1 ηk−2 + ηk−1 ) + ξk Hk Fk−1 . .23) . . . .5: After an initial increase in uncertainty. the norm of the state covariance matrix converges to zero. yK (7. and we obtain a large system of the form zK = ΨK x0 + gK + nK where zK = y0 . K. Fj 1 for l ≥ j for l < j νk = Hk j=1 Φ(k − 1. F1 η0 + . 0)x0 + Hk j=1 Φ(k − 1. . . j)Gj−1 uj−1 + νk (7. . j) = and the term k Fl . . . F1 G0 u0 + . . . Upwards segments correspond to state propagation.
23) all at once. = . 99 nK HK (Φ(K − 1. As time goes by. having better and better approximations to the solution as new data come in is much preferable in a dynamic setting. . HK Φ(K − 1. Rather. . Intuitively. . that is. However. under the assumption that the noise that affects each of the equations has the same probability distribution. This is equivalent to assuming that all the noise terms in nK are equally important. we know the covariance matrices of all these noise terms. . . that all the noise terms in nK in equation 7. the result with the pseudoinverse is the K same as we would obtain by solving the normal equations. νK Without knowing anything about the statistics of the noise vector nK in equation (7. the Kalman ﬁlter for a linear system has been shown to be equivalent to a linear equation solver. to obtain an estimate of x0 from the measurements y0 . Better information ought to yield more accurate results. knowledge of the initial state may obsolesce and become less and less useful.23) minimizes the residue between the left and the righthand side under the assumption that all equations are to be treated the same way. data my never stop arriving.23) is not solved all at once. . the Kalman ﬁlter differs from a linear solver in the following important respects: 1. and the Kalman ﬁlter makes optimal use of this information for a proper weighting of each of the scalar equations in (7. LINEAR SYSTEMS AND THE KALMAN FILTER ΨK = gK = H0 H1 F0 .23 are equally important.23). and weigh each equation to keep these covariances into account.23).23 are not equally important. an initial solution is reﬁned over time as new measurements become available. a small covariance means that we believe in that measurement. and therefore in that equation. In some applications. so that Ψ† = (ΨT ΨK )−1 ΨT . and this is in fact the case.7. . The noise terms in nK in equation 7. K K K The least square solution to system (7. the best we can do is to solve the system zK = ΨK x0 + gK in the sense of least squares. x x 3. . A solution for the estimate ˆkk of the current state is given. K)GK−1 uK−1 ) ν0 . The Kalman ﬁlter computes uptodate information about the current state. and not only for the estimate ˆ0k of the initial state. so we ought to be able to do better. where one cannot in general wait for all the data to be collected. Measurements come with covariance matrices. However. . The system (7. . 1)G0 u0 + . . In summary. yK : ˆ0K = Ψ† (zK − gK ) x K where Ψ† is the pseudoinverse of ΨK . The ﬁnal solution can be proven to be exactly equal to solving system (7. 2. 0) 0 H1 G0 u0 . . . + Φ(K − 1. We know that if ΨK has full rank. However. which should consequently be weighed more heavily than others. The quantitative embodiment of this intuitive idea is at the core of the Kalman ﬁlter.7.
This action might not be possible to undo. Are you sure you want to continue?
We've moved you to where you read on your other device.
Get the full title to continue reading from where you left off, or restart the preview.