Linear Algebra and Optimization
Linear Algebra and Optimization
1
Contents
4 Geometric Foundations 54
4.1 Vector Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2 Application: Polynomial Interpolation and Overfitting . . . . 61
7 Connections to Optimization 84
7.1 Low Rank Approximation . . . . . . . . . . . . . . . . . . . 84
7.2 Application: Recommendation Systems . . . . . . . . . . . . 89
2
CONTENTS 3
8 Quadratic Programming 98
8.1 Using the Eigendecomposition . . . . . . . . . . . . . . . . . 98
8.2 Optimality Conditions . . . . . . . . . . . . . . . . . . . . . 105
Chapter 1
Sometimes we will work with vectors whose entries are complex numbers
and in other classes you may also come across vectors over finite fields too.
The dimension of the vector is just the size of the tuple. So in the example
above, the dimension is three. Moreover sometimes we will want to extract
individual entries from the vector, which are called coordinates. In the
example above, the first coordinate is −1, the second coordinate is 3/2 and
so on.
4
CHAPTER 1. VECTORS, MATRICES AND LINEAR EQUATIONS 5
And more generally we can form a linear combination of them too which is
an expression that looks like
αx + βy
for some scalars α and β and some vectors x and y. We will spend much
of the beginning of the course developing geometric and algebraic insights
about what certain operations on vectors do and how we can reason about
them.
But perhaps another question on your mind is: Why do we care about
vectors in the first place? The usefulness of linear algebra comes from
the fact that all around us, in science, engineering and social sciences are
things that can be modeled as vectors where the insights of linear algebra
will teach us something useful and important. Throughout this course, we
will motivate the theory with concrete applications. As an example where
vectors arise naturally, consider the two jpegs below. One picture is of
Marilyn Monroe and the other is of Albert Einstein.
CHAPTER 1. VECTORS, MATRICES AND LINEAR EQUATIONS 6
Each picture is 320 × 360 pixels and we can represent each grayscale
value as a number between 0 and 1. Thus we can think about a picture as
a vector with 115200 = 320 × 360 dimensions once we fix a convention of
which pixel is associated which coordinate. Is there a way to manipulate
these two vectors, one for each image, so that we can combine them in a
visually interesting way?
If you look at the picture, it looks like Albert Einstein but when you
CHAPTER 1. VECTORS, MATRICES AND LINEAR EQUATIONS 7
The dimensions of the matrix are the number of rows and columns. So in
this example the matrix is 2 × 2 and more generally a matrix with n rows
and m columns would be n × m. If the matrix above represents a linear
function, what operation on a vector is it supposed to represent?
Returning to the example above, what does the function that corre-
sponds to multiplying the vector by the matrix
0 1
−1 0
actually do to a point? If we think about a general point in the x-y plane
as a vector then
0 1 x y
=
−1 0 y −x
This corresponds to rotating every vector in the x-y plane clockwise by
π/2. Much of this class will also be about translating applications into
the language of linear algebra, so that we have many powerful tools at our
disposal. So, for example, what if we want to rotate every vector in the x-
y plane counterclockwise by π/4? What matrix-vector product represents
this operation? It takes some practice to go back and forth between the
abstraction and the application, and you should try this for yourself here
and throughout these notes.
first coordinate of the vector, and so on. We can also define the operation
of multiplying a 3 × 3 dimensional matrix by a 3 dimensional vector:
a b c x ax + by + cz
d e f y = dx + ey + f z
g h i z gz + hy + iz
But again we can interpret the result as a linear combination of the columns
of the matrix:
a b c
x d + y e + z f
g h i
In exactly the same way we can define the general matrix-vector product,
but only if the dimensions agree! If x is a vector, we let xj denote its jth
coordinate. And if A is a matrix we let Ai,j denote its entry in row i, column
j. Then
The proof of this proposition is helpful for building intuition. If you were
given a linear function f how would you go about finding the corresponding
matrix A? Suppose x is m dimensional.
x = x1 e1 + x2 e2 + · · · xm em
Now suppose we knew f (e1 ), f (e2 ), etc. Can we figure out what f (x) is for
any x? We can use the linearity assumption to write
Just like vectors, matrices are a useful and powerful way to represent
things around us. We have already seen one example, where the right way to
create a composite image of Marilyn Monroe and Albert Einstein is actually
to design the right matrices to define the right linear function to apply to
vectors representing the two images. As another example, in engineering,
the state of a system is often represented as a vector and the way that the
system evolves over time can be expressed as a matrix-vector product.
For example, suppose we have two species, let’s say the frogs and the
flies. These species have a predator-prey relationship whereby if there were
no frogs to eat the flies, the number of flies would grow exponentially.
But conversely if there were too many frogs and too few flies for them to
feed on, the number of frogs might decay exponentially. We can model
this system as follows. Let g(t) and y(t) denote the number of frogs and
flies respectively at time t. Then from studying their dynamics, we might
come to the conclusion that the way the system updates is governed by the
equations
system evolves, like does it converge and if so to what? When we know the
state at some later time, can we solve for the initial conditions? And how
sensitive is the behavior, if we perturb the initial conditions?
The 2 × 2 matrix above is called the constraint matrix. Last time we talked
about matrix-vector products when all the entries were scalars. Now we are
allowing the entries to be variables too. The recipe for how to compute a
matrix-vector product still works the same, and when we multiply a matrix
and a vector of variables, instead of a vector of scalars, we get a vector of
linear functions of the variables x and y. In particular we get
2x − 3y 0
=
x+y 5
Ax = b
Let’s dig into the geometry of the inner-product some more. The set of
solutions to a linear equation is an affine hyperplane:
Now we have:
Let’s visualize what’s happening when we have two linear equations and
two variables. Consider the constraint 2x − 3y = 0 that comes from the
first row of our linear system. The solution is the blue line:
If we also plot the vector corresponding to the first row of A in red, we can
see that it makes a right angle with the line. Similarly if we take the second
row, which corresponds to the constraint x + y = 5, we would get the green
line:
CHAPTER 1. VECTORS, MATRICES AND LINEAR EQUATIONS 14
The set of solutions is the intersection of the two lines. Thus there is a
solution x = 3, y = 2 and moreover it is unique.
In three and higher dimensions it works the same, it’s just harder to
draw! For example, consider a linear system with three variables and three
constraints:
2 −3 1 x 0
1 1 2 y = 5
3 −2 −1 z 2
The first row corresponds to the constraint 2x − 3y + z = 0. Instead of a
line it is a plane – i.e. an infinite sheet of paper in three dimensions. The
vector corresponding to the first row makes a right angle with the sheet of
paper. This is an affine hyperplane and when we take the intersection of
the three affine hyperplanes, we get exactly the solutions to the system of
linear equations.
We can also go the other direction, from the visualization to the system
of linear equations. If we have a linear system Ax = b where the constraint
matrix is 2 × 2 which has no solution, what would A look like? Well, since
the solutions are the intersections of two lines, one for each row, we need
the two lines to have no point in their intersection. The only way to do this
is to have the lines be parallel. For example:
CHAPTER 1. VECTORS, MATRICES AND LINEAR EQUATIONS 16
Since the two lines are parallel, their normal vectors are either the same
or are scalar multiples of each other. This means that the first and second
rows of A must also be scalar multiples of each other. For example, we
could get
1 1 x 5
=
1 1 y 3
which indeed has no solutions because the equations are contradictory.
However, for a 3 × 3 constraint matrix, we could have no solutions for
a more complicated reason. We will get to this later.
The Column View In the previous subsection, we pared away the can-
didate solutions by taking the intersection of constraints. Alternatively we
could incrementally build up the set of b’s for which there is a solution to
Ax = b. To make this precise, we need another key definition:
c1 v1 + c2 v2 + · · · cm vm
Again, let’s see how this works out on the same system of two equations
in two variables. Our linear system
2 −3 x 0
=
1 1 y 5
CHAPTER 1. VECTORS, MATRICES AND LINEAR EQUATIONS 17
is equivalent to
2 −3 0
x +y =
1 1 5
Now what happens if we start with a1 , the first column of A? What is the
set of vectors we can obtain as a linear combination of this one fixed vector?
We get the blue line:
And what about all the vectors that are linear combinations of the second
column of A? We get the red line:
CHAPTER 1. VECTORS, MATRICES AND LINEAR EQUATIONS 18
and everything in between! Thus the set of all vectors we can get as a linear
combination of the columns of A is all of R2 . The way to think about it
CHAPTER 1. VECTORS, MATRICES AND LINEAR EQUATIONS 19
is for any point we want to represent, we just have to walk along the blue
line until the direction we have left to go is along the direction of the red
lines. What this means for us is that it’s not just that the particular choice
of b we had has a solution to Ax = b, but rather every choice of b will have
a solution. Moreover the solution is unique too, because for every vector
b there is a unique way to reach it by walking in the direction of the blue
line, and then switching to walking in the direction of the red line.
Definition 8. The span of vectors v1 , v2 , · · · , vn is all the vectors that can
be obtained as linear combinations of v1 , v2 , · · · , vn .
Just as we did before, we can go the opposite direction, from the visual-
ization to the linear system. If A is 2 × 2, what does a linear system Ax = b
look like that does not have a solution? If we think about how we built up
our skewed grid, if the blue line and red line pointed in the same direction,
when we walk along the blue direction and then switch to the red direction
we would never leave the original line. Thus we need the columns of A to
be scalar multiples of each other. So, for example, the linear system
2 4 x 3
=
1 2 y 1
would not have a solution because the span of the columns of A is a line
and the vector on the right hand side is not on it.
We will develop the theory in more detail, but for now let’s appreciate
that things become more complicated when we have 3×3 or larger constraint
matrices. From the 2×2 case, you might get the false impression that that a
linear system can only have no solution if there is some pair of rows that are
scalar multiples of each other. But consider the system of linear equations
2 −3 0 x 1
1 1 2 y = 1
3 −2 2 z 3
CHAPTER 1. VECTORS, MATRICES AND LINEAR EQUATIONS 20
You can check that there are no two rows that are scalar multiples of each
other. Similarly, there are no two columns that are scalar multiples of each
other. But nevertheless, there is no solution! How can we see this? If we
add up the linear equations corresponding to the first two rows, we will get
another valid constraint
3x − 2y + 2z = 2
However the third row corresponds to the constraint
3x − 2y + 2z = 3
And thus we cannot have all three equations hold simultaneously. Taking
this a step further, now that we know the linear system does not have a
solution, it must be true that the span of the columns of A is not all of
R3 . But how can we see this? Check for yourself that you can express
the third column as a linear combination of the first two columns. This
means, that as we grow the set of vectors from the ones we can obtain as
a linear combination of the first column of A, to ones we can obtain as a
linear combination of the first and second columns of A, and finally the
first, second and third columns of A, at one step we won’t actually reach
any new vectors and our solutions will be confined to a plane. This leads
us to another key definition:
Definition 9. A set of vectors v1 , v2 , · · · , vn is linearly independent if no
vi can be expressed a linear combination of the others.
for finding solutions. We have already seen some of the ingredients that
will go into this, namely we showed how to combine equations to produce
a contradiction, which in turn told us that there were no solutions to our
linear system at hand. The idea here will be related. We will reason about
ways to make a linear system simpler while preserving its solutions.
For example, for the augmented matrix of the linear system we are
interested in, we have:
1 −1 2 1 x1
−2 2 −3 −1 x2
−3 −1 2 −3 x3
CHAPTER 1. VECTORS, MATRICES AND LINEAR EQUATIONS 22
r2′ =
0 0 1 1
The new row, which we denote by r2′ to keep track of the fact that it’s a
synthetic row that wasn’t in our original linear system, has the property
that its pivot is now in the third column. So we’ve made progress!
r3′ =
0 −4 8 0
Thus the new augmented matrix is:
1 −1 2 1
0 0 1 1
1 −4 8 0
Now you might be wondering: Why are we allowed to do these operations?
The key point is that the steps we are performing preserve the set of all
solutions. The steps were defined with respect to the augmented matrix,
so to make sense of this statement we have to unpack what this means for
the associated linear system:
Proposition 6. Suppose we have a linear system Ax = b and we create its
augmented matrix [A, b] and add a scalar multiple of one row to another to
get a new augmented matrix [A′ , b′ ]. Then x is a solution to Ax = b if and
only if x is a solution to A′ x = b′ .
CHAPTER 1. VECTORS, MATRICES AND LINEAR EQUATIONS 23
The augmented matrix is really just bookkeeping. Let’s see why this
proposition is true in a simplified setting. Let S be the solutions to the
linear system (
x1 − x2 + 2x3 = 1
−2x1 + 2x2 − 3x3 = −1
These equations represent rows r1 and r2 . Let S ′ be the solutions to the
linear system (
x1 − x2 + 2x3 = 1
x3 = 1
These equations represent r1 and r2′ . Recall that r2′ = r2 +2r1 . Now it is easy
to see why if x ∈ S we must also have x ∈ S ′ : If x satisfies the constraints
associated with r1 and r2 then when we add scalar multiples of the two
equations that hold, we get another equation, in our case corresponding
to r2′ , that also must hold. And so if x ∈ S we have that x ∈ S ′ too.
The other direction is more interesting. Suppose x ∈ S ′ , meaning that it
satisfies the equations corresponding to r1 and r2′ . How can we show that
x must also satisfy the equation corresponding to r2 ? The key point is
that the operation we performed on the rows of the augmented matrix was
invertible, and we can undo it. We can rewrite the relation r2′ = r2 + 2r1
instead as r2 = r2′ − 2r1 . From this equation it now follows that if x satisfies
the constraints associated with r1 and r2′ we can again add scalar multiples
of the two equations to get another equation, in this case we get back r2 ,
that also must hold. This completes the equivalence because we have now
show that if x ∈ S ′ we must have that x ∈ S too.
In any case, we have now taken one step of Gaussian elimination. The
new linear system definitely looks simpler. But when should we stop trying
to move the pivots to the right?
CHAPTER 1. VECTORS, MATRICES AND LINEAR EQUATIONS 24
Definition 12. An augmented matrix is in row echelon form if all its pivots
go strictly from left to right.
The new augmented matrix is not in row echelon form. In fact we’ll need
a new operation to fix it. The missing ingredient is that we can also swap
rows. For example, if we swap the second and third rows of the augmented
matrix, we get:
1 −1 2 1
0 −4 8 0
0 0 1 1
But again, we ask: Why are we allowed to do this operation? It again
preserves the set of all solutions. This is easy to see because each row of
the augmented matrix represents a linear equation we must satisfy, and if
we think back to our geometric picture where we reasoned about the set
of solutions by taking all points in Rm and intersecting them with affine
hyperplanes, it doesn’t matter which order we take the intersection. We
still get the same set of solutions at the end.
Backsubstitution Now that we have put our linear system in row echelon
form, it is easy to find one or even all of the solutions by backsubstitution.
Let’s translate our augmented matrix to a linear system. We have
x1 − x2 + x3 = 1
− 4x2 + 8x3 = 0
x3 = 1
which implies that x1 = 1. You can check that when we plug in these values
into the original linear system, they work too. But of course we knew that
would be true already. Moreover not only have we found a solution, we’ve
actually shown that it is unique because didn’t have a choice for how to set
x3 , and conditioned on what we set for its value, we didn’t have a choice
for x2 and so on.
r2′ = 0 2 b2 − b21
5
As an aside, you can always solve linear systems on a computer. But this is
one example where understanding how to do things by hand gives you new
insights. For example, what if we want to solve a system of linear equations
Ax = b, Ax = b′ , and so on, where the constraint matrix always stays the
same. Instead of calling the linear system solver separately on each one,
you could instead put the system in row echelon form just once, and keep
track of the corresponding operations you are supposed to perform on the
coordinates of the vector on the right hand side, and then run backsolve for
each linear system. This would be much faster.
Again we can ask: What would its row echelon form look like? And how
could we deduce that the solution is not unique? Putting it in row echelon
form, we get:
2 −3 0
5 x1 1
1
0 2 x2 = 2
2 x3 0
0 0 0
Notice that there is no pivot in the last row. But it still meets our definition
of row echelon form. In particular, any row consisting of only zeros must be
at the bottom. Now what would happen if we perform backsubstitution?
We would get the equation 0 = 0 from the last row, so any choice of x3
works! But now for any choice we make, say x3 = t, the second equation tells
us a unique way to set x2 , and so on. So the set of solutions looks like a line
in three dimensions because there is one degree of freedom corresponding
to the choice of x3 . To be more explicit, we can write down the equations
CHAPTER 1. VECTORS, MATRICES AND LINEAR EQUATIONS 27
of the line as 4 6t
x 1
= 5
− 5
1 4t
x2 = 5
− 5
x3 = t
Note that it does not matter whether the augmented matrix has zeros or
not in the last column, since that corresponds to what we do to the vector
b. Being singular is a property of just the constraint matrix.
Again we can put it in row echelon form. (Does the sequence of operations,
of what row we add and subtract from which other rows need to change?)
We get
2 −3 0
5 x1 1
1
0 2 x2 = 2
2 x3 1
0 0 0
Now if we try backsubstitution we get the equation 0 = 1 from the last
row. So while our operations have preserved the set of solutions, the linear
system we derived clearly has no solutions, so that must be true of the
linear system we started off with too!
Taking a step back, there is an important lesson that will come up again
and again for us: It is much easier to get geometric inisghts about a system
of linear equations (e.g. what does the set of solutions look like?) by
CHAPTER 1. VECTORS, MATRICES AND LINEAR EQUATIONS 28
first putting it into a convenient normal form. We will develop much more
powerful normal forms, that will tell us other things too, like quantitative
ways to measure how far a matrix is from being singular, etc.
First we can multiply the rows by scalar to make all the pivots equal to
one:
1 −1 2 1
0 1 −2 0
0 0 1 1
And now, for any row which contains a pivot, we can add and subtract it
from the rows above it to zero out the remaining entries in that column.
We would get
1 0 0 1
0 1 0 2
0 0 1 1
This corresponds to the linear system
1 0 0 x1 1
0 1 0 x2 = 2
0 0 1 x3 1
All the resistors in the circuit have unit resistance. Our ammeter will send
one unit of current into the junction a1 and one unit of current will leave
from the junction a6 . We would like to know how this current splits along
the different branches of the circuit.
Fact 1. Kirchhoff ’s law tells us that the current the current across a resistor
is equal to
v1 − v2
i=
R
CHAPTER 1. VECTORS, MATRICES AND LINEAR EQUATIONS 30
where v1 and v2 represent the voltages at the two endpoints, and R is the
resistance.
Consider the junction a2 . The total current going in must equal the current
going out, so we get
i1 = i2 + i5
We can use Kirchhoff’s law to rewrite this in terms of the voltage differences
as
(v2 − v1 ) = (v3 − v2 ) + (v5 − v2 )
Moreover we assumed there is one unit of current going into a1 and this
same amount of current must exit. So we get
v2 − v1 = 1
because all the resistors have unit resistance. Similarly we get the equation
v6 − v5 = 1
CHAPTER 1. VECTORS, MATRICES AND LINEAR EQUATIONS 31
Moreover there are interesting ways to connect the general theory and
systems of linear equations. Since the vector on the right hand side repre-
sents the net current out of each junction, we know its entries must sum to
zero, since otherwise current would not be conserved. So if we try to solve
the linear system with a right hand side whose entries do not sum to zero,
what should happen? We should find that there is no solution. How can
we see this? If we add up all the rows in the constraint matrix, they would
sum to zero. This means that our matrix is singular, and if the sum of the
entries in the vector on the right hand side is not also zero, we would get a
contradiction.
Let’s take these ideas even further. If you’ve had to solve electrical
circuit problems before, you are probably familiar with the rule for how to
simplify the circuit diagram of two resistors in parallel:
CHAPTER 1. VECTORS, MATRICES AND LINEAR EQUATIONS 32
This means that for any amount of current i that we send through the cir-
cuit, the voltage difference v1 −v2 will be the same as across the two circuits.
There are even more complicated rules, such as the Y − ∆ transform, that
tell us the following two circuits are equivalent:
Where we set
R2 R3 R1 R3 R1 R2
R1′ = , R2′ = and R3′ =
R1 + R2 + R3 R1 + R2 + R3 R1 + R2′ + R3′
The claim is that for any amount of current that we send into and out of
the three junctions, the voltage differences we get are the same. This is
equivalent to saying that for any right hand side in our system of linear
equations, the solution we find from the two circuits is the same. This
is because we can go from one linear system to the other by doing only
elementary row operations where we add and subtract rows from each other.
The details are complicated, however.
Chapter 2
When the number of columns in A does not match the number of rows
in B, we cannot multiply them. Another way to think about this formula is
the following. Let Bk and Ck be the kth columns of B and C respectively.
Then Ck = ABk . Thus matrix multiplication can also be thought of as a
sequence of matrix-vector products. Yet another way to think about it is
that the entry Ci,k is the result of taking the inner product of the ith row
of A and the kth column of B.
33
CHAPTER 2. MATRIX MULTIPLICATION AND INVERSES 34
h(x+x′ ) = g(f (x+x′ )) = g(f (x)+f (x′ )) = g(f (x))+g(f (x′ )) = h(x)+h(x′ )
So what is this matrix C? Recall that ei is the ith standard basis vector
(Definition 4). Then Cei is the ith column of C. So if we want to compute
C we can compute
h(ei ) = g(f (ei )) = A(Bei )
for all i. From this formula, we get that Ci = ABi , which agrees with our
formula for matrix multiplication. Moreover it shows that the matrix mul-
tiplication formula represents the natural way to compose linear functions.
Graphs come up all over the place. For example, they can be used to
describe social networks. Vertices would represent people and there would
be an edge between a pair of people if, say, they are friends on Facebook.
We will work with the following graph:
CHAPTER 2. MATRIX MULTIPLICATION AND INVERSES 35
Definition 18. An adjacency matrix is a matrix where each row and each
column represents a vertex in V . Moreover there is a 1 in row i, column j
if and only if there is an edge between the corresponding vertices in G.
where we have chosen the convention that the first row and column represent
the vertex a, the second row and column represent the vertex b, and so on.
Now what happens if we multiply A by itself? We usually write AA = A2 .
CHAPTER 2. MATRIX MULTIPLICATION AND INVERSES 36
We get
0 1 0 0 0 1 0 0 1 0 1 1
1 0 1 1
1 0 1 1 0 3 1 1
0 =
1 0 1 0
1 0 1 1 1 2 1
0 1 1 0 0 1 1 0 1 1 1 2
What remains is to figure out how to interpret the result in graph theoretic
terms. Consider the entry in row 1, column 3 in A2 . It came from the
expression
A1,1 A1,3 + A1,2 A2,3 + A1,3 A3,3 + A1,4 A4,3
We claim that each term in this expression is a candidate walk. For exam-
ple, since the first row and column are associated with vertex a, and the
third row and column are associated with vertex c, the first term A1,1 A1,3
represents the length two walk a, a, c. But since there is no edge from a
to a nor is there one from a to c this walk shouldn’t be counted, and in-
deed A1,1 A1,3 = 0. We can do this exercise for the second term A1,2 A2,3
too. This term represents the walk a, b, c, which is a valid walk, and indeed
A1,2 A2,3 = 1 and contributes one to our sum, and so on. Thus the entry in
row 1 column 3 in A2 counts the number of length two walks from a to c in
G.
Aℓ = Aℓ−1 A
CHAPTER 2. MATRIX MULTIPLICATION AND INVERSES 37
Now by induction Aℓ−1 i,j counts the number of walks from i to j of length
ℓ − 1. And Aj,k counts the number of length one walks from j to k. The
point is that every walk of length ℓ from i to k corresponds to the choice
of a walk of length ℓ − 1 from i to j and a walk of length 1 from j to k for
some j. And conversely every choice of a walk of length ℓ − 1 from i to j
and a walk of length 1 from j to k for some j also yields a walk of length
ℓ from i to k by concatenation. Thus matrix multiplication is counting the
right thing.
Basic Facts We will close this section by introducing some basic facts
about multiplying matrices:
Fact 2. Matrix multiplication is associative, i.e. for A, B and C which are
m × n, n × p and p × ℓ matrices respectively, we have
A(BC) = (AB)C
CHAPTER 2. MATRIX MULTIPLICATION AND INVERSES 38
You can check this fact algebraically by using the formula for matrix
multiplication. However the intuition behind this fact is simple, again us-
ing Proposition 1 and the connection between matrix multiplication and
composing linear functions. If A, B and C represent the linear functions
f , g and h respectively then A(BC) corresponds to defining a new linear
function p = g ◦ h, i.e. the composition of g and h and computing f ◦ p.
Similarly (AB)C corresponds to defining a new linear function q = f ◦ g
and computing q ◦ h. And f ◦ p = q ◦ h because they both correspond to
the same linear function.
Recall in Section 1.1 we defined the predator-prey model and wrote out
an expression
g(t + 1) g(t)
=A
y(t + 1) y(t)
that describes how the number of frogs and flies evolves over time. The
associativity of matrix multiplication implies that if we want to compute
what the population looks like at time t from its initial data, instead of
computing
g(t + 1) g(0)
= A(A(· · · (A ) · · · ))
y(t + 1) y(0)
we can compute
g(t + 1) g(0) t g(0)
= (· · · ((A)A) · · · A) =A
y(t + 1) y(0) y(0)
Thus if we want to simulate the model for many different initial conditions,
we can just compute At once. Later on, we will study what intrinsic prop-
erties of A govern how quickly At grows/decays which will tell us important
things about the long term behavior of linear dynamical systems like the
one above.
Fact 3. Matrix multiplication is (right and left) distributive, i.e. for A, B
and C which are m × n, n × p and n × p matrices respectively, we have
A(B + C) = AB + AC
Recall that the identity matrix is a square matrix which as ones along
the diagonal and zeros everywhere else. We use In denote the n × n identity
matrix. We sometimes write I when the dimensions are clear from context.
AB ̸= BA
The dimensions do not necessarily match, so there’s not even a way to make
sense of this statement.
Definition 19. Given a square matrix A, its right inverse is a matrix A−1
with the property that
AA−1 = I
Fact 5. A matrix A−1 is the left inverse, i.e. A−1 A = I if and only if it is
the right inverse, i.e. AA−1 = I.
CHAPTER 2. MATRIX MULTIPLICATION AND INVERSES 40
Thus we will refer to A−1 as simply the inverse of A and drop the qualifier
left or right.
There are many basic questions left. When does the inverse of a matrix
exist? Is it unique? And how do we find it?
Adding a scalar multiple of one row to another row is not the only
operation we perform in Gauss-Jordan elimination. Sometimes we need
CHAPTER 2. MATRIX MULTIPLICATION AND INVERSES 41
Now instead of just keeping track of the current constraint matrix, let’s
also keep track of how it relates to the previous constraint matrix. We start
off with A and the first step of Gauss-Jordan elimination can be performed
by computing
A′ = B1 A
Similarly the next step can be performed by multiplying by B2 and so on.
In the end, this gives us
A → B1 A → B2 B1 A ⇝ (Bp · · · B2 B1 )A
(Bp · · · B2 B1 )A = I
Ax = ei
CHAPTER 2. MATRIX MULTIPLICATION AND INVERSES 42
What happens if we do not get lucky? What if, when we are performing
Gauss-Jordan elimination, we end up with a row of just zeros? For example
when solving the linear system
2 −3 0 x1 1 2 −3 0 x1 1
1 1 2 x2 = 1 ⇝ 0 5 2 x2 = 1
2 2
3 −2 2 x3 3 0 0 0 x3 1
Recall that such a matrix is called singular (Definition 13). What does this
tell us about A−1 ?
Lemma 3. If A is singular then it does not have an inverse.
Proof. Suppose, for the sake of contradiction, that A did have an inverse
A−1 . Then for any linear system Ax = b for any b, there would be a solution
x∗ = A−1 b
However when A is singular, we know that there are some choices of b for
which Ax = b does not have a solution, e.g. the example above. Thus we
have a contradiction, and the inverse of A must not exist.
CHAPTER 2. MATRIX MULTIPLICATION AND INVERSES 43
Other Useful Facts We close this section by stating some other basic
facts about inverses.
(AB)−1 = B −1 A−1
its inverse is
−1 1 d −b
A =
ad − bc −c a
Rotations, Projections,
Reflections and Transpositions
Lengths and Angles First let’s talk about lengths and angles:
∥v∥2 = ⟨v, v⟩
In any case, we can use the inner-product to talk about angles between two
vectors:
44
CHAPTER 3. ROTATIONS, PROJECTIONS, REFLECTIONS AND
TRANSPOSITIONS 45
Definition 22. The angle θ between two vectors v and w is determined by
the expression
⟨v, w⟩
cos θ =
∥v∥∥w∥
Then the circle above denotes the set of all unit vectors. Without loss of
generality, we can assume that v lies on the x axis. Then the inner-product
of w and v is just the x-coordinate of w, which is the same thing as the
cosine of the angle between them. Lengths and angles are helpful things to
keep track of when we apply various matrix operations.
It is easy to see that the length stays the same, because when we compute
the length, we’re just taking the sum of squares of the coordinates in a
different order, but the result is the same.
Question 3. Does applying a permutation matrix change the angle between
vectors?
Again, we can see that the angle is unchanged, using the definition of the
inner-product. We already know that the lengths remain unchanged, and
also
X d d
X
π π ′ ′
⟨A x, A w⟩ = xi w i = xπ−1 (i) wπ−1 (i) = ⟨x, w⟩
i=1 i=1
CHAPTER 3. ROTATIONS, PROJECTIONS, REFLECTIONS AND
TRANSPOSITIONS 47
where x′ = Aπ x and w′ = Aπ w.
Now let’s take our 3 × 3 permutation matrix A from earlier. We can check
0 1 0 0 0 1 1 0 0
0 0 1 1 0 0 = 0
1 0
1 0 0 0 1 0 0 0 1
| {z } | {z }
AT A
Thus A−1 = AT . We will see that this is true for any permutation matrix:
2. Thus the matrix AT has a 1 in row j, column i and all the other entries
in this row and column are 0. And so the matrix-vector product Aw
sends the ith coordinate of w to the jth coordinate.
CHAPTER 3. ROTATIONS, PROJECTIONS, REFLECTIONS AND
TRANSPOSITIONS 48
These two properties imply that A and AT undo each other.
Let’s dive deeper into the transposition operation. First, applying the
transpose to a product of matrices reverses their order.
Fact 9. (AB)T = B T AT
This last fact is the only one that is not an immediate consequence of
the definition of the transpose. But it is also easy to verify in the sense
that we just need to check that (A−1 )T is the inverse of AT . This follows
because
(A−1 )T AT = (AA−1 )T = I T = I
where the second equality follows from Fact 9.
This ought to preserve lengths. Indeed we can verify this by observing that
which shows that the squared length of the vector before and after the
transformation is the same.
It turns out that for rotation matrices, the inverse is again easy to
describe. If the matrix A rotates a vector by θ in the counter-clockwise
direction, A−1 rotates by −θ in the counter-clockwise direction. Thus we
have
−1 cos(−theta) − sin(−θ)
Rθ = R−θ =
sin(−θ) cos(−θ)
But since cos is an even function and sin is an odd-function, we can see
that
−1 cos θ sin θ
Rθ =
− sin θ cos θ
And now we can see that Rθ−1 = RθT . Thus we have:
Each matrix performs a rotation in either the x-y, x-z or y-z plane sepa-
rately.
I.e. the closest vector to v is obtained by zeroing out the y coordinate. Now
we can see that
1 0
D=
0 0
With this as a building block, how can we project onto a general line ℓ that
makes an angle θ with the x-axis? We claim that it can be written as
P = Rθ DR−θ
CHAPTER 3. ROTATIONS, PROJECTIONS, REFLECTIONS AND
TRANSPOSITIONS 52
This follows because if we apply the operation R−θ , we rotate the line ℓ
onto the x-axis and we rotate v along with it. At this point, we know the
projection onto ℓ can be implemented by multiplying by D. Finally we need
to rotate back to put the line ℓ back to where it should be. This last step can
be implemented by multiplying by Rθ . This is actually our first exposure
to a powerful decomposition, called the singular value decomposition. We
expressed a projection as the project of rotation matrices whose inverses are
easy to understand and a diagonal matrix D that is also easy to understand.
From this vantage point, it is easy to answer all our usual questions
about projections:
Question 4. What does a projection do to the length?
Again, we can visualize this geometrically using the picture from earlier.
Suppose we draw a circle with radius ∥a∥ we can see that the projection of
a onto the line ℓ is contained in the circle, and the only case it ends up on
the boundary is when a is already on the line. Finally we can ask:
Question 5. Is the projection onto ℓ invertible?
To answer this question, we will appeal to the following fact about in-
vertibility:
Fact 15. Let A and B be square matrices and suppose that A is invertible.
Then B is invertible if and only if AB is invertible.
We will prove this fact later. But already we can see from the expression
P = Rθ DR−θ that a projection is not invertible. This is true because D
CHAPTER 3. ROTATIONS, PROJECTIONS, REFLECTIONS AND
TRANSPOSITIONS 53
is already in reduced row echelon form, but it does not have a pivot in the
last row. We can also see why invertibility fails geometrically: It is easy
to construct two points a and a′ that both project to the same point b.
Hence there is now way to undo the operation uniquely, and knowing the
projection is not enough information to determine the original point. This
is the same thing as saying that the linear system Ax = b does not have a
unique solution.
2P − I = Rθ (2D − I)R−θ
Observing that 2D − I is in row echelon form and all its pivots are nonzero
we know that it is invertible. Now appealing to Fact 15 once more we
conclude that every reflection is invertible.
Chapter 4
Geometric Foundations
54
CHAPTER 4. GEOMETRIC FOUNDATIONS 55
Thus a subspace is just a vector space sitting inside another vector space.
We already know that R2 is a vector space. What are its subspaces? We
could choose S = R2 or S = {0}. A more interesting example is a line
passing through the origin. But note that a line that does not pass through
the origin is noT a vector space. It would fail to satisfy the property that
αv ∈ S for α = 0. Just as before, we can also think about vector spaces of
matrices. The set of all d × d matrices is a vector space, in which case the
set of all d × d symmetric matrices is a subspace.
A key property of vector spaces is that you can combine them in different
ways to get another vector space:
Lemma 5. If S1 and S2 are subpsaces then S = S1 ∩ S2 is too
span(a1 , a2 , · · · , an ) = {α1 a1 + α2 a2 + · · · + αn an }
A(x + y) = v + w
has a solution for any choice of b1 and b2 . Another way to interpret this
linear system is it’s asking if the vector on the right hand side can be
written as a linear combination of the columns of A. This what we argued,
implicitly, was that C(A) = R2 . More generally we have:
Fact 16. The linear system Ax = b has a solution if and only if b ∈ C(A).
CHAPTER 4. GEOMETRIC FOUNDATIONS 57
Recall that in Fact 5 we asserted that the right and left inverses of square
matrix A are the same, provided A is non-singular. Now here is a powerful
way to connect invertibility to vector spaces:
Ax = (AA−1 )b = Ib = b
Second, let’s prove the reverse direction. What if C(A) = Rn ? How can
we find the inverse of A? Recall in Definition 4 we defined the standard
basis vectors e1 , e2 , · · · , en . Now if e1 ∈ C(A) there must be some x1 that
satisfies Ax1 = e1 , and so on. Now consider
A x1 x2 · · · xn = e1 e2 · · · en = I
x1 x2 · · · xn = A−1
The Nullspace Now let’s talk about our second fundamental subspace:
Definition 31. The nullspace N (A) of a matrix A is
N (A) = {x|Ax = 0}
Thus it is the set of all ways of writing the all zero vector as a linear
combination of the columns of A. Note that if A is an m × n matrix then
its column space C(A) is a subspace of Rm while its nullspace N (A) is a
subspace of Rn . Again it is easy to check that N (A) really is a subspace: If
x, y ∈ N (A) then x+y satisfies A(x+y) = Ax+Ay = 0 and so x+y ∈ N (A).
Moreover for any x ∈ N (A) and any α ∈ R we have Aαx = αAx = 0 and
thus αx ∈ N (A) too. As a caution note that the set S = {x|Ax = b} for
b ̸= 0 is not a subspace. In particular the all zero vector is not in S. It is
sometimes called an affine subspace. As we will see, the columnspace and
the nullspace are very interrelated.
Let’s revisit another example from Section 1.2 with our new abstraction.
We studied the linear system
2 −3 0 x b1
1 1 2 y = b2
3 −2 2 z b3
and argued that there are some choices of b1 , b2 and b3 for which it has no
solution. We can use Gaussian elimination to put it in row echelon form
and write
1 0 65
A = B 0 1 45
0 0 0
where A is our original constraint matrix and B keeps track of how to undo
the operations we performed in the process of doing Gaussian elimination.
Let A′ be the row echelon form. Can we use this expression to find a nonzero
vector x ∈ N (A)?
Fact 17. If B is square and invertible and A = BA′ then N (A′ ) = N (A).
Fact 18. For any B not necessarily invertible, and A = BA′ then N (A′ ) ⊆
N (A)
Finally, let’s connect the columnspace and nullspace. We claim the fact
that A in the example above is square, and that it has a nonzero vector in
its nullspace implies that C(A) ̸= Rn . Why is this? In our example, we
can consider the subspaces S1 , S2 and S3 we get from taking the span of the
CHAPTER 4. GEOMETRIC FOUNDATIONS 60
first column of A, the span of the first two columns of A, and the span of
all three columns of A. In our example we have that
2
S1 = span 1
3
is a line. And
2 −3
S2 = span 1 , 1
3 −2
is a plane. But what happens when we add the third column in too? We
are now looking at linear combinations of the form
2 −3 0
α1 1 + α2 1 + α3 2
3 −2 2
But we claim we still just get a plane. From the nonzero vector in the
nullspace that we have found, we find that
0 2 −3
5 2
2 = 1 + 1
6 3
2 3 −2
Hence we can rewrite the general linear combination of the three vectors as
2 −3
6
4
α1 + 1 + α2 + 1
5 5
3 −2
Proof. The example above already showed how to go in one direction, but
we can state the argument more generally. If we have a nonzero vector
z ∈ N (A) then when we put A in row echelon form we must get a row of
CHAPTER 4. GEOMETRIC FOUNDATIONS 61
all zeros. But this means that A is singular and by Lemma 6 we have that
C(A) ̸= Rn .
be a degree d polynomial. We want that p(xi ) = yi for all i. There are many
basic questions we could ask: Is there always a solution? And if so, how
CHAPTER 4. GEOMETRIC FOUNDATIONS 62
small can we choose the degree to be? And for a given degree, is there more
than one polynomial that interpolates the data? These questions are all
equivalent to understanding the the fundamental vectorspaces associated
with a particular matrix.
Definition 32. The Vandermonde matrix is a n × d + 1 matrix
1 x1 x21 · · · xd1
1 x2 x2 · · · xd2
2
V = .. ..
.. ..
. . . ··· .
1 xn x2n · · · xdn
Vp=y
where we set Y x − xj
Li (x) =
1≤j≤n,j̸=i
xi − xj
This definition only makes sense if all the x′i s are distinct. The key obser-
vation is that Li (xj ) is equal to 1 if j = i and otherwise is equal to 0. From
this fact we can check that
n
X n
X
p(xi ) = yi Li (x) = yi 1i=j = yi
i=1 i=1
We can not only make our polynomial interpolate our data, but if we
have any other data point x0 by choosing the coefficient α appropriately
we can make p(x) + αq(x) take on any value we want on x0 . So when
you are overfitting, i.e. when the nullspace contains nonzero vectors, there
is no guarantee that what you’ve found extrapolates beyond your original
collection of data!
Chapter 5
65
CHAPTER 5. LINEAR INDEPENDENCE AND BASES 66
The first subspace is just the zero vector. The second subspace is a
line. The third subspace is the entire plane. Clearly the last subspace is
the biggest, because it strictly contains the others. But what if we have
two subspaces, neither of which contains the other? Consider the following
example:
The first subspace is a line. The second subspace is a plane, but it does
not contain the first subspace. Still we would like to say that the second
subspace is bigger. How can we make this intuitive notion precise?
ways, some of which are more convenient than others at times. Let’s give
another equivalent definition:
Definition 33 (Linear independence). A set of vectors {v1 , . . . , vk } ⊂ Rn
is linearly independent (LI) if whenever we have a linear combination equal
to zero, then necessarily all the scalars are zero. Equivalently,
λ1 v1 + · · · + λk vk = 0 ⇒ λ1 = 0, . . . , λk = 0.
Definition 34 (Linear dependence). A set of vectors {v1 , . . . , vk } ⊂ Rn is
linearly dependent (LD) if they are not linearly independent.
(We can also verify this “directly”, since the third vector is three times
the second vector minus twice the first one.)
CHAPTER 5. LINEAR INDEPENDENCE AND BASES 68
λ1 v1 + · · · + λk vk = 0.
Notice that the linear system Aλ = 0 always has a solution that comes
from setting λ = 0. The condition that the vectors are linearly independent
stipulates that this is the only solution. We can now nicely put together
some of these notions:
We have then
2α
N (A) = N (T A) = N (R) = −3α : α∈R .
α
Since the nullspace is nonzero, the vectors are indeed linearly dependent.
Let’s see some examples that being defined by a large number of gen-
erators does not necessarily imply that a subspace is big, at least without
some further conditions.
Example 4. We have
2 −1 −3
span , , = R2 ,
2 3 1
For instance, if fi (t) = ti , then the elements of this subspace are the uni-
variate polynomials of degree k.
CHAPTER 5. LINEAR INDEPENDENCE AND BASES 71
A natural question in this context is “what are all the possible output
trajectories we can obtain?” This is known as the reachable subspace of
a linear dynamical system. We are often interested in understanding how
large this subspace is – in particular, it may allow us to understand which
trajectories are possible.
5.3 Bases
Generators are great, but sometimes they can be a bit “wasteful.”
span{v1 , . . . , vk , w} = span{v1 , . . . , vk }.
Generators: a finite set of vectors that span S. For instance the col-
umn space description C(B) = span{b1 , . . . , bn }, where B = [b1 · · · bn ].
1. U and V are n×n and m×m matrices and have orthonormal columns
74
CHAPTER 6. THE SINGULAR VALUE DECOMPOSITION 75
Here the σi ’s are called the singular values. The ui ’s and vi ’s are the columns
of U and V and they are called the left and right singular vectors respec-
tively. We will spend the remainder of this section digesting what this
decomposition means and why it is so powerful. Our first goal is to see how
it unifies our understanding of linear algebra thus far:
Question 6. How can we read off important properties of A from its sin-
gular value decomposition?
Proof. Let’s get some intuition and the proof will follow. Since u1 is in the
column space of A, we should be able to find a vector x so that Ax is in the
direction of u1 , i.e. Ax = cu1 for some non-zero scalar c. How should we
choose such an x? To answer this question, it is convenient to work with
the alternative expression for the singular value decomposition, which tells
us that
Ax = σ1 u1 v1T x + σ2 u2 v2T x + · · · + σr ur vrT x
We want all the terms in this expression, except for the first one, to be zero.
Let’s choose x = v1 . This gives
v2T v1 = v3T v1 = · · · 0
C(A) = span(u1 , u2 , · · · , ur )
Lemma 12. The vectors vr+1 , vr+2 , · · · , vm are an orthonormal basis for
N (A).
Proof. The proof follows along similar lines as Lemma 11. If we choose
x = vj for j ≥ r + 1 we have
and because the vi ’s are orthonormal, all the terms are zero. Thus Avj = 0
for j ≥ r + 1. Equivalently vj ∈ N (A). Conversely, consider some x such
that Ax = 0 and let
x = α1 v1 + α2 v2 + · · · αm vm
CHAPTER 6. THE SINGULAR VALUE DECOMPOSITION 77
We want to show that α1 must be zero. What if it were not zero? How
could we contradict the assumption that Ax = 0? Consider
uT1 Ax = σ1 uT1 u1 v1T x + σ2 uT1 u2 v2T x + · · · σr uT1 ur vrT x = σ1 v1T x
since the ui ’s are orthonormal. Now plugging in the expression for we have
uT1 Ax = σ1 v1T x = σ1 α1 v1T v1 + σ1 α2 v1T v2 + · · · σ1 αr v1T vr = σ1 α1
Thus we conclude that α1 = 0 as desired. Of course, nothing about this
argument is particular to v1 and for any vj with σj ̸= 0 we must have that
αj = 0. And so
x = σr+1 vr+1 + σr+2 vr+2 + · · · σm vm
which shows that Ax = 0 implies that x ∈ span(vr+1 , vr+2 , · · · , vm ). This
completes the proof.
Not only can we immediately read off the rank and related properties
of A immediately, the singular value decomposition also contains powerful
theorems as a corollary. Recall:
Theorem 1. For any n × m matrix A, we have
rank(A) + dim(N (A)) = m
How can we prove this using the singular value decomposition? Recall
that we are using r to denote the number of nonzero singular values. From
Lemma 10 we have
rank(A) = r
and from Lemma 12 we have
dim(N (A)) = m − r
thus the Rank-Nullity theorem follows!
Actually this hints at another way to prove Lemma 12 that avoids doing
so much algebra. We will use the fact that C(AT ) and N (A) are orthogonal
complements of each other instead. First applying Lemma 11 to AT , we
have that v1 , v2 , · · · , vr is an orthonormal basis for C(AT ). Finally
N (A) = (C(AT ))⊥ = (span(v1 , v2 , · · · , vr ))⊥ = span(vr+1 , vr+2 , · · · , vm )
CHAPTER 6. THE SINGULAR VALUE DECOMPOSITION 78
Inverses and Pseudoinverses Let’s take the idea of using the singular
value decomposition to unify what we know one step further. In Section 2.2
we saw how to compute the inverse of a matrix using Gauss-Jordan elim-
ination. It turns out that we can compute A−1 directly from the singular
value decomposition instead.
Proposition 11. Suppose A is a n × n and nonsingular and has singular
value decomposition A = U ΣV T . Then
A−1 = V Σ−1 U T
The expression above is quite natural. It comes from using our rules for
computing the inverses of products of matrices:
(V Σ−1 U T )A = V Σ−1 U T U ΣV T
= V Σ−1 ΣV T
= VVT =I
where we have used the fact that U and V are orthogonal, and hence their
inverse is their transpose.
In fact, even when A is not invertible (or even not square!), in lieu of
computing the inverse, we can still do the next best thing:
Definition 38. The pseudoinverse of A, denoted by A+ , is
r
X
+
A = σi−1 vi uTi
i=1
CHAPTER 6. THE SINGULAR VALUE DECOMPOSITION 79
When A is nonsingular, it is easy to see that the inverse and the pseu-
doinverse are the same. But when A is singular, how can we think about
the pseudoinverse? Again, let’s proceed by direct computation:
r
X X r
AA+ = σi ui viT ( σi−1 vi uTi
i=1 i=1
r X
X r
= σi σj−1 ui viT vj uTj
i=1 j=1
r
X r
X
= σi σi−1 ui uTi = ui uTi
i=1 i=1
The last line follows because the vj ’s are orthonormal, so all the cross terms
where j ̸= i are zero. From this expression, we get:
B = {x|∥x∥ ≤ 1}
CHAPTER 6. THE SINGULAR VALUE DECOMPOSITION 80
What does the set AB = {Ax|∥x∥ ≤ 1} look like? The singular value
decomposition of A will give us a way to decompose A into simpler geo-
metric building blocks. In particular, let’s visualize what happens when we
multiply by V T , then Σ and finally by U . The unit ball looks like:
Here we have marked the north and south poles. What happens when
we apply V T ? Since V has orthornormal columns, multiplying by V T does
not change the length of a vector. This implies that
B = V T B = {V T x|∥x∥ ≤ 1}
Even though the lengths do not change, the coordinates do. In our visual,
this means that the locations of the north and south poles change, even
though multiplying by V T maps the unit ball to the unit ball.
CHAPTER 6. THE SINGULAR VALUE DECOMPOSITION 81
The principal axes of the ellipsoid are the vectors σ1 u1 , σ2 u2 and so on.
Thus, from a geometric standpoint, the singular value decomposition is a
way to think about a general linear transformation as mapping a unit ball
to an ellipsoid. The singular values describe the lengths of the principal
axes. The left singular vectors are the directions of the principal axes. And
finally the right singular vectors are the directions in the unit ball that get
mapped to each of the principal axes.
back to solving linear systems. The catch is there is also some unknown
noise z in the process, and to be conservative we might only assume that
the noise is small, i.e. ∥z∥ ≤ δ for some small constant δ. If A is known,
how can we estimate x? And furthermore, can we quantify our uncertainty
in x?
We will assume that A has full column rank. This is equivalent to the
condition that A has a left inverse – i.e. there is a matrix N so that N A = I.
So the natural way to estimate x is using
x
b = Ny
b − x. It represents the parts of the
The reconstruction error is defined as x
image that we got wrong. What we’d like to do is use properties of N
and the assumption that ∥z∥ ≤ δ to bound our reconstruction error. After
all, there might be some pixels that we can confidently say we accurately
reconstructed, and that is useful to know! Towards that end we can write
b − x = N (Ax + z) + x = Ix + N z − x = N z
x
b − x is contained in an uncer-
Thus we know that the reconstruction error x
tainty ellipsoid
{N z|∥z∥ ≤ δ}
And so the singular value decomposition of N can be used to understand
the reconstruction error. For example, if we want to make a diagnosis
based on our estimate xb it is useful to know if there are other images in the
uncertainty ellipsoid centered at x b where we would have made a different
diagnosis.
Of course this leads to even more questions. It is not hard to see that
in general the left inverse of a matrix is not unique. There are often in-
finitely many. So which one should we choose in terms of minimizing the
uncertainty ellipsoid? It turns out that there is a best choice. We will state
without proof the following fact:
Fact 19. The uncertainty ellipsoid of A+ is contained in those of any other
left inverse of A.
Thus A+ is the best left inverse in the sense that it yields an estimator
b = A+ y with the best bound on the reconstruction error.
x
Chapter 7
Connections to Optimization
where each zi,j is chosen uniformly at random from the interval [−10−2 , 10−2 ].
What is the rank of B? This is a good problem to test your intuition. Maybe
the rank is unchanged, and rank(B) = rank(A) = 1? Or perhaps the rank
doubles? Or does the rank become as large as possible, and rank(B) = 50?
You can try this experiment out for yourself and check that indeed
rank(B) = 50. This is true in general, regardless of how small you make
the additive noise terms zi,j , and regardless of how large n is too. You’ll
always get rank(B) = n, unless you run into machine precision issues with
the computation itself. But why does this happen? Recall that the rank
of a matrix is equal to the dimension of its columnspace. And when we
sequentially consider the span of the first column of B, then the span of
the first two columns of B, and so on, we are incredibly unlikely to get any
column belonging to the span of the previous columns of B. Thus at each
84
CHAPTER 7. CONNECTIONS TO OPTIMIZATION 85
step, the dimension grows by one. This is the intuition behind why B has
full rank.
Now we come to the key idea: Even though the rank of B is large,
it is still close to a rank one matrix, if we define an appropriate notion
of what being close means. As we will see, the idea of finding a low rank
approximation to a matrix has many important applications in data science
and engineering.
max ∥Ax∥
x s.t. ∥x∥≤1
Fact 20. For any A, its operator norm is its largest singular value.
There is another way to derive this fact that will be a useful perspective
for later:
∥A∥ = ∥AR∥
Proof. We will verify that the two optimization problems, one computing
the operator norm of A and one computing the operator norm of AR are
optimizing over the same set. In particular consider the optimization prob-
lems
∥A∥ = max ∥Ax∥
x s.t. ∥x∥≤1
as well as
∥AR∥ = max ∥ARz∥
z s.t. ∥z∥≤1
But if we set x = Rz then ∥x∥ = ∥z∥ and we can go back and forth. Or to
be more formal, if there is a z with ∥z∥ ≤ 1 and ∥ARz∥ ≥ C then we can
set x = Rz and we will have ∥x∥ = ∥Rz∥ ≤ 1 and ∥Ax∥ = ∥ARz∥ ≥ C
as well. An identical argument works in the reverse direction, and we must
have that the values of the optimization problems are the same.
∥A∥ = ∥RA∥
This fact is even easier to prove. It holds because ∥Ax∥ = ∥RAx∥ for
any vector x. Now putting it all together we can think about why the
singular value decomposition reveals the operator norm:
∥A∥ = ∥U ΣV T ∥
= ∥U T U ΣV T ∥ = ∥ΣV T ∥
= ∥ΣV T V ∥ = ∥Σ∥
CHAPTER 7. CONNECTIONS TO OPTIMIZATION 87
Here the first equality in the second line follows from Fact 21 and the first
equality in the third line follows from Fact 22. Now since Σ is diagonal we
have v
u r
uX
∥Σ∥ = t σi2 x2i
i=1
and the largest you can make it, subject to the constraint that x is a unit
vector, is by setting x1 = 1 and xi = 0 for all other i’s.
But before we get ahead of ourselves, can we actually solve this op-
timization problem? Later in this course, we will study optimization in
more depth, and understand what families of optimization problems we
can efficiently solve, and what ones are generally hard. But even without
the general theory, the optimization problem above seems to be searching
over a complex, high-dimensional set – the set of all n × n rank one matri-
ces. Amazingly, it turns out that the singular value decomposition already
contains the answer!
We have written B as the sum of r rank one matrices. The main idea is to
take the largest one, i.e. the one that has the biggest operator norm, which
CHAPTER 7. CONNECTIONS TO OPTIMIZATION 88
C = σ1 u1 v1T
with σ1 ≥ σ2 ≥ · · · ≥ σr . Then
min ∥B − C∥ = σk+1
C s.t. rank(C)≤k
Proof. First we claim that ∥A∥F = ∥RA∥F = ∥AS∥F for any orthogonal
matrices R and S. This follows because ∥A∥2F is the sum of squares of the
lengths of the columns of A. When we left multiply by R, none of these
lengths change and thus ∥A∥2F = ∥RA∥2F . A similar argument works for
right multiplying by R, but considering the sum of squares of the lengths
of rows. These facts again allow us to conclude
s
X
∥A∥F = ∥Σ∥F = σi2
i
we can think of the rank one matrix A as the structured data we would
like to recover, but unfortunately we only have access to B. However we
can use the singular value decomposition to approximate A. There are
many bounds that quantify how good an approximation we find, but these
are beyond the scope of the course, and involve sophisticated tools from
random matrix theory.
Still we can get an idea of how these ideas can be applied in practical
settings. In 2006, Netflix offered a challenge to the machine learning com-
munity: Improve upon our recommendation systems considerably, and we’ll
give you a million dollars! The problem was formulated as follows: Netflix
released the movie ratings of about 480k users on about 18k movies. Each
rating was a numerical score from one to five stars, but of course most of
the ratings were blank because the user had not rated the corresponding
movie.
We can think about this data as being organized into a 480k × 18k
matrix where each row represents a user and each column represents a
movie. The entry in row i, column j is the numerical score or is zero if it is
missing. The main challenge is that only about one percent of the entries
are observed. The goal is to estimate the missing entries. Netflix measured
the performance of its algorithm based on its mean squared error.
CHAPTER 7. CONNECTIONS TO OPTIMIZATION 91
Each rank one matrix has the form uv T and we can think of the entries
of u as representing how much user i likes drama movies. Then the entries of
j represent how much the jth movie appeals to viewers who like drama. In
any case, we can now think of our observed matrix B as having Bi,j = Mi,j
for entries we observe, and Bi,j being equal to zero everywhere else. Thus
B is the sum of a low-rank (or approximately low rank) matrix and a noise
matrix, and one simple and powerful way to predict the missing entries is
to compute a low rank approximation B b to B and use Bbi,j to predict how
user i would rate movie j even when that rating is not in our system.
A = a1 a2 · · · ap
.. .. .
. . · · · ..
where each data point ai is an n-dimensional vector. Some important ex-
amples, which we will study later, include:
CHAPTER 7. CONNECTIONS TO OPTIMIZATION 92
Now suppose we take our high-dimensional data points and project them
onto a line or onto a subspace. It turns out that we can understand the
mean and covariance of the new, lower dimensional data points linear alge-
braically based on the mean and covariance of the old, higher dimensional
data points. Let’s dig into this. Throughout the rest of the section we
will assume that we have recentered our data so that its mean is zero. In
particular let yi = ai − µ. Then the covariance becomes
p
1X T
S= yi yi
p i=1
where the last equality follows because the average of the yi ’s is the zero
vector. Now we can compute the variance
p p
1X 2 1X T 2
zi = (c yi )
p i=1 p i=1
p
1X T T
= c yi yi c
p i=1
p
1 TX T 1
= c yi yi c = cT Sc
p i=1
p
This expression will play a crucial role in principal component analysis, and
in quadratic programming later on. Let’s give it a name:
Definition 42. The quadratic form of a vector c on matrix S is cT Sc
We will solve this problem through the singular value decomposition. Recall
that A is an n × p matrix that represents our data, and we have assumed
that the columns of A have zero mean. Then we can write A = U ΣV T .
Using the expression for the covariance, we can write
p
1X T 1
S = yi yi = AAT
p i=1 p
1
= U ΣV T V ΣT U T
p
1
= U ΣΣT U T
p
Now we can reformulate our optimization problem as
1 T
max c U ΣΣT U T c
c s.t. ∥c∥=1 p
If we make the substitution b = U T c we get
1 T
max b ΣΣT b
b s.t. ∥b∥=1 p
This is the same type of change of variances we used in Section 7.1, and
works because ∥b∥ = ∥U T c∥ = ∥c∥. In any case, since Σ is diagonal, it is
now easy to understand the maximum of the optimization problem. Since
the singular values are non-increasing, the maximum is achieved by setting
b1 = 1 and bi = 0 for all other i. Since b = U T c, in our original optimization
problem the direction b = e1 corresponds to the direction c = U e1 . Thus
we can read off from the singular value decomposition the answer:
Lemma 14. The direction that maximizes the projected variance of A is
its top left singular vector.
We will not prove this theorem. But the intuition is that if you look
for the best k dimensional subspace, it contains the best k − 1 dimensional
subspace inside it, and so on. Thus once you have found the 1 dimensional
subspace that maximizes the projected variance, which is achieved by taking
the line in the direction of u1 , you can look orthogonal to it to find the new
vector you should add in to get a 2 dimensional subspace, and so on.
It turns out that this optimization problem is the same one as before,
just written in a different way. First note that ybi and y − ybi are orthogonal
because ybi is the projection of yi onto a subspace. Now we have
where the first equality follows the Pythagorean theorem and the last equal-
ity follows from the assumption that C has orthonormal columns. Note that
C is not orthogonal because, in general, k < n. Nevertheless the identity
∥CC T yi ∥2 = ∥C T yi ∥2 still holds because C T C = I and
∥C T x∥2 = xT C T Cx = xT x = ∥x∥2
Quadratic Programming
98
CHAPTER 8. QUADRATIC PROGRAMMING 99
Actually there is a more convenient way to write this where we are taking
a quadratic form on a symmetric matrix.
√ √ x
2 −1 x
min x y + −2 2 4 2
x,y −1 2 y y
This alternative formulation will help us because we already know a lot
about the existence and structure of the eigendecomposition of a symmetric
matrix. In contrast, non-symmetric matrices cannot always be diagonalized.
Furthermore we can write the quadratic programming problem even more
compactly as
min z T Az + bT z
z
min z T Az + bT z = min z T U DU T z + bT z
z z
= min(z T U )D(U T z) + (bT U U T z)
z
= min
′
z ′T Dz ′ + b′T z ′
z
CHAPTER 8. QUADRATIC PROGRAMMING 100
min
′ ′
3x′2 + y ′2 + 6x′ + 2y ′
x ,y
What makes this problem simpler than the one we started off with is that
there are no terms that involve both x′ and y ′ . Thus the eigendecompo-
sition allowed us to separate the variables. Now we can solve each of the
minimization problems, one over x′ and one over y ′ independently. One
easy way to do this is to complete the squares. By this we mean, collect
all the terms that involve x′ and write in the form α(x + β)2 + γ for some
choice of α, β and γ. We do the same thing for y ′ too. This allows us to
rewrite the problem as:
min
′ ′
3(x′ + 1)2 + (y ′ + 1)2 − 4
x ,y
And from this expression it is easy to find the optimum. We should set
x′ = 1 and y ′ = −1. This will make the objective value equal to −4.
Conversely, since the objective function cannot be made less than −4 over
the reals, because it is the sum of two nonnegative terms and the constant
−4. Now all that remains is to find a choice of x and y that achieve this
value:
And we can check that plugging in this choice of x and y makes the objective
value to be −4, as it should.
CHAPTER 8. QUADRATIC PROGRAMMING 101
The Role of the Eigenvalues In this section, let’s study some other
examples of unconstrained quadratic programming. Our goal is to under-
stand what the eigenvalues directly tell us about our optimization problem.
Suppose we instead had the problem:
√ √
min x2 − 4xy + y 2 − 2 2x + 4 2y
x,y
The only difference is, before, the coefficients in front of x′2 and y ′2 were 3
and 1 respectively and now they are 3 and −1. Observe that these coef-
ficients are the eigenvalues of A. Thus, instead of minimizing a quadratic
form on a symmetric matrix with nonnegative eigenvalues, we now have
negative eigenvalues. We could complete the square as before, we would
get
min
′ ′
3(x′ + 1)2 − (y ′ − 1)2 − 2
x ,y
From this expression, we can immediately see that the optimum is −∞:
We can make the minimum arbitrarily small, say, by setting x′ to zero and
making y ′ larger and larger. In fact, we didn’t need to compute the change
of variables or complete the square. The fact that one of the eigenvalues
was negative already implied it, and the rest is just bookkeeping, because
as long as the optimization problem turns into something of the form
min
′ ′
α(x′ + β)2 − (y ′ − γ)2 + C
x ,y
Observe that Lemma 15 does not tell us what happens when A is positive
semidefinite, but has a zero eigenvalue. For example, suppose we had the
problem:
3 3 √
min x2 − 3xy + y 2 − 2 2x
x,y 2 2
Again, computing the eigendecomposition and applying a change of vari-
ables gives us
min
′ ′
3x′2 + 6x′ + 2y ′
x ,y
is equivalent to
min
′ ′
3(x′ + 1)2 + (y ′ + 1)2 − 4
x ,y
CHAPTER 8. QUADRATIC PROGRAMMING 104
Is the solution unique? We know that the optimal value is −4 and this can
be achieved by setting x′ = −1 and y ′ = −1. If we were to choose any
different value for x′ or y ′ the objective function would be the sum of −4
and a strictly positive term, and thus it would not achieve the minimum.
Thus the minimum is unique. Moreover since our change of variables is an
invertible transformation, the minimization problem over x and y also has
a unique solution.
min
′ ′
3(x′ + 1)2 − 3
x ,y
From this expression, we can see that the optimum is −3 and setting x′ =
−1 and y ′ to anything achieves the minimum. Thus the optimum is not
unique. When we apply the change of variables in reverse we still have a
one-dimensional space of optimal solutions.
CHAPTER 8. QUADRATIC PROGRAMMING 105
min z T Az + bz + c
z
Proof. We will ignore the case where the optimum is −∞. We can think
about this as a case where the optimum is not achieved. We have already
seen how, in the case where A is positive definite, the objective function
takes the form n
X
C+ αi (xi − βi )2
i=1
that involves n′ < n variables. But then in any optimal solution, we can
set xn arbitrarily and not change the objective value. Thus the solution is
not unique. This completes the proof.
to select the one that is the simplest in terms of minimizing ∥x∥2 . Now we
can see that this optimization problem
min ∥x∥2 s.t. Ax = b
x
Similarly we find
∂
f (x, y) = −2x + 4y
∂y
We can actually express the answer in matrix-vector notation
2 −1 x 4x − 2y
∇f = 2 =
−1 2 y −2x + 4y
| {z } |{z}
A z
This expression should look somewhat familiar from calculus since for a
d
scalar z we have dz (az 2 ) = 2az. This fact is true more generally:
Fact 25. Let f (z) = z T Az and suppose A is symmetric. Then ∇f = 2Az.
This should still remind you of the chain rule. As a side note, this is
why we chose the convention to have a 1/2 in front of P in our definition
of quadratic programming: It will make the expressions for the gradients
simpler. Let’s do another important example which we will use later on.
CHAPTER 8. QUADRATIC PROGRAMMING 108
From calculus, for a function f (z) where z is a scalar we know that the
derivative gives us a linear approximation
d
f (z + δ) ≈ f (z) + δ f (z)
dz
Similarly when z is a vector we have
f (z + δ) ≈ f (z) + δ T ∇f (z)
If we let δ be any vector with norm at most c, we should have δ point in the
direction of ∇f (z) in order to maximize the linear approximation. Thus
the gradient is the direction the direction of largest infinitessimal increase.
CHAPTER 8. QUADRATIC PROGRAMMING 109
Now let’s talk about what it means for a point to be a local minimum.
Intuitively it means that there is no direction we can move in that infinites-
simally decreases the objective function. However this is not just about the
gradient of the function, but also its relationship to the directions we are
allowed to move in while maintaining feasibility. Formally:
Proof. We will only prove one direction, because we will give a stronger
characterization in the other direction below. Now consider the case where
∇f (z) is not orthogonal to N (A). Then there is a direction δ ∈ N (A) so
that
⟨δ, ∇f (z)⟩ < 0
From calculus, we have the estimate that for small α
This implies that if we move a small enough amount in the direction δ (i.e.
α is sufficiently small), we will not only decrease the value of the linear
approximation, but also decrease the value of the function too. Thus z is
not a local minimum.
CHAPTER 8. QUADRATIC PROGRAMMING 110
Now let’s prove that z is a global minimum. Consider any other feasible
point z ′ . Since the set of feasible points are the solutions to a linear system
Ax = b we know that z ′ = z + δ for some δ ∈ N (A). Let’s compare the
objective values. Since P is symmetric, we have
1
f (z + δ) = (z + δ)T P (z + δ) + (z + δ)T q
2
1
= f (z) + δ T P δ + δ T P z + δ T q
2
CHAPTER 8. QUADRATIC PROGRAMMING 111
P z = −q − AT ν
⇒ δ T P z = −δ T q − δ T AT ν
⇒ δ T P z = −δ T qν
112
CHAPTER 8. QUADRATIC PROGRAMMING 113
5. N (A) = {0}
6. C(A) = Rn