0% found this document useful (0 votes)
93 views113 pages

Linear Algebra and Optimization

The document is a comprehensive overview of linear algebra and optimization, covering topics such as vectors, matrices, linear equations, and their applications in various fields. It includes detailed sections on matrix multiplication, linear independence, singular value decomposition, and connections to optimization techniques like quadratic programming. The authors, Ankur Moitra and Pablo A. Parrilo, aim to provide both theoretical insights and practical applications of linear algebra concepts.

Uploaded by

傅俊結
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views113 pages

Linear Algebra and Optimization

The document is a comprehensive overview of linear algebra and optimization, covering topics such as vectors, matrices, linear equations, and their applications in various fields. It includes detailed sections on matrix multiplication, linear independence, singular value decomposition, and connections to optimization techniques like quadratic programming. The authors, Ankur Moitra and Pablo A. Parrilo, aim to provide both theoretical insights and practical applications of linear algebra concepts.

Uploaded by

傅俊結
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Linear Algebra and Optimization

Ankur Moitra and Pablo A. Parrilo

September 26, 2022

1
Contents

1 Vectors, Matrices and Linear Equations 4


1.1 A Panoramic View . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 The Geometry of Linear Equations . . . . . . . . . . . . . . 11
1.3 Solving Systems of Linear Equations . . . . . . . . . . . . . 20
1.4 Application: Electrical Circuits . . . . . . . . . . . . . . . . 29

2 Matrix Multiplication and Inverses 33


2.1 Multiplying Matrices . . . . . . . . . . . . . . . . . . . . . . 33
2.2 The Inverse of a Matrix . . . . . . . . . . . . . . . . . . . . 39

3 Rotations, Projections, Reflections and Transpositions 44

4 Geometric Foundations 54
4.1 Vector Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2 Application: Polynomial Interpolation and Overfitting . . . . 61

5 Linear Independence and Bases 65


5.1 Linear Independence . . . . . . . . . . . . . . . . . . . . . . 66
5.2 Generators and Bases . . . . . . . . . . . . . . . . . . . . . . 69
5.3 Bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.4 Describing subspaces . . . . . . . . . . . . . . . . . . . . . . 73

6 The Singular Value Decomposition 74


6.1 A Unifying View . . . . . . . . . . . . . . . . . . . . . . . . 74
6.2 Application: Uncertainty Regions . . . . . . . . . . . . . . . 79

7 Connections to Optimization 84
7.1 Low Rank Approximation . . . . . . . . . . . . . . . . . . . 84
7.2 Application: Recommendation Systems . . . . . . . . . . . . 89

2
CONTENTS 3

7.3 Principal Component Analysis . . . . . . . . . . . . . . . . . 91

8 Quadratic Programming 98
8.1 Using the Eigendecomposition . . . . . . . . . . . . . . . . . 98
8.2 Optimality Conditions . . . . . . . . . . . . . . . . . . . . . 105
Chapter 1

Vectors, Matrices and Linear


Equations

1.1 A Panoramic View


You can view linear algebra from a many different perspectives. At an
abstract level, we will be working with objects called vectors. A vector is
just a way to store a tuple of real numbers, for example
 
−1
3
2

Sometimes we will work with vectors whose entries are complex numbers
and in other classes you may also come across vectors over finite fields too.
The dimension of the vector is just the size of the tuple. So in the example
above, the dimension is three. Moreover sometimes we will want to extract
individual entries from the vector, which are called coordinates. In the
example above, the first coordinate is −1, the second coordinate is 3/2 and
so on.

There are natural ways to manipulate vectors:

4
CHAPTER 1. VECTORS, MATRICES AND LINEAR EQUATIONS 5

1. We can multiply a vector by a scalar:


   
−1 −2
2×  3  = 6

2 4

which performs the operation of multiplying each coordinate of the


vector separately by the scalar.

2. And we can add together two vectors of the same dimension:


     
−1 4 3
 3  + −1 = 2
2 1 1

which adds up the vectors coordinate-by-coordinate.

And more generally we can form a linear combination of them too which is
an expression that looks like
αx + βy
for some scalars α and β and some vectors x and y. We will spend much
of the beginning of the course developing geometric and algebraic insights
about what certain operations on vectors do and how we can reason about
them.

But perhaps another question on your mind is: Why do we care about
vectors in the first place? The usefulness of linear algebra comes from
the fact that all around us, in science, engineering and social sciences are
things that can be modeled as vectors where the insights of linear algebra
will teach us something useful and important. Throughout this course, we
will motivate the theory with concrete applications. As an example where
vectors arise naturally, consider the two jpegs below. One picture is of
Marilyn Monroe and the other is of Albert Einstein.
CHAPTER 1. VECTORS, MATRICES AND LINEAR EQUATIONS 6

Each picture is 320 × 360 pixels and we can represent each grayscale
value as a number between 0 and 1. Thus we can think about a picture as
a vector with 115200 = 320 × 360 dimensions once we fix a convention of
which pixel is associated which coordinate. Is there a way to manipulate
these two vectors, one for each image, so that we can combine them in a
visually interesting way?

If you look at the picture, it looks like Albert Einstein but when you
CHAPTER 1. VECTORS, MATRICES AND LINEAR EQUATIONS 7

squint it looks like Marilyn Monroe. Thus understanding operations you


can perform on vectors often gives you interesting families of tools to process
your data. But this is just one of a myriad of applications. In the sciences,
you can often represent your experimental data as a vector or a collection of
vectors. Linear algebra will give us powerful tools for extracting some latent
structure among them, which in turn have fueled many scientific discoveries.
In other contexts, you might have a system that is evolving over time, and
again the way that you can represent how your system updates is through
the right tools from linear algebra.

Next, let’s introduce a well-behaved class of functions that we can apply


to vectors:

Definition 1. A function f that maps n dimensional vectors to m dimen-


sional vectors is called linear if it satisfies the properties:

1. For all vectors x and all scalars α, f (αx) = αf (x)

2. For all vectors x and y, f (x + y) = f (x) + f (y)

For example, if we take f to be the function that multiplies the entries


of x by 2 it is easy to check that f is linear. But we could do more exotic
things like swapping two coordinates of x, or adding one coordinate to all
the others, etc. It turns out that every linear function can be represented
as a matrix. A matrix is just a grid of numbers, for example
 
0 1
−1 0

The dimensions of the matrix are the number of rows and columns. So in
this example the matrix is 2 × 2 and more generally a matrix with n rows
and m columns would be n × m. If the matrix above represents a linear
function, what operation on a vector is it supposed to represent?

Definition 2. The matrix-vector product of a 2 × 2 matrix and a 2 dimen-


sional vector is defined as
    
a b x ax + by
=
c d y cx + dy
CHAPTER 1. VECTORS, MATRICES AND LINEAR EQUATIONS 8

Returning to the example above, what does the function that corre-
sponds to multiplying the vector by the matrix
 
0 1
−1 0
actually do to a point? If we think about a general point in the x-y plane
as a vector then     
0 1 x y
=
−1 0 y −x
This corresponds to rotating every vector in the x-y plane clockwise by
π/2. Much of this class will also be about translating applications into
the language of linear algebra, so that we have many powerful tools at our
disposal. So, for example, what if we want to rotate every vector in the x-
y plane counterclockwise by π/4? What matrix-vector product represents
this operation? It takes some practice to go back and forth between the
abstraction and the application, and you should try this for yourself here
and throughout these notes.

We claim that any matrix-vector product is a linear function. This is


easy to check since: For any scalar α we have
 x  a b  αx αax + αby   x  
f α = = = αf
y c d αy αcx + αdy y
Notice that at each step above we are either invoking the definition of f or
using the definition of what it means to multiply a vector by a scalar. You
should check the other property of linearity for yourself, which will reduce
to verifying that for any pair of vectors
    ′  
a b x′
     
a b  x x a b x
+ ′ = +
c d y y c d y c d y′

Another helpful way to think about the matrix-vector product is as a


linear combination. In particular
      
a b x a b
=x +y
c d y c d
Thus a matrix-vector product forms a linear combination of the columns
of the matrix, where the scalar you put in front of the first column is the
CHAPTER 1. VECTORS, MATRICES AND LINEAR EQUATIONS 9

first coordinate of the vector, and so on. We can also define the operation
of multiplying a 3 × 3 dimensional matrix by a 3 dimensional vector:
    
a b c x ax + by + cz
d e f  y  = dx + ey + f z 
g h i z gz + hy + iz

But again we can interpret the result as a linear combination of the columns
of the matrix:      
a b c
x d + y  e  + z f 
g h i
In exactly the same way we can define the general matrix-vector product,
but only if the dimensions agree! If x is a vector, we let xj denote its jth
coordinate. And if A is a matrix we let Ai,j denote its entry in row i, column
j. Then

Definition 3. If A is an n×m dimensional matrix and x is a m dimensional


vector then Ax is an n dimensional vector and its ith coordinate is
m
X
(Ax)i = Ai,j xj
j=1

Not only is the matrix-vector product a linear function, it is the canon-


ical one in the following sense:

Proposition 1. Any linear function f on vectors can be represented as a


matrix-vector product. I.e. there is some A so that f (x) = Ax.

The proof of this proposition is helpful for building intuition. If you were
given a linear function f how would you go about finding the corresponding
matrix A? Suppose x is m dimensional.

Definition 4. The standard basis vectors e1 , e2 , · · · , em have the property


that ei is the vector with a 1 in its ith coordinate and all its other coordinates
are 0.
CHAPTER 1. VECTORS, MATRICES AND LINEAR EQUATIONS 10

Then we can write

x = x1 e1 + x2 e2 + · · · xm em

Now suppose we knew f (e1 ), f (e2 ), etc. Can we figure out what f (x) is for
any x? We can use the linearity assumption to write

f (x) = x1 f (e1 ) + x2 f (e2 ) + · · · + xm f (em )

Thus if we interpret f (ei ) as the ith column of a matrix A we have expressed


f (x) as a linear combination of the columns of A, which was exactly our
alternative view of the matrix-vector product.

Just like vectors, matrices are a useful and powerful way to represent
things around us. We have already seen one example, where the right way to
create a composite image of Marilyn Monroe and Albert Einstein is actually
to design the right matrices to define the right linear function to apply to
vectors representing the two images. As another example, in engineering,
the state of a system is often represented as a vector and the way that the
system evolves over time can be expressed as a matrix-vector product.

For example, suppose we have two species, let’s say the frogs and the
flies. These species have a predator-prey relationship whereby if there were
no frogs to eat the flies, the number of flies would grow exponentially.
But conversely if there were too many frogs and too few flies for them to
feed on, the number of frogs might decay exponentially. We can model
this system as follows. Let g(t) and y(t) denote the number of frogs and
flies respectively at time t. Then from studying their dynamics, we might
come to the conclusion that the way the system updates is governed by the
equations

g(t + 1) = 0.4g(t) + 0.2y(t)


y(t + 1) = −0.6g(t) + 1.8y(t)

We can write these equations in matrix-vector product format as


    
g(t + 1) 0.4 0.2 g(t)
=
y(t + 1) −0.6 1.8 y(t)
It turns out that building an abstract geometric toolkit for how to think
about matrices will help us answer all sorts of questions about how the
CHAPTER 1. VECTORS, MATRICES AND LINEAR EQUATIONS 11

system evolves, like does it converge and if so to what? When we know the
state at some later time, can we solve for the initial conditions? And how
sensitive is the behavior, if we perturb the initial conditions?

1.2 The Geometry of Linear Equations


We will develop the basic tools of linear algebra in the context of solving
systems of linear equations, before we apply them more broadly. But we
will write our systems of linear equations using the matrix-vector product.
In particular consider the equation:
    
2 −3 x 0
=
1 1 y 5

The 2 × 2 matrix above is called the constraint matrix. Last time we talked
about matrix-vector products when all the entries were scalars. Now we are
allowing the entries to be variables too. The recipe for how to compute a
matrix-vector product still works the same, and when we multiply a matrix
and a vector of variables, instead of a vector of scalars, we get a vector of
linear functions of the variables x and y. In particular we get
   
2x − 3y 0
=
x+y 5

Thus our matrix-vector product represents a system of two linear equations


in two variables: (
2x − 3y = 0
x+ y =5
Often we will write a linear system in an even more abstract notation as

Ax = b

Here A is an m × n matrix, x is a vector of variables with coordinates


x1 , x2 , · · · , xn and b is a vector of dimension m. Thus when we write out the
matrix-vector product we get a system of m linear equations in n variables.
The main questions we will be interested in are:
Question 1. Does the linear system Ax = b have a solution? And if so, is
it unique?
CHAPTER 1. VECTORS, MATRICES AND LINEAR EQUATIONS 12

These questions are easy to answer when we have a 2 × 2 constraint


matrix, but what we are after is a more geometric understanding of when
there is a solution and when it is unique that can apply even in higher
dimensions. One of the central themes in linear algebra is being able to
think about a matrix in terms of its rows, or in terms of its columns, and
understanding the relationships between these two views. The goal of this
section is to understand how the geometric properties of the rows (or the
columns) determines the answers to these main questions.

The Row View As we discussed above in the case of a 2 × 2 constraint


matrix, each row gives us a linear equation that our solution must satisfy.
Let’s generalize this to higher dimensions, by first defining the notion of an
inner-product:

Definition 5. The inner-product between two vectors a and b of dimension


n is n
X
⟨a, b⟩ = ai b i
i=1

We can also write the inner-product as a matrix-vector product between


a 1 × n matrix and a n dimensional vector. We usually refer to a 1 × n
matrix as a row vector and an n × 1 matrix as a column vector. Now we
can check that  
b1
   b2 

⟨a, b⟩ = a1 , a2 , · · · , an  .. 

.
bn

Let’s dig into the geometry of the inner-product some more. The set of
solutions to a linear equation is an affine hyperplane:

Definition 6. An affine hyperplane is the set of all solutions to a linear


equation
⟨a, x⟩ = a1 x1 + a2 x2 + · · · an xn = b
CHAPTER 1. VECTORS, MATRICES AND LINEAR EQUATIONS 13

Now the set of solutions to a system of linear equations Ax = b is the


intersection of affine hyperplanes. In particular let

Hi = {x s.t. Ai,1 x1 + Ai,2 x2 + · · · + Ai,n xn = bi }

Now we have:

Proposition 2. x is a solution to Ax = b if and only if x ∈ ∩m


i=1 Hi .

Let’s visualize what’s happening when we have two linear equations and
two variables. Consider the constraint 2x − 3y = 0 that comes from the
first row of our linear system. The solution is the blue line:

If we also plot the vector corresponding to the first row of A in red, we can
see that it makes a right angle with the line. Similarly if we take the second
row, which corresponds to the constraint x + y = 5, we would get the green
line:
CHAPTER 1. VECTORS, MATRICES AND LINEAR EQUATIONS 14

Again, the vector corresponding to the second row of A is at a right angle


with the green line but only if we put the tail of the vector on the green
line too.

Now we can answer the existence and uniqueness questions visually:


CHAPTER 1. VECTORS, MATRICES AND LINEAR EQUATIONS 15

The set of solutions is the intersection of the two lines. Thus there is a
solution x = 3, y = 2 and moreover it is unique.

In three and higher dimensions it works the same, it’s just harder to
draw! For example, consider a linear system with three variables and three
constraints:     
2 −3 1 x 0
1 1 2  y  = 5
3 −2 −1 z 2
The first row corresponds to the constraint 2x − 3y + z = 0. Instead of a
line it is a plane – i.e. an infinite sheet of paper in three dimensions. The
vector corresponding to the first row makes a right angle with the sheet of
paper. This is an affine hyperplane and when we take the intersection of
the three affine hyperplanes, we get exactly the solutions to the system of
linear equations.

We can also go the other direction, from the visualization to the system
of linear equations. If we have a linear system Ax = b where the constraint
matrix is 2 × 2 which has no solution, what would A look like? Well, since
the solutions are the intersections of two lines, one for each row, we need
the two lines to have no point in their intersection. The only way to do this
is to have the lines be parallel. For example:
CHAPTER 1. VECTORS, MATRICES AND LINEAR EQUATIONS 16

Since the two lines are parallel, their normal vectors are either the same
or are scalar multiples of each other. This means that the first and second
rows of A must also be scalar multiples of each other. For example, we
could get     
1 1 x 5
=
1 1 y 3
which indeed has no solutions because the equations are contradictory.
However, for a 3 × 3 constraint matrix, we could have no solutions for
a more complicated reason. We will get to this later.

The Column View In the previous subsection, we pared away the can-
didate solutions by taking the intersection of constraints. Alternatively we
could incrementally build up the set of b’s for which there is a solution to
Ax = b. To make this precise, we need another key definition:

Definition 7. A linear combination (over the reals) of vectors v1 , v2 , · · · , vm


of dimension n is a sum

c1 v1 + c2 v2 + · · · cm vm

where each ci is a real number.

Recall that a matrix-vector product Ax produces a linear combination


of the columns of A. In particular
n
X
Ax = ai x i
i=1

where ai is the ith column of A. Thus we have:

Proposition 3. Ax = b has a solution if and only if there is a linear


combination of the columns of A that equals b.

Again, let’s see how this works out on the same system of two equations
in two variables. Our linear system
    
2 −3 x 0
=
1 1 y 5
CHAPTER 1. VECTORS, MATRICES AND LINEAR EQUATIONS 17

is equivalent to      
2 −3 0
x +y =
1 1 5
Now what happens if we start with a1 , the first column of A? What is the
set of vectors we can obtain as a linear combination of this one fixed vector?
We get the blue line:

And what about all the vectors that are linear combinations of the second
column of A? We get the red line:
CHAPTER 1. VECTORS, MATRICES AND LINEAR EQUATIONS 18

Putting it all together we get a skewed grid:

and everything in between! Thus the set of all vectors we can get as a linear
combination of the columns of A is all of R2 . The way to think about it
CHAPTER 1. VECTORS, MATRICES AND LINEAR EQUATIONS 19

is for any point we want to represent, we just have to walk along the blue
line until the direction we have left to go is along the direction of the red
lines. What this means for us is that it’s not just that the particular choice
of b we had has a solution to Ax = b, but rather every choice of b will have
a solution. Moreover the solution is unique too, because for every vector
b there is a unique way to reach it by walking in the direction of the blue
line, and then switching to walking in the direction of the red line.
Definition 8. The span of vectors v1 , v2 , · · · , vn is all the vectors that can
be obtained as linear combinations of v1 , v2 , · · · , vn .

Now we have deduced:


Proposition 4. If A is m × n, the system of linear equations Ax = b has
a solution for every choice of b if and only if the span of the columns of A
is all of Rm .

Just as we did before, we can go the opposite direction, from the visual-
ization to the linear system. If A is 2 × 2, what does a linear system Ax = b
look like that does not have a solution? If we think about how we built up
our skewed grid, if the blue line and red line pointed in the same direction,
when we walk along the blue direction and then switch to the red direction
we would never leave the original line. Thus we need the columns of A to
be scalar multiples of each other. So, for example, the linear system
    
2 4 x 3
=
1 2 y 1
would not have a solution because the span of the columns of A is a line
and the vector on the right hand side is not on it.

We will develop the theory in more detail, but for now let’s appreciate
that things become more complicated when we have 3×3 or larger constraint
matrices. From the 2×2 case, you might get the false impression that that a
linear system can only have no solution if there is some pair of rows that are
scalar multiples of each other. But consider the system of linear equations
    
2 −3 0 x 1
1 1 2 y  = 1
3 −2 2 z 3
CHAPTER 1. VECTORS, MATRICES AND LINEAR EQUATIONS 20

You can check that there are no two rows that are scalar multiples of each
other. Similarly, there are no two columns that are scalar multiples of each
other. But nevertheless, there is no solution! How can we see this? If we
add up the linear equations corresponding to the first two rows, we will get
another valid constraint
3x − 2y + 2z = 2
However the third row corresponds to the constraint
3x − 2y + 2z = 3
And thus we cannot have all three equations hold simultaneously. Taking
this a step further, now that we know the linear system does not have a
solution, it must be true that the span of the columns of A is not all of
R3 . But how can we see this? Check for yourself that you can express
the third column as a linear combination of the first two columns. This
means, that as we grow the set of vectors from the ones we can obtain as
a linear combination of the first column of A, to ones we can obtain as a
linear combination of the first and second columns of A, and finally the
first, second and third columns of A, at one step we won’t actually reach
any new vectors and our solutions will be confined to a plane. This leads
us to another key definition:
Definition 9. A set of vectors v1 , v2 , · · · , vn is linearly independent if no
vi can be expressed a linear combination of the others.

And finally we have:


Proposition 5. If A is n × n, the system of linear equations Ax = b has a
solution for every b if and only if the columns of A are linearly independent.

Note that if A is m × n things are more subtle. If m < n we could have


that the columns of A are linearly dependent but nevertheless Ax = b has
a solution for every b.

1.3 Solving Systems of Linear Equations


So far we have developed geometric ways to visualize the solutions to a sys-
tem of linear equations. In this section, we will develop a general procedure
CHAPTER 1. VECTORS, MATRICES AND LINEAR EQUATIONS 21

for finding solutions. We have already seen some of the ingredients that
will go into this, namely we showed how to combine equations to produce
a contradiction, which in turn told us that there were no solutions to our
linear system at hand. The idea here will be related. We will reason about
ways to make a linear system simpler while preserving its solutions.

Our first procedure is called Gaussian elimination. At a high-level, we


will add and subtract rows from each other. Let’s do an example. Given
the linear system: 
 x1 − x2 + 2x3 = 1

−2x1 + 2x2 − 3x3 = −1

−3x1 − x2 + 2x3 = −3

First we put it in matrix-vector form


    
1 −1 2 x1 1
−2 2 −3 x2  = −1
−3 −1 2 x3 −3

Let A be the 3 × 3 constraint matrix and let b be the vector of dimenion 3


representing the right hand side of the equation.

Definition 10. The augmented matrix of a system of linear equations Ax =


b is  
A, b
i.e. what we get from appending b to the end of A as another column.

We will need to do some bookkeeping for Gaussian elimination. Towards


that end:

Definition 11. The pivot in a row is the leftmost nonzero.

For example, for the augmented matrix of the linear system we are
interested in, we have:
  
1 −1 2 1 x1
 −2 2 −3 −1 x2 
 
−3 −1 2 −3 x3
CHAPTER 1. VECTORS, MATRICES AND LINEAR EQUATIONS 22

The goal of Gaussian elimination is to get the pivots, as we traverse the


rows from top to bottom, to go from left to right, strictly. As a first step,
how can we move the pivot of the 2nd and 3rd rows strictly to the right by
adding and subtracting the 1st row from the others? Because the rows of
the augmented matrix will change throughout our algorithm, let’s use the
convention that r1 , r2 and r3 denote the original rows. Then we have
 
r2 = −2 2 −3 −1
 
+ 2r1 = 2 1 −1 2 1

r2′ =
 
0 0 1 1
The new row, which we denote by r2′ to keep track of the fact that it’s a
synthetic row that wasn’t in our original linear system, has the property
that its pivot is now in the third column. So we’ve made progress!

Similarly, what multiple of r1 should we add to r3 to move its pivot to


the right?
 
r3 = −3 −1 2 −3
 
+ 3r1 = 3 1 −1 2 1

r3′ =
 
0 −4 8 0
Thus the new augmented matrix is:
 
1 −1 2 1
0 0 1 1
1 −4 8 0
Now you might be wondering: Why are we allowed to do these operations?
The key point is that the steps we are performing preserve the set of all
solutions. The steps were defined with respect to the augmented matrix,
so to make sense of this statement we have to unpack what this means for
the associated linear system:
Proposition 6. Suppose we have a linear system Ax = b and we create its
augmented matrix [A, b] and add a scalar multiple of one row to another to
get a new augmented matrix [A′ , b′ ]. Then x is a solution to Ax = b if and
only if x is a solution to A′ x = b′ .
CHAPTER 1. VECTORS, MATRICES AND LINEAR EQUATIONS 23

The augmented matrix is really just bookkeeping. Let’s see why this
proposition is true in a simplified setting. Let S be the solutions to the
linear system (
x1 − x2 + 2x3 = 1
−2x1 + 2x2 − 3x3 = −1
These equations represent rows r1 and r2 . Let S ′ be the solutions to the
linear system (
x1 − x2 + 2x3 = 1
x3 = 1
These equations represent r1 and r2′ . Recall that r2′ = r2 +2r1 . Now it is easy
to see why if x ∈ S we must also have x ∈ S ′ : If x satisfies the constraints
associated with r1 and r2 then when we add scalar multiples of the two
equations that hold, we get another equation, in our case corresponding
to r2′ , that also must hold. And so if x ∈ S we have that x ∈ S ′ too.
The other direction is more interesting. Suppose x ∈ S ′ , meaning that it
satisfies the equations corresponding to r1 and r2′ . How can we show that
x must also satisfy the equation corresponding to r2 ? The key point is
that the operation we performed on the rows of the augmented matrix was
invertible, and we can undo it. We can rewrite the relation r2′ = r2 + 2r1
instead as r2 = r2′ − 2r1 . From this equation it now follows that if x satisfies
the constraints associated with r1 and r2′ we can again add scalar multiples
of the two equations to get another equation, in this case we get back r2 ,
that also must hold. This completes the equivalence because we have now
show that if x ∈ S ′ we must have that x ∈ S too.

Now if we do a sequence of operations of adding a scalar multiple of one


row to another we preserve the set of all solutions along the way. There is
one subtlety: You should think of the operations as atomic and happening
one after another. So you are not allowed, for example, to replace r1 with
r1′ = r1 − r2 and at the same time replace r2 with r2′ = r2 − r1 . Can you see
what would go wrong if we were allowed to do this?

In any case, we have now taken one step of Gaussian elimination. The
new linear system definitely looks simpler. But when should we stop trying
to move the pivots to the right?
CHAPTER 1. VECTORS, MATRICES AND LINEAR EQUATIONS 24

Definition 12. An augmented matrix is in row echelon form if all its pivots
go strictly from left to right.

The new augmented matrix is not in row echelon form. In fact we’ll need
a new operation to fix it. The missing ingredient is that we can also swap
rows. For example, if we swap the second and third rows of the augmented
matrix, we get:  
1 −1 2 1
 0 −4 8 0
0 0 1 1
But again, we ask: Why are we allowed to do this operation? It again
preserves the set of all solutions. This is easy to see because each row of
the augmented matrix represents a linear equation we must satisfy, and if
we think back to our geometric picture where we reasoned about the set
of solutions by taking all points in Rm and intersecting them with affine
hyperplanes, it doesn’t matter which order we take the intersection. We
still get the same set of solutions at the end.

Backsubstitution Now that we have put our linear system in row echelon
form, it is easy to find one or even all of the solutions by backsubstitution.
Let’s translate our augmented matrix to a linear system. We have

x1 − x2 + x3 = 1

− 4x2 + 8x3 = 0

x3 = 1

We immediately read off the value of x3 because we have a linear equation


that involves x3 and none of the other variables. This is the benefit of having
a pivot all the way on the right. So we conclude that x3 = 1. But now that
we know the value of x3 , we can substitute it into the other equations. And
since the second equation involves only x2 and x3 and none of the other
variables, we can now deduce the value of x2 as well. We get
−4x2 + 8 = 0
which implies that x2 = 2. And finally, the first equation involves x1 and
we can substitute in our values for x2 and x3 to get the equation
x1 − 2 + 2 = 1
CHAPTER 1. VECTORS, MATRICES AND LINEAR EQUATIONS 25

which implies that x1 = 1. You can check that when we plug in these values
into the original linear system, they work too. But of course we knew that
would be true already. Moreover not only have we found a solution, we’ve
actually shown that it is unique because didn’t have a choice for how to set
x3 , and conditioned on what we set for its value, we didn’t have a choice
for x2 and so on.

Connecting Gaussian Elimination to Geometry Now let’s connect


this back to earlier sections. By reasoning about the span of the columns
of the constraint matrix, we argued that the linear system
    
2 −3 x1 b
= 1
1 1 x2 b2
has a solution for any choice of b1 and b2 . How does putting the linear
system into row echelon form reveal this fact? All we need to do is form a
new row r2′ = r2 − 1/2r1 which gives
 
r2 = 1 1 b2
1 1 
− r1 = − 2 −3 b1
2 2

r2′ = 0 2 b2 − b21
 5 

Thus our equivalent linear system is


 
2 −3    
x b1
 5 1 = b1
0 x 2 b 2 − 2
2
In essence, we are performing the steps of Gaussian elimination but keeping
track of the variables on the right hand side. From this expression it is now
clear why it has a solution for every b1 and b2 : Given b1 and b2 , backsolving
gives us a procedure to generate a solution since all we would do would be
to set
2b2 b1
x2 = −
5 5
and similarly set
3b2 b1
x1 = +
5 5
CHAPTER 1. VECTORS, MATRICES AND LINEAR EQUATIONS 26

As an aside, you can always solve linear systems on a computer. But this is
one example where understanding how to do things by hand gives you new
insights. For example, what if we want to solve a system of linear equations
Ax = b, Ax = b′ , and so on, where the constraint matrix always stays the
same. Instead of calling the linear system solver separately on each one,
you could instead put the system in row echelon form just once, and keep
track of the corresponding operations you are supposed to perform on the
coordinates of the vector on the right hand side, and then run backsolve for
each linear system. This would be much faster.

Other linear systems that we have encountered do not have a unique


solution. For example, consider

 x1 − x2 + 2x3 = 1

−2x1 + 2x2 − 3x3 = −1

−3x1 − x2 + 2x3 = −3

First we put it in matrix-vector form


    
2 −3 0 x1 1
1 1 2 x2  = 1
3 −2 2 x3 2

Again we can ask: What would its row echelon form look like? And how
could we deduce that the solution is not unique? Putting it in row echelon
form, we get:  
2 −3 0    
 5  x1 1
1
0 2 x2 = 2 
   
 2 x3 0
0 0 0
Notice that there is no pivot in the last row. But it still meets our definition
of row echelon form. In particular, any row consisting of only zeros must be
at the bottom. Now what would happen if we perform backsubstitution?
We would get the equation 0 = 0 from the last row, so any choice of x3
works! But now for any choice we make, say x3 = t, the second equation tells
us a unique way to set x2 , and so on. So the set of solutions looks like a line
in three dimensions because there is one degree of freedom corresponding
to the choice of x3 . To be more explicit, we can write down the equations
CHAPTER 1. VECTORS, MATRICES AND LINEAR EQUATIONS 27

of the line as  4 6t
x 1
 = 5
− 5
1 4t
x2 = 5
− 5

x3 = t

This leads us to another key definition:


Definition 13. If performing Gaussian elimination on a linear system
Ax = b results in a row of all zeros in the new constraint matrix, we say
that A is singular.

Note that it does not matter whether the augmented matrix has zeros or
not in the last column, since that corresponds to what we do to the vector
b. Being singular is a property of just the constraint matrix.

So we saw above how a singular matrix A can have infinitely many


solutions. But they can also have no solutions. Suppose we take the same
linear system but change the entry b3 so that we get
    
2 −3 0 x1 1
1 1 2 x2  = 1
3 −2 2 x3 3

Again we can put it in row echelon form. (Does the sequence of operations,
of what row we add and subtract from which other rows need to change?)
We get  
2 −3 0    
 5  x1 1
1
0 2 x2 = 2 
   
 2 x3 1
0 0 0
Now if we try backsubstitution we get the equation 0 = 1 from the last
row. So while our operations have preserved the set of solutions, the linear
system we derived clearly has no solutions, so that must be true of the
linear system we started off with too!

Taking a step back, there is an important lesson that will come up again
and again for us: It is much easier to get geometric inisghts about a system
of linear equations (e.g. what does the set of solutions look like?) by
CHAPTER 1. VECTORS, MATRICES AND LINEAR EQUATIONS 28

first putting it into a convenient normal form. We will develop much more
powerful normal forms, that will tell us other things too, like quantitative
ways to measure how far a matrix is from being singular, etc.

Gauss-Jordan Elimination It turns out that we can perform elemen-


tary row operations to make the matrix even simpler. Consider an aug-
mented matrix in row echelon form:
 
1 −1 2 1
 0 −4 8 0
0 0 1 1

First we can multiply the rows by scalar to make all the pivots equal to
one:  
1 −1 2 1
0 1 −2 0
0 0 1 1
And now, for any row which contains a pivot, we can add and subtract it
from the rows above it to zero out the remaining entries in that column.
We would get  
1 0 0 1
 0 1 0 2
0 0 1 1
This corresponds to the linear system
    
1 0 0 x1 1
 0 1 0  x2  = 2
0 0 1 x3 1

Now backsubstitution is even easier! We can directly read off x1 = 1,


x2 = 2, x3 = 1. An n × n matrix with ones along the diagonal and zeros
everywhere else is called an identity matrix. We will see why, later when
we talk about matrix multiplication.

Definition 14. An augmented matrix is in row echelon form if all of its


pivots go strictly from left to right, are one, and in a column containing a
pivot all the other entries are zero.
CHAPTER 1. VECTORS, MATRICES AND LINEAR EQUATIONS 29

It is not always possible to get the identity matrix. Sometimes we have


to settle for systems that look like
 
  x1  
1 0 0 0   b1
 0 1 2 0  x2  = b2 
x3 
0 0 0 1 b3
x4
But this is still in reduced row echelon form and it is still easy to read off
the set of solutions without even doing any further algebra.

1.4 Application: Electrical Circuits


Linear algebra arises throughout science and engineering, but sometimes
you don’t realize that it’s just linear algebra in disguise! Towards that end,
let’s do an example of how to model a problem we might need to solve as
a system of linear equations. Consider an electrical circuit:

All the resistors in the circuit have unit resistance. Our ammeter will send
one unit of current into the junction a1 and one unit of current will leave
from the junction a6 . We would like to know how this current splits along
the different branches of the circuit.
Fact 1. Kirchhoff ’s law tells us that the current the current across a resistor
is equal to
v1 − v2
i=
R
CHAPTER 1. VECTORS, MATRICES AND LINEAR EQUATIONS 30

where v1 and v2 represent the voltages at the two endpoints, and R is the
resistance.

So now we can write down a system of linear equations that describes


how the voltages translate into currents. We will let i1 be the current going
from junction a1 to a2 , and similarly for all the other currents in the diagram
below. We will let v1 denote the voltage at junction a1 , and so on.

Consider the junction a2 . The total current going in must equal the current
going out, so we get
i1 = i2 + i5
We can use Kirchhoff’s law to rewrite this in terms of the voltage differences
as
(v2 − v1 ) = (v3 − v2 ) + (v5 − v2 )
Moreover we assumed there is one unit of current going into a1 and this
same amount of current must exit. So we get

v2 − v1 = 1

because all the resistors have unit resistance. Similarly we get the equation

v6 − v5 = 1
CHAPTER 1. VECTORS, MATRICES AND LINEAR EQUATIONS 31

Putting it all together, we get a system of linear equations


    
1 −1 0 0 0 0 v1 −1
−1 3 −1 0 −1 0  v2   0 
    
 0 −1 2 −1 0 0  v3  =  0 
   

0
 0 −1 2 −1 0   v4   0 
   
 0 −1 0 −1 3 −1 v5   0 
0 0 0 0 −1 1 v6 1
Now all we have to do is solve the linear system to find the voltages at
each junction. And then we can solve for the currents just by applying
Kirchhoff’s laws again. The beauty of this approach is that we are not
solving just one electrical circuit problem, but what we actually have is a
general recipe of translating an electrical engineering problem into a system
of linear equations from which point we can apply Gaussian elimination.

Moreover there are interesting ways to connect the general theory and
systems of linear equations. Since the vector on the right hand side repre-
sents the net current out of each junction, we know its entries must sum to
zero, since otherwise current would not be conserved. So if we try to solve
the linear system with a right hand side whose entries do not sum to zero,
what should happen? We should find that there is no solution. How can
we see this? If we add up all the rows in the constraint matrix, they would
sum to zero. This means that our matrix is singular, and if the sum of the
entries in the vector on the right hand side is not also zero, we would get a
contradiction.

Let’s take these ideas even further. If you’ve had to solve electrical
circuit problems before, you are probably familiar with the rule for how to
simplify the circuit diagram of two resistors in parallel:
CHAPTER 1. VECTORS, MATRICES AND LINEAR EQUATIONS 32

They act as one resistor whose resistance is


1
R= 1
R1 +R2

This means that for any amount of current i that we send through the cir-
cuit, the voltage difference v1 −v2 will be the same as across the two circuits.
There are even more complicated rules, such as the Y − ∆ transform, that
tell us the following two circuits are equivalent:

Where we set
R2 R3 R1 R3 R1 R2
R1′ = , R2′ = and R3′ =
R1 + R2 + R3 R1 + R2 + R3 R1 + R2′ + R3′

The claim is that for any amount of current that we send into and out of
the three junctions, the voltage differences we get are the same. This is
equivalent to saying that for any right hand side in our system of linear
equations, the solution we find from the two circuits is the same. This
is because we can go from one linear system to the other by doing only
elementary row operations where we add and subtract rows from each other.
The details are complicated, however.
Chapter 2

Matrix Multiplication and


Inverses

2.1 Multiplying Matrices


In this section, we will talk about how to multiply matrices and what it
means. There are a few ways to think about it. First, we can think about
it as a formula.
Definition 15. Given matrices A and B of size m×n and n×p respectively,
their product C = AB is an m × p matrix where
n
X
Ci,k = Ai,j Bj,k
j=1

When the number of columns in A does not match the number of rows
in B, we cannot multiply them. Another way to think about this formula is
the following. Let Bk and Ck be the kth columns of B and C respectively.
Then Ck = ABk . Thus matrix multiplication can also be thought of as a
sequence of matrix-vector products. Yet another way to think about it is
that the entry Ci,k is the result of taking the inner product of the ith row
of A and the kth column of B.

Derivation through Composition Where does the matrix multiplica-


tion formula come from? Recall that if we have a linear function f (see

33
CHAPTER 2. MATRIX MULTIPLICATION AND INVERSES 34

Definition 1) that maps vectors of dimension p to vectors of dimension n,


we can represent it as a matrix-vector product. Let f (x) = Bx for some
n × p matrix. Now what if we have another linear function g that maps
vectors of dimension n to vectors of dimension m? Again we can represent
it as a matrix-vector product. Let g(y) = Ay for some m × n matrix A.
Now consider the composition h(x) = g(f (x)). We claim that if f and g
are linear, then h is too. This is easy to see by verifying the conditions.
First for any scalar α we have

h(αx) = g(f (αx)) = g(αf (x)) = αg(f (x)) = αh(x)

where in each step we have either used the linearity of f or of g. Similarly


for any x and x′ we have

h(x+x′ ) = g(f (x+x′ )) = g(f (x)+f (x′ )) = g(f (x))+g(f (x′ )) = h(x)+h(x′ )

again using the linearity of f and g. Now appealing to Proposition 1 we


know that there must be a m × p matrix C so that h(x) = Cx.

So what is this matrix C? Recall that ei is the ith standard basis vector
(Definition 4). Then Cei is the ith column of C. So if we want to compute
C we can compute
h(ei ) = g(f (ei )) = A(Bei )
for all i. From this formula, we get that Ci = ABi , which agrees with our
formula for matrix multiplication. Moreover it shows that the matrix mul-
tiplication formula represents the natural way to compose linear functions.

Application: Counting Walks Let’s see an example from graph theory


where matrix multiplication comes up. We will be interested in counting
walks in graphs:

Definition 16. A graph G = (V, E) is a collection of vertices V and a set


of edges E that are unordered pairs of vertices.

Graphs come up all over the place. For example, they can be used to
describe social networks. Vertices would represent people and there would
be an edge between a pair of people if, say, they are friends on Facebook.
We will work with the following graph:
CHAPTER 2. MATRIX MULTIPLICATION AND INVERSES 35

We will be interested in counting the number of walks between a pair


of vertices:

Definition 17. A walk is a sequence of vertices connected by edges, where


repetitions are allowed. The length of the walk is the number of edges.

For example, a, b, a, b, d is a walk in the graph of length 4. Given a


graph G = (V, E) can we count the number of walks between some pair
of vertices? It turns out that we will be able to do this through matrix
multiplication.

The first step is to figure out how to represent a graph as a matrix.

Definition 18. An adjacency matrix is a matrix where each row and each
column represents a vertex in V . Moreover there is a 1 in row i, column j
if and only if there is an edge between the corresponding vertices in G.

The adjacency matrix of our graph is:


 
0 1 0 0
1 0 1 1
 
0 1 0 1
0 1 1 0

where we have chosen the convention that the first row and column represent
the vertex a, the second row and column represent the vertex b, and so on.
Now what happens if we multiply A by itself? We usually write AA = A2 .
CHAPTER 2. MATRIX MULTIPLICATION AND INVERSES 36

We get     
0 1 0 0 0 1 0 0 1 0 1 1
1 0 1 1
 1 0 1 1 0 3 1 1
  

0 = 
1 0 1 0
 1 0 1 1 1 2 1
0 1 1 0 0 1 1 0 1 1 1 2
What remains is to figure out how to interpret the result in graph theoretic
terms. Consider the entry in row 1, column 3 in A2 . It came from the
expression
A1,1 A1,3 + A1,2 A2,3 + A1,3 A3,3 + A1,4 A4,3
We claim that each term in this expression is a candidate walk. For exam-
ple, since the first row and column are associated with vertex a, and the
third row and column are associated with vertex c, the first term A1,1 A1,3
represents the length two walk a, a, c. But since there is no edge from a
to a nor is there one from a to c this walk shouldn’t be counted, and in-
deed A1,1 A1,3 = 0. We can do this exercise for the second term A1,2 A2,3
too. This term represents the walk a, b, c, which is a valid walk, and indeed
A1,2 A2,3 = 1 and contributes one to our sum, and so on. Thus the entry in
row 1 column 3 in A2 counts the number of length two walks from a to c in
G.

We can prove a more general statement. Let Aℓ = AA


| {z · · · A}.
ℓ times

Lemma 1. Let G = (V, E) be a graph and let A be its adjacency matrix.


Then the entry in row i, column j of Aℓ counts the number of walks of
length ℓ from the ith vertex to the jth vertex in G.

Proof. The proof is by induction. For ℓ = 1 we are interested in counting


walks of length 1. There is a walk of length 1 between a pair of vertices if
and only if there is an edge between them, which is exactly how we formed
the adjacency matrix in the first place.

Now suppose the lemma is true for ℓ − 1. Then we can write

Aℓ = Aℓ−1 A
CHAPTER 2. MATRIX MULTIPLICATION AND INVERSES 37

Let |V | = n. From the matrix multiplication formula, we have


n
X
Aℓi,k = Aℓ−1
i,j Aj,k
j=1

Now by induction Aℓ−1 i,j counts the number of walks from i to j of length
ℓ − 1. And Aj,k counts the number of length one walks from j to k. The
point is that every walk of length ℓ from i to k corresponds to the choice
of a walk of length ℓ − 1 from i to j and a walk of length 1 from j to k for
some j. And conversely every choice of a walk of length ℓ − 1 from i to j
and a walk of length 1 from j to k for some j also yields a walk of length
ℓ from i to k by concatenation. Thus matrix multiplication is counting the
right thing.

When we want to count short walks, it is easy to do it by inspection.


But what if we walk to count the number of walks of length 20? It would
no longer be feasible to write down a list of all of the walks. But matrix
multiplication doesn’t do this. We can compute:
 
426811 923802 790748 790748
923802 2008307 1714550 1714550
A20 = 790748 1714550 1466055 1466054

790748 1714550 1466054 1466055


which computes all length 20 walks between every pair of vertices in the
graph! It turns out that counting walks, and related problems like comput-
ing the PageRank vector of a graph, tell us important connectivity prop-
erties about graphs that are too large to write down. Even in our small
example, we can see this in that the number of walks of length 20 from
a to the other vertices (or vice-versa) is much smaller than the number of
walks of length 20 among vertices on the triangle. They have important
applications in search and detecting communities in social networks.

Basic Facts We will close this section by introducing some basic facts
about multiplying matrices:
Fact 2. Matrix multiplication is associative, i.e. for A, B and C which are
m × n, n × p and p × ℓ matrices respectively, we have
A(BC) = (AB)C
CHAPTER 2. MATRIX MULTIPLICATION AND INVERSES 38

You can check this fact algebraically by using the formula for matrix
multiplication. However the intuition behind this fact is simple, again us-
ing Proposition 1 and the connection between matrix multiplication and
composing linear functions. If A, B and C represent the linear functions
f , g and h respectively then A(BC) corresponds to defining a new linear
function p = g ◦ h, i.e. the composition of g and h and computing f ◦ p.
Similarly (AB)C corresponds to defining a new linear function q = f ◦ g
and computing q ◦ h. And f ◦ p = q ◦ h because they both correspond to
the same linear function.

Recall in Section 1.1 we defined the predator-prey model and wrote out
an expression    
g(t + 1) g(t)
=A
y(t + 1) y(t)
that describes how the number of frogs and flies evolves over time. The
associativity of matrix multiplication implies that if we want to compute
what the population looks like at time t from its initial data, instead of
computing    
g(t + 1) g(0)
= A(A(· · · (A ) · · · ))
y(t + 1) y(0)
we can compute
     
g(t + 1) g(0) t g(0)
= (· · · ((A)A) · · · A) =A
y(t + 1) y(0) y(0)

Thus if we want to simulate the model for many different initial conditions,
we can just compute At once. Later on, we will study what intrinsic prop-
erties of A govern how quickly At grows/decays which will tell us important
things about the long term behavior of linear dynamical systems like the
one above.
Fact 3. Matrix multiplication is (right and left) distributive, i.e. for A, B
and C which are m × n, n × p and n × p matrices respectively, we have

A(B + C) = AB + AC

Similarly if A, B and C are m × n, m × n and n × p matrices respectively


we also have
(A + B)C = AC + BC
CHAPTER 2. MATRIX MULTIPLICATION AND INVERSES 39

Recall that the identity matrix is a square matrix which as ones along
the diagonal and zeros everywhere else. We use In denote the n × n identity
matrix. We sometimes write I when the dimensions are clear from context.

Fact 4. If A is an m × n matrix then Im A = A and also AIn = A.

Sometimes I is called a unit element, because it acts like multiplying by


the scalar 1.

Finally it’s important to beware of what you’re not allowed to do when


multiplying matrices:

Anti-Fact 1. Matrix multiplication is generally not commutative, i.e.

AB ̸= BA

The dimensions do not necessarily match, so there’s not even a way to make
sense of this statement.

2.2 The Inverse of a Matrix


Throughout this course, we will develop recipes for solving important prob-
lems, like finding a solution to a system of linear equations. Often times our
recipe will produce a new, important abstraction as a by-product. Indeed,
in this section, we will see how Gauss-Jordan elimination actually gives us
a recipe for computing the inverse of a matrix:

Definition 19. Given a square matrix A, its right inverse is a matrix A−1
with the property that
AA−1 = I

While matrix multiplication is in general not commutative, nevertheless


the left and right inverse of a matrix turn out to be the same:

Fact 5. A matrix A−1 is the left inverse, i.e. A−1 A = I if and only if it is
the right inverse, i.e. AA−1 = I.
CHAPTER 2. MATRIX MULTIPLICATION AND INVERSES 40

Thus we will refer to A−1 as simply the inverse of A and drop the qualifier
left or right.

Now suppose we want to solve a linear system Ax = b. If we know the


inverse A−1 , we can write down a closed form expression for the solution.
Consider x∗ = A−1 b. Substituting this back into the equation, we find
Ax∗ = A(A−1 b) = (AA−1 )b = Ib = b
and thus x∗ solves the equation.

There are many basic questions left. When does the inverse of a matrix
exist? Is it unique? And how do we find it?

Back to Gauss-Jordan Elimination It turns out that we already know


how to find A−1 . It just involves keeping track of the operations we per-
formed in Gauss-Jordan elimination in a matrix. Suppose we want to solve
a linear system     
1 −1 2 x1 1
−2 2 −3 x2  = −1
−3 −1 2 x3 −3
Let ri denote the ith row of the constraint matrix. The first step of Gauss-
Jordan elimination was to add 2r1 to r2 . But it turns out that we can
represent this operation using matrix multiplication:
    
1 0 0 r1 r1
2 1 0 r2  = r2 + 2r1 
0 0 1 r3 r3
Here each ri is a 1 × 3 matrix, i.e. a row vector of dimension three. Thus in
the above expression we are multiplying two 3 × 3 matrices to get another
3 × 3 matrix. In fact, this recipe works more generally:
Fact 6. Given an n × m matrix A, the operation of adding α times the
ith row to the jth row can be performed by computing BA, where B is an
m × m matrix with ones along the diagonal, Bj,i = α and all of the other
entries are zero.

Adding a scalar multiple of one row to another row is not the only
operation we perform in Gauss-Jordan elimination. Sometimes we need
CHAPTER 2. MATRIX MULTIPLICATION AND INVERSES 41

to swap rows. This can also be performed by matrix multiplication. For


example, if we want to swap the second and third rows, this can be done
by:     
1 0 0 r1 r1
0 0 1 r2 = r3 
   
0 1 0 r3 r2
And more generally:
Definition 20. Given an n × m matrix A, the operation of swapping the
ith and jth rows can be performed by computing BA, where B is an m × m
matrix with Bk,k = 1 for all k ̸= i, j and Bj,i = Bi,j = 1 and all of the other
entries are zero.

Now instead of just keeping track of the current constraint matrix, let’s
also keep track of how it relates to the previous constraint matrix. We start
off with A and the first step of Gauss-Jordan elimination can be performed
by computing
A′ = B1 A
Similarly the next step can be performed by multiplying by B2 and so on.
In the end, this gives us

A → B1 A → B2 B1 A ⇝ (Bp · · · B2 B1 )A

Recall that at the end of Gauss-Jordan elimination, the constraint matrix


is in reduced row echelon form. Suppose we got lucky, so that none of the
rows were just zero. Since A is square, this would mean

(Bp · · · B2 B1 )A = I

And so we have found A−1 , because A−1 = (Bp · · · B2 B1 ). Thus we have


shown:
Proposition 7. If A is square and non-singular, then its inverse exists.

We can also think about the inverse, column-by-column, as the solution


to a particular linear system. Recall that ei is the ith standard basis vector
(Definition 4). Consider the linear systems

Ax = ei
CHAPTER 2. MATRIX MULTIPLICATION AND INVERSES 42

for each i from 1 to n. Let xi be the solution. And suppose we concatenate


all these solutions together. We claim that
A−1 = x1 , x2 , · · · , xn
 

This is easy to check by backsubstitution:


   
A x1 , x2 , · · · , xn = e1 , e2 , · · · , en = I
This yields an answer to another one of our questions:
Lemma 2. If it exists, the inverse of A is unique.

Proof. We found the inverse of A by performing Gauss-Jordan elimination


and reaching the identity matrix at the end of the process. Recall that the
steps of Gauss-Jordan elimination preserve the set of solutions (Proposi-
tion 6). Thus, since the linear system Ix = b has a unique solution for any
b so too does the linear system Ax = b. This means that the columns of
A−1 , which solve the linear systems Axi = ei , are unique and thus A−1 is
unique too.

What happens if we do not get lucky? What if, when we are performing
Gauss-Jordan elimination, we end up with a row of just zeros? For example
when solving the linear system
         
2 −3 0 x1 1 2 −3 0 x1 1
1 1 2 x2  = 1 ⇝ 0 5 2 x2  =  1 
2 2
3 −2 2 x3 3 0 0 0 x3 1
Recall that such a matrix is called singular (Definition 13). What does this
tell us about A−1 ?
Lemma 3. If A is singular then it does not have an inverse.

Proof. Suppose, for the sake of contradiction, that A did have an inverse
A−1 . Then for any linear system Ax = b for any b, there would be a solution
x∗ = A−1 b
However when A is singular, we know that there are some choices of b for
which Ax = b does not have a solution, e.g. the example above. Thus we
have a contradiction, and the inverse of A must not exist.
CHAPTER 2. MATRIX MULTIPLICATION AND INVERSES 43

Other Useful Facts We close this section by stating some other basic
facts about inverses.

Fact 7. If A and B are square, non-singular matrices then

(AB)−1 = B −1 A−1

Moreover there is a simple closed-form expression for the inverse of a


2 × 2 matrix:

Lemma 4. Given a non-singular 2 × 2 matrix


 
a b
A=
c d

its inverse is  
−1 1 d −b
A =
ad − bc −c a

The expression ad − bc is the determinant of a 2 × 2 matrix, and later


we will see that it is zero if and only if A is singular. Thus the expression
makes sense, and we can check that A−1 is indeed the inverse just by writing
out AA−1 . There are explicit expressions for the inverse of n × n square
matrices involving the determinant, but these are numerically unstable and
they do not yield good algorithms for computing the inverse.
Chapter 3

Rotations, Projections,
Reflections and Transpositions

In this section we will introduce some important families of matrices. These


matrices will apply geometric operations to vectors, like permuting their
coordinates, rotating them, or projecting them. Apart from the fact that
they are interesting in their own right, they will also serve as building blocks
for some of the matrix decompositions we develop later, whereby we will see
how to think about general matrices as a composition of simpler operations.

Lengths and Angles First let’s talk about lengths and angles:

Definition 21. Let v be a vector in d dimensions. Its length is


q
∥v∥ = v12 + v22 + · · · + vdd

If ∥v∥ = 1 we say that v is a unit vector.

Recall we introduced the notion of an inner-product in Definition 5. As


an aside, the length can also be defined in terms of the inner-product, since

∥v∥2 = ⟨v, v⟩

In any case, we can use the inner-product to talk about angles between two
vectors:

44
CHAPTER 3. ROTATIONS, PROJECTIONS, REFLECTIONS AND
TRANSPOSITIONS 45
Definition 22. The angle θ between two vectors v and w is determined by
the expression
⟨v, w⟩
cos θ =
∥v∥∥w∥

We can visualize the inner-product as follows. Suppose, for simplicity,


that v and w are both unit vectors.

Then the circle above denotes the set of all unit vectors. Without loss of
generality, we can assume that v lies on the x axis. Then the inner-product
of w and v is just the x-coordinate of w, which is the same thing as the
cosine of the angle between them. Lengths and angles are helpful things to
keep track of when we apply various matrix operations.

Permutation Matrices In the process of describing the steps of Gaus-


sian elimination, and seeing how it gave us a recipe for constructing the
inverse of a matrix, we had to express the operation of swapping entries of
a vector as a matrix-vector product. Let’s study the more general problem
of applying a permutation to the coordinates of a vector:
Definition 23. A permutation π is a function
π : {1, 2, · · · , d} → {1, 2, · · · , d}
that is one-to-one, i.e. each element of the image is mapped to exactly once.
CHAPTER 3. ROTATIONS, PROJECTIONS, REFLECTIONS AND
TRANSPOSITIONS 46
For example, for d = 3 we could have the permutation π where π(1) =
2, π(2) = 3 and π(3) = 1. Now we can apply a permutation π to the
coordinates of a vector as follows. If v is the vector we apply it to, and v ′
is the result, and if π(i) = j then we want to put the ith coordinate of v
in the jth coordinate in v ′ . How could we express this as a matrix-vector
product? First consider our example for d = 3. We can check that
    
0 0 1 x z
1 0 0 y  = x
0 1 0 z y
But what does the general recipe, for turning a permutation into a matrix-
vector product, look like?
Definition 24. A permutation matrix is a square matrix with 0s and 1s
where there is exactly one 1 in each row or column. Given a permutation π
we can define its associated permutation matrix Aπ as follows:
Aπi,j = 1π(j)=i

Here we have used the notation 1 to denote the indicator function. In


particular, 1π(j)=i is 1 if π(j) = i and otherwise is 0.

Now let’s test our geometric intuition.


Question 2. Does applying a permutation matrix change the length of a
vector?

It is easy to see that the length stays the same, because when we compute
the length, we’re just taking the sum of squares of the coordinates in a
different order, but the result is the same.
Question 3. Does applying a permutation matrix change the angle between
vectors?

Again, we can see that the angle is unchanged, using the definition of the
inner-product. We already know that the lengths remain unchanged, and
also
X d d
X
π π ′ ′
⟨A x, A w⟩ = xi w i = xπ−1 (i) wπ−1 (i) = ⟨x, w⟩
i=1 i=1
CHAPTER 3. ROTATIONS, PROJECTIONS, REFLECTIONS AND
TRANSPOSITIONS 47
where x′ = Aπ x and w′ = Aπ w.

Something special about permutation matrices (that is also true of other


family of matrices we will see shortly) is that computing their inverse is easy:

Definition 25. The transpose of an m × n matrix A, denoted by AT is an


n × m matrix with
(AT )i,j = Aj,i
Moreover if A = AT we say that A is symmetric.

For example, we have


 
 1 4
1 2 3
A= , AT = 2 5
4 5 6
3 6

Now let’s take our 3 × 3 permutation matrix A from earlier. We can check
    
0 1 0 0 0 1 1 0 0
0 0 1 1 0 0 = 0
  1 0
1 0 0 0 1 0 0 0 1
| {z } | {z }
AT A

Thus A−1 = AT . We will see that this is true for any permutation matrix:

Fact 8. If A is a permutation matrix, then A−1 = AT .

The intuition for this fact is simple. Suppose A corresponds to a permuta-


tion π and π(j) = i. Then

1. The matrix-vector product Av sends the jth coordinate of v to the


ith coordinate. This is because it has a 1 in row i, column j and all
the other entries in this row and column are 0.

2. Thus the matrix AT has a 1 in row j, column i and all the other entries
in this row and column are 0. And so the matrix-vector product Aw
sends the ith coordinate of w to the jth coordinate.
CHAPTER 3. ROTATIONS, PROJECTIONS, REFLECTIONS AND
TRANSPOSITIONS 48
These two properties imply that A and AT undo each other.

Let’s dive deeper into the transposition operation. First, applying the
transpose to a product of matrices reverses their order.

Fact 9. (AB)T = B T AT

If A and B are n × m and m × p matrices, then AT and B T are m × n


and p × m matrices. If we didn’t swap their order, the dimensions wouldn’t
even match. Let’s state some more basic facts that we will use later:

Fact 10. (AT )T = A

Fact 11. (A + B)T = AT + B T

Fact 12. For a square, invertible A we have (A−1 )T = (AT )−1 .

This last fact is the only one that is not an immediate consequence of
the definition of the transpose. But it is also easy to verify in the sense
that we just need to check that (A−1 )T is the inverse of AT . This follows
because
(A−1 )T AT = (AA−1 )T = I T = I
where the second equality follows from Fact 9.

Finally, transposes give us a helpful way to express many kinds of oper-


ations. For example, if we have two column vectors v and w and we want to
take their inner-product, how can we express this as a matrix-vector prod-
uct? We can turn v into a 1 × d matrix, i.e. a row vector using a transpose.
Then we have
 
w1
 w2  X d
T
 
v w = v1 v2 · · · vd  ..  = vi wi = ⟨v, w⟩
 
 .  i=1
wd

Rotations In later sections, we will talk about higher dimensional rota-


tions and projections. But for now, let’s focus on the two-dimensional case.
In Section 1.1 we talked about how to rotate a two-dimensional vector by
CHAPTER 3. ROTATIONS, PROJECTIONS, REFLECTIONS AND
TRANSPOSITIONS 49
π/2 in the counter-clockwise direction. We say that this could be expressed
as a matrix-vector product as
    
0 −1 x1 −x2
=
1 0 x2 x1

More generally, how can we rotate a vector by θ in the counter-clockwise


direction? It is easy to check that we can do this as:
    
cos θ − sin θ x1 x1 cos θ − x2 sin θ
Rθ = =
sin θ cos θ x2 x1 sin θ + x2 cos θ

This ought to preserve lengths. Indeed we can verify this by observing that

(x1 cos θ − x2 sin)2 + (x1 sin θ + x2 cos θ)2


= x21 (sin2 θ + cos2 θ) + x22 (sin2 θ + cos2 θ)
= x21 + x22

which shows that the squared length of the vector before and after the
transformation is the same.

It turns out that for rotation matrices, the inverse is again easy to
describe. If the matrix A rotates a vector by θ in the counter-clockwise
direction, A−1 rotates by −θ in the counter-clockwise direction. Thus we
have  
−1 cos(−theta) − sin(−θ)
Rθ = R−θ =
sin(−θ) cos(−θ)
But since cos is an even function and sin is an odd-function, we can see
that  
−1 cos θ sin θ
Rθ =
− sin θ cos θ
And now we can see that Rθ−1 = RθT . Thus we have:

Fact 13. For any rotation matrix R, its inverse is RT .

We have only talked about rotations in two-dimensions so far, but when we


talk about rotations in higher-dimensions we will see that this fact remains
true.
CHAPTER 3. ROTATIONS, PROJECTIONS, REFLECTIONS AND
TRANSPOSITIONS 50
Applications to Robotics In robotics, we often want to describe a cam-
era angle in three-dimensions. It turns out that we can decompose the
transformation, of how we should rotate the camera in three-dimensions to
a new direction, in terms of simpler transformations. In our case, we can
act on just two coordinates at a time:
   
cos γ − sin γ 0 cos β 0 sin β 1 0 0
 sin γ cos γ 0  0 1 0  0 cos α − sin α
0 0 1 − sin β 0 cos β 0 sin α cos α
| {z }| {z }| {z }
yaw pitch roll

Each matrix performs a rotation in either the x-y, x-z or y-z plane sepa-
rately.

Projections and Reflections Finally let’s talk about projections:

Definition 26. The projection of a point a onto a line ℓ is

projℓ (a) = arg min ∥a − b∥


b∈ℓ

In particular, a projection maps a to the closest point on the line ℓ. The


following picture is helpful to keep in mind:
CHAPTER 3. ROTATIONS, PROJECTIONS, REFLECTIONS AND
TRANSPOSITIONS 51
Our strategy will be to understand projections by breaking them down into
simpler building blocks. First, what if we want to project a vector v onto
the x-axis? It is easy to see that
   
v v
P 1 = 1
v2 0

I.e. the closest vector to v is obtained by zeroing out the y coordinate. Now
we can see that  
1 0
D=
0 0
With this as a building block, how can we project onto a general line ℓ that
makes an angle θ with the x-axis? We claim that it can be written as

P = Rθ DR−θ
CHAPTER 3. ROTATIONS, PROJECTIONS, REFLECTIONS AND
TRANSPOSITIONS 52
This follows because if we apply the operation R−θ , we rotate the line ℓ
onto the x-axis and we rotate v along with it. At this point, we know the
projection onto ℓ can be implemented by multiplying by D. Finally we need
to rotate back to put the line ℓ back to where it should be. This last step can
be implemented by multiplying by Rθ . This is actually our first exposure
to a powerful decomposition, called the singular value decomposition. We
expressed a projection as the project of rotation matrices whose inverses are
easy to understand and a diagonal matrix D that is also easy to understand.

From this vantage point, it is easy to answer all our usual questions
about projections:
Question 4. What does a projection do to the length?

Coming back to the expression P = Rθ DR−θ , the matrices Rθ and R−θ


preserve the length, and applying D can never increase the length. Thus
we have:
Fact 14. For any projection matrix P , ∥P v∥ ≤ ∥v∥.

Again, we can visualize this geometrically using the picture from earlier.
Suppose we draw a circle with radius ∥a∥ we can see that the projection of
a onto the line ℓ is contained in the circle, and the only case it ends up on
the boundary is when a is already on the line. Finally we can ask:
Question 5. Is the projection onto ℓ invertible?

PP: define ‘‘nonsingular’’ somewhere, we use it later ToDo

To answer this question, we will appeal to the following fact about in-
vertibility:
Fact 15. Let A and B be square matrices and suppose that A is invertible.
Then B is invertible if and only if AB is invertible.

We will prove this fact later. But already we can see from the expression
P = Rθ DR−θ that a projection is not invertible. This is true because D
CHAPTER 3. ROTATIONS, PROJECTIONS, REFLECTIONS AND
TRANSPOSITIONS 53
is already in reduced row echelon form, but it does not have a pivot in the
last row. We can also see why invertibility fails geometrically: It is easy
to construct two points a and a′ that both project to the same point b.
Hence there is now way to undo the operation uniquely, and knowing the
projection is not enough information to determine the original point. This
is the same thing as saying that the linear system Ax = b does not have a
unique solution.

Finally, if we want to reflect a across a line ℓ we can express the reflection


as 2(b−a)+a. This tells us that we can express it as a matrix vector product
as 2P − I. And again using the expression P = Rθ DR−θ and the fact that
I = Rθ R−θ we have that

2P − I = Rθ (2D − I)R−θ

Observing that 2D − I is in row echelon form and all its pivots are nonzero
we know that it is invertible. Now appealing to Fact 15 once more we
conclude that every reflection is invertible.
Chapter 4

Geometric Foundations

4.1 Vector Spaces


In this section, we will explore a key abstraction and how it helps us think
about matrices and operators in new ways:
Definition 27. A vector space over the reals is a set V with notions of

1. How to add: For all v, w ∈ V we have v + w ∈ V


2. How to scale: For all v ∈ V and any α ∈ R we have αv ∈ V

We have already seen some examples of vector spaces, such as V = Rd


which we think of as the set of all d-tuples of reals. But there are other
important examples of vector spaces, that don’t necessarily look like familiar
finite dimensional vectors. For example in Definition 25 we defined the set
of symmetric matrices, which are all matrices A so that A = AT . The set of
all d × d symmetric matrices is also a vector space, because when we have
two symmetric matrices A and B which are both d × d, their sum A + B is
also d × d and furthermore is symmetric since
(A + B)T = AT + B T = A + B
Moreover if A is symmetric then αA is too. Even when vector spaces don’t
look like familiar finite dimensional vectors, the tools of linear algebra can
still give us important insights.

54
CHAPTER 4. GEOMETRIC FOUNDATIONS 55

We can also define the notion of a subspace:


Definition 28. If V is a vector space then S ⊆ V is a subspace if

1. For all v, w ∈ S we have v + w ∈ S


2. For all v ∈ S, and any α ∈ R we have αv ∈ S

Thus a subspace is just a vector space sitting inside another vector space.
We already know that R2 is a vector space. What are its subspaces? We
could choose S = R2 or S = {0}. A more interesting example is a line
passing through the origin. But note that a line that does not pass through
the origin is noT a vector space. It would fail to satisfy the property that
αv ∈ S for α = 0. Just as before, we can also think about vector spaces of
matrices. The set of all d × d matrices is a vector space, in which case the
set of all d × d symmetric matrices is a subspace.

A key property of vector spaces is that you can combine them in different
ways to get another vector space:
Lemma 5. If S1 and S2 are subpsaces then S = S1 ∩ S2 is too

Proof. We just need to verify the properties of being a subspace:

1. Addition: If we are given v, w ∈ S = S1 ∩S2 then consider v+w. Since


S1 is a subspace and v, w ∈ S2 we have that v + w ∈ S1 . Similarly
v + w ∈ S2 . Thus v + w ∈ S1 ∩ S2 = S, which proves that S is closed
under addition.
2. Scaling: If we are given v ∈ S1 ∩S2 and α ∈ R then consider αv. Since
S1 is a subspace we have that αv ∈ S1 . Similarly αv ∈ S2 . Thus, as
before, we conclude αv ∈ S1 ∩ S2 = S.

This completes the proof.

The main goal of this section is to define certain fundamental vector


spaces associated with matrices, and use them to get new geometric insights.
CHAPTER 4. GEOMETRIC FOUNDATIONS 56

The Columnspace We start with a simple but powerful definition. Re-


call in Definition 7 we defined the notion of a linear combination of a set of
vectors. First we define the set of all linear combinations:

Definition 29. The span of a collection of vectors a1 , a2 , · · · , an is the set


of all linear combinations of them, i.e.

span(a1 , a2 , · · · , an ) = {α1 a1 + α2 a2 + · · · + αn an }

where the αi ’s are reals.

Definition 30. The columnspace C(A) of an m × n matrix A is the span


of its n columns.

It is easy to see that the columnspace C(A) is a vector space. This


is true because if we have v, w ∈ C(A) then by definition it means that
there are vectors x and y of dimension m so that Ax = v and Ay = w.
This follows from our interpretation of the matrix-vector product as taking
a linear combination of the columns of A. Now how can we show that
v + w ∈ C(A)? It is easy to check that

A(x + y) = v + w

Thus the vector z = x + y gives us a way to represent v + w as a linear


combination of the columns of A, and hence v + w ∈ C(A) too. The
argument for scaling works similarly: Given Ax = v we have that αv =
A(αx) and hence v ∈ C(A) implies that αv ∈ C(A) too.

Actually we have been talking about columnspaces implicitly for a while


now. In Section 1.2 we argued that the linear system
    
2 −3 x b
= 1
1 1 y b2

has a solution for any choice of b1 and b2 . Another way to interpret this
linear system is it’s asking if the vector on the right hand side can be
written as a linear combination of the columns of A. This what we argued,
implicitly, was that C(A) = R2 . More generally we have:

Fact 16. The linear system Ax = b has a solution if and only if b ∈ C(A).
CHAPTER 4. GEOMETRIC FOUNDATIONS 57

Recall that in Fact 5 we asserted that the right and left inverses of square
matrix A are the same, provided A is non-singular. Now here is a powerful
way to connect invertibility to vector spaces:

Lemma 6. Suppose A is an n × n matrix. Then A is invertible if and only


if C(A) = Rn .

Proof. This is one of the most important results in this section, so it is


essential to understand the proof. First let’s prove the forwards direction,
where we assume that A is invertible and want to conclude that C(A) = Rn .
By Fact 16 we want to show that b ∈ C(A) for any b ∈ Rn . So for a given
b, what coefficients should we use to express b as a linear combination of
the columns of A? We can try x = A−1 b which implies

Ax = (AA−1 )b = Ib = b

Thus the x we have guessed implies b ∈ C(A).

Second, let’s prove the reverse direction. What if C(A) = Rn ? How can
we find the inverse of A? Recall in Definition 4 we defined the standard
basis vectors e1 , e2 , · · · , en . Now if e1 ∈ C(A) there must be some x1 that
satisfies Ax1 = e1 , and so on. Now consider
   
A x1 x2 · · · xn = e1 e2 · · · en = I

Thus we conclude that

x1 x2 · · · xn = A−1
 

We have found the inverse of A just by expressing each of the standard


basis vectors as a linear combination of the column so A. The completes
the second part of the proof.

As an aside, now that we have many linear algebraic concepts under


our belt, it is helpful to go back to earlier examples and understand old
tools (like Gaussian elimination and matrix inverses) in our new language.
Linear algebra is interconnected, and understanding how to go back and
forth between different ways of thinking about a problem is important.
CHAPTER 4. GEOMETRIC FOUNDATIONS 58

The Nullspace Now let’s talk about our second fundamental subspace:
Definition 31. The nullspace N (A) of a matrix A is

N (A) = {x|Ax = 0}

Thus it is the set of all ways of writing the all zero vector as a linear
combination of the columns of A. Note that if A is an m × n matrix then
its column space C(A) is a subspace of Rm while its nullspace N (A) is a
subspace of Rn . Again it is easy to check that N (A) really is a subspace: If
x, y ∈ N (A) then x+y satisfies A(x+y) = Ax+Ay = 0 and so x+y ∈ N (A).
Moreover for any x ∈ N (A) and any α ∈ R we have Aαx = αAx = 0 and
thus αx ∈ N (A) too. As a caution note that the set S = {x|Ax = b} for
b ̸= 0 is not a subspace. In particular the all zero vector is not in S. It is
sometimes called an affine subspace. As we will see, the columnspace and
the nullspace are very interrelated.

Let’s revisit another example from Section 1.2 with our new abstraction.
We studied the linear system
    
2 −3 0 x b1
1 1 2 y  = b2 
3 −2 2 z b3
and argued that there are some choices of b1 , b2 and b3 for which it has no
solution. We can use Gaussian elimination to put it in row echelon form
and write
1 0 65
 

A = B 0 1 45 
0 0 0
where A is our original constraint matrix and B keeps track of how to undo
the operations we performed in the process of doing Gaussian elimination.
Let A′ be the row echelon form. Can we use this expression to find a nonzero
vector x ∈ N (A)?

First let’s find a nonzero x that satisfies A′ x = 0. We can write


1 0 65 x + 56 z
      
x 0
0 1 4  y  = y + 4 z  = 0
5 5
0 0 0 z 0 0
CHAPTER 4. GEOMETRIC FOUNDATIONS 59

And setting x = 1, z = − 56 and y = 23 will work as will any scalar multiple


of it. But now because A = BA′ we have that A′ x = 0 implies Ax = 0
too. We have not only found a nonzero vector in N (A) but rather a full
description of N (A). What we have used implicitly in this argument is the
following important fact:

Fact 17. If B is square and invertible and A = BA′ then N (A′ ) = N (A).

Proof. In one direction, if we have x ∈ N (A′ ), i.e. A′ x = 0 then the


expression A = BA′ implies Ax = 0 too. Hence x ∈ N (A). In the other
direction, since B is invertible we can write B −1 A = A′ . And now the
argument proceeds as before, since x ∈ N (A), meaning Ax = 0, implies
B −1 Ax = A′ x = 0 too.

As a word of caution, this fact is only true when B is invertible.

Fact 18. For any B not necessarily invertible, and A = BA′ then N (A′ ) ⊆
N (A)

Without invertibility we can only prove containment in one direction.


For example, if B is the all zero matrix then clearly N (A) can strictly
contain N (A′ ). It is easy to get confused about the direction of the con-
tainment, so you should convince yourself that the proof in Fact 17 implies
N (A′ ) ⊆ N (A) and not the other way around. And you can always check
yourself by making sure that the direction you think the containment goes
in matches what happens in some trivial cases, like when B = 0.

Returning to the earlier thread, the matrices B that arise in Gaussian


elimination are invertible. Thus Gaussian elimination gives us a way to find
the nullspace because we can find a matrix A′ in row echelon form with the
same nullspace, and computing the nullspace of a matrix in row echelon
form is straightforward.

Finally, let’s connect the columnspace and nullspace. We claim the fact
that A in the example above is square, and that it has a nonzero vector in
its nullspace implies that C(A) ̸= Rn . Why is this? In our example, we
can consider the subspaces S1 , S2 and S3 we get from taking the span of the
CHAPTER 4. GEOMETRIC FOUNDATIONS 60

first column of A, the span of the first two columns of A, and the span of
all three columns of A. In our example we have that
 
 2 
S1 = span 1
3

is a line. And    
 2 −3 
S2 = span 1 , 1 
  
3 −2
is a plane. But what happens when we add the third column in too? We
are now looking at linear combinations of the form
     
2 −3 0
α1 1 + α2 1 + α3 2
    
3 −2 2

But we claim we still just get a plane. From the nonzero vector in the
nullspace that we have found, we find that
     
0 2 −3
5    2 
2 = 1 + 1
6 3
2 3 −2

Hence we can rewrite the general linear combination of the three vectors as
   
2 −3
 6  
  4  

α1 + 1 + α2 + 1
5 5
3 −2

We conclude with the following lemma, which explains what is happening


in this example in more generality:

Lemma 7. If A is n × n then C(A) = Rn if and only if N (A) = {0}.

Proof. The example above already showed how to go in one direction, but
we can state the argument more generally. If we have a nonzero vector
z ∈ N (A) then when we put A in row echelon form we must get a row of
CHAPTER 4. GEOMETRIC FOUNDATIONS 61

all zeros. But this means that A is singular and by Lemma 6 we have that
C(A) ̸= Rn .

Alternatively, if N (A) = {0} when we put A in row echelon form we


must have all nonzero pivots. This means that A is invertible and again by
Lemma 6 we have that C(A) = Rn .

4.2 Application: Polynomial Interpolation


and Overfitting
In this section, we will consider the problem of finding a polynomial that
passes through a given collection of data points (x1 , y1 ), (x2 , y2 ), · · · , (xn , yn ).

This is called polynomial interpolation. To be more precise, let


d
X
p(x) = p k xi
k=0

be a degree d polynomial. We want that p(xi ) = yi for all i. There are many
basic questions we could ask: Is there always a solution? And if so, how
CHAPTER 4. GEOMETRIC FOUNDATIONS 62

small can we choose the degree to be? And for a given degree, is there more
than one polynomial that interpolates the data? These questions are all
equivalent to understanding the the fundamental vectorspaces associated
with a particular matrix.
Definition 32. The Vandermonde matrix is a n × d + 1 matrix
 
1 x1 x21 · · · xd1
 1 x2 x2 · · · xd2 
2
V =  .. ..
 
.. .. 
. . . ··· .
1 xn x2n · · · xdn

Now if we abuse notation and let p be a d + 1 dimensional vector whose


first coordinate is p0 , whose second coordinate is p1 , and so on, finding an
interpolating polynomial is equivalent to solving the linear system

Vp=y

where y is an n dimensional vector whose coordinates are the yi ’s. It turns


out that there is an explicit solution called Lagrange interpolation where
we let n
X
p(x) = yi Li (x)
i=1

where we set Y x − xj
Li (x) =
1≤j≤n,j̸=i
xi − xj

This definition only makes sense if all the x′i s are distinct. The key obser-
vation is that Li (xj ) is equal to 1 if j = i and otherwise is equal to 0. From
this fact we can check that
n
X n
X
p(xi ) = yi Li (x) = yi 1i=j = yi
i=1 i=1

The degree of p is equal to n − 1. Can we understand why Lagrange inter-


polation works, from the vantage point of linear algebra?
Lemma 8. The square Vandermonde matrix is invertible if and only if all
of the xi ’s are distinct.
CHAPTER 4. GEOMETRIC FOUNDATIONS 63

Proof. First observe that if xi = xj for any i ̸= j then there is a repeated


row in the Vandermonde matrix in which case we will get a row of all zeros
when we perform Gaussian elimination, and the columnspace cannot be
all of Rn . If instead the xi ’s are distinct, Lagrange interpolation succeds.
In particular let qi be the d + 1 dimensional vector that represents the
polynomial Li . In particular, its first coordinate is the constant term of Li ,
and its second term is the coefficient on the linear term, and so on. Now the
fact that Li (xj ) is equal to one iff j = i and otherwise is zero is equivalent
to the statement that V qi = ei where ei is the standard basis vector. Thus
Lagrange interpolation works because it is an explicit construction of the
inverse when n = d + 1 and we have
 
V q1 q2 · · · q n = I

This completes the proof.

Thus choosing d = n − 1 always works, provided the xi ’s are distinct.


And if we choose d any smaller the Vandermonde matrix would not be
square and its columnspace cannot be all of Rn , so you will not be able
to interpolate in general. What happens if we choose d > n − 1? The
trouble is that N (A) must contain nonzero vectors. Let z be such a vector
and let q(x) be the associated polynomial. Then if we have any polynomial
p(x) that interpolates our data, the polynomial p(x) + q(x) will too because
V z = 0 means that q(xi ) = 0 for all i. This is called overfitting.
CHAPTER 4. GEOMETRIC FOUNDATIONS 64

We can not only make our polynomial interpolate our data, but if we
have any other data point x0 by choosing the coefficient α appropriately
we can make p(x) + αq(x) take on any value we want on x0 . So when
you are overfitting, i.e. when the nullspace contains nonzero vectors, there
is no guarantee that what you’ve found extrapolates beyond your original
collection of data!
Chapter 5

Linear Independence and


Bases

Given a matrix A ∈ Rn×m , we have defined two natural subspaces associated


with it, namely the nullspace
N (A) = {x ∈ Rm : Ax = 0}
and the column space
C(A) = {y ∈ Rn : y = Az, z ∈ Rn }.
Intuitively, they tell us complementary things about A. However, notice
that if A is not square (i.e., n ̸= m), then the nullspace and the column
space live in different dimensions – the column space is a subspace of Rn
and the nullspace is a subspace of Rm .
Remark 1. Another name, that you might see elsewhere, for the nullspace
N (A) is the kernel of A. Similarly, the column space C(A) is sometimes
called the range of A.

When we introduced subspaces in Section 4.1 we saw that they are


invariants of Gaussian elimination. In particular, the steps of Gaussian
elimination, which add a scalar multiple of one row to another row, or swap
rows, leave both the columnspace and the nullspace unchanged. Eventually
we will want to be able to compare subspaces. Consider the following
examples:

65
CHAPTER 5. LINEAR INDEPENDENCE AND BASES 66

The first subspace is just the zero vector. The second subspace is a
line. The third subspace is the entire plane. Clearly the last subspace is
the biggest, because it strictly contains the others. But what if we have
two subspaces, neither of which contains the other? Consider the following
example:

The first subspace is a line. The second subspace is a plane, but it does
not contain the first subspace. Still we would like to say that the second
subspace is bigger. How can we make this intuitive notion precise?

5.1 Linear Independence


Recall that in earlier sections we defined the notion of linear independence
of a set of vectors – no vector can be written as a linear combination of the
others (Definition 9). Linear independence can be stated in a few equivalent
CHAPTER 5. LINEAR INDEPENDENCE AND BASES 67

ways, some of which are more convenient than others at times. Let’s give
another equivalent definition:
Definition 33 (Linear independence). A set of vectors {v1 , . . . , vk } ⊂ Rn
is linearly independent (LI) if whenever we have a linear combination equal
to zero, then necessarily all the scalars are zero. Equivalently,
λ1 v1 + · · · + λk vk = 0 ⇒ λ1 = 0, . . . , λk = 0.
Definition 34 (Linear dependence). A set of vectors {v1 , . . . , vk } ⊂ Rn is
linearly dependent (LD) if they are not linearly independent.

Let’s see some examples first:


Example 1. Consider the vectors in R2 :
   
2 1
, .
1 0
Let’s form a generic linear combination with scalars λ1 , λ2 , and set it equal
to zero. We have then
       
2 1 2λ1 + λ2 0
λ1 + λ2 = = .
1 0 λ1 0
Solving these equations, we have λ1 = λ2 = 0. Thus any linear combina-
tion of these two vectors that results in the zero vector necessarily has zero
coefficients, and so we conclude that the vectors are LI.
Example 2. Consider the following three vectors in R2 :
     
2 1 −1
, , .
1 0 −2
As before, we consider a generic linear combination and set it equal to zero.
We have then
         
2 1 −1 2λ1 + λ2 − λ3 0
λ1 + λ2 + λ3 = = .
1 0 −2 λ1 − 2λ3 0
Now the solution set is nonempty – for instance, we can choose λ1 = 2,
λ2 = −3, λ3 = 1, and thus the vectors are linearly dependent.

(We can also verify this “directly”, since the third vector is three times
the second vector minus twice the first one.)
CHAPTER 5. LINEAR INDEPENDENCE AND BASES 68

Exercise 1. The notions of linear independence and linear dependence are


not quite symmetric. In particular:

1. Show that if {v1 , . . . , vk } are linearly independent, then any subset of


the vectors is also linearly independent.

2. Does a similar statement hold for linear dependence? Prove this, or


give a counterexample.

We focus now on a “matrix” way of understanding linear independence.


Consider as before the vectors {v1 , . . . , vk } ∈ Rn and the equation

λ1 v1 + · · · + λk vk = 0.

Putting all the vectors as columns of a matrix, we can write this as


 
λ1
 λ2 

 
v1 v2 · · · vk  ..  = 0
| {z } . 
A
λk

or simply Aλ = 0, where (careful with dimensions!), A ∈ Rn×k and λ ∈ Rk .

Notice that the linear system Aλ = 0 always has a solution that comes
from setting λ = 0. The condition that the vectors are linearly independent
stipulates that this is the only solution. We can now nicely put together
some of these notions:

Lemma 9. Let A = [v1 · · · vk ]. The following statements are equivalent:

1. The vectors {v1 , . . . , vk } are linearly independent.

2. The linear system Aλ = 0 has a unique solution.

3. The nullspace of A is the zero vector, i.e., N (A) = {0}.


CHAPTER 5. LINEAR INDEPENDENCE AND BASES 69

How do we determine computationally if a set of vectors is linearly


independent? The different characterizations in Lemma 9 make it easy,
whether you are doing it “by hand” using Gaussian elimination, or on a
computer (e.g., with Julia). Recall that the nullspace of matrix does not
change with you premultiply a matrix by a nonsingular matrix (Fact 17).
As we have seen, Gaussian elimination is equivalent to premultiplication by
an invertible matrix (a product of elementary matrices and permutations).

Since finding the nullspace of a matrix in row echelon form is straightfor-


ward from backsubstitution, we now have a recipe for determining whether
a set of vectors is linear independent, and if not, exhibiting a linear depen-
dence. We concatenate the vectors into a matrix A and put it in row echelon
form. The vectors are linearly independent if and only if N (A) = {0}. If
the nullspace contains a nonzero vector, its coefficients explicitly give us a
linear dependence among the columns. Let’s see this in an example.
Example 3. Consider again the vectors of Example 2, and form the cor-
responding matrix  
2 1 −1
A= .
1 0 −2
We can compute its row echelon form R = rref(A) (either by hand, or using
Julia), to obtain  
1 0 −2
R= .
0 1 3
 
0 1
(In fact, we have the factorization R = T A, where T = is invert-
1 −2
ible.)

We have then
  
 2α 
N (A) = N (T A) = N (R) = −3α : α∈R .
α
 

Since the nullspace is nonzero, the vectors are indeed linearly dependent.

5.2 Generators and Bases


Next we will study how to build up and represent subspaces.
CHAPTER 5. LINEAR INDEPENDENCE AND BASES 70

Definition 35 (Generators). Let S be a subspace. The vectors {v1 , . . . , vk } ⊂


S are generators of S (or generate, or span S) if every vector v ∈ S can
be written as
v = λ1 v1 + · · · + λk vk ,
for some scalars λ1 , . . . , λk .

If the vi generate S, we write S = ⟨v1 , . . . , vk ⟩, or S = span{v1 , . . . , vk }.

Let’s see some examples that being defined by a large number of gen-
erators does not necessarily imply that a subspace is big, at least without
some further conditions.

Example 4. We have
     
2 −1 −3
span , , = R2 ,
2 3 1

i.e., the whole two-dimensional plane.

On the other hand,


        
1 −1 2 α
span , , = :α∈R .
1 −1 2 α

which is just a line.

Spans of Functions Suppose we have a collection of simple functions


f1 (t), . . . , fk (t). Using these as building blocks, what sorts of more complex
functions can we express? In particular, consider their span, i.e. the set of
all functions we can obtain as linear combinations of them.
k
X
f (t) = λi fi (t)
i=1
= λ1 f1 (t) + · · · + λk fk (t).

For instance, if fi (t) = ti , then the elements of this subspace are the uni-
variate polynomials of degree k.
CHAPTER 5. LINEAR INDEPENDENCE AND BASES 71

Applications in Robotics and Control Consider a linear dynamical


system
xt+1 = Axt + But
Here xt is a d-dimensional vector representing the state of the system and
A is a d × d matrix that represents how the system evolves. Additionally
we can apply external control, in the form of choosing a vector ut at time
t to move the system in a different direction.

A natural question in this context is “what are all the possible output
trajectories we can obtain?” This is known as the reachable subspace of
a linear dynamical system. We are often interested in understanding how
large this subspace is – in particular, it may allow us to understand which
trajectories are possible.

5.3 Bases
Generators are great, but sometimes they can be a bit “wasteful.”

Example 5. Consider the 17 vectors v1 , . . . , v17 in the figure. We clearly


have span{v1 , . . . , v17 } = R2 , i.e., they generate the whole plane. However,
we can give a much smaller set (how big?) that also spans R2 .

In general, if one vector is a linear combination of others, then the span


does not change when we remove it. More formally:

Proposition 8. Let w be a linear combination of v1 , . . . , vk . Then,

span{v1 , . . . , vk , w} = span{v1 , . . . , vk }.

This motivates the following important definition

Definition 36 (Basis). A set of vectors {v1 , . . . , vk } is a basis of the sub-


space S, if they generate S and they are linearly independent.

The simplest example is the ”canonical” basis of Rn :


CHAPTER 5. LINEAR INDEPENDENCE AND BASES 72

Example 6. Recall that the vectors e1 , . . . , en are the columns of the n × n


identity matrix. Let’s show that {e1 , . . . , en } form a basis of Rn .

To do this, we need to prove two things:

ˆ They generate Rn . This is easy, since every vector x = (x1 , . . . , xn )


can be written as
x = x1 e1 + · · · + xn en .

ˆ The ei are linearly independent, since no ej is a linear combination


of the others.

Example 7. A few more examples of bases and not-bases.

Add pictures ToDo

This is nice. But, is it possible to have a subspace, with a basis with 3


elements, and another basis with 8? Can this can actually happen?

Fortunately, this is not the case!

Proposition 9. All bases of a subspace S have the same cardinality (num-


ber of vectors).

In fact, this fact justifies the following key definition:

Definition 37 (Dimension). The dimension of a subspace is the cardinality


of any basis.

Remark 2. From Example 6, we easily conclude that the dimension of Rn


is equal to n.
CHAPTER 5. LINEAR INDEPENDENCE AND BASES 73

5.4 Describing subspaces


Given a subspace S, how to present it in a computationally convenient way?
It turns out that are two ”natural” description for subspaces:

ˆ Equations: a finite list of constraints (equations) that the vectors


in our subspace must satisfy. For instance, the nullspace description
N (A) = {x : Ax = 0}.

ˆ Generators: a finite set of vectors that span S. For instance the col-
umn space description C(B) = span{b1 , . . . , bn }, where B = [b1 · · · bn ].

Depending on what we want to do, one kind of description may be more


advantageous than the other. For instance, when computing the intersec-
tion of two subspaces, it is easier if both subspaces are in equational form,
since then we can just put together all the equations. Therefore, a useful
skill is to convert one kind of description into the other.
Chapter 6

The Singular Value


Decomposition

6.1 A Unifying View


In this section, we will introduce arguably the most powerful tool in lin-
ear algebra: The singular value decomposition gives us a way to express
an arbitrary linear transformation as the composition of simpler, easier to
understand pieces.

Proposition 10. Any n × m matrix A can be factorized as A = U ΣV T


where

1. U and V are n×n and m×m matrices and have orthonormal columns

2. and Σ is an n × m matrix whose off-diagonal entries are zero and


whose diagonal entries are nonnegative.

In some applications, it is more convenient to think about the singular


value decomposition in an alternative form. First let
 
σ1 0 · · · 0
Σ =  0 σ2 · · · 0
 
.. .. .. ..
. . . .

74
CHAPTER 6. THE SINGULAR VALUE DECOMPOSITION 75

Suppose the entries on the diagonal are arranged in non-increasing order so


that
σ1 ≥ σ2 ≥ · · · ≥ σr > σr+1 = σr+2 = · · · = 0
Now the singular value decomposition can also be written as
r
X
A= σi ui viT
i=1

Here the σi ’s are called the singular values. The ui ’s and vi ’s are the columns
of U and V and they are called the left and right singular vectors respec-
tively. We will spend the remainder of this section digesting what this
decomposition means and why it is so powerful. Our first goal is to see how
it unifies our understanding of linear algebra thus far:

Question 6. How can we read off important properties of A from its sin-
gular value decomposition?

The Rank, Column Space and Nullspace First we will show:

Lemma 10. The rank of A is equal to the number of nonzero singular


values.

This lemma follows from an even more general one:

Lemma 11. The vectors u1 , u2 , · · · , ur are an orthonormal basis for C(A).

Proof. Let’s get some intuition and the proof will follow. Since u1 is in the
column space of A, we should be able to find a vector x so that Ax is in the
direction of u1 , i.e. Ax = cu1 for some non-zero scalar c. How should we
choose such an x? To answer this question, it is convenient to work with
the alternative expression for the singular value decomposition, which tells
us that
Ax = σ1 u1 v1T x + σ2 u2 v2T x + · · · + σr ur vrT x
We want all the terms in this expression, except for the first one, to be zero.
Let’s choose x = v1 . This gives

Av1 = σ1 u1 v1T v1 + σ2 u2 v2T v1 + · · · + σr ur vrT v1


CHAPTER 6. THE SINGULAR VALUE DECOMPOSITION 76

Since the vi ’s are orthonormal, we know that

v2T v1 = v3T v1 = · · · 0

And moreover v1T v1 = 1. And so we have Av1 = σ1 u1 , as desired. Of course


nothing about this argument is particular to u1 , and for any i we have that
Avi = σi ui . This proves the first part, that for any i, ui ∈ C(A).

Conversely, consider Ax for an arbitrary x. We will show that Ax can


be expressed as a linear combination of the ui ’s. Again, this follows imme-
diately from the singular value decomposition:

Ax = (σ1 v1T x)u1 + (σ2 v2T x)u1 + · · · + (σr vrT x)ur

Here we have rearranged the expression to make it clearer that, despite


looking complicated, it’s still just a linear combination of the ui ’s since the
viT x’s are just scalars. Together these two calculations tell us that

C(A) = span(u1 , u2 , · · · , ur )

and since the ui ’s are orthonormal, this completes the proof.

Lemma 10 now follows as a corollary from Lemma 11 since the rank of


A is equal to the dimension of its column space. Similarly, we can read off
properties of the nullspace.

Lemma 12. The vectors vr+1 , vr+2 , · · · , vm are an orthonormal basis for
N (A).

Proof. The proof follows along similar lines as Lemma 11. If we choose
x = vj for j ≥ r + 1 we have

Avj = σ1 u1 v1T vj + σ2 u2 v2T vj + · · · + σr ur vrT vj

and because the vi ’s are orthonormal, all the terms are zero. Thus Avj = 0
for j ≥ r + 1. Equivalently vj ∈ N (A). Conversely, consider some x such
that Ax = 0 and let

x = α1 v1 + α2 v2 + · · · αm vm
CHAPTER 6. THE SINGULAR VALUE DECOMPOSITION 77

We want to show that α1 must be zero. What if it were not zero? How
could we contradict the assumption that Ax = 0? Consider
uT1 Ax = σ1 uT1 u1 v1T x + σ2 uT1 u2 v2T x + · · · σr uT1 ur vrT x = σ1 v1T x
since the ui ’s are orthonormal. Now plugging in the expression for we have
uT1 Ax = σ1 v1T x = σ1 α1 v1T v1 + σ1 α2 v1T v2 + · · · σ1 αr v1T vr = σ1 α1
Thus we conclude that α1 = 0 as desired. Of course, nothing about this
argument is particular to v1 and for any vj with σj ̸= 0 we must have that
αj = 0. And so
x = σr+1 vr+1 + σr+2 vr+2 + · · · σm vm
which shows that Ax = 0 implies that x ∈ span(vr+1 , vr+2 , · · · , vm ). This
completes the proof.

Not only can we immediately read off the rank and related properties
of A immediately, the singular value decomposition also contains powerful
theorems as a corollary. Recall:
Theorem 1. For any n × m matrix A, we have
rank(A) + dim(N (A)) = m

How can we prove this using the singular value decomposition? Recall
that we are using r to denote the number of nonzero singular values. From
Lemma 10 we have
rank(A) = r
and from Lemma 12 we have
dim(N (A)) = m − r
thus the Rank-Nullity theorem follows!

Actually this hints at another way to prove Lemma 12 that avoids doing
so much algebra. We will use the fact that C(AT ) and N (A) are orthogonal
complements of each other instead. First applying Lemma 11 to AT , we
have that v1 , v2 , · · · , vr is an orthonormal basis for C(AT ). Finally
N (A) = (C(AT ))⊥ = (span(v1 , v2 , · · · , vr ))⊥ = span(vr+1 , vr+2 , · · · , vm )
CHAPTER 6. THE SINGULAR VALUE DECOMPOSITION 78

Inverses and Pseudoinverses Let’s take the idea of using the singular
value decomposition to unify what we know one step further. In Section 2.2
we saw how to compute the inverse of a matrix using Gauss-Jordan elim-
ination. It turns out that we can compute A−1 directly from the singular
value decomposition instead.
Proposition 11. Suppose A is a n × n and nonsingular and has singular
value decomposition A = U ΣV T . Then

A−1 = V Σ−1 U T

Then Σ−1 is an n × n matrix whose off-diagonal entries are zero, and


whose diagonal entries are σ1−1 , σ2−1 , · · · σn−1 . All these values are defined be-
cause, using Lemma 10, A being nonsingular is equivalent to all its singular
values being strictly positive. Caution: This is only true when A is square!

The expression above is quite natural. It comes from using our rules for
computing the inverses of products of matrices:

A−1 = (U ΣV T )−1 = (V T )−1 Σ−1 U −1 = V Σ−1 U T

Now, continuing to the proof:

Proof. We can prove this proposition by direct computation:

(V Σ−1 U T )A = V Σ−1 U T U ΣV T
= V Σ−1 ΣV T
= VVT =I

where we have used the fact that U and V are orthogonal, and hence their
inverse is their transpose.

In fact, even when A is not invertible (or even not square!), in lieu of
computing the inverse, we can still do the next best thing:
Definition 38. The pseudoinverse of A, denoted by A+ , is
r
X
+
A = σi−1 vi uTi
i=1
CHAPTER 6. THE SINGULAR VALUE DECOMPOSITION 79

When A is nonsingular, it is easy to see that the inverse and the pseu-
doinverse are the same. But when A is singular, how can we think about
the pseudoinverse? Again, let’s proceed by direct computation:
r
X X r 
AA+ = σi ui viT ( σi−1 vi uTi
i=1 i=1
r X
X r
= σi σj−1 ui viT vj uTj
i=1 j=1
r
X r
X
= σi σi−1 ui uTi = ui uTi
i=1 i=1

The last line follows because the vj ’s are orthonormal, so all the cross terms
where j ̸= i are zero. From this expression, we get:

Lemma 13. AA+ is the projection onto the column space of A.

When A is singular there is no B so that AB = I. This is true because


no matter what B we choose, ABx will always be in the column span of A,
which by assumption is not all of Rn . However the pseudoinverse is the next
best thing in the sense that AA+ is a projection onto the largest possible
space we can get, the column space. Similarly we can check that
r
X
+
A A= vi viT
i=1

which implies that A+ A is the projection onto N (A)⊥ .

6.2 Application: Uncertainty Regions


In this section, we will re-examine the singular value decomposition from a
geometric viewpoint. First let’s study what a transformation A does to the
unit ball – i.e. the set of all vectors with unit length (Definition 21).

Definition 39. The unit ball B is defined as

B = {x|∥x∥ ≤ 1}
CHAPTER 6. THE SINGULAR VALUE DECOMPOSITION 80

What does the set AB = {Ax|∥x∥ ≤ 1} look like? The singular value
decomposition of A will give us a way to decompose A into simpler geo-
metric building blocks. In particular, let’s visualize what happens when we
multiply by V T , then Σ and finally by U . The unit ball looks like:

Here we have marked the north and south poles. What happens when
we apply V T ? Since V has orthornormal columns, multiplying by V T does
not change the length of a vector. This implies that

B = V T B = {V T x|∥x∥ ≤ 1}

Even though the lengths do not change, the coordinates do. In our visual,
this means that the locations of the north and south poles change, even
though multiplying by V T maps the unit ball to the unit ball.
CHAPTER 6. THE SINGULAR VALUE DECOMPOSITION 81

Now what happens when we multiply by Σ? Since Σ is a diagonal


matrix, it stretches the unit ball along the coordinate axes, proportional to
the singular values. In R2 we could get something like:

In particular, the axes corresponding to larger singular values are more


stretched than those corresponding to smaller singular values. Thus we no
CHAPTER 6. THE SINGULAR VALUE DECOMPOSITION 82

longer have a ball, but rather an ellipsoid. And finally multiplying by U


rotates the ellipsoid.

The principal axes of the ellipsoid are the vectors σ1 u1 , σ2 u2 and so on.
Thus, from a geometric standpoint, the singular value decomposition is a
way to think about a general linear transformation as mapping a unit ball
to an ellipsoid. The singular values describe the lengths of the principal
axes. The left singular vectors are the directions of the principal axes. And
finally the right singular vectors are the directions in the unit ball that get
mapped to each of the principal axes.

Reconstruction Error Now let’s use our new geometric viewpoint on


the singular value decomposition to understand uncertainty regions. As a
motivation, in many applications like MRI, we get noisy linear measure-
ments of some unknown vector x. For example, x might represent a picture
of organs, bones, blood vessels, etc, that we would like to reconstruct. Sup-
pose we have
y = Ax + z
Here y represents our observations, and A represents how the pixel values
in the image x affect our measurements. If that’s all there is, we would be
CHAPTER 6. THE SINGULAR VALUE DECOMPOSITION 83

back to solving linear systems. The catch is there is also some unknown
noise z in the process, and to be conservative we might only assume that
the noise is small, i.e. ∥z∥ ≤ δ for some small constant δ. If A is known,
how can we estimate x? And furthermore, can we quantify our uncertainty
in x?

We will assume that A has full column rank. This is equivalent to the
condition that A has a left inverse – i.e. there is a matrix N so that N A = I.
So the natural way to estimate x is using
x
b = Ny
b − x. It represents the parts of the
The reconstruction error is defined as x
image that we got wrong. What we’d like to do is use properties of N
and the assumption that ∥z∥ ≤ δ to bound our reconstruction error. After
all, there might be some pixels that we can confidently say we accurately
reconstructed, and that is useful to know! Towards that end we can write
b − x = N (Ax + z) + x = Ix + N z − x = N z
x
b − x is contained in an uncer-
Thus we know that the reconstruction error x
tainty ellipsoid
{N z|∥z∥ ≤ δ}
And so the singular value decomposition of N can be used to understand
the reconstruction error. For example, if we want to make a diagnosis
based on our estimate xb it is useful to know if there are other images in the
uncertainty ellipsoid centered at x b where we would have made a different
diagnosis.

Of course this leads to even more questions. It is not hard to see that
in general the left inverse of a matrix is not unique. There are often in-
finitely many. So which one should we choose in terms of minimizing the
uncertainty ellipsoid? It turns out that there is a best choice. We will state
without proof the following fact:
Fact 19. The uncertainty ellipsoid of A+ is contained in those of any other
left inverse of A.

Thus A+ is the best left inverse in the sense that it yields an estimator
b = A+ y with the best bound on the reconstruction error.
x
Chapter 7

Connections to Optimization

7.1 Low Rank Approximation


In this section, we will start with a thought experiment. Suppose we have
an n × n rank one matrix
A = uv T
where u and v are unit vectors and we add small random numbers to its
entries. What happens to its rank? To be concrete, for n = 50 we could set

Bi,j = Ai,j + zi,j

where each zi,j is chosen uniformly at random from the interval [−10−2 , 10−2 ].
What is the rank of B? This is a good problem to test your intuition. Maybe
the rank is unchanged, and rank(B) = rank(A) = 1? Or perhaps the rank
doubles? Or does the rank become as large as possible, and rank(B) = 50?

You can try this experiment out for yourself and check that indeed
rank(B) = 50. This is true in general, regardless of how small you make
the additive noise terms zi,j , and regardless of how large n is too. You’ll
always get rank(B) = n, unless you run into machine precision issues with
the computation itself. But why does this happen? Recall that the rank
of a matrix is equal to the dimension of its columnspace. And when we
sequentially consider the span of the first column of B, then the span of
the first two columns of B, and so on, we are incredibly unlikely to get any
column belonging to the span of the previous columns of B. Thus at each

84
CHAPTER 7. CONNECTIONS TO OPTIMIZATION 85

step, the dimension grows by one. This is the intuition behind why B has
full rank.

Now we come to the key idea: Even though the rank of B is large,
it is still close to a rank one matrix, if we define an appropriate notion
of what being close means. As we will see, the idea of finding a low rank
approximation to a matrix has many important applications in data science
and engineering.

Definition 40. The operator norm of A, denoted by ∥A∥ is define as

max ∥Ax∥
x s.t. ∥x∥≤1

Note that when ∥ · ∥ is applied to a vector, it is the length. But when


it is applied to a matrix it is the operator norm – i.e. it is the maximum
factor by which the length of a vector can grow, when multiplying by A.
Let’s recall the geometric picture of the singular value decomposition:

From this picture, we can immediately determine the operator norm


of A. In particular, when we consider the image of the unit ball when
we multiply by A, which vector is the farthest from the origin? Since the
singular values are in non-increasing order σ1 ≥ σ2 ≥ · ≥ σn it is σ1 u1 .
Moreover since u1 is a unit vector, we have ∥A∥ = σ1 and we know that
choosing x = v1 achieves this since Av1 = σ1 u1 .
CHAPTER 7. CONNECTIONS TO OPTIMIZATION 86

Fact 20. For any A, its operator norm is its largest singular value.

There is another way to derive this fact that will be a useful perspective
for later:

Fact 21. For any matrix A and orthogonal matrix R we have

∥A∥ = ∥AR∥

Proof. We will verify that the two optimization problems, one computing
the operator norm of A and one computing the operator norm of AR are
optimizing over the same set. In particular consider the optimization prob-
lems
∥A∥ = max ∥Ax∥
x s.t. ∥x∥≤1

as well as
∥AR∥ = max ∥ARz∥
z s.t. ∥z∥≤1
But if we set x = Rz then ∥x∥ = ∥z∥ and we can go back and forth. Or to
be more formal, if there is a z with ∥z∥ ≤ 1 and ∥ARz∥ ≥ C then we can
set x = Rz and we will have ∥x∥ = ∥Rz∥ ≤ 1 and ∥Ax∥ = ∥ARz∥ ≥ C
as well. An identical argument works in the reverse direction, and we must
have that the values of the optimization problems are the same.

The fact also holds if we left multiply by an orthogonal matrix instead:

Fact 22. For any matrix A and orthogonal matrix R we have

∥A∥ = ∥RA∥

This fact is even easier to prove. It holds because ∥Ax∥ = ∥RAx∥ for
any vector x. Now putting it all together we can think about why the
singular value decomposition reveals the operator norm:

∥A∥ = ∥U ΣV T ∥
= ∥U T U ΣV T ∥ = ∥ΣV T ∥
= ∥ΣV T V ∥ = ∥Σ∥
CHAPTER 7. CONNECTIONS TO OPTIMIZATION 87

Here the first equality in the second line follows from Fact 21 and the first
equality in the third line follows from Fact 22. Now since Σ is diagonal we
have v
u r
uX
∥Σ∥ = t σi2 x2i
i=1

and the largest you can make it, subject to the constraint that x is a unit
vector, is by setting x1 = 1 and xi = 0 for all other i’s.

Now let’s return to our thought experiment. Suppose we want to recover


A approximately from B. At first this might seem hopeless because A is
rank one and B has rank n. But since the noise we added was so small, we
know that ∥B − A∥ is small too. Thus a natural strategy for estimating A
is to solve
A
b = arg min ∥B − C∥
C s.t. rank(C)≤1

In words, we are looking for an n × n matrix Ab which is rank at most


one and approximates B in the sense that it makes the operator norm of
the error matrix B − Ab as small as possible. The hope is that we know
C = A is already a good solution, so perhaps the best such C, i.e. A
b will
be close to A.

But before we get ahead of ourselves, can we actually solve this op-
timization problem? Later in this course, we will study optimization in
more depth, and understand what families of optimization problems we
can efficiently solve, and what ones are generally hard. But even without
the general theory, the optimization problem above seems to be searching
over a complex, high-dimensional set – the set of all n × n rank one matri-
ces. Amazingly, it turns out that the singular value decomposition already
contains the answer!

First consider the singular value decomposition of B given by


r
X
B= σi ui viT
i=1

We have written B as the sum of r rank one matrices. The main idea is to
take the largest one, i.e. the one that has the biggest operator norm, which
CHAPTER 7. CONNECTIONS TO OPTIMIZATION 88

is the first term. So consider choosing

C = σ1 u1 v1T

Now let’s compute how good this approximation is:


r
X
∥B − C∥ = ∥ σi ui viT ∥
i=2

What is the operator norm of B − C? From Fact 20 we know that it is the


largest singular value of B − C. The good news is the expression
r
X
B−C = σi ui viT
i=2

is the singular value decomposition of B − C. It’s singular values are σ2 , σ3 ,


and so on. Thus we have that ∥B − C∥ = σ2 . Actually this is the best you
can do. In fact there is an even more general characterization that works
not just for rank one approximation, but rank k approximation too:

Theorem 2 (Eckhart-Young). Let B be a matrix whose singular value


decomposition is given by
r
X
B= σi ui viT
i=1

with σ1 ≥ σ2 ≥ · · · ≥ σr . Then

min ∥B − C∥ = σk+1
C s.t. rank(C)≤k

which is achieved by setting


k
X
C= σi ui viT
i=1

In the expression above, C is called the truncated singular value decom-


position of B. The proof of this theorem is beyond the scope of this course.
However it even holds for other norms:
CHAPTER 7. CONNECTIONS TO OPTIMIZATION 89

Definition 41. The Frobenius norm of an n × m matrix A, denoted by


qP P 2
∥A∥F , is i j Ai,j .

The Frobenius norm is also invariant under orthogonal transformations:


Fact [Link]
P any n × m matrix A with singular values σi , its Frobenius
2
norm is i σi .

Proof. First we claim that ∥A∥F = ∥RA∥F = ∥AS∥F for any orthogonal
matrices R and S. This follows because ∥A∥2F is the sum of squares of the
lengths of the columns of A. When we left multiply by R, none of these
lengths change and thus ∥A∥2F = ∥RA∥2F . A similar argument works for
right multiplying by R, but considering the sum of squares of the lengths
of rows. These facts again allow us to conclude
s
X
∥A∥F = ∥Σ∥F = σi2
i

which completes the proof

And finally we have:


Theorem 3 (Frobenius Variant). In the same setup as Theorem 2 we have
v
u r
uX
min ∥B − C∥F = t σi2
C s.t. rank(C)≤k i=k+1

which is achieved by setting


k
X
C= σi ui viT
i=1

7.2 Application: Recommendation Systems


In many applications in data science, we want to find some kind of hidden
structure that is obfuscated by noise. In our original thought experiment,
CHAPTER 7. CONNECTIONS TO OPTIMIZATION 90

we can think of the rank one matrix A as the structured data we would
like to recover, but unfortunately we only have access to B. However we
can use the singular value decomposition to approximate A. There are
many bounds that quantify how good an approximation we find, but these
are beyond the scope of the course, and involve sophisticated tools from
random matrix theory.

Still we can get an idea of how these ideas can be applied in practical
settings. In 2006, Netflix offered a challenge to the machine learning com-
munity: Improve upon our recommendation systems considerably, and we’ll
give you a million dollars! The problem was formulated as follows: Netflix
released the movie ratings of about 480k users on about 18k movies. Each
rating was a numerical score from one to five stars, but of course most of
the ratings were blank because the user had not rated the corresponding
movie.

We can think about this data as being organized into a 480k × 18k
matrix where each row represents a user and each column represents a
movie. The entry in row i, column j is the numerical score or is zero if it is
missing. The main challenge is that only about one percent of the entries
are observed. The goal is to estimate the missing entries. Netflix measured
the performance of its algorithm based on its mean squared error.
CHAPTER 7. CONNECTIONS TO OPTIMIZATION 91

Estimating the missing entries of the matrix is clearly impossible unless


we make some assumption about its structure. A convenient and plausi-
ble assumption is that if we could fully observe the matrix M , it would
be approximately low rank. This could happen, for example, because we
can write it as the sum of rank one matrices where each rank one matrix
represents a different category like drama or comedy.

Each rank one matrix has the form uv T and we can think of the entries
of u as representing how much user i likes drama movies. Then the entries of
j represent how much the jth movie appeals to viewers who like drama. In
any case, we can now think of our observed matrix B as having Bi,j = Mi,j
for entries we observe, and Bi,j being equal to zero everywhere else. Thus
B is the sum of a low-rank (or approximately low rank) matrix and a noise
matrix, and one simple and powerful way to predict the missing entries is
to compute a low rank approximation B b to B and use Bbi,j to predict how
user i would rate movie j even when that rating is not in our system.

7.3 Principal Component Analysis


In this section, we will study principal component analysis, which is one of
the most widely used tools for extracting structure from high-dimensional
data. The setup is as follows: You have an n × p matrix of data points
. .
.. .. · · · ...

A = a1 a2 · · · ap 
 
.. .. .
. . · · · ..
where each data point ai is an n-dimensional vector. Some important ex-
amples, which we will study later, include:
CHAPTER 7. CONNECTIONS TO OPTIMIZATION 92

1. Natural language processing: Suppose we are given a collection of p


documents. We associate the rows and columns of A with words and
documents respectively. Then we ignore all notions of syntax and
grammar, and the entry Ai,j just counts the number of times word i
occurs in document j.
2. Genetics: Suppose we fully sequence the genomes of p people. We
associate the rows and columns of A with SNPs and people respec-
tively, and the entry Ai,j records which nucleotide occurs in location
i in person j.
3. Voting records: Suppose we have the voting records of p senators on
n different bills. We associate the rows and columns of A with bills
and senators respectively, and the entry Ai,j records whether senator
j voted yay or nay on bill i.

In each of these examples, and many more, it is natural to represent data


as a large matrix. However it is difficult to visualize it. Our motivating
question in this section is: Can we map high-dimensional data to low-
dimensions while approximately preserving its structure? We will see how
the singular value decomposition can be used to solve this problem too.

The Mean and Covariance Recall thatP if we have a collection of scalars


y1 , y2 , · · · , yp the mean or average is µ = p1 pi=1 yi and the variance is
p
2 1X
σ = (yi − µ)2
p i=1
We will define a high-dimensional analogue of these quantities.

1. The high-dimensional mean is


p
1X
µ= ai
p i=1

2. The high-dimensional covariance is


p
1X
S= (ai − µ)(ai − µ)T
p i=1
CHAPTER 7. CONNECTIONS TO OPTIMIZATION 93

Now suppose we take our high-dimensional data points and project them
onto a line or onto a subspace. It turns out that we can understand the
mean and covariance of the new, lower dimensional data points linear alge-
braically based on the mean and covariance of the old, higher dimensional
data points. Let’s dig into this. Throughout the rest of the section we
will assume that we have recentered our data so that its mean is zero. In
particular let yi = ai − µ. Then the covariance becomes
p
1X T
S= yi yi
p i=1

Now suppose we project our data onto a line in some direction c:

The new lower dimensional data points are the zi ’s where zi = cT yi .


CHAPTER 7. CONNECTIONS TO OPTIMIZATION 94

Can we express the variance of the zi ’s in terms of the covariance of the


yi ’s? First we can check that the average of the zi ’s is zero. This follows
because
p p
1X 1X T
zi = c yi
p i=1 p i=1
p
T 1
 X 
= c yi = 0
p i=1

where the last equality follows because the average of the yi ’s is the zero
vector. Now we can compute the variance
p p
1X 2 1X T 2
zi = (c yi )
p i=1 p i=1
p
1X T T
= c yi yi c
p i=1
p
1 TX T 1
= c yi yi c = cT Sc
p i=1
p

This expression will play a crucial role in principal component analysis, and
in quadratic programming later on. Let’s give it a name:
Definition 42. The quadratic form of a vector c on matrix S is cT Sc

In summary, we have shown:


Fact 24. Suppose y1 , y2 , · · · , yp are vectors and we project them onto a
direction c. Let zi = cT yi be the projections. Then
var(zi ) = cT Sc
where S is the covariance of the yi ’s.

Maximizing the Projected Variance Principal component analysis


maps high-dimensional data to lower dimensions by finding directions that
maximize the projected variance. In particular, based on Fact 24, we want
to solve
max cT Sc
c s.t. ∥c∥=1
CHAPTER 7. CONNECTIONS TO OPTIMIZATION 95

We will solve this problem through the singular value decomposition. Recall
that A is an n × p matrix that represents our data, and we have assumed
that the columns of A have zero mean. Then we can write A = U ΣV T .
Using the expression for the covariance, we can write
p
1X T 1
S = yi yi = AAT
p i=1 p
1
= U ΣV T V ΣT U T
p
1
= U ΣΣT U T
p
Now we can reformulate our optimization problem as
1 T
max c U ΣΣT U T c
c s.t. ∥c∥=1 p
If we make the substitution b = U T c we get
1 T
max b ΣΣT b
b s.t. ∥b∥=1 p
This is the same type of change of variances we used in Section 7.1, and
works because ∥b∥ = ∥U T c∥ = ∥c∥. In any case, since Σ is diagonal, it is
now easy to understand the maximum of the optimization problem. Since
the singular values are non-increasing, the maximum is achieved by setting
b1 = 1 and bi = 0 for all other i. Since b = U T c, in our original optimization
problem the direction b = e1 corresponds to the direction c = U e1 . Thus
we can read off from the singular value decomposition the answer:
Lemma 14. The direction that maximizes the projected variance of A is
its top left singular vector.

More generally we might be interested in maximizing the projected


spread onto a k dimensional subspace. This translates into a more compli-
cated objective function:
1
max ∥C T yi ∥2
C p
 
where we set C = c1 c2 · · · ck and c1 , c2 , · · · , ck are orthonormal so
that zi = C T yi is the projection of yi onto the k dimensional space spanned
by the ci ’s.
CHAPTER 7. CONNECTIONS TO OPTIMIZATION 96

Theorem 4. The k dimensional space that maximizes the spread of A is


achieved by taking the span of the top k left singular vectors of A.

The operation of taking your data, finding the k dimensional subspace


that maximizes the projected variance, and projecting your data onto it,
is called principal component analysis. As we will see, it has wide ranging
applications.

We will not prove this theorem. But the intuition is that if you look
for the best k dimensional subspace, it contains the best k − 1 dimensional
subspace inside it, and so on. Thus once you have found the 1 dimensional
subspace that maximizes the projected variance, which is achieved by taking
the line in the direction of u1 , you can look orthogonal to it to find the new
vector you should add in to get a 2 dimensional subspace, and so on.

Minimizing the Reconstruction Error There is another interpreta-


tion of principal component analysis, in terms of a different optimization
problem. Suppose our goal is to project our data points onto a k dimen-
sional subspace and to minimize the amount that they move. In particular
we want to solve:
1
min ∥yi − ybi ∥2
C p
 
where ybi = CC T and as before C = c1 c2 · · · ck and c1 , c2 , · · · , ck are
orthonormal. Caution: We are thinking about projections in a different
way here. Instead of the projection of yi being a k dimensional vector, it
is a p dimensional vector. To put it another way, earlier we were thinking
about the projection in terms of the natural coordinates of the subspace,
but now we are thinking about it in our original coordinate system, so that
we can measure the distance between a vector and its projection.

It turns out that this optimization problem is the same one as before,
just written in a different way. First note that ybi and y − ybi are orthogonal
because ybi is the projection of yi onto a subspace. Now we have

∥yi − ybi ∥2 = ∥yi ∥2 − ∥b


y i ∥2
= ∥yi ∥2 − ∥CC T yi ∥2
= ∥yi ∥2 − ∥C T yi ∥2
CHAPTER 7. CONNECTIONS TO OPTIMIZATION 97

where the first equality follows the Pythagorean theorem and the last equal-
ity follows from the assumption that C has orthonormal columns. Note that
C is not orthogonal because, in general, k < n. Nevertheless the identity
∥CC T yi ∥2 = ∥C T yi ∥2 still holds because C T C = I and

∥C T x∥2 = xT C T Cx = xT x = ∥x∥2

Returning to the minimization problem above, we have


p p
1 1X 1X T 2
min ∥yi − ybi ∥2 = ∥yi ∥2 − max ∥C yi ∥
C p p i=1 C p
i=1

From these manipulations, we have:

Theorem 5. The k dimensional subspace that minimizes the reconstruction


error is achieved by taking the span of top k left singular vectors of A.

Thus principal component analysis is not only a mechanism to visualize


high-dimensional data, but also a way to find the optimal linear compression
onto a lower dimensional subspace.
Chapter 8

Quadratic Programming

8.1 Using the Eigendecomposition


In this section, we will study an important class of optimization problems
that is closely connected to linear algebra. Let’s start with a concrete
example: √ √
min 2x2 − 2xy + 2y 2 − 2 2x + 4 2y
x,y

This is a quadratic programming problem because x and y are scalars and


we would like to minimize a degree two polynomial in x and y. Such a
polynomial is called a quadratic. Moreover this is an unconstrained opti-
mization problem. Later we will allow constraints that the variables must
belong to some region. The main goal of this section is to show how the
eigendecomposition can be used to solve the above optimization problem,
and more generally any unconstrained quadratic optimization problem in
any number of variables. We will explain the key steps as applied to our
concrete example:

(1) Write the quadratic program in matrix vector notation:


 √ √  x
    
  2 −2 x
min x y + −2 2 4 2
x,y 0 2 y y

98
CHAPTER 8. QUADRATIC PROGRAMMING 99

Actually there is a more convenient way to write this where we are taking
a quadratic form on a symmetric matrix.
 √ √  x
    
  2 −1 x
min x y + −2 2 4 2
x,y −1 2 y y
This alternative formulation will help us because we already know a lot
about the existence and structure of the eigendecomposition of a symmetric
matrix. In contrast, non-symmetric matrices cannot always be diagonalized.
Furthermore we can write the quadratic programming problem even more
compactly as
min z T Az + bT z
z

where z is a two-dimensional vector with coordinates x and y and A and b


are the 2 × 2 matrices and the two-dimensional vector of scalars above.

(2) Use the eigendecomposition to find a convenient change of variables

Since A is symmetric, we know that for some orthogonal matrix U we


have A = U DU T and D is diagonal and its entries are the eigenvalues of
A. In our concrete example, we have
  " −1 1 #   " −1 1 #
2 −1 √ √ 3 0 √2 √2
A= = √12 √12
−1 2 2 2
0 1 √12 √12
| {z } | {z } | {z }
U D
UT

Now we can apply the following change of variables


 ′  " −1 1 #  
x √ √ x
2 2
′ = √1 1
y √ y
|{z} | 2 {z 2 } |{z}
z′ z
UT

And applying z ′ = U T z to our original optimization problem, we have

min z T Az + bT z = min z T U DU T z + bT z
z z
= min(z T U )D(U T z) + (bT U U T z)
z
= min

z ′T Dz ′ + b′T z ′
z
CHAPTER 8. QUADRATIC PROGRAMMING 100

where b′ = U T b. In our concrete problem, we get

min
′ ′
3x′2 + y ′2 + 6x′ + 2y ′
x ,y

What makes this problem simpler than the one we started off with is that
there are no terms that involve both x′ and y ′ . Thus the eigendecompo-
sition allowed us to separate the variables. Now we can solve each of the
minimization problems, one over x′ and one over y ′ independently. One
easy way to do this is to complete the squares. By this we mean, collect
all the terms that involve x′ and write in the form α(x + β)2 + γ for some
choice of α, β and γ. We do the same thing for y ′ too. This allows us to
rewrite the problem as:

min
′ ′
3(x′ + 1)2 + (y ′ + 1)2 − 4
x ,y

And from this expression it is easy to find the optimum. We should set
x′ = 1 and y ′ = −1. This will make the objective value equal to −4.
Conversely, since the objective function cannot be made less than −4 over
the reals, because it is the sum of two nonnegative terms and the constant
−4. Now all that remains is to find a choice of x and y that achieve this
value:

(3) Apply the change of variables in reverse to find an optimal solution


to the original problem

In our concrete problem, we write


" # 
−1 1
x′
 
x √ √
2 2
=
y √1
2
√1
2
y′
" #  
−1
√ √1

2 2 −1 0

= √1 √1
=
2 2
−1 − 2

And we can check that plugging in this choice of x and y makes the objective
value to be −4, as it should.
CHAPTER 8. QUADRATIC PROGRAMMING 101

The Role of the Eigenvalues In this section, let’s study some other
examples of unconstrained quadratic programming. Our goal is to under-
stand what the eigenvalues directly tell us about our optimization problem.
Suppose we instead had the problem:
√ √
min x2 − 4xy + y 2 − 2 2x + 4 2y
x,y

Following the same steps as earlier, we get


  " −1 1 #   " −1 1 #
1 −2 √ √ 3 0 √ √
A= = √12 √12 1
2
√1
2
−2 1 2 2
0 −1 √
2 2
| {z } | {z } | {z }
U D
UT

Now when we apply the same change of variables, z ′ = U T z as before, we


get
min
′ ′
3x′2 − y ′2 + 6x′ + 2y ′
x ,y

The only difference is, before, the coefficients in front of x′2 and y ′2 were 3
and 1 respectively and now they are 3 and −1. Observe that these coef-
ficients are the eigenvalues of A. Thus, instead of minimizing a quadratic
form on a symmetric matrix with nonnegative eigenvalues, we now have
negative eigenvalues. We could complete the square as before, we would
get
min
′ ′
3(x′ + 1)2 − (y ′ − 1)2 − 2
x ,y

From this expression, we can immediately see that the optimum is −∞:
We can make the minimum arbitrarily small, say, by setting x′ to zero and
making y ′ larger and larger. In fact, we didn’t need to compute the change
of variables or complete the square. The fact that one of the eigenvalues
was negative already implied it, and the rest is just bookkeeping, because
as long as the optimization problem turns into something of the form

min
′ ′
α(x′ + β)2 − (y ′ − γ)2 + C
x ,y

the optimum will be −∞.


Definition 43. A matrix A is positive semidefinite, written A ⪰ 0 if it
is symmetric and has nonnegative eigenvalues. Moreover if A has positive
eigenvalues, we say it is positive definite, written as A ≻ 0.
CHAPTER 8. QUADRATIC PROGRAMMING 102

Lemma 15. Consider an unconstrainted quadratic program


min z T Az + bz + c
z

where A is symmetric. If A is positive definite, then the optimum is finite.


If A has a negative eigenvalue, then the optimum is −∞.

There is a geometric picture that explains this lemma. For positive


definite A, the objective function looks as follows

Such a function is called convex. We will develop the theory of convex


functions more generally later. The important thing is that there are ef-
ficient algorithms for minimizing a convex function, and this holds much
more generally when our function is not a quadratic or perhaps not even
a polynomial. If AS has. negative eigenvalue, the objective function looks
like
CHAPTER 8. QUADRATIC PROGRAMMING 103

Such a function is nonconvex. There are no known algorithms for mini-


mizing a general nonconvex function. However there are some restricted
families of nonconvex optimization problems that can be solved, although
this is usually done by revealing some sort of hidden convexity. We will
spend more time on these concepts later, but already we can see some
important connections between linear algebra and convex/nonconvex opti-
mization even for quadratic programming problems.

Observe that Lemma 15 does not tell us what happens when A is positive
semidefinite, but has a zero eigenvalue. For example, suppose we had the
problem:
3 3 √
min x2 − 3xy + y 2 − 2 2x
x,y 2 2
Again, computing the eigendecomposition and applying a change of vari-
ables gives us
min
′ ′
3x′2 + 6x′ + 2y ′
x ,y

And completing the square gives us


min
′ ′
3(x′ + 1)2 + 2y ′ − 3
x ,y

Since we have a zero eigenvalue, the coefficient of y ′2 is zero. However we


still have a term that is linear in y ′ . From this expression it is easy to see
that the optimum is −∞ because we can, say, set x′ = 0 and send y ′ to −∞.
If we instead got a negative term multiplying the y ′ , the optimum would still
be −∞ but we would send y ′ to ∞ instead. This example illustrates that if
the eigenvalues are nonnegative, but there is one or more zero eigenvalue,
whether the optimum is bounded or not depends on what the change of
variables does to the linear terms.

Uniqueness Sometimes we are interested in more than just finding an


optimal solution, but also in determining if there is a unique solution. It
turns out that the eigenvalues of A can help us address this question too.
Let’s return our earlier example. We know that the optimization problem
√ √
min 2x2 − 2xy + 2y 2 − 2 2x + 4 2y
x,y

is equivalent to
min
′ ′
3(x′ + 1)2 + (y ′ + 1)2 − 4
x ,y
CHAPTER 8. QUADRATIC PROGRAMMING 104

Is the solution unique? We know that the optimal value is −4 and this can
be achieved by setting x′ = −1 and y ′ = −1. If we were to choose any
different value for x′ or y ′ the objective function would be the sum of −4
and a strictly positive term, and thus it would not achieve the minimum.
Thus the minimum is unique. Moreover since our change of variables is an
invertible transformation, the minimization problem over x and y also has
a unique solution.

But suppose we instead had the problem:


3 3 √
min x2 − 3xy + y 2 − 2 2x
x,y 2 2
Again, computing the eigendecomposition and applying a change of vari-
ables gives us
min
′ ′
3x′2 + 6x′
x ,y

And completing the square gives us

min
′ ′
3(x′ + 1)2 − 3
x ,y

From this expression, we can see that the optimum is −3 and setting x′ =
−1 and y ′ to anything achieves the minimum. Thus the optimum is not
unique. When we apply the change of variables in reverse we still have a
one-dimensional space of optimal solutions.
CHAPTER 8. QUADRATIC PROGRAMMING 105

Lemma 16. Consider an unconstrainted quadratic program

min z T Az + bz + c
z

where A is symmetric. The optimal solution is unique if and only if A is


positive definite.

Proof. We will ignore the case where the optimum is −∞. We can think
about this as a case where the optimum is not achieved. We have already
seen how, in the case where A is positive definite, the objective function
takes the form n
X
C+ αi (xi − βi )2
i=1

And if any xi is set to a value different than βi we would get an objective


value that is strictly larger than C, and is suboptimal. Thus the solution
is unique. In the remaining case, where A is positive semidefinite but has
a zero eigenvalue, the only way the optimum is bounded from below is if
after the change of variables, we get an expression of the form
n′
X
C+ αi (xi − βi )2
i=1

that involves n′ < n variables. But then in any optimal solution, we can
set xn arbitrarily and not change the objective value. Thus the solution is
not unique. This completes the proof.

8.2 Optimality Conditions


In this section, our main motivation will be an even more expressive class of
optimization problems called equality constrained quadratic programming.
The general problem is:
1 T
min z P z + qT z s.t. Ax = b
x 2
Earlier, we introduced underdetermined least squares: Given an underde-
termined linear system Ax = b that has infinitely many solutions, we want
CHAPTER 8. QUADRATIC PROGRAMMING 106

to select the one that is the simplest in terms of minimizing ∥x∥2 . Now we
can see that this optimization problem
min ∥x∥2 s.t. Ax = b
x

corresponds to choosing P = 2I and q = 0 in the general setup. Moreover


we found a closed-form expression for the optimal solution
x∗ = AT (AAT )−1 b
The key point is that for more complex problems we won’t always have
a closed-form solution. Instead we will use iterative methods to find the
optimum.

In this section we will study optimality conditions, which tell us when a


given solution x is the optimal solution. It turns out that for quadratic pro-
gramming, both without constraints and in the linearly constrained case,
deciding if a solution x is optimal will reduce to solving a linear system.
Later, when we introduce iterative methods like gradient descent that, start-
ing from an initial point, take steps towards the optimal solution, optimality
conditions will be able to tell us when we should stop.

Gradients and Hessians Before we dig into optimality conditions, we


need to understand how to take derivatives in matrix-vector notation.
Definition 44. The gradient of a function f (x1 , x2 , · · · , xd ) is a d dimen-
sional vector  ∂
f (x1 , x2 , · · · , xd )

∂x1
 ∂ f (x1 , x2 , · · · , xd )
∇f =  ∂x2 
 ... 

∂xd
f (x1 , x2 , · · · , xd )

Let’s revisit the example from Section 8.1 and consider


  
  2 −1 x
f (x, y) = x y
| {z } −1 2 y
zT | {z } |{z}
A z

Let’s compute the gradient:


∂ ∂  2 2

f (x, y) = 2x − 2xy + 2y = 4x − 2y
∂x ∂x
CHAPTER 8. QUADRATIC PROGRAMMING 107

Similarly we find

f (x, y) = −2x + 4y
∂y
We can actually express the answer in matrix-vector notation
    
2 −1 x 4x − 2y
∇f = 2 =
−1 2 y −2x + 4y
| {z } |{z}
A z

This expression should look somewhat familiar from calculus since for a
d
scalar z we have dz (az 2 ) = 2az. This fact is true more generally:
Fact 25. Let f (z) = z T Az and suppose A is symmetric. Then ∇f = 2Az.

When A is not necessarily symmetric, the expression becomes slightly


more complicated
Lemma 17. Let f (z) = z T Az. Then ∇f = (A + AT )z.

Proof. We can write


d X
X d
f (z) = Ai,j zi zj
i=1 j=1

and then it follows that


∂ X X
f (z) = 2Ak,k zk + Ai,j zi + Ai,j zj
∂xk i̸=k j̸=k
X
= 2Ak,k zk + 2 Ai,j zi
i̸=k
 
= (A + AT )z
k

and this completes the proof.

This should still remind you of the chain rule. As a side note, this is
why we chose the convention to have a 1/2 in front of P in our definition
of quadratic programming: It will make the expressions for the gradients
simpler. Let’s do another important example which we will use later on.
CHAPTER 8. QUADRATIC PROGRAMMING 108

Claim 1. Let f (z) = z T b. Then ∇f (z) = b.

Relatedly, if we let f (z) = bT z what is ∇f (z)? It is easy to get tripped


up and think that the answer is ∇f (z) = bT but this is not correct because
the gradient is always a column vector by convention. Another way to think
about this is that z T b and bT z are the same function (they are just written
in different ways), so their gradient must be the same too.

Using the Gradient Let’s return to equality constrained quadratic pro-


gramming:
1
min f (z) = z T P z + q T z s.t. Ax = b
x 2
Now let’s introduce some more general terminology

Definition 45. An optimization problem is composed of an objective func-


tion f (z) and a feasible region P which is the set of allowable solutions that
z must belong to.

In equality constrained quadratic programming, the feasible region is


P = {x s.t. Ax = b}, which is an affine plane that does not necessarily go
through the origin. In any case, we know a lot about affine planes, and
we can use our understanding of their geometry as well as the gradient to
understand pictorially when a solution z is optimal.

From calculus, for a function f (z) where z is a scalar we know that the
derivative gives us a linear approximation
d
f (z + δ) ≈ f (z) + δ f (z)
dz
Similarly when z is a vector we have

f (z + δ) ≈ f (z) + δ T ∇f (z)

If we let δ be any vector with norm at most c, we should have δ point in the
direction of ∇f (z) in order to maximize the linear approximation. Thus
the gradient is the direction the direction of largest infinitessimal increase.
CHAPTER 8. QUADRATIC PROGRAMMING 109

Now let’s talk about what it means for a point to be a local minimum.
Intuitively it means that there is no direction we can move in that infinites-
simally decreases the objective function. However this is not just about the
gradient of the function, but also its relationship to the directions we are
allowed to move in while maintaining feasibility. Formally:

Definition 46. We say that a point z is a local minimum if there is some


ϵ > 0 so that any other point z + δ where z + δ is feasible and ∥δ∥ ≤ ϵ we
have
f (z) ≤ f (z + δ)

When we have linear constraints, i.e. P = {z s.t. Az = b} what are the


directions δ we are allowed to move in while maintaining feasibility? The
following claim gives the answer:

Claim 2. If z satisfies Ax = b then z + δ satisfies A(z + δ) = b if and only


if z ∈ N (A).

Now we come to our main results about optimality:

Lemma 18. When optimizing a quadratic function f (z) with respect to a


linear constraint Az = b, a feasible point z is locally optimal if and only if
∇f (z) is orthogonal to N (A).

Proof. We will only prove one direction, because we will give a stronger
characterization in the other direction below. Now consider the case where
∇f (z) is not orthogonal to N (A). Then there is a direction δ ∈ N (A) so
that
⟨δ, ∇f (z)⟩ < 0
From calculus, we have the estimate that for small α

f (z + αδ) = f (z) + αδ T ∇f (z) + O(α2 )

This implies that if we move a small enough amount in the direction δ (i.e.
α is sufficiently small), we will not only decrease the value of the linear
approximation, but also decrease the value of the function too. Thus z is
not a local minimum.
CHAPTER 8. QUADRATIC PROGRAMMING 110

Of course our main interest is in finding a globally optimal solution:


Definition 47. We say that a point z is a global minimum if for any other
feasible point z ′ we have f (z) ≤ f (z ′ ).

A global minimum clearly must be a local minimum too. However the


converse is often not true. When you have found a local minimum, there
certainly could be another feasible point that is very far away that achieves
an even smaller objective value. Still, it turns out that for nice classes
of functions, every local minimum is a global minimum too. We will de-
velop this theory later on, but for now we observe that this local-to-global
property holds for equality constrained quadratic programming:
Lemma 19. For an equality constrained quadratic programming
1 T
min z P z + qT z s.t. Ax = b
x 2
with positive semidefinite P , a feasible point z is a local minimum if and
only if it is a global minimum.

Proof. Let’s assume that z is a local minimum. We want to prove that


z is a global minimum too. Our first step is to translate the condition
from Lemma 18 into a more convenient form. (Recall, we only proved one
direction but that is the direction we will use here). Since z is a local
minimum we know that ∇f (z) must be orthogonal to N (A). But since
C(AT ) = N (A)⊥ we have that ∇f (z) = P z + q ∈ C(A). This is equivalent
to saying there is a vector ν such that
P x + q = AT ν

Now let’s prove that z is a global minimum. Consider any other feasible
point z ′ . Since the set of feasible points are the solutions to a linear system
Ax = b we know that z ′ = z + δ for some δ ∈ N (A). Let’s compare the
objective values. Since P is symmetric, we have
1
f (z + δ) = (z + δ)T P (z + δ) + (z + δ)T q
2
1
= f (z) + δ T P δ + δ T P z + δ T q
2
CHAPTER 8. QUADRATIC PROGRAMMING 111

We claim that δ T P z + δ T q = 0. Let’s see why. From local optimality, we


have

P z = −q − AT ν
⇒ δ T P z = −δ T q − δ T AT ν
⇒ δ T P z = −δ T qν

The third line follows because δ ∈ N (A). Now plugging in δ T P z + δ T q = 0


into our expression from above, we have
1
f (z + δ) = f (z) + δ T P δ
2
And since P is semidefinite, by definition δ T P δ ≥ 0. Thus f (z + δ) ≥ f (z)
as desired. This completes the proof.

We remark that we can combine the optimality and feasibility condi-


tions for equality constrained quadratic programming into one composite
condition: z is feasible and a local/global minimum if and only if there
exists ν satisfying     
P AT x −q
=
A 0 ν b
Thus we can find a solution to a linearly constrained quadratic program by
solving a linear system.
Linear Algebra TLDR

Theorem 6. (“tall matrix”) Let A ∈ Rm×n , with m ≥ n (“tall” matrix).


The following statements are equivalent:

1. The columns of A are LI


2. If solvable, the system Ax = b has a unique solution
3. A has a left inverse
4. N (A) = {0}
5. rank(A) = n (“full column rank”)
Theorem 7. (“fat matrix”) Let A ∈ Rm×n , with m ≤ n (“fat” matrix).
The following statements are equivalent:

1. The rows of A are LI


2. The system Ax = b is solvable for every b
3. A has a right inverse
4. C(A) = Rm
5. rank(A) = m (“full row rank”)
Theorem 8. (“square matrix”) Let A ∈ Rn×n (square matrix). The fol-
lowing statements are equivalent:

1. The rows of A are LI

112
CHAPTER 8. QUADRATIC PROGRAMMING 113

2. The columns of A are LI

3. The system Ax = b always has a unique solution

4. A is invertible, i.e., it has both a left and a right inverse

5. N (A) = {0}

6. C(A) = Rn

7. rank(A) = n (“full rank”)

You might also like