You are on page 1of 23

Additional Topics for Chapter 4

Linear Algebra and Differential Equations


1
Matrix Factorization
Review of Elementary Matrices
Denition 1 An elementary matrix is an n n matrix that can be obtained by performing a single
elementary row operation on the identity matrix I
n
. (Note that the identity matrix itself is an elementary
matrix because we could multiply any row of I
n
by the scalar 1.)
Recall that the elementary row operations are:
1. Swap two rows
2. Multiply a row by a nonzero constant
3. Add a multiple of one row to another row
Example 2 Row swap: Multiplying matrix A by the elementary matrix E
1
, in which rows 1 and 2 of I
3
are swapped, produces a matrix in which rows 1 and 2 of A have also been swapped.
_
_
0 1 0
1 0 0
0 0 1
_
_
_
_
1 4 7
2 5 6
3 1 2
_
_
=
_
_
2 5 6
1 4 7
3 1 2
_
_
Example 3 Multiplication of a row by a scalar: Multiplying matrix A by the elementary matrix E
2
,
in which the second row of I
3
has been multiplied by
1
3
, produces a new matrix in which the second row of A
has been multiplied by
1
3
.
_
_
1 0 0
0
1
3
0
0 0 1
_
_
_
_
1 4 7
2 5 6
3 1 2
_
_
=
_
_
1 4 7
2
3
5
3
2
3 1 2
_
_
Example 4 Adding a multiple of one row to another: Multiplying matrix A by the elementary matrix
E
3
, in which two times the rst row has been subtracted from the second row of I
3
, produces a new matrix
in which the two times the rst row of A has been subtracted from the second row of A.
_
_
1 0 0
2 1 0
0 0 1
_
_
_
_
1 4 7
2 5 6
3 1 2
_
_
=
_
_
1 4 7
0 3 8
3 1 2
_
_
This leads us to the following theorems, the second of which is a direct result of the fact that elementary
row operations are reversible.
Theorem 5 If an elementary row operation is performed on a matrix A, the resulting matrix can also be
obtained by multiplying A (on the left) by the corresponding elementary matrix E.
Theorem 6 If E is an elementary matrix, then E
1
exists and is also an elementary matrix.
As conrmation of the previous theorem, note that the elementary matrices E
1
, E
2
, and E
3
from above
have inverses
_
_
0 1 0
1 0 0
0 0 1
_
_
,
_
_
1 0 0
0 3 0
0 0 1
_
_
, and
_
_
1 0 0
2 1 0
0 0 1
_
_
because
E
1
E
1
1
= E
1
1
E
1
=
_
_
0 1 0
1 0 0
0 0 1
_
_
_
_
0 1 0
1 0 0
0 0 1
_
_
=
_
_
1 0 0
0 1 0
0 0 1
_
_
,
1
Material from Falvo, David C. and Larson, Ron. Elementary Linear Algebra, 6th ed. Brooks/Cole. 2010.
1
while
E
2
E
1
2
= E
1
2
E
2
=
_
_
1 0 0
0
1
3
0
0 0 1
_
_
_
_
1 0 0
0 3 0
0 0 1
_
_
=
_
_
1 0 0
0 1 0
0 0 1
_
_
,
and
E
3
E
1
3
= E
1
3
E
3
=
_
_
1 0 0
2 1 0
0 0 1
_
_
_
_
1 0 0
2 1 0
0 0 1
_
_
=
_
_
1 0 0
0 1 0
0 0 1
_
_
.
Theorem 7 Two matrices A and B are row equivalent if there exists a nite number of elementary matrices
E
1
; E
2
; :::; E
k
such that B = E
k
E
k1
E
2
E
1
A. (In other words, A and B are row equivalent if we can get
from A to B via a nite number of elementary row operations.)
Following is an example of elementary matrices in use to reduce a 2 2 matrix to reduced row-echelon
form (i.e., I
2
in this case):
Example 8 Start with A =
_
5 18
1 4
_
:
Matrix Elementary Row Operation Elementary Matrix Inverse Elementary Matrix
_
5 18
1 4
_
swap R1 and R2 E
1
=
_
0 1
1 0
_
E
1
1
=
_
0 1
1 0
_
_
1 4
5 18
_
Add 5R1 to R2 E
2
=
_
1 0
5 1
_
E
1
2
=
_
1 0
5 1
_
_
1 4
0 2
_
Multiply R2 by
1
2
E
3
=
_
1 0
0
1
2
_
E
1
3
=
_
1 0
0 2
_
_
1 4
0 1
_
Add 4R2 to R1 E
4
=
_
1 4
0 1
_
E
1
4
=
_
1 4
0 1
_
Then, E
4
E
3
E
2
E
1
A = I. Since each of the E
i
are invertible, we also see that
E
1
1
E
1
2
E
1
3
E
1
4
E
4
E
3
E
2
E
1
A = E
1
1
E
1
2
E
1
3
E
1
4
I
A = E
1
1
E
1
2
E
1
3
E
1
4
.
In other words,
A =
_
0 1
1 0
_ _
1 0
5 1
_ _
1 0
0 2
_ _
1 4
0 1
_
,
or, A is the product of the inverses of the elementary matrices that were used to reduce A to I.
The LU-Factorization (without row interchanges)
There are a number of "matrix factorizations" in frequent use. Perhaps the most basic of these is what is
known as the "LU-Factorization." To motivate its development, let us consider an example:
Example 9 Start with A =
_
2 1
8 7
_
. We can accomplish row-echelon form with only one row operation.
Here is that row operation and its associated elementary matrix:
Matrix Elementary Row Operation Elementary Matrix Inverse Elementary Matrix
_
2 1
8 7
_
Add 4R1 to R2 E
1
=
_
1 0
4 1
_
E
1
1
=
_
1 0
4 1
_
|
_
2 1
0 3
_
2
The above example shows that E
1
A = U, so the relation A = LU implies that L must actually be E
1
1
,
or
A =
_
2 1
8 7
_
=
_
1 0
4 1
_ _
2 1
0 3
_
= LU:
What is the signicance of this factorization? First of all, we use the letters L and U for a reason. Note that
L is lower triangular (any nonzero elements are on or below the diagonal) and U is upper triangular (any
nonzero elements are on or above the diagonal). Additionally, the diagonal elements of the L matrix are
1s. Once we have an LU-factorization of a matrix, we can generate an algorithm to easily solve numerous
systems involving that same coecient matrix. The practical signicance of this is that it is even more
ecient than Gaussian elimination when we need to reuse a coecient matrix with varying right-hand sides
(i.e., what weve been calling the b vector).
2
Before we proceed, we need to mention an important "lemma"
(a lemma is a sort of warm-up to a Theorem):
Lemma 10 If L and

L are lower triangular matrices of the same size, so is their product L

L. Furthermore,
if both of the matrices have ones on their diagonals, then so does their product. If U and

U are upper
triangular matrices of the same size, so is their product U

U.
Let us illustrate with another example, this time taking note of the result of the above lemma.
Example 11 Find an LU-factorization of the matrix A =
_
_
2 1 1
4 5 2
2 2 0
_
_
.
Here is the procedure (Gaussian elimination) and its associated elementary matrices.
Matrix Elementary Row Operation Elementary Matrix Inverse Elementary Matrix
_
_
2 1 1
4 5 2
2 2 0
_
_
Add 2R1 to R2 E
1
=
_
_
1 0 0
2 1 0
0 0 1
_
_
E
1
1
=
_
_
1 0 0
2 1 0
0 0 1
_
_
_
_
2 1 1
0 3 0
2 2 0
_
_
Add R1 to R3 E
2
=
_
_
1 0 0
0 1 0
1 0 1
_
_
E
1
2
=
_
_
1 0 0
0 1 0
1 0 1
_
_
_
_
2 1 1
0 3 0
0 3 1
_
_
Add R2 to R3 E
3
=
_
_
1 0 0
0 1 0
0 1 1
_
_
E
1
3
=
_
_
1 0 0
0 1 0
0 1 1
_
_
|
_
_
2 1 1
0 3 0
0 0 1
_
_
Just as in the earlier 2 2 example, we have
E
3
E
2
E
1
A = U;
so
E
1
1
E
1
2
E
1
3
E
3
E
2
E
1
A = A = E
1
1
E
1
2
E
1
3
U.
2
For n n systems, LU-Factorization requires

4n
3
3n
2
n

=6 airthmetic operations for the factorization itself (which


only has to be done once and can then be reused). Then each solution for the two resulting tiangular systems (more on this
later) can be carried out in 2n
2
n operations per system. On the other hand, Gaussian elimination uses

4n
3
+ 9n
2
7n

=6
arithmetic operations to arrive at a solution, and it requries this many operations for each system.
3
But, note that each of the E
1
i
s are lower triangular with ones on their diagonal. According to the previous
lemma, their product will also have this form. Indeed,
E
1
1
E
1
2
E
1
3
=
_
_
1 0 0
2 1 0
0 0 1
_
_
_
_
1 0 0
0 1 0
1 0 1
_
_
_
_
1 0 0
0 1 0
0 1 1
_
_
=
_
_
1 0 0
2 1 0
1 1 1
_
_
; (1)
and we realize that E
1
1
E
1
2
E
1
3
= L, and that A = LU, as desired. In other words, A can be "factored"
into
A =
_
_
2 1 1
4 5 2
2 2 0
_
_
=
_
_
1 0 0
2 1 0
1 1 1
_
_
_
_
2 1 1
0 3 0
0 0 1
_
_
= LU, (2)
which again, is a product of a lower and an upper diagonal matrix. Note too that the result of the
multiplication in (1) is a matrix whose diagonal elements are ones and whose other elements are the individual
elements of the elementary matrices "condensed" into one matrix. We can look directly at L (at least in
this case) and see exactly what row operations were performed to get from A to U.
Using A = LU to Solve Systems
So how do we use this factorization to solve a system Ax = b? We can use a simple two-stage process:
1. Solve the lower triangular system Ly = b for the vector c by forward substitution.
2. Solve the resulting upper triangular system Ux = y for x by back substitution.
The above two-stage process works because if
Ux = y and Ly = b, then Ax = LUx = Ly = b.
As an example, consider the LU-factorization we found in (2) above, namely
_
_
2 1 1
4 5 2
2 2 0
_
_
=
_
_
1 0 0
2 1 0
1 1 1
_
_
_
_
2 1 1
0 3 0
0 0 1
_
_
:
Suppose we seek to nd the solution to the system
_
_
2 1 1
4 5 2
2 2 0
_
_
_
_
x
y
z
_
_
=
_
_
1
2
2
_
_
.
We rst solve the lower triangular system
_
_
1 0 0
2 1 0
1 1 1
_
_
_
_
a
b
c
_
_
=
_
_
1
2
2
_
_
, or
_
_
_
a = 1
2a + b = 2
a b + c = 2
,
which is easy to do via forward substitution (i.e., a = 1, b = 0, and c = 1). Then solve the upper triangular
system
_
_
2 1 1
0 3 0
0 0 1
_
_
_
_
x
y
z
_
_
=
_
_
1
0
1
_
_
, or
_
_
_
2x + y + z = 1
3y = 0
z = 1
,
which is easy to do via back substitution (i.e., z = 1; y = 0, and x = 1), which is easily veriable as the
solution to the original system.
4
Review of Permutation Matrices
Sometimes in the process of row reduction we encounter a pivot position in which there is a zero. We
can often get around this inconvenience by interchanging rows. Just as there are elementary matrices that
represent multiplication of a row by a scalar or the addition of a multiple of one row to another, there are
permutation matrices that represent the interchange of rows. We dene such a matrix as follows.
Denition 12 A permutation matrix is a matrix obtained from the identity matrix by any combination
of row interchanges.
Note that interchanging rows of a permutation matrix results in another permutation matrix. All that
matters is that the individual elements in a particular row are not changed. We have the following lemma:
Lemma 13 A matrix P is a permutation matrix if and only if each row of P contains all 0 entries except
for a single 1, and each column of P contains all 0 entries except for a single 1.
It should be easy to convince yourself that there are n! permutation matrices for an n n matrix - we
have n "choices" for where we place the rst row, then n 1 "choices" for the second row, etc. The 3! = 6
permutation matrices for a 3 3 matrix are
_
_
1 0 0
0 1 0
0 0 1
_
_
;
_
_
1 0 0
0 0 1
0 1 0
_
_
,
_
_
0 1 0
1 0 0
0 0 1
_
_
;
_
_
0 1 0
0 0 1
1 0 0
_
_
;
_
_
0 0 1
1 0 0
0 1 0
_
_
;
_
_
0 0 1
0 1 0
1 0 0
_
_
:
The eect of these matrices on a matrix with row vectors a, b, and c would be (reading the matrices from
left to right, line to line):
_
_
a
b
c
_
_
,
_
_
a
c
b
_
_
,
_
_
b
a
c
_
_
,
_
_
b
c
a
_
_
,
_
_
c
a
b
_
_
, and
_
_
c
b
a
_
_
.
The rst permutation matrix, which is the identity matrix, is the "do-nothing" permutation matrix. The
second, third, and sixth matrices are equivalent to a single row interchange, whereas the fourth and fth
matrices require a combination of row interchanges (and are therefore called "non-elementary" permutation
matrices).
It should be no surprise that the product of two permutation matrices is another permutation matrix.
However, multiplication of permutation matrices is not commutative in general. Switching the rst and
second rows, and then switching the second and third rows does not have the same eect as switching the
second and third rows followed by switching the rst and second rows (conrm this!). In addition, the
product of a permutation matrix works in a similar manner to products of elementary matrices in that
P = P
n
:::P
2
P
1
summarizes all of the permutations in a single matrix. For example:
P
2
P
1
A =
_
_
0 0 1
0 1 0
1 0 0
_
_
_
_
0 0 1
1 0 0
0 1 0
_
_
_
_
a
b
c
_
_
=
_
_
b
a
c
_
_
is equivalent to the following permutations, in order:
P
1
A =
_
_
0 0 1
1 0 0
0 1 0
_
_
_
_
a
b
c
_
_
=
_
_
c
a
b
_
_
,
which corresponds to a non-elementary permutation matrix that swaps rows 1 and 2 and then rows 3 and 2
(or rows 2 and 3 and then rows 1 and 2, or rows 1 and 3 and then rows 2 and 3, etc.), and then
P
2
(P
1
A) =
_
_
0 0 1
0 1 0
1 0 0
_
_
_
_
c
a
b
_
_
=
_
_
b
a
c
_
_
;
5
which in turn corresponds to a swapping of rows 1 and 3. We could achieve this same result by nding
P = P
2
P
1
=
_
_
0 1 0
1 0 0
0 0 1
_
_
;
and then multiplying:
PA =
_
_
0 1 0
1 0 0
0 0 1
_
_
_
_
a
b
c
_
_
=
_
_
b
a
c
_
_
.
The Permuted LU-Factorization
Even though we havent ocially proven this result, it should be somewhat believable that any nonsingular
matrix (i.e., one that can be reduced to the identity matrix) can be reduced to upper triangular form by
interchanging rows and/or adding multiples of one row to another. We know, from above, how to nd an
LU-factorization for a matrix A that can be row reduced without any row interchanges. If we encounter a
matrix A that does require row interchanges in its row reduction, we can still determine an LU-factorization
if we multiply A by its permutation matrix ahead of time, which will just yield some "new" matrix B. In
other words, if A isnt arranged how I need it to be arranged, I multiply A by P (recall that P itself could
be a product of any number of permutation matrices), and the result is some matrix B with As rows in a
dierent order. So PA = B. Then B is just a "new" row-reducible matrix, so we should be able to nd
B = LU. This really amounts to
PA = LU.
The best way to accomplish this is a matter of bookkeeping. Start with a matrix, keep track of A, P and
L, then use these matrices to rearrange the result back into a solvable system of equations. For example:
Example 14 Compute a permuted LU-factorization of the matrix
A =
_

_
1 2 1 0
2 4 2 1
3 5 6 1
1 2 8 2
_

_
.
First, eliminate the entries below the rst pivot, keeping track of A, L, and P as you go:
A =
_

_
1 2 1 0
0 0 0 1
0 1 3 1
0 4 7 2
_

_
, P =
_

_
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
_

_
, L =
_

_
1 0 0 0
2 1 0 0
3 0 1 0
1 0 0 1
_

_
:
Now we need to interchange the second and third rows:
A =
_

_
1 2 1 0
0 1 3 1
0 0 0 1
0 4 7 2
_

_
, P =
_

_
1 0 0 0
0 0 1 0
0 1 0 0
0 0 0 1
_

_
, L =
_

_
1 0 0 0
3 1 0 0
2 0 1 0
1 0 0 1
_

_
:
(Note the eect this had on L as well as on A and P). Now we can eliminate the 4 at the (4; 2) position:
A =
_

_
1 2 1 0
0 1 3 1
0 0 0 1
0 0 5 6
_

_
, P =
_

_
1 0 0 0
0 0 1 0
0 1 0 0
0 0 0 1
_

_
, L =
_

_
1 0 0 0
3 1 0 0
2 0 1 0
1 4 0 1
_

_
:
Finally, we interchange the third and fourth rows to obtain an upper triangular matrix U:
U =
_

_
1 2 1 0
0 1 3 1
0 0 5 6
0 0 0 1
_

_
, P =
_

_
1 0 0 0
0 0 1 0
0 0 0 1
0 1 0 0
_

_
, L =
_

_
1 0 0 0
3 1 0 0
1 4 1 0
2 0 0 1
_

_
:
6
Now,
PA =
_

_
1 0 0 0
0 0 1 0
0 0 0 1
0 1 0 0
_

_
_

_
1 2 1 0
2 4 2 1
3 5 6 1
1 2 8 2
_

_
=
_

_
1 2 1 0
3 5 6 1
1 2 8 2
2 4 2 1
_

_
=
_

_
1 0 0 0
3 1 0 0
1 4 1 0
2 0 0 1
_

_
_

_
1 2 1 0
0 1 3 1
0 0 5 6
0 0 0 1
_

_
= LU.
Using PA = LU to solve a system of equations requires a slight adjustment as well. We need to rst
multiply the system Ax = b by the permutation matrix P, resulting in
PAx = Pb =

b.
We then solve as we did earlier, noting that if
Ux = c and Lc =

b, then PAx = LUx = Lc =

b.
For example, if we wish to solve the system
_

_
1 2 1 0
2 4 2 1
3 5 6 1
1 2 8 2
_

_
_

_
x
y
z
w
_

_
=
_

_
1
1
3
0
_

_
,
we set out to solve the triangular systems Lc =

b and Ux = c:
_

_
1 0 0 0
3 1 0 0
1 4 1 0
2 0 0 1
_

_
_

_
a
b
c
d
_

_
=
_

_
1
3
0
1
_

_
,
(where the entries of b have been permuted in the same way as the other matrices by multiplying it by P).
Using forward substitution it is easy to obtain a = 1, b = 6, c = 23, and d = 3. We then solve the
resulting upper triangular system
_

_
1 2 1 0
0 1 3 1
0 0 5 6
0 0 0 1
_

_
_

_
x
y
z
w
_

_
=
_

_
1
6
23
3
_

_
,
which, using back substitution, leads to w = 3, z = 1, y = 0, and x = 2.
The LDU or LDV Factorization
Recall the factorization in (2):
A =
_
_
2 1 1
4 5 2
2 2 0
_
_
=
_
_
1 0 0
2 1 0
1 1 1
_
_
_
_
2 1 1
0 3 0
0 0 1
_
_
= LU:
7
We can also "factor out" the diagonal elements (i.e., the pivots) in the U matrix so that Us diagonal elements
are all 1s. This results in a lower triangular matrix and an upper triangular matrix with a diagonal matrix
in between:
A =
_
_
2 1 1
4 5 2
2 2 0
_
_
=
_
_
1 0 0
2 1 0
1 1 1
_
_
_
_
2 0 0
0 3 0
0 0 1
_
_
_
_
1
1
2
1
2
0 1 0
0 0 1
_
_
:
This is called the LDU-Factorization of A. Note how the factorization of each of the pivots d
i
aects the
entire row.
Inner Products
Weve previously discussed the dot product of two vectors. Recall that if u = (u
1
; u
2
; :::; u
n
) and v =
(v
1
; v
2
; :::; v
n
), then the dot product is given by
u v = u
1
v
1
+u
2
v
2
+ +u
n
v
n
:
This dot product is a specic example of a more general product called an inner product. The dot product,
when viewed as an inner product, is sometimes called the Euclidean inner product because of its close ties
with R
n
Euclidean space. Becuase dot products and inner products are dierent from one another, we use
dierent notation:
u v : dot product on R
n
u; v : general inner product for a vector space V
In order for a product to be considered an inner product, it must satisfy certain properties. Any vector
space with an inner product is called an inner product space. The inner product is dened as follows.
Denition 15 Let u, v, and w be vectors in a vector space V , and let c be any scalar. An inner product on
V is a function that associates a real number, denoted u; v, with each pair of vectors u and v and satises
the following properties.
(1) u; v = v; u
(2) u; v +w = u; v + u; w
(3) c u; v = cu; v = u; cv
(4) v; v _ 0, and v; v = 0 if and only if v = 0.
It is easy to verify the above four axioms for the dot product as dened earlier. Here are some other
inner products.
Example 16 Consider u; v = u
1
v
1
+ 2u
2
v
2
on R
2
. Then
(1) By commutativity of real numbers, u; v = u
1
v
1
+ 2u
2
v
2
= v
1
u
1
+ 2v
1
u
1
= v; u
(2) Let w = (w
1
; w
2
). Then
u; v +w = u
1
(v
1
+w
1
) + 2u
2
(v
2
+w
2
)
= u
1
v
1
+u
1
w
1
+ 2u
2
v
2
+ 2u
2
w
2
= (u
1
v
1
+ 2u
2
v
2
) + (u
1
w
1
+ 2u
2
w
2
)
= u; v + u; w
(3) If c is any scalar, c u; v = c (u
1
v
1
+ 2u
2
v
2
) = (cu
1
) v
1
+ 2 (cu
2
) v
2
= cu; v
(4) Finally, because the square of a real number is nonnegative, v; v = v
2
1
+ 2v
2
2
_ 0, and v
2
1
+ 2v
2
2
= 0
implies v
1
= v
2
= 0, i.e., if and only if v = 0.
This example generalizes to show that products satisfying
u; v = c
1
u
1
v
1
+c
2
u
2
v
2
+ +c
n
u
n
v
n
, c
i
> 0;
with positive scalars c
i
(called weights) are all inner products on R
n
. Note the condition c > 0. If any of
the c
i
are zero or negative, the product is no longer an inner product.
8
Example 17 Consider the real-valued and continuous functions in the vector space C [a; b] (the space of all
continuous functions on the interval [a; b]). Then f; g =
_
b
a
f (x) g (x) dx is an inner product on C [a; b].
(1) f; g =
_
b
a
f (x) g (x) dx =
_
b
a
g (x) f (x) dx = g; f
(2) f; g +h =
_
b
a
f (x) [g (x) +h(x)] dx =
_
b
a
f (x) g (x) dx +
_
b
a
f (x) h(x) dx = f; g + f; h
(3) c f; g = c
_
b
a
f (x) g (x) dx =
_
b
a
(cf (x)) g (x) dx = cf; g
(4) f; f =
_
b
a
f (x) f (x) dx _ 0 because (f (x))
2
_ 0 for all x. Additionally, f; f = 0 if and only if
f (x) = 0 or if a = b.
Orthogonal Projections
Review of Dot Products and Orthogonality
Recall the following:
Two vectors are said to be orthogonal if their dot product is zero, namely
u v = 0 or u
T
v = 0,
where u and v are column vectors. By denition, the zero vector is orthogonal to all other vectors.
The angle between the two vectors is given by the relation
u v = |u| |v| cos or cos =
u v
|u| |v|
.
The length or norm of a vector is given by |v|
2
= v v.
The distance between two points (or vectors) is given by
d (u; v) = |u v| = |v u| .
A set of vectors is said to be mutually orthogonal if every pair of vectors in the set is orthogonal.
Additionally, if all of the vectors are unit vectors (i.e., have length of one), the set is said to be
orthonormal.
An orthogonal set of nonzero vectors is linearly independent.
A basis that is an orthogonal set is called an orthogonal basis. If the vectors in the basis are all
of length one, the basis is called an orthonormal basis. (All of the familiar "standard" bases are
orthonormal, e.g. (1; 0; 0) ; (0; 1; 0) ; (0; 0; 1))
Orthogonal and Orthonormal Bases
Why make a big deal out of orthogonal and orthonormal bases? It turns out that the orthonormal bases of
a vector space are quite useful because there is a simple formula for writing any vector in the vector space as
a linear combination of those orthonormal basis vectors. We do not have to start over and solve a system of
equations just to determine the coecients of the given vector relative to the basis every single time. Here
is the derivation of that formula.
Suppose we have an orthonormal basis u
1
; :::; u
n
for a vector space V . If v is a vector in V , there
must exist scalars c
1
; :::; c
n
such that
v = c
1
u
1
+c
2
u
2
+ +c
n
u
n
. (3)
We seek a formula to determine each of the c
i
s. Start with the ith basis vector, namely u
i
. If we take the
dot product of u
i
with both sides of (3), we have
v u
i
= (c
1
u
1
+c
2
u
2
+ +c
n
u
n
) u
i
,
9
and using the properties of dot products, this leads to
v u
i
= (c
1
u
1
+c
2
u
2
+ +c
n
u
n
) u
i
= c
1
u
1
u
i
+c
2
u
2
u
i
+ +c
n
u
n
u
i
.
Now, since each of the basis vectors are mutually orthogonal, we must have u
i
u
j
= 0 for any two distinct
vectors in the set u
1
; :::; u
n
(i.e., u
i
u
j
= 0 unless i = j). Therefore,
v u
i
= 0 + 0 + +c
i
u
i
u
i
+ + 0 + 0.
Since the basis vectors are orthonormal, we know their lengths are all one, so u
i
u
i
= |u
i
|
2
= 1, and
v u
i
= c
i
(u
i
u
i
) = c
i
.
We have therefore found a formula for the ith coecient c
i
. As i ranges from 1 to n, we nd that c
1
= v u
1
,
c
2
= v u
2
, ..., c
n
= v u
n
. Consequently, we have proven the following theorem.
Theorem 18 If u
1
; :::; u
n
is an orthonormal basis for a vector space V , any vector v in V can be written
as a linear combination of these basis vectors as follows:
v = c
1
u
1
+c
2
u
2
+ +c
n
u
n
= (v u
1
) u
1
+ (v u
2
) u
2
+ + (v u
n
) u
n
.
Example 19 The vectors u
1
= (0; 1; 0), u
2
=
_
3
5
; 0;
4
5
_
, and u
3
=
_
4
5
; 0;
3
5
_
form an orthonormal basis
B for R
3
. Express the vector v = (2; 3; 1) as a linear combination of these basis vectors.
Solution 20 Take the three required dot products:
v u
1
= (2; 3; 1) (0; 1; 0) = 3
v u
2
= (2; 3; 1)
_
3
5
; 0;
4
5
_
=
2
5
v u
3
= (2; 3; 1)
_
4
5
; 0;
3
5
_
=
11
5
These scalars represent the "coordinates of v relative to the basis B," and
v = 3 (0; 1; 0) +
2
5
_
3
5
; 0;
4
5
_
+
11
5
_
4
5
; 0;
3
5
_
.
(Multiply it out to conrm this!)
Furthermore, note that taking dot products in this manner, with the rst vector the same each time, is
equivalent to the following matrix multiplications:
[2; 3; 1]
_
_
0
1
0
_
_
= 3, [2; 3; 1]
_
_
3
5
0

4
5
_
_
=
2
5
, and [2; 3; 1]
_
_
4
5
0
3
5
_
_
=
11
5
,
and we can combine all of them into a single matrix multiplication:
[2; 3; 1]
_
_
0
3
5
4
5
1 0 0
0
4
5
3
5
_
_
=
_
3
2
5
11
5

,
yielding the desired coecients of u
1
, u
2
, and u
3
, respectively. (Compare this to the technique we had to
use to nd the coordinates of a vector relative to a nonstandard basis.)
10
Distance and Projections
We quite often need to determine the distance between a point b and a line in the direction of vector a, as
shown in the gure below. Or, we might want to determine "how much" of the force vector b is pointing in
the direction of a. (We have probably all done this with respect to the coordinate axes in the former case
or horizontal and vertical vector components in the latter.) Regardless of the question, the approach is the
same. We need to determine the projection of b onto a, denoted by proj
a
b and represented by p in the
gure.
b
a
e = b - p
p
b
a
e = b - p
p
O
It might help to think of proj
a
b as what b would look like if you were "above" it and looking directly
down at a, with a line of sight perpendicular to a.
We will now derive the formula for p. Note that p must be some scalar multiple of vector a because it
is in the same direction (or opposite direction if the angle was obtuse). Therefore, p = ca, and we need to
solve for c. Of course, the point on the vector a that is closest to b would be the point at the foot of the
perpendicular dropped from b onto a. In other words, the line from b to the closest point p on a would
be perpendicular to a: Note that in terms of vector subtraction, the side opposite angle O (denoted e in
the gure) represents the vector subtraction e = b p, or because p = ca, e = b ca. Since vector e is
perpendicular to a, we must have
a e = 0, or a (b ca) = 0, or a b a ca = 0,
which in turn leads to the solution
c =
a b
a a
.
Therefore, the projection p of vector b onto a is given by
p = proj
a
b = ca =
a b
a a
a. (4)
If we rewrite the dot products in (4) in the equivalent form a b = a
T
b and a a = a
T
a, we have
proj
a
b =
a
T
b
a
T
a
a.
Realizing that this is a scalar
a
T
b
a
T
a
multiplied by the vector a and rearranging, we have
3
proj
a
b = a
a
T
b
a
T
a
=
aa
T
a
T
a
b.
Note that the quantity
aa
T
a
T
a
actually represents a matrix called the projection matrix P. (It is a matrix
because aa
T
is a column times a row (say an n 1 times a 1 n, so the product is an n n matrix), and
a
T
a is the familiar dot product of a with itself.) Thus we conclude that the projection of b onto a can be
found by multiplying the projection matrix P =
aa
T
a
T
a
by the vector b:
p = Pb.
3
The 1 1 "matrix" (i.e. scalar) a
T
a is called an "inner product" while the nn matrix aa
T
is called the "outer product."
11
Example 21 The matrix that projects any vector onto the line through the point a = (1; 1; 1) is given by
P =
aa
T
a
T
a
=
1
3
_
_
1
1
1
_
_
_
1 1 1

=
_
_
1
3
1
3
1
3
1
3
1
3
1
3
1
3
1
3
1
3
_
_
.
For example, to determine the projection of (2; 3; 1) onto the line through (1; 1; 1), we would simply calculate
_
_
1
3
1
3
1
3
1
3
1
3
1
3
1
3
1
3
1
3
_
_
_
_
2
1
5
_
_
=
_
_
8
3
8
3
8
3
_
_
.
Note again the ease with which the projections can be found if the vector a has unit length. The dot
product a a would be 1, and the resulting formulas would become
proj
a
b = (a b) a
and
P = aa
T
.
Example 22 Determine the projection of the vector v = (6; 7) onto the vector u = (1; 4) :
Method 1 Using the formula proj
a
b =
ab
aa
a, we have
proj
u
v =
34
17
(1; 4) = (2; 8) .
Method 2 Using the projection matrix P =
aa
T
a
T
a
, we nd
P =
1
17
_
1
4
_
_
1 4

=
_
1
17
4
17
4
17
16
17
_
.
Then
proj
u
v = Pv =
_
1
17
4
17
4
17
16
17
_ _
6
7
_
=
_
2
8
_
,
both of which appear to agree with the gure shown below.
(2,8)
(1,4)
(6,7)
(2,8)
(1,4)
(6,7)
u
p
O
v
12
Gram-Schmidt Orthonormalization
Recall that, in R
2
, the projection of a vector v onto a nonzero vector u is given by
proj
u
v =
u v
u u
u.
If the vector u is of unit length, this projection becomes
proj
u
v =
u v
u u
u = (u v) u. (5)
Now suppose we have a basis w
1
; :::; w
n
for some vector space V and we wish to use this basis to
construct an orthogonal (or orthonormal) basis v
1
; :::; v
n
for V . Start by choosing
v
1
= w
1
,
(where v
1
,= 0 because w
1
was a member of the original basis). We then require that the second vector be
orthogonal to the rst, or v
1
v
2
= 0. Weve seen previously that at least one way to obtain an orthogonal
vector is to consider the perpendicular dropped from v onto u in the projection proj
u
v:
proj(v)
v
v - proj(v)
u
proj(v)
v
v - proj(v)
u
O
So lets take the next vector, v
1
, to be the perpendicular dropped from w
2
onto v
1
, i.e.
v
2
= w
2
proj
v1
w
2
. (6)
As conrmation of this choice, note that this will satisfy the orthogonality requirement because
v
1
v
2
= v
1

_
w
2
proj
v1
w
2
_
= v
1
w
2
v
1

v
1
w
2
v
1
v
1
v
1
= v
1
w
2
v
1
w
2
= 0.
Because v
1
= w
1
and w
2
are members of the original basis, we know they are linearly independent and
therefore v
1
and v
2
are also linearly independent, and thus v
2
= w
2

v1w2
v1v1
v
1
,= 0.
Now we need the third basis vector to be perpendicular to the rst two. Note from Eq. (6) that in
order to construct a new orthogonal basis vector (i.e. v
2
), we took the next given basis vector (i.e., w
2
) and
removed the component of w
2
that pointed in the direction of v
1
, our already settled basis vector. If we
continue in this manner, to nd v
3
we would subtract the components of w
3
in the directions of v
1
and v
2
to
obtain a vector that is perpendicular to both v
1
and v
2
, then to nd v
4
we would subtract the components
of w
4
in the direction of v
1
, v
2
, and v
3
, and so on. In other words, we will take
v
3
= w
3
proj
v1
w
3
proj
v2
w
3
,
and then
v
4
= w
4
proj
v1
w
4
proj
v2
w
4
proj
v3
w
4
,
and so on. This leads to the following generalization:
13
Theorem 23 Gram-Schmidt Orthogonalization: Let W = w
1
; :::; w
n
be a basis for a vector space V .
To create a set of orthogonal basis vectors B = v
1
; :::; v
n
from W, construct the v
i
as follows:
v
1
= w
1
v
2
= w
2
proj
v1
w
2
v
3
= w
3
proj
v1
w
3
proj
v2
w
3
.
.
.
v
n
= w
n
proj
v1
w
n
proj
v2
w
n
proj
vn1
w
n
To create an orthonormal basis, normalize each of the vectors v
i
.
If we normalize the vectors as we go through the process, all of the dot products, as we are reminded
in (5), are easier to calculate. However, the normalization usually introduces many square roots into the
calculation, which may be cumbersome to work with.
Here are some examples of this process.
Example 24 Apply the Gram-Schmidt process to the following basis for R
2
: B = (1; 1) ; (0; 1).
Solution: Choose v
1
= (1; 1). Then remove the component of w
2
= (0; 1) that points in the direction
of v
1
:
v
2
= w
2
proj
v1
w
2
= (0; 1)
(1; 1) (0; 1)
(1; 1) (1; 1)
(1; 1)
= (0; 1)
_
1
2
;
1
2
_
=
_
1
2
;
1
2
_
:
Therefore an orthogonal basis for R
2
based on the two vectors (1; 1) and (0; 1) would be (1; 1) and
_
1
2
;
1
2
_
.
If we desire and orthonormal basis, divide each vector by its respective length, namely |v
1
| =
_
2 and
|v
2
| =
1
p
2
, so the basis would be
_
p
2
2
;
p
2
2
_
and
_

p
2
2
;
p
2
2
_
.
Note: Had we chosen v
1
= (0; 1), we would have found
v
2
= (1; 1)
(0; 1) (1; 1)
(0; 1) (0; 1)
(0; 1) = (1; 0) ,
which we should have been able to guess in the rst place, since (1; 0) and (0; 1) make up the standard basis
for R
2
!
Example 25 Apply the Gram-Schmidt process to the following basis for a three-dimensional subspace of R
4
:
B = (1; 2; 0; 3) ; (4; 0; 5; 8) ; (8; 1; 5; 6).
Solution: Choose v
1
= (1; 2; 0; 3). Then remove the component of w
2
= (4; 0; 5; 8) that points in the
direction of v
1
:
v
2
= w
2
proj
v1
w
2
= (4; 0; 5; 8)
(1; 2; 0; 3) (4; 0; 5; 8)
(1; 2; 0; 3) (1; 2; 0; 3)
(1; 2; 0; 3)
= (4; 0; 5; 8) (2; 4; 0; 6)
= (2; 4; 5; 2) :
14
Now remove the components of w
3
= (8; 1; 5; 6) that point in the directions of v
1
and v
2
:
v
3
= w
3
proj
v1
w
3
proj
v2
w
3
= (8; 1; 5; 6)
(1; 2; 0; 3) (8; 1; 5; 6)
(1; 2; 0; 3) (1; 2; 0; 3)
(1; 2; 0; 3)
(2; 4; 5; 2) (8; 1; 5; 6)
(2; 4; 5; 2) (2; 4; 5; 2)
(2; 4; 5; 2)
= (8; 1; 5; 6) (2; 4; 0; 6) (2; 4; 5; 2)
= (4; 1; 0; 2) .
We conclude that the set (1; 2; 0; 3) ; (2; 4; 5; 2) ; (4; 1; 0; 2) constitutes an orthogonal basis for this par-
ticular subspace. We get an orthonormal basis by dividing each vector by its length:
|(1; 2; 0; 3)| =
_
14
|(2; 4; 5; 2)| = 7
|(4; 1; 0; 2)| =
_
21,
so the orthonormal basis is given by
__
1
_
14
;
2
_
14
; 0;
3
_
14
_
;
_
2
7
;
4
7
;
5
7
;
2
7
_
;
_
4
_
21
;
1
_
21
; 0;
2
_
21
__
.
Projection and Distances on Subspaces; QR-Factorization
Quick Review
We now know how to project one vector onto another vector, namely via any of the following formulas:
proj
u
v =
u v
u u
u or proj
u
v =
u
T
v
u
T
u
u or proj
u
v =
uu
T
u
T
u
v.
We also know how to write any vector w in a vector space V in terms of its orthonormal basis vectors
u
1
; :::; u
n
:
w = (w u
1
) u
1
+ (w u
2
) u
2
+ + (w u
n
) u
n
.
Finally, weve devised a way to generate an orthonormal basis v
1
; :::; v
n
from another basis w
1
; :::; w
n

via the Gram-Schmidt process:


v
1
= w
1
v
2
= w
2
proj
v1
w
2
v
3
= w
3
proj
v1
w
3
proj
v2
w
3
.
.
.
v
n
= w
n
proj
v1
w
n
proj
v2
w
n
proj
vn1
w
n
.
Projection onto a Subspace
The projection of a vector v onto a subspace tells us "how much" of the given vector v lies in that particular
subspace. Put another way (and rather non-rigorously), the projection of v onto the subspace tells us "how
many" of each of the subspaces orthonormal basis vectors we would need to represent v. We have met this
quantity before, and you should recognize the right-hand side of the following.
Denition 26 Consider the subspace W of R
n
and let u
1
; :::; u
k
be an orthonormal basis for W. If v is
a vector in R
n
, the projection of vector v onto the subspace W, denoted proj
W
v, is dened as
proj
W
v = (v u
1
) u
1
+ (v u
2
) u
2
+ + (v u
k
) u
k
.
15
This is the exact same formula we encountered when writing a vector in terms of orthonormal basis
vectors of a particular subspace! In addition, it would make sense (and we accept without proof) that every
vector in R
n
can be "decomposed" into a vector w within a vector space W and a vector w
?
orthogonal to
W. In symbols,
v = w+w
?
, where w is in W and w
?
is in W
?
.
It should come as no surprise, especially if one considers the two-dimensional case, that
w = proj
W
v,
and because v = w+w
?
, we must have
w
?
= v proj
W
v.
Example 27 Suppose we have the vector v = (3; 2; 6) in R
3
, and we wish to decompose v into the sum of
a vector that lies in the subspace W consisting of all vectors of the form (a; b; b) and a vector orthogonal to
that subspace.
Solution: The vectors (1; 0; 0) and (0; 1; 1) span all of W and are orthogonal (hence linearly independent),
and therefore form a basis for W. Normalizing, we nd orthonormal basis vectors
u
1
= (1; 0; 0) and u
2
=
_
0;
1
_
2
;
1
_
2
_
.
Then
w = proj
W
v
= (v u
1
) u
1
+ (v u
2
) u
2
= ((3; 2; 6) (1; 0; 0)) (1; 0; 0) +
_
(3; 2; 6)
_
0;
1
_
2
;
1
_
2
___
0;
1
_
2
;
1
_
2
_
= (3; 0; 0) + (0; 4; 4)
= (3; 4; 4) .
Now,
w
?
= v proj
W
v
= (3; 2; 6) (3; 4; 4)
= (0; 2; 2) .
We can then conclude that (3; 4; 4) is a vector in W while (0; 2; 2) is a vector that is orthogonal to W.
Distance from a Point to a Subspace
Again, it would seem reasonable to extend the concept of "distance between points" to "distance from a
point to a line" to "distance from a point to a subspace" by realizing that the latter is simply the distance
of the point from its projection in the subspace. In symbols,
d (x; W) =
_
_
x proj
W
x
_
_
.
Example 28 Determine the distance of the point x = (4; 1; 7) from the subspace W discussed in the pre-
vious example.
Solution: We have already found an orthonormal basis for W, namely
u
1
= (1; 0; 0) and u
2
=
_
0;
1
_
2
;
1
_
2
_
.
16
Then
proj
W
x = (x u
1
) u
1
+ (x u
2
) u
2
= ((4; 1; 7) (1; 0; 0)) (1; 0; 0) +
_
(4; 1; 7)
_
0;
1
_
2
;
1
_
2
___
0;
1
_
2
;
1
_
2
_
= (4; 0; 0) + (0; 3; 3)
= (4; 3; 3) .
The distance of the point x from the subspace W is then
_
_
x proj
W
x
_
_
= |(4; 1; 7) (4; 3; 3)|
= |(0; 4; 4)|
=
_
32.
Orthogonal Matrices
Denition 29 An orthogonal matrix is a square matrix with orthonormal columns. Denoting this matrix
Q, it is easy to determine that Q
T
Q = I, and therefore, Q
T
= Q
1
. In other words, the transpose of an
orthogonal matrix is its inverse.
4
Example 30 Consider the rotation matrix Q =
_
cos sin
sin cos
_
. Then Q
T
=
_
cos sin
sin cos
_
; and
it is easy to verify that Q
T
Q = I. This type of matrix is called an isometry because it represents a
length-preserving transformation. We can calculate the length of (1; 2)
T
to be
_
5. Then,
_
cos sin
sin cos
_ _
1
2
_
=
_
cos 2 sin
2 cos + sin
_
,
which still has a length of
_
5:
Example 31 All permutation matrices are orthogonal, hence we conrm that the inverse of a permutation
matrix is actually its transpose.
Another important property of orthogonal matrices is that multiplication by Q preserves lengths, inner
products, and angles (i.e., lengths, inner products, and angles that existed before multiplication by Q will
be the same after multiplication by Q.) For instance, lengths are equivalent (i.e., |Qx|
2
= |x|
2
) because
(Qx)
T
(Qx) = x
T
Q
T
Qx = x
T
x, and inner products are preserved because (Qx)
T
(Qy) = x
T
Q
T
Qy = x
T
y.
Therefore, the following statements are equivalent if they are about an n n matrix Q:
1. Q is orthogonal.
2. |Qx| = |x| for exery x in R
n
:
3. Qx Qy = x y for every x and y in R
n
.
Note that the discussion earlier regarding the expression of a vector v as a linear combination of a
subspaces orthonormal basis vectors can be reinterpreted here if we consider again the system Ax = b.
This time, however, we will consider Qx = b, where the columns of Q are the orthonormal basis vectors.
Then writing b as a linear combination of the basis vectors q
1
; :::; q
n
simply equates to solving the system
x
1
q
1
+x
2
q
2
+ +x
n
q
n
= b, or Qx = b.
The solution to this system is x = Q
1
b, and since Q
1
= Q
T
, this becomes
x =Q
T
b =
_

_
q
T
1

.
.
.
q
T
n

_

_
_

_
.
.
.
b
.
.
.
_

_
=
_

_
q
T
1
b
.
.
.
q
T
n
b
_

_, (7)
4
The Q
T
Q = I relation still works even if Q is not square. If Q is an m n matrix, Q
T
would be an n m matrix, and
their product would be a square identity matrix.
17
where the components of x are the dot products of the orthonormal basis vectors with b, as we would expect.
Note: When we projected a vector b onto a line, we ended up with the expression
a
T
b
a
T
a
. Note here
that a is actually q
i
, and because of the unit lengths, the denominator is 1. What Eq. (7) then shows
is that every vector b is the sum of its one-dimensional projections onto the lines spanned by each of the
orthonormal vectors q
i
.
Note: Furthermore, because Q
T
= Q
1
we have QQ
T
= I (in addition to Q
T
Q = I). This leads to the
somewhat remarkable conclusion that the rows of a square matrix are orthonormal whenever the columns
are!
QR-Factorization
In the Gram-Schmidt process, we start with independent vectors in R
m
, namely a
1
; :::; a
n
; and end with
orthonormal vectors q
1
; :::; q
n
(again in R
m
) . If we make these vectors the columns of matrices A and
Q, respectively, we have two mn matrices. Is there a third matrix that connects these two?
Recall that we can easily write vectors in a space as linear combinations of the vectors in any orthonormal
basis of that space. Since the q
i
constitute an orthonormal basis, we have
a
1
=
_
q
T
1
a
1
_
q
1
+
_
q
T
2
a
1
_
q
2
+ +
_
q
T
n
a
1
_
q
n
a
2
=
_
q
T
1
a
2
_
q
1
+
_
q
T
2
a
2
_
q
2
+ +
_
q
T
n
a
2
_
q
n
a
3
=
_
q
T
1
a
3
_
q
1
+
_
q
T
2
a
3
_
q
2
+ +
_
q
T
n
a
3
_
q
n
.
.
.
a
n
=
_
q
T
1
a
n
_
q
1
+
_
q
T
2
a
n
_
q
2
+ +
_
q
T
n
a
n
_
q
n
:
However, because of the manner in which the Gram-Schmidt process is performed, we know that vector a
1
is orthogonal to the vectors q
2
; q
3
; q
4
; :::, the vector a
2
is orthogonal to the vectors q
3
; q
4
; q
5
; :::, the vector
a
3
is orthogonal to the vectors q
4
; q
5
; q
6
; :::, and so on. Therefore, all of the dot products q
T
j
a
i
with j > i
will equal zero, yielding the following:
a
1
=
_
q
T
1
a
1
_
q
1
a
2
=
_
q
T
1
a
2
_
q
1
+
_
q
T
2
a
2
_
q
2
a
3
=
_
q
T
1
a
3
_
q
1
+
_
q
T
2
a
3
_
q
2
+
_
q
T
3
a
3
_
q
3
.
.
.
a
n
=
_
q
T
1
a
n
_
q
1
+
_
q
T
2
a
n
_
q
2
+ +
_
q
T
n
a
n
_
q
n
:
Of course, this corresponds exactly to the following system:
A =
_
_
a
1
a
2
a
n
_
_
. .
mn
=
_
_
q
1
q
2
q
n
_
_
. .
mn
_

_
_
q
T
1
a
1
_ _
q
T
1
a
2
_ _
q
T
1
a
3
_

_
q
T
1
a
n
_
0
_
q
T
2
a
2
_ _
q
T
2
a
3
_

_
q
T
2
a
n
_
0 0
_
q
T
3
a
3
_

_
q
T
3
a
n
_
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0 0
_
q
T
n
a
n
_
_

_
. .
nn
= QR,
and we have arrived at the QR-Factorization of matrix A, in which Q has orthonormal columns and R is
upper triangular (because of how Gram-Schmidt is performed - we start with vector a, which falls on the
same line as q
1
. Then vectors a
1
and a
2
are in the same plane as q
1
and q
2
, and so on). Thus matrix R is
the matrix that connects Q back to A, and we have the following theorem:
Theorem 32 Let A be an m n matrix with linearly independent columns. Then A can be factored as
A = QR, where Q is an m n matrix with orthonormal columns and R is an invertible upper triangular
matrix.
18
Example 33 Find a QR factorization of
A =
_

_
1 2 2
1 1 2
1 0 1
1 1 2
_

_
.
Solution: It is easy to determine that the columns of A are linearly independent, so it forms a basis for
the subspace spanned by those columns (i.e., the column space of A). Start the Gram-Schmidt process by
setting v
1
= a
1
:
v
1
=
_
_
_
_
1
1
1
1
_
_
_
_
.
Then,
v
2
=
_
_
_
_
2
1
0
1
_
_
_
_

_
v
1
a
2
v
1
v
1
_
_
_
_
_
1
1
1
1
_
_
_
_
=
_
_
_
_
2
1
0
1
_
_
_
_

_
2
4
_
_
_
_
_
1
1
1
1
_
_
_
_
=
_
_
_
_
3
2
3
2
1
2
1
2
_
_
_
_
.
Note: Since we will be normalizing later, we can "rescale" v
2
without changing any orthogonality relation-
ships to make future calculations easier. So well replace v
2
with v
0
2
= (3; 3; 1; 1). Finally,
v
3
=
_
_
_
_
2
2
1
2
_
_
_
_

_
v
1
a
3
v
1
v
1
_
_
_
_
_
1
1
1
1
_
_
_
_

_
v
0
2
a
3
v
0
2
v
0
2
_
_
_
_
_
3
3
1
1
_
_
_
_
=
_
_
_
_
2
2
1
2
_
_
_
_

_
1
4
_
_
_
_
_
1
1
1
1
_
_
_
_

_
15
20
_
_
_
_
_
3
3
1
1
_
_
_
_
=
_
_
_
_

1
2
0
1
2
1
_
_
_
_
.
We can again rescale v
3
to obtain v
0
3
=
_
_
_
_
1
0
1
2
_
_
_
_
. We now have an orthogonal basis v
1
; v
0
2
; v
0
3
for the
subspace W. Now, to obtain an orthonormal basis, normalize each vector (the details are left to you):
q
1
; q
2
; q
3
=
_

_
_
_
_
_
1=2
1=2
1=2
1=2
_
_
_
_
;
_
_
_
_
3
_
5=10
3
_
5=10
_
5=10
_
5=10
_
_
_
_
;
_
_
_
_

_
6=6
0
_
6=6
_
6=3
_
_
_
_
_

_
.
Now, to obtain a QR factorization for A, we have
Q =
_

_
1=2 3
_
5=10
_
6=6
1=2 3
_
5=10 0
1=2
_
5=10
_
6=6
1=2
_
5=10
_
6=3
_

_
.
Because Q has orthonormal columns, we know that Q
T
Q = I. Therefore, if A = QR,
Q
T
A = Q
T
QR = IR = R.
19
So to nd R, just calculate Q
T
A:
Q
T
A =
_
_
1
2

1
2

1
2
1
2
3
_
5=10 3
_
5=10
_
5=10
_
5=10

_
6=6 0
_
6=6
_
6=3
_
_
_

_
1 2 2
1 1 2
1 0 1
1 1 2
_

_
=
_
_
2 1
1
2
0
_
5
3
2
_
5
0 0
1
2
_
6
_
_
= R.
Note that the diagonals of R contain the lengths of vectors v
1
, v
2
, and v
3
.
Using the QR Factorization to Solve Systems
Note that the system Ax = b becomes QRx = b, and hence
Rx = Q
T
b (8)
(because Q
1
= Q
T
). Because R is upper triangular, the equation in (8) can be solved easily via back
substitution. For example, given the system Ax = (0; 4; 5) and the fact that A = QR factorization yields
_
_
1 1 2
1 0 2
1 2 3
_
_
=
_

_
1
p
3
4
p
42
2
p
14
1
p
3
1
p
42

3
p
14

1
p
3
5
p
42

1
p
14
_

_
_

_
_
3
1
p
3

_
3
0
p
14
p
3
p
21
p
2
0 0
p
7
p
2
_

_,
We nd
Q
T
b =
_
_
1
3
_
3
1
3
_
3
1
3
_
3
2
21
_
42
1
42
_
42
5
42
_
42
1
7
_
14
3
14
_
14
1
14
_
14
_
_
_
_
0
4
5
_
_
=
_
_
3
_
3
1
2
_
42
1
2
_
14
_
_
.
Then solve
Rx =
_

_
_
3
1
p
3

_
3
0
p
14
p
3
p
21
p
2
0 0
p
7
p
2
_

_
_
_
x
y
z
_
_
=
_
_
3
_
3
1
2
_
42
1
2
_
14
_
_
by back substitution to obtain
x =
_
_
2
0
1
_
_
.
Least Squares and the QR Factorization
Review of Least Squares and the Normal Equations
This topic builds o of what we did in Computer Lab #10. In the lab, we learned:
In a least-squares situation, in order to minimize all of the errors (specically, the sum of the squared
distances between the "best-t" line and the actual data points), we needed to determine the vector
in Ax that was closest to the vector b.
This is the same as determining the projection of b onto a subspace, and that subspace was actually
the column space of A.
Typically, in a least squares setting, we have many more data points than variables, so if A is mn,
then m > n, and we most likely do not have an exact solution (i.e., rarely will all the points follow the
mathematical model exactly).
20
In terms of matrix subspaces, the vector b will most likely be outside the column space of A.
However, the point p in the subspace that is closest to b would be in the column space of A, so it can
be written as p = A x, where x represents the "best estimate" vector to the "almost" solution vector
x.
Since p is the projection of b onto the column space, the error vector we wish to minimize, i.e.
e = b A x, will be orthogonal to that space.
However, if a vector is orthogonal to the column space of the matrix A, it is also orthogonal to the row
space of the transpose A
T
, and any vector orthogonal to the row space of a matrix is in the null space
of that matrix.
Therefore, because e is orthogonal to the column space of A, we can conclude that it is in the null
space of A
T
. This is what nally allowed us make the following important connection:
A
T
(b A x) = 0
A
T
b A
T
A x = 0
A
T
A x = A
T
b, (9)
the last line of which describes what are called the normal equations.
Finally, the matrix A
T
A is invertible exactly when the columns of A are linearly independent.
5
Then,
the best estimate x, which gives us the coecients in the mathematical model (or "line" of best-t),
6
can be found as
x =
_
A
T
A
_
1
A
T
b:
Example 34 Find a least squares solution to the inconsistent system Ax = b, where
A =
_
_
1 5
2 2
1 1
_
_
and b =
_
_
3
2
5
_
_
.
Solution: Compute
A
T
A =
_
1 2 1
5 2 1
_
_
_
1 5
2 2
1 1
_
_
=
_
6 0
0 30
_
and
A
T
b =
_
1 2 1
5 2 1
_
_
_
3
2
5
_
_
=
_
2
16
_
.
Then the normal equations are
A
T
A x = A
T
b
_
6 0
0 30
_
x =
_
2
16
_
,
from which it is easy to see that x =
_
1
3
;
8
15
_
T
.
Example 35 Find the least squares approximating line for the data points (1; 2), (2; 2), and (3; 4).
Solution: We want the line y = a + bx that is best ts these three points. The appropriate system would
be
a +b (1) = 2
a +b (2) = 2
a +b (3) = 4
5
Be careful here - because A might be rectangular, we are acutally dealing with what is called a "left inverse," and the
relation

A
T
A

1
= A
1

A
T

1
does not hold as it does with square matrices.
6
I use quotes here because we are not limited to linear models with this technique.
21
which can be reformed into Ax = b as
_
_
1 1
1 2
1 3
_
_
_
a
b
_
=
_
_
2
2
4
_
_
.
Again, compute
A
T
A =
_
1 1 1
1 2 3
_
_
_
1 1
1 2
1 3
_
_
=
_
3 6
6 14
_
and
A
T
b =
_
1 1 1
1 2 3
_
_
_
2
2
4
_
_
=
_
8
18
_
.
Solving
_
3 6
6 14
_
x =
_
8
18
_
leads to the solution x =
_
2
3
; 1
_
T
, so the equation for the line of best t would be y =
2
3
+ x, shown in the
plot below along with the three data points:
5 2.5 0 -2.5 -5
5
2.5
0
-2.5
x
y
x
y
While were at it, we can also calculate the actual least squares error. If x represents the least squares
solution of Ax = b, it is the vector in the column space of A that is closest to b. The actual distance from
b to x would simply be the length of the perpendicular component of the projection of b onto A. In symbols,
|e| = |b A x| .
Now,
e = b A x =
_
_
2
2
4
_
_

_
_
1 1
1 2
1 3
_
_
_
2
3
1
_
=
_
_
1
3

2
3
1
3
_
_
,
and the length of e is then
_
_
1
3
_
2
+
_
2
3
_
2
+
_
1
3
_
2
=
_
2
3
- 0:816.
Least Squares and the QR Factorization
One major advantage of orthogonalization is that it greatly simplies the least squares problem Ax = b.
The normal equations from (9) are still
A
T
A x = A
T
b,
22
but with QR factorization, A
T
A becomes
A
T
A = (QR)
T
(QR) = R
T
Q
T
QR = R
T
R, (because Q
T
= Q
1
).
Then, the equations in (9) become
A
T
A x = A
T
b
R
T
R x = R
T
Q
T
b,
or,
R x = Q
T
b. (10)
Although this may not look like much of an improvement, it most certainly is, particularly because R is
upper triangular. Therefore, the solution to (10) can be found via back substitution. We still need to use
Gram-Schmidt to produce Q and R, but the payo is that the equations in (10) are less prone to numerical
inaccuracies such as round-o error.
Example 36 Consider the previous example in which we found the line of best t for the points (1; 2), (2; 2),
and (3; 4). If we instead nd the QR factorization, we have
A =
_
_
1 1
1 2
1 3
_
_
=
_
_
1
3
_
3
1
2
_
2
1
3
_
3 0
1
3
_
3
1
2
_
2
_
_
_ _
3 2
_
3
0
_
2
_
= QR.
Then R x = Q
T
b becomes
_ _
3 2
_
3
0
_
2
_
x =
_
1
3
_
3
1
3
_
3
1
3
_
3

1
2
_
2 0
1
2
_
2
_
_
_
2
2
4
_
_
=
_
8
3
_
3
_
2
_
.
Hence,
_
2b =
_
2 = b = 1 and so
_
3a + 2
_
3 (1) =
8
p
3
3
= a =
2
3
, as we found earlier.
An Aside: Least Squares and Calculus
Consider the simple system
a
1
x = b
1
a
2
x = b
2
:
a
3
x = b
3
This is solvable only if b
1
; b
2
, and b
3
are in the ratio of a
1
: a
2
: a
3
. In practice, this would rarely be the
case if the above equations came from "real" data. So, instead of trying to solve the unsolvable, we proceed
by choosing an x that minimizes the average error E in the equations. A convenient error measurement to
use is the "sum of squares," namely
E
2
= (a
1
x b
1
)
2
+ (a
2
x b
2
)
2
+ (a
3
x b
3
)
2
.
If there was an exact solution, E = 0. If there is not an exact solution, we can nd the minimum error by
setting the derivative of E
2
= 0
dE
2
dx
= 2 [(a
1
x b
1
) a
1
+ (a
2
x b
2
) a
2
+ (a
3
x b
3
) a
3
] = 0
and then solving for x:
0 = 2 ((a
1
x b
1
) a
1
+ (a
2
x b
2
) a
2
+ (a
3
x b
3
) a
3
)
= 2xa
2
1
2a
2
b
2
2a
3
b
3
2a
1
b
1
+ 2xa
2
2
+ 2xa
2
3
=
x =
a
1
b
1
+a
2
b
2
+a
3
b
3
a
2
1
+a
2
2
+a
2
3
=
a
T
b
a
T
a
.
This result, which you should recognize as the coecient in the projection calculations, gives us the least-
squares solution to a problem ax = b in one variable x.
23

You might also like