Lecture Notes
Plamen Koev
Contents
Chapter 1. Preliminaries
1. Motivationwhy study numerical linear algebra?
2. Matricesnotation and operations
3. Norms of vectors and matrices
4. Absolute and relative errors
5. Computer arithmetic
6. Writing for loopschasing entries around a matrix
7. Determining the complexity of an algorithm
5
5
7
10
13
15
19
20
23
23
25
27
27
28
31
35
37
39
41
42
45
47
48
50
51
51
56
58
61
67
Bibliography
71
CHAPTER 1
Preliminaries
1. Motivationwhy study numerical linear algebra?
The unique challenges of numerical linear algebra are likely to keep it a vibrant area of research and its
techniques important despite continuing advances in computing technology. Here are some of the reasons.
The computational bottleneck in most mathematical models is a linear system or an
eigenvalue problem.
Most natural phenomena are described by differential equations. Modeling those phenomena
requires solving these differential equations which is rarely possible explicitly. Thus one attempts
to discretize and solve them numerically. This usually means picking a set of points and getting a
linear problem for the [volume, temperature, pressure, etc.] at these points.
Efficiency
Computers are slow, so to speak. Everyone expects to get answers within a reasonable amount
of time, preferably immediately. Results within seconds would be perfect, but waiting minutes
might be alright. Predicting weather only makes sense if the data comes back before we see what
the weather really is. Waiting months for computational results is unlikely to be acceptable in any
situation. So, how large of a linear system can be solved in (say) less than a day?
If were modeling a 3D phenomenon on a uniform 100by100by100 mesh, this means we need
to solve at n = 1, 000, 000 points, i.e, a millionbymillion linear system. This mesh isnt all that
dense to give great approximations, but its a start. With the average unstructured algorithm
requiring O(n3 ) operations, this means (106 )3 = 1018 operations. How long will it take to perform
that many operations? Modern machines run a few billion operations per second at best, say 1010
to be optimistic. So we need 108 seconds to solve our problem. This is 3.16 years give or take.
There are a lot of assumptions in the above calculation, but you get the picture. You cant
just throw MATLABs linear solver at a problem and expect to solve a decent size problem.
There are, of course, adaptive meshes, but even those only go so far. There are structureexploiting algorithms that run in less than O(n3 ) time, but these algorithms need to be tailored
to every problem (so you need to know numerical linear algebra). There are multicore processors
and parallel computers that can perform multiple operations simultaneously, but you need to know
and understand the algorithms in order to properly divide the work among processors. This alone
is highly nontrivial task.
Efficiency is about performing the minimum number of operations to perform a given computational task. For example, x8 can be computed with 7 multiplications in the obvious way x x x,
or with 3 as ((x2 )2 )2 . Thus the obvious way to compute xn would take O(n) operations, but
the clever, repeated squaring method takes O(log n), which is a lot better. Ultimately one can
compute xn as en ln x , which takes a fixed time for any n and x, i.e., O(1) operations, which is best.
Accuracy For reasons that require its own course to fully explain and understand, computations
are performed in IEEE binary floating point arithmetic. This means most every number gets
rounded before its stored, and most every floating point computation results in a rounding error.
5
1. PRELIMINARIES
These
1 3 errors are tiny, but they can accumulate. A lot. If one tries to compute the determinant
in MATLAB, one deservedly gets 0.
3 9
>> det([1 3; 3 9])
ans = 0
0.1 0.3
If, however, one tries the determinant of one tenth of that, i.e., 0.3
0.9 , one gets
>> det([0.1 0.3; 0.3 0.9])
ans = 1.6653e17
The problem is that 0.1, etc., are not finite binary fractions, so the 0.1 is approximated by
whatever the closest binary fraction is that is representable in the computer. The determinant of
the stored matrix then may no longer be zero, a matter further complicated by the fact that the
already rounded input quantities are subject to further rounding errors as the determinant is being
computed.
So, instead of a zero we got a tiny quantity. So we need to recognize when tiny quantities are
really zeros.
Now that its clear that:
we cant even input most quantities exactly in the computer. Most will get rounded as they
are stored;
the rounded input quantities are subject to further rounding errors during arithmetic operations.
So, what results can we possibly trust?
Computers are widely used, so we can trust a lot, but one of the main goals of this course to
learn how to recognize the abilities as well as the limitations of numerical linear algebra algorithms.
A= .
..
.. ,
..
..
.
.
.
am1
also denoted as A =
am2
amn
[aij ]m,n
i,j=1 .
Identity matrix
I=
1
..
.
1
a11 a21
a12 a22
AT = .
..
..
.
a1n
T
a2n
..
.
am1
am2
..
.
amn
[aji ]n,m
i,j=1 .
or, equivalently, A =
The conjugate transpose is defined as A = [
aji ]n,m
i,j=1 .
A matrix A is upper triangular if it is zero below its main diagonal, i.e., if aij = 0 for all i > j:
a22 a2n
A=
.. .
..
.
.
ann
A is also called unit upper triangular matrix if, in addition, it has ones on the main diagonal, i.e.,
a11 = a22 = = ann = 1. (Unit) lower triangular matrices are defined analogously.
A tridiagonal matrix is a matrix that is nonzero only on its diagonal and the first super and
subdiagonal, i.e., aij = 0 unless i j 1:
a1 b1
..
c1 a2
.
.
A=
..
..
.
. bn1
cn1
an
A matrix A is bidiagonal if it is only nonzero
a1 b1
a2
A=
..
..
bn1
an
1. PRELIMINARIES
Matrixvector product
a11 a12
a21 a22
..
..
..
.
.
.
am1 am2
a1n
a2n
..
.
amn
x1
x2
..
.
xn
a1n
a12
a11
x1
a11 a12 a1n
a2n
a22
a21
a21 a22 a2n x2
+
x
+
x
=
x
..
.. .
..
..
..
..
..
n
2
1
..
.
.
.
.
.
.
.
am1
am2
amn
am2
am1
xn
amn
n,k
Matrixmatrix product. If A = [aij ]m,n
i,j=1 is m n and B = [bij ]i,j=1 is n k, then the product
C = [cij ]m,k
i,j=1 is an m k matrix, where
cij =
n
X
ait btj
t=1
and
(AB)T = B T AT .
Symmetric and Hermitian matrices. A real matrix such that AT = A is called symmetric. A
(complex) matrix such that A = A is called Hermitian.
Symmetric/Hermitian positive definite matrix is a symmetric/Hermitian matrix such that x Ax > 0
for all nonzero vectors x. This condition is equivalent to all eigenvalues being positive.
Orthogonal/unitary matrices. A real m n, m n, matrix Q is orthogonal if QT Q = I. A complex
m n, m n, matrix Q is unitary if Q Q = I. In either case Q1 = Q .
Note that a matrix need not be square in order to be orthogonal/unitary, but it does need to
have at least as many rows as it has columns.
A product of orthogonal matrices is also an orthogonal matrix, namely, if P and Q are orthogonal, then (P Q)T P Q = QT P T P Q = QT Q = I.
Eigenvalues and eigenvectors. We say that is an eigenvalue of an n n matrix A if Ax = x for
some nonzero vector x, called a (right) eigenvector. A left eigenvector is one such that y T A = y T .
An n n matrix has n eigenvalues. Those are the are the roots of the characteristic polynomial
det(A I) (each taken with its multiplicity).
When A is n n and symmetric, it has n real eigenvalues and n eigenvectors. The eigenvectors
can be chosen to form an orthonormal set. When all eigenvectors are placed in an (orthogonal)
matrix Q, we have A = QT Q, where = diag (1 , 2 , . . . , n ) and 1 , 2 , . . . , n are the (real)
eigenvalues. The left and right eigenvectors corresponding to the same eigenvalue are the same.
Problems:
(1) Prove that the product of two lower triangular matrices is lower triangular.
(2) Prove that the inverse of a lower triangular invertible matrix is lower triangular.
(3) Prove that if a triangular matrix is orthogonal, then it must be diagonal. What are the elements
on the diagonal?
(4) Prove that if the m n, m n matrix A is orthogonal and k < n, then the m k matrix B
obtained by taking the first k columns of A is also orthogonal.
(5) Give an example or explain why, if A is m n, m > n, orthogonal, then AT is not orthogonal.
(6) Prove that if A and B are orthogonal, then AB is also orthogonal.
10
1. PRELIMINARIES
kxk
= max xj 
j
p
= x1 2 + x2 2 + + xn 2
kxk2
(the absolute values in the two norm are needed for complex vectors).
For the 2norm, we have, very importantly
kxk2 = x x.
Also, if Q is orthogonal (or unitary), then
p
p
(submultiplicativity).
Equivalently (since
kAxk
kxk
kAxk
.
kxk
x
),
=
A kxk
kAk = max kAxk.
kxk=1
11
Theorem 3.1. The following are the induced 1, 2, and norms for matrices:
X
kAk1 = max
aij ;
j
kAk
= max
i
kAk2
aij ;
q
max (AT A).
For symmetric matrices kAk2 = max (A), the absolute value of the largest by magnitude eigenvalue of A.
Proof. We prove the third part only. Let AT A = QT Q be the eigendecomposition of AT A, where
= diag (1 , . . . , n ), 1 n 0. Then, since kQxk2 = kxk2 ,
2
2
kAk2 = max kAxk2
kxk2 =1
= max xT AT Ax
kxk2 =1
= max xT QT Qx
kxk2 =1
kQxk2 =1
= max y T y
kyk2 =1
= y12 1 + + yn2 n
1 (y12 + + yn2 )
= 1 ,
with equality for y = e1 = (1, 0, 0, . . . , 0)T .
The matrix 2norm is unaffected by orthogonal unitary transformations for the same reason the vector
2norm is not: kQAk2 = kAk2 .
The Frobenius norm is a matrix norm, which is not an operator norm:
v
uX
p
u n
aij 2 = trace (A A).
kAkF t
i,j=1
xT W x is a vector norm.
(2) Find the 1,2, and norms of the vectors
1
0
x = 1 , y = 2 .
3
4
12
1. PRELIMINARIES
1 1 1
A = 1 0 0 ,
3 1 7
0
B= 1
3
1 1
1 2 .
2 4
1
2
1
2
,
B=
0
2
2
3
.
(5) Prove that kIk 1 for any operator norm and that kIk = 1 for any induced norm.
(6) Prove that kQk2 = 1 for any orthogonal matrix.
13
100, 000
1, 000 .
x=
10
Suppose we have a computed solution x
with a relative norm error 103 . Namely,
kx x
k
= 103 .
kxk
Since kxk = 100, 000, this means kx x
k = 105 103 = 102 .
For the individual entries x
i , i = 1, 2, 3, we have
xi x
i  kx x
k = 102 .
This error bound means different things for x
1 , x
2 , and x
3 because of their magnitudes:
x1 x
1  102 means
x2 x
2  102 means
x3 x
3  102 means
x1
x1 
x1 
x2
x2 
x2 
x3
x3 
x3 
102
x1
102
x2
102
x3
= 104 , i.e., x
1 is accurate to at least 4 decimal digits;
= 102 , i.e., x
2 is accurate to at least 2 decimal digits;
= 1, i.e., x
3 may have no accurate digits at all.
14
1. PRELIMINARIES
Understanding the correct implications about the number of correct digits in each computed quantity
based on absolute and relative error bounds is critical in numerical linear algebra.
Problems.
(1) If x
= 1.1234567890 1040 is computed with relative error 106 , write down all digits of x that
you can infer to be correct.
(2) If the vector
1, 000.123456
10.123456
x
=
0.01234
is known to be computed with a relative error satisfying
kx x
k
106 ,
kxk
write down the vector x and in every entry only specify the digits which you know to be correct.
5. COMPUTER ARITHMETIC
15
5. Computer arithmetic
Pretty much all computers nowadays conform to the IEEE standard for binary floating point arithmetic [1].
In the most popular double precision, the numbers take 64 bits to represent: 1 for sign, 11 for exponent,
and 52 for the fraction. Any nonzero number can be written as
(1)s 2e1023 (1.f1 f2 . . . f52 ),
where 1 e 2046 is the exponent (e = 0 and e = 2047 are specialwe discuss below) and f = f1 f2 . . . f52
are the fraction bits. Since in binary the first significant digit is 1, we get an extra, 53rd significant digit for
freeit does not need to be stored (the leading significant digit of a binary number is always 1!).
The binary floating point representation of the number 0 is e = 0, f = 0. It can have either sign. By
convention +0 = 0.
What about the remaining floating point numbers with e = 0, i.e., the ones with e = 0, f 6= 0? These
are called subnormal numbers, because the assumption of a leading significant digit 1 is dropped. In every
other aspect, these are perfectly functional floating point numbers of the form
(1)s 21022 (0.f1 f2 . . . f52 )
and serve an extremely important purpose to ensure that the smallest distance between floating point
numbers is also a floating point number. This way a 6= b means a b 6= 0. Note that even though the
exponent e = 0 it is treated as if it were 1 (thus the power of 2 is 1022, not 1023).
The exponent e = 2047 (all 1s in binary) is also special.
If e = 2047, f = 0, this is infinity or INF. It behaves like infinity in the sense that 1/INF = 0
and 1/0 =INF. Why infinity? In a well designed calculation it should never appear, right? If it
appears, something went wrong and the programmer should be notified. But what happens if this
is performed on an unmanned craft in outer space? One may as well let the calculations continue.
For example, say we have a parallel two resistor circuit where the overall resistance is
1
R= 1
1 .
+
R1
R2
Shorts shouldnt happen, but what if one did? If R1 = 0, floating point arithmetic will still correctly
return the correct overall resistance R = 0.
If e = 2047 and f 6= 0, this is NaN or NotaNumber. It is an indication that a 0/0 or a similar
calculation occurred.
The range of floating point numbers is 21074 (the smallest subnormal number, e = 0, f1 = = f51 = 0,
f52 = 1) to 21023 (2 252 ) 21024 these are also the underflow and overflow thresholds, respectively.
When arithmetic operations are performed then the result is always the correctly rounded true answer.
In other words, one can think that each of the operations +, , , / is performed exactly, then its rounded
to the nearest floating point number.
Without belaboring the point further, we will use a simplified model in which the result of any floating
point computation will be
(5.1)
fl(a b) = (a b)(1 + ),
16
1. PRELIMINARIES
Catastrophic cancellation in floating point. How is accuracy lost in floating point? It turns out
addition (of same sign quantities), multiplication, and division preserve the relative accuracy of intermediate
results. In other words, if we have two intermediate computed quantities a
> 0 and b > 0 accurate to (say)
10 digits, then a
+ b, a
b and a
/b will also be accurate to about 10 digits.
In contrast the difference a
b may be inaccurate, depending on how close a is to b.
This is known as subtractive cancellation and is the way accuracy is lost in numerical linear algebra.
Understanding this helps in the design of accurate algorithms.
First, we explain why addition, multiplication, and division preserve the relative accuracy. Let a
=
a(1 + 1 ) and b = b(1 + 2 ), with 1  k and 2  l for some modest k and l. Assume additionally, that
k l.
In the following calculations we will ignore terms which are products of 2 or more s, e.g., 1 2 , which are
O(2 ), so too tiny to matter. This is done for convenience here and does not compromise the analysisthere
is a method, which does account for all s and yields the same conclusionssee e.g., Higham [5].
In particular, we say that
1
= 1 2 + 22 = 1 1 .
1 + 1
Then the relative errors for +, , / are:
Addition: We have fl(
a + b) = (
a + b)(1 + 3 ), where 3  . Therefore
fl(
a + b)(1 + ) (a + b)
a + b) (a + b) (
3
=
a+b
a+b
(a(1 + 1 ) + b(1 + 2 ))(1 + 3 ) (a + b)
=
a+b
a1 + b2 + (a + b)3
(weve dropped 1 3 and 2 3 terms)
=
a+b
a1  + b2  + (a + b)3 
a+b
= (k + 1).
5. COMPUTER ARITHMETIC
17
1
1+2
= 1 2
= 1 2 + 3 
(k + l + 1).
We see that in all three cases (+, , /) the relative errors accumulate modestly with the resulting quantity
still having about as many correct digits as the two arguments.
One also shouldnt worry about the k and l coefficients in the relative errors. In practical computations
with n n matrices these typically grow as O(n) leaving plenty of correct digits even for n in the millions.
Subtractions are tricky: following the same calculations as with addition, the relative error in subtraction
is easily seen to be
fl(
a+b
a + b) (a + b) a1 b2 + (a b)3 a1  + b2  + a b3 
k + .
=
a+b
ab
a b
a b
Unlike +, , /, this relative error now depends on a and b! In particular, if a b is much smaller than a + b,
i.e., when a and b share a few significant digits, the relative error can be enormous!
For example, if a and b have 7 digits in common, e.g., a = 1.234567444444444 and b = 1.234567333333333,
then
a+b
107
a b
and fl(
a b) will have 7 fewer correct significant digits than either a
and b!
Problems.
(1) In double precision IEEE 754 arithmetic (e.g., in MATLAB), what is the largest floating point
number a so that fl(100 + a) = 100? Specify the exact value. For example, a = 2100 is one such
number, but is not the largest.
(2) (Hard) Prove that if a and b are binary floating point numbers, then a/b cant fall exactly half way
between two floating point numbers. In other words, there will never be an ambiguity on how to
round a fraction.
Solution: Say, its kdigit floating point arithmetic and we have 2 numbers a and b such that
c = a/b is exactly halfway between 2 floating point numbers, i.e., c has exactly k + 1 significant
digits. The first digit of c must be 1 (it always is), and the (k + 1)st must also be one so that c falls
exactly halfway between two floating point numbers. Now bc, regardless of what b is, must have at
least (k + 1) significant digits, which is impossible since bc = a which only has k significant digits.
(3) (Hard) If 2b > a > b > 0 then fl(a b) = a b, i.e., a b is an exact binary floating point number.
Solution: If a = 1.a1 a2 . . . a52 2k and b = 1.b1 b2 . . . b52 2n , then 2b > a > b implies k = n
or k = n 1.
If k = n, since a b > 0, then a and b align in the subtraction a b and a b thus cant have
more than 52 significant digits (since the leading ones will cancel), i.e., it is an exactly representable
floating point number. There is no need to round.
18
1. PRELIMINARIES
If k = n 1, then a and b are misaligned by one place. The condition 2b > a means a b < b,
i.e., the leading 1 in a will get cancelled in the subtraction, leaving only an at most 53 digit number,
which again is exactly representable.
19
The same idea, but go through the entries in each column starting from the bottom:
for j=1:n1
for i=n:1:j+1
A(i,j)=0
pause
end
end
Next, we go through all subdiagonals of A starting at the very bottom lefthand corner (i.e., the entry
(n, 1)):
for i=n1:1:1
for j=1:ni
A(i+j,j)=0
pause
end
end
Problems.
(1) Write a program that sets to zero the entries below the diagonal in a matrix, one row at a time,
starting with the second row. Namely, the program will set a21 = 0, a31 = 0, a32 = 0, a41 = 0, a42 =
0, a43 = 0, etc.
(2) Write a program that computes the average value of each of the 2n 1 anti diagonals of an n n
matrix A. Apply this program to the Pascal matrices pascal(n) in MATLAB and find the size
of the smallest Pascal matrix for which the largest average value of any of the anti diagonals is at
least 1010 . Turn in your program and the size n of that matrix.
20
1. PRELIMINARIES
Each of the n entries cij will require n multiplications and n additions for a total of 2n3 operations.
Is this reasonable if A = Eij (x)? Forming AB is equivalent to changing row j of B only by adding a
multiple of row i to it. Namely, we are changing n entries only bj1 , . . . , bjn using two arithmetic operations
per entry:
bjk = bjk + xbik
for a total of 2n operations.
It therefore matters tremendously if we account for the structure of our problem (and dont just blindly
call matrixmatrix multiplication) and save major computational resources!
Such decisions have huge implications on algorithms (which are otherwise mathematically equivalent)
and there is systematic way of analyzing the cost of an algorithm.
7.2. Formal analysis of the complexity of an algorithm. The analysis of the complexity of an
algorithm, we assume that each arithmetic operation {+, , , /} takes one unit of time. This is not a bad
assumption and allows us to perform complexity analysis without always referring to the latest technology.
As of this writing, on modern computes, addition, subtraction, and multiplication, do take one unit of time,
whereas division takes about 8. We ignore such peculiarities and assume all arithmetic operations (including
square roots, sines, cosines, etc.) take only one unit of time. It makes for a systemic analysis of all linear
algebra algorithms.
The complexity f (n) of an algorithm is the number of arithmetic operations that it performs. It is a
function of the problem size n. The goal is expressing f (n) as a function of n, most often in the form
f (n) = ank + O(nk1 ).
Note that:
The problem size n is usually the size of a matrix, the number of discretization points, size of the
mesh or other appropriate quantity that measures the size of the problem.
The complexity behavior matters when n gets larger, which is exactly when the lower order terms
(O(nk1 )) no longer contribute anything significant compared with ank . This is why these lower
order terms are most often ignored in the complexity analysis and an algorithm is said to have
complexity ank or simply nk .
It is common practice in numerical analysis to count each arithmetic operation (+, , , /, square
root) as 1 and compare algorithms that way. Computer technology changes rapidly so who knows
how many clock cycles each operation takes on a computer this month. At the time of this writing,
multiplication, addition, or subtraction can be done in 1 clock cycle, with division taking 6 to
8. Such performance is only possible, however, through sophisticated cache optimization and
programming techniques, which we will not address.
The complexity of an algorithm can be determined by counting the number of arithmetic operations
which can be done either analytically or experimentally.
The goal is to figure out what f (n) is as a function of n, i.e., to find a and k.
21
i=
i=1
n(n + 1)
2
and
n
X
i2 =
i=1
n(n + 1)(2n + 1)
.
6
Experimentally, one adds a variable, e.g., flopcount, to the code and adds 1 to it every time an
arithmetic operation is performed. Then the code is run for different (reasonably large) values of n, say n1
and n2 .
The variable flopcount will return f (n1 ) and f (n2 ). Since
f (n1 ) = ank1 + O(nk1
)
1
Then
a
f (n1 )
.
nk1
n
n
X
X
n(n + 1)
n = n2
(2i 1) = 2
in=2
2
i=1
i=1
operations.
On the other side, experimentally, the code that solves Ax = b is
for i=1:n
x(i)=b(i);
for j=1:i1
x(i)=x(i)A(i,j)*x(j);
end
x(i)=x(i)/A(i,i);
end
22
1. PRELIMINARIES
CHAPTER 2
..
.
..
Eij (x) =
x
1
..
.
1
(x is in position (i, j)). This matrix differs from the identity matrix only in the (i, j)th entry x.
Scaling a row.
We scale row i of A by d by multiplying A on the left by
..
.
Si (d) =
.
..
1
Note that both matrices Eij (x) and Si (d) invert very easily:
(Eij (x))1 = Eij (x)
and
23
24
For example,
1
0 1
0 0
0 3
1
1
0
1
0
=
0
0
0 1
3 0 1
and
1
2
1
1
1
12
1
1
25
(2.1)
1
1
A=
1
1
1
2
3
4
1
4
9
16
1
8
.
27
64
1 1
1 1 1 1
1
0 1
1 2 4 8 1 2
E42 (3) V =
0 0 1
1 3 9 27 = 1 3
4 10
1 4 16 64
0 3 0 1
(2) In order to subtract 3 times row 2 from row
1
1
1
0
1
E42 (3) V =
1
0
0 1
1
0 3 0 1
1
4
9
28
1
8
.
27
96
1 1
4 8
.
9 27
4 40
(3) If we wanted to create a zero in position (4, 1) using row 1, we would subtract 1 times row 1 from
row 4 of A by forming the product:
1 1 1 1
1 1 1 1
1
1 2 4 8 1 2 4 8
0 1
E41 (1) V =
1 3 9 27 = 1 3 9 27 .
0 0 1
0 1 12 56
1 4 16 64
1 0 0 1
(4) If we wanted to scale row 2 of A by (say)
1
1
1
S2 (2) V =
1
1
1
1
1
1
1
1
1 1 1
2 4 8
= 2 4 8 16 .
1
3
9
27
3 9 27
1
4 16
64
4 16 64
(5) Column operations work analogously by applying the appropriate matrix Eij (x) or Si (d) on the
right. For example, if we wanted to subtract 2 times column 2 from column 3, wed form the
product
1 1
1 1
1
1 1 1 1
1 2
1 2
4 8
1 2
0 8
=
.
V E23 (2) =
1 3
9 27
1
1 3
3 27
1 4 16 64
1
1 4
8 64
Problems. For the Vandermonde matrix (2.1):
(1) Find the matrix Eij that uses row 2 to create a zero in position (4, 4).
(2) Find the matrix Eij that uses column 4 to create a zero in position (2, 1).
(3) Find the matrix that scales column 4 by 1.
CHAPTER 3
a22 a2n
x2 b2
= .
.
.
.
..
.. .. ..
ann
xn
bn
means
a11 x1 + a12 x2 + + a1n xn = b1
a22 x2 + + a2n xn = b2
..
.
an1,n1 xn1 + an1,n xn = bn1
ann xn = bn .
The solution to the above system is easily obtained by substitution, starting with xn and going
back
x1 = (b1 a11 x1 a1,n1 xn1 )/a11
..
.
xn1 = (bn1 an1,n xn )/an1,n1
xn = bn /ann .
Therefore, in order to solve a linear system, our first goal is to reduce it to a triangular form which is
then easy to solve.
We can do this using either the elimination matrices Eij (x) (obtaining the Gaussian elimination algorithm) or orthogonal matrices (obtaining the QR algorithm, which is also useful for solving least squares
problems).
27
28
2. Gaussian Elimination
Gaussian elimination is the process of reducing a matrix to upper triangular form using only subtraction
of a multiple of one row from another, i.e., the matrices Eij (x). The result is a decomposition
A = LU,
where L is unit lower triangular and U is upper triangular.
We start with the simplest 2 2 case. Let
a11 a12
A=
.
a21 a22
Then
E21 (l21 )A =
1
l21
1
a11
a21
a12
a22
=
a11
0
a12
a022
,
21
is the factor that we need to multiply the first row by so that when we subtract it from the
where l21 = aa11
second we get a zero in position (2, 1) and a022 = a22 a12 l21 . Therefore
1
1
a11 a12
1
a11 a12
A=
=
.
l21 1
0 a022
l21 1
0 a022
We have factored A into a the product of a lower triangular matrix (well call it L) and an upper triangular
matrix U .
In the general n n case, the process is analogous we proceed one column at a time with the diagonal
entry being used as pivot to eliminate all entries in the corresponding column.
Let lij be the multiplier needed to multiply the pivot in position (j, j) so that when the jth row is
subtracted from the ith, we get a zero in position (i, j).
In particular, in order to eliminate the entries in the first column a21 through an1 we would need
multipliers li1 = ai1 /a11 , i = 2, 3, . . . , n. Forming the product
A1 = En1 (ln1 )En1,1 (ln1,1 ) E21 (l21 )A
will result in a matrix A1 whose entries in the first column below the main diagonal are zero. Moving all the
Es on the other side (and recalling that Eij (x)1 = Eij (x)) we get
A = E21 (l21 )E31 (l31 ) En1 (ln1 )A1 .
The matrix A1 is called the first Schur complement of A.
We repeat the same process for the remaining columns and get
n1
n
Y Y
(2.1)
A=
Eij (lij ) U.
j=1 i=j+1
n1
n
n1
n
Y Y
Y Y
Ux =
Eij (lij ) b =
Eij (lij ) b,
j=1 i=j+1
j=1 i=j+1
2. GAUSSIAN ELIMINATION
1
l21
n1
n
Y Y
L=
Eij (lij ) = l31
..
j=1 i=j+1
.
ln1
29
1
l32
..
.
1
..
.
ln2
...
..
.
ln,n1
This fact requires no proof as it is a direct consequence of how the product of matrices is formed, but it is
highly recommended that the reader perform several LU decompositions in order to make friends with
the process.
One thus obtains the LU decomposition of A
1
u11 u12 u13 . . . u1n
l21 1
u22 u23 . . . u2n
l31 l32 1
u33 . . . u3n
A = LU =
..
..
..
..
..
..
.
.
.
.
.
.
ln1
ln2
...
ln,n1
unn
1
5 6 7
2 1
0 8 9
3 4 1
0 0 10
will be encoded in the matrix
5 6
2 8
3 4
Here is the MATLAB code that does just that:
7
9 .
10
function A=lu143Mexpert(A);
n=size(A,1);
for j=1:n1
for i=j+1:n
A(i,j) =A(i,j)/A(j,j);
A(i,j+1:n) = A(i,j+1:n)  A(i,j)*A(j,j+1:n);
end
end
30
This code returns the encoded A. If L and U are needed, they can be obtained as L=eye(n)+tril(L,1)
and U=triu(A).
2.2. Operation count for computing the LU decomposition. Referring back to the algorithm,
its cost is
n1
n
X X
2
(2(n j + 1) + 1) = n3 + O(n2 ).
3
j=1 i=j+1
2.3. Solving Ax = b using the LU decomposition. Once we have the decomposition A = LU
solving Ax = b is very easy. It requires two triangular solves: first we solve Ly = b for y and then U x = y
for x. We thus have x = U 1 y = U 1 L1 b = (LU )1 b = A1 b.
2.4. Uniqueness of the LU decomposition. If A is nonsingular, its LU decomposition is unique.
1
To see that, write A = LU = L1 U1 . Then L1
. On the left we have a unit lower triangular
1 L = U1 U
matrix (see the problems after this section) and on the right we have an upper triangular matrix. This is
only possible if both equal the identity matrix, i.e., L = L1 and U = U1 .
2.5. LDU decomposition. Sometimes its preferable to work with the so called LDU decomposition
of a matrix, which is readily obtained from the LU decomposition by factoring the diagonal out of U , leaving
U unit upper triangular:
A = LDU.
Problems.
(1) Compute the LU decomposition of the following matrices:
1 1 1
1 1 1
1 2
3
1 2 4 , 1 2 3 , 1 0 4 .
1 3 9
1 3 6
2 10
2
(2) Compute the LU decomposition of
the 4 4 Pascal matrix (pascal(4) in MATLAB)
the Vandermonde matrix
1 1 1 1
1 2 4 8
1 3 9 27 .
1 4 16 64
(3) Prove that the product of two unit lower triangular matrices is also unit lower triangular.
(4) Prove that the inverse of a unit lower triangular matrix is also unit lower triangular.
(5) Compute the LDU decomposition of the matrix
2 2 4
2
5 5 .
4
5 39
3. PIVOTING
31
3. Pivoting
3.1. The need for pivoting. One faces the obvious problem of not being able to start Gaussian
elimination if the (1, 1) entry is 0, e.g., the matrix
0
1
1
1
.
Serious problems also occur if the 0 is replaced by a very small number, e.g., 254 :
A=
With a right hand side b =
are
1
0
254
1
1
1
, the solution to Ax = b is x
=
L
1
254
0
1
,
=
U
254
0
1
1
. However, the computed L and U
1
254
(note that the (2, 2) entry of U is 1 254 which is rounded to 254 ). Then
U
=
L
254
1
1
0
,
3.2. Partial and complete pivoting. Partial pivoting occurs when during Gaussian elimination one
chooses the largest element in the current column at or below the diagonal as a pivot element. When that
element is chosen among the entire Schur complement, this is complete pivoting.
In practice, partial pivoting almost always suffices to achieve stability (although there are extremely rare
counterexamples), so this is the default setting in MATLAB.
The process of pivoting produces the decomposition
A = P LU,
where P is a permutation matrix and L and U are as usual.
32
u11 . . . u1k
u1,k+1
...
u1n
1
..
..
..
..
..
..
..
.
.
.
.
.
.
.
lk1
u
u
.
.
.
u
.
.
.
1
kk
k,k+1
kn
lk+1,1 . . . lk+1,k 1
u
.
.
.
u
k+1,k+1
k+1,n
A=P
.
.
.
..
..
..
.
..
..
..
...
.
.
.
li1
ui,k+1
...
uin
...
lik
1
..
..
..
..
..
..
.
.
...
.
.
.
.
un,k+1
...
unn
ln1
...
lnk
1
Say ui,k+1 is the largest (in absolute value) among uk+1,k+1 through un,k+1 , i.e., ui,k+1 is our pivot that
needs to be brought to position (k + 1, k + 1). Let S be the permutation matrix that swaps rows k + 1 and
i. So we have
1
u11 . . . u1k
u1,k+1
...
u1n
..
..
..
..
..
..
..
.
.
.
.
.
.
.
lk1
.
.
.
1
u
u
.
.
.
u
kk
k,k+1
kn
lk+1,1 . . . lk+1,k 1
u
.
.
.
u
i,k+1
in
A=P
S
.
.
..
..
..
.
.
.
.
.
.
...
.
.
.
.
li1
uk+1,k+1 . . . uk+1,n
...
lik
1
..
..
..
..
..
..
.
.
...
.
.
.
.
un,k+1
...
unn
ln1
...
lnk
1
We factor S into L from the right (swapping columns k + 1 and i).
1
u11 . . .
..
.
..
..
.
.
lk1
.
.
.
1
lk+1,1 . . . lk+1,k 0
A=P
..
..
..
..
.
.
.
.
li1
...
lik
1
0
..
..
..
.
.
.
.
.
.
ln1
...
lnk
1
Then we factor S out of
..
lk1
li1
A=P S
..
lk+1,1
..
.
ln1
u11 . . .
..
..
.
.
...
1
...
lik
1
..
..
..
.
.
.
. . . lk+1,k
1
..
..
..
.
.
.
...
lnk
1
u1k
..
.
u1,k+1
..
.
...
..
.
u1n
..
.
ukk
uk,k+1
ui,k+1
..
.
...
...
ukn
uin
..
.
uk+1,k+1
..
.
un,k+1
...
. . . uk+1,n
..
...
.
...
u1,k+1
..
.
...
..
.
u1n
..
.
ukk
uk,k+1
ui,k+1
..
.
...
...
ukn
uin
..
.
un,k+1
unn
u1k
..
.
uk+1,k+1
..
.
...
. . . uk+1,n
..
...
.
...
unn
3. PIVOTING
33
Thus we update P by S and we have our new L and U that we can proceed with in the usual manner for
one more step since the pivot is now on the diagonal.
With complete pivoting, one also swaps columns by multiplying by permutation matrices on the right,
ultimately getting
A = P LU Q,
where P and Q are permutation matrices and L and U are as before.
Example. We will compute the LU decomposition with partial pivoting of the matrix
1
2
1
2 2 4 .
1
1 1
The largest element (in absolute value) in the first column is 2 so we need to swap rows
1
2
1
0 1 0
2 2 4
2 2 4 = 1 0 0 1
2
1
1
1 1
0 0 1
1
1 1
then we carry out the elimination of the first column
1
2 2 4
0 1 0
0
1 1
= 1 0 0 0.5 1
0.5 0 1
0
2
1
0 0 1
we now need to swap rows 2 and 3 in U
2 2
1
1 0 0
0 1 0
0 0 1 0
2
= 1 0 0 0.5 1
0
1
0.5 0 1
0 1 0
0 0 1
{z
}

0 1 0
1
2 2 4
2
1
= 1 0 0 0.5 0 1 0
0 0 1
0.5 1 0
0
1 1
we need to swap rows 2 and 3 in L to restore its shape
2 2
1 0 0
1
0 1 0
0
2
= 1 0 0 0 0 1 0.5 1
0 0 1
0 1 0
0.5 0 1
0
1

{z
}
0 0 1
1
2 2
4
0
1
2
1 .
= 1 0 0 0.5
0 1 0
0.5 0.5 1
0
0 1.5
1 and 2.
4
1
1
4
1
1
Condensed technique. One can compute an LU decomposition with pivoting by avoiding the need to
write the full permutation matrices and record the columns indices instead. Also, we record the multipliers
of L in the lower triangular portion of U . We start with the following
1
2
3
1
2
1
2
1
2 4
1 1
which represents the fact that the columns of P start off in their natural order 1, 2, 3. We need to pivot
rows 1 and 2. This is automatically registered in P as the 1 and 2 gets swapped as well. The entries of L
are entered in boldface in where the zeros in U would be.
34
1
1
2 2
3 1
2
1
2
2 4 1
1 1
3
2 2
0.5
1
0.5
2
4
2
1 3
1
1
2 2
0.5
2
0.5
1
4
2
1 3
1
1
2
0.5
0.5
2
4
2
1
0.5 1.5
The matrix P would thus contain the columns of the identity in order 2, 3, 1 and we have
0 0 1
1
2 2
4
U =
1
2
1 .
P = 1 0 0 L = 0.5
0 1 0
0.5 0.5 1
1.5
Problems:
(1) Compute the decomposition A = P LU from Gaussian elimination with partial pivoting of the
matrix
0 1 1
1 1 1 .
1 3 3
4. CHOLESKY DECOMPOSITION
35
4. Cholesky decomposition
If a matrix is symmetric and positive definite, it turns out, it can be decomposed into triangular factors
in half the time that LU takes and there is no need to pivot. This is the socalled Cholesky decomposition.
Recall that a matrix A is symmetric positive definite if A is symmetric and xT Ax > 0 for all nonzero
vectors x.
In particular, the positive definiteness of A implies that it is nonsingular and has an LDU decomposition
A = L1 DU , where L1 is unit lower triangular, U is unit upper triangular and D is diagonal (we call the
lower triangular factor L1 because we need L for later).
Since A = AT = (L1 DU )T = U T DLT1 (note that D = DT since D is diagonal), U T DLT1 is also an LDU
decomposition of A, thus we must have L1 = U T (because U T would be obtained by the same Gaussian
elimination process as L1 ). Now, D has positive entries on the diagonal (or otherwise we can find a vector
x such that xT Ax 0which vector is that?), thus D1/2 is real and A = (L1 D1/2 ) (L1 D1/2 )T . For
L L1 D1/2 ,
A = LLT .
This is the Cholesky decomposition of A. Here L is lower triangular with positive diagonal (but not necessarily
unit lower triangular).
There are two main properties that make the Cholesky decomposition deserve special attention: its
stable without the need to pivot and it can be computed in half the time that it takes to compute the LU
decomposition.
First on stability. If A is s.p.d., i.e., xT Ax > 0 for all x 6= 0, if we pick x = ei + zej (i.e., x is a vector of
zeros except for 1 and z in positions i and j, respectively), we get
0 < xT Ax = aii + 2zaij + z 2 ajj .
The above inequality holds true for all z, thus we must have
a2ij < aii ajj ,
implying that we can never have a zero on the main diagonal and also that the largest element of the matrix
(or any of the subsequent Schur complements which are also s.p.d.) is always on the diagonal. Pivoting is
not required for the Cholesky decomposition to be stable.
Now efficiency. Since the symmetry is preserved in Gaussian elimination, the Schur complement will be
symmetric at every step. Thus only the elements at and above the diagonal are worked with and updated,
1
thus halving the work. We need to symmetrize the LU decomposition at the end, so we scale by D 2 ,
where D = diag (u1 , . . . , unn ), to obtain the Cholesky decomposition.
Here is the code. Compare it with the LU decomposition code. It returns LT rather than L, which is
typical and is what MATLAB does.
function U=cholesky(A)
U=triu(A);
n=size(A,1);
for j=1:n
for i=j+1:n
t = U(j,i)/U(j,j);
U(i,i:n) = U(i,i:n)  t*U(j,i:n);
end
end
for j=1:n
U(j,j:n)=U(j,j:n)/sqrt(U(j,j));
end
36
1 3
3n
+ O(n2 ).
Problems.
(1) Prove that the cost of Cholesky is 31 n3 + O(n2 ). Analytically or empirically.
(2) Compute the Cholesky decomposition of the matrices
4 6 10
1 1 1
A = 6 25 39 B = 1 2 3 .
10 39 110
1 3 6
37
38
of the form
100g(n),
where g(n) is a polynomial in n.
Most significantly O() is a bound that does not depend on x only on its size and modestly so.
6. CONDITION NUMBERS
39
6. Condition numbers
So how sensitive is a problem to perturbations?
f (x + x) f (x) f 0 (x) x,
so we call f 0 (x) the absolute condition number and similarly since
f (x + x) f (x)
x xf 0 (x)
,
f (x)
x
f (x)
0
(x)
the quantity xf
f (x) is called relative condition number.
The condition number tell us how much the backward error gets magnified to get the forward condition
number. Since in all our algorithms the backward errors will be small, the forward error will be small
for problems with small condition numberwell conditioned problemsand large errors in ill conditioned
problems.
Condition number of solving Ax = b. In this section we address the question of the sensitivity of
the solution to a linear system to perturbations. Namely, if we perturb A to A = A + A, how much does x
change?
x = b. Then (A + A)(x + x) = b or
Consider the perturbed solution x
= x + x such that A
Ax + A
x = 0, i.e.,
x = A1 A
x,
which implies
kxk = kA1 A
xk kA1 kkAkk
xk
and in turn
kAk
kxk
kAkkA1 k
.
k
xk
kAk
This justifies calling the quantity
(A) = kAkkA1 k
condition number of A.
It tells us how relative changes in A are multiplied to result in relative perturbations in x.
The condition number is always at least 1:
(A) 1.
1
This is because 1 = kIk = kAA k kAkkA k = (A) (recall that kIk = 1 in any norm). In the twonorm
equality is attained for orthogonal matrices 2 (Q) = 1, since the two norm of any orthogonal matrix is 1.
6.1. Practical interpretation of (A) when solving Ax = b. We can expect at least relative
16
and thus
errors in A just from storing A in the computer. Therefore kAk
kAk 10
kxk
(A)1016 .
k
xk
Since a relative error of 10k means we will have about k correct decimal digits in the answer, the
condition number (A) can be interpreted as a measurement of how many digits will be lost in solving
Ax = b. If (A) 105 , then we can expect to lose 5 decimal digits and the answer to have about 11 correct
decimal digits.
In general, we would expect the solution to Ax = b to have about 16 log10 (A) correct digits.
This can all be verified empirically very easily in MATLAB.
40
n=50;
kappa=10^5;
% generate a random matrix with condition number kappa
[Q,R]=qr(rand(n));
A=Q*diag(linspace(1,kappa,n))*Q;
x=rand(n,1); % our random solution
b=A*x;
% right hand side
y=A\b;
% the computed solution
disp(The relative error is:);
norm(yx)/norm(y)
disp(Expected number of correct decimal digits is at least);
16log10(kappa)
disp(Actual number of correct decimal digits (in the norm sense):);
log10(norm(yx)/norm(y))
Problems:
(1) Find (by hand) the twonorm condition number of the matrix
0 2
A=
1 1
(2) Prove that if L is the Cholesky factor of A then 2 (A) = (2 (L))2 .
41
1
1
1
1
1
2
1
1
1
..
2
1
1
.
.
2
K = 1 1
for which U =
..
..
.. . .
.
.
.
.
. .
.
.
.
1
.
1 1 . . . 1 1
2n1
The growth factor is 2n1 , a worst case scenario. Matrices like this are never encountered in practice, so
Gaussian elimination with partial pivoting is stable in practice and this is the reason MATLAB defaults to
it when one calls A\b.
42
8. Givens rotations
A Givens rotation is the third tool well use. It is not really a third since its a product of the first
two, but well use it often enough to where it deserves separate attention.
The need for a Givens rotation arises because our main tool for creation of zeros, the matrix Eij , is not
orthogonal.
A Givens rotation is a matrix of the form
cos x sin x
G=
.
sin x cos x
One trivially verifies that GT G = I, so G is indeed orthogonal. The idea is to choose x so that G creates
zeros:
a
c
G
=
.
b
0
Since orthogonal matrices do not change the 2norm of a vector, we must have c = a2 + b2 . Also, from
p
a cos x + b sin x = a2 + b2
a sin x + b cos x = 0
we get
a
.
a2 + b2
3
For example, the Givens rotation that kills the (2, 1) entry in the vector
is
4
3 4
5
3
5
5
.
=
0
4
45 35
sin x =
b
a2 + b2
and
cos x =
Givens rotations in n n matrices. Since well use Givens rotations to create zeros in large matrices,
we also call Givens rotation the n n matrix
..
cos x
sin x
..
G=
.
.
sin x
cos x
.
..
1
Note that G differs from the identity matrix only in the 4 entries in positions (i, i), (i, j), (j, i), and (j, j).
One trivially verifies that G is still orthogonal; it rotates rows i and j and leaves the rest of the matrix
unchanged.
A Givens rotation is nothing new. A Givens rotation is simply a sequence of our previous two tools:
if we perform Gaussian elimination on G we get
cos x sin x
1
0
cos x
1 tan x
=
.
sin x cos x
tan x 1
sec x
0
1
In other words, using a Givens rotation is equivalent to the following sequence of operations:
(1) Adding a multiple of the second row to the first;
8. GIVENS ROTATIONS
43
c s
s c
where c is real and s is complex such that c2 + s2 = 1. One directly checks that G G = I, so G is unitary.
We need
p
x
x2 + y2
,
G
=
y
0
where can be any complex
p number on the unit circle,  = 1.
Writing cs + sy = x2 + y2 and sx + cy = 0 means c =
.
x2 +y2
choose = x/
x, so that
c= p
x
x2
y2
and s =
cy
.
x
2
17
65
5
0 0 12
1 1 1 1
2
2
2
2
2
0 1 0
0
2
4
8
1 2 4 8 = 1
.
0 0 1
0 1 3 9 27 1
3
9 27
15
63
1 4 16 64
12 0 0 12
0 32
2
2
44
1
16
1
64
"
=
2
2
5
2
3
2
17
2
15
65
2
63
these are rows 1 and 4 in the matrix GA. Rows 2 and 3 are unchanged.
Problems. For the 4 4 Vandermonde matrix (2.1):
(1) Find the Givens rotation that rotates rows 2 and 4 to create a zero in position (4, 4).
(2) Find the Givens rotation that rotates columns 1 and 2 to create a zero in position (1, 1).
9. THE QR ALGORITHM
45
9. The QR algorithm
The QR algorithm reduces a matrix to an upper triangular form by using Givens rotations to zero out
all the entries below the main diagonal.
We start with an m n matrix A then choose an appropriate Givens rotation G1 to zero out the (m, 1)
entry and get A1 = G1 A. Then we choose another Givens rotations G2 to zero out the (m 1, 1) entry,
obtaining A2 = G2 A1 . Note that A2 = G2 A1 = G2 G1 A. We go on until we get an upper triangular matrix
R. If the total number of entries to be zeroed out is k, then
R = Gk G2 G1 A.
Therefore A = GT1 GT2 GTk R, since the Gi s are orthogonal. We define Q = GT1 GT2 GTk , which is an
orthogonal matrix as a product of orthogonal ones. The decomposition
A = QR
is called the QR decomposition of A.
So here is the QR algorithm. We need two nested loops to kill all entries below the main diagonal. We
start with column 1, killing entries (2, 1) through (m, 1), then proceed with column 2 and so on.
function [Q,A]=qr143M(A)
[m,n]=size(A);
Q=eye(m);
for j=1:n
for i=j+1:m
G=givens(A(j,j),A(i,j));
A([j i],j:n)=G*A([j i],j:n);
Q(:,[j,i])=Q(:,[j,i])*G;
end
end
%
%
%
%
%
For square matrices, solving Ax = b then is easy. From QRx = b we get Rx = QT b, which is a triangular
linear system which we solve through substitution. It will turn out that solving the least squares problem
Ax = b for a rectangular A has the exact same solution.
One can compute the QR decomposition through other methods, e.g., Householder transformations,
which yield the same decomposition.
Uniqueness of the QR decomposition. The QR decomposition is unique only up to the signs of R
because A = (QJ)(JR) is also a QR decomposition of A for any choice of a diagonal matrix J with +1s and
1s on the diagonal. The matrix QJ would still be orthogonal and JR upper triangular. In the complex
case one can choose J to be any diagonal matrix of complex signs (i.e., complex numbers on the unit circle).
Reduced QR decomposition. The QR algorithm of an m n matrix yields an m m matrix Q and
an m n matrix R. If we were to take only the first n columns of Q and the first n rows of R, the resulting
m n matrix Q will still be orthogonal and the resulting n n matrix R will still be upper triangular and
the product will still equal A. This is called the reduced QR decomposition. The reason is that the last
m n rows of the original Q multiply zeros in the bottom m n rows of the original R and can thus be
ignored.
46
For example,
1
1
1
1
1
0.5000
0.5000
2
=
3 0.5000
4
0.5000
0.5000
0.5000
=
0.5000
0.5000
2.0000
0.0236
0.5472
0
0.4393 0.7120
0
0.8079 0.2176
0
0.3921
0.3824
0.6708
0.2236
2.0000 5.0000 .
0.2236
0 2.2361
0.6708
0.6708
0.2236
0.2236
0.6708
5.0000
2.2361
0
0
Problems.
(1) Compute (by hand) the QR decomposition of the matrix
1 2
.
1 3
(2) Write a program to compute a QL decomposition of a matrix as a product of an orthogonal matrix
and a lower triangular matrix. Turn in your code and the result of your code (the matrices Q and
L for the 5 5 Pascal matrix pascal(5).
(3) Compute the reduced QR decomposition of the 5 3 matrix obtained in MATLAB as
A=pascal(5); A=A(:,1:3); Turn in the matrix Q only.
47
48
kAk
kxk
kA1 k kAk
.
k
xk
kAk
In other words the relative change in A gets magnified by kA1 kkAk to get the relative change in x.
This justifies calling the quantity
(A) = kA1 k kAk
the condition number of A. When we refer to a particular norm, we write, e.g., 2 (A) = kAk2 kA1 k.
Properties of (A).
(A) 1 in the 1, 2, and infinity norms. Follows from kIk = kAA1 k kAk kA1 k.
2 (Q) = 1 for orthogonal matrices Q: 2 (Q) = kQk2 kQ1 k2 = kQIk2 kQT Ik2 = kIk2 kIk2 = 1,
since the 2norm is unaffected by multiplication by an orthogonal matrix.
11.2. Backward stability. The addition of three numbers is backward stable:
fl(a + b + c) = fl(fl(a + b) + c) = fl((a + b)(1 + ) + c)
= ((a + b)(1 + ) + c)(1 + 1 )
= a(1 + )(1 + 1 ) + b(1 + )(1 + 1 ) + c(1 + 1 )
= a1 + b1 + c1 ,
where a1 = a(1 + )(1 + 1 ), b1 = b(1 + )(1 + 1 ), c1 = c(1 + 1 ), and  , 1  .
In other words the floating point result of the sum of the numbers a, b, and c is the exact sum of the
numbers a1 , b1 , and c1 . The relative perturbation in the a1 is
a1 a
= (1 + )(1 + 1 ) 1 =  + 1  2,
a
where we ignored the 1 term which is O(2 ). It is common practice to ignore second and higher order
terms in backward error analysis.
We get similar inequalities for b1 and c1 and say that the backward error in the sum of three numbers
is 2. Any algorithm whose backward error is O() is called backward stable.
As an exercise (see the problems after this section), prove that multiplication by a Givens rotation G is
= G(A+E), where kEk2 = O()kAk2 and G
is the floating point representation
backward stable. I.e., fl(GA)
of G.
Theorem 11.1. Applying a series of Givens rotations is backward stable.
k G
1 A) = Gk G1 (A + E)
fl(G
i is the floating point representation of the Givens rotation Gi , i = 1, 2, . . . , k.
where kEk2 = O()kAk and G
49
k1
Proof. The proof goes by induction, with the k = 1 step being trivial. Then assuming that fl(G
1 A) = Gk1 G1 A + E1 , where kE1 k = O()kAk2 , then for B fl(G
k1 G
1 A) we have
G
k B) = Gk (B + E2 ) = Gk (Gk1 G1 A + E1 ) + E2 = Gk Gk1 G1 A + Gk E1 + E2 ,
fl(G
where kGk E1 + E2 k2 kGk E1 k2 + kE2 k2 = kE1 k2 + kE2 k2 O()kAk2 + O()kBk2 = O()kAk2 , since
kBk2 kGk1 G1 Ak2 + kE1 k2 = kAk2 + O()kAk2 .
and R
are the computed
Theorem 11.2. The QR decomposition of a matrix is backward stable, i.e., if Q
Q and R, then
R
= A + A, where kAk = O().
Q
kAk
Theorem 11.3. The Ax = b is solved by first computing the QR decomposition of A, then solving
Rx = QT b via backward substitution, then
kAk
= O()
(A + A)
x = b, where
kAk
and
k
x xk
= O((A))
kxk
Problems.
50
[u,
V
]
=
=
,
V
V Au V AV
0 V AV
where the 0 stands for a zero vector of length n 1. We can then proceed by induction with the trailing
(n 1) (n 1) matrix V AV .
So our goal is to compute the Schur decomposition of A using orthogonal similarity transformations.
Those, of course, will be comprised of Givens rotations.
Problems.
(1) Let x and y be vectors and the rank 1 matrix A be defined as A = xy T . In terms of x and y, find
the only nonzero eigenvalue of A and a left and right eigenvector for it. What is kAk2 ?
1We must allow for U to be unitary rather than orthogonal, because may be complex. One method to obtain U is to
take an arbitrary n n matrix, replace the first column with u, and compute its QR decomposition. The first column of Q will
be u.
14. QR ITERATION
51
52
we just created. We then use row 2 again to create a zero in position (4, 1) again completing the similarity on
the right, which again does not disturb the zero we just created in position (3, 1) (since this rotates columns
2 and 4 nor the one we already had in position (2, 1). We proceed like this zeroing out all elements in the
first column up to the (n, 1) entry.
It might be tempting to zero out the (2, 1) entry as well. We could do so by applying a Givens on the
left rotating rows 1 and 2. However, completing the similarity on the right by rotating columns 1 and 2 will
destroy all the zeros in the first column we just created. Thus we leave the (2, 1) entry intact.
We proceed with the second column and so on. One can easily see that neither the Givens rotations
needed to create the new zeros nor the completion of the similarity transformations on the right affect any
of the zeros we already have.
Here is the algorithm.
function A=hessenberg(A)
n=size(A,1);
for j=1:n2
for i=j+2:n
G=givens(A(j+1,j),A(i,j));
A([j+1 i],:)=G*A([j+1 i],:);
A(:,[j+1 i])=A(:,[j+1 i])*G;
end
end
14.2. Wilkinson shift. While the QR iteration as we have already described it works most of the
time, there are examples on which it fails to converge at all or the convergence
is very slow.
For example, QR iteration fails to converge at all on the matrix 01 10 : the Givens rotation for QR
iteration is 10 10 , so the QR iteration produces the sequence 01 10 10 10 01 10 even though
the eigenvalues +1 and 1 are nice.
Additionally, QR iteration can be slow to converge to clustered eigenvalues, i.e., ones that have small
relative gaps:
i j 
.
i 
For example eigenvalues 1.00001 and 1.00002 are clusteredthe relative gap between them is 105 , whereas
0.00001 and 0.00002 are notthe relative gaps between these two is 1.
Both of these problems can be remedied by introducing shifts, which is actually very simple change to
the algorithm. If the eigenvalues of A are i , i = 1, 2, . . . , n, those of A sI are i s, i = 1, 2, . . . , n. If i
and j have a small relative gap, then the relative gap between i s and j s is
i j 
.
i s
So long as we pick s near i this new relative gap can be made very large (i.e., be close to 1). Then i s
and j s will be very well separated. For example, if i = 1.00001 and j = 1.00002, choosing any shift
between 1 and 1.00003 makes the relative gap between i s and j s at least 0.5.
One shift which works particularly well is the Wilkinson shiftit is the eigenvalue of the trailing 2 2
submatrix
an1,n1 an1,n
an,n1
ann
that is closer to ann . With this shift the QR iteration converges in 23 steps per eigenvalue. It requires that
we change the inner loop of QR iteration to
14. QR ITERATION
53
[Q,R]=qr(As*I)
A=R*Q+s*I
Note that with the shift, we still have a similarity transformation: If A sI = QR, then
RQ + sI = QT QRQ + sI = QT (A sI)Q + sI = QT AQ sQT IQ + sI = QT AQ.
14.3. Convergence. When have we converged to an eigenvalue in position (n, n)? Theoretically, this
happens when an,n1 = 0. But in the presence of roundoff errors zeros are elusive, so we settle for machine
precision . Thus we say that weve converged when an,n1  < 1016 ann .
14.4. Our ultimate QR iteration algorithm. Our ultimate algorithm performs n 1 Givens rotations to compute the QR decomposition, records them, then applies them on the right.
function A=QRiteration(A)
A=hessenberg(A);
n=size(A,1);
StoredGivens=zeros(2,2,n);
i=n;
while i>1
L=eig(A(i1:i,i1:i));
if abs(L(1)A(i,i))<abs(L(2)A(i,i))
shift=L(1);
else
shift=L(2);
end
for k=1:i
A(k,k)=A(k,k)shift;
end
for k=2:i
G=givens(A(k1,k1),A(k,k1));
A(k1:k,:)=G*A(k1:k,:);
StoredGivens(:,:,k)=G;
end
for k=2:i
G=StoredGivens(:,:,k);
A(:,k1:k)=A(:,k1:k)*G;
end
for k=1:i
A(k,k)=A(k,k)+shift;
end
if abs(A(i,i1))<10^(16)*abs(A(i,i))
i=i1;
end
end
%
%
%
%
%
%
14.5. The symmetric eigenvalue problem. If A is symmetric, the exact same QR iteration yields
the eigenvalue of A. The difference is that the Hessenberg form of A is now tridiagonal (since if QAQT = H
and AT = A, we get H T = H, so H must be symmetric, thus tridiagonal).
One can account for that fact by only applying each Givens rotation to only those entries in the rows
and columns of A that are nonzero.
54
14.6. Operation count. We will only do the accounting in the case when only the eigenvalues (but
not the eigenvectors) are desired.
The reduction to Hessenberg form requires (n 2) + (n 1) + + 1 Givens rotations, each costing
O(n), for a total of O(n3 ).
The iterative part of the algorithm then requires 23 iterations for each of the n eigenvalues. Each
iteration consists of 2(n 1) Givens rotations (n 1 rotations applied to each side of the matrix).
In the nonsymmetric case each Givens rotation costs O(n). In the symmetric case each Givens rotation
only changes 4 entries on each of the 2 rows (or columns when applied on the right) that it changes. This is
4 2 4 = 32 arithmetic operations each. The overall cost of the iterative part is thus:
In the nonsymmetric case: O(n3 );
In the symmetric case: O(n2 ).
Note that in the symmetric case the finite part, i.e., the reduction to tridiagonal form costs O(n3 ), while the
potentially infinite iterative part takes O(n2 ) and is negligible compared to the finite one.
14.7. Stability. The following theorem states that the application of one or multiple Givens rotations
to a matrix A (possibly with a shift) is backward stable.
Theorem 14.1. If A = QBQT is computed using Givens rotations in floating point arithmetic with
B
we have
precision , then for the computed factors Q,
B
Q
T = A + A,
Q
where
kAk
= O().
kAk
This theorem indicates that the reduction to Hessenberg form as well as the subsequent eigenvalue
i will be those of a nearby
computation is backward stable. In other words, the computed eigenvalues
y Ax
y x
and
kAk
.
y x
Therefore eigenvalues where the angle between the left and right eigenvectors is small (i.e., 1/y x =
sec (x, y) is large) will be very ill conditioned and will likely be computed inaccurately in the presence of
roundoff.
Even when the angle between x and y is large, e.g., in the symmetric case when x and y are orthogonal,
there is no guarantee that the computed eigenvalue will be accurate if it is very small compared to kAk.
In the symmetric case the above implies

14. QR ITERATION
55
Problems:
(1) Modify the MATLAB code QRiteration to accumulate the eigenvector matrix and compute (and
turn in) the eigenvector matrix of pascal(5). You will need to accumulate all the Givens rotations
for the reduction to Hessenberg form, then the Givens rotations in the reduction to Schur form
(which for this symmetric matrix will be diagonal). Compare your output with that of eig.
(2) Modify the MATLAB code QRiteration so that when the matrix is tridiagonal, it only takes
O(n2 ) operations to compute all the eigenvalues. You will need to modify the application of the
Givens rotations so that they only affect the nonzero entries of A then empirically estimate the
total cost. Turn in your code and the number of flops for the 20 20 tridiagonal matrix with
aii = 2, i = 1, 2, . . . , n, and aij = 1, for i j = 1.
56
1 = kAk2 .
Pk
Ak = i=1 i ui viT is the best rankk approximation to A.
If A is symmetric i = i , i = 1, 2, . . . , n.
The eigenvalues of AT A are i2 and the eigenvalues of AAT are i2 and m n zeroes.
rank(A) is equal to the number of nonzero singular values.
If A is nonsingular, kAk1
2 = 1/n and (A) = 1 /n .
If A has full rank, then the solution to the least squares problem minx kAx bk2 if x = V 1 U T b.
Computing the SVD. The computation of the SVD consists of 2 parts. First, we use Givens rotations
to reduces the matrix to a bidiagonal form through a process called GolubKahan bidiagonalization.
We use n 1 Givens rotations on the left to zero out the entries below the diagonal in the first column
and then n 2 Givens rotations on the right to zero out the third to the last elements in the first row. This
way only 2 non zeros remain in the first row and creating the zeros in the first row does not affect the zeros
we already created in the first column. We then proceed by induction with the trailing (m 1) (n 1)
matrix:
0
a11 a12 a13 . . . a1n
a11 a012 a013 . . . a01n
a21 a22 a23 . . . a2n 0
a022 a023 . . . a02n
Gn1 Gn2 . . . G1 .
=
..
..
..
..
..
..
..
..
..
..
.
.
.
.
.
.
.
.
.
am1
am2
am3
...
amn
a0m2
a0m3
...
a0mn
and
a011
0
..
.
a012
a022
..
.
a013
a023
..
.
...
...
..
.
a01n
a02n
..
.
a0m2
a0m3
...
a0mn
0 0
G1 G2 G0n2 =
57
a011
0
..
.
a0012
a0022
..
.
0
a0023
..
.
...
...
..
.
0
a002n
..
.
a00m2
a00m3
...
a00mn
58
0
1
0
0
0
1
..
.
..
.
..
0
0
0
..
.
0
c1
c2
c3
..
.
..
.
cm
1
AKm = C means
is Hessenberg. Now if Qm Rm = Km is the reduced QR decomposition of Km , then Km
(Qm Rm )1 AQm Rm = C
or
1
QTm AQm = Rm CRm
= H,
where H is also Hessenberg (because one can easily verify that the product of two upper triangular matrices
R and R1 and the Hessenberg matrix C is also Hessenberg).
What are the entries of H?
We have
AQm = Qm H.
We compare column j on both sides. On the left its Aqi , where qi is the ith column of Qm , and on the right
its Qm hi , where hi is the ith column of H. Namely,
Aqj =
j+1
X
hij qi .
i=1
T
T
Multiplying both sides by qm
on the left gets us qm
Aqi =
Pj+1
i=1
j
X
T
hij qm
qi = hmj , which implies
hij qi .
i=1
Therefore we can get the vector qj+1 by forming Aqj and subtracting all its components in the direction of
q1 , q2 , . . . , qj . This is just the modified GramSchmidt algorithm.
59
xKm
The vector b is chosen as the initial guess, thus b Km Km+1 and also, AQm c Km+1 . Therefore
multiplying by QTm+1 does not change the 2 norm above, so
QT bk2 .
min kAQm c bk2 = min kQTm+1 AQm c QTm+1 bk2 = min kHc
m+1
c
QTm+1 b
solution to the least squares problem minc kHc kbk2 e1 k2 . Once we have c, we obtain x = Qm c.
60
61
2
where w = e N = cos 2
N i sin N is a principal N th root of unity.
Its utility comes from:
(1) 1 = N1 .
(2) Z is the imaginary part of 2N +2 (2 : N + 1, 2 : N + 1); also Z 1 = N1 Z since Z is symmetric.
(3) One can multiply by in O(N log N ) time (vs. O(N 2 ) for conventional matrixvector multiplication).
Proof of (1): Since w
= w1 and wN = 1,
lj =
()
N
1
X
lk kj =
k=0
N
1
X
k=0
wlk w
kj =
N
1
X
wk(lj) .
k=0
1wN (lj)
1wlj
= 0.
62
For the cost C(N ) of the algorithm, we assume the powers of w are precomputed and stored, so we have
N
3N
N
3N
3N
C(N ) = 2C N2 + 3N
2 = 4C 4 + 2 2 = 8C 8 + 3 2 = = log2 N 2 .
63
17.2. Multiplication by V . We now explain how to perform the multiplication V x in O(N log N )
time. Since V is N 2 N 2 and x is N 2 1, we partition x into sections of size N :
x=
x1
x2
..
.
xN
where each
xi =
xi1
x2
..
.
xiN
N
V x = [zjk Z]j,k=1 x =
z11 Z
z21 Z
..
.
z12 Z
z22 Z
..
.
..
.
z1N Z
z2N Z
..
.
zN 1 Z
zN 2 Z
zN N Z
x1
x2
..
.
xN
..
.
1
2
N
zN 1 y + zN 2 y + + zN N y
..
1
2
N
z11 yN
+ z12 yN
+ + z1N yN
2
N
1
1
2
N
..
.
.
1
2
N
z21 yN
+ z22 yN
+ + z2N yN
..
N
1
2
zN 1 y1 + zN 2 y1 + + zN N y1
zN 1 y21 + zN 2 y22 + + zN N y2N
..
.
1
2
N
zN 1 yN
+ zN 2 yN
+ + zN N yN
64
Z . = .
..
.. .. =
..
.. ..
.
.
.
.
N
zN 1 zN 2 zN N
y1N
y1
.
..
.
zN 1 y11 + zN 2 y12 + + zN N y1N
65
17.3. The right hand side of TN N x = b. We saw that solving the heat equation in 2D means
solving TN N x = b for a certain right hand side b. We will establish its exact form.
Say, were solving the heat equation on [0, 1] [0, 1], subdivided into N + 1 parts so that in both the
x and y direction we have N + 2 points2 for boundary conditions and N unknown. Say all data is in a
(N + 2) (N + 2) matrix u with the unknown values in u(2 : N + 1, 2 : N + 1). The boundary conditions
are u(:, 1), u(:, N + 2), u(1, :) and u(N + 2, :).
Here are all our equations for the u0ij s:
66
u12 + u21
u13
..
u1N
u1,N +1 + u2,N +2
u31
..
u3,N +2
..
TN N x =
.
u
N,1
..
uN,N +2
uN +2,2 + uN +1,1
uN +2,3
..
uN +2,N
uN +2,N +1 + uN +1,N +2
67
a+b
ak + bl + (a + b)
a+b
= (k + 1).
Multiplication: We have fl(
ab) = (
ab)(1 + 3 ), where 3  . Therefore
fl(
ab) ab a(1 + 1 )b(1 + 2 )(1 + 3 ) ab
=
ab
ab
= (1 + 1 )(1 + 2 )(1 + 3 ) 1
= 1 + 2 + 3 
(k + l + 1).
68
1
1+2
= 1 2
= 1 2 + 3 
(k + l + 1).
We see that in all three cases (+, , /) the relative errors accumulate modestly with the resulting quantity
still having about as many correct digits as the two arguments.
One also shouldnt worry about the k and l coefficients in the relative errors. In practical computations
with n n matrices these typically grow as O(n) leaving plenty of correct digits even for n in the millions.
Subtractions are tricky: following the same calculations as with addition, the relative error in subtraction
is easily seen to be
fl(
a+b
a + b) (a + b) a1 b2 + (a b)3 a1  + b2  + a b3 
k + .
=
a+b
ab
a b
a b
Unlike +, , /, this relative error now depends on a and b! In particular, if a b is much smaller than a + b,
i.e., when a and b share a few significant digits, the relative error can be enormous!
For example, if a and b have 7 digits in common, e.g., a = 1.234567444444444 and b = 1.234567333333333,
then
a+b
107
a b
and fl(
a b) will have 7 fewer correct significant digits than either a
and b!
This is how accuracy is lost in numerical computations: when subtracting quantities that are:
(1) results of previous floating point computations (and thus carry rounding errors)
(2) share significant digits.
Item 1. above has a (pleasant) caveat: If a and b are initial data (and thus assumed to be known exactly),
then a b is computed accurately as per the fact that fl(a b) = (a b)(1 + ) with relative error such
that  .
The moral of this story is that
The output will be accurate if we only multiply, divide, add, and only subtract
initial data!
If we perform other subtractions, it does not automatically mean the results will be inaccurate, just that
theres no guarantee of accuracy.
For example, the formula for the determinant of the Vandermonde matrix V = [xj1
]ni,j=1
i
det V =
(xi xj )
i>j
will always be accurate because it only involves subtractions of initial data and multiplications.
69
Only true subtractions should be avoided. Just because there is a minus in the formula does not
mean this is a potential problem: it must be a true subtraction (i.e., a positive quantity gets subtracted
from a positive quantity). Subtracting a positive from a negative is a net addition and cannot result in loss
of significant digits.
For example, we can be certain to get an accurate result if we multiply two matrices with positive entries
(since well only add and multiply), but it is also OK to multiply two checkerboard sign pattern matrices,
since then the multiplication will not involve any true subtractions regardless of all the minuses. For example,
the product of the matrices
2
0
2 3
1 1
1 1
5
1
1 1
3
2 3
4
and
2 1
5 2
3 2
4 1
6
1 5
1
0
2 2
3
will involve no (true) subtractions! E.g., the (1, 1) entry of the product is 1 2 + 1 5 + 1 3 + 1 6, the (2, 4)
entry is (1 3 + 2 3 + 3 1 + 4 1) and similarly for all the other entries despite all the minuses!
This is an important observation, since the inverses of the Hilbert and Pascal matrices have checkerboard sign pattern and in general all linear algebra with these matrices can be performed to high relative
accuracy [6].
How can we perform linear algebra without subtracting? This, obviously cannot always be
done, but there are many classes of structured matrices, where accurate computations are possible through
the avoidance of subtractive cancellation [2, 6]. The problems after this section provide an example how, by
avoiding subtractive cancellation one can compute accurate results in about the same amount of time that
MATLAB will take (and deliver the wrong answer!).
Problems.
(1) Compute the smallest eigenvalue of the 100 100 matrix H, where
hi,j =
1
.
i+j
Turn in your code and all 16 digits of the smallest eigenvalue. The first 4 digits are 1.001
10151 .
While there are many ways to solve this problem, here are some ideas.
This matrix is an example of a Cauchy matrix. A Cauchy matrix, C = C(x, y), is defined as
cij =
1
xi + yj
70
[(xj
i=1 j=i+1
n Y
n
Y
xi )(yj yi )]
,
(xi + yj )
i=1 j=1
which will not result in subtractive cancellation and will thus be accurate.
In the process of writing your code, you may want to write it as a function of n. This way, for
small n, e.g., n = 6, 7, your code should match the output of MATLABs eig. For these values of
n, the matrix H would still be well conditioned and MATLABs results will still be accurate.
You may also want to test your code against the output of invhilb, which is a builtin function
in MATLAB that accurately computes the inverse of the Hilbert matrix [1/(i + j 1)]ni,j=1 (note
the slight difference with the matrix H), so the minimum eigenvalue of the Hilbert matrix can be
computed accurately in MATLAB as 1/max(eig(invhilb(100))).
(2) Compute the smallest eigenvalue of the 200 200 Pascal matrix. Turn in your code and answer
(all 16 digits of it). The first few digits of the answer are: 2.9 10119 .
Here are some suggestions. Details on the Pascal matrix can be obtained on Wikipedia. While
there are many approaches to solve this problem, one way would be to compute the largest eigenvalue of the inverse. The inverse (which has checkerboard sign pattern) can (once again) be computed without performing any subtractions if one takes the correct approach. Instead of eliminating
the matrix in the typical Gaussian elimination fashion, try to eliminate it by using only ADJACENT rows and columns. This process is called Neville elimination. Once you eliminate the first
row and first column, you will see that the Schur complement is also a Pascal matrix of one size
less. In matrix form this elimination can be written as
0
LPn LT = Pn1
where L is lower bidiagonal matrix with ones on the main diagonal and 1 on the first subdiagonal
0
and Pn1
is an n n matrix with zeros in the first row and column, except for the (1, 1) entry
(which equals one), and the matrix Pn1 in the lower right hand corner. You can now observe
1
(no need to prove) that if you have Pn1
you can compute Pn1 using the above equality without
performing any subtractions.
(3) Let A = LLT , where
a d
b e .
L=
c
Design a subtractionfree algorithm that computes the Cholesky factor of A given a, b, c, d, and e
that performs not more than 19 arithmetic operations (it can even be done with 18). Turn in your
code and your output in format long e format for a = 1020 , b = 1, c = 2, d = 1, e = 1.
The obvious call L=diag([1e20 1 2])+diag([1 1],1); chol(L*L) will return an error saying that the computed A is not positive definite. Yet the true matrix A is positive definite.
Bibliography
1. ANSI/IEEE, New York, IEEE Standard for Binary Floating Point Arithmetic, Std 7541985 ed., 1985.
2. J. Demmel, M. Gu, S. Eisenstat, I. Slapni
car, K. Veseli
c, and Z. Drma
c, Computing the singular value decomposition with
high relative accuracy, Linear Algebra Appl. 299 (1999), no. 13, 2180.
3. J. W. Demmel, Applied Numerical Linear Algebra, SIAM, Philadelphia, 1997. MR MR1463942 (98m:65001)
4. G. Golub and C. Van Loan, Matrix computations, 3rd ed., Johns Hopkins University Press, Baltimore, MD, 1996.
5. N. J. Higham, Accuracy and stability of numerical algorithms, Second ed., SIAM, Philadelphia, 2002. MR 2003g:65064
6. Plamen Koev, Accurate computations with totally nonnegative matrices, SIAM J. Matrix Anal. Appl. 29 (2007), 731751.
71