# Massively Multivariable

Open Online Calculus
Jim Fowler and Steve Gubkin
EXPERIMENTAL DRAFT
This document was typeset on April 27, 2014.
Contents
Contents
2
1 An n-dimensional space
We package lists of numbers as “vectors.”
In this course we will be studying calculus of many variables. That means that
instead of just seeing how one quantity depends on another, we will see how two
quantities could aﬀect a third, or how ﬁve inputs might cause changes in three
outputs. The very ﬁrst step of this journey is to give a convenient mathematical
framework for talking about lists of numbers. To that end we deﬁne:
Deﬁnition 1 R
n
is the set of all ordered lists containing n real numbers. That
is,
R
n
= ¦(x
1
, x
2
, . . . , x
n
) : x
1
, x
2
, . . . , x
n
∈ R¦.
The number n is called the dimension of R
n
, and R
n
is called n-dimensional
space. When speaking aloud, it is also acceptable to say “are en.” We call the
elements of R
n
points or n-tuples.
Example 2 R
1
is just the set of all real numbers, which is often visualized by
the number line, which is 1-dimensional.
−1 0 1 2 3
Example 3 R
2
is the set of all pairs of real numbers, like (2, 5) or (1.54, π). This
can be visualized by the coordinate plane, which is 2-dimensional.
(1, 2)
Example 4 R
3
is the set of all triples of real numbers. It can be visualized as
3-dimensional space, with three coordinate axes.
Question 5 If (3, 2, e, 1.4) ∈ R
n
, what is n?
Solution
Hint: n-dimensional space consists of ordered n-tuples of numbers. How many coor-
dinates does (3, 2, e, 1.4) have?
Hint:
Warning 6 Be careful to distinguish between commas and periods.
3
1 An n-dimensional space
Hint: n = 4
n = 4
Question 7 Which point is farther away from the point (0, 0)?
Solution
Hint:
(0, 1)
(1, 1)
(−2, 3)
(1, 4)
(a) (0, 1)
(b) (1, 1)
(c) (−2, 3)
(d) (1, 4)
It becomes quite diﬃcult to visualize high dimensional spaces. You can some-
times visualize a higher dimensional object by having 3 spatial dimensions and one
color dimension, or 3 spatial dimensions and one time dimension to get a movie.
Sometimes you can project a higher dimensional object into a lower dimensional
space. If you have the time, you should watch the excellent ﬁlm Dimensions
1
which
will get you to visualize some higher dimensional objects.
Although we may often be working with high-dimensional objects, we will gener-
ally not try to visualize objects in dimensions above 3 in this course. Nevertheless,
we hope the video is enlightening!
1
http://www.dimensions-math.org/
4
2 Vector spaces
Vector spaces are where vectors live.
It will be convenient for us to equip R
n
with two algebraic operations: “vector ad-
dition” and “scalar multiplication” (to be deﬁned soon). This additional structure
will transform R
n
from a mere set into a “vector space.” To distinguish between
R
n
as a set and R
n
as a vector space, we think of elements of R
n
as a set as being
ordered lists, such as
p = (x
1
, x
2
, x
3
, . . . , x
n
),
but elements of R
n
the vector space will be written typographically as vertically
oriented lists ﬂanked with square brackets, like this
v =
_
¸
¸
¸
¸
¸
_
x
1
x
2
x
3
.
.
.
x
n
_
¸
¸
¸
¸
¸
_
We will try to stick to the convention that bold letters like p represent points,
while letters with little arrows above them (like v) represent vectors.
Unfortunately (like practically everybody else in the world), we use the same
symbol R
n
to refer to both the vector space R
n
and the underlying set of points
R
n
.
Vector addition is deﬁned as follows:
_
¸
¸
¸
_
x
1
x
2
.
.
.
x
n
_
¸
¸
¸
_
+
_
¸
¸
¸
_
y
1
y
2
.
.
.
y
n
_
¸
¸
¸
_
=
_
¸
¸
¸
_
x
1
+ y
1
x
2
+ y
2
.
.
.
x
n
+ y
n
_
¸
¸
¸
_
Warning 1 You cannot add vectors in R
n
and R
m
unless n = m.
An element of R is a number, but it is also called a “scalar” in this context, and
vectors can be multiplied by scalars as follows:
c
_
¸
¸
¸
_
x
1
x
2
.
.
.
x
n
_
¸
¸
¸
_
=
_
¸
¸
¸
_
cx
1
cx
2
.
.
.
cx
n
_
¸
¸
¸
_
Warning 2 We have not yet deﬁned a notion of multiplication for vectors. You
might think it is reasonable to deﬁne
_
¸
¸
¸
_
x
1
x
2
.
.
.
x
n
_
¸
¸
¸
_
_
¸
¸
¸
_
y
1
y
2
.
.
.
y
n
_
¸
¸
¸
_
=
_
¸
¸
¸
_
x
1
y
1
x
2
y
2
.
.
.
x
n
y
n
_
¸
¸
¸
_
,
5
2 Vector spaces
but actually this operation is not especially useful, and will never be utilized in this
course. We will have a notion of “vector multiplication” called the dot product, but
that is not the (faulty) deﬁnition above.
Question 3 Solution
Hint:
_
_
1
2
3
_
_
+
_
_
3
−2
4
_
_
=
_
_
1 + 3
2 +−2
3 + 4
_
_
=
_
_
4
0
7
_
_
What is
_
_
1
2
3
_
_
+
_
_
3
−2
4
_
_
?
Question 4 Solution
Hint: 3
_
_
3
−2
4
_
_
=
_
_
3(3)
3(−2)
3(4)
_
_
=
_
_
9
6
12
_
_
What is 3
_
_
3
−2
4
_
_
?
Question 5 If v
1
=
_
3
−2
_
, v
2
=
_
1
5
_
, and v
3
=
_
1
1
_
can you ﬁnd a, b ∈ R so that
av
1
+ bv
2
= v
3
?
Solution
Hint:
av1 + bv2 = v3
a
_
3
−2
_
+ b
_
1
5
_
=
_
1
1
_
_
3a
−2a
_
+
_
b
5b
_
=
_
1
1
_
_
3a + b
−2a + 5b
_
=
_
1
1
_
Can you turn this into a system of two equations?
6
2 Vector spaces
Hint:
_
3a + b = 1
−2a + 5b = 1
_
15a + 5b = 5
−2a + 5b = 1
_
17a = 4
−2a + 5b = 1
_
_
_
a =
4
17
−2(
4
17
) + 5b = 1
_
_
_
a =
4
17
b =
5
17
a =4/17
Solution b = 5/17
7
3 Geometry
Vectors can be viewed geometrically.
Graphically, we depict a vector
_
¸
¸
¸
¸
_
x
1
x
2
.
.
x
n
_
¸
¸
¸
¸
_
in R
n
as an arrow whose base is at the origin
and whose head is at the point (x
1
, x
2
, ..., x
n
). For example, in R
2
we would depict
the vector v =
_
3
4
_
as follows
v
Question 1 What is the vector w pictured below?
8
3 Geometry
w
Solution
Hint: Consider whether the x and y coordinates are positive or negative.
(a)
_
−4
2
_

(b)
_
3
−3
_
(c)
_
−4
−2
_
(d)
_
4
2
_
9
3 Geometry
Question 2 Hint:
v
On a sheet of paper, draw the vector v =
_
3
1
_
. Click the hint to see if you got
it right.
10
3 Geometry
Question 3 Hint:
v1
v2
v1 +v2
v
1
and v
2
are drawn below. Redraw them on a sheet of paper, and also draw
their sum v
1
+v
2
. Click the hint to see if you got it right.
v
1
v
2
11
3 Geometry
Question 4 Hint:
v
3v
v is drawn below. Redraw it on a sheet of paper, and also draw 3v. Click the
hint to see if you got it right
v
12
3 Geometry
You may have noticed that you can sum vectors graphically by forming a par-
allelogram.
You also may have noticed that multiplying a vector by a scalar leaves the vector
pointing in the same direction but ”scales” its length. That is the reason we call
real numbers ”scalars” when they are coeﬃcients of vectors: it is to remind us that
they act geometrically by scaling the vector.
13
4 Span
Vectors can be combined; all those combinations form the “span.”
Deﬁnition 1 We say that a vector w is a linear combination of the vectors
v
1
, v
2
, v
3
, . . . , v
k
if there are scalars a
1
, a
2
, . . . , a
k
so that w = a
1
v
1
+a
2
v
2
+ +a
k
v
k
.
Deﬁnition 2 The span of a set of vectors v
1
, v
2
, . . . , v
k
∈ R
n
is the set of all
linear combinations of the vectors. Symbolically, span(v
1
, v
2
, . . . , v
k
) = ¦a
1
v
1
+
a
2
v
2
+ + a
k
v
k
: a
1
, a
2
, . . . , a
k
∈ R¦.
Example 3 The span of
_
_
1
0
0
_
_
,
_
_
0
1
0
_
_
is all vectors of the form
_
_
x
y
0
_
_
for some x, y ∈ R
Example 4
_
8
13
_
is in the span of
_
2
3
_
and
_
4
7
_
because 2
_
2
3
_
+
_
4
7
_
=
_
8
13
_
Question 5 Is
_
_
3
4
2
_
_
in the span of
_
_
1
2
0
_
_
and
_
_
3
−3
0
_
_
?
Solution
Hint: The linear combinations of
_
_
1
2
0
_
_
and
_
_
3
−3
0
_
_
are all the vectors of the form
a
_
_
1
2
0
_
_
+ b
_
_
3
−3
0
_
_
for scalars a, b ∈ R. Could
_
_
3
4
2
_
_
be written in such a form?
Hint: No, because the last coordinate of all of these vectors is 0. In fact, graphically,
the span of these two vectors is just the entire xy-plane, and
_
_
3
4
2
_
_
lives oﬀ of that plane.
(a) Yes, it is in the span of those two vectors.
(b) No, it is not in the span of those two vectors.
Graphically, we should think of the span of one vector as the line which contains
the vector (unless the vector is the zero vector, in which case its span is just the
zero vector).
The span of two vectors which are not in the same line is the plane containing
the two vectors.
The span of three vectors which are not in the same plane is the “3D-space”
which contains those 3 vectors.
14
5 Functions
A function relates inputs and outputs.
Deﬁnition 1 A function f from a set A to a set B is an assignment of exactly
one element of B to each element of A. If a is an element of A, we write f(a) for
the element of B which is assigned to a by f.
We call A the domain of f, and B the codomain of f. We will also commonly
write f : A →B which we read out loud as “f from A to B” or “f maps A to B.”
Example 2 Let W = ¦yes, no¦ and A = ¦Dog, Cat, Walrus¦. Let f : A →W be
the function which assigns to each animal in A the answer to the question “Is this
animal commonly a pet?” Then f(Dog) = yes, f(Cat) = yes, and f(Walrus) = no.
In this case, A is the domain, and W is the codomain.
In these activities, we mostly study functions from R
n
to R
m
.
Question 3 Let g : R
1
→R
2
be deﬁned by g(θ) = (cos(θ), sin(θ)).
Solution
Hint:
Warning 4 In everything that follows, cos and sin are in terms of radians.
Hint: g(
π
6
) = (cos(
π
6
), sin(
π
6
))
Hint: If you remember your trig facts, this is (

3
2
,
1
2
). Format this as
_
_
_

3
2
1
2
_
¸
_ for this
question.
What is g(
π
6
Can you imagine what would happen to the point g(θ) as θ moved from 0 to
2π?
Question 5 Let h : R
2
→R
2
be deﬁned by h(x, y) = (x, −y).
Solution
Hint: Consider h(2, 1) = (2, −1).
15
5 Functions
Hint:
(2, 1)
h(2, 1)
Hint: h takes any point (x, y) to its reﬂection in the x−axis.
_
2
−1
_
What is h(2, 1)? Format your answer as a vertical column of numbers.
Try to understand this function graphically. How does it transform the plane? The
hint reveals the answer to this question.
Question 6 Let f : R
4
→R
2
be deﬁned by f((x
1
, x
2
, x
3
, x
4
)) = (x
1
x
2
+x
3
, x
4
2
+
x
1
).
Solution
Hint: f(3, 4, 1, 9) = (3 4 + 1, 9
2
+ 3) = (13, 84).
Hint: Format this as
_
13
84
_
.
What is f(3, 4, 1, 9)? Format your answer as a vertical column of numbers.
Note that this function has too many inputs and outputs to visualize easily.
That certainly does not stop it from being a useful and meaningful function; this
is a “massively multivariable” course.
16
6 Composition
One way to build new functions is via “composition.”
Practically the most important thing you can do with functions is to compose them.
Deﬁnition 1 Let f : A → B and g : B → C. Then there is another function
(g ◦ f) : A →C deﬁned by (g ◦ f)(a) = g (f(a)) for each a ∈ A.
It is called the composition of g with f.
Warning 2 The composition is only deﬁned if the codomain of f is the domain
of g.
Question 3 Let A = ¦cat, dog¦, B = ¦(2, 3), (5, 6), (7, 8)¦, C = R. Let f be
deﬁned by f(cat) = (2, 3) and f(dog) = (7, 8). Let g be deﬁned by the rule
g((x, y)) = x + y.
Solution
Hint: First, (g ◦ f)(cat) = g (f(cat)).
Hint: Then note that f (cat) = (2, 3).
Hint: So this is g ((2, 3)) = 2 + 3 = 5.
(g ◦ f)(cat) = 5
Question 4 Let h : R
2
→R
3
be deﬁned by h(x, y) = (x
2
, xy, y), and let ω : R
3

R
2
be deﬁned by ω(x, y, z) = (xyz, z).
Solution
Hint:
(ω ◦ h)(x, y) = ω [h(x, y)]
= ω(x
2
, xy, y)
= ((x
2
)(xy)(y), y)
= (x
3
y
2
, y)
What is (ω ◦ h)(x, y)? Format your answer as a vertical column of formulas.
17
7 Higher-order functions
Sometimes functions act on functions.
Functions from R
n
→ R
m
are not the only useful kind of function. While such
functions are our primary object of study in this multivariable calculus class, it will
often be helpful to think about “functions of functions.” The next examples might
seem a bit peculiar, but later on in the course these kinds of mappings will become
very important.
Question 1 Let C
[0,1]
be the set of all continuous functions from [0, 1] to R. Deﬁne
I : C
[0,1]
→R by I(f) =
_
1
0
f(x)dx
Solution
Hint:
I(g) =
_
1
0
g(x)dx
=
_
1
0
x
2
dx
=
1
3
x
3
¸
¸
1
0
=
1
3
(1 −0)
=
1
3
.
If g(x) = x
2
, then I(g) = 1/3
Question 2 Let C

(R) be the set of all inﬁnitely diﬀerentiable (“smooth”) func-
tions on R. Deﬁne Q : C

(R) →C

(R) by Q(f)(x) = f(0) + f

(0)x +
f

(0)
2
x
2
.
Solution
Hint: Question 3 Solution
Hint: f(0) = cos(0) = 1
f(0) = 1
Question 4 Solution
Hint: f

(x) = −sin(x), so f

(0) = −sin(0) = 0
f

(0) = 0
Question 5 Solution
18
7 Higher-order functions
Hint: f

(x) = −cos(x), so f

(0) = −cos(0) = −1
f

(0) = -1
Hint: So Q(f)(x) = 1 −
x
2
2
If f(x) = cos(x), then Q(f)(x) = 1 −x
2
/2?
This is an example of a function which eats a function and spits out another func-
tion. In particular, this takes a function and returns the second order MacLaurin
polynomial of that function.
Question 6 Deﬁne dot
n
: R
n
R
n
→R by dot
n
((x
1
, x
2
, ..., x
n
), (y
1
, y
2
, ..., y
n
)) =
x
1
y
1
+ x
2
y
2
+ x
3
y
3
+ ... + x
n
y
n
.
Solution
Hint: dot3((2, 4, 5), (0, 1, 4)) = 2(0) + 4(1) + 5(4) = 24
dot3((2, 4, 5), (0, 1, 4)) = 24
19
8 Currying
Higher-order functions provide a diﬀerent perspective on functions that take many
inputs.
Deﬁnition 1 Let A and B be two sets. The product AB of the two sets is the
set of all ordered pairs AB = ¦(a, b) : a ∈ A and b ∈ B¦.
Example 2 If A = ¦1, 2, Wolf¦ and B = ¦4, 5¦, then AB = ¦(1, 4), (1, 5), (2, 4), (2, 5), (Wolf, 4), (Wolf, 5)¦
Example 3 We write R
2
for pairs of real numbers, but we could have written
Question 4 Let Func(R, R) be the set of all functions from R to R. Deﬁne Eval :
R Func(R, R) →R by Eval(x, f) = f(x).
Solution
Hint: Eval(−3, g) = g(−3) = [ −3[ = 3
If g(x) = [x[, then Eval(−3, g) = 3?
Question 5 Let Func(A, B) be the set of all functions from A to B for any two
sets A and B.
Let Curry : Func(R
2
, R) →Func(R, Func(R, R)) be deﬁned by Curry(f)(x)(y) =
f(x, y).
Let h : R
2
→R be deﬁned by h(x, y) = x
2
+ xy.
Solution
Hint:
G(3) = Curry(h)(2)(3)
= h(2, 3)
= 2
2
+ 2(3)
= 10
Let G = Curry(h)(2). Then G(3) = 10
This wacky way of thinking is helpful when thinking about the λ-calculus
1
. It
also helps a lot if you ever want to learn to program in Haskell—which is one of
the languages that Ximera was written in.
1
http://en.wikipedia.org/wiki/Lambda_calculus
20
9 Python
Python provides a playground for multivariable functions.
We can use Python to experiment a bit with multivariable functions.
Question 1 Solution Model the function f(x) = x
2
as a python function.
Hint:
Warning 2 Python does not use ^ for exponentiation; it denotes this by **
Hint: Try using return x**2
Python
1 def f(x):
3
4 def validator():
5 return (f(4) == 16) and (f(-5) == 25)
Solution Model the function g(x) =
_
−1 if x ≤ 0
1 if x > 0
as a Python function.
Hint: Try using an if
Python
1 def g(x):
3 return # the value of g(x)
4
5 def validator():
6 return (g(0) == -1) and (g(-17) == -1) and (g(25) == 1)
Solution Model the function h(x, y) =
_
x/(1 + y) if y ,= −1
0 if y = −1
as a Python function.
Python
1 def h(x,y):
3 return # the value of h(x,y)
4
5 def validator():
6 return (h(6,2) == 2) and (h(17,-1) == 0) and (h(-24,5) == -4)
21
10 Higher-order python
One nice feature of Python is that we can play with functions which act on functions.
Question 1 Here is an example of a higher order function horizontal_shift.
It takes a function f of one variable, and a horizontal shift H, and returns the
function whose graph is the same as f, only shifted horizontally by H units.
Solution Find a function f so that horizontal_shift(f,2) is the squaring function.
Python
1 def horizontal_shift(f,H):
2 # first we define a new function shifted_f which is the appropriate shift of f
3 def shifted_f(x):
4 return f(x-H)
5 # then we return that function
6 return shifted_f
7 def f(x):
8 return # a function so that horizontal_shift(f,2) is the squaring function
9
10 def validator():
11 return (f(1) == 9) and (f(0) == 4) and (f(-3) == 1)
Solution Write a function forward_difference which takes a function f : R → R
and returns another real-valued function deﬁned by forward difference(f)(x) = f(x +
1) −f(x).
Python
1 def forward_difference(f):
3
4 def validator():
5 def f(x):
6 return x**2
7 def g(x):
8 return x**3
9 return (forward_difference(f)(3) == 7) and (forward_difference(g)(4) == 61)
22
11 Calculus
We can do some calculus with Python, too.
Let’s try doing some single-variable calculus with a bit of Python.
Let epsilon be a small, but positive number. Suppose f : R → R has been
coded as a Python function f which takes a real number and returns a real number.
Seeing as
f

(x) = lim
h→0
f(x + h) −f(x)
h
,
can you ﬁnd a Python function which approximates f

(x)?
Given a Python function f which takes a real number and returns a real number,
we can approximate f

(x) by using epsilon. Write a Python function derivative
which takes a function f and returns an approximation to its derivative.
Solution
Hint: To approximate this, use (f(x+epsilon) - f(x))/epsilon.
Python
1 epsilon = 0.0001
2 def derivative(f):
3 def df(x): return (f(blah blah) - f(blah blah)) / blah blah
4 return df
5
6 def validator():
7 df = derivative(lambda x: 1+x**2+x**3)
8 if abs(df(2) - 16) > 0.01:
9 return False
10 df = derivative(lambda x: (1+x)**4)
11 if abs(df(-2.642) - -17.708405152) > 0.01:
12 return False
13 return True
This is great! In the future, we’ll review this activity, and then extend it to a multivariable
setting.
23
12 Linear maps
Linear maps respect addition and scalar multiplication.
We begin by deﬁning linear maps.
Deﬁnition 1 A function L : R
n
→ R
m
is called a linear map if it “respects
Symbolically, for a map to be linear, we must have that L(v + w) = L(v) +L( w)
for all v, w ∈ R
n
and also L(av) = aL(v) for all a ∈ R and v ∈ R
n
.
Deﬁnition 2 Linear Algebra is the branch of mathematics concerning vector
spaces and linear mappings between such spaces.
Question 3 Which of the following functions are linear?
Solution
Hint: For a function to be linear, it must respect scalar multiplication. Let’s see how
f
_
5
_
1
1
__
compares to 5f
__
1
1
__
, and also how h
_
5
_
1
1
__
compares to 5h
__
1
1
__
.
Question 4 Solution
Hint: Remember f is deﬁned by f
__
x
y
__
= x + 2y, so f
_
5
_
1
1
__
= f
__
5
5
__
=
5 + 2(5) = 15
What is f
_
5
_
1
1
__
? 15
Solution
Hint: Remember f is deﬁned by f
__
x
y
__
= x + 2y, so f
__
1
1
__
= 1 + 2 (1) = 3
What is f
__
1
1
__
? 3
Solution Is f
_
5
_
1
1
__
= 5f
__
1
1
__
?
(a) Yes
(b) No
Great! So f has a chance of being linear, since it is respecting scalar multiplication in this
Solution
Hint: Remember h is deﬁned by h
__
x
y
__
=
_
17
x
_
, so h
_
5
_
1
1
__
= h
__
5
5
__
=
_
17
5
_
What is h
_
5
_
1
1
__
?
Solution
24
12 Linear maps
Hint: Remember h is deﬁned by h
__
x
y
__
=
_
17
x
_
, so h
__
1
1
__
=
_
17
1
_
What is h
__
1
1
__
?
Solution Is h
_
5
_
1
1
__
= 5h
__
1
1
__
?
(a) Yes
(b) No
Great! So h is not linear: by looking at this particular example, we can see that h does
not always respect scalar multiplication. So h is not linear.
Since we know one of the two functions is linear, we can already answer the question:
The answer is f. To be thorough, lets check that f really is linear.
First we check that f really does respect scalar multiplication:
Let a ∈ R be an arbitrary scalar and
_
x
y
_
∈ R
2
be an arbitrary vector. Then
f
_
a
_
x
y
__
= f
__
ax
ay
__
= ax + 2ay
= a (x + 2y)
= af
__
x
y
__
Now we check that f really does respect vector addition:
Let
_
x1
y1
_
and
_
x2
y2
_
be arbitrary vectors in R
2
. Then
f
__
x1
y1
_
+
_
x2
y2
__
= f
__
x1 + x2
y1 + y2
__
= (x1 + x2) + 2 (y1 + y2)
= x1 + x2 + 2y1 + 2y2
= (x1 + 2y1) + (x2 + 2y2)
= f
__
x1
y1
__
+ f
__
x2
y2
__
This proves that f is linear!
(a) f : R
2
→R
1
deﬁned by f
__
x
y
__
= x + 2y
(b) h : R
2
→R
2
deﬁned by h
__
x
y
__
=
_
17
x
_
What about these two functions? Which of them is a linear map?
Solution
25
12 Linear maps
Hint: For a function to be linear, it must respect scalar addition. Let’s see how h(5+2)
compares to h(5)+h(2) and also how g
_
_
_
_
2
3
1
_
_
+
_
_
1
4
5
_
_
_
_
compares to g
_
_
_
_
2
3
1
_
_
_
_
+g
_
_
_
_
1
4
5
_
_
_
_
.
Question 5 Solution
Hint: Remember h is deﬁned by h(x) =
_
_
_
_
x
x
x
4x
_
¸
¸
_
, so h(5 + 2) = h(7) =
_
_
_
_
7
7
7
28
_
¸
¸
_
What is h(5 + 2)?
Solution
Hint: Remember h is deﬁned by h(x) =
_
_
_
_
x
x
x
4x
_
¸
¸
_
, so h(5) +h(2) =
_
_
_
_
5
5
5
20
_
¸
¸
_
+
_
_
_
_
2
2
2
8
_
¸
¸
_
=
_
_
_
_
7
7
7
28
_
¸
¸
_
What is h(5) + h(2)?
Solution Is h(5 + 2) = h(5) + h(2)?
(a) Yes
(b) No
Great! So h has a chance of being linear, since it is respecting vector addition in this
Solution
Hint: Remember g is deﬁned by g
_
_
_
_
x
y
z
_
_
_
_
=
_
x
xy
_
, so g
_
_
_
_
2
3
1
_
_
+
_
_
1
4
5
_
_
_
_
= g
_
_
_
_
3
7
6
_
_
_
_
=
_
3
3(7)
_
=
_
3
21
_
What is g
_
_
_
_
2
3
1
_
_
+
_
_
1
4
5
_
_
_
_
?
Solution
Hint: Remember g is deﬁned by g
_
_
_
_
x
y
z
_
_
_
_
=
_
x
xy
_
, so
g
_
_
_
_
2
3
1
_
_
_
_
+ g
_
_
_
_
1
4
5
_
_
_
_
=
_
2
2(3)
_
+
_
1
1(4)
_
=
_
2
6
_
+
_
1
4
_
=
_
3
10
_
26
12 Linear maps
What is g
_
_
_
_
2
3
1
_
_
_
_
+ g
_
_
_
_
1
4
5
_
_
_
_
?
Solution Is g
_
_
_
_
2
3
1
_
_
+
_
_
1
4
5
_
_
_
_
= g
_
_
_
_
2
3
1
_
_
_
_
+ g
_
_
_
_
1
4
5
_
_
_
_
(a) Yes
(b) No
Great! So g is not linear: by looking at this particular example, we can see that g
does not always respect vector addition. So g is not linear.
Since we know one of the two functions is linear, we can already answer the question:
The answer is h. To be thorough, lets check that h really is linear.
First we check that h really does respect scalar multiplication:
Let a ∈ R be an arbitrary scalar and x ∈ R be an arbitrary vector. Then
h(ax) =
_
_
_
_
ax
ax
ax
4ax
_
¸
¸
_
= a
_
_
_
_
x
x
x
4x
_
¸
¸
_
= ah(x)
Now we check that h really does respect vector addition:
Let x and y be arbitrary vectors in R
1
. Then
h(x + y) =
_
_
_
_
x + y
x + y
x + y
4(x + y)
_
¸
¸
_
=
_
_
_
_
x + y
x + y
x + y
4x + 4y
_
¸
¸
_
=
_
_
_
_
x
x
x
4x
_
¸
¸
_
+
_
_
_
_
y
y
y
4y
_
¸
¸
_
= h(x) + h(y)
This proves that h is linear!
(a) g : R
3
→ R
2
deﬁned by g
_
_
_
_
x
y
z
_
_
_
_
=
_
x
xy
_
27
12 Linear maps
(b) h : R →R
4
deﬁned by h(x) =
_
_
_
_
x
x
x
4x
_
¸
¸
_

And ﬁnally, which of the following functions are linear?
Solution
Hint: For a function to be linear, it must respect scalar multiplication. Let’s see how
A
_
2
_
2
3
__
compares to 2A
__
2
3
__
and also how G
_
_
_
_
2
_
_
_
_
1
2
3
4
_
¸
¸
_
_
_
_
_
compares to 2G
_
_
_
_
_
_
_
_
1
2
3
4
_
¸
¸
_
_
_
_
_
.
Question 6 Solution
Hint: Remember A is deﬁned by A
__
x
y
__
=
_
0
0
_
, so A
_
2
_
2
3
__
= A
__
4
6
__
=
_
0
0
_
What is A
_
2
_
2
3
__
?
Solution
Hint: Remember A is deﬁned by A
__
x
y
__
=
_
0
0
_
, so 2A
__
2
3
__
= 2
_
0
0
_
=
_
0
0
_
What is 2A
__
2
3
__
?
Solution Is A
_
2
_
2
3
__
= 2A
__
2
3
__
)?
(a) Yes
(b) No
Great! So A has a chance of being linear, since it is respecting vector addition in this
Solution
Hint: Remember G is deﬁned by G
_
_
_
_
_
_
_
_
x
y
z
t
_
¸
¸
_
_
_
_
_
=
_
_
e
x+y
x + z
sin(x + t)
_
_
, so
G
_
_
_
_
2
_
_
_
_
1
2
3
4
_
¸
¸
_
_
_
_
_
= G
_
_
_
_
_
_
_
_
2
4
6
8
_
¸
¸
_
_
_
_
_
=
_
_
e
2+4
2 + 6
sin(2 + 8)
_
_
=
_
_
e
6
8
sin(10)
_
_
28
12 Linear maps
What is G
_
_
_
_
2
_
_
_
_
1
2
3
4
_
¸
¸
_
_
_
_
_
?
Solution
Hint: Remember G is deﬁned by G
_
_
_
_
_
_
_
_
x
y
z
t
_
¸
¸
_
_
_
_
_
=
_
_
e
x+y
x + z
sin(x + t)
_
_
, so
2G
_
_
_
_
_
_
_
_
1
2
3
4
_
¸
¸
_
_
_
_
_
= 2
_
_
e
1+2
1 + 3
sin(1 + 4)
_
_
= 2
_
_
e
3
4
sin(5)
_
_
=
_
_
2e
3
8
2 sin(5)
_
_
What is 2G
_
_
_
_
_
_
_
_
1
2
3
4
_
¸
¸
_
_
_
_
_
?
Solution Is G
_
_
_
_
2
_
_
_
_
1
2
3
4
_
¸
¸
_
_
_
_
_
= 2G
_
_
_
_
_
_
_
_
1
2
3
4
_
¸
¸
_
_
_
_
_
?
(a) Yes
(b) No
Great! So G is not linear: by looking at this particular example, we can see that G
does not always respect scalar multiplication. So G is not linear.
Since we know one of the two functions is linear, we can already answer the question:
The answer is A. To be thorough, lets check that A really is linear.
First we check that A really does respect scalar multiplication:
Let c ∈ R be an arbitrary scalar and
_
x
y
_
∈ R
2
be an arbitrary vector. Then
A
_
c
_
x
y
__
= A
__
ax
ay
__
=
_
0
0
_
= a
_
0
0
_
Now we check that A really does respect vector addition:
Let
_
x1
y1
_
and
_
x2
y2
_
be arbitrary vectors in R
2
. Then
29
12 Linear maps
A
__
x1
y1
_
+
_
x2
y2
__
= A
__
x1 + x2
y1 + y2
__
=
_
0
0
_
=
_
0
0
_
+
_
0
0
_
= A
__
x1
y1
__
+ A
__
x2
y2
__
This proves that A is linear!
(a) G : R
4
→R
3
deﬁned by G
_
_
_
_
_
_
_
_
x
y
z
t
_
¸
¸
_
_
_
_
_
=
_
_
e
x+y
x + z
sin(x + t)
_
_
(b) A : R
2
→ R
2
deﬁned by A
__
x
y
__
=
_
0
0
_

Warning 7 Note that the function which sends every vector to the zero vector
is linear.
Question 8 Let L : R
3
→ R
2
be a linear function. Suppose L
_
_
_
_
1
0
0
_
_
_
_
=
_
3
4
_
,
L
_
_
_
_
0
1
0
_
_
_
_
=
_
−2
0
_
, and L
_
_
_
_
0
0
1
_
_
_
_
=
_
1
−1
_
.
Solution
Hint: The only thing we know about linear maps is that they respect scalar multi-
plication and vector addition. So we need to somehow rewrite the vector
_
_
4
−1
2
_
_
in terms
of the vectors
_
_
1
0
0
_
_
,
_
_
0
1
0
_
_
and
_
_
0
0
1
_
_
, scalar multiplication, and vector addition, to exploit
Question 9 Can you rewrite
_
_
4
−1
2
_
_
in the form a
_
_
1
0
0
_
_
+ b
_
_
0
1
0
_
_
+ c
_
_
0
0
1
_
_
?
Solution
Hint: Observe that
_
_
4
−1
2
_
_
= 4
_
_
1
0
0
_
_
+−1
_
_
0
1
0
_
_
+ 2
_
_
0
0
1
_
_
.
30
12 Linear maps
Hint: Consider the coeﬃcient on
_
_
1
0
0
_
_
.
Hint: In this case, a = 4.
Hint: Moreover, b = −1.
Hint: Finally, c = 2.
a = 4
Solution b = -1
Solution c = 2
Now using the linearity of L, we can see that
L
_
_
_
_
4
−1
2
_
_
_
_
= L
_
_
4
_
_
1
0
0
_
_
+−1
_
_
0
1
0
_
_
+ 2
_
_
0
0
1
_
_
_
_
= 4L
_
_
_
_
1
0
0
_
_
_
_
+−1L
_
_
_
_
0
1
0
_
_
_
_
+ 2L
_
_
_
_
0
0
1
_
_
_
_
Can you ﬁnish oﬀ the computation?
Hint:
L
_
_
_
_
4
−1
2
_
_
_
_
= 4L
_
_
_
_
1
0
0
_
_
_
_
+−1L
_
_
_
_
0
1
0
_
_
_
_
+ 2L
_
_
_
_
0
0
1
_
_
_
_
= 4
_
3
4
_
+−1
_
−2
0
_
+ 2
_
1
−1
_
=
_
12
16
_
+
_
2
0
_
+
_
2
−2
_
=
_
16
14
_
Let v = L
_
_
_
_
4
−1
2
_
_
_
_
. What is v?
Can you generalize this?
Solution
31
12 Linear maps
Hint: The only thing we know about linear maps is that they respect scalar multi-
plication and vector addition. So we need to somehow rewrite the vector
_
_
x
y
z
_
_
in terms
of the vectors
_
_
1
0
0
_
_
,
_
_
0
1
0
_
_
and
_
_
0
0
1
_
_
, scalar multiplication, and vector addition, to exploit
Question 10 Can you rewrite
_
_
x
y
z
_
_
in the form a
_
_
1
0
0
_
_
+ b
_
_
0
1
0
_
_
+ c
_
_
0
0
1
_
_
?
Solution
Hint:
_
_
x
y
z
_
_
= x
_
_
1
0
0
_
_
+ y
_
_
0
1
0
_
_
+ z
_
_
0
0
1
_
_
a = x
Solution b = y
Solution c = z
Hint: Now using the linearity of L, we can see that
L
_
_
_
_
x
y
z
_
_
_
_
= L
_
_
x
_
_
1
0
0
_
_
+ y
_
_
0
1
0
_
_
+ z
_
_
0
0
1
_
_
_
_
= xL
_
_
_
_
1
0
0
_
_
_
_
+ yL
_
_
_
_
0
1
0
_
_
_
_
+ zL
_
_
_
_
0
0
1
_
_
_
_
Can you ﬁnish oﬀ the computation?
Hint:
L
_
_
_
_
x
y
z
_
_
_
_
= xL
_
_
_
_
1
0
0
_
_
_
_
+ yL
_
_
_
_
0
1
0
_
_
_
_
+ zL
_
_
_
_
0
0
1
_
_
_
_
= x
_
3
4
_
+ y
_
−2
0
_
+ z
_
1
−1
_
=
_
3x
4x
_
+
_
−2y
0
_
+
_
z
−z
_
=
_
3x −2y + z
4x −z
_
Let v = L
_
_
_
_
x
y
z
_
_
_
_
? What is v?
32
12 Linear maps
As you have already discovered a linear map L : R
n
→ R
m
is fully determined
by its action on the “standard basis vectors” e
1
=
_
¸
¸
¸
¸
¸
_
1
0
0
.
.
.
0
_
¸
¸
¸
¸
¸
_
, e
2
=
_
¸
¸
¸
¸
¸
_
0
1
0
.
.
.
0
_
¸
¸
¸
¸
¸
_
, and so on, until
we reach e
n
=
_
¸
¸
¸
¸
¸
_
0
0
.
.
.
0
1
_
¸
¸
¸
¸
¸
_
.
Argue convincingly that if L : R
n
→R
m
is a linear map and you know L(e
i
) for
i = 1, 2, 3, ..., n, then you could ﬁgure out L(v) for any v ∈ R
n
. I want to determine
what L does to any vector v =
_
¸
¸
¸
¸
¸
¸
¸
¸
_
x
1
x
2
x
3
.
.
.
x
n
_
¸
¸
¸
¸
¸
¸
¸
¸
_
∈ R
n
. I can rewrite v as x
1
e
1
+x
2
e
2
+x
3
e
3
+
... +x
n
e
n
. By the linearity of L, L(v) = x
1
L( e
1
)+x
2
L( e
2
)+x
3
L( e
3
)+... +x
n
L( e
n
).
Since I already know the value of L( e
i
) for all i = 1, 2, 3, ..., n, this allows me to
compute L(v). So L is completely determined once I know what it does to each of
the standard basis vectors.
1
1
33
13 Matrices
Matrices are a way to represent linear maps.
To make writing a linear map a little less cumbersome, we will develop a com-
pact notation for linear maps using our previous observation that a linear map is
determined by its action on the standard basis vectors.
Deﬁnition 1 An mn matrix is an array of numbers which has m rows and n
columns. The numbers in a matrix are called entries.
When A is a matrix, we write A = (a
ij
), meaning that a
i,j
is the entry in the
i
th
row and j
th
column of the matrix. Note: We start counting with 1 not 0. So
the upper lefthand entry of the matrix is a
1,1
.
Question 2 The matrix A =
_
_
1 −1
2 4
3 −5
_
_
is an n m matrix.
Solution
Hint: Note that this is n m whereas the deﬁnition above used mn.
Hint: n is the number of rows, and m is the number of columns
Hint: n = 3 and m = 2
In this case, n is 3.
Solution And m is 2.
Remember, we write a
i,j
for the entry in the i
th
row and j
th
column of the
matrix.
Solution
Hint: a3,2 is the entry in the 3
rd
row and the 2
nd
column.
Hint: a3,2 = −5
Therefore a3,2 is −5.
Next, suppose the 3 4 matrix B has b
i,j
= i + j.
Solution
Hint: Question 3 Solution
Hint: b1,2 = 1 + 2 = 3
According to this rule, b1,2 is 3
So the entry in the ﬁrst row and second column of this matrix should be 3.
34
13 Matrices
Hint: B =
_
_
2 3 4 5
3 4 5 6
4 5 6 7
_
_
What is B?
Deﬁnition 4 To each linear map L : R
n
→R
m
we associate a mn matrix A
L
called the matrix of the linear map with respect to the standard coordinates. It is
deﬁned by setting a
i,j
to be the i
th
component of L(e
j
). In other words, the j
th
column of the matrix A
L
is the vector L(e
j
).
Going the other way, we likewise associate to each matrix m n matrix M a
linear map L
M
: R
n
→R
m
by requiring that L(e
j
) be the j
th
column of the matrix
M.
Question 5 The linear map L : R
2
→R
3
satisﬁes L
__
1
0
__
=
_
_
3
−5
2
_
_
and L
__
0
1
__
=
_
_
−1
1
1
_
_
. What is the matrix of L?
Solution
Hint: Remember that, by deﬁnition, the ﬁrst column of this matrix should be L
__
1
0
__
and the second column should be L
__
0
1
__
.
Hint: The matrix of L is
_
_
3 −1
−5 1
2 1
_
_
Let’s do another example.
Question 6 Suppose L is a linear map represented by the matrix A =
_
_
1 −1
2 4
3 −5
_
_
.
Solution
Hint: A should have one column for each basis vector of the domain.
Hint: A has 2 columns, so the dimension of the domain is 2.
The dimension of the domain of L is 2.
Solution
Hint: Each column of A is the image of a basis vector under the action of L
35
13 Matrices
Hint: Since the columns are of length 3, that means L is spitting out vectors of length
3.
Hint: The codomain of L is R
3
which is 3 dimensional.
The dimension of the codomain of L is 3.
Suppose v = L
__
0
1
__
. What is v?
Solution
Hint: Remember that, by deﬁnition, the i
th
column of A is L( ei).
Hint: So, by deﬁnition, L
__
0
1
__
is the second column of the matrix A.
Hint: So L
__
0
1
__
=
_
_
−1
4
−5
_
_
Suppose w = L
__
4
5
__
. What is w?
Solution
Hint: By deﬁnition of the matrix associated to a linear map, we know that L
__
1
0
__
=
_
_
1
2
3
_
_
and L
__
0
1
__
=
_
_
−1
4
−5
_
_
.
Hint: Can you rewrite
_
4
5
_
in terms of
_
1
0
_
and
_
0
1
_
so that you can use the linearity
of L to compute L
__
4
5
__
?
Hint: L
__
4
5
__
= L
_
4
_
1
0
_
+ 5
_
0
1
__
Hint:
L
__
4
5
__
= L
_
4
_
1
0
_
+ 5
_
0
1
__
= 4L
__
1
0
__
+ 5L
__
0
1
__
= 4
_
_
1
2
3
_
_
+ 5
_
_
−1
4
−5
_
_
=
_
_
4
8
12
_
_
+
_
_
−5
20
−25
_
_
=
_
_
−1
28
−13
_
_
36
13 Matrices
What is L
__
x
y
__
?
Solution
Hint: By deﬁnition of the matrix associated to a linear map, we know that L
__
1
0
__
=
_
_
1
2
3
_
_
and L
__
0
1
__
=
_
_
−1
4
−5
_
_
.
Hint: Can you rewrite
_
x
y
_
in terms of
_
1
0
_
and
_
0
1
_
so that you can use the linearity
of L to compute L
__
4
5
__
?
Hint: L
__
x
y
__
= L
_
x
_
1
0
_
+ y
_
0
1
__
Hint:
L
__
x
y
__
= L
_
x
_
1
0
_
+ y
_
0
1
__
= xL
__
1
0
__
+ yL
__
0
1
__
= x
_
_
1
2
3
_
_
+ y
_
_
−1
4
−5
_
_
=
_
_
x
2x
3x
_
_
+
_
_
−y
4y
−5y
_
_
=
_
_
x −y
2x + 4y
3x −5y
_
_
As an antidote to the abstraction, let’s take a look at a simplistic “real world”
example.
Question 7 In the local barter economy, there is an exchange where you can
• trade 1 spoon for 2 apples and 1 orange,
• trade 1 knife for 2 oranges, and
• trade 1 fork for 3 apples and 4 oranges.
Model this as a linear map from L : R
3
→ R
2
, where the coordinates on R
3
are
_
_
spoons
knives
forks
_
_
and the coordinates on R
2
are
_
apples
oranges
_
.
37
13 Matrices
Solution
Hint: Remember the matrix of a linear map is deﬁned by the fact the the kth column
of the matrix is the image of the kth standard basis vector.
Hint:
_
_
1
0
0
_
_
represents one spoon in the codomain. Its image under this linear map is
2 apples and 1 orange, which is represented by the vector
_
2
1
_
in the codomain. So the
ﬁrst column of the matrix should be
_
2
1
_
Hint: The full matrix is
_
2 0 3
1 2 4
_
What is the matrix of the linear map L?
Solution
Hint:
L
_
_
_
_
3
0
4
_
_
_
_
= L
_
_
3
_
_
1
0
0
_
_
+ 4
_
_
0
0
1
_
_
_
_
(1)
= 3L
_
_
_
_
1
0
0
_
_
_
_
+ 4L
_
_
_
_
0
0
1
_
_
_
_
(2)
= 3
_
2
1
_
+ 4
_
3
4
_
(3)
=
_
6
3
_
+
_
12
16
_
(4)
=
_
18
19
_
(5)
So you would be able to get 18 apples and 19 oranges.
Hint: Now the “5 year old” solution: If you have 3 spoons, 0 knives, and 4 forks, and
you traded them all in for fruit, how many apples would you have?
Hint: 3 spoons would get you 6 apples, and 4 forks get you 12 apples, so you would
have a total of 18 apples.
The ﬁrst (“apples”) entry of L
_
_
_
_
3
0
4
_
_
_
_
is 18.
Try to answer this question both by applying the matrix to the vector, but also as a
5 year old would solve it.
38
13 Matrices
Prove the following statement: if S : R
n
→ R
m
and T : R
n
→ R
m
are both
linear maps, then the map (S +T) : R
n
→R
m
deﬁned by (S +T)(v) = S(v) +T(v)
is also linear.
We need to check that (S + T) respects both scalar multiplication and vector
Scalar multiplication:
Choose and arbitrary scalar c ∈ R and an arbitrary vector v ∈ R
n
. Then
(S + T)(cv) = S(cv) + T(cv) by deﬁnition of (S + T)
= cS(v) + cT(v) by the linearity of S and T
= c (S(v) + T(v)) by the distributivity of scalar multiplication over addition in R
m
= c(S + T)(v) by deﬁnition of (S + T)
Vector addition: Choose two arbitrary vectors v and w in R
n
. Then
(S + T)(v + w) = S(v + w) + T(v + w) by deﬁnition of S + T
= S(v) + S( w) + T(v) + T( w) by the linearity of S and T
= S(v) + T(v) + S( w) + T( w) by the commutativity of vector addition in R
m
= (S + T)(v) + (S + T)( w) by the deﬁnition of S + T.
Prove that if T : R
n
→R
m
is a linear map and c ∈ R is a scalar, then the map
cT : R
n
→R
m
, deﬁned by
(cT)(v) = cT(v)
is also a linear map.
We need to check that cT respects both scalar multiplication and vector addi-
tion.
Scalar multiplication:
Choose and arbitrary scalar a ∈ R and an arbitrary vector v ∈ R
n
. Then
(cT)(av) = cT(av)
= acT(v)
= a(cT)(v)
Vector addition: Choose two arbitrary vectors v and w in R
n
. Then
(cT)(v + w) = cT(v + w)
= c (T(v) + T( w))
= cT(v) + cT( w)
= (cT)(v) + (cT)( w)
Observation 8 The last two exercises show that we have a nice way to both
add linear maps and multiply linear maps by scalars. So linear maps themselves
“feel” a bit like vectors. You do not have to worry about this now, but we will see
that the linear maps from R
n
→R
m
form an “abstract vector space.” Much of the
power of linear algebra is that we can apply linear algebra to spaces of linear maps!
39
14 Composition
The composition of linear maps can be computed with matrices.
Prove that if S : R
n
→R
m
is a linear map, and T : R
m
→R
k
is a linear map, then
the composite function T ◦ S : R
n
→R
k
is also linear.
We need to show that T ◦ S respects scalar multiplication and vector addition:
Scalar multiplication: For every scalar a ∈ R and every vector v ∈ R
n
, we have:
(T ◦ S)(av) = T (S(av))
= T(aS(v)) because S respects scalar multiplication
= aT(S(v)) because T respects scalar multiplication
= a(T ◦ S)(v)
Vector addition: For every two vectors v, w ∈ R
n
, we have:
(T ◦ S)(v + w) = T (S(v + w))
= T(S(v + S( w)))because S respects vector addition
= T(S(v)) + T(S( w))because T respects vector addition
= (T ◦ S)(v) + (T ◦ S)( w)
Question 1 Suppose the matrix of S is M
S
=
_
2 0 −1
−1 1 1
_
and the matrix of
T is M
T
=
_
_
−1 −1
0 2
−1 1
_
_
.
Solution
Hint: Remember that the matrix for S ◦ T will have columns given by (S ◦ T)
__
1
0
__
and (S ◦ T)
__
0
1
__
Hint: Question 2 Solution
Hint:
(S ◦ T)
__
1
0
__
= S
_
T
__
1
0
___
= S
_
_
_
_
−1
0
−1
_
_
_
_
because by deﬁnition, T
__
1
0
__
is the ﬁrst column of the matrix of T
= −1S
_
_
_
_
1
0
0
_
_
_
_
+−1S
_
_
_
_
0
0
1
_
_
_
_
by the linearity of S
= −1
_
2
−1
_
+−1
_
−1
1
_
because ???
=
_
−1
0
_
40
14 Composition
What is (S ◦ T)
__
1
0
__
?
Question 3 Solution
Hint:
(S ◦ T)
__
0
1
__
= S
_
T
__
0
1
___
= S
_
_
_
_
−1
2
1
_
_
_
_
because by deﬁnition, T
__
0
1
__
is the second column of the matrix of T
= −1S
_
_
_
_
1
0
0
_
_
_
_
+ 2S
_
_
_
_
0
1
0
_
_
_
_
+ S
_
_
_
_
0
0
1
_
_
_
_
by the linearity of S
= −1
_
2
−1
_
+ 2
_
0
1
_
+
_
−1
1
_
because ???
=
_
−3
4
_
What is (S ◦ T)
__
0
1
__
?
Hint: The matrix of (S ◦ T) is
_
−1 −3
0 4
_
What is the matrix of S ◦ T?
Solution
Hint: Remember that the matrix for T ◦S will have columns given by (T ◦S)
_
_
_
_
1
0
0
_
_
_
_
,
(T ◦ S)
_
_
_
_
0
1
0
_
_
_
_
and (T ◦ S)
_
_
_
_
0
0
1
_
_
_
_
Hint: Question 4 Solution
Hint:
(T ◦ S)
_
_
_
_
1
0
0
_
_
_
_
= T
_
_
S
_
_
_
_
1
0
0
_
_
_
_
_
_
= T
__
2
−1
__
because by deﬁnition, S
_
_
_
_
1
0
0
_
_
_
_
is the ﬁrst column of the matrix of S
= 2T
__
1
0
__
+−1T
__
0
1
__
by the linearity of T
= 2
_
_
−1
0
−1
_
_
+−1
_
_
−1
2
1
_
_
because ???
=
_
_
−1
−2
−3
_
_
41
14 Composition
What is (T ◦ S)
_
_
_
_
1
0
0
_
_
_
_
?
Question 5 Solution
Hint:
(T ◦ S)
_
_
_
_
0
1
0
_
_
_
_
= T
_
_
S
_
_
_
_
0
1
0
_
_
_
_
_
_
= T
__
0
1
__
because by deﬁnition, S
_
_
_
_
1
0
0
_
_
_
_
is the ﬁrst column of the matrix of S
=
_
_
−1
2
1
_
_
we got lucky: by deﬁnition T
__
0
1
__
is the second column of the matrix of T
What is (T ◦ S)
_
_
_
_
0
1
0
_
_
_
_
?
Question 6 Solution
Hint:
(T ◦ S)
_
_
_
_
0
0
1
_
_
_
_
= T
_
_
S
_
_
_
_
0
0
1
_
_
_
_
_
_
= T
__
−1
1
__
because by deﬁnition, S
_
_
_
_
0
0
1
_
_
_
_
is the third column of the matrix of S
= −1T
__
1
0
__
+ T
__
0
1
__
by the linearity of T
= −1
_
_
−1
0
−1
_
_
+
_
_
−1
2
1
_
_
because ???
=
_
_
0
2
2
_
_
What is (T ◦ S)
_
_
_
_
0
0
1
_
_
_
_
?
Hint: The matrix of (T ◦ S) is
_
_
−1 −1 0
−2 2 2
−3 1 2
_
_
What is the matrix of T ◦ S?
42
14 Composition
Deﬁnition 7 If M is a mn matrix and N is a k m matrix, then the product
NM of the matrices is deﬁned as the matrix of the composition of the linear maps
deﬁned by M and N.
In other words, NM is the matrix of L
N
◦ L
M
.
Warning 8 You may have seen another deﬁnition for matrix multiplication in
the past. That deﬁnition could be seen as a shortcut for how to compute the
product, but it is usually presented devoid of mathematical meaning.
Hopefully our deﬁnition seems properly motivated: matrix multiplication is just
what you do to compose linear maps. We suggest working out the problems here
using our deﬁnition: you will develop your own eﬃcient shortcuts in time.
You have already multiplied two matrices, even though you didn’t know it,
above. Take some time now to get a whole lot of practice. You do not need us
to prompt you: invent your own matrices and try to multiply them, on paper.
What condition is needed on the rows and columns of the two matrices for matrix
multiplication to even make sense? You can check your work using a computer
algebra system, like SAGE
1
or you can use a free web hosted app like Reshih
2
. Use
our deﬁnition, and think through it each time. Try to get faster and more eﬃcient.
Eventually you should be able to do this quite rapidly.
Question 9 Suppose B =
_
1 2
3 4
_
. Find a 22 matrix A so that AB ,= BA. Play
around! Can you ﬁnd more than one?
Solution
Hint: There is no systematic way to answer this question: you just have to play
around, and see what you discover!
Hint: Question 10 Solution
Hint:
_
1 2
3 4
_ _
1 0
0 0
_
=
_
1 0
3 0
_
What is
_
1 2
3 4
_ _
1 0
0 0
_
?
Question 11 Solution
Hint:
_
1 0
0 0
_ _
1 2
3 4
_
=
_
1 2
0 0
_
What is
_
1 0
0 0
_ _
1 2
3 4
_
?
A matrix that doesn’t commute with B is
1
http://www.sagemath.org/
2
http://matrix.reshish.com/
43
14 Composition
Question 12 Solution
Hint: Try some simple matrices. Maybe limit yourself to 2 2 matrices?
Hint: One simple linear map which would work is L
__
x
y
__
=
_
y
0
_
. Applying this
twice to any vector would give you the zero vector. This linear map is great for cooking
up counterexamples to all sorts of naive things you might think about matrices! See this
3
(you will understand more and more of these terms as the course
progresses).
Question 13 Hint: The matrix of L is
_
0 1
0 0
_
What is the matrix of the example linear map L?
Find A ,= 0 with AA = 0. (Note: such a matrix is called “nilpotent”)
Question 14 If A =
_
2 8
3 12
_
, ﬁnd v ,= 0 with Av =

0.
Solution
Hint: Let v =
_
x
y
_
, and solve a system of equations
Hint:
A(v) =

0
_
2 8
3 12
_ _
x
y
_
=
_
0
0
_
_
2x + 8y
3x + 12y
_
=
_
0
0
_
Hint: Both of these conditions (2x + 8y = 0 and 3x + 12y = 0) are saying the same
thing: x = −4y.
Hint: So
_
−4
1
_
works, for example.
Question 15 If A =
_
1 3
2 4
_
, ﬁnd v with Av =
_
0
8
_
.
Solution
3
http://mathoverflow.net/questions/16829/what-are-your-favorite-instructional-counterexamples/
16841#16841
44
14 Composition
Hint: Let v =
_
x
y
_
and solve a system of equations.
Hint:
Av =
_
0
8
_
_
1 3
2 4
_ _
x
y
_
=
_
0
8
_
_
x + 3y
2x + 4y
_
=
_
0
8
_
Hint:
_
x + 3y = 0
2x + 4y = 8
_
x + 3y = 0
x + 2y = 4
_
x + 3y = 0
y = −4
_
x = 12
y = −4
In the last two exercises, you found that solving matrix equations is equivalent
to solving systems of linear equations.
Question 16 Rewrite
_
4x + 7y + z = 3
−x + 8y −z = 2
as A
_
_
x
y
z
_
_
=
_
3
2
_
.
Solution
Hint: A =
_
4 7 1
−1 8 −1
_
45
15 Python
Build up some linear algebra in python.
Exercise 1 We will store a vector as a list. So the vector
_
_
1
2
3
_
_
will be stored as
[1,2,3]. Let’s try to write some Python code for working with lists as if they were
vectors.
Solution
Hint: This was discussed on http://stackoverflow.com/questions/14050824/add-sum-of-values-of-two-lists-into-new-listStackOverﬂow.
Write a “vector add” function. Your function may assume that the two vectors have
the same number of entries.
Python
1 # write a function vector_sum(v,w) which takes two vectors v and w,
2 # and returns the sum v + w.
3 #
4 # For example, vector_sum([1,2], [4,1]) equals [5,3]
5 #
6
7 def vector_sum(v,w):
9 return # the sum v+w
10
11 def validator():
12 # It would be better to try more cases
13 if vector_sum([-5,23],[10,2])[0] != 5:
14 return False
15 if vector_sum([1,5,6],[2,3,6])[1] != 8:
16 return False
17 return True
18
Solution
Hint: Try a Python “list comprehension”
Hint: For example, return [alpha * x for x in v]
Next, write a scalar multiplication function.
Python
1 # write a function scale_vector(alpha, v) which takes a number alpha and a vector v
2 # and returns alpha * v
3 #
4 # For example, scale_vector(5,[1,2,3]) equals [5,10,15]
5
6 def scale_vector(alpha, v):
8 return # the scaled vector alpha * v
46
15 Python
9
10 def validator():
11 # It would be better to try more cases
12 if scale_vector(-3,[2,3,10])[1] != -9:
13 return False
14 if scale_vector(10,[4,3,2,1])[2] != 20:
15 return False
16 return True
17
Let’s write a dot product function.
Solution
Python
1 # Write a function dot_product(v,w) which takes two vectors v and w,
2 # and returns the dot product of v and w.
3 #
4 # For example, dot_product([1,2],[0,3]) is 6.
5
6 def dot_product(v,w):
8 return # the dot product "v dot w"
9
10 def validator():
11 if dot_product([1,2],[-3,5]) != 7:
12 return False
13 if dot_product([0,4,2],[2,3,-7]) != -2:
14 return False
15 return True
And we will store a matrix as a list of lists. For example the list [[1,3,5],[2,4,6]]
will represent the matrix
_
1 3 5
2 4 6
_
.
Note that there are two diﬀerent conventions that we could have chosen: the in-
nermost lists could be the rows, or the columns. There are good reasons to have
chosen the opposite convention: after all, when thinking of a matrix as a linear
map, we should be paying attention to the columns, since the ith column tells us
what the corresponding linear map does when applied to e
i
.
Nevertheless, the innermost lists are rows in our chosen representation.
This way, to talk about the entry m
ij
choice, the m
ij
entry would have been accessed by writing j and i in the other order.
This is also the same convention used by the computer algebra system, Sage.
Exercise 2 Write a “matrix multiplication” function.
Solution
47
15 Python
Python
1 # write a function multiply(A,B) which takes two matrices A and B stored in the above format,
2 # and returns the matrix of their product
3
4 def multiply(A,B):
6 return # the product AB
7
8 def validator():
9 # It would be better to try more cases
10 a = [[-2, 0], [-2, -3], [-1, 3]]
11 b = [[-3, 2, -1, -2], [3, 2, 1, 3]]
12 result = multiply(a,b)
13 if (len(result) != 3):
14 return False
15 if (len(result[0]) != 4):
16 return False
17 if (result[2][1] != 4):
18 return False
19 return True
Fantastic!
Next, let’s think more about how matrices and linear maps are related.
Solution
Hint:
Warning 3 This is a function whose output is a function.
Hint: Try using lambda.
Write a function matrix_to_function which takes a matrix ML representing the linear
map L, and returns a Python function. The returned Python function should take a vector
v and send it to L(v).
Python
1 # For example, if M = [[1,2],[3,4]], then matrix_to_function(M)([0,1]) should be [2,4]
2
3 def matrix_to_function(M):
5 return # the function which sends v to M(v)
6
7 def validator():
8 if matrix_to_function([[-3,2,4],[5,-7,2]])([5,3,2])[0] != -1:
9 return False
10 if matrix_to_function([[4,3],[2,-1],[-5,3]])([2,-4])[2] != -22:
11 return False
12 return True
Now you can go back and check—for some examples of A, B, and v—that the
following is true: matrix_to_function(A)(matrix_to_function(B)(v)) is the
same as matrix_to_function(multiply(A,B))(v).
48
15 Python
Solution Now let’s go the other way. Write a function function_to_matrix which
takes a Python function f—assumed to be a linear map from R
2
to R
2
—and returns the
2 2 matrix representing that linear map.
Python
1 # For example if you had defined
2 #
3 # def L(v):
4 # return [2*v[0]+3*v[1], -4*v[0]]
5 #
6 # Then function_to_matrix(L) is
7
8 # You may assume that L takes [x,y] to another list with two entries
9 # and you may assume that L is linear
10
11 def function_to_matrix(L):
13 return # the matrix
14
15 def validator():
16 M = function_to_matrix( lambda v: [3*v[0]+5*v[1], -2*v[0] + 4*v[1]] )
17 if (M[0][0] != 3):
18 return False
19 M = function_to_matrix( lambda v: [2*v[0]-3*v[1], -7*v[0] - 5*v[1]] )
20 if (M[1][0] != -7):
21 return False
22 M = function_to_matrix( lambda v: [v[0]+7*v[1], 3*v[0] - 2*v[1]] )
23 if (M[1][1] != -2):
24 return False
25 return True
Great work! If you like, you can try to compute function_to_matrix(matrix_to_function(M)).
You should get back M.
49
16 An inner product space
The dot product provides a way to compute lengths and angles.
In order to do geometry in R
n
, we will want to be able to compute the length of
a vector, and the angle between two vectors. Miraculously, a single operation will
allow us to compute both quantities.
50
17 Covectors
A covector eats vectors and provides numbers.
Deﬁnition 1 A covector on R
n
is a linear map from R
n
→R.
As a matrix, it is a single row of length n.
Example 2
_
2 −1 3
¸
is the matrix of a covector on R
3
.
Question 3 Solution
Hint:
_
2 −1 3
_
_
_
3
5
7
_
_
= 2(3) +−1(5) + 3(7) = 22
_
2 −1 3
_
_
_
3
5
7
_
_
=22
Now we can do this a bit more abstractly.
Hint:
_
x y z
_
_
_
a
b
c
_
_
= ax + by + cz
_
x y z
¸
_
_
a
b
c
_
_
= ax + by + cz
There is a natural way to turn a vector into a covector, or a covector into a
vector: just turn the matrix 90

one direction or the other!
Deﬁnition 4 We deﬁne the transpose of a vector v =
_
¸
¸
¸
_
x
1
x
2
.
.
.
x
n
_
¸
¸
¸
_
to be the covector
v

with matrix
_
x
1
x
2
x
n
¸
.
Similarly we deﬁne the transpose of a covector ω :
_
x
1
x
2
x
n
¸
to be
the vector ω

with matrix
_
¸
¸
¸
_
x
1
x
2
.
.
.
x
n
_
¸
¸
¸
_
.
Question 5 Suppose v =
_
_
1
4
3
_
_
. What is (v

)

?
Solution
(a) (v

)

=
_
_
1
4
3
_
_

51
17 Covectors
(b) (v

)

=
_
1 4 3
_
Indeed, (v

)

= v and (ω

)

= ω for any vector v and covector ω.
Let v =
_
_
5
3
1
_
_
and w =
_
_
2
−2
7
_
_
Solution
Hint: v

(w) =
_
5 3 1
_
_
_
2
−2
7
_
_
= 5(2) + 3(−2) + 1(7) = 11
v

(w) = 11?
Solution
Hint:
w(v

) =
_
_
2
−2
7
_
_
_
5 3 1
_
=
_
_
10 6 2
−10 −6 −2
35 21 7
_
_
What is wv

?
52
18 Dot product
The standard inner product is the dot product.
Deﬁnition 1 Given two vectors v, w ∈ R
n
, we deﬁne their standard inner product
¸v, w¸ by ¸v, w¸ = v

( w) ∈ R. We sometimes use the notation v w for ¸v, w¸, and
call the operation the dot product.
Warning 2 Note that v

( w) ,= w(v

): one is a number, while the other is an
n n matrix.
Question 3 Make sure for yourself, by using the deﬁnition, that
_
¸
¸
¸
_
x
1
x
2
.
.
.
x
n
_
¸
¸
¸
_

_
¸
¸
¸
_
y
1
y
2
.
.
.
y
n
_
¸
¸
¸
_
= x
1
y
1
+ x
2
y
2
+ x
3
y
3
+ + x
n
y
n
.
Prove the following facts about the dot product. u, v, w ∈ R
n
and a ∈ R
(a) v w = w v (The dot product is commutative)
(b) (u +v) w = u w +v w and (av) w = a(v w) (The dot product is linear
in the ﬁrst argument)
(c) u (v + w) = u v +u w and v (a w) = a(v w) (The dot product is linear in
the second argument)
(d) v v ≥ 0 (We say that the dot product is “positive deﬁnite”)
(e) if v z = 0 for all z ∈ R
n
, then v =

0 (The dot product is nondegenerate)
1. v w = v
1
w
1
+ v
2
w
2
+ ... + v
n
w
n
= w
1
v
1
+ w
2
v
2
+ ... + w
n
v
n
= w v, so the
dot product is commutative.
(skipping item 2 for now)
3.
u (v + w) = u

(v + w) by deﬁnition
= u

(v) +u

( w) since u

: R
n
→R is linear
= u v +u w by deﬁnition
and
u (a w) = u

(a w) by deﬁnition
= au

( w) since u

: R
n
→R is linear
= au w by deﬁnition
53
18 Dot product
2. follows from 3 and 1
4. v v = v
2
1
+ v
2
2
+ v
2
3
+ ... + v
2
n
, and the square of a real number is nonnega-
tive, so the sum of these squares is also nonnegative.
5. is perhaps the trickiest fact to prove. Observe that if v z = 0 for every z ∈ R
n
,
then this formula is true in particular for z = e
j
. But v e
j
= v
j
. Thus, by dotting
with all of the standard basis vectors, we see that every coordinate of v must be 0.
Thus v is the zero vector
The fact that the dot product is linear in two separate vector variables means
that it is an example of a “bilinear form”. We will make a careful study of bi-
linear forms later in this course: it will turn out that the second derivative of a
multivariable function gives a bilinear form at each point.
So far, the inner product feels like it belongs to the realm of pure algebra. In
the next few exercises, we will start to see some hints of its geometric meaning.
Question 4 Let v =
_
5
1
_
.
Solution
Hint: ¸v, v) = 5
2
+ 1
2
= 26
¸v, v) = 26
_
x
y
_
.
Solution
Hint: ¸v, v) = x
2
+ y
2
¸v, v) = x
2
+ y
2
Notice that the length of the line segment from (0, 0) to (x, y) is
_
x
2
+ y
2
by
the Pythagorean theorem.
54
19 Length
The inner product provides a way to measure the length of a vector.
You should have discovered that v v is the square of the length of the vector v
when viewed as an arrow based at the origin. So far, you have only shown this in
the 2-dimensional case. See if you can do it in three dimensions.
Show that the length of the line segment from (0, 0, 0) to (x, y, z) is

v v, where
v =
_
_
x
y
z
_
_
.
Until now, you may not have seen a treatment of length in higher dimensions.
Generalizing the results above, we deﬁne:
Deﬁnition 1 The length of a vector v ∈ R
n
is deﬁned by [v[ =

v v.
Question 2 Solution The length of the vector
_
_
_
_
6
2
3
1
_
¸
¸
_
= sqrt(6
2
+ 2
2
+ 3
2
+ 1
2
)
Question 3 Solution
Hint: By the Pythagorean theorem, we can see that the distance is
_
(5 −2)
2
+ (9 −3)
2
Hint: We could also view this as the length of the vector
_
3
6
_
which “points” from
(2, 3) to (5, 9).
The distance between the points (2, 3) and (5, 9) is sqrt(3
2
+ 6
2
)
Deﬁnition 4 The distance between two points p and q in R
n
is deﬁned to be
the length of the “displacement” vector p −q.
Question 5 Solution
Hint: The displacement vector between these points is
_
_
_
_
5 −2
6 −7
9 −3
8 −1
_
¸
¸
_
=
_
_
_
_
3
1
6
7
_
¸
¸
_
Hint: The length of the displacement vector is
_
3
2
+ 1
2
+ 6
2
+ 7
2
The distance between the points (2, 7, 3, 1) and (5, 6, 9, 8) is sqrt(3
2
+ 1 + 6
2
+ 7
2
)
Question 6 Write an equation for the sphere centered at (0, 0, 0, 0) in R
4
r using the coordinates x, y, z, w on R
4
.
Solution
55
19 Length
Hint: For a point p = (x, y, z, w) to be on the sphere of radius r centered at (0, 0, 0, 0),
the distance from p to the origin must be r
Hint: r =
_
x
2
+ y
2
+ z
2
+ w
2
Hint: x
2
+ y
2
+ z
2
+ w
2
= r
2
x
2
+ y
2
+ z
2
+ w
2
= r
2
Question 7 Write an inequality stating that the point (x, y, z, w) is more than 4
units away from the point (2, 3, 1, 9)
Solution
Hint: The distance between the point (x, y, z, w) and (2, 3, 1, 9) is
_
(x −2)
2
+ (y −3)
2
+ (z −1)
2
+ (w −9)
2
.
Hint: So we need
_
(x −2)
2
+ (y −3)
2
+ (z −1)
2
+ (w −9)
2
> 4
sqrt((x −2)
2
+ (y −3)
2
+ (z −1)
2
+ (w −9)
2
) > 4
Prove that [av[ = [a[[v[ for every a ∈ R.
Warning 8 These two uses of [ [ are distinct: [a[ means the absolute value of
a, and [v[ is the length of v.
[av[ =
_
¸av, av¸ by deﬁnition
=
_
a
2
¸v, v¸ by the linearity of the inner product in each slot
=

a
2
_
¸v, v¸
= [a[[v[
56
20 Angles
Dot products can be used to compute angles.
Question 1 Give a vector of length 1 which points in the same direction as v =
_
1
2
_
(i.e. is a positive multiple of v).
Solution
Hint: Remember that you just argued that [av[ = [a[v for any a ∈ R. What positive
a could you choose to make [a[[v[ = 1?
Hint: We need to take a =
1
[v[
Hint: The length of v is
_
1
2
+ 2
2
=

5
Hint: The vector
_
_
_
1

5
2

5
_
¸
_ points in the same direction as v, but has length 1.
Now that we understand the relationship between the inner product and length
of vectors, we will attempt to establish a connection between the inner product and
the angle between two vectors.
Do you remember the law of cosines? It states the following:
Theorem 2 If a triangle has side lengths a, b, and c, then c
2
= a
2
+b
2
−2ab cos(θ),
where θ is the angle opposite the side with length c.
Prove the law of cosines. You may want to read the lovely proof at mathproofs
1
.
You can ﬁnd a beautiful proof here
2
.
We can rephrase this in terms of vectors, since geometrically if v and w are
vectors, the third side of the triangle is the vector w −v.
Theorem 3 For any two vectors v, w ∈ R
n
, [w−v[
2
= [w[
2
+[v[
2
−2[v[[w[ cos(θ),
where θ is the angle between v and w.
(For you sticklers, this is really being taken as the deﬁnition of the angle between
two vectors in arbitrary dimension.)
Rewrite the theorem above by using our deﬁnition of length in terms of the dot
product. Performing some algebra you should obtain a nice expression for v w in
terms of [v[, [w[, and cos(θ).
1
http://mathproofs.blogspot.com/2006/06/law-of-cosines.html
2
http://mathproofs.blogspot.com/2006/06/law-of-cosines.html
57
20 Angles
[w −v[
2
= [v[
2
+[w[
2
−2[v[[w[ cos(θ)
¸w −v, w −v¸ = [v[
2
+[w[
2
−2[v[[w[ cos(θ)
¸w, w −v¸ −¸v, w −v¸ = [v[
2
+[w[
2
−2[v[[w[ cos(θ) by the linearity of the inner product in the ﬁrst slot
¸w, w¸ −¸w, v −¸v, w¸ +¸v, v¸ = [v[
2
+[w[
2
−2[v[[w[ cos(θ) by the linearity of the inner product in the second slot
[w[
2
−2¸v, w¸ +[v[
2
= [v[
2
+[w[
2
−2[v[[w[ cos(θ)
¸v, w¸ = [v[[w[ cos(θ)
You should have discovered the following theorem:
Theorem 4 For any two vectors v, w ∈ R
n
, v w = [v[[w[ cos(θ). In words, the
dot product of two vectors is the product of the lengths of the two vectors, times
the cosine of the angle between them.
This gives an almost totally geometric picture of the dot product: Given two
vectors v and w, [v cos(θ)[ can be viewed as the length of the projection of v onto
the line containing w. So [v[[ w[ cos(θ) is the “length of the projection of v in the
direction of w times the length of w”.
As mentioned above, this theorem is really being used to deﬁne the angle be-
tween two vectors. This is not quite rigorous: how do we even know that
v w
[v[[w[
is
even between −1 and 1, so that it could be the cosine of an angle? This is clear
from the “Euclidean Geometry” perspective, but not as clear from the “Carte-
sian Geometry” perspective. To make sure that everything is okay, we prove the
“Cauchy-Schwarz” theorem which reconciles these two worlds.
58
21 Cauchy-Schwarz
The Cauchy-Schwarz inequality relates the inner product and the norm of the two
vectors.
This is the Cauchy-Schwarz inequality.
Theorem 1 [v w[ ≤ [v[[w[ for any two vectors v, w ∈ R
n
Proof If v or w is the zero vector, the result is trivial. So assume v ,=

0 and
w ,=

0 Start by noting that ¸v −w, v −w¸ ≥ 0. Expanding this out, we have:
¸v, v¸ −2¸v, w¸ +¸w, w¸ ≥ 0
2¸v, w¸ ≤ ¸v, v¸ +¸w, w¸
Now, if v and w are unit vectors, this says that
2¸v, w¸ ≤ 2
¸v, w¸ ≤ 1
Now to prove the result for any pair of nonzero vectors, simply scale them to
make them unit vectors:
¸
1
[v[
v,
1
[ w[
w¸ ≤ 1
¸v, w¸ ≤ [v[[w[

We are not quite done with the proof, because we have not proven that v
w ≥ −[v[[w[. Following the same basic outline, try to prove the other half of this
inequality below. Start by noting that ¸v + w, v + w¸ ≥ 0. Expanding this out, we
have:
¸v, v¸ + 2¸v, w¸ +¸w, w¸ ≥ 0
2¸v, w¸ ≥ −¸v, v¸ +−¸w, w¸
Now, if v and w are unit vectors, this says that
2¸v, w¸ ≥ −2
¸v, w¸ ≥ −1
Now to prove the result for any pair of nonzero vectors, simply scale them to
make them unit vectors:
¸
1
[v[
v,
1
[ w[
w¸ ≥ −1
¸v, w¸ ≤ −[v[[w[
In the next question, we ask you to ﬁll in the details of an alternative proof
which, while a little harder than the one above, is at least as beautiful.
59
21 Cauchy-Schwarz
Question 2 Start by noting that ¸v−w, v−w¸ ≥ 0. Expanding this out, we have:
¸v, v¸ −2¸v, w¸ +¸w, w¸ ≥ 0
2¸v, w¸ ≤ ¸v, v¸ +¸w, w¸
Now notice that the left hand side is unaﬀected by scaling v by a scalar λ and
w by
1
λ
, but the right hand side is! This allows us to breathe new life into the
inequality: we know that for every scalar λ ∈ (0, ∞)
¸v, w¸ ≤ λ
2
[v[
2
+
1
λ
2
[w[
2
This is somewhat miraculous: we have a stronger inequality than the one we
This new inequality is strongest when the right hand side (RHS) is minimized.
As it stands the RHS is just a function of one real variable λ.
Solution
Hint: We can minimize the right hand side using single variable calculus.
Hint: Let f(λ) = λ
2
[v[
2
+
1
λ
2
[w[
2
.
Then f

(λ) = 2λ[v[
2
−2
[w[
2
λ
3
The minimum must occur where f

vanishes
Hint:
f

(λ) = 0
2λ[v[
2
−2
[w[
2
λ
3
= 0
λ
4
[v[
2
= [w[
2
λ =
_
[w[
[v[
Hint: You can type [w[ by writing abs(w).
The value of λ which minimizes the left hand side is sqrt(abs(w)/abs(v))
Conclude that the Cauchy-Schwarz theorem is true!
Credit for this beautiful line of reasoning goes to Terry Tao at this blog post
1
.
Question 3 Solution
Hint: We know that v w = [v[[ w[ cos θ
1
https://terrytao.wordpress.com/2007/09/05/amplification-arbitrage-and-the-tensor-power-trick/
60
21 Cauchy-Schwarz
Hint:
_
_
2
3
1
_
_

_
_
1
1
1
_
_
= 2(1) + 3(1) + 1(1) = 6
Hint: [v[ =

v v =

14
Hint: [ w[ =

w w =

3
Hint: Thus, 6 =

14

3 cos(θ)
Hint: Therefore, θ = arccos(
6

42
)
The angle between the vectors v =
_
_
2
3
1
_
_
and w =
_
_
1
1
1
_
_
is arccos(6/(sqrt(14)*sqrt(3)))
This problem probably would have stumped you before you started this activity!
Question 4 Find a vector which is perpendicular to w =
_
_
2
3
1
_
_
.
Solution
Hint: For v to be perpendicular to
_
(
_
2, 3, 1), we would need that the angle between
v and w is
π
2
(or
−π
2
). In either case v w = [v[[ w[ cos(
±π
2
) = 0 So we need to ﬁnd a
vector for which v w = 0
Hint: Let v =
_
_
x
y
z
_
_
. Then
v w = 0
_
_
x
y
z
_
_

_
_
2
3
1
_
_
= 0
2x + 3y + z = 0
Hint: There are a whole lot of choices for x, y, and z that ﬁt these criteria (In fact
there is an entire plane of vectors perpendicular to w)
Hint:
_
_
0
1
−3
_
_
works for instance.
61
21 Cauchy-Schwarz
Question 5 Find a vector u which is perpendicular to both v =
_
_
2
3
1
_
_
and w =
_
_
5
9
2
_
_
Solution
Hint: We need both u v = 0 and u w = 0
Hint: Letting u =
_
_
x
y
z
_
_
, we have the conditions
_
2x + 3y + z = 0
5x + 9y + 2z = 0
Hint:
_
4x + 6y + 2z = 0
5x + 9y + 2z = 0
_
x + 3y = 0
5x + 9y + 2z = 0
Hint: Picking whatever you like for x, you should be able to ﬁnd the other values
now. Try x = 3.
Hint:
_
_
3
−1
3
_
_
works.
Prove the “Triangle inequality”: For any two vectors v, w ∈ R
n
, [v + w[ ≤
[v[ +[ w[. Draw a picture. Why is this called the triangle inequality?
The inequality is equivalent to [v + w[
2
≤ [[v[ +[ w[[
2
, which is easier to handle
because it does not involve square roots.
[v + w[
2
= ¸v + w, v + w¸
= [v[
2
+ 2¸v, w¸ +[w[
2
≤ [v[
2
+ 2[v[[w[ +[w[
2
by the Cauchy-Schwarz inequality
= ([v[ +[w[)
2
62
22 Multiplying matrices using dot
products
There is a quick way to multiply matrices using dot products
Question 1 Let M =
_
_
2 3
4 5
1 2
_
_
, and e
2
=
_
_
0
1
0
_
_
.
Solution
Hint:
e

2
M =
_
_
0
1
0
_
_
_
_
2 3
4 5
1 2
_
_
=
_
4 5
_
e

2
M=
Did you notice how multiplying by e

2
on the right selected the 2
nd
row of M?
Prove that if M is an mn matrix and e
j
∈ R
m
is the j
th
standard basis vector
of R
m
, then e
j

M is the j
th
row of M. We know that w = e
j

M is a covector
(row) just by looking at dimensions. What is the i
th
entry of this row? Well, we
can only ﬁgure that out by applying the map to the basis vectors. e
j

M e
i
is the
dot product of e
j
with the i
th
column of M. But that just selects the j
th
element
of that column. So the i
th
element of w is the j
t
h element of the i
th
column of M.
This just says that w is the j
t
h column of M. (Whew.)
Now we can use this observation to great eﬀect. If M is an m n matrix, e
j
is the standard basis of R
m
and

b
k
is the standard basis of R
n
, then we can select
M
j,k
by performing the operation e

j
M

b
k
. This is so important we will label it as
a theorem:
Theorem 2 If M is an m n matrix, e
j
is the standard basis of R
m
and

b
k
is
the standard basis of R
n
, then M
j,k
= e

j
M

b
k
.
Proof The proof is simply that M

b
k
is by deﬁnition the k
th
column of the
matrix, and by our observation above e

j
M

b
k
must be the j
th
row of that column
vector, which consists of the single number M
i,j

Question 3 Let M =
_
4 1 −2
3 1 0
_
.
Solution
Hint: By the above theorem, it will be the entry in the 2
nd
row and the 1
st
column
of M
63
22 Multiplying matrices using dot products
Hint:
_
0 1
_
M
_
_
1
0
0
_
_
= 3
_
0 1
_
M
_
_
1
0
0
_
_
=3
The philosophical import of this theorem is that we can probe the inner structure
of any matrix with simple row and column vectors to ﬁnd out every component of
the matrix. What happens when we apply this insight to a product of matrices?
Question 4 Let A =
_
_
−1 1
2 2
3 0
_
_
and B = [ ]. Let C = AB.
Solution
Hint: By the theorem above, C2,3 =
_
0 1 0
_
C
_
_
_
_
0
0
1
0
_
¸
¸
_
Hint: So C2,3 =
_
0 1 0
_
AB
_
_
_
_
0
0
1
0
_
¸
¸
_
Hint: But
_
0 1 0
_
A is the 2
nd
row of A, and B
_
_
_
_
0
0
1
0
_
¸
¸
_
is the 3
rd
column of B
Hint: So
_
0 1 0
_
A =
_
2 2
_
and B
_
_
_
_
0
0
1
0
_
¸
¸
_
=
_
1
9
_
Hint: Thus C2,3 =
_
2 2
_
_
1
9
_
= 2(1) + 2(9) = 20
Without computing the whole matrix C, can you ﬁnd
C2,3 = 20
Wow! So it looks like we can ﬁnd the entries of a product of two matrices just
by looking at the dot product of rows of the ﬁrst matrix with columns of the second
matrix!
64
22 Multiplying matrices using dot products
Theorem 5 Let A and B be composable matrices. Let C = AB. Then C
i,j
is
the product of the i
th
row of A with the j
th
column of B
Prove this theorem We can prove this by combining the other two theorems
in this section. C
i,j
= e
i

Ce
j
by the second theorem. But C = AB, so we have
C
i,j
= e
i

ABe
j
. By the ﬁrst theorem e
i

A is the i
th
row of A, and by our deﬁnition
of matrix multiplication, Be
j
is the j
th
column of B. So C
i,j
is the product of the
i
th
row of A with the k
th
column of B.
Now try multiplying some matrices of your choosing using this method. This
is likely the deﬁnition of matrix multiplication you learned in high school (or the
same thing deﬁned by some messy formula with a

). Do you prefer this method?
Or do you prefer whatever method you came up with on your own earlier? Maybe
they are the same!
Another note: it is interesting that we are feeding two vectors e
i
and e
j
into
the matrix and getting out a number somehow. In week 4 we will learn that we are
treading in deep water here: this is the very tip of the iceberg of bilinear forms,
which are a kind of 2-tensor.
65
23 Limits
Limits are the diﬀerence between analysis and algebra
Limits are the backbone of calculus. Multivariable calculus is no diﬀerent. In this
section we will deal with limits on an intuitive level.
We will postpone the rigorous -δ analysis to the next section.
Deﬁnition 1 Let f : R
n
→R
m
and let p ∈ R
n
. We say that
lim
x→p
f(x) = L
for some L ∈ R
m
if as x “gets arbitrarily close to ” p, the points f(x) “get arbitrarily
close to L”.
Deﬁnition 2 A function f : R
n
→R
m
is said to be continuous at a point p ∈ R
n
if lim
x→p
f(x) = f(p)
Most functions deﬁned by formulas are continuous where they are deﬁned. For
example, the function f(x, y) = (cos(xy + y
2
), e
sin(x)+y
+ y
2
) is continuous be-
cause each component function is a string of composites of continuous functions.
f(x, y) = (xy, cos(x)/(x+y)) is continuous everywhere it is deﬁned (it is not deﬁned
on the line y = −x, because the denominator of the second component function
vanishes there). This is basically because all of the functions we have names for like
cos(x), sin(x), e
x
, polynomials, rational functions, are all continuous, so if you can
write down a function as a “single formula” it is probably continuous. The prob-
lematic points are basically just zeros of denominators, like our example above.
Piecewise deﬁned functions can also be problematic:
Argue intuitively that the function f : R
2
→R deﬁned by f(x, y) =
_
0 if x < y
1 if x ≥ y
is continuous at every point oﬀ the line y = x, and is discontinuous at every
point on the line y = x For any point p which is not on the line y = x, there is a
little neighborhood of p where f is the constant function 0, which is known to be
continuous. So f is continuous at p. For any point p on the line y = x, we get a
diﬀerent limit if we approach p along the line y = x (we get 1), versus approaching
through points not on the line y = x (we get 0).
Question 3 Solution
Hint: Since xcos(π(x+y)) +sin(
πy
4
) is continuous, we can just evaluate the function
at (1, 2).
Hint: So lim
(x,y)→(1,2)
xcos(π(x+y))+sin(
πy
4
) = 1 cos(π(1+2))+sin(
π2
4
) = −1+1 = 0
lim
(x,y)→(1,2)
xcos(π(x + y)) + sin(
πy
4
) = 0
66
23 Limits
If we are confronted with a limit like lim
(x,y)→(0,0)
x
2
+ xy
x + y
, this is actually a little
bit interesting. The function is not continuous at 0, because it is not even deﬁned
at 0. What is more, the numerator and denominator are both approaching 0,
which each ”pull” the limit in opposite directions. (Dividing by smaller and smaller
numbers would tend to make the value larger and larger, while multiplying by
smaller and smaller numbers has the opposite eﬀect) There are essentially two
ways to work with this:
• show that it does not have a limit by ﬁnding two diﬀerent ways of approaching
(0, 0) which give diﬀerent limiting values, or
• show that it does have a limit by rewriting the expression algebraically as a
continuous function, and just plug in to get the value of the limit.
Question 4 Consider lim
(x,y)→(0,0)
x
2
+ xy
x + y
.
Solution
Hint: This limit does exist, because it can be rewritten as a continuous function.
Do you think the limit exists?
(a) Yes
(b) No
Solution
Hint: lim
(x,y)→(0,0)
x
2
+ xy
x + y
= lim
(x,y)→(0,0)
x(x + y)
(x + y)
Hint: lim
(x,y)→(0,0)
x(x + y)
(x + y)
= lim
(x,y)→(0,0)
x = 0
lim
(x,y)→(0,0)
x
2
+ xy
x + y
=0
Question 5 Consider lim
(x,y)→(3,3)
x
2
−9
xy −3y
.
Solution
Hint: This limit does exist, because it can be rewritten as a continuous function.
Do you think the limit exists?
(a) Yes
(b) No
Solution
Hint: lim
(x,y)→(3,3)
x
2
−9
xy −3y
= lim
(x,y)→(3,3)
(x −3)(x + 3)
y(x −3)
67
23 Limits
Hint: lim
(x,y)→(3,3)
(x −3)(x + 3)
y(x −3)
= lim
(x,y)→(3,3)
x + 3
y
=
3 + 3
3
= 2
lim
(x,y)→(3,3)
x
2
−9
xy −3y
=2
Question 6 Let f : R
2
→R
2
be deﬁned by f(x, y) = (
x
2
y −4y
x −2
, xy)
Solution
Hint: We can consider the limit component by component
Hint:
lim
(x,y)→(2,2)
x
2
y −4y
x −2
= lim
(x,y)→(2,2)
(x −2)(x + 2)y
x −2
= lim
(x,y)→(2,2)
y(x + 2)
= 2(2 + 2)
= 8
Hint: lim
(x,y)→(2,2)
xy = 2(2) = 4, since xy is continuous.
_
8
4
_
(x,y)→(2,2)
f(x, y)?
Question 7 Consider lim
(x,y)→(0,0)
x
y
Solution
Hint: Think about approaching (0, 0) along the line x = 0 ﬁrst, and then along the
line x = y
Hint: If we look at lim
(0,y)→(0,0)
0
y
, this is just the limit of the constant 0 function. So
the function approaches the limit 0 along the line x = 0
Hint: If we look at lim
(t,t)→(0,0)
t
t
, this is just the limit of the constant 1 function. So
the function approaches the limit 1 along the line y = x
Hint: So the limit does not exist.
Do you think the limit exists?
68
23 Limits
(a) Yes
(b) No
The last example showcased how you could show that a limit does not exist by
ﬁnding two diﬀerent paths along which you approach diﬀerent limiting values.
Let’s try another example of that form
Question 8 Solution
Hint: On the line y = kx, we have f(x, y) = f(x, kx) =
x + kx + x
2
x −kx
=
1 + k + x
1 −k
.
Hint: So we have lim
x→0
1 + k + x
1 −k
=
1 + k
1 −k
The limit of f : R

→ R deﬁned by f(x, y) =
x + y + x
2
x −y
as (x, y) → (0, 0) along the line
y = kx is (1 + k)/(1 −k)
The last two questions may have given you the idea that if a limit does not exist,
it must be because you get a diﬀerent value by approaching along two diﬀerent lines.
This is not always the case. Consider the function
f(x, y) =
_
1 if y = x
2
0 if y ,= x
2
Through any line containing the origin, f approaches 0 as points get closer
and closer to (0, 0), but as points approach (0, 0) along the parabola y = x
2
, f
approaches 1. So the limit lim
(x,y)→(0,0)
f(x, y) does not exist, even though the limit
along each line does.
Here is a more ”natural” example of such a phenomenon (deﬁned by a single
formula, not a piecewise deﬁned function):
f(x, y) =
x
2
y
x
4
+ y
2
Along each line y = kx, we have f(x, y) =
kx
3
x
4
+ k
2
x
2
, so lim
x→0
kx
3
x
4
+ k
2
x
2
=
lim
x→0
k
x + k
2
x
−1
= 0. On the other hand, along the parabola y = x
2
, we have
f(x, y) =
x
4
2x
4
=
1
2
where the limit is
1
2
. So even though the limit along all lines
through the origin is 0, the limit does not exist.
69
24 The formal deﬁnition of the limit
Limits are deﬁned by formalizing the notion of closeness.
This optional section explores limits from a formal and rigorous point of view. The
level of mathematical maturity required to get through this section is much higher
than others. If you get through it and understand everything, you can consider
yourself “hardcore.”
Deﬁnition 1 Let U ⊂ R
n
. The closure of U, written U is deﬁned to be the set
of all p ∈ R
n
such that every solid ball centered at p contains at least one point of
U.
Symbolically,
U = ¦p ∈ R
n
: for all r > 0 there exists x ∈ U so that [x −p[ < r¦
Prove that U ⊂ U for any subset U of R
n
.
Let p ∈ U. Then for every r > 0, p is an element of U whose distance to p is
less than r. In other words, since every solid ball centered at p must contain p,
and p is in U, then p must be in the closure of U. So p ∈ U.
Prove that the closure of the open unit ball is the closed unit ball. That is,
show that if U = ¦x : [x[ < 1¦, then U = ¦x : [x[ ≤ 1¦.
Let B = ¦x : [x[ ≤ 1¦. We need to see that U = B.
It is easy to see that B ⊂ U, since for each point p ∈ B, either p ∈ U (in
which case it is in the closure), or [p[ = 1. In this case, for every r > 0, the point
q = p −
1
2
rp is in U and satisﬁes [p −q[ < r.
On the other hand, if [p[ > 1, then a solid ball of radius
[p[ −1
2
centered at p
will not intersect U. So we are done.
Deﬁnition 2 Let f : U → V with U ⊂ R
n
, V ⊂ R
m
and p ∈ U. We say that
lim
x→p
f(x) = L if for every > 0 we can ﬁnd a δ > 0 so that if 0 < [x −p[ < δ and
x ∈ U, then [f(x) −L[ < .
Deﬁnition 3 Let f : U → V with U ⊂ R
n
and V ⊂ R
m
. We say that f is
continuous at p ∈ U if lim
x→p
f(x) = f(p).
Prove, using the -δ deﬁnition of the limit, that f : R
2
→R deﬁned by f(x, y) =
xy is continuous everywhere.
Let p = (a, b). Let > 0 be given. Without loss of generality, assume a, b ≥ 0.
We work ”backwards”:
70
24 The formal deﬁnition of the limit
[xy −ab[ <
⇐= [[(x −a) + a][(y −b) + b] −ab[ <
⇐= [(x −a)(y −b) + a(y −b) + b(x −a)[ <
⇐= [x −a[[y −b[ + a[y −b[ + a[x −a[ < by the triangle inequality
⇐=
_
¸
¸
¸
_
¸
¸
¸
_
[x −a[[y −b[ <

3
a[y −b[ <

3
b[x −a[ <

3
Now it is easy to arrange that a[y −a[ and b[x −a[ are less than

3
. If a = 0, or
b = 0, you do not have to do anything to get that condition satisﬁed, but otherwise
[x − a[ ≤

3b
is implied by [(x, y) − (a, b)[ ≤

3b

2
and [y − a[ ≤

3a
is implied by
[(x, y) −(a, b)[ ≤

3a

2
.
[x − a[[y − b[ <

3
is implied by [(x, y) − (a, b)[ ≤

3
. So if we let δ =
min(

3b

2
,

3a

2
,

3
) we are done.
Of course, this fact—that (x, y) →xy is continuous—is something you probably
believe intuitively. “Wiggling two numbers by a little bit doesn’t aﬀect their product
by very much.” Making that intuition precise obviously took some work in this
activity.
71
25 Single variable derivative, redux
The derivative is the slope of the best linear approximation.
Our goal is to deﬁne the derivative of a multivariable function, but ﬁrst we will
recast the derivative of a single variable function in a manner which is ripe for
generalization.
The derivative of a function f : R → R at a point x = a is the “instantaneous
rate of change” of f(x) with respect to x. In other words,
f(a + ∆x) ≈ f(a) + f

(a)∆x.
This is really the essential thing to understand about the derivative.
Question 1 Let f be a function with f(3) = 2, and f

(3) = 5.
Solution
Hint: f(3.01) ≈ f(3) + f

(3)(0.01)
Hint: ≈ 2 + 5(0.01)
Hint: ≈ 2.05
Then f(3.01) ≈ 2.05
Question 2 Let f be a function with f(4) = 2 and f(4.2) = 2.6.
Solution
Hint: f(4.2) ≈ f(4) + f

(4)(0.2)
Hint: 2.6 ≈ 2 + f

(4)(0.2)
Hint: f

(4) ≈
2.6 −2
0.2
Hint: f

(4) ≈ 3
Then f

(4) ≈ 3
We have not made precise what we mean by the approximate sign. After all, if
∆x is small enough and f is continuous, f(a +∆x) will be close to f(a), but we do
not want to say that the derivative is always zero. We will make the ≈ sign precise
by asking that the diﬀerence between the actual value and the estimated value goes
to zero faster than ∆x goes to zero.
72
25 Single variable derivative, redux
Deﬁnition 3 Let f : R → R be a function, and let a ∈ R. f is said to be
diﬀerentiable at x = a if there is a number m such that
f(a + ∆x) = f(a) + m∆x + Error
a
(∆x)
with
lim
∆x→0
[Error
a
(∆x)[
[∆x[
= 0
.
If f is diﬀerentiable at a, there is only one such number m, which we call the
derivative of f at a.
Verbally, m is the number which makes the error between the function value
f(a + ∆x) and the linear approximation f(a) + m∆x go to zero “faster than ∆x”
does.
This deﬁnition looks more complicated than the usual deﬁnition (and it is!), but
it has the advantage that it will generalize directly to the derivative of a multivari-
able function.
Conﬁrm that for f(x) = x
2
, we have f

(2) = 4 using our deﬁnition of the
derivative.
f(2 + ∆x) = (2 + ∆x)
2
= 2
2
+ 2(2)∆x + (∆x)
2
So we have f(2 + ∆x) = f(2) + 4∆x + Error(∆(x)), where Error(∆x) = (∆x)
2
lim
∆x→0
Error(∆x)
∆x
= lim
∆x→0
(∆x)
2
∆x
= lim
∆x→0
(∆x)
= 0
Thus,
f

(2) = 2(2) = 4, according to our new deﬁnition!
Show the equivalence of our deﬁnition of the derivative with the “usual” deﬁni-
tion. That is, show that the number m in our deﬁnition satisﬁes m = lim
∆x→0
f(a + ∆x) −f(a)
∆x
.
This also shows the uniqueness of m.
Let f be diﬀerentiable (in the sense above) at x = a, with derivative m. Then
lim
∆x→0
[Error
a
(∆x)[
[∆x[
= 0
where Error
a
(∆x) is deﬁned by f(a + ∆x) = f(a) + m∆x + Error
a
(∆x), i.e.
Error
a
(∆x) = f(a + ∆x) −f(a) −m∆x.
So
73
25 Single variable derivative, redux
lim
∆x→0
[f(a + ∆x) −f(a) −m∆x[
[∆x[
= 0
lim
∆x→0
¸
¸
¸
¸
f(x + ∆x) −f(a)
∆x
−m
¸
¸
¸
¸
= 0
But this implies that
m = lim
∆x→0
f(a + ∆x) −f(a)
∆x
So our deﬁnition of the derivative agrees with the “usual” deﬁnition
74
26 Multivariable derivatives
We introduce the derivative.
The derivative in multiple variables requires a bit more machinery.
1
1
75
27 Intuitively
The derivative is the linear map which best approximates changes in a function near
a point.
The single variable derivative allows us to ﬁnd the best linear approximation to a
function at a point. In several variables we will deﬁne the derivative to be a linear
approximation which approximates the change in the values of a function. In this
section we will explore what the multivariable derivative is from an intuitive point
of view, without making anything too formal.
We give the following wishy-washy “deﬁnition”:
Deﬁnition 1 Let f : R
n
→ R
m
be a function. Then the derivative of f at a
point p ∈ R
n
is the linear map D(f)
¸
¸
p
: R
n
→ R
m
which allows the following
approximation property:
f(p +

h) ≈ f(p) + D(f)
¸
¸
p
(

h)
We will make the sense in which this approximation holds precise in the next
section.
Note: we also call the matrix of the derivative the Jacobian Matrix in honor
of the mathematician Carl Gustav Jacob Jacobi
1
.
Question 2 Let f : R
2
→ R
3
be a function, and suppose f(2, 3) = (4, 8, 9).
Suppose that the matrix of D(f)
¸
¸
(2,3)
is
_
_
−1 3
4 5
2 −3
_
_
.
Solution
Hint: By the deﬁning property of derivatives,
f(2.01, 3.04) ≈ f(2, 3) + D(f)
¸
¸
(2,3)
__
0.01
0.04
__
Hint: =
_
_
4
8
9
_
_
+
_
_
−1 3
4 5
2 −3
_
_
_
0.01
0.04
_
Hint: =
_
_
4
8
9
_
_
+
_
_
0.01(−1) + 0.04(3)
0.01(4) + 0.04(5)
0.01(2) + 0.04(−3)
_
_
Hint: =
_
_
4
8
9
_
_
+
_
_
0.11
0.24
−0.1
_
_
1
http://en.wikipedia.org/wiki/Carl_Gustav_Jacob_Jacobi
76
27 Intuitively
Hint: =
_
_
4.11
8.24
8.9
_
_
Question 3 Let f : R
2
→R be a function with f(1, 2) = 3, f(1.01, 2) = 3.04 and
f(1, 2.002) = 3.002.
Solution
Hint: Since f : R
2
→R, D(f)
¸
¸
(1,2)
: R
2
→R, so the matrix of the derivative is a row
of length 2.
Hint: To ﬁnd the matrix, we need to see how D(f)
¸
¸
(1,2)
acts on
_
1
0
_
and
_
0
1
_
Hint: f(1.01, 2) ≈ f(1, 2) + D(f)
¸
¸
(1,2)
__
0.01
0
__
by the fundamental property of the
derivative
Hint:
3.04 ≈ 3 + D(f)
¸
¸
(1,2)
__
0.01
0
__
0.04 ≈ 0.01D(f)
¸
¸
(1,2)
__
1
0
__
by the linearity of the derivative
D(f)
¸
¸
(1,2)
__
1
0
__
≈ 4
Hint:
f(1, 2.002) ≈ f(1, 2) + D(f)
¸
¸
(1,2)
__
0
0.02
__
3.002 ≈ 3 + D(f)
¸
¸
(1,2)
__
0
0.02
__
0.002 ≈ 0.002D(f)
¸
¸
(1,2)
__
0
1
__
by the linearity of the derivative
D(f)
¸
¸
(1,2)
__
0
1
__
≈ 1
Hint: Thus the matrix of D(f)
¸
¸
(1,2)
is
_
4 1
_
What is Jacobian matrix of f at (1, 2)?
Solution
77
27 Intuitively
Hint: f(0.9, 2.03) ≈ f(1, 2) + D(f)
¸
¸
(1,2)
__
−0.1
0.03
__
Hint: = 3 +
_
4 1
_
_
−0.1
0.03
_
Hint: = 3 + 4(−0.1) + 1(0.03)
= 2.63
Using your approximation of the Jacobian matrix, f(0.9, 2.03) ≈ 2.63
This problem shows that if a function has a derivative, then only knowing how
it changes in the coordinate directions lets you determine how it changes in any
direction. This is so important it is worth driving it home: we only started with
information about how f(1.01, 2) and f(1, 2.02) compared to f(1, 2), but because
this function had a derivative, we could obtain the approximate value of the function
at any near by point by exploiting linearity. This is powerful.
Prepare yourself : the following two paragraphs are going to be very diﬃcult
to digest.
So far we have only talked about the derivative of a function at a point. The
derivative of a function is actually a function which assigns a linear map to each
point in the domain of the original function. So the derivative is a function which
takes (functions from R
n
→ R
m
) and returns a function which takes ( points in
R
n
) and returns ( linear maps from R
n
→R
m
). This level of abstraction is why we
wanted you to get comfortable with “higher-order functions” earlier. We are not
as crazy as we seem.
As an example, if f : R
2
→R
2
is the function deﬁned by f(x, y) = (x
2
y, y +x),
then it will turn out that at any point (a, b), the derivative Df
¸
¸
(a,b)
will be the
linear map from R
2
to R
2
given by the matrix
_
2ab a
2
1 1
_
(we do not know why yet,
but this is true). So Df is really a function which takes a point (a, b) and spits out
the linear map with matrix
_
2ab a
2
1 1
_
. So what about just plain old D? D takes
a function (f) and returns the function Df which takes a point (a, b) and returns
the linear map whose matrix is
_
2ab a
2
1 1
_
. Letting L(A, B) stand for all the linear
functions from A →B and Func(A, B) be the set of all functions from A →B, we
could write D : Func(R
n
, R
m
) →Func(R
n
, L(R
n
, R
m
)).
Please do not give up on the course after the last two paragraphs! Everything
is going to be okay. Hopefully you will be able to slowly digest these statements
throughout the course. Not understanding them now will not hold you back.
Question 4 Let f be a function which satisﬁes Df
¸
¸
(x,y)
=
_
_
3x
2
y
2
2x
3
y
2x 2y
ye
xy
xe
xy
_
_
.
Solution
78
27 Intuitively
Hint: D(f)
¸
¸
(1,2)
=
_
_
3(1)
2
(2)
2
2(1)
3
(2)
2(1) 2(2)
2e
(1)(2)
1e
(1)(2)
_
_
Hint: f
_
(1, 2) +
_
0.01
−0.02
__
≈ f(1, 2) +
_
_
12 4
2 4
2e
2
e
2
_
_
_
0.01
−0.02
_
Hint: f(1.01, 1.99) ≈ (2, 3, 1) +
_
_
12(0.01) + 4(−0.02)
2(0.01) + 4(−0.02)
2e
2
(0.01) + e
2
(−0.02)
_
_
Hint: f(1.01, 1.99) ≈ (2.04, 2.94, 1)
Hint: Format this as
_
_
2.04
2.94
1
_
_
Given that f(1, 2) = (2, 3, 1). Approximate f(1.01, 1.98).
79
28 Rigorously
The derivative approximates the changes in a function to ﬁrst order accuracy
We are now ready to deﬁne the derivative rigorously. Mimicking our development
of the single variable derivative, we deﬁne:
Deﬁnition 1 Let f : R
n
→ R
m
be a function, and let p ∈ R
n
. f is said to be
diﬀerentiable at p if there is a linear map M : R
n
→R
m
such that
f(p +

h) = f(p) + M(

h) + Error
p
(

h)
with
lim

h→0
¸
¸
¸Error
p
(

h)
¸
¸
¸
¸
¸
¸

h
¸
¸
¸
= 0
.
If f is diﬀerentiable at p, there is only one such linear map M, which we call
the (total) derivative of f at p.
Verbally, M is the linear function which makes the error between the function
value f(p +

h) and the aﬃne approximation f(a) + M(

h) go to zero ”faster than

h” does.
This deﬁnition is great, but it doesn’t tell us how to actually compute the
derivative of a diﬀerentiable function! Lets dig a little deeper:
Example 2 Let f : R

→R

be deﬁned by f
__
x
y
__
=
_
f
1
(x, y)
f
2
(x, y)
_
. Assuming f
is diﬀerentiable at the point (1, 2), lets try to compute the derivative there. Let M
be the derivative of f at (1, 2). Then
lim
h→0
¸
¸
¸
¸
f((1, 2) + h
_
1
0
_
) −f(
_
1
2
_
) −M(h
_
1
0
_
)
¸
¸
¸
¸
¸
¸
¸
¸
h
_
1
0

¸
¸
¸
= 0
lim
h→0
¸
¸
¸
¸
¸
¸
¸
¸
f(1 + h, 2) −f(1, 2) −hM(
_
1
0
_
)
h
¸
¸
¸
¸
¸
¸
¸
¸
= 0
lim
h→0
¸
¸
¸
¸
¸
¸
¸
¸
f(
_
1 + h
2
_
) −f(
_
1
2
_
)
h
−M(
_
1
0
_
)
¸
¸
¸
¸
¸
¸
¸
¸
= 0
80
28 Rigorously
so
M(
_
1
0
_
) = lim
h→0
f(1 + h, 2) −f(1, 2)
h
M(
_
1
0
_
) = lim
h→0
_
¸
_
f
1
(1 + h, 2) −f
1
(1, 2)
h
f
2
(1 + h, 2) −f
2
(1, 2)
h
_
¸
_
But each of the remaining quantities are derivatives of one variable functions!
In particular, we have that
M(
_
1
0
_
) =
_
¸
_
d
dx
(f
1
(x, 2))
¸
¸
x=1
d
dx
(f
2
(x, 2))
¸
¸
x=1
_
¸
_. We call these kinds of quantities partial deriva-
derivatives in the next section.
Without copying the work in the example above (if you can) try to ﬁnd M(0, 1).
lim
h→0
¸
¸
¸
¸
f((1, 2) + h
_
0
1
_
) −f(1, 2) −M(h
_
0
1
_
)
¸
¸
¸
¸
¸
¸
¸
¸
h
_
0
1

¸
¸
¸
= 0
lim
h→0
¸
¸
¸
¸
¸
¸
¸
¸
f(1, 2 + h) −f(1, 2) −hM(
_
0
1
_
)
h
¸
¸
¸
¸
¸
¸
¸
¸
= 0
lim
h→0
¸
¸
¸
¸
f(1, 2 + h) −f(1, 2)
h
−M(
_
0
1
_
)
¸
¸
¸
¸
= 0
so
M(
_
0
1
_
) = lim
h→0
f(1, 2 + h) −f(1, 2)
h
M(
_
0
1
_
) = lim
h→0
_
¸
_
f
1
(1, 2 + h) −f
1
(1, 2)
h
f
2
(1, 2 + h) −f
2
(1, 2)
h
_
¸
_
M(
_
0
1
_
) =
_
¸
_
d
dy
(f
1
(1, y))
¸
¸
y=2
d
dy
(f
2
(1, y))
¸
¸
y=2
_
¸
_
This question and the previous example show that the matrix of the derivative
of f at (1, 2) is
_
¸
_
d
dx
(f
1
(x, 2))
¸
¸
x=1
d
dy
(f
1
(1, y))
¸
¸
y=2
d
dx
(f
2
(x, 2))
¸
¸
x=1
d
dy
(f
2
(1, y))
¸
¸
y=2
_
¸
_
Question 3 Use the results of this question and the previous example to ﬁnd the
matrix of the derivative of f
__
x
y
__
=
_
x
2
+ y
2
xy
_
at the point (1, 2).
81
28 Rigorously
Solution
Hint: In this case f1(x, y) = x
2
+ y
2
and f2(x, y) = xy
Hint: By the result of the last two exercises, the matrix of the derivative is
_
_
_
d
dx
(f1(x, 2))
¸
¸
x=1
d
dy
(f1(1, y))
¸
¸
y=2
d
dx
(f2(x, 2))
¸
¸
x=1
d
dx
(f2(1, y))
¸
¸
y=2
_
¸
_
Hint:
f1(x, 2) = x
2
+ 2
2
f1(1, y) = 1
2
+ y
2
f2(x, 2) = 2x
f2(1, y) = y
Hint:
d
dx
(f1(x, 2))
¸
¸
x=1
=
d
dx
_
x
2
+ 2
2
_ ¸
¸
x=1
= 2x
¸
¸
x=1
= 2
d
dx
(f2(x, 2))
¸
¸
x=1
=
d
dx
(2x)
¸
¸
x=1
= 2
¸
¸
x=1
= 2
d
dy
(f1(1, y))
¸
¸
y=2
=
d
dy
_
1
2
+ y
2
_ ¸
¸
y=2
= 2y
¸
¸
y=2
= 4
d
dy
(f2(1, y))
¸
¸
y=2
=
d
dy
(y)
¸
¸
y=2
= 1
¸
¸
x=1
= 1
Hint: Thus the matrix of the derivative is
_
2 4
2 1
_
82
29 Partial Derivatives
The entries in the Jacobian matrix are partial derivatives
1
There is a familiar looking formula for the derivative of a diﬀerentiable function:
Theorem 1 Let f : R
n
→R
m
be a diﬀerentiable function. Then
D(f)
¸
¸
p
(v) = lim
h→0
f(p + hv) −f(p)
h
Prove this theorem
By the deﬁnition of the derivative, we have that
lim
h→0
[f(p + hv) −f(p) −Df(p)(hv)[
[hv[
= 0
1
[v[
lim
h→0
¸
¸
¸
¸
f(p + hv) −f(p) −hDf(p)(v)
h
¸
¸
¸
¸
= 0 since
1
[v[
is a constant
lim
h→0
¸
¸
¸
¸
f(p + hv) −f(p)
h
−Df(p)(v)
¸
¸
¸
¸
= 0 since
1
[v[
is a constant
We conclude that
D(f)
¸
¸
p
(v) = lim
h→0
f(p + hv) −f(p)
h
Question 2 Let f : R
2
→R
2
be deﬁned by f(x, y) = (x
2
−y
2
, 2xy).
Solution
Hint:
Df
¸
¸
(3,4)
__
−1
2
__
= lim
h→0
f
_
(3, 4) + h
_
−1
2
__
−f(3, 4)
h
= lim
h→0
f(3 −h, 4 + 2h) −f(3, 4)
h
= lim
h→0
1
h
_
(3 −h)
2
−(4 + 2h)
2
2(3 −h)(4 + 2h)
_

_
−7
24
_
= lim
h→0
1
h
_
(3 −h)
2
−(4 + 2h)
2
+ 7
2(3 −h)(4 + 2h) −24
_
1
83
29 Partial Derivatives
Hint:
= lim
h→0
_
_
_
−22h −3h
2
h
4h −4h
2
h
_
¸
_
= lim
h→0
_
−22 −3h
4 −4h
_
=
_
−22
4
_
Using the theorem above, compute Df
¸
¸
(3,4)
__
−1
2
__
Since the unit directions are especially important we deﬁne:
Deﬁnition 3 Let f : R
n
→ R be a (not necessarily diﬀerentiable) function. We
deﬁne its partial derivative with respect to x
i
by
∂f
∂x
i
(p) = f
xi
(p) := lim
h→0
f(p + he
i
) −f(p)
h
In other words,
∂f
∂x
i
(p) is the instantaneous rate of change in f by moving only
in the e
i
direction.
Example 4 There is really only a good visualization of the partial derivatives
of a map f : R
2
→ R, because this is really the only type of higher dimensional
function we can eﬀectively graph.
Computing partial derivatives is no harder than computing derivatives of single
variable functions. You take a partial derivative of a function with respect to x
i
just
by treating all other variables as constants, and taking the derivative with respect
to x
i
.
Question 5 Let f : R
2
→R be deﬁned by f(x, y) = xsin(y).
Solution
Hint: We are trying to compute

∂x
(xsin(y))
¸
¸
(a,b)
Hint: We just diﬀerentiate as if y were a constant, so

∂x
(xsin(y))
¸
¸
(a,b)
= sin(y)
¸
¸
(a,b)
Hint: fx(a, b) = sin(b)
fx(a, b) = sin(b)
Solution
Hint: We are trying to compute

∂y
(xsin(y))
¸
¸
(a,b)
84
29 Partial Derivatives
Hint: We just diﬀerentiate as if x were a constant, so

∂y
(xsin(y))
¸
¸
(a,b)
= xcos(y)
¸
¸
(a,b)
Hint: fx(a, b) = a cos(b)
fy(a, b) = a(cos(b))
We have already proven the following theorem in the special case n = m = 2
in the previous activity. Proving it in the general case requires no new ideas: only
better notational bookkeeping.
Theorem 6 Let f : R
n
→R
m
be a function with component functions f
i
: R
n

R, for i = 1, 2, 3, ..., m. In other words, f(p) =
_
¸
¸
¸
¸
¸
_
f
1
(p)
f
2
(p)
f
3
(p)
.
.
.
f
m
(p)
_
¸
¸
¸
¸
¸
_
. If f is diﬀerentiable at
p, then its Jacobian matrix at p is
_
¸
¸
¸
¸
¸
¸
¸
¸
_
∂f
1
∂x
1
(p)
∂f
1
∂x
2
(p)
∂f
1
∂x
n
(p)
∂f
2
∂x
1
(p)
∂f
2
∂x
2
(p)
∂f
2
∂x
n
(p)
.
.
.
.
.
.
.
.
.
.
.
.
∂f
m
∂x
1
(p)
∂f
m
∂x
2
(p)
∂f
m
∂x
n
(p)
_
¸
¸
¸
¸
¸
¸
¸
¸
_
More compactly, we might write
_
∂f
i
∂x
j
(p)
_
Try to prove this theorem. Using the more compact notation will be helpful.
Follow along the proof we developed together in the last section! By the deﬁnition
of the derivative, we have
lim
h→0
[f(p + h e
i
) −f(p) −M(h e
i
)[
[h e
i
[
= 0
lim
h→0
[f(p + h e
i
) −f(p) −hM( e
i
)[
[h[
= 0
lim
h→0
¸
¸
¸
¸
f(p + h e
i
) −f(p) −hM( e
i
)
h
¸
¸
¸
¸
= 0
lim
h→0
¸
¸
¸
¸
f(p + h e
i
) −f(p)
h
−M( e
i
)
¸
¸
¸
¸
= 0
So lim
h→0
f(p + h e
i
) −f(p)
h
= M(e
i
). But for this to be true, the j
th
row of each
side must be equal, so
85
29 Partial Derivatives
lim
h→0
f
j
(p + h e
i
) −f
j
(p)
h
= M
ji
But the quantity on the left hand side is
∂f
j
∂x
i
¸
¸
p
Question 7 Let f : R
3
→R
2
be deﬁned by f(x, y, z) = (x
2
+ y + z
3
, xy + yz
2
).
Solution
Hint: The Jacobian Matrix is
_
_
_
∂f1
∂x
∂f1
∂y
∂f1
∂z
∂f2
∂x
∂f2
∂y
∂f2
∂z
_
¸
_
Hint: As an example,
∂f2
∂z
=

∂z
xy +yz
2
= 2yz. Remember that we just diﬀerentiate
with respect to z, treating x and y as constants.
Hint:
∂f1
∂x
= 2x
∂f1
∂y
= 1
∂f1
∂z
= 3z
2
∂f2
∂x
= y
∂f2
∂y
= x + z
2
∂f2
∂z
= 2yz
Hint: The Jacobian matrix is
_
2x 1 3z
2
y x + z
2
2yz
_
What is the Jacobian Matrix of f? This should be a matrix valued function of x, y, z.
The formula for the derivative
Df(p)(v) = lim
h→0
f(p + hv) −f(p)
h
looks a lot more familiar than our deﬁnition. You might be asking why we didn’t
take this formula as our deﬁnition of the derivative. After all, we usually take
something that looks like this as our deﬁnition in single variable calculus.
In the following two optional exercises you will ﬁnd out why.
Find a function f : R
2
→R such that at (0, 0), the limit M(v) = lim
h→0
f((0, 0) + hv) −f(0, 0)
h
exists for every vector v ∈ R
2
, but M is not a linear map.
86
29 Partial Derivatives
Hint: Try showing that the function
f(x, y) =
_
_
_
x
3
x
2
+ y
2
if (x, y) ,= (0, 0)
0 if (x, y) = (0, 0)
has the desired properties.
Let
f(x, y) =
_
_
_
x
3
x
2
+ y
2
if (x, y) ,= (0, 0)
0 if (x, y) = (0, 0)
Then for any vector v =
_
a
b
_
, we have
M (v) = lim
h→0
f((0, 0) + hv) −f(0, 0)
h
= lim
h→0
f(ha, hb)
h
= lim
h→0
h
3
a
3
h(h
2
a
2
+ h
2
b
2
)
= lim
h→0
a
3
a
2
+ b
2
=
a
3
a
2
+ b
2
So M
__
a
b
__
=
a
3
a
2
+ b
2
. This is certainly not a linear function from R
2
→R
So this formula cannot serve as a good deﬁnition of the derivative, because it
does not have to produce linear functions. What if we require that the function is
linear as well? Even then, it is no good:
Find a function f : R
2
→R such that at (0, 0), M(v) = lim
h→0
f((0, 0) + hv) −f(0, 0)
h
exists for every vector v ∈ R
2
and the function M deﬁned this way is linear, but
nevertheless, f is not diﬀerentiable at (0, 0).
Hint: Try the function
f(x, y) =
_
1 if y = x
2
and (x, y) ,= (0, 0)
0 else
Let
f(x, y) =
_
1 if y = x
2
and (x, y) ,= (0, 0)
0 else
Let v =
_
a
b
_
. Then
87
29 Partial Derivatives
M (v) = lim
h→0
f((0, 0) + hv) −f(0, 0)
h
= lim
h→0
f(ha, hb)
h
Now, the intersection of the line t → (ta, tb) with the parabola y = x
2
happens
when tb = t
2
a
2
, i.e. when t = ±

b
[a[
. So as long as we choose h smaller than that,
we know that (ha, hb) is not on the parabola y = x
2
. Hence f(ha, hb) = 0 for small
enough h.
Thus M(v) = 0 for all v ∈ R
2
.
This deﬁnitely is a linear function, but f is not diﬀerentiable at (0, 0) using our
deﬁnition, since
lim

h→0
¸
¸
¸f((0, 0) +

h) −f(0, 0) −M(

h)
¸
¸
¸
[

h[
= lim

h→0
¸
¸
¸f(

h)
¸
¸
¸
[

h[
does not exist, since taking

h on the parabola y = x
2
yields the limit lim
t→0
¸
¸
¸
¸
f(t, t
2
)
t
¸
¸
¸
¸
=
lim
t→0
1
[t[
, which diverges to ∞.
88
The gradient is a vector version of the derivative.
In this section, we will focus on gaining a more ”geometric” understanding of deriva-
tives of functions f : R
n
→R.
If f is such a function, the derivative Df (p) : R
n
→R is a covector. So, by the
deﬁnition of the dot product, we can reinterpret that derivative as the dot product
with the ﬁxed vector Df (p)

.
Deﬁnition 1 The gradient of a diﬀerentiable function f : R
n
→ R is deﬁned
by ∇f(p) = Df (p)

. Equivalently, ∇f is the (unique) vector which makes the
following equation true for all v ∈ R
n
:
∇f(p) v = Df(p)(v)
Question 2 Solution
Hint: ∇f(x, y, z) =
_
_
_
_
_
_

∂x
sin(xyz
2
)

∂y
sin(xyz
2
)

∂z
sin(xyz
2
)
_
¸
¸
¸
¸
_
Hint: ∇f(x, y, z) =
_
_
yz
2
cos(xyz
2
)
xz
2
cos(xyz
2
)
2xyz cos(xyz
2
)
_
_
If f : R
3
→R is deﬁned by f(x, y, z) = sin(xyz
2
), what is ∇f(x, y, z)?
We can now use what we know about the geometry of the dot product to
understand some interesting things about the derivative.
In a sentence, how does the vector
v
[v[
relate to the vector v?
v
[v[
is the unit vector which points in the same direction as v
Theorem 3 Let f : R
n
→R, and p ∈ R
n
. Let η =
∇f(p)
[∇f(p)[
If [v[ = 1, then Df(p)(v) ≤ Df(p)(η)
More geometrically, this theorem says that ∇f(p) points in the direction of
“greatest increase” for the function f. More poetically, ∇f always points “straight
up the mountain”.
Prove this theorem.
Df(p)(v) = ∇f(p) v
≤ [v[[∇f(p)[ by Cauchy-Schwarz
= [∇f(p)[ since [v[ = 1
89
On the other hand,
Df
¸
¸
p
(η) = ∇f(p) η
= ∇f(p)
∇f(p)
[∇f(p)[
=
[∇f(p)[
2
[∇f(p)[
= [∇f(p)[
The inequality follows.
One of the ways that we learned to visualize functions was via contour plots.
We will see that there is a very nice relationship between contours and gradient
vectors.
2
→ R, and consider the
contour ( = ¦(x, y) ∈ R
2
: f(x, y) = c¦ for some c ∈ R.
Question 4 Let v be a tangent vector to ( at a point p ∈ (
Solution
Hint: Since v is pointing in the direction of a level curve of f, f should not be changing
as you move in the direction v
Hint: So Df(p)(v) = 0.
Df(p)(v) = 0
Note: You should be able to answer this from an intuitive point of view, but we will
not develop the formal tool to prove this (the implicit function theorem
1
) in this course.
In general, if f : R
n
→ R, and the contour ( = ¦p ∈ R
n
: f(p) = c¦ for some
c ∈ R, then for every tangent vector v to (, we will have Dfp(v) = 0. Intuitively
this is true because moving a small amount in the direction of v will not change the
value of the function much, since you are staying as close as possible to the contour
where the function is constant. Accepting this, we have the following:
Theorem 5 If f : R
n
→ R, and the contour ( = ¦p ∈ R
n
: f(p) = c¦ for some
c ∈ R, then for every tangent vector v to (, we will have ∇f(p) v = 0. In other
words, ∇f(p) is perpendicular to the contour.
Question 6 Write an equation for the tangent plane to the surface x
2
+xy+4z
2
=
1 at the point (1, 0, 0).
Solution
Hint: Our general strategy will be to ﬁnd a vector which is perpendicular to the plane.
Writing down what that means in terms of dot products should yield the equation.
1
http://en.wikipedia.org/wiki/Implicit_function_theorem
90
Hint: This surface is a level surface of f(x, y, z) = x
2
+xy+4z
2
, namely f(x, y, z) = 1.
Hint: ∇f =
_
_
_
_
_
_
∂f
∂x
∂f
∂y
∂f
∂z
_
¸
¸
¸
¸
_
Hint: ∇f =
_
_
2x + y
x
8z
_
_
Hint: ∇f(1, 0, 0) =
_
_
2
1
0
_
_
Hint: For a point (x, y, z) to be in the tangent plane, we would need that (x, y, z) −
(1, 0, 0) is perpendicular to ∇f(1, 0, 0).
Hint: So we need
_
_
x −1
y
z
_
_

_
_
2
1
0
_
_
= 0
Hint: This says that the equation of the plane is 2x −2 + y = 0
2x + y −2 = 0
91
31 One forms
One forms are covector ﬁelds.
In this section we just want to introduce you to some new notation and terminol-
ogy which will be helpful to keep in mind for the next course, which will cover
multivariable integration theory.
As we observed in the last section, the derivative of a function f : R
n
→ R
assigns a covector to each point in R
n
. In particular, Df
¸
¸
p
: R
n
→ R is the
covector whose matrix is the row
_
∂f
∂x
1
¸
¸
p
∂f
∂x
2
¸
¸
p

∂f
∂x
n
¸
¸
p
_
Deﬁnition 1 A covector ﬁeld, also known as a diﬀerential 1-form, is a function
which takes points in R
n
and returns a covector on R
n
. In other words, it is a
covector valued function. We can always write any covector ﬁeld ω as
ω(x) =
_
f
1
(x) f
2
(x) f
n
(x)
¸
for n functions f
i
: R
n
→R.
The derivative of a function f : R
n
→ R is the quintessential example of a
1-form on R
n
.
Question 2 Let f : R
3
→R be the function f(x, y, z) = y.
Solution
Hint: The Jacobian of f is
_
0 1 0
_
everywhere.
What is the matrix for Df at the point (a, b, c)?
Generalizing the result of the previous question, we see that if π
i
: R
n
→R is de-
ﬁned by π
i
(x
1
, x
2
, . . . , x
n
) = x
i
, then D(π
i
) will be the row
_
0 0 0 1 0 0
¸
,
where the 1 appears in the i
th
slot.
We introduce the notation dx
i
for the covector ﬁeld D(π
i
). So we can rewrite
any covector ﬁeld ω(x) =
_
f
1
(x) f
2
(x) f
n
(x)
¸
for n functions f
i
: R
n
→R
in the form ω(x) = f
1
(x)dx
1
+ f
2
(x)dx
2
+ + f
n
(x)dx
n
.
It turns out that 1-forms, not functions, are the appropriate objects to integrate
along curves in R
n
. The sequel to this course will focus on the integration of
diﬀerential forms: we will not touch on it in this course.
92
32 Numerical integration
Integrate a covector ﬁeld.
Exercise 1 Suppose we have a one-form expressed as a Python function, e.g.,
omega (which we will often write as ω) which takes a point (expressed as a list)
and returns a 1 n matrix. For example, perhaps we have that omega([7,2,5])
is [[5,3,2]].
Let’s only consider the case n = 2, and suppose that ω is the derivative of
some mystery function f : R
2
→ R. If we have access to omega and we know that
We can take a path from (3, 2) to (4, 3), and break it up into small pieces; on
each piece, we can use the derivative to approximate how a small change to the
input will aﬀect the output. And repeat.
Do this in Python.
Solution
Python
1 # suppose the derivative of f is omega, and f(3,2) = 5.
2 # so omega([3,2]) is (perhaps) [[-4,3]].
3 #
4 # integrate(omega) is an approximation to the value of f at (4,3).
5 #
6 def integrate(omega):
7 return # the value
8
9 def validator():
10 return abs(integrate( lambda p: [[2*p[0] - p[1], -p[0] + 1]] ) - 7.0) < 0.05
How did you move from (3, 2) to (4, 3)? Did it matter which path you walked
along?
93
33 Python
We approximate derivatives in Python.
There are two diﬀerent perspectives on the derivative available to us.
Suppose f : R
n
→R
m
is a diﬀerentiable function, and we have a point p ∈ R
n
.
The ﬁrst perspective on the derivative is the total derivative, which is lin-
ear map Df(p) which sends the vector v to Df(p)(v), recording how much an
inﬁnitesimal change in the v direction in R
n
will aﬀect the output of f in R
m
.
The second perspective on the derivative is the Jacobian matrix, which is the
matrix of partials given by
_
¸
¸
¸
¸
¸
¸
¸
¸
_
∂f
1
∂x
1
(p)
∂f
1
∂x
2
(p)
∂f
1
∂x
n
(p)
∂f
2
∂x
1
(p)
∂f
2
∂x
2
(p)
∂f
2
∂x
n
(p)
.
.
.
.
.
.
.
.
.
.
.
.
∂f
m
∂x
1
(p)
∂f
m
∂x
2
(p)
∂f
m
∂x
n
(p)
_
¸
¸
¸
¸
¸
¸
¸
¸
_
.
Observation 1 The Jacobian matrix is the matrix representing the linear map
Df(p).
This observation can be “seen” with some Python code.
94
34 Derivative
Code the total derivative.
Exercise 1 Let epsilon be a small, but positive number. Suppose f : R → R
has been coded as a Python function f which takes a real number and returns a
real number. Seeing as
f

(x) = lim
h→0
f(x + h) −f(x)
h
,
can you ﬁnd a Python function which approximates f

(x)?
Given a Python function f which takes a real number and returns a real number,
we can approximate f

(x) by using epsilon. Write a Python function derivative
which takes a function f and returns an approximation to its derivative.
Solution
Hint: To approximate this, use (f(x+epsilon) - f(x))/epsilon.
Python
1 epsilon = 0.0001
2 def derivative(f):
3 def df(x): return (f(blah blah) - f(blah blah)) / blah blah
4 return df
5
6
7 def validator():
8 df = derivative(lambda x: 1+x**2+x**3)
9 if abs(df(2) - 16) > 0.01:
10 return False
11 df = derivative(lambda x: (1+x)**4)
12 if abs(df(-2.642) - -17.708405152) > 0.01:
13 return False
14 return True
Great work! Now let’s do this in a multivariable setting.
A function f : R
n
→ R
m
should be stored as a Python function which takes a
list with n entries and returns a list with m entries.
Solution Implement f(x, y, z) = (xy, x + z) as a Python function.
Hint: You can get away with
def f(v):
return [v[0]*v[1],v[0] + v[2]]
Python
1 def f(v):
2 x = v[0]
3 y = v[1]
4 z = v[2]
5 return # such and such
95
34 Derivative
6
7 def validator():
8 if f([3,2,7])[0] != 6:
9 return False
10 if f([3,2,7])[1] != 10:
11 return False
12 return True
Now we provide you with a function add vector which takes two vectors v and
w and returns v + w, and a function scale vector which takes a scalar c and a
vector v and returns the vector cv. Finally, vector length(v) computes the length
of the vector v.
Given all of this preamble, write a function D which takes a Python function
f : R
n
→ R
m
and returns the function Df : R
n
→ Lin(R
n
, R
m
) which takes a
point p ∈ R
n
and returns (an approximation to) the linear map Df(p) : R
n
→R
m
.
Solution
Hint: def D(ff):
def Df(p):
f = ff # band-aid over a Python interpreter bug
def L(v):
return scale_vector( 1/(epsilon),
scale_vector(-1, f(p)) ) )
return L
return Df
Python
1 epsilon = 0.0001
2 n = 3
3 m = 2
5 return [sum(v) for v in zip(v,w)]
6 def scale_vector(c,v):
7 return [c*x for x in v]
8 def vector_length(v):
9 return sum([x**2 for x in v])**0.5
10 def D(f):
11 def Df(p):
12 f = f # band-aid over a Python interpreter bug
13 def L(v):
14 # Try "f(p + blah blah) - f(p)" and so on...
15 return L(v) where L = Df(p)
16 return L
17 return Df
18
19
20 def validator():
21 # f(x,y,z) = (3*x^2 + 2*x*y*z, x*y^3*z^2)
22 Df = D(lambda v: [3*v[0]*v[0] + 2*v[0]*v[1]*v[2], v[0]*(v[1]**3)*(v[2]**2)])
23 Dfp = Df([2,3,4])
24 Dfpv = Dfp([3,2,1])
96
34 Derivative
25 if abs(Dfpv[0] - 152) > 1.0:
26 return False
27 if abs(Dfpv[1] - 3456) > 10.0:
28 return False
29 return True
Note that Df(p) is a linear map, so we can represent that linear map as a matrix.
We do so in the next activity.
97
35 Jacobian matrix
Code the Jacobian matrix.
In the previous activity, we wrote some code to compute D. Armed with this, we
can take a function f : R
n
→ R
m
and a point p ∈ R
n
and compute Df(p), the
linear map which describes, inﬁnitesimally, how wiggling the input to f will aﬀect
its output.
Assuming f is diﬀerentiable, we have that Df(p) is a linear map, so we can
write down a matrix for it. Let’s do so now.
Exercise 1 To get started, we begin by computing partial derivatives in Python.
To make things easy, let’s diﬀerentiate functions like
def fi(v):
return v[0] * (v[1]**2)
In other words, our functions will send an n-tuple to a single real number. In this
case where f
i
(x, y) = xy
2
, we should have that partial(fi,1)([2,3]) is close to
12, since

∂y
_
xy
2
_
= 2xy,
and so at the point (2, 3), the derivative is 2 2 3 = 12.
Solution
Hint: def partial(fi,j):
def derivative(p):
p_shifted = p[:]
p_shifted[j] += epsilon
return (fi(p_shifted) - fi(p))/epsilon
return derivative
Python
1 epsilon = 0.0001
2 n = 2
3 #
4 # fi is a function from R^n to R
5 def partial(fi,j):
6 def derivative(p):
7 return # the partial derivative of fi in the j-th coordinate at p
8 return derivative
9 #
10 # this should be close to 12
11 print partial(lambda v: v[0] * v[1]**2, 1)([2,3])
12
13 def validator():
14 return abs(partial(lambda v: v[0]**2 * v[1]**3, 0)([7,2]) - 112) < 0.01
If we have a function f : R
n
→R
m
, we’ll encode it as a Python function which
takes a list with n entries, and returns a list with m-entries. Let’s write a Python
helper for pulling out just the ith component of the output.
98
35 Jacobian matrix
Solution
Python
1 # if f is a function from R^n to R^m,
2 # then component(f,i) is the function R^n to R,
3 # which just looks at the i-th entry of the output
4 #
5 def component(f,i):
6 return lambda p: # the i-th component of the output
7
8 def validator():
9 return component(lambda v: [v[0],v[1]],1)([1,17]) == 17
Now we put it all together. For a function f : R
n
→R
m
, the Jacobian matrix
is given by
_
¸
¸
¸
¸
¸
¸
¸
¸
_
∂f
1
∂x
1
(p)
∂f
1
∂x
2
(p)
∂f
1
∂x
n
(p)
∂f
2
∂x
1
(p)
∂f
2
∂x
2
(p)
∂f
2
∂x
n
(p)
.
.
.
.
.
.
.
.
.
.
.
.
∂f
m
∂x
1
(p)
∂f
m
∂x
2
(p)
∂f
m
∂x
n
(p)
_
¸
¸
¸
¸
¸
¸
¸
¸
_
Solution Implement the function jacobian which takes a function f : R
n
→R
m
and
a point p ∈ R
n
, and returns the Jacobian matrix of f at the point p.
Hint: You can write this matrix as
[[partial(component(f,i),j)(p) for j in range(n)] for i in range(m)]
Python
1 epsilon = 0.0001
2 n = 3 # the dimension of the domain
3 m = 2 # the dimension of the codomain
4 def component(f,i):
5 return lambda p: f(p)[i]
6 def partial(fi,j):
7 def derivative(p):
8 p_shifted = p[:]
9 p_shifted[j] += epsilon
10 return (fi(p_shifted) - fi(p))/epsilon
11 return derivative
12 #
13 # f is a function from R^n to R^m
14 # jacobian(f,p) is its Jacobian matrix at the point p
15 def jacobian(f,p):
16 return # the Jacobian matrix
17
18 def validator():
19 m = jacobian(lambda v: [v[0]**2, (v[1]**3)*(v[0]**2)], [3,7])
20 return abs(m[1][0] - 2058) < 0.1
99
36 Relationship
Relate the Jacobian matrix and the total derivative.
In the previous activities, we wrote some code to compute Df(p) and the Jacobian
matrix of f at the point p. In this activity, we observe that these are related.
Exercise 1 Try running the code below.
Solution
Python
1 epsilon = 0.0001
2 n = 3 # the dimension of the domain
3 m = 2 # the dimension of the codomain
4 def component(f,i):
5 return lambda p: f(p)[i]
6 def partial(fi,j):
7 def derivative(p):
8 p_shifted = p[:]
9 p_shifted[j] += epsilon
10 return (fi(p_shifted) - fi(p))/epsilon
11 return derivative
12 def jacobian(f,p):
13 return [[partial(component(f,i),j)(p) for j in range(n)] for i in range(m)]
15 return [sum(v) for v in zip(v,w)]
16 def scale_vector(c,v):
17 return [c*x for x in v]
18 def vector_length(v):
19 return sum([x**2 for x in v])**0.5
20 def D(ff):
21 def Df(p):
22 f = ff
23 def L(v):
24 return scale_vector( 1/(epsilon),
26 scale_vector(-1, f(p)) ) )
27 return L
28 return Df
29 #
30 def dot_product(v,w):
31 return sum([x[0] * x[1] for x in zip(v,w)])
32 def apply_matrix(m,v):
33 return [dot_product(row,v) for row in m]
34 #
35 f = lambda p: [p[0] + p[0]*p[1], p[1] * p[2]**2]
36 p = [1,2,0]
37 v = [2,2,1]
38 print apply_matrix(jacobian(f,p),v)
39 print D(f)(p)(v)
40
41 def validator():
100
36 Relationship
42 # I just want them to try running this
43 return True
In this case, we set f(x, y, z) = (x+xy, yz
2
), and we computed Df(p)(v) in two
diﬀerent ways. The two diﬀerent methods were close—but not exactly the same.
Why are they not exactly the same? The D function is computing Df(p) by
comparing f(p + v) to f(p).
In contrast, the jacobian function computes Df(p) by computing partial deriva-
tives, so we are not actually computing f(p+v) in that case, but rather, f(p+e
i
)
for various i’s.
That the way f changes when wiggling each component separately has anything
to do with what happens to f when we wiggle the inputs together boils down to
the assumption that f be diﬀerentiable. This relationship is true inﬁnitesimally,
but here we’re working with = 0.0001, so it is not surprising that it is not true
on the nose.
101
37 The Chain Rule
Diﬀerentiating a composition of functions is the same as composing the derivatives
of the functions.
The chain rule of single variable calculus tells you how the derivative of a compo-
sition of functions relates to the derivatives of each of the original functions. The
chain rule of multivariable calculus will work analogously.
Question 1 Let f : R
2
→R
3
and g : R
3
→R. The only things you know about f
are that f(2, 3) = (3, 0, 0) and Df(2, 3) has matrix
_
_
4 −1
2 3
0 1
_
_
. The only thing you
know about g is that g(3, 0, 0) = 4 and Dg(3, 0, 0) has matrix
_
4 5 6
¸
.
Solution
Hint: g(f(2 +a, 3 +b)) ≈ g
_
f(2, 3) + Df(2, 3)
__
a
b
___
using the linear approxima-
tion to f at (2, 3)
Hint:
g
_
f(2, 3) + Df(2, 3)
__
a
b
___
= g
_
_
(3, 0, 0) +
_
_
4 −1
2 3
0 1
_
_
_
a
b
_
_
_
= g
_
_
(3, 0, 0) +
_
_
4a −b
2a + 3b
b
_
_
_
_
≈ g(3, 0, 0) + Dg(3, 0, 0)
_
_
4a −b
2a + 3b
b
_
_
using the linear approximation to g at (3, 0, 0)
≈ 4 +
_
4 5 6
_
_
_
4a −b
2a + 3b
b
_
_
= 4 + (16a −4b) + (10a + 15b) + (6b)
= 4 + 26a + 17b
Assuming a, b ∈ R
2
are small, (g ◦ f)(2 + a, 3 + b) ≈ 4 + 26a + 17b
Solution So the matrix of D(g ◦ f)(2, 3) is
Hint: Dg
¸
¸
f(2,3)
◦ Df
¸
¸
(2,3)
has matrix
_
4 5 6
_
_
_
4 −1
2 3
0 1
_
_
=
_
4(4) + 5(2) + 6(0) 4(−1) + 5(3) + 6(1)
_
=
_
26 17
_
102
37 The Chain Rule
Notice that D(g ◦ f)(2, 3) is the same as Dg
¸
¸
f(2,3)
◦ Df
¸
¸
(2,3)
! You should really
check this to make sure. Look at the hint to see how to do the composition if you
need help.
The heuristic approximations in the last question lead us to expect the following
theorem:
Theorem 2 Let f : R
n
→R
m
and g : R
m
→R
k
be diﬀerentiable functions, and
p ∈ R
n
. Then
D(g ◦ f)(p) = Dg(f (p)) ◦ Df(p)
In other words, the derivative of a composition of functions is the composition
of the derivatives of the functions.
The trickiest part of the above theorem is remembering that you need to apply
Dg at the point f(p).
The proof of this theorem is a little bit beyond what we want to require you to
think about in this course, but the essential idea of the proof is just that
(g ◦ f)(p +

h) ≈ g(f(p) + Df(p)(

h)) ≈ g(f(p)) + Dg(f (p))(Df(p)(

h))
You should understand this essential idea, even if you do not understand the
full proof. We cover the full proof in an optional section after this one.
Question 3 Let f : R
2
→ R
2
be deﬁned by f(r, t) = (r cos(t), r sin(t)). Let
g : R
2
→R be deﬁned by g(x, y) = x
2
+ y
2
.
Don’t let the choice of variable names scare you.
Solution
Hint:
_
_
_

∂r
r cos(t)

∂t
r cos(t)

∂r
r sin(t)

∂t
r sin(t)
_
¸
_
Hint:
_
cos(t) −r sin(t)
sin(t) r cos(t)
_
What is the Jacobian of f at (r, t)?
Solution
Hint:
_

∂x
(x
2
+ y
2
)

∂y
(x
2
+ y
2
)
_
Hint:
_
2x 2y
_
What is the Jacobian of g at (x, y)?
103
37 The Chain Rule
Solution
Hint: Just plug (r cos(t), r sin(t)) into the formula for the Jacobian of g you obtained
above.
_
2r cos(t) 2r sin(t)
_
What is the matrix of Dg(f (r, t))?
Solution
Hint:
_
2r cos(t) 2r sin(t)
_
_
cos(t) −r sin(t)
sin(t) r cos(t)
_
=
_
2r cos
2
(t) + 2r sin
2
(t) −2r
2
cos(t) sin(t) + 2r
2
sin(t) cos(t)
_
=
_
2r 0
_
What is the matrix of Dg(f (r, t)) ◦ Df(r, t)?
Solution
Hint:
(g ◦ f)(r, t) = g(r cos(t), r sin(t))
= (r cos(t))
2
+ (r sin(t))
2
= r
2
(cos
2
(t) + sin
2
(t))
= r
2
Compute the composite directly: (g ◦ f)(r, t) =r
2
Solution Compute D(g ◦ f)(r, t) directly from the formula for (g ◦ f).
This example has demonstrated the chain rule in action! Computing the deriva-
tive of the composite was the same as composing the derivatives.
The product rule from single variable calculus can be proved by invoking the
chain rule for multivariable functions. We supply a basic outline of the proof below:
can you complete this outline to give a full proof?
Let f, g : R → R be two diﬀerentiable functions. Let p : R → R be deﬁned by
p(t) = f(t)g(t). Let Both : R →R
2
be deﬁned by Both(t) = (f(t), g(t)). Finally let
Multiply : R
2
→ R be deﬁned by Multiply(x, y) = xy. Then we can diﬀerentiate
Multiply(Both(t)) = p(t) using the multivariable chain rule. This should result in
the product rule.
D(Both) has the matrix
_
f

(t)
g

(t)
_
at the point t ∈ R.
D(Multiply) has the matrix
_
y x
¸
at the point (x, y) ∈ R
2
So p

(t) = D(Multiply)
¸
¸
Both(t)
◦ D(Both)
¸
¸
t
by the multivariable chain rule.
104
37 The Chain Rule
So
p

(t) = D(Multiply)
¸
¸
Both(t)
◦ D(Both)
¸
¸
t
= D(Multiply)
¸
¸
(f(t),g(t))
◦ D(Both)
¸
¸
t
=
_
g(t) f(t)
¸
_
f

(t)
g

(t)
_
= g(t)f

(t) + f(t)g

(t)
This is the product rule from single variable diﬀerential calculus.
105
38 Proof of the chain rule
This section is optional
This section is optional
Before we beginning the proof of the chain rule, we need to introduce a new piece
of machinery:
Let L : R
n
→ R
m
be a linear map. Let S
n−1
= ¦v ∈ R
n
: [v[ = 1¦ be the
unit sphere in R
n
. Then there is a function F : S
n−1
→R
m
deﬁned on this sphere
which takes a vector and returns the length of its image under L, F(v) = [L(v)[.
Deﬁnition 1 The maximum value of the function F is called the operator
norm of L, and is written [L[
op
.
The fact that the operator norm of a linear transformation exists is a kind of
deep (for this course) piece of analysis. It follows from the fact that the sphere is
compact
1
, and continuous functions on compact spaces must achieve a maximum
value. For example, every continuous function on the closed interval [0, 1] ⊂ R
has a maximum, although not every continuous function on the open interval (0, 1)
does. The essential diﬀerence between these two intervals is that the ﬁrst one is
compact, while the second one is not.
The essential property of the operator norm is that [L(v)[ ≤ [L[
op
[v[ for each vector
v ∈ R
n
. This is true just because we have L(
v
[v[
) ≤ [L[
op
because
v
[v[
is a unit
vector, and by the deﬁnition of the operator norm. This fact will be essential to us
as we prove the chain rule.
Let f : R
n
→ R
m
and g : R
m
→ R
k
be diﬀerentiable functions. We want to
show that D(g ◦ f)(p) = Dg(f (p)) ◦ Df(p).
All we know at the beginning of the day is that
f(p +

h) = f(p) + Df(p)(

h) + fError(

h)
with
lim

h→0
¸
¸
¸fError(

h)
¸
¸
¸
[

h[
= 0
and
g(f (p) +u) = g(f(p)) + Dg(f (p))(u) + gError(u)
with
lim
u→0
[gError(u)[
[u[
= 0
1
http://en.wikipedia.org/wiki/Compact_space
106
38 Proof of the chain rule
We will simply compose these two formulas for f and g, and try to get some
control on the complicated error term which results.
g
_
f(p +

h)
_
= g
_
f(p) + Df(p)(

h) + fError
p
(

h)
_
= g(f(p)) + Dg(f (p))
_
Df(p)(

h) + fError
p
(

h)
_
+ gError
_
Df(p)(

h) + fError
p
(

h)
_
= (g ◦ f)(p) + Dg(f (p)) ◦ Df(p)(

h)
+ Dg(f (p))
_
fError
p
(

h)
_
+ gError
_
Df(p)(

h) + fError
p
(

h)
_
This looks pretty horrible, but at least we can see the error term in red that we
have to get some control over. In particular we will have proven the chain rule if
we can show that
lim

h→0
¸
¸
¸Dg(f (p))
_
fError
p
(

h)
_
+ gError
_
Df(p)(

h) + fError
p
(

h)

¸
¸
[

h[
= 0
Since
0 ≤
¸
¸
¸Dg(f (p))
_
fError
p
(

h)
_
+ gError
_
Df(p)(

h) + fError
p
(

h)

¸
¸ ≤
¸
¸
¸Dg(f (p))
_
fError
p
(

h)

¸
¸+
¸
¸
¸gError
_
Df(p)(

h) + fError
p
(

h)

¸
¸
by the triangle inequality, we will have the result if we can prove separately that
lim

h→0
¸
¸
¸Dg(f (p))
_
fError
p
(

h)

¸
¸
¸
¸
¸

h
¸
¸
¸
= 0
and
lim

h→0
¸
¸
¸gError
_
Df(p)(

h) + fError
p
(

h)

¸
¸

h
= 0
Lets prove the ﬁrst limit. This is where operator norms enter the picture.
¸
¸
¸Dg(f (p))
_
fError
p
(

h)

¸
¸
¸
¸
¸

h
¸
¸
¸

[Dg[
op
[fError(

h)[
[

h[
Since
[fError[
¸
¸
¸

h
¸
¸
¸
→ 0 as

h → 0, then we see that
¸
¸
¸Dg(f (p))
_
fError
p
(

h)

¸
¸
¸
¸
¸

h
¸
¸
¸
must
as well, since it is bounded above by a constant multiple of something which goes
107
38 Proof of the chain rule
to 0 (and bounded below by 0).
For the other part of the error,
¸
¸
¸gError
_
Df(p)(

h) + fError(

h)

¸
¸

h
=
¸
¸
¸gError
_
Df(p)(

h) + fError(

h)

¸
¸
¸
¸
¸Df(p)(

h) + fError
p
(

h)
¸
¸
¸
¸
¸
¸Df(p)(

h) + fError(

h)
¸
¸
¸)
[

h[
The ﬁrst factor in this expression goes to 0 as

h →0 because g is diﬀerentiable.
So all we need to do is make sure that the second factor is bounded.
¸
¸
¸Df(p)(

h) + fError
p
(

h)
¸
¸
¸
[

h[

¸
¸
¸Df(p)(

h)
¸
¸
¸
[

h[
+
¸
¸
¸fError(

h)
¸
¸
¸
[

h[
by the triangle inequality

[Df(p)[
op
¸
¸
¸

h
¸
¸
¸
[

h[
+
¸
¸
¸fError(

h)
¸
¸
¸
[

h[
= [Df(p)[
op
+
¸
¸
¸fError(

h)
¸
¸
¸
[

h[
Now the second term in this expression goes to 0 as

h → 0 since f is diﬀer-
entiable. So the whole expression is bounded by, say, [Df(p)[
op
+
1
2
if

h is small
enough.
Now we are done! We have successfully shown that the nasty error term Error
satisﬁes
lim

h→0
¸
¸
¸Error(

h)
¸
¸
¸
[

h[
= 0
Thus g ◦ f is diﬀerentiable at p, and its derivative is given by Dg(f (p)) ◦ Df(p).
QED
108
39 End of Week Practice
Practice doing computations.
This section just contains practice problems on the material we have learned this
week. These problems do not have detailed hints: clicking on the hint will
one of the forums.
1
1
109
40 Jacobian practice
Practice computing the Jacobian.
Question 1 Compute the Jacobian of f : R
2
→R
3
deﬁned by f(x, y) = (sin(xy), x
2
y
3
, x
3
).
Solution
Hint: J =
_
_
y cos(xy) xcos(xy)
2xy
3
3x
2
y
2
3x
2
0
_
_
Question 2 Compute the Jacobian of f : R
4
− ¦y = 0¦ → R
1
deﬁned by
f(x, y, z, t) = x
2
yz
3
t
4
+
x
y
.
Solution
Hint: J =
_
2xyz
3
t
4
+
1
y
x
2
z
3
t
4

x
y
2
3x
2
yz
2
t
4
4x
2
yz
3
t
3
_
Question 3 Compute the Jacobian of f : R
2
−¦(0, 0)¦ →R
2
deﬁned by f(x, y) =
(
x
x
2
+ y
2
,
y
x
2
+ y
2
).
Solution
Hint: J =
_
_
_
_
y
2
−x
2
(x
2
+ y
2
)
2
−2xy
(x
2
+ y
2
)
2
−2xy
(x
2
+ y
2
)
2
x
2
−y
2
(x
2
+ y
2
)
2
_
¸
¸
_
Question 4 Compute the Jacobian of f : R →R
4
deﬁned by f(t) = (cos(t), sin(t), t, t
2
).
Solution
Hint: J =
_
_
_
_
−sin(t)
cos(t)
1
2t
_
¸
¸
_
110
Question 1 Compute the ∇f where f : R
3
→R deﬁned by f(x, y, z) = x
2
+xyz.
Solution
Hint: ∇f =
_
_
2x + yz
xz
xy
_
_
Question 2 Compute the ∇f where f : R
4
→R deﬁned by f(x, y, z, t) = cos(xy) sin(zt).
Solution
Hint: ∇f =
_
_
_
_
−y sin(xy) sin(zt)
−xsin(xy) sin(zt)
t cos(xy) cos(zt)
z cos(xy) cos(zt)
_
¸
¸
_
Question 3 Compute the ∇f where f : R
2
−(0, 0) →R deﬁned by f(x, y) =
x
y
.
Solution
Hint: ∇f =
_
_
_
1
y
−x
y
2
_
¸
_
111
42 Linear approximation practice
Practice doing some linear approximations.
Question 1 Let f : R
2
→R
2
be given by f(x, y) = (x
2
−y
2
, 2xy).
Solution
Hint:
_
4.4
14.2
_
Use the linear approximation to f at (3, 2) to approximate f(3.1, 2.3). Give your answer
as a column vector.
Question 2 Let f : R
1
→R
3
be given by f(t) = (t, t
2
, t
3
).
Solution
Hint:
_
_
1.1
1.2
1.3
_
_
Use the linear approximation to f at 1 to approximate f(1.1). Give your answer as a
column vector.
Question 3 Let f : R
3
→R
2
be given by f(x, y, z) = (xy, yz).
Solution
Hint:
_
2.1
6.1
_
Use the linear approximation to f at (1, 2, 3) to approximate f(1.1, 1.9, 3.2). Give your
112
43 Increasing and decreasing
Consider whether a function is increasing or decreasing.
Question 1 Let f : R
2
→R be given by f(x, y) = (x + 2y)
2
+ y
3
Solution
Hint: By computing the directional derivative in that direction, we see that f is
decreasing.
At the point (−1, 5), is f increasing or decreasing in the direction
_
1
−1
_
.
(a) Increasing
(b) Decreasing
Question 2 Let f : R
3
→R be given by f(x, y, z) =
x + y
z
Solution
Hint: By computing the directional derivative in that direction, we see that f is
decreasing. You could also see that f is decreasing by noting that the numerator is left
unchanged when x and y are increased equally in opposite directions, but the denominator
is increasing.
At the point (3, 2, 6), is f increasing or decreasing in the direction
_
_
1
−1
2
_
_
.
(a) Increasing
(b) Decreasing
Question 3 Let f : R
4
→R be given by f(x, y, z, t) =
xy
2
t
+ z
2
Solution
Hint: By computing the directional derivative in that direction, we see that f is
decreasing.
At the point (0, 1, 5, −1), is f increasing or decreasing in the direction
_
_
_
_
1
1
−2
3
_
¸
¸
_
.
(a) Increasing
(b) Decreasing
113
44 Stationary points
Find stationary points.
Question 1 Let f : R
2
→R
2
be deﬁned by f(x, y) = (x
2
−x−y
2
, 2xy−y). There
is one stationary point of f (one place where the derivative vanishes). What is this
Question 2 Let f : R
3
→ R be deﬁned by f(x, y, z) = x
2
+ xy + xz + z
2
. There
is one stationary point of f (one place where the derivative vanishes). What is this
Question 3 Let f : R
2
→R
3
be given by f(x, y) = (x
2
, y−x,
_
x
2
−4x + 4 + y
2
)
Question 4 f is diﬀerentiable everywhere except for one point. What is that
114
45 Hypersurfaces
Find tangent planes and lines.
Question 1 Hint: 6(x −3) −4(y −2) = 0
Let f : R
2
→R be deﬁned by f(x, y) = x
2
−y
2
.
Question 2 The equation of the tangent line to the curve f(x, y) = 5 at the point
(3, 2) is 0 = 6(x −3) −4(y −2)
Question 3 Hint: 4(x −3) −2y −6(z −1) = 0
Let f : R
3
→R be deﬁned by f(x, y, z) = (x −z)
2
−(y + z)
2
.
Question 4 The equation of the tangent plane to the surface f(x, y, z) = 3 at the
point (3, 0, 1) is 0 = 4(x −3) −2y −6(z −1)
115
46 Use the chain rule
Compute using the chain rule.
Question 1 Let f : R
2
→R
2
be deﬁned by f(x, y) = (x
2
y + x,
x
y
). Let g : R
2

R
3
be deﬁned by g(x, y) = (x + y, x −y, xy).
Solution
Hint:
_
_
_
_
_
_
2ab + 1 +
1
b
a
2

a
b
2
2ab + 1 −
1
b
a
2
+
a
b
2
3a
2
+
2a
b
−a
2
b
2
_
¸
¸
¸
¸
_
Use the chain rule to ﬁnd the Jacobian of (g ◦ f) at the point (a, b).
Question 2 Let f : R
2
→ R be deﬁned by f(x, y) = x
3
+ y
3
. Let g : R → R
2
be
deﬁned by g(x, y) = (x, sin(x)).
Solution
Hint:
_
3a
2
3b
2
3a
2
cos(a
3
+ b
3
) 3b
2
cos(a
3
+ b
3
)
_
Use the chain rule to ﬁnd the Jacobian of (g ◦ f) at the point (a, b).
116
47 Abstraction
Not every vector space consists of lists of numbers.
So far, we’ve been thinking of all of our vector spaces as being R
n
. We now relax
this condition by providing a deﬁnition of a vector space in general.
117
48 Vector spaces
Vector spaces are sets with a notion of addition and scaling.
Until now, we have only dealt with the spaces R
n
. We now begin the journey of
understanding more general spaces.
The crucial structure we need to talk about linear maps on R
n
scalar multiplication. Addition is a function which takes a pair of vectors v, w ∈ R
n
and returns a new vector v + w ∈ R
n
. Scalar multiplication is function which takes
a vector v ∈ R
n
and a scalar c ∈ R, and returns a new vector cv.
Deﬁnition 1 A vector space is a set V equipped with a notion of addition, which
is a function that takes a pair of vectors v, w ∈ V and returns a new vector v + w,
and a notion of scalar multiplication, which is a function that takes a scalar c ∈ R
and a vector v ∈ V and returns a new vector cv ∈ V .
These operations are subject to the following requirements:
Commutativity For each v, w ∈ V , v + w = w +v
Associativity For each v, w, u ∈ V , v + ( w +u) = (v + w) +u
Additive identity There is a vector called

0 ∈ V with v +

0 = v for each v ∈ V .
Additive inverse For each v ∈ V there is a vector w ∈ V with v + w =

0
Multiplicative identity For each v ∈ V , 1v = v (here 1 is really the real
number 1 ∈ R, and 1v is the scalar product of 1 with v)
Distributivity of scalar multiplication over vector addition For each v, w ∈
V and c ∈ R, c(v + w) = cv + c w
Distributivity of vector multiplication under scalar addition For each a, b ∈
R and v ∈ V , (a + b)v = av + bv
Let’s list oﬀ some nice examples of vector spaces.
Example 2 Our old friend R
n
is a vector space with the notions of vector addi-
tion and scalar multiplication we introduced in Week 1.
Example 3 Let Poly
2
be the set of all polynomials of degree at most 2 in one
variable. For example 1 + 2x ∈ Poly
2
and 3 + x
2
∈ Poly
2
. Then the usual way of
adding polynomials and multiplying them by constants turns Poly
2
into a vector
space. For example 2 (1 + 2x) + (3 + x
2
) = 5 + 4x + x
2
.
The next example will probably be the most important example for us. Thinking
of matrices as points in a vector space will be useful for us when we start thinking
Example 4 Let Mat
nm
be the collection of all n m matrixes. Then Mat
nm
is
a vector space with the usual notion of matrix addition and scalar multiplication.
The following is an important, but very small, ﬁrst step into the world of func-
tional analysis.
118
48 Vector spaces
Example 5 Let (([0, 1]) be the set of all continuous functions from [0, 1] to R.
Then (([0, 1]) is a vector space with addition and scalar multiplication deﬁned
pointwise (f + g is the function whose value at x is f(x) + g(x), and cf is the
function whose value at x is cf(x)).
Realizing that solution sets of certain diﬀerential equations form vector spaces
is important.
Let V be the set of all smooth functions f : R →R which satisfy the diﬀerential
equation
d
2
f
dx
2
+ 3
df
dx
meaning that f + g denotes the function which sends x to f(x) + g(x). Scalar
multiplication cf means the function that sends x to c f(x). With this, we can
show that V is a vector space.
What if we change the diﬀerential equation to
d
2
f
dx
2
+ 3
df
dx
+ 4f(x) = 0? Is this
still a vector space? Why or why not? We already know that function addition
and scalar multiplication of functions satisfy all of the axioms of a vector space:
what we do not know is whether function addition and scalar multiplication are
well deﬁned for solutions to this diﬀerential equation.
We need to check that if f and g are solutions, and c ∈ R, then f + g and cf
are as well.
d
2
dx
2
(f + g) + 3
d
dx
(f + g) + 4(f + g)(x) =
d
2
f
dx
2
‘ +
d
2
g
dx
2
+ 3
df
dx
+ 3
df
dx
+ 4f(x) + 4g(x)
=
d
2
f
dx
2
+ 3
df
dx
+ 4f(x) +
d
2
g
dx
2
+ 3
dg
dx
+ 4g(x)
= 0
So f + g is a solution to the DE if f and g are.
d
2
dx
2
(cf(x)) + 3
d
dx
(cf(x)) + 4cf(x) = c(
d
2
f
dx
2
+ 3
df
dx
+ 4f(x)) = 0
So cf is a solution to the DE if f is, and c ∈ R
So addition and scalar multiplication are well deﬁned on V , giving V the struc-
ture of a vector space.
We are dealing with a level of abstraction here that you may not have met
before. It is worthwhile taking some time to prove certain “obvious” (though they
are not so obvious) statements formally from the axioms:
Prove that in any vector space, there is a unique additive identity. In other
words, if V is a vector space, and there are two elements

0 ∈ V and

0

∈ V so that
for each v ∈ V , v +

0 = v and v +

0

= v, then

0 =

0

. Every line of your proof
should be justiﬁed with a vector space axiom!

0, and use our vector space axioms to construct a string of
equalities ending in

0

0 =

0 +

0

because

0

=

0

+

0 by the commutativity of vector addition
=

0

because

119
48 Vector spaces
So

0 =

0

.
Prove that each element of a vector space has a unique (only one) additive
inverse. Let v ∈ V . Assume that both w
1
and w
2
are both additive inverses of v.
We will show that w
1
= w
2
w
1
= w
1
+

0 by the deﬁnition of the additive identity
= w
1
+ (v + w
2
) because v and w
2
= ( w +v) + w
2
=

0 + w
2
because v and w
1
= w
2
by the deﬁnition of the additive identity
Let V be a vector space. Prove that 0v =

0 for every v ∈ V . (Note: 0 means
diﬀerent things on the diﬀerent sides of the equation! On the left hand side, it is
the scalar 0 ∈ R, whereas on the right hand side it is the zero vector

0 ∈ V )
0v = (0 + 0)v nothing funny here: 0 + 0 = 0
= 0v + 0v by the distributivity of vector multiplication under scalar addition
So 0v = 0v + 0v.
Now let w be the additive inverse of 0v, and add it to both sides of the equation:
0v + w = (0v + 0v) + w
0v + w = 0v + (0v + w) by the associativity of vector addition

0 = 0v +

0 by the deﬁnition of additive inverses

0 = 0v by the deﬁnition of the additive identity
QED
Let V be a vector space. Prove that a

0 =

0 for every a ∈ R
a

0 = a(

0 +

0) by deﬁnition of the additive identity
= a

0 + a

0 by the distributivity of scalar multiplication over vector adddition
So
a

0 = a

0 + a

0
Let w be the additive inverse of a

0. Adding w to both sides we have
a

0 + w =
_
a

0 + a

0
_
+ w
a

0 + w = a

0 +
_

0 + w
_

0 = a

0 +

0 = a

0 by deﬁnition of the additive identity
120
48 Vector spaces
QED
Let V be a vector space. Prove that (−1)v is the additive inverse of v for every
v ∈ R.
The proof of this uses the “rat poison principle”: if you want to show that
something is rat poison, try feeding it to a rat! In this case we want to see if (−1)v
is the additive inverse of v, so we should try adding it to v.
(−1)v +v = (−1)v + 1v Multiplicative identity property
= (−1 + 1)v distributivity of vector multiplication over scalar addition
= 0v =

0 by one of the theorems above
So, indeed, (−1)v is an additive inverse of v. We already proved uniqueness
of additive inverses above, so we are done. We will often simply write −v for the
additive inverse of v in the future.
121
49 Linear maps, redux
Linear maps respect scalar multiplication and vector addition.
Deﬁnition 1 Let V and W be two vector spaces. A function L : V → W is a
linear map if
Respects vector addition For all v
1
, v
2
∈ V , L(v
1
+v
2
) = L(v
1
) + L(v
2
)
Respects scalar multiplication For all c ∈ R and v ∈ V , L(cv) = cL(v)
If the domain and codomain of a linear map are both V , then we may call it a
linear operator to emphasize this fact. For instance, you might hear someone say
“L : V →V is a linear operator.”
Let V be the space of all polynomials in one variable x, and W = R. For each
real number a ∈ R, deﬁne the function Eval
a
: V →W deﬁned by Eval
a
(p) = p(a).
Show that Eval
c
is a linear map.
Let p
1
, p
2
∈ V . Then Eval
a
(p
1
+ p
2
) = p
1
(a) + p
2
(a) = Eval
a
(p
1
) + Eval
a
(p
2
).
Also if c ∈ R, then Eval
a
(cp) = cp(a) = cEval
a
(p). So Eval
a
is a linear map.
To make sure linear maps work the way we expect them to in this new context,
and to ﬂex our brains a little bit, let’s prove some facts about linear functions:
Let L : V →W be a linear map. Show that L(

0) =

0. (Note:

0 means diﬀerent
things on either side of the equation. On the LHS it means the additive identity of
V , while on the RHS it means the additive identity of W).
L(

0) = L(0

0) we proved in the last section that 0v =

0 for any v
= 0L(

0) because L respects scalar multiplication
=

0 by the same reasoning quoted above
Another way to do this would be by starting with L(

0) = L(

0 +

0) and using
the fact that L respects vector addition. Try this proof out too!
Let V and W be vector spaces, and deﬁne a function Zero : V → W by
Zero(v) =

0 for all v ∈ V . Show that Zero is a linear function.
Let v
1
, v
2
∈ V . Then
Zero(v
1
+v
2
) =

0
=

0 +

0
= Zero(v
1
) +Zero(v
2
)
Let v ∈ V and c ∈ R.
Zero(cv) =

0
= c

0
= cZero(v)
So Zero respects scalar multiplication.
122
50 Python
Certain Python functions form a vector space?
Let T be the collection of all Python functions f with the properties that
• f accepts a single numeric parameter,
• f returns a single numeric parameter, and
• no matter what number x is, the function call f(x) successfully returns a
number.
We’ll say that two Python functions are “equal” if they produce the same outputs
for the same inputs.
Now the collection T (arguably) forms a vector space. I say “arguably” because
“numbers” in Python aren’t real numbers, but let’s just play along and pretend
that they are.
Question 1 What function plays the role of

0 in T?
Solution
Python
1 def zero(x):
2 # return ?
3
4 def validator():
5 return (zero(17) == 0)
Suppose we have two functions f and g. What is their sum?
Solution
Python
1 def vector_sum(f,g):
2 # return a new Python function which is the sum of f and g
3
4 def validator():
5 return (vector_sum(lambda x: x**2, lambda x: x**3)(3) == 36)
Now suppose we have a function f and a scalar c. What is c times f?
Solution
Python
1 def scalar_multiple(c,f):
2 # return a new Python function which is c*f
3
4 def validator():
5 return (scalar_multiple(17, lambda x: x**2)(2) == 68)
Now suppose we have a function f and a point a. The map evaluation map
sends f ∈ T to the value f(a).
Solution
123
50 Python
Python
1 def scalar_multiple(c,f):
2 return (lambda x: c * f(x))
3 def vector_sum(f,g):
4 return (lambda x: f(x) + g(x))
5 def evaluation_map(a,f):
6 # return the value of f at the point a
7 #
8 # Now note that evaluation_map(a,vector_sum(f,g)) = evaluation_map(a,f) + evaluation_map(a,g)
9 #
10 f = lambda x: x**2
11 g = lambda x: x**3
12 a = 3
13 print evaluation_map(a,vector_sum(f,g))
14 print evaluation_map(a,f) + evaluation_map(a,g)
15
16 def validator():
17 return (evaluation_map(17,(lambda x: x**2)) == 289)
This is an example of the fact that “evaluation at a” is a linear map from T to
the underlying number system.
Finally, some food for thought: in a little while, we’ll be thinking about “dimen-
sion.” Keep in mind the following question: what is the dimension of the vector
functions” take honest real numbers as their input and outputs.
124
51 Bases
Basis vectors span the space without redundancy.
In our study of the vector spaces R
n
, we have relied quite heavily on the “standard
basis vectors” e
1
=
_
¸
¸
¸
¸
¸
_
1
0
0
.
.
.
0
_
¸
¸
¸
¸
¸
_
, e
2
=
_
¸
¸
¸
¸
¸
_
0
1
0
.
.
.
0
_
¸
¸
¸
¸
¸
_
, e
3
=
_
¸
¸
¸
¸
¸
¸
¸
_
0
0
1
0
.
.
.
0
_
¸
¸
¸
¸
¸
¸
¸
_
, . . . , e
n
=
_
¸
¸
¸
_
0
.
.
.
0
1
_
¸
¸
¸
_
. We’ll write c
\
for
the collection of vectors (e
1
, e
2
, . . . , e
n
).
A great feature of these vectors is that they span all of R
n
: every vector v ∈ R
n
can be written in the form
v = x
1
e
1
+ x
2
e
2
+ + x
n
e
n
.
What is even better is that this representation is unique: if I also have that
v = y
1
e
1
+ y
2
e
2
+ + y
n
e
n
,
then x
1
= y
1
, x
2
= y
2
, . . . , x
n
= y
n
.
Our goal in this section will be to ﬁnd similarly nice sets of vectors in an abstract
vector space.
Question 1 A linear combination of two vectors v and w is an expression of
the form
αv + β w
for some numbers α, β ∈ R.
Which of the following vectors is a linear combination of v =
_
_
3
2
1
_
_
and w =
_
_
1
5
1
_
_
?
Solution
(a)
_
_
1
−8
−1
_
_

(b)
_
_
−1
8
1
_
_
So not every vector in R
3
is a linear combination of v and w. Vectors which
are a linear combination of v and w are said to be in the “span” of v and w. Let’s
make this more general for more than just two vectors.
Deﬁnition 2 The span of an ordered list of vectors (v
1
, v
2
, . . . , v
n
) is the set of
all linear combinations of the v
i
.
Span(v
1
, v
2
, . . . , v
n
) = ¦a
1
v
1
+ a
2
v
2
+ + a
n
v
n
: a
i
∈ R¦
125
51 Bases
Question 3 Do
_
1
1
_
and
_
1
0
_
together span the vector space R
2
?
Solution
(a) Yes.
(b) No.
Indeed, every vector in R
2
can be written as a linear combination of
_
1
1
_
and
_
1
0
_
. Prove it!
_
1
0
_
and
_
0
1
_
span R
2
, it is enough to show that
_
1
1
_
and
_
1
0
_
span these two vectors, i.e. we need only show that
_
0
1
_
is in the span of
these two vectors.
But
_
0
1
_
=
_
1
1
_
+−1
_
1
0
_
, so we are done.
To be a bit more explicit, we can write any vector
_
x
y
_
= x
_
1
0
_
+ y
_
0
1
_
= x
_
1
0
_
+ y
__
1
1
_
+−1
_
1
0
__
= (x −y)
_
1
0
_
+ y
_
1
1
_
So we have expressed any vector as a linear combination of
_
1
1
_
and
_
1
0
_
.
Question 4 Can every polynomial of degree at most 2 be written in the form
α(1) + β(x −1) + γ(x −1)
2
?
Solution
(a) Yes.
(b) No.
In other words: the polynomials 1, x −1, and (x −1)
2
span the vector space of
polynomials of degree at most 2. Prove it!
Let p(x) = a
0
+ a
1
x + a
2
x
2
.
Then
p(x) = a
0
+ a
1
[(x −1) + 1] + a
2
[(x −1) + 1]
2
= a
0
+ a
1
(x −1) + a
1
+ a
2
[(x −1)
2
+ 2(x −1) + 1]
= (a
0
+ a
1
+ a
2
)1 + (a
1
+ 2a
2
)(x −1) + a
2
(x −1)
2
so we have expressed every polynomial of degree at most 2 as a linear combina-
tion of 1, (x −1) and (x −1)
2
.
You could also solve this problem by appealing to Taylor’s theorem in one
variable calculus. Can you see how?
126
52 Dimension
Basis vectors span the space without redundancy.
Deﬁnition 1 A vector space is called ﬁnite dimensional if it has a ﬁnite list
of spanning vectors. A space which is not ﬁnite dimensional is called inﬁnite di-
mensional.
Question 2 The space P of all polynomials in one variable x is:
Solution
(a) Inﬁnite dimensional.
(b) Finite dimensional.
Can you prove it? Suppose that P were ﬁnite dimensional, and then deduce a
contradiction to show that it is impossible.
Let p
1
, p
2
, . . . , p
n
be a ﬁnite list of vectors. Since this list of polynomials is ﬁnite
they must be bounded in degree, i.e. the degree of p
i
must be less than some k
for each i. But a linear combination of polynomials of degree at most k is also of
degree at most k. So the polynomial x
k+1
,∈ Span(p
1
, p
2
, . . . , p
n
). Thus no ﬁnite
list of polynomials spans all of P. So P is inﬁnite dimensional.
Deﬁnition 3 Let V be a vector space. An ordered list of vectors (v
1
, v
2
, . . . , v
n
)
where all the v
i
∈ V is linearly independent if a
1
v
1
+ a
2
v
2
+ + a
n
v
n
=
b
1
v
1
+b
2
v
2
+ +b
n
v
n
implies that a
1
= b
1
, a
2
= b
2
, . . . , a
n
= b
n
. In other words,
every vector in the span of (v
1
, v
2
, . . . , v
n
) can be expressed as a linear combination
of the v
i
in only one way.
If the set of vectors is not linearly independent it is linearly dependent.
Show that the following alternative deﬁnition for linear independence is equiv-
alent to our deﬁnition:
Deﬁnition 4 Let V be a vector space. An ordered list of vectors (v
1
, v
2
, . . . , v
n
)
where all the v
i
∈ V is called linearly independent if a
1
v
1
+ a
2
v
2
+ + a
n
v
n
=

0
implies that a
i
= 0 for all i = 1, 2, 3, . . . , n.
Let us say that our original deﬁnition is of being linearly independent in the ﬁrst
sense, while this second deﬁnition is being linearly independent in the second sense.
If a list of vectors (v
1
, v
2
, . . . , v
n
) is linearly independent in the ﬁrst sense, then if
a
1
v
1
+a
2
v
2
+ +a
n
v
n
=

0 we have a
1
v
1
+a
2
v
2
+ +a
n
v
n
= 0v
1
+0v
2
+ +0v
n
,
so by the deﬁnition of linear independence in the ﬁrst sense, we have a
1
= a
2
=
= a
n
= 0.
On the other hand, if (v
1
, v
2
, . . . , v
n
) are linearly independent in the second
sense, then if a
1
v
1
+ a
2
v
2
+ + a
n
v
n
= b
1
v
1
+ b
2
v
2
+ + b
n
v
n
we have (a
1

b
1
)v
1
+ (a
2
−b
2
)v
2
+ + (a
n
−b
n
)v
n
=

0, so a
i
−b
i
= 0 for each i. Thus a
i
= b
i
for each i, proving that the list was linearly independent in the ﬁrst sense.
Often this deﬁnition is easier to check, although it does not capture the “mean-
ing” of linear independence as well as the ﬁrst deﬁnition.
127
52 Dimension
Prove that any ordered list of vectors containing the zero vector is linearly
dependent. We can see immediately from the second deﬁnition that since 1

0 =

0,
but 1 ,= 0, that the list cannot be linearly independent
Prove that an ordered list of length 2 (i.e. (v
1
, v
2
)) is linearly dependent if and
only if one vector is a scalar multiple of the other. For v
1
and v
2
to be linearly
dependent there must be two scalars a, b ∈ V with av
1
+ bv
2
= 0 with at least one
of a or b nonzero. Let us assume (without loss of generality) that a ,= 0. Then
av
1
= −bv
2
, so v
1
=
−b
a
v
2
. Thus one vector is a scalar multiple of the other.
Theorem 5 If ( v
1
, v
2
, v
3
, . . . , v
n
) is linearly dependent in V and v
1
,=

0, then one
of the vectors v
j
is in the span of v
1
, v
2
, . . . , v
j−1
Prove this theorem.
Since ( v
1
, v
2
, v
3
, . . . , v
n
) is linearly dependent, by deﬁnition there are scalars
a
i
∈ R with a
1
v
1
+ a
2
v
2
+ + a
n
v
n
= 0, and not all of the a
j
= 0. Let j
be the largest element of 2, 3, . . . , n so that a
j
is not equal to 0. Then we have
v
j
= −
a
1
a
j
v
1

a
2
a
j
v
2

a
3
a
j
v
3
− −
a
j−1
a
j−1
v
j−1
. So v
j
is in the span of v
1
, v
2
, . . . , v
j−1
.
If 2 vectors v
1
, v
2
span V , is it possible that the three vectors w
1
, w
2
, w
3
are
linearly independent?
Warning 6 This is harder to prove than you might think!
No!
Assume to the contrary that w
1
, w
2
, w
3
are linearly independent.
Since the list (v
1
, v
2
) spans V , the list (w
1
, v
1
, v
2
) is linearly dependant. Thus
by the previous theorem, either v
1
is in the span of w
1
, or v
2
is in the span of
(w
1
, v
1
). In either case we get that (w
1
, v) spans V , where v is either v
1
or v
2
.
Now apply the same trick: (w
2
, w
1
, v) must span V . So by the previous theorem,
either w
1
is in the span of w
2
or v is in the span of w
2
, w
1
. w
2
cannot be in the
span of w
1
because the w’s are linearly independent. So v is in the span of w
2
, w
1
.
So (w
2
, w
1
) spans V . But then w
3
is in the span of (w
2
, w
1
fact that it is linearly independent from those two vectors. We have arrived at our
Therefore, w
1
, w
2
, w
3
cannot be linearly independent.
This problem generalizes:
Theorem 7 The length of a linearly independent list of spanning vectors is less
than the length of any spanning list of vectors.
Prove this theorem We will follow the same procedure that we did above. As-
sume (v
1
, v
2
, . . . , v
n
) is a list of vectors which spans V , and (w
1
, w
2
, . . . , w
m
) is a
linearly independent list of vectors. We must show that m < n.
(w
1
, v
1
, v
2
, . . . , v
n
) is linearly dependent since w
1
is in the span of the v
i
. By the
theorem above, we can remove on of the v
i
and still have a spanning list of length
n.
Repeating this, we can always add one w vector to the beginning of the list,
while deleting a v vector from the end of the list. This maintains a list of length n
which spans all of V . We know that it must be a v which gets deleted, because the
ws are all linearly independent. If m > n, then at the n
th
stage of this process we
128
52 Dimension
obtain that (w
1
, w
2
, . . . , w
n
) spans all of V , which contradicts the fact that w
n+1
is supposed to be linearly independent from the rest of the w.
Deﬁnition 8 An ordered list of vectors B = ( v
1
, v
2
, v
3
, . . . , v
n
) is called a basis
of the vector space V if B is both spans V and is linearly independent.
Let V be a ﬁnite dimensional vector space. Show that V has a basis. Let
(v
1
, v
2
, . . . , v
n
) be a spanning list of vectors (which exists and is ﬁnite since V is
ﬁnite dimensional). If this list is linearly dependent we can go through the following
process: For each i if v
i
∈ Span(v
1
, v
2
, . . . , v
i−1
), delete v
i
from the list. Note that
this also covers the 1st case: if v
1
= 0 delete it from the list.
At the end of this process, we have a list of vectors which span V , and also
no vector is the span of all the previos vectors. By the theorem above, the list is
linearly independent. So this new list is a basis for V .
Note: Let V be a ﬁnite dimensional vector space. Then every basis of V has
the same length. In other words, if v
1
, v
2
, . . . , v
n
is a basis and w
1
, w
2
, . . . , w
m
is a
basis, then n = m. This follows because we have already proven that n ≤ m and
m ≤ n
Deﬁnition 9 We say that a ﬁnite dimensional vector space has dimension n if
it has a basis of length n.
Let p
1
, p
2
, . . . , p
n
, p
n+1
be polynomials in the space P
n
of all polynomials of
degree at most n. Assume p
i
(3) = 0 for i = 1, 2, . . . , n. Is it possible that
p
1
, p
2
, . . . , p
n
, p
n+1
are all linearly independent? Why or why not? No. If p
1
, p
2
, . . . , p
n
, p
n+1
were all linearly independent then they would form a basis of P
n
, since P
n
has di-
mension n + 1. But every polynomial in the span of the p
i
must evaluate to 0 at
x = 3, while some polynomials in P
n
do not evaluate to 0 at x = 3, (for example,
the polynomial x).
129
53 Matrix of a linear map
Matrices record where basis vectors go.
In our ﬁrst brush with linear algebra, we only dealt with linear maps between the
spaces spaces R
n
for varying n. For those maps and those spaces, the convenient
standard basis allowed us to record linear maps using the ﬁnite data of a matrix.
In this section we will see that a similar story plays out for maps between ﬁnite
dimensional vector spaces: they too can be described by a matrix, but only after
making a choice of “basis” on the domain and codomain.
Question 1 Let V be a vector space with basis (v
1
, v
2
, v
3
) and W be a vector
space with basis ( w
1
, w
2
).
Suppose there is a linear map L : V →W for which
L(v
1
) = 3 w
1
+ 2 w
2
,
L(v
2
) = 3 w
1
−2 w
2
, and
L(v
3
) = w
1
+ w
2
.
In light of all this, compute L(2v
1
). But how will we write down our answer?
Where does L(2v
1
) live?
Solution
(a) In W.
(b) In V .
And since we know that L(2v
1
) ∈ W, we’ll write our answer as α w
1
+ β w
2
for
some numbers α and β.
So say L(2v
1
) = α w
1
+ β w
2
.
Solution In this case, α = 6.
Solution And β = 4.
Next compute L(2v
1
+ 3v
2
−4v
3
) = α w
1
+ β w
2
.
Solution
Hint: Use the fact that L(2v1 + 3v2 −4v3) = L(2v1) + L(3v2) −L(4v3).
Hint: Further use the fact that L(2v1)+L(3v2)−L(4v3) = 2 L(v1)+3 L(v2)−4 L(v3).
In this case, α = 11 but β = −6.
What we are seeing is an instance of the following observation.
Observation 2 If L : V → W is a linear map, and you know the value of L(v
i
)
for each vector in the basis (v
1
, v
2
, . . . , v
n
) of V , then you can compute L(v) for
any v ∈ V .
And by “compute,” I mean you can write down L(v) in terms of a basis of W.
130
53 Matrix of a linear map
Deﬁnition 3 Let L : V →W be a linear map between ﬁnite dimensional vector
spaces, let B
V
= (v
1
, v
2
, . . . , v
n
) be a basis for V , and let B
W
= ( w
1
, w
2
, . . . , w
m
)
be a basis for W. Then L( v
i
) = a
i,1
w
1
+ a
i,2
w
2
+ + a
i,m
w
m
.
Then the matrix with respect to the bases B
V
and B
W
is the matrix M whose
entry in the i
th
column and j
th
row is a
i,j
Question 4 Let B
1
=
_
_
1
1
_
E2
,
_
1
0
_
E2
_
and B
2
=
_
_
_
_
1
0
0
_
_
E3
,
_
_
0
0
1
_
_
E3
,
_
_
0
1
0
_
_
E3
_
_
be
bases for R
2
and R
3
, respectively.
Solution
Hint: The ﬁrst column of the matrix will be
_
1
1
_
E
2
but written with respect to the
basis B2.
Remember the order of vectors in the basis matters.
Hint:
L
_
_
1
1
_
E
2
_
=
_
_
2
1
0
_
_
E
3
= 2
_
_
1
0
0
_
_
E
3
+ 0
_
_
0
0
1
_
_
E
3
+ 1
_
_
0
1
0
_
_
E
3
Hint: So the ﬁrst column of the matrix is
_
_
2
0
1
_
_
B
2
.
Hint: Similarly, the second column is
_
_
1
0
1
_
_
B
2
.
Hint: So the matrix of this linear map is
_
_
2 1
0 0
1 1
_
_
What is the matrix for the linear map L
_
_
x
y
_
E
2
_
=
_
_
x + y
x
0
_
_
E
3
with respect to the
bases B1 and B2?
131
53 Matrix of a linear map
Question 5 Let P
2
be the space of polynomials of degree at most 2. Let B
0
=
(1, x, x
2
) and B
1
= (1, (x − 1), (x − 1)
2
). Consider the map L : P
2
→ R given by
L(p) = p(1).
Solution
Hint:
L(1) = 1
L(x) = 1
L(x
2
) = 1
2
= 1
Hint: The matrix of L with respect to B0 is
_
1 1 1
_
What is the matrix of this linear map with respect to the basis B0?
Solution
Hint:
L(1) = 1
L(x −1) = 1 −1 = 0
L((x −1)
2
) = (1 −1)
2
= 0
Hint: The matrix of L with respect to B1 is
_
1 0 0
_
What is the matrix of this linear map with respect to the basis B1?
Question 6 Let P
3
be the space of polynomials of degree at most 3. Let B =
(1, x, x
2
, x
3
). Consider the map L : P
3
→P
3
given by L(p(x)) =
d
dx
p(x). This map
is linear (why?).
Solution
Hint: L is linear because the derivative of a sum of two functions is the sum of the
derivative of the two functions, and since the derivative of a constant times a function is
the constant times the derivative of the function.
Hint:
L(1) = 0
L(x) = 1
L(x
2
) = 2x
L(x
3
) = 3x
2
Hint: The matrix of L is
_
_
_
_
0 1 0 0
0 0 2 0
0 0 0 3
0 0 0 0
_
¸
¸
_
What is the matrix for L with respect to the basis B?
132
54 Subspaces
A subspace is a subset of a vector space which is also a vector space.
Deﬁnition 1 A subset U of a vector space V is a subspace of V if U is a vector
space with respect to the scalar multiplication and the vector addition inherited
from V .
Question 2 Which of the following is a subspace of R
2
?
Solution
Hint: The vectors
_
1
2
_
and
_
0
1
_
are both on the line , but the sum
_
1
2
_
+
_
0
1
_
is not
on the line .
Hint: So is not a subspace.
Hint: The set P consists of a single vector.
Hint: But the vector in P is not the origin
_
0
0
_
.
Hint: So
_
1
2
_
∈ P but 10
_
1
2
_
,∈ P.
Hint: So P is not a subspace.
Hint: By process of elimination, the x-axis must be a subspace. Is it really?
Hint: Yes, if I multiply any vector
_
x
0
_
by a scalar, it is still on the x-axis.
Hint: And if I add together two vectors of the form
_
x
0
_
, the result is still on the
x-axis.
(a) The x-axis.
(b) The set P =
__
1
2
__
.
(c) The line =
__
x
y
_
∈ R
2
: y = x + 1
_
.
Let’s look at some more examples! Which of the following is a subspace of R
2
?
Solution
133
54 Subspaces
Hint: The set A is not a subspace because
_
1
0
_
∈ A, but −1
_
1
0
_
is not in A, so a
scalar multiple of something in A need not be in A.
Hint: The set C is not a subspace because even though it is closed under scalar
multiplication (check this!) it is not closed under vector addition, since
_
1
−2
_
and
_
1
2
_
are
both in C, but their sum
_
2
0
_
is not (draw a picture of this example!).
Hint: As the only choice left, B must be a subspace.
The reason is that it is just the span of the vector
_
2
1
_
, and as such, is closed under
(a) The set A = |
_
x
y
_
: x > 0 and y > 0¦
(b) The set B =
_
x
y
_
: x = 2y
(c) The set C = |
_
x
y
_
: [y[ < [x[¦
Solution
Hint: Question 3 What about the line y = 3? Does it form a subspace of R
2
?
Solution
(a) Yes.
(b) No.
That’s right; the tip of the vector
_
0
3
_
is on that line, but the scalar multiple of that
vector, like 2
_
0
3
_
=
_
0
6
_
, is not on the line.
So when do the points on a line in R
2
form a subspace?
(a) When the line passes through the point (0, 0).
(b) When the line is parallel to the x-axis.
This is an important observation.
Observation 4 Suppose U is a subspace of a vector space V . Then the “zero
vector” is in U.
134
55 Kernel
A kernel is everything sent to zero.
There are some special subspaces that we will want to pay attention to.
Theorem 1 If L : V → W is a linear transformation, then the kernel of L,
deﬁned by
ker(L) = ¦v ∈ V : L(v) =

is a subspace of V .
You may also hear this referred to as the null space of L.
Prove this theorem that ker L is a subspace.
We only need to show that ker(L) is closed under scalar multiplication and
For any v ∈ ker(L) and c ∈ R,
L(cv) = cL(v)
= c

0
=

0
so cv ∈ ker(L).
If v, w ∈ ker(L), then
L(v + w) = L(v) + L( w)
=

0 +

0
=

0
so v + w ∈ ker(L)
Thus ker(L) is a subspace!
Question 2 Let L : R
3
→R
2
be the linear map whose matrix is
_
2 3 1
1 0 −1
_
Solution
Hint: Just by evaluating all three, we see the only one which gets sent to
_
0
0
_
by L is
_
_
1
−1
1
_
_
Which of the following vectors is in the kernel of L?
(a)
_
_
1
−1
1
_
_

(b)
_
_
3
2
0
_
_
135
55 Kernel
(c)
_
_
0
0
2
_
_
Theorem 3 A linear map L : V →W is injective if and only if ker(L) = ¦

0¦.
Deﬁnition 4 The word “injective” is an adjective meaning the same thing as
“one to one.” In other words, a function f : A → B is injective if f(a
1
) = f(a
2
)
implies a
1
= a
2
.
Prove this theorem.
Let L be injective. Then L(v) =

0 implies L(v) = L(

0). Since L is injective,
this implies v =

0. Thus the only element of the kernel is

0.
On the other hand, if ker(L) = ¦

0¦, then if L( v
1
) = L( v
2
), then L( v
1
− v
2
) =

0,
so v
1
− v
2
is in the null space, and hence must be equal to

0. But then we can
conclude that v
1
= v
2
Deﬁnition 5 The dimension of the kernel of L is the nullity of L.
Be careful to observe that ker L is a subspace, while dimker L is a number, so
the nullity of L is just a number.
136
56 Image
The “image” is every actual output.
Deﬁnition 1 If L : V →W is a linear transformation, then the image of L is
Imag(L) = ¦ w ∈ W : ∃v ∈ V, L(v) = w¦
.
Remember to read ∃ as “there exists.”
Warning 2 Some people may call this the “range.” Some other people use the
word “range” for what we’ve been calling the codomain. The result is that, in my
opinion, the word “range” is now overused, so we give up and never use the word.
Question 3 Suppose L : R
2
→R
3
, and suppose that
L
__
1
0
__
=
_
_
3
2
1
_
_
, and
L
__
0
1
__
=
_
_
1
1
1
_
_
.
What is a vector v ∈ R
2
so that L(v) =
_
_
2
1
0
_
_
?
Solution
Hint: Use the fact that
_
_
2
1
0
_
_
=
_
_
3
2
1
_
_

_
_
1
1
1
_
_
.
Hint: In other words,
_
_
2
1
0
_
_
= L(e1) −L(e2).
Hint: By linearity of L, we have L(e1) −L(e2) = L(e1 −e2).
Hint: And so a vector in the domain which is sent to
_
_
2
1
0
_
_
is the vector
_
1
−1
_
.
This is a special case of a general fact: if we have two vectors in the image, then
their sum is in the image, too.
Theorem 4 The image of a linear map is a subspace of the codomain.
137
56 Image
Prove this.
If w ∈ Imag(L), then there is a v ∈ V with L(v) = w. L(cv) = cL(v) = cw, so
cw ∈ Imag(L) for any c ∈ R. Thus Imag(L) is closed under scalar multiplication.
If w
1
, w
2
∈ Imag(L), then there are v
1
, v
2
∈ V with L(v
1
) = w
1
and L(v
2
) = w
2
.
L(v
1
+ v
2
) = L(v
1
) + L(v
2
) = w
1
+ w
2
, so w
1
+ w
2
∈ Imag(L). Thus Imag(L) is
We ﬁnish with some terminology.
Deﬁnition 5 The dimension of the image of L is the rank of L.
Be careful to observe that the image of L is a subspace, while the dimension of
the image of L is a number, so the rank of L is just a number.
Question 6 Consider the linear map L : R
2
→R
3
given by the matrix
_
_
2 1
4 2
6 3
_
_
.
Solution
Hint: Not every vector in R
3
is the image of L.
Hint: Let’s think about which vectors are in the image of L.
Question 7 Is
_
_
2
4
6
_
_
in the image of L?
Solution
(a) Yes.
(b) No.
In fact,
_
_
2
4
6
_
_
= L
__
1
0
__
.
Is
_
_
1
1
1
_
_
in the image of L?
Solution
(a) Yes.
(b) No.
But how can we tell? The only things in the image of L are vectors of the form
_
_
x
2x
3x
_
_
for some x ∈ R. This is the span of
_
_
_
_
1
2
3
_
_
_
_
.
So what is the dimension of the vector space spanned by this single vector?
138
56 Image
Solution
(a) 0
(b) 1
(c) 2
And so the rank is one.
The rank of L is 1.
139
57 Rank nullity theorem
Rank plus nullity is the dimension of the domain.
Theorem 1 (Rank-Nullity) If L : V →W is a linear transformation, then the
sum of the dimension of the kernel of L and the dimension of the image of L is the
dimension of V .
The dimension of the kernel is sometimes called the “nullity” of L, and the
dimension of the image is sometimes called the “rank” of L.
Hence the name “rank-nullity” theorem.
Prove this theorem
Warning 2 This is hard!
Let v
1
, v
2
, v
3
, ..., v
m
be a basis of ker(L). We can extend this to a basis of V ,
v
1
, v
2
, ..., v
n
, u
1
, u
2
, ..., u
k
. We need to show that We will be done if we can show
that L(u
1
), L(u
2
), ..., L(u
k
) form a basis of Im(L).
Let w ∈ Im(L). Then w = L(v) for some v ∈ V . Since v
1
, v
2
, ..., v
n
, u
1
, u
2
, ..., u
k
is a basis of V , we can write
w = L(a
1
v
1
+ a
2
v
2
+ ... + a
n
v
n
+ b
1
u
1
+ ... + b
k
u
k
)
= a
1
L(v
1
) + ... + a
n
L(v
n
) + b
1
L(u
1
) + ... + b
k
L(u
k
)
= b
1
L(u
1
) + ... + b
k
L(u
k
)
So Im(L) is spanned by the L(u
i
). Now we need to see that the L(u
i
) are
linearly independent.
Assume b
1
L(u
1
) + b
2
L(u
2
) + ... + b
k
L(u
k
) = 0. Then L(b
1
u
1
+ ... + b
k
u
k
) = 0.
Then b
1
u
1
+ ... + b
k
u
k
would be in the null space of L. But the u
i
were chosen
speciﬁcally to be linearly independent of all of the vectors in the null space. So
b
1
= b
2
= ... = b
k
= 0. Thus the L(u
i
) are linearly independent and we are done.
140
58 Eigenvectors
Eigenvectors are mapped to multiples of themselves.
Deﬁnition 1 Let L : V → V be a linear map. A vector v ∈ V is called an
eigenvector of L if L(v) = λv for some λ ∈ R.
A constant λ ∈ R is called an eigenvalue of L if there is a nonzero eigenvector
v with L(v) = λv.
Geometrically, eigenvectors of L are those vectors whose direction is not changed
(or at worst, negated!) when they are transformed by L.
Let’s try some examples.
Question 2 Suppose L : R
2
→R
2
is the linear map represented by the matrix
_
3 2
4 1
_
.
Which of these vectors is an eigenvector of L?
Solution
Hint: Question 3 What is L
__
1
−1
__
?
Solution
Hint: We want to compute
_
3 2
4 1
_ _
1
−1
_
.
Hint: In this case,
_
3 2
4 1
_ _
1
−1
_
=
_
1
3
_
.
Is
_
1
3
_
a multiple of
_
1
−1
_
?
Solution
(a) No.
(b) Yes.
Consequently,
_
1
−1
_
is not an eigenvector. The eigenvector must be
_
1
1
_
.
(a)
_
1
1
_

(b)
_
1
−1
_
That’s right! Note that L
__
1
1
__
is
_
5
5
_
= 5
_
1
1
_
, and so
_
1
1
_
is an eigenvector.
141
58 Eigenvectors
Solution
Hint: Try computing L
__
1
−2
__
.
Hint: In this case, L
__
1
−2
__
=
_
−1
2
_
.
Hint: Question 4 Find λ ∈ R so that
_
1
−2
_
= λ
_
−1
2
_
.
Solution
Hint: The sign is opposite on both sides of the equation.
Hint: So try λ = −1.
λ = −1
And so −1 is an eigenvalue, with eigenvector
_
1
−2
_
.
Which of the following is another eigenvector?
(a)
_
1
−2
_

(b)
_
2
−1
_
Rock on! We check that
L
__
1
−2
__
=
_
−1
2
_
= −1
_
1
−2
_
and so
_
1
−2
_
is an eigenvector with eigenvalue −1.
142
59 Eigenvalues
Eigenvalues measure how much eigenvectors are scaled.
Deﬁnition 1 Let L : V → V be a linear operator (NB: linear maps with the
same domain and codomain are called linear operators). The set of all eigenvalues
of L is the spectrum of L.
Let’s try ﬁnding the spectrum.
Question 2 Let L : R
2
→ R
2
be the linear map whose matrix is
_
1 2
2 1
_
with
respect to the standard basis. L has two diﬀerent eigenvalues. What are they?
_
λ
1
λ
2
_
, where λ
1
≤ λ
2
.
Solution
Hint: For lambda to be an eigenvalue we need
_
1 2
2 1
_ _
x
y
_
= λ
_
x
y
_
Hint: This is the same as
_
x + 2y = λx
2x + y = λy
or
_
(1 −λ)x + 2y = 0
2x + (1 −λ)y = 0
Hint: These are two lines passing through the origin. To have more than just the
origin as a solution, we need that the slope of the two lines is the same. So
1 −λ
2
=
2
1 −λ
Hint:
1 −λ
2
=
2
1 −λ
(1 −λ)
2
= 4
1 −λ = ±2
lambda = −1 or 3
Hint: Let us now check that these really are eigenvalues:
If we let λ = −1, we have the equation 2x+2y = 0. Check that
_
1
−1
_
is an eigenvector
with eigenvalue −1
If we let λ = 3, we have the equation 2x − 2y = 0. Check that
_
1
1
_
is an eigenvector
with eigenvalue 3
143
59 Eigenvalues
Question 3 Let’s try another example. Suppose F : R
2
→ R
2
is the linear map
represented by the matrix
_
0 −1
1 0
_
.
Which of these numbers is an eigenvalue of F?
Solution
Hint: Let’s suppose that
_
x
y
_
is an eigenvector.
Hint: Then there is some λ ∈ R so that
_
0 −1
1 0
_ _
x
y
_
= λ
_
x
y
_
.
Hint: But
_
0 −1
1 0
_ _
x
y
_
=
_
−y
x
_
.
Hint: And so
_
−y
x
_
= λ
_
x
y
_
.
Hint: This means that −y = λx and x = λy.
Hint: Putting this together, −y = λ
2
y and x = −λ
2
x.
Hint: Since we are looking for a nonzero eigenvector (in order to have an eigenvalue),
we must have that either x ,= 0 or y ,= 0.
Hint: Consequently, λ
2
= −1.
Hint: But there is no real number λ ∈ R so that λ
2
= −1, since the square of any
real number is nonnegative.
Hint: Therefore, there is no real eigenvalue.
(a) There is no real eigenvalue.
(b) −1
(c)

2
(d) 1
Perhaps surprisingly, not every linear operator from R
n
to R
n
has any real
eigenvalues.
Geometrically, what is this linear map F doing?
Solution
(a) Rotation by 90

counterclockwise.
(b) Rotation by 90

clockwise.
(c) Rotation by 180

.
This geometric fact also explains why there is no eigenvalue: what would be
the corresponding eigenvector whose direction is unchanged by applying F? Every
vector is moved by a rotation!
The additional fact that there are imaginary solutions to λ
2
= −1 is hinting
that i should have something to do with rotation, too.
144
60 Eigenspace
An eigenspace collects together all the eigenvectors for a given eigenvalue.
Theorem 1 Let λ be an eigenvalue of a linear operator L : V →V . Then the set
E
λ
(L) = ¦v ∈ V : L(v) = λv¦ of all (including zero) eigenvectors with eigenvalue λ
forms a subspace of V .
This subspace is the eigenspace associated to the eigenvalue λ.
Prove this theorem. We need to check that E
λ
(L) is closed under scalar multi-
If v ∈ E
λ
(L), and c ∈ R, then L(cv) = cL(v) = cλv = λ(cv), so cv is also an
eigenvector of L.
If v
1
, v
2
∈ E
λ
(L), then L(v
1
+ v
2
) = λv
1
+ λv
2
= λ(v
1
+ v
2
), so v
1
+ v
2
is also
an eigenvector of L.
The kernel of L is the eigenspace of the eigenvalue 0.
145
61 Eigenbasis
An eigenbasis is a basis of eigenvectors.
Observation 1 If (v
1
, v
2
, ..., v
n
) is a basis of eigenvectors of a linear operator L,
then the matrix of L with respect to that basis is diagonal, with the eigenvalues of
L appearing along the diagonal.
Theorem 2 Let L : V → W be a linear map. If v
1
, v
2
, ..., v
n
are nonzero eigen-
vectors of L with distinct eigenvalues λ
1
, λ
2
, ...λ
n
, then (v
1
, v
2
, ..., v
n
) are linearly
independent.
Prove this theorem.
Assume to the contrary that the list is linearly dependent. Let v
k
be the ﬁrst
vector in the list which is in the span of the preceding vectors, so that the vectors
(v
1
, v
2
, ..., v
k−1
) are linearly independent. Let a
1
v
1
+ a
2
v
2
+ ... + a
k−1
v
k−1
= v
k
.
Then applying L to both sides of this equation we have a
1
λ
1
v
1
+ a
2
λ
2
v
2
+ ... +
a
k−1
λ
k−1
v
k−1
= λ
k
v
k
. If we multiply the ﬁrst equation by λ
k
we also have a
1
λ
1
v
1
+
a
2
λ
1
v
2
+ ... + a
k−1
λ
1
v
k−1
= λ
1
v
k
. Subtracting these two equations we have
a
1

k
−λ
1
)v
1
+ a
2

k
−λ
2
)v
2
+ ... + a
3

k
−λ
k−1
)v
k
= 0.
Since the vectors (v
1
, v
2
, ..., v
k−1
) are linearly independent, we must have that
a
i

k
− λ
i
) = 0. But λ
k
,= λ
i
, so a
i
= 0 for each i. Looking back at where the a
i
came from, we see that this implies that v
k
= 0. This contradicts the assumption
that the v
j
were all nonzero.
So our assumption that the list was linearly dependent was absurd, hence the
list is linearly independent.
A corollary of this theorem is that if V is n dimensional and L : V → V has n
distinct eigenvalues, then the eigenvectors of L form a basis of V . The matrix of
the operator with respect to this basis is diagonal.
146
62 Python
We can ﬁnd eigenvectors in Python.
Let’s suppose I have an n n matrix M, expressed in Python as a list of lists. For
example, suppose
M =
_
_
6 4 3
4 5 2
3 2 7
_
_
= [[6,4,3],[4,5,2],[3,2,7]].
Further suppose that the matrix M = (m
ij
) is symmetric, meaning that m
ij
=
m
ji
. I’d like to compute an eigenvector of M quickly.
Question 1 Here’s a procedure that I’d like you to code in Python:
(b) Replace v with the result of applying the linear map L
M
to the vector v.
(c) Normalize v so that it has unit length.
(d) Repeat many times.
You can try print eigenvector([[6,4,3],[4,5,2],[3,2,7]]) to see what
happens in the case of the matrix above.
Solution
Python
1 def eigenvector(M):
3 v = [1] * len(M[0])
4 # for many, many times
5 # replace v with Mv
6 # normalize v
7 # return v
8
9 def validator():
10 v = eigenvector([[6, 5, 5], [5, 2, 3],[5, 3, 8]])
11 if abs((v[1] / v[0]) - 0.6514182851) > 0.01:
12 return False
13 if abs((v[2] / v[0]) - 1.0603152077) > 0.01:
14 return False
15 return True
Can you use your program to ﬁnd, numerically, an eigenvector of the matrix
M?
147
63 Cayley-Hamilton theorem
Sometimes eigen-information reveals quite a bit about linear operators.
We will not be proving—or even stating!—the Cayley-Hamilton theorem
1
, but there
is one very special case which provides a nice activity. This activity will force us to
Here’s the setup: suppose L : R
2
→R
2
is a linear map, and it has an eigenvector
u (with eigenvalue 2) and an eigenvector w (with eigenvalue 3).
Question 1 Now suppose v ∈ R
2
is some arbitrary vector. How does L(L(v))
compare to −6v + 5 L(v)?
Solution
(a) L(L(v)) = −6v + 5 L(v)
(b) L(L(v)) ,= −6v + 5 L(v)
(c) It cannot be determined from the information given.
Why is this the case?
Solution
Hint: The vectors u and w together form a basis for R
2
.
Can we write v as αu + β w?
(a) Yes.
(b) No.
What is L(v) in terms of α, β, u, and w?
Solution
(a) αL(u) + βL( w)
(b) αL( w) + βL(u)
But what is L(u)?
Solution
Hint: Question 2 Solution Remember that u is an eigenvector with eigenvalue
2.
Consequently L(u) = 2u.
(a) 2u
(b) 3u
(c) 2 w
(d) 3 w
And what is L( w)?
1
http://en.wikipedia.org/wiki/CayleyHamilton_theorem
148
63 Cayley-Hamilton theorem
Solution
Hint: Question 3 Solution Remember that w is an eigenvector with eigenvalue
3.
Consequently L( w) = 3 w.
(a) 2u
(b) 3u
(c) 2 w
(d) 3 w
Using these facts, what is L(v) in terms of α, β, u, and w?
Solution
(a) 2αu + 3β w
(b) 3αu + 2β w
(c) 2α w + 3β u
(d) 3α w + 2β u
Solution
Hint: Question 4 Solution Using linearity of L, what is L(L(v))?
(a) 2αL(u) + 3β L( w)
(b) 3αL(u) + 2β L( w)
(c) 2αL( w) + 3β L(u)
(d) 3αL( w) + 2β L(u)
Hint: Question 5 Solution But what is L(u)?
(a) L(u) = 2u
(b) L(u) = 3u
(c) L(u) = 2 w
(d) L(u) = 3 w
Hint: Question 6 Solution And what is L( w)?
(a) L( w) = 3 w
(b) L( w) = 2 w
(c) L( w) = 2u
(d) L( w) = 3u
149
63 Cayley-Hamilton theorem
Hint: Try substituting the facts that L(u) = 2u and L( w) = 3 w into 2αL(u) +
3β L( w).
What is L(L(v))?
(a) 4αu + 9β w
(b) 4α w + 9β u
(c) 9αu + 4β w
(d) 9α w + 4β u
What is −6v + 5 L(v) in terms of α, β, u, and w?
Solution
Hint: Earlier we wrote v = αu + β w.
Hint: Since L is a linear map, we have L(v) = αL(u) + βL( w).
(a) −6 (αu + β w) + 5αL(u) + 5βL( w)
(b) −6 (αu + β w) + 5βL(u) + 5αL( w)
(c) −6 (αu + β w) + 3αL(u) + 3βL( w)
(d) −6 (αu + β w) + 3βL(u) + 3αL( w)
Solution
Hint: Question 7 Solution But what is L(u)?
(a) L(u) = 2u
(b) L(u) = 3u
(c) L(u) = 2 w
(d) L(u) = 3 w
Hint: Question 8 Solution And what is L( w))?
(a) L( w) = 3 w
(b) L( w) = 2 w
(c) L( w) = 2u
(d) L( w) = 3u
Hint: Try substituting the facts that L(u) = 2u and L( w) = 3 w into −6 (αu + β w) +
5αL(u) + 5βL( w).
Hint: Then we get −6αu −6β w + (5 2)αu + (5 3)β w.
Hint: But −6 + 10 = 4 and −6 + 15 = 9.
150
63 Cayley-Hamilton theorem
Hint: Consequently, this simpliﬁes to 4αu + 9β w.
Now write −6v + 5 L(v) but without referring to L.
(a) 4αu + 9β w
(b) 4α w + 9β u
(c) 9αu + 4β w
(d) 9α w + 4β u
And so, after all this, we see that L(L(v)) = −6v + 5 L(v).
What happens if you try this in higher dimensions? Suppose you have a map
L : R
3
→ R
3
and it has three eigenvectors with three diﬀerent eigenvalues. Can
you rewrite L(L(L(v))) in terms of v and L(v) and L(L(v)) in that case?
151
64 Bilinear maps
Bilinear maps are linear in two vector variables separately.
Deﬁnition 1 Let V, W and U be vector spaces. A bilinear map B : V W →U
is a function of two vector variables which is linear in each variable separately. That
is
Additivity in the ﬁrst slot For all v
1
, v
2
∈ V and all w ∈ W, we have B(v
1
+
v
2
, w) = B(v
1
, w) + B(v
2
, w)
Additivity in the second slot For all v ∈ V and all w
1
, w
2
∈ W, we have
B(v, w
1
+ w
2
) = B(v, w
1
) + B(v, w
2
).
Scaling in each slot For all c ∈ R and all v inV and all w ∈ W, we have
B(cv, w) = B(v, c w) = cB(v, w).
A bilinear map from V V → R is called a bilinear form on V . We will
mostly be focusing on bilinear forms on R
n
, but we will sometimes need to work
with more general bilinear maps.
Example 2 The map B : R
n
R
n
→ R given by B(v, w) = v w is a bilinear
form, since we conﬁrmed that the dot product has these properties immediately
after deﬁning the dot product.
Question 3 R
n
R
m
can be identiﬁed with R
n+m
. Is a bilinear map R
n
R
m

R
k
linear when viewed as a map from R
n+m
→R
k
?
Solution
(a) No.
(b) Yes.
You are correct: a bilinear map R
n
R
m
→R
k
is not necessarily a linear map
when we identify R
n
R
m
with R
n+m
. Why? What is an example? For example,
the dot product dot : R
2
R
2
→ R deﬁned by B(
_
x
y
_
,
_
z
t
_
) = xz + yt is bilinear,
but it is certainly not a linear map from R
4
→R.
Question 4 Let B : R
2
R
3
→ R be a bilinear mapping, and you know the
following values of B:
• B
_
_
_
1
0
_
,
_
_
1
0
0
_
_
_
_
= 2
• B
_
_
_
1
0
_
,
_
_
0
1
0
_
_
_
_
= 1
• B
_
_
_
1
0
_
,
_
_
0
0
1
_
_
_
_
= −3
152
64 Bilinear maps
• B
_
_
_
0
1
_
,
_
_
1
0
0
_
_
_
_
= 2
• B
_
_
_
0
1
_
,
_
_
0
1
0
_
_
_
_
= 5
• B
_
_
_
0
1
_
,
_
_
0
0
1
_
_
_
_
= 4
What is B
_
_
_
3
2
_
,
_
_
4
2
1
_
_
_
_
?
Solution
Hint: We need to use the linearity in each slot to break this down into a computation
involving only the basis vectors.
Hint:
B
_
_
_
3
2
_
,
_
_
4
2
1
_
_
_
_
= B
_
_
_
3
0
_
+
_
0
2
_
,
_
_
4
2
1
_
_
_
_
= B
_
_
_
3
0
_
,
_
_
4
2
1
_
_
_
_
+ B
_
_
_
0
2
_
,
_
_
4
2
1
_
_
_
_
Hint:
= 3B
_
_
_
1
0
_
,
_
_
4
2
1
_
_
_
_
+ 2B
_
_
_
0
1
_
,
_
_
4
2
1
_
_
_
_
= 3B
_
_
_
1
0
_
,
_
_
4
2
1
_
_
_
_
+ 2B
_
_
_
0
1
_
,
_
_
4
2
1
_
_
_
_
= 3
_
_
4B
_
_
_
1
0
_
,
_
_
1
0
0
_
_
_
_
+ 2B
_
_
_
1
0
_
,
_
_
0
1
0
_
_
_
_
+ B
_
_
_
1
0
_
,
_
_
0
0
1
_
_
_
_
_
_
+ 2
_
_
4B
_
_
_
0
1
_
,
_
_
1
0
0
_
_
_
_
+ 2B
_
_
_
0
1
_
,
_
_
0
1
0
_
_
_
_
+ B
_
_
_
0
1
_
,
_
_
0
0
1
_
_
_
_
_
_
Hint:
= 3 (4(2) + 2(1) + 1(−3)) + 2 (4(2) + 2(5) + 1(4))
= 21 + 44
= 65
153
64 Bilinear maps
B
_
_
_
3
2
_
,
_
_
4
2
1
_
_
_
_
= 65
Question 5 Hint: If we set L(x) = B(x, 3), then L should be a linear map R →R.
Hint: But a linear map R → R is just multiplication, so B(x, 3) = αx for some
number α.
Hint: But a bilinear map is linear in both variables, so B(17, y) = βy for some number
β.
Hint: So one way to get a bilinear map would be to set B(x, y) = 10xy. You can
enter this as 10 * x * y.
Hint: Can you think of other examples?
Hint: Sure! Another way to get a bilinear map would be to set B(x, y) = 13xy. You
can enter this as 13 * x * y.
Hint: In general, if B : R R → R is bilinear, then it must be B(x, y) = λxy for
some λ ∈ R.
Write a nonzero bilinear map B : R R →R.
Solution B(x, y) =
154
65 Tensor products
Bilinear forms comprise a vector space of tensors.
The set of all bilinear maps from V W → R has the structure of a vector space:
we can add such maps, and multiply them by scalars.
Deﬁnition 1 We deﬁne V

⊗W

to be the vector space of all bilinear maps from
V W →R.
This is the tensor product of the dual spaces V

and W

.
Hopefully the reason for the duality involved in the deﬁnition above will become
clear shortly.
Given covectors S : V → R and T : W → R, their tensor product is the map
S ⊗T : V W →R given by the rule S ⊗T(v, w) = S(v)T( w)
Warning 2 This formula involves the product of S(v) and T( w) as real numbers.
Question 3 S ⊗T is a function from V W to R. Is it bilinear?
Solution
(a) Yes.
(b) No.
Let’s prove it!
First lets check additivity in the ﬁrst slot:
(S ⊗T)( v
1
+ v
2
, w) = S(v
1
+v
2
)T( w)
= (S(v
1
) + S(v
2
)) T( w)
= S(v
1
)T( w) + S(v
2
)T( w)
= (S ⊗T)(v
1
, w) + (S ⊗T)(v
2
, w)
Proving additivity in the second slot is similar.
Lets check scaling in the ﬁrst slot:
(S ⊗T)(c v
1
, w) = S(cv)T( w)
= cS(v)T( w)
= c(S ⊗T)(v, w)
Proving scaling in the other slot is similar.
So S ⊗T really is bilinear!
155
66 Some nice covectors
Bilinear forms can be written in terms of particularly nice covectors.
Let us recall some notation from the section on derivatives.
Let e
i
be the standard basis for R
n
. The covector e

i
: R
n
→R is
e

i
(v) = ¸e
i
, v¸.
Question 1 We can build more complicated examples, too. Suppose L : R
5
→R
is the covector given by
L = 4e

3
−2e

4
+e

5
.
Solution
Hint: Set v =
_
_
_
_
_
_
1
4
2
3
5
_
¸
¸
¸
¸
_
. We are considering L(v).
Hint: Then L(v) = (4e

3
−2e

4
+e

5
)(v).
Hint: So L(v) = 4e

3
(v) −2e

4
(v) +e

5
(v).
Hint: Replacing e

i
(v) by ¸v, ei) yields L(v) = 4 ¸v, e3) −2 ¸v, e4) +¸v, e5).
Hint: In this case, ¸v, e3) = 2.
Hint: And ¸v, e4) = 3.
Hint: And ¸v, e5) = 5.
Hint: We conclude L(v) = 4 2 −2 3 + 5 = 8 −6 + 5 = 7.
Then L
_
_
_
_
_
_
_
_
_
_
_
_
1
4
2
3
5
_
¸
¸
¸
¸
_
_
_
_
_
_
_
= 7.
This is a special case of something quite general.
Theorem 2 Any covector R
n
→R can be written as a linear combination of the
covectors e

i
.
Why is this? Think of a covector as
Solution
156
66 Some nice covectors
(a) a row vector
_
a1 a2 an
_
.
(b) a column vector
_
_
_
_
_
a1
a2
.
.
.
an
_
¸
¸
¸
_
.
Then we can write that row vector as
a
1
_
1 0 0
¸
+ a
2
_
0 1 0 0
¸
+ + a
n
_
0 0 1
¸
.
But those row vectors are just the duals to the standard basis, so we can write the
covector as
a
1
e

1
+ a
2
e

2
+ + a
n
e

n
.
How is this related to derivatives?
Deﬁne the coordinate functions to be π
i
: R
n
→R given by π
i
(x
1
, x
2
, x
3
, . . . , x
n
) =
x
i
.
What is the derivative of π
i
at any point p in R
n
?
Solution
Hint: This is a special case of a general theorem.
Theorem 3 The derivative of a linear map (at any point) is the same linear map.
Hint: In this case, πi is a linear map.
Hint: So Dπi(p) = πi.
Hint: But another way to write πi is e

i
.
(a) Dπi(p) = e

i
.
(b) Dπi(p) = ei.
(c) Dπi(p) =

0.
As a result of this, we will often write dx
i
as a more suggestive notation for the
covector with the rather more cumbersome name e

i
. Let’s do some calculations
with this new notation.
Solution
Hint: dx2 will select the second entry of any vector
Hint: dx2(
_
3, 6, 4
_
) = 6
dx2(
_
3, 6, 4
_
) = 6
We can also consider the tensor product of these covectors.
Solution
157
66 Some nice covectors
Hint: dx1 ⊗dx2
_
_
_
_
2
5
3
_
_
,
_
_
7
9
4
_
_
_
_
= dx1
_
_
_
_
2
5
3
_
_
_
_
dx2
_
_
_
_
7
9
4
_
_
_
_
Hint:
= 2(9)
= 18
dx1 ⊗dx2(
_
_
2
5
3
_
_
,
_
_
7
9
4
_
_
) =18
Prove that the set of bilinear forms ¦dx
i
⊗ dy
j
: 0 ≤ i ≤ n and 0 ≤ j ≤ m¦
forms a basis for the space (R
n
)

⊗(R
m
)

Warning 4 One of your greatest challenges here will be dealing with the all of
the indexes!
Let B : R
n
R
m
→ R be a bilinear map. Let x =
_
¸
¸
¸
_
x
1
x
2
.
.
.
x
n
_
¸
¸
¸
_
and y =
_
¸
¸
¸
_
y
1
y
2
.
.
.
y
m
_
¸
¸
¸
_
. Then
we can write
B(x, y) = B
_
_
_
_
_
_
¸
¸
¸
_
x
1
x
2
.
.
.
x
n
_
¸
¸
¸
_
,
_
¸
¸
¸
_
y
1
y
2
.
.
.
y
m
_
¸
¸
¸
_
_
_
_
_
_
=
j=n

j=1
x
j
B
_
_
_
_
_
e
j
,
_
¸
¸
¸
_
y
1
y
2
.
.
.
y
m
_
¸
¸
¸
_
_
_
_
_
_
=
j=n

j=1
i=m

i=1
x
j
y
i
B (e
j
, e
i
)
=
j=n

j=1
i=m

i=1
B (e
j
, e
i
) dx
i
⊗dy
j
(x, y)
So B =
j=n

j=1
i=m

i=1
B (e
j
, e
i
) dx
i
⊗ dy
j
. This shows that the dx
i
⊗ dy
j
span all of
(R
n
)

⊗(R
m
)

.
To see that the dx
i
⊗ dy
j
are linearly independent, simply observe that if
j=n

j=1
i=m

i=1
a
i,J
dx
i
⊗dy
j
= 0, then in particular
j=n

j=1
i=m

i=1
a
i,J
dx
i
⊗dy
j
(e
i
, e
j
) = 0, which
implies that a
i,j
= 0 for all i, j.
158
66 Some nice covectors
Example 5 The dot product on R
2
is given by the expression dx
1
⊗dy
1
+dx
2
⊗dy
2
Example 6 Let R
4
have coordinates (t, x, y, z). The bilinear form η = −dt ⊗
dt + dx ⊗ dx + dy ⊗ dy + dz ⊗ dz on R
4
is incredibly important to physics. It is
called the Minkowski inner product
1
, and is one of the basic structures underlying
the local geometry of our universe.
1
http://en.wikipedia.org/wiki/Minkowski_space
159
67 A basis for forms
A basis for the space of bilinear forms consists of tensors of coordinate functions.
Prove that the set of bilinear forms ¦dx
i
⊗dy
j
: 1 ≤ i ≤ n and 1 ≤ j ≤ m¦ forms a
basis for the space (R
n
)

⊗(R
m
)

.
Warning 1 One of your greatest challenges here will be dealing with the all of
the indexes.
Let B : R
n
R
m
→ R be a bilinear map. Let x =
_
¸
¸
¸
_
x
1
x
2
.
.
.
x
n
_
¸
¸
¸
_
and y =
_
¸
¸
¸
_
y
1
y
2
.
.
.
y
m
_
¸
¸
¸
_
. Then
we can write
B(x, y) = B
_
_
_
_
_
_
¸
¸
¸
_
x
1
x
2
.
.
.
x
n
_
¸
¸
¸
_
,
_
¸
¸
¸
_
y
1
y
2
.
.
.
y
m
_
¸
¸
¸
_
_
_
_
_
_
=
j=n

j=1
x
j
B
_
_
_
_
_
e
j
,
_
¸
¸
¸
_
y
1
y
2
.
.
.
y
m
_
¸
¸
¸
_
_
_
_
_
_
=
j=n

j=1
i=m

i=1
x
j
y
i
B (e
j
, e
i
)
=
j=n

j=1
i=m

i=1
B (e
j
, e
i
) dx
i
⊗dy
j
(x, y)
So B =
j=n

j=1
i=m

i=1
B (e
j
, e
i
) dx
i
⊗ dy
j
. This shows that the dx
i
⊗ dy
j
span all of
(R
n
)

⊗(R
m
)

.
To see that the dx
i
⊗ dy
j
are linearly independent, simply observe that if
j=n

j=1
i=m

i=1
a
i,J
dx
i
⊗dy
j
= 0, then in particular
j=n

j=1
i=m

i=1
a
i,J
dx
i
⊗dy
j
(e
i
, e
j
) = 0, which
implies that a
i,j
= 0 for all i, j.
Example 2 The dot product on R
2
is given by the expression dx
1
⊗dy
1
+dx
2
⊗dy
2
Question 3 Can you write dx
1
⊗ dy
1
+ dx
2
⊗ dy
2
as α ⊗ β for some covectors
α, β : R
2
→R?
Solution
(a) No.
(b) Yes.
160
67 A basis for forms
Why not? Suppose this were possible. By the rank-nullity theorem, there must
be some nonzero vector v which is in the kernel of α. That is, there is some nonzero
vector v ∈ R
2
so that α(v) = 0.
But ¸v, v¸ , = 0, so (α ⊗β)(v, v) ,= 0.
On the other hand, (α ⊗β)(v, v) = α(v) β(v) = 0, which is a contradiction.
Deﬁnition 4 Bilinear forms which can be written as α ⊗β are pure tensors.
So what we have shown here is that not all bilinear forms are pure tensors.
Example 5 Let R
4
have coordinates (t, x, y, z). The bilinear form η = −dt ⊗
dt + dx ⊗dx + dy ⊗dy + dz ⊗dz on R
4
is the Minkowski inner product.
The Minkowski inner product
1
is one of the basic structures underlying the local
geometry of our universe.
1
http://en.wikipedia.org/wiki/Minkowski_space
161
68 Python
Build some bilinear maps in Python.
Question 1 Suppose v and w are both vectors in R
4
, represented in Python as
two lists of four real numbers called v and w. Build a Python function B which
represents some bilinear form B : R
4
R
4
→R.
Solution
Hint: For example, you could try returning 17 * v[0] * w[3].
Python
1 def B(v,w):
2 return # the real number B(v,w)
3
4 def validator():
5 if B([4,2,3,4],[6,5,4,3]) + B([6,2,3,4],[6,5,4,3]) != B([10,4,6,8],[6,5,4,3]):
6 return False
7
8 if B([1,2,3,4],[6,5,4,3]) + B([1,2,3,4],[6,3,4,3]) != B([1,2,3,4],[12,8,8,6]):
9 return False
10
11 if 2*B([1,2,3,4],[6,5,4,3]) != B([2,4,6,8],[6,5,4,3]):
12 return False
13
14 if 2*B([1,2,3,4],[6,5,4,3]) != B([1,2,3,4],[12,10,8,6]):
15 return False
16
17 return True
Now let’s write a Python function tensor which takes two covectors α and β,
and returns their tensor product α ⊗β.
Solution
Hint: The returned function should take two parameters (say v and w) and output
α(v) β( w).
Hint: Speciﬁcally, you could try return lambda v,w: alpha(v) * beta(w)
Python
1 def tensor(alpha,beta):
2 return # the bilinear form alpha tensor beta
3
4 def validator():
5 return tensor(lambda x: 4*x[0] + 5*x[1], lambda y: 2*y[0] - 3*y[1])([1,3],[4,5]) == -133
162
69 Linear maps and bilinear forms
Associated to a bilinear form is a linear map.
It turns out that we will be able to use the inner product on R
n
to rewrite any
bilinear form on R
n
in a special form.
Given a bilinear map B : V W → R, we obtain a new map B(, w) : V → R
for each vector w ∈ W. B(, w) is linear, since by deﬁnition of bilinearity it is
linear in the ﬁrst slot for a ﬁxed vector w in the second slot. Thus we have a map
Curry(B) : W →V

deﬁned by Curry(B)( w) = B(, w).
If V and W are Euclidean spaces, then we have that every bilinear map R
n

R
m
→ R gives rise to a map R
m
→ (R
n
)

. But every element of ω ∈ (R
n
)

is just
a row vector, and so can be represented as the dot product against the vector ω

.
Thus we obtain a map L
B
: R
m
→R
n
deﬁned by L
B
( w) = B(, w)

. This is called
the linear map associated to the bilinear form. We also call the matrix of
L
B
the matrix of B.
Computing some examples will make these deﬁnitions more concrete in our
minds.
Question 1 Let B : R
2
R
3
→ R be a bilinear mapping, and suppose we have
the following values of B.
• B
_
_
_
1
0
_
,
_
_
1
0
0
_
_
_
_
= 2
• B
_
_
_
1
0
_
,
_
_
0
1
0
_
_
_
_
= 1
• B
_
_
_
1
0
_
,
_
_
0
0
1
_
_
_
_
= −3
• B
_
_
_
0
1
_
,
_
_
1
0
0
_
_
_
_
= 3
• B
_
_
_
0
1
_
,
_
_
0
1
0
_
_
_
_
= 5
• B
_
_
_
0
1
_
,
_
_
0
0
1
_
_
_
_
= 4
Solution
Hint: LB : R
3
→R
2
.
Hint: LB(e1) = B(, e1)

.
163
69 Linear maps and bilinear forms
Hint: To ﬁnd the matrix of B(, e1) : R
2
→ R, we need to see its eﬀect on basis
vectors.
B(e1, e1) = 2
B(e2, e1) = 3
so the matrix of B(, e1) is
_
2 3
_
Hint: Thus LB(e1) = B(, e1)

=
_
2
3
_
Hint: Similarly, LB(e2) =
_
1
5
_
and LB(e3) =
_
−3
4
_
Hint: Thus the matrix of LB is
_
2 1 −3
3 5 4
_
What is the matrix of LB?
Question 2 If B : R
3
R
3
→ R is a bilinear map, and the matrix of B is
_
_
2 3 1
−2 1 5
3 2 1
_
_
Solution
Hint: By deﬁnition, B
_
_
_
_
1
2
0
_
_
,
_
_
0
0
1
_
_
_
_
=
_
_
1
2
0
_
_
LB(
_
_
0
0
1
_
_
)
Hint: Thus B
_
_
_
_
1
2
0
_
_
,
_
_
0
0
1
_
_
_
_
=
_
_
1
2
0
_
_

_
_
1
5
1
_
_
= 11
Then B
_
_
_
_
1
2
0
_
_
,
_
_
0
0
1
_
_
_
_
= 11.
Show that the matrix of the bilinear form

a
i,j
dx
i
⊗dx
j
is the matrix (a
i,j
).
Let M be the matrix of L
B
.
Following the same line of reasoning as in a previous activity
1
, we know that
M
i,j
= e

i
L
B
(e
j
). But by deﬁnition, this is B(e
i
, e
j
), which plainly evaluates to
a
i,j
. The claim is proven.
To every linear map L : R
m
→R
n
we also obtain a bilinear map B
L
: R
n
R
m

R deﬁned by B
L
(v, w) = v

L(w).
1
http://ximera.osu.edu/course/kisonecat/m2o2c2/course/activity/week1/
inner-product/multiply-dot/
164
69 Linear maps and bilinear forms
To summarize, we have a really nice story about bilinear maps B : R
n
R
m
→R:
Every single one of them can be written as B(v, w) = v

L(w) for some unique
linear map L : R
m
→ R
n
. Also every linear map R
m
→ R
n
gives rise to a bilinear
form by deﬁning B(v, w) = v

L(w). On the level of matrices, we just have that
B(v, w) = v

Mw where M is the matrix of the linear map L
B
. We will sometimes
say talk about “using a matrix as a bilinear form:” this is what we mean by that.
This will be very important to us when we start talking about the second derivative.
In this activity we have shown that for bilinear maps R
n
R
m
→ R, there is
a useful notion of a linear map R
m
→ R
n
associated to it. If the codomain of the
original bilinear map had been anything other than R we would not have such luck:
our work depended crucially on the ability to turn covectors into vectors using the
inner product on R
n
.
165
70 Python
Use Python to ﬁnd the linear maps associated to bilinear forms.
Question 1 Suppose alpha is a covector α : R
n
→ R. Write a Python function
for converting such a covector in (R
n
)

into a vector in v ∈ R
n
. More speciﬁcally,
covector to vector should take as input a Python function alpha, and return
a list of n real numbers. This list of n real numbers, when regarded as a vector
v ∈ R
n
, should have the property that α( w) = ¸v, w¸.
Solution
Hint: You can determine what vi must be by consider α(ei).
Hint: Deﬁne e(i) by [0] * i + [1] + [0] * (n-i-1).
Hint: In other words, e = lambda i: [0] * i + [1] + [0] * (n-i-1).
Hint: Then α(ei) is alpha(e(i)).
Hint: So to form a vector, we need only use [alpha(e(i)) for i in range(0,n)].
Python
1 n = 4
2 def covector_to_vector(alpha):
3 return # a vector v so that alpha(w) = v dot w
4
5 def validator():
6 if covector_to_vector(lambda x: 3*x[0] + 2*x[1] - 17*x[2] + 30*x[3])[1] != 2:
7 return False
8 if covector_to_vector(lambda x: 3*x[0] + 2*x[1] - 17*x[2] + 30*x[3])[2] != -17:
9 return False
10 if covector_to_vector(lambda x: 3*x[0] + 2*x[1] - 17*x[2] + 30*x[3])[3] != 30:
11 return False
12 return True
Suppose B is a bilinear form B : R
n
R
m
→ R. Write a Python function for
taking such a bilinear form, and producing the corresponding linear map L
B
.
Recall that we encode a linear map R
m
→ R
n
in Python as regular Python
function which takes as input a list of m real numbers, and outputs a list of n real
numbers.
Solution
Hint: You may want to make use of covector to vector; copy the code from above
and paste it into the box below.
Hint: We deﬁned LB( w) = B(, w)

.
166
70 Python
Hint: In other words, LB sends a vector w to the vector corresponding to the covector
x → B(x, w).
Hint: So we should return a linear map sending w to the covector to vector applied
to x → B(x, w).
Hint: So we should return lambda w: covector to vector(lambda x: B(x,w)).
Python
1 n = 4
2 m = 3
3 def bilinear_to_linear(B):
4 return # the associated linear map L_B
5
6 def validator():
7 if bilinear_to_linear(lambda x,y: 7 * x[0] * y[1] + 3*x[1]*y[2])([3,5,7])[1] != 21:
8 return False
9 if bilinear_to_linear(lambda x,y: 7 * x[0] * y[1] + 3*x[1]*y[2])([3,5,7])[0] != 35:
10 return False
11 return True
These are again examples of “higher-order functions.” Keeping track of dual
spaces and thinking about operations which transform bilinear maps into linear
maps are two examples of where such “higher-order thinking” comes in handy.
167
Let L : R
m
→ R
n
be a linear map. Then there is an associated bilinear form
B
L
: R
n
R
m
→R given by B
L
(v, w) = ¸v, L( w)¸.
One thing we can do to a bilinear map is swap the two inputs, namely, we can
build B

L
: R
m
R
n
→R by the rule B

L
( w, v) = B
L
(v, w).
And with this “swapped” bilinear map, we can go back and recover an associated
linear map L
B

L
: R
n
→R
m
.
Question 1 The domain of L
B

L
is the same as
Solution
(a) the codomain of L.
(b) the domain of L.
Right! The minor surprise is that L went from R
m
to R
n
, but L
B

L
went “the
other way” from R
n
to R
m
.
Deﬁnition 2 If L : R
m
→ R
n
is a linear map, the adjoint of L is the map
L
B

L
: R
n
→R
m
. We usually write L

Let L : R
m
→ R
n
be a linear map. Show that ¸v, L( w)¸ = ¸L

(v), w¸ for every
v ∈ R
n
and w ∈ R
m
¸v, L( w)¸ = B
L
(v, w)
= B

L
( w, v)
= ¸ w, L

(v)¸
= ¸L

(v), w¸
Question 3 Let’s work through an example. Let L : R
3
→R
2
be the linear map
represented by the matrix
_
3 2 1
−4 2 9
_
with respect to the standard basis.
Solution
Hint: L(e1) =
_
3
−4
_
Hint: ¸e2, L(e1)) = ¸e2,
_
3
−4
_
)
Hint: ¸e2,
_
3
−4
_
) = −4.
¸e2, L(e1)) = −4
168
Recall that ¸v, L( w)¸ = ¸L

(v), w¸. Consequently, setting v = e
2
and w = e
1
,
we have ¸L

(e
2
), e
1
¸ is also −4.
Let’s write (
ij
) for the entries of the matrix for L, and (

ij
) for the entries of
the matrix for L

. The fact that ¸e
2
, L(e
1
)¸ = −4 amounts to saying
2,1
= −4,
and then since ¸L

(e
2
), e
1
¸ = −4, we have that

1,2
= −4.
Solution
Hint: The matrix of the adjoint of a linear map is the transpose of the matrix of that
linear map.
Hint: The matrix of L

is
_
_
3 −2
2 2
1 9
_
_
What is the matrix of L

?
What do you notice about these entries?
Solution
(a) ij =

ji

(b) ij =

ij
Let’s summarize this fact as a theorem.
Theorem 4 Let L : R
n
→R
m
be a linear map. If L has matrix M with respect
to the standard basis, then L

has matrix M

.
Recall that M

is the transpose of M, meaning that M
ij
= M

ji
.
It is your turn to prove this theorem.
Let’s use the fundamental insight from this activity
1
.
Let the matrix of L

be called M

for now. To ﬁnd the entry in the i
th
row and
j
th
column of M

, we just compute
M
i,j
= e

i
M

(e
j
)
= e

i
L

(e
j
)
= ¸e
i
, L

(e
j

= ¸L(e
i
), e
j
¸
= e

j
L(e
i
)
= e

j
M(e
i
)
= M
j,i
So the entry in the i
th
row and j
th
column of M

is the entry in the j
th
row
and i
th
column of M. Thus M

= M

.
Here, ﬁnally, is a question for you to ponder: why are we bothering about adjoints
of linear maps if we have transposes of matrices?
1
http://ximera.osu.edu/course/kisonecat/m2o2c2/course/activity/week1/
inner-product/multiply-dot/
169
Taking adjoints doesn’t aﬀect the spectrum.
The set of eigenvalues of a linear operator—what we call the spectrum of the linear
operator—is of fundamental importance. Taking adjoints is one way to build a new
linear operator from an old linear operator. Fusing these two ideas together results
in a question: how does the spectrum of L relate to the spectrum of its adjoint, L

?
Surprisingly, the spectrum of L is the same as the spectrum of L

.
Let’s get started: show that if v is a nonzero eigenvector of L : R
n
→R
n
, with
eigenvalue λ then there is a nonzero eigenvector u of L

with eigenvalue λ.
Warning 1 This is a very hard problem.
Hint: Consider the map S : R
n
→R
n
given by S(v) = L

(v) −λv, or in other words
S = L

−λI. Showing that this map has a nontrivial kernel is the same as showing that
L

has λ as an eigenvector.
Hint: Notice that L −λI is adjoint to S = L

−λI
Hint: For all w ∈ R
n
, we have ¸S( w), v) = ¸ w, L(v) −λv) = 0
Hint: Thus S( w) is in the subspace of vectors perpendicular to the eigenvector v,
which we denote v

.
Hint: Thus we have that Im(S) ⊂ v

. This implies that dim(Im(S)) ≤ n −1
Hint: By the rank nullity theorem, we have that dim(ker(S)) ≥ 1
Hint: So S has a nontrivial kernel, so L

has a nonzero eigenvector u with eigenvalue
λ.
Consider the map S : R
n
→ R
n
given by S(v) = L

(v) − λv. Showing that this
map has a nontrivial kernel is the same as showing that L

has λ as an eigenvector.
Notice that L −λI is adjoint to S = L

−λI
For all w ∈ R
n
, we have ¸S( w), v¸ = ¸ w, L(v) −λv¸ = 0
Thus S( w) is in the subspace of vectors perpendicular to the eigenvector v,
which we denote v

.
Thus we have that Im(S) ⊂ v

. This implies that dim(Im(S)) ≤ n −1
By the rank nullity theorem, we have that dim(ker(S)) ≥ 1
So S has a nontrivial kernel, so L

has a nonzero eigenvector u with eigenvalue
λ.
170
Linear maps which equal their own adjoint are important
Deﬁnition 1 A linear operator L : R
n
→R
n
is self-adjoint if L = L

.
These ideas also pop up when considering bilinear forms.
Deﬁnition 2 A bilinear form on R
n
is symmetric if for all v, w ∈ R
n
we have
B(v, w) = B( w, v).
Question 3 Consider the bilinear form B : R
n
R
n
→ R given by B(v, w) =
¸v, w¸. In other words, B is just the inner product. Is B symmetric?
Solution
(a) Yes.
(b) No.
That’s right!
Example 4 The dot product on R
n
is a symmetric bilinear form, since we have
already shown that v w = w v.
Which of the following bilinear forms on R
2
are symmetric?
Solution
Hint: B
__
x1, x2
_
,
_
y1, y2
__
= x1y2 + 3x2y1 is not symmetric since, for example,
B
__
1
0
_
,
_
0
1
__
= 1, but B
__
0
1
_
,
_
1
0
__
= 3.
Hint: B
__
x1, x2
_
,
_
y1, y2
__
= x1y1 +5x2y2 +x1y2 is not symmetric since, for example,
B
__
1, 0
_
,
_
0, 1
__
= 1 but B
__
0
1
_
,
_
1
0
__
= 0
Hint: B
__
x1, x2
_
,
_
y1, y2
__
= 2x1y1+4x2y2 is symmetric, since B
__
x1, x2
_
,
_
y1, y2
__
=
2x1y1 + 4x2y2 = B
__
y1, y2
_
,
_
x1, x2
__
(a) B
__
x1, x2
_
,
_
y1, y2
__
= x1y2 + 3x2y1
(b) B
__
x1, x2
_
,
_
y1, y2
__
= 2x1y1 + 4x2y2
(c) B
__
x1, x2
_
,
_
y1, y2
__
= x1y1 + 5x2y2 + x1y2
171
74 Symmetric matrices
The matrix of a self-adjoint linear map is symmetric.
The matrix of a self-adjoint operator equals its own transpose. In other words, it
is symmetric about the main diagonal.
Deﬁnition 1 A matrix which equal its transpose is a symmetric matrix.
Question 2 Let B be the symmetric bilinear form on R
2
deﬁned by B
__
x
1
x
2
_
,
_
y
1
y
2
__
=
2x
1
y
1
+4x
2
y
2
+x
1
y
2
+x
2
y
1
. What is the matrix of B? What do you notice about
this matrix?
Solution
Hint: Remember that the entry Mi,j = B(ei, ej)
Hint:
M1,1 = B(
_
1
0
_
,
_
1
0
_
) = 2
M1,2 = B(
_
1
0
_
,
_
0
1
_
) = 1
M2,1 = B(
_
0
1
_
,
_
1
0
_
) = 1
M2,2 = M(
_
0
1
_
,
_
0
1
_
) = 4
Hint: The matrix of B is
_
2 1
1 4
_
Notice that this matrix is a symmetric matrix!
Show that L is self-adjoint if and only if the bilinear form associated to it is
symmetric. If L is self-adjoint, then
B
L
(v, w) = v

L(w) = ¸v, L(w)¸
= ¸L(v), w¸
= w

L(v)
= B
L
(w, v)
So the bilinear form associated to L is symmetric.
172
74 Symmetric matrices
On the other hand, if B is a symmetric, then
¸L
B
(v), w¸ = ¸B(v, )
t
op, w¸
= B(v, )(w)
= B(v, w)
= B(w, v)
= B(w, )(v)
= ¸B(w, )

, v¸ = ¸L
B
(w), v¸
So the linear map associated with B is self-adjoint
173
75 Python
Build some self-adjoint operators in Python.
Question 1 We will represent a linear operator in Python as a function which
takes as input a list of n real numbers, and outputs a list of n real numbers.
Write down a linear operator L which is self-adjoint.
Solution
Hint: In this problem, n = 4.
Hint: So the input v will be a list of four numbers, namely v[0], v[1], v[2], and
v[3].
Hint: The output should also be a list of four numbers.
Hint: We must make sure that the resulting operator is self-adjoint, which we can
achieve if the corresponding matrix is symmetric.
Hint: Since we just need to write down one example, we could even get away with
return v, namely, the identity operator. But let’s try to be fancier!
Hint: Let’s make L into the linear operator represented by the matrix
_
_
_
_
2 3 0 0
3 4 0 0
0 0 1 0
0 0 0 1
_
¸
¸
_
.
Hint: We can achieve this with return [2*v[0] + 3*v[1], 3*v[0] + 4*v[1], v[2],
v[3]].
Python
1 n = 4
2 def L(v):
3 return # the vector L(v), but make sure that L is self-adjoint
4
5 def validator():
6 e = lambda i: [0] * i + [1] + [0] * (n-i-1)
7 for i in range(0,4):
8 for j in range(i,4):
9 if L(e(i))[j] != L(e(j))[i]:
10 return False
11 return True
Fantastic!
174
76 Deﬁniteness and the spectral theorem
“Deﬁniteness” describes what we can say about the sign of the output of a bilinear
form.
Deﬁnition 1 A bilinear form B : R
n
R
n
→R is
Positive deﬁnite if B(v, v) > 0 for all v ,=

0,
Positive semideﬁnite if B(v, v) ≥ 0 for all v,
Negative deﬁnite if B(v, v) < 0 for all v ,=

0,
Negative semideﬁnite if B(v, v) ≤ 0 for all v, and
Indeﬁnite if B there are v and w with B(v, v) > 0 and B(w, w) < 0
Let M be a diagonal matrix. In a sentence, can you relate the sign of the entries
M
i,i
to the deﬁniteness of the associated bilinear form?
Given a diagonal n n matrix M with M
i,i
= λ
i
, we see that B(x, x) =
_
x
1
x
2
.
.
. x
n
_
M
_
¸
¸
¸
_
x
1
x
2
.
.
.
x
n
_
¸
¸
¸
_
v = λ
1
x
2
1
+ λ
2
x
2
2
+ + λ
n
x
2
n
If the λ
i
are all positive, this expression is always positive whenever the x
i
are
not all 0. So the bilinear form is positive deﬁnite.
If the λ
i
are all nonnegative, this expression is always nonnegative whenever the
x
i
are not all 0. So the bilinear form is positive semideﬁnite.
If the λ
i
are all negative, this expression is always negative whenever the x
i
are
not all 0. So the bilinear form is negative deﬁnite.
If the λ
i
are all nonpositive, this expression is always nonpositive whenever the
x
i
are not all 0. So the bilinear form is negative semideﬁnite.
If the λ
i
> 0 and λ
j
< 0 for some i, j ≤ n, then B(e
i
, e
i
) = λ
i
> 0 and
B(e
j
, e
j
) = λ
j
< 0, so the bilinear form is indeﬁnite.
Our goal will now be to reduce the study of general symmetric bilinear forms
to those whose associated matrix is diagonal.
Let L : R
n
→ R
n
be a self adjoint linear operator. Prove that if v
1
and v
2
are
eigenvectors with distinct eigenvalues λ
1
and λ
2
, then v
1
⊥ v
2
.
¸L(v
1
), v
2
¸ = ¸v
1
, L(v
2

¸λ
1
v
1
, v
2
¸ = ¸v
1
, λ
2
v
2
¸

1
−λ
2
)¸v
1
, v
2
¸ = 0
¸v
1
, v
2
¸ = 0
Let L : R
n
→ R
n
be a self adjoint linear operator. Let v be an eigenvector of
L. Prove that L restricts to a self adjoint linear operator on the space of vectors
perpendicular to v, v

.
All we need to show is that w ⊥ v implies L(w) ⊥ v.
175
76 Deﬁniteness and the spectral theorem
¸L(w), v¸ = ¸w, L(v)¸
= ¸w, λv¸
= λ¸w, v¸
= 0
so we are done!
Theorem 2 If L : R
n
→ R
n
is a self adjoint linear operator, then L has an
eigenvector.
Proof First assume that L is not the identically 0 map. If it is, we are done
because 0 is an eigenvector in that case.
Since the unit sphere in R
n
is compact
1
, the function v → [L(v)[ achieves its
maximum M. So there is a unit vector u so that [L(u)[ = M, and [L(v)[ ≤ M for
all other unit vectors v. M > 0 because L ,= 0.
Now let w = L(u)/M. This is another unit vector.
Note that ¸w, L(u)¸ = M, so we also have ¸L(w), u¸ = M, since L is self adjoint.
¸L( w), u¸ ≤ [L( w)[[u[ with equality if and only if L( w) ∈ span(u) by Cauchy-
Schwarz.
But [L( w)[[u[ = [L( w)[ because u is a unit vector!
Since M is the maximum value of [L(u)[ over all unit vectors u, we must have
L( w) ∈ span(u)
We can conclude that L( w) = Mu.
Now either u + w ,= 0 or u − w ,= 0 . In either case,
L(u ± w) = Mu ±M w, so the nonzero v ± w is an eigenvector of L.
Credit for this beautiful line of reasoning goes to Marcos Cossarini
2
. Most proofs
of this theorem use either Lagrange Multipliers (which we will learn about soon),
or complex analysis. Here we use only linear algebra along with the one analytic
fact that a continuous function on a compact set achieves its maximum value.
We can combine the previous theorem and question to prove the “Spectral
Theorem 3 A self adjoint operator L : R
n
→ R
n
has an orthonormal basis of
eigenvectors.
Proof L has an eigenvector v
1
. L
¸
¸
v

1
: v

1
→ v

1
operator and so it has an eigenvector v
2
. Continue in this way until you have
constructed all n eigenvectors. Because of how they are constructed, we have that
each one is perpendicular to all of the eigenvectors which came before it.
This, in some sense, completely answers the question of how to characterize the
deﬁniteness of a symmetric bilinear form. Look at its associated linear operator,
which must be self adjoint. By the Spectral Theorem, it has a orthonormal basis
of eigenvectors. Then
1
http://en.wikipedia.org/wiki/Compact_space
2
http://mathoverflow.net/a/118759/1106
176
76 Deﬁniteness and the spectral theorem
• B positive deﬁnite ⇐⇒ L
B
has all positive eigenvalues
• B positive semideﬁnite ⇐⇒ L
B
has all nonnegative eigenvalues
• B negative deﬁnite ⇐⇒ L
B
has all negative eigenvalues
• B negative semideﬁnite ⇐⇒ L
B
has all nonpositive eigenvalues
• B indeﬁnite ⇐⇒ L
B
has both positive and negative eigenvalues
This will be crucially important when we get to the second derivative test:
it will turn out that the second derivative is a symmetric bilinear form, and the
deﬁniteness of this bilinear form is analogous to the concavity of a function of one
variable. Identifying local maxima and minima with the “second derivative test”
will require analysis of the eigenvalues of the associated linear map.
177
77 Second order partial derivatives
Second order partial derivatives are partial derivatives of partial derivatives
Deﬁnition 1 If f : R
n
→R is a function, then
∂f
∂x
i
: R
n
→R is another function,
so we can take its partial derivative with respect to x
j
. We deﬁne the second order
partial derivative

2
f
∂x
j
∂x
i
=

∂x
j
∂f
∂x
i
. In the special case that i = j we will write

2
f
∂x
2
i
(even though this notation is horrible, it is standard, so we will follow it).
Question 2 Let f(x, y) = x
2
y
3
Solution
Hint:

2
f
∂x
2
=

∂x
_

∂x
(x
2
y
3
)
_
=

∂x
(2xy
3
)
= 2y
3

2
f
∂x
2
= 2y
3
Solution
Hint:

2
f
∂x∂y
=

∂x
_

∂y
(x
2
y
3
)
_
=

∂x
(3x
2
y
2
)
= 6xy
2

2
f
∂x∂y
= 6xy
2
Solution
Hint:

2
f
∂y∂x
=

∂y
_

∂x
(x
2
y
3
)
_
=

∂y
(2xy
3
)
= 6xy
2

2
f
∂y∂x
= 6xy
2
Solution
178
77 Second order partial derivatives
Hint:

2
f
∂y
2
=

∂y
_

∂y
(x
2
y
3
)
_
=

∂y
(3x
2
y
2
)
= 6x
2
y

2
f
∂y
2
= 6x
2
y
Solution Did you notice how

2
f
∂x∂y
=

2
f
∂y∂x
? Doesn’t that ﬁll you with a sense of
wonder and mystery?
(a) Yes!
(b) No :(
Question 3 Let f(x, y) = sin(xy
2
)
Solution
Hint:

2
f
∂x
2
=

∂x
_

∂x
sin(xy
2
)
_
=

∂x
(y
2
sin(xy
2
))
= −y
4
sin(xy
2
)

2
f
∂x
2
= −y
4
sin(xy
2
)
Solution
Hint:

2
f
∂x∂y
=

∂x
_

∂y
(sin(xy
2
))
_
=

∂x
(2xy cos(xy
2
))
= 2y cos(xy
2
) −2xy(y
2
) sin(xy
2
)
= 2y cos(xy
2
) −2xy
3
sin(xy
2
)

2
f
∂x∂y
= 2ycos(xy
2
) −2xy
3
sin(xy
2
)
Solution
Hint:

2
f
∂y∂x
=

∂y
_

∂x
(sin(xy
2
))
_
=

∂y
(y
2
cos(xy
2
))
= 2y cos(xy
2
) −y
2
(2xy) sin(xy
2
)
= 2y cos(xy
2
) −2xy
3
sin(xy
2
)
179
77 Second order partial derivatives

2
f
∂y∂x
= 2ycos(xy
2
) −2xy
3
sin(xy
2
)
Solution
Hint:

2
f
∂y
2
=

∂y
_

∂y
(sin(xy
2
))
_
=

∂y
(2xy cos(xy
2
))
= 2xcos(xy
2
) −2xy(2xy) sin(xy
2
)
= 2xcos(xy
2
) −4x
2
y
2
sin(xy
2
)

2
f
∂y
2
= 2xcos(xy
2
) −4x
2
y
2
sin(xy
2
)
This lends even more evidence to the startling claim that

2
f
∂x∂y
=

2
f
∂y∂x
Question 4 Let f : R
3
→R be the function f(x, y, z) = x
2
yz
3
Solution
Hint:

2
f
∂x∂z
=

∂x

∂z
x
2
yz
3
=

∂x
3x
2
yz
2
= 6xyz
2

2
f
∂x∂z
=6xyz
2
Solution
Hint:

2
f
∂z∂x
=

∂z

∂x
x
2
yz
3
=

∂x
2xyz
3
= 6xyz
2

2
f
∂z∂x
=6xyz
2
Okay, it really really looks like

2
f
∂x
i
∂x
j
=

2
f
∂x
j
∂x
i
.
180
78 Mixed partials commute
Order of partial diﬀerentiation doesn’t matter
In the last section on partial derivatives we made the interesting observation that

2
f
∂x
i
∂x
j
=

2
f
∂x
j
∂x
i
for all of the functions we considered. We will now prove this,
modulo some technical assumptions.
Theorem 1 Let f : R
n
→R be a diﬀerentiable function. Assume that the partial
derivatives f
xi
: R
n
→ R are all diﬀerentiable, and the second partial derivatives
f
xi,xj
are continuous. Then f
xi,xj
= f
xj,xi
.
First, let’s develop some intuition about why this result is true. This informal
discussion will also suggest how we should proceed with the formal proof.
Let’s restrict our attention, for the moment, to functions g : R
2
→ R. Observe
that g
x
(a, b) ≈
g(a + h, b) −g(a, b)
h
for small values of h. Analogously, g
y
(a, b) ≈
g(a, b + k) −g(a, b)
k
.
Now applying this idea twice, we have
f
xy
(a, b) ≈
1
h
(f
y
(a + h, b) −f
y
(a, b))

1
h
_
f(a + h, b + k) −f(a + h, b)
k

f(a, b + k) −f(a, b)
k
_
=
f(a + h, b + k) −f(a + h, b) −f(a, b + k) + f(a, b)
hk
Going through the same process with f
yx
leads to exactly the same approxima-
tion!
So our strategy of proof will be to show that we can express both of these partial
derivatives as the two variable limit:
f
xy
(a, b) = lim
h,k→0
f(a + h, b + k) −f(a + h, b) −f(a, b + k) + f(a, b)
hk
= f
yx
(a, b)
Proof Let HODQ(h, k) = f(a + h, b + k) − f(a + h, b) − f(a, b + k) + f(a, b).
(Here HODQ stands for ”higher order diﬀerence quotient”).
Let Q(s) = f(s, b + k) −f(s, b).
Then HODQ(h, k) = Q(a + h) −Q(a).
By the mean value theorem for derivatives, there is an 0 <
1
< h such that
Q(a + h) −Q(a) = hQ

(a +
1
).
So HODQ(h, k) = h(f
x
(a +
1
, b + k) −f
x
(a, b)) .
By the mean value theorem again, we have
HODQ(h, k) = hkf
yx
(a +
1
, b +
2
) for some 0 <
2
< k.
Now apply exactly the same reasoning to conclude that HODQ(h, k) = hkf
yx
(a+
ξ
2
, b + ξ
1
) for some 0 < ξ
1
< k and 0 < ξ
2
< h.
Let R(s) = f(a + h, s) −f(a, s).
181
78 Mixed partials commute
Then HODQ(h, k) = R(b + k) −R(b).
By the mean value theorem for derivatives, there is an 0 < ξ
1
< k such that
R(b + k) −R(b) = kR

(b + ξ
1
).
So HODQ(h, k) = k(f
y
(a + h, b + ξ
1
) −f
y
(a, b)) .
By the mean value theorem again, we have
HODQ(h, k) = hkf
xy
(a + ξ
2
, b + ξ
1
) for some 0 < ξ
2
< h.
So we have
lim
h,k→0
f(a + h, b + k) −f(a + h, b) −f(a, b + k) + f(a, b)
hk
= lim
h,k→0
HODQ(h, k)
hk
= lim
h,k→0
f
yx
(a +
1
, b +
2
)
But since 0 <
1
< h and 0 <
2
< k , then as h, k → 0, a +
1
→ a and
b +
2
→b. By the continuity of f
yx
, we have that the limit equals f
yx
(a, b).
Apply the same reasoning to conclude that lim
h,k→0
f(a + h, b + k) −f(a + h, b) −f(a, b + k) + f(a, b)
hk
=
f
xy
(a, b)
lim
h,k→0
f(a + h, b + k) −f(a + h, b) −f(a, b + k) + f(a, b)
hk
= lim
h,k→0
HODQ(h, k)
hk
= lim
h,k→0
f
xy
(a + ξ
2
, b + ξ
1
)
But since 0 < ξ
1
< k and 0 < ξ
2
< h , then as h, k → 0, a + ξ
2
→ a and
b + ξ
1
→b. By the continuity of f
xy
, we have that the limit equals f
xy
(a, b).
So we can conclude that f
xy
(a, b) = f
yx
(a, b), because they are the common
value of the limit lim
h,k→0
f(a + h, b + k) −f(a + h, b) −f(a, b + k) + f(a, b)
hk
.
We close with a cautionary example. This result is not always true if the second
partial derivatives are not continuous. Remember that we deﬁne
g
x
(a, b) = lim
h→0
g(a + h, b) −g(a, b)
h
, and similarly g
y
(a, b) = lim
k→0
g(a, b + k) −g(a, b)
k
Question 2 Deﬁne f(x, y) =
_
_
_
xy
x
2
−y
2
x
2
+ y
2
if (x, y) ,= (0, 0)
0 if (x,y)=(0,0)
Solution
Hint: Question 3 Solution
Hint:
fy(x, 0) = lim
k→0
f(x, k) −f(x, 0)
k
= lim
k→0
xk
x
2
−k
2
x
2
+k
2
−0
k
= lim
k→0
x
x
2
−k
2
x
2
+ k
2
= x
182
78 Mixed partials commute
fy(x, 0) =x
Solution
Hint:
fx(0, y) = lim
h→0
f(h, y) −f(0, y)
h
= lim
h→0
hy
h
2
−y
2
h
2
+y
2
−0
h
= lim
h→0
y
h
2
−y
2
h
2
+ y
2
= −y
fx(0, y) =−y
Hint:
fxy(0, 0) = lim
h→0
fy(h, 0) −fy(0, 0)
h
= lim
h→0
h −0
h
= 1
Hint:
fyx(0, 0) = lim
k→0
fx(0, k) −fx(0, 0)
k
= lim
h→0
−k −0
k
= −1
fxy(0, 0) =1
Solution fyx(0, 0) =−1
183
79 Second derivative
The second derivative records how small changes aﬀect the derivative.
The derivative allowed us to ﬁnd the best linear approximation to a function at a
point. But how do these linear approximations change as we move from point to
nearby point? That is exactly what the second derivative is aiming for.
184
80 Intuitively
The second derivative is a bilinear form.
From our perspective, the second derivative of a function f : R
n
→ R at a point
will be a bilinear form on R
n
. Let us take some time to understand, intuitively,
why that should be the case.
Let f : R
2
→R be deﬁned by f(x, y) = x
2
y.
D(f)
¸
¸
(x,y)
is the linear map given by the matrix
_
2xy x
2
¸
. That is to say,
D(f)
¸
¸
(x,y)
(
_
∆x
∆y
_
) = 2xy∆x + x
2
∆y ≈ f(x + ∆x, y + ∆y) −f(x, y).
The second derivative should now tell you how much the derivative changes
from point to point. If we increment (x, y) by a little bit to (x + ∆x, y) then we
should expect the derivative to increase by about
_
∆x

∂x
(2xy) ∆x

∂x
(x
2
)
_
=
_
2y∆x 2x∆x
¸
. Similarly, when we increase y by ∆y, the derivative should change
_
∆y

∂y
(2xy) ∆y

∂y
(x
2
)
_
=
_
2x∆y 0∆y
¸
.
By linearity, if we change from (x, y) to (x+∆x, y+∆y), we expect the derivative
to change by
_
2y∆x + 2x∆y 2x∆x + 0
¸
=
_
∆x ∆y
¸
_
2y 2x
2x 0
_
This gives a matrix which is the approximate change in the derivative. You can
then apply this to another vector if you so wish.
Summing it up, if you wanted to see approximately how much the derivative
changes from p = (x, y) to (x+∆x
2
, y+∆y
2
) = p+

h
2
(

h
2
=
_
∆x
2
∆y
2
_
) when both are
evaluated in the same direction

h
1
=
_
∆x
1
∆y
1
_
, you would perform the computation:
Df
p+

h2
(

h
1
) −Df
p
(

h
1
) ≈
_
∆x
2
∆y
2
¸
_
2x 2x
2x 0
_ _
∆x
1
∆y
1
_
This is exactly using the matrix
_
2x 2x
2x 0
_
as a bilinear form applied to the two
vectors

h
1
=
_
∆x
1
∆y
1
_
and

h
2
=
_
∆x
2
∆y
2
_
.
With all of this as motivation, we make the following wishy washy ”deﬁnition”
Deﬁnition 1 The second derivative of a function f : R
n
→R at a point p ∈ R
n
is a bilinear form D
2
f
¸
¸
p
: R
n
R
n
→ R enjoying the following approximation
property:
Df
¸
¸
p+

h1
(

h
2
) ≈ Df
¸
¸
p
(

h
2
) + D
2
f
¸
¸
p
(

h
1
,

h
2
)
We will make the sense in which this approximation holds precise in another
section, but for now this is good enough.
185
80 Intuitively
Question 2 If f : R
2
→R is a function, and Df
¸
¸
(1,2)
=
_
−1 1
¸
and Hf
¸
¸
(1,2)
=
_
0 4
4 1
_
. Df
¸
¸
(1.2,1.1)
(
_
−0.2
0.3
_
).
Solution
Hint: By the fundamental approximating property, we have Df
¸
¸
(1,2)+

0.2
0.1

(−0.2
0.3) ≈ Df
¸
¸
¸
(1,2)
_
−0.2
0.3
_
+ D
2
f
¸
¸
(1,2)
__
0.2
0.1
_
,
_
−0.2
0.3
__
Hint: Thus Df
¸
¸
(1,2)+

0.2
0.1

(−0.2
0.3) ≈
_
−1 1
_
_
−0.2
0.3
_
+
_
0.2
0.1
_ _
0 4
4 1
_ _
−0.2
0.3
_
Hint:
Df
¸
¸
(1.2,1.1)
(
_
−0.2
0.3
_
) ≈
_
−1 1
_
_
−0.2
0.3
_
+
_
0.2
0.1
_ _
0 4
4 1
_ _
−0.2
0.3
_
= −1(−0.2) + 1(0.3) + +(0.2)(0)(−0.2) + (0.2)(4)(0.3) + (0.1)(4)(−0.2) + (0.1)(1)(0.3)
= 0.2 + 0.3 + 0 + 0.12 −0.08 + 0.03
= 0.57
Df
¸
¸
(1.2,1.1)
(
_
−0.2
0.3
_
) ≈ 0.57
Note that the computation really splits into a ﬁrst order change (Df[
p
(

h)) and
a second order change (D
2
f(

h
1
,

h
2
)). In this case the ﬁrst order change was 0.5,
and the second order change was 0.07. This should be a better approximation to
the real value than if we had used the ﬁrst derivative alone.
Question 3 Let f : R
2
→ R be a function with Df[
p
=
_
3 4
¸
and Hf[
p
=
_
1 3
3 −2
_
, approximate Df
p+

0.01
0.02

.
Solution
Hint: By the fundamental approximation property, Df
p+

0.01
0.02

(

h) ≈ Dfp(v)+
_
0.01 0.02
_
Hf[pv.
So Df
p+

0.01
0.02

≈ Dfp +
_
0.01 0.02
_
Hf[p as linear maps from R
2
→R
Hint:
_
0.01 0.02
_
Hf[p =
_
0.01 0.02
_
_
1 3
3 −2
_
=
_
0.01(1) + 0.02(3) 0.01(3) + 0.02(−2)
_
=
_
0.07 −0.01
_
186
80 Intuitively
Hint: So Df
p+

0.01
0.02

_
3 4
_
+
_
0.07 −0.01
_
=
_
3.07 3.99
_
Following the development at the beginning of this activity, we can anticipate
how to compute the second derivative as a bilinear form:
Warning 4 This is an intuitive development, not a rigorous proof
Let f : R
n
→R.
Since
Df
¸
¸
p
=
_
f
x1
(p) f
x2
(p) ... f
xn
(p)
¸
It is reasonable to think that
Df
¸
¸
p+

h1
≈ Df
¸
¸
p
+
_
Df
x1
(p)(

h
1
) Df
x2
(p)(

h
1
) ... Df
xn
(p)(

h
1
)
_
but
Df
xi
(

h
1
) =
_
f
x1x1
(p) f
x2x1
(p) ... f
xnx1
(p)
¸

h
1
We can rewrite this as

h
1

_
¸
¸
¸
_
f
x1x1
(p)
f
x2x1
(p)
.
.
.
f
xnx1
(p)
_
¸
¸
¸
_
, so we obtain the rather pleasing formula
Df
¸
¸
p+

h1
≈ Df
¸
¸
p
+

h
1

_
¸
¸
¸
_
f
x1x1
(p) f
x1x2
(p) ... f
x1xn
(p)
f
x2x1
(p) f
x2x2
(p) ... f
x2xn
(p)
.
.
.
f
xnx1
(p) f
xnx2
(p) ... f
xnxn
(p)
_
¸
¸
¸
_
So
Df
¸
¸
p+

h1
(

h
2
) ≈ Df
¸
¸
p
(

h
2
) +

h
1

_
¸
¸
¸
_
f
x1x1
(p) f
x1x2
(p) ... f
x1xn
(p)
f
x2x1
(p) f
x2x2
(p) ... f
x2xn
(p)
.
.
.
f
xnx1
(p) f
xnx2
(p) ... f
xnxn
(p)
_
¸
¸
¸
_

h
2
So it looks like we have:
Theorem 5 If f : R
n
→R, the matrix of the bilinear form D
2
f
¸
¸
p
: R
n
R
n
→R
is
_
¸
¸
¸
_
f
x1x1
(p) f
x1x2
(p) ... f
x1xn
(p)
f
x2x1
(p) f
x2x2
(p) ... f
x2xn
(p)
.
.
.
f
xnx1
(p) f
xnx2
(p) ... f
xnxn
(p)
_
¸
¸
¸
_
This matrix is also called the Hessian matrix of f.
We could also express this in the following convenient notation:
D
2
f
¸
¸
p
=
n

i,j=1

2
f
∂x
i
∂x
j
dx
i
⊗dx
j
187
80 Intuitively
By the equality of mixed partial derivatives, this bilinear form is actually sym-
metric! So all of the theory we developed about self adjoint linear operators and
symmetric bilinear forms can (and will) be brought to bear on the study of the
second derivative.
Question 6 Let f : R
2
→R be deﬁned by f(x, y) =
x
y
.
Solution
Hint: Question 7 Solution
Hint:
fxx =

∂x

∂x
x
y
=

∂x
1
y
= 0
fxx =0
Solution
Hint:
fxy =

∂x

∂y
x
y
=

∂x
−x
y
2
=
−1
y
2
fxy =−1/y
2
Solution
Hint:
fyx = fxy by equality of mixed partials
=
−1
y
2
fyx =−1/y
2
Solution
Hint:
fyy =

∂y

∂y
x
y
=

∂y
−x
y
2
=
2x
y
3
fyy =2x/y
3
188
80 Intuitively
Hint: ¹ =
_
_
_
0
−1
y
2
−1
y
2
2x
y
3
_
¸
_
What is the Hessian matrix of f at the point (x, y)?
Question 8 Let f : R
3
→R be deﬁned by f(x, y, z) = xy + yz.
Solution
Hint: The only second partials which are not zero are fxy = fyx and fyz = fzy
Hint: fxy = fyx = 1
and
fyz = fzy = 1
Hint: ¹ =
_
_
0 1 0
1 0 1
0 1 0
_
_
What is the Hessian matrix of f at the point (x, y, z)?
Notice that the second derivative of this function is the same at every point because
f was a quadratic function. Any other polynomial of degree 2 in n variables would
also have a constant second derivative. For example f(x, y, z, t) = 1 + 2x + 3z +
4z
2
+ zx + xt + t
2
+ yx would also have constant second derivative.
189
81 Rigorously
The second derivative allows approximations to the derivative.
Deﬁnition 1 Let f : R
n
→ R be a diﬀerentiable function, and p ∈ R
n
. We say
that f is twice diﬀerentiable at p if there is a bilinear form B : R
n
R
n
→R with
Df(p +

h
1
)(

h
2
) = Df(p)(

h
2
) + B(

h
1
,

h
2
) + Error(

h
1
,

h
2
)
With
lim

h1,

h2→0
¸
¸
¸Error(

h
1
,

h
2
)
¸
¸
¸
[

h
1
[[

h
2
[
= 0
In this case we call B the second derivative of f at p and write B = D
2
f(p).
Theorem 2 Let f : R
n
→ R be a function which is twice diﬀerentiable every-
where. Then the second derivative of f at p has the matrix
1(p) =
_
¸
¸
¸
_
f
x1x1
(p) f
x1x2
(p) ... f
x1xn
(p)
f
x2x1
(p) f
x2x2
(p) ... f
x2xn
(p)
.
.
.
f
xnx1
(p) f
xnx2
(p) ... f
xnxn
(p)
_
¸
¸
¸
_
Prove this theorem!
Hint: Apply the deﬁnition to B(hei, kej)
We want to show that D
2
f(p)(e
i
, e
j
) = f
xi,xj
(p).
By deﬁnition, we have that
lim
h,k→0
¸
¸
Df(p + he
i
)(k e
j
) −Df(p)(k e
j
) −D
2
f(he
i
, ke
j
)
¸
¸
[he
i
[[ke
j
[
= 0
So by the linearity of the derivative, and the bilinearity of the second derivative,
lim
h,k→0
¸
¸
kDf(p + he
i
)( e
j
) −kDf(p)( e
j
) −hkD
2
f(e
i
, e
j
)
¸
¸
[hk[
= 0.
So we have
lim
h,k→0
¸
¸
¸
¸
Df(p + he
i
)( e
j
) −Df(p)( e
j
)
h
−D
2
f(e
i
, e
j
)
¸
¸
¸
¸
= 0
Which implies
D
2
f(p)( e
1
, e
2
) = lim
h→0
Df(p + he
i
)( e
j
) −Df(p)( e
j
)
h
But Df(x)( e
j
) = f
xj
(x) for any x ∈ R
n
, so this is
D
2
f(p)( e
1
, e
2
) = lim
h→0
f
xj
(p + h e
i
) −f
xj
(p)
h
190
81 Rigorously
But by deﬁnition of the directional derivative, this implies that
D
2
f(p)( e
1
, e
2
) = f
xi,xj
(p)
191
82 Taylor series
The second derivative allows us to approximate functions better than just the ﬁrst
derivative
As it stands, the second derivative lets us get approximations of the ﬁrst derivative.
The ﬁrst derivative allows us to get approximations of the original function. In the
following extended question, we will see how we can use the second derivative to get
about the original function. This will lead to approximations with second order
accuracy, rather than just ﬁrst order accuracy. This is the essence of the second
order Taylor’s theorem.
192
83 An example
A speciﬁc example sheds light on Taylor series.
Let’s work through an example.
Question 1 Let f : R
2
→ R be a function. All we know about f at the point
(1, 2) is the following:
• f(1, 2) = 6
• Df(1, 2) =
_
4 5
¸
• D
2
f(1, 2) =
_
1 −2
−2 3
_
Suppose that we want to approximate f(1.1, 1.9) as accurately as we can given
this information. We can simply use the linear approximation to f at (1, 2):
Solution
Hint:
f(1.1, 1.9 ≈ 6 +
_
4 5
_
_
0.1
−0.1
_
= 6 + 0.4 −0.5
= 5.9
Using the linear approximation to f at (1, 2), we ﬁnd f(1.1, 1.9) ≈ 5.9.
This approximation ignores the second order data provided by the second deriva-
tive: we have essentially assumed that the ﬁrst derivative is constant along the line
from (1, 2) to (1.1, 2.2). Since we know the second derivative at the point (1, 2)
we can estimate how the derivative is changing along this line assuming the second
derivative was constant, and get a better approximation.
For example, we could use the following three step process:
• Use the linear approximation to f at (1, 2) to approximate f(1.05, 1.95)
• Use the second derivative to approximate Df(1.05, 1.95)
• Use the linear approximation to f at (1.05, 1.95) to approximate f(1.1, 1.9)
Solution
Hint:
f(1.05, 1.95) ≈ 6 +
_
4 5
_
_
0.05
−0.05
_
= 6 + 0.2 −0.25
= 5.95
Let’s try that here: f(1.05, 1.95) ≈ 5.95
193
83 An example
Solution
Hint:
Df(1.05, 1.95) ≈ Df(1, 2) +
_
0.05 −0.05
_
_
1 −2
−2 3
_
=
_
4 5
_
+
_
0.05(1) +−0.05(−2) 0.05(−2) +−0.05(3)
_
=
_
4 5
_
+
_
0.15 −0.25
_
=
_
4.15 4.75
_
Using the second derivative, Df(1.05, 1.95) is approximately:
Solution
Hint:
f(1.1, 1.9) ≈ 5.95 +
_
4.15 4.75
_
_
0.05
−0.05
_
= 5.95 + 4.15(0.05) + 4.75(−0.05)
= 5.92
Using the linear approximation to f at (1.05, 1.95), f(1.1, 1.9) ≈ 5.92
So this method allowed us to get a slightly better approximation of f(1.1, 2.2)
using the fact that the Df
p
(
_
1
−1
_
) is increasing as p moves from (1, 2) in the
direction
_
1
−1
_
. We really got a slightly higher estimate from f(1.9, 2.1) using
this two step approximation compared to using the linear approximation because
D
2
f(1, 2)
__
1
−1
_
,
_
1
−1
__
= 8 is positive.
We do not have to limit ourselves to only using a two step approximation: we
could get better and better approximations of f(1.1, 1.9) by using more and more
partitions of the line segment from (1, 2) to (1.1, 1.9). For example, we could use
ten partitions:
• Use the linear approximation to f at (1, 2) to approximate f(1.01, 1.99)
• Use the second derivative to approximate Df(1.01, 1.99)
• Use the linear approximation to f at (1.01, 1.99) to approximate f(1.02, 1.98)
• Use the second derivative to approximate Df(1.02, 1.98)
• Use the linear approximation to f at (1.02, 1.98) to approximate f(1.03, 1.97)

.
.
.
• Use the linear approximation to f at (1.09, 1.91) to approximate f(1.1, 1.9)
This kind of process, where we are summing more and more of smaller and
smaller values to approximate something, furiously demands to be phrased as an
integral.
194
83 An example
Solution
Hint: Notice that
D
2
f
_
_
_
_
0.1
1
n
−0.1
1
n
_
_
,
_
_
0.1
1
n
−0.1
1
n
_
_
_
_
=
_
0.1
1
n
, −0.1
1
n
_
_
1 −2
−2 3
_
_
_
0.1
1
n
−0.1
1
n
_
_
= (0.1(0.1)(1) + 0.1(−0.1)(−2) + (0.1)(−0.1)(−2) + (−0.1)(−0.1)(3))
1
n
2
= 0.08
1
n
2
Hint: By partitioning [0, 1] into n little pieces of equal width, the contribution to the
sum
• over [0,
1
n
] is Df(1, 2)
_
0.1
1
n
−0.1
1
n
_
=
_
4 5
_
_
_
0.1
1
n
−0.1
1
n
_
_
= −0.1
1
n
• over [
1
n
,
2
n
] is
Df(1 + 0.1
1
n
, 2 + (−0.1)
1
n
)
_
_
_
_
0.1
1
n
(−0.1)
1
n
_
_
_
_
≈ Df(1, 2)
_
_
_
_
0.1
1
n
(−0.1)
1
n
_
_
+ D
2
f ()
_
_
0.1
1
n
(−0.1)
1
n
_
_
,
_
_
0.1
1
n
(−0.1)
1
n
_
_
_
_
= −0.1
1
n
+ 0.08
1
n
2
• over [
2
n
,
3
n
] is
Df(1 + 2(0.1
1
n
), 2 + 2((−0.1)
1
n
))
_
_
_
_
0.1
1
n
(−0.1)
1
n
_
_
_
_
≈ Df(1 + 0.1
1
n
, 2 + (−0.1)
1
n
)
_
_
_
_
0.1
1
n
(−0.1)
1
n
_
_
_
_
+ D
2
f
_
_
_
_
0.1
1
n
(−0.1)
1
n
_
_
,
_
_
0.1
1
n
(−0.1)
1
n
_
_
_
_
≈ −0.1
1
n
+ 0.08
1
n
2
+ 0.08
1
n
2
= 1.4
1
n
+ 0.08
2
n
1
n

.
.
.
• over [
k + 1
n
,
k + 2
n
] is
Df(1 + (k + 1)(0.1
1
n
), 2 + (k + 1)((−0.1)
1
n
))
_
_
_
_
0.1
1
n
(−0.1)
1
n
_
_
_
_
≈ Df(1 + (k)0.1
1
n
, 2 + (k)(−0.1)
1
n
)
_
_
_
_
0.1
1
n
(−0.1)
1
n
_
_
_
_
+ D
2
f
_
_
_
_
0.1
1
n
(−0.1)
1
n
_
_
,
_
_
0.1
1
n
(−0.1)
1
n
_
_
_
_
≈ −0.1
1
n
+ (k −1)0.08
1
n
2
+ 0.08
1
n
2
= 1.4
1
n
+ 0.08
k
n
1
n
Hint: So f(1.1, 2.2) ≈ 6 +
n

k=0
−0.1
1
n
+ 0.08
k
n
1
n
195
83 An example
Hint: By deﬁnition of the integral we have lim
n→∞
n

k=0
−0.1
1
n
+ 0.08
k
n
1
n
=
_
1
0
(−0.1 +
0.08t)dt
In this case, we get that f(1.1, 1.9) ≈ 6 +
_
1
0
f(t)dt, where f(t) =−0.1 + 0.08t
Solution
Hint:
f(1.1, 1.9) ≈ 6 +
_
1
0
(−0.1 + 0.08t)dt
= 6 +
_
−0.1t +
1
2
(0.08)t
2
_
¸
¸
1
0
= 6 −0.1 + 0.04
= 5.94
Evaluating this integral we have f(1.1, 1.9) ≈ 5.94. This is the best approximation we can
really expect to get given only this information about f at (1, 2).
Notice that this approximation of f is really just f(1, 2) +Df(1, 2)
__
0.1
−0.1
__
+
1
2
D
2
f(1, 2)
__
0.1
−0.1
_
,
_
0.1
−0.1
__
.
The ﬁrst two terms are just the regular linear approximation to f at (1, 2), but
the next term arose from integrating the function D
2
f(
_
0.1
−0.1
_
,
_
0.1
−0.1
_
)t from t = 0
to t = 1. This is exactly
1
2
D
2
f(
_
0.1
−0.1
_
,
_
0.1
−0.1
_
).
Generalizing, we might expect in general that
Theorem 2 f(p +

h) ≈ f(p) + Df(p)(

h) +
1
2
D
2
f(p)
_

h,

h
_
This is the second order taylor approximation of f at p. Notice how similar it
looks to the second order taylor approximation of a single variable function! If we
had not taken the time to develop an understanding of D
2
f(p) as bilinear map, it
would be quite messy to even state this theorem, and it would only get worse for
the higher order Taylor’s theorem we will be learning about next week.
Hopefully this (admittedly long) discussion has helped you to understand where
this approximation comes from! We will give a rigorous statement and proof of the
theorem in the next section.
196
84 Rigorously
The second derivative enables quadratic approximation
You should know the statement of the following theorem for this course:
Theorem 1 (Second order Taylor’s theorem) If f : R
n
→ R is a twice
diﬀerentiable function, p ∈ R
n
then we have
f(p +

h) = f(p) + Df(p)

h +
1
2
D
2
f(b + ξ
˜
h)(

h,

h)
for some ξ ∈ [0, 1].
It follows (after a lot of work!) that
f(p +

h) = f(p) + Df(p)

h +
1
2
D
2
f(p)(

h,

h) + Error(

h)
with lim

h→

0
[Error(

h)[
[

h[
2
= 0
This approximation is also sometimes phrased as f(x) ≈ f(p)+Df(p)(x−p)+
D
2
f(p)(x −p, x −p)
In the future, we will prove the above theorem. To prepare yourself, you should
at a minimum make sure you have already worked through the other two optional
sections on the formal deﬁnition of a limit
1
and also the proof of the chain rule
2
where operator norms are introduced.
Proof
1
http://ximera.osu.edu/course/kisonecat/m2o2c2/course/activity/week2/limits/
formal-limit/
2
http://ximera.osu.edu/course/kisonecat/m2o2c2/course/activity/week2/chain-rule/
proof/
197
85 Taylor’s theorem examples
Let’s see how Taylor’s theorem gives us better approximations
Warning 1 Just due to the sheer number of calculations, these questions are
quite long.
Question 2 Consider f : R
2
→R deﬁned by f(x, y) = xcos(y) + xy.
Solution
Hint: Question 3 Solution
Hint:

∂x
xcos(y) + xy = cos(y) + y
So fx(0, 0) = 1
fx(0, 0) =1
Solution
Hint:

∂y
xcos(y) + xy = −xsin(y) + x
So fy(0, 0) = 0
fy(0, 0) =0
Hint:
f(x, y) ≈ f(0, 0) + Df(0, 0)(
_
x
y
_
)
= 0 cos(0) + 0(0) +
_
fx(0, 0) fy(0, 0)
_
_
x
y
_
=
_
1 0
_
_
x
y
_
= x
The linear approximation to f at (0, 0) is f(x, y) ≈ x
Solution
Hint: Question 4 Solution
Hint: fxx = 0
fxx(0, 0) =0
Solution
Hint: fxy = −sin(y) + 1
fxy(0, 0) =1
198
85 Taylor’s theorem examples
Solution
Hint: fyx = −sin(y) + 1
fyx(0, 0) =1
Solution
Hint: fyy = 0
fyy(0, 0) =0
Hint: So ¹(0, 0) =
_
0 1
1 0
_
Hint: By Taylors theorem, f(x, y) ≈ f(0, 0) + Df(0, 0)(
_
x
y
_
) +
1
2
D
2
f(
_
x
y
_
,
_
x
y
_
)
Hint: So
f(x, y) ≈ 0 + x +
1
2
_
x y
_
_
0 1
1 0
_ _
x
y
_
= x +
1
2
(2xy)
= x + xy
The second order approximation to f at (0, 0) is f(x, y) ≈ x + xy
It is kind of cool that we could also read this oﬀ from the following magic:
xcos(y) + xy = x(1 −
y
2
2!
+
y
4
4!
−...) + xy
= x + xy −
xy
2
2!
+
xy
4
4!
−...
So it looks like the second order approximation is x + xy
Solution Using the ﬁrst order approximation f(0.1, 0.2) ≈ 0.1
Solution Using the second order approximation f(0.1, 0.2) ≈ 0.12
A calculator tells me f(0.1, 0.2) ≈ 0.11800665778. So clearly, the second order
approximation is better. Notice that the second order approximation is slightly
high, and this is apparent from our magical calculation, since the next term should
be −
0.1(0.2)
2
2
= −0.002, which gets us even closer to the exact answer. We will
make the magic more precise when we deal with the full multivariable taylors the-
orem later.
Question 5 Consider f : R
3
→R deﬁned by f(x, y, z) = xe
z+y
+ z
2
199
85 Taylor’s theorem examples
Solution
Hint: f(x, y, z) ≈ f(0, 0, 1) + Df(0, 0, 1)(
_
_
x
y
z −1
_
_
) +
1
2
D
2
f
_
_
_
_
x
y
z −1
_
_
,
_
_
x
y
z −1
_
_
_
_
Hint: Question 6 Solution
Hint:
Df(0, 0, 1) =
_
∂f
∂x
∂f
∂y
∂f
∂z
_
¸
¸
(0,0,1)
=
_
e
z+y
xe
z+y
xe
z+y
+ 2z
_ ¸
¸
(0,0,1)
=
_
e 0 2
_
The matrix of Df(0, 0, 1) is
Solution
Hint:
¹(0, 0, 1) =
_
_
_
_
_
_
_

2
f
∂x∂x

2
f
∂x∂y

2
f
∂x∂z

2
f
∂y∂x

2
f
∂y∂y

2
f
∂y∂z

2
f
∂z∂x

2
f
∂z∂y

2
f
∂z∂z
_
¸
¸
¸
¸
¸
_
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
¸
(0,0,1)
=
_
_
0 e
z+y
e
z+y
e
z+y
xe
z+y
xe
z+y
e
z+y
xe
z+y
xe
z+y
+ 2
_
_
¸
¸
¸
¸
¸
¸
(0,0,1)
=
_
_
0 e e
e 0 0
e 0 2
_
_
The hessian matrix of f at (0, 0, 1) is
Hint:
f(x, y, z) ≈ 1 +
_
e 0 2
_
_
_
x
y
z −1
_
_
+
1
2
_
x y z −1
_
_
_
0 e e
e 0 0
e 0 2
_
_
_
_
x
y
z −1
_
_
= 1 + ex + 2(z −1) + exy + ex(z −1) + (z −1)
2
The second order taylor expansion of f about the point (0, 0, 1) is f(x, y, z) ≈1 + ex +
2(z −1) + exy + ex(z −1) + (z −1)
2
Question 7 Let f : R
4
→R be a function with
• f(0, 0, 0, 0) = 2
200
85 Taylor’s theorem examples
• Df(0, 0, 0, 0) =
_
1 −1 0 0
¸
• D
2
f(0, 0, 0, 0) =
_
¸
¸
_
0 0 0 0
0 2 0 3
0 0 0 0
0 3 0 0
_
¸
¸
_
Solution
Hint: f(x, y, z, t) ≈ f(0, 0, 0, 0) + Df(0, 0, 0, 0)
_
_
_
_
_
_
_
_
x
y
z
t
_
¸
¸
_
_
_
_
_
+
1
2
D
2
f
_
_
_
_
_
_
_
_
x
y
z
t
_
¸
¸
_
,
_
_
_
_
x
y
z
t
_
¸
¸
_
_
_
_
_
Hint:
f(x, y, z, t) ≈ 2 +
_
1 −1 0 0
_
_
_
_
_
x
y
z
t
_
¸
¸
_
+
1
2
_
x y z t
_
_
_
_
_
0 0 0 0
0 2 0 3
0 0 0 0
0 3 0 0
_
¸
¸
_
_
_
_
_
x
y
z
t
_
¸
¸
_
= 2 + x −y + y
2
+ 3yt
The second order approximation to f at (0, 0, 0, 0) is f(x, y, z, t) ≈ 2 + x −y + y
2
+ 3yt
201
86 Optimization
Optimization means ﬁnding a biggest (or smallest) value.
Suppose A is a subset of R
n
, meaning that each element of A is a vector in R
n
.
Maybe A contains all vectors in R
n
, maybe not. Further suppose f : A → R is a
function.
Question 1 Is there a vector v ∈ A so that f(v) is at least as large as any other
output of f?
Solution
(a) This is always the case.
(b) This is not necessarily the case.
It really does depend on A and on f.
For example, suppose f(v) = ¸v, v¸, meaning f sends v to the square of the
length of v. Further suppose that A = ¦v ∈ R
n
: [v[ ≤ 1¦.
Then for all v ∈ A, it is the case that f(v) ≤ 1. And yet, there is not a single
vector v ∈ A so that f(v) is at least as large as all outputs of f.
If you claim that you have found a vector v so that f(v) is as large as any output
of f, then you should consider the input
w =
1 +[v[
2
v,
and note that f( w) > f(v).
Question 2 Let’s consider an example. Let g : R
2
→R be the function given by
g
__
x
y
__
= 10 −(x + 1)
2
−(y −2)
2
.
Solution
Hint: No matter what x is, (x + 1)
2
≥ 0.
Hint: No matter what y is, (y −2)
2
≥ 0.
Hint: No matter what x and y are, (x + 1)
2
+ (y −2)
2
≥ 0.
Hint: No matter what x and y are, −(x + 1)
2
−(y −2)
2
≤ 0.
Hint: No matter what x and y are, 10 −(x + 1)
2
−(y −2)
2
≤ 10.
Hint: If (x, y) = (−1, 2), then 10 −(x + 1)
2
−(y −2)
2
= 10.
202
86 Optimization
Hint: Consequently, the largest possible output of g is 10.
The largest possible output of g is 10.
Solution This largest possible output occurs when x is −1.
Solution This largest possible output occurs when y is 2.
In this case, we were able to think through the situation by considering some
algebra—namely the fact that when we square a real number, the result is nonneg-
ative.
Here is the key idea that motivates everything we are about to do: using
the second derivative, we can approximate complicated functions by
we analyzed this example.
203
87 Deﬁnitions
“Local” means “after restricting to a small neighborhood.”
Deﬁnition 1 Let X ⊂ R
n
and f : X → R. To say that the maximum value
of f occurs at the point p ∈ X is to say that, for all q ∈ X, we have f(p) ≥ f(q).
Conversely, to say that the minimum value of f occurs at the point p ∈ X is
to say that, for all q ∈ X, we have f(p) ≤ f(q).
Sometimes people use the term “extremum value” to speak of both maximum
values and minimum values. Sometimes people say “maxima” instead of maximums
A function need not achieve a maximum or a minimum value.
Our goal will be to use calculus to search for maximums and minimums, but that
raises a problem. The derivative at a point is only describing what is happening
around that point, so if we use calculus to search for extreme values, then we will
only see “local” extremes.
Deﬁnition 2 Let X ⊂ R
n
and f : X → R. To say that a local maximum of
f occurs at the point p ∈ X is to say that there is an > 0 so that for all q ∈ X
within of p, we have f(p) ≥ f(q).
Conversely, to say that a local minimum of f occurs at the point p ∈ X is to
say that there is an > 0 so that for all q ∈ X within of p, we have f(p) ≤ f(q).
Here’s an example of how this works out in practice. Let g : R
2
→ R be the
function given by
g(x, y) = x
2
+ y
2
+ y
3
Question 3 Does this function g achieve a minimum value?
Solution
(a) No.
(b) Yes.
That’s correct: there is no “global” minimum. No matter how negative you
want the output to g to be, you can achieve it by looking at g(0, y) where y is a
very negative number.
On the other hand, if we restrict our attention near the point (0, 0), then g is
nonnegative there.
Solution Whenever (x, y) is within of (0, 0), then g(x, y) ≥ g(0, 0) = 0.
As a result, g achieves a local minimum at (0, 0), in spite of the fact that there
is no global maximum.
204
88 Critical points and extrema
Extremes happen where the derivative vanishes
Deﬁnition 1 Let f : R
n
→ R be a function. A point p ∈ U is called a critical
point of f if f is not diﬀerentiable at p, or if Df(p) is the zero map.
Question 2 Consider f : R
2
→ R deﬁned by f(x, y) = e
x
2
+y
2
. The function f
has only one critical point
Solution
Hint: Question 3 Solution
Hint:
Df(x, y) =
_
∂f
∂x
∂f
∂y
_
=
_
2xe
x
2
+y
2
2ye
x
2
+y
2
_
What is Df(x, y)?
Hint: So we need
_
2xe
x
2
+y
2
2ye
x
2
+y
2
_
=
_
0 0
_
Hint: This only occurs when x = 0 and y = 0
Hint: Enter this as
_
0
0
_
What is this critical point? Give you answer as a vertical vector.
Question 4 Consider f : R
2
→R deﬁned by f(x, y) = x
3
+ y
3
−3xy.
Solution
Hint: Question 5 Solution
Hint:
Df(x, y) =
_
∂f
∂x
∂f
∂y
_
=
_
3x
2
−3y 3y
2
−3x
_
What is Df(x, y)?
Hint: So we need
_
3x
2
−3y 3y
2
−3x
_
=
_
0 0
_
205
88 Critical points and extrema
Hint:
_
3x
2
−3y = 0
3y
2
−3x = 0
_
y = x
2
x = y
2
_
y = y
4
x = y
2
_
y(y −1)(y
2
+ y + 1) = 0
x = y
2
Hint: The only two points that work are (0, 0) and (1, 1)
f has two critical points. One of them is (0, 0). What is the other?
1
.
Deﬁnition 6 A function f : R
n
→R has a local maximum at the point p ∈ R
n
if there is an > 0 so that for all x ∈ R
n
with [x −p[ ≤ , f(x) ≤ f(p).
Warning 7 The fact that the inequalities in this deﬁnition are not strict means
that, for example, for the function f(x, y) = 1 every point is both a local maximum
and a local minimum.
Write a good deﬁnition for the local minimum of a function A function f : R
n

R has a local minimum at the point p ∈ R
n
if there is an > 0 so that for all
x ∈ R
n
with [x −p[ ≤ , f(x) ≥ f(p).
We call points which are either local maxima or local minima local extrema.
Theorem 8 If f : R
n
→ R is a diﬀerentiable function, and p a local extremum.
Then p is a critical point of f.
Prove this theorem Let p be a local maximum. We want to show that Df(p)(v) =
0 for all v ∈ R
n
. Recall that one formula for the derivative is
Df(p)(v) = lim
t→0
f(p + tv) −f(p)
t
Since f is diﬀerentiable, this limit must exist. As t →0
+
, we have
f(p + tv) −f(p)
t

0, since the numerator is less than or equal to zero by deﬁnition of a local maximum,
and the denominator is greater than 0. So the limit must be less than or equal to 0
On the other hand, as t → 0

, the numerator is still less than 0, but the
denominator is now negative, so the limit must be greater than or equal to 0.
Therefore
Df(p)(v) = lim
t→0
f(p + tv) −f(p)
t
= 0
1
http://ximera.osu.edu/course/kisonecat/m2o2c2/course/activity/week2/practice/
stationary-points/
206
88 Critical points and extrema
Since we did this with an arbitrary vector v ∈ R
n
, we see that Df(p) is the zero
map.
We leave the nearly identical case of a local minima to you.
This theorem tells us that if we want to identify local extrema, a good place
to start is by looking for all the critical points. It is worthwhile to note that just
because a point is a critical point does not mean it is a local extrema:
Example 9 Let f : R
2
→ R be deﬁned by f(x, y) = x
2
− y
2
. Then (0, 0) is a
critical point of f (check this!), but (0, 0) is not a local extremum. In fact we can
see that along the line y = 0, (0, 0) is a local maximum, while along the line x = 0
it is a local minimum. The graph of f looks like a saddle.
Deﬁnition 10 A critical point which is not a local extremum is a saddle point.
Warning 11 A saddle point does not need to be a local minimum in some
directions and a local maximum in others. For example, according to our deﬁnition
0 is a saddle point of f(x, y) = x
3
In the next section we will learn how to determine when a critical point is a
local maximum, minimum, or saddle by using the second derivative.
207
89 Second derivative test
Deﬁniteness of the second derivative determines extreme behavior at critical points.
In this section, we apply the second derivative to extreme value problems.
Theorem 1 (Second derivative test) Let f : R
n
→ R be a (
2
function. Let
p ∈ R
n
be a critical point of f. Then
• If D
2
f(p) is positive deﬁnite, p is a local minimum
• If D
2
f(p) is negative deﬁnite, p is a local maximum
• If D
2
f(p) is indeﬁnite, then p is a saddle point
• If D
2
f(p) is only positive semideﬁnite or negative semideﬁnite, we get no
information
Proof By the second order taylor’s theorem, we have
f(p +

h) = f(p) + Df(p)(

h) + D
2
f(p + ξ

h)(

h,

h) for some ξ ∈ [0, 1]
Since p is a critical point,
f(p +

h) = f(p) + D
2
f(p + ξ

h)(

h,

h) for some ξ ∈ [0, 1]
If D
2
f(p) is positive deﬁnite, then because f is (
2
, D
2
f(p+ξ

h) is also positive
semideﬁnite for small enough

h (this just uses continuity of the second derivative).
Thus D
2
f(p + ξ

h)(

h,

h) > 0 for all small enough values of

h, say [

h[ ≤ . But this
just says that f(p +

h) > f(p), so p is a local minimum.
If D
2
f(p) is negative deﬁnite, a completely analogous proof works.
If D
2
f(p) is indeﬁnite, then there are directions

h
1
where D
2
f(p)(

h
1
,

h
1
) > 0,
and

h
2
where D
2
f(p)(

h
2
,

h
2
) < 0. By continuity again, for [

h
i
[ < , we have
D
2
f(p + ξ

h
1
)(

h
1
,

h
1
) > 0 and D
2
f(p + ξ

h
2
)(

h
2
,

h
2
) < 0.
So we have f(p + ξ
1

h
1
) > f(p) and f(p + ξ
2

h
2
) < f(p). Thus p is neither a
local maximum nor a minimum, and so is a saddle point.
The method of proof show why the semideﬁnite cases might break down. With-
out strict positivity or negativity, continuity does not guarantee that D
2
f is still
positive deﬁnite or negative deﬁnite for nearby points: you might slip into an in-
deﬁnite case, for instance.
For a concrete counterexample in the semideﬁnite case, consider f(x, y) = x
2
+
y
3
. (0, 0) is a critical point, but and the Hessian
_
2 0
0 0
_
is positive semideﬁnite,
but (0, 0) is not a local minimum, since f(0, −) = −
3
is always less than f(0, 0).
The trouble here is really that the Hessian at nearby points (x, y) is
_
2 0
0 6y
_
, which
is indeﬁnite for y < 0.
We have already proven in the section on bilinear forms that the deﬁniteness
of a bilinear form is completely determined by the sign of the eigenvalues of the
associated linear operator.
208
89 Second derivative test
For the following exercises, use whatever means necessary to compute the eigen-
values of the Hessian, use that information to determine the deﬁniteness of the sec-
ond derivative, and use this to draw extremum information about f. I recommend
using a computer algebra system like Sage
1
to compute the eigenvalues, but you
can also use a free online app like this one
2
.
Question 2 Let f(x, y) = x
3
+ e
3y
−3xe
y
Solution
Hint:
Df(x, y) =
_
0 0
_
_
3x
2
−3e
y
3e
3y
−3xe
y
_
=
_
0 0
_
_
3x
2
−3e
y
= 0
3e
3y
−3xe
y
= 0
_
x
2
= e
y
e
3y
= xe
y
_
x
2
= e
y
x = e
2y
_
x
4
= x
x = e
2y
To satisfy the ﬁrst solution, either x = 0 or x = 1. If x = 0, then x = e
2y
has no
solutions, so we must have x = 1. Thus (1, 0) is the only critical point.
What is the critical point of f? Give your answer as a vertical vector.
Solution
Hint: ¹(x, y) =
_
6x −3e
y
−3e
y
9e
y
_
Hint: ¹(0, 0) =
_
6 −3
−3 6
_
What is the Hessian matrix of f at (1, 0)?
Solution
Hint: By using a computer algebra system, we see that the eigenvalues of the Hessian
are 3 and 9.
Hint: So D
2
f(1, 0) is positive deﬁnite.
Hint: Thus (1, 0) is a local minimum.
1
http://www.sagemath.org/
2
http://www.bluebit.gr/matrix-calculator/
209
89 Second derivative test
(a) (1, 0) is a local maximum
(b) (1, 0) is a local minimum
(c) (1, 0) is a saddle point
(d) The second derivative gives no information in this case
Observe that even though (1, 0) is the only local extremum, and this is a local
minimum, (1, 0) is not a global minimum because f(1, 0) = −1 but f(−10, 0) =
−1000 + 1 −3(−10) = −969. Contemplate this carefully.
Question 3 Let f : R
3
→R be deﬁned by f(x, y, z) = e
x+y+z
−x−y−z+4z
2
+xy.
f has a critical point at (0, 0, 0).
Solution
Hint: Question 4 Solution
Hint:
_
_
1 2 1
2 1 1
1 1 9
_
_
The Hessian matrix of f at (0, 0, 0) is
Hint: According to computer algebra software, the eigenvalues of this matrix are
approximately 9.31662, −1 and 2.68338, so the second derivative is indeﬁnite
Hint: Thus f has a saddle point at (0, 0, 0)
(a) (0, 0, 0) is a local maximum
(b) (0, 0, 0) is a local minimum
(c) (0, 0, 0) is a saddle point
(d) The second derivative gives no information in this case
Question 5 Let f(x, y) = cos(x + y
2
) −x
2
. (0, 0) is a critical point of f.
Solution
Hint: The hessian matrix of f at (0, 0) is
_
−1 0
0 −2
_
which plainly has eigenvalues
−1 and −2.
Hint: Thus D
2
f(0, 0) is negative deﬁnite
Hint: Thus f has a local maximum at (0, 0)
210
89 Second derivative test
Hint: You can actually see that this is a global maximum, since the largest cos could
ever be is 1, and the largest −x
2
could ever be is 0, so 1 is the largest value that f ever
attains, and this is attained at (0, 0). In fact this value is attained at inﬁnitely many
points on the line x = 0.
(a) (0, 0) is a local maximum
(b) (0, 0) is a local minimum
(c) (0, 0) is a saddle point
(d) The second derivative gives no information in this case
211
90 Lagrange multipliers
Lagrange multipliers enable constrained optimization
In the previous section we considered unconstrained optimization problems. Some-
times we want to ﬁnd extreme values of a function f : R
n
→ R subject to some
constraints.
Example 1 Say we want to maximize f(x, y) = x
2
y subject to the contraint that
x
2
+y
2
= 1. In words, we want to know among all points on the unit circle, which of
them has the greatest product of the square of the ﬁrst coordinate with the second
coordinate. One way we can do this is to reduce it to a single variable calculus
problem: we can parameterize the unit circle by γ(t) = (cos(t), sin(t)), and try to
maximize f(γ(t)) : [0, 2π] → R. In other words, we are maximizing cos
2
(t) sin(t)
on the interval [0, 2π].
Deﬁnition 2 Let f : R
n
→ R be a function, and g : R
n
→ R
m
be another
function. A point p is called a local maximum of f with constraint g(x) =

0
if there is an > 0 such that if [x −p[ < and g(x) = 0, then f(x) < f(p)
Give a deﬁnition of a constrained local minimum:
Let f : R
n
→ R be a function, and g : R
n
→ R
m
be another function. A point
p is called a local minimum of f with constraint g(x) =

0 if there is an > 0
such that if [x −p[ < and g(x) = 0, then f(x) > f(p)
The method we outlined above is a great way to ﬁnd constrained local extrema:
If you can parameterize g
−1
(¦0¦), by some function M : U ⊂ R
k
→ R
n
, then you
can just try to ﬁnd the unconstrained extrema of f ◦ M. The problem is that,
although ﬁnding a parameterization of the circle was easy, more general sets might
be harder to parameterize. Some of them might not be parameterizable by just one
open set: you might have to use several patches. We will not pursue this method
further. Instead we will develop an alternative method, the method of lagrange
multipliers.
Theorem 3 Let f : R
n
→ R and g : R
n
→ R
m
. Assume g has the m com-
ponent functions g
1
, g
2
, ..., g
n
: R
n
→ R. If p is a constrained extrema of f with
the constraint g(x) = 0, and Rank(Dg(p)) = m, then there exist λ
1
, λ
2
, ...λ
m
with Df(p) = λ
1
Dg
1
(p) + λ
2
Dg
2
(p) + ... + λ
m
Dg
m
(p). The scalars λ
i
are called
Lagrange Multipliers.
Proof The full proof of this theorem would require the implicit function theo-
rem
1
. We will make one intuitive assumption to get around this.
A vector v should be tangent to the set g
−1

0¦) at the point p if and only if
Dg(p)(v) = 0. This is just because moving “inﬁnitesmally” in the direction of a
tangent vector to g
−1

0¦) should not change the value of g to ﬁrst order.
Since p is a constrained maximum, moving in one of these tangent directions
should not eﬀect the value of f to ﬁrst order either.
We can summarize these intuitive statements as Null(Dg(p)) ⊂ Null(Df(p)).
This is the assumption whose formal proof would require the implicit function
theorem.
1
http://en.wikipedia.org/wiki/Implicit_function_theorem
212
90 Lagrange multipliers
Given this assumption, the result follows essentially formally from our work
with linear algebra:
Null(Dg(p)) ⊂ Null(Df(p))
Null(Df(p))

⊂ Null(Dg(p))

Image(Df(p)

) ⊂ Image(Dg(p)

)
This last line is exactly what we are trying to prove!
Now let’s get some practice actually using this theorem as a practical tool.
First some questions with only one constraint equation: g : R
n
→R
Question 4 Solution
Hint: At a point constrained maximum point (x, y) we would need Df(x, y) = λDg(x, y)
for some λ ∈ R.
Hint: Question 5 Solution What is Df(x, y)?
Solution What is Dg(x, y)?
Hint: So we must have
_
2xy x
2
_
= λ
_
2x 2y
_
Hint:
_
2xy = 2λx
x
2
= 2λy
_
y = λ x ,= 0 for if it were then y = 0 by the second equation
x
2
= 2λy
_
y = λ
x
2
= 2λ
2
But x
2
+ y
2
= 1, so we have 3λ
2
= 1, or λ =
±1

3
Hint: This results in only 4 possible extrema at (
_
±
2
3
, ±
1

3
)
Hint: The values of f at these points are ±
_
2
3 ∗ sqrt(3)
. So the maximum value of
f on the unit circle is
_
2
3 ∗ sqrt(3)
and the minimum value is −
_
2
3 ∗ sqrt(3)
.
213
90 Lagrange multipliers
The maximum value of f(x, y) = x
2
y subject to the constraint x
2
+y
2
= 1 is 2/3 ∗ sqrt(3)
Question 6 Let f : R
3
→R be deﬁned by f(x, y, z) = x
2
+y
2
+z
2
subject to the
constraint that g(x, y, z) = xyz −1 = 0.
Solution
Hint: At a point constrained maximum point (x, y) we would need Df(x, y) = λDg(x, y)
for some λ ∈ R.
Hint: Question 7 Solution What is Df(x, y)?
Solution What is Dg(x, y)?
Hint: So we must have
_
2x 2y 2z
_
= λ
_
λyz λxz λxy
_
Hint:
_
¸
_
¸
_
2x = λyz
2y = λxz
2z = λxy
Hint: Multiplying all of these equations together, we have 8xyz = λ
3
(xyz)
2
. Since
xyz = 1, we have λ
3
= 8, so λ = 2
Hint:
_
¸
_
¸
_
x = yz
y = xz
z = xy
_
¸
_
¸
_
x = xz
2
y = yx
2
z = zy
2
Hint: So x = ±1, y = ±1, z = ±1. So the only possible location of constrained ex-
trema are (1, 1, 1), (1, −1, −1), (−1, 1, −1), (−1, −1, 1). At each of these points f(x, y, z) =
1. These are all local minima. f has no local or global maxima.
The minimum value of f subject to this constraint is 1
Here is a question with two constraint equations: g : R
n
→R
2
Question 8 Hint: Here g(x, y, z) =
_
x
2
+ y
2
+ z
2
−1
x + y + z −1
_
Hint: We need
_
1 −1 0
_
= λ1
_
2x 2y 2z
_
+ λ2
_
1 1 1
_
214
90 Lagrange multipliers
Hint:
_
¸
_
¸
_
2λ1x + λ2 = 1
2λ1y + λ2 = −1
2λ1z + λ2 = 0
Adding these all together and using that (x + y + z) = 1, we have 2λ1 + 3λ2 = 0.
So the last equation becomes −3λ2z + λ2 = 0, or λ2(1 − 3z) = 0. λ2 ,= 0, for other-
So We know that z =
1
3
Hint: There are only two points satisfying both constraints and with z =
1
3
Hint: Solving the system of 2 equations
_
_
_
x
2
+ y
2
+ (
1
3
)
2
= 1
x + y +
1
3
= 0
we obtain that the only two points which work are (
1 −

2
3
,
1 +

2
3
,
1
3
) and (
1 +

2
3
,
1 −

2
3
,
1
3
).
Hint: So the maximum value of x −y is
2

2
3
The maximum value of f(x, y, z) = x −y subject to the two constraints that x
2
+
y
2
+ z
2
= 1 and x + y + z = 1 is 2sqrt(2)/3
215
91 Hill climbing in Python
Hill climbing is a computational technique for ﬁnding a local maximum.
Let’s try hill climbing to—at least numerically—attempt to ﬁnd a local maximum.
The idea is the following: the gradient points up hill so if I want to ﬁnd a
local maximum, I should start somewhere and follow the gradient up. Hopefully
I’ll ﬁnd a point where the gradient vanishes (i.e., a critical point).
Question 1 Here’s the procedure that I’d like you to code in Python:
(b) Replace p with p plus a small multiple of ∇f(p).
(c) If ∇f(p) is very small, stop!
(d) Otherwise, repeat.
Solution
Hint: You might want to use some code to add and scale vectors, like
return [sum(v) for v in zip(v,w)]
def scale_vector(c,v):
return [c*x for x in v]
def vector_length(v):
return sum([x**2 for x in v])**0.5
Hint: You may also want some code to compute the gradient numerically.
epsilon = 0.01
n = len(p)
ei = lambda i: [0]*i + [epsilon] + [0]*(n-i-1)
return [ (f(add_vector(p, ei(i))) - f(p)) / epsilon for i in range(n) ]
Hint: To do the hill climbing, we can put together these pieces.
def climb_hill(f, starting_point):
p = starting_point
while vector_length(nabla) > epsilon:
return p
Hint: Incidentally, be careful with your choice of in this problem; if it is too small,
the Python code might take too long to run!
216
91 Hill climbing in Python
Python
1 def climb_hill(f, starting_point):
2 p = starting_point
3 # while gradient of f is pretty big at p
4 # p = p + multiple of gradient f
5 # return p
6 #
7 # here’s an example to try
8 p = [3,6,2]
9 f = lambda x: 10 - (x[0] + x[1])**2 - (x[0] - 3)**2 - (x[2] - 4)**2
10 print(climb_hill(f, p))
11
12 def validator():
13 f = lambda x: 10 - (x[0] - 2)**4 - (x[1] - 3)**2 - (x[2] - 4)**2
14 p = climb_hill(f, [3,6,2])
15 return abs(p[0] - 2) < 0.5 and abs(p[1] - 3) < 0.5 and abs(p[2] - 4) < 0.5
So you can use your program to ﬁnd the maximum value of the function f :
R
3
→R given by
f(x, y, z) = 10 −(x + y)
2
−(x −3)
2
−(z −4)
2
.
Solution In this case, x is 3.
Solution And y is −3.
Solution And z is 4.
Fantastic!
217
92 Multilinear forms
Multilinear forms are separately linear in multiple vector variables.
Deﬁnition 1 Let X be a set. We introduce the notation X
k
to stand for the set
of all ordered k-tuples of elements of X. In other words, X
k
= X X X,
with k “factors” of X.
For example, if X = ¦cat, dog¦, then X
3
is a set with eight elements, consisting
of 3-tuples of either cat or dog. For example, (cat, cat, dog) ∈ X
3
.
Deﬁnition 2 A k-linear form on a vector space V is a function T : V
k
→ R
which is linear in each vector variable. In other words, given (k − 1) vectors
v
1
, v
2
, . . . , v
i−1
, v
i+1
, . . . , v
k
, the map T
i
: V →R deﬁned by T
i
(v) = T(v
1
, v
2
, . . . , v
i−1
, v, v
i+1
, . . . , v
k
)
is linear.
The k-linear forms on V form a vector space.
Question 3 Let T : R
2
R
2
R
2
→ R be a trilinear form on R
2
. Suppose we
know that
• T
__
1
0
_
,
_
1
0
_
,
_
1
0
__
= 1
• T
__
1
0
_
,
_
1
0
_
,
_
0
1
__
= 2
• T
__
1
0
_
,
_
0
1
_
,
_
1
0
__
= 3
• T
__
1
0
_
,
_
0
1
_
,
_
0
1
__
= 4
• T
__
0
1
_
,
_
1
0
_
,
_
1
0
__
= 5
• T
__
0
1
_
,
_
1
0
_
,
_
0
1
__
= 6
• T
__
0
1
_
,
_
0
1
_
,
_
1
0
__
= 7
• T
__
0
1
_
,
_
0
1
_
,
_
0
1
__
= 8
Solution
Hint:
T
__
1
1
_
,
_
1
2
_
,
_
1
0
__
= T
__
1
0
_
,
_
1
2
_
,
_
1
0
__
+ T
__
0
1
_
,
_
1
2
_
,
_
1
0
__
= T
__
1
0
_
,
_
1
0
_
,
_
1
0
__
+ 2T
__
1
0
_
,
_
0
1
_
,
_
1
0
__
+ T
__
0
1
_
,
_
1
0
_
,
_
1
0
__
+ 2T
__
0
1
_
,
_
0
1
_
,
_
1
0
__
= 1 + 2(3) + 5 + 2(7)
= 26
218
92 Multilinear forms
T
__
1
1
_
,
_
1
2
_
,
_
1
0
__
=26
From the last example—and by analogy with the bilinear case—it is clear that
if you know the value of a k−linear form on all k-tuples of basis vectors of V (there
are (dimV )
k
of such), then you can ﬁnd the value of T on any k-tuple of vectors.
Deﬁnition 4 Let T : V
k1
→ R and S : V
k2
→ R be multilinear forms. Then
we deﬁne their tensor product by T ⊗ S : V
k1+k2
→ R by multiplication: (T ⊗
S)(v
1
, v
2
, . . . , v
k1+k2
) = T(v
1
, v
2
, . . . , v
k1
)S(v
k1+1
, v
k1+2
, . . . , v
k1+k2
).
Theorem 5 The k-linear forms dx
i1
⊗ dx
i2
⊗ ⊗ dx
i
k
where 1 < i
j
< n form
a basis for the space of all multilinear k-forms on R
n
. In fact,
T =

T(e
i1
, e
i2
, . . . , e
i
k
)dx
i1
⊗dx
i2
⊗ ⊗dx
i
k
,
where the sum ranges of all n
k
k-tuples of basis vectors.
The proof is as straightforward as the corresponding proof for bilinear forms,
but the notation is something awful.
Question 6 Solution
Hint: dx1
__
1
2
__
= 1.
Hint: dx2
__
−2
4
__
= 4.
Hint: dx2
__
5
6
__
) = 6.
Hint: So putting this all together, we have
(dx1 ⊗dx2 ⊗dx2)
__
1
2
_
,
_
−2
4
_
,
_
5
6
__
= 1 4 6 = 24
(dx1 ⊗dx2 ⊗dx2)
__
1
2
_
,
_
−2
4
_
,
_
5
6
__
= 24.
Question 7 Let T = dx
1
⊗ dx
1
⊗ dx
1
+ 4dx
2
⊗ dx
2
⊗ dx
1
be a trilinear form on
R
2
. Let v =
_
x
y
_
Solution
Hint:
T(v, v, v) = dx1 ⊗dx1 ⊗dx1(
_
x
y
_
,
_
x
y
_
,
_
x
y
_
) + 4dx2 ⊗dx2 ⊗dx1(
_
x
y
_
,
_
x
y
_
,
_
x
y
_
)
= x x x + 4y y x
= x
3
+ 4y
2
x
219
92 Multilinear forms
As a function of x and y, T(v, v, v) = x
3
+ 4 ∗ y
2
∗ x
As this example shows, applying a trilinear form to the same vector three times
gives a polynomial.
Solution
Hint: The monomial x
3
has degree three.
Hint: The monomial 4x
2
x also has degree three.
Hint: So the total degree of each monomial is three.
The total degree of each monomial is 3.
What we are seeing is a special case of the following result.
Theorem 8 Applying a k-linear form to the same vector k-times gives a homo-
geneous polynomial of degree k.
220
93 Symmetry
Various sorts of symmetry are possible for multilinear forms.
Deﬁnition 1 A k-linear form F is a symmetric if
F(v
1
, v
2
, . . . , v
k
) = F(v
i1
, v
i2
, . . . , v
i
k
),
whenever (i
1
, i
2
, . . . , i
k
) is a rearrangement of (1, 2, . . . , k).
Question 2 Let B : R
2
R
2
→R be the bilinear form
B = dx
1
⊗dx
2
+ dx
2
⊗dx
1
.
Solution
Hint: dx1
__
1
2
__
= 1.
Hint: dx2
__
1
2
__
= 2.
Hint: dx1
__
3
4
__
= 3.
Hint: dx2
__
3
4
__
= 4.
Hint: (dx1 ⊗dx2)
__
1
2
_
,
_
3
4
__
= 1 4.
Hint: (dx2 ⊗dx1)
__
1
2
_
,
_
3
4
__
= 2 3.
Hint: B
__
1
2
_
,
_
3
4
__
= 1 4 + 2 3 = 4 + 6 = 10.
B
__
1
2
_
,
_
3
4
__
= 10.
Solution
Hint: B
__
1
2
_
,
_
3
4
__
= 3 2 + 4 1 = 6 + 4 = 10.
B
__
3
4
_
,
_
1
2
__
= 10.
Is the bilinear form B symmetric?
221
93 Symmetry
Solution
(a) Yes.
(b) No.
Now let’s consider trilinear forms.
Let T : R
2
R
2
R
2
→R be the trilinear form
T = dx
1
⊗dx
1
⊗dx
2
+ dx
1
⊗dx
2
⊗dx
1
Is the triilinear form T symmetric?
Solution
(a) Yes.
(b) No.
For example, compare
T
__
1 0
¸
,
_
0 1
¸
,
_
1 0
¸_
to the value of
T
__
0 0
¸
,
_
1 0
¸
,
_
1 0
¸_
.
Can you cook up some examples of symmetric trilinear forms? Sure! Here is an
example:
T = dx
1
⊗dx
1
⊗dx
2
+ dx
1
⊗dx
2
⊗dx
1
+ dx
2
⊗dx
1
⊗dx
1
.
222
94 Higher order derivatives
Higher derivatives of a function are multilinear maps.
The (k + 1)
st
order derivative of a function f : R
n
→ R at a point p is a (k + 1)-
linear form D
k+1
f(p), which allows us to approximate changes in the k
th
order
derivative. This approximation works as follows.
Deﬁnition 1 D
k
(p+ v
k+1
)( v
1
, v
2
, . . . , v
k
) = D
k
(p)( v
1
, v
2
, . . . , v
k
)+D
k+1
(p)( v
1
, v
2
, . . . , v
k
, v
k+1
)+
Error(p)( v
1
, v
2
, . . . , v
k
, v
k+1
) where lim
v1,v2,...,v
k+1

0
Error(p)( v
1
, v
2
, . . . , v
k
, v
k+1
)
[ v
1
[[ v
2
[ [ v
k+1
[
=
0
Theorem 2 D
k
f(p) =

k
f
∂x
i1
∂x
i2
∂x
i
k
dx
i1
⊗ dx
i2
⊗ ⊗ dx
i
k
, where the
sum ranges over all k-tuples of basis covectors.
Question 3 f : R
2
→R is deﬁned by f(x, y) = x
2
y.
Solution
Hint: The only terms which are not zero are the terms involving 2 partial derivatives
with respect to x and 1 partial derivative with respect to y.
Hint: So D
3
f =

3
f
∂x∂x∂y
dx ⊗dx ⊗dy +

3
f
∂x∂y∂x
dx ⊗dy ⊗dx +

3
f
∂y∂x∂y
dy ⊗dx ⊗dx
Hint: So D
3
f(0, 0, 0) = 2dx ⊗dx ⊗dy + 2dx ⊗dy ⊗dx + 2dy ⊗dx ⊗dx
Hint: So D
3
f(0, 0, 0)(
_
1
2
_
,
_
3
4
_
,
_
0
1
_
) = 2(2 3 1) + 2(1 4 0) + 2(2 3 0) = 12
D
3
f(0, 0)(
_
1
2
_
,
_
3
4
_
,
_
0
1
_
) =12
Question 4 Assume D
2
f(p) = 3dx
1
⊗ dx
2
+ 3dx
2
⊗ dx
1
. In other words, the
matrix of D
f
(p) is
_
0 3
3 0
_
. Assume D
3
f(p) = dx ⊗dx ⊗dx.
Solution
Hint: By the deﬁnition of higher order derivatives, we have
D
2
f(p +
_
_
0.3
0.2
0.3
_
_
)(v1, v2) ≈ D
2
f(p)( v1, v2) + D
3
f(
_
_
0.3
0.2
0.2
_
_
, v1, v2)
Hint: So
D
2
f(p +
_
_
0.3
0.2
0.3
_
_
)(v1, v2) ≈ 3dx1 ⊗dx2(v1, v2) + 3dx2 ⊗dx1(v1, v2) + dx ⊗dx ⊗dx(
_
_
0.3
0.2
0.3
_
_
, v1, v2)
= 3dx1 ⊗dx2(v1, v2) + 3dx2 ⊗dx1(v1, v2) + 0.3dx ⊗dx( v1, v2)
223
94 Higher order derivatives
Hint: The matrix of this bilinear form is
_
0.3 3
3 0
_
The matrix of the bilinear form D
2
f(p +
_
_
0.3
0.2
0.3
_
_
) is approximately
224
95 Symmetry
In many nice situations, higher-order derivatives are symmetric.
Recall that we once saw the following theorem.
Theorem 1 Let f : R
n
→R be a diﬀerentiable function. Assume that the partial
derivatives f
xi
: R
n
→ R are all diﬀerentiable, and the second partial derivatives
f
xi,xj
are continuous. Then f
xi,xj
= f
xj,xi
.
After interpreting the “second derivative” as a bilinear form, we were then
able to say something nicer (though the hypothesis is stronger, so this is a weaker
theorem).
Theorem 2 Let f : R
n
→R be a continuously twice diﬀerentiable function; then
the bilinear form representing the second derivative is symmetric.
And ﬁnally, we are in a position to formulate the higher-order version of this
theorem.
Theorem 3 Let f : R
n
→ R be a continuously k-times diﬀerentiable function;
then the k-linear form representing the k-th order derivative is a symmetric form.
225
96 Taylor’s theorem
Higher order derivatives give rise to higher order polynomial approximations.
Here is the statement of a statement of Taylor’s theorem for many variables.
Theorem 1 Let f : R
n
→R be a (k + 1)-times diﬀerentiable function. Then
f(p+

h) = f(p)+Df(p)(

h)+
1
2!
D
2
f(p)(

h,

h)+
1
3!
D
3
f(p)(

h,

h,

h)+ +
1
k!
D
k
f(p)(

h
k
)+
1
(k + 1)!
D
k+1
(p+ξ

h)(

h
k+1
)
for some ξ ∈ [0, 1], where we have abbreviated the ordered tuple of i

h

s as

h
i
Let’s apply this to a speciﬁc function.
Question 2 Let f : R
2
→R be deﬁned by f(x, y) = e
x+y
.
Solution
Hint: The second order taylor approximation is 1 + (x + y) +
(x + y)
2
2
Hint: Every partial derivative of this function is e
x+y
, so all of the third partial
derivatives are 1
Hint: So the third derivative is the sum of all of the following terms
• dx ⊗dx ⊗dx
• dx ⊗dx ⊗dy
• dx ⊗dy ⊗dx
• dx ⊗dy ⊗dy
• dy ⊗dx ⊗dx
• dy ⊗dx ⊗dy
• dy ⊗dy ⊗dx
• dy ⊗dy ⊗dy
Hint: Applying this tensor to (
_
x
y
_
,
_
x
y
_
,
_
x
y
_
) we get xxx+xxy +xyx+xyy +yxx+
yxy + yyx + yyy = (x + y)
3
Hint: So the third order taylor expansion is 1 + (x + y) +
(x + y)
2
2
+
(x + y)
3
6
The third order taylor series of f about the point (0, 0) is 1+(x+y)+(x+y)
2
/2+(x+y)
3
/6
226
97 Python
There are numerical examples of higher-order Taylor series.
In this exercise, given a function f, we compute a higher-order Taylor series for f
numerically.
Question 1 Suppose f is a Python function with two real inputs, perhaps
def f(x,y):
return 2.71828182845904**(x+y)
We have a couple of “numerical diﬀerentiation” functions
epsilon = 0.001
def partial_x(f):
return lambda x,y: (f(x+epsilon,y) - f(x,y))/epsilon
def partial_y(f):
return lambda x,y: (f(x,y+epsilon) - f(x,y))/epsilon
We can build a linear approximation function.
epsilon = 0.001
def linear_approximation(f):
return lambda x,y: f(0,0) + x*partial_x(f)(0,0) + y*partial_y(f)(0,0)
Solution
Hint: We need only write down the second order Taylor series.
return lambda x, y: f(0,0) + x*partial_x(f)(0,0) + y*partial_y(f)(0,0) + x*x*partial_x(partial_x(f))(0,0)/2 + y*y*partial_y(partial_y(f))(0,0)/2 + x*y*partial_x(partial_y(f))(0,0)
Python
1 epsilon = 0.001
2 def partial_x(f):
3 return lambda x,y: (f(x+epsilon,y) - f(x,y))/epsilon
4 def partial_y(f):
5 return lambda x,y: (f(x,y+epsilon) - f(x,y))/epsilon
7 return lambda x,y: # the second order Taylor series approximation
8 #
9 # here’s an example to try
10 f = lambda x,y: 2.71828182845904**(x+y)
12
13 def validator():
14 f = lambda x,y: x*y
15 if abs(quadratic_approximation(f)(2,3) - 6) > 0.1:
16 return False
17 f = lambda x,y: x*x
18 if abs(quadratic_approximation(f)(2,3) - 4) > 0.1:
227
97 Python
19 return False
20 f = lambda x,y: y*y
21 if abs(quadratic_approximation(f)(2,3) - 9) > 0.1:
22 return False
23 f = lambda x,y: x
24 if abs(quadratic_approximation(f)(2,3) - 2) > 0.1:
25 return False
26 f = lambda x,y: y
27 if abs(quadratic_approximation(f)(2,3) - 3) > 0.1:
28 return False
29 return True
If you like this, you could build a version of this that produces a third order
approximation.
228
98 Denouement
Farewell!
That’s it! You have reached the end of this course.
If you want a high level overview of essentially everything we did in this course,
we recommend that you read this set of lecture notes
1
.
It has been a joy working with all of you and talking to you on the forums. I
am grateful for the many people who submitted pull requests to ﬁx errors in these
notes. Keep an eye on this space: we will hopefully be improving this course by