Often we have not only one, but several variables in a problem. The issues that come up are somewhat
more complex than for one variable. Let us ﬁrst start with vector spaces and linear functions.
While it is common to use x or x for elements of R
n
, especially in the applied sciences, we will just use
x, which is common in mathematics. That is x ∈ R
n
is a vector which means that x = (x
1
, x
2
, . . . , x
n
) is
an ntuple of real numbers. We will use upper indices for identifying component. That leaves us the lower
index for sequences of vectors. That is we can have vectors x
1
and x
2
in R
n
and then x
1
= (x
1
1
, x
2
1
, . . . , x
n
1
)
and x
2
= (x
1
2
, x
2
2
, . . . , x
n
2
). It is common to write vectors as column vectors, that is
x = (x
1
, x
2
, . . . , x
n
) =
x
1
x
2
.
.
.
x
n
,
and we will do so when convenient. We will use this notation with square brackets and use round brackets
for just an ntuple of numbers. We will call real numbers scalars to distinguish them from vectors.
Deﬁnition: A set X together with operations +: X X → X and : R X → X (where we will just
write ax instead of a x) is called a vector space (or a real vector space) if the following conditions are
satisﬁed:
(i) (Addition is associative) If u, v, w ∈ X, then u + (v + w) = (u + v) + w.
(ii) (Addition is commutative) If u, v ∈ X, then u + v = v + u.
(iii) (Additive identity) There is a 0 ∈ X such that v + 0 = v for all v ∈ X.
(iv) (Additive inverse) For every v ∈ X, there is a −v ∈ X, such that v + (−v) = 0.
(v) (Distributive law) If a ∈ R, u, v ∈ X, then a(u + v) = au + av.
(vi) (Distributive law) If a, b ∈ R, v ∈ X, then (a + b)v = av + bv.
(vii) (Multiplication is associative) If a, b ∈ R, v ∈ X, then (ab)v = a(bv).
(viii) (Multiplicative identity) 1v = v for all v ∈ X.
Elements of a vector space are usually called vectors, even if they are not elements of R
n
(vectors in the
“traditional” sense).
In short X is an Abelian group with respect to the addition, and equipped with multiplication by scalars.
An example vector space is R
n
, where addition and multiplication by a constant is done componentwise.
We will mostly deal with vector spaces as subsets of R
n
, but there are other vector spaces that are useful
in analysis. For example, the space C([0, 1], R) of continuous functions on the interval [0, 1] is a vector
space. The functions L
1
(X, µ) is also a vector space.
A trivial example of a vector space (the smallest one in fact) is just X = ¦0¦. The operations are deﬁned
in the obvious way. You always need a zero vector to exist, so all vector spaces are nonempty sets.
It is also possible to use other ﬁelds than R in the deﬁnition (for example it is common to use C), but
let us stick with R.
1
Deﬁnition: If we have vectors x
1
, . . . , x
k
∈ R
n
and scalars a
1
, . . . , a
k
∈ R, then
a
1
x
1
+ a
2
x
2
+ + a
k
x
k
is called a linear combination of the vectors x
1
, . . . , x
k
.
Note that if x
1
, . . . , x
k
are in a vector space X, then any linear combination of x
1
, . . . , x
k
is also in X.
If Y ⊂ R
n
is a set then the span of Y , or in notation span(Y ), is the set of all linear combinations of
some ﬁnite number of elements of Y . We also say that Y spans span(Y ).
Example: Let Y = ¦(1, 1)¦ ⊂ R
2
. Then
span(Y ) = ¦(x, x) ∈ R
2
: x ∈ R¦,
or in other words the line through the origin and the point (1, 1).
1
If you want a funky vector space over a diﬀerent ﬁeld, R is an inﬁnite dimensional vector space over the rational numbers.
1
2
Example: Let Y = ¦(1, 1), (0, 1)¦ ⊂ R
2
. Then
span(Y ) = R
2
,
as any point (x, y) ∈ R
2
can be written as a linear combination
(x, y) = x(1, 1) + (y −x)(0, 1).
Since a sum of two linear combinations is again a linear combination, and a scalar multiple of a linear
combination is a linear combination, we see that:
Proposition: A span of a set Y ⊂ R
n
is a vector space.
If Y is already a vector space then span(Y ) = Y .
Deﬁnition: A set of vectors ¦x
1
, x
2
, . . . , x
k
¦ is said to be linearly independent, if the only solution to
a
1
x
1
+ + a
k
x
k
= 0
is the trivial solution a
1
= = a
k
= 0. Here 0 is the vector of all zeros. A set that is not linearly
independent, is linearly dependent.
Note that if a set is linearly dependent, then this means that one of the vectors is a linear combination
of the others.
A linearly independent set B of vectors that span a vector space X is called a basis of X.
If a vector space X contains a linearly independent set of d vectors, but no linearly independent set of
d + 1 vectors then we say the dimension or dimX = d. If for all d ∈ N the vector space X contains a set
of d linearly independent vectors, we say X is inﬁnite dimensional and write dimX = ∞.
Note that we have dim¦0¦ = 0. So far we have not shown that any other vector space has a ﬁnite
dimension. We will see in a moment that any vector space that is a subset of R
n
has a ﬁnite dimension,
and that dimension is less than or equal to n.
Proposition: If B = ¦x
1
, . . . , x
k
¦ is a basis of a vector space X, then every point y ∈ X has a unique
representation of the form
y =
k
j=1
α
j
x
j
for some numbers α
1
, . . . , α
k
.
Proof. First, every y ∈ X is a linear combination of elements of B since X is the span of B. For uniqueness
suppose
y =
k
j=1
α
j
x
j
=
k
j=1
β
j
x
j
then
k
j=1
(α
j
−β
j
)x
j
= 0
so by linear independence of the basis α
j
= β
j
for all j.
For R
n
we deﬁne
e
1
= (1, 0, 0, . . . , 0), e
2
= (0, 1, 0, . . . , 0), . . . , e
n
= (0, 0, 0, . . . , 1),
and call this the standard basis of R
n
. We use the same letters e
j
for any R
n
, and what space R
n
we are
working in is understood from context. A direct computation shows that ¦e
1
, e
2
, . . . , e
n
¦ is really a basis
of R
n
; it is easy to show that it spans R
n
and is linearly independent. In fact,
x = (x
1
, . . . , x
n
) =
n
j=1
x
j
e
j
.
3
Proposition (Theorems 9.2 and 9.3 in Rudin): Suppose that X is a vector space.
(i) If X is spanned by d vectors, then dimX ≤ d.
(ii) dimX = d if and only if X has a basis of d vectors (and so every basis has d vectors).
(iii) In particular, dimR
n
= n.
(iv) If Y ⊂ X is a vector space and dimX = d, then dimY ≤ d.
(v) If dimX = d and a set T of d vectors spans X, then T is linearly independent.
(vi) If dimX = d and a set T of m vectors is linearly independent, then there is a set S of d −m vectors
such that T ∪ S is a basis of X.
Proof. Let us start with (i). Suppose that S = ¦x
1
, . . . , x
d
¦ span X. Now suppose that T = ¦y
1
, . . . , y
m
¦
is a set of linearly independent vectors of X. We wish to show that m ≤ d. Write
y
1
=
d
k=1
α
k
1
x
k
,
which we can do as S spans X. One of the α
k
1
is nonzero (otherwise y
1
would be zero), so suppose without
loss of generality that this is α
1
1
. Then we can solve
x
1
=
1
α
1
1
y
1
−
d
k=2
α
k
1
α
1
1
x
k
.
In particular ¦y
1
, x
2
, . . . , x
d
¦ span X, since x
1
can be obtained from ¦y
1
, x
2
, . . . , x
d
¦. Next,
y
2
= α
1
2
y
1
+
d
k=2
α
k
2
x
k
,
As T is linearly independent, we must have that one of the α
k
2
for k ≥ 2 must be nonzero. Without loss of
generality suppose that this is α
2
2
. Proceed to solve for
x
2
=
1
α
2
2
y
2
−
α
1
2
α
2
2
y
1
−
d
k=3
α
k
2
α
2
2
x
k
.
In particular ¦y
1
, y
2
, x
3
, . . . , x
d
¦ spans X. The astute reader will think back to linear algebra and notice
that we are rowreducing a matrix.
We continue this procedure. Either m < d and we are done. So suppose that m ≥ d. After d steps we
obtain that ¦y
1
, y
2
, . . . , y
d
¦ spans X. So any other vector v in X is a linear combination of ¦y
1
, y
2
, . . . , y
d
¦,
and hence cannot be in T as T is linearly independent. So m = d.
Let us look at (ii). First notice that if we have a set T of k linearly independent vectors that do not
span X, then we can always choose a vector v ∈ X ¸ span(T). The set T ∪ ¦v¦ is linearly independent
(exercise). If dimX = d, then there must exist some linearly independent set of d vectors T, and it must
span X, otherwise we could choose a larger set of linearly independent vectors. So we have a basis of d
vectors. On the other hand if we have a basis of d vectors, it is linearly independent and spans X. By (i)
we know there is no set of d + 1 linearly independent vectors, so dimension must be d.
For (iii) notice that ¦e
1
, . . . , e
n
¦ is a basis of R
n
.
To see (iv), suppose that Y is a vector space and Y ⊂ X, where dimX = d. As X cannot contain d +1
linearly independent vectors, neither can Y .
For (v) suppose that T is a set of m vectors that is linearly dependent and spans X. Then one of the
vectors is a linear combination of the others. Therefore if we remove it from T we obtain a set of m− 1
vectors that still span X and hence dimX ≤ m−1.
For (vi) suppose that T = ¦x
1
, . . . , x
m
¦ is a linearly independent set. We follow the procedure above in
the proof of (ii) to keep adding vectors while keeping the set linearly independent. As the dimension is d
we can add a vector exactly d −m times.
4
Deﬁnition: A mapping A: X → Y of vector spaces X and Y is said to be linear (or a linear transfor
mation) if for every a ∈ R and x, y ∈ X we have
A(ax) = aA(x) A(x + y) = A(x) + A(y).
We will usually just write Ax instead of A(x) if A is linear.
If A is onetoone an onto then we say A is invertible and we deﬁne A
−1
as the inverse.
If A: X → X is linear then we say A is a linear operator on X.
We will write L(X, Y ) for the set of all linear transformations from X to Y , and just L(X) for the set
of linear operators on X. If a, b ∈ R and A, B ∈ L(X, Y ) then deﬁne the transformation aA + bB
(aA + bB)(x) = aAx + bBx.
It is not hard to see that aA + bB is linear.
If A ∈ L(Y, Z) and B ∈ L(X, Y ), then deﬁne the transformation AB as
ABx = A(Bx).
It is trivial to see that AB ∈ L(X, Z).
Finally denote by I ∈ L(X) the identity, that is the linear operator such that Ix = x for all x.
Note that it is obvious that A0 = 0.
Proposition: If A: X → Y is invertible, then A
−1
is linear.
Proof. Let a ∈ R and y ∈ Y . As A is onto, then there is an x such that y = Ax, and further as it is also
onetoone A
−1
(Az) = z for all z ∈ X. So
A
−1
(ay) = A
−1
(aAx) = A
−1
(A(ax)) = ax = aA
−1
(y).
Similarly let y
1
, y
2
∈ Y , and x
1
, x
2
∈ X such that Ax
1
= y
1
and Ax
2
= y
2
, then
A
−1
(y
1
+ y
2
) = A
−1
(Ax
1
+ Ax
2
) = A
−1
(A(x
1
+ x
2
)) = x
1
+ x
2
= A
−1
(y
1
) + A
−1
(y
2
).
Proposition: If A: X → Y is linear then it is completely determined by its values on a basis of X.
Furthermore, if B is a basis, then any function
˜
A: B → Y extends to a linear function on X.
Proof. For inﬁnite dimensional spaces, the proof is essentially the same, but a little trickier to write, so
let’s stick with ﬁnitely many dimensions. Let ¦x
1
, . . . , x
n
¦ be a basis and suppose that A(x
j
) = y
j
. Then
every x ∈ X has a unique representation
x =
n
j=1
b
j
x
j
for some numbers b
1
, . . . , b
n
. Then by linearity
Ax = A
n
j=1
b
j
x
j
=
n
j=1
b
j
Ax
j
=
n
j=1
b
j
y
j
.
The “furthermore” follows by deﬁning the extension Ax =
n
j=1
b
j
y
j
, and noting that this is well deﬁned
by uniqueness of the representation of x.
Theorem 9.5: If X is a ﬁnite dimensional vector space and A: X → X is linear, then A is onetoone
if and only if it is onto.
5
Proof. Let ¦x
1
, . . . , x
n
¦ be a basis for X. Suppose that A is onetoone. Now suppose
n
j=1
c
j
Ax
j
= A
n
j=1
c
j
x
j
= 0.
As A is onetoone, the only vector that is taken to 0 is 0 itself. Hence,
0 =
n
j=1
c
j
x
j
and so c
j
= 0 for all j. Therefore, ¦Ax
1
, . . . , Ax
n
¦ is linearly independent. By an above proposition and
the fact that the dimension is n, we have that ¦Ax
1
, . . . , Ax
n
¦ span X. As any point x ∈ X can be written
as
x =
n
j=1
a
j
Ax
j
= A
n
j=1
a
j
x
j
,
so A is onto.
Now suppose that A is onto. As A is determined by the action on the basis we see that every element
of X has to be in the span of ¦Ax
1
, . . . , Ax
n
¦. Suppose that
A
n
j=1
c
j
x
j
=
n
j=1
c
j
Ax
j
= 0.
By the same proposition as ¦Ax
1
, . . . , Ax
n
¦ span X, the set is independent, and hence c
j
= 0 for all j.
This means that A is onetoone. If Ax = Ay, then A(x −y) = 0 and so x = y.
Let us start measuring distance. If X is a vector space, then we say a real valued function  is a norm
if:
(i) x ≥ 0, with x = 0 if and only if x = 0.
(ii) cx = [c[ x for all c ∈ R and x ∈ X.
(iii) x + y ≤ x +y for all x, y ∈ X.
Let x = (x
1
, . . . , x
n
) ∈ R
n
. Deﬁne
x =
_
(x
1
)
2
+ (x
2
)
2
+ + (x
n
)
2
be the Euclidean norm. Then d(x, y) = x −y is the standard distance function on R
n
that we used
when we talked about metric spaces. In fact we proved last semester that the Euclidean norm is a norm.
On any vector space X, once we have a norm, we can deﬁne a distance d(x, y) = x −y that makes X a
metric space.
Let A ∈ L(X, Y ). Deﬁne
A = sup¦Ax : x ∈ X with x = 1¦.
The number A is called the operator norm. We will see below that indeed it is a norm (at least for ﬁnite
dimensional spaces). By linearity we get
A = sup¦Ax : x ∈ X with x = 1¦ = sup
x∈X
x=0
Ax
x
.
This implies that
Ax ≤ A x .
It is not hard to see from the deﬁnition that A = 0 if and only if A = 0, that is, if A takes every
vector to the zero vector.
For ﬁnite dimensional spaces A is always ﬁnite as we will prove below. For inﬁnite dimensional spaces
this need not be true. For a simple example, take the vector space of continuously diﬀerentiable functions
on [0, 1] and as the norm use the uniform norm. Then for example the functions sin(nx) have norm 1, but
6
the derivatives have norm n. So diﬀerentiation (which is a linear operator) has unbounded norm on this
space. But let us stick to ﬁnite dimensional spaces now.
Proposition (Theorem 9.7 in Rudin):
(i) If A ∈ L(R
n
, R
m
), then A < ∞ and A is uniformly continuous (Lipschitz with constant A).
(ii) If A, B ∈ L(R
n
, R
m
) and c ∈ R, then
A + B ≤ A +B , cA = [c[ A .
In particular L(R
n
, R
m
) is a metric space with distance A −B.
(iii) If A ∈ L(R
n
, R
m
) and B ∈ L(R
m
, R
k
), then
BA ≤ B A .
Proof. For (i), let x ∈ R
n
. We know that A is deﬁned by its action on a basis. Write
x =
n
j=1
c
j
e
j
.
Then
Ax =
_
_
_
_
_
n
j=1
c
j
Ae
j
_
_
_
_
_
≤
n
j=1
¸
¸
c
j
¸
¸
Ae
j
 .
If x = 1, then it is easy to see that [c
j
[ ≤ 1 for all j, so
Ax ≤
n
j=1
¸
¸
c
j
¸
¸
Ae
j
 ≤
n
j=1
Ae
j
 .
The right hand side does not depend on x and so we are done, we have found a ﬁnite upper bound. Next,
A(x −y) ≤ A x −y
as we mentioned above. So if A < ∞, then this says that A is Lipschitz with constant A.
For (ii), let us note that
(A + B)x = Ax + Bx ≤ Ax +Bx ≤ A x +B x = (A +B) x .
So A + B ≤ A +B. Similarly
(cA)x = [c[ Ax ≤ ([c[ A) x .
So cA ≤ [c[ A. Next note
[c[ Ax = cAx ≤ cA x .
So [c[ A ≤ cA.
That we have a metric space follows pretty easily, and is left to student.
For (iii) write
BAx ≤ B Ax ≤ B A x .
As a norm deﬁnes a metric, we have deﬁned a metric space topology on L(R
n
, R
m
) so we can talk about
open/closed sets, continuity, and convergence. Note that we have deﬁned a norm only on R
n
and not on
an arbitrary ﬁnite dimensional vector space. However, after picking bases, we can deﬁne a norm on any
vector space in the same way. So we really have a topology on any L(X, Y ), although the precise metric
would depend on the basis picked.
Theorem 9.8: Let U ⊂ L(R
n
) be the set of invertible linear operators.
(i) If A ∈ U and B ∈ L(R
n
), and
A −B <
1
A
−1

, (1)
then B is invertible.
7
(ii) U is open and A → A
−1
is a continuous function on U.
The theorem says that U is an open set and A → A
−1
is continuous on U.
You should always think back to R
1
, where linear operators are just numbers a. The operator a is
invertible (a
−1
=
1
/a) whenever a ,= 0. Of course a →
1
/a is continuous. When n > 1, then there are other
noninvertible operators, and in general things are a bit more diﬃcult.
Proof. Let us prove (i). First a straight forward computation
x =
_
_
A
−1
Ax
_
_
≤
_
_
A
−1
_
_
Ax ≤
_
_
A
−1
_
_
((A −B)x +Bx) ≤
_
_
A
−1
_
_
A −B x +
_
_
A
−1
_
_
Bx .
Now assume that x ,= 0 and so x , = 0. Using (1) we obtain
x < x +
_
_
A
−1
_
_
Bx ,
or in other words Bx , = 0 for all nonzero x, and hence Bx ,= 0 for all nonzero x. This is enough to see
that B is onetoone (if Bx = By, then B(x −y) = 0, so x = y). As B is onetoone operator from R
n
to
R
n
it is onto and hence invertible.
Let us look at (ii). Let B be invertible and near A
−1
, that is (1) is satisﬁed. In fact, suppose that
A −B A
−1
 <
1
/2. Then we have shown above (using B
−1
y instead of x)
_
_
B
−1
y
_
_
≤
_
_
A
−1
_
_
A −B
_
_
B
−1
y
_
_
+
_
_
A
−1
_
_
y ≤
1
/2
_
_
B
−1
y
_
_
+
_
_
A
−1
_
_
y ,
or
_
_
B
−1
y
_
_
≤ 2
_
_
A
−1
_
_
y .
So B
−1
 ≤ 2 A
−1
.
Now note that
A
−1
(A −B)B
−1
= A
−1
(AB
−1
−I) = B
−1
−A
−1
,
and
_
_
B
−1
−A
−1
_
_
=
_
_
A
−1
(A −B)B
−1
_
_
≤
_
_
A
−1
_
_
A −B
_
_
B
−1
_
_
≤ 2
_
_
A
−1
_
_
2
A −B .
Finally let’s get to matrices, which are a convenient way to represent ﬁnitedimensional operators. If we
have bases ¦x
1
, . . . , x
n
¦ and ¦y
1
, . . . , y
m
¦ for vector spaces X and Y , then we know that a linear operator
is determined by its values on the basis. Given A ∈ L(X, Y ), deﬁne the numbers ¦a
j
i
¦ as follows
Ax
j
=
m
i=1
a
i
j
y
i
,
and write them as a matrix
A =
_
¸
¸
_
a
1
1
a
1
2
a
1
n
a
2
1
a
2
2
a
2
n
.
.
.
.
.
.
.
.
.
.
.
.
a
m
1
a
m
2
a
m
n
_
¸
¸
_
.
Note that the columns of the matrix are precisely the coeﬃcients that represent Ax
j
. We can represent x
j
as a column vector of n numbers (an n1 matrix) with 1 in the jth position and zero elsewhere, and then
Ax =
_
¸
¸
_
a
1
1
a
1
2
a
1
n
a
2
1
a
2
2
a
2
n
.
.
.
.
.
.
.
.
.
.
.
.
a
m
1
a
m
2
a
m
n
_
¸
¸
_
_
¸
¸
¸
¸
¸
_
0
.
.
.
1
.
.
.
0
_
¸
¸
¸
¸
¸
_
=
_
¸
¸
_
a
1
j
a
2
j
.
.
.
a
m
j
_
¸
¸
_
When
x =
n
j=1
γ
j
x
j
,
8
then
Ax =
n
j=1
m
i=1
γ
j
a
i
j
y
i
, =
m
i=1
_
n
j=1
γ
j
a
i
j
_
y
i
,
which gives rise to the familiar rule for matrix multiplication.
There is a onetoone correspondence between matrices and linear operators in L(X, Y ). That is, once
we ﬁx a basis in X and in Y . If we would choose a diﬀerent basis, we would get diﬀerent matrices. This
is important, the operator A acts on elements of X, the matrix is something that works with ntuples of
numbers.
Note that if B is an rbym matrix with entries b
j
k
, then we note that the matrix for BA has the i, kth
entry c
i
k
being
c
i
k
=
m
j=1
b
j
k
a
i
j
.
Note how upper and lower indices line up.
Now suppose that all the bases are just the standard bases and X = R
n
and Y = R
m
. If we recall the
Schwarz inequality we note that
Ax
2
=
m
i=1
_
n
j=1
γ
j
a
i
j
_
2
≤
m
i=1
_
n
j=1
(γ
j
)
2
__
n
j=1
(a
i
j
)
2
_
=
m
i=1
_
n
j=1
(a
i
j
)
2
_
x
2
.
In other words,
A ≤
¸
¸
¸
_
m
i=1
n
j=1
(a
i
j
)
2
.
Hence if the entries go to zero, then A goes to zero. In particular, if the entries of A−B go to zero then
B goes to A in operator norm. That is in the metric space topology induced by the operator norm. We
have proved the ﬁrst part of:
Proposition: If f : S →R
nm
is a continuous function for a metric space S, then taking the components
of f as the entries of a matrix, f is a continuous mapping from S to L(R
n
, R
m
). Conversely if f : S →
L(R
n
, R
m
) is a continuous function then the entries of the matrix are continuous functions.
The proof of the second part is rather easy. Take f(x)e
j
and note that is a continuous function to R
m
with standard euclidean norm (Note (A −B)e
j
 ≤ A −B). Such a function recall from last semester
that such a function is continuous if and only if its components are continuous and these are the components
of the jth column of the matrix f(x).
The derivative
Recall that when we had a function f : (a, b) →R, we deﬁned the derivative as
lim
h→0
f(x + h) −f(x)
h
.
In other words, there was a number a such that
lim
h→0
¸
¸
¸
¸
f(x + h) −f(x)
h
−a
¸
¸
¸
¸
= lim
h→0
¸
¸
¸
¸
f(x + h) −f(x) −ah
h
¸
¸
¸
¸
= lim
h→0
[f(x + h) −f(x) −ah[
[h[
= 0.
Multiplying by a is a linear map in one dimension. That is, we can think of a ∈ L(R
1
, R
1
). So we can
use this deﬁnition to extend diﬀerentiation to more variables.
Deﬁnition: Let U ⊂ R
n
be an open subset and f : U → R
m
. Then we say f is diﬀerentiable at x ∈ U
if there exists an A ∈ L(R
n
, R
m
) such that
lim
h→0
h∈R
n
f(x + h) −f(x) −Ah
h
= 0.
9
and we say Df(x) = A, or f
(x) = A and we say A is the derivative of f at x. When f is diﬀerentiable at
all x ∈ U, we say simply that f is diﬀerentiable.
Note that the derivative is a function from U to L(R
n
, R
m
).
The norms above must be in the right spaces of course. Normally it is just understood that h ∈ R
n
and
so we won’t explicitly say so from now on.
We have again (as last semester) cheated somewhat as said that A is the derivative. We have of course
not shown yet that there is only one.
Proposition: Let U ⊂ R
n
be an open subset and f : U →R
m
. Suppose that x ∈ U and there exists an
A, B ∈ L(R
n
, R
m
) such that
lim
h→0
f(x + h) −f(x) −Ah
h
= 0 and lim
h→0
f(x + h) −f(x) −Bh
h
= 0.
Then A = B.
Proof.
(A −B)h
h
=
f(x + h) −f(x) −Ah −(f(x + h) −f(x) −Bh)
h
≤
f(x + h) −f(x) −Ah
h
+
f(x + h) −f(x) −Bh
h
.
So
(A−B)h
h
goes to 0 as h → 0. That is, given > 0 we have that for all h in some δ ball around the origin
>
(A −B)h
h
=
_
_
_
_
(A −B)
h
h
_
_
_
_
.
For any x with x = 1 let h =
δ
/2 x, then h < δ and
h
h
= x and so A −B ≤ . So A = B.
Example: If f(x) = Ax for a linear mapping A, then f
(x) = A. This is easily seen:
f(x + h) −f(x) −Ah
h
=
A(x + h) −Ax −Ah
h
=
0
h
= 0.
Proposition: Let U ⊂ R
n
be open and f : U →R
m
be diﬀerentiable at x
0
. Then f is continuous at x
0
.
Proof. Another way to write the diﬀerentiability is to write
r(h) = f(x
0
+ h) −f(x
0
) −f
(x
0
)h.
As
r(h)
h
must go to zero as h → 0, r(h) itself must go to zero. As h → f
(x
0
)h is continuous (it’s linear)
and hence also goes to zero, that means that f(x
0
+ h) must go to f(x
0
). That is, f is continuous at
x
0
.
Theorem 9.15 (Chain rule): Let U ⊂ R
n
be open and let f : U → R
m
be diﬀerentiable at x
0
∈ U.
Let V ⊂ R
m
be open, f(U) ⊂ V and let g : V →R
be diﬀerentiable at f(x
0
). Then
F(x) = g
_
f(x)
_
is diﬀerentiable at x
0
and
F
(x
0
) = g
_
f(x
0
)
_
f
(x
0
).
Without the points this is sometimes written as F
= g
f
. The way to understand it is that the derivative
of the composition g
_
f(x)
_
is the composition of the derivatives of g and f. That is, if A = f
(x
0
) and
B = g
_
f(x
0
)
_
, then F
(x
0
) = BA.
10
Proof. Let A = f
(x
0
) and B = g
_
f(x
0
)
_
. Let h vary in R
n
and write, y
0
= f(x
0
), k = f(x
0
+h) −f(x
0
).
Write
r(h) = f(x
0
+ h) −f(x
0
) −Ah = k −Ah.
Then
F(x
0
+ h) −F(x
0
) −BAh
h
=
_
_
g
_
f(x
0
+ h)
_
−g
_
f(x
0
)
_
−BAh
_
_
h
=
_
_
g(y
0
+ k) −g(y
0
) −B
_
k −r(h)
__
_
h
≤
g(y
0
+ k) −g(y
0
) −Bk
h
+B
r(h)
h
=
g(y
0
+ k) −g(y
0
) −Bk
k
f(x
0
+ h) −f(x
0
)
h
+B
r(h)
h
.
First, B is constant and f is diﬀerentiable at x
0
, so the term B
r(h)
h
goes to 0. Next as f is continuous
at x
0
, we have that as h goes to 0, then k goes to 0. Therefore
g(y
0
+k)−g(y
0
)−Bk
k
goes to 0 because g is
diﬀerentiable at y
0
. Finally
f(x
0
+ h) −f(x
0
)
h
≤
f(x
0
+ h) −f(x
0
) −Ah
h
+
Ah
h
≤
f(x
0
+ h) −f(x
0
) −Ah
h
+A .
As f is diﬀerentiable at x
0
, the term
f(x
0
+h)−f(x
0
)
h
stays bounded as h goes to 0. Therefore,
F(x
0
+h)−F(x
0
)−BAh
h
goes to zero, and hence F
(x
0
) = BA, which is what was claimed.
There is another way to generalize the derivative from one dimension. We can simply hold all but one
variables constant and take the regular derivative.
Deﬁnition: Let f : U →R be a function on an open set U ⊂ R
n
. If the following limit exists we write
∂f
∂x
j
(x) = lim
h→0
f(x
1
, . . . , x
j−1
, x
j
+ h, x
j+1
, . . . , x
n
) −f(x)
h
= lim
h→0
f(x + he
j
) −f(x)
h
.
We call
∂f
∂x
j
(x) the partial derivative of f with respect to x
j
. Sometimes we write D
j
f instead.
When f : U → R
m
is a function, then we can write f = (f
1
, f
2
, . . . , f
m
), where f
k
are realvalued
functions. Then we can deﬁne
∂f
k
∂x
j
(or write it D
j
f
k
).
Partial derivatives are easier to compute with all the machinery of calculus, and they provide a way to
compute the total derivative of a function.
Theorem 9.17: Let U ⊂ R
n
be open and let f : U → R
m
be diﬀerentiable at x
0
∈ U. Then all the
partial derivatives at x
0
exist and in terms of the standard basis of R
n
and R
m
, f
(x
0
) is represented by
the matrix
_
¸
¸
¸
_
∂f
1
∂x
1
(x
0
)
∂f
1
∂x
2
(x
0
) . . .
∂f
1
∂x
n
(x
0
)
∂f
2
∂x
1
(x
0
)
∂f
2
∂x
2
(x
0
) . . .
∂f
2
∂x
n
(x
0
)
.
.
.
.
.
.
.
.
.
.
.
.
∂f
m
∂x
1
(x
0
)
∂f
m
∂x
2
(x
0
) . . .
∂f
m
∂x
n
(x
0
)
_
¸
¸
¸
_
.
In other words
f
(x
0
) e
j
=
m
k=1
∂f
k
∂x
j
(x
0
) e
k
.
If h =
n
j=1
c
j
e
j
, then
f
(x
0
) h =
n
j=1
m
k=1
c
j
∂f
k
∂x
j
(x
0
) e
k
.
Again note the updown pattern with the indices being summed over. That is on purpose.
11
Proof. Fix a j and note that
_
_
_
_
f(x
0
+ he
j
) −f(x
0
)
h
−f
(x
0
)e
j
_
_
_
_
=
_
_
_
_
f(x
0
+ he
j
) −f(x
0
) −f
(x
0
)he
j
h
_
_
_
_
=
f(x
0
+ he
j
) −f(x
0
) −f
(x
0
)he
j

he
j

.
As h goes to 0, the right hand side goes to zero by diﬀerentiability of f, and hence
lim
h→0
f(x
0
+ he
j
) −f(x
0
)
h
= f
(x
0
)e
j
.
Note that f is vector valued. So represent f by components f = (f
1
, f
2
, . . . , f
m
), and note that taking a
limit in R
m
is the same as taking the limit in each component separately. Therefore for any k the partial
derivative
∂f
k
∂x
j
(x
0
) = lim
h→0
f
k
(x
0
+ he
j
) −f
k
(x
0
)
h
exists and is equal to the kth component of f
(x
0
)e
j
, and we are done.
One of the consequences of the theorem is that if f is diﬀerentiable on U, then f
: U → L(R
n
, R
m
) is a
continuous function if and only if all the
∂f
k
∂x
j
are continuous functions.
When U ⊂ R
n
is open and f : U → R is a diﬀerentiable function. We call the following vector the
gradient:
∇f(x) =
n
j=1
∂f
∂x
j
(x) e
j
.
Note that the upperlower indices don’t really match up. As a preview of Math 621, we note that we write
df =
n
j=1
∂f
∂x
j
dx
j
where dx
j
is really the standard bases (though we’re thinking of dx
j
to be in L(R
n
, R) which is really
equivalent to R
n
). But we digress.
Suppose that γ : (a, b) ⊂ R →R
n
is a diﬀerentiable function and γ
_
(a, b)
_
⊂ U. Write γ = (γ
1
, . . . , γ
n
).
Then we can write
g(t) = f
_
γ(t)
_
.
The function g is then a diﬀerentiable and the derivative is
g
(t) =
n
j=1
∂f
∂x
j
_
γ(t)
_
dγ
j
dt
(t) =
n
j=1
∂f
∂x
j
dγ
j
dt
,
where we sometimes, for convenience, leave out the points at which we are evaluating. We notice that
g
(t) = (∇f)
_
γ(t)
_
γ
(t) = ∇f γ
.
The dot represents the standard scalar dot product.
We use this idea to deﬁne derivatives in a speciﬁc direction. A direction is simply a vector pointing in
that direction. So pick a vector u ∈ R
n
such that u = 1. Fix x ∈ U. Then deﬁne
γ(t) = x + tu.
It is easy to compute that γ
(t) = u for all t. By chain rule
d
dt
¸
¸
¸
t=0
_
f(x + tu)
¸
= (∇f)(x) u,
where the notation
d
dt
[
t=0
represents the derivative evaluated at t = 0. We also just compute directly
d
dt
¸
¸
¸
t=0
_
f(x + tu)
¸
= lim
h→0
f(x + hu) −f(x)
h
.
12
We obtain what is usually called the directional derivative, sometimes denoted by
D
u
f(x) =
d
dt
¸
¸
¸
t=0
_
f(x + tu)
¸
,
which can be computed by one of the methods above.
Let us suppose that (∇f)(x) ,= 0. By Schwarz inequality we have
[D
u
f(x)[ ≤ (∇f)(x) ,
and further equality is achieved when u is a scalar multiple of (∇f)(x). When
u =
(∇f)(x)
(∇f)(x)
,
we get D
u
f(x) = (∇f)(x). So the gradient points in the direction in which the function grows fastest,
that is, the direction in which D
u
is maximal.
Let us prove a “mean value theorem” for vector valued functions.
Theorem 5.19: If ϕ: [a, b] →R
n
is diﬀerentiable on (a, b) and continuous on [a, b], then there exists a
t such that
ϕ(b) −ϕ(a) ≤ (b −a) ϕ
(t) .
Proof. By mean value theorem on the function
_
ϕ(b) −ϕ(a)
_
ϕ(t) (the dot is the scalar dot product) we
obtain there is a t such that
_
ϕ(b) −ϕ(a)
_
ϕ(b) −
_
ϕ(b) −ϕ(a)
_
ϕ(a) = ϕ(b) −ϕ(a)
2
=
_
ϕ(b) −ϕ(a)
_
ϕ
(t)
where we treat ϕ
as a simply a column vector of numbers by abuse of notation. Note that in this case, it
is not hard to see that ϕ
(t)
L(R,R
n
)
= ϕ
(t)
R
n
(exercise).
By Schwarz inequality
ϕ(b) −ϕ(a)
2
=
_
ϕ(b) −ϕ(a)
_
ϕ
(t) ≤ ϕ(b) −ϕ(a) ϕ
(t) .
A set U is convex if whenever x, y ∈ U, the line segment from x to y lies in U. That is, if the convex
combination (1 −t)x + ty is in U for all t ∈ [0, 1]. Note that in R, every connected interval is convex. In
R
2
(or higher dimensions) there are lots of nonconvex connected sets. The ball B(x, r) is always convex
by the triangle inequality (exercise).
Theorem 9.19: Let U ⊂ R
n
be a convex open set, f : U → R
m
a diﬀerentiable function, and an M
such that
f
(x) ≤ M
for all x ∈ U. Then f is Lipschitz with constant M, that is
f(x) −f(y) ≤ M x −y
for all x, y ∈ U.
Note that if U is not convex this is not true. To see this, take the set
U = ¦(x, y) : 0.9 < x
2
+ y
2
< 1.1¦ ¸ ¦(x, 0) : x < 0¦.
Let f(x, y) be the angle that the line from the origin to (x, y) makes with the positive x axis. You can
even write the formula for f:
f(x, y) = 2 arctan
_
y
x +
_
x
2
+ y
2
_
.
Think spiral staircase with room in the middle. See:
13
(x, y)
θ = f(x, y)
In any case the function is diﬀerentiable, and the derivative is bounded on U, which is not hard to see,
but thinking of what happens near where the negative xaxis cuts the annulus in half, we see that the
conclusion cannot hold.
Proof. Fix x and y in U and note that (1 −t)x + ty ∈ U for all t ∈ [0, 1] by convexity. Next
d
dt
_
f
_
(1 −t)x + ty
_
_
= f
_
(1 −t)x + ty
_
(y −x).
By mean value theorem above we get
f(x) −f(y) ≤
_
_
_
_
d
dt
_
f
_
(1 −t)x + ty
_
_
_
_
_
_
≤
_
_
f
_
(1 −t)x + ty
__
_
y −x ≤ M y −x .
Let us solve the diﬀerential equation f
= 0.
Corollary: If U ⊂ R
n
is connected and f : U →R
m
is diﬀerentiable and f
(x) = 0, for all x ∈ U, then
f is constant.
Proof. For any x ∈ U, there is a ball B(x, δ) ⊂ U. The ball B(x, δ) is convex. Since f
(y) ≤ 0 for all
y ∈ B(x, δ) then by the theorem, f(x) −f(y) ≤ 0 x −y = 0, so f(x) = f(y) for all y ∈ B(x, δ).
This means that f
−1
(c) is open for any c ∈ R
m
. Suppose that f
−1
(c) is nonempty. The two sets
U
= f
−1
(c), U
= f
−1
(R
m
¸ ¦c¦) =
_
a∈R
m
a=c
f
−1
(a)
are open disjoint, and further U = U
∪ U
. So as U
is nonempty, and U is connected, we have that
U
= ∅. So f(x) = c for all x ∈ U.
Deﬁnition: We say f : U ⊂ R
n
→R
m
is continuously diﬀerentiable, or C
1
(U) if f is diﬀerentiable and
f
: U → L(R
n
, R
m
) is continuous.
Theorem 9.21: Let U ⊂ R
n
be open and f : U →R
m
. The function f is continuously diﬀerentiable if
and only if all the partial derivatives exist and are continuous.
Note that without continuity the theorem does not hold. Just because partial derivatives exist doesn’t
mean that f is diﬀerentiable, in fact, f may not even be continuous. See the homework.
Proof. We have seen that if f is diﬀerentiable, then the partial derivatives exist. Furthermore, the partial
derivatives are the entries of the matrix of f
(x). So if f
: U → L(R
n
, R
m
) is continuous, then the entries
are continuous, hence the partial derivatives are continuous.
To prove the opposite direction, suppose the partial derivatives exist and are continuous. Fix x ∈ U.
If we can show that f
(x) exists we are done, because the entries of the matrix f
(x) are then the partial
derivatives and if the entries are continuous functions, the matrix valued function f
is continuous.
14
Let us do induction on dimension. First let us note that the conclusion is true when n = 1. In this case
the derivative is just the regular derivative (exercise: you should check that the fact that the function is
vector valued is not a problem).
Suppose that the conclusion is true for R
n−1
, that is, if we restrict to the ﬁrst n − 1 variables, the
conclusion is true. It is easy to see that the ﬁrst n −1 partial derivatives of f restricted to the set where
the last coordinate is ﬁxed are the same as those for f. In the following we will think of R
n−1
as a subset
of R
n
, that is the set in R
n
where x
n
= 0. Let
A =
_
¸
_
∂f
1
∂x
1
(x) . . .
∂f
1
∂x
n
(x)
.
.
.
.
.
.
.
.
.
∂f
m
∂x
1
(x) . . .
∂f
m
∂x
n
(x)
_
¸
_
, A
1
=
_
¸
_
∂f
1
∂x
1
(x) . . .
∂f
1
∂x
n−1
(x)
.
.
.
.
.
.
.
.
.
∂f
m
∂x
1
(x) . . .
∂f
m
∂x
n−1
(x)
_
¸
_
, v =
_
¸
_
∂f
1
∂x
n
(x)
.
.
.
∂f
m
∂x
n
(x)
_
¸
_
.
Let > 0 be given. Let δ > 0 be such that for any k ∈ R
n−1
with k < δ we have
f(x + k) −f(x) −A
1
k
k
< .
By continuity of the partial derivatives, suppose that δ is small enough so that
¸
¸
¸
¸
∂f
j
∂x
n
(x + h) −
∂f
j
∂x
n
(x)
¸
¸
¸
¸
< ,
for all j and all h with h < δ.
Let h = h
1
+ te
n
be a vector in R
n
where h
1
∈ R
n−1
such that h < δ. Then h
1
 ≤ h < δ. Note
that Ah = A
1
h
1
+ tv.
f(x + h) −f(x) −Ah = f(x + h
1
+ te
n
) −f(x + h
1
) −tv + f(x + h
1
) −f(x) −A
1
h
1

≤ f(x + h
1
+ te
n
) −f(x + h
1
) −tv +f(x + h
1
) −f(x) −A
1
h
1

≤ f(x + h
1
+ te
n
) −f(x + h
1
) −tv + h
1
 .
As all the partial derivatives exist then by the mean value theorem for each j there is some θ
j
∈ [0, t] (or
[t, 0] if t < 0), such that
f
j
(x + h
1
+ te
n
) −f
j
(x + h
1
) = t
∂f
j
∂x
n
(x + h
1
+ θ
j
e
n
).
Note that if h < δ then h
1
+ θ
j
e
n
 ≤ h < δ. So to ﬁnish the estimate
f(x + h) −f(x) −Ah ≤ f(x + h
1
+ te
n
) −f(x + h
1
) −tv + h
1

≤
¸
¸
¸
_
m
j=1
_
t
∂f
j
∂x
n
(x + h
1
+ θ
j
e
n
) −t
∂f
j
∂x
n
(x)
_
2
+ h
1

≤
√
m [t[ + h
1

≤ (
√
m + 1) h .
Contraction mapping principle
Let us review the contraction mapping principle.
Deﬁnition: Let (X, d) and (X
, d
) be metric spaces. f : X → X
is said to be a contraction (or a
contractive map) if it is a kLipschitz map for some k < 1, i.e. if there exists a k < 1 such that
d
_
f(x), f(y)
_
≤ kd(x, y) for all x, y ∈ X.
If f : X → X is a map, x ∈ X is called a ﬁxed point if f(x) = x.
Theorem 9.23 (Contraction mapping principle or Fixed point theorem): Let (X, d) be a
nonempty complete metric space and f : X → X a contraction. Then f has a ﬁxed point.
15
The words complete and contraction are necessary. For example, f : (0, 1) → (0, 1) deﬁned by f(x) = kx
for any 0 < k < 1 is a contraction with no ﬁxed point. Also f : R → R deﬁned by f(x) = x + 1 is not a
contraction (k = 1) and has no ﬁxed point.
Proof. Pick any x
0
∈ X. Deﬁne a sequence ¦x
n
¦ by x
n+1
:= f(x
n
).
d(x
n+1
, x
n
) = d
_
f(x
n
), f(x
n−1
)
_
≤ kd(x
n
, x
n−1
) ≤ ≤ k
n
d(x
1
, x
0
).
Suppose m ≥ n, then
d(x
m
, x
n
) ≤
m−1
i=n
d(x
i+1
, x
i
)
≤
m−1
i=n
k
i
d(x
1
, x
0
)
= k
n
d(x
1
, x
0
)
m−n−1
i=0
k
i
≤ k
n
d(x
1
, x
0
)
∞
i=0
k
i
= k
n
d(x
1
, x
0
)
1
1 −k
.
In particular the sequence is Cauchy (why?). Since X is complete we let x := lim x
n
and we claim that x
is our unique ﬁxed point.
Fixed point? Note that f is continuous because it is a contraction. Hence
f(x) = lim f(x
n
) = lim x
n+1
= x.
Unique? Let y be a ﬁxed point.
d(x, y) = d
_
f(x), f(y)
_
= kd(x, y).
As k < 1 this means that d(x, y) = 0 and hence x = y. The theorem is proved.
Note that the proof is constructive. Not only do we know that a unique ﬁxed point exists. We also know
how to ﬁnd it.
We’ve used the theorem to prove Picard’s theorem last semester. This semester, we will prove the inverse
and implicit function theorems.
Do also note the proof of uniqueness holds even if X is not complete. If f is a contraction, then if it has
a ﬁxed point, that point is unique.
Inverse function theorem
The idea of a derivative is that if a function is diﬀerentiable, then it locally “behaves like” the derivative
(which is a linear function). So for example, if a function is diﬀerentiable and the derivative is invertible,
the function is (locally) invertible.
Theorem 9.24: Let U ⊂ R
n
be a set and let f : U →R
n
be a continuously diﬀerentiable function. Also
suppose that x
0
∈ U, f(x
0
) = y
0
, and f
(x
0
) is invertible. Then there exist open sets V, W ⊂ R
n
such that
x
0
∈ V ⊂ U, f(V ) = W and f[
V
is onetoone and onto. Furthermore, the inverse g(y) = (f[
V
)
−1
(y) is
continuously diﬀerentiable and
g
(y) =
_
f
(x)
_
−1
, for all x ∈ V , y = f(x).
Proof. Write A = f
(x
0
). As f
is continuous, there exists an open ball V around x
0
such that
A −f
(x) <
1
2 A
−1

for all x ∈ V .
Note that f
(x) is invertible for all x ∈ V .
16
Given y ∈ R
n
we deﬁne ϕ
y
: C →R
n
ϕ
y
(x) = x + A
−1
_
y −f(x)
_
.
As A
−1
is onetoone, we notice that ϕ
y
(x) = x (x is a ﬁxed point) if only if y − f(x) = 0, or in other
words f(x) = y. Using chain rule we obtain.
ϕ
y
(x) = I −A
−1
f
(x) = A
−1
_
A −f
(x)
_
.
so for x ∈ V we have
_
_
ϕ
y
(x)
_
_
≤
_
_
A
−1
_
_
A −f
(x) <
1
/2.
As V is a ball it is convex, and hence
ϕ
y
(x
1
) −ϕ
y
(x
2
) ≤
1
2
x
1
−x
2
 for all x
1
, x
2
∈ V .
In other words ϕ
y
is a contraction deﬁned on V , though we so far do not know what is the range of ϕ
y
.
We cannot apply the ﬁxed point theorem, but we can say that ϕ
y
has at most one ﬁxed point (note proof
of uniqueness in the contraction mapping principle). That is, there exists at most one x ∈ V such that
f(x) = y, and so f[
V
is onetoone.
Let W = f(V ). We need to show that W is open. Take a y
1
∈ W, then there is a unique x
1
∈ V such
that f(x
1
) = y
1
. Let r > 0 be small enough such that the closed ball C(x
1
, r) ⊂ V (such r > 0 exists as
V is open).
Suppose y is such that
y −y
1
 <
r
2 A
−1

.
If we can show that y ∈ W, then we have shown that W is open. Deﬁne ϕ
y
(x) = x + A
−1
_
y − f(x)
_
as
before. If x ∈ C(x
1
, r), then
ϕ
y
(x) −x
1
 ≤ ϕ
y
(x) −ϕ
y
(x
1
) +ϕ
y
(x
1
) −x
1

≤
1
2
x −x
1
 +
_
_
A
−1
(y −y
1
)
_
_
≤
1
2
r +
_
_
A
−1
_
_
y −y
1

<
1
2
r +
_
_
A
−1
_
_
r
2 A
−1

= r.
So ϕ
y
takes C(x
1
, r) into B(x
1
, r) ⊂ C(x
1
, r). It is a contraction on C(x
1
, r) and C(x
1
, r) is complete
(closed subset of R
n
is complete). Apply the contraction mapping principle to obtain a ﬁxed point x, i.e.
ϕ
y
(x) = x. That is f(x) = y. So y ∈ f
_
C(x
1
, r)
_
⊂ f(V ) = W. Therefore W is open.
Next we need to show that g is continuously diﬀerentiable and compute its derivative. First let us show
that it is diﬀerentiable. Let y ∈ W and k ∈ R
n
, k ,= 0, such that y +k ∈ W. Then there are unique x ∈ V
and h ∈ R
n
, h ,= 0 and x + h ∈ V , such that f(x) = y and f(x + h) = y + k as f[
V
is a onetoone and
onto mapping of V onto W. In other words, g(y) = x and g(y + k) = x + h. We can still squeeze some
information from the fact that ϕ
y
is a contraction.
ϕ
y
(x + h) −ϕ
y
(x) = h + A
−1
_
f(x) −f(x + h)
_
= h −A
−1
k.
So
_
_
h −A
−1
k
_
_
= ϕ
y
(x + h) −ϕ
y
(x) ≤
1
2
x + h −x =
h
2
.
By the inverse triangle inequality h −A
−1
k ≤
1
2
h so
h ≤ 2
_
_
A
−1
k
_
_
≤ 2
_
_
A
−1
_
_
k .
In particular as k goes to 0, so does h.
17
As x ∈ V , then f
(x) is invertible. Let B =
_
f
(x)
_
−1
, which is what we think the derivative of g at y
is. Then
g(y + k) −g(y) −Bk
k
=
h −Bk
k
=
_
_
h −B
_
f(x + h) −f(x)
__
_
k
=
_
_
B
_
f(x + h) −f(x) −f
(x)h
__
_
k
≤ B
h
k
f(x + h) −f(x) −f
(x)h
h
≤ 2 B
_
_
A
−1
_
_
f(x + h) −f(x) −f
(x)h
h
.
As k goes to 0, so does h. So the right hand side goes to 0 as f is diﬀerentiable, and hence the left hand
side also goes to 0. And B is precisely what we wanted g
(y) to be.
We have that g is diﬀerentiable, let us show it is C
1
(W). Now, g : W → V is continuous (it’s
diﬀerentiable), f
is continuous function from V to L(R
n
), and X → X
−1
is a continuous function.
g
(y) =
_
f
_
g(y)
__
−1
is the composition of these three continuous functions and hence is continuous.
Corollary: Suppose U ⊂ R
n
is open and f : U →R
n
is a continuously diﬀerentiable mapping such that
f
(x) is invertible for all x ∈ U. Then given any open set V ⊂ U, f(V ) is open. (f is an open mapping).
Proof. WLOG suppose U = V . For each point y ∈ f(V ), we pick x ∈ f
−1
(y) (there could be more than
one such point), then by the inverse function theorem there is a neighbourhood of x in V that maps onto
an neighbourhood of y. Hence f(V ) is open.
The theorem, and the corollary, is not true if f
(x) is not invertible for some x. For example, the map
f(x, y) = (x, xy), maps R
2
onto the set R
2
¸ ¦(0, y) : y ,= 0¦, which is neither open nor closed. In fact
f
−1
(0, 0) = ¦(0, y) : y ∈ R¦. Note that this bad behaviour only occurs on the yaxis, everywhere else the
function is locally invertible. In fact if we avoid the yaxis it is even one to one.
Also note that just because f
(x) is invertible everywhere doesn’t mean that f is onetoone globally. It
is deﬁnitely “locally” onetoone. For an example, just take the map f : C¸¦0¦ →C deﬁned by f(z) = z
2
.
Here we treat the map as if it went from R
2
¸ ¦0¦ to R
2
. For any nonzero complex number, there are
always two square roots, so the map is actually 2to1. It is left to student to show that f is diﬀerentiable
and the derivative is invertible (Hint: let z = x +iy and write down what the real an imaginary part of f
is in terms if x and y).
Also note that the invertibility of the derivative is not a necessary condition, just suﬃcient for having a
continuous inverse and being an open mapping. For example the function f(x) = x
3
is an open mapping
from R to R and is globally onetoone with a continuous inverse.
Implicit function theorem:
The inverse function theorem is really a special case of the implicit function theorem which we prove
next. Although somewhat ironically we will prove the implicit function theorem using the inverse function
theorem. Really what we were showing in the inverse function theorem was that the equation x−f(y) = 0
was solvable for y in terms of x if the derivative in terms of y was invertible, that is if f
(y) was invertible.
That is there was locally a function g such that x −f
_
g(x)
_
= 0.
OK, so how about we look at the equation f(x, y) = 0. Obviously this is not solvable for y in terms of x
in every case. For example, when f(x, y) does not actually depend on y. For a slightly more complicated
example, notice that x
2
+ y
2
− 1 = 0 deﬁnes the unit circle, and we can locally solve for y in terms of x
when 1) we are near a point which lies on the unit circle and 2) when we are not at a point where the
circle has a vertical tangency, or in other words where
∂f
∂y
= 0.
18
To make things simple we ﬁx some notation. We let (x, y) ∈ R
n+m
denote the coordinates (x
1
, . . . , x
n
, y
1
, . . . , y
m
).
A linear transformation A ∈ L(R
n+m
, R
m
) can then always be written as A = [A
x
A
y
] so that A(x, y) =
A
x
x + A
y
y, where A
x
∈ L(R
n
, R
m
) and A
y
∈ L(R
m
).
Note that Rudin does things “in reverse” from what the statement is usually. I’ll do it in the usual order
as that’s what I am used to, where we are taking the derivatives of y, not x (but it doesn’t matter really
in the end). First a linear version of the implicit function theorem.
Proposition (Theorem 9.27): Let A = [A
x
A
y
] ∈ L(R
n+m
, R
m
) and suppose that A
y
is invertible,
then let B = −(A
y
)
−1
A
x
and note that
0 = A(x, Bx) = A
x
x + A
y
Bx.
The proof is obvious. We simply solve and obtain y = Bx. Let us therefore show that the same can be
done for C
1
functions.
Theorem 9.28 (Implicit function theorem): Let U ⊂ R
n+m
be an open set and let f : U →R
m
be
a C
1
(U) mapping. Let (x
0
, y
0
) ∈ U be a point such that f(x
0
, y
0
) = 0. Write A = [A
x
A
y
] = f
(x
0
, y
0
) and
suppose that A
y
is invertible. Then there exists an open set W ⊂ R
n
with x
0
∈ W and a C
1
(W) mapping
g : W →R
m
, with g(x
0
) = y
0
, and for all x ∈ W, we have (x, g(x)) ∈ U and
f
_
x, g(x)
_
= 0.
Furthermore,
g
(x
0
) = −(A
y
)
−1
A
x
.
Proof. Deﬁne F : U →R
n+m
by F(x, y) =
_
x, f(x, y)
_
. It is clear that F is C
1
, and we want to show that
the derivative at (x
0
, y
0
) is invertible.
Let’s compute the derivative. We know that
f(x
0
+ h, y
0
+ k) −f(x
0
, y
0
) −A
x
h −A
y
k
(h, k)
goes to zero as (h, k) =
_
h
2
+k
2
goes to zero. But then so does
_
_
_
h, f(x
0
+ h, y
0
+ k) −f(x
0
, y
0
)
_
−(h, A
x
h + A
y
k)
_
_
(h, k)
=
f(x
0
+ h, y
0
+ k) −f(x
0
, y
0
) −A
x
h −A
y
k
(h, k)
.
So the derivative of F ate (x
0
, y
0
) takes (h, k) to (h, A
x
h + A
y
k). If (h, A
x
h + A
y
k) = (0, 0), then h = 0,
and so A
y
k = 0. As A
y
is onetoone, then k = 0. Therefore F
(x
0
, y
0
) is onetoone or in other words
invertible and we can apply the inverse function theorem.
That is, there exists some open set V ⊂ R
n+m
with (x
0
, 0) ∈ V , and an inverse mapping G: V →R
n+m
,
that is F(G(x, s)) = (x, s) for all (x, s) ∈ V (where x ∈ R
n
and s ∈ R
m
). Write G = (G
1
, G
2
) (the ﬁrst n
and the second m components of G). Then
F
_
G
1
(x, s), G
2
(x, s)
_
=
_
G
1
(x, s), f(G
1
(x, s), G
2
(x, s))
_
= (x, s).
So x = G
1
(x, s) and f
_
G
1
(x, s), G
2
(x, s)) = f
_
x, G
2
(x, s)
_
= s. Plugging in s = 0 we obtain
f
_
x, G
2
(x, 0)
_
= 0.
Let W = ¦x ∈ R
n
: (x, 0) ∈ V ¦ and deﬁne g : W → R
m
by g(x) = G
2
(x, 0). We obtain the g in the
theorem.
Next diﬀerentiate
x → f
_
x, g(x)
_
,
at x
0
, which should be the zero map. The derivative is done in the same way as above. We get that for all
h ∈ R
n
0 = A
_
h, g
(x
0
)h
_
= A
x
h + A
y
g
(x
0
)h,
and we obtain the desired derivative for g as well.
19
In other words, in the context of the theorem we have m equations in n + m unknowns.
f
1
(x
1
, . . . , x
n
, y
1
, . . . , y
m
) = 0
.
.
.
f
m
(x
1
, . . . , x
n
, y
1
, . . . , y
m
) = 0
And the condition guaranteeing a solution is that this is a C
1
mapping (that all the components are C
1
,
or in other words all the partial derivatives exist and are continuous), and the matrix
_
¸
_
∂f
1
∂y
1
. . .
∂f
1
∂y
m
.
.
.
.
.
.
.
.
.
∂f
m
∂y
1
. . .
∂f
m
∂y
m
_
¸
_
is invertible at (x
0
, y
0
).
Example: Consider the set x
2
+ y
2
− (z + 1)
3
= −1, e
x
+ e
y
+ e
z
= 3 near the point (0, 0, 0). The
function we are looking at is
f(x, y, z) = (x
2
+ y
2
−(z + 1)
3
+ 1, e
x
+ e
y
+ e
z
−3).
We ﬁnd that
Df =
_
2x 2y −3(z + 1)
2
e
x
e
y
e
z
_
.
The matrix
_
2(0) −3(0 + 1)
2
e
0
e
0
_
=
_
0 −3
1 1
_
is invertible. Hence near (0, 0, 0) we can ﬁnd y and z as C
1
functions of x such that for x near 0 we have
x
2
+ y(x)
2
−(z(x) + 1)
3
= −1, e
x
+ e
y(x)
+ e
z(x)
= 3.
The theorem doesn’t tell us how to ﬁnd y(x) and z(x) explicitly, it just tells us they exist. In other words,
near the origin the set of solutions is a smooth curve that goes through the origin.
Note that there are versions of the theorem for arbitrarily many derivatives. If f has k continuous
derivatives, then the solution also has k derivatives.
So it would be good to have an easy test for when is a matrix invertible. This is where determinants
come in. Suppose that σ = (σ
1
, . . . , σ
n
) is a permutation of the integers (1, . . . , n). It is not hard to see
that any permutation can be obtained by a sequence of transpositions (switchings of two elements). Call
a permutation even (resp. odd) if it takes an even (resp. odd) number of transpositions to get from σ to
(1, . . . , n). It can be shown that this is well deﬁned, in fact it is not hard to show that
sgn(σ) = sgn(σ
1
, . . . , σ
n
) =
p<q
sgn(σ
q
−σ
p
)
is −1 if σ is odd and 1 if σ is even. The symbol sgn(x) for a number is deﬁned by
sgn(x) =
_
¸
_
¸
_
−1 if x < 0,
0 if x = 0,
1 if x > 0.
This can be proved by noting that applying a transposition changes the sign, which is not hard to prove
by induction on n. Then note that the sign of (1, 2, . . . , n) is 1.
Let S
n
be the set of all permutations on n elements (the symmetric group). Let A = [a
i
j
] be a matrix.
Deﬁne the determinant of A
det(A) =
σ∈Sn
sgn(σ)
n
i=1
a
i
σ
i
.
20
Proposition (Theorem 9.34 and other observations):
(i) det(I) = 1.
(ii) det([x
1
x
2
. . . x
n
]) where x
j
are column vectors is linear in each variable x
j
separately.
(iii) If two columns of a matrix are interchanged determinant changes sign.
(iv) If two columns of A are equal, then det(A) = 0.
(v) If a column is zero, then det(A) = 0.
(vi) A → det(A) is a continuous function.
(vii) det [
a b
c d
] = ad −bc and det[a] = a.
In fact, the determinant is the unique function that satisﬁes (i), (ii), and (iii). But we digress.
Proof. We go through the proof quickly, as you have likely seen this before.
(i) is trivial. For (ii) Notice that each term in the deﬁnition of the determinant contains exactly one
factor from each column.
Part (iii) follows by noting that switching two columns is like switching the two corresponding numbers
in every element in S
n
. Hence all the signs are changed. Part (iv) follows because if two columns are equal
and we switch them we get the same matrix back and so part (iii) says the determinant must have been 0.
Part (v) follows because the product in each term in the deﬁnition includes one element from the zero
column. Part (vi) follows as det is a polynomial in the entries of the matrix and hence continuous. We
have seen that a function deﬁned on matrices is continuous in the operator norm if it is continuous in the
entries. Finally, part (vii) is a direct computation.
Theorem 9.35+9.36: If A and B are nn matrices, then det(AB) = det(A) det(B). In particular, A
is invertible if and only if det(A) ,= 0 and in this case, det(A
−1
) =
1
det(A)
.
Proof. Let b
1
, . . . , b
n
be the columns of B. Then
AB = [Ab
1
Ab
2
Ab
n
].
That is, the columns of AB are Ab
1
, . . . , Ab
n
.
Let b
i
j
denote the elements of B and a
j
the columns of A. Note that Ae
j
= a
j
. By linearity of the
determinant as proved above we have
det(AB) = det([Ab
1
Ab
2
Ab
n
]) = det
__
n
j=1
b
j
1
a
j
Ab
2
Ab
n
__
=
n
j=1
b
j
1
det([a
j
Ab
2
Ab
n
])
=
1≤j
1
,...,jn≤n
b
j
1
1
b
j
2
2
b
jn
n
det([a
j
1
a
j
2
a
jn
])
=
_
_
(j
1
,...,jn)∈Sn
b
j
1
1
b
j
2
2
b
jn
n
sgn(j
1
, . . . , j
n
)
_
_
det([a
1
a
2
a
n
]).
In the above, we note that we could go from all integers, to just elements of S
n
by noting that the
determinant of the resulting matrix is just zero.
The conclusion follows by recognizing the determinant of B. Actually the rows and columns are swapped,
but a moment’s reﬂection will reveal that it does not matter. We could also just plug in A = I.
For the second part of the theorem note that if Ais invertible, then A
−1
A = I and so det(A
−1
) det(A) = 1.
If A is not invertible, then the columns are linearly dependent. That is suppose that
n
j=1
c
j
a
j
= 0.
21
Without loss of generality suppose that c
1
,= 1. Then take
B =
_
¸
¸
¸
¸
_
c
1
0 0 0
c
2
1 0 0
c
3
0 1 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
c
n
0 0 1
_
¸
¸
¸
¸
_
.
It is not hard to see from the deﬁnition that det(B) = c
1
,= 0. Then det(AB) = det(A) det(B) = c
1
det(A).
Note that the ﬁrst column of AB is zero, and hence det(AB) = 0. Thus det(A) = 0.
Proposition: Determinant is independent of the basis. In other words, if B is invertible then,
det(A) = det(B
−1
AB).
The proof is immediate. If in one basis A is the matrix representing a linear operator, then for another
basis we can ﬁnd a matrix B such that the matrix B
−1
AB takes us to the ﬁrst basis, apply A in the
ﬁrst basis, and take us back to the basis we started with. Therefore, the determinant can be deﬁned as a
function on the space L(R
n
), not just on matrices. No matter what basis we choose, the function is the
same. It follows from the two propositions that
det: L(R
n
) →R
is a well deﬁned and continuous function.
We can now test whether a matrix is invertible.
Deﬁnition: Let U ⊂ R
n
and f : U → R
n
be a diﬀerentiable mapping. Then deﬁne the Jacobian of f
at x as
J
f
(x) = det
_
f
(x)
_
Sometimes this is written as
∂(f
1
, . . . , f
n
)
∂(x
1
, . . . , x
n
)
.
To the uninitiated this can be a somewhat confusing notation, but it is useful when you need to specify
the exact variables and function components used.
When f is C
1
, then J
f
(x) is a continuous function.
The Jacobian is a real valued function, and when n = 1 it is simply the derivative. Also note that from
the chain rule it follows that:
J
f◦g
(x) = J
f
_
g(x)
_
J
g
(x).
We can restate the inverse function theorem using the Jacobian. That is, f : U →R
n
is locally invertible
near x if J
f
(x) ,= 0.
For the implicit function theorem the condition is normally stated as
∂(f
1
, . . . , f
n
)
∂(y
1
, . . . , y
n
)
(x
0
, y
0
) ,= 0.
It can be computed directly that the determinant tells us what happens to area/volume. Suppose that
we are in R
2
. Then if A is a linear transformation, it follows by direct computation that the direct image of
the unit square A([0, 1]
2
) has area [det(A)[. Note that the sign of the determinant determines “orientation”.
If the determinant is negative, then the two sides of the unit square will be ﬂipped in the image. We claim
without proof that this follows for arbitrary ﬁgures, not just the square.
Similarly, the Jacobian measures how much a diﬀerentiable mapping stretches things locally, and if it
ﬂips orientation. We should see more of this geometry next semester.
2
Example: Let Y = {(1, 1), (0, 1)} ⊂ R2 . Then span(Y ) = R2 , as any point (x, y) ∈ R2 can be written as a linear combination (x, y) = x(1, 1) + (y − x)(0, 1). Since a sum of two linear combinations is again a linear combination, and a scalar multiple of a linear combination is a linear combination, we see that: Proposition: A span of a set Y ⊂ Rn is a vector space. If Y is already a vector space then span(Y ) = Y . Deﬁnition: A set of vectors {x1 , x2 , . . . , xk } is said to be linearly independent, if the only solution to a1 x 1 + · · · + ak x k = 0 is the trivial solution a1 = · · · = ak = 0. Here 0 is the vector of all zeros. A set that is not linearly independent, is linearly dependent. Note that if a set is linearly dependent, then this means that one of the vectors is a linear combination of the others. A linearly independent set B of vectors that span a vector space X is called a basis of X. If a vector space X contains a linearly independent set of d vectors, but no linearly independent set of d + 1 vectors then we say the dimension or dim X = d. If for all d ∈ N the vector space X contains a set of d linearly independent vectors, we say X is inﬁnite dimensional and write dim X = ∞. Note that we have dim{0} = 0. So far we have not shown that any other vector space has a ﬁnite dimension. We will see in a moment that any vector space that is a subset of Rn has a ﬁnite dimension, and that dimension is less than or equal to n. Proposition: If B = {x1 , . . . , xk } is a basis of a vector space X, then every point y ∈ X has a unique representation of the form
k
y=
j=1
α j xj
for some numbers α , . . . , α . Proof. First, every y ∈ X is a linear combination of elements of B since X is the span of B. For uniqueness suppose
k k
1
k
y=
j=1
α xj =
j=1 k
j
β j xj
then (αj − β j )xj = 0
j=1
so by linear independence of the basis α = β j for all j. For Rn we deﬁne e1 = (1, 0, 0, . . . , 0), e2 = (0, 1, 0, . . . , 0), ..., en = (0, 0, 0, . . . , 1), and call this the standard basis of Rn . We use the same letters ej for any Rn , and what space Rn we are working in is understood from context. A direct computation shows that {e1 , e2 , . . . , en } is really a basis of Rn ; it is easy to show that it spans Rn and is linearly independent. In fact,
n
j
x = (x1 , . . . , xn ) =
j=1
xj e j .
Proceed to solve for k α2 α1 1 x . dim Rn = n. . . If dim X = d. . After d steps we obtain that {y1 . . . yd }. x2 = 2 y 2 − 2 y 1 − 2 2 k α2 α2 α2 k=3 d In particular {y1 . So m = d. . so dimension must be d. Proof. For (v) suppose that T is a set of m vectors that is linearly dependent and spans X. . . . yd } spans X. k As T is linearly independent. . . We follow the procedure above in the proof of (ii) to keep adding vectors while keeping the set linearly independent. and it must span X. . (v) If dim X = d and a set T of d vectors spans X. As the dimension is d we can add a vector exactly d − m times. We wish to show that m ≤ d. For (iii) notice that {e1 . (vi) If dim X = d and a set T of m vectors is linearly independent. Write d y1 = k=1 k α 1 xk . suppose that Y is a vector space and Y ⊂ X. y2 . . 1 k α1 α1 k=2 d In particular {y1 . xd } span X. then there must exist some linearly independent set of d vectors T . ym } is a set of linearly independent vectors of X. On the other hand if we have a basis of d vectors. .2 and 9. Suppose that S = {x1 . One of the α1 is nonzero (otherwise y1 would be zero). k which we can do as S spans X. xd } spans X. . Next. First notice that if we have a set T of k linearly independent vectors that do not span X. . So any other vector v in X is a linear combination of {y1 . x3 . xm } is a linearly independent set. As X cannot contain d + 1 linearly independent vectors. Now suppose that T = {y1 . For (vi) suppose that T = {x1 . . then dim Y ≤ d. . Then we can solve k 1 α1 x1 = 1 y 1 − x . . (ii) dim X = d if and only if X has a basis of d vectors (and so every basis has d vectors). x2 . then there is a set S of d − m vectors such that T ∪ S is a basis of X. So suppose that m ≥ d. We continue this procedure. . So we have a basis of d vectors. y2 . (i) If X is spanned by d vectors. . Therefore if we remove it from T we obtain a set of m − 1 vectors that still span X and hence dim X ≤ m − 1. xd } span X. Either m < d and we are done. The astute reader will think back to linear algebra and notice that we are rowreducing a matrix. so suppose without 1 loss of generality that this is α1 . . . By (i) we know there is no set of d + 1 linearly independent vectors. . The set T ∪ {v} is linearly independent (exercise). since x1 can be obtained from {y1 . en } is a basis of Rn . Then one of the vectors is a linear combination of the others. d y2 = 1 α2 y 1 + k=2 k α 2 xk . xd }. (iii) In particular. (iv) If Y ⊂ X is a vector space and dim X = d. . . Without loss of 2 generality suppose that this is α2 . . then we can always choose a vector v ∈ X \ span(T ). and hence cannot be in T as T is linearly independent. it is linearly independent and spans X. then T is linearly independent. Let us look at (ii). . x2 . where dim X = d. then dim X ≤ d. . . . neither can Y . To see (iv). we must have that one of the α2 for k ≥ 2 must be nonzero.3 in Rudin): Suppose that X is a vector space. Let us start with (i). . .3 Proposition (Theorems 9. . otherwise we could choose a larger set of linearly independent vectors. . y2 . . .
so let’s stick with ﬁnitely many dimensions. then A−1 (y1 + y2 ) = A−1 (Ax1 + Ax2 ) = A−1 (A(x1 + x2 )) = x1 + x2 = A−1 (y1 ) + A−1 (y2 ). and noting that this is well deﬁned The “furthermore” follows by deﬁning the extension Ax = by uniqueness of the representation of x. . If A : X → X is linear then we say A is a linear operator on X. Similarly let y1 . then there is an x such that y = Ax. Note that it is obvious that A0 = 0. that is the linear operator such that Ix = x for all x. If A is onetoone an onto then we say A is invertible and we deﬁne A−1 as the inverse. Theorem 9. . Y ). As A is onto. Let a ∈ R and y ∈ Y . For inﬁnite dimensional spaces. b ∈ R and A. Proposition: If A : X → Y is linear then it is completely determined by its values on a basis of X. then A is onetoone if and only if it is onto. So A−1 (ay) = A−1 (aAx) = A−1 (A(ax)) = ax = aA−1 (y). and further as it is also onetoone A−1 (Az) = z for all z ∈ X. . and x1 . then deﬁne the transformation AB as ABx = A(Bx). n j j=1 b yj . It is trivial to see that AB ∈ L(X. Finally denote by I ∈ L(X) the identity. x2 ∈ X such that Ax1 = y1 and Ax2 = y2 . . Y ) for the set of all linear transformations from X to Y . ˜ Furthermore. xn } be a basis and suppose that A(xj ) = yj . the proof is essentially the same. . . and just L(X) for the set of linear operators on X. then any function A : B → Y extends to a linear function on X. if B is a basis. Y ) then deﬁne the transformation aA + bB (aA + bB)(x) = aAx + bBx. b . Proof. If A ∈ L(Y. If a. y ∈ X we have A(ax) = aA(x) A(x + y) = A(x) + A(y). We will write L(X. . Then every x ∈ X has a unique representation n x= j=1 bj x j for some numbers b .4 Deﬁnition: A mapping A : X → Y of vector spaces X and Y is said to be linear (or a linear transformation) if for every a ∈ R and x. Proposition: If A : X → Y is invertible.5: If X is a ﬁnite dimensional vector space and A : X → X is linear. y2 ∈ Y . We will usually just write Ax instead of A(x) if A is linear. Z). B ∈ L(X. It is not hard to see that aA + bB is linear. then A−1 is linear. Then by linearity n n n 1 n Ax = A j=1 b xj = j=1 j b Axj = j=1 j bj yj . . Z) and B ∈ L(X. . Proof. but a little trickier to write. Let {x1 .
. 1] and as the norm use the uniform norm. . . Now suppose that A is onto. Axn } span X. Then for example the functions sin(nx) have norm 1. {Ax1 . Let us start measuring distance. Suppose that n n A j=1 cj x j = j=1 cj Axj = 0. . On any vector space X. . By linearity we get A = sup{ Ax : x ∈ X This implies that Ax ≤ A x . . The number A is called the operator norm. Then d(x. and hence cj = 0 for all j. By the same proposition as {Ax1 . For a simple example.5 Proof. once we have a norm. Axn } is linearly independent. . . x . Hence. . but with x = 1} = sup x∈X x=0 Ax . This means that A is onetoone. (iii) x + y ≤ x + y for all x. y) = x − y is the standard distance function on Rn that we used when we talked about metric spaces. the only vector that is taken to 0 is 0 itself. Therefore. . then A(x − y) = 0 and so x = y. If X is a vector space. Let A ∈ L(X. Axn }. so A is onto. that is. . . n 0= j=1 j cj x j and so c = 0 for all j. . As any point x ∈ X can be written as n n x= j=1 a Axj = A j=1 j aj x j . . Now suppose n n cj Axj = A j=1 j=1 cj xj = 0. xn } be a basis for X. . . . As A is onetoone. Deﬁne A = sup{ Ax : x ∈ X with x = 1}. Deﬁne x = (x1 )2 + (x2 )2 + · · · + (xn )2 be the Euclidean norm. It is not hard to see from the deﬁnition that A = 0 if and only if A = 0. . (ii) cx = c x for all c ∈ R and x ∈ X. . then we say a real valued function · is a norm if: (i) x ≥ 0. . For inﬁnite dimensional spaces this need not be true. If Ax = Ay. By an above proposition and the fact that the dimension is n. with x = 0 if and only if x = 0. Y ). we have that {Ax1 . We will see below that indeed it is a norm (at least for ﬁnite dimensional spaces). the set is independent. . y) = x − y that makes X a metric space. we can deﬁne a distance d(x. Let x = (x1 . As A is determined by the action on the basis we see that every element of X has to be in the span of {Ax1 . In fact we proved last semester that the Euclidean norm is a norm. Let {x1 . Axn } span X. . . xn ) ∈ Rn . take the vector space of continuously diﬀerentiable functions on [0. if A takes every vector to the zero vector. Suppose that A is onetoone. For ﬁnite dimensional spaces A is always ﬁnite as we will prove below. . y ∈ X.
However. and 1 A−B < . Next note c Ax = cAx ≤ cA x . Rm ) and B ∈ L(Rm . So we really have a topology on any L(X. So diﬀerentiation (which is a linear operator) has unbounded norm on this space. after picking bases. But let us stick to ﬁnite dimensional spaces now. The right hand side does not depend on x and so we are done. n m cA = c A . we have deﬁned a metric space topology on L(Rn . R ) is a metric space with distance A − B . then it is easy to see that c  ≤ 1 for all j. let x ∈ Rn . Y ). Write n x= j=1 cj e j . then A < ∞ and A is uniformly continuous (Lipschitz with constant A ). Proof. n Then Ax = n cj Aej ≤ j=1 j=1 cj Aej . If x = 1. So cA ≤ c A . Proposition (Theorem 9. Rm ). Rk ). We know that A is deﬁned by its action on a basis. A(x − y) ≤ A x−y as we mentioned above. (ii) If A. B ∈ L(Rn . Similarly (cA)x = c Ax ≤ (c A ) x . So if A < ∞. A−1 then B is invertible. Theorem 9. then this says that A is Lipschitz with constant A . we have found a ﬁnite upper bound. and is left to student. Note that we have deﬁned a norm only on Rn and not on an arbitrary ﬁnite dimensional vector space. (i) If A ∈ U and B ∈ L(Rn ). let us note that (A + B)x = Ax + Bx ≤ Ax + Bx ≤ A So A + B ≤ A + B . we can deﬁne a norm on any vector space in the same way. Rm ) and c ∈ R. In particular L(R . x + B x =( A + B ) x . As a norm deﬁnes a metric. then BA ≤ B A . Rm ) so we can talk about open/closed sets. so n n j Ax ≤ j=1 c j Aej ≤ j=1 Aej .6 the derivatives have norm n. and convergence.8: Let U ⊂ L(Rn ) be the set of invertible linear operators. For (iii) write BAx ≤ B Ax ≤ B A x . although the precise metric would depend on the basis picked. (iii) If A ∈ L(Rn .7 in Rudin): (i) If A ∈ L(Rn . Next. (1) . continuity. then A+B ≤ A + B . So c A ≤ cA . For (i). For (ii). That we have a metric space follows pretty easily.
As B is onetoone operator from Rn to Rn it is onto and hence invertible. Y ). . . then we know that a linear operator is determined by its values on the basis. . . . We can represent xj as a column vector of n numbers (an n × 1 matrix) with 1 in the jth position and zero elsewhere. . The theorem says that U is an open set and A → A−1 is continuous on U . where linear operators are just numbers a. Proof. . In fact. A−B x + A−1 Bx . . . . .. 1 = . Then we have shown above (using B −1 y instead of x) B −1 y ≤ A−1 or B −1 y ≤ 2 A−1 So B ≤2 A Now note that and B −1 − A−1 = A−1 (A − B)B −1 ≤ A−1 A−B B −1 ≤ 2 A−1 −1 −1 A−B B −1 y + A−1 y ≤ 1/2 B −1 y + A−1 y . Let us look at (ii). Of course a → 1/a is continuous. . . . First a straight forward computation x = A−1 Ax ≤ A−1 Ax ≤ A−1 ( (A − B)x + Bx ) ≤ A−1 x < x + A−1 Bx . a1 2 a2 2 . Given A ∈ L(X. The operator a is invertible (a−1 = 1/a) whenever a = 0. . Let us prove (i). When n > 1. that is (1) is satisﬁed. deﬁne the numbers {aj } as follows i m Axj = i=1 ai y i . a2 a1 a2 · · · an j Ax = . A−1 (A − B)B −1 = A−1 (AB −1 − I) = B −1 − A−1 . . This is enough to see that B is onetoone (if Bx = By. .. and in general things are a bit more diﬃcult. Finally let’s get to matrices. m m . . so x = y). ym } for vector spaces X and Y . j ··· ··· . a1 n a2 n . . which are a convenient way to represent ﬁnitedimensional operators. . . 2 2 2 . Let B be invertible and near A−1 . . a1 a2 · · · am am n j 0 When x= j=1 n γ j xj . . . 2 A−B . . . . xn } and {y1 . . . . and write them as a matrix a1 1 a2 1 A= .7 (ii) U is open and A → A−1 is a continuous function on U . . You should always think back to R1 . . then there are other noninvertible operators. If we have bases {x1 . suppose that A − B A−1 < 1/2. Now assume that x = 0 and so x = 0. and then 0 1 1 1 1 aj a1 a2 · · · an . Using (1) we obtain or in other words Bx = 0 for all nonzero x. y . am am · · · am 1 2 n Note that the columns of the matrix are precisely the coeﬃcients that represent Axj . . then B(x − y) = 0. and hence Bx = 0 for all nonzero x.
Then we say f is diﬀerentiable at x ∈ U if there exists an A ∈ L(Rn . Take f (x)ej and note that is a continuous function to Rm with standard euclidean norm (Note (A − B)ej ≤ A − B ). That is in the metric space topology induced by the operator norm. We have proved the ﬁrst part of: Proposition: If f : S → Rnm is a continuous function for a metric space S. we can think of a ∈ L(R1 . once we ﬁx a basis in X and in Y . That is. If we recall the Schwarz inequality we note that m n 2 m n n m 2 (ai ) j j=1 n Ax In other words. kth k entry ci being k m ci = k j=1 b j ai . f is a continuous mapping from S to L(Rn . k j Note how upper and lower indices line up. b) → R. there was a number a such that lim h→0 lim f (x + h) − f (x) − ah f (x + h) − f (x) f (x + h) − f (x) − ah − a = lim = lim = 0. Conversely if f : S → L(Rn . Now suppose that all the bases are just the standard bases and X = Rn and Y = Rm . 2 = i=1 j=1 γ j ai j ≤ i=1 j=1 (γ ) j 2 = i=1 j=1 (ai ) j 2 x 2 . h . In particular. That is.8 then n m m n Ax = j=1 i=1 γ j ai y i . Rm ). we deﬁned the derivative as f (x + h) − f (x) . If we would choose a diﬀerent basis. Rm ) such that h→0 h∈Rn lim f (x + h) − f (x) − Ah = 0. R1 ). There is a onetoone correspondence between matrices and linear operators in L(X. we would get diﬀerent matrices. h→0 h→0 h h h Multiplying by a is a linear map in one dimension. Y ). the matrix is something that works with ntuples of numbers. which gives rise to the familiar rule for matrix multiplication. Deﬁnition: Let U ⊂ Rn be an open subset and f : U → Rm . = j i=1 j=1 γ j ai j yi . So we can use this deﬁnition to extend diﬀerentiation to more variables. then taking the components of f as the entries of a matrix. Such a function recall from last semester that such a function is continuous if and only if its components are continuous and these are the components of the jth column of the matrix f (x). The proof of the second part is rather easy. The derivative Recall that when we had a function f : (a. the operator A acts on elements of X. Rm ) is a continuous function then the entries of the matrix are continuous functions. Note that if B is an rbym matrix with entries bj . j 2 Hence if the entries go to zero. if the entries of A − B go to zero then B goes to A in operator norm. This is important. then A goes to zero. h→0 h In other words. then we note that the matrix for BA has the i. m n A ≤ i=1 j=1 (ai ) .
h h h Proposition: Let U ⊂ Rn be open and f : U → Rm be diﬀerentiable at x0 . Rm ). if A = f (x0 ) and B = g f (x0 ) . f (U ) ⊂ V and let g : V → R be diﬀerentiable at f (x0 ). (A − B)h f (x + h) − f (x) − Ah − (f (x + h) − f (x) − Bh) = h h f (x + h) − f (x) − Bh f (x + h) − f (x) − Ah + . r(h) itself must go to zero. or f (x) = A and we say A is the derivative of f at x. For any x with x = 1 let h = δ/2 x. The norms above must be in the right spaces of course. B ∈ L(Rn . Let V ⊂ Rm be open. then F (x0 ) = BA. f is continuous at x0 . As r(h) must go to zero as h → 0. Proof. . Rm ) such that h→0 lim f (x + h) − f (x) − Ah =0 h and h→0 lim f (x + h) − f (x) − Bh = 0. So A = B. We have again (as last semester) cheated somewhat as said that A is the derivative. that means that f (x0 + h) must go to f (x0 ). That is. That is. Example: If f (x) = Ax for a linear mapping A. Normally it is just understood that h ∈ Rn and so we won’t explicitly say so from now on. Proposition: Let U ⊂ Rn be an open subset and f : U → Rm .15 (Chain rule): Let U ⊂ Rn be open and let f : U → Rm be diﬀerentiable at x0 ∈ U . then h < δ and = x and so A − B ≤ . Note that the derivative is a function from U to L(Rn .9 and we say Df (x) = A. ≤ h h So (A−B)h h goes to 0 as h → 0. Without the points this is sometimes written as F = g f . Suppose that x ∈ U and there exists an A. Another way to write the diﬀerentiability is to write r(h) = f (x0 + h) − f (x0 ) − f (x0 )h. We have of course not shown yet that there is only one. we say simply that f is diﬀerentiable. When f is diﬀerentiable at all x ∈ U . Proof. Then f is continuous at x0 . The way to understand it is that the derivative of the composition g f (x) is the composition of the derivatives of g and f . That is. Then F (x) = g f (x) is diﬀerentiable at x0 and F (x0 ) = g f (x0 ) f (x0 ). given > 0 we have that for all h in some δ ball around the origin > h (A − B)h = (A − B) h h h h . As h → f (x0 )h is continuous (it’s linear) h and hence also goes to zero. h Then A = B. This is easily seen: f (x + h) − f (x) − Ah A(x + h) − Ax − Ah 0 = = = 0. Theorem 9. then f (x) = A.
xj + h. xn ) − f (x) f (x + hej ) − f (x) ∂f (x) = lim = lim . Therefore.. and they provide a way to compute the total derivative of a function. f 2 . . j h→0 h→0 ∂x h h ∂f We call ∂xj (x) the partial derivative of f with respect to xj . ∂xj If h = n j j=1 c ej . . ∂f m (x0 ) ∂x1 ∂f m (x0 ) ∂x2 m . F (x0 +h)−F (x0 )−BAh h r(h) goes to 0. ∂x Partial derivatives are easier to compute with all the machinery of calculus.. . and hence F (x0 ) = BA. so the term B goes to 0 because g is at x0 . f m ). . h g(y0 +k)−g(y0 )−Bk k Next as f is continuous There is another way to generalize the derivative from one dimension. ∂xj ∂f k (x0 ) ek . B is constant and f is diﬀerentiable at x0 . ∂f m (x0 ) ∂xn In other words f (x0 ) ej = k=1 ∂f k (x0 ) ek . . . where f k are realvalued k functions. . y0 = f (x0 ).. the term f (x0 +h)−f (x0 ) stays bounded as h goes to 0. Sometimes we write Dj f instead. h goes to zero. We can simply hold all but one variables constant and take the regular derivative.10 Proof. . . . k = f (x0 + h) − f (x0 ). . we have that as h goes to 0. . Then we can deﬁne ∂f j (or write it Dj f k ). then we can write f = (f 1 . Let A = f (x0 ) and B = g f (x0 ) . h h h h As f is diﬀerentiable at x0 .17: Let U ⊂ Rn be open and let f : U → Rm be diﬀerentiable at x0 ∈ U . . Then all the partial derivatives at x0 exist and in terms of the standard basis of Rn and Rm . Therefore diﬀerentiable at y0 . That is on purpose. . . . . then k goes to 0. If the following limit exists we write f (x1 . ∂xn (x0 ) 1 (x0 ) 2 (x0 ) ∂x ∂x ∂f 2 (x ) ∂f 2 (x ) . Theorem 9. then n m f (x0 ) h = j=1 k=1 cj Again note the updown pattern with the indices being summed over. . . . which is what was claimed. . Finally f (x0 + h) − f (x0 ) f (x0 + h) − f (x0 ) − Ah Ah f (x0 + h) − f (x0 ) − Ah ≤ + ≤ + A . = k h h First. Write r(h) = f (x0 + h) − f (x0 ) − Ah = k − Ah. . xj−1 . Then g f (x0 + h) − g f (x0 ) − BAh F (x0 + h) − F (x0 ) − BAh = h h g(y0 + k) − g(y0 ) − B k − r(h) = h g(y0 + k) − g(y0 ) − Bk r(h) ≤ + B h h r(h) g(y0 + k) − g(y0 ) − Bk f (x0 + h) − f (x0 ) + B . . . Deﬁnition: Let f : U → R be a function on an open set U ⊂ Rn . f (x0 ) is represented by the matrix ∂f 1 ∂f 1 ∂f 1 . . When f : U → Rm is a function. . ∂f 2 (x ) ∂x1 0 0 0 ∂x2 ∂xn . xj+1 . . Let h vary in Rn and write. .
b) ⊂ R → Rn is a diﬀerentiable function and γ (a. Suppose that γ : (a. We use this idea to deﬁne derivatives in a speciﬁc direction. Fix a j and note that f (x0 + hej ) − f (x0 ) − f (x0 )hej f (x0 + hej ) − f (x0 ) − f (x0 )ej = h h = f (x0 + hej ) − f (x0 ) − f (x0 )hej . We also just compute directly d dt f (x + tu) = lim t=0 f (x + hu) − f (x) . . But we digress. h→0 lim One of the consequences of the theorem is that if f is diﬀerentiable on U . . hej As h goes to 0. .11 Proof. Fix x ∈ U . ∂x When U ⊂ Rn is open and f : U → R is a diﬀerentiable function. Write γ = (γ 1 . h→0 h . It is easy to compute that γ (t) = u for all t. and hence f (x0 + hej ) − f (x0 ) = f (x0 )ej . . f m ). . Rm ) is a k continuous function if and only if all the ∂f j are continuous functions. . for convenience. Then we can write g(t) = f γ(t) . The dot represents the standard scalar dot product. f (x) = ∂xj j=1 Note that the upperlower indices don’t really match up. We notice that g (t) = ( f ) γ(t) · γ (t) = f ·γ. f 2 . the right hand side goes to zero by diﬀerentiability of f . t=0 represents the derivative evaluated at t = 0. we note that we write n df = j=1 ∂f j dx ∂xj where dxj is really the standard bases (though we’re thinking of dxj to be in L(Rn . Then deﬁne γ(t) = x + tu. Therefore for any k the partial derivative ∂f k f k (x0 + hej ) − f k (x0 ) (x0 ) = lim h→0 ∂xj h exists and is equal to the kth component of f (x0 )ej . So pick a vector u ∈ Rn such that u = 1. . The function g is then a diﬀerentiable and the derivative is n g (t) = j=1 ∂f dγ j (t) = γ(t) ∂xj dt n j=1 ∂f dγ j . γ n ). . ∂xj dt where we sometimes. then f : U → L(Rn . As a preview of Math 621. leave out the points at which we are evaluating. and we are done. By chain rule d dt where the notation d  dt t=0 f (x + tu) = ( f )(x) · u. and note that taking a limit in Rm is the same as taking the limit in each component separately. We call the following vector the gradient: n ∂f (x) ej . R) which is really equivalent to Rn ). b) ⊂ U . h Note that f is vector valued. So represent f by components f = (f 1 . A direction is simply a vector pointing in that direction.
( f )(x) we get Du f (x) = ( f )(x) . Then f is Lipschitz with constant M . b] → Rn is diﬀerentiable on (a. then there exists a t such that ϕ(b) − ϕ(a) ≤ (b − a) ϕ (t) . if the convex combination (1 − t)x + ty is in U for all t ∈ [0. You can even write the formula for f : f (x. In R2 (or higher dimensions) there are lots of nonconvex connected sets. So the gradient points in the direction in which the function grows fastest.1} \ {(x.19: Let U ⊂ Rn be a convex open set. sometimes denoted by d f (x + tu) . A set U is convex if whenever x. r) is always convex by the triangle inequality (exercise). To see this. y ∈ U . y) be the angle that the line from the origin to (x. b]. Theorem 5. Let us prove a “mean value theorem” for vector valued functions. y) makes with the positive x axis.Rn ) = ϕ (t) Rn (exercise).9 < x2 + y 2 < 1. y) : 0. The ball B(x. Proof.12 We obtain what is usually called the directional derivative. b) and continuous on [a. 0) : x < 0}. f : U → Rm a diﬀerentiable function. When u= ( f )(x) . Let f (x. y) = 2 arctan Think spiral staircase with room in the middle. the line segment from x to y lies in U . Note that in this case. By Schwarz inequality we have Du f (x) = Du f (x) ≤ ( f )(x) . Note that if U is not convex this is not true. y ∈ U . the direction in which Du is maximal. . dt t=0 which can be computed by one of the methods above. Theorem 9. every connected interval is convex. By mean value theorem on the function ϕ(b) − ϕ(a) · ϕ(t) (the dot is the scalar dot product) we obtain there is a t such that ϕ(b) − ϕ(a) · ϕ(b) − ϕ(b) − ϕ(a) · ϕ(a) = ϕ(b) − ϕ(a) 2 = ϕ(b) − ϕ(a) · ϕ (t) where we treat ϕ as a simply a column vector of numbers by abuse of notation. that is. Note that in R. that is f (x) − f (y) ≤ M x − y for all x. take the set U = {(x. By Schwarz inequality ϕ(b) − ϕ(a) 2 = ϕ(b) − ϕ(a) · ϕ (t) ≤ ϕ(b) − ϕ(a) ϕ (t) . and further equality is achieved when u is a scalar multiple of ( f )(x). 1]. it is not hard to see that ϕ (t) L(R.19: If ϕ : [a. Let us suppose that ( f )(x) = 0. That is. and an M such that f (x) ≤ M for all x ∈ U . See: y x+ x2 + y 2 .
and the derivative is bounded on U . there is a ball B(x. for all x ∈ U . Next d f (1 − t)x + ty = f (1 − t)x + ty (y − x). so f (x) = f (y) for all y ∈ B(x. Deﬁnition: We say f : U ⊂ Rn → Rm is continuously diﬀerentiable. Since f (y) ≤ 0 for all y ∈ B(x. in fact. Rm ) is continuous. This means that f −1 (c) is open for any c ∈ Rm . y) θ = f (x. To prove the opposite direction. U = f −1 (Rm \ {c}) = a∈Rm a=c f −1 (a) are open disjoint. which is not hard to see. So if f : U → L(Rn . δ) ⊂ U . Furthermore. dt By mean value theorem above we get d f (x) − f (y) ≤ f (1 − t)x + ty ≤ f (1 − t)x + ty y−x ≤M y−x . See the homework. and U is connected. because the entries of the matrix f (x) are then the partial derivatives and if the entries are continuous functions. Suppose that f −1 (c) is nonempty. Note that without continuity the theorem does not hold. Just because partial derivatives exist doesn’t mean that f is diﬀerentiable. y) In any case the function is diﬀerentiable. f may not even be continuous. then the entries are continuous. Rm ) is continuous. suppose the partial derivatives exist and are continuous. For any x ∈ U . The ball B(x. 1] by convexity. The two sets U = f −1 (c). We have seen that if f is diﬀerentiable. but thinking of what happens near where the negative xaxis cuts the annulus in half. Proof. then f is constant. δ) is convex. Fix x and y in U and note that (1 − t)x + ty ∈ U for all t ∈ [0. Proof. or C 1 (U ) if f is diﬀerentiable and f : U → L(Rn . and further U = U ∪ U . then the partial derivatives exist. If we can show that f (x) exists we are done. we see that the conclusion cannot hold. Theorem 9. Corollary: If U ⊂ Rn is connected and f : U → Rm is diﬀerentiable and f (x) = 0. So as U is nonempty. So f (x) = c for all x ∈ U . the partial derivatives are the entries of the matrix of f (x).13 (x. dt Let us solve the diﬀerential equation f = 0. we have that U = ∅. Fix x ∈ U . f (x) − f (y) ≤ 0 x − y = 0. δ) then by the theorem. δ). Proof.21: Let U ⊂ Rn be open and f : U → Rm . hence the partial derivatives are continuous. . The function f is continuously diﬀerentiable if and only if all the partial derivatives exist and are continuous. the matrix valued function f is continuous.
0] if t < 0). . .23 (Contraction mapping principle or Fixed point theorem): Let (X. .. A1 = . . f : X → X is said to be a contraction (or a contractive map) if it is a kLipschitz map for some k < 1.. Then f has a ﬁxed point. ∂xn (x) (x) . such that ∂f j f (x + h1 + ten ) − f (x + h1 ) = t n (x + h1 + θj en ). . f (y) ≤ kd(x. . . ∂xn ∂x for all j and all h with h < δ. Deﬁnition: Let (X. . t] (or [t. Theorem 9. . v = . Let δ > 0 be such that for any k ∈ R ∂f m (x) . . It is easy to see that the ﬁrst n − 1 partial derivatives of f restricted to the set where the last coordinate is ﬁxed are the same as those for f . . d) be a nonempty complete metric space and f : X → X a contraction. . if we restrict to the ﬁrst n − 1 variables. . .14 Let us do induction on dimension. ∂x Note that if h < δ then h1 + θj en ≤ h < δ. ∂x1 n−1 ∂f m (x) ∂xn−1 ∂f m (x) ∂xn with k < δ we have f (x + k) − f (x) − A1 k < . ∂f m (x) ∂xn Let > 0 be given. x ∈ X is called a ﬁxed point if f (x) = x. . Contraction mapping principle Let us review the contraction mapping principle. . that is the set in Rn where xn = 0. Note that Ah = A1 h1 + tv. A= . d) and (X . if there exists a k < 1 such that d f (x). f (x + h) − f (x) − Ah = f (x + h1 + ten ) − f (x + h1 ) − tv + f (x + h1 ) − f (x) − A1 h1 ≤ f (x + h1 + ten ) − f (x + h1 ) − tv + f (x + h1 ) − f (x) − A1 h1 ≤ f (x + h1 + ten ) − f (x + h1 ) − tv + h1 . d ) be metric spaces. ∂xn−1 (x) (x) 1 1 n ∂x . ∂x . i. ∂x .e. In the following we will think of Rn−1 as a subset of Rn . . Let 1 1 1 ∂f ∂f 1 ∂f 1 ∂f ∂f (x) . y) for all x. that is. . . Suppose that the conclusion is true for Rn−1 . . Let h = h1 + ten be a vector in Rn where h1 ∈ Rn−1 such that h < δ. In this case the derivative is just the regular derivative (exercise: you should check that the fact that the function is vector valued is not a problem). . suppose that δ is small enough so that ∂f j ∂f j (x + h) − n (x) < . First let us note that the conclusion is true when n = 1. If f : X → X is a map. As all the partial derivatives exist then by the mean value theorem for each j there is some θj ∈ [0. So to ﬁnish the estimate j j f (x + h) − f (x) − Ah ≤ f (x + h1 + ten ) − f (x + h1 ) − tv + m h1 2 ≤ ≤ √ j=1 ∂f j ∂f j t n (x + h1 + θj en ) − t n (x) ∂x ∂x + h1 m t + h1 √ ≤ ( m + 1) h . the conclusion is true. ∂f m (x) ∂x1 .. . Then h1 ≤ h < δ. . y ∈ X. k By continuity of the partial derivatives..
Proof. Unique? Let y be a ﬁxed point. the inverse g(y) = (f V )−1 (y) is continuously diﬀerentiable and g (y) = f (x) −1 . Fixed point? Note that f is continuous because it is a contraction. Suppose m ≥ n. Hence f (x) = lim f (xn ) = lim xn+1 = x. As f is continuous. 1−k In particular the sequence is Cauchy (why?). f (xn−1 ) ≤ kd(xn . x0 ) i=n m−n−1 ≤ = k n d(x1 . y) = d f (x). xn−1 ) ≤ · · · ≤ k n d(x1 . 1) → (0. then it locally “behaves like” the derivative (which is a linear function). y = f (x). This semester. there exists an open ball V around x0 such that 1 A − f (x) < for all x ∈ V . we will prove the inverse and implicit function theorems. xn ) ≤ i=n m−1 d(xi+1 . d(xn+1 . for all x ∈ V . then m−1 d(xm . and f (x0 ) is invertible. For example. The theorem is proved. So for example.15 The words complete and contraction are necessary. W ⊂ Rn such that x0 ∈ V ⊂ U . We’ve used the theorem to prove Picard’s theorem last semester. Also f : R → R deﬁned by f (x) = x + 1 is not a contraction (k = 1) and has no ﬁxed point. 2 A−1 Note that f (x) is invertible for all x ∈ V . Inverse function theorem The idea of a derivative is that if a function is diﬀerentiable. if a function is diﬀerentiable and the derivative is invertible. that point is unique. Write A = f (x0 ). y). f (V ) = W and f V is onetoone and onto. x0 ). Do also note the proof of uniqueness holds even if X is not complete. Then there exist open sets V. x0 ) i=0 ∞ ki k i = k n d(x1 . . Note that the proof is constructive. Not only do we know that a unique ﬁxed point exists. the function is (locally) invertible. xn ) = d f (xn ). xi ) k i d(x1 . We also know how to ﬁnd it. Deﬁne a sequence {xn } by xn+1 := f (xn ). x0 ) i=0 ≤ k n d(x1 . Since X is complete we let x := lim xn and we claim that x is our unique ﬁxed point. d(x. Also suppose that x0 ∈ U . f (x0 ) = y0 . x0 ) 1 . f : (0. As k < 1 this means that d(x. 1) deﬁned by f (x) = kx for any 0 < k < 1 is a contraction with no ﬁxed point. Proof. Theorem 9. Pick any x0 ∈ X. y) = 0 and hence x = y. Furthermore. f (y) = kd(x. then if it has a ﬁxed point.24: Let U ⊂ Rn be a set and let f : U → Rn be a continuously diﬀerentiable function. If f is a contraction.
h so h ≤ 2 A−1 k ≤ 2 A−1 In particular as k goes to 0. there exists at most one x ∈ V such that f (x) = y. r) into B(x1 . 2 A−1 ϕy (x1 ) − ϕy (x2 ) ≤ If we can show that y ∈ W . We need to show that W is open. x2 ∈ V . Let y ∈ W and k ∈ Rn . 2 2 A−1 So ϕy takes C(x1 . So y ∈ f C(x1 . so for x ∈ V we have ϕy (x) ≤ A−1 As V is a ball it is convex. and so f V is onetoone. then there is a unique x1 ∈ V such that f (x1 ) = y1 . Using chain rule we obtain. r). r) is complete (closed subset of Rn is complete). Then there are unique x ∈ V and h ∈ Rn . ϕy (x) = I − A−1 f (x) = A−1 A − f (x) . r) ⊂ f (V ) = W . then ϕy (x) − x1 ≤ ϕy (x) − ϕy (x1 ) + ϕy (x1 ) − x1 1 ≤ x − x1 + A−1 (y − y1 ) 2 1 ≤ r + A−1 y − y1 2 1 r < r + A−1 = r. so does h. That is f (x) = y. such that f (x) = y and f (x + h) = y + k as f V is a onetoone and onto mapping of V onto W . In other words. 2 In other words ϕy is a contraction deﬁned on V . Apply the contraction mapping principle to obtain a ﬁxed point x. k = 0. though we so far do not know what is the range of ϕy .e. and hence 1 x1 − x2 for all x1 . We can still squeeze some information from the fact that ϕy is a contraction. then we have shown that W is open. Take a y1 ∈ W . such that y + k ∈ W . ϕy (x) = x. Deﬁne ϕy (x) = x + A−1 y − f (x) as before. r) ⊂ V (such r > 0 exists as V is open). g(y) = x and g(y + k) = x + h. but we can say that ϕy has at most one ﬁxed point (note proof of uniqueness in the contraction mapping principle). r) and C(x1 . or in other words f (x) = y. As A−1 is onetoone. Let W = f (V ). r) ⊂ C(x1 . ϕy (x + h) − ϕy (x) = h + A−1 f (x) − f (x + h) = h − A−1 k. . Let r > 0 be small enough such that the closed ball C(x1 . First let us show that it is diﬀerentiable. If x ∈ C(x1 . We cannot apply the ﬁxed point theorem. i. Next we need to show that g is continuously diﬀerentiable and compute its derivative. Therefore W is open. So h − A−1 k = ϕy (x + h) − ϕy (x) ≤ By the inverse triangle inequality h − A−1 k ≤ 1 2 A − f (x) < 1/2. That is. Suppose y is such that r y − y1 < . we notice that ϕy (x) = x (x is a ﬁxed point) if only if y − f (x) = 0. 1 h x+h−x = .16 Given y ∈ Rn we deﬁne ϕy : C → Rn ϕy (x) = x + A−1 y − f (x) . It is a contraction on C(x1 . r). h = 0 and x + h ∈ V . 2 2 k .
and we can locally solve for y in terms of x when 1) we are near a point which lies on the unit circle and 2) when we are not at a point where the circle has a vertical tangency.17 As x ∈ V . Corollary: Suppose U ⊂ Rn is open and f : U → Rn is a continuously diﬀerentiable mapping such that f (x) is invertible for all x ∈ U . y) : y ∈ R}. Then g(y + k) − g(y) − Bk h − Bk = k k h − B f (x + h) − f (x) = k B f (x + h) − f (x) − f (x)h = k h f (x + h) − f (x) − f (x)h ≤ B k h f (x + h) − f (x) − f (x)h ≤ 2 B A−1 . y) = (x. let us show it is C 1 (W ). Also note that the invertibility of the derivative is not a necessary condition. y) = 0. OK. Obviously this is not solvable for y in terms of x in every case. That is there was locally a function g such that x − f g(x) = 0. (f is an open mapping). the map f (x. For each point y ∈ f (V ). The theorem. so the map is actually 2to1. And B is precisely what we wanted g (y) to be. Hence f (V ) is open. In fact f −1 (0. −1 g (y) = f g(y) is the composition of these three continuous functions and hence is continuous. ∂y −1 . For example. y) : y = 0}. It is left to student to show that f is diﬀerentiable and the derivative is invertible (Hint: let z = x + iy and write down what the real an imaginary part of f is in terms if x and y). We have that g is diﬀerentiable. Let B = f (x) . So the right hand side goes to 0 as f is diﬀerentiable. which is neither open nor closed. For example the function f (x) = x3 is an open mapping from R to R and is globally onetoone with a continuous inverse. we pick x ∈ f −1 (y) (there could be more than one such point). just take the map f : C \ {0} → C deﬁned by f (z) = z 2 . then f (x) is invertible. Here we treat the map as if it went from R2 \ {0} to R2 . maps R2 onto the set R2 \ {(0. and X → X −1 is a continuous function. g : W → V is continuous (it’s diﬀerentiable). xy). Really what we were showing in the inverse function theorem was that the equation x − f (y) = 0 was solvable for y in terms of x if the derivative in terms of y was invertible. For example. For any nonzero complex number. For an example. so does h. Although somewhat ironically we will prove the implicit function theorem using the inverse function theorem. h As k goes to 0. then by the inverse function theorem there is a neighbourhood of x in V that maps onto an neighbourhood of y. Implicit function theorem: The inverse function theorem is really a special case of the implicit function theorem which we prove next. and hence the left hand side also goes to 0. there are always two square roots. For a slightly more complicated example. Note that this bad behaviour only occurs on the yaxis. Now. In fact if we avoid the yaxis it is even one to one. which is what we think the derivative of g at y is. notice that x2 + y 2 − 1 = 0 deﬁnes the unit circle. is not true if f (x) is not invertible for some x. Also note that just because f (x) is invertible everywhere doesn’t mean that f is onetoone globally. Then given any open set V ⊂ U . 0) = {(0. and the corollary. It is deﬁnitely “locally” onetoone. f (V ) is open. when f (x. WLOG suppose U = V . just suﬃcient for having a continuous inverse and being an open mapping. everywhere else the function is locally invertible. f is continuous function from V to L(Rn ). so how about we look at the equation f (x. or in other words where ∂f = 0. Proof. that is if f (y) was invertible. y) does not actually depend on y.
y0 ) and suppose that Ay is invertible. y) = Ax x + Ay y. 0) ∈ V . . . . y0 ) ∈ U be a point such that f (x0 . G2 ) (the ﬁrst n and the second m components of G). at x0 . G2 (x. s)) = (x. That is. y 1 . Let (x0 . s). y0 ) takes (h. f (G1 (x. and an inverse mapping G : V → Rn+m .28 (Implicit function theorem): Let U ⊂ Rn+m be an open set and let f : U → Rm be a C 1 (U ) mapping. where we are taking the derivatives of y. Let us therefore show that the same can be done for C 1 functions. s) = s. s). Theorem 9. I’ll do it in the usual order as that’s what I am used to. We know that f (x0 + h. So x = G1 (x. xn . s). s) ∈ V (where x ∈ Rn and s ∈ Rm ). The derivative is done in the same way as above. Next diﬀerentiate x → f x. We obtain the g in the theorem. y) . Write A = [Ax Ay ] = f (x0 . k) goes to zero as (h. Bx) = Ax x + Ay Bx. that is F (G(x. y0 + k) − f (x0 . Rm ) can then always be written as A = [Ax Ay ] so that A(x. and for all x ∈ W . . It is clear that F is C 1 . Therefore F (x0 . Proof. y0 ) is invertible. G2 (x. k) So the derivative of F ate (x0 . . s). k) = h 2 + k 2 goes to zero. We let (x. s) = G1 (x. y0 ) − Ax h − Ay k = . s)) = f x. Let’s compute the derivative. and so Ay k = 0. there exists some open set V ⊂ Rn+m with (x0 . not x (but it doesn’t matter really in the end). f (x.18 To make things simple we ﬁx some notation. Note that Rudin does things “in reverse” from what the statement is usually. Deﬁne F : U → Rn+m by F (x. g(x)) ∈ U and f x. and we want to show that the derivative at (x0 . y0 ) is onetoone or in other words invertible and we can apply the inverse function theorem. y0 + k) − f (x0 . First a linear version of the implicit function theorem. (h. G2 (x. f (x0 + h. Ax h + Ay k) = (0. G2 (x. Rm ) and suppose that Ay is invertible. Then there exists an open set W ⊂ Rn with x0 ∈ W and a C 1 (W ) mapping g : W → Rm . g(x) . y0 + k) − f (x0 . where Ax ∈ L(Rn . Let W = {x ∈ Rn : (x. But then so does h. . and we obtain the desired derivative for g as well. s) and f G1 (x. y) ∈ Rn+m denote the coordinates (x1 . We get that for all h ∈ Rn 0 = A h. . . k) to (h. . y0 ) = 0. G2 (x. s)) = (x. then let B = −(Ay )−1 Ax and note that 0 = A(x. 0) ∈ V } and deﬁne g : W → Rm by g(x) = G2 (x. Rm ) and Ay ∈ L(Rm ). Plugging in s = 0 we obtain f x. then h = 0. with g(x0 ) = y0 . 0). y0 ) − Ax h − Ay k (h. then k = 0. g (x0 )h = Ax h + Ay g (x0 )h. If (h. g(x) = 0. The proof is obvious. k) (h. Furthermore.27): Let A = [Ax Ay ] ∈ L(Rn+m . s) for all (x. 0) = 0. Ax h + Ay k). Write G = (G1 . We simply solve and obtain y = Bx. y0 ) − (h. Ax h + Ay k) f (x0 + h. Proposition (Theorem 9. y) = x. A linear transformation A ∈ L(Rn+m . y m ). s). which should be the zero map. As Ay is onetoone. Then F G1 (x. 0). g (x0 ) = −(Ay )−1 Ax . we have (x.
. . ex + ey + ez − 3). odd) if it takes an even (resp. . σn ) is a permutation of the integers (1. . . n). ym ) = 0 And the condition guaranteeing a solution is that this is a C 1 mapping (that all the components are C 1 . ex + ey(x) + ez(x) = 3. In other words. y1 . ∂f m ∂y 1 . ∂f m ∂y m is invertible at (x0 . . . σ . is invertible. then the solution also has k derivatives. . y1 . 0. . z) = (x2 + y 2 − (z + 1)3 + 1. Example: Consider the set x2 + y 2 − (z + 1)3 = −1. The symbol sgn(x) −1 sgn(x) = 0 1 for a number is deﬁned by if x < 0. . . . The function we are looking at is f (x. if x = 0.. . σn ) = p<q sgn(σq − σp ) is −1 if σ is odd and 1 if σ is even. near the origin the set of solutions is a smooth curve that goes through the origin. f m (x1 . . . . y. in fact it is not hard to show that sgn(σ) = sgn(σ1 . . Hence near (0. . . . 0) we can ﬁnd y and z as C 1 functions of x such that for x near 0 we have The theorem doesn’t tell us how to ﬁnd y(x) and z(x) explicitly.. Let A = [ai ] be a matrix. .. . Then note that the sign of (1. . Note that there are versions of the theorem for arbitrarily many derivatives. 0). ex + ey + ez = 3 near the point (0. and the matrix ∂f 1 ∂f 1 . . It is not hard to see that any permutation can be obtained by a sequence of transpositions (switchings of two elements). So it would be good to have an easy test for when is a matrix invertible. ex ey ez 0 −3 2(0) −3(0 + 1)2 = 1 1 e0 e0 x2 + y(x)2 − (z(x) + 1)3 = −1. . which is not hard to prove by induction on n. ∂ym 1 ∂y . 0. . in the context of the theorem we have m equations in n + m unknowns. . 2. Call a permutation even (resp. n). . xn . Let Sn be the set of all permutations on n elements (the symmetric group). j Deﬁne the determinant of A n det(A) = σ∈Sn sgn(σ) i=1 ai i . . . . . This is where determinants come in. Suppose that σ = (σ1 . if x > 0. .19 In other words.. We ﬁnd that Df = The matrix 2x 2y −3(z + 1)2 . . . . ym ) = 0 . It can be shown that this is well deﬁned. . y0 ). . . n) is 1. If f has k continuous derivatives. or in other words all the partial derivatives exist and are continuous). . . . . This can be proved by noting that applying a transposition changes the sign. f 1 (x1 . . odd) number of transpositions to get from σ to (1. . it just tells us they exist. . xn .
the determinant is the unique function that satisﬁes (i). .35+9. Actually the rows and columns are swapped. b (vii) det [ a d ] = ad − bc and det[a] = a. (iv) If two columns of A are equal. .. (vi) A → det(A) is a continuous function. then the columns are linearly dependent. That is. . j=1 . 1 2 n (j1 . the columns of AB are Ab1 .34 and other observations): (i) det(I) = 1. Let bi denote the elements of B and aj the columns of A. Part (vi) follows as det is a polynomial in the entries of the matrix and hence continuous. For (ii) Notice that each term in the deﬁnition of the determinant contains exactly one factor from each column. But we digress. det(A−1 ) = det(A) . jn ) det([a1 a2 · · · an ]). The conclusion follows by recognizing the determinant of B. . (ii). . (v) If a column is zero. . Then AB = [Ab1 Ab2 · · · Abn ]. then det(A) = 0.. That is suppose that n cj aj = 0. to just elements of Sn by noting that the determinant of the resulting matrix is just zero. (i) is trivial. (iii) If two columns of a matrix are interchanged determinant changes sign. A 1 is invertible if and only if det(A) = 0 and in this case. Abn . then det(A) = 0. . If A is not invertible. (ii) det([x1 x2 .. Finally. Theorem 9. ... Note that Aej = aj .. ...jn ≤n = = bj1 bj2 · · · bjn sgn(j1 . bn be the columns of B.20 Proposition (Theorem 9. For the second part of the theorem note that if A is invertible.36: If A and B are n × n matrices. . then A−1 A = I and so det(A−1 ) det(A) = 1. Part (iii) follows by noting that switching two columns is like switching the two corresponding numbers in every element in Sn . By linearity of the j determinant as proved above we have n det(AB) = det([Ab1 Ab2 · · · Abn ]) = det j=1 n bj aj Ab2 · · · Abn 1 = j=1 bj det([aj Ab2 · · · Abn ]) 1 j bj1 b22 · · · bjn det([aj1 aj2 · · · ajn ]) 1 n 1≤j1 . as you have likely seen this before. but a moment’s reﬂection will reveal that it does not matter. c In fact. We could also just plug in A = I. . Part (v) follows because the product in each term in the deﬁnition includes one element from the zero column. We have seen that a function deﬁned on matrices is continuous in the operator norm if it is continuous in the entries. Proof.jn )∈Sn In the above. . In particular. we note that we could go from all integers. xn ]) where xj are column vectors is linear in each variable xj separately. then det(AB) = det(A) det(B). part (vii) is a direct computation. and (iii). Hence all the signs are changed. Let b1 . We go through the proof quickly. Part (iv) follows because if two columns are equal and we switch them we get the same matrix back and so part (iii) says the determinant must have been 0. . . Proof.
21 Without loss of generality suppose that c1 = 1. f n ) . f : U → Rn is locally invertible near x if Jf (x) = 0. Therefore. the Jacobian measures how much a diﬀerentiable mapping stretches things locally. . . . . . . not just on matrices. Then take 1 c 0 0 ··· c2 1 0 · · · 3 B = c 0 1 · · · . . ∂(x1 . the determinant can be deﬁned as a function on the space L(Rn ). . . . . and when n = 1 it is simply the derivative. . apply A in the ﬁrst basis. . When f is C 1 . Suppose that we are in R2 . For the implicit function theorem the condition is normally stated as ∂(f 1 . not just the square. and take us back to the basis we started with. The proof is immediate. We can restate the inverse function theorem using the Jacobian. it follows by direct computation that the direct image of the unit square A([0. We can now test whether a matrix is invertible. Then det(AB) = det(A) det(B) = c1 det(A). . . We should see more of this geometry next semester. y0 ) = 0. If the determinant is negative. then Jf (x) is a continuous function. Similarly. . . and if it ﬂips orientation. if B is invertible then. . Deﬁnition: Let U ⊂ Rn and f : U → Rn be a diﬀerentiable mapping. In other words. . . 1]2 ) has area det(A). . . Then deﬁne the Jacobian of f at x as Jf (x) = det f (x) Sometimes this is written as ∂(f 1 . . the function is the same. Proposition: Determinant is independent of the basis. It follows from the two propositions that det : L(Rn ) → R is a well deﬁned and continuous function. then for another basis we can ﬁnd a matrix B such that the matrix B −1 AB takes us to the ﬁrst basis. but it is useful when you need to specify the exact variables and function components used. cn 0 0 · · · 0 0 0 . Note that the ﬁrst column of AB is zero. . The Jacobian is a real valued function. That is. y n ) It can be computed directly that the determinant tells us what happens to area/volume. We claim without proof that this follows for arbitrary ﬁgures. and hence det(AB) = 0. xn ) To the uninitiated this can be a somewhat confusing notation. If in one basis A is the matrix representing a linear operator. Also note that from the chain rule it follows that: Jf ◦g (x) = Jf g(x) Jg (x). det(A) = det(B −1 AB). Note that the sign of the determinant determines “orientation”. Thus det(A) = 0. 1 It is not hard to see from the deﬁnition that det(B) = c1 = 0. . . then the two sides of the unit square will be ﬂipped in the image. . f n ) (x0 . Then if A is a linear transformation. . . ∂(y 1 . No matter what basis we choose.. .