You are on page 1of 75

lOMoARcPSD|31719747

Mathematics for Machine Learning solutions

Machine Learning (Sogang University)

Studocu is not sponsored or endorsed by any college or university


Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)
lOMoARcPSD|31719747

Linear Algebra

Exercises
2.1 We consider (R\{−1}, ⋆), where
a ⋆ b := ab + a + b, a, b ∈ R\{−1} (2.1)
a. Show that (R\{−1}, ⋆) is an Abelian group.
b. Solve
3 ⋆ x ⋆ x = 15

in the Abelian group (R\{−1}, ⋆), where ⋆ is defined in (2.1).


a. First, we show that R\{−1} is closed under ⋆: For all a, b ∈ R\{−1}:
a ⋆ b = ab + a + b + 1 − 1 = (a + 1) (b + 1) −1 ̸= −1
! "# $ ! "# $
̸=0 ̸=0

⇒ a ⋆ b ∈ R\{−1}

Next, we show the group axioms


Associativity: For all a, b, c ∈ R\{−1}:
(a ⋆ b) ⋆ c = (ab + a + b) ⋆ c
= (ab + a + b)c + (ab + a + b) + c
= abc + ac + bc + ab + a + b + c
= a(bc + b + c) + a + (bc + b + c)
= a ⋆ (bc + b + c)
= a ⋆ (b ⋆ c)

Commutativity:
∀a, b ∈ R\{−1} : a ⋆ b = ab + a + b = ba + b + a = b ⋆ a

Neutral Element: n = 0 is the neutral element since


∀a ∈ R\{−1} : a ⋆ 0 = a = 0 ⋆ a

Inverse Element: We need to find ā, such that a ⋆ ā = 0 = ā ⋆ a.


ā ⋆ a = 0 ⇐⇒ āa + a + ā = 0
⇐⇒ ā(a + 1) = −a
a̸=−1 a 1
⇐⇒ ā = − = −1 + ̸= −1 ∈ R\{−1}
a+1 a+1

468
This material will be published by Cambridge University Press as Mathematics for Machine Learn-
ing by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. This pre-publication version is
free to view and download for personal use only. Not for re-distribution, re-sale or use in deriva-
c
tive works. ⃝by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2020. https://mml-book.com.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

Exercises 469

b.

3 ⋆ x ⋆ x = 15 ⇐⇒ 3 ⋆ (x2 + x + x) = 15
⇐⇒ 3x2 + 6x + 3 + x2 + 2x = 15
⇐⇒ 4x2 + 8x − 12 = 0
⇐⇒ (x − 1)(x + 3) = 0
⇐⇒ x ∈ {−3, 1}

2.2 Let n be in N\{0}. Let k, x be in Z. We define the congruence class k̄ of the


integer k as the set

k = {x ∈ Z | x − k = 0 (modn)}
= {x ∈ Z | ∃a ∈ Z : (x − k = n · a)} .

We now define Z/nZ (sometimes written Zn ) as the set of all congruence


classes modulo n. Euclidean division implies that this set is a finite set con-
taining n elements:

Zn = {0, 1, . . . , n − 1}

For all a, b ∈ Zn , we define

a ⊕ b := a + b

a. Show that (Zn , ⊕) is a group. Is it Abelian?


b. We now define another operation ⊗ for all a and b in Zn as

a ⊗ b = a × b, (2.2)

where a × b represents the usual multiplication in Z.


Let n = 5. Draw the times table of the elements of Z5 \{0} under ⊗, i.e.,
calculate the products a ⊗ b for all a and b in Z5 \{0}.
Hence, show that Z5 \{0} is closed under ⊗ and possesses a neutral
element for ⊗. Display the inverse of all elements in Z5 \{0} under ⊗.
Conclude that (Z5 \{0}, ⊗) is an Abelian group.
c. Show that (Z8 \{0}, ⊗) is not a group.
d. We recall that the Bézout theorem states that two integers a and b are
relatively prime (i.e., gcd(a, b) = 1) if and only if there exist two integers
u and v such that au + bv = 1. Show that (Zn \{0}, ⊗) is a group if and
only if n ∈ N\{0} is prime.

a. We show that the group axioms are satisfied:


Closure: Let a, b be in Zn . We have:

a⊕b=a+b
= (a + b) mod n

by definition of the congruence class, and since [(a + b) mod n] ∈


{0, . . . , n − 1}, it follows that a ⊕ b ∈ Zn . Thus, Zn is closed under ⊕.

c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

470 Linear Algebra

Associativity: Let c be in Zn . We have:


(a ⊕ b) ⊕ c = (a + b) ⊕ c = (a + b) + c = a + (b + c)
= a ⊕ (b + c) = a ⊕ (b ⊕ c)

so that ⊕ is associative.
Neutral element: We have
a+0=a+0=a=0+a

so 0 is the neutral element for ⊕.


Inverse element: We have
a + (−a) = a − a = 0 = (−a) + a

and we know that (−a) is equal to (−a) mod n which belongs to Zn


and is thus the inverse of a.
Commutativity: Finally, the commutativity of (Zn , ⊕) follows from
that of (Z, +) since we have
a ⊕ b = a + b = b + a = b ⊕ a,

which shows that (Zn , ⊕) is an Abelian group.


b. Let us calculate the times table of Z5 \{0} under ⊗:

⊗ 1 2 3 4
1 1 2 3 4
2 2 4 1 3
3 3 1 4 2
4 4 3 2 1

We can notice that all the products are in Z5 \{0}, and that in particular,
none of them is equal to 0. Thus, Z5 \{0} is closed under ⊗. The neutral
element is 1 and we have (1)−1 = 1, (2)−1 = 3, (3)−1 = 2, and (4)−1 = 4.
Associativity and commutativity are straightforward and (Z5 \{0}, ⊗) is
an Abelian group.
c. The elements 2 and 4 belong to Z8 \{0}, but their product 2 ⊗ 4 = 8 = 0
does not. Thus, this set is not closed under ⊗ and is not a group.
d. Let us assume that n is not prime and can thus be written as a product
n = a × b of two integers a and b in {2, . . . , n − 1}. Both elements a
and b belong to Zn \{0} but their product a ⊗ b = n = 0 does not.
Thus, this set is not closed under ⊗ and (Zn \{0}, ⊗) is not a group.
Let n be a prime number. Let a and b be in Zn \{0} with a and b in
{1, . . . , n − 1}. As n is prime, we know that a is relatively prime to n,
and so is b. Let us then take four integers u, v, u′ and v ′ , such that
au + nv = 1
bu′ + nv ′ = 1 .

We thus have that (au + nv)(bu′ + nv ′ ) = 1, which we can rewrite as


ab(uu′ ) + n(auv ′ + vbu′ + nvv ′ ) = 1

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

Exercises 471

By virtue of the Bézout theorem, this implies that ab and n are rela-
tively prime, which ensures that the product a ⊗ b is not equal to 0
and belongs to Zn \{0}, which is thus closed under ⊗.
The associativity and commutativity of ⊗ are straightforward, but we
need to show that every element has an inverse. First, the neutral
element is 1. Let us again consider an element a in Zn \{0} with a in
{1, . . . , n − 1}. As a and n are coprime, the Bézout theorem enables
us to define two integers u and v such that
au + nv = 1 , (2.3)
which implies that au = 1 − nv and thus
au = 1 mod n , (2.4)
which means that a ⊗ u = au = 1, or that u is the inverse of a. Over-
all, (Zn \{0}, ⊗) is an Abelian group. Note that the Bézout theorem
ensures the existence of an inverse without yielding its explicit value,
which is the purpose of the extended Euclidean algorithm.
2.3 Consider the set G of 3 × 3 matrices defined as follows:
⎧⎡ ⎤ , ⎫
⎨ 1 x z ,
, ⎬
G = ⎣0 1 y ⎦ ∈ R3×3 ,, x, y, z ∈ R
⎩ ⎭
0 0 1 ,

We define · as the standard matrix multiplication.


Is (G, ·) a group? If yes, is it Abelian? Justify your answer.
Closure: Let a, b, c, x, y and z be in R and let us define A and B in G as
⎡ ⎤ ⎡ ⎤
1 x z 1 a c
A = ⎣0 1 y⎦ , B = ⎣0 1 b⎦ .
0 0 1 0 0 1
Then,
⎡ ⎤⎡ ⎤ ⎡ ⎤
1 x z 1 a c 1 a+x c + xb + z
A · B = ⎣0 1 y ⎦ ⎣0 1 b ⎦ = ⎣0 1 b+y ⎦ .
0 0 1 0 0 1 0 0 1
Since a + x, b + y and c + xb + z are in R we have A · B ∈ G . Thus, G is
closed under matrix multiplication.
Associativity: Let α, β and γ be in R and let C in G be defined as
⎡ ⎤
1 α γ
C = ⎣0 1 β⎦ .
0 0 1
It holds that
⎡ ⎤⎡ ⎤
1 a+x c + xb + z 1 α γ
(A · B) · C = ⎣0 1 b + y ⎦ ⎣0 1 β⎦
0 0 1 0 0 1
⎡ ⎤
1 α+a+x γ + αβ + xβ + c + xb + z
= ⎣0 1 β+b+y ⎦.
0 0 1

c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

472 Linear Algebra

Similarly,
⎡ ⎤ ⎡ ⎤
1 x z 1 α+a γ + αβ + c
A · (B · C) = ⎣0 1 y ⎦ · ⎣0 1 β+b ⎦
0 0 1 0 0 1
⎡ ⎤
1 α+a+x γ + αβ + xβ + c + xb + z
= ⎣0 1 β+b+y ⎦ = (A · B) · C .
0 0 1

Therefore, · is associative.
Neutral element: For all A in G , we have: I 3 · A = A = A · I 3 and thus
I 3 is the neutral element.
Non-commutativity: We show that · is not commutative. Consider the
matrices X, Y ∈ G , where
⎡ ⎤ ⎡ ⎤
1 1 0 1 0 0
X = ⎣0 1 0⎦ , Y = ⎣0 1 1⎦ . (2.5)
0 0 1 0 0 1

Then,
⎡ ⎤
1 1 1
X · Y = ⎣0 1 1⎦ ,
0 0 1
⎡ ⎤
1 1 0
Y · X = ⎣0 1 1⎦ ̸= X · Y .
0 0 1

Therefore, · is not commutative.


Inverse element: Let us look for a right inverse A−1 r of A. Such a matrix
should satisfy AA−1 r = I 3 . We thus solve the linear system [A|I 3 ] that
we transform into [I 3 |A−1
r ]:
⎡ ⎤
1 x z 1 0 0 −zR3
⎣ 0 1 y 0 1 0 ⎦ −yR3
0 0 1 0 0 1
⎡ ⎤
1 x 0 1 0 −z −xR2
! ⎣ 0 1 0 0 1 −y ⎦
0 0 1 0 0 1
⎡ ⎤
1 0 0 1 −x xy − z
! ⎣ 0 1 0 0 1 −y ⎦ .
0 0 1 0 0 1

Therefore, we obtain the right inverse


⎡ ⎤
1 −x xy − z
A−1
r = ⎣0 1 −y ⎦ ∈ G
0 0 1

Because of the uniqueness of the inverse element, if a left inverse A−1 l


exists, then it is equal to the right inverse. But as · is not commutative,

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

Exercises 473

we need to check manually that we also have AA−1


r = I 3 , which we do
next:
⎡ ⎤ ⎡ ⎤
1 −x −z + xy 1 x z
A−1
r A = ⎣0 1 −y ⎦ · ⎣0 1 y⎦
0 0 1 0 0 1
⎡ ⎤
1 x−x z − xy − z + xy
= ⎣0 1 y+y ⎦ = I3 .
0 0 1
Thus, every element of G has an inverse. Overall, (G, ·) is a non-Abelian
group.
2.4 Compute the following matrix products, if possible:
a.
⎡ ⎤⎡ ⎤
1 2 1 1 0
⎣4 5 ⎦ ⎣0 1 1⎦
7 8 1 0 1
This matrix product is not defined. Highlight that the neighboring di-
mensions have to fit (i.e., m × n matrices need to be multiplied by n × p
(from the right) or k × m matrices (from the left).)
b.
⎡ ⎤⎡ ⎤ ⎡ ⎤
1 2 3 1 1 0 4 3 5
⎣4 5 6 ⎦ ⎣0 1 1⎦ = ⎣10 9 11⎦
7 8 9 1 0 1 16 15 17
c.
⎡ ⎤⎡ ⎤ ⎡ ⎤
1 1 0 1 2 3 5 7 9
⎣0 1 1 ⎦ ⎣4 5 6⎦ = ⎣11 13 15⎦
1 0 1 7 8 9 8 10 12
d.
⎡ ⎤
0 1 0 3
0 1
1 2 1 2 ⎢⎢1 −1⎥⎥ = 14 6
4 1 −1 −4 ⎣2 1 ⎦ −21 2
5 2
e.
⎡ ⎤ ⎡ ⎤
0 3 12 3 −3 −12
0 1
⎢1 −1⎥⎥ 1 2 1 2 ⎢−3 1 2 6 ⎥
⎢ =⎢ ⎥
⎣2 1 ⎦ 4 1 −1 −4 ⎣6 5 1 0 ⎦
5 2 13 12 3 2
2.5 Find the set S of all solutions in x of the following inhomogeneous linear
systems Ax = b, where A and b are defined as follows:
a.
⎡ ⎤ ⎡ ⎤
1 1 −1 −1 1
⎢2 5 −7 −5⎥ ⎢−2⎥
A=⎢ ⎥, b=⎢ ⎥
⎣2 −1 1 3 ⎦ ⎣4⎦
5 2 −4 2 6

c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

474 Linear Algebra

We apply Gaussian elimination to the augmented matrix


⎡ ⎤
1 1 −1 −1 1
⎢ 2
⎢ 5 −7 −5 −2 ⎥
⎥ −2R1
⎣ 2 −1 1 3 4 ⎦ −2R1
5 2 −4 2 6 −5R1
⎡ ⎤
1 1 −1 −1 1 − 13 R2
⎢ 0 3 −5 −3 −4 ⎥ | · 31

! ⎢
⎣ 0 −3 3 5 2 ⎦ +R2
0 −3 1 7 1 +R2
⎡ 2 7 ⎤ ⎡ 2 7 ⎤
1 0 30 3 1 0 3 0 3
⎢ 0 1 − 3 −1 − 43 ⎥
5 ⎢ 0 1 − 3 −1 − 34 ⎥
5
! ⎢ ⎥ ! ⎢ ⎥
⎣ 0 0 −2 2 −2 ⎦ ⎣ 0 0 −2 2 −2 ⎦
0 0 −4 4 −3 −2R3 0 0 0 0 1

The last row of the final linear system shows that the equation system
has no solution and thus S = ∅.
b.
⎡ ⎤ ⎡ ⎤
1 −1 0 0 1 3
⎢1 1 0 −3 0 ⎥ ⎢6⎥
A=⎢ ⎥, b=⎢ ⎥
⎣2 −1 0 1 −1⎦ ⎣5⎦
−1 2 0 −2 −1 −1

We start again by writing down the augmented matrix and apply Gaus-
sian elimination to obtain the reduced row echelon form:
⎡ ⎤
1 −1 0 0 1 3
⎢ 1
⎢ 1 0 −3 0 ⎥ −R1
6 ⎥
⎣ 2 −1 0 1 −1 5 ⎦ −2R1
−1 2 0 −2 −1 −1 +R1
⎡ ⎤
1 −1 0 0 1 3 +R3
⎢ 0 2 0 −3 −1 3 ⎥ −2R3
! ⎢ ⎥
⎣ 0 1 0 1 −3 −1 ⎦ swap with R2
0 1 0 −2 0 2 −R3
⎡ ⎤
1 0 0 1 −2 2
⎢ 0 1 0 1 −3 −1 ⎥
! ⎢ ⎥
⎣ 0 0 0 −5 5 5 ⎦ ·(− 15 )
0 0 0 −3 3 3 + 35 R3
⎡ ⎤
1 0 0 1 −2 2 −R3
⎢ 0 1 0 1 −3 −1 ⎥
⎥ −R3
! ⎢
⎣ 0 0 0 1 −1 −1 ⎦
0 0 0 0 0 0
⎡ ⎤
1 0 0 0 −1 3
! ⎣ 0 1 0 0 −2 0 ⎦
0 0 0 1 −1 −1

From the reduced row echelon form we apply the “Minus-1 Trick” in
order to get the following system:

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

Exercises 475

⎡ ⎤
1 0 0 0 −1 3

⎢ 0 1 0 0 −2 0 ⎥⎥

⎢ 0 0 −1 0 0 0 ⎥

⎣ 0 0 0 1 −1 −1 ⎦
0 0 0 0 −1 0

The right-hand side of the system yields us a particular solution while


the columns corresponding to the −1 pivots are the directions of the
solution space. We then obtain the set of all possible solutions as
⎧ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤, ⎫

⎪ 3 0 −1 , ⎪
⎪ , ⎪


⎨ ⎢0⎥ ⎢0⎥ ⎢−2⎥ , ⎪

6
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ,
S := x ∈ R : x = ⎢ ⎢ 0 ⎥ + λ1 ⎢−1⎥ + λ2 ⎢ 0 ⎥ , λ1 , λ2 ∈ R .
⎥ ⎢ ⎥ ⎢ ⎥ ,

⎪ ⎣−1⎦ ⎣0⎦ ⎣−1⎦ , ⎪


⎪ , ⎪

⎩ ⎭
0 0 −1 ,

2.6 Using Gaussian elimination, find all solutions of the inhomogeneous equa-
tion system Ax = b with
⎡ ⎤ ⎡ ⎤
0 1 0 0 1 0 2
A = ⎣0 0 0 1 1 0⎦ , b = ⎣−1⎦ .
0 1 0 0 0 1 1

We start by determining the reduced row echelon form of the augmented


matrix [A|b].
⎡ ⎤
0 1 0 0 1 0 2
⎣ 0 0 0 1 1 0 −1 ⎦
0 1 0 0 0 1 1 −R1
⎡ ⎤
0 1 0 0 1 0 2 +R3
! ⎣ 0 0 0 1 1 0 −1 ⎦ +R3
0 0 0 0 −1 1 −1 ·(−1)
⎡ ⎤
0 1 0 0 0 1 1
! ⎣ 0 0 0 1 0 1 −2 ⎦
0 0 0 0 1 −1 1

This augmented matrix is now in reduced row echelon form. Applying the
“Minus-1 trick” gives us the following augmented matrix:

⎡ ⎤
−1 0 0 0 0 0 0
⎢ 0 1 0 0 0 1 1 ⎥
⎢ ⎥
⎢ 0
⎢ 0 −1 0 0 0 0 ⎥

⎢ 0
⎢ 0 0 1 0 1 −2 ⎥

⎣ 0 0 0 0 1 −1 1 ⎦
0 0 0 0 0 −1 0

The right-hand side of this augmented matrix gives us a particular solution


while the columns corresponding to the −1 pivots are the span the solution

c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

476 Linear Algebra

space S . Therefore, we obtain the general solution


⎧ , ⎫
⎪ , ⎪

⎪⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤, , ⎪



⎪ 0 −1 0 0 ,




⎪ , ⎪




⎢ 1 ⎥ ⎢ 0 ⎥ ⎢ 0 ⎥ ⎢ 1 ⎥ , ⎪

⎨⎢⎢0⎥
⎥ ⎢ ⎥
⎢0⎥
⎢ ⎥
⎢−1⎥
⎢ ⎥,
⎢ 0 ⎥,


S = ⎢ ⎥ + λ1 ⎢ ⎥ + λ2 ⎢ ⎥ + λ3 ⎢ ⎥ , λ1 , λ 2 , λ 3 ∈ R
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥


⎪ ⎢−2⎥ ⎢0⎥ ⎢0⎥ ⎢ 1 ⎥ ,, ⎪



⎪ ⎣1⎦ ⎣0⎦ ⎣0⎦ ⎣−1⎦ , ⎪


⎪ , ⎪


⎪ 0 0 0 −1 , ⎪


⎪ , ⎪

⎩ ! "# $, ⎭
solves Ax=0

Now, we find a particular solution that solves the inhomogeneous equation


system. From the RREF, we see that x2 = 1, x4 = −2, x5 = 1 + x6 and
x1 , x3 , x6 ∈ R are free variables (they correspond to variables that belong
to the non-pivot columns of the augmented matrix). Therefore, a particular
solution is
⎡ ⎤
0
⎢1⎥
⎢ ⎥
⎢0⎥ 6
⎢−2⎥ ∈ R .
xp = ⎢ ⎥
⎢ ⎥
⎣1⎦
0

The general solution adds solutions from the homogeneous equation system
Ax = 0. We can use the RREF of the augmented system to read out these so-
lutions by using the Minus-1 Trick. Padding the RREF with rows containing
-1 on the diagonal gives us
⎡ ⎤
−1 0 0 0 0 0
⎢0 1 0 0 0 1 ⎥
⎢ ⎥
⎢0
⎢ 0 −1 0 0 0 ⎥
⎥.
⎢0 0 0 1 0 1 ⎥
⎢ ⎥
⎣0 0 0 0 1 −1⎦
0 0 0 0 0 −1

This yields the general solution


⎡ ⎤ ⎡ ⎤ ⎡ ⎤
−1 0 0
⎢0⎥ ⎢0⎥ ⎢1⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢0⎥ ⎢−1⎥ ⎢0⎥
xp + λ 1 ⎢ ⎥ + λ 2 ⎢ ⎥ + λ 3 ⎢
⎢ ⎥ ⎢ ⎥
⎢ 1 ⎥,
⎥ λ1 , λ 2 , λ 3 ∈ R .
⎢ ⎥0 0
⎢ ⎥ ⎢ ⎥
⎣0⎦ ⎣0⎦ ⎣−1⎦
0 0 −1
! "# $
solves Ax=0

⎡ ⎤
x1
2.7 Find all solutions in x = ⎣x2 ⎦ ∈ R3 of the equation system Ax = 12x,
x3

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

Exercises 477

where
⎡ ⎤
6 4 3
A = ⎣6 0 9⎦
0 8 0
5
and 3i=1 xi = 1.
We start by rephrasing the problem into solving a homogeneous system of
linear equations. Let x be in R3 . We notice that Ax = 12x is equivalent
to (A − 12I)x = 0, which can be rewritten as the homogeneous system
Ãx = 0, where we define
⎡ ⎤
−6 4 3
à = ⎣ 6 −12 9 ⎦.
0 8 −12
5
The constraint 3i=1 xi = 1 can be transcribed as a fourth equation, which
leads us to consider the following linear system, which we bring to reduced
row echelon form:
⎡ ⎤
−6 4 3 0 +R2
⎢ ⎥
1

⎢ 6 −12 9 ⎥ ·3
0 ⎥
⎢ ⎥
1

⎣ 0 8 −12 ⎦ ·4
0 ⎥
1 1 1 1
⎡ ⎤
0 −8 12 0 +4R3
⎢ ⎥

⎢ 2 −4 3 ⎥ +2R3
0 ⎥
! ⎢ ⎥

⎣ 0 2 −3 0 ⎥

1 1 1 1
⎡ ⎤
0 0 0 0
⎢ ⎥
1

⎢ 2 0 −3 ⎥ ·2
0 ⎥
! ⎢ ⎥
1

⎣ 0 2 −3 ⎦ ·2
0 ⎥
1 1 1 1
⎡ ⎤
1 0 − 32 0
⎢ ⎥
! ⎢
⎣ 0 1 − 23 0 ⎥

1 1 1 1 −R1 − R2
⎡ ⎤ ⎡ ⎤
1 0 − 23 0 +( 83 )R3 1 0 0 3
8
⎢ ⎥ ⎢ ⎥
! ⎢
⎣ 0 1 − 23 3
⎦ +( 8 )R3 !
0 ⎥ ⎢
⎣ 0 1 0 3
8


0 0 4 1 · 14 0 0 1 1
4

Therefore, we obtain the unique solution


⎡3⎤ ⎡ ⎤
8 3
1⎣ ⎦
x = ⎣ 38 ⎦ = 3 .
1 8
4 2

c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

478 Linear Algebra

2.8 Determine the inverses of the following matrices if possible:

a.
⎡ ⎤
2 3 4
A = ⎣3 4 5⎦
4 5 6

To determine the inverse of a matrix, we start with the augmented ma-


trix [A | I] and transform it into [I | B], where B turns out to be A−1 :

⎡ ⎤
2 3 4 1 0 0
⎢ ⎥
3
3 4 5 0 1 ⎦ − 2 R1
0 ⎥


4 5 6 0 0 1 −2R1
⎡ ⎤
2 3 4 1 0 0
⎢ ⎥
! ⎢
⎣ 0 − 12 −1 − 32 1 1
⎦ − 2 R3
0 ⎥
0 −1 −2 −2 0 1 ·(−1)
⎡ ⎤
2 3 4 1 0 0
⎢ ⎥
! ⎢
⎣ 0 0 0 − 12 1 − 21 ⎥.

0 1 2 2 0 −1

Here, we see that this system of linear equations is not solvable. There-
fore, the inverse does not exist.
b.
⎡ ⎤
1 0 1 0
⎢0 1 1 0⎥
A=⎢
⎣1

1 0 1⎦
1 1 1 0

⎡ ⎤
1 0 1 0 1 0 0 0
⎢ ⎥

⎢ 0 1 1 0 0 1 0 0 ⎥

⎢ ⎥
1 1 0 1 0 0 1 ⎦ −R1
0 ⎥


1 1 1 0 0 0 0 1 −R1
⎡ ⎤
1 0 1 0 1 0 0 0
⎢ ⎥
0 1 1 0 0 1 0 ⎥ −R4
0 ⎥


! ⎢ ⎥

⎣ 0 1 −1 1 −1 0 1 ⎦ −R4
0 ⎥
0 1 0 0 −1 0 0 1 swap with R2

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

Exercises 479
⎡ ⎤
1 0 1 0 1 0 0 0 −R4
⎢ ⎥

⎢ 0 1 0 0 −1 0 0 1 ⎥

! ⎢ ⎥

⎣ 0 0 −1 1 0 0 1 −1 ⎥
⎦ +R4
0 0 1 0 1 1 0 −1 swap with R3
⎡ ⎤
1 0 0 0 0 −1 0 1
⎢ ⎥

⎢ 0 1 0 0 −1 0 0 1 ⎥

! ⎢ ⎥

⎣ 0 0 1 0 1 1 0 −1 ⎥

0 0 0 1 1 1 1 −2

Therefore,
⎡ ⎤
0 −1 0 1
⎢−1 0 0 1 ⎥
A−1 =⎢ ⎥
⎣1 1 0 −1⎦
1 1 1 −2

2.9 Which of the following sets are subspaces of R3 ?


a. A = {(λ, λ + µ3 , λ − µ3 ) | λ, µ ∈ R}
b. B = {(λ2 , −λ2 , 0) | λ ∈ R}
c. Let γ be in R.
C = {(ξ1 , ξ2 , ξ3 ) ∈ R3 | ξ1 − 2ξ2 + 3ξ3 = γ}
d. D = {(ξ1 , ξ2 , ξ3 ) ∈ R3 | ξ2 ∈ Z}
As a reminder: Let V be a vector space. U ⊆ V is a subspace if
1. U ̸= ∅. In particular, 0 ∈ U .
2. ∀a, b ∈ U : a + b ∈ U Closure with respect to the inner operation
3. ∀a ∈ U, λ ∈ R : λa ∈ U Closure with respect to the outer operation
The standard vector space properties (Abelian group, distributivity, asso-
ciativity and neutral element) do not have to be shown because they are
inherited from the vector space (R3 , +, ·).
Let us now have a look at the sets A, B, C, D.
a. 1. We have that (0, 0, 0) ∈ A for λ = 0 = µ.
2. Let a = (λ1 , λ1 + µ31 , λ1 − µ31 ) and b = (λ2 , λ2 + µ32 , λ2 − µ32 ) be two
elements of A, where λ1 , µ1 , λ2 , µ2 ∈ R. Then,

a + b = (λ1 , λ1 + µ31 , λ1 − µ31 ) + (λ2 , λ2 + µ32 , λ2 − µ32 )


= (λ1 + λ2 , λ1 + µ31 + λ2 + µ32 , λ1 − µ31 + λ2 − µ32 )
= (λ1 + λ2 , (λ1 + λ2 ) + (µ31 + µ32 ), (λ1 + λ2 ) − (µ31 + µ32 )) ,

which belongs to A.
3. Let α be in R. Then,

α(λ, λ + µ3 , λ − µ3 ) = (αλ, αλ + αµ3 , αλ − αµ3 ) ∈ A .

Therefore, A is a subspace of R3 .

c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

480 Linear Algebra

b. The vector (1, −1, 0) belongs to B , but (−1) · (1, −1, 0) = (−1, 1, 0) does
not. Thus, B is not closed under scalar multiplication and is not a sub-
space of R3 .
c. Let A ∈ R1×3 be defined as A = [1, −2, 3]. The set C can be written as:

C = {x ∈ R3 | Ax = γ} .

We can first notice that 0 belongs to B only if γ = 0 since A=0.


Let thus consider γ = 0 and ask whether C is a subspace of R3 . Let x
and y be in C . We know that Ax = 0 and Ay = 0, so that

A(x + y) = Ax + Ay = 0 + 0 = 0 .

Therefore, x + y belongs to C . Let λ be in R. Similarly,

A(λx) = λ(Ax) = λ0 = 0

Therefore, C is closed under scalar multiplication, and thus is a subspace


of R3 if (and only if) γ = 0.
d. The vector (0, 1, 0) belongs to D but π(0, 1, 0) does not and thus D is not
a subspace of R3 .
2.10 Are the following sets of vectors linearly independent?
a.
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
2 1 3
x1 = ⎣−1⎦ , x2 = ⎣ 1 ⎦ , x3 = ⎣−3⎦
3 −2 8

To determine whether these vectors are linearly independent, we check


if the 0-vector can be non-trivially represented as a linear combination of
x1 , . . . , x3 . Therefore, we try to solve the homogeneous linear equation
5
system 3i=1 λi xi = 0 for λi ∈ R. We use Gaussian elimination to solve
Ax = 0 with
⎡ ⎤
2 1 3
A = ⎣−1 1 −3⎦ ,
3 −2 8

which leads to the reduced row echelon form


⎡ ⎤
1 0 2
⎣0 1 −1⎦ .
0 0 0

This means that A is rank deficient/singular and, therefore, the three


vectors are linearly dependent. For example, with λ1 = 2, λ2 = −1, λ3 =
53
−1 we have a non-trivial linear combination i=1 λi xi = 0.
b.
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
1 1 1
⎢2⎥ ⎢1 ⎥ ⎢0⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
x1 = ⎢
⎢1⎥ ,
⎥ x2 = ⎢
⎢0 ⎥ ,
⎥ x3 = ⎢
⎢0⎥

⎣0⎦ ⎣1 ⎦ ⎣1⎦
0 1 1

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

Exercises 481

Here, we are looking at the distribution of 0s in the vectors. x1 is the


only vector whose third component is non-zero. Therefore, λ1 must be
0. Similarly, λ2 must be 0 because of the second component (already
conditioning on λ1 = 0). And finally, λ3 = 0 as well. Therefore, the
three vectors are linearly independent.
An alternative solution, using Gaussian elimination, is possible and would
lead to the same conclusion.
2.11 Write
⎡ ⎤
1
y = ⎣−2⎦
5

as linear combination of
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
1 1 2
x1 = ⎣ 1 ⎦ , x2 = ⎣ 2 ⎦ , x3 = ⎣−1⎦
1 3 1
5
We are looking for λ1 , . . . , λ3 ∈ R, such that 3i=1 λi xi = y . Therefore, we
need to solve the inhomogeneous linear equation system
⎡ ⎤
1 1 2 1
⎣ 1 2 −1 −2 ⎦
1 3 1 5

Using Gaussian elimination, we obtain λ1 = −6, λ2 = 3, λ3 = 2.


2.12 Consider two subspaces of R4 :
⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤
1 2 −1 −1 2 −3
⎢1⎥ ⎢−1⎥ ⎢1⎥ ⎢−2⎥ ⎢−2⎥ ⎢6⎥
U1 = span[⎢
⎣−3⎦
⎥ ,⎢
⎣0⎦
⎥ ,⎢
⎣−1⎦] ,
⎥ U2 = span[⎢
⎣2⎦
⎥ ,⎢
⎣0⎦
⎥ ,⎢
⎣−2⎦] .

1 −1 1 1 0 −1

Determine a basis of U1 ∩ U2 .
We start by checking whether there the vectors in the generating sets of U1
(and U2 ) are linearly dependent. Thereby, we can determine bases of U1 and
U2 , which will make the following computations simpler.
We start with U1 . To see whether the three vectors are linearly dependent,
we need to find a linear combination of these vectors that allows a non-
trivial representation of 0, i.e., λ1 , λ2 , λ3 ∈ R, such that
⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤
1 2 −1 0
⎢1⎥ ⎢−1⎥ ⎢ 1 ⎥ ⎢0 ⎥
⎣−3⎦ + λ2 ⎣ 0 ⎦ + λ3 ⎣−1⎦ = ⎣0⎦ .
λ1 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥

1 −1 1 0

We see that necessarily: λ3 = −3λ1 (otherwise, the third component can


never be 0). With this, we get
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
1+3 2 0
⎢ 1−3 ⎥ ⎢−1⎥ ⎢0⎥
⎣−3 + 3⎦ + λ2 ⎣ 0 ⎦ = ⎣0⎦
λ1 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥

1−3 −1 0

c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

482 Linear Algebra


⎡ ⎤ ⎡ ⎤ ⎡ ⎤
4 2 0
⎢−2⎥ ⎢−1⎥ ⎢0⎥
⇐⇒ λ1 ⎢
⎣ 0 ⎦ + λ2 ⎣ 0 ⎦ = ⎣0 ⎦
⎥ ⎢ ⎥ ⎢ ⎥

−2 −1 0

and, therefore, λ2 = −2λ1 . This means that there exists a non-trivial linear
combination of 0 using spanning vectors of U1 , for example: λ1 = 1, λ2 = −2
and λ3 = −3. Therefore, not all vectors in the generating set of U1 are
necessary, such that U1 can be more compactly represented as
⎡ ⎤ ⎡ ⎤
1 2
⎢ 1 ⎥ ⎢−1⎥
U1 = span[⎢
⎣−3⎦ , ⎣ 0 ⎦] .
⎥ ⎢ ⎥

1 −1

Now, we see whether the generating set of U2 is also a basis. We try again
whether we can find a non-trivial linear combination of 0 using the spanning
vectors of U2 , i.e., a triple (α1 , α2 , α3 ) ∈ R3 such that
⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤
−1 2 −3 0
⎢−2⎥ ⎢−2⎥ ⎢ 6 ⎥ ⎢0 ⎥
α1 ⎢
⎣ 2 ⎦ + α2 ⎣ 0 ⎦ + α3 ⎣−2⎦ = ⎣0⎦ .
⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥

1 0 −1 0

Here, we see that necessarily α1 = α3 . Then, α2 = 2α1 gives a non-trivial


representation of 0, and the three vectors are linearly dependent. However,
any two of them are linearly independent, and we choose the first two vec-
tors of the generating set as a basis of U2 , such that
⎡ ⎤ ⎡ ⎤
−1 2
⎢−2⎥ ⎢−2⎥
U2 = span[⎢
⎣2⎦
⎥ ,⎢
⎣ 0 ⎦] .

1 0

Now, we determine U1 ∩ U2 . Let x be in R4 . Then,

x ∈ U1 ∩ U2 ⇐⇒ x ∈ U1 ∧ x ∈ U2
⎛ ⎡ ⎤ ⎡ ⎤⎞
−1 2
⎜ ⎢−2⎥ ⎢−2⎥⎟
⇐⇒ ∃λ1 , λ2 , α1 , α2 ∈ R : ⎜
⎝x = α 1 ⎣ 2 ⎦ + α 2 ⎣ 0 ⎦ ⎠
⎢ ⎥ ⎢ ⎥⎟

1 0
⎛ ⎡ ⎤ ⎡ ⎤⎞
1 2
⎜ ⎢1⎥ ⎢−1⎥⎟
∧⎜
⎝x = λ1 ⎣−3⎦ + λ2 ⎣ 0 ⎦⎠
⎢ ⎥ ⎢ ⎥⎟

1 −1
⎛ ⎡ ⎤ ⎡ ⎤⎞
−1 2
⎜ ⎢−2⎥ ⎢−2⎥⎟
⇐⇒ ∃λ1 , λ2 , α1 , α2 ∈ R : ⎜
⎝x = α 1 ⎣ 2 ⎦ + α 2 ⎣ 0 ⎦ ⎠
⎢ ⎥ ⎢ ⎥⎟

1 0

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

Exercises 483
⎛ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤⎞
1 2 −1 2
⎜ ⎢1⎥ ⎢−1⎥ ⎢−2⎥ ⎢−2⎥⎟
∧⎜
⎝λ1 ⎣−3⎦ + λ2 ⎣ 0 ⎦ = α1 ⎣ 2 ⎦ + α2 ⎣ 0 ⎦⎠
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎟

1 −1 1 0

A general approach is to use Gaussian elimination to solve for either λ1 , λ2


or α1 , α2 . In this particular case, we can find the solution by careful inspec-
tion: From the third component, we see that we need −3λ1 = 2α1 and thus
α1 = − 32 λ1 . Then:

⎛ ⎡ ⎤ ⎡ ⎤⎞
−1 2
⎜ 3
⎢ −2 ⎥ ⎢ −2 ⎥⎟
x ∈ U1 ∩ U2 ⇐⇒ ∃λ1 , λ2 , α2 ∈ R : ⎜
⎝x = − 2 λ 1 ⎣ 2 ⎦ + α 2 ⎣ 0 ⎦ ⎠
⎢ ⎥ ⎢ ⎥⎟

1 0
⎛ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤⎞
1 −1 2 2
⎜ ⎢ 1 ⎥ 3 ⎢−2⎥ ⎢−1⎥ ⎢−2⎥⎟
∧⎜
⎝λ1 ⎣−3⎦ + 2 λ1 ⎣ 2 ⎦ + λ2 ⎣ 0 ⎦ = α2 ⎣ 0 ⎦⎠
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎟

1 1 −1 0
⎛ ⎡ ⎤ ⎡ ⎤⎞
−1 2
⎜ 3
⎢ −2 ⎥ ⎢ −2 ⎥⎟
⇐⇒ ∃λ1 , λ2 , α2 ∈ R : ⎜
⎝x = − 2 λ 1 ⎣ 2 ⎦ + α 2 ⎣ 0 ⎦ ⎠
⎢ ⎥ ⎢ ⎥⎟

1 0
⎛ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤⎞
− 12 2 2
⎜ ⎢ −2 ⎥ ⎢−1⎥ ⎢−2⎥⎟
∧⎜
⎝λ 1 ⎣ 0 ⎦ + λ 2 ⎣ 0 ⎦ = α 2 ⎣ 0 ⎦ ⎠
⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎟
5
2 −1 0

The last component requires that λ2 = 52 λ1 . Therefore,

⎛ ⎡ ⎤ ⎡ ⎤⎞
−1 2
⎜ 3
⎢−2⎥ ⎢−2⎥⎟
x ∈ U1 ∩ U2 ⇐⇒ ∃λ1 , α2 ∈ R : ⎜
⎝x = − 2 λ1 ⎣ 2 ⎦ + α2 ⎣ 0 ⎦⎠
⎢ ⎥ ⎢ ⎥⎟

1 0
⎛ ⎡ 9 ⎤ ⎡ ⎤⎞
2 2
⎜ ⎢− 9 ⎥ ⎢−2⎥⎟
∧⎜ ⎢ 2⎥
⎝λ 1 ⎣ 0 ⎦ = α 2 ⎣ 0 ⎦ ⎠
⎢ ⎥⎟

0 0
⎛ ⎡ ⎤ ⎡ ⎤⎞
−1 2
⎢−2⎥ ⎢−2⎥⎟
⎜ 3
⎢ ⎥ ⎢ ⎥⎟ 9
⎝x = − 2 λ1 ⎣ 2 ⎦ + α2 ⎣ 0 ⎦⎠ ∧ (α2 = 4 λ1 )
⇐⇒ ∃λ1 , α2 ∈ R : ⎜

1 0
⎛ ⎡ ⎤ ⎡ ⎤⎞
−1 2
⎜ 3
⎢−2⎥ 9 ⎢−2⎥⎟
⇐⇒ ∃λ1 ∈ R : ⎜
⎝x = − 2 λ1 ⎣ 2 ⎦ + 4 λ1 ⎣ 0 ⎦⎠
⎢ ⎥ ⎢ ⎥⎟

1 0

c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

484 Linear Algebra


⎛ ⎡ ⎤ ⎡ ⎤⎞
−1 2
⎜ ⎢−2⎥ ⎢−2⎥⎟
⇐⇒ ∃λ1 ∈ R : ⎜
⎝x = −6λ1 ⎣ 2 ⎦ + 9λ1 ⎣ 0 ⎦⎠
⎢ ⎥ ⎢ ⎥⎟ (mutiplied by 4)

1 0
⎛ ⎡ ⎤⎞
24
⎜ ⎢ −6 ⎥⎟
⇐⇒ ∃λ1 ∈ R : ⎜
⎝x = λ1 ⎣−12⎦⎠
⎢ ⎥⎟

−6
⎛ ⎡ ⎤⎞
4
⎜ ⎢−1⎥⎟
⇐⇒ ∃λ1 ∈ R : ⎜
⎝x = λ1 ⎣−2⎦⎠
⎢ ⎥⎟

−1

Thus, we have
⎧ ⎡ ⎤, ⎫ ⎡ ⎤

⎪ 4 ,, ⎪
⎪ 4

⎢−1⎥ , ⎬ ⎢−1⎥
U1 ∩ U2 = λ 1 ⎢
⎣−2⎦ , λ1 ∈ R⎪ = span[⎣−2⎦ , ]
⎥, ⎢ ⎥

⎪ , ⎪
⎩ ⎭
−1 , −1

i.e., we obtain vector space spanned by [4, −1, −2, −1]⊤ .


2.13 Consider two subspaces U1 and U2 , where U1 is the solution space of the
homogeneous equation system A1 x = 0 and U2 is the solution space of the
homogeneous equation system A2 x = 0 with
⎡ ⎤ ⎡ ⎤
1 0 1 3 −3 0
⎢1 −2 −1⎥ ⎢1 2 3⎥
A1 = ⎢ ⎥, A2 = ⎢ ⎥.
⎣2 1 3 ⎦ ⎣7 −5 2⎦
1 0 1 3 −1 2

a. Determine the dimension of U1 , U2 .


We determine U1 by computing the reduced row echelon form of A1 as
⎡ ⎤
1 0 1
⎢0 1 1⎥
⎢ ⎥,
⎣0 0 0⎦
0 0 0

which gives us
⎡ ⎤
1
U1 = span[⎣ 1 ⎦] .
−1

Therefore, dim(U1 ) = 1. Similarly, we determine U2 by computing the


reduced row echelon form of A2 as
⎡ ⎤
1 0 1
⎢0 1 1⎥
⎢ ⎥,
⎣0 0 0⎦
0 0 0

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

Exercises 485

which gives us
⎡ ⎤
1
U2 = span[⎣ 1 ⎦] .
−1

Therefore, dim(U2 ) = 1.
b. Determine bases of U1 and U2 .
The basis vector that spans both U1 and U2 is
⎡ ⎤
1
⎣1⎦.
−1

c. Determine a basis of U1 ∩ U2 .
Since both U1 and U2 are spanned by the same basis vector, it must be
that U1 = U2 , and the desired basis is
⎡ ⎤
1
U1 ∩ U2 = U1 = U2 = span[⎣ 1 ⎦] .
−1

2.14 Consider two subspaces U1 and U2 , where U1 is spanned by the columns of


A1 and U2 is spanned by the columns of A2 with
⎡ ⎤ ⎡ ⎤
1 0 1 3 −3 0
⎢1 −2 −1⎥ ⎢1 2 3⎥
A1 = ⎢ ⎥, A2 = ⎢ ⎥.
⎣2 1 3 ⎦ ⎣7 −5 2⎦
1 0 1 3 −1 2

a. Determine the dimension of U1 , U2


We start by noting that U1 , U2 ⊆ R4 since we are interested in the space
spanned by the columns of the corresponding matrices. Looking at A1 ,
we see that −d1 + d3 = d2 , where di are the columns of A1 . This means
that the second column can be expressed as a linear combination of d1
and d3 . d1 and d3 are linearly independent, i.e., dim(U1 ) = 2.
Similarly, for A2 , we see that the third column is the sum of the first two
columns, and again we arrive at dim(U2 ) = 2.
Alternatively, we can use Gaussian elimination to determine a set of
linearly independent columns in both matrices.
b. Determine bases of U1 and U2
A basis B of U1 is given by the first two columns of A1 (any pair of
columns would be fine), which are independent. A basis C of U2 is given
by the second and third columns of A2 (again, any pair of columns
would be a basis), such that
⎧⎡ ⎤ ⎡ ⎤⎫ ⎧⎡ ⎤ ⎡ ⎤⎫

⎪ 1 0 ⎪ ⎪ −3 0 ⎪
⎨⎢ ⎥ ⎢ ⎥⎪ ⎬ ⎪
⎨⎢ ⎥ ⎢ ⎥⎪ ⎬
1 −2
⎥ , ⎢ ⎥ , C = ⎢ ⎥ , ⎢3 ⎥2
B= ⎢ ⎣2 ⎦ ⎣ 1 ⎦⎪ ⎣−5⎦ ⎣2⎦⎪

⎪ ⎪ ⎪
⎪ ⎪
⎩ ⎭ ⎩ ⎭
1 0 −1 2

c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

486 Linear Algebra

c. Determine a basis of U1 ∩ U2
Let us call b1 , b2 , c1 and c2 the vectors of the bases B and C such that
B = {b1 , b2 } and C = {c1 , c2 }. Let x be in R4 . Then,

x ∈ U1 ∩ U2 ⇐⇒ ∃λ1 , λ2 , λ3 , λ4 ∈ R : (x = λ1 b1 + λ2 b2 ) ∧ (x = λ3 c1 + λ4 c2 )
⇐⇒ ∃λ1 , λ2 , λ3 , λ4 ∈ R : (x = λ1 b1 + λ2 b2 )
∧ (λ1 b1 + λ2 b2 = λ3 c1 + λ4 c2 )
⇐⇒ ∃λ1 , λ2 , λ3 , λ4 ∈ R : (x = λ1 b1 + λ2 b2 )
∧ (λ1 b1 + λ2 b2 − λ3 c1 − λ4 c2 = 0)

Let λ := [λ1 , λ2 , λ3 , λ4 ]⊤ . The last equation of the system can be written


as the linear system Aλ = 0, where we define the matrix A as the
concatenation of the column vectors b1 , b2 , −c1 and −c2 .
⎡ ⎤
1 0 3 0
⎢1 −2 −2 −3⎥
A=⎢ ⎥.
⎣2 1 5 −2⎦
1 0 1 −2

We solve this homogeneous linear system using Gaussian elimination.


⎡ ⎤
1 0 3 0
⎢ 1
⎢ ⎥ −R1
−2 −2 −3 ⎥
⎣ 2 1 5 −2 ⎦ −2R1
1 0 1 −2 −R1
⎡ ⎤
1 0 3 0
⎢ 0 −2 −5 −3 ⎥
⎥ +2R3
! ⎢
⎣ 0 1 −1 −2 ⎦ swap with R2
0 0 −2 −2 ·(− 21 )
⎡ ⎤ ⎡ ⎤
1 0 3 0 −3R4 1 0 0 −3
⎢ 0 1 −1 −2 ⎥
⎥ +R4
⎢ 0 1 0 −1 ⎥
! ⎢ ! ⎢ ⎥
⎣ 0 0 −7 −7 ⎦ +7R4 ⎣ 0 0 1 1 ⎦
0 0 1 1 swap with R3 0 0 0 0

From the reduced row echelon form we find that the set
⎡ ⎤
−3
⎢−1⎥
⎣ 1 ⎦]
S := span[⎢ ⎥

−1

describes the solution space of the system of equations in λ.


We can now resume our equivalence derivation and replace the homo-
geneous system with its solution space. It holds

x ∈ U1 ∩ U2 ⇐⇒ ∃λ1 , λ2 , λ3 , λ4 , α ∈ R : (x = λ1 b1 + λ2 b2 )
∧ ([λ1 , λ2 , λ3 , λ4 ]⊤ = α[−3, −1, 1, −1]⊤ )
⇐⇒ ∃α ∈ R : x = −3αb1 − αb2
⇐⇒ ∃α ∈ R : x = α[−3, −1, −7, −3]⊤

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

Exercises 487

Finally,
⎡ ⎤
−3
⎢−1⎥
U1 ∩ U2 = span[⎢
⎣−7⎦] .

−3

Alternatively, we could have expressed the solutions of x in terms of b1


and c2 with the condition on λ being ∃α ∈ R : (λ3 = α) ∧ (λ4 = −α) to
obtain [3, 1, 7, 3]⊤ .
2.15 Let F = {(x, y, z) ∈ R3 | x+y−z = 0} and G = {(a−b, a+b, a−3b) | a, b ∈ R}.
a. Show that F and G are subspaces of R3 .
We have (0, 0, 0) ∈ F since 0 + 0 − 0 = 0.
Let a = (x, y, z) ∈ R3 and b = (x′ , y ′ , z ′ ) ∈ R3 be two elements of F . We
have: (x + y − z) + (x′ + y ′ − z ′ ) = 0 + 0 = 0 so a + b ∈ F .
Let λ ∈ R. We also have: λx + λy − λz = λ0 = 0 so λa ∈ F and thus F
is a subspace of R3 .
Similarly, we have (0, 0, 0) ∈ G by setting a and b to 0.
Let a, b, a′ and b′ be in R and let x = (a − b, a + b, a − 3b) and y =
(a′ − b′ , a′ + b′ , a′ − 3b′ ) be two elements of G. We have x + y = ((a +
a′ ) − (b + b′ ), (a + a′ ) + (b + b′ ), (a + a′ ) − 3(b + b′ )) and (a + a′ , b + b′ ) ∈ R2
so x + y ∈ G.
Let λ be in R. We have (λa, λb) ∈ R2 so λx ∈ G and thus G is a subspace
of R3 .
b. Calculate F ∩ G without resorting to any basis vector.
Combining both constraints, we have:

F ∩ G = {(a − b, a + b, a − 3b) | (a, b ∈ R) ∧ [(a − b) + (a + b) − (a − 3b) = 0]}


= {(a − b, a + b, a − 3b) | (a, b ∈ R) ∧ (a = −3b)}
= {(−4b, −2b, −6b) | b ∈ R}
= {(2b, b, 3b) | b ∈ R}
⎡ ⎤
2
= span[⎣1⎦]
3

c. Find one basis for F and one for G, calculate F ∩G using the basis vectors
previously found and check your result with the previous question.
We can see that F is a subset of R3 with one linear constraint. It thus
has dimension 2, and it suffices to find two independent vectors in F to
construct a basis. By setting (x, y) = (1, 0) and (x, y) = (0, 1) successively,
we obtain the following basis for F :
⎧⎡ ⎤ ⎡ ⎤⎫
⎨ 1 0 ⎬
⎣0⎦ , ⎣1⎦
⎩ ⎭
1 1

Let us consider the set G. We introduce u, v ∈ R and perform the fol-


lowing variable substitutions: u := a + b and v := a − b. Note that then

c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

488 Linear Algebra

a = (u + v)/2 and b = (u − v)/2 and thus a − 3b = 2v − u, so that G can


be written as
G = {(v, u, 2v − u) | u, v ∈ R} .

The dimension of G is clearly 2 and a basis can be found by choosing


two independent vectors of G, e.g.,
⎧ ⎡ ⎤ ⎡ ⎤⎫
⎨ 1 0 ⎬
⎣0⎦ , ⎣ 1 ⎦
⎩ ⎭
2 −1

Let us now find F ∩ G. Let x ∈ R3 . It holds that


⎛ ⎡ ⎤ ⎡ ⎤⎞
1 0
x ∈ F ∩ G ⇐⇒ ∃λ1 , λ2 , µ1 , µ2 ∈ R : ⎝x = λ1 ⎣0⎦ + λ2 ⎣1⎦⎠
1 1
⎛ ⎡ ⎤ ⎡ ⎤⎞
1 0
∧ ⎝x = µ 1 ⎣ 0 ⎦ + µ 2 ⎣ 1 ⎦ ⎠
2 −1
⎛ ⎡ ⎤ ⎡ ⎤⎞
1 0
⇐⇒ ∃λ1 , λ2 , µ1 , µ2 ∈ R : ⎝x = λ1 ⎣0⎦ + λ2 ⎣1⎦⎠
1 1
⎛ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎞
1 0 1 0
∧ ⎝λ 1 ⎣ 0 ⎦ + λ 2 ⎣ 1 ⎦ + µ 1 ⎣ 0 ⎦ + µ 2 ⎣ 1 ⎦ = 0 ⎠
1 1 2 −1

Note that for simplicity purposes, we have not reversed the sign of the
coefficients for µ1 and µ2 , which we can do since we could replace µ1
by −µ1 .
The latter equation is a linear system in [λ1 , λ2 , µ1 , µ2 ]⊤ that we solve
next.

⎡ ⎤ ⎡ ⎤
1 0 1 0 1 0 0 2
⎣ 0 1 0 1 ⎦(· · · ) ! ⎣ 0 1 0 1 ⎦
1 1 2 −1 0 0 1 −2

The solution space for (λ1 , λ2 , µ1 , µ2 ) is therefore


⎡ ⎤
2
⎢1⎥
span[⎢
⎣−2⎦] ,

−1

and we can resume our equivalence


⎡ ⎤ ⎡ ⎤
1 0
x ∈ F ∩ G ⇐⇒ ∃α ∈ R : x = 2α ⎣0⎦ + α ⎣1⎦
1 1
⎡ ⎤
2
⇐⇒ ∃α ∈ R : x = α ⎣1⎦ ,
3

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

Exercises 489

which yields the same result as the previous question, i.e.,


⎡ ⎤
2
F ∩ G = span[⎣1⎦] .
3

2.16 Are the following mappings linear?


Recall: To show that Φ is a linear mapping from E to F , we need to show
that for all x and y in E and all λ in R:
Φ(x + y) = Φ(x) + Φ(y)
Φ(λx) = λΦ(x)

a. Let a, b ∈ R.

Φ : L1 ([a, b]) → R
< b
f 0→ Φ(f ) = f (x)dx ,
a

where L1 ([a, b]) denotes the set of integrable functions on [a, b].
Let f, g ∈ L1 ([a, b]). It holds that
< b < b < b
Φ(f ) + Φ(g) = f (x)dx + g(x)dx = f (x) + g(x)dx = Φ(f + g) .
a a a

For λ ∈ R we have
< b < b
Φ(λf ) = λf (x)dx = λ f (x)dx = λΦ(f ) .
a a

Therefore, Φ is a linear mapping. (In more advanced courses/books,


you may learn that Φ is a linear functional, i.e., it takes functions as
arguments. But for our purposes here this is not relevant.)
b.

Φ : C1 → C0
f 0→ Φ(f ) = f ′ ,

where for k " 1, C k denotes the set of k times continuously differen-


tiable functions, and C 0 denotes the set of continuous functions.
For f, g ∈ C 1 we have

Φ(f + g) = (f + g)′ = f ′ + g ′ = Φ(f ) + Φ(g)

For λ ∈ R we have

Φ(λf ) = (λf )′ = λf ′ = λΦ(f )

Therefore, Φ is linear. (Again, Φ is a linear functional.)


From the first two exercises, we have seen that both integration and
differentiation are linear operations.

c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

490 Linear Algebra

c.

Φ:R→R
x 0→ Φ(x) = cos(x)

We have cos(π) = −1 and cos(2π) = 1m which is different from 2 cos(π).


Therefore, Φ is not linear.
d.

Φ : R3 → R2
0 1
1 2 3
x 0→ x
1 4 3

We define the matrix as A. Let x and y be in R3 . Let λ be in R. Then:

Φ(x + y) = A(x + y) = Ax + Ay = Φ(x) + Φ(y) .

Similarly,
Φ(λx) = A(λx) = λAx = λΦ(x) .

Therefore, this mapping is linear.


e. Let θ be in [0, 2π[ and

Φ : R2 → R2
0 1
cos(θ) sin(θ)
x 0→ x
− sin(θ) cos(θ)

We define the (rotation) matrix as A. Then the reasoning is identical to


the previous question. Therefore, this mapping is linear.
The mapping Φ represents a rotation of x by an angle θ. Rotations are
also linear mappings.
2.17 Consider the linear mapping

Φ : R3 → R4
⎡ ⎤
⎛⎡ ⎤⎞ 3x1 + 2x2 + x3
x1 ⎢ x1 + x2 + x3 ⎥
Φ ⎝⎣ x 2 ⎦ ⎠ = ⎢
⎣ x1 − 3x2 ⎦

x3
2x1 + 3x2 + x3

Find the transformation matrix AΦ .


Determine rk(AΦ ).
Compute the kernel and image of Φ. What are dim(ker(Φ)) and dim(Im(Φ))?

The transformation matrix is


⎡ ⎤
3 2 −1
⎢1 1 1 ⎥
AΦ = ⎢ ⎥
⎣1 −3 0 ⎦
2 3 1

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

Exercises 491

The rank of AΦ is the number linearly independent rows/columns. We


use Gaussian elimination on AΦ to determine the reduced row eche-
lon form (Not necessary to identify the number of linearly independent
rows/columns, but useful for the next questions.):
⎡ ⎤
1 0 0
⎢0 1 0⎥
⎢ ⎥.
⎣0 0 1⎦
0 0 0

From here, we see that rk(AΦ ) = 3.


ker(Φ) = 0 and dim(ker(Φ)) = 0. From the reduced row echelon form,
we see that all three columns of AΦ are linearly independent. Therefore,
they form a basis of Im(Φ), and dim(Im(Φ)) = 3.
2.18 Let E be a vector space. Let f and g be two automorphisms on E such that
f ◦ g = idE (i.e., f ◦ g is the identity mapping idE ). Show that ker(f ) =
ker(g ◦ f ), Im(g) = Im(g ◦ f ) and that ker(f ) ∩ Im(g) = {0E }.
Let x ∈ ker(f ). We have g(f (x)) = g(0) = 0 since g is linear. Therefore,
ker(f ) ⊆ ker(g ◦ f ) (this always holds).
Let x ∈ ker(g ◦ f ). We have g(f (x)) = 0 and as f is linear, f (g(f (x))) =
f (0) = 0. This implies that (f ◦ g)(f (x)) = 0 so that f (x) = 0 since
f ◦ g = idE . So ker(g ◦ f ) ⊆ ker(f ) and thus ker(g ◦ f ) = ker(f ).

Let y ∈ Im(g ◦ f ) and let x ∈ E so that y = (g ◦ f )(x). Then y = g(f (x)),


which shows that Im(g ◦ f ) ⊆ Im(g) (which is always true).
Let y ∈ Im(g). Let then x ∈ E such that y = g(x). We have y = g((f ◦ g)(x))
and thus y = (g ◦ f )(g(x)), which means that y ∈ Im(g ◦ f ). Therefore,
Im(g) ⊆ Im(g ◦ f ). Overall, Im(g) = Im(g ◦ f ).

Let y ∈ ker(f ) ∩ Im(g). Let x ∈ E such that y = g(x). Applying f gives


us f (y) = (f ◦ g)(x) and as y ∈ ker(f ), we have 0 = x. This means that
ker(f ) ∩ Im(g) ⊆ {0} but the intersection of two subspaces is a subspace and
thus always contains 0, so ker(f ) ∩ Im(g) = {0}.
2.19 Consider an endomorphism Φ : R3 → R3 whose transformation matrix
(with respect to the standard basis in R3 ) is
⎡ ⎤
1 1 0
AΦ = ⎣1 −1 0⎦ .
1 1 1

a. Determine ker(Φ) and Im(Φ).


The image Im(Φ) is spanned by the columns of A. One way to determine
a basis, we need to determine the smallest generating set of the columns
of AΦ . This can be done by Gaussian elimination. However, in this case,
it is quite obvious that AΦ has full rank, i.e., the set of columns is already
minimal, such that
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
1 1 0
Im(Φ) = [⎣1⎦ , ⎣−1⎦ , ⎣0⎦] = R3
1 1 1

c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

492 Linear Algebra

We know that dim(Im(Φ)) = 3. Using the rank-nullity theorem, we get


that dim(ker(Φ)) = 3 − dim(Im(Φ)) = 0, and ker(Φ) = {0} consists of the
0-vector alone.
b. Determine the transformation matrix ÃΦ with respect to the basis
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
1 1 1
B = ( ⎣1 ⎦ , ⎣2 ⎦ , ⎣0 ⎦) ,
1 1 0

i.e., perform a basis change toward the new basis B .


Let B the matrix built out of the basis vectors of B (order is important):
⎡ ⎤
1 1 1
B = ⎣1 2 0⎦ .
1 1 0

Then, ÃΦ = B −1 AΦ B . The inverse is given by


⎡ ⎤
0 −1 2
B −1 = ⎣0 1 −1⎦ ,
1 0 −1

and the desired transformation matrix of Φ with respect to the new basis
B of R3 is
⎡ ⎤⎡ ⎤⎡ ⎤ ⎡ ⎤⎡ ⎤
0 −1 2 1 1 0 1 1 1 1 3 2 1 1 1
⎣0 1 −1⎦ ⎣1 −1 0 ⎦ ⎣1 2 0 ⎦ = ⎣0 −2 −1⎦ ⎣1 2 0⎦
1 0 −1 1 1 1 1 1 0 0 0 −1 1 1 0
⎡ ⎤
6 9 1
= ⎣−3 −5 0⎦ .
−1 −1 0

2.20 Let us consider b1 , b2 , b′1 , b′2 , 4 vectors of R2 expressed in the standard basis
of R2 as
0 1 0 1 0 1 0 1
2 −1 2 1
b1 = , b2 = , b′1 = , b′2 =
1 −1 −2 1

and let us define two ordered bases B = (b1 , b2 ) and B ′ = (b′1 , b′2 ) of R2 .
a. Show that B and B ′ are two bases of R2 and draw those basis vectors.
The vectors b1 and b2 are clearly linearly independent and so are b′1 and
b′2 .
b. Compute the matrix P 1 that performs a basis change from B ′ to B .
We need to express the vector b′1 (and b′2 ) in terms of the vectors b1
and b2 . In other words, we want to find the real coefficients λ1 and λ2
such that b′1 = λ1 b1 + λ2 b2 . In order to do that, we will solve the linear
equation system
= ′
>
b1 b2 b1

i.e.,
0 1
2 −1 2
1 −1 −2

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

Exercises 493

and which results in the reduced row echelon form


0 1
1 0 4
.
0 1 6

This gives us b′1 = 4b1 + 6b2 .


Similarly for b′2 , Gaussian elimination gives us b′2 = −1b2 .
Thus, the matrix that performs a basis change from B ′ to B is given as
0 1
4 0
P1 = .
6 −1

c. We consider c1 , c2 , c3 , three vectors of R3 defined in the standard basis


of R as
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
1 0 1
c1 = ⎣ 2 ⎦ , c2 = ⎣−1⎦ , c3 = ⎣ 0 ⎦
−1 2 −1

and we define C = (c1 , c2 , c3 ).


(i) Show that C is a basis of R3 , e.g., by using determinants (see
Section 4.1).
We have:
, ,
,1 0 1 ,,
,
det(c1 , c2 , c3 ) = ,, 2 −1 0 ,, = 4 ̸= 0
,−1 2 −1,

Therefore, C is regular, and the columns of C are linearly inde-


pendent, i.e., they form a basis of R3 .
(ii) Let us call C ′ = (c′1 , c′2 , c′3 ) the standard basis of R3 . Determine
the matrix P 2 that performs the basis change from C to C ′ .
In order to write the matrix that performs a basis change from
C to C ′ , we need to express the vectors of C in terms of those
of C ′ . But as C ′ is the standard basis, it is straightforward that
c1 = 1c′1 + 2c′2 − 1c′3 for example. Therefore,
⎡ ⎤
1 0 1
P 2 := ⎣ 2 −1 0 ⎦.
−1 2 −1

simply contains the column vectors of C (this would not be the


case if C ′ was not the standard basis).
d. We consider a homomorphism Φ : R2 −→ R3 , such that

Φ(b1 + b2 ) = c2 + c3
Φ(b1 − b2 ) = 2c1 − c2 + 3c3

where B = (b1 , b2 ) and C = (c1 , c2 , c3 ) are ordered bases of R2 and R3 ,


respectively.
Determine the transformation matrix AΦ of Φ with respect to the or-
dered bases B and C .

c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

494 Linear Algebra

Adding and subtracting both equations gives us


?
Φ(b1 + b2 ) + Φ(b1 − b2 ) = 2c1 + 4c3
Φ(b1 + b2 ) − Φ(b1 − b2 ) = −2c1 + 2c2 − 2c3
As Φ is linear, we obtain

?
Φ(2b1 ) = 2c1 + 4c3
Φ(2b2 ) = −2c1 + 2c2 − 2c3
And by linearity of Φ again, the system of equations gives us

?
Φ(b1 ) = c1 + 2c3
.
Φ(b2 ) = −c1 + c2 − c3
Therefore, the transformation matrix of AΦ with respect to the bases B
and C is
⎡ ⎤
1 −1
A Φ = ⎣0 1 ⎦.
2 −1

e. Determine A′ , the transformation matrix of Φ with respect to the bases


B ′ and C ′ .
We have:
⎡ ⎤⎡ ⎤ ⎡ ⎤
1 0 1 1 −1 0 1 0 2
′ 4 0
A = P 2 AP 1 = ⎣ 2 −1 0 ⎦ ⎣0 1 ⎦ = ⎣−10 3 ⎦.
6 −1
−1 2 −1 2 −1 12 −4

f. Let us consider the vector x ∈ R2 whose coordinates in B ′ are [2, 3]⊤ .


In other words, x = 2b′1 + 3b′2 .
(i) Calculate the coordinates of x in B .
By definition of P 1 , x can be written in B as
0 1 0 10 1 0 1
2 4 0 2 8
P1 = =
3 6 −1 3 9

(ii) Based on that, compute the coordinates of Φ(x) expressed in C .


Using the transformation matrix A of Φ with respect to the bases
B and C , we get the coordinates of Φ(x) in C with
⎡ ⎤ ⎡ ⎤
0 1 1 −1 0 1 −1
8 8
A = ⎣0 1 ⎦ =⎣ 9 ⎦
9 9
2 −1 7

(iii) Then, write Φ(x) in terms of c′1 , c′2 , c′3 .


Going back to the basis C ′ thanks to the matrix P 2 gives us the
expression of Φ(x) in C ′
⎡ ⎤ ⎡ ⎤⎡ ⎤ ⎡ ⎤
−1 1 0 1 −1 6
P2 ⎣ 9 ⎦ = ⎣ 2 −1 0 ⎦ ⎣ 9 ⎦ = ⎣−11⎦
7 −1 2 −1 7 12

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

Exercises 495

In other words, Φ(x) = 6c′1 − 11c′2 + 12c′3 .


(iv) Use the representation of x in B ′ and the matrix A′ to find this
result directly.
We can calculate Φ(x) in C directly:
⎡ ⎤ ⎡ ⎤
0 1 0 2 0 1 6
2 2
A′ = ⎣−10 3 ⎦ = ⎣−11⎦
3 3
12 −4 12

c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

Analytic Geometry

Exercises
3.1 Show that ⟨·, ·⟩ defined for all x = [x1 , x2 ]⊤ ∈ R2 and y = [y1 , y2 ]⊤ ∈ R2 by

⟨x, y⟩ := x1 y1 − (x1 y2 + x2 y1 ) + 2(x2 y2 )

is an inner product.
We need to show that ⟨x, y⟩ is a symmetric, positive definite bilinear form.
Let x := [x1 , x2 ]⊤ , y = [y1 , y2 ]⊤ ∈ R2 . Then,

⟨x, y⟩ = x1 y1 − (x1 y2 + x2 y1 ) + 2x2 y2


= y1 x1 − (y1 x2 + y2 x1 ) + 2y2 x2 = ⟨y, x⟩ ,

where we exploited the commutativity of addition and multiplication in


R. Therefore, ⟨·, ·⟩ is symmetric.
It holds that

⟨x, x⟩ = x21 − (2x1 x2 ) + 2x22 = (x1 − x2 )2 + x22 .

This is a sum of positive terms for x ̸= 0. Moreover, this expression shows


that if ⟨x, x⟩ = 0 then x2 = 0 and then x1 = 0, i.e., x = 0. Hence, ⟨·, ·⟩ is
positive definite.
In order to show that ⟨·, ·⟩ is bilinear (linear in both arguments), we will
simply show that ⟨·, ·⟩ is linear in its first argument. Symmetry will ensure
that ⟨·, ·⟩ is bilinear. Do not duplicate the proof of linearity in both argu-
ments.
Let z = [z1 , z2 ]⊤ ∈ R2 and λ ∈ R. Then,
@ A @ A
⟨x + y, z⟩ = (x1 + y1 )z1 − (x1 + y1 )z2 + (x2 + y2 )z2 + 2 x2 + y2 )z2
= x1 z1 − (x1 z2 + x2 z1 ) + 2(x2 z2 ) + y1 z1 − (y1 z2 + y2 z1 )
+ 2(y2 z2 )
= ⟨x, z⟩ + ⟨y, z⟩
⟨λx, y⟩ = λx1 y1 − (λx1 y2 + λx2 y1 ) + 2(λx2 y2 )
@ A
= λ x1 y1 − (x1 y2 + x2 y1 ) + 2(x2 y2 ) = λ ⟨x, y⟩

Thus, ⟨·, ·⟩ is linear in its first variable. By symmetry, it is bilinear. Overall,


⟨·, ·⟩ is an inner product.

496
This material will be published by Cambridge University Press as Mathematics for Machine Learn-
ing by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. This pre-publication version is
free to view and download for personal use only. Not for re-distribution, re-sale or use in deriva-
c
tive works. ⃝by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2020. https://mml-book.com.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

Exercises 497

3.2 Consider R2 with ⟨·, ·⟩ defined for all x and y in R2 as


0 1
2 0
⟨x, y⟩ := x⊤ y.
1 2
! "# $
=:A

Is ⟨·, ·⟩ an inner product?


Let us define x and y as
0 1 0 1
1 0
x= , y= .
0 1
We have ⟨x, y⟩ = 0 but ⟨y, x⟩ = 1, i.e., ⟨·, ·⟩ is not symmetric. Therefore, it is
not an inner product.
3.3 Compute the distance between
⎡ ⎤ ⎡ ⎤
1 −1
x = ⎣2⎦ , y = ⎣−1⎦
3 0
using
a. ⟨x, y⟩ := x⊤ y ⎡ ⎤
2 1 0

b. ⟨x, y⟩ := x Ay , A := ⎣1 3 −1⎦
0 −1 2
The difference vector is
⎡ ⎤
2
z = x − y = ⎣3 ⎦ .
4
√ √
a. ∥z∥ = √z ⊤ z = 29

b. ∥z∥ = z ⊤ Az = 55
3.4 Compute the angle between
0 1 0 1
1 −1
x= , y=
2 −1
using
a. ⟨x, y⟩ := x⊤ y 0 1
⊤ 2 1
b. ⟨x, y⟩ := x By , B :=
1 3
It holds that
⟨x, y⟩
cos ω = ,
∥x∥ ∥y∥
where ω is the angle between x and y .
a.
−3 3
cos ω = √ √ = − √ ≈ 2.82 rad = 161.5◦
5 2 10
b.
x⊤ By −11 11
cos ω = B B = √ √ =− ≈ 1.66 rad = 95◦
x⊤ By y ⊤ By 18 7 126

c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

498 Analytic Geometry

3.5 Consider the Euclidean vector space R5 with the dot product. A subspace
U ⊆ R5 and x ∈ R5 are given by
⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤
0 1 −3 −1 −1
⎢−1⎥ ⎢−3⎥ ⎢ 4 ⎥ ⎢−3⎥ ⎢−9⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
U = span[⎢
⎢ 2 ⎥, ⎢−1⎥
⎥ ⎢ 1 ⎥, ⎢ 1 ⎥ , ⎢ 5 ⎥] , x=⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎣0⎦ ⎣−1⎦ ⎣2⎦ ⎣0⎦ ⎣4⎦
2 2 1 7 1

a. Determine the orthogonal projection πU (x) of x onto U


First, we determine a basis of U . Writing the spanning vectors as the
columns of a matrix A, we use Gaussian elimination to bring A into
(reduced) row echelon form:
⎡ ⎤
1 0 0 1
⎢0 1 0 2⎥
⎢ ⎥
⎢0 0 1 1⎥
⎢ ⎥
⎣0 0 0 0⎦
0 0 0 0

From here, we see that the first three columns are pivot columns, i.e.,
the first three vectors in the generating set of U form a basis of U :
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
0 1 −3
⎢−1⎥ ⎢−3⎥ ⎢4⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
U = span[⎢
⎢ 2 ⎥,
⎥ ⎢ 1 ⎥,
⎢ ⎥
⎢ 1 ⎥] .
⎢ ⎥
⎣0⎦ ⎣−1⎦ ⎣2⎦
2 2 1

Now, we define
⎡ ⎤
0 1 −3
⎢−1 −3 4 ⎥
⎢ ⎥
B=⎢
⎢2 1 ⎥,
1 ⎥
⎣0 −1 2 ⎦
2 2 1

where we define three basis vectors bi of U as the columns of B for


1 # i # 3.
We know that the projection of x on U exists and we define p := πU (x).
Moreover, we know that p ∈ U . We define λ := [λ1 , λ2 , λ3 ]⊤ ∈ R3 , such
5
that p can be written p = 3i=1 λi bi = Bλ.
As p is the orthogonal projection of x onto U , then x − p is orthogonal
to all the basis vectors of U , so that

B ⊤ (x − Bλ) = 0 .

Therefore,
B ⊤ Bλ = B ⊤ x .

Solving in λ the inhomogeneous system B ⊤ Bλ = B ⊤ x gives us a single

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

Exercises 499

solution
⎡ ⎤
−3
λ=⎣ 4 ⎦
1

and, therefore, the desired projection


⎡ ⎤
1
⎢−5⎥
⎢ ⎥
⎢−1⎥ ∈ U .
p = Bλ = ⎢ ⎥
⎣−2⎦
3

b. Determine the distance d(x, U )


The distance is simply the length of x − p:
C⎡ ⎤C
C 2 C
C C
C⎢ 4 ⎥C
C⎢ ⎥C √
∥x − p∥ = CC⎢ 0 ⎥C = 60
⎢ ⎥C
C⎣−6⎦C
C C
C 2 C

3.6 Consider R3 with the inner product


⎡ ⎤
2 1 0
⟨x, y⟩ := x⊤ ⎣1 2 −1⎦ y .
0 −1 2

Furthermore, we define e1 , e2 , e3 as the standard/canonical basis in R3 .


a. Determine the orthogonal projection πU (e2 ) of e2 onto

U = span[e1 , e3 ] .

Hint: Orthogonality is defined through the inner product.


Let p = πU (e2 ). As p ∈ U , we can define Λ = (λ1 , λ3 ) ∈ R2 such that p
can be written p = U Λ. In fact, p becomes p = λ1 e1 +λ3 e3 = [λ1 , 0, λ3 ]⊤
expressed in the canonical basis.
Now, we know by orthogonal projection that

p = πU (e2 ) =⇒ (p − e2 ) ⊥ U
0 1 0 1
⟨p − e2 , e1 ⟩ 0
=⇒ =
⟨p − e2 , e3 ⟩ 0
0 1 0 1
⟨p, e1 ⟩ − ⟨e2 , e1 ⟩ 0
=⇒ =
⟨p, e3 ⟩ − ⟨e2 , e3 ⟩ 0

We compute the individual components as


⎡ ⎤⎡ ⎤
= > 2 1 0 1
⟨p, e1 ⟩ = λ1 0 λ3 ⎣1 2 −1⎦ ⎣0⎦ = 2λ1
0 −1 2 0

c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

500 Analytic Geometry


⎡ ⎤⎡ ⎤
= > 2 1 0 0
⟨p, e3 ⟩ = λ1 0 λ3 ⎣1 2 −1⎦ ⎣1⎦ = 2λ3
0 −1 2 0
⎡ ⎤⎡ ⎤⎡ ⎤
0 2 1 0 1
⟨e2 , e1 ⟩ = ⎣1⎦ ⎣1 2 −1⎦ ⎣0⎦ = 1
0 0 −1 2 0
⎡ ⎤⎡ ⎤
= > 2 1 0 0
⟨e2 , e3 ⟩ = 0 1 0 ⎣1 2 −1 ⎦ ⎣ 0⎦ = −1
0 −1 2 1

This now leads to the inhomogeneous linear equation system

2λ1 = 1
2λ3 = −1

This immediately gives the coordinates of the projection as


⎡ 1 ⎤
2
πU (e2 ) = ⎣ 0 ⎦
− 12

b. Compute the distance d(e2 , U ).


The distance of d(e2 , U ) is the distance between e2 and its orthogonal
projection p = πU (e2 ) onto U . Therefore,
D
d(e2 , U ) = ⟨p − e2 , p − e2 ⟩2 .

However,
⎡ ⎤⎡ 1 ⎤
=1 > 2 1 0 2
⟨p − e2 , p − e2 ⟩ = 2 −1 − 1 ⎣1
2 2 −1⎦ ⎣ −1 ⎦ = 1 ,
0 −1 2 − 21
B
which yields d(e2 , U ) = ⟨p − e2 , p − e2 ⟩ = 1
c. Draw the scenario: standard basis vectors and πU (e2 )
See Figure 3.1.
3.7 Let V be a vector space and π an endomorphism of V .
a. Prove that π is a projection if and only if idV − π is a projection, where
idV is the identity endomorphism on V .
b. Assume now that π is a projection. Calculate Im(idV −π) and ker(idV −π)
as a function of Im(π) and ker(π).

a. It holds that (idV − π)2 = idV − 2π + π 2 . Therefore,

(idV − π)2 = idV − π ⇐⇒ π 2 = π ,

which is exactly what we want. Note that we reasoned directly at the


endomorphism level, but one can also take any x ∈ V and prove the
same results. Also note that π 2 means π ◦ π as in “π composed with π ”.

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

Exercises 501

e3

e2
1
πU (e2 )

e1

ure 3.1
ojection πU (e2 ).
b. We have π ◦ (idV − π) = π − π 2 = 0V , where 0V represents the null
endomorphism. Then Im(idV − π) ⊆ ker(π).
Conversely, let x ∈ ker(π). Then

(idV − π)(x) = x − π(x) = x ,

which means that x is the image of itself by idV −π . Hence, x ∈ Im(idV −


π). In other words, ker(π) ⊆ Im(idV − π) and thus ker(π) = Im(idV − π).

Similarly, we have

(idV − π) ◦ π = π − π 2 = π − π = 0V

so Im(π) ⊆ ker(idV − π).


Conversely, let x ∈ ker(idV − π). We have (idV − π)(x) = 0 and thus
x − π(x) = 0 or x = π(x). This means that x is its own image by π , and
therefore ker(idV − π) ⊆ Im(π). Overall,

ker(idV − π) = Im(π) .

3.8 Using the Gram-Schmidt method, turn the basis B = (b1 , b2 ) of a two-
dimensional subspace U ⊆ R3 into an ONB C = (c1 , c2 ) of U , where
⎡ ⎤ ⎡ ⎤
1 −1
b1 := ⎣1⎦ , b2 := ⎣ 2 ⎦ .
1 0

c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

502 Analytic Geometry

We start by normalizing b1
⎡ ⎤
1
b 1
c1 := 1 = √ ⎣1⎦ . (3.1)
∥b1 ∥ 3
1
To get c2 , we project b2 onto the subspace spanned by c1 . This gives us
(since ∥c1 = 1∥)
⎡ ⎤
1
1⎣ ⎦
c⊤
1 b2 c1 = 1 ∈U.
3
1
By subtracting this projection (a multiple of c1 ) from b2 , we get a vector
that is orthogonal to c1 :
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
−1 1 −4
⎣ 2 ⎦ − 1 ⎣1⎦ = 1 ⎣ 5 ⎦ = 1 (−b1 + 3b2 ) ∈ U .
3 3 3
0 1 −1
Normalizing c̃2 yields
⎡ ⎤
−4
c̃ 3
c2 = 2 = √ ⎣ 5 ⎦ .
∥c̃2 ∥ 42
−1
We see that c1 ⊥ c2 and that ∥c1 ∥ = 1 = ∥c2 ∥. Moreover, c1 , c2 ∈ U it
follows that (c1 , c2 ) are a basis of U .
3.9 Let n ∈ N and let x1 , . . . , xn > 0 be n positive real numbers so that x1 +
· · · + xn = 1. Use the Cauchy-Schwarz inequality and show that
5n 2 1
a. i=1 xi " n
5n 1 2
b. i=1 xi " n
Hint: Think about the dot product on Rn . Then, choose specific vectors
x, y ∈ Rn and apply the Cauchy-Schwarz inequality.
Recall Cauchy-Schwarz inequality expressed with the dot product in Rn .
Let x = [x1 , . . . , xn ]⊤ and y = [y1 , . . . , yn ]⊤ be two vectors of Rn . Cauchy-
Schwarz tells us that
⟨x, y⟩2 # ⟨x, x⟩ · ⟨y, y⟩ ,
which, applied with the dot product in Rn , can be rephrased as
E n G2 E n G E n G
F F 2 F 2
x i yi # xi · yi .
i=1 i=1 i=1

a. Consider x = [x1 , . . . , xn ] as defined in the question. Let us choose
y = [1, . . . , 1]⊤ . Then, the Cauchy-Schwarz inequality becomes
E n G2 E n G E n G
F F 2 F 2
xi · 1 # xi · 1
i=1 i=1 i=1

and thus
E n
G
F
1# x2i · n,
i=1

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

Exercises 503

which yields the expected result.


b. Let us now choose both vectors differently to obtain the expected result.
√ √
Let x = [ √1x1 , . . . , √1xn ]⊤ and y = [ x1 , · · · , n]⊤ . Note that our choice
is legal since all xi and yi are strictly positive. The Cauchy-Schwarz
inequality now becomes
E n G2 E n G E n G
F √ FH I2 F√ 2
1 1

xi
· xi # √
xi
· xi
i=1 i=1 i=1

so that E G E G
n
F n
F
2 1
n # xi · xi .
i=1 i=1
5n
This yields n2 # 1
i=1 xi · 1, which gives the expected result.
3.10 Rotate the vectors
0 1 0 1
2 0
x1 := , x2 :=
3 −1

by 30◦ .
Since 30◦ = π/6 rad we obtain the rotation matrix
0 π π
1
cos( 6 ) − sin( 6 )
A=
sin( π6 ) cos( π6 )

and the rotated vectors are


0 1 0 1
2 cos( π6 ) − 3 sin( π6 ) 0.23
Ax1 = ≈
2 sin( π6 ) + 3 cos( π6 ) 3.60
0 1 0 1
sin( π6 ) 0.5
Ax2 = ≈ .
− cos( π6 ) 0.87

c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

Matrix Decompositions

Exercises
4.1 Compute the determinant using the Laplace expansion (using the first row)
and the Sarrus rule for
⎡ ⎤
1 3 5
A=⎣ 2 4 6 ⎦.
0 2 4

The determinant is
, ,
, 1 3 5 ,
, ,
|A| = det(A) = ,, 2 4 6 ,,
, 0 2 4 ,
, , , , , ,
, 4 6 ,
, − 3, 2 6 , + 5, 2 4 ,,
, , ,
= 1 ,,
2 4 , , 0 4 , , 0 2 ,
=1·4−3·8−5·4=0 Laplace expansion
= 16 + 20 + 0 − 0 − 12 − 24 = 0 Sarrus’ rule

4.2 Compute the following determinant efficiently:


⎡ ⎤
2 0 1 2 0
⎢2
⎢ −1 0 1 1⎥⎥
⎥.
⎢0 1 2 1 2⎥

⎣−2 0 2 −1 2⎦
2 0 0 1 1

This strategy shows the power of the methods we learned in this and the
previous chapter. We can first apply Gaussian elimination to transform A
into a triangular form, and then use the fact that the determinant of a trian-
gular matrix equals the product of its diagonal elements.
, , , , , ,
,2 0 1 2 0,, ,2 0 1 2 0,, ,,2 0 1 2 0,,
, ,
,2
, −1 0 1 1,, ,,0 −1 −1 −1 1,, ,,0 −1 −1 −1 1,,
,0
, 1 2 1 2,, = ,,0 1 2 1 2,, = ,,0 0 1 0 3,,
,−2 0 2 −1 2,, ,,0 0 3 1 2,, ,,0 0 3 1 2,,
,
,2 0 0 1 1 , ,0 0 −1 −1 1 , ,0 0 −1 −1 1,

504
This material will be published by Cambridge University Press as Mathematics for Machine Learn-
ing by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. This pre-publication version is
free to view and download for personal use only. Not for re-distribution, re-sale or use in deriva-
c
tive works. ⃝by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2020. https://mml-book.com.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

Exercises 505
, , , ,
,2 0 1 2 0 , ,2 0 1 2 0 ,,
, , ,
,0
, −1 −1 −1 1 ,, ,,0 −1 −1 −1 1 ,,
= ,,0 0 1 0 3 ,, = ,,0 0 1 0 3 ,, = 6 .
,0
, 0 0 1 −7,, ,,0 0 0 1 −7,,
,0 0 0 −1 4 , ,0 0 0 0 −3,

Alternatively, we can apply the Laplace expansion and arrive at the same
solution:
, , , ,
,2 0 1 2 0 , ,2 0 1 2 0,,
, , ,
,2
, −1 0 1 1,, ,,0 −1 −1 −1 1,,
,0
, 1 2 1 2,, = ,,0 1 2 1 2,,
,−2 0 2 −1 2,, ,,0 0 3 1 2,,
,
,2 0 0 1 1 , ,0 0 −1 −1 1,
, ,
,−1 −1 −1 1,,
,
1st col. ,1 2 1 2,,
= (−1)1+1 2 · ,, .
,0 3 1 2,,
,0 −1 −1 1,

If we now subtract the fourth row from the first row and multiply (−2) times
the third column to the fourth column we obtain
, ,
,−1 0 0 0,, , ,
, ,2 1 0,, , ,
,1 2 1 0, 1st row , 3rd=col. (−2) · 3(−1)3+3 · ,2 1, = 6 .
, , , ,
2 ,, = −2 ,3 1 0
,0 3 1 0, 3 1,
, , , ,
,−1 −1 3,
, 0 −1 −1 3,

4.3 Compute the eigenspaces of


a.
0 1
1 0
A :=
1 1

b.
0 1
−2 2
B :=
2 1

a. For
0 1
1 0
A=
1 1
, ,
,1 − λ 0 ,,
(i) Characteristic polynomial: p(λ) = |A − λI 2 | = ,, =
1 1 − λ,
(1 − λ)2 . Therefore λ = 1 is the only root of p and, therefore, the
only eigenvalue of A
(ii) To compute the eigenspace for the eigenvalue λ = 1, we need to
compute the null space of A − I :
0 1
0 0
(A − 1 · I)x = 0 ⇐⇒ x=0
1 0
0 1
0
⇒ E1 = [ ]
1

c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

506 Matrix Decompositions

E1 is the only eigenspace of A.


b. For B , the corresponding eigenspaces (for λi ∈ C) are
0 1 0 1
1 1
Ei = [ ], E−i = [ ].
i −i

4.4 Compute all eigenspaces of


⎡ ⎤
0 −1 1 1
⎢−1 1 −2 3⎥
A=⎢ ⎥.
⎣2 −1 0 0⎦
1 −1 1 0

(i) Characteristic polynomial:


, , , ,
,−λ −1 1 1 ,, ,,−λ −1 1 1 ,,
,
, −1 1 − λ −2 3 ,, ,, 0 −λ −1 3 − λ,,
p(λ) = ,, =,
, 2 −1 −λ 0 , , 0
, 1 −2 − λ 2λ ,,
, 1 −1 1 −λ , , 1 −1 1 −λ ,
, ,
,−λ −1 − λ 0 1 ,,
,
, 0 −λ −1 − λ 3 − λ,,
= ,,
, 0 1 −1 − λ 2λ ,,
, 1 0 0 −λ ,
, , , ,
,−λ −1 − λ 3 − λ, ,−1 − λ 0 1 ,,
, , ,
= (−λ) ,, 1 −1 − λ 2λ ,, − ,, −λ −1 − λ 3 − λ,,
, 0 0 −λ , , 1 −1 − λ 2λ ,
, ,
,−λ −1 − λ, ,−1 − λ 0 1 ,,
, , ,
= (−λ)2 ,, , − , −λ −1 − λ 3 − λ,,
1 −1 − λ, ,,
1 −1 − λ 2λ ,
= (1 + λ)2 (λ2 − 3λ + 2) = (1 + λ)2 (1 − λ)(2 − λ)

Therefore, the eigenvalues of A are λ1 = −1, λ2 = 1, λ3 = 2.


(ii) The corresponding eigenspaces are the solutions of (A − λi I)x =
0, i = 1, 2, 3, and given by
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
0 1 1
⎢1 ⎥ ⎢1⎥ ⎢0⎥
E−1 = span[⎢
⎣1⎦],
⎥ E1 = span[⎢
⎣1⎦],
⎥ E2 = span[⎢
⎣1 ⎦] .

0 1 1

4.5 Diagonalizability of a matrix is unrelated to its invertibility. Determine for


the following four matrices whether they are diagonalizable and/or invert-
ible
0 1 0 1 0 1 0 1
1 0 1 0 1 1 0 1
, , , .
0 1 0 0 0 1 0 0

In the four matrices above, the first one is diagonalizable and invertible, the
second one is diagonalizable but is not invertible, the third one is invertible
but is not diagonalizable, and, finally, the fourth ione s neither invertible
nor is it diagonalizable.

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

Exercises 507

4.6 Compute the eigenspaces of the following transformation matrices. Are they
diagonalizable?
a. For
⎡ ⎤
2 3 0
A = ⎣1 4 3⎦
0 0 1

(i) Compute the eigenvalue as the roots of the characteristic polyno-


mial
, ,
,2 − λ 3 0 ,, , ,
, ,2 − λ 3 ,,
p(λ) = det(A − λI) = , 1
, 4−λ 3 , = (1 − λ) ,
, ,
1 4 − λ,
, 0 0 1 − λ,
@ A 2
= (1 − λ) (2 − λ)(4 − λ) − 3 = (1 − λ)(8 − 2λ − 4λ + λ − 3)
= (1 − λ)(λ2 − 6λ + 5) = (1 − λ)(λ − 1)(λ − 5) = −(1 − λ)2 (λ − 5) .

Therefore, we obtain the eigenvalues 1 and 5.


(ii) To compute the eigenspaces, we need to solve (A − λi I)x = 0,
where λ1 = 1, λ2 = 5:
⎡ ⎤ ⎡ ⎤
1 3 0 1 3 0
E1 : ⎣ 1 3 3 ⎦ ! ⎣0 0 1⎦ ,
0 0 0 0 0 0

where we subtracted the first row from the second and, subse-
quently, divided the second row by 3 to obtain the reduced row
echelon form. From here, we see that
⎡ ⎤
3
E1 = span[⎣−1⎦]
0

Now, we compute E5 by solving (A − 5I)x = 0:


⎡ ⎤ 1 ⎡ ⎤
−3 3 0 | · (− 3 ) 1 −1 0
⎣ 1 −1 3 ⎦ + 1 R1 + 3 R3 ! ⎣ 0 0 1 ⎦
3 4
0 0 −4 | · (− 14 ), swap with R2 0 0 0

Then,
⎡ ⎤
1
E5 = span[⎣1⎦] .
0

(iii) This endomorphism cannot be diagonalized because dim(E1 ) +


dim(E5 ) ̸= 3.
Alternative arguments:
dim(E1 ) does not correspond to the algebraic multiplicity of the eigen-
value λ = 1 in the characteristic polynomial
rk(A − I) ̸= 3 − 2.

c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

508 Matrix Decompositions

b. For
⎡ ⎤
1 1 0 0
⎢0 0 0 0⎥
A=⎢
⎣0

0 0 0⎦
0 0 0 0

(i) Compute the eigenvalues as the roots of the characteristic poly-


nomial
p(λ) = det(A − λI) = (1 − λ)λ3
, ,
,1 − λ 1 0 0 ,,
,
, 0 −λ 0 0 ,,
= ,, = −(1 − λ)λ3 .
, 0 0 −λ 0 ,,
, 0 0 0 −λ,

It follows that the eigenvalues are 0 and 1 with algebraic multi-


plicities 3 and 1, respectively.
(ii) We compute the eigenspaces E0 and E1 , which requires us to de-
termine the null spaces A − λi I , where λi ∈ {0, 1}.
(iii) We compute the eigenspaces E0 and E1 , which requires us to de-
termine the null spaces A − λi I , where λi ∈ {0, 1}. For E0 , we
compute the null space of A directly and obtain
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
1 0 0
⎢−1⎥ ⎢0 ⎥ ⎢0⎥
E0 = span[⎢
⎣ 0 ⎦,
⎥ ⎢ ⎥,
⎣1 ⎦
⎢ ⎥] .
⎣0⎦
0 0 1

To determine E1 , we need to solve (A − I)x = 0:


⎡ ⎤ ⎡ ⎤
0 1 0 0 0 1 0 0
−1 0 ⎥ +R1 | move to R4 !
⎢ 0 0 ⎥ ⎢ 0 0 1 0 ⎥
⎢ ⎢ ⎥
⎣ 0 0 −1 0 ⎦ ·(−1) ⎣ 0 0 0 1 ⎦
0 0 0 −1 ·(−1) 0 0 0 0

From here, we see that


⎡ ⎤
1
⎢0⎥
⎣0⎦] .
E1 = span[⎢ ⎥

(iv) Since dim(E0 )+dim(E1 ) = 4 = dim(R4 ), it follows that a diagonal


form exists.
4.7 Are the following matrices diagonalizable? If yes, determine their diagonal
form and a basis with respect to which the transformation matrices are di-
agonal. If no, give reasons why they are not diagonalizable.
a.
0 1
0 1
A=
−8 4

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

Exercises 509

We determine the characteristic polynomial as

p(λ) = det(A − λI) = −λ(4 − λ) + 8 = λ2 − 4λ + 8 .

The characteristic polynomial does not decompose into linear factors


over R because the roots of p(λ) are complex and given by λ1,2 =

2 ± −4. Since the characteristic polynomial does not decompose into
linear factors, A cannot be diagonalized (over R).
b.
⎡ ⎤
1 1 1
A = ⎣1 1 1⎦
1 1 1

(i) The characteristic polynomial is p(λ) = det(A − λI).


, , , ,
,1 − λ 1 1 ,, ,1 − λ 1 1 ,,
, subtr. R1 from R2 ,R3 ,
p(λ) = , 1
, 1−λ 1 , , = , λ
, −λ 0 ,,
, 1 1 1−λ , , λ 0 −λ,
, , , ,
develop last row , 1 1,, ,1 − λ 1 ,
= λ ,, − λ ,, ,
−λ 0 , λ −λ,
= λ2 + λ(λ(1 − λ) + λ) = λ(−λ2 + 3λ) = λ2 (λ − 3) .

Therefore, the roots of p(λ) are 0 and 3 with algebraic multiplici-


ties 2 and 1, respectively.
(ii) To determine whether A is diagonalizable, we need to show that
the dimension of E0 is 2 (because the dimension of E3 is nec-
essarily 1: an eigenspace has at least dimension 1 by definition,
and its dimension cannot exceed the algebraic multiplicity of its
associated eigenvalue).
Let us study E0 = ker(A − 0I) :
⎡ ⎤ ⎡ ⎤
1 1 1 1 1 1
A − 0I = ⎣1 1 1 ⎦ ! ⎣0 0 0⎦ .
1 1 1 0 0 0

Here, dimE0 = 2, which is identical to the algebraic multiplicity of


the eigenvalue 0 in the characteristic polynomial. Thus A is diag-
onalizable. Moreover, we can read from the reduced row echelon
form that
⎡ ⎤ ⎡ ⎤
1 1
E0 = span[⎣−1⎦ , ⎣ 0 ⎦] .
0 −1

(iii) For E3 , we obtain


⎡ ⎤ ⎡ ⎤
−2 1 1 1 0 −1
A − 3I = ⎣ 1 −2 1 ⎦ ! ⎣0 1 −1⎦ ,
1 1 −2 0 0 0

which has rank 2, and, therefore (using the rank-nullity theorem),

c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

510 Matrix Decompositions

E3 has dimension 1 (it could not be anything else anyway, as jus-


tified above) and
⎡ ⎤
−1
E1 = span[⎣−1⎦] .
−1

(iv) Therefore, we can find a new basis P as the concatenation of the


spanning vectors of the eigenspaces. If we identify the matrix
⎡ ⎤
−1 1 1
P = ⎣−1 −1 0 ⎦,
−1 0 −1

whose columns are composed of the basis vectors of basis P , then


our endomorphism will have the diagonal form: D = diag[3, 0, 0]
with respect to this new basis. As a reminder, diag[3, 0, 0] refers to
the 3 × 3 diagonal matrix with 3, 0, 0 as values on the diagonal.
Note that the diagonal form is not unique and depends on the
order of the eigenvectors in the new basis. For example, we can
define another matrix Q composed of the same vectors as P but
in a different order:
⎡ ⎤
1 −1 1
Q = ⎣−1 −1 0 ⎦.
0 −1 −1

If we use this matrix, our endomorphism would have another di-


agonal form: D ′ = diag[0, 3, 0]. Sceptical students can check that
Q−1 AQ = D ′ and P −1 AP = D .
c.
⎡ ⎤
5 4 2 1
⎢ 0 1 −1 −1⎥
A=⎢
⎣ −1

−1 3 0 ⎦
1 1 −1 2

p(λ) = (λ − 1)(λ − 2)(λ − 4)2


⎡ ⎤ ⎡ ⎤ ⎡ ⎤
−1 1 1
⎢1⎥ ⎢−1⎥ ⎢0⎥
E1 = span[⎢
⎣ 0 ⎦] ,
⎥ E2 = span[⎢
⎣ 0 ⎦] ,

⎣−1⎦] .
E4 = span[⎢ ⎥

0 1 1

Here, we see that dim(E4 ) = 1 ̸= 2 (which is the algebraic multiplicity


of the eigenvalue 4). Therefore, A cannot be diagonalized.
d.
⎡ ⎤
5 −6 −6
A = ⎣−1 4 2 ⎦
3 −6 −4

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

Exercises 511

(i) We compute the characteristic polynomial p(λ) = det(A − λI) as


, ,
,5 − λ −6 −6 ,,
,
p(λ) = ,, −1 4−λ 2 ,,
, 3 −6 −4 − λ,
= (5 − λ)(4 − λ)(−4 − λ) − 36 − 36 + 18(4 − λ)
+ 12(5 − λ) − 6(−4 − λ)
= −λ3 + 5λ2 − 8λ + 4 = (1 − λ)(2 − λ)2 ,

where we used Sarrus rule. The characteristic polynomial decom-


poses into linear factors, and the eigenvalues are λ1 = 1, λ2 = 2
with (algebraic) multiplicity 1 and 2, respectively.
(ii) If the dimension of the eigenspaces are identical to multiplicity of
the corresponding eigenvalues, the matrix is diagonalizable. The
eigenspace dimension is the dimension of ker(A − λi I), where λi
are the eigenvalues (here: 1,2). For a simple check whether the
matrices are diagonalizable, it is sufficient to compute the rank ri
of A − λi I since the eigenspace dimension is n − ri (rank-nullity
theorem).
Let us study E2 and apply Gaussian elimination on A − 2I :
⎡ ⎤
3 −6 −6
⎣−1 2 2 ⎦. (4.1)
3 −6 −6

(iii) We can immediately see that the rank of this matrix is 1 since
the first and third row are three times the second. Therefore, the
eigenspace dimension is dim(E2 ) = 3 − 1 = 2, which corresponds
to the algebraic multiplicity of the eigenvalue λ = 2 in p(λ). More-
over, we know that the dimension of E1 is 1 since it cannot exceed
its algebraic multiplicity, and the dimension of an eigenspace is at
least 1. Hence, A is diagonalizable.
(iv) The diagonal matrix is easy to determine since it just contains the
eigenvalues (with corresponding multiplicities) on its diagonal:
⎡ ⎤
1 0 0
D = ⎣0 2 0⎦ .
0 0 2

(v) We need to determine a basis with respect to which the trans-


formation matrix is diagonal. We know that the basis that con-
sists of the eigenvectors has exactly this property. Therefore, we
need to determine the eigenvectors for all eigenvalues. Remem-
ber that x is an eigenvector for an eigenvalue λ if they satisfy
Ax = λx ⇐⇒ (A − λI)x = 0. Therefore, we need to find the
basis vectors of the eigenspaces E1 , E2 .
For E1 = ker(A − I) we obtain:
⎡ ⎤ ⎡ ⎤ 1
4 −6 −6 +4R2 0 6 2 ·( 6 )
⎣−1 3 2 ⎦ ! ⎣−1 3 2 ⎦ ·(−1)|swap with R1
3 −6 −5 +3R2 0 3 1 − 12 R1

c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

512 Matrix Decompositions


⎡ ⎤ ⎡ ⎤
1 −3 −2 +3R2 1 0 −1
⎣ 0 1 ⎦ 1
! 1 3 ! ⎣ 0 1 3

0 0 0 0 0 0

The rank of this matrix is 2. Since 3 − 2 = 1 it follows that


dim(E1 ) = 1, which corresponds to the algebraic multiplicity of
the eigenvalue λ = 1 in the characteristic polynomial.
(vi) From the reduced row echelon form we see that
⎡ ⎤
3
E1 = span[⎣−1⎦] ,
3

and our first eigenvector is [3, −1, 3]⊤ .


(vii) We proceed with determining a basis of E2 , which will give us the
other two basis vectors that we need (remember that dim(E2 ) =
2). From (4.1), we (almost) immediately obtain the reduced row
echelon form
⎡ ⎤
1 −2 −2
⎣0 0 0 ⎦,
0 0 0

and the corresponding eigenspace


⎡ ⎤ ⎡ ⎤
2 2
E2 = span[⎣1⎦ , ⎣0⎦] .
0 1

(viii) Overall, an ordered basis with respect to which A has diagonal


form D consists of all eigenvectors is
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
3 2 2
B = (⎣−1⎦ , ⎣1⎦ , ⎣0⎦) .
3 0 1

4.8 Find the SVD of the matrix


0 1
3 2 2
A= .
2 3 −2

⎡ ⎤
J K0 √1 − 1
√ − 23
√1 − √1
1 2 3 2
2 2 5 0 0 ⎢ √1 1
√ 2 ⎥
A= 3
√1 √1
⎣ 2 3 √2 ⎦
0 3 0
2 2 2 2 1
! "# $! "# $ 0 − 3 3
=U =Σ ! "# $
=V

4.9 Find the singular value decomposition of


0 1
2 2
A= .
−1 1

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

Exercises 513

(i) Compute the symmetrized matrix A⊤ A. We first compute the sym-


metrized matrix
0 10 1 0 1
2 −1 2 2 5 3
A⊤ A = = . (4.2)
2 1 −1 1 3 5

(ii) Find the right-singular vectors and singular values from A⊤ A.


The characteristic polynomial of A⊤ A is

(5 − λ)2 − 9 = λ2 − 10λ + 16 = 0 . (4.3)

This yields eigenvalues, sorted from largest absolute first, λ1 = 8 and


λ2 = 2. The associated normalized eigenvectors are respectively
J K J K
− √1 √1
v1 = 2 and v 2 = 2 . (4.4)
√1 √1
2 2

We have thus obtained the right-singular orthogonal matrixx


J 1 1
K
√ √
V = [v 1 , v 2 ] = 2 2 . (4.5)
− √1 √1
2 2

(iii) Determine the singular values.


We obtain the two singular

values from the square
√ √
root of the eigen-
√ √
values σ1 = λ1 = 8 = 2 2 and σ2 = λ2 = 2. We construct the
singular value diagonal matrix as
0 1 0 √ 1
σ 0 2 2 0
Σ= 1 = √ . (4.6)
0 σ2 0 2

(iv) Find the left-singular eigenvectors.


We have to map the two eigenvectors v 1 , v 2 using A. This yields two
self-consistent equations that enable us to find orthogonal eigenvec-
tors u1 , u2 :

0 √ 1
2 2
Av 1 = (σ1 u1 v ⊤
1 )v 1 = σ1 u1 (v ⊤
1 v1 ) = σ 1 u1 =
0
0 1
0
Av 2 = (σ2 u2 v ⊤ ⊤
2 )v 2 = σ2 u2 (v 2 v 2 ) = σ2 u2 = √
2

We normalize the left-singular vectors0 by


1 diving them
0 1by their respec-
1 0
tive singular values and obtain u1 = and u2 = , which yields
0 1
0 1
1 0
U = [u1 , u2 ] = . (4.7)
0 1

(v) Assemble the left-/right-singular vectors and singular values. The SVD
of A is
0 1 0√ 1 J √1 √1
K
1 0 8 0
A = U ΣV ⊤ = √ 2 2 .
0 1 0 2 − √1 √1
2 2

c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

514 Matrix Decompositions

4.10 Find the rank-1 approximation of


0 1
3 2 2
A=
2 3 −2

To find the rank-1 approximation we apply the SVD to A (as in Exercise 4.7)
to obtain
J 1 1
K
√ −√
U= 2 2
√1 √1
2 2
⎡ 1 1 ⎤
√ − √ − 32
2 3 2
⎢ √1 √1 2 ⎥
V = ⎣ 2 3 √ 2 3 ⎦.
0 −232 1
3

We apply the construction rule for rank-1 matrices


A i = ui v ⊤
i .

We use the largest singular value (σ1 = 5, i.e., i = 1 and the first column
vectors of the U and V matrices, respectively:
J 1 K
√ L M 1 01 1 0 1

A 1 = u1 v 1 = 2 √1 √1 0 = .
√1 2 2 2 1 1 0
2

To find the rank-1 approximation, we apply the SVD to A to obtain


J 1 1
K
√ −√
U= 2 2
√1 √1
2 2
⎡ 1 1 ⎤
√ − √ − 32
2 3 2
⎢ √1 √1 2 ⎥
V = ⎣ 2 3 √ 2 3 ⎦.
2 2 1
0 − 3 3

We apply the construction rule for rank-1 matrices


A i = ui v ⊤
i .

We use the largest singular value (σ1 = 5, i.e., i = 1) and therefore, the first
column vectors of U and V , respectively, which then yields
J 1 K 0 1
√ L M 1 1 1 0
A 1 = u1 v ⊤ 2 √1 √1 0 =
1 = √1 2 2
.
2 1 1 0
2

4.11 Show that for any A ∈ Rm×n the matrices A⊤ A and AA⊤ possess the
same nonzero eigenvalues.
Let us assume that λ is a nonzero eigenvalue of AA⊤ and x is an eigenvector
belonging to λ. Thus, the eigenvalue equation
(AA⊤ )x = λx

can be manipulated by left multiplying by A⊤ and pulling on the right-hand


side the scalar factor λ forward. This yields
A⊤ (AA⊤ )x = A⊤ (λx) = λ(A⊤ x)

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

Exercises 515
and we can use matrix multiplication associativity to reorder the left-hand
side factors

(A⊤ A)(A⊤ x) = λ(A⊤ x) .

This is the eigenvalue equation for A⊤ A. Therefore, λ is the same eigen-


value for AA⊤ and A⊤ A.
4.12 Show that for x ̸= 0 Theorem 4.24 holds, i.e., show that
∥Ax∥2
max = σ1 ,
x ∥x∥2

where σ1 is the largest singular value of A ∈ Rm×n .


(i) We compute the eigendecomposition of the symmetric matrix

A⊤ A = P DP ⊤

for diagonal D and orthogonal P . Since the columns of P are an


ONB of Rn , we can write every y = P x as a linear combination of
the eigenvectors pi so that
n
F
y = Px = x i pi , x∈ R . (4.8)
i=1

Moreover, since the orthogonal matrix P preserves lengths (see Sec-


tion 3.4), we obtain
n
F
∥y∥22 = ∥P x∥22 = ∥x∥22 = x2i . (4.9)
i=1

(ii) Then,
N n B n B
O
F F
∥Ax∥22 ⊤ ⊤
= x (P DP )x = y Dy = ⊤
λ i x i pi , λi xi pi ,
i=1 i=1

where we used ⟨·, ·⟩ to denote the dot product.


(iii) The bilinearity of the dot product gives us
n
F n
F
∥Ax∥22 = λi ⟨xi pi , xi pi ⟩ = λi x2i
i=1 i=1

where we exploited that the pi are an ONB and p⊤


i pi = 1 .
(iv) With (4.8) we obtain
n
F (4.9)
∥Ax∥22 # ( max λj ) x2i = max λj ∥x∥22
1!j!n 1!j!n
i=1

so that
∥Ax∥22
# max λj ,
∥x∥22 1!j!n

where λj are the eigenvalues of A⊤ A.

c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

516 Matrix Decompositions

(v) Assuming the eigenvalues of A⊤ A are sorted in descending order, we


get
∥Ax∥2 B
# λ1 = σ 1 ,
∥x∥2
where σ1 is the maximum singular value of A.

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

Vector Calculus

Exercises
5.1 Compute the derivative f ′ (x) for
f (x) = log(x4 ) sin(x3 ) .

4
f ′ (x) = sin(x3 ) + 12x2 log(x) cos(x3 )
x

5.2 Compute the derivative f ′ (x) of the logistic sigmoid


1
f (x) = .
1 + exp(−x)

exp(x)
f ′ (x) =
(1 + exp(x))2

5.3 Compute the derivative f ′ (x) of the function


f (x) = exp(− 2σ1 2 (x − µ)2 ) ,

where µ, σ ∈ R are constants.


f ′ (x) = − σ12 f (x)(x − µ)

5.4 Compute the Taylor polynomials Tn , n = 0, . . . , 5 of f (x) = sin(x) + cos(x)


at x0 = 0.
T0 (x) = 1
T1 (x) = T0 (x) + x
x2
T2 (x) = T1 (x) −
2
x3
T3 (x) = T2 (x) −
6
x4
T4 (x) = T3 (x) +
24
x5
T5 (x) = T4 (x) +
120

517
This material will be published by Cambridge University Press as Mathematics for Machine Learn-
ing by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. This pre-publication version is
free to view and download for personal use only. Not for re-distribution, re-sale or use in deriva-
c
tive works. ⃝by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2020. https://mml-book.com.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

518 Vector Calculus

5.5 Consider the following functions:


f1 (x) = sin(x1 ) cos(x2 ) , x ∈ R2
f2 (x, y) = x⊤ y , x, y ∈ Rn
f3 (x) = xx⊤ , x ∈ Rn
∂fi
a. What are the dimensions of ∂x ?
b. Compute the Jacobians.
f1
∂f1
= cos(x1 ) cos(x2 )
∂x1
∂f1
= − sin(x1 ) sin(x2 )
∂x2
L M = >
=⇒ J = ∂f1
∂x1
∂f1
∂x2 = cos(x1 ) cos(x2 ) − sin(x1 ) sin(x2 ) ∈ R1×2

f2
F
x⊤ y = x i yi
i

∂f2
L M = >
= ∂f2
∂x1 ··· ∂f2
∂xn = y1 ··· yn = y ⊤ ∈ Rn
∂x

∂f2
L M = >
= ∂f2
∂y1 ··· ∂f2
∂yn = x1 ··· xn = x⊤ ∈ Rn
∂y
L M = >
∂f2 ∂f2 1×2n
=⇒ J = ∂x ∂y = y⊤ x⊤ ∈ R

f3 : Rn → Rn×n
⎡ ⎤
x 1 x⊤
⎢ x 2 x⊤ ⎥ = >
xx⊤ = ⎢ xxn ∈ Rn×n
⎢ ⎥
.. ⎥ = xx1 xx2 ···
⎣ . ⎦
x n x⊤
⎡ ⎤
x⊤
⎢ 0⊤ ⎥
∂f3 =⎢ n⎥ >
=⇒ = ⎢ . ⎥+ x 0n ··· 0n ∈ Rn×n
∂x1 ⎣ .. ⎦ ! "# $
∈Rn×n
0⊤n
! "# $
∈Rn×n
⎡ ⎤
0(i−1)×n
∂f3 = >
=⇒ =⎣ x⊤ ⎦ + 0n×(i−1) x 0n×(n−i+1) ∈ Rn×n
∂xi ! "# $
0(n−1+1)×n
! "# $ ∈Rn×n
∈Rn×n
∂f3
To get the Jacobian, we need to concatenate all partial derivatives ∂xi
and obtain
L M
J = ∂x∂f3
1
∂f3
· · · ∂x n
∈ R(n×n)×n

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

Exercises 519

5.6 Differentiate f with respect to t and g with respect to X , where


f (t) = sin(log(t⊤ t)) , t ∈ RD
g(X) = tr(AXB) , A ∈ RD×E , X ∈ RE×F , B ∈ RF ×D ,

where tr(·) denotes the trace.

∂f 1
= cos(log(t⊤ t)) · ⊤ · 2t⊤
∂t t t
The trace for T ∈ RD×D is defined as
D
F
tr(T ) = Tii
i=1

A matrix product ST can be written as


F
(ST )pq = Spi Tiq
i

The product AXB contains the elements


E F
F F
(AXB)pq = Api Xij Bjq
i=1 j=1

When we compute the trace, we sum up the diagonal elements of the


matrix. Therefore we obtain,
⎛ ⎞
FD D
F FE FF
tr(AXB) = (AXB)kk = ⎝ Aki Xij Bjk ⎠
k=1 k=1 i=1 j=1

∂ F
tr(AXB) = Aki Bjk = (BA)ji
∂Xij
k

We know that the size of the gradient needs to be of the same size as X
(i.e., E × F ). Therefore, we have to transpose the result above, such that
we finally obtain

A⊤ !"#$
tr(AXB) = !"#$ B⊤
∂X
E×D D×F

5.7 Compute the derivatives df /dx of the following functions by using the chain
rule. Provide the dimensions of every single partial derivative. Describe your
steps in detail.
a.
f (z) = log(1 + z) , z = x⊤ x , x ∈ RD

b.
f (z) = sin(z) , z = Ax + b , A ∈ RE×D , x ∈ RD , b ∈ RE

where sin(·) is applied to every element of z .

c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

520 Vector Calculus

a.
df ∂f ∂z
= ∈ R1×D
dx ∂z ∂x
!"#$ !"#$
∈R ∈R1×D
∂f 1 1
= =
∂z 1+z 1 + x⊤ x
∂z
= 2x⊤
∂x
df 2x⊤
=⇒ =
dx 1 + x⊤ x
b.
df ∂f ∂z
= ∈ RE×D
dx ∂z
!"#$ ∂x
!"#$
∈RE×E ∈RE×D
⎡ ⎤
sin z1
sin(z) = ⎣
⎢ .. ⎥
. ⎦
sin zE
⎡ ⎤
0
⎢ . ⎥
⎢ .. ⎥
⎢ ⎥
⎢ 0 ⎥
⎢ ⎥
∂ sin z
= ⎢cos(zi )⎥ ∈ RE
⎢ ⎥
∂zi ⎢ ⎥
⎢ 0 ⎥
⎢ ⎥
⎢ .. ⎥
⎣ . ⎦
0
∂f
=⇒ = diag(cos(z)) ∈ RE×E
∂z
∂z
= A ∈ RE×D :
∂x
D
F ∂ci
ci = Aij xj =⇒ = Aij , i = 1, . . . , E, j = 1, . . . , D
∂xj
j=1

Here, we defined ci to be the ith component of Ax. The offset b is con-


stant and vanishes when taking the gradient with respect to x. Overall,
we obtain
df
= diag(cos(Ax + b))A
dx
5.8 Compute the derivatives df /dx of the following functions. Describe your
steps in detail.
a. Use the chain rule. Provide the dimensions of every single partial deriva-
tive.
f (z) = exp(− 21 z)
z = g(y) = y ⊤ S −1 y
y = h(x) = x − µ

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

Exercises 521

where x, µ ∈ RD , S ∈ RD×D .
The desired derivative can be computed using the chain rule:
df ∂f ∂g ∂h
= ∈ R1×D
dx ∂z ∂y ∂x
!"#$ !"#$ !"#$
1×1 1×D D×D

Here
∂f
= − 21 exp(− 12 z)
∂z
∂g
= 2y ⊤ S −1
∂y
∂h
= ID
∂x
so that
df @ A
= − exp − 21 (x − µ)⊤ S −1 (x − µ) (x − µ)⊤ S −1
dx

b.

f (x) = tr(xx⊤ + σ 2 I) , x ∈ RD

Here tr(A) is the trace of A, i.e., the sum of the diagonal elements Aii .
Hint: Explicitly write out the outer product.
Let us have a look at the outer product. We define X = xx⊤ with

Xij = xi xj

The trace sums up all the diagonal elements, such that


D
F ∂Xii + σ 2

tr(X + σ 2 I) = = 2xj
∂xj ∂xj
i=1

for j = 1, . . . , D. Overall, we get



tr(xx⊤ + σ 2 I) = 2x⊤ ∈ R1×D
∂x

c. Use the chain rule. Provide the dimensions of every single partial deriva-
tive. You do not need to compute the product of the partial derivatives
explicitly.

f = tanh(z) ∈ RM
z = Ax + b, x ∈ RN , A ∈ RM ×N , b ∈ RM .

Here, tanh is applied to every component of z .

∂f
= diag(1 − tanh2 (z)) ∈ RM ×M
∂z
∂z ∂Ax
= = A ∈ RM ×N
∂x ∂x

c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

522 Vector Calculus

We get the latter result by defining y = Ax, such that


F ∂yi ∂yi
yi = Aij xj =⇒ = Aik =⇒ = [Ai1 , ..., AiN ] ∈ R1×N
∂xk ∂x
j
∂y
=⇒ =A
∂x
The overall derivative is an M × N matrix.
5.9 We define
g(z, ν) := log p(x, z) − log q(z, ν)
z := t(ϵ, ν)

for differentiable functions p, q, t. By using the chain rule, compute the gra-
dient
d
g(z, ν) .

d d d d
g(z, ν) = (log p(x, z) − log q(z, ν)) = log p(x, z) − log q(z, ν)
dν dν dν dν
∂ ∂t(ϵ, ν) ∂ ∂t(ϵ, ν) ∂
= log p(x, z) − log q(z, ν) − log q(z, ν)
∂z
P ∂ν ∂z Q ∂ν ∂ν
∂ ∂ ∂t(ϵ, ν) ∂
= log p(x, z) − log q(z, ν) − log q(z, ν)
∂z ∂z ∂ν ∂ν
P Q
1 ∂ 1 ∂ ∂t(ϵ, ν) ∂
= p(x, z) − q(z, ν) − log q(z, ν)
p(x, z) ∂z q(z, ν) ∂z ∂ν ∂ν

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

Probability and Distributions

Exercises
6.1 Consider the following bivariate distribution p(x, y) of two discrete random
variables X and Y .

y1 0.01 0.02 0.03 0.1 0.1

Y y2 0.05 0.1 0.05 0.07 0.2

y3 0.1 0.05 0.03 0.05 0.04

x1 x2 x3 x4 x5
X

Compute:
a. The marginal distributions p(x) and p(y).
b. The conditional distributions p(x|Y = y1 ) and p(y|X = x3 ).
The marginal and conditional distributions are given by

p(x) = [0.16, 0.17, 0.11, 0.22, 0.34]⊤


p(y) = [0.26, 0.47, 0.27]⊤
p(x|Y = y1 ) = [0.01, 0.02, 0.03, 0.1, 0.1]⊤
p(y|X = x3 ) = [0.03, 0.05, 0.03]⊤ .

6.2 Consider a mixture of two Gaussian distributions (illustrated in Figure 6.4),


P0 1 0 1Q P0 1 0 1Q
10 1 0 0 8.4 2.0
0.4 N , + 0.6 N , .
2 0 1 0 2.0 1.7

a. Compute the marginal distributions for each dimension.


b. Compute the mean, mode and median for each marginal distribution.
c. Compute the mean and mode for the two-dimensional distribution.
Consider the mixture of two Gaussians,
P0 1Q P0 1 0 1Q P0 1 0 1Q
x1 10 1 0 0 8.4 2.0
p = 0.4 N , + 0.6 N , .
x2 2 0 1 0 2.0 1.7

523
This material will be published by Cambridge University Press as Mathematics for Machine Learn-
ing by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. This pre-publication version is
free to view and download for personal use only. Not for re-distribution, re-sale or use in deriva-
c
tive works. ⃝by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2020. https://mml-book.com.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

524 Probability and Distributions

a. Compute the marginal distribution for each dimension.

< P P0 1 0 1Q P0 1 0 1QQ
10 1 0 0 8.4 2.0
p(x1 ) = 0.4 N , + 0.6N , dx2
2 0 1 0 2.0 1.7
(6.1)
< P0 1 0 1Q < P0 1 0 1Q
10 1 0 0 8.4 2.0
= 0.4 N , dx2 + 0.6 N , dx2
2 0 1 0 2.0 1.7
(6.2)
@ A @ A
= 0.4N 10, 1 + 0.6N 0, 8.4 .

From (6.1) to (6.2), we used the result that the integral of a sum is the
sum of the integrals.
Similarly, we can get
@ A @ A
p(x2 ) = 0.4N 2, 1 + 0.6N 0, 1.7 .

b. Compute the mean, mode and median for each marginal distribution.
Mean:
<
E(x1 ) = x1 p(x1 )dx1
<
@ @ A @ AA
= x1 0.4N x1 | 10, 1 + 0.6N x1 | 0, 8.4 dx1 (6.3)
< <
@ A @ A
= 0.4 x1 N x1 | 10, 1 dx1 + 0.6 x1 N x1 | 0, 8.4 dx1 (6.4)

= 0.4 · 10 + 0.6 · 0 = 4 .
@ A
From step (6.3) to step (6.4), we use the fact that for Y ∼ N µ, σ ,
where E[Y ] = µ.
Similarly,
E(x2 ) = 0.4 · 2 + 0.6 · 0 = 0.8 .
Mode: In principle, we would need to solve for
dp(x1 ) dp(x2 )
=0 and = 0.
dx1 dx2
However, we can observe that the modes of each individual distribution
are the peaks of the Gaussians, that is the Gaussian means for each
dimension. Median:
< a
1
p(x1 )dx1 = .
−∞ 2

c. Compute the mean and mode for the 2 dimensional distribution.


Mean:
From (6.30), we know that
0 1 0 1 0 1
x1 E[x1 ] 4
E[ ]= = .
x2 E[x2 ] 0.8

Mode:
The two dimensional distribution has two peaks, and hence there are

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

Exercises 525

two modes. In general, we would need to solve an optimization problem


to find the maxima. However, in this particular case we can observe that
the two modes correspond to the individual Gaussian means, that is
0 1 0 1
10 0
and .
2 0

6.3 You have written a computer program that sometimes compiles and some-
times not (code does not change). You decide to model the apparent stochas-
ticity (success vs. no success) x of the compiler using a Bernoulli distribution
with parameter µ:

p(x | µ) = µx (1 − µ)1−x , x ∈ {0, 1} .

Choose a conjugate prior for the Bernoulli likelihood and compute the pos-
terior distribution p(µ | x1 , . . . , xN ).
The posterior is proportional to the product of the likelihood (Bernoulli)
and the prior (Beta). Given the choice of the prior, we already know that
the posterior will be a Beta distribution. With a conjugate Beta prior p(µ) =
Beta(a, b) obtain
N
p(µ)p(X | µ) R x
p(µ | X) = ∝ µa−1 (1 − µ)b−1 µ i (1 − µ)1−xi
p(X)
i=1
! ! F F
a−1+ i xi b−1+N − xi
∝µ (1 − µ) i ∝ Beta(a + xi , b + N − xi ) .
i i

Alternative approach: We look at the log-posterior. Ignoring all constants


that are independent of µ and xi , we obtain

log p(µ | X) = (a − 1) log µ + (b − 1) log(1 − µ)


F F
+ log µ xi + log(1 − µ) (1 − xi )
i i
F F
= log µ(a − 1 + xi ) + log(1 − µ)(b − 1 + N − xi ) .
i i
5 5
Therefore, the posterior is again Beta(a + i xi , b + N − i xi ).
6.4 There are two bags. The first bag contains four mangos and two apples; the
second bag contains four mangos and four apples.
We also have a biased coin, which shows “heads” with probability 0.6 and
“tails” with probability 0.4. If the coin shows “heads”. we pick a fruit at
random from bag 1; otherwise we pick a fruit at random from bag 2.
Your friend flips the coin (you cannot see the result), picks a fruit at random
from the corresponding bag, and presents you a mango.
What is the probability that the mango was picked from bag 2?
Hint: Use Bayes’ theorem.
We apply Bayes’ theorem and compute the posterior p(b2 | m) of picking a
mango from bag 2.
p(m | b2 )p(b2 )
p(b2 | m) =
p(m)

c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

526 Probability and Distributions

where
32 21 2 1 3
p(m) = p(b1 )p(m | b1 ) + p(b2 )p(m | b2 ) = + = + = Evidence
53 52 5 5 5
2
p(b2 ) = 5 Prior
1
p(m | b2 ) = 2 Likelihood

Therefore,
21
p(m | b2 )p(b2 ) 52 1
p(b2 | m) = = 3
= .
p(m) 5
3

6.5 Consider the time-series model


@ A
xt+1 = Axt + w , w ∼ N 0, Q
@ A
y t = Cxt + v , v ∼ N 0, R ,

where
@ w, vA are i.i.d. Gaussian noise variables. Further, assume that p(x0 ) =
N µ0 , Σ 0 .
a. What is the form of p(x0 , x1 , . . . , xT )? Justify your answer (you do not
have to explicitly compute the joint distribution).
@ A
b. Assume that p(xt | y 1 , . . . , y t ) = N µt , Σt .
1. Compute p(xt+1 | y 1 , . . . , y t ).
2. Compute p(xt+1 , y t+1 | y 1 , . . . , y t ).
3. At time t+1, we observe the value y t+1 = ŷ . Compute the conditional
distribution p(xt+1 | y 1 , . . . , y t+1 ).

1. The joint distribution is Gaussian: p(x0 ) is Gaussian, and xt+1 is a lin-


ear/affine transformations of xt . Since affine transformations leave the
Gaussianity of the random variable invariant, the joint distribution must
be Gaussian.
2. 1. We use the results from linear transformation of Gaussian random
variables (Section 6.5.3). We immediately obtain
@ A
p(xt+1 | y 1:t ) = N xt+1 | µt+1 | t , Σt+1 | t
µt+1 | t := Aµt
Σt+1 | t := AΣt A⊤ + Q .

2. The joint distribution p(xt+1 , y t+1 | y 1:t ) is Gaussian (linear transfor-


mation of random variables). We compute every component of the
Gaussian separately:

E[y t+1 | y 1:t ] = Cµt+1 | t =: µyt+1 | t


V[y t+1 | y 1:t ] = CΣt+1 | t C ⊤ + R =: Σyt+1 | t
Cov[xt+1 , y t+1 | y 1:t ] = Cov[xt+1 , Cxt+1 | y 1:t ] = Σt+1 | t C ⊤ =: Σxy

Therefore,
EJ K J K G
µt+1 | t Σt+1 | t Σxy
p(xt+1 , y t+1 | y 1:t ) = N , .
µyt+1 | t Σyx Σyt+1 | t

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

Exercises 527

3. We obtain the desired distribution (again: Gaussian) by applying the


rules for Gaussian conditioning of the joint distribution in B):
@ A
p(xt+1 | y 1:t+1 ) = N µt+1 | t+1 , Σt+1 | t+1
µt+1 | t+1 = µt+1 | t + Σxy (Σyt+1 | t )−1 (ŷ − µyt+1 | t )

Σt+1 | t+1 = Σt+1 | t − Σxy (Σyt+1 | t )−1 Σyx

6.6 Prove the relationship in (6.44), which relates the standard definition of the
variance to the raw-score expression for the variance.
The formula in (6.43) can be converted to the so called raw-score formula
for variance
N N H I
1 F 1 F 2
(xi − µ)2 = xi − 2xi µ + µ2
N N
i=1 i=1
N N
1 F 2 F
= x2i − µ x i + µ2
N N
i=1 i=1
N
1 F
= x2i − µ2
N
i=1
N
E N
G2
1 F 2 1 F
= xi − xi .
N N
i=1 i=1

6.7 Prove the relationship in (6.45), which relates the pairwise difference be-
tween examples in a dataset with the raw-score expression for the variance.
Consider the pairwise squared deviation of a random variable:
N N
1 F 2 1 F 2
(x i − x j ) = xi + x2j − 2xi xj (6.5)
N2 N2
i,j=1 i,j=1
N N N
1 F 2 1 F 2 1 F
= xi + xj − 2 2 xi xj (6.6)
N N N
i=1 j=1 i,j=1
⎛ ⎞
N N N
2 F 2 1 F ⎝1 F ⎠
= xi − 2 xi xj (6.7)
N N N
i=1 i=1 j=1
⎡ E G2 ⎤
N N
1 F 2 1 F
= 2⎣ xi − xi ⎦. (6.8)
N N
i=1 i=1

Observe that the last equation above is twice of (6.44).


Going from (6.6) to (6.7), we need to make two observations. The first two
terms in (6.6) are sums over all examples, and the index is not important,
therefore they can be combined into twice the sum. The last term is obtained
by “pushing in” the sum over j . This can be seen by a small example by
considering N = 3:
x1 x1 + x1 x2 + x1 x 3 + x2 x1 + x 2 x 2 + x 2 x3 + x3 x 1 + x 3 x2 + x 3 x3
=x1 (x1 + x2 + x3 ) + x2 (x1 + x2 + x3 ) + x3 (x1 + x2 + x3 ) .

c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

528 Probability and Distributions

Going from (6.7) to (6.8), we consider the sum over j as a common factor,
as seen in the following small example:

x1 (x1 + x2 + x3 ) + x2 (x1 + x2 + x3 ) + x3 (x1 + x2 + x3 )


=(x1 + x2 + x3 )(x1 + x2 + x3 ) .

6.8 Express the Bernoulli distribution in the natural parameter form of the ex-
ponential family, see (6.107).

p(x | µ) = µx (1 − µ)1−x
L H IM
= exp log µx (1 − µ)1−x

= exp [x log µ + (1 − x) log(1 − µ)]


0 1
µ
= exp x log + log(1 − µ)
1−µ
= exp [xθ − log(1 + exp(θ))] ,
µ
where the last line is obtained by substituting θ = log 1−µ . Note that this
is in exponential family form, as θ is the natural parameter, the sufficient
statistic φ(x) = x and the log partition function is A(θ) = log(1 + exp(θ)).
6.9 Express the Binomial distribution as an exponential family distribution. Also
express the Beta distribution is an exponential family distribution. Show that
the product of the Beta and the Binomial distribution is also a member of
the exponential family.
Express Binomial distribution as an exponential family distribution:

Recall the Binomial distribution from Example 6.11


E G
N
p(x|N, µ) = µx (1 − µ)(N −x) x = 0, 1, ..., N .
x

This can be written in exponential family form


E G
N
p(x|N, µ) = exp [log(µx (1 − µ)(N −x) )]
x
E G
N
= exp [x log µ + (N − x) log(1 − µ)]
x
E G
N µ
= exp [x log + N log(1 − µ)] .
x 1−µ

The last line can be identified as being in exponential family form by ob-
serving that
S@ A
N
x x = 0, 1, ..., N.
h(x) =
0 otherwise
µ
θ = log
1−µ

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

Exercises 529

φ(x) = x
A(θ) = −N log(1 − µ) = N log(1 + eθ ) .

Express Beta distribution as an exponential family distribution:

Recall the Beta distribution from Example 6.11


Γ(α + β) α−1
p(µ|α, β) = µ (1 − µ)(β−1)
Γ(α)Γ(β)
1
= µα−1 (1 − µ)(β−1) .
B(α, β)
This can be written in exponential family form
p(µ|α, β) = exp [(α − 1) log µ + (β − 1) log(1 − µ) − log B(α, β)] .

This can be identified as being in exponential family form by observing that


h(µ) = 1
θ|(α, β) = [α − 1 β − 1]⊤
φ(µ) = [log µ log(1 − µ)]
A(θ) = log B(α + β) .

Show the product of the Beta and Binomial distribution is also a member of
exponential family:

We treat µ as random variable here, x as an known integer between 0 and


N . For x = 0, . . . , N , we get

E G
N 1
p(x|N, µ)p(µ|α, β) = µx (1 − µ)(N −x) µα−1 (1 − µ)(β−1)
x B(α, β)
E G
N µ
= exp [x log + N log(1 − µ)]
x 1−µ

· exp [(α − 1) log µ + (β − 1) log(1 − µ) − log B(α, β)]


L
= exp (x + α − 1) log µ + (N − x + β − 1) log(1 − µ)
E G
N
M
+ log − log B(α, β) .
x

The last line can be identified as being in exponential family form by ob-
serving that
h(µ) = 1
θ = [x + α − 1 N − x + β − 1]⊤
φ(µ) = [log µ log(1 − µ)]
E G
N
A(θ) = log + log B(α + β) .
x

c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

530 Probability and Distributions

6.10 Derive the relationship in Section 6.5.2 in two ways:


a. By completing the square
b. By expressing the Gaussian in its exponential family form
@ A @ A
The product of two Gaussians
@ N Ax | a, A N x | b, B is an unnormalized
Gaussian distribution c N x | c, C with

C = (A−1 + B −1 )−1
c = C(A−1 a + B −1 b)
D 1 @ A
c = (2π)− 2 | A + B | − 2 exp − 12 (a − b)⊤ (A + B)−1 (a − b) .

Note that the normalizing constant c itself can be considered a (normalized)


Gaussian distribution
@ either in aA or in @b with an “inflated”
A covariance matrix
A + B , i.e., c = N a | b, A + B = N b | a, A + B .
(i) By completing the square, we obtain
@ A D 1 1
N x | a, A ) =(2π)− 2 |A|− 2 exp [− (x − a)⊤ A−1 (x − a))],
2
@ A −D 1 1

N x | b, B ) =(2π) 2 |B| 2 exp [− (x − b)⊤ B −1 (x − b))] .
2
For convenience, we consider the log of the product:
H @ A @ AI
log N x | a, A N x | b, B
@ A @ A
= log N x | a, A + log N x | b, B
1
= − [(x − a)⊤ A−1 (x − a) + (x − b)⊤ B −1 (x − b)] + const
2
1
= − [x⊤ A−1 x − 2x⊤ A−1 a + a⊤ A−1 a + x⊤ B −1 x
2
− 2x⊤ B −1 b + b⊤ B −1 b] + const
1
= − [x⊤ (A−1 + B −1 x − 2x(A−1 a + B −1 b) + a⊤ A−1 a + b⊤ B −1 b]
2
+ const
1
L
=− (x − (A−1 + B −1 )−1 (A−1 a + B −1 b))⊤
2
M
· (A−1 + B −1 )(x − (A−1 + B −1 )−1 (A−1 a + B −1 b) + const .
@ A
Thus, the corresponding product is N x | c, C with

C = (A−1 + B −1 )−1 (6.9)


−1 −1 −1 −1 −1 −1 −1
c = (A +B ) (A a+B b) = C(A a+B b) . (6.10)

(ii) By expressing Gaussian in its exponential family form:

Recall from the Example 6.13 we get the exponential form of univari-
ate Gaussian distribution, similarly we can get the exponential form
of multivariate Gaussian distribution as
H 1 1
p(x|a, A) = exp − x⊤ A−1 x + x⊤ A−1 a − a⊤ A−1 a
2 2

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

Exercises 531
1 1
I
− log(2π) − log |A| .
2 2
This can be identified as being in exponential family form by observ-
ing that
h(µ) = 1
0 @ A1
− 21 vec A−1
θ|(a, A) =
A−1 a
J H IK
vec xx⊤
φ(x) = .
x

Then the product of Gaussian distribution can be expressed as


p(x|a, A)p(x|b, B) = exp(⟨θ|(a, A), φ(x)⟩ + ⟨θ|(b, B), φ(x)⟩) + const
= exp(⟨θ|(a, A, b, B), φ(x)⟩) + const,
= >⊤
where θ | (a, A, b, B) = − 21 (A−1 + B −1 ) A−1 a + B −1 b .
Then we end up with the same answer as shown in (6.9)–(6.10).
6.11 Iterated Expectations.
Consider two random variables x, y with joint distribution p(x, y). Show that
= >
EX [x] = EY EX [x | y] .
Here, EX [x | y] denotes the expected value of x under the conditional distri-
bution p(x | y).

= >
EX [x] = EY EX [x | y]
< <<
⇐⇒ xp(x)dx = xp(x | y)p(y)dxdy
< <<
⇐⇒ xp(x)dx = xp(x, y)dxdy
< < < <
⇐⇒ xp(x)dx = x p(x, y)dydx = xp(x)dx ,

which proves the claim.


6.12 Manipulation of Gaussian Random Variables.
@ A
Consider a Gaussian random variable x ∼ N x | µx , Σx , where x ∈ RD .
Furthermore, we have
y = Ax + b + w ,
@ A
where y ∈ RE , A ∈ RE×D , b ∈ RE , and w ∼ N w | 0, Q is indepen-
dent Gaussian noise. “Independent” implies that x and w are independent
random variables and that Q is diagonal.
a. Write down the likelihood p(y | x).
@ A
p(y | x) = N y | Ax + b, Q
@ A
= Z exp − 12 (y − Ax − b)⊤ Q−1 (y − Ax − b)
where Z = (2π)−E/2 | Q | −1/2

c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

532 Probability and Distributions

T
b. The distribution p(y) = p(y | x)p(x)dx is Gaussian. Compute the mean
µy and the covariance Σy . Derive your result in detail.
p(y) is Gaussian distributed (affine transformation of the Gaussian ran-
dom variable x) with mean µy and covariance matrix Σy . We obtain

µy = EY [y] = EX,W [Ax + b + w] = AEX [x] + b + EW [w]


= Aµx + b, and
Σy = EY [yy ] − EY [y]EY [y]⊤

= >
= EX,W (Ax + b + w)(Ax + b + w)⊤ − µy µ⊤
y
=
= EX,W Axx⊤ A⊤ + Axb⊤ + Axw⊤ + bx⊤ A⊤ + bb⊤ + bw⊤
>
+ w(Ax + b)⊤ + ww⊤ ) − µy µ⊤
y .

We use the linearity of the expected value, move all constants out of the
expected value, and exploit the independence of w and x:

Σy = AEX [xx⊤ ]A⊤ + AEX [x]b⊤ + bEX [x⊤ ]A⊤ + bb⊤ + EW [ww⊤ ]
− (Aµx + b)(Aµx + b)⊤ ,

where we used our previous result for µy . Note that EW [w] = 0. We


continue as follows:

Σy = AEX [xx⊤ ]A⊤ + Aµx b⊤ + bµ⊤ ⊤ ⊤


x A + bb + Q

− Aµx µ⊤ ⊤ ⊤ ⊤ ⊤
x A − Aµx b − bµx A − bb

@ A
= A EX [xx⊤ ] − µx µ⊤ ⊤
x A +Q
! "# $
=Σx

= AΣx A⊤ + Q .

Alternatively, we could have exploited


i.i.d.
Vy [y] = Vx,w [Ax + b + w] = Vx [Ax + b] + Vw [w]
= AVx A⊤ + Q = AΣx A⊤ + Q .

c. The random variable y is being transformed according to the measure-


ment mapping

z = Cy + v ,
@ A
where z ∈ RF , C ∈ RF ×E , and v ∼ N v | 0, R is independent Gaus-
sian (measurement) noise.
Write down p(z | y).
@ A
p(z | y) = N z | Cy, R
@ A
= (2π)−F/2 | R | −1/2 exp − 21 (z − Cy)⊤ R−1 (z − Cy) .

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

Exercises 533

Compute p(z), i.e., the mean µz and the covariance Σz . Derive your
result in detail.
Remark: Since y is Gaussian and z is a linear transformation of y ,
p(z) is Gaussian, too. Let’s compute its moments (similar to the “time
update”):

µz = EZ [z] = EY,V [Cy + v] = C EY [y] + EV [v]


! "# $
=0
= Cµy .

For the covariance matrix, we compute

Σz = EZ [zz ⊤ ] − EZ [z]EZ [z]⊤


= >
= EY,V (Cy + v)(Cy + v)⊤ − µz µ⊤
z
= >
= EY,V Cyy ⊤ C ⊤ + Cyv ⊤ + v(Cy)⊤ + vv ⊤ ) − µz µ⊤
z .

We use the linearity of the expected value, move all constants out of
the expected value, and exploit the independence of v and y :

Σz = C EY [yy ⊤ ]C ⊤ + EV [vv ⊤ ] − Cµy µ⊤ ⊤


yC ,

where we used our previous result for µz . Note that EV [v] = 0. We


continue as follows:
@ A ⊤
Σy = C EY [yy ⊤ ] − µy µ⊤
y C +R
! "# $
=Σx

= CΣy C + R.

d. Now, a value ŷ is measured. Compute the posterior distribution p(x | ŷ).


Hint for solution: This posterior is also Gaussian, i.e., we need to de-
termine only its mean and covariance matrix. Start by explicitly com-
puting the joint Gaussian p(x, y). This also requires us to compute the
cross-covariances Covx,y [x, y] and Covy,x [y, x]. Then apply the rules
for Gaussian conditioning.
We derive the posterior distribution following the second hint since we
do not have to worry about normalization constants:
Assume, we know the joint Gaussian distribution
P0 1 , 0 1 0 1Q
x ,, µx Σx Σxy
p(x, y) = N , , (6.11)
, y µy Σyx Σy

where we defined Σxy := Covx,y [x, y].


Now, we apply the rules for Gaussian conditioning to obtain
@ A
p(x | ŷ) = N x | µx | y , Σx | y
µx | y = µx + Σxy Σ−1
y (ŷ − µy )

Σx | y = Σx − Σxy Σ−1
y Σyx .

c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

534 Probability and Distributions

Looking at (6.11), it remains to compute the cross-covariance term Σxy


(the marginal distributions p(x) and p(y) are known and Σyx = Σ⊤ xy ):

Σxy = Cov[x, y] = EX,Y [xy ⊤ ] − EX [x]EY [y]⊤ = EX,Y [xy ⊤ ] − µx µ⊤


y ,

where µx and µy are known. Hence, it remains to compute

EX,Y [xy ⊤ ] = EX [x(Ax + b + w)⊤ ] = EX [xx⊤ ]A⊤ + EX [x]b⊤


= EX [xx⊤ ]A⊤ + µx b⊤ .

With
µx µ⊤ ⊤ ⊤
y = µx µx A + µx b

we obtain the desired cross-covariance


Σxy = EX [xx⊤ ]A⊤ + µx b⊤ − µx µ⊤ ⊤
x A − µx b

@ A
= EX [xx⊤ ] − µx µ⊤ ⊤ ⊤
x A = Σx A .

And finally
µx | y = µx + Σx A⊤ (AΣx A⊤ + Q)−1 (ŷ − Aµx − b) ,
Σx | y = Σx − Σx A⊤ (AΣx A⊤ + Q)−1 AΣx .

6.13 Probability Integral Transformation


Given a continuous random variable x, with cdf Fx (x), show that the ran-
dom variable y = Fx (x) is uniformly distributed.
Proof We need to show that the cumulative distribution function of Y de-
fines a distribution of a uniform random variable. Recall that by the axioms
of probability (Section 6.1) probabilities must be non-negative and sum/
integrate to one. Therefore, the range of possible values of Y = FX (x) is
−1
the interval [0, 1]. For any FX (x), the inverse FX (x) exists because we as-
sumed that FX (x) is strictly monotonically increasing, which we will use in
the following.
Given any continuous random variable X , the definition of a cdf gives
FY (y) = P (Y # y)
= P (FX (x) # y) transformation of interest
−1
= P (X # FX (y)) inverse exists
−1
= FX (FX (y)) definition of cdf
= y,

where the last line is due to the fact that FX (x) composed with its inverse
results in an identity transformation. The statement FY (y) = y along with
the fact that y lies in the interval [0, 1] means that FY (x) is the cdf of the
uniform random variable on the unit interval.

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

Continuous Optimization

Exercises
7.1 Consider the univariate function
f (x) = x3 + 6x2 − 3x − 5.

Find its stationary points and indicate whether they are maximum, mini-
mum, or saddle points.
Given the function f (x), we obtain the following gradient and Hessian,
df
= 3x2 + 12x − 3
dx
d2 f
= 6x + 12 .
dx2
We find stationary points by setting the gradient to zero, and solving for x.
One option is to use the formula for quadratic functions, but below we show
how to solve it using completing the square. Observe that
(x + 2)2 = x2 + 4x + 4

and therefore (after dividing all terms by 3),


(x2 + 4x) − 1 = ((x + 2)2 − 4) − 1 .

By setting this to zero, we obtain that


(x + 2)2 = 5,

and hence

x = −2 ± 5.
df
Substituting the solutions of dx = 0 into the Hessian gives
2
d f √ d2 f √
(−2 − 5) ≈ −13.4 and (−2 + 5) ≈ 13.4 .
dx2 dx2
This means that the left stationary point is a maximum and the right one is
a minimum. See Figure 7.1 for an illustration.
7.2 Consider the update equation for stochastic gradient descent (Equation (7.15)).
Write down the update when we use a mini-batch size of one.
θ i+1 = θ i − γi (∇L(θ i ))⊤ = θ i − γi ∇Lk (θ i )⊤

where k is the index of the example that is randomly chosen.


7.3 Consider whether the following statements are true or false:

535
This material will be published by Cambridge University Press as Mathematics for Machine Learn-
ing by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. This pre-publication version is
free to view and download for personal use only. Not for re-distribution, re-sale or use in deriva-
c
tive works. ⃝by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2020. https://mml-book.com.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

536 Continuous Optimization

Figure 7.1 A plot of


100 f (x)
the function f (x) df
dx
80 d2 f
along with its dx2

gradient and 60

Hessian. 40

20

−20

−40
−6 −4 −2 0 2

a. The intersection of any two convex sets is convex.


b. The union of any two convex sets is convex.
c. The difference of a convex set A from another convex set B is convex.
a. true
b. false
c. false
7.4 Consider whether the following statements are true or false:
a. The sum of any two convex functions is convex.
b. The difference of any two convex functions is convex.
c. The product of any two convex functions is convex.
d. The maximum of any two convex functions is convex.
a. true
b. false
c. false
d. true
7.5 Express the following optimization problem as a standard linear program in
matrix notation
max p⊤ x + ξ
x∈R2 , ξ∈R

subject to the constraints that ξ " 0, x0 # 0 and x1 # 3.


The optimization target and constraints must all be linear. To make the op-
timization target linear, we combine x and ξ into one matrix equation:
⎡ ⎤⊤ ⎡ ⎤
p0 x0
max ⎣p 1 ⎦ ⎣x1 ⎦
x∈R2 ,ξ∈R
1 ξ

We can then combine all of the inequalities into one matrix inequality:
⎡ ⎤⎡ ⎤ ⎡ ⎤
1 0 0 x0 0
subject to ⎣0 1 0 ⎦ ⎣ x1 ⎦ − ⎣3 ⎦ # 0
0 0 −1 ξ 0

7.6 Consider the linear program illustrated in Figure 7.9,


0 1⊤ 0 1
5 x1
min −
x∈R2 3 x2

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

Exercises 537
⎡ ⎤ ⎡ ⎤
2 2 33
⎢2
⎢ −4⎥ 0 1 ⎢8⎥
⎥ x1 ⎢ ⎥
⎢−2
subject to ⎢ 1 ⎥
⎥ x2 # ⎢ 5 ⎥
⎢ ⎥
⎣0 −1⎦ ⎣−1⎦
0 1 8
Derive the dual linear program using Lagrange duality.
Write down the Lagrangian:
⎛⎡ ⎤ ⎡ ⎤⎞
2 2 33
0 1⊤ ⎜⎢ 2 −4⎥ ⎢ 8 ⎥⎟
5 ⎜⎢ ⎥ ⎢ ⎥⎟
L(x, λ) = − x + λ⊤ ⎜
⎜⎢−2

⎥ x − ⎢ 5 ⎥⎟
1 ⎥ ⎢ ⎥⎟
3 ⎝⎣ 0 −1 ⎦ ⎣−1⎦⎠
0 1 8
Rearrange and factorize x:
⎛ ⎡ ⎤⎞ ⎡ ⎤
2 2 33
⎜ 0 1⊤
⎜ 5
⎢2
⎢ −4⎥⎥⎟
⎟ ⎢8⎥
⎢ ⎥
⊤⎢ ⊤⎢
⎜− 3 + λ ⎢−2
L(x, λ) = ⎜ ⎥⎟ x − λ ⎢ 5 ⎥
1 ⎥ ⎟ ⎥
⎝ ⎣0 −1 ⎦ ⎠ ⎣ −1⎦
0 1 8
Differentiate with respect to x and set to zero:
⎡ ⎤
2 2
0 1⊤ ⎢2 −4⎥
5 ⎢ ⎥
∇x L = − + λ⊤ ⎢
⎢−2 ⎥=0
1 ⎥
3 ⎣0 −1⎦
0 1
Then substitute back into the Lagrangian to obtain the dual Lagrangian:
⎡ ⎤
33
⎢8⎥
⎢ ⎥
D(λ) = −λ⊤ ⎢
⎢5⎥

⎣−1⎦
8
The dual optimization problem is therefore
⎡ ⎤
33
⎢8⎥
⎢ ⎥
⊤⎢
max − λ ⎢ 5 ⎥

λ∈Rm ⎣−1⎦
8
⎡ ⎤
2 2
0 1⊤ ⎢2 −4⎥
5 ⎢ ⎥
subject to − + λ⊤ ⎢
⎢−2 ⎥=0
1 ⎥
3 ⎣0 −1⎦
0 1
λ " 0.

c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

538 Continuous Optimization

7.7 Consider the quadratic program illustrated in Figure 7.4,


0 1⊤ 0 1 0 1 0 1⊤ 0 1
1 x1 2 1 x1 5 x1
min +
x∈R2 2 x2 1 4 x2 3 x2
⎡ ⎤ ⎡ ⎤
1 0 1
0 1
⎢−1 0 ⎥ x 1
⎢1 ⎥
subject to ⎢
⎣0
⎥ #⎢ ⎥
1 ⎦ x2 ⎣1 ⎦
0 −1 1

Derive the dual quadratic program


⎡ using Lagrange
⎤ duality.
⎡ ⎤
1 0 1
0 1 0 1
2 1 5 ⎢−1 0 ⎥ ⎢1 ⎥
Let Q = ,c = ,A = ⎢ ⎥ , and b = ⎢ ⎥. Then by (7.45)
1 4 3 ⎣0 1 ⎦ ⎣1 ⎦
0 −1 1
and (7.52) the dual optimization problem is
⎛ ⎡ ⎤ ⊤ ⎞⊤ ⎛ ⎡ ⎤⊤ ⎞ ⎡ ⎤
1 0 1 0 1
⎜0 1 ⎢ ⎟ 0 1−1 ⎜0 1 ⎟
1⎜ 5 −1 0 ⎥
⎥ λ⎟ 2 1 ⎜ 5 ⎢−1 0 ⎥ ⎟ ⊤
⎢1 ⎥
max − ⎜ +⎢ ⎟ ⎜ +⎢ ⎥ λ⎟ − λ ⎢ ⎥
λ∈R4 2⎝ 3 ⎣0 1 ⎦ ⎠ 1 4 ⎝ 3 ⎣0 1 ⎦ ⎠ ⎣1 ⎦
0 −1 0 −1 1
subject to λ " 0.
which expands to
⎛ ⎡ ⎤⊤ ⎡ ⎤ ⎞
33 4 −4 −1 1
⎜ ⎢−35⎥ ⎢−4 ⎟
1 ⎜ ⊤ 4 1 −1⎥⎥ λ⎟
max − ⎜88 + ⎢
⎣ 1 ⎦ λ + λ ⎣−1
⎥ ⎢ ⎟
λ∈R4 14 ⎝ 1 2 −2⎦ ⎠
−3 1 −1 −2 2
subject to λ " 0.

Alternative derivation
Write down the Lagrangian:
⎛⎡ ⎤ ⎡ ⎤⎞
1 0 1
0 1 0 1⊤
1 ⊤ 2 1 5 ⎜⎢ −1 0 ⎥ ⎢1⎥⎟
L(x, λ) = x x+ x + λ⊤ ⎜⎢ ⎥ x − ⎢ ⎥⎟
2 1 4 3 ⎝⎣ 0 1 ⎦ ⎣1⎦⎠
0 −1 1

Differentiate with respect to x and set to zero:


⎡ ⎤
1 0
0 1 0 1⊤
⊤ 2 1 5 ⊤ ⎢−1 0 ⎥

∇x L = x + +λ ⎣ ⎥=0
1 4 3 0 1 ⎦
0 −1

Solve for x:
⎛ ⎡ ⎤⎞
1 0
0 1⊤ 0 1−1
⊤ ⎢−1
⎜ 5 0 ⎥
⎥⎟ 2 1
⎢ ⎟
x⊤ = − ⎜ +λ ⎣
⎝ 3 0 1 ⎦⎠ 1 4
0 −1

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

Exercises 539

Then substitute back to get the dual Lagrangian:


⎛⎡ ⎤ ⎡ ⎤⎞
1 0 1
0 1 0 1⊤
1 ⊤ 2 1 5 ⎜ ⎢ −1 0 ⎥ ⎢1 ⎥⎟
D(λ) = x x+ x + λ⊤ ⎜ ⎢ ⎥ x − ⎢ ⎥⎟
2 1 4 3 ⎝⎣ 0 1 ⎦ ⎣1 ⎦⎠
0 −1 1

7.8 Consider the following convex optimization problem


1 ⊤
min w w
w∈RD 2
subject to w⊤ x " 1 .

Derive the Lagrangian dual by introducing the Lagrange multiplier λ.


First we express the convex optimization problem in standard form,
1 ⊤
min w w
w∈RD 2
subject to 1 − w⊤ x # 0 .

By introducing a Lagrange multiplier λ " 0, we obtain the following La-


grangian
1 ⊤
L(w) = w w + λ(1 − w⊤ x)
2
Taking the gradient of the Lagrangian with respect to w gives
dL(w)
= w⊤ − λx⊤ .
dw
Setting the gradient to zero and solving for w gives
w = λx .

Substituting back into L(w) gives the dual Lagrangian


λ2 ⊤
D(λ) = x x + λ − λ 2 x⊤ x
2
λ2
= − x⊤ x + λ .
2
Therefore the dual optimization problem is given by
λ2 ⊤
max − x x+λ
λ∈R 2
subject to λ " 0 .

7.9 Consider the negative entropy of x ∈ RD ,


D
F
f (x) = xd log xd .
d=1

Derive the convex conjugate function f ∗ (s), by assuming the standard dot

c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

540 Continuous Optimization

product.
Hint: Take the gradient of an appropriate function and set the gradient to zero.
From the definition of the Legendre Fenchel conjugate
D
F
f ∗ (s) = sup sd xd − xd log xd .
x∈RD d=1

Define a function (for notational convenience)


D
F
g(x) = sd xd − xd log xd .
d=1

The gradient of g(x) with respect to xd is


dg(x) x
= sd − d − log xd
dxd xd
= sd − 1 − log xd .

Setting the gradient to zero gives


xd = exp(sd − 1) .

Substituting the optimum value of xd back into f ∗ (s) gives


D
F
f ∗ (s) = = sd exp(sd − 1) − (sd − 1) exp(sd − 1)
d=1
= exp(sd − 1) .

7.10 Consider the function


1 ⊤
f (x) = x Ax + b⊤ x + c ,
2
where A is strictly positive definite, which means that it is invertible. Derive
the convex conjugate of f (x).
Hint: Take the gradient of an appropriate function and set the gradient to zero.
From the definition of the Legendre Fenchel transform,
1 ⊤
f ∗ (s) = sup s⊤ x − x Ax − b⊤ x − c .
x∈RD 2

Define a function (for notational convenience)


1 ⊤
g(x) = s⊤ x − x Ax − b⊤ x − c .
2
The gradient of g(x) with respect to x is
dg(x)
= s ⊤ − x⊤ A − b ⊤ .
dx
Setting the gradient to zero gives (note that A⊤ = A)
sA⊤ x − b = 0
A⊤ x = s − b
x = A−1 (s − b) .

Draft (2020-02-23) of “Mathematics for Machine Learning”. Feedback: https://mml-book.com.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)


lOMoARcPSD|31719747

Exercises 541

Substituting the optimum value of x back into f ∗ (s) gives


1
f ∗ (s) = s⊤ A−1 (s − b) − (s − b)A−1 (s − b) − b⊤ A−1 (s − b) − c
2
1
= (s − b)⊤ A−1 (s − b) − c .
2
7.11 The hinge loss (which is the loss used by the support vector machine) is
given by
L(α) = max{0, 1 − α} ,

If we are interested in applying gradient methods such as L-BFGS, and do


not want to resort to subgradient methods, we need to smooth the kink in
the hinge loss. Compute the convex conjugate of the hinge loss L∗ (β) where
β is the dual variable. Add a ℓ2 proximal term, and compute the conjugate
of the resulting function
γ 2
L∗ (β) + β ,
2
where γ is a given hyperparameter.
Recall that the hinge loss is given by
L(α) = max{0, 1 − α}

The convex conjugate of L(α) is


L∗ (β) = sup {αβ − max{0, 1 − α}}
α∈R
S
β if − 1 # β # 0,
=
∞ otherwise

The smoothed conjugate is


γ 2
L∗γ (β) = L∗ (β) + β .
2
The corresponding primal smooth hinge loss is given by
U γ
V
Lγ (α) = sup αβ − β − β2
−1!β!0 2
⎧ γ
⎨1 − α 2−
⎪ 2 if α < 1 − γ,
(α−1)
= if 1 − γ # α # 1,
⎪ 2γ

0 if α > 1.

Lγ (α) is convex and differentiable with the derivative



⎨−1
⎪ if α < 1 − γ,
L′γ (α) = α−1
if 1 − γ # α # 1,
⎪ γ

0 if α > 1.

c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.

Downloaded by ???[ ???? / ?????·?? ? (kyungie0311@korea.ac.kr)

You might also like