Professional Documents
Culture Documents
Linear Algebra
Exercises
2.1 We consider (R\{−1}, ⋆), where
a ⋆ b := ab + a + b, a, b ∈ R\{−1} (2.1)
a. Show that (R\{−1}, ⋆) is an Abelian group.
b. Solve
3 ⋆ x ⋆ x = 15
⇒ a ⋆ b ∈ R\{−1}
Commutativity:
∀a, b ∈ R\{−1} : a ⋆ b = ab + a + b = ba + b + a = b ⋆ a
468
This material will be published by Cambridge University Press as Mathematics for Machine Learn-
ing by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. This pre-publication version is
free to view and download for personal use only. Not for re-distribution, re-sale or use in deriva-
c
tive works. ⃝by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2020. https://mml-book.com.
Exercises 469
b.
3 ⋆ x ⋆ x = 15 ⇐⇒ 3 ⋆ (x2 + x + x) = 15
⇐⇒ 3x2 + 6x + 3 + x2 + 2x = 15
⇐⇒ 4x2 + 8x − 12 = 0
⇐⇒ (x − 1)(x + 3) = 0
⇐⇒ x ∈ {−3, 1}
k = {x ∈ Z | x − k = 0 (modn)}
= {x ∈ Z | ∃a ∈ Z : (x − k = n · a)} .
Zn = {0, 1, . . . , n − 1}
a ⊕ b := a + b
a ⊗ b = a × b, (2.2)
a⊕b=a+b
= (a + b) mod n
c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
so that ⊕ is associative.
Neutral element: We have
a+0=a+0=a=0+a
⊗ 1 2 3 4
1 1 2 3 4
2 2 4 1 3
3 3 1 4 2
4 4 3 2 1
We can notice that all the products are in Z5 \{0}, and that in particular,
none of them is equal to 0. Thus, Z5 \{0} is closed under ⊗. The neutral
element is 1 and we have (1)−1 = 1, (2)−1 = 3, (3)−1 = 2, and (4)−1 = 4.
Associativity and commutativity are straightforward and (Z5 \{0}, ⊗) is
an Abelian group.
c. The elements 2 and 4 belong to Z8 \{0}, but their product 2 ⊗ 4 = 8 = 0
does not. Thus, this set is not closed under ⊗ and is not a group.
d. Let us assume that n is not prime and can thus be written as a product
n = a × b of two integers a and b in {2, . . . , n − 1}. Both elements a
and b belong to Zn \{0} but their product a ⊗ b = n = 0 does not.
Thus, this set is not closed under ⊗ and (Zn \{0}, ⊗) is not a group.
Let n be a prime number. Let a and b be in Zn \{0} with a and b in
{1, . . . , n − 1}. As n is prime, we know that a is relatively prime to n,
and so is b. Let us then take four integers u, v, u′ and v ′ , such that
au + nv = 1
bu′ + nv ′ = 1 .
Exercises 471
By virtue of the Bézout theorem, this implies that ab and n are rela-
tively prime, which ensures that the product a ⊗ b is not equal to 0
and belongs to Zn \{0}, which is thus closed under ⊗.
The associativity and commutativity of ⊗ are straightforward, but we
need to show that every element has an inverse. First, the neutral
element is 1. Let us again consider an element a in Zn \{0} with a in
{1, . . . , n − 1}. As a and n are coprime, the Bézout theorem enables
us to define two integers u and v such that
au + nv = 1 , (2.3)
which implies that au = 1 − nv and thus
au = 1 mod n , (2.4)
which means that a ⊗ u = au = 1, or that u is the inverse of a. Over-
all, (Zn \{0}, ⊗) is an Abelian group. Note that the Bézout theorem
ensures the existence of an inverse without yielding its explicit value,
which is the purpose of the extended Euclidean algorithm.
2.3 Consider the set G of 3 × 3 matrices defined as follows:
⎧⎡ ⎤ , ⎫
⎨ 1 x z ,
, ⎬
G = ⎣0 1 y ⎦ ∈ R3×3 ,, x, y, z ∈ R
⎩ ⎭
0 0 1 ,
c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
Similarly,
⎡ ⎤ ⎡ ⎤
1 x z 1 α+a γ + αβ + c
A · (B · C) = ⎣0 1 y ⎦ · ⎣0 1 β+b ⎦
0 0 1 0 0 1
⎡ ⎤
1 α+a+x γ + αβ + xβ + c + xb + z
= ⎣0 1 β+b+y ⎦ = (A · B) · C .
0 0 1
Therefore, · is associative.
Neutral element: For all A in G , we have: I 3 · A = A = A · I 3 and thus
I 3 is the neutral element.
Non-commutativity: We show that · is not commutative. Consider the
matrices X, Y ∈ G , where
⎡ ⎤ ⎡ ⎤
1 1 0 1 0 0
X = ⎣0 1 0⎦ , Y = ⎣0 1 1⎦ . (2.5)
0 0 1 0 0 1
Then,
⎡ ⎤
1 1 1
X · Y = ⎣0 1 1⎦ ,
0 0 1
⎡ ⎤
1 1 0
Y · X = ⎣0 1 1⎦ ̸= X · Y .
0 0 1
Exercises 473
c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
The last row of the final linear system shows that the equation system
has no solution and thus S = ∅.
b.
⎡ ⎤ ⎡ ⎤
1 −1 0 0 1 3
⎢1 1 0 −3 0 ⎥ ⎢6⎥
A=⎢ ⎥, b=⎢ ⎥
⎣2 −1 0 1 −1⎦ ⎣5⎦
−1 2 0 −2 −1 −1
We start again by writing down the augmented matrix and apply Gaus-
sian elimination to obtain the reduced row echelon form:
⎡ ⎤
1 −1 0 0 1 3
⎢ 1
⎢ 1 0 −3 0 ⎥ −R1
6 ⎥
⎣ 2 −1 0 1 −1 5 ⎦ −2R1
−1 2 0 −2 −1 −1 +R1
⎡ ⎤
1 −1 0 0 1 3 +R3
⎢ 0 2 0 −3 −1 3 ⎥ −2R3
! ⎢ ⎥
⎣ 0 1 0 1 −3 −1 ⎦ swap with R2
0 1 0 −2 0 2 −R3
⎡ ⎤
1 0 0 1 −2 2
⎢ 0 1 0 1 −3 −1 ⎥
! ⎢ ⎥
⎣ 0 0 0 −5 5 5 ⎦ ·(− 15 )
0 0 0 −3 3 3 + 35 R3
⎡ ⎤
1 0 0 1 −2 2 −R3
⎢ 0 1 0 1 −3 −1 ⎥
⎥ −R3
! ⎢
⎣ 0 0 0 1 −1 −1 ⎦
0 0 0 0 0 0
⎡ ⎤
1 0 0 0 −1 3
! ⎣ 0 1 0 0 −2 0 ⎦
0 0 0 1 −1 −1
From the reduced row echelon form we apply the “Minus-1 Trick” in
order to get the following system:
Exercises 475
⎡ ⎤
1 0 0 0 −1 3
⎢
⎢ 0 1 0 0 −2 0 ⎥⎥
⎢
⎢ 0 0 −1 0 0 0 ⎥
⎥
⎣ 0 0 0 1 −1 −1 ⎦
0 0 0 0 −1 0
2.6 Using Gaussian elimination, find all solutions of the inhomogeneous equa-
tion system Ax = b with
⎡ ⎤ ⎡ ⎤
0 1 0 0 1 0 2
A = ⎣0 0 0 1 1 0⎦ , b = ⎣−1⎦ .
0 1 0 0 0 1 1
This augmented matrix is now in reduced row echelon form. Applying the
“Minus-1 trick” gives us the following augmented matrix:
⎡ ⎤
−1 0 0 0 0 0 0
⎢ 0 1 0 0 0 1 1 ⎥
⎢ ⎥
⎢ 0
⎢ 0 −1 0 0 0 0 ⎥
⎥
⎢ 0
⎢ 0 0 1 0 1 −2 ⎥
⎥
⎣ 0 0 0 0 1 −1 1 ⎦
0 0 0 0 0 −1 0
c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
The general solution adds solutions from the homogeneous equation system
Ax = 0. We can use the RREF of the augmented system to read out these so-
lutions by using the Minus-1 Trick. Padding the RREF with rows containing
-1 on the diagonal gives us
⎡ ⎤
−1 0 0 0 0 0
⎢0 1 0 0 0 1 ⎥
⎢ ⎥
⎢0
⎢ 0 −1 0 0 0 ⎥
⎥.
⎢0 0 0 1 0 1 ⎥
⎢ ⎥
⎣0 0 0 0 1 −1⎦
0 0 0 0 0 −1
⎡ ⎤
x1
2.7 Find all solutions in x = ⎣x2 ⎦ ∈ R3 of the equation system Ax = 12x,
x3
Exercises 477
where
⎡ ⎤
6 4 3
A = ⎣6 0 9⎦
0 8 0
5
and 3i=1 xi = 1.
We start by rephrasing the problem into solving a homogeneous system of
linear equations. Let x be in R3 . We notice that Ax = 12x is equivalent
to (A − 12I)x = 0, which can be rewritten as the homogeneous system
Ãx = 0, where we define
⎡ ⎤
−6 4 3
à = ⎣ 6 −12 9 ⎦.
0 8 −12
5
The constraint 3i=1 xi = 1 can be transcribed as a fourth equation, which
leads us to consider the following linear system, which we bring to reduced
row echelon form:
⎡ ⎤
−6 4 3 0 +R2
⎢ ⎥
1
⎢
⎢ 6 −12 9 ⎥ ·3
0 ⎥
⎢ ⎥
1
⎢
⎣ 0 8 −12 ⎦ ·4
0 ⎥
1 1 1 1
⎡ ⎤
0 −8 12 0 +4R3
⎢ ⎥
⎢
⎢ 2 −4 3 ⎥ +2R3
0 ⎥
! ⎢ ⎥
⎢
⎣ 0 2 −3 0 ⎥
⎦
1 1 1 1
⎡ ⎤
0 0 0 0
⎢ ⎥
1
⎢
⎢ 2 0 −3 ⎥ ·2
0 ⎥
! ⎢ ⎥
1
⎢
⎣ 0 2 −3 ⎦ ·2
0 ⎥
1 1 1 1
⎡ ⎤
1 0 − 32 0
⎢ ⎥
! ⎢
⎣ 0 1 − 23 0 ⎥
⎦
1 1 1 1 −R1 − R2
⎡ ⎤ ⎡ ⎤
1 0 − 23 0 +( 83 )R3 1 0 0 3
8
⎢ ⎥ ⎢ ⎥
! ⎢
⎣ 0 1 − 23 3
⎦ +( 8 )R3 !
0 ⎥ ⎢
⎣ 0 1 0 3
8
⎥
⎦
0 0 4 1 · 14 0 0 1 1
4
c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
a.
⎡ ⎤
2 3 4
A = ⎣3 4 5⎦
4 5 6
⎡ ⎤
2 3 4 1 0 0
⎢ ⎥
3
3 4 5 0 1 ⎦ − 2 R1
0 ⎥
⎢
⎣
4 5 6 0 0 1 −2R1
⎡ ⎤
2 3 4 1 0 0
⎢ ⎥
! ⎢
⎣ 0 − 12 −1 − 32 1 1
⎦ − 2 R3
0 ⎥
0 −1 −2 −2 0 1 ·(−1)
⎡ ⎤
2 3 4 1 0 0
⎢ ⎥
! ⎢
⎣ 0 0 0 − 12 1 − 21 ⎥.
⎦
0 1 2 2 0 −1
Here, we see that this system of linear equations is not solvable. There-
fore, the inverse does not exist.
b.
⎡ ⎤
1 0 1 0
⎢0 1 1 0⎥
A=⎢
⎣1
⎥
1 0 1⎦
1 1 1 0
⎡ ⎤
1 0 1 0 1 0 0 0
⎢ ⎥
⎢
⎢ 0 1 1 0 0 1 0 0 ⎥
⎥
⎢ ⎥
1 1 0 1 0 0 1 ⎦ −R1
0 ⎥
⎢
⎣
1 1 1 0 0 0 0 1 −R1
⎡ ⎤
1 0 1 0 1 0 0 0
⎢ ⎥
0 1 1 0 0 1 0 ⎥ −R4
0 ⎥
⎢
⎢
! ⎢ ⎥
⎢
⎣ 0 1 −1 1 −1 0 1 ⎦ −R4
0 ⎥
0 1 0 0 −1 0 0 1 swap with R2
Exercises 479
⎡ ⎤
1 0 1 0 1 0 0 0 −R4
⎢ ⎥
⎢
⎢ 0 1 0 0 −1 0 0 1 ⎥
⎥
! ⎢ ⎥
⎢
⎣ 0 0 −1 1 0 0 1 −1 ⎥
⎦ +R4
0 0 1 0 1 1 0 −1 swap with R3
⎡ ⎤
1 0 0 0 0 −1 0 1
⎢ ⎥
⎢
⎢ 0 1 0 0 −1 0 0 1 ⎥
⎥
! ⎢ ⎥
⎢
⎣ 0 0 1 0 1 1 0 −1 ⎥
⎦
0 0 0 1 1 1 1 −2
Therefore,
⎡ ⎤
0 −1 0 1
⎢−1 0 0 1 ⎥
A−1 =⎢ ⎥
⎣1 1 0 −1⎦
1 1 1 −2
which belongs to A.
3. Let α be in R. Then,
Therefore, A is a subspace of R3 .
c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
b. The vector (1, −1, 0) belongs to B , but (−1) · (1, −1, 0) = (−1, 1, 0) does
not. Thus, B is not closed under scalar multiplication and is not a sub-
space of R3 .
c. Let A ∈ R1×3 be defined as A = [1, −2, 3]. The set C can be written as:
C = {x ∈ R3 | Ax = γ} .
A(x + y) = Ax + Ay = 0 + 0 = 0 .
A(λx) = λ(Ax) = λ0 = 0
Exercises 481
as linear combination of
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
1 1 2
x1 = ⎣ 1 ⎦ , x2 = ⎣ 2 ⎦ , x3 = ⎣−1⎦
1 3 1
5
We are looking for λ1 , . . . , λ3 ∈ R, such that 3i=1 λi xi = y . Therefore, we
need to solve the inhomogeneous linear equation system
⎡ ⎤
1 1 2 1
⎣ 1 2 −1 −2 ⎦
1 3 1 5
1 −1 1 1 0 −1
Determine a basis of U1 ∩ U2 .
We start by checking whether there the vectors in the generating sets of U1
(and U2 ) are linearly dependent. Thereby, we can determine bases of U1 and
U2 , which will make the following computations simpler.
We start with U1 . To see whether the three vectors are linearly dependent,
we need to find a linear combination of these vectors that allows a non-
trivial representation of 0, i.e., λ1 , λ2 , λ3 ∈ R, such that
⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤
1 2 −1 0
⎢1⎥ ⎢−1⎥ ⎢ 1 ⎥ ⎢0 ⎥
⎣−3⎦ + λ2 ⎣ 0 ⎦ + λ3 ⎣−1⎦ = ⎣0⎦ .
λ1 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
1 −1 1 0
1−3 −1 0
c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
−2 −1 0
and, therefore, λ2 = −2λ1 . This means that there exists a non-trivial linear
combination of 0 using spanning vectors of U1 , for example: λ1 = 1, λ2 = −2
and λ3 = −3. Therefore, not all vectors in the generating set of U1 are
necessary, such that U1 can be more compactly represented as
⎡ ⎤ ⎡ ⎤
1 2
⎢ 1 ⎥ ⎢−1⎥
U1 = span[⎢
⎣−3⎦ , ⎣ 0 ⎦] .
⎥ ⎢ ⎥
1 −1
Now, we see whether the generating set of U2 is also a basis. We try again
whether we can find a non-trivial linear combination of 0 using the spanning
vectors of U2 , i.e., a triple (α1 , α2 , α3 ) ∈ R3 such that
⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤
−1 2 −3 0
⎢−2⎥ ⎢−2⎥ ⎢ 6 ⎥ ⎢0 ⎥
α1 ⎢
⎣ 2 ⎦ + α2 ⎣ 0 ⎦ + α3 ⎣−2⎦ = ⎣0⎦ .
⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
1 0 −1 0
1 0
x ∈ U1 ∩ U2 ⇐⇒ x ∈ U1 ∧ x ∈ U2
⎛ ⎡ ⎤ ⎡ ⎤⎞
−1 2
⎜ ⎢−2⎥ ⎢−2⎥⎟
⇐⇒ ∃λ1 , λ2 , α1 , α2 ∈ R : ⎜
⎝x = α 1 ⎣ 2 ⎦ + α 2 ⎣ 0 ⎦ ⎠
⎢ ⎥ ⎢ ⎥⎟
1 0
⎛ ⎡ ⎤ ⎡ ⎤⎞
1 2
⎜ ⎢1⎥ ⎢−1⎥⎟
∧⎜
⎝x = λ1 ⎣−3⎦ + λ2 ⎣ 0 ⎦⎠
⎢ ⎥ ⎢ ⎥⎟
1 −1
⎛ ⎡ ⎤ ⎡ ⎤⎞
−1 2
⎜ ⎢−2⎥ ⎢−2⎥⎟
⇐⇒ ∃λ1 , λ2 , α1 , α2 ∈ R : ⎜
⎝x = α 1 ⎣ 2 ⎦ + α 2 ⎣ 0 ⎦ ⎠
⎢ ⎥ ⎢ ⎥⎟
1 0
Exercises 483
⎛ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤⎞
1 2 −1 2
⎜ ⎢1⎥ ⎢−1⎥ ⎢−2⎥ ⎢−2⎥⎟
∧⎜
⎝λ1 ⎣−3⎦ + λ2 ⎣ 0 ⎦ = α1 ⎣ 2 ⎦ + α2 ⎣ 0 ⎦⎠
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎟
1 −1 1 0
⎛ ⎡ ⎤ ⎡ ⎤⎞
−1 2
⎜ 3
⎢ −2 ⎥ ⎢ −2 ⎥⎟
x ∈ U1 ∩ U2 ⇐⇒ ∃λ1 , λ2 , α2 ∈ R : ⎜
⎝x = − 2 λ 1 ⎣ 2 ⎦ + α 2 ⎣ 0 ⎦ ⎠
⎢ ⎥ ⎢ ⎥⎟
1 0
⎛ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤⎞
1 −1 2 2
⎜ ⎢ 1 ⎥ 3 ⎢−2⎥ ⎢−1⎥ ⎢−2⎥⎟
∧⎜
⎝λ1 ⎣−3⎦ + 2 λ1 ⎣ 2 ⎦ + λ2 ⎣ 0 ⎦ = α2 ⎣ 0 ⎦⎠
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎟
1 1 −1 0
⎛ ⎡ ⎤ ⎡ ⎤⎞
−1 2
⎜ 3
⎢ −2 ⎥ ⎢ −2 ⎥⎟
⇐⇒ ∃λ1 , λ2 , α2 ∈ R : ⎜
⎝x = − 2 λ 1 ⎣ 2 ⎦ + α 2 ⎣ 0 ⎦ ⎠
⎢ ⎥ ⎢ ⎥⎟
1 0
⎛ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤⎞
− 12 2 2
⎜ ⎢ −2 ⎥ ⎢−1⎥ ⎢−2⎥⎟
∧⎜
⎝λ 1 ⎣ 0 ⎦ + λ 2 ⎣ 0 ⎦ = α 2 ⎣ 0 ⎦ ⎠
⎢ ⎥ ⎢ ⎥ ⎢ ⎥⎟
5
2 −1 0
⎛ ⎡ ⎤ ⎡ ⎤⎞
−1 2
⎜ 3
⎢−2⎥ ⎢−2⎥⎟
x ∈ U1 ∩ U2 ⇐⇒ ∃λ1 , α2 ∈ R : ⎜
⎝x = − 2 λ1 ⎣ 2 ⎦ + α2 ⎣ 0 ⎦⎠
⎢ ⎥ ⎢ ⎥⎟
1 0
⎛ ⎡ 9 ⎤ ⎡ ⎤⎞
2 2
⎜ ⎢− 9 ⎥ ⎢−2⎥⎟
∧⎜ ⎢ 2⎥
⎝λ 1 ⎣ 0 ⎦ = α 2 ⎣ 0 ⎦ ⎠
⎢ ⎥⎟
0 0
⎛ ⎡ ⎤ ⎡ ⎤⎞
−1 2
⎢−2⎥ ⎢−2⎥⎟
⎜ 3
⎢ ⎥ ⎢ ⎥⎟ 9
⎝x = − 2 λ1 ⎣ 2 ⎦ + α2 ⎣ 0 ⎦⎠ ∧ (α2 = 4 λ1 )
⇐⇒ ∃λ1 , α2 ∈ R : ⎜
1 0
⎛ ⎡ ⎤ ⎡ ⎤⎞
−1 2
⎜ 3
⎢−2⎥ 9 ⎢−2⎥⎟
⇐⇒ ∃λ1 ∈ R : ⎜
⎝x = − 2 λ1 ⎣ 2 ⎦ + 4 λ1 ⎣ 0 ⎦⎠
⎢ ⎥ ⎢ ⎥⎟
1 0
c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
1 0
⎛ ⎡ ⎤⎞
24
⎜ ⎢ −6 ⎥⎟
⇐⇒ ∃λ1 ∈ R : ⎜
⎝x = λ1 ⎣−12⎦⎠
⎢ ⎥⎟
−6
⎛ ⎡ ⎤⎞
4
⎜ ⎢−1⎥⎟
⇐⇒ ∃λ1 ∈ R : ⎜
⎝x = λ1 ⎣−2⎦⎠
⎢ ⎥⎟
−1
Thus, we have
⎧ ⎡ ⎤, ⎫ ⎡ ⎤
⎪
⎪ 4 ,, ⎪
⎪ 4
⎨
⎢−1⎥ , ⎬ ⎢−1⎥
U1 ∩ U2 = λ 1 ⎢
⎣−2⎦ , λ1 ∈ R⎪ = span[⎣−2⎦ , ]
⎥, ⎢ ⎥
⎪
⎪ , ⎪
⎩ ⎭
−1 , −1
which gives us
⎡ ⎤
1
U1 = span[⎣ 1 ⎦] .
−1
Exercises 485
which gives us
⎡ ⎤
1
U2 = span[⎣ 1 ⎦] .
−1
Therefore, dim(U2 ) = 1.
b. Determine bases of U1 and U2 .
The basis vector that spans both U1 and U2 is
⎡ ⎤
1
⎣1⎦.
−1
c. Determine a basis of U1 ∩ U2 .
Since both U1 and U2 are spanned by the same basis vector, it must be
that U1 = U2 , and the desired basis is
⎡ ⎤
1
U1 ∩ U2 = U1 = U2 = span[⎣ 1 ⎦] .
−1
c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
c. Determine a basis of U1 ∩ U2
Let us call b1 , b2 , c1 and c2 the vectors of the bases B and C such that
B = {b1 , b2 } and C = {c1 , c2 }. Let x be in R4 . Then,
x ∈ U1 ∩ U2 ⇐⇒ ∃λ1 , λ2 , λ3 , λ4 ∈ R : (x = λ1 b1 + λ2 b2 ) ∧ (x = λ3 c1 + λ4 c2 )
⇐⇒ ∃λ1 , λ2 , λ3 , λ4 ∈ R : (x = λ1 b1 + λ2 b2 )
∧ (λ1 b1 + λ2 b2 = λ3 c1 + λ4 c2 )
⇐⇒ ∃λ1 , λ2 , λ3 , λ4 ∈ R : (x = λ1 b1 + λ2 b2 )
∧ (λ1 b1 + λ2 b2 − λ3 c1 − λ4 c2 = 0)
From the reduced row echelon form we find that the set
⎡ ⎤
−3
⎢−1⎥
⎣ 1 ⎦]
S := span[⎢ ⎥
−1
x ∈ U1 ∩ U2 ⇐⇒ ∃λ1 , λ2 , λ3 , λ4 , α ∈ R : (x = λ1 b1 + λ2 b2 )
∧ ([λ1 , λ2 , λ3 , λ4 ]⊤ = α[−3, −1, 1, −1]⊤ )
⇐⇒ ∃α ∈ R : x = −3αb1 − αb2
⇐⇒ ∃α ∈ R : x = α[−3, −1, −7, −3]⊤
Exercises 487
Finally,
⎡ ⎤
−3
⎢−1⎥
U1 ∩ U2 = span[⎢
⎣−7⎦] .
⎥
−3
c. Find one basis for F and one for G, calculate F ∩G using the basis vectors
previously found and check your result with the previous question.
We can see that F is a subset of R3 with one linear constraint. It thus
has dimension 2, and it suffices to find two independent vectors in F to
construct a basis. By setting (x, y) = (1, 0) and (x, y) = (0, 1) successively,
we obtain the following basis for F :
⎧⎡ ⎤ ⎡ ⎤⎫
⎨ 1 0 ⎬
⎣0⎦ , ⎣1⎦
⎩ ⎭
1 1
c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
Note that for simplicity purposes, we have not reversed the sign of the
coefficients for µ1 and µ2 , which we can do since we could replace µ1
by −µ1 .
The latter equation is a linear system in [λ1 , λ2 , µ1 , µ2 ]⊤ that we solve
next.
⎡ ⎤ ⎡ ⎤
1 0 1 0 1 0 0 2
⎣ 0 1 0 1 ⎦(· · · ) ! ⎣ 0 1 0 1 ⎦
1 1 2 −1 0 0 1 −2
−1
Exercises 489
a. Let a, b ∈ R.
Φ : L1 ([a, b]) → R
< b
f 0→ Φ(f ) = f (x)dx ,
a
where L1 ([a, b]) denotes the set of integrable functions on [a, b].
Let f, g ∈ L1 ([a, b]). It holds that
< b < b < b
Φ(f ) + Φ(g) = f (x)dx + g(x)dx = f (x) + g(x)dx = Φ(f + g) .
a a a
For λ ∈ R we have
< b < b
Φ(λf ) = λf (x)dx = λ f (x)dx = λΦ(f ) .
a a
Φ : C1 → C0
f 0→ Φ(f ) = f ′ ,
For λ ∈ R we have
c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
c.
Φ:R→R
x 0→ Φ(x) = cos(x)
Φ : R3 → R2
0 1
1 2 3
x 0→ x
1 4 3
Similarly,
Φ(λx) = A(λx) = λAx = λΦ(x) .
Φ : R2 → R2
0 1
cos(θ) sin(θ)
x 0→ x
− sin(θ) cos(θ)
Φ : R3 → R4
⎡ ⎤
⎛⎡ ⎤⎞ 3x1 + 2x2 + x3
x1 ⎢ x1 + x2 + x3 ⎥
Φ ⎝⎣ x 2 ⎦ ⎠ = ⎢
⎣ x1 − 3x2 ⎦
⎥
x3
2x1 + 3x2 + x3
Exercises 491
c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
and the desired transformation matrix of Φ with respect to the new basis
B of R3 is
⎡ ⎤⎡ ⎤⎡ ⎤ ⎡ ⎤⎡ ⎤
0 −1 2 1 1 0 1 1 1 1 3 2 1 1 1
⎣0 1 −1⎦ ⎣1 −1 0 ⎦ ⎣1 2 0 ⎦ = ⎣0 −2 −1⎦ ⎣1 2 0⎦
1 0 −1 1 1 1 1 1 0 0 0 −1 1 1 0
⎡ ⎤
6 9 1
= ⎣−3 −5 0⎦ .
−1 −1 0
2.20 Let us consider b1 , b2 , b′1 , b′2 , 4 vectors of R2 expressed in the standard basis
of R2 as
0 1 0 1 0 1 0 1
2 −1 2 1
b1 = , b2 = , b′1 = , b′2 =
1 −1 −2 1
and let us define two ordered bases B = (b1 , b2 ) and B ′ = (b′1 , b′2 ) of R2 .
a. Show that B and B ′ are two bases of R2 and draw those basis vectors.
The vectors b1 and b2 are clearly linearly independent and so are b′1 and
b′2 .
b. Compute the matrix P 1 that performs a basis change from B ′ to B .
We need to express the vector b′1 (and b′2 ) in terms of the vectors b1
and b2 . In other words, we want to find the real coefficients λ1 and λ2
such that b′1 = λ1 b1 + λ2 b2 . In order to do that, we will solve the linear
equation system
= ′
>
b1 b2 b1
i.e.,
0 1
2 −1 2
1 −1 −2
Exercises 493
Φ(b1 + b2 ) = c2 + c3
Φ(b1 − b2 ) = 2c1 − c2 + 3c3
c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
?
Φ(2b1 ) = 2c1 + 4c3
Φ(2b2 ) = −2c1 + 2c2 − 2c3
And by linearity of Φ again, the system of equations gives us
?
Φ(b1 ) = c1 + 2c3
.
Φ(b2 ) = −c1 + c2 − c3
Therefore, the transformation matrix of AΦ with respect to the bases B
and C is
⎡ ⎤
1 −1
A Φ = ⎣0 1 ⎦.
2 −1
Exercises 495
c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
Analytic Geometry
Exercises
3.1 Show that ⟨·, ·⟩ defined for all x = [x1 , x2 ]⊤ ∈ R2 and y = [y1 , y2 ]⊤ ∈ R2 by
is an inner product.
We need to show that ⟨x, y⟩ is a symmetric, positive definite bilinear form.
Let x := [x1 , x2 ]⊤ , y = [y1 , y2 ]⊤ ∈ R2 . Then,
496
This material will be published by Cambridge University Press as Mathematics for Machine Learn-
ing by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. This pre-publication version is
free to view and download for personal use only. Not for re-distribution, re-sale or use in deriva-
c
tive works. ⃝by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2020. https://mml-book.com.
Exercises 497
c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
3.5 Consider the Euclidean vector space R5 with the dot product. A subspace
U ⊆ R5 and x ∈ R5 are given by
⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤
0 1 −3 −1 −1
⎢−1⎥ ⎢−3⎥ ⎢ 4 ⎥ ⎢−3⎥ ⎢−9⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
U = span[⎢
⎢ 2 ⎥, ⎢−1⎥
⎥ ⎢ 1 ⎥, ⎢ 1 ⎥ , ⎢ 5 ⎥] , x=⎢ ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎣0⎦ ⎣−1⎦ ⎣2⎦ ⎣0⎦ ⎣4⎦
2 2 1 7 1
From here, we see that the first three columns are pivot columns, i.e.,
the first three vectors in the generating set of U form a basis of U :
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
0 1 −3
⎢−1⎥ ⎢−3⎥ ⎢4⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
U = span[⎢
⎢ 2 ⎥,
⎥ ⎢ 1 ⎥,
⎢ ⎥
⎢ 1 ⎥] .
⎢ ⎥
⎣0⎦ ⎣−1⎦ ⎣2⎦
2 2 1
Now, we define
⎡ ⎤
0 1 −3
⎢−1 −3 4 ⎥
⎢ ⎥
B=⎢
⎢2 1 ⎥,
1 ⎥
⎣0 −1 2 ⎦
2 2 1
B ⊤ (x − Bλ) = 0 .
Therefore,
B ⊤ Bλ = B ⊤ x .
Exercises 499
solution
⎡ ⎤
−3
λ=⎣ 4 ⎦
1
U = span[e1 , e3 ] .
p = πU (e2 ) =⇒ (p − e2 ) ⊥ U
0 1 0 1
⟨p − e2 , e1 ⟩ 0
=⇒ =
⟨p − e2 , e3 ⟩ 0
0 1 0 1
⟨p, e1 ⟩ − ⟨e2 , e1 ⟩ 0
=⇒ =
⟨p, e3 ⟩ − ⟨e2 , e3 ⟩ 0
c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
2λ1 = 1
2λ3 = −1
However,
⎡ ⎤⎡ 1 ⎤
=1 > 2 1 0 2
⟨p − e2 , p − e2 ⟩ = 2 −1 − 1 ⎣1
2 2 −1⎦ ⎣ −1 ⎦ = 1 ,
0 −1 2 − 21
B
which yields d(e2 , U ) = ⟨p − e2 , p − e2 ⟩ = 1
c. Draw the scenario: standard basis vectors and πU (e2 )
See Figure 3.1.
3.7 Let V be a vector space and π an endomorphism of V .
a. Prove that π is a projection if and only if idV − π is a projection, where
idV is the identity endomorphism on V .
b. Assume now that π is a projection. Calculate Im(idV −π) and ker(idV −π)
as a function of Im(π) and ker(π).
Exercises 501
e3
e2
1
πU (e2 )
e1
ure 3.1
ojection πU (e2 ).
b. We have π ◦ (idV − π) = π − π 2 = 0V , where 0V represents the null
endomorphism. Then Im(idV − π) ⊆ ker(π).
Conversely, let x ∈ ker(π). Then
Similarly, we have
(idV − π) ◦ π = π − π 2 = π − π = 0V
ker(idV − π) = Im(π) .
3.8 Using the Gram-Schmidt method, turn the basis B = (b1 , b2 ) of a two-
dimensional subspace U ⊆ R3 into an ONB C = (c1 , c2 ) of U , where
⎡ ⎤ ⎡ ⎤
1 −1
b1 := ⎣1⎦ , b2 := ⎣ 2 ⎦ .
1 0
c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
We start by normalizing b1
⎡ ⎤
1
b 1
c1 := 1 = √ ⎣1⎦ . (3.1)
∥b1 ∥ 3
1
To get c2 , we project b2 onto the subspace spanned by c1 . This gives us
(since ∥c1 = 1∥)
⎡ ⎤
1
1⎣ ⎦
c⊤
1 b2 c1 = 1 ∈U.
3
1
By subtracting this projection (a multiple of c1 ) from b2 , we get a vector
that is orthogonal to c1 :
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
−1 1 −4
⎣ 2 ⎦ − 1 ⎣1⎦ = 1 ⎣ 5 ⎦ = 1 (−b1 + 3b2 ) ∈ U .
3 3 3
0 1 −1
Normalizing c̃2 yields
⎡ ⎤
−4
c̃ 3
c2 = 2 = √ ⎣ 5 ⎦ .
∥c̃2 ∥ 42
−1
We see that c1 ⊥ c2 and that ∥c1 ∥ = 1 = ∥c2 ∥. Moreover, c1 , c2 ∈ U it
follows that (c1 , c2 ) are a basis of U .
3.9 Let n ∈ N and let x1 , . . . , xn > 0 be n positive real numbers so that x1 +
· · · + xn = 1. Use the Cauchy-Schwarz inequality and show that
5n 2 1
a. i=1 xi " n
5n 1 2
b. i=1 xi " n
Hint: Think about the dot product on Rn . Then, choose specific vectors
x, y ∈ Rn and apply the Cauchy-Schwarz inequality.
Recall Cauchy-Schwarz inequality expressed with the dot product in Rn .
Let x = [x1 , . . . , xn ]⊤ and y = [y1 , . . . , yn ]⊤ be two vectors of Rn . Cauchy-
Schwarz tells us that
⟨x, y⟩2 # ⟨x, x⟩ · ⟨y, y⟩ ,
which, applied with the dot product in Rn , can be rephrased as
E n G2 E n G E n G
F F 2 F 2
x i yi # xi · yi .
i=1 i=1 i=1
⊤
a. Consider x = [x1 , . . . , xn ] as defined in the question. Let us choose
y = [1, . . . , 1]⊤ . Then, the Cauchy-Schwarz inequality becomes
E n G2 E n G E n G
F F 2 F 2
xi · 1 # xi · 1
i=1 i=1 i=1
and thus
E n
G
F
1# x2i · n,
i=1
Exercises 503
so that E G E G
n
F n
F
2 1
n # xi · xi .
i=1 i=1
5n
This yields n2 # 1
i=1 xi · 1, which gives the expected result.
3.10 Rotate the vectors
0 1 0 1
2 0
x1 := , x2 :=
3 −1
by 30◦ .
Since 30◦ = π/6 rad we obtain the rotation matrix
0 π π
1
cos( 6 ) − sin( 6 )
A=
sin( π6 ) cos( π6 )
c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
Matrix Decompositions
Exercises
4.1 Compute the determinant using the Laplace expansion (using the first row)
and the Sarrus rule for
⎡ ⎤
1 3 5
A=⎣ 2 4 6 ⎦.
0 2 4
The determinant is
, ,
, 1 3 5 ,
, ,
|A| = det(A) = ,, 2 4 6 ,,
, 0 2 4 ,
, , , , , ,
, 4 6 ,
, − 3, 2 6 , + 5, 2 4 ,,
, , ,
= 1 ,,
2 4 , , 0 4 , , 0 2 ,
=1·4−3·8−5·4=0 Laplace expansion
= 16 + 20 + 0 − 0 − 12 − 24 = 0 Sarrus’ rule
This strategy shows the power of the methods we learned in this and the
previous chapter. We can first apply Gaussian elimination to transform A
into a triangular form, and then use the fact that the determinant of a trian-
gular matrix equals the product of its diagonal elements.
, , , , , ,
,2 0 1 2 0,, ,2 0 1 2 0,, ,,2 0 1 2 0,,
, ,
,2
, −1 0 1 1,, ,,0 −1 −1 −1 1,, ,,0 −1 −1 −1 1,,
,0
, 1 2 1 2,, = ,,0 1 2 1 2,, = ,,0 0 1 0 3,,
,−2 0 2 −1 2,, ,,0 0 3 1 2,, ,,0 0 3 1 2,,
,
,2 0 0 1 1 , ,0 0 −1 −1 1 , ,0 0 −1 −1 1,
504
This material will be published by Cambridge University Press as Mathematics for Machine Learn-
ing by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. This pre-publication version is
free to view and download for personal use only. Not for re-distribution, re-sale or use in deriva-
c
tive works. ⃝by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2020. https://mml-book.com.
Exercises 505
, , , ,
,2 0 1 2 0 , ,2 0 1 2 0 ,,
, , ,
,0
, −1 −1 −1 1 ,, ,,0 −1 −1 −1 1 ,,
= ,,0 0 1 0 3 ,, = ,,0 0 1 0 3 ,, = 6 .
,0
, 0 0 1 −7,, ,,0 0 0 1 −7,,
,0 0 0 −1 4 , ,0 0 0 0 −3,
Alternatively, we can apply the Laplace expansion and arrive at the same
solution:
, , , ,
,2 0 1 2 0 , ,2 0 1 2 0,,
, , ,
,2
, −1 0 1 1,, ,,0 −1 −1 −1 1,,
,0
, 1 2 1 2,, = ,,0 1 2 1 2,,
,−2 0 2 −1 2,, ,,0 0 3 1 2,,
,
,2 0 0 1 1 , ,0 0 −1 −1 1,
, ,
,−1 −1 −1 1,,
,
1st col. ,1 2 1 2,,
= (−1)1+1 2 · ,, .
,0 3 1 2,,
,0 −1 −1 1,
If we now subtract the fourth row from the first row and multiply (−2) times
the third column to the fourth column we obtain
, ,
,−1 0 0 0,, , ,
, ,2 1 0,, , ,
,1 2 1 0, 1st row , 3rd=col. (−2) · 3(−1)3+3 · ,2 1, = 6 .
, , , ,
2 ,, = −2 ,3 1 0
,0 3 1 0, 3 1,
, , , ,
,−1 −1 3,
, 0 −1 −1 3,
b.
0 1
−2 2
B :=
2 1
a. For
0 1
1 0
A=
1 1
, ,
,1 − λ 0 ,,
(i) Characteristic polynomial: p(λ) = |A − λI 2 | = ,, =
1 1 − λ,
(1 − λ)2 . Therefore λ = 1 is the only root of p and, therefore, the
only eigenvalue of A
(ii) To compute the eigenspace for the eigenvalue λ = 1, we need to
compute the null space of A − I :
0 1
0 0
(A − 1 · I)x = 0 ⇐⇒ x=0
1 0
0 1
0
⇒ E1 = [ ]
1
c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
0 1 1
In the four matrices above, the first one is diagonalizable and invertible, the
second one is diagonalizable but is not invertible, the third one is invertible
but is not diagonalizable, and, finally, the fourth ione s neither invertible
nor is it diagonalizable.
Exercises 507
4.6 Compute the eigenspaces of the following transformation matrices. Are they
diagonalizable?
a. For
⎡ ⎤
2 3 0
A = ⎣1 4 3⎦
0 0 1
where we subtracted the first row from the second and, subse-
quently, divided the second row by 3 to obtain the reduced row
echelon form. From here, we see that
⎡ ⎤
3
E1 = span[⎣−1⎦]
0
Then,
⎡ ⎤
1
E5 = span[⎣1⎦] .
0
c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
b. For
⎡ ⎤
1 1 0 0
⎢0 0 0 0⎥
A=⎢
⎣0
⎥
0 0 0⎦
0 0 0 0
Exercises 509
c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
0 1 1
Exercises 511
(iii) We can immediately see that the rank of this matrix is 1 since
the first and third row are three times the second. Therefore, the
eigenspace dimension is dim(E2 ) = 3 − 1 = 2, which corresponds
to the algebraic multiplicity of the eigenvalue λ = 2 in p(λ). More-
over, we know that the dimension of E1 is 1 since it cannot exceed
its algebraic multiplicity, and the dimension of an eigenspace is at
least 1. Hence, A is diagonalizable.
(iv) The diagonal matrix is easy to determine since it just contains the
eigenvalues (with corresponding multiplicities) on its diagonal:
⎡ ⎤
1 0 0
D = ⎣0 2 0⎦ .
0 0 2
c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
⎡ ⎤
J K0 √1 − 1
√ − 23
√1 − √1
1 2 3 2
2 2 5 0 0 ⎢ √1 1
√ 2 ⎥
A= 3
√1 √1
⎣ 2 3 √2 ⎦
0 3 0
2 2 2 2 1
! "# $! "# $ 0 − 3 3
=U =Σ ! "# $
=V
Exercises 513
0 √ 1
2 2
Av 1 = (σ1 u1 v ⊤
1 )v 1 = σ1 u1 (v ⊤
1 v1 ) = σ 1 u1 =
0
0 1
0
Av 2 = (σ2 u2 v ⊤ ⊤
2 )v 2 = σ2 u2 (v 2 v 2 ) = σ2 u2 = √
2
(v) Assemble the left-/right-singular vectors and singular values. The SVD
of A is
0 1 0√ 1 J √1 √1
K
1 0 8 0
A = U ΣV ⊤ = √ 2 2 .
0 1 0 2 − √1 √1
2 2
c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
To find the rank-1 approximation we apply the SVD to A (as in Exercise 4.7)
to obtain
J 1 1
K
√ −√
U= 2 2
√1 √1
2 2
⎡ 1 1 ⎤
√ − √ − 32
2 3 2
⎢ √1 √1 2 ⎥
V = ⎣ 2 3 √ 2 3 ⎦.
0 −232 1
3
We use the largest singular value (σ1 = 5, i.e., i = 1 and the first column
vectors of the U and V matrices, respectively:
J 1 K
√ L M 1 01 1 0 1
⊤
A 1 = u1 v 1 = 2 √1 √1 0 = .
√1 2 2 2 1 1 0
2
We use the largest singular value (σ1 = 5, i.e., i = 1) and therefore, the first
column vectors of U and V , respectively, which then yields
J 1 K 0 1
√ L M 1 1 1 0
A 1 = u1 v ⊤ 2 √1 √1 0 =
1 = √1 2 2
.
2 1 1 0
2
4.11 Show that for any A ∈ Rm×n the matrices A⊤ A and AA⊤ possess the
same nonzero eigenvalues.
Let us assume that λ is a nonzero eigenvalue of AA⊤ and x is an eigenvector
belonging to λ. Thus, the eigenvalue equation
(AA⊤ )x = λx
Exercises 515
and we can use matrix multiplication associativity to reorder the left-hand
side factors
A⊤ A = P DP ⊤
(ii) Then,
N n B n B
O
F F
∥Ax∥22 ⊤ ⊤
= x (P DP )x = y Dy = ⊤
λ i x i pi , λi xi pi ,
i=1 i=1
so that
∥Ax∥22
# max λj ,
∥x∥22 1!j!n
c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
Vector Calculus
Exercises
5.1 Compute the derivative f ′ (x) for
f (x) = log(x4 ) sin(x3 ) .
4
f ′ (x) = sin(x3 ) + 12x2 log(x) cos(x3 )
x
exp(x)
f ′ (x) =
(1 + exp(x))2
517
This material will be published by Cambridge University Press as Mathematics for Machine Learn-
ing by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. This pre-publication version is
free to view and download for personal use only. Not for re-distribution, re-sale or use in deriva-
c
tive works. ⃝by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2020. https://mml-book.com.
f2
F
x⊤ y = x i yi
i
∂f2
L M = >
= ∂f2
∂x1 ··· ∂f2
∂xn = y1 ··· yn = y ⊤ ∈ Rn
∂x
∂f2
L M = >
= ∂f2
∂y1 ··· ∂f2
∂yn = x1 ··· xn = x⊤ ∈ Rn
∂y
L M = >
∂f2 ∂f2 1×2n
=⇒ J = ∂x ∂y = y⊤ x⊤ ∈ R
f3 : Rn → Rn×n
⎡ ⎤
x 1 x⊤
⎢ x 2 x⊤ ⎥ = >
xx⊤ = ⎢ xxn ∈ Rn×n
⎢ ⎥
.. ⎥ = xx1 xx2 ···
⎣ . ⎦
x n x⊤
⎡ ⎤
x⊤
⎢ 0⊤ ⎥
∂f3 =⎢ n⎥ >
=⇒ = ⎢ . ⎥+ x 0n ··· 0n ∈ Rn×n
∂x1 ⎣ .. ⎦ ! "# $
∈Rn×n
0⊤n
! "# $
∈Rn×n
⎡ ⎤
0(i−1)×n
∂f3 = >
=⇒ =⎣ x⊤ ⎦ + 0n×(i−1) x 0n×(n−i+1) ∈ Rn×n
∂xi ! "# $
0(n−1+1)×n
! "# $ ∈Rn×n
∈Rn×n
∂f3
To get the Jacobian, we need to concatenate all partial derivatives ∂xi
and obtain
L M
J = ∂x∂f3
1
∂f3
· · · ∂x n
∈ R(n×n)×n
Exercises 519
∂f 1
= cos(log(t⊤ t)) · ⊤ · 2t⊤
∂t t t
The trace for T ∈ RD×D is defined as
D
F
tr(T ) = Tii
i=1
∂ F
tr(AXB) = Aki Bjk = (BA)ji
∂Xij
k
We know that the size of the gradient needs to be of the same size as X
(i.e., E × F ). Therefore, we have to transpose the result above, such that
we finally obtain
∂
A⊤ !"#$
tr(AXB) = !"#$ B⊤
∂X
E×D D×F
5.7 Compute the derivatives df /dx of the following functions by using the chain
rule. Provide the dimensions of every single partial derivative. Describe your
steps in detail.
a.
f (z) = log(1 + z) , z = x⊤ x , x ∈ RD
b.
f (z) = sin(z) , z = Ax + b , A ∈ RE×D , x ∈ RD , b ∈ RE
c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
a.
df ∂f ∂z
= ∈ R1×D
dx ∂z ∂x
!"#$ !"#$
∈R ∈R1×D
∂f 1 1
= =
∂z 1+z 1 + x⊤ x
∂z
= 2x⊤
∂x
df 2x⊤
=⇒ =
dx 1 + x⊤ x
b.
df ∂f ∂z
= ∈ RE×D
dx ∂z
!"#$ ∂x
!"#$
∈RE×E ∈RE×D
⎡ ⎤
sin z1
sin(z) = ⎣
⎢ .. ⎥
. ⎦
sin zE
⎡ ⎤
0
⎢ . ⎥
⎢ .. ⎥
⎢ ⎥
⎢ 0 ⎥
⎢ ⎥
∂ sin z
= ⎢cos(zi )⎥ ∈ RE
⎢ ⎥
∂zi ⎢ ⎥
⎢ 0 ⎥
⎢ ⎥
⎢ .. ⎥
⎣ . ⎦
0
∂f
=⇒ = diag(cos(z)) ∈ RE×E
∂z
∂z
= A ∈ RE×D :
∂x
D
F ∂ci
ci = Aij xj =⇒ = Aij , i = 1, . . . , E, j = 1, . . . , D
∂xj
j=1
Exercises 521
where x, µ ∈ RD , S ∈ RD×D .
The desired derivative can be computed using the chain rule:
df ∂f ∂g ∂h
= ∈ R1×D
dx ∂z ∂y ∂x
!"#$ !"#$ !"#$
1×1 1×D D×D
Here
∂f
= − 21 exp(− 12 z)
∂z
∂g
= 2y ⊤ S −1
∂y
∂h
= ID
∂x
so that
df @ A
= − exp − 21 (x − µ)⊤ S −1 (x − µ) (x − µ)⊤ S −1
dx
b.
f (x) = tr(xx⊤ + σ 2 I) , x ∈ RD
Here tr(A) is the trace of A, i.e., the sum of the diagonal elements Aii .
Hint: Explicitly write out the outer product.
Let us have a look at the outer product. We define X = xx⊤ with
Xij = xi xj
c. Use the chain rule. Provide the dimensions of every single partial deriva-
tive. You do not need to compute the product of the partial derivatives
explicitly.
f = tanh(z) ∈ RM
z = Ax + b, x ∈ RN , A ∈ RM ×N , b ∈ RM .
∂f
= diag(1 − tanh2 (z)) ∈ RM ×M
∂z
∂z ∂Ax
= = A ∈ RM ×N
∂x ∂x
c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
for differentiable functions p, q, t. By using the chain rule, compute the gra-
dient
d
g(z, ν) .
dν
d d d d
g(z, ν) = (log p(x, z) − log q(z, ν)) = log p(x, z) − log q(z, ν)
dν dν dν dν
∂ ∂t(ϵ, ν) ∂ ∂t(ϵ, ν) ∂
= log p(x, z) − log q(z, ν) − log q(z, ν)
∂z
P ∂ν ∂z Q ∂ν ∂ν
∂ ∂ ∂t(ϵ, ν) ∂
= log p(x, z) − log q(z, ν) − log q(z, ν)
∂z ∂z ∂ν ∂ν
P Q
1 ∂ 1 ∂ ∂t(ϵ, ν) ∂
= p(x, z) − q(z, ν) − log q(z, ν)
p(x, z) ∂z q(z, ν) ∂z ∂ν ∂ν
Exercises
6.1 Consider the following bivariate distribution p(x, y) of two discrete random
variables X and Y .
x1 x2 x3 x4 x5
X
Compute:
a. The marginal distributions p(x) and p(y).
b. The conditional distributions p(x|Y = y1 ) and p(y|X = x3 ).
The marginal and conditional distributions are given by
523
This material will be published by Cambridge University Press as Mathematics for Machine Learn-
ing by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. This pre-publication version is
free to view and download for personal use only. Not for re-distribution, re-sale or use in deriva-
c
tive works. ⃝by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2020. https://mml-book.com.
< P P0 1 0 1Q P0 1 0 1QQ
10 1 0 0 8.4 2.0
p(x1 ) = 0.4 N , + 0.6N , dx2
2 0 1 0 2.0 1.7
(6.1)
< P0 1 0 1Q < P0 1 0 1Q
10 1 0 0 8.4 2.0
= 0.4 N , dx2 + 0.6 N , dx2
2 0 1 0 2.0 1.7
(6.2)
@ A @ A
= 0.4N 10, 1 + 0.6N 0, 8.4 .
From (6.1) to (6.2), we used the result that the integral of a sum is the
sum of the integrals.
Similarly, we can get
@ A @ A
p(x2 ) = 0.4N 2, 1 + 0.6N 0, 1.7 .
b. Compute the mean, mode and median for each marginal distribution.
Mean:
<
E(x1 ) = x1 p(x1 )dx1
<
@ @ A @ AA
= x1 0.4N x1 | 10, 1 + 0.6N x1 | 0, 8.4 dx1 (6.3)
< <
@ A @ A
= 0.4 x1 N x1 | 10, 1 dx1 + 0.6 x1 N x1 | 0, 8.4 dx1 (6.4)
= 0.4 · 10 + 0.6 · 0 = 4 .
@ A
From step (6.3) to step (6.4), we use the fact that for Y ∼ N µ, σ ,
where E[Y ] = µ.
Similarly,
E(x2 ) = 0.4 · 2 + 0.6 · 0 = 0.8 .
Mode: In principle, we would need to solve for
dp(x1 ) dp(x2 )
=0 and = 0.
dx1 dx2
However, we can observe that the modes of each individual distribution
are the peaks of the Gaussians, that is the Gaussian means for each
dimension. Median:
< a
1
p(x1 )dx1 = .
−∞ 2
Mode:
The two dimensional distribution has two peaks, and hence there are
Exercises 525
6.3 You have written a computer program that sometimes compiles and some-
times not (code does not change). You decide to model the apparent stochas-
ticity (success vs. no success) x of the compiler using a Bernoulli distribution
with parameter µ:
Choose a conjugate prior for the Bernoulli likelihood and compute the pos-
terior distribution p(µ | x1 , . . . , xN ).
The posterior is proportional to the product of the likelihood (Bernoulli)
and the prior (Beta). Given the choice of the prior, we already know that
the posterior will be a Beta distribution. With a conjugate Beta prior p(µ) =
Beta(a, b) obtain
N
p(µ)p(X | µ) R x
p(µ | X) = ∝ µa−1 (1 − µ)b−1 µ i (1 − µ)1−xi
p(X)
i=1
! ! F F
a−1+ i xi b−1+N − xi
∝µ (1 − µ) i ∝ Beta(a + xi , b + N − xi ) .
i i
c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
where
32 21 2 1 3
p(m) = p(b1 )p(m | b1 ) + p(b2 )p(m | b2 ) = + = + = Evidence
53 52 5 5 5
2
p(b2 ) = 5 Prior
1
p(m | b2 ) = 2 Likelihood
Therefore,
21
p(m | b2 )p(b2 ) 52 1
p(b2 | m) = = 3
= .
p(m) 5
3
where
@ w, vA are i.i.d. Gaussian noise variables. Further, assume that p(x0 ) =
N µ0 , Σ 0 .
a. What is the form of p(x0 , x1 , . . . , xT )? Justify your answer (you do not
have to explicitly compute the joint distribution).
@ A
b. Assume that p(xt | y 1 , . . . , y t ) = N µt , Σt .
1. Compute p(xt+1 | y 1 , . . . , y t ).
2. Compute p(xt+1 , y t+1 | y 1 , . . . , y t ).
3. At time t+1, we observe the value y t+1 = ŷ . Compute the conditional
distribution p(xt+1 | y 1 , . . . , y t+1 ).
Therefore,
EJ K J K G
µt+1 | t Σt+1 | t Σxy
p(xt+1 , y t+1 | y 1:t ) = N , .
µyt+1 | t Σyx Σyt+1 | t
Exercises 527
6.6 Prove the relationship in (6.44), which relates the standard definition of the
variance to the raw-score expression for the variance.
The formula in (6.43) can be converted to the so called raw-score formula
for variance
N N H I
1 F 1 F 2
(xi − µ)2 = xi − 2xi µ + µ2
N N
i=1 i=1
N N
1 F 2 F
= x2i − µ x i + µ2
N N
i=1 i=1
N
1 F
= x2i − µ2
N
i=1
N
E N
G2
1 F 2 1 F
= xi − xi .
N N
i=1 i=1
6.7 Prove the relationship in (6.45), which relates the pairwise difference be-
tween examples in a dataset with the raw-score expression for the variance.
Consider the pairwise squared deviation of a random variable:
N N
1 F 2 1 F 2
(x i − x j ) = xi + x2j − 2xi xj (6.5)
N2 N2
i,j=1 i,j=1
N N N
1 F 2 1 F 2 1 F
= xi + xj − 2 2 xi xj (6.6)
N N N
i=1 j=1 i,j=1
⎛ ⎞
N N N
2 F 2 1 F ⎝1 F ⎠
= xi − 2 xi xj (6.7)
N N N
i=1 i=1 j=1
⎡ E G2 ⎤
N N
1 F 2 1 F
= 2⎣ xi − xi ⎦. (6.8)
N N
i=1 i=1
c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
Going from (6.7) to (6.8), we consider the sum over j as a common factor,
as seen in the following small example:
6.8 Express the Bernoulli distribution in the natural parameter form of the ex-
ponential family, see (6.107).
p(x | µ) = µx (1 − µ)1−x
L H IM
= exp log µx (1 − µ)1−x
The last line can be identified as being in exponential family form by ob-
serving that
S@ A
N
x x = 0, 1, ..., N.
h(x) =
0 otherwise
µ
θ = log
1−µ
Exercises 529
φ(x) = x
A(θ) = −N log(1 − µ) = N log(1 + eθ ) .
Show the product of the Beta and Binomial distribution is also a member of
exponential family:
E G
N 1
p(x|N, µ)p(µ|α, β) = µx (1 − µ)(N −x) µα−1 (1 − µ)(β−1)
x B(α, β)
E G
N µ
= exp [x log + N log(1 − µ)]
x 1−µ
The last line can be identified as being in exponential family form by ob-
serving that
h(µ) = 1
θ = [x + α − 1 N − x + β − 1]⊤
φ(µ) = [log µ log(1 − µ)]
E G
N
A(θ) = log + log B(α + β) .
x
c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
C = (A−1 + B −1 )−1
c = C(A−1 a + B −1 b)
D 1 @ A
c = (2π)− 2 | A + B | − 2 exp − 12 (a − b)⊤ (A + B)−1 (a − b) .
Recall from the Example 6.13 we get the exponential form of univari-
ate Gaussian distribution, similarly we can get the exponential form
of multivariate Gaussian distribution as
H 1 1
p(x|a, A) = exp − x⊤ A−1 x + x⊤ A−1 a − a⊤ A−1 a
2 2
Exercises 531
1 1
I
− log(2π) − log |A| .
2 2
This can be identified as being in exponential family form by observ-
ing that
h(µ) = 1
0 @ A1
− 21 vec A−1
θ|(a, A) =
A−1 a
J H IK
vec xx⊤
φ(x) = .
x
= >
EX [x] = EY EX [x | y]
< <<
⇐⇒ xp(x)dx = xp(x | y)p(y)dxdy
< <<
⇐⇒ xp(x)dx = xp(x, y)dxdy
< < < <
⇐⇒ xp(x)dx = x p(x, y)dydx = xp(x)dx ,
c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
T
b. The distribution p(y) = p(y | x)p(x)dx is Gaussian. Compute the mean
µy and the covariance Σy . Derive your result in detail.
p(y) is Gaussian distributed (affine transformation of the Gaussian ran-
dom variable x) with mean µy and covariance matrix Σy . We obtain
We use the linearity of the expected value, move all constants out of the
expected value, and exploit the independence of w and x:
Σy = AEX [xx⊤ ]A⊤ + AEX [x]b⊤ + bEX [x⊤ ]A⊤ + bb⊤ + EW [ww⊤ ]
− (Aµx + b)(Aµx + b)⊤ ,
− Aµx µ⊤ ⊤ ⊤ ⊤ ⊤
x A − Aµx b − bµx A − bb
⊤
@ A
= A EX [xx⊤ ] − µx µ⊤ ⊤
x A +Q
! "# $
=Σx
= AΣx A⊤ + Q .
z = Cy + v ,
@ A
where z ∈ RF , C ∈ RF ×E , and v ∼ N v | 0, R is independent Gaus-
sian (measurement) noise.
Write down p(z | y).
@ A
p(z | y) = N z | Cy, R
@ A
= (2π)−F/2 | R | −1/2 exp − 21 (z − Cy)⊤ R−1 (z − Cy) .
Exercises 533
Compute p(z), i.e., the mean µz and the covariance Σz . Derive your
result in detail.
Remark: Since y is Gaussian and z is a linear transformation of y ,
p(z) is Gaussian, too. Let’s compute its moments (similar to the “time
update”):
We use the linearity of the expected value, move all constants out of
the expected value, and exploit the independence of v and y :
Σx | y = Σx − Σxy Σ−1
y Σyx .
c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
With
µx µ⊤ ⊤ ⊤
y = µx µx A + µx b
⊤
And finally
µx | y = µx + Σx A⊤ (AΣx A⊤ + Q)−1 (ŷ − Aµx − b) ,
Σx | y = Σx − Σx A⊤ (AΣx A⊤ + Q)−1 AΣx .
where the last line is due to the fact that FX (x) composed with its inverse
results in an identity transformation. The statement FY (y) = y along with
the fact that y lies in the interval [0, 1] means that FY (x) is the cdf of the
uniform random variable on the unit interval.
Continuous Optimization
Exercises
7.1 Consider the univariate function
f (x) = x3 + 6x2 − 3x − 5.
Find its stationary points and indicate whether they are maximum, mini-
mum, or saddle points.
Given the function f (x), we obtain the following gradient and Hessian,
df
= 3x2 + 12x − 3
dx
d2 f
= 6x + 12 .
dx2
We find stationary points by setting the gradient to zero, and solving for x.
One option is to use the formula for quadratic functions, but below we show
how to solve it using completing the square. Observe that
(x + 2)2 = x2 + 4x + 4
and hence
√
x = −2 ± 5.
df
Substituting the solutions of dx = 0 into the Hessian gives
2
d f √ d2 f √
(−2 − 5) ≈ −13.4 and (−2 + 5) ≈ 13.4 .
dx2 dx2
This means that the left stationary point is a maximum and the right one is
a minimum. See Figure 7.1 for an illustration.
7.2 Consider the update equation for stochastic gradient descent (Equation (7.15)).
Write down the update when we use a mini-batch size of one.
θ i+1 = θ i − γi (∇L(θ i ))⊤ = θ i − γi ∇Lk (θ i )⊤
535
This material will be published by Cambridge University Press as Mathematics for Machine Learn-
ing by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. This pre-publication version is
free to view and download for personal use only. Not for re-distribution, re-sale or use in deriva-
c
tive works. ⃝by M. P. Deisenroth, A. A. Faisal, and C. S. Ong, 2020. https://mml-book.com.
gradient and 60
Hessian. 40
20
−20
−40
−6 −4 −2 0 2
We can then combine all of the inequalities into one matrix inequality:
⎡ ⎤⎡ ⎤ ⎡ ⎤
1 0 0 x0 0
subject to ⎣0 1 0 ⎦ ⎣ x1 ⎦ − ⎣3 ⎦ # 0
0 0 −1 ξ 0
Exercises 537
⎡ ⎤ ⎡ ⎤
2 2 33
⎢2
⎢ −4⎥ 0 1 ⎢8⎥
⎥ x1 ⎢ ⎥
⎢−2
subject to ⎢ 1 ⎥
⎥ x2 # ⎢ 5 ⎥
⎢ ⎥
⎣0 −1⎦ ⎣−1⎦
0 1 8
Derive the dual linear program using Lagrange duality.
Write down the Lagrangian:
⎛⎡ ⎤ ⎡ ⎤⎞
2 2 33
0 1⊤ ⎜⎢ 2 −4⎥ ⎢ 8 ⎥⎟
5 ⎜⎢ ⎥ ⎢ ⎥⎟
L(x, λ) = − x + λ⊤ ⎜
⎜⎢−2
⎢
⎥ x − ⎢ 5 ⎥⎟
1 ⎥ ⎢ ⎥⎟
3 ⎝⎣ 0 −1 ⎦ ⎣−1⎦⎠
0 1 8
Rearrange and factorize x:
⎛ ⎡ ⎤⎞ ⎡ ⎤
2 2 33
⎜ 0 1⊤
⎜ 5
⎢2
⎢ −4⎥⎥⎟
⎟ ⎢8⎥
⎢ ⎥
⊤⎢ ⊤⎢
⎜− 3 + λ ⎢−2
L(x, λ) = ⎜ ⎥⎟ x − λ ⎢ 5 ⎥
1 ⎥ ⎟ ⎥
⎝ ⎣0 −1 ⎦ ⎠ ⎣ −1⎦
0 1 8
Differentiate with respect to x and set to zero:
⎡ ⎤
2 2
0 1⊤ ⎢2 −4⎥
5 ⎢ ⎥
∇x L = − + λ⊤ ⎢
⎢−2 ⎥=0
1 ⎥
3 ⎣0 −1⎦
0 1
Then substitute back into the Lagrangian to obtain the dual Lagrangian:
⎡ ⎤
33
⎢8⎥
⎢ ⎥
D(λ) = −λ⊤ ⎢
⎢5⎥
⎥
⎣−1⎦
8
The dual optimization problem is therefore
⎡ ⎤
33
⎢8⎥
⎢ ⎥
⊤⎢
max − λ ⎢ 5 ⎥
⎥
λ∈Rm ⎣−1⎦
8
⎡ ⎤
2 2
0 1⊤ ⎢2 −4⎥
5 ⎢ ⎥
subject to − + λ⊤ ⎢
⎢−2 ⎥=0
1 ⎥
3 ⎣0 −1⎦
0 1
λ " 0.
c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
Alternative derivation
Write down the Lagrangian:
⎛⎡ ⎤ ⎡ ⎤⎞
1 0 1
0 1 0 1⊤
1 ⊤ 2 1 5 ⎜⎢ −1 0 ⎥ ⎢1⎥⎟
L(x, λ) = x x+ x + λ⊤ ⎜⎢ ⎥ x − ⎢ ⎥⎟
2 1 4 3 ⎝⎣ 0 1 ⎦ ⎣1⎦⎠
0 −1 1
Solve for x:
⎛ ⎡ ⎤⎞
1 0
0 1⊤ 0 1−1
⊤ ⎢−1
⎜ 5 0 ⎥
⎥⎟ 2 1
⎢ ⎟
x⊤ = − ⎜ +λ ⎣
⎝ 3 0 1 ⎦⎠ 1 4
0 −1
Exercises 539
Derive the convex conjugate function f ∗ (s), by assuming the standard dot
c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.
product.
Hint: Take the gradient of an appropriate function and set the gradient to zero.
From the definition of the Legendre Fenchel conjugate
D
F
f ∗ (s) = sup sd xd − xd log xd .
x∈RD d=1
Exercises 541
c
⃝2020 M. P. Deisenroth, A. A. Faisal, C. S. Ong. To be published by Cambridge University Press.