Professional Documents
Culture Documents
⎢
x2 ⎥
x = ⎢
⎥
∈ Rd
⎢
⎥
⎢ ⋮ ⎥
⎣ ⎦
xd
d
2
∥x∥2 = ∑ x = √⟨x, x⟩
i
⎷
i=1
where
d
⟨u, v⟩ = ∑ ui vi
i=1
The triangle inequality for the -norm follows from the Cauchy–Schwarz inequality, which is
ℓ
2
z = ∑ αℓ uℓ ,
ℓ=1
zi = ∑ αℓ (uℓ )i , i = 1, … , d.
ℓ=1
The Euclidean distance between two vectors and in is the -norm of their difference
u v R
d
2
d(u, v) = ∥u − v∥2 .
otherwise.
More generally, for , the -norm of is given by
p ≥ 1 ℓ
p
x
1/p
d
p
∥x∥p = (∑ |xi | ) .
i=1
Here is a nice visualization of the unit ball, that is, the set {x : ∥x∥p ≤ 1} , under varying .
p
a ∈ R u, v ∈ R
d
(Point-separating): implies
ℓ(u) = 0 . u = 0
import numpy as np
In [2]:
u = np.array([1., 3., 5. ,7.])
print(u)
[1. 3. 5. 7.]
To obtain the norm of a vector, we can use the function linalg.norm (which requires the
numpy.linalg package):
In [3]: from numpy import linalg as LA
In [4]:
LA.norm(u)
Out[4]: 9.16515138991168
Out[5]: 9.16515138991168
and column . We also refer to a matrix as the collection of all of its entries as follows
j
A = (Aij )i∈[n],j∈[m] .
We occasionally simplify the notation to (Aij )i,j when the range of the indices is clear from
context. We use the notation
Ai,⋅ = (Ai1 ⋯ Aim ),
to indicate the -th row of - as a row vector, i.e., a matrix with a single row - and similarly
i A
A1j
⎛ ⎞
⎜
⎟
A⋅,j = ⎜ ⋮ ⎟,
⎝ ⎠
Anj
the -th column of - as a column vector, i.e., a matrix with a single column.
j A
Example: Suppose
2 5
⎡ ⎤
A = ⎢3 6⎥.
⎣ ⎦
1 1
⊲
Recall that the transpose of a matrix
A
T
is defined as the matrix in
A ∈ R
n×m
R
m×n
that
switches the row and column indices of , that is, its entries are
A
T
[A ]ij = Aji , i = 1, … , m, j = 1, … , n.
Visually:
(Source)
The transpose in particular can be used to turn a column vector into a row vector and vice
versa. That is, if
x ∈ R
n
is a column vector
b1
⎡ ⎤
⎢
b2 ⎥
b = ⎢
⎥
,
⎢
⎥
⎢ ⋮ ⎥
⎣ ⎦
bn
T
b = [ b1 b2 ⋯ bn ] .
⎢
2 ⎥
⎢
⎥
X = ⎢
⎥
= ⎢
⎥
.
⎢
⎥
⎢
⎥
⎢ ⋮ ⎥ ⎢ ⋮ ⋮ ⋱ ⋮ ⎥
⎣ T ⎦ ⎣ ⎦
xn xn1 xn2 ⋯ xnd
X = np.stack((u,v),axis=0)
print(X)
[[1. 3. 5. 7.]
[2. 4. 6. 8.]]
X = np.stack((u,v,w))
print(X)
[[1. 3. 5. 7.]
[2. 4. 6. 8.]
[9. 8. 7. 6.]]
ci = (Ab)i = ∑ Aij bj .
j=1
In vector form,
m
Ab = ∑ A⋅,j bj ,
j=1
that is, is a linear combination of the columns of where the coefficients are the entries of
Ab A
b .
Example (continued): Consider the column vector . Then
b = (1, 0)
T
2(1) + 5(0) 2
⎡ ⎤ ⎡ ⎤
Ab = ⎢ 3(1) + 6(0) ⎥ = ⎢ 3 ⎥ ,
⎣ ⎦ ⎣ ⎦
1(1) + 1(0) 1
j=1
Note that the number of columns of and the number of rows of must match. There are
A B
many different ways to view this formula that are helpful in interpreting matrix-matrix products
in different contexts.
First, we observe that the entry is an inner product of the -th row of and of the -th
Ciℓ i A ℓ
In matrix form,
A1,⋅ B⋅,1 A1,⋅ B⋅,2 ⋯ A1,⋅ B⋅,m
⎡ ⎤
⎢
A2,⋅ B⋅,1 A2,⋅ B⋅,2 ⋯ A2,⋅ B⋅,m ⎥
⎢
⎥
AB = ⎢
⎥
.
⎢
⎥
⎢ ⋮ ⋮ ⋱ ⋮ ⎥
⎣ ⎦
An,⋅ B⋅,1 An,⋅ B⋅,2 ⋯ An,⋅ B⋅,m
is a linear combination of the columns of where the coefficients are the entries in column of
A j
ℓ=1
Similarly, the -th row of the product is a linear combination of the rows of where the
i AB B
ℓ=1
⎢
A2,⋅ B ⎥
⎢
⎥
AB = ⎢
⎥
,
⎢
⎥
⎢ ⋮ ⎥
⎣ ⎦
An,⋅ B
Matrix norms We will also need a notion of matrix norm. A natural way to define a norm for
matrices is to notice that ann × m matrix can be thought of as an vector, with one
A nm
element for each entry of . Indeed, addition and scalar multiplication work exactly in the same
A
way. Hence, we can define the -norm of a matrix in terms of the sum of its squared entries.
2
as
n m
∥A∥F = 2
∑ ∑ A .
ij
⎷
i=1 j=1
Using the row notation, we see that the square of the Frobenius norm can be written as the sum
of the squared Euclidean norms of the rows
n
2 2
∥A∥ = ∑ ∥Ai,⋅ ∥ .
F 2
i=1
j = 1, … , m A ∥A∥ .
2
F
= ∑
m
j=1
∥A⋅,j ∥
2
2
interpreted as a distance between and , that is, a measure of how dissimilar they are.
A B
NUMERICAL CORNER: In Numpy, the Frobenius norm of a matrix can be computed using the
function numpy.linalg.norm .
In [8]:
A = np.array([[1., 0.],[0., 1.],[0., 0.]])
print(A)
[[1. 0.]
[0. 1.]
[0. 0.]]
In [9]:
LA.norm(A)
Out[9]: 1.4142135623730951
.
i=1 i
T d
x = (x1 , … , xd ) ∈ R
d
Br (x) = {y ∈ R : ∥y − x∥ < r}.
belongs to .
A
(Source)
A pointx ∈ R is an interior point of a set
d
A ⊆ R if there exists an
d
r > 0 such that
Br (x) ⊆ A .
A set is open if it consists entirely of interior points.
A
(Source)
A setA ⊆ R
d
is bounded if there exists an r > 0 such that A ⊆ Br (0) , where 0 = (0, … , 0)
T
.
Definition (Limits of a Function): Let be a real-valued function on
f : D → R D ⊆ R
d
. Then
f is said to have a limit
L ∈ R as approaches if: for any , there exists a
x a ε > 0 δ > 0 such
that|f (x) − L| < ε for all . This is written as
x ∈ D ∩ Bδ (a) ∖ {a}
lim f (x) = L.
x→a
(Source)
Note that we explicitly exclude itself from having to satisfy the condition
a |f (x) − L| < ε . In
particular, we may have f (a) ≠ L . We also do not restrict to be in .
a D
(Source)
We will not prove the following fundamental analysis result, which will be used repeatedly in this
course. (See e.g. Wikipedia for a sketch of the proof.) Suppose f : D → R is defined on a set
D ⊆ R
d
. We say that attains a maximum value at if
f M z
∗ ∗
f (z ) = M and M ≥ f (x) for all
x ∈ D . Similarly, we say attains a minimum value at if
f m z∗ f (z∗ ) = m and for all
m ≥ f (x)
x ∈ D .
Theorem (Extreme Value): Let f : D → R be a real-valued, continuous function on a
nonempty, closed, bounded set D ⊆ R
d
. Then attains a maximum and a minimum on .
f D
1.2.2.2 Derivatives
We move on to derivatives. Recall that the derivative of a function of a real variable is the rate of
change of the function with respect to the change in the variable. Formally:
Definition (Derivative): Let where
f : D → R and let
D ⊆ R be an interior point of .
x0 ∈ D D
The derivative of at is
f x0
(Source)
The following lemma encapsulates a key insight about the derivative of at : it tells us where f x0
(a) if
f (x) > f (x0 ) ,
(b)
x > x0 if
f (x) < f (x0 ). x < x0
If instead ′
, the opposite holds.
f (x0 ) < 0
Proof idea: Follows from the definition of the derivative by taking small enough that ε
′
f (x0 ) − ε > 0 .
Proof: Take ε = f (x0 )/2
′
. By definition of the derivative, there is δ > 0 such that
f (x0 + h) − f (x0 )
′
f (x0 ) − < ε
h
One implication of the Descent Direction Lemma is the Mean Value Theorem, which will lead us
later to Taylor's Theorem. First, an important special case:
Theorem (Rolle): Let f : [a, b] → R be a continuous function and assume that its derivative
exists on . If
(a, b) f (a) = f (b) then there is a < c < bsuch that . ′
f (c) = 0
Proof idea: Look at an extremum and use the Descent Direction Lemma to get a contradiction.
(Source)
Proof: If for all
f (x) = f (a) , then x ∈ (a, b)on and we are done. So assume
′
f (x) = 0 (a, b)
(otherwise consider the function ). By the Extreme Value Theorem, attains a maximum
−f f
value at some . By our assumption, and cannot be the location of the maximum and
c ∈ [a, b] a b
it must be that .
c ∈ (a, b)
We claim that ′
. We argue by contradiction. Suppose
f (c) = 0 . By the Descent′
f (c) > 0
Theorem (Mean Value): Let f : [a, b] → R be a continuous function and assume that its
derivative exists on . Then there is
(a, b) a < c < b such that
′
f (b) = f (a) + (b − a)f (c),
or put differently
f (b) − f (a)
′
= f (c).
b − a
(Source)
Proof: Let ϕ(x) = f (x) − f (a) − . Note that
f (b)−f (a)
b−a
(x − a) and ϕ(a) = ϕ(b) = 0
′ ′
ϕ (x) = f (x) −
b−a
for all
f (b)−f (a)
. Thus, by Rolle, there is
x ∈ (a, b) such that c ∈ (a, b)
′
ϕ (c) = 0 . That implies b−a
and plugging into gives the result.
f (b)−f (a)
′
= ϕ (c) ϕ(b) □
We will also use Taylor's Theorem, a generalization of the Mean Value Theorem that provides a
polynomial approximation to a function around a point. We will restrict ourselves to the case of a
linear approximation with second-order error term, which will suffice for our purposes.
1
′ 2 ′′
f (x) = f (a) + (x − a)f (a) + (x − a) f (ξ)
2
One way to think of the proof of that result is the following: we constructed an affine function
that agrees with at and , then used Rolle to express the coefficient of the linear term using
f a x
f . Here we do the same with a polynomial of degree . But we now have an extra degree of
′
2
freedom in choosing this polynomial. Because we are looking for a good approximation close to
a , we choose to make the first derivative at also agree. Applying Rolle twice gives the claim.
a
Proof: Let
2
P (t) = α0 + α1 (t − a) + α2 (t − a) .
such that ′
. Moreover,
′
ϕ (ξ ) = 0 . Hence we can apply Rolle again - this time to on
′
ϕ (a) = 0 ϕ
′
′′ ′′ ′′ ′′
0 = ϕ (ξ) = f (ξ) − P (ξ) = f (ξ) − 2α2
so α2 = f
′′
(ξ)/2 . Plugging into and using
P ϕ(x) = 0 gives the claim. □
1.2.2.3 Optimization
As we mentioned before, optimization problems play a central role in data science. Here we look
at unconstrained optimization problems, that is, problems of the form:
min f (x)
d
x∈R
where f : R
d
. → R
Ideally, we would like to find a global minimizer to the optimization problem above.
Definition (Global Minimizer): Let . The point
f : R
d
→ R is a global minimizer of x
∗
∈ R
d
f
over if
R
d
∗ d
f (x) ≥ f (x ), ∀x ∈ R .
matplotlib package, and specifically its function matplotlib.pyplot.plot . We also use the
function numpy.linspace to create an array of evenly spaced numbers where we evaluate . f
In [10]:
import matplotlib.pyplot as plt
In [11]:
x = np.linspace(-2,2,100)
y = x ** 2
plt.plot(x,y)
plt.show()
not bounded, therefore the Extreme Value Theorem does not apply here.
In [12]:
x = np.linspace(-2,2,100)
y = np.exp(x)
plt.plot(x,y)
plt.ylim(0,5)
plt.show()
The function 2
f (x) = (x + 1) (x − 1)
2
over has two global minimizers at
R x
∗
= −1 and
x
∗∗
. Indeed,
= 1 f (x) ≥ 0 and f (x) = 0 if and only
x = x or∗
x = x. ∗∗
In [13]:
x = np.linspace(-2,2,100)
y = ((x+1)**2) * ((x-1)**2)
plt.plot(x,y)
plt.ylim(0,5)
plt.show()
In general, finding a global minimizer and certifying that one has been found can be difficult
unless some special structure is present. Therefore weaker notions of solution have been
introduced.
Definition (Local Minimizer): Let f : R
d
. The point
→ R x
∗
is a local minimizer of over
∈ R
d
f
R if there is
d
such that
δ > 0
∗ ∗ ∗
f (x) ≥ f (x ), ∀x ∈ Bδ (x ) ∖ {x }.
In words, is a local minimizer if there is open ball around where it attains the minimum
x
∗
x
∗
value. The difference between global and local minimizers is illustrated in the next figure.
(Source)
Local minimizers can be characterized in terms of derivatives (or more generally, gradients, their
generalization to functions of several variables; we come back to them later in the course). More
precisely, they provide a necessary condition.
Theorem (First-Order Necessary Condition): Let f : R → R be continuously differentiable
on . If is a local minimizer, then
R x0 .
′
f (x0 ) = 0
being similar). By the Descent Direction Lemma, there is a such that, for each in
δ > 0 x Bδ (x0 )
,f (x) < f (x0 ) if x < x0. So every open ball around has a point achieving a smaller value
x0
1.2.3 Probability
Finally, we review a few key definitions and results from probability theory. Further concepts will
be reviewed later in the course.
1.2.3.1 Expectation, variance and Chebyshev's inequality
Recall that the expectation (or mean) of a function of a discrete random variable taking
h X
values in is given by
X
x∈X
we have
E[h(X)] = ∫ h(x)fX (x) dx
These definitions extend to functions of multiple variables by using instead the joint PMF or PDF.
We sometimes denote the expectation of by . X μX
2
Var[X] = E[(X − E[X]) ]
and its standard deviation is σX = √Var[X] . The variance does not satisfy linearity, but we
have the following property
2
Var[αX + β] = α Var[X].
The standard deviation is a measure of the typical deviation of around its mean, that is, of the
X
Theorem (Chebyshev) For a random variable with finite variance, we have for any
X α > 0
2
Var[X] σX
P[|X − E[X]| ≥ α] ≤ = ( ) .
2
α α
The intuition is the following: if the expected squared deviation from the mean is small, then the
absolute deviation from the mean is unlikely to be large.
To formalize this we prove a more general inequality, Makov's inequality. In words, if a non-
negative random variable has a small expectation then it is unlikely to be large.
Lemma (Markov) Let be a non-negative random variable with finite expectation. Then, for
Z
any ,
β > 0
E[Z]
P[Z ≥ β] ≤ .
β
Proof idea: The quantity is a lower bound on the expectation of restricted to the
β P[Z ≥ β] Z
Proof: Formally, let be the indicator of the event , that is, it is the random variable that is
1A A 1
= P[Z ≥ β]
E[Z]
≤
β
Var[X]
=
2
α
Note the absence of absolute values compared to the two-sided form appearing in Chebyshev's
inequality. In this case, we can use the fact that the event {X − E[X] ≥ α}implies a fortiori
that{|X − E[X]| ≥ α} , so that the probability of the former is smaller than that of the latter by
monotonicity, namely,
P[X − E[X] ≥ α] ≤ P[|X − E[X]| ≥ α].
(2) In terms of the standard deviation σX = √Var[X] of , the inequality can be re-written as
X
2
Var[X] σX
P[|X − E[X]| ≥ α] ≤ = ( ) .
2
α α
So to get a small bound on the right-hand side, one needs the deviation from the mean to be α
significantly larger than the standard deviation. In words, a random variable is unlikely to be
away from its mean by much more than its standard deviation. This observation is consistent
with the interpretation of the standard deviation as the typical spread of a random variable.
Chebyshev's inequality is particularly useful when combined with independence, as we recall
next.
1.2.3.2 Independence and limit theorems
Recall that discrete random variables and are independent if their joint PMF factorizes, that
X Y
is
pX,Y (x, y) = pX (x) pY (y), ∀x, y
where pX,Y (x, y) = P[X = x, Y = y] . Similarly, continuous random variables and are
X Y
independent if their joint PDF factorizes. One consequence is that expectations of products of
single-variable functions factorize as well, that is, for functions and we have
g h
variances is defined as
Cov[X, Y ] = E [(X − E[X])(Y − E[Y ])] .
= E [X − E[X]] E [Y − E[Y ]]
= 0,
where we used independence on the second line and the linearity of expectations on the third
one.
The covariance leads to a useful identity for the variance of a sum of random variables.
Lemma (Variance of a Sum) Let X1 , … , Xn be random variables with finite means and
variances. Then we have
Var[X1 + ⋯ + Xn ] = ∑ Var[Xi ] + 2 ∑ Cov[Xi , Xj ].
i=1 i<j
2
= E [(X1 + ⋯ + Xn − E[X1 ] − ⋯ − E[Xn ]) ]
2
= E [(X1 − E[X1 ]) + ⋯ + (Xn − E[Xn ])) ]
2
= ∑ E [(Xi − E[Xi ]) ] + ∑ E [(Xi − E[Xi ])(Xj − E[Xj ])] .
i=1 i≠j
The claim follows from the definition of the variance and covariance, and the symmetry of the
covariance. □
The previous lemma has the following important implication. If are pairwise
X1 , … , Xn
independent, real-valued random variables, then
Var[X1 + ⋯ + Xn ] = Var[X1 ] + ⋯ + Var[Xn ].
Notice that, unlike the case of the expectation, this linearity property for the variance requires
independence.
Applied to the sample mean of independent, identically distributed (i.i.d.) random variables
n
X1 , … , Xn , we obtain
n n
1 1
Var [ ∑ Xi ] = ∑ Var[Xi ]
n n2
i=1 i=1
1
= n Var[X1 ]
2
n
Var[X1 ]
= .
n
So the variance of the sample mean decreases as gets large, while its expectation remains the
n
same by linearity
n n
1 1
E[ ∑ Xi ] = ∑ E[Xi ]
n n
i=1 i=1
1
= n E[X1 ]
n
= E[X1 ].
Together with Chebyshev's inequality, we immediately get that the sample mean approaches its
expectation in the following probabilistic sense.
Theorem (Weak Law of Large Numbers) Let X1 , … , Xn be i.i.d. For any , as
ε > 0 ,
n → +∞
n
∣ ∣
1
∣
∣
P [∣ ∑ Xi − E[X1 ]∣ ≥ ε] → 0.
n
∣ i=1 ∣
P [∣ ∑ Xi − E[X1 ]∣ ≥ ε] = P [∣ ∑ Xi − E [ ∑ Xi ]∣ ≥ ε]
n n n
∣ i=1 ∣ ∣ i=1 i=1 ∣
1 n
Var [ n
∑ Xi ]
i=1
≤
ε2
Var[X1 ]
=
2
nε
→ 0
as n → +∞ □ .
NUMERICAL CORNER: We can use simulations to confirm the Weak Law of Large Numbers.
Recall that a uniform random variable over the interval has density
[a, b]
1
x ∈ [a, b]
b−a
fX (x) = {
0 o.w.
We write X ∼ U[a, b]. We can obtain a sample from U[0, 1] by using the function
numpy.random in Numpy.
In [14]:
from numpy.random import default_rng
rng = default_rng(12345)
In [15]:
rng.random()
Out[15]: 0.22733602246716966
plt.hist(sample_mean,bins=15)
plt.xlim(0,1)
plt.title(f'n={n}')
plt.show()
In [17]:
lln_unif(10, 1000)
In [18]:
lln_unif(100, 1000)
1
2
fX (x) = exp(−x /2).
√2π
(Source)
To construct a -dimensional version, we take independent standard Normal variables
d d
that is,
d
1
2
fX (x) = ∏ exp(−x /2)
i
i=1
√2π
d
1
2
= exp(− ∑ x /2)
d i
∏ √2π i=1
i=1
1
2
= exp(−∥x∥ /2).
d/2
(2π)
NUMERICAL CORNER: The following function generates data points from a spherical -
n d
dimensional Gaussians with variance and mean . We will use it later in the chapter to
1 we1
In [19]:
def one_cluster(d, n, w):
X = one_cluster(d, n, w)
plt.scatter(X[:,0], X[:,1])
plt.show()