You are on page 1of 24

Course: Math 535 - Mathematical Methods in Data Science (MMiDS)

Author: Sebastien Roch, Department of Mathematics, University of Wisconsin-Madison


Updated: April 22, 2022
Copyright: © 2022 Sebastien Roch

Chapter 1 - Introduction and review


1.2 Review++
We first review a few basic mathematical concepts. In this chapter, we focus on vector and
matrix algebra, some basic calculus and optimization, as well as elementary probability
concepts. Further mathematical concepts will be reviewed in the next chapters. Along the way,
we also introduce Python, especially Numpy.
1.2.1 Vectors and matrices
Vectors and norms For a vector
x1
⎡ ⎤


x2 ⎥

x = ⎢

∈ Rd

⎢ ⋮ ⎥

⎣ ⎦
xd

the Euclidean norm of is defined as


x

 d

 2
∥x∥2 = ∑ x = √⟨x, x⟩
i

i=1

where
d

⟨u, v⟩ = ∑ ui vi

i=1

is the inner product of and .


This is also known as the -norm.
u v ℓ
2

The triangle inequality for the -norm follows from the Cauchy–Schwarz inequality, which is

2

useful in proving many facts.


Theorem (Cauchy–Schwarz): For all u, v ∈ R
d

|⟨u, v⟩| ≤ ∥u∥2 ∥v∥2 .

Given a collection of vectors u1 , … , uk ∈ R


d
and real numbers α1 , … , αk ∈ R , the linear
combination of 's with coefficients 's is the vector
uℓ αℓ

z = ∑ αℓ uℓ ,

ℓ=1

whose entries are


k

zi = ∑ αℓ (uℓ )i , i = 1, … , d.

ℓ=1

The Euclidean distance between two vectors and in is the -norm of their difference
u v R
d
2

d(u, v) = ∥u − v∥2 .

Throughout we use the notation to indicate the -norm of unless specified


∥x∥ = ∥x∥2 2 x

otherwise.
More generally, for , the -norm of is given by
p ≥ 1 ℓ
p
x

1/p
d

p
∥x∥p = (∑ |xi | ) .

i=1

Here is a nice visualization of the unit ball, that is, the set {x : ∥x∥p ≤ 1} , under varying .
p

Finally the -norm, is defined as



∥x∥∞ = max |xi |.


i=1,…,d

There exist other norms. Formally:


Definition (Norm): A norm is a function from to ℓ R
d
R+ that satisfies for all ,

a ∈ R u, v ∈ R
d

(Homogeneity): ℓ(au) = |a|ℓ(u)

(Triangle inequality): ℓ(u + v) ≤ ℓ(u) + ℓ(v)

(Point-separating): implies
ℓ(u) = 0 . u = 0

NUMERICAL CORNER: In Numpy, a vector is defined as a 1d array.


In [1]:
# Python 3

import numpy as np

In [2]:
u = np.array([1., 3., 5. ,7.])

print(u)

[1. 3. 5. 7.]

To obtain the norm of a vector, we can use the function linalg.norm (which requires the
numpy.linalg package):
In [3]: from numpy import linalg as LA

In [4]:
LA.norm(u)

Out[4]: 9.16515138991168

which we check next "by hand"


In [5]:
np.sqrt(np.sum(u ** 2))

Out[5]: 9.16515138991168

In Numpy, ** indicates element-wise exponentiation.


Matrices For an matrix


n × m A ∈ R with real entries, we denote by its entry in row
n×m
Aij i

and column . We also refer to a matrix as the collection of all of its entries as follows
j

A = (Aij )i∈[n],j∈[m] .

We occasionally simplify the notation to (Aij )i,j when the range of the indices is clear from
context. We use the notation
Ai,⋅ = (Ai1 ⋯ Aim ),

to indicate the -th row of - as a row vector, i.e., a matrix with a single row - and similarly
i A

A1j
⎛ ⎞


A⋅,j = ⎜ ⋮ ⎟,

⎝ ⎠
Anj

the -th column of - as a column vector, i.e., a matrix with a single column.
j A

Example: Suppose
2 5
⎡ ⎤
A = ⎢3 6⎥.
⎣ ⎦
1 1

Then the second row is


A2,⋅ = [ 3 6],

and the second column is


5
⎡ ⎤
A = ⎢6⎥.
⎣ ⎦
1


Recall that the transpose of a matrix
A
T
is defined as the matrix in
A ∈ R
n×m
R
m×n
that
switches the row and column indices of , that is, its entries are
A

T
[A ]ij = Aji , i = 1, … , m, j = 1, … , n.

Visually:

(Source)
The transpose in particular can be used to turn a column vector into a row vector and vice
versa. That is, if
x ∈ R
n
is a column vector
b1
⎡ ⎤


b2 ⎥

b = ⎢

,

⎢ ⋮ ⎥

⎣ ⎦
bn

then is the corresponding row vector


b
T

T
b = [ b1 b2 ⋯ bn ] .

For instance,√bT b is the Euclidean norm of . b

NUMERICAL CORNER: We will often work with collections of vectors n x1 , … , xn in and it


R
d

will be convenient to stack them up into a matrix


T
x x11 x12 ⋯ x1d
⎡ 1 ⎤ ⎡ ⎤
T

x ⎥

x21 x22 ⋯ x2d ⎥


2 ⎥

X = ⎢

= ⎢

.



⎢ ⋮ ⎥ ⎢ ⋮ ⋮ ⋱ ⋮ ⎥

⎣ T ⎦ ⎣ ⎦
xn xn1 xn2 ⋯ xnd

To create a matrix out of two vectors, we use the function numpy.stack .


In [6]:
u = np.array([1., 3., 5., 7.])

v = np.array([2., 4., 6., 8.])

X = np.stack((u,v),axis=0)

print(X)

[[1. 3. 5. 7.]

[2. 4. 6. 8.]]

Quoting the documentation:


The axis parameter specifies the index of the new axis in the dimensions of the
result. For example, if axis=0 it will be the first dimension and if axis=-1 it will be
the last dimension.
The same scheme still works with more than two vectors.
In [7]:
u = np.array([1., 3., 5., 7.])

v = np.array([2., 4., 6., 8.])

w = np.array([9., 8., 7., 6.])

X = np.stack((u,v,w))

print(X)

[[1. 3. 5. 7.]

[2. 4. 6. 8.]

[9. 8. 7. 6.]]

Matrix-vector product Recall that, for a matrix A = (Aij )i∈[n],j∈[m] ∈ R


n×m
and a column
vector , the matrix-vector product
b = (bi )i∈[m] ∈ R
m
c = Ab is the vector with entries
m

ci = (Ab)i = ∑ Aij bj .

j=1

In vector form,
m

Ab = ∑ A⋅,j bj ,

j=1

that is, is a linear combination of the columns of where the coefficients are the entries of
Ab A

b .
Example (continued): Consider the column vector . Then
b = (1, 0)
T

2(1) + 5(0) 2
⎡ ⎤ ⎡ ⎤
Ab = ⎢ 3(1) + 6(0) ⎥ = ⎢ 3 ⎥ ,

⎣ ⎦ ⎣ ⎦
1(1) + 1(0) 1

which can also be written in vector form as


2 5 2
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
(1) ⎢ 3 ⎥ + (0) ⎢ 6 ⎥ = ⎢ 3 ⎥ .
⎣ ⎦ ⎣ ⎦ ⎣ ⎦
1 1 1

Matrix-matrix product Recall that, for matrices and A ∈ R


n×k
B ∈ R
k×m
, their matrix product
is defined as the matrix C = AB ∈ R whose entries are
n×m

Ciℓ = (AB)iℓ = ∑ Aij Bjℓ , ℓ = 1, … , m.

j=1

Note that the number of columns of and the number of rows of must match. There are
A B

many different ways to view this formula that are helpful in interpreting matrix-matrix products
in different contexts.
First, we observe that the entry is an inner product of the -th row of and of the -th
Ciℓ i A ℓ

column of . That is,


B

Ciℓ = Ai,⋅ B⋅,ℓ .

In matrix form,
A1,⋅ B⋅,1 A1,⋅ B⋅,2 ⋯ A1,⋅ B⋅,m
⎡ ⎤


A2,⋅ B⋅,1 A2,⋅ B⋅,2 ⋯ A2,⋅ B⋅,m ⎥


AB = ⎢

.

⎢ ⋮ ⋮ ⋱ ⋮ ⎥

⎣ ⎦
An,⋅ B⋅,1 An,⋅ B⋅,2 ⋯ An,⋅ B⋅,m

Alternatively, the previous display can be re-written as


AB = [ A(B⋅,1 ) A(B⋅,2 ) ⋯ A(B⋅,m ) ] ,

where we specify a matrix by the collection of its columns.


Put differently, by the matrix-vector product formula above, the -th column of the product
j AB

is a linear combination of the columns of where the coefficients are the entries in column of
A j

(AB)⋅,j = AB⋅,j = ∑ A⋅,ℓ Bℓj .

ℓ=1

Similarly, the -th row of the product is a linear combination of the rows of where the
i AB B

coefficients are the entries in row of i A

(AB)i,⋅ = ∑ Aiℓ Bℓ,⋅ .

ℓ=1

Writing AB as the collection of its rows, this is


A1,⋅ B
⎡ ⎤


A2,⋅ B ⎥


AB = ⎢

,

⎢ ⋮ ⎥

⎣ ⎦
An,⋅ B

where recall that is a row vector.


Ai,⋅

Matrix norms We will also need a notion of matrix norm. A natural way to define a norm for
matrices is to notice that ann × m matrix can be thought of as an vector, with one
A nm

element for each entry of . Indeed, addition and scalar multiplication work exactly in the same
A

way. Hence, we can define the -norm of a matrix in terms of the sum of its squared entries.
2

(We will encounter other matrix norms later in the course.)


Definition (Frobenius Norm): The Frobenius norm of an matrix n × m is defined
A ∈ R
n×m

as
 n m

∥A∥F =  2
∑ ∑ A .
ij

i=1 j=1

Using the row notation, we see that the square of the Frobenius norm can be written as the sum
of the squared Euclidean norms of the rows
n

2 2
∥A∥ = ∑ ∥Ai,⋅ ∥ .
F 2

i=1

Similarly in terms of the columns ,


A⋅,j , of we have

j = 1, … , m A ∥A∥ .
2
F
= ∑
m

j=1
∥A⋅,j ∥
2
2

For two matrices A, B ∈ R


n×m
, the Frobenius norm of their difference can be
∥A − B∥F

interpreted as a distance between and , that is, a measure of how dissimilar they are.
A B

NUMERICAL CORNER: In Numpy, the Frobenius norm of a matrix can be computed using the
function numpy.linalg.norm .
In [8]:
A = np.array([[1., 0.],[0., 1.],[0., 0.]])

print(A)

[[1. 0.]

[0. 1.]

[0. 0.]]

In [9]:
LA.norm(A)

Out[9]: 1.4142135623730951

1.2.2 Differential calculus


Next, we review some basic concepts from differential calculus. We focus here on definitions
and results relevant to optimization theory, which plays a central role in data science.
1.2.2.1 Limits and continuity
Throughout this section, we use the Euclidean norm ∥x∥ = √∑ ford
x
2

.
i=1 i

T d
x = (x1 , … , xd ) ∈ R

The open -ball around


r is the set of points within Euclidean distance of , that is,
x ∈ R
d
r x

d
Br (x) = {y ∈ R : ∥y − x∥ < r}.

A pointx ∈ R is a limit point (or accumulation point) of a set


d
if every open ball around
A ⊆ R
d

x contains an element of such that


a A . A set is closed if every limit point of
a ≠ x A A

belongs to .
A

(Source)
A pointx ∈ R is an interior point of a set
d
A ⊆ R if there exists an
d
r > 0 such that
Br (x) ⊆ A .
A set is open if it consists entirely of interior points.
A

(Source)
A setA ⊆ R
d
is bounded if there exists an r > 0 such that A ⊆ Br (0) , where 0 = (0, … , 0)
T
.
Definition (Limits of a Function): Let be a real-valued function on
f : D → R D ⊆ R
d
. Then
f is said to have a limit
L ∈ R as approaches if: for any , there exists a
x a ε > 0 δ > 0 such
that|f (x) − L| < ε for all . This is written as
x ∈ D ∩ Bδ (a) ∖ {a}

lim f (x) = L.
x→a

(Source)
Note that we explicitly exclude itself from having to satisfy the condition
a |f (x) − L| < ε . In
particular, we may have f (a) ≠ L . We also do not restrict to be in .
a D

Definition (Continuous Function): Let f : D → R be a real-valued function on D ⊆ R


d
. Then
f is said to be continuous ata ∈ D if
lim f (x) = f (a).
x→a

(Source)
We will not prove the following fundamental analysis result, which will be used repeatedly in this
course. (See e.g. Wikipedia for a sketch of the proof.) Suppose f : D → R is defined on a set
D ⊆ R
d
. We say that attains a maximum value at if
f M z
∗ ∗
f (z ) = M and M ≥ f (x) for all
x ∈ D . Similarly, we say attains a minimum value at if
f m z∗ f (z∗ ) = m and for all
m ≥ f (x)

x ∈ D .
Theorem (Extreme Value): Let f : D → R be a real-valued, continuous function on a
nonempty, closed, bounded set D ⊆ R
d
. Then attains a maximum and a minimum on .
f D

1.2.2.2 Derivatives
We move on to derivatives. Recall that the derivative of a function of a real variable is the rate of
change of the function with respect to the change in the variable. Formally:
Definition (Derivative): Let where
f : D → R and let
D ⊆ R be an interior point of .
x0 ∈ D D

The derivative of at is
f x0

df (x0 ) f (x0 + h) − f (x0 )



f (x0 ) = = lim
dx h→0 h

provided the limit exists. ⊲

(Source)
The following lemma encapsulates a key insight about the derivative of at : it tells us where f x0

to find smaller values.

Lemma (Descent Direction): Let with


f : D → R and let
D ⊆ R x0 ∈ D be an interior point
of where
D

exists. If
f (x0 )

, then there is an open ball
f (x0 ) > 0 Bδ (x0 ) ⊆ D around
such
x0

that for each in x :


Bδ (x0 )

(a) if
f (x) > f (x0 ) ,
(b)
x > x0 if
f (x) < f (x0 ). x < x0

If instead ′
, the opposite holds.
f (x0 ) < 0

Proof idea: Follows from the definition of the derivative by taking small enough that ε

f (x0 ) − ε > 0 .
Proof: Take ε = f (x0 )/2

. By definition of the derivative, there is δ > 0 such that
f (x0 + h) − f (x0 )

f (x0 ) − < ε
h

for all 0 < h < δ . Rearranging gives



f (x0 + h) > f (x0 ) + [f (x0 ) − ε]h > f (x0 )

by our choice of . The other direction is similar.


ε □

One implication of the Descent Direction Lemma is the Mean Value Theorem, which will lead us
later to Taylor's Theorem. First, an important special case:
Theorem (Rolle): Let f : [a, b] → R be a continuous function and assume that its derivative
exists on . If
(a, b) f (a) = f (b) then there is a < c < bsuch that . ′
f (c) = 0

Proof idea: Look at an extremum and use the Descent Direction Lemma to get a contradiction.

(Source)
Proof: If for all
f (x) = f (a) , then x ∈ (a, b)on and we are done. So assume

f (x) = 0 (a, b)

there is such that


y ∈ (a, b) . Assume without loss of generality that
f (y) ≠ f (a) f (y) > f (a)

(otherwise consider the function ). By the Extreme Value Theorem, attains a maximum
−f f

value at some . By our assumption, and cannot be the location of the maximum and
c ∈ [a, b] a b

it must be that .
c ∈ (a, b)

We claim that ′
. We argue by contradiction. Suppose
f (c) = 0 . By the Descent′
f (c) > 0

Direction Lemma, there is a such that δ > 0 for all , a contradiction. A


f (x) > f (c) x ∈ Bδ (c)

similar argument holds if . That concludes the proof.



f (c) < 0 □

Theorem (Mean Value): Let f : [a, b] → R be a continuous function and assume that its
derivative exists on . Then there is
(a, b) a < c < b such that

f (b) = f (a) + (b − a)f (c),

or put differently
f (b) − f (a)

= f (c).
b − a

Proof idea: Apply Rolle to


f (b) − f (a)
ϕ(x) = f (x) − [f (a) + (x − a)] .
b − a

(Source)
Proof: Let ϕ(x) = f (x) − f (a) − . Note that
f (b)−f (a)

b−a
(x − a) and ϕ(a) = ϕ(b) = 0

′ ′
ϕ (x) = f (x) −
b−a
for all
f (b)−f (a)
. Thus, by Rolle, there is
x ∈ (a, b) such that c ∈ (a, b)


ϕ (c) = 0 . That implies b−a
and plugging into gives the result.
f (b)−f (a)

= ϕ (c) ϕ(b) □

We will also use Taylor's Theorem, a generalization of the Mean Value Theorem that provides a
polynomial approximation to a function around a point. We will restrict ourselves to the case of a
linear approximation with second-order error term, which will suffice for our purposes.

Theorem (Taylor): Let where


f : D → R D ⊆ R . Suppose has a continuous derivative on
f

and that its second derivative exists on


[a, b] (a, b) . Then for any x ∈ [a, b]

1
′ 2 ′′
f (x) = f (a) + (x − a)f (a) + (x − a) f (ξ)
2

for some a < ξ < x .


Proof idea: The Mean Value Theorem implies that there is a < ξ < x such that

f (x) = f (a) + (x − a)f (ξ).

One way to think of the proof of that result is the following: we constructed an affine function
that agrees with at and , then used Rolle to express the coefficient of the linear term using
f a x

f . Here we do the same with a polynomial of degree . But we now have an extra degree of

2
freedom in choosing this polynomial. Because we are looking for a good approximation close to
a , we choose to make the first derivative at also agree. Applying Rolle twice gives the claim.
a

Proof: Let
2
P (t) = α0 + α1 (t − a) + α2 (t − a) .

We choose the 's so that αi ,


P (a) = f (a) P (a) = f (a)
′ ′
, and P (x) = f (x) . The first two lead
to the conditions

α0 = f (a), α1 = f (a).

Let ϕ(t) = f (t) − P (t) . By construction . By Rolle, there is a


ϕ(a) = ϕ(x) = 0 ξ

∈ (a, x)

such that ′
. Moreover,

ϕ (ξ ) = 0 . Hence we can apply Rolle again - this time to on

ϕ (a) = 0 ϕ

. It implies that there is



[a, ξ ] such that
ξ ∈ (a, ξ )

. ′′
ϕ (ξ) = 0

The second derivative of at is ϕ ξ

′′ ′′ ′′ ′′
0 = ϕ (ξ) = f (ξ) − P (ξ) = f (ξ) − 2α2

so α2 = f
′′
(ξ)/2 . Plugging into and using
P ϕ(x) = 0 gives the claim. □

1.2.2.3 Optimization
As we mentioned before, optimization problems play a central role in data science. Here we look
at unconstrained optimization problems, that is, problems of the form:
min f (x)
d
x∈R

where f : R
d
. → R

Ideally, we would like to find a global minimizer to the optimization problem above.
Definition (Global Minimizer): Let . The point
f : R
d
→ R is a global minimizer of x

∈ R
d
f

over if
R
d

∗ d
f (x) ≥ f (x ), ∀x ∈ R .

NUMERICAL CORNER: The function over has a global minimizer at


f (x) = x
2
R. x

= 0

Indeed, we clearly have for all while


f (x) ≥ 0 . To plot the function, we use the
x f (0) = 0

matplotlib package, and specifically its function matplotlib.pyplot.plot . We also use the
function numpy.linspace to create an array of evenly spaced numbers where we evaluate . f

In [10]:
import matplotlib.pyplot as plt

In [11]:
x = np.linspace(-2,2,100)

y = x ** 2

plt.plot(x,y)

plt.show()

The function over does not have a global minimizer. Indeed,


f (x) = e
x
R but no f (x) > 0 x

achieves . And, for any


0 , there is small enough such that
m > 0 x . Note that is
f (x) < m R

not bounded, therefore the Extreme Value Theorem does not apply here.
In [12]:
x = np.linspace(-2,2,100)

y = np.exp(x)

plt.plot(x,y)

plt.ylim(0,5)

plt.show()

The function 2
f (x) = (x + 1) (x − 1)
2
over has two global minimizers at
R x

= −1 and
x
∗∗
. Indeed,
= 1 f (x) ≥ 0 and f (x) = 0 if and only
x = x or∗
x = x. ∗∗

In [13]:
x = np.linspace(-2,2,100)

y = ((x+1)**2) * ((x-1)**2)

plt.plot(x,y)

plt.ylim(0,5)

plt.show()

In general, finding a global minimizer and certifying that one has been found can be difficult
unless some special structure is present. Therefore weaker notions of solution have been
introduced.
Definition (Local Minimizer): Let f : R
d
. The point
→ R x

is a local minimizer of over
∈ R
d
f

R if there is
d
such that
δ > 0

∗ ∗ ∗
f (x) ≥ f (x ), ∀x ∈ Bδ (x ) ∖ {x }.

If the inequality is strict, we say that is a strict local minimizer.


x

In words, is a local minimizer if there is open ball around where it attains the minimum
x

x

value. The difference between global and local minimizers is illustrated in the next figure.

(Source)
Local minimizers can be characterized in terms of derivatives (or more generally, gradients, their
generalization to functions of several variables; we come back to them later in the course). More
precisely, they provide a necessary condition.
Theorem (First-Order Necessary Condition): Let f : R → R be continuously differentiable
on . If is a local minimizer, then
R x0 .

f (x0 ) = 0

Proof: We argue by contradiction. Suppose that ′


. Say
f (x0 ) ≠ 0 (the other case

f (x0 ) > 0

being similar). By the Descent Direction Lemma, there is a such that, for each in
δ > 0 x Bδ (x0 )

,f (x) < f (x0 ) if x < x0. So every open ball around has a point achieving a smaller value
x0

than . Thus is not a local minimizer, a contradiction. So it must be that


f (x0 ) x0 . ′
f (x0 ) = 0 □

1.2.3 Probability
Finally, we review a few key definitions and results from probability theory. Further concepts will
be reviewed later in the course.
1.2.3.1 Expectation, variance and Chebyshev's inequality
Recall that the expectation (or mean) of a function of a discrete random variable taking
h X

values in is given by
X

E[h(X)] = ∑ h(x) pX (x)

x∈X

where pX (x) = P[X = x] is the probability mass function (PMF) of .


In the continuous case,
X

we have
E[h(X)] = ∫ h(x)fX (x) dx

if is the probability density function (PDF) of .


fX X

These definitions extend to functions of multiple variables by using instead the joint PMF or PDF.
We sometimes denote the expectation of by . X μX

Two key properties of the expectation are:


linearity, that is,
E[α1 h1 (X) + α2 h2 (Y ) + β] = α1 E[h1 (X)] + α2 E[h2 (Y )] + β

monotonicity, that is, if h1 (x) ≤ h2 (x) for all then


x

E[h1 (X)] ≤ E[h2 (X)].

The variance of a real-valued random variable is X

2
Var[X] = E[(X − E[X]) ]

and its standard deviation is σX = √Var[X] . The variance does not satisfy linearity, but we
have the following property
2
Var[αX + β] = α Var[X].
The standard deviation is a measure of the typical deviation of around its mean, that is, of the
X

spread of the distribution.


A quantitative version of this statement is given by Chebyshev's
inequality.

Theorem (Chebyshev) For a random variable with finite variance, we have for any
X α > 0

2
Var[X] σX
P[|X − E[X]| ≥ α] ≤ = ( ) .
2
α α

The intuition is the following: if the expected squared deviation from the mean is small, then the
absolute deviation from the mean is unlikely to be large.
To formalize this we prove a more general inequality, Makov's inequality. In words, if a non-
negative random variable has a small expectation then it is unlikely to be large.
Lemma (Markov) Let be a non-negative random variable with finite expectation. Then, for
Z

any ,
β > 0

E[Z]
P[Z ≥ β] ≤ .
β

Proof idea: The quantity is a lower bound on the expectation of restricted to the
β P[Z ≥ β] Z

range {Z ≥ β} , which by non-negativity is itself lower bounded by . E[Z]

Proof: Formally, let be the indicator of the event , that is, it is the random variable that is
1A A 1

when occurs and otherwise. By definition, the expectation of is


A 0 1A

E[A] = 0 P[1A = 0] + 1 P[1A = 1] = P[A]

where is the complement of . Hence, by linearity and monotonicity,


A
c
A

β P[Z ≥ β] = β E[1Z≥β ] = E[β1Z≥β ] ≤ E[Z].

Rearranging gives the claim. □

Finally we return to the proof of Chebyshev.


Proof idea (Chebyshev): Simply apply Markov to the squared deviation of from its mean.
X

Proof (Chebyshev): Let Z = (X − E[X]) , which is non-negative by definition. Hence, by


2

Markov, for any β = α


2
> 0
2 2
P[|X − E[X]| ≥ α] = P[(X − E[X]) ≥ α ]

= P[Z ≥ β]

E[Z]

β

Var[X]
=
2
α

where we used the definition of the variance in the last equality. □

A few important remarks about Chebyshev's inequality:


(1) We sometimes need a one-sided bound of the form
P[X − E[X] ≥ α].

Note the absence of absolute values compared to the two-sided form appearing in Chebyshev's
inequality. In this case, we can use the fact that the event {X − E[X] ≥ α}implies a fortiori
that{|X − E[X]| ≥ α} , so that the probability of the former is smaller than that of the latter by
monotonicity, namely,
P[X − E[X] ≥ α] ≤ P[|X − E[X]| ≥ α].

We can then use Chebyshev's inequality on the right-hand side to obtain


Var[X]
P[X − E[X] ≥ α] ≤ .
2
α

Similarly, for the same reasons, we also have


Var[X]
P[X − E[X] ≤ −α] ≤ .
2
α

(2) In terms of the standard deviation σX = √Var[X] of , the inequality can be re-written as
X

2
Var[X] σX
P[|X − E[X]| ≥ α] ≤ = ( ) .
2
α α

So to get a small bound on the right-hand side, one needs the deviation from the mean to be α

significantly larger than the standard deviation. In words, a random variable is unlikely to be
away from its mean by much more than its standard deviation. This observation is consistent
with the interpretation of the standard deviation as the typical spread of a random variable.
Chebyshev's inequality is particularly useful when combined with independence, as we recall
next.
1.2.3.2 Independence and limit theorems
Recall that discrete random variables and are independent if their joint PMF factorizes, that
X Y

is
pX,Y (x, y) = pX (x) pY (y), ∀x, y
where pX,Y (x, y) = P[X = x, Y = y] . Similarly, continuous random variables and are
X Y

independent if their joint PDF factorizes. One consequence is that expectations of products of
single-variable functions factorize as well, that is, for functions and we have
g h

E[g(X)h(Y )] = E[g(X)] E[h(Y )],

provided the expectations exist.


An important way to quantify the lack of independence of two random variables is the
covariance.
Definition (Covariance) The covariance of random variables and with finite means and
X Y

variances is defined as
Cov[X, Y ] = E [(X − E[X])(Y − E[Y ])] .

Note that, by definition, the covariance is symmetric: Cov[X, Y ] = Cov[Y , X] .


When and are independent, their covariance is :
X Y 0

Cov[X, Y ] = E [(X − E[X])(Y − E[Y ])]

= E [X − E[X]] E [Y − E[Y ]]

= (E[X] − E[X]) (E[Y ] − E[Y ])

= 0,

where we used independence on the second line and the linearity of expectations on the third
one.
The covariance leads to a useful identity for the variance of a sum of random variables.
Lemma (Variance of a Sum) Let X1 , … , Xn be random variables with finite means and
variances. Then we have
Var[X1 + ⋯ + Xn ] = ∑ Var[Xi ] + 2 ∑ Cov[Xi , Xj ].

i=1 i<j

Proof: By definition of the variance and linearity of expectations,


2
Var[X1 + ⋯ + Xn ] = E [(X1 + ⋯ + Xn − E[X1 + ⋯ + Xn ]) ]

2
= E [(X1 + ⋯ + Xn − E[X1 ] − ⋯ − E[Xn ]) ]

2
= E [(X1 − E[X1 ]) + ⋯ + (Xn − E[Xn ])) ]

2
= ∑ E [(Xi − E[Xi ]) ] + ∑ E [(Xi − E[Xi ])(Xj − E[Xj ])] .

i=1 i≠j

The claim follows from the definition of the variance and covariance, and the symmetry of the
covariance. □

The previous lemma has the following important implication. If are pairwise
X1 , … , Xn
independent, real-valued random variables, then
Var[X1 + ⋯ + Xn ] = Var[X1 ] + ⋯ + Var[Xn ].

Notice that, unlike the case of the expectation, this linearity property for the variance requires
independence.
Applied to the sample mean of independent, identically distributed (i.i.d.) random variables
n

X1 , … , Xn , we obtain
n n
1 1
Var [ ∑ Xi ] = ∑ Var[Xi ]
n n2
i=1 i=1

1
= n Var[X1 ]
2
n

Var[X1 ]
= .
n

So the variance of the sample mean decreases as gets large, while its expectation remains the
n

same by linearity
n n
1 1
E[ ∑ Xi ] = ∑ E[Xi ]
n n
i=1 i=1

1
= n E[X1 ]
n

= E[X1 ].

Together with Chebyshev's inequality, we immediately get that the sample mean approaches its
expectation in the following probabilistic sense.
Theorem (Weak Law of Large Numbers) Let X1 , … , Xn be i.i.d. For any , as
ε > 0 ,
n → +∞

n
∣ ∣
1


P [∣ ∑ Xi − E[X1 ]∣ ≥ ε] → 0.
n
∣ i=1 ∣

Proof: By Chebyshev and the formulas above,


n n n
∣ ∣ ∣ ∣
1 1 1



P [∣ ∑ Xi − E[X1 ]∣ ≥ ε] = P [∣ ∑ Xi − E [ ∑ Xi ]∣ ≥ ε]
n n n
∣ i=1 ∣ ∣ i=1 i=1 ∣

1 n
Var [ n
∑ Xi ]
i=1


ε2

Var[X1 ]
=
2

→ 0

as n → +∞ □ .
NUMERICAL CORNER: We can use simulations to confirm the Weak Law of Large Numbers.
Recall that a uniform random variable over the interval has density
[a, b]

1
x ∈ [a, b]
b−a
fX (x) = {
0 o.w.

We write X ∼ U[a, b]. We can obtain a sample from U[0, 1] by using the function
numpy.random in Numpy.
In [14]:
from numpy.random import default_rng
rng = default_rng(12345)

In [15]:
rng.random()

Out[15]: 0.22733602246716966

Now we take samples from


n U[0, 1] and compute their sample mean. We repeat times and
k

display the empirical distribution of the sample means using an histogram.


In [16]:
def lln_unif(n, k):

sample_mean = [np.mean(rng.random(n)) for i in range(k)]

plt.hist(sample_mean,bins=15)

plt.xlim(0,1)

plt.title(f'n={n}')

plt.show()

In [17]:
lln_unif(10, 1000)

Taking much larger leads to more concentration around the mean.


n

In [18]:
lln_unif(100, 1000)

1.2.3.3 Normal distribution


Recall that a standard Normal variable has PDF
X

1
2
fX (x) = exp(−x /2).
√2π

Its mean is and its variance is .


0 1

This is what its PDF looks like:

(Source)
To construct a -dimensional version, we take independent standard Normal variables
d d

X1 , X2 , … , Xdand form the vector . We will say that is a standard


X = (X1 , … , Xd ) X
Normal -vector. By independence, its joint PDF is given by the product of the PDFs of the 's,
d Xi

that is,
d
1
2
fX (x) = ∏ exp(−x /2)
i

i=1
√2π

d
1
2
= exp(− ∑ x /2)
d i
∏ √2π i=1
i=1

1
2
= exp(−∥x∥ /2).
d/2
(2π)

We can also shift and scale it.


Definition (Spherical Gaussian): Let be a standard Normal -vector,
let
Z d and let μ ∈ R
d

σ ∈ R+ . Then we will refer to the transformed random variable as a spherical


X = μ + σZ

Gaussian with mean and variance . We use the notation


μ σ
2
. 2
Z ∼ Nd (μ, σ I ) ⊲

NUMERICAL CORNER: The following function generates data points from a spherical -
n d

dimensional Gaussians with variance and mean . We will use it later in the chapter to
1 we1

simulate interesting datasets.


Below, rng.normal(0,1,d) generates a d -dimensional spherical Gaussian with mean . 0

Below we use the function numpy.concatenate to create a vector by concatenating two


given vectors. We use [w] to create a vector with a single entry w . We also use the function
numpy.zeros to create an all-zero vector.

In [19]:
def one_cluster(d, n, w):

X = np.stack([np.concatenate(([w], np.zeros(d-1))) + rng.normal(0,1,d) for _


return X

We generate 100 data points in dimension d = 2 .


In [20]:
d, n, w = 2, 100, 3.

X = one_cluster(d, n, w)

plt.scatter(X[:,0], X[:,1])

plt.show()

You might also like