Machine Learning

0.
Foundations of Probabilities
and Information Theory
for Machine Learning
1.
Random Variables
Some proofs
2.
E[X + Y ] = E[X] + E[Y ]

where X and Y are random variables of the same type (i.e. either discrete or cont.)
The discrete case:
X
E[X + Y ] = (X(ω) + Y (ω)) · P (ω)
ω∈Ω
X X
= X(ω) · P (ω) + Y (ω) · P (ω) = E[X] + E[Y ]
ω ω
The continuous case:

Z Z
E[X + Y ] = (x + y)pXY (x, y)dydx
x y
Z Z Z Z
= xpXY (x, y)dydx + ypXY (x, y)dydx
x y x y
Z Z Z Z
= x pXY (x, y)dydx + y pXY (x, y)dxdy
x y y x
Z Z
= xpX (x)dx + ypY (y)dy = E[X] + E[Y ]
x y
3.
X and Y are independent ⇒ E[XY ] = E[X] · E[Y ],

X and Y being random variables of the same type (i.e. either discrete or continuous)
The discrete case:

X X X X
E[XY ] = xyP (X = x, Y = y) = xyP (X = x) · P (Y = y)
x∈V al(X) y∈V al(Y ) x∈V al(X) y∈V al(Y )
 
X X X
= xP (X = x) yP (Y = y) = xP (X = x)E[Y ] = E[X] · E[Y ]
x∈V al(X) y∈V al(Y ) x∈V al(X)
The continuous case:

Z Z Z Z
E[XY ] = xy p(X = x, Y = y)dydx = xy p(X = x) · p(Y = y)dydx
x y x y
Z Z Z
= x p(X = x) y p(Y = y)dy dx = x p(X = x)E[Y ]dx
x y x
Z
= E[Y ] · x p(X = x)dx = E[X] · E[Y ]
x
4.
Discrete random variables:

independence, conditional independence, conditional probabilities
CMU, 2015 spring, T. Mitchell, N. Balcan, HW2, pr. 1.d

5.
Let X, Y , and Z be random variables taking values in {0, 1}.

The following table lists the probability of each possible assign-
ment of 0 and 1 to the variables X, Y , and Z:
Z=0 Z=1
X =0 X=1 X =0 X=1
Y = 0 1/15 1/15 4/15 2/15
Y = 1 1/10 1/10 8/45 4/45
For example,
P (X = 0, Y = 1, Z = 0) = 1/10 and P (X = 1, Y = 1, Z = 1) = 4/45.
a. Is X independent of Y ? Why or why not?
b. Is X conditionally independent of Y given Z? Why or why not?
c. Calculate P (X = 0 | X + Y > 0).
6.
Answer
a. No.
P (X = 0) = 1/15 + 1/10 + 4/15 + 8/45 = 11/18,

P (Y = 0) = 1/15 + 1/15 + 4/15 + 2/15 = 8/15,
and
P (X = 0, Y = 0) 1/15 + 4/15 5
P (X = 0|Y = 0) = = = .
P (Y = 0) 8/15 8
Since P (X = 0) does not equal P (X = 0|Y = 0), X is not independent of Y .
7.
b. For all pairs y, z ∈ {0, 1}, we need to check that P (X = 0|Y = y, Z = z) = P (X = 0|Z = z).
That the other probabilities are equal follows from the law of total probability.
1/15
P (X = 0|Y = 0, Z = 0) = = 1/2
1/15 + 1/15
1/10
P (X = 0|Y = 1, Z = 0) = = 1/2
1/10 + 1/10
4/15
P (X = 0|Y = 0, Z = 1) = = 2/3
4/15 + 2/15
8/45
P (X = 0|Y = 1, Z = 1) = = 2/3.
8/45 + 4/15
and
1/15 + 1/10
P (X = 0|Z = 0) = = 1/2
1/15 + 1/15 + 1/10 + 1/10
4/15 + 8/45
P (X = 0|Z = 1) = = 2/3.
4/15 + 2/15 + 8/45 + 4/45
This shows that X is independent of Y given Z.

c.
1/10 + 8/45
P (X = 0|X + Y > 0) = = 5/12.
1/15 + 1/10 + 1/10 + 2/15 + 4/45 + 8/45
8.
The correlation coefficient of two random variables:

two properties
Sheldon Ross
A First Course in Probability, 5th ed., Prentice Hall, 1997, pag. 332
9.
Pentru două variabile aleatoare oarecare X şi Y având Var (X) 6= 0 şi Var (Y ) 6=
0, coeficientul de corelaţie se defineşte astfel:
def. Cov (X, Y )

ρ(X, Y ) = p .
Var (X) · Var (Y )
a. Să se demonstreze că −1 ≤ ρ(X, Y ) ≤ 1.

p p
Consecinţă: Cov (X, Y ) ∈ [− Var (X) Var (Y ), + Var (X) Var (Y )], dacă
Var (X) 6= 0 şi Var (Y ) 6= 0.
b. Să se arate că dacă ρ(X, Y ) = 1, atunci Y = aX +b, cu a = Var (Y )/Var (X) > 0.
Similar, dacă ρ(X, Y ) = −1, atunci Y = aX + b, cu a = −Var (Y )/Var (X) < 0.
Observaţie: Aşadar, coeficientul de corelaţie reprezintă o ,,măsură“ a gradului
de ,,dependenţă liniară“ dintre X şi Y .
10.
Indicaţii
1. La punctul a, notând Var (X) = σX şi Var (Y ) = σY , pentru a demonstra
X Y
inegalitatea ρ(X, Y ) ≥ −1 vă sugerăm să dezvoltaţi expresia Var +
σX σY
folosind următoarele două proprietăţi, valabile pentru orice variabile aleatoare
X şi Y :
Var (X + Y ) = Var (X) + Var (Y ) + 2Cov (X, Y )
şi
Cov (aX, bY ) = abCov (X, Y ), pentru orice a, b ∈ R.
X Y
Apoi veţi proceda similar, dezvoltând expresia Var − , ca să
σX σY
demonstraţi inegalitatea ρ(X, Y ) ≤ 1.
2. La punctul b, veţi ţine cont de faptul că pentru o variabilă aleatoare

oarecare X, avem Var (X) = 0 dacă şi numai dacă variabila X este constantă.
(Mai precis, există c ∈ R astfel ı̂ncât P (X = c) = 1, unde P este distribuţia de
probabilitate considerată la definirea variabilelor din enunţul problemei.)
11.
Răspuns
a. Pentru a demonstra inegalitatea ρ(X, Y ) ≥ −1, procedăm conform Indicaţiei
1:
X Y X Y X Y
Var + = Var + Var + 2Cov ,
σX σY σX σY σX σY
1 1 1 1
= 2 Var (X) + Var (Y ) + 2 Cov (X, Y )
σX σY2 σX σY
= 1 + 1 + 2ρ(X, Y ) = 2[1 + ρ(X, Y )].
X
Întrucât Var (X) ≥ 0 pentru orice variabilă aleatoare X, rezultă că Var +
σX
Y
≥ 0, deci 1 + ρ(X, Y ) ≥ 0, adică ρ(X, Y ) ≥ −1.
σY
În mod similar, putem să arătăm că

X Y
Var − = 2[1 − ρ(X, Y )] ≥ 0,
σX σY
deci ρ(X, Y ) ≤ 1.
12.
b. Dacă ρ(X, Y ) = −1, atunci din primul calcul de la punctul a va rezulta că
X Y X Y
Var + = 0. Ştim că aceasta se ı̂ntâmplă dacă + este o variabilă
σX σY σX σY
X Y
aleatoare constantă, adică, mai precis, există a′ ∈ R astfel ı̂ncât + = a′
σX σY
′ σY
cu probabilitate 1. Prin urmare, putem scrie Y = a σY − X. Rezultă că
σX
σY
Y = aX + b, unde a = − < 0 şi b = a′ σY .
σX
În mod similar, dacă ρ(X, Y ) = 1, atunci din al doilea calcul de la punctul a va
X Y X Y
rezulta că Var − = 0, deci există a′′ ∈ R astfel ı̂ncât − = a′′ cu
σX σY σX σY
σY
probabilitate 1. Aşadar, Y = −a′′ σY + X. Renotând, obţinem Y = aX + b,
σX
σY
cu a = > 0 şi b = −a′′ σY .
σX
13.
Probabilistic Distributions
Some properties
14.
def.
Binomial distribution: b(r; n, p) = Cnr pr (1 − p)n−r
Significance: b(r; n, p) is the probability of drawing r heads in n independent

flips of a coin having the head probability p.
b(r; n, p) indeed represents a probability distribution:

• b(r; n, p) = Cnr pr (1 − p)n−r ≥ 0 for all p ∈ [0, 1], n ∈ N and r ∈ {0, 1, . . . , n},
P
• nr=0 b(r; n, p) = 1:
(1 − p)n + Cn1 p(1 − p)n−1 + · · · + Cnn−1 pn−1 (1 − p) + pn = [p + (1 − p)]n = 1
15.
Binomial distribution: calculating the mean
n
X
def.
E[b(r; n, p)] = r · b(r; n, p) =
r=0
= 1 · Cn1 p(1 − p)n−1 + 2 · Cn2 p2 (1 − p)n−2 + · · · + (n − 1) · Cnn−1 pn−1 (1 − p) + n · pn
1 n−1 2 n−2 n−1 n−2 n−1

= p Cn (1 − p) + 2 · Cn p(1 − p) + · · · + (n − 1) · Cn p (1 − p) + n · p
n−1 1 n−2 n−2 n−2 n−1 n−1

= np (1 − p) + Cn−1 p(1 − p) + · · · + Cn−1 p (1 − p) + Cn−1 p (1)
= np[p + (1 − p)]n−1 = np (2)
For the (1) equality we used the following property:

n! n! n (n − 1)!
k Cnk = k = =
k! (n − k)! (k − 1)! (n − k)! (k − 1)! (n − 1 − (k − 1))!
k−1
= n Cn−1 , ∀k = 1, . . . , n.
16.
Binomial distribution: calculating the variance

following www.proofwiki.org/wiki/Variance of Binomial Distribution, which cites
“Probability: An Introduction”, by Geoffrey Grimmett and Dominic Welsh,
Oxford Science Publications, 1986
We will make use of the formula Var[X] = E[X 2 ] − E 2 [X].

By denoting q = 1 − p, it follows:
n n
2 def. X
2
X n(n − 1) . . . (n − r + 1) r n−r
E[b (r; n, p)] = r Cnr pr q n−r = r2 p q
r=0 r=0
r!
n n
X (n − 1) . . . (n − r + 1) r n−r X r−1 r n−r
= rn p q = rn Cn−1 p q
r=1
(r − 1)! r=1
n
X
r−1 r−1 (n−1)−(r−1)
= np r Cn−1 p q
r=1
17.
Binomial distribution: calculating the variance (cont’d)
By denoting j = r − 1 and m = n − 1, we’ll get:

m
X
2 j j m−j
E[b (r; n, p)] = np (j + 1) Cm p q
j=0
m
hX m
X i
j j m−j j j m−j
= np j Cm p q + Cm p q .
j=0 j=0
| {z } | {z }
E[b(r;n−1,p)], cf.(2) 1
Therefore,
E[b2 (r; n, p)] = np[(n − 1)p + 1] = n2 p2 − np2 + np.

Finally,
Var[X] = E[b2 (r; n, p)] − (E[b(r; n, p)])2 = n2 p2 − np2 + np − n2 p2 = np(1 − p)

18.
Binomial distribution: calculating the variance

Another solution
• se demonstrează relativ uşor că orice variabilă aleatoare urmând

distribuţia binomială b(r; n, p) poate fi văzută ca o sumă de n vari-
abile independente care urmează distribuţia Bernoulli de parametru
p;a
• ştim (sau, se poate dovedi imediat) că varianţa distribuţiei Bernoulli
de parametru p este p(1 − p);
• ţinând cont de proprietatea de liniaritate a varianţelor — Var[X1 +
X2 . . . + Xn ] = Var[X1 ] + Var[X2 ] . . . + Var[Xn ], dacă X1 , X2 , . . . , Xn sunt
variabile independente —, rezultă că Var[X] = np(1 − p).
a Veziwww.proofwiki.org/wiki/Bernoulli Process as Binomial Distribution, care citează de asemenea ca sursă “Proba-
bility: An Introduction” de Geoffrey Grimmett şi Dominic Welsh, Oxford Science Publications, 1986.
19.
The categorical distribution:

Computing probabilities and expectations
CMU, 2009 fall, Geoff Gordon, HW1, pr. 4

20.
Suppose we have n bins and m balls. We throw balls into bins independently
at random, so that each ball is equally likely to fall into any of the bins.
a. What is the probability of the first ball falling into the first bin?
b. What is the expected number of balls in the first bin?
Hint (1): Define an indicator random variable representing whether the i-th
ball fell into the first bin:

1 if i-th ball fell into the first bin;
Xi =
0 otherwise.
Hint (2): Use linearity of expectation.

c. What is the probability that the first bin is empty?
d. What is the expected number of empty bins? Hint (3): Define an indicator
for the event “bin j is empty” and use linearity of expectations.
21.
Answer
a. It is equally likely for a ball to fall in any of the bins, so the probability that first
ball falling into the first bin is 1/n.
b. Define Xi as described in Hint 1. Let Y be the total number of balls that fall into
first bin: Y = X1 + . . . + Xm . The expected number of balls is:
m
X m
X
E[Y ] = E[Xi] = 1 · P (Xi = 1) = m · 1/n = m/n
i=1 i=1
c. Let Y and Xi be the same as defined at point b. For the first bin to be empty none
of the balls should fall into the first bin: Y = 0.
m
n − 1
P (Y = 0) = P (X1 = 0, . . . , Xm = 0) = Πm m
i=1 P (Xi = 0) = (1 − 1/n) =
n
d. For each one of the n bins, we define an indicator random variable Yj for the event
“bin j is empty”. Let Z be the random variables denoting the number of empty bins.
Z = Y1 + . . . + Yn . Then the expected number of empty bins is:
n n n n m m
X X X X n−1 n−1
E[Z] = E[ Yj ] = E[Yj ] = 1 · P (Yj ) = =n
j=1 j=1 j=1 j=1
n n
22.
The univariate Gaussian distribution:

(x − µ)2
−
Nµ,σ (x) = √1 e 2σ 2
2πσ
The plot, when µ = 0:

2
1 −x
√ exp
σ 2π 2σ 2
σ
34% 34%
2% 14% 14% 2%
0.1% 0.1%
x
−3σ −2σ −σ σ 2σ 3σ
Source: http://www.texample.net/tikz/examples/standard-deviation/
23.
Proving that Nµ,σ is indeed a p.d.f.
1. Nµ,σ (x) ≥ 0 ∀x ∈ R (true)

R∞
2. −∞ Nµ,σ (x)dx = 1
Note: Concerning the second property, it is enough to prove it for the stan-
dard case (µ = 0, σ = 1), because the non-standard case can be reduced to
this one:
x−µ
Using the variable transformation v = will imply x = σv +µ and dx = σdv,
σ
so:
Z ∞ Z ∞
(x − µ)2 Z ∞ (x − µ)2
1 − 1 −
Nµ,σ (x)dx = √ e 2σ 2 dx = √ e 2σ 2 dx
−∞ x=−∞ 2πσ 2πσ x=−∞
Z ∞ v2 Z ∞ v2 Z ∞
1 − 1 − 1
= √ e 2 σdv = √ σ e 2 dv = √ N0,1 (x)dx
2πσ v=−∞ 2πσ v=−∞ 2π x=−∞
The standard case: proving that N0,1 is indeed a p.d.f.: 24.
 2    
Z ∞
v2 Z ∞
x2 Z ∞
y2 Z ∞ Z ∞
x2 + y 2
− − − −
e 2 dv  = e 2 dx ·  e 2 dy  = e 2 dydx
     
 
v=−∞ x=−∞ y=−∞ x=−∞ y=−∞
ZZ x2 + y 2
−
= e 2 dydx
R2
By switching from x, y to polar coordinates r, θ (see the Note below), it follows:

 2
Z ∞ v2 Z ∞ Z 2π r2 Z ∞ r 2 Z 2π Z ∞ r2
− − − −
e 2 dv  = e 2 (rdrdθ) = re 2 dθ dr = re 2 θ|2π
0 dr
 

v=−∞ r=0 θ=0 r=0 θ=0 r=0
2 ∞
2

r r v2 v2
∞Z
− −
Z ∞
− √ Z ∞
1 −
= 2π re 2 dr = 2π (−e 2 ) = 2π(1 − 0) = 2π ⇒ e 2 dv = 2π ⇒ √ e 2 dv = 1.

r=0
v=−∞ v=−∞ 2π
0
Note: x = r cos θ and y = r sin θ, with r ≥ 0 and θ ∈ [0, 2π). Therefore, x2 + y 2 = r 2 , and the
Jacobian matrix is

∂x ∂x
cos θ −r sin θ
∂(x, y) ∂r ∂θ 2 2
= = = r cos θ + r sin θ = r ≥ 0. So, dxdy = rdrdθ.
∂(r, θ) ∂y ∂y
sin θ r cos θ
∂r ∂θ

25.
Calculating the mean

Z ∞ Z ∞
(x − µ)2
def. 1 −
E[Nµ,σ (x)] = x Nµ,σ (x)dx = √ x·e 2σ 2 dx
−∞ 2πσ −∞
x−µ
Using again the variable transformation v = will imply:
σ
2
 2 2

Z ∞ v Z ∞ v Z ∞ v
1 − σ  − −
E[X] = √ (σv + µ)e 2 (σdv) = √  σ ve 2 dv + µ e 2 dv 

2πσ −∞ 2πσ −∞ −∞
 
 
2 2  2 ∞ 
Z ∞ v Z ∞ v  v Z ∞ v2 
1  − −  1  − − 
= √ −σ (−v)e 2 dv + µ e 2 dv  = √   −σ e 2 +µ e 2 dv 

2π −∞ −∞ 2π 
−∞ 
 −∞ 
| {z }
=0
Z ∞ v2
µ − µ √
= √ e 2 dv = √ 2π = µ
2π −∞ 2π
26.
Calculating the variance

We will make use of the formula Var[X] = E[X 2 ] − E 2 [X].
Z ∞ Z ∞ (x − µ)2
1 −
E[X 2 ] = x2 Nµ,σ (x)dx = √ x2 · e 2σ 2 dx
−∞ 2πσ −∞
x−µ
Again, using the transformation v = will imply x = σv + µ and dx = σdv. Therefore,
σ
Z ∞ v2
1 −
E[X 2 ] = √ (σv + µ)2 e 2 (σdv)
2πσ −∞
Z ∞
v2
σ −
= √ (σ 2 v 2 + 2σµv + µ2 ) e 2 dv
2πσ −∞
 2 2 2

Z v Z ∞ v Z ∞ v
1  2 ∞ 2 − − −
v e 2 dv + µ2

= √ σ v e 2 dv + 2σµ e 2 dv 
2π −∞ −∞ −∞
v2 v2
R∞ − R∞ − √
Note that we have already computed −∞ ve 2 dv = 0 and −∞ e 2 dv = 2π.
27.
Calculating the variance (Cont’d)
Therefore, we only need to compute
2
 2
  2
′
Z ∞ v Z ∞ v Z ∞
v
− −  − 
v 2 e 2 dv =
 
(−v) −ve 2  dv = (−v) e 2  dv
−∞ −∞ −∞

2 ∞
v Z v2 Z v2
−
∞ − ∞ − √
= (−v) e 2 − (−1)e 2 dv = 0 + e 2 dv = 2π.
−∞ −∞

−∞
Here above we used the fact that

v2 v2
− v l’Hôpital 1 −
lim v e 2 = lim = lim = 0 = lim v e 2
v→∞ v→∞ v 2 v→∞ v2 v→−∞
e2 ve 2
1
√ √
2
So, E[X ] = 2π σ 2π + 2σµ · 0 + µ 2π = σ 2 + µ2 .
√ 2 2
And, finally, Var[X] = E[X 2 ] − (E[X])2 = (σ 2 + µ2 ) − µ2 = σ 2 .

28.
Vectors of random variables.

A property:
The covariance matrix Σ corresponding to such a vector is
symmetric and positive semi-definite
Chuong Do, Stanford University, 2008
[adapted by Liviu Ciortuz]

29.
Fie variabilele aleatoare X1 , . . . , Xn , cu Xi : Ω → R pentru

i = 1, . . . , n. Matricea de covarianţă a vectorului de variabile
aleatoare X = (X1 , . . . , Xn ) este o matrice pătratică de dimensiune
def.
n×n, ale cărei elemente se definesc astfel: [Cov(X)]ij = Cov(Xi , Xj ),
pentru orice i, j ∈ {1, . . . , n}.
not.
Arătaţi că Σ = Cov(X) este matrice simetrică şi pozitiv semi-
definită, cea de-a doua proprietate ı̂nsemnând că pentru orice
vector z ∈ Rn are loc inegalitatea z ⊤ Σz ≥ 0. (Vectorii z ∈ Rn sunt
consideraţi vectori-coloană, iar simbolul ⊤ reprezintă operaţia de
transpunere de matrice.)
30.
def.
Cov(X)i,j = Cov(Xi , Xj ), for all i, j ∈ {1, . . . , n}, and
def.
Cov(Xi , Xj ) = E[(Xi − E[Xi ])(Xj − E[Xj ])] = E[(Xj − E[Xj ])(Xi − E[Xi ])] = Cov(Xj , Xi ),
therefore Cov(X) is a symmetric matrix.
We will show that z T Σz ≥ 0 for any z ∈ Rn (seen as a column-vector):

n
X Xn X n
n X n X
X n
T
z Σz = zi ( Σij zj ) = (zi Σij zj ) = (zi Cov[Xi , Xj ] zj )
i=1 j=1 i=1 j=1 i=1 j=1
 
X n
n X n
XX n
= (zi E[(Xi − E[Xi ])(Xj − E[Xj ])] zj ) = E  zi (Xi − E[Xi ])(Xj − E[Xj ]) zj 
i=1 j=1 i=1 j=1
 ! 
n
X n
X
= E zi (Xi − E[Xi ])  (Xj − E[Xj ]) zj 
i=1 j=1
 ! 
n
X n
X
= E (Xi − E[Xi ]) zi  (Xj − E[Xj ]) zj  = E[((X − E[X])T · z)2 ] ≥ 0
i=1 j=1
31.
Multi-variate Gaussian distributions:

A property:
When the covariance matrix of a multi-variate (d-dimensional) Gaussian
distribution is diagonal, then the p.d.f. (probability density function) of
the respective multi-variate Gaussian is equal to the product of d
independent uni-variate Gaussian densities.
Chuong Do, Stanford University, 2008

32.
Let’s consider X = [X1 . . . Xd ]T , µ ∈ Rd and Σ ∈ Sd+ , where Sd+ is the set of symmetric
positive definite matrices (which implies |Σ| = 6 0 and (x − µ)T Σ−1 (x − µ) > 0, therefore
− 21 (x − µ)T Σ−1 (x − µ) < 0, for any x ∈ Sd , x 6= µ).
The probability density function of a multi-variate Gaussian distribution of parameters
µ and Σ is:
1 1 T −1
p(x; µ, Σ) = exp − (x − µ) Σ (x − µ) ,
(2π)d/2 |Σ|1/2 2
Notation: X ∼ N (µ, Σ).
Show that when the covariance matrice Σ is diagonal, then the p.d.f. (probability den-
sity function) of the respective multi-variate Gaussian is equal to the product of d
independent uni-variate Gaussian densities.
We will make the proof for d = 2 Note: It is easy to show that if Σ ∈ Sd+ is diagonal, the
(generalization to d > 2 will be easy): elements on the principal diagonal Σ are indeed strictly
2 positive. (It is enough to consider z = (1, 0) and respec-

x1 µ1 σ1 0 tively z = (0, 1) in formula for pozitive-definiteness of Σ.)
x= µ= Σ= 2
x2 µ2 0 σ2 This is why we wrote these elements of σ as σ12 and σ22 .
33.
34.
T −1 !
1 1 x1 − µ1 σ12 0 x1 − µ1
p(x; µ, Σ) = exp −
2
σ
1
0 2 2 x2 − µ2 0 σ22 x2 − µ2
2π 1
0 σ22
  1  
T 0
1  1 x1 − µ1  σ12  x1 − µ1 
= exp −  1  
2π σ1 σ2 2 x2 − µ2 0 x2 − µ2
σ22
  1 
(x1 − µ1 )
1  1 x1 − µ1 T  σ12 
= exp −
  
2π σ1 σ2 2 x2 − µ2  1 
(x2 − µ2 )
σ22

1 1 1
= exp − 2 (x1 − µ1 )2 − 2 (x2 − µ2 )2
2π σ1 σ2 2σ1 2σ2
= p(x1 ; µ1 , σ12 ) p(x2 ; µ2 , σ22 ).

35.
Factorizing positive definite matrices, using eigenvectors
Sebastian Ciobanu,
following Chuong Do (from Stanford University),
The Multivariate Gaussian Distribution, 2008
36.
d d
Fie o variabilă aleatoare X : Ω → R . În cele ce urmează elementele din R vor fi
considerate vectori-coloană. Vom nota cu Sd+ mulţimea matricelor simetrice pozitiv
definite de dimensiune d × d.
Definiţie: Fie A ∈ Rd×d . Se numeşte valoare proprie a matricei A un număr complex
λ ∈ C pentru care există un vector nenul x ∈ Cd pentru care are loc egalitatea Ax = λx.
Acest vector x se numeşte vector propriu asociat valorii proprii λ.
a. Demonstraţi că pentru orice matrice Σ ∈ Sd+ există o matrice B ∈ Rd×d astfel ı̂ncât Σ
să poată fi factorizată sub forma următoare: Σ = BB ⊤ .
Observaţie: Factorizarea aceasta nu este unică, adică există mai multe posibilităţi de a
scrie matricea Σ drept Σ = BB ⊤ .
Indicaţie: Vă puteţi folosi de următoarele proprietăţi:
i. Orice matrice A ∈ Rd×d care este simetrică poate fi scrisă astfel: A = U ΛU ⊤ , unde
U ∈ Rd×d este o matrice ortonormală conţinând vectorii proprii (pentru care impunem
să aibă norma 1) ai lui A drept coloane, iar Λ este matricea diagonală conţinând valorile
proprii ale lui A ı̂n ordinea corespunzătoare coloanelor (adică, a vectorilor proprii) din
matricea U .(Faptul că U este matrice ortonormală se poate scrie ı̂n mod formal astfel:
U ⊤ U = U U ⊤ = I.)
ii. Fie A ∈ Rd×d o matrice simetrică. A este pozitiv definită dacă şi numai dacă toate
valorile sale proprii sunt (reale şi) pozitive.
iii. (AB)⊤ = B ⊤ A⊤ , pentru orice matrice A, B ∈ Rd×d .
Soluţie 37.
Ştim, prin ipoteză, că matricea Σ ∈ Sd+ , deci este simetrică. Conform proprietăţii (a.i),
putem scrie următoarea factorizare pentru Σ:
Σ = U ΛU ⊤ ,
unde U este matricea ortonormală conţinând vectorii proprii (cu norma 1) ai lui Σ
drept coloane, iar Λ ∈ Rd×d este matricea diagonală conţinând valorile proprii ale lui Σ,
ı̂n ordinea corespunzătoare coloanelor matricei U .
Întrucât Σ ∈ Sd+ este matrice pozitiv definită, conform proprietăţii (a.ii) rezultă că toate
valorile proprii ale lui Σ sunt pozitive. Prin urmare, există matricea
√ 
λ1 . . . 0
Λ1/2 =  ... .. ..  ∈ Rd×d .

.
√.

0 ... λd
Observaţii :
(1) Este imediat faptul că putem factoriza / descompune matricea Λ ı̂n felul următor:
Λ1/2 · Λ1/2 = Λ. (3)
(2) Λ1/2 este matrice diagonală, deci simetrică. Aşadar,
(Λ1/2 )⊤ = Λ1/2 . (4)

38.
Acum putem relua factorizarea matricei Σ = U ΛU ⊤ . Elaborând — adică, factorizând

matricea Λ — ı̂n partea dreaptă a acestei egalităţi, vom putea obţine o nouă factorizare
pentru matricea Σ ı̂n felul următor:
Σ = U ΛU ⊤
(3) (4)
= U Λ1/2 Λ1/2 U ⊤ = U Λ1/2 (Λ1/2 )⊤ U ⊤
asoc. (a.iii)
= U Λ1/2 ((Λ1/2 )⊤ U ⊤ ) = U Λ1/2 (U Λ1/2 )⊤
not.
= BB ⊤ , unde B = U Λ1/2 . (5)
Observaţii :
(3) O altă modalitate de a factoriza / descompune matricea Σ sub forma Σ = BB ⊤
este dată de factorizarea Cholesky : dacă A este o matrice simetrică şi pozitiv definită,
atunci există o unică matrice L ∈ Rd×d , inferior triunghiulară, astfel ı̂ncât A = LL⊤ . (See
https://en.wikipedia.org/wiki/Positive-definite matrix.)
(4) Factorizarea (5) este de fapt valabilă şi pentru matrice pozitiv semidefinite. Însă
proprietatea de inversabilitate (vedeţi punctul c) nu este valabilă decât pentru matrice
pozitiv definite.
39.
b. La acest punct vom face o exemplificare pentru chestiunile prezentate la punctul a.

Considerând matricea
1 0.5
Σ= ,
0.5 1
(deci, d = 2) determinaţi o matrice B ∈ R2×2 astfel ı̂ncât Σ = BB ⊤ .
Indicaţie: Folosind notaţiile din Definiţia dată pentru valorile proprii şi vectorii proprii,
vom avea:
Ax = λx ⇔ (λId − A)x = 0, x 6= 0 ⇔ det(λId − A) = 0,
unde 0 este vectorul coloană nul d-dimensional, iar Id este matricea identitate d-
dimensională.
40.
Soluţie
Mai ı̂ntâi vom calcula valorile proprii ale matricei Σ din enunţ. Fie deci λ ∈ R şi x ∈ Rd ,
x 6= 0. Trebuie să rezolvăm ecuaţia Σx = λx şi, conform Indicaţiei din enunţ, aceasta
revine la a rezolva ecuaţia det(λId − Σ) = 0.

λ − 1 −0.5
det(λId − Σ) = 0 ⇔ det = 0 ⇔ (λ − 1)2 − 0.52 = 0
−0.5 λ − 1
⇔ (λ − 1 − 0.5)(λ − 1 + 0.5) = 0 ⇔ (λ − 1.5)(λ − 0.5) = 0
⇔ λ1 = 0.5, λ2 = 1.5.
Deci valorile proprii ale matricei Σ sunt λ1 = 0.5 şi λ2 = 1.5.

Se observă că λ1 , λ2 > 0. Aşadar, conform proprietăţii (a, ii), matricea Σ dată ı̂n enunţ
este pozitiv definită. Conform punctului a, ştim acum că există o matrice B astfel ı̂ncât
matricea Σ să poată fi factorizată sub forma Σ = BB ⊤ . Pentru a determina matricea B,
trebuie sa calculăm vectorii proprii ai matricei Σ. Îi vom obţine rezolvând următorul
sistem, care este compatibil nedeterminat (pentru că det(λId − Σ) = 0):

(λ − 1)v − 0.5u = 0 ⇒ v = 0.5u
λ−1 .
−0.5v + (λ − 1)u = 0
Deci, vectorii proprii sunt de forma: 41.

0.5
x = (v, u) = u, u , u ∈ R.
λ−1
În mod concret,

λ1 = 0.5 ⇒ x1 = (−u, u), u ∈ R
şi
λ2 = 1.5 ⇒ x2 = (u, u), u ∈ R.
Deoarece vrem ca vectorii x1 şi x2 să aibă norma 1, ı̂i vom ,,normaliza“. Mai ı̂ntâi,
x1 (−u, u) (−u, u)
= √ = √ (6)
kx1 k2 2u2 2|u|
şi, considerând u > 0, deci |u| = u, rezultă că

x1 (−u, u) 1 1
= √ = −√ , √ .
kx1 k2 2u 2 2
Apoi, ı̂n mod similar obţinem

x2 1 1
= √ ,√ .
kx2 k2 2 2
42.
De la punctul a ştim că putem alege B = U Λ1/2 , deci

   
1 1 1
√ √ √
− 2 0 
"√ #
2  0.5 0  2
U =  , Λ1/2 = √ = √ ,
 1 1  0 1.5
 3
√ √ 0 √
2 2 2
    √ 
1 1 1 1 3
√ √ √
− 2   0  −
2  2 √   2 2 
B=  = √  .
1 1  3  1 3
√ √ 0 √
2 2 2 2 2
Verificăm ă ı̂ntr-adevăr Σ = BB ⊤ :
 √ 
1 3 1 1   1 3 1 3

− 2  − 2 + − +
2 2 4 4 4 4 = 1 0.5
⊤
√  √ √ =
  
BB =    = Σ.
1 3 3 3 1 3 1 3 0.5 1
− + +
2 2 2 2 4 4 4 4
43.
Observaţie
În ton cu Observaţia din enunţ, dacă ı̂n relaţia (6) vom considera u < 0, vom obţine ı̂ncă
o soluţie pentru factorizarea matricei Σ:

x′1 (−u, u) 1 1 x1
= √ = √ , − √ = −
kx′1 k2 2|u| 2 2 kx1 k2

x′2 (u, u) 1 1 x2
= √ = − √ , − √ = − ,
kx′2 k2 2|u| 2 2 kx k
2 2
deci
U ′ = −U, B ′ = U ′ Λ1/2 = −U Λ1/2 = −B
şi, ı̂n consecinţă,
B ′ (B ′ )⊤ = (−B)(−B)⊤ = BB ⊤ = Σ.
44.
c. Demonstraţi că matricea B care satisface proprietatea de la punctul a este inversabilă.
Indicaţie: Vă puteţi folosi de următoarele proprietăţi:

i. Matricea A ∈ Rd×d este inversabilă dacă şi numai dacă det(A) 6= 0.
ii. det(AB) = det(A) det(B), pentru orice matrice A, B ∈ Rd×d .
iii. det(A) = det(A⊤ ), unde A ∈ Rd×d .
iv. Orice matrice pozitiv definită este inversabilă (iar inversa ei este de asemenea ma-
trice pozitiv definită).
45.
Soluţie
De la punctul a rezultă că pentru orice matrice Σ ∈ Sd+ există o descompunere / factor-
izare de forma Σ = BB ⊤ , deci
(c.ii) (c.iii)
det(Σ) = det(BB ⊤ ) = det(B) det(B ⊤ ) = det(B)2 .
Prin urmare, p
⇒ det(B) = ± det(Σ). (7)
Pe de altă parte, din faptul că Σ este matrice pozitiv definită, rezultă conform pro-
prietăţii (c.iv) că Σ este matrice inversabilă. Mai departe, conform proprietăţii (c.i),
vom avea det(Σ) 6= 0. Coroborând aceasta cu relaţia (7), rezultă că det(B) 6= 0 şi ı̂n
consecinţă (din nou, conform proprietăţii (c.i)) că matricea B este inversabilă (ceea ce
era de demonstrat).
46.
The Gaussian Multivariate Distribution:

Its density function is indeed a p.d.f.
Sebastian Ciobanu,
following Chuong Do (from Stanford University),
The Multivariate Gaussian Distribution, 2008
47.
Definiţie: Notăm cu Sd+ mulţimea matricelor simetrice pozitiv definite de dimensiune
d × d. Spunem că variabila aleatoare vectorială X : Ω → Rd , reprezentată sub forma
X = (X1 . . . Xd )⊤ , urmează o distribuţie gaussiană multivariată, având media µ ∈ Rd şi
matricea de covarianţă Σ ∈ Sd+ , dacă funcţia ei de densitate [de probabilitate] are forma
analitică următoare:

1 1 ⊤ −1
pX (x; µ, Σ) = exp − (x − µ) Σ (x − µ) ,
(2π)d/2 det(Σ)1/2 2
unde notaţia det(Σ) desemnează determinantul matricei Σ, iar exp( ) desemnează funcţia
exponenţială având baza e. Pe scurt, vom nota această proprietate de definiţie a lui X
sub forma X ∼ N (µ, Σ).
Observaţie (1): La problema . . . am demonstrat că pentru orice matrice Σ ∈ Rd simet-
rică şi pozitiv definită există (ı̂nsă nu neapărat ı̂n mod unic) o matrice B ∈ Rd×d cu
proprietatea că Σ se poate ,,factoriza“ sub forma Σ = BB ⊤ . Mai mult, am demonstrat
că orice matrice B care satisface această proprietate este ı̂n mod necesar inversabilă.
a. Demonstraţi că dacă definim Z = B −1 (X −µ), atunci Z ∼ N (0, Id ), unde 0 este vectorul
coloană nul d-dimensional, iar Id este matricea identitate d-dimensională.
Observaţie (2): Proprietatea aceasta este o generalizare a metodei de ,,standardizare“
pe care am ı̂ntâlnit-o deja la problema CMU, 2004 fall, T. Mitchell, Z. Bar-Joseph,
HW1, pr. 3, aplicată ı̂n cazul distribuţiilor gaussiene univariate.
Indicaţie (1): Vă puteţi folosi de următoarele proprietăţi: 48.
i. Fie X = (X1 . . . Xd )⊤ ∈ Rd o variabilă aleatoare de tip vector, cu distribuţia comună dată
de funcţia de densitate pX : Rd → R. Dacă Z = H(X) ∈ Rd , unde H este funcţie bijectivă şi
diferenţiabilă, atunci Z are distribuţia comună dată de funcţia de densitate pZ : Rd → R, unde
 ∂x ∂x1 

1
...

 ∂z1 ∂zd 

 . .. .. 
pZ (z) = pX (x) · det  .  .
 . . . 
 ∂x ∂xd 

d
...

∂z1 ∂zd

 ∂x ∂x1 
1
...
 ∂z1 ∂zd 
 . .. .. 
Matricea   . . . .  se numeşte matricea jacobiană a lui x ı̂n raport cu z şi se notează,

 ∂x ∂xd 
d
...
∂z1 ∂zd
∂x
ı̂n acest caz, cu .
∂z
∂Ax + b
ii. = A, unde x, b ∈ Rd , A ∈ Rd×d , iar A şi b nu depind de x.
∂x
iii. (AB)−1 = B −1 A−1 , unde A, B ∈ Rd×d .
iv. (AB)⊤ = B ⊤ A⊤ , unde A, B ∈ Rd×d .
v. det(AB) = det(A) det(B), unde A, B ∈ Rd×d .
vi. det(A) = det(A⊤ ), unde A ∈ Rd×d .
49.
Soluţie
Sunt imediate următoarele relaţii:
B·
z = B −1 (x − µ) ⇒ Bz = x − µ ⇒ x = Bz + µ (8)
∂x ∂(Bz + µ) (a.ii)
= = B (9)
∂z ∂z
q q
(a.v) (a.vi) p
(det(Σ))1/2 ⊤
= det(BB ) = ⊤
det(B) det(B ) = (det(B))2 = | det(B)|. (10)
Vom rescrie acum expresia care constituie argumentul funcţiei exp( ) din definiţia p.d.f.-
ului distribuţiei gaussiene multivariate (vedeţi enunţul), ı̂n funcţie de vectorul z.
(x − µ)⊤ Σ−1 (x − µ)
(8)
= (Bz + µ − µ)⊤ (BB ⊤ )−1 (Bz + µ − µ) = (Bz)⊤ (BB ⊤ )−1 (Bz)
(a.iii)
= (Bz)⊤ ((B ⊤ )−1 B −1 )(Bz)
(a.iv)
= (z ⊤ B ⊤ )((B ⊤ )−1 B −1 )(Bz)
asoc.
= z ⊤ (B ⊤ (B ⊤ )−1 )(B −1 B)z = z ⊤ Id Id z = z ⊤ z. (11)
50.
Aplicând acum proprietatea (a.i), vom putea calcula p.d.f.-ul distribuţiei gaussiene aso-
ciate variabilei Z:

∂x
pZ (z) = pX (x) · det

∂z

1 1 ⊤ −1
∂x
= exp − (x − µ) Σ (x − µ) det
(2π)d/2 det(Σ)1/2 2 ∂z

(9) (10) (11) 1 1 ⊤
= d/2
exp − z z ✘ | det(B)|
✘✘✘
(2π) ✘ | det(B)|
✘ ✘ ✘ 2

1 1 ⊤ z ⊤ =z ⊤ Id 1 1 ⊤
= exp − z z = exp − z Id z
(2π)d/2 2 (2π)d/2 (det(Id ))1/2 2
= N (z; 0, Id ).
51.
b. Arătaţi că funcţia pX (x; µ, Σ) care a fost dată ı̂n enunţ este ı̂ntr-adevăr funcţie den-
sitate de probabilitate (p.d.f.).
Indicaţie (2): Vă puteţi folosi de următoarele proprietăţi:
i. Pentru cazul d = 1, funcţia pX (x; µ, σ 2 ) este funcţie densitate de probabilitate.
ii. În cazul ı̂n care matricea Σ este diagonală,
 2   
σ1 . . . 0 µ1
Σ =  ... . . . ...  , iar µ =  ...  ,
   
0 . . . σd2 µd
expresia funcţiei de densitate gaussiană multivariată este identică cu produsul a d funcţii

de densitate de tip gaussian, univariate şi independente, prima funcţie având media µ1
şi varianţa σ12 , a doua funcţie având media µ2 şi varianţa σ22 , . . . , a d-a funcţie având
media µd şi varianţa σd2 .
Soluţie 52.
pX (x; µ, Σ) este funcţie densitate de probabilitate (p.d.f.) dacă:

• pX (x; µ, Σ) ≥ 0, ∀x ∈ Rd
not. R ∞ R ∞ R∞
• I = −∞ −∞ . . . −∞ pX (x; µ, Σ)dxd . . . dx2 dx1 = 1.
Prima condiţie este satisfăcută, pentru că numitorul fracţiei [care este factorul de nor-
malizare] din definiţia funcţiei de densitate de probabilitate pX (x; µ, Σ) este pozitiv, iar
exp(y) > 0 pentru orice y ∈ R.
În continuare vom verifica a doua condiţie.
În integrala I facem substituţia (schimbarea de variabilă): z = B −1 (x − µ), unde Σ =
BB ⊤ , B ∈ Rd×d . Matricea B există şi este inversabilă după cum s-a precizat ı̂n enunţ
(vedeţi Observaţia (1)).
De la punctul a rezultă imediat că:
Z ∞Z ∞ Z ∞
I = ... pZ (z; 0, Id )dzd . . . dz2 dz1
−∞ −∞ −∞
Z ∞ Z ∞ Z ∞
(b.ii) (b.i)
= pZ1 (z1 ; 0, 1)dz1 pZ2 (z2 ; 0, 1)dz2 . . . pZd (zd ; 0, 1)dzd = 1 · 1 · . . . · 1 = 1.
−∞ −∞ −∞
Deci, şi a doua condiţie este satisfăcută, ceea ce ı̂nseamnă că funcţia pX (x; µ, Σ) este
ı̂ntr-adevăr funcţie densitate de probabilitate (p.d.f.).
53.
Bi-variate Gaussian distributions. A property:

The conditional distributions X1 |X2 and X2|X1 are also
Gaussians.
The calculation of their parameters
Duda, Hart and Stork, Pattern Classification, 2001,

Appendix A.5.2

54.
Fie X o variabilă aleatoare care urmează o distribuţie gaussiană bi-variată

de parametri µ (vectorul de medii) şi Σ (matricea de covarianţă). Aşadar,
µ = (µ1 , µ2 ) ∈ R2 , iar Σ ∈ M2×2 (R).
not.
Prin definiţie, Σ = Cov(X, X), unde X = (X1 , X2 ), aşadar Σij = Cov(Xi , Xj )
not.
pentru i, j ∈ {1, 2}. De asemenea, Cov(Xi , Xi ) = Var[Xi ] = σi2 ≥ 0 pentru
not.
i ∈ {1, 2}, ı̂n vreme ce pentru i 6= j avem Cov(Xi , Xj ) = Cov(Xj , Xi ) = σij .
def. σ12
În sfârşit, dacă introducem ,,coeficientul de corelare“ ρ = , rezultă că
σ1 σ2
putem scrie astfel matricea de covarianţă:

σ12 ρσ1 σ2
Σ= . (12)
ρσ1 σ2 σ22
55.
Demonstraţi că ipoteza X ∼ N (µ, Σ), im-

plică faptul că distribuţia condiţională
X2 |X1 este de tip gaussian, şi anume
2
X2 |X1 = x1 ∼ N (µ2|1 , σ2|1 ),
σ2 2
cu µ2|1 = µ2 +ρ (x1 −µ1 ) şi σ2|1 = σ22 (1−ρ2 ).
σ1
Observaţie: Pentru X1 |X2 , rezultatul
2
este similar: X1 |X2 = x2 ∼ N (µ1|2 , σ1|2 ), cu
σ1 2
µ1|2 = µ1 + ρ (x2 − µ2 ) şi σ1|2 = σ12 (1 − ρ2 ).
σ2
Source:
Pattern Classification, 2nd ed., Appendix A.5.2,

R. Duda, P. Hart and D. Stork, 2001
56.
Answer
def. pX1 ,X2 (x1 , x2 )
pX2 |X1 (x2 |x1 ) = , (13)
pX1 (x1 )
where

1 1
pX1 ,X2 (x1 , x2 ) = √ p exp − (x − µ)⊤ Σ−1 (x − µ) şi
2
( 2π) |Σ| 2

1 1
pX1 (x1 ) = √ exp − 2 (x1 − µ1 )2 . (14)
2πσ1 2σ1
p
From (12) it follows that |Σ| = σ12 σ22 (1 − ρ2 ). In order that |Σ| and Σ−1 bepdefined, it
p
follows that ρ ∈ (−1, 1). Moreover, since σ1 , σ2 > 0, we will have |Σ| = σ1 σ2 1 − ρ2 .
2

1 1 σ −ρσ σ
1 2
Σ−1 = Σ∗ = 2 2 2
2
2 2 2
σ1 σ2 (1 − ρ ) σ1 σ2 (1 − ρ ) −ρσ1 σ2
2 σ1
 
1 ρ
2 −
1  σ 1 σ1 σ2 
=  
2
(1 − ρ )  ρ 1 
−
σ1 σ2 σ22
57.
So,
pX1 ,X2 (x1 , x2 ) =

   
1 ρ
2 −
1  1  σ 1 σ σ 
1 2  x1 − µ 1

= p exp  − (x1 − µ 1 , x2 − µ 2 )  
2πσ1 σ2 1 − ρ 2  2(1 − ρ)2  ρ 1  x2 − µ 2 
−
σ1 σ2 σ22
1
= p ·
2πσ1 σ2 1 − ρ2
" 2 2 #!
1 x1 − µ1 x1 − µ1 x2 − µ2 x2 − µ2
exp − − 2ρ + (15)
2(1 − ρ2 ) σ1 σ1 σ2 σ2
58.
By substitution (14) and (15) in the definition (13), we will get:
pX1 ,X2 (x1 , x2 )

p(x2 |x1 ) =
pX1 (x1 )
" 2 2 #!
1 1 x1 − µ1 x1 − µ1 x2 − µ2 x2 − µ2
= p · exp − − 2ρ +
2πσ1 σ2 1 − ρ2 2(1 − ρ2 ) σ1 σ1 σ2 σ2
2 !
√ 1 x1 − µ1
· 2πσ1 exp
2 σ1
" 2 #
1 1 x2 − µ2 x1 − µ1
= √ p exp − − ρ
2πσ2 1 − ρ2 2(1 − ρ2 ) σ2 σ1
  σ2 2 
x2 − [µ2 + ρ (x1 − µ1 )]
1  1 σ
p 1
 
= √ p exp −   
2πσ2 1 − ρ2 2 σ2 1 − ρ2
Therefore,
2 not. σ2 2 not. 2
X2 |X1 = x1 ∼ N (µ2|1 , σ2|1 ) with µ2|1 = µ2 + ρ (x1 − µ1 ) and σ2|1 = σ2 (1 − ρ2 ).
σ1
59.
Using the Central Limit Theorem (the i.i.d. version)

to compute the real error of a classifier
CMU, 2008 fall, Eric Xing, HW3, pr. 3.3
60.
Chris recently adopts a new (binary) classifier to filter email

spams. He wants to quantitively evaluate how good the classi-
fier is.
He has a small dataset of 100 emails on hand which, you can
assume, are randomly drawn from all emails.
He tests the classifier on the 100 emails and gets 83 classified
correctly, so the error rate on the small dataset is 17%.
However, the number on 100 samples could be either higher or
lower than the real error rate just by chance.
With a confidence level of 95%, what is likely to be the range of

the real error rate? Please write down all important steps.
(Hint: You need some approximation in this problem.)
61.
Notations:
Let Xi , i = 1, . . . , n = 100 be defined as:
Xi = 1 if the email i was incorrectly classified, and 0 otherwise;
not. not. not.
E[Xi ] = µ = ereal ; Var(Xi ) = σ 2
X1 + . . . + Xn
not.
esample = = 0.17
n
X1 + . . . + Xn − nµ
Zn = √ (the standardized form of X1 + . . . + Xn )
nσ
Key insight:
Calculating the real error of the classifier (more exactly, a symmetric interval
not.
around the real error p = µ) with a “confidence” of 95% amounts to finding a > 0
sunch that P (|Zn | ≤ a) ≥ 0.95.
62.
Calculus:

X1 + . . . + Xn − nµ X1 + . . . + Xn − nµ
| Zn |≤ a ⇔ √ ≤a ⇔ ≤ √a
nσ nσ n

X1 + . . . + Xn − nµ aσ X1 + . . . + Xn aσ
⇔ √
≤ n ⇔
− µ ≤
√
n n n
aσ aσ
⇔ |esample − ereal | ≤ √ ⇔ |ereal − esample | ≤ √
n n
aσ aσ
⇔ − √ ≤ ereal − esample ≤ √
n n
aσ aσ
⇔ esample − √ ≤ ereal ≤ esample + √
n n

aσ aσ
⇔ ereal ∈ esample − √ , esample + √
n n
Important facts: 63.
The Central Limit Theorem: Zn → N (0; 1)
Therefore, P (|Zn | ≤ a) ≈ P (|X| ≤ a) = Φ(a) − Φ(−a), where X ∼ N (0; 1)
and Φ is the cumulative function distribution of N (0; 1).
Calculus:
Φ(−a) + Φ(a) = 1 ⇒ P (| Zn |≤ a) = Φ(a) − Φ(−a) = 2Φ(a) − 1
P (| Zn |≤ a) = 0.95 ⇔ 2Φ(a) − 1 = 0.95 ⇔ Φ(a) = 0.975 ⇔ a ∼
= 1.97 (see Φ table)
not.
σ 2 = Varreal = ereal (1 − ereal ) because Xi are Bernoulli variables.
Futhermore, we can approximate ereal with esemple , because
1
E[esample ] = ereal and Varsample = Varreal → 0 for n → +∞,
n
cf. CMU, 2011 fall, T. Mitchell, A. Singh, HW2, pr. 1.ab.
Finally: p
aσ 0.17(1 − 0.17) ∼
⇒ √ ≈ 1.97 · √ = 0.07
n 100
|ereal − esample | ≤ 0.07 ⇔ |ereal − 0.17| ≤ 0.07 ⇔ −0.07 ≤ ereal − 0.17 ≤ 0.07
⇔ ereal ∈ [0.10, 0.24]
64.
Exemplifying
a mixture of categorical distributions;
how to compute its expectation and variance
CMU, 2010 fall, Aarti Singh, HW1, pr. 2.2.1-2
65.
Suppose that I have two six-sided dice, one is fair and the other one is
loaded – having: 
 1

 x=6
2
P (x) =

 1
 x ∈ {1, 2, 3, 4, 5}
10
I will toss a coin to decide which die to roll. If the coin flip is heads I will
roll the fair die, otherwise the loaded one. The probability that the coin
flip is heads is p ∈ (0, 1).
a. What is the expectation of the die roll (in terms of p).
b. What is the variation of the die roll (in terms of p).
66.
Solution:
a.
6
X
E[X] = i · [P (i|fair) · p + P (i|loaded) · (1 − p)]
i=1
" 6 # " 6 #
X X
= i · P (i|fair) p + i · P (i|loaded) (1 − p)
i=1 i=1
7 9 9
= p + (1 − p) = − p
2 2 2
67.
2 2
b. Recall that we may write Var (X) = E[X ] − (E[X]) , therefore:
6
X
2
E[X ] = i2 · [P (i|fair) · p + P (i|loaded) · (1 − p)]
i=1
" 6 # " 6 #
X X
= i2 · P (i|fair) p + i2 · P (i|loaded) (1 − p)
i=1 i=1
91 36 55
= p+ + (1 − p)
6 2 10
47 25
= − p
2 3
Combining this with the result of the previous question yields:
141 50 9
Var (X) = E[X 2 ] − (E[X])2 = − p − ( − p)2
6 6 2
141 50 81
= − p − ( − 9p + p2 )
6 6 4
141 81 50
= ( − ) − ( − 9)p − p2
6 4 6
13 2
= + p − p2
4 3
68.
Estimating the parameters of some

probability distributions:
Exemplifications
69.
Estimating the parameter of the Bernoulli

distribution:
the MLE and MAP approaches
CMU, 2015 spring, Tom Mitchell, Nina Balcan, HW2, pr. 2
70.
Suppose we observe the values of n i.i.d. (independent, identi-

cally distributed) random variables X1 , . . . , Xn drawn from a single
Bernoulli distribution with parameter θ. In other words, for each
Xi , we know that
P (Xi = 1) = θ and P (Xi = 0) = 1 − θ.
Our goal is to estimate the value of θ from the observed values of
X1 , . . . , Xn .
71.
Reminder: Maximum Likelihood Estimation

For any hypothetical value θ̂, we can compute the probability of
observing the outcome X1 , . . . , Xn if the true parameter value θ
were equal to θ̂.
This probability of the observed data is often called the data likeli-
hood, and the function L(θ̂) that maps each θ̂ to the corresponding
likelihood is called the likelihood function.
A natural way to estimate the unknown parameter θ is to choose
the θ̂ that maximizes the likelihood function. Formally,
θ̂MLE = argmax L(θ̂).

θ̂
72.
a. Write a formula for the likelihood function, L(θ̂).

Your function should depend on the random variables X1 , . . . , Xn
and the hypothetical parameter θ̂.
Does the likelihood function depend on the order of the random
variables?
Solution:
Since the Xi are independent, we have
n
Y n
Y
L(θ̂) = Pθ̂ (X1 , . . . , Xn ) = Pθ̂ (Xi ) = (θ̂Xi · (1 − θ̂)1−Xi )
i=1 i=1
= θ̂#{Xi =1} · (1 − θ̂) #{Xi =0}

,
where #{·} counts the number of Xi for which the condition in
braces holds true.
Note that in the third equality we used the trick Xi = I{Xi =1} .
The likelihood function does not depend on the order of the data.
73.
b. Suppose that n = 10 and the data

set contains six 1s and four 0s. Solution:
MLE; n = 10, six 1s, four 0s
Write a short computer program
0.0012
that plots the likelihood function of
this data. 0.001
For the plot, the x-axis should be θ̂, 0.0008

and the y-axis L(θ̂). Scale your y-axis 0.0006
so that you can see some variation in 0.0004
its value.
0.0002
Estimate θ̂MLE by marking on the x-
0
axis the value of θ̂ that maximizes 0 0.2 0.4 0.6 0.8 1
the likelihood. θ
74.
c. Find a closed-form formula for θ̂MLE , the MLE estimate of θ̂. Does the
closed form agree with the plot?
Solution:
Let’s consider l(θ) = ln(L(θ)). Since the ln function is increasing, the θ̂
that maximizes the log-likelihood is the same as the θ̂ that maximizes the
likelihood. Using the properties of the ln function, we can rewrite l(θ̂) as
follows:
l(θ̂) = ln(θ̂n1 · (1 − θ̂)n0 ) = n1 ln(θ̂) + n0 ln(1 − θ̂).
not. not.
where n1 = #{Xi = 1}, iar n0 = #{Xi = 0}.
Assuming that θ̂ 6= 0 and θ̂ 6= 1, the first and second derivatives of l are
given by
n1 n0 n1 n0
l′ (θ̂) = − and l′′ (θ̂) = − −
θ̂ 1 − θ̂ θ̂2 (1 − θ̂)2
Since l′′ (θ̂) is always negative, the l function is concave, and we can find its
maximizer by solving the equation l′ (θ) = 0.
n1
The solution to this equation is given by θ̂MLE = .
n1 + n0
75.
d. Create three more likelihood plots: one where n = 5 and the

data set contains three 1s and two 0s; one where n = 100 and the
data set contains sixty 1s and fourty 0s; and one where n = 10 and
there are five 1s and five 0s.
Solution:
MLE; n = 5, three 1s, two 0s MLE; n = 100, sixty 1s, fourty 0s
0.04 7e-30
0.035 6e-30
0.03 5e-30
0.025
4e-30
0.02
3e-30
0.015
0.01 2e-30
0.005 1e-30
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
θ θ
76.
e. Describe how the likelihood

Solution (to part d.): functions and maximum likelihood
estimates compare for the different
MLE; n = 10, five 1s, five 0s data sets.
0.001
Solution (to part e.):
0.0008
0.0006
The MLE is equal to the propor-
tion of 1s observed in the data, so
0.0004 for the first three plots the MLE is
always at 0.6, while for the last plot
0.0002
it is at 0.5.
0 As the number of samples n in-
0 0.2 0.4 0.6 0.8 1 creases, the likelihood function gets
θ more peaked at its maximum value,
and the values it takes on decrease.
77.
Reminder: Maximum a Posteriori Probability Estimation

In the maximum likelihood estimate, we treated the true parameter value θ as a fixed
(non-random) number. In cases where we have some prior knowledge about θ, it is
useful to treat θ itself as a random variable, and express our prior knowledge in the
form of a prior probability distribution over θ.
For example, suppose that the X1 , . . . , Xn are generated in the following way:
− First, the value of θ is drawn from a given prior probability distribution
− Second, X1 , . . . , Xn are drawn independently from a Bernoulli distribution using
this value for θ.
Since both θ and the sequence X1 , . . . , Xn are random, they have a joint probability
distribution. In this setting, a natural way to estimate the value of θ is to simply choose
its most probable value given its prior distribution plus the observed data X1 , . . . , Xn .
Definition:
θ̂MAP = argmax P (θ = θ̂|X1 , . . . , Xn ).
θ̂
This is called the maximum a posteriori probability (MAP) estimate of θ.

78.
Reminder (cont’d)
Using Bayes rule, we can rewrite the posterior probability as follows:
P (X1 , . . . , Xn |θ = θ̂)P (θ = θ̂)

P (θ = θ̂|X1 , . . . , Xn ) = .
P (X1 , . . . , Xn )
Since the probability in the denominator does not depend on θ̂, the MAP estimate is
given by
θ̂MAP = argmax P (X1 , . . . , Xn |θ = θ̂)P (θ = θ̂)

θ̂
= argmax L(θ̂)P (θ = θ̂).

θ̂
In words, the MAP estimate for θ is the value θ̂ that maximizes the likelihood function
multiplied by the prior distribution on θ.
79.
We will consider a Beta(3, 3) prior distribution for θ, which has the density
θ̂2 (1 − θ̂)2
function given by p(θ̂) = , where B(α, β) is the beta function and
B(3, 3)
B(3, 3) ≈ 0.0333.
f. Suppose, as in part b, that Solution:

n = 10 and we observed six 1s and MAP; n = 10, six 1s, four 0s; Beta(3,3)
four 0s. 0.0025
Write a short computer program 0.002

that plots the function θ̂ 7→
0.0015
L(θ̂) p(θ̂) for the same values of θ̂
as in part b. 0.001
Estimate θ̂MAP by marking on the 0.0005

x-axis the value of θ̂ that maxi-
0
mizes the function. 0 0.2 0.4 0.6 0.8 1
θ
80.
g. Find a closed form formula for θ̂MAP , the MAP estimate of θ̂.
Does the closed form agree with the plot?
Solution:
As in the case of the MLE, we will apply the ln function before
finding the maximizer. We want to maximize the function
l(θ̂) = ln(L(θ̂) · p(θ̂)) = ln(θ̂n1 +2 · (1 − θ̂)n0 +2 ) − ln(B(3, 3)).

The normalizing constant for the prior appears as an additive con-
stant and therefore the first and second derivatives are identical
to those in the case of the MLE (except with n1 + 2 and n0 + 2
instead of n1 and n0 , respectively).
It follows that the closed form formula for the MAP estimate is
given by
n1 + 2
θ̂MAP = .
n1 + n0 + 4
This formula agrees with the plot obtained in part f .
81.
h. Compare the MAP estimate to the MLE computed from the

same data in part b. Briefly explain any significant difference.
Solution:
The MAP estimate is equal to the MLE with four additional vir-
tual random variables, two that are equal to 1, and two that are
equal to 0. This pulls the value of the MAP estimate closer to the
value 0.5, which is why θ̂MAP is smaller than θ̂MLE .
i. Comment on the relationship between the MAP and MLE es-

timates as n goes to infinity, while the ratio #{Xi = 1}/#{Xi = 0}
remains constant.
Solution:
It is obvious that as n goes to infinity, the influence of the 4 vir-
tual random variables diminishes, and the two estimators become
equal.
82.
The MLE estimator for the parameter of

the Bernoulli distribution:
the bias and [an example of ] inadmissibility
CMU, 2004 fall, Tom Mitchell, Ziv Bar-Joseph, HW2, pr. 1
83.
Suppose X is a binary random variable that takes value 0 with probability p and value
1 with probability 1 − p. Let X1 , . . . , Xn be i.i.d. samples of X.
a. Compute an MLE estimate of p (denote it by p̂).
Answer:
By way of definition,
i.i.d.
p̂ = argmax P (X1 , . . . , Xn |p) = argmax P (Xi |p) = argmax pk (1 − p)n−k
p p p
where k is the number of 0’s in x1 , . . . , xn .

Furthermore, since ln is a monotonic (strictly increasing) function,
p̂ = argmax ln pk (1 − p)n−k = argmax(k ln p + (n − k) ln(1 − p))

p p
Computing the first derivative of k ln p + (n − k) ln(1 − p) w.r.t. p leads to:
∂ k n−k
(k ln p + (n − k) ln(1 − p)) = − .
∂p p 1−p
Hence,
∂ k n−k k
(k ln p + (n − k) ln(1 − p)) = 0 ⇔ = ⇔ p̂ = .
∂p p 1−p n
84.
b. Is p̂ an unbiased estimate of p? Prove the answer.
Answer:
Since k can be seen as a sum of n [independent] Bernoulli variables
of parameter p, we can write:
n n
k 1 1 X 1 X
E[p̂] = E = E[k] = E[n − Xi ] = (n − E[Xi ])
n n n i=1
n i=1
n
1 X 1 1
= (n − (1 − p)) = (n − n(1 − p)) = np = p.
n i=1
n n
Therefore, p̂ is an unbiased estimator for the parameter p.

85.
c. Compute the expected square error of p̂ in terms of p.

Answer:
2 2 E[k 2 ]
2 2 2
E[(p̂ − p) ] = E[p̂ ] − 2E[p̂] p + p = − 2p + p
n2
Var (k) + E 2 [k] 2 np(1 − p) + (np)2 2 p
= −p = − p = (1 − p)
n2 n2 n
We used the fact that Var (k) = E[k 2 ] − E 2 [k], and also Var (k) =
np(1 − p) because, as already said, k can be seen as a sum of n
independent Bernoulli variables of parameter p.
Note that E[p̂] = p (cf. part b), therefore
p
E[(p̂ − p)2 ] = E[(p̂ − E[p̂])2 ] = Var [p̂] = (1 − p).
n
This implies that Var [p̂] → 0 for n → ∞.
86.
d. Prove that if you know that p lies in the interval [1/4; 3/4] and you are
given only n = 3 samples of X, then p̂ is an inadmissible estimator of p when
minimizing the expected square error of estimation.
Note: An estimator δ of a parameter θ is said to be inadmissible when there
exists a different estimator δ ′ such that R(θ, δ ′ ) ≤ R(θ, δ) for all θ and R(θ, δ ′ ) <
R(θ, δ) for some θ, where R(θ, δ) is a risk function and in this problem it is the
expected square error of the estimator.
Answer:
Consider another estimator, p̃ = 1/2.
E[(p̃ − p)2 ] = (1/2 − p)2 .
For p = 1/2 we have E[(p̃ − p)2 ] = 0 < E[(p̂ − p)2 ] = 1/12.
We now need to show that E[(p̃ − p)2 ] ≤ E[(p̂ − p)2 ] over p ∈ [1/4; 3/4].
2
1 1 1 4 4
E[(p̃ − p)2 ] − E[(p̂ − p)2 ] = − p − · p(1 − p) = − p + p2 .
2 3 4 3 3
This is a parabola going up, so we need to show that it lies below or equal to
zero for p ∈ [1/4; 3/4].
It is equivalent to showing that it is below or equal to 0 at boundary points.
1 4 4
In fact it is: − p + p2 = 0 for both p = 1/4 and p = 3/4.
4 3 3
87.
Estimating the parameters of the categorical

distribution:
the MLE approach
CMU, 2009 spring, Ziv Bar-Joseph, HW1, pr. 2.3
88.
In this problem we will derive the MLE for the parameters of a categorical
distribution where the variable of interest, X, can take on k values, namely
a1 , a2 , . . . , ak , the probability of seeing an event of type j being θj for j =
1, . . . , k.
a. Given data describing n independent identically distributed observa-
tions of X, namely d1 , . . . , dn , each of which can be one of k values, express
the likelihood of the data given k − 1 parameters for the distribution over
X. Let ni represent the number of times X takes on value i in the data.
Answer:
The verosimility of the data is:
i.i.d. Qn Pk
L(θ) = P (d1, . . . , dn |θ) = j=1 i=1 (θi Idj =ai ) (I is the indicator function)
Qk ni not. Pn
= i=1 θi (ni = j=1 Idj =ai )
k−1
X Qk−1
= (1 − θi )nk i=1 θini (since the thetas sum to one)
i=1
| {z }
θk
89.
b. Find θ̂j , the MLE for θj , one of the k − 1 parameters, by setting the
partial derivative of the likelihood in part a with respect to θj equal to zero
and solving for it.
Hint: You may want to start by first taking the log of the likelihood from
part a before taking its derivative.
Answer:
k−1 k−1
X X ∂ ln L(θ) nk nj
ln L(θ) = nk ln(1 − θi ) + ni ln θi ⇒ =− Pk−1 +
i=1 i=1
∂θj 1 − i=1 θi θj
∂ ln L(θ) nk nj nj nk
=0⇔− P + = 0 ⇔ = Pk−1
∂θj 1 − θ̂j − i6k−1 θ
=j,k i θ̂j θ̂j 1 − θ̂j − i6=j,k θi
k−1
X
⇔ nj (1 − θi ) = (nk + nj )θ̂j
i6=j,k
k−1
nj X
⇔ θ̂j = (1 − θi ) for all j ∈ {1, . . . , k − 1}.
nj + nk i6=j,k
90.
c. At this point you should have k − 1 equations describing MLEs

of different parameters. Show how those equations imply that the
MLE for a parameter θj representing the probability that X takes
nj
on value j is equal to .
n
Hint: In order to remove the k-th parameter from the likelihood
in part a you had to represent it with an equation, θk = f ( ). At
this point you may find it helpful to replace all occurrences of
f ( ) with θk . After replacing f ( ) with θk you can substitute all
occurrences of each other parameter in f ( ) with its MLE from
part b. This should allow you to solve for the MLE of θk , which
can then be used to simplify all of the other equations.
91.
Answer
As the likelihood function is uniquely optimal for the vector θ, the last
equation in part b can be written as:
k−1
! k−1
nj X nj X
θ̂j = 1− θ̂i ⇔ θ̂j = (θ̂j + θ̂k ) (because θ̂k = 1 − θ̂i )
nj + nk nj + nk i=1
i6=j,k

nj nj nk nj
⇔ θ̂j 1 − = θ̂k ⇔ θ̂j = θ̂k
nj + nk nj + nk nj + nk nj + nk
⇔ θ̂j nk = nj θ̂k
nj
⇔ θ̂j = θ̂k for all j ∈ {1, . . . , k − 1}.
nk
92.
Finally,
n1 nk−1
θ̂k = 1 − θ̂1 − . . . − θ̂k−1 = 1 − θ̂k − . . . − θ̂k
nk nk
⇒ nk θ̂k = nk − (n1 + . . . + nk−1 )θ̂k
⇒ θ̂k (n1 + . . . + nk−1 + nk ) = nk
| {z }
n
nk
⇒ θ̂k =
n
nj nk nj
⇒ θ̂j = · = for all j ∈ {1, . . . , k − 1}.
nk n n
Note: Even though [here] we can go from the non-hatted (θi ) to the hatted
form (hatθi ) of the equation in the first step of c, this will generally not
be possible. To solve for a maximum likelihood criterion under additional
constraints like the thetas summing to one, a generic and useful method
is the method of Lagrange multipliers.
93.
The Gaussian [uni-variate] distribution:

estimating µ when σ 2 is known
CMU, 2011 fall, Tom Mitchell, Aarti Singh, HW2, pr. 1
CMU, 2010 fall, Ziv Bar-Joseph, HW1, pr. 1.2-3
94.
Assume we have n samples, x1 , . . . , xn , independently drawn from a normal
distribution with known variance σ 2 and unknown mean µ.
a. Derive the MLE estimator for the mean µ.
Solution:
n n (xi − µ)2
Y Y 1 −
P (x1 , . . . , xn |µ) = P (xi |µ) = √ e 2σ 2
i=1 i=1
2πσ
n
X 1 (xi − µ)2
⇒ ln P (x1 , . . . , xn |µ) = ln √ −
i=1
2πσ 2σ 2
n
X xi − µ
∂
⇒ P (x1 , . . . , xn |µ) =
∂µ i=1
σ2
n
X xi − µ n n
∂ X X
P (x1 , . . . , xn |µ) = 0 ⇔ 2
= 0 ⇔ (xi − µ) = 0 ⇔ xi = nµ
∂µ i=1
σ i=1 i=1
Pn
xi
⇒ µMLE = i=1
n
Remark : It can be easily shown that ln P (x1 , . . . , xn |µ) indeed reaches its maximum for
µ = µMLE .
b. Show that E[µMLE] = µ. 95.
Solution:
The sample x1 , . . . , xn can be seen as the realization of n independent ran-
dom variables X1 , . . . , Xn of Gaussian distribution of mean µ and variance
σ 2 . Then, due to the property of linearity for the expectation of random
variables, we get:

X1 + . . . + Xn E[X1 ] + . . . + E[Xn ] nµ
E[µMLE ] = E = = =µ
n n n
Therefore, the µMLE estimator is unbiased.
c. What is Var [µMLE ]?
Solution:
" n
# n
1 X 1 X
(⋆) i.i.d. 1 σ2
Var [µMLE ] = Var Xi = 2 Var [Xi ] = n 2 Var [X1 ] =
n i=1
n i=1 n n
Therefore, Var [µMLE ] → 0 as n → ∞.

(⋆) Remember that Var[aX] = a2 Var[X].
96.
d. Now derive the MAP estimator for the mean µ. Assume that
the prior distribution for the mean is itself a normal distribution
with mean ν and variance β 2 .
97.
Solution 1:
T. Bayes P (x1 , . . . , xn |µ) P (µ)
P (µ|x1 , . . . , xn ) = (16)
P (x1 , . . . , xn )
 
(xi − µ)2 (µ − ν)2
1 − 1 −
Qn
 i=1 √ e 2σ 2 
· √ e 2β 2
2πσ 2πβ
= (17)
C
not.
where C = P (x1 , . . . , xn ).
n
X √ (xi − µ)2 √ (µ − ν)2
⇒ ln P (µ|x1 , . . . , xn ) = − ln 2πσ + 2
− ln 2πβ − 2
− ln C
i=1
2σ 2β
n
X xi − µ µ − ν
∂
⇒ ln P (µ|x1 , . . . , xn ) = 2
− 2
∂µ i=1
σ β
n Pn
∂ X xi − µ µ−ν 1 n i=1 xi ν
ln P (µ|x1 , . . . , xn ) = 0 ⇔ = ⇔ µ + = +
∂µ i=1
σ2 β2 β2 σ2 σ2 β2
P
σ 2 ν + β 2 ni=1 xi
⇒ µMAP =
σ 2 + nβ 2
98.
Solution 2:
Instead of computing the derivative of the posterior distribution
P (µ|x1 , . . . , xn ),
we will first show that the right hand side of (17) is itself a Gaussian, and
then we will use the fact that the mean of a Gaussian is where it achieves its
maximum value.
 
n (xi − µ)2 (µ − ν)2
1 Y 1 − 1 −
P (µ|x1 , . . . , xn ) =  √ e 2σ 2 
 · √ e 2β 2
C i=1 2πσ 2πβ
Pn
i=1 (xi − µ)2 (µ − ν)2
− −
= const · e 2σ 2 2β 2
Pn
β2 − µ)2 + σ 2 (µ − ν)2
i=1 (xi
−
= const · e 2σ 2 β 2
Pn Pn
nβ 2 + σ 2 2 β 2 i=1 xi + νσ 2 β 2 i=1 x2i + ν 2 σ 2
− µ + µ−
2 2 σ2 β 2 2σ 2 β 2
= const · e 2σ β
99.
P (µ|x1 , . . . , xn ) =
 Pn Pn 2 2 
2 β2 i=1 xi + νσ 2
β 2
x
i=1 i
2
+ ν σ
µ − 2µ +
 nβ 2 + σ 2 nβ 2 + σ 2 
= const · exp − 2 2

 2σ β 
nβ 2 + σ 2
 Pn 2 Pn Pn 
2 2
2 2
2 2 2 2
β i=1 xi + νσ 2 β i=1 xi + νσ β i=1 xi + ν σ
 (µ − nβ 2 + σ 2
) −
nβ 2 + σ 2
+
nβ 2 + σ 2 
 
= const · exp − 
 σ2 β 2 
2 2
nβ + σ 2
 P 2   Pn 2 P 
β 2 ni=1 xi + νσ 2 β 2 i=1 xi + νσ 2 β 2 ni=1 x2i + ν 2 σ 2
 µ−   − 
 nβ 2 + σ 2   nβ 2 + σ 2 nβ 2 + σ 2 
= const · exp −  · exp  
 σ2 β 2   σ2 β 2 
2 2 2 2
nβ + σ 2 nβ + σ 2
 
2 2
2
Pn
β i=1 xi + νσ
 µ− 
′  nβ 2 + σ 2 
= const · exp − 
 σ2 β 2 
2 2
nβ + σ 2
100.
The Pnexp term2 in the last equality being a Gaussian of mean

2 2 2
β i=1 xi + νσ σ β
and variance , it follows that its maximum
nβ 2 + σ 2 P nβ 2 + σ2
β 2 ni=1 xi + νσ 2
is obtained for µ = 2 2
= µMAP .
nβ + σ
101.
e. Please comment on what happens to the MLE and MAP esti-

mators for the mean µ as the number of samples n goes to infinity.
Solution:
Pn
i=1 xi
µMLE =
n P P
σ ν + β 2 ni=1 xi
2
σ2 ν β 2 ni=1 xi
µMAP = = 2 + 2
σ 2 + nβ 2 σ + nβ 2 σ + nβ 2
1 Pn
2
σ ν i=1 xi σ2 ν µMLE
= + n = +
σ 2 + nβ 2 σ2 σ 2 + nβ 2 σ2
1+ 1+
nβ 2 nβ 2
σ2 ν σ2
n→∞⇒ 2 → 0 and → 0 ⇒ µMAP → µMLE
σ + nβ 2 nβ 2
102.

estimating σ 2 when µ = 0
CMU, 2009 spring, Ziv Bar-Joseph, HW1, pr. 2.1
103.
Let X be a random variable distributed according to a Normal
distribution with 0 mean, and σ 2 variance, i.e. X ∼ N(0, σ 2 ).
a. Find the maximum likelihood estimate for σ 2 , i.e. σMLE
2
.
Solution:
Let X1 , X2 , . . . , Xn be drawn i.i.d. ∼ N (0, σ 2 ). Let f be the density function
corresponding to X. Then we can write the likelihood function as:
Yn
2
L(X1 , X2 , . . . , Xn |σ ) = f (Xi ; µ = 0, σ 2 )
i=1
n
n Y n Pn 2

1 (Xi − 0)2 1 X
= √ exp − 2
= √ exp − i=12 i
2πσ i=1
2σ 2πσ 2σ
n
n 2 1 X 2
⇒ ln L = constant − ln σ − 2 X
2 2σ i=1 i
n n
∂ ln L n 1 X 2 ∂ ln L 2 1X 2
⇒ = − 2+ 4 X . Therefore, = 0 ⇔ σMLE = X
∂σ 2 2σ 2σ i=1 i ∂σ 2 n i=1 i
Note: It can be easily shown that L(X1 , X2 , . . . , Xn |σ2 ) indeed reaches its maximum for σ2 =
1 Pn 2
i=1 Xi .
n
104.
b. Is the estimator you obtained biased?

Solution:
It is unbiased, since:
1
Pn n
E[ n i=1 Xi2 ] = E[X 2 ] since i.i.d.
n
= Var [X] + (E[X])2
= Var [X] = σ 2 since E[X] = 0
105.

estimating σ 2 (without restrictions on µ)
CMU, 2010 fall, Ziv Bar-Joseph, HW1, pr. 2.1.1-2
106.
not.
Let x = (x1 , . . . , xn ) be observed i.i.d. samples from a Gaussian distribution
N(x|µ, σ 2 ).
2
a. Derive σMLE , the MLE for σ 2 .
Solution:
(x − µ)2
1 −
The p.d.f. for N (x|µ, σ 2 ) has the form f (x) = √ e 2σ 2 .
2πσ
The log likelihood function of the data x is:
n n 2

Y X 1 (xi − µ)
ln L(x | µ, σ 2 ) = ln f (xi ) = − ln(2πσ 2 ) −
i=1 i=1
2 2σ 2
n
n 2 1 X
= − ln(2πσ ) − 2 (xi − µ)2
2 2σ i=1
2 ∂ ln L(x | µ, σ 2 ) n 1 Pn 2
The partial derivative of ln L w.r.t. σ : 2
= − 2
+ 4 i=1 (xi − µ) .
∂σ 2σ 2σ
∂ ln L(x | µ, σ 2 ) 2 1 P n 2
Solving the equation 2
= 0, we get: σMLE = i=1 (xi − µMLE ) .
∂σ n
Note that we had to take into account the optimal value of µ (see problem CMU, 2011
fall, T. Mitchell, A. Singh, HW2, pr. 1)
107.
2 n−1 2
b. Show that E[σMLE ]= σ .
n
Solution:
"n
# " n
#
2 1 X 1 X
E[σMLE ] = E (xi − µMLE )2 = E[(x1 − µMLE )2 ] = E (x1 − xi )2
n i=1 n i=1
" n n
#
2 X 1 X
= E x21 − x1 xi + 2 ( xi )2
n i=1 n i=1
 
n n
2 2 X 1 X 2 2 X
= E x1 − x1
 xi + 2 xi + 2 xi xj 
n i=1 n i=1 n i<j
n n
1 X 2X 2 X
= E[x21 ] + 2 2
E[xi ] − E[x1 xi ] + 2 E[xi xj ]
n i=1 n i=1 n i<j
1 2 2 2 n(n − 1)
= E[x21 ] + nE[x2
1 ] − E[x2
1 ] − (n − 1)E[x1 x2 ] + E[x1 x2 ]
n2 n n n2 2
n−1 n−1
= E[x21 ] − E[x1 x2 ]
n n
108.
σ 2 = Var(x1 ) = E[x21 ] − (E[x1 ])2 = E[x21 ] − µ2 ⇒ E[x21 ] = σ 2 + µ2
Because x1 and x2 are independent, it follows that Cov(x1 , x2 ) = 0.

Therefore,
0 = Cov(x1 , x2 ) = E[(x1 − E[x1 ])(x2 − E[x2 ])] = E[(x1 − µ)(x2 − µ)]

= E[x1 x2 ] − µE[x1 + x2 ] + µ2 = E[x1 x2 ] − µ(E[x1 ] + E[x2 ]) + µ2
= E[x1 x2 ] − µ(2µ) + µ2 = E[x1 x2 ] − µ2
So, E[x1 x2 ] = µ2 .
By substituting E[x21 ] = σ 2 + µ2 and E[x1 x2 ] = µ2 into the previously obtained
n−1 n−1
equality (E[σMLE ] = E[x21 ] − E[x1 x2 ]), we get:
n n
2 n−1 2 n−1 2 n−1 2
E[σMLE ]= (σ + µ2 ) − µ = σ
n n n
109.
c. Find an unbiased estimator for σ 2 .
Solution:
1 Pn 2
It can be immediately proven that i=1 (xi − µMLE ) is an un-
n−1
biased estimator of σ 2 .
110.
The Gamma distribution:

Maximum Likelihood Estimation of parameters
Liviu Ciortuz, 2017
111.
0.5
r = 1.0, β = 2.0
r = 2.0, β = 2.0
The Gamma distribution of parameters r > 0 and β > 0 r = 3.0, β = 2.0
0.4
r = 5.0, β = 1.0
r = 9.0, β = 0.5
has the following density function: r = 7.5, β = 1.0
0.3
r = 0.5, β = 1.0
p(x)
x
−
0.2
1
Gamma(x|r, β) = xr−1 e β , for all x > 0,
β r Γ(r)
0.1
where the Γ symbol designates Euler’s Gamma function.
0.0
0 5 10 15 20
x
Notes:
1
1. In the above definition, is the so-called normalization factor, since it does
β r Γ(r)
x
R +∞ −
not depend on x, and x=−∞ xr−1 e β dx = β r Γ(r).
R +∞
2. Euler’s Gamma function is defined as follows: Γ(r) = 0 tr−1 e−t dt, for all r ∈ R,
except for the negative integers. Starting from the definition of Γ, it can be easily
shown that Γ(r + 1) = rΓ(r) for any r > 0, and also Γ(1) = 1. Therefore, Γ(r + 1) = r · Γ(n) =
r · (r − 1) · Γ(r − 1) = . . . = r · (r − 1) · . . .· 2 · 1 = r!, which means that the Γ function generalizes
the factorial function.
3. The exponential distribution is a member of the Gamma family of distributions.
(Just set r to 1 in Gamma’s density function.)
112.
Consider x1 , . . . , xn ∈ R+ , all of them [having been] generated by one compo-

nent of the above family of distributions.
Find the maximum likelihood estimation of the parameters r and β.
Solution
• The verosimility function:
n
Y
def. i.i.d.
L(r, β) = P (x1 , . . . , xn |r, β) = P (xi |r, β)
i=1
n
!r−1 1 Pn
Y − i=1 xi
= β −rn −n
(Γ(r)) xi e β
i=1
• The log-verosimility function:

nn
def. 1XX
ℓ(r, β) = ln L(r, β) = −rn ln β − n ln Γ(r) + (r − 1) ln xi − xi .
i=1
β i=1
113.
Now we will calculate the partial derivative of ℓ(r, β) w.r.t. β, and then equate
it to 0:
n
" n #
∂ rn 1 X 1 X
ℓ(r, β) = − + 2 xi = 2 xi − rnβ
∂β β β i=1 β i=1
n
∂ 1 X
ℓ(r, β) = 0 ⇔ β̂ = xi > 0.
∂β rn i=1
By substituting β̂ into ℓ(r, β), we will get:

n n
X 1X
ℓ(r, β̂) = −rn ln β̂ − n ln Γ(r) + (r − 1) ln xi − xi
i=1 β̂ i=1
n n n
X X rn X
= rn ln(rn) − rn ln xi − n ln Γ(r) + (r − 1) ln xi − Pn · xi
i=1 i=1 i=1 xi i=1
n
! n
X X
= rn ln(rn) − rn ln xi + 1 − n ln Γ(r) + (r − 1) ln xi
i=1 i=1
114.
Therefore, by computing the partial derivative of ℓ(r, β̂) with respect to r, and
then equating this derivative to 0, we will get:
n
! n
∂ X Γ′ (r) X
ℓ(r, β̂) = 0 ⇔ n ln(nr) + n − n ln xi + 1 − n · + ln xi = 0 ⇔
∂r i=1
Γ(r) i=1
n
X n
X
n(ln r − ψ(r)) = −n ln n − ln xi + n ln xi ⇔
i=1 i=1
n n
1 X X
ln r − ψ(r) = − ln n − ln xi + ln xi .
n i=1 i=1
The solution of the last equation is r̂, the maximum likelihood estimation of
the parameter r.
115.
The Gaussian multi-variate distribution:

ML estimation of
the mean and the precision matrix, Λ
(Λ is the inverse of the covariance matrix, Σ)
CMU, 2010 fall, Aarti Singh, HW1, pr. 3.2.1

116.
The density function of a d-dimensional Gaussian distribution is

as follows:
1 ⊤

−1 exp − 2 (x − µ) Λ(x − µ)
N (x | µ, Λ ) = p ,
(2π) d/2 −1
|Λ |
where Λ is the inverse of the covariance matrix, or the so-called
precision matrix. Let {x1 , x2 , . . . , xn } be an i.i.d. sample from a
d-dimensional Gaussian distribution.
Suppose that n ≫ d. Derive the MLE estimates µ̂ and Λ̂.
117.
Hint
You may find useful the following formulas (taken from Matrix Identities, by
Sam Roweis, 1999):
1
(2b) |A−1 | =
|A|
(2e) Tr(AB) = Tr(BA);a
more generally, Tr(ABC . . .) = Tr(BC . . . A) = Tr(C . . . AB) = . . .
∂ ∂
(3b) Tr(XA) = Tr(AX) = A⊤
∂X ∂X
∂
(4b) ln |X| = (X −1 )⊤ = (X ⊤ )−1
∂X
∂ ⊤
(5c) a Xb = ab⊤
∂X
∂
(5g) (Xa + b)⊤ C(Xa + b) = (C + C ⊤ )(Xa + b)a⊤
∂X
Tr(A), the trace of an n-by-n square matrix A, is defined as the sum of the
elements on the main diagonal (the diagonal from the upper left to the lower
right) of A, i.e., a11 + . . . + ann .
a See Theorem 1.3.d from Matrix Analysis for Statistics, 2017, James R. Schott.
Given the x1 , . . . , xn data, the log-likelihood function is: 118.
n
Y n
X
i.i.d. −1
l(µ, Λ) = ln N (xi |µ, Λ )= ln N (xi |µ, Λ−1 )
i=1 i=1
n
nd n −1 1X
= − ln(2π) − ln |Λ | − (xi − µ)⊤ Λ(xi − µ)
2 2 2 i=1
n
(2b) nd n 1X
= − ln(2π) + ln |Λ| − (xi − µ)⊤ Λ(xi − µ).
2 2 2 i=1
For any fixed positive definite precision matrix Λ, the log-likelihood is a quadratic
function of µ with a negative leading coefficient, hence a strictly concave function of µ.
We then solve
n n n
(5g) 1 ⊤
X X X
∇µ l(µ, Λ) = 0 ⇐⇒ − (Λ + Λ ) (xi − µ)(−1) = 0 ⇐⇒ Λ (xi − µ) = 0 ⇐⇒ Λ xi = nΛµ.
2 i=1 i=1 i=1
(18)
From (18) we get, by the assumption that Λ is invertible, the following estimate of µ:
Pn
xi
µ̂ = i=1 ,
n
which coincides with the sample mean x̄ and is constant w.r.t. Λ.
119.
Now that we have
l(µ, Λ) ≤ l(µ̂, Λ) ∀µ ∈ Rd , Λ being positive definite,
we continue to consider Λ by first plugging µ̂ back in the log-likelihood function (18):
n
nd n 1X
l(µ̂, Λ) = − ln(2π) + ln |Λ| − (xi − x̄)⊤ Λ(xi − x̄) (19)
2 2 2 i=1
(2e) nd n (20)
= − ln(2π) + (ln |Λ| − Tr(SΛ)),
2 2
1 Pn ⊤
where S is the sample covariance matrix : S = i=1 (xi − x̄)(xi − x̄) .
n
Explanation:
(xi − x̄)⊤ Λ(xi − x̄) is a 1-by-1 matrix, therefore (xi − x̄)⊤ Λ(xi − x̄) = Tr((xi − x̄)⊤ Λ(xi − x̄)),
and using the (2e) rule, it can be further written as Tr((xi − x̄)(xi − x̄)⊤ Λ).
Using another simple rule, Tr(A + B) = Tr(A) + Tr(B) (which can be easily proven), it
follows that
Xn Xn n
X
⊤ ⊤
(xi − x̄) Λ(xi − x̄) = Tr((xi − x̄) Λ(xi − x̄)) = Tr((xi − x̄)(xi − x̄)⊤ Λ)
i=1 n i=1 n i=1
X X
⊤ ⊤
= Tr ((xi − x̄)(xi − x̄) Λ) = Tr (xi − x̄)(xi − x̄) Λ = Tr((nS)Λ) = nTr(SΛ).
i=1 i=1
120.
By the fact that ln |Λ| is strictly concave on the domain of positive definite
Λ,a and that Tr(SΛ) is linear in Λ, we are able to find the maximum of the
expression (20) by solving
∇Λ l(µ̂, Λ) = 0,
which can be provenb to be equivalent to

Λ−1 − S = 0. (Therefore, Σ̂ = Λ̂−1 = S.)
Since n ≫ d, we can safely assume that S is invertible and get the following
estimate:
Λ̂ = S −1 .
a See, for example, Section 3.1.5, Convex Optimization: http://www.stanford.edu/∼boyd/cvxbook/.
b Applying (4b) and (3b) on (20) you’ll get (Λ⊤ )−1 − S ⊤ . Then (Λ⊤ )−1 − S ⊤ = 0 ⇔ (Λ−1 )⊤ = S ⊤ ⇔ Λ−1 = S.
121.
Notes
1. In the above derivation, we have ensured that the estimates µ and Λ̂
are in the parameter space and satisfy
l(µ, Λ) ≤ l(µ̂, Λ) ≤ l(µ̂, Λ̂) ∀µ ∈ Rd , Λ being positive definite,
so they are the MLE estimates.

2. Instead of using the relation (20), i.e., working with the Tr functional,
one could directly compute the partial derivative of l(µ̂, Λ) in (19):
n n
(4b),(5c) n ⊤ −1 1 X ⊤ n −1 1 X
∇Λ l(µ̂, Λ) = (Λ ) − (xi − µ̂)(xi − µ̂) = Λ − (xi − µ̂)(xi − µ̂)⊤
2 2 i=1 2 2 i=1
n −1 n
= Λ − S.
2 2
So,
n
not. 1
not.
X
∇Λ l(µ̂, Λ) = 0 ⇔ Λ̂−1 = Σ̂ = S = (xi − µ̂)(xi − µ̂)⊤ .
n i=1
122.
Elements of Information Theory:

Some examples and then some useful proofs
123.
Computing entropies and specific conditional entropies

for discrete random variables
CMU, 2012 spring, R. Rosenfeld, HW2, pr. 2
124.
On the roll of two six-sided fair dice,
a. Calculate the distribution of the sum (S) of the total.
b. The amount of information (or surprise) when seeing the outcome x

1
for a random variable X is defined as log2 = − log2 P (X = x). How
P (X = x)
surprised are you (in bits) to observe S = 2, S = 11, S = 5, S = 7?
c. Calculate the entropy of S [as the expected value of the random variable
− log2 P (X = x)].
d. Let’s say you throw the die one by one, and the first die shows 4. What
is the entropy of S after this observation? Was any information gained / lost
in the process? If so, calculate how much information (in bits) was lost or
gained.
125.
a.
S 2 3 4 5 6 7 8 9 10 11 12
P (S) 1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36
b.
Information(S = 2) = − log2 (1/36) = log2 36 = 2 log2 6 = 2(1 + log2 3)

= 5.169925001 bits
Information(S = 11) = − log2 2/36 = log2 18 = 1 + 2 log2 3 = 4.169925001 bits
Information(S = 5) = − log2 4/36 = log2 9 = 2 log2 3 = 3.169925001 bits
Information(S = 7) = − log2 6/36 = log2 6 = 1 + log2 3 = 2.584962501 bits
126.
c.
n
X
H(S) = − pi log pi
i=1

1 1 2 2 3 3 4 4
= − 2· log +2· log +2· log +2· log +
36 36 36 36 36 36 36 36

5 5 6 6
2· log + log
36 36 36 36
1 36
= (2 log2 36 + 4 log2 18 + 6 log2 12 + 8 log2 9 + 10 log2 + 6 log2 6)
36 5
1 2 2 62
= (2 log2 6 + 4 log2 6 · 3 + 6 log2 6 · 2 + 8 log2 3 + 10 log2 + 6 log2 6)
36 5
1
= (40 log2 6 + 20 log2 3 + 6 − 10 log2 5)
36
1
= (60 log2 3 + 46 − 10 log2 5) = 3.274401919 bits.
36
127.
d.
S 2 3 4 5 6 7 8 9 10 11 12
P (S|...) 0 0 0 1/6 1/6 1/6 1/6 1/6 1/6 0 0
1 1
H(S|First-die-shows-4) = −6 · log2 = log2 6 = 2.58 bits,
6 6
IG(S; First-die-shows-4) = H(S) − H(S|First-die-shows-4) = 3.27 − 2.58 = 0.69 bits.

128.
Computing entropies and average conditional entropies

for discrete random variables
CMU, 2012 spring, Roni Rosenfeld, HW2, pr. 3
129.
A doctor needs to diagnose a person having cold (C). The primary factor
he considers in his diagnosis is the outside temperature (T ). The random
variable C takes two values, yes / no, and the random variable T takes 3
values, sunny, rainy, snowy. The joint distribution of the two variables is
given in following table.
T = sunny T = rainy T = snowy

C = no 0.30 0.20 0.10
C = yes 0.05 0.15 0.20
a. Calculate the marginal probabilities P (C), P (T ).

P
Hint: Use P (X = x) = Y P (X = x; Y = y). For example,
P (C = no) = P (C = no, T = sunny) + P (C = no, T = rainy) + P (C = no, T = snowy).
b. Calculate the entropies H(C), H(T ).

c. Calculate the average conditional entropies H(C|T ), H(T |C).
130.
a. PC = (0.6, 0.4) şi PT = (0.35, 0.35, 0.30).

b.
5 5
H(C) = 0.6 log + 0.4 log = log 5 − 0.6 log 3 − 0.4 = 0.971 bits
3 2
20 10
H(T ) = 2 · 0.35 log + 0.3 log
7 3
= 0.7(2 + log 5 − log 7) + 0.3(1 + log 5 − log 3)
= 1.7 + log 5 − 0.7 log 7 − 0.3 log 3 = 1.581 bits.
131.
c.
def. X
H(C|T ) = P (T = t) · H(C|T = t)
t∈Val (T )
= P (T = sunny) · H(C|T = sunny) + P (T = rainy) · H(C|T = rainy) +

P (T = snowy) · H(C|T = snowy)

0.30 0.05 0.20 0.15
= 0.35 · H , + 0.35 · H , +
0.30 + 0.05 0.30 + 0.05 0.20 + 0.15 0.20 + 0.15

0.10 0.20
0.30 · H ,
0.10 + 0.20 0.20 + 0.10

7 6 1 7 4 3 3 1 2
= ·H , + ·H , + ·H ,
20 7 7 20 7 7 10 3 3

7 6 7 1 7 4 7 3 7 3 1 2 3
= · log + log 7 + · log + log + · log 3 + log
20 7 6 7 20 7 4 7 3 10 3 3 2

7 6 6 7 8 3 3 2
= · log 7 − − log 3 + · log 7 − − log 3 + · log 3 −
20 7 7 20 7 7 10 3

7 3 4 2 6 3 3 7 3 9
= log 7 − + + − + − · log 3 = log 7 − log 3 − = 0.82715 bits.
10 10 10 10 20 20 10 10 20 10
132.
def. X
H(T |C) = P (C = c) · H(T |C = c)
c∈Val (C)
= P (C = no) · H(T |C = no) + P (C = yes) · H(T |C = yes)

0.30 0.20 0.10
= 0.60 · H , , +
0.30 + 0.20 + 0.10 0.30 + 0.20 + 0.10 0.30 + 0.20 + 0.10

0.05 0.15 0.20
0.40 · H , ,
0.05 + 0.15 + 0.20 0.05 + 0.15 + 0.20 0.05 + 0.15 + 0.20

3 1 1 1 2 1 3 1
= ·H , , + ·H , ,
5 2 3 6 5 8 8 2

3 1 1 1 2 1 3 1
= + log 3 + (1 + log 3) + · 3 + (3 − log 3) +
5 2 3 6 5 8 8 2

3 2 1 2 3
= + log 3 + 2 − log 3
5 3 2 5 8
6 3
= + log 3 = 1.43774 bits.
5 20
133.
xxxxxx Exponenţial p.d.f.
1.5
λ = 0.5
λ=1
λ = 1.5
Computing the entropy of the
1.0
exponential distribution
p(x)
CMU, 2011 spring, R. Rosenfeld,
0.5
HW2, pr. 2.c
0.0
0 1 2 3 4 5
x
134.
Pentru o distribuţie de probabilitate continuă P , entropia se defineşte ast-

fel: Z +∞
1
H(P ) = P (x) log2 dx
−∞ P (x)
Calculaţi entropia distribuţiei continue exponenţiale de parametru λ > 0.
Definiţia acestei distribuţii este următoarea:
−λx
λe , dacă x ≥ 0;
P (x) =
0, dacă x < 0.
Indicaţie: Dacă P (x) = 0, veţi presupune că −P (x) log2 P (x) = 0.

135.
Answer
Z 0 Z ∞
1 1
H(P ) = P (x) log2 dx + P (x) log2 dx
−∞ P (x) 0 P (x)
Z 0 Z ∞ Z ∞
def. P 1 1
= 0 log2 0 dx + λe−λx log2 −λx dx = λe−λx log2 −λx dx
−∞ 0 λe 0 λe
| {z }
0
Z ∞ Z ∞
1 1 1 1 1
⇒ H(P ) = λe−λx ln −λx dx = λe−λx ln + ln −λx dx
ln 2 0 λe ln 2 0 λ e
Z ∞
1
= λe−λx − ln λ + ln eλx dx
ln 2 0
Z ∞
1
= λe−λx (− ln λ + λx) dx
ln 2 0
Z ∞ Z ∞
1 1
= λe−λx (− ln λ) dx + λe−λx λx dx
ln 2 0 ln 2
Z ∞ Z ∞ 0
− ln λ λ
= λe−λx dx + λe−λx x dx
ln 2 0 ln 2 0
Z ∞ Z ∞
ln λ −λx ′
λ ′
= e dx − e−λx x dx
ln 2 0 ln 2 0
136.
Prima integrală se rezolvă foarte uşor:

Z ∞
−λx ′

−λx ∞
e dx = e 0
= e−∞ − e0 = 0 − 1 = −1
0
Pentru a rezolva cea de-a doua integrală se poate folosi formula de integrare prin părţi:
Z ∞ Z ∞ Z ∞
−λx ′

−λx ∞ −λx ′

−λx ∞
e x dx = e x0 − e x dx = e x0 − e−λx dx
0 0 0
∞
Integrala definită e−λx x0 nu se poate calcula direct (din cauza conflictului 0 · ∞ care
se produce atunci când lui x i se atribuie valoarea-limită ∞), ci se calculează folosind
regula lui l’Hôpital:
−λx x x′ 1 1
lim xe = lim λx = lim ′ = lim λx
= lim e−λx = e−∞ = 0,
x→∞ x→∞ e λx
x→∞ (e ) x→∞ λe λ x→∞
deci ∞
e−λx x0 = 0 − 0 = 0.
137.
R∞
Integrala 0 e−λx dx se calculează uşor:
Z ∞ Z ∞
1 ′ 1 −λx ∞ 1 1
e−λx dx = − e−λx dx = − e 0
= − (0 − 1) =
0 λ 0 λ λ λ
Prin urmare, Z ∞
−λx ′
1 1
e x dx = 0 − =− ,
0 λ λ
ceea ce conduce la rezultatul final:

ln λ λ 1 ln λ 1 1 − ln λ
H(P ) = (−1) − − =− + = .
ln 2 ln 2 λ ln 2 ln 2 ln 2
138.
Entropie, entropie comună,

entropie condiţională, câştig de informaţie:
definiţii şi proprietăţi imediate
CMU, 2005 fall, T. Mitchell, A. Moore, HW1, pr. 2
139.
Definiţii
• Entropia variabilei X:
def. P not.
H(X) = − i P (X = xi ) log P (X = xi ) = EX [− log P (X)].
• Entropia condiţională specifică a variabilei Y ı̂n raport cu valoarea xk a variabilei X:
def. P
H(Y | X = xk ) = − j P (Y = yj | X = xk ) log P (Y = yj | X = xk )
not.
= EY |X=xk [− log P (Y | X = xk )].
• Entropia condiţională medie a variabilei Y ı̂n raport cu variabila X:
def. P not.
H(Y | X) = k P (X = xk )H(Y | X = xk ) = EX [H(Y | X)].
• Entropia comună a variabilelor X şi Y :

def. P P
H(X, Y ) = − i j P (X = xi , Y = yj ) log P (X = xi , Y = yj )
not.
= EX,Y [− log P (X, Y )].
• Informaţia mutuală a variabilelor X şi Y , numită de asemenea câştigul de informaţie
al variabilei X ı̂n raport cu variabila Y (sau invers):
not. def.
M I(X, Y ) = IG(X, Y ) = H(X) − H(X | Y ) = H(Y ) − H(Y | X)
(Observaţie: ultima egalitate de mai sus are loc datorită rezultatului de la punctul
c de mai jos.)
140.
a.
H(X) ≥ 0.
X X 1
H(X) = − P (X = xi ) log P (X = xi ) = P (X = xi ) log ≥0
i i
| {z } P (X = xi )
≥0 | {z }
≥0
Mai mult, H(X) = 0 dacă şi numai dacă variabila X este constantă:
P 1
,,⇒“ Presupunem că H(X) = 0, adică i P (X = xi ) log = 0. Datorită faptului
P (X = xi )
că fiecare termen din această sumă este mai mare sau egal cu 0, rezultă că H(X) = 0
1
doar dacă pentru ∀i, P (X = xi ) = 0 sau log = 0, adică dacă pentru ∀i,
PP(X = xi )
P (X = xi ) = 0 sau P (X = xi ) = 1. Cum ı̂nsă i P (X = xi ) = 1 rezultă că există o
singură valoare x1 pentru X astfel ı̂ncât P (X = x1 ) = 1, iar P (X = x) = 0 pentru
orice x 6= x1 . Altfel spus, variabila aleatoare discretă X este constantă.
,,⇐“ Presupunem că variabila X este constantă, ceea ce ı̂nseamnă că X ia o singură
valoare x1 , cu probabilitatea P (X = x1 ) = 1. Prin urmare, H(X) = −1 · log 1 = 0.
141.
b.
XX
H(Y | X) = − P (X = xi , Y = yj ) log P (Y = yj | X = xi )
i j
X
H(Y | X) = P (X = xi )H(Y | X = xi )
i
" #
X X
= P (X = xi ) − P (Y = yj | X = xi ) log P (Y = yj | X = xi )
i j
XX
= − P (X = xi )P (Y = yj | X = xi ) log P (Y = yj | X = xi )
i j
| {z }
=P (X=xi ,Y =yj )
XX
= − P (X = xi , Y = yj ) log P (Y = yj | X = xi )
i j
142.
c.
H(X, Y ) = H(X) + H(Y | X) = H(Y ) + H(X | Y )
XX
H(X, Y ) = − p(xi , yj ) log p(xi , yj )
i j
XX
= − p(xi ) · p(yj | xi ) log[p(xi ) · p(yj | xi )]
i j
XX
= − p(xi ) · p(yj | xi )[log p(xi ) + log p(yj | xi )]
i j
XX XX
= − p(xi ) · p(yj | xi ) log p(xi ) − p(xi ) · p(yj | xi ) log p(yj | xi )
i j i j
X X X X
= − p(xi ) log p(xi ) · p(yj | xi ) − p(xi ) p(yj | xi ) log p(yj | xi )
i j i j
| {z }
=1
X
= H(X) + p(xi )H(Y | X = xi ) = H(X) + H(Y | X)
i
143.
Mai general (regula de ı̂nlănţuire):

H(X1 , . . . , Xn ) = H(X1 ) + H(X2 | X1 ) + . . . + H(Xn | X1 , . . . , Xn−1 )

1
H(X1 , . . . , Xn ) = E log
p(x1 , . . . , xn )
= − Ep(x1 ,...,xn ) [log p(x1 , . . . , xn ) ]
| {z }
p(x1 )·p(x2 |x1 )·...·p(xn |x1 ,...,xn−1 )
= − Ep(x1 ,...,xn ) [log p(x1 ) + log p(x2 | x1 ) + . . . + log p(xn | x1 , . . . , xn−1 )]

= − Ep(x1 ) [log p(x1 )] − Ep(x1 ,x2 ) [log p(x2 | x1 )] − . . .
− Ep(x1 ,...,xn ) [log p(xn | x1 , . . . , xn−1 )]
= H(X1 ) + H(X2 | X1 ) + . . . + H(Xn | X1 , . . . , Xn−1 )
144.
An upper bound for the entropy of a discrete distribution

CMU, 2003 fall, T. Mitchell, A. Moore, HW1, pr. 1.1
145.
y
Fie X o variabilă aleatoare discretă care ia n
x−1
valori şi urmează distribuţia probabilistă P .
Conform definiţiei, entropia lui X este
n
X
H(X) = − P (X = xi ) log2 P (X = xi ).
i=1
0 1 x
Arătaţi că H(X) ≤ log2 n.
Sugestie: Puteţi folosi inegalitatea ln x ≤ x−1,

care are loc pentru orice x > 0. ln(x)
Answer 146.
n
!
1 X
H(X) = − P (X = xi ) ln P (X = xi )
ln 2 i=1
Aşadar,
n
!
1 X
H(X) ≤ log2 n ⇔ − P (X = xi ) ln P (X = xi ) ≤ log2 n
ln 2 i=1
n
X
⇔ − P (xi ) ln P (xi ) ≤ ln n
i=1
n n
!
X 1 X
⇔ P (xi ) ln − P (xi ) ln n ≤ 0
i=1
P (xi ) i=1
| {z }
1
n n
X 1 X
⇔ P (xi ) ln − P (xi ) ln n ≤ 0
i=1
P (x i ) i=1
n
X 1
⇔ P (xi ) ln − ln n ≤ 0
i=1
P (x i )
n
X 1
⇔ P (xi ) ln ≤0
i=1
n P (xi )
147.
1
Aplicând inegalitatea ln x ≤ x − 1 pentru x = , vom avea:
n P (xi )
n n n n
X 1 X 1 X 1 X
P (xi ) ln ≤ P (xi ) −1 = − P (xi ) = 1 − 1 = 0
i=1
n P (xi ) i=1
n P (xi ) i=1
n i=1
| {z }
1
Observaţie: Această margine superioară chiar este ,,atinsă“. De exemplu, ı̂n cazul ı̂n
care o variabilă aleatoare discretă X având n valori urmează distribuţia uniformă, se
poate verifica imediat că H(X) = log2 n.
148.
The particular form of the chain rule for entropies

when X and Y are independent random variables:
H(X, Y ) = H(X) + H(Y )
CMU, 2012 spring, R. Rosenfeld, HW2, pr. 7.b

149.
Prove that if X and Y are independent random variables, the following property holds:
H(X, Y ) = H(X) + H(Y ).
Answer: Therefore,
X
H(X, Y ) = − P (X = x, Y = y) log P (X = x, Y = y)
x,y
indep. X
=− P (X = x)P (Y = y) log(P (X = x)P (Y = y))
x,y
X
=− P (X = x)P (Y = y)(log P (X = x) + log P (Y = y))
x,y
X X
=− P (X = x)P (Y = y) log P (X = x) − P (X = x)P (Y = y) log P (Y = y)
x,y x,y
X X X X
=− P (Y = y) P (X = x) log P (X = x) − P (X = x) P (Y = y) log P (Y = y)
y x x y
X X X X
=− P (Y = y) · P (X = x) log P (X = x) − P (X = x) · P (Y = y) log P (Y = y)
y x x y
X X
=− P (X = x) log P (X = x) − P (Y = y) log P (Y = y)
x y
= H(X) + H(Y )
150.
If X, Y are continue and independent random variables, then
pX,Y (X = x, Y = y) = pX (X = x)pY (Y = y)
for any x and y, where p denotes the p.d.f. corresponding to the [c.d.f. of ] P .
Therefore,
Z Z
def.
H(X, Y ) = − pX,Y (X = x, Y = y) log pX,Y (X = x, Y = y) dx dy
X Y
Z Z
indep.
= − pX (X = x) pY (Y = y)(log pX (X = x) + log pY (Y = y)) dx dy
ZX ZY
not.
= − pX (x) pY (y)(log pX (x) + log pY (y)) dx dy
X Y
Z Z Z Z
= − pY (y) pX (x) log pX (x)dx dy − pX (x) pY (y) log pY (y)dx dy
X Y X Y
Z Z Z Z
= − pX (x) log pX (x) pY (y) dy dx − pX (x) pY (y) log pY (y) dy dx
X Y X Y
| {z } | {z }
1 H(Y )
Z Z
= − pX (x) log pX (x) dx + H(Y ) pX (x) dx
X X
| {z }
1
= H(X) + H(Y )
151.
The conditional form of

the simplest case of the chain rule for entropies (n = 2):
H(X, Y |A) = H(X|A) + H(Y |X, A)
CMU, 2012 spring, Roni Rosenfeld, HW2, pr. 4.b

152.
H(X,Y|A)
Prove that for any 3 discrete random

variables X, Y and A, the following H(X|A) H(Y|X,A)
property holds:
H(X, Y |A) = H(X|A) + H(Y |X, A) H(X) H(Y)
= H(Y |A) + H(X|Y, A).
Answer: H(A)
H(X, Y |A) = H(X, Y, A) − H(A) = H(X, Y, A) − H(Y, A) + H(Y, A) − H(A)

= (H(X, Y, A) − H(Y, A)) + (H(Y, A) − H(A)) = H(X|Y, A) + H(Y |A)
= H(Y |A) + H(X|Y, A).
We’ve used the fact that H(X, Y ) = H(X|Y ) + H(Y ) (see CMU, 2005 fall, T.
Mitchell, A. Moore, HW1, pr. 2).
153.
Proving [in a direct manner] that

the Information Gain is always positive or 0
and that IG(X, Y ) = 0 ⇔ X is independent of Y
(an indirect proof is made at CMU, 2007 fall, Carlos Guestrin, HW1, pr. 1.2)
Liviu Ciortuz, 2017

154.
Definiţia câştigului de informaţie (sau: a informaţiei mutuale) al unei vari-

abile aleatoare X ı̂n raport cu o altă variabilă aleatoare Y este
IG (X, Y ) = H(X) − H(X | Y ) = H(Y ) − H(Y | X).
La CMU, 2007 fall, Carlos Guestrin, HW1, pr. 1.2 s-a demonstrat — pentru cazul ı̂n care X
şi Y sunt discrete — că IG(X, Y ) = KL(PX,Y ||PX PY ), unde KL desemnează entropia relativă (sau:
divergenţa Kullback-Leibler), PX şi PY sunt distribuţiile variabilelor X şi, respectiv, Y , iar PX,Y este
distribuţia comună a acestor variabile. Tot la CMU, 2007 fall, Carlos Guestrin, HW1, pr. 1.2 s-a
arătat că divergenţa KL este ı̂ntotdeauna ne-negativă. În consecinţă, IG(X, Y ) ≥ 0 pentru orice X
şi Y .
La acest exerciţiu vă cerem să demonstraţi inegalitatea IG(X, Y ) ≥ 0 ı̂n

manieră directă, plecând de la prima definiţie dată mai sus, fără a [mai] apela
la divergenţa Kullback-Leibler.
155.
Sugestie: Puteţi folosi următoarea formă a inegalităţii lui Jensen:
n n
!
X X
ai log xi ≤ log ai xi
i=1 i=1
unde baza logaritmului se consideră supraunitară, ai ≥ 0 pentru i = 1, . . . , n şi

P n
i=1 ai = 1.
Observaţie: Avantajul la această problemă, comparativ cu CMU, 2007 fall, Carlos Guestrin,
HW1, pr. 1.2.a, este că aici se lucrează cu o singură distribuţie (p), nu cu două distribuţii (p şi q).
Totuşi, demonstraţia de aici va fi mai laborioasă.
Answer (in Romanian)
Presupunem că valorile variabilei X sunt x1 , x2 , . . . , xn , iar valorile variabilei Y

sunt y1 , y2 , . . . , ym . Avem:
def.
IG(X, Y ) = H(X) − H(X|Y )
Xn m
X n
X
def.
= −P (xi ) log2 P (xi ) − P (yj ) (−P (xi |yj ) log2 P (xi |yj ))
i=1 j=1 i=1
156.
n
X m
X n
X
−IG(X, Y ) = P (xi ) log2 P (xi ) − P (yj ) P (xi |yj ) log2 P (xi |yj )
i=1 j=1 i=1
def.
 
= n
X X m m
X n
X
prob. marg.  P (xi , yj ) log2 P (xi ) − P (yj ) P (xi |yj ) log2 P (xi |yj )
i=1 j=1 j=1 i=1
X m
n X m X
X n
distrib.·,+
= P (xi , yj ) log2 P (xi ) − P (yj )P (xi |yj ) log2 P (xi |yj )
i=1 j=1 j=1 i=1
def.
= X m
n X m X
X n
prob. cond. P (xi , yj ) log2 P (xi ) − P (xi , yj ) log2 P (xi |yj )
i=1 j=1 j=1 i=1
Xn X m
distrib.·,+
= P (xi , yj )(log2 P (xi ) − log2 P (xi |yj ))
i=1 j=1
prop. reg. de
= m
n X = n X m
X P (xi ) X P (xi )
log. P (xi , yj ) log2 multipl. P (xi |yj )P (yj ) log2
i=1 j=1
P (xi |yj ) i=1 j=1
P (xi |yj )
m n
distrib.·,+ X X P (xi )
= P (yj ) P (xi |yj ) log2
j=1 i=1
| {z } P (xi |yj )
ai
157.
Pn
Întrucât pe de o parte P (xi |yj ) ≥ 0 şi pe de altă parte i=1 P (xi |yj ) = 1 pentru
fiecare valoare yj a lui Y ı̂n parte, putem aplica inegalitatea lui Jensen pentru
cea de-a doua sumă din ultima expresie de mai sus — mai exact, pentru
fiecare valoare a indicelui j ı̂n parte — şi obţinem:
m n
! m n
!
X X P (xi ) X X
−IG(X, Y ) ≤ P (yj ) log2 P (xi |yj ) = P (yj ) log2 P (xi ) = 0
j=1 i=1
P (xi |y j ) j=1 i=1
| {z }
1
Prin urmare, IG(X, Y ) ≥ 0.

Pm Pn
De asemenea, din relaţia IG(X, Y ) = j=1 P (yj ) i=1 P (xi |yj ) log2 PP(x(x i)
i |yj )
care a
fost obţinută pe slide-ul precedent, decurge imediat următoarea consecinţă:
atunci când X este indepenedent de Y , ı̂ntrucât P (xi ) = P (xi |yj ) pentru orice
P (xi )
i şi j, urmează că log2 , deci IG(X, Y ) = 0.
P (xi |yj )
Invers, vom demonstra aici că 158.
IG(X, Y ) = 0 implică faptul că
X este indepenedent de Y . Aplicând pentru fiecare valoare a
indicelui j ı̂n parte [ultima parte
Pentru aceasta vom folosi teorema lui Jensen. din] teorema lui Jensen — ceea
Într-adevăr, din egalitatea Pneste posibil, ı̂ntrucât ai ≥ 0 şi
ce
i=1 ai = 1 pentru fiecare valoare
m n
IG(X, Y ) =
X
P (yj )
X
P (xi |yj ) log2
P (xi ) fixată a lui j —, rezultă că
j=1 i=1
P (xi |yj )
P (xi )
avem = α (const.), ∀i = 1, . . . , n.
P (xi |yj )
n
P (yj )>0 X P (xi )
IG(X, Y ) = 0 ⇒ P (xi |yj ) log2 = 0, ∀j = 1, . . . , m.
i=1
P (xi |yj ) Aşadar,
Este ı̂nsă adevărat şi că P (xi ) = αP (xi |yj ) ∀i = 1, . . . , n.

n n
X P (xi ) X
log2 P (x
✘ ✘ ✘
✘
i |yj ) ✘
✘ = log 2 P (xi ) = 0, ∀j = 1, . . . , m. Însumând aceste egalităţi
Pn pentru
P (x
✘ i |yj )
i=1 ✘ i=1
| {z } i =Pn1, . . . , n, rezultă i=1 P (xi ) =
1
α i=1 P (xi |yj ), deci 1 = α.
Prin urmare, Deci P (xi ) = P (xi |yj ), ∀i = 1, . . . , n
n
X P (xi ) Xn
P (xi )
[şi pentru fiecare j = 1, . . . , m],
P (xi |yj ) log2 = log2 P (xi |yj ) = 0, ∀j = 1, . . . , m. deci X este independent de Y .
i=1
| {z } P (x i |y j ) i=1
P (x i |y j )
ai
159.
Corollary
Since
H(X, Y ) = H(X) + H(Y |X) = H(Y ) + H(X|Y ),
and also due to the definition of the Information Gain,
IG(X, Y ) = H(X) − H(X|Y ) = H(Y ) − H(Y |X),
it follows that
X independent of Y ⇔ H(X) = H(X|Y )

⇔ H(Y ) = H(Y |X)
⇔ H(X, Y ) = H(X) + X(Y )
160.
Cross-entropy (CH):
definition, some basic properties, and exemplifications
CMU, 2011 spring, Roni Rosenfeld, HW2, pr. 3.c
161.
Cross-entropy, CH(p,q), measures the average number of bits needed to en-

code an event from a set of possibilities, if a coding scheme is used based on
a given probability distribution q, rather than the “true” distribution p.
For discrete random variables p and q this means
X
CH(p,q) = − p(x) log q(x)
x
The situation for continuous random variable distributions is analogous:

Z
CH(p,q) = − p(x) log q(x) dx.
X
a. Can cross-entropy be negative? Either prove or give a counter-example.

162.
Answer
a. No. Here follows the proof :

For every probability value p(x) and q(x), we know from defnition that 0 ≤
p(x) ≤ 1 and 0 ≤ q(x) ≤ 1. From q(x) ≤ 1, we conclude that log q(x) ≤ 0. Given
that 0 ≤ p(x), and − log q(x) ≥ 0, we conclude that 0 ≤ −p(x) log q(x). Thus, the
sum of these terms will also be greater or equal to 0, so cross-entropy is never
negative.
Note, however that unlike entropy, cross-entropy is not bounded, so it can

grow to infinity if for an x, q(x) is zero and p(x) is not zero.
163.
b. In many experiments, the quality of different hypothesis models are com-

pared on a data set. Imagine you derived two diferent models to predict the
probabilities of 7 diferent possible outcomes, and the probability distributions
predicted by the models are q1 , and q2 as follows:
q1 = (0.1, 0.1, 0.2, 0.3, 0.2, 0.05, 0.05) and q2 = (0.05, 0.1, 0.15, 0.35, 0.2, 0.1, 0.05).
The experiments are done on a data set with the following empirical distri-
bution:
pempirical = (0.05, 0.1, 0.2, 0.3, 0.2, 0.1, 0.05).
Compute the cross-entropies, CH (pempirical , q1 ) and CH (pempirical , q2 ).

Which model has a lower cross-entropy? Is this model guaranteed to be a
better one? Explain your answer. Can cross-entropy be negative? Either
prove or give a counter-example.
164.
Answer
b. Using the cross-entropy formula we see that:
CH(pempirical , q1 ) = −(0.05 log 0.1 + 0.1 log 0.1 + 0.2 log 0.2 + 0.3 log 0.3 +
0.2 log 0.2 + 0.1 log 0.05 + 0.05 log 0.05) = 2.596 bits
CH(pempirical , q2 ) = −(0.05 log 0.05 + 0.1 log 0.1 + 0.2 log 0.15 + 0.3 log 0.35 +
0.2 log 0.2 + 0.1 log 0.1 + 0.05 log 0.05) = 2.56 bits.
The pempirical distribution has a lower cross-entropy with the model q2 , so it is

reasonable to say that q2 is a better choice.
However, it is not guaranteed that this model is always the better model,
because we are working with an “empirical” distribution here, and the “true”
distribution might not be reflected in this empirical distribution completely.
Usually, sampling bias and insufficient training data samples will widen the
gap between the true distribution and the empirical one, so in practice when
designing the experiment, you should always have that in mind, and if possible
use techniques that minimize these risks.
165.
Relative entropy a.k.a. the Kulback-Leibler divergence,

and the [relationship to] information gain;
some basic properties
CMU, 2007 fall, C. Guestrin, HW1, pr. 1.2
166.
The relative entropy — also known as the Kullback-Leibler (KL) divergence

— from a distribution p to a distribution q is defined as
def. X q(x)
KL(p||q) = − p(x) log
p(x)
x∈X
From an information theory perspective, the KL-divergence specifies the num-

ber of additional bits required on average to transmit values of X if the values
are distributed with respect to p but we encode them assuming the distribu-
tion q.
167.
Notes
1. KL is not a distance measure, since it is not symmetric (i.e., in general
KL(p||q) 6= KL(q||p)).
1
Another measure, which is defined as JSD(p||q) = (KL(p||(p + q)/2) + KL(q||(p +
2
p)/2)), and is called the Jensen-Shannon divergence is symmetric.
2. The quantity
def.
d(X, Y ) = H(X, Y ) − IG (X, Y ) = H(X) + H(Y ) − 2IG (X, Y )
= H(X | Y ) + H(Y | X)
known as variation of information, is a distance metric, i.e., it is non-negative,

symmetric, implies indiscernability, and satisfies the triangle inequality.
168.
a. Show that KL(p||q) ≥ 0, and KL(p||q) = 0 iff p(x) = q(x) for all x.
(More generally, the smaller the KL-divergence, the more similar the two distributions.)
Indicaţie:
Pentru a demonstra punctul acesta puteţi folosi ine-
galitatea lui Jensen: f(y)
tf(x)+(1−t)f(y)
Dacă f : R → R este o funcţie convexă, atunci pentru f(tx+(1−t)y)
orice t ∈ [0, 1] şi orice x1 , x2 ∈ R urmează f(x)
f (tx1 + (1 − t)x2 ) ≤ tf (x1 ) + (1 − t)f (x2 ). x y

tx+(1−t)y
Dacă f este funcţie strict convexă, atunci egalitatea
are loc doar dacă x1 = x2 .
P
Mai
P general, pentru
P orice a i ≥ 0, i = 1, . . . , n cu i ai 6= 0 şi orice xi ∈ R, i = 1, . . . , n, avem
ai xi i ai f (xi )
f Pi a ≤ P a .
j j j j
Dacă f este strict convexă, atunci egalitatea are loc doar dacă x1 = . . . = xn .
Evident, rezultate similare pot fi formulate şi pentru funcţii concave.
169.
Answer
Vom dovedi inegalitatea KL(p||q) ≥ 0 folosind inegalitatea lui Jensen, ı̂n ex-
presia căreia vom ı̂nlocui f cu funcţia convexă − log2 , pe ai cu p(xi ) şi pe xi cu
q(xi )
.
p(xi )
(Pentru convenienţă, ı̂n cele ce urmează vor renunţa la indicele variabilei x.)
Vom avea:
def. X q(x)
KL (p || q) = − p(x) log
x
p(x)
Jensen X q(x) X
≥ − log( p(x) ) = − log( q(x)) = − log 1 = 0
x
p(x) x
| {z }
1
Aşadar, KL (p || q) ≥ 0, oricare ar fi distribuţiile (discrete) p şi q.

170.
Vom demonstra acum că KL(p||q) = 0 ⇔ p = q.

,,⇐“
q(x) q(x)
Egalitatea p(x) = q(x) implică = 1, deci log = 0 pentru orice x, de unde
p(x) p(x)
rezultă imediat KL(p||q) = 0.
,,⇒“
Ştim că ı̂n inegalitatea lui Jensen are loc egalitatea doar ı̂n cazul ı̂n care xi = xj
pentru orice i şi j.
q(x)
În cazul de faţă, această condiţie se traduce prin faptul că raportul este acelaşi
p(x)
(α) pentru orice valoare a lui x.
P P P q(x) q(x)
Ţinând cont că x p(x) = 1 şi x q(x) = x p(x) = 1, rezultă că α = = 1,
p(x) p(x)
deci p(x) = q(x) pentru orice x, ceea ce ı̂nseamnă că distribuţiile p şi q sunt identice.
171.
b. We can define the information gain as the KL-divergence from the
observed joint distribution of X and Y to the product of their observed
marginals:

def. XX pX (x)p(y)
IG(X, Y ) = KL(pX,Y || (pX pY )) = − pX,Y (x, y) log
x y
pY (x, y)

not.
XX p(x)p(y)
= − p(x, y) log
x y
p(x, y)
Prove that this definition of information gain is equivalent to the one given in
problem CMU, 2005 fall, T. Mitchell, A. Moore, HW1, pr. 2. That is, show
that IG(X, Y ) = H[X] − H[X|Y ] = H[Y ] − H[Y |X], starting from the definition
in terms of KL-divergence.
Remark:
It follows that
X X p(x | y) X
IG(X, Y ) = p(y) p(x | y) log = p(y)KL(pX|Y || pX )
y x
p(x) y
= EY [KL(pX|Y || pX )]
172.
Answer
By making use of the multiplication rule, namely p(x, y) = p(x | y)p(y), we will
have:
KL(pX,Y || (pX pY ))

def. KL XX p(x)p(y)
= − p(x, y) log
x y
p(x, y)

XX p(x) XX
= − p(x, y) log =− p(x, y)[log p(x) − log p(x | y)]
x y
p(x | y) x y
!
XX XX
= − p(x, y) log p(x) − − p(x, y) log p(x | y)
x y x y
X X X
= − log p(x) p(x, y) −H[X | Y ] = p(x) log p(x) − H[X | Y ]
x y x
| {z }
=p(x)
= H[X] − H[X | Y ] = IG(X, Y )

173.
c.
A direct consequence of parts a. and b. is that IG(X, Y ) ≥ 0 (and therefore
H(X) ≥ H(X|Y ) and H(Y ) ≥ H(Y |X)) for any discrete random variables X and
Y.
Prove that IG(X, Y ) = 0 iff X and Y are independent.
Answer:
This is also an immediate consequence of parts a. and b. already proven:

(b) (a)
IG (X, Y ) = 0 ⇔ KL (pX,Y ||pX pY ) = 0 ⇔ X and Y are independent.
Remark (in Romanian) 174.
Putem (re)demonstra inegalitatea IG(X, Y ) ≥ 0, folosind (doar!) rezul-

tatul de la punctul b. (nu şi cel de la punctul a.), şi anume că IG(Y, X) =
P P
− x y p(x, y) log p(x)p(y)
p(x,y) . Ideea este să aplicăm inegalitatea lui Jensen ı̂ntr-
o formă uşor generalizată, şi anume:
− ı̂n locul unui singur indice, se vor considera doi indici (aşadar ı̂n loc de
ai şi xi vom avea aij şi respectiv xij );
p(xi )p(xj )
− vom lua f = − log2 iar aij ← p(xi , yj ) şi xij ← ;
P P p(xi , xj )
− ı̂n fine, vom ţine cont că i j p(xi , yj ) = 1.
Prin urmare,
XX p(xi ) · p(yj ) X X h p(xi ) · p(yj ) i
IG(X, Y ) = − p(xi , yj ) log = p(xi , yj ) − log
i j
p(x ,
i jy ) i j
p(xi , yj )
XX p(xi ) · p(yj ) XX
≥ − log p(xi , yj ) = − log p(xi ) · p(yj )Big)
i j
p(x i , y j ) i j
X X
= − log( p(xi ) · p(yj ) ) = − log 1 = 0
i j
| {z } | {z }
1 1
În concluzie, IG(X, Y ) ≥ 0.

175.
Remark (cont’d)
În ce priveşte echivalenţa IG(X, Y ) = 0 ⇔ X şi Y sunt independente:

,,⇐“
Dacă X şi Y sunt variabilele independente,
atunci p(xi , yj ) = p(xi )p(yj ) pentru orice i şi j.
În consecinţă, toţi logaritmii din partea dreaptă a primei egalităţi din calculul de
mai sus sunt 0 şi rezultă IG(X, Y ) = 0.
,,⇒“
Invers, presupunând că IG(X, Y ) = 0, vom ţine cont de faptul că putem exprima
câştigul de informaţie cu ajutorul divergenţei KL şi vom aplica un raţionament
similar cu cel de la punctul a.
p(xi )p(yj )
Rezultă că = 1 şi deci p(xi )p(yj ) = p(xi , yj ) pentru orice i şi j.
p(xi , yj )
Aceasta echivalează cu a spune că variabilele X şi Y sunt independente.
176.
Using Information Gain / Mutual Information for doing

Feature Selection
CMU, 2009 spring, Ziv Bar-Joseph, HW5, pr. 6
177.
Given the following observations for input binary features X1 , X2 , X3 , X4 , X5 ,

and output binary label Y , we would like to use a filter approach to reduce
the feature space of {X1 , X2 , X3 , X4 , X5 }.
X1 X2 X3 X4 X5 Y
0 1 1 0 1 0
1 0 0 0 1 0
0 1 0 1 0 1
1 1 1 1 0 1
0 1 1 0 0 1
0 0 0 1 1 1
1 0 0 1 0 1
1 1 1 0 1 1
a. Calculate the mutual information M I(Xi , Y ) for each i.

b. Accordingly choose the smallest subset of features such that the best
classifier trained on the reduced feature set will perform at least as well as
the best classifier trained on the whole feature set. Explain the reasons behind
your choice.
Answer 178.
a. As shown at CMU, 2007 fall, Carlos
Guestrin, HW1, pr. 1.2, the mutual infor-
mation can be calculated as: xi = 0 xi = 1
P (X1 = x1 ) 1/2 1/2
def. XX P (x) P (y) y P (Y = y)
M I(X, Y ) = − P (x, y) log . P (X2 = x2 ) 3/8 5/8
x y
P (x, y) 0 1/4
P (X3 = x3 ) 1/2 1/2
1 3/4
The marginal probabilities, estimated (in P (X4 = x4 ) 1/2 1/2
the MLE sense) from the given data are P (X5 = x5 ) 1/2 1/2
shown in the nearby tables.
The joint probabilities P (Xi = xi , Y = y), also estimated from the data, are:
xi = 0, y = 0 xi = 0, y = 1 xi = 1, y = 0 xi = 1, y = 1
P (x1 , y) 1/8 3/8 1/8 3/8
P (x2 , y) 1/8 1/4 1/8 1/2
P (x3 , y) 1/8 3/8 1/8 3/8
P (x4 , y) 1/4 1/4 0 1/2
P (x5 , y) 0 1/2 1/4 1/4
Note: Columns 3 and 5 can be easily computed using columns 2 and respectively 4
(and also the first table from above), as P (xi = 0, y = 1) = P (xi = 0) − P (xi = 0, y = 0) and
P (xi = 1, y = 1) = P (xi = 1) − P (xi = 1, y = 0).
179.
Using these probabilities and the given formula, the mutual information for
each feature is: M I(X1 , Y ) = 0, M I(X2 , Y ) = 0.01571, M I(X3 , Y ) = 0, M I(X4 , Y ) =
0.3113 and M I(X5 , Y ) = 0.3113.
Note: One could easily check, using the previous tables, that X1 is indepen-
dent of Y , and X3 is also independent of Y (so, we can determine in this way
too that M I(X1 , Y ) and M I(X3 , Y ) are 0).
b. In order to select a set of features, we can prioritize the ones with more
mutual information because they are less independent to Y .
By looking at the results of the previous question, we can see that X5 and
X4 are the features with more mutual information (0.3113), followed by X2
(0.0157) and, finally, X1 and X3 that do not have mutual information with Y .
By inspection of the data, we can see that if we select X5 , X4 and X2 there
are two samples of different classes with the same features (X2 = 1, X4 = 0,
X5 = 1). To avoid this problem, we can add X1 as an extra feature.
Note: Although the mutual information of X1 with Y is zero, it does not
mean that the combination of X1 with other features will also have zero
mutual information [LC: w.r.t. Y ].
180.
Derivation of entropy definition,

starting from a set of desirable properties
CMU, 2005 fall, T. Mitchell, A. Moore, HW1, pr. 2.2
181.
Pn
Remark: The definition we gave for entropy − i=1 pi log pi is not very intuitive.
Theorem:
If ψn (p1 , . . . , pn ) satisfies the following axioms
A0. [LC:] ψn (p1 , . . . , pn ) ≥ 0 for any n ∈ N∗ and p1 , . . . , pn , since we view ψn is a
measure of disorder ; also, ψ1 (1) = 0 because in this case there is no disorder;
A1. ψn should be continuous in pi and symmetric in its arguments;
A2. if pi = 1/n then ψn should be a monotonically increasing function of n;
(If all events are equally likely, then having more events means being
more uncertain.)
A3. if a choice among N events is broken down into successive choices, then
the entropy should be the weighted sum of the entropy at each stage;
P
then ψn (p1 , . . . , pn ) = −K i pi log pi where
K is a positive constant.
1 1
(As we’ll see, K depends however on ψs ,..., for a certain s ∈ N∗ ).
s s
Remark: We will prove the theorem firstly for uniform distributions (pi = 1/n) and secondly
for the case pi ∈ Q (only!).
182.
Example for the axiom A3:
(a,b,c)
1/2 1/2
(a,b,c) a (b,c)
1/2 1/3 1/6 2/3 1/3
a b c b c
Encoding 1 Encoding 2

1 1 1 1 1 1 1 1 1 1 2 1
H , , = log 2 + log 3 + log 6 = + log 2 + + log 3 = + log 3
2 3 6 2 3 6 2 6 3 6 3 2

1 1 1 2 1 1 2 3 1 1 2 2 1
H , + H , = 1+ log + log 3 = 1 + log 3 − = + log 3
2 2 2 3 3 2 3 2 3 2 3 3 2
The next 3 slides:

Case 1: pi = 1/n for i = 1, . . . , n; proof steps:
not.
183.
a. A(n) = ψ(1/n, 1/n, . . . , 1/n) implies
A(sm ) = m A(s) for any s, m ∈ N∗ . (1)
b. If s, m ∈ N⋆ (fixed), s 6= 1, and t, n ∈ N⋆ such that sm ≤ tn ≤ sm+1 , then
m log t 1
| n − log s |≤ n. (2)
c. For sm ≤ tn ≤ sm+1 as above, due to A2 it follows (immediately)

1 1 1 1 1 1
ψsm , . . . , m ≤ ψtn n , . . . , n ≤ ψsm+1 , . . . , m+1
sm s t t sm+1 s
i.e. A(sm ) ≤ A(tn ) ≤ A(sm+1 )
c. Show that
m A(t) 1
| n − A(s) |≤ n for s 6= 1. (3)
d. Combining (2) + (3) immediately gives

A(t) log t 2
| A(s) − log s |≤ n pentru s 6= 1 (4)
d. Show that this inequation implies

A(t) = K log t with K > 0 (due to A2). (5)
184.
Proof
a.
nivel 1
1/s 1/s 1/s
...
... ...
nivel 2 de s ori
1/s 1/s 1/s

1 1 1
m m m
s s s
...
... ...
... de s ori
.
..
de s m ori
nivel m
1/s 1/s 1/s
...
de s ori
Applying the axion A3 on the right encoding from above gives:

1 1 1
A(sm ) = A(s) + s · A(s) + s2 · 2 A(s) + . . . + sm−1 · m−1 A(s)
s s s
= A(s) + A(s) + A(s) + . . . + A(s) = mA(s)
| {z }
m times
185.
Proof (cont’d)
b.
sm ≤ tn ≤ sm+1 ⇒ m log s ≤ n log t ≤ (m + 1) log s ⇒

m log t m 1 log t m 1 log t m 1
≤ ≤ + ⇒0≤ − ≤ ⇒ − ≤
n log s n n log s n n log s n n
c.
(1) s6=1
A(sm ) ≤ A(tn ) ≤ A(sm+1 ) ⇒ m A(s) ≤ n A(t) ≤ (m + 1) A(s) ⇒

m A(t) m 1 A(t) m 1 A(t) m 1
≤ ≤ + ⇒0≤ − ≤ ⇒ − ≤
n A(s) n n A(s) n n A(s) n n
d. Consider
again sm ≤ tn ≤ sm+1 with s, t fixed. If m → ∞ then n → ∞ and
A(t) log t 2 A(t) log t
from − ≤ it follows that A(s) log s → 0.
−
A(s) log s n

A(t) log t = 0 and so A(t) = log t .

Therefore −
A(s) log s A(s) log s
A(s) A(s)
Finally, A(t) = log t = K log t, where K = > 0 (if s 6= 1).
log s log s
186.
Case 2: pi ∈ Q for i = 1, . . . , n
Let’s consider a set of N ≥ 2 equiprobable ran-

level 1
dom events, and P = (S1 , S2 , . . . , Sk ) a partition
of this set. Let’s denote pi =| Si | /N . |S1 |/N |Sk |/N
A “natural” two-step ecoding (as shown in the |S2 |/N |S i |/N
nearby figure) leads to A(N ) = ψk (p1 , . . . , pk ) + S1 S2 ... Si ...
P Sk
i pi A(| Si |), based on the axiom A3. ... . . . level 2 ...
Finally, using the result A(t) = K log t, gives:
P 1/|S i | 1/|S i | 1/|S i |
K log N = ψk (p1 , . . . , pk ) + K i pi log | Si |
...
X
⇒ ψk (p1 , . . . , pk ) = K[ log N − pi log | Si | ]
i
X X X | Si | X
= K[ log N pi − pi log | Si | ] = −K pi log = −K pi log pi
i i i
N i
187.
Exemplifying
The gradient descent method:
finding the minimum of a real function of second degree
University of Utah, 2008 fall, Hal Daumé III, HW4, pr. 1

188.
Suppose we are trying to find the minimum of the function f (x) =

3x2 − 2x + 1, for uni-variate (scalar) x.
First, verify that this function is convex (and therefore it has a
global minimum).
Second, find the minimum of this function using calculus.
Finally, perform (by hand – show your work) three steps of gra-
dient descent with η = 0.1 and the initial point x0 = 1. How close
does it get to the true solution?
189.
Solution (in Romanian)

Pentru a studia convexitatea funcţiei f (x) se cal-
culează derivata a doua:
f ′′ (x) = 6 > 0, ∀x ∈ R. 2
Rezultă că funcţia f este convexă pe ı̂ntreg dome-
f(x)
niul ei de definiţie, deci are un (singur) punct de
minim.
Pentru a calcula minimul funcţiei f (x) = 3x2 − 2x + 1
utilizăm derivata de ordinul ı̂ntâi: 1
f ′ (x) = 6x − 2.
Punctul de minim este dat de soluţia ecuaţiei f ′ (x) =

1
0, şi anume: x = ≈ 0.33.
3 x3x2 x1 x0
0
0 0.33 1
x
190.
Observaţii
Se constată uşor că

1. apropierea de punctul de optim este [mai] rapidă atâta timp cât valoarea
primei derivate (i.e., panta tangentei la graficul funcţiei) este mare ı̂n valoare
absolută;
2. dacă rata de ı̂nvăţare η a fost fixată la o valoare prea mare, atunci este posi-
bil ca la un moment dat să depăşim punctul de optim şi apoi să ,,pendulăm“
ı̂n jurul lui. Acest punct slab al metodei gradientului poate fi contracarat
reducând ı̂n mod dinamic mărimea lui η.
Altminteri metoda gradientului are avantajul de a fi o tehnică de optimizare
foarte simplă din punct de vedere conceptual şi uşor de implementat.
Alte două puncte slabe ale metodei gradientului descendent sunt:

imposibilitatea de a garanta găsirea optimul global şi
numărul mare de iteraţii care trebuie executate pe unele seturi de date reale.

Machine Learning

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning

Uploaded by

Copyright:

Available Formats

0.

E[X + Y ] = E[X] + E[Y ]

The continuous case:

X and Y are independent ⇒ E[XY ] = E[X] · E[Y ],

The discrete case:

The continuous case:

Discrete random variables:

CMU, 2015 spring, T. Mitchell, N. Balcan, HW2, pr. 1.d

Let X, Y , and Z be random variables taking values in {0, 1}.

P (X = 0) = 1/15 + 1/10 + 4/15 + 8/45 = 11/18,

This shows that X is independent of Y given Z.

The correlation coefficient of two random variables:

def. Cov (X, Y )

a. Să se demonstreze că −1 ≤ ρ(X, Y ) ≤ 1.

2. La punctul b, veţi ţine cont de faptul că pentru o variabilă aleatoare

În mod similar, putem să arătăm că

Significance: b(r; n, p) is the probability of drawing r heads in n independent

b(r; n, p) indeed represents a probability distribution:

Binomial distribution: calculating the mean

For the (1) equality we used the following property:

Binomial distribution: calculating the variance

We will make use of the formula Var[X] = E[X 2 ] − E 2 [X].

Binomial distribution: calculating the variance (cont’d)

By denoting j = r − 1 and m = n − 1, we’ll get:

E[b2 (r; n, p)] = np[(n − 1)p + 1] = n2 p2 − np2 + np.

Var[X] = E[b2 (r; n, p)] − (E[b(r; n, p)])2 = n2 p2 − np2 + np − n2 p2 = np(1 − p)

Binomial distribution: calculating the variance

• se demonstrează relativ uşor că orice variabilă aleatoare urmând

The categorical distribution:

CMU, 2009 fall, Geoff Gordon, HW1, pr. 4

Hint (2): Use linearity of expectation.

The univariate Gaussian distribution:

The plot, when µ = 0:

Proving that Nµ,σ is indeed a p.d.f.

1. Nµ,σ (x) ≥ 0 ∀x ∈ R (true)

By switching from x, y to polar coordinates r, θ (see the Note below), it follows:

Calculating the mean

Calculating the variance

Here above we used the fact that

And, finally, Var[X] = E[X 2 ] − (E[X])2 = (σ 2 + µ2 ) − µ2 = σ 2 .

Vectors of random variables.

Chuong Do, Stanford University, 2008

[adapted by Liviu Ciortuz]

Fie variabilele aleatoare X1 , . . . , Xn , cu Xi : Ω → R pentru

We will show that z T Σz ≥ 0 for any z ∈ Rn (seen as a column-vector):

Multi-variate Gaussian distributions:

Chuong Do, Stanford University, 2008

[adapted by Liviu Ciortuz]

     2  positive. (It is enough to consider z = (1, 0) and respec-

= p(x1 ; µ1 , σ12 ) p(x2 ; µ2 , σ22 ).

Factorizing positive definite matrices, using eigenvectors

Λ1/2 · Λ1/2 = Λ. (3)

(2) Λ1/2 este matrice diagonală, deci simetrică. Aşadar,

(Λ1/2 )⊤ = Λ1/2 . (4)

Acum putem relua factorizarea matricei Σ = U ΛU ⊤ . Elaborând — adică, factorizând

b. La acest punct vom face o exemplificare pentru chestiunile prezentate la punctul a.

Deci valorile proprii ale matricei Σ sunt λ1 = 0.5 şi λ2 = 1.5.

În mod concret,

şi, considerând u > 0, deci |u| = u, rezultă că

De la punctul a ştim că putem alege B = U Λ1/2 , deci

c. Demonstraţi că matricea B care satisface proprietatea de la punctul a este inversabilă.

Indicaţie: Vă puteţi folosi de următoarele proprietăţi:

The Gaussian Multivariate Distribution:

expresia funcţiei de densitate gaussiană multivariată este identică cu produsul a d funcţii

2 positive. (It is enough to consider z = (1, 0) and respec-