You are on page 1of 191

0.

Foundations of Probabilities
and Information Theory
for Machine Learning
1.

Random Variables
Some proofs
2.

E[X + Y ] = E[X] + E[Y ]


where X and Y are random variables of the same type (i.e. either discrete or cont.)
The discrete case:
X
E[X + Y ] = (X(ω) + Y (ω)) · P (ω)
ω∈Ω
X X
= X(ω) · P (ω) + Y (ω) · P (ω) = E[X] + E[Y ]
ω ω

The continuous case:


Z Z
E[X + Y ] = (x + y)pXY (x, y)dydx
x y
Z Z Z Z
= xpXY (x, y)dydx + ypXY (x, y)dydx
x y x y
Z Z Z Z
= x pXY (x, y)dydx + y pXY (x, y)dxdy
x y y x
Z Z
= xpX (x)dx + ypY (y)dy = E[X] + E[Y ]
x y
3.

X and Y are independent ⇒ E[XY ] = E[X] · E[Y ],


X and Y being random variables of the same type (i.e. either discrete or continuous)

The discrete case:


X X X X
E[XY ] = xyP (X = x, Y = y) = xyP (X = x) · P (Y = y)
x∈V al(X) y∈V al(Y ) x∈V al(X) y∈V al(Y )
 
X X X
= xP (X = x) yP (Y = y) = xP (X = x)E[Y ] = E[X] · E[Y ]
x∈V al(X) y∈V al(Y ) x∈V al(X)

The continuous case:


Z Z Z Z
E[XY ] = xy p(X = x, Y = y)dydx = xy p(X = x) · p(Y = y)dydx
x y x y
Z Z  Z
= x p(X = x) y p(Y = y)dy dx = x p(X = x)E[Y ]dx
x y x
Z
= E[Y ] · x p(X = x)dx = E[X] · E[Y ]
x
4.

Discrete random variables:


independence, conditional independence, conditional probabilities

CMU, 2015 spring, T. Mitchell, N. Balcan, HW2, pr. 1.d


5.

Let X, Y , and Z be random variables taking values in {0, 1}.


The following table lists the probability of each possible assign-
ment of 0 and 1 to the variables X, Y , and Z:

Z=0 Z=1
X =0 X=1 X =0 X=1
Y = 0 1/15 1/15 4/15 2/15
Y = 1 1/10 1/10 8/45 4/45

For example,
P (X = 0, Y = 1, Z = 0) = 1/10 and P (X = 1, Y = 1, Z = 1) = 4/45.
a. Is X independent of Y ? Why or why not?
b. Is X conditionally independent of Y given Z? Why or why not?
c. Calculate P (X = 0 | X + Y > 0).
6.

Answer
a. No.

P (X = 0) = 1/15 + 1/10 + 4/15 + 8/45 = 11/18,


P (Y = 0) = 1/15 + 1/15 + 4/15 + 2/15 = 8/15,

and
P (X = 0, Y = 0) 1/15 + 4/15 5
P (X = 0|Y = 0) = = = .
P (Y = 0) 8/15 8
Since P (X = 0) does not equal P (X = 0|Y = 0), X is not independent of Y .
7.
b. For all pairs y, z ∈ {0, 1}, we need to check that P (X = 0|Y = y, Z = z) = P (X = 0|Z = z).
That the other probabilities are equal follows from the law of total probability.

1/15
P (X = 0|Y = 0, Z = 0) = = 1/2
1/15 + 1/15
1/10
P (X = 0|Y = 1, Z = 0) = = 1/2
1/10 + 1/10
4/15
P (X = 0|Y = 0, Z = 1) = = 2/3
4/15 + 2/15
8/45
P (X = 0|Y = 1, Z = 1) = = 2/3.
8/45 + 4/15

and
1/15 + 1/10
P (X = 0|Z = 0) = = 1/2
1/15 + 1/15 + 1/10 + 1/10
4/15 + 8/45
P (X = 0|Z = 1) = = 2/3.
4/15 + 2/15 + 8/45 + 4/45

This shows that X is independent of Y given Z.


c.
1/10 + 8/45
P (X = 0|X + Y > 0) = = 5/12.
1/15 + 1/10 + 1/10 + 2/15 + 4/45 + 8/45
8.

The correlation coefficient of two random variables:


two properties
Sheldon Ross
A First Course in Probability, 5th ed., Prentice Hall, 1997, pag. 332
9.

Pentru două variabile aleatoare oarecare X şi Y având Var (X) 6= 0 şi Var (Y ) 6=
0, coeficientul de corelaţie se defineşte astfel:

def. Cov (X, Y )


ρ(X, Y ) = p .
Var (X) · Var (Y )

a. Să se demonstreze că −1 ≤ ρ(X, Y ) ≤ 1.


p p
Consecinţă: Cov (X, Y ) ∈ [− Var (X) Var (Y ), + Var (X) Var (Y )], dacă
Var (X) 6= 0 şi Var (Y ) 6= 0.

b. Să se arate că dacă ρ(X, Y ) = 1, atunci Y = aX +b, cu a = Var (Y )/Var (X) > 0.
Similar, dacă ρ(X, Y ) = −1, atunci Y = aX + b, cu a = −Var (Y )/Var (X) < 0.
Observaţie: Aşadar, coeficientul de corelaţie reprezintă o ,,măsură“ a gradului
de ,,dependenţă liniară“ dintre X şi Y .
10.

Indicaţii
1. La punctul a, notând Var (X) = σX şi Var (Y ) = σY , pentru a demonstra
X Y 
inegalitatea ρ(X, Y ) ≥ −1 vă sugerăm să dezvoltaţi expresia Var +
σX σY
folosind următoarele două proprietăţi, valabile pentru orice variabile aleatoare
X şi Y :
Var (X + Y ) = Var (X) + Var (Y ) + 2Cov (X, Y )
şi
Cov (aX, bY ) = abCov (X, Y ), pentru orice a, b ∈ R.
X Y 
Apoi veţi proceda similar, dezvoltând expresia Var − , ca să
σX σY
demonstraţi inegalitatea ρ(X, Y ) ≤ 1.

2. La punctul b, veţi ţine cont de faptul că pentru o variabilă aleatoare


oarecare X, avem Var (X) = 0 dacă şi numai dacă variabila X este constantă.
(Mai precis, există c ∈ R astfel ı̂ncât P (X = c) = 1, unde P este distribuţia de
probabilitate considerată la definirea variabilelor din enunţul problemei.)
11.
Răspuns
a. Pentru a demonstra inegalitatea ρ(X, Y ) ≥ −1, procedăm conform Indicaţiei
1:
X Y  X  Y  X Y 
Var + = Var + Var + 2Cov ,
σX σY σX σY σX σY
1 1 1 1
= 2 Var (X) + Var (Y ) + 2 Cov (X, Y )
σX σY2 σX σY
= 1 + 1 + 2ρ(X, Y ) = 2[1 + ρ(X, Y )].
X
Întrucât Var (X) ≥ 0 pentru orice variabilă aleatoare X, rezultă că Var +
σX
Y 
≥ 0, deci 1 + ρ(X, Y ) ≥ 0, adică ρ(X, Y ) ≥ −1.
σY

În mod similar, putem să arătăm că


X Y 
Var − = 2[1 − ρ(X, Y )] ≥ 0,
σX σY
deci ρ(X, Y ) ≤ 1.
12.

b. Dacă ρ(X, Y ) = −1, atunci din primul calcul de la punctul a va rezulta că
X Y  X Y
Var + = 0. Ştim că aceasta se ı̂ntâmplă dacă + este o variabilă
σX σY σX σY
X Y
aleatoare constantă, adică, mai precis, există a′ ∈ R astfel ı̂ncât + = a′
σX σY
′ σY
cu probabilitate 1. Prin urmare, putem scrie Y = a σY − X. Rezultă că
σX
σY
Y = aX + b, unde a = − < 0 şi b = a′ σY .
σX

În mod similar, dacă ρ(X, Y ) = 1, atunci din al doilea calcul de la punctul a va
X Y  X Y
rezulta că Var − = 0, deci există a′′ ∈ R astfel ı̂ncât − = a′′ cu
σX σY σX σY
σY
probabilitate 1. Aşadar, Y = −a′′ σY + X. Renotând, obţinem Y = aX + b,
σX
σY
cu a = > 0 şi b = −a′′ σY .
σX
13.

Probabilistic Distributions
Some properties
14.

def.
Binomial distribution: b(r; n, p) = Cnr pr (1 − p)n−r

Significance: b(r; n, p) is the probability of drawing r heads in n independent


flips of a coin having the head probability p.

b(r; n, p) indeed represents a probability distribution:


• b(r; n, p) = Cnr pr (1 − p)n−r ≥ 0 for all p ∈ [0, 1], n ∈ N and r ∈ {0, 1, . . . , n},
P
• nr=0 b(r; n, p) = 1:
(1 − p)n + Cn1 p(1 − p)n−1 + · · · + Cnn−1 pn−1 (1 − p) + pn = [p + (1 − p)]n = 1
15.

Binomial distribution: calculating the mean

n
X
def.
E[b(r; n, p)] = r · b(r; n, p) =
r=0
= 1 · Cn1 p(1 − p)n−1 + 2 · Cn2 p2 (1 − p)n−2 + · · · + (n − 1) · Cnn−1 pn−1 (1 − p) + n · pn
 1 n−1 2 n−2 n−1 n−2 n−1

= p Cn (1 − p) + 2 · Cn p(1 − p) + · · · + (n − 1) · Cn p (1 − p) + n · p
 n−1 1 n−2 n−2 n−2 n−1 n−1

= np (1 − p) + Cn−1 p(1 − p) + · · · + Cn−1 p (1 − p) + Cn−1 p (1)
= np[p + (1 − p)]n−1 = np (2)

For the (1) equality we used the following property:


n! n! n (n − 1)!
k Cnk = k = =
k! (n − k)! (k − 1)! (n − k)! (k − 1)! (n − 1 − (k − 1))!
k−1
= n Cn−1 , ∀k = 1, . . . , n.
16.

Binomial distribution: calculating the variance


following www.proofwiki.org/wiki/Variance of Binomial Distribution, which cites
“Probability: An Introduction”, by Geoffrey Grimmett and Dominic Welsh,
Oxford Science Publications, 1986

We will make use of the formula Var[X] = E[X 2 ] − E 2 [X].


By denoting q = 1 − p, it follows:
n n
2 def. X
2
X n(n − 1) . . . (n − r + 1) r n−r
E[b (r; n, p)] = r Cnr pr q n−r = r2 p q
r=0 r=0
r!
n n
X (n − 1) . . . (n − r + 1) r n−r X r−1 r n−r
= rn p q = rn Cn−1 p q
r=1
(r − 1)! r=1
n
X
r−1 r−1 (n−1)−(r−1)
= np r Cn−1 p q
r=1
17.

Binomial distribution: calculating the variance (cont’d)

By denoting j = r − 1 and m = n − 1, we’ll get:


m
X
2 j j m−j
E[b (r; n, p)] = np (j + 1) Cm p q
j=0
m
hX m
X i
j j m−j j j m−j
= np j Cm p q + Cm p q .
j=0 j=0
| {z } | {z }
E[b(r;n−1,p)], cf.(2) 1

Therefore,

E[b2 (r; n, p)] = np[(n − 1)p + 1] = n2 p2 − np2 + np.


Finally,

Var[X] = E[b2 (r; n, p)] − (E[b(r; n, p)])2 = n2 p2 − np2 + np − n2 p2 = np(1 − p)


18.

Binomial distribution: calculating the variance


Another solution

• se demonstrează relativ uşor că orice variabilă aleatoare urmând


distribuţia binomială b(r; n, p) poate fi văzută ca o sumă de n vari-
abile independente care urmează distribuţia Bernoulli de parametru
p;a
• ştim (sau, se poate dovedi imediat) că varianţa distribuţiei Bernoulli
de parametru p este p(1 − p);
• ţinând cont de proprietatea de liniaritate a varianţelor — Var[X1 +
X2 . . . + Xn ] = Var[X1 ] + Var[X2 ] . . . + Var[Xn ], dacă X1 , X2 , . . . , Xn sunt
variabile independente —, rezultă că Var[X] = np(1 − p).
a Veziwww.proofwiki.org/wiki/Bernoulli Process as Binomial Distribution, care citează de asemenea ca sursă “Proba-
bility: An Introduction” de Geoffrey Grimmett şi Dominic Welsh, Oxford Science Publications, 1986.
19.

The categorical distribution:


Computing probabilities and expectations

CMU, 2009 fall, Geoff Gordon, HW1, pr. 4


20.

Suppose we have n bins and m balls. We throw balls into bins independently
at random, so that each ball is equally likely to fall into any of the bins.
a. What is the probability of the first ball falling into the first bin?
b. What is the expected number of balls in the first bin?
Hint (1): Define an indicator random variable representing whether the i-th
ball fell into the first bin:

1 if i-th ball fell into the first bin;
Xi =
0 otherwise.

Hint (2): Use linearity of expectation.


c. What is the probability that the first bin is empty?
d. What is the expected number of empty bins? Hint (3): Define an indicator
for the event “bin j is empty” and use linearity of expectations.
21.
Answer
a. It is equally likely for a ball to fall in any of the bins, so the probability that first
ball falling into the first bin is 1/n.
b. Define Xi as described in Hint 1. Let Y be the total number of balls that fall into
first bin: Y = X1 + . . . + Xm . The expected number of balls is:
m
X m
X
E[Y ] = E[Xi] = 1 · P (Xi = 1) = m · 1/n = m/n
i=1 i=1

c. Let Y and Xi be the same as defined at point b. For the first bin to be empty none
of the balls should fall into the first bin: Y = 0.
 m
n − 1
P (Y = 0) = P (X1 = 0, . . . , Xm = 0) = Πm m
i=1 P (Xi = 0) = (1 − 1/n) =
n

d. For each one of the n bins, we define an indicator random variable Yj for the event
“bin j is empty”. Let Z be the random variables denoting the number of empty bins.
Z = Y1 + . . . + Yn . Then the expected number of empty bins is:
n n n n  m  m
X X X X n−1 n−1
E[Z] = E[ Yj ] = E[Yj ] = 1 · P (Yj ) = =n
j=1 j=1 j=1 j=1
n n
22.

The univariate Gaussian distribution:


(x − µ)2

Nµ,σ (x) = √1 e 2σ 2
2πσ

The plot, when µ = 0:


 2
1 −x
√ exp
σ 2π 2σ 2

σ
34% 34%
2% 14% 14% 2%
0.1% 0.1%
x
−3σ −2σ −σ σ 2σ 3σ

Source: http://www.texample.net/tikz/examples/standard-deviation/
23.

Proving that Nµ,σ is indeed a p.d.f.

1. Nµ,σ (x) ≥ 0 ∀x ∈ R (true)


R∞
2. −∞ Nµ,σ (x)dx = 1

Note: Concerning the second property, it is enough to prove it for the stan-
dard case (µ = 0, σ = 1), because the non-standard case can be reduced to
this one:
x−µ
Using the variable transformation v = will imply x = σv +µ and dx = σdv,
σ
so:

Z ∞ Z ∞
(x − µ)2 Z ∞ (x − µ)2
1 − 1 −
Nµ,σ (x)dx = √ e 2σ 2 dx = √ e 2σ 2 dx
−∞ x=−∞ 2πσ 2πσ x=−∞
Z ∞ v2 Z ∞ v2 Z ∞
1 − 1 − 1
= √ e 2 σdv = √ σ e 2 dv = √ N0,1 (x)dx
2πσ v=−∞ 2πσ v=−∞ 2π x=−∞
The standard case: proving that N0,1 is indeed a p.d.f.: 24.
 2    
Z ∞
v2 Z ∞
x2 Z ∞
y2 Z ∞ Z ∞
x2 + y 2
− − − −
e 2 dv  = e 2 dx ·  e 2 dy  = e 2 dydx
     
 
v=−∞ x=−∞ y=−∞ x=−∞ y=−∞

ZZ x2 + y 2

= e 2 dydx
R2

By switching from x, y to polar coordinates r, θ (see the Note below), it follows:


 2
Z ∞ v2 Z ∞ Z 2π r2 Z ∞ r 2 Z 2π  Z ∞ r2
− − − −
e 2 dv  = e 2 (rdrdθ) = re 2 dθ dr = re 2 θ|2π
0 dr
 

v=−∞ r=0 θ=0 r=0 θ=0 r=0

2 ∞
2

r r v2 v2
∞Z
− −
Z ∞
− √ Z ∞
1 −
= 2π re 2 dr = 2π (−e 2 ) = 2π(1 − 0) = 2π ⇒ e 2 dv = 2π ⇒ √ e 2 dv = 1.

r=0
v=−∞ v=−∞ 2π
0
Note: x = r cos θ and y = r sin θ, with r ≥ 0 and θ ∈ [0, 2π). Therefore, x2 + y 2 = r 2 , and the
Jacobian matrix is

∂x ∂x
cos θ −r sin θ
∂(x, y) ∂r ∂θ 2 2
= = = r cos θ + r sin θ = r ≥ 0. So, dxdy = rdrdθ.
∂(r, θ) ∂y ∂y
sin θ r cos θ
∂r ∂θ

25.

Calculating the mean


Z ∞ Z ∞
(x − µ)2
def. 1 −
E[Nµ,σ (x)] = x Nµ,σ (x)dx = √ x·e 2σ 2 dx
−∞ 2πσ −∞
x−µ
Using again the variable transformation v = will imply:
σ
2
 2 2

Z ∞ v Z ∞ v Z ∞ v
1 − σ  − −
E[X] = √ (σv + µ)e 2 (σdv) = √  σ ve 2 dv + µ e 2 dv 

2πσ −∞ 2πσ −∞ −∞
 
 
2 2  2 ∞ 
Z ∞ v Z ∞ v  v Z ∞ v2 
1  − −  1  − − 
= √ −σ (−v)e 2 dv + µ e 2 dv  = √   −σ e 2 +µ e 2 dv 

2π −∞ −∞ 2π 
−∞ 
 −∞ 
| {z }
=0
Z ∞ v2
µ − µ √
= √ e 2 dv = √ 2π = µ
2π −∞ 2π
26.

Calculating the variance


We will make use of the formula Var[X] = E[X 2 ] − E 2 [X].
Z ∞ Z ∞ (x − µ)2
1 −
E[X 2 ] = x2 Nµ,σ (x)dx = √ x2 · e 2σ 2 dx
−∞ 2πσ −∞
x−µ
Again, using the transformation v = will imply x = σv + µ and dx = σdv. Therefore,
σ
Z ∞ v2
1 −
E[X 2 ] = √ (σv + µ)2 e 2 (σdv)
2πσ −∞

Z ∞
v2
σ −
= √ (σ 2 v 2 + 2σµv + µ2 ) e 2 dv
2πσ −∞
 2 2 2

Z v Z ∞ v Z ∞ v
1  2 ∞ 2 − − −
v e 2 dv + µ2

= √ σ v e 2 dv + 2σµ e 2 dv 
2π −∞ −∞ −∞

v2 v2
R∞ − R∞ − √
Note that we have already computed −∞ ve 2 dv = 0 and −∞ e 2 dv = 2π.
27.
Calculating the variance (Cont’d)
Therefore, we only need to compute
2
 2
  2
′
Z ∞ v Z ∞ v Z ∞
v
− −  − 
v 2 e 2 dv =
 
(−v) −ve 2  dv = (−v) e 2  dv
−∞ −∞ −∞


2 ∞
v Z v2 Z v2

∞ − ∞ − √
= (−v) e 2 − (−1)e 2 dv = 0 + e 2 dv = 2π.
−∞ −∞

−∞

Here above we used the fact that


v2 v2
− v l’Hôpital 1 −
lim v e 2 = lim = lim = 0 = lim v e 2
v→∞ v→∞ v 2 v→∞ v2 v→−∞

e2 ve 2
1
√ √ 
2
So, E[X ] = 2π σ 2π + 2σµ · 0 + µ 2π = σ 2 + µ2 .
√ 2 2

And, finally, Var[X] = E[X 2 ] − (E[X])2 = (σ 2 + µ2 ) − µ2 = σ 2 .


28.

Vectors of random variables.


A property:
The covariance matrix Σ corresponding to such a vector is
symmetric and positive semi-definite

Chuong Do, Stanford University, 2008

[adapted by Liviu Ciortuz]


29.

Fie variabilele aleatoare X1 , . . . , Xn , cu Xi : Ω → R pentru


i = 1, . . . , n. Matricea de covarianţă a vectorului de variabile
aleatoare X = (X1 , . . . , Xn ) este o matrice pătratică de dimensiune
def.
n×n, ale cărei elemente se definesc astfel: [Cov(X)]ij = Cov(Xi , Xj ),
pentru orice i, j ∈ {1, . . . , n}.
not.
Arătaţi că Σ = Cov(X) este matrice simetrică şi pozitiv semi-
definită, cea de-a doua proprietate ı̂nsemnând că pentru orice
vector z ∈ Rn are loc inegalitatea z ⊤ Σz ≥ 0. (Vectorii z ∈ Rn sunt
consideraţi vectori-coloană, iar simbolul ⊤ reprezintă operaţia de
transpunere de matrice.)
30.

def.
Cov(X)i,j = Cov(Xi , Xj ), for all i, j ∈ {1, . . . , n}, and
def.
Cov(Xi , Xj ) = E[(Xi − E[Xi ])(Xj − E[Xj ])] = E[(Xj − E[Xj ])(Xi − E[Xi ])] = Cov(Xj , Xi ),
therefore Cov(X) is a symmetric matrix.

We will show that z T Σz ≥ 0 for any z ∈ Rn (seen as a column-vector):


n
X Xn X n
n X n X
X n
T
z Σz = zi ( Σij zj ) = (zi Σij zj ) = (zi Cov[Xi , Xj ] zj )
i=1 j=1 i=1 j=1 i=1 j=1
 
X n
n X n
XX n
= (zi E[(Xi − E[Xi ])(Xj − E[Xj ])] zj ) = E  zi (Xi − E[Xi ])(Xj − E[Xj ]) zj 
i=1 j=1 i=1 j=1
 ! 
n
X n
X
= E zi (Xi − E[Xi ])  (Xj − E[Xj ]) zj 
i=1 j=1
 ! 
n
X n
X
= E (Xi − E[Xi ]) zi  (Xj − E[Xj ]) zj  = E[((X − E[X])T · z)2 ] ≥ 0
i=1 j=1
31.

Multi-variate Gaussian distributions:


A property:
When the covariance matrix of a multi-variate (d-dimensional) Gaussian
distribution is diagonal, then the p.d.f. (probability density function) of
the respective multi-variate Gaussian is equal to the product of d
independent uni-variate Gaussian densities.

Chuong Do, Stanford University, 2008

[adapted by Liviu Ciortuz]


32.

Let’s consider X = [X1 . . . Xd ]T , µ ∈ Rd and Σ ∈ Sd+ , where Sd+ is the set of symmetric
positive definite matrices (which implies |Σ| = 6 0 and (x − µ)T Σ−1 (x − µ) > 0, therefore
− 21 (x − µ)T Σ−1 (x − µ) < 0, for any x ∈ Sd , x 6= µ).
The probability density function of a multi-variate Gaussian distribution of parameters
µ and Σ is:  
1 1 T −1
p(x; µ, Σ) = exp − (x − µ) Σ (x − µ) ,
(2π)d/2 |Σ|1/2 2
Notation: X ∼ N (µ, Σ).
Show that when the covariance matrice Σ is diagonal, then the p.d.f. (probability den-
sity function) of the respective multi-variate Gaussian is equal to the product of d
independent uni-variate Gaussian densities.

We will make the proof for d = 2 Note: It is easy to show that if Σ ∈ Sd+ is diagonal, the
(generalization to d > 2 will be easy): elements on the principal diagonal Σ are indeed strictly

     2  positive. (It is enough to consider z = (1, 0) and respec-


x1 µ1 σ1 0 tively z = (0, 1) in formula for pozitive-definiteness of Σ.)
x= µ= Σ= 2
x2 µ2 0 σ2 This is why we wrote these elements of σ as σ12 and σ22 .
33.
34.

 T  −1  !
1 1 x1 − µ1 σ12 0 x1 − µ1
p(x; µ, Σ) = exp −
2
σ
1
0 2 2 x2 − µ2 0 σ22 x2 − µ2
2π 1
0 σ22
  1  
 T 0  
1  1 x1 − µ1  σ12  x1 − µ1 
= exp −  1  
2π σ1 σ2 2 x2 − µ2 0 x2 − µ2
σ22
  1 
  (x1 − µ1 )
1  1 x1 − µ1 T  σ12 
= exp −
  
2π σ1 σ2 2 x2 − µ2  1 
(x2 − µ2 )
σ22
 
1 1 1
= exp − 2 (x1 − µ1 )2 − 2 (x2 − µ2 )2
2π σ1 σ2 2σ1 2σ2

= p(x1 ; µ1 , σ12 ) p(x2 ; µ2 , σ22 ).


35.

Factorizing positive definite matrices, using eigenvectors

Sebastian Ciobanu,
following Chuong Do (from Stanford University),
The Multivariate Gaussian Distribution, 2008
36.
d d
Fie o variabilă aleatoare X : Ω → R . În cele ce urmează elementele din R vor fi
considerate vectori-coloană. Vom nota cu Sd+ mulţimea matricelor simetrice pozitiv
definite de dimensiune d × d.
Definiţie: Fie A ∈ Rd×d . Se numeşte valoare proprie a matricei A un număr complex
λ ∈ C pentru care există un vector nenul x ∈ Cd pentru care are loc egalitatea Ax = λx.
Acest vector x se numeşte vector propriu asociat valorii proprii λ.
a. Demonstraţi că pentru orice matrice Σ ∈ Sd+ există o matrice B ∈ Rd×d astfel ı̂ncât Σ
să poată fi factorizată sub forma următoare: Σ = BB ⊤ .
Observaţie: Factorizarea aceasta nu este unică, adică există mai multe posibilităţi de a
scrie matricea Σ drept Σ = BB ⊤ .
Indicaţie: Vă puteţi folosi de următoarele proprietăţi:
i. Orice matrice A ∈ Rd×d care este simetrică poate fi scrisă astfel: A = U ΛU ⊤ , unde
U ∈ Rd×d este o matrice ortonormală conţinând vectorii proprii (pentru care impunem
să aibă norma 1) ai lui A drept coloane, iar Λ este matricea diagonală conţinând valorile
proprii ale lui A ı̂n ordinea corespunzătoare coloanelor (adică, a vectorilor proprii) din
matricea U .(Faptul că U este matrice ortonormală se poate scrie ı̂n mod formal astfel:
U ⊤ U = U U ⊤ = I.)
ii. Fie A ∈ Rd×d o matrice simetrică. A este pozitiv definită dacă şi numai dacă toate
valorile sale proprii sunt (reale şi) pozitive.
iii. (AB)⊤ = B ⊤ A⊤ , pentru orice matrice A, B ∈ Rd×d .
Soluţie 37.
Ştim, prin ipoteză, că matricea Σ ∈ Sd+ , deci este simetrică. Conform proprietăţii (a.i),
putem scrie următoarea factorizare pentru Σ:

Σ = U ΛU ⊤ ,

unde U este matricea ortonormală conţinând vectorii proprii (cu norma 1) ai lui Σ
drept coloane, iar Λ ∈ Rd×d este matricea diagonală conţinând valorile proprii ale lui Σ,
ı̂n ordinea corespunzătoare coloanelor matricei U .
Întrucât Σ ∈ Sd+ este matrice pozitiv definită, conform proprietăţii (a.ii) rezultă că toate
valorile proprii ale lui Σ sunt pozitive. Prin urmare, există matricea
√ 
λ1 . . . 0
Λ1/2 =  ... .. ..  ∈ Rd×d .

.
√.

0 ... λd

Observaţii :
(1) Este imediat faptul că putem factoriza / descompune matricea Λ ı̂n felul următor:

Λ1/2 · Λ1/2 = Λ. (3)

(2) Λ1/2 este matrice diagonală, deci simetrică. Aşadar,

(Λ1/2 )⊤ = Λ1/2 . (4)


38.

Acum putem relua factorizarea matricei Σ = U ΛU ⊤ . Elaborând — adică, factorizând


matricea Λ — ı̂n partea dreaptă a acestei egalităţi, vom putea obţine o nouă factorizare
pentru matricea Σ ı̂n felul următor:

Σ = U ΛU ⊤
(3) (4)
= U Λ1/2 Λ1/2 U ⊤ = U Λ1/2 (Λ1/2 )⊤ U ⊤
asoc. (a.iii)
= U Λ1/2 ((Λ1/2 )⊤ U ⊤ ) = U Λ1/2 (U Λ1/2 )⊤
not.
= BB ⊤ , unde B = U Λ1/2 . (5)

Observaţii :
(3) O altă modalitate de a factoriza / descompune matricea Σ sub forma Σ = BB ⊤
este dată de factorizarea Cholesky : dacă A este o matrice simetrică şi pozitiv definită,
atunci există o unică matrice L ∈ Rd×d , inferior triunghiulară, astfel ı̂ncât A = LL⊤ . (See
https://en.wikipedia.org/wiki/Positive-definite matrix.)
(4) Factorizarea (5) este de fapt valabilă şi pentru matrice pozitiv semidefinite. Însă
proprietatea de inversabilitate (vedeţi punctul c) nu este valabilă decât pentru matrice
pozitiv definite.
39.

b. La acest punct vom face o exemplificare pentru chestiunile prezentate la punctul a.


Considerând matricea  
1 0.5
Σ= ,
0.5 1
(deci, d = 2) determinaţi o matrice B ∈ R2×2 astfel ı̂ncât Σ = BB ⊤ .
Indicaţie: Folosind notaţiile din Definiţia dată pentru valorile proprii şi vectorii proprii,
vom avea:
Ax = λx ⇔ (λId − A)x = 0, x 6= 0 ⇔ det(λId − A) = 0,
unde 0 este vectorul coloană nul d-dimensional, iar Id este matricea identitate d-
dimensională.
40.
Soluţie
Mai ı̂ntâi vom calcula valorile proprii ale matricei Σ din enunţ. Fie deci λ ∈ R şi x ∈ Rd ,
x 6= 0. Trebuie să rezolvăm ecuaţia Σx = λx şi, conform Indicaţiei din enunţ, aceasta
revine la a rezolva ecuaţia det(λId − Σ) = 0.
 
λ − 1 −0.5
det(λId − Σ) = 0 ⇔ det = 0 ⇔ (λ − 1)2 − 0.52 = 0
−0.5 λ − 1
⇔ (λ − 1 − 0.5)(λ − 1 + 0.5) = 0 ⇔ (λ − 1.5)(λ − 0.5) = 0
⇔ λ1 = 0.5, λ2 = 1.5.

Deci valorile proprii ale matricei Σ sunt λ1 = 0.5 şi λ2 = 1.5.


Se observă că λ1 , λ2 > 0. Aşadar, conform proprietăţii (a, ii), matricea Σ dată ı̂n enunţ
este pozitiv definită. Conform punctului a, ştim acum că există o matrice B astfel ı̂ncât
matricea Σ să poată fi factorizată sub forma Σ = BB ⊤ . Pentru a determina matricea B,
trebuie sa calculăm vectorii proprii ai matricei Σ. Îi vom obţine rezolvând următorul
sistem, care este compatibil nedeterminat (pentru că det(λId − Σ) = 0):

(λ − 1)v − 0.5u = 0 ⇒ v = 0.5u
λ−1 .
−0.5v + (λ − 1)u = 0
Deci, vectorii proprii sunt de forma: 41.
 
0.5
x = (v, u) = u, u , u ∈ R.
λ−1

În mod concret,


λ1 = 0.5 ⇒ x1 = (−u, u), u ∈ R
şi
λ2 = 1.5 ⇒ x2 = (u, u), u ∈ R.
Deoarece vrem ca vectorii x1 şi x2 să aibă norma 1, ı̂i vom ,,normaliza“. Mai ı̂ntâi,

x1 (−u, u) (−u, u)
= √ = √ (6)
kx1 k2 2u2 2|u|

şi, considerând u > 0, deci |u| = u, rezultă că


 
x1 (−u, u) 1 1
= √ = −√ , √ .
kx1 k2 2u 2 2
Apoi, ı̂n mod similar obţinem
 
x2 1 1
= √ ,√ .
kx2 k2 2 2
42.

De la punctul a ştim că putem alege B = U Λ1/2 , deci


   
1 1 1
√ √ √
− 2 0 
"√ #
2  0.5 0  2
U =  , Λ1/2 = √ = √ ,
 1 1  0 1.5
 3
√ √ 0 √
2 2 2
    √ 
1 1 1 1 3
√ √ √
− 2   0  −
2  2 √   2 2 
B=  = √  .
1 1  3  1 3
√ √ 0 √
2 2 2 2 2
Verificăm ă ı̂ntr-adevăr Σ = BB ⊤ :
 √ 
1 3 1 1   1 3 1 3

− 2  − 2 + − +  
2 2 4 4 4 4 = 1 0.5

√  √ √ =
  
BB =    = Σ.
1 3 3 3 1 3 1 3 0.5 1
− + +
2 2 2 2 4 4 4 4
43.

Observaţie
În ton cu Observaţia din enunţ, dacă ı̂n relaţia (6) vom considera u < 0, vom obţine ı̂ncă
o soluţie pentru factorizarea matricei Σ:
 
x′1 (−u, u) 1 1 x1
= √ = √ , − √ = −
kx′1 k2 2|u| 2 2 kx1 k2
 
x′2 (u, u) 1 1 x2
= √ = − √ , − √ = − ,
kx′2 k2 2|u| 2 2 kx k
2 2

deci
U ′ = −U, B ′ = U ′ Λ1/2 = −U Λ1/2 = −B
şi, ı̂n consecinţă,
B ′ (B ′ )⊤ = (−B)(−B)⊤ = BB ⊤ = Σ.
44.

c. Demonstraţi că matricea B care satisface proprietatea de la punctul a este inversabilă.

Indicaţie: Vă puteţi folosi de următoarele proprietăţi:


i. Matricea A ∈ Rd×d este inversabilă dacă şi numai dacă det(A) 6= 0.
ii. det(AB) = det(A) det(B), pentru orice matrice A, B ∈ Rd×d .
iii. det(A) = det(A⊤ ), unde A ∈ Rd×d .
iv. Orice matrice pozitiv definită este inversabilă (iar inversa ei este de asemenea ma-
trice pozitiv definită).
45.

Soluţie
De la punctul a rezultă că pentru orice matrice Σ ∈ Sd+ există o descompunere / factor-
izare de forma Σ = BB ⊤ , deci
(c.ii) (c.iii)
det(Σ) = det(BB ⊤ ) = det(B) det(B ⊤ ) = det(B)2 .

Prin urmare, p
⇒ det(B) = ± det(Σ). (7)
Pe de altă parte, din faptul că Σ este matrice pozitiv definită, rezultă conform pro-
prietăţii (c.iv) că Σ este matrice inversabilă. Mai departe, conform proprietăţii (c.i),
vom avea det(Σ) 6= 0. Coroborând aceasta cu relaţia (7), rezultă că det(B) 6= 0 şi ı̂n
consecinţă (din nou, conform proprietăţii (c.i)) că matricea B este inversabilă (ceea ce
era de demonstrat).
46.

The Gaussian Multivariate Distribution:


Its density function is indeed a p.d.f.

Sebastian Ciobanu,
following Chuong Do (from Stanford University),
The Multivariate Gaussian Distribution, 2008
47.
Definiţie: Notăm cu Sd+ mulţimea matricelor simetrice pozitiv definite de dimensiune
d × d. Spunem că variabila aleatoare vectorială X : Ω → Rd , reprezentată sub forma
X = (X1 . . . Xd )⊤ , urmează o distribuţie gaussiană multivariată, având media µ ∈ Rd şi
matricea de covarianţă Σ ∈ Sd+ , dacă funcţia ei de densitate [de probabilitate] are forma
analitică următoare:
 
1 1 ⊤ −1
pX (x; µ, Σ) = exp − (x − µ) Σ (x − µ) ,
(2π)d/2 det(Σ)1/2 2

unde notaţia det(Σ) desemnează determinantul matricei Σ, iar exp( ) desemnează funcţia
exponenţială având baza e. Pe scurt, vom nota această proprietate de definiţie a lui X
sub forma X ∼ N (µ, Σ).
Observaţie (1): La problema . . . am demonstrat că pentru orice matrice Σ ∈ Rd simet-
rică şi pozitiv definită există (ı̂nsă nu neapărat ı̂n mod unic) o matrice B ∈ Rd×d cu
proprietatea că Σ se poate ,,factoriza“ sub forma Σ = BB ⊤ . Mai mult, am demonstrat
că orice matrice B care satisface această proprietate este ı̂n mod necesar inversabilă.
a. Demonstraţi că dacă definim Z = B −1 (X −µ), atunci Z ∼ N (0, Id ), unde 0 este vectorul
coloană nul d-dimensional, iar Id este matricea identitate d-dimensională.
Observaţie (2): Proprietatea aceasta este o generalizare a metodei de ,,standardizare“
pe care am ı̂ntâlnit-o deja la problema CMU, 2004 fall, T. Mitchell, Z. Bar-Joseph,
HW1, pr. 3, aplicată ı̂n cazul distribuţiilor gaussiene univariate.
Indicaţie (1): Vă puteţi folosi de următoarele proprietăţi: 48.
i. Fie X = (X1 . . . Xd )⊤ ∈ Rd o variabilă aleatoare de tip vector, cu distribuţia comună dată
de funcţia de densitate pX : Rd → R. Dacă Z = H(X) ∈ Rd , unde H este funcţie bijectivă şi
diferenţiabilă, atunci Z are distribuţia comună dată de funcţia de densitate pZ : Rd → R, unde
 ∂x ∂x1 

1
...

 ∂z1 ∂zd 


 . .. .. 
pZ (z) = pX (x) · det  .  .
 . . . 
 ∂x ∂xd 

d
...

∂z1 ∂zd

 ∂x ∂x1 
1
...
 ∂z1 ∂zd 
 . .. .. 
Matricea   . . . .  se numeşte matricea jacobiană a lui x ı̂n raport cu z şi se notează,

 ∂x ∂xd 
d
...
∂z1 ∂zd
∂x
ı̂n acest caz, cu .
∂z
∂Ax + b
ii. = A, unde x, b ∈ Rd , A ∈ Rd×d , iar A şi b nu depind de x.
∂x
iii. (AB)−1 = B −1 A−1 , unde A, B ∈ Rd×d .
iv. (AB)⊤ = B ⊤ A⊤ , unde A, B ∈ Rd×d .
v. det(AB) = det(A) det(B), unde A, B ∈ Rd×d .
vi. det(A) = det(A⊤ ), unde A ∈ Rd×d .
49.

Soluţie
Sunt imediate următoarele relaţii:

z = B −1 (x − µ) ⇒ Bz = x − µ ⇒ x = Bz + µ (8)
∂x ∂(Bz + µ) (a.ii)
= = B (9)
∂z ∂z
q q
(a.v) (a.vi) p
(det(Σ))1/2 ⊤
= det(BB ) = ⊤
det(B) det(B ) = (det(B))2 = | det(B)|. (10)

Vom rescrie acum expresia care constituie argumentul funcţiei exp( ) din definiţia p.d.f.-
ului distribuţiei gaussiene multivariate (vedeţi enunţul), ı̂n funcţie de vectorul z.

(x − µ)⊤ Σ−1 (x − µ)
(8)
= (Bz + µ − µ)⊤ (BB ⊤ )−1 (Bz + µ − µ) = (Bz)⊤ (BB ⊤ )−1 (Bz)
(a.iii)
= (Bz)⊤ ((B ⊤ )−1 B −1 )(Bz)
(a.iv)
= (z ⊤ B ⊤ )((B ⊤ )−1 B −1 )(Bz)
asoc.
= z ⊤ (B ⊤ (B ⊤ )−1 )(B −1 B)z = z ⊤ Id Id z = z ⊤ z. (11)
50.

Aplicând acum proprietatea (a.i), vom putea calcula p.d.f.-ul distribuţiei gaussiene aso-
ciate variabilei Z:
 
∂x
pZ (z) = pX (x) · det

∂z
   
1 1 ⊤ −1
∂x
= exp − (x − µ) Σ (x − µ) det
(2π)d/2 det(Σ)1/2 2 ∂z
 
(9) (10) (11) 1 1 ⊤
= d/2
exp − z z ✘ | det(B)|
✘✘✘
(2π) ✘ | det(B)|
✘ ✘ ✘ 2
   
1 1 ⊤ z ⊤ =z ⊤ Id 1 1 ⊤
= exp − z z = exp − z Id z
(2π)d/2 2 (2π)d/2 (det(Id ))1/2 2
= N (z; 0, Id ).
51.

b. Arătaţi că funcţia pX (x; µ, Σ) care a fost dată ı̂n enunţ este ı̂ntr-adevăr funcţie den-
sitate de probabilitate (p.d.f.).
Indicaţie (2): Vă puteţi folosi de următoarele proprietăţi:
i. Pentru cazul d = 1, funcţia pX (x; µ, σ 2 ) este funcţie densitate de probabilitate.
ii. În cazul ı̂n care matricea Σ este diagonală,
 2   
σ1 . . . 0 µ1
Σ =  ... . . . ...  , iar µ =  ...  ,
   

0 . . . σd2 µd

expresia funcţiei de densitate gaussiană multivariată este identică cu produsul a d funcţii


de densitate de tip gaussian, univariate şi independente, prima funcţie având media µ1
şi varianţa σ12 , a doua funcţie având media µ2 şi varianţa σ22 , . . . , a d-a funcţie având
media µd şi varianţa σd2 .
Soluţie 52.

pX (x; µ, Σ) este funcţie densitate de probabilitate (p.d.f.) dacă:


• pX (x; µ, Σ) ≥ 0, ∀x ∈ Rd
not. R ∞ R ∞ R∞
• I = −∞ −∞ . . . −∞ pX (x; µ, Σ)dxd . . . dx2 dx1 = 1.
Prima condiţie este satisfăcută, pentru că numitorul fracţiei [care este factorul de nor-
malizare] din definiţia funcţiei de densitate de probabilitate pX (x; µ, Σ) este pozitiv, iar
exp(y) > 0 pentru orice y ∈ R.
În continuare vom verifica a doua condiţie.
În integrala I facem substituţia (schimbarea de variabilă): z = B −1 (x − µ), unde Σ =
BB ⊤ , B ∈ Rd×d . Matricea B există şi este inversabilă după cum s-a precizat ı̂n enunţ
(vedeţi Observaţia (1)).
De la punctul a rezultă imediat că:
Z ∞Z ∞ Z ∞
I = ... pZ (z; 0, Id )dzd . . . dz2 dz1
−∞ −∞ −∞
Z ∞  Z ∞  Z ∞ 
(b.ii) (b.i)
= pZ1 (z1 ; 0, 1)dz1 pZ2 (z2 ; 0, 1)dz2 . . . pZd (zd ; 0, 1)dzd = 1 · 1 · . . . · 1 = 1.
−∞ −∞ −∞

Deci, şi a doua condiţie este satisfăcută, ceea ce ı̂nseamnă că funcţia pX (x; µ, Σ) este
ı̂ntr-adevăr funcţie densitate de probabilitate (p.d.f.).
53.

Bi-variate Gaussian distributions. A property:


The conditional distributions X1 |X2 and X2|X1 are also
Gaussians.
The calculation of their parameters

Duda, Hart and Stork, Pattern Classification, 2001,


Appendix A.5.2

[adapted by Liviu Ciortuz]


54.

Fie X o variabilă aleatoare care urmează o distribuţie gaussiană bi-variată


de parametri µ (vectorul de medii) şi Σ (matricea de covarianţă). Aşadar,
µ = (µ1 , µ2 ) ∈ R2 , iar Σ ∈ M2×2 (R).
not.
Prin definiţie, Σ = Cov(X, X), unde X = (X1 , X2 ), aşadar Σij = Cov(Xi , Xj )
not.
pentru i, j ∈ {1, 2}. De asemenea, Cov(Xi , Xi ) = Var[Xi ] = σi2 ≥ 0 pentru
not.
i ∈ {1, 2}, ı̂n vreme ce pentru i 6= j avem Cov(Xi , Xj ) = Cov(Xj , Xi ) = σij .
def. σ12
În sfârşit, dacă introducem ,,coeficientul de corelare“ ρ = , rezultă că
σ1 σ2
putem scrie astfel matricea de covarianţă:

 
σ12 ρσ1 σ2
Σ= . (12)
ρσ1 σ2 σ22
55.

Demonstraţi că ipoteza X ∼ N (µ, Σ), im-


plică faptul că distribuţia condiţională
X2 |X1 este de tip gaussian, şi anume
2
X2 |X1 = x1 ∼ N (µ2|1 , σ2|1 ),
σ2 2
cu µ2|1 = µ2 +ρ (x1 −µ1 ) şi σ2|1 = σ22 (1−ρ2 ).
σ1
Observaţie: Pentru X1 |X2 , rezultatul
2
este similar: X1 |X2 = x2 ∼ N (µ1|2 , σ1|2 ), cu
σ1 2
µ1|2 = µ1 + ρ (x2 − µ2 ) şi σ1|2 = σ12 (1 − ρ2 ).
σ2

Source:

Pattern Classification, 2nd ed., Appendix A.5.2,


R. Duda, P. Hart and D. Stork, 2001
56.
Answer
def. pX1 ,X2 (x1 , x2 )
pX2 |X1 (x2 |x1 ) = , (13)
pX1 (x1 )

where
 
1 1
pX1 ,X2 (x1 , x2 ) = √ p exp − (x − µ)⊤ Σ−1 (x − µ) şi
2
( 2π) |Σ| 2
 
1 1
pX1 (x1 ) = √ exp − 2 (x1 − µ1 )2 . (14)
2πσ1 2σ1
p
From (12) it follows that |Σ| = σ12 σ22 (1 − ρ2 ). In order that |Σ| and Σ−1 bepdefined, it
p
follows that ρ ∈ (−1, 1). Moreover, since σ1 , σ2 > 0, we will have |Σ| = σ1 σ2 1 − ρ2 .
 2

1 1 σ −ρσ σ
1 2
Σ−1 = Σ∗ = 2 2 2
2
2 2 2
σ1 σ2 (1 − ρ ) σ1 σ2 (1 − ρ ) −ρσ1 σ2
2 σ1
 
1 ρ
2 −
1  σ 1 σ1 σ2 
=  
2
(1 − ρ )  ρ 1 

σ1 σ2 σ22
57.

So,

pX1 ,X2 (x1 , x2 ) =


   
1 ρ
2 −  
1  1  σ 1 σ σ 
1 2  x1 − µ 1

= p exp  − (x1 − µ 1 , x2 − µ 2 )  
2πσ1 σ2 1 − ρ 2  2(1 − ρ)2  ρ 1  x2 − µ 2 

σ1 σ2 σ22
1
= p ·
2πσ1 σ2 1 − ρ2
" 2     2 #!
1 x1 − µ1 x1 − µ1 x2 − µ2 x2 − µ2
exp − − 2ρ + (15)
2(1 − ρ2 ) σ1 σ1 σ2 σ2
58.
By substitution (14) and (15) in the definition (13), we will get:

pX1 ,X2 (x1 , x2 )


p(x2 |x1 ) =
pX1 (x1 )
" 2     2 #!
1 1 x1 − µ1 x1 − µ1 x2 − µ2 x2 − µ2
= p · exp − − 2ρ +
2πσ1 σ2 1 − ρ2 2(1 − ρ2 ) σ1 σ1 σ2 σ2
 2 !
√ 1 x1 − µ1
· 2πσ1 exp
2 σ1
"  2 #
1 1 x2 − µ2 x1 − µ1
= √ p exp − − ρ
2πσ2 1 − ρ2 2(1 − ρ2 ) σ2 σ1
  σ2 2 
x2 − [µ2 + ρ (x1 − µ1 )]
1  1 σ
p 1
 
= √ p exp −   
2πσ2 1 − ρ2 2 σ2 1 − ρ2

Therefore,
2 not. σ2 2 not. 2
X2 |X1 = x1 ∼ N (µ2|1 , σ2|1 ) with µ2|1 = µ2 + ρ (x1 − µ1 ) and σ2|1 = σ2 (1 − ρ2 ).
σ1
59.

Using the Central Limit Theorem (the i.i.d. version)


to compute the real error of a classifier
CMU, 2008 fall, Eric Xing, HW3, pr. 3.3
60.

Chris recently adopts a new (binary) classifier to filter email


spams. He wants to quantitively evaluate how good the classi-
fier is.
He has a small dataset of 100 emails on hand which, you can
assume, are randomly drawn from all emails.
He tests the classifier on the 100 emails and gets 83 classified
correctly, so the error rate on the small dataset is 17%.
However, the number on 100 samples could be either higher or
lower than the real error rate just by chance.

With a confidence level of 95%, what is likely to be the range of


the real error rate? Please write down all important steps.
(Hint: You need some approximation in this problem.)
61.

Notations:
Let Xi , i = 1, . . . , n = 100 be defined as:
Xi = 1 if the email i was incorrectly classified, and 0 otherwise;
not. not. not.
E[Xi ] = µ = ereal ; Var(Xi ) = σ 2
X1 + . . . + Xn
not.
esample = = 0.17
n
X1 + . . . + Xn − nµ
Zn = √ (the standardized form of X1 + . . . + Xn )

Key insight:
Calculating the real error of the classifier (more exactly, a symmetric interval
not.
around the real error p = µ) with a “confidence” of 95% amounts to finding a > 0
sunch that P (|Zn | ≤ a) ≥ 0.95.
62.

Calculus:

X1 + . . . + Xn − nµ X1 + . . . + Xn − nµ
| Zn |≤ a ⇔ √ ≤a ⇔ ≤ √a
nσ nσ n

X1 + . . . + Xn − nµ aσ X1 + . . . + Xn aσ
⇔ √
≤ n ⇔
− µ ≤

n n n
aσ aσ
⇔ |esample − ereal | ≤ √ ⇔ |ereal − esample | ≤ √
n n
aσ aσ
⇔ − √ ≤ ereal − esample ≤ √
n n
aσ aσ
⇔ esample − √ ≤ ereal ≤ esample + √
n n
 
aσ aσ
⇔ ereal ∈ esample − √ , esample + √
n n
Important facts: 63.
The Central Limit Theorem: Zn → N (0; 1)
Therefore, P (|Zn | ≤ a) ≈ P (|X| ≤ a) = Φ(a) − Φ(−a), where X ∼ N (0; 1)
and Φ is the cumulative function distribution of N (0; 1).
Calculus:
Φ(−a) + Φ(a) = 1 ⇒ P (| Zn |≤ a) = Φ(a) − Φ(−a) = 2Φ(a) − 1
P (| Zn |≤ a) = 0.95 ⇔ 2Φ(a) − 1 = 0.95 ⇔ Φ(a) = 0.975 ⇔ a ∼
= 1.97 (see Φ table)
not.
σ 2 = Varreal = ereal (1 − ereal ) because Xi are Bernoulli variables.
Futhermore, we can approximate ereal with esemple , because

1
E[esample ] = ereal and Varsample = Varreal → 0 for n → +∞,
n
cf. CMU, 2011 fall, T. Mitchell, A. Singh, HW2, pr. 1.ab.

Finally: p
aσ 0.17(1 − 0.17) ∼
⇒ √ ≈ 1.97 · √ = 0.07
n 100
|ereal − esample | ≤ 0.07 ⇔ |ereal − 0.17| ≤ 0.07 ⇔ −0.07 ≤ ereal − 0.17 ≤ 0.07
⇔ ereal ∈ [0.10, 0.24]
64.

Exemplifying
a mixture of categorical distributions;
how to compute its expectation and variance
CMU, 2010 fall, Aarti Singh, HW1, pr. 2.2.1-2
65.

Suppose that I have two six-sided dice, one is fair and the other one is
loaded – having: 
 1

 x=6
2
P (x) =

 1
 x ∈ {1, 2, 3, 4, 5}
10
I will toss a coin to decide which die to roll. If the coin flip is heads I will
roll the fair die, otherwise the loaded one. The probability that the coin
flip is heads is p ∈ (0, 1).
a. What is the expectation of the die roll (in terms of p).
b. What is the variation of the die roll (in terms of p).
66.

Solution:

a.
6
X
E[X] = i · [P (i|fair) · p + P (i|loaded) · (1 − p)]
i=1
" 6 # " 6 #
X X
= i · P (i|fair) p + i · P (i|loaded) (1 − p)
i=1 i=1
7 9 9
= p + (1 − p) = − p
2 2 2
67.
2 2
b. Recall that we may write Var (X) = E[X ] − (E[X]) , therefore:
6
X
2
E[X ] = i2 · [P (i|fair) · p + P (i|loaded) · (1 − p)]
i=1
" 6 # " 6 #
X X
= i2 · P (i|fair) p + i2 · P (i|loaded) (1 − p)
i=1 i=1
91  36 55 
= p+ + (1 − p)
6 2 10
47 25
= − p
2 3
Combining this with the result of the previous question yields:
141 50 9
Var (X) = E[X 2 ] − (E[X])2 = − p − ( − p)2
6 6 2
141 50 81
= − p − ( − 9p + p2 )
6 6 4
141 81 50
= ( − ) − ( − 9)p − p2
6 4 6
13 2
= + p − p2
4 3
68.

Estimating the parameters of some


probability distributions:
Exemplifications
69.

Estimating the parameter of the Bernoulli


distribution:
the MLE and MAP approaches
CMU, 2015 spring, Tom Mitchell, Nina Balcan, HW2, pr. 2
70.

Suppose we observe the values of n i.i.d. (independent, identi-


cally distributed) random variables X1 , . . . , Xn drawn from a single
Bernoulli distribution with parameter θ. In other words, for each
Xi , we know that
P (Xi = 1) = θ and P (Xi = 0) = 1 − θ.
Our goal is to estimate the value of θ from the observed values of
X1 , . . . , Xn .
71.

Reminder: Maximum Likelihood Estimation


For any hypothetical value θ̂, we can compute the probability of
observing the outcome X1 , . . . , Xn if the true parameter value θ
were equal to θ̂.
This probability of the observed data is often called the data likeli-
hood, and the function L(θ̂) that maps each θ̂ to the corresponding
likelihood is called the likelihood function.
A natural way to estimate the unknown parameter θ is to choose
the θ̂ that maximizes the likelihood function. Formally,

θ̂MLE = argmax L(θ̂).


θ̂
72.

a. Write a formula for the likelihood function, L(θ̂).


Your function should depend on the random variables X1 , . . . , Xn
and the hypothetical parameter θ̂.
Does the likelihood function depend on the order of the random
variables?

Solution:
Since the Xi are independent, we have
n
Y n
Y
L(θ̂) = Pθ̂ (X1 , . . . , Xn ) = Pθ̂ (Xi ) = (θ̂Xi · (1 − θ̂)1−Xi )
i=1 i=1

= θ̂#{Xi =1} · (1 − θ̂) #{Xi =0}


,
where #{·} counts the number of Xi for which the condition in
braces holds true.
Note that in the third equality we used the trick Xi = I{Xi =1} .
The likelihood function does not depend on the order of the data.
73.

b. Suppose that n = 10 and the data


set contains six 1s and four 0s. Solution:
MLE; n = 10, six 1s, four 0s
Write a short computer program
0.0012
that plots the likelihood function of
this data. 0.001

For the plot, the x-axis should be θ̂, 0.0008


and the y-axis L(θ̂). Scale your y-axis 0.0006
so that you can see some variation in 0.0004
its value.
0.0002
Estimate θ̂MLE by marking on the x-
0
axis the value of θ̂ that maximizes 0 0.2 0.4 0.6 0.8 1
the likelihood. θ
74.
c. Find a closed-form formula for θ̂MLE , the MLE estimate of θ̂. Does the
closed form agree with the plot?
Solution:
Let’s consider l(θ) = ln(L(θ)). Since the ln function is increasing, the θ̂
that maximizes the log-likelihood is the same as the θ̂ that maximizes the
likelihood. Using the properties of the ln function, we can rewrite l(θ̂) as
follows:
l(θ̂) = ln(θ̂n1 · (1 − θ̂)n0 ) = n1 ln(θ̂) + n0 ln(1 − θ̂).
not. not.
where n1 = #{Xi = 1}, iar n0 = #{Xi = 0}.
Assuming that θ̂ 6= 0 and θ̂ 6= 1, the first and second derivatives of l are
given by
n1 n0 n1 n0
l′ (θ̂) = − and l′′ (θ̂) = − −
θ̂ 1 − θ̂ θ̂2 (1 − θ̂)2
Since l′′ (θ̂) is always negative, the l function is concave, and we can find its
maximizer by solving the equation l′ (θ) = 0.
n1
The solution to this equation is given by θ̂MLE = .
n1 + n0
75.

d. Create three more likelihood plots: one where n = 5 and the


data set contains three 1s and two 0s; one where n = 100 and the
data set contains sixty 1s and fourty 0s; and one where n = 10 and
there are five 1s and five 0s.

Solution:
MLE; n = 5, three 1s, two 0s MLE; n = 100, sixty 1s, fourty 0s
0.04 7e-30
0.035 6e-30
0.03 5e-30
0.025
4e-30
0.02
3e-30
0.015
0.01 2e-30
0.005 1e-30
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
θ θ
76.

e. Describe how the likelihood


Solution (to part d.): functions and maximum likelihood
estimates compare for the different
MLE; n = 10, five 1s, five 0s data sets.
0.001
Solution (to part e.):
0.0008

0.0006
The MLE is equal to the propor-
tion of 1s observed in the data, so
0.0004 for the first three plots the MLE is
always at 0.6, while for the last plot
0.0002
it is at 0.5.
0 As the number of samples n in-
0 0.2 0.4 0.6 0.8 1 creases, the likelihood function gets
θ more peaked at its maximum value,
and the values it takes on decrease.
77.

Reminder: Maximum a Posteriori Probability Estimation


In the maximum likelihood estimate, we treated the true parameter value θ as a fixed
(non-random) number. In cases where we have some prior knowledge about θ, it is
useful to treat θ itself as a random variable, and express our prior knowledge in the
form of a prior probability distribution over θ.
For example, suppose that the X1 , . . . , Xn are generated in the following way:
− First, the value of θ is drawn from a given prior probability distribution
− Second, X1 , . . . , Xn are drawn independently from a Bernoulli distribution using
this value for θ.
Since both θ and the sequence X1 , . . . , Xn are random, they have a joint probability
distribution. In this setting, a natural way to estimate the value of θ is to simply choose
its most probable value given its prior distribution plus the observed data X1 , . . . , Xn .
Definition:
θ̂MAP = argmax P (θ = θ̂|X1 , . . . , Xn ).
θ̂

This is called the maximum a posteriori probability (MAP) estimate of θ.


78.

Reminder (cont’d)
Using Bayes rule, we can rewrite the posterior probability as follows:

P (X1 , . . . , Xn |θ = θ̂)P (θ = θ̂)


P (θ = θ̂|X1 , . . . , Xn ) = .
P (X1 , . . . , Xn )

Since the probability in the denominator does not depend on θ̂, the MAP estimate is
given by

θ̂MAP = argmax P (X1 , . . . , Xn |θ = θ̂)P (θ = θ̂)


θ̂

= argmax L(θ̂)P (θ = θ̂).


θ̂

In words, the MAP estimate for θ is the value θ̂ that maximizes the likelihood function
multiplied by the prior distribution on θ.
79.

We will consider a Beta(3, 3) prior distribution for θ, which has the density
θ̂2 (1 − θ̂)2
function given by p(θ̂) = , where B(α, β) is the beta function and
B(3, 3)
B(3, 3) ≈ 0.0333.

f. Suppose, as in part b, that Solution:


n = 10 and we observed six 1s and MAP; n = 10, six 1s, four 0s; Beta(3,3)
four 0s. 0.0025

Write a short computer program 0.002


that plots the function θ̂ 7→
0.0015
L(θ̂) p(θ̂) for the same values of θ̂
as in part b. 0.001

Estimate θ̂MAP by marking on the 0.0005


x-axis the value of θ̂ that maxi-
0
mizes the function. 0 0.2 0.4 0.6 0.8 1
θ
80.
g. Find a closed form formula for θ̂MAP , the MAP estimate of θ̂.
Does the closed form agree with the plot?
Solution:
As in the case of the MLE, we will apply the ln function before
finding the maximizer. We want to maximize the function

l(θ̂) = ln(L(θ̂) · p(θ̂)) = ln(θ̂n1 +2 · (1 − θ̂)n0 +2 ) − ln(B(3, 3)).


The normalizing constant for the prior appears as an additive con-
stant and therefore the first and second derivatives are identical
to those in the case of the MLE (except with n1 + 2 and n0 + 2
instead of n1 and n0 , respectively).
It follows that the closed form formula for the MAP estimate is
given by
n1 + 2
θ̂MAP = .
n1 + n0 + 4
This formula agrees with the plot obtained in part f .
81.

h. Compare the MAP estimate to the MLE computed from the


same data in part b. Briefly explain any significant difference.
Solution:
The MAP estimate is equal to the MLE with four additional vir-
tual random variables, two that are equal to 1, and two that are
equal to 0. This pulls the value of the MAP estimate closer to the
value 0.5, which is why θ̂MAP is smaller than θ̂MLE .

i. Comment on the relationship between the MAP and MLE es-


timates as n goes to infinity, while the ratio #{Xi = 1}/#{Xi = 0}
remains constant.
Solution:
It is obvious that as n goes to infinity, the influence of the 4 vir-
tual random variables diminishes, and the two estimators become
equal.
82.

The MLE estimator for the parameter of


the Bernoulli distribution:
the bias and [an example of ] inadmissibility
CMU, 2004 fall, Tom Mitchell, Ziv Bar-Joseph, HW2, pr. 1
83.

Suppose X is a binary random variable that takes value 0 with probability p and value
1 with probability 1 − p. Let X1 , . . . , Xn be i.i.d. samples of X.
a. Compute an MLE estimate of p (denote it by p̂).
Answer:
By way of definition,
i.i.d.
p̂ = argmax P (X1 , . . . , Xn |p) = argmax P (Xi |p) = argmax pk (1 − p)n−k
p p p

where k is the number of 0’s in x1 , . . . , xn .


Furthermore, since ln is a monotonic (strictly increasing) function,

p̂ = argmax ln pk (1 − p)n−k = argmax(k ln p + (n − k) ln(1 − p))


p p

Computing the first derivative of k ln p + (n − k) ln(1 − p) w.r.t. p leads to:

∂ k n−k
(k ln p + (n − k) ln(1 − p)) = − .
∂p p 1−p
Hence,
∂ k n−k k
(k ln p + (n − k) ln(1 − p)) = 0 ⇔ = ⇔ p̂ = .
∂p p 1−p n
84.

b. Is p̂ an unbiased estimate of p? Prove the answer.

Answer:
Since k can be seen as a sum of n [independent] Bernoulli variables
of parameter p, we can write:
  n n
k 1 1 X 1 X
E[p̂] = E = E[k] = E[n − Xi ] = (n − E[Xi ])
n n n i=1
n i=1
n
1 X 1 1
= (n − (1 − p)) = (n − n(1 − p)) = np = p.
n i=1
n n

Therefore, p̂ is an unbiased estimator for the parameter p.


85.

c. Compute the expected square error of p̂ in terms of p.


Answer:
2 2 E[k 2 ]
2 2 2
E[(p̂ − p) ] = E[p̂ ] − 2E[p̂] p + p = − 2p + p
n2
Var (k) + E 2 [k] 2 np(1 − p) + (np)2 2 p
= −p = − p = (1 − p)
n2 n2 n
We used the fact that Var (k) = E[k 2 ] − E 2 [k], and also Var (k) =
np(1 − p) because, as already said, k can be seen as a sum of n
independent Bernoulli variables of parameter p.
Note that E[p̂] = p (cf. part b), therefore
p
E[(p̂ − p)2 ] = E[(p̂ − E[p̂])2 ] = Var [p̂] = (1 − p).
n
This implies that Var [p̂] → 0 for n → ∞.
86.
d. Prove that if you know that p lies in the interval [1/4; 3/4] and you are
given only n = 3 samples of X, then p̂ is an inadmissible estimator of p when
minimizing the expected square error of estimation.
Note: An estimator δ of a parameter θ is said to be inadmissible when there
exists a different estimator δ ′ such that R(θ, δ ′ ) ≤ R(θ, δ) for all θ and R(θ, δ ′ ) <
R(θ, δ) for some θ, where R(θ, δ) is a risk function and in this problem it is the
expected square error of the estimator.

Answer:
Consider another estimator, p̃ = 1/2.
E[(p̃ − p)2 ] = (1/2 − p)2 .
For p = 1/2 we have E[(p̃ − p)2 ] = 0 < E[(p̂ − p)2 ] = 1/12.
We now need to show that E[(p̃ − p)2 ] ≤ E[(p̂ − p)2 ] over p ∈ [1/4; 3/4].
 2
1 1 1 4 4
E[(p̃ − p)2 ] − E[(p̂ − p)2 ] = − p − · p(1 − p) = − p + p2 .
2 3 4 3 3
This is a parabola going up, so we need to show that it lies below or equal to
zero for p ∈ [1/4; 3/4].
It is equivalent to showing that it is below or equal to 0 at boundary points.
1 4 4
In fact it is: − p + p2 = 0 for both p = 1/4 and p = 3/4.
4 3 3
87.

Estimating the parameters of the categorical


distribution:
the MLE approach
CMU, 2009 spring, Ziv Bar-Joseph, HW1, pr. 2.3
88.
In this problem we will derive the MLE for the parameters of a categorical
distribution where the variable of interest, X, can take on k values, namely
a1 , a2 , . . . , ak , the probability of seeing an event of type j being θj for j =
1, . . . , k.
a. Given data describing n independent identically distributed observa-
tions of X, namely d1 , . . . , dn , each of which can be one of k values, express
the likelihood of the data given k − 1 parameters for the distribution over
X. Let ni represent the number of times X takes on value i in the data.
Answer:
The verosimility of the data is:
i.i.d. Qn Pk
L(θ) = P (d1, . . . , dn |θ) = j=1 i=1 (θi Idj =ai ) (I is the indicator function)
Qk ni not. Pn
= i=1 θi (ni = j=1 Idj =ai )
k−1
X Qk−1
= (1 − θi )nk i=1 θini (since the thetas sum to one)
i=1
| {z }
θk
89.
b. Find θ̂j , the MLE for θj , one of the k − 1 parameters, by setting the
partial derivative of the likelihood in part a with respect to θj equal to zero
and solving for it.
Hint: You may want to start by first taking the log of the likelihood from
part a before taking its derivative.
Answer:
k−1 k−1
X X ∂ ln L(θ) nk nj
ln L(θ) = nk ln(1 − θi ) + ni ln θi ⇒ =− Pk−1 +
i=1 i=1
∂θj 1 − i=1 θi θj

∂ ln L(θ) nk nj nj nk
=0⇔− P + = 0 ⇔ = Pk−1
∂θj 1 − θ̂j − i6k−1 θ
=j,k i θ̂j θ̂j 1 − θ̂j − i6=j,k θi
k−1
X
⇔ nj (1 − θi ) = (nk + nj )θ̂j
i6=j,k
k−1
nj X
⇔ θ̂j = (1 − θi ) for all j ∈ {1, . . . , k − 1}.
nj + nk i6=j,k
90.

c. At this point you should have k − 1 equations describing MLEs


of different parameters. Show how those equations imply that the
MLE for a parameter θj representing the probability that X takes
nj
on value j is equal to .
n
Hint: In order to remove the k-th parameter from the likelihood
in part a you had to represent it with an equation, θk = f ( ). At
this point you may find it helpful to replace all occurrences of
f ( ) with θk . After replacing f ( ) with θk you can substitute all
occurrences of each other parameter in f ( ) with its MLE from
part b. This should allow you to solve for the MLE of θk , which
can then be used to simplify all of the other equations.
91.

Answer

As the likelihood function is uniquely optimal for the vector θ, the last
equation in part b can be written as:

k−1
! k−1
nj X nj X
θ̂j = 1− θ̂i ⇔ θ̂j = (θ̂j + θ̂k ) (because θ̂k = 1 − θ̂i )
nj + nk nj + nk i=1
i6=j,k
 
nj nj nk nj
⇔ θ̂j 1 − = θ̂k ⇔ θ̂j = θ̂k
nj + nk nj + nk nj + nk nj + nk

⇔ θ̂j nk = nj θ̂k
nj
⇔ θ̂j = θ̂k for all j ∈ {1, . . . , k − 1}.
nk
92.

Finally,
n1 nk−1
θ̂k = 1 − θ̂1 − . . . − θ̂k−1 = 1 − θ̂k − . . . − θ̂k
nk nk
⇒ nk θ̂k = nk − (n1 + . . . + nk−1 )θ̂k
⇒ θ̂k (n1 + . . . + nk−1 + nk ) = nk
| {z }
n
nk
⇒ θ̂k =
n
nj nk nj
⇒ θ̂j = · = for all j ∈ {1, . . . , k − 1}.
nk n n
Note: Even though [here] we can go from the non-hatted (θi ) to the hatted
form (hatθi ) of the equation in the first step of c, this will generally not
be possible. To solve for a maximum likelihood criterion under additional
constraints like the thetas summing to one, a generic and useful method
is the method of Lagrange multipliers.
93.

The Gaussian [uni-variate] distribution:


estimating µ when σ 2 is known
CMU, 2011 fall, Tom Mitchell, Aarti Singh, HW2, pr. 1
CMU, 2010 fall, Ziv Bar-Joseph, HW1, pr. 1.2-3
94.
Assume we have n samples, x1 , . . . , xn , independently drawn from a normal
distribution with known variance σ 2 and unknown mean µ.
a. Derive the MLE estimator for the mean µ.
Solution:
n n (xi − µ)2
Y Y 1 −
P (x1 , . . . , xn |µ) = P (xi |µ) = √ e 2σ 2
i=1 i=1
2πσ
n  
X 1 (xi − µ)2
⇒ ln P (x1 , . . . , xn |µ) = ln √ −
i=1
2πσ 2σ 2
n
X xi − µ

⇒ P (x1 , . . . , xn |µ) =
∂µ i=1
σ2
n
X xi − µ n n
∂ X X
P (x1 , . . . , xn |µ) = 0 ⇔ 2
= 0 ⇔ (xi − µ) = 0 ⇔ xi = nµ
∂µ i=1
σ i=1 i=1
Pn
xi
⇒ µMLE = i=1
n
Remark : It can be easily shown that ln P (x1 , . . . , xn |µ) indeed reaches its maximum for
µ = µMLE .
b. Show that E[µMLE] = µ. 95.

Solution:
The sample x1 , . . . , xn can be seen as the realization of n independent ran-
dom variables X1 , . . . , Xn of Gaussian distribution of mean µ and variance
σ 2 . Then, due to the property of linearity for the expectation of random
variables, we get:
 
X1 + . . . + Xn E[X1 ] + . . . + E[Xn ] nµ
E[µMLE ] = E = = =µ
n n n
Therefore, the µMLE estimator is unbiased.
c. What is Var [µMLE ]?
Solution:
" n
# n
1 X 1 X
(⋆) i.i.d. 1 σ2
Var [µMLE ] = Var Xi = 2 Var [Xi ] = n 2 Var [X1 ] =
n i=1
n i=1 n n

Therefore, Var [µMLE ] → 0 as n → ∞.


(⋆) Remember that Var[aX] = a2 Var[X].
96.

d. Now derive the MAP estimator for the mean µ. Assume that
the prior distribution for the mean is itself a normal distribution
with mean ν and variance β 2 .
97.
Solution 1:
T. Bayes P (x1 , . . . , xn |µ) P (µ)
P (µ|x1 , . . . , xn ) = (16)
P (x1 , . . . , xn )
 
(xi − µ)2 (µ − ν)2
1 − 1 −
Qn
 i=1 √ e 2σ 2 
· √ e 2β 2
2πσ 2πβ
= (17)
C
not.
where C = P (x1 , . . . , xn ).
n  
X √ (xi − µ)2 √ (µ − ν)2
⇒ ln P (µ|x1 , . . . , xn ) = − ln 2πσ + 2
− ln 2πβ − 2
− ln C
i=1
2σ 2β
n
X xi − µ µ − ν

⇒ ln P (µ|x1 , . . . , xn ) = 2
− 2
∂µ i=1
σ β

n   Pn
∂ X xi − µ µ−ν 1 n i=1 xi ν
ln P (µ|x1 , . . . , xn ) = 0 ⇔ = ⇔ µ + = +
∂µ i=1
σ2 β2 β2 σ2 σ2 β2
P
σ 2 ν + β 2 ni=1 xi
⇒ µMAP =
σ 2 + nβ 2
98.

Solution 2:
Instead of computing the derivative of the posterior distribution
P (µ|x1 , . . . , xn ),
we will first show that the right hand side of (17) is itself a Gaussian, and
then we will use the fact that the mean of a Gaussian is where it achieves its
maximum value.
 
n (xi − µ)2 (µ − ν)2
1 Y 1 − 1 −
P (µ|x1 , . . . , xn ) =  √ e 2σ 2 
 · √ e 2β 2
C i=1 2πσ 2πβ

Pn
i=1 (xi − µ)2 (µ − ν)2
− −
= const · e 2σ 2 2β 2
Pn
β2 − µ)2 + σ 2 (µ − ν)2
i=1 (xi

= const · e 2σ 2 β 2
Pn Pn
nβ 2 + σ 2 2 β 2 i=1 xi + νσ 2 β 2 i=1 x2i + ν 2 σ 2
− µ + µ−
2 2 σ2 β 2 2σ 2 β 2
= const · e 2σ β
99.

P (µ|x1 , . . . , xn ) =
 Pn Pn 2 2 
2 β2 i=1 xi + νσ 2
β 2
x
i=1 i
2
+ ν σ
µ − 2µ +
 nβ 2 + σ 2 nβ 2 + σ 2 
= const · exp − 2 2

 2σ β 
nβ 2 + σ 2
 Pn  2 Pn Pn 
2 2
2 2
 2 2 2 2
β i=1 xi + νσ 2 β i=1 xi + νσ β i=1 xi + ν σ
 (µ − nβ 2 + σ 2
) −
nβ 2 + σ 2
+
nβ 2 + σ 2 
 
= const · exp − 
 σ2 β 2 
2 2
nβ + σ 2
  P 2    Pn 2 P 
β 2 ni=1 xi + νσ 2 β 2 i=1 xi + νσ 2 β 2 ni=1 x2i + ν 2 σ 2
 µ−   − 
 nβ 2 + σ 2   nβ 2 + σ 2 nβ 2 + σ 2 
= const · exp −  · exp  
 σ2 β 2   σ2 β 2 
2 2 2 2
nβ + σ 2 nβ + σ 2
   
2 2
2
Pn
β i=1 xi + νσ
 µ− 
′  nβ 2 + σ 2 
= const · exp − 
 σ2 β 2 
2 2
nβ + σ 2
100.

The Pnexp term2 in the last equality being a Gaussian of mean


2 2 2
β i=1 xi + νσ σ β
and variance , it follows that its maximum
nβ 2 + σ 2 P nβ 2 + σ2

β 2 ni=1 xi + νσ 2
is obtained for µ = 2 2
= µMAP .
nβ + σ
101.

e. Please comment on what happens to the MLE and MAP esti-


mators for the mean µ as the number of samples n goes to infinity.

Solution:
Pn
i=1 xi
µMLE =
n P P
σ ν + β 2 ni=1 xi
2
σ2 ν β 2 ni=1 xi
µMAP = = 2 + 2
σ 2 + nβ 2 σ + nβ 2 σ + nβ 2
1 Pn
2
σ ν i=1 xi σ2 ν µMLE
= + n = +
σ 2 + nβ 2 σ2 σ 2 + nβ 2 σ2
1+ 1+
nβ 2 nβ 2

σ2 ν σ2
n→∞⇒ 2 → 0 and → 0 ⇒ µMAP → µMLE
σ + nβ 2 nβ 2
102.

The Gaussian [uni-variate] distribution:


estimating σ 2 when µ = 0
CMU, 2009 spring, Ziv Bar-Joseph, HW1, pr. 2.1
103.
Let X be a random variable distributed according to a Normal
distribution with 0 mean, and σ 2 variance, i.e. X ∼ N(0, σ 2 ).
a. Find the maximum likelihood estimate for σ 2 , i.e. σMLE
2
.
Solution:
Let X1 , X2 , . . . , Xn be drawn i.i.d. ∼ N (0, σ 2 ). Let f be the density function
corresponding to X. Then we can write the likelihood function as:
Yn
2
L(X1 , X2 , . . . , Xn |σ ) = f (Xi ; µ = 0, σ 2 )
i=1
 n
n Y    n  Pn 2

1 (Xi − 0)2 1 X
= √ exp − 2
= √ exp − i=12 i
2πσ i=1
2σ 2πσ 2σ
n
n 2 1 X 2
⇒ ln L = constant − ln σ − 2 X
2 2σ i=1 i
n n
∂ ln L n 1 X 2 ∂ ln L 2 1X 2
⇒ = − 2+ 4 X . Therefore, = 0 ⇔ σMLE = X
∂σ 2 2σ 2σ i=1 i ∂σ 2 n i=1 i
Note: It can be easily shown that L(X1 , X2 , . . . , Xn |σ2 ) indeed reaches its maximum for σ2 =
1 Pn 2
i=1 Xi .
n
104.

b. Is the estimator you obtained biased?


Solution:
It is unbiased, since:
1
Pn n
E[ n i=1 Xi2 ] = E[X 2 ] since i.i.d.
n
= Var [X] + (E[X])2
= Var [X] = σ 2 since E[X] = 0
105.

The Gaussian [uni-variate] distribution:


estimating σ 2 (without restrictions on µ)
CMU, 2010 fall, Ziv Bar-Joseph, HW1, pr. 2.1.1-2
106.
not.
Let x = (x1 , . . . , xn ) be observed i.i.d. samples from a Gaussian distribution
N(x|µ, σ 2 ).
2
a. Derive σMLE , the MLE for σ 2 .
Solution:
(x − µ)2
1 −
The p.d.f. for N (x|µ, σ 2 ) has the form f (x) = √ e 2σ 2 .
2πσ
The log likelihood function of the data x is:
n n  2

Y X 1 (xi − µ)
ln L(x | µ, σ 2 ) = ln f (xi ) = − ln(2πσ 2 ) −
i=1 i=1
2 2σ 2
n
n 2 1 X
= − ln(2πσ ) − 2 (xi − µ)2
2 2σ i=1
2 ∂ ln L(x | µ, σ 2 ) n 1 Pn 2
The partial derivative of ln L w.r.t. σ : 2
= − 2
+ 4 i=1 (xi − µ) .
∂σ 2σ 2σ
∂ ln L(x | µ, σ 2 ) 2 1 P n 2
Solving the equation 2
= 0, we get: σMLE = i=1 (xi − µMLE ) .
∂σ n
Note that we had to take into account the optimal value of µ (see problem CMU, 2011
fall, T. Mitchell, A. Singh, HW2, pr. 1)
107.

2 n−1 2
b. Show that E[σMLE ]= σ .
n
Solution:
"n
# " n
#
2 1 X 1 X
E[σMLE ] = E (xi − µMLE )2 = E[(x1 − µMLE )2 ] = E (x1 − xi )2
n i=1 n i=1
" n n
#
2 X 1 X
= E x21 − x1 xi + 2 ( xi )2
n i=1 n i=1
 
n n
2 2 X 1 X 2 2 X
= E x1 − x1
 xi + 2 xi + 2 xi xj 
n i=1 n i=1 n i<j
n n
1 X 2X 2 X
= E[x21 ] + 2 2
E[xi ] − E[x1 xi ] + 2 E[xi xj ]
n i=1 n i=1 n i<j
1 2 2 2 n(n − 1)
= E[x21 ] + nE[x2
1 ] − E[x2
1 ] − (n − 1)E[x1 x2 ] + E[x1 x2 ]
n2 n n n2 2
n−1 n−1
= E[x21 ] − E[x1 x2 ]
n n
108.

σ 2 = Var(x1 ) = E[x21 ] − (E[x1 ])2 = E[x21 ] − µ2 ⇒ E[x21 ] = σ 2 + µ2

Because x1 and x2 are independent, it follows that Cov(x1 , x2 ) = 0.


Therefore,

0 = Cov(x1 , x2 ) = E[(x1 − E[x1 ])(x2 − E[x2 ])] = E[(x1 − µ)(x2 − µ)]


= E[x1 x2 ] − µE[x1 + x2 ] + µ2 = E[x1 x2 ] − µ(E[x1 ] + E[x2 ]) + µ2
= E[x1 x2 ] − µ(2µ) + µ2 = E[x1 x2 ] − µ2

So, E[x1 x2 ] = µ2 .
By substituting E[x21 ] = σ 2 + µ2 and E[x1 x2 ] = µ2 into the previously obtained
n−1 n−1
equality (E[σMLE ] = E[x21 ] − E[x1 x2 ]), we get:
n n
2 n−1 2 n−1 2 n−1 2
E[σMLE ]= (σ + µ2 ) − µ = σ
n n n
109.

c. Find an unbiased estimator for σ 2 .

Solution:
1 Pn 2
It can be immediately proven that i=1 (xi − µMLE ) is an un-
n−1
biased estimator of σ 2 .
110.

The Gamma distribution:


Maximum Likelihood Estimation of parameters
Liviu Ciortuz, 2017
111.

0.5
r = 1.0, β = 2.0
r = 2.0, β = 2.0
The Gamma distribution of parameters r > 0 and β > 0 r = 3.0, β = 2.0

0.4
r = 5.0, β = 1.0
r = 9.0, β = 0.5
has the following density function: r = 7.5, β = 1.0

0.3
r = 0.5, β = 1.0

p(x)
x

0.2
1
Gamma(x|r, β) = xr−1 e β , for all x > 0,
β r Γ(r)

0.1
where the Γ symbol designates Euler’s Gamma function.

0.0
0 5 10 15 20

x
Notes:
1
1. In the above definition, is the so-called normalization factor, since it does
β r Γ(r)
x
R +∞ −
not depend on x, and x=−∞ xr−1 e β dx = β r Γ(r).
R +∞
2. Euler’s Gamma function is defined as follows: Γ(r) = 0 tr−1 e−t dt, for all r ∈ R,
except for the negative integers. Starting from the definition of Γ, it can be easily
shown that Γ(r + 1) = rΓ(r) for any r > 0, and also Γ(1) = 1. Therefore, Γ(r + 1) = r · Γ(n) =
r · (r − 1) · Γ(r − 1) = . . . = r · (r − 1) · . . .· 2 · 1 = r!, which means that the Γ function generalizes
the factorial function.
3. The exponential distribution is a member of the Gamma family of distributions.
(Just set r to 1 in Gamma’s density function.)
112.

Consider x1 , . . . , xn ∈ R+ , all of them [having been] generated by one compo-


nent of the above family of distributions.
Find the maximum likelihood estimation of the parameters r and β.

Solution
• The verosimility function:
n
Y
def. i.i.d.
L(r, β) = P (x1 , . . . , xn |r, β) = P (xi |r, β)
i=1

n
!r−1 1 Pn
Y − i=1 xi
= β −rn −n
(Γ(r)) xi e β
i=1

• The log-verosimility function:


nn
def. 1XX
ℓ(r, β) = ln L(r, β) = −rn ln β − n ln Γ(r) + (r − 1) ln xi − xi .
i=1
β i=1
113.

Now we will calculate the partial derivative of ℓ(r, β) w.r.t. β, and then equate
it to 0:
n
" n #
∂ rn 1 X 1 X
ℓ(r, β) = − + 2 xi = 2 xi − rnβ
∂β β β i=1 β i=1
n
∂ 1 X
ℓ(r, β) = 0 ⇔ β̂ = xi > 0.
∂β rn i=1

By substituting β̂ into ℓ(r, β), we will get:


n n
X 1X
ℓ(r, β̂) = −rn ln β̂ − n ln Γ(r) + (r − 1) ln xi − xi
i=1 β̂ i=1
n n n
X X rn X
= rn ln(rn) − rn ln xi − n ln Γ(r) + (r − 1) ln xi − Pn · xi
i=1 i=1 i=1 xi i=1
n
! n
X X
= rn ln(rn) − rn ln xi + 1 − n ln Γ(r) + (r − 1) ln xi
i=1 i=1
114.

Therefore, by computing the partial derivative of ℓ(r, β̂) with respect to r, and
then equating this derivative to 0, we will get:
n
! n
∂ X Γ′ (r) X
ℓ(r, β̂) = 0 ⇔ n ln(nr) + n − n ln xi + 1 − n · + ln xi = 0 ⇔
∂r i=1
Γ(r) i=1
n
X n
X
n(ln r − ψ(r)) = −n ln n − ln xi + n ln xi ⇔
i=1 i=1
n n
1 X X
ln r − ψ(r) = − ln n − ln xi + ln xi .
n i=1 i=1

The solution of the last equation is r̂, the maximum likelihood estimation of
the parameter r.
115.

The Gaussian multi-variate distribution:


ML estimation of
the mean and the precision matrix, Λ
(Λ is the inverse of the covariance matrix, Σ)

CMU, 2010 fall, Aarti Singh, HW1, pr. 3.2.1


116.

The density function of a d-dimensional Gaussian distribution is


as follows:
1 ⊤

−1 exp − 2 (x − µ) Λ(x − µ)
N (x | µ, Λ ) = p ,
(2π) d/2 −1
|Λ |
where Λ is the inverse of the covariance matrix, or the so-called
precision matrix. Let {x1 , x2 , . . . , xn } be an i.i.d. sample from a
d-dimensional Gaussian distribution.
Suppose that n ≫ d. Derive the MLE estimates µ̂ and Λ̂.
117.
Hint
You may find useful the following formulas (taken from Matrix Identities, by
Sam Roweis, 1999):
1
(2b) |A−1 | =
|A|
(2e) Tr(AB) = Tr(BA);a
more generally, Tr(ABC . . .) = Tr(BC . . . A) = Tr(C . . . AB) = . . .
∂ ∂
(3b) Tr(XA) = Tr(AX) = A⊤
∂X ∂X

(4b) ln |X| = (X −1 )⊤ = (X ⊤ )−1
∂X
∂ ⊤
(5c) a Xb = ab⊤
∂X

(5g) (Xa + b)⊤ C(Xa + b) = (C + C ⊤ )(Xa + b)a⊤
∂X

Tr(A), the trace of an n-by-n square matrix A, is defined as the sum of the
elements on the main diagonal (the diagonal from the upper left to the lower
right) of A, i.e., a11 + . . . + ann .
a See Theorem 1.3.d from Matrix Analysis for Statistics, 2017, James R. Schott.
Given the x1 , . . . , xn data, the log-likelihood function is: 118.
n
Y n
X
i.i.d. −1
l(µ, Λ) = ln N (xi |µ, Λ )= ln N (xi |µ, Λ−1 )
i=1 i=1
n
nd n −1 1X
= − ln(2π) − ln |Λ | − (xi − µ)⊤ Λ(xi − µ)
2 2 2 i=1
n
(2b) nd n 1X
= − ln(2π) + ln |Λ| − (xi − µ)⊤ Λ(xi − µ).
2 2 2 i=1

For any fixed positive definite precision matrix Λ, the log-likelihood is a quadratic
function of µ with a negative leading coefficient, hence a strictly concave function of µ.
We then solve
n n n
(5g) 1 ⊤
X X X
∇µ l(µ, Λ) = 0 ⇐⇒ − (Λ + Λ ) (xi − µ)(−1) = 0 ⇐⇒ Λ (xi − µ) = 0 ⇐⇒ Λ xi = nΛµ.
2 i=1 i=1 i=1
(18)

From (18) we get, by the assumption that Λ is invertible, the following estimate of µ:
Pn
xi
µ̂ = i=1 ,
n
which coincides with the sample mean x̄ and is constant w.r.t. Λ.
119.
Now that we have
l(µ, Λ) ≤ l(µ̂, Λ) ∀µ ∈ Rd , Λ being positive definite,
we continue to consider Λ by first plugging µ̂ back in the log-likelihood function (18):
n
nd n 1X
l(µ̂, Λ) = − ln(2π) + ln |Λ| − (xi − x̄)⊤ Λ(xi − x̄) (19)
2 2 2 i=1
(2e) nd n (20)
= − ln(2π) + (ln |Λ| − Tr(SΛ)),
2 2
1 Pn ⊤
where S is the sample covariance matrix : S = i=1 (xi − x̄)(xi − x̄) .
n
Explanation:
(xi − x̄)⊤ Λ(xi − x̄) is a 1-by-1 matrix, therefore (xi − x̄)⊤ Λ(xi − x̄) = Tr((xi − x̄)⊤ Λ(xi − x̄)),
and using the (2e) rule, it can be further written as Tr((xi − x̄)(xi − x̄)⊤ Λ).
Using another simple rule, Tr(A + B) = Tr(A) + Tr(B) (which can be easily proven), it
follows that
Xn Xn n
X
⊤ ⊤
(xi − x̄) Λ(xi − x̄) = Tr((xi − x̄) Λ(xi − x̄)) = Tr((xi − x̄)(xi − x̄)⊤ Λ)
i=1 n i=1 n i=1
X   X  
⊤ ⊤
= Tr ((xi − x̄)(xi − x̄) Λ) = Tr (xi − x̄)(xi − x̄) Λ = Tr((nS)Λ) = nTr(SΛ).
i=1 i=1
120.

By the fact that ln |Λ| is strictly concave on the domain of positive definite
Λ,a and that Tr(SΛ) is linear in Λ, we are able to find the maximum of the
expression (20) by solving
∇Λ l(µ̂, Λ) = 0,

which can be provenb to be equivalent to


Λ−1 − S = 0. (Therefore, Σ̂ = Λ̂−1 = S.)
Since n ≫ d, we can safely assume that S is invertible and get the following
estimate:
Λ̂ = S −1 .
a See, for example, Section 3.1.5, Convex Optimization: http://www.stanford.edu/∼boyd/cvxbook/.
b Applying (4b) and (3b) on (20) you’ll get (Λ⊤ )−1 − S ⊤ . Then (Λ⊤ )−1 − S ⊤ = 0 ⇔ (Λ−1 )⊤ = S ⊤ ⇔ Λ−1 = S.
121.
Notes
1. In the above derivation, we have ensured that the estimates µ and Λ̂
are in the parameter space and satisfy
l(µ, Λ) ≤ l(µ̂, Λ) ≤ l(µ̂, Λ̂) ∀µ ∈ Rd , Λ being positive definite,

so they are the MLE estimates.


2. Instead of using the relation (20), i.e., working with the Tr functional,
one could directly compute the partial derivative of l(µ̂, Λ) in (19):
n n
(4b),(5c) n ⊤ −1 1 X ⊤ n −1 1 X
∇Λ l(µ̂, Λ) = (Λ ) − (xi − µ̂)(xi − µ̂) = Λ − (xi − µ̂)(xi − µ̂)⊤
2 2 i=1 2 2 i=1
n −1 n
= Λ − S.
2 2
So,
n
not. 1
not.
X
∇Λ l(µ̂, Λ) = 0 ⇔ Λ̂−1 = Σ̂ = S = (xi − µ̂)(xi − µ̂)⊤ .
n i=1
122.

Elements of Information Theory:


Some examples and then some useful proofs
123.

Computing entropies and specific conditional entropies


for discrete random variables
CMU, 2012 spring, R. Rosenfeld, HW2, pr. 2
124.

On the roll of two six-sided fair dice,

a. Calculate the distribution of the sum (S) of the total.

b. The amount of information (or surprise) when seeing the outcome x


1
for a random variable X is defined as log2 = − log2 P (X = x). How
P (X = x)
surprised are you (in bits) to observe S = 2, S = 11, S = 5, S = 7?

c. Calculate the entropy of S [as the expected value of the random variable
− log2 P (X = x)].

d. Let’s say you throw the die one by one, and the first die shows 4. What
is the entropy of S after this observation? Was any information gained / lost
in the process? If so, calculate how much information (in bits) was lost or
gained.
125.

a.
S 2 3 4 5 6 7 8 9 10 11 12
P (S) 1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36

b.

Information(S = 2) = − log2 (1/36) = log2 36 = 2 log2 6 = 2(1 + log2 3)


= 5.169925001 bits
Information(S = 11) = − log2 2/36 = log2 18 = 1 + 2 log2 3 = 4.169925001 bits
Information(S = 5) = − log2 4/36 = log2 9 = 2 log2 3 = 3.169925001 bits
Information(S = 7) = − log2 6/36 = log2 6 = 1 + log2 3 = 2.584962501 bits
126.

c.
n
X
H(S) = − pi log pi
i=1

1 1 2 2 3 3 4 4
= − 2· log +2· log +2· log +2· log +
36 36 36 36 36 36 36 36

5 5 6 6
2· log + log
36 36 36 36
1 36
= (2 log2 36 + 4 log2 18 + 6 log2 12 + 8 log2 9 + 10 log2 + 6 log2 6)
36 5
1 2 2 62
= (2 log2 6 + 4 log2 6 · 3 + 6 log2 6 · 2 + 8 log2 3 + 10 log2 + 6 log2 6)
36 5
1
= (40 log2 6 + 20 log2 3 + 6 − 10 log2 5)
36
1
= (60 log2 3 + 46 − 10 log2 5) = 3.274401919 bits.
36
127.

d.

S 2 3 4 5 6 7 8 9 10 11 12
P (S|...) 0 0 0 1/6 1/6 1/6 1/6 1/6 1/6 0 0

1 1
H(S|First-die-shows-4) = −6 · log2 = log2 6 = 2.58 bits,
6 6

IG(S; First-die-shows-4) = H(S) − H(S|First-die-shows-4) = 3.27 − 2.58 = 0.69 bits.


128.

Computing entropies and average conditional entropies


for discrete random variables
CMU, 2012 spring, Roni Rosenfeld, HW2, pr. 3
129.

A doctor needs to diagnose a person having cold (C). The primary factor
he considers in his diagnosis is the outside temperature (T ). The random
variable C takes two values, yes / no, and the random variable T takes 3
values, sunny, rainy, snowy. The joint distribution of the two variables is
given in following table.

T = sunny T = rainy T = snowy


C = no 0.30 0.20 0.10
C = yes 0.05 0.15 0.20

a. Calculate the marginal probabilities P (C), P (T ).


P
Hint: Use P (X = x) = Y P (X = x; Y = y). For example,

P (C = no) = P (C = no, T = sunny) + P (C = no, T = rainy) + P (C = no, T = snowy).

b. Calculate the entropies H(C), H(T ).


c. Calculate the average conditional entropies H(C|T ), H(T |C).
130.

a. PC = (0.6, 0.4) şi PT = (0.35, 0.35, 0.30).


b.
5 5
H(C) = 0.6 log + 0.4 log = log 5 − 0.6 log 3 − 0.4 = 0.971 bits
3 2
20 10
H(T ) = 2 · 0.35 log + 0.3 log
7 3
= 0.7(2 + log 5 − log 7) + 0.3(1 + log 5 − log 3)
= 1.7 + log 5 − 0.7 log 7 − 0.3 log 3 = 1.581 bits.
131.
c.
def. X
H(C|T ) = P (T = t) · H(C|T = t)
t∈Val (T )

= P (T = sunny) · H(C|T = sunny) + P (T = rainy) · H(C|T = rainy) +


P (T = snowy) · H(C|T = snowy)
   
0.30 0.05 0.20 0.15
= 0.35 · H , + 0.35 · H , +
0.30 + 0.05 0.30 + 0.05 0.20 + 0.15 0.20 + 0.15
 
0.10 0.20
0.30 · H ,
0.10 + 0.20 0.20 + 0.10
     
7 6 1 7 4 3 3 1 2
= ·H , + ·H , + ·H ,
20 7 7 20 7 7 10 3 3
     
7 6 7 1 7 4 7 3 7 3 1 2 3
= · log + log 7 + · log + log + · log 3 + log
20 7 6 7 20 7 4 7 3 10 3 3 2
     
7 6 6 7 8 3 3 2
= · log 7 − − log 3 + · log 7 − − log 3 + · log 3 −
20 7 7 20 7 7 10 3
   
7 3 4 2 6 3 3 7 3 9
= log 7 − + + − + − · log 3 = log 7 − log 3 − = 0.82715 bits.
10 10 10 10 20 20 10 10 20 10
132.

def. X
H(T |C) = P (C = c) · H(T |C = c)
c∈Val (C)

= P (C = no) · H(T |C = no) + P (C = yes) · H(T |C = yes)


 
0.30 0.20 0.10
= 0.60 · H , , +
0.30 + 0.20 + 0.10 0.30 + 0.20 + 0.10 0.30 + 0.20 + 0.10
 
0.05 0.15 0.20
0.40 · H , ,
0.05 + 0.15 + 0.20 0.05 + 0.15 + 0.20 0.05 + 0.15 + 0.20
   
3 1 1 1 2 1 3 1
= ·H , , + ·H , ,
5 2 3 6 5 8 8 2
   
3 1 1 1 2 1 3 1
= + log 3 + (1 + log 3) + · 3 + (3 − log 3) +
5 2 3 6 5 8 8 2
   
3 2 1 2 3
= + log 3 + 2 − log 3
5 3 2 5 8
6 3
= + log 3 = 1.43774 bits.
5 20
133.

xxxxxx Exponenţial p.d.f.

1.5
λ = 0.5
λ=1
λ = 1.5
Computing the entropy of the

1.0
exponential distribution

p(x)
CMU, 2011 spring, R. Rosenfeld,

0.5
HW2, pr. 2.c

0.0
0 1 2 3 4 5

x
134.

Pentru o distribuţie de probabilitate continuă P , entropia se defineşte ast-


fel: Z +∞
1
H(P ) = P (x) log2 dx
−∞ P (x)
Calculaţi entropia distribuţiei continue exponenţiale de parametru λ > 0.
Definiţia acestei distribuţii este următoarea:
 −λx
λe , dacă x ≥ 0;
P (x) =
0, dacă x < 0.

Indicaţie: Dacă P (x) = 0, veţi presupune că −P (x) log2 P (x) = 0.


135.
Answer
Z 0 Z ∞
1 1
H(P ) = P (x) log2 dx + P (x) log2 dx
−∞ P (x) 0 P (x)
Z 0 Z ∞ Z ∞
def. P 1 1
= 0 log2 0 dx + λe−λx log2 −λx dx = λe−λx log2 −λx dx
−∞ 0 λe 0 λe
| {z }
0
Z ∞ Z ∞  
1 1 1 1 1
⇒ H(P ) = λe−λx ln −λx dx = λe−λx ln + ln −λx dx
ln 2 0 λe ln 2 0 λ e
Z ∞
1 
= λe−λx − ln λ + ln eλx dx
ln 2 0
Z ∞
1
= λe−λx (− ln λ + λx) dx
ln 2 0
Z ∞ Z ∞
1 1
= λe−λx (− ln λ) dx + λe−λx λx dx
ln 2 0 ln 2
Z ∞ Z ∞ 0
− ln λ λ
= λe−λx dx + λe−λx x dx
ln 2 0 ln 2 0
Z ∞ Z ∞
ln λ −λx ′
 λ ′
= e dx − e−λx x dx
ln 2 0 ln 2 0
136.

Prima integrală se rezolvă foarte uşor:


Z ∞
−λx ′

−λx ∞
e dx = e 0
= e−∞ − e0 = 0 − 1 = −1
0

Pentru a rezolva cea de-a doua integrală se poate folosi formula de integrare prin părţi:
Z ∞ Z ∞ Z ∞
−λx ′

−λx ∞ −λx ′

−λx ∞
e x dx = e x0 − e x dx = e x0 − e−λx dx
0 0 0

Integrala definită e−λx x 0 nu se poate calcula direct (din cauza conflictului 0 · ∞ care
se produce atunci când lui x i se atribuie valoarea-limită ∞), ci se calculează folosind
regula lui l’Hôpital:

−λx x x′ 1 1
lim xe = lim λx = lim ′ = lim λx
= lim e−λx = e−∞ = 0,
x→∞ x→∞ e λx
x→∞ (e ) x→∞ λe λ x→∞

deci ∞
e−λx x 0 = 0 − 0 = 0.
137.

R∞
Integrala 0 e−λx dx se calculează uşor:
Z ∞ Z ∞
1 ′ 1 −λx ∞ 1 1
e−λx dx = − e−λx dx = − e 0
= − (0 − 1) =
0 λ 0 λ λ λ

Prin urmare, Z ∞
−λx ′
 1 1
e x dx = 0 − =− ,
0 λ λ
ceea ce conduce la rezultatul final:
 
ln λ λ 1 ln λ 1 1 − ln λ
H(P ) = (−1) − − =− + = .
ln 2 ln 2 λ ln 2 ln 2 ln 2
138.

Entropie, entropie comună,


entropie condiţională, câştig de informaţie:
definiţii şi proprietăţi imediate
CMU, 2005 fall, T. Mitchell, A. Moore, HW1, pr. 2
139.
Definiţii
• Entropia variabilei X:
def. P not.
H(X) = − i P (X = xi ) log P (X = xi ) = EX [− log P (X)].
• Entropia condiţională specifică a variabilei Y ı̂n raport cu valoarea xk a variabilei X:
def. P
H(Y | X = xk ) = − j P (Y = yj | X = xk ) log P (Y = yj | X = xk )
not.
= EY |X=xk [− log P (Y | X = xk )].
• Entropia condiţională medie a variabilei Y ı̂n raport cu variabila X:
def. P not.
H(Y | X) = k P (X = xk )H(Y | X = xk ) = EX [H(Y | X)].

• Entropia comună a variabilelor X şi Y :


def. P P
H(X, Y ) = − i j P (X = xi , Y = yj ) log P (X = xi , Y = yj )
not.
= EX,Y [− log P (X, Y )].
• Informaţia mutuală a variabilelor X şi Y , numită de asemenea câştigul de informaţie
al variabilei X ı̂n raport cu variabila Y (sau invers):
not. def.
M I(X, Y ) = IG(X, Y ) = H(X) − H(X | Y ) = H(Y ) − H(Y | X)
(Observaţie: ultima egalitate de mai sus are loc datorită rezultatului de la punctul
c de mai jos.)
140.

a.
H(X) ≥ 0.
X X 1
H(X) = − P (X = xi ) log P (X = xi ) = P (X = xi ) log ≥0
i i
| {z } P (X = xi )
≥0 | {z }
≥0

Mai mult, H(X) = 0 dacă şi numai dacă variabila X este constantă:
P 1
,,⇒“ Presupunem că H(X) = 0, adică i P (X = xi ) log = 0. Datorită faptului
P (X = xi )
că fiecare termen din această sumă este mai mare sau egal cu 0, rezultă că H(X) = 0
1
doar dacă pentru ∀i, P (X = xi ) = 0 sau log = 0, adică dacă pentru ∀i,
PP(X = xi )
P (X = xi ) = 0 sau P (X = xi ) = 1. Cum ı̂nsă i P (X = xi ) = 1 rezultă că există o
singură valoare x1 pentru X astfel ı̂ncât P (X = x1 ) = 1, iar P (X = x) = 0 pentru
orice x 6= x1 . Altfel spus, variabila aleatoare discretă X este constantă.
,,⇐“ Presupunem că variabila X este constantă, ceea ce ı̂nseamnă că X ia o singură
valoare x1 , cu probabilitatea P (X = x1 ) = 1. Prin urmare, H(X) = −1 · log 1 = 0.
141.

b.
XX
H(Y | X) = − P (X = xi , Y = yj ) log P (Y = yj | X = xi )
i j

X
H(Y | X) = P (X = xi )H(Y | X = xi )
i
" #
X X
= P (X = xi ) − P (Y = yj | X = xi ) log P (Y = yj | X = xi )
i j
XX
= − P (X = xi )P (Y = yj | X = xi ) log P (Y = yj | X = xi )
i j
| {z }
=P (X=xi ,Y =yj )
XX
= − P (X = xi , Y = yj ) log P (Y = yj | X = xi )
i j
142.
c.
H(X, Y ) = H(X) + H(Y | X) = H(Y ) + H(X | Y )

XX
H(X, Y ) = − p(xi , yj ) log p(xi , yj )
i j
XX
= − p(xi ) · p(yj | xi ) log[p(xi ) · p(yj | xi )]
i j
XX
= − p(xi ) · p(yj | xi )[log p(xi ) + log p(yj | xi )]
i j
XX XX
= − p(xi ) · p(yj | xi ) log p(xi ) − p(xi ) · p(yj | xi ) log p(yj | xi )
i j i j
X X X X
= − p(xi ) log p(xi ) · p(yj | xi ) − p(xi ) p(yj | xi ) log p(yj | xi )
i j i j
| {z }
=1
X
= H(X) + p(xi )H(Y | X = xi ) = H(X) + H(Y | X)
i
143.

Mai general (regula de ı̂nlănţuire):


H(X1 , . . . , Xn ) = H(X1 ) + H(X2 | X1 ) + . . . + H(Xn | X1 , . . . , Xn−1 )

 
1
H(X1 , . . . , Xn ) = E log
p(x1 , . . . , xn )
= − Ep(x1 ,...,xn ) [log p(x1 , . . . , xn ) ]
| {z }
p(x1 )·p(x2 |x1 )·...·p(xn |x1 ,...,xn−1 )

= − Ep(x1 ,...,xn ) [log p(x1 ) + log p(x2 | x1 ) + . . . + log p(xn | x1 , . . . , xn−1 )]


= − Ep(x1 ) [log p(x1 )] − Ep(x1 ,x2 ) [log p(x2 | x1 )] − . . .
− Ep(x1 ,...,xn ) [log p(xn | x1 , . . . , xn−1 )]
= H(X1 ) + H(X2 | X1 ) + . . . + H(Xn | X1 , . . . , Xn−1 )
144.

An upper bound for the entropy of a discrete distribution


CMU, 2003 fall, T. Mitchell, A. Moore, HW1, pr. 1.1
145.

y
Fie X o variabilă aleatoare discretă care ia n
x−1
valori şi urmează distribuţia probabilistă P .
Conform definiţiei, entropia lui X este
n
X
H(X) = − P (X = xi ) log2 P (X = xi ).
i=1
0 1 x
Arătaţi că H(X) ≤ log2 n.

Sugestie: Puteţi folosi inegalitatea ln x ≤ x−1,


care are loc pentru orice x > 0. ln(x)
Answer 146.
n
!
1 X
H(X) = − P (X = xi ) ln P (X = xi )
ln 2 i=1
Aşadar,
n
!
1 X
H(X) ≤ log2 n ⇔ − P (X = xi ) ln P (X = xi ) ≤ log2 n
ln 2 i=1
n
X
⇔ − P (xi ) ln P (xi ) ≤ ln n
i=1
n n
!
X 1 X
⇔ P (xi ) ln − P (xi ) ln n ≤ 0
i=1
P (xi ) i=1
| {z }
1
n n
X 1 X
⇔ P (xi ) ln − P (xi ) ln n ≤ 0
i=1
P (x i ) i=1
n  
X 1
⇔ P (xi ) ln − ln n ≤ 0
i=1
P (x i )
n
X 1
⇔ P (xi ) ln ≤0
i=1
n P (xi )
147.

1
Aplicând inegalitatea ln x ≤ x − 1 pentru x = , vom avea:
n P (xi )
n n   n n
X 1 X 1 X 1 X
P (xi ) ln ≤ P (xi ) −1 = − P (xi ) = 1 − 1 = 0
i=1
n P (xi ) i=1
n P (xi ) i=1
n i=1
| {z }
1

Observaţie: Această margine superioară chiar este ,,atinsă“. De exemplu, ı̂n cazul ı̂n
care o variabilă aleatoare discretă X având n valori urmează distribuţia uniformă, se
poate verifica imediat că H(X) = log2 n.
148.

The particular form of the chain rule for entropies


when X and Y are independent random variables:

H(X, Y ) = H(X) + H(Y )

CMU, 2012 spring, R. Rosenfeld, HW2, pr. 7.b


149.
Prove that if X and Y are independent random variables, the following property holds:
H(X, Y ) = H(X) + H(Y ).

Answer: Therefore,
X
H(X, Y ) = − P (X = x, Y = y) log P (X = x, Y = y)
x,y
indep. X
=− P (X = x)P (Y = y) log(P (X = x)P (Y = y))
x,y
X
=− P (X = x)P (Y = y)(log P (X = x) + log P (Y = y))
x,y
X  X 
=− P (X = x)P (Y = y) log P (X = x) − P (X = x)P (Y = y) log P (Y = y)
x,y x,y
X X  X X 
=− P (Y = y) P (X = x) log P (X = x) − P (X = x) P (Y = y) log P (Y = y)
y x x y
X  X  X  X 
=− P (Y = y) · P (X = x) log P (X = x) − P (X = x) · P (Y = y) log P (Y = y)
y x x y
X X
=− P (X = x) log P (X = x) − P (Y = y) log P (Y = y)
x y
= H(X) + H(Y )
150.
If X, Y are continue and independent random variables, then

pX,Y (X = x, Y = y) = pX (X = x)pY (Y = y)
for any x and y, where p denotes the p.d.f. corresponding to the [c.d.f. of ] P .
Therefore,
Z Z
def.
H(X, Y ) = − pX,Y (X = x, Y = y) log pX,Y (X = x, Y = y) dx dy
X Y
Z Z
indep.
= − pX (X = x) pY (Y = y)(log pX (X = x) + log pY (Y = y)) dx dy
ZX ZY
not.
= − pX (x) pY (y)(log pX (x) + log pY (y)) dx dy
X Y
Z Z Z Z
= − pY (y) pX (x) log pX (x)dx dy − pX (x) pY (y) log pY (y)dx dy
X Y X Y
Z Z  Z Z 
= − pX (x) log pX (x) pY (y) dy dx − pX (x) pY (y) log pY (y) dy dx
X Y X Y
| {z } | {z }
1 H(Y )
Z Z 
= − pX (x) log pX (x) dx + H(Y ) pX (x) dx
X X
| {z }
1
= H(X) + H(Y )
151.

The conditional form of


the simplest case of the chain rule for entropies (n = 2):
H(X, Y |A) = H(X|A) + H(Y |X, A)

CMU, 2012 spring, Roni Rosenfeld, HW2, pr. 4.b


152.
H(X,Y|A)

Prove that for any 3 discrete random


variables X, Y and A, the following H(X|A) H(Y|X,A)
property holds:

H(X, Y |A) = H(X|A) + H(Y |X, A) H(X) H(Y)

= H(Y |A) + H(X|Y, A).

Answer: H(A)

H(X, Y |A) = H(X, Y, A) − H(A) = H(X, Y, A) − H(Y, A) + H(Y, A) − H(A)


= (H(X, Y, A) − H(Y, A)) + (H(Y, A) − H(A)) = H(X|Y, A) + H(Y |A)
= H(Y |A) + H(X|Y, A).

We’ve used the fact that H(X, Y ) = H(X|Y ) + H(Y ) (see CMU, 2005 fall, T.
Mitchell, A. Moore, HW1, pr. 2).
153.

Proving [in a direct manner] that


the Information Gain is always positive or 0
and that IG(X, Y ) = 0 ⇔ X is independent of Y
(an indirect proof is made at CMU, 2007 fall, Carlos Guestrin, HW1, pr. 1.2)

Liviu Ciortuz, 2017


154.

Definiţia câştigului de informaţie (sau: a informaţiei mutuale) al unei vari-


abile aleatoare X ı̂n raport cu o altă variabilă aleatoare Y este

IG (X, Y ) = H(X) − H(X | Y ) = H(Y ) − H(Y | X).

La CMU, 2007 fall, Carlos Guestrin, HW1, pr. 1.2 s-a demonstrat — pentru cazul ı̂n care X
şi Y sunt discrete — că IG(X, Y ) = KL(PX,Y ||PX PY ), unde KL desemnează entropia relativă (sau:
divergenţa Kullback-Leibler), PX şi PY sunt distribuţiile variabilelor X şi, respectiv, Y , iar PX,Y este
distribuţia comună a acestor variabile. Tot la CMU, 2007 fall, Carlos Guestrin, HW1, pr. 1.2 s-a
arătat că divergenţa KL este ı̂ntotdeauna ne-negativă. În consecinţă, IG(X, Y ) ≥ 0 pentru orice X
şi Y .

La acest exerciţiu vă cerem să demonstraţi inegalitatea IG(X, Y ) ≥ 0 ı̂n


manieră directă, plecând de la prima definiţie dată mai sus, fără a [mai] apela
la divergenţa Kullback-Leibler.
155.
Sugestie: Puteţi folosi următoarea formă a inegalităţii lui Jensen:
n n
!
X X
ai log xi ≤ log ai xi
i=1 i=1

unde baza logaritmului se consideră supraunitară, ai ≥ 0 pentru i = 1, . . . , n şi


P n
i=1 ai = 1.

Observaţie: Avantajul la această problemă, comparativ cu CMU, 2007 fall, Carlos Guestrin,
HW1, pr. 1.2.a, este că aici se lucrează cu o singură distribuţie (p), nu cu două distribuţii (p şi q).
Totuşi, demonstraţia de aici va fi mai laborioasă.

Answer (in Romanian)

Presupunem că valorile variabilei X sunt x1 , x2 , . . . , xn , iar valorile variabilei Y


sunt y1 , y2 , . . . , ym . Avem:
def.
IG(X, Y ) = H(X) − H(X|Y )
Xn m
X n
X
def.
= −P (xi ) log2 P (xi ) − P (yj ) (−P (xi |yj ) log2 P (xi |yj ))
i=1 j=1 i=1
156.
n
X m
X n
X
−IG(X, Y ) = P (xi ) log2 P (xi ) − P (yj ) P (xi |yj ) log2 P (xi |yj )
i=1 j=1 i=1
def.
 
= n
X X m m
X n
X
prob. marg.  P (xi , yj ) log2 P (xi ) − P (yj ) P (xi |yj ) log2 P (xi |yj )
i=1 j=1 j=1 i=1

X m
n X m X
X n
distrib.·,+
= P (xi , yj ) log2 P (xi ) − P (yj )P (xi |yj ) log2 P (xi |yj )
i=1 j=1 j=1 i=1
def.
= X m
n X m X
X n
prob. cond. P (xi , yj ) log2 P (xi ) − P (xi , yj ) log2 P (xi |yj )
i=1 j=1 j=1 i=1
Xn X m
distrib.·,+
= P (xi , yj )(log2 P (xi ) − log2 P (xi |yj ))
i=1 j=1
prop. reg. de
= m
n X = n X m
X P (xi ) X P (xi )
log. P (xi , yj ) log2 multipl. P (xi |yj )P (yj ) log2
i=1 j=1
P (xi |yj ) i=1 j=1
P (xi |yj )
m n
distrib.·,+ X X P (xi )
= P (yj ) P (xi |yj ) log2
j=1 i=1
| {z } P (xi |yj )
ai
157.

Pn
Întrucât pe de o parte P (xi |yj ) ≥ 0 şi pe de altă parte i=1 P (xi |yj ) = 1 pentru
fiecare valoare yj a lui Y ı̂n parte, putem aplica inegalitatea lui Jensen pentru
cea de-a doua sumă din ultima expresie de mai sus — mai exact, pentru
fiecare valoare a indicelui j ı̂n parte — şi obţinem:
m n
! m n
!
X X P (xi ) X X
−IG(X, Y ) ≤ P (yj ) log2 P (xi |yj ) = P (yj ) log2 P (xi ) = 0
j=1 i=1
P (xi |y j ) j=1 i=1
| {z }
1

Prin urmare, IG(X, Y ) ≥ 0.


Pm Pn
De asemenea, din relaţia IG(X, Y ) = j=1 P (yj ) i=1 P (xi |yj ) log2 PP(x(x i)
i |yj )
care a
fost obţinută pe slide-ul precedent, decurge imediat următoarea consecinţă:
atunci când X este indepenedent de Y , ı̂ntrucât P (xi ) = P (xi |yj ) pentru orice
P (xi )
i şi j, urmează că log2 , deci IG(X, Y ) = 0.
P (xi |yj )
Invers, vom demonstra aici că 158.
IG(X, Y ) = 0 implică faptul că
X este indepenedent de Y . Aplicând pentru fiecare valoare a
indicelui j ı̂n parte [ultima parte
Pentru aceasta vom folosi teorema lui Jensen. din] teorema lui Jensen — ceea
Într-adevăr, din egalitatea Pneste posibil, ı̂ntrucât ai ≥ 0 şi
ce
i=1 ai = 1 pentru fiecare valoare
m n
IG(X, Y ) =
X
P (yj )
X
P (xi |yj ) log2
P (xi ) fixată a lui j —, rezultă că
j=1 i=1
P (xi |yj )
P (xi )
avem = α (const.), ∀i = 1, . . . , n.
P (xi |yj )
n
P (yj )>0 X P (xi )
IG(X, Y ) = 0 ⇒ P (xi |yj ) log2 = 0, ∀j = 1, . . . , m.
i=1
P (xi |yj ) Aşadar,

Este ı̂nsă adevărat şi că P (xi ) = αP (xi |yj ) ∀i = 1, . . . , n.


n n
X P (xi ) X
log2 P (x
✘ ✘ ✘

i |yj ) ✘
✘ = log 2 P (xi ) = 0, ∀j = 1, . . . , m. Însumând aceste egalităţi
Pn pentru
P (x
✘ i |yj )
i=1 ✘ i=1
| {z } i =Pn1, . . . , n, rezultă i=1 P (xi ) =
1
α i=1 P (xi |yj ), deci 1 = α.
Prin urmare, Deci P (xi ) = P (xi |yj ), ∀i = 1, . . . , n
n
X P (xi ) Xn
P (xi )
[şi pentru fiecare j = 1, . . . , m],
P (xi |yj ) log2 = log2 P (xi |yj ) = 0, ∀j = 1, . . . , m. deci X este independent de Y .
i=1
| {z } P (x i |y j ) i=1
P (x i |y j )
ai
159.

Corollary

Since
H(X, Y ) = H(X) + H(Y |X) = H(Y ) + H(X|Y ),
and also due to the definition of the Information Gain,

IG(X, Y ) = H(X) − H(X|Y ) = H(Y ) − H(Y |X),

it follows that

X independent of Y ⇔ H(X) = H(X|Y )


⇔ H(Y ) = H(Y |X)
⇔ H(X, Y ) = H(X) + X(Y )
160.

Cross-entropy (CH):
definition, some basic properties, and exemplifications
CMU, 2011 spring, Roni Rosenfeld, HW2, pr. 3.c
161.

Cross-entropy, CH(p,q), measures the average number of bits needed to en-


code an event from a set of possibilities, if a coding scheme is used based on
a given probability distribution q, rather than the “true” distribution p.
For discrete random variables p and q this means
X
CH(p,q) = − p(x) log q(x)
x

The situation for continuous random variable distributions is analogous:


Z
CH(p,q) = − p(x) log q(x) dx.
X

a. Can cross-entropy be negative? Either prove or give a counter-example.


162.

Answer

a. No. Here follows the proof :


For every probability value p(x) and q(x), we know from defnition that 0 ≤
p(x) ≤ 1 and 0 ≤ q(x) ≤ 1. From q(x) ≤ 1, we conclude that log q(x) ≤ 0. Given
that 0 ≤ p(x), and − log q(x) ≥ 0, we conclude that 0 ≤ −p(x) log q(x). Thus, the
sum of these terms will also be greater or equal to 0, so cross-entropy is never
negative.

Note, however that unlike entropy, cross-entropy is not bounded, so it can


grow to infinity if for an x, q(x) is zero and p(x) is not zero.
163.

b. In many experiments, the quality of different hypothesis models are com-


pared on a data set. Imagine you derived two diferent models to predict the
probabilities of 7 diferent possible outcomes, and the probability distributions
predicted by the models are q1 , and q2 as follows:

q1 = (0.1, 0.1, 0.2, 0.3, 0.2, 0.05, 0.05) and q2 = (0.05, 0.1, 0.15, 0.35, 0.2, 0.1, 0.05).

The experiments are done on a data set with the following empirical distri-
bution:
pempirical = (0.05, 0.1, 0.2, 0.3, 0.2, 0.1, 0.05).

Compute the cross-entropies, CH (pempirical , q1 ) and CH (pempirical , q2 ).


Which model has a lower cross-entropy? Is this model guaranteed to be a
better one? Explain your answer. Can cross-entropy be negative? Either
prove or give a counter-example.
164.

Answer

b. Using the cross-entropy formula we see that:

CH(pempirical , q1 ) = −(0.05 log 0.1 + 0.1 log 0.1 + 0.2 log 0.2 + 0.3 log 0.3 +
0.2 log 0.2 + 0.1 log 0.05 + 0.05 log 0.05) = 2.596 bits
CH(pempirical , q2 ) = −(0.05 log 0.05 + 0.1 log 0.1 + 0.2 log 0.15 + 0.3 log 0.35 +
0.2 log 0.2 + 0.1 log 0.1 + 0.05 log 0.05) = 2.56 bits.

The pempirical distribution has a lower cross-entropy with the model q2 , so it is


reasonable to say that q2 is a better choice.

However, it is not guaranteed that this model is always the better model,
because we are working with an “empirical” distribution here, and the “true”
distribution might not be reflected in this empirical distribution completely.
Usually, sampling bias and insufficient training data samples will widen the
gap between the true distribution and the empirical one, so in practice when
designing the experiment, you should always have that in mind, and if possible
use techniques that minimize these risks.
165.

Relative entropy a.k.a. the Kulback-Leibler divergence,


and the [relationship to] information gain;
some basic properties
CMU, 2007 fall, C. Guestrin, HW1, pr. 1.2
[adapted by Liviu Ciortuz]
166.

The relative entropy — also known as the Kullback-Leibler (KL) divergence


— from a distribution p to a distribution q is defined as

def. X q(x)
KL(p||q) = − p(x) log
p(x)
x∈X

From an information theory perspective, the KL-divergence specifies the num-


ber of additional bits required on average to transmit values of X if the values
are distributed with respect to p but we encode them assuming the distribu-
tion q.
167.

Notes
1. KL is not a distance measure, since it is not symmetric (i.e., in general
KL(p||q) 6= KL(q||p)).
1
Another measure, which is defined as JSD(p||q) = (KL(p||(p + q)/2) + KL(q||(p +
2
p)/2)), and is called the Jensen-Shannon divergence is symmetric.

2. The quantity
def.
d(X, Y ) = H(X, Y ) − IG (X, Y ) = H(X) + H(Y ) − 2IG (X, Y )
= H(X | Y ) + H(Y | X)

known as variation of information, is a distance metric, i.e., it is non-negative,


symmetric, implies indiscernability, and satisfies the triangle inequality.
168.
a. Show that KL(p||q) ≥ 0, and KL(p||q) = 0 iff p(x) = q(x) for all x.
(More generally, the smaller the KL-divergence, the more similar the two distributions.)

Indicaţie:
Pentru a demonstra punctul acesta puteţi folosi ine-
galitatea lui Jensen: f(y)
tf(x)+(1−t)f(y)
Dacă f : R → R este o funcţie convexă, atunci pentru f(tx+(1−t)y)
orice t ∈ [0, 1] şi orice x1 , x2 ∈ R urmează f(x)

f (tx1 + (1 − t)x2 ) ≤ tf (x1 ) + (1 − t)f (x2 ). x y


tx+(1−t)y
Dacă f este funcţie strict convexă, atunci egalitatea
are loc doar dacă x1 = x2 .
P
Mai
P general, pentru
 P orice a i ≥ 0, i = 1, . . . , n cu i ai 6= 0 şi orice xi ∈ R, i = 1, . . . , n, avem
ai xi i ai f (xi )
f Pi a ≤ P a .
j j j j

Dacă f este strict convexă, atunci egalitatea are loc doar dacă x1 = . . . = xn .
Evident, rezultate similare pot fi formulate şi pentru funcţii concave.
169.

Answer
Vom dovedi inegalitatea KL(p||q) ≥ 0 folosind inegalitatea lui Jensen, ı̂n ex-
presia căreia vom ı̂nlocui f cu funcţia convexă − log2 , pe ai cu p(xi ) şi pe xi cu
q(xi )
.
p(xi )
(Pentru convenienţă, ı̂n cele ce urmează vor renunţa la indicele variabilei x.)
Vom avea:
def. X q(x)
KL (p || q) = − p(x) log
x
p(x)
Jensen X q(x) X
≥ − log( p(x) ) = − log( q(x)) = − log 1 = 0
x
p(x) x
| {z }
1

Aşadar, KL (p || q) ≥ 0, oricare ar fi distribuţiile (discrete) p şi q.


170.

Vom demonstra acum că KL(p||q) = 0 ⇔ p = q.


,,⇐“
q(x) q(x)
Egalitatea p(x) = q(x) implică = 1, deci log = 0 pentru orice x, de unde
p(x) p(x)
rezultă imediat KL(p||q) = 0.
,,⇒“
Ştim că ı̂n inegalitatea lui Jensen are loc egalitatea doar ı̂n cazul ı̂n care xi = xj
pentru orice i şi j.
q(x)
În cazul de faţă, această condiţie se traduce prin faptul că raportul este acelaşi
p(x)
(α) pentru orice valoare a lui x.
P P P q(x) q(x)
Ţinând cont că x p(x) = 1 şi x q(x) = x p(x) = 1, rezultă că α = = 1,
p(x) p(x)
deci p(x) = q(x) pentru orice x, ceea ce ı̂nseamnă că distribuţiile p şi q sunt identice.
171.
b. We can define the information gain as the KL-divergence from the
observed joint distribution of X and Y to the product of their observed
marginals:
 
def. XX pX (x)p(y)
IG(X, Y ) = KL(pX,Y || (pX pY )) = − pX,Y (x, y) log
x y
pY (x, y)
 
not.
XX p(x)p(y)
= − p(x, y) log
x y
p(x, y)

Prove that this definition of information gain is equivalent to the one given in
problem CMU, 2005 fall, T. Mitchell, A. Moore, HW1, pr. 2. That is, show
that IG(X, Y ) = H[X] − H[X|Y ] = H[Y ] − H[Y |X], starting from the definition
in terms of KL-divergence.

Remark:
It follows that
X X p(x | y) X
IG(X, Y ) = p(y) p(x | y) log = p(y)KL(pX|Y || pX )
y x
p(x) y
= EY [KL(pX|Y || pX )]
172.

Answer
By making use of the multiplication rule, namely p(x, y) = p(x | y)p(y), we will
have:

KL(pX,Y || (pX pY ))
 
def. KL XX p(x)p(y)
= − p(x, y) log
x y
p(x, y)
 
XX p(x) XX
= − p(x, y) log =− p(x, y)[log p(x) − log p(x | y)]
x y
p(x | y) x y
!
XX XX
= − p(x, y) log p(x) − − p(x, y) log p(x | y)
x y x y
X X X
= − log p(x) p(x, y) −H[X | Y ] = p(x) log p(x) − H[X | Y ]
x y x
| {z }
=p(x)

= H[X] − H[X | Y ] = IG(X, Y )


173.

c.
A direct consequence of parts a. and b. is that IG(X, Y ) ≥ 0 (and therefore
H(X) ≥ H(X|Y ) and H(Y ) ≥ H(Y |X)) for any discrete random variables X and
Y.
Prove that IG(X, Y ) = 0 iff X and Y are independent.

Answer:

This is also an immediate consequence of parts a. and b. already proven:


(b) (a)
IG (X, Y ) = 0 ⇔ KL (pX,Y ||pX pY ) = 0 ⇔ X and Y are independent.
Remark (in Romanian) 174.

Putem (re)demonstra inegalitatea IG(X, Y ) ≥ 0, folosind (doar!) rezul-


tatul de la punctul  b. (nu  şi cel de la punctul a.), şi anume că IG(Y, X) =
P P
− x y p(x, y) log p(x)p(y)
p(x,y) . Ideea este să aplicăm inegalitatea lui Jensen ı̂ntr-
o formă uşor generalizată, şi anume:
− ı̂n locul unui singur indice, se vor considera doi indici (aşadar ı̂n loc de
ai şi xi vom avea aij şi respectiv xij );
p(xi )p(xj )
− vom lua f = − log2 iar aij ← p(xi , yj ) şi xij ← ;
P P p(xi , xj )
− ı̂n fine, vom ţine cont că i j p(xi , yj ) = 1.
Prin urmare,
XX p(xi ) · p(yj ) X X h p(xi ) · p(yj ) i
IG(X, Y ) = − p(xi , yj ) log = p(xi , yj ) − log
i j
p(x ,
i jy ) i j
p(xi , yj )
XX p(xi ) · p(yj )  XX
≥ − log p(xi , yj ) = − log p(xi ) · p(yj )Big)
i j
p(x i , y j ) i j
X X
= − log( p(xi ) · p(yj ) ) = − log 1 = 0
i j
| {z } | {z }
1 1

În concluzie, IG(X, Y ) ≥ 0.


175.

Remark (cont’d)

În ce priveşte echivalenţa IG(X, Y ) = 0 ⇔ X şi Y sunt independente:


,,⇐“
Dacă X şi Y sunt variabilele independente,
atunci p(xi , yj ) = p(xi )p(yj ) pentru orice i şi j.
În consecinţă, toţi logaritmii din partea dreaptă a primei egalităţi din calculul de
mai sus sunt 0 şi rezultă IG(X, Y ) = 0.
,,⇒“
Invers, presupunând că IG(X, Y ) = 0, vom ţine cont de faptul că putem exprima
câştigul de informaţie cu ajutorul divergenţei KL şi vom aplica un raţionament
similar cu cel de la punctul a.
p(xi )p(yj )
Rezultă că = 1 şi deci p(xi )p(yj ) = p(xi , yj ) pentru orice i şi j.
p(xi , yj )
Aceasta echivalează cu a spune că variabilele X şi Y sunt independente.
176.

Using Information Gain / Mutual Information for doing


Feature Selection
CMU, 2009 spring, Ziv Bar-Joseph, HW5, pr. 6
177.

Given the following observations for input binary features X1 , X2 , X3 , X4 , X5 ,


and output binary label Y , we would like to use a filter approach to reduce
the feature space of {X1 , X2 , X3 , X4 , X5 }.

X1 X2 X3 X4 X5 Y
0 1 1 0 1 0
1 0 0 0 1 0
0 1 0 1 0 1
1 1 1 1 0 1
0 1 1 0 0 1
0 0 0 1 1 1
1 0 0 1 0 1
1 1 1 0 1 1

a. Calculate the mutual information M I(Xi , Y ) for each i.


b. Accordingly choose the smallest subset of features such that the best
classifier trained on the reduced feature set will perform at least as well as
the best classifier trained on the whole feature set. Explain the reasons behind
your choice.
Answer 178.
a. As shown at CMU, 2007 fall, Carlos
Guestrin, HW1, pr. 1.2, the mutual infor-
mation can be calculated as: xi = 0 xi = 1
  P (X1 = x1 ) 1/2 1/2
def. XX P (x) P (y) y P (Y = y)
M I(X, Y ) = − P (x, y) log . P (X2 = x2 ) 3/8 5/8
x y
P (x, y) 0 1/4
P (X3 = x3 ) 1/2 1/2
1 3/4
The marginal probabilities, estimated (in P (X4 = x4 ) 1/2 1/2
the MLE sense) from the given data are P (X5 = x5 ) 1/2 1/2
shown in the nearby tables.
The joint probabilities P (Xi = xi , Y = y), also estimated from the data, are:

xi = 0, y = 0 xi = 0, y = 1 xi = 1, y = 0 xi = 1, y = 1
P (x1 , y) 1/8 3/8 1/8 3/8
P (x2 , y) 1/8 1/4 1/8 1/2
P (x3 , y) 1/8 3/8 1/8 3/8
P (x4 , y) 1/4 1/4 0 1/2
P (x5 , y) 0 1/2 1/4 1/4

Note: Columns 3 and 5 can be easily computed using columns 2 and respectively 4
(and also the first table from above), as P (xi = 0, y = 1) = P (xi = 0) − P (xi = 0, y = 0) and
P (xi = 1, y = 1) = P (xi = 1) − P (xi = 1, y = 0).
179.

Using these probabilities and the given formula, the mutual information for
each feature is: M I(X1 , Y ) = 0, M I(X2 , Y ) = 0.01571, M I(X3 , Y ) = 0, M I(X4 , Y ) =
0.3113 and M I(X5 , Y ) = 0.3113.
Note: One could easily check, using the previous tables, that X1 is indepen-
dent of Y , and X3 is also independent of Y (so, we can determine in this way
too that M I(X1 , Y ) and M I(X3 , Y ) are 0).

b. In order to select a set of features, we can prioritize the ones with more
mutual information because they are less independent to Y .
By looking at the results of the previous question, we can see that X5 and
X4 are the features with more mutual information (0.3113), followed by X2
(0.0157) and, finally, X1 and X3 that do not have mutual information with Y .
By inspection of the data, we can see that if we select X5 , X4 and X2 there
are two samples of different classes with the same features (X2 = 1, X4 = 0,
X5 = 1). To avoid this problem, we can add X1 as an extra feature.
Note: Although the mutual information of X1 with Y is zero, it does not
mean that the combination of X1 with other features will also have zero
mutual information [LC: w.r.t. Y ].
180.

Derivation of entropy definition,


starting from a set of desirable properties
CMU, 2005 fall, T. Mitchell, A. Moore, HW1, pr. 2.2
181.
Pn
Remark: The definition we gave for entropy − i=1 pi log pi is not very intuitive.

Theorem:
If ψn (p1 , . . . , pn ) satisfies the following axioms
A0. [LC:] ψn (p1 , . . . , pn ) ≥ 0 for any n ∈ N∗ and p1 , . . . , pn , since we view ψn is a
measure of disorder ; also, ψ1 (1) = 0 because in this case there is no disorder;
A1. ψn should be continuous in pi and symmetric in its arguments;
A2. if pi = 1/n then ψn should be a monotonically increasing function of n;
(If all events are equally likely, then having more events means being
more uncertain.)
A3. if a choice among N events is broken down into successive choices, then
the entropy should be the weighted sum of the entropy at each stage;
P
then ψn (p1 , . . . , pn ) = −K i pi log pi where
 K is a positive constant.
1 1
(As we’ll see, K depends however on ψs ,..., for a certain s ∈ N∗ ).
s s
Remark: We will prove the theorem firstly for uniform distributions (pi = 1/n) and secondly
for the case pi ∈ Q (only!).
182.
Example for the axiom A3:
(a,b,c)

1/2 1/2

(a,b,c) a (b,c)

1/2 1/3 1/6 2/3 1/3

a b c b c

Encoding 1 Encoding 2
     
1 1 1 1 1 1 1 1 1 1 2 1
H , , = log 2 + log 3 + log 6 = + log 2 + + log 3 = + log 3
2 3 6 2 3 6 2 6 3 6 3 2

       
1 1 1 2 1 1 2 3 1 1 2 2 1
H , + H , = 1+ log + log 3 = 1 + log 3 − = + log 3
2 2 2 3 3 2 3 2 3 2 3 3 2

The next 3 slides:


Case 1: pi = 1/n for i = 1, . . . , n; proof steps:
not.
183.
a. A(n) = ψ(1/n, 1/n, . . . , 1/n) implies
A(sm ) = m A(s) for any s, m ∈ N∗ . (1)
b. If s, m ∈ N⋆ (fixed), s 6= 1, and t, n ∈ N⋆ such that sm ≤ tn ≤ sm+1 , then
m log t 1
| n − log s |≤ n. (2)

c. For sm ≤ tn ≤ sm+1 as above, due to A2 it follows (immediately)


     
1 1 1 1 1 1
ψsm , . . . , m ≤ ψtn n , . . . , n ≤ ψsm+1 , . . . , m+1
sm s t t sm+1 s
i.e. A(sm ) ≤ A(tn ) ≤ A(sm+1 )
c. Show that
m A(t) 1
| n − A(s) |≤ n for s 6= 1. (3)

d. Combining (2) + (3) immediately gives


A(t) log t 2
| A(s) − log s |≤ n pentru s 6= 1 (4)

d. Show that this inequation implies


A(t) = K log t with K > 0 (due to A2). (5)
184.
Proof
a.
nivel 1
1/s 1/s 1/s

...
... ...
nivel 2 de s ori

1/s 1/s 1/s


1 1 1
m m m
s s s
...
... ...
... de s ori

.
..
de s m ori
nivel m

1/s 1/s 1/s

...
de s ori

Applying the axion A3 on the right encoding from above gives:


1 1 1
A(sm ) = A(s) + s · A(s) + s2 · 2 A(s) + . . . + sm−1 · m−1 A(s)
s s s
= A(s) + A(s) + A(s) + . . . + A(s) = mA(s)
| {z }
m times
185.
Proof (cont’d)
b.
sm ≤ tn ≤ sm+1 ⇒ m log s ≤ n log t ≤ (m + 1) log s ⇒

m log t m 1 log t m 1 log t m 1
≤ ≤ + ⇒0≤ − ≤ ⇒ − ≤
n log s n n log s n n log s n n
c.
(1) s6=1
A(sm ) ≤ A(tn ) ≤ A(sm+1 ) ⇒ m A(s) ≤ n A(t) ≤ (m + 1) A(s) ⇒

m A(t) m 1 A(t) m 1 A(t) m 1
≤ ≤ + ⇒0≤ − ≤ ⇒ − ≤
n A(s) n n A(s) n n A(s) n n

d. Consider
again sm ≤ tn ≤ sm+1 with s, t fixed. If m → ∞ then n → ∞ and
A(t) log t 2 A(t) log t
from − ≤ it follows that A(s) log s → 0.

A(s) log s n

A(t) log t = 0 and so A(t) = log t .

Therefore −
A(s) log s A(s) log s
A(s) A(s)
Finally, A(t) = log t = K log t, where K = > 0 (if s 6= 1).
log s log s
186.
Case 2: pi ∈ Q for i = 1, . . . , n

Let’s consider a set of N ≥ 2 equiprobable ran-


level 1
dom events, and P = (S1 , S2 , . . . , Sk ) a partition
of this set. Let’s denote pi =| Si | /N . |S1 |/N |Sk |/N
A “natural” two-step ecoding (as shown in the |S2 |/N |S i |/N
nearby figure) leads to A(N ) = ψk (p1 , . . . , pk ) + S1 S2 ... Si ...
P Sk
i pi A(| Si |), based on the axiom A3. ... . . . level 2 ...
Finally, using the result A(t) = K log t, gives:
P 1/|S i | 1/|S i | 1/|S i |
K log N = ψk (p1 , . . . , pk ) + K i pi log | Si |
...

X
⇒ ψk (p1 , . . . , pk ) = K[ log N − pi log | Si | ]
i
X X X | Si | X
= K[ log N pi − pi log | Si | ] = −K pi log = −K pi log pi
i i i
N i
187.

Exemplifying
The gradient descent method:
finding the minimum of a real function of second degree

University of Utah, 2008 fall, Hal Daumé III, HW4, pr. 1


188.

Suppose we are trying to find the minimum of the function f (x) =


3x2 − 2x + 1, for uni-variate (scalar) x.
First, verify that this function is convex (and therefore it has a
global minimum).
Second, find the minimum of this function using calculus.
Finally, perform (by hand – show your work) three steps of gra-
dient descent with η = 0.1 and the initial point x0 = 1. How close
does it get to the true solution?
189.

Solution (in Romanian)


Pentru a studia convexitatea funcţiei f (x) se cal-
culează derivata a doua:

f ′′ (x) = 6 > 0, ∀x ∈ R. 2

Rezultă că funcţia f este convexă pe ı̂ntreg dome-

f(x)
niul ei de definiţie, deci are un (singur) punct de
minim.
Pentru a calcula minimul funcţiei f (x) = 3x2 − 2x + 1
utilizăm derivata de ordinul ı̂ntâi: 1

f ′ (x) = 6x − 2.

Punctul de minim este dat de soluţia ecuaţiei f ′ (x) =


1
0, şi anume: x = ≈ 0.33.
3 x3x2 x1 x0
0
0 0.33 1
x
190.

Observaţii

Se constată uşor că


1. apropierea de punctul de optim este [mai] rapidă atâta timp cât valoarea
primei derivate (i.e., panta tangentei la graficul funcţiei) este mare ı̂n valoare
absolută;
2. dacă rata de ı̂nvăţare η a fost fixată la o valoare prea mare, atunci este posi-
bil ca la un moment dat să depăşim punctul de optim şi apoi să ,,pendulăm“
ı̂n jurul lui. Acest punct slab al metodei gradientului poate fi contracarat
reducând ı̂n mod dinamic mărimea lui η.
Altminteri metoda gradientului are avantajul de a fi o tehnică de optimizare
foarte simplă din punct de vedere conceptual şi uşor de implementat.

Alte două puncte slabe ale metodei gradientului descendent sunt:


imposibilitatea de a garanta găsirea optimul global şi
numărul mare de iteraţii care trebuie executate pe unele seturi de date reale.

You might also like