Professional Documents
Culture Documents
Foundations of Probabilities
and Information Theory
for Machine Learning
1.
Random Variables
Some proofs
2.
Z=0 Z=1
X =0 X=1 X =0 X=1
Y = 0 1/15 1/15 4/15 2/15
Y = 1 1/10 1/10 8/45 4/45
For example,
P (X = 0, Y = 1, Z = 0) = 1/10 and P (X = 1, Y = 1, Z = 1) = 4/45.
a. Is X independent of Y ? Why or why not?
b. Is X conditionally independent of Y given Z? Why or why not?
c. Calculate P (X = 0 | X + Y > 0).
6.
Answer
a. No.
and
P (X = 0, Y = 0) 1/15 + 4/15 5
P (X = 0|Y = 0) = = = .
P (Y = 0) 8/15 8
Since P (X = 0) does not equal P (X = 0|Y = 0), X is not independent of Y .
7.
b. For all pairs y, z ∈ {0, 1}, we need to check that P (X = 0|Y = y, Z = z) = P (X = 0|Z = z).
That the other probabilities are equal follows from the law of total probability.
1/15
P (X = 0|Y = 0, Z = 0) = = 1/2
1/15 + 1/15
1/10
P (X = 0|Y = 1, Z = 0) = = 1/2
1/10 + 1/10
4/15
P (X = 0|Y = 0, Z = 1) = = 2/3
4/15 + 2/15
8/45
P (X = 0|Y = 1, Z = 1) = = 2/3.
8/45 + 4/15
and
1/15 + 1/10
P (X = 0|Z = 0) = = 1/2
1/15 + 1/15 + 1/10 + 1/10
4/15 + 8/45
P (X = 0|Z = 1) = = 2/3.
4/15 + 2/15 + 8/45 + 4/45
Pentru două variabile aleatoare oarecare X şi Y având Var (X) 6= 0 şi Var (Y ) 6=
0, coeficientul de corelaţie se defineşte astfel:
b. Să se arate că dacă ρ(X, Y ) = 1, atunci Y = aX +b, cu a = Var (Y )/Var (X) > 0.
Similar, dacă ρ(X, Y ) = −1, atunci Y = aX + b, cu a = −Var (Y )/Var (X) < 0.
Observaţie: Aşadar, coeficientul de corelaţie reprezintă o ,,măsură“ a gradului
de ,,dependenţă liniară“ dintre X şi Y .
10.
Indicaţii
1. La punctul a, notând Var (X) = σX şi Var (Y ) = σY , pentru a demonstra
X Y
inegalitatea ρ(X, Y ) ≥ −1 vă sugerăm să dezvoltaţi expresia Var +
σX σY
folosind următoarele două proprietăţi, valabile pentru orice variabile aleatoare
X şi Y :
Var (X + Y ) = Var (X) + Var (Y ) + 2Cov (X, Y )
şi
Cov (aX, bY ) = abCov (X, Y ), pentru orice a, b ∈ R.
X Y
Apoi veţi proceda similar, dezvoltând expresia Var − , ca să
σX σY
demonstraţi inegalitatea ρ(X, Y ) ≤ 1.
b. Dacă ρ(X, Y ) = −1, atunci din primul calcul de la punctul a va rezulta că
X Y X Y
Var + = 0. Ştim că aceasta se ı̂ntâmplă dacă + este o variabilă
σX σY σX σY
X Y
aleatoare constantă, adică, mai precis, există a′ ∈ R astfel ı̂ncât + = a′
σX σY
′ σY
cu probabilitate 1. Prin urmare, putem scrie Y = a σY − X. Rezultă că
σX
σY
Y = aX + b, unde a = − < 0 şi b = a′ σY .
σX
În mod similar, dacă ρ(X, Y ) = 1, atunci din al doilea calcul de la punctul a va
X Y X Y
rezulta că Var − = 0, deci există a′′ ∈ R astfel ı̂ncât − = a′′ cu
σX σY σX σY
σY
probabilitate 1. Aşadar, Y = −a′′ σY + X. Renotând, obţinem Y = aX + b,
σX
σY
cu a = > 0 şi b = −a′′ σY .
σX
13.
Probabilistic Distributions
Some properties
14.
def.
Binomial distribution: b(r; n, p) = Cnr pr (1 − p)n−r
n
X
def.
E[b(r; n, p)] = r · b(r; n, p) =
r=0
= 1 · Cn1 p(1 − p)n−1 + 2 · Cn2 p2 (1 − p)n−2 + · · · + (n − 1) · Cnn−1 pn−1 (1 − p) + n · pn
1 n−1 2 n−2 n−1 n−2 n−1
= p Cn (1 − p) + 2 · Cn p(1 − p) + · · · + (n − 1) · Cn p (1 − p) + n · p
n−1 1 n−2 n−2 n−2 n−1 n−1
= np (1 − p) + Cn−1 p(1 − p) + · · · + Cn−1 p (1 − p) + Cn−1 p (1)
= np[p + (1 − p)]n−1 = np (2)
Therefore,
Suppose we have n bins and m balls. We throw balls into bins independently
at random, so that each ball is equally likely to fall into any of the bins.
a. What is the probability of the first ball falling into the first bin?
b. What is the expected number of balls in the first bin?
Hint (1): Define an indicator random variable representing whether the i-th
ball fell into the first bin:
1 if i-th ball fell into the first bin;
Xi =
0 otherwise.
c. Let Y and Xi be the same as defined at point b. For the first bin to be empty none
of the balls should fall into the first bin: Y = 0.
m
n − 1
P (Y = 0) = P (X1 = 0, . . . , Xm = 0) = Πm m
i=1 P (Xi = 0) = (1 − 1/n) =
n
d. For each one of the n bins, we define an indicator random variable Yj for the event
“bin j is empty”. Let Z be the random variables denoting the number of empty bins.
Z = Y1 + . . . + Yn . Then the expected number of empty bins is:
n n n n m m
X X X X n−1 n−1
E[Z] = E[ Yj ] = E[Yj ] = 1 · P (Yj ) = =n
j=1 j=1 j=1 j=1
n n
22.
σ
34% 34%
2% 14% 14% 2%
0.1% 0.1%
x
−3σ −2σ −σ σ 2σ 3σ
Source: http://www.texample.net/tikz/examples/standard-deviation/
23.
Note: Concerning the second property, it is enough to prove it for the stan-
dard case (µ = 0, σ = 1), because the non-standard case can be reduced to
this one:
x−µ
Using the variable transformation v = will imply x = σv +µ and dx = σdv,
σ
so:
Z ∞ Z ∞
(x − µ)2 Z ∞ (x − µ)2
1 − 1 −
Nµ,σ (x)dx = √ e 2σ 2 dx = √ e 2σ 2 dx
−∞ x=−∞ 2πσ 2πσ x=−∞
Z ∞ v2 Z ∞ v2 Z ∞
1 − 1 − 1
= √ e 2 σdv = √ σ e 2 dv = √ N0,1 (x)dx
2πσ v=−∞ 2πσ v=−∞ 2π x=−∞
The standard case: proving that N0,1 is indeed a p.d.f.: 24.
2
Z ∞
v2 Z ∞
x2 Z ∞
y2 Z ∞ Z ∞
x2 + y 2
− − − −
e 2 dv = e 2 dx · e 2 dy = e 2 dydx
v=−∞ x=−∞ y=−∞ x=−∞ y=−∞
ZZ x2 + y 2
−
= e 2 dydx
R2
2 ∞
2
r r v2 v2
∞Z
− −
Z ∞
− √ Z ∞
1 −
= 2π re 2 dr = 2π (−e 2 ) = 2π(1 − 0) = 2π ⇒ e 2 dv = 2π ⇒ √ e 2 dv = 1.
r=0
v=−∞ v=−∞ 2π
0
Note: x = r cos θ and y = r sin θ, with r ≥ 0 and θ ∈ [0, 2π). Therefore, x2 + y 2 = r 2 , and the
Jacobian matrix is
∂x ∂x
cos θ −r sin θ
∂(x, y) ∂r ∂θ 2 2
= = = r cos θ + r sin θ = r ≥ 0. So, dxdy = rdrdθ.
∂(r, θ) ∂y ∂y
sin θ r cos θ
∂r ∂θ
25.
Z ∞
v2
σ −
= √ (σ 2 v 2 + 2σµv + µ2 ) e 2 dv
2πσ −∞
2 2 2
Z v Z ∞ v Z ∞ v
1 2 ∞ 2 − − −
v e 2 dv + µ2
= √ σ v e 2 dv + 2σµ e 2 dv
2π −∞ −∞ −∞
v2 v2
R∞ − R∞ − √
Note that we have already computed −∞ ve 2 dv = 0 and −∞ e 2 dv = 2π.
27.
Calculating the variance (Cont’d)
Therefore, we only need to compute
2
2
2
′
Z ∞ v Z ∞ v Z ∞
v
− − −
v 2 e 2 dv =
(−v) −ve 2 dv = (−v) e 2 dv
−∞ −∞ −∞
2 ∞
v Z v2 Z v2
−
∞ − ∞ − √
= (−v) e 2 − (−1)e 2 dv = 0 + e 2 dv = 2π.
−∞ −∞
−∞
e2 ve 2
1
√ √
2
So, E[X ] = 2π σ 2π + 2σµ · 0 + µ 2π = σ 2 + µ2 .
√ 2 2
def.
Cov(X)i,j = Cov(Xi , Xj ), for all i, j ∈ {1, . . . , n}, and
def.
Cov(Xi , Xj ) = E[(Xi − E[Xi ])(Xj − E[Xj ])] = E[(Xj − E[Xj ])(Xi − E[Xi ])] = Cov(Xj , Xi ),
therefore Cov(X) is a symmetric matrix.
Let’s consider X = [X1 . . . Xd ]T , µ ∈ Rd and Σ ∈ Sd+ , where Sd+ is the set of symmetric
positive definite matrices (which implies |Σ| = 6 0 and (x − µ)T Σ−1 (x − µ) > 0, therefore
− 21 (x − µ)T Σ−1 (x − µ) < 0, for any x ∈ Sd , x 6= µ).
The probability density function of a multi-variate Gaussian distribution of parameters
µ and Σ is:
1 1 T −1
p(x; µ, Σ) = exp − (x − µ) Σ (x − µ) ,
(2π)d/2 |Σ|1/2 2
Notation: X ∼ N (µ, Σ).
Show that when the covariance matrice Σ is diagonal, then the p.d.f. (probability den-
sity function) of the respective multi-variate Gaussian is equal to the product of d
independent uni-variate Gaussian densities.
We will make the proof for d = 2 Note: It is easy to show that if Σ ∈ Sd+ is diagonal, the
(generalization to d > 2 will be easy): elements on the principal diagonal Σ are indeed strictly
T −1 !
1 1 x1 − µ1 σ12 0 x1 − µ1
p(x; µ, Σ) = exp −
2
σ
1
0 2 2 x2 − µ2 0 σ22 x2 − µ2
2π 1
0 σ22
1
T 0
1 1 x1 − µ1 σ12 x1 − µ1
= exp − 1
2π σ1 σ2 2 x2 − µ2 0 x2 − µ2
σ22
1
(x1 − µ1 )
1 1 x1 − µ1 T σ12
= exp −
2π σ1 σ2 2 x2 − µ2 1
(x2 − µ2 )
σ22
1 1 1
= exp − 2 (x1 − µ1 )2 − 2 (x2 − µ2 )2
2π σ1 σ2 2σ1 2σ2
Sebastian Ciobanu,
following Chuong Do (from Stanford University),
The Multivariate Gaussian Distribution, 2008
36.
d d
Fie o variabilă aleatoare X : Ω → R . În cele ce urmează elementele din R vor fi
considerate vectori-coloană. Vom nota cu Sd+ mulţimea matricelor simetrice pozitiv
definite de dimensiune d × d.
Definiţie: Fie A ∈ Rd×d . Se numeşte valoare proprie a matricei A un număr complex
λ ∈ C pentru care există un vector nenul x ∈ Cd pentru care are loc egalitatea Ax = λx.
Acest vector x se numeşte vector propriu asociat valorii proprii λ.
a. Demonstraţi că pentru orice matrice Σ ∈ Sd+ există o matrice B ∈ Rd×d astfel ı̂ncât Σ
să poată fi factorizată sub forma următoare: Σ = BB ⊤ .
Observaţie: Factorizarea aceasta nu este unică, adică există mai multe posibilităţi de a
scrie matricea Σ drept Σ = BB ⊤ .
Indicaţie: Vă puteţi folosi de următoarele proprietăţi:
i. Orice matrice A ∈ Rd×d care este simetrică poate fi scrisă astfel: A = U ΛU ⊤ , unde
U ∈ Rd×d este o matrice ortonormală conţinând vectorii proprii (pentru care impunem
să aibă norma 1) ai lui A drept coloane, iar Λ este matricea diagonală conţinând valorile
proprii ale lui A ı̂n ordinea corespunzătoare coloanelor (adică, a vectorilor proprii) din
matricea U .(Faptul că U este matrice ortonormală se poate scrie ı̂n mod formal astfel:
U ⊤ U = U U ⊤ = I.)
ii. Fie A ∈ Rd×d o matrice simetrică. A este pozitiv definită dacă şi numai dacă toate
valorile sale proprii sunt (reale şi) pozitive.
iii. (AB)⊤ = B ⊤ A⊤ , pentru orice matrice A, B ∈ Rd×d .
Soluţie 37.
Ştim, prin ipoteză, că matricea Σ ∈ Sd+ , deci este simetrică. Conform proprietăţii (a.i),
putem scrie următoarea factorizare pentru Σ:
Σ = U ΛU ⊤ ,
unde U este matricea ortonormală conţinând vectorii proprii (cu norma 1) ai lui Σ
drept coloane, iar Λ ∈ Rd×d este matricea diagonală conţinând valorile proprii ale lui Σ,
ı̂n ordinea corespunzătoare coloanelor matricei U .
Întrucât Σ ∈ Sd+ este matrice pozitiv definită, conform proprietăţii (a.ii) rezultă că toate
valorile proprii ale lui Σ sunt pozitive. Prin urmare, există matricea
√
λ1 . . . 0
Λ1/2 = ... .. .. ∈ Rd×d .
.
√.
0 ... λd
Observaţii :
(1) Este imediat faptul că putem factoriza / descompune matricea Λ ı̂n felul următor:
Σ = U ΛU ⊤
(3) (4)
= U Λ1/2 Λ1/2 U ⊤ = U Λ1/2 (Λ1/2 )⊤ U ⊤
asoc. (a.iii)
= U Λ1/2 ((Λ1/2 )⊤ U ⊤ ) = U Λ1/2 (U Λ1/2 )⊤
not.
= BB ⊤ , unde B = U Λ1/2 . (5)
Observaţii :
(3) O altă modalitate de a factoriza / descompune matricea Σ sub forma Σ = BB ⊤
este dată de factorizarea Cholesky : dacă A este o matrice simetrică şi pozitiv definită,
atunci există o unică matrice L ∈ Rd×d , inferior triunghiulară, astfel ı̂ncât A = LL⊤ . (See
https://en.wikipedia.org/wiki/Positive-definite matrix.)
(4) Factorizarea (5) este de fapt valabilă şi pentru matrice pozitiv semidefinite. Însă
proprietatea de inversabilitate (vedeţi punctul c) nu este valabilă decât pentru matrice
pozitiv definite.
39.
x1 (−u, u) (−u, u)
= √ = √ (6)
kx1 k2 2u2 2|u|
Observaţie
În ton cu Observaţia din enunţ, dacă ı̂n relaţia (6) vom considera u < 0, vom obţine ı̂ncă
o soluţie pentru factorizarea matricei Σ:
x′1 (−u, u) 1 1 x1
= √ = √ , − √ = −
kx′1 k2 2|u| 2 2 kx1 k2
x′2 (u, u) 1 1 x2
= √ = − √ , − √ = − ,
kx′2 k2 2|u| 2 2 kx k
2 2
deci
U ′ = −U, B ′ = U ′ Λ1/2 = −U Λ1/2 = −B
şi, ı̂n consecinţă,
B ′ (B ′ )⊤ = (−B)(−B)⊤ = BB ⊤ = Σ.
44.
Soluţie
De la punctul a rezultă că pentru orice matrice Σ ∈ Sd+ există o descompunere / factor-
izare de forma Σ = BB ⊤ , deci
(c.ii) (c.iii)
det(Σ) = det(BB ⊤ ) = det(B) det(B ⊤ ) = det(B)2 .
Prin urmare, p
⇒ det(B) = ± det(Σ). (7)
Pe de altă parte, din faptul că Σ este matrice pozitiv definită, rezultă conform pro-
prietăţii (c.iv) că Σ este matrice inversabilă. Mai departe, conform proprietăţii (c.i),
vom avea det(Σ) 6= 0. Coroborând aceasta cu relaţia (7), rezultă că det(B) 6= 0 şi ı̂n
consecinţă (din nou, conform proprietăţii (c.i)) că matricea B este inversabilă (ceea ce
era de demonstrat).
46.
Sebastian Ciobanu,
following Chuong Do (from Stanford University),
The Multivariate Gaussian Distribution, 2008
47.
Definiţie: Notăm cu Sd+ mulţimea matricelor simetrice pozitiv definite de dimensiune
d × d. Spunem că variabila aleatoare vectorială X : Ω → Rd , reprezentată sub forma
X = (X1 . . . Xd )⊤ , urmează o distribuţie gaussiană multivariată, având media µ ∈ Rd şi
matricea de covarianţă Σ ∈ Sd+ , dacă funcţia ei de densitate [de probabilitate] are forma
analitică următoare:
1 1 ⊤ −1
pX (x; µ, Σ) = exp − (x − µ) Σ (x − µ) ,
(2π)d/2 det(Σ)1/2 2
unde notaţia det(Σ) desemnează determinantul matricei Σ, iar exp( ) desemnează funcţia
exponenţială având baza e. Pe scurt, vom nota această proprietate de definiţie a lui X
sub forma X ∼ N (µ, Σ).
Observaţie (1): La problema . . . am demonstrat că pentru orice matrice Σ ∈ Rd simet-
rică şi pozitiv definită există (ı̂nsă nu neapărat ı̂n mod unic) o matrice B ∈ Rd×d cu
proprietatea că Σ se poate ,,factoriza“ sub forma Σ = BB ⊤ . Mai mult, am demonstrat
că orice matrice B care satisface această proprietate este ı̂n mod necesar inversabilă.
a. Demonstraţi că dacă definim Z = B −1 (X −µ), atunci Z ∼ N (0, Id ), unde 0 este vectorul
coloană nul d-dimensional, iar Id este matricea identitate d-dimensională.
Observaţie (2): Proprietatea aceasta este o generalizare a metodei de ,,standardizare“
pe care am ı̂ntâlnit-o deja la problema CMU, 2004 fall, T. Mitchell, Z. Bar-Joseph,
HW1, pr. 3, aplicată ı̂n cazul distribuţiilor gaussiene univariate.
Indicaţie (1): Vă puteţi folosi de următoarele proprietăţi: 48.
i. Fie X = (X1 . . . Xd )⊤ ∈ Rd o variabilă aleatoare de tip vector, cu distribuţia comună dată
de funcţia de densitate pX : Rd → R. Dacă Z = H(X) ∈ Rd , unde H este funcţie bijectivă şi
diferenţiabilă, atunci Z are distribuţia comună dată de funcţia de densitate pZ : Rd → R, unde
∂x ∂x1
1
...
∂z1 ∂zd
. .. ..
pZ (z) = pX (x) · det . .
. . .
∂x ∂xd
d
...
∂z1 ∂zd
∂x ∂x1
1
...
∂z1 ∂zd
. .. ..
Matricea . . . . se numeşte matricea jacobiană a lui x ı̂n raport cu z şi se notează,
∂x ∂xd
d
...
∂z1 ∂zd
∂x
ı̂n acest caz, cu .
∂z
∂Ax + b
ii. = A, unde x, b ∈ Rd , A ∈ Rd×d , iar A şi b nu depind de x.
∂x
iii. (AB)−1 = B −1 A−1 , unde A, B ∈ Rd×d .
iv. (AB)⊤ = B ⊤ A⊤ , unde A, B ∈ Rd×d .
v. det(AB) = det(A) det(B), unde A, B ∈ Rd×d .
vi. det(A) = det(A⊤ ), unde A ∈ Rd×d .
49.
Soluţie
Sunt imediate următoarele relaţii:
B·
z = B −1 (x − µ) ⇒ Bz = x − µ ⇒ x = Bz + µ (8)
∂x ∂(Bz + µ) (a.ii)
= = B (9)
∂z ∂z
q q
(a.v) (a.vi) p
(det(Σ))1/2 ⊤
= det(BB ) = ⊤
det(B) det(B ) = (det(B))2 = | det(B)|. (10)
Vom rescrie acum expresia care constituie argumentul funcţiei exp( ) din definiţia p.d.f.-
ului distribuţiei gaussiene multivariate (vedeţi enunţul), ı̂n funcţie de vectorul z.
(x − µ)⊤ Σ−1 (x − µ)
(8)
= (Bz + µ − µ)⊤ (BB ⊤ )−1 (Bz + µ − µ) = (Bz)⊤ (BB ⊤ )−1 (Bz)
(a.iii)
= (Bz)⊤ ((B ⊤ )−1 B −1 )(Bz)
(a.iv)
= (z ⊤ B ⊤ )((B ⊤ )−1 B −1 )(Bz)
asoc.
= z ⊤ (B ⊤ (B ⊤ )−1 )(B −1 B)z = z ⊤ Id Id z = z ⊤ z. (11)
50.
Aplicând acum proprietatea (a.i), vom putea calcula p.d.f.-ul distribuţiei gaussiene aso-
ciate variabilei Z:
∂x
pZ (z) = pX (x) · det
∂z
1 1 ⊤ −1
∂x
= exp − (x − µ) Σ (x − µ) det
(2π)d/2 det(Σ)1/2 2 ∂z
(9) (10) (11) 1 1 ⊤
= d/2
exp − z z ✘ | det(B)|
✘✘✘
(2π) ✘ | det(B)|
✘ ✘ ✘ 2
1 1 ⊤ z ⊤ =z ⊤ Id 1 1 ⊤
= exp − z z = exp − z Id z
(2π)d/2 2 (2π)d/2 (det(Id ))1/2 2
= N (z; 0, Id ).
51.
b. Arătaţi că funcţia pX (x; µ, Σ) care a fost dată ı̂n enunţ este ı̂ntr-adevăr funcţie den-
sitate de probabilitate (p.d.f.).
Indicaţie (2): Vă puteţi folosi de următoarele proprietăţi:
i. Pentru cazul d = 1, funcţia pX (x; µ, σ 2 ) este funcţie densitate de probabilitate.
ii. În cazul ı̂n care matricea Σ este diagonală,
2
σ1 . . . 0 µ1
Σ = ... . . . ... , iar µ = ... ,
0 . . . σd2 µd
Deci, şi a doua condiţie este satisfăcută, ceea ce ı̂nseamnă că funcţia pX (x; µ, Σ) este
ı̂ntr-adevăr funcţie densitate de probabilitate (p.d.f.).
53.
σ12 ρσ1 σ2
Σ= . (12)
ρσ1 σ2 σ22
55.
Source:
where
1 1
pX1 ,X2 (x1 , x2 ) = √ p exp − (x − µ)⊤ Σ−1 (x − µ) şi
2
( 2π) |Σ| 2
1 1
pX1 (x1 ) = √ exp − 2 (x1 − µ1 )2 . (14)
2πσ1 2σ1
p
From (12) it follows that |Σ| = σ12 σ22 (1 − ρ2 ). In order that |Σ| and Σ−1 bepdefined, it
p
follows that ρ ∈ (−1, 1). Moreover, since σ1 , σ2 > 0, we will have |Σ| = σ1 σ2 1 − ρ2 .
2
1 1 σ −ρσ σ
1 2
Σ−1 = Σ∗ = 2 2 2
2
2 2 2
σ1 σ2 (1 − ρ ) σ1 σ2 (1 − ρ ) −ρσ1 σ2
2 σ1
1 ρ
2 −
1 σ 1 σ1 σ2
=
2
(1 − ρ ) ρ 1
−
σ1 σ2 σ22
57.
So,
Therefore,
2 not. σ2 2 not. 2
X2 |X1 = x1 ∼ N (µ2|1 , σ2|1 ) with µ2|1 = µ2 + ρ (x1 − µ1 ) and σ2|1 = σ2 (1 − ρ2 ).
σ1
59.
Notations:
Let Xi , i = 1, . . . , n = 100 be defined as:
Xi = 1 if the email i was incorrectly classified, and 0 otherwise;
not. not. not.
E[Xi ] = µ = ereal ; Var(Xi ) = σ 2
X1 + . . . + Xn
not.
esample = = 0.17
n
X1 + . . . + Xn − nµ
Zn = √ (the standardized form of X1 + . . . + Xn )
nσ
Key insight:
Calculating the real error of the classifier (more exactly, a symmetric interval
not.
around the real error p = µ) with a “confidence” of 95% amounts to finding a > 0
sunch that P (|Zn | ≤ a) ≥ 0.95.
62.
Calculus:
X1 + . . . + Xn − nµ X1 + . . . + Xn − nµ
| Zn |≤ a ⇔ √ ≤a ⇔ ≤ √a
nσ nσ n
X1 + . . . + Xn − nµ aσ X1 + . . . + Xn aσ
⇔ √
≤ n ⇔
− µ ≤
√
n n n
aσ aσ
⇔ |esample − ereal | ≤ √ ⇔ |ereal − esample | ≤ √
n n
aσ aσ
⇔ − √ ≤ ereal − esample ≤ √
n n
aσ aσ
⇔ esample − √ ≤ ereal ≤ esample + √
n n
aσ aσ
⇔ ereal ∈ esample − √ , esample + √
n n
Important facts: 63.
The Central Limit Theorem: Zn → N (0; 1)
Therefore, P (|Zn | ≤ a) ≈ P (|X| ≤ a) = Φ(a) − Φ(−a), where X ∼ N (0; 1)
and Φ is the cumulative function distribution of N (0; 1).
Calculus:
Φ(−a) + Φ(a) = 1 ⇒ P (| Zn |≤ a) = Φ(a) − Φ(−a) = 2Φ(a) − 1
P (| Zn |≤ a) = 0.95 ⇔ 2Φ(a) − 1 = 0.95 ⇔ Φ(a) = 0.975 ⇔ a ∼
= 1.97 (see Φ table)
not.
σ 2 = Varreal = ereal (1 − ereal ) because Xi are Bernoulli variables.
Futhermore, we can approximate ereal with esemple , because
1
E[esample ] = ereal and Varsample = Varreal → 0 for n → +∞,
n
cf. CMU, 2011 fall, T. Mitchell, A. Singh, HW2, pr. 1.ab.
Finally: p
aσ 0.17(1 − 0.17) ∼
⇒ √ ≈ 1.97 · √ = 0.07
n 100
|ereal − esample | ≤ 0.07 ⇔ |ereal − 0.17| ≤ 0.07 ⇔ −0.07 ≤ ereal − 0.17 ≤ 0.07
⇔ ereal ∈ [0.10, 0.24]
64.
Exemplifying
a mixture of categorical distributions;
how to compute its expectation and variance
CMU, 2010 fall, Aarti Singh, HW1, pr. 2.2.1-2
65.
Suppose that I have two six-sided dice, one is fair and the other one is
loaded – having:
1
x=6
2
P (x) =
1
x ∈ {1, 2, 3, 4, 5}
10
I will toss a coin to decide which die to roll. If the coin flip is heads I will
roll the fair die, otherwise the loaded one. The probability that the coin
flip is heads is p ∈ (0, 1).
a. What is the expectation of the die roll (in terms of p).
b. What is the variation of the die roll (in terms of p).
66.
Solution:
a.
6
X
E[X] = i · [P (i|fair) · p + P (i|loaded) · (1 − p)]
i=1
" 6 # " 6 #
X X
= i · P (i|fair) p + i · P (i|loaded) (1 − p)
i=1 i=1
7 9 9
= p + (1 − p) = − p
2 2 2
67.
2 2
b. Recall that we may write Var (X) = E[X ] − (E[X]) , therefore:
6
X
2
E[X ] = i2 · [P (i|fair) · p + P (i|loaded) · (1 − p)]
i=1
" 6 # " 6 #
X X
= i2 · P (i|fair) p + i2 · P (i|loaded) (1 − p)
i=1 i=1
91 36 55
= p+ + (1 − p)
6 2 10
47 25
= − p
2 3
Combining this with the result of the previous question yields:
141 50 9
Var (X) = E[X 2 ] − (E[X])2 = − p − ( − p)2
6 6 2
141 50 81
= − p − ( − 9p + p2 )
6 6 4
141 81 50
= ( − ) − ( − 9)p − p2
6 4 6
13 2
= + p − p2
4 3
68.
Solution:
Since the Xi are independent, we have
n
Y n
Y
L(θ̂) = Pθ̂ (X1 , . . . , Xn ) = Pθ̂ (Xi ) = (θ̂Xi · (1 − θ̂)1−Xi )
i=1 i=1
Solution:
MLE; n = 5, three 1s, two 0s MLE; n = 100, sixty 1s, fourty 0s
0.04 7e-30
0.035 6e-30
0.03 5e-30
0.025
4e-30
0.02
3e-30
0.015
0.01 2e-30
0.005 1e-30
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
θ θ
76.
0.0006
The MLE is equal to the propor-
tion of 1s observed in the data, so
0.0004 for the first three plots the MLE is
always at 0.6, while for the last plot
0.0002
it is at 0.5.
0 As the number of samples n in-
0 0.2 0.4 0.6 0.8 1 creases, the likelihood function gets
θ more peaked at its maximum value,
and the values it takes on decrease.
77.
Reminder (cont’d)
Using Bayes rule, we can rewrite the posterior probability as follows:
Since the probability in the denominator does not depend on θ̂, the MAP estimate is
given by
In words, the MAP estimate for θ is the value θ̂ that maximizes the likelihood function
multiplied by the prior distribution on θ.
79.
We will consider a Beta(3, 3) prior distribution for θ, which has the density
θ̂2 (1 − θ̂)2
function given by p(θ̂) = , where B(α, β) is the beta function and
B(3, 3)
B(3, 3) ≈ 0.0333.
Suppose X is a binary random variable that takes value 0 with probability p and value
1 with probability 1 − p. Let X1 , . . . , Xn be i.i.d. samples of X.
a. Compute an MLE estimate of p (denote it by p̂).
Answer:
By way of definition,
i.i.d.
p̂ = argmax P (X1 , . . . , Xn |p) = argmax P (Xi |p) = argmax pk (1 − p)n−k
p p p
∂ k n−k
(k ln p + (n − k) ln(1 − p)) = − .
∂p p 1−p
Hence,
∂ k n−k k
(k ln p + (n − k) ln(1 − p)) = 0 ⇔ = ⇔ p̂ = .
∂p p 1−p n
84.
Answer:
Since k can be seen as a sum of n [independent] Bernoulli variables
of parameter p, we can write:
n n
k 1 1 X 1 X
E[p̂] = E = E[k] = E[n − Xi ] = (n − E[Xi ])
n n n i=1
n i=1
n
1 X 1 1
= (n − (1 − p)) = (n − n(1 − p)) = np = p.
n i=1
n n
Answer:
Consider another estimator, p̃ = 1/2.
E[(p̃ − p)2 ] = (1/2 − p)2 .
For p = 1/2 we have E[(p̃ − p)2 ] = 0 < E[(p̂ − p)2 ] = 1/12.
We now need to show that E[(p̃ − p)2 ] ≤ E[(p̂ − p)2 ] over p ∈ [1/4; 3/4].
2
1 1 1 4 4
E[(p̃ − p)2 ] − E[(p̂ − p)2 ] = − p − · p(1 − p) = − p + p2 .
2 3 4 3 3
This is a parabola going up, so we need to show that it lies below or equal to
zero for p ∈ [1/4; 3/4].
It is equivalent to showing that it is below or equal to 0 at boundary points.
1 4 4
In fact it is: − p + p2 = 0 for both p = 1/4 and p = 3/4.
4 3 3
87.
∂ ln L(θ) nk nj nj nk
=0⇔− P + = 0 ⇔ = Pk−1
∂θj 1 − θ̂j − i6k−1 θ
=j,k i θ̂j θ̂j 1 − θ̂j − i6=j,k θi
k−1
X
⇔ nj (1 − θi ) = (nk + nj )θ̂j
i6=j,k
k−1
nj X
⇔ θ̂j = (1 − θi ) for all j ∈ {1, . . . , k − 1}.
nj + nk i6=j,k
90.
Answer
As the likelihood function is uniquely optimal for the vector θ, the last
equation in part b can be written as:
k−1
! k−1
nj X nj X
θ̂j = 1− θ̂i ⇔ θ̂j = (θ̂j + θ̂k ) (because θ̂k = 1 − θ̂i )
nj + nk nj + nk i=1
i6=j,k
nj nj nk nj
⇔ θ̂j 1 − = θ̂k ⇔ θ̂j = θ̂k
nj + nk nj + nk nj + nk nj + nk
⇔ θ̂j nk = nj θ̂k
nj
⇔ θ̂j = θ̂k for all j ∈ {1, . . . , k − 1}.
nk
92.
Finally,
n1 nk−1
θ̂k = 1 − θ̂1 − . . . − θ̂k−1 = 1 − θ̂k − . . . − θ̂k
nk nk
⇒ nk θ̂k = nk − (n1 + . . . + nk−1 )θ̂k
⇒ θ̂k (n1 + . . . + nk−1 + nk ) = nk
| {z }
n
nk
⇒ θ̂k =
n
nj nk nj
⇒ θ̂j = · = for all j ∈ {1, . . . , k − 1}.
nk n n
Note: Even though [here] we can go from the non-hatted (θi ) to the hatted
form (hatθi ) of the equation in the first step of c, this will generally not
be possible. To solve for a maximum likelihood criterion under additional
constraints like the thetas summing to one, a generic and useful method
is the method of Lagrange multipliers.
93.
Solution:
The sample x1 , . . . , xn can be seen as the realization of n independent ran-
dom variables X1 , . . . , Xn of Gaussian distribution of mean µ and variance
σ 2 . Then, due to the property of linearity for the expectation of random
variables, we get:
X1 + . . . + Xn E[X1 ] + . . . + E[Xn ] nµ
E[µMLE ] = E = = =µ
n n n
Therefore, the µMLE estimator is unbiased.
c. What is Var [µMLE ]?
Solution:
" n
# n
1 X 1 X
(⋆) i.i.d. 1 σ2
Var [µMLE ] = Var Xi = 2 Var [Xi ] = n 2 Var [X1 ] =
n i=1
n i=1 n n
d. Now derive the MAP estimator for the mean µ. Assume that
the prior distribution for the mean is itself a normal distribution
with mean ν and variance β 2 .
97.
Solution 1:
T. Bayes P (x1 , . . . , xn |µ) P (µ)
P (µ|x1 , . . . , xn ) = (16)
P (x1 , . . . , xn )
(xi − µ)2 (µ − ν)2
1 − 1 −
Qn
i=1 √ e 2σ 2
· √ e 2β 2
2πσ 2πβ
= (17)
C
not.
where C = P (x1 , . . . , xn ).
n
X √ (xi − µ)2 √ (µ − ν)2
⇒ ln P (µ|x1 , . . . , xn ) = − ln 2πσ + 2
− ln 2πβ − 2
− ln C
i=1
2σ 2β
n
X xi − µ µ − ν
∂
⇒ ln P (µ|x1 , . . . , xn ) = 2
− 2
∂µ i=1
σ β
n Pn
∂ X xi − µ µ−ν 1 n i=1 xi ν
ln P (µ|x1 , . . . , xn ) = 0 ⇔ = ⇔ µ + = +
∂µ i=1
σ2 β2 β2 σ2 σ2 β2
P
σ 2 ν + β 2 ni=1 xi
⇒ µMAP =
σ 2 + nβ 2
98.
Solution 2:
Instead of computing the derivative of the posterior distribution
P (µ|x1 , . . . , xn ),
we will first show that the right hand side of (17) is itself a Gaussian, and
then we will use the fact that the mean of a Gaussian is where it achieves its
maximum value.
n (xi − µ)2 (µ − ν)2
1 Y 1 − 1 −
P (µ|x1 , . . . , xn ) = √ e 2σ 2
· √ e 2β 2
C i=1 2πσ 2πβ
Pn
i=1 (xi − µ)2 (µ − ν)2
− −
= const · e 2σ 2 2β 2
Pn
β2 − µ)2 + σ 2 (µ − ν)2
i=1 (xi
−
= const · e 2σ 2 β 2
Pn Pn
nβ 2 + σ 2 2 β 2 i=1 xi + νσ 2 β 2 i=1 x2i + ν 2 σ 2
− µ + µ−
2 2 σ2 β 2 2σ 2 β 2
= const · e 2σ β
99.
P (µ|x1 , . . . , xn ) =
Pn Pn 2 2
2 β2 i=1 xi + νσ 2
β 2
x
i=1 i
2
+ ν σ
µ − 2µ +
nβ 2 + σ 2 nβ 2 + σ 2
= const · exp − 2 2
2σ β
nβ 2 + σ 2
Pn 2 Pn Pn
2 2
2 2
2 2 2 2
β i=1 xi + νσ 2 β i=1 xi + νσ β i=1 xi + ν σ
(µ − nβ 2 + σ 2
) −
nβ 2 + σ 2
+
nβ 2 + σ 2
= const · exp −
σ2 β 2
2 2
nβ + σ 2
P 2 Pn 2 P
β 2 ni=1 xi + νσ 2 β 2 i=1 xi + νσ 2 β 2 ni=1 x2i + ν 2 σ 2
µ− −
nβ 2 + σ 2 nβ 2 + σ 2 nβ 2 + σ 2
= const · exp − · exp
σ2 β 2 σ2 β 2
2 2 2 2
nβ + σ 2 nβ + σ 2
2 2
2
Pn
β i=1 xi + νσ
µ−
′ nβ 2 + σ 2
= const · exp −
σ2 β 2
2 2
nβ + σ 2
100.
β 2 ni=1 xi + νσ 2
is obtained for µ = 2 2
= µMAP .
nβ + σ
101.
Solution:
Pn
i=1 xi
µMLE =
n P P
σ ν + β 2 ni=1 xi
2
σ2 ν β 2 ni=1 xi
µMAP = = 2 + 2
σ 2 + nβ 2 σ + nβ 2 σ + nβ 2
1 Pn
2
σ ν i=1 xi σ2 ν µMLE
= + n = +
σ 2 + nβ 2 σ2 σ 2 + nβ 2 σ2
1+ 1+
nβ 2 nβ 2
σ2 ν σ2
n→∞⇒ 2 → 0 and → 0 ⇒ µMAP → µMLE
σ + nβ 2 nβ 2
102.
2 n−1 2
b. Show that E[σMLE ]= σ .
n
Solution:
"n
# " n
#
2 1 X 1 X
E[σMLE ] = E (xi − µMLE )2 = E[(x1 − µMLE )2 ] = E (x1 − xi )2
n i=1 n i=1
" n n
#
2 X 1 X
= E x21 − x1 xi + 2 ( xi )2
n i=1 n i=1
n n
2 2 X 1 X 2 2 X
= E x1 − x1
xi + 2 xi + 2 xi xj
n i=1 n i=1 n i<j
n n
1 X 2X 2 X
= E[x21 ] + 2 2
E[xi ] − E[x1 xi ] + 2 E[xi xj ]
n i=1 n i=1 n i<j
1 2 2 2 n(n − 1)
= E[x21 ] + nE[x2
1 ] − E[x2
1 ] − (n − 1)E[x1 x2 ] + E[x1 x2 ]
n2 n n n2 2
n−1 n−1
= E[x21 ] − E[x1 x2 ]
n n
108.
So, E[x1 x2 ] = µ2 .
By substituting E[x21 ] = σ 2 + µ2 and E[x1 x2 ] = µ2 into the previously obtained
n−1 n−1
equality (E[σMLE ] = E[x21 ] − E[x1 x2 ]), we get:
n n
2 n−1 2 n−1 2 n−1 2
E[σMLE ]= (σ + µ2 ) − µ = σ
n n n
109.
Solution:
1 Pn 2
It can be immediately proven that i=1 (xi − µMLE ) is an un-
n−1
biased estimator of σ 2 .
110.
0.5
r = 1.0, β = 2.0
r = 2.0, β = 2.0
The Gamma distribution of parameters r > 0 and β > 0 r = 3.0, β = 2.0
0.4
r = 5.0, β = 1.0
r = 9.0, β = 0.5
has the following density function: r = 7.5, β = 1.0
0.3
r = 0.5, β = 1.0
p(x)
x
−
0.2
1
Gamma(x|r, β) = xr−1 e β , for all x > 0,
β r Γ(r)
0.1
where the Γ symbol designates Euler’s Gamma function.
0.0
0 5 10 15 20
x
Notes:
1
1. In the above definition, is the so-called normalization factor, since it does
β r Γ(r)
x
R +∞ −
not depend on x, and x=−∞ xr−1 e β dx = β r Γ(r).
R +∞
2. Euler’s Gamma function is defined as follows: Γ(r) = 0 tr−1 e−t dt, for all r ∈ R,
except for the negative integers. Starting from the definition of Γ, it can be easily
shown that Γ(r + 1) = rΓ(r) for any r > 0, and also Γ(1) = 1. Therefore, Γ(r + 1) = r · Γ(n) =
r · (r − 1) · Γ(r − 1) = . . . = r · (r − 1) · . . .· 2 · 1 = r!, which means that the Γ function generalizes
the factorial function.
3. The exponential distribution is a member of the Gamma family of distributions.
(Just set r to 1 in Gamma’s density function.)
112.
Solution
• The verosimility function:
n
Y
def. i.i.d.
L(r, β) = P (x1 , . . . , xn |r, β) = P (xi |r, β)
i=1
n
!r−1 1 Pn
Y − i=1 xi
= β −rn −n
(Γ(r)) xi e β
i=1
Now we will calculate the partial derivative of ℓ(r, β) w.r.t. β, and then equate
it to 0:
n
" n #
∂ rn 1 X 1 X
ℓ(r, β) = − + 2 xi = 2 xi − rnβ
∂β β β i=1 β i=1
n
∂ 1 X
ℓ(r, β) = 0 ⇔ β̂ = xi > 0.
∂β rn i=1
Therefore, by computing the partial derivative of ℓ(r, β̂) with respect to r, and
then equating this derivative to 0, we will get:
n
! n
∂ X Γ′ (r) X
ℓ(r, β̂) = 0 ⇔ n ln(nr) + n − n ln xi + 1 − n · + ln xi = 0 ⇔
∂r i=1
Γ(r) i=1
n
X n
X
n(ln r − ψ(r)) = −n ln n − ln xi + n ln xi ⇔
i=1 i=1
n n
1 X X
ln r − ψ(r) = − ln n − ln xi + ln xi .
n i=1 i=1
The solution of the last equation is r̂, the maximum likelihood estimation of
the parameter r.
115.
Tr(A), the trace of an n-by-n square matrix A, is defined as the sum of the
elements on the main diagonal (the diagonal from the upper left to the lower
right) of A, i.e., a11 + . . . + ann .
a See Theorem 1.3.d from Matrix Analysis for Statistics, 2017, James R. Schott.
Given the x1 , . . . , xn data, the log-likelihood function is: 118.
n
Y n
X
i.i.d. −1
l(µ, Λ) = ln N (xi |µ, Λ )= ln N (xi |µ, Λ−1 )
i=1 i=1
n
nd n −1 1X
= − ln(2π) − ln |Λ | − (xi − µ)⊤ Λ(xi − µ)
2 2 2 i=1
n
(2b) nd n 1X
= − ln(2π) + ln |Λ| − (xi − µ)⊤ Λ(xi − µ).
2 2 2 i=1
For any fixed positive definite precision matrix Λ, the log-likelihood is a quadratic
function of µ with a negative leading coefficient, hence a strictly concave function of µ.
We then solve
n n n
(5g) 1 ⊤
X X X
∇µ l(µ, Λ) = 0 ⇐⇒ − (Λ + Λ ) (xi − µ)(−1) = 0 ⇐⇒ Λ (xi − µ) = 0 ⇐⇒ Λ xi = nΛµ.
2 i=1 i=1 i=1
(18)
From (18) we get, by the assumption that Λ is invertible, the following estimate of µ:
Pn
xi
µ̂ = i=1 ,
n
which coincides with the sample mean x̄ and is constant w.r.t. Λ.
119.
Now that we have
l(µ, Λ) ≤ l(µ̂, Λ) ∀µ ∈ Rd , Λ being positive definite,
we continue to consider Λ by first plugging µ̂ back in the log-likelihood function (18):
n
nd n 1X
l(µ̂, Λ) = − ln(2π) + ln |Λ| − (xi − x̄)⊤ Λ(xi − x̄) (19)
2 2 2 i=1
(2e) nd n (20)
= − ln(2π) + (ln |Λ| − Tr(SΛ)),
2 2
1 Pn ⊤
where S is the sample covariance matrix : S = i=1 (xi − x̄)(xi − x̄) .
n
Explanation:
(xi − x̄)⊤ Λ(xi − x̄) is a 1-by-1 matrix, therefore (xi − x̄)⊤ Λ(xi − x̄) = Tr((xi − x̄)⊤ Λ(xi − x̄)),
and using the (2e) rule, it can be further written as Tr((xi − x̄)(xi − x̄)⊤ Λ).
Using another simple rule, Tr(A + B) = Tr(A) + Tr(B) (which can be easily proven), it
follows that
Xn Xn n
X
⊤ ⊤
(xi − x̄) Λ(xi − x̄) = Tr((xi − x̄) Λ(xi − x̄)) = Tr((xi − x̄)(xi − x̄)⊤ Λ)
i=1 n i=1 n i=1
X X
⊤ ⊤
= Tr ((xi − x̄)(xi − x̄) Λ) = Tr (xi − x̄)(xi − x̄) Λ = Tr((nS)Λ) = nTr(SΛ).
i=1 i=1
120.
By the fact that ln |Λ| is strictly concave on the domain of positive definite
Λ,a and that Tr(SΛ) is linear in Λ, we are able to find the maximum of the
expression (20) by solving
∇Λ l(µ̂, Λ) = 0,
c. Calculate the entropy of S [as the expected value of the random variable
− log2 P (X = x)].
d. Let’s say you throw the die one by one, and the first die shows 4. What
is the entropy of S after this observation? Was any information gained / lost
in the process? If so, calculate how much information (in bits) was lost or
gained.
125.
a.
S 2 3 4 5 6 7 8 9 10 11 12
P (S) 1/36 2/36 3/36 4/36 5/36 6/36 5/36 4/36 3/36 2/36 1/36
b.
c.
n
X
H(S) = − pi log pi
i=1
1 1 2 2 3 3 4 4
= − 2· log +2· log +2· log +2· log +
36 36 36 36 36 36 36 36
5 5 6 6
2· log + log
36 36 36 36
1 36
= (2 log2 36 + 4 log2 18 + 6 log2 12 + 8 log2 9 + 10 log2 + 6 log2 6)
36 5
1 2 2 62
= (2 log2 6 + 4 log2 6 · 3 + 6 log2 6 · 2 + 8 log2 3 + 10 log2 + 6 log2 6)
36 5
1
= (40 log2 6 + 20 log2 3 + 6 − 10 log2 5)
36
1
= (60 log2 3 + 46 − 10 log2 5) = 3.274401919 bits.
36
127.
d.
S 2 3 4 5 6 7 8 9 10 11 12
P (S|...) 0 0 0 1/6 1/6 1/6 1/6 1/6 1/6 0 0
1 1
H(S|First-die-shows-4) = −6 · log2 = log2 6 = 2.58 bits,
6 6
A doctor needs to diagnose a person having cold (C). The primary factor
he considers in his diagnosis is the outside temperature (T ). The random
variable C takes two values, yes / no, and the random variable T takes 3
values, sunny, rainy, snowy. The joint distribution of the two variables is
given in following table.
def. X
H(T |C) = P (C = c) · H(T |C = c)
c∈Val (C)
1.5
λ = 0.5
λ=1
λ = 1.5
Computing the entropy of the
1.0
exponential distribution
p(x)
CMU, 2011 spring, R. Rosenfeld,
0.5
HW2, pr. 2.c
0.0
0 1 2 3 4 5
x
134.
Pentru a rezolva cea de-a doua integrală se poate folosi formula de integrare prin părţi:
Z ∞ Z ∞ Z ∞
−λx ′
−λx ∞ −λx ′
−λx ∞
e x dx = e x0 − e x dx = e x0 − e−λx dx
0 0 0
∞
Integrala definită e−λx x0 nu se poate calcula direct (din cauza conflictului 0 · ∞ care
se produce atunci când lui x i se atribuie valoarea-limită ∞), ci se calculează folosind
regula lui l’Hôpital:
−λx x x′ 1 1
lim xe = lim λx = lim ′ = lim λx
= lim e−λx = e−∞ = 0,
x→∞ x→∞ e λx
x→∞ (e ) x→∞ λe λ x→∞
deci ∞
e−λx x0 = 0 − 0 = 0.
137.
R∞
Integrala 0 e−λx dx se calculează uşor:
Z ∞ Z ∞
1 ′ 1 −λx ∞ 1 1
e−λx dx = − e−λx dx = − e 0
= − (0 − 1) =
0 λ 0 λ λ λ
Prin urmare, Z ∞
−λx ′
1 1
e x dx = 0 − =− ,
0 λ λ
ceea ce conduce la rezultatul final:
ln λ λ 1 ln λ 1 1 − ln λ
H(P ) = (−1) − − =− + = .
ln 2 ln 2 λ ln 2 ln 2 ln 2
138.
a.
H(X) ≥ 0.
X X 1
H(X) = − P (X = xi ) log P (X = xi ) = P (X = xi ) log ≥0
i i
| {z } P (X = xi )
≥0 | {z }
≥0
Mai mult, H(X) = 0 dacă şi numai dacă variabila X este constantă:
P 1
,,⇒“ Presupunem că H(X) = 0, adică i P (X = xi ) log = 0. Datorită faptului
P (X = xi )
că fiecare termen din această sumă este mai mare sau egal cu 0, rezultă că H(X) = 0
1
doar dacă pentru ∀i, P (X = xi ) = 0 sau log = 0, adică dacă pentru ∀i,
PP(X = xi )
P (X = xi ) = 0 sau P (X = xi ) = 1. Cum ı̂nsă i P (X = xi ) = 1 rezultă că există o
singură valoare x1 pentru X astfel ı̂ncât P (X = x1 ) = 1, iar P (X = x) = 0 pentru
orice x 6= x1 . Altfel spus, variabila aleatoare discretă X este constantă.
,,⇐“ Presupunem că variabila X este constantă, ceea ce ı̂nseamnă că X ia o singură
valoare x1 , cu probabilitatea P (X = x1 ) = 1. Prin urmare, H(X) = −1 · log 1 = 0.
141.
b.
XX
H(Y | X) = − P (X = xi , Y = yj ) log P (Y = yj | X = xi )
i j
X
H(Y | X) = P (X = xi )H(Y | X = xi )
i
" #
X X
= P (X = xi ) − P (Y = yj | X = xi ) log P (Y = yj | X = xi )
i j
XX
= − P (X = xi )P (Y = yj | X = xi ) log P (Y = yj | X = xi )
i j
| {z }
=P (X=xi ,Y =yj )
XX
= − P (X = xi , Y = yj ) log P (Y = yj | X = xi )
i j
142.
c.
H(X, Y ) = H(X) + H(Y | X) = H(Y ) + H(X | Y )
XX
H(X, Y ) = − p(xi , yj ) log p(xi , yj )
i j
XX
= − p(xi ) · p(yj | xi ) log[p(xi ) · p(yj | xi )]
i j
XX
= − p(xi ) · p(yj | xi )[log p(xi ) + log p(yj | xi )]
i j
XX XX
= − p(xi ) · p(yj | xi ) log p(xi ) − p(xi ) · p(yj | xi ) log p(yj | xi )
i j i j
X X X X
= − p(xi ) log p(xi ) · p(yj | xi ) − p(xi ) p(yj | xi ) log p(yj | xi )
i j i j
| {z }
=1
X
= H(X) + p(xi )H(Y | X = xi ) = H(X) + H(Y | X)
i
143.
1
H(X1 , . . . , Xn ) = E log
p(x1 , . . . , xn )
= − Ep(x1 ,...,xn ) [log p(x1 , . . . , xn ) ]
| {z }
p(x1 )·p(x2 |x1 )·...·p(xn |x1 ,...,xn−1 )
y
Fie X o variabilă aleatoare discretă care ia n
x−1
valori şi urmează distribuţia probabilistă P .
Conform definiţiei, entropia lui X este
n
X
H(X) = − P (X = xi ) log2 P (X = xi ).
i=1
0 1 x
Arătaţi că H(X) ≤ log2 n.
1
Aplicând inegalitatea ln x ≤ x − 1 pentru x = , vom avea:
n P (xi )
n n n n
X 1 X 1 X 1 X
P (xi ) ln ≤ P (xi ) −1 = − P (xi ) = 1 − 1 = 0
i=1
n P (xi ) i=1
n P (xi ) i=1
n i=1
| {z }
1
Observaţie: Această margine superioară chiar este ,,atinsă“. De exemplu, ı̂n cazul ı̂n
care o variabilă aleatoare discretă X având n valori urmează distribuţia uniformă, se
poate verifica imediat că H(X) = log2 n.
148.
Answer: Therefore,
X
H(X, Y ) = − P (X = x, Y = y) log P (X = x, Y = y)
x,y
indep. X
=− P (X = x)P (Y = y) log(P (X = x)P (Y = y))
x,y
X
=− P (X = x)P (Y = y)(log P (X = x) + log P (Y = y))
x,y
X X
=− P (X = x)P (Y = y) log P (X = x) − P (X = x)P (Y = y) log P (Y = y)
x,y x,y
X X X X
=− P (Y = y) P (X = x) log P (X = x) − P (X = x) P (Y = y) log P (Y = y)
y x x y
X X X X
=− P (Y = y) · P (X = x) log P (X = x) − P (X = x) · P (Y = y) log P (Y = y)
y x x y
X X
=− P (X = x) log P (X = x) − P (Y = y) log P (Y = y)
x y
= H(X) + H(Y )
150.
If X, Y are continue and independent random variables, then
pX,Y (X = x, Y = y) = pX (X = x)pY (Y = y)
for any x and y, where p denotes the p.d.f. corresponding to the [c.d.f. of ] P .
Therefore,
Z Z
def.
H(X, Y ) = − pX,Y (X = x, Y = y) log pX,Y (X = x, Y = y) dx dy
X Y
Z Z
indep.
= − pX (X = x) pY (Y = y)(log pX (X = x) + log pY (Y = y)) dx dy
ZX ZY
not.
= − pX (x) pY (y)(log pX (x) + log pY (y)) dx dy
X Y
Z Z Z Z
= − pY (y) pX (x) log pX (x)dx dy − pX (x) pY (y) log pY (y)dx dy
X Y X Y
Z Z Z Z
= − pX (x) log pX (x) pY (y) dy dx − pX (x) pY (y) log pY (y) dy dx
X Y X Y
| {z } | {z }
1 H(Y )
Z Z
= − pX (x) log pX (x) dx + H(Y ) pX (x) dx
X X
| {z }
1
= H(X) + H(Y )
151.
Answer: H(A)
We’ve used the fact that H(X, Y ) = H(X|Y ) + H(Y ) (see CMU, 2005 fall, T.
Mitchell, A. Moore, HW1, pr. 2).
153.
La CMU, 2007 fall, Carlos Guestrin, HW1, pr. 1.2 s-a demonstrat — pentru cazul ı̂n care X
şi Y sunt discrete — că IG(X, Y ) = KL(PX,Y ||PX PY ), unde KL desemnează entropia relativă (sau:
divergenţa Kullback-Leibler), PX şi PY sunt distribuţiile variabilelor X şi, respectiv, Y , iar PX,Y este
distribuţia comună a acestor variabile. Tot la CMU, 2007 fall, Carlos Guestrin, HW1, pr. 1.2 s-a
arătat că divergenţa KL este ı̂ntotdeauna ne-negativă. În consecinţă, IG(X, Y ) ≥ 0 pentru orice X
şi Y .
Observaţie: Avantajul la această problemă, comparativ cu CMU, 2007 fall, Carlos Guestrin,
HW1, pr. 1.2.a, este că aici se lucrează cu o singură distribuţie (p), nu cu două distribuţii (p şi q).
Totuşi, demonstraţia de aici va fi mai laborioasă.
X m
n X m X
X n
distrib.·,+
= P (xi , yj ) log2 P (xi ) − P (yj )P (xi |yj ) log2 P (xi |yj )
i=1 j=1 j=1 i=1
def.
= X m
n X m X
X n
prob. cond. P (xi , yj ) log2 P (xi ) − P (xi , yj ) log2 P (xi |yj )
i=1 j=1 j=1 i=1
Xn X m
distrib.·,+
= P (xi , yj )(log2 P (xi ) − log2 P (xi |yj ))
i=1 j=1
prop. reg. de
= m
n X = n X m
X P (xi ) X P (xi )
log. P (xi , yj ) log2 multipl. P (xi |yj )P (yj ) log2
i=1 j=1
P (xi |yj ) i=1 j=1
P (xi |yj )
m n
distrib.·,+ X X P (xi )
= P (yj ) P (xi |yj ) log2
j=1 i=1
| {z } P (xi |yj )
ai
157.
Pn
Întrucât pe de o parte P (xi |yj ) ≥ 0 şi pe de altă parte i=1 P (xi |yj ) = 1 pentru
fiecare valoare yj a lui Y ı̂n parte, putem aplica inegalitatea lui Jensen pentru
cea de-a doua sumă din ultima expresie de mai sus — mai exact, pentru
fiecare valoare a indicelui j ı̂n parte — şi obţinem:
m n
! m n
!
X X P (xi ) X X
−IG(X, Y ) ≤ P (yj ) log2 P (xi |yj ) = P (yj ) log2 P (xi ) = 0
j=1 i=1
P (xi |y j ) j=1 i=1
| {z }
1
Corollary
Since
H(X, Y ) = H(X) + H(Y |X) = H(Y ) + H(X|Y ),
and also due to the definition of the Information Gain,
it follows that
Cross-entropy (CH):
definition, some basic properties, and exemplifications
CMU, 2011 spring, Roni Rosenfeld, HW2, pr. 3.c
161.
Answer
q1 = (0.1, 0.1, 0.2, 0.3, 0.2, 0.05, 0.05) and q2 = (0.05, 0.1, 0.15, 0.35, 0.2, 0.1, 0.05).
The experiments are done on a data set with the following empirical distri-
bution:
pempirical = (0.05, 0.1, 0.2, 0.3, 0.2, 0.1, 0.05).
Answer
CH(pempirical , q1 ) = −(0.05 log 0.1 + 0.1 log 0.1 + 0.2 log 0.2 + 0.3 log 0.3 +
0.2 log 0.2 + 0.1 log 0.05 + 0.05 log 0.05) = 2.596 bits
CH(pempirical , q2 ) = −(0.05 log 0.05 + 0.1 log 0.1 + 0.2 log 0.15 + 0.3 log 0.35 +
0.2 log 0.2 + 0.1 log 0.1 + 0.05 log 0.05) = 2.56 bits.
However, it is not guaranteed that this model is always the better model,
because we are working with an “empirical” distribution here, and the “true”
distribution might not be reflected in this empirical distribution completely.
Usually, sampling bias and insufficient training data samples will widen the
gap between the true distribution and the empirical one, so in practice when
designing the experiment, you should always have that in mind, and if possible
use techniques that minimize these risks.
165.
def. X q(x)
KL(p||q) = − p(x) log
p(x)
x∈X
Notes
1. KL is not a distance measure, since it is not symmetric (i.e., in general
KL(p||q) 6= KL(q||p)).
1
Another measure, which is defined as JSD(p||q) = (KL(p||(p + q)/2) + KL(q||(p +
2
p)/2)), and is called the Jensen-Shannon divergence is symmetric.
2. The quantity
def.
d(X, Y ) = H(X, Y ) − IG (X, Y ) = H(X) + H(Y ) − 2IG (X, Y )
= H(X | Y ) + H(Y | X)
Indicaţie:
Pentru a demonstra punctul acesta puteţi folosi ine-
galitatea lui Jensen: f(y)
tf(x)+(1−t)f(y)
Dacă f : R → R este o funcţie convexă, atunci pentru f(tx+(1−t)y)
orice t ∈ [0, 1] şi orice x1 , x2 ∈ R urmează f(x)
Dacă f este strict convexă, atunci egalitatea are loc doar dacă x1 = . . . = xn .
Evident, rezultate similare pot fi formulate şi pentru funcţii concave.
169.
Answer
Vom dovedi inegalitatea KL(p||q) ≥ 0 folosind inegalitatea lui Jensen, ı̂n ex-
presia căreia vom ı̂nlocui f cu funcţia convexă − log2 , pe ai cu p(xi ) şi pe xi cu
q(xi )
.
p(xi )
(Pentru convenienţă, ı̂n cele ce urmează vor renunţa la indicele variabilei x.)
Vom avea:
def. X q(x)
KL (p || q) = − p(x) log
x
p(x)
Jensen X q(x) X
≥ − log( p(x) ) = − log( q(x)) = − log 1 = 0
x
p(x) x
| {z }
1
Prove that this definition of information gain is equivalent to the one given in
problem CMU, 2005 fall, T. Mitchell, A. Moore, HW1, pr. 2. That is, show
that IG(X, Y ) = H[X] − H[X|Y ] = H[Y ] − H[Y |X], starting from the definition
in terms of KL-divergence.
Remark:
It follows that
X X p(x | y) X
IG(X, Y ) = p(y) p(x | y) log = p(y)KL(pX|Y || pX )
y x
p(x) y
= EY [KL(pX|Y || pX )]
172.
Answer
By making use of the multiplication rule, namely p(x, y) = p(x | y)p(y), we will
have:
KL(pX,Y || (pX pY ))
def. KL XX p(x)p(y)
= − p(x, y) log
x y
p(x, y)
XX p(x) XX
= − p(x, y) log =− p(x, y)[log p(x) − log p(x | y)]
x y
p(x | y) x y
!
XX XX
= − p(x, y) log p(x) − − p(x, y) log p(x | y)
x y x y
X X X
= − log p(x) p(x, y) −H[X | Y ] = p(x) log p(x) − H[X | Y ]
x y x
| {z }
=p(x)
c.
A direct consequence of parts a. and b. is that IG(X, Y ) ≥ 0 (and therefore
H(X) ≥ H(X|Y ) and H(Y ) ≥ H(Y |X)) for any discrete random variables X and
Y.
Prove that IG(X, Y ) = 0 iff X and Y are independent.
Answer:
Remark (cont’d)
X1 X2 X3 X4 X5 Y
0 1 1 0 1 0
1 0 0 0 1 0
0 1 0 1 0 1
1 1 1 1 0 1
0 1 1 0 0 1
0 0 0 1 1 1
1 0 0 1 0 1
1 1 1 0 1 1
xi = 0, y = 0 xi = 0, y = 1 xi = 1, y = 0 xi = 1, y = 1
P (x1 , y) 1/8 3/8 1/8 3/8
P (x2 , y) 1/8 1/4 1/8 1/2
P (x3 , y) 1/8 3/8 1/8 3/8
P (x4 , y) 1/4 1/4 0 1/2
P (x5 , y) 0 1/2 1/4 1/4
Note: Columns 3 and 5 can be easily computed using columns 2 and respectively 4
(and also the first table from above), as P (xi = 0, y = 1) = P (xi = 0) − P (xi = 0, y = 0) and
P (xi = 1, y = 1) = P (xi = 1) − P (xi = 1, y = 0).
179.
Using these probabilities and the given formula, the mutual information for
each feature is: M I(X1 , Y ) = 0, M I(X2 , Y ) = 0.01571, M I(X3 , Y ) = 0, M I(X4 , Y ) =
0.3113 and M I(X5 , Y ) = 0.3113.
Note: One could easily check, using the previous tables, that X1 is indepen-
dent of Y , and X3 is also independent of Y (so, we can determine in this way
too that M I(X1 , Y ) and M I(X3 , Y ) are 0).
b. In order to select a set of features, we can prioritize the ones with more
mutual information because they are less independent to Y .
By looking at the results of the previous question, we can see that X5 and
X4 are the features with more mutual information (0.3113), followed by X2
(0.0157) and, finally, X1 and X3 that do not have mutual information with Y .
By inspection of the data, we can see that if we select X5 , X4 and X2 there
are two samples of different classes with the same features (X2 = 1, X4 = 0,
X5 = 1). To avoid this problem, we can add X1 as an extra feature.
Note: Although the mutual information of X1 with Y is zero, it does not
mean that the combination of X1 with other features will also have zero
mutual information [LC: w.r.t. Y ].
180.
Theorem:
If ψn (p1 , . . . , pn ) satisfies the following axioms
A0. [LC:] ψn (p1 , . . . , pn ) ≥ 0 for any n ∈ N∗ and p1 , . . . , pn , since we view ψn is a
measure of disorder ; also, ψ1 (1) = 0 because in this case there is no disorder;
A1. ψn should be continuous in pi and symmetric in its arguments;
A2. if pi = 1/n then ψn should be a monotonically increasing function of n;
(If all events are equally likely, then having more events means being
more uncertain.)
A3. if a choice among N events is broken down into successive choices, then
the entropy should be the weighted sum of the entropy at each stage;
P
then ψn (p1 , . . . , pn ) = −K i pi log pi where
K is a positive constant.
1 1
(As we’ll see, K depends however on ψs ,..., for a certain s ∈ N∗ ).
s s
Remark: We will prove the theorem firstly for uniform distributions (pi = 1/n) and secondly
for the case pi ∈ Q (only!).
182.
Example for the axiom A3:
(a,b,c)
1/2 1/2
(a,b,c) a (b,c)
a b c b c
Encoding 1 Encoding 2
1 1 1 1 1 1 1 1 1 1 2 1
H , , = log 2 + log 3 + log 6 = + log 2 + + log 3 = + log 3
2 3 6 2 3 6 2 6 3 6 3 2
1 1 1 2 1 1 2 3 1 1 2 2 1
H , + H , = 1+ log + log 3 = 1 + log 3 − = + log 3
2 2 2 3 3 2 3 2 3 2 3 3 2
...
... ...
nivel 2 de s ori
.
..
de s m ori
nivel m
...
de s ori
d. Consider
again sm ≤ tn ≤ sm+1 with s, t fixed. If m → ∞ then n → ∞ and
A(t) log t 2 A(t) log t
from − ≤ it follows that A(s) log s → 0.
−
A(s) log s n
A(t) log t = 0 and so A(t) = log t .
Therefore −
A(s) log s A(s) log s
A(s) A(s)
Finally, A(t) = log t = K log t, where K = > 0 (if s 6= 1).
log s log s
186.
Case 2: pi ∈ Q for i = 1, . . . , n
X
⇒ ψk (p1 , . . . , pk ) = K[ log N − pi log | Si | ]
i
X X X | Si | X
= K[ log N pi − pi log | Si | ] = −K pi log = −K pi log pi
i i i
N i
187.
Exemplifying
The gradient descent method:
finding the minimum of a real function of second degree
f ′′ (x) = 6 > 0, ∀x ∈ R. 2
f(x)
niul ei de definiţie, deci are un (singur) punct de
minim.
Pentru a calcula minimul funcţiei f (x) = 3x2 − 2x + 1
utilizăm derivata de ordinul ı̂ntâi: 1
f ′ (x) = 6x − 2.
Observaţii