Professional Documents
Culture Documents
Stefan Harmeling
Density estimation
Given n data points X = [x1 , . . . , xn ] ∈ RD×n .
Find a probability density function p(x) for X .
Dimensionality reduction
Given n data points X = [x1 , . . . , xn ] ∈ RD×n .
find low dimensional embedding Z = [z1 , . . . , zn ] ∈ Rd×n
Probabilistically, e.g.
aka Gaussian mixture model, i.e. for each class we have a different
Gaussian.
▸ there are many variants of this
▸ not all have an obvious probabilistic interpretation
Probabilistically, e.g.
As a graphical model:
z Ð→ x
p(z i = 1∣θ) ∶= πi
p(x∣z i = 1, θ) ∶= fi (x∣θi )
p(x∣θ) = ∑ πi N (x∣µi , Σi )
i
▸ parameters θ = (π1 , . . . , πK , µ1 , . . . , µK , Σ1 , . . . , ΣK )
What is the probability that x is in the ith component?
τ i ∶= p(z i = 1∣x, θ)
p(x∣z i = 1, θ)P(z i = 1∣θ)
=
∑j p(x∣z j = 1, θ)P(z j = 1∣θ)
πi N (x∣µi , Σi )
=
∑j πj N (x∣µj , Σj )
▸ difficulty: can not exchange the inner sum and the logarithm to
decouple the parameters
▸ this maximization is a nonlinear problem without closed-form
solution
▸ can be solved with nonlinear optimization
▸ better approach: introduce latent variables, and apply the EM
algorithm to Gaussian mixture models
▸ for a motivation let us discuss the K-means algorithm
0 0 0
−2 −2 −2
−2 0 2 −2 0 2 −2 0 2
0 0 0
−2 −2 −2
−2 0 2 −2 0 2 −2 0 2
0 0 0
−2 −2 −2
−2 0 2 −2 0 2 −2 0 2
Figure 9.1 Illustration of the K-means algorithm using the re-scaled Old Faithful data set. (a) Green points
denote the data set in a two-dimensional Euclidean space. The initial choices for centres µ1 and µ2 are shown
copied from Fig. 9.1 of Bishop
by the/ red
Machine Learning andHarmeling
Stefan blue crosses, respectively.
/ 6./8. December (b) In the
2021 (WSinitial E step, each data point is assigned either to the red
2021/22) 15
EM algorithm for the Gaussian mixture model (1)
Motivated by K-means:
▸ view zn as random assignment variables (as above)
If we knew the values zn for each point xn
▸ estimate the means simply by:
∑n zni xn
µ̂i =
∑n zni
as before in K-means
However, since we do not know the value zn :
▸ consider the conditional expectation of zni :
∑n τni xn
µ̂i =
∑n τni
Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 16
EM algorithm for the Gaussian mixture model (2)
∑n zni xn
µ̂i =
∑n zni
Soft assignment:
∑n τni xn
µ̂i =
∑n τni
πi N (xn ∣µi , Σi )
τni =
∑j πj N (xn ∣µj , Σj )
2 2 2
L=1
0 0 0
−2 −2 −2
2 2 2
L=2 L=5 L = 20
0 0 0
−2 −2 −2
Figure 9.8 Illustration of the EM algorithm using the Old Faithful set as used for the illustration of the K-means
algorithm
copiedinfrom
FigureFig.
9.1. See
9.8theoftext for details.
Bishop
Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 19
EM algorithm for the Gaussian mixture model (5)
So far, so good, but:
▸ Does the EM algorithm for the Gaussian mixture model really
maximize the log likelihood?
∂
l(θ∣D) = . . . = −Σ−1
i (∑ τn xn − ∑ τn µi ) = 0
i i
∂µi n n
∑n τni xn
µi =
∑n τni
1 ∂
=∑ πi fi (xn ∣θi ) derivative picks the summand containing µi
n ∑ π
j j jf (x ∣θ
n j ) ∂µ i
πi fi (xn ∣θi ) ∂
=∑ log fi (xn ∣θi ) use
∂
g(a) = g(a)
∂
log g(a)
n ∑j πj fj (xn ∣θj ) ∂µi
∂a ∂a
∂
= ∑ τni log fi (xn ∣θi ) i
plugin τn =
πi fi (xn ∣θi )
∂
= ∑ τni log N (xn ∣µi , Σi ) plugin Gaussian PDF
n ∂µi
= − ∑ τni Σ−1
i (xn − µi )
n
= −Σ−1
i ∑ τn (xn
i
− µi )
n
= −Σ−1
i (∑ τn xn − ∑ τn µi )
i i
n n
Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 21
EM algorithm for the Gaussian mixture model (6)
∑n τni xn
µi =
∑n τni
∑ τ i (xn − µi )(xn − µi )T
Σi = n n
∑n τni
1
πi = ∑ τni
N n
Update equations:
∑n τni xn
µi =
∑n τni
∑ τ i (xn − µi )(xn − µi )T
Σi = n n
∑n τni
1
πi = ∑ τni
N n
Mixture of Gaussians
The mixture of Gaussians is a probability distribution which is written
as the weighted sum of several Gaussian distributions:
K
p(x) = ∑ πk N (x∣µk , Σk )
k =1
n K
`(θ) = log p(D∣θ) = ∑ log ∑ πk N (xi ∣µk , Σk )
i=1 k =1
∂ n
πk N (xi ∣µk , Σk )
`(θ) = ∑ K Σ−1
k (xi − µk ) = 0
∂µk i=1 ∑j=1 j
π N (x ∣µ
i j , Σ j )
∂ n
πk N (xi ∣µk , Σk )
`(θ) = ∑ K (. . .) = 0
∂Σk i=1 ∑j=1 πj N (xi ∣µj , Σj )
∂ n
πk N (xi ∣µk , Σk )
L(θ) = ∑ K Σ−1
k (xi − µk ) = 0
∂µk i=1 ∑j=1 πj N (xi ∣µj , Σj )
∂ n
πk N (xi ∣µk , Σk )
L(θ) = ∑ K (. . .) = 0
∂Σk i=1 ∑j=1 πj N (xi ∣µj , Σj )
p(z = k ∣θ) = πk
p(x∣z = k , θ) = N (x∣µk , Σk )
A: Yes!
▸ Let’s write down the so-called complete data log likelihood.
Dc = {(x1 , z1 ), . . . , (xn , zn )}
1 if F is true
[F ] = {
0 otherwise
1 if z = k is true
[z = k ] = {
0 otherwise
M-step of k -means
Setting the derivative of `c (θ) wrt. µk to zero we obtain
E-step of k -means
Similarly choosing
Dc = {(x1 , z1 ), . . . , (xn , zn )}
Responsibilities
Given a mixture of Gaussians (i.e. for fixed θ) we can calculate the
probability that xi is in cluster k , i.e. that zi = k
Note:
▸ πk can be seen as a prior probability.
▸ τik can be seen as a posterior probability for location xi .
n K
Q(θ, θ0 ) = ∑ ∑ τik (log πk + log N (xi ∣µk , Σk ))
i=1 k =1 ¯ ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
depends on θ0 depends on θ
▸ The responsibilities
depend on θ0 .
▸ The joint log probabilities
log p(zi , xi ) = log p(zi ) + log p(xi ∣zi ) = log πk + log N (xi ∣µk , Σk )
depend on θ.
1 n
πk = ∑ τik
n i=1
n
∑i=1 τik xi
µk = n
∑i=1 τik
∑ τik (xi − µk )(xi − µk )T
n
Σk = i=1 n
∑i=1 τik
In words:
“Maximize the expected complete log likelihood
wrt. the parameters θ”
πk N (xi ∣µk , Σk )
τik =
∑j πj N (xi ∣µj , Σj )
In words:
p(x, z∣θ)
`(θ) = log ∑ q(z)
z q(z)
p(x, z∣θ)
`(θ) = log ∑ q(z)
z q(z)
p(x, z∣θ)
≥ ∑ q(z) log =∶ L(θ, q)
z q(z)
The last expression L(θ, q) is a lower bound for the incomplete log
likelihood.
▸ Q: How does a lower bound help us? A: Instead of maximizing
`(θ) we would maximize its lower bound (and hope that this is
maximizing `(θ) as well).
Reminder: what is Jensen’s inequality?
f (∑ θi xi ) ≤ ∑ θi f (xi )
i i
f (∑ θi xi ) ≤ ∑ θi f (xi )
i i
g (∑ θi xi ) ≥ ∑ θi g(xi )
i i
p(x, z∣θ)
`(θ) ≥ L(θ, q) = ∑ q(z) log
z q(z)
= ∑ q(z) log p(x, z∣θ) − ∑ q(z) log q(z)
z z
= Ez∼q `c (θ) + H(q)
p(x, z∣θ)
`(θ) ≥ L(θ, q) = ∑ q(z) log
z q(z)
p(z∣x, θ)p(x∣θ)
= ∑ q(z) log
z q(z)
p(z∣x, θ)
= ∑ q(z) log + ∑ q(z) log p(x∣θ)
z q(z) z
= −KL(q(z)∣∣p(z∣x, θ)) + log p(x∣θ)
= −KL(q(z)∣∣p(z∣x, θ)) + `(θ)
q(z) p(z)
KL(q∣∣p) = ∑ q(z) log = − ∑ q(z) log
z p(z) z q(z)