ML Section14 GMM em

Machine Learning
Section 14: Gaussian mixture models, k-means, EM
Stefan Harmeling
6./8. December 2021 (WS 2021/22)
Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 1

What we have seen so far?
Sections:
1. Introduction
2. Plausible reasoning and Bayes Rule
3. From Logic to Probabilities
4. Bayesian networks
5. Continuous Probabilities
6. The Gaussian distribution
7. More on distributions, models, MAP, ML
8. Linear Regression
9. Matrix Differential Calculus
10. Model selection
11. Support Vector Machines
12. PCA, kPCA
13. ISOMAP, LLE

Back to probabilities...

Unsupervised learning
Given n data points X = [x1 , . . . , xn ] ∈ RD×n .
Find a good description of the data.

If the description is succinct (short), we possibly learned something
about the data.
Probabilistically:
Density estimation
Find a probability density function p(x) for X .

Types of unsupervised learning (1)
Dimensionality reduction
find low dimensional embedding Z = [z1 , . . . , zn ] ∈ Rd×n
Probabilistically, e.g.
p(z) = . . . latent space / continuous distribution

p(x∣z) = N (x∣f (z), σ I) 2
where f is a linear or nonlinear mapping from a low dimensional space

Rd into the RD .
▸ there are many variants of this
▸ not all have an obvious probabilistic interpretation

Clustering (aka mixture models)

find class labels Z = [z1 , . . . , zn ] ∈ {1, 2, . . . , k }
p(z) = Dir(θ) k sided dice / discrete distribution

p(x∣z) = N (x∣µz , Σz )
aka Gaussian mixture model, i.e. for each class we have a different
Gaussian.
▸ there are many variants of this
▸ not all have an obvious probabilistic interpretation

More general point of view:
Latent variable modelling

find latent variable Z = [z1 , . . . , zn ]
p(z) = . . . the latent variable

p(x∣z) = . . . simpler density dependent on z
As a graphical model:
z Ð→ x
(Note: today we write also x instead of X for RVs.)
How to fit a latent variable model?

▸ last time: PCA, ISOMAP, LLE, etc
▸ today: clustering, k-means, mixture models, etc

The EM Algorithm
▸ Michael Jordan, “An Introduction to Graphical Models”, 2003.

▸ Chris Bishop, “Pattern Recognition and Machine Learning”, 2007.

Intuitive motivation for mixture models (clustering)
Medical treatments:
▸ a medical doctor tries to cure ill people
plan 1. have a single, complicated procedure that fits all
plan 2. invent syndromes to split the people depending on their type of
illness, have several simple procedures
Statistical modelling:
▸ you want to model some complicated data
plan 1. have a single, complicated model that fits all data
plan 2. invent different classes, have several simple models
Mixture model:
K
p(x) = ∑ πi fi (x∣θi )
i=1
▸ with mixing components fi , each being a density

▸ with mixing proportions πi ≥ 0 and ∑i πi = 1
▸ with parameters π1 , . . . , πK , θ1 , . . . , θK
Represent mixture model as a graphical model
Introduce latent variable z ranging over 1, . . . , K :
z Ð→ x
▸ view z as a K -dimensional vector with binary coordinates z i ,

i.e. z i = 1 if z = i, otherwise z i = 0.
▸ define distribution of z:
p(z i = 1∣θ) ∶= πi
▸ define distribution of x given z:
p(x∣z i = 1, θ) ∶= fi (x∣θi )
▸ this implies joint distribution
p(x, z i = 1∣θ) = πi fi (x∣θi )
▸ and marginal of x (same as the mixture density):
p(x∣θ) = ∑ p(x, z i = 1∣θ) = ∑ πi fi (x∣θi )

i i
Inference in Gaussian mixture models (1)
Gaussian mixture model:
p(x∣θ) = ∑ πi N (x∣µi , Σi )
i
▸ parameters θ = (π1 , . . . , πK , µ1 , . . . , µK , Σ1 , . . . , ΣK )
What is the probability that x is in the ith component?
τ i ∶= p(z i = 1∣x, θ)
p(x∣z i = 1, θ)P(z i = 1∣θ)
=
∑j p(x∣z j = 1, θ)P(z j = 1∣θ)
πi N (x∣µi , Σi )
=
∑j πj N (x∣µj , Σj )
▸ thus the model allows us to classify a data point x

▸ πi can be seen as a prior probability, τ i as a posterior probability

Inference in Gaussian mixture models (2)
Given i.i.d. data D = (x1 , . . . , xN ) estimate the parameters θ:
▸ MLE: maximize log-likelihood function wrt. θ:
l(θ∣D) = log p(D∣θ)

= log ∏ ∑ πi fi (xn ∣θi )
n i
= ∑ log ∑ πi fi (xn ∣θi )
n i
▸ difficulty: can not exchange the inner sum and the logarithm to
decouple the parameters
▸ this maximization is a nonlinear problem without closed-form
solution
▸ can be solved with nonlinear optimization
▸ better approach: introduce latent variables, and apply the EM
algorithm to Gaussian mixture models
▸ for a motivation let us discuss the K-means algorithm

The K-means algorithm (1)
Group data points D = (x1 , . . . , xN ) into K clusters which are
characterized by K means µ1 , . . . , µK
Algorithm 14.1 (k-means clustering)

▸ randomly initialize the K means µ1 , . . . , µK
▸ repeat until convergence
1. assign each data point xn to its closest cluster:
1 if i = arg minj ∣∣xn − µj ∣∣2

zni = {
0 otherwise
where zn are categorical variables as above

2. recompute the means
i
∑n zn xn
µi =
∑n zni
▸ one can show that K-means minimizes the distortion measure:
J = ∑ ∑ zni ∣∣xn − µi ∣∣2

n i
426 9. MIXTURE MODELS AND EM
The K-means algorithm (2)

2 (a) 2 (b) 2 (c)
0 0 0
−2 −2 −2
−2 0 2 −2 0 2 −2 0 2
2 (d) 2 (e) 2 (f)
0 0 0
−2 −2 −2
−2 0 2 −2 0 2 −2 0 2
2 (g) 2 (h) 2 (i)
0 0 0
−2 −2 −2
−2 0 2 −2 0 2 −2 0 2
Figure 9.1 Illustration of the K-means algorithm using the re-scaled Old Faithful data set. (a) Green points
denote the data set in a two-dimensional Euclidean space. The initial choices for centres µ1 and µ2 are shown
copied from Fig. 9.1 of Bishop
by the/ red
Machine Learning andHarmeling
Stefan blue crosses, respectively.
/ 6./8. December (b) In the
2021 (WSinitial E step, each data point is assigned either to the red
2021/22) 15
EM algorithm for the Gaussian mixture model (1)
Motivated by K-means:
▸ view zn as random assignment variables (as above)
If we knew the values zn for each point xn
▸ estimate the means simply by:
∑n zni xn
µ̂i =
∑n zni
as before in K-means
However, since we do not know the value zn :
▸ consider the conditional expectation of zni :
E(zni ∣xn , θ) = 1 ⋅ p(zni = 1∣xn , θ) + 0 ⋅ p(zni = 0∣xn , θ)

= p(zni = 1∣xn , θ) =∶ τni
since zni are binary variables

▸ plug these conditional expectations into the updates:
∑n τni xn
µ̂i =
∑n τni
Hard assignment (K-means):
∑n zni xn
µ̂i =
∑n zni
Soft assignment:
∑n τni xn
µ̂i =
∑n τni
▸ zni ∈ {0, 1}, which is discrete set, thus hard assigment

▸ τni ∈ [0, 1], which is interval, thus soft assigment

Algorithm 14.2 (EM for Gaussian mixture models)

▸ randomly initialize the parameters
µ1 , . . . , µK , Σ1 , . . . , ΣK , π1 , . . . , πK
▸ repeat until convergence
1. E-step: calculate soft assignments for each data point xn :
πi N (xn ∣µi , Σi )
τni =
∑j πj N (xn ∣µj , Σj )
2. M-step: recompute the parameters:

i
∑n τn xn
µi =
∑n τni
∑n τn (xn − µi )(xn − µi )
i T
Σi =
∑n τni
1
πi = ∑ τni
N n

EM algorithm for the Gaussian mixture
9.2. Mixtures model
of Gaussians (4) 437
2 2 2
L=1
0 0 0
−2 −2 −2
−2 0 (a) 2 −2 0 (b) 2 −2 0 (c) 2
2 2 2
L=2 L=5 L = 20
0 0 0
−2 −2 −2
−2 0 (d) 2 −2 0 (e) 2 −2 0 (f) 2
Figure 9.8 Illustration of the EM algorithm using the Old Faithful set as used for the illustration of the K-means
algorithm
copiedinfrom
FigureFig.
9.1. See
9.8theoftext for details.
Bishop
So far, so good, but:
▸ Does the EM algorithm for the Gaussian mixture model really
maximize the log likelihood?
l(θ∣D) = ∑ log ∑ πi fi (x∣θi )

n i
Take derivatives of l wrt. to the parameters and setting it to zero:
∂
l(θ∣D) = . . . = −Σ−1
i (∑ τn xn − ∑ τn µi ) = 0
i i
∂µi n n
(full derivation of the missing bits on the next page)

Transforming it we get thus:
∑n τni xn
µi =
∑n τni

∂ ∂
l(θ∣D) = ∑ log ∑ πj fj (xn ∣θj ) exchange partials and summation
∂µi n ∂µ i j
1 ∂
=∑ ∑ πj fj (xn ∣θj ) use
∂
log g(a) =
1 ∂
g(a)
n ∑j πj fj (xn ∣θj ) ∂µi j ∂a g(a) ∂a
1 ∂
=∑ πi fi (xn ∣θi ) derivative picks the summand containing µi
n ∑ π
j j jf (x ∣θ
n j ) ∂µ i
πi fi (xn ∣θi ) ∂
=∑ log fi (xn ∣θi ) use
∂
g(a) = g(a)
∂
log g(a)
n ∑j πj fj (xn ∣θj ) ∂µi
∂a ∂a
∂
= ∑ τni log fi (xn ∣θi ) i
plugin τn =
πi fi (xn ∣θi )
n ∂µi ∑j πj fj (xn ∣θj )
∂
= ∑ τni log N (xn ∣µi , Σi ) plugin Gaussian PDF
n ∂µi
= − ∑ τni Σ−1
i (xn − µi )
n
= −Σ−1
i ∑ τn (xn
i
− µi )
n
= −Σ−1
i (∑ τn xn − ∑ τn µi )
i i
n n
Setting all derivatives of l wrt. to the parameters to zero, yields:
∑n τni xn
µi =
∑n τni
∑ τ i (xn − µi )(xn − µi )T
Σi = n n
∑n τni
1
πi = ∑ τni
N n
▸ exactly the same as in the previous more heuristic derivation!

▸ however, since τni depends on the parameters as well, this is just a
set of coupled, nonlinear equations

Update equations:
∑n τni xn
µi =
∑n τni
∑ τ i (xn − µi )(xn − µi )T
Σi = n n
∑n τni
1
πi = ∑ τni
N n
Three ways to derive them:

1. replace unknown assignments with their expected values, i.e. hard
assignments in K-means with soft-assignments (quite heuristic)
2. set derivatives of log-likelhood to zero (somewhat heuristic)
3. NEW (more rigorous): consider expected complete log likelihood

We consider the following model:
Mixture of Gaussians
The mixture of Gaussians is a probability distribution which is written
as the weighted sum of several Gaussian distributions:
K
p(x) = ∑ πk N (x∣µk , Σk )
k =1
The parameters are π1 , . . . , πK , µ1 , . . . , µK , Σ1 , . . . , ΣK , which are

abbreviated as θ, thus we sometimes write p(x∣θ).
Question:
▸ Given n data points D = {x1 , . . . , xn } ⊂ RD . How can we estimate
the parameters?

Estimate the parameters via maximum likelihood ?
Given n data points D = {x1 , . . . , xn } ⊂ RD .
Log-likelihood for mixture of Gaussians
n K
`(θ) = log p(D∣θ) = ∑ log ∑ πk N (xi ∣µk , Σk )
i=1 k =1
Derivatives of the log-likelihood

∂ n
πk N (xi ∣µk , Σk ) 1
`(θ) = ∑ K =0 (side constraint ∑ πk = 1 missing)
∂πk i=1 ∑j=1 πj N (xi ∣µj , Σj ) πk
k
∂ n
πk N (xi ∣µk , Σk )
`(θ) = ∑ K Σ−1
k (xi − µk ) = 0
∂µk i=1 ∑j=1 j
π N (x ∣µ
i j , Σ j )
∂ n
`(θ) = ∑ K (. . .) = 0
∂Σk i=1 ∑j=1 πj N (xi ∣µj , Σj )
Complicated system of equations, since all parameters are coupled.

(Last time we cheated a bit by assuming ∑πK k Nπ (x i ∣µk ,Σk )
N (x ∣µ ,Σ )
to be constant.
j=1 j i j j
More precisely. . .
Constrained optimization:
Minimize `(θ) s.t. ∑k πk = 1, gives us the Lagrangian:
n K
L(θ, λ) = `(θ) − λ (∑ πk − 1) = ∑ log ∑ πk N (xi ∣µk , Σk ) − λ (∑ πk − 1)
k i=1 k =1 k
Derivatives of the Lagrangian wrt. θ

∂ n
πk N (xi ∣µk , Σk ) 1
L(θ) = ∑ K −λ=0
∂πk i=1 ∑j=1 πj N (xi ∣µj , Σj ) πk
∂ n
L(θ) = ∑ K Σ−1
k (xi − µk ) = 0
∂µk i=1 ∑j=1 πj N (xi ∣µj , Σj )
∂ n
L(θ) = ∑ K (. . .) = 0
∂Σk i=1 ∑j=1 πj N (xi ∣µj , Σj )
Complicated system of equations, since all parameters are coupled.

(Last time we cheated a bit by assuming ∑πK k Nπ (x i ∣µk ,Σk )
N (x ∣µ ,Σ )
to be constant.
j=1 j i j j

This can be optimized by adding more unknowns!
Mixture of Gaussians as latent variable model

Introduce latent variable z ∈ {1, . . . , K } with
p(z = k ∣θ) = πk
p(x∣z = k , θ) = N (x∣µk , Σk )
These choices let us rewrite the mixture of Gaussians:

K
p(x∣θ) = ∑ p(x, z = k ∣θ)
k =1
K
= ∑ p(z = k ∣θ) p(x∣z = k , θ)
k =1
K
= ∑ πk N (x∣µk , Σk )
k =1
Now we have θ and z as unkowns.

Again...

▸ Q: Does the introduced latent variable help us estimating the
parameters of the mixture of Gaussians?
A: Not immediately, since we do not know the values

z1 , . . . , zn ∈ {1, . . . , K } of the latent variable.
▸ Q: Does the problem become easier if we knew these values?
A: Yes!
▸ Let’s write down the so-called complete data log likelihood.

Complete data log likelihood
The locations together with the values of the latent variables are called
the complete data:
Dc = {(x1 , z1 ), . . . , (xn , zn )}
The complete data log likelihood can be written as

n n
`c (θ) = log p(Dc ∣θ) = ∑ log p(xi , zi ∣θ) = ∑ log πzi N (xi ∣µzi , Σzi )
i=1 i=1
n K
[zi =k ]
= ∑ log ∏ (πk N (xi ∣µk , Σk ))
i=1 k =1
n K
[zi =k ]
= ∑ ∑ log (πk N (xi ∣µk , Σk ))
i=1 k =1
n K
= ∑ ∑ [zi = k ] (log πk + log N (xi ∣µk , Σk ))
i=1 k =1

As a reminder:
Iverson brackets
The Iverson bracket is a handy notation for the indicator function of a
statement:
1 if F is true
[F ] = {
0 otherwise
where F is a statement that is either true or false. E.g. for z = k
1 if z = k is true
[z = k ] = {
0 otherwise

k-means as a special mixture of Gaussians
Fixing πk = 1/K and Σk = I for all k we get
M-step of k -means
Setting the derivative of `c (θ) wrt. µk to zero we obtain
∑i=1 [zi = k ]xi

n
µk =
∑i=1 [zi = k ]
n
which updates the means given cluster assignments zi for all i.
E-step of k -means
Similarly choosing
zi = argmaxk log N (xi ∣µk , Σk )

= argmink ∥xi − µk ∥2
we maximize `c (θ) wrt. zi .

Back to the
Complete data log likelihood
The locations together with the values of the latent variables are called
the complete data:
Dc = {(x1 , z1 ), . . . , (xn , zn )}
The complete data log likelihood can be written as

n K
`c (θ) = ∑ ∑ [zi = k ] (log πk + log N (xi ∣µk , Σk ))
i=1 k =1
▸ In practice we do not have the complete data.

▸ Now we have a missing data problem.
▸ However, given the incomplete data and the model parameters θ
we can derive a posterior distribution of the unknown latent
variables.

Responsibilities
Given a mixture of Gaussians (i.e. for fixed θ) we can calculate the
probability that xi is in cluster k , i.e. that zi = k
τik = p(zi = k ∣xi , θ)

p(xi ∣zi = k , θ)P(zi = k ∣θ)
=
∑j p(xi ∣zi = j, θ)P(zi = j∣θ)
=
∑j πj N (xi ∣µj , Σj )
Note:
▸ πk can be seen as a prior probability.
▸ τik can be seen as a posterior probability for location xi .

Given the distributions p(zi ∣xi , θ0 ) for the latent variables we calculate
the:
Posterior expectation of [zi = k ]
For a single data point the posterior expectation of [zi = k ] is
K
Ezi ∼p(zi ∣xi ,θ0 ) [zi = k ] = ∑ p(zi = j∣xi , θ0 )[zi = k ] = τik
j=1
where θ0 are some fixed parameters.

Using this, the expectation of the complete data log likelihood is
n K
Q(θ, θ0 ) = Ez∼p(z∣x,θ0 ) ∑ ∑ [zi = k ] (log πk + log N (xi ∣µk , Σk ))
i=1 k =1
n K
= ∑ ∑ (Ez∼p(z∣x,θ0 ) [zi = k ]) (log πk + log N (xi ∣µk , Σk ))
i=1 k =1
n K
= ∑ ∑ τik (log πk + log N (xi ∣µk , Σk ))
i=1 k =1

Expected complete data log likelihood
n K
Q(θ, θ0 ) = ∑ ∑ τik (log πk + log N (xi ∣µk , Σk ))
i=1 k =1 ¯ ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
depends on θ0 depends on θ
▸ The responsibilities
τik = p(zi = k ∣xi , θ0 )
depend on θ0 .
▸ The joint log probabilities
log p(zi , xi ) = log p(zi ) + log p(xi ∣zi ) = log πk + log N (xi ∣µk , Σk )
depend on θ.

M-step for mixture of Gaussians
For fixed θ0 the derivatives of Q(θ, θ0 ) wrt θ have a simple form and
imply parameter updates:
1 n
πk = ∑ τik
n i=1
n
∑i=1 τik xi
µk = n
∑i=1 τik
∑ τik (xi − µk )(xi − µk )T
n
Σk = i=1 n
∑i=1 τik
In words:
“Maximize the expected complete log likelihood
wrt. the parameters θ”

E-step for mixture of Gaussians
For fixed θ we update the responsibilities:
τik =
∑j πj N (xi ∣µj , Σj )
In words:
“Calculate the Expected complete log likelihood

by recomputing the responsibilities”
Notes:
▸ The E-step updates θ0 only indirectly, but updating the
responsibilities.
▸ The M-step updates θ.
▸ The M-step has been derived from derivatives.
▸ Does the E-step maximize Q(θ, θ0 )?

EM from a more general point of view

General setting of EM algorithm
▸ x is an observable variable
▸ z is a latent variable (non-observable)
▸ p(x, z∣θ) joint probabilistic model
▸ For simplicity we consider only a single data point x. The following
easily generalizes to several points.
(Incomplete) log likelihood:
`(θ) = log p(x∣θ) = log ∑ p(x, z∣θ)

z
We can extend this expression as follows:
p(x, z∣θ)
`(θ) = log ∑ q(z)
z q(z)
where q(z) is any distribution over z.

Note: summation may be replaced by integration.

Lower bound for the incomplete data log likelihood
Apply Jensen’s inequality to exchange summation and logarithm (a
concave function)
p(x, z∣θ)
`(θ) = log ∑ q(z)
z q(z)
p(x, z∣θ)
≥ ∑ q(z) log =∶ L(θ, q)
z q(z)
The last expression L(θ, q) is a lower bound for the incomplete log
likelihood.
▸ Q: How does a lower bound help us? A: Instead of maximizing
`(θ) we would maximize its lower bound (and hope that this is
maximizing `(θ) as well).
Reminder: what is Jensen’s inequality?

Jensen’s inequality (1)
Let f be a convex function. There are several versions:
▸ By definition of convexity: for any x, y and 0 ≤ α ≤ 1, the function
value at the convex combination of x and y is at most as large as
the corresponding convex combination of f (x) and f (y ),
f (αx + (1 − α)y ) ≤ αf (x) + (1 − α)f (y )
▸ More generally for ∑i θi = 1 for θi ≥ 0 we have
f (∑ θi xi ) ≤ ∑ θi f (xi )
i i
▸ For a PDF p(x) with ∫ p(x) = 1 we have
f (∫ p(x) x dx) ≤ ∫ p(x)f (x)dx
f (E x) ≤ E f (x) written as expectations

If f is convex, −f is concave.
Jensen’s inequality (2)
▸ A function f is convex, if the area above the function is convex.
▸ A function g is concave, if the area below the function is convex.
Or: if −g is convex.
Thus we can write down Jensen’s inequality for a convex function f and
similarly for a concave function g (like the logarithm):
f (∑ θi xi ) ≤ ∑ θi f (xi )
i i
g (∑ θi xi ) ≥ ∑ θi g(xi )
i i
Q: can you draw a function that is neither convex nor concave?

Back to our lower bound: L(θ, q)

Deep insight into the M-step
How the lower bound is related to the M-step

Let’s rewrite L(θ, q):
p(x, z∣θ)
`(θ) ≥ L(θ, q) = ∑ q(z) log
z q(z)
= ∑ q(z) log p(x, z∣θ) − ∑ q(z) log q(z)
z z
= Ez∼q `c (θ) + H(q)
where H(q) is the entropy of q. The first summand is the expected

complete data log likelihood.
Conclusion:
▸ The M-step maximizes the lower bound L(θ, q) for fixed q by
maximizing Ez∼q `c (θ) in θ, since the entropy of q does not depend
on θ.

Entropy
The quantity
H(q) = − ∑ q(z) log q(z)

z
is called the entropy. It is a measure of the randomness of random

variable z.
Entropy is the essential concept in information theory.

Entropy — more insights
Q: Why taking logs of probabilities?
1. Consider a single fair coin. Once we know the outcome we have
received one bit which is:
− log2 0.5 = 1
Entropy is the expected amount of information in a random
experiment
X ∼ p(x)
H(X ) = E(− log p(X )) = − ∑ p(x) log p(x)
x
2. Consider two independent random experiments X ∼ p(x) and

Y ∼ p(y ), i.e. p(x, y ) = p(x)p(y ). The information (as measured
by entropy) should be additive.
H(X , Y ) = ∑ p(x, y ) log p(x, y ) = ∑ p(x, y ) log p(x)p(y )
xy xy
= ∑ p(x, y )(log p(x) + log p(y ))

xy
= − ∑ p(x) log p(x) − ∑ p(y ) log p(y ) = H(X ) + H(Y )

x y
Deep insight into the E-step
How the lower bound is related to the E-step (1)
Let’s rewrite L(θ, q) again:
p(x, z∣θ)
`(θ) ≥ L(θ, q) = ∑ q(z) log
z q(z)
p(z∣x, θ)p(x∣θ)
= ∑ q(z) log
z q(z)
p(z∣x, θ)
= ∑ q(z) log + ∑ q(z) log p(x∣θ)
z q(z) z
= −KL(q(z)∣∣p(z∣x, θ)) + log p(x∣θ)
= −KL(q(z)∣∣p(z∣x, θ)) + `(θ)
where KL is the Kullback-Leibler divergence.

Thus the gap between the lower bound and `(θ) is the KL-divergence
between q(z) and p(z∣x, θ):
`(θ) − L(θ, q) = KL(q(z)∣∣p(z∣x, θ)) ≥ 0

Kullback-Leibler (KL) divergence
The quantity
q(z) p(z)
KL(q∣∣p) = ∑ q(z) log = − ∑ q(z) log
z p(z) z q(z)
is called the Kullback-Leibler divergence. It is a measure of the

difference between two distributions q and p.
▸ KL(q∣∣p) ≥ 0 with equality for q = p (proof?)

▸ it is not a distance, because KL(q∣∣p) ≠ KL(p∣∣q)
▸ KL(q∣∣p) compares q and p where q is large
▸ KL(p∣∣q) compares q and p where p is large
▸ usually we write KL(p∣∣q) instead of KL(p, q)
▸ KL-divergence is another important concept in information theory.
▸ Entropy measures the distance to the uniform distribution.

How the lower bound is related to the E-step (2)
We showed that the gap between `(θ) and L(θ, q) is the KL divergence
between q and p(z∣x, θ)
`(θ) − L(θ, q) = KL(q(z)∣∣p(z∣x, θ)) ≥ 0
The E-step closes this gap by choosing q(z) = p(z∣x, θ):
`(θ) − L(θ, p(z∣x, θ)) = KL(p(z∣x, θ)∣∣p(z∣x, θ)) = 0
▸ E.g. if our current estimate of θ is θ0 , we choose q = p(z∣x, θ0 ).

Algorithm 14.3 (General form of EM)
1. randomly initialize θ(0)
2. repeat until convergence
E-step q (t+1) = argmaxq L(θ(t) , q) = p(z∣x, θ(t) )

M-step θ(t+1) = argmaxθ L(θ, q (t+1) ) = argmaxθ Ez∼q (t+1) `c (θ)
Viewed even more general:

▸ The lower bound L(θ, q) is an example of an auxiliary function.

Auxiliary function
The idea of an auxiliary function is quite general:
▸ suppose maximizing some function, say `(θ) is difficult.
▸ introduce an auxiliary function L(θ, θ0 ),
1. that is easy to maximize in θ,
2. that touches `(θ), i.e. L(θ, θ) = `(θ), and
3. that lowerbounds `(θ), i.e. L(θ, θ0 ) ≤ `(θ) for any θ0 .
Then do M-steps and E-steps:
`(θ(0) ) = L(θ(0) , θ(0) )

≤ L(θ(1) , θ(0) ) M-step
(1) (1) (1)
≤ L(θ ,θ ) = `(θ ) E-step
(2) (1)
≤ L(θ ,θ ) M-step
≤ L(θ(2) , θ(2) ) = `(θ(2) ) E-step
≤ ...

`(θ(0) ) = L(θ(0) , θ(0) )
≤ L(θ(1) , θ(0) ) M-step
≤ L(θ(1) , θ(1) ) = `(θ(1) ) E-step
(2) (1)
≤ L(θ ,θ ) M-step
(2) (2) (2)
≤ L(θ ,θ ) = `(θ ) E-step
≤ ...
Then EM performs greedy hill-climbing for `(θ), i.e. we generate a

sequence θ(0) , θ(1) , θ(2) , . . . , s.t.
`(θ(0) ) ≤ `(θ(1) ) ≤ `(θ(2) ) ≤ . . .
However, we might end up in a local maximum...

Summary of k-means, EM algorithm, and beyond
Algorithm 14.4 (General form of EM)
1. randomly initialize θ(0)
2. repeat until convergence
E-step q (t+1) = argmaxq L(θ(t) , q) = p(z∣x, θ(t) )

M-step θ(t+1) = argmaxθ L(θ, q (t+1) ) = argmaxθ Ez∼q (t+1) `c (θ)
▸ k-means is a special case (was heuristically motivated but is an

instance of this
▸ fitting a GMM is a special case
▸ more generally: this is an instance of optimization by an auxiliary
function
▸ by the way: this is also a solution for the missing data problem
(missing at random)

ML Section14 GMM em

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML Section14 GMM em

Uploaded by

Copyright:

Available Formats

Machine Learning

Section 14: Gaussian mixture models, k-means, EM

6./8. December 2021 (WS 2021/22)

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 1

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 2

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 3

Find a good description of the data.

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 4

p(z) = . . . latent space / continuous distribution

where f is a linear or nonlinear mapping from a low dimensional space

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 5

Clustering (aka mixture models)

p(z) = Dir(θ) k sided dice / discrete distribution

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 6

Latent variable modelling

p(z) = . . . the latent variable

(Note: today we write also x instead of X for RVs.)

How to fit a latent variable model?

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 7

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 8

▸ Michael Jordan, “An Introduction to Graphical Models”, 2003.

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 9

▸ with mixing components fi , each being a density

▸ view z as a K -dimensional vector with binary coordinates z i ,

▸ define distribution of x given z:

▸ this implies joint distribution

p(x, z i = 1∣θ) = πi fi (x∣θi )

▸ and marginal of x (same as the mixture density):

p(x∣θ) = ∑ p(x, z i = 1∣θ) = ∑ πi fi (x∣θi )

▸ thus the model allows us to classify a data point x

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 12

l(θ∣D) = log p(D∣θ)

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 13

Algorithm 14.1 (k-means clustering)

1 if i = arg minj ∣∣xn − µj ∣∣2

where zn are categorical variables as above

▸ one can show that K-means minimizes the distortion measure:

J = ∑ ∑ zni ∣∣xn − µi ∣∣2

The K-means algorithm (2)

2 (d) 2 (e) 2 (f)

2 (g) 2 (h) 2 (i)

E(zni ∣xn , θ) = 1 ⋅ p(zni = 1∣xn , θ) + 0 ⋅ p(zni = 0∣xn , θ)

since zni are binary variables

Hard assignment (K-means):

▸ zni ∈ {0, 1}, which is discrete set, thus hard assigment

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 17

Algorithm 14.2 (EM for Gaussian mixture models)

2. M-step: recompute the parameters:

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 18

−2 0 (a) 2 −2 0 (b) 2 −2 0 (c) 2

−2 0 (d) 2 −2 0 (e) 2 −2 0 (f) 2

l(θ∣D) = ∑ log ∑ πi fi (x∣θi )

Take derivatives of l wrt. to the parameters and setting it to zero:

(full derivation of the missing bits on the next page)

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 20

n ∂µi ∑j πj fj (xn ∣θj )

Setting all derivatives of l wrt. to the parameters to zero, yields:

▸ exactly the same as in the previous more heuristic derivation!

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 22

Three ways to derive them:

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 23

The parameters are π1 , . . . , πK , µ1 , . . . , µK , Σ1 , . . . , ΣK , which are

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 24