Professional Documents
Culture Documents
22 Mixture Models A EM
22 Mixture Models A EM
Lecture 22
Mixture Models & EM
Philipp Hennig
6 July 2021
Faculty of Science
Department of Computer Science
Chair for the Methods of Machine Learning
# content Ex # content Ex
1 Introduction 1 15 Exponential Families 8
2 Reasoning under Uncertainty 16 Graphical Models
3 Continuous Variables 2 17 Factor Graphs 9
4 Monte Carlo 18 The Sum-Product Algorithm
5 Markov Chain Monte Carlo 3 19 Example: Modelling Topics 10
6 Gaussian Distributions 20 Latent Dirichlet Allocation
7 Parametric Regression 4 21 k-mneans 11
8 Learning Representations 22 Mixture Models
9 Gaussian Processes 5 23 Expectation Maximization (EM) 12
10 Understanding Kernels 24 Variational Inference
11 Gauss-Markov Models 6 25 Example: Collapsed Inference 13
12 An Example for GP Regression 26 Example: EM for Kernel Topic Models
13 GP Classification 7 27 Making Decisions
14 Generalized Linear Models 28 Revision
Probabilistic ML — P. Hennig, SS 2021 — Lecture 22: Mixture Models & EM — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 1
k-Means Clustering
Steinhaus, H. (1957). Sur la division des corps matériels en parties. Bull. Acad. Polon. Sci. 4 (12): 801–804.
Given {xi }i=1,...,n
Init Set k means {mk } to random values
Assign each datum xi to its nearest mean. One could denote this by an
integer variable
ki = arg min ∥mk − xi ∥2
k
or by binary responsibilities
(
1 if ki = k
rki =
0 else
1 X X
n
Hugo Steinhaus
mk ^ rki xi where Rk := rki 1887–1972
Rk
i i
Probabilistic ML — P. Hennig, SS 2021 — Lecture 22: Mixture Models & EM — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 3
k-means has pathologies
figures after DJC MacKay, ITILA, 2003, recreated by Ann-Kathrin Schalkamp
Probabilistic ML — P. Hennig, SS 2021 — Lecture 22: Mixture Models & EM — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 4
k-means has pathologies
figures after DJC MacKay, ITILA, 2003, recreated by Ann-Kathrin Schalkamp
Probabilistic ML — P. Hennig, SS 2021 — Lecture 22: Mixture Models & EM — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 4
k-means has pathologies
figures after DJC MacKay, ITILA, 2003, recreated by Ann-Kathrin Schalkamp
Probabilistic ML — P. Hennig, SS 2021 — Lecture 22: Mixture Models & EM — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 4
k-means always converges
for an interesting reason …
The existence of a Lyapunov function means that one can think about
the algorithm in question as an optimization routine for J. It also
guarantees convergence of the algorithm at a local (not necessarily
global!) minimum of J
Aleksandr M. Lyapunov
(1857—1918)
Probabilistic ML — P. Hennig, SS 2021 — Lecture 22: Mixture Models & EM — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 5
k-means always converges …
for an interesting reason …
1 procedure k-MEANS(x, k)
2 m ^ RAND(k) initialize
3 while not converged do
4 r ^ FIND(min(∥m − x∥2 )) set responsibilities
5 m ^ rx ⊘ r1 set means
6 end while
7 return m
8 end procedure X
n X
K
Consider J(r, m) := rik ∥xi − mk ∥2
i k
▶ step 4 always decreases J (by definition)
▶ step 5 always decreases J, because
X n P X
∂ rik xi ∂ 2 J(r, m)
J(r, m) = −2 rik (xi − mk ) = 0 ⇒ mk = Pi 2
=2 rik > 0
∂mk
i i rik ∂mk i
Probabilistic ML — P. Hennig, SS 2021 — Lecture 22: Mixture Models & EM — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 6
▶ k-means is a simple algorithm that always finds a stable clustering
▶ the resulting clusterings can be unintuitive. They do not capture shape of clusters or their number,
and are subject to random fluctuations
a probabilistic interpretation of k-means yields clarity and allows fitting all parameters. As a neat side-
effect, it leads to a final entry to our toolbox!
Probabilistic ML — P. Hennig, SS 2021 — Lecture 22: Mixture Models & EM — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 7
From a Loss to a Generative Model
empirical risk minimization is frequently identified with maximum likelihood
X
n X
K
(r, m) = arg min rik ∥xi − mk ∥2
r,m
i k
X
n X
K
= arg max rik (−1/2σ −2 ∥xi − mk ∥2 ) + const.
i k
Y
n X
K
= arg max rik exp −1/2σ −2 ∥xi − mk ∥2 /Z
i k
Y
n X
K
= arg max rik N (xi ; mi , σ 2 I)
i k
= arg max p(x | m, r)
n X
Y k
p(x | π, µ, Σ) = πj N (xi ; µj , Σj )
i j
πj ∈ [0, 1],
X
πj = 1
j
Probabilistic ML — P. Hennig, SS 2021 — Lecture 22: Mixture Models & EM — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 9
Gaussian Mixtures
A generative model for k-means [Figure after Bishop, PRML, 2006, by Ann-Kathrin Schalkamp]
n X
Y k
p(x | π, µ, Σ) = πj N (xi ; µj , Σj )
i j
πj ∈ [0, 1],
X
πj = 1
j
Probabilistic ML — P. Hennig, SS 2021 — Lecture 22: Mixture Models & EM — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 9
Gaussian Mixtures
A generative model for k-means [Figure after Bishop, PRML, 2006, by Ann-Kathrin Schalkamp]
n X
Y k
p(x | π, µ, Σ) = πj N (xi ; µj , Σj )
i j
πj ∈ [0, 1],
X
πj = 1
j
Probabilistic ML — P. Hennig, SS 2021 — Lecture 22: Mixture Models & EM — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 9
Soft k-means as maximum likelihood
for the Gaussian mixture model
Y
n X
k
p(x | π, µ, Σ) = πj N (xi ; µj , Σj ) (⋆)
i j µk Σk
k
Probabilistic ML — P. Hennig, SS 2021 — Lecture 22: Mixture Models & EM — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 10
Soft k-means as maximum likelihood
for the Gaussian mixture model
Y
n X
k
p(x | π, µ, Σ) = πj N (xi ; µj , Σj ) (⋆)
i j µk Σk
p(x | π, µ, Σ) · p(π, µ, Σ)
p(π, µ, Σ | x) =
p(x) x
Probabilistic ML — P. Hennig, SS 2021 — Lecture 22: Mixture Models & EM — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 10
Soft k-means as maximum likelihood
for the Gaussian mixture model
Y
n X
k
p(x | π, µ, Σ) = πj N (xi ; µj , Σj ) (⋆)
i j µk Σk
p(x | π, µ, Σ) · p(π, µ, Σ)
p(π, µ, Σ | x) =
p(x) x
▶ likelihood is not an exponential family — no obvious conjugate prior n
Probabilistic ML — P. Hennig, SS 2021 — Lecture 22: Mixture Models & EM — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 10
Soft k-means as maximum likelihood
for the Gaussian mixture model
⊺ −1
e− 2 (x−µ) Σ (x−µ) )
1
1 X
n X
∇µj log p = 0 ⇒ µj = rji xi Rj := rji
Rj
i i
Probabilistic ML — P. Hennig, SS 2021 — Lecture 22: Mixture Models & EM — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 11
Soft k-means as maximum likelihood
for the Gaussian mixture model
⊺ −1
e− 2 (x−µ) Σ (x−µ) )
1
To maximize w.r.t. Σ set gradient of log likelihood to 0 (note ∂|Σ|−1/2 /∂Σ = − 12 |Σ|−3/2 |Σ|Σ−1 and
∂(v⊺ Σ−1 v)/∂Σ = −Σ−1 vv⊺ Σ−1 ):
1 X πj N (xi ; µj , Σj ) −1
n
∇Σj log p(x | π, µ, Σ) = − P Σ (xi − µj )(xi − µj )⊺ Σ−1 − Σ−1
′ πj N (xi ; µj , Σj )
j
2
i | j {z }
=:rji
1 X X
n
∇Σj log p = 0 ⇒ Σj = rji (xi − µj )(xi − µj )⊺ Rj := rji
Rj
i i
Probabilistic ML — P. Hennig, SS 2021 — Lecture 22: Mixture Models & EM — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 11
Soft k-means as maximum likelihood
for the Gaussian mixture model
⊺ −1
e− 2 (x−µ) Σ (x−µ) )
1
P
To maximize w.r.t. π, enforce πj = 1 by introducing Lagrange multiplier λ and optimize
j
X Xn
N (xi ; µj , Σj )
∇πj log p(x | π, µ, Σ) + λ πj − 1 = P +λ
j πj N (xi ; µj , Σj )
′
j i
X
n
N (xi ; µj , Σj ) X
n
0= πj P + λπj = rij + λπj
i j′ πj N (xi ; µj , Σj ) i
X Rj
πj = 1 ⇒ λ = −N ⇒ πj =
n
j
Probabilistic ML — P. Hennig, SS 2021 — Lecture 22: Mixture Models & EM — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 11
The EM Algorithm (for Gaussian mixtures)
Refinement of soft k-means and k-means
If we know the responsibilities rij , we can optimize µ, Σ, π analytically. And if we know µ, π, we can set
rij ! Thus
1. initialize µ, π (e.g. random µ, uniform π)
2. Set
πj N (xi ; µj , Σj )
rij = Pk
j′ πj′ N (xi ; µj′ , Σj′ )
3. Set
X 1 X
n
1 X
n
Rj
Rj = rji µj = rij xi Σj = rij (xi − µj )(xi − µj )⊺ πj =
Rj Rj n
i i i
▶ Note that π is essentially given through rij , thus can be incorporated into the first step
Probabilistic ML — P. Hennig, SS 2021 — Lecture 22: Mixture Models & EM — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 12
The connection to (soft) k-means
Refinement of soft k-means and k-means with cluster probabilities
3. Set
!
X 1 X
n
1 X
n
Rj
Rj = rji µj = rij xi Σj = rij (xi − µj )(xi − µj )⊺ πj =
Rj Rj n
i i i
Probabilistic ML — P. Hennig, SS 2021 — Lecture 22: Mixture Models & EM — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 13
Gaussian Mixtures Rivisited
Introducing a Latent Variable Simplifies Things
π
P
▶ consider binary zij ∈ {0; 1} with j zij = 1 (“one-hot”)
▶ what is p(x, z)? Let’s write it as p(x, z) = p(x | z)p(z) with
Yn Y k
z
p(zij = 1) = πj ⇒ p(z) = πj ij
zi:
i j
Y
k
µk Σk
p(xi | zj = 1) = N (xi ; µj , Σj ) ⇒ p(xi | zi: ) = N (x | µj , Σj )zij
k
j
X X
k
p(xi ) = p(z = j)p(xi | z = j) = πj N (x; µj , Σj ) xi
j j
n
Probabilistic ML — P. Hennig, SS 2021 — Lecture 22: Mixture Models & EM — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 14
Joint Generative Model
For the Gaussian Mixture
π
Y
n Y
k
z
p(x, z | π, µ, Σ) = πj ij N (xi ; µj , Σj )zij
i j
Given µ, Σ, have a simple distribution for z. And, given z, µ, Σ show up in a tractable form.
Probabilistic ML — P. Hennig, SS 2021 — Lecture 22: Mixture Models & EM — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 15
The Expectation Maximization Algorithm
Refinement of soft k-means and k-means with cluster probabilities
πj N (xi ; µj , Σj ) Rj exp(−β∥xi − mj ∥2 )
rij = Pk =P
j′ Rj′ exp(−β∥xi − mj′ ∥ )
2
j′ πj′ N (xi ; µj′ , Σj′ )
3. MAXIMIZE Likelihood
!
X 1 X
n
1 X
n
Rj
Rj = rji µj = rij xi Σj = rij (xi − µj )(xi − µj )⊺ πj =
Rj Rj n
i i i
Probabilistic ML — P. Hennig, SS 2021 — Lecture 22: Mixture Models & EM — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 16
Taking the easy way out
Just pretend you know that variable that causes trouble
π
Y
n Y
k
z
p(x, z | π, µ, Σ) = πj ij N (xi ; µj , Σj )zij
i j
Y
n
p(x | z, π, µ, Σ) = πki N (xi ; µki , Σki ) z
i
Nj X µ Σ
πj ^ Nj = zij
N
i
1 X 1 X
µj ^ zij xi Σj ^ zij (xi − µj )(xi − µj )⊺ x
Nj Nj
i i
n
But we didn’t have z! So, for EM, we replaced it with its expectation!
Probabilistic ML — P. Hennig, SS 2021 — Lecture 22: Mixture Models & EM — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 17
The Toolbox
Framework:
Z
p(y | x)p(x)
p(x1 , x2 ) dx2 = p(x1 ) p(x1 , x2 ) = p(x1 | x2 )p(x2 ) p(x | y) =
p(y)
Modelling: Computation:
▶ graphical models ▶ Monte Carlo
▶ Gaussian distributions ▶ Linear algebra / Gaussian inference
▶ (deep) learnt representations ▶ maximum likelihood / MAP
▶ Kernels ▶ Laplace approximations
▶ Markov Chains ▶ EM
▶ Exponential Families / Conjugate Priors ▶ Variational approximations
▶ Factor Graphs & Message Passing
Probabilistic ML — P. Hennig, SS 2021 — Lecture 22: Mixture Models & EM — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 18
Generic EM Algorithm
Maximize expected log likelihoods
Setting:
▶ Want to find maximum likelihood (or MAP) estimate for a model involving a latent variable
" !#
X
θ∗ = arg max [log p(x | θ)] = arg max log p(x, z | θ)
θ θ z
▶ Assume that the summation inside the log makes analytic optimization intractable
▶ but that optimization would be analytic if z were known (i.e. if there were only one term in the sum)
Idea: Initialize θ0 , then iterate between
1. Compute p(z | x, θold )
2. Set θnew to the Maximum of the Expectation of the complete-data log likelihood:
X
θnew = arg max p(z | x, θold ) log p( x, z | θ) = arg max Ep(z|x,θold ) [log p(x, z | θ)]
θ |{z} θ
z !
z
µ Σ
Probabilistic ML — P. Hennig, SS 2021 — Lecture 22: Mixture Models & EM — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 20
EM for Gaussian Mixtures
re-written in generic form
Probabilistic ML — P. Hennig, SS 2021 — Lecture 22: Mixture Models & EM — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 20
EM for Gaussian Mixtures
re-written in generic form
2. Maximize XX
Ep(z|x,θ) (log p(x, z | θ)) = rij log πj + log N (xi ; µj , Σj )
i j
Probabilistic ML — P. Hennig, SS 2021 — Lecture 22: Mixture Models & EM — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 21
The EM algorithm
Instead of trying to maximize
X X
log p(x | θ) = log p(x, z | θ) = log p(z | x, θ)p(x | θ),
z z
instead maximize X
Ez log p(x, z | θ) = p(z | x, θ) log p(x, z | θ),
z
Probabilistic ML — P. Hennig, SS 2021 — Lecture 22: Mixture Models & EM — © Philipp Hennig, 2021 CC BY-NC-SA 3.0 22