You are on page 1of 53

Machine Learning

Section 14: Gaussian mixture models, k-means, EM

Stefan Harmeling

6./8. December 2021 (WS 2021/22)

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 1


What we have seen so far?
Sections:
1. Introduction
2. Plausible reasoning and Bayes Rule
3. From Logic to Probabilities
4. Bayesian networks
5. Continuous Probabilities
6. The Gaussian distribution
7. More on distributions, models, MAP, ML
8. Linear Regression
9. Matrix Differential Calculus
10. Model selection
11. Support Vector Machines
12. PCA, kPCA
13. ISOMAP, LLE

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 2


Back to probabilities...

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 3


Unsupervised learning
Given n data points X = [x1 , . . . , xn ] ∈ RD×n .

Find a good description of the data.


If the description is succinct (short), we possibly learned something
about the data.
Probabilistically:

Density estimation
Given n data points X = [x1 , . . . , xn ] ∈ RD×n .
Find a probability density function p(x) for X .

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 4


Types of unsupervised learning (1)

Dimensionality reduction
Given n data points X = [x1 , . . . , xn ] ∈ RD×n .
find low dimensional embedding Z = [z1 , . . . , zn ] ∈ Rd×n
Probabilistically, e.g.

p(z) = . . . latent space / continuous distribution


p(x∣z) = N (x∣f (z), σ I) 2

where f is a linear or nonlinear mapping from a low dimensional space


Rd into the RD .
▸ there are many variants of this
▸ not all have an obvious probabilistic interpretation

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 5


Types of unsupervised learning (2)

Clustering (aka mixture models)


Given n data points X = [x1 , . . . , xn ] ∈ RD×n .
find class labels Z = [z1 , . . . , zn ] ∈ {1, 2, . . . , k }
Probabilistically, e.g.

p(z) = Dir(θ) k sided dice / discrete distribution


p(x∣z) = N (x∣µz , Σz )

aka Gaussian mixture model, i.e. for each class we have a different
Gaussian.
▸ there are many variants of this
▸ not all have an obvious probabilistic interpretation

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 6


Types of unsupervised learning (3)
More general point of view:

Latent variable modelling


Given n data points X = [x1 , . . . , xn ] ∈ RD×n .
find latent variable Z = [z1 , . . . , zn ]

Probabilistically, e.g.

p(z) = . . . the latent variable


p(x∣z) = . . . simpler density dependent on z

As a graphical model:

z Ð→ x

(Note: today we write also x instead of X for RVs.)

How to fit a latent variable model?

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 7


▸ last time: PCA, ISOMAP, LLE, etc
▸ today: clustering, k-means, mixture models, etc

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 8


The EM Algorithm

▸ Michael Jordan, “An Introduction to Graphical Models”, 2003.


▸ Chris Bishop, “Pattern Recognition and Machine Learning”, 2007.

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 9


Intuitive motivation for mixture models (clustering)
Medical treatments:
▸ a medical doctor tries to cure ill people
plan 1. have a single, complicated procedure that fits all
plan 2. invent syndromes to split the people depending on their type of
illness, have several simple procedures
Statistical modelling:
▸ you want to model some complicated data
plan 1. have a single, complicated model that fits all data
plan 2. invent different classes, have several simple models
Mixture model:
K
p(x) = ∑ πi fi (x∣θi )
i=1

▸ with mixing components fi , each being a density


▸ with mixing proportions πi ≥ 0 and ∑i πi = 1
▸ with parameters π1 , . . . , πK , θ1 , . . . , θK
Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 10
Represent mixture model as a graphical model
Introduce latent variable z ranging over 1, . . . , K :
z Ð→ x

▸ view z as a K -dimensional vector with binary coordinates z i ,


i.e. z i = 1 if z = i, otherwise z i = 0.
▸ define distribution of z:

p(z i = 1∣θ) ∶= πi

▸ define distribution of x given z:

p(x∣z i = 1, θ) ∶= fi (x∣θi )

▸ this implies joint distribution

p(x, z i = 1∣θ) = πi fi (x∣θi )

▸ and marginal of x (same as the mixture density):

p(x∣θ) = ∑ p(x, z i = 1∣θ) = ∑ πi fi (x∣θi )


i i
Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 11
Inference in Gaussian mixture models (1)
Gaussian mixture model:

p(x∣θ) = ∑ πi N (x∣µi , Σi )
i

▸ parameters θ = (π1 , . . . , πK , µ1 , . . . , µK , Σ1 , . . . , ΣK )
What is the probability that x is in the ith component?

τ i ∶= p(z i = 1∣x, θ)
p(x∣z i = 1, θ)P(z i = 1∣θ)
=
∑j p(x∣z j = 1, θ)P(z j = 1∣θ)
πi N (x∣µi , Σi )
=
∑j πj N (x∣µj , Σj )

▸ thus the model allows us to classify a data point x


▸ πi can be seen as a prior probability, τ i as a posterior probability

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 12


Inference in Gaussian mixture models (2)
Given i.i.d. data D = (x1 , . . . , xN ) estimate the parameters θ:
▸ MLE: maximize log-likelihood function wrt. θ:

l(θ∣D) = log p(D∣θ)


= log ∏ ∑ πi fi (xn ∣θi )
n i
= ∑ log ∑ πi fi (xn ∣θi )
n i

▸ difficulty: can not exchange the inner sum and the logarithm to
decouple the parameters
▸ this maximization is a nonlinear problem without closed-form
solution
▸ can be solved with nonlinear optimization
▸ better approach: introduce latent variables, and apply the EM
algorithm to Gaussian mixture models
▸ for a motivation let us discuss the K-means algorithm

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 13


The K-means algorithm (1)
Group data points D = (x1 , . . . , xN ) into K clusters which are
characterized by K means µ1 , . . . , µK

Algorithm 14.1 (k-means clustering)


▸ randomly initialize the K means µ1 , . . . , µK
▸ repeat until convergence
1. assign each data point xn to its closest cluster:

1 if i = arg minj ∣∣xn − µj ∣∣2


zni = {
0 otherwise

where zn are categorical variables as above


2. recompute the means
i
∑n zn xn
µi =
∑n zni

▸ one can show that K-means minimizes the distortion measure:

J = ∑ ∑ zni ∣∣xn − µi ∣∣2


n i
Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 14
426 9. MIXTURE MODELS AND EM

The K-means algorithm (2)


2 (a) 2 (b) 2 (c)

0 0 0

−2 −2 −2

−2 0 2 −2 0 2 −2 0 2

2 (d) 2 (e) 2 (f)

0 0 0

−2 −2 −2

−2 0 2 −2 0 2 −2 0 2

2 (g) 2 (h) 2 (i)

0 0 0

−2 −2 −2

−2 0 2 −2 0 2 −2 0 2

Figure 9.1 Illustration of the K-means algorithm using the re-scaled Old Faithful data set. (a) Green points
denote the data set in a two-dimensional Euclidean space. The initial choices for centres µ1 and µ2 are shown
copied from Fig. 9.1 of Bishop
by the/ red
Machine Learning andHarmeling
Stefan blue crosses, respectively.
/ 6./8. December (b) In the
2021 (WSinitial E step, each data point is assigned either to the red
2021/22) 15
EM algorithm for the Gaussian mixture model (1)
Motivated by K-means:
▸ view zn as random assignment variables (as above)
If we knew the values zn for each point xn
▸ estimate the means simply by:

∑n zni xn
µ̂i =
∑n zni
as before in K-means
However, since we do not know the value zn :
▸ consider the conditional expectation of zni :

E(zni ∣xn , θ) = 1 ⋅ p(zni = 1∣xn , θ) + 0 ⋅ p(zni = 0∣xn , θ)


= p(zni = 1∣xn , θ) =∶ τni

since zni are binary variables


▸ plug these conditional expectations into the updates:

∑n τni xn
µ̂i =
∑n τni
Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 16
EM algorithm for the Gaussian mixture model (2)

Hard assignment (K-means):

∑n zni xn
µ̂i =
∑n zni
Soft assignment:

∑n τni xn
µ̂i =
∑n τni

▸ zni ∈ {0, 1}, which is discrete set, thus hard assigment


▸ τni ∈ [0, 1], which is interval, thus soft assigment

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 17


EM algorithm for the Gaussian mixture model (3)

Algorithm 14.2 (EM for Gaussian mixture models)


▸ randomly initialize the parameters
µ1 , . . . , µK , Σ1 , . . . , ΣK , π1 , . . . , πK
▸ repeat until convergence
1. E-step: calculate soft assignments for each data point xn :

πi N (xn ∣µi , Σi )
τni =
∑j πj N (xn ∣µj , Σj )

2. M-step: recompute the parameters:


i
∑n τn xn
µi =
∑n τni
∑n τn (xn − µi )(xn − µi )
i T
Σi =
∑n τni
1
πi = ∑ τni
N n

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 18


EM algorithm for the Gaussian mixture
9.2. Mixtures model
of Gaussians (4) 437

2 2 2
L=1

0 0 0

−2 −2 −2

−2 0 (a) 2 −2 0 (b) 2 −2 0 (c) 2

2 2 2
L=2 L=5 L = 20

0 0 0

−2 −2 −2

−2 0 (d) 2 −2 0 (e) 2 −2 0 (f) 2

Figure 9.8 Illustration of the EM algorithm using the Old Faithful set as used for the illustration of the K-means
algorithm
copiedinfrom
FigureFig.
9.1. See
9.8theoftext for details.
Bishop
Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 19
EM algorithm for the Gaussian mixture model (5)
So far, so good, but:
▸ Does the EM algorithm for the Gaussian mixture model really
maximize the log likelihood?

l(θ∣D) = ∑ log ∑ πi fi (x∣θi )


n i

Take derivatives of l wrt. to the parameters and setting it to zero:


l(θ∣D) = . . . = −Σ−1
i (∑ τn xn − ∑ τn µi ) = 0
i i
∂µi n n

(full derivation of the missing bits on the next page)


Transforming it we get thus:

∑n τni xn
µi =
∑n τni

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 20


∂ ∂
l(θ∣D) = ∑ log ∑ πj fj (xn ∣θj ) exchange partials and summation
∂µi n ∂µ i j
1 ∂
=∑ ∑ πj fj (xn ∣θj ) use

log g(a) =
1 ∂
g(a)
n ∑j πj fj (xn ∣θj ) ∂µi j ∂a g(a) ∂a

1 ∂
=∑ πi fi (xn ∣θi ) derivative picks the summand containing µi
n ∑ π
j j jf (x ∣θ
n j ) ∂µ i
πi fi (xn ∣θi ) ∂
=∑ log fi (xn ∣θi ) use

g(a) = g(a)

log g(a)
n ∑j πj fj (xn ∣θj ) ∂µi
∂a ∂a


= ∑ τni log fi (xn ∣θi ) i
plugin τn =
πi fi (xn ∣θi )

n ∂µi ∑j πj fj (xn ∣θj )


= ∑ τni log N (xn ∣µi , Σi ) plugin Gaussian PDF
n ∂µi
= − ∑ τni Σ−1
i (xn − µi )
n

= −Σ−1
i ∑ τn (xn
i
− µi )
n

= −Σ−1
i (∑ τn xn − ∑ τn µi )
i i
n n
Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 21
EM algorithm for the Gaussian mixture model (6)

Setting all derivatives of l wrt. to the parameters to zero, yields:

∑n τni xn
µi =
∑n τni
∑ τ i (xn − µi )(xn − µi )T
Σi = n n
∑n τni
1
πi = ∑ τni
N n

▸ exactly the same as in the previous more heuristic derivation!


▸ however, since τni depends on the parameters as well, this is just a
set of coupled, nonlinear equations

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 22


EM algorithm for the Gaussian mixture model (7)

Update equations:

∑n τni xn
µi =
∑n τni
∑ τ i (xn − µi )(xn − µi )T
Σi = n n
∑n τni
1
πi = ∑ τni
N n

Three ways to derive them:


1. replace unknown assignments with their expected values, i.e. hard
assignments in K-means with soft-assignments (quite heuristic)
2. set derivatives of log-likelhood to zero (somewhat heuristic)
3. NEW (more rigorous): consider expected complete log likelihood

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 23


We consider the following model:

Mixture of Gaussians
The mixture of Gaussians is a probability distribution which is written
as the weighted sum of several Gaussian distributions:
K
p(x) = ∑ πk N (x∣µk , Σk )
k =1

The parameters are π1 , . . . , πK , µ1 , . . . , µK , Σ1 , . . . , ΣK , which are


abbreviated as θ, thus we sometimes write p(x∣θ).
Question:
▸ Given n data points D = {x1 , . . . , xn } ⊂ RD . How can we estimate
the parameters?

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 24


Estimate the parameters via maximum likelihood ?
Given n data points D = {x1 , . . . , xn } ⊂ RD .

Log-likelihood for mixture of Gaussians

n K
`(θ) = log p(D∣θ) = ∑ log ∑ πk N (xi ∣µk , Σk )
i=1 k =1

Derivatives of the log-likelihood


∂ n
πk N (xi ∣µk , Σk ) 1
`(θ) = ∑ K =0 (side constraint ∑ πk = 1 missing)
∂πk i=1 ∑j=1 πj N (xi ∣µj , Σj ) πk
k

∂ n
πk N (xi ∣µk , Σk )
`(θ) = ∑ K Σ−1
k (xi − µk ) = 0
∂µk i=1 ∑j=1 j
π N (x ∣µ
i j , Σ j )
∂ n
πk N (xi ∣µk , Σk )
`(θ) = ∑ K (. . .) = 0
∂Σk i=1 ∑j=1 πj N (xi ∣µj , Σj )

Complicated system of equations, since all parameters are coupled.


(Last time we cheated a bit by assuming ∑πK k Nπ (x i ∣µk ,Σk )
N (x ∣µ ,Σ )
to be constant.
j=1 j i j j
Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 25
More precisely. . .
Constrained optimization:
Minimize `(θ) s.t. ∑k πk = 1, gives us the Lagrangian:
n K
L(θ, λ) = `(θ) − λ (∑ πk − 1) = ∑ log ∑ πk N (xi ∣µk , Σk ) − λ (∑ πk − 1)
k i=1 k =1 k

Derivatives of the Lagrangian wrt. θ


∂ n
πk N (xi ∣µk , Σk ) 1
L(θ) = ∑ K −λ=0
∂πk i=1 ∑j=1 πj N (xi ∣µj , Σj ) πk

∂ n
πk N (xi ∣µk , Σk )
L(θ) = ∑ K Σ−1
k (xi − µk ) = 0
∂µk i=1 ∑j=1 πj N (xi ∣µj , Σj )

∂ n
πk N (xi ∣µk , Σk )
L(θ) = ∑ K (. . .) = 0
∂Σk i=1 ∑j=1 πj N (xi ∣µj , Σj )

Complicated system of equations, since all parameters are coupled.


(Last time we cheated a bit by assuming ∑πK k Nπ (x i ∣µk ,Σk )
N (x ∣µ ,Σ )
to be constant.
j=1 j i j j

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 26


This can be optimized by adding more unknowns!

Mixture of Gaussians as latent variable model


Introduce latent variable z ∈ {1, . . . , K } with

p(z = k ∣θ) = πk
p(x∣z = k , θ) = N (x∣µk , Σk )

These choices let us rewrite the mixture of Gaussians:


K
p(x∣θ) = ∑ p(x, z = k ∣θ)
k =1
K
= ∑ p(z = k ∣θ) p(x∣z = k , θ)
k =1
K
= ∑ πk N (x∣µk , Σk )
k =1

Now we have θ and z as unkowns.

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 27


Again...

Given n data points D = {x1 , . . . , xn } ⊂ RD .


▸ Q: Does the introduced latent variable help us estimating the
parameters of the mixture of Gaussians?

A: Not immediately, since we do not know the values


z1 , . . . , zn ∈ {1, . . . , K } of the latent variable.

▸ Q: Does the problem become easier if we knew these values?

A: Yes!
▸ Let’s write down the so-called complete data log likelihood.

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 28


Complete data log likelihood
The locations together with the values of the latent variables are called
the complete data:

Dc = {(x1 , z1 ), . . . , (xn , zn )}

The complete data log likelihood can be written as


n n
`c (θ) = log p(Dc ∣θ) = ∑ log p(xi , zi ∣θ) = ∑ log πzi N (xi ∣µzi , Σzi )
i=1 i=1
n K
[zi =k ]
= ∑ log ∏ (πk N (xi ∣µk , Σk ))
i=1 k =1
n K
[zi =k ]
= ∑ ∑ log (πk N (xi ∣µk , Σk ))
i=1 k =1
n K
= ∑ ∑ [zi = k ] (log πk + log N (xi ∣µk , Σk ))
i=1 k =1

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 29


As a reminder:
Iverson brackets
The Iverson bracket is a handy notation for the indicator function of a
statement:

1 if F is true
[F ] = {
0 otherwise

where F is a statement that is either true or false. E.g. for z = k

1 if z = k is true
[z = k ] = {
0 otherwise

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 30


k-means as a special mixture of Gaussians
Fixing πk = 1/K and Σk = I for all k we get

M-step of k -means
Setting the derivative of `c (θ) wrt. µk to zero we obtain

∑i=1 [zi = k ]xi


n
µk =
∑i=1 [zi = k ]
n

which updates the means given cluster assignments zi for all i.

E-step of k -means
Similarly choosing

zi = argmaxk log N (xi ∣µk , Σk )


= argmink ∥xi − µk ∥2

we maximize `c (θ) wrt. zi .

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 31


Back to the
Complete data log likelihood
The locations together with the values of the latent variables are called
the complete data:

Dc = {(x1 , z1 ), . . . , (xn , zn )}

The complete data log likelihood can be written as


n K
`c (θ) = ∑ ∑ [zi = k ] (log πk + log N (xi ∣µk , Σk ))
i=1 k =1

▸ In practice we do not have the complete data.


▸ Now we have a missing data problem.
▸ However, given the incomplete data and the model parameters θ
we can derive a posterior distribution of the unknown latent
variables.

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 32


Given n data points D = {x1 , . . . , xn } ⊂ RD .

Responsibilities
Given a mixture of Gaussians (i.e. for fixed θ) we can calculate the
probability that xi is in cluster k , i.e. that zi = k

τik = p(zi = k ∣xi , θ)


p(xi ∣zi = k , θ)P(zi = k ∣θ)
=
∑j p(xi ∣zi = j, θ)P(zi = j∣θ)
πk N (xi ∣µk , Σk )
=
∑j πj N (xi ∣µj , Σj )

Note:
▸ πk can be seen as a prior probability.
▸ τik can be seen as a posterior probability for location xi .

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 33


Given the distributions p(zi ∣xi , θ0 ) for the latent variables we calculate
the:
Posterior expectation of [zi = k ]
For a single data point the posterior expectation of [zi = k ] is
K
Ezi ∼p(zi ∣xi ,θ0 ) [zi = k ] = ∑ p(zi = j∣xi , θ0 )[zi = k ] = τik
j=1

where θ0 are some fixed parameters.


Using this, the expectation of the complete data log likelihood is
n K
Q(θ, θ0 ) = Ez∼p(z∣x,θ0 ) ∑ ∑ [zi = k ] (log πk + log N (xi ∣µk , Σk ))
i=1 k =1
n K
= ∑ ∑ (Ez∼p(z∣x,θ0 ) [zi = k ]) (log πk + log N (xi ∣µk , Σk ))
i=1 k =1
n K
= ∑ ∑ τik (log πk + log N (xi ∣µk , Σk ))
i=1 k =1

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 34


Expected complete data log likelihood

n K
Q(θ, θ0 ) = ∑ ∑ τik (log πk + log N (xi ∣µk , Σk ))
i=1 k =1 ¯ ´¹¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¸ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¹ ¶
depends on θ0 depends on θ

▸ The responsibilities

τik = p(zi = k ∣xi , θ0 )

depend on θ0 .
▸ The joint log probabilities

log p(zi , xi ) = log p(zi ) + log p(xi ∣zi ) = log πk + log N (xi ∣µk , Σk )

depend on θ.

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 35


M-step for mixture of Gaussians
For fixed θ0 the derivatives of Q(θ, θ0 ) wrt θ have a simple form and
imply parameter updates:

1 n
πk = ∑ τik
n i=1
n
∑i=1 τik xi
µk = n
∑i=1 τik
∑ τik (xi − µk )(xi − µk )T
n
Σk = i=1 n
∑i=1 τik

In words:
“Maximize the expected complete log likelihood
wrt. the parameters θ”

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 36


E-step for mixture of Gaussians
For fixed θ we update the responsibilities:

πk N (xi ∣µk , Σk )
τik =
∑j πj N (xi ∣µj , Σj )

In words:

“Calculate the Expected complete log likelihood


by recomputing the responsibilities”
Notes:
▸ The E-step updates θ0 only indirectly, but updating the
responsibilities.
▸ The M-step updates θ.
▸ The M-step has been derived from derivatives.
▸ Does the E-step maximize Q(θ, θ0 )?

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 37


EM from a more general point of view

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 38


General setting of EM algorithm
▸ x is an observable variable
▸ z is a latent variable (non-observable)
▸ p(x, z∣θ) joint probabilistic model
▸ For simplicity we consider only a single data point x. The following
easily generalizes to several points.
(Incomplete) log likelihood:

`(θ) = log p(x∣θ) = log ∑ p(x, z∣θ)


z

We can extend this expression as follows:

p(x, z∣θ)
`(θ) = log ∑ q(z)
z q(z)

where q(z) is any distribution over z.


Note: summation may be replaced by integration.

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 39


Lower bound for the incomplete data log likelihood
Apply Jensen’s inequality to exchange summation and logarithm (a
concave function)

p(x, z∣θ)
`(θ) = log ∑ q(z)
z q(z)
p(x, z∣θ)
≥ ∑ q(z) log =∶ L(θ, q)
z q(z)

The last expression L(θ, q) is a lower bound for the incomplete log
likelihood.
▸ Q: How does a lower bound help us? A: Instead of maximizing
`(θ) we would maximize its lower bound (and hope that this is
maximizing `(θ) as well).
Reminder: what is Jensen’s inequality?

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 40


Jensen’s inequality (1)
Let f be a convex function. There are several versions:
▸ By definition of convexity: for any x, y and 0 ≤ α ≤ 1, the function
value at the convex combination of x and y is at most as large as
the corresponding convex combination of f (x) and f (y ),

f (αx + (1 − α)y ) ≤ αf (x) + (1 − α)f (y )

▸ More generally for ∑i θi = 1 for θi ≥ 0 we have

f (∑ θi xi ) ≤ ∑ θi f (xi )
i i

▸ For a PDF p(x) with ∫ p(x) = 1 we have

f (∫ p(x) x dx) ≤ ∫ p(x)f (x)dx

f (E x) ≤ E f (x) written as expectations

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 41


If f is convex, −f is concave.
Jensen’s inequality (2)
▸ A function f is convex, if the area above the function is convex.
▸ A function g is concave, if the area below the function is convex.
Or: if −g is convex.
Thus we can write down Jensen’s inequality for a convex function f and
similarly for a concave function g (like the logarithm):

f (∑ θi xi ) ≤ ∑ θi f (xi )
i i

g (∑ θi xi ) ≥ ∑ θi g(xi )
i i

Q: can you draw a function that is neither convex nor concave?

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 42


Back to our lower bound: L(θ, q)

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 43


Deep insight into the M-step

How the lower bound is related to the M-step


Let’s rewrite L(θ, q):

p(x, z∣θ)
`(θ) ≥ L(θ, q) = ∑ q(z) log
z q(z)
= ∑ q(z) log p(x, z∣θ) − ∑ q(z) log q(z)
z z
= Ez∼q `c (θ) + H(q)

where H(q) is the entropy of q. The first summand is the expected


complete data log likelihood.
Conclusion:
▸ The M-step maximizes the lower bound L(θ, q) for fixed q by
maximizing Ez∼q `c (θ) in θ, since the entropy of q does not depend
on θ.

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 44


Entropy
The quantity

H(q) = − ∑ q(z) log q(z)


z

is called the entropy. It is a measure of the randomness of random


variable z.
Entropy is the essential concept in information theory.

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 45


Entropy — more insights
Q: Why taking logs of probabilities?
1. Consider a single fair coin. Once we know the outcome we have
received one bit which is:
− log2 0.5 = 1
Entropy is the expected amount of information in a random
experiment
X ∼ p(x)
H(X ) = E(− log p(X )) = − ∑ p(x) log p(x)
x

2. Consider two independent random experiments X ∼ p(x) and


Y ∼ p(y ), i.e. p(x, y ) = p(x)p(y ). The information (as measured
by entropy) should be additive.
H(X , Y ) = ∑ p(x, y ) log p(x, y ) = ∑ p(x, y ) log p(x)p(y )
xy xy

= ∑ p(x, y )(log p(x) + log p(y ))


xy

= − ∑ p(x) log p(x) − ∑ p(y ) log p(y ) = H(X ) + H(Y )


x y
Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 46
Deep insight into the E-step
How the lower bound is related to the E-step (1)
Let’s rewrite L(θ, q) again:

p(x, z∣θ)
`(θ) ≥ L(θ, q) = ∑ q(z) log
z q(z)
p(z∣x, θ)p(x∣θ)
= ∑ q(z) log
z q(z)
p(z∣x, θ)
= ∑ q(z) log + ∑ q(z) log p(x∣θ)
z q(z) z
= −KL(q(z)∣∣p(z∣x, θ)) + log p(x∣θ)
= −KL(q(z)∣∣p(z∣x, θ)) + `(θ)

where KL is the Kullback-Leibler divergence.


Thus the gap between the lower bound and `(θ) is the KL-divergence
between q(z) and p(z∣x, θ):

`(θ) − L(θ, q) = KL(q(z)∣∣p(z∣x, θ)) ≥ 0


Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 47
Kullback-Leibler (KL) divergence
The quantity

q(z) p(z)
KL(q∣∣p) = ∑ q(z) log = − ∑ q(z) log
z p(z) z q(z)

is called the Kullback-Leibler divergence. It is a measure of the


difference between two distributions q and p.

▸ KL(q∣∣p) ≥ 0 with equality for q = p (proof?)


▸ it is not a distance, because KL(q∣∣p) ≠ KL(p∣∣q)
▸ KL(q∣∣p) compares q and p where q is large
▸ KL(p∣∣q) compares q and p where p is large
▸ usually we write KL(p∣∣q) instead of KL(p, q)
▸ KL-divergence is another important concept in information theory.
▸ Entropy measures the distance to the uniform distribution.

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 48


How the lower bound is related to the E-step (2)
We showed that the gap between `(θ) and L(θ, q) is the KL divergence
between q and p(z∣x, θ)

`(θ) − L(θ, q) = KL(q(z)∣∣p(z∣x, θ)) ≥ 0

The E-step closes this gap by choosing q(z) = p(z∣x, θ):

`(θ) − L(θ, p(z∣x, θ)) = KL(p(z∣x, θ)∣∣p(z∣x, θ)) = 0

▸ E.g. if our current estimate of θ is θ0 , we choose q = p(z∣x, θ0 ).

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 49


Algorithm 14.3 (General form of EM)
▸ x is an observable variable
▸ z is a latent variable (non-observable)
▸ p(x, z∣θ) joint probabilistic model
1. randomly initialize θ(0)
2. repeat until convergence

E-step q (t+1) = argmaxq L(θ(t) , q) = p(z∣x, θ(t) )


M-step θ(t+1) = argmaxθ L(θ, q (t+1) ) = argmaxθ Ez∼q (t+1) `c (θ)

Viewed even more general:


▸ The lower bound L(θ, q) is an example of an auxiliary function.

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 50


Auxiliary function
The idea of an auxiliary function is quite general:
▸ suppose maximizing some function, say `(θ) is difficult.
▸ introduce an auxiliary function L(θ, θ0 ),
1. that is easy to maximize in θ,
2. that touches `(θ), i.e. L(θ, θ) = `(θ), and
3. that lowerbounds `(θ), i.e. L(θ, θ0 ) ≤ `(θ) for any θ0 .

Then do M-steps and E-steps:

`(θ(0) ) = L(θ(0) , θ(0) )


≤ L(θ(1) , θ(0) ) M-step
(1) (1) (1)
≤ L(θ ,θ ) = `(θ ) E-step
(2) (1)
≤ L(θ ,θ ) M-step
≤ L(θ(2) , θ(2) ) = `(θ(2) ) E-step
≤ ...

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 51


`(θ(0) ) = L(θ(0) , θ(0) )
≤ L(θ(1) , θ(0) ) M-step
≤ L(θ(1) , θ(1) ) = `(θ(1) ) E-step
(2) (1)
≤ L(θ ,θ ) M-step
(2) (2) (2)
≤ L(θ ,θ ) = `(θ ) E-step
≤ ...

Then EM performs greedy hill-climbing for `(θ), i.e. we generate a


sequence θ(0) , θ(1) , θ(2) , . . . , s.t.

`(θ(0) ) ≤ `(θ(1) ) ≤ `(θ(2) ) ≤ . . .

However, we might end up in a local maximum...

Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 52


Summary of k-means, EM algorithm, and beyond
Algorithm 14.4 (General form of EM)
▸ x is an observable variable
▸ z is a latent variable (non-observable)
▸ p(x, z∣θ) joint probabilistic model
1. randomly initialize θ(0)
2. repeat until convergence

E-step q (t+1) = argmaxq L(θ(t) , q) = p(z∣x, θ(t) )


M-step θ(t+1) = argmaxθ L(θ, q (t+1) ) = argmaxθ Ez∼q (t+1) `c (θ)

▸ k-means is a special case (was heuristically motivated but is an


instance of this
▸ fitting a GMM is a special case
▸ more generally: this is an instance of optimization by an auxiliary
function
▸ by the way: this is also a solution for the missing data problem
(missing at random)
Machine Learning / Stefan Harmeling / 6./8. December 2021 (WS 2021/22) 53

You might also like