Gaussian Mixtures and The EM Algorithm: Reading: Chapter 7.4, Prince Book

Robert Collins
CSE586
Gaussian Mixtures
and the EM Algorithm
Reading: Chapter 7.4, Prince book
Robert Collins
Review: The Gaussian Distribu3on

CSE586
•  Mul3variate Gaussian

d
(( 2π )
Evaluates to a number
mean covariance

dx1 vector dxd matrix
Isotropic (spherical) if covariance is diag(σ , σ ,..., σ )

2 2 2
Bishop, 2003
Robert Collins
Likelihood Func3on
CSE586
•  Data set

•  Assume observed data points generated
independently

•  Viewed as a func3on of the parameters, this is
known as the likelihood func-on
Bishop, 2003
Robert Collins
Maximum Likelihood
CSE586
•  Set the parameters by maximizing the

likelihood func3on
•  Equivalently maximize the log likelihood

d

Bishop, 2003
Robert Collins
Maximum Likelihood Solu3on

CSE586
•  Maximizing w.r.t. the mean gives the sample mean

•  Maximizing w.r.t covariance gives the sample

covariance
Note: if N is small you want to divide by N-1

when computing sample covariance to get
an unbiased estimate.
Bishop, 2003
Robert Collins
Comments
CSE586
Gaussians are well understood and easy to estimate
However, they are unimodal, thus cannot be used

to represent inherently multimodal datasets
Fitting a single Gaussian to a multimodal dataset is

likely to give a mean value in an area with low
probability, and to overestimate the covariance.
Robert Collins
Old Faithful Data Set

CSE586
Time
between
eruptions
(minutes)
Duration of eruption (minutes)
Bishop, 2004
Robert Collins
Idea: Use a Mixture of Gaussians

CSE586
•  Convex Combina3on of Distribu3ons

•  Normaliza3on and posi3vity require

•  Can interpret the mixing coeﬃcients as prior
probabili3es
Bishop, 2003
Robert Collins
Example: Mixture of 3 Gaussians

CSE586
1 1
0.5 0.5
0 0
0 0.5 1 0 0.5 1
(a) (b)
Bishop, 2004
Robert Collins
CSE586
Aside: Sampling from a Mixture Model
0.5
0
0 0.5 1
(a)
Robert Collins
CSE586
Aside: Sampling from a Mixture Model
Generate u = uniform random number between 0 and 1
If u < π1
generate x ~ N(x | µ1, Σ1)
elseif u < π1 + π2
generate x ~ N(x | µ2, Σ2)
elseif u < π1 + π2 + ... + πK-1

generate x ~ N(x | µK-1, ΣK-1)
else
generate x ~ N(x | µK, ΣK)
Robert Collins
MLE of Mixture Parameters

CSE586
•  However, MLE of mixture parameters is HARD!

•  Joint distribution:
•  Log likelihood
Uh-oh, log of a sum

Robert Collins
EM Algorithm
CSE586
What makes this estimation problem hard?
1) It is a mixture, so log-likelihood is messy
2) We don’t directly see what the underlying process is

Robert Collins
EM Algorithm
CSE586
labels
Suppose some oracle told us which point comes from which Gaussian.
How? By providing a “latent” variable z_nk which is 1 if point n comes from

the kth component Gaussian, and 0 otherwise (a 1 of K representation)
Robert Collins
EM Algorithm
CSE586
labels
This lets us recover the underlying generating process decomposition:
= + +
Robert Collins
EM Algorithm
CSE586
labels
And we can easily estimate each Gaussian, along with the mixture weights!
N1 points N2 points N3 points
= + +
estimate estimate estimate estimate

pi_k = Nk / N mu_1, Sigma_1 mu_2, Sigma_2 mu_3, Sigma_3
Robert Collins
EM Algorithm
CSE586
how can I make that

inner sum be a product
instead???
Remember that this was a problem...
Robert Collins
EM Algorithm
CSE586
how can I make that

inner sum be a product
instead???
Remember that this was a problem...
Again, if an oracle gave us the values of the latent

variables (component that generated each point)
we could work with the complete log likelihood
and the log of that looks much better!

Robert Collins
Latent Variable View

CSE586
note: for a given n, there are k of these latent variables,

and only ONE of them is 1 (all the rest are 0)
Robert Collins

CSE586
note: for a given n, there are k of these latent variables,

and only ONE of them is 1 (all the rest are 0)
This is thus equivalent to
Robert Collins

CSE586
Robert Collins

CSE586
can be
estimated
separately
can be
estimated
separately
can be
estimated
separately
Robert Collins

CSE586
can be
estimated
separately
can be
estimated
separately
can be
estimated
separately
these are coupled because the mixing weights

all sum to 1, but it is no big deal to solve
Robert Collins
EM Algorithm
CSE586
Unfortunately, oracles don’t exist (or if they do, they won’t talk to us)
So we don’t know values of the the z_nk variables
What EM proposes to do:
1)  compute p(Z|X,theta), the posterior distribution over z_nk,

given our current best guess at the values of theta
2)  compute the expected value of the log likelihood ln(p(X,Z|theta))

with respect to the distribution p(Z|X,theta)
3)  find theta_new that maximizes that function.

This is our new best guess at the values of theta.
4)  iterate...
Robert Collins
Insight
CSE586
Since we don’t know the latent variables, we instead take the expected value
of the log likelihood with respect to their posterior distribution P(z|x,theta). In
the GMM case, this is equivalent to “softening” the binary latent variables to
continuous ones (the expected values of the latent variables)
unknown discrete value 0 or 1
known continuous value between 0 and 1
Where is P(znk = 1)
Robert Collins
Insight
CSE586
So now, after replacing the binary latent variables with their

continuous expected values:
all points contribute to the estimation of all components
each point has unit mass to contribute, but splits it across

the K components
the amount of weight a point contributes to a component is

proportional to the relative likelihood that the point was
generated by that component
Robert Collins
Latent Variable View (with an oracle)
CSE586
can be
estimated
separately
can be
estimated
separately
can be
estimated
separately

Robert Collins a constant
i
Latent Variable View (with EM,
CSE586
at iteration i )
i i can be
estimated
separately
i i
can be
estimated
separately
i i can be
estimated
separately

Robert Collins
EM Algorithm for GMM

CSE586
E ownership weights
means covariances
mixing probabilities
Robert Collins
Gaussian Mixture Example

CSE586
Copyright © 2001, Andrew W. Moore

Robert Collins
After first iteration

CSE586
Robert Collins
CSE586
After 2nd iteration

Robert Collins
CSE586
After 3rd iteration

Robert Collins
CSE586
After 4th iteration

Robert Collins
CSE586
After 5th iteration

Robert Collins
CSE586
After 6th iteration

Robert Collins
CSE586
After 20th iteration

Robert Collins
Recall: Labeled vs Unlabeled Data

CSE586
labeled unlabeled
Easy to estimate params Hard to estimate params
(do each color separately) (we need to assign colors)
Robert Collins
EM produces a “Soft” labeling

CSE586
each point makes a weighted contribution

to the estimation of ALL components
Review of EM for GMMs
E ownership weights
(so. labels)
M
means covariances
mixing weights Alternate E and M

steps to convergence.
From EM to K-‐Means
Alternative explanation of K-means!
•  Fix all mixing weights to 1/K

[drop out of the es-ma-on]
•  Fix all covariances to σ2 I

[drop out of the es-ma-on so we only have to es-mate the
means; each Gaussian likelihood becomes inversely
propor-onal to distance from a mean]
•  Take limit as σ2 goes to 0

[this forces so< weights to become binary]
From EM to K-‐Means
E ownership weights
(so. labels)
2 a.er ﬁxing mixing weights

and covariances as described
on last slide
2
now divide top and boVom by where

and take limit as σ2 goes to 0
1 if µj is closest mean to xn hard labels, as in the
K-‐means algorithm
0 otherwise
K-‐Means Algorithm
•  Given N data points x1, x2,..., xN
•  Find K cluster centers µ1, µ2,..., µΚ to minimize
(znk is 1 if point n belongs to cluster k; 0 otherwise)
•  Algorithm:
–  ini3alize K cluster centers µ1, µ2,..., µΚ
–  repeat
•  set znk labels to assign each point to closest cluster center
•  revise each cluster center
E
µj to be center of mass of
points in that cluster
M
–  un3l convergence (e.g. znk labels don’t change)

Gaussian Mixtures and The EM Algorithm: Reading: Chapter 7.4, Prince Book

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Gaussian Mixtures and The EM Algorithm: Reading: Chapter 7.4, Prince Book

Uploaded by

Copyright:

Available Formats

Robert Collins

Review: The Gaussian Distribu3on

Isotropic (spherical) if covariance is diag(σ , σ ,..., σ )

• Set the parameters by maximizing the

Maximum Likelihood Solu3on

• Maximizing w.r.t. the mean gives the sample mean

• Maximizing w.r.t covariance gives the sample

Note: if N is small you want to divide by N-1

Gaussians are well understood and easy to estimate

However, they are unimodal, thus cannot be used

Fitting a single Gaussian to a multimodal dataset is

Old Faithful Data Set

Duration of eruption (minutes)

Idea: Use a Mixture of Gaussians

• Convex Combina3on of Distribu3ons

Example: Mixture of 3 Gaussians

elseif u < π1 + π2 + ... + πK-1

MLE of Mixture Parameters

• However, MLE of mixture parameters is HARD!

Uh-oh, log of a sum

What makes this estimation problem hard?

1) It is a mixture, so log-likelihood is messy

2) We don’t directly see what the underlying process is

How? By providing a “latent” variable z_nk which is 1 if point n comes from

This lets us recover the underlying generating process decomposition:

N1 points N2 points N3 points

estimate estimate estimate estimate

how can I make that

how can I make that

Again, if an oracle gave us the values of the latent

and the log of that looks much better!

Latent Variable View

note: for a given n, there are k of these latent variables,

Latent Variable View

note: for a given n, there are k of these latent variables,

Latent Variable View

Latent Variable View

Latent Variable View

these are coupled because the mixing weights

So we don’t know values of the the z_nk variables

What EM proposes to do:

1) compute p(Z|X,theta), the posterior distribution over z_nk,

2) compute the expected value of the log likelihood ln(p(X,Z|theta))

3) find theta_new that maximizes that function.

unknown discrete value 0 or 1

known continuous value between 0 and 1

So now, after replacing the binary latent variables with their

all points contribute to the estimation of all components

each point has unit mass to contribute, but splits it across

the amount of weight a point contributes to a component is

these are coupled because the mixing weights

these are coupled because the mixing weights

EM Algorithm for GMM

Gaussian Mixture Example

Copyright © 2001, Andrew W. Moore

After first iteration

After 2nd iteration

After 3rd iteration

After 4th iteration

After 5th iteration

After 6th iteration

After 20th iteration

Recall: Labeled vs Unlabeled Data

•  Set the parameters by maximizing the

•  Maximizing w.r.t. the mean gives the sample mean

•  Maximizing w.r.t covariance gives the sample

•  Convex Combina3on of Distribu3ons

•  However, MLE of mixture parameters is HARD!

1)  compute p(Z|X,theta), the posterior distribution over z_nk,

2)  compute the expected value of the log likelihood ln(p(X,Z|theta))

3)  find theta_new that maximizes that function.

•  Fix all mixing weights to 1/K

•  Fix all covariances to σ2 I

•  Take limit as σ2 goes to 0