You are on page 1of 14

Source Material for Lecture

Gaussian Mixture Models (GMM)


and Expectation-Maximization (EM)
(and K-Means too!)
http://research.microsoft.com/~cmbishop/talks.htm

http://www.autonlab.org/tutorials/gmm.html

Review: The Gaussian Distribution Likelihood Function


• Multivariate Gaussian • Data set

• Assume observed data points generated independently


mean covariance

Isotropic (circularly symmetric) if covariance is diag(k,k,...,k)

• Viewed as a function of the parameters, this is known as


the likelihood function

Bishop, 2003 Bishop, 2003

Maximum Likelihood Maximum Likelihood Solution


• Set the parameters by maximizing the likelihood function • Maximizing w.r.t. the mean gives the sample mean
• Equivalently maximize the log likelihood

• Maximizing w.r.t covariance gives the sample covariance

Bishop, 2003 Bishop, 2003

1
Bias of Maximum Likelihood Intuitive Explanation of Over-fitting
• Consider the expectations of the maximum likelihood
estimates under the Gaussian distribution

• The maximum likelihood solution systematically under-


estimates the covariance
• This is an example of over-fitting

Bishop, 2003 Bishop, 2003

Unbiased Variance Estimate Comments


• Clearly we can remove the bias by using
Gaussians are well understood and easy to estimate

However, they are unimodal, thus cannot be used


to represent inherently multimodal datasets
since this gives
Fitting a single Gaussian to a multimodal dataset is
likely to give a mean value in an area with low
probability, and to overestimate the covariance.
• Arises naturally in a Bayesian treatment
• For an infinite data set the two expressions are equal

Bishop, 2003

Old Faithful Data Set Some Bio Assay data


some other axis

Time
between
eruptions
(minutes)

Duration of eruption (minutes)

Bishop, 2004 Copyright © 2001, Andrew W. Moore some axis

2
Idea: Use a Mixture of Gaussians Example: Mixture of 3 Gaussians
• Linear super-position of Gaussians

• Normalization and positivity require

• Can interpret the mixing coefficients as prior probabilities

Bishop, 2003 Bishop, 2004

Sampling from the Gaussian Sampling from a GMM


• To generate a data point:
• There are k components. The
– first pick one of the components with probability i’th component is called i
– then draw a sample from that component • Component i has an
• Repeat these two steps for each new data point associated mean vector i 2
1

3

Bishop, 2003 Copyright © 2001, Andrew W. Moore

Sampling from a GMM Sampling from a GMM

• There are k components. The • There are k components. The


i’th component is called i i’th component is called i
• Component i has an • Component i has an
associated mean vector i 2 associated mean vector i 2
• Each component generates 1 • Each component generates
data from a Gaussian with data from a Gaussian with
mean i and covariance matrix mean i and covariance matrix
2I 3 2I
Assume that each datapoint is Assume that each datapoint is
generated according to the generated according to the
following recipe: following recipe:
1. Pick a component at random.
Choose component i with
probability P(i).

Copyright © 2001, Andrew W. Moore Copyright © 2001, Andrew W. Moore

3
Sampling from a GMM Sampling from a General GMM

• There are k components. The • There are k components. The


i’th component is called i i’th component is called i
• Component i has an • Component i has an
associated mean vector i 2 associated mean vector i 2
• Each component generates • Each component generates 1
data from a Gaussian with x data from a Gaussian with
mean i and covariance matrix mean i and covariance matrix
2I i 3
Assume that each datapoint is Assume that each datapoint is
generated according to the generated according to the
following recipe: following recipe:
1. Pick a component at random. 1. Pick a component at random.
Choose component i with Choose component i with
probability P(i). probability P(i).
2. Datapoint ~ N( , 2I )
i 2. Datapoint ~ N( ,  ) i i
Copyright © 2001, Andrew W. Moore Copyright © 2001, Andrew W. Moore

GMM describing assay data GMM density function

Note: now we have a


continuous estimate of
the density, so can
estimate a value at any
point.

Also, could draw


constant-probability
contours if we wanted to.

Copyright © 2001, Andrew W. Moore Copyright © 2001, Andrew W. Moore

Fitting the Gaussian Mixture Maximum Likelihood for the GMM


• We wish to invert this process – given the data set, find • The log likelihood function takes the form
the corresponding parameters:
– mixing coefficients
– means
– covariances
• Note: sum over components appears inside the log
• If we knew which component generated each data point,
the maximum likelihood solution would involve fitting • There is no closed form solution for maximum likelihood
each component to the corresponding cluster
• Problem: the data set is unlabelled
• We shall refer to the labels as latent (= hidden) variables • However, with labeled data, the story is different

Bishop, 2003 Bishop, 2003

4
Labeled vs Unlabeled Data Side-Trip : Clustering using K-means

K-means is a well-known method of clustering data.

Determines location of clusters (cluster centers), as well as


which data points are “owned” by which cluster.

Motivation: K-means may give us some insight into how to


label data points by which cluster they come from
(i.e. determine ownership or membership)

labeled unlabeled
Easy to estimate params Hard to estimate params
(do each color separately) (we need to assign colors)
Bishop, 2003

Some Data

K-means and Hierarchical Clustering


Note to other teachers and users of
these slides. Andrew would be delighted
if you found this source material useful
in giving your own lectures. Feel free to
use these slides verbatim, or to modify
them to fit your own needs. PowerPoint
originals are available. If you make use
of a significant portion of these slides in
Andrew W. Moore This could easily be
your own lecture, please include this
message, or the following link to the modeled by a
source repository of Andrew’s tutorials:
http://www.cs.cmu.edu/~awm/tutorials .
Associate Professor Gaussian Mixture
Comments and corrections gratefully
received. School of Computer Science (with 5 components)
Carnegie Mellon University But let’s look at an
www.cs.cmu.edu/~awm satisfying, friendly and
awm@cs.cmu.edu infinitely popular
412-268-7599 alternative…

K-means K-means
1. Ask user how many 1. Ask user how many
clusters they’d like. clusters they’d like.
(e.g. k=5) (e.g. k=5)
2. Randomly guess k
cluster Center
locations

5
K-means K-means
1. Ask user how many 1. Ask user how many
clusters they’d like. clusters they’d like.
(e.g. k=5) (e.g. k=5)
2. Randomly guess k 2. Randomly guess k
cluster Center cluster Center
locations locations
3. Each datapoint finds 3. Each datapoint finds
out which Center it’s out which Center it’s
closest to. (Thus closest to.
each Center “owns”
4. Each Center finds
a set of datapoints)
the centroid of the
points it owns

K-means K-means
1. Ask user how many
clusters they’d like. Start
(e.g. k=5)
2. Randomly guess k Example generated by
Dan Pelleg’s super-
cluster Center
duper fast K-means
locations system:
3. Each datapoint finds Dan Pelleg and Andrew
out which Center it’s Moore. Accelerating
closest to. Exact k-means
Algorithms with
4. Each Center finds Geometric Reasoning.
Proc. Conference on
the centroid of the
Knowledge Discovery in
points it owns… Databases 1999,
(KDD99) (available on
5. …and jumps there www.autonlab.org/pap.html)

6. …Repeat until
terminated!

K-means K-means
continues… continues…

6
K-means K-means
continues… continues…

K-means K-means
continues… continues…

K-means K-means
continues… continues…

7
K-means Questions
K-means • What is it trying to optimize?
terminates • Are we sure it will terminate?
• Are we sure it will find an optimal clustering?
• How should we start it?
• How could we automatically choose the number of
centers?

….we’ll deal with these questions over the next few slides

Distortion Distortion
Given.. Given..
•an encoder function: ENCODE : m  [1..k] •an encoder function: ENCODE : m  [1..k]
•a decoder function: DECODE : [1..k]  m •a decoder function: DECODE : [1..k]  m
Define… Define…
R R
Distortion   xi  DECODE[ENCODE(x i )] Distortion   xi  DECODE[ENCODE(x i )]
2 2

i 1 i 1

We may as well write


DECODE[ j ]  c j
R
so Distortion   ( xi  c ENCODE ( xi ) ) 2
i 1

The Minimal Distortion The Minimal Distortion (1)


R R
Distortion   (x i  c ENCODE ( xi ) ) 2 Distortion   (x i  c ENCODE ( xi ) ) 2
i 1 i 1

What properties must centers c1 , c2 , … , ck have What properties must centers c1 , c2 , … , ck have
when distortion is minimized? when distortion is minimized?
(1) xi must be encoded by its nearest center
….why?

c ENCODE ( xi )  arg min (x i  c j ) 2


c j {c1 ,c 2 ,...c k }

..at the minimal distortion

8
The Minimal Distortion (1) The Minimal Distortion (2)
R R
Distortion   (x i  c ENCODE ( xi ) ) 2 Distortion   (x i  c ENCODE ( xi ) ) 2
i 1 i 1

What properties must centers c1 , c2 , … , ck have What properties must centers c1 , c2 , … , ck have
when distortion is minimized? when distortion is minimized?
(1) xi must be encoded by its nearest center (2) The partial derivative of Distortion with respect to
Otherwise distortion could be each center location must be zero.
….why?
reduced by replacing ENCODE[xi]
by the nearest center

c ENCODE ( xi )  arg min (x i  c j ) 2


c j {c1 ,c 2 ,...c k }

..at the minimal distortion

(2) The partial derivative of Distortion with respect to (2) The partial derivative of Distortion with respect to
each center location must be zero. each center location must be zero.
R R
Distortion   (x
i 1
i  c ENCODE ( xi ) ) 2 Distortion   (x
i 1
i  c ENCODE ( xi ) ) 2
k k
   (x i
j 1 iOwnedBy( c j )
 c j )2 OwnedBy(cj ) = the set
of records owned by
   (x i
j 1 iOwnedBy( c j )
 c j )2
Center cj .

Distortion  Distortion 
c j

c j
 (x i
iOwnedBy( c j )
 c j )2
c j

c j
 (x i
iOwnedBy( c j )
 c j )2

 2  (x i
iOwnedBy( c j )
c j)  2  (x i
iOwnedBy( c j )
c j)

 0 (for a minimum)  0 (for a minimum)


1
Thus, at a minimum: c j   xi
| OwnedBy(c j ) | iOwnedBy( c j )

At the minimum distortion Improving a suboptimal configuration…


R R
Distortion   (x i  c ENCODE ( xi ) ) 2 Distortion   (x i  c ENCODE ( xi ) ) 2
i 1 i 1

What properties must centers c1 , c2 , … , ck have when What properties can be changed for centers c1 , c2 , … , ck
distortion is minimized? have when distortion is not minimized?
(1) xi must be encoded by its nearest center (1) Change encoding so that xi is encoded by its nearest center
(2) Each Center must be at the centroid of points it owns. (2) Set each Center to the centroid of points it owns.
There’s no point applying either operation twice in succession.
But it can be profitable to alternate.
…And that’s K-means!
Easy to prove this procedure will terminate in a state at
which neither (1) or (2) change the configuration. Why?

9
Improving a suboptimal configuration…
partitioning R Will we find the optimal configuration?
er of ways of
ly a finite numb
There are on
R
into k groups 
records Distortion
.

only a finitei nu
(mb  cofENCODE
x i er xi ) )
possib(le
2
of
• Not necessarily.
So there are 1
Cen ters are the centroids • Can you invent a configuration that has converged, but
s in which all
configuration
What properties can ben.changed for centers c1 , c2 , ve … , ck does not have the minimum distortion?
the poi nts they ow ation, it must
ha
have when distortion is not
ch angminimized?
es on an iter
ura tion
If the config .
(1) Change ed the disso
encoding tortion
that xi is encoded it by
must to a
itsgonearest center
improv tion changes
the configura
So each time be en to be fore.
(2) Set eachfigu Center ion to
it’s the
nev ercentroid of points it owns.
ly run ou t of
con rat uld eventual
forever, it wo
d to go on either
There’s no if it trieapplying
Sopoint operation twice in succession.
ion s.
configurat
But it can be profitable to alternate.
…And that’s K-means!
Easy to prove this procedure will terminate in a state at
which neither (1) or (2) change the configuration. Why?

Will we find the optimal configuration? Will we find the optimal configuration?
• Not necessarily. • Not necessarily.
• Can you invent a configuration that has converged, but • Can you invent a configuration that has converged, but
does not have the minimum distortion? (Hint: try a fiendish k=3 does not have the minimum distortion? (Hint: try a fiendish k=3
configuration here…) configuration here…)

Trying to find good optima Trying to find good optima


• Idea 1: Be careful about where you start • Idea 1: Be careful about where you start
• Idea 2: Do many runs of k-means, each from a different • Idea 2: Do many runs of k-means, each from a different
random start configuration random start configuration
• Many other ideas floating around. • Many other ideas floating around.
Neat trick:
Place first center on top of randomly chosen datapoint.
Place second center on datapoint that’s as far away as
possible from first center
:
Place j’th center on datapoint that’s as far away as
possible from the closest of Centers 1 through j-1
:

10
Common uses of K-means
• Often used as an exploratory data analysis tool
• In one-dimension, a good way to quantize real-valued
variables into k non-uniform buckets
• Used on acoustic data in speech understanding to Back to Estimating GMMs
convert waveforms into one of k categories (known as
Vector Quantization)
• Also used for choosing color palettes on old fashioned
graphical display devices!
• Used to initialize clusters for the EM algorithm!!!

Recall: Maximum Likelihood for the GMM Latent Variable View of EM


• The log likelihood function takes the form • Binary latent variables describing which
component generated each data point
• If we knew the values for the latent variables, we would
maximize the complete-data log likelihood

• Note: sum over components appears inside the log


• There is no closed form solution for maximum likelihood
• However, with labeled data, the story is different which gives a trivial closed-form solution (fit each
component to the corresponding set of data points)
• We don’t know the values of the latent variables
• However, for given parameter values we can compute
the expected values of the latent variables

SHOW THIS ON THE BOARD!!


Bishop, 2003 Bishop, 2003

EM – Latent Variable Viewpoint Expected Value of Latent Variable

• To compute the expected values of the latent variables, given an


observation x, we need to know their distribution.

• P(Z|X) = P(X,Z) / P(X) - distribution of latent variables given x

• Recall:

p(x,z)

FOR INTERPRETATION OF THIS QUANTITY


Bishop, 2003
WE AGAIN ARE GOING TO THE BOARD!!

11
Expected Complete-Data Log Likelihood Expected Complete-Data Log Likelihood
• Suppose we make a guess for the parameter values • To summarize what we just did, we replaced:
(means, covariances and mixing coefficients)
• Use these to evaluate the responsibilities (ownership weights)
• Consider expected complete-data log likelihood
unknown discrete value 0 or 1
• With:

where responsibilities are computed using


• We are implicitly ‘filling in’ latent variables with best guess
known continuous value between 0 and 1
• Keeping the responsibilities fixed and maximizing with
respect to the parameters give the previous results

Bishop, 2003 Bishop, 2003

Posterior Probabilities Posterior Probabilities (colour coded)


• We can think of the mixing coefficients as prior
probabilities for the components
• For a given value of we can evaluate the
corresponding posterior probabilities, called
responsibilities
• These are given from Bayes’ theorem by

Bishop, 2003 Bishop, 2003

EM Algorithm for GMM Gaussian Mixture Example: Start

ownership weights
E

means covariances

mixing probabilities

Copyright © 2001, Andrew W. Moore

12
After first iteration After 2nd iteration

Copyright © 2001, Andrew W. Moore Copyright © 2001, Andrew W. Moore

After 3rd iteration After 4th iteration

Copyright © 2001, Andrew W. Moore Copyright © 2001, Andrew W. Moore

After 5th iteration After 6th iteration

Copyright © 2001, Andrew W. Moore Copyright © 2001, Andrew W. Moore

13
After 20th iteration Homework
4) Change your code for generating N points from a single multivariate Gaussian to
instead generate N points from a mixture of Gaussians. Assume K Gaussians, each
of which is specified by a mixing parameter 0<=p_i<=1, a 2x1 mean vector mu_i, and
a 2x2 covariance matrix C_i.

5) write code to perform the K-means algorithm, given a set of N data points and a
number K of desired clusters. You can either start the algorithm with random cluster
centers, or else try something smarter that you can think up.

6) write a program to estimate the parameters of a mixture of Gaussians. At the heart


of it will be code that looks a lot like the code you wrote last time to estimate MLE
parameters of a multivariate Gaussian, except now you are computing weighted
sample means, weighted sample covariances, and the mixing parameters, for each of
K gaussian components [this is the “M” step of EM]. Also, you will estimate the
ownership weights for each point, in the “E” step, to determine those weights.
Starting with N sample points and a number K of desired Gaussian components, use
EM to estimate the K mixing weights p_i, K 2x1 vectors mu_i, and K 2x2 covariance
matrices C_i. To initialize, you could use a random start (say random selection of K
mean vectors, identity matrices for each covariance, and 1/K for each mixing weight
p_i). Or, you could first run k-means to find a more appropriate set of cluster centers
to use as initial mean vectors.

Copyright © 2001, Andrew W. Moore

14