You are on page 1of 43

Robert Collins

CSE586

Gaussian Mixtures
and the EM Algorithm
Reading: Chapter 7.4, Prince book
Robert Collins

Review:  The  Gaussian  Distribu3on  


CSE586

•  Mul3variate  Gaussian  
 
d
  (( 2π )
  Evaluates to a number
mean covariance
 
dx1 vector dxd matrix

  Isotropic (spherical) if covariance is diag(σ , σ ,..., σ )


2 2 2

Bishop, 2003
Robert Collins

Likelihood  Func3on  
CSE586

•  Data  set  
 
•  Assume  observed  data  points  generated  
independently  
 
 
•  Viewed  as  a  func3on  of  the  parameters,  this  is  
known  as  the  likelihood  func-on  

Bishop, 2003
Robert Collins

Maximum  Likelihood  
CSE586

•  Set  the  parameters  by  maximizing  the  


likelihood  func3on  
•  Equivalently  maximize  the  log  likelihood  
 
  d

 
 
 

Bishop, 2003
Robert Collins

Maximum  Likelihood  Solu3on  


CSE586

•  Maximizing  w.r.t.  the  mean  gives  the  sample  mean  


 

•  Maximizing  w.r.t  covariance  gives  the  sample  


covariance  

Note: if N is small you want to divide by N-1


when computing sample covariance to get
an unbiased estimate.

Bishop, 2003
Robert Collins

Comments  
CSE586

Gaussians are well understood and easy to estimate

However, they are unimodal, thus cannot be used


to represent inherently multimodal datasets

Fitting a single Gaussian to a multimodal dataset is


likely to give a mean value in an area with low
probability, and to overestimate the covariance.
Robert Collins

Old  Faithful  Data  Set  


CSE586

Time
between
eruptions
(minutes)

Duration of eruption (minutes)

Bishop, 2004
Robert Collins

Idea:  Use  a  Mixture  of  Gaussians  


CSE586

•  Convex  Combina3on  of  Distribu3ons  


 
 
•  Normaliza3on  and  posi3vity  require  
 
 
•  Can  interpret  the  mixing  coefficients  as  prior  
probabili3es  

Bishop, 2003
Robert Collins

Example:  Mixture  of  3  Gaussians  


CSE586

1 1

0.5 0.5

0 0
0 0.5 1 0 0.5 1
(a) (b)

Bishop, 2004
Robert Collins
CSE586
Aside:  Sampling  from  a  Mixture  Model  

0.5

0
0 0.5 1
(a)
Robert Collins
CSE586
Aside:  Sampling  from  a  Mixture  Model  
Generate u = uniform random number between 0 and 1
If u < π1
generate x ~ N(x | µ1, Σ1)
elseif u < π1 + π2
generate x ~ N(x | µ2, Σ2)

elseif u < π1 + π2 + ... + πK-1


generate x ~ N(x | µK-1, ΣK-1)
else
generate x ~ N(x | µK, ΣK)
Robert Collins

MLE of Mixture Parameters


CSE586

•  However, MLE of mixture parameters is HARD!


•  Joint distribution:

•  Log likelihood

Uh-oh, log of a sum


Robert Collins

EM Algorithm
CSE586

What makes this estimation problem hard?

1) It is a mixture, so log-likelihood is messy

2) We don’t directly see what the underlying process is


Robert Collins

EM Algorithm
CSE586

labels

Suppose some oracle told us which point comes from which Gaussian.

How? By providing a “latent” variable z_nk which is 1 if point n comes from


the kth component Gaussian, and 0 otherwise (a 1 of K representation)
Robert Collins

EM Algorithm
CSE586

labels

This lets us recover the underlying generating process decomposition:

= + +
Robert Collins

EM Algorithm
CSE586

labels

And we can easily estimate each Gaussian, along with the mixture weights!

N1 points N2 points N3 points

= + +

estimate estimate estimate estimate


pi_k = Nk / N mu_1, Sigma_1 mu_2, Sigma_2 mu_3, Sigma_3
Robert Collins

EM Algorithm
CSE586

how can I make that


inner sum be a product
instead???
Remember that this was a problem...
Robert Collins

EM Algorithm
CSE586

how can I make that


inner sum be a product
instead???
Remember that this was a problem...

Again, if an oracle gave us the values of the latent


variables (component that generated each point)
we could work with the complete log likelihood

and the log of that looks much better!


Robert Collins

Latent Variable View


CSE586

note: for a given n, there are k of these latent variables,


and only ONE of them is 1 (all the rest are 0)
Robert Collins

Latent Variable View


CSE586

note: for a given n, there are k of these latent variables,


and only ONE of them is 1 (all the rest are 0)
This is thus equivalent to
Robert Collins

Latent Variable View


CSE586
Robert Collins

Latent Variable View


CSE586

can be
estimated
separately

can be
estimated
separately

can be
estimated
separately
Robert Collins

Latent Variable View


CSE586

can be
estimated
separately

can be
estimated
separately

can be
estimated
separately

these are coupled because the mixing weights


all sum to 1, but it is no big deal to solve
Robert Collins

EM Algorithm
CSE586

Unfortunately, oracles don’t exist (or if they do, they won’t talk to us)

So we don’t know values of the the z_nk variables

What EM proposes to do:

1)  compute p(Z|X,theta), the posterior distribution over z_nk,


given our current best guess at the values of theta

2)  compute the expected value of the log likelihood ln(p(X,Z|theta))


with respect to the distribution p(Z|X,theta)

3)  find theta_new that maximizes that function.


This is our new best guess at the values of theta.

4)  iterate...
Robert Collins

Insight
CSE586

Since we don’t know the latent variables, we instead take the expected value
of the log likelihood with respect to their posterior distribution P(z|x,theta). In
the GMM case, this is equivalent to “softening” the binary latent variables to
continuous ones (the expected values of the latent variables)

unknown discrete value 0 or 1

known continuous value between 0 and 1

Where is P(znk = 1)
Robert Collins

Insight
CSE586

So now, after replacing the binary latent variables with their


continuous expected values:

all points contribute to the estimation of all components

each point has unit mass to contribute, but splits it across


the K components

the amount of weight a point contributes to a component is


proportional to the relative likelihood that the point was
generated by that component
Robert Collins
Latent Variable View (with an oracle)
CSE586

can be
estimated
separately

can be
estimated
separately

can be
estimated
separately

these are coupled because the mixing weights


all sum to 1, but it is no big deal to solve
Robert Collins a constant
i
Latent Variable View (with EM,
CSE586
at iteration i )
i i can be
estimated
separately

i i
can be
estimated
separately

i i can be
estimated
separately

these are coupled because the mixing weights


all sum to 1, but it is no big deal to solve
Robert Collins

EM Algorithm for GMM


CSE586

E ownership weights

means covariances

mixing probabilities
Robert Collins

Gaussian Mixture Example


CSE586

Copyright © 2001, Andrew W. Moore


Robert Collins

After first iteration


CSE586
Robert Collins
CSE586

After 2nd iteration


Robert Collins
CSE586

After 3rd iteration


Robert Collins
CSE586

After 4th iteration


Robert Collins
CSE586

After 5th iteration


Robert Collins
CSE586

After 6th iteration


Robert Collins
CSE586

After 20th iteration


Robert Collins

Recall: Labeled vs Unlabeled Data


CSE586

labeled unlabeled
Easy to estimate params Hard to estimate params
(do each color separately) (we need to assign colors)
Robert Collins

EM produces a “Soft” labeling


CSE586

each point makes a weighted contribution


to the estimation of ALL components
Review  of  EM  for  GMMs  

E   ownership  weights  
(so.  labels)  

M  

means   covariances  

mixing  weights   Alternate  E  and  M  


steps  to  convergence.  
From  EM  to  K-­‐Means  
Alternative explanation of K-means!

•  Fix  all  mixing  weights  to  1/K  


[drop  out  of  the  es-ma-on]  

•  Fix  all  covariances  to  σ2  I


[drop  out  of  the  es-ma-on  so  we  only  have  to  es-mate  the  
means;  each  Gaussian  likelihood  becomes  inversely  
propor-onal  to  distance  from  a  mean]  

•  Take  limit  as  σ2  goes  to  0      


[this  forces  so<  weights  to  become  binary]  
From  EM  to  K-­‐Means  
E   ownership  weights  
(so.  labels)  

2   a.er  fixing  mixing  weights  


and  covariances  as  described  
on  last  slide  
2  

now  divide  top  and  boVom  by                                                                    where    


and  take  limit  as  σ2 goes  to  0  

1      if  µj  is  closest  mean  to  xn   hard  labels,  as  in  the  
K-­‐means  algorithm  
0      otherwise  
K-­‐Means  Algorithm  
•  Given  N  data  points  x1,  x2,...,  xN  
•  Find  K  cluster  centers  µ1, µ2,..., µΚ to  minimize  
(znk  is  1  if  point  n  belongs  to  cluster  k;  0  otherwise)  

•  Algorithm:  
–  ini3alize  K  cluster  centers  µ1, µ2,..., µΚ  
–  repeat  
•  set  znk  labels  to  assign  each  point  to  closest  cluster  center  
•  revise  each  cluster  center  
E  
µj to  be  center  of  mass  of  
points  in  that  cluster    
M  
–  un3l  convergence  (e.g.  znk  labels  don’t  change)  

You might also like