You are on page 1of 38

Machine Learning Srihari

Mixtures of Gaussians

Sargur Srihari
srihari@cedar.buffalo.edu

1
Machine Learning Srihari

9. Mixture Models and EM


0. Mixture Models Overview
1.  K-Means Clustering
2.  Mixtures of Gaussians
3.  An Alternative View of EM
4.  The EM Algorithm in General

2
Machine Learning Srihari

Topics in Mixtures of Gaussians


•  Goal of Gaussian Mixture Modeling
•  Latent Variables
•  Maximum Likelihood
•  EM for Gaussian Mixtures

3
Goal of Gaussian Mixture Modeling
Machine Learning Srihari

•  A linear superposition of Gaussians in the form


K
p(x) = ∑ π k N(x | µk , Σk )
k =1

•  Goal of Modeling:
•  Find maximum likelihood parameters πk, µk, Σk
•  Examples of data sets and models
1-D data, K=2 subclasses 2-D data, K=3

Each data point


is associated
with a
subclass k with
probability πk
k 1 2
π 0.4 0.6
µ −28 1.86
σ 0.48 0.88
Machine Learning Srihari

GMMs and Latent Variables


•  A GMM is a linear superposition of Gaussian
components
–  Provides a richer class of density models than the
single Gaussian
•  We formulate a GMM in terms of discrete latent
variables
–  This provides deeper insight into this distribution
–  Serves to motivate the EM algorithm
•  Which gives a maximum likelihood solution to no. of
components and their means/covariances
5
Machine Learning Srihari

Latent Variable Representation


•  Linear superposition of K Gaussians:
K
p(x) = ∑ π N(x | µ , Σ )
k k k
k =1

•  Introduce a K-dimensional binary variable z


–  Use 1-of-K representation (one-hot vector)
•  Let z = z1,..,zK whose elements are
zk ∈{0,1} and ∑ zk = 1
k
•  K possible states of z corresponding to K components
k 1 2
k 1 2 3
z 10 01
z 100 010 001
πk 0.4 0.6
µk −28 1.86
σk 0.48 0.88
Machine Learning Srihari

Joint Distribution
•  Define joint distribution of latent variable
and observed variable
– p(x,z)=p(x|z)Ÿ p(z)
–  x is observed variable
–  z is the hidden or missing variable
–  Marginal distribution p(z)
–  Conditional distribution p(x|z)

7
Machine Learning Srihari

Graphical Representation of Mixture Model

•  The joint distribution p(x,z) is represented in


the form p(z)p(x|z)

Latent variable z=[z1,..zK] represents subclass

Observed variable x

–  We now specify marginal p(z)and conditional p(x|z)


•  Using them we specify p(x) in terms of observed and
latent variables
8
Machine Learning Srihari

Specifying the marginal p(z)


•  Associate a probability with each component zk
–  Denote p(zk = 1) = π k where parameters {πk} satisfy
0 ≤ π k ≤ 1 and ∑π k
=1
k

•  Because z uses 1-of-K it follows that p(z)


K
p(z) = ∏ π k
zk
p(x|z)
k =1

–  since zk ∈{0,1} and components of z are mutually


exclusive and hence are independent
z1
With one component p(z1 ) = π 1
z1 z2
With two components p(z1,z 2 ) = π 1 π 2 9
Machine Learning Srihari

Specifying the Conditional p(x|z)


•  For a particular component (value of z) p(z)
p(x | zk = 1) = N(x | µk , Σk )
•  Thus p(x|z) can be written in the form
p(x|z)
( )
K
zk
p(x | z) = ∏ N x | µk , Σk
k =1

–  Due to the exponent zk all product terms except for one


equal one

10
Machine Learning Srihari

Marginal distribution p(x)


•  The joint distribution p(x,z) is given by p(z)p(x|z)
•  Thus marginal distribution of x is obtained by summing
over all possible states of z to give
K

( ) = ∑ π N (x | µ , Σ )
K
p(x) = ∑ p(z)p(x | z) = ∑ ∏ π k N x | µk , Σk
zk zk
k k k
z z k =1 k =1

–  Since zk ∈{0,1}
•  This is the standard form of a Gaussian mixture

11
Machine Learning Srihari

Value of Introducing Latent Variable


•  If we have observations x1,..,xN
•  Because marginal distribution is in the form
p(x) = ∑ p(x,z)
z

–  It follows that for every observed data point xn there is a


corresponding latent vector zn , i.e., its sub-class
•  Thus we have found a formulation of Gaussian
mixture involving an explicit latent variable
–  We are now able to work with joint distribution p(x,z)
instead of marginal p(x)
•  Leads to significant simplification through introduction
of expectation maximization 12
Machine Learning Srihari

Another conditional probability (Responsibility)


•  In EM p(z |x) plays a role
•  The probability p(zk=1|x) is denoted γ (zk )
•  From Bayes theorem
p(z k = 1)p(x | z k = 1)
γ (z k ) ≡ p(z k = 1 | x) = K

∑ p(z j
= 1)p(x | z j = 1)
j =1

π k N (x | µk , Σk ) p(x,z)=p(x|z)p(z)
= K

∑ π N (x | µ , Σ
j k j
)
j =1

•  View p(zk = 1) = π k as prior probability of component k


γ (zk ) = p(zk = 1 | x) as the posterior probability
it is also the responsibility that component k takes for explaining
the observation x 13
Machine Learning Srihari

Plan of Discussion
•  Next we look at
1.  How to get data from a mixture model synthetically
and then
2.  Given a data set {x1,..xN} how to model the data
using a mixture of Gaussians

14
Machine Learning Srihari

Synthesizing data from mixture


500 points from
three Gaussians
•  Use ancestral sampling
–  Start with lowest numbered node and
draw a sample,
•  Generate sample of z, called ẑ
•  move to successor node and draw
Complete Data set
a sample given the parent value,
etc.
•  Then generate a value for x from
conditional p(x| ẑ )
•  Samples from p(x,z) are plotted
according to value of x and colored
with value of z
•  Samples from marginal p(x) Incomplete Data set
obtained by ignoring values of z 15
Machine Learning Srihari

Illustration of responsibilities
•  Evaluate for every data point
–  Posterior probability of each component
•  Responsibility γ (znk ) is associated with
data point xn
•  Color using proportion of red, blue and
green ink
–  If for a data point γ (zn1 ) = 1 it is colored red
–  If for another point γ (zn 2 ) = γ (zn 3 ) = 0.5 it
has equal blue and green and will appear
as cyan
16
Machine Learning Srihari

Maximum Likelihood for GMM


•  We wish to model data set {x1,..xN} using a mixture of
Gaussians (N items each of dimension D)
•  Represent by N x D matrix X ⎡ x ⎤
⎢ ⎥ 1

⎢ x ⎥
–  nth row is given by xnT X =⎢ 2



⎢ xN ⎥
⎣ ⎦

•  Represent N latent variables with N x K matrix Z


–  nth row is given by znT ⎡ z
⎢ 1


⎢ z ⎥
Z =⎢ 2 ⎥
⎢ ⎥
⎢ zN ⎥
⎣ ⎦

•  Goal is to state the likelihood function


•  so as to estimate the three sets of parameters
•  by maximizing the likelihood
17
Machine Learning Srihari

Graphical representation of GMM


•  For a set of i.i.d. data points {xn} with
corresponding latent points {zn} where n=1,..,N
•  Bayesian Network for p(X,Z) using plate
notation
–  N x D matrix X
–  N x K matrix Z

18
Machine Learning Srihari

Likelihood Function for GMM


Mixture density function is
K

z k =1
(
p(x) = ∑ p(z)p(x | z) = ∑ π kN x | µk , Σk ) Since z has values {zk}
with probabilities {πk}

Therefore Likelihood function is


N ⎧K ⎫
p(X | π , µ, Σ) = ∏ ⎨∑ π k N(x n | µk , Σk )⎬
Product is over the N
i.i.d. samples
n =1 ⎪
⎩ k =1 ⎪⎭

Therefore log-likelihood function is


⎧K
N

ln p(X | π , µ, Σ) = ∑ ln ⎨∑ π k N(x n | µk , Σk )⎬
n =1 ⎪⎩ k =1 ⎪⎭
Which we wish to maximize
A more difficult problem than for a single Gaussian
19
Machine Learning Srihari

Maximization of Log-Likelihood
N
⎧K ⎫
ln p(X | π , µ, Σ) = ∑ ln ⎨∑ π k N(x n | µk , Σk )⎬
n =1 ⎩⎪ k =1 ⎭⎪

•  Goal is to estimate the three sets of parameters


π k , µk ,Σ k

–  By taking derivatives in turn w.r.t each while
keeping others constant
–  But there are no closed-form solutions
•  Task is not straightforward since summation appears in
Gaussian and logarithm does not operate on Gaussian
•  While a gradient-based optimization is possible,
we consider the iterative EM algorithm 20
Machine Learning Srihari

Some issues with GMM m.l.e.


•  Before proceeding with the m.l.e. briefly
mention two technical issues:
1.  Problem of singularities with Gaussian mixtures
2.  Problem of Identifiability of mixtures

21
Machine Learning Srihari

Problem of Singularities with Gaussian mixtures


•  Consider Gaussian mixture
–  components with covariance matrices Σ k = σ k 2 I
•  Data point that falls on a mean µ j = x n will
contribute to the likelihood function
1 1
N (x n |x n ,σ I ) =2
since exp(xn-µj)2=1
j
(2π )1/2 σ j One component
assigns finite values
•  As σ j → 0 term goes to infinity and other to large
•  Therefore maximization of log-likelihood value
N
⎧ K

ln p(X | π , µ, Σ) = ∑ ln ⎨∑ π N(x | µ , Σ )⎬ is not well-posed
k n k k
n =1 ⎪⎩ k =1 ⎪⎭

–  Does not happen with a single Gaussian


•  Multiplicative factors go to zero
–  Does not happen in the Bayesian approach
•  Problem is avoided using heuristics Multiplicative values
–  Resetting mean or covariance Take it to zero 22
Machine Learning Srihari

Problem of Identifiability
A density p(x | θ ) is identifiable if θ ≠ θ ' then there is an x for which p(x | θ ) ≠ p(x | θ ')

A K-component mixture will have a total of K!


equivalent solutions
–  Corresponding to K! ways of assigning K sets of
parameters to K components
•  E.g., for K=3 K!=6: 123, 132, 213, 231, 312, 321
–  For any given point in the space of parameter values
there will be a further K!-1 additional points all giving
exactly same distribution
•  However any of the equivalent solutions is as
good as the other
A C C
B
Two ways of labeling three Gaussian subclasses B A
23
Machine Learning Srihari

EM for Gaussian Mixtures


•  EM is a method for finding maximum likelihood
solutions for models with latent variables
•  Begin with log-likelihood function
N
⎧K ⎫
ln p(X | π , µ, Σ) = ∑ ln ⎨∑ π k N(x n | µk , Σk )⎬
n =1 ⎪⎩ k =1 ⎪⎭

–  We wish to find π , µ, Σ that maximize this quantity


–  Task is not straightforward since summation appears in
Gaussian and logarithm does not operate on Gaussian
•  Take derivatives in turn w.r.t
–  Means µ k and set to zero
–  covariance matrices Σ k and set to zero
–  mixing coefficients π and set to zero
k 24
Machine Learning Srihari

EM for GMM: Derivative wrt µk


•  Begin with log-likelihood function
N
⎧K ⎫
ln p(X | π , µ, Σ) = ∑ ln ⎨∑ π k N(x n | µk , Σk )⎬
n =1 ⎪⎩ k =1 ⎪⎭

•  Take derivative w.r.t the means µ k and set to zero


–  Making use of exponential form of Gaussian
–  Use formulas: d ln u = u ' and d e u = e uu '
dx u dx
–  We get
N
π k N(x n | µk , Σk )
0=∑ ∑
−1
(x n − µk )
n=1 ∑ j π j N(x n | µ j , Σ j )
k
Inverse of covariance
matrix
γ (znk ) the posterior probabilities
25
Machine Learning Srihari

M.L.E. solution for Means


•  Multiplying by Σ k (assuming non-singularity)
Mean of kth Gaussian component
N is the weighted mean of all the
1
µk =
Nk
∑ γ (z nk
)x n points in the data set:
n=1 where data point xn is weighted by
the posterior probability that
•  Where we have defined component k was responsible for
generating xn
N
N k = ∑ γ (znk )
n =1

–  Which is the effective number of points assigned


to cluster k
26
Machine Learning Srihari

M.L.E. solution for Covariance


•  Set derivative wrt Σ k to zero
–  Making use of mle solution for covariance matrix of
single Gaussian
N
1
Σk =
Nk
∑ γ (z nk
)(x n
− µk
)(x n
− µk
)T

n =1

–  Similar to result for a single Gaussian for the data


set but each data point weighted by the
corresponding posterior probability
–  Denominator is effective no of points in component
27
Machine Learning Srihari

M.L.E. solution for Mixing Coefficients


•  Maximize ln p(X | π , µ, Σ) w.r.t. πk
–  Must take into account that mixing coefficients sum
to one
–  Achieved using Lagrange multiplier and maximizing
⎛K ⎞
ln p(X | π , µ,Σ) + λ⎜ ∑ π k −1⎟
⎝ k=1 ⎠

–  Setting derivative wrt π k to zero and solving gives


Nk
πk =
N
28
Machine Learning Srihari

Summary of m.l.e. expressions


•  GMM maximum likelihood parameter estimates
N
Means 1
µk =
Nk
∑γ (z
n =1
nk )x n

Covariance matrices 1 N
Σk = ∑
N k n=1
γ (znk )(x n − µk )(x n − µk )T

N
Mixing Coefficients Nk N k = ∑ γ (znk )
πk =
N n=1

•  All three are in terms of responsibilities


•  and so we have not completely solved the problem
29
Machine Learning Srihari

EM Formulation
•  The results for µk , Σ k , π k are not closed form
solutions for the parameters
–  Since γ (znk ) the responsibilities depend on those
parameters in a complex way
•  Results suggest an iterative solution
•  An instance of EM algorithm for the particular
case of GMM

30
Machine Learning Srihari

Informal EM for GMM


•  First choose initial values for means, covariances and
mixing coefficients
•  Alternate between following two updates
–  Called E step and M step
•  In E step use current value of parameters to evaluate
posterior probabilities, or responsibilities
•  In the M step use these posterior probabilities to to re-
estimate means, covariances and mixing coefficients

31
Machine Learning Srihari

EM using Old Faithful


Initial E step After first M step
Data points and Determine Re-evaluate Parameters
Initial mixture model responsibilities

After 2 cycles After 5 cycles After 20 cycles

32
Machine Learning Srihari

Comparison with K-Means

K-means result E-M result

33
Machine Learning Srihari

Animation of EM for Old Faithful Data


•  http://en.wikipedia.org/wiki/
File:Em_old_faithful.gif
Code in R
#initial parameter estimates (chosen to be deliberately bad)
theta <- list( tau=c(0.5,0.5),
mu1=c(2.8,75),
mu2=c(3.6,58),
sigma1=matrix(c(0.8,7,7,70),ncol=2),
sigma2=matrix(c(0.8,7,7,70),ncol=2) )

34
Machine Learning Srihari

Practical Issues with EM


•  Takes many more iterations than K-means
–  Each cycle requires significantly more comparison
•  Common to run K-means first in order to find
suitable initialization
•  Covariance matrices can be initialized to
covariances of clusters found by K-means
•  EM is not guaranteed to find global maximum of
log likelihood function

35
Machine Learning Srihari

Summary of EM for GMM


•  Given a Gaussian mixture model
•  Goal is to maximize the likelihood function w.r.t.
the parameters (means, covariances and
mixing coefficients)

Step1: Initialize the means µ,k covariances Σ k


and mixing coefficients π k and evaluate initial value of
log-likelihood

36
Machine Learning Srihari

EM continued
•  Step 2: E step: Evaluate responsibilities using current
parameter values
π k N(x n | µk , Σk )
γ (zk )= K

∑ π N(xj n
| µ j , Σ j ))
j =1

•  Step 3: M Step: Re-estimate parameters using current


responsibilities
N
1
µ new
k
=
Nk
∑ γ (z nk
)xn
n =1

N
1
Σ new
k
=
Nk
∑ γ (z nk
)(x n
− µk
new
)(x n
− µk
new T
)
n =1

N
Nk
π knew = where N k = ∑ γ (znk )
N n =1 37
Machine Learning Srihari

EM Continued
•  Step 4: Evaluate the log likelihood

N
⎧K ⎫
ln p(X | π , µ, Σ) = ∑ ln ⎨∑ π k N(xn | µk , Σk )⎬
n =1 ⎪⎩ k =1 ⎪⎭

–  And check for convergence of either parameters


or log likelihood
•  If convergence not satisfied return to Step 2

38

You might also like