Ch9 2-MixturesofGaussians PDF

Machine Learning Srihari
Mixtures of Gaussians
Sargur Srihari
srihari@cedar.buffalo.edu
1
9. Mixture Models and EM

0. Mixture Models Overview
1.  K-Means Clustering
2.  Mixtures of Gaussians
3.  An Alternative View of EM
4.  The EM Algorithm in General
2
Topics in Mixtures of Gaussians

•  Goal of Gaussian Mixture Modeling
•  Latent Variables
•  Maximum Likelihood
•  EM for Gaussian Mixtures
3
Goal of Gaussian Mixture Modeling
•  A linear superposition of Gaussians in the form

K
p(x) = ∑ π k N(x | µk , Σk )
k =1
•  Goal of Modeling:
•  Find maximum likelihood parameters πk, µk, Σk
•  Examples of data sets and models
1-D data, K=2 subclasses 2-D data, K=3
Each data point

is associated
with a
subclass k with
probability πk
k 1 2
π 0.4 0.6
µ −28 1.86
σ 0.48 0.88
GMMs and Latent Variables

•  A GMM is a linear superposition of Gaussian
components
–  Provides a richer class of density models than the
single Gaussian
•  We formulate a GMM in terms of discrete latent
variables
–  This provides deeper insight into this distribution
–  Serves to motivate the EM algorithm
•  Which gives a maximum likelihood solution to no. of
components and their means/covariances
5
Latent Variable Representation

•  Linear superposition of K Gaussians:
K
p(x) = ∑ π N(x | µ , Σ )
k k k
k =1
•  Introduce a K-dimensional binary variable z

–  Use 1-of-K representation (one-hot vector)
•  Let z = z1,..,zK whose elements are
zk ∈{0,1} and ∑ zk = 1
k
•  K possible states of z corresponding to K components
k 1 2
k 1 2 3
z 10 01
z 100 010 001
πk 0.4 0.6
µk −28 1.86
σk 0.48 0.88
Joint Distribution
•  Define joint distribution of latent variable
and observed variable
– p(x,z)=p(x|z) p(z)
–  x is observed variable
–  z is the hidden or missing variable
–  Marginal distribution p(z)
–  Conditional distribution p(x|z)
7
Graphical Representation of Mixture Model
•  The joint distribution p(x,z) is represented in

the form p(z)p(x|z)
Latent variable z=[z1,..zK] represents subclass
Observed variable x
–  We now specify marginal p(z)and conditional p(x|z)

•  Using them we specify p(x) in terms of observed and
latent variables
8
Specifying the marginal p(z)

•  Associate a probability with each component zk
–  Denote p(zk = 1) = π k where parameters {πk} satisfy
0 ≤ π k ≤ 1 and ∑π k
=1
k
•  Because z uses 1-of-K it follows that p(z)

K
p(z) = ∏ π k
zk
p(x|z)
k =1
–  since zk ∈{0,1} and components of z are mutually

exclusive and hence are independent
z1
With one component p(z1 ) = π 1
z1 z2
With two components p(z1,z 2 ) = π 1 π 2 9
Specifying the Conditional p(x|z)

•  For a particular component (value of z) p(z)
p(x | zk = 1) = N(x | µk , Σk )
•  Thus p(x|z) can be written in the form
p(x|z)
( )
K
zk
p(x | z) = ∏ N x | µk , Σk
k =1
–  Due to the exponent zk all product terms except for one

equal one
10
Marginal distribution p(x)

•  The joint distribution p(x,z) is given by p(z)p(x|z)
•  Thus marginal distribution of x is obtained by summing
over all possible states of z to give
K
( ) = ∑ π N (x | µ , Σ )
K
p(x) = ∑ p(z)p(x | z) = ∑ ∏ π k N x | µk , Σk
zk zk
k k k
z z k =1 k =1
–  Since zk ∈{0,1}
•  This is the standard form of a Gaussian mixture
11
Value of Introducing Latent Variable

•  If we have observations x1,..,xN
•  Because marginal distribution is in the form
p(x) = ∑ p(x,z)
z
–  It follows that for every observed data point xn there is a

corresponding latent vector zn , i.e., its sub-class
•  Thus we have found a formulation of Gaussian
mixture involving an explicit latent variable
–  We are now able to work with joint distribution p(x,z)
instead of marginal p(x)
•  Leads to significant simplification through introduction
of expectation maximization 12
Another conditional probability (Responsibility)

•  In EM p(z |x) plays a role
•  The probability p(zk=1|x) is denoted γ (zk )
•  From Bayes theorem
p(z k = 1)p(x | z k = 1)
γ (z k ) ≡ p(z k = 1 | x) = K
∑ p(z j
= 1)p(x | z j = 1)
j =1
π k N (x | µk , Σk ) p(x,z)=p(x|z)p(z)
= K
∑ π N (x | µ , Σ
j k j
)
j =1
•  View p(zk = 1) = π k as prior probability of component k

γ (zk ) = p(zk = 1 | x) as the posterior probability
it is also the responsibility that component k takes for explaining
the observation x 13
Plan of Discussion
•  Next we look at
1.  How to get data from a mixture model synthetically
and then
2.  Given a data set {x1,..xN} how to model the data
using a mixture of Gaussians
14
Synthesizing data from mixture

500 points from
three Gaussians
•  Use ancestral sampling
–  Start with lowest numbered node and
draw a sample,
•  Generate sample of z, called ẑ
•  move to successor node and draw
Complete Data set
a sample given the parent value,
etc.
•  Then generate a value for x from
conditional p(x| ẑ )
•  Samples from p(x,z) are plotted
according to value of x and colored
with value of z
•  Samples from marginal p(x) Incomplete Data set
obtained by ignoring values of z 15
Illustration of responsibilities
•  Evaluate for every data point
–  Posterior probability of each component
•  Responsibility γ (znk ) is associated with
data point xn
•  Color using proportion of red, blue and
green ink
–  If for a data point γ (zn1 ) = 1 it is colored red
–  If for another point γ (zn 2 ) = γ (zn 3 ) = 0.5 it
has equal blue and green and will appear
as cyan
16
Maximum Likelihood for GMM

•  We wish to model data set {x1,..xN} using a mixture of
Gaussians (N items each of dimension D)
•  Represent by N x D matrix X ⎡ x ⎤
⎢ ⎥ 1
⎢ x ⎥
–  nth row is given by xnT X =⎢ 2
⎢
⎥
⎥
⎢ xN ⎥
⎣ ⎦
•  Represent N latent variables with N x K matrix Z

–  nth row is given by znT ⎡ z
⎢ 1
⎤
⎥
⎢ z ⎥
Z =⎢ 2 ⎥
⎢ ⎥
⎢ zN ⎥
⎣ ⎦
•  Goal is to state the likelihood function

•  so as to estimate the three sets of parameters
•  by maximizing the likelihood
17
Graphical representation of GMM

•  For a set of i.i.d. data points {xn} with
corresponding latent points {zn} where n=1,..,N
•  Bayesian Network for p(X,Z) using plate
notation
–  N x D matrix X
–  N x K matrix Z
18
Likelihood Function for GMM

Mixture density function is
K
z k =1
(
p(x) = ∑ p(z)p(x | z) = ∑ π kN x | µk , Σk ) Since z has values {zk}
with probabilities {πk}
Therefore Likelihood function is

N ⎧K ⎫
p(X | π , µ, Σ) = ∏ ⎨∑ π k N(x n | µk , Σk )⎬
Product is over the N
i.i.d. samples
n =1 ⎪
⎩ k =1 ⎪⎭
Therefore log-likelihood function is

⎧K
N
⎫
ln p(X | π , µ, Σ) = ∑ ln ⎨∑ π k N(x n | µk , Σk )⎬
n =1 ⎪⎩ k =1 ⎪⎭
Which we wish to maximize
A more difficult problem than for a single Gaussian
19
Maximization of Log-Likelihood
N
⎧K ⎫
n =1 ⎩⎪ k =1 ⎭⎪
•  Goal is to estimate the three sets of parameters

π k , µk ,Σ k

–  By taking derivatives in turn w.r.t each while
keeping others constant
–  But there are no closed-form solutions
•  Task is not straightforward since summation appears in
Gaussian and logarithm does not operate on Gaussian
•  While a gradient-based optimization is possible,
we consider the iterative EM algorithm 20
Some issues with GMM m.l.e.

•  Before proceeding with the m.l.e. briefly
mention two technical issues:
1.  Problem of singularities with Gaussian mixtures
2.  Problem of Identifiability of mixtures
21
Problem of Singularities with Gaussian mixtures

•  Consider Gaussian mixture
–  components with covariance matrices Σ k = σ k 2 I
•  Data point that falls on a mean µ j = x n will
contribute to the likelihood function
1 1
N (x n |x n ,σ I ) =2
since exp(xn-µj)2=1
j
(2π )1/2 σ j One component
assigns finite values
•  As σ j → 0 term goes to infinity and other to large
•  Therefore maximization of log-likelihood value
N
⎧ K
⎫
ln p(X | π , µ, Σ) = ∑ ln ⎨∑ π N(x | µ , Σ )⎬ is not well-posed
k n k k
n =1 ⎪⎩ k =1 ⎪⎭
–  Does not happen with a single Gaussian

•  Multiplicative factors go to zero
–  Does not happen in the Bayesian approach
•  Problem is avoided using heuristics Multiplicative values
–  Resetting mean or covariance Take it to zero 22
Problem of Identifiability
A density p(x | θ ) is identifiable if θ ≠ θ ' then there is an x for which p(x | θ ) ≠ p(x | θ ')
A K-component mixture will have a total of K!

equivalent solutions
–  Corresponding to K! ways of assigning K sets of
parameters to K components
•  E.g., for K=3 K!=6: 123, 132, 213, 231, 312, 321
–  For any given point in the space of parameter values
there will be a further K!-1 additional points all giving
exactly same distribution
•  However any of the equivalent solutions is as
good as the other
A C C
B
Two ways of labeling three Gaussian subclasses B A
23
EM for Gaussian Mixtures

•  EM is a method for finding maximum likelihood
solutions for models with latent variables
•  Begin with log-likelihood function
N
⎧K ⎫
n =1 ⎪⎩ k =1 ⎪⎭
–  We wish to find π , µ, Σ that maximize this quantity

–  Task is not straightforward since summation appears in
Gaussian and logarithm does not operate on Gaussian
•  Take derivatives in turn w.r.t
–  Means µ k and set to zero
–  covariance matrices Σ k and set to zero
–  mixing coefficients π and set to zero
k 24
EM for GMM: Derivative wrt µk

•  Begin with log-likelihood function
N
⎧K ⎫
n =1 ⎪⎩ k =1 ⎪⎭
•  Take derivative w.r.t the means µ k and set to zero

–  Making use of exponential form of Gaussian
–  Use formulas: d ln u = u ' and d e u = e uu '
dx u dx
–  We get
N
π k N(x n | µk , Σk )
0=∑ ∑
−1
(x n − µk )
n=1 ∑ j π j N(x n | µ j , Σ j )
k
Inverse of covariance
matrix
γ (znk ) the posterior probabilities
25
M.L.E. solution for Means

•  Multiplying by Σ k (assuming non-singularity)
Mean of kth Gaussian component
N is the weighted mean of all the
1
µk =
Nk
∑ γ (z nk
)x n points in the data set:
n=1 where data point xn is weighted by
the posterior probability that
•  Where we have defined component k was responsible for
generating xn
N
N k = ∑ γ (znk )
n =1
–  Which is the effective number of points assigned

to cluster k
26
M.L.E. solution for Covariance

•  Set derivative wrt Σ k to zero
–  Making use of mle solution for covariance matrix of
single Gaussian
N
1
Σk =
Nk
∑ γ (z nk
)(x n
− µk
)(x n
− µk
)T
n =1
–  Similar to result for a single Gaussian for the data

set but each data point weighted by the
corresponding posterior probability
–  Denominator is effective no of points in component
27
M.L.E. solution for Mixing Coefficients

•  Maximize ln p(X | π , µ, Σ) w.r.t. πk
–  Must take into account that mixing coefficients sum
to one
–  Achieved using Lagrange multiplier and maximizing
⎛K ⎞
ln p(X | π , µ,Σ) + λ⎜ ∑ π k −1⎟
⎝ k=1 ⎠
–  Setting derivative wrt π k to zero and solving gives

Nk
πk =
N
28
Summary of m.l.e. expressions

•  GMM maximum likelihood parameter estimates
N
Means 1
µk =
Nk
∑γ (z
n =1
nk )x n
Covariance matrices 1 N
Σk = ∑
N k n=1
γ (znk )(x n − µk )(x n − µk )T
N
Mixing Coefficients Nk N k = ∑ γ (znk )
πk =
N n=1
•  All three are in terms of responsibilities

•  and so we have not completely solved the problem
29
EM Formulation
•  The results for µk , Σ k , π k are not closed form
solutions for the parameters
–  Since γ (znk ) the responsibilities depend on those
parameters in a complex way
•  Results suggest an iterative solution
•  An instance of EM algorithm for the particular
case of GMM
30
Informal EM for GMM

•  First choose initial values for means, covariances and
mixing coefficients
•  Alternate between following two updates
–  Called E step and M step
•  In E step use current value of parameters to evaluate
posterior probabilities, or responsibilities
•  In the M step use these posterior probabilities to to re-
estimate means, covariances and mixing coefficients
31
EM using Old Faithful

Initial E step After first M step
Data points and Determine Re-evaluate Parameters
Initial mixture model responsibilities
After 2 cycles After 5 cycles After 20 cycles
32
Comparison with K-Means
K-means result E-M result
33
Animation of EM for Old Faithful Data

•  http://en.wikipedia.org/wiki/
File:Em_old_faithful.gif
Code in R
#initial parameter estimates (chosen to be deliberately bad)
theta <- list( tau=c(0.5,0.5),
mu1=c(2.8,75),
mu2=c(3.6,58),
sigma1=matrix(c(0.8,7,7,70),ncol=2),
sigma2=matrix(c(0.8,7,7,70),ncol=2) )
34
Practical Issues with EM

•  Takes many more iterations than K-means
–  Each cycle requires significantly more comparison
•  Common to run K-means first in order to find
suitable initialization
•  Covariance matrices can be initialized to
covariances of clusters found by K-means
•  EM is not guaranteed to find global maximum of
log likelihood function
35
Summary of EM for GMM

•  Given a Gaussian mixture model
•  Goal is to maximize the likelihood function w.r.t.
the parameters (means, covariances and
mixing coefficients)
Step1: Initialize the means µ,k covariances Σ k

and mixing coefficients π k and evaluate initial value of
log-likelihood
36
EM continued
•  Step 2: E step: Evaluate responsibilities using current
parameter values
π k N(x n | µk , Σk )
γ (zk )= K
∑ π N(xj n
| µ j , Σ j ))
j =1
•  Step 3: M Step: Re-estimate parameters using current

responsibilities
N
1
µ new
k
=
Nk
∑ γ (z nk
)xn
n =1
N
1
Σ new
k
=
Nk
∑ γ (z nk
)(x n
− µk
new
)(x n
− µk
new T
)
n =1
N
Nk
π knew = where N k = ∑ γ (znk )
N n =1 37
EM Continued
•  Step 4: Evaluate the log likelihood
N
⎧K ⎫
ln p(X | π , µ, Σ) = ∑ ln ⎨∑ π k N(xn | µk , Σk )⎬
n =1 ⎪⎩ k =1 ⎪⎭
–  And check for convergence of either parameters

or log likelihood
•  If convergence not satisfied return to Step 2
38

Ch9 2-MixturesofGaussians PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ch9 2-MixturesofGaussians PDF

Uploaded by

Copyright:

Available Formats

Machine Learning Srihari

9. Mixture Models and EM

Topics in Mixtures of Gaussians

• A linear superposition of Gaussians in the form

Each data point

GMMs and Latent Variables

Latent Variable Representation

• Introduce a K-dimensional binary variable z

Graphical Representation of Mixture Model

• The joint distribution p(x,z) is represented in

Latent variable z=[z1,..zK] represents subclass

– We now specify marginal p(z)and conditional p(x|z)

Specifying the marginal p(z)

• Because z uses 1-of-K it follows that p(z)

– since zk ∈{0,1} and components of z are mutually

Specifying the Conditional p(x|z)

– Due to the exponent zk all product terms except for one

Marginal distribution p(x)

Value of Introducing Latent Variable

– It follows that for every observed data point xn there is a

Another conditional probability (Responsibility)

• View p(zk = 1) = π k as prior probability of component k

Synthesizing data from mixture

Maximum Likelihood for GMM

• Represent N latent variables with N x K matrix Z

• Goal is to state the likelihood function

Graphical representation of GMM

Likelihood Function for GMM

Therefore Likelihood function is

Therefore log-likelihood function is

• Goal is to estimate the three sets of parameters

Some issues with GMM m.l.e.

Problem of Singularities with Gaussian mixtures

– Does not happen with a single Gaussian

A K-component mixture will have a total of K!

EM for Gaussian Mixtures

– We wish to find π , µ, Σ that maximize this quantity

EM for GMM: Derivative wrt µk

• Take derivative w.r.t the means µ k and set to zero

M.L.E. solution for Means

– Which is the effective number of points assigned

M.L.E. solution for Covariance

– Similar to result for a single Gaussian for the data

M.L.E. solution for Mixing Coefficients

– Setting derivative wrt π k to zero and solving gives

Summary of m.l.e. expressions

• All three are in terms of responsibilities

Informal EM for GMM

EM using Old Faithful

After 2 cycles After 5 cycles After 20 cycles

Comparison with K-Means

K-means result E-M result

Animation of EM for Old Faithful Data

Practical Issues with EM

Summary of EM for GMM

Step1: Initialize the means µ,k covariances Σ k

• Step 3: M Step: Re-estimate parameters using current

– And check for convergence of either parameters

You might also like

•  A linear superposition of Gaussians in the form

•  Introduce a K-dimensional binary variable z

•  The joint distribution p(x,z) is represented in

–  We now specify marginal p(z)and conditional p(x|z)

•  Because z uses 1-of-K it follows that p(z)

–  since zk ∈{0,1} and components of z are mutually

–  Due to the exponent zk all product terms except for one

–  It follows that for every observed data point xn there is a

•  View p(zk = 1) = π k as prior probability of component k

•  Represent N latent variables with N x K matrix Z

•  Goal is to state the likelihood function

•  Goal is to estimate the three sets of parameters

–  Does not happen with a single Gaussian

–  We wish to find π , µ, Σ that maximize this quantity

•  Take derivative w.r.t the means µ k and set to zero

–  Which is the effective number of points assigned

–  Similar to result for a single Gaussian for the data

–  Setting derivative wrt π k to zero and solving gives

•  All three are in terms of responsibilities

•  Step 3: M Step: Re-estimate parameters using current

–  And check for convergence of either parameters