Clustering With Gaussian Mixtures: Unsupervised Learning

Note to other teachers and users of these slides.
Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrews tutorials: http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received.
Clustering with Gaussian Mixtures

Andrew W. Moore Professor School of Computer Science Carnegie Mellon University
www.cs.cmu.edu/~awm awm@cs.cmu.edu 412-268-7599
Copyright 2001, 2004, Andrew W. Moore
Unsupervised Learning
You walk into a bar. A stranger approaches and tells you: Ive got data from k classes. Each class produces observations with a normal distribution and variance 2I . Standard simple multivariate gaussian assumptions. I can tell you all the P(wi)s . So far, looks straightforward. I need a maximum likelihood estimate of the is . No problem: Theres just one thing. None of the data are labeled. I have datapoints, but I dont know what class theyre from (any of them!) Uh oh!!
Clustering with Gaussian Mixtures: Slide 2
Gaussian Bayes Classifier Reminder

P ( y = i | x) = p(x | y = i) P ( y = i) p (x)
m/2
P ( y = i | x) =
( 2 )
1 1 T exp (x k i ) i (x k i ) pi 1/ 2 || i || 2 p (x)
How do we deal with that?
Predicting wealth from age
Predicting wealth from age
Learning modelyear , mpg ---> maker
21 = 12 M 1m
12 L 1m 2 2 L 2m 2m
M M L 2m O
General: O(m2) parameters
21 = 12 M 1m
12 L 1m 2 2 L 2m 2m
M M L 2m O
Aligned: O(m) parameters
21 0 0 22 0 0 = M M 0 0 0 0
0 0 M 0 0
L L O L
0 0 0 M 0
23 L
L 2 m 1
0 0 0 M 0 2m
Aligned: O(m) parameters
21 0 0 22 0 0 = M M 0 0 0 0
0 0 M 0 0
L L O L
0 0 0 M 0
23 L
L 2 m 1
0 0 0 M 0 2 m
Spherical: O(1) cov parameters
2 0 0 = M 0 0
0 0
L L L O
0 0 0 M
0 M 0 0
0 0
L 2 L 0
0 0 0 M 0 2
Spherical: O(1) cov parameters
2 0 0 = M 0 0
0 0 M 0 0
L L O
0 0 0 M
0 M 0 0
2 L
L 2 L 0
0 0 0 M 0 2
Making a Classifier from a Density Estimator

Categorical inputs only Inputs Classifier Density Estimator Regressor Predict Joint BC category Nave BC Probability Predict real no. Joint DE Nave DE Real-valued inputs only Gauss BC Mixed Real / Cat okay Dec Tree
Inputs Inputs
Gauss DE
Next back to Density Estimation What if we want to do density estimation with multimodal or clumpy data?
The GMM assumption

There are k components. The ith component is called i Component i has an associated mean vector i
2 1 3
The GMM assumption

There are k components. The ith component is called i Component i has an associated mean vector i Each component generates data from a Gaussian with mean i and covariance matrix 2I
2 1 3
Assume that each datapoint is generated according to the following recipe:
The GMM assumption

Assume that each datapoint is generated according to the following recipe: 1. Pick a component at random. Choose component i with probability P(i).
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 16
The GMM assumption

2
x
Assume that each datapoint is generated according to the following recipe: 1. Pick a component at random. Choose component i with probability P(i). 2. Datapoint ~ N(i, 2I )
The General GMM assumption

There are k components. The ith component is called i Component i has an associated mean vector i Each component generates data from a Gaussian with mean i and covariance matrix i
2 1 3
Assume that each datapoint is generated according to the following recipe: 1. Pick a component at random. Choose component i with probability P(i). 2. Datapoint ~ N(i, i )
Unsupervised Learning: not as hard as it looks

Sometimes easy
IN CASE YOURE WONDERING WHAT THESE DIAGRAMS ARE, THEY SHOW 2-d UNLABELED DATA (X VECTORS) DISTRIBUTED IN 2-d SPACE. THE TOP ONE HAS THREE VERY CLEAR GAUSSIAN CENTERS
Sometimes impossible
and sometimes in between

Computing likelihoods in unsupervised case

We have x1 , x2 , xN We know P(w1) P(w2) .. P(wk) We know P(x|wi, i, k) = Prob that an observation from class wi would have value x given class means 1 x
Can we write an expression for that?
10
likelihoods in unsupervised case

We have x1 x2 xn We have P(w1) .. P(wk). We have . We can define, for any x , P(x|wi , 1, 2 .. k) Can we define P(x | 1, 2 .. k) ?
Can we define P(x1, x1, .. xn | 1, 2 .. k) ?

[YES, IF WE ASSUME THE X1S WERE DRAWN INDEPENDENTLY]
Unsupervised Learning: Mediumly Good News

We now have a procedure s.t. if you give me a guess at 1, 2 .. k, I can tell you the prob of the unlabeled data given those s.
Suppose xs are 1-dimensional. There are two classes; w1 and w2 P(w1) = 1/3
x1 = x2 = x3 = x4 =
0.608 -1.590 0.235 3.949 : x25 = -0.712
(From Duda and Hart)
P(w2) = 2/3
=1.
There are 25 unlabeled datapoints
11
Duda & Harts Example

Graph of log P(x1, x2 .. x25 | 1, 2 ) against 1 () and 2 ()
Max likelihood = (1 =-2.13, 2 =1.668) Local minimum, but very close to global at (1 =2.085, 2 =-1.257)* * corresponds to switching w1 + w2.
Duda & Harts Example

We can graph the prob. dist. function of data given our 1 and 2 estimates. We can also graph the true function from which the data was randomly generated.
They are close. Good. The 2nd solution tries to put the 2/3 hump where the 1/3 hump should go, and vice versa. In this example unsupervised is almost as good as supervised. If the x1 .. x25 are given the class which was used to learn them, then the results are (1=-2.176, 2=1.684). Unsupervised got (1=-2.13, 2=1.668).
12
We can compute P( data | 1,2..k) How do we find the is which give max. likelihood?
Finding the max likelihood 1,2..k
The normal max likelihood trick: Set log Prob (.) = 0 i and solve for is. # Here you get non-linear non-analyticallysolvable equations Use gradient descent Slow but doable Use a much faster, cuter, and recently very popular method
Expectation Maximalization
13
U TO DE
The E.M. Algorithm
Well get back to unsupervised learning soon. But now well look at an even simpler case with hidden information. The EM algorithm
Can do trivial things, such as the contents of the next few slides. An excellent way of doing our unsupervised learning problem, as well see. Many, many other uses, including inference of Hidden Markov Models (future lecture).
Silly Example
Let events be grades in a class w1 = Gets an A P(A) = w2 = Gets a B P(B) = w3 = Gets a C P(C) = 2 P(D) = -3 w4 = Gets a D (Note 0 1/6) Assume we want to estimate from data. In a given class there were a As b Bs c Cs d Ds Whats the maximum likelihood estimate of given a,b,c,d ?
14
Silly Example
P(A) = P(B) = P(C) = 2 P(D) = -3 (Note 0 1/6) Assume we want to estimate from data. In a given class there were a As b Bs c Cs d Ds Whats the maximum likelihood estimate of given a,b,c,d ? Let events be grades in a class w1 = Gets an A w2 = Gets a B w3 = Gets a C w4 = Gets a D
Trivial Statistics
P(A) = P(B) = P(C) = 2 P(D) = -3 P( a,b,c,d | ) = K()a()b(2)c(-3)d log P( a,b,c,d | ) = log K + alog + blog + clog 2 + dlog (-3)
FOR MAX LIKE , SET
LogP =0
LogP b 2c 3d = + =0 2 1 / 2 3 b+c Gives max like = 6 (b + c + d ) So if class got A B

14 6
C 9
D 10
Max like =
1 10
ut g, b orin
! true
15
Same Problem with Hidden Information

Someone tells us that Number of High grades (As + Bs) = h Number of Cs =c Number of Ds =d What is the max. like estimate of now?
REMEMBER P(A) = P(B) = P(C) = 2 P(D) = -3
Same Problem with Hidden Information

Someone tells us that Number of High grades (As + Bs) = h Number of Cs =c Number of Ds =d What is the max. like estimate of now? We can answer this question circularly: EXPECTATION If we know the value of we could compute the expected value of a and b 1 2 h a= b= h Since the ratio a:b should be the same as the ratio : 1 + 1 + 2 2 MAXIMIZATION If we know the expected values of a and b we could compute the maximum likelihood value of
b+c 6(b + c + d )
16
E.M. for our Trivial Problem

We begin with a guess for We iterate between EXPECTATION and MAXIMALIZATION to improve our estimates of and a and b. Define (t) the estimate of on the tth iteration b(t) the estimate of b on tth iteration
(0) = initial guess 1 + (t ) 2 b(t ) + c (t + 1) = 6(b(t ) + c + d ) = max like est of given b(t ) b(t ) = (t)h = [b | (t )]
E-step
M-step
Continue iterating until converged. Good news: Converging to local optimum is assured. Bad news: I said local optimum.
E.M. Convergence
Convergence proof based on fact that Prob(data | ) must increase or remain same between each iteration [NOT OBVIOUS] But it can never exceed 1 [OBVIOUS] So it must therefore converge [OBVIOUS]
In our example, suppose we had h = 20 c = 10 d = 10 (0) = 0 Convergence is generally linear: error decreases by a constant factor each time step.
t 0 1 2 3 4 5 6 0
(t) 0 0.0833 0.0937 0.0947 0.0948 0.0948 0.0948
b(t) 2.857 3.158 3.185 3.187 3.187 3.187
17
Back to Unsupervised Learning of GMMs

Remember: We have unlabeled data x1 x2 xR We know there are k classes We know P(w1) P(w2) P(w3) P(wk) We dont know 1 2 .. k We can write P( data | 1. k)
= p(x1...xR 1... k ) = p(xi 1... k )

R i =1 R
= p xi w j , 1... k P(w j )
k i =1 j =1 R k 1 2 = K exp 2 (xi j ) P(w j ) 2 i =1 j =1
E.M. for GMMs

For Max likelihood we know
log Pr ob(data 1... k ) = 0 i
Some wild' n' crazy algebra turns this into : " For Max likelihood, for each j, j =
P(w
R i =1 R i =1
xi , 1... k ) xi
See
http://www.cs.cmu.edu/~awm/doc/gmm-algebra.pdf
P(w j xi , 1... k )
This is n nonlinear equations in js.

If, for each xi we knew that for each wj the prob that j was in class wj is P(wj|xi,1k) Then we would easily compute j. If we knew each j then we could easily compute P(wj|xi,1k) for each wj and xi.
I feel an EM experience coming on!!

18
Iterate. On the tth iteration let our estimates be
E.M. for GMMs
t = { 1(t), 2(t) c(t) }

E-step Compute expected classes of all datapoints for each class
Just evaluate a Gaussian at xk
P(wi xk , t ) =
p(xk wi , t )P(wi t ) p(xk t )
p xk wi , i (t ), 2 I pi (t )
j =1 M-step. Compute Max. like given our datas class membership distributions
p(x
c
w j , j (t ), 2 I p j (t )
i (t + 1) =
P(w x , ) x P(w x , )
i k t k i k t k
E.M. Convergence
Your lecturer will (unless out of time) give you a nice intuitive explanation of why this rule works. As with all EM procedures, convergence to a local optimum guaranteed.
This algorithm is REALLY USED. And in high dimensional state spaces, too. E.G. Vector Quantization for Speech Data
19
Iterate. On the tth iteration let our estimates be
E.M. for General GMMs
pi(t) is shorthand P(i) on tth

iteration for estimate of
t = { 1(t), 2(t) c(t), 1(t), 2(t) c(t), p1(t), p2(t) pc(t) }

E-step Compute expected classes of all datapoints for each class
Just evaluate a Gaussian at xk
P(wi xk , t ) =
p(xk wi , t )P(wi t ) p(xk t )
j =1 M-step. Compute Max. like given our datas class membership distributions
p(x
c
p(xk wi , i (t ), i (t ) ) pi (t )
k
w j , j (t ), j (t ) p j (t )
P(w x , ) x (t + 1) = P(w x , )
i k t k i i k t k
P(w x , ) [x (t + 1)][x (t + 1) = P(w x , )

i k t k i k i i k t k
i (t + 1)]
pi (t + 1) =
P(w x , )
i k t k
R = #records
Gaussian Mixture Example: Start
Advance apologies: in Black and White this example will be incomprehensible

20
After first iteration
After 2nd iteration
21
After 3rd iteration
After 4th iteration
22
After 5th iteration
After 6th iteration
23
After 20th iteration
Some Bio Assay data
24
GMM clustering of the assay data
Resulting Density Estimator
25
Where are we now?

Inputs Inference P(E1|E2) Joint DE, Bayes Net Structure Learning Engine Learn Classifier Density Estimator Regressor Predict Gauss/Joint BC, Gauss Nave BC, N.Neigh, Bayes category Net Based BC, Cascade Correlation Probability Predict real no.
Joint DE, Nave DE, Gauss/Joint DE, Gauss Nave DE, Bayes Net Structure Learning, GMMs Linear Regression, Polynomial Regression, Perceptron, Neural Net, N.Neigh, Kernel, LWR, RBFs, Robust Regression, Cascade Correlation, Regression Trees, GMDH, Multilinear Interp, MARS
Inputs Inputs
Inputs
Dec Tree, Sigmoid Perceptron, Sigmoid N.Net,
The old trick

Inputs Inference P(E1|E2) Joint DE, Bayes Net Structure Learning Engine Learn Classifier Density Estimator Regressor Predict Gauss/Joint BC, Gauss Nave BC, N.Neigh, Bayes category Net Based BC, Cascade Correlation, GMM-BC Probability Predict real no.
Joint DE, Nave DE, Gauss/Joint DE, Gauss Nave DE, Bayes Net Structure Learning, GMMs Linear Regression, Polynomial Regression, Perceptron, Neural Net, N.Neigh, Kernel, LWR, RBFs, Robust Regression, Cascade Correlation, Regression Trees, GMDH, Multilinear Interp, MARS
Inputs Inputs
Inputs
Dec Tree, Sigmoid Perceptron, Sigmoid N.Net,
26
Three classes of assay

(Sorry, this will again be semi-useless in black and white)
(each learned with its own mixture model)
Resulting Bayes Classifier
27
Resulting Bayes Classifier, using posterior probabilities to alert about ambiguity and anomalousness
Yellow means anomalous Cyan means ambiguous
Unsupervised learning with symbolic attributes missing

# KIDS MARRIED
NATION
Its just a learning Bayes net with known structure but hidden values problem. Can use Gradient Descent. EASY, fun exercise to do an EM formulation for this case too.
28
Remember, E.M. can get stuck in local minima, and empirically it DOES. Our unsupervised learning example assumed P(wi)s known, and variances fixed and known. Easy to relax this. Its possible to do Bayesian unsupervised learning instead of max. likelihood. There are other algorithms for unsupervised learning. Well visit K-means soon. Hierarchical clustering is also interesting. Neural-net algorithms called competitive learning turn out to have interesting parallels with the EM method we saw.
Final Comments
What you should know

How to learn maximum likelihood parameters (locally max. like.) in the case of unlabeled data. Be happy with this kind of probabilistic analysis. Understand the two examples of E.M. given in these notes. For more info, see Duda + Hart. Its a great book. Theres much more in the book than in your handout.
29
Other unsupervised learning methods

K-means (see next lecture) Hierarchical clustering (e.g. Minimum spanning trees) (see next lecture) Principal Component Analysis
simple, useful tool
Non-linear PCA
Neural Auto-Associators Locally weighted PCA Others
30

Clustering With Gaussian Mixtures: Unsupervised Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Clustering With Gaussian Mixtures: Unsupervised Learning

Uploaded by

Copyright:

Available Formats

Note to other teachers and users of these slides.

Clustering with Gaussian Mixtures

Copyright 2001, 2004, Andrew W. Moore

Gaussian Bayes Classifier Reminder

How do we deal with that?

Copyright 2001, 2004, Andrew W. Moore

Clustering with Gaussian Mixtures: Slide 3

Predicting wealth from age

Copyright 2001, 2004, Andrew W. Moore

Clustering with Gaussian Mixtures: Slide 4

Predicting wealth from age

Copyright 2001, 2004, Andrew W. Moore

Clustering with Gaussian Mixtures: Slide 5

Learning modelyear , mpg ---> maker

Copyright 2001, 2004, Andrew W. Moore

Clustering with Gaussian Mixtures: Slide 6

General: O(m2) parameters

Copyright 2001, 2004, Andrew W. Moore

Clustering with Gaussian Mixtures: Slide 7

Aligned: O(m) parameters

Copyright 2001, 2004, Andrew W. Moore

Clustering with Gaussian Mixtures: Slide 8

Aligned: O(m) parameters

Copyright 2001, 2004, Andrew W. Moore

Clustering with Gaussian Mixtures: Slide 9

Spherical: O(1) cov parameters

Copyright 2001, 2004, Andrew W. Moore

Clustering with Gaussian Mixtures: Slide 10

Spherical: O(1) cov parameters

Copyright 2001, 2004, Andrew W. Moore

Clustering with Gaussian Mixtures: Slide 11

Making a Classifier from a Density Estimator

Copyright 2001, 2004, Andrew W. Moore

Clustering with Gaussian Mixtures: Slide 12

Copyright 2001, 2004, Andrew W. Moore

Clustering with Gaussian Mixtures: Slide 13

The GMM assumption

Copyright 2001, 2004, Andrew W. Moore

Clustering with Gaussian Mixtures: Slide 14

The GMM assumption

Assume that each datapoint is generated according to the following recipe:

Copyright 2001, 2004, Andrew W. Moore

Clustering with Gaussian Mixtures: Slide 15

The GMM assumption

The GMM assumption

The General GMM assumption

Clustering with Gaussian Mixtures: Slide 18

Unsupervised Learning: not as hard as it looks

and sometimes in between

Computing likelihoods in unsupervised case

Copyright 2001, 2004, Andrew W. Moore

Clustering with Gaussian Mixtures: Slide 20

likelihoods in unsupervised case

Can we define P(x1, x1, .. xn | 1, 2 .. k) ?

Unsupervised Learning: Mediumly Good News

(From Duda and Hart)

There are 25 unlabeled datapoints

Clustering with Gaussian Mixtures: Slide 22

Duda & Harts Example

Duda & Harts Example

Finding the max likelihood 1,2..k

The E.M. Algorithm

Copyright 2001, 2004, Andrew W. Moore