You are on page 1of 30

1

Copyright 2001, 2004, Andrew W. Moore


Clustering with
Gaussian Mixtures
Andrew W. Moore
Professor
School of Computer Science
Carnegie Mellon University
www.cs.cmu.edu/~awm
awm@cs.cmu.edu
412-268-7599
Note to other teachers and users of these slides. Andrew would be delighted if you found
this source material useful in giving your own lectures. Feel free to use these slides
verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you
make use of a significant portion of these slides in your own lecture, please include this
message, or the following link to the source repository of Andrews tutorials:
http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received.
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 2
Unsupervised Learning
You walk into a bar.
A stranger approaches and tells you:
Ive got data from k classes. Each class produces
observations with a normal distribution and variance

2
I . Standard simple multivariate gaussian
assumptions. I can tell you all the P(w
i
)s .
So far, looks straightforward.
I need a maximum likelihood estimate of the
i
s .
No problem:
Theres just one thing. None of the data are labeled. I
have datapoints, but I dont know what class theyre
from (any of them!)
Uh oh!!
2
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 3
Gaussian Bayes Classifier
Reminder
) (
) ( ) | (
) | (
x
x
x
p
i y P i y p
i y P
= =
= =
( ) ( )
) (
2
1
exp
|| || ) 2 (
1
) | (
2 / 1 2 /
x
x x

x
p
p
i y P
i i k i
T
i k
i
m


= =

How do we deal with that?
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 4
Predicting wealth from age
3
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 5
Predicting wealth from age
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 6
Learning modelyear ,
mpg ---> maker

=
m
m m
m
m
2
2 1
2
2
2
12
1 12
1
2



L
M O M M
L
L

4
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 7
General: O(m
2
)
parameters

=
m
m m
m
m
2
2 1
2
2
2
12
1 12
1
2



L
M O M M
L
L

Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 8
Aligned: O(m)
parameters

m
m
2
1
2
3
2
2
2
1
2
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0

L
L
M M O M M M
L
L
L

5
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 9
Aligned: O(m)
parameters

m
m
2
1
2
3
2
2
2
1
2
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0

L
L
M M O M M M
L
L
L

Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 10
Spherical: O(1)
cov parameters

=
2
2
2
2
2
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0

L
L
M M O M M M
L
L
L

6
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 11
Spherical: O(1)
cov parameters

=
2
2
2
2
2
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0

L
L
M M O M M M
L
L
L

Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 12
Making a Classifier from a
Density Estimator
I
n
p
u
t
s
Classifier
Predict
category
I
n
p
u
t
s
Density
Estimator
Prob-
ability
I
n
p
u
t
s
Regressor
Predict
real no.
Categorical
inputs only
Mixed Real /
Cat okay
Real-valued
inputs only
Dec Tree
Joint BC
Nave BC
Joint DE
Nave DE
Gauss DE
Gauss BC
7
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 13
Next back to Density Estimation
What if we want to do density estimation with
multimodal or clumpy data?
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 14
The GMM assumption
There are k components. The
ith component is called
i
Component
i
has an
associated mean vector
i

3
8
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 15
The GMM assumption
There are k components. The
ith component is called
i
Component
i
has an
associated mean vector
i
Each component generates data
from a Gaussian with mean
i
and covariance matrix
2
I
Assume that each datapoint is
generated according to the
following recipe:

3
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 16
The GMM assumption
There are k components. The
ith component is called
i
Component
i
has an
associated mean vector
i
Each component generates data
from a Gaussian with mean
i
and covariance matrix
2
I
Assume that each datapoint is
generated according to the
following recipe:
1. Pick a component at random.
Choose component i with
probability P(
i
).

2
9
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 17
The GMM assumption
There are k components. The
ith component is called
i
Component
i
has an
associated mean vector
i
Each component generates data
from a Gaussian with mean
i
and covariance matrix
2
I
Assume that each datapoint is
generated according to the
following recipe:
1. Pick a component at random.
Choose component i with
probability P(
i
).
2. Datapoint ~ N(
i
,
2
I )

2
x
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 18
The General GMM assumption

3
There are k components. The
ith component is called
i
Component
i
has an
associated mean vector
i
Each component generates data
from a Gaussian with mean
i
and covariance matrix
i
Assume that each datapoint is
generated according to the
following recipe:
1. Pick a component at random.
Choose component i with
probability P(
i
).
2. Datapoint ~ N(
i
,
i
)
10
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 19
Unsupervised Learning:
not as hard as it looks
and sometimes
in between
Sometimes impossible
Sometimes easy
IN CASE YOURE
WONDERING WHAT
THESE DIAGRAMS ARE,
THEY SHOW 2-d
UNLABELED DATA (X
VECTORS)
DISTRIBUTED IN 2-d
SPACE. THE TOP ONE
HAS THREE VERY
CLEAR GAUSSIAN
CENTERS
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 20
Computing likelihoods in
unsupervised case
We have x
1
, x
2 ,
x
N
We know P(w
1
) P(w
2
) .. P(w
k
)
We know
P(x|w
i
,
i
,
k
) = Prob that an observation from class
w
i
would have value x given class
means
1

x
Can we write an expression for that?
11
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 21
likelihoods in unsupervised case
We have x
1
x
2
x
n
We have P(w
1
) .. P(w
k
). We have .
We can define, for any x , P(x|w
i
,
1
,
2
..
k
)
Can we define P(x |
1
,
2
..
k
) ?
Can we define P(x
1
, x
1
, .. x
n
|
1
,
2
..
k
) ?
[YES, IF WE ASSUME THE X
1
S WERE DRAWN INDEPENDENTLY]
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 22
Unsupervised Learning:
Mediumly Good News
We now have a procedure s.t. if you give me a guess at
1
,
2
..
k,
I can tell you the prob of the unlabeled data given those s.
Suppose xs are 1-dimensional.
There are two classes; w
1
and w
2
P(w
1
) = 1/3 P(w
2
) = 2/3 = 1 .
There are 25 unlabeled datapoints
x
1
= 0.608
x
2
= -1.590
x
3
= 0.235
x
4
= 3.949
:
x
25
= -0.712
(From Duda and Hart)
12
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 23
Graph of
log P(x
1
, x
2
.. x
25
|
1
,
2
)
against
1
() and
2
()
Max likelihood = (
1
=-2.13,
2
=1.668)
Local minimum, but very close to global at (
1
=2.085,
2
=-1.257)*
* corresponds to switching w
1
+ w
2
.
Duda & Harts
Example
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 24
Duda & Harts Example
We can graph the
prob. dist. function
of data given our

1
and
2
estimates.
We can also graph the
true function from
which the data was
randomly generated.
They are close. Good.
The 2
nd
solution tries to put the 2/3 hump where the 1/3 hump should
go, and vice versa.
In this example unsupervised is almost as good as supervised. If the x
1
..
x
25
are given the class which was used to learn them, then the results are
(
1
=-2.176,
2
=1.684). Unsupervised got (
1
=-2.13,
2
=1.668).
13
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 25
Finding the max likelihood
1
,
2
..
k
We can compute P( data |
1
,
2
..
k
)
How do we find the
i
s which give max. likelihood?
The normal max likelihood trick:
Set log Prob (.) = 0

i
and solve for
i
s.
# Here you get non-linear non-analytically-
solvable equations
Use gradient descent
Slow but doable
Use a much faster, cuter, and recently very popular
method
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 26
Expectation
Maximalization
14
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 27
The E.M. Algorithm
Well get back to unsupervised learning soon.
But now well look at an even simpler case with
hidden information.
The EM algorithm
Can do trivial things, such as the contents of the next
few slides.
An excellent way of doing our unsupervised learning
problem, as well see.
Many, many other uses, including inference of Hidden
Markov Models (future lecture).
D
E
T
O
U
R
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 28
Silly Example
Let events be grades in a class
w
1
= Gets an A P(A) =
w
2
= Gets a B P(B) =
w
3
= Gets a C P(C) = 2
w
4
= Gets a D P(D) = -3
(Note 0 1/6)
Assume we want to estimate from data. In a given class
there were
a As
b Bs
c Cs
d Ds
Whats the maximum likelihood estimate of given a,b,c,d ?
15
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 29
Silly Example
Let events be grades in a class
w
1
= Gets an A P(A) =
w
2
= Gets a B P(B) =
w
3
= Gets a C P(C) = 2
w
4
= Gets a D P(D) = -3
(Note 0 1/6)
Assume we want to estimate from data. In a given class there were
a As
b Bs
c Cs
d Ds
Whats the maximum likelihood estimate of given a,b,c,d ?
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 30
Trivial Statistics
P(A) = P(B) = P(C) = 2 P(D) = -3
P( a,b,c,d | ) = K()
a
()
b
(2)
c
(-3)
d
log P( a,b,c,d | ) = log K + alog + blog + clog 2 + dlog (-3)
( )
10
1
like Max
got class if So
6
like max Gives
0
3 2 / 1
3
2
2

LogP
0

LogP
SET , LIKE MAX FOR
=
+ +
+
=
=

+ =

d c b
c b
d c b
10 9 6 14
D C B A
B
o
r
in
g
, b
u
t
t
r
u
e
!
16
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 31
Same Problem with Hidden Information
Someone tells us that
Number of High grades (As + Bs) = h
Number of Cs = c
Number of Ds = d
What is the max. like estimate of now?
REMEMBER
P(A) =
P(B) =
P(C) = 2
P(D) = -3
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 32
Same Problem with Hidden Information
h b h a

2
1

2
1
2
1
+
=
+
=
( ) d c b
c b
+ +
+
=
6

Someone tells us that
Number of High grades (As + Bs) = h
Number of Cs = c
Number of Ds = d
What is the max. like estimate of now?
We can answer this question circularly:
EXPECTATION
MAXIMIZATION
If we know the value of we could compute the
expected value of a and b
If we know the expected values of a and b
we could compute the maximum likelihood
value of
REMEMBER
P(A) =
P(B) =
P(C) = 2
P(D) = -3
Since the ratio a:b should be the same as the ratio :
17
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 33
E.M. for our Trivial Problem
We begin with a guess for
We iterate between EXPECTATION and MAXIMALIZATION to
improve our estimates of and a and b.
Define (t) the estimate of on the tth iteration
b(t) the estimate of b on tth iteration
REMEMBER
P(A) =
P(B) =
P(C) = 2
P(D) = -3
[ ]
( )
( ) ( )
( ) t b
d c t b
c t b
t
t b
t
h
t b
given of est like max
6
) 1 (
) ( |
) (
2
1
(t)
) (
guess initial ) 0 (
=
+ +
+
= +
=
+
=
=
E-step
M-step
Continue iterating until converged.
Good news: Converging to local optimum is assured.
Bad news: I said local optimum.
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 34
E.M. Convergence
Convergence proof based on fact that Prob(data | ) must increase or
remain same between each iteration [NOT OBVIOUS]
But it can never exceed 1 [OBVIOUS]
So it must therefore converge [OBVIOUS]
3.187 0.0948 6
3.187 0.0948 5
3.187 0.0948 4
3.185 0.0947 3
3.158 0.0937 2
2.857 0.0833 1
0 0 0
b(t) (t) t
In our example,
suppose we had
h = 20
c = 10
d = 10
(0) = 0
Convergence is generally linear: error
decreases by a constant factor each time
step.
18
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 35
Back to Unsupervised Learning of
GMMs
Remember:
We have unlabeled data x
1
x
2
x
R
We know there are k classes
We know P(w
1
) P(w
2
) P(w
3
) P(w
k
)
We dont know
1

2
..
k
We can write P( data |
1
.
k
)
( )
( )
( ) ( )
( ) ( )

= =
= =
=

=
=
=
=
R
i
k
j
j j i
R
i
k
j
j k j i
R
i
k i
k R
w x
w w x
x
x x
1 1
2
2
1 1
1
1
1
1 1
P
2
1
exp K
P ... , p
... p
... ... p
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 36
E.M. for GMMs
( )
( )
( )

=
=
=
=

R
i
k i j
i
R
i
k i j
j
k
i
x w P
x x w P
1
1
1
1
1
... ,
... ,

j, each for , likelihood Max For " : into this turns algebra crazy n' wild' Some
0 ... data ob Pr log

know we likelihood Max For


This is n nonlinear equations in
j
s.
I feel an EM experience coming on!!
If, for each x
i
we knew that for each w
j
the prob that
j
was in class w
j
is
P(w
j
|x
i
,
1

k
) Then we would easily compute
j
.
If we knew each
j
then we could easily compute P(w
j
|x
i
,
1

k
) for each w
j
and x
i
.
See
http://www.cs.cmu.edu/~awm/doc/gmm-algebra.pdf
19
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 37
E.M. for GMMs
Iterate. On the tth iteration let our estimates be

t
= {
1
(t),
2
(t)
c
(t) }
E-step
Compute expected classes of all datapoints for each class
( )
( ) ( )
( )
( )
( )

=
= =
c
j
j j j k
i i i k
t k
t i t i k
t k i
t p t w x
t p t w x
x
w w x
x w
1
2
2
) ( ), ( , p
) ( ), ( , p
p
P , p
, P
I
I

M-step.
Compute Max. like given our datas class membership distributions
( )
( )
( )

= +
k
t k i
k
k
t k i
i
x w
x x w
t

, P
, P
1
Just evaluate
a Gaussian at
x
k
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 38
E.M.
Convergence
This algorithm is REALLY USED. And
in high dimensional state spaces, too.
E.G. Vector Quantization for Speech
Data
Your lecturer will
(unless out of
time) give you a
nice intuitive
explanation of
why this rule
works.
As with all EM
procedures,
convergence to a
local optimum
guaranteed.
20
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 39
E.M. for General GMMs
Iterate. On the tth iteration let our estimates be

t
= {
1
(t),
2
(t)
c
(t),
1
(t),
2
(t)
c
(t), p
1
(t), p
2
(t) p
c
(t) }
E-step
Compute expected classes of all datapoints for each class
( )
( ) ( )
( )
( )
( )

= =
c
j
j j j j k
i i i i k
t k
t i t i k
t k i
t p t t w x
t p t t w x
x
w w x
x w
1
) ( ) ( ), ( , p
) ( ) ( ), ( , p
p
P , p
, P

M-step.
Compute Max. like given our datas class membership distributions
p
i
(t) is shorthand
for estimate of
P(
i
) on tth
iteration
( )
( )
( )

= +
k
t k i
k
k
t k i
i
x w
x x w
t

, P
, P
1
( )
( ) ( ) [ ] ( ) [ ]
( )

+ +
= +
k
t k i
T
i k i k
k
t k i
i
x w
t x t x x w
t


, P
1 1 , P
1
( )
( )
R
x w
t p
k
t k i
i

= +
, P
1
R = #records
Just evaluate
a Gaussian at
x
k
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 40
Gaussian
Mixture
Example:
Start
Advance apologies: in Black
and White this example will be
incomprehensible
21
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 41
After first
iteration
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 42
After 2nd
iteration
22
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 43
After 3rd
iteration
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 44
After 4th
iteration
23
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 45
After 5th
iteration
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 46
After 6th
iteration
24
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 47
After 20th
iteration
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 48
Some Bio
Assay
data
25
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 49
GMM
clustering
of the
assay data
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 50
Resulting
Density
Estimator
26
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 51
Where are we now?
I
n
p
u
t
s
Classifier
Predict
category
I
n
p
u
t
s
Density
Estimator
Prob-
ability
I
n
p
u
t
s
Regressor
Predict
real no.
Dec Tree, Sigmoid Perceptron, Sigmoid N.Net,
Gauss/Joint BC, Gauss Nave BC, N.Neigh, Bayes
Net Based BC, Cascade Correlation
Joint DE, Nave DE, Gauss/Joint DE, Gauss Nave
DE, Bayes Net Structure Learning, GMMs
Linear Regression, Polynomial Regression,
Perceptron, Neural Net, N.Neigh, Kernel, LWR,
RBFs, Robust Regression, Cascade Correlation,
Regression Trees, GMDH, Multilinear Interp, MARS
I
n
p
u
t
s
Inference
Engine Learn
P(E
1
|E
2
)
Joint DE, Bayes Net Structure Learning
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 52
The old trick
I
n
p
u
t
s
Classifier
Predict
category
I
n
p
u
t
s
Density
Estimator
Prob-
ability
I
n
p
u
t
s
Regressor
Predict
real no.
Dec Tree, Sigmoid Perceptron, Sigmoid N.Net,
Gauss/Joint BC, Gauss Nave BC, N.Neigh, Bayes
Net Based BC, Cascade Correlation, GMM-BC
Joint DE, Nave DE, Gauss/Joint DE, Gauss Nave
DE, Bayes Net Structure Learning, GMMs
Linear Regression, Polynomial Regression,
Perceptron, Neural Net, N.Neigh, Kernel, LWR,
RBFs, Robust Regression, Cascade Correlation,
Regression Trees, GMDH, Multilinear Interp, MARS
I
n
p
u
t
s
Inference
Engine Learn
P(E
1
|E
2
)
Joint DE, Bayes Net Structure Learning
27
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 53
Three
classes of
assay
(each learned with
its own mixture
model)
(Sorry, this will again be
semi-useless in black and
white)
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 54
Resulting
Bayes
Classifier
28
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 55
Resulting Bayes
Classifier, using
posterior
probabilities to
alert about
ambiguity and
anomalousness
Yellow means
anomalous
Cyan means
ambiguous
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 56
Unsupervised learning with symbolic
attributes
Its just a learning Bayes net with known structure but
hidden values problem.
Can use Gradient Descent.
EASY, fun exercise to do an EM formulation for this case too.
# KI DS
MARRIED
NATION
missing
29
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 57
Final Comments
Remember, E.M. can get stuck in local minima, and
empirically it DOES.
Our unsupervised learning example assumed P(w
i
)s
known, and variances fixed and known. Easy to
relax this.
Its possible to do Bayesian unsupervised learning
instead of max. likelihood.
There are other algorithms for unsupervised
learning. Well visit K-means soon. Hierarchical
clustering is also interesting.
Neural-net algorithms called competitive learning
turn out to have interesting parallels with the EM
method we saw.
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 58
What you should know
How to learn maximum likelihood parameters
(locally max. like.) in the case of unlabeled data.
Be happy with this kind of probabilistic analysis.
Understand the two examples of E.M. given in these
notes.
For more info, see Duda + Hart. Its a great book.
Theres much more in the book than in your
handout.
30
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 59
Other unsupervised learning
methods
K-means (see next lecture)
Hierarchical clustering (e.g. Minimum spanning
trees) (see next lecture)
Principal Component Analysis
simple, useful tool
Non-linear PCA
Neural Auto-Associators
Locally weighted PCA
Others

You might also like