Professional Documents
Culture Documents
Clustering With Gaussian Mixtures: Unsupervised Learning
Clustering With Gaussian Mixtures: Unsupervised Learning
Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrews tutorials: http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received.
Unsupervised Learning
You walk into a bar. A stranger approaches and tells you: Ive got data from k classes. Each class produces observations with a normal distribution and variance 2I . Standard simple multivariate gaussian assumptions. I can tell you all the P(wi)s . So far, looks straightforward. I need a maximum likelihood estimate of the is . No problem: Theres just one thing. None of the data are labeled. I have datapoints, but I dont know what class theyre from (any of them!) Uh oh!!
Clustering with Gaussian Mixtures: Slide 2
P ( y = i | x) =
( 2 )
1 1 T exp (x k i ) i (x k i ) pi 1/ 2 || i || 2 p (x)
21 = 12 M 1m
12 L 1m 2 2 L 2m 2m
M M L 2m O
21 = 12 M 1m
12 L 1m 2 2 L 2m 2m
M M L 2m O
21 0 0 22 0 0 = M M 0 0 0 0
0 0 M 0 0
L L O L
0 0 0 M 0
23 L
L 2 m 1
0 0 0 M 0 2m
21 0 0 22 0 0 = M M 0 0 0 0
0 0 M 0 0
L L O L
0 0 0 M 0
23 L
L 2 m 1
0 0 0 M 0 2 m
2 0 0 = M 0 0
0 0
L L L O
0 0 0 M
0 M 0 0
0 0
L 2 L 0
0 0 0 M 0 2
2 0 0 = M 0 0
0 0 M 0 0
L L O
0 0 0 M
0 M 0 0
2 L
L 2 L 0
0 0 0 M 0 2
Inputs Inputs
Gauss DE
Next back to Density Estimation What if we want to do density estimation with multimodal or clumpy data?
2 1 3
2 1 3
Assume that each datapoint is generated according to the following recipe: 1. Pick a component at random. Choose component i with probability P(i).
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 16
2
x
Assume that each datapoint is generated according to the following recipe: 1. Pick a component at random. Choose component i with probability P(i). 2. Datapoint ~ N(i, 2I )
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 17
2 1 3
Assume that each datapoint is generated according to the following recipe: 1. Pick a component at random. Choose component i with probability P(i). 2. Datapoint ~ N(i, i )
Copyright 2001, 2004, Andrew W. Moore
Sometimes impossible
10
Suppose xs are 1-dimensional. There are two classes; w1 and w2 P(w1) = 1/3
x1 = x2 = x3 = x4 =
0.608 -1.590 0.235 3.949 : x25 = -0.712
Copyright 2001, 2004, Andrew W. Moore
P(w2) = 2/3
=1.
11
Max likelihood = (1 =-2.13, 2 =1.668) Local minimum, but very close to global at (1 =2.085, 2 =-1.257)* * corresponds to switching w1 + w2.
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 23
They are close. Good. The 2nd solution tries to put the 2/3 hump where the 1/3 hump should go, and vice versa. In this example unsupervised is almost as good as supervised. If the x1 .. x25 are given the class which was used to learn them, then the results are (1=-2.176, 2=1.684). Unsupervised got (1=-2.13, 2=1.668).
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 24
12
We can compute P( data | 1,2..k) How do we find the is which give max. likelihood?
The normal max likelihood trick: Set log Prob (.) = 0 i and solve for is. # Here you get non-linear non-analyticallysolvable equations Use gradient descent Slow but doable Use a much faster, cuter, and recently very popular method
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 25
Expectation Maximalization
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 26
13
U TO DE
Well get back to unsupervised learning soon. But now well look at an even simpler case with hidden information. The EM algorithm
Can do trivial things, such as the contents of the next few slides. An excellent way of doing our unsupervised learning problem, as well see. Many, many other uses, including inference of Hidden Markov Models (future lecture).
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 27
Silly Example
Let events be grades in a class w1 = Gets an A P(A) = w2 = Gets a B P(B) = w3 = Gets a C P(C) = 2 P(D) = -3 w4 = Gets a D (Note 0 1/6) Assume we want to estimate from data. In a given class there were a As b Bs c Cs d Ds Whats the maximum likelihood estimate of given a,b,c,d ?
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 28
14
Silly Example
P(A) = P(B) = P(C) = 2 P(D) = -3 (Note 0 1/6) Assume we want to estimate from data. In a given class there were a As b Bs c Cs d Ds Whats the maximum likelihood estimate of given a,b,c,d ? Let events be grades in a class w1 = Gets an A w2 = Gets a B w3 = Gets a C w4 = Gets a D
Trivial Statistics
P(A) = P(B) = P(C) = 2 P(D) = -3 P( a,b,c,d | ) = K()a()b(2)c(-3)d log P( a,b,c,d | ) = log K + alog + blog + clog 2 + dlog (-3)
LogP =0
C 9
D 10
Max like =
1 10
ut g, b orin
! true
15
b+c 6(b + c + d )
16
(0) = initial guess 1 + (t ) 2 b(t ) + c (t + 1) = 6(b(t ) + c + d ) = max like est of given b(t ) b(t ) = (t)h = [b | (t )]
E-step
M-step
Continue iterating until converged. Good news: Converging to local optimum is assured. Bad news: I said local optimum.
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 33
E.M. Convergence
Convergence proof based on fact that Prob(data | ) must increase or remain same between each iteration [NOT OBVIOUS] But it can never exceed 1 [OBVIOUS] So it must therefore converge [OBVIOUS]
In our example, suppose we had h = 20 c = 10 d = 10 (0) = 0 Convergence is generally linear: error decreases by a constant factor each time step.
Copyright 2001, 2004, Andrew W. Moore
t 0 1 2 3 4 5 6 0
17
= p xi w j , 1... k P(w j )
k i =1 j =1 R k 1 2 = K exp 2 (xi j ) P(w j ) 2 i =1 j =1
Some wild' n' crazy algebra turns this into : " For Max likelihood, for each j, j =
P(w
R i =1 R i =1
xi , 1... k ) xi
See
http://www.cs.cmu.edu/~awm/doc/gmm-algebra.pdf
P(w j xi , 1... k )
18
P(wi xk , t ) =
p xk wi , i (t ), 2 I pi (t )
j =1 M-step. Compute Max. like given our datas class membership distributions
p(x
c
w j , j (t ), 2 I p j (t )
i (t + 1) =
P(w x , ) x P(w x , )
i k t k i k t k
E.M. Convergence
Your lecturer will (unless out of time) give you a nice intuitive explanation of why this rule works. As with all EM procedures, convergence to a local optimum guaranteed.
Copyright 2001, 2004, Andrew W. Moore
This algorithm is REALLY USED. And in high dimensional state spaces, too. E.G. Vector Quantization for Speech Data
Clustering with Gaussian Mixtures: Slide 38
19
P(wi xk , t ) =
j =1 M-step. Compute Max. like given our datas class membership distributions
p(x
c
p(xk wi , i (t ), i (t ) ) pi (t )
k
w j , j (t ), j (t ) p j (t )
P(w x , ) x (t + 1) = P(w x , )
i k t k i i k t k
i (t + 1)]
pi (t + 1) =
Copyright 2001, 2004, Andrew W. Moore
P(w x , )
i k t k
R = #records
Clustering with Gaussian Mixtures: Slide 39
20
21
22
23
24
25
Inputs Inputs
Inputs
Inputs Inputs
Inputs
26
27
Resulting Bayes Classifier, using posterior probabilities to alert about ambiguity and anomalousness
Yellow means anomalous Cyan means ambiguous
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 55
NATION
Its just a learning Bayes net with known structure but hidden values problem. Can use Gradient Descent. EASY, fun exercise to do an EM formulation for this case too.
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 56
28
Remember, E.M. can get stuck in local minima, and empirically it DOES. Our unsupervised learning example assumed P(wi)s known, and variances fixed and known. Easy to relax this. Its possible to do Bayesian unsupervised learning instead of max. likelihood. There are other algorithms for unsupervised learning. Well visit K-means soon. Hierarchical clustering is also interesting. Neural-net algorithms called competitive learning turn out to have interesting parallels with the EM method we saw.
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 57
Final Comments
29
Non-linear PCA
Neural Auto-Associators Locally weighted PCA Others
Copyright 2001, 2004, Andrew W. Moore Clustering with Gaussian Mixtures: Slide 59
30