121 views

Uploaded by Özgür Genç

- Code
- 91805v00 Word Recognition Final
- Jurnal Perbandingan K-Means Clustering dan Spectral Clustering
- Clustering With K-Means in Python _ the Data Science Lab
- MLE
- de14_drukker_gsem
- Data Mining Application Design Using K-MEANS and Exponential Smoothing Algorithm for Predicting New Student Registration
- Clustering
- GMM Stata
- kontrol
- K Means Algorithm
- Flexsurv Tutorial
- ds06.pdf
- 4_Robust Speaker Recognition Against Utterance Variations
- machine learning and pattern recognition-tutorial2
- K-means
- Statistics Part2 2013
- 10.1.1.48.651
- Fully Fuzzy Supervised Classification Approach
- Chap 2

You are on page 1of 14

and Expectation-Maximization (EM)

(and K-Means too!)

http://research.microsoft.com/~cmbishop/talks.htm

http://www.autonlab.org/tutorials/gmm.html

• Multivariate Gaussian • Data set

mean covariance

the likelihood function

• Set the parameters by maximizing the likelihood function • Maximizing w.r.t. the mean gives the sample mean

• Equivalently maximize the log likelihood

1

Bias of Maximum Likelihood Intuitive Explanation of Over-fitting

• Consider the expectations of the maximum likelihood

estimates under the Gaussian distribution

estimates the covariance

• This is an example of over-fitting

• Clearly we can remove the bias by using

Gaussians are well understood and easy to estimate

to represent inherently multimodal datasets

since this gives

Fitting a single Gaussian to a multimodal dataset is

likely to give a mean value in an area with low

probability, and to overestimate the covariance.

• Arises naturally in a Bayesian treatment

• For an infinite data set the two expressions are equal

Bishop, 2003

some other axis

Time

between

eruptions

(minutes)

2

Idea: Use a Mixture of Gaussians Example: Mixture of 3 Gaussians

• Linear super-position of Gaussians

• To generate a data point:

• There are k components. The

– first pick one of the components with probability i’th component is called i

– then draw a sample from that component • Component i has an

• Repeat these two steps for each new data point associated mean vector i 2

1

3

i’th component is called i i’th component is called i

• Component i has an • Component i has an

associated mean vector i 2 associated mean vector i 2

• Each component generates 1 • Each component generates

data from a Gaussian with data from a Gaussian with

mean i and covariance matrix mean i and covariance matrix

2I 3 2I

Assume that each datapoint is Assume that each datapoint is

generated according to the generated according to the

following recipe: following recipe:

1. Pick a component at random.

Choose component i with

probability P(i).

3

Sampling from a GMM Sampling from a General GMM

i’th component is called i i’th component is called i

• Component i has an • Component i has an

associated mean vector i 2 associated mean vector i 2

• Each component generates • Each component generates 1

data from a Gaussian with x data from a Gaussian with

mean i and covariance matrix mean i and covariance matrix

2I i 3

Assume that each datapoint is Assume that each datapoint is

generated according to the generated according to the

following recipe: following recipe:

1. Pick a component at random. 1. Pick a component at random.

Choose component i with Choose component i with

probability P(i). probability P(i).

2. Datapoint ~ N( , 2I )

i 2. Datapoint ~ N( , ) i i

Copyright © 2001, Andrew W. Moore Copyright © 2001, Andrew W. Moore

continuous estimate of

the density, so can

estimate a value at any

point.

constant-probability

contours if we wanted to.

• We wish to invert this process – given the data set, find • The log likelihood function takes the form

the corresponding parameters:

– mixing coefficients

– means

– covariances

• Note: sum over components appears inside the log

• If we knew which component generated each data point,

the maximum likelihood solution would involve fitting • There is no closed form solution for maximum likelihood

each component to the corresponding cluster

• Problem: the data set is unlabelled

• We shall refer to the labels as latent (= hidden) variables • However, with labeled data, the story is different

4

Labeled vs Unlabeled Data Side-Trip : Clustering using K-means

which data points are “owned” by which cluster.

label data points by which cluster they come from

(i.e. determine ownership or membership)

labeled unlabeled

Easy to estimate params Hard to estimate params

(do each color separately) (we need to assign colors)

Bishop, 2003

Some Data

Note to other teachers and users of

these slides. Andrew would be delighted

if you found this source material useful

in giving your own lectures. Feel free to

use these slides verbatim, or to modify

them to fit your own needs. PowerPoint

originals are available. If you make use

of a significant portion of these slides in

Andrew W. Moore This could easily be

your own lecture, please include this

message, or the following link to the modeled by a

source repository of Andrew’s tutorials:

http://www.cs.cmu.edu/~awm/tutorials .

Associate Professor Gaussian Mixture

Comments and corrections gratefully

received. School of Computer Science (with 5 components)

Carnegie Mellon University But let’s look at an

www.cs.cmu.edu/~awm satisfying, friendly and

awm@cs.cmu.edu infinitely popular

412-268-7599 alternative…

K-means K-means

1. Ask user how many 1. Ask user how many

clusters they’d like. clusters they’d like.

(e.g. k=5) (e.g. k=5)

2. Randomly guess k

cluster Center

locations

5

K-means K-means

1. Ask user how many 1. Ask user how many

clusters they’d like. clusters they’d like.

(e.g. k=5) (e.g. k=5)

2. Randomly guess k 2. Randomly guess k

cluster Center cluster Center

locations locations

3. Each datapoint finds 3. Each datapoint finds

out which Center it’s out which Center it’s

closest to. (Thus closest to.

each Center “owns”

4. Each Center finds

a set of datapoints)

the centroid of the

points it owns

K-means K-means

1. Ask user how many

clusters they’d like. Start

(e.g. k=5)

2. Randomly guess k Example generated by

Dan Pelleg’s super-

cluster Center

duper fast K-means

locations system:

3. Each datapoint finds Dan Pelleg and Andrew

out which Center it’s Moore. Accelerating

closest to. Exact k-means

Algorithms with

4. Each Center finds Geometric Reasoning.

Proc. Conference on

the centroid of the

Knowledge Discovery in

points it owns… Databases 1999,

(KDD99) (available on

5. …and jumps there www.autonlab.org/pap.html)

6. …Repeat until

terminated!

K-means K-means

continues… continues…

6

K-means K-means

continues… continues…

K-means K-means

continues… continues…

K-means K-means

continues… continues…

7

K-means Questions

K-means • What is it trying to optimize?

terminates • Are we sure it will terminate?

• Are we sure it will find an optimal clustering?

• How should we start it?

• How could we automatically choose the number of

centers?

….we’ll deal with these questions over the next few slides

Distortion Distortion

Given.. Given..

•an encoder function: ENCODE : m [1..k] •an encoder function: ENCODE : m [1..k]

•a decoder function: DECODE : [1..k] m •a decoder function: DECODE : [1..k] m

Define… Define…

R R

Distortion xi DECODE[ENCODE(x i )] Distortion xi DECODE[ENCODE(x i )]

2 2

i 1 i 1

DECODE[ j ] c j

R

so Distortion ( xi c ENCODE ( xi ) ) 2

i 1

R R

Distortion (x i c ENCODE ( xi ) ) 2 Distortion (x i c ENCODE ( xi ) ) 2

i 1 i 1

What properties must centers c1 , c2 , … , ck have What properties must centers c1 , c2 , … , ck have

when distortion is minimized? when distortion is minimized?

(1) xi must be encoded by its nearest center

….why?

c j {c1 ,c 2 ,...c k }

8

The Minimal Distortion (1) The Minimal Distortion (2)

R R

Distortion (x i c ENCODE ( xi ) ) 2 Distortion (x i c ENCODE ( xi ) ) 2

i 1 i 1

What properties must centers c1 , c2 , … , ck have What properties must centers c1 , c2 , … , ck have

when distortion is minimized? when distortion is minimized?

(1) xi must be encoded by its nearest center (2) The partial derivative of Distortion with respect to

Otherwise distortion could be each center location must be zero.

….why?

reduced by replacing ENCODE[xi]

by the nearest center

c j {c1 ,c 2 ,...c k }

(2) The partial derivative of Distortion with respect to (2) The partial derivative of Distortion with respect to

each center location must be zero. each center location must be zero.

R R

Distortion (x

i 1

i c ENCODE ( xi ) ) 2 Distortion (x

i 1

i c ENCODE ( xi ) ) 2

k k

(x i

j 1 iOwnedBy( c j )

c j )2 OwnedBy(cj ) = the set

of records owned by

(x i

j 1 iOwnedBy( c j )

c j )2

Center cj .

Distortion Distortion

c j

c j

(x i

iOwnedBy( c j )

c j )2

c j

c j

(x i

iOwnedBy( c j )

c j )2

2 (x i

iOwnedBy( c j )

c j) 2 (x i

iOwnedBy( c j )

c j)

1

Thus, at a minimum: c j xi

| OwnedBy(c j ) | iOwnedBy( c j )

R R

Distortion (x i c ENCODE ( xi ) ) 2 Distortion (x i c ENCODE ( xi ) ) 2

i 1 i 1

What properties must centers c1 , c2 , … , ck have when What properties can be changed for centers c1 , c2 , … , ck

distortion is minimized? have when distortion is not minimized?

(1) xi must be encoded by its nearest center (1) Change encoding so that xi is encoded by its nearest center

(2) Each Center must be at the centroid of points it owns. (2) Set each Center to the centroid of points it owns.

There’s no point applying either operation twice in succession.

But it can be profitable to alternate.

…And that’s K-means!

Easy to prove this procedure will terminate in a state at

which neither (1) or (2) change the configuration. Why?

9

Improving a suboptimal configuration…

partitioning R Will we find the optimal configuration?

er of ways of

ly a finite numb

There are on

R

into k groups

records Distortion

.

only a finitei nu

(mb cofENCODE

x i er xi ) )

possib(le

2

of

• Not necessarily.

So there are 1

Cen ters are the centroids • Can you invent a configuration that has converged, but

s in which all

configuration

What properties can ben.changed for centers c1 , c2 , ve … , ck does not have the minimum distortion?

the poi nts they ow ation, it must

ha

have when distortion is not

ch angminimized?

es on an iter

ura tion

If the config .

(1) Change ed the disso

encoding tortion

that xi is encoded it by

must to a

itsgonearest center

improv tion changes

the configura

So each time be en to be fore.

(2) Set eachfigu Center ion to

it’s the

nev ercentroid of points it owns.

ly run ou t of

con rat uld eventual

forever, it wo

d to go on either

There’s no if it trieapplying

Sopoint operation twice in succession.

ion s.

configurat

But it can be profitable to alternate.

…And that’s K-means!

Easy to prove this procedure will terminate in a state at

which neither (1) or (2) change the configuration. Why?

Will we find the optimal configuration? Will we find the optimal configuration?

• Not necessarily. • Not necessarily.

• Can you invent a configuration that has converged, but • Can you invent a configuration that has converged, but

does not have the minimum distortion? (Hint: try a fiendish k=3 does not have the minimum distortion? (Hint: try a fiendish k=3

configuration here…) configuration here…)

• Idea 1: Be careful about where you start • Idea 1: Be careful about where you start

• Idea 2: Do many runs of k-means, each from a different • Idea 2: Do many runs of k-means, each from a different

random start configuration random start configuration

• Many other ideas floating around. • Many other ideas floating around.

Neat trick:

Place first center on top of randomly chosen datapoint.

Place second center on datapoint that’s as far away as

possible from first center

:

Place j’th center on datapoint that’s as far away as

possible from the closest of Centers 1 through j-1

:

10

Common uses of K-means

• Often used as an exploratory data analysis tool

• In one-dimension, a good way to quantize real-valued

variables into k non-uniform buckets

• Used on acoustic data in speech understanding to Back to Estimating GMMs

convert waveforms into one of k categories (known as

Vector Quantization)

• Also used for choosing color palettes on old fashioned

graphical display devices!

• Used to initialize clusters for the EM algorithm!!!

• The log likelihood function takes the form • Binary latent variables describing which

component generated each data point

• If we knew the values for the latent variables, we would

maximize the complete-data log likelihood

• There is no closed form solution for maximum likelihood

• However, with labeled data, the story is different which gives a trivial closed-form solution (fit each

component to the corresponding set of data points)

• We don’t know the values of the latent variables

• However, for given parameter values we can compute

the expected values of the latent variables

Bishop, 2003 Bishop, 2003

observation x, we need to know their distribution.

• Recall:

p(x,z)

Bishop, 2003

WE AGAIN ARE GOING TO THE BOARD!!

11

Expected Complete-Data Log Likelihood Expected Complete-Data Log Likelihood

• Suppose we make a guess for the parameter values • To summarize what we just did, we replaced:

(means, covariances and mixing coefficients)

• Use these to evaluate the responsibilities (ownership weights)

• Consider expected complete-data log likelihood

unknown discrete value 0 or 1

• With:

• We are implicitly ‘filling in’ latent variables with best guess

known continuous value between 0 and 1

• Keeping the responsibilities fixed and maximizing with

respect to the parameters give the previous results

• We can think of the mixing coefficients as prior

probabilities for the components

• For a given value of we can evaluate the

corresponding posterior probabilities, called

responsibilities

• These are given from Bayes’ theorem by

ownership weights

E

means covariances

mixing probabilities

12

After first iteration After 2nd iteration

13

After 20th iteration Homework

4) Change your code for generating N points from a single multivariate Gaussian to

instead generate N points from a mixture of Gaussians. Assume K Gaussians, each

of which is specified by a mixing parameter 0<=p_i<=1, a 2x1 mean vector mu_i, and

a 2x2 covariance matrix C_i.

5) write code to perform the K-means algorithm, given a set of N data points and a

number K of desired clusters. You can either start the algorithm with random cluster

centers, or else try something smarter that you can think up.

of it will be code that looks a lot like the code you wrote last time to estimate MLE

parameters of a multivariate Gaussian, except now you are computing weighted

sample means, weighted sample covariances, and the mixing parameters, for each of

K gaussian components [this is the “M” step of EM]. Also, you will estimate the

ownership weights for each point, in the “E” step, to determine those weights.

Starting with N sample points and a number K of desired Gaussian components, use

EM to estimate the K mixing weights p_i, K 2x1 vectors mu_i, and K 2x2 covariance

matrices C_i. To initialize, you could use a random start (say random selection of K

mean vectors, identity matrices for each covariance, and 1/K for each mixing weight

p_i). Or, you could first run k-means to find a more appropriate set of cluster centers

to use as initial mean vectors.

14

- CodeUploaded byTuMacho Castigador
- 91805v00 Word Recognition FinalUploaded byAmit Sheth
- Jurnal Perbandingan K-Means Clustering dan Spectral ClusteringUploaded byKristi Damayanti
- Clustering With K-Means in Python _ the Data Science LabUploaded byhamedfazelm
- MLEUploaded byPoornima Pamulapati
- de14_drukker_gsemUploaded byDet Guillermo
- Data Mining Application Design Using K-MEANS and Exponential Smoothing Algorithm for Predicting New Student RegistrationUploaded byIRJCS-INTERNATIONAL RESEARCH JOURNAL OF COMPUTER SCIENCE
- ClusteringUploaded byDuong Duc Hung
- GMM StataUploaded byMao Cardenas
- kontrolUploaded byItaaAminoto
- K Means AlgorithmUploaded byAsir Mosaddek Sakib
- Flexsurv TutorialUploaded byMarius_2010
- ds06.pdfUploaded byKevin Mondragon
- 4_Robust Speaker Recognition Against Utterance VariationsUploaded byEng Samar
- machine learning and pattern recognition-tutorial2Uploaded bydebnathsuman49
- K-meansUploaded byKurushNishanth
- Statistics Part2 2013Uploaded byVictor De Paula Vila
- 10.1.1.48.651Uploaded byPedro Junior
- Fully Fuzzy Supervised Classification ApproachUploaded byp_akhlaghi
- Chap 2Uploaded byAbdoul Quang Cuong
- FRP.pdfUploaded bymsl_csp
- Abi Script StudUploaded bymuralidharan
- Game Th. TalkNotesUploaded byBj Kr
- Survey of Network Anomaly Detection Using Markov ChainUploaded byijcseit
- Cocoa Care - An Android Application for Cocoa Disease IdentificationUploaded byEditor IJRITCC
- WeibullPaperUploaded byJeff Hardy
- IRJET-Intelligent Approach for Classification of Grain Crop Seeds using Machine LearningUploaded byIRJET Journal
- Improving Conditional Sequence Generative Adversarial Networks by Stepwise EvaluationUploaded byJihadfenix
- 22Uploaded bysumeshkumarsharma
- erztztrUploaded byIvan Jokic

- Computer-Aided Introduction to Econometrics.pdfUploaded bydkmasta
- 2. Maths - IJAMSS - Region of - O. E. Okereke - NigeriaUploaded byiaset123
- Probability and Statistical Inference, 9ed (Odd Ans)Uploaded bycontributer
- Pre TestUploaded byanaggarwal
- u3-l4 - Sampling DistributionsUploaded bySudhagar D
- ecostat ps4Uploaded byDivyanshu Dvh
- Chapter2-part1Uploaded byDante Bernedo
- Interpreting the One-Way MANOVAUploaded byJuna Beqiri
- Class 05 - Failure Analysis TechniquesUploaded byb_shadid8399
- hmk3[1]Uploaded bysykim657
- HTUploaded bypreetylata
- Logistic Regression Models Using SASUploaded byRobin
- Rocha and Cribari-Neto (2009) Beta Autoregressive Moving Average ModelsUploaded byLaís Helen Loose
- ENM539 Hmwk Solutions.unlockedUploaded byKevin Cool
- Modul_05 (1)Uploaded byWijaya Andi
- EcoI Exam WS1415!1!5creditsUploaded byMandar Priya Phatak
- regresionUploaded bype
- DocUploaded bySandeep Pan
- 10 Mathematics Impq Sa 1 6 Statistics 2Uploaded byRanvir
- MTH302(1)Wth SolUploaded byshiny_star51
- PCA Cluster AnalysisUploaded byPrakhar Sarkar
- from analytics vidhya.pdfUploaded bylagom1
- ChebyshevUploaded byAbhishek Goudar
- working capital managementUploaded byAnika Relan
- u-testUploaded byBetty Bup
- MATLAB ReconciliationUploaded byAllan Paolo
- SolutionUploaded byMuhammad Arham
- 9. Fundamentals of Hypothesis Testing One-Sample Testsnew.docUploaded byWinnie Ip
- QMB FormulaUploaded byilyas
- Assessing Risk of a Serious Failure Mode Based on Limited Field DataUploaded byShawn Waltz