7 views

Uploaded by Ashwini Kumar Pal

lecture notes

lecture notes

© All Rights Reserved

- Probability is a Way of Expressing Knowledge or Belief That an Event Will Occur or Has Occurred
- Multiple-Feature Fusion Based Onset Detection for Solo Singing Voice
- jugkjhkje- Expectation-Maximization Algorithm
- Mathworks - Machine Learning Made Easy
- annotated bibliography draft
- Docs Slides Lecture1
- Machine Learning
- gupta13_cikm
- PubMatic Ad Price Prediction 2009
- Zj 34281294
- Datacamp All Courses
- Handout1
- Assignment (Probabilty Theory Probability Distribution, Correlation Regression Ananlysis
- report
- aruksha resume
- Xiaoyuan Zhu et al- Expectation-Maximization Method for EEG-Based Continuous Cursor Control
- et
- AI deep learning explanation why it worjd
- The reason for a maybe
- Feature Subset Selection for High Dimensional Data using Clustering Techniques

You are on page 1of 4

Andrew Ng

In this set of notes, we discuss the EM (Expectation-Maximization) for density estimation.

Suppose that we are given a training set {x(1) , . . . , x(m) } as usual. Since

we are in the unsupervised learning setting, these points do not come with

any labels.

We wish to model the data by specifying a joint distribution P

p(x(i) , z (i) ) =

(i) (i)

(i)

(i)

p(x |z )p(z ). Here, z Multinomial() (where j 0, kj=1 j = 1,

and the parameter j gives p(z (i) = j),), and x(i) |z (i) = j N (j , j ). We

let k denote the number of values that the z (i) s can take on. Thus, our

model posits that each x(i) was generated by randomly choosing z (i) from

{1, . . . , k}, and then x(i) was drawn from one of k Gaussians depending on

z (i) . This is called the mixture of Gaussians model. Also, note that the

z (i) s are latent random variables, meaning that theyre hidden/unobserved.

This is what will make our estimation problem difficult.

The parameters of our model are thus , and . To estimate them, we

can write down the likelihood of our data:

(, , ) =

m

X

log p(x(i) ; , , )

i=1

m

X

i=1

log

k

X

z (i) =1

the parameters and try to solve, well find that it is not possible to find the

maximum likelihood estimates of the parameters in closed form. (Try this

yourself at home.)

The random variables z (i) indicate which of the k Gaussians each x(i)

had come from. Note that if we knew what the z (i) s were, the maximum

1

2

likelihood problem would have been easy. Specifically, we could then write

down the likelihood as

(, , ) =

m

X

i=1

1 X

1{z (i) = j},

m i=1

Pm

(i)

= j}x(i)

i=1 1{z

P

=

,

m

(i) = j}

i=1 1{z

Pm

(i)

= j}(x(i) j )(x(i) j )T

i=1 1{z P

.

=

m

(i) = j}

i=1 1{z

m

j =

j

j

Indeed, we see that if the z (i) s were known, then maximum likelihood

estimation becomes nearly identical to what we had when estimating the

parameters of the Gaussian discriminant analysis model, except that here

the z (i) s playing the role of the class labels.1

However, in our density estimation problem, the z (i) s are not known.

What can we do?

The EM algorithm is an iterative algorithm that has two main steps.

Applied to our problem, in the E-step, it tries to guess the values of the

z (i) s. In the M-step, it updates the parameters of our model based on our

guesses. Since in the M-step we are pretending that the guesses in the first

part were correct, the maximization becomes easy. Heres the algorithm:

Repeat until convergence: {

(E-step) For each i, j, set

(i)

1

There are other minor differences in the formulas here from what wed obtained in

PS1 with Gaussian discriminant analysis, first because weve generalized the z (i) s to be

multinomial rather than Bernoulli, and second because here we are using a different j

for each Gaussian.

3

(M-step) Update the parameters:

1 X (i)

w ,

:=

m i=1 j

Pm (i) (i)

i=1 wj x

:=

Pm (i) ,

i=1 wj

Pm (i) (i)

j )(x(i) j )T

i=1 wj (x

:=

Pm (i)

i=1 wj

m

j

j

j

}

the z (i) s, given the x(i) and using the current setting of our parameters. I.e.,

using Bayes rule, we obtain:

p(x(i) |z (i) = j; , )p(z (i) = j; )

p(z (i) = j|x(i) ; , , ) = Pk

(i) (i) = l; , )p(z (i) = l; )

l=1 p(x |z

Here, p(x(i) |z (i) = j; , ) is given by evaluating the density of a Gaussian

with mean j and covariance j at x(i) ; p(z (i) = j; ) is given by j , and so

(i)

on. The values wj calculated in the E-step represent our soft guesses2 for

the values of z (i) .

Also, you should contrast the updates in the M-step with the formulas we

had when the z (i) s were known exactly. They are identical, except that instead of the indicator functions 1{z (i) = j} indicating from which Gaussian

(i)

each datapoint had come, we now instead have the wj s.

The EM-algorithm is also reminiscent of the K-means clustering algorithm, except that instead of the hard cluster assignments c(i), we instead

(i)

have the soft assignments wj . Similar to K-means, it is also susceptible

to local optima, so reinitializing at several different initial parameters may

be a good idea.

Its clear that the EM algorithm has a very natural interpretation of

repeatedly trying to guess the unknown z (i) s; but how did it come about,

and can we make any guarantees about it, such as regarding its convergence?

In the next set of notes, we will describe a more general view of EM, one

2

The term soft refers to our guesses being probabilities and taking values in [0, 1]; in

contrast, a hard guess is one that represents a single best guess (such as taking values

in {0, 1} or {1, . . . , k}).

4

that will allow us to easily apply it to other estimation problems in which

there are also latent variables, and which will allow us to give a convergence

guarantee.

- Probability is a Way of Expressing Knowledge or Belief That an Event Will Occur or Has OccurredUploaded byUma Thevi
- Multiple-Feature Fusion Based Onset Detection for Solo Singing VoiceUploaded bymachinelearner
- jugkjhkje- Expectation-Maximization AlgorithmUploaded byAdelmo Filho
- Mathworks - Machine Learning Made EasyUploaded byitzabugslife
- annotated bibliography draftUploaded byapi-251969180
- Machine LearningUploaded byShubham Mahawar
- gupta13_cikmUploaded byjeromeku
- Datacamp All CoursesUploaded byPrageet Surheley
- Docs Slides Lecture1Uploaded byPravinkumarGhodake
- PubMatic Ad Price Prediction 2009Uploaded byPubMatic
- Handout1Uploaded byGurkanwal Singh
- Assignment (Probabilty Theory Probability Distribution, Correlation Regression AnanlysisUploaded byishikadhamija24
- Zj 34281294Uploaded byAJER JOURNAL
- reportUploaded byapi-303634380
- aruksha resumeUploaded byapi-304262732
- Xiaoyuan Zhu et al- Expectation-Maximization Method for EEG-Based Continuous Cursor ControlUploaded byAsvcxv
- etUploaded byhectorcflores1
- AI deep learning explanation why it worjdUploaded bycadourian1
- The reason for a maybeUploaded byamilcarsoares
- Feature Subset Selection for High Dimensional Data using Clustering TechniquesUploaded byIRJET Journal
- Summary of the Article : A Few Useful Things to Know about Machine LearningUploaded byEge Erdem
- 10.1.1.113Uploaded byanietc12
- Read Me Instructions - Machine Learning Mining Industry.docxUploaded byMiljan Kovacevic
- BrochureUploaded byNaresh Navuluri
- DSPMeeting_2Uploaded byPhuc Nguyen Thai Vinh
- 05 Random SignalUploaded bylokesh_harami_kurm
- 8.2.pdfUploaded byquae
- II. UJI HIPOTESIS.pptUploaded bySafitri Rahma Setiawati
- Control 3Uploaded byCarlos Mieres Núñez
- 9783790815375-t1Uploaded byPLHM

- Interfacing Python With Fortran_f2pyUploaded byapmapm
- Interaction Effects-Part 2Uploaded byAshwini Kumar Pal
- Fine GrainedUploaded byAshwini Kumar Pal
- Learning Sentiment LexiconsUploaded byAshwini Kumar Pal
- Miss ForestUploaded byAshwini Kumar Pal
- R-CatagoricalDataAnalysisUploaded byrabbityeah
- Sequence to Sequence Learning With Neural NetworksUploaded byIonel Alexandru Hosu
- WP104Uploaded byAshwini Kumar Pal
- adabagUploaded byAshwini Kumar Pal
- bayesclustUploaded byAshwini Kumar Pal
- HermiteUploaded byJorgeIngeniero
- adlec4Uploaded byRaman Iyer
- BayesTreeUploaded byAshwini Kumar Pal
- Cubic Spline InterpolationUploaded byAshwini Kumar Pal
- Three Introductory Lectures on FourierUploaded bycooldudepaneendra
- The Feynman-Kac TheoremUploaded byAshwini Kumar Pal
- 6normgfcd ndcl.sdUploaded byhamdimouhhi
- The QR Method for EigenvaluesUploaded byAshwini Kumar Pal
- QrUploaded byAshwini Kumar Pal
- Convex Optimization — Boyd & VandenbergheUploaded byPremSuman8
- Game Theory 4Uploaded byAshwini Kumar Pal
- C EssentialUploaded byVijay kumar Gupta .C
- A Suitable BoyUploaded byAshwini Kumar Pal
- Interest Rate ModelsUploaded byAshwini Kumar Pal
- Rainbow OptionsUploaded byAshwini Kumar Pal
- Kuhn Tucker ConditionUploaded byAvinash Vasudeo
- support vector machines and RNNUploaded byAshwini Kumar Pal
- g. Value Analysis and Value EngineeringUploaded byMaaz Farooqui
- 06_Pert cpmUploaded byTanzeel Liaqat

- EM AlgorithmUploaded byŠárka ◪ Rusá
- Machine Learning Techniques for Anomaly Detection: An OverviewUploaded byLeandro Souza
- 10.1.1.81Uploaded byAsnik
- ALIZE EvaluationUploaded byJonathan Arley Torres Castañeda
- Algo Trading Hidden Markov Models on Foreign Exchange DataUploaded bymoon
- Data Mining-Model Based ClusteringUploaded byRaj Endran
- 01csiUploaded byamirali55
- 1-s2.0-S0005109811005450-mainUploaded byM s
- marlin-phd-thesis.pdfUploaded byMarko Krstic
- ALIZE.pdfUploaded byPaulo Casemiro
- Gaussian Mixture ModelsUploaded byPedro Koob
- AmeliaUploaded byMarco Alberto Benjumeda Barquita
- IJDIWC_ Volume 5, Issue 4Uploaded byIJDIWC
- Call Centre Forecasting ModelsUploaded byDeepak Iyer
- JensenUploaded byomonda_nii
- Feature Selection for Unsupervised Learning - Dy, Brodley - Journal of Machine Learning Research - 2004Uploaded byzukun
- 967349594-MITUploaded bysherlock
- Chapter 11Uploaded byricha
- clustering.pdfUploaded byAdhy Rizaldy
- Chapter 11Uploaded byLinh Dinh
- TimeSeries Analysis State Space MethodsUploaded byBryant Shi
- icpsr99Uploaded byhuanning55
- 10.1.1.109.8192Uploaded bySenthil Murugan
- 2009-NABE-Evaluating and Comparing Leading and Coincident Economic Indicators-Markov Switching ModelsUploaded byjibay206
- UT Dallas Syllabus for cs6375.002 06f taught by Yu Chung Ng (ycn041000)Uploaded byUT Dallas Provost's Technology Group
- DIGITAL DENTAL X-RAY IMAGE SEGMENTATION AND FEATURE EXTRACTION USING EXPECTATION MAXIMIZATION TECHNIQUEUploaded byAnonymous vQrJlEN
- NCME Program 2015Uploaded byyosher7787
- ClusterSample.pdfUploaded byannamyem
- ICCV98Uploaded byultimatekp144
- Structural Health Monitoring 2016 Santos 610 25Uploaded byCESAR