You are on page 1of 16

Lecture 16: 09 March, 2023

Pranabendu Misra
Slides by Madhavan Mukund

Data Mining and Machine Learning


January–April 2023
Generative models

We assume that the data is generated by a probabilistic process.


To use probabilities, need to describe how data is randomly generated
Generative model

Typically, assume a random instance is created as follows


Choose a class cj with probability Pr (cj )
Choose attributes a1 , . . . , ak with probability Pr (a1 , . . . , ak | cj )

Generative model has associated parameters θ = (θ1 , . . . , θm )


Each class probability Pr (cj ) is a parameter
Each conditional probability Pr (a1 , . . . , ak | cj ) is a parameter

We need to estimate these parameters

Pranabendu Misra Lecture 16: 09 March, 2023 DMML Jan–Apr 2023 2 / 16


Maximum Likelihood Estimators
We are given some data O = (o1 , o2 , . . . , on )
Our goal is to estimate parameters (probabilities) θ = (θ1 , . . . , θm )
Law of large numbers allows us to estimate probabilities by counting frequencies
Example: Tossing a biased coin, single parameter θ = Pr (heads)
N coin tosses, H heads and T tails
Why is θ̂ = H/N the best estimate?

Likelihood
Actual coin toss sequence is τ = t1 t2 . . . tN
Given an estimate of θ, compute Pr (τ | θ) — likelihood L(θ)

θ̂ = H/N maximizes this likelihood — arg max L(θ) = θ̂ = H/N


θ
Maximum Likelihood Estimator (MLE)
Pranabendu Misra Lecture 16: 09 March, 2023 DMML Jan–Apr 2023 3 / 16
Mixture models
Probabilistic process — parameters Θ
Tossing a coin with Θ = {Pr(H)} = {p}

Perform an experiment
Toss the coin N times, H T H H · · · T

Estimate parameters from observations


From h heads, estimate p = h/N
Maximum Likelihood Estimator (MLE)

What if we have a mixture of two random processes


Two coins, c1 and c2 , with Pr(H) = p1 and p2 , respectively
Repeat N times: choose ci with probability 1/2 and toss it
Outcome: N1 tosses of c1 interleaved with N2 tosses of c2 , N1 + N2 = N
Can we estimate p1 and p2 ?
Pranabendu Misra Lecture 16: 09 March, 2023 DMML Jan–Apr 2023 4 / 16
Mixture models . . .
Two coins, c1 and c2 , with Pr(H) = p1 and p2 , respectively
Sequence of N interleaved coin tosses H T H H · · · H H T
If the sequence is labelled, we can estimate p1 , p2 separately
H T T H H T H T H H T H T H T H H T H T
p1 = 8/12 = 2/3, p2 = 3/8

What the observation is unlabelled?


H T T H H T H T H H T H T H T H H T H T

Iterative algorithm to estimate the parameters


Make an initial guess for the parameters
Compute a (fractional) labelling of the outcomes
Re-estimate the parameters

Pranabendu Misra Lecture 16: 09 March, 2023 DMML Jan–Apr 2023 5 / 16


Expectation Maximization (EM)
Iterative algorithm to estimate the parameters
Make an initial guess for the parameters
Compute a (fractional) labelling of the outcomes
Re-estimate the parameters

H T T H H T H T H H T H T H T H H T H T
Initial guess: p1 = 1/2, p2 = 1/4
Pr (c1 = T ) = q1 = 1/2, Pr (c2 = T ) = q2 = 3/4,
For each H, likelihood it was ci , Pr(ci | H), is pi /(p1 + p2 )
For each T , likelihood it was ci , Pr(ci | T ), is qi /(q1 + q2 )
Assign fractional count Pr(ci | H) to each H: 2/3 × c1 , 1/3 × c2
Likewise, assign fractional count Pr(ci | T ) to each T : 2/5 × c1 , 3/5 × c2

Pranabendu Misra Lecture 16: 09 March, 2023 DMML Jan–Apr 2023 6 / 16


Expectation Maximization (EM)
H T T H H T H T H H T H T H T H H T H T

Initial guess: p1 = 1/2, p2 = 1/4


Fractional counts: each H is 2/3 × c1 , 1/3 × c2 , each T : 2/5 × c1 , 3/5 × c2
Add up the fractional counts
c1 : 11 · (2/3) = 22/3 heads, 9 · (2/5) = 18/5 tails
c2 : 11 · (1/3) = 11/3 heads, 9 · (3/5) = 27/5 tails

Re-estimate the parameters


22/3
p1 = = 110/164 = 0.67, q1 = 1 − p1 = 0.33
22/3 + 18/5
11/3
p2 = = 55/136 = 0.40, q2 = 1 − p2 = 0.60
11/3 + 27/5
Repeat until convergence
Pranabendu Misra Lecture 16: 09 March, 2023 DMML Jan–Apr 2023 7 / 16
Expectation Maximization (EM)
Mixture of probabilistic models (M1 , M2 , . . . , Mk ) with parameters
Θ = (θ1 , θ2 , . . . , θk )
Observation O = o1 o2 . . . oN
Expectation step
Compute likelihoods Pr (Mi |oj ) for each Mi , oj

Maximization step
Recompute MLE for each Mi using fraction of O assigned using likelihood

Repeat until convergence


Why should it converge?
If the value converges, what have we computed?

Pranabendu Misra Lecture 16: 09 March, 2023 DMML Jan–Apr 2023 8 / 16


EM — another example
Two biased coins, choose a coin If we know the breakup, we can separately
and toss 10 times, repeat 5 times compute MLE for each coin

Pranabendu Misra Lecture 16: 09 March, 2023 DMML Jan–Apr 2023 9 / 16


EM — another example

Expectation-
Maximization
Initial estimates,
θA = 0.6, θB = 0.5
Compute likelihood
of each sequence:
θnH (1 − θ)nT
Assign each sequence
proportionately
Converge to
θA = 0.8, θB = 0.52

Pranabendu Misra Lecture 16: 09 March, 2023 DMML Jan–Apr 2023 10 / 16


EM — mixture of Gaussians

Sample uniformly from multiple Gaussians,


N (µi , σi )
For simplicity, assume all σi = σ
N sample points z1 , z2 , . . . , zN
Make an initial guess for each µj
Pr(zi | µj ) = exp(− 2σ1 2 (zi − µj )2 )
Pr(zi | µj )
Pr (µj | zi ) = cij = P
k Pr(zi | µk )
P
cij zi
MLE of µj is sample mean, Pi
i cij
Update estimates for µj and repeat

Pranabendu Misra Lecture 16: 09 March, 2023 DMML Jan–Apr 2023 11 / 16


Theoretical foundations of EM
Mixture of probabilistic
models (M1 , M2 , . . . , Mk )
with parameters
Θ = (θ1 , θ2 , . . . , θk )
Observation
O = o1 o2 . . . oN
EM builds a sequence of
estimates Θ1 , Θ2 , . . . , Θn
L(Θj ) — log-likelihood
function, ln Pr(O | Θj ) EM performs a form of gradient descenct
Want to extend the If we update Θn to Θ0 we get an new likelihood
sequence with Θn+1 such L(Θn ) + ∆(Θ0 , Θn ) which we call `(Θ0 | Θn )
that L(Θn+1 ) > L(Θn )
Choose Θn+1 to maximize `(Θ0 | Θn )
Pranabendu Misra Lecture 16: 09 March, 2023 DMML Jan–Apr 2023 12 / 16
Semi-supervised learning

Supervised learning requires labelled training data


What if we don’t have enough labelled data?
For a probabilistic classifier we can apply EM
Use available training data to assign initial probabilities
Label the rest of the data using this model — fractional labels
Add up counts and re-estimate the parameters

Pranabendu Misra Lecture 16: 09 March, 2023 DMML Jan–Apr 2023 13 / 16


Semi-supervised topic classification

Each document is a multiset or bag of words over a vocabulary


V = {w1 , w2 , . . . , wm }

Each topic c has probability Pr(c)


Each word wi ∈ V has conditional probability Pr(wi | cj ), for cj ∈ C
m
X
Note that Pr (wi | cj ) = 1
i=1

Assume document length is independent of the class


Only a small subset of documents is labelled
Use this subset for initial estimate of Pr(c), Pr(wi | cj )

Pranabendu Misra Lecture 16: 09 March, 2023 DMML Jan–Apr 2023 14 / 16


Semi-supervised topic classification
Current model Pr(c), Pr(wi | cj )
Compute Pr(cj | d) for each unlabelled document d
Normally we assign the maximum among these as the class for d
Here we keep fractional values
P
d∈D Pr(cj | D)
Recompute Pr (cj ) =
|D|
For labelled d, Pr(cj | d) ∈ {0, 1}
For unlabelled d, Pr(cj | d) is fractional value computed from current parameters

Recompute Pr (wi | cj ) — fraction of occurrences of wi in documents labelled cj


nid — occurrences of wi in d
P
nid Pr (cj | d)
Pr (wi | cj ) = Pm d∈D
P
t=1 d∈D ntd Pr (cj | d)

Pranabendu Misra Lecture 16: 09 March, 2023 DMML Jan–Apr 2023 15 / 16


Clustering
Data points from a mixture of
Gaussian distributions
Use EM to estimate the
parameters of each Gaussian
distribution
Assign each point to “best”
Gaussian
Can tweak the shape of the
clusters by constraining the
covariance matrix
Outliers are those that are
outside kσ for all the Gaussians

Pranabendu Misra Lecture 16: 09 March, 2023 DMML Jan–Apr 2023 16 / 16

You might also like