DMML2023 Lecture16 09mar2023

Lecture 16: 09 March, 2023
Pranabendu Misra
Slides by Madhavan Mukund
Data Mining and Machine Learning

January–April 2023
Generative models
We assume that the data is generated by a probabilistic process.

To use probabilities, need to describe how data is randomly generated
Generative model
Typically, assume a random instance is created as follows

Choose a class cj with probability Pr (cj )
Choose attributes a1 , . . . , ak with probability Pr (a1 , . . . , ak | cj )
Generative model has associated parameters θ = (θ1 , . . . , θm )

Each class probability Pr (cj ) is a parameter
Each conditional probability Pr (a1 , . . . , ak | cj ) is a parameter
We need to estimate these parameters
Pranabendu Misra Lecture 16: 09 March, 2023 DMML Jan–Apr 2023 2 / 16

Maximum Likelihood Estimators
We are given some data O = (o1 , o2 , . . . , on )
Our goal is to estimate parameters (probabilities) θ = (θ1 , . . . , θm )
Law of large numbers allows us to estimate probabilities by counting frequencies
Example: Tossing a biased coin, single parameter θ = Pr (heads)
N coin tosses, H heads and T tails
Why is θ̂ = H/N the best estimate?
Likelihood
Actual coin toss sequence is τ = t1 t2 . . . tN
Given an estimate of θ, compute Pr (τ | θ) — likelihood L(θ)
θ̂ = H/N maximizes this likelihood — arg max L(θ) = θ̂ = H/N

θ
Maximum Likelihood Estimator (MLE)
Mixture models
Probabilistic process — parameters Θ
Tossing a coin with Θ = {Pr(H)} = {p}
Perform an experiment
Toss the coin N times, H T H H · · · T
Estimate parameters from observations

From h heads, estimate p = h/N
Maximum Likelihood Estimator (MLE)
What if we have a mixture of two random processes

Two coins, c1 and c2 , with Pr(H) = p1 and p2 , respectively
Repeat N times: choose ci with probability 1/2 and toss it
Outcome: N1 tosses of c1 interleaved with N2 tosses of c2 , N1 + N2 = N
Can we estimate p1 and p2 ?
Mixture models . . .
Two coins, c1 and c2 , with Pr(H) = p1 and p2 , respectively
Sequence of N interleaved coin tosses H T H H · · · H H T
If the sequence is labelled, we can estimate p1 , p2 separately
H T T H H T H T H H T H T H T H H T H T
p1 = 8/12 = 2/3, p2 = 3/8
What the observation is unlabelled?

Iterative algorithm to estimate the parameters

Make an initial guess for the parameters
Compute a (fractional) labelling of the outcomes
Re-estimate the parameters

Expectation Maximization (EM)
Iterative algorithm to estimate the parameters
Make an initial guess for the parameters
Compute a (fractional) labelling of the outcomes
Initial guess: p1 = 1/2, p2 = 1/4
Pr (c1 = T ) = q1 = 1/2, Pr (c2 = T ) = q2 = 3/4,
For each H, likelihood it was ci , Pr(ci | H), is pi /(p1 + p2 )
For each T , likelihood it was ci , Pr(ci | T ), is qi /(q1 + q2 )
Assign fractional count Pr(ci | H) to each H: 2/3 × c1 , 1/3 × c2
Likewise, assign fractional count Pr(ci | T ) to each T : 2/5 × c1 , 3/5 × c2

Initial guess: p1 = 1/2, p2 = 1/4

Fractional counts: each H is 2/3 × c1 , 1/3 × c2 , each T : 2/5 × c1 , 3/5 × c2
Add up the fractional counts
c1 : 11 · (2/3) = 22/3 heads, 9 · (2/5) = 18/5 tails
c2 : 11 · (1/3) = 11/3 heads, 9 · (3/5) = 27/5 tails

22/3
p1 = = 110/164 = 0.67, q1 = 1 − p1 = 0.33
22/3 + 18/5
11/3
p2 = = 55/136 = 0.40, q2 = 1 − p2 = 0.60
11/3 + 27/5
Repeat until convergence
Mixture of probabilistic models (M1 , M2 , . . . , Mk ) with parameters
Θ = (θ1 , θ2 , . . . , θk )
Observation O = o1 o2 . . . oN
Expectation step
Compute likelihoods Pr (Mi |oj ) for each Mi , oj
Maximization step
Recompute MLE for each Mi using fraction of O assigned using likelihood
Repeat until convergence

Why should it converge?
If the value converges, what have we computed?

EM — another example
Two biased coins, choose a coin If we know the breakup, we can separately
and toss 10 times, repeat 5 times compute MLE for each coin

EM — another example
Expectation-
Maximization
Initial estimates,
θA = 0.6, θB = 0.5
Compute likelihood
of each sequence:
θnH (1 − θ)nT
Assign each sequence
proportionately
Converge to
θA = 0.8, θB = 0.52

EM — mixture of Gaussians
Sample uniformly from multiple Gaussians,

N (µi , σi )
For simplicity, assume all σi = σ
N sample points z1 , z2 , . . . , zN
Make an initial guess for each µj
Pr(zi | µj ) = exp(− 2σ1 2 (zi − µj )2 )
Pr(zi | µj )
Pr (µj | zi ) = cij = P
k Pr(zi | µk )
P
cij zi
MLE of µj is sample mean, Pi
i cij
Update estimates for µj and repeat

Theoretical foundations of EM
Mixture of probabilistic
models (M1 , M2 , . . . , Mk )
with parameters
Θ = (θ1 , θ2 , . . . , θk )
Observation
O = o1 o2 . . . oN
EM builds a sequence of
estimates Θ1 , Θ2 , . . . , Θn
L(Θj ) — log-likelihood
function, ln Pr(O | Θj ) EM performs a form of gradient descenct
Want to extend the If we update Θn to Θ0 we get an new likelihood
sequence with Θn+1 such L(Θn ) + ∆(Θ0 , Θn ) which we call `(Θ0 | Θn )
that L(Θn+1 ) > L(Θn )
Choose Θn+1 to maximize `(Θ0 | Θn )
Semi-supervised learning
Supervised learning requires labelled training data

What if we don’t have enough labelled data?
For a probabilistic classifier we can apply EM
Use available training data to assign initial probabilities
Label the rest of the data using this model — fractional labels
Add up counts and re-estimate the parameters

Semi-supervised topic classification
Each document is a multiset or bag of words over a vocabulary

V = {w1 , w2 , . . . , wm }
Each topic c has probability Pr(c)

Each word wi ∈ V has conditional probability Pr(wi | cj ), for cj ∈ C
m
X
Note that Pr (wi | cj ) = 1
i=1
Assume document length is independent of the class

Only a small subset of documents is labelled
Use this subset for initial estimate of Pr(c), Pr(wi | cj )

Semi-supervised topic classification
Current model Pr(c), Pr(wi | cj )
Compute Pr(cj | d) for each unlabelled document d
Normally we assign the maximum among these as the class for d
Here we keep fractional values
P
d∈D Pr(cj | D)
Recompute Pr (cj ) =
|D|
For labelled d, Pr(cj | d) ∈ {0, 1}
For unlabelled d, Pr(cj | d) is fractional value computed from current parameters
Recompute Pr (wi | cj ) — fraction of occurrences of wi in documents labelled cj

nid — occurrences of wi in d
P
nid Pr (cj | d)
Pr (wi | cj ) = Pm d∈D
P
t=1 d∈D ntd Pr (cj | d)

Clustering
Data points from a mixture of
Gaussian distributions
Use EM to estimate the
parameters of each Gaussian
distribution
Assign each point to “best”
Gaussian
Can tweak the shape of the
clusters by constraining the
covariance matrix
Outliers are those that are
outside kσ for all the Gaussians

DMML2023 Lecture16 09mar2023

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DMML2023 Lecture16 09mar2023

Uploaded by

Copyright:

Available Formats

Lecture 16: 09 March, 2023

Data Mining and Machine Learning

We assume that the data is generated by a probabilistic process.

Typically, assume a random instance is created as follows

Generative model has associated parameters θ = (θ1 , . . . , θm )

We need to estimate these parameters

Pranabendu Misra Lecture 16: 09 March, 2023 DMML Jan–Apr 2023 2 / 16

θ̂ = H/N maximizes this likelihood — arg max L(θ) = θ̂ = H/N

Estimate parameters from observations

What if we have a mixture of two random processes

What the observation is unlabelled?

Iterative algorithm to estimate the parameters

Pranabendu Misra Lecture 16: 09 March, 2023 DMML Jan–Apr 2023 5 / 16

Pranabendu Misra Lecture 16: 09 March, 2023 DMML Jan–Apr 2023 6 / 16

Initial guess: p1 = 1/2, p2 = 1/4

Re-estimate the parameters

Repeat until convergence

Pranabendu Misra Lecture 16: 09 March, 2023 DMML Jan–Apr 2023 8 / 16

Pranabendu Misra Lecture 16: 09 March, 2023 DMML Jan–Apr 2023 9 / 16

Pranabendu Misra Lecture 16: 09 March, 2023 DMML Jan–Apr 2023 10 / 16

Sample uniformly from multiple Gaussians,

Pranabendu Misra Lecture 16: 09 March, 2023 DMML Jan–Apr 2023 11 / 16

Supervised learning requires labelled training data

Pranabendu Misra Lecture 16: 09 March, 2023 DMML Jan–Apr 2023 13 / 16

Each document is a multiset or bag of words over a vocabulary

Each topic c has probability Pr(c)

Assume document length is independent of the class

Pranabendu Misra Lecture 16: 09 March, 2023 DMML Jan–Apr 2023 14 / 16

Recompute Pr (wi | cj ) — fraction of occurrences of wi in documents labelled cj

Pranabendu Misra Lecture 16: 09 March, 2023 DMML Jan–Apr 2023 15 / 16

Pranabendu Misra Lecture 16: 09 March, 2023 DMML Jan–Apr 2023 16 / 16

You might also like