Professional Documents
Culture Documents
University of Toronto
Mixture of Gaussians
EM algorithm
Latent Variables
w.r.t Θ = {πk , µk , Σk }
w.r.t Θ = {πk , µk , Σk }
Problems:
w.r.t Θ = {πk , µk , Σk }
Problems:
I Singularities: Arbitrarily large likelihood when a Gaussian explains a
single point
w.r.t Θ = {πk , µk , Σk }
Problems:
I Singularities: Arbitrarily large likelihood when a Gaussian explains a
single point
I Identifiability: Solution is up to permutations
w.r.t Θ = {πk , µk , Σk }
Problems:
I Singularities: Arbitrarily large likelihood when a Gaussian explains a
single point
I Identifiability: Solution is up to permutations
How would you optimize this?
w.r.t Θ = {πk , µk , Σk }
Problems:
I Singularities: Arbitrarily large likelihood when a Gaussian explains a
single point
I Identifiability: Solution is up to permutations
How would you optimize this?
Can we have a closed form update?
w.r.t Θ = {πk , µk , Σk }
Problems:
I Singularities: Arbitrarily large likelihood when a Gaussian explains a
single point
I Identifiability: Solution is up to permutations
How would you optimize this?
Can we have a closed form update?
Don’t forget to satisfy the constraints on πk
Introduce a hidden variable such that its knowledge would simplify the
maximization
Introduce a hidden variable such that its knowledge would simplify the
maximization
We could introduce a hidden (latent) variable z which would represent
which Gaussian generated our observation x, with some probability
Introduce a hidden variable such that its knowledge would simplify the
maximization
We could introduce a hidden (latent) variable z which would represent
which Gaussian generated our observation x, with some probability
P
Let z ∼ Categorical(π) (where πk ≥ 0, k πk = 1)
Introduce a hidden variable such that its knowledge would simplify the
maximization
We could introduce a hidden (latent) variable z which would represent
which Gaussian generated our observation x, with some probability
P
Let z ∼ Categorical(π) (where πk ≥ 0, k πk = 1)
Then:
K
X
p(x) = p(x, z = k)
k=1
K
X
= p(z = k) p(x|z = k)
| {z } | {z }
k=1 πk N (x|µk ,Σk )
P
We had: z ∼ Categorical(π) (where πk ≥ 0, k πk = 1)
P
We had: z ∼ Categorical(π) (where πk ≥ 0, k πk = 1)
Joint distribution: p(x, z) = p(z)p(x|z)
P
We had: z ∼ Categorical(π) (where πk ≥ 0, k πk = 1)
Joint distribution: p(x, z) = p(z)p(x|z)
Log-likelihood:
N
X
`(π, µ, Σ) = ln p(X|π, µ, Σ) = ln p(x(n) |π, µ, Σ)
n=1
N
X K
X
= ln p(x(n) | z (n) ; µ, Σ)p(z (n) | π)
n=1 z (n) =1
P
We had: z ∼ Categorical(π) (where πk ≥ 0, k πk = 1)
Joint distribution: p(x, z) = p(z)p(x|z)
Log-likelihood:
N
X
`(π, µ, Σ) = ln p(X|π, µ, Σ) = ln p(x(n) |π, µ, Σ)
n=1
N
X K
X
= ln p(x(n) | z (n) ; µ, Σ)p(z (n) | π)
n=1 z (n) =1
P
We had: z ∼ Categorical(π) (where πk ≥ 0, k πk = 1)
Joint distribution: p(x, z) = p(z)p(x|z)
Log-likelihood:
N
X
`(π, µ, Σ) = ln p(X|π, µ, Σ) = ln p(x(n) |π, µ, Σ)
n=1
N
X K
X
= ln p(x(n) | z (n) ; µ, Σ)p(z (n) | π)
n=1 z (n) =1
P
We had: z ∼ Categorical(π) (where πk ≥ 0, k πk = 1)
Joint distribution: p(x, z) = p(z)p(x|z)
Log-likelihood:
N
X
`(π, µ, Σ) = ln p(X|π, µ, Σ) = ln p(x(n) |π, µ, Σ)
n=1
N
X K
X
= ln p(x(n) | z (n) ; µ, Σ)p(z (n) | π)
n=1 z (n) =1
.95
.05 .05
.95
.5
.5 .5
.5
Elegant and powerful method for finding maximum likelihood solutions for
models with latent variables
Elegant and powerful method for finding maximum likelihood solutions for
models with latent variables
1. E-step:
I In order to adjust the parameters, we must first solve the inference
problem: Which Gaussian generated each datapoint?
Elegant and powerful method for finding maximum likelihood solutions for
models with latent variables
1. E-step:
I In order to adjust the parameters, we must first solve the inference
problem: Which Gaussian generated each datapoint?
I We cannot be sure, so it’s a distribution over all possibilities.
(n)
γk = p(z (n) = k|x(n) ; π, µ, Σ)
Elegant and powerful method for finding maximum likelihood solutions for
models with latent variables
1. E-step:
I In order to adjust the parameters, we must first solve the inference
problem: Which Gaussian generated each datapoint?
I We cannot be sure, so it’s a distribution over all possibilities.
(n)
γk = p(z (n) = k|x(n) ; π, µ, Σ)
2. M-step:
I Each Gaussian gets a certain amount of posterior probability for each
datapoint.
Elegant and powerful method for finding maximum likelihood solutions for
models with latent variables
1. E-step:
I In order to adjust the parameters, we must first solve the inference
problem: Which Gaussian generated each datapoint?
I We cannot be sure, so it’s a distribution over all possibilities.
(n)
γk = p(z (n) = k|x(n) ; π, µ, Σ)
2. M-step:
I Each Gaussian gets a certain amount of posterior probability for each
datapoint.
I At the optimum we shall satisfy
∂ ln p(X|π, µ, Σ)
=0
∂Θ
Elegant and powerful method for finding maximum likelihood solutions for
models with latent variables
1. E-step:
I In order to adjust the parameters, we must first solve the inference
problem: Which Gaussian generated each datapoint?
I We cannot be sure, so it’s a distribution over all possibilities.
(n)
γk = p(z (n) = k|x(n) ; π, µ, Σ)
2. M-step:
I Each Gaussian gets a certain amount of posterior probability for each
datapoint.
I At the optimum we shall satisfy
∂ ln p(X|π, µ, Σ)
=0
∂Θ
γk = p(z = k|x) =
p(z = k)p(x|z = k)
γk = p(z = k|x) =
p(x)
p(z = k)p(x|z = k)
γk = p(z = k|x) =
p(x)
p(z = k)p(x|z = k)
= PK
j=1 p(z = j)p(x|z = j)
p(z = k)p(x|z = k)
γk = p(z = k|x) =
p(x)
p(z = k)p(x|z = k)
= PK
j=1 p(z = j)p(x|z = j)
πk N (x|µk , Σk )
= PK
j=1 πj N (x|µj , Σj )
p(z = k)p(x|z = k)
γk = p(z = k|x) =
p(x)
p(z = k)p(x|z = k)
= PK
j=1 p(z = j)p(x|z = j)
πk N (x|µk , Σk )
= PK
j=1 πj N (x|µj , Σj )
Set derivatives to 0:
N
X πk N (x(n) |µk , Σk )
∂ ln p(X|π, µ, Σ)
=0= PK Σk (x(n) − µk )
∂µk n=1 j=1 πj N (x|µj , Σj )
Set derivatives to 0:
N
X πk N (x(n) |µk , Σk )
∂ ln p(X|π, µ, Σ)
=0= PK Σk (x(n) − µk )
∂µk n=1 j=1 πj N (x|µj , Σj )
We used:
1 1
N (x|µ, Σ) = p exp − (x − µ)T Σ−1 (x − µ)
(2π)d |Σ| 2
and:
∂(xT Ax)
= xT (A + AT )
∂x
Zemel, Urtasun, Fidler (UofT) CSC 411: 13-MoG 16 / 33
M-Step: Estimate Parameters
N
X πk N (x(n) |µk , Σk )
∂ ln p(X|π, µ, Σ)
=0= PK Σk (x(n) − µk )
∂µk n=1 π
j=1 j N (x|µ j , Σ j )
| {z }
(n)
γk
This gives
N
1 X (n) (n)
µk = γ x
Nk n=1 k
with Nk the effective number of points in cluster k
N
X (n)
Nk = γk
n=1
This gives
N
1 X (n) (n)
µk = γ x
Nk n=1 k
with Nk the effective number of points in cluster k
N
X (n)
Nk = γk
n=1
We just take the center-of gravity of the data that the Gaussian is responsible for
This gives
N
1 X (n) (n)
µk = γ x
Nk n=1 k
with Nk the effective number of points in cluster k
N
X (n)
Nk = γk
n=1
We just take the center-of gravity of the data that the Gaussian is responsible for
Just like in K-means, except the data is weighted by the posterior probability of
the Gaussian.
This gives
N
1 X (n) (n)
µk = γ x
Nk n=1 k
with Nk the effective number of points in cluster k
N
X (n)
Nk = γk
n=1
We just take the center-of gravity of the data that the Gaussian is responsible for
Just like in K-means, except the data is weighted by the posterior probability of
the Gaussian.
Guaranteed to lie in the convex hull of the data (Could be big initial jump)
1. Initialize Θold
2. E-step: Evaluate p(Z|X, Θold )
3. M-step:
Θnew = arg max Q(Θ, Θold )
Θ
where X
Q(Θ, Θold ) = p(Z|X, Θold ) ln p(X, Z|Θ)
z
4. Evaluate log likelihood and check for convergence (or the parameters). If
not converged, Θold = Θ, Go to step 2
Free energy F is a cost function that is reduced by both the E-step and the
M-step.
F = expected energy − entropy
Free energy F is a cost function that is reduced by both the E-step and the
M-step.
F = expected energy − entropy
Free energy F is a cost function that is reduced by both the E-step and the
M-step.
F = expected energy − entropy
where
X p(X, Z|Θ)
L(q, Θ) = q(Z) ln
q(Z)
Z
X p(Z|X, Θ)
KL(q||p) = − q(Z) ln
q(Z)
Z
L(q, Θ) ≤ ln p(X|Θ)
The distribution q(Z) is held fixed and the lower bound L(q, Θ) is
maximized with respect to the parameter vector Θ to give a revised value
Θnew . Because the KL divergence is nonnegative, this causes the log
likelihood ln p(X|Θ) to increase by at least as much as the lower bound does.
X X
L(q, Θ) = p(Z|X, Θold ) ln p(X, Z|Θ) − p(Z|X, Θold ) ln p(Z|X, Θold )
Z Z
old
= Q(Θ, Θ ) + const
= expected energy − entropy