• Embed Doc
  • Readcast
  • Collections
  • 1
    CommentGo Back
 
Tutorial of EM Algorithm: A Top-down Approach
Hao Wu
#1
#
 Department of Computer Science and Technology, Tsinghua University Haidian, Beijing, 100084
1
haowu06@mails.tsinghua.edu.cn
 Abstract
EM algorithm is important. In this document we tryto make a brief tutorial of this algorithm. We first investigatewhy EM algorithm is needed. And then we apply this algorithmto the task of estimating the parameters of gaussian mixturemodels.
I. I
NTRODUCTION
EM algorithm is a powerful tool for estimating the param-eters of graphical models with hidden variables. Deeply un-derstanding of this algorithm can greatly help one understandother useful machine learning methods. In this document wemake a brief tutorial of EM algorithm in a top-down mannerwhich can help people learn it more easily. In Section 2we investigate why EM algorithm is needed. And then inSection 3 we apply this algorithm to the task of estimatingthe parameters of gaussian mixture models. Finally, we makeour conclusion in Section 4.II. W
HY
EM
ALGORITHM
?First of all, we have a set of observed data points
x
=(
x
1
,x
2
,...,x
1
)
and a set of unobserved data points
y
=(
y
1
,y
2
,...,y
2
)
.
x
and
y
are generated by a graphical modelwith parameter
θ
. Given
x
we want to estimate the bestparameter of the model. In other words we want to find
θ
which can maximize the log-likelihood function
L
(
θ
) :=log
 p
(
x
|
θ
) = log
y
p
(
x,y
|
θ
)
, i.e.
θ
= argmax
θ
L
(
θ
) = argmax
θ
log
y
 p
(
x,y
|
θ
)
.
But it is a intractable task since the summation is after thelogarithm. Can we make it a little bit easier? Here are someintuitive considerations:
It will be better to exchange the summation and thelogarithm.
If we can find a lower-bound of the log-likelihood
L
(
θ
)
,say
(
θ
)
, then we can maximize this lower-bound in-stead.For any distribution
q
(
y
)
, we have
y
q
(
y
) = 1
. Thusaccording to
Jensen’s inequality
[1]
ϕ
a
i
x
i
a
i
a
i
ϕ
(
x
i
)
a
i
,
where
ϕ
is a concave function, we have
L
(
θ
) =log
 p
(
x
|
θ
)=log
y
 p
(
x,y
|
θ
)=log
y
q
(
y
)
 p
(
x,y
|
θ
)
q
(
y
)
y
q
(
y
)
y
q
(
y
)log
p
(
x,y
|
θ
)
q
(
y
)
y
q
(
y
)=
y
q
(
y
)log
p
(
x,y
|
θ
)
q
(
y
)
.
Let
L
(
q,θ
) :=
y
q
(
y
)log
p
(
x,y
|
θ
)
q
(
y
)
. It can easily be ob-served that
L
(
q,θ
)
is a family of lower bounds of 
L
(
θ
)
withrespect to distribution
q
. In addition, we also have
L
(
q,θ
) =
y
q
(
y
)log
p
(
x,y
|
θ
)
q
(
y
)=
y
q
(
y
)log
p
(
y
|
x,θ
)
 p
(
x
|
θ
)
q
(
y
)=
y
q
(
y
)(log
 p
(
y
|
x,θ
) + log
 p
(
x
|
θ
)
log
q
(
y
))=
y
q
(
y
)log
 p
(
x
|
θ
)+
y
q
(
y
)(log
 p
(
y
|
x,θ
)
log
q
(
y
))=log
 p
(
x
|
θ
) +
y
q
(
y
)log
p
(
y
|
x,θ
)
q
(
y
)=
L
(
θ
) +
y
q
(
y
)log
p
(
y
|
x,θ
)
q
(
y
)
.
Obviously if we set
q
(
y
) =
p
(
y
|
x,θ
)
given some fixedvalue of 
θ
,
L
(
q,θ
)
will be equal to
L
(
θ
)
. Thus the task of maximizing
L
(
θ
)
with respect to
θ
is equal to the task of maximizing
L
(
q,θ
)
with respect to both
q
and
θ
. This can beguaranteed by the following lemma.
 Lemma 1: If 
L
(
q
,θ
)
is the global maximum of 
L
(
q,θ
)
,then we have
q
(
y
) =
p
(
y
|
x,θ
)
 , and 
L
(
θ
) =
L
(
q
,θ
)
isthe global maximum of 
L
(
θ
)
.
 
Proof:
Assume that
θ
,
L
(
θ
)
>
L
(
θ
) =
L
(
q
,θ
)
.Setting
q
(
y
) =
p
(
y
|
x,θ
)
, we have
L
(
q
,θ
) =
L
(
θ
)
>
L
(
θ
) =
L
(
q
,θ
)
, which contradicts with the fact that
L
(
q
,θ
)
is the global maximum of 
L
(
q,θ
)
.Generally, finding the global maximum of 
L
(
q,θ
)
is stillintractable. So practically we only find its local maximum. If we can find a local maximum
L
(
q
,θ
)
, then
L
(
θ
)
is alsoa local maximum of 
L
(
θ
)
. This can be guaranteed by thefollowing lemma.
 Lemma 2: If 
L
(
q
,θ
)
is a local maximum of 
L
(
q,θ
)
, thenwe have
q
(
y
) =
p
(
y
|
x,θ
)
 , and 
L
(
θ
) =
L
(
q
,θ
)
is a localmaximum of 
L
(
θ
)
.Proof:
Proof omitted here. An intuitive explanation canbe found in [2].We can introduce an iterative procedure to find the localmaximum by estimating
q
and
θ
separately:
ALGORITHM 1:
LocalMaximum
(
L
(
q, θ
))
INPUT:
L
(
q, θ
)
:
the family of lower bounds of the log-likelihood 
.OUTPUT:
L
(
q
, θ
)
:
the local maximum of the lower bound and corresponding parameters,
q
and 
θ
.1.
θ
[0]
initial value
;
t
0
;2.
do
3.
t
t
+ 1
;4.
q
[
t
]
argmax
q
L
(
q, θ
[
t
1]
)
; (
t
-th E-step
)5.
θ
[
t
]
argmax
θ
L
(
q
[
t
]
, θ
)
; (
t
-th M-step
)6.
until
convergence
;7.
return
L
(
q
[
t
]
, θ
[
t
]
)
.
In line 4, we can easily find
q
[
t
]
by setting
q
[
t
]
(
y
) =
 p
(
y
|
x,θ
[
t
1]
)
. And in most of the time maximizing
L
(
q
[
t
]
,θ
)
in line 5 is relatively easy because the logarithm is inside thesummation (see the definition of 
L
(
q,θ
)
). The iteration stopswhen the change of 
L
(
q,θ
)
is smaller than a given threshold.Now we have an easier way to approximately estimate theparameter
θ
. In one word, this method, which is also called
 EM algorithm
, is just a procedure to find a local maximum of the lower bound of the log-likelihood. In most of the time theresulting model parameter
θ
is very close to the true solutionof the global maximum, so this method is widely used today.III. A
N
E
XAMPLE OF
EM A
LGORITHM
Now consider a very simple setting. We have
numbers
= (
1
,
2
,...,X 
)
, each of which is generated by one of 
gaussian mixture components
= (
1
,
2
,...,C 
)
. Weassume that all of the components share the same variance
σ
2
, but their means
µ
= (
µ
1
,µ
2
,...,µ
)
are different. Eachcomponent
i
has a weight
α
i
, we use
α
= (
α
1
,α
2
,...,α
)
to denote them. The log-likelihood of generating all thenumbers
takes the form
L
(
µ
) = log
(
|
µ
) =
i
=1
log
j
=1
α
j
 N 
(
i
|
µ
j
,σ
2
)
with respect to the parameter
µ
of the model. Our task is toestimate
µ
given the data
and other parameters,
σ
2
and
α
.First of all, let’s calculate
µ
by using
L
(
µ
)
directly. Thefirst-order derivative of 
L
(
µ
)
takes the form
∂ 
L
(
µ
)
∂µ
k
=
i
=1
(
α
k
 N 
(
i
|
µ
k
,σ
2
)
j
=1
α
j
 N 
(
i
|
µ
j
,σ
2
)
·
1
σ
2
(
i
µ
k
))
.
Thus the solution of 
∂ 
L
(
µ
)
/∂µ
k
= 0
is hard to be calculated.Now we use EM algorithm. First we introduce a set of random variables
{
ij
}
.
ij
equals to 1 if 
i
is generatedby
j
, and equals to 0 otherwise. Each
ij
is a hiddenvariable and we have little idea about its distribution, soit is convenient to use EM algorithm here. Let
L
(
Q,µ
) =
Z
Q
(
)log
(
X,Z
|
µ
)
Q
(
Z
)
.E-step (maximization of 
L
(
Q,µ
[
t
1]
)
with respect to
Q
):
Q
[
t
]
(
)
(
|
X,µ
[
t
1]
)
.
M-step (maximization of 
L
(
Q
[
t
]
,µ
)
with respect to
µ
):
µ
[
t
]
argmax
µ
L
(
Q
[
t
]
,µ
)=argmax
µ
(
Z
Q
[
t
]
(
)log
(
X,
|
µ
)
Z
Q
[
t
]
(
)log
Q
[
t
]
(
))=argmax
µ
Z
Q
[
t
]
(
)log
(
X,
|
µ
)=argmax
µ
Z
(
|
X,µ
[
t
1]
)log
(
X,
|
µ
)=argmax
µ
E
Z
[log
(
X,
|
µ
)
|
X,µ
[
t
1]
]=argmax
µ
Q
(
µ
|
µ
[
t
1]
)
.
So the key task is to compute
µ
which can maximize
Q
(
µ
|
µ
[
t
1]
)
. Using
(
X,
|
µ
) =
(
|
Z,µ
)
(
|
µ
)
we have
log
(
X,
|
µ
) =log
i
=1
(
i
,
i
|
µ
)=log
i
=1
(
i
|
i
,µ
)
(
i
|
µ
)=log
i
=1
(
j
=1
 N 
(
i
|
µ
j
,σ
2
)
Z
ij
j
=1
α
Z
ij
j
)=log
i
=1
j
=1
(
α
j
 N 
(
i
|
µ
j
,σ
2
))
Z
ij
=
i
=1
j
=1
ij
log(
α
j
 N 
(
i
|
µ
j
,σ
2
))
.
of 00

Leave a Comment

You must be to leave a comment.
Submit
Characters: ...

very clear! thanks

You must be to leave a comment.
Submit
Characters: ...