Proof:
Assume that
∃
θ
,
L
(
θ
)
>
L
(
θ
∗
) =
L
(
q
∗
,θ
∗
)
.Setting
q
(
y
) =
p
(
y
|
x,θ
)
, we have
L
(
q
,θ
) =
L
(
θ
)
>
L
(
θ
∗
) =
L
(
q
∗
,θ
∗
)
, which contradicts with the fact that
L
(
q
∗
,θ
∗
)
is the global maximum of
L
(
q,θ
)
.Generally, finding the global maximum of
L
(
q,θ
)
is stillintractable. So practically we only find its local maximum. If we can find a local maximum
L
(
q
∗
,θ
∗
)
, then
L
(
θ
∗
)
is alsoa local maximum of
L
(
θ
)
. This can be guaranteed by thefollowing lemma.
Lemma 2: If
L
(
q
∗
,θ
∗
)
is a local maximum of
L
(
q,θ
)
, thenwe have
q
∗
(
y
) =
p
(
y
|
x,θ
∗
)
, and
L
(
θ
∗
) =
L
(
q
∗
,θ
∗
)
is a localmaximum of
L
(
θ
)
.Proof:
Proof omitted here. An intuitive explanation canbe found in [2].We can introduce an iterative procedure to find the localmaximum by estimating
q
and
θ
separately:
ALGORITHM 1:
LocalMaximum
(
L
(
q, θ
))
INPUT:
L
(
q, θ
)
:
the family of lower bounds of the log-likelihood
.OUTPUT:
L
(
q
∗
, θ
∗
)
:
the local maximum of the lower bound and corresponding parameters,
q
∗
and
θ
∗
.1.
θ
[0]
←
initial value
;
t
←
0
;2.
do
3.
t
←
t
+ 1
;4.
q
[
t
]
←
argmax
q
L
(
q, θ
[
t
−
1]
)
; (
t
-th E-step
)5.
θ
[
t
]
←
argmax
θ
L
(
q
[
t
]
, θ
)
; (
t
-th M-step
)6.
until
convergence
;7.
return
L
(
q
[
t
]
, θ
[
t
]
)
.
In line 4, we can easily find
q
[
t
]
by setting
q
[
t
]
(
y
) =
p
(
y
|
x,θ
[
t
−
1]
)
. And in most of the time maximizing
L
(
q
[
t
]
,θ
)
in line 5 is relatively easy because the logarithm is inside thesummation (see the definition of
L
(
q,θ
)
). The iteration stopswhen the change of
L
(
q,θ
)
is smaller than a given threshold.Now we have an easier way to approximately estimate theparameter
θ
. In one word, this method, which is also called
EM algorithm
, is just a procedure to find a local maximum of the lower bound of the log-likelihood. In most of the time theresulting model parameter
θ
∗
is very close to the true solutionof the global maximum, so this method is widely used today.III. A
N
E
XAMPLE OF
EM A
LGORITHM
Now consider a very simple setting. We have
M
numbers
X
= (
X
1
,X
2
,...,X
M
)
, each of which is generated by one of
N
gaussian mixture components
C
= (
C
1
,C
2
,...,C
N
)
. Weassume that all of the components share the same variance
σ
2
, but their means
µ
= (
µ
1
,µ
2
,...,µ
N
)
are different. Eachcomponent
C
i
has a weight
α
i
, we use
α
= (
α
1
,α
2
,...,α
N
)
to denote them. The log-likelihood of generating all thenumbers
X
takes the form
L
(
µ
) = log
P
(
X
|
µ
) =
M
i
=1
log
N
j
=1
α
j
N
(
X
i
|
µ
j
,σ
2
)
with respect to the parameter
µ
of the model. Our task is toestimate
µ
given the data
X
and other parameters,
σ
2
and
α
.First of all, let’s calculate
µ
∗
by using
L
(
µ
)
directly. Thefirst-order derivative of
L
(
µ
)
takes the form
∂
L
(
µ
)
∂µ
k
=
M
i
=1
(
α
k
N
(
X
i
|
µ
k
,σ
2
)
N j
=1
α
j
N
(
X
i
|
µ
j
,σ
2
)
·
1
σ
2
(
X
i
−
µ
k
))
.
Thus the solution of
∂
L
(
µ
)
/∂µ
k
= 0
is hard to be calculated.Now we use EM algorithm. First we introduce a set of random variables
{
Z
ij
}
.
Z
ij
equals to 1 if
X
i
is generatedby
C
j
, and equals to 0 otherwise. Each
Z
ij
is a hiddenvariable and we have little idea about its distribution, soit is convenient to use EM algorithm here. Let
L
(
Q,µ
) =
Z
Q
(
Z
)log
P
(
X,Z
|
µ
)
Q
(
Z
)
.E-step (maximization of
L
(
Q,µ
[
t
−
1]
)
with respect to
Q
):
Q
[
t
]
(
Z
)
←
P
(
Z
|
X,µ
[
t
−
1]
)
.
M-step (maximization of
L
(
Q
[
t
]
,µ
)
with respect to
µ
):
µ
[
t
]
←
argmax
µ
L
(
Q
[
t
]
,µ
)=argmax
µ
(
Z
Q
[
t
]
(
Z
)log
P
(
X,Z
|
µ
)
−
Z
Q
[
t
]
(
Z
)log
Q
[
t
]
(
Z
))=argmax
µ
Z
Q
[
t
]
(
Z
)log
P
(
X,Z
|
µ
)=argmax
µ
Z
P
(
Z
|
X,µ
[
t
−
1]
)log
P
(
X,Z
|
µ
)=argmax
µ
E
Z
[log
P
(
X,Z
|
µ
)
|
X,µ
[
t
−
1]
]=argmax
µ
Q
(
µ
|
µ
[
t
−
1]
)
.
So the key task is to compute
µ
which can maximize
Q
(
µ
|
µ
[
t
−
1]
)
. Using
P
(
X,Z
|
µ
) =
P
(
X
|
Z,µ
)
P
(
Z
|
µ
)
we have
log
P
(
X,Z
|
µ
) =log
M
i
=1
P
(
X
i
,Z
i
|
µ
)=log
M
i
=1
P
(
X
i
|
Z
i
,µ
)
P
(
Z
i
|
µ
)=log
M
i
=1
(
N
j
=1
N
(
X
i
|
µ
j
,σ
2
)
Z
ij
N
j
=1
α
Z
ij
j
)=log
M
i
=1
N
j
=1
(
α
j
N
(
X
i
|
µ
j
,σ
2
))
Z
ij
=
M
i
=1
N
j
=1
Z
ij
log(
α
j
N
(
X
i
|
µ
j
,σ
2
))
.
Leave a Comment
very clear! thanks