Professional Documents
Culture Documents
A short tutorial
Sean Borman
Comments and corrections to: em-tut at seanborman dot com
July 18 2004
Last updated January 09, 2009
2009-01-09
2008-07-05
2006-10-14
2006-06-28
2005-08-26
2004-07-18
Revision history
Corrected grammar in the paragraph which precedes Equation (17). Changed datestamp format in the revision history.
Corrected caption for Figure (2). Added conditioning on n
for l in convergence discussion in Section (3.2). Changed email
contact info to reduce spam.
Added explanation and disambiguating parentheses in the development leading to Equation (14). Minor corrections.
Added Figure (1). Corrected typo above Equation (5). Minor
corrections. Added hyperlinks.
Minor corrections.
Initial revision.
Introduction
This tutorial discusses the Expectation Maximization (EM) algorithm of Dempster, Laird and Rubin [1]. The approach taken follows that of an unpublished
note by Stuart Russel, but fleshes out some of the gory details. In order to
ensure that the presentation is reasonably self-contained, some of the results on
which the derivation of the algorithm is based are presented prior to the main
results. The EM algorithm has become a popular tool in statistical estimation
problems involving incomplete data, or in problems which can be posed in a similar form, such as mixture estimation [3, 4]. The EM algorithm has also been
used in various motion estimation frameworks [5] and variants of it have been
used in multiframe superresolution restoration methods which combine motion
estimation along the lines of [2].
f (x1 ) + (1 )f (x2 )
f (x1 + (1 )x2 )
x1
x1 + (1 )x2
x2
Convex Functions
(1)
(2)
(3)
(5)
=
=
(1 )f 0 (s)(z x)
f 0 (t)(1 )(z x)
f 0 (t)(y z)
[f (y) f (z)]
by
by
by
by
Equation
Equation
Equation
Equation
(2)
(4)
(5)
(3).
i=1
i=1
n
X
1
n+1 xn+1 + (1 n+1 )
i xi
1 n+1 i=1
n+1 f (xn+1 ) +
n
X
n
X
1
i xi
1 n+1 i=1
!
n
X
i
xi
1 n+1
i=1
n
X
i=1
i
f (xi )
1 n+1
i f (xi )
i=1
n+1
X
i f (xi )
i=1
Since ln(x) is concave, we may apply Jensens inequality to obtain the useful
result,
n
n
X
X
ln
i xi
i ln(xi ).
(6)
i=1
i=1
1X
xi n x1 x2 xn .
n i=1
Proof:
xi
ln(xi )
ln
n i=1
n
i=1
=
=
Thus, we have
1
ln(x1 x2 xn )
n
1
ln(x1 x2 xn ) n
1X
xi n x1 x2 xn
n i=1
3.1
(7)
(9)
X
z
P(X|z, )P(z|)
ln P(X|n ).
(11)
Notice that this expression involves the logarithm of a sum. In Section (2) using
Jensens inequality, it was shown that,
ln
n
X
i=1
i xi
n
X
i ln(xi )
i=1
Pn
for constants i 0 with i=1 i = 1. This result may be applied to Equation (11) provided that the constants i can be identified. Consider letting the
constants be of the form P(z|X, n ). Since
P P(z|X, n ) is a probability measure,
we have that P(z|X, n ) 0 and that z P(z|X, n ) = 1 as required.
Then starting with Equation (11) the constants P(z|X, n ) are introduced
as,
!
X
L() L(n ) = ln
P(X|z, )P(z|) ln P(X|n )
z
=
=
!
P(z|X, n )
ln P(X|n )
P(X|z, )P(z|)
ln
P(z|X, n )
z
!
X
P(X|z, )P(z|)
ln
ln P(X|n )
P(z|X, n )
P(z|X, n )
z
X
P(X|z, )P(z|)
P(z|X, n ) ln
ln P(X|n ) (12)
P(z|X, n )
z
X
P(X|z, )P(z|)
P(z|X, n ) ln
(13)
P(z|X, n )P(X|n )
z
X
(|n ).
(14)
In
P (13) we made use of the fact that
P going from Equation (12) to Equation
P(z|X,
)
=
1
so
that
ln
P(X|
)
=
n
n
z P(z|X, n ) ln P(X|n ) which allows
z
the term ln P(X|n ) to be brought into the summation.
We continue by writing
L() L(n ) + (|n )
and for convenience define,
(15)
L(n+1 )
l(n+1 |n )
L(n ) = l(n |n )
L()
l(|n )
l(|n )
L()
n+1
L(n ) + (n |n )
X
P(X|z, n )P(z|n )
L(n ) +
P(z|X, n ) ln
P(z|X,
n )P(X|n )
z
X
P(X, z|n )
L(n ) +
P(z|X, n ) ln
P(X, z|n )
z
X
L(n ) +
P(z|X, n ) ln 1
z
L(n ),
(16)
Formally we have,
n+1
X
z
P(X|z, )P(z|)
P(z|X, n ) ln
P(X|n )P(z|X, n )
= arg max
= arg max
X
z
X
z
P(X, z, ) P(z, )
P(z|X, n ) ln
P(z, ) P()
)
= arg max EZ|X,n {ln P(X, z|)}
(17)
In Equation (17) the expectation and maximization steps are apparent. The
EM algorithm thus consists of iterating the:
1. E-step: Determine the conditional expectation EZ|X,n {ln P(X, z|)}
2. M-step: Maximize this expression with respect to .
At this point it is fair to ask what has been gained given that we have simply
traded the maximization of L() for the maximization of l(|n ). The answer
lies in the fact that l(|n ) takes into account the unobserved or missing data
Z. In the case where we wish to estimate these variables the EM algorithms
provides a framework for doing so. Also, as alluded to earlier, it may be convenient to introduce such hidden variables so that the maximization of L(|n ) is
simplified given knowledge of the hidden variables. (as compared with a direct
maximization of L())
3.2
3.3
References
[1] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society:
Series B, 39(1):138, November 1977.
[2] R. C. Hardie, K. J. Barnard, and E. E. Armstrong. Joint MAP registration
and high-resolution image estimation using a sequence of undersampled images. IEEE Transactions on Image Processing, 6(12):16211633, December
1997.
[3] Geoffrey McLachlan and Thriyambakam Krishnan. The EM Algorithm and
Extensions. John Wiley & Sons, New York, 1996.
[4] Geoffrey McLachlan and David Peel. Finite Mixture Models. John Wiley &
Sons, New York, 2000.
[5] Yair Weiss. Bayesian motion estimation and segmentation. PhD thesis,
Massachusetts Institute of Technology, May 1998.