Professional Documents
Culture Documents
Thomas P Minka
Statistics Dept.
Carnegie Mellon University
Pittsburgh, PA 15213
Bayesian inference is often hampered by large computa- In belief networks with loops it is known that approximate
tional expense. Fast and accurate approximation methods marginal distributions can be obtained by iterating the be-
are therefore very important and can have great impact. lief propagation recursions, a process known as loopy be-
This paper presents a new deterministic algorithm, Expec- lief propagation (Frey & MacKay, 1997; Murphy et al.,
tation Propagation, which achieves higher accuracy than 1999). In section 4, this turns out to be a special case of Ex-
existing approximation algorithms with similar computa- pectation Propagation, where the approximation is a com-
tional cost. pletely disconnected network. Expectation Propagation is
UZY U Y\[
more general than belief propagation in two ways: (1) like Uof
Y terms: D
U
/
X
W
where
]
and
variational methods, it can use approximations which are
'
. Next we choose an approximating family.
not completely disconnected, and (2) it can impose useful In the clutter problem, a spherical Gaussian distribution is
constraints on functional form, such as multivariate Gaus- reasonable: ^
C9,`_5aZ_+" E
sian. YU
(3)
Finally, we U sequence through and incorporate the terms
2 ASSUMED-DENSITY FILTERING into the^Zapproximate
b ^
posterior. At each step we move from
an old
to a new
^
. (To reduce notation,
^ we drop
the dependence of
on c .) Initialize with
/d .
This section reviews the idea of assumed-density filter- Incorporating the prior term is trivial, with no approxima- YU
ing (ADF), to lay groundwork for Expectation Propaga- tion needed. To incorporate a more complicated term
,
tion. Assumed-density filtering is a general technique for take the exact posterior U
computing approximate posteriors in Bayesian networks Y U ^Zb
U
and other statistical models. ADF has been indepen- e
Ff g Y U ^ b
dently proposed in the statistics (Lauritzen, 1992), artifi-
B (4)
cial intelligence (Boyen & Koller, 1998; Opper & Winther, ^
1999), and control (Maybeck, 1982) literatures. “Assumed- ^
and minimize the KL-divergence 8 e
$
M
subject to
density filtering” is the name used in control; other names the constraint that
is in the approximating family. This
include “online Bayesian learning,” “moment matching,” is equivalent e to a maximum-likelihood problem with data
and “weak marginalization.” ADF applies when we have distribution . For a spherical Gaussian, the solution is
postulated a joint distribution
where has been h
given by matching moments: h
observed and is hidden. We would like to know the pos- h h opn j lk
terior over ,
, as well as the probability of the ob-
iZj lkm (5)
served data (or evidence for the model), . The former i j : lkm
o n j : lk (6)
is useful for estimation while the latter is useful for model
selection. With any exponential
U family, ADF reduces to propagating
U Y U Each
expectations. Z
^ b
q f g
also produces a normalizing factor
step
For example, suppose we have observations from a Gaus- B . The product of these normalizing
sian distribution embedded in a sea of unrelated clutter, so factors estimates . In
U theU clutter problem, we have
U U b b U
that the observation density is a mixture of two Gaussians: q r st # , _ F a _ u% * "$ #%v #(
$*)+"$ (7)
!#"$ &%'#(
$*)+"$
6 8,9 :;.8< 6 ,= The final ADF algorithm is:
,-./ 0 13254 7
>@?;.'
6A 7
1. Initialize ,`_`w( , aZ_xyF)s) (the prior). Initialize
The first component contains the parameter of interest,
z{ (the scale factor).
U
while the other component describes clutter. is the , _ a _ #z|
known ratio of clutter. Let the B -dimensional vector have
2. For each data point , update according
to U!~ U
a Gaussian prior distribution: b q
z} z
C D(
$*)+)+"FEF (1)
U U U
q (OFF)s"$U
The joint distribution of and G independent observations U U U b
HJI* 6 FKLKMKLONQP is therefore: b b U U , _
,`_ , _ %`a _ b
U U a _ %- U U
R
TS@UV
U b
U a U _ 7 U U a b s7 U b 37
(2) b _ U , _
a_ a _ b %` b
a _ % B a _ %* 7
U
The Bayesian network for this problem is simply pointing
N
to the . But we cannot use belief propagation because the This algorithm can be understood in an intuitive way: for
belief state for is a mixture of > Gaussians. To apply each data point we compute its probability of not being
ADF, we write the joint distribution R
as a product clutter, make a soft update to our estimate of ( , _ ), and
U
^+b YU
change our confidence in the estimate (a _ ). However, it is (c) Combine and
and minimize
^
KL-
clear that this algorithm will depend on the order in which divergence U to get a new posterior
with nor-
q U
malizer Y U . U ^
q
# ^Zb
data is processed, because the clutter probability depends
on the current estimate of . (d) Update
.
^
4. Use the normalizing constant of
as an approxi-
3 EXPECTATION PROPAGATION mation to :
YU
This section describes the Expectation Propagation algo-
HS U
B (11)
rithm and demonstrates its use on the clutter problem. Ex-
pectation Propagation is based on a novel interpretation of
assumed-density filtering.Y U ADF was described as treating This algorithm always has a fixed point, and sometimes has
each observation term exactly YU and then approximating several. If initialized too far away from a fixed point, it may
U U can also think of it as
the posterior that includes . ButY we
Y diverge. This is discussed in section 3.3.
YU
first approximating with some
and then using an exact
Y U possible be-
posterior with
. This interpretation is always
cause we can define the approximate term
to be the ratio 3.1 THE CLUTTER PROBLEM
of the new posterior to the old posterior times a constant:
YU U ^ U
For the clutter problem of the previous section, the EP al-
q ^b
U (8) gorithm is
^sb ^
Multiplying this approximate term by
gives
, as
1. The term approximations have the form
desired. An important property is that if the approximate YU U U U U
posterior is in an exponential family, then the term approx-
z
/
, :
/
, (12)
imations will be in the same family. 13254 Z> a
[ [
[Initialize the[ prior a
F
s
) ) , ( ,
The algorithm of the previous section can thus be inter- E 7 term to itself: ,
Y U as sequentially computing a Y Gaussian
preted U approxima- Yz U >@?;a U < A . Initialize U U
the data terms so that
tion
`pU
−199
10 10
aZ_ a 10
−18 −200
10 Importance
N U
Importance
>@?;aZ_s E A 7
VB
Z>s MUS [ z
−19
VB
−201
10 10
13254
Error
−20
Error
−202
10 Laplace 10
−21 Laplace
10 −203
10
G >Z) G >Z)+)
FLOPS
3.2 RESULTS 0
Posterior mean
0
Posterior mean
10 10
Error
Error
−3 −3
10 10 Gibbs
the importance distribution), and Gibbs sampling (by in- VB
typical run with G >+) and with G J>Z)s) . It plots the ac-
−6 −6
EP
10 10
2 3 4 5 6 3 4 5 6 7
10 10 10 10 10 10 10 10 10 10
FLOPS FLOPS
G >Z) G >Z)+)
curacy vs. cost of the algorithms. Accuracy is measured by
absolute difference from the true evidence or the true poste-
rior mean. Cost is measured by the number of floating point Figure 1: Cost vs. accuracy curves for expectation prop-
operations (FLOPS) in Matlab, via Matlab’s flops func- agation (EP), Laplace’s method, variational Bayes (VB),
tion. This is better than using CPU time because FLOPS importance sampling, and Gibbs sampling on the clutter
ignores interpretation overhead. problem with )¡K£¢ and 8> . Each ‘x’ is one iteration
of EP. ADF is the first ‘x’.
The deterministic methods EP, Laplace, and VB all try to
approximate the posterior with a Gaussian, so they improve
substantially with more data (the posterior is more Gaus-
sian with more data). The sampling methods assume very −26
x 10
little about the posterior and cannot exploit the fact that it 8
10
1
Posterior mean
Exact
is becoming more Gaussian. However, this is an advan- 7 EP
VB
tage for sampling when the posterior has a complex shape. 6
Laplace Laplace
VB EP
Figure 2 shows an atypical run with a small amount of Gibbs
4
Error
2
istic methods converge to an erroneous result that captures Importance
only a single mode. 1
−2
0 10
2 3 4 5 6
−6 −4 −2 0 2 4 6 10 10 10 10 10
theta FLOPS
¦
Y U
approximations will be
13254
loopy belief propagation.
for some . Let G be the number of terms .
Let the hidden variables be » 6 $KMKLKM »Q¼ and collect the ob-
The EP fixed points are in one-to-one correspondence with served variables into ½I º 6 $KMKLKM ºs¾ P . A completely dis-
stationary points of the objective connected distribution for has the form
^ ¼ ^
§¨M©
§ « x * 5L®s¯ g
¥ B
° S ¿ » ¿
ª ¬ 2 G 1$254 ¿ 6
(15)
N YU U ^
e
UM L®s°
¯ g
±
B When we minimize the KL-divergence R
$
M
, we
h e
13254 (13)
6 U h
will simply preserve the marginals of . This cor-
responds to an expectation constraint i j À » ¿ a5 k
such that G x* ¥ U (14) o n j À » ¿ Va5 k for all values a of » ¿ . From this we arrive at
the ADF algorithm of Boyen & Koller (1998):
Note that min-max cannot be exchanged with max-min in ^
this objective. ^
fFg By taking derivatives
fFg
we
e get the stationary
e ¿ » ¿
conditions
B
, where
is
1. Initialize
YU ^
defined
U by (4). This is an EP fixed point. In reverse,^ given 2. Fore each term
in turn, set ¿ to the Á th marginal
^Zb EP fixed point we can recover ¥ and from
and
an of : U
to obtain a stationary point of (13). U
YU
^
¿ » ¿ 0 e
U Y
^ b
g b q gb
Assume all terms are bounded:
²]³ . Then the ob- _|Â U _*Â
¥ U UY ^b
jective is U bounded
N < 6 ¥ from below, because for any we can q g
choose N , and then the second part of (13) is at where
least
YU
G M®+¯ g ³\
¥ Z ´Z´ µ·¶ B
13254 For dynamic Bayesian networks, Boyen & Koller set to
the product ofU all of the conditional probability tables for
¸ G M®+¯³°x G =| ¹L®s¯° g
¥ B timeslice^ c . ^+Now Y U From
b let’s turn this into an EP algorithm.
13254 the ratio , we see that the approximate terms
are
completely disconnected. The EP algorithm is thus:
by the concavity of the function º ´Z´ µ·¶ . Therefore there YU YU YU
must be stationary points. Sometimes there are multiple 1.
W ¿
U » . Initialize
J .
¿ ¿
fixed points of EP, in which case we can define the ‘best’ ^ UY
2. ¿ » ¿ W
¿ » ¿
fixed point as the one with minimum energy (13). When YU
canonical EP does not converge, we can minimize (13) by Until all
converge:
some other scheme, such as gradient descent. In practice, it
3. YU
is found that when canonical EP does not converge, it is for (a) Choose aY U
to refine
a good reason, namely the approximating family is a poor (b) Remove
U from the posterior. For all Á :
^
match to the exact posterior. This happened in the previous ^b ¿ ¿ Y
example. So before considering alternate ways to carry out ¿ » ¿ Y U ¿ » ¿ S OU
¿ » ¿
» |Ã
EP, one should reconsider the approximating family. U
^ ^+b
(c) Recompute
from
as in ADF.
4 LOOPY BELIEF PROPAGATION (d) U
YU U ^ ¿U ¿ Y U ^ b
q ^b »
¿ » ¿ g
Ä
b _*Â
&S »
Expectation Propagation and assumed-density filtering can ¿ » ¿ |Ã ¿
be used to approximate a belief network by a simpler net-
To makeY U this equivalent to belief propagation, the original
Ü #Ê Ë Ø Ü Ï #Ê Ë Ø Ï
terms should correspond to the conditional probability ÒVÝ ·Ò Þ Ø Ø Ï Ý Þ
tables of a directed belief network. That is, we should break Ò
the joint distribution R
into ÚÙ
ËÛË
S » ¿
T« » ¿ TS º
p« º
¿ 4 4 (16)
Ø
where 4 «¡Å is the set of parents of node Å . The network Ë
has observed nodes º and hidden nodes » ¿ . The parents of
an observed node might be hidden, and vice versa. For an
undirected network, the terms are the clique potentials. The Unlike Pearl’s derivation of belief propagation in terms of
and ? messages, this derivation is symmetric with respect
quantities in EP now have the following interpretations:
to parents and children. In fact, it is the form used in in fac-
^ tor graphs (Kschischang et al., 2000). All of the nodes that
Æ ¿ » ¿
is the belief state of node » ¿ , i.e. the product participate in a conditional probability table Åx
4 «pÅ
of all messages into » U¿ . send messages to each other based on their partial belief
^b states.
Æ The ‘old’ posterior ¿ » ¿ for a particular term c is a
partial belief state, i.e. the product of messages into Loopy belief propagation does not always converge, but
» ¿ except for those originating from term c . from section 3.3 we know how we could find a fixed point.
YU For an undirected network with pairwise potentials, the EP
Æ When cÈ Ç Á , the function
¿ » ¿ is the message that energy function (13) is a dual representation of the Bethe
node c (either hidden or observed) sends to its par- free energy given by Yedidia et al. (2000).
ent » ¿ in belief propagation.
YU ForU example,
U suppose
node c is hidden and
9 »
4 «T » . The other Alternatively, we can fit an approximate network which is
parents send their partial belief states, which the child not completely disconnected, such as a tree-structured net-
combines with its partial beliefU state: U work. This was done in the ADF context by Frey et al.
YU U U ^ Ub U ^b (2000). A general algorithm for tree-structured approxima-
¿
º g »
4 p« » » »
S
b _*Â tion using EP is given by Minka (2001).
parents | Ã ¿
Error
−2
10 MF
−3
10
References
EP TAP
−4
Blake, C. L., & Merz, C. J. (1998). UCI repository of machine
10
10
2
10
3
10
4
10
5 6
10 learning databases.
FLOPS
www.ics.uci.edu/˜mlearn/MLRepository.html.
Figure 3: (left) Bayes point machine vs. Support Vec- Boyen, X., & Koller, D. (1998). Tractable inference for complex
stochastic processes. Uncertainty in AI.
tor Machine on a simple data set. The Bayes point more
closely approximates a vote between all linear separators Frey, B. J., & MacKay, D. J. (1997). A revolution: Belief
of the data. (right) Cost vs. error in estimating the poste- propagation in graphs with cycles. NIPS 10.
rior mean. ADF is the first ‘x’ on the EP curve. Frey, B. J., Patrascu, R., Jaakkola, T., & Moran, J. (2000).
Sequentially fitting inclusive trees for inference in noisy-OR
networks. NIPS 13.
Dataset EP Billiard SVM
Heart K£>Z) à-K ) K >+) àK ) K£>s>àK ) Herbrich, R., Graepel, T., & Campbell, C. (1999). Bayes point
Thyroid K ) °àK ) K ) à-K ) K )·¢ àK ) s¢ machines: Estimating the Bayes point in kernel space. IJCAI
Ionosphere K ) àK )·¢ KM+ à-K ) KLs*¢à-K ) Workshop Support Vector Machines.
Sonar KM ·) à-K ) KL! àK )Z > KM*> àK ) Z¢ Koller, D., Lerner, U., & Angelov, D. (1999). A general
algorithm for approximate inference and its application to
Figure 4: Test error rate for the Bayes Point Machine (using hybrid Bayes nets. Uncertainty in AI (pp. 324–333).
EP or the Billiard algorithm) compared to the Support Vec- Kschischang, F. R., Frey, B. J., & Loeliger, H.-A. (2000). Factor
tor Machine. Reported is the mean over 40 train-test splits graphs and the sum-product algorithm. IEEE Trans Info
à two standard deviations. These results are for a Gaussian Theory, to appear.
kernel with " # , and will differ with other kernels. www.comm.toronto.edu/frank/factor/.
Lauritzen, S. L. (1992). Propagation of probabilities, means and
variances in mixed graphical association models. J American
Statistical Association, 87, 1098–1108.
variance in the training set. The classifiers used zero slack
and a Gaussian inner product with standard deviation 3. Maybeck, P. S. (1982). Stochastic models, estimation and
Billiard was run for 500 iterations. The thyroid dataset control, chapter 12.7. Academic Press.
was made into a binary classification problem by merging Minka, T. P. (2001). A family of algorithms for approximate
the different classes into normal vs. abnormal. Except for Bayesian inference. Doctoral dissertation, Massachusetts
sonar, EP has lower average error than the SVM (with Institute of Technology.
vismod.www.media.mit.edu/˜tpminka/.
99% probability), and in all cases EP is at least as good as
Billiard. Billiard has the highest running time because it is Murphy, K., Weiss, Y., & Jordan, M. (1999). Loopy-belief
initialized at the SVM solution. propagation for approximate inference: An empirical study.
Uncertainty in AI.
Opper, M., & Winther, O. (1999). A Bayesian approach to
on-line learning. On-Line Learning in Neural Networks.
6 SUMMARY Cambridge University Press.
Opper, M., & Winther, O. (2000). Gaussian processes for
classification: Mean field algorithms. Neural Computation,
This paper presented a generalization of belief propagation 12, 2655–2684.
which is appropriate for hybrid belief networks. Its supe-
Shachter, R. (1990). A linear approximation method for
rior speed and accuracy were demonstrated on a Gaussian
probabilistic inference. Uncertainty in AI.
mixture network and the Bayes Point Machine. Hopefully
it will prove useful for other networks as well. The Ex- Yedidia, J. S., Freeman, W. T., & Weiss, Y. (2000). Generalized
pectation Propagation iterations always have a fixed point, belief propagation (Technical Report TR2000-26). MERL.
www.merl.com/reports/TR2000-26/index.html.
which can be found by minimizing an energy function.