You are on page 1of 8

Expectation Propagation for Approximate Bayesian Inference

Thomas P Minka
Statistics Dept.
Carnegie Mellon University
Pittsburgh, PA 15213

Abstract Expectation Propagation is an extension of assumed-


density filtering (ADF), a one-pass, sequential method for
computing an approximate posterior distribution. In ADF,
This paper presents a new deterministic approx- observations are processed one by one, updating the pos-
imation technique in Bayesian networks. This terior distribution which is then approximated before pro-
method, “Expectation Propagation,” unifies two cessing the next observation. For example, we might re-
previous techniques: assumed-density filtering, place the exact one-step posterior with a Gaussian having
an extension of the Kalman filter, and loopy be- the same mean and same variance (Maybeck, 1982; Opper
lief propagation, an extension of belief propaga- & Winther, 1999). Or we might replace a posterior over
tion in Bayesian networks. Loopy belief propa- many variables with one that renders the variables inde-
gation, because it propagates exact belief states, pendent (Boyen & Koller, 1998). The weakness of ADF
is useful for a limited class of belief networks, stems from its sequential nature: information that is dis-
such as those which are purely discrete. Expec- carded early on may turn out to be important later. ADF is
tation Propagation approximates the belief states also sensitive to observation ordering, which is undesirable
by only retaining expectations, such as mean and in a batch context.
variance, and iterates until these expectations are
consistent throughout the network. This makes it Expectation Propagation (EP) extends ADF to incorporate
applicable to hybrid networks with discrete and iterative refinement of the approximations, by making ad-
continuous nodes. Experiments with Gaussian ditional passes through the network. The information from
mixture models show Expectation Propagation to later observations refines the choices made earlier, so that
be convincingly better than methods with simi- the most important information is retained. Iterative refine-
lar computational cost: Laplace’s method, vari- ment has previously been used in conjunction with sam-
ational Bayes, and Monte Carlo. Expectation pling (Koller et al., 1999) and extended Kalman filtering
Propagation also provides an efficient algorithm (Shachter, 1990). Expectation Propagation is faster than
for training Bayes point machine classifiers. sampling and more general than extended Kalman filtering.
It is more expensive than ADF by only a constant factor—
the number of refinement passes (typically 4 or 5). EP ap-
plies to all statistical models to which ADF can be applied
1 INTRODUCTION and, as shown in section 3.2, is significantly more accurate.

Bayesian inference is often hampered by large computa- In belief networks with loops it is known that approximate
tional expense. Fast and accurate approximation methods marginal distributions can be obtained by iterating the be-
are therefore very important and can have great impact. lief propagation recursions, a process known as loopy be-
This paper presents a new deterministic algorithm, Expec- lief propagation (Frey & MacKay, 1997; Murphy et al.,
tation Propagation, which achieves higher accuracy than 1999). In section 4, this turns out to be a special case of Ex-
existing approximation algorithms with similar computa- pectation Propagation, where the approximation is a com-
tional cost. pletely disconnected network. Expectation Propagation is
UZY U Y\[
more general than belief propagation in two ways: (1) like Uof
Y terms:  D
U 
 

 /
X
 W  
where  
]  
and
variational methods, it can use approximations which are  
'   
. Next we choose an approximating family.
not completely disconnected, and (2) it can impose useful In the clutter problem, a spherical Gaussian distribution is
constraints on functional form, such as multivariate Gaus- reasonable: ^

C9,`_5aZ_+" E
sian. YU
(3)
Finally, we U sequence through and incorporate the terms
2 ASSUMED-DENSITY FILTERING into the^Zapproximate
b ^
posterior. At each step we move from
an old 
to a new
^  
. (To reduce notation,
^ we drop
the dependence of  
on c .) Initialize with  
/d .
This section reviews the idea of assumed-density filter- Incorporating the prior term is trivial, with no approxima- YU
ing (ADF), to lay groundwork for Expectation Propaga- tion needed. To incorporate a more complicated term  
,
tion. Assumed-density filtering is a general technique for take the exact posterior U
computing approximate posteriors in Bayesian networks Y U ^Zb

U  

and other statistical models. ADF has been indepen- e 
 Ff g Y U ^ b
dently proposed in the statistics (Lauritzen, 1992), artifi-  
 
B  (4)
cial intelligence (Boyen & Koller, 1998; Opper & Winther, ^
1999), and control (Maybeck, 1982) literatures. “Assumed- ^
and minimize the KL-divergence 8 e 
$ M  
subject to
density filtering” is the name used in control; other names the constraint that  
is in the approximating family. This
include “online Bayesian learning,” “moment matching,” is equivalent e to a maximum-likelihood problem with data
and “weak marginalization.” ADF applies when we have distribution . For a spherical Gaussian, the solution is
postulated a joint distribution 
where  has been h
given by matching moments: h
observed and  is hidden. We would like to know the pos- h h opn j lk
terior over  ,    , as well as the probability of the ob-
iZj lkm (5)
served data (or evidence for the model),  . The former i j : lkm
 o n j  : lk (6)
is useful for estimation while the latter is useful for model
selection. With any exponential
U family, ADF reduces to propagating
U Y U Each
expectations. Z
^ b
q  f g 
 
 also produces a normalizing factor
step
For example, suppose we have observations from a Gaus- B . The product of these normalizing
sian distribution embedded in a sea of unrelated clutter, so factors estimates  . In
U theU clutter problem, we have
U U b b U
that the observation density is a mixture of two Gaussians: q r st  # , _ F a _ u% * "$ #%v  #(
$*)+"$ (7)
  
   !#"$ &%'#(
$*)+"$
6  8,9 :;.8< 6 ,= The final ADF algorithm is:
 ,-./ 0 13254 7
>@?;.' 6A 7
1. Initialize ,`_`w( , aZ_xyF)s) (the prior). Initialize
The first component contains the parameter of interest,
z{ (the scale factor).
U
while the other component describes clutter.  is the  , _  a _ #z|
known ratio of clutter. Let the B -dimensional vector  have
2. For each data point , update according
to U!~ U
a Gaussian prior distribution: b q
z} z
 
C D(
$*)+)+"FEF (1)
U U U
   q   €(OFF)s"$U
The joint distribution of  and G independent observations U U U b
HJI* 6 FKLKMKLONQP is therefore: b b U  ‚ U , _
,`_ , _ %`a _  b
U U a _ %- U U
R 
  
TS@UV  
U b
U  a U _ 7 U U  a b s7 „ U b „37
(2) b _  U ‚, _
a_  a _  b %`  b
a _ %ƒ B a _ %ƒ* 7
U
The Bayesian network for this problem is simply  pointing

N
to the . But we cannot use belief propagation because the This algorithm can be understood in an intuitive way: for
belief state for  is a mixture of > Gaussians. To apply each data point we compute its probability  of not being
ADF, we write the joint distribution R 
as a product clutter, make a soft update to our estimate of  ( , _ ), and
U
^+b YU
change our confidence in the estimate (a _ ). However, it is (c) Combine and 
and minimize
^ 
KL-
clear that this algorithm will depend on the order in which divergence U to get a new posterior 
with nor-
q U
malizer Y U . U ^
q  
#‡ ^Zb  

data is processed, because the clutter probability depends
on the current estimate of  . (d) Update …  .
^
4. Use the normalizing constant of  
as an approxi-
3 EXPECTATION PROPAGATION mation to  :
YU
This section describes the Expectation Propagation algo-
 ˆŠ‰HS U …  
B  (11)
rithm and demonstrates its use on the clutter problem. Ex-
pectation Propagation is based on a novel interpretation of
assumed-density filtering.Y U ADF was described as treating This algorithm always has a fixed point, and sometimes has
each observation term exactly YU and then approximating several. If initialized too far away from a fixed point, it may
U U can also think of it as
the posterior that includes . ButY we
Y diverge. This is discussed in section 3.3.
YU
first approximating with some … and then using an exact
Y U possible be-
posterior with … . This interpretation is always
cause we can define the approximate term … to be the ratio 3.1 THE CLUTTER PROBLEM
of the new posterior to the old posterior times a constant:
YU U ^ U
For the clutter problem of the previous section, the EP al-
q ^b
… 
  
U (8) gorithm is
^sb ^
Multiplying this approximate term by 
gives  
, as
1. The term approximations have the form
desired. An important property is that if the approximate YU U  U U U
posterior is in an exponential family, then the term approx- …


 
‹
 z    
 /
 
 , : 
 /
 
 , (12)
imations will be in the same family. 13254 Z> a
[ [
[Initialize the[ prior a Œ
 F
 s
) ) , ( ,
The algorithm of the previous section can thus be inter- E 7 term to itself: ,
Y U as sequentially computing a Y Gaussian
preted U approxima- Yz U Ž>@?;a U < A . Initialize U U
the data terms so that
tion …


 to every observation term  


, then combining …  
 : a Š , , ‹( , and z r .
[ [
these approximations analytically to get a Gaussian poste- , _ ‹,  a _ ƒa
rior on  . Under this perspective, the approximations do
2.
U U U
not have any required order—the ordering only determined 3. Until all D
 ,  a #z converge (changes are less than
how we made the approximations. We are free to go back *) <T‘ ):
and refine the approximations, in any order. This gives the
loop c r+U FKLKMKL G :
general form of Expectation Propagation: Y
YU (a) Remove … from the posterior to get an ‘old’ pos-
1. Initialize the term approximations … terior: U
YU b U

Compute the posterior for from the product of … :  a _ < 6 U  a _ < 6 ’a < U 6
2.
U YU b b U U
^ W U Y … U 
, _  , _ %'a _ a < 6 , _ ‚,
 
 f U U
W …  
B  (9) U b b
q
YU (b) Recompute , _  a _  from , _  a _ as in
Until all … converge: ADF. Y U
3. YU (c) Update … : U
(a) Choose aY U … to refine U b
U a < 6 U  a _ < 6 U “ a _ U < 6 U U U
(b) Remove ^+b … from the posterior to get an ‘old’ pos- b b b b
terior  
, by dividing and normalizing: ,  , _ %‹ a %`a _ 3Ua _ < 6 , _ ‚, _
U ^ U q U U
^b 

 
† Y U z  U E
7
U b U b
… 

(10) D>Z?;a A , , _ F a %'a _ "$
4. Compute the normalizing constant:
U U
” , :_ , _ , : U, −17
Evidence Evidence

 `•pU
−199
10 10

aZ_ a 10
−18 −200
10 Importance

N U
Importance

>@?;aZ_s E A 7
VB

 ” ‡Z>s MUS – [ z
−19
VB

 ˆ
−201
10 10

13254

Error
−20

Error
−202
10 Laplace 10

−21 Laplace
10 −203
10

Because the term approximations start at  , the result after


EP
−22
10 −204
10
EP

one pass through the data is identical to ADF. 10


−23
2 3 4 5 6
−205
10
3 4 5 6 7
10 10 10 10 10 10 10 10 10 10
FLOPS

G Š>Z) G Š>Z)+)
FLOPS

3.2 RESULTS 0
Posterior mean
0
Posterior mean
10 10

EP for the clutter problem is compared with four other


−1 −1
10 Importance 10 Importance

algorithms for approximate inference: Laplace’s method, 10


−2
10
−2

variational Bayes, importance sampling (using the prior as Laplace VB

Error

Error
−3 −3
10 10 Gibbs
the importance distribution), and Gibbs sampling (by in- VB

troducing hiddenh variables that determine if a data point is −4 Gibbs −4


10 10

clutter). The goal is to estimate the evidence  and the


EP
Laplace

posterior mean j  —k . Figure 1 shows the results on a


−5 −5
10 10

typical run with G ˜>+) and with G J>Z)s) . It plots the ac-
−6 −6
EP
10 10
2 3 4 5 6 3 4 5 6 7
10 10 10 10 10 10 10 10 10 10
FLOPS FLOPS

G Š>Z) G Š>Z)+)
curacy vs. cost of the algorithms. Accuracy is measured by
absolute difference from the true evidence or the true poste-
rior mean. Cost is measured by the number of floating point Figure 1: Cost vs. accuracy curves for expectation prop-
operations (FLOPS) in Matlab, via Matlab’s flops func- agation (EP), Laplace’s method, variational Bayes (VB),
tion. This is better than using CPU time because FLOPS importance sampling, and Gibbs sampling on the clutter
ignores interpretation overhead. problem with Š‹)¡K£¢ and 8‹> . Each ‘x’ is one iteration
of EP. ADF is the first ‘x’.
The deterministic methods EP, Laplace, and VB all try to
approximate the posterior with a Gaussian, so they improve
substantially with more data (the posterior is more Gaus-
sian with more data). The sampling methods assume very −26
x 10
little about the posterior and cannot exploit the fact that it 8
10
1
Posterior mean
Exact
is becoming more Gaussian. However, this is an advan- 7 EP
VB
tage for sampling when the posterior has a complex shape. 6
Laplace Laplace
VB EP
Figure 2 shows an atypical run with a small amount of Gibbs

data (G ™>Z) ) where the true posterior has three distinct


0
5 10
p(theta,D)

4
Error

modes. Regular EP did not converge, but a restricted ver-


3
sion did (Minka, 2001). Unfortunately, all of the determin- 10
−1

2
istic methods converge to an erroneous result that captures Importance
only a single mode. 1

−2
0 10
2 3 4 5 6
−6 −4 −2 0 2 4 6 10 10 10 10 10
theta FLOPS

3.3 CONVERGENCE (a) (b)

Figure 2: A complex posterior in the clutter problem.


The EP iterations can be shown to always have a fixed
(a) Exact posterior vs. approximations obtained by EP,
point when the approximations are in an exponential fam-
Laplace’s method, and variational Bayes. (b) Cost vs. ac-
ily. The proof is analogous to Yedidia et al. (2000). Let
the sufficient statistics– be š 6  
›FKLKMKM šZœ 
so that the fam-
curacy.
œ
ily has form 13254 ž Ÿ 6 š Ÿ 
  Ÿ . In the clutter problem
we had š 6  
‚¤ and š 7 
’¤ :  . When we ^ treat work with fewer edges. This section shows that if the ap-
the prior exactly, the final approximation will be 
† proximation is completely disconnected, then ADF yields

ž Ÿ š Ÿ 
¥ Ÿ forU some ¥ , and the leave-one-out U
13254 ^+b the algorithm of Boyen & Koller (1998) and EP yields



 ¦

†  

 Y U ž
  Ÿ Ÿ 


   Ÿ
approximations will be
  13254  
š loopy belief propagation.
for some . Let G be the number of terms .
Let the hidden variables be » 6 $KMKLKM »Q¼ and collect the ob-
The EP fixed points are in one-to-one correspondence with served variables into ½I º 6 $KMKLKM ºs¾ P . A completely dis-
stationary points of the objective connected distribution for  has the form
^ ¼ ^
§—¨M© —
§ «  x  * 5­L®s¯ ‰ g  
• 
¥ Ÿ B   
° S – ¿  » ¿
ª ¬ 2 G 1$254 Ÿ š Ÿ ¿ 6
(15)
N YU U ^
e
 • UM– ­L®s°
¯ ‰ g 
± 
• š Ÿ 
  Ÿ B  When we minimize the KL-divergence R   
$ M  
, we
h e 

13254 Ÿ (13)
6 U h
will simply preserve the marginals of . This cor-
responds to an expectation constraint i j À  » ¿ a5 žk
such that  G x* ¥ Ÿ ˜• U   Ÿ (14) o n j À  » ¿ Va5 žk for all values a of » ¿ . From this we arrive at
the ADF algorithm of Boyen & Koller (1998):
Note that min-max cannot be exchanged with max-min in ^
this objective. ^
fFg By taking derivatives
fFg Ÿ  
we
e get the stationary
e ¿  » ¿ 
conditions š Ÿ  

B ’ š 
, where  
is
1. Initialize
YU ^
defined
U by (4). This is an EP fixed point. In reverse,^ given 2. Fore each term 
in turn, set ¿ to the Á th marginal
^Zb EP fixed point we can recover ¥ and   from  
and
an of : U

to obtain a stationary point of (13). U
YU
^
¿  » ¿ 0 e 
  U • Y  
^ b  

g• b q gb
Assume all terms are bounded:  
²]³ . Then the ob- _|Â U _*Â
¥ U UY ^b
jective is U bounded
N < 6 ¥ from below, because for any we can q  •g  


choose   Ÿ  N Ÿ , and then the second part of (13) is at where
least
YU
 G ­M®+¯‰ g ³\ 
ž •  
¥ Ÿ Z ´Z´ µ·¶ B 
13254 Ÿ š Ÿ For dynamic Bayesian networks, Boyen & Koller set to
the product ofU all of the conditional probability tables for
¸  G ­M®+¯³°x G =| ¹­L®s¯°‰ g  
ž • 
¥ Ÿ B  timeslice^ c . ^+Now Y U From
b let’s turn this into an EP algorithm.
13254 Ÿ š Ÿ the ratio ‡ , we see that the approximate terms …  
are
completely disconnected. The EP algorithm is thus:
by the concavity of the function º ´Z´ µ·¶ . Therefore there YU YU YU
must be stationary points. Sometimes there are multiple 1. … 

 
 W ¿ … U » . Initialize …  
J .
¿  ¿
fixed points of EP, in which case we can define the ‘best’ ^ UY
2. ¿  » ¿ † W … ¿  » ¿
fixed point as the one with minimum energy (13). When YU
canonical EP does not converge, we can minimize (13) by Until all … converge:
some other scheme, such as gradient descent. In practice, it
3. YU
is found that when canonical EP does not converge, it is for (a) Choose aY U … to refine
a good reason, namely the approximating family is a poor (b) Remove … U from the posterior. For all Á :
^
match to the exact posterior. This happened in the previous ^b ¿  ¿ Y
example. So before considering alternate ways to carry out ¿  » ¿ † Y U ¿  » ¿ † S –OU … Ÿ ¿  » ¿
… » Ÿ|Ã
EP, one should reconsider the approximating family. U
^ ^+b
(c) Recompute 
from 
as in ADF.
4 LOOPY BELIEF PROPAGATION (d) U
YU U ^ ¿U ¿ Y U ^ b
q ^b »
… ¿  » ¿   g•
Ä
b _*Â
 
&S – Ÿ  » Ÿ
Expectation Propagation and assumed-density filtering can ¿ » ¿ Ÿ|à ¿
be used to approximate a belief network by a simpler net-
To makeY U this equivalent to belief propagation, the original
Ü #Ê Ë Ø Ü Ï #Ê Ë Ø Ï
terms should correspond to the conditional probability ÒVÝ ·Ò Þ Ø Ø Ï Ý Þ
tables of a directed belief network. That is, we should break Ò
the joint distribution R 
into ÚÙ
ËÛË

ŠS  » ¿ T«  » ¿ TS  º Ÿ p«  º Ÿ 
¿ 4 Ÿ 4 (16)

Ø
where 4 «¡ Œ is the set of parents of node Å . The network Ë
has observed nodes º Ÿ and hidden nodes » ¿ . The parents of
an observed node might be hidden, and vice versa. For an
undirected network, the terms are the clique potentials. The Unlike Pearl’s derivation of belief propagation in terms of
  and ? messages, this derivation is symmetric with respect
quantities in EP now have the following interpretations:
to parents and children. In fact, it is the form used in in fac-
^ tor graphs (Kschischang et al., 2000). All of the nodes that
Æ ¿ » ¿
is the belief state of node » ¿ , i.e. the product participate in a conditional probability table  Åx 4 «p Œ 
of all messages into » U¿ . send messages to each other based on their partial belief
^b states.
Æ The ‘old’ posterior ¿  » ¿ for a particular term c is a
partial belief state, i.e. the product of messages into Loopy belief propagation does not always converge, but
» ¿ except for those originating from term c . from section 3.3 we know how we could find a fixed point.
YU For an undirected network with pairwise potentials, the EP
Æ When cÈ Ç Á , the function … ¿  » ¿ is the message that energy function (13) is a dual representation of the Bethe
node c (either hidden or observed) sends to its par- free energy given by Yedidia et al. (2000).
ent » ¿ in belief propagation.
YU ForU example,
U suppose
node c is hidden and 
9  » 4 «T »  . The other Alternatively, we can fit an approximate network which is
parents send their partial belief states, which the child not completely disconnected, such as a tree-structured net-
combines with its partial beliefU state: U work. This was done in the ADF context by Frey et al.
YU U U ^ Ub U ^b (2000). A general algorithm for tree-structured approxima-
¿…  º  g •  » 4 p«  »  » – Ÿ » Ÿ
S
b _*Â tion using EP is given by Minka (2001).
parents |Ÿ à ¿

5 BAYES POINT MACHINE


ÉQÏ Ê#Ë Ï
Ì\Í Î
Ï This section applies Expectation Propagation to inference
Í Ò ÑÐ Í
in the Bayes Point Machine (Herbrich et al., 1999). The
Ë±Ò Bayes Point Machine (BPM) is a Bayesian approach to
linear classification. A linear classifier classifies a point
 according to º  sign ß : 
for some parameter vec-
tor ß (the two classes are º ½à¦ ). Given a training set
X½I· 6  º 6 3$KMKLKMF N  º N ›P , the likelihood for ß can be
Í Ë
written
ÉQÊ#Ë U U
Ë/ÌÍ Ë Î U U ß : 
Y UMU U 8 ß  U á º
S U  º  ß ‹S 
Æ â (17)
When node c is hidden, the function …  » is a com-
bination of messages sent to node c from its parents in á ã¹ 0 ‰`ä ãT)p$| B ã
belief propagation. Each parent sends it partial belief (18)
<lå
state, and the child combines them according toU
Y ULU U U U ^b By using á instead of a step function, this likelihood tol-
…  »  •  » T«  » 
4
S Ÿ » Ÿ erates small errors. The allowed ‘slack’ is controlled by â .
$Ó ÔFÕ _*Ö × parents Ÿ To avoid estimating â , which is tangential to this paper, the
U
U í U U b U
experiments all use âçæ ) , where á becomes a step func-U U á D ã   %`a < 6  : . ì 
U is a hybrid belief network, with ß and  z  g û Ö g Ö
U
  76 g ÖûÖ ý ý
 Ö 6 î
tion. The BPM

pointing to . Under the Bayesian approach, we also have 13254
a prior distribution on ß , which is taken to be D(
"$ .
Given this model,h the optimal way to classify a new data ” ‹, :ì . ì < 6 ,`ì  U é Ö Ö
point  is to vote all classifiers according to their poste- 4. UL– U
j sign  ß : 
F èk . As an approximation  ˆ .Vì 6 A 7  ” ‡Z>s W N z
h
rior probability: 13254 6
to this, the BPM uses the output of the average classifier:
sign  j ߗk : 
. 7
This algorithm processes each data point in   B time.
Using EP, we can make a multivariate Gaussian approx- Assuming the number of iterations is constant, which
imation to the posterior over ß and use its mean as the seems to be true in practice,
7 computing the Bayes point
estimated Bayes point. The resulting algorithm isU similar
U therefore takes   GOB time. This algorithm can be ex-
U
to that for the clutter problem. To save notation, º  ‡ â is tended to use an arbitrary inner product function, just as
written simply as . in Gaussian process classifiers and the Support Vector Ma-
chine, which changes the running time to   G , regard-
YU U U U
U…  ß rz U 1$254  7#U 6é Ö  ß :  'ê 7 . Initialize with
less of dimensionality. This extension can be found in
1. Minka (2001). Interestingly, Opper & Winther (2000) have
a ˜ , ê -) , z  .
derived an equivalent algorithm using statistical physics
^
2.  ß ë},9ì.ì! . Initialize with the prior: methods. However, the EP updates tend to be faster than
,`ì8Š(
 .ì‚-" . theirs and do not require a stepsize parameter.
U U
3. Until all
 ê  a converge (changes are less than
*) <T‘ ): 5.1 RESULTS

loop c r+U FKLKMKL G :


Y Figure 3(a) demonstrates the Bayes point classifier vs. the
(a) Remove … from the posterior to get an ‘old’ pos- SVM classifier on 3 training points. Besides the two di-
terior: mensions shown here, each point had a third dimension set
U U U at  . This provides a ‘bias’ coefficient  so that the de-
b .U ì $ U .Vì& U :
cision boundary doesn’t have to pass through D)¡#)s . The

. ì  .ì/%
U a U ’U  : .ì U
b b U U Bayes point classifier approximates a vote between all lin-
, ì  , ì %‹ . ì  \a < 6   : , ì ’ê ear separators, ranging from an angle of ) to s
 ¢ . The
U U
b b Bayes point is an angle in the middle of this range.
(b) Recompute , ì  . ì from , ì  . ì , using
U Figure 3(b) plots cost vs. error for EP versus three other
ADF:
b U algorithms for estimating the Bayes point: the billiard al-
U , ì U :  gorithm of Herbrich et al. (1999), the TAP algorithm of
ã  í U b U Opper & Winther (2000), and the mean-field (MF) algo-
 : . ì  %- U rithm of Opper & Winther (2000). The error is measured
U U Dã #U)¡$| by Euclidean distance to the exact solution found by im-
î  í U b U
 U : . ì  U %- á ã portance sampling. The error in using the SVM solution is
b b U U also plotted for reference. Its unusually long running time
, ì  , ì %`. ì î  is due to Matlab’s quadprog solver. TAP and MF were
֞ù úFû Ö ýlþ Öÿ 
ðñ õ ñö÷!ø ú ûÖ Ö5ü
slower to converge than EP, even with a large initial step
.ì  ïèò‚ ðñ ó’ô ïèò ý úFÖ þ ø  ô ï—ò° ð ñ õ ñDö size of )pK ¢ . As expected, EP and TAP converge to the same
YU solution.
(c) Update … : U
U b U U Figure 4 compares the error rate of EP, Billiard, and SVM
U U : . U ì  %- U U b U on four datasets from the UCI repository (Blake & Merz,
a  ’ . ì  : 1998). Each dataset was randomly split 40 times into a
U î   : U , ì %`U î U
U b U b U U training set and test set, in the ratio 60%:40%. In each trial,
ê   : , ì %ƒa %` : . ì  î the features were normalized to have zero mean and unit
Acknowledgment
0
10
SVM SVM
Bayes

This work was performed at the MIT Media Lab, supported


−1
10 Billiard

by the Things That Think and Digital Life consortia.

Error
−2
10 MF

−3
10
References
EP TAP

−4
Blake, C. L., & Merz, C. J. (1998). UCI repository of machine
10
10
2
10
3
10
4
10
5 6
10 learning databases.
FLOPS
www.ics.uci.edu/˜mlearn/MLRepository.html.

Figure 3: (left) Bayes point machine vs. Support Vec- Boyen, X., & Koller, D. (1998). Tractable inference for complex
stochastic processes. Uncertainty in AI.
tor Machine on a simple data set. The Bayes point more
closely approximates a vote between all linear separators Frey, B. J., & MacKay, D. J. (1997). A revolution: Belief
of the data. (right) Cost vs. error in estimating the poste- propagation in graphs with cycles. NIPS 10.
rior mean. ADF is the first ‘x’ on the EP curve. Frey, B. J., Patrascu, R., Jaakkola, T., & Moran, J. (2000).
Sequentially fitting inclusive trees for inference in noisy-OR
networks. NIPS 13.
Dataset EP Billiard SVM
Heart K£>Z) à-K ) K >+) àƒK ) K£>s>àƒK ) Herbrich, R., Graepel, T., & Campbell, C. (1999). Bayes point
Thyroid K ) °àƒK ) K )  à-K ) K )·¢  àƒK ) s¢ machines: Estimating the Bayes point in kernel space. IJCAI
Ionosphere K )  àƒK )·¢  KM+  à-K ) KLs*¢à-K )  Workshop Support Vector Machines.
Sonar KM ·) à-K )  KL! àƒK )Z > KM*>  àƒK ) Z¢ Koller, D., Lerner, U., & Angelov, D. (1999). A general
algorithm for approximate inference and its application to
Figure 4: Test error rate for the Bayes Point Machine (using hybrid Bayes nets. Uncertainty in AI (pp. 324–333).
EP or the Billiard algorithm) compared to the Support Vec- Kschischang, F. R., Frey, B. J., & Loeliger, H.-A. (2000). Factor
tor Machine. Reported is the mean over 40 train-test splits graphs and the sum-product algorithm. IEEE Trans Info
à two standard deviations. These results are for a Gaussian Theory, to appear.
kernel with " # , and will differ with other kernels. www.comm.toronto.edu/frank/factor/.
Lauritzen, S. L. (1992). Propagation of probabilities, means and
variances in mixed graphical association models. J American
Statistical Association, 87, 1098–1108.
variance in the training set. The classifiers used zero slack
and a Gaussian inner product with standard deviation 3. Maybeck, P. S. (1982). Stochastic models, estimation and
Billiard was run for 500 iterations. The thyroid dataset control, chapter 12.7. Academic Press.
was made into a binary classification problem by merging Minka, T. P. (2001). A family of algorithms for approximate
the different classes into normal vs. abnormal. Except for Bayesian inference. Doctoral dissertation, Massachusetts
sonar, EP has lower average error than the SVM (with Institute of Technology.
vismod.www.media.mit.edu/˜tpminka/.
99% probability), and in all cases EP is at least as good as
Billiard. Billiard has the highest running time because it is Murphy, K., Weiss, Y., & Jordan, M. (1999). Loopy-belief
initialized at the SVM solution. propagation for approximate inference: An empirical study.
Uncertainty in AI.
Opper, M., & Winther, O. (1999). A Bayesian approach to
on-line learning. On-Line Learning in Neural Networks.
6 SUMMARY Cambridge University Press.
Opper, M., & Winther, O. (2000). Gaussian processes for
classification: Mean field algorithms. Neural Computation,
This paper presented a generalization of belief propagation 12, 2655–2684.
which is appropriate for hybrid belief networks. Its supe-
Shachter, R. (1990). A linear approximation method for
rior speed and accuracy were demonstrated on a Gaussian
probabilistic inference. Uncertainty in AI.
mixture network and the Bayes Point Machine. Hopefully
it will prove useful for other networks as well. The Ex- Yedidia, J. S., Freeman, W. T., & Weiss, Y. (2000). Generalized
pectation Propagation iterations always have a fixed point, belief propagation (Technical Report TR2000-26). MERL.
www.merl.com/reports/TR2000-26/index.html.
which can be found by minimizing an energy function.

You might also like