You are on page 1of 7

Inference by enumeration

Slightly intelligent way to sum out variables from the joint without actually
constructing its explicit representation
Inference in Bayesian networks Simple query on the burglary network:
P(B|j, m) B E
= P(B, j, m)/P (j, m)
A
= αP(B, j, m)
= α Σe Σa P(B, e, a, j, m) J M
Chapter 14.4–5
Rewrite full joint entries using product of CPT entries:
P(B|j, m)
= α Σe Σa P(B)P (e)P(a|B, e)P (j|a)P (m|a)
= αP(B) Σe P (e) Σa P(a|B, e)P (j|a)P (m|a)
Recursive depth-first enumeration: O(n) space, O(dn) time
Chapter 14.4–5 1 Chapter 14.4–5 4
Outline Enumeration algorithm
♦ Exact inference by enumeration
function Enumeration-Ask(X, e, bn) returns a distribution over X
♦ Exact inference by variable elimination inputs: X, the query variable
e, observed values for variables E
♦ Approximate inference by stochastic simulation bn, a Bayesian network with variables {X} ∪ E ∪ Y
Q(X ) ← a distribution over X, initially empty
♦ Approximate inference by Markov chain Monte Carlo for each value xi of X do
extend e with value xi for X
Q(xi ) ← Enumerate-All(Vars[bn], e)
return Normalize(Q(X ))
function Enumerate-All(vars, e) returns a real number
if Empty?(vars) then return 1.0
Y ← First(vars)
if Y has value y in e
then return P (y | P a(Y )) × Enumerate-All(Rest(vars), e)
else return y P (y | P a(Y )) × Enumerate-All(Rest(vars), ey )
P
where ey is e extended with Y = y
Chapter 14.4–5 2 Chapter 14.4–5 5
Inference tasks Evaluation tree
Simple queries: compute posterior marginal P(Xi|E = e) P(b)
e.g., P (N oGas|Gauge = empty, Lights = on, Starts = f alse) .001
P(e) P( e)
Conjunctive queries: P(Xi, Xj |E = e) = P(Xi|E = e)P(Xj |Xi, E = e) .002 .998
Optimal decisions: decision networks include utility information;
P(a|b,e) P( a|b,e) P(a|b, e) P( a|b, e)
probabilistic inference required for P (outcome|action, evidence) .95 .05 .94 .06
Value of information: which evidence to seek next?
P(j|a) P(j| a) P(j|a) P(j| a)
Sensitivity analysis: which probability values are most critical? .90 .05 .90 .05
Explanation: why do I need a new starter motor?
P(m|a) P(m| a) P(m|a) P(m| a)
.70 .01 .70 .01
Enumeration is inefficient: repeated computation
e.g., computes P (j|a)P (m|a) for each value of e
Chapter 14.4–5 3 Chapter 14.4–5 6
Inference by variable elimination Irrelevant variables
Variable elimination: carry out summations right-to-left, Consider the query P (JohnCalls|Burglary = true)
storing intermediate results (factors) to avoid recomputation B E
P(B|j, m) P (J|b) = αP (b)
X
P (e)
X
P (a|b, e)P (J|a)
X
P (m|a) A
e a m
= α P(B) Σe P (e) Σa |P(a|B, e)} P (j|a) P (m|a) J M
| {z }
B
| {z }
E
{z | {z }|
A
{z }
J M
Sum over m is identically 1; M is irrelevant to the query
= αP(B)ΣeP (e)ΣaP(a|B, e)P (j|a)fM (a)
= αP(B)ΣeP (e)ΣaP(a|B, e)fJ (a)fM (a) Thm 1: Y is irrelevant unless Y ∈ Ancestors({X} ∪ E)
= αP(B)ΣeP (e)ΣafA(a, b, e)fJ (a)fM (a)
= αP(B)ΣeP (e)fĀJM (b, e) (sum out A) Here, X = JohnCalls, E = {Burglary}, and
= αP(B)fĒ ĀJM (b) (sum out E) Ancestors({X} ∪ E) = {Alarm, Earthquake}
= αfB (b) × fĒ ĀJM (b) so M aryCalls is irrelevant
(Compare this to backward chaining from the query in Horn clause KBs)
Chapter 14.4–5 7 Chapter 14.4–5 10
Variable elimination: Basic operations Irrelevant variables contd.
Summing out a variable from a product of factors: Defn: moral graph of Bayes net: marry all parents and drop arrows
move any constant factors outside the summation
Defn: A is m-separated from B by C iff separated by C in the moral graph
add up submatrices in pointwise product of remaining factors
Thm 2: Y is irrelevant if m-separated from X by E
Σxf1 × · · · × fk = f1 × · · · × fi Σx fi+1 × · · · × fk = f1 × · · · × fi × fX̄ B E
assuming f1, . . . , fi do not depend on X
For P (JohnCalls|Alarm = true), both A
Pointwise product of factors f1 and f2: Burglary and Earthquake are irrelevant
J M
f1(x1, . . . , xj , y1, . . . , yk ) × f2(y1, . . . , yk , z1, . . . , zl )
= f (x1, . . . , xj , y1, . . . , yk , z1, . . . , zl )
E.g., f1(a, b) × f2(b, c) = f (a, b, c)
Chapter 14.4–5 8 Chapter 14.4–5 11
Variable elimination algorithm Complexity of exact inference
function Elimination-Ask(X, e, bn) returns a distribution over X
Singly connected networks (or polytrees):
inputs: X, the query variable – any two nodes are connected by at most one (undirected) path
e, evidence specified as an event – time and space cost of variable elimination are O(dk n)
bn, a belief network specifying joint distribution P(X1, . . . , Xn)
Multiply connected networks:
factors ← [ ]; vars ← Reverse(Vars[bn])
– can reduce 3SAT to exact inference ⇒ NP-hard
for each var in vars do
factors ← [Make-Factor(var , e)|factors] – equivalent to counting 3SAT models ⇒ #P-complete
if var is a hidden variable then factors ← Sum-Out(var, factors) 0.5 0.5 0.5 0.5
return Normalize(Pointwise-Product(factors)) A B C D

L
1. A v B v C
2. C v D v A 1 2 3

L
3. B v C v D

L
AND
Chapter 14.4–5 9 Chapter 14.4–5 12
Inference by stochastic simulation Example
Basic idea: P(C)
1) Draw N samples from a sampling distribution S .50
2) Compute an approximate posterior probability P̂ 0.5
Cloudy
3) Show this converges to the true probability P
Outline:
Coin
C P(S|C) C P(R|C)
– Sampling from an empty network Sprinkler Rain
T .10 T .80
– Rejection sampling: reject samples disagreeing with evidence
– Likelihood weighting: use evidence to weight samples F .50 F .20
– Markov chain Monte Carlo (MCMC): sample from a stochastic process Wet
Grass
whose stationary distribution is the true posterior
S R P(W|S,R)
T T .99
T F .90
F T .90
F F .01
Chapter 14.4–5 13 Chapter 14.4–5 16
Sampling from an empty network Example
P(C)
function Prior-Sample(bn) returns an event sampled from bn
.50
inputs: bn, a belief network specifying joint distribution P(X1, . . . , Xn)
x ← an event with n elements Cloudy
for i = 1 to n do
xi ← a random sample from P(Xi | parents(Xi))
given the values of P arents(Xi) in x C P(S|C) C P(R|C)
return x T .10 Sprinkler Rain T .80
F .50 F .20
Wet
Grass
S R P(W|S,R)
T T .99
T F .90
F T .90
F F .01
Chapter 14.4–5 14 Chapter 14.4–5 17
Example Example
P(C) P(C)
.50 .50
Cloudy Cloudy
C P(S|C) C P(R|C) C P(S|C) C P(R|C)
T .10 Sprinkler Rain T .80 T .10 Sprinkler Rain T .80
F .50 F .20 F .50 F .20
Wet Wet
Grass Grass
S R P(W|S,R) S R P(W|S,R)
T T .99 T T .99
T F .90 T F .90
F T .90 F T .90
F F .01 F F .01
Chapter 14.4–5 15 Chapter 14.4–5 18
Example Sampling from an empty network contd.
P(C) Probability that PriorSample generates a particular event
n
.50 SP S (x1 . . . xn) = Πi = 1P (xi|parents(Xi)) = P (x1 . . . xn)
i.e., the true prior probability
Cloudy
E.g., SP S (t, f, t, t) = 0.5 × 0.9 × 0.8 × 0.9 = 0.324 = P (t, f, t, t)
C P(S|C) C P(R|C) Let NP S (x1 . . . xn) be the number of samples generated for event x1, . . . , xn
T .10 Sprinkler Rain T .80
F .50 F .20 Then we have
Wet lim P̂ (x1, . . . , xn) = lim NP S (x1, . . . , xn)/N
Grass N →∞ N →∞
= SP S (x1, . . . , xn)
S R P(W|S,R) = P (x1 . . . xn)
T T .99
That is, estimates derived from PriorSample are consistent
T F .90
F T .90 Shorthand: P̂ (x1, . . . , xn) ≈ P (x1 . . . xn)
F F .01
Chapter 14.4–5 19 Chapter 14.4–5 22
Example Rejection sampling
P(C)
P̂(X|e) estimated from samples agreeing with e
.50
function Rejection-Sampling(X, e, bn, N) returns an estimate of P (X |e)
Cloudy local variables: N, a vector of counts over X, initially zero
for j = 1 to N do
C P(S|C) C P(R|C) x ← Prior-Sample(bn)
T .10 Sprinkler Rain T .80 if x is consistent with e then
N[x] ← N[x]+1 where x is the value of X in x
F .50 F .20
return Normalize(N[X])
Wet
Grass
E.g., estimate P(Rain|Sprinkler = true) using 100 samples
S R P(W|S,R) 27 samples have Sprinkler = true
T T .99 Of these, 8 have Rain = true and 19 have Rain = f alse.
T F .90
P̂(Rain|Sprinkler = true) = Normalize(h8, 19i) = h0.296, 0.704i
F T .90
F F .01 Similar to a basic real-world empirical estimation procedure
Chapter 14.4–5 20 Chapter 14.4–5 23
Example Analysis of rejection sampling
P(C)
P̂(X|e) = αNP S (X, e) (algorithm defn.)
.50 = NP S (X, e)/NP S (e) (normalized by NP S (e))
≈ P(X, e)/P (e) (property of PriorSample)
Cloudy
= P(X|e) (defn. of conditional probability)
C P(S|C) C P(R|C) Hence rejection sampling returns consistent posterior estimates
T .10 Sprinkler Rain T .80 Problem: hopelessly expensive if P (e) is small
F .50 F .20
P (e) drops off exponentially with number of evidence variables!
Wet
Grass
S R P(W|S,R)
T T .99
T F .90
F T .90
F F .01
Chapter 14.4–5 21 Chapter 14.4–5 24
Likelihood weighting Likelihood weighting example
P(C)
Idea: fix evidence variables, sample only nonevidence variables,
.50
and weight each sample by the likelihood it accords the evidence
Cloudy
function Likelihood-Weighting(X, e, bn, N) returns an estimate of P (X |e)
local variables: W, a vector of weighted counts over X, initially zero C P(S|C) C P(R|C)
for j = 1 to N do T .10 Sprinkler Rain T .80
x, w ← Weighted-Sample(bn) F .50 F .20
W[x ] ← W[x ] + w where x is the value of X in x Wet
return Normalize(W[X ]) Grass
S R P(W|S,R)
function Weighted-Sample(bn, e) returns an event and a weight T T .99
x ← an event with n elements; w ← 1 T F .90
for i = 1 to n do F T .90
if Xi has a value xi in e F F .01
then w ← w × P (Xi = xi | parents(Xi ))
else xi ← a random sample from P(Xi | parents(Xi )) w = 1.0
return x, w
Chapter 14.4–5 25 Chapter 14.4–5 28
Likelihood weighting example Likelihood weighting example
P(C) P(C)
.50 .50
Cloudy Cloudy
C P(S|C) C P(R|C) C P(S|C) C P(R|C)
T .10 Sprinkler Rain T .80 T .10 Sprinkler Rain T .80
F .50 F .20 F .50 F .20
Wet Wet
Grass Grass
S R P(W|S,R) S R P(W|S,R)
T T .99 T T .99
T F .90 T F .90
F T .90 F T .90
F F .01 F F .01
w = 1.0 w = 1.0 × 0.1
Chapter 14.4–5 26 Chapter 14.4–5 29
Likelihood weighting example Likelihood weighting example
P(C) P(C)
.50 .50
Cloudy Cloudy
C P(S|C) C P(R|C) C P(S|C) C P(R|C)
T .10 Sprinkler Rain T .80 T .10 Sprinkler Rain T .80
F .50 F .20 F .50 F .20
Wet Wet
Grass Grass
S R P(W|S,R) S R P(W|S,R)
T T .99 T T .99
T F .90 T F .90
F T .90 F T .90
F F .01 F F .01
w = 1.0 w = 1.0 × 0.1
Chapter 14.4–5 27 Chapter 14.4–5 30
Likelihood weighting example Approximate inference using MCMC
P(C)
“State” of network = current assignment to all variables.
.50
Cloudy
Generate next state by sampling one variable given Markov blanket
Sample each variable in turn, keeping evidence fixed
C P(S|C) C P(R|C)
Sprinkler Rain function MCMC-Ask(X, e, bn, N) returns an estimate of P (X |e)
T .10 T .80
local variables: N[X ], a vector of counts over X, initially zero
F .50 F .20
Z, the nonevidence variables in bn
Wet
Grass x, the current state of the network, initially copied from e
S R P(W|S,R) initialize x with random values for the variables in Y
for j = 1 to N do
T T .99
for each Zi in Z do
T F .90
F T .90 sample the value of Zi in x from P(Zi |mb(Zi ))
F F .01 given the values of M B(Zi ) in x
N[x ] ← N[x ] + 1 where x is the value of X in x
return Normalize(N[X ])
w = 1.0 × 0.1
Can also choose a variable to sample at random each time
Chapter 14.4–5 31 Chapter 14.4–5 34
Likelihood weighting example The Markov chain
P(C)
With Sprinkler = true, W etGrass = true, there are four states:
.50
Cloudy Cloudy
Cloudy
Sprinkler Rain Sprinkler Rain
C P(S|C) C P(R|C)
T .10 Sprinkler Rain T .80 Wet Wet
Grass Grass
F .50 F .20
Wet
Grass
S R P(W|S,R)
Cloudy Cloudy
T T .99
T F .90
F T .90 Sprinkler Rain Sprinkler Rain
F F .01 Wet Wet
Grass Grass
w = 1.0 × 0.1 × 0.99 = 0.099
Wander about for a while, average what you see
Chapter 14.4–5 32 Chapter 14.4–5 35
Likelihood weighting analysis MCMC example contd.
Sampling probability for WeightedSample is Estimate P(Rain|Sprinkler = true, W etGrass = true)
l
SW S (z, e) = Πi = 1P (zi|parents(Zi))
Note: pays attention to evidence in ancestors only Sample Cloudy or Rain given its Markov blanket, repeat.
⇒ somewhere “in between” prior and
Cloudy
Count number of times Rain is true and false in the samples.
posterior distribution Sprinkler Rain
E.g., visit 100 states
Weight for a given sample z, e is
Wet
Grass
31 have Rain = true, 69 have Rain = f alse
m
w(z, e) = Πi = 1P (ei|parents(Ei)) P̂(Rain|Sprinkler = true, W etGrass = true)
Weighted sampling probability is = Normalize(h31, 69i) = h0.31, 0.69i
SW S (z, e)w(z, e) Theorem: chain approaches stationary distribution:
l m
= Πi = 1P (zi|parents(Zi)) Πi = 1P (ei|parents(Ei)) long-run fraction of time spent in each state is exactly
= P (z, e) (by standard global semantics of network) proportional to its posterior probability
Hence likelihood weighting returns consistent estimates
but performance still degrades with many evidence variables
because a few samples have nearly all the total weight
Chapter 14.4–5 33 Chapter 14.4–5 36
Markov blanket sampling
Markov blanket of Cloudy is Cloudy
Sprinkler and Rain
Markov blanket of Rain is Sprinkler Rain
Cloudy, Sprinkler, and W etGrass Wet
Grass
Probability given the Markov blanket is calculated as follows:
P (xi0 |mb(Xi)) = P (xi0 |parents(Xi))ΠZj ∈Children(Xi)P (zj |parents(Zj ))
Easily implemented in message-passing parallel systems, brains
Main computational problems:
1) Difficult to tell if convergence has been achieved
2) Can be wasteful if Markov blanket is large:
P (Xi|mb(Xi)) won’t change much (law of large numbers)
Chapter 14.4–5 37
Summary
Exact inference by variable elimination:
– polytime on polytrees, NP-hard on general graphs
– space = time, very sensitive to topology
Approximate inference by LW, MCMC:
– LW does poorly when there is lots of (downstream) evidence
– LW, MCMC generally insensitive to topology
– Convergence can be very slow with probabilities close to 1 or 0
– Can handle arbitrary combinations of discrete and continuous variables
Chapter 14.4–5 38

You might also like