A Tutorial On Bayesian Learning

2 ND S CHOOL ON F O PSS: L OGIC AND L EARNING
A tutorial on Bayesian learning
Daniel M. Roy
University of Toronto | Vector Institute
The theory of probability is at bottom nothing but common sense

reduced to calculus; it makes one appreciate with exactness that
which accurate minds feel with a sort of instinct, often without
being able to account for it. — Pierre Simon Laplace
Daniel M. Roy 1 / 101

Based on
Freer, Roy, and Tenenbaum (2012). Towards common-sense reasoning via
conditional simulation: Legacies of Turing in Artificial Intelligence. Turing’s
Legacy (ASL Lecture Notes in Logic).

The goal of this tutorial is to give a “Bayesian perspective” on “learning”.
I What is a “Bayesian perspective”?
I What is “learning”?

T HE B AYESIAN PERSPECTIVE
I Probability is no longer just a limiting relative frequency associated with

independent identical experiments.
I Instead, probability represents degree of belief.
I P{ I will fly to Vilnius tomorrow } = 0.97.
I P{ I will fly to Canada tomorrow } = 0.001.
I P{ I will fly to Canada and Vilnius tomorrow } = 0.
I P{ I will fly to Canada or Vilnius tomorrow } = 0.971.
I P{ Tom has flu | Tom has fever and muscle aches in Winter } = 0.6.
I P{ Tom has flu | Tom has fever and muscle aches in Summer } = 0.1.
I Probabilities are personal.
I The key structures are joint distributions of multiple random variables.
I X = Patient has flu, Y = has fever, Z = has muscle aches,
S = It is Summer, W = It is Winter.
I A joint distribution is a specification of probabilities
P{X = x, Y = y, Z = z, S = s, W = w} for every possible “event”.

W HAT IS LEARNING ?
I If we do not know P, we can attempt to “learn” it from data. This

necessarily requires some strong assumptions and relates to the
problem of induction.
I Learning to classify images: Learn P{Label | Image } from dataset of
labelled images.
I Learning to diagnosis: Learn P{Disease, Symptoms } from past patient
data, in order to infer diseases from symptoms in future patients.
I The distributions P above are generally assumed to come from a
parametric family {Pθ }, e.g., θ ∈ Rd . The (“true” or “best”) parameter θ ∗
is assumed unknown.
I An approach is “fully Bayesian” if one uses probability distributions to
express uncertainty for all unknown quantities (such as θ ∗ ), modeling
them as random variables with prior distributions.
I In a “fully Bayesian” approach, learning is probabilistic inference, and
thus everything is probabilistic inference.
I In contrast, in a frequentist approach, one would develop estimators for
θ∗ with good frequentist (sampling) properties.
U NIVERSAL STOCHASTIC INFERENCE
I The Bayesian framework is conceptually simple:

I represent all knowledge by distributions
I evidence incorporated by conditioning
I I will introduce the Bayesian framework using the computational
framework of universal stochastic inference wherein
I distributions are represented by programs
I evidence is represented by predicates
I conditioning is a higher-order procedure
I Programs can represent distributions by being simulators
I a program P represents a distribution µ by simulating µ,
i.e., generating a random output X whose distribution is µ

R ELATED FRAMEWORKS
This perspective is closely related to

I Probabilistic programming
I Approximate Bayesian Computation (ABC)
I Implicit Generative Models

S TRUCTURE OF THE TUTORIAL
The tutorial centers around a higher-order procedure, called QUERY, which

implements a simple, though generic, form of probabilistic conditioning.
I QUERY operates on complex probabilistic models that are themselves
represented by programs
I By using QUERY appropriately, one can describe various forms of
inference, learning, and decision-making.
I We will introduce inference, learning, and decision making through an
extended example of medical diagnosis, a classic AI problem.
I Elusive “common-sense” behavior arises implicitly from past experience
and models of causal structure and goals, rather than explicitly via rules
or purely deductive reasoning.

E XAMPLE : SIMPLE PROBABILISTIC P YTHON PROGRAM
1 def binomial(n, p):

2 return sum( [bernoulli(p) for i in range(n)] )
I returns a random integer in {0, . . . , n}.

I defines a family of distributions on {0, . . . , n},
in particular, the Binomial family.
I represents a statistical model of
the # of successes among
n independent and identical experiments













3 def randomized_trial()
4
6 return ( binomial(100, p_control),

7 binomial(10, p_treatment) )


4 p_control = uniform(0,1) # prior
5 p_treatment = uniform(0,1) # prior


represents a Bayesian model of a randomized trial.


0
0 1


1
simulation
−−−−−−−−−−−−→
0
0 1


1
simulation (71, 9)
−−−−−−−−−−−−→
0
0 1


1
simulation (71, 9)
−−−−−−−−−−−−→
0
0 0.67 0.86 1


1
simulation (71, 9)
−−−−−−−−−−−−→
0
0 0.67 0.86 1
(71, 9)


1
simulation (71, 9)
−−−−−−−−−−−−→
0
0 0.67 0.86 1
inference
←−−−−−−−−−−−−− (71, 9)


1
simulation (71, 9)
−−−−−−−−−−−−→
0
0 0.67 0.86 1
inference
←−−−−−−−−−−−−− (71, 9)
0
0 0.71 0.9 1

I NTRODUCING QUERY
1 def QUERY(guesser,checker):
2 # guesser: Unit -> S
3 # predicate : S -> Boolean
4 accept = False
5 while (not accept)
6 guess = guesser()
7 accept = checker(guess)
8 return guess

U NDERSTANDING QUERY
4 accept = False
6 guess = guesser()
8 return guess

4 accept = False
6 guess = guesser()
8 return guess
As a first step towards understanding QUERY, consider the trivial predicate
lambda (_): True
Then guesser() has the same meaning (distributional semantics) as
QUERY(guesser, lambda (_): True)

4 accept = False
6 guess = guesser()
8 return guess

4 accept = False
6 guess = guesser()
8 return guess
Consider a slightly more interesting example:
def N():
return uniformInt(range(1,180))
and consider the predicate
def div235(n):
return isDivBy(n,2) or isDivBy(n,3) or isDivBy(n,5)
What is the meaning of QUERY(N, div235) ?

C ONDITIONING AS A HIGHER - ORDER PROCEDURE
4 accept = False
6 guess = guesser()
8 return guess
represents the higher order operation of conditioning. When checker is
deterministic, then QUERY denotes the map
P ( · ∩ A)
(P, 1A ) 7→ P ( · |A) := . (1)
P (A)
QUERY halts with probability 1 provided P (A) > 0.

T HE STOCHASTIC INFERENCE PROBLEM

I NPUT : guesser and checker probabilistic programs.

O UTPUT : a sample from the same distribution as the program

1 accept = False
2 while (not accept):
3 guess = guesser()
5 return guess


1 accept = False
3 guess = guesser()
5 return guess
This computation captures Bayesian statistical inference.


1 accept = False
3 guess = guesser()
5 return guess
This computation captures Bayesian statistical inference.
“prior” distribution ←→ distribution of guesser()

“likelihood(g )” ←→ P checker(g) is True
“posterior” distribution ←→ distribution of return value

E XAMPLE : INFERRING BIAS OF A COIN

1 accept = False
3 guess = guesser()
5 return guess

1 accept = False
3 guess = guesser()
5 return guess
Given x1 , . . . , xn ∈ {0, 1},

report probability of xn+1 = 1?

1 accept = False
3 guess = guesser()
5 return guess
Given x1 , . . . , xn ∈ {0, 1},

report probability of xn+1 = 1? E.g., 0, 0, 1, 0, 0

1 accept = False
3 guess = guesser()
5 return guess
Given x1 , . . . , xn ∈ {0, 1},

def guesser():
p = uniform()
return p

1 accept = False
3 guess = guesser()
5 return guess
Given x1 , . . . , xn ∈ {0, 1},

def guesser():
p = uniform()
return p
def checker(p):
return [0,0,1,0,0] == bernoulli(p,5)
1 accept = False
3 guess = guesser()
5 return guess
Given x1 , . . . , xn ∈ {0, 1},

def guesser():
p = uniform()
return p
def checker(p):
2
1 accept = False
3 guess = guesser()
count
5 return guess 1
0
0.0 0.2 0.4 0.6 0.8 1.0
Given x1 , . . . , xn ∈ {0, 1}, p
def guesser():
p = uniform()
return p
def checker(p):
7
1 accept = False
2 while (not accept): 6
3 guess = guesser() 5
4 accept = checker(guess) 4
count
5 return guess
3
2
1
0
0.0 0.2 0.4 0.6 0.8 1.0
Given x1 , . . . , xn ∈ {0, 1}, p
def guesser():
p = uniform()
return p
def checker(p):
5000
1 accept = False 4500
count
5 return guess 2500
2000
1500
1000
500
0
0.0 0.2 0.4 0.6 0.8 1.0
Given x1 , . . . , xn ∈ {0, 1}, p
def guesser():
p = uniform()
return p
def checker(p):
Let s = x1 + · · · + xn and let U be uniformly distributed.
For all t ∈ [0, 1], we have P U ≤ t = t and

P checker(t, x) is True = P ∀i ( Ui ≤ t ⇐⇒ xi = 1 )
= ts (1 − t)n−s .
PrHacceptL
0.06
0.05
0.04
0.03
n = 6, s ∈ {1, 3, 5}.
0.02
0.01
t
0.2 0.4 0.6 0.8 1.0
Z 1
(s)!(n − s)!
P checker(U, x) is True = ts (1 − t)n−s dt = =: Z(s)
0 (n + 1)!
Let p(t)dt be the probability that the accepted θ ∈ [t, t + dt).
ts (1 − t)n−s
p(t)dt ≈ ts (1 − t)n−s dt + 1 − Z(s) p(t)dt ≈ dt
Z(s)
R s+1
Probability that the accepted X = 1 is then t p(t)dt = n+2 .
E XAMPLE : INFERRING OBJECTS FROM AN IMAGE

1 accept = False
3 guess = guesser()
5 return guess

1 accept = False
3 guess = guesser()
5 return guess
How many objects in this image?

1 accept = False
3 guess = guesser()
5 return guess

1 accept = False
3 guess = guesser()
5 return guess

1 accept = False
3 guess = guesser()
5 return guess

1 accept = False
3 guess = guesser()
5 return guess
1 def guesser():
2 k = geometric()
3 blocks = [ randomblock() for _ in range(k) ]
4 colors = [ randomcolor() for _ in range(k) ]
5 return (k,blocks,colors)

1 accept = False
3 guess = guesser()
5 return guess
1 def guesser():
2 k = geometric()
7 def checker(k,blocks,colors):
8 return rasterize(blocks,colors) ==

1 accept = False
3 guess = guesser()
5 return guess
1 def guesser():
2 k = geometric()

1 accept = False
3 guess = guesser()
5 return guess
1 def guesser():
2 k = geometric()

1 accept = False
3 guess = guesser()
5 return guess
1 def guesser():
2 k = geometric()

1 accept = False
3 guess = guesser()
5 return guess
1 def guesser():
2 k = geometric()

1 accept = False
3 guess = guesser()
5 return guess
1 def guesser():
2 k = geometric()

1 accept = False
3 guess = guesser()
5 return guess
1 def guesser():
2 k = geometric()

1 accept = False
3 guess = guesser()
accept = checker(guess)
3
4
5 return guess
count
2
1
0
0 1 2 3 4 5 6 7
1 def guesser(): k = # blocks
2 k = geometric()

1 accept = False
5 return guess 28
24
count
20
16
12
How many objects in this image? 8
4
0
0 1 2 3 4 5 6 7
2 k = geometric()

1 accept = False
5 return guess 56
48
count
40
32
24
How many objects in this image? 16
8
0
0 1 2 3 4 5 6 7
2 k = geometric()

E XAMPLE : E XTRACTING 3D STRUCTURE FROM IMAGES
1 accept = False
3 guess = guesser()
5 return guess

1 accept = False
3 guess = guesser()
5 return guess

1 accept = False
3 guess = guesser()
5 return guess

1 accept = False
3 guess = guesser()
5 return guess
checker
−−−−−−→

1 accept = False
3 guess = guesser()
5 return guess
checker
−−−−−−→

1 accept = False
3 guess = guesser()
5 return guess
checker
−−−−−−→

1 accept = False
3 guess = guesser()
5 return guess

1 accept = False
3 guess = guesser()
5 return guess
inference
←−−−−−−

1 accept = False
3 guess = guesser()
5 return guess
inference
←−−−−−−

QUERY AS AN ALGORITHM
Key point: QUERY is not a serious proposal for an algorithm, but it denotes
the operation we care about in Bayesian analysis.

QUERY AS AN ALGORITHM
Key point: QUERY is not a serious proposal for an algorithm, but it denotes
the operation we care about in Bayesian analysis.
How efficient is QUERY? Let model() represent a distribution P and pred

represent an indicator function 1A .
Proposition
1
In expectation, QUERY(model,pred) takes P (A) times as long to run as
pred(model()).
Corollary
If pred(model()) is efficient and P (A) not too small, then
QUERY(model,pred) is efficient.

MIT-C HURCH AKA T RACE -MCMC
1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)

1 def geometric(p):
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0
1 def geometric(p):
2 g = geometric(p)
1 def geometric(p):
2 g = geometric(p)
1 def geometric(p):
2 g = geometric(p)
1 def geometric(p):
2 g = geometric(p)
1 def geometric(p):
2 g = geometric(p)
1 def geometric(p):
2 g = geometric(p)
1 def geometric(p):
2 g = geometric(p)
1 def geometric(p):
2 g = geometric(p)
1 def geometric(p):
2 g = geometric(p)
1 def geometric(p):
2 g = geometric(p)
1 def geometric(p):
2 g = geometric(p)
1 def geometric(p):
2 g = geometric(p)
1 def geometric(p):
2 g = geometric(p)
1 def geometric(p):
2 g = geometric(p)
1 def geometric(p):
2 g = geometric(p)
1 def geometric(p):
2 g = geometric(p)
1 def geometric(p):
2 g = geometric(p)
1 def geometric(p):
2 g = geometric(p)
1 def geometric(p):
2 g = geometric(p)
1 def geometric(p):
2 g = geometric(p)
1 def geometric(p):
2 g = geometric(p)
1 def geometric(p):
2 g = geometric(p)
1 def geometric(p):
2 g = geometric(p)
1 def geometric(p):
2 g = geometric(p)
1 def geometric(p):
2 g = geometric(p)
1 def geometric(p):
2 g = geometric(p)
1 def geometric(p):
2 g = geometric(p)
1 def geometric(p):
2 g = geometric(p)
1 def geometric(p):
2 g = geometric(p)
1 def geometric(p):
2 g = geometric(p)
1 def geometric(p):
2 g = geometric(p)
1 def geometric(p):
2 g = geometric(p)
1 def geometric(p):
2 g = geometric(p)
1 def geometric(p):
2 g = geometric(p)
1 def geometric(p):
2 g = geometric(p)
1 def geometric(p):
2 g = geometric(p)
1 def geometric(p):
2 g = geometric(p)
1 def geometric(p):
2 g = geometric(p)
1 def geometric(p):
2 g = geometric(p)
1 def geometric(p):
2 g = geometric(p)
1 def geometric(p):
2 g = geometric(p)
1 def geometric(p):
2 g = geometric(p)
1 def geometric(p):
2 g = geometric(p)
1 def geometric(p):
2 g = geometric(p)
1 def geometric(p):
2 g = geometric(p)
1 def geometric(p):
2 g = geometric(p)
1 def geometric(p):
2 g = geometric(p)
1 def geometric(p):
2 g = geometric(p)
1 def geometric(p):
2 g = geometric(p)

1 def geometric(p):
2 g = geometric(p)

1 def geometric(p):
2 g = geometric(p)

1 def geometric(p):
2 g = geometric(p)

1 def geometric(p):
2 g = geometric(p)

1 def geometric(p):
2 g = geometric(p)

1 def geometric(p):
2 g = geometric(p)

1 def geometric(p):
2 g = geometric(p)

1 def geometric(p):
2 g = geometric(p)

1 def geometric(p):
2 g = geometric(p)

1 def geometric(p):
2 g = geometric(p)

1 def geometric(p):
2 g = geometric(p)

1 def geometric(p):
2 g = geometric(p)

1 def geometric(p):
2 g = geometric(p)

1 def geometric(p):
2 g = geometric(p)

1 def geometric(p):
2 g = geometric(p)

1 def geometric(p):
2 g = geometric(p)

1 def geometric(p):
2 g = geometric(p)

1 def geometric(p):
2 g = geometric(p)

1 def geometric(p):
2 g = geometric(p)

1 def geometric(p):
2 g = geometric(p)

1 def geometric(p):
2 g = geometric(p)

1 def geometric(p):
2 g = geometric(p)

1 def geometric(p):
2 g = geometric(p)

1 def geometric(p):
2 g = geometric(p)

1 def geometric(p):
2 g = geometric(p)

1 def geometric(p):
2 g = geometric(p)

1 def geometric(p):
2 g = geometric(p)

1 def geometric(p):
2 g = geometric(p)

1 def geometric(p):
2 g = geometric(p)

1 def geometric(p):
2 g = geometric(p)

1 def geometric(p):
2 g = geometric(p)

1 def geometric(p):
2 g = geometric(p)

1 def geometric(p):
2 g = geometric(p)

1 def geometric(p):
2 g = geometric(p)

1 def geometric(p):
2 g = geometric(p)

1 def geometric(p):
2 g = geometric(p)

1 def geometric(p):
2 g = geometric(p)

1 def geometric(p):
2 g = geometric(p)
p(1 − p)
2
if1
B1 D1 A1
+1 if2
B2 RET2

1 def geometric(p):
2 g = geometric(p)
p p(1 − p)
1 2
if1 if1
B1 RET1 B1 D1 A1
+1 if2
B2 RET2

1 def geometric(p):
2 g = geometric(p)
p p(1 − p) p(1 − p)2

1 2 3
if1 if1 if1
B1 RET1 B1 D1 A1 B1 D1 A1
+1 if2 +1 if2
B2 RET2 B2 D2 A2
+2 if3
B3 RET3

1 def geometric(p):
2 g = geometric(p)
p p(1 − p) p(1 − p)2 p(1 − p)k−1

1 2 3 k
if1 if1 if1
+1 if2 +1 if2 ...

B2 RET2 B2 D2 A2
+2 if3
B3 RET3

1 def geometric(p):
2 g = geometric(p)
p p(1 − p) p(1 − p)2 p(1 − p)k−1

1 2 3 k
if1 if1 if1
+1 if2 +1 if2 ...

B2 RET2 B2 D2 A2
+2 if3
B3 RET3

if1
B1 RET1
if1
B1 D1 A1
+1 if2
B2 RET2
if1
B1 D1 A1
+1 if2
B2 D2 A2
+2 if3
B3 RET3
4
5
Daniel M. Roy 6 23 / 101
if1 if1
B1 RET1 B1 RET1
if1 if1
B1 D1 A1 B1 D1 A1
+1 if2 +1 if2
B2 RET2 B2 RET2
if1 if1
B1 D1 A1 B1 D1 A1
+1 if2 +1 if2
B2 D2 A2 B2 D2 A2
+2 if3 +2 if3
B3 RET3 B3 RET3
4 4
5 5
Daniel M. Roy 6 6 23 / 101
if1 if1
?
B1 RET1
? B1 RET1
?
if1 if1
?
B1 D1 A1
? B1 D1 A1
?
+1 if2 +1 if2
?
B2 RET2 ? B2 RET2
if1 if1
B1 D1 A1 B1 D1 A1
?
+1 if2
? +1 if2
?
B2 D2 A2 B2 D2 A2
?
+2 if3 +2 if3
?
B3 RET3 ? B3 RET3
?
4 ? 4
5 5
Daniel M. Roy 6 6 23 / 101
if1 if1 if1
B1 RET1 B1 RET1 B1 RET1
if1 if1 if1
B1 D1 A1 B1 D1 A1 B1 D1 A1
+1 if2 +1 if2 +1 if2
if1 if1 if1
B1 D1 A1 B1 D1 A1 B1 D1 A1
+1 if2 +1 if2 +1 if2
B2 D2 A2 B2 D2 A2 B2 D2 A2
+2 if3 +2 if3 +2 if3
4 4 4
5 5 5
Daniel M. Roy 6 6 6 23 / 101
Proposal
if1 if1 if1
if1 if1 if1
B1 D1 A1 B1 D1 A1 B1 D1 A1
+1 if2 +1 if2 +1 if2
if1 if1 if1
B1 D1 A1 B1 D1 A1 B1 D1 A1
+1 if2 +1 if2 +1 if2
B2 D2 A2 B2 D2 A2 B2 D2 A2
+2 if3 +2 if3 +2 if3
4 4 4
5 5 5
Daniel M. Roy 6 6 6 23 / 101
Proposal
if1 if1 if1
p
if1 if1 if1
B1 D1 A1 B1 D1 A1 B1 D1 A1
pp̄
+1 if2 +1 if2 +1 if2
if1 if1 if1
B1 D1 A1 B1 D1 A1 B1 D1 A1
+1 if2 +1 if2 +1 if2

k
pp̄
B2 D2 A2 B2 D2 A2 B2 D2 A2
+2 if3 +2 if3 +2 if3
4 4 4
5 5 5
Daniel M. Roy 6 6 6 23 / 101
Proposal
if1 if1 if1
p 1
2
if1 if1 if1
B1 D1 A1 B1 D1 A1 B1 D1 A1
pp̄
+1 if2 +1 if2 +1 if2
1
2p
if1 if1 if1
1 k
B1 D1 A1 2 pp̄ B1 D1 A1 B1 D1 A1
+1 if2 +1 if2 +1 if2

k
pp̄
B2 D2 A2 B2 D2 A2 B2 D2 A2
+2 if3 +2 if3 +2 if3
4 4 4
5 5 5
Daniel M. Roy 6 6 6 23 / 101
Proposal Metropolis–Hastings
if1 if1 if1
p 1
2
if1 if1 if1
B1 D1 A1 B1 D1 A1 B1 D1 A1
pp̄
+1 if2 +1 if2 +1 if2
1
2p
if1 if1 if1
1 k
B1 D1 A1 2 pp̄ B1 D1 A1 B1 D1 A1
+1 if2 +1 if2 +1 if2

k
pp̄
B2 D2 A2 B2 D2 A2 B2 D2 A2
+2 if3 +2 if3 +2 if3
4 4 4
5 5 5
Daniel M. Roy 6 6 6 23 / 101
if1 if1 if1
p 1
2
p̄
if1 if1
1− 2p 1
if1
B1 D1 A1 B1 D1 A1 B1 D1 A1
pp̄ p̄
+1 if2 +1 if2
2p +1 if2
1
2p
if1 if1 if1
1 k
B1 D1 A1 2 pp̄ B1 D1 A1 B1 D1 A1
+1 if2 +1 if2 +1 if2

k
pp̄
B2 D2 A2 B2 D2 A2 B2 D2 A2
+2 if3 +2 if3 +2 if3
4 4 4
5 5 5
Daniel M. Roy 6 6 6 23 / 101
if1 if1 if1
B1 RET1 B1 RET1
1 B1 RET1
p 1
2
p̄
if1 if1
1− 2p 1
if1
B1 D1 A1 B1 D1 A1 B1 D1 A1
pp̄ p̄
+1 if2 +1 if2
2p +1 if2
1
1
2p
if1 if1 if1
1 k
B1 D1 A1 2 pp̄ B1 D1 A1
1 B1 D1 A1
+1 if2 +1 if2 +1 if2

k
pp̄
B2 D2 A2 B2 D2 A2 B2 D2 A2
+2 if3 +2 if3 +2 if3
4 4 4
5 5 5
Daniel M. Roy 6 6 6 23 / 101
if1 if1
B1 RET1
1 − 12 p̄ B1 RET1
1
if1
2 p̄ if1
1
B1 D1 A1
2 B1 D1 A1
1
+1 if2 2 +1 if2
B2 RET2 B2 RET2
if1 if1
B1 D1 A1 B1 D1 A1
+1 if2 +1 if2
B2 D2 A2 B2 D2 A2
+2 if3 +2 if3
B3 RET3 B3 RET3
4 4
5 5
Daniel M. Roy 6 6 23 / 101
if1 if1
B1 RET1
1 − 12 p̄ B1 RET1
1
if1
2 p̄ if1
1
B1 D1 A1
2 B1 D1 A1
1
+1 if2 2 +1 if2
B2 RET2 B2 RET2
if1 if1
B1 D1 A1 B1 D1 A1
+1 if2
k p p̄
+1 if2
1 − 21 p̄ 1
p̄ p+pp̄ p+pp̄
→ as k → ∞
B2 D2 A2 B2 D2 A2
2
1 1 p p̄
+2 if3 +2 if3
2 2 p+pp̄ p+pp̄
B3 RET3 B3 RET3
4 4
5 5
Daniel M. Roy 6 6 23 / 101
QUERY CAN CAPTURE A WIDE RANGE OF AI BEHAVIORS
Despite the apparent simplicity of the QUERY construct, we will see that it
captures the essential structure of a range of common-sense inferences. We
now demonstrate the power of the QUERY formalism by exploring its
behavior in a medical diagnosis example.

Introduction
QUERY and conditional simulation
Probabilistic inference
Conditional independence and compact representations
Learning parameters via probabilistic inference
Learning conditional independences via probabilistic inference
References

Introduction
References

M EDICAL DIAGNOSIS MODEL DESCRIPTION , PART I
The remainder of the tutorial will use a medical diagnosis task as a running
example. The goal is to link observed symptoms to (unobserved) diseases.
The stochastic inference problem is:
QUERY(diseasesAndSymptoms,checkSymptoms)
All that remains is to define the two procedures that define the model.
diseasesAndSymptoms() will produce a random (possibly empty) set
of diseases and associated symptoms, modeling the distribution of diseases
and symptoms of a random chosen patient arriving at a clinic.
checkSymptoms(...) checks the hypothesized symptoms against the
list of observed symptoms, accepting if there is a match

M EDICAL DIAGNOSIS MODEL DESCRIPTION , PART II
The prior program diseasesAndSymptoms() proceeds as follows:
(1) For each disease n, sample an independent binary random variable Dn

with mean pn where
n Disease pn
1 Arthritis 0.06
2 Asthma 0.04
3 Diabetes 0.11
4 Epilepsy 0.002
5 Giardiasis 0.006
6 Influenza 0.08
7 Measles 0.001
8 Meningitis 0.003
9 MRSA 0.001
10 Salmonella 0.002
11 Tuberculosis 0.003
Dn indicates whether or not a patient has disease n.

M EDICAL DIAGNOSIS MODEL DESCRIPTION , PART III
(2) For each symptom m, sample an independent binary random variable
Lm with mean `m where
m Symptom `m
1 Fever 0.06
2 Cough 0.04
3 Hard breathing 0.001
4 Insulin resistant 0.15
5 Seizures 0.002
6 Aches 0.2
7 Sore neck 0.006
Lm indicates whether or not a patient spontaneously presents symptom m.

M EDICAL DIAGNOSIS MODEL DESCRIPTION , PART IV
(3) For each pair of disease n and symptom m, sample an independent
binary random variable Cn,m with mean cn,m where
cn,m 1 2 3 4 5 6 7
1 .1 .2 .1 .2 .2 .5 .5
2 .1 .4 .8 .3 .1 .0 .1
3 .1 .2 .1 .9 .2 .3 .5
4 .4 .1 .0 .2 .9 .0 .0
5 .6 .3 .2 .1 .2 .8 .5
6 .4 .2 .0 .2 .0 .7 .4
7 .5 .2 .1 .2 .1 .6 .5
8 .8 .3 .0 .3 .1 .8 .9
9 .3 .2 .1 .2 .0 .3 .5
10 .4 .1 .0 .2 .1 .1 .2
11 .3 .2 .1 .2 .2 .3 .5
Conditioned on having disease n, Cn,m indicates whether or not disease n

causes the patient to present symptom m.

M EDICAL DIAGNOSIS MODEL DESCRIPTION , PART V
For each symptom m, we then define
Sm = max{Lm , D1 · C1,m , . . . , D11 · C11,m },
hence Sm ∈ {0, 1}. (The max operator is playing the role of a logical OR
operation.)
Sm indicates whether or not the patient presents symptom m.
Every term of the form Dn · Cn,m is interpreted as indicating whether (or not)
the patient has disease n and disease n has caused symptom m. The term
Lm captures the possibility that the symptom may present itself despite the
patient having none of the listed diseases.
Finally, define the output of diseasesAndSymptoms to be the vector

(D1 , . . . , D11 , S1 , . . . , S7 ).

E XPLORING THE MODEL
If we repeatedly evaluate diseasesAndSymptoms, we might see outputs

like those in the following array:
Diseases Symptoms
1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7
1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3
0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
4
0 0 1 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0
5
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
6
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7
0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0
8
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
The rows model eight patients chosen randomly:
1. diseases and symptom free
2. same
3. suffering from diabetes and presenting insulin resistance;
4. suffering from diabetes and influenza, and presenting a fever and insulin
resistance;
5. suffering from unexplained aches;
6. free from disease and symptoms;
7. suffering from diabetes, and presenting insulin resistance and aches;
8. disease and symptom free.
D I S E A S E SAN DSY M P T O M S
This model is a toy version of the real diagnostic model QMR-DT [Shwe
et al., 1991], built from the Quick Medical Reference (QMR) knowledge base
of hundreds of diseases and thousands of findings (such as symptoms or test
results). A key aspect of this model is the disjunctive relationship between the
diseases and the symptoms, known as a “noisy-OR”.
Shortcomings
As a model of natural patterns of diseases and symptoms in a random
patient, leaves much to desire:
I model assumes that the presence or absence of any two diseases is
independent, although, as we will see later on in our analysis, diseases
are (as expected) typically not independent conditioned on symptoms.
I diseases may cause other diseases, and symptoms may cause diseases
I QMR-DT, like our toy model, was major advance over earlier expert
systems and probabilistic models, allowing simultaneous occurrence of
multiple diseases Shwe et al. [1991].
These caveats notwithstanding, a close inspection of this simplified model will
demonstrate a surprising range of common-sense reasoning phenomena.
I NCORPORATING OBSERVED SYMPTOMS
Consider the predicate that accepts if and only if S1 = 1 and S7 = 1, i.e., if

and only if the patient presents the symptoms of a fever and a sore neck,
ignoring other symptoms.
def checkSymptoms(D1 ,. . . ,D11 ,S1 ,. . . ,S7 ):

return S1 == 1 and S7 == 1
What does QUERY(diseasesAndSymptoms, checkSymptoms)

produce? We can just run it! (Right?)
Let µ denote output distribution of diseasesAndSymptoms, and let

A = {(d1 , . . . , d11 , s1 , . . . , s7 ) : s1 = s7 = 1}. Then
QUERY(diseasesAndSymptoms, checkSymptoms) generates
samples from the conditioned distribution µ(· | A).
We will study the conditional distributions of the diseases given the

symptoms. The following calculations may be very familiar to some readers,
but will be less so to others, and so we present them here to give a more
complete picture of the behavior of QUERY.
P OSTERIOR CALCULATIONS GIVEN S1 = S7 = 1, PART I
Consider a {0, 1}-assignment dn for each disease n, and write D = d to
denote the event that Dn = dn for every such n. We’d like to be able to
compute P{D = d|S1 = S7 = 1}.
How? We know that output of diseasesAndSymptoms is produced from

a collection of independent random variables. We will use this fact to
compute P{D = d} and P{S1 = S7 = 1 | D = d}. We will then employ the
following identities
µ(B)µ(A | B) = µ(A ∩ B) = µ(B|A)µ(A) (2)
Rearranging, we obtain Bayes rule,
µ(B|A)µ(A)
µ(A | B) = . (3)
µ(B)

P OSTERIOR CALCULATIONS GIVEN S1 = S7 = 1, PART II
Assume for the moment that D = d. Then what is the probability that
checkSymptoms accepts? The probability we are seeking is the
conditional probability
P(S1 = S7 = 1 | D = d) (4)
= P(S1 = 1 | D = d) · P(S7 = 1 | D = d), (5)
where the equality follows from the observation that once the Dn variables
are fixed, the variables S1 and S7 are independent. To see this, recall that
Sm = max{Lm , D1 · C1,m , . . . , D11 · C11,m },
and see there’s no overlap in variables determining each Sm once D is fixed.

P OSTERIOR CALCULATIONS GIVEN S1 = S7 = 1, PART III
Note that Sm = 1 if and only if Lm = 1 or Cn,m = 1 for some n such that
dn = 1. (Equivalently, Sm = 0 if and only if Lm = 0 and Cn,m = 0 for all n
such that dn = 1.) By the independence of each of these variables, it follows
that
Y
P(Sm = 1|D = d) = 1 − (1 − `m ) (1 − cn,m ). (6)
n : dn =1
It’s difficult to visualize a distribution on a 11-dimensional vector, such as D .

Instead, let d, d0 be {0, 1}-assignment specifying different patterns of
diseases. We can characterize the a posteriori odds
P(D = d | S1 = S7 = 1)
P(D = d0 | S1 = S7 = 1)
of the assignment d versus the assignment d0 in order to understand how

much more likely we are to see d as an explanation versus d0 .

P OSTERIOR CALCULATIONS GIVEN S1 = S7 = 1, PART IV
By Bayes’ rule, this can be rewritten as
P(S1 = S7 = 1 | D = d) · P(D = d)
, (7)
P(S1 = S7 = 1 | D = d0 ) · P(D = d0 )
Q11
where P(D = d) = n=1 P(Dn = dn ) by independence. Using (5), (6) and
(7), one may calculate that
P(Patient only has influenza | S1 = S7 = 1)

≈ 42,
P(Patient has no listed disease | S1 = S7 = 1)
i.e., it is forty-two times more likely that an execution of

diseasesAndSymptoms satisfies the predicate checkSymptoms via
an execution that posits the patient only has the flu than an execution which
posits that the patient has no disease at all.

P OSTERIOR CALCULATIONS GIVEN S1 = S7 = 1, PART V
On the other hand,
P(Patient only has meningitis | S1 = S7 = 1)

≈ 6,
P(Patient has no listed disease | S1 = S7 = 1)
and so
P(Patient only has influenza | S1 = S7 = 1)

≈ 7,
P(Patient only has meningitis | S1 = S7 = 1)
and hence we would expect to see, over many executions of
QUERY(diseasesAndSymptoms, checkSymptoms), (8)
roughly seven times as many explanations positing only influenza than

positing only meningitis.

E XPLAINING AWAY
Further investigation reveals some subtle aspects of the model. For example,
P(Patient only has meningitis and influenza | S1 = S7 = 1)

P(Patient has meningitis, maybe influenza, but nothing else | S1 = S7 = 1)
= 0.09 ≈ P(Patient has influenza),
Observation 1
Once we have observed some symptoms, diseases are no longer
independent.
Observation 2
Once the symptoms have been “explained” (e.g., as coming from meningitis),
there is little pressure to posit further causes (the posterior probability of
influenza is essentially the prior probability of influenza).
This phenomenon is well-known and is called explaining away; it is also

known to be linked to the computational hardness of computing probabilities
(and generating samples as QUERY does) in models of this variety.

S IMPLE MODELS CAN YIELD COMPLEX BEHAVIOR , PART I
Despite the simple model and simple query,

QUERY(diseasesAndSymptoms, checkSymptoms) yields a collection
of diagnostic inferences with tremendous internal complexity.
Many more behaviors available through different predicates:

I The model naturally handles missing data. checkSymptoms leads to
different conclusions than
def checkSymptoms(D1 ,. . . ,D11 ,S1 ,. . . ,S 7 ):

P 7
return S1 == 1 and S7 == 1 and m=1 Sm == 2

S IMPLE MODELS CAN YIELD COMPLEX BEHAVIOR , PART II
We need not limit ourselves to reasoning about diseases given symptoms.
I Imagine that we perform a diagnostic test that rules out meningitis. We
could represent our new knowledge using a predicate capturing the
condition
(D8 = 0) ∧ (S1 = S7 = 1) ∧ (S2 = · · · = S6 = 0).
Of course this approach would not take into consideration our

uncertainty regarding the accuracy or mechanism of the diagnostic test
itself, and so, ideally, we might expand the diseasesAndSymptoms
model to account for how the outcomes of diagnostic tests are affected
by the presence of other diseases or symptoms. Later, we will discuss
how such an extended model might be learned from data, rather than
constructed by hand.

S IMPLE MODELS CAN YIELD COMPLEX BEHAVIOR , PART III
We can also reason in the other direction, about symptoms given diseases.
I For example, public health officials might wish to know about how
frequently those with influenza present no symptoms. This is captured
by the conditional probability
P(S1 = · · · = S7 = 0 | D6 = 1),
and, via QUERY, by the predicate for the condition D6 = 1. Unlike the
earlier examples where we reasoned backwards from effects
(symptoms) to their likely causes (diseases), here we are reasoning in
the same forward direction as the model diseasesAndSymptoms is
expressed.
The possibilities are effectively inexhaustible, including more complex states
of knowledge such as, there are at least two symptoms present, or the patient
does not have both salmonella and tuberculosis. Later, we will consider the
vast number of predicates and the resulting inferences supported by QUERY
and diseasesAndSymptoms, and contrast this with the compact size of
diseasesAndSymptoms and the table of parameters.
In this section, we illustrated the basic behavior of QUERY, and began to
explore how QUERY can be used to update beliefs in light of observations.
I Inferences need not be explicitly described in terms of rules, but can
arise implicitly via other mechanisms, like QUERY, paired with an
appropriate models and predicates.
I In the working example, the diagnostic rules were determined by the
definition of diseasesAndSymptoms and the table of its parameters.
I The inference, however, are fixed.
I We will examine how the underlying table of probabilities might be learned
from data.
I The structure of diseasesAndSymptoms itself encodes strong
structural relationships among the diseases and symptoms. We will study
how to learn this in part 2.
I Finally, many common-sense reasoning tasks involve making a decision,
and not just determining what to believe. Towards the end, we will describe
how to use QUERY to make decisions under uncertainty.

Introduction
References

C ONDITIONAL INDEPENDENCE AND COMPACT
REPRESENTATIONS
In this section, we return to the medical diagnosis example, and explain the
way in which conditional independence leads to compact representations,
and conversely, the fact that efficient probabilistic programs, like
diseasesAndSymptoms, exhibit many conditional independencies. We
will do so through connections with the Bayesian network formalism, whose
introduction by Pearl [1988] was a major advancement in AI.

T HE COMBINATORICS OF QUERY
Common-sense reasoning seems to encompass an unbounded range of
responses / behaviors. How are these compactly represented?
In fact, the small number of diseases and symptoms considered in our simple
medical diagnosis model already leads to a combinatorial number of potential
scenarios: among 11 potential diseases and 7 potential symptoms there are
311 · 37 = 387 420 489
partial assignments to a subset of variables. All of these must be assigned

probabilities!
Luckily, these 387 420 489 probabilities are determined by
211 · 27 − 1 = 262 143
probabilities, one each for every complete assignment. Still, this number is
exponential in the number of diseases and symptoms. Even if we discretize
the probabilities to some fixed accuracy, a simple counting argument shows
that most such distributions have no short description.
F EW DISTRIBUTIONS HAVE COMPACT REPRESENTATIONS
Like diseasesAndSymptoms, every probability distribution on 18 binary

variables implicitly defines a large set of probabilities.
I Not feasible to store these probabilities explicitly.
I Necessary to exploit structure to devise more compact representations.
diseasesAndSymptoms is a small efficient program acting on three
tables with
11 + 7 + 11 · 7 = 95
probabilities. In contrast, a generic distribution on 218 possibilities has no

short description.
I diseasesAndSymptoms implicitly represents 218 − 1 probabilities
via an efficient (and short) simulator.
I What structure suffices to yield compact representations?
I What structure is necessary given efficient representations?

C ONDITIONAL INDEPENDENCE , PART I
The answer to both questions is conditional independence.
Recall that a collection of random variables {Xi : i ∈ I} is independent

when, for all finite subsets J ⊆ I and measurable sets Ai where i ∈ J , we
have
^ Y
P Xi ∈ Ai = P(Xi ∈ Ai ). (9)
i∈J i∈J
If X and Y were binary random variables, then specifying their distribution

would require 3 probabilities in general, but only 2 if they were independent.
While those savings are small, consider instead m binary random variables
Xj , j = 1, . . . , m, and note that, while a generic distribution over these
random variables would require the specification of 2m − 1 probabilities, only
m probabilities are needed in the case of full independence.
Full independence is rare and so this factorization is not the whole story.

C ONDITIONAL INDEPENDENCE , PART II
Conditional independence is arguably more fundamental.
We consider a special case of conditional independence, restricting our

attention to conditional independence with respect to a discrete random
variable N taking values in some countable or finite set N .
We say that a collection of random variables {Xi : i ∈ I} is conditionally

independent given N when, for all n ∈ N , finite subsets J ⊆ I and
measurable sets Ai , for i ∈ J , we have
^ Y
P Xi ∈ Ai | N = n = P(Xi ∈ Ai | N = n).
i∈J i∈J
To illustrate the potential savings that can arise from conditional

independence, consider m binary random variables that are conditionally
independent given a discrete random variable taking k values. In general, the
joint distribution over these m + 1 variables is specified by k · 2m − 1
probabilities, but, in light of the conditional independence, we need specify
only km + k − 1 probabilities.
C ONDITIONAL INDEPENDENCE IN
D I S E A S E S A N D S Y M P T O M S , PART I
Conditional independence gives rise to compact representations.

Indeed, diseasesAndSymptoms exhibits many conditional
independencies.
To begin to understand the compactness of diseasesAndSymptoms,

note that the 95 variables
{D1 , . . . , D11 ; L1 , . . . , L7 ; C1,1 , C1,2 , C2,1 , C2,2 , . . . , C11,7 }
are independent, and thus their joint distribution is determined by specifying

only 95 probabilities (in particular, those in the tables).

C ONDITIONAL INDEPENDENCE IN
D I S E A S E S A N D S Y M P T O M S , PART II
Each symptom Sm is a deterministic function of a 23-variable subset
{D1 , . . . , D11 ; Lm ; C1,m , . . . , C11,m }.
The variables Lm ; C1,m , . . . , C11,m are not shared across symptoms,

implying symptoms are conditionally independent given diseases.
However, these facts alone do not fully explain the compactness of

diseasesAndSymptoms. In particular, there are
23 6
22 > 1010
binary functions of 23 binary inputs, and so by a counting argument, most

have no short description. On the other hand, the max operation that defines
Sm does have a compact and efficient implementation.
What’s the connection?

R EPRESENTATIONS OF CONDITIONAL INDEPENDENCE
A useful way to represent conditional independence among a collection of

random variables is in terms of a directed acyclic graph, where the vertices
stand for random variables, and the collection of edges indicates the
presence of certain conditional independencies. Such a graph is known as a
directed graphical model or Bayesian network. (For more details on Bayesian
networks, see the survey by Pearl [2004].)

B AYES NET FOR D I S E A S E S A N D S Y M P T O M S
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11
J J J J J J J J J J J
J J J J J J J
S1 S2 S3 S4 S5 S6 S7
J J J J J J J
C1,1 C1,2 C1,3 C1,4 C1,5 C1,6 C1,7
J J J J J J J
C2,1 C2,2 C2,3 C2,4 C2,5 C2,6 C2,7
J J J J J J J
C3,1 C3,2 C3,3 C3,4 C3,5 C3,6 C3,7
J J J J J J J
C4,1 C4,2 C4,3 C4,4 C4,5 C4,6 C4,7
J J J J J J J
C5,1 C5,2 C5,3 C5,4 C5,5 C5,6 C5,7
J J J J J J J
C6,1 C6,2 C6,3 C6,4 C6,5 C6,6 C6,7
J J J J J J J
C7,1 C7,2 C7,3 C7,4 C7,5 C7,6 C7,7
J J J J J J J
C8,1 C8,2 C8,3 C8,4 C8,5 C8,6 C8,7
J J J J J J J
C9,1 C9,2 C9,3 C9,4 C9,5 C9,6 C9,7
J J J J J J J
C10,1 C10,2 C10,3 C10,4 C10,5 C10,6 C10,7
J J J J J J J
C11,1 C11,2 C11,3 C11,4 C11,5 C11,6 C11,7
J J J J J J J
L1 L2 L3 L4 L5 L6 L7
Bayes net for the medical diagnosis example. (Note that the directionality of
the arrows is not rendered for clarity. All arrows point to the symptoms Sm .)
P ROTO - LANGUAGES FOR B AYES NETS
Dn diseases n
Cn,m
J
Lm
J J
Sm
symptoms m
The repetitive structure can be partially captured by so-called “plate notation”,

which can be interpreted as a primitive for-loop construct. Practitioners
have adopted a number of strategies like plate notation for capturing
complicated structures.

D -S EPARATION IN B AYES NETS I
In order to understand exactly which conditional independencies are formally

encoded in such a graph, we must introduce the notion of d-separation.
d-separation
A pair (x, y) of vertices are d-separated by a subset of vertices E as follows:
First, mark each vertex in E with a ×, which we will indicate by the symbol
N
. If a vertex with (any type of) mark has an unmarked
L parent, mark the
parent with a +, which we will indicate by the symbol . Repeat until a fixed
J
point is reached. Let indicate unmarked vertices.
Definition. x and y are d-separated by E if, for all (undirected) paths from x
to y , one of the following patterns appears:
J N J J N J
→ → ← ←
J N J J J J
← → → ←

D -S EPARATION IN B AYES NETS II
More generally, if X and E are disjoint sets of vertices, then the graph
encodes the conditional independence of the vertices X given E if every pair
of vertices in X is d-separated by E .
If we fix a collection V of random variables, then we say that a directed

acyclic graph G over V is a Bayesian network when the random variables in
V indeed posses all of the conditional independencies implied by the graph
by d-separation. Note that a Bayes net G says nothing about which
conditional independencies do not exist among its vertex set.

D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11
J J J J J J J J J J J
J J J J J J J
S1 S2 S3 S4 S5 S6 S7
J J J J J J J
C1,1 C1,2 C1,3 C1,4 C1,5 C1,6 C1,7
J J J J J J J
C2,1 C2,2 C2,3 C2,4 C2,5 C2,6 C2,7
J J J J J J J
C3,1 C3,2 C3,3 C3,4 C3,5 C3,6 C3,7
J J J J J J J
C4,1 C4,2 C4,3 C4,4 C4,5 C4,6 C4,7
J J J J J J J
C5,1 C5,2 C5,3 C5,4 C5,5 C5,6 C5,7
J J J J J J J
C6,1 C6,2 C6,3 C6,4 C6,5 C6,6 C6,7
J J J J J J J
C7,1 C7,2 C7,3 C7,4 C7,5 C7,6 C7,7
J J J J J J J
C8,1 C8,2 C8,3 C8,4 C8,5 C8,6 C8,7
J J J J J J J
C9,1 C9,2 C9,3 C9,4 C9,5 C9,6 C9,7
J J J J J J J
C10,1 C10,2 C10,3 C10,4 C10,5 C10,6 C10,7
J J J J J J J
C11,1 C11,2 C11,3 C11,4 C11,5 C11,6 C11,7
J J J J J J J
L1 L2 L3 L4 L5 L6 L7
Deciding d-separation in the Bayes net, we can determine that (1) the
diseases {D1 , . . . , D11 } are independent (i.e., conditionally independent
given E = ∅) and that (2) the symptoms {S1 , . . . , S7 } are conditionally
independent given the diseases {D1 , . . . , D11 } and many more.

C ONDITIONAL INDEPENDENCE AND FACTORIZATION I
Bayes nets specify a factorization of the joint distribution of the vertex set.
It is a basic fact from probability that
P(X1 = x1 , . . . , Xk = xk )
= P(X1 = x1 ) · P(X2 = x2 | X1 = x1 ) · · · P(Xk = xk | Xj = xj , j < k)
k
Y
= P(Xj = xj | Xi = xi , i < j).
j=1
Such a factorization provides no advantage when seeking a compact

representation, as the j ’th conditional distribution is determined by 2j−1
probabilities.

C ONDITIONAL INDEPENDENCE AND FACTORIZATION II
On the other hand, if we have a Bayes net over the same variables, then we
may have a much more concise factorization.
Let G be a Bayes net over {X1 , . . . , Xk }, and write Pa(Xj ) for the set of
indices i such that (Xi , Xj ) ∈ G, i.e., Pa(Xj ) indexes the parent vertices of
Xj . Then the joint p.m.f. may be expressed as
k
Y
P(X1 = x1 , . . . , Xk = xk ) = P(Xj = xj | Xi = xi , i ∈ Pa(Xj )).
j=1
Pk
This factorization is determined by only j=1 2|Pa(Xj )| probabilities, an
exponential savings potentially.

E FFICIENT REPRESENTATIONS AND CONDITIONAL
INDEPENDENCE , PART I
As we saw at the beginning of this section, models with only a moderate

number of variables can have enormous descriptions. Having introduced the
Bayesian network formalism, we can use diseasesAndSymptoms as an
example to explain why, roughly speaking, the output distributions of efficient
probabilistic programs exhibit many conditional independencies.

INDEPENDENCE , PART II
What does the efficiency of diseasesAndSymptoms imply about the
structure of its output distribution?
Assuming the 95 tabulated probabilities are dyadics of the form 2km , we may
represent diseasesAndSymptoms as a small boolean circuit whose
inputs are random bits and whose 18 output lines represent the diseases and
symptom indicators. Each circuit elements can be restricted to
constant-fan-in, and the total number of circuit elements grows only linearly in
the number of diseases and in the number of symptoms, assuming fixed
accuracy for the bae probabilities.

INDEPENDENCE , PART III
If we view the input lines as random variables, then the output lines of the
logic gates are also random variables, and so we may ask: what conditional
independencies hold among the circuit elements?
It is straightforward to show that the circuit diagram, viewed as a directed

acyclic graph, is a Bayes net capturing conditional independencies among
the inputs, outputs, and internal gates of the circuit implementing
diseasesAndSymptoms. For every gate, the conditional probability mass
function is characterized by the (constant-size) truth table of the logical gate.
Therefore, if an efficient prior program samples from some distribution over a

collection of binary random variables, then those random variables exhibit
many conditional independencies, in the sense that we can introduce a
polynomial number of additional boolean random variables (representing
intermediate computations) such that there exists a constant-fan-in Bayes net
over all the variables.

G RAPHICAL MODELS IN AI
Graphical models, and, in particular, Bayesian networks, played a critical role

in popularizing probabilistic techniques within AI in the late 1980s and early
1990s.
Two developments were central to this shift:
1. Researchers introduced compact, computer-readable representations of
distributions on large (but still finite) collections of random variables, and
did so by explicitly representing a graph capturing conditional
independencies and exploiting the factorization (??).
2. Researchers introduced efficient graph-based algorithms that operated
on these representations, exploiting the factorization to compute
conditional probabilities.
For the first time, a large class of distributions were given a formal
representation that enabled the design of general purpose algorithms to
compute useful quantities.

Introduction
References

L EARNING PARAMETERS VIA PROBABILISTIC INFERENCE I
The 95 tabulated probabilities induce a distribution over 262 144 outcomes.

I Where did these numbers come from?
I And are they any good?
In practice, these 95 “parameters” would themselves be subject to a great
deal of uncertainty, and one might hope to use data from actual diagnostic
situations to learn appropriate values.

L EARNING AS PROBABILISTIC INFERENCE , PART I
The Bayesian approach to learning the 95 parameters is to place a prior

distribution on them. Whereas different individuals diseases and symptoms
were independent before, they are no longer independent.
Thus, this change affects diseasesAndSymptoms in two ways:
1. Rather than using the fixed 95 tabulated probabilities, the program will
start by randomly generating the entries of the table. (Concretely,
assume they are i.i.d. uniform in [0, 1].)
2. The new program, dubbed allDiseasesAndSymptoms, then
evaluates diseasesAndSymptoms() n + 1 times, returning the
resulting (n + 1) × 18 array.
3. Similarly, we modify checkSymptoms to obtain
checkAllSymptoms by accepts the n + 1 diagnoses generated by
allDiseasesAndSymptoms if and only if the first n agree with n
historical records, and last one, the current patient, is accepted by
checkSymptoms.

L EARNING AS PROBABILISTIC INFERENCE , PART II
The changes to allDiseasesAndSymptoms may sound quite
surprising, and unlikely to work very well. Indeed,
allDiseasesAndSymptoms will produce unnatural patterns. The key is
to consider the effect of checkAllSymptoms. What are typical outputs
from QUERY(allDiseasesAndSymptoms, checkAllSymptoms)?
For large n,
I hypothesized marginal probability of a disease relatively close to the
frequency observed in the simulated disease–symptom data.
Thus, when a simulation is eventually accepted by checkAllSymptoms,

I the hypothesized marginal probabilities will closely match the relative
frequencies in the data.
The concentration around the data occurs at a n−1/2 rate, and so we would
expect that the typical accepted sample would soon correspond with a latent
table of probabilities that roughly matches the historical record.

E XACT POSTERIOR DISTRIBUTION OF pj
Fix a disease j and recall that pj ∼ Uniform[0, 1].
The probability that the n sampled values of Dj match the historical record is
pkj · (1 − pj )n−k , (10)
where k stands for the number of records where disease j is present.
By Bayes’ theorem, in the special case of a uniform prior distribution on pj ,

the density of the conditional distribution of pj given the historical evidence is
proportional to the likelihood (10). This implies that, conditionally on the
historical record, pj has a so-called Beta(α1 , α0 ) distribution with mean
α1 k+1
=
α1 + α0 n+2
and concentration parameter α1 + α0 = n + 2.

E XACT POSTERIOR DISTRIBUTION OF pj
Beta distributions under varying parameterizations, highlighting the fact that,
as the concentration grows, the distribution begins to concentrate around its
mean. As n grows, predictions made by
QUERY(allDiseasesAndSymptoms, checkAllSymptoms) will
likely be those of runs where each disease marginals pj falls near the
observed frequency of the j th disease. In effect, the historical record data
determines the values of the marginals pj .
14
12
10
0.2 0.4 0.6 0.8 1.0
Plots of the probability density of Beta(a1 , a0 ) distributions with density f (x; α1 , α0 ) =

Γ(α1 +α0 ) α1 −1
Γ(α1 )Γ(α0 ) x (1 − x)α0 −1 for parameters (1, 1), (3, 1), (30, 3), and (90, 9) (respectively, in
height). For parameters α1 , α0 > 1, the distribution is unimodal with mean α1 /(α1 + α0 ).
P OSTERIOR CONVERGENCE
A similar analysis can be made of the dynamics of the posterior distribution of

the latent parameters `m and cn,m .
Abstractly speaking, in finite dimensional Bayesian models like this one

satisfying certain regularity conditions, it is possible to show that the
predictions of the model converge to those made by the best possible
approximation within the model to the distribution of the data. (For a
discussion of these issues, see, e.g., Barron [1998].)

B AYESIAN LEARNING , PART I
I Original diseasesAndSymptoms program makes the same

inferences in each case, even in face of large data.
I allDiseasesAndSymptoms learns from data.
The key: the latent table of probabilities, modeled as random variables.
Similar approaches can be used when the patients come from multiple
distinct populations where you do not expect the patterns of diseases and
symtoms to agree.

B AYESIAN LEARNING , PART II
Note
I allDiseasesAndSymptoms is even more compact than
diseasesAndSymptoms
I the specification of the distribution of the random table is logarithmic in the
size of the table
I On the other hand, allDiseasesAndSymptoms relies on data to
help it reduce its substantial a priori uncertainty regarding these values.
This tradeoff—between, on the one hand, the flexibility and complexity of a
model and, on the other, the amount of data required in order to make
sensible predictions—is seen throughout statistical modeling. We will return
to this point later.
Here we have seen how the parameters in prior programs can be modeled as
random, and thereby learned from data by conditioning on historical
diagnoses. In the next section, we consider the problem of learning not only
the parameterization but the structure of the model’s conditional
independence itself.

Introduction
References

L EARNING CONDITIONAL INDEPENDENCES I
Key limitation of diseasesAndSymptoms

diseasesAndSymptoms uses a fixed noisy-OR network to perform
inference
Solution
allDiseasesAndSymptoms performs probabilistic inference over the
probabilities to learn the best noisy-OR network from data.
Key limitation of allDiseasesAndSymptoms

Irrespective of how much historical data we have,
allDiseasesAndSymptoms cannot go beyond the conditional
independence assumptions implicit in the structure of the prior program.
Solution
Identify the correct structure of the dependence between symptoms and
disease by probabilistic inference over random conditional independence
structure among the model variables.

S TRUCTURAL LEARNING VIA PROBABILISTIC INFERENCE
Learning a probabilistic model from data is a quintessential example of

unsupervised learning. Learning conditional independence relationships
among model variables is known as structure learning.
Need to learn the components of this factorization:
k
Y
j=1
Approach
Be “Bayesian”... put prior distributions on graphs and conditional probabilities.

A RANDOM PROBABILITY DISTRIBUTION I
Consider a prior program, which we will call RPD (for Random Probability
Distribution) that takes two positive integer inputs, n and D , and produces as
output n independent samples drawn from a random probability distribution
on {0, 1}D .
We will then perform inference as usual
QUERY(RPD(n+1,18), checkAllSymptoms)

A RANDOM PROBABILITY DISTRIBUTION II
Intuitively, RPD works in the following way:
1. RPD generates a random directed acyclic graph G over the vertex set
{X1 , . . . , XD }.
2. For each vertex j , for each setting v ∈ {0, 1}Pa(Xj ) , RPD generates the
value of the conditional probabilitiy
pj|v = P(Xj = 1 | Xi = vi , i ∈ Pa(Xj )) uniformly at random.
3. Generate n samples, each a vector of D binary values with the same
distributions as X1 , . . . , XD .
The first two steps produce a graph G and random probabliity distribution for
which G is a valid Bayes net.

G ENERAL B AYESIAN LEARNING MACHINE
With RPD fully specified, we’d like to understand the output of
QUERY(RPD(n + 1, 18), checkAllSymptoms) (11)
where checkAllSymptoms is defined earlier, accepting n + 1 diagnoses

if and only if the first n agree with historical records, and the symptoms
associated with the n + 1’st agree with the current patient’s symptoms. (Note
that we are identifying each output (X1 , . . . , X11 , X12 , . . . , X18 ) with a
diagnosis (D1 , . . . , D11 , S1 , . . . , S7 ), and have done so in order to highlight
the generality of RPD.)

G RAPH - CONDITIONAL POSTERIOR , PART I
Assume we condition on the graph G produced on step 1. Now we’re

essentially back to our earlier analysis.
The expected value of pj|v = P(Xj = 1 | Xi = vi , i ∈ Pa(Xj )) on those

runs accepted by QUERY is
kj|v + 1
nj|v + 2
where nj|v is the number of times in the historical data where the event
{Xi = xi , i ∈ Pa(Xj )} occurs; and kj|v is the number of times when,
moreover, Xj = 1. This is simply the “smoothed” empirical frequency. In fact,
the probability pj|v is conditionally Beta distributed with concentration
nj|v + 2.

G RAPH - CONDITIONAL POSTERIOR , PART II
The variance of pj|v is one characterization of our residual uncertainty, and
for each probability j , the variance is easily shown to scale as n−1
j|v , i.e., the
number of times in the historical data when {Xi = xi , i ∈ Pa(Xj )} occurs.
Informally,
I the smaller the parental sets (a property of G), the more certain we are
likely to be regarding the correct parameterization
I in terms of QUERY, reflected in smaller the range of values of pj|v on
accepted runs.
This is our first glimpse at a subtle balance between the simplicity (sparsity)
of the graph G and how well it captures hidden structure in the data.

A SPECTS OF THE POSTERIOR DISTRIBUTION OF THE
GRAPHICAL STRUCTURE
There are a number of practical roadblocks

I The space of directed acyclic graphs on 18 variables is enormous
I Computational hardness results abound [Cooper, 1990, Dagum and
Luby, 1993, Chandrasekaran et al., 2008].
I Indeed, QUERY would not halt in reasonable time for even small n and
D because the probability of generating the structure that fits the data is
astronomically small.
State of the art structure learning algorithms operate in special subclasses of
distributions and are very sophisticated.
Our goal here is understanding / intuition:
I we can aim to understand conceptual structure of the posterior
distribution of the graph G, perhaps in simple cases
I this example reveals an important aspect of hierarchical Bayesian
models with regard to their ability to avoid “overfitting”, and gives some
insight into why we might expect “simpler” explanations/theories to win
out in the short term over more complex ones.
L IKELIHOOD OF A GRAPH I
Fix a graph G. Then any distribution of the form
k
Y
j=1
will be called a model in G.
Lemma. If edges in G0 are a strict subset of those in G, then the models of

G0 are a strict subset of those of G.
Corollary. The best fitting distribution in G is never worse than that in G0 .
Are samples from QUERY(RPD(n + 1, 18), checkAllSymptoms) more

likely to come from G than G0 ?

L IKELIHOOD OF A GRAPH II
Key observations:
I posterior probability of a particular graph G does not reflect the
best-fitting model in G, but rather reflects the average ability of models
in G to explain the historical data.
I average is over Uniform[0, 1] distribution of pj|v .
I if a spurious edge exists in a graph G0 , a typical distribution from G0 is
less likely to explain the data than a typical distribution from the graph
with that edge removed.

L IKELIHOOD OF A GRAPH III
In order to characterize the posterior distribution of the graph, we must
identify the likelihood that a sample from the prior program is accepted given
a particular graph G.
Every time the pattern {Xi = vi , i ∈ Pa(Xj )} arises in historical data, the
generative process produces the historical value Xj with probability pj|v if
Xj = 1 and 1 − pj|v if Xj = 0.
Conditional on the pj|v ’s, the probability that the historical data is reproduced
is
D Y
Y k
j|v
pj|v (1 − pj|v )nj|v −kj|v ,
j=1 v
where v ranges over the possible {0, 1} assignments to Pa(Xj ) and kj|v
and nj|v are defined as above.

L IKELIHOOD OF A GRAPH IV
However, we don’t know pj|v , and so the likelihood of the graph G is
Y
D Y
kj|v nj|v −kj|v
score(G) = E pj|v (1 − pj|v )
j=1 v
YD Y −1
−1 nj|v
= (nj|v + 1)
j=1 v
kj|v
where E takes expectations with respect to Uniform[0, 1].
Because the graph G was chosen uniformly at random, it follows that the
posterior probability of a particular graph G is proportional to score(G).

G RAPH POSTERIOR
We can study the preference for one graph G over another G0 by studying
the ratio of their scores:
score(G)
.
score(G0 )
This score ratio is known as the Bayes factor, which I.J. Good termed the
Bayes–Jeffreys–Turing factor [Good, 1968, 1975], and which Turing himself
called the factor in favor of a hypothesis (see [Good, 1968], [Zabell, 2012,
§1.4], and [Turing, 2012]). Its logarithm is sometimes known as the weight of
evidence [Good, 1968].
The form of the score is a product over the local structure of the graph, thus
the Bayes factor will depend only on the contributions from those parts of the
graphs G and G0 that differ.

S IMPLE SPECIAL CASE OF GRAPH POSTERIOR , PART I
Consider the following simplified scenario, which captures several features of

learning structure from data: Fix two graphs, G and G0 , over the same
collection of random variables, but assume that in G, two of these random
variables, X and Y , have no parents and are thus independent, and in G0
there is an edge from X to Y , and so they are almost surely dependent.
The Bayes factor in factor of independence is

n1
n0

(n1 + 1)(n0 + 1) k1 k0
n
, (12)
(n + 1) k
where
I n counts the total number of observations;
I k counts Y = 1;
I n1 counts X = 1;
I k1 counts X = 1 and Y = 1;
I n0 counts X = 0; and
I k0 counts X = 0 and Y = 1.
S IMPLE SPECIAL CASE OF GRAPH POSTERIOR , PART II
First consider the case where G0 is the true underlying graph, i.e., when Y is
indeed dependent on X .
By the law of large numbers, and Stirling’s approximation, we can reason that
the evidence for G0 accumulates rapidly, satisfying
score(G)
log ∼ −C · n, a.s.,
score(G0 )
for some constant C > 0 that depends only on the joint distribution of (X, Y ).

S IMPLE SPECIAL CASE OF GRAPH POSTERIOR , PART III
Now consider that G is the true underlying graph, i.e., X , Y are independent.
Using a similar approach, we have
score(G) 1
log ∼ log n, a.s.
score(G0 ) 2
and thus evidence accumulates much more slowly.
The following plots show typical the evolution of the Bayes factors under G0
and under G. Evidence accumulates much more rapidly for G0 .

S IMPLE SPECIAL CASE OF GRAPH POSTERIOR , PART IV
20 40 60 80 100 4
-200
3
-400
2
-600
1
-800
20 40 60 80 100
-1000
-1
Weight of evidence for dependence versus independence (positive values support independence) of
a sequence of pairs of random variables sampled from RPD(n, 2).
(left) When presented with data from a distribution where (X, Y ) are indeed dependent, the weight of
evidence rapidly accumulates for the dependent model, at an asymptotically linear rate in the amount
of data.
(right) When presented with data from a distribution where (X, Y ) are independent, the weight of
evidence slowly accumulates for the independent model, at an asymptotic rate that is logarithmic in
the amount of data.
Note that the dependent model can imitate the independent model, but, on average over random
parameterizations of the conditional probability mass functions, the dependent model is worse at
modeling independent data.

S IMPLE SPECIAL CASE OF GRAPH POSTERIOR , PART V
Some observations
I In both cases, evidence accumulates for the correct model. In fact, it can
be shown that the expected weight of evidence is always non-negative
for the true hypothesis, a result due to Turing himself [Good, 1991,
p. 93].
I the prior distributions on the pj|v are fixed and do not vary with the
amount of data, thus the weight of evidence will eventually eclipse any
prior information and determine the posterior probability
I however, evidence accumulates rapidly for dependence and much more
slowly for independence: should choose prior distribution on graph to
reflect this imbalance, preferring graphs with fewer edges a priori.

B AYES ’ O CCAM ’ S RAZOR
The following two statements may seem contradictory

I When X and Y are independent, evidence stochastically accumulates
for the simpler graph over the more complex graph
I There is, with high probability, always a parameterization of the more
complex graph that assigns higher likelihood to the data than any
parametrization of the simpler graph.
Bayes’ Occam’s razor [MacKay, 2003, Ch. 28]

The natural way in which hierarchical models choose explanations of the data
with intermediate complexity, avoiding overfitting.
I many degrees of modeling freedom, then each configuration must be

assigned, on average, less probability than it would under a simpler
model with fewer degrees of freedom.
I graph with additional edges has more degrees of freedom

C HOOSING MODELS USING DATA
Which model should we use to diagnose: diseasesAndSymptoms,

allDiseasesAndSymptoms or RPD?
We can use the same Bayes’ Occam’s Razor perspective:

I RPD model has many more degrees of freedom than both
diseasesAndSymptoms and allDiseasesAndSymptoms.
I Given enough data, RPD can fit any distribution on a finite collection of
binary variables, as opposed to allDiseasesAndSymptoms, which
cannot because it makes strong and immutable assumptions.
I With only a small amount of training data, RPD model expected to have
high posterior uncertainty.
I One would expect better predictions from
allDiseasesAndSymptoms than RPD, if both were fed data
generated by diseasesAndSymptoms.

S TATE OF THE ART
I There is a challenge bridging the gap between

allDiseasesAndSymptoms and RPD.
I A lot of focus in Bayesian statistics is on building scalable
approximations to QUERY.
I Key family of algorithms are so-called “variational approximations”.
I There’s also active research on Monte Carlo approximations to QUERY.
I Bespoke hand-crafted models in machine learning are being replaced
by hybrid deep learning ones.
I Probabilistic programming and differentiable programming frameworks
massively accelerate certain approaches.
I There’s also interest in bridging the frequentist–Bayesian divide.

Introduction
References

F. R. Bach and M. I. Jordan. Learning graphical models with Mercer kernels.
In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural
Information Processing Systems 15 (NIPS 2002), pages 1009–1016. The
MIT Press, Cambridge, MA, 2003.
A. R. Barron. Information-theoretic characterization of Bayes performance
and the choice of priors in parametric and nonparametric problems. In
J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, editors,
Bayesian Statistics 6: Proceedings of the Sixth Valencia International
Meeting, pages 27–52, 1998.
V. Chandrasekaran, N. Srebro, and P. Harsha. Complexity of inference in
graphical models. In Proceedings of the Twenty Fourth Conference on
Uncertainty in Artificial Intelligence (UAI 2008), pages 70–78, Corvalis,
Oregon, 2008. AUAI Press.
G. F. Cooper. The computational complexity of probabilistic inference using
Bayesian belief networks. Artificial Intelligence, 42(2-3):393–405, 1990.
P. Dagum and M. Luby. Approximating probabilistic inference in Bayesian
belief networks is NP-hard. Artificial Intelligence, 60(1):141–153, 1993.
ISSN 0004-3702. doi:
http://dx.doi.org/10.1016/0004-3702(93)90036-B.
P. Dagum, R. Karp, M. Luby, and S. Ross. An optimal algorithm for Monte
Carlo estimation. SIAM Journal on Computing, 29(5):1484–1496, 2000.
doi: 10.1137/S0097539797315306. URL http://epubs.siam.
org/doi/abs/10.1137/S0097539797315306.
M. H. DeGroot. Optimal Statistical Decisions. Wiley Classics Library. Wiley,
2005. ISBN 9780471726142. URL
http://books.google.co.uk/books?id=dtVieJ245z0C.
I. J. Good. Corroboration, explanation, evolving probability, simplicity and a
sharpened razor. The British Journal for the Philosophy of Science, 19(2):
123–143, 1968.
I. J. Good. Explicativity, corroboration, and the relative odds of hypotheses.
Synthese, 30(1):39–73, 1975.
I. J. Good. Weight of evidence and the Bayesian likelihood ratio. In C. G. G.
Aitken and D. A. Stoney, editors, The Use Of Statistics In Forensic
Science. Ellis Horwood, Chichester, 1991.
N. D. Goodman, V. K. Mansinghka, D. M. Roy, K. Bonawitz, and J. B.
Tenenbaum. Church: A language for generative models. In Proceedings of
the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence (UAI
2008), pages 220–229, Corvalis, Oregon, 2008. AUAI Press.

T. L. Griffiths and J. B. Tenenbaum. Structure and strength in causal
induction. Cognitive Psychology, 51(4):334–384, 2005. ISSN 0010-0285.
doi: 10.1016/j.cogpsych.2005.05.004. URL
http://www.sciencedirect.com/science/article/pii/
S0010028505000459.
T. L. Griffiths and J. B. Tenenbaum. Optimal predictions in everyday
cognition. Psychological Science, 17(9):767–773, 2006. URL
http://web.mit.edu/cocosci/Papers/
Griffiths-Tenenbaum-PsychSci06.pdf.
T. L. Griffiths and J. B. Tenenbaum. Theory-based causal induction.
Psychological Review, 116(4):661–716, 2009. URL
http://cocosci.berkeley.edu/tom/papers/tbci.pdf.
T. L. Griffiths, C. Kemp, and J. B. Tenenbaum. Bayesian models of cognition.
In Cambridge Handbook of Computational Cognitive Modeling. Cambridge
University Press, 2008. URL http://cocosci.berkeley.edu/
tom/papers/bayeschapter.pdf.
O. Kallenberg. Foundations of modern probability. Probability and its
Applications. Springer, New York, 2nd edition, 2002. ISBN 0-387-95313-2.

C. Kemp and J. B. Tenenbaum. The discovery of structural form. Proceedings
of the National Academy of Sciences, 105(31):10687–10692, 2008. URL
http://www.psy.cmu.edu/~ckemp/papers/kempt08.pdf.
C. Kemp, P. Shafto, A. Berke, and J. B. Tenenbaum. Combining causal and
similarity-based reasoning. In B. Schölkopf, J. Platt, and T. Hoffman,
editors, Advances in Neural Information Processing Systems 19 (NIPS
2006), pages 681–688. The MIT Press, Cambridge, MA, 2007.
R. D. Luce. Individual Choice Behavior. John Wiley, New York, 1959.
R. D. Luce. The choice axiom after twenty years. Journal of Mathematical
Psychology, 15(3):215–233, 1977.
D. J. C. MacKay. Information Theory, Inference, and Learning Algorithms.
Cambridge University Press, Cambridge, UK, 2003.
V. K. Mansinghka. Natively Probabilistic Computation. PhD thesis,
Massachusetts Institute of Technology, 2009.
V. K. Mansinghka, C. Kemp, J. B. Tenenbaum, and T. L. Griffiths. Structured
priors for structure learning. In Proceedings of the Twenty-Second
Conference on Uncertainty in Artificial Intelligence (UAI 2006), pages
324–331, Arlington, Virginia, 2006. AUAI Press. URL http:
//cocosci.berkeley.edu/tom/papers/structure.pdf.

J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of
Plausible Inference. Morgan Kaufmann, San Francisco, 1988.
J. Pearl. Graphical models for probabilistic and causal reasoning. In A. B.
Tucker, editor, Computer Science Handbook. CRC Press, 2nd edition,
2004.
M. A. Shwe, B. Middleton, D. E. Heckerman, M. Henrion, E. J. Horvitz, H. P.
Lehmann, and G. F. Cooper. Probabilistic diagnosis using a reformulation
of the INTERNIST-1/QMR knowledge base. Methods of Information in
Medicine, 30:241–255, 1991.
J. B. Tenenbaum, T. L. Griffiths, and C. Kemp. Theory-based Bayesian
models of inductive learning and reasoning. Trends in Cognitive Sciences,
10:309–318, 2006. doi: 10.1016/j.tics.2006.05.009.
M. Toussaint, S. Harmeling, and A. Storkey. Probabilistic inference for solving
(PO)MDPs. Technical Report EDI-INF-RR-0934, University of Edinburgh,
School of Informatics, 2006.
A. M. Turing. The Applications of Probability to Cryptography, c. 1941. UK
National Archives, HW 25/37. 2012.
S. L. Zabell. Commentary on Alan M. Turing: The applications of probability
to cryptography. Cryptologia, 36(3):191–214, 2012. doi:
10.1080/01611194.2012.697811.

A Tutorial On Bayesian Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Tutorial On Bayesian Learning

Uploaded by

Copyright:

Available Formats

2 ND S CHOOL ON F O PSS: L OGIC AND L EARNING

A tutorial on Bayesian learning

The theory of probability is at bottom nothing but common sense

Daniel M. Roy 1 / 101

Daniel M. Roy 2 / 101

Daniel M. Roy 3 / 101

I Probability is no longer just a limiting relative frequency associated with

Daniel M. Roy 4 / 101

I If we do not know P, we can attempt to “learn” it from data. This

I The Bayesian framework is conceptually simple:

Daniel M. Roy 6 / 101

This perspective is closely related to

Daniel M. Roy 7 / 101

The tutorial centers around a higher-order procedure, called QUERY, which

Daniel M. Roy 8 / 101

1 def binomial(n, p):

I returns a random integer in {0, . . . , n}.

Daniel M. Roy 9 / 101

1 def binomial(n, p):

I returns a random integer in {0, . . . , n}.

Daniel M. Roy 9 / 101

1 def binomial(n, p):

I returns a random integer in {0, . . . , n}.

Daniel M. Roy 9 / 101

1 def binomial(n, p):

I returns a random integer in {0, . . . , n}.

Daniel M. Roy 9 / 101

1 def binomial(n, p):

Daniel M. Roy 10 / 101

1 def binomial(n, p):

6 return ( binomial(100, p_control),

Daniel M. Roy 10 / 101

1 def binomial(n, p):

Daniel M. Roy 10 / 101

1 def binomial(n, p):

Daniel M. Roy 10 / 101

1 def binomial(n, p):

Daniel M. Roy 10 / 101

1 def binomial(n, p):

Daniel M. Roy 10 / 101

1 def binomial(n, p):

Daniel M. Roy 10 / 101

1 def binomial(n, p):

Daniel M. Roy 10 / 101

1 def binomial(n, p):

Daniel M. Roy 10 / 101

1 def binomial(n, p):

Daniel M. Roy 10 / 101

1 def binomial(n, p):

Daniel M. Roy 10 / 101

Daniel M. Roy 11 / 101

Daniel M. Roy 12 / 101

lambda (_): True

Then guesser() has the same meaning (distributional semantics) as

QUERY(guesser, lambda (_): True)

Daniel M. Roy 12 / 101

Daniel M. Roy 12 / 101

and consider the predicate

What is the meaning of QUERY(N, div235) ?

QUERY halts with probability 1 provided P (A) > 0.

Daniel M. Roy 13 / 101

Daniel M. Roy 14 / 101

I NPUT : guesser and checker probabilistic programs.