You are on page 1of 256

2 ND S CHOOL ON F O PSS: L OGIC AND L EARNING

A tutorial on Bayesian learning

Daniel M. Roy
University of Toronto | Vector Institute

The theory of probability is at bottom nothing but common sense


reduced to calculus; it makes one appreciate with exactness that
which accurate minds feel with a sort of instinct, often without
being able to account for it. — Pierre Simon Laplace

Daniel M. Roy 1 / 101


Based on
Freer, Roy, and Tenenbaum (2012). Towards common-sense reasoning via
conditional simulation: Legacies of Turing in Artificial Intelligence. Turing’s
Legacy (ASL Lecture Notes in Logic).

Daniel M. Roy 2 / 101


The goal of this tutorial is to give a “Bayesian perspective” on “learning”.
I What is a “Bayesian perspective”?
I What is “learning”?

Daniel M. Roy 3 / 101


T HE B AYESIAN PERSPECTIVE

I Probability is no longer just a limiting relative frequency associated with


independent identical experiments.
I Instead, probability represents degree of belief.
I P{ I will fly to Vilnius tomorrow } = 0.97.
I P{ I will fly to Canada tomorrow } = 0.001.
I P{ I will fly to Canada and Vilnius tomorrow } = 0.
I P{ I will fly to Canada or Vilnius tomorrow } = 0.971.
I P{ Tom has flu | Tom has fever and muscle aches in Winter } = 0.6.
I P{ Tom has flu | Tom has fever and muscle aches in Summer } = 0.1.
I Probabilities are personal.
I The key structures are joint distributions of multiple random variables.
I X = Patient has flu, Y = has fever, Z = has muscle aches,
S = It is Summer, W = It is Winter.
I A joint distribution is a specification of probabilities
P{X = x, Y = y, Z = z, S = s, W = w} for every possible “event”.

Daniel M. Roy 4 / 101


W HAT IS LEARNING ?

I If we do not know P, we can attempt to “learn” it from data. This


necessarily requires some strong assumptions and relates to the
problem of induction.
I Learning to classify images: Learn P{Label | Image } from dataset of
labelled images.
I Learning to diagnosis: Learn P{Disease, Symptoms } from past patient
data, in order to infer diseases from symptoms in future patients.
I The distributions P above are generally assumed to come from a
parametric family {Pθ }, e.g., θ ∈ Rd . The (“true” or “best”) parameter θ ∗
is assumed unknown.
I An approach is “fully Bayesian” if one uses probability distributions to
express uncertainty for all unknown quantities (such as θ ∗ ), modeling
them as random variables with prior distributions.
I In a “fully Bayesian” approach, learning is probabilistic inference, and
thus everything is probabilistic inference.
I In contrast, in a frequentist approach, one would develop estimators for
θ∗ with good frequentist (sampling) properties.
Daniel M. Roy 5 / 101
U NIVERSAL STOCHASTIC INFERENCE

I The Bayesian framework is conceptually simple:


I represent all knowledge by distributions
I evidence incorporated by conditioning
I I will introduce the Bayesian framework using the computational
framework of universal stochastic inference wherein
I distributions are represented by programs
I evidence is represented by predicates
I conditioning is a higher-order procedure
I Programs can represent distributions by being simulators
I a program P represents a distribution µ by simulating µ,
i.e., generating a random output X whose distribution is µ

Daniel M. Roy 6 / 101


R ELATED FRAMEWORKS

This perspective is closely related to


I Probabilistic programming
I Approximate Bayesian Computation (ABC)
I Implicit Generative Models

Daniel M. Roy 7 / 101


S TRUCTURE OF THE TUTORIAL

The tutorial centers around a higher-order procedure, called QUERY, which


implements a simple, though generic, form of probabilistic conditioning.
I QUERY operates on complex probabilistic models that are themselves
represented by programs
I By using QUERY appropriately, one can describe various forms of
inference, learning, and decision-making.
I We will introduce inference, learning, and decision making through an
extended example of medical diagnosis, a classic AI problem.
I Elusive “common-sense” behavior arises implicitly from past experience
and models of causal structure and goals, rather than explicitly via rules
or purely deductive reasoning.

Daniel M. Roy 8 / 101


E XAMPLE : SIMPLE PROBABILISTIC P YTHON PROGRAM

1 def binomial(n, p):


2 return sum( [bernoulli(p) for i in range(n)] )

I returns a random integer in {0, . . . , n}.


I defines a family of distributions on {0, . . . , n},
in particular, the Binomial family.
I represents a statistical model of
the # of successes among
n independent and identical experiments

Daniel M. Roy 9 / 101


E XAMPLE : SIMPLE PROBABILISTIC P YTHON PROGRAM

1 def binomial(n, p):


2 return sum( [bernoulli(p) for i in range(n)] )

I returns a random integer in {0, . . . , n}.


I defines a family of distributions on {0, . . . , n},
in particular, the Binomial family.
I represents a statistical model of
the # of successes among
n independent and identical experiments

Daniel M. Roy 9 / 101


E XAMPLE : SIMPLE PROBABILISTIC P YTHON PROGRAM

1 def binomial(n, p):


2 return sum( [bernoulli(p) for i in range(n)] )

I returns a random integer in {0, . . . , n}.


I defines a family of distributions on {0, . . . , n},
in particular, the Binomial family.
I represents a statistical model of
the # of successes among
n independent and identical experiments

Daniel M. Roy 9 / 101


E XAMPLE : SIMPLE PROBABILISTIC P YTHON PROGRAM

1 def binomial(n, p):


2 return sum( [bernoulli(p) for i in range(n)] )

I returns a random integer in {0, . . . , n}.


I defines a family of distributions on {0, . . . , n},
in particular, the Binomial family.
I represents a statistical model of
the # of successes among
n independent and identical experiments

Daniel M. Roy 9 / 101


E XAMPLE : SIMPLE PROBABILISTIC P YTHON PROGRAM

1 def binomial(n, p):


2 return sum( [bernoulli(p) for i in range(n)] )

Daniel M. Roy 10 / 101


E XAMPLE : SIMPLE PROBABILISTIC P YTHON PROGRAM

1 def binomial(n, p):


2 return sum( [bernoulli(p) for i in range(n)] )
3 def randomized_trial()
4

6 return ( binomial(100, p_control),


7 binomial(10, p_treatment) )

Daniel M. Roy 10 / 101


E XAMPLE : SIMPLE PROBABILISTIC P YTHON PROGRAM

1 def binomial(n, p):


2 return sum( [bernoulli(p) for i in range(n)] )
3 def randomized_trial()
4 p_control = uniform(0,1) # prior
5 p_treatment = uniform(0,1) # prior
6 return ( binomial(100, p_control),
7 binomial(10, p_treatment) )

Daniel M. Roy 10 / 101


E XAMPLE : SIMPLE PROBABILISTIC P YTHON PROGRAM

1 def binomial(n, p):


2 return sum( [bernoulli(p) for i in range(n)] )
3 def randomized_trial()
4 p_control = uniform(0,1) # prior
5 p_treatment = uniform(0,1) # prior
6 return ( binomial(100, p_control),
7 binomial(10, p_treatment) )
represents a Bayesian model of a randomized trial.

Daniel M. Roy 10 / 101


E XAMPLE : SIMPLE PROBABILISTIC P YTHON PROGRAM

1 def binomial(n, p):


2 return sum( [bernoulli(p) for i in range(n)] )
3 def randomized_trial()
4 p_control = uniform(0,1) # prior
5 p_treatment = uniform(0,1) # prior
6 return ( binomial(100, p_control),
7 binomial(10, p_treatment) )
represents a Bayesian model of a randomized trial.

0
0 1

Daniel M. Roy 10 / 101


E XAMPLE : SIMPLE PROBABILISTIC P YTHON PROGRAM

1 def binomial(n, p):


2 return sum( [bernoulli(p) for i in range(n)] )
3 def randomized_trial()
4 p_control = uniform(0,1) # prior
5 p_treatment = uniform(0,1) # prior
6 return ( binomial(100, p_control),
7 binomial(10, p_treatment) )
represents a Bayesian model of a randomized trial.

1
simulation
−−−−−−−−−−−−→
0
0 1

Daniel M. Roy 10 / 101


E XAMPLE : SIMPLE PROBABILISTIC P YTHON PROGRAM

1 def binomial(n, p):


2 return sum( [bernoulli(p) for i in range(n)] )
3 def randomized_trial()
4 p_control = uniform(0,1) # prior
5 p_treatment = uniform(0,1) # prior
6 return ( binomial(100, p_control),
7 binomial(10, p_treatment) )
represents a Bayesian model of a randomized trial.

1
simulation (71, 9)
−−−−−−−−−−−−→
0
0 1

Daniel M. Roy 10 / 101


E XAMPLE : SIMPLE PROBABILISTIC P YTHON PROGRAM

1 def binomial(n, p):


2 return sum( [bernoulli(p) for i in range(n)] )
3 def randomized_trial()
4 p_control = uniform(0,1) # prior
5 p_treatment = uniform(0,1) # prior
6 return ( binomial(100, p_control),
7 binomial(10, p_treatment) )
represents a Bayesian model of a randomized trial.

1
simulation (71, 9)
−−−−−−−−−−−−→
0
0 0.67 0.86 1

Daniel M. Roy 10 / 101


E XAMPLE : SIMPLE PROBABILISTIC P YTHON PROGRAM

1 def binomial(n, p):


2 return sum( [bernoulli(p) for i in range(n)] )
3 def randomized_trial()
4 p_control = uniform(0,1) # prior
5 p_treatment = uniform(0,1) # prior
6 return ( binomial(100, p_control),
7 binomial(10, p_treatment) )
represents a Bayesian model of a randomized trial.

1
simulation (71, 9)
−−−−−−−−−−−−→
0
0 0.67 0.86 1

(71, 9)

Daniel M. Roy 10 / 101


E XAMPLE : SIMPLE PROBABILISTIC P YTHON PROGRAM

1 def binomial(n, p):


2 return sum( [bernoulli(p) for i in range(n)] )
3 def randomized_trial()
4 p_control = uniform(0,1) # prior
5 p_treatment = uniform(0,1) # prior
6 return ( binomial(100, p_control),
7 binomial(10, p_treatment) )
represents a Bayesian model of a randomized trial.

1
simulation (71, 9)
−−−−−−−−−−−−→
0
0 0.67 0.86 1

inference
←−−−−−−−−−−−−− (71, 9)

Daniel M. Roy 10 / 101


E XAMPLE : SIMPLE PROBABILISTIC P YTHON PROGRAM

1 def binomial(n, p):


2 return sum( [bernoulli(p) for i in range(n)] )
3 def randomized_trial()
4 p_control = uniform(0,1) # prior
5 p_treatment = uniform(0,1) # prior
6 return ( binomial(100, p_control),
7 binomial(10, p_treatment) )
represents a Bayesian model of a randomized trial.

1
simulation (71, 9)
−−−−−−−−−−−−→
0
0 0.67 0.86 1

inference
←−−−−−−−−−−−−− (71, 9)
0
0 0.71 0.9 1

Daniel M. Roy 10 / 101


I NTRODUCING QUERY

1 def QUERY(guesser,checker):
2 # guesser: Unit -> S
3 # predicate : S -> Boolean
4 accept = False
5 while (not accept)
6 guess = guesser()
7 accept = checker(guess)
8 return guess

Daniel M. Roy 11 / 101


U NDERSTANDING QUERY

1 def QUERY(guesser,checker):
2 # guesser: Unit -> S
3 # predicate : S -> Boolean
4 accept = False
5 while (not accept)
6 guess = guesser()
7 accept = checker(guess)
8 return guess

Daniel M. Roy 12 / 101


U NDERSTANDING QUERY

1 def QUERY(guesser,checker):
2 # guesser: Unit -> S
3 # predicate : S -> Boolean
4 accept = False
5 while (not accept)
6 guess = guesser()
7 accept = checker(guess)
8 return guess
As a first step towards understanding QUERY, consider the trivial predicate

lambda (_): True

Then guesser() has the same meaning (distributional semantics) as

QUERY(guesser, lambda (_): True)

Daniel M. Roy 12 / 101


U NDERSTANDING QUERY

1 def QUERY(guesser,checker):
2 # guesser: Unit -> S
3 # predicate : S -> Boolean
4 accept = False
5 while (not accept)
6 guess = guesser()
7 accept = checker(guess)
8 return guess

Daniel M. Roy 12 / 101


U NDERSTANDING QUERY

1 def QUERY(guesser,checker):
2 # guesser: Unit -> S
3 # predicate : S -> Boolean
4 accept = False
5 while (not accept)
6 guess = guesser()
7 accept = checker(guess)
8 return guess
Consider a slightly more interesting example:

def N():
return uniformInt(range(1,180))

and consider the predicate

def div235(n):
return isDivBy(n,2) or isDivBy(n,3) or isDivBy(n,5)

What is the meaning of QUERY(N, div235) ?


Daniel M. Roy 12 / 101
C ONDITIONING AS A HIGHER - ORDER PROCEDURE

1 def QUERY(guesser,checker):
2 # guesser: Unit -> S
3 # predicate : S -> Boolean
4 accept = False
5 while (not accept)
6 guess = guesser()
7 accept = checker(guess)
8 return guess
represents the higher order operation of conditioning. When checker is
deterministic, then QUERY denotes the map

P ( · ∩ A)
(P, 1A ) 7→ P ( · |A) := . (1)
P (A)

QUERY halts with probability 1 provided P (A) > 0.

Daniel M. Roy 13 / 101


T HE STOCHASTIC INFERENCE PROBLEM

Daniel M. Roy 14 / 101


T HE STOCHASTIC INFERENCE PROBLEM

I NPUT : guesser and checker probabilistic programs.

Daniel M. Roy 14 / 101


T HE STOCHASTIC INFERENCE PROBLEM

I NPUT : guesser and checker probabilistic programs.

O UTPUT : a sample from the same distribution as the program


1 accept = False
2 while (not accept):
3 guess = guesser()
4 accept = checker(guess)
5 return guess

Daniel M. Roy 14 / 101


T HE STOCHASTIC INFERENCE PROBLEM

I NPUT : guesser and checker probabilistic programs.

O UTPUT : a sample from the same distribution as the program


1 accept = False
2 while (not accept):
3 guess = guesser()
4 accept = checker(guess)
5 return guess

This computation captures Bayesian statistical inference.

Daniel M. Roy 14 / 101


T HE STOCHASTIC INFERENCE PROBLEM

I NPUT : guesser and checker probabilistic programs.

O UTPUT : a sample from the same distribution as the program


1 accept = False
2 while (not accept):
3 guess = guesser()
4 accept = checker(guess)
5 return guess

This computation captures Bayesian statistical inference.

“prior” distribution ←→ distribution of guesser()



“likelihood(g )” ←→ P checker(g) is True
“posterior” distribution ←→ distribution of return value

Daniel M. Roy 14 / 101


E XAMPLE : INFERRING BIAS OF A COIN

Daniel M. Roy 15 / 101


E XAMPLE : INFERRING BIAS OF A COIN

1 accept = False
2 while (not accept):
3 guess = guesser()
4 accept = checker(guess)
5 return guess

Daniel M. Roy 15 / 101


E XAMPLE : INFERRING BIAS OF A COIN

1 accept = False
2 while (not accept):
3 guess = guesser()
4 accept = checker(guess)
5 return guess

Given x1 , . . . , xn ∈ {0, 1},


report probability of xn+1 = 1?

Daniel M. Roy 15 / 101


E XAMPLE : INFERRING BIAS OF A COIN

1 accept = False
2 while (not accept):
3 guess = guesser()
4 accept = checker(guess)
5 return guess

Given x1 , . . . , xn ∈ {0, 1},


report probability of xn+1 = 1? E.g., 0, 0, 1, 0, 0

Daniel M. Roy 15 / 101


E XAMPLE : INFERRING BIAS OF A COIN

1 accept = False
2 while (not accept):
3 guess = guesser()
4 accept = checker(guess)
5 return guess

Given x1 , . . . , xn ∈ {0, 1},


report probability of xn+1 = 1? E.g., 0, 0, 1, 0, 0

def guesser():
p = uniform()
return p

Daniel M. Roy 15 / 101


E XAMPLE : INFERRING BIAS OF A COIN

1 accept = False
2 while (not accept):
3 guess = guesser()
4 accept = checker(guess)
5 return guess

Given x1 , . . . , xn ∈ {0, 1},


report probability of xn+1 = 1? E.g., 0, 0, 1, 0, 0

def guesser():
p = uniform()
return p
def checker(p):
return [0,0,1,0,0] == bernoulli(p,5)
Daniel M. Roy 15 / 101
E XAMPLE : INFERRING BIAS OF A COIN

1 accept = False
2 while (not accept):
3 guess = guesser()
4 accept = checker(guess)
5 return guess

Given x1 , . . . , xn ∈ {0, 1},


report probability of xn+1 = 1? E.g., 0, 0, 1, 0, 0

def guesser():
p = uniform()
return p
def checker(p):
return [0,0,1,0,0] == bernoulli(p,5)
Daniel M. Roy 15 / 101
E XAMPLE : INFERRING BIAS OF A COIN

2
1 accept = False
2 while (not accept):
3 guess = guesser()
4 accept = checker(guess)

count
5 return guess 1

0
0.0 0.2 0.4 0.6 0.8 1.0
Given x1 , . . . , xn ∈ {0, 1}, p

report probability of xn+1 = 1? E.g., 0, 0, 1, 0, 0

def guesser():
p = uniform()
return p
def checker(p):
return [0,0,1,0,0] == bernoulli(p,5)
Daniel M. Roy 15 / 101
E XAMPLE : INFERRING BIAS OF A COIN

7
1 accept = False
2 while (not accept): 6
3 guess = guesser() 5
4 accept = checker(guess) 4

count
5 return guess
3
2
1
0
0.0 0.2 0.4 0.6 0.8 1.0
Given x1 , . . . , xn ∈ {0, 1}, p

report probability of xn+1 = 1? E.g., 0, 0, 1, 0, 0

def guesser():
p = uniform()
return p
def checker(p):
return [0,0,1,0,0] == bernoulli(p,5)
Daniel M. Roy 15 / 101
E XAMPLE : INFERRING BIAS OF A COIN

5000
1 accept = False 4500
2 while (not accept): 4000
3 guess = guesser() 3500
4 accept = checker(guess) 3000

count
5 return guess 2500
2000
1500
1000
500
0
0.0 0.2 0.4 0.6 0.8 1.0
Given x1 , . . . , xn ∈ {0, 1}, p

report probability of xn+1 = 1? E.g., 0, 0, 1, 0, 0

def guesser():
p = uniform()
return p
def checker(p):
return [0,0,1,0,0] == bernoulli(p,5)
Daniel M. Roy 15 / 101
Let s = x1 + · · · + xn and let U be uniformly distributed.
For all t ∈ [0, 1], we have P U ≤ t = t and
 
P checker(t, x) is True = P ∀i ( Ui ≤ t ⇐⇒ xi = 1 )
= ts (1 − t)n−s .
PrHacceptL

0.06
0.05
0.04
0.03
n = 6, s ∈ {1, 3, 5}.
0.02
0.01
t
0.2 0.4 0.6 0.8 1.0

Z 1
 (s)!(n − s)!
P checker(U, x) is True = ts (1 − t)n−s dt = =: Z(s)
0 (n + 1)!
Let p(t)dt be the probability that the accepted θ ∈ [t, t + dt).

 ts (1 − t)n−s
p(t)dt ≈ ts (1 − t)n−s dt + 1 − Z(s) p(t)dt ≈ dt
Z(s)
R s+1
Probability that the accepted X = 1 is then t p(t)dt = n+2 .
Daniel M. Roy 16 / 101
E XAMPLE : INFERRING OBJECTS FROM AN IMAGE

Daniel M. Roy 17 / 101


E XAMPLE : INFERRING OBJECTS FROM AN IMAGE

1 accept = False
2 while (not accept):
3 guess = guesser()
4 accept = checker(guess)
5 return guess

Daniel M. Roy 17 / 101


E XAMPLE : INFERRING OBJECTS FROM AN IMAGE

1 accept = False
2 while (not accept):
3 guess = guesser()
4 accept = checker(guess)
5 return guess

How many objects in this image?

Daniel M. Roy 17 / 101


E XAMPLE : INFERRING OBJECTS FROM AN IMAGE

1 accept = False
2 while (not accept):
3 guess = guesser()
4 accept = checker(guess)
5 return guess

How many objects in this image?

Daniel M. Roy 17 / 101


E XAMPLE : INFERRING OBJECTS FROM AN IMAGE

1 accept = False
2 while (not accept):
3 guess = guesser()
4 accept = checker(guess)
5 return guess

How many objects in this image?

Daniel M. Roy 17 / 101


E XAMPLE : INFERRING OBJECTS FROM AN IMAGE

1 accept = False
2 while (not accept):
3 guess = guesser()
4 accept = checker(guess)
5 return guess

How many objects in this image?

Daniel M. Roy 17 / 101


E XAMPLE : INFERRING OBJECTS FROM AN IMAGE

1 accept = False
2 while (not accept):
3 guess = guesser()
4 accept = checker(guess)
5 return guess

How many objects in this image?

1 def guesser():
2 k = geometric()
3 blocks = [ randomblock() for _ in range(k) ]
4 colors = [ randomcolor() for _ in range(k) ]
5 return (k,blocks,colors)

Daniel M. Roy 17 / 101


E XAMPLE : INFERRING OBJECTS FROM AN IMAGE

1 accept = False
2 while (not accept):
3 guess = guesser()
4 accept = checker(guess)
5 return guess

How many objects in this image?

1 def guesser():
2 k = geometric()
3 blocks = [ randomblock() for _ in range(k) ]
4 colors = [ randomcolor() for _ in range(k) ]
5 return (k,blocks,colors)
7 def checker(k,blocks,colors):
8 return rasterize(blocks,colors) ==

Daniel M. Roy 17 / 101


E XAMPLE : INFERRING OBJECTS FROM AN IMAGE

1 accept = False
2 while (not accept):
3 guess = guesser()
4 accept = checker(guess)
5 return guess

How many objects in this image?

1 def guesser():
2 k = geometric()
3 blocks = [ randomblock() for _ in range(k) ]
4 colors = [ randomcolor() for _ in range(k) ]
5 return (k,blocks,colors)
7 def checker(k,blocks,colors):
8 return rasterize(blocks,colors) ==

Daniel M. Roy 17 / 101


E XAMPLE : INFERRING OBJECTS FROM AN IMAGE

1 accept = False
2 while (not accept):
3 guess = guesser()
4 accept = checker(guess)
5 return guess

How many objects in this image?

1 def guesser():
2 k = geometric()
3 blocks = [ randomblock() for _ in range(k) ]
4 colors = [ randomcolor() for _ in range(k) ]
5 return (k,blocks,colors)
7 def checker(k,blocks,colors):
8 return rasterize(blocks,colors) ==

Daniel M. Roy 17 / 101


E XAMPLE : INFERRING OBJECTS FROM AN IMAGE

1 accept = False
2 while (not accept):
3 guess = guesser()
4 accept = checker(guess)
5 return guess

How many objects in this image?

1 def guesser():
2 k = geometric()
3 blocks = [ randomblock() for _ in range(k) ]
4 colors = [ randomcolor() for _ in range(k) ]
5 return (k,blocks,colors)
7 def checker(k,blocks,colors):
8 return rasterize(blocks,colors) ==

Daniel M. Roy 17 / 101


E XAMPLE : INFERRING OBJECTS FROM AN IMAGE

1 accept = False
2 while (not accept):
3 guess = guesser()
4 accept = checker(guess)
5 return guess

How many objects in this image?

1 def guesser():
2 k = geometric()
3 blocks = [ randomblock() for _ in range(k) ]
4 colors = [ randomcolor() for _ in range(k) ]
5 return (k,blocks,colors)
7 def checker(k,blocks,colors):
8 return rasterize(blocks,colors) ==

Daniel M. Roy 17 / 101


E XAMPLE : INFERRING OBJECTS FROM AN IMAGE

1 accept = False
2 while (not accept):
3 guess = guesser()
4 accept = checker(guess)
5 return guess

How many objects in this image?

1 def guesser():
2 k = geometric()
3 blocks = [ randomblock() for _ in range(k) ]
4 colors = [ randomcolor() for _ in range(k) ]
5 return (k,blocks,colors)
7 def checker(k,blocks,colors):
8 return rasterize(blocks,colors) ==

Daniel M. Roy 17 / 101


E XAMPLE : INFERRING OBJECTS FROM AN IMAGE

1 accept = False
2 while (not accept):
3 guess = guesser()
4 accept = checker(guess)
5 return guess

How many objects in this image?

1 def guesser():
2 k = geometric()
3 blocks = [ randomblock() for _ in range(k) ]
4 colors = [ randomcolor() for _ in range(k) ]
5 return (k,blocks,colors)
7 def checker(k,blocks,colors):
8 return rasterize(blocks,colors) ==

Daniel M. Roy 17 / 101


E XAMPLE : INFERRING OBJECTS FROM AN IMAGE

1 accept = False
2 while (not accept): 4
3 guess = guesser()
accept = checker(guess)
3
4

5 return guess

count
2

1
How many objects in this image?
0
0 1 2 3 4 5 6 7
1 def guesser(): k = # blocks
2 k = geometric()
3 blocks = [ randomblock() for _ in range(k) ]
4 colors = [ randomcolor() for _ in range(k) ]
5 return (k,blocks,colors)
7 def checker(k,blocks,colors):
8 return rasterize(blocks,colors) ==

Daniel M. Roy 17 / 101


E XAMPLE : INFERRING OBJECTS FROM AN IMAGE

1 accept = False
2 while (not accept): 40
3 guess = guesser() 36
4 accept = checker(guess) 32
5 return guess 28
24

count
20
16
12
How many objects in this image? 8
4
0
0 1 2 3 4 5 6 7
1 def guesser(): k = # blocks
2 k = geometric()
3 blocks = [ randomblock() for _ in range(k) ]
4 colors = [ randomcolor() for _ in range(k) ]
5 return (k,blocks,colors)
7 def checker(k,blocks,colors):
8 return rasterize(blocks,colors) ==

Daniel M. Roy 17 / 101


E XAMPLE : INFERRING OBJECTS FROM AN IMAGE

1 accept = False
2 while (not accept): 80
3 guess = guesser() 72
4 accept = checker(guess) 64
5 return guess 56
48

count
40
32
24
How many objects in this image? 16
8
0
0 1 2 3 4 5 6 7
1 def guesser(): k = # blocks
2 k = geometric()
3 blocks = [ randomblock() for _ in range(k) ]
4 colors = [ randomcolor() for _ in range(k) ]
5 return (k,blocks,colors)
7 def checker(k,blocks,colors):
8 return rasterize(blocks,colors) ==

Daniel M. Roy 17 / 101


E XAMPLE : E XTRACTING 3D STRUCTURE FROM IMAGES

1 accept = False
2 while (not accept):
3 guess = guesser()
4 accept = checker(guess)
5 return guess

Daniel M. Roy 18 / 101


E XAMPLE : E XTRACTING 3D STRUCTURE FROM IMAGES

1 accept = False
2 while (not accept):
3 guess = guesser()
4 accept = checker(guess)
5 return guess

Daniel M. Roy 18 / 101


E XAMPLE : E XTRACTING 3D STRUCTURE FROM IMAGES

1 accept = False
2 while (not accept):
3 guess = guesser()
4 accept = checker(guess)
5 return guess

Daniel M. Roy 18 / 101


E XAMPLE : E XTRACTING 3D STRUCTURE FROM IMAGES

1 accept = False
2 while (not accept):
3 guess = guesser()
4 accept = checker(guess)
5 return guess

checker
−−−−−−→

Daniel M. Roy 18 / 101


E XAMPLE : E XTRACTING 3D STRUCTURE FROM IMAGES

1 accept = False
2 while (not accept):
3 guess = guesser()
4 accept = checker(guess)
5 return guess

checker
−−−−−−→

Daniel M. Roy 18 / 101


E XAMPLE : E XTRACTING 3D STRUCTURE FROM IMAGES

1 accept = False
2 while (not accept):
3 guess = guesser()
4 accept = checker(guess)
5 return guess

checker
−−−−−−→

Daniel M. Roy 18 / 101


E XAMPLE : E XTRACTING 3D STRUCTURE FROM IMAGES

1 accept = False
2 while (not accept):
3 guess = guesser()
4 accept = checker(guess)
5 return guess

Daniel M. Roy 18 / 101


E XAMPLE : E XTRACTING 3D STRUCTURE FROM IMAGES

1 accept = False
2 while (not accept):
3 guess = guesser()
4 accept = checker(guess)
5 return guess

inference
←−−−−−−

Daniel M. Roy 18 / 101


E XAMPLE : E XTRACTING 3D STRUCTURE FROM IMAGES

1 accept = False
2 while (not accept):
3 guess = guesser()
4 accept = checker(guess)
5 return guess

inference
←−−−−−−

Daniel M. Roy 18 / 101


QUERY AS AN ALGORITHM

Key point: QUERY is not a serious proposal for an algorithm, but it denotes
the operation we care about in Bayesian analysis.

Daniel M. Roy 19 / 101


QUERY AS AN ALGORITHM

Key point: QUERY is not a serious proposal for an algorithm, but it denotes
the operation we care about in Bayesian analysis.

How efficient is QUERY? Let model() represent a distribution P and pred


represent an indicator function 1A .
Proposition
1
In expectation, QUERY(model,pred) takes P (A) times as long to run as
pred(model()).

Corollary
If pred(model()) is efficient and P (A) not too small, then
QUERY(model,pred) is efficient.

Daniel M. Roy 19 / 101


MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)

Daniel M. Roy 20 / 101


MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0
MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0
MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0
MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0
MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0
MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0
MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0
MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0
MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0
MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0
MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0
MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0
MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0
MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0
MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0
MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0
MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0
MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0
MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0
MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0
MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0
MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0
MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0
MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0
MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0
MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0
MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0
MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0
MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0
MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0
MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0
MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0
MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0
MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0
MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0
MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0
MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0
MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0
MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0
MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0
MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0
MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0
MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0
MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0
MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0
MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0
MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0
MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0
MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0
MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0

Daniel M. Roy 21 / 101


MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0

Daniel M. Roy 21 / 101


MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0

Daniel M. Roy 21 / 101


MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0

Daniel M. Roy 21 / 101


MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0

Daniel M. Roy 21 / 101


MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0

Daniel M. Roy 21 / 101


MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0

Daniel M. Roy 21 / 101


MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0

Daniel M. Roy 21 / 101


MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0

Daniel M. Roy 21 / 101


MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0

Daniel M. Roy 21 / 101


MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0

Daniel M. Roy 21 / 101


MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0

Daniel M. Roy 21 / 101


MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0

Daniel M. Roy 21 / 101


MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0

Daniel M. Roy 21 / 101


MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0

Daniel M. Roy 21 / 101


MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0

Daniel M. Roy 21 / 101


MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0

Daniel M. Roy 21 / 101


MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0

Daniel M. Roy 21 / 101


MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0

Daniel M. Roy 21 / 101


MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0

Daniel M. Roy 21 / 101


MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0

Daniel M. Roy 21 / 101


MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0

Daniel M. Roy 21 / 101


MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0

Daniel M. Roy 21 / 101


MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0

Daniel M. Roy 21 / 101


MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0

Daniel M. Roy 21 / 101


MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0

Daniel M. Roy 21 / 101


MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0

Daniel M. Roy 21 / 101


MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0

Daniel M. Roy 21 / 101


MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0

Daniel M. Roy 21 / 101


MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0

Daniel M. Roy 21 / 101


MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0

Daniel M. Roy 21 / 101


MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0

Daniel M. Roy 21 / 101


MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0

Daniel M. Roy 21 / 101


MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0

Daniel M. Roy 21 / 101


MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0

Daniel M. Roy 21 / 101


MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0

Daniel M. Roy 21 / 101


MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0

Daniel M. Roy 21 / 101


MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0

Daniel M. Roy 22 / 101


MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0

p(1 − p)
2
if1

B1 D1 A1

+1 if2

B2 RET2

Daniel M. Roy 22 / 101


MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0

p p(1 − p)
1 2
if1 if1

B1 RET1 B1 D1 A1

+1 if2

B2 RET2

Daniel M. Roy 22 / 101


MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0

p p(1 − p) p(1 − p)2


1 2 3
if1 if1 if1

B1 RET1 B1 D1 A1 B1 D1 A1

+1 if2 +1 if2

B2 RET2 B2 D2 A2

+2 if3

B3 RET3

Daniel M. Roy 22 / 101


MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0

p p(1 − p) p(1 − p)2 p(1 − p)k−1


1 2 3 k
if1 if1 if1

B1 RET1 B1 D1 A1 B1 D1 A1

+1 if2 +1 if2 ...


B2 RET2 B2 D2 A2

+2 if3

B3 RET3

Daniel M. Roy 22 / 101


MIT-C HURCH AKA T RACE -MCMC

1 def geometric(p):
2 if bernoulli(p) == 1: return 1
3 else: return 1 + geometric(p)
1 def aliased_geometric(p):
2 g = geometric(p)
3 return 1 if g < 3 else 0

p p(1 − p) p(1 − p)2 p(1 − p)k−1


1 2 3 k
if1 if1 if1

B1 RET1 B1 D1 A1 B1 D1 A1

+1 if2 +1 if2 ...


B2 RET2 B2 D2 A2

+2 if3

B3 RET3

Daniel M. Roy 22 / 101


MIT-C HURCH AKA T RACE -MCMC

if1

B1 RET1

if1

B1 D1 A1

+1 if2

B2 RET2

if1

B1 D1 A1

+1 if2

B2 D2 A2

+2 if3

B3 RET3

4
5
Daniel M. Roy 6 23 / 101
MIT-C HURCH AKA T RACE -MCMC

if1 if1

B1 RET1 B1 RET1

if1 if1

B1 D1 A1 B1 D1 A1

+1 if2 +1 if2

B2 RET2 B2 RET2

if1 if1

B1 D1 A1 B1 D1 A1

+1 if2 +1 if2

B2 D2 A2 B2 D2 A2

+2 if3 +2 if3

B3 RET3 B3 RET3

4 4
5 5
Daniel M. Roy 6 6 23 / 101
MIT-C HURCH AKA T RACE -MCMC

if1 if1

?
B1 RET1
? B1 RET1

?
if1 if1
?

B1 D1 A1
? B1 D1 A1

?
+1 if2 +1 if2

?
B2 RET2 ? B2 RET2

if1 if1

B1 D1 A1 B1 D1 A1
?
+1 if2
? +1 if2

?
B2 D2 A2 B2 D2 A2
?

+2 if3 +2 if3

?
B3 RET3 ? B3 RET3

?
4 ? 4
5 5
Daniel M. Roy 6 6 23 / 101
MIT-C HURCH AKA T RACE -MCMC

if1 if1 if1

B1 RET1 B1 RET1 B1 RET1

if1 if1 if1

B1 D1 A1 B1 D1 A1 B1 D1 A1

+1 if2 +1 if2 +1 if2

B2 RET2 B2 RET2 B2 RET2

if1 if1 if1

B1 D1 A1 B1 D1 A1 B1 D1 A1

+1 if2 +1 if2 +1 if2

B2 D2 A2 B2 D2 A2 B2 D2 A2

+2 if3 +2 if3 +2 if3

B3 RET3 B3 RET3 B3 RET3

4 4 4
5 5 5
Daniel M. Roy 6 6 6 23 / 101
MIT-C HURCH AKA T RACE -MCMC

Proposal

if1 if1 if1

B1 RET1 B1 RET1 B1 RET1

if1 if1 if1

B1 D1 A1 B1 D1 A1 B1 D1 A1

+1 if2 +1 if2 +1 if2

B2 RET2 B2 RET2 B2 RET2

if1 if1 if1

B1 D1 A1 B1 D1 A1 B1 D1 A1

+1 if2 +1 if2 +1 if2

B2 D2 A2 B2 D2 A2 B2 D2 A2

+2 if3 +2 if3 +2 if3

B3 RET3 B3 RET3 B3 RET3

4 4 4
5 5 5
Daniel M. Roy 6 6 6 23 / 101
MIT-C HURCH AKA T RACE -MCMC

Proposal

if1 if1 if1

B1 RET1 B1 RET1 B1 RET1

p
if1 if1 if1

B1 D1 A1 B1 D1 A1 B1 D1 A1

pp̄
+1 if2 +1 if2 +1 if2

B2 RET2 B2 RET2 B2 RET2

if1 if1 if1

B1 D1 A1 B1 D1 A1 B1 D1 A1

+1 if2 +1 if2 +1 if2


k
pp̄
B2 D2 A2 B2 D2 A2 B2 D2 A2

+2 if3 +2 if3 +2 if3

B3 RET3 B3 RET3 B3 RET3

4 4 4
5 5 5
Daniel M. Roy 6 6 6 23 / 101
MIT-C HURCH AKA T RACE -MCMC

Proposal

if1 if1 if1

B1 RET1 B1 RET1 B1 RET1

p 1
2
if1 if1 if1

B1 D1 A1 B1 D1 A1 B1 D1 A1

pp̄
+1 if2 +1 if2 +1 if2

B2 RET2 B2 RET2 B2 RET2

1
2p
if1 if1 if1

1 k
B1 D1 A1 2 pp̄ B1 D1 A1 B1 D1 A1

+1 if2 +1 if2 +1 if2


k
pp̄
B2 D2 A2 B2 D2 A2 B2 D2 A2

+2 if3 +2 if3 +2 if3

B3 RET3 B3 RET3 B3 RET3

4 4 4
5 5 5
Daniel M. Roy 6 6 6 23 / 101
MIT-C HURCH AKA T RACE -MCMC

Proposal Metropolis–Hastings

if1 if1 if1

B1 RET1 B1 RET1 B1 RET1

p 1
2
if1 if1 if1

B1 D1 A1 B1 D1 A1 B1 D1 A1

pp̄
+1 if2 +1 if2 +1 if2

B2 RET2 B2 RET2 B2 RET2

1
2p
if1 if1 if1

1 k
B1 D1 A1 2 pp̄ B1 D1 A1 B1 D1 A1

+1 if2 +1 if2 +1 if2


k
pp̄
B2 D2 A2 B2 D2 A2 B2 D2 A2

+2 if3 +2 if3 +2 if3

B3 RET3 B3 RET3 B3 RET3

4 4 4
5 5 5
Daniel M. Roy 6 6 6 23 / 101
MIT-C HURCH AKA T RACE -MCMC

Proposal Metropolis–Hastings

if1 if1 if1

B1 RET1 B1 RET1 B1 RET1

p 1
2

if1 if1
1− 2p 1
if1

B1 D1 A1 B1 D1 A1 B1 D1 A1

pp̄ p̄
+1 if2 +1 if2
2p +1 if2

B2 RET2 B2 RET2 B2 RET2

1
2p
if1 if1 if1

1 k
B1 D1 A1 2 pp̄ B1 D1 A1 B1 D1 A1

+1 if2 +1 if2 +1 if2


k
pp̄
B2 D2 A2 B2 D2 A2 B2 D2 A2

+2 if3 +2 if3 +2 if3

B3 RET3 B3 RET3 B3 RET3

4 4 4
5 5 5
Daniel M. Roy 6 6 6 23 / 101
MIT-C HURCH AKA T RACE -MCMC

Proposal Metropolis–Hastings

if1 if1 if1

B1 RET1 B1 RET1
1 B1 RET1

p 1
2

if1 if1
1− 2p 1
if1

B1 D1 A1 B1 D1 A1 B1 D1 A1

pp̄ p̄
+1 if2 +1 if2
2p +1 if2

B2 RET2 B2 RET2 B2 RET2

1
1
2p
if1 if1 if1

1 k
B1 D1 A1 2 pp̄ B1 D1 A1
1 B1 D1 A1

+1 if2 +1 if2 +1 if2


k
pp̄
B2 D2 A2 B2 D2 A2 B2 D2 A2

+2 if3 +2 if3 +2 if3

B3 RET3 B3 RET3 B3 RET3

4 4 4
5 5 5
Daniel M. Roy 6 6 6 23 / 101
MIT-C HURCH AKA T RACE -MCMC

if1 if1

B1 RET1
1 − 12 p̄ B1 RET1

1
if1
2 p̄ if1
1
B1 D1 A1
2 B1 D1 A1

1
+1 if2 2 +1 if2

B2 RET2 B2 RET2

if1 if1

B1 D1 A1 B1 D1 A1

+1 if2 +1 if2

B2 D2 A2 B2 D2 A2

+2 if3 +2 if3

B3 RET3 B3 RET3

4 4
5 5
Daniel M. Roy 6 6 23 / 101
MIT-C HURCH AKA T RACE -MCMC

if1 if1

B1 RET1
1 − 12 p̄ B1 RET1

1
if1
2 p̄ if1
1
B1 D1 A1
2 B1 D1 A1

1
+1 if2 2 +1 if2

B2 RET2 B2 RET2

if1 if1

B1 D1 A1 B1 D1 A1

+1 if2
 k  p p̄ 
+1 if2

1 − 21 p̄ 1
p̄ p+pp̄ p+pp̄
→ as k → ∞
B2 D2 A2 B2 D2 A2
2
1 1 p p̄
+2 if3 +2 if3
2 2 p+pp̄ p+pp̄
B3 RET3 B3 RET3

4 4
5 5
Daniel M. Roy 6 6 23 / 101
QUERY CAN CAPTURE A WIDE RANGE OF AI BEHAVIORS

Despite the apparent simplicity of the QUERY construct, we will see that it
captures the essential structure of a range of common-sense inferences. We
now demonstrate the power of the QUERY formalism by exploring its
behavior in a medical diagnosis example.

Daniel M. Roy 24 / 101


Introduction

QUERY and conditional simulation

Probabilistic inference

Conditional independence and compact representations

Learning parameters via probabilistic inference

Learning conditional independences via probabilistic inference

References

Daniel M. Roy 25 / 101


Introduction

QUERY and conditional simulation

Probabilistic inference

Conditional independence and compact representations

Learning parameters via probabilistic inference

Learning conditional independences via probabilistic inference

References

Daniel M. Roy 26 / 101


M EDICAL DIAGNOSIS MODEL DESCRIPTION , PART I

The remainder of the tutorial will use a medical diagnosis task as a running
example. The goal is to link observed symptoms to (unobserved) diseases.
The stochastic inference problem is:

QUERY(diseasesAndSymptoms,checkSymptoms)

All that remains is to define the two procedures that define the model.
diseasesAndSymptoms() will produce a random (possibly empty) set
of diseases and associated symptoms, modeling the distribution of diseases
and symptoms of a random chosen patient arriving at a clinic.
checkSymptoms(...) checks the hypothesized symptoms against the
list of observed symptoms, accepting if there is a match

Daniel M. Roy 27 / 101


M EDICAL DIAGNOSIS MODEL DESCRIPTION , PART II
The prior program diseasesAndSymptoms() proceeds as follows:

(1) For each disease n, sample an independent binary random variable Dn


with mean pn where

n Disease pn
1 Arthritis 0.06
2 Asthma 0.04
3 Diabetes 0.11
4 Epilepsy 0.002
5 Giardiasis 0.006
6 Influenza 0.08
7 Measles 0.001
8 Meningitis 0.003
9 MRSA 0.001
10 Salmonella 0.002
11 Tuberculosis 0.003

Dn indicates whether or not a patient has disease n.


Daniel M. Roy 28 / 101
M EDICAL DIAGNOSIS MODEL DESCRIPTION , PART III
(2) For each symptom m, sample an independent binary random variable
Lm with mean `m where

m Symptom `m
1 Fever 0.06
2 Cough 0.04
3 Hard breathing 0.001
4 Insulin resistant 0.15
5 Seizures 0.002
6 Aches 0.2
7 Sore neck 0.006

Lm indicates whether or not a patient spontaneously presents symptom m.

Daniel M. Roy 29 / 101


M EDICAL DIAGNOSIS MODEL DESCRIPTION , PART IV
(3) For each pair of disease n and symptom m, sample an independent
binary random variable Cn,m with mean cn,m where

cn,m 1 2 3 4 5 6 7
1 .1 .2 .1 .2 .2 .5 .5
2 .1 .4 .8 .3 .1 .0 .1
3 .1 .2 .1 .9 .2 .3 .5
4 .4 .1 .0 .2 .9 .0 .0
5 .6 .3 .2 .1 .2 .8 .5
6 .4 .2 .0 .2 .0 .7 .4
7 .5 .2 .1 .2 .1 .6 .5
8 .8 .3 .0 .3 .1 .8 .9
9 .3 .2 .1 .2 .0 .3 .5
10 .4 .1 .0 .2 .1 .1 .2
11 .3 .2 .1 .2 .2 .3 .5

Conditioned on having disease n, Cn,m indicates whether or not disease n


causes the patient to present symptom m.

Daniel M. Roy 30 / 101


M EDICAL DIAGNOSIS MODEL DESCRIPTION , PART V
For each symptom m, we then define

Sm = max{Lm , D1 · C1,m , . . . , D11 · C11,m },

hence Sm ∈ {0, 1}. (The max operator is playing the role of a logical OR
operation.)

Sm indicates whether or not the patient presents symptom m.

Every term of the form Dn · Cn,m is interpreted as indicating whether (or not)
the patient has disease n and disease n has caused symptom m. The term
Lm captures the possibility that the symptom may present itself despite the
patient having none of the listed diseases.

Finally, define the output of diseasesAndSymptoms to be the vector


(D1 , . . . , D11 , S1 , . . . , S7 ).

Daniel M. Roy 31 / 101


E XPLORING THE MODEL

If we repeatedly evaluate diseasesAndSymptoms, we might see outputs


like those in the following array:
Diseases Symptoms
1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7
1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3
0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
4
0 0 1 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0
5
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
6
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7
0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0
8
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
The rows model eight patients chosen randomly:
1. diseases and symptom free
2. same
3. suffering from diabetes and presenting insulin resistance;
4. suffering from diabetes and influenza, and presenting a fever and insulin
resistance;
5. suffering from unexplained aches;
6. free from disease and symptoms;
7. suffering from diabetes, and presenting insulin resistance and aches;
8. disease and symptom free.
Daniel M. Roy 32 / 101
D I S E A S E SAN DSY M P T O M S

This model is a toy version of the real diagnostic model QMR-DT [Shwe
et al., 1991], built from the Quick Medical Reference (QMR) knowledge base
of hundreds of diseases and thousands of findings (such as symptoms or test
results). A key aspect of this model is the disjunctive relationship between the
diseases and the symptoms, known as a “noisy-OR”.
Shortcomings
As a model of natural patterns of diseases and symptoms in a random
patient, leaves much to desire:
I model assumes that the presence or absence of any two diseases is
independent, although, as we will see later on in our analysis, diseases
are (as expected) typically not independent conditioned on symptoms.
I diseases may cause other diseases, and symptoms may cause diseases
I QMR-DT, like our toy model, was major advance over earlier expert
systems and probabilistic models, allowing simultaneous occurrence of
multiple diseases Shwe et al. [1991].
These caveats notwithstanding, a close inspection of this simplified model will
demonstrate a surprising range of common-sense reasoning phenomena.
Daniel M. Roy 33 / 101
I NCORPORATING OBSERVED SYMPTOMS

Consider the predicate that accepts if and only if S1 = 1 and S7 = 1, i.e., if


and only if the patient presents the symptoms of a fever and a sore neck,
ignoring other symptoms.

def checkSymptoms(D1 ,. . . ,D11 ,S1 ,. . . ,S7 ):


return S1 == 1 and S7 == 1

What does QUERY(diseasesAndSymptoms, checkSymptoms)


produce? We can just run it! (Right?)

Let µ denote output distribution of diseasesAndSymptoms, and let


A = {(d1 , . . . , d11 , s1 , . . . , s7 ) : s1 = s7 = 1}. Then
QUERY(diseasesAndSymptoms, checkSymptoms) generates
samples from the conditioned distribution µ(· | A).

We will study the conditional distributions of the diseases given the


symptoms. The following calculations may be very familiar to some readers,
but will be less so to others, and so we present them here to give a more
complete picture of the behavior of QUERY.
Daniel M. Roy 34 / 101
P OSTERIOR CALCULATIONS GIVEN S1 = S7 = 1, PART I
Consider a {0, 1}-assignment dn for each disease n, and write D = d to
denote the event that Dn = dn for every such n. We’d like to be able to
compute P{D = d|S1 = S7 = 1}.

How? We know that output of diseasesAndSymptoms is produced from


a collection of independent random variables. We will use this fact to
compute P{D = d} and P{S1 = S7 = 1 | D = d}. We will then employ the
following identities

µ(B)µ(A | B) = µ(A ∩ B) = µ(B|A)µ(A) (2)

Rearranging, we obtain Bayes rule,

µ(B|A)µ(A)
µ(A | B) = . (3)
µ(B)

Daniel M. Roy 35 / 101


P OSTERIOR CALCULATIONS GIVEN S1 = S7 = 1, PART II
Assume for the moment that D = d. Then what is the probability that
checkSymptoms accepts? The probability we are seeking is the
conditional probability

P(S1 = S7 = 1 | D = d) (4)
= P(S1 = 1 | D = d) · P(S7 = 1 | D = d), (5)

where the equality follows from the observation that once the Dn variables
are fixed, the variables S1 and S7 are independent. To see this, recall that

Sm = max{Lm , D1 · C1,m , . . . , D11 · C11,m },

and see there’s no overlap in variables determining each Sm once D is fixed.

Daniel M. Roy 36 / 101


P OSTERIOR CALCULATIONS GIVEN S1 = S7 = 1, PART III
Note that Sm = 1 if and only if Lm = 1 or Cn,m = 1 for some n such that
dn = 1. (Equivalently, Sm = 0 if and only if Lm = 0 and Cn,m = 0 for all n
such that dn = 1.) By the independence of each of these variables, it follows
that
Y
P(Sm = 1|D = d) = 1 − (1 − `m ) (1 − cn,m ). (6)
n : dn =1

It’s difficult to visualize a distribution on a 11-dimensional vector, such as D .


Instead, let d, d0 be {0, 1}-assignment specifying different patterns of
diseases. We can characterize the a posteriori odds

P(D = d | S1 = S7 = 1)
P(D = d0 | S1 = S7 = 1)

of the assignment d versus the assignment d0 in order to understand how


much more likely we are to see d as an explanation versus d0 .

Daniel M. Roy 37 / 101


P OSTERIOR CALCULATIONS GIVEN S1 = S7 = 1, PART IV
By Bayes’ rule, this can be rewritten as

P(S1 = S7 = 1 | D = d) · P(D = d)
, (7)
P(S1 = S7 = 1 | D = d0 ) · P(D = d0 )
Q11
where P(D = d) = n=1 P(Dn = dn ) by independence. Using (5), (6) and
(7), one may calculate that

P(Patient only has influenza | S1 = S7 = 1)


≈ 42,
P(Patient has no listed disease | S1 = S7 = 1)

i.e., it is forty-two times more likely that an execution of


diseasesAndSymptoms satisfies the predicate checkSymptoms via
an execution that posits the patient only has the flu than an execution which
posits that the patient has no disease at all.

Daniel M. Roy 38 / 101


P OSTERIOR CALCULATIONS GIVEN S1 = S7 = 1, PART V
On the other hand,

P(Patient only has meningitis | S1 = S7 = 1)


≈ 6,
P(Patient has no listed disease | S1 = S7 = 1)
and so

P(Patient only has influenza | S1 = S7 = 1)


≈ 7,
P(Patient only has meningitis | S1 = S7 = 1)

and hence we would expect to see, over many executions of

QUERY(diseasesAndSymptoms, checkSymptoms), (8)

roughly seven times as many explanations positing only influenza than


positing only meningitis.

Daniel M. Roy 39 / 101


E XPLAINING AWAY

Further investigation reveals some subtle aspects of the model. For example,

P(Patient only has meningitis and influenza | S1 = S7 = 1)


P(Patient has meningitis, maybe influenza, but nothing else | S1 = S7 = 1)
= 0.09 ≈ P(Patient has influenza),

Observation 1
Once we have observed some symptoms, diseases are no longer
independent.

Observation 2
Once the symptoms have been “explained” (e.g., as coming from meningitis),
there is little pressure to posit further causes (the posterior probability of
influenza is essentially the prior probability of influenza).

This phenomenon is well-known and is called explaining away; it is also


known to be linked to the computational hardness of computing probabilities
(and generating samples as QUERY does) in models of this variety.

Daniel M. Roy 40 / 101


S IMPLE MODELS CAN YIELD COMPLEX BEHAVIOR , PART I

Despite the simple model and simple query,


QUERY(diseasesAndSymptoms, checkSymptoms) yields a collection
of diagnostic inferences with tremendous internal complexity.

Many more behaviors available through different predicates:


I The model naturally handles missing data. checkSymptoms leads to
different conclusions than

def checkSymptoms(D1 ,. . . ,D11 ,S1 ,. . . ,S 7 ):


P 7
return S1 == 1 and S7 == 1 and m=1 Sm == 2

Daniel M. Roy 41 / 101


S IMPLE MODELS CAN YIELD COMPLEX BEHAVIOR , PART II
We need not limit ourselves to reasoning about diseases given symptoms.
I Imagine that we perform a diagnostic test that rules out meningitis. We
could represent our new knowledge using a predicate capturing the
condition

(D8 = 0) ∧ (S1 = S7 = 1) ∧ (S2 = · · · = S6 = 0).

Of course this approach would not take into consideration our


uncertainty regarding the accuracy or mechanism of the diagnostic test
itself, and so, ideally, we might expand the diseasesAndSymptoms
model to account for how the outcomes of diagnostic tests are affected
by the presence of other diseases or symptoms. Later, we will discuss
how such an extended model might be learned from data, rather than
constructed by hand.

Daniel M. Roy 42 / 101


S IMPLE MODELS CAN YIELD COMPLEX BEHAVIOR , PART III
We can also reason in the other direction, about symptoms given diseases.
I For example, public health officials might wish to know about how
frequently those with influenza present no symptoms. This is captured
by the conditional probability

P(S1 = · · · = S7 = 0 | D6 = 1),

and, via QUERY, by the predicate for the condition D6 = 1. Unlike the
earlier examples where we reasoned backwards from effects
(symptoms) to their likely causes (diseases), here we are reasoning in
the same forward direction as the model diseasesAndSymptoms is
expressed.
The possibilities are effectively inexhaustible, including more complex states
of knowledge such as, there are at least two symptoms present, or the patient
does not have both salmonella and tuberculosis. Later, we will consider the
vast number of predicates and the resulting inferences supported by QUERY
and diseasesAndSymptoms, and contrast this with the compact size of
diseasesAndSymptoms and the table of parameters.
Daniel M. Roy 43 / 101
In this section, we illustrated the basic behavior of QUERY, and began to
explore how QUERY can be used to update beliefs in light of observations.
I Inferences need not be explicitly described in terms of rules, but can
arise implicitly via other mechanisms, like QUERY, paired with an
appropriate models and predicates.
I In the working example, the diagnostic rules were determined by the
definition of diseasesAndSymptoms and the table of its parameters.
I The inference, however, are fixed.
I We will examine how the underlying table of probabilities might be learned
from data.
I The structure of diseasesAndSymptoms itself encodes strong
structural relationships among the diseases and symptoms. We will study
how to learn this in part 2.
I Finally, many common-sense reasoning tasks involve making a decision,
and not just determining what to believe. Towards the end, we will describe
how to use QUERY to make decisions under uncertainty.

Daniel M. Roy 44 / 101


Introduction

QUERY and conditional simulation

Probabilistic inference

Conditional independence and compact representations

Learning parameters via probabilistic inference

Learning conditional independences via probabilistic inference

References

Daniel M. Roy 45 / 101


C ONDITIONAL INDEPENDENCE AND COMPACT
REPRESENTATIONS

In this section, we return to the medical diagnosis example, and explain the
way in which conditional independence leads to compact representations,
and conversely, the fact that efficient probabilistic programs, like
diseasesAndSymptoms, exhibit many conditional independencies. We
will do so through connections with the Bayesian network formalism, whose
introduction by Pearl [1988] was a major advancement in AI.

Daniel M. Roy 46 / 101


T HE COMBINATORICS OF QUERY
Common-sense reasoning seems to encompass an unbounded range of
responses / behaviors. How are these compactly represented?

In fact, the small number of diseases and symptoms considered in our simple
medical diagnosis model already leads to a combinatorial number of potential
scenarios: among 11 potential diseases and 7 potential symptoms there are

311 · 37 = 387 420 489

partial assignments to a subset of variables. All of these must be assigned


probabilities!
Luckily, these 387 420 489 probabilities are determined by

211 · 27 − 1 = 262 143

probabilities, one each for every complete assignment. Still, this number is
exponential in the number of diseases and symptoms. Even if we discretize
the probabilities to some fixed accuracy, a simple counting argument shows
that most such distributions have no short description.
Daniel M. Roy 47 / 101
F EW DISTRIBUTIONS HAVE COMPACT REPRESENTATIONS

Like diseasesAndSymptoms, every probability distribution on 18 binary


variables implicitly defines a large set of probabilities.
I Not feasible to store these probabilities explicitly.
I Necessary to exploit structure to devise more compact representations.
diseasesAndSymptoms is a small efficient program acting on three
tables with

11 + 7 + 11 · 7 = 95

probabilities. In contrast, a generic distribution on 218 possibilities has no


short description.
I diseasesAndSymptoms implicitly represents 218 − 1 probabilities
via an efficient (and short) simulator.
I What structure suffices to yield compact representations?
I What structure is necessary given efficient representations?

Daniel M. Roy 48 / 101


C ONDITIONAL INDEPENDENCE , PART I

The answer to both questions is conditional independence.

Recall that a collection of random variables {Xi : i ∈ I} is independent


when, for all finite subsets J ⊆ I and measurable sets Ai where i ∈ J , we
have
^  Y
P Xi ∈ Ai = P(Xi ∈ Ai ). (9)
i∈J i∈J

If X and Y were binary random variables, then specifying their distribution


would require 3 probabilities in general, but only 2 if they were independent.
While those savings are small, consider instead m binary random variables
Xj , j = 1, . . . , m, and note that, while a generic distribution over these
random variables would require the specification of 2m − 1 probabilities, only
m probabilities are needed in the case of full independence.

Full independence is rare and so this factorization is not the whole story.

Daniel M. Roy 49 / 101


C ONDITIONAL INDEPENDENCE , PART II
Conditional independence is arguably more fundamental.

We consider a special case of conditional independence, restricting our


attention to conditional independence with respect to a discrete random
variable N taking values in some countable or finite set N .

We say that a collection of random variables {Xi : i ∈ I} is conditionally


independent given N when, for all n ∈ N , finite subsets J ⊆ I and
measurable sets Ai , for i ∈ J , we have
^  Y
P Xi ∈ Ai | N = n = P(Xi ∈ Ai | N = n).
i∈J i∈J

To illustrate the potential savings that can arise from conditional


independence, consider m binary random variables that are conditionally
independent given a discrete random variable taking k values. In general, the
joint distribution over these m + 1 variables is specified by k · 2m − 1
probabilities, but, in light of the conditional independence, we need specify
only km + k − 1 probabilities.
Daniel M. Roy 50 / 101
C ONDITIONAL INDEPENDENCE IN
D I S E A S E S A N D S Y M P T O M S , PART I

Conditional independence gives rise to compact representations.


Indeed, diseasesAndSymptoms exhibits many conditional
independencies.

To begin to understand the compactness of diseasesAndSymptoms,


note that the 95 variables

{D1 , . . . , D11 ; L1 , . . . , L7 ; C1,1 , C1,2 , C2,1 , C2,2 , . . . , C11,7 }

are independent, and thus their joint distribution is determined by specifying


only 95 probabilities (in particular, those in the tables).

Daniel M. Roy 51 / 101


C ONDITIONAL INDEPENDENCE IN
D I S E A S E S A N D S Y M P T O M S , PART II

Each symptom Sm is a deterministic function of a 23-variable subset

{D1 , . . . , D11 ; Lm ; C1,m , . . . , C11,m }.

The variables Lm ; C1,m , . . . , C11,m are not shared across symptoms,


implying symptoms are conditionally independent given diseases.

However, these facts alone do not fully explain the compactness of


diseasesAndSymptoms. In particular, there are
23 6
22 > 1010

binary functions of 23 binary inputs, and so by a counting argument, most


have no short description. On the other hand, the max operation that defines
Sm does have a compact and efficient implementation.

What’s the connection?

Daniel M. Roy 52 / 101


R EPRESENTATIONS OF CONDITIONAL INDEPENDENCE

A useful way to represent conditional independence among a collection of


random variables is in terms of a directed acyclic graph, where the vertices
stand for random variables, and the collection of edges indicates the
presence of certain conditional independencies. Such a graph is known as a
directed graphical model or Bayesian network. (For more details on Bayesian
networks, see the survey by Pearl [2004].)

Daniel M. Roy 53 / 101


B AYES NET FOR D I S E A S E S A N D S Y M P T O M S

D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11
J J J J J J J J J J J

J J J J J J J
S1 S2 S3 S4 S5 S6 S7
J J J J J J J
C1,1 C1,2 C1,3 C1,4 C1,5 C1,6 C1,7
J J J J J J J
C2,1 C2,2 C2,3 C2,4 C2,5 C2,6 C2,7
J J J J J J J
C3,1 C3,2 C3,3 C3,4 C3,5 C3,6 C3,7
J J J J J J J
C4,1 C4,2 C4,3 C4,4 C4,5 C4,6 C4,7
J J J J J J J
C5,1 C5,2 C5,3 C5,4 C5,5 C5,6 C5,7
J J J J J J J
C6,1 C6,2 C6,3 C6,4 C6,5 C6,6 C6,7
J J J J J J J
C7,1 C7,2 C7,3 C7,4 C7,5 C7,6 C7,7
J J J J J J J
C8,1 C8,2 C8,3 C8,4 C8,5 C8,6 C8,7
J J J J J J J
C9,1 C9,2 C9,3 C9,4 C9,5 C9,6 C9,7
J J J J J J J
C10,1 C10,2 C10,3 C10,4 C10,5 C10,6 C10,7
J J J J J J J
C11,1 C11,2 C11,3 C11,4 C11,5 C11,6 C11,7
J J J J J J J
L1 L2 L3 L4 L5 L6 L7

Bayes net for the medical diagnosis example. (Note that the directionality of
the arrows is not rendered for clarity. All arrows point to the symptoms Sm .)
Daniel M. Roy 54 / 101
P ROTO - LANGUAGES FOR B AYES NETS

Dn diseases n

Cn,m
J

Lm
J J

Sm
symptoms m

The repetitive structure can be partially captured by so-called “plate notation”,


which can be interpreted as a primitive for-loop construct. Practitioners
have adopted a number of strategies like plate notation for capturing
complicated structures.

Daniel M. Roy 55 / 101


D -S EPARATION IN B AYES NETS I

In order to understand exactly which conditional independencies are formally


encoded in such a graph, we must introduce the notion of d-separation.

d-separation
A pair (x, y) of vertices are d-separated by a subset of vertices E as follows:
First, mark each vertex in E with a ×, which we will indicate by the symbol
N
. If a vertex with (any type of) mark has an unmarked
L parent, mark the
parent with a +, which we will indicate by the symbol . Repeat until a fixed
J
point is reached. Let indicate unmarked vertices.

Definition. x and y are d-separated by E if, for all (undirected) paths from x
to y , one of the following patterns appears:
J N J J N J
→ → ← ←
J N J J J J
← → → ←

Daniel M. Roy 56 / 101


D -S EPARATION IN B AYES NETS II
More generally, if X and E are disjoint sets of vertices, then the graph
encodes the conditional independence of the vertices X given E if every pair
of vertices in X is d-separated by E .

If we fix a collection V of random variables, then we say that a directed


acyclic graph G over V is a Bayesian network when the random variables in
V indeed posses all of the conditional independencies implied by the graph
by d-separation. Note that a Bayes net G says nothing about which
conditional independencies do not exist among its vertex set.

Daniel M. Roy 57 / 101


D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11
J J J J J J J J J J J

J J J J J J J
S1 S2 S3 S4 S5 S6 S7
J J J J J J J
C1,1 C1,2 C1,3 C1,4 C1,5 C1,6 C1,7
J J J J J J J
C2,1 C2,2 C2,3 C2,4 C2,5 C2,6 C2,7
J J J J J J J
C3,1 C3,2 C3,3 C3,4 C3,5 C3,6 C3,7
J J J J J J J
C4,1 C4,2 C4,3 C4,4 C4,5 C4,6 C4,7
J J J J J J J
C5,1 C5,2 C5,3 C5,4 C5,5 C5,6 C5,7
J J J J J J J
C6,1 C6,2 C6,3 C6,4 C6,5 C6,6 C6,7
J J J J J J J
C7,1 C7,2 C7,3 C7,4 C7,5 C7,6 C7,7
J J J J J J J
C8,1 C8,2 C8,3 C8,4 C8,5 C8,6 C8,7
J J J J J J J
C9,1 C9,2 C9,3 C9,4 C9,5 C9,6 C9,7
J J J J J J J
C10,1 C10,2 C10,3 C10,4 C10,5 C10,6 C10,7
J J J J J J J
C11,1 C11,2 C11,3 C11,4 C11,5 C11,6 C11,7
J J J J J J J
L1 L2 L3 L4 L5 L6 L7

Deciding d-separation in the Bayes net, we can determine that (1) the
diseases {D1 , . . . , D11 } are independent (i.e., conditionally independent
given E = ∅) and that (2) the symptoms {S1 , . . . , S7 } are conditionally
independent given the diseases {D1 , . . . , D11 } and many more.

Daniel M. Roy 58 / 101


C ONDITIONAL INDEPENDENCE AND FACTORIZATION I

Bayes nets specify a factorization of the joint distribution of the vertex set.

It is a basic fact from probability that

P(X1 = x1 , . . . , Xk = xk )
= P(X1 = x1 ) · P(X2 = x2 | X1 = x1 ) · · · P(Xk = xk | Xj = xj , j < k)
k
Y
= P(Xj = xj | Xi = xi , i < j).
j=1

Such a factorization provides no advantage when seeking a compact


representation, as the j ’th conditional distribution is determined by 2j−1
probabilities.

Daniel M. Roy 59 / 101


C ONDITIONAL INDEPENDENCE AND FACTORIZATION II
On the other hand, if we have a Bayes net over the same variables, then we
may have a much more concise factorization.

Let G be a Bayes net over {X1 , . . . , Xk }, and write Pa(Xj ) for the set of
indices i such that (Xi , Xj ) ∈ G, i.e., Pa(Xj ) indexes the parent vertices of
Xj . Then the joint p.m.f. may be expressed as
k
Y
P(X1 = x1 , . . . , Xk = xk ) = P(Xj = xj | Xi = xi , i ∈ Pa(Xj )).
j=1

Pk
This factorization is determined by only j=1 2|Pa(Xj )| probabilities, an
exponential savings potentially.

Daniel M. Roy 60 / 101


E FFICIENT REPRESENTATIONS AND CONDITIONAL
INDEPENDENCE , PART I

As we saw at the beginning of this section, models with only a moderate


number of variables can have enormous descriptions. Having introduced the
Bayesian network formalism, we can use diseasesAndSymptoms as an
example to explain why, roughly speaking, the output distributions of efficient
probabilistic programs exhibit many conditional independencies.

Daniel M. Roy 61 / 101


E FFICIENT REPRESENTATIONS AND CONDITIONAL
INDEPENDENCE , PART II
What does the efficiency of diseasesAndSymptoms imply about the
structure of its output distribution?

Assuming the 95 tabulated probabilities are dyadics of the form 2km , we may
represent diseasesAndSymptoms as a small boolean circuit whose
inputs are random bits and whose 18 output lines represent the diseases and
symptom indicators. Each circuit elements can be restricted to
constant-fan-in, and the total number of circuit elements grows only linearly in
the number of diseases and in the number of symptoms, assuming fixed
accuracy for the bae probabilities.

Daniel M. Roy 62 / 101


E FFICIENT REPRESENTATIONS AND CONDITIONAL
INDEPENDENCE , PART III
If we view the input lines as random variables, then the output lines of the
logic gates are also random variables, and so we may ask: what conditional
independencies hold among the circuit elements?

It is straightforward to show that the circuit diagram, viewed as a directed


acyclic graph, is a Bayes net capturing conditional independencies among
the inputs, outputs, and internal gates of the circuit implementing
diseasesAndSymptoms. For every gate, the conditional probability mass
function is characterized by the (constant-size) truth table of the logical gate.

Therefore, if an efficient prior program samples from some distribution over a


collection of binary random variables, then those random variables exhibit
many conditional independencies, in the sense that we can introduce a
polynomial number of additional boolean random variables (representing
intermediate computations) such that there exists a constant-fan-in Bayes net
over all the variables.

Daniel M. Roy 63 / 101


G RAPHICAL MODELS IN AI

Graphical models, and, in particular, Bayesian networks, played a critical role


in popularizing probabilistic techniques within AI in the late 1980s and early
1990s.
Two developments were central to this shift:
1. Researchers introduced compact, computer-readable representations of
distributions on large (but still finite) collections of random variables, and
did so by explicitly representing a graph capturing conditional
independencies and exploiting the factorization (??).
2. Researchers introduced efficient graph-based algorithms that operated
on these representations, exploiting the factorization to compute
conditional probabilities.
For the first time, a large class of distributions were given a formal
representation that enabled the design of general purpose algorithms to
compute useful quantities.

Daniel M. Roy 64 / 101


Introduction

QUERY and conditional simulation

Probabilistic inference

Conditional independence and compact representations

Learning parameters via probabilistic inference

Learning conditional independences via probabilistic inference

References

Daniel M. Roy 65 / 101


L EARNING PARAMETERS VIA PROBABILISTIC INFERENCE I

The 95 tabulated probabilities induce a distribution over 262 144 outcomes.


I Where did these numbers come from?
I And are they any good?
In practice, these 95 “parameters” would themselves be subject to a great
deal of uncertainty, and one might hope to use data from actual diagnostic
situations to learn appropriate values.

Daniel M. Roy 66 / 101


L EARNING AS PROBABILISTIC INFERENCE , PART I

The Bayesian approach to learning the 95 parameters is to place a prior


distribution on them. Whereas different individuals diseases and symptoms
were independent before, they are no longer independent.
Thus, this change affects diseasesAndSymptoms in two ways:
1. Rather than using the fixed 95 tabulated probabilities, the program will
start by randomly generating the entries of the table. (Concretely,
assume they are i.i.d. uniform in [0, 1].)
2. The new program, dubbed allDiseasesAndSymptoms, then
evaluates diseasesAndSymptoms() n + 1 times, returning the
resulting (n + 1) × 18 array.
3. Similarly, we modify checkSymptoms to obtain
checkAllSymptoms by accepts the n + 1 diagnoses generated by
allDiseasesAndSymptoms if and only if the first n agree with n
historical records, and last one, the current patient, is accepted by
checkSymptoms.

Daniel M. Roy 67 / 101


L EARNING AS PROBABILISTIC INFERENCE , PART II
The changes to allDiseasesAndSymptoms may sound quite
surprising, and unlikely to work very well. Indeed,
allDiseasesAndSymptoms will produce unnatural patterns. The key is
to consider the effect of checkAllSymptoms. What are typical outputs
from QUERY(allDiseasesAndSymptoms, checkAllSymptoms)?

For large n,
I hypothesized marginal probability of a disease relatively close to the
frequency observed in the simulated disease–symptom data.

Thus, when a simulation is eventually accepted by checkAllSymptoms,


I the hypothesized marginal probabilities will closely match the relative
frequencies in the data.
The concentration around the data occurs at a n−1/2 rate, and so we would
expect that the typical accepted sample would soon correspond with a latent
table of probabilities that roughly matches the historical record.

Daniel M. Roy 68 / 101


E XACT POSTERIOR DISTRIBUTION OF pj
Fix a disease j and recall that pj ∼ Uniform[0, 1].

The probability that the n sampled values of Dj match the historical record is

pkj · (1 − pj )n−k , (10)

where k stands for the number of records where disease j is present.

By Bayes’ theorem, in the special case of a uniform prior distribution on pj ,


the density of the conditional distribution of pj given the historical evidence is
proportional to the likelihood (10). This implies that, conditionally on the
historical record, pj has a so-called Beta(α1 , α0 ) distribution with mean

α1 k+1
=
α1 + α0 n+2
and concentration parameter α1 + α0 = n + 2.

Daniel M. Roy 69 / 101


E XACT POSTERIOR DISTRIBUTION OF pj
Beta distributions under varying parameterizations, highlighting the fact that,
as the concentration grows, the distribution begins to concentrate around its
mean. As n grows, predictions made by
QUERY(allDiseasesAndSymptoms, checkAllSymptoms) will
likely be those of runs where each disease marginals pj falls near the
observed frequency of the j th disease. In effect, the historical record data
determines the values of the marginals pj .
14

12

10

0.2 0.4 0.6 0.8 1.0

Plots of the probability density of Beta(a1 , a0 ) distributions with density f (x; α1 , α0 ) =


Γ(α1 +α0 ) α1 −1
Γ(α1 )Γ(α0 ) x (1 − x)α0 −1 for parameters (1, 1), (3, 1), (30, 3), and (90, 9) (respectively, in
height). For parameters α1 , α0 > 1, the distribution is unimodal with mean α1 /(α1 + α0 ).
Daniel M. Roy 70 / 101
P OSTERIOR CONVERGENCE

A similar analysis can be made of the dynamics of the posterior distribution of


the latent parameters `m and cn,m .

Abstractly speaking, in finite dimensional Bayesian models like this one


satisfying certain regularity conditions, it is possible to show that the
predictions of the model converge to those made by the best possible
approximation within the model to the distribution of the data. (For a
discussion of these issues, see, e.g., Barron [1998].)

Daniel M. Roy 71 / 101


B AYESIAN LEARNING , PART I

I Original diseasesAndSymptoms program makes the same


inferences in each case, even in face of large data.
I allDiseasesAndSymptoms learns from data.
The key: the latent table of probabilities, modeled as random variables.

Similar approaches can be used when the patients come from multiple
distinct populations where you do not expect the patterns of diseases and
symtoms to agree.

Daniel M. Roy 72 / 101


B AYESIAN LEARNING , PART II
Note
I allDiseasesAndSymptoms is even more compact than
diseasesAndSymptoms
I the specification of the distribution of the random table is logarithmic in the
size of the table
I On the other hand, allDiseasesAndSymptoms relies on data to
help it reduce its substantial a priori uncertainty regarding these values.
This tradeoff—between, on the one hand, the flexibility and complexity of a
model and, on the other, the amount of data required in order to make
sensible predictions—is seen throughout statistical modeling. We will return
to this point later.
Here we have seen how the parameters in prior programs can be modeled as
random, and thereby learned from data by conditioning on historical
diagnoses. In the next section, we consider the problem of learning not only
the parameterization but the structure of the model’s conditional
independence itself.

Daniel M. Roy 73 / 101


Introduction

QUERY and conditional simulation

Probabilistic inference

Conditional independence and compact representations

Learning parameters via probabilistic inference

Learning conditional independences via probabilistic inference

References

Daniel M. Roy 74 / 101


L EARNING CONDITIONAL INDEPENDENCES I

Key limitation of diseasesAndSymptoms


diseasesAndSymptoms uses a fixed noisy-OR network to perform
inference

Solution
allDiseasesAndSymptoms performs probabilistic inference over the
probabilities to learn the best noisy-OR network from data.

Key limitation of allDiseasesAndSymptoms


Irrespective of how much historical data we have,
allDiseasesAndSymptoms cannot go beyond the conditional
independence assumptions implicit in the structure of the prior program.

Solution
Identify the correct structure of the dependence between symptoms and
disease by probabilistic inference over random conditional independence
structure among the model variables.

Daniel M. Roy 75 / 101


S TRUCTURAL LEARNING VIA PROBABILISTIC INFERENCE

Learning a probabilistic model from data is a quintessential example of


unsupervised learning. Learning conditional independence relationships
among model variables is known as structure learning.

Need to learn the components of this factorization:

k
Y
P(X1 = x1 , . . . , Xk = xk ) = P(Xj = xj | Xi = xi , i ∈ Pa(Xj )).
j=1

Approach
Be “Bayesian”... put prior distributions on graphs and conditional probabilities.

Daniel M. Roy 76 / 101


A RANDOM PROBABILITY DISTRIBUTION I

Consider a prior program, which we will call RPD (for Random Probability
Distribution) that takes two positive integer inputs, n and D , and produces as
output n independent samples drawn from a random probability distribution
on {0, 1}D .

We will then perform inference as usual

QUERY(RPD(n+1,18), checkAllSymptoms)

Daniel M. Roy 77 / 101


A RANDOM PROBABILITY DISTRIBUTION II
Intuitively, RPD works in the following way:
1. RPD generates a random directed acyclic graph G over the vertex set
{X1 , . . . , XD }.
2. For each vertex j , for each setting v ∈ {0, 1}Pa(Xj ) , RPD generates the
value of the conditional probabilitiy
pj|v = P(Xj = 1 | Xi = vi , i ∈ Pa(Xj )) uniformly at random.
3. Generate n samples, each a vector of D binary values with the same
distributions as X1 , . . . , XD .
The first two steps produce a graph G and random probabliity distribution for
which G is a valid Bayes net.

Daniel M. Roy 78 / 101


G ENERAL B AYESIAN LEARNING MACHINE

With RPD fully specified, we’d like to understand the output of

QUERY(RPD(n + 1, 18), checkAllSymptoms) (11)

where checkAllSymptoms is defined earlier, accepting n + 1 diagnoses


if and only if the first n agree with historical records, and the symptoms
associated with the n + 1’st agree with the current patient’s symptoms. (Note
that we are identifying each output (X1 , . . . , X11 , X12 , . . . , X18 ) with a
diagnosis (D1 , . . . , D11 , S1 , . . . , S7 ), and have done so in order to highlight
the generality of RPD.)

Daniel M. Roy 79 / 101


G RAPH - CONDITIONAL POSTERIOR , PART I

Assume we condition on the graph G produced on step 1. Now we’re


essentially back to our earlier analysis.

The expected value of pj|v = P(Xj = 1 | Xi = vi , i ∈ Pa(Xj )) on those


runs accepted by QUERY is

kj|v + 1
nj|v + 2

where nj|v is the number of times in the historical data where the event
{Xi = xi , i ∈ Pa(Xj )} occurs; and kj|v is the number of times when,
moreover, Xj = 1. This is simply the “smoothed” empirical frequency. In fact,
the probability pj|v is conditionally Beta distributed with concentration
nj|v + 2.

Daniel M. Roy 80 / 101


G RAPH - CONDITIONAL POSTERIOR , PART II
The variance of pj|v is one characterization of our residual uncertainty, and
for each probability j , the variance is easily shown to scale as n−1
j|v , i.e., the
number of times in the historical data when {Xi = xi , i ∈ Pa(Xj )} occurs.

Informally,
I the smaller the parental sets (a property of G), the more certain we are
likely to be regarding the correct parameterization
I in terms of QUERY, reflected in smaller the range of values of pj|v on
accepted runs.
This is our first glimpse at a subtle balance between the simplicity (sparsity)
of the graph G and how well it captures hidden structure in the data.

Daniel M. Roy 81 / 101


A SPECTS OF THE POSTERIOR DISTRIBUTION OF THE
GRAPHICAL STRUCTURE

There are a number of practical roadblocks


I The space of directed acyclic graphs on 18 variables is enormous
I Computational hardness results abound [Cooper, 1990, Dagum and
Luby, 1993, Chandrasekaran et al., 2008].
I Indeed, QUERY would not halt in reasonable time for even small n and
D because the probability of generating the structure that fits the data is
astronomically small.
State of the art structure learning algorithms operate in special subclasses of
distributions and are very sophisticated.
Our goal here is understanding / intuition:
I we can aim to understand conceptual structure of the posterior
distribution of the graph G, perhaps in simple cases
I this example reveals an important aspect of hierarchical Bayesian
models with regard to their ability to avoid “overfitting”, and gives some
insight into why we might expect “simpler” explanations/theories to win
out in the short term over more complex ones.
Daniel M. Roy 82 / 101
L IKELIHOOD OF A GRAPH I

Fix a graph G. Then any distribution of the form

k
Y
P(X1 = x1 , . . . , Xk = xk ) = P(Xj = xj | Xi = xi , i ∈ Pa(Xj )).
j=1

will be called a model in G.

Lemma. If edges in G0 are a strict subset of those in G, then the models of


G0 are a strict subset of those of G.

Corollary. The best fitting distribution in G is never worse than that in G0 .

Are samples from QUERY(RPD(n + 1, 18), checkAllSymptoms) more


likely to come from G than G0 ?

Daniel M. Roy 83 / 101


L IKELIHOOD OF A GRAPH II
Key observations:
I posterior probability of a particular graph G does not reflect the
best-fitting model in G, but rather reflects the average ability of models
in G to explain the historical data.
I average is over Uniform[0, 1] distribution of pj|v .
I if a spurious edge exists in a graph G0 , a typical distribution from G0 is
less likely to explain the data than a typical distribution from the graph
with that edge removed.

Daniel M. Roy 84 / 101


L IKELIHOOD OF A GRAPH III
In order to characterize the posterior distribution of the graph, we must
identify the likelihood that a sample from the prior program is accepted given
a particular graph G.

Every time the pattern {Xi = vi , i ∈ Pa(Xj )} arises in historical data, the
generative process produces the historical value Xj with probability pj|v if
Xj = 1 and 1 − pj|v if Xj = 0.

Conditional on the pj|v ’s, the probability that the historical data is reproduced
is
D Y
Y k
j|v
pj|v (1 − pj|v )nj|v −kj|v ,
j=1 v

where v ranges over the possible {0, 1} assignments to Pa(Xj ) and kj|v
and nj|v are defined as above.

Daniel M. Roy 85 / 101


L IKELIHOOD OF A GRAPH IV
However, we don’t know pj|v , and so the likelihood of the graph G is

Y
D Y 
kj|v nj|v −kj|v
score(G) = E pj|v (1 − pj|v )
j=1 v

YD Y  −1
−1 nj|v
= (nj|v + 1)
j=1 v
kj|v

where E takes expectations with respect to Uniform[0, 1].

Because the graph G was chosen uniformly at random, it follows that the
posterior probability of a particular graph G is proportional to score(G).

Daniel M. Roy 86 / 101


G RAPH POSTERIOR

We can study the preference for one graph G over another G0 by studying
the ratio of their scores:

score(G)
.
score(G0 )

This score ratio is known as the Bayes factor, which I.J. Good termed the
Bayes–Jeffreys–Turing factor [Good, 1968, 1975], and which Turing himself
called the factor in favor of a hypothesis (see [Good, 1968], [Zabell, 2012,
§1.4], and [Turing, 2012]). Its logarithm is sometimes known as the weight of
evidence [Good, 1968].

The form of the score is a product over the local structure of the graph, thus
the Bayes factor will depend only on the contributions from those parts of the
graphs G and G0 that differ.

Daniel M. Roy 87 / 101


S IMPLE SPECIAL CASE OF GRAPH POSTERIOR , PART I

Consider the following simplified scenario, which captures several features of


learning structure from data: Fix two graphs, G and G0 , over the same
collection of random variables, but assume that in G, two of these random
variables, X and Y , have no parents and are thus independent, and in G0
there is an edge from X to Y , and so they are almost surely dependent.

The Bayes factor in factor of independence is


n1
 n0

(n1 + 1)(n0 + 1) k1 k0
n
 , (12)
(n + 1) k

where
I n counts the total number of observations;
I k counts Y = 1;
I n1 counts X = 1;
I k1 counts X = 1 and Y = 1;
I n0 counts X = 0; and
I k0 counts X = 0 and Y = 1.
Daniel M. Roy 88 / 101
S IMPLE SPECIAL CASE OF GRAPH POSTERIOR , PART II
First consider the case where G0 is the true underlying graph, i.e., when Y is
indeed dependent on X .

By the law of large numbers, and Stirling’s approximation, we can reason that
the evidence for G0 accumulates rapidly, satisfying

score(G)
log ∼ −C · n, a.s.,
score(G0 )

for some constant C > 0 that depends only on the joint distribution of (X, Y ).

Daniel M. Roy 89 / 101


S IMPLE SPECIAL CASE OF GRAPH POSTERIOR , PART III
Now consider that G is the true underlying graph, i.e., X , Y are independent.

Using a similar approach, we have

score(G) 1
log ∼ log n, a.s.
score(G0 ) 2
and thus evidence accumulates much more slowly.

The following plots show typical the evolution of the Bayes factors under G0
and under G. Evidence accumulates much more rapidly for G0 .

Daniel M. Roy 90 / 101


S IMPLE SPECIAL CASE OF GRAPH POSTERIOR , PART IV

20 40 60 80 100 4

-200
3

-400
2

-600
1

-800

20 40 60 80 100
-1000

-1

Weight of evidence for dependence versus independence (positive values support independence) of
a sequence of pairs of random variables sampled from RPD(n, 2).

(left) When presented with data from a distribution where (X, Y ) are indeed dependent, the weight of
evidence rapidly accumulates for the dependent model, at an asymptotically linear rate in the amount
of data.

(right) When presented with data from a distribution where (X, Y ) are independent, the weight of
evidence slowly accumulates for the independent model, at an asymptotic rate that is logarithmic in
the amount of data.

Note that the dependent model can imitate the independent model, but, on average over random
parameterizations of the conditional probability mass functions, the dependent model is worse at
modeling independent data.

Daniel M. Roy 91 / 101


S IMPLE SPECIAL CASE OF GRAPH POSTERIOR , PART V
Some observations
I In both cases, evidence accumulates for the correct model. In fact, it can
be shown that the expected weight of evidence is always non-negative
for the true hypothesis, a result due to Turing himself [Good, 1991,
p. 93].
I the prior distributions on the pj|v are fixed and do not vary with the
amount of data, thus the weight of evidence will eventually eclipse any
prior information and determine the posterior probability
I however, evidence accumulates rapidly for dependence and much more
slowly for independence: should choose prior distribution on graph to
reflect this imbalance, preferring graphs with fewer edges a priori.

Daniel M. Roy 92 / 101


B AYES ’ O CCAM ’ S RAZOR

The following two statements may seem contradictory


I When X and Y are independent, evidence stochastically accumulates
for the simpler graph over the more complex graph
I There is, with high probability, always a parameterization of the more
complex graph that assigns higher likelihood to the data than any
parametrization of the simpler graph.

Bayes’ Occam’s razor [MacKay, 2003, Ch. 28]


The natural way in which hierarchical models choose explanations of the data
with intermediate complexity, avoiding overfitting.

I many degrees of modeling freedom, then each configuration must be


assigned, on average, less probability than it would under a simpler
model with fewer degrees of freedom.
I graph with additional edges has more degrees of freedom

Daniel M. Roy 93 / 101


C HOOSING MODELS USING DATA

Which model should we use to diagnose: diseasesAndSymptoms,


allDiseasesAndSymptoms or RPD?

We can use the same Bayes’ Occam’s Razor perspective:


I RPD model has many more degrees of freedom than both
diseasesAndSymptoms and allDiseasesAndSymptoms.
I Given enough data, RPD can fit any distribution on a finite collection of
binary variables, as opposed to allDiseasesAndSymptoms, which
cannot because it makes strong and immutable assumptions.
I With only a small amount of training data, RPD model expected to have
high posterior uncertainty.
I One would expect better predictions from
allDiseasesAndSymptoms than RPD, if both were fed data
generated by diseasesAndSymptoms.

Daniel M. Roy 94 / 101


S TATE OF THE ART

I There is a challenge bridging the gap between


allDiseasesAndSymptoms and RPD.
I A lot of focus in Bayesian statistics is on building scalable
approximations to QUERY.
I Key family of algorithms are so-called “variational approximations”.
I There’s also active research on Monte Carlo approximations to QUERY.
I Bespoke hand-crafted models in machine learning are being replaced
by hybrid deep learning ones.
I Probabilistic programming and differentiable programming frameworks
massively accelerate certain approaches.
I There’s also interest in bridging the frequentist–Bayesian divide.

Daniel M. Roy 95 / 101


Introduction

QUERY and conditional simulation

Probabilistic inference

Conditional independence and compact representations

Learning parameters via probabilistic inference

Learning conditional independences via probabilistic inference

References

Daniel M. Roy 96 / 101


F. R. Bach and M. I. Jordan. Learning graphical models with Mercer kernels.
In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural
Information Processing Systems 15 (NIPS 2002), pages 1009–1016. The
MIT Press, Cambridge, MA, 2003.
A. R. Barron. Information-theoretic characterization of Bayes performance
and the choice of priors in parametric and nonparametric problems. In
J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, editors,
Bayesian Statistics 6: Proceedings of the Sixth Valencia International
Meeting, pages 27–52, 1998.
V. Chandrasekaran, N. Srebro, and P. Harsha. Complexity of inference in
graphical models. In Proceedings of the Twenty Fourth Conference on
Uncertainty in Artificial Intelligence (UAI 2008), pages 70–78, Corvalis,
Oregon, 2008. AUAI Press.
G. F. Cooper. The computational complexity of probabilistic inference using
Bayesian belief networks. Artificial Intelligence, 42(2-3):393–405, 1990.
P. Dagum and M. Luby. Approximating probabilistic inference in Bayesian
belief networks is NP-hard. Artificial Intelligence, 60(1):141–153, 1993.
ISSN 0004-3702. doi:
http://dx.doi.org/10.1016/0004-3702(93)90036-B.
Daniel M. Roy 97 / 101
P. Dagum, R. Karp, M. Luby, and S. Ross. An optimal algorithm for Monte
Carlo estimation. SIAM Journal on Computing, 29(5):1484–1496, 2000.
doi: 10.1137/S0097539797315306. URL http://epubs.siam.
org/doi/abs/10.1137/S0097539797315306.
M. H. DeGroot. Optimal Statistical Decisions. Wiley Classics Library. Wiley,
2005. ISBN 9780471726142. URL
http://books.google.co.uk/books?id=dtVieJ245z0C.
I. J. Good. Corroboration, explanation, evolving probability, simplicity and a
sharpened razor. The British Journal for the Philosophy of Science, 19(2):
123–143, 1968.
I. J. Good. Explicativity, corroboration, and the relative odds of hypotheses.
Synthese, 30(1):39–73, 1975.
I. J. Good. Weight of evidence and the Bayesian likelihood ratio. In C. G. G.
Aitken and D. A. Stoney, editors, The Use Of Statistics In Forensic
Science. Ellis Horwood, Chichester, 1991.
N. D. Goodman, V. K. Mansinghka, D. M. Roy, K. Bonawitz, and J. B.
Tenenbaum. Church: A language for generative models. In Proceedings of
the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence (UAI
2008), pages 220–229, Corvalis, Oregon, 2008. AUAI Press.

Daniel M. Roy 98 / 101


T. L. Griffiths and J. B. Tenenbaum. Structure and strength in causal
induction. Cognitive Psychology, 51(4):334–384, 2005. ISSN 0010-0285.
doi: 10.1016/j.cogpsych.2005.05.004. URL
http://www.sciencedirect.com/science/article/pii/
S0010028505000459.
T. L. Griffiths and J. B. Tenenbaum. Optimal predictions in everyday
cognition. Psychological Science, 17(9):767–773, 2006. URL
http://web.mit.edu/cocosci/Papers/
Griffiths-Tenenbaum-PsychSci06.pdf.
T. L. Griffiths and J. B. Tenenbaum. Theory-based causal induction.
Psychological Review, 116(4):661–716, 2009. URL
http://cocosci.berkeley.edu/tom/papers/tbci.pdf.
T. L. Griffiths, C. Kemp, and J. B. Tenenbaum. Bayesian models of cognition.
In Cambridge Handbook of Computational Cognitive Modeling. Cambridge
University Press, 2008. URL http://cocosci.berkeley.edu/
tom/papers/bayeschapter.pdf.
O. Kallenberg. Foundations of modern probability. Probability and its
Applications. Springer, New York, 2nd edition, 2002. ISBN 0-387-95313-2.

Daniel M. Roy 99 / 101


C. Kemp and J. B. Tenenbaum. The discovery of structural form. Proceedings
of the National Academy of Sciences, 105(31):10687–10692, 2008. URL
http://www.psy.cmu.edu/~ckemp/papers/kempt08.pdf.
C. Kemp, P. Shafto, A. Berke, and J. B. Tenenbaum. Combining causal and
similarity-based reasoning. In B. Schölkopf, J. Platt, and T. Hoffman,
editors, Advances in Neural Information Processing Systems 19 (NIPS
2006), pages 681–688. The MIT Press, Cambridge, MA, 2007.
R. D. Luce. Individual Choice Behavior. John Wiley, New York, 1959.
R. D. Luce. The choice axiom after twenty years. Journal of Mathematical
Psychology, 15(3):215–233, 1977.
D. J. C. MacKay. Information Theory, Inference, and Learning Algorithms.
Cambridge University Press, Cambridge, UK, 2003.
V. K. Mansinghka. Natively Probabilistic Computation. PhD thesis,
Massachusetts Institute of Technology, 2009.
V. K. Mansinghka, C. Kemp, J. B. Tenenbaum, and T. L. Griffiths. Structured
priors for structure learning. In Proceedings of the Twenty-Second
Conference on Uncertainty in Artificial Intelligence (UAI 2006), pages
324–331, Arlington, Virginia, 2006. AUAI Press. URL http:
//cocosci.berkeley.edu/tom/papers/structure.pdf.

Daniel M. Roy 100 / 101


J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of
Plausible Inference. Morgan Kaufmann, San Francisco, 1988.
J. Pearl. Graphical models for probabilistic and causal reasoning. In A. B.
Tucker, editor, Computer Science Handbook. CRC Press, 2nd edition,
2004.
M. A. Shwe, B. Middleton, D. E. Heckerman, M. Henrion, E. J. Horvitz, H. P.
Lehmann, and G. F. Cooper. Probabilistic diagnosis using a reformulation
of the INTERNIST-1/QMR knowledge base. Methods of Information in
Medicine, 30:241–255, 1991.
J. B. Tenenbaum, T. L. Griffiths, and C. Kemp. Theory-based Bayesian
models of inductive learning and reasoning. Trends in Cognitive Sciences,
10:309–318, 2006. doi: 10.1016/j.tics.2006.05.009.
M. Toussaint, S. Harmeling, and A. Storkey. Probabilistic inference for solving
(PO)MDPs. Technical Report EDI-INF-RR-0934, University of Edinburgh,
School of Informatics, 2006.
A. M. Turing. The Applications of Probability to Cryptography, c. 1941. UK
National Archives, HW 25/37. 2012.
S. L. Zabell. Commentary on Alan M. Turing: The applications of probability
to cryptography. Cryptologia, 36(3):191–214, 2012. doi:
10.1080/01611194.2012.697811.
Daniel M. Roy 101 / 101

You might also like