Probabilistic Programming: Algorithms, Implementation and Applications

Probabilistic Programming:
Algorithms, Implementation and Applications

Outline of the tutorial
• PROB: A simple probabilistic programing language
– Syntax and semantics
• Inference
– Inference is program analysis
– Compiler and runtime algorithms, implementation
• Application
– Debugging ML tasks
• Tools: Synthesizing probabilistic programs
Part I: Probabilistic Programs
PROB
Imperative language (think safe version of C) with two added
features:
1. The ability to sample from a distribution
2. The ability to condition values of variables through
observations
Goal of a PROB program: succinctly specify a probability

distribution
Goal of inference: infer the distribution specified by a

probabilistic program
Simple probabilistic program
bool c1, c2;
c1 = Bernoulli(0.5); false false 1/4
c2 = Bernoulli(0.5);
return(c1,c2); false true 1/4
true false 1/4
true true 1/4
Probabilistic program with
conditioning
bool c1, c2;
c2 = Bernoulli(0.5); false false 0
observe(c1 || c2); false true 1/3
return(c1,c2);
true false 1/3
true true 1/3
bool c1, c2;
while (!(c1 || c2)) {
}
return(c1,c2);
Another probabilistic program
with a while loop…
bool b, c;
b = false;
c = Bernoulli(0.5);
while (c){ false 2/3
b = !b;
c = Bernoulli(0.5); true 1/3
}
return(b);
1
1 1 1 2 2
𝑃 ( 𝑏 = 𝑓𝑎𝑙𝑠𝑒 ) = + + +… = =
2 8 16
1− ( )
1
4
3
Halo multiplayer
• How are skills modeled?

• Player A beats Player B
–?
TrueSkill
float skillA, skillB, skillC;
float perfA1, perfB1, perfB2,
perfC2, perfA3, perfC3;
 Sample from a noisy
skillA = Gaussian(100, 10); distribution
skillB = Gaussian(100, 10);
skillC = Gaussian(100, 10);
// first game: A vs B, A won  Sample from a noisy

perfA1 = Gaussian(skillA, 15);
perfB1 = Gaussian(skillB, 15); distribution
observe(perfA1 > perfB1);
// second game: B vs C, B won

perfB2 = Gaussian(skillA, 15);
 if then A wins else B wins
perfC2 = Gaussian(skillB, 15);
observe(perfB2 > perfC2);
// third game: A vs C, A won

perfA3 = Gaussian(skillA, 15);
perfC3 = Gaussian(skillB, 15);
observe(perfA3 > perfC3);
return(skillA, skillB, skillC);

Several applications…
• Population Models
• Medical Diagnostics
• Hidden Markov Models (eg. for speech
recognition)
• Kalman Filters (eg. In computer vision)
• Markov Random Fields (eg. In image processing)
• And more applications:
– Ecology & Biology (Carbon modeling, Evolutionary
Genetics,…)
– Security (quantitative information flow, inference
attacks)
Part II: Inference
Compiler and Runtime Algorithms

Inference
• Infer the distribution specified
by a probabilistic program.
– Generate samples to test a HBC Hansei
machine learning algorithm BLOG
– Calculate the expected value of

a function wrt the distribution FACTORIE Infer.NET
specified by the program BUGS

Church
– Calculate the mode of the Alchemy
distribution specified by the Figaro
program
PRISM Stan
• In this talk..
– Formal semantics of
probabilistic programs
– Inference is program analysis
Prob : A simple imperative language for prob.
programs
types
; declaration
expressions
variable
constant
binary operation
unary operation
statements
deterministic assignment
probabilistic assignment
observe
skip
sequential composition
conditional composition
loop
programs
Semantics
• States: , valuation to all variables
• Set of all states:
• Probabilistic semantics:
Starting at state , and statement , the program ran by
generating samples and terminated in state with probability
The samples come from a measure space

Transition rules (1)
,
if
if
Transition rules (2)
,
if
,
if
,
if
,
if
,
if
Semantics = Expectation
Let be the return expression of the program
otherwise
,
Inference = Program Analysis
• Program Slicing (as a pre-processing step)

• Using “Pre” transformer to avoid rejections
during sampling
• Data flow analysis and ADDs
Slicing
Can we “slice” a probabilistic program so that for
a program P, we can produce a “smaller”
program Slice(P), such that:
P  Slice(P)
Note that the equivalence  above is semantic

equality, wrt semantics defined earlier.
Example 1
1: bool d, i, s, l, g;
2: d = Bernoulli(0.6);
3: i = Bernoulli(0.7);
4: if (!i && !d)
5: g = Bernoulli(0.3);
6: else if (!i && d) 1: bool i, s;
7: g = Bernoulli(0.05); 3: i = Bernoulli(0.7);
8: else if (i && !d) slicing 12: if (!i)
9: g = Bernoulli(0.9); 13: s = Bernoulli(0.2);
10: else
14: else
12: if (!i) 15: s = Bernoulli(0.95);
13: s = Bernoulli(0.2); 𝑑 𝑖 20: return s;
14: else
15: s = Bernoulli(0.95);
16: if (!g) 𝑔 𝑠
17: l = Bernoulli(0.1);
18: else
19: l = Bernoulli(0.4); 𝑙
20: return s;
Example 2
4: if (!i && !d)
8: else if (i && !d) slicing 12: if (!i)
10: else
14: else
12: if (!i) 15: s = Bernoulli(0.95);
14: else
16: if (!g) 𝑔 𝑠
18: else
20: observe(l = true);
21: return s;
Example 2
4: if (!i && !d)
8: else if (i && !d) 12: if (!i)
10: else
14: else
12: if (!i) 15: s = Bernoulli(0.95);
14: else
16: if (!g) 𝑔 𝑠
18: else
20: observe(l = true);
21: return s;
Observe dependence
𝑥 𝑦 • Represents
• There is no dependence between and
• On the other hand, if (or some descendant of )
is observed, then depends on and vice versa
𝑧
A new notion of influence
• is a relation that maintain usual

control and data dependences:
𝑥 𝑦
• is a relation that captures additional
dependences called “observe
dependences” 𝑧 𝑟


Chung-Kil Hur, Aditya V. Nori, Sriram K. Rajamani, and Selva Samuel.

Slicing Probabilistic Programs, In PLDI '14: Programming Language Design and
Implementation, June 2014
Background: Sampling
Problem. Estimate expectation of wrt to the distribution
If we can sample from estimate the expectation as:

+
Figure from D J Mackay, Introduction to Monte Carlo Methods

Pearl’s Burglar alarm example
int alarm() {
char earthquake = Bernoulli(0.001);
char burglary = Bernoulli(0.01);
char alarm = earthquake || burglary;
char phoneWorking =
(earthquake)? Bernoulli(0.6) : Bernoulli(0.99);
char maryWakes;
if (alarm && earthquake)
maryWakes = Bernoulli(0.8);
else if (alarm)
else maryWakes = Bernoulli(0.2);
char called = maryWakes && phoneWorking;
observe(called); “called” is a low probability
return burglary; event, and causes large
} number of rejections
during sampling
int alarm() {
Pre transformation bool earthquake, burglary, alarm, phoneWorking,

maryWakes,called;
earthquake = Bernoulli(0.001);
burglary = Bernoulli(0.01);
alarm = earthquake || burglary;
if (earthquake) {
• Let P be any program phoneWorking = Bernoulli(0.6);
observe(phoneWorking);
• Let Pre(P) denote the program }
obtained by propagating observe else {
phoneWorking = Bernoulli(0.99);
statements immediately after observe(phoneWorking);
sample statements }
if (alarm && earthquake){
Theorem: P = Pre(P)
observe(maryWakes && phoneWorking);
}
else if (alarm){
}
else {
}
called = maryWakes && phoneWorking;
return burglary;
}
Background: MH sampling
𝑄 ( 𝑥 ; 𝑥′ )
𝑥
1. Draw samples for from a proposal
2. Compute
3. If , accept else accept with probability
Figure from D J Mackay, Introduction to Monte Carlo Methods

MH without rejections
For each statement of the form: During each run of for each
sample statement:
• Sample from proposal sub-
Calculate distribution conditioned by
•
If, accept else accept with

probability
Aditya V. Nori, Chung-Kil Hur, Sriram K. Rajamani, and Selva Samuel.

R2: An Efficient MCMC Sampler for Probabilistic Programs,
In AAAI '14: AAAI Conference on Artificial Intelligence, July 2014
Bayesian inference using data flow analysis:
Boolean Probabilistic Programs
bool c1, c2;
c1 = Bernoulli(0.5); 𝑐 2=𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖( 0.5)
c1 c2
c2 = Bernoulli(0.5); c1 t t 1/4
observe(c1 || c2); t f 1/4
t 1/2
return (c1, c2); f t 1/4
f f 1/4
f 1/2
𝑜𝑏𝑠𝑒𝑟𝑣𝑒(𝑐 1∨¿ 𝑐 2)
c1 c2 Normalize c1 c2
t t 1/3 t t 1/4
t f 1/3 t f 1/4
f t 1/3 f t 1/4
Data flow analysis with ADDs
bool c1, c2;
c1 = Bernoulli(0.5); c1 = Bernoulli(0.5) c2 = Bernoulli(0.5)
observe(c1 || c2);
return (c1, c2); 𝟏
𝟐
𝟏
𝟒
observe(c1 || c2)
𝒄𝟏 𝒄𝟏
0 Normalize 0
𝒄𝟐 𝟏 𝟏
𝒄𝟐
0 𝟏 0 𝟏
𝟎 𝟏
𝟑 𝟎 𝟏
𝟒
Guillaume Claret, Sriram K. Rajamani, Aditya V. Nori, Andrew D. Gordon, and Johannes
Borgström. Bayesian Inference Using Data Flow Analysis. In ESEC-FSE '13: Foundations of
Software Engineering, August 2013
Part III: Applications
Debugging Classification Tasks

Background: Classification
Training Phase
Training Set
o Training set: , ,
ML Training Algorithm
o Test set:
,
o is an ML classification
Test Set Classifier Class Labels algorithm
Evaluation Phase
Debugging Classification Tasks
Training Phase
Training Set Suppose we get an undesirable

result using the trained classifier
on a test point
ML Training Algorithm This can be due to:
• Bugs in the implementation of
the training algorithm
• Incorrect choice of features or
Test Set Classifier Class Labels
training data
Evaluation Phase • Bugs in training data
• ….
Example
Original classifier
learnt with noisy
training data is
We would like to fix

errors in the training
data and learn the
new classifier
Approach (1) : Program Slicing
Trace dependences
backward and find out
which training points affect
result on test point
• depends on
• depends on all of the
training set
Approach (2) : Experiment!
Come up with a ranking
function to rank training
points : we use Pearl’s Pick training points at
Causality Theory (PS random and flip them
Score)
• Need to experiment with
subsets of training points
and number of subsets is
We assume that training large
algorithm is “robust”. • For each subset need to
For training algorithms
based on gradient- retrain the classifier and
descent, we propose a retraining is expensive!
fast approximation
Pearl’s Counterfactual Causality
Out of 101 people, suppose 55 people vote for
A and 46 people vote for B and A wins. Who
caused A to win?
An individual ’s vote is a cause, if there exists

an alternate world (called counterfactual)
where A wins because of ’s vote alone.
Causal model
A Causal Model consists of:
• A set of random input variables
• An output variable
• A structural equation =
An assignment to input variables is a world.

Probability of Sufficiency
A Causal Model
consists of:
• A set of random
input variables
• An output
variable
• A structural
equation =
An assignment to
input variables is a
world.
Probabilistic program to compute
Take 1, simple version Take 2, optimized version
𝑃𝑆 ( 𝑌 𝑖 )= ∑ 𝑝 (𝑤)
𝑤∨ 𝑓 (𝑤 [ 𝑌 𝑖 ← 𝑦 𝑖] ) =𝑥 ∧ 𝑋 ≠ 𝑥∧ 𝑌 𝑖 ≠ 𝑦 𝑖
Background on training
Training learns function
The classifier is a wrapper around :
if and if
Given training data

the goal of training is to search for s which
minimizes R:
Gradient descent
Given training data 1. Choose initial value of as

2. Iterate
the goal of training is to
search for s which
minimizes R: Until (
Retraining with ’ and differing in
1. Choose initial value of as
Specifically, assume that
’ and differ in only label
2. Iterate
of one training point, say
Until (
Calculate for all training
Call the result
points :
1. Choose initial value of from
training output of
2. Iterate
Until (
Call the result
Approximating for ’ close to
Assume that we have

calculated the matrix
for all
Recall that using ’

and differing in only
label of one training
point, say , we can
calculate:
PSI
Results
Benchmarks
Logistic
regression
Decision
Trees
Aleks Chakarov, Aditya Nori, Sriram K. Rajamani, Selva Samuel, Shayak Sen,
Deepak Vijaykeerthy. Towards Debugging Machine Learning Tasks. Under review
Part IV: Tools
Synthesizing Probabilistic Programs

Motivation
• Probabilistic programs are not easy to write.
• Can we have programmer specify only
“sketches” and “synthesize” the program
automatically
• Challenges
– The number of possible “completions” of a sketch
are unbounded
– For each candidate “completion” computation of
likelihood score is expensive
Synthesis = Search for completions..
Program with a hole:
Program with hole completed with :
Likelihood of given data :
By Bayes rule Likelihood estimation,

expensive to compute
Example
: Probability that program generates data
Assuming has pdf and

has pdf
Main idea
We can estimate likelihoods symbolically using
Mixture of Gaussian (“MoG”) approximation
MoG:
[V. Maz’ya and G. Schmidt – 1996]

Symbolic computations on MoG (1)
Exact Approximate
Symbolic computations on MoG (2)
Compiler to symbolically compute
likelihoods
The compiler takes a statement , environment ,

constraint and generates a new environment
and and is the updated set of constraints from
observe statements in :
= (
Example
Synthesis Algorithm
MCMC speedup due to MoG
approximation
Aditya V. Nori, Sherjil Ozair, Sriram K. Rajamani, and Deepak Vijaykeerthy.

Efficient Synthesis of Probabilistic Programs. In PLDI '15: Programming Languages
Design and Implementation, June 2015
Summary
• Probabilistic programs
– Succinct ways of specifying probabilistic models
– Diverse applications (machine learning, biology, security, ecology,
graphics, vision,…)
• Probabilistic inference is program analysis
– New application for our techniques!
– Great opportunity for building a programming environment for ML
(with a compiler and runtime) for use by experts with domain
knowledge (but lack ML expertise)
• Several directions for future work
http://research.microsoft.com/en-us/projects/r2/
Questions?
Motivation
• Programs are increasingly “data driven”
instead of “algorithm driven”
– Use ML to build models from data, and then use
models to make decisions
– Many domains: search engines, social networks,
speech recognition, computer vision, medical
diagnostic aids, etc.
• Our goal: programming abstractions, compiler,
runtime, and tools for such programs

Probabilistic Programming: Algorithms, Implementation and Applications

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Probabilistic Programming: Algorithms, Implementation and Applications

Uploaded by

Copyright:

Available Formats

Probabilistic Programming:

Algorithms, Implementation and Applications

Goal of a PROB program: succinctly specify a probability

Goal of inference: infer the distribution specified by a

• How are skills modeled?

// first game: A vs B, A won  Sample from a noisy

// second game: B vs C, B won

// third game: A vs C, A won

return(skillA, skillB, skillC);

Compiler and Runtime Algorithms

machine learning algorithm BLOG

– Calculate the expected value of

specified by the program BUGS

• Set of all states:

The samples come from a measure space

• Program Slicing (as a pre-processing step)

Note that the equivalence  above is semantic

• is a relation that maintain usual

Chung-Kil Hur, Aditya V. Nori, Sriram K. Rajamani, and Selva Samuel.

Problem. Estimate expectation of wrt to the distribution

If we can sample from estimate the expectation as:

Figure from D J Mackay, Introduction to Monte Carlo Methods

Pre transformation bool earthquake, burglary, alarm, phoneWorking,

Figure from D J Mackay, Introduction to Monte Carlo Methods

If, accept else accept with

Aditya V. Nori, Chung-Kil Hur, Sriram K. Rajamani, and Selva Samuel.

Debugging Classification Tasks

Training Set Suppose we get an undesirable

We would like to fix

An individual ’s vote is a cause, if there exists

An assignment to input variables is a world.

Given training data

Given training data 1. Choose initial value of as

Assume that we have

Recall that using ’

Synthesizing Probabilistic Programs

By Bayes rule Likelihood estimation,

Assuming has pdf and

[V. Maz’ya and G. Schmidt – 1996]

The compiler takes a statement , environment ,

Aditya V. Nori, Sherjil Ozair, Sriram K. Rajamani, and Deepak Vijaykeerthy.

You might also like