You are on page 1of 64

Probabilistic Programming:

Algorithms, Implementation and Applications


Outline of the tutorial
• PROB: A simple probabilistic programing language
– Syntax and semantics
• Inference
– Inference is program analysis
– Compiler and runtime algorithms, implementation
• Application
– Debugging ML tasks
• Tools: Synthesizing probabilistic programs
Part I: Probabilistic Programs
PROB
Imperative language (think safe version of C) with two added
features:
1. The ability to sample from a distribution
2. The ability to condition values of variables through
observations

Goal of a PROB program: succinctly specify a probability


distribution

Goal of inference: infer the distribution specified by a


probabilistic program
Simple probabilistic program
bool c1, c2;
c1 = Bernoulli(0.5); false false 1/4
c2 = Bernoulli(0.5);
return(c1,c2); false true 1/4
true false 1/4
true true 1/4
Probabilistic program with
conditioning
bool c1, c2;
c1 = Bernoulli(0.5);
c2 = Bernoulli(0.5); false false 0
observe(c1 || c2); false true 1/3
return(c1,c2);
true false 1/3
true true 1/3
bool c1, c2;
c1 = Bernoulli(0.5);
c2 = Bernoulli(0.5);
while (!(c1 || c2)) {
c1 = Bernoulli(0.5);
c2 = Bernoulli(0.5);
}
return(c1,c2);
Another probabilistic program
with a while loop…
bool b, c;
b = false;
c = Bernoulli(0.5);
while (c){ false 2/3
b = !b;
c = Bernoulli(0.5); true 1/3
}
return(b);

1
1 1 1 2 2
𝑃 ( 𝑏 = 𝑓𝑎𝑙𝑠𝑒 ) = + + +… = =
2 8 16
1− ( )
1
4
3
Halo multiplayer

• How are skills modeled?


• Player A beats Player B
–?
TrueSkill
float skillA, skillB, skillC;
float perfA1, perfB1, perfB2,
perfC2, perfA3, perfC3;
 Sample from a noisy
skillA = Gaussian(100, 10); distribution
skillB = Gaussian(100, 10);
skillC = Gaussian(100, 10);

// first game: A vs B, A won  Sample from a noisy


perfA1 = Gaussian(skillA, 15);
perfB1 = Gaussian(skillB, 15); distribution
observe(perfA1 > perfB1);

// second game: B vs C, B won


perfB2 = Gaussian(skillA, 15);
 if then A wins else B wins
perfC2 = Gaussian(skillB, 15);
observe(perfB2 > perfC2);

// third game: A vs C, A won


perfA3 = Gaussian(skillA, 15);
perfC3 = Gaussian(skillB, 15);
observe(perfA3 > perfC3);

return(skillA, skillB, skillC);


Several applications…
• Population Models
• Medical Diagnostics
• Hidden Markov Models (eg. for speech
recognition)
• Kalman Filters (eg. In computer vision)
• Markov Random Fields (eg. In image processing)
• And more applications:
– Ecology & Biology (Carbon modeling, Evolutionary
Genetics,…)
– Security (quantitative information flow, inference
attacks)
Part II: Inference

Compiler and Runtime Algorithms


Inference
• Infer the distribution specified
by a probabilistic program.
– Generate samples to test a HBC Hansei

machine learning algorithm BLOG

– Calculate the expected value of


a function wrt the distribution FACTORIE Infer.NET

specified by the program BUGS


Church
– Calculate the mode of the Alchemy
distribution specified by the Figaro
program
PRISM Stan

• In this talk..
– Formal semantics of
probabilistic programs
– Inference is program analysis
Prob : A simple imperative language for prob.
programs
types
; declaration
expressions
variable
constant
binary operation
unary operation
statements
deterministic assignment
probabilistic assignment
observe
skip
sequential composition
conditional composition
loop
programs
Semantics
• States: , valuation to all variables

• Set of all states:

• Probabilistic semantics:
Starting at state , and statement , the program ran by
generating samples and terminated in state with probability

The samples come from a measure space


Transition rules (1)

,
if

if
Transition rules (2)
,
if

,
if
,
if

,
if
,
if
Semantics = Expectation
Let be the return expression of the program

otherwise
,
Inference = Program Analysis

• Program Slicing (as a pre-processing step)


• Using “Pre” transformer to avoid rejections
during sampling
• Data flow analysis and ADDs
Slicing
Can we “slice” a probabilistic program so that for
a program P, we can produce a “smaller”
program Slice(P), such that:
P  Slice(P)

Note that the equivalence  above is semantic


equality, wrt semantics defined earlier.
Example 1
1: bool d, i, s, l, g;
2: d = Bernoulli(0.6);
3: i = Bernoulli(0.7);
4: if (!i && !d)
5: g = Bernoulli(0.3);
6: else if (!i && d) 1: bool i, s;
7: g = Bernoulli(0.05); 3: i = Bernoulli(0.7);
8: else if (i && !d) slicing 12: if (!i)
9: g = Bernoulli(0.9); 13: s = Bernoulli(0.2);
10: else
14: else
11: g = Bernoulli(0.5);
12: if (!i) 15: s = Bernoulli(0.95);
13: s = Bernoulli(0.2); 𝑑 𝑖 20: return s;
14: else
15: s = Bernoulli(0.95);
16: if (!g) 𝑔 𝑠
17: l = Bernoulli(0.1);
18: else
19: l = Bernoulli(0.4); 𝑙
20: return s;
Example 2
1: bool d, i, s, l, g;
2: d = Bernoulli(0.6);
3: i = Bernoulli(0.7);
4: if (!i && !d)
5: g = Bernoulli(0.3);
6: else if (!i && d) 1: bool i, s;
7: g = Bernoulli(0.05); 3: i = Bernoulli(0.7);
8: else if (i && !d) slicing 12: if (!i)
9: g = Bernoulli(0.9); 13: s = Bernoulli(0.2);
10: else
14: else
11: g = Bernoulli(0.5);
12: if (!i) 15: s = Bernoulli(0.95);
13: s = Bernoulli(0.2); 𝑑 𝑖 20: return s;
14: else
15: s = Bernoulli(0.95);
16: if (!g) 𝑔 𝑠
17: l = Bernoulli(0.1);
18: else
19: l = Bernoulli(0.4); 𝑙
20: observe(l = true);
21: return s;
Example 2
1: bool d, i, s, l, g;
2: d = Bernoulli(0.6);
3: i = Bernoulli(0.7);
4: if (!i && !d)
5: g = Bernoulli(0.3);
6: else if (!i && d) 1: bool i, s;
7: g = Bernoulli(0.05); 3: i = Bernoulli(0.7);
8: else if (i && !d) 12: if (!i)
9: g = Bernoulli(0.9); 13: s = Bernoulli(0.2);
10: else
14: else
11: g = Bernoulli(0.5);
12: if (!i) 15: s = Bernoulli(0.95);
13: s = Bernoulli(0.2); 𝑑 𝑖 20: return s;
14: else
15: s = Bernoulli(0.95);
16: if (!g) 𝑔 𝑠
17: l = Bernoulli(0.1);
18: else
19: l = Bernoulli(0.4); 𝑙
20: observe(l = true);
21: return s;
Observe dependence

𝑥 𝑦 • Represents
• There is no dependence between and
• On the other hand, if (or some descendant of )
is observed, then depends on and vice versa
𝑧
A new notion of influence

• is a relation that maintain usual


control and data dependences:
𝑥 𝑦
• is a relation that captures additional
dependences called “observe
dependences” 𝑧 𝑟

Chung-Kil Hur, Aditya V. Nori, Sriram K. Rajamani, and Selva Samuel. 


Slicing Probabilistic Programs, In PLDI '14: Programming Language Design and
Implementation, June 2014
Background: Sampling

Problem. Estimate expectation of wrt to the distribution

If we can sample from estimate the expectation as:


+

Figure from D J Mackay, Introduction to Monte Carlo Methods


Pearl’s Burglar alarm example
int alarm() {
char earthquake = Bernoulli(0.001);
char burglary = Bernoulli(0.01);
char alarm = earthquake || burglary;
char phoneWorking =
(earthquake)? Bernoulli(0.6) : Bernoulli(0.99);
char maryWakes;
if (alarm && earthquake)
maryWakes = Bernoulli(0.8);
else if (alarm)
maryWakes = Bernoulli(0.6);
else maryWakes = Bernoulli(0.2);
char called = maryWakes && phoneWorking;
observe(called); “called” is a low probability
return burglary; event, and causes large
} number of rejections
during sampling
int alarm() {

Pre transformation bool earthquake, burglary, alarm, phoneWorking,


maryWakes,called;
earthquake = Bernoulli(0.001);
burglary = Bernoulli(0.01);
alarm = earthquake || burglary;
if (earthquake) {
• Let P be any program phoneWorking = Bernoulli(0.6);
observe(phoneWorking);
• Let Pre(P) denote the program }
obtained by propagating observe else {
phoneWorking = Bernoulli(0.99);
statements immediately after observe(phoneWorking);
sample statements }
if (alarm && earthquake){
maryWakes = Bernoulli(0.8);

Theorem: P = Pre(P)
observe(maryWakes && phoneWorking);
}
else if (alarm){
maryWakes = Bernoulli(0.6);
observe(maryWakes && phoneWorking);
}
else {
maryWakes = Bernoulli(0.2);
observe(maryWakes && phoneWorking);
}
called = maryWakes && phoneWorking;
return burglary;
}
Background: MH sampling

𝑄 ( 𝑥 ; 𝑥′ )

𝑥
1. Draw samples for from a proposal
2. Compute
3. If , accept else accept with probability

Figure from D J Mackay, Introduction to Monte Carlo Methods


MH without rejections
For each statement of the form: During each run of for each
sample statement:
• Sample from proposal sub-
Calculate distribution conditioned by

If, accept else accept with


probability

Aditya V. Nori, Chung-Kil Hur, Sriram K. Rajamani, and Selva Samuel.  


R2: An Efficient MCMC Sampler for Probabilistic Programs,
In AAAI '14: AAAI Conference on Artificial Intelligence, July 2014
Bayesian inference using data flow analysis:
Boolean Probabilistic Programs
bool c1, c2;
c1 = Bernoulli(0.5); 𝑐 2=𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖( 0.5)
c1 c2

c2 = Bernoulli(0.5); c1 t t 1/4
observe(c1 || c2); t f 1/4
t 1/2
return (c1, c2); f t 1/4
f f 1/4
f 1/2
𝑜𝑏𝑠𝑒𝑟𝑣𝑒(𝑐 1∨¿ 𝑐 2)

c1 c2 Normalize c1 c2

t t 1/3 t t 1/4

t f 1/3 t f 1/4
f t 1/3 f t 1/4
Data flow analysis with ADDs
bool c1, c2;
c1 = Bernoulli(0.5); c1 = Bernoulli(0.5) c2 = Bernoulli(0.5)
c2 = Bernoulli(0.5);
observe(c1 || c2);
return (c1, c2); 𝟏
𝟐
𝟏
𝟒
observe(c1 || c2)

𝒄𝟏 𝒄𝟏
0 Normalize 0

𝒄𝟐 𝟏 𝟏
𝒄𝟐
0 𝟏 0 𝟏
𝟎 𝟏
𝟑 𝟎 𝟏
𝟒

Guillaume Claret, Sriram K. Rajamani, Aditya V. Nori, Andrew D. Gordon, and Johannes
Borgström. Bayesian Inference Using Data Flow Analysis. In ESEC-FSE '13: Foundations of
Software Engineering, August 2013
Part III: Applications

Debugging Classification Tasks


Background: Classification
Training Phase

Training Set

o Training set: , ,

ML Training Algorithm
o Test set:
,

o is an ML classification
Test Set Classifier Class Labels algorithm

Evaluation Phase
Debugging Classification Tasks
Training Phase

Training Set Suppose we get an undesirable


result using the trained classifier
on a test point
ML Training Algorithm This can be due to:
• Bugs in the implementation of
the training algorithm
• Incorrect choice of features or
Test Set Classifier Class Labels
training data
Evaluation Phase • Bugs in training data
• ….
Example
Original classifier
learnt with noisy
training data is

We would like to fix


errors in the training
data and learn the
new classifier
Approach (1) : Program Slicing
Trace dependences
backward and find out
which training points affect
result on test point
• depends on
• depends on all of the
training set
Approach (2) : Experiment!
Come up with a ranking
function to rank training
points : we use Pearl’s Pick training points at
Causality Theory (PS random and flip them
Score)
• Need to experiment with
subsets of training points
and number of subsets is
We assume that training large
algorithm is “robust”. • For each subset need to
For training algorithms
based on gradient- retrain the classifier and
descent, we propose a retraining is expensive!
fast approximation
Pearl’s Counterfactual Causality
Out of 101 people, suppose 55 people vote for
A and 46 people vote for B and A wins. Who
caused A to win?

An individual ’s vote is a cause, if there exists


an alternate world (called counterfactual)
where A wins because of ’s vote alone.
Causal model
A Causal Model consists of:
• A set of random input variables
• An output variable
• A structural equation =

An assignment to input variables is a world.


Probability of Sufficiency
A Causal Model
consists of:
• A set of random
input variables
• An output
variable
• A structural
equation =

An assignment to
input variables is a
world.
Probabilistic program to compute
Take 1, simple version Take 2, optimized version

𝑃𝑆 ( 𝑌 𝑖 )= ∑ 𝑝 (𝑤)
𝑤∨ 𝑓 (𝑤 [ 𝑌 𝑖 ← 𝑦 𝑖] ) =𝑥 ∧ 𝑋 ≠ 𝑥∧ 𝑌 𝑖 ≠ 𝑦 𝑖
Background on training
Training learns function
The classifier is a wrapper around :
if and if

Given training data


the goal of training is to search for s which
minimizes R:
Gradient descent

Given training data 1. Choose initial value of as


2. Iterate
the goal of training is to
search for s which
minimizes R: Until (
Retraining with ’ and differing in
1. Choose initial value of as
Specifically, assume that
’ and differ in only label
2. Iterate
of one training point, say
Until (
Calculate for all training
Call the result
points :
1. Choose initial value of from
training output of
2. Iterate

Until (
Call the result
Approximating for ’ close to

Assume that we have


calculated the matrix
for all

Recall that using ’


and differing in only
label of one training
point, say , we can
calculate:
PSI
Results
Benchmarks

Logistic
regression

Decision
Trees

Aleks Chakarov, Aditya Nori, Sriram K. Rajamani, Selva Samuel, Shayak Sen,
Deepak Vijaykeerthy. Towards Debugging Machine Learning Tasks. Under review
Part IV: Tools

Synthesizing Probabilistic Programs


Motivation
• Probabilistic programs are not easy to write.
• Can we have programmer specify only
“sketches” and “synthesize” the program
automatically
• Challenges
– The number of possible “completions” of a sketch
are unbounded
– For each candidate “completion” computation of
likelihood score is expensive
Synthesis = Search for completions..
Program with a hole:
Program with hole completed with :
Likelihood of given data :

By Bayes rule Likelihood estimation,


expensive to compute
Example
: Probability that program generates data

Assuming has pdf and


has pdf
Main idea
We can estimate likelihoods symbolically using
Mixture of Gaussian (“MoG”) approximation

MoG:

[V. Maz’ya and G. Schmidt – 1996]


Symbolic computations on MoG (1)

Exact Approximate
Symbolic computations on MoG (2)
Compiler to symbolically compute
likelihoods

The compiler takes a statement , environment ,


constraint and generates a new environment
and and is the updated set of constraints from
observe statements in :
= (
Example
Synthesis Algorithm
MCMC speedup due to MoG
approximation

Aditya V. Nori, Sherjil Ozair, Sriram K. Rajamani, and Deepak Vijaykeerthy.


Efficient Synthesis of Probabilistic Programs. In PLDI '15: Programming Languages
Design and Implementation, June 2015
Summary
• Probabilistic programs
– Succinct ways of specifying probabilistic models
– Diverse applications (machine learning, biology, security, ecology,
graphics, vision,…)
• Probabilistic inference is program analysis
– New application for our techniques!
– Great opportunity for building a programming environment for ML
(with a compiler and runtime) for use by experts with domain
knowledge (but lack ML expertise)
• Several directions for future work

http://research.microsoft.com/en-us/projects/r2/
Questions?
Motivation
• Programs are increasingly “data driven”
instead of “algorithm driven”
– Use ML to build models from data, and then use
models to make decisions
– Many domains: search engines, social networks,
speech recognition, computer vision, medical
diagnostic aids, etc.
• Our goal: programming abstractions, compiler,
runtime, and tools for such programs

You might also like