You are on page 1of 60

# CSE291D Lecture 5

## Monte Carlo Methods 1:

Importance Sampling,
Rejection Sampling,
Particle filters

1
Project
• Project details have been uploaded to Piazza,
and are in the handout

via email

## • Start getting a group together and planning

You can use Piazza to search for teammates
2
Probability and Inference
Probability

Data generating
Observed data
process

Inference

3
Figure based on one by Larry Wasserman, "All of Statistics"
Approximate Inference
• In principle, Bayesian inference is a simple
application of Bayes’ rule. This has been easy to do
for most of the simple models we’ve studied so far.

## • However, in general, Bayesian inference is

intractable, motivating approximation techniques

4
Approximate Inference
• Optimization approaches

## – Cast inference as optimizing an objective function.

Maximize or find a fixed point

• EM
• Variational inference
– Variational Bayes, mean field
– Message passing: loopy BP, TRW, expectation propagation
• Laplace approximation

5
Approximate Inference
• Simulation approaches
(Monte Carlo methods)

## • Importance sampling, rejection sampling

• Particle filtering
• Markov chain Monte Carlo
– Gibbs sampling, Metropolis-Hastings, Hamiltonian Monte
Carlo…

6
Monte Carlo Methods
• Suppose we want to approximately compute

## • From the law of large numbers, for sufficiently

large S,

7
Monte Carlo Methods

## – Draw S samples from P(x)

– Compute f(x) for each of the samples
– Approximate E[f(x)] by the sample average

8
Monte Carlo Methods: Example

9
Monte Carlo Methods

## • In practice, we typically cannot sample from P(x), and

need to resort to approximate algorithms

## • That’s what we’ll be talking about in the next two

lessons

10
Learning outcomes
By the end of the lesson, you should be able to:

## • Apply simple Monte Carlo methods to approximate

expectations under distributions, including
importance sampling and rejection sampling.

## • Distinguish between scenarios where these methods

might be expected to perform well or not.

11
12
13
Bayesian Inference:
One Computer Scientist’s Perspective
• In theory, the posterior is simply given by Bayes’ rule.

## • Bayesian inference, then, involves computing likelihood times

prior, and normalizing, for every single possible value

## • But even if we could do this, except for very simple cases we

typically couldn’t even store the result of this computation (at
least naïvely).

14
Bayesian Inference:
One Computer Scientist’s Perspective
• So, what do we actually mean when we say
we are doing Bayesian inference?
– Answering specific queries with respect to the
distribution?
(MAP, marginals, posterior predictive,…)

## – Computing a data structure which allows us to

• Posterior samples could be understood as a convenient
data structure summarizing the posterior distribution

15
Sampling: An analogy

## • Draw a water sample so that it is equally likely to

come from anywhere in the lake
16
Exhaustive approach

## • Visit every point in the lake. Pour a copy of the

whole lake into equally-sized jars. Pick one at random

## • As the number of dimensions increases, the size of

the “surface of the lake” increases exponentially

17
Sampling: Challenges

## • There could be deep, narrow canyons. How do you

make sure you don’t miss them?
18
Uniform sampling

## • Pick S uniform samples, weight according to

their relative probability

19
Uniform sampling

## • If you miss a “canyon,” the result will be very

• In higher dimensions, it’s more likely you’ll
miss the “canyons”
20
Uniform sampling
• E.g. suppose you have a very good model for documents:
The quick brown fox jumps over the sly lazy dog
[5 6 37 1 4 30 5 22 570 12]

## • The chance of uniformly picking a coherent document gets exponentially

smaller, the longer the document is.

21
Importance sampling
• Same idea, but pick from a better “proposal”
distribution than uniform.
• Reweight samples to correct for sampling from the
wrong distribution.

22
Importance sampling

23
Importance sampling

24
Importance sampling

25
Importance sampling

26
Importance sampling
without normalization constants

27
Importance sampling
without normalization constants

28
Importance sampling
without normalization constants

29
Importance sampling
without normalization constants

30
Importance sampling
without normalization constants

31
Importance sampling
without normalization constants

32
Importance sampling
without normalization constants

33
34
Importance sampling
• Can be used to estimate the ratio of
partition functions between p(x) and q(x)

35
Importance sampling
• Can be used to estimate the ratio of
partition functions between p(x) and q(x)

36
Importance sampling
• Can be used to estimate the ratio of
partition functions between p(x) and q(x)

37
Importance sampling
• Can be used to estimate the ratio of
partition functions between p(x) and q(x)

38
39
Heavy tails
• If q(x) goes towards zero faster than p(x), importance
weights of rare events will become extremely large

## Gaussian proposal Cauchy proposal 40

Importance sampling
in high dimensions
• As the dimensionality of the space increases, it becomes
harder to reliably construct a good proposal distribution

Spherical Gaussian:
41
42
Sampling Importance Resampling
• We can convert a set of importance-weighted
samples to a set of unweighted samples

## – Resample S’ samples from the set of samples, with

replacement, proportional to their importance weights

43
Rejection Sampling

## • Unnormalized proposal distribution cQ*(x) that upper bounds P*(x)

• Sample uniformly under the curve cQ*(x) (with auxiliary “height” u)
• Reject samples that do not fall under the curve of P*(x)

44
Rejection Sampling

## • Unnormalized proposal distribution cQ*(x) that upper bounds P*(x)

• Sample uniformly under the curve cQ*(x) (with auxiliary “height” u)
• Reject samples that do not fall under the curve of P*(x)

45
Rejection sampling
in high dimensions
• As the dimensionality of the space increases, the constant c
gets exponentially larger in general

## Spherical Gaussian: Multiply by c in one dimension, multiply by cN in N dimensions.

46
Particle Filters
• Dynamical systems (cf. Kalman filters)
Latent
states Z1 Z2 Z3 Z4 Z5

Observations Y1 Y2 Y3 Y4 Y5

## • Radar tracking, robot localization, weather

forecasting,…
47
Particle Filters
• Dynamical systems (cf. Kalman filters)
Latent
states Z1 Z2 Z3 Z4 Z5

Observations Y1 Y2 Y3 Y4 Y5

Zt = ?

## • Filtering: keeping a running prediction on the

current state zt
48
Particle Filters
• Particle filters, a.k.a. sequential Monte Carlo,
a.k.a. sequential importance sampling

• Basic idea:
– Perform importance sampling to estimate z,

## – At each timestep t, extend each importance sample to

include zt and update the weights recursively

49
Updating importance weights

## • Then the update simplifies:

50
Degeneracy
• As we add more timesteps, the z vector
becomes higher dimensional
– Importance weights select only a few samples

## • Solution: Sampling importance resampling!

– When “effective sample size” is low, resample new
particles proportional to the weights

51
Illustration of particle filtering

52
Application: visual object tracking
• Goal: track an object (in this case, a remote
controlled helicopter) in a video sequence

## • Linear dynamics model

• Likelihood based on color histogram features
• Proposal distribution: sample from the prior
(dynamics model)
• S = 250 samples
53
54
55
56
57
58
59
Think-pair-share: helicopter tracker

## • Your company plans to deploy the helicopter tracking system

as part of a mobile phone app in 3 months, but needs it to be
more reliable.

performance?

60