You are on page 1of 60

CSE291D Lecture 5

Monte Carlo Methods 1:


Importance Sampling,
Rejection Sampling,
Particle filters

1
Project
• Project details have been uploaded to Piazza,
and are in the handout

• Reminder: Project proposals due 4/19,


via email

• Start getting a group together and planning


your project.
You can use Piazza to search for teammates
2
Probability and Inference
Probability

Data generating
Observed data
process

Inference

3
Figure based on one by Larry Wasserman, "All of Statistics"
Approximate Inference
• In principle, Bayesian inference is a simple
application of Bayes’ rule. This has been easy to do
for most of the simple models we’ve studied so far.

• However, in general, Bayesian inference is


intractable, motivating approximation techniques

4
Approximate Inference
• Optimization approaches

– Cast inference as optimizing an objective function.


Maximize or find a fixed point

• EM
• Variational inference
– Variational Bayes, mean field
– Message passing: loopy BP, TRW, expectation propagation
• Laplace approximation

5
Approximate Inference
• Simulation approaches
(Monte Carlo methods)

– Approximate a distribution by drawing samples

• Importance sampling, rejection sampling


• Particle filtering
• Markov chain Monte Carlo
– Gibbs sampling, Metropolis-Hastings, Hamiltonian Monte
Carlo…

6
Monte Carlo Methods
• Suppose we want to approximately compute

• From the law of large numbers, for sufficiently


large S,

7
Monte Carlo Methods

• This suggests the procedure:

– Draw S samples from P(x)


– Compute f(x) for each of the samples
– Approximate E[f(x)] by the sample average

8
Monte Carlo Methods: Example

9
Monte Carlo Methods

• In practice, we typically cannot sample from P(x), and


need to resort to approximate algorithms

• That’s what we’ll be talking about in the next two


lessons

10
Learning outcomes
By the end of the lesson, you should be able to:

• Apply simple Monte Carlo methods to approximate


expectations under distributions, including
importance sampling and rejection sampling.

• Distinguish between scenarios where these methods


might be expected to perform well or not.

11
12
13
Bayesian Inference:
One Computer Scientist’s Perspective
• In theory, the posterior is simply given by Bayes’ rule.

• Bayesian inference, then, involves computing likelihood times


prior, and normalizing, for every single possible value

• But even if we could do this, except for very simple cases we


typically couldn’t even store the result of this computation (at
least naïvely).

14
Bayesian Inference:
One Computer Scientist’s Perspective
• So, what do we actually mean when we say
we are doing Bayesian inference?
– Answering specific queries with respect to the
distribution?
(MAP, marginals, posterior predictive,…)

– Computing a data structure which allows us to


answer such queries?
• Posterior samples could be understood as a convenient
data structure summarizing the posterior distribution

15
Sampling: An analogy

• Draw a water sample so that it is equally likely to


come from anywhere in the lake
16
Exhaustive approach

• Visit every point in the lake. Pour a copy of the


whole lake into equally-sized jars. Pick one at random

• As the number of dimensions increases, the size of


the “surface of the lake” increases exponentially

17
Sampling: Challenges

• We don’t know how deep the lake could be

• It is too expensive to explore the whole lake

• There could be deep, narrow canyons. How do you


make sure you don’t miss them?
18
Uniform sampling

• Pick S uniform samples, weight according to


their relative probability

19
Uniform sampling

• If you miss a “canyon,” the result will be very


bad.
• In higher dimensions, it’s more likely you’ll
miss the “canyons”
20
Uniform sampling
• E.g. suppose you have a very good model for documents:
The quick brown fox jumps over the sly lazy dog
[5 6 37 1 4 30 5 22 570 12]

• The chance of uniformly picking a coherent document gets exponentially


smaller, the longer the document is.

21
Importance sampling
• Same idea, but pick from a better “proposal”
distribution than uniform.
• Reweight samples to correct for sampling from the
wrong distribution.

22
Importance sampling

23
Importance sampling

24
Importance sampling

25
Importance sampling

26
Importance sampling
without normalization constants

27
Importance sampling
without normalization constants

28
Importance sampling
without normalization constants

29
Importance sampling
without normalization constants

30
Importance sampling
without normalization constants

31
Importance sampling
without normalization constants

32
Importance sampling
without normalization constants

33
34
Importance sampling
• Can be used to estimate the ratio of
partition functions between p(x) and q(x)

35
Importance sampling
• Can be used to estimate the ratio of
partition functions between p(x) and q(x)

36
Importance sampling
• Can be used to estimate the ratio of
partition functions between p(x) and q(x)

37
Importance sampling
• Can be used to estimate the ratio of
partition functions between p(x) and q(x)

38
39
Heavy tails
• If q(x) goes towards zero faster than p(x), importance
weights of rare events will become extremely large

Gaussian proposal Cauchy proposal 40


Importance sampling
in high dimensions
• As the dimensionality of the space increases, it becomes
harder to reliably construct a good proposal distribution

Spherical Gaussian:
41
42
Sampling Importance Resampling
• We can convert a set of importance-weighted
samples to a set of unweighted samples

– Draw S importance samples

– Resample S’ samples from the set of samples, with


replacement, proportional to their importance weights

43
Rejection Sampling

• Unnormalized proposal distribution cQ*(x) that upper bounds P*(x)


• Sample uniformly under the curve cQ*(x) (with auxiliary “height” u)
• Reject samples that do not fall under the curve of P*(x)

44
Rejection Sampling

• Unnormalized proposal distribution cQ*(x) that upper bounds P*(x)


• Sample uniformly under the curve cQ*(x) (with auxiliary “height” u)
• Reject samples that do not fall under the curve of P*(x)

45
Rejection sampling
in high dimensions
• As the dimensionality of the space increases, the constant c
gets exponentially larger in general

Spherical Gaussian: Multiply by c in one dimension, multiply by cN in N dimensions.


46
Particle Filters
• Dynamical systems (cf. Kalman filters)
Latent
states Z1 Z2 Z3 Z4 Z5

Observations Y1 Y2 Y3 Y4 Y5

• Radar tracking, robot localization, weather


forecasting,…
47
Particle Filters
• Dynamical systems (cf. Kalman filters)
Latent
states Z1 Z2 Z3 Z4 Z5

Observations Y1 Y2 Y3 Y4 Y5

Zt = ?

• Filtering: keeping a running prediction on the


current state zt
48
Particle Filters
• Particle filters, a.k.a. sequential Monte Carlo,
a.k.a. sequential importance sampling

• Basic idea:
– Perform importance sampling to estimate z,

– At each timestep t, extend each importance sample to


include zt and update the weights recursively

49
Updating importance weights

• Suppose the proposal is the prior:

• Then the update simplifies:

50
Degeneracy
• As we add more timesteps, the z vector
becomes higher dimensional
– Importance weights select only a few samples

• Solution: Sampling importance resampling!


– When “effective sample size” is low, resample new
particles proportional to the weights

51
Illustration of particle filtering

52
Application: visual object tracking
• Goal: track an object (in this case, a remote
controlled helicopter) in a video sequence

• Linear dynamics model


• Likelihood based on color histogram features
• Proposal distribution: sample from the prior
(dynamics model)
• S = 250 samples
53
54
55
56
57
58
59
Think-pair-share: helicopter tracker

• You are an engineer for the RC helicopter company.

• Your company plans to deploy the helicopter tracking system


as part of a mobile phone app in 3 months, but needs it to be
more reliable.

• How would you change the system to improve its


performance?

60