You are on page 1of 585

Markov Chain Monte Carlo Methods

Markov Chain Monte Carlo Methods


Christian P. Robert
Universit
e Paris Dauphine and CREST-INSEE
http://www.ceremade.dauphine.fr/~xian

3-6 Mayo 2005

Markov Chain Monte Carlo Methods

Outline
Motivation and leading example
Random variable generation
Monte Carlo Integration
Notions on Markov Chains
The Metropolis-Hastings Algorithm
The Gibbs Sampler
MCMC tools for variable dimension problems
Sequential importance sampling

Markov Chain Monte Carlo Methods

New [2004] edition:

Markov Chain Monte Carlo Methods


Motivation and leading example

Motivation and leading example


Motivation and leading example
Introduction
Likelihood methods
Missing variable models
Bayesian Methods
Bayesian troubles
Random variable generation
Monte Carlo Integration
Notions on Markov Chains
The Metropolis-Hastings Algorithm

Markov Chain Monte Carlo Methods


Motivation and leading example
Introduction

Latent structures make life harder!

Even simple models may lead to computational complications, as


in latent variable models
Z
f (x|) = f ? (x, x? |) dx?

Markov Chain Monte Carlo Methods


Motivation and leading example
Introduction

Latent structures make life harder!

Even simple models may lead to computational complications, as


in latent variable models
Z
f (x|) = f ? (x, x? |) dx?
If (x, x? ) observed, fine!

Markov Chain Monte Carlo Methods


Motivation and leading example
Introduction

Latent structures make life harder!

Even simple models may lead to computational complications, as


in latent variable models
Z
f (x|) = f ? (x, x? |) dx?
If (x, x? ) observed, fine!
If only x observed, trouble!

Markov Chain Monte Carlo Methods


Motivation and leading example
Introduction

Example (Mixture models)


Models of mixtures of distributions:
X fj with probability pj ,
for j = 1, 2, . . . , k, with overall density
X p1 f1 (x) + + pk fk (x) .

Markov Chain Monte Carlo Methods


Motivation and leading example
Introduction

Example (Mixture models)


Models of mixtures of distributions:
X fj with probability pj ,
for j = 1, 2, . . . , k, with overall density
X p1 f1 (x) + + pk fk (x) .
For a sample of independent random variables (X1 , , Xn ),
sample density
n
Y
i=1

{p1 f1 (xi ) + + pk fk (xi )} .

Markov Chain Monte Carlo Methods


Motivation and leading example
Introduction

Example (Mixture models)


Models of mixtures of distributions:
X fj with probability pj ,
for j = 1, 2, . . . , k, with overall density
X p1 f1 (x) + + pk fk (x) .
For a sample of independent random variables (X1 , , Xn ),
sample density
n
Y
i=1

{p1 f1 (xi ) + + pk fk (xi )} .

Expanding this product involves k n elementary terms: prohibitive


to compute in large samples.

Markov Chain Monte Carlo Methods


Motivation and leading example

1
0
1

Introduction

Case of the 0.3N (1 , 1) + 0.7N (2 , 1) likelihood

Markov Chain Monte Carlo Methods


Motivation and leading example
Likelihood methods

Maximum likelihood methods


Go Bayes!!

For an iid sample X1 , . . . , Xn from a population with density


f (x|1 , . . . , k ), the likelihood function is
L(|x) = L(1 , . . . , k |x1 , . . . , xn )
Yn
=
f (xi |1 , . . . , k ).
i=1

Markov Chain Monte Carlo Methods


Motivation and leading example
Likelihood methods

Maximum likelihood methods


Go Bayes!!

For an iid sample X1 , . . . , Xn from a population with density


f (x|1 , . . . , k ), the likelihood function is
L(|x) = L(1 , . . . , k |x1 , . . . , xn )
Yn
=
f (xi |1 , . . . , k ).
i=1

Global justifications from asymptotics

Markov Chain Monte Carlo Methods


Motivation and leading example
Likelihood methods

Maximum likelihood methods


Go Bayes!!

For an iid sample X1 , . . . , Xn from a population with density


f (x|1 , . . . , k ), the likelihood function is
L(|x) = L(1 , . . . , k |x1 , . . . , xn )
Yn
=
f (xi |1 , . . . , k ).
i=1

Global justifications from asymptotics

Computational difficulty depends on structure, eg latent


variables

Markov Chain Monte Carlo Methods


Motivation and leading example
Likelihood methods

Example (Mixtures again)


For a mixture of two normal distributions,
pN (, 2 ) + (1 p)N (, 2 ) ,
likelihood proportional to
n 
Y
i=1

containing 2n terms.

xi

+ (1 p)

xi



Markov Chain Monte Carlo Methods


Motivation and leading example
Likelihood methods

Standard maximization techniques often fail to find the global


maximum because of multimodality of the likelihood function.

Example
In the special case

f (x|, ) = (1) exp{(1/2)x2 }+ exp{(1/2 2 )(x)2 } (1)

with  > 0 known,

Markov Chain Monte Carlo Methods


Motivation and leading example
Likelihood methods

Standard maximization techniques often fail to find the global


maximum because of multimodality of the likelihood function.

Example
In the special case

f (x|, ) = (1) exp{(1/2)x2 }+ exp{(1/2 2 )(x)2 } (1)

with  > 0 known, whatever n, the likelihood is unbounded:


lim `( = x1 , |x1 , . . . , xn ) =

Markov Chain Monte Carlo Methods


Motivation and leading example
Missing variable models

The special case of missing variable models

Consider again a latent variable representation


Z
g(x|) =
f (x, z|) dz
Z

Markov Chain Monte Carlo Methods


Motivation and leading example
Missing variable models

The special case of missing variable models

Consider again a latent variable representation


Z
g(x|) =
f (x, z|) dz
Z

Define the completed (but unobserved) likelihood


Lc (|x, z) = f (x, z|)

Markov Chain Monte Carlo Methods


Motivation and leading example
Missing variable models

The special case of missing variable models

Consider again a latent variable representation


Z
g(x|) =
f (x, z|) dz
Z

Define the completed (but unobserved) likelihood


Lc (|x, z) = f (x, z|)
Useful for optimisation algorithm

Markov Chain Monte Carlo Methods


Motivation and leading example
Missing variable models

The EM Algorithm
Gibbs connection

Bayes rather than EM

Algorithm (ExpectationMaximisation)
Iterate (in m)
1. (E step) Compute
Q(|(m) , x) = E[log Lc (|x, Z)|(m) , x] ,

Markov Chain Monte Carlo Methods


Motivation and leading example
Missing variable models

The EM Algorithm
Gibbs connection

Bayes rather than EM

Algorithm (ExpectationMaximisation)
Iterate (in m)
1. (E step) Compute
Q(|(m) , x) = E[log Lc (|x, Z)|(m) , x] ,
2. (M step) Maximise Q(|(m) , x) in and take
(m+1) = arg max Q(|(m) , x).

until a fixed point [of Q] is reached

Markov Chain Monte Carlo Methods


Motivation and leading example
Missing variable models

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Echantillon N(0,1)

Sample from (1)

Markov Chain Monte Carlo Methods


Motivation and leading example

1
0
1

Missing variable models

Likelihood of .7N (1 , 1) + .3N (2 , 1) and EM steps

Markov Chain Monte Carlo Methods


Motivation and leading example
Bayesian Methods

The Bayesian Perspective

In the Bayesian paradigm, the information brought by the data x,


realization of
X f (x|),

Markov Chain Monte Carlo Methods


Motivation and leading example
Bayesian Methods

The Bayesian Perspective

In the Bayesian paradigm, the information brought by the data x,


realization of
X f (x|),
is combined with prior information specified by prior distribution
with density
()

Markov Chain Monte Carlo Methods


Motivation and leading example
Bayesian Methods

Central tool
Summary in a probability distribution, (|x), called the posterior
distribution

Markov Chain Monte Carlo Methods


Motivation and leading example
Bayesian Methods

Central tool
Summary in a probability distribution, (|x), called the posterior
distribution
Derived from the joint distribution f (x|)(), according to
(|x) = R

f (x|)()
,
f (x|)()d
[Bayes Theorem]

Markov Chain Monte Carlo Methods


Motivation and leading example
Bayesian Methods

Central tool
Summary in a probability distribution, (|x), called the posterior
distribution
Derived from the joint distribution f (x|)(), according to
(|x) = R
where
m(x) =
is the marginal density of X

f (x|)()
,
f (x|)()d
[Bayes Theorem]
f (x|)()d

Markov Chain Monte Carlo Methods


Motivation and leading example
Bayesian Methods

Central tool...central to Bayesian inference


Posterior defined up to a constant as
(|x) f (x|) ()
I

Operates conditional upon the observations

Markov Chain Monte Carlo Methods


Motivation and leading example
Bayesian Methods

Central tool...central to Bayesian inference


Posterior defined up to a constant as
(|x) f (x|) ()
I

Operates conditional upon the observations

Integrate simultaneously prior information and information


brought by x

Markov Chain Monte Carlo Methods


Motivation and leading example
Bayesian Methods

Central tool...central to Bayesian inference


Posterior defined up to a constant as
(|x) f (x|) ()
I

Operates conditional upon the observations

Integrate simultaneously prior information and information


brought by x

Avoids averaging over the unobserved values of x

Markov Chain Monte Carlo Methods


Motivation and leading example
Bayesian Methods

Central tool...central to Bayesian inference


Posterior defined up to a constant as
(|x) f (x|) ()
I

Operates conditional upon the observations

Integrate simultaneously prior information and information


brought by x

Avoids averaging over the unobserved values of x

Coherent updating of the information available on ,


independent of the order in which i.i.d. observations are
collected

Markov Chain Monte Carlo Methods


Motivation and leading example
Bayesian Methods

Central tool...central to Bayesian inference


Posterior defined up to a constant as
(|x) f (x|) ()
I

Operates conditional upon the observations

Integrate simultaneously prior information and information


brought by x

Avoids averaging over the unobserved values of x

Coherent updating of the information available on ,


independent of the order in which i.i.d. observations are
collected

Provides a complete inferential scope and a unique motor of


inference

Markov Chain Monte Carlo Methods


Motivation and leading example
Bayesian troubles

Conjugate bonanza...

Example (Binomial)
For an observation X B(n, p) so-called conjugate prior is the
family of beta Be(a, b) distributions

Markov Chain Monte Carlo Methods


Motivation and leading example
Bayesian troubles

Conjugate bonanza...

Example (Binomial)
For an observation X B(n, p) so-called conjugate prior is the
family of beta Be(a, b) distributions
The classical Bayes estimator is the posterior mean
Z 1
(a + b + n)
p px+a1 (1 p)nx+b1 dp
(a + x)(n x + b) 0
x+a
.
=
a+b+n

Markov Chain Monte Carlo Methods


Motivation and leading example
Bayesian troubles

Example (Normal)
In the normal N (, 2 ) case, with both and unknown,
conjugate prior on = (, 2 ) of the form


( 2 ) exp ( )2 + / 2

Markov Chain Monte Carlo Methods


Motivation and leading example
Bayesian troubles

Example (Normal)
In the normal N (, 2 ) case, with both and unknown,
conjugate prior on = (, 2 ) of the form


( 2 ) exp ( )2 + / 2
since



((, 2 )|x1 , . . . , xn ) ( 2 ) exp ( )2 + / 2


( 2 )n exp n( x)2 + s2x / 2

2 +n
( )
exp ( + n)( x )2

n
+ + s2x +
/ 2
n +

Markov Chain Monte Carlo Methods


Motivation and leading example
Bayesian troubles

...and conjugate curse

The use of conjugate priors for computational reasons


implies a restriction on the modeling of the available prior
information

Markov Chain Monte Carlo Methods


Motivation and leading example
Bayesian troubles

...and conjugate curse

The use of conjugate priors for computational reasons


implies a restriction on the modeling of the available prior
information
may be detrimental to the usefulness of the Bayesian approach

Markov Chain Monte Carlo Methods


Motivation and leading example
Bayesian troubles

...and conjugate curse

The use of conjugate priors for computational reasons


implies a restriction on the modeling of the available prior
information
may be detrimental to the usefulness of the Bayesian approach
gives an impression of subjective manipulation of the prior
information disconnected from reality.

Markov Chain Monte Carlo Methods


Motivation and leading example
Bayesian troubles

A typology of Bayes computational problems


(i). use of a complex parameter space, as for instance in
constrained parameter sets like those resulting from imposing
stationarity constraints in dynamic models;

Markov Chain Monte Carlo Methods


Motivation and leading example
Bayesian troubles

A typology of Bayes computational problems


(i). use of a complex parameter space, as for instance in
constrained parameter sets like those resulting from imposing
stationarity constraints in dynamic models;
(ii). use of a complex sampling model with an intractable
likelihood, as for instance in missing data and graphical
models;

Markov Chain Monte Carlo Methods


Motivation and leading example
Bayesian troubles

A typology of Bayes computational problems


(i). use of a complex parameter space, as for instance in
constrained parameter sets like those resulting from imposing
stationarity constraints in dynamic models;
(ii). use of a complex sampling model with an intractable
likelihood, as for instance in missing data and graphical
models;
(iii). use of a huge dataset;

Markov Chain Monte Carlo Methods


Motivation and leading example
Bayesian troubles

A typology of Bayes computational problems


(i). use of a complex parameter space, as for instance in
constrained parameter sets like those resulting from imposing
stationarity constraints in dynamic models;
(ii). use of a complex sampling model with an intractable
likelihood, as for instance in missing data and graphical
models;
(iii). use of a huge dataset;
(iv). use of a complex prior distribution (which may be the
posterior distribution associated with an earlier sample);

Markov Chain Monte Carlo Methods


Motivation and leading example
Bayesian troubles

A typology of Bayes computational problems


(i). use of a complex parameter space, as for instance in
constrained parameter sets like those resulting from imposing
stationarity constraints in dynamic models;
(ii). use of a complex sampling model with an intractable
likelihood, as for instance in missing data and graphical
models;
(iii). use of a huge dataset;
(iv). use of a complex prior distribution (which may be the
posterior distribution associated with an earlier sample);
(v). use of a complex inferential procedure as for instance, Bayes
factors

P ( 0 | x) ( 0 )

B01 (x) =
.
P ( 1 | x) ( 1 )

Markov Chain Monte Carlo Methods


Motivation and leading example
Bayesian troubles

Example (Mixture once again)


Observations from
x1 , . . . , xn f (x|) = p(x; 1 , 1 ) + (1 p)(x; 2 , 2 )

Markov Chain Monte Carlo Methods


Motivation and leading example
Bayesian troubles

Example (Mixture once again)


Observations from
x1 , . . . , xn f (x|) = p(x; 1 , 1 ) + (1 p)(x; 2 , 2 )
Prior
i |i N (i , i2 /ni ),

i2 I G (i /2, s2i /2),

p Be(, )

Markov Chain Monte Carlo Methods


Motivation and leading example
Bayesian troubles

Example (Mixture once again)


Observations from
x1 , . . . , xn f (x|) = p(x; 1 , 1 ) + (1 p)(x; 2 , 2 )
Prior
i |i N (i , i2 /ni ),

i2 I G (i /2, s2i /2),

p Be(, )

Posterior
(|x1 , . . . , xn )
=

n
Y

{p(xj ; 1 , 1 ) + (1 p)(xj ; 2 , 2 )} ()

j=1
n X
X

(kt )(|(kt ))

`=0 (kt )

Markov Chain Monte Carlo Methods


Motivation and leading example
Bayesian troubles

Example (Mixture once again (contd))


For a given permutation (kt ), conditional posterior distribution


12
I G ((1 + `)/2, s1 (kt )/2)
(|(kt )) = N 1 (kt ),
n1 + `


22
I G ((2 + n `)/2, s2 (kt )/2)
N 2 (kt ),
n2 + n `
Be( + `, + n `)

Markov Chain Monte Carlo Methods


Motivation and leading example
Bayesian troubles

Example (Mixture once again (contd))


where
x
1 (kt ) =
x
2 (kt ) =
and

1
`

P`
t=1 xkt ,
1 Pn

n`

t=`+1 xkt ,

P
s1 (kt ) = P `t=1 (xkt x
1 (kt ))2 ,
n
s2 (kt ) =
2 (kt ))2
t=`+1 (xkt x

n1 1 + `
x1 (kt )
n2 2 + (n `)
x2 (kt )
,
2 (kt ) =
,
n1 + `
n2 + n `
n1 `
(1 x
1 (kt ))2 ,
s1 (kt ) = s21 + s21 (kt ) +
n1 + `
n2 (n `)
s2 (kt ) = s22 + s22 (kt ) +
(2 x
2 (kt ))2 ,
n2 + n `
1 (kt ) =

posterior updates of the hyperparameters

Markov Chain Monte Carlo Methods


Motivation and leading example
Bayesian troubles

Example (Mixture once again)


Bayes estimator of :
(x1 , . . . , xn ) =

n X
X

(kt )E [|x, (kt )]

`=0 (kt )

Too costly: 2n terms

Markov Chain Monte Carlo Methods


Motivation and leading example
Bayesian troubles

press for AR

Example (Poly-t priors)


Normal observation x N (, 1), with conjugate prior
N (, )
Closed form expression for the posterior mean
 Z
Z
f (x|) () d
f (x|) () d =

x + 2
.
1 + 2

Markov Chain Monte Carlo Methods


Motivation and leading example
Bayesian troubles

Example (Poly-t priors (2))


More involved prior distribution:
poly-t distribution
[Bauwens,1985]
() =

k
Y

i=1

i + ( i )2

i

i , i > 0

Markov Chain Monte Carlo Methods


Motivation and leading example
Bayesian troubles

Example (Poly-t priors (2))


More involved prior distribution:
poly-t distribution
[Bauwens,1985]
() =

k
Y

i=1

i + ( i )2

Computation of E[|x] ???

i

i , i > 0

Markov Chain Monte Carlo Methods


Motivation and leading example
Bayesian troubles

Example (AR(p) model)


Auto-regressive representation of a time series,
xt =

p
X
i=1

i xti + t

Markov Chain Monte Carlo Methods


Motivation and leading example
Bayesian troubles

Example (AR(p) model)


Auto-regressive representation of a time series,
xt =

p
X

i xti + t

i=1

If order p unknown, predictive distribution of xt+1 given by


Z
(xt+1 |xt , . . . , x1 ) f (xt+1 |xt , . . . , xtp+1 )(, p|xt , . . . , x1 )dp d ,

Markov Chain Monte Carlo Methods


Motivation and leading example
Bayesian troubles

Example (AR(p) model (contd))


Integration over the parameters of all models
Z
X
p=0

f (xt+1 |xt , . . . , xtp+1 )(|p, xt , . . . , x1 ) d (p|xt , . . . , x1 ) .

Markov Chain Monte Carlo Methods


Motivation and leading example
Bayesian troubles

Example (AR(p) model (contd))


Multiple layers of complexity
(i). Complex parameter space within each AR(p) model because
of stationarity constraint
(ii). if p unbounded, infinity of models
(iii). varies between models AR(p) and AR(p + 1), with a
different stationarity constraint (except for root
reparameterisation).
(iv). if prediction used sequentially, every tick/second/hour/day,
posterior distribution (, p|xt , . . . , x1 ) must be re-evaluated

Markov Chain Monte Carlo Methods


Random variable generation

Random variable generation


Motivation and leading example
Random variable generation
Basic methods
Uniform pseudo-random generator
Beyond Uniform distributions
Transformation methods
Accept-Reject Methods
Fundamental theorem of simulation
Log-concave densities
Monte Carlo Integration
Notions on Markov Chains

Markov Chain Monte Carlo Methods


Random variable generation

Random variable generation

Rely on the possibility of producing (computer-wise) an


endless flow of random variables (usually iid) from well-known
distributions

Markov Chain Monte Carlo Methods


Random variable generation

Random variable generation

Rely on the possibility of producing (computer-wise) an


endless flow of random variables (usually iid) from well-known
distributions
Given a uniform random number generator, illustration of
methods that produce random variables from both standard
and nonstandard distributions

Markov Chain Monte Carlo Methods


Random variable generation
Basic methods

The inverse transform method

For a function F on R, the generalized inverse of F , F , is defined


by
F (u) = inf {x; F (x) u} .

Markov Chain Monte Carlo Methods


Random variable generation
Basic methods

The inverse transform method

For a function F on R, the generalized inverse of F , F , is defined


by
F (u) = inf {x; F (x) u} .

Definition (Probability Integral Transform)


If U U[0,1] , then the random variable F (U ) has the distribution
F.

Markov Chain Monte Carlo Methods


Random variable generation
Basic methods

The inverse transform method (2)

To generate a random variable X F , simply generate


U U[0,1]

Markov Chain Monte Carlo Methods


Random variable generation
Basic methods

The inverse transform method (2)

To generate a random variable X F , simply generate


U U[0,1]
and then make the transform
x = F (u)

Markov Chain Monte Carlo Methods


Random variable generation
Uniform pseudo-random generator

Desiderata and limitations


skip Uniform

Production of a deterministic sequence of values in [0, 1]


which imitates a sequence of iid uniform random variables
U[0,1] .

Markov Chain Monte Carlo Methods


Random variable generation
Uniform pseudo-random generator

Desiderata and limitations


skip Uniform

Production of a deterministic sequence of values in [0, 1]


which imitates a sequence of iid uniform random variables
U[0,1] .
Cant use the physical imitation of a random draw [no
guarantee of uniformity, no reproducibility]

Markov Chain Monte Carlo Methods


Random variable generation
Uniform pseudo-random generator

Desiderata and limitations


skip Uniform

Production of a deterministic sequence of values in [0, 1]


which imitates a sequence of iid uniform random variables
U[0,1] .
Cant use the physical imitation of a random draw [no
guarantee of uniformity, no reproducibility]
Random sequence in the sense: Having generated
(X1 , , Xn ), knowledge of Xn [or of (X1 , , Xn )] imparts
no discernible knowledge of the value of Xn+1 .

Markov Chain Monte Carlo Methods


Random variable generation
Uniform pseudo-random generator

Desiderata and limitations


skip Uniform

Production of a deterministic sequence of values in [0, 1]


which imitates a sequence of iid uniform random variables
U[0,1] .
Cant use the physical imitation of a random draw [no
guarantee of uniformity, no reproducibility]
Random sequence in the sense: Having generated
(X1 , , Xn ), knowledge of Xn [or of (X1 , , Xn )] imparts
no discernible knowledge of the value of Xn+1 .
Deterministic: Given the initial value X0 , sample
(X1 , , Xn ) always the same

Markov Chain Monte Carlo Methods


Random variable generation
Uniform pseudo-random generator

Desiderata and limitations


skip Uniform

Production of a deterministic sequence of values in [0, 1]


which imitates a sequence of iid uniform random variables
U[0,1] .
Cant use the physical imitation of a random draw [no
guarantee of uniformity, no reproducibility]
Random sequence in the sense: Having generated
(X1 , , Xn ), knowledge of Xn [or of (X1 , , Xn )] imparts
no discernible knowledge of the value of Xn+1 .
Deterministic: Given the initial value X0 , sample
(X1 , , Xn ) always the same
Validity of a random number generator based on a single
sample X1 , , Xn when n tends to +, not on replications
(X11 , , X1n ), (X21 , , X2n ), . . . (Xk1 , , Xkn )

where n fixed and k tends to infinity.

Markov Chain Monte Carlo Methods


Random variable generation
Uniform pseudo-random generator

Uniform pseudo-random generator

Algorithm starting from an initial value 0 u0 1 and a


transformation D, which produces a sequence
(ui ) = (D i (u0 ))
in [0, 1].

Markov Chain Monte Carlo Methods


Random variable generation
Uniform pseudo-random generator

Uniform pseudo-random generator

Algorithm starting from an initial value 0 u0 1 and a


transformation D, which produces a sequence
(ui ) = (D i (u0 ))
in [0, 1].
For all n,
(u1 , , un )
reproduces the behavior of an iid U[0,1] sample (V1 , , Vn ) when
compared through usual tests

Markov Chain Monte Carlo Methods


Random variable generation
Uniform pseudo-random generator

Uniform pseudo-random generator (2)

Validity means the sequence U1 , , Un leads to accept the


hypothesis
H : U 1 , , Un

are iid

U[0,1] .

Markov Chain Monte Carlo Methods


Random variable generation
Uniform pseudo-random generator

Uniform pseudo-random generator (2)

Validity means the sequence U1 , , Un leads to accept the


hypothesis
H : U 1 , , Un

are iid

U[0,1] .

The set of tests used is generally of some consequence


KolmogorovSmirnov and other nonparametric tests
Time series methods, for correlation between Ui and
(Ui1 , , Uik )
Marsaglias battery of tests called Die Hard (!)

Markov Chain Monte Carlo Methods


Random variable generation
Uniform pseudo-random generator

Usual generators
In R and S-plus, procedure runif()
The Uniform Distribution
Description:
runif generates random deviates.
Example:
u <- runif(20)
.Random.seed is an integer vector, containing
the random number generator state for random
number generation in R. It can be saved and
restored, but should not be altered by users.

Markov Chain Monte Carlo Methods


Random variable generation

0.0 0.2 0.4 0.6 0.8 1.0

Uniform pseudo-random generator

500

520

540

560

580

600

0.6

0.8

1.0

0.0

0.5

1.0

1.5

uniform sample

0.0

0.2

0.4

Markov Chain Monte Carlo Methods


Random variable generation
Uniform pseudo-random generator

Usual generators (2)


In C, procedure rand() or random()
SYNOPSIS
#include <stdlib.h>
long int random(void);
DESCRIPTION
The random() function uses a non-linear additive
feedback random number generator employing a
default table of size 31 long integers to return
successive pseudo-random numbers in the range
from 0 to RAND_MAX. The period of this random
generator is very large, approximately
16*((2**31)-1).
RETURN VALUE
random() returns a value between 0 and RAND_MAX.

Markov Chain Monte Carlo Methods


Random variable generation
Uniform pseudo-random generator

Usual generators(3)
In Scilab, procedure rand()
rand() : with no arguments gives a scalar whose
value changes each time it is referenced. By
default, random numbers are uniformly distributed
in the interval (0,1). rand(normal) switches to
a normal distribution with mean 0 and variance 1.
EXAMPLE
x=rand(10,10,uniform)

Markov Chain Monte Carlo Methods


Random variable generation
Beyond Uniform distributions

Beyond Uniform generators


Generation of any sequence of random variables can be
formally implemented through a uniform generator
Distributions with explicit F (for instance, exponential, and
Weibull distributions), use the probability integral
transform here

Markov Chain Monte Carlo Methods


Random variable generation
Beyond Uniform distributions

Beyond Uniform generators


Generation of any sequence of random variables can be
formally implemented through a uniform generator
Distributions with explicit F (for instance, exponential, and
Weibull distributions), use the probability integral
transform here
Case specific methods rely on properties of the distribution (for
instance, normal distribution, Poisson distribution)

Markov Chain Monte Carlo Methods


Random variable generation
Beyond Uniform distributions

Beyond Uniform generators


Generation of any sequence of random variables can be
formally implemented through a uniform generator
Distributions with explicit F (for instance, exponential, and
Weibull distributions), use the probability integral
transform here
Case specific methods rely on properties of the distribution (for
instance, normal distribution, Poisson distribution)
More generic methods (for instance, accept-reject and
ratio-of-uniform)

Markov Chain Monte Carlo Methods


Random variable generation
Beyond Uniform distributions

Beyond Uniform generators


Generation of any sequence of random variables can be
formally implemented through a uniform generator
Distributions with explicit F (for instance, exponential, and
Weibull distributions), use the probability integral
transform here
Case specific methods rely on properties of the distribution (for
instance, normal distribution, Poisson distribution)
More generic methods (for instance, accept-reject and
ratio-of-uniform)

Simulation of the standard distributions is accomplished quite


efficiently by many numerical and statistical programming
packages.

Markov Chain Monte Carlo Methods


Random variable generation
Transformation methods

Transformation methods
Case where a distribution F is linked in a simple way to another
distribution easy to simulate.

Example (Exponential variables)


If U U[0,1] , the random variable
X = log U/
has distribution
P (X x) = P ( log U x)

= P (U ex ) = 1 ex ,

the exponential distribution E xp().

Markov Chain Monte Carlo Methods


Random variable generation
Transformation methods

Other random variables that can be generated starting from an


exponential include
Y = 2

X
j=1

log(Uj ) 22

a
1X
log(Uj ) G a(a, )
Y =

j=1

Pa

j=1 log(Uj )

Y = Pa+b

j=1 log(Uj )

Be(a, b)

Markov Chain Monte Carlo Methods


Random variable generation
Transformation methods

Points to note

Transformation quite simple to use

There are more efficient algorithms for gamma and beta


random variables
Cannot generate gamma random variables with a non-integer
shape parameter
For instance, cannot get a 21 variable, which would get us a
N (0, 1) variable.

Markov Chain Monte Carlo Methods


Random variable generation
Transformation methods

Box-Muller Algorithm
Example (Normal variables)
If r, polar coordinates of (X1 , X2 ), then,
r2 = X12 + X22 22 = E (1/2)

and U [0, 2]

Markov Chain Monte Carlo Methods


Random variable generation
Transformation methods

Box-Muller Algorithm
Example (Normal variables)
If r, polar coordinates of (X1 , X2 ), then,
r2 = X12 + X22 22 = E (1/2)
Consequence: If U1 , U2 iid U[0,1] ,

p
2 log(U1 ) cos(2U2 )
p
=
2 log(U1 ) sin(2U2 )

X1 =
X2
iid N (0, 1).

and U [0, 2]

Markov Chain Monte Carlo Methods


Random variable generation
Transformation methods

Box-Muller Algorithm (2)

1. Generate U1 , U2 iid U[0,1] ;


2. Define

x2

2 log(u1 ) cos(2u2 ) ,
p
2 log(u1 ) sin(2u2 ) ;
=

x1 =

3. Take x1 and x2 as two independent draws from


N (0, 1).

Markov Chain Monte Carlo Methods


Random variable generation
Transformation methods

Box-Muller Algorithm (3)

Unlike algorithms based on the CLT,


this algorithm is exact

Get two normals for the price of


two uniforms
I Drawback (in speed)
in calculating log, cos and sin.
4

0.0

0.1

0.2

0.3

0.4

Markov Chain Monte Carlo Methods


Random variable generation
Transformation methods

More transforms

Reject

Example (Poisson generation)


Poissonexponential connection:
If N P() and Xi E xp(), i N ,
P (N = k) =
P (X1 + + Xk 1 < X1 + + Xk+1 ) .

Markov Chain Monte Carlo Methods


Random variable generation
Transformation methods

More Poisson

Skip Poisson

A Poisson can be simulated by generating Exp(1) till their


sum exceeds 1.
This method is simple, but is really practical only for smaller
values of .
On average, the number of exponential variables required is .
Other approaches are more suitable for large s.

Markov Chain Monte Carlo Methods


Random variable generation
Transformation methods

Atkinsons Poisson
To generate N P():
1. Define

= / 3,

and

k = log clog ;

2. Generate U1 U[0,1] and calculate


x = { log{(1 u1 )/u1 }}/
until x > 0.5 ;

3. Define N = bx + 0.5c and generate


U2 U[0,1] ;
4. Accept N if

x+log (u2 /{1+exp(x)}2 ) k+N log log N ! .

Markov Chain Monte Carlo Methods


Random variable generation
Transformation methods

Negative extension

A generator of Poisson random variables can produce negative


binomial random variables since,
Y Ga(n, (1 p)/p) X|y P(y)
implies
X N eg(n, p)

Markov Chain Monte Carlo Methods


Random variable generation
Transformation methods

Mixture representation
The representation of the negative binomial is a particular
case of a mixture distribution
The principle of a mixture representation is to represent a
density f as the marginal of another distribution, for example
X
f (x) =
pi fi (x) ,
iY

If the component distributions fi (x) can be easily generated,


X can be obtained by first choosing fi with probability pi and
then generating an observation from fi .

Markov Chain Monte Carlo Methods


Random variable generation
Transformation methods

Partitioned sampling

Special case of mixture sampling when


Z
fi (x) = f (x) IAi (x)

f (x) dx
Ai

and
pi = Pr(X Ai )
for a partition (Ai )i

Markov Chain Monte Carlo Methods


Random variable generation
Accept-Reject Methods

Accept-Reject algorithm

Many distributions from which it is difficult, or even


impossible, to directly simulate.
Another class of methods that only require us to know the
functional form of the density f of interest only up to a
multiplicative constant.
The key to this method is to use a simpler (simulation-wise)
density g, the instrumental density , from which the simulation
from the target density f is actually done.

Markov Chain Monte Carlo Methods


Random variable generation
Fundamental theorem of simulation

0.25

Fundamental theorem of simulation

0.20

Lemma

f(x)
0.10

X f (x)

0.15

Simulating

(X, U ) U{(x, u) : 0 < u < f (x)}

0.00

0.05

equivalent to simulating

6
x

10

Markov Chain Monte Carlo Methods


Random variable generation
Fundamental theorem of simulation

The Accept-Reject algorithm

Given a density of interest f , find a density g and a constant M


such that
f (x) M g(x)
on the support of f .

Markov Chain Monte Carlo Methods


Random variable generation
Fundamental theorem of simulation

The Accept-Reject algorithm

Given a density of interest f , find a density g and a constant M


such that
f (x) M g(x)
on the support of f .
1. Generate X g, U U[0,1] ;

2. Accept Y = X if U f (X)/M g(X) ;


3. Return to 1. otherwise.

Markov Chain Monte Carlo Methods


Random variable generation
Fundamental theorem of simulation

Validation of the Accept-Reject method


Warranty:

This algorithm produces a variable Y distributed according to f

Markov Chain Monte Carlo Methods


Random variable generation
Fundamental theorem of simulation

Two interesting properties


First, it provides a generic method to simulate from any
density f that is known up to a multiplicative factor
Property particularly important in Bayesian calculations where
the posterior distribution
(|x) () f (x|) .
is specified up to a normalizing constant

Markov Chain Monte Carlo Methods


Random variable generation
Fundamental theorem of simulation

Two interesting properties


First, it provides a generic method to simulate from any
density f that is known up to a multiplicative factor
Property particularly important in Bayesian calculations where
the posterior distribution
(|x) () f (x|) .
is specified up to a normalizing constant
Second, the probability of acceptance in the algorithm is
1/M , e.g., expected number of trials until a variable is
accepted is M

Markov Chain Monte Carlo Methods


Random variable generation
Fundamental theorem of simulation

More interesting properties

In cases f and g both probability densities, the constant M is


necessarily larger that 1.

Markov Chain Monte Carlo Methods


Random variable generation
Fundamental theorem of simulation

More interesting properties

In cases f and g both probability densities, the constant M is


necessarily larger that 1.
The size of M , and thus the efficiency of the algorithm, are
functions of how closely g can imitate f , especially in the tails

Markov Chain Monte Carlo Methods


Random variable generation
Fundamental theorem of simulation

More interesting properties

In cases f and g both probability densities, the constant M is


necessarily larger that 1.
The size of M , and thus the efficiency of the algorithm, are
functions of how closely g can imitate f , especially in the tails
For f /g to remain bounded, necessary for g to have tails
thicker than those of f .
It is therefore impossible to use the A-R algorithm to simulate
a Cauchy distribution f using a normal distribution g, however
the reverse works quite well.

Markov Chain Monte Carlo Methods


Random variable generation
Fundamental theorem of simulation

No Cauchy!

Example (Normal from a Cauchy)


Take

1
f (x) = exp(x2 /2)
2

and

1 1
,
1 + x2
densities of the normal and Cauchy distributions.
g(x) =

Markov Chain Monte Carlo Methods


Random variable generation
Fundamental theorem of simulation

No Cauchy!

Example (Normal from a Cauchy)


Take

1
f (x) = exp(x2 /2)
2

and

1 1
,
1 + x2
densities of the normal and Cauchy distributions.
Then
r
r
f (x)

2
2
=
(1 + x2 ) ex /2
= 1.52
g(x)
2
e
g(x) =

attained at x = 1.

Markov Chain Monte Carlo Methods


Random variable generation
Fundamental theorem of simulation

Example (Normal from a Cauchy (2))


So probability of acceptance
1/1.52 = 0.66,
and, on the average, one out of every three simulated Cauchy
variables is rejected.

Markov Chain Monte Carlo Methods


Random variable generation
Fundamental theorem of simulation

No Double!

Example (Normal/Double Exponential)


Generate a N (0, 1) by using a double-exponential distribution
with density
g(x|) = (/2) exp(|x|)
Then
f (x)

g(x|)

2 1 2 /2
e

and minimum of this bound (in ) attained for


? = 1

Markov Chain Monte Carlo Methods


Random variable generation
Fundamental theorem of simulation

Example (Normal/Double Exponential (2))


Probability of acceptance
p
/2e = .76

To produce one normal random variable requires on the average


1/.76 1.3 uniform variables.

Markov Chain Monte Carlo Methods


Random variable generation
Fundamental theorem of simulation

truncate

Example (Gamma generation)


Illustrates a real advantage of the Accept-Reject algorithm
The gamma distribution Ga(, ) represented as the sum of
exponential random variables, only if is an integer

Markov Chain Monte Carlo Methods


Random variable generation
Fundamental theorem of simulation

Example (Gamma generation (2))


Can use the Accept-Reject algorithm with instrumental distribution
Ga(a, b), with a = [],
(Without loss of generality, = 1.)

0.

Markov Chain Monte Carlo Methods


Random variable generation
Fundamental theorem of simulation

Example (Gamma generation (2))


Can use the Accept-Reject algorithm with instrumental distribution
Ga(a, b), with a = [],

0.

(Without loss of generality, = 1.)


Up to a normalizing constant,
f /gb = b

a a

exp{(1 b)x} b

for b 1.
The maximum is attained at b = a/.

a
(1 b)e

a

Markov Chain Monte Carlo Methods


Random variable generation
Fundamental theorem of simulation

Cheng and Feasts Gamma generator


Gamma G a(, 1), > 1 distribution
1. Define c1 = 1, c2 = ( (1/6))/c1 ,

c3 = 2/c1 , c4 = 1 + c3 , and c5 = 1/ .
2. Repeat
generate U1 , U2
take U1 = U2 + c5 (1 1.86U1 ) if > 2.5
until 0 < U1 < 1.
3. Set W = c2 U2 /U1 .
4. If c3 U1 + W + W 1 c4 or
c3 log U1 log W + W 1,
take c1 W ;
otherwise, repeat.

Markov Chain Monte Carlo Methods


Random variable generation
Fundamental theorem of simulation

Truncated Normal simulation


Example (Truncated Normal distributions)
Constraint x produces density proportional to
e(x)

2 /2 2

for a bound large compared with

Ix

Markov Chain Monte Carlo Methods


Random variable generation
Fundamental theorem of simulation

Truncated Normal simulation


Example (Truncated Normal distributions)
Constraint x produces density proportional to
e(x)

2 /2 2

Ix

for a bound large compared with


There exists alternatives far superior to the nave method of
generating a N (, 2 ) until exceeding , which requires an average
number of
1/(( )/)

simulations from N (, 2 ) for a single acceptance.

Markov Chain Monte Carlo Methods


Random variable generation
Fundamental theorem of simulation

Example (Truncated Normal distributions (2))


Instrumental distribution: translated exponential distribution,
E (, ), with density
g (z) = e(z) Iz .

Markov Chain Monte Carlo Methods


Random variable generation
Fundamental theorem of simulation

Example (Truncated Normal distributions (2))


Instrumental distribution: translated exponential distribution,
E (, ), with density
g (z) = e(z) Iz .
The ratio f /g is bounded by
(
1/ exp(2 /2 ) if > ,
f /g
otherwise.
1/ exp(2 /2)

Markov Chain Monte Carlo Methods


Random variable generation
Log-concave densities

Log-concave densities (1)

Densities f whose logarithm is concave, for


instance Bayesian posterior distributions such that
move to next chapter

log (|x) = log () + log f (x|) + c


concave

Markov Chain Monte Carlo Methods


Random variable generation
Log-concave densities

Log-concave densities (2)

Take
Sn = {xi , i = 0, 1, . . . , n + 1} supp(f )
such that h(xi ) = log f (xi ) known up to the same constant.
By concavity of h, line Li,i+1 through (xi , h(xi )) and
(xi+1 , h(xi+1 ))

Markov Chain Monte Carlo Methods


Random variable generation
Log-concave densities

Log-concave densities (2)

Take
Sn = {xi , i = 0, 1, . . . , n + 1} supp(f )
such that h(xi ) = log f (xi ) known up to the same constant.
By concavity of h, line Li,i+1 through (xi , h(xi )) and
(xi+1 , h(xi+1 ))
I

below h in [xi , xi+1 ] and

above this graph outside this interval


L 2,3 (x)
log f(x)
x1

x2

x3
x

x4

Markov Chain Monte Carlo Methods


Random variable generation
Log-concave densities

Log-concave densities (3)


For x [xi , xi+1 ], if
hn (x) = min{Li1,i (x), Li+1,i+2 (x)}

and hn (x) = Li,i+1 (x) ,

the envelopes are


hn (x) h(x) hn (x)
uniformly on the support of f , with
hn (x) =
on [x0 , xn+1 ]c .

and

hn (x) = min(L0,1 (x), Ln,n+1 (x))

Markov Chain Monte Carlo Methods


Random variable generation
Log-concave densities

Log-concave densities (4)

Therefore, if
f n (x) = exp hn (x) and f n (x) = exp hn (x)
then
f n (x) f (x) f n (x) = $n gn (x) ,
where $n normalizing constant of fn

Markov Chain Monte Carlo Methods


Random variable generation
Log-concave densities

ARS Algorithm

1. Initialize n and Sn .
2. Generate X gn (x), U U[0,1] .

3. If U f n (X)/$n gn (X), accept X;


otherwise, if U f (X)/$n gn (X), accept X

Markov Chain Monte Carlo Methods


Random variable generation
Log-concave densities

kill ducks

Example (Northern Pintail ducks)


Ducks captured at time i with both probability pi and size N of
the population unknown.
Dataset
(n1 , . . . , n11 ) = (32, 20, 8, 5, 1, 2, 0, 2, 1, 1, 0)
Number of recoveries over the years 19571968 of N = 1612
Northern Pintail ducks banded in 1956

Markov Chain Monte Carlo Methods


Random variable generation
Log-concave densities

Example (Northern Pintail ducks (2))


Corresponding conditional likelihood
L(p1 , . . . , pI |N, n1 , . . . , nI ) =

I
Y
N!
pni i (1 pi )N ni ,
(N r)!
i=1

where I number of captures, ni number of captured animals


during the ith capture, and r is the total number of different
captured animals.

Markov Chain Monte Carlo Methods


Random variable generation
Log-concave densities

Example (Northern Pintail ducks (3))


Prior selection
If
N P()
and
i = log

pi
1 pi

N (i , 2 ),
[Normal logistic]

Markov Chain Monte Carlo Methods


Random variable generation
Log-concave densities

Example (Northern Pintail ducks (4))


Posterior distribution
(, N |, n1 , . . . , nI )

I
N!
N Y
(1 + ei )N
(N r)! N !
i=1


I
Y
1
2
exp i ni 2 (i i )
2
i=1

Markov Chain Monte Carlo Methods


Random variable generation
Log-concave densities

Example (Northern Pintail ducks (5))


For the conditional posterior distribution


1
2
(i |N, n1 , . . . , nI ) exp i ni 2 (i i )
(1+ei )N ,
2
the ARS algorithm can be implemented since
i ni
is concave in i .

1
(i i )2 N log(1 + ei )
2 2

Markov Chain Monte Carlo Methods


Random variable generation
Log-concave densities

Posterior distributions of capture log-odds ratios for the


years 19571965.

1.0
0.8
0.6
0.4
0.2

10

10

1965

0.0

0.2

0.4

0.6

0.8

1.0

1964

0.6
8

0.6
8

0.4
9

1962

0.4
9

0.2
10

0.2
10

0.0

0.0

0.2

0.4

0.6

0.8

1.0

1963

1.0

0.8

10

10

0.0

0.2

0.4

0.6

0.8

1.0

1961

0.0

0.0

0.2

0.4

0.6

0.8

1.0

1960

1.0

0.8

10

1959

0.0

0.2

0.4

0.6

0.8

1.0

1958

0.0

0.0

0.2

0.4

0.6

0.8

1.0

1957

10

10

Markov Chain Monte Carlo Methods


Random variable generation
Log-concave densities

0.0

0.2

0.4

0.6

0.8

1960

True distribution versus histogram of simulated sample

Markov Chain Monte Carlo Methods


Monte Carlo Integration

Monte Carlo integration


Motivation and leading example
Random variable generation
Monte Carlo Integration
Introduction
Monte Carlo integration
Importance Sampling
Acceleration methods
Bayesian importance sampling
Notions on Markov Chains
The Metropolis-Hastings Algorithm

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Introduction

Quick reminder

Two major classes of numerical problems that arise in statistical


inference
Optimization - generally associated with the likelihood
approach

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Introduction

Quick reminder

Two major classes of numerical problems that arise in statistical


inference
Optimization - generally associated with the likelihood
approach
Integration- generally associated with the Bayesian approach

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Introduction
skip Example!

Example (Bayesian decision theory)


Bayes estimators are not always posterior expectations, but rather
solutions of the minimization problem
Z
L(, ) () f (x|) d .
min

Proper loss:
For L(, ) = ( )2 , the Bayes estimator is the posterior mean

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Introduction
skip Example!

Example (Bayesian decision theory)


Bayes estimators are not always posterior expectations, but rather
solutions of the minimization problem
Z
L(, ) () f (x|) d .
min

Proper loss:
For L(, ) = ( )2 , the Bayes estimator is the posterior mean
Absolute error loss:
For L(, ) = | |, the Bayes estimator is the posterior median

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Introduction
skip Example!

Example (Bayesian decision theory)


Bayes estimators are not always posterior expectations, but rather
solutions of the minimization problem
Z
L(, ) () f (x|) d .
min

Proper loss:
For L(, ) = ( )2 , the Bayes estimator is the posterior mean
Absolute error loss:
For L(, ) = | |, the Bayes estimator is the posterior median
With no loss function
use the maximum a posteriori (MAP) estimator
arg max `(|x)()

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Monte Carlo integration

Monte Carlo integration

Theme:
Generic problem of evaluating the integral
Z
I = Ef [h(X)] =
h(x) f (x) dx
X

where X is uni- or multidimensional, f is a closed form, partly


closed form, or implicit density, and h is a function

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Monte Carlo integration

Monte Carlo integration (2)


Monte Carlo solution
First use a sample (X1 , . . . , Xm ) from the density f to
approximate the integral I by the empirical average
hm =

m
1 X
h(xj )
m
j=1

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Monte Carlo integration

Monte Carlo integration (2)


Monte Carlo solution
First use a sample (X1 , . . . , Xm ) from the density f to
approximate the integral I by the empirical average
hm =

m
1 X
h(xj )
m
j=1

which converges
hm Ef [h(X)]
by the Strong Law of Large Numbers

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Monte Carlo integration

Monte Carlo precision


Estimate the variance with
m

vm

1 1 X
[h(xj ) hm ]2 ,
=
mm1
j=1

and for m large,


hm Ef [h(X)]
N (0, 1).

vm
Note: This can lead to the construction of a convergence test and
of confidence bounds on the approximation of Ef [h(X)].

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Monte Carlo integration

Example (Cauchy prior/normal sample)


For estimating a normal mean, a robust prior is a Cauchy prior
X N (, 1),

C(0, 1).

Under squared error loss, posterior mean


Z

2
e(x) /2 d
2
1+
(x) = Z

1
2
e(x) /2 d
2
1
+

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Monte Carlo integration

Example (Cauchy prior/normal sample (2))


Form of suggests simulating iid variables
1 , , m N (x, 1)
and calculating

m
(x) =

m
X
i=1

i
1 + i2

X
m
i=1

1
.
1 + i2

The Law of Large Numbers implies

m
(x) (x) as m .

Markov Chain Monte Carlo Methods


Monte Carlo Integration

9.6

9.8

10.0

10.2

10.4

10.6

Monte Carlo integration

200

400

600

800

1000

iterations

for 100 runs and x = 10


Range of estimators m

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Importance Sampling

Importance sampling

Paradox
Simulation from f (the true density) is not necessarily optimal

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Importance Sampling

Importance sampling

Paradox
Simulation from f (the true density) is not necessarily optimal
Alternative to direct sampling from f is importance sampling,
based on the alternative representation

Z 
f (x)
Ef [h(X)] =
g(x) dx .
h(x)
g(x)
X
which allows us to use other distributions than f

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Importance Sampling

Importance sampling algorithm


Evaluation of
Ef [h(X)] =

h(x) f (x) dx
X

by
1. Generate a sample X1 , . . . , Xn from a distribution g
2. Use the approximation
m
1 X f (Xj )
h(Xj )
m
g(Xj )
j=1

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Importance Sampling

Same thing as before!!!

Convergence of the estimator


Z
m
1 X f (Xj )
h(Xj )
h(x) f (x) dx
m
g(Xj )
X
j=1

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Importance Sampling

Same thing as before!!!

Convergence of the estimator


Z
m
1 X f (Xj )
h(Xj )
h(x) f (x) dx
m
g(Xj )
X
j=1

converges for any choice of the distribution g


[as long as supp(g) supp(f )]

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Importance Sampling

Important details

Instrumental distribution g chosen from distributions easy to


simulate
The same sample (generated from g) can be used repeatedly,
not only for different functions h, but also for different
densities f
Even dependent proposals can be used, as seen later
PMC chapter

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Importance Sampling

Although g can be any density, some choices are better than


others:
Finite variance only when

 Z
f (X)
f 2 (X)
2
Ef h (X)
dx < .
h2 (x)
=
g(X)
g(X)
X

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Importance Sampling

Although g can be any density, some choices are better than


others:
Finite variance only when

 Z
f (X)
f 2 (X)
2
Ef h (X)
dx < .
h2 (x)
=
g(X)
g(X)
X
Instrumental distributions with tails lighter than those of f
(that is, with sup f /g = ) not appropriate.

If sup f /g = , the weights f (xj )/g(xj ) vary widely, giving


too much importance to a few values xj .

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Importance Sampling

Although g can be any density, some choices are better than


others:
Finite variance only when

 Z
f (X)
f 2 (X)
2
Ef h (X)
dx < .
h2 (x)
=
g(X)
g(X)
X
Instrumental distributions with tails lighter than those of f
(that is, with sup f /g = ) not appropriate.

If sup f /g = , the weights f (xj )/g(xj ) vary widely, giving


too much importance to a few values xj .
If sup f /g = M < , the accept-reject algorithm can be used
as well to simulate f directly.

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Importance Sampling

Example (Cauchy target)


Case of Cauchy distribution C(0, 1) when importance function is
Gaussian N (0, 1).
Ratio of the densities
%(x) =
very badly behaved: e.g.,
Z

exp x2 /2
p? (x)
= 2
p0 (x)
(1 + x2 )

%(x)2 p0 (x)dx = .

Poor performances of the associated importance sampling


estimator

Markov Chain Monte Carlo Methods


Monte Carlo Integration

50

100

150

200

Importance Sampling

2000

4000

6000

8000

10000

iterations

Range and average of 500 replications of IS estimate of


E[exp X] over 10, 000 iterations.

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Importance Sampling

Optimal importance function

The choice of g that minimizes the variance of the


importance sampling estimator is
g (x) = R

|h(x)| f (x)
.
|h(z)| f (z) dz

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Importance Sampling

Optimal importance function

The choice of g that minimizes the variance of the


importance sampling estimator is
g (x) = R

|h(x)| f (x)
.
|h(z)| f (z) dz

Rather formal optimality result since optimal choice of g (x)


requires the knowledge of I, the integral of interest!

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Importance Sampling

Practical impact
Pm

h(Xj ) f (Xj )/g(Xj )


Pm
,
j=1 f (Xj )/g(Xj )

j=1

where f and g are known up to constants.

Also converges to I by the Strong Law of Large Numbers.


Biased, but the bias is quite small

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Importance Sampling

Practical impact
Pm

h(Xj ) f (Xj )/g(Xj )


Pm
,
j=1 f (Xj )/g(Xj )

j=1

where f and g are known up to constants.

Also converges to I by the Strong Law of Large Numbers.


Biased, but the bias is quite small

In some settings beats the unbiased estimator in squared error


loss.
Using the optimal solution does not always work:
Pm
#positive h #negative h
j=1 h(xj ) f (xj )/|h(xj )| f (xj )
Pm
Pm
=
j=1 f (xj )/|h(xj )| f (xj )
j=1 1/|h(xj )|

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Importance Sampling

Selfnormalised importance sampling

For ratio estimator


hn

n
X

i h(xi )

i=1

X
n

i=1

with Xi g(y) and Wi such that


E[Wi |Xi = x] = f (x)/g(x)

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Importance Sampling

Selfnormalised variance
then
var(hn )
for


1
n

n
n

2
n
var(S
)

2E
[h]
cov(S
,
S
)
+
E
[h]
var(S
)
.
1
1
h
h
n2 2
Shn =

n
X
i=1

Wi h(Xi ) ,

S1n =

n
X

Wi

i=1

Rough approximation
varhn

1
var (h(X)) {1 + varg (W )}
n

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Importance Sampling

Example (Students t distribution)


X T (, , 2 ), with density
f (x) =

(( + 1)/2)

(/2)

1+

(x )2
2

Without loss of generality, take = 0, = 1.


Problem: Calculate the integral

Z 
sin(x) n
f (x)dx.
x
2.1

(+1)/2

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Importance Sampling

Example (Students t distribution (2))


Simulation possibilities

Directly from f , since f = N(0,1)


2

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Importance Sampling

Example (Students t distribution (2))


Simulation possibilities

Directly from f , since f = N(0,1)


2

Importance sampling using Cauchy C (0, 1)

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Importance Sampling

Example (Students t distribution (2))


Simulation possibilities

Directly from f , since f = N(0,1)


2

Importance sampling using Cauchy C (0, 1)


Importance sampling using a normal N (0, 1)
(expected to be nonoptimal)

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Importance Sampling

Example (Students t distribution (2))


Simulation possibilities

Directly from f , since f = N(0,1)


2

Importance sampling using Cauchy C (0, 1)


Importance sampling using a normal N (0, 1)
(expected to be nonoptimal)
Importance sampling using a U ([0, 1/2.1])
change of variables

Markov Chain Monte Carlo Methods


Monte Carlo Integration

5.0

5.5

6.0

6.5

7.0

Importance Sampling

10000

20000

30000

40000

50000

Sampling from f (solid lines), importance sampling with


Cauchy instrumental (short dashes), U ([0, 1/2.1])
instrumental (long dashes) and normal instrumental (dots).

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Importance Sampling

IS suffers from curse of dimensionality


As dimension increases, discrepancy between importance and
target worsens
skip explanation

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Importance Sampling

IS suffers from curse of dimensionality


As dimension increases, discrepancy between importance and
target worsens
skip explanation

Explanation:
Take target distribution and instrumental distribution
N
n
Simulation of a sample of iid samples of size n
R x1:n from n =
Importance sampling estimator for n (fn ) = fn (x1:n )n (dx1:n )
PN
QN
i
i
i=1 fn (1:n )
j=1 Wj
\
n (fn ) =
,
PN QN
j=1
j=1 Wj
i
i
where Wki = d
d (k ), and j are iid with distribution .
For {Vk }k0 , sequence of iid nonnegative random variables and for
n 1, Fn = (Vk ; k n), set

Un =

n
Y

k=1

Vk

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Importance Sampling

Since E[Vn+1 ] = 1 and Vn+1 independent from Fn ,


E(Un+1 | Fn ) = Un E(Vn+1 | Fn ) = Un ,
and thus {Un }n0 martingale
Since x 7 x concave, by Jensens inequality,
p
p
p
E( Un+1 | Fn ) E(Un+1 | Fn ) Un

and thus { Un }n0 supermartingale


Assume E( Vn+1 ) < 1. Then
p

E( Un ) =

n
Y

k=1

p
E( Vk ) 0,

n .

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Importance Sampling

But { Un }n0 is a nonnegative supermartingale and thus Un


converges a.s. to a random variable Z 0. By Fatous lemma,

p 

E(Z) = E lim
Un lim inf E( U n ) = 0.
n

Hence, Z = 0 and Un 0 a.s., which implies that the martingale


{Un }n0 is not regular.
i
Apply these results to Vk = d
d (k ), i {1, . . . , N }:
"r
#


d i
d i
E
( ) E
( ) = 1.
d k
d k
with equality iff

d
d

= 1, -a.e., i.e. = .

Thus all importance weights converge to 0

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Importance Sampling

too volatile!

Example (Stochastic volatility model)


yt = exp (xt /2) t ,

t N (0, 1)

with AR(1) log-variance process (or volatility)


xt+1 = xt + ut ,

ut N (0, 1)

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Importance Sampling

10

Evolution of IBM stocks (corrected from trend and log-ratio-ed)

100

200

300
time

400

500

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Importance Sampling

Example (Stochastic volatility model (2))


Observed likelihood unavailable in closed from.
Joint posterior (or conditional) distribution of the hidden state
sequence {Xk }1kK can be evaluated explicitly
K
Y

k=2



exp 2 (xk xk1 )2 + 2 exp(xk )yk2 + xk /2 , (2)

up to a normalizing constant.

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Importance Sampling

Computational problems

Example (Stochastic volatility model (3))


Direct simulation from this distribution impossible because of
(a) dependence among the Xk s,
(b) dimension of the sequence {Xk }1kK , and
(c) exponential term exp(xk )yk2 within (2).

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Importance Sampling

Importance sampling

Example (Stochastic volatility model (4))


Natural candidate: replace the exponential term with a quadratic
approximation to preserve Gaussianity.
E.g., expand exp(xk ) around its conditional expectation xk1 as


1
2
exp(xk ) exp(xk1 ) 1 (xk xk1 ) + (xk xk1 )
2

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Importance Sampling

Example (Stochastic volatility model (5))


Corresponding Gaussian importance distribution with mean
k =

xk1 { 2 + yk2 exp(xk1 )/2} {1 yk2 exp(xk1 )}/2


2 + yk2 exp(xk1 )/2

and variance
k2 = ( 2 + yk2 exp(xk1 )/2)1
Prior proposal on X1 ,
X1 N (0, 2 )

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Importance Sampling

Example (Stochastic volatility model (6))


Simulation starts with X1 and proceeds forward to Xn , each Xk
being generated conditional on Yk and the previously generated
Xk1 .
Importance weight computed sequentially as the product of


exp 2 (xk xk1 )2 + exp(xk )yk2 + xk /2


.
exp k2 (xk k )2 k1
(1 k K)

Markov Chain Monte Carlo Methods


Monte Carlo Integration

0.00

0.4

0.02

0.3

0.04

0.2

Density

0.06

0.1

0.08

0.0

0.10

0.1

Importance Sampling

15

weights

10

15

20

40

60

80

100

Histogram of the logarithms of the importance weights (left)


and comparison between the true volatility and the best fit,
based on 10, 000 simulated importance samples.

Markov Chain Monte Carlo Methods


Monte Carlo Integration

0.4

0.2

0.0

0.2

0.4

Importance Sampling

20

40

60

80

100

Corresponding range of the simulated {Xk }1k100 ,


compared with the true value.

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Acceleration methods

Correlated simulations
Negative correlation reduces variance
Special technique but efficient when it applies
Two samples (X1 , . . . , Xm ) and (Y1 , . . . , Ym ) from f to estimate
Z
I=
h(x)f (x)dx
R

by

m
1 X
b
h(Xi )
I1 =
m
i=1

with mean I and variance 2

and

m
1 X
b
I2 =
h(Yi )
m
i=1

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Acceleration methods

Variance reduction

Variance of the average


var

b
I1 + b
I2
2

2 1
+ cov(b
I1 , b
I2 ).
2
2

If the two samples are negatively correlated,


cov(b
I1 , b
I2 ) 0 ,

they improve on two independent samples of same size

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Acceleration methods

Antithetic variables

If f symmetric about , take Yi = 2 Xi

If Xi = F 1 (Ui ), take Yi = F 1 (1 Ui )

If (Ai )i partition of X , partitioned sampling by sampling


Xj s in each Ai (requires to know Pr(Ai ))

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Acceleration methods

Control variates
out of control!

For
I=
unknown and
I0 =
known,
I0 estimated by b
I0 and
I estimated by b
I

h(x)f (x)dx

h0 (x)f (x)dx

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Acceleration methods

Control variates (2)

Combined estimator
b
I = b
I + (b
I0 I 0 )

b
I is unbiased for I and

var(b
I ) = var(b
I) + 2 var(b
I) + 2cov(b
I, b
I0 )

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Acceleration methods

Optimal control

Optimal choice of
? =
with

cov(b
I, b
I0 )
,
b0 )
var(I

var(b
I? ) = (1 2 ) var(b
I) ,

b and I
b0
where correlation between I
Usual solution: regression coefficient of h(xi ) over h0 (xi )

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Acceleration methods

Example (Quantile Approximation)


Evaluate
% = Pr(X > a) =
by

with Xi iid f .
If Pr(X > ) =

%b =
1
2

known

f (x)dx

1X
I(Xi > a),
n
i=1

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Acceleration methods

Example (Quantile Approximation (2))


Control variate
n

1X
% =
I(Xi > a) +
n
i=1

improves upon %b if

1X
I(Xi > ) Pr(X > )
n
i=1

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Acceleration methods

Example (Quantile Approximation (2))


Control variate
n

1X
% =
I(Xi > a) +
n
i=1

improves upon %b if
<0

and

1X
I(Xi > ) Pr(X > )
n

|| < 2

i=1

cov(b
%, %b0 ) Pr(X > a)
2
.
var(b
%0 ) Pr(X > )

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Acceleration methods

Integration by conditioning

Use Rao-Blackwell Theorem


var(E[(X)|Y]) var((X))

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Acceleration methods

Consequence

If b
I unbiased estimator of I = Ef [h(X)], with X simulated from a
joint density f(x, y), where
Z
f(x, y)dy = f (x),
the estimator

b
I = Ef[b
I|Y1 , . . . , Yn ]

dominate b
I(X1 , . . . , Xn ) variance-wise (and is unbiased)

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Acceleration methods

skip expectation

Example (Students t expectation)


For
E[h(x)] = E[exp(x2 )] with

X T (, 0, 2 )

a Students t distribution can be simulated as


X|y N (, 2 y)

and

Y 1 2 .

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Acceleration methods

Example (Students t expectation (2))


Empirical distribution
m

1 X
exp(Xj2 ) ,
m
j=1

can be improved from the joint sample


((X1 , Y1 ), . . . , (Xm , Ym ))

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Acceleration methods

Example (Students t expectation (2))


Empirical distribution
m

1 X
exp(Xj2 ) ,
m
j=1

can be improved from the joint sample


((X1 , Y1 ), . . . , (Xm , Ym ))
since
m
m
1 X
1
1 X
p
E[exp(X 2 )|Yj ] =
m
m
2 2 Yj + 1
j=1
j=1

is the conditional expectation.


In this example, precision ten times better

Markov Chain Monte Carlo Methods


Monte Carlo Integration

0.50

0.52

0.54

0.56

0.58

0.60

Acceleration methods

2000

4000

6000

8000

10000

Estimators of E[exp(X 2 )]: empirical average (full) and


conditional expectation (dotted) for (, , ) = (4.6, 0, 1).

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Bayesian importance sampling

Bayesian importance functions


directly Markovian

Recall algorithm:
(1)

(T )

1. Generate 1 , , 1
with
2. Take
Z

from cg()
Z
c1 = g()d

f (x|)()d

T
((t) )
1X
f (x| (t) )
T
cg( (t) )
t=1

T
X

f (x| (t) )

t=1

((t) )
g((t) )

T
X
((t) )
t=1

g((t) )

= mIS (x)

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Bayesian importance sampling

Choice of g
g() = ()
1X
f (x| (t) )
T t
often inefficient if data informative
impossible if is improper
mIS (x) =

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Bayesian importance sampling

Choice of g
g() = ()
1X
f (x| (t) )
T t
often inefficient if data informative
impossible if is improper
mIS (x) =

g() = f (x|)()
c unknown

1
1X
T t=1 f (x| (t) )
improper priors allowed
mIS (x) = 1

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Bayesian importance sampling

g() = () + (1 )(|x)
defensive mixture
 1 Ok

[Hestenberg, 1998]

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Bayesian importance sampling

g() = () + (1 )(|x)
defensive mixture
 1 Ok

[Hestenberg, 1998]

g() = (|x)
mh (x) =

T
h()
1X
T t=1 f (x|)()
works for any h
finite variance if
Z
h2 ()
d <
f (x|)()

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Bayesian importance sampling

Bridge sampling
[Chen & Shao, 1997]
Given two models f1 (x|1 ) and f2 (x|2 ),
1 (1 |x) =
2 (2 |x) =

1 (1 )f1 (x|1 )
m1 (x)
2 (2 )f2 (x|2 )
m2 (x)

Bayes factor:
B12 (x) =
ratio of normalising constants

m1 (x)
m2 (x)

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Bayesian importance sampling

Bridge sampling (2)

(i) Missing normalising constants:


1 (1 |x)
1 (1 )
2 (2 |x)
2 (2 )
n

B12

1 (i )
1X

2 (i )
i=1

i 2

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Bayesian importance sampling

Bridge sampling (3)


(ii) Still missing normalising constants:
Z

2 ()()1 ()d
B12 = Z

1 ()()2 ()d

n1
1 X

2 (1i )(1i )
n1
i=1
n2
1 X

1 (2i )(2i )
n2
i=1

()

ji j ()

Markov Chain Monte Carlo Methods


Monte Carlo Integration
Bayesian importance sampling

Bridge sampling (4)

Optimal choice
() =

n1 + n 2
n1 1 () + n2 2 ()

[?]

[Chen, Meng & Wong, 2000]

Markov Chain Monte Carlo Methods


Notions on Markov Chains

Notions on Markov Chains


Notions on Markov Chains
Basics
Irreducibility
Transience and Recurrence
Invariant measures
Ergodicity and convergence
Limit theorems
Quantitative convergence rates
Coupling
Renewal and CLT

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Basics

Basics

Definition (Markov chain)


A sequence of random variables whose distribution evolves over
time as a function of past realizations

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Basics

Basics

Definition (Markov chain)


A sequence of random variables whose distribution evolves over
time as a function of past realizations
Chain defined through its transition kernel, a function K defined
on X B(X ) such that
I
I

x X , K(x, ) is a probability measure;


A B(X ), K(, A) is measurable.

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Basics

no discrete

When X is a discrete (finite or denumerable) set, the


transition kernel simply is a (transition) matrix K with
elements
x, y X

Pxy = Pr(Xn = y|Xn1 = x) ,

Since, for all x X , K(x, ) is a probability, we must have


X
Pxy = 1
Pxy 0 and K(x, X ) =
yX

The matrix K is referred to as a Markov transition matrix


or a stochastic matrix

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Basics

In the continuous case, the kernel also denotes the


conditional density K(x, x0 ) of the transition K(x, )
Z
K(x, x0 )dx0 .
Pr(X A|x) =
A

Then, for any bounded , we may define


Z
K(x) = K(x, ) =
K(x, dy)(y).
X

Note that

|K(x)|

K(x, dy)|(y)| || = sup |(x)|.


xX

We may also associate to a probability measure the measure


K, defined as
Z
K(A) =
(dx)K(x, A).
X

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Basics

Markov chains

skip definition

Given a transition kernel K, a sequence X0 , X1 , . . . , Xn , . . . of


random variables is a Markov chain denoted by (Xn ), if, for any
t, the conditional distribution of Xt given xt1 , xt2 , . . . , x0 is the
same as the distribution of Xt given xt1 . That is,
Pr(Xk+1 A|x0 , x1 , x2 , . . . , xk ) = Pr(Xk+1 A|xk )
Z
=
K(xk , dx)
A

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Basics

Note that the entire structure of the chain only depends on


The transition function K

The initial state x0 or initial distribution X0

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Basics

Example (Random walk)


The normal random walk is the kernel K(x, ) associated with the
distribution
Np (x, 2 Ip )
which means
Xt+1 = Xt + t
t being an iid additional noise

Markov Chain Monte Carlo Methods


Notions on Markov Chains

10

Basics

100 consecutive realisations of the random walk in R2 with


=1

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Basics
bypass remarks

On a discrete state-space X = {x0 , x1 , . . .},


I A function on a discrete state space is uniquely defined by
the (column) vector = ((x0 ), (x1 ), . . . , )T and
X
K(x) =
Pxy (y)
yX

can be interpreted as the xth component of the product of


the transition matrix K and of the vector .

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Basics
bypass remarks

On a discrete state-space X = {x0 , x1 , . . .},


I A function on a discrete state space is uniquely defined by
the (column) vector = ((x0 ), (x1 ), . . . , )T and
X
K(x) =
Pxy (y)
yX

can be interpreted as the xth component of the product of


the transition matrix K and of the vector .
A probability distribution on P(X ) is defined as a (row)
vector = ((x0 ), (x1 ), . . .) and the probability distribution
K is defined, for each y X as
X
K({y}) =
({x})Pxy
xX

yth component of the product of the vector and of the


transition matrix K.

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Basics

Composition of kernels

Let Q1 and Q2 be two probability kernels. Define, for any x X


and any A B(X ) the product of kernels Q1 Q2 as
Z
Q1 Q2 (x, A) =
Q1 (x, dy)Q2 (y, A)
X

When the state space X is discrete, the product of Markov kernels


coincides with the product of matrices Q1 Q2 .

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Irreducibility

Irreducibility
Irreducibility is one measure of the sensitivity of the Markov chain
to initial conditions
It leads to a guarantee of convergence for MCMC algorithms

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Irreducibility

Irreducibility
Irreducibility is one measure of the sensitivity of the Markov chain
to initial conditions
It leads to a guarantee of convergence for MCMC algorithms

Definition (Irreducibility)
In the discrete case, the chain is irreducible if all states
communicate, namely if
Px (y < ) > 0 ,

x, y X ,

y being the first (positive) time y is visited

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Irreducibility

Irreducibility for a continuous chain

In the continuous case, the chain is -irreducible for some measure


if for some n,
K n (x, A) > 0
I
I

for all x X

for every A B(X ) with (A) > 0

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Irreducibility

Minoration condition

Assume there exist a probability measure and  > 0 such that,


for all x X and all A B(X ),
K(x, A) (A)
This is called a minoration condition.
When K is a Markov chain on a discrete state space, this is
equivalent to saying that Pxy > 0 for all x, y X .

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Irreducibility

Small sets

Definition (Small set)


If there exist C B(X ), (C) > 0, a probability measure and
 > 0 such that, for all x C and all A B(X ),
K(x, A) (A)
C is called a small set
For discrete state space, atoms are small sets.

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Transience and Recurrence

Towards further stability


Irreducibility: every set A has a chance to be visited by the
Markov chain (Xn )
This property is too weak to ensure that the trajectory of
(Xn ) will enter A often enough.

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Transience and Recurrence

Towards further stability


Irreducibility: every set A has a chance to be visited by the
Markov chain (Xn )
This property is too weak to ensure that the trajectory of
(Xn ) will enter A often enough.
A Markov chain must enjoy good stability properties to
guarantee an acceptable approximation of the simulated
model.
Formalizing this stability leads to different notions of
recurrence
For discrete chains, the recurrence of a state equivalent to
probability one of sure return.
Always satisfied for irreducible chains on finite spaces

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Transience and Recurrence

Transience and Recurrence

In a finite state space X , denote the average number of visits to a


state by

X
=
I (Xi )
i=1

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Transience and Recurrence

Transience and Recurrence

In a finite state space X , denote the average number of visits to a


state by

X
=
I (Xi )
i=1

If E [ ] = , the state is recurrent


If E [ ] < , the state is transient
For irreducible chains, recurrence/transience is property of the
chain, not of a particular state

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Transience and Recurrence

Transience and Recurrence

In a finite state space X , denote the average number of visits to a


state by

X
=
I (Xi )
i=1

If E [ ] = , the state is recurrent


If E [ ] < , the state is transient
For irreducible chains, recurrence/transience is property of the
chain, not of a particular state
Similar definitions for the continuous case.

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Transience and Recurrence

Harris recurrence
Stronger form of recurrence:

Definition (Harris recurrence)


A set A is Harris recurrent if
Px (A = ) = 1 for all x A.
The chain (Xn ) is Harris recurrent if it is
irreducible

for every set A with (A) > 0, A is Harris recurrent.

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Transience and Recurrence

Harris recurrence
Stronger form of recurrence:

Definition (Harris recurrence)


A set A is Harris recurrent if
Px (A = ) = 1 for all x A.
The chain (Xn ) is Harris recurrent if it is
irreducible

for every set A with (A) > 0, A is Harris recurrent.


Note that
Px (A = ) = 1 implies Ex [A ] =

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Invariant measures

Invariant measures
Stability increases for the chain (Xn ) if marginal distribution of Xn
independent of n
Requires the existence of a probability distribution such that
Xn+1

if

Xn

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Invariant measures

Invariant measures
Stability increases for the chain (Xn ) if marginal distribution of Xn
independent of n
Requires the existence of a probability distribution such that
Xn+1

if

Xn

Definition (Invariant measure)


A measure is invariant for the transition kernel K(, ) if
Z
(B) =
K(x, B) (dx) ,
B B(X ) .
X

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Invariant measures

Stability properties and invariance

The chain is positive recurrent if is a probability measure.


Otherwise it is null recurrent or transient

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Invariant measures

Stability properties and invariance

The chain is positive recurrent if is a probability measure.


Otherwise it is null recurrent or transient

If probability measure, also called stationary distribution


since
X0 implies that Xn for every n
The stationary distribution is unique

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Invariant measures

Insights
no time for that!

Invariant probability measures are important not merely because they define stationary processes, but also because
they turn out to be the measures which define the longterm or ergodic behavior of the chain.
To understand why, consider P (Xn ) for a starting distribution . If
a limiting measure exists such as
P (Xn A) (A)
for all A B(X ), then

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Invariant measures

(dx)P n (x, A)
Z Z
= lim
P n1 (x, dw)K(w, A)
n X
Z
=
(dw)K(w, A)

(A) =

lim

R
since setwise convergence of P n (x, ) implies convergence of integrals
of bounded measurable functions. Hence, if a limiting distribution exists,
it is an invariant probability measure; and obviously, if there is a unique
invariant probability measure, the limit will be independent of
whenever it exists.

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Ergodicity and convergence

Ergodicity and convergence

We finally consider: to what is the chain converging?


The invariant distribution is a natural candidate for the limiting
distribution

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Ergodicity and convergence

Ergodicity and convergence

We finally consider: to what is the chain converging?


The invariant distribution is a natural candidate for the limiting
distribution
A fundamental property is ergodicity, or independence of initial
conditions. In the discrete case, a state is ergodic if
lim |K n (, ) ()| = 0 .

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Ergodicity and convergence

Norm and convergence


In general , we establish convergence using the total variation norm
k1 2 kTV = sup |1 (A) 2 (A)|
A

and we want
Z

K n (x, )(dx)

TV


Z


n
= sup K (x, A)(dx) (A)
A

to be small.

skip minoration TV

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Ergodicity and convergence

Total variation distance and minoration

Lemma
Let and 0 be two probability measures. Then,
)
(
X
(Ai ) 0 (Ai ) = k 0 kTV .
1 inf
i

where the infimum is taken over all finite partitions (Ai )i of X .

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Ergodicity and convergence

Total variation distance and minoration (2)


Assume that there exist a probability and  > 0 such that, for all
A B(X ) we have
(A) 0 (A) (A).
Then, for all I and all partitions A1 , A2 , . . ., AI ,
X
(Ai ) 0 (Ai ) 
i=1

and the previous result thus implies that


k 0 kTV (1 ).

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Ergodicity and convergence

Harris recurrence and ergodicity


Theorem
If (Xn ) Harris positive recurrent and aperiodic, then
Z



n

K
(x,
)(dx)

lim
=0


n

for every initial distribution .

TV

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Ergodicity and convergence

Harris recurrence and ergodicity


Theorem
If (Xn ) Harris positive recurrent and aperiodic, then
Z



n

K
(x,
)(dx)

lim
=0


n

TV

for every initial distribution .

We thus take Harris positive recurrent and aperiodic as


equivalent to ergodic
[Meyn & Tweedie, 1993]

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Ergodicity and convergence

Harris recurrence and ergodicity


Theorem
If (Xn ) Harris positive recurrent and aperiodic, then
Z



n

K
(x,
)(dx)

lim
=0


n

TV

for every initial distribution .

We thus take Harris positive recurrent and aperiodic as


equivalent to ergodic
[Meyn & Tweedie, 1993]
Convergence in total variation implies
lim |E [h(Xn )] E [h(X)]| = 0

for every bounded function h.

no detail of convergence

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Ergodicity and convergence

Convergences

There are difference speeds of convergence


ergodic (fast enough)

geometrically ergodic (faster)


uniformly ergodic (fastest)

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Ergodicity and convergence

Geometric ergodicity

A -irreducible aperiodic Markov kernel P with invariant


distribution is geometrically ergodic if there exist V 1, and
constants < 1, R < such that (n 1)
kP n (x, .) (.)kV RV (x)n ,
on {V < } which is full and absorbing.

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Ergodicity and convergence

Geometric ergodicity implies a lot of important results


P
I CLT for additive functionals n1/2
g(Xk ) and functions
|g| < V

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Ergodicity and convergence

Geometric ergodicity implies a lot of important results


P
I CLT for additive functionals n1/2
g(Xk ) and functions
|g| < V
I

Rosenthals type inequalities



p
n
X



Ex
g(Xk ) C(p)np/2 ,


k=1

|g|p 2

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Ergodicity and convergence

Geometric ergodicity implies a lot of important results


P
I CLT for additive functionals n1/2
g(Xk ) and functions
|g| < V
I

Rosenthals type inequalities



p
n
X



Ex
g(Xk ) C(p)np/2 ,


k=1

|g|p 2

exponential inequalities (for bounded functions and small


enough)
!)
(
n
X
g(Xk )
<
Ex exp
k=1

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Ergodicity and convergence

Minoration condition and uniform ergodicity

Under the minoration condition, the kernel K is thus contractant


and standard results in functional analysis shows the existence and
the unicity of a fixed point . The previous relation implies that,
for all x X .
kP n (x, ) kTV (1 )n
Such Markov chains are called uniformly ergodic.

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Ergodicity and convergence

Uniform ergodicity
Theorem (S&n ergodicity)
The following conditions are equivalent:
I

(Xn )n is uniformly ergodic,

there exist < 1 and R < such that, for all x X ,


kP n (x, ) kTV Rn ,

for some n > 0,


sup kP n (x, ) ()kTV < 1.

xX

[Meyn and Tweedie, 1993]

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Limit theorems

Limit theorems
Ergodicity determines the probabilistic properties of average
behavior of the chain.

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Limit theorems

Limit theorems
Ergodicity determines the probabilistic properties of average
behavior of the chain.
But also need of statistical inference, made by induction from the
observed sample.
If kPxn k close to 0, no direct information about
Xn Pxn
c We need LLNs and CLTs!!!

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Limit theorems

Limit theorems
Ergodicity determines the probabilistic properties of average
behavior of the chain.
But also need of statistical inference, made by induction from the
observed sample.
If kPxn k close to 0, no direct information about
Xn Pxn
c We need LLNs and CLTs!!!

Classical LLNs and CLTs not directly applicable due to:
Markovian dependence structure between the
observations Xi
Non-stationarity of the sequence

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Limit theorems

The Theorem

Theorem (Ergodic Theorem)


If the Markov chain (Xn ) is Harris recurrent, then for any function
h with E|h| < ,
Z
1X
h(Xi ) = h(x)d(x),
lim
n n
i

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Limit theorems

Central Limit Theorem


To get a CLT, we need more assumptions.

skip conditions and results

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Limit theorems

Central Limit Theorem


To get a CLT, we need more assumptions.

skip conditions and results

For MCMC, the easiest is


P( -> )

Definition (reversibility)
A Markov chain (Xn ) is reversible if for
all n

P(-> )

Xn+1 |Xn+2 = x Xn+1 |Xn = x


The direction of time does not matter

[Green, 1995]

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Limit theorems

The CLT
Theorem
If the Markov chain (Xn ) is Harris recurrent and reversible,
!
N
X
1
L

(h(Xn ) E [h]) N (0, h2 ) .


N n=1
where
2

0 < h2 = E [h (X0 )]

X
+2
E [h(X0 )h(Xk )] < +.
k=1

[Kipnis & Varadhan, 1986]

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Quantitative convergence rates

Quantitative convergence rates


skip detailed results

Let P a Markov transition kernel on (X , B(X )), with P positive


recurrent and its stationary distribution

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Quantitative convergence rates

Quantitative convergence rates


skip detailed results

Let P a Markov transition kernel on (X , B(X )), with P positive


recurrent and its stationary distribution
Convergence rate Determine, from the kernel, a sequence
B(, n), such that
kP n kV B(, n)
where V : X [1, ) and for any signed measure ,
kkV = sup |()|
||V

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Quantitative convergence rates

Practical purposes?

In the 90s, a wealth of contributions on quantitative bounds


triggered by MCMC algorithms to answer questions like: what is
the appropriate burn in? or how long should the sampling continue
after burn in?
[Douc, Moulines and Rosenthal, 2001]
[Jones and Hobert, 2001]

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Quantitative convergence rates

Tools at hand
For MCMC algorithms, kernels are explicitly known.
Type of quantities (more or less directly) available:
I

Minoration constants
K s (x, A) (A),

for all x C,

Foster-Lyapunov Drift conditions,


KV V + bIC

and goal is to obtain a bound depending explicitly upon , , b,


&tc...

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Coupling

Coupling
skip coupling

X0

If X and
and , one can construct two
and X
0 such that
random variables X
, X
0 0
X

=X
0
and X

with probability 

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Coupling

Coupling
skip coupling

X0

If X and
and , one can construct two
and X
0 such that
random variables X
, X
0 0
X

=X
0
and X

with probability 

The basic coupling construction


I
I

with probability , draw Z according to and set


=X
0 = Z.
X
and X 0 under distributions
with probability 1 , draw X
( )/(1 )

and (0 )/(1 ),

respectively.
[Thorisson, 2000]

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Coupling

Coupling inequality
X, X 0 r.v.s with probability distribution K(x, .) and K(x0 , .),
respectively, can be coupled with probability  if:
K(x, ) K(x0 , ) x,x0 (.)
where x,x0 is a probability measure, or, equivalently,
kK(x, ) K(x0 , )kTV (1 )
Define an -coupling set as a set C X X satisfying :
A B(X ),
(x, x0 ) C,

K(x, A) K(x0 , A) x,x0 (A)

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Coupling

Small set and coupling sets


C X small set if there exist  > 0 and a probability measure
such that, for all A B(X )
K(x, A) (A),

x C.

(3)

Small sets always exist when the MC is -irreducible


[Jain and Jamieson, 1967]

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Coupling

Small set and coupling sets


C X small set if there exist  > 0 and a probability measure
such that, for all A B(X )
K(x, A) (A),

x C.

(3)

Small sets always exist when the MC is -irreducible


[Jain and Jamieson, 1967]
For MCMC kernels, small sets in general easy to find.
If C is a small set, then C = C C is a coupling set:
A B(X ),
(x, x0 ) C,

K(x, A) K(x0 , A) (A).

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Coupling

Coupling for Markov chains


P Markov transition kernel on X X such that, for all
(x, x0 ) 6 C (where C is an -coupling set) and all A B(X ) :
P (x, x0 ; A X ) = K(x, A) and

P (x, x0 ; X A) = K(x0 , A)

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Coupling

Coupling for Markov chains


P Markov transition kernel on X X such that, for all
(x, x0 ) 6 C (where C is an -coupling set) and all A B(X ) :
P (x, x0 ; A X ) = K(x, A) and

P (x, x0 ; X A) = K(x0 , A)

For example,
I
I

P (x, x0 ; A A0 ) = K(x, A)K(x0 , A0 ).


for (x, x0 ) 6 C,
For all (x, x0 ) C and all A, A0 B(X ), define the residual
kernel
x0 ; A X ) = (1 )1 (K(x, A) x,x0 (A))
R(x,
x0 ; X A0 ) = (1 )1 (K(x0 , A) x,x0 (A0 )).
R(x,

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Coupling

Coupling algorithm
I
I
I

Initialisation Let X0 and X00 0 and set d0 = 0.

After coupling If dn = 1, then draw Xn+1 K(Xn , ), and


0
set Xn+1
= Xn+1 .

Before coupling If dn = 0 and (Xn , Xn0 ) C,


I
I
I

0
with probability , draw Xn+1 = Xn+1
Xn ,Xn0 and set
dn+1 = 1.
0
n , Xn0 ; ) and
with probability 1 , draw (Xn+1 , Xn+1
) R(X
set dn+1 = 0.
then draw
If dn = 0 and (Xn , Xn0 ) 6 C,
0
0

(Xn+1 , Xn+1 ) P (Xn , Xn ; ).

(Xn , Xn0 , dn ) [where dn is the bell variable which indicates


whether the chains have coupled or not] is a Markov chain on
(X X {0, 1}).

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Coupling

Coupling inequality (again!)

Define the coupling time T as


T = inf{k 1, dk = 1}
Coupling inequality
sup |P k (A) 0 P k (A)| P, 0 ,0 [T > k]
A

[Pitman, 1976; Lindvall, 1992]

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Coupling

Drift conditions
To exploit the coupling construction, we need to control the hitting
time

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Coupling

Drift conditions
To exploit the coupling construction, we need to control the hitting
time
Moments of the return time to a set C are most often controlled
using Foster-Lyapunov drift condition:
P V V + bIC ,

V 1

Mk = k V (Xk )I(C k), k 1 is a supermartingale and thus


Ex [C ] V (x) + b1 IC (x).

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Coupling

Drift conditions
To exploit the coupling construction, we need to control the hitting
time
Moments of the return time to a set C are most often controlled
using Foster-Lyapunov drift condition:
P V V + bIC ,

V 1

Mk = k V (Xk )I(C k), k 1 is a supermartingale and thus


Ex [C ] V (x) + b1 IC (x).
Conversely, if there exists a set C such that Ex [C ] < for all
x (in a full and absorbing set), then there exists a drift function
verifying the Foster-Lyapunov conditions.
[Meyn and Tweedie, 1993]

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Coupling

If the drift condition is imposed directly on the joint transition


kernel P , there exist V 1, 0 < < 1 and a set C such that :
P V (x, x0 ) V (x, x0 )

(x, x0 ) 6 C

When P (x, x0 ; A A0 ) = K(x, A)K(x0 , A0 ), one may consider



V (x, x0 ) = (1/2) V (x) + V (x0 )

where V drift function for P (but not necessarily the best choice)

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Coupling

Explicit bound

Theorem
For any distributions and 0 , and any j k, then:
kP k () 0 P k ()kT V (1 )j + k B j1 E, 0 ,0 [V (X0 , X00 )]
where
B = 1 1 (1 ) sup RV.

[DMR,2001]

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Renewal and CLT

Renewal and CLT


Given a Markov chain (Xn )n , how good an approximation of
Z
I = g(x)(x)dx
is

n1

1X
g(Xi ) ?
g n :=
n
i=0

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Renewal and CLT

Renewal and CLT


Given a Markov chain (Xn )n , how good an approximation of
Z
I = g(x)(x)dx
is

n1

1X
g(Xi ) ?
g n :=
n
i=0

Standard MC if CLT

d
n (g n E [g(X)]) N (0, g2 )
and there exists an easy-to-compute, consistent estimate of g2 ...

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Renewal and CLT

Minoration

skip construction

Assume that the kernel density K satisfies, for some density q(),
(0, 1) and a small set C X ,
K(y|x) q(y) for all y X and x C
Then split K into a mixture
K(y|x) = q(y) + (1 ) R(y|x)
where R is residual kernel

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Renewal and CLT

Split chain

Let 0 , 1 , 2 , . . . be iid B(). Then the split chain


{(X0 , 0 ), (X1 , 1 ), (X2 , 2 ), . . . }
is such that, when Xi C, i determines Xi+1 :
(
q(x)
if i = 1,
Xi+1
R(x|Xi ) otherwise
[Regeneration] When (Xi , i ) C {1}, Xi+1 q

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Renewal and CLT

Renewals
For X0 q and R successive renewals, define by 1 < . . . < R the
renewal times.
Then
#
"
R


R 1 X
R g R E [g(X)] =
(St Nt E [g(X)])
N R
t=1

where Nt length of the t th tour, and St sum of the g(Xj )s over


the t th tour.
Since (Nt , St ) are iid and Eq [St Nt E [g(X)]] = 0, if Nt and St
have finite 2nd moments,

 d
I
R g R E g N (0, g2 )
I there is a simple, consistent estimator of g2
[Mykland & al., 1995; Robert, 1995]

Markov Chain Monte Carlo Methods


Notions on Markov Chains
Renewal and CLT

Moment conditions
We need to show that, for the minoration condition, Eq [N12 ] and
Eq [S12 ] are finite.
If
1. the chain is geometrically ergodic, and
2. E [|g|2+ ] < for some > 0,

then Eq [N12 ] < and Eq [S12 ] < .

[Hobert & al., 2002]

Note that drift + minoration ensures geometric ergodicity


[Rosenthal, 1995; Roberts & Tweedie, 1999]

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm

The Metropolis-Hastings Algorithm


Motivation and leading example
Random variable generation
Monte Carlo Integration
Notions on Markov Chains
The Metropolis-Hastings Algorithm
Monte Carlo Methods based on Markov Chains
The MetropolisHastings algorithm
A collection of Metropolis-Hastings algorithms
Extensions

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
Monte Carlo Methods based on Markov Chains

Running Monte Carlo via Markov Chains

It is not necessary to use a sample from the distribution f to


approximate the integral
Z
I=
h(x)f (x)dx ,

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
Monte Carlo Methods based on Markov Chains

Running Monte Carlo via Markov Chains

It is not necessary to use a sample from the distribution f to


approximate the integral
Z
I=
h(x)f (x)dx ,
We can obtain X1 , . . . , Xn f (approx) without directly
simulating from f , using an ergodic Markov chain with
stationary distribution f

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
Monte Carlo Methods based on Markov Chains

Running Monte Carlo via Markov Chains (2)


Idea
For an arbitrary starting value x(0) , an ergodic chain (X (t) ) is
generated using a transition kernel with stationary distribution f

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
Monte Carlo Methods based on Markov Chains

Running Monte Carlo via Markov Chains (2)


Idea
For an arbitrary starting value x(0) , an ergodic chain (X (t) ) is
generated using a transition kernel with stationary distribution f

Insures the convergence in distribution of (X (t) ) to a random


variable from f .

For a large enough T0 , X (T0 ) can be considered as


distributed from f

Produce a dependent sample X (T0 ) , X (T0 +1) , . . ., which is


generated from f , sufficient for most approximation purposes.

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
Monte Carlo Methods based on Markov Chains

Running Monte Carlo via Markov Chains (2)


Idea
For an arbitrary starting value x(0) , an ergodic chain (X (t) ) is
generated using a transition kernel with stationary distribution f

Insures the convergence in distribution of (X (t) ) to a random


variable from f .

For a large enough T0 , X (T0 ) can be considered as


distributed from f

Produce a dependent sample X (T0 ) , X (T0 +1) , . . ., which is


generated from f , sufficient for most approximation purposes.

Problem: How can one build a Markov chain with a given


stationary distribution?

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
The MetropolisHastings algorithm

The MetropolisHastings algorithm

Basics
The algorithm uses the objective (target) density
f
and a conditional density
q(y|x)
called the instrumental (or proposal) distribution

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
The MetropolisHastings algorithm

The MH algorithm
Algorithm (MetropolisHastings)
Given x(t) ,
1. Generate Yt q(y|x(t) ).
2. Take

(t+1)

Yt
x(t)

with prob. (x(t) , Yt ),


with prob. 1 (x(t) , Yt ),

where
(x, y) = min

f (y) q(x|y)
,1
f (x) q(y|x)

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
The MetropolisHastings algorithm

Features

Independent of normalizing constants for both f and q(|x)


(ie, those constants independent of x)

Never move to values with f (y) = 0

The chain (x(t) )t may take the same value several times in a
row, even though f is a density wrt Lebesgue measure

The sequence (yt )t is usually not a Markov chain

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
The MetropolisHastings algorithm

Convergence properties
1. The M-H Markov chain is reversible, with
invariant/stationary density f since it satisfies the detailed
balance condition
f (y) K(y, x) = f (x) K(x, y)

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
The MetropolisHastings algorithm

Convergence properties
1. The M-H Markov chain is reversible, with
invariant/stationary density f since it satisfies the detailed
balance condition
f (y) K(y, x) = f (x) K(x, y)
2. As f is a probability measure, the chain is positive recurrent

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
The MetropolisHastings algorithm

Convergence properties
1. The M-H Markov chain is reversible, with
invariant/stationary density f since it satisfies the detailed
balance condition
f (y) K(y, x) = f (x) K(x, y)
2. As f is a probability measure, the chain is positive recurrent
3. If

"

#
f (Yt ) q(X (t) |Yt )
Pr
1 < 1.
f (X (t) ) q(Yt |X (t) )

(1)

that is, the event {X (t+1) = X (t) } is possible, then the chain
is aperiodic

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
The MetropolisHastings algorithm

Convergence properties (2)


4. If
q(y|x) > 0 for every (x, y),
the chain is irreducible

(2)

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
The MetropolisHastings algorithm

Convergence properties (2)


4. If
q(y|x) > 0 for every (x, y),
the chain is irreducible
5. For M-H, f -irreducibility implies Harris recurrence

(2)

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
The MetropolisHastings algorithm

Convergence properties (2)


4. If
q(y|x) > 0 for every (x, y),

(2)

the chain is irreducible


5. For M-H, f -irreducibility implies Harris recurrence
6. Thus, for M-H satisfying (1) and (2)
(i) For h, with Ef |h(X)| < ,
lim

(ii) and

Z
T
1X
(t)
h(X ) = h(x)df (x)
T t=1

Z

lim
K n (x, )(dx) f
n

a.e. f.

=0
TV

for every initial distribution , where K n (x, ) denotes the


kernel for n transitions.

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms

The Independent Case


The instrumental distribution q is independent of X (t) , and is
denoted g by analogy with Accept-Reject.

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms

The Independent Case


The instrumental distribution q is independent of X (t) , and is
denoted g by analogy with Accept-Reject.

Algorithm (Independent Metropolis-Hastings)


Given x(t) ,
a Generate Yt g(y)

b Take

X (t+1) =

x(t)

with prob. min


otherwise.

)
f (Yt ) g(x(t) )
,1 ,
f (x(t) ) g(Yt )

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms

Properties
The resulting sample is not iid

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms

Properties
The resulting sample is not iid but there exist strong convergence
properties:

Theorem (Ergodicity)
The algorithm produces a uniformly ergodic chain if there exists a
constant M such that
f (x) M g(x) ,

x supp f.

In this case,
n

kK (x, ) f kT V

1
1
M

n

[Mengersen & Tweedie, 1996]

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms

Example (Noisy AR(1))


Hidden Markov chain from a regular AR(1) model,
xt+1 = xt + t+1

t N (0, 2 )

and observables
yt |xt N (x2t , 2 )

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms

Example (Noisy AR(1))


Hidden Markov chain from a regular AR(1) model,
xt+1 = xt + t+1

t N (0, 2 )

and observables
yt |xt N (x2t , 2 )
The distribution of xt given xt1 , xt+1 and yt is


2
1
2
2
2 2
.
exp 2 (xt xt1 ) + (xt+1 xt ) + 2 (yt xt )
2

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms

Example (Noisy AR(1) too)


Use for proposal the N (t , t2 ) distribution, with
t =

xt1 + xt+1
1 + 2

and

t2 =

2
.
1 + 2

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms

Example (Noisy AR(1) too)


Use for proposal the N (t , t2 ) distribution, with
t =

xt1 + xt+1
1 + 2

and

t2 =

2
.
1 + 2

Ratio
(x)/qind (x) = exp (yt x2t )2 /2 2
is bounded

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms

(top) Last 500 realisations of the chain {Xk }k out of 10, 000
iterations; (bottom) histogram of the chain, compared with
the target distribution.

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms

Example (Cauchy by normal)


Given a Cauchy C (0, 1) distribution, consider a normal
N (0, 1) proposal
go random W

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms

Example (Cauchy by normal)


Given a Cauchy C (0, 1) distribution, consider a normal
N (0, 1) proposal
The MetropolisHastings acceptance ratio is
go random W


 1 + ( 0 )2
( 0 )/( 0 )
= exp 2 ( 0 )2 /2
.
()/())
(1 + 2 )

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms

Example (Cauchy by normal)


Given a Cauchy C (0, 1) distribution, consider a normal
N (0, 1) proposal
The MetropolisHastings acceptance ratio is
go random W


 1 + ( 0 )2
( 0 )/( 0 )
= exp 2 ( 0 )2 /2
.
()/())
(1 + 2 )

Poor perfomances: the proposal distribution has lighter tails than


the target Cauchy and convergence to the stationary distribution is
not even geometric!
[Mengersen & Tweedie, 1996]

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm

0.0

0.1

0.2

Density

0.3

0.4

A collection of Metropolis-Hastings algorithms

0
4

Histogram of Markov chain


(t )1t5000 against target
C (0, 1) distribution.

1000

2000

3000

4000

iterations

Range and average of 1000


parallel runs when initialized
with a normal N (0, 1002 )
distribution.

5000

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms

Random walk MetropolisHastings

Use of a local perturbation as proposal


Yt = X (t) + t ,
where t g, independent of X (t) .
The instrumental density is now of the form g(y x) and the
Markov chain is a random walk if we take g to be symmetric
g(x) = g(x)

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms

Algorithm (Random walk Metropolis)


Given x(t)
1. Generate Yt g(y x(t) )
2. Take

X (t+1) =

x(t)



f (Yt )
with prob. min 1,
,
f (x(t) )
otherwise.

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms

Example (Random walk and normal target)


Generate N (0, 1) based on the uniform proposal [, ]
[Hastings (1970)]
The probability of acceptance is then
forget History!

(x(t) , yt ) = exp{(x(t) yt2 )/2} 1.

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms

Example (Random walk & normal (2))


Sample statistics

mean
variance

0.1
0.399
0.698

0.5
-0.111
1.11

1.0
0.10
1.06

c As , we get better histograms and a faster exploration of the



support of f .

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm

-1

1
(a)

0.5

0.5

0.0

0.0

-0.5

-0.5

-1.0

-1.0

100

-2

0
(b)

-1.5

-1.5

-1.5

50

-1.0

100

100

-0.5

200

200

150

0.0

300

200

300

0.5

250

400

400

A collection of Metropolis-Hastings algorithms

-3

-2

-1

(c)

Three samples based on U[, ] with (a) = 0.1, (b) = 0.5


and (c) = 1.0, superimposed with the convergence of the
means (15, 000 simulations).

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms

Example (Mixture models (again!))


(|x)

n
k
Y
X

j=1

`=1

p` f (xj |` , ` ) ()

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms

Example (Mixture models (again!))


(|x)

n
k
Y
X

j=1

`=1

p` f (xj |` , ` ) ()

Metropolis-Hastings proposal:
 (t)
+ (t) if u(t) < (t)
(t+1)

=
(t)
otherwise
where
(t) =

((t) + (t) |x)


1
((t) |x)

and scaled for good acceptance rate

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms

2
1
theta
0
-1

-1

theta

Random walk sampling (50000 iterations)

0.0

0.2

0.4

0.6

0.8

1.0

0.2

0.4

0.6

0.8

1.0

1.2

tau

1.0

0.0 1.0 2.0

1.2

-1

0.6

tau

0 1 2 3 4 5 6

0.8

theta

0.4

0.0

0.2

0.4

0.6

0.8

1.0

0.2

0 2 4

0.0

0.2

0.4

0.6
p

0.8

1.0

0.2

0.4

0.6

0.8

tau

General case of a 3 component normal mixture


[Celeux & al., 2000]

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm

1
0

A collection of Metropolis-Hastings algorithms

Random walk MCMC output for .7N (1 , 1) + .3N (2 , 1)

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms

Example (probit model)


skip probit

Likelihood of the probit model


n
Y
i=1

(yiT )xi (yiT )1xi

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms

Example (probit model)


skip probit

Likelihood of the probit model


n
Y

(yiT )xi (yiT )1xi

i=1

Random walk proposal


(t+1) = (t) + t
where, for instance,

t Np (0, )

= (Y Y T )1

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm

10

A collection of Metropolis-Hastings algorithms

10

Likelihood surface and random walk Metropolis-Hastings


steps

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms

Convergence properties

Uniform ergodicity prohibited by random walk structure

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms

Convergence properties

Uniform ergodicity prohibited by random walk structure


At best, geometric ergodicity:

Theorem (Sufficient ergodicity)


For a symmetric density f , log-concave in the tails, and a positive
and symmetric density g, the chain (X (t) ) is geometrically ergodic.
[Mengersen & Tweedie, 1996]
no tail effect

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm

1.5
1.0
0.5
-1.5

-1.0

-0.5

0.0

0.5
0.0
-0.5
-1.0

Random-walk
MetropolisHastings algorithms
based on a N (0, 1) instrumental
for the generation of (a) a
N (0, 1) distribution and (b) a
distribution with density
(x) (1 + |x|)3

-1.5

Example (Comparison of tail


effects)

1.0

1.5

A collection of Metropolis-Hastings algorithms

50

100
(a)

150

200

50

100

150

(b)

90% confidence envelopes of


the means, derived from 500
parallel independent chains

200

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms

Example (Cauchy by normal continued)


Again, Cauchy C (0, 1) target and Gaussian random walk proposal,
0 N (, 2 ), with acceptance probability
1 + 2
1,
1 + ( 0 )2
Overall fit of the Cauchy density by the histogram satisfactory, but
poor exploration of the tails: 99% quantile of C (0, 1) equal to 3,
but no simulation exceeds 14 out of 10, 000!
[Roberts & Tweedie, 2004]

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms

Again, lack of geometric ergodicity!

0.00

0.05

0.10

0.15

Density

0.20

0.25

0.30

0.35

[Mengersen & Tweedie, 1996]


Slow convergence shown by the non-stable range after 10, 000
iterations.

Histogram of the 10, 000 first steps of a random walk


MetropolisHastings algorithm using a N (, 1) proposal

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm

100

50

50

100

A collection of Metropolis-Hastings algorithms

2000

4000

6000

8000

iterations

Range of 500 parallel runs for the same setup

10000

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms

Further convergence properties


Under assumptions
I

skip detailed convergence

(A1) f is super-exponential, i.e. it is positive with positive


continuous first derivative such that
lim|x| n(x)0 log f (x) = where n(x) := x/|x|.
In words : exponential decay of f in every direction with rate
tending to
(A2) lim sup|x| n(x)0 m(x) < 0, where
m(x) = f (x)/|f (x)|.
In words: non degeneracy of the countour manifold
Cf (y) = {y : f (y) = f (x)}

Q is geometrically ergodic, and


V (x) f (x)1/2 verifies the drift condition
[Jarner & Hansen, 2000]

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms

Further [further] convergence properties


skip hyperdetailed convergence

If P -irreducible and aperiodic, for r = (r(n))nN real-valued non


decreasing sequence, such that, for all n, m N,
r(n + m) r(n)r(m),
and r(0) = 1, for C a small set, C = inf{n 1, Xn C}, and
h 1, assume
#
" 1
C
X
r(k)h(Xk ) < ,
sup Ex
xC

k=0

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms

then,
S(f, C, r) :=

x X, Ex

( 1
C
X

r(k)h(Xk )

k=0

<

is full and absorbing and for x S(f, C, r),


lim r(n)kP n (x, .) f kh = 0.

[Tuominen & Tweedie, 1994]

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms

Comments
I

[CLT, Rosenthals inequality...] h-ergodicity implies CLT


for additive (possibly unbounded) functionals of the chain,
Rosenthals inequality and so on...

[Control of the moments of the return-time] The


condition implies (because h 1) that
( 1
)
C
X
sup Ex [r0 (C )] sup Ex
r(k)h(Xk ) < ,
xC

xC

Pn

k=0

where r0 (n) = l=0 r(l) Can be used to derive bounds for


the coupling time, an essential step to determine computable
bounds, using coupling inequalities
[Roberts & Tweedie, 1998; Fort & Moulines, 2000]

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms

Alternative conditions

The condition is not really easy to work with...


[Possible alternative conditions]
(a) [Tuominen, Tweedie, 1994] There exists a sequence
(Vn )nN , Vn r(n)h, such that
(i) supC V0 < ,
(ii) {V0 = } {V1 = } and
(iii) P Vn+1 Vn r(n)h + br(n)IC .

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
A collection of Metropolis-Hastings algorithms

(b) [Fort 2000] V f 1 and b < , such that supC V <


and
)
(
C
X
P V (x) + Ex
r(k)f (Xk ) V (x) + bIC (x)
k=0

where C is the hitting time on C and


r(k) = r(k) r(k 1), k 1 and r(0) = r(0).
nP
o
C 1
Result (a) (b) supxC Ex
r(k)f
(X
)
< .
k
k=0

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
Extensions

Extensions

There are many other families of HM algorithms


Adaptive Rejection Metropolis Sampling
Reversible Jump (later!)
Langevin algorithms

to name just a few...

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
Extensions

Langevin Algorithms

Proposal based on the Langevin diffusion Lt is defined by the


stochastic differential equation
1
dLt = dBt + log f (Lt )dt,
2
where Bt is the standard Brownian motion

Theorem
The Langevin diffusion is the only non-explosive diffusion which is
reversible with respect to f .

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
Extensions

Discretization
Instead, consider the sequence
x(t+1) = x(t) +

2
log f (x(t) ) + t ,
2

where 2 corresponds to the discretization step

t Np (0, Ip )

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
Extensions

Discretization
Instead, consider the sequence
x(t+1) = x(t) +

2
log f (x(t) ) + t ,
2

t Np (0, Ip )

where 2 corresponds to the discretization step


Unfortunately, the discretized chain may be be transient, for
instance when


lim 2 log f (x)|x|1 > 1
x

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
Extensions

MH correction
Accept the new value Yt with probability


2 


2
2
(t)
(t)
2
exp Yt x 2 log f (x )
f (Yt )

 1.

2 
f (x(t) )

(t)
2
2
2
exp x Yt 2 log f (Yt )

Choice of the scaling factor


Should lead to an acceptance rate of 0.574 to achieve optimal
convergence rates (when the components of x are uncorrelated)
[Roberts & Rosenthal, 1998]

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
Extensions

Optimizing the Acceptance Rate

Problem of choice of the transition kernel from a practical point of


view
Most common alternatives:
(a) a fully automated algorithm like ARMS;
(b) an instrumental density g which approximates f , such that
f /g is bounded for uniform ergodicity to apply;
(c) a random walk
In both cases (b) and (c), the choice of g is critical,

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
Extensions

Case of the independent MetropolisHastings algorithm


Choice of g that maximizes the average acceptance rate



f (Y ) g(X)
,1
= E min
f (X) g(Y )

f (Y )
f (X)
= 2P
,
X f, Y g,

g(Y )
g(X)
Related to the speed of convergence of
T
1 X
h(X (t) )
T
t=1

to Ef [h(X)] and to the ability of the algorithm to explore any


complexity of f

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
Extensions

Case of the independent MetropolisHastings algorithm (2)

Practical implementation
Choose a parameterized instrumental distribution g(|) and
adjusting the corresponding parameters based on the evaluated
acceptance rate
() =

m
2 X
I{f (yi )g(xi )>f (xi )g(yi )} ,
m
i=1

where x1 , . . . , xm sample from f and y1 , . . . , ym iid sample from g.

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
Extensions

Example (Inverse Gaussian distribution)


no inverse

Simulation from

p 
p
2
f (z|1 , 2 ) z
exp 1 z
+ 2 1 2 + log 22 IR+ (z)
z
p
based on the Gamma distribution Ga(, ) with = 2 /1
3/2

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
Extensions

Example (Inverse Gaussian distribution)


no inverse

Simulation from

p 
p
2
f (z|1 , 2 ) z
exp 1 z
+ 2 1 2 + log 22 IR+ (z)
z
p
based on the Gamma distribution Ga(, ) with = 2 /1
Since


2
f (x)
1/2
x
exp ( 1 )x
,
g(x)
x
3/2

the maximum is attained at


x

( + 1/2)

p
( + 1/2)2 + 42 (1 )
.
2( 1 )

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
Extensions

Example (Inverse Gaussian distribution (2))


The analytical optimization (in ) of
(
M () = (x )1/2

2
exp ( 1 )x
x

is impossible

()

E[Z]
E[1/Z]

0.2
0.22
1.137
1.116

0.5
0.41
1.158
1.108

0.8
0.54
1.164
1.116

0.9
0.56
1.154
1.115

1
0.60
1.133
1.120

(1 = 1.5, 2 = 2, and m = 5000).

1.1
0.63
1.148
1.126

1.2
0.64
1.181
1.095

1.5
0.71
1.148
1.115

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
Extensions

Case of the random walk


Different approach to acceptance rates
A high acceptance rate does not indicate that the algorithm is
moving correctly since it indicates that the random walk is moving
too slowly on the surface of f .

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
Extensions

Case of the random walk


Different approach to acceptance rates
A high acceptance rate does not indicate that the algorithm is
moving correctly since it indicates that the random walk is moving
too slowly on the surface of f .
If x(t) and yt are close, i.e. f (x(t) ) ' f (yt ) y is accepted with
probability


f (yt )
min
,1 ' 1 .
f (x(t) )
For multimodal densities with well separated modes, the negative
effect of limited moves on the surface of f clearly shows.

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
Extensions

Case of the random walk (2)

If the average acceptance rate is low, the successive values of f (yt )


tend to be small compared with f (x(t) ), which means that the
random walk moves quickly on the surface of f since it often
reaches the borders of the support of f

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
Extensions

Rule of thumb

In small dimensions, aim at an average acceptance rate of 50%. In


large dimensions, at an average acceptance rate of 25%.
[Gelman,Gilks and Roberts, 1995]

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
Extensions

Rule of thumb

In small dimensions, aim at an average acceptance rate of 50%. In


large dimensions, at an average acceptance rate of 25%.
[Gelman,Gilks and Roberts, 1995]
This rule is to be taken with a pinch of salt!

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
Extensions

Example (Noisy AR(1) continued)


For a Gaussian random walk with scale small enough, the
random walk never jumps to the other mode. But if the scale is
sufficiently large, the Markov chain explores both modes and give a
satisfactory approximation of the target distribution.

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
Extensions

Markov chain based on a random walk with scale = .1.

Markov Chain Monte Carlo Methods


The Metropolis-Hastings Algorithm
Extensions

Markov chain based on a random walk with scale = .5.

Markov Chain Monte Carlo Methods


The Gibbs Sampler

The Gibbs Sampler

The Gibbs Sampler


General Principles
Completion
Convergence
The Hammersley-Clifford theorem
Hierarchical models
Data Augmentation
Improper Priors

Markov Chain Monte Carlo Methods


The Gibbs Sampler
General Principles

General Principles
A very specific simulation algorithm based on the target
distribution f :
1. Uses the conditional densities f1 , . . . , fp from f

Markov Chain Monte Carlo Methods


The Gibbs Sampler
General Principles

General Principles
A very specific simulation algorithm based on the target
distribution f :
1. Uses the conditional densities f1 , . . . , fp from f
2. Start with the random variable X = (X1 , . . . , Xp )

Markov Chain Monte Carlo Methods


The Gibbs Sampler
General Principles

General Principles
A very specific simulation algorithm based on the target
distribution f :
1. Uses the conditional densities f1 , . . . , fp from f
2. Start with the random variable X = (X1 , . . . , Xp )
3. Simulate from the conditional densities,
Xi |x1 , x2 , . . . , xi1 , xi+1 , . . . , xp

fi (xi |x1 , x2 , . . . , xi1 , xi+1 , . . . , xp )

for i = 1, 2, . . . , p.

Markov Chain Monte Carlo Methods


The Gibbs Sampler
General Principles

Algorithm (Gibbs sampler)


(t)

(t)

Given x(t) = (x1 , . . . , xp ), generate


(t+1)

1. X1

(t+1)

2. X2

(t)

(t)

f1 (x1 |x2 , . . . , xp );
(t+1)

, x3 , . . . , xp ),

(t+1)

, . . . , xp1 )

f2 (x2 |x1

(t)

(t)

...
p.

(t+1)
Xp

fp (xp |x1

(t+1)

X(t+1) X f

Markov Chain Monte Carlo Methods


The Gibbs Sampler
General Principles

Properties

The full conditionals densities f1 , . . . , fp are the only densities used


for simulation. Thus, even in a high dimensional problem, all of
the simulations may be univariate

Markov Chain Monte Carlo Methods


The Gibbs Sampler
General Principles

Properties

The full conditionals densities f1 , . . . , fp are the only densities used


for simulation. Thus, even in a high dimensional problem, all of
the simulations may be univariate
The Gibbs sampler is not reversible with respect to f . However,
each of its p components is. Besides, it can be turned into a
reversible sampler, either using the Random Scan Gibbs sampler
see section or running instead the (double) sequence
f1 fp1 fp fp1 f1

Markov Chain Monte Carlo Methods


The Gibbs Sampler
General Principles

Example (Bivariate Gibbs sampler)


(X, Y ) f (x, y)
Generate a sequence of observations by
Set X0 = x0
For t = 1, 2, . . . , generate
Yt fY |X (|xt1 )

Xt fX|Y (|yt )

where fY |X and fX|Y are the conditional distributions

Markov Chain Monte Carlo Methods


The Gibbs Sampler
General Principles

A Very Simple Example: Independent N(, 2 )


Observations
iid

When Y1 , . . . , Yn N(y|, 2 ) with both and unknown, the


posterior in (, 2 ) is conjugate outside a standard familly

Markov Chain Monte Carlo Methods


The Gibbs Sampler
General Principles

A Very Simple Example: Independent N(, 2 )


Observations
iid

When Y1 , . . . , Yn N(y|, 2 ) with both and unknown, the


posterior in (, 2 ) is conjugate outside a standard familly

But...

 P
2

|Y0:n , 2 N n1 ni=1 Yi , n )


P
2 |Y1:n , IG 2 n2 1, 21 ni=1 (Yi )2

assuming constant (improper) priors on both and 2


I

Hence we may use the Gibbs sampler for simulating from the
posterior of (, 2 )

Markov Chain Monte Carlo Methods


The Gibbs Sampler
General Principles

R Gibbs Sampler for Gaussian posterior


n = length(Y);
S = sum(Y);
mu = S/n;
for (i in 1:500)
S2 = sum((Y-mu)^2);
sigma2 = 1/rgamma(1,n/2-1,S2/2);
mu = S/n + sqrt(sigma2/n)*rnorm(1);

Markov Chain Monte Carlo Methods


The Gibbs Sampler
General Principles

Example of results with n = 10 observations from the


N(0, 1) distribution

Number of Iterations 1

Markov Chain Monte Carlo Methods


The Gibbs Sampler
General Principles

Example of results with n = 10 observations from the


N(0, 1) distribution

Number of Iterations 1, 2

Markov Chain Monte Carlo Methods


The Gibbs Sampler
General Principles

Example of results with n = 10 observations from the


N(0, 1) distribution

Number of Iterations 1, 2, 3

Markov Chain Monte Carlo Methods


The Gibbs Sampler
General Principles

Example of results with n = 10 observations from the


N(0, 1) distribution

Number of Iterations 1, 2, 3, 4

Markov Chain Monte Carlo Methods


The Gibbs Sampler
General Principles

Example of results with n = 10 observations from the


N(0, 1) distribution

Number of Iterations 1, 2, 3, 4, 5

Markov Chain Monte Carlo Methods


The Gibbs Sampler
General Principles

Example of results with n = 10 observations from the


N(0, 1) distribution

Number of Iterations 1, 2, 3, 4, 5, 10

Markov Chain Monte Carlo Methods


The Gibbs Sampler
General Principles

Example of results with n = 10 observations from the


N(0, 1) distribution

Number of Iterations 1, 2, 3, 4, 5, 10, 25

Markov Chain Monte Carlo Methods


The Gibbs Sampler
General Principles

Example of results with n = 10 observations from the


N(0, 1) distribution

Number of Iterations 1, 2, 3, 4, 5, 10, 25, 50

Markov Chain Monte Carlo Methods


The Gibbs Sampler
General Principles

Example of results with n = 10 observations from the


N(0, 1) distribution

Number of Iterations 1, 2, 3, 4, 5, 10, 25, 50, 100

Markov Chain Monte Carlo Methods


The Gibbs Sampler
General Principles

Example of results with n = 10 observations from the


N(0, 1) distribution

Number of Iterations 1, 2, 3, 4, 5, 10, 25, 50, 100, 500

Markov Chain Monte Carlo Methods


The Gibbs Sampler
General Principles

Limitations of the Gibbs sampler

Formally, a special case of a sequence of 1-D M-H kernels, all with


acceptance rate uniformly equal to 1.
The Gibbs sampler
1. limits the choice of instrumental distributions

Markov Chain Monte Carlo Methods


The Gibbs Sampler
General Principles

Limitations of the Gibbs sampler

Formally, a special case of a sequence of 1-D M-H kernels, all with


acceptance rate uniformly equal to 1.
The Gibbs sampler
1. limits the choice of instrumental distributions
2. requires some knowledge of f

Markov Chain Monte Carlo Methods


The Gibbs Sampler
General Principles

Limitations of the Gibbs sampler

Formally, a special case of a sequence of 1-D M-H kernels, all with


acceptance rate uniformly equal to 1.
The Gibbs sampler
1. limits the choice of instrumental distributions
2. requires some knowledge of f
3. is, by construction, multidimensional

Markov Chain Monte Carlo Methods


The Gibbs Sampler
General Principles

Limitations of the Gibbs sampler

Formally, a special case of a sequence of 1-D M-H kernels, all with


acceptance rate uniformly equal to 1.
The Gibbs sampler
1. limits the choice of instrumental distributions
2. requires some knowledge of f
3. is, by construction, multidimensional
4. does not apply to problems where the number of parameters
varies as the resulting chain is not irreducible.

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Completion

Latent variables are back

The Gibbs sampler can be generalized in much wider generality


A density g is a completion of f if
Z
g(x, z) dz = f (x)
Z

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Completion

Latent variables are back

The Gibbs sampler can be generalized in much wider generality


A density g is a completion of f if
Z
g(x, z) dz = f (x)
Z

Note
The variable z may be meaningless for the problem

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Completion

Purpose g should have full conditionals that are easy to simulate


for a Gibbs sampler to be implemented with g rather than f
For p > 1, write y = (x, z) and denote the conditional densities of
g(y) = g(y1 , . . . , yp ) by
Y1 |y2 , . . . , yp g1 (y1 |y2 , . . . , yp ),

Y2 |y1 , y3 , . . . , yp g2 (y2 |y1 , y3 , . . . , yp ),


...,

Yp |y1 , . . . , yp1 gp (yp |y1 , . . . , yp1 ).

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Completion

The move from Y (t) to Y (t+1) is defined as follows:

Algorithm (Completion Gibbs sampler)


(t)

(t)

Given (y1 , . . . , yp ), simulate


(t+1)

1. Y1

(t+1)

2. Y2

(t+1)

p. Yp

(t)

(t)

g1 (y1 |y2 , . . . , yp ),
(t+1)

, y3 , . . . , yp ),

(t+1)

, . . . , yp1 ).

g2 (y2 |y1
...

gp (yp |y1

(t)

(t+1)

(t)

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Completion

Example (Mixtures all over again)


Hierarchical missing data structure:
If
k
X
X1 , . . . , Xn
pi f (x|i ),
i=1

then

X|Z f (x|Z ),

Z p1 I(z = 1) + . . . + pk I(z = k),

Z is the component indicator associated with observation x

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Completion

Example (Mixtures (2))


Conditionally on (Z1 , . . . , Zn ) = (z1 , . . . , zn ) :
(p1 , . . . , pk , 1 , . . . , k |x1 , . . . , xn , z1 , . . . , zn )
p11 +n1 1 . . . pkk +nk 1

(1 |y1 + n1 x
1 , 1 + n1 ) . . . (k |yk + nk x
k , k + nk ),

with
ni =

X
j

I(zj = i)

and

x
i =

j; zj =i

xj /ni .

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Completion

Algorithm (Mixture Gibbs sampler)


1. Simulate
i (i |yi + ni x
i , i + ni ) (i = 1, . . . , k)
(p1 , . . . , pk ) D(1 + n1 , . . . , k + nk )

2. Simulate (j = 1, . . . , n)
Zj |xj , p1 , . . . , pk , 1 , . . . , k

k
X
i=1

with (i = 1, . . . , k)
pij pi f (xj |i )
and update ni and x
i (i = 1, . . . , k).

pij I(zj = i)

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Completion

10
5
0

10

15

T = 1000

15

T = 500

-2

-2

10
5
0

10

15

T = 3000

15

T = 2000

-2

-2

10
5
0

10

15

T = 5000

15

T = 4000

-2

-2

Estimation of the pluggin density for 3 components and T


iterations for 149 observations of acidity levels in US lakes

Markov Chain Monte Carlo Methods


The Gibbs Sampler

0.00

0.05

0.10

0.15

0.20

0.25

Completion

10

15

20

25

30

35

Galaxy dataset (82 observations) with k = 2 components


average density (yellow), and pluggins:
average (tomato), marginal MAP (green), MAP (marroon)

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Completion

A wee problem

Gibbs started at random

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Completion

A wee problem

1
1

Gibbs stuck at the wrong mode

Gibbs started at random

1
1

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Completion

Random Scan Gibbs sampler

back to basics

dont do random

Modification of the above Gibbs sampler where, with probability


1/p, the i-th component is drawn from fi (xi |Xi ), ie when the
components are chosen at random

Motivation
The Random Scan Gibbs sampler is reversible.

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Completion

Slice sampler as generic Gibbs


If f () can be written as a product
k
Y
i=1

fi (),

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Completion

Slice sampler as generic Gibbs


If f () can be written as a product
k
Y

fi (),

i=1

it can be completed as
k
Y
i=1

I0i fi () ,

leading to the following Gibbs algorithm:

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Completion

Algorithm (Slice sampler)


Simulate
(t+1)

1. 1

U[0,f1 ((t) )] ;

...
k.
k+1.

(t+1)
k
U[0,fk ((t) )] ;
(t+1) UA(t+1) , with
(t+1)

A(t+1) = {y; fi (y) i

, i = 1, . . . , k}.

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Completion

0.000

0.002

0.004

0.006

0.008

0.010

Example of results with a truncated N(3, 1) distribution

0.0

0.2

0.4

0.6

0.8

Number of Iterations 2

1.0

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Completion

0.000

0.002

0.004

0.006

0.008

0.010

Example of results with a truncated N(3, 1) distribution

0.0

0.2

0.4

0.6

0.8

Number of Iterations 2, 3

1.0

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Completion

0.000

0.002

0.004

0.006

0.008

0.010

Example of results with a truncated N(3, 1) distribution

0.0

0.2

0.4

0.6

0.8

1.0

Number of Iterations 2, 3, 4

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Completion

0.000

0.002

0.004

0.006

0.008

0.010

Example of results with a truncated N(3, 1) distribution

0.0

0.2

0.4

0.6

0.8

1.0

Number of Iterations 2, 3, 4, 5

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Completion

0.000

0.002

0.004

0.006

0.008

0.010

Example of results with a truncated N(3, 1) distribution

0.0

0.2

0.4

0.6

0.8

1.0

Number of Iterations 2, 3, 4, 5, 10

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Completion

0.000

0.002

0.004

0.006

0.008

0.010

Example of results with a truncated N(3, 1) distribution

0.0

0.2

0.4

0.6

0.8

1.0

Number of Iterations 2, 3, 4, 5, 10, 50

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Completion

0.000

0.002

0.004

0.006

0.008

0.010

Example of results with a truncated N(3, 1) distribution

0.0

0.2

0.4

0.6

0.8

1.0

Number of Iterations 2, 3, 4, 5, 10, 50, 100

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Completion

Good slices

The slice sampler usually enjoys good theoretical properties (like


geometric ergodicity and even uniform ergodicity under bounded f
and bounded X ).
As k increases, the determination of the set A(t+1) may get
increasingly complex.

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Completion

Example (Stochastic volatility core distribution)


Difficult part of the stochastic volatility model


(x) exp 2 (x )2 + 2 exp(x)y 2 + x /2 ,


simplified in exp x2 + exp(x)

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Completion

Example (Stochastic volatility core distribution)


Difficult part of the stochastic volatility model


(x) exp 2 (x )2 + 2 exp(x)y 2 + x /2 ,


simplified in exp x2 + exp(x)
Slice sampling means simulation from a uniform distribution on




A = x; exp x2 + exp(x) /2 u


= x; x2 + exp(x)

if we set = 2 log u.
Note Inversion of x2 + exp(x) = needs to be done by
trial-and-error.

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Completion
1

Density

0.8
0.6
0.4
0.2
0
1

0.5

0.5

1.5

2.5

3.5

90

100

Correlation

0.1
0.05
0
0.05
0.1

10

20

30

40

50

Lag

60

70

80

Histogram of a Markov chain produced by a slice sampler


and target distribution in overlay.

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Convergence

Properties of the Gibbs sampler


Theorem (Convergence)
For
(Y1 , Y2 , , Yp ) g(y1 , . . . , yp ),
if either
[Positivity condition]
(i) g (i) (yi ) > 0 for every i = 1, , p, implies that
g(y1 , . . . , yp ) > 0, where g (i) denotes the marginal distribution
of Yi , or
(ii) the transition kernel is absolutely continuous with respect to g,
then the chain is irreducible and positive Harris recurrent.

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Convergence

Properties of the Gibbs sampler (2)


Consequences
(i) If

h(y)g(y)dy < , then


lim

nT

Z
T
1X
(t)
h1 (Y ) = h(y)g(y)dy a.e. g.
T
t=1

(ii) If, in addition, (Y (t) ) is aperiodic, then


Z

n
lim
K (y, )(dx) f
n

for every initial distribution .

=0
TV

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Convergence

Slice sampler

fast on that slice

For convergence, the properties of Xt and of f (Xt ) are identical

Theorem (Uniform ergodicity)


If f is bounded and suppf is bounded, the simple slice sampler is
uniformly ergodic.
[Mira & Tierney, 1997]

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Convergence

A small set for a slice sampler


no slice detail

For

?

> ? ,

C = {x X ; ? < f (x) < ? }

is a small set:
Pr(x, )
where

1
(A) =
?

if L() = {x X ; f (x) > }

?
0

?
()
?

(A L())
d
(L())
[Roberts & Rosenthal, 1998]

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Convergence

Slice sampler: drift

Under differentiability and monotonicity conditions, the slice


sampler also verifies a drift condition with V (x) = f (x) , is
geometrically ergodic, and there even exist explicit bounds on the
total variation distance
[Roberts & Rosenthal, 1998]

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Convergence

Slice sampler: drift

Under differentiability and monotonicity conditions, the slice


sampler also verifies a drift condition with V (x) = f (x) , is
geometrically ergodic, and there even exist explicit bounds on the
total variation distance
[Roberts & Rosenthal, 1998]

Example (Exponential Exp(1))


For n > 23,

||K n (x, ) f ()||T V .054865 (0.985015)n (n 15.7043)

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Convergence

Slice sampler: convergence


no more slice detail

Theorem
For any density such that


({x X ; f (x) > })




is non-increasing

then
||K 523 (x, ) f ()||T V .0095
[Roberts & Rosenthal, 1998]

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Convergence

A poor slice sampler


Example
Consider

400

600

800

1000

10

400

600

800

1000

correlation
0.0 0.2 0.4 0.6 0.8 1.0

30
25
20

200

200

400

600

800

1000

correlation
0.0 0.2 0.4 0.6 0.8 1.0

60
40
20

10

10

Poor performances when d large


(heavy tails)

200

400

600

800

1000

correlation
0.0 0.2 0.4 0.6 0.8 1.0

u>0

40

20

30

40

20

30

40

30

40

100 dimensional acf

100 200 300 400

100 dimensional run

1/d

30

20 dimensional acf

z>0

20

10 dimensional acf

20 dimensional run

or on
(u) = eu

10 dimensional run

(z) = z d1 ez

200

correlation
0.0 0.2 0.4 0.6 0.8 1.0

0
-1
-2

15

Slice sampler equivalent to


one-dimensional slice sampler on

1 dimensional acf

10

x Rd

1 dimensional run

f (x) = exp {||x||}

10

20

Sample runs of log(u) and


ACFs for log(u) (Roberts
& Rosenthal, 1999)

Markov Chain Monte Carlo Methods


The Gibbs Sampler
The Hammersley-Clifford theorem

Hammersley-Clifford theorem

An illustration that conditionals determine the joint distribution

Theorem
If the joint density g(y1 , y2 ) have conditional distributions
g1 (y1 |y2 ) and g2 (y2 |y1 ), then
g(y1 , y2 ) = R

g2 (y2 |y1 )
.
g2 (v|y1 )/g1 (y1 |v) dv
[Hammersley & Clifford, circa 1970]

Markov Chain Monte Carlo Methods


The Gibbs Sampler
The Hammersley-Clifford theorem

General HC decomposition

Under the positivity condition, the joint distribution g satisfies


g(y1 , . . . , yp )

p
Y
g`j (y`j |y`1 , . . . , y`j1 , y`0 j+1 , . . . , y`0 p )

j=1

g`j (y`0 j |y`1 , . . . , y`j1 , y`0 j+1 , . . . , y`0 p )

for every permutation ` on {1, 2, . . . , p} and every y 0 Y .

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Hierarchical models

Hierarchical models
no hierarchy

The Gibbs sampler is particularly well suited to hierarchical models

Example (Animal epidemiology)


Counts of the number of cases of clinical mastitis in 127 dairy
cattle herds over a one year period
Number of cases in herd i
Xi P(i )

i = 1, , m

where i is the underlying rate of infection in herd i


Lack of independence might manifest itself as overdispersion.

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Hierarchical models

Example (Animal epidemiology (2))


Modified model
Xi P(i )

i G a(, i )

i I G (a, b),

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Hierarchical models

Example (Animal epidemiology (2))


Modified model
Xi P(i )

i G a(, i )

i I G (a, b),

The Gibbs sampler corresponds to conditionals


i (i |x, , i ) = G a(xi + , [1 + 1/i ]1 )

i (i |x, , a, b, i ) = I G ( + a, [i + 1/b]1 )

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Hierarchical models

if you hate rats

Example (Rats)
Experiment where rats are intoxicated by a substance, then treated
by either a placebo or a drug:
xij N (i , c2 ),
yij N (i + i , a2 ),
zij N (i + i + i , t2 ),

control
1 j Jic ,
a
1 j Ji , intoxication
1 j Jit ,
treatment

Additional variable wi , equal to 1 if the rat is treated with the


drug, and 0 otherwise.

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Hierarchical models

Example (Rats (2))


Prior distributions (1 i I),
i N ( , 2 ),

i N ( , 2 ),

and
i N (P , P2 )

or

2
i N (D , D
),

if ith rat treated with a placebo (P) or a drug (D)

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Hierarchical models

Example (Rats (2))


Prior distributions (1 i I),
i N ( , 2 ),

i N ( , 2 ),

and
i N (P , P2 )

or

2
i N (D , D
),

if ith rat treated with a placebo (P) or a drug (D)


Hyperparameters of the model,
, , P , D , c , a , t , , , P , D ,
associated with Jeffreys noninformative priors.
Alternative prior with two possible levels of intoxication
2
2
i pN (1 , 1
) + (1 p)N (2 , 2
),

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Hierarchical models

Conditional decompositions

Easy decomposition of the posterior distribution


For instance, if
|1 1 (|1 ),
then
(|x) =

1 2 (1 ),

(|1 , x)(1 |x) d1 ,

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Hierarchical models

Conditional decompositions (2)


where
f (x|)1 (|1 )
,
m1 (x|1 )
Z
m1 (x|1 ) =
f (x|)1 (|1 ) d,

(|1 , x) =

m1 (x|1 )2 (1 )
,
(1 |x) =
m(x)
Z
m(x) =
m1 (x|1 )2 (1 ) d1 .
1

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Hierarchical models

Conditional decompositions (3)

Moreover, this decomposition works for the posterior moments,


that is, for every function h,
E [h()|x] = E(1 |x) [E1 [h()|1 , x]] ,
where
1

E [h()|1 , x] =

h()(|1 , x) d.

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Hierarchical models

Example (Rats inc., continued

if you still hate rats

Posterior complete distribution given by


((i , i , i )i , , . . . , c , . . . |D)
I
Y

i=1

exp {(i )2 /22 + (i )2 /22 }

Ji
Y

j=1
Jit

j=1

exp {(xij i )

/2c2 }

Ji
Y

j=1

exp {(zij i i i )
exp {(i P )2 /2P2 }

exp {(yij i i )2 /2a2 }

/2t2 }

2
exp {(i D )2 /2D
}

`i =0
`i =1
P
P
P
i Jic 1 i Jia 1 i Jit 1
ID 1 IP 1
( )I1 D
P
t
a
c

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Hierarchical models

Local conditioning property

For the hierarchical model


Z
() =
1 (|1 )2 (1 |2 ) n+1 (n ) d1 dn+1 .
1 ...n

we have
(i |x, , 1 , . . . , n ) = (i |i1 , i+1 )
with the convention 0 = and n+1 = 0.

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Hierarchical models

Example (Rats inc., terminated

still this zemmiphobia?!

2.0
1.8
1.7

Convergence of the posterior means

0 20 40 60 80100

-4.0 -3.5 -3.0 -2.5 -2.0 -1.5


intoxication

50 100150200250

2000 4000 6000 8000 10000

1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4


control

2.1

2000 4000 6000 8000 10000

1.9

0.60
0.50
0.40

0 20 40 60 80 100120

-2.80

2000 4000 6000 8000 10000

-2.90

0 50 100150200250300

-2.60
-2.70

1.90
1.80
1.70
1.60

2000 4000 6000 8000 10000

0.70

140

The full conditional distributions correspond to standard


distributions and Gibbs sampling applies.

-1

1
placebo

drug

Posteriors of the effects

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Hierarchical models

Posterior Gibbs inference

Probability
Confidence

1.00
[-3.48,-2.17]

D
0.9998
[0.94,2.50]

P
0.94
[-0.17,1.24]

D P
0.985
[0.14,2.20]

Posterior probabilities of significant effects

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Data Augmentation

Data Augmentation
The Gibbs sampler with only two steps is particularly useful

Algorithm (Data Augmentation)


Given y (t) ,
(t+1)

1.. Simulate Y1

(t+1)

2.. Simulate Y2

(t)

g1 (y1 |y2 ) ;
(t+1)

g2 (y2 |y1

).

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Data Augmentation

Data Augmentation
The Gibbs sampler with only two steps is particularly useful

Algorithm (Data Augmentation)


Given y (t) ,
(t+1)

1.. Simulate Y1

(t+1)

2.. Simulate Y2

(t)

g1 (y1 |y2 ) ;
(t+1)

g2 (y2 |y1

).

Theorem (Markov property)


(t)

(t)

Both (Y1 ) and (Y2 ) are Markov chains, with transitions


Z

Ki (x, x ) = gi (y|x)g3i (x |y) dy,

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Data Augmentation

Example (Grouped counting data)


360 consecutive records of the number of passages per unit time
Number of
passages
0
1
2
3 4 or more
Number of
observations 139 128 55 25
13

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Data Augmentation

Example (Grouped counting data (2))


Feature Observations with 4 passages and more are grouped
If observations are Poisson P(), the likelihood is
`(|x1 , . . . , x5 )
e

347 128+552+253

which can be difficult to work with.

1e

3
X
i
i!
i=0

!13

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Data Augmentation

Example (Grouped counting data (2))


Feature Observations with 4 passages and more are grouped
If observations are Poisson P(), the likelihood is
`(|x1 , . . . , x5 )
e

347 128+552+253

1e

3
X
i
i!
i=0

!13

which can be difficult to work with.


Idea With a prior () = 1/, complete the vector (y1 , . . . , y13 ) of
the 13 units larger than 4

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Data Augmentation

Algorithm (Poisson-Gamma Gibbs)


(t)

a Simulate Yi
b Simulate

P((t1) ) Iy4

(t)

Ga 313 +

i = 1, . . . , 13
13
X
i=1

(t)
yi ,

360

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Data Augmentation

Algorithm (Poisson-Gamma Gibbs)


(t)

a Simulate Yi
b Simulate

P((t1) ) Iy4
Ga 313 +

13
X

(t)
yi ,

i=1

360

40

1.025

(t)

i = 1, . . . , 13

t=1

converges quite rapidly

10
0

1.023

20

1.024

i=1

(t)
yi

0.9

1.0
lambda

1.1

1.2

1.022

313 +

13
X
to R&B

1.021

T
1 X
=
360T

30

The Bayes estimator

100

200

300

400

500

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Data Augmentation

Rao-Blackwellization
If (y1 , y2 , . . . , yp )(t) , t = 1, 2, . . . T is the output from a Gibbs
sampler
Z
T
1 X  (t) 
h y1 h(y1 )g(y1 )dy1
0 =
T
t=1

and is unbiased.

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Data Augmentation

Rao-Blackwellization
If (y1 , y2 , . . . , yp )(t) , t = 1, 2, . . . T is the output from a Gibbs
sampler
Z
T
1 X  (t) 
h y1 h(y1 )g(y1 )dy1
0 =
T
t=1

and is unbiased.
The Rao-Blackwellization replaces 0 with its conditional
expectation
rb

T
i
1 X h
(t)
(t)
E h(Y1 )|y2 , . . . , yp .
=
T
t=1

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Data Augmentation

Rao-Blackwellization (2)

Then
Both estimators converge to E[h(Y1 )]
Both are unbiased,

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Data Augmentation

Rao-Blackwellization (2)

Then
Both estimators converge to E[h(Y1 )]
Both are unbiased,
and

i
 h
(t)
var E h(Y1 )|Y2 , . . . , Yp(t) var(h(Y1 )),

so rb is uniformly better (for Data Augmentation)

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Data Augmentation

Examples of Rao-Blackwellization
Example
Bivariate normal Gibbs sampler
X | y N (y, 1 2 )

Y | x N (x, 1 2 ).

Then
T
1 X (i)
0 =
X
T
i=1

and

T
T
1X
1 X (i)
(i)
(i)
1 =
E[X |Y ] =
%Y ,
T
T

estimate E[X] and 20 /21 =

i=1

1
2

> 1.

i=1

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Data Augmentation

Examples of Rao-Blackwellization (2)


Example (Poisson-Gamma Gibbs contd)
Nave estimate
0 =

T
1 X (t)

T
t=1

and Rao-Blackwellized version

T
1 X
(i) (i)
(i)
E[(t) |x1 , x2 , . . . , x5 , y1 , y2 , . . . , y13 ]
T
t=1
!
T
13
X
1 X
(t)
313 +
yi
,
360T
t=1

back to graph

i=1

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Data Augmentation

NP Rao-Blackwellization & Rao-Blackwellized NP

Another substantial benefit of Rao-Blackwellization is in the


approximation of densities of different components of y without
nonparametric density estimation methods.

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Data Augmentation

NP Rao-Blackwellization & Rao-Blackwellized NP

Another substantial benefit of Rao-Blackwellization is in the


approximation of densities of different components of y without
nonparametric density estimation methods.
The estimator
T
1X
(t)
gi (yi |yj , j 6= i) gi (yi ),
T
t=1

is unbiased.

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Data Augmentation

The Duality Principle


skip dual part

Ties together the properties of the two Markov chains in Data


Augmentation
Consider a Markov chain (X (t) ) and a sequence (Y (t) ) of random
variables generated from the conditional distributions
X (t) |y (t) (x|y (t) )

Y (t+1) |x(t) , y (t) f (y|x(t) , y (t) ) .

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Data Augmentation

The Duality Principle


skip dual part

Ties together the properties of the two Markov chains in Data


Augmentation
Consider a Markov chain (X (t) ) and a sequence (Y (t) ) of random
variables generated from the conditional distributions
X (t) |y (t) (x|y (t) )

Y (t+1) |x(t) , y (t) f (y|x(t) , y (t) ) .

Theorem (Duality properties)


If the chain (Y (t) ) is ergodic then so is (X (t) ) and the duality also
holds for geometric or uniform ergodicity.

Note
The chain (Y (t) ) can be discrete, and the chain (X (t) ) continuous.

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Improper Priors

Improper Priors

Unsuspected danger resulting from careless use of MCMC


algorithms:

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Improper Priors

Improper Priors

Unsuspected danger resulting from careless use of MCMC


algorithms:
It may happen that
all conditional distributions are well defined,

all conditional distributions may be simulated from, but...

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Improper Priors

Improper Priors

Unsuspected danger resulting from careless use of MCMC


algorithms:
It may happen that
all conditional distributions are well defined,

all conditional distributions may be simulated from, but...

the system of conditional distributions may not correspond to


any joint distribution

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Improper Priors

Improper Priors

Unsuspected danger resulting from careless use of MCMC


algorithms:
It may happen that
all conditional distributions are well defined,

all conditional distributions may be simulated from, but...

the system of conditional distributions may not correspond to


any joint distribution
Warning The problem is due to careless use of the Gibbs sampler
in a situation for which the underlying assumptions are violated

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Improper Priors

Example (Conditional exponential distributions)


For the model
X1 |x2 E xp(x2 ) ,

X2 |x1 E xp(x1 )

the only candidate f (x1 , x2 ) for the joint density is


f (x1 , x2 ) exp(x1 x2 ),
but

f (x1 , x2 )dx1 dx2 =

c These conditionals do not correspond to a joint



probability distribution

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Improper Priors

Example (Improper random effects)


Consider
Yij = + i + ij ,

i = 1, . . . , I, j = 1, . . . , J,

where
i N (0, 2 ) and ij N (0, 2 ),
the Jeffreys (improper) prior for the parameters , and is
(, 2 , 2 ) =

1
.
2 2

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Improper Priors

Example (Improper random effects 2)


The conditional distributions
2

i |y, , ,

|, y, 2 , 2
2 |, , y, 2
2 |, , y, 2


J(
yi )
2
2 1
,
, (J + )
N
J + 2 2
N (
y
, 2 /JI) ,
!
X
IG I/2, (1/2)
i2 ,
i

X
IG IJ/2, (1/2)
(yij i )2 ,


i,j

are well-defined and a Gibbs sampler can be easily implemented in


this setting.

Markov Chain Monte Carlo Methods


The Gibbs Sampler

-8

-6

10

freq.
15

-4
observations

20

-2

25

30

Improper Priors

-4

-3

-2
(1000 iterations)

-1

Example (Improper random


effects 2)
The figure shows the sequence of
(t) s and its histogram over
1, 000 iterations. They both fail
to indicate that the
corresponding joint distribution
does not exist

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Improper Priors

Final notes on impropriety

The improper posterior Markov chain


cannot be positive recurrent

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Improper Priors

Final notes on impropriety

The improper posterior Markov chain


cannot be positive recurrent
The major task in such settings is to find indicators that flag that
something is wrong. However, the output of an improper Gibbs
sampler may not differ from a positive recurrent Markov chain.

Markov Chain Monte Carlo Methods


The Gibbs Sampler
Improper Priors

Final notes on impropriety

The improper posterior Markov chain


cannot be positive recurrent
The major task in such settings is to find indicators that flag that
something is wrong. However, the output of an improper Gibbs
sampler may not differ from a positive recurrent Markov chain.

Example
The random effects model was initially treated in Gelfand et al.
(1990) as a legitimate model

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems

MCMC tools for variable dimension problems

MCMC tools for variable dimension problems


Introduction
Greens method
Birth and Death processes

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems
Introduction

A new brand of problems

There exist setups where


One of the things we do not know is the number
of things we do not know
[Peter Green]

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems
Introduction

Bayesian Model Choice

Typical in model choice settings


- model construction (nonparametrics)
- model checking (goodness of fit)
- model improvement (expansion)
- model prunning (contraction)
- model comparison
- hypothesis testing (Science)
- prediction (finance)

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems
Introduction

Bayesian Model Choice II

Many areas of application


I

variable selection

change point(s) determination

image analysis

graphical models and expert systems

variable dimension models

causal inference

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems
Introduction

Example (Mixture again, yes!)

0.0

0.5

1.0

1.5

2.0

Benchmark dataset: Speed of galaxies


[Roeder, 1990; Richardson & Green, 1997]

1.0

1.5

2.0
2.5
velocities

3.0

3.5

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems
Introduction

Example (Mixture again (2))


Modelling by a mixture model
Mi : xj

i
X
`=1

2
p`i N (`i , `i
)

i?

(j = 1, . . . , 82)

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems
Introduction

Bayesian variable dimension model


Definition
A variable dimension model is defined as a collection of models
(k = 1. . . . , K),
Mk = {f (|k ); k k } ,
associated with a collection of priors on the parameters of these
models,
k (k ) ,
and a prior distribution on the indices of these models,
{%(k) , k = 1, . . . , K} .
Alternative notation:
(Mk , k ) = %(k) k (k )

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems
Introduction

Bayesian solution
Formally over:
1. Compute
pi
p(Mi |x) = X

pj

fi (x|i )i (i )di
i

fj (x|j )j (j )dj
j

2. Take largest p(Mi |x) to determine model, or use


X Z
fj (x|j )j (j )dj
pj
j

as predictive
[Different decision theoretic perspectives]

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems
Introduction

Difficulties
Not at
I

(formal) inference level

parameter space representation


M
k ,
=

[see above]

[even if there are parameters common to several models]

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems
Introduction

Difficulties
Not at
I

(formal) inference level

parameter space representation


M
k ,
=

[see above]

[even if there are parameters common to several models]


Rather at
I

(practical) inference level:


model separation, interpretation, overfitting, prior modelling,
prior coherence

computational level:
infinity of models, moves between models, predictive
computation

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems
Greens method

Greens resolution

Setting up a proper measuretheoretic framework for designing


moves between models Mk
[Green, 1995]

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems
Greens method

Greens resolution

Setting up a proper measuretheoretic framework for designing


moves between models Mk
[Green, 1995]
S
Create a reversible kernel K on H = k {k} k such that
Z Z
Z Z
K(x, dy)(x)dx =
K(y, dx)(y)dy
A

for the invariant density [x is of the form (k, (k) )]

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems
Greens method

Greens resolution (2)


Write K as
K(x, B) =

Z
X

m (x, y)qm (x, dy) + (x)IB (x)

m=1

where qm (x, dy) is a transition measure to model Mm and


m (x, y) the corresponding acceptance probability.

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems
Greens method

Greens resolution (2)


Write K as
K(x, B) =

Z
X

m (x, y)qm (x, dy) + (x)IB (x)

m=1

where qm (x, dy) is a transition measure to model Mm and


m (x, y) the corresponding acceptance probability.
Introduce a symmetric measure m (dx, dy) on H2 and impose on
(dx)qm (x, dy) to be absolutely continuous wrt m ,
(dx)qm (x, dy)
= gm (x, y)
m (dx, dy)

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems
Greens method

Greens resolution (2)


Write K as
K(x, B) =

Z
X

m (x, y)qm (x, dy) + (x)IB (x)

m=1

where qm (x, dy) is a transition measure to model Mm and


m (x, y) the corresponding acceptance probability.
Introduce a symmetric measure m (dx, dy) on H2 and impose on
(dx)qm (x, dy) to be absolutely continuous wrt m ,
(dx)qm (x, dy)
= gm (x, y)
m (dx, dy)
Then

gm (y, x)
m (x, y) = min 1,
gm (x, y)
ensures reversibility

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems
Greens method

Special case
When contemplating a move between two models, M1 and M2 ,
the Markov chain being in state 1 M1 , denote by K12 (1 , d)
and K21 (2 , d) the corresponding kernels, under the detailed
balance condition
(d1 ) K12 (1 , d) = (d2 ) K21 (2 , d) ,
and take, wlog, dim(M2 ) > dim(M1 ).

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems
Greens method

Special case
When contemplating a move between two models, M1 and M2 ,
the Markov chain being in state 1 M1 , denote by K12 (1 , d)
and K21 (2 , d) the corresponding kernels, under the detailed
balance condition
(d1 ) K12 (1 , d) = (d2 ) K21 (2 , d) ,
and take, wlog, dim(M2 ) > dim(M1 ).
Proposal expressed as
2 = 12 (1 , v12 )
where v12 is a random variable of dimension
dim(M2 ) dim(M1 ), generated as
v12 12 (v12 ) .

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems
Greens method

Special case (2)


In this case, q12 (1 , d2 ) has density


12 (1 , v12 ) 1
,

12 (v12 )
(1 , v12 )

by the Jacobian rule.


If probability $12 of choosing move to M2 while in M1 ,
acceptance probability reduces to


12 (1 , v12 )
(M2 , 2 ) $21

.
(1 , v12 ) = 1
(M1 , 1 ) $12 12 (v12 ) (1 , v12 )

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems
Greens method

Interpretation (1)

The representation puts us back in a fixed dimension setting:


I

M1 V12 and M2 are in one-to-one relation

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems
Greens method

Interpretation (1)

The representation puts us back in a fixed dimension setting:


I
I

M1 V12 and M2 are in one-to-one relation

regular MetropolisHastings move from the couple (1 , v12 )


to 2 when stationary distributions are
(M1 , 1 ) 12 (v12 )
and (M2 , 2 ), and when proposal distribution is deterministic
(??)

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems
Greens method

Interpretation (2)
Consider, instead, the proposals
2 N (12 (1 , v12 ), )

and

12 (1 , v12 ) N (2 , )

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems
Greens method

Interpretation (2)
Consider, instead, the proposals
2 N (12 (1 , v12 ), )

and

12 (1 , v12 ) N (2 , )

Reciprocal proposal has density





12 (1 , v12 )
exp (2 12 (1 , v12 ))2 /2


(1 , v12 )
2
by the Jacobian rule.

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems
Greens method

Interpretation (2)
Consider, instead, the proposals
2 N (12 (1 , v12 ), )

and

12 (1 , v12 ) N (2 , )

Reciprocal proposal has density





12 (1 , v12 )
exp (2 12 (1 , v12 ))2 /2


(1 , v12 )
2
by the Jacobian rule.
Thus MetropolisHastings acceptance probability is


12 (1 , v12 )
(M2 , 2 )


1
(M1 , 1 ) 12 (v12 ) (1 , v12 )

Does not depend on : Let go to 0

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems
Greens method

Saturation
[Brooks, Giudici, Roberts, 2003]
Consider series of models Mi (i = 1, . . . , k) such that
max dim(Mi ) = nmax <
i

Parameter of model Mi then completed with an auxiliary variable


Ui such that
dim(i , ui ) = nmax

and

Ui qi (ui )

Posit the following joint distribution for [augmented] model Mi


(Mi , i ) qi (ui )

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems
Greens method

Back to fixed dimension


Saturation: no varying dimension anymore since (i , ui ) of fixed
dimension.

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems
Greens method

Back to fixed dimension


Saturation: no varying dimension anymore since (i , ui ) of fixed
dimension.

Algorithm (Three stage MCMC update)


1. Update the current value of the parameter, i ;
2. Update ui conditional on i ;
3. Update the current model from Mi to Mj using the bijection
(j , uj ) = ij (i , ui )

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems
Greens method

Example (Mixture of normal distributions)


Mk :

k
X
j=1

2
pjk N (jk , jk
)

[Richardson & Green, 1997]


Moves:

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems
Greens method

Example (Mixture of normal distributions)


Mk :

k
X
j=1

2
pjk N (jk , jk
)

[Richardson & Green, 1997]


Moves:
(i) Split

= pj(k+1) + p(j+1)(k+1)
pjk
pjk jk = pj(k+1) j(k+1) + p(j+1)(k+1) (j+1)(k+1)

2
2
2
= pj(k+1) j(k+1)
+ p(j+1)(k+1) (j+1)(k+1)
pjk jk

(ii) Merge

(reverse)

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems
Greens method

Example (Mixture (2))


Additional Birth and Death moves for empty components
(created from the prior distribution)
Equivalent
(i). Split

u1 , u2 , u3

p
j(k+1)
(T )

j(k+1)

2
j(k+1)

=
=
=

U(0, 1)
u1 pjk
u2 jk
2
u3 jk

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems
Greens method

Histogram of k

2.5

0.0

0.1

0.2

3.0

0.3

0.4

Normalised enzyme dataset

2.0

1.0

2e+04

4e+04

6e+04

8e+04

Histogram and rawplot of


100, 000 ks under the
constraint k 5.

1e+05

0.5

0e+00

0.0

1.5

Rawplot of k

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems
Greens method

Example (Hidden Markov model)


move to birth

Extension of the mixture model


P (Xt + 1 = j|Xt = i) = wij ,
wij = ij /

i` ,

Yt |Xt = i N (i , i2 ).

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems
Greens method






6
6



Yt

...

Yt+1

- ...




Xt

Xt+1

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems
Greens method

Example (Hidden Markov model (2))


Move to split component j? into j1 and j2 :
ij1 = ij? i ,

ij2 = ij? (1 i ),

j1 j = j? j j ,

j2 j = j? j /j ,

i U(0, 1);

j log N (0, 1);

similar ideas give j1 j2 etc.;


j1 = j? 3j? ,
j21 = j2 ,

j2 = j? + 3j? ,
j22 = j2 / ,

N (0, 1);

log N (0, 1).


[Robert & al., 2000]

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems
Greens method
5
4
3
2
1

10000

20000

30000

40000

50000

100000

150000

200000

5000

10000

15000

20000

0.5

0
0.015
0.01
0.005
0

Upper panel: First 40,000 values of k for S&P 500 data, plotted
every 20th sweep. Middle panel: estimated posterior distribution
of k for S&P 500 data as a function of number of sweeps. Lower
panel: 1 and 2 in first 20,000 sweeps with k = 2 for S&P 500
data.

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems
Greens method

Example (Autoregressive model)


move to birth

Typical setting for model choice: determine order p of AR(p)


model

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems
Greens method

Example (Autoregressive model)


move to birth

Typical setting for model choice: determine order p of AR(p)


model
Consider the (less standard) representation
p
Y
i=1

(1 i B) Xt = t ,

t N (0, 2 )

where the i s are within the unit circle if complex and within
[1, 1] if real.
[Huerta and West, 1998]

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems
Greens method

AR(p) reversible jump algorithm


Example (Autoregressive (2))
Uniform priors for the real and complex roots j ,
Y 1
Y 1
1
I|i |<1
I
bk/2c + 1
2
|i |<1
i R

i 6R

and (purely birth-and-death) proposals based on these priors


I
I
I
I

k k+1
k k+2
k k-1
k k-2

[Creation of real root]


[Creation of complex root]
[Deletion of real root]
[Deletion of complex root]

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems

200

300

400

500

0 1 2 3 4 5
0

10

15

20

0.4

p=5

0.2

0.0

0.2

0.4

0.2

0.0

0.2

0.4

0.256

0.2

0.0

0.2

0.4

0.6
Im(i)

0.4
1.2

0.4

0.412

0.0 0.2 0.4

8
6
4
2

1.1
2

0.4

means

0.0817

1.0

0.2

0 2 4 6 8 10

0.4

p=3

mean 0.997

0.9

0.0

p=4

1000

2000

3000
Iterations

4000

5000

0.2

0.4

0.2

p=3

0 1 2 3 4 5 6 7

100

0 1 2 3 4 5

0.6 0.2

xt

0.0 0.1 0.2 0.3 0.4 0.5

Greens method

0.1

0.0

0.1

0.2

0.3

0.4

0.5

Re(i)

Reversible jump algorithm based on an AR(3) simulated dataset of


530 points (upper left) with true parameters i (0.1, 0.3, 0.4)
and = 1. First histogram associated with p, the following
histograms with the i s, for different values of p, and of 2 . Final
graph: scatterplot of the complex roots. One before last:
evolution of 1 , 2 , 3 .

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems
Birth and Death processes

Birth and Death processes

instant death!

Use of an alternative methodology based on a Birth&-Death


(point) process
[Preston, 1976; Ripley, 1977; Geyer & Mller, 1994; Stevens, 1999]

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems
Birth and Death processes

Birth and Death processes

instant death!

Use of an alternative methodology based on a Birth&-Death


(point) process
[Preston, 1976; Ripley, 1977; Geyer & Mller, 1994; Stevens, 1999]
Idea: Create a Markov chain in continuous time, i.e. a Markov
jump process, moving between models Mk , by births (to increase
the dimension), deaths (to decrease the dimension), and other
moves.

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems
Birth and Death processes

Birth and Death processes


Time till next modification (jump) is exponentially distributed
with rate depending on current state
Remember: if 1 , . . . , v are exponentially distributed, i E(i ),
!
X
i
min i E
i

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems
Birth and Death processes

Birth and Death processes


Time till next modification (jump) is exponentially distributed
with rate depending on current state
Remember: if 1 , . . . , v are exponentially distributed, i E(i ),
!
X
i
min i E
i

Difference with MH-MCMC: Whenever a jump occurs, the


corresponding move is always accepted. Acceptance probabilities
replaced with holding times.
Implausible configurations
L()()  1
die quickly.

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems
Birth and Death processes

Balance condition

Sufficient to have detailed balance


L()()q(, 0 ) = L( 0 )( 0 )q( 0 , )

for all , 0

for
() L()() to be stationary.
Here q(, 0 ) rate of moving from state to 0 .
Possibility to add split/merge and fixed-k processes if balance
condition satisfied.

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems
Birth and Death processes

Example (Mixture contd)


Stephens original modelling:
I

Representation as a (marked) point process




= {pj , (j , j )}
j

Birth rate 0 (constant)

Birth proposal from the prior

Death rate j () for removal of point j

Death proposal removes component and modifies weights

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems
Birth and Death processes

Example (Mixture contd (2))


I

Overall death rate


k
X

j () = ()

j=1

Balance condition
(k+1) d({p, (, )}) L({p, (, )}) = 0 L()
with

d( \ {pj , (j , j )}) = j ()
Case of Poisson prior k Poi(1 )
j () =

0 L( \ {pj , (j , j )})
1
L()

(k)
(k + 1)

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems
Birth and Death processes

Stephens original algorithm

Algorithm (Mixture Birth& Death)


For v = 0, 1, , V

tv
Run till t > v + 1
1. Compute j () =
2. ()

k
X
j=1

L(|j ) 0
L() 1

j (j ), 0 + (), u U([0, 1])

3. t t u log(u)

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems
Birth and Death processes

Algorithm (Mixture Birth& Death (contd))


4. With probability ()/
Remove component j with probability j ()/()
k k1
p` p` /(1 pj ) (` 6= j)

Otherwise,

Add component j from the prior (j , j ) pj Be(, k)


p` p` (1 pj ) (` 6= j)
k k+1

5. Run I MCMC(k, , p)

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems
Birth and Death processes

Rescaling time
move to HMM

In discrete-time RJMCMC, let the time unit be 1/N ,

put
k = k /N

k = 1 k /N

and

As N , each birth proposal will be accepted, and having k components births occur according to a
Poisson process with rate k while component (w, ) dies with rate

lim

N k+1

1
k+1

min(A

=
=

lim

, 1)

1
k+1

likelihood ratio

likelihood ratio

k
k+1

k
k+1

b(w, )
(1 w)k1

b(w, )
(1 w)k1

Hence RJMCMCBDMCMC. This holds more generally.

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems
Birth and Death processes

Example (HMM models (contd))


Implementation of the split-and-combine rule of Richardson and
Green (1997) in continuous time

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems
Birth and Death processes

Example (HMM models (contd))


Implementation of the split-and-combine rule of Richardson and
Green (1997) in continuous time
Move to split component j into j1 and j2 :
ij1 = ij i ,

ij2 = ij (1 i ),

j1 j = j j j ,

j2 j = j j /j ,

i U(0, 1);

j log N (0, 1);

similar ideas give j1 j2 etc.;


j1 = j 3j  ,
j21 = j2 ,

j2 = j + 3j  ,
j22 = j2 / ,

 N (0, 1);

log N (0, 1).


[Cappe & al, 2001]

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems
Birth and Death processes

0.0

0.1

0.2

0.3

0.4

0.5

Wind intensity in Athens

Histogram and rawplot of 500 wind intensities in Athens

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems
Birth and Death processes

4
3
1

number of states

0.6
0.4
0.2
0.0

Relative Frequency

Number of states

10

500

1000

1500

2000

2500

1500

2000

2500

1500

2000

2500

instants

temp[, 1]

loglikelihood
1400

1200

1000

800

600

400

1400 1000 600

0.020
0.010
0.000

Relative Frequency

0.030

200

Log likelihood values

200

500

1000
instants

temp[, 2]

30
10

15

20
temp[, 3]

25

30

20

Number of moves
5

5 10

0.00 0.05 0.10 0.15

Relative Frequency

Number of moves

500

1000
instants

MCMC output on k (histogram and rawplot), corresponding


loglikelihood values (histogram and rawplot), and number of
moves (histogram and rawplot)

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems

500

1000

1500

500

1000

1500

0.0 0.2 0.4 0.6 0.8 1.0

Birth and Death processes

MCMC sequence of the probabilities j of the stationary


distribution (top) and the parameters (bottom) of the
three components when conditioning on k = 3

Markov Chain Monte Carlo Methods


MCMC tools for variable dimension problems

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Birth and Death processes

MCMC evaluation of the marginal density of the dataset


(dashes), compared with R nonparametric density estimate
(solid lines).

Markov Chain Monte Carlo Methods


Sequential importance sampling

Sequential importance sampling

basic importance

Sequential importance sampling


Adaptive MCMC
Importance sampling revisited
Dynamic extensions
Population Monte Carlo

Markov Chain Monte Carlo Methods


Sequential importance sampling
Adaptive MCMC

Adaptive MCMC is not possible

Algorithms trained on-line usually invalid:

Markov Chain Monte Carlo Methods


Sequential importance sampling
Adaptive MCMC

Adaptive MCMC is not possible

Algorithms trained on-line usually invalid:


using the whole past of the chain implies that this is not a
Markov chain any longer!

Markov Chain Monte Carlo Methods


Sequential importance sampling
Adaptive MCMC

Example (Poly t distribution)


Consider a t-distribution T (3, , 1) sample (x1 , . . . , xn ) with a flat
prior () = 1
If we try fit a normal proposal from empirical mean and variance of
the chain so far,
t
1 X (i)

t =
t
i=1

and

t2

t
1 X (i)
=
( t )2 ,
t
i=1

Markov Chain Monte Carlo Methods


Sequential importance sampling
Adaptive MCMC

Example (Poly t distribution)


Consider a t-distribution T (3, , 1) sample (x1 , . . . , xn ) with a flat
prior () = 1
If we try fit a normal proposal from empirical mean and variance of
the chain so far,
t
1 X (i)

t =
t
i=1

and

t2

t
1 X (i)
=
( t )2 ,
t
i=1

MetropolisHastings algorithm with acceptance probability


#(+1)/2
"
n
Y
+ (xj (t) )2
exp (t (t) )2 /2t2
,
+ (xj )2
exp (t )2 /2t2
j=2
where N (t , t2 ).

Markov Chain Monte Carlo Methods


Sequential importance sampling
Adaptive MCMC

Example (Poly t distribution (2))


Invalid scheme:
I

when range of initial values too small, the (i) s cannot


converge to the target distribution and concentrates on too
small a support.

long-range dependence on past values modifies the


distribution of the sequence.

using past simulations to create a non-parametric


approximation to the target distribution does not work either

Markov Chain Monte Carlo Methods


Sequential importance sampling

0.4

0.2

0.0

0.2

Adaptive MCMC

1000

2000

3000

4000

5000

1.5

1.0

0.5

0.0

0.5

0.0

0.2

0.4

0.6

1.5 1.0 0.5 0.0 0.5 1.0 1.5

Iterations

1000

2000

3000

4000

5000

Iterations

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

2
1
0
2

1000

2000

3000

Iterations

4000

5000

Adaptive scheme for a sample of 10 xj T3 and initial


variances of (top) 0.1, (middle) 0.5, and (bottom) 2.5.

Markov Chain Monte Carlo Methods


Sequential importance sampling

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Adaptive MCMC

Comparison of the distribution of an adaptive scheme sample


of 25, 000 points with initial variance of 2.5 and of the target
distribution.

Markov Chain Monte Carlo Methods


Sequential importance sampling

1.0

0.0

0.5

0.2

0.0

0.4

0.5

0.6

1.0

0.8

1.5

1.0

Adaptive MCMC

10000

30000
Iterations

50000

1.5

0.5

0.5

1.0

1.5

Sample produced by 50, 000 iterations of a nonparametric


adaptive MCMC scheme and comparison of its distribution
with the target distribution.

Markov Chain Monte Carlo Methods


Sequential importance sampling
Adaptive MCMC

Simply forget about it!

Warning:
One should not constantly adapt the proposal on past
performances
Either adaptation ceases after a period of burnin
or the adaptive scheme must be theoretically assessed on its own
right.

Markov Chain Monte Carlo Methods


Sequential importance sampling
Importance sampling revisited

Importance sampling revisited


Approximation of integrals
I=

back to basic importance

h(x)(x)dx

by unbiased estimators
n
X
= 1
I
%i h(xi )
n
i=1

when
iid

x1 , . . . , xn q(x)

and

def (xi )
%i =
q(xi )

Markov Chain Monte Carlo Methods


Sequential importance sampling
Importance sampling revisited

Markov extension
For densities f and g, and importance weight
(x) = f (x)/g(x) ,
for any kernel K(x, x0 ) with stationary distribution f ,
Z
(x) K(x, x0 ) g(x)dx = f (x0 ) .
[McEachern, Clyde, and Liu, 1999]

Markov Chain Monte Carlo Methods


Sequential importance sampling
Importance sampling revisited

Markov extension
For densities f and g, and importance weight
(x) = f (x)/g(x) ,
for any kernel K(x, x0 ) with stationary distribution f ,
Z
(x) K(x, x0 ) g(x)dx = f (x0 ) .
[McEachern, Clyde, and Liu, 1999]
Consequence: An importance sample transformed by MCMC
transitions keeps its weights
Unbiasedness preservation:
Z


0
E (X)h(X ) =
(x) h(x0 ) K(x, x0 ) g(x) dx dx0
= Ef [h(X)]

Markov Chain Monte Carlo Methods


Sequential importance sampling
Importance sampling revisited

Not so exciting!

The weights do not change!

Markov Chain Monte Carlo Methods


Sequential importance sampling
Importance sampling revisited

Not so exciting!

The weights do not change!


If x has small weight
(x) = f (x)/g(x) ,
then
x0 K(x, x0 )
keeps this small weight.

Markov Chain Monte Carlo Methods


Sequential importance sampling
Importance sampling revisited

Pros and cons of importance sampling vs. MCMC


I

Production of a sample (IS) vs. of a Markov chain (MCMC)

Dependence on importance function (IS) vs. on previous value


(MCMC)

Unbiasedness (IS) vs. convergence to the true distribution


(MCMC)

Variance control (IS) vs. learning costs (MCMC)

Recycling of past simulations (IS) vs. progressive adaptability


(MCMC)

Processing of moving targets (IS) vs. handling large


dimensional problems (MCMC)

Non-asymptotic validity (IS) vs. difficult asymptotia for


adaptive algorithms (MCMC)

Markov Chain Monte Carlo Methods


Sequential importance sampling
Dynamic extensions

Dynamic importance sampling

Idea
It is possible to generalise importance sampling using random
weights t

Markov Chain Monte Carlo Methods


Sequential importance sampling
Dynamic extensions

Dynamic importance sampling

Idea
It is possible to generalise importance sampling using random
weights t such that
E[t |xt ] = (xt )/g(xt )

Markov Chain Monte Carlo Methods


Sequential importance sampling
Dynamic extensions

(a) Self-regenerative chains


[Sahu & Zhigljavsky, 1998; Gasemyr, 2002]
Proposal
Y p(y) p(y)
and target distribution (y)
(y)
Ratios
(x) = (x)/p(x)

and

Unknown

(x) =
(x)/
p(x)
Known

Acceptance function
(x) =

1
1 +
(x)

>0

Markov Chain Monte Carlo Methods


Sequential importance sampling
Dynamic extensions

Geometric jumps

Theorem
If
Y p(y)
and
W |Y = y G ((y)) ,
then
Xt = = Xt+W 1 = Y 6= Xt+W
defines a Markov chain with stationary distribution

Markov Chain Monte Carlo Methods


Sequential importance sampling
Dynamic extensions

Plusses
I

Valid for any choice of [ small = large variance and large


= slow convergence]

Only depends on current value [Difference with Metropolis]

Random integer weight W [Similarity with Metropolis]

Saves on the rejections: always accept [Difference with


Metropolis]

Introduces geometric noise compared with importance


sampling
2
2
SZ
= 2 IS
+ (1/)2

Can be used with a sequence of proposals pk and constants


k [Adaptativity]

Markov Chain Monte Carlo Methods


Sequential importance sampling
Dynamic extensions

A generalisation
[G
asemyr, 2002]
Proposal density p(y) and probability q(y) of accepting a jump.

Markov Chain Monte Carlo Methods


Sequential importance sampling
Dynamic extensions

A generalisation
[G
asemyr, 2002]
Proposal density p(y) and probability q(y) of accepting a jump.

Algorithm (G
asemyrs dynamic weights)
Generate a sequence of random weights Wn by
1. Generate Yn p(y)

2. Generate Vn B(q(yn ))

3. Generate Sn Geo((yn ))
4. Take Wn = Vn Sn

Markov Chain Monte Carlo Methods


Sequential importance sampling
Dynamic extensions

Validation
direct to PMC

(y) = R

p(y)q(y)
,
p(y)q(y)dy

the chain (Xt ) associated with the sequence (Yn , Wn ) by


Y1 = X1 = = X1+W1 1 , Y2 = X1+W1 =
is a Markov chain with transition
K(x, y) = (x)(y)
which has a point mass at y = x with weight 1 (x).

Markov Chain Monte Carlo Methods


Sequential importance sampling
Dynamic extensions

Ergodicity for G
asemyrs scheme
Necessary and sufficient condition
is stationary for (Xt ) iff
(y) = q(y)/((y)/p(y)) = q(y)/(w(y))
for some constant .

Markov Chain Monte Carlo Methods


Sequential importance sampling
Dynamic extensions

Ergodicity for G
asemyrs scheme
Necessary and sufficient condition
is stationary for (Xt ) iff
(y) = q(y)/((y)/p(y)) = q(y)/(w(y))
for some constant .
Implies that
E[W n |Y n = y] = w(y) .
[Average importance sampling]
Special case: (y) = 1/(1 + w(y)) of Sahu and Zhigljavski (2001)

Markov Chain Monte Carlo Methods


Sequential importance sampling
Dynamic extensions

Properties
Constraint on : for (y) 1, must be such that
p(y)q(y)

(y)
Reverse of accept-reject conditions (!)
Variance of
X
X
Wn h(Yn )/
Wn
n

is

(h(y) )2
w(y)(y)dy (1/)2 ,
q(y)

by Cramer-Wold/Slutsky
Still worse than importance sampling.

(4)

Markov Chain Monte Carlo Methods


Sequential importance sampling
Dynamic extensions

(b) Dynamic weighting


[Wong & Liang, 1997; Liu, Liang & Wong, 2001; Liang, 2002]
direct to PMC

Generalisation of the above: simultaneous generation of points


and weights, (t , t ), under the constraint
E[t |t ] (t )
Same use as importance sampling weights

(5)

Markov Chain Monte Carlo Methods


Sequential importance sampling
Dynamic extensions

Algorithm (Liangs dynamic importance sampling)


1. Generate y K(x, y) and compute
%=

(y)K(y, x)
(x)K(x, y)

2. Generate u U(0, 1) and take


(
(y, (1 + )%/a)
(x0 , 0 ) =
(x, (1 + )/(1 a)

if u < a
otherwise

where a = %/(% + ), = (x, ), and > 0 constant or


independent rv

Markov Chain Monte Carlo Methods


Sequential importance sampling
Dynamic extensions

Preservation of the equilibrium equation


If g and g+ denote the distributions of the augmented variable
(X, W ) before the step and after the step, respectively, then
Z

0 g+ (x0 , 0 ) d 0 =

%(, x, x0 )
(1 + ) [%(, x, x0 ) + ] g (x, ) K(x, x0 )
dx d
%(, x, x0 ) +
Z
(%(, x0 , z) + )

+ (1 + )
g (x0 , ) K(x, z)
dz d

%(, x0 , z) +
Z
(x0 )K(x0 , x)
= (1 + )
g (x, )
dx d
(x)

Z
+ g (x0 , ) K(x0 , z) dz d


Z
= (1 + ) (x0 ) c0 K(x0 , x) dx + c0 (x0 )
= 2(1 + )c0 (x0 ) ,

Markov Chain Monte Carlo Methods


Sequential importance sampling
Dynamic extensions

Special case: R-move

[Liang, 2002]
= 0 and 1, and thus
(
(y, % + 1)
(x0 , 0 ) =
(x, (% + 1))

if u < %/(% + 1)
otherwise,
[Importance sampling]

Markov Chain Monte Carlo Methods


Sequential importance sampling
Dynamic extensions

Special case: W -move

0, thus a = 1 and
(x0 , 0 ) = (y, %) .
Q-move
[Liu & al, 2001]
(x0 , 0 ) =

(y, %) if u < 1 %/ ,
(x, a)
otherwise,

with a 1 either a constant or an independent random variable.

Markov Chain Monte Carlo Methods


Sequential importance sampling
Dynamic extensions

Notes
I

Updating step in Q and R schemes written as


(xt+1 , t+1 ) = {xt , t /Pr(Rt = 0)}
with probability Pr(Rt = 0) and
(xt+1 , t+1 ) = {yt+1 , t r(xt , yt+1 )/Pr(Rt = 1)}
with probability Pr(Rt = 1), where Rt is the move indicator
and
yt+1 K(xt , y)

Markov Chain Monte Carlo Methods


Sequential importance sampling
Dynamic extensions

Notes (2)
I

Geometric structure of the weights


Pr(Rt = 0) =
and
Pr(Rt = 0) =
for the R scheme

t
.
t+1

t r(xt , yt )
,
t r(xt , yt ) +

> 0,

Markov Chain Monte Carlo Methods


Sequential importance sampling
Dynamic extensions

Notes (2)
I

Geometric structure of the weights


Pr(Rt = 0) =
and
Pr(Rt = 0) =

t
.
t+1

t r(xt , yt )
,
t r(xt , yt ) +

> 0,

for the R scheme


Number of steps T before an acceptance (a jump) such that
Pr (T t) = P (R1 = 0, . . . , Rt1 = 0)

t1
Y
j
E[1/t ] .
= E
j+1
j=0

Markov Chain Monte Carlo Methods


Sequential importance sampling
Dynamic extensions

Alternative scheme

Preservation of weight expectation:

(xt , t t /Pr(Rt = 0))

with probability Pr(R = 0) and


t
(xt+1 , t+1 ) =

(yt+1 , (1 t )t r(xt , yt+1 )/Pr(Rt = 1))

with probability Pr(R = 1).


t

Markov Chain Monte Carlo Methods


Sequential importance sampling
Dynamic extensions

Alternative scheme (2)


Then
Pr (T = t) = P (R1 = 0, . . . , Rt1 = 0, Rt = 1)

t1
Y
j
t1 r(x0 , Yt )
(1 t )
j
= E
j+1
t
j=0

which is equal to

t1 (1 )E[o r(x, Yt )/t ]


when j constant and deterministic.

Markov Chain Monte Carlo Methods


Sequential importance sampling
Dynamic extensions

Example
Choose a function 0 < (, ) < 1 and to take, while in (x0 , 0 ),


0 r(x0 , y1 )
(1 (x0 , y1 )
(x1 , 1 ) = y1 ,
(x0 , y1 )
with probability

min(1, 0 r(x0 , y1 )) = (x0 , y1 )


and
(x1 , 1 ) =

0
(x0 , y1 )
x0 ,
1 (x0 , y1 )

with probability 1 (x0 , y1 ).

Markov Chain Monte Carlo Methods


Sequential importance sampling
Population Monte Carlo

Population Monte Carlo

Idea
Simulate from the product distribution

(x1 , . . . , xn ) =

n
Y

(xi )

i=1

and apply dynamic importance sampling to the sample


(a.k.a. population)
(t)

x(t) = (x1 , . . . , x(t)


n )

Markov Chain Monte Carlo Methods


Sequential importance sampling
Population Monte Carlo

Iterated importance sampling


As in Markov Chain Monte Carlo (MCMC) algorithms,
introduction of a temporal dimension :
(t)

(t1)

xi qt (x|xi
and

i = 1, . . . , n,

t = 1, . . .

n
X
(t)
(t)
t = 1
%i h(xi )
I
n
i=1

is still unbiased for


(t)

%i =

(t)

t (xi )
(t)

(t1)

qt (xi |xi

i = 1, . . . , n

Markov Chain Monte Carlo Methods


Sequential importance sampling
Population Monte Carlo

Fundamental importance equality


Preservation of unbiasedness


E h(X
=
=

(t)

(X (t) )
)
qt (X (t) |X (t1) )

h(x)

(x)
qt (x|y) g(y) dx dy
qt (x|y)

h(x) (x) dx

for any distribution g on X (t1)

Markov Chain Monte Carlo Methods


Sequential importance sampling
Population Monte Carlo

Sequential variance decomposition


Furthermore,
n


 
X
(t)
(t)
t = 1
var
%
h(x
)
,
var I
i
i
n2
i=1

 
(t)
(t)
if var %i
exists, because the xi s are conditionally uncorrelated

Note

(t)

This decomposition is still valid for correlated [in i] xi s when


(t)
incorporating weights %i

Markov Chain Monte Carlo Methods


Sequential importance sampling
Population Monte Carlo

Simulation of a population
The importance distribution of the sample (a.k.a. particles) x(t)
qt (x(t) |x(t1) )
can depend on the previous sample x(t1) in any possible way as
long as marginal distributions
Z
(t)
qit (x) = qt (x(t) ) dxi
can be expressed to build importance weights
(t)

%it =

(xi )
(t)

qit (xi )

Markov Chain Monte Carlo Methods


Sequential importance sampling
Population Monte Carlo

Special case of the product proposal

If
qt (x(t) |x(t1) ) =
then

n
Y
i=1

(t)

qit (xi |x(t1) )


[Independent proposals]

n
 


1 X
(t)
(t)

var It = 2
var %i h(xi ) ,
n
i=1

Markov Chain Monte Carlo Methods


Sequential importance sampling
Population Monte Carlo

Validation

skip validation

h
i
(t)
(t)
(t)
(t)
E %i h(Xi ) %j h(Xj )
Z
(xj )
(xi )
h(xj )
= h(xi )
(t1)
qit (xi |x
) qjt (xj |x(t1) )

qit (xi |x(t1) ) qjt (xj |x(t1) ) dxi dxj g(x(t1) )dx(t1)

= E [h(X)]2

whatever the distribution g on x(t1)

Markov Chain Monte Carlo Methods


Sequential importance sampling
Population Monte Carlo

Self-normalised version

In general, is unscaled and the weight


(t)

%i
is scaled so that

(t)

(xi )
(t)

qit (xi )
X
i

i = 1, . . . , n ,

(t)

%i = 1

Markov Chain Monte Carlo Methods


Sequential importance sampling
Population Monte Carlo

Self-normalised version properties

Loss of the unbiasedness property and the variance


decomposition

Normalising constant can be estimated by


t
n
( )
1 X X (xi )
$t =
( )
tn
=1 i=1 qi (xi )

Variance decomposition (approximately) recovered if $t1 is


used instead

Markov Chain Monte Carlo Methods


Sequential importance sampling
Population Monte Carlo

Sampling importance resampling


Importance sampling from g can also produce samples from the
target
[Rubin, 1987]

Markov Chain Monte Carlo Methods


Sequential importance sampling
Population Monte Carlo

Sampling importance resampling


Importance sampling from g can also produce samples from the
target
[Rubin, 1987]

Theorem (Bootstraped importance sampling)


If a sample (x?i )1im is derived from the weighted sample
(xi , %i )1in by multinomial sampling with weights %i , then
x?i (x)

Markov Chain Monte Carlo Methods


Sequential importance sampling
Population Monte Carlo

Sampling importance resampling


Importance sampling from g can also produce samples from the
target
[Rubin, 1987]

Theorem (Bootstraped importance sampling)


If a sample (x?i )1im is derived from the weighted sample
(xi , %i )1in by multinomial sampling with weights %i , then
x?i (x)

Note
Obviously, the x?i s are not iid

Markov Chain Monte Carlo Methods


Sequential importance sampling
Population Monte Carlo

Iterated sampling importance resampling

This principle can be extended to iterated importance sampling:


After each iteration, resampling produces a sample from
[Again, not iid!]

Markov Chain Monte Carlo Methods


Sequential importance sampling
Population Monte Carlo

Iterated sampling importance resampling

This principle can be extended to iterated importance sampling:


After each iteration, resampling produces a sample from
[Again, not iid!]

Incentive
Use previous sample(s) to learn about and q

Markov Chain Monte Carlo Methods


Sequential importance sampling
Population Monte Carlo

Generic Population Monte Carlo

Algorithm (Population Monte Carlo Algorithm)


For t = 1, . . . , T
For i = 1, . . . , n,
1. Select the generating distribution qit ()
(t)
2. Generate x
i qit (x)
(t)
(t)
(t)
3. Compute %i = (
xi )/qit (
xi )
(t)

(t)

Normalise the %i s into %i s


(t)

(t)

Generate Ji,t M((


%i )1iN ) and set xi,t = x
Ji,t

Markov Chain Monte Carlo Methods


Sequential importance sampling
Population Monte Carlo

D-kernels in competition
A general adaptive construction:
Construct qi,t as a mixture of D different transition kernels
(t1)
depending on xi
qi,t =

D
X

(t1)

pt,` K` (xi

`=1

and adapt the weights pt,` .

, x),

D
X
`=1

pt,` = 1 ,

Markov Chain Monte Carlo Methods


Sequential importance sampling
Population Monte Carlo

D-kernels in competition
A general adaptive construction:
Construct qi,t as a mixture of D different transition kernels
(t1)
depending on xi
qi,t =

D
X

(t1)

pt,` K` (xi

`=1

, x),

D
X

pt,` = 1 ,

`=1

and adapt the weights pt,` .

Example
Take pt,` proportional to the survival rate of the points
(t)
(a.k.a. particles) xi generated from K`

Markov Chain Monte Carlo Methods


Sequential importance sampling
Population Monte Carlo

Implementation
Algorithm (D-kernel PMC)
For t = 1, . . . , T
generate (Ki,t )1iN M ((pt,k )1kD )

for 1 i N , generate

x
i,t KKi,t (x)
compute and renormalize the importance weights i,t
generate (Ji,t )1iN M (( i,t )1iN )
P
take xi,t = x
Ji,t ,t and pt+1,d = N
i,t Id (Ki,t )
i=1

Markov Chain Monte Carlo Methods


Sequential importance sampling
Population Monte Carlo

Links with particle filters


I

Usually setting where = t changes with t: Population


Monte Carlo also adapts to this case

Can be traced back all the way to Hammersley and Morton


(1954) and the self-avoiding random walk problem

Gilks and Berzuini (2001) produce iterated samples with (SIR)


resampling steps, and add an MCMC step: this step must use
a t invariant kernel

Chopin (2001) uses iterated importance sampling to handle


large datasets: this is a special case of PMC where the qit s
are the posterior distributions associated with a portion kt of
the observed dataset

Markov Chain Monte Carlo Methods


Sequential importance sampling
Population Monte Carlo

Links with particle filters (2)


I

Rubinstein and Kroeses (2004) cross-entropy method is


parameterised importance sampling targeted at rare events

Stavropoulos and Titteringtons (1999) smooth bootstrap and


Warnes (2001) kernel coupler use nonparametric kernels on
the previous importance sample to build an improved
proposal: this is a special case of PMC

West (1992) mixture approximation is a precursor of smooth


bootstrap

Mengersen and Robert (2002) pinball sampler is an MCMC


attempt at population sampling

Del Moral and Doucet (2003) sequential Monte Carlo


samplers also relates to PMC, with a Markovian dependence
on the past sample x(t) but (limited) stationarity constraints

Markov Chain Monte Carlo Methods


Sequential importance sampling
Population Monte Carlo

Things can go wrong


Unexpected behaviour of the mixture weights when the number of
particles increases
N
X
i=1

i,t IKi,t =d P

1
D

Markov Chain Monte Carlo Methods


Sequential importance sampling
Population Monte Carlo

Things can go wrong


Unexpected behaviour of the mixture weights when the number of
particles increases
N
X
i=1

i,t IKi,t =d P

1
D

Conclusion
At each iteration, every weight converges to 1/D:
the algorithm fails to learn from experience!!

Markov Chain Monte Carlo Methods


Sequential importance sampling
Population Monte Carlo

Saved by Rao-Blackwell!!

Modification: Rao-Blackwellisation (=conditioning)

Markov Chain Monte Carlo Methods


Sequential importance sampling
Population Monte Carlo

Saved by Rao-Blackwell!!

Modification: Rao-Blackwellisation (=conditioning)


Use the whole mixture in the importance weight:
D
X
i,t = (
xi,t )
pt,d Kd (xi,t1 , x
i,t )
d=1

instead of
i,t =

(
xi,t )
KKi,t (xi,t1 , x
i,t )

Markov Chain Monte Carlo Methods


Sequential importance sampling
Population Monte Carlo

Adapted algorithm
Algorithm (Rao-Blackwellised D-kernel PMC)
At time t (t = 1, . . . , T ),
Generate
Generate

and set i,t


Generate
and set xi,t

iid

(Ki,t )1iN M((pt,d )1dD );


ind

(
xi,t )1iN KKi,t (xi,t1 , x)

PD
= (
xi,t )
i,t );
d=1 pt,d Kd (xi,t1 , x
iid

(Ji,t )1iN M((


i,t )1iN )
P
=x
Ji,t ,t and pt+1,d = N
i,t pt,d .
i=1

Markov Chain Monte Carlo Methods


Sequential importance sampling
Population Monte Carlo

Convergence properties
Theorem (LLN)
Under regularity assumptions, for h L1 and for every t 1,
N
1 X
N

i,t h(xi,t ) P (h)


N
k=1

and

pt,d P dt

The limiting coefficients (dt )1dD are defined recursively as


!
Z
0)
K
(x,
x
d
t1
dt = d
(dx, dx0 ).
PD
t1
0)

K
(x,
x
j
j=1 j

Markov Chain Monte Carlo Methods


Sequential importance sampling
Population Monte Carlo

Recursion on the weights


Set F as
F () =

on the simplex
(
S=

Z "

PD

Kd (x, x0 )

0
j=1 j Kj (x, x )

(dx, dx0 )

= (1 , . . . , D ); d {1, . . . , D}, d 0

and define the sequence


t+1 = F (t )

1dD

and

D
X
d=1

d = 1 .

Markov Chain Monte Carlo Methods


Sequential importance sampling
Population Monte Carlo

Kullback divergence

Definition (Kullback divergence)


For S,
KL() =

Z "

log

(x)(x0 )
P
0
(x) D
d=1 d Kd (x, x )

!#

(dx, dx0 ).

Kullback divergence between and the mixture.


Goal: Obtain the mixture closest to , i.e., that minimises KL()

Markov Chain Monte Carlo Methods


Sequential importance sampling
Population Monte Carlo

Connection with RBDPMCA ??


Theorem
Under the assumption
d {1, . . . , D}, <

log(Kd (x, x0 )) (dx, dx0 ) <

for every SD ,
KL(F ()) KL().

Markov Chain Monte Carlo Methods


Sequential importance sampling
Population Monte Carlo

Connection with RBDPMCA ??


Theorem
Under the assumption
d {1, . . . , D}, <

log(Kd (x, x0 )) (dx, dx0 ) <

for every SD ,
KL(F ()) KL().

Conclusion
The Kullback divergence decreases at every iteration of RBDPMCA

Markov Chain Monte Carlo Methods


Sequential importance sampling
Population Monte Carlo

An integrated EM interpretation
skip interpretation

We have

min

= arg min KL() = arg max


S

= arg max
S

log p (
x) (d
x)
Z
log p (
x, K)dK (d
x)

for x
= (x, x0 ) and K M((d )1dD ). Then t+1 = F (t )
means
ZZ
t+1
K)|X
=x
Et (log p (X,
) (d
x)

= arg max

and
lim t = min

Markov Chain Monte Carlo Methods


Sequential importance sampling
Population Monte Carlo

Illustration
Example (A toy example)
Take the target
1/4N (1, 0.3)(x) + 1/4N (0, 1)(x) + 1/2N (3, 2)(x)
and use 3 proposals: N (1, 0.3), N (0, 1) and N (3, 2)
[Surprise!!!]

Markov Chain Monte Carlo Methods


Sequential importance sampling
Population Monte Carlo

Illustration
Example (A toy example)
Take the target
1/4N (1, 0.3)(x) + 1/4N (0, 1)(x) + 1/2N (3, 2)(x)
and use 3 proposals: N (1, 0.3), N (0, 1) and N (3, 2)
[Surprise!!!]
Then
1
2
6
10
16

0.0500000 0.05000000
0.2605712 0.09970292
0.2740816 0.19160178
0.2989651 0.19200904
0.2651511 0.24129039
Weight evolution

0.9000000
0.6397259
0.5343166
0.5090259
0.4935585

Markov Chain Monte Carlo Methods


Sequential importance sampling
Population Monte Carlo

Target and mixture evolution

Markov Chain Monte Carlo Methods


Sequential importance sampling
Population Monte Carlo

Example : PMC for mixtures

Observation of an iid sample x = (x1 , . . . , xn ) from


pN (1 , 2 ) + (1 p)N (2 , 2 ),
with p 6= 1/2 and > 0 known.
Usual N (, 2 /) prior on 1 and 2 :
(1 , 2 |x) f (x|1 , 2 ) (1 , 2 )

Markov Chain Monte Carlo Methods


Sequential importance sampling
Population Monte Carlo

Algorithm (Mixture PMC)


Step 0: Initialisation
(0)

(0)

For j = 1, . . . , n = pm, choose (1 )j , (2 )j


For k = 1, . . . , p, set rk = m
Step i: Update (i = 1, . . . , I)
For k = 1, . . . , p,
1. generate a sample of size rk as


(i)
(i1)
and
(1 )j N (1 )j
, vk



(i)
(i1)
(2 )j N (2 )j
, vk

2. compute the weights



 


(i)
(i)
(i)
(i)
f x (1 )j , (2 )j (1 )j , (2 )j



 
%j 
(i1)
(i)
(i1)
(i)
, vk
(1 )j (1 )j
, vk (2 )j (2 )j



(i)
(i)
using the weights %j ,
Resample the (1 )j , (2 )j

Markov Chain Monte Carlo Methods


Sequential importance sampling
Population Monte Carlo

Details

After an arbitrary initialisation, use of the previous (importance)


sample (after resampling) to build random walk proposals,
(i1)

N (()j

, vj )

with a multiscale variance vj within a predetermined set of p scales


ranging from 103 down to 103 , whose importance is proportional
to its survival rate in the resampling step.

Markov Chain Monte Carlo Methods


Sequential importance sampling

100

200

300

400

600

Resampling
0

0 200

600
0 200

Resampling

Population Monte Carlo

500

100

1
100

200

300

400

500

100

400

500

200

300

400

500

400

500

Iterations

2.010

0 1 2 3 4

2.020

Iterations

Var(2)

300

0.05 0.15 0.25

Var(1)

200

Iterations

0 1 2 3 4

Iterations

100

200

300

Iterations

400

500

100

200

300

Iterations

(u.left) Number of resampled points for v1 = 5 (darker) and


v2 = 2; (u.right) Number of resampled points for the other
variances; (m.left) Variance of the 1 s along iterations; (m.right)
Average of the 1 s over iterations; (l.left) Variance of the 2 s
along iterations; (l.right) Average of the simulated 2 s over
iterations.

Markov Chain Monte Carlo Methods


Sequential importance sampling

1.5
0.0

0.5

1.0

2.0

2.5

3.0

Population Monte Carlo

Log-posterior distribution and sample of means

You might also like