Monte Carlo Notes From Main Book

Markov Chain Monte Carlo Methods

Christian P. Robert
Universit
e Paris Dauphine and CREST-INSEE
http://www.ceremade.dauphine.fr/~xian
3-6 Mayo 2005
Outline
Motivation and leading example
Random variable generation
Monte Carlo Integration
Notions on Markov Chains
The Metropolis-Hastings Algorithm
The Gibbs Sampler
MCMC tools for variable dimension problems
Sequential importance sampling
New [2004] edition:


Introduction
Likelihood methods
Missing variable models
Bayesian Methods
Bayesian troubles

Introduction
Latent structures make life harder!
Even simple models may lead to computational complications, as

in latent variable models
Z
f (x|) = f ? (x, x? |) dx?

Introduction

Z
f (x|) = f ? (x, x? |) dx?
If (x, x? ) observed, fine!

Introduction

Z
f (x|) = f ? (x, x? |) dx?
If (x, x? ) observed, fine!
If only x observed, trouble!

Introduction
Example (Mixture models)

Models of mixtures of distributions:
X fj with probability pj ,
for j = 1, 2, . . . , k, with overall density
X p1 f1 (x) + + pk fk (x) .

Introduction

X p1 f1 (x) + + pk fk (x) .
For a sample of independent random variables (X1 , , Xn ),
sample density
n
Y
i=1
{p1 f1 (xi ) + + pk fk (xi )} .

Introduction

X p1 f1 (x) + + pk fk (x) .
For a sample of independent random variables (X1 , , Xn ),
sample density
n
Y
i=1
{p1 f1 (xi ) + + pk fk (xi )} .
Expanding this product involves k n elementary terms: prohibitive

to compute in large samples.

1
0
1
Introduction
Case of the 0.3N (1 , 1) + 0.7N (2 , 1) likelihood

Likelihood methods
Maximum likelihood methods

Go Bayes!!
For an iid sample X1 , . . . , Xn from a population with density

f (x|1 , . . . , k ), the likelihood function is
L(|x) = L(1 , . . . , k |x1 , . . . , xn )
Yn
=
f (xi |1 , . . . , k ).
i=1

Likelihood methods

Go Bayes!!

L(|x) = L(1 , . . . , k |x1 , . . . , xn )
Yn
=
f (xi |1 , . . . , k ).
i=1
Global justifications from asymptotics

Likelihood methods

Go Bayes!!

L(|x) = L(1 , . . . , k |x1 , . . . , xn )
Yn
=
f (xi |1 , . . . , k ).
i=1
Global justifications from asymptotics
Computational difficulty depends on structure, eg latent

variables

Likelihood methods
Example (Mixtures again)

For a mixture of two normal distributions,
pN (, 2 ) + (1 p)N (, 2 ) ,
likelihood proportional to
n
Y
i=1
containing 2n terms.
xi
+ (1 p)
xi


Likelihood methods
Standard maximization techniques often fail to find the global

maximum because of multimodality of the likelihood function.
Example
In the special case

f (x|, ) = (1) exp{(1/2)x2 }+ exp{(1/2 2 )(x)2 } (1)
with > 0 known,

Likelihood methods
Standard maximization techniques often fail to find the global

maximum because of multimodality of the likelihood function.
Example
In the special case

f (x|, ) = (1) exp{(1/2)x2 }+ exp{(1/2 2 )(x)2 } (1)
with > 0 known, whatever n, the likelihood is unbounded:

lim `( = x1 , |x1 , . . . , xn ) =

The special case of missing variable models
Consider again a latent variable representation

Z
g(x|) =
f (x, z|) dz
Z


Z
g(x|) =
f (x, z|) dz
Z
Define the completed (but unobserved) likelihood

Lc (|x, z) = f (x, z|)


Z
g(x|) =
f (x, z|) dz
Z
Define the completed (but unobserved) likelihood

Lc (|x, z) = f (x, z|)
Useful for optimisation algorithm

The EM Algorithm
Gibbs connection
Bayes rather than EM
Algorithm (ExpectationMaximisation)
Iterate (in m)
1. (E step) Compute
Q(|(m) , x) = E[log Lc (|x, Z)|(m) , x] ,

The EM Algorithm
Gibbs connection
Bayes rather than EM
Algorithm (ExpectationMaximisation)
Iterate (in m)
1. (E step) Compute
Q(|(m) , x) = E[log Lc (|x, Z)|(m) , x] ,
2. (M step) Maximise Q(|(m) , x) in and take
(m+1) = arg max Q(|(m) , x).
until a fixed point [of Q] is reached

0.0
0.1
0.2
0.3
0.4
0.5
0.6
Echantillon N(0,1)
Sample from (1)

1
0
1
Likelihood of .7N (1 , 1) + .3N (2 , 1) and EM steps

Bayesian Methods
The Bayesian Perspective
In the Bayesian paradigm, the information brought by the data x,

realization of
X f (x|),

Bayesian Methods
The Bayesian Perspective
In the Bayesian paradigm, the information brought by the data x,

realization of
X f (x|),
is combined with prior information specified by prior distribution
with density
()

Bayesian Methods
Central tool
Summary in a probability distribution, (|x), called the posterior
distribution

Bayesian Methods
Central tool
distribution
Derived from the joint distribution f (x|)(), according to
(|x) = R
f (x|)()
,
f (x|)()d
[Bayes Theorem]

Bayesian Methods
Central tool
distribution
Derived from the joint distribution f (x|)(), according to
(|x) = R
where
m(x) =
is the marginal density of X
f (x|)()
,
f (x|)()d
[Bayes Theorem]
f (x|)()d

Bayesian Methods
Central tool...central to Bayesian inference

Posterior defined up to a constant as
(|x) f (x|) ()
I
Operates conditional upon the observations

Bayesian Methods

(|x) f (x|) ()
I
Integrate simultaneously prior information and information

brought by x

Bayesian Methods

(|x) f (x|) ()
I

brought by x
Avoids averaging over the unobserved values of x

Bayesian Methods

(|x) f (x|) ()
I

brought by x
Coherent updating of the information available on ,

independent of the order in which i.i.d. observations are
collected

Bayesian Methods

(|x) f (x|) ()
I

brought by x
Coherent updating of the information available on ,

independent of the order in which i.i.d. observations are
collected
Provides a complete inferential scope and a unique motor of

inference

Bayesian troubles
Conjugate bonanza...
Example (Binomial)
For an observation X B(n, p) so-called conjugate prior is the
family of beta Be(a, b) distributions

Bayesian troubles
Conjugate bonanza...
Example (Binomial)
For an observation X B(n, p) so-called conjugate prior is the
family of beta Be(a, b) distributions
The classical Bayes estimator is the posterior mean
Z 1
(a + b + n)
p px+a1 (1 p)nx+b1 dp
(a + x)(n x + b) 0
x+a
.
=
a+b+n

Bayesian troubles
Example (Normal)
In the normal N (, 2 ) case, with both and unknown,
conjugate prior on = (, 2 ) of the form

( 2 ) exp ( )2 + / 2

Bayesian troubles
Example (Normal)
In the normal N (, 2 ) case, with both and unknown,
conjugate prior on = (, 2 ) of the form

( 2 ) exp ( )2 + / 2
since

((, 2 )|x1 , . . . , xn ) ( 2 ) exp ( )2 + / 2

( 2 )n exp n( x)2 + s2x / 2

2 +n
( )
exp ( + n)( x )2

n
+ + s2x +
/ 2
n +

Bayesian troubles
...and conjugate curse
The use of conjugate priors for computational reasons

implies a restriction on the modeling of the available prior
information

Bayesian troubles

information
may be detrimental to the usefulness of the Bayesian approach

Bayesian troubles

information
may be detrimental to the usefulness of the Bayesian approach
gives an impression of subjective manipulation of the prior
information disconnected from reality.

Bayesian troubles
A typology of Bayes computational problems

(i). use of a complex parameter space, as for instance in
constrained parameter sets like those resulting from imposing
stationarity constraints in dynamic models;

Bayesian troubles

(ii). use of a complex sampling model with an intractable
likelihood, as for instance in missing data and graphical
models;

Bayesian troubles

models;
(iii). use of a huge dataset;

Bayesian troubles

models;
(iv). use of a complex prior distribution (which may be the
posterior distribution associated with an earlier sample);

Bayesian troubles

models;
(iv). use of a complex prior distribution (which may be the
posterior distribution associated with an earlier sample);
(v). use of a complex inferential procedure as for instance, Bayes
factors

P ( 0 | x) ( 0 )
B01 (x) =
.
P ( 1 | x) ( 1 )

Bayesian troubles
Example (Mixture once again)

Observations from
x1 , . . . , xn f (x|) = p(x; 1 , 1 ) + (1 p)(x; 2 , 2 )

Bayesian troubles

Observations from
x1 , . . . , xn f (x|) = p(x; 1 , 1 ) + (1 p)(x; 2 , 2 )
Prior
i |i N (i , i2 /ni ),
i2 I G (i /2, s2i /2),
p Be(, )

Bayesian troubles

Observations from
x1 , . . . , xn f (x|) = p(x; 1 , 1 ) + (1 p)(x; 2 , 2 )
Prior
i |i N (i , i2 /ni ),
i2 I G (i /2, s2i /2),
p Be(, )
Posterior
(|x1 , . . . , xn )
=
n
Y
{p(xj ; 1 , 1 ) + (1 p)(xj ; 2 , 2 )} ()
j=1
n X
X
(kt )(|(kt ))
`=0 (kt )

Bayesian troubles
Example (Mixture once again (contd))

For a given permutation (kt ), conditional posterior distribution

12
I G ((1 + `)/2, s1 (kt )/2)
(|(kt )) = N 1 (kt ),
n1 + `

22
I G ((2 + n `)/2, s2 (kt )/2)
N 2 (kt ),
n2 + n `
Be( + `, + n `)

Bayesian troubles
Example (Mixture once again (contd))

where
x
1 (kt ) =
x
2 (kt ) =
and
1
`
P`
t=1 xkt ,
1 Pn
n`
t=`+1 xkt ,
P
s1 (kt ) = P `t=1 (xkt x
1 (kt ))2 ,
n
s2 (kt ) =
2 (kt ))2
t=`+1 (xkt x
n1 1 + `
x1 (kt )
n2 2 + (n `)
x2 (kt )
,
2 (kt ) =
,
n1 + `
n2 + n `
n1 `
(1 x
1 (kt ))2 ,
s1 (kt ) = s21 + s21 (kt ) +
n1 + `
n2 (n `)
s2 (kt ) = s22 + s22 (kt ) +
(2 x
2 (kt ))2 ,
n2 + n `
1 (kt ) =
posterior updates of the hyperparameters

Bayesian troubles

Bayes estimator of :
(x1 , . . . , xn ) =
n X
X
(kt )E [|x, (kt )]
`=0 (kt )
Too costly: 2n terms

Bayesian troubles
press for AR
Example (Poly-t priors)

Normal observation x N (, 1), with conjugate prior
N (, )
Closed form expression for the posterior mean
Z
Z
f (x|) () d
f (x|) () d =
x + 2
.
1 + 2

Bayesian troubles
Example (Poly-t priors (2))

More involved prior distribution:
poly-t distribution
[Bauwens,1985]
() =
k
Y

i=1
i + ( i )2
i
i , i > 0

Bayesian troubles
Example (Poly-t priors (2))

More involved prior distribution:
poly-t distribution
[Bauwens,1985]
() =
k
Y

i=1
i + ( i )2
Computation of E[|x] ???
i
i , i > 0

Bayesian troubles
Example (AR(p) model)

Auto-regressive representation of a time series,
xt =
p
X
i=1
i xti + t

Bayesian troubles
Example (AR(p) model)

Auto-regressive representation of a time series,
xt =
p
X
i xti + t
i=1
If order p unknown, predictive distribution of xt+1 given by

Z
(xt+1 |xt , . . . , x1 ) f (xt+1 |xt , . . . , xtp+1 )(, p|xt , . . . , x1 )dp d ,

Bayesian troubles
Example (AR(p) model (contd))

Integration over the parameters of all models
Z
X
p=0
f (xt+1 |xt , . . . , xtp+1 )(|p, xt , . . . , x1 ) d (p|xt , . . . , x1 ) .

Bayesian troubles
Example (AR(p) model (contd))

Multiple layers of complexity
(i). Complex parameter space within each AR(p) model because
of stationarity constraint
(ii). if p unbounded, infinity of models
(iii). varies between models AR(p) and AR(p + 1), with a
different stationarity constraint (except for root
reparameterisation).
(iv). if prediction used sequentially, every tick/second/hour/day,
posterior distribution (, p|xt , . . . , x1 ) must be re-evaluated


Basic methods
Uniform pseudo-random generator
Beyond Uniform distributions
Transformation methods
Accept-Reject Methods
Fundamental theorem of simulation
Log-concave densities

Rely on the possibility of producing (computer-wise) an

endless flow of random variables (usually iid) from well-known
distributions

Rely on the possibility of producing (computer-wise) an

endless flow of random variables (usually iid) from well-known
distributions
Given a uniform random number generator, illustration of
methods that produce random variables from both standard
and nonstandard distributions

Basic methods
The inverse transform method
For a function F on R, the generalized inverse of F , F , is defined

by
F (u) = inf {x; F (x) u} .

Basic methods
The inverse transform method
For a function F on R, the generalized inverse of F , F , is defined

by
F (u) = inf {x; F (x) u} .
Definition (Probability Integral Transform)

If U U[0,1] , then the random variable F (U ) has the distribution
F.

Basic methods
The inverse transform method (2)
To generate a random variable X F , simply generate

U U[0,1]

Basic methods
The inverse transform method (2)
To generate a random variable X F , simply generate

U U[0,1]
and then make the transform
x = F (u)

Desiderata and limitations

skip Uniform
Production of a deterministic sequence of values in [0, 1]

which imitates a sequence of iid uniform random variables
U[0,1] .


skip Uniform

U[0,1] .
Cant use the physical imitation of a random draw [no
guarantee of uniformity, no reproducibility]


skip Uniform

U[0,1] .
Random sequence in the sense: Having generated
(X1 , , Xn ), knowledge of Xn [or of (X1 , , Xn )] imparts
no discernible knowledge of the value of Xn+1 .


skip Uniform

U[0,1] .
Deterministic: Given the initial value X0 , sample
(X1 , , Xn ) always the same


skip Uniform

U[0,1] .
Deterministic: Given the initial value X0 , sample
(X1 , , Xn ) always the same
Validity of a random number generator based on a single
sample X1 , , Xn when n tends to +, not on replications
(X11 , , X1n ), (X21 , , X2n ), . . . (Xk1 , , Xkn )
where n fixed and k tends to infinity.

Algorithm starting from an initial value 0 u0 1 and a

transformation D, which produces a sequence
(ui ) = (D i (u0 ))
in [0, 1].

Algorithm starting from an initial value 0 u0 1 and a

transformation D, which produces a sequence
(ui ) = (D i (u0 ))
in [0, 1].
For all n,
(u1 , , un )
reproduces the behavior of an iid U[0,1] sample (V1 , , Vn ) when
compared through usual tests

Uniform pseudo-random generator (2)
Validity means the sequence U1 , , Un leads to accept the

hypothesis
H : U 1 , , Un
are iid
U[0,1] .

Uniform pseudo-random generator (2)
Validity means the sequence U1 , , Un leads to accept the

hypothesis
H : U 1 , , Un
are iid
U[0,1] .
The set of tests used is generally of some consequence

KolmogorovSmirnov and other nonparametric tests
Time series methods, for correlation between Ui and
(Ui1 , , Uik )
Marsaglias battery of tests called Die Hard (!)

Usual generators
In R and S-plus, procedure runif()
The Uniform Distribution
Description:
runif generates random deviates.
Example:
u <- runif(20)
.Random.seed is an integer vector, containing
the random number generator state for random
number generation in R. It can be saved and
restored, but should not be altered by users.

0.0 0.2 0.4 0.6 0.8 1.0
500
520
540
560
580
600
0.6
0.8
1.0
0.0
0.5
1.0
1.5
uniform sample
0.0
0.2
0.4

Usual generators (2)

In C, procedure rand() or random()
SYNOPSIS
#include <stdlib.h>
long int random(void);
DESCRIPTION
The random() function uses a non-linear additive
feedback random number generator employing a
default table of size 31 long integers to return
successive pseudo-random numbers in the range
from 0 to RAND_MAX. The period of this random
generator is very large, approximately
16*((2**31)-1).
RETURN VALUE
random() returns a value between 0 and RAND_MAX.

Usual generators(3)
In Scilab, procedure rand()
rand() : with no arguments gives a scalar whose
value changes each time it is referenced. By
default, random numbers are uniformly distributed
in the interval (0,1). rand(normal) switches to
a normal distribution with mean 0 and variance 1.
EXAMPLE
x=rand(10,10,uniform)

Beyond Uniform generators

Generation of any sequence of random variables can be
formally implemented through a uniform generator
Distributions with explicit F (for instance, exponential, and
Weibull distributions), use the probability integral
transform here


transform here
Case specific methods rely on properties of the distribution (for
instance, normal distribution, Poisson distribution)


transform here
More generic methods (for instance, accept-reject and
ratio-of-uniform)


transform here
More generic methods (for instance, accept-reject and
ratio-of-uniform)
Simulation of the standard distributions is accomplished quite

efficiently by many numerical and statistical programming
packages.

Case where a distribution F is linked in a simple way to another
distribution easy to simulate.
Example (Exponential variables)

If U U[0,1] , the random variable
X = log U/
has distribution
P (X x) = P ( log U x)
= P (U ex ) = 1 ex ,
the exponential distribution E xp().

Other random variables that can be generated starting from an

exponential include
Y = 2
X
j=1
log(Uj ) 22
a
1X
log(Uj ) G a(a, )
Y =
j=1
Pa
j=1 log(Uj )
Y = Pa+b
j=1 log(Uj )
Be(a, b)

Points to note
Transformation quite simple to use
There are more efficient algorithms for gamma and beta

random variables
Cannot generate gamma random variables with a non-integer
shape parameter
For instance, cannot get a 21 variable, which would get us a
N (0, 1) variable.

Box-Muller Algorithm
Example (Normal variables)
If r, polar coordinates of (X1 , X2 ), then,
r2 = X12 + X22 22 = E (1/2)
and U [0, 2]

Box-Muller Algorithm
Example (Normal variables)
If r, polar coordinates of (X1 , X2 ), then,
r2 = X12 + X22 22 = E (1/2)
Consequence: If U1 , U2 iid U[0,1] ,
p
2 log(U1 ) cos(2U2 )
p
=
2 log(U1 ) sin(2U2 )
X1 =
X2
iid N (0, 1).
and U [0, 2]

Box-Muller Algorithm (2)
1. Generate U1 , U2 iid U[0,1] ;

2. Define
x2
2 log(u1 ) cos(2u2 ) ,
p
2 log(u1 ) sin(2u2 ) ;
=
x1 =
3. Take x1 and x2 as two independent draws from

N (0, 1).

Box-Muller Algorithm (3)
Unlike algorithms based on the CLT,

this algorithm is exact
Get two normals for the price of

two uniforms
I Drawback (in speed)
in calculating log, cos and sin.
4
0.0
0.1
0.2
0.3
0.4

More transforms
Reject
Example (Poisson generation)

Poissonexponential connection:
If N P() and Xi E xp(), i N ,
P (N = k) =
P (X1 + + Xk 1 < X1 + + Xk+1 ) .

More Poisson
Skip Poisson
A Poisson can be simulated by generating Exp(1) till their

sum exceeds 1.
This method is simple, but is really practical only for smaller
values of .
On average, the number of exponential variables required is .
Other approaches are more suitable for large s.

Atkinsons Poisson
To generate N P():
1. Define
= / 3,
and
k = log clog ;
2. Generate U1 U[0,1] and calculate

x = { log{(1 u1 )/u1 }}/
until x > 0.5 ;
3. Define N = bx + 0.5c and generate

U2 U[0,1] ;
4. Accept N if
x+log (u2 /{1+exp(x)}2 ) k+N log log N ! .

Negative extension
A generator of Poisson random variables can produce negative

binomial random variables since,
Y Ga(n, (1 p)/p) X|y P(y)
implies
X N eg(n, p)

Mixture representation
The representation of the negative binomial is a particular
case of a mixture distribution
The principle of a mixture representation is to represent a
density f as the marginal of another distribution, for example
X
f (x) =
pi fi (x) ,
iY
If the component distributions fi (x) can be easily generated,

X can be obtained by first choosing fi with probability pi and
then generating an observation from fi .

Partitioned sampling
Special case of mixture sampling when

Z
fi (x) = f (x) IAi (x)
f (x) dx
Ai
and
pi = Pr(X Ai )
for a partition (Ai )i

Accept-Reject Methods
Accept-Reject algorithm
Many distributions from which it is difficult, or even

impossible, to directly simulate.
Another class of methods that only require us to know the
functional form of the density f of interest only up to a
multiplicative constant.
The key to this method is to use a simpler (simulation-wise)
density g, the instrumental density , from which the simulation
from the target density f is actually done.

0.25
0.20
Lemma
f(x)
0.10
X f (x)
0.15
Simulating
(X, U ) U{(x, u) : 0 < u < f (x)}
0.00
0.05
equivalent to simulating
6
x
10

The Accept-Reject algorithm
Given a density of interest f , find a density g and a constant M

such that
f (x) M g(x)
on the support of f .

The Accept-Reject algorithm
Given a density of interest f , find a density g and a constant M

such that
f (x) M g(x)
on the support of f .
1. Generate X g, U U[0,1] ;
2. Accept Y = X if U f (X)/M g(X) ;

3. Return to 1. otherwise.

Validation of the Accept-Reject method

Warranty:
This algorithm produces a variable Y distributed according to f

Two interesting properties

First, it provides a generic method to simulate from any
density f that is known up to a multiplicative factor
Property particularly important in Bayesian calculations where
the posterior distribution
(|x) () f (x|) .
is specified up to a normalizing constant

Two interesting properties

First, it provides a generic method to simulate from any
density f that is known up to a multiplicative factor
Property particularly important in Bayesian calculations where
the posterior distribution
(|x) () f (x|) .
is specified up to a normalizing constant
Second, the probability of acceptance in the algorithm is
1/M , e.g., expected number of trials until a variable is
accepted is M

More interesting properties
In cases f and g both probability densities, the constant M is

necessarily larger that 1.


The size of M , and thus the efficiency of the algorithm, are
functions of how closely g can imitate f , especially in the tails


The size of M , and thus the efficiency of the algorithm, are
functions of how closely g can imitate f , especially in the tails
For f /g to remain bounded, necessary for g to have tails
thicker than those of f .
It is therefore impossible to use the A-R algorithm to simulate
a Cauchy distribution f using a normal distribution g, however
the reverse works quite well.

No Cauchy!
Example (Normal from a Cauchy)

Take
1
f (x) = exp(x2 /2)
2
and
1 1
,
1 + x2
densities of the normal and Cauchy distributions.
g(x) =

No Cauchy!
Example (Normal from a Cauchy)

Take
1
f (x) = exp(x2 /2)
2
and
1 1
,
1 + x2
densities of the normal and Cauchy distributions.
Then
r
r
f (x)
2
2
=
(1 + x2 ) ex /2
= 1.52
g(x)
2
e
g(x) =
attained at x = 1.

Example (Normal from a Cauchy (2))

So probability of acceptance
1/1.52 = 0.66,
and, on the average, one out of every three simulated Cauchy
variables is rejected.

No Double!
Example (Normal/Double Exponential)

Generate a N (0, 1) by using a double-exponential distribution
with density
g(x|) = (/2) exp(|x|)
Then
f (x)
g(x|)
2 1 2 /2
e
and minimum of this bound (in ) attained for

? = 1

Example (Normal/Double Exponential (2))

Probability of acceptance
p
/2e = .76
To produce one normal random variable requires on the average

1/.76 1.3 uniform variables.

truncate
Example (Gamma generation)

Illustrates a real advantage of the Accept-Reject algorithm
The gamma distribution Ga(, ) represented as the sum of
exponential random variables, only if is an integer

Example (Gamma generation (2))

Can use the Accept-Reject algorithm with instrumental distribution
Ga(a, b), with a = [],
(Without loss of generality, = 1.)
0.

Example (Gamma generation (2))

Can use the Accept-Reject algorithm with instrumental distribution
Ga(a, b), with a = [],
0.
(Without loss of generality, = 1.)

Up to a normalizing constant,
f /gb = b
a a
exp{(1 b)x} b
for b 1.
The maximum is attained at b = a/.
a
(1 b)e
a

Cheng and Feasts Gamma generator

Gamma G a(, 1), > 1 distribution
1. Define c1 = 1, c2 = ( (1/6))/c1 ,
c3 = 2/c1 , c4 = 1 + c3 , and c5 = 1/ .
2. Repeat
generate U1 , U2
take U1 = U2 + c5 (1 1.86U1 ) if > 2.5
until 0 < U1 < 1.
3. Set W = c2 U2 /U1 .
4. If c3 U1 + W + W 1 c4 or
c3 log U1 log W + W 1,
take c1 W ;
otherwise, repeat.

Truncated Normal simulation

Example (Truncated Normal distributions)
Constraint x produces density proportional to
e(x)
2 /2 2
for a bound large compared with
Ix

Truncated Normal simulation

Example (Truncated Normal distributions)
Constraint x produces density proportional to
e(x)
2 /2 2
Ix
for a bound large compared with

There exists alternatives far superior to the nave method of
generating a N (, 2 ) until exceeding , which requires an average
number of
1/(( )/)
simulations from N (, 2 ) for a single acceptance.

Example (Truncated Normal distributions (2))

Instrumental distribution: translated exponential distribution,
E (, ), with density
g (z) = e(z) Iz .

Example (Truncated Normal distributions (2))

Instrumental distribution: translated exponential distribution,
E (, ), with density
g (z) = e(z) Iz .
The ratio f /g is bounded by
(
1/ exp(2 /2 ) if > ,
f /g
otherwise.
1/ exp(2 /2)

Log-concave densities (1)
Densities f whose logarithm is concave, for

instance Bayesian posterior distributions such that
move to next chapter
log (|x) = log () + log f (x|) + c

concave

Take
Sn = {xi , i = 0, 1, . . . , n + 1} supp(f )
such that h(xi ) = log f (xi ) known up to the same constant.
By concavity of h, line Li,i+1 through (xi , h(xi )) and
(xi+1 , h(xi+1 ))

Take
Sn = {xi , i = 0, 1, . . . , n + 1} supp(f )
such that h(xi ) = log f (xi ) known up to the same constant.
By concavity of h, line Li,i+1 through (xi , h(xi )) and
(xi+1 , h(xi+1 ))
I
below h in [xi , xi+1 ] and
above this graph outside this interval

L 2,3 (x)
log f(x)
x1
x2
x3
x
x4


For x [xi , xi+1 ], if
hn (x) = min{Li1,i (x), Li+1,i+2 (x)}
and hn (x) = Li,i+1 (x) ,
the envelopes are

hn (x) h(x) hn (x)
uniformly on the support of f , with
hn (x) =
on [x0 , xn+1 ]c .
and
hn (x) = min(L0,1 (x), Ln,n+1 (x))

Therefore, if
f n (x) = exp hn (x) and f n (x) = exp hn (x)
then
f n (x) f (x) f n (x) = $n gn (x) ,
where $n normalizing constant of fn

ARS Algorithm
1. Initialize n and Sn .
2. Generate X gn (x), U U[0,1] .
3. If U f n (X)/$n gn (X), accept X;

otherwise, if U f (X)/$n gn (X), accept X

kill ducks
Example (Northern Pintail ducks)

Ducks captured at time i with both probability pi and size N of
the population unknown.
Dataset
(n1 , . . . , n11 ) = (32, 20, 8, 5, 1, 2, 0, 2, 1, 1, 0)
Number of recoveries over the years 19571968 of N = 1612
Northern Pintail ducks banded in 1956

Example (Northern Pintail ducks (2))

Corresponding conditional likelihood
L(p1 , . . . , pI |N, n1 , . . . , nI ) =
I
Y
N!
pni i (1 pi )N ni ,
(N r)!
i=1
where I number of captures, ni number of captured animals

during the ith capture, and r is the total number of different
captured animals.


Prior selection
If
N P()
and
i = log
pi
1 pi
N (i , 2 ),
[Normal logistic]


Posterior distribution
(, N |, n1 , . . . , nI )
I
N!
N Y
(1 + ei )N
(N r)! N !
i=1

I
Y
1
2
exp i ni 2 (i i )
2
i=1


For the conditional posterior distribution

1
2
(i |N, n1 , . . . , nI ) exp i ni 2 (i i )
(1+ei )N ,
2
the ARS algorithm can be implemented since
i ni
is concave in i .
1
(i i )2 N log(1 + ei )
2 2

Posterior distributions of capture log-odds ratios for the

years 19571965.
1.0
0.8
0.6
0.4
0.2
10
10
1965
0.0
0.2
0.4
0.6
0.8
1.0
1964
0.6
8
0.6
8
0.4
9
1962
0.4
9
0.2
10
0.2
10
0.0
0.0
0.2
0.4
0.6
0.8
1.0
1963
1.0
0.8
10
10
0.0
0.2
0.4
0.6
0.8
1.0
1961
0.0
0.0
0.2
0.4
0.6
0.8
1.0
1960
1.0
0.8
10
1959
0.0
0.2
0.4
0.6
0.8
1.0
1958
0.0
0.0
0.2
0.4
0.6
0.8
1.0
1957
10
10

0.0
0.2
0.4
0.6
0.8
1960
True distribution versus histogram of simulated sample

Monte Carlo integration

Introduction
Importance Sampling
Acceleration methods
Bayesian importance sampling

Introduction
Quick reminder
Two major classes of numerical problems that arise in statistical

inference
Optimization - generally associated with the likelihood
approach

Introduction
Quick reminder
Two major classes of numerical problems that arise in statistical

inference
Optimization - generally associated with the likelihood
approach
Integration- generally associated with the Bayesian approach

Introduction
skip Example!
Example (Bayesian decision theory)

Bayes estimators are not always posterior expectations, but rather
solutions of the minimization problem
Z
L(, ) () f (x|) d .
min
Proper loss:
For L(, ) = ( )2 , the Bayes estimator is the posterior mean

Introduction
skip Example!

Z
L(, ) () f (x|) d .
min
Proper loss:
Absolute error loss:
For L(, ) = | |, the Bayes estimator is the posterior median

Introduction
skip Example!

Z
L(, ) () f (x|) d .
min
Proper loss:
Absolute error loss:
For L(, ) = | |, the Bayes estimator is the posterior median
With no loss function
use the maximum a posteriori (MAP) estimator
arg max `(|x)()

Theme:
Generic problem of evaluating the integral
Z
I = Ef [h(X)] =
h(x) f (x) dx
X
where X is uni- or multidimensional, f is a closed form, partly

closed form, or implicit density, and h is a function

Monte Carlo integration (2)

Monte Carlo solution
First use a sample (X1 , . . . , Xm ) from the density f to
approximate the integral I by the empirical average
hm =
m
1 X
h(xj )
m
j=1

Monte Carlo integration (2)

Monte Carlo solution
First use a sample (X1 , . . . , Xm ) from the density f to
approximate the integral I by the empirical average
hm =
m
1 X
h(xj )
m
j=1
which converges
hm Ef [h(X)]
by the Strong Law of Large Numbers

Monte Carlo precision

Estimate the variance with
m
vm
1 1 X
[h(xj ) hm ]2 ,
=
mm1
j=1
and for m large,

hm Ef [h(X)]
N (0, 1).
vm
Note: This can lead to the construction of a convergence test and
of confidence bounds on the approximation of Ef [h(X)].

Example (Cauchy prior/normal sample)

For estimating a normal mean, a robust prior is a Cauchy prior
X N (, 1),
C(0, 1).
Under squared error loss, posterior mean

Z
2
e(x) /2 d
2
1+
(x) = Z
1
2
e(x) /2 d
2
1
+

Example (Cauchy prior/normal sample (2))

Form of suggests simulating iid variables
1 , , m N (x, 1)
and calculating
m
(x) =
m
X
i=1
i
1 + i2
X
m
i=1
1
.
1 + i2
The Law of Large Numbers implies
m
(x) (x) as m .

9.6
9.8
10.0
10.2
10.4
10.6
200
400
600
800
1000
iterations
for 100 runs and x = 10

Range of estimators m

Importance Sampling
Importance sampling
Paradox
Simulation from f (the true density) is not necessarily optimal

Importance Sampling
Importance sampling
Paradox
Simulation from f (the true density) is not necessarily optimal
Alternative to direct sampling from f is importance sampling,
based on the alternative representation

Z
f (x)
Ef [h(X)] =
g(x) dx .
h(x)
g(x)
X
which allows us to use other distributions than f

Importance Sampling
Importance sampling algorithm

Evaluation of
Ef [h(X)] =
h(x) f (x) dx
X
by
1. Generate a sample X1 , . . . , Xn from a distribution g
2. Use the approximation
m
1 X f (Xj )
h(Xj )
m
g(Xj )
j=1

Importance Sampling
Same thing as before!!!
Convergence of the estimator

Z
m
1 X f (Xj )
h(Xj )
h(x) f (x) dx
m
g(Xj )
X
j=1

Importance Sampling
Same thing as before!!!
Convergence of the estimator

Z
m
1 X f (Xj )
h(Xj )
h(x) f (x) dx
m
g(Xj )
X
j=1
converges for any choice of the distribution g

[as long as supp(g) supp(f )]

Importance Sampling
Important details
Instrumental distribution g chosen from distributions easy to

simulate
The same sample (generated from g) can be used repeatedly,
not only for different functions h, but also for different
densities f
Even dependent proposals can be used, as seen later
PMC chapter

Importance Sampling
Although g can be any density, some choices are better than

others:
Finite variance only when

Z
f (X)
f 2 (X)
2
Ef h (X)
dx < .
h2 (x)
=
g(X)
g(X)
X

Importance Sampling

others:

Z
f (X)
f 2 (X)
2
Ef h (X)
dx < .
h2 (x)
=
g(X)
g(X)
X
Instrumental distributions with tails lighter than those of f
(that is, with sup f /g = ) not appropriate.
If sup f /g = , the weights f (xj )/g(xj ) vary widely, giving

too much importance to a few values xj .

Importance Sampling

others:

Z
f (X)
f 2 (X)
2
Ef h (X)
dx < .
h2 (x)
=
g(X)
g(X)
X
Instrumental distributions with tails lighter than those of f
(that is, with sup f /g = ) not appropriate.
If sup f /g = , the weights f (xj )/g(xj ) vary widely, giving

too much importance to a few values xj .
If sup f /g = M < , the accept-reject algorithm can be used
as well to simulate f directly.

Importance Sampling
Example (Cauchy target)

Case of Cauchy distribution C(0, 1) when importance function is
Gaussian N (0, 1).
Ratio of the densities
%(x) =
very badly behaved: e.g.,
Z
exp x2 /2
p? (x)
= 2
p0 (x)
(1 + x2 )
%(x)2 p0 (x)dx = .
Poor performances of the associated importance sampling

estimator

50
100
150
200
Importance Sampling
2000
4000
6000
8000
10000
iterations
Range and average of 500 replications of IS estimate of

E[exp X] over 10, 000 iterations.

Importance Sampling
Optimal importance function
The choice of g that minimizes the variance of the

importance sampling estimator is
g (x) = R
|h(x)| f (x)
.
|h(z)| f (z) dz

Importance Sampling
Optimal importance function
The choice of g that minimizes the variance of the

importance sampling estimator is
g (x) = R
|h(x)| f (x)
.
|h(z)| f (z) dz
Rather formal optimality result since optimal choice of g (x)

requires the knowledge of I, the integral of interest!

Importance Sampling
Practical impact
Pm
h(Xj ) f (Xj )/g(Xj )

Pm
,
j=1 f (Xj )/g(Xj )
j=1
where f and g are known up to constants.
Also converges to I by the Strong Law of Large Numbers.

Biased, but the bias is quite small

Importance Sampling
Practical impact
Pm
h(Xj ) f (Xj )/g(Xj )

Pm
,
j=1 f (Xj )/g(Xj )
j=1
where f and g are known up to constants.
Also converges to I by the Strong Law of Large Numbers.

Biased, but the bias is quite small
In some settings beats the unbiased estimator in squared error

loss.
Using the optimal solution does not always work:
Pm
#positive h #negative h
j=1 h(xj ) f (xj )/|h(xj )| f (xj )
Pm
Pm
=
j=1 f (xj )/|h(xj )| f (xj )
j=1 1/|h(xj )|

Importance Sampling
Selfnormalised importance sampling
For ratio estimator

hn
n
X
i h(xi )
i=1
X
n
i=1
with Xi g(y) and Wi such that

E[Wi |Xi = x] = f (x)/g(x)

Importance Sampling
Selfnormalised variance
then
var(hn )
for

1
n
n
n
2
n
var(S
)
2E
[h]
cov(S
,
S
)
+
E
[h]
var(S
)
.
1
1
h
h
n2 2
Shn =
n
X
i=1
Wi h(Xi ) ,
S1n =
n
X
Wi
i=1
Rough approximation
varhn
1
var (h(X)) {1 + varg (W )}
n

Importance Sampling
Example (Students t distribution)

X T (, , 2 ), with density
f (x) =
(( + 1)/2)
(/2)
1+
(x )2
2
Without loss of generality, take = 0, = 1.

Problem: Calculate the integral

Z
sin(x) n
f (x)dx.
x
2.1
(+1)/2

Importance Sampling
Example (Students t distribution (2))

Simulation possibilities
Directly from f , since f = N(0,1)

2

Importance Sampling


2
Importance sampling using Cauchy C (0, 1)

Importance Sampling


2

Importance sampling using a normal N (0, 1)
(expected to be nonoptimal)

Importance Sampling


2

Importance sampling using a normal N (0, 1)
(expected to be nonoptimal)
Importance sampling using a U ([0, 1/2.1])
change of variables

5.0
5.5
6.0
6.5
7.0
Importance Sampling
10000
20000
30000
40000
50000
Sampling from f (solid lines), importance sampling with

Cauchy instrumental (short dashes), U ([0, 1/2.1])
instrumental (long dashes) and normal instrumental (dots).

Importance Sampling
IS suffers from curse of dimensionality

As dimension increases, discrepancy between importance and
target worsens
skip explanation

Importance Sampling
IS suffers from curse of dimensionality

As dimension increases, discrepancy between importance and
target worsens
skip explanation
Explanation:
Take target distribution and instrumental distribution
N
n
Simulation of a sample of iid samples of size n
R x1:n from n =
Importance sampling estimator for n (fn ) = fn (x1:n )n (dx1:n )
PN
QN
i
i
i=1 fn (1:n )
j=1 Wj
\
n (fn ) =
,
PN QN
j=1
j=1 Wj
i
i
where Wki = d
d (k ), and j are iid with distribution .
For {Vk }k0 , sequence of iid nonnegative random variables and for
n 1, Fn = (Vk ; k n), set
Un =
n
Y
k=1
Vk

Importance Sampling
Since E[Vn+1 ] = 1 and Vn+1 independent from Fn ,

E(Un+1 | Fn ) = Un E(Vn+1 | Fn ) = Un ,
and thus {Un }n0 martingale
Since x 7 x concave, by Jensens inequality,
p
p
p
E( Un+1 | Fn ) E(Un+1 | Fn ) Un
and thus { Un }n0 supermartingale

Assume E( Vn+1 ) < 1. Then
p
E( Un ) =
n
Y
k=1
p
E( Vk ) 0,
n .

Importance Sampling
But { Un }n0 is a nonnegative supermartingale and thus Un

converges a.s. to a random variable Z 0. By Fatous lemma,

p
E(Z) = E lim
Un lim inf E( U n ) = 0.
n
Hence, Z = 0 and Un 0 a.s., which implies that the martingale

{Un }n0 is not regular.
i
Apply these results to Vk = d
d (k ), i {1, . . . , N }:
"r
#

d i
d i
E
( ) E
( ) = 1.
d k
d k
with equality iff
d
d
= 1, -a.e., i.e. = .
Thus all importance weights converge to 0

Importance Sampling
too volatile!
Example (Stochastic volatility model)

yt = exp (xt /2) t ,
t N (0, 1)
with AR(1) log-variance process (or volatility)

xt+1 = xt + ut ,
ut N (0, 1)

Importance Sampling
10
Evolution of IBM stocks (corrected from trend and log-ratio-ed)
100
200
300
time
400
500

Importance Sampling
Example (Stochastic volatility model (2))

Observed likelihood unavailable in closed from.
Joint posterior (or conditional) distribution of the hidden state
sequence {Xk }1kK can be evaluated explicitly
K
Y
k=2

exp 2 (xk xk1 )2 + 2 exp(xk )yk2 + xk /2 , (2)
up to a normalizing constant.

Importance Sampling
Computational problems

Direct simulation from this distribution impossible because of
(a) dependence among the Xk s,
(b) dimension of the sequence {Xk }1kK , and
(c) exponential term exp(xk )yk2 within (2).

Importance Sampling
Importance sampling

Natural candidate: replace the exponential term with a quadratic
approximation to preserve Gaussianity.
E.g., expand exp(xk ) around its conditional expectation xk1 as

1
2
exp(xk ) exp(xk1 ) 1 (xk xk1 ) + (xk xk1 )
2

Importance Sampling

Corresponding Gaussian importance distribution with mean
k =
xk1 { 2 + yk2 exp(xk1 )/2} {1 yk2 exp(xk1 )}/2

2 + yk2 exp(xk1 )/2
and variance
k2 = ( 2 + yk2 exp(xk1 )/2)1
Prior proposal on X1 ,
X1 N (0, 2 )

Importance Sampling

Simulation starts with X1 and proceeds forward to Xn , each Xk
being generated conditional on Yk and the previously generated
Xk1 .
Importance weight computed sequentially as the product of

exp 2 (xk xk1 )2 + exp(xk )yk2 + xk /2

.
exp k2 (xk k )2 k1
(1 k K)

0.00
0.4
0.02
0.3
0.04
0.2
Density
0.06
0.1
0.08
0.0
0.10
0.1
Importance Sampling
15
weights
10
15
20
40
60
80
100
Histogram of the logarithms of the importance weights (left)

and comparison between the true volatility and the best fit,
based on 10, 000 simulated importance samples.

0.4
0.2
0.0
0.2
0.4
Importance Sampling
20
40
60
80
100
Corresponding range of the simulated {Xk }1k100 ,

compared with the true value.

Correlated simulations
Negative correlation reduces variance
Special technique but efficient when it applies
Two samples (X1 , . . . , Xm ) and (Y1 , . . . , Ym ) from f to estimate
Z
I=
h(x)f (x)dx
R
by
m
1 X
b
h(Xi )
I1 =
m
i=1
with mean I and variance 2
and
m
1 X
b
I2 =
h(Yi )
m
i=1

Variance reduction
Variance of the average

var
b
I1 + b
I2
2
2 1
+ cov(b
I1 , b
I2 ).
2
2
If the two samples are negatively correlated,

cov(b
I1 , b
I2 ) 0 ,
they improve on two independent samples of same size

Antithetic variables
If f symmetric about , take Yi = 2 Xi
If Xi = F 1 (Ui ), take Yi = F 1 (1 Ui )
If (Ai )i partition of X , partitioned sampling by sampling

Xj s in each Ai (requires to know Pr(Ai ))

Control variates
out of control!
For
I=
unknown and
I0 =
known,
I0 estimated by b
I0 and
I estimated by b
I
h(x)f (x)dx
h0 (x)f (x)dx

Control variates (2)
Combined estimator
b
I = b
I + (b
I0 I 0 )
b
I is unbiased for I and
var(b
I ) = var(b
I) + 2 var(b
I) + 2cov(b
I, b
I0 )

Optimal control
Optimal choice of
? =
with
cov(b
I, b
I0 )
,
b0 )
var(I
var(b
I? ) = (1 2 ) var(b
I) ,
b and I
b0
where correlation between I
Usual solution: regression coefficient of h(xi ) over h0 (xi )

Example (Quantile Approximation)

Evaluate
% = Pr(X > a) =
by
with Xi iid f .
If Pr(X > ) =
%b =
1
2
known
f (x)dx
1X
I(Xi > a),
n
i=1

Example (Quantile Approximation (2))

Control variate
n
1X
% =
I(Xi > a) +
n
i=1
improves upon %b if
1X
I(Xi > ) Pr(X > )
n
i=1

Example (Quantile Approximation (2))

Control variate
n
1X
% =
I(Xi > a) +
n
i=1
improves upon %b if
<0
and
1X
I(Xi > ) Pr(X > )
n
|| < 2
i=1
cov(b
%, %b0 ) Pr(X > a)
2
.
var(b
%0 ) Pr(X > )

Integration by conditioning
Use Rao-Blackwell Theorem

var(E[(X)|Y]) var((X))

Consequence
If b
I unbiased estimator of I = Ef [h(X)], with X simulated from a
joint density f(x, y), where
Z
f(x, y)dy = f (x),
the estimator
b
I = Ef[b
I|Y1 , . . . , Yn ]
dominate b
I(X1 , . . . , Xn ) variance-wise (and is unbiased)

skip expectation
Example (Students t expectation)

For
E[h(x)] = E[exp(x2 )] with
X T (, 0, 2 )
a Students t distribution can be simulated as

X|y N (, 2 y)
and
Y 1 2 .

Example (Students t expectation (2))

Empirical distribution
m
1 X
exp(Xj2 ) ,
m
j=1
can be improved from the joint sample

((X1 , Y1 ), . . . , (Xm , Ym ))

Example (Students t expectation (2))

Empirical distribution
m
1 X
exp(Xj2 ) ,
m
j=1
can be improved from the joint sample

((X1 , Y1 ), . . . , (Xm , Ym ))
since
m
m
1 X
1
1 X
p
E[exp(X 2 )|Yj ] =
m
m
2 2 Yj + 1
j=1
j=1
is the conditional expectation.

In this example, precision ten times better

0.50
0.52
0.54
0.56
0.58
0.60
2000
4000
6000
8000
10000
Estimators of E[exp(X 2 )]: empirical average (full) and

conditional expectation (dotted) for (, , ) = (4.6, 0, 1).

Bayesian importance functions

directly Markovian
Recall algorithm:
(1)
(T )
1. Generate 1 , , 1
with
2. Take
Z
from cg()
Z
c1 = g()d
f (x|)()d
T
((t) )
1X
f (x| (t) )
T
cg( (t) )
t=1
T
X
f (x| (t) )
t=1
((t) )
g((t) )
T
X
((t) )
t=1
g((t) )
= mIS (x)

Choice of g
g() = ()
1X
f (x| (t) )
T t
often inefficient if data informative
impossible if is improper
mIS (x) =

Choice of g
g() = ()
1X
f (x| (t) )
T t
often inefficient if data informative
impossible if is improper
mIS (x) =
g() = f (x|)()
c unknown
1
1X
T t=1 f (x| (t) )
improper priors allowed
mIS (x) = 1

g() = () + (1 )(|x)
defensive mixture
1 Ok
[Hestenberg, 1998]

g() = () + (1 )(|x)
defensive mixture
1 Ok
[Hestenberg, 1998]
g() = (|x)
mh (x) =
T
h()
1X
T t=1 f (x|)()
works for any h
finite variance if
Z
h2 ()
d <
f (x|)()

Bridge sampling
[Chen & Shao, 1997]
Given two models f1 (x|1 ) and f2 (x|2 ),
1 (1 |x) =
2 (2 |x) =
1 (1 )f1 (x|1 )
m1 (x)
2 (2 )f2 (x|2 )
m2 (x)
Bayes factor:
B12 (x) =
ratio of normalising constants
m1 (x)
m2 (x)

Bridge sampling (2)
(i) Missing normalising constants:

1 (1 |x)
1 (1 )
2 (2 |x)
2 (2 )
n
B12
1 (i )
1X
2 (i )
i=1
i 2

Bridge sampling (3)

(ii) Still missing normalising constants:
Z
2 ()()1 ()d
B12 = Z
1 ()()2 ()d
n1
1 X
2 (1i )(1i )
n1
i=1
n2
1 X
1 (2i )(2i )
n2
i=1
()
ji j ()

Bridge sampling (4)
Optimal choice
() =
n1 + n 2
n1 1 () + n2 2 ()
[?]
[Chen, Meng & Wong, 2000]


Basics
Irreducibility
Transience and Recurrence
Invariant measures
Ergodicity and convergence
Limit theorems
Quantitative convergence rates
Coupling
Renewal and CLT

Basics
Basics
Definition (Markov chain)

A sequence of random variables whose distribution evolves over
time as a function of past realizations

Basics
Basics
Definition (Markov chain)

A sequence of random variables whose distribution evolves over
time as a function of past realizations
Chain defined through its transition kernel, a function K defined
on X B(X ) such that
I
I
x X , K(x, ) is a probability measure;

A B(X ), K(, A) is measurable.

Basics
no discrete
When X is a discrete (finite or denumerable) set, the

transition kernel simply is a (transition) matrix K with
elements
x, y X
Pxy = Pr(Xn = y|Xn1 = x) ,
Since, for all x X , K(x, ) is a probability, we must have

X
Pxy = 1
Pxy 0 and K(x, X ) =
yX
The matrix K is referred to as a Markov transition matrix

or a stochastic matrix

Basics
In the continuous case, the kernel also denotes the

conditional density K(x, x0 ) of the transition K(x, )
Z
K(x, x0 )dx0 .
Pr(X A|x) =
A
Then, for any bounded , we may define

Z
K(x) = K(x, ) =
K(x, dy)(y).
X
Note that
|K(x)|
K(x, dy)|(y)| || = sup |(x)|.

xX
We may also associate to a probability measure the measure

K, defined as
Z
K(A) =
(dx)K(x, A).
X

Basics
Markov chains
skip definition
Given a transition kernel K, a sequence X0 , X1 , . . . , Xn , . . . of

random variables is a Markov chain denoted by (Xn ), if, for any
t, the conditional distribution of Xt given xt1 , xt2 , . . . , x0 is the
same as the distribution of Xt given xt1 . That is,
Pr(Xk+1 A|x0 , x1 , x2 , . . . , xk ) = Pr(Xk+1 A|xk )
Z
=
K(xk , dx)
A

Basics
Note that the entire structure of the chain only depends on

The transition function K
The initial state x0 or initial distribution X0

Basics
Example (Random walk)

The normal random walk is the kernel K(x, ) associated with the
distribution
Np (x, 2 Ip )
which means
Xt+1 = Xt + t
t being an iid additional noise

10
Basics
100 consecutive realisations of the random walk in R2 with

=1

Basics
bypass remarks
On a discrete state-space X = {x0 , x1 , . . .},

I A function on a discrete state space is uniquely defined by
the (column) vector = ((x0 ), (x1 ), . . . , )T and
X
K(x) =
Pxy (y)
yX
can be interpreted as the xth component of the product of

the transition matrix K and of the vector .

Basics
bypass remarks
On a discrete state-space X = {x0 , x1 , . . .},

I A function on a discrete state space is uniquely defined by
the (column) vector = ((x0 ), (x1 ), . . . , )T and
X
K(x) =
Pxy (y)
yX
can be interpreted as the xth component of the product of

the transition matrix K and of the vector .
A probability distribution on P(X ) is defined as a (row)
vector = ((x0 ), (x1 ), . . .) and the probability distribution
K is defined, for each y X as
X
K({y}) =
({x})Pxy
xX
yth component of the product of the vector and of the

transition matrix K.

Basics
Composition of kernels
Let Q1 and Q2 be two probability kernels. Define, for any x X

and any A B(X ) the product of kernels Q1 Q2 as
Z
Q1 Q2 (x, A) =
Q1 (x, dy)Q2 (y, A)
X
When the state space X is discrete, the product of Markov kernels

coincides with the product of matrices Q1 Q2 .

Irreducibility
Irreducibility
Irreducibility is one measure of the sensitivity of the Markov chain
to initial conditions
It leads to a guarantee of convergence for MCMC algorithms

Irreducibility
Irreducibility
Irreducibility is one measure of the sensitivity of the Markov chain
to initial conditions
It leads to a guarantee of convergence for MCMC algorithms
Definition (Irreducibility)
In the discrete case, the chain is irreducible if all states
communicate, namely if
Px (y < ) > 0 ,
x, y X ,
y being the first (positive) time y is visited

Irreducibility
Irreducibility for a continuous chain
In the continuous case, the chain is -irreducible for some measure

if for some n,
K n (x, A) > 0
I
I
for all x X
for every A B(X ) with (A) > 0

Irreducibility
Minoration condition
Assume there exist a probability measure and > 0 such that,

for all x X and all A B(X ),
K(x, A) (A)
This is called a minoration condition.
When K is a Markov chain on a discrete state space, this is
equivalent to saying that Pxy > 0 for all x, y X .

Irreducibility
Small sets
Definition (Small set)

If there exist C B(X ), (C) > 0, a probability measure and
> 0 such that, for all x C and all A B(X ),
K(x, A) (A)
C is called a small set
For discrete state space, atoms are small sets.

Towards further stability

Irreducibility: every set A has a chance to be visited by the
Markov chain (Xn )
This property is too weak to ensure that the trajectory of
(Xn ) will enter A often enough.

Towards further stability

Irreducibility: every set A has a chance to be visited by the
Markov chain (Xn )
This property is too weak to ensure that the trajectory of
(Xn ) will enter A often enough.
A Markov chain must enjoy good stability properties to
guarantee an acceptable approximation of the simulated
model.
Formalizing this stability leads to different notions of
recurrence
For discrete chains, the recurrence of a state equivalent to
probability one of sure return.
Always satisfied for irreducible chains on finite spaces

In a finite state space X , denote the average number of visits to a

state by
X
=
I (Xi )
i=1


state by
X
=
I (Xi )
i=1
If E [ ] = , the state is recurrent

If E [ ] < , the state is transient
For irreducible chains, recurrence/transience is property of the
chain, not of a particular state


state by
X
=
I (Xi )
i=1
If E [ ] = , the state is recurrent

If E [ ] < , the state is transient
For irreducible chains, recurrence/transience is property of the
chain, not of a particular state
Similar definitions for the continuous case.

Harris recurrence
Stronger form of recurrence:
Definition (Harris recurrence)

A set A is Harris recurrent if
Px (A = ) = 1 for all x A.
The chain (Xn ) is Harris recurrent if it is
irreducible
for every set A with (A) > 0, A is Harris recurrent.

Harris recurrence
Stronger form of recurrence:
Definition (Harris recurrence)

A set A is Harris recurrent if
Px (A = ) = 1 for all x A.
The chain (Xn ) is Harris recurrent if it is
irreducible
for every set A with (A) > 0, A is Harris recurrent.

Note that
Px (A = ) = 1 implies Ex [A ] =

Invariant measures
Invariant measures
Stability increases for the chain (Xn ) if marginal distribution of Xn
independent of n
Requires the existence of a probability distribution such that
Xn+1
if
Xn

Invariant measures
Invariant measures
Stability increases for the chain (Xn ) if marginal distribution of Xn
independent of n
Requires the existence of a probability distribution such that
Xn+1
if
Xn
Definition (Invariant measure)

A measure is invariant for the transition kernel K(, ) if
Z
(B) =
K(x, B) (dx) ,
B B(X ) .
X

Invariant measures
Stability properties and invariance
The chain is positive recurrent if is a probability measure.

Otherwise it is null recurrent or transient

Invariant measures
Stability properties and invariance
The chain is positive recurrent if is a probability measure.

Otherwise it is null recurrent or transient
If probability measure, also called stationary distribution

since
X0 implies that Xn for every n
The stationary distribution is unique

Invariant measures
Insights
no time for that!
Invariant probability measures are important not merely because they define stationary processes, but also because
they turn out to be the measures which define the longterm or ergodic behavior of the chain.
To understand why, consider P (Xn ) for a starting distribution . If
a limiting measure exists such as
P (Xn A) (A)
for all A B(X ), then

Invariant measures
(dx)P n (x, A)
Z Z
= lim
P n1 (x, dw)K(w, A)
n X
Z
=
(dw)K(w, A)
(A) =
lim
R
since setwise convergence of P n (x, ) implies convergence of integrals
of bounded measurable functions. Hence, if a limiting distribution exists,
it is an invariant probability measure; and obviously, if there is a unique
invariant probability measure, the limit will be independent of
whenever it exists.

We finally consider: to what is the chain converging?

The invariant distribution is a natural candidate for the limiting
distribution

We finally consider: to what is the chain converging?

The invariant distribution is a natural candidate for the limiting
distribution
A fundamental property is ergodicity, or independence of initial
conditions. In the discrete case, a state is ergodic if
lim |K n (, ) ()| = 0 .

Norm and convergence

In general , we establish convergence using the total variation norm
k1 2 kTV = sup |1 (A) 2 (A)|
A
and we want
Z

K n (x, )(dx)

TV

Z

n
= sup K (x, A)(dx) (A)
A
to be small.
skip minoration TV

Total variation distance and minoration
Lemma
Let and 0 be two probability measures. Then,
)
(
X
(Ai ) 0 (Ai ) = k 0 kTV .
1 inf
i
where the infimum is taken over all finite partitions (Ai )i of X .

Total variation distance and minoration (2)

Assume that there exist a probability and > 0 such that, for all
A B(X ) we have
(A) 0 (A) (A).
Then, for all I and all partitions A1 , A2 , . . ., AI ,
X
(Ai ) 0 (Ai )
i=1
and the previous result thus implies that

k 0 kTV (1 ).

Harris recurrence and ergodicity

Theorem
If (Xn ) Harris positive recurrent and aperiodic, then
Z

n

K
(x,
)(dx)
lim
=0

n
for every initial distribution .
TV


Theorem
Z

n

K
(x,
)(dx)
lim
=0

n
TV
We thus take Harris positive recurrent and aperiodic as

equivalent to ergodic
[Meyn & Tweedie, 1993]


Theorem
Z

n

K
(x,
)(dx)
lim
=0

n
TV
We thus take Harris positive recurrent and aperiodic as

equivalent to ergodic
[Meyn & Tweedie, 1993]
Convergence in total variation implies
lim |E [h(Xn )] E [h(X)]| = 0
for every bounded function h.
no detail of convergence

Convergences
There are difference speeds of convergence

ergodic (fast enough)
geometrically ergodic (faster)

uniformly ergodic (fastest)

Geometric ergodicity
A -irreducible aperiodic Markov kernel P with invariant

distribution is geometrically ergodic if there exist V 1, and
constants < 1, R < such that (n 1)
kP n (x, .) (.)kV RV (x)n ,
on {V < } which is full and absorbing.

Geometric ergodicity implies a lot of important results

P
I CLT for additive functionals n1/2
g(Xk ) and functions
|g| < V


P
|g| < V
I
Rosenthals type inequalities

p
n
X

Ex
g(Xk ) C(p)np/2 ,

k=1
|g|p 2


P
|g| < V
I
Rosenthals type inequalities

p
n
X

Ex
g(Xk ) C(p)np/2 ,

k=1
|g|p 2
exponential inequalities (for bounded functions and small

enough)
!)
(
n
X
g(Xk )
<
Ex exp
k=1

Minoration condition and uniform ergodicity
Under the minoration condition, the kernel K is thus contractant

and standard results in functional analysis shows the existence and
the unicity of a fixed point . The previous relation implies that,
for all x X .
kP n (x, ) kTV (1 )n
Such Markov chains are called uniformly ergodic.

Uniform ergodicity
Theorem (S&n ergodicity)
The following conditions are equivalent:
I
(Xn )n is uniformly ergodic,
there exist < 1 and R < such that, for all x X ,

kP n (x, ) kTV Rn ,
for some n > 0,

sup kP n (x, ) ()kTV < 1.
xX
[Meyn and Tweedie, 1993]

Limit theorems
Limit theorems
Ergodicity determines the probabilistic properties of average
behavior of the chain.

Limit theorems
Limit theorems
But also need of statistical inference, made by induction from the
observed sample.
If kPxn k close to 0, no direct information about
Xn Pxn
c We need LLNs and CLTs!!!


Limit theorems
Limit theorems
But also need of statistical inference, made by induction from the
observed sample.
If kPxn k close to 0, no direct information about
Xn Pxn
c We need LLNs and CLTs!!!

Classical LLNs and CLTs not directly applicable due to:
Markovian dependence structure between the
observations Xi
Non-stationarity of the sequence

Limit theorems
The Theorem
Theorem (Ergodic Theorem)

If the Markov chain (Xn ) is Harris recurrent, then for any function
h with E|h| < ,
Z
1X
h(Xi ) = h(x)d(x),
lim
n n
i

Limit theorems
Central Limit Theorem

To get a CLT, we need more assumptions.
skip conditions and results

Limit theorems
Central Limit Theorem

To get a CLT, we need more assumptions.
skip conditions and results
For MCMC, the easiest is

P( -> )
Definition (reversibility)
A Markov chain (Xn ) is reversible if for
all n
P(-> )
Xn+1 |Xn+2 = x Xn+1 |Xn = x

The direction of time does not matter
[Green, 1995]

Limit theorems
The CLT
Theorem
If the Markov chain (Xn ) is Harris recurrent and reversible,
!
N
X
1
L
(h(Xn ) E [h]) N (0, h2 ) .

N n=1
where
2
0 < h2 = E [h (X0 )]
X
+2
E [h(X0 )h(Xk )] < +.
k=1
[Kipnis & Varadhan, 1986]


skip detailed results
Let P a Markov transition kernel on (X , B(X )), with P positive

recurrent and its stationary distribution


skip detailed results
Let P a Markov transition kernel on (X , B(X )), with P positive

recurrent and its stationary distribution
Convergence rate Determine, from the kernel, a sequence
B(, n), such that
kP n kV B(, n)
where V : X [1, ) and for any signed measure ,
kkV = sup |()|
||V

Practical purposes?
In the 90s, a wealth of contributions on quantitative bounds

triggered by MCMC algorithms to answer questions like: what is
the appropriate burn in? or how long should the sampling continue
after burn in?
[Douc, Moulines and Rosenthal, 2001]
[Jones and Hobert, 2001]

Tools at hand
For MCMC algorithms, kernels are explicitly known.
Type of quantities (more or less directly) available:
I
Minoration constants
K s (x, A) (A),
for all x C,
Foster-Lyapunov Drift conditions,

KV V + bIC
and goal is to obtain a bound depending explicitly upon , , b,

&tc...

Coupling
Coupling
skip coupling
X0
If X and
and , one can construct two
and X
0 such that
random variables X
, X
0 0
X
=X
0
and X
with probability

Coupling
Coupling
skip coupling
X0
If X and
and , one can construct two
and X
0 such that
random variables X
, X
0 0
X
=X
0
and X
with probability
The basic coupling construction

I
I
with probability , draw Z according to and set

=X
0 = Z.
X
and X 0 under distributions
with probability 1 , draw X
( )/(1 )
and (0 )/(1 ),
respectively.
[Thorisson, 2000]

Coupling
Coupling inequality
X, X 0 r.v.s with probability distribution K(x, .) and K(x0 , .),
respectively, can be coupled with probability if:
K(x, ) K(x0 , ) x,x0 (.)
where x,x0 is a probability measure, or, equivalently,
kK(x, ) K(x0 , )kTV (1 )
Define an -coupling set as a set C X X satisfying :
A B(X ),
(x, x0 ) C,
K(x, A) K(x0 , A) x,x0 (A)

Coupling
Small set and coupling sets

C X small set if there exist > 0 and a probability measure
such that, for all A B(X )
K(x, A) (A),
x C.
(3)
Small sets always exist when the MC is -irreducible

[Jain and Jamieson, 1967]

Coupling
Small set and coupling sets

C X small set if there exist > 0 and a probability measure
such that, for all A B(X )
K(x, A) (A),
x C.
(3)
Small sets always exist when the MC is -irreducible

[Jain and Jamieson, 1967]
For MCMC kernels, small sets in general easy to find.
If C is a small set, then C = C C is a coupling set:
A B(X ),
(x, x0 ) C,
K(x, A) K(x0 , A) (A).

Coupling
Coupling for Markov chains

P Markov transition kernel on X X such that, for all
(x, x0 ) 6 C (where C is an -coupling set) and all A B(X ) :
P (x, x0 ; A X ) = K(x, A) and
P (x, x0 ; X A) = K(x0 , A)

Coupling
Coupling for Markov chains

P Markov transition kernel on X X such that, for all
(x, x0 ) 6 C (where C is an -coupling set) and all A B(X ) :
P (x, x0 ; A X ) = K(x, A) and
P (x, x0 ; X A) = K(x0 , A)
For example,
I
I
P (x, x0 ; A A0 ) = K(x, A)K(x0 , A0 ).

for (x, x0 ) 6 C,
For all (x, x0 ) C and all A, A0 B(X ), define the residual
kernel
x0 ; A X ) = (1 )1 (K(x, A) x,x0 (A))
R(x,
x0 ; X A0 ) = (1 )1 (K(x0 , A) x,x0 (A0 )).
R(x,

Coupling
Coupling algorithm
I
I
I
Initialisation Let X0 and X00 0 and set d0 = 0.
After coupling If dn = 1, then draw Xn+1 K(Xn , ), and

0
set Xn+1
= Xn+1 .
Before coupling If dn = 0 and (Xn , Xn0 ) C,

I
I
I
0
with probability , draw Xn+1 = Xn+1
Xn ,Xn0 and set
dn+1 = 1.
0
n , Xn0 ; ) and
with probability 1 , draw (Xn+1 , Xn+1
) R(X
set dn+1 = 0.
then draw
If dn = 0 and (Xn , Xn0 ) 6 C,
0
0
(Xn+1 , Xn+1 ) P (Xn , Xn ; ).
(Xn , Xn0 , dn ) [where dn is the bell variable which indicates

whether the chains have coupled or not] is a Markov chain on
(X X {0, 1}).

Coupling
Coupling inequality (again!)
Define the coupling time T as

T = inf{k 1, dk = 1}
Coupling inequality
sup |P k (A) 0 P k (A)| P, 0 ,0 [T > k]
A
[Pitman, 1976; Lindvall, 1992]

Coupling
Drift conditions
To exploit the coupling construction, we need to control the hitting
time

Coupling
Drift conditions
time
Moments of the return time to a set C are most often controlled
using Foster-Lyapunov drift condition:
P V V + bIC ,
V 1
Mk = k V (Xk )I(C k), k 1 is a supermartingale and thus

Ex [C ] V (x) + b1 IC (x).

Coupling
Drift conditions
time
Moments of the return time to a set C are most often controlled
using Foster-Lyapunov drift condition:
P V V + bIC ,
V 1
Mk = k V (Xk )I(C k), k 1 is a supermartingale and thus

Ex [C ] V (x) + b1 IC (x).
Conversely, if there exists a set C such that Ex [C ] < for all
x (in a full and absorbing set), then there exists a drift function
verifying the Foster-Lyapunov conditions.
[Meyn and Tweedie, 1993]

Coupling
If the drift condition is imposed directly on the joint transition

kernel P , there exist V 1, 0 < < 1 and a set C such that :
P V (x, x0 ) V (x, x0 )
(x, x0 ) 6 C
When P (x, x0 ; A A0 ) = K(x, A)K(x0 , A0 ), one may consider

V (x, x0 ) = (1/2) V (x) + V (x0 )
where V drift function for P (but not necessarily the best choice)

Coupling
Explicit bound
Theorem
For any distributions and 0 , and any j k, then:
kP k () 0 P k ()kT V (1 )j + k B j1 E, 0 ,0 [V (X0 , X00 )]
where
B = 1 1 (1 ) sup RV.
[DMR,2001]

Renewal and CLT
Renewal and CLT

Given a Markov chain (Xn )n , how good an approximation of
Z
I = g(x)(x)dx
is
n1
1X
g(Xi ) ?
g n :=
n
i=0

Renewal and CLT
Renewal and CLT

Given a Markov chain (Xn )n , how good an approximation of
Z
I = g(x)(x)dx
is
n1
1X
g(Xi ) ?
g n :=
n
i=0
Standard MC if CLT
d
n (g n E [g(X)]) N (0, g2 )
and there exists an easy-to-compute, consistent estimate of g2 ...

Renewal and CLT
Minoration
skip construction
Assume that the kernel density K satisfies, for some density q(),
(0, 1) and a small set C X ,
K(y|x) q(y) for all y X and x C
Then split K into a mixture
K(y|x) = q(y) + (1 ) R(y|x)
where R is residual kernel

Renewal and CLT
Split chain
Let 0 , 1 , 2 , . . . be iid B(). Then the split chain

{(X0 , 0 ), (X1 , 1 ), (X2 , 2 ), . . . }
is such that, when Xi C, i determines Xi+1 :
(
q(x)
if i = 1,
Xi+1
R(x|Xi ) otherwise
[Regeneration] When (Xi , i ) C {1}, Xi+1 q

Renewal and CLT
Renewals
For X0 q and R successive renewals, define by 1 < . . . < R the
renewal times.
Then
#
"
R

R 1 X
R g R E [g(X)] =
(St Nt E [g(X)])
N R
t=1
where Nt length of the t th tour, and St sum of the g(Xj )s over

the t th tour.
Since (Nt , St ) are iid and Eq [St Nt E [g(X)]] = 0, if Nt and St
have finite 2nd moments,
d
I
R g R E g N (0, g2 )
I there is a simple, consistent estimator of g2
[Mykland & al., 1995; Robert, 1995]

Renewal and CLT
Moment conditions
We need to show that, for the minoration condition, Eq [N12 ] and
Eq [S12 ] are finite.
If
1. the chain is geometrically ergodic, and
2. E [|g|2+ ] < for some > 0,
then Eq [N12 ] < and Eq [S12 ] < .
[Hobert & al., 2002]
Note that drift + minoration ensures geometric ergodicity

[Rosenthal, 1995; Roberts & Tweedie, 1999]


Monte Carlo Methods based on Markov Chains
The MetropolisHastings algorithm
A collection of Metropolis-Hastings algorithms
Extensions

Running Monte Carlo via Markov Chains
It is not necessary to use a sample from the distribution f to

approximate the integral
Z
I=
h(x)f (x)dx ,

Running Monte Carlo via Markov Chains
It is not necessary to use a sample from the distribution f to

approximate the integral
Z
I=
h(x)f (x)dx ,
We can obtain X1 , . . . , Xn f (approx) without directly
simulating from f , using an ergodic Markov chain with
stationary distribution f

Running Monte Carlo via Markov Chains (2)

Idea
For an arbitrary starting value x(0) , an ergodic chain (X (t) ) is
generated using a transition kernel with stationary distribution f


Idea
Insures the convergence in distribution of (X (t) ) to a random

variable from f .
For a large enough T0 , X (T0 ) can be considered as

distributed from f
Produce a dependent sample X (T0 ) , X (T0 +1) , . . ., which is

generated from f , sufficient for most approximation purposes.


Idea
Insures the convergence in distribution of (X (t) ) to a random

variable from f .
For a large enough T0 , X (T0 ) can be considered as

distributed from f
Produce a dependent sample X (T0 ) , X (T0 +1) , . . ., which is

generated from f , sufficient for most approximation purposes.
Problem: How can one build a Markov chain with a given

stationary distribution?

Basics
The algorithm uses the objective (target) density
f
and a conditional density
q(y|x)
called the instrumental (or proposal) distribution

The MH algorithm
Algorithm (MetropolisHastings)
Given x(t) ,
1. Generate Yt q(y|x(t) ).
2. Take
(t+1)
Yt
x(t)
with prob. (x(t) , Yt ),

with prob. 1 (x(t) , Yt ),
where
(x, y) = min
f (y) q(x|y)
,1
f (x) q(y|x)

Features
Independent of normalizing constants for both f and q(|x)

(ie, those constants independent of x)
Never move to values with f (y) = 0
The chain (x(t) )t may take the same value several times in a
row, even though f is a density wrt Lebesgue measure
The sequence (yt )t is usually not a Markov chain

Convergence properties
1. The M-H Markov chain is reversible, with
invariant/stationary density f since it satisfies the detailed
balance condition
f (y) K(y, x) = f (x) K(x, y)

balance condition
f (y) K(y, x) = f (x) K(x, y)
2. As f is a probability measure, the chain is positive recurrent

balance condition
f (y) K(y, x) = f (x) K(x, y)
2. As f is a probability measure, the chain is positive recurrent
3. If
"
#
f (Yt ) q(X (t) |Yt )
Pr
1 < 1.
f (X (t) ) q(Yt |X (t) )
(1)
that is, the event {X (t+1) = X (t) } is possible, then the chain
is aperiodic

Convergence properties (2)

4. If
q(y|x) > 0 for every (x, y),
the chain is irreducible
(2)


4. If
5. For M-H, f -irreducibility implies Harris recurrence
(2)


4. If
(2)

5. For M-H, f -irreducibility implies Harris recurrence
6. Thus, for M-H satisfying (1) and (2)
(i) For h, with Ef |h(X)| < ,
lim
(ii) and
Z
T
1X
(t)
h(X ) = h(x)df (x)
T t=1
Z

lim
K n (x, )(dx) f
n
a.e. f.
=0
TV
for every initial distribution , where K n (x, ) denotes the

kernel for n transitions.

The Independent Case

The instrumental distribution q is independent of X (t) , and is
denoted g by analogy with Accept-Reject.

The Independent Case

The instrumental distribution q is independent of X (t) , and is
denoted g by analogy with Accept-Reject.
Algorithm (Independent Metropolis-Hastings)

Given x(t) ,
a Generate Yt g(y)
b Take
X (t+1) =
x(t)
with prob. min

otherwise.
)
f (Yt ) g(x(t) )
,1 ,
f (x(t) ) g(Yt )

Properties
The resulting sample is not iid

Properties
The resulting sample is not iid but there exist strong convergence
properties:
Theorem (Ergodicity)
The algorithm produces a uniformly ergodic chain if there exists a
constant M such that
f (x) M g(x) ,
x supp f.
In this case,
n
kK (x, ) f kT V
1
1
M
n
[Mengersen & Tweedie, 1996]

Example (Noisy AR(1))

Hidden Markov chain from a regular AR(1) model,
xt+1 = xt + t+1
t N (0, 2 )
and observables
yt |xt N (x2t , 2 )

Example (Noisy AR(1))

Hidden Markov chain from a regular AR(1) model,
xt+1 = xt + t+1
t N (0, 2 )
and observables
yt |xt N (x2t , 2 )
The distribution of xt given xt1 , xt+1 and yt is

2
1
2
2
2 2
.
exp 2 (xt xt1 ) + (xt+1 xt ) + 2 (yt xt )
2

Example (Noisy AR(1) too)

Use for proposal the N (t , t2 ) distribution, with
t =
xt1 + xt+1
1 + 2
and
t2 =
2
.
1 + 2

Example (Noisy AR(1) too)

Use for proposal the N (t , t2 ) distribution, with
t =
xt1 + xt+1
1 + 2
and
t2 =
2
.
1 + 2
Ratio
(x)/qind (x) = exp (yt x2t )2 /2 2
is bounded

(top) Last 500 realisations of the chain {Xk }k out of 10, 000
iterations; (bottom) histogram of the chain, compared with
the target distribution.

Example (Cauchy by normal)

Given a Cauchy C (0, 1) distribution, consider a normal
N (0, 1) proposal
go random W


N (0, 1) proposal
The MetropolisHastings acceptance ratio is
go random W

1 + ( 0 )2
( 0 )/( 0 )
= exp 2 ( 0 )2 /2
.
()/())
(1 + 2 )


N (0, 1) proposal
The MetropolisHastings acceptance ratio is
go random W

1 + ( 0 )2
( 0 )/( 0 )
= exp 2 ( 0 )2 /2
.
()/())
(1 + 2 )
Poor perfomances: the proposal distribution has lighter tails than

the target Cauchy and convergence to the stationary distribution is
not even geometric!

0.0
0.1
0.2
Density
0.3
0.4
0
4
Histogram of Markov chain

(t )1t5000 against target
C (0, 1) distribution.
1000
2000
3000
4000
iterations
Range and average of 1000

parallel runs when initialized
with a normal N (0, 1002 )
distribution.
5000

Random walk MetropolisHastings
Use of a local perturbation as proposal

Yt = X (t) + t ,
where t g, independent of X (t) .
The instrumental density is now of the form g(y x) and the
Markov chain is a random walk if we take g to be symmetric
g(x) = g(x)

Algorithm (Random walk Metropolis)

Given x(t)
1. Generate Yt g(y x(t) )
2. Take
X (t+1) =
x(t)

f (Yt )
with prob. min 1,
,
f (x(t) )
otherwise.

Example (Random walk and normal target)

Generate N (0, 1) based on the uniform proposal [, ]
[Hastings (1970)]
The probability of acceptance is then
forget History!
(x(t) , yt ) = exp{(x(t) yt2 )/2} 1.

Example (Random walk & normal (2))

Sample statistics
mean
variance
0.1
0.399
0.698
0.5
-0.111
1.11
1.0
0.10
1.06
c As , we get better histograms and a faster exploration of the

support of f .

-1
1
(a)
0.5
0.5
0.0
0.0
-0.5
-0.5
-1.0
-1.0
100
-2
0
(b)
-1.5
-1.5
-1.5
50
-1.0
100
100
-0.5
200
200
150
0.0
300
200
300
0.5
250
400
400
-3
-2
-1
(c)
Three samples based on U[, ] with (a) = 0.1, (b) = 0.5

and (c) = 1.0, superimposed with the convergence of the
means (15, 000 simulations).

Example (Mixture models (again!))

(|x)
n
k
Y
X
j=1
`=1
p` f (xj |` , ` ) ()

Example (Mixture models (again!))

(|x)
n
k
Y
X
j=1
`=1
p` f (xj |` , ` ) ()
Metropolis-Hastings proposal:
(t)
+ (t) if u(t) < (t)
(t+1)
=
(t)
otherwise
where
(t) =
((t) + (t) |x)

1
((t) |x)
and scaled for good acceptance rate

2
1
theta
0
-1
-1
theta
Random walk sampling (50000 iterations)
0.0
0.2
0.4
0.6
0.8
1.0
0.2
0.4
0.6
0.8
1.0
1.2
tau
1.0
0.0 1.0 2.0
1.2
-1
0.6
tau
0 1 2 3 4 5 6
0.8
theta
0.4
0.0
0.2
0.4
0.6
0.8
1.0
0.2
0 2 4
0.0
0.2
0.4
0.6
p
0.8
1.0
0.2
0.4
0.6
0.8
tau
General case of a 3 component normal mixture

[Celeux & al., 2000]

1
0
Random walk MCMC output for .7N (1 , 1) + .3N (2 , 1)

Example (probit model)

skip probit
Likelihood of the probit model

n
Y
i=1
(yiT )xi (yiT )1xi

Example (probit model)

skip probit
Likelihood of the probit model

n
Y
(yiT )xi (yiT )1xi
i=1
Random walk proposal

(t+1) = (t) + t
where, for instance,
t Np (0, )
= (Y Y T )1

10
10
Likelihood surface and random walk Metropolis-Hastings

steps

Uniform ergodicity prohibited by random walk structure

Uniform ergodicity prohibited by random walk structure

At best, geometric ergodicity:
Theorem (Sufficient ergodicity)

For a symmetric density f , log-concave in the tails, and a positive
and symmetric density g, the chain (X (t) ) is geometrically ergodic.
no tail effect

1.5
1.0
0.5
-1.5
-1.0
-0.5
0.0
0.5
0.0
-0.5
-1.0
Random-walk
MetropolisHastings algorithms
based on a N (0, 1) instrumental
for the generation of (a) a
N (0, 1) distribution and (b) a
distribution with density
(x) (1 + |x|)3
-1.5
Example (Comparison of tail

effects)
1.0
1.5
50
100
(a)
150
200
50
100
150
(b)
90% confidence envelopes of

the means, derived from 500
parallel independent chains
200

Example (Cauchy by normal continued)

Again, Cauchy C (0, 1) target and Gaussian random walk proposal,
0 N (, 2 ), with acceptance probability
1 + 2
1,
1 + ( 0 )2
Overall fit of the Cauchy density by the histogram satisfactory, but
poor exploration of the tails: 99% quantile of C (0, 1) equal to 3,
but no simulation exceeds 14 out of 10, 000!
[Roberts & Tweedie, 2004]

Again, lack of geometric ergodicity!
0.00
0.05
0.10
0.15
Density
0.20
0.25
0.30
0.35

Slow convergence shown by the non-stable range after 10, 000
iterations.
Histogram of the 10, 000 first steps of a random walk

MetropolisHastings algorithm using a N (, 1) proposal

100
50
50
100
2000
4000
6000
8000
iterations
Range of 500 parallel runs for the same setup
10000

Further convergence properties

Under assumptions
I
skip detailed convergence
(A1) f is super-exponential, i.e. it is positive with positive

continuous first derivative such that
lim|x| n(x)0 log f (x) = where n(x) := x/|x|.
In words : exponential decay of f in every direction with rate
tending to
(A2) lim sup|x| n(x)0 m(x) < 0, where
m(x) = f (x)/|f (x)|.
In words: non degeneracy of the countour manifold
Cf (y) = {y : f (y) = f (x)}
Q is geometrically ergodic, and

V (x) f (x)1/2 verifies the drift condition
[Jarner & Hansen, 2000]

Further [further] convergence properties

skip hyperdetailed convergence
If P -irreducible and aperiodic, for r = (r(n))nN real-valued non

decreasing sequence, such that, for all n, m N,
r(n + m) r(n)r(m),
and r(0) = 1, for C a small set, C = inf{n 1, Xn C}, and
h 1, assume
#
" 1
C
X
r(k)h(Xk ) < ,
sup Ex
xC
k=0

then,
S(f, C, r) :=
x X, Ex
( 1
C
X
r(k)h(Xk )
k=0
<
is full and absorbing and for x S(f, C, r),

lim r(n)kP n (x, .) f kh = 0.
[Tuominen & Tweedie, 1994]

Comments
I
[CLT, Rosenthals inequality...] h-ergodicity implies CLT

for additive (possibly unbounded) functionals of the chain,
Rosenthals inequality and so on...
[Control of the moments of the return-time] The

condition implies (because h 1) that
( 1
)
C
X
sup Ex [r0 (C )] sup Ex
r(k)h(Xk ) < ,
xC
xC
Pn
k=0
where r0 (n) = l=0 r(l) Can be used to derive bounds for

the coupling time, an essential step to determine computable
bounds, using coupling inequalities
[Roberts & Tweedie, 1998; Fort & Moulines, 2000]

Alternative conditions
The condition is not really easy to work with...

[Possible alternative conditions]
(a) [Tuominen, Tweedie, 1994] There exists a sequence
(Vn )nN , Vn r(n)h, such that
(i) supC V0 < ,
(ii) {V0 = } {V1 = } and
(iii) P Vn+1 Vn r(n)h + br(n)IC .

(b) [Fort 2000] V f 1 and b < , such that supC V <

and
)
(
C
X
P V (x) + Ex
r(k)f (Xk ) V (x) + bIC (x)
k=0
where C is the hitting time on C and

r(k) = r(k) r(k 1), k 1 and r(0) = r(0).
nP
o
C 1
Result (a) (b) supxC Ex
r(k)f
(X
)
< .
k
k=0

Extensions
Extensions
There are many other families of HM algorithms

Adaptive Rejection Metropolis Sampling
Reversible Jump (later!)
Langevin algorithms
to name just a few...

Extensions
Langevin Algorithms
Proposal based on the Langevin diffusion Lt is defined by the

stochastic differential equation
1
dLt = dBt + log f (Lt )dt,
2
where Bt is the standard Brownian motion
Theorem
The Langevin diffusion is the only non-explosive diffusion which is
reversible with respect to f .

Extensions
Discretization
Instead, consider the sequence
x(t+1) = x(t) +
2
log f (x(t) ) + t ,
2
where 2 corresponds to the discretization step
t Np (0, Ip )

Extensions
Discretization
Instead, consider the sequence
x(t+1) = x(t) +
2
log f (x(t) ) + t ,
2
t Np (0, Ip )
where 2 corresponds to the discretization step

Unfortunately, the discretized chain may be be transient, for
instance when

lim 2 log f (x)|x|1 > 1
x

Extensions
MH correction
Accept the new value Yt with probability

2

2
2
(t)
(t)
2
exp Yt x 2 log f (x )
f (Yt )

1.
2
f (x(t) )

(t)
2
2
2
exp x Yt 2 log f (Yt )
Choice of the scaling factor

Should lead to an acceptance rate of 0.574 to achieve optimal
convergence rates (when the components of x are uncorrelated)
[Roberts & Rosenthal, 1998]

Extensions
Optimizing the Acceptance Rate
Problem of choice of the transition kernel from a practical point of

view
Most common alternatives:
(a) a fully automated algorithm like ARMS;
(b) an instrumental density g which approximates f , such that
f /g is bounded for uniform ergodicity to apply;
(c) a random walk
In both cases (b) and (c), the choice of g is critical,

Extensions
Case of the independent MetropolisHastings algorithm

Choice of g that maximizes the average acceptance rate

f (Y ) g(X)
,1
= E min
f (X) g(Y )

f (Y )
f (X)
= 2P
,
X f, Y g,
g(Y )
g(X)
Related to the speed of convergence of
T
1 X
h(X (t) )
T
t=1
to Ef [h(X)] and to the ability of the algorithm to explore any

complexity of f

Extensions
Case of the independent MetropolisHastings algorithm (2)
Practical implementation
Choose a parameterized instrumental distribution g(|) and
adjusting the corresponding parameters based on the evaluated
acceptance rate
() =
m
2 X
I{f (yi )g(xi )>f (xi )g(yi )} ,
m
i=1
where x1 , . . . , xm sample from f and y1 , . . . , ym iid sample from g.

Extensions
Example (Inverse Gaussian distribution)

no inverse
Simulation from

p
p
2
f (z|1 , 2 ) z
exp 1 z
+ 2 1 2 + log 22 IR+ (z)
z
p
based on the Gamma distribution Ga(, ) with = 2 /1
3/2

Extensions
Example (Inverse Gaussian distribution)

no inverse
Simulation from

p
p
2
f (z|1 , 2 ) z
exp 1 z
+ 2 1 2 + log 22 IR+ (z)
z
p
based on the Gamma distribution Ga(, ) with = 2 /1
Since

2
f (x)
1/2
x
exp ( 1 )x
,
g(x)
x
3/2
the maximum is attained at

x
( + 1/2)
p
( + 1/2)2 + 42 (1 )
.
2( 1 )

Extensions
Example (Inverse Gaussian distribution (2))

The analytical optimization (in ) of
(
M () = (x )1/2
2
exp ( 1 )x
x
is impossible
()
E[Z]
E[1/Z]
0.2
0.22
1.137
1.116
0.5
0.41
1.158
1.108
0.8
0.54
1.164
1.116
0.9
0.56
1.154
1.115
1
0.60
1.133
1.120
(1 = 1.5, 2 = 2, and m = 5000).
1.1
0.63
1.148
1.126
1.2
0.64
1.181
1.095
1.5
0.71
1.148
1.115

Extensions
Case of the random walk

Different approach to acceptance rates
A high acceptance rate does not indicate that the algorithm is
moving correctly since it indicates that the random walk is moving
too slowly on the surface of f .

Extensions
Case of the random walk

Different approach to acceptance rates
A high acceptance rate does not indicate that the algorithm is
moving correctly since it indicates that the random walk is moving
too slowly on the surface of f .
If x(t) and yt are close, i.e. f (x(t) ) ' f (yt ) y is accepted with
probability

f (yt )
min
,1 ' 1 .
f (x(t) )
For multimodal densities with well separated modes, the negative
effect of limited moves on the surface of f clearly shows.

Extensions
Case of the random walk (2)
If the average acceptance rate is low, the successive values of f (yt )

tend to be small compared with f (x(t) ), which means that the
random walk moves quickly on the surface of f since it often
reaches the borders of the support of f

Extensions
Rule of thumb
In small dimensions, aim at an average acceptance rate of 50%. In

large dimensions, at an average acceptance rate of 25%.
[Gelman,Gilks and Roberts, 1995]

Extensions
Rule of thumb
In small dimensions, aim at an average acceptance rate of 50%. In

large dimensions, at an average acceptance rate of 25%.
[Gelman,Gilks and Roberts, 1995]
This rule is to be taken with a pinch of salt!

Extensions
Example (Noisy AR(1) continued)

For a Gaussian random walk with scale small enough, the
random walk never jumps to the other mode. But if the scale is
sufficiently large, the Markov chain explores both modes and give a
satisfactory approximation of the target distribution.

Extensions
Markov chain based on a random walk with scale = .1.

Extensions
Markov chain based on a random walk with scale = .5.

The Gibbs Sampler
The Gibbs Sampler
The Gibbs Sampler

General Principles
Completion
Convergence
The Hammersley-Clifford theorem
Hierarchical models
Data Augmentation
Improper Priors

The Gibbs Sampler
General Principles
General Principles
A very specific simulation algorithm based on the target
distribution f :
1. Uses the conditional densities f1 , . . . , fp from f

The Gibbs Sampler
General Principles
General Principles
distribution f :
2. Start with the random variable X = (X1 , . . . , Xp )

The Gibbs Sampler
General Principles
General Principles
distribution f :
2. Start with the random variable X = (X1 , . . . , Xp )
3. Simulate from the conditional densities,
Xi |x1 , x2 , . . . , xi1 , xi+1 , . . . , xp
fi (xi |x1 , x2 , . . . , xi1 , xi+1 , . . . , xp )
for i = 1, 2, . . . , p.

The Gibbs Sampler
General Principles
Algorithm (Gibbs sampler)

(t)
(t)
Given x(t) = (x1 , . . . , xp ), generate

(t+1)
1. X1
(t+1)
2. X2
(t)
(t)
f1 (x1 |x2 , . . . , xp );
(t+1)
, x3 , . . . , xp ),
(t+1)
, . . . , xp1 )
f2 (x2 |x1
(t)
(t)
...
p.
(t+1)
Xp
fp (xp |x1
(t+1)
X(t+1) X f

The Gibbs Sampler
General Principles
Properties
The full conditionals densities f1 , . . . , fp are the only densities used

for simulation. Thus, even in a high dimensional problem, all of
the simulations may be univariate

The Gibbs Sampler
General Principles
Properties
The full conditionals densities f1 , . . . , fp are the only densities used

for simulation. Thus, even in a high dimensional problem, all of
the simulations may be univariate
The Gibbs sampler is not reversible with respect to f . However,
each of its p components is. Besides, it can be turned into a
reversible sampler, either using the Random Scan Gibbs sampler
see section or running instead the (double) sequence
f1 fp1 fp fp1 f1

The Gibbs Sampler
General Principles
Example (Bivariate Gibbs sampler)

(X, Y ) f (x, y)
Generate a sequence of observations by
Set X0 = x0
For t = 1, 2, . . . , generate
Yt fY |X (|xt1 )
Xt fX|Y (|yt )
where fY |X and fX|Y are the conditional distributions

The Gibbs Sampler
General Principles
A Very Simple Example: Independent N(, 2 )

Observations
iid
When Y1 , . . . , Yn N(y|, 2 ) with both and unknown, the

posterior in (, 2 ) is conjugate outside a standard familly

The Gibbs Sampler
General Principles
A Very Simple Example: Independent N(, 2 )

Observations
iid
When Y1 , . . . , Yn N(y|, 2 ) with both and unknown, the

posterior in (, 2 ) is conjugate outside a standard familly
But...
P
2

|Y0:n , 2 N n1 ni=1 Yi , n )

P
2 |Y1:n , IG 2 n2 1, 21 ni=1 (Yi )2
assuming constant (improper) priors on both and 2

I
Hence we may use the Gibbs sampler for simulating from the
posterior of (, 2 )

The Gibbs Sampler
General Principles
R Gibbs Sampler for Gaussian posterior

n = length(Y);
S = sum(Y);
mu = S/n;
for (i in 1:500)
S2 = sum((Y-mu)^2);
sigma2 = 1/rgamma(1,n/2-1,S2/2);
mu = S/n + sqrt(sigma2/n)*rnorm(1);

The Gibbs Sampler
General Principles
Example of results with n = 10 observations from the

N(0, 1) distribution
Number of Iterations 1

The Gibbs Sampler
General Principles

Number of Iterations 1, 2

The Gibbs Sampler
General Principles

Number of Iterations 1, 2, 3

The Gibbs Sampler
General Principles

Number of Iterations 1, 2, 3, 4

The Gibbs Sampler
General Principles

Number of Iterations 1, 2, 3, 4, 5

The Gibbs Sampler
General Principles

Number of Iterations 1, 2, 3, 4, 5, 10

The Gibbs Sampler
General Principles

Number of Iterations 1, 2, 3, 4, 5, 10, 25

The Gibbs Sampler
General Principles

Number of Iterations 1, 2, 3, 4, 5, 10, 25, 50

The Gibbs Sampler
General Principles

Number of Iterations 1, 2, 3, 4, 5, 10, 25, 50, 100

The Gibbs Sampler
General Principles

Number of Iterations 1, 2, 3, 4, 5, 10, 25, 50, 100, 500

The Gibbs Sampler
General Principles
Limitations of the Gibbs sampler
Formally, a special case of a sequence of 1-D M-H kernels, all with

acceptance rate uniformly equal to 1.
The Gibbs sampler
1. limits the choice of instrumental distributions

The Gibbs Sampler
General Principles

The Gibbs sampler
2. requires some knowledge of f

The Gibbs Sampler
General Principles

The Gibbs sampler
3. is, by construction, multidimensional

The Gibbs Sampler
General Principles

The Gibbs sampler
3. is, by construction, multidimensional
4. does not apply to problems where the number of parameters
varies as the resulting chain is not irreducible.

The Gibbs Sampler
Completion
Latent variables are back
The Gibbs sampler can be generalized in much wider generality

A density g is a completion of f if
Z
g(x, z) dz = f (x)
Z

The Gibbs Sampler
Completion
Latent variables are back
The Gibbs sampler can be generalized in much wider generality

A density g is a completion of f if
Z
g(x, z) dz = f (x)
Z
Note
The variable z may be meaningless for the problem

The Gibbs Sampler
Completion
Purpose g should have full conditionals that are easy to simulate

for a Gibbs sampler to be implemented with g rather than f
For p > 1, write y = (x, z) and denote the conditional densities of
g(y) = g(y1 , . . . , yp ) by
Y1 |y2 , . . . , yp g1 (y1 |y2 , . . . , yp ),
Y2 |y1 , y3 , . . . , yp g2 (y2 |y1 , y3 , . . . , yp ),

...,
Yp |y1 , . . . , yp1 gp (yp |y1 , . . . , yp1 ).

The Gibbs Sampler
Completion
The move from Y (t) to Y (t+1) is defined as follows:
Algorithm (Completion Gibbs sampler)

(t)
(t)
Given (y1 , . . . , yp ), simulate

(t+1)
1. Y1
(t+1)
2. Y2
(t+1)
p. Yp
(t)
(t)
g1 (y1 |y2 , . . . , yp ),
(t+1)
, y3 , . . . , yp ),
(t+1)
, . . . , yp1 ).
g2 (y2 |y1
...
gp (yp |y1
(t)
(t+1)
(t)

The Gibbs Sampler
Completion
Example (Mixtures all over again)

Hierarchical missing data structure:
If
k
X
X1 , . . . , Xn
pi f (x|i ),
i=1
then
X|Z f (x|Z ),
Z p1 I(z = 1) + . . . + pk I(z = k),
Z is the component indicator associated with observation x

The Gibbs Sampler
Completion
Example (Mixtures (2))

Conditionally on (Z1 , . . . , Zn ) = (z1 , . . . , zn ) :
(p1 , . . . , pk , 1 , . . . , k |x1 , . . . , xn , z1 , . . . , zn )
p11 +n1 1 . . . pkk +nk 1
(1 |y1 + n1 x
1 , 1 + n1 ) . . . (k |yk + nk x
k , k + nk ),
with
ni =
X
j
I(zj = i)
and
x
i =
j; zj =i
xj /ni .

The Gibbs Sampler
Completion
Algorithm (Mixture Gibbs sampler)

1. Simulate
i (i |yi + ni x
i , i + ni ) (i = 1, . . . , k)
(p1 , . . . , pk ) D(1 + n1 , . . . , k + nk )
2. Simulate (j = 1, . . . , n)
Zj |xj , p1 , . . . , pk , 1 , . . . , k
k
X
i=1
with (i = 1, . . . , k)
pij pi f (xj |i )
and update ni and x
i (i = 1, . . . , k).
pij I(zj = i)

The Gibbs Sampler
Completion
10
5
0
10
15
T = 1000
15
T = 500
-2
-2
10
5
0
10
15
T = 3000
15
T = 2000
-2
-2
10
5
0
10
15
T = 5000
15
T = 4000
-2
-2
Estimation of the pluggin density for 3 components and T

iterations for 149 observations of acidity levels in US lakes

The Gibbs Sampler
0.00
0.05
0.10
0.15
0.20
0.25
Completion
10
15
20
25
30
35
Galaxy dataset (82 observations) with k = 2 components

average density (yellow), and pluggins:
average (tomato), marginal MAP (green), MAP (marroon)

The Gibbs Sampler
Completion
A wee problem
Gibbs started at random

The Gibbs Sampler
Completion
A wee problem
1
1
Gibbs stuck at the wrong mode
Gibbs started at random
1
1

The Gibbs Sampler
Completion
Random Scan Gibbs sampler
back to basics
dont do random
Modification of the above Gibbs sampler where, with probability

1/p, the i-th component is drawn from fi (xi |Xi ), ie when the
components are chosen at random
Motivation
The Random Scan Gibbs sampler is reversible.

The Gibbs Sampler
Completion
Slice sampler as generic Gibbs

If f () can be written as a product
k
Y
i=1
fi (),

The Gibbs Sampler
Completion
Slice sampler as generic Gibbs

If f () can be written as a product
k
Y
fi (),
i=1
it can be completed as
k
Y
i=1
I0i fi () ,
leading to the following Gibbs algorithm:

The Gibbs Sampler
Completion
Algorithm (Slice sampler)

Simulate
(t+1)
1. 1
U[0,f1 ((t) )] ;
...
k.
k+1.
(t+1)
k
U[0,fk ((t) )] ;
(t+1) UA(t+1) , with
(t+1)
A(t+1) = {y; fi (y) i
, i = 1, . . . , k}.

The Gibbs Sampler
Completion
0.000
0.002
0.004
0.006
0.008
0.010
Example of results with a truncated N(3, 1) distribution
0.0
0.2
0.4
0.6
0.8
Number of Iterations 2
1.0

The Gibbs Sampler
Completion
0.000
0.002
0.004
0.006
0.008
0.010
0.0
0.2
0.4
0.6
0.8
Number of Iterations 2, 3
1.0

The Gibbs Sampler
Completion
0.000
0.002
0.004
0.006
0.008
0.010
0.0
0.2
0.4
0.6
0.8
1.0
Number of Iterations 2, 3, 4

The Gibbs Sampler
Completion
0.000
0.002
0.004
0.006
0.008
0.010
0.0
0.2
0.4
0.6
0.8
1.0
Number of Iterations 2, 3, 4, 5

The Gibbs Sampler
Completion
0.000
0.002
0.004
0.006
0.008
0.010
0.0
0.2
0.4
0.6
0.8
1.0
Number of Iterations 2, 3, 4, 5, 10

The Gibbs Sampler
Completion
0.000
0.002
0.004
0.006
0.008
0.010
0.0
0.2
0.4
0.6
0.8
1.0
Number of Iterations 2, 3, 4, 5, 10, 50

The Gibbs Sampler
Completion
0.000
0.002
0.004
0.006
0.008
0.010
0.0
0.2
0.4
0.6
0.8
1.0
Number of Iterations 2, 3, 4, 5, 10, 50, 100

The Gibbs Sampler
Completion
Good slices
The slice sampler usually enjoys good theoretical properties (like

geometric ergodicity and even uniform ergodicity under bounded f
and bounded X ).
As k increases, the determination of the set A(t+1) may get
increasingly complex.

The Gibbs Sampler
Completion
Example (Stochastic volatility core distribution)

Difficult part of the stochastic volatility model

(x) exp 2 (x )2 + 2 exp(x)y 2 + x /2 ,

simplified in exp x2 + exp(x)

The Gibbs Sampler
Completion
Example (Stochastic volatility core distribution)

Difficult part of the stochastic volatility model

(x) exp 2 (x )2 + 2 exp(x)y 2 + x /2 ,

simplified in exp x2 + exp(x)
Slice sampling means simulation from a uniform distribution on

A = x; exp x2 + exp(x) /2 u

= x; x2 + exp(x)
if we set = 2 log u.
Note Inversion of x2 + exp(x) = needs to be done by
trial-and-error.

The Gibbs Sampler
Completion
1
Density
0.8
0.6
0.4
0.2
0
1
0.5
0.5
1.5
2.5
3.5
90
100
Correlation
0.1
0.05
0
0.05
0.1
10
20
30
40
50
Lag
60
70
80
Histogram of a Markov chain produced by a slice sampler

and target distribution in overlay.

The Gibbs Sampler
Convergence
Properties of the Gibbs sampler

Theorem (Convergence)
For
(Y1 , Y2 , , Yp ) g(y1 , . . . , yp ),
if either
[Positivity condition]
(i) g (i) (yi ) > 0 for every i = 1, , p, implies that
g(y1 , . . . , yp ) > 0, where g (i) denotes the marginal distribution
of Yi , or
(ii) the transition kernel is absolutely continuous with respect to g,
then the chain is irreducible and positive Harris recurrent.

The Gibbs Sampler
Convergence
Properties of the Gibbs sampler (2)

Consequences
(i) If
h(y)g(y)dy < , then

lim
nT
Z
T
1X
(t)
h1 (Y ) = h(y)g(y)dy a.e. g.
T
t=1
(ii) If, in addition, (Y (t) ) is aperiodic, then

Z

n
lim
K (y, )(dx) f
n
=0
TV

The Gibbs Sampler
Convergence
Slice sampler
fast on that slice
For convergence, the properties of Xt and of f (Xt ) are identical
Theorem (Uniform ergodicity)

If f is bounded and suppf is bounded, the simple slice sampler is
uniformly ergodic.
[Mira & Tierney, 1997]

The Gibbs Sampler
Convergence
A small set for a slice sampler

no slice detail
For
?
> ? ,
C = {x X ; ? < f (x) < ? }
is a small set:
Pr(x, )
where
1
(A) =
?
if L() = {x X ; f (x) > }
?
0
?
()
?
(A L())
d
(L())

The Gibbs Sampler
Convergence
Slice sampler: drift
Under differentiability and monotonicity conditions, the slice

sampler also verifies a drift condition with V (x) = f (x) , is
geometrically ergodic, and there even exist explicit bounds on the
total variation distance

The Gibbs Sampler
Convergence
Slice sampler: drift
Under differentiability and monotonicity conditions, the slice

sampler also verifies a drift condition with V (x) = f (x) , is
geometrically ergodic, and there even exist explicit bounds on the
total variation distance
Example (Exponential Exp(1))

For n > 23,
||K n (x, ) f ()||T V .054865 (0.985015)n (n 15.7043)

The Gibbs Sampler
Convergence
Slice sampler: convergence

no more slice detail
Theorem
For any density such that

({x X ; f (x) > })

is non-increasing
then
||K 523 (x, ) f ()||T V .0095

The Gibbs Sampler
Convergence
A poor slice sampler

Example
Consider
400
600
800
1000
10
400
600
800
1000
correlation
0.0 0.2 0.4 0.6 0.8 1.0
30
25
20
200
200
400
600
800
1000
correlation
0.0 0.2 0.4 0.6 0.8 1.0
60
40
20
10
10
Poor performances when d large

(heavy tails)
200
400
600
800
1000
correlation
0.0 0.2 0.4 0.6 0.8 1.0
u>0
40
20
30
40
20
30
40
30
40
100 dimensional acf
100 200 300 400
100 dimensional run
1/d
30
20 dimensional acf
z>0
20
10 dimensional acf
20 dimensional run
or on
(u) = eu
10 dimensional run
(z) = z d1 ez
200
correlation
0.0 0.2 0.4 0.6 0.8 1.0
0
-1
-2
15
Slice sampler equivalent to

one-dimensional slice sampler on
1 dimensional acf
10
x Rd
1 dimensional run
f (x) = exp {||x||}
10
20
Sample runs of log(u) and

ACFs for log(u) (Roberts
& Rosenthal, 1999)

The Gibbs Sampler
Hammersley-Clifford theorem
An illustration that conditionals determine the joint distribution
Theorem
If the joint density g(y1 , y2 ) have conditional distributions
g1 (y1 |y2 ) and g2 (y2 |y1 ), then
g(y1 , y2 ) = R
g2 (y2 |y1 )
.
g2 (v|y1 )/g1 (y1 |v) dv
[Hammersley & Clifford, circa 1970]

The Gibbs Sampler
General HC decomposition
Under the positivity condition, the joint distribution g satisfies

g(y1 , . . . , yp )
p
Y
g`j (y`j |y`1 , . . . , y`j1 , y`0 j+1 , . . . , y`0 p )
j=1
g`j (y`0 j |y`1 , . . . , y`j1 , y`0 j+1 , . . . , y`0 p )
for every permutation ` on {1, 2, . . . , p} and every y 0 Y .

The Gibbs Sampler
Hierarchical models
Hierarchical models
no hierarchy
The Gibbs sampler is particularly well suited to hierarchical models
Example (Animal epidemiology)

Counts of the number of cases of clinical mastitis in 127 dairy
cattle herds over a one year period
Number of cases in herd i
Xi P(i )
i = 1, , m
where i is the underlying rate of infection in herd i

Lack of independence might manifest itself as overdispersion.

The Gibbs Sampler
Hierarchical models
Example (Animal epidemiology (2))

Modified model
Xi P(i )
i G a(, i )
i I G (a, b),

The Gibbs Sampler
Hierarchical models
Example (Animal epidemiology (2))

Modified model
Xi P(i )
i G a(, i )
i I G (a, b),
The Gibbs sampler corresponds to conditionals

i (i |x, , i ) = G a(xi + , [1 + 1/i ]1 )
i (i |x, , a, b, i ) = I G ( + a, [i + 1/b]1 )

The Gibbs Sampler
Hierarchical models
if you hate rats
Example (Rats)
Experiment where rats are intoxicated by a substance, then treated
by either a placebo or a drug:
xij N (i , c2 ),
yij N (i + i , a2 ),
zij N (i + i + i , t2 ),
control
1 j Jic ,
a
1 j Ji , intoxication
1 j Jit ,
treatment
Additional variable wi , equal to 1 if the rat is treated with the

drug, and 0 otherwise.

The Gibbs Sampler
Hierarchical models
Example (Rats (2))

Prior distributions (1 i I),
i N ( , 2 ),
i N ( , 2 ),
and
i N (P , P2 )
or
2
i N (D , D
),
if ith rat treated with a placebo (P) or a drug (D)

The Gibbs Sampler
Hierarchical models
Example (Rats (2))

Prior distributions (1 i I),
i N ( , 2 ),
i N ( , 2 ),
and
i N (P , P2 )
or
2
i N (D , D
),
if ith rat treated with a placebo (P) or a drug (D)

Hyperparameters of the model,
, , P , D , c , a , t , , , P , D ,
associated with Jeffreys noninformative priors.
Alternative prior with two possible levels of intoxication
2
2
i pN (1 , 1
) + (1 p)N (2 , 2
),

The Gibbs Sampler
Hierarchical models
Conditional decompositions
Easy decomposition of the posterior distribution

For instance, if
|1 1 (|1 ),
then
(|x) =
1 2 (1 ),
(|1 , x)(1 |x) d1 ,

The Gibbs Sampler
Hierarchical models
Conditional decompositions (2)

where
f (x|)1 (|1 )
,
m1 (x|1 )
Z
m1 (x|1 ) =
f (x|)1 (|1 ) d,
(|1 , x) =
m1 (x|1 )2 (1 )
,
(1 |x) =
m(x)
Z
m(x) =
m1 (x|1 )2 (1 ) d1 .
1

The Gibbs Sampler
Hierarchical models
Conditional decompositions (3)
Moreover, this decomposition works for the posterior moments,

that is, for every function h,
E [h()|x] = E(1 |x) [E1 [h()|1 , x]] ,
where
1
E [h()|1 , x] =
h()(|1 , x) d.

The Gibbs Sampler
Hierarchical models
Example (Rats inc., continued
if you still hate rats
Posterior complete distribution given by

((i , i , i )i , , . . . , c , . . . |D)
I
Y

i=1
exp {(i )2 /22 + (i )2 /22 }
Ji
Y
j=1
Jit
j=1
exp {(xij i )
/2c2 }
Ji
Y
j=1
exp {(zij i i i )
exp {(i P )2 /2P2 }
exp {(yij i i )2 /2a2 }
/2t2 }
2
exp {(i D )2 /2D
}
ì =0
ì =1
P
P
P
i Jic 1 i Jia 1 i Jit 1
ID 1 IP 1
( )I1 D
P
t
a
c

The Gibbs Sampler
Hierarchical models
Local conditioning property
For the hierarchical model

Z
() =
1 (|1 )2 (1 |2 ) n+1 (n ) d1 dn+1 .
1 ...n
we have
(i |x, , 1 , . . . , n ) = (i |i1 , i+1 )
with the convention 0 = and n+1 = 0.

The Gibbs Sampler
Hierarchical models
Example (Rats inc., terminated
still this zemmiphobia?!
2.0
1.8
1.7
Convergence of the posterior means
0 20 40 60 80100
-4.0 -3.5 -3.0 -2.5 -2.0 -1.5

intoxication
50 100150200250
2000 4000 6000 8000 10000
1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4

control
2.1
2000 4000 6000 8000 10000
1.9
0.60
0.50
0.40
0 20 40 60 80 100120
-2.80
2000 4000 6000 8000 10000
-2.90
0 50 100150200250300
-2.60
-2.70
1.90
1.80
1.70
1.60
2000 4000 6000 8000 10000
0.70
140
The full conditional distributions correspond to standard

distributions and Gibbs sampling applies.
-1
1
placebo
drug
Posteriors of the effects

The Gibbs Sampler
Hierarchical models
Posterior Gibbs inference
Probability
Confidence
1.00
[-3.48,-2.17]
D
0.9998
[0.94,2.50]
P
0.94
[-0.17,1.24]
D P
0.985
[0.14,2.20]
Posterior probabilities of significant effects

The Gibbs Sampler
Data Augmentation
Data Augmentation
The Gibbs sampler with only two steps is particularly useful
Algorithm (Data Augmentation)

Given y (t) ,
(t+1)
1.. Simulate Y1
(t+1)
2.. Simulate Y2
(t)
g1 (y1 |y2 ) ;
(t+1)
g2 (y2 |y1
).

The Gibbs Sampler
Data Augmentation
Data Augmentation
The Gibbs sampler with only two steps is particularly useful
Algorithm (Data Augmentation)

Given y (t) ,
(t+1)
1.. Simulate Y1
(t+1)
2.. Simulate Y2
(t)
g1 (y1 |y2 ) ;
(t+1)
g2 (y2 |y1
).
Theorem (Markov property)

(t)
(t)
Both (Y1 ) and (Y2 ) are Markov chains, with transitions

Z
Ki (x, x ) = gi (y|x)g3i (x |y) dy,

The Gibbs Sampler
Data Augmentation
Example (Grouped counting data)

360 consecutive records of the number of passages per unit time
Number of
passages
0
1
2
3 4 or more
Number of
observations 139 128 55 25
13

The Gibbs Sampler
Data Augmentation
Example (Grouped counting data (2))

Feature Observations with 4 passages and more are grouped
If observations are Poisson P(), the likelihood is
`(|x1 , . . . , x5 )
e
347 128+552+253
which can be difficult to work with.
1e
3
X
i
i!
i=0
!13

The Gibbs Sampler
Data Augmentation
Example (Grouped counting data (2))

Feature Observations with 4 passages and more are grouped
If observations are Poisson P(), the likelihood is
`(|x1 , . . . , x5 )
e
347 128+552+253
1e
3
X
i
i!
i=0
!13
which can be difficult to work with.

Idea With a prior () = 1/, complete the vector (y1 , . . . , y13 ) of
the 13 units larger than 4

The Gibbs Sampler
Data Augmentation
Algorithm (Poisson-Gamma Gibbs)

(t)
a Simulate Yi
b Simulate
P((t1) ) Iy4
(t)
Ga 313 +
i = 1, . . . , 13
13
X
i=1
(t)
yi ,
360

The Gibbs Sampler
Data Augmentation
Algorithm (Poisson-Gamma Gibbs)

(t)
a Simulate Yi
b Simulate
P((t1) ) Iy4
Ga 313 +
13
X
(t)
yi ,
i=1
360
40
1.025
(t)
i = 1, . . . , 13
t=1
converges quite rapidly
10
0
1.023
20
1.024
i=1
(t)
yi
0.9
1.0
lambda
1.1
1.2
1.022
313 +
13
X
to R&B
1.021
T
1 X
=
360T
30
The Bayes estimator
100
200
300
400
500

The Gibbs Sampler
Data Augmentation
Rao-Blackwellization
If (y1 , y2 , . . . , yp )(t) , t = 1, 2, . . . T is the output from a Gibbs
sampler
Z
T
1 X (t)
h y1 h(y1 )g(y1 )dy1
0 =
T
t=1
and is unbiased.

The Gibbs Sampler
Data Augmentation
Rao-Blackwellization
If (y1 , y2 , . . . , yp )(t) , t = 1, 2, . . . T is the output from a Gibbs
sampler
Z
T
1 X (t)
h y1 h(y1 )g(y1 )dy1
0 =
T
t=1
and is unbiased.
The Rao-Blackwellization replaces 0 with its conditional
expectation
rb
T
i
1 X h
(t)
(t)
E h(Y1 )|y2 , . . . , yp .
=
T
t=1

The Gibbs Sampler
Data Augmentation
Rao-Blackwellization (2)
Then
Both estimators converge to E[h(Y1 )]
Both are unbiased,

The Gibbs Sampler
Data Augmentation
Rao-Blackwellization (2)
Then
Both estimators converge to E[h(Y1 )]
Both are unbiased,
and
i
h
(t)
var E h(Y1 )|Y2 , . . . , Yp(t) var(h(Y1 )),
so rb is uniformly better (for Data Augmentation)

The Gibbs Sampler
Data Augmentation
Examples of Rao-Blackwellization
Example
Bivariate normal Gibbs sampler
X | y N (y, 1 2 )
Y | x N (x, 1 2 ).
Then
T
1 X (i)
0 =
X
T
i=1
and
T
T
1X
1 X (i)
(i)
(i)
1 =
E[X |Y ] =
%Y ,
T
T
estimate E[X] and 20 /21 =
i=1
1
2
> 1.
i=1

The Gibbs Sampler
Data Augmentation
Examples of Rao-Blackwellization (2)

Example (Poisson-Gamma Gibbs contd)
Nave estimate
0 =
T
1 X (t)
T
t=1
and Rao-Blackwellized version
T
1 X
(i) (i)
(i)
E[(t) |x1 , x2 , . . . , x5 , y1 , y2 , . . . , y13 ]
T
t=1
!
T
13
X
1 X
(t)
313 +
yi
,
360T
t=1
back to graph
i=1

The Gibbs Sampler
Data Augmentation
NP Rao-Blackwellization & Rao-Blackwellized NP
Another substantial benefit of Rao-Blackwellization is in the

approximation of densities of different components of y without
nonparametric density estimation methods.

The Gibbs Sampler
Data Augmentation
NP Rao-Blackwellization & Rao-Blackwellized NP
Another substantial benefit of Rao-Blackwellization is in the

approximation of densities of different components of y without
nonparametric density estimation methods.
The estimator
T
1X
(t)
gi (yi |yj , j 6= i) gi (yi ),
T
t=1
is unbiased.

The Gibbs Sampler
Data Augmentation
The Duality Principle

skip dual part
Ties together the properties of the two Markov chains in Data

Augmentation
Consider a Markov chain (X (t) ) and a sequence (Y (t) ) of random
variables generated from the conditional distributions
X (t) |y (t) (x|y (t) )
Y (t+1) |x(t) , y (t) f (y|x(t) , y (t) ) .

The Gibbs Sampler
Data Augmentation
The Duality Principle

skip dual part
Ties together the properties of the two Markov chains in Data

Augmentation
Consider a Markov chain (X (t) ) and a sequence (Y (t) ) of random
variables generated from the conditional distributions
X (t) |y (t) (x|y (t) )
Y (t+1) |x(t) , y (t) f (y|x(t) , y (t) ) .
Theorem (Duality properties)

If the chain (Y (t) ) is ergodic then so is (X (t) ) and the duality also
holds for geometric or uniform ergodicity.
Note
The chain (Y (t) ) can be discrete, and the chain (X (t) ) continuous.

The Gibbs Sampler
Improper Priors
Improper Priors
Unsuspected danger resulting from careless use of MCMC

algorithms:

The Gibbs Sampler
Improper Priors
Improper Priors

algorithms:
It may happen that
all conditional distributions are well defined,
all conditional distributions may be simulated from, but...

The Gibbs Sampler
Improper Priors
Improper Priors

algorithms:
It may happen that
the system of conditional distributions may not correspond to

any joint distribution

The Gibbs Sampler
Improper Priors
Improper Priors

algorithms:
It may happen that
the system of conditional distributions may not correspond to

any joint distribution
Warning The problem is due to careless use of the Gibbs sampler
in a situation for which the underlying assumptions are violated

The Gibbs Sampler
Improper Priors
Example (Conditional exponential distributions)

For the model
X1 |x2 E xp(x2 ) ,
X2 |x1 E xp(x1 )
the only candidate f (x1 , x2 ) for the joint density is

f (x1 , x2 ) exp(x1 x2 ),
but
f (x1 , x2 )dx1 dx2 =
c These conditionals do not correspond to a joint

probability distribution

The Gibbs Sampler
Improper Priors
Example (Improper random effects)

Consider
Yij = + i + ij ,
i = 1, . . . , I, j = 1, . . . , J,
where
i N (0, 2 ) and ij N (0, 2 ),
the Jeffreys (improper) prior for the parameters , and is
(, 2 , 2 ) =
1
.
2 2

The Gibbs Sampler
Improper Priors
Example (Improper random effects 2)

The conditional distributions
2
i |y, , ,
|, y, 2 , 2
2 |, , y, 2
2 |, , y, 2

J(
yi )
2
2 1
,
, (J + )
N
J + 2 2
N (
y
, 2 /JI) ,
!
X
IG I/2, (1/2)
i2 ,
i
X
IG IJ/2, (1/2)
(yij i )2 ,

i,j
are well-defined and a Gibbs sampler can be easily implemented in

this setting.

The Gibbs Sampler
-8
-6
10
freq.
15
-4
observations
20
-2
25
30
Improper Priors
-4
-3
-2
(1000 iterations)
-1
Example (Improper random

effects 2)
The figure shows the sequence of
(t) s and its histogram over
1, 000 iterations. They both fail
to indicate that the
corresponding joint distribution
does not exist

The Gibbs Sampler
Improper Priors
Final notes on impropriety
The improper posterior Markov chain

cannot be positive recurrent

The Gibbs Sampler
Improper Priors

The major task in such settings is to find indicators that flag that
something is wrong. However, the output of an improper Gibbs
sampler may not differ from a positive recurrent Markov chain.

The Gibbs Sampler
Improper Priors

The major task in such settings is to find indicators that flag that
something is wrong. However, the output of an improper Gibbs
sampler may not differ from a positive recurrent Markov chain.
Example
The random effects model was initially treated in Gelfand et al.
(1990) as a legitimate model


Introduction
Greens method
Birth and Death processes

Introduction
A new brand of problems
There exist setups where

One of the things we do not know is the number
of things we do not know
[Peter Green]

Introduction
Bayesian Model Choice
Typical in model choice settings

- model construction (nonparametrics)
- model checking (goodness of fit)
- model improvement (expansion)
- model prunning (contraction)
- model comparison
- hypothesis testing (Science)
- prediction (finance)

Introduction
Bayesian Model Choice II
Many areas of application

I
variable selection
change point(s) determination
image analysis
graphical models and expert systems
variable dimension models
causal inference

Introduction
Example (Mixture again, yes!)
0.0
0.5
1.0
1.5
2.0
Benchmark dataset: Speed of galaxies

[Roeder, 1990; Richardson & Green, 1997]
1.0
1.5
2.0
2.5
velocities
3.0
3.5

Introduction
Example (Mixture again (2))

Modelling by a mixture model
Mi : xj
i
X
`=1
2
pì N (ì , ì
)
i?
(j = 1, . . . , 82)

Introduction
Bayesian variable dimension model

Definition
A variable dimension model is defined as a collection of models
(k = 1. . . . , K),
Mk = {f (|k ); k k } ,
associated with a collection of priors on the parameters of these
models,
k (k ) ,
and a prior distribution on the indices of these models,
{%(k) , k = 1, . . . , K} .
Alternative notation:
(Mk , k ) = %(k) k (k )

Introduction
Bayesian solution
Formally over:
1. Compute
pi
p(Mi |x) = X
pj
fi (x|i )i (i )di
i
fj (x|j )j (j )dj
j
2. Take largest p(Mi |x) to determine model, or use

X Z
fj (x|j )j (j )dj
pj
j
as predictive
[Different decision theoretic perspectives]

Introduction
Difficulties
Not at
I
(formal) inference level
parameter space representation

M
k ,
=
[see above]
[even if there are parameters common to several models]

Introduction
Difficulties
Not at
I
(formal) inference level
parameter space representation

M
k ,
=
[see above]
[even if there are parameters common to several models]

Rather at
I
(practical) inference level:

model separation, interpretation, overfitting, prior modelling,
prior coherence
computational level:
infinity of models, moves between models, predictive
computation

Greens method
Greens resolution
Setting up a proper measuretheoretic framework for designing

moves between models Mk
[Green, 1995]

Greens method
Greens resolution
Setting up a proper measuretheoretic framework for designing

moves between models Mk
[Green, 1995]
S
Create a reversible kernel K on H = k {k} k such that
Z Z
Z Z
K(x, dy)(x)dx =
K(y, dx)(y)dy
A
for the invariant density [x is of the form (k, (k) )]

Greens method
Greens resolution (2)

Write K as
K(x, B) =
Z
X
m (x, y)qm (x, dy) + (x)IB (x)
m=1
where qm (x, dy) is a transition measure to model Mm and

m (x, y) the corresponding acceptance probability.

Greens method

Write K as
K(x, B) =
Z
X
m=1

Introduce a symmetric measure m (dx, dy) on H2 and impose on
(dx)qm (x, dy) to be absolutely continuous wrt m ,
(dx)qm (x, dy)
= gm (x, y)
m (dx, dy)

Greens method

Write K as
K(x, B) =
Z
X
m=1

Introduce a symmetric measure m (dx, dy) on H2 and impose on
(dx)qm (x, dy) to be absolutely continuous wrt m ,
(dx)qm (x, dy)
= gm (x, y)
m (dx, dy)
Then
gm (y, x)
m (x, y) = min 1,
gm (x, y)
ensures reversibility

Greens method
Special case
When contemplating a move between two models, M1 and M2 ,
the Markov chain being in state 1 M1 , denote by K12 (1 , d)
and K21 (2 , d) the corresponding kernels, under the detailed
balance condition
(d1 ) K12 (1 , d) = (d2 ) K21 (2 , d) ,
and take, wlog, dim(M2 ) > dim(M1 ).

Greens method
Special case
When contemplating a move between two models, M1 and M2 ,
the Markov chain being in state 1 M1 , denote by K12 (1 , d)
and K21 (2 , d) the corresponding kernels, under the detailed
balance condition
(d1 ) K12 (1 , d) = (d2 ) K21 (2 , d) ,
and take, wlog, dim(M2 ) > dim(M1 ).
Proposal expressed as
2 = 12 (1 , v12 )
where v12 is a random variable of dimension
dim(M2 ) dim(M1 ), generated as
v12 12 (v12 ) .

Greens method
Special case (2)

In this case, q12 (1 , d2 ) has density

12 (1 , v12 ) 1
,

12 (v12 )
(1 , v12 )
by the Jacobian rule.

If probability $12 of choosing move to M2 while in M1 ,
acceptance probability reduces to

12 (1 , v12 )
(M2 , 2 ) $21

.
(1 , v12 ) = 1
(M1 , 1 ) $12 12 (v12 ) (1 , v12 )

Greens method
Interpretation (1)
The representation puts us back in a fixed dimension setting:

I
M1 V12 and M2 are in one-to-one relation

Greens method
Interpretation (1)
The representation puts us back in a fixed dimension setting:

I
I
M1 V12 and M2 are in one-to-one relation
regular MetropolisHastings move from the couple (1 , v12 )

to 2 when stationary distributions are
(M1 , 1 ) 12 (v12 )
and (M2 , 2 ), and when proposal distribution is deterministic
(??)

Greens method
Interpretation (2)
Consider, instead, the proposals
2 N (12 (1 , v12 ), )
and
12 (1 , v12 ) N (2 , )

Greens method
Interpretation (2)
2 N (12 (1 , v12 ), )
and
12 (1 , v12 ) N (2 , )
Reciprocal proposal has density

12 (1 , v12 )
exp (2 12 (1 , v12 ))2 /2

(1 , v12 )
2

Greens method
Interpretation (2)
2 N (12 (1 , v12 ), )
and
12 (1 , v12 ) N (2 , )
Reciprocal proposal has density

12 (1 , v12 )
exp (2 12 (1 , v12 ))2 /2

(1 , v12 )
2
Thus MetropolisHastings acceptance probability is

12 (1 , v12 )
(M2 , 2 )

1
(M1 , 1 ) 12 (v12 ) (1 , v12 )
Does not depend on : Let go to 0

Greens method
Saturation
[Brooks, Giudici, Roberts, 2003]
Consider series of models Mi (i = 1, . . . , k) such that
max dim(Mi ) = nmax <
i
Parameter of model Mi then completed with an auxiliary variable

Ui such that
dim(i , ui ) = nmax
and
Ui qi (ui )
Posit the following joint distribution for [augmented] model Mi

(Mi , i ) qi (ui )

Greens method
Back to fixed dimension

Saturation: no varying dimension anymore since (i , ui ) of fixed
dimension.

Greens method
Back to fixed dimension

Saturation: no varying dimension anymore since (i , ui ) of fixed
dimension.
Algorithm (Three stage MCMC update)

1. Update the current value of the parameter, i ;
2. Update ui conditional on i ;
3. Update the current model from Mi to Mj using the bijection
(j , uj ) = ij (i , ui )

Greens method
Example (Mixture of normal distributions)

Mk :
k
X
j=1
2
pjk N (jk , jk
)
[Richardson & Green, 1997]

Moves:

Greens method
Example (Mixture of normal distributions)

Mk :
k
X
j=1
2
pjk N (jk , jk
)
[Richardson & Green, 1997]

Moves:
(i) Split
= pj(k+1) + p(j+1)(k+1)
pjk
pjk jk = pj(k+1) j(k+1) + p(j+1)(k+1) (j+1)(k+1)
2
2
2
= pj(k+1) j(k+1)
+ p(j+1)(k+1) (j+1)(k+1)
pjk jk
(ii) Merge
(reverse)

Greens method
Example (Mixture (2))

Additional Birth and Death moves for empty components
(created from the prior distribution)
Equivalent
(i). Split
u1 , u2 , u3
p
j(k+1)
(T )
j(k+1)
2
j(k+1)
=
=
=
U(0, 1)
u1 pjk
u2 jk
2
u3 jk

Greens method
Histogram of k
2.5
0.0
0.1
0.2
3.0
0.3
0.4
Normalised enzyme dataset
2.0
1.0
2e+04
4e+04
6e+04
8e+04
Histogram and rawplot of

100, 000 ks under the
constraint k 5.
1e+05
0.5
0e+00
0.0
1.5
Rawplot of k
0.0
0.5
1.0
1.5
2.0
2.5
3.0

Greens method
Example (Hidden Markov model)

move to birth
Extension of the mixture model

P (Xt + 1 = j|Xt = i) = wij ,
wij = ij /
i` ,
Yt |Xt = i N (i , i2 ).

Greens method

6
6

Yt
...
Yt+1
- ...

Xt
Xt+1

Greens method
Example (Hidden Markov model (2))

Move to split component j? into j1 and j2 :
ij1 = ij? i ,
ij2 = ij? (1 i ),
j1 j = j? j j ,
j2 j = j? j /j ,
i U(0, 1);
j log N (0, 1);
similar ideas give j1 j2 etc.;

j1 = j? 3j? ,
j21 = j2 ,
j2 = j? + 3j? ,
j22 = j2 / ,
N (0, 1);
log N (0, 1).

[Robert & al., 2000]

Greens method
5
4
3
2
1
10000
20000
30000
40000
50000
100000
150000
200000
5000
10000
15000
20000
0.5
0
0.015
0.01
0.005
0
Upper panel: First 40,000 values of k for S&P 500 data, plotted
every 20th sweep. Middle panel: estimated posterior distribution
of k for S&P 500 data as a function of number of sweeps. Lower
panel: 1 and 2 in first 20,000 sweeps with k = 2 for S&P 500
data.

Greens method
Example (Autoregressive model)

move to birth
Typical setting for model choice: determine order p of AR(p)

model

Greens method
Example (Autoregressive model)

move to birth
Typical setting for model choice: determine order p of AR(p)

model
Consider the (less standard) representation
p
Y
i=1
(1 i B) Xt = t ,
t N (0, 2 )
where the i s are within the unit circle if complex and within
[1, 1] if real.
[Huerta and West, 1998]

Greens method
AR(p) reversible jump algorithm

Example (Autoregressive (2))
Uniform priors for the real and complex roots j ,
Y 1
Y 1
1
I|i |<1
I
bk/2c + 1
2
|i |<1
i R
i 6R
and (purely birth-and-death) proposals based on these priors

I
I
I
I
k k+1
k k+2
k k-1
k k-2
[Creation of real root]

[Creation of complex root]
[Deletion of real root]
[Deletion of complex root]

200
300
400
500
0 1 2 3 4 5
0
10
15
20
0.4
p=5
0.2
0.0
0.2
0.4
0.2
0.0
0.2
0.4
0.256
0.2
0.0
0.2
0.4
0.6
Im(i)
0.4
1.2
0.4
0.412
0.0 0.2 0.4
8
6
4
2
1.1
2
0.4
means
0.0817
1.0
0.2
0 2 4 6 8 10
0.4
p=3
mean 0.997
0.9
0.0
p=4
1000
2000
3000
Iterations
4000
5000
0.2
0.4
0.2
p=3
0 1 2 3 4 5 6 7
100
0 1 2 3 4 5
0.6 0.2
xt
0.0 0.1 0.2 0.3 0.4 0.5
Greens method
0.1
0.0
0.1
0.2
0.3
0.4
0.5
Re(i)
Reversible jump algorithm based on an AR(3) simulated dataset of

530 points (upper left) with true parameters i (0.1, 0.3, 0.4)
and = 1. First histogram associated with p, the following
histograms with the i s, for different values of p, and of 2 . Final
graph: scatterplot of the complex roots. One before last:
evolution of 1 , 2 , 3 .

instant death!
Use of an alternative methodology based on a Birth&-Death

(point) process
[Preston, 1976; Ripley, 1977; Geyer & Mller, 1994; Stevens, 1999]

instant death!
Use of an alternative methodology based on a Birth&-Death

(point) process
[Preston, 1976; Ripley, 1977; Geyer & Mller, 1994; Stevens, 1999]
Idea: Create a Markov chain in continuous time, i.e. a Markov
jump process, moving between models Mk , by births (to increase
the dimension), deaths (to decrease the dimension), and other
moves.


Time till next modification (jump) is exponentially distributed
with rate depending on current state
Remember: if 1 , . . . , v are exponentially distributed, i E(i ),
!
X
i
min i E
i


Time till next modification (jump) is exponentially distributed
with rate depending on current state
Remember: if 1 , . . . , v are exponentially distributed, i E(i ),
!
X
i
min i E
i
Difference with MH-MCMC: Whenever a jump occurs, the

corresponding move is always accepted. Acceptance probabilities
replaced with holding times.
Implausible configurations
L()() 1
die quickly.

Balance condition
Sufficient to have detailed balance

L()()q(, 0 ) = L( 0 )( 0 )q( 0 , )
for all , 0
for
() L()() to be stationary.
Here q(, 0 ) rate of moving from state to 0 .
Possibility to add split/merge and fixed-k processes if balance
condition satisfied.

Example (Mixture contd)

Stephens original modelling:
I
Representation as a (marked) point process

= {pj , (j , j )}
j
Birth rate 0 (constant)
Birth proposal from the prior
Death rate j () for removal of point j
Death proposal removes component and modifies weights

Example (Mixture contd (2))

I
Overall death rate

k
X
j () = ()
j=1
Balance condition
(k+1) d({p, (, )}) L({p, (, )}) = 0 L()
with
d( \ {pj , (j , j )}) = j ()
Case of Poisson prior k Poi(1 )
j () =
0 L( \ {pj , (j , j )})
1
L()
(k)
(k + 1)

Stephens original algorithm
Algorithm (Mixture Birth& Death)

For v = 0, 1, , V
tv
Run till t > v + 1
1. Compute j () =
2. ()
k
X
j=1
L(|j ) 0
L() 1
j (j ), 0 + (), u U([0, 1])
3. t t u log(u)

Algorithm (Mixture Birth& Death (contd))

4. With probability ()/
Remove component j with probability j ()/()
k k1
p` p` /(1 pj ) (` 6= j)
Otherwise,
Add component j from the prior (j , j ) pj Be(, k)

p` p` (1 pj ) (` 6= j)
k k+1
5. Run I MCMC(k, , p)

Rescaling time
move to HMM
In discrete-time RJMCMC, let the time unit be 1/N ,
put
k = k /N
k = 1 k /N
and
As N , each birth proposal will be accepted, and having k components births occur according to a
Poisson process with rate k while component (w, ) dies with rate
lim
N k+1
1
k+1
min(A
=
=
lim
, 1)
1
k+1
likelihood ratio
likelihood ratio
k
k+1
k
k+1
b(w, )
(1 w)k1
b(w, )
(1 w)k1
Hence RJMCMCBDMCMC. This holds more generally.

Example (HMM models (contd))

Implementation of the split-and-combine rule of Richardson and
Green (1997) in continuous time

Example (HMM models (contd))

Implementation of the split-and-combine rule of Richardson and
Green (1997) in continuous time
Move to split component j into j1 and j2 :
ij1 = ij i ,
ij2 = ij (1 i ),
j1 j = j j j ,
j2 j = j j /j ,
i U(0, 1);
j log N (0, 1);
similar ideas give j1 j2 etc.;

j1 = j 3j ,
j21 = j2 ,
j2 = j + 3j ,
j22 = j2 / ,
N (0, 1);
log N (0, 1).

[Cappe & al, 2001]

0.0
0.1
0.2
0.3
0.4
0.5
Wind intensity in Athens
Histogram and rawplot of 500 wind intensities in Athens

4
3
1
number of states
0.6
0.4
0.2
0.0
Relative Frequency
Number of states
10
500
1000
1500
2000
2500
1500
2000
2500
1500
2000
2500
instants
temp[, 1]
loglikelihood
1400
1200
1000
800
600
400
1400 1000 600
0.020
0.010
0.000
Relative Frequency
0.030
200
Log likelihood values
200
500
1000
instants
temp[, 2]
30
10
15
20
temp[, 3]
25
30
20
Number of moves
5
5 10
0.00 0.05 0.10 0.15
Relative Frequency
Number of moves
500
1000
instants
MCMC output on k (histogram and rawplot), corresponding

loglikelihood values (histogram and rawplot), and number of
moves (histogram and rawplot)

500
1000
1500
500
1000
1500
0.0 0.2 0.4 0.6 0.8 1.0
MCMC sequence of the probabilities j of the stationary

distribution (top) and the parameters (bottom) of the
three components when conditioning on k = 3

0.0
0.1
0.2
0.3
0.4
0.5
0.6
MCMC evaluation of the marginal density of the dataset

(dashes), compared with R nonparametric density estimate
(solid lines).

basic importance

Adaptive MCMC
Importance sampling revisited
Dynamic extensions
Population Monte Carlo

Adaptive MCMC
Adaptive MCMC is not possible
Algorithms trained on-line usually invalid:

Adaptive MCMC
Adaptive MCMC is not possible
Algorithms trained on-line usually invalid:

using the whole past of the chain implies that this is not a
Markov chain any longer!

Adaptive MCMC
Example (Poly t distribution)

Consider a t-distribution T (3, , 1) sample (x1 , . . . , xn ) with a flat
prior () = 1
If we try fit a normal proposal from empirical mean and variance of
the chain so far,
t
1 X (i)
t =
t
i=1
and
t2
t
1 X (i)
=
( t )2 ,
t
i=1

Adaptive MCMC
Example (Poly t distribution)

Consider a t-distribution T (3, , 1) sample (x1 , . . . , xn ) with a flat
prior () = 1
If we try fit a normal proposal from empirical mean and variance of
the chain so far,
t
1 X (i)
t =
t
i=1
and
t2
t
1 X (i)
=
( t )2 ,
t
i=1
MetropolisHastings algorithm with acceptance probability

#(+1)/2
"
n
Y
+ (xj (t) )2
exp (t (t) )2 /2t2
,
+ (xj )2
exp (t )2 /2t2
j=2
where N (t , t2 ).

Adaptive MCMC
Example (Poly t distribution (2))

Invalid scheme:
I
when range of initial values too small, the (i) s cannot

converge to the target distribution and concentrates on too
small a support.
long-range dependence on past values modifies the

distribution of the sequence.
using past simulations to create a non-parametric

approximation to the target distribution does not work either

0.4
0.2
0.0
0.2
Adaptive MCMC
1000
2000
3000
4000
5000
1.5
1.0
0.5
0.0
0.5
0.0
0.2
0.4
0.6
1.5 1.0 0.5 0.0 0.5 1.0 1.5
Iterations
1000
2000
3000
4000
5000
Iterations
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
2
1
0
2
1000
2000
3000
Iterations
4000
5000
Adaptive scheme for a sample of 10 xj T3 and initial

variances of (top) 0.1, (middle) 0.5, and (bottom) 2.5.

0.0
0.1
0.2
0.3
0.4
0.5
0.6
Adaptive MCMC
Comparison of the distribution of an adaptive scheme sample

of 25, 000 points with initial variance of 2.5 and of the target
distribution.

1.0
0.0
0.5
0.2
0.0
0.4
0.5
0.6
1.0
0.8
1.5
1.0
Adaptive MCMC
10000
30000
Iterations
50000
1.5
0.5
0.5
1.0
1.5
Sample produced by 50, 000 iterations of a nonparametric

adaptive MCMC scheme and comparison of its distribution
with the target distribution.

Adaptive MCMC
Simply forget about it!
Warning:
One should not constantly adapt the proposal on past
performances
Either adaptation ceases after a period of burnin
or the adaptive scheme must be theoretically assessed on its own
right.


Approximation of integrals
I=
back to basic importance
h(x)(x)dx
by unbiased estimators
n
X
= 1
I
%i h(xi )
n
i=1
when
iid
x1 , . . . , xn q(x)
and
def (xi )
%i =
q(xi )

Markov extension
For densities f and g, and importance weight
(x) = f (x)/g(x) ,
for any kernel K(x, x0 ) with stationary distribution f ,
Z
(x) K(x, x0 ) g(x)dx = f (x0 ) .
[McEachern, Clyde, and Liu, 1999]

Markov extension
For densities f and g, and importance weight
(x) = f (x)/g(x) ,
for any kernel K(x, x0 ) with stationary distribution f ,
Z
(x) K(x, x0 ) g(x)dx = f (x0 ) .
[McEachern, Clyde, and Liu, 1999]
Consequence: An importance sample transformed by MCMC
transitions keeps its weights
Unbiasedness preservation:
Z

0
E (X)h(X ) =
(x) h(x0 ) K(x, x0 ) g(x) dx dx0
= Ef [h(X)]

Not so exciting!
The weights do not change!

Not so exciting!
The weights do not change!

If x has small weight
(x) = f (x)/g(x) ,
then
x0 K(x, x0 )
keeps this small weight.

Pros and cons of importance sampling vs. MCMC

I
Production of a sample (IS) vs. of a Markov chain (MCMC)
Dependence on importance function (IS) vs. on previous value

(MCMC)
Unbiasedness (IS) vs. convergence to the true distribution

(MCMC)
Variance control (IS) vs. learning costs (MCMC)
Recycling of past simulations (IS) vs. progressive adaptability

(MCMC)
Processing of moving targets (IS) vs. handling large

dimensional problems (MCMC)
Non-asymptotic validity (IS) vs. difficult asymptotia for

adaptive algorithms (MCMC)

Dynamic extensions
Dynamic importance sampling
Idea
It is possible to generalise importance sampling using random
weights t

Dynamic extensions
Dynamic importance sampling
Idea
It is possible to generalise importance sampling using random
weights t such that
E[t |xt ] = (xt )/g(xt )

Dynamic extensions
(a) Self-regenerative chains

[Sahu & Zhigljavsky, 1998; Gasemyr, 2002]
Proposal
Y p(y) p(y)
and target distribution (y)
(y)
Ratios
(x) = (x)/p(x)
and
Unknown
(x) =
(x)/
p(x)
Known
Acceptance function
(x) =
1
1 +
(x)
>0

Dynamic extensions
Geometric jumps
Theorem
If
Y p(y)
and
W |Y = y G ((y)) ,
then
Xt = = Xt+W 1 = Y 6= Xt+W
defines a Markov chain with stationary distribution

Dynamic extensions
Plusses
I
Valid for any choice of [ small = large variance and large

= slow convergence]
Only depends on current value [Difference with Metropolis]
Random integer weight W [Similarity with Metropolis]
Saves on the rejections: always accept [Difference with

Metropolis]
Introduces geometric noise compared with importance

sampling
2
2
SZ
= 2 IS
+ (1/)2
Can be used with a sequence of proposals pk and constants

k [Adaptativity]

Dynamic extensions
A generalisation
[G
asemyr, 2002]
Proposal density p(y) and probability q(y) of accepting a jump.

Dynamic extensions
A generalisation
[G
asemyr, 2002]
Proposal density p(y) and probability q(y) of accepting a jump.
Algorithm (G
asemyrs dynamic weights)
Generate a sequence of random weights Wn by
1. Generate Yn p(y)
2. Generate Vn B(q(yn ))
3. Generate Sn Geo((yn ))
4. Take Wn = Vn Sn

Dynamic extensions
Validation
direct to PMC
(y) = R
p(y)q(y)
,
p(y)q(y)dy
the chain (Xt ) associated with the sequence (Yn , Wn ) by

Y1 = X1 = = X1+W1 1 , Y2 = X1+W1 =
is a Markov chain with transition
K(x, y) = (x)(y)
which has a point mass at y = x with weight 1 (x).

Dynamic extensions
Ergodicity for G
asemyrs scheme
Necessary and sufficient condition
is stationary for (Xt ) iff
(y) = q(y)/((y)/p(y)) = q(y)/(w(y))
for some constant .

Dynamic extensions
Ergodicity for G
asemyrs scheme
Necessary and sufficient condition
is stationary for (Xt ) iff
(y) = q(y)/((y)/p(y)) = q(y)/(w(y))
for some constant .
Implies that
E[W n |Y n = y] = w(y) .
[Average importance sampling]
Special case: (y) = 1/(1 + w(y)) of Sahu and Zhigljavski (2001)

Dynamic extensions
Properties
Constraint on : for (y) 1, must be such that
p(y)q(y)
(y)
Reverse of accept-reject conditions (!)
Variance of
X
X
Wn h(Yn )/
Wn
n
is
(h(y) )2
w(y)(y)dy (1/)2 ,
q(y)
by Cramer-Wold/Slutsky
Still worse than importance sampling.
(4)

Dynamic extensions
(b) Dynamic weighting

[Wong & Liang, 1997; Liu, Liang & Wong, 2001; Liang, 2002]
direct to PMC
Generalisation of the above: simultaneous generation of points

and weights, (t , t ), under the constraint
E[t |t ] (t )
Same use as importance sampling weights
(5)

Dynamic extensions
Algorithm (Liangs dynamic importance sampling)

1. Generate y K(x, y) and compute
%=
(y)K(y, x)
(x)K(x, y)
2. Generate u U(0, 1) and take

(
(y, (1 + )%/a)
(x0 , 0 ) =
(x, (1 + )/(1 a)
if u < a
otherwise
where a = %/(% + ), = (x, ), and > 0 constant or

independent rv

Dynamic extensions
Preservation of the equilibrium equation

If g and g+ denote the distributions of the augmented variable
(X, W ) before the step and after the step, respectively, then
Z
0 g+ (x0 , 0 ) d 0 =
%(, x, x0 )
(1 + ) [%(, x, x0 ) + ] g (x, ) K(x, x0 )
dx d
%(, x, x0 ) +
Z
(%(, x0 , z) + )
+ (1 + )
g (x0 , ) K(x, z)
dz d
%(, x0 , z) +
Z
(x0 )K(x0 , x)
= (1 + )
g (x, )
dx d
(x)

Z
+ g (x0 , ) K(x0 , z) dz d

Z
= (1 + ) (x0 ) c0 K(x0 , x) dx + c0 (x0 )
= 2(1 + )c0 (x0 ) ,

Dynamic extensions
Special case: R-move
[Liang, 2002]
= 0 and 1, and thus
(
(y, % + 1)
(x0 , 0 ) =
(x, (% + 1))
if u < %/(% + 1)
otherwise,
[Importance sampling]

Dynamic extensions
Special case: W -move
0, thus a = 1 and
(x0 , 0 ) = (y, %) .
Q-move
[Liu & al, 2001]
(x0 , 0 ) =
(y, %) if u < 1 %/ ,
(x, a)
otherwise,
with a 1 either a constant or an independent random variable.

Dynamic extensions
Notes
I
Updating step in Q and R schemes written as

(xt+1 , t+1 ) = {xt , t /Pr(Rt = 0)}
with probability Pr(Rt = 0) and
(xt+1 , t+1 ) = {yt+1 , t r(xt , yt+1 )/Pr(Rt = 1)}
with probability Pr(Rt = 1), where Rt is the move indicator
and
yt+1 K(xt , y)

Dynamic extensions
Notes (2)
I
Geometric structure of the weights

Pr(Rt = 0) =
and
Pr(Rt = 0) =
for the R scheme
t
.
t+1
t r(xt , yt )
,
t r(xt , yt ) +
> 0,

Dynamic extensions
Notes (2)
I
Geometric structure of the weights

Pr(Rt = 0) =
and
Pr(Rt = 0) =
t
.
t+1
t r(xt , yt )
,
t r(xt , yt ) +
> 0,
for the R scheme

Number of steps T before an acceptance (a jump) such that
Pr (T t) = P (R1 = 0, . . . , Rt1 = 0)
t1
Y
j
E[1/t ] .
= E
j+1
j=0

Dynamic extensions
Alternative scheme
Preservation of weight expectation:
(xt , t t /Pr(Rt = 0))
with probability Pr(R = 0) and

t
(xt+1 , t+1 ) =
(yt+1 , (1 t )t r(xt , yt+1 )/Pr(Rt = 1))
with probability Pr(R = 1).

t

Dynamic extensions
Alternative scheme (2)

Then
Pr (T = t) = P (R1 = 0, . . . , Rt1 = 0, Rt = 1)
t1
Y
j
t1 r(x0 , Yt )
(1 t )
j
= E
j+1
t
j=0
which is equal to
t1 (1 )E[o r(x, Yt )/t ]

when j constant and deterministic.

Dynamic extensions
Example
Choose a function 0 < (, ) < 1 and to take, while in (x0 , 0 ),

0 r(x0 , y1 )
(1 (x0 , y1 )
(x1 , 1 ) = y1 ,
(x0 , y1 )
with probability
min(1, 0 r(x0 , y1 )) = (x0 , y1 )

and
(x1 , 1 ) =
0
(x0 , y1 )
x0 ,
1 (x0 , y1 )
with probability 1 (x0 , y1 ).

Idea
Simulate from the product distribution
(x1 , . . . , xn ) =
n
Y
(xi )
i=1
and apply dynamic importance sampling to the sample

(a.k.a. population)
(t)
x(t) = (x1 , . . . , x(t)

n )

Iterated importance sampling

As in Markov Chain Monte Carlo (MCMC) algorithms,
introduction of a temporal dimension :
(t)
(t1)
xi qt (x|xi
and
i = 1, . . . , n,
t = 1, . . .
n
X
(t)
(t)
t = 1
%i h(xi )
I
n
i=1
is still unbiased for

(t)
%i =
(t)
t (xi )
(t)
(t1)
qt (xi |xi
i = 1, . . . , n

Fundamental importance equality

Preservation of unbiasedness

E h(X
=
=
(t)
(X (t) )
)
qt (X (t) |X (t1) )
h(x)
(x)
qt (x|y) g(y) dx dy
qt (x|y)
h(x) (x) dx
for any distribution g on X (t1)

Sequential variance decomposition

Furthermore,
n

X
(t)
(t)
t = 1
var
%
h(x
)
,
var I
i
i
n2
i=1

(t)
(t)
if var %i
exists, because the xi s are conditionally uncorrelated
Note
(t)
This decomposition is still valid for correlated [in i] xi s when

(t)
incorporating weights %i

Simulation of a population
The importance distribution of the sample (a.k.a. particles) x(t)
qt (x(t) |x(t1) )
can depend on the previous sample x(t1) in any possible way as
long as marginal distributions
Z
(t)
qit (x) = qt (x(t) ) dxi
can be expressed to build importance weights
(t)
%it =
(xi )
(t)
qit (xi )

Special case of the product proposal
If
qt (x(t) |x(t1) ) =
then
n
Y
i=1
(t)
qit (xi |x(t1) )

[Independent proposals]
n

1 X
(t)
(t)
var It = 2
var %i h(xi ) ,
n
i=1

Validation
skip validation
h
i
(t)
(t)
(t)
(t)
E %i h(Xi ) %j h(Xj )
Z
(xj )
(xi )
h(xj )
= h(xi )
(t1)
qit (xi |x
) qjt (xj |x(t1) )
qit (xi |x(t1) ) qjt (xj |x(t1) ) dxi dxj g(x(t1) )dx(t1)
= E [h(X)]2
whatever the distribution g on x(t1)

Self-normalised version
In general, is unscaled and the weight

(t)
%i
is scaled so that
(t)
(xi )
(t)
qit (xi )
X
i
i = 1, . . . , n ,
(t)
%i = 1

Self-normalised version properties
Loss of the unbiasedness property and the variance

decomposition
Normalising constant can be estimated by

t
n
( )
1 X X (xi )
$t =
( )
tn
=1 i=1 qi (xi )
Variance decomposition (approximately) recovered if $t1 is

used instead

Sampling importance resampling

Importance sampling from g can also produce samples from the
target
[Rubin, 1987]


target
[Rubin, 1987]
Theorem (Bootstraped importance sampling)

If a sample (x?i )1im is derived from the weighted sample
(xi , %i )1in by multinomial sampling with weights %i , then
x?i (x)


target
[Rubin, 1987]
Theorem (Bootstraped importance sampling)

If a sample (x?i )1im is derived from the weighted sample
(xi , %i )1in by multinomial sampling with weights %i , then
x?i (x)
Note
Obviously, the x?i s are not iid

Iterated sampling importance resampling
This principle can be extended to iterated importance sampling:

After each iteration, resampling produces a sample from
[Again, not iid!]

Iterated sampling importance resampling
This principle can be extended to iterated importance sampling:

After each iteration, resampling produces a sample from
[Again, not iid!]
Incentive
Use previous sample(s) to learn about and q

Generic Population Monte Carlo
Algorithm (Population Monte Carlo Algorithm)

For t = 1, . . . , T
For i = 1, . . . , n,
1. Select the generating distribution qit ()
(t)
2. Generate x
i qit (x)
(t)
(t)
(t)
3. Compute %i = (
xi )/qit (
xi )
(t)
(t)
Normalise the %i s into %i s

(t)
(t)
Generate Ji,t M((

%i )1iN ) and set xi,t = x
Ji,t

D-kernels in competition
A general adaptive construction:
Construct qi,t as a mixture of D different transition kernels
(t1)
depending on xi
qi,t =
D
X
(t1)
pt,` K` (xi
`=1
and adapt the weights pt,` .
, x),
D
X
`=1
pt,` = 1 ,

D-kernels in competition
A general adaptive construction:
Construct qi,t as a mixture of D different transition kernels
(t1)
depending on xi
qi,t =
D
X
(t1)
pt,` K` (xi
`=1
, x),
D
X
pt,` = 1 ,
`=1
and adapt the weights pt,` .
Example
Take pt,` proportional to the survival rate of the points
(t)
(a.k.a. particles) xi generated from K`

Implementation
Algorithm (D-kernel PMC)
For t = 1, . . . , T
generate (Ki,t )1iN M ((pt,k )1kD )
for 1 i N , generate
x
i,t KKi,t (x)
compute and renormalize the importance weights i,t
generate (Ji,t )1iN M (( i,t )1iN )
P
take xi,t = x
Ji,t ,t and pt+1,d = N
i,t Id (Ki,t )
i=1

Links with particle filters

I
Usually setting where = t changes with t: Population

Monte Carlo also adapts to this case
Can be traced back all the way to Hammersley and Morton

(1954) and the self-avoiding random walk problem
Gilks and Berzuini (2001) produce iterated samples with (SIR)

resampling steps, and add an MCMC step: this step must use
a t invariant kernel
Chopin (2001) uses iterated importance sampling to handle

large datasets: this is a special case of PMC where the qit s
are the posterior distributions associated with a portion kt of
the observed dataset

Links with particle filters (2)

I
Rubinstein and Kroeses (2004) cross-entropy method is

parameterised importance sampling targeted at rare events
Stavropoulos and Titteringtons (1999) smooth bootstrap and

Warnes (2001) kernel coupler use nonparametric kernels on
the previous importance sample to build an improved
proposal: this is a special case of PMC
West (1992) mixture approximation is a precursor of smooth

bootstrap
Mengersen and Robert (2002) pinball sampler is an MCMC

attempt at population sampling
Del Moral and Doucet (2003) sequential Monte Carlo

samplers also relates to PMC, with a Markovian dependence
on the past sample x(t) but (limited) stationarity constraints

Things can go wrong

Unexpected behaviour of the mixture weights when the number of
particles increases
N
X
i=1
i,t IKi,t =d P
1
D

Things can go wrong

Unexpected behaviour of the mixture weights when the number of
particles increases
N
X
i=1
i,t IKi,t =d P
1
D
Conclusion
At each iteration, every weight converges to 1/D:
the algorithm fails to learn from experience!!

Saved by Rao-Blackwell!!
Modification: Rao-Blackwellisation (=conditioning)

Saved by Rao-Blackwell!!
Modification: Rao-Blackwellisation (=conditioning)

Use the whole mixture in the importance weight:
D
X
i,t = (
xi,t )
pt,d Kd (xi,t1 , x
i,t )
d=1
instead of
i,t =
(
xi,t )
KKi,t (xi,t1 , x
i,t )

Adapted algorithm
Algorithm (Rao-Blackwellised D-kernel PMC)
At time t (t = 1, . . . , T ),
Generate
Generate
and set i,t

Generate
and set xi,t
iid
(Ki,t )1iN M((pt,d )1dD );

ind
(
xi,t )1iN KKi,t (xi,t1 , x)

PD
= (
xi,t )
i,t );
d=1 pt,d Kd (xi,t1 , x
iid
(Ji,t )1iN M((

i,t )1iN )
P
=x
Ji,t ,t and pt+1,d = N
i,t pt,d .
i=1

Theorem (LLN)
Under regularity assumptions, for h L1 and for every t 1,
N
1 X
N
i,t h(xi,t ) P (h)

N
k=1
and
pt,d P dt
The limiting coefficients (dt )1dD are defined recursively as

!
Z
0)
K
(x,
x
d
t1
dt = d
(dx, dx0 ).
PD
t1
0)
K
(x,
x
j
j=1 j

Recursion on the weights

Set F as
F () =
on the simplex
(
S=
Z "
PD
Kd (x, x0 )
0
j=1 j Kj (x, x )
(dx, dx0 )
= (1 , . . . , D ); d {1, . . . , D}, d 0
and define the sequence

t+1 = F (t )
1dD
and
D
X
d=1
d = 1 .

Kullback divergence
Definition (Kullback divergence)

For S,
KL() =
Z "
log
(x)(x0 )
P
0
(x) D
d=1 d Kd (x, x )
!#
(dx, dx0 ).
Kullback divergence between and the mixture.

Goal: Obtain the mixture closest to , i.e., that minimises KL()

Connection with RBDPMCA ??

Theorem
Under the assumption
d {1, . . . , D}, <
log(Kd (x, x0 )) (dx, dx0 ) <
for every SD ,
KL(F ()) KL().

Connection with RBDPMCA ??

Theorem
Under the assumption
d {1, . . . , D}, <
log(Kd (x, x0 )) (dx, dx0 ) <
for every SD ,
KL(F ()) KL().
Conclusion
The Kullback divergence decreases at every iteration of RBDPMCA

An integrated EM interpretation
skip interpretation
We have
min
= arg min KL() = arg max

S
= arg max
S
log p (
x) (d
x)
Z
log p (
x, K)dK (d
x)
for x
= (x, x0 ) and K M((d )1dD ). Then t+1 = F (t )
means
ZZ
t+1
K)|X
=x
Et (log p (X,
) (d
x)
= arg max
and
lim t = min

Illustration
Example (A toy example)
Take the target
1/4N (1, 0.3)(x) + 1/4N (0, 1)(x) + 1/2N (3, 2)(x)
and use 3 proposals: N (1, 0.3), N (0, 1) and N (3, 2)
[Surprise!!!]

Illustration
Example (A toy example)
Take the target
1/4N (1, 0.3)(x) + 1/4N (0, 1)(x) + 1/2N (3, 2)(x)
and use 3 proposals: N (1, 0.3), N (0, 1) and N (3, 2)
[Surprise!!!]
Then
1
2
6
10
16
0.0500000 0.05000000
0.2605712 0.09970292
0.2740816 0.19160178
0.2989651 0.19200904
0.2651511 0.24129039
Weight evolution
0.9000000
0.6397259
0.5343166
0.5090259
0.4935585

Target and mixture evolution

Example : PMC for mixtures
Observation of an iid sample x = (x1 , . . . , xn ) from

pN (1 , 2 ) + (1 p)N (2 , 2 ),
with p 6= 1/2 and > 0 known.
Usual N (, 2 /) prior on 1 and 2 :
(1 , 2 |x) f (x|1 , 2 ) (1 , 2 )

Algorithm (Mixture PMC)

Step 0: Initialisation
(0)
(0)
For j = 1, . . . , n = pm, choose (1 )j , (2 )j

For k = 1, . . . , p, set rk = m
Step i: Update (i = 1, . . . , I)
For k = 1, . . . , p,
1. generate a sample of size rk as

(i)
(i1)
and
(1 )j N (1 )j
, vk

(i)
(i1)
(2 )j N (2 )j
, vk
2. compute the weights

(i)
(i)
(i)
(i)
f x (1 )j , (2 )j (1 )j , (2 )j

%j
(i1)
(i)
(i1)
(i)
, vk
(1 )j (1 )j
, vk (2 )j (2 )j

(i)
(i)
using the weights %j ,
Resample the (1 )j , (2 )j

Details
After an arbitrary initialisation, use of the previous (importance)

sample (after resampling) to build random walk proposals,
(i1)
N (()j
, vj )
with a multiscale variance vj within a predetermined set of p scales

ranging from 103 down to 103 , whose importance is proportional
to its survival rate in the resampling step.

100
200
300
400
600
Resampling
0
0 200
600
0 200
Resampling
500
100
1
100
200
300
400
500
100
400
500
200
300
400
500
400
500
Iterations
2.010
0 1 2 3 4
2.020
Iterations
Var(2)
300
0.05 0.15 0.25
Var(1)
200
Iterations
0 1 2 3 4
Iterations
100
200
300
Iterations
400
500
100
200
300
Iterations
(u.left) Number of resampled points for v1 = 5 (darker) and

v2 = 2; (u.right) Number of resampled points for the other
variances; (m.left) Variance of the 1 s along iterations; (m.right)
Average of the 1 s over iterations; (l.left) Variance of the 2 s
along iterations; (l.right) Average of the simulated 2 s over
iterations.

1.5
0.0
0.5
1.0
2.0
2.5
3.0
Log-posterior distribution and sample of means

Monte Carlo Notes From Main Book

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Monte Carlo Notes From Main Book

Uploaded by

Copyright:

Available Formats

Markov Chain Monte Carlo Methods

Markov Chain Monte Carlo Methods

3-6 Mayo 2005

Markov Chain Monte Carlo Methods

Markov Chain Monte Carlo Methods

New [2004] edition:

Markov Chain Monte Carlo Methods

Motivation and leading example

Markov Chain Monte Carlo Methods

Latent structures make life harder!

Even simple models may lead to computational complications, as

Markov Chain Monte Carlo Methods

Latent structures make life harder!

Even simple models may lead to computational complications, as

Markov Chain Monte Carlo Methods

Latent structures make life harder!

Even simple models may lead to computational complications, as

Markov Chain Monte Carlo Methods

Example (Mixture models)

Markov Chain Monte Carlo Methods

Example (Mixture models)

{p1 f1 (xi ) + + pk fk (xi )} .

Markov Chain Monte Carlo Methods

Example (Mixture models)

{p1 f1 (xi ) + + pk fk (xi )} .

Expanding this product involves k n elementary terms: prohibitive

Markov Chain Monte Carlo Methods

Case of the 0.3N (1 , 1) + 0.7N (2 , 1) likelihood

Markov Chain Monte Carlo Methods

Maximum likelihood methods

For an iid sample X1 , . . . , Xn from a population with density

Markov Chain Monte Carlo Methods

Maximum likelihood methods

For an iid sample X1 , . . . , Xn from a population with density

Global justifications from asymptotics

Markov Chain Monte Carlo Methods

Maximum likelihood methods

For an iid sample X1 , . . . , Xn from a population with density

Global justifications from asymptotics

Computational difficulty depends on structure, eg latent

Markov Chain Monte Carlo Methods

Example (Mixtures again)

Markov Chain Monte Carlo Methods

Standard maximization techniques often fail to find the global

with  > 0 known,

Markov Chain Monte Carlo Methods

Standard maximization techniques often fail to find the global

with  > 0 known, whatever n, the likelihood is unbounded:

Markov Chain Monte Carlo Methods

The special case of missing variable models

Consider again a latent variable representation

Markov Chain Monte Carlo Methods

The special case of missing variable models

Consider again a latent variable representation

Define the completed (but unobserved) likelihood

Markov Chain Monte Carlo Methods

The special case of missing variable models

Consider again a latent variable representation

Define the completed (but unobserved) likelihood

Markov Chain Monte Carlo Methods

Bayes rather than EM

Markov Chain Monte Carlo Methods

Bayes rather than EM

until a fixed point [of Q] is reached

with > 0 known,

with > 0 known, whatever n, the likelihood is unbounded: