Professional Documents
Culture Documents
Outline
Motivation and leading example
Random variable generation
Monte Carlo Integration
Notions on Markov Chains
The Metropolis-Hastings Algorithm
The Gibbs Sampler
MCMC tools for variable dimension problems
Sequential importance sampling
1
0
1
Introduction
containing 2n terms.
xi
+ (1 p)
xi
Example
In the special case
f (x|, ) = (1) exp{(1/2)x2 }+ exp{(1/2 2 )(x)2 } (1)
Example
In the special case
f (x|, ) = (1) exp{(1/2)x2 }+ exp{(1/2 2 )(x)2 } (1)
The EM Algorithm
Gibbs connection
Algorithm (ExpectationMaximisation)
Iterate (in m)
1. (E step) Compute
Q(|(m) , x) = E[log Lc (|x, Z)|(m) , x] ,
The EM Algorithm
Gibbs connection
Algorithm (ExpectationMaximisation)
Iterate (in m)
1. (E step) Compute
Q(|(m) , x) = E[log Lc (|x, Z)|(m) , x] ,
2. (M step) Maximise Q(|(m) , x) in and take
(m+1) = arg max Q(|(m) , x).
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Echantillon N(0,1)
1
0
1
Central tool
Summary in a probability distribution, (|x), called the posterior
distribution
Central tool
Summary in a probability distribution, (|x), called the posterior
distribution
Derived from the joint distribution f (x|)(), according to
(|x) = R
f (x|)()
,
f (x|)()d
[Bayes Theorem]
Central tool
Summary in a probability distribution, (|x), called the posterior
distribution
Derived from the joint distribution f (x|)(), according to
(|x) = R
where
m(x) =
is the marginal density of X
f (x|)()
,
f (x|)()d
[Bayes Theorem]
f (x|)()d
Conjugate bonanza...
Example (Binomial)
For an observation X B(n, p) so-called conjugate prior is the
family of beta Be(a, b) distributions
Conjugate bonanza...
Example (Binomial)
For an observation X B(n, p) so-called conjugate prior is the
family of beta Be(a, b) distributions
The classical Bayes estimator is the posterior mean
Z 1
(a + b + n)
p px+a1 (1 p)nx+b1 dp
(a + x)(n x + b) 0
x+a
.
=
a+b+n
Example (Normal)
In the normal N (, 2 ) case, with both and unknown,
conjugate prior on = (, 2 ) of the form
( 2 ) exp ( )2 + / 2
Example (Normal)
In the normal N (, 2 ) case, with both and unknown,
conjugate prior on = (, 2 ) of the form
( 2 ) exp ( )2 + / 2
since
((, 2 )|x1 , . . . , xn ) ( 2 ) exp ( )2 + / 2
( 2 )n exp n( x)2 + s2x / 2
2 +n
( )
exp ( + n)( x )2
n
+ + s2x +
/ 2
n +
B01 (x) =
.
P ( 1 | x) ( 1 )
p Be(, )
p Be(, )
Posterior
(|x1 , . . . , xn )
=
n
Y
{p(xj ; 1 , 1 ) + (1 p)(xj ; 2 , 2 )} ()
j=1
n X
X
(kt )(|(kt ))
`=0 (kt )
1
`
P`
t=1 xkt ,
1 Pn
n`
t=`+1 xkt ,
P
s1 (kt ) = P `t=1 (xkt x
1 (kt ))2 ,
n
s2 (kt ) =
2 (kt ))2
t=`+1 (xkt x
n1 1 + `
x1 (kt )
n2 2 + (n `)
x2 (kt )
,
2 (kt ) =
,
n1 + `
n2 + n `
n1 `
(1 x
1 (kt ))2 ,
s1 (kt ) = s21 + s21 (kt ) +
n1 + `
n2 (n `)
s2 (kt ) = s22 + s22 (kt ) +
(2 x
2 (kt ))2 ,
n2 + n `
1 (kt ) =
n X
X
`=0 (kt )
press for AR
x + 2
.
1 + 2
k
Y
i=1
i + ( i )2
i
i , i > 0
k
Y
i=1
i + ( i )2
i
i , i > 0
p
X
i=1
i xti + t
p
X
i xti + t
i=1
are iid
U[0,1] .
are iid
U[0,1] .
Usual generators
In R and S-plus, procedure runif()
The Uniform Distribution
Description:
runif generates random deviates.
Example:
u <- runif(20)
.Random.seed is an integer vector, containing
the random number generator state for random
number generation in R. It can be saved and
restored, but should not be altered by users.
500
520
540
560
580
600
0.6
0.8
1.0
0.0
0.5
1.0
1.5
uniform sample
0.0
0.2
0.4
Usual generators(3)
In Scilab, procedure rand()
rand() : with no arguments gives a scalar whose
value changes each time it is referenced. By
default, random numbers are uniformly distributed
in the interval (0,1). rand(normal) switches to
a normal distribution with mean 0 and variance 1.
EXAMPLE
x=rand(10,10,uniform)
Transformation methods
Case where a distribution F is linked in a simple way to another
distribution easy to simulate.
= P (U ex ) = 1 ex ,
X
j=1
log(Uj ) 22
a
1X
log(Uj ) G a(a, )
Y =
j=1
Pa
j=1 log(Uj )
Y = Pa+b
j=1 log(Uj )
Be(a, b)
Points to note
Box-Muller Algorithm
Example (Normal variables)
If r, polar coordinates of (X1 , X2 ), then,
r2 = X12 + X22 22 = E (1/2)
and U [0, 2]
Box-Muller Algorithm
Example (Normal variables)
If r, polar coordinates of (X1 , X2 ), then,
r2 = X12 + X22 22 = E (1/2)
Consequence: If U1 , U2 iid U[0,1] ,
p
2 log(U1 ) cos(2U2 )
p
=
2 log(U1 ) sin(2U2 )
X1 =
X2
iid N (0, 1).
and U [0, 2]
x2
2 log(u1 ) cos(2u2 ) ,
p
2 log(u1 ) sin(2u2 ) ;
=
x1 =
0.0
0.1
0.2
0.3
0.4
More transforms
Reject
More Poisson
Skip Poisson
Atkinsons Poisson
To generate N P():
1. Define
= / 3,
and
k = log clog ;
Negative extension
Mixture representation
The representation of the negative binomial is a particular
case of a mixture distribution
The principle of a mixture representation is to represent a
density f as the marginal of another distribution, for example
X
f (x) =
pi fi (x) ,
iY
Partitioned sampling
f (x) dx
Ai
and
pi = Pr(X Ai )
for a partition (Ai )i
Accept-Reject algorithm
0.25
0.20
Lemma
f(x)
0.10
X f (x)
0.15
Simulating
0.00
0.05
equivalent to simulating
6
x
10
No Cauchy!
1
f (x) = exp(x2 /2)
2
and
1 1
,
1 + x2
densities of the normal and Cauchy distributions.
g(x) =
No Cauchy!
1
f (x) = exp(x2 /2)
2
and
1 1
,
1 + x2
densities of the normal and Cauchy distributions.
Then
r
r
f (x)
2
2
=
(1 + x2 ) ex /2
= 1.52
g(x)
2
e
g(x) =
attained at x = 1.
No Double!
g(x|)
2 1 2 /2
e
truncate
0.
0.
a a
exp{(1 b)x} b
for b 1.
The maximum is attained at b = a/.
a
(1 b)e
a
c3 = 2/c1 , c4 = 1 + c3 , and c5 = 1/ .
2. Repeat
generate U1 , U2
take U1 = U2 + c5 (1 1.86U1 ) if > 2.5
until 0 < U1 < 1.
3. Set W = c2 U2 /U1 .
4. If c3 U1 + W + W 1 c4 or
c3 log U1 log W + W 1,
take c1 W ;
otherwise, repeat.
2 /2 2
Ix
2 /2 2
Ix
Take
Sn = {xi , i = 0, 1, . . . , n + 1} supp(f )
such that h(xi ) = log f (xi ) known up to the same constant.
By concavity of h, line Li,i+1 through (xi , h(xi )) and
(xi+1 , h(xi+1 ))
Take
Sn = {xi , i = 0, 1, . . . , n + 1} supp(f )
such that h(xi ) = log f (xi ) known up to the same constant.
By concavity of h, line Li,i+1 through (xi , h(xi )) and
(xi+1 , h(xi+1 ))
I
x2
x3
x
x4
and
Therefore, if
f n (x) = exp hn (x) and f n (x) = exp hn (x)
then
f n (x) f (x) f n (x) = $n gn (x) ,
where $n normalizing constant of fn
ARS Algorithm
1. Initialize n and Sn .
2. Generate X gn (x), U U[0,1] .
kill ducks
I
Y
N!
pni i (1 pi )N ni ,
(N r)!
i=1
pi
1 pi
N (i , 2 ),
[Normal logistic]
I
N!
N Y
(1 + ei )N
(N r)! N !
i=1
I
Y
1
2
exp i ni 2 (i i )
2
i=1
1
(i i )2 N log(1 + ei )
2 2
1.0
0.8
0.6
0.4
0.2
10
10
1965
0.0
0.2
0.4
0.6
0.8
1.0
1964
0.6
8
0.6
8
0.4
9
1962
0.4
9
0.2
10
0.2
10
0.0
0.0
0.2
0.4
0.6
0.8
1.0
1963
1.0
0.8
10
10
0.0
0.2
0.4
0.6
0.8
1.0
1961
0.0
0.0
0.2
0.4
0.6
0.8
1.0
1960
1.0
0.8
10
1959
0.0
0.2
0.4
0.6
0.8
1.0
1958
0.0
0.0
0.2
0.4
0.6
0.8
1.0
1957
10
10
0.0
0.2
0.4
0.6
0.8
1960
Quick reminder
Quick reminder
Proper loss:
For L(, ) = ( )2 , the Bayes estimator is the posterior mean
Proper loss:
For L(, ) = ( )2 , the Bayes estimator is the posterior mean
Absolute error loss:
For L(, ) = | |, the Bayes estimator is the posterior median
Proper loss:
For L(, ) = ( )2 , the Bayes estimator is the posterior mean
Absolute error loss:
For L(, ) = | |, the Bayes estimator is the posterior median
With no loss function
use the maximum a posteriori (MAP) estimator
arg max `(|x)()
Theme:
Generic problem of evaluating the integral
Z
I = Ef [h(X)] =
h(x) f (x) dx
X
m
1 X
h(xj )
m
j=1
m
1 X
h(xj )
m
j=1
which converges
hm Ef [h(X)]
by the Strong Law of Large Numbers
vm
1 1 X
[h(xj ) hm ]2 ,
=
mm1
j=1
vm
Note: This can lead to the construction of a convergence test and
of confidence bounds on the approximation of Ef [h(X)].
C(0, 1).
2
e(x) /2 d
2
1+
(x) = Z
1
2
e(x) /2 d
2
1
+
m
(x) =
m
X
i=1
i
1 + i2
X
m
i=1
1
.
1 + i2
m
(x) (x) as m .
9.6
9.8
10.0
10.2
10.4
10.6
200
400
600
800
1000
iterations
Importance sampling
Paradox
Simulation from f (the true density) is not necessarily optimal
Importance sampling
Paradox
Simulation from f (the true density) is not necessarily optimal
Alternative to direct sampling from f is importance sampling,
based on the alternative representation
Z
f (x)
Ef [h(X)] =
g(x) dx .
h(x)
g(x)
X
which allows us to use other distributions than f
h(x) f (x) dx
X
by
1. Generate a sample X1 , . . . , Xn from a distribution g
2. Use the approximation
m
1 X f (Xj )
h(Xj )
m
g(Xj )
j=1
Important details
exp x2 /2
p? (x)
= 2
p0 (x)
(1 + x2 )
%(x)2 p0 (x)dx = .
50
100
150
200
Importance Sampling
2000
4000
6000
8000
10000
iterations
|h(x)| f (x)
.
|h(z)| f (z) dz
|h(x)| f (x)
.
|h(z)| f (z) dz
Practical impact
Pm
j=1
Practical impact
Pm
j=1
n
X
i h(xi )
i=1
X
n
i=1
Selfnormalised variance
then
var(hn )
for
1
n
n
n
2
n
var(S
)
2E
[h]
cov(S
,
S
)
+
E
[h]
var(S
)
.
1
1
h
h
n2 2
Shn =
n
X
i=1
Wi h(Xi ) ,
S1n =
n
X
Wi
i=1
Rough approximation
varhn
1
var (h(X)) {1 + varg (W )}
n
(( + 1)/2)
(/2)
1+
(x )2
2
(+1)/2
5.0
5.5
6.0
6.5
7.0
Importance Sampling
10000
20000
30000
40000
50000
Explanation:
Take target distribution and instrumental distribution
N
n
Simulation of a sample of iid samples of size n
R x1:n from n =
Importance sampling estimator for n (fn ) = fn (x1:n )n (dx1:n )
PN
QN
i
i
i=1 fn (1:n )
j=1 Wj
\
n (fn ) =
,
PN QN
j=1
j=1 Wj
i
i
where Wki = d
d (k ), and j are iid with distribution .
For {Vk }k0 , sequence of iid nonnegative random variables and for
n 1, Fn = (Vk ; k n), set
Un =
n
Y
k=1
Vk
E( Un ) =
n
Y
k=1
p
E( Vk ) 0,
n .
E(Z) = E lim
Un lim inf E( U n ) = 0.
n
d
d
= 1, -a.e., i.e. = .
too volatile!
t N (0, 1)
ut N (0, 1)
10
100
200
300
time
400
500
k=2
exp 2 (xk xk1 )2 + 2 exp(xk )yk2 + xk /2 , (2)
up to a normalizing constant.
Computational problems
Importance sampling
and variance
k2 = ( 2 + yk2 exp(xk1 )/2)1
Prior proposal on X1 ,
X1 N (0, 2 )
0.00
0.4
0.02
0.3
0.04
0.2
Density
0.06
0.1
0.08
0.0
0.10
0.1
Importance Sampling
15
weights
10
15
20
40
60
80
100
0.4
0.2
0.0
0.2
0.4
Importance Sampling
20
40
60
80
100
Correlated simulations
Negative correlation reduces variance
Special technique but efficient when it applies
Two samples (X1 , . . . , Xm ) and (Y1 , . . . , Ym ) from f to estimate
Z
I=
h(x)f (x)dx
R
by
m
1 X
b
h(Xi )
I1 =
m
i=1
and
m
1 X
b
I2 =
h(Yi )
m
i=1
Variance reduction
b
I1 + b
I2
2
2 1
+ cov(b
I1 , b
I2 ).
2
2
Antithetic variables
If Xi = F 1 (Ui ), take Yi = F 1 (1 Ui )
Control variates
out of control!
For
I=
unknown and
I0 =
known,
I0 estimated by b
I0 and
I estimated by b
I
h(x)f (x)dx
h0 (x)f (x)dx
Combined estimator
b
I = b
I + (b
I0 I 0 )
b
I is unbiased for I and
var(b
I ) = var(b
I) + 2 var(b
I) + 2cov(b
I, b
I0 )
Optimal control
Optimal choice of
? =
with
cov(b
I, b
I0 )
,
b0 )
var(I
var(b
I? ) = (1 2 ) var(b
I) ,
b and I
b0
where correlation between I
Usual solution: regression coefficient of h(xi ) over h0 (xi )
with Xi iid f .
If Pr(X > ) =
%b =
1
2
known
f (x)dx
1X
I(Xi > a),
n
i=1
1X
% =
I(Xi > a) +
n
i=1
improves upon %b if
1X
I(Xi > ) Pr(X > )
n
i=1
1X
% =
I(Xi > a) +
n
i=1
improves upon %b if
<0
and
1X
I(Xi > ) Pr(X > )
n
|| < 2
i=1
cov(b
%, %b0 ) Pr(X > a)
2
.
var(b
%0 ) Pr(X > )
Integration by conditioning
Consequence
If b
I unbiased estimator of I = Ef [h(X)], with X simulated from a
joint density f(x, y), where
Z
f(x, y)dy = f (x),
the estimator
b
I = Ef[b
I|Y1 , . . . , Yn ]
dominate b
I(X1 , . . . , Xn ) variance-wise (and is unbiased)
skip expectation
X T (, 0, 2 )
and
Y 1 2 .
1 X
exp(Xj2 ) ,
m
j=1
1 X
exp(Xj2 ) ,
m
j=1
0.50
0.52
0.54
0.56
0.58
0.60
Acceleration methods
2000
4000
6000
8000
10000
Recall algorithm:
(1)
(T )
1. Generate 1 , , 1
with
2. Take
Z
from cg()
Z
c1 = g()d
f (x|)()d
T
((t) )
1X
f (x| (t) )
T
cg( (t) )
t=1
T
X
f (x| (t) )
t=1
((t) )
g((t) )
T
X
((t) )
t=1
g((t) )
= mIS (x)
Choice of g
g() = ()
1X
f (x| (t) )
T t
often inefficient if data informative
impossible if is improper
mIS (x) =
Choice of g
g() = ()
1X
f (x| (t) )
T t
often inefficient if data informative
impossible if is improper
mIS (x) =
g() = f (x|)()
c unknown
1
1X
T t=1 f (x| (t) )
improper priors allowed
mIS (x) = 1
g() = () + (1 )(|x)
defensive mixture
1 Ok
[Hestenberg, 1998]
g() = () + (1 )(|x)
defensive mixture
1 Ok
[Hestenberg, 1998]
g() = (|x)
mh (x) =
T
h()
1X
T t=1 f (x|)()
works for any h
finite variance if
Z
h2 ()
d <
f (x|)()
Bridge sampling
[Chen & Shao, 1997]
Given two models f1 (x|1 ) and f2 (x|2 ),
1 (1 |x) =
2 (2 |x) =
1 (1 )f1 (x|1 )
m1 (x)
2 (2 )f2 (x|2 )
m2 (x)
Bayes factor:
B12 (x) =
ratio of normalising constants
m1 (x)
m2 (x)
B12
1 (i )
1X
2 (i )
i=1
i 2
2 ()()1 ()d
B12 = Z
1 ()()2 ()d
n1
1 X
2 (1i )(1i )
n1
i=1
n2
1 X
1 (2i )(2i )
n2
i=1
()
ji j ()
Optimal choice
() =
n1 + n 2
n1 1 () + n2 2 ()
[?]
Basics
Basics
no discrete
Note that
|K(x)|
Markov chains
skip definition
10
Basics
Composition of kernels
Irreducibility
Irreducibility is one measure of the sensitivity of the Markov chain
to initial conditions
It leads to a guarantee of convergence for MCMC algorithms
Irreducibility
Irreducibility is one measure of the sensitivity of the Markov chain
to initial conditions
It leads to a guarantee of convergence for MCMC algorithms
Definition (Irreducibility)
In the discrete case, the chain is irreducible if all states
communicate, namely if
Px (y < ) > 0 ,
x, y X ,
for all x X
Minoration condition
Small sets
X
=
I (Xi )
i=1
X
=
I (Xi )
i=1
X
=
I (Xi )
i=1
Harris recurrence
Stronger form of recurrence:
Harris recurrence
Stronger form of recurrence:
Invariant measures
Stability increases for the chain (Xn ) if marginal distribution of Xn
independent of n
Requires the existence of a probability distribution such that
Xn+1
if
Xn
Invariant measures
Stability increases for the chain (Xn ) if marginal distribution of Xn
independent of n
Requires the existence of a probability distribution such that
Xn+1
if
Xn
Insights
no time for that!
Invariant probability measures are important not merely because they define stationary processes, but also because
they turn out to be the measures which define the longterm or ergodic behavior of the chain.
To understand why, consider P (Xn ) for a starting distribution . If
a limiting measure exists such as
P (Xn A) (A)
for all A B(X ), then
(dx)P n (x, A)
Z Z
= lim
P n1 (x, dw)K(w, A)
n X
Z
=
(dw)K(w, A)
(A) =
lim
R
since setwise convergence of P n (x, ) implies convergence of integrals
of bounded measurable functions. Hence, if a limiting distribution exists,
it is an invariant probability measure; and obviously, if there is a unique
invariant probability measure, the limit will be independent of
whenever it exists.
and we want
Z
K n (x, )(dx)
TV
Z
n
= sup K (x, A)(dx) (A)
A
to be small.
skip minoration TV
Lemma
Let and 0 be two probability measures. Then,
)
(
X
(Ai ) 0 (Ai ) = k 0 kTV .
1 inf
i
lim
=0
n
TV
lim
=0
n
TV
lim
=0
n
TV
no detail of convergence
Convergences
Geometric ergodicity
|g|p 2
|g|p 2
Uniform ergodicity
Theorem (S&n ergodicity)
The following conditions are equivalent:
I
xX
Limit theorems
Ergodicity determines the probabilistic properties of average
behavior of the chain.
Limit theorems
Ergodicity determines the probabilistic properties of average
behavior of the chain.
But also need of statistical inference, made by induction from the
observed sample.
If kPxn k close to 0, no direct information about
Xn Pxn
c We need LLNs and CLTs!!!
Limit theorems
Ergodicity determines the probabilistic properties of average
behavior of the chain.
But also need of statistical inference, made by induction from the
observed sample.
If kPxn k close to 0, no direct information about
Xn Pxn
c We need LLNs and CLTs!!!
Classical LLNs and CLTs not directly applicable due to:
Markovian dependence structure between the
observations Xi
Non-stationarity of the sequence
The Theorem
Definition (reversibility)
A Markov chain (Xn ) is reversible if for
all n
P(-> )
[Green, 1995]
The CLT
Theorem
If the Markov chain (Xn ) is Harris recurrent and reversible,
!
N
X
1
L
0 < h2 = E [h (X0 )]
X
+2
E [h(X0 )h(Xk )] < +.
k=1
Practical purposes?
Tools at hand
For MCMC algorithms, kernels are explicitly known.
Type of quantities (more or less directly) available:
I
Minoration constants
K s (x, A) (A),
for all x C,
Coupling
skip coupling
X0
If X and
and , one can construct two
and X
0 such that
random variables X
, X
0 0
X
=X
0
and X
with probability
Coupling
skip coupling
X0
If X and
and , one can construct two
and X
0 such that
random variables X
, X
0 0
X
=X
0
and X
with probability
respectively.
[Thorisson, 2000]
Coupling inequality
X, X 0 r.v.s with probability distribution K(x, .) and K(x0 , .),
respectively, can be coupled with probability if:
K(x, ) K(x0 , ) x,x0 (.)
where x,x0 is a probability measure, or, equivalently,
kK(x, ) K(x0 , )kTV (1 )
Define an -coupling set as a set C X X satisfying :
A B(X ),
(x, x0 ) C,
x C.
(3)
x C.
(3)
P (x, x0 ; X A) = K(x0 , A)
P (x, x0 ; X A) = K(x0 , A)
For example,
I
I
Coupling algorithm
I
I
I
0
with probability , draw Xn+1 = Xn+1
Xn ,Xn0 and set
dn+1 = 1.
0
n , Xn0 ; ) and
with probability 1 , draw (Xn+1 , Xn+1
) R(X
set dn+1 = 0.
then draw
If dn = 0 and (Xn , Xn0 ) 6 C,
0
0
Drift conditions
To exploit the coupling construction, we need to control the hitting
time
Drift conditions
To exploit the coupling construction, we need to control the hitting
time
Moments of the return time to a set C are most often controlled
using Foster-Lyapunov drift condition:
P V V + bIC ,
V 1
Drift conditions
To exploit the coupling construction, we need to control the hitting
time
Moments of the return time to a set C are most often controlled
using Foster-Lyapunov drift condition:
P V V + bIC ,
V 1
(x, x0 ) 6 C
where V drift function for P (but not necessarily the best choice)
Explicit bound
Theorem
For any distributions and 0 , and any j k, then:
kP k () 0 P k ()kT V (1 )j + k B j1 E, 0 ,0 [V (X0 , X00 )]
where
B = 1 1 (1 ) sup RV.
[DMR,2001]
n1
1X
g(Xi ) ?
g n :=
n
i=0
n1
1X
g(Xi ) ?
g n :=
n
i=0
Standard MC if CLT
d
n (g n E [g(X)]) N (0, g2 )
and there exists an easy-to-compute, consistent estimate of g2 ...
Minoration
skip construction
Assume that the kernel density K satisfies, for some density q(),
(0, 1) and a small set C X ,
K(y|x) q(y) for all y X and x C
Then split K into a mixture
K(y|x) = q(y) + (1 ) R(y|x)
where R is residual kernel
Split chain
Renewals
For X0 q and R successive renewals, define by 1 < . . . < R the
renewal times.
Then
#
"
R
R 1 X
R g R E [g(X)] =
(St Nt E [g(X)])
N R
t=1
d
I
R g R E g N (0, g2 )
I there is a simple, consistent estimator of g2
[Mykland & al., 1995; Robert, 1995]
Moment conditions
We need to show that, for the minoration condition, Eq [N12 ] and
Eq [S12 ] are finite.
If
1. the chain is geometrically ergodic, and
2. E [|g|2+ ] < for some > 0,
Basics
The algorithm uses the objective (target) density
f
and a conditional density
q(y|x)
called the instrumental (or proposal) distribution
The MH algorithm
Algorithm (MetropolisHastings)
Given x(t) ,
1. Generate Yt q(y|x(t) ).
2. Take
(t+1)
Yt
x(t)
where
(x, y) = min
f (y) q(x|y)
,1
f (x) q(y|x)
Features
The chain (x(t) )t may take the same value several times in a
row, even though f is a density wrt Lebesgue measure
Convergence properties
1. The M-H Markov chain is reversible, with
invariant/stationary density f since it satisfies the detailed
balance condition
f (y) K(y, x) = f (x) K(x, y)
Convergence properties
1. The M-H Markov chain is reversible, with
invariant/stationary density f since it satisfies the detailed
balance condition
f (y) K(y, x) = f (x) K(x, y)
2. As f is a probability measure, the chain is positive recurrent
Convergence properties
1. The M-H Markov chain is reversible, with
invariant/stationary density f since it satisfies the detailed
balance condition
f (y) K(y, x) = f (x) K(x, y)
2. As f is a probability measure, the chain is positive recurrent
3. If
"
#
f (Yt ) q(X (t) |Yt )
Pr
1 < 1.
f (X (t) ) q(Yt |X (t) )
(1)
that is, the event {X (t+1) = X (t) } is possible, then the chain
is aperiodic
(2)
(2)
(2)
(ii) and
Z
T
1X
(t)
h(X ) = h(x)df (x)
T t=1
Z
lim
K n (x, )(dx) f
n
a.e. f.
=0
TV
b Take
X (t+1) =
x(t)
)
f (Yt ) g(x(t) )
,1 ,
f (x(t) ) g(Yt )
Properties
The resulting sample is not iid
Properties
The resulting sample is not iid but there exist strong convergence
properties:
Theorem (Ergodicity)
The algorithm produces a uniformly ergodic chain if there exists a
constant M such that
f (x) M g(x) ,
x supp f.
In this case,
n
kK (x, ) f kT V
1
1
M
n
t N (0, 2 )
and observables
yt |xt N (x2t , 2 )
t N (0, 2 )
and observables
yt |xt N (x2t , 2 )
The distribution of xt given xt1 , xt+1 and yt is
2
1
2
2
2 2
.
exp 2 (xt xt1 ) + (xt+1 xt ) + 2 (yt xt )
2
xt1 + xt+1
1 + 2
and
t2 =
2
.
1 + 2
xt1 + xt+1
1 + 2
and
t2 =
2
.
1 + 2
Ratio
(x)/qind (x) = exp (yt x2t )2 /2 2
is bounded
(top) Last 500 realisations of the chain {Xk }k out of 10, 000
iterations; (bottom) histogram of the chain, compared with
the target distribution.
1 + ( 0 )2
( 0 )/( 0 )
= exp 2 ( 0 )2 /2
.
()/())
(1 + 2 )
1 + ( 0 )2
( 0 )/( 0 )
= exp 2 ( 0 )2 /2
.
()/())
(1 + 2 )
0.0
0.1
0.2
Density
0.3
0.4
0
4
1000
2000
3000
4000
iterations
5000
X (t+1) =
x(t)
f (Yt )
with prob. min 1,
,
f (x(t) )
otherwise.
mean
variance
0.1
0.399
0.698
0.5
-0.111
1.11
1.0
0.10
1.06
-1
1
(a)
0.5
0.5
0.0
0.0
-0.5
-0.5
-1.0
-1.0
100
-2
0
(b)
-1.5
-1.5
-1.5
50
-1.0
100
100
-0.5
200
200
150
0.0
300
200
300
0.5
250
400
400
-3
-2
-1
(c)
n
k
Y
X
j=1
`=1
p` f (xj |` , ` ) ()
n
k
Y
X
j=1
`=1
p` f (xj |` , ` ) ()
Metropolis-Hastings proposal:
(t)
+ (t) if u(t) < (t)
(t+1)
=
(t)
otherwise
where
(t) =
2
1
theta
0
-1
-1
theta
0.0
0.2
0.4
0.6
0.8
1.0
0.2
0.4
0.6
0.8
1.0
1.2
tau
1.0
1.2
-1
0.6
tau
0 1 2 3 4 5 6
0.8
theta
0.4
0.0
0.2
0.4
0.6
0.8
1.0
0.2
0 2 4
0.0
0.2
0.4
0.6
p
0.8
1.0
0.2
0.4
0.6
0.8
tau
1
0
i=1
t Np (0, )
= (Y Y T )1
10
10
Convergence properties
Convergence properties
1.5
1.0
0.5
-1.5
-1.0
-0.5
0.0
0.5
0.0
-0.5
-1.0
Random-walk
MetropolisHastings algorithms
based on a N (0, 1) instrumental
for the generation of (a) a
N (0, 1) distribution and (b) a
distribution with density
(x) (1 + |x|)3
-1.5
1.0
1.5
50
100
(a)
150
200
50
100
150
(b)
200
0.00
0.05
0.10
0.15
Density
0.20
0.25
0.30
0.35
100
50
50
100
2000
4000
6000
8000
iterations
10000
k=0
then,
S(f, C, r) :=
x X, Ex
( 1
C
X
r(k)h(Xk )
k=0
<
Comments
I
xC
Pn
k=0
Alternative conditions
Extensions
Langevin Algorithms
Theorem
The Langevin diffusion is the only non-explosive diffusion which is
reversible with respect to f .
Discretization
Instead, consider the sequence
x(t+1) = x(t) +
2
log f (x(t) ) + t ,
2
t Np (0, Ip )
Discretization
Instead, consider the sequence
x(t+1) = x(t) +
2
log f (x(t) ) + t ,
2
t Np (0, Ip )
MH correction
Accept the new value Yt with probability
2
2
2
(t)
(t)
2
exp
Yt x 2 log f (x )
f (Yt )
1.
2
f (x(t) )
(t)
2
2
2
exp
x Yt 2 log f (Yt )
g(Y )
g(X)
Related to the speed of convergence of
T
1 X
h(X (t) )
T
t=1
Practical implementation
Choose a parameterized instrumental distribution g(|) and
adjusting the corresponding parameters based on the evaluated
acceptance rate
() =
m
2 X
I{f (yi )g(xi )>f (xi )g(yi )} ,
m
i=1
Simulation from
p
p
2
f (z|1 , 2 ) z
exp 1 z
+ 2 1 2 + log 22 IR+ (z)
z
p
based on the Gamma distribution Ga(, ) with = 2 /1
3/2
Simulation from
p
p
2
f (z|1 , 2 ) z
exp 1 z
+ 2 1 2 + log 22 IR+ (z)
z
p
based on the Gamma distribution Ga(, ) with = 2 /1
Since
2
f (x)
1/2
x
exp ( 1 )x
,
g(x)
x
3/2
( + 1/2)
p
( + 1/2)2 + 42 (1 )
.
2( 1 )
2
exp ( 1 )x
x
is impossible
()
E[Z]
E[1/Z]
0.2
0.22
1.137
1.116
0.5
0.41
1.158
1.108
0.8
0.54
1.164
1.116
0.9
0.56
1.154
1.115
1
0.60
1.133
1.120
1.1
0.63
1.148
1.126
1.2
0.64
1.181
1.095
1.5
0.71
1.148
1.115
Rule of thumb
Rule of thumb
General Principles
A very specific simulation algorithm based on the target
distribution f :
1. Uses the conditional densities f1 , . . . , fp from f
General Principles
A very specific simulation algorithm based on the target
distribution f :
1. Uses the conditional densities f1 , . . . , fp from f
2. Start with the random variable X = (X1 , . . . , Xp )
General Principles
A very specific simulation algorithm based on the target
distribution f :
1. Uses the conditional densities f1 , . . . , fp from f
2. Start with the random variable X = (X1 , . . . , Xp )
3. Simulate from the conditional densities,
Xi |x1 , x2 , . . . , xi1 , xi+1 , . . . , xp
for i = 1, 2, . . . , p.
(t)
1. X1
(t+1)
2. X2
(t)
(t)
f1 (x1 |x2 , . . . , xp );
(t+1)
, x3 , . . . , xp ),
(t+1)
, . . . , xp1 )
f2 (x2 |x1
(t)
(t)
...
p.
(t+1)
Xp
fp (xp |x1
(t+1)
X(t+1) X f
Properties
Properties
Xt fX|Y (|yt )
But...
P
2
|Y0:n , 2 N n1 ni=1 Yi , n )
P
2 |Y1:n , IG 2 n2 1, 21 ni=1 (Yi )2
Hence we may use the Gibbs sampler for simulating from the
posterior of (, 2 )
Number of Iterations 1
Number of Iterations 1, 2
Number of Iterations 1, 2, 3
Number of Iterations 1, 2, 3, 4
Number of Iterations 1, 2, 3, 4, 5
Number of Iterations 1, 2, 3, 4, 5, 10
Note
The variable z may be meaningless for the problem
(t)
1. Y1
(t+1)
2. Y2
(t+1)
p. Yp
(t)
(t)
g1 (y1 |y2 , . . . , yp ),
(t+1)
, y3 , . . . , yp ),
(t+1)
, . . . , yp1 ).
g2 (y2 |y1
...
gp (yp |y1
(t)
(t+1)
(t)
then
X|Z f (x|Z ),
(1 |y1 + n1 x
1 , 1 + n1 ) . . . (k |yk + nk x
k , k + nk ),
with
ni =
X
j
I(zj = i)
and
x
i =
j; zj =i
xj /ni .
2. Simulate (j = 1, . . . , n)
Zj |xj , p1 , . . . , pk , 1 , . . . , k
k
X
i=1
with (i = 1, . . . , k)
pij pi f (xj |i )
and update ni and x
i (i = 1, . . . , k).
pij I(zj = i)
10
5
0
10
15
T = 1000
15
T = 500
-2
-2
10
5
0
10
15
T = 3000
15
T = 2000
-2
-2
10
5
0
10
15
T = 5000
15
T = 4000
-2
-2
0.00
0.05
0.10
0.15
0.20
0.25
Completion
10
15
20
25
30
35
A wee problem
A wee problem
1
1
1
1
back to basics
dont do random
Motivation
The Random Scan Gibbs sampler is reversible.
fi (),
fi (),
i=1
it can be completed as
k
Y
i=1
I0i fi () ,
1. 1
U[0,f1 ((t) )] ;
...
k.
k+1.
(t+1)
k
U[0,fk ((t) )] ;
(t+1) UA(t+1) , with
(t+1)
, i = 1, . . . , k}.
0.000
0.002
0.004
0.006
0.008
0.010
0.0
0.2
0.4
0.6
0.8
Number of Iterations 2
1.0
0.000
0.002
0.004
0.006
0.008
0.010
0.0
0.2
0.4
0.6
0.8
Number of Iterations 2, 3
1.0
0.000
0.002
0.004
0.006
0.008
0.010
0.0
0.2
0.4
0.6
0.8
1.0
Number of Iterations 2, 3, 4
0.000
0.002
0.004
0.006
0.008
0.010
0.0
0.2
0.4
0.6
0.8
1.0
Number of Iterations 2, 3, 4, 5
0.000
0.002
0.004
0.006
0.008
0.010
0.0
0.2
0.4
0.6
0.8
1.0
Number of Iterations 2, 3, 4, 5, 10
0.000
0.002
0.004
0.006
0.008
0.010
0.0
0.2
0.4
0.6
0.8
1.0
0.000
0.002
0.004
0.006
0.008
0.010
0.0
0.2
0.4
0.6
0.8
1.0
Good slices
if we set = 2 log u.
Note Inversion of x2 + exp(x) = needs to be done by
trial-and-error.
Density
0.8
0.6
0.4
0.2
0
1
0.5
0.5
1.5
2.5
3.5
90
100
Correlation
0.1
0.05
0
0.05
0.1
10
20
30
40
50
Lag
60
70
80
nT
Z
T
1X
(t)
h1 (Y ) = h(y)g(y)dy a.e. g.
T
t=1
=0
TV
Slice sampler
For
?
> ? ,
is a small set:
Pr(x, )
where
1
(A) =
?
?
0
?
()
?
(A L())
d
(L())
[Roberts & Rosenthal, 1998]
Theorem
For any density such that
is non-increasing
then
||K 523 (x, ) f ()||T V .0095
[Roberts & Rosenthal, 1998]
400
600
800
1000
10
400
600
800
1000
correlation
0.0 0.2 0.4 0.6 0.8 1.0
30
25
20
200
200
400
600
800
1000
correlation
0.0 0.2 0.4 0.6 0.8 1.0
60
40
20
10
10
200
400
600
800
1000
correlation
0.0 0.2 0.4 0.6 0.8 1.0
u>0
40
20
30
40
20
30
40
30
40
1/d
30
20 dimensional acf
z>0
20
10 dimensional acf
20 dimensional run
or on
(u) = eu
10 dimensional run
(z) = z d1 ez
200
correlation
0.0 0.2 0.4 0.6 0.8 1.0
0
-1
-2
15
1 dimensional acf
10
x Rd
1 dimensional run
10
20
Hammersley-Clifford theorem
Theorem
If the joint density g(y1 , y2 ) have conditional distributions
g1 (y1 |y2 ) and g2 (y2 |y1 ), then
g(y1 , y2 ) = R
g2 (y2 |y1 )
.
g2 (v|y1 )/g1 (y1 |v) dv
[Hammersley & Clifford, circa 1970]
General HC decomposition
p
Y
g`j (y`j |y`1 , . . . , y`j1 , y`0 j+1 , . . . , y`0 p )
j=1
Hierarchical models
no hierarchy
i = 1, , m
i G a(, i )
i I G (a, b),
i G a(, i )
i I G (a, b),
i (i |x, , a, b, i ) = I G ( + a, [i + 1/b]1 )
Example (Rats)
Experiment where rats are intoxicated by a substance, then treated
by either a placebo or a drug:
xij N (i , c2 ),
yij N (i + i , a2 ),
zij N (i + i + i , t2 ),
control
1 j Jic ,
a
1 j Ji , intoxication
1 j Jit ,
treatment
i N ( , 2 ),
and
i N (P , P2 )
or
2
i N (D , D
),
i N ( , 2 ),
and
i N (P , P2 )
or
2
i N (D , D
),
Conditional decompositions
1 2 (1 ),
(|1 , x) =
m1 (x|1 )2 (1 )
,
(1 |x) =
m(x)
Z
m(x) =
m1 (x|1 )2 (1 ) d1 .
1
E [h()|1 , x] =
h()(|1 , x) d.
Ji
Y
j=1
Jit
j=1
exp {(xij i )
/2c2 }
Ji
Y
j=1
exp {(zij i i i )
exp {(i P )2 /2P2 }
/2t2 }
2
exp {(i D )2 /2D
}
`i =0
`i =1
P
P
P
i Jic 1 i Jia 1 i Jit 1
ID 1 IP 1
( )I1 D
P
t
a
c
we have
(i |x, , 1 , . . . , n ) = (i |i1 , i+1 )
with the convention 0 = and n+1 = 0.
2.0
1.8
1.7
0 20 40 60 80100
50 100150200250
2.1
1.9
0.60
0.50
0.40
0 20 40 60 80 100120
-2.80
-2.90
0 50 100150200250300
-2.60
-2.70
1.90
1.80
1.70
1.60
0.70
140
-1
1
placebo
drug
Probability
Confidence
1.00
[-3.48,-2.17]
D
0.9998
[0.94,2.50]
P
0.94
[-0.17,1.24]
D P
0.985
[0.14,2.20]
Data Augmentation
The Gibbs sampler with only two steps is particularly useful
1.. Simulate Y1
(t+1)
2.. Simulate Y2
(t)
g1 (y1 |y2 ) ;
(t+1)
g2 (y2 |y1
).
Data Augmentation
The Gibbs sampler with only two steps is particularly useful
1.. Simulate Y1
(t+1)
2.. Simulate Y2
(t)
g1 (y1 |y2 ) ;
(t+1)
g2 (y2 |y1
).
(t)
347 128+552+253
1e
3
X
i
i!
i=0
!13
347 128+552+253
1e
3
X
i
i!
i=0
!13
a Simulate Yi
b Simulate
P((t1) ) Iy4
(t)
Ga 313 +
i = 1, . . . , 13
13
X
i=1
(t)
yi ,
360
a Simulate Yi
b Simulate
P((t1) ) Iy4
Ga 313 +
13
X
(t)
yi ,
i=1
360
40
1.025
(t)
i = 1, . . . , 13
t=1
10
0
1.023
20
1.024
i=1
(t)
yi
0.9
1.0
lambda
1.1
1.2
1.022
313 +
13
X
to R&B
1.021
T
1 X
=
360T
30
100
200
300
400
500
Rao-Blackwellization
If (y1 , y2 , . . . , yp )(t) , t = 1, 2, . . . T is the output from a Gibbs
sampler
Z
T
1 X (t)
h y1 h(y1 )g(y1 )dy1
0 =
T
t=1
and is unbiased.
Rao-Blackwellization
If (y1 , y2 , . . . , yp )(t) , t = 1, 2, . . . T is the output from a Gibbs
sampler
Z
T
1 X (t)
h y1 h(y1 )g(y1 )dy1
0 =
T
t=1
and is unbiased.
The Rao-Blackwellization replaces 0 with its conditional
expectation
rb
T
i
1 X h
(t)
(t)
E h(Y1 )|y2 , . . . , yp .
=
T
t=1
Rao-Blackwellization (2)
Then
Both estimators converge to E[h(Y1 )]
Both are unbiased,
Rao-Blackwellization (2)
Then
Both estimators converge to E[h(Y1 )]
Both are unbiased,
and
i
h
(t)
var E h(Y1 )|Y2 , . . . , Yp(t) var(h(Y1 )),
Examples of Rao-Blackwellization
Example
Bivariate normal Gibbs sampler
X | y N (y, 1 2 )
Y | x N (x, 1 2 ).
Then
T
1 X (i)
0 =
X
T
i=1
and
T
T
1X
1 X (i)
(i)
(i)
1 =
E[X |Y ] =
%Y ,
T
T
i=1
1
2
> 1.
i=1
T
1 X (t)
T
t=1
T
1 X
(i) (i)
(i)
E[(t) |x1 , x2 , . . . , x5 , y1 , y2 , . . . , y13 ]
T
t=1
!
T
13
X
1 X
(t)
313 +
yi
,
360T
t=1
back to graph
i=1
is unbiased.
Note
The chain (Y (t) ) can be discrete, and the chain (X (t) ) continuous.
Improper Priors
Improper Priors
Improper Priors
Improper Priors
X2 |x1 E xp(x1 )
i = 1, . . . , I, j = 1, . . . , J,
where
i N (0, 2 ) and ij N (0, 2 ),
the Jeffreys (improper) prior for the parameters , and is
(, 2 , 2 ) =
1
.
2 2
i |y, , ,
|, y, 2 , 2
2 |, , y, 2
2 |, , y, 2
J(
yi )
2
2 1
,
, (J + )
N
J + 2 2
N (
y
, 2 /JI) ,
!
X
IG I/2, (1/2)
i2 ,
i
X
IG IJ/2, (1/2)
(yij i )2 ,
i,j
-8
-6
10
freq.
15
-4
observations
20
-2
25
30
Improper Priors
-4
-3
-2
(1000 iterations)
-1
Example
The random effects model was initially treated in Gelfand et al.
(1990) as a legitimate model
variable selection
image analysis
causal inference
0.0
0.5
1.0
1.5
2.0
1.0
1.5
2.0
2.5
velocities
3.0
3.5
i
X
`=1
2
p`i N (`i , `i
)
i?
(j = 1, . . . , 82)
Bayesian solution
Formally over:
1. Compute
pi
p(Mi |x) = X
pj
fi (x|i )i (i )di
i
fj (x|j )j (j )dj
j
as predictive
[Different decision theoretic perspectives]
Difficulties
Not at
I
[see above]
Difficulties
Not at
I
[see above]
computational level:
infinity of models, moves between models, predictive
computation
Greens resolution
Greens resolution
Z
X
m=1
Z
X
m=1
Z
X
m=1
gm (y, x)
m (x, y) = min 1,
gm (x, y)
ensures reversibility
Special case
When contemplating a move between two models, M1 and M2 ,
the Markov chain being in state 1 M1 , denote by K12 (1 , d)
and K21 (2 , d) the corresponding kernels, under the detailed
balance condition
(d1 ) K12 (1 , d) = (d2 ) K21 (2 , d) ,
and take, wlog, dim(M2 ) > dim(M1 ).
Special case
When contemplating a move between two models, M1 and M2 ,
the Markov chain being in state 1 M1 , denote by K12 (1 , d)
and K21 (2 , d) the corresponding kernels, under the detailed
balance condition
(d1 ) K12 (1 , d) = (d2 ) K21 (2 , d) ,
and take, wlog, dim(M2 ) > dim(M1 ).
Proposal expressed as
2 = 12 (1 , v12 )
where v12 is a random variable of dimension
dim(M2 ) dim(M1 ), generated as
v12 12 (v12 ) .
Interpretation (1)
Interpretation (1)
Interpretation (2)
Consider, instead, the proposals
2 N (12 (1 , v12 ), )
and
12 (1 , v12 ) N (2 , )
Interpretation (2)
Consider, instead, the proposals
2 N (12 (1 , v12 ), )
and
12 (1 , v12 ) N (2 , )
(1 , v12 )
2
by the Jacobian rule.
Interpretation (2)
Consider, instead, the proposals
2 N (12 (1 , v12 ), )
and
12 (1 , v12 ) N (2 , )
(1 , v12 )
2
by the Jacobian rule.
Thus MetropolisHastings acceptance probability is
12 (1 , v12 )
(M2 , 2 )
1
(M1 , 1 ) 12 (v12 ) (1 , v12 )
Saturation
[Brooks, Giudici, Roberts, 2003]
Consider series of models Mi (i = 1, . . . , k) such that
max dim(Mi ) = nmax <
i
and
Ui qi (ui )
k
X
j=1
2
pjk N (jk , jk
)
k
X
j=1
2
pjk N (jk , jk
)
= pj(k+1) + p(j+1)(k+1)
pjk
pjk jk = pj(k+1) j(k+1) + p(j+1)(k+1) (j+1)(k+1)
2
2
2
= pj(k+1) j(k+1)
+ p(j+1)(k+1) (j+1)(k+1)
pjk jk
(ii) Merge
(reverse)
u1 , u2 , u3
p
j(k+1)
(T )
j(k+1)
2
j(k+1)
=
=
=
U(0, 1)
u1 pjk
u2 jk
2
u3 jk
Histogram of k
2.5
0.0
0.1
0.2
3.0
0.3
0.4
2.0
1.0
2e+04
4e+04
6e+04
8e+04
1e+05
0.5
0e+00
0.0
1.5
Rawplot of k
0.0
0.5
1.0
1.5
2.0
2.5
3.0
i` ,
Yt |Xt = i N (i , i2 ).
6
6
Yt
...
Yt+1
- ...
Xt
Xt+1
ij2 = ij? (1 i ),
j1 j = j? j j ,
j2 j = j? j /j ,
i U(0, 1);
j2 = j? + 3j? ,
j22 = j2 / ,
N (0, 1);
10000
20000
30000
40000
50000
100000
150000
200000
5000
10000
15000
20000
0.5
0
0.015
0.01
0.005
0
Upper panel: First 40,000 values of k for S&P 500 data, plotted
every 20th sweep. Middle panel: estimated posterior distribution
of k for S&P 500 data as a function of number of sweeps. Lower
panel: 1 and 2 in first 20,000 sweeps with k = 2 for S&P 500
data.
(1 i B) Xt = t ,
t N (0, 2 )
where the i s are within the unit circle if complex and within
[1, 1] if real.
[Huerta and West, 1998]
i 6R
k k+1
k k+2
k k-1
k k-2
200
300
400
500
0 1 2 3 4 5
0
10
15
20
0.4
p=5
0.2
0.0
0.2
0.4
0.2
0.0
0.2
0.4
0.256
0.2
0.0
0.2
0.4
0.6
Im(i)
0.4
1.2
0.4
0.412
8
6
4
2
1.1
2
0.4
means
0.0817
1.0
0.2
0 2 4 6 8 10
0.4
p=3
mean 0.997
0.9
0.0
p=4
1000
2000
3000
Iterations
4000
5000
0.2
0.4
0.2
p=3
0 1 2 3 4 5 6 7
100
0 1 2 3 4 5
0.6 0.2
xt
Greens method
0.1
0.0
0.1
0.2
0.3
0.4
0.5
Re(i)
instant death!
instant death!
Balance condition
for all , 0
for
() L()() to be stationary.
Here q(, 0 ) rate of moving from state to 0 .
Possibility to add split/merge and fixed-k processes if balance
condition satisfied.
j () = ()
j=1
Balance condition
(k+1) d({p, (, )}) L({p, (, )}) = 0 L()
with
d( \ {pj , (j , j )}) = j ()
Case of Poisson prior k Poi(1 )
j () =
0 L( \ {pj , (j , j )})
1
L()
(k)
(k + 1)
tv
Run till t > v + 1
1. Compute j () =
2. ()
k
X
j=1
L(|j ) 0
L() 1
3. t t u log(u)
Otherwise,
5. Run I MCMC(k, , p)
Rescaling time
move to HMM
put
k = k /N
k = 1 k /N
and
As N , each birth proposal will be accepted, and having k components births occur according to a
Poisson process with rate k while component (w, ) dies with rate
lim
N k+1
1
k+1
min(A
=
=
lim
, 1)
1
k+1
likelihood ratio
likelihood ratio
k
k+1
k
k+1
b(w, )
(1 w)k1
b(w, )
(1 w)k1
ij2 = ij (1 i ),
j1 j = j j j ,
j2 j = j j /j ,
i U(0, 1);
j2 = j + 3j ,
j22 = j2 / ,
N (0, 1);
0.0
0.1
0.2
0.3
0.4
0.5
4
3
1
number of states
0.6
0.4
0.2
0.0
Relative Frequency
Number of states
10
500
1000
1500
2000
2500
1500
2000
2500
1500
2000
2500
instants
temp[, 1]
loglikelihood
1400
1200
1000
800
600
400
0.020
0.010
0.000
Relative Frequency
0.030
200
200
500
1000
instants
temp[, 2]
30
10
15
20
temp[, 3]
25
30
20
Number of moves
5
5 10
Relative Frequency
Number of moves
500
1000
instants
500
1000
1500
500
1000
1500
0.0
0.1
0.2
0.3
0.4
0.5
0.6
basic importance
t =
t
i=1
and
t2
t
1 X (i)
=
( t )2 ,
t
i=1
t =
t
i=1
and
t2
t
1 X (i)
=
( t )2 ,
t
i=1
0.4
0.2
0.0
0.2
Adaptive MCMC
1000
2000
3000
4000
5000
1.5
1.0
0.5
0.0
0.5
0.0
0.2
0.4
0.6
Iterations
1000
2000
3000
4000
5000
Iterations
2
1
0
2
1000
2000
3000
Iterations
4000
5000
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Adaptive MCMC
1.0
0.0
0.5
0.2
0.0
0.4
0.5
0.6
1.0
0.8
1.5
1.0
Adaptive MCMC
10000
30000
Iterations
50000
1.5
0.5
0.5
1.0
1.5
Warning:
One should not constantly adapt the proposal on past
performances
Either adaptation ceases after a period of burnin
or the adaptive scheme must be theoretically assessed on its own
right.
h(x)(x)dx
by unbiased estimators
n
X
= 1
I
%i h(xi )
n
i=1
when
iid
x1 , . . . , xn q(x)
and
def (xi )
%i =
q(xi )
Markov extension
For densities f and g, and importance weight
(x) = f (x)/g(x) ,
for any kernel K(x, x0 ) with stationary distribution f ,
Z
(x) K(x, x0 ) g(x)dx = f (x0 ) .
[McEachern, Clyde, and Liu, 1999]
Markov extension
For densities f and g, and importance weight
(x) = f (x)/g(x) ,
for any kernel K(x, x0 ) with stationary distribution f ,
Z
(x) K(x, x0 ) g(x)dx = f (x0 ) .
[McEachern, Clyde, and Liu, 1999]
Consequence: An importance sample transformed by MCMC
transitions keeps its weights
Unbiasedness preservation:
Z
0
E (X)h(X ) =
(x) h(x0 ) K(x, x0 ) g(x) dx dx0
= Ef [h(X)]
Not so exciting!
Not so exciting!
Idea
It is possible to generalise importance sampling using random
weights t
Idea
It is possible to generalise importance sampling using random
weights t such that
E[t |xt ] = (xt )/g(xt )
and
Unknown
(x) =
(x)/
p(x)
Known
Acceptance function
(x) =
1
1 +
(x)
>0
Geometric jumps
Theorem
If
Y p(y)
and
W |Y = y G ((y)) ,
then
Xt = = Xt+W 1 = Y 6= Xt+W
defines a Markov chain with stationary distribution
Plusses
I
A generalisation
[G
asemyr, 2002]
Proposal density p(y) and probability q(y) of accepting a jump.
A generalisation
[G
asemyr, 2002]
Proposal density p(y) and probability q(y) of accepting a jump.
Algorithm (G
asemyrs dynamic weights)
Generate a sequence of random weights Wn by
1. Generate Yn p(y)
2. Generate Vn B(q(yn ))
3. Generate Sn Geo((yn ))
4. Take Wn = Vn Sn
Validation
direct to PMC
(y) = R
p(y)q(y)
,
p(y)q(y)dy
Ergodicity for G
asemyrs scheme
Necessary and sufficient condition
is stationary for (Xt ) iff
(y) = q(y)/((y)/p(y)) = q(y)/(w(y))
for some constant .
Ergodicity for G
asemyrs scheme
Necessary and sufficient condition
is stationary for (Xt ) iff
(y) = q(y)/((y)/p(y)) = q(y)/(w(y))
for some constant .
Implies that
E[W n |Y n = y] = w(y) .
[Average importance sampling]
Special case: (y) = 1/(1 + w(y)) of Sahu and Zhigljavski (2001)
Properties
Constraint on : for (y) 1, must be such that
p(y)q(y)
(y)
Reverse of accept-reject conditions (!)
Variance of
X
X
Wn h(Yn )/
Wn
n
is
(h(y) )2
w(y)(y)dy (1/)2 ,
q(y)
by Cramer-Wold/Slutsky
Still worse than importance sampling.
(4)
(5)
(y)K(y, x)
(x)K(x, y)
if u < a
otherwise
0 g+ (x0 , 0 ) d 0 =
%(, x, x0 )
(1 + ) [%(, x, x0 ) + ] g (x, ) K(x, x0 )
dx d
%(, x, x0 ) +
Z
(%(, x0 , z) + )
+ (1 + )
g (x0 , ) K(x, z)
dz d
%(, x0 , z) +
Z
(x0 )K(x0 , x)
= (1 + )
g (x, )
dx d
(x)
Z
+ g (x0 , ) K(x0 , z) dz d
Z
= (1 + ) (x0 ) c0 K(x0 , x) dx + c0 (x0 )
= 2(1 + )c0 (x0 ) ,
[Liang, 2002]
= 0 and 1, and thus
(
(y, % + 1)
(x0 , 0 ) =
(x, (% + 1))
if u < %/(% + 1)
otherwise,
[Importance sampling]
0, thus a = 1 and
(x0 , 0 ) = (y, %) .
Q-move
[Liu & al, 2001]
(x0 , 0 ) =
(y, %) if u < 1 %/ ,
(x, a)
otherwise,
Notes
I
Notes (2)
I
t
.
t+1
t r(xt , yt )
,
t r(xt , yt ) +
> 0,
Notes (2)
I
t
.
t+1
t r(xt , yt )
,
t r(xt , yt ) +
> 0,
t1
Y
j
E[1/t ] .
= E
j+1
j=0
Alternative scheme
t1
Y
j
t1 r(x0 , Yt )
(1 t )
j
= E
j+1
t
j=0
which is equal to
Example
Choose a function 0 < (, ) < 1 and to take, while in (x0 , 0 ),
0 r(x0 , y1 )
(1 (x0 , y1 )
(x1 , 1 ) = y1 ,
(x0 , y1 )
with probability
0
(x0 , y1 )
x0 ,
1 (x0 , y1 )
Idea
Simulate from the product distribution
(x1 , . . . , xn ) =
n
Y
(xi )
i=1
(t1)
xi qt (x|xi
and
i = 1, . . . , n,
t = 1, . . .
n
X
(t)
(t)
t = 1
%i h(xi )
I
n
i=1
%i =
(t)
t (xi )
(t)
(t1)
qt (xi |xi
i = 1, . . . , n
E h(X
=
=
(t)
(X (t) )
)
qt (X (t) |X (t1) )
h(x)
(x)
qt (x|y) g(y) dx dy
qt (x|y)
h(x) (x) dx
(t)
(t)
if var %i
exists, because the xi s are conditionally uncorrelated
Note
(t)
Simulation of a population
The importance distribution of the sample (a.k.a. particles) x(t)
qt (x(t) |x(t1) )
can depend on the previous sample x(t1) in any possible way as
long as marginal distributions
Z
(t)
qit (x) = qt (x(t) ) dxi
can be expressed to build importance weights
(t)
%it =
(xi )
(t)
qit (xi )
If
qt (x(t) |x(t1) ) =
then
n
Y
i=1
(t)
n
1 X
(t)
(t)
var It = 2
var %i h(xi ) ,
n
i=1
Validation
skip validation
h
i
(t)
(t)
(t)
(t)
E %i h(Xi ) %j h(Xj )
Z
(xj )
(xi )
h(xj )
= h(xi )
(t1)
qit (xi |x
) qjt (xj |x(t1) )
qit (xi |x(t1) ) qjt (xj |x(t1) ) dxi dxj g(x(t1) )dx(t1)
= E [h(X)]2
Self-normalised version
%i
is scaled so that
(t)
(xi )
(t)
qit (xi )
X
i
i = 1, . . . , n ,
(t)
%i = 1
Note
Obviously, the x?i s are not iid
Incentive
Use previous sample(s) to learn about and q
(t)
(t)
D-kernels in competition
A general adaptive construction:
Construct qi,t as a mixture of D different transition kernels
(t1)
depending on xi
qi,t =
D
X
(t1)
pt,` K` (xi
`=1
, x),
D
X
`=1
pt,` = 1 ,
D-kernels in competition
A general adaptive construction:
Construct qi,t as a mixture of D different transition kernels
(t1)
depending on xi
qi,t =
D
X
(t1)
pt,` K` (xi
`=1
, x),
D
X
pt,` = 1 ,
`=1
Example
Take pt,` proportional to the survival rate of the points
(t)
(a.k.a. particles) xi generated from K`
Implementation
Algorithm (D-kernel PMC)
For t = 1, . . . , T
generate (Ki,t )1iN M ((pt,k )1kD )
for 1 i N , generate
x
i,t KKi,t (x)
compute and renormalize the importance weights i,t
generate (Ji,t )1iN M (( i,t )1iN )
P
take xi,t = x
Ji,t ,t and pt+1,d = N
i,t Id (Ki,t )
i=1
i,t IKi,t =d P
1
D
i,t IKi,t =d P
1
D
Conclusion
At each iteration, every weight converges to 1/D:
the algorithm fails to learn from experience!!
Saved by Rao-Blackwell!!
Saved by Rao-Blackwell!!
instead of
i,t =
(
xi,t )
KKi,t (xi,t1 , x
i,t )
Adapted algorithm
Algorithm (Rao-Blackwellised D-kernel PMC)
At time t (t = 1, . . . , T ),
Generate
Generate
iid
(
xi,t )1iN KKi,t (xi,t1 , x)
PD
= (
xi,t )
i,t );
d=1 pt,d Kd (xi,t1 , x
iid
Convergence properties
Theorem (LLN)
Under regularity assumptions, for h L1 and for every t 1,
N
1 X
N
and
pt,d P dt
K
(x,
x
j
j=1 j
on the simplex
(
S=
Z "
PD
Kd (x, x0 )
0
j=1 j Kj (x, x )
(dx, dx0 )
= (1 , . . . , D ); d {1, . . . , D}, d 0
1dD
and
D
X
d=1
d = 1 .
Kullback divergence
Z "
log
(x)(x0 )
P
0
(x) D
d=1 d Kd (x, x )
!#
(dx, dx0 ).
for every SD ,
KL(F ()) KL().
for every SD ,
KL(F ()) KL().
Conclusion
The Kullback divergence decreases at every iteration of RBDPMCA
An integrated EM interpretation
skip interpretation
We have
min
= arg max
S
log p (
x) (d
x)
Z
log p (
x, K)dK (d
x)
for x
= (x, x0 ) and K M((d )1dD ). Then t+1 = F (t )
means
ZZ
t+1
K)|X
=x
Et (log p (X,
) (d
x)
= arg max
and
lim t = min
Illustration
Example (A toy example)
Take the target
1/4N (1, 0.3)(x) + 1/4N (0, 1)(x) + 1/2N (3, 2)(x)
and use 3 proposals: N (1, 0.3), N (0, 1) and N (3, 2)
[Surprise!!!]
Illustration
Example (A toy example)
Take the target
1/4N (1, 0.3)(x) + 1/4N (0, 1)(x) + 1/2N (3, 2)(x)
and use 3 proposals: N (1, 0.3), N (0, 1) and N (3, 2)
[Surprise!!!]
Then
1
2
6
10
16
0.0500000 0.05000000
0.2605712 0.09970292
0.2740816 0.19160178
0.2989651 0.19200904
0.2651511 0.24129039
Weight evolution
0.9000000
0.6397259
0.5343166
0.5090259
0.4935585
(0)
(i)
(i1)
(2 )j N (2 )j
, vk
(i)
(i)
using the weights %j ,
Resample the (1 )j , (2 )j
Details
N (()j
, vj )
100
200
300
400
600
Resampling
0
0 200
600
0 200
Resampling
500
100
1
100
200
300
400
500
100
400
500
200
300
400
500
400
500
Iterations
2.010
0 1 2 3 4
2.020
Iterations
Var(2)
300
Var(1)
200
Iterations
0 1 2 3 4
Iterations
100
200
300
Iterations
400
500
100
200
300
Iterations
1.5
0.0
0.5
1.0
2.0
2.5
3.0