This action might not be possible to undo. Are you sure you want to continue?
Cosma and Ludger Evers
Markov Chains and Monte Carlo Methods
Lecture Notes
March 12, 2010
African Institute for Mathematical Sciences
Department of Pure Mathematics
and Mathematical Statistics
These notes are based on notes written originally by Adam M. Johansen and Ludger Evers for a lecture course on
Monte Carlo Methods given at the University of Bristol in 2007.
Commons Deed
You are free:
to copy, distribute and display the work.
to make derivative works.
Under the following conditions:
Attribution. You must give the original authors credit.
Noncommercial. You may not use this work for commercial purposes.
Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting
work only under a license identical to this one.
For any reuse or distribution, you must make clear to others the license terms of this work.
Any of these conditions can be waived if you get permission from the copyright holders. Nothing in this license
impairs or restricts the authors’ moral rights. Your fair dealing and other rights are in no way affected by the
above.
This is a humanreadable summary of the Legal Code (the full license). It can be read online at the following
address:
http://creativecommons.org/licenses/byncsa/2.5/za/legalcode
Note: The Commons Deed is not a license. It is simply a handy reference for understanding the Legal Code
(the full license) — it is a humanreadable expression of some of its key terms. Think of it as the userfriendly
interface to the Legal Code beneath. This Deed itself has no legal value, and its contents do not appear in the
actual license.
The L
A
T
E
X source ﬁles of this document can be obtained from the authors.
Table of Contents
Table of Contents 3
1. Markov Chains 5
1.1 Stochastic processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Discrete Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 General state space Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4 Ergodic theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2. An Introduction to Monte Carlo Methods 23
2.1 What are Monte Carlo Methods? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 Introductory examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 A Brief History of Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Pseudorandom numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3. Fundamental Concepts: Transformation, Rejection, and Reweighting 31
3.1 Transformation methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Rejection sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 Importance sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4. The Gibbs Sampler 43
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3 The HammersleyClifford Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.4 Convergence of the Gibbs sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5. The MetropolisHastings Algorithm 53
5.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2 Convergence results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.3 The random walk Metropolis algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.4 Choosing the proposal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6. Diagnosing convergence 63
6.1 Practical considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.2 Tools for monitoring convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7. Statespace models and the Kalman ﬁlter algorithm 73
7.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
7.2 Statespace models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4 TABLE OF CONTENTS
7.3 The Kalman ﬁlter algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
8. Sequential Monte Carlo 79
8.1 Importance Sampling revisited . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
8.2 Sequential Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
8.3 Sequential Importance Sampling with Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
8.4 Sequential Importance Sampling with Resampling and MCMC moves . . . . . . . . . . . . . . . . 89
8.5 Smoothing density estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Bibliography 95
Chapter 1
Markov Chains
This chapter introduces Markov chains
1
, a special kind of random process which is said to have “no memory”: the
evolution of the process in the future depends only on the present state and not on where it has been in the past. In
order to be able to study Markov chains, we ﬁrst need to introduce the concept of a stochastic process.
1.1 Stochastic processes
Deﬁnition 1.1 (Stochastic process). A stochastic process X is a family {X
t
: t ∈ T} of random variables
X
t
: Ω →S. T is hereby called the index set (“time”) and S is called the state space.
We will soon focus on stochastic processes in discrete time, i.e. we assume that T ⊂ N or T ⊂ Z. Other choices
would be T = [0, ∞) or T = R (processes in continuous time) or T = R ×R (spatial process).
An example of a stochastic process in discrete time would be the sequence of temperatures recorded every
morning at Braemar in the Scottish Highlands. Another example would be the price of a share recorded at the
opening of the market every day. During the day we can trace the share price continuously, which would constitute
a stochastic process in continuous time.
We can distinguish between processes not only based on their index set T, but also based on their state space S,
which gives the “range” of possible values the process can take. An important special case arises if the state space
S is a countable set. We shall then call X a discrete process. The reasons for treating discrete processes separately
are the same as for treating discrete random variables separately: we can assume without loss of generality that the
state space are the natural numbers. This special case will turn out to be much simpler than the case of a general
state space.
Deﬁnition 1.2 (Sample Path). For a given realisation ω ∈ Ω the collection {X
t
(ω) : t ∈ T} is called the sample
path of X at ω.
If T = N
0
(discrete time) the sample path is a sequence; if T = R (continuous time) the sample path is a function
from R to S.
Figure 1.1 shows sample paths both of a stochastic process in discrete time (panel (a)), and of two stochastic
processes in continuous time (panels (b) and (c)). The process in panel (b) has a discrete state space, whereas the
process in panel (c) has the real numbers as its state space (“continuous state space”). Note that whilst certain
stochastic processes have sample paths that are (almost surely) continuous or differentiable, this does not need to
be the case.
1
named after the Andrey Andreyevich Markov (1856–1922), a Russian mathematician.
6 1. Markov Chains
1 2 3 4 5
1
2
3
4
5
X
t
(ω
1
)
X
t
(ω
2
)
X
t
t
(a) Two sample paths of a discrete pro
cess in discrete time.
1 2 3 4 5
1
2
3
4
5
Y
t
(ω
1
)
Y
t
(ω
2
)
Y
t
t
(b) Two sample paths of a discrete pro
cess in continuous time.
1 2 3 4 5
1
2
3
4
5
Z
t
(ω
1
)
Z
t
(ω
2
)
Z
t
t
(c) Two sample paths of a continuous
process in continuous time.
Figure 1.1. Examples of sample paths of stochastic processes.
A stochastic process is not only characterised by the marginal distributions of X
t
, but also by the dependency
structure of the process. This dependency structure can be expressed by the ﬁnitedimensional distributions of the
process:
P(X
t
1
∈ A
1
, . . . , X
t
k
∈ A
k
)
where t
1
, . . . , t
k
∈ T, k ∈ N, and A
1
, . . . , A
k
are measurable subsets of S. In the case of S ⊂ R the ﬁnite
dimensional distributions can be represented using their joint distribution functions
F
(t
1
,...,t
k
)
(x
1
, . . . , x
k
) = P(X
t
1
∈ (−∞, x
1
], . . . , X
t
k
∈ (−∞, x
k
]).
This raises the question whether a stochastic process X is fully described by its ﬁnite dimensional distributions.
The answer to this is given by Kolmogorov’s existence theorem. However, in order to be able to formulate the
theorem, we need to introduce the concept of a consistent family of ﬁnitedimensional distributions. To keep things
simple, we will formulate this condition using distributions functions. We shall call a family of ﬁnite dimensional
distribution functions consistent if for any collection of times t
1
, . . . t
k
, for all j ∈ {1, . . . , k}
F
(t
1
,...,t
j−1
,t
j
,t
j+1
,...,t
k
)
(x
1
, . . . , x
j−1
, +∞, x
j+1
, . . . , x
k
) = F
(t
1
,...,t
j−1
,t
j+1
,...,t
k
)
(x
1
, . . . , x
j−1
, x
j+1
, . . . , x
k
)
(1.1)
This consistency condition says nothing else than that lowerdimensional members of the family have to be the
marginal distributions of the higherdimensional members of the family. For a discrete state space, (1.1) corresponds
to
x
j
p
(t
1
,...,t
j−1
,t
j
,t
j+1
,...,t
k
)
(x
1
, . . . , x
j−1
, x
j
, x
j+1
, . . . , x
k
) = p
(t
1
,...,t
j−1
,t
j+1
,...,t
k
)
(x
1
, . . . , x
j−1
, x
j+1
, . . . , x
k
),
where p
(... )
(·) are the joint probability mass functions (p.m.f.). For a continuous state space, (1.1) corresponds to
_
f
(t
1
,...,t
j−1
,t
j
,t
j+1
,...,t
k
)
(x
1
, . . . , x
j−1
, x
j
, x
j+1
, . . . , x
k
) dx
j
= f
(t
1
,...,t
j−1
,t
j+1
,...,t
k
)
(x
1
, . . . , x
j−1
, x
j+1
, . . . , x
k
)
where f
(... )
(·) are the joint probability density functions (p.d.f.).
Without the consistency condition we could obtain different results when computing the same probability using
different members of the family.
Theorem 1.3 (Kolmogorov). Let F
(t
1
,...,t
k
)
be a family of consistent ﬁnitedimensional distribution functions. Then
there exists a probability space and a stochastic process X, such that
F
(t
1
,...,t
k
)
(x
1
, . . . , x
k
) = P(X
t
1
∈ (−∞, x
1
], . . . , X
t
k
∈ (−∞, x
k
]).
Proof. The proof of this theorem can be for example found in (Gihman and Skohorod, 1974).
1.2 Discrete Markov chains 7
Thus we can specify the distribution of a stochastic process by writing down its ﬁnitedimensional distributions.
Note that the stochastic process X is not necessarily uniquely determined by its ﬁnitedimensional distributions.
However, the ﬁnite dimensional distributions uniquely determine all probabilities relating to events involving an at
most countable collection of random variables. This is however, at least as far as this course is concerned, all that
we are interested in.
In what follows we will only consider the case of a stochastic process in discrete time i.e. T = N
0
(or Z).
Initially, we will also assume that the state space is discrete.
1.2 Discrete Markov chains
1.2.1 Introduction
In this section we will deﬁne Markov chains, however we will focus on the special case that the state space S is (at
most) countable. Thus we can assume without loss of generality that the state space S is the set of natural numbers
N (or a subset of it): there exists a bijection that uniquely maps each element to S to a natural number, thus we can
relabel the states 1, 2, 3, . . ..
Deﬁnition 1.4 (Discrete Markov chain). Let X be a stochastic process in discrete time with countable (“discrete”)
state space. X is called a Markov chain (with discrete state space) if X satisﬁes the Markov property
P(X
t+1
= x
t+1
X
t
= x
t
, . . . , X
0
= x
0
) = P(X
t+1
= x
t+1
X
t
= x
t
)
This deﬁnition formalises the idea of the process depending on the past only through the present. If we know the
current state X
t
, then the next state X
t+1
is independent of the past states X
0
, . . . X
t−1
. Figure 1.2 illustrates this
idea.
2
X
t
t −1 t t + 1
Past Future
P
r
e
s
e
n
t
Figure 1.2. Past, present, and future of a Markov chain at t.
Proposition 1.5. The Markov property is equivalent to assuming that for all k ∈ N and all t
1
< . . . < t
k
≤ t
P(X
t+1
= x
t+1
X
t
k
= x
t
k
, . . . , X
t
1
= x
t
1
) = P(X
t+1
= x
t+1
X
t
k
= x
t
k
).
Proof. (homework)
Example 1.1 (Phone line). Consider the simple example of a phone line. It can either be busy (we shall call this state
1) or free (which we shall call 0). If we record its state every minute we obtain a stochastic process {X
t
: t ∈ N
0
}.
2
A similar concept (Markov processes) exists for processes in continuous time. See e.g. http://en.wikipedia.org/
wiki/Markov_process.
8 1. Markov Chains
If we assume that {X
t
: t ∈ N
0
} is a Markov chain, we assume that probability of a new phone call being ended
is independent of how long the phone call has already lasted. Similarly the Markov assumption implies that the
probability of a new phone call being made is independent of how long the phone has been out of use before.
The Markov assumption is compatible with assuming that the usage pattern changes over time. We can assume that
the phone is more likely to be used during the day and more likely to be free during the night. ⊳
Example 1.2 (Random walk on Z). Consider a socalled random walk on Z starting at X
0
= 0. At every time, we
can either stay in the state or move to the next smaller or next larger number. Suppose that independently of the
current state, the probability of staying in the current state is 1−α−β, the probability of moving to the next smaller
number is α and that the probability of moving to the next larger number is β, where α, β ≥ 0 with α + β ≤ 1.
Figure 1.3 illustrates this idea. To analyse this process in more detail we write X
t+1
as
−3
1 −α −β
−2
1 −α −β
−1
1 −α −β
0
1 −α −β
1
1 −α −β
2
1 −α −β
3
1 −α −β
α
β
α
β
α
β
α
β
α
β
α
β
α
β
α
β
· · · · · ·
Figure 1.3. Illustration (“Markov graph”) of the random walk on Z.
X
t+1
= X
t
+E
t
,
with the E
t
being independent and for all t
P(E
t
= −1) = α P(E
t
= 0) = 1 −α −β P(E
t
= 1) = β.
It is easy to see that
P(X
t+1
= x
t
−1X
t
= x
t
) = α P(X
t+1
= x
t
X
t
= x
t
) = 1 −α −β P(X
t+1
= x
t
+ 1X
t
= x
t
) = β
Most importantly, these probabilities do not change when we condition additionally on the past {X
t−1
=
x
t−1
, . . . , X
0
= x
0
}:
P(X
t+1
= x
t+1
X
t
= x
t
, X
t−1
= x
t−1
. . . , X
0
= x
0
)
= P(E
t
= x
t+1
−x
t
E
t−1
= x
t
−x
t−1
, . . . , E
0
= x
1
−x
0
, X
0
= x
o
)
E
s
⊥E
t
= P(E
t
= x
t+1
−x
t
) = P(X
t+1
= x
t+1
X
t
= x
t
)
Thus {X
t
: t ∈ N
0
} is a Markov chain. ⊳
The distribution of a Markov chain is fully speciﬁed by its initial distribution P(X
0
= x
0
) and the transition
probabilities P(X
t+1
= x
t+1
X
t
= x
t
), as the following proposition shows.
Proposition 1.6. For a discrete Markov chain {X
t
: t ∈ N
0
} we have that
P(X
t
= x
t
, X
t−1
= x
t−1
, . . . , X
0
= x
0
) = P(X
0
= x
0
) ·
t−1
τ=0
P(X
τ+1
= x
τ+1
X
τ
= x
τ
).
Proof. From the deﬁnition of conditional probabilities we can derive that
1.2 Discrete Markov chains 9
P(X
t
= x
t
, X
t−1
= x
t−1
, . . . , X
0
= x
0
) = P(X
0
= x
0
)
· P(X
1
= x
1
X
0
= x
0
)
· P(X
2
= x
2
X
1
= x
1
, X
0
= x
0
)
. ¸¸ .
=P(X
2
=x
2
X
1
=x
1
)
· · ·
· P(X
t
= x
t
X
t−1
= x
t−1
, . . . , X
0
= x
0
)
. ¸¸ .
=P(X
t
=x
t
X
t−1
=x
t−1
)
=
t−1
τ=0
P(X
τ+1
= x
τ+1
X
τ
= x
τ
).
Comparing the equation in proposition 1.6 to the ﬁrst equation of the proof (which holds for any sequence of
random variables) illustrates how powerful the Markovian assumption is.
To simplify things even further we will introduce the concept of a homogeneous Markov chain, which is a
Markov chains whose behaviour does not change over time.
Deﬁnition 1.7 (Homogeneous Markov Chain). A Markov chain {X
t
: t ∈ N
0
} is said to be homogeneous if
P(X
t+1
= jX
t
= i) = p
ij
for all i, j ∈ S, and independent of t ∈ N
0
.
In the following we will assume that all Markov chains are homogeneous.
Deﬁnition 1.8 (Transition kernel). The matrix K = (k
ij
)
ij
with k
ij
= P(X
t+1
= jX
t
= i) is called the transi
tion kernel (or transition matrix) of the homogeneous Markov chain X.
We will see that together with the initial distribution, which we might write as a vector λ
0
= (P(X
0
= i))
(i∈S)
,
the transition kernel Kfully speciﬁes the distribution of a homogeneous Markov chain.
However, we start by stating two basic properties of the transition kernel K:
– The entries of the transition kernel are nonnegative (they are probabilities).
– Each row of the transition kernel sums to 1, as
j
k
ij
=
j
P(X
t+1
= jX
t
= i) = P(X
t+1
∈ SX
t
= i) = 1
Example 1.3 (Phone line (continued)). Suppose that in the example of the phone line the probability that someone
makes a new call (if the phone is currently unused) is 10% and the probability that someone terminates an active
phone call is 30%. If we denote the states by 0 (phone not in use) and 1 (phone in use). Then
P(X
t+1
= 0X
t
= 0) = 0.9 P(X
t+1
= 1X
t
= 0) = 0.1
P(X
t+1
= 0X
t
= 1) = 0.3 P(X
t+1
= 1X
t
= 1) = 0.7,
and the transition kernel is
K =
_
0.9 0.1
0.3 0.7
_
.
The transition probabilities are often illustrated using a socalled Markov graph. The Markov graph for this example
is shown in ﬁgure 1.4. Note that knowing K alone is not enough to ﬁnd the distribution of the states: for this we
also need to know the initial distribution λ
0
. ⊳
10 1. Markov Chains
0 1
0.3
0.1
0.9 0.7
Figure 1.4. Markov graph for the phone line example.
Example 1.4 (Random walk on Z (continued)). The transition kernel for the random walk on Z is a Toeplitz matrix
with an inﬁnite number of rows and columns:
K =
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
α 1−α−β β 0 0 0
.
.
.
.
.
.
0 α 1−α−β β 0 0
.
.
.
.
.
.
0 0 α 1−α−β β 0
.
.
.
.
.
.
0 0 0 α 1−α−β β
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
The Markov graph for this Markov chain was given in ﬁgure 1.3. ⊳
We will now generalise the concept of the transition kernel, which contains the probabilities of moving from
state i to step j in one step, to the mstep transition kernel, which contains the probabilities of moving from state i
to step j in m steps:
Deﬁnition 1.9 (mstep transition kernel). The matrix K
(m)
= (k
(m)
ij
)
ij
with k
(m)
ij
= P(X
t+m
= jX
t
= i) is
called the mstep transition kernel of the homogeneous Markov chain X.
We will now show that the mstep transition kernel is nothing other than the mpower of the transition kernel.
Proposition 1.10. Let X be a homogeneous Markov chain, then
i. K
(m)
= K
m
, and
ii. P(X
m
= j) = (λ
′
0
K
(m)
)
j
.
Proof. i. We will ﬁrst show that for m
1
, m
2
∈ N we have that K
(m
1
+m
2
)
= K
(m
1
)
· K
(m
2
)
:
P(X
t+m
1
+m
2
= kX
t
= i) =
j
P(X
t+m
1
+m
2
= k, X
t+m
1
= jX
t
= i)
=
j
P(X
t+m
1
+m
2
= kX
t+m
1
= j, X
t
= i)
. ¸¸ .
=P(X
t+m
1
+m
2
=kX
t+m
1
=j)=P(X
t+m
2
=kX
t
=j)
P(X
t+m
1
= jX
t
= i)
=
j
P(X
t+m
2
= kX
t
= j)P(X
t+m
1
= jX
t
= i)
=
j
K
(m
1
)
ij
K
(m
2
)
jk
=
_
K
(m
1
)
K
(m
2
)
_
i,k
Thus K
(2)
= K· K = K
2
, and by induction K
(m)
= K
m
.
ii. P(X
m
= j) =
i
P(X
m
= j, X
0
= i) =
i
P(X
m
= jX
0
= i)
. ¸¸ .
=K
(m)
ij
P(X
0
= i)
. ¸¸ .
=(λ
0
)
i
= (λ
′
0
K
m
)
j
Example 1.5 (Phone line (continued)). In the phoneline example, the transition kernel is
K =
_
0.9 0.1
0.3 0.7
_
.
The mstep transition kernel is
1.2 Discrete Markov chains 11
K
(m)
= K
m
=
_
0.9 0.1
0.3 0.7
_m
=
_
_
3+(
3
5
)
m
4
1−(
3
5
)
m
4
1+(
3
5
)
m
4
3−(
3
5
)
m
4
_
_
.
Thus the probability that the phone is free given that it was free 10 hours ago is P(X
t+10
= 0X
t
= 0) = K
(10)
0,0
=
3+(
3
5
)
10
4
=
7338981
9765625
= 0.7515. ⊳
1.2.2 Classiﬁcation of states
Deﬁnition 1.11 (Classiﬁcation of states). (a) A state i is said to lead to a state j (“i j”) if there is an m ≥ 0
such that there is a positive probability of getting from state i to state j in m steps, i.e.
k
(m)
ij
= P(X
t+m
= jX
t
= i) > 0.
(b) Two states i and j are said to communicate (“i ∼ j”) if i j and j i.
From the deﬁnition we can derive for states i, j, k ∈ S:
– i i (as k
(0)
ii
= P(X
t+0
= iX
t
= i) = 1 > 0), thus i ∼ i.
– If i ∼ j, then also j ∼ i.
– If i j and j k, then there exist m
ij
, m
2
≥ 0 such that the probability of getting from state i to state j in
m
ij
steps is positive, i.e. k
(m
ij
)
ij
= P(X
t+m
ij
= jX
t
= i) > 0, as well as the probability of getting from state j
to state k in m
jk
steps, i.e. k
(m
jk
)
jk
= P(X
t+m
jk
= kX
t
= j) = P(X
t+m
ij
+m
jk
= kX
t+m
ij
= j) > 0. Thus
we can get (with positive probability) from state i to state k in m
ij
+m
jk
steps:
k
(m
ij
+m
jk
)
ik
= P(X
t+m
ij
+m
jk
= kX
t
= i) =
ι
P(X
t+m
ij
+m
jk
= kX
t+m
ij
= ι)P(X
t+m
ij
= ιX
t
= i)
≥ P(X
t+m
ij
+m
jk
= kX
t+m
ij
= j)
. ¸¸ .
>0
P(X
t+m
ij
= jX
t
= i)
. ¸¸ .
>0
> 0
Thus i j and j k imply i k. Thus i ∼ j and j ∼ k also imply i ∼ k
Thus ∼ is an equivalence relation and we can partition the state space S into communicating classes, such that all
states in one class communicate and no larger classes can be formed. A class C is called closed if there are no paths
going out of C, i.e. for all i ∈ C we have that i j implies that j ∈ C.
We will see that states within one class have many properties in common.
Example 1.6. Consider a Markov chain with transition kernel
K =
_
_
_
_
_
_
_
_
_
_
_
_
1
2
1
4
0
1
4
0 0
0 0 0 1 0 0
0 0
3
4
0 0
1
4
0 0 0 0 1 0
0
3
4
0 0 0
1
4
0 0
1
2
0 0
1
2
_
_
_
_
_
_
_
_
_
_
_
_
The Markov graph is shown in ﬁgure 1.5. We have that 2 ∼ 4, 2 ∼ 5, 3 ∼ 6, 4 ∼ 5. Thus the communicating
classes are {1}, {2, 4, 5}, and {3, 6}. Only the class {3, 6} is closed. ⊳
Finally, we will introduce the notion of an irreducible chain. This concept will become important when we
analyse the limiting behaviour of the Markov chain.
Deﬁnition 1.12 (Irreducibility). A Markov chain is called irreducible if it only consists of a single class, i.e. all
states communicate.
12 1. Markov Chains
1
2
3
4
1
2
1
4
1
4
3
4
1
1
1
4
1
4
1
2
4 5 6
1 2 3
Figure 1.5. Markov graph of the chain of example 1.6. The communicating classes are {1}, {2, 4, 5}, and {3, 6}.
The Markov chain considered in the phoneline example (examples 1.1,1.3, and 1.5) and the random walk on Z
(examples 1.2 and 1.4) are irreducible chains. The chain of example 1.6 is not irreducible.
In example 1.6 the states 2, 4 and 5 can only be visited in this order: if we are currently in state 2 (i.e. X
t
= 2),
then we can only visit this state again at time t + 3, t + 6, . . . . Such a behaviour is referred to as periodicity.
Deﬁnition 1.13 (Period). (a) A state i ∈ S is said to have period
d(i) = gcd{m ≥ 1 : K
(m)
ii
> 0},
where gcd denotes the greatest common denominator.
(b) If d(i) = 1 the state i is called aperiodic.
(c) If d(i) > 1 the state i is called periodic.
For a periodic state i, the number of steps required to possibly get back to this state must be a multiple of the
period d(i).
To analyse the periodicity of a state i we must check the existence of paths of positive probability and of length
m going from the state i back to i. If no path of length m exists, then K
(m)
ii
= 0. If there exists a single path of
positive probability of length m, then K
(m)
ii
> 0.
Example 1.7 (Example 1.6 continued). In example 1.6 the state 2 has period d(2) = 3, as all paths from 2 back to 2
have a length which is a multiple of 3, thus
K
(3)
22
> 0, K
(6)
22
> 0, K
(9)
22
> 0, . . .
All other K
(m)
22
= 0 (
m
3
∈ N
0
), thus the period is d(2) = 3 (3 being the greatest common denominator of 3, 6, 9, . . .).
Similarly d(4) = 3 and d(5) = 3.
The states 3 and 6 are aperiodic, as there is a positive probability of remaining in these states, thus K
(m)
33
> 0
and K
(m)
66
> 0 for all m, thus d(3) = d(6) = 1. ⊳
In example 1.6 all states within one communicating class had the same period. This holds in general, as the
following proposition shows:
Proposition 1.14. (a) All states within a communicating class have the same period.
(b) In an irreducible chain all states have the same period.
Proof. (a) Suppose i ∼ j. Thus there are paths of positive probability between these two states. Suppose we can
get from i to j in m
ij
steps and from j to i in m
ji
steps. Suppose also that we can get from j back to j in m
jj
steps. Then we can get fromi back to i in m
ij
+m
ji
steps as well as in m
ij
+m
jj
+m
ji
steps. Thus m
ij
+m
ji
and m
ij
+ m
jj
+ m
ji
must be divisible by the period d(i) of state i. Thus m
jj
is also divisible by d(i) (being
the difference of two numbers divisible by d(i)).
1.2 Discrete Markov chains 13
The above argument holds for any path between j and j, thus the length of any path fromj back to j is divisible
by d(i). Thus d(i) ≤ d(j) (d(j) being the greatest common denominator).
Repeating the same argument with the rˆ oles of i and j swapped gives us d(j) ≤ d(i), thus d(i) = d(j).
(b) An irreducible chain consists of a single communicating class, thus (b) is implied by (a).
1.2.3 Recurrence and transience
If we follow the Markov chain of example 1.6 long enough, we will eventually end up switching between state 3
and 6 without ever coming back to the other states Whilst the states 3 and 6 will be visited inﬁnitely often, the other
states will eventually be left forever.
In order to formalise these notions we will introduce the number of visits in state i:
V
i
=
+∞
t=0
1
{X
t
=i}
The expected number of visits in state i given that we start the chain in i is
E(V
i
X
0
= i) = E
_
+∞
t=0
1
{X
t
=i}
¸
¸
¸X
0
= i
_
=
+∞
t=0
E(1
{X
t
=i}
X
0
= i) =
+∞
t=0
P(X
t
= iX
o
= i) =
+∞
t=0
k
(t)
ii
Based on whether the expected number of visits in a state is inﬁnite or not, we will classify states as recurrent
or transient:
Deﬁnition 1.15 (Recurrence and transience). (a) A state i is called recurrent if E(V
i
X
0
= i) = +∞.
(b) A state i is called transient if E(V
i
X
0
= i) < +∞.
One can show that a recurrent state will (almost surely) be visited inﬁnitely often, whereas a transient state will
(almost surely) be visited only a ﬁnite number of times.
In proposition 1.14 we have seen that within a communicating class either all states are aperiodic, or all states
are periodic. A similar dichotomy holds for recurrence and transience.
Proposition 1.16. Within a communicating class, either all states are transient or all states are recurrent.
Proof. Suppose i ∼ j. Then there exists a path of length m
ij
leading from i to j and a path of length m
ji
from j
back to i, i.e. k
(m
ij
)
ij
> 0 and k
(m
ji
)
ji
> 0.
Suppose furthermore that the state i is transient, i.e. E(V
i
X
0
= i) =
+∞
t=0
k
(t)
ii
< +∞.
This implies
E(V
j
X
0
= j) =
+∞
t=0
k
(t)
jj
=
1
k
(m
ij
)
ij
k
(m
ji
)
ji
+∞
t=0
k
(m
ij
)
ij
k
(t)
jj
k
(m
ji
)
ji
. ¸¸ .
≤k
(m+t+n)
ii
≤
1
k
(m
ij
)
ij
k
(m
ji
)
ji
+∞
t=0
k
(m
ij
+t+m
ji
)
≤
1
k
(m
ij
)
ij
k
(m
ji
)
ji
+∞
s=0
k
(s)
ii
< +∞,
thus state j is be transient as well.
Finally we state without proof two simple criteria for determining recurrence and transience.
Proposition 1.17. (a) Every class which is not closed is transient.
(b) Every ﬁnite closed class is recurrent.
Proof. For a proof see (Norris, 1997, sect. 1.5).
14 1. Markov Chains
Example 1.8 (Examples 1.6 and 1.7 continued). The chain of example 1.6 had three classes: {1}, {2, 4, 5}, and
{3, 6}. The classes {1} and {2, 4, 5} are not closed, so they are transient. The class {3, 6} is closed and ﬁnite, thus
recurrent. ⊳
Note that an inﬁnite closed class is not necessarily recurrent. The random walk on Z studied in examples 1.2
and 1.4 is only recurrent if it is symmetric, i.e. α = β, otherwise it drifts off to −∞or +∞. An interesting result is
that a symmetric random walk on Z
p
is only recurrent if p ≤ 2 (see e.g. Norris, 1997, sect. 1.6).
1.2.4 Invariant distribution and equilibrium
In this section we will study the longterm behaviour of Markov chains. A key concept for this is the invariant
distribution.
Deﬁnition 1.18 (Invariant distribution). Let µ = (µ
i
)
i∈S
be a probability distribution on the state space S, and let
X be a Markov chain with transition kernel K. Then µis called the invariant distribution (or stationary distribution)
of the Markov chain X if
3
µ
′
K = µ
′
.
If µ is the stationary distribution of a chain with transition kernel K, then
µ
′
= µ
′
.¸¸.
=µ
′
K
K = µ
′
K
2
= . . . = µ
′
K
m
= µ
′
K
(m)
for all m ∈ N. Thus if X
0
in drawn from µ, then all X
m
have distribution µ: according to proposition 1.10
P(X
m
= j) = (µ
′
K
(m)
)
j
= (µ)
j
for all m. Thus, if the chain has µ as initial distribution, the distribution of X will not change over time.
Example 1.9 (Phone line (continued)). In example 1.1,1.3, and 1.5 we studied a Markov chain with the two states 0
(“free”) and 1 (“in use”) and which modeled whether a phone is free or not. Its transition kernel was
K =
_
0.9 0.1
0.3 0.7
_
.
To ﬁnd the invariant distribution, we need to solve µ
′
K = µ
′
for µ, which is equivalent to solving the following
system of linear equations:
(K
′
−I)µ = 0, i.e.
_
−0.1 0.3
0.1 −0.3
_
·
_
µ
0
µ
1
_
=
_
0
0
_
It is easy to see that the corresponding system is underdetermined and that −µ
0
+ 3µ
1
= 0, i.e. µ = (µ
0
, µ
1
)
′
∝
(3, 1), i.e. µ =
_
3
4
,
1
4
_
′
(as µ has to be a probability distribution, thus µ
0
+µ
1
= 1). ⊳
Not every Markov chain has an invariant distribution. The random walk on Z (studied in examples 1.2 and 1.4)
for example does not have an invariant distribution, as the following example shows:
Example 1.10 (Random walk on Z (continued)). The random walk on Z had the transition kernel (see example 1.4)
K =
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
α 1−α−β β 0 0 0
.
.
.
.
.
.
0 α 1−α−β β 0 0
.
.
.
.
.
.
0 0 α 1−α−β β 0
.
.
.
.
.
.
0 0 0 α 1−α−β β
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
_
3
i.e. µ is the left eigenvector of Kcorresponding to the eigenvalue 1.
1.2 Discrete Markov chains 15
As α+(1 −α−β) +β = 1 we have for µ = (. . . , 1, 1, 1, . . .)
′
that µ
′
K = µ
′
, however µ cannot be renormalised
to become a probability distribution. ⊳
We will now show that if a Markov chain is irreducible and aperiodic, its distribution will in the long run tend
to the invariant distribution.
Theorem 1.19 (Convergence to equilibrium). Let X be an irreducible and aperiodic Markov chain with invariant
distribution µ. Then
P(X
t
= i)
t→+∞
−→ µ
i
for all states i.
Outline of the proof. We will explain the outline of the proof using the idea of coupling.
Suppose that X has initial distribution λ and transition kernel K. Deﬁne a new Markov chain Y with initial
distribution µ and same transition kernel K. Let T be the ﬁrst time the two chains “meet” in the state i, i.e.
T = min{t ≥ 0 : X
t
= Y
t
= i}
Then one can show that P(T < ∞) = 1 and deﬁne a new process Z by
Z
t
=
_
X
t
if t ≤ T
Y
t
if t > T
Figure 1.6 illustrates this new chain Z. One can show that Z is a Markov chain with initial distribution λ (as
T
Y
t
Z
t
X
t
Y
t
X
t
t
Figure 1.6. Illustration of the chains X (−−−), Y (— —) and Z (thick line) used in the proof of theorem 1.19.
X
0
= Z
0
) and transition kernel K (as both X and Y have the transition kernel K). Thus X and Z have the same
distribution and for all t ∈ N
0
we have that P(X
t
= j) = P(Z
t
= j) for all states j ∈ S.
The chain Y has its invariant distribution as initial distribution, thus P(Y
t
= j) = µ
j
for all t ∈ N
0
and j ∈ S.
As t →+∞the probability of {Y
t
= Z
t
} tends to 1, thus
P(X
t
= j) = P(Z
t
= j) →P(Y
t
= j) = µ
j
.
A more detailed proof of this theorem can be found in (Norris, 1997, sec. 1.8).
Example 1.11 (Phone line (continued)). We have shown in example 1.9 that the invariant distribution of the Markov
chain modeling the phone line is µ =
_
3
4
,
1
4
_
, thus according to theorem 1.19 P(X
t
= 0) →
3
4
and P(X
t
= 1) →
1
4
. Thus, in the long run, the phone will be free 75% of the time. ⊳
Example 1.12. This example illustrates that the aperiodicity condition in theorem 1.19 is necessary.
Consider a Markov chain X with two states S = {1, 2} and transition kernel
16 1. Markov Chains
K =
_
0 1
1 0
_
.
This Markov chains switches deterministically, thus goes either 1, 0, 1, 0, . . . or 0, 1, 0, 1, . . .. Thus it is
periodic with period 2.
Its invariant distribution is µ
′
=
_
1
2
,
1
2
_
, as
µ
′
K =
_
1
2
,
1
2
_
_
0 1
1 0
_
=
_
1
2
,
1
2
_
= µ
′
.
However if the chain is started in X
0
= 1, i.e. λ = (1, 0), then
P(X
t
= 0) =
_
1 if t is odd
0 if t is even
, P(X
t
= 1) =
_
0 if t is odd
1 if t is even
,
which is different from the invariant distribution, under which all these probabilities would be
1
2
. ⊳
1.2.5 Reversibility and detailed balance
In our study of Markov chains we have so far focused on conditioning on the past. For example, we have deﬁned
the transition kernel to consist of k
ij
= P(X
t+1
= jX
t
= i). What happens if we analyse the distribution of X
t
conditional on the future, i.e we turn the universal clock backwards?
P(X
t
= jX
t+1
= i) =
P(X
t
= j, X
t+1
= i)
P(X
t+1
= i)
= P(X
t+1
= iX
t
= j) ·
P(X
t
= j)
P(X
t+1
= i)
This suggests deﬁning a new Markov chain which goes back in time. As the deﬁning property of a Markov
chain was that the past and future are conditionally independent given the present, the same should hold for the
“backward chain”, just with the rˆ oles of past and future swapped.
Deﬁnition 1.20 (Timereversed chain). For τ ∈ N let {X
t
: t = 0, . . . , τ} be a Markov chain. Then {Y
t
: t =
0, . . . , τ} deﬁned by Y
t
= X
τ−t
is called the timereversed chain corresponding to X.
We have that
P(Y
t
= jY
t−1
= i) = P(X
τ−t
= jX
τ−t+1
= i) = P(X
s
= jX
s+1
= i) =
P(X
s
= j, X
s+1
= i)
P(X
s+1
= i)
= P(X
s+1
= iX
s
= j) ·
P(X
s
= j)
P(X
s+1
= i)
= k
ji
·
P(X
s
= j)
P(X
s+1
= i)
,
thus the timereversed chain is in general not homogeneous, even if the forward chain X is homogeneous.
This changes however if the forward chain X is initialised according to its invariant distribution µ. In this case
P(X
s+1
= i) = µ
i
and P(X
s
= j) = µ
j
for all s, and thus Y is a homogeneous Markov chain with transition
probabilities
P(Y
t
= jY
t−1
= i) = k
ji
·
µ
j
µ
i
. (1.2)
In general, the transition probabilities for the timereversed chain will thus be different from the forward chain.
Example 1.13 (Phone line (continued)). In the example of the phone line (examples 1.1, 1.3, 1.5, 1.9, and 1.11) the
transition matrix was
K =
_
0.9 0.1
0.3 0.7
_
.
The invariant distribution was µ =
_
3
4
,
1
4
_
′
.
If we use the invariant distribution µ as initial distribution for X
0
, then using (1.2)
1.2 Discrete Markov chains 17
P(Y
t
= 0Y
t−1
= 0) = k
00
·
µ
0
µ
0
= k
00
= P(X
t
= 0X
t−1
= 1)
P(Y
t
= 0Y
t−1
= 1) = k
01
·
µ
0
µ
1
= 0.1 ·
3
4
1
4
= 0.3 = k
10
= P(X
t
= 0X
t−1
= 1)
P(Y
t
= 1Y
t−1
= 0) = k
10
·
µ
1
µ
0
= 0.3 ·
1
4
3
4
= 0.1 = k
01
= P(X
t
= 1X
t−1
= 0)
P(Y
t
= 1Y
t−1
= 1) = k
11
·
µ
1
µ
1
= k
11
= P(X
t
= 1X
t−1
= 1)
Thus in this case both the forward chain X and the timereversed chain Y have the same transition probabilities.
We will call such chains timereversible, as their dynamics do not change when time is reversed. ⊳
We will now introduce a criterion for checking whether a chain is timereversible.
Deﬁnition 1.21 (Detailed balance). A transition kernel Kis said to be in detailed balance with a distribution µ if
for all i, j ∈ S
µ
i
k
ij
= µ
j
k
ji
.
It is easy to see that Markov chain studied in the phone line example (see example 1.13) satisﬁes the detailed
balance condition.
The detailedbalance condition is a very important concept that we will require when studying Markov Chain
Monte Carlo (MCMC) algorithms later. The reason for its relevance is the following theorem, which says that if a
Markov chain is in detailed balance with a distribution µ, then the chain is timereversible, and, more importantly,
µ is the invariant distribution. The advantage of the detailedbalance condition over the condition of deﬁnition 1.18
is that the detailedbalance condition is often simpler to check, as it does not involve a sum (or a vectormatrix
product).
Theorem 1.22. Let X be a Markov chain with transition kernel Kwhich is in detailed balance with some distribu
tion µ on the states of the chain. Then
i. µ is the invariant distribution of X.
ii. If initialised according to µ, X is timereversible, i.e. both X and its time reversal have the same transition
kernel.
Proof. i. We have that
(µ
′
K)
i
=
j
µ
j
k
ji
. ¸¸ .
=µ
i
k
ij
= µ
i
j
k
ij
. ¸¸ .
=1
= µ
i
,
thus µ
′
K = µ
′
, i.e. µ is the invariant distribution.
ii. Let Y be the timereversal of X, then using (1.2)
P(Y
t
= jY
t−1
= i) =
µ
i
k
ij
¸ .. ¸
µ
j
k
ji
µ
i
= k
ij
= P(X
t
= jX
t−1
= i),
thus X and Y have the same transition probabilities.
Note that not every chain which has an invariant distribution is timereversible, as the following example shows:
Example 1.14. Consider the following Markov chain on S = {1, 2, 3} with transition matrix
K =
_
_
_
_
0 0.8 0.2
0.2 0 0.8
0.8 0.2 0
_
_
_
_
The corresponding Markov graph is shown in ﬁgure 1.7: The stationary distribution of the chain is µ =
_
1
3
,
1
3
,
1
3
_
.
18 1. Markov Chains
3 2
1
0.8
0.2
0.2
0.8
0.2
0.8
Figure 1.7. Markov graph for the Markov chain of example 1.14.
However the distribution is not timereversible. Using equation (1.2) we can ﬁnd the transition matrix of the time
reversed chain Y , which is
_
_
_
_
0 0.2 0.8
0.8 0 0.2
0.2 0.8 0
_
_
_
_
,
which is equal to K
′
, rather than K. Thus the chains X and its time reversal Y have different transition kernels.
When going forward in time, the chain is much more likely to go clockwise in ﬁgure 1.7; when going backwards in
time however, the chain is much more likely to go counterclockwise. ⊳
1.3 General state space Markov chains
So far, we have restricted our attention to Markov chains with a discrete (i.e. at most countable) state space S. The
main reason for this was that this kind of Markov chain is much easier to analyse than Markov chains having a more
general state space.
However, most applications of Markov Chain Monte Carlo algorithms are concerned with continuous random
variables, i.e. the corresponding Markov chain has a continuous state space S, thus the theory studied in the preced
ing section does not directly apply. Largely, we deﬁned most concepts for discrete state spaces by looking at events
of the type {X
t
= j}, which is only meaningful if the state space is discrete.
In this section we will give a brief overview of the theory underlying Markov chains with general state spaces.
Although the basic principles are not entirely different from the ones we have derived in the discrete case, the study
of general state space Markov chains involves many more technicalities and subtleties, so that we will not present
any proofs here. The interested reader can ﬁnd a more rigorous treatment in (Meyn and Tweedie, 1993), (Nummelin,
1984), or (Robert and Casella, 2004, chapter 6).
Though this section is concerned with general state spaces we will notationally assume that the state space is
S = R
d
.
First of all, we need to generalise our deﬁnition of a Markov chain (deﬁnition 1.4). We deﬁned a Markov chain
to be a stochastic process in which, conditionally on the present, the past and the future are independent. In the
discrete case we formalised this idea using the conditional probability of {X
t
= j} given different collections of
past events.
In a general state space it can be that all events of the type {X
t
= j} have probability 0, as it is the case for
a process with a continuous state space. A process with a continuous state space spreads the probability so thinly
that the probability of exactly hitting one given state is 0 for all states. Thus we have to work with conditional
probabilities of sets of states, rather than individual states.
Deﬁnition 1.23 (Markov chain). Let X be a stochastic process in discrete time with general state space S. X is
called a Markov chain if X satisﬁes the Markov property
P(X
t+1
∈ AX
0
= x
0
, . . . , X
t
= x
t
) = P(X
t+1
∈ AX
t
= x
t
)
for all measurable sets A ⊂ S.
1.3 General state space Markov chains 19
If S is at most countable, this deﬁnition is equivalent to deﬁnition 1.4.
In the following we will assume that the Markov chain is homogeneous, i.e. the probabilities P(X
t+1
∈ AX
t
=
x
t
) are independent of t. For the remainder of this section we shall also assume that we can express the probability
from deﬁnition 1.23 using a transition kernel K : S ×S →R
+
0
:
P(X
t+1
∈ AX
t
= x
t
) =
_
A
K(x
t
, x
t+1
) dx
t+1
(1.3)
where the integration is with respect to a suitable dominating measure, i.e. for example with respect to the Lebesgue
measure if S = R
d
.
4
The transition kernel K(x, y) is thus just the conditional probability density of X
t+1
given
X
t
= x
t
.
We obtain the special case of deﬁnition 1.8 by setting K(i, j) = k
ij
, where k
ij
is the (i, j)th element of the
transition matrix K. For a discrete state space the dominating measure is the counting measure, so integration just
corresponds to summation, i.e. equation (1.3) is equivalent to
P(X
t+1
∈ AX
t
= x
t
) =
x
t+1
∈A
k
x
t
x
t+1
.
We have for measurable set A ⊂ S that
P(X
t+m
∈ AX
t
= x
t
) =
_
A
_
S
· · ·
_
S
K(x
t
, x
t+1
)K(x
t+1
, x
t+2
) · · · K(x
t+m−1
, x
t+m
) dx
t+1
· · · dx
t+m−1
dx
t+m
,
thus the mstep transition kernel is
K
(m)
(x
0
, x
m
) =
_
S
· · ·
_
S
K(x
0
, x
1
) · · · K(x
m−1
, x
m
) dx
m−1
· · · dx
1
The mstep transition kernel allows for expressing the mstep transition probabilities more conveniently:
P(X
t+m
∈ AX
t
= x
t
) =
_
A
K
(m)
(x
t
, x
t+m
) dx
t+m
Example 1.15 (Gaussian random walk on R). Consider the random walk on R deﬁned by
X
t+1
= X
t
+E
t
,
where E
t
∼ N(0, 1), i.e. the probability density function of E
t
is φ(z) =
1
√
2π
exp
_
−
z
2
2
_
. This is equivalent to
assuming that
X
t+1
X
t
= x
t
∼ N(x
t
, 1).
We also assume that E
t
is independent of X
0
, E
1
, . . . , E
t−1
. Suppose that X
0
∼ N(0, 1). In contrast to the random
walk on Z (introduced in example 1.2) the state space of the Gaussian random walk is R. In complete analogy with
example 1.2 we have that
P(X
t+1
∈ AX
t
= x
t
, . . . , X
0
= x
0
) = P(E
t
∈ A−x
t
X
t
= x
t
, . . . , X
0
= x
0
)
= P(E
t
∈ A−x
t
) = P(X
t+1
∈ AX
t
= x
t
),
where A−x
t
= {a −x
t
: a ∈ A}. Thus X is indeed a Markov chain. Furthermore we have that
P(X
t+1
∈ AX
t
= x
t
) = P(E
t
∈ A−x
t
) =
_
A
φ(x
t+1
−x
t
) dx
t+1
Thus the transition kernel (which is nothing other than the conditional density of X
t+1
X
t
= x
t
) is thus
K(x
t
, x
t+1
) = φ(x
t+1
−x
t
)
To ﬁnd the mstep transition kernel we could use equation (1.3). However, the resulting integral is difﬁcult to
compute. Rather we exploit the fact that
4
A more correct way of stating this would be P(X
t+1
∈ AX
t
= x
t
) =
R
A
K(x
t
, dx
t+1
).
20 1. Markov Chains
X
t+m
= X
t
+E
t
+. . . +E
t+m−1
. ¸¸ .
∼N(0,m)
,
thus X
t+m
X
t
= x
t
∼ N(x
t
, m).
P(X
t+m
∈ AX
t
= x
t
) = P(X
t+m
−X
t
∈ A−x
t
) =
_
A
1
√
m
φ
_
x
t+m
−x
t
√
m
_
dx
t+m
Comparing this with (1.3) we can identify
K
(m)
(x
t
, x
t+m
) =
1
√
m
φ
_
x
t+m
−x
t
√
m
_
as mstep transition kernel. ⊳
In section 1.2.2 we deﬁned a Markov chain to be irreducible if there is a positive probability of getting from any
state i ∈ S to any other state j ∈ S, possibly via intermediate steps.
Again, we cannot directly apply deﬁnition 1.12 to Markov chains with general state spaces: it might be — as
it is the case for a continuous state space — that the probability of hitting a given state is 0 for all states. We will
again resolve this by looking at sets of states rather than individual states.
Deﬁnition 1.24 (Irreducibility). Given a distribution µ on the states S, a Markov chain is said to be µirreducible
if for all sets A with µ(A) > 0 and for all x ∈ S, there exists an m ∈ N
0
such that
P(X
t+m
∈ AX
t
= x) =
_
A
K
(m)
(x, y) dy > 0.
If the number of steps m = 1 for all A, then the chain is said to be strongly µirreducible.
Example 1.16 (Gaussian random walk (continued)). In example 1.15 we had that X
t+1
X
t
= x
t
∼ N(x
t
, 1). As
the range of the Gaussian distribution is R, we have that P(X
t+1
∈ AX
t
= x
t
) > 0 for all sets A of nonzero
Lebesgue measure. Thus the chain is strongly irreducible with the respect to any continuous distribution. ⊳
Extending the concepts of periodicity, recurrence, and transience studied in sections 1.2.2 and 1.2.3 from the
discrete case to the general case requires additional technical concepts like atoms and small sets, which are beyond
the scope of this course (for a more rigorous treatment of these concepts see e.g. Robert and Casella, 2004, sections
6.3 and 6.4). Thus we will only generalise the concept of recurrence.
In section 1.2.3 we deﬁned a discrete Markov chain to be recurrent, if all states are (on average) visited inﬁnitely
often. For more general state spaces, we need to consider the number of visits to a set of states rather than single
states. Let V
A
=
+∞
t=0
1
{X
t
∈A}
be the number of visits the chain makes to states in the set A ⊂ S. We then deﬁne
the expected number of visits in A ⊂ S, when we start the chain in x ∈ S:
E(V
A
X
0
= x) = E
_
+∞
t=0
1
{X
t
∈A}
¸
¸
¸X
0
= x
_
=
+∞
t=0
E(1
{X
t
∈A}
X
0
= x) =
+∞
t=0
_
A
K
(t)
(x, y) dy
This allows us to deﬁne recurrence for general state spaces. We start with deﬁning recurrence of sets before extend
ing the deﬁnition of recurrence of an entire Markov chain.
Deﬁnition 1.25 (Recurrence). (a) A set A ⊂ S is said to be recurrent for a Markov chain X if for all x ∈ A
E(V
A
X
0
= x) = +∞,
(b) A Markov chain is said to be recurrent, if
i. The chain is µirreducible for some distribution µ.
ii. Every measurable set A ⊂ S with µ(A) > 0 is recurrent.
1.4 Ergodic theorems 21
According to the deﬁnition a set is recurrent if on average it is visited inﬁnitely often. This is already the case if
there is a nonzero probability of visiting the set inﬁnitely often. A stronger concept of recurrence can be obtained
if we require that the set is visited inﬁnitely often with probability 1. This type of recurrence is referred to as Harris
recurrence.
Deﬁnition 1.26 (Harris Recurrence). (a) A set A ⊂ S is said to be Harrisrecurrent for a Markov chain X if for
all x ∈ A
P(V
A
= +∞X
0
= x) = 1,
(b) A Markov chain is said to be Harrisrecurrent, if
i. The chain is µirreducible for some distribution µ.
ii. Every measurable set A ⊂ S with µ(A) > 0 is Harrisrecurrent.
It is easy to see that Harris recurrence implies recurrence. For discrete state spaces the two concepts are equiva
lent.
Checking recurrence or Harris recurrence can be very difﬁcult. We will state (without) proof a proposition
which establishes that if a Markov chain is irreducible and has a unique invariant distribution, then the chain is also
recurrent.
However, before we can state this proposition, we need to deﬁne invariant distributions for general state spaces.
Deﬁnition 1.27 (Invariant Distribution). A distribution µ with density function f
µ
is said to be the invariant distri
bution of a Markov chain X with transition kernel K if
f
µ
(y) =
_
S
f
µ
(x)K(x, y) dx
for almost all y ∈ S.
Proposition 1.28. Suppose that X is a µirreducible Markov chain having µ as unique invariant distribution. Then
X is also recurrent.
Proof. see (Tierney, 1994, theorem 1) or (Athreya et al., 1992)
Checking the invariance condition of deﬁnition 1.27 requires computing an integral, which can be quite cum
bersome. A simpler (sufﬁcient, but not necessary) condition is, just like in the case discrete case, detailed balance.
Deﬁnition 1.29 (Detailed balance). A transition kernel K is said to be in detailed balance with a distribution µ
with density f
µ
if for almost all x, y ∈ S
f
µ
(x)K(x, y) = f
µ
(y)K(y, x).
In complete analogy with theorem 1.22 one can also show in the general case that if the transition kernel of a
Markov chain is in detailed balance with a distribution µ, then the chain is timereversible and has µ as its invariant
distribution. Thus theorem 1.22 also holds in the general case.
1.4 Ergodic theorems
In this section we will study the question whether we can use observations from a Markov chain to make inferences
about its invariant distribution. We will see that under some regularity conditions it is even enough to follow a single
sample path of the Markov chain.
For independent identically distributed data the Law of Large Numbers is used to justify estimating the expected
value of a functional using empirical averages. A similar result can be obtained for Markov chains. This result is
the reason why Markov Chain Monte Carlo methods work: it allows us to set up simulation algorithms to generate
a Markov chain, whose sample path we can then use for estimating various quantities of interest.
22 1. Markov Chains
Theorem 1.30 (Ergodic Theorem). Let X be a µirreducible, recurrent R
d
valued Markov chain with invariant
distribution µ. Then we have for any integrable function g : R
d
→R that with probability 1
lim
t→∞
1
t
t
i=1
g(X
i
) →E
µ
(g(X)) =
_
S
g(x)f
µ
(x) dx
for almost every starting value X
0
= x. If X is Harrisrecurrent this holds for every starting value x.
Proof. For a proof see (Roberts and Rosenthal, 2004, fact 5), (Robert and Casella, 2004, theorem 6.63), or (Meyn
and Tweedie, 1993, theorem 17.3.2).
Under additional regularity conditions one can also derive a Central Limit Theorem which can be used to justify
Gaussian approximations for ergodic averages of Markov chains. This would however be beyond the scope of this
course.
We conclude by giving an example that illustrates that the conditions of irreducibility and recurrence are neces
sary in theorem 1.30. These conditions ensure that the chain is permanently exploring the entire state space, which
is a necessary condition for the convergence of ergodic averages.
Example 1.17. Consider a discrete chain with two states S = {1, 2} and transition matrix
K =
_
1 0
0 1
_
The corresponding Markov graph is shown in ﬁgure 1.8. This chain will remain in its intial state forever. Any
1 2
1 1
Figure 1.8. Markov graph of the chain of example 1.17
distribution µ on {1, 2} is an invariant distribution, as
µ
′
K = µ
′
I = µ
′
for all µ. However, the chain is not irreducible (or recurrent): we cannot get from state 1 to state 2 and vice versa.
If the initial distribution is µ = (α, 1 −α)
′
with α ∈ [0, 1] then for every t ∈ N
0
we have that
P(X
t
= 1) = α P(X
t
= 2) = 1 −α.
By observing one sample path (which is either 1, 1, 1, . . . or 2, 2, 2, . . .) we can make no inference about the distri
bution of X
t
or the parameter α. The reason for this is that the chain fails to explore the space (i.e. switch between
the states 1 and 2). In order to estimate the parameter α we would need to look at more than one sample path. ⊳
Note that theorem 1.30 does not require the chain to the aperiodic. In example 1.12 we studied a periodic chain.
Due to the periodicity we could not apply theorem 1.19. We can however apply theorem 1.30 to this chain. The
reason for this is that whilst theorem 1.19 was about the distribution of states at a given time t, theorem 1.30 is
about averages, and the periodic behaviour does not affect averages.
Chapter 2
An Introduction to Monte Carlo Methods
2.1 What are Monte Carlo Methods?
This lecture course is concerned with Monte Carlo methods, which are sometimes referred to as stochastic simula
tion (Ripley (1987) for example only uses this term).
Examples of Monte Carlo methods include stochastic integration, where we use a simulationbased method to
evaluate an integral, Monte Carlo tests, where we resort to simulation in order to compute the pvalue, and Markov
Chain Monte Carlo (MCMC), where we construct a Markov chain which (hopefully) converges to the distribution
of interest.
A formal deﬁnition of Monte Carlo methods was given (amongst others) by Halton (1970). He deﬁned a Monte
Carlo method as “representing the solution of a problem as a parameter of a hypothetical population, and using
a random sequence of numbers to construct a sample of the population, from which statistical estimates of the
parameter can be obtained.”
2.2 Introductory examples
Example 2.1 (A raindrop experiment for computing π). Assume we want to compute an Monte Carlo estimate of π
using a simple experiment. Assume that we could produce “uniform rain” on the square [−1, 1] ×[−1, 1], such that
the probability of a raindrop falling into a region R ⊂ [−1, 1]
2
is proportional to the area of R, but independent of
the position of R. It is easy to see that this is the case iff the two coordinates X, Y are i.i.d. realisations of uniform
distributions on the interval [−1, 1] (in short X, Y
i.i.d.
∼ U[−1, 1]).
Now consider the probability that a raindrop falls into the unit circle (see ﬁgure 2.1). It is
P(drop within circle) =
area of the unit circle
area of the square
=
_ _
{x
2
+y
2
≤1}
1 dxdy
_ _
{−1≤x,y≤1}
1 dxdy
=
π
2 · 2
=
π
4
In other words,
π = 4 · P(drop within circle),
i.e. we found a way of expressing the desired quantity π as a function of a probability.
Of course we cannot compute P(drop within circle) without knowing π, however we can estimate the probability
using our raindrop experiment. If we observe n raindrops, then the number of raindrops Z that fall inside the circle
is a binomial random variable:
24 2. An Introduction to Monte Carlo Methods
1 −1
1
−1
Figure 2.1. Illustration of the raindrop experiment for estimating π
Z ∼ B(n, p), with p = P(drop within circle).
Thus we can estimate p by its maximumlikelihood estimate
ˆ p =
Z
n
,
and we can estimate π by
ˆ π = 4ˆ p = 4 ·
Z
n
.
Assume we have observed, as in ﬁgure 2.1, that 77 of the 100 raindrops were inside the circle. In this case, our
estimate of π is
ˆ π =
4 · 77
100
= 3.08,
which is relatively poor.
However the law of large numbers guarantees that our estimate ˆ π converges almost surely to π. Figure 2.2 shows the
estimate obtained after n iterations as a function of n for n = 1, . . . , 2000. You can see that the estimate improves
as n increases.
We can assess the quality of our estimate by computing a conﬁdence interval for π. As we have Z ∼ B(100, p) and
ˆ p =
Z
n
, we use the approximation that Z ∼ N(100p, 100p(1 − p)). Hence, ˆ p ∼ N(p, p(1 − p)/100), and we can
obtain a 95% conﬁdence interval for p using this Normal approximation:
_
0.77 −1.96 ·
_
0.77 · (1 −0.77)
100
, 0.77 + 1.96 ·
_
0.77 · (1 −0.77)
100
_
= [0.6875, 0.8525],
As our estimate of π is four times the estimate of p, we now also have a conﬁdence interval for π:
[2.750, 3.410]
In more general, let ˆ π
n
= 4ˆ p
n
denote the estimate after having observed n raindrops. A (1−2α) conﬁdence interval
for p is then
_
ˆ p
n
−z
1−α
_
ˆ p
n
(1 − ˆ p
n
)
n
, ˆ p
n
+z
1−α
_
ˆ p
n
(1 − ˆ p
n
)
n
_
,
thus a (1 −2α) conﬁdence interval for π is
_
ˆ π
n
−z
1−α
_
ˆ π
n
(4 − ˆ π
n
)
n
, ˆ π
n
+z
1−α
_
ˆ π
n
(4 − ˆ π
n
)
n
_
⊳
2.2 Introductory examples 25
0 500 1000 1500 2000
2
.
0
2
.
5
3
.
0
3
.
5
4
.
0
Sample size
E
s
t
i
m
a
t
e
o
f
π
Monte Carlo estimate of π (with 90% conﬁdence interval)
Figure 2.2. Estimate of π resulting from the raindrop experiment
Let us recall again the different steps we have used in the example:
– We have written the quantity of interest (in our case π) as an expectation.
1
– Second, we have replaced this algebraic representation of the quantity of interest by a sample approximation to it.
The lawof large numbers guaranteed that the sample approximation converges to the algebraic representation, and
thus to the quantity of interest. Furthermore we used the central limit theorem to assess the speed of convergence.
It is of course of interest whether the Monte Carlo methods offer more favourable rates of convergence than
other numerical methods. We will investigate this in the case of Monte Carlo integration using the following simple
example.
Example 2.2 (Monte Carlo Integration). Assume we want to evaluate the integral
_
1
0
f(x) dx with f(x) =
1
27
·
_
−65536x
8
+ 262144x
7
−409600x
6
+ 311296x
5
−114688x
4
+ 16384x
3
_
using a Monte Carlo approach.
2
Figure 2.3 shows the function for x ∈ [0, 1]. Its graph is fully contained in the unit
square [0, 1]
2
.
Once more, we can resort to a raindrop experiment. Assume we can produce uniform rain on the unit square. The
probability that a raindrop falls below the curve is equal to the area below the curve, which of course equals the
integral we want to evaluate (the area of the unit square is 1, so we don’t need to rescale the result).
A more formal justiﬁcation for this is, using the fact that f(x) =
_
f(x)
0
1 dt,
_
1
0
f(x) dx =
_
1
0
_
f(x)
0
1 dt dx =
_ _
{(x,t):t≤f(x)}
1dt dx =
_ _
{(x,t):t≤f(x)}
1 dt dx
_ _
{0≤x,t≤1}
1dt dx
The numerator is nothing other than the dark grey area under the curve, and the denominator is the area of the
unit square (shaded in light grey in ﬁgure 2.3). Thus the expression on the right hand side is the probability that a
1
A probability is a special case of an expectation as P(A) = E(I
A
).
2
As f is a polynomial we can obtain the result analytically, it is
4096
8505
=
2
12
3
5
·5·7
≈ 0.4816.
26 2. An Introduction to Monte Carlo Methods
raindrop falls below the curve.
We have thus reexpressed our quantity of interest as a probability in a statistical model. Figure 2.3 shows the result
obtained when observing 100 raindrops. 52 of them are below the curve, yielding a MonteCarlo estimate of the
integral of 0.52.
If after n raindrops a proportion ˆ p
n
is found to lie below the curve, a (1 − 2α) conﬁdence interval for the value of
the integral is
_
ˆ p
n
−z
1−α
_
ˆ p
n
(1 −p
n
)
n
, ˆ p
n
+z
1−α
_
ˆ p
n
(1 −p
n
)
n
_
Thus the speed of convergence of our (rather crude) Monte Carlo method is O
P
(n
−1/2
). ⊳
1 0
1
x
Figure 2.3. Illustration of the raindrop experiment to compute
R
1
0
f(x)dx
When using Riemann sums (as in ﬁgure 2.4) to approximate the integral from example 2.2 the error is of order
O(n
−1
).
3,4
Recall that our Monte Carlo method was “only” of order O
P
(n
−1/2
). However, it is easy to see that its speed
of convergence is of the same order, regardless of the dimension of the support of f. This is not the case for other
(deterministic) numerical integration methods. For a twodimensional function f the error made by the Riemann
approximation using n function evaluations is O(n
−1/2
).
5
This makes the Monte Carlo methods especially suited for highdimensional problems. Furthermore the Monte
Carlo method offers the advantage of being relatively simple and thus easy to implement on a computer.
2.3 A Brief History of Monte Carlo Methods
Experimental Mathematics is an old discipline: the Old Testament (1 Kings vii. 23 and 2 Chronicles iv. 2) contains
a rough estimate of π (using the columns of King Solomon’s temple). Monte Carlo methods are a somewhat more
recent discipline. One of the ﬁrst documented Monte Carlo experiments is Buffon’s needle experiment (see example
2.3 below). Laplace (1812) suggested that this experiment can be used to approximate π.
3
The error made for each “bar” can be upper bounded by
∆
2
2
max f
′
(x). Let n denote the number evaluations of f (and thus
the number of “bars”). As ∆ is proportional to
1
n
, the error made for each bar is O(n
−2
). As there are n “bars”, the total
error is O(n
−1
).
4
The order of convergence can be improved when using the trapezoid rule and (even more) by using Simpson’s rule.
5
Assume we partition both axes into m segments, i.e. we have to evaluate the function n = m
2
times. The error made for
each “bar” is O(m
−3
) (each of the two sides of the base area of the “bar” is proportional to m
−1
, so is the upper bound on
f(x)−f(ξ
mid
), yielding O(m
−3
)). There are in total m
2
bars, so the total error is only O(m
−1
), or equivalently O(n
−1/2
).
2.3 A Brief History of Monte Carlo Methods 27
1 0
1
∆
ξ
mid
x
f(x) −f(ξ
mid
) <
∆
2
· max f
′
(x) for x−ξ
mid
≤
∆
2
Figure 2.4. Illustration of numerical integration by Riemann sums
Example 2.3 (Buffon’s needle). In 1733, the Comte de Buffon, George Louis Leclerc, asked the following question
(Buffon, 1733): Consider a ﬂoor with equally spaced lines, a distance δ apart. What is the probability that a needle
of length l < δ dropped on the ﬂoor will intersect one of the lines?
Buffon answered the question himself in 1777 (Buffon, 1777).
Assume the needle landed such that its angle is θ (see ﬁgure 2.5). Then the question whether the needle intersects a
line is equivalent to the question whether a box of width l sin θ intersects a line. The probability of this happening
is
P(intersectθ) =
l sin θ
δ
.
Assuming that the angle θ is uniform on [0, π) we obtain
P(intersect) =
_
π
0
P(intersectθ) ·
1
π
dθ =
_
π
0
l sin θ
δ
·
1
π
dθ =
l
πδ
·
_
π
0
sin θ dθ
. ¸¸ .
=2
=
2l
πδ
.
When dropping n needles the expected number of needles crossing a line is thus
2nl
πδ
.
Thus we can estimate π by
δ δ δ
l sin θ
θ
(a) Illustration of the geometry behind
Buffon’s needle
(b) Results of the Buffon’s needle experi
ment using 50 needles. Dark needles inter
sect the thin vertical lines, light needles do
not.
Figure 2.5. Illustration of Buffon’s needle
28 2. An Introduction to Monte Carlo Methods
π ≈
2nl
Xδ
,
where X is the number of needles crossing a line.
The Italian mathematician Mario Lazzarini performed Buffon’s needle experiment in 1901 using a needle of length
l = 2.5cm and lines d = 3cm apart (Lazzarini, 1901). Of 3408 needles 1808 needles crossed a line, so Lazzarini’s
estimate of π was
π ≈
2 · 3408 · 2.5
1808 · 3
=
17040
5424
=
355
133
,
which is nothing other than the best rational approximation to π with at most 4 digits each in the denominator and
the numerator.
6
⊳
Historically, the main drawback of Monte Carlo methods was that they used to be expensive to carry out.
Physical random experiments were difﬁcult to perform and so was the numerical processing of their results.
This however changed fundamentally with the advent of the digital computer. Amongst the ﬁrst to realise this
potential were John von Neuman and Stanisław Ulam, who were then working for the Manhattan project in Los
Alamos. They proposed in 1947 to use a computer simulation for solving the problem of neutron diffusion in
ﬁssionable material (Metropolis, 1987). Enrico Fermi previously considered using Monte Carlo techniques in the
calculation of neutron diffusion, however he proposed to use a mechanical device, the socalled “Fermiac”, for
generating the randomness. The name “Monte Carlo” goes back to Stanisław Ulam, who claimed to be stimulated
by playing poker (Ulam, 1983). In 1949 Metropolis and Ulam published their results in the Journal of the American
Statistical Association (Metropolis and Ulam, 1949). Nonetheless, in the following 30 years Monte Carlo methods
were used and analysed predominantly by physicists, and not by statisticians: it was only in the 1980s — following
the paper by Geman and Geman (1984) proposing the Gibbs sampler — that the relevance of Monte Carlo methods
in the context of (Bayesian) statistics was fully realised.
2.4 Pseudorandom numbers
For any MonteCarlo simulation we need to be able to reproduce randomness by a computer algorithm, which,
by deﬁnition, is deterministic in nature — a philosophical paradox. In the following chapters we will assume that
independent (pseudo)random realisations from a uniform U[0, 1] distribution
7
are readily available. This section
tries to give very brief overview of how pseudorandom numbers can be generated. For a more detailed discussion
of pseudorandom number generators see Ripley (1987) or Knuth (1997).
A pseudorandom number generator (RNG) is an algorithm for whose output the U[0, 1] distribution is a suitable
model. In other words, the number generated by the pseudorandomnumber generator should have the same relevant
statistical properties as independent realisations of a U[0, 1] random variable. Most importantly:
– The numbers generated by the algorithm should reproduce independence, i.e. the numbers X
1
, . . . , X
n
that we
have already generated should not contain any discernible information on the next value X
n+1
. This property is
often referred to as the lack of predictability.
– The numbers generated should be spread out evenly across the interval [0, 1].
In the following we will brieﬂy discuss the linear congruential generator. It is not a particularly powerful gen
erator (so we discourage you from using it in practise), however it is easy enough to allow some insight into how
pseudorandom number generators work.
6
That Lazzarini’s experiment was that precise, however, casts some doubt over the results of his experiments (see Badger,
1994, for a more detailed discussion).
7
We will only use the U(0, 1) distribution as a source of randomness. Samples from other distributions can be derived from
realisations of U(0, 1) random variables using deterministic algorithms.
2.4 Pseudorandom numbers 29
Algorithm 2.1 (Congruential pseudorandom number generator). 1. Choose a, M ∈ N, c ∈ N
0
, and the initial
value (“seed”) Z
0
∈ {1, . . . M −1}.
2. For i = 1, 2, . . .
Set Z
i
= (aZ
i−1
+c) mod M, and X
i
= Z
i
/M.
The integers Z
i
generated by the algorithm are from the set {0, 1, . . . , M−1} and thus the X
i
are in the interval
[0, 1).
It is easy to see that the sequence of pseudorandom numbers only depends on the seed Z
0
. Running the pseudo
random number generator twice with the same seed thus generates exactly the same sequence of pseudorandom
numbers. This can be a very useful feature when debugging your own code.
Example 2.4. Cosider the choice of a = 81, c = 35, M = 256, and seed Z
0
= 4.
Z
1
= (81 · 4 + 35) mod 256 = 359 mod 256 = 103
Z
2
= (81 · 103 + 35) mod 256 = 8378 mod 256 = 186
Z
3
= (81 · 186 + 35) mod 256 = 15101 mod 256 = 253
. . .
The corresponding X
i
are X
1
= 103/256 = 0.4023438, X
2
= 186/256 = 0.72656250, X
1
= 253/256 =
0.98828120. ⊳
The main ﬂaw of the congruential generator its “crystalline” nature (Marsaglia, 1968). If the sequence of gen
erated values X
1
, X
2
, . . . is viewed as points in an ndimension cube
8
, they lie on a ﬁnite, and often very small
number of parallel hyperplanes. Or as Marsaglia (1968) put it: “the points [generated by a congruential generator]
are about as randomly spaced in the unit ncube as the atoms in a perfect crystal at absolute zero.” The number of
hyperplanes depends on the choice of a, c, and M.
An example for a notoriously poor design of a congruential pseudorandom number generator is RANDU,
which was (unfortunately) very popular in the 1970s and used for example in IBM’s System/360 and System/370,
and Digital’s PDP11. It used a = 2
16
+ 3, c = 0, and M = 2
31
. The numbers generated by RANDU lie on only
15 hyperplanes in the 3dimensional unit cube (see ﬁgure 2.6).
Figure 2.7 shows another cautionary example (taken from Ripley, 1987). The lefthand panel shows a plot of
1,000 realisations of a congruential generator with a = 1229, c = 1, and M = 2
11
. The random numbers lie
on only 5 hyperplanes in the unit square. The right hand panel shows the outcome of the BoxMuller method for
transforming two uniform pseudorandom numbers into a pair of Gaussians (see example 3.2).
Due to this ﬂaw of the congruential pseudorandom number generator, it should not be used in Monte Carlo
experiments. For more powerful pseudorandom number generators see e.g. Marsaglia and Zaman (1991) or Mat
sumoto and Nishimura (1998). GNU R (and other environments) provide you with a large choice of powerful
random number generators, see the corresponding help page (?RNGkind) for details.
8
The (k + 1)th point has the coordinates (X
nk+1
, . . . , X
nk+n−1
).
30 2. An Introduction to Monte Carlo Methods
Figure 2.6. 300,000 realisations of the RANDU pseudorandom number generator plotted in 3D. A point corresponds to a triplet
(x
3k−2
, x
3k−1
, x
3k
) for k = 1, . . . , 100000. The data points lie on 15 hyperplanes.
0
.
0
0.0
0
.
2
0.2
0
.
4
0.4
0
.
6
0.6
0
.
8
0.8
1
.
0
1.0
X2k−1
X
2
k
(a) 1,000 realisations of this congruential generator plot
ted in 2D.
10

5
5
0
0
5
5
−2 log(X2k−1) cos(2πX2k)
−
2
l
o
g
(
X
2
k
−
1
)
s
i
n
(
2
π
X
2
k
)
(b) Supposedly bivariate Gaussian pseudorandom num
bers obtained using the pseudorandom numbers shown
in panel (a).
Figure 2.7. Results obtained using a congruential generator with a = 1229, c = 1, and M = 2
11
Chapter 3
Fundamental Concepts: Transformation, Re
jection, and Reweighting
3.1 Transformation methods
In section 2.4 we have seen how to create (pseudo)random numbers from the uniform distribution U[0, 1]. One
of the simplest methods of generating random samples from a distribution with cumulative distribution function
(c.d.f.) F(x) = P(X ≤ x) is based on the inverse of the c.d.f..
F
−
(u)
x
1
u
F(x)
Figure 3.1. Illustration of the deﬁnition of the generalised inverse F
−
of a c.d.f. F
The c.d.f. is an increasing function, however it is not necessarily continuous. Thus we deﬁne the generalised
inverse F
−
(u) = inf{x : F(x) ≥ u}. Figure 3.1 illustrates its deﬁnition. If F is continuous, then F
−
(u) =
F
−1
(u).
Theorem 3.1 (Inversion Method). Let U ∼ U[0, 1] and F be a c.d.f.. Then F
−
(U) has the c.d.f. F.
Proof. It is easy to see (e.g. in ﬁgure 3.1) that F
−
(u) ≤ x is equivalent to u ≤ F(x). Thus for U ∼ U[0, 1]
P(F
−
(U) ≤ x) = P(U ≤ F(x)) = F(x),
thus F is the c.d.f. of X = F
−
(U).
Example 3.1 (Exponential Distribution). The exponential distribution with rate λ > 0 has the c.d.f. F
λ
(x) = 1 −
exp(−λx) for x ≥ 0. Thus F
−
λ
(u) = F
−1
λ
(u) = −log(1 − u)/λ. Thus we can generate random samples from
Expo(λ) by applying the transformation −log(1 −U)/λ to a uniform U[0, 1] random variable U. As U and 1 −U,
of course, have the same distribution we can use −log(U)/λ as well. ⊳
32 3. Fundamental Concepts: Transformation, Rejection, and Reweighting
The Inversion Method is a very efﬁcient tool for generating random numbers. However very few distributions
possess a c.d.f. whose (generalised) inverse can be evaluated efﬁciently. Take the example of the Gaussian distribu
tion, whose c.d.f. is not even available in closed form.
Note however that the generalised inverse of the c.d.f. is just one possible transformation and that there might
be other transformations that yield the desired distribution. An example of such a method is the BoxMuller method
for generating Gaussian random variables.
Example 3.2 (BoxMuller Method for Sampling from Gaussians). When sampling from the normal distribution, one
faces the problem that neither the c.d.f. Φ(·), nor its inverse has a closedform expression. Thus we cannot use the
inversion method.
It turns out however, that if we consider a pair X
1
, X
2
i.i.d.
∼ N(0, 1), as a point (X
1
, X
2
) in the plane, then its
polar coordinates (R, θ) are again independent and have distributions we can easily sample from: θ ∼ U[0, 2π], and
R
2
∼ Expo(1/2).
This can be shown as follows. Assume that θ ∼ U[0, 2π] and R
2
∼ Expo(1/2). Then the joint density of (θ, r
2
)
is
f
(θ,r
2
)
(θ, r
2
) =
1
2π
1
[0,2π]
(θ) ·
1
2
exp
_
−
1
2
r
2
_
=
1
4π
exp
_
−
1
2
r
2
_
· 1
[0,2π]
(θ)
To obtain the probability density function of
X
1
=
√
R
2
· cos(θ), X
2
=
√
R
2
· sin(θ)
we need to use the transformation of densities formula.
f
(X
1
,X
2
)
(x
1
, x
2
) = f
(θ,r
2
)
(θ(x
1
, x
2
), r
2
(x
1
, x
2
)) ·
¸
¸
¸
¸
¸
∂x
1
∂θ
∂x
1
∂r
2
∂x
2
∂θ
∂x
2
∂r
2
¸
¸
¸
¸
¸
−1
=
1
4π
exp
_
−
1
2
(x
2
1
+x
2
2
)
2
_
· 2
=
_
1
√
2π
exp
_
−
1
2
x
2
1
__
·
_
1
√
2π
exp
_
−
1
2
x
2
2
__
as
¸
¸
¸
¸
¸
∂x
1
∂θ
∂x
1
∂r
2
∂x
2
∂θ
∂x
2
∂r
2
¸
¸
¸
¸
¸
=
¸
¸
¸
¸
¸
−r sin(θ)
cos(θ)
2r
r cos(θ)
sin(θ)
2r
¸
¸
¸
¸
¸
=
¸
¸
¸
¸
−
r sin(θ)
2
2r
−
r cos(θ)
2
2r
¸
¸
¸
¸
=
1
2
Thus X
1
, X
2
∼ N(0, 1). As their joint density factorises, X
1
and X
2
are independent, as required.
Thus we only need to generate θ ∼ U[0, 2π], and R
2
∼ Expo(1/2). Using U
1
, U
2
i.i.d.
∼ U[0, 1] and example 3.1
we can generate R =
√
R
2
and θ by
R =
_
−2 log(U
1
), θ = 2πU
2
,
and thus
X
1
=
_
−2 log(U
1
) · cos(2πU
2
), X
2
=
_
−2 log(U
1
) · sin(2πU
2
)
are two independent realisations from a N(0, 1) distribution. ⊳
The idea of transformation methods like the Inversion Method was to generate random samples from a distribu
tion other than the target distribution and to transform them such that they come from the desired target distribution.
In many situations, we cannot ﬁnd such a transformation in closed form. In these cases we have to ﬁnd other ways
of correcting for the fact that we sample from the “wrong” distribution. The next two sections present two such
ideas: rejection sampling and importance sampling.
3.2 Rejection sampling 33
3.2 Rejection sampling
The basic idea of rejection sampling is to sample from an instrumental distribution
1
and reject samples that are
“unlikely” under the target distribution.
Assume that we want to sample from a target distribution whose density f is known to us. The simple idea
underlying rejection sampling (and other Monte Carlo algorithms) is the rather trivial identity
f(x) =
_
f(x)
0
1 du =
_
1
0
1
0<u<f(x)
. ¸¸ .
=f(x,u)
du
Thus f(x) can be interpreted as the marginal density of a uniform distribution on the area under the density f(x)
{(x, u) : 0 ≤ u ≤ f(x)}.
Figure 3.2 illustrates this idea. This suggests that we can generate a sample fromf by sampling from the area under
the curve.
1 0
2.4
u
x
Figure 3.2. Illustration of example 3.3. Sampling from the area under the curve (dark grey) corresponds to sampling from the
Beta(3, 5) density. In example 3.3 we use a uniform distribution of the light grey rectangle as proposal distribution. Empty
circles denote rejected values, ﬁlled circles denote accepted values.
Example 3.3 (Sampling from a Beta distribution). The Beta(a, b) distribution (a, b ≥ 0) has the density
f(x) =
Γ(a +b)
Γ(a)Γ(b)
x
a−1
(1 −x)
b−1
, for 0 < x < 1,
where Γ(a) =
_
+∞
0
t
a−1
exp(−t) dt is the Gamma function. For a, b > 1 the Beta(a, b) density is unimodal with
mode (a − 1)/(a + b − 2). Figure 3.2 shows the density of a Beta(3, 5) distribution. It attains its maximum of
1680/729 ≈ 2.305 at x = 1/3.
Using the above identity we can draw from Beta(3, 5) by drawing from a uniform distribution on the area under the
density {(x, u) : 0 < u < f(x)} (the area shaded in dark gray in ﬁgure 3.2).
In order to sample from the area under the density, we will use a similar trick as in examples 2.1 and 2.2. We will
sample from the light grey rectangle and only keep the samples that fall in the area under the curve. Figure 3.2
illustrates this idea.
Mathematically speaking, we sample independently X ∼ U[0, 1] and U ∼ U[0, 2.4]. We keep the pair (X, U) if
U < f(X), otherwise we reject it.
The conditional probability that a pair (X, U) is kept if X = x is
P(U < f(X)X = x) = P(U < f(x)) = f(x)/2.4
1
The instrumental distribution is sometimes referred to as proposal distribution.
34 3. Fundamental Concepts: Transformation, Rejection, and Reweighting
As X and U were drawn independently we can rewrite our algorithm as: Draw X from U[0, 1] and accept X with
probability f(X)/2.4, otherwise reject X. ⊳
The method proposed in example 3.3 is based on bounding the density of the Beta distribution by a box. Whilst
this is a powerful idea, it cannot be directly applied to other distributions, as the density might be unbounded or
have inﬁnite support. However we might be able to bound the density of f(x) by M · g(x), where g(x) is a density
that we can easily sample from.
Algorithm 3.1 (Rejection sampling). Given two densities f, g with f(x) < M · g(x) for all x, we can generate a
sample from f as follows:
1. Draw X ∼ g
2. Accept X as a sample from f with probability
f(X)
M · g(X)
,
otherwise go back to step 1.
Proof. We have
P(X ∈ X and is accepted) =
_
X
g(x)
f(x)
M · g(x)
. ¸¸ .
=P(X is acceptedX=x)
dx =
_
X
f(x) dx
M
, (3.1)
and thus
2
P(X is accepted) = P(X ∈ S and is accepted) =
1
M
, (3.2)
yielding
P(x ∈ XX is accepted) =
P(X ∈ X and is accepted)
P(X is accepted)
=
_
X
f(x) dx/M
1/M
=
_
X
f(x) dx. (3.3)
Thus the density of the values accepted by the algorithm is f(·).
Remark 3.2. If we know f only up to a multiplicative constant, i.e. if we only know π(x), where f(x) = C · π(x),
we can carry out rejection sampling using
π(X)
M · g(X)
as probability of rejecting X, provided π(x) < M · g(x) for all x. Then by analogy with (3.1)  (3.3) we have
P(X ∈ X and is accepted) =
_
X
g(x)
π(x)
M · g(x)
dx =
_
X
π(x) dx
M
=
_
X
f(x) dx
C · M
,
P(X is accepted) = 1/(C · M), and thus
P(x ∈ XX is accepted) =
_
X
f(x) dx/(C · M)
1/(C · M)
=
_
X
f(x) dx
Example 3.4 (Rejection sampling from the N(0, 1) distribution using a Cauchy proposal). Assume we want to sam
ple from the N(0, 1) distribution with density
f(x) =
1
√
2π
exp
_
−
x
2
2
_
using a Cauchy distribution with density
g(x) =
1
π(1 +x
2
)
3.3 Importance sampling 35
1 2 3 4 5 6 −1 −2 −3 −4 −5 −6
M · g(x)
f(x)
Figure 3.3. Illustration of example 3.3. Sampling from the area under the density f(x) (dark grey) corresponds to sampling from
the N(0, 1) density. The proposal g(x) is a Cauchy(0, 1).
as instrumental distribution.
3
The smallest M we can choose such that f(x) ≤ Mg(x) is M =
√
2π · exp(−1/2).
Figure 3.3 illustrates the results. As before, ﬁlled circles correspond to accepted values whereas open circles corre
spond to rejected values.
Note that it is impossible to do rejection sampling from a Cauchy distribution using a N(0, 1) distribution as
instrumental distribution: there is no M ∈ R such that
1
π(1 +x
2
)
< M ·
1
√
2πσ
2
exp
_
−
x
2
2
_
;
the Cauchy distribution has heavier tails than the Gaussian distribution. ⊳
3.3 Importance sampling
In rejection sampling we have compensated for the fact that we sampled from the instrumental distribution g(x)
instead of f(x) by rejecting some of the values proposed by g(x). Importance sampling is based on the idea of
using weights to correct for the fact that we sample from the instrumental distribution g(x) instead of the target
distribution f(x).
Importance sampling is based on the identity
P(X ∈ A) =
_
A
f(x) dx =
_
A
g(x)
f(x)
g(x)
. ¸¸ .
=:w(x)
dx =
_
A
g(x)w(x) dx (3.4)
for all g(·), such that g(x) > 0 for (almost) all x with f(x) > 0. We can generalise this identity by considering the
expectation E
f
(h(X)) of a measurable function h:
E
f
(h(X)) =
_
S
f(x)h(x) dx =
_
S
g(x)
f(x)
g(x)
. ¸¸ .
=:w(x)
h(x) dx =
_
S
g(x)w(x)h(x) dx = E
g
(w(X) · h(X)), (3.5)
if g(x) > 0 for (almost) all x with f(x) · h(x) = 0.
Assume we have a sample X
1
, . . . , X
n
∼ g. Then, provided E
g
w(X) · h(X) exists,
1
n
n
i=1
w(X
i
)h(X
i
)
a.s.
n→∞
−→ E
g
(w(X) · h(X))
(by the Law of Large Numbers) and thus by (3.5)
1
n
n
i=1
w(X
i
)h(X
i
)
a.s.
n→∞
−→ E
f
(h(X)).
2
We denote by S the set of all possible values X can take, i.e.,
R
S
f(x)dx = 1.
3
There is not much point is using this method is practise. The BoxMuller method is more efﬁcient.
36 3. Fundamental Concepts: Transformation, Rejection, and Reweighting
In other words, we can estimate µ = E
f
(h(X)) by
˜ µ =
1
n
n
i=1
w(X
i
)h(X
i
)
Note that whilst E
g
(w(X)) =
_
S
f(x)
g(x)
g(x) dx =
_
S
f(x) = 1, the weights w
1
(X), . . . , w
n
(X) do not neces
sarily sum up to n, so one might want to consider the selfnormalised version
ˆ µ =
1
n
i=1
w(X
i
)
n
i=1
w(X
i
)h(X
i
).
This gives rise to the following algorithm:
Algorithm 3.2 (Importance Sampling). Choose g such that supp(g) ⊃ supp(f · h).
1. For i = 1, . . . , n:
i. Generate X
i
∼ g.
ii. Set w(X
i
) =
f(X
i
)
g(X
i
)
.
2. Return either
ˆ µ =
n
i=1
w(X
i
)h(X
i
)
n
i=1
w(X
i
)
or
˜ µ =
n
i=1
w(X
i
)h(X
i
)
n
The following theorem gives the bias and the variance of importance sampling.
Theorem 3.3 (Bias and Variance of Importance Sampling). (a) E
g
(˜ µ) = µ
(b) Var
g
(˜ µ) =
Var
g
(w(X) · h(X))
n
(c) E
g
(ˆ µ) = µ +
µVar
g
(w(X)) −Cov
g
(w(X), w(X) · h(X))
n
+O(n
−2
)
(d) Var
g
(ˆ µ) =
Var
g
(w(X) · h(X)) −2µCov
g
(w(X), w(X) · h(X)) +µ
2
Var
g
(w(X))
n
+O(n
−2
)
Proof. (a) E
g
_
1
n
n
i=1
w(X
i
)h(X
i
)
_
=
1
n
n
i=1
E
g
(w(X
i
)h(X
i
)) = E
f
(h(X))
(b) Var
g
_
1
n
n
i=1
w(X
i
)h(X
i
)
_
=
1
n
2
n
i=1
Var
g
(w(X
i
)h(X
i
)) =
Var
g
(w(X)h(X))
n
(c) and (d) see (Liu, 2001, p. 35)
Note that the theorem implies that in contrast to ˜ µ the selfnormalised estimator ˆ µ is biased. The selfnormalised
estimator ˆ µ however might have a lower variance. In addition, it has another advantage: we only need to know
the density up to a multiplicative constant, as it is often the case in hierarchical Bayesian modelling. Assume
f(x) = C · π(x), then
ˆ µ =
n
i=1
w(X
i
)h(X
i
)
n
i=1
w(X
i
)
=
n
i=1
f(X
i
)
g(X
i
)
h(X
i
)
n
i=1
f(X
i
)
g(X
i
)
=
n
i=1
C·π(X
i
)
g(X
i
)
h(X
i
)
n
i=1
C·π(X
i
)
g(X
i
)
=
n
i=1
π(X
i
)
g(X
i
)
h(X
i
)
n
i=1
π(X
i
)
g(X
i
)
,
i.e. the selfnormalised estimator ˆ µ does not depend on the normalisation constant C.
4
On the other hand, as we
have seen in the proof of theorem 3.3 it is a lot harder to analyse the theoretical properties of the selfnormalised
estimator ˆ µ.
Although the above equations (3.4) and (3.5) hold for every g with supp(g) ⊃ supp(f · h) and the importance
sampling algorithm converges for a large choice of such g, one typically only considers choices of g that lead to
ﬁnite variance estimators. The following two conditions are each sufﬁcient (albeit rather restrictive) for a ﬁnite
variance of ˜ µ:
4
By complete analogy one can show that is enough to know g up to a multiplicative constant.
3.3 Importance sampling 37
– f(x) < M · g(x) and Var
f
(h(X)) < +∞.
– S is compact, f is bounded above on S, and g is bounded below on S.
Note that under the ﬁrst condition rejection sampling can also be used to sample from f.
So far we have only studied whether a distribution g leads to a ﬁnitevariance estimator. This leads to the
question which instrumental distribution is optimal, i.e. for which choice Var(˜ µ) is minimal. The following theorem
answers this question:
Theorem 3.4 (Optimal proposal). The proposal distribution g that minimises the variance of ˜ µ is
g
∗
(x) =
h(x)f(x)
_
S
h(t)f(t) dt
.
Proof. We have from theroem 3.3 (b) that
n·Var
g
(˜ µ) = Var
g
(w(X) · h(X)) = Var
g
_
h(X) · f(X)
g(X)
_
= E
g
_
_
h(X) · f(X)
g(X)
_
2
_
−
_
E
g
_
h(X) · f(X)
g(X)
_
. ¸¸ .
=E
g
(˜ µ)=µ
_
2
.
Thus we only have to minimise E
g
_
_
h(X)·f(X)
g(X)
_
2
_
. When plugging in g
⋆
we obtain:
E
g
⋆
_
_
h(X) · f(X)
g
⋆
(X)
_
2
_
=
_
S
h(x)
2
· f(x)
2
g
⋆
(x)
dx =
__
S
h(x)
2
· f(x)
2
h(x)f(x)
dx
_
·
__
S
h(t)f(t) dt
_
=
__
S
h(x)f(x) dx
_
2
On the other hand, we can apply the Jensen inequality
5
to E
g
_
_
h(X)·f(X)
g(X)
_
2
_
yielding
E
g
_
_
h(X) · f(X)
g(X)
_
2
_
≥
_
E
g
_
h(X) · f(X)
g(X)
__
2
=
__
S
h(x)f(x) dx
_
2
.
An important corollary of theorem 3.4 is that importance sampling can be superefﬁcient, i.e. when using the
optimal g
⋆
from theorem 3.4 the variance of ˜ µ is less than the variance obtained when sampling directly from f:
n · Var
f
_
h(X
1
) +. . . +h(X
n
)
n
_
= E
f
(h(X)
2
) −µ
2
≥ (E
f
h(X))
2
−µ
2
=
__
S
h(x)f(x) dx
_
2
−µ
2
= n · Var
g
⋆(˜ µ)
by Jensen’s inequality. Unless h(X) is (almost surely) constant the inequality is strict. There is an intuitive expla
nation to the superefﬁciency of importance sampling. Using g
⋆
instead of f causes us to focus on regions of high
probability where h is large, which contribute most to the integral E
f
(h(X)).
Theorem 3.4 is, however, a rather formal optimality result. When using ˜ µ we need to know the normalisation
constant of g
⋆
, which is exactly the integral we are looking for. Further we need to be able to draw samples from g
⋆
efﬁciently. The practically important corollary of theorem 3.4 is that we should choose an instrumental distribution
g whose shape is close to the one of f · h.
Example 3.5 (Computing E
f
X for X ∼ t
3
). Assume we want to compute E
f
X for X from a tdistribution with
3 degrees of freedom (t
3
) using a Monte Carlo method. Three different schemes are considered
5
If X is realvalued random variable, and ψ a convex function, then ψ(E(X)) ≤ E(ψ(X)).
38 3. Fundamental Concepts: Transformation, Rejection, and Reweighting
– Sampling X
1
, . . . , X
n
directly from t
3
and estimating E
f
X by
1
n
i=1
nX
i
.
– Alternatively we could use importance sampling using a t
1
(which is nothing other than a Cauchy distribution)
as instrumental distribution. The idea behind this choice is that the density g
t
1
(x) of a t
1
distribution is closer to
f(x)x, where f(x) is the density of a t
3
distribution, as ﬁgure 3.4 shows.
– Third, we will consider importance sampling using a N(0, 1) distribution as instrumental distribution.
4 2 0 2 4
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
x
x · f(x) (Target)
f(x) (direct sampling)
gt
1
(x) (IS t1)
g
N(0,1)
(x) (IS N(0, 1))
Figure 3.4. Illustration of the different instrumental distributions in example 3.5.
Note that the third choice yields weights of inﬁnite variance, as the instrumental distribution (N(0, 1)) has lighter
tails than the distribution we want to sample from (t
3
). The righthand panel of ﬁgure 3.5 illustrates that this choice
yields a very poor estimate of the integral
_
xf(x) dx.
Sampling directly from the t
3
distribution can be seen as importance sampling with all weights w
i
≡ 1, this choice
clearly minimises the variance of the weights. This however does not imply that this yields an estimate of the
integral
_
xf(x) dx of minimal variance. Indeed, after 1500 iterations the empirical standard deviation (over 100
realisations) of the direct estimate is 0.0345, which is larger than the empirical standard deviation of ˜ µ when using
a t
1
distribution as instrumental distribution, which is 0.0182. So using a t
1
distribution as instrumental distribution
is superefﬁcient (see ﬁgure 3.5).
Figure 3.6 somewhat explains why the t
1
distribution is a far better choice than the N(0, 1) distributon. As the
N(0, 1) distribution does not have heavy enough tails, the weight tends to inﬁnity as x →+∞. Thus large x get
large weights, causing the jumps of the estimate ˜ µ shown in ﬁgure 3.5. The t
1
distribution has heavy enough tails,
so the weights are small for large values of x, explaining the small variance of the estimate ˜ µ when using a t
1
distribution as instrumental distribution. ⊳
Example 3.6 (Partially labelled data). Suppose that we are given count data from observations in two groups, such
that
Y
i
∼ Poi(λ
1
) if the ith observation is from group 1
Y
i
∼ Poi(λ
2
) if the ith observation is from group 2
The data is given in the table 3.1. Note that only the ﬁrst ten observations are labelled, the group label is missing
for the remaining ten observations.
We will use a Gamma(α, β) distribution as (conjugate) prior distribution for λ
j
, i.e. the prior density of λ
j
is
3.3 Importance sampling 39
0 0 0 500 500 500 1000 1000 1000 1500 1500 1500
0
.
0
0
.
5
1
.
0
1
.
5
2
.
0
2
.
5
3
.
0
I
S
e
s
t
i
m
a
t
e
o
v
e
r
t
i
m
e
Iteration
Sampling directly from t3 IS using t1 as instrumental distribution IS using N(0, 1) as instrumental distribution
Figure 3.5. Estimates of EX for X ∼ t
3
obtained after 1 to 1500 iterations. The three panels correspond to the three different
sampling schemes used. The areas shaded in grey correspond to the range of 100 replications.
0 0 0 2 2 2 4 4 4 2 2 2 4 4 4
0
.
0
0
.
5
1
.
0
1
.
5
2
.
0
2
.
5
3
.
0
W
e
i
g
h
t
s
W
i
Sample Xi from the instrumental distribution
Sampling directly from t3 IS using t1 as instrumental distribution IS using N(0, 1) as instrumental distribution
Figure 3.6. Weights W
i
obtained for 20 realisations X
i
from the different instrumental distributions.
40 3. Fundamental Concepts: Transformation, Rejection, and Reweighting
Group Count Y
i
Group Count Y
i
Group Count Y
i
Group Count Y
i
1 3 2 14 * 15 * 21
1 6 2 12 * 4 * 11
1 3 2 11 * 1 * 3
1 5 2 19 * 6 * 7
1 9 2 18 * 11 * 18
Table 3.1. Data of example 3.6.
f(λ
j
) =
1
Γ(α)
λ
α−1
j
β
α
j
exp(−βλ
j
).
Furthermore, we believe that a priori each observation is equally likely to stem from group 1 or group 2.
We start with analysing the labelled data only, ignoring the 10 unlabelled observations. In this case, we can analyse
the two groups separately. In group 1 we have that the joint distribution of Y
1
, . . . , Y
5
, λ
1
is given by
f(y
1
, . . . , y
5
, λ
1
) = f(y
1
, . . . , y
5
λ
1
)f(λ
1
) =
_
5
i=1
exp(−λ
1
)λ
y
i
1
y
i
!
_
·
1
Γ(α)
λ
α−1
1
β
α
exp(−βλ)
=
1
5
i=1
y
i
!
·
1
Γ(α)
λ
α+
P
5
i=1
y
i
1
β
α
exp(−(β + 5)λ
1
) ∝ λ
α+
P
5
i=1
y
i
1
exp(−(β + 5)λ
1
)
The posterior distribution of λ
1
given the data from group 1 is
f(λ
1
y
1
, . . . , y
5
) =
f(y
1
, . . . , y
5
, λ
1
)
_
λ
f(y
1
, . . . , y
5
, λ) dλ
∝ f(y
1
, . . . , y
5
, λ
1
)
∝λ
α+
P
5
i=1
y
i
1
exp(−(β + 5)λ
1
)
Comparing this to the density of the Gamma distribution we obtain that
λ
1
Y
1
, . . . , Y
5
∼ Gamma
_
α +
5
i=1
y
i
, β + 5
_
,
and similarly
λ
2
Y
6
, . . . , Y
10
∼ Gamma
_
α +
10
i=6
y
i
, β + 5
_
.
Thus, when only using the labelled data, we do not have to resort to Monte Carlo methods for ﬁnding the posterior
distribution.
This however is not the case any more once we also want to include the unlabelled data. The conditional density of
Y
i
λ
1
, λ
2
for an unlabelled observation (i > 10) is
f(y
i
λ
1
, λ
2
) =
1
2
exp(−λ
1
)λ
y
i
1
y
i
!
+
1
2
exp(−λ
2
)λ
y
i
2
y
i
!
The posterior density for the entire sample (using both labelled and unlabelled data) is
f(λ
1
, λ
2
y
1
, . . . , y
20
)∝f(λ
1
)f(y
1
, . . . , y
5
λ
1
)
. ¸¸ .
∝f(λ
1
y
1
,...,y
5
)
f(λ
2
)f(y
6
, . . . , y
10
λ
2
)
. ¸¸ .
∝f(λ
2
y
6
,...,y
10
)
· f(y
11
, . . . , y
20
λ
1
, λ
2
)
. ¸¸ .
=
Q
20
i=11
f(y
i
λ
1
,λ
2
)
∝ f(λ
1
y
1
, . . . , y
5
)f(λ
2
y
6
, . . . , y
10
)
20
i=11
f(y
i
λ
1
, λ
2
)
This suggests using importance sampling with the product of the distributions of λ
1
Y
1
, . . . , Y
5
and λ
2
Y
6
, . . . , Y
10
as instrumental distributions, i. e. use
g(λ
1
, λ
2
) = f(λ
1
y
1
, . . . , y
5
)f(λ
2
y
6
, . . . , y
10
).
The target distribution is f(λ
1
, λ
2
y
1
, . . . , y
20
), thus the weights are
3.3 Importance sampling 41
w(λ
1
, λ
2
) =
f(λ
1
, λ
2
y
1
, . . . , y
20
)
g(λ
1
, λ
2
)
(3.6)
∝
f(λ
1
y
1
, . . . , y
5
)f(λ
2
y
6
, . . . , y
10
)
20
i=11
f(y
i
λ
1
, λ
2
)
f(λ
1
y
1
, . . . , y
5
)f(λ
2
y
6
, . . . , y
10
)
=
20
i=11
f(y
i
λ
1
, λ
2
) =
20
i=11
_
1
2
exp(−λ
1
)λ
y
i
1
y
i
!
+
1
2
exp(−λ
2
)λ
y
i
2
y
i
!
_
Thus we can draw a weighted sample of size n from the distribution of f(λ
1
, λ
2
y
1
, . . . , y
20
) by repeating the three
steps below n times:
1. Draw λ
1
∼ Gamma
_
α +
5
i=1
y
i
, β + 5
_
2. Draw λ
2
∼ Gamma
_
α +
10
i=6
y
i
, β + 5
_
3. Compute the weight w(λ
1
, λ
2
) using equation (3.6).
From a simulation with n = 50, 000 I obtained 4.4604 as posterior mean of λ
1
and 14.5294 as posterior mean
of λ
2
. The posterior densities are shown in ﬁgure 3.7. ⊳
Posterior density of λ
1
N = 50000 Bandwidth = 0.1036
All data
Labelled data
Posterior density of λ
N = 50000 Bandwidth = 0.1742
All data
Labelled data
r
e
p
l
a
c
e
m
e
n
t
s
2
2 4 6 8
0
.
0
0
.
1
0
.
2
0
.
3
0
.
4
0
.
5
D
e
n
s
i
t
y
D
e
n
s
i
t
y
10 10 15 20
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
0
.
2
0
0
.
2
5
0
.
3
0
Figure 3.7. Posterior distributions of λ
1
and λ
2
in example 3.6. The dashed line is the posterior density obtained only from the
labelled data.
42 3. Fundamental Concepts: Transformation, Rejection, and Reweighting
Chapter 4
The Gibbs Sampler
4.1 Introduction
In section 3.3 we have seen that, using importance sampling, we can approximate an expectation E
f
(h(X)) without
having to sample directly from f. However, ﬁnding an instrumental distribution which allows us to efﬁciently
estimate E
f
(h(X)) can be difﬁcult, especially in large dimensions.
In this chapter and the following chapters we will use a somewhat different approach. We will discuss methods
that allow obtaining an approximate sample fromf without having to sample fromf directly. More mathematically
speaking, we will discuss methods which generate a Markov chain whose stationary distribution is the distribution
of interest f. Such methods are often referred to as Markov Chain Monte Carlo (MCMC) methods.
Example 4.1 (Poisson change point model). Assume the following Poisson model of two regimes for n random
variables Y
1
, . . . , Y
n
.
1
Y
i
∼ Poi(λ
1
) for i = 1, . . . , M
Y
i
∼ Poi(λ
2
) for i = M + 1, . . . , n
A suitable (conjugate) prior distribution for λ
j
is the Gamma(α
j
, β
j
) distribution with density
f(λ
j
) =
1
Γ(α
j
)
λ
α
j
−1
j
β
α
j
j
exp(−β
j
λ
j
).
The joint distribution of Y
1
, . . . , Y
n
, λ
1
, λ
2
, and M is
f(y
1
, . . . , y
n
, λ
1
, λ
2
, M) =
_
M
i=1
exp(−λ
1
)λ
y
i
1
y
i
!
_
·
_
n
i=M+1
exp(−λ
2
)λ
y
i
2
y
i
!
_
·
1
Γ(α
1
)
λ
α
1
−1
1
β
α
1
1
exp(−β
1
λ
1
) ·
1
Γ(α
2
)
λ
α
2
−1
2
β
α
2
2
exp(−β
2
λ
2
).
If M is known, the posterior distribution of λ
1
has the density
f(λ
1
Y
1
, . . . , Y
n
, M) ∝ λ
α
1
−1+
P
M
i=1
y
i
1
exp(−(β
1
+M)λ
1
),
so
λ
1
Y
1
, . . . Y
n
, M ∼ Gamma
_
α
1
+
M
i=1
y
i
, β
1
+M
_
(4.1)
λ
2
Y
1
, . . . Y
n
, M ∼ Gamma
_
α
2
+
n
i=M+1
y
i
, β
2
+n −M
_
. (4.2)
1
The probability distribution function of the Poi(λ) distribution is p(y) =
exp(−λ)λ
y
y!
.
44 4. The Gibbs Sampler
Now assume that we do not know the change point M and that we assume a uniform prior on the set {1, . . . , M −
1}. It is easy to compute the distribution of M given the observations Y
1
, . . . Y
n
, and λ
1
and λ
2
. It is a discrete
distribution with probability density function proportional to
p(MY
1
, . . . , Y
n
, λ
1
, λ
2
) ∝ λ
P
M
i=1
y
i
1
· λ
P
n
i=M+1
y
i
2
· exp((λ
2
−λ
1
) · M) (4.3)
The conditional distributions in (4.1) to (4.3) are all easy to sample from. It is however rather difﬁcult to sample
from the joint posterior of (λ
1
, λ
2
, M). ⊳
The example above suggests the strategy of alternately sampling from the (full) conditional distributions ((4.1)
to (4.3) in the example). This tentative strategy however raises some questions.
– Is the joint distribution uniquely speciﬁed by the conditional distributions?
– Sampling alternately from the conditional distributions yields a Markov chain: the newly proposed values only
depend on the present values, not the past values. Will this approach yield a Markov chain with the correct
invariant distribution? Will the Markov chain converge to the invariant distribution?
As we will see in sections 4.3 and 4.4, the answer to both questions is — under certain conditions — yes. The
next section will however ﬁrst of all state the Gibbs sampling algorithm.
4.2 Algorithm
The Gibbs sampler was ﬁrst proposed by Geman and Geman (1984) and further developed by Gelfand and Smith
(1990). Denote with x
−i
:= (x
1
, . . . , x
i−1
, x
i+1
, . . . , x
p
).
Algorithm 4.1 ((Systematic sweep) Gibbs sampler). Starting with (X
(0)
1
, . . . , X
(0)
p
) iterate for t = 1, 2, . . .
1. Draw X
(t)
1
∼ f
X
1
X
−1
(·X
(t−1)
2
, . . . , X
(t−1)
p
).
. . .
j. Draw X
(t)
j
∼ f
X
j
X
−j
(·X
(t)
1
, . . . , X
(t)
j−1
, X
(t−1)
j+1
, . . . , X
(t−1)
p
).
. . .
p. Draw X
(t)
p
∼ f
X
p
X
−p
(·X
(t)
1
, . . . , X
(t)
p−1
).
Figure 4.1 illustrates the Gibbs sampler. The conditional distributions as used in the Gibbs sampler are often
referred to as full conditionals. Note that the Gibbs sampler is not reversible. Liu et al. (1995) proposed the following
algorithm that yields a reversible chain.
Algorithm 4.2 (Random sweep Gibbs sampler). Starting with (X
(0)
1
, . . . , X
(0)
p
) iterate for t = 1, 2, . . .
1. Draw an index j from a distribution on {1, . . . , p} (e.g. uniform)
2. Draw X
(t)
j
∼ f
X
j
X
−j
(·X
(t−1)
1
, . . . , X
(t−1)
j−1
, X
(t−1)
j+1
, . . . , X
(t−1)
p
), and set X
(t)
ι
:= X
(t−1)
ι
for all ι = j.
4.3 The HammersleyClifford Theorem
An interesting property of the full conditionals, which the Gibbs sampler is based on, is that they fully specify the
joint distribution, as Hammersley and Clifford proved in 1970
2
. Note that the set of marginal distributions does not
have this property.
2
Hammersley and Clifford actually never published this result, as they could not extend the theorem to the case of non
positivity.
4.3 The HammersleyClifford Theorem 45
X
(t)
1
X
(
t
)
2
(X
(0)
1
, X
(0)
2
)
(X
(1)
1
, X
(0)
2
)
(X
(1)
1
, X
(1)
2
)
(X
(2)
1
, X
(1)
2
)
(X
(2)
1
, X
(2)
2
) (X
(3)
1
, X
(2)
2
)
(X
(3)
1
, X
(3)
2
) (X
(4)
1
, X
(3)
2
)
(X
(4)
1
, X
(4)
2
) (X
(5)
1
, X
(4)
2
)
(X
(5)
1
, X
(5)
2
) (X
(6)
1
, X
(5)
2
)
(X
(6)
1
, X
(6)
2
)
Figure 4.1. Illustration of the Gibbs sampler for a twodimensional distribution
Deﬁnition 4.1 (Positivity condition). A distribution with density f(x
1
, . . . , x
p
) and marginal densities f
X
i
(x
i
) is
said to satisfy the positivity condition if f
X
i
(x
i
) > 0 for all x
1
, . . . , x
p
implies that f(x
1
, . . . , x
p
) > 0.
The positivity condition thus implies that the support of the joint density f is the Cartesian product of the support
of the marginals f
X
i
.
Theorem 4.2 (HammersleyClifford). Let (X
1
, . . . , X
p
) satisfy the positivity condition and have joint density
f(x
1
, . . . , x
p
). Then for all (ξ
1
, . . . , ξ
p
) ∈ supp(f)
f(x
1
, . . . , x
p
) ∝
p
j=1
f
X
j
X
−j
(x
j
x
1
, . . . , x
j−1
, ξ
j+1
, . . . , ξ
p
)
f
X
j
X
−j
(ξ
j
x
1
, . . . , x
j−1
, ξ
j+1
, . . . , ξ
p
)
Proof. We have
f(x
1
, . . . , x
p−1
, x
p
) = f
X
p
X
−p
(x
p
x
1
, . . . , x
p−1
)f(x
1
, . . . , x
p−1
) (4.4)
and by complete analogy
f(x
1
, . . . , x
p−1
, ξ
p
) = f
X
p
X
−p
(ξ
p
x
1
, . . . , x
p−1
)f(x
1
, . . . , x
p−1
), (4.5)
thus
f(x
1
, . . . , x
p
)
(4.4)
= f(x
1
, . . . , x
p−1
)
. ¸¸ .
(4.5)
= f(x
1
,...,,x
p−1
,ξ
p
)/f
X
p
X
−p
(ξ
p
x
1
,...,x
p−1
)
f
X
p
X
−p
(x
p
x
1
, . . . , x
p−1
)
= f(x
1
, . . . , x
p−1
, ξ
p
)
f
X
p
X
−p
(x
p
x
1
, . . . , x
p−1
)
f
X
p
X
−p
(ξ
p
x
1
, . . . , x
p−1
)
= . . .
= f(ξ
1
, . . . , ξ
p
)
f
X
1
X
−1
(x
1
ξ
2
, . . . , ξ
p
)
f
X
1
X
−1
(ξ
1
ξ
2
, . . . , ξ
p
)
· · ·
f
X
p
X
−p
(x
p
x
1
, . . . , x
p−1
)
f
X
p
X
−p
(ξ
p
x
1
, . . . , x
p−1
)
The positivity condition guarantees that the conditional densities are nonzero.
Note that the HammersleyClifford theorem does not guarantee the existence of a joint probability distribution
for every choice of conditionals, as the following example shows. In Bayesian modeling such problems mostly arise
when using improper prior distributions.
46 4. The Gibbs Sampler
Example 4.2. Consider the following “model”
X
1
X
2
∼ Expo(λX
2
)
X
2
X
1
∼ Expo(λX
1
),
for which it would be easy to design a Gibbs sampler. Trying to apply the HammersleyClifford theorem, we obtain
f(x
1
, x
2
) ∝
f
X
1
X
2
(x
1
ξ
2
) · f
X
2
X
1
(x
2
x
1
)
f
X
1
X
2
(ξ
1
ξ
2
) · f
X
2
X
1
(ξ
2
x
1
)
=
λξ
2
exp(−λx
1
ξ
2
) · λx
1
exp(−λx
1
x
2
)
λξ
2
exp(−λξ
1
ξ
2
) · λx
1
exp(−λx
1
ξ
2
)
∝ exp(−λx
1
x
2
)
The integral
_ _
exp(−λx
1
x
2
) dx
1
dx
2
however is not ﬁnite, thus there is no twodimensional probability distri
bution with f(x
1
, x
2
) as its density. ⊳
4.4 Convergence of the Gibbs sampler
First of all we have to analyse whether the joint distribution f(x
1
, . . . , x
p
) is indeed the stationary distribution
of the Markov chain generated by the Gibbs sampler
3
. For this we ﬁrst have to determine the transition kernel
corresponding to the Gibbs sampler.
Lemma 4.3. The transition kernel of the Gibbs sampler is
K(x
(t−1)
, x
(t)
) = f
X
1
X
−1
(x
(t)
1
x
(t−1)
2
, . . . , x
(t−1)
p
) · f
X
2
X
−2
(x
(t)
2
x
(t)
1
, x
(t−1)
3
, . . . , x
(t−1)
p
) · . . .
·f
X
p
X
−p
(x
(t)
p
x
(t)
1
, . . . , x
(t)
p−1
)
Proof. We have
P(X
(t)
∈ XX
(t−1)
= x
(t−1)
) =
_
X
f
(X
t
X
(t−1)
)
(x
(t)
x
(t−1)
) dx
(t)
=
_
X
f
X
1
X
−1
(x
(t)
1
x
(t−1)
2
, . . . , x
(t−1)
p
)
. ¸¸ .
corresponds to step 1. of the algorithm
· f
X
2
X
−2
(x
(t)
2
x
(t)
1
, x
(t−1)
3
, . . . , x
(t−1)
p
)
. ¸¸ .
corresponds to step 2. of the algorithm
· . . .
· f
X
p
X
−p
(x
(t)
p
x
(t)
1
, . . . , x
(t)
p−1
)
. ¸¸ .
corresponds to step p. of the algorithm
dx
(t)
3
All the results in this section will be derived for the systematic scan Gibbs sampler (algorithm 4.1). Very similar results hold
for the random scan Gibbs sampler (algorithm 4.2).
4
.
4
C
o
n
v
e
r
g
e
n
c
e
o
f
t
h
e
G
i
b
b
s
s
a
m
p
l
e
r
4
7
Proposition 4.4. The joint distribution f(x
1
, . . . , x
p
) is indeed the invariant distribution of the Markov chain (X
(0)
, X
(1)
, . . .) generated by the Gibbs sampler.
Proof.
_
f(x
(t−1)
)K(x
(t−1)
, x
(t)
) dx
(t−1)
=
_
· · ·
_
f(x
(t−1)
1
, . . . , x
(t−1)
p
) dx
(t−1)
1
. ¸¸ .
=f(x
(t−1)
2
,...,x
(t−1)
p
)
f
X
1
X
−1
(x
(t)
1
x
(t−1)
2
, . . . , x
(t−1)
p
)
. ¸¸ .
=f(x
(t)
1
,x
(t−1)
2
,...,x
(t−1)
p
)
· · · f
X
p
X
−p
(x
(t)
p
x
(t)
1
, . . . , x
(t)
p−1
)dx
(t−1)
2
. . . dx
(t−1)
p
=
_
· · ·
_
f(x
(t)
1
, x
(t−1)
2
, . . . , x
(t−1)
p
) dx
(t−1)
2
. ¸¸ .
=f(x
(t)
1
,x
(t−1)
3
,...,x
(t−1)
p
)
f
X
2
X
−2
(x
(t)
2
x
(t)
1
, x
(t−1)
3
, . . . , x
(t−1)
p
)
. ¸¸ .
=f(x
(t)
1
,x
(t)
2
,x
(t−1)
3
,...,x
(t−1)
p
)
· · · f
X
p
X
−p
(x
(t)
p
x
(t)
1
, . . . , x
(t)
p−1
)dx
(t−1)
3
. . . dx
(t−1)
p
= . . .
=
_
f(x
(t)
1
, . . . , x
(t)
p−1
, x
(t−1)
p
) dx
(t−1)
p
. ¸¸ .
=f(x
(t)
1
,...,x
(t)
p−1
)
f
X
p
X
−p
(x
(t)
p
x
(t)
1
, . . . , x
(t)
p−1
)
. ¸¸ .
=f(x
(t)
1
,...,x
(t)
p
)
= f(x
(t)
1
, . . . , x
(t)
p
)
Thus according to deﬁnition 1.27 f is indeed the invariant distribution.
48 4. The Gibbs Sampler
So far we have established that f is indeed the invariant distribution of the Gibbs sampler. Next, we have to analyse
under which conditions the Markov chain generated by the Gibbs sampler will converge to f.
First of all we have to study under which conditions the resulting Markov chain is irreducible
4
. The following
example shows that this does not need to be the case.
Example 4.3 (Reducible Gibbs sampler). Consider Gibbs sampling from the uniform distribution on C
1
∪ C
2
with
C
1
:= {(x
1
, x
2
) : (x
1
, x
2
) −(1, 1) ≤ 1} and C
2
:= {(x
1
, x
2
) : (x
1
, x
2
) −(−1, −1) ≤ 1}, i.e.
f(x
1
, x
2
) =
1
2π
I
C
1
∪C
2
(x
1
, x
2
)
Figure 4.2 shows the density as well the ﬁrst few samples obtained by starting a Gibbs sampler with X
(0)
1
< 0 and
X
(0)
2
< 0. It is easy to that when the Gibbs sampler is started in C
2
it will stay there and never reach C
1
. The reason

2
2

1
1
0
0
1
1
2
2
X
(t)
1
X
(
t
)
2
Figure 4.2. Illustration of a Gibbs sampler failing to sample from a distribution with unconnected support (uniform distribution
on {(x
1
, x
2
) : (x
1
, x
2
) −(1, 1) ≤ 1} ∪ {(x
1
, x
2
) : (x
1
, x
2
) −(−1, −1) ≤ 1})
for this is that the conditional distribution X
2
X
1
(X
1
X
2
) is for X
1
< 0 (X
2
< 0) entirely concentrated on C
2
. ⊳
The following proposition gives a sufﬁcient condition for irreducibility (and thus the recurrence) of the Markov
chain generated by the Gibbs sampler. There are less strict conditions for the irreducibility and aperiodicity of the
Markov chain generated by the Gibbs sampler (see e.g. Robert and Casella, 2004, Lemma 10.11).
Proposition 4.5. If the joint distribution f(x
1
, . . . , x
p
) satisﬁes the positivity condition, the Gibbs sampler yields
an irreducible, recurrent Markov chain.
Proof. Let X ⊂ supp(f) be a set with
_
X
f(x
(t)
1
, . . . , x
(t)
p
)d(x
(t)
1
, . . . , x
(t)
p
) > 0.
_
X
K(x
(t−1)
, x
(t)
)dx
(t)
=
_
X
f
X
1
X
−1
(x
(t)
1
x
(t−1)
2
, . . . , x
(t−1)
p
)
. ¸¸ .
>0 (on a set of nonzero measure)
· · · f
X
p
X
−p
(x
(t)
p
x
(t)
1
, . . . , x
(t)
p−1
)
. ¸¸ .
>0 (on a set of nonzero measure)
dx
(t)
> 0,
where the conditional densities are nonzero by the positivity condition. Thus the Markov Chain (X
(t)
)
t
is strongly
firreducible. As f is the unique invariant distribution of the Markov chain, it is as well recurrent (proposition
1.28).
4
Here and in the following we understand by “irreducibilty” irreducibility with respect to the target distribution f.
4.4 Convergence of the Gibbs sampler 49
If the transition kernel is absolutely continuous with respect to the dominating measure, then recurrence even
implies Harris recurrence (see e.g. Robert and Casella, 2004, Lemma 10.9).
Now we have established all the necessary ingredients to state an ergodic theorem for the Gibbs sampler, which
is a direct consequence of theorem 1.30.
Theorem 4.6. If the Markov chain generated by the Gibbs sampler is irreducible and recurrent (which is e.g. the
case when the positivity condition holds), then for any integrable function h : E →R
lim
n→∞
1
n
n
t=1
h(X
(t)
) →E
f
(h(X))
for almost every starting value X
(0)
. If the chain is Harris recurrent, then the above result holds for every starting
value X
(0)
.
Theorem 4.6 guarantees that we can approximate expectations E
f
(h(X)) by their empirical counterparts using
a single Markov chain.
Example 4.4. Assume that we want to use a Gibbs sampler to estimate the probability P(X
1
≥ 0, X
2
≥ 0) for a
N
2
__
µ
1
µ
2
_
,
_
σ
2
1
σ
12
σ
12
σ
2
2
__
distribution.
5
The marginal distributions are
X
1
∼ N(µ
1
, σ
2
1
) and X
2
∼ N(µ
2
, σ
2
2
)
In order to construct a Gibbs sampler, we need the conditional distributions X
1
X
2
= x
2
and X
2
X
1
= x
1
. We
have
6
f(x
1
, x
2
) ∝ exp
_
_
−
1
2
__
x
1
x
2
_
−
_
µ
1
µ
2
__′ _
σ
2
1
σ
12
σ
12
σ
2
2
_−1 __
x
1
x
2
_
−
_
µ
1
µ
2
__
_
_
∝ exp
_
−
(x
1
−(µ
1
+σ
12
/σ
2
22
(x
2
−µ
2
)))
2
2(σ
2
1
−(σ
12
)
2
/σ
2
2
)
_
,
i.e.
X
1
X
2
= x
2
∼ N(µ
1
+σ
12
/σ
2
2
(x
2
−µ
2
), σ
2
1
−(σ
12
)
2
/σ
2
2
)
Thus the Gibbs sampler for this problem consists of iterating for t = 1, 2, . . .
1. Draw X
(t)
1
∼ N(µ
1
+σ
12
/σ
2
2
(X
(t−1)
2
−µ
2
), σ
2
1
−(σ
12
)
2
/σ
2
2
)
5
A Gibbs sampler is of course not the optimal way to sample from a N
p
(µ, Σ) distribution. A more efﬁcient way is: draw
Z
1
, . . . , Z
p
i.i.d.
∼ N(0, 1) and set (X
1
, . . . , X
p
)
′
= Σ
1/2
(Z
1
, . . . , Z
p
)
′
+µ
6
We make use of
x
1
x
2
!
−
µ
1
µ
2
!!
′
σ
2
1
σ
12
σ
12
σ
2
2
!
−1
x
1
x
2
!
−
µ
1
µ
2
!!
=
1
σ
2
1
σ
2
2
−(σ
12
)
2
x
1
x
2
!
−
µ
1
µ
2
!!
′
σ
2
2
−σ
12
−σ
12
σ
2
1
!
x
1
x
2
!
−
µ
1
µ
2
!!
=
1
σ
2
1
σ
2
2
−(σ
12
)
2
`
σ
2
2
(x
1
−µ
1
)
2
−2σ
12
(x
1
−µ
1
)(x
2
−µ
2
)
´
+ const
=
1
σ
2
1
σ
2
2
−(σ
12
)
2
`
σ
2
2
x
2
1
−2σ
2
2
x
1
µ
1
−2σ
12
x
1
(x
2
−µ
2
)
´
+ const
=
1
σ
2
1
−(σ
12
)
2
/σ
2
2
`
x
2
1
−2x
1
(µ
1
+σ
12
/σ
2
2
(x
2
−µ
2
))
´
+ const
=
1
σ
2
1
−(σ
12
)
2
/σ
2
2
`
x
1
−(µ
1
+σ
12
/σ
2
2
(x
2
−µ
2
)
´
2
+ const
50 4. The Gibbs Sampler
2. Draw X
(t)
2
∼ N(µ
2
+σ
12
/σ
2
1
(X
(t)
1
−µ
1
), σ
2
2
−(σ
12
)
2
/σ
2
1
).
Now consider the special case µ
1
= µ
2
= 0, σ
2
1
= σ
2
2
= 1 and σ
12
= 0.3. Figure 4.4 shows the sample paths of
this Gibbs sampler.
Using theorem 4.6 we can estimate P(X
1
≥ 0, X
2
≥ 0) by the proportion of samples (X
(t)
1
, X
(t)
2
) with X
(t)
1
≥ 0
and X
(t)
2
≥ 0. Figure 4.3 shows this estimate. ⊳
0 2000 4000 6000 8000 10000
0
.
1
0
.
2
0
.
3
0
.
4
0
.
5
t

{
(
X
(
τ
)
1
,
X
(
τ
)
2
)
:
τ
≤
t
,
X
(
τ
)
1
≥
0
,
X
(
τ
)
2
≥
0
}

/
t
Figure 4.3. Estimate of the P(X
1
≥ 0, X
2
≥ 0) obtained using a Gibbs sampler. The area shaded in grey corresponds to the
range of 100 replications.
Note that the realisations (X
(0)
, X
(1)
, . . .) form a Markov chain, and are thus not independent, but typically
positively correlated. The correlation between the X
(t)
is larger if the Markov chain moves only slowly (the chain
is then said to be slowly mixing). For the Gibbs sampler this is typically the case if the variables X
j
are strongly
(positively or negatively) correlated, as the following example shows.
Example 4.5 (Sampling from a highly correlated bivariate Gaussian). Figure 4.5 shows the results obtained when
sampling from a bivariate Normal distribution as in example 4.4, however with σ
12
= 0.99. This yields a correlation
of ρ(X
1
, X
2
) = 0.99. This Gibbs sampler is a lot slower mixing than the one considered in example 4.4 (and
displayed in ﬁgure 4.4): due to the strong correlation the Gibbs sampler can only perform very small movements.
This makes subsequent samples X
(t−1)
j
and X
(t)
j
highly correlated and thus yields to a slower convergence, as the
plot of the estimated densities show (panels (b) and (c) of ﬁgures 4.4 and 4.5). ⊳
4.4 Convergence of the Gibbs sampler 51

2
.
0
2.0
2
.
0
2.0

1
.
0
1.0
1
.
0
1.0
0
.
0
0.0
X
(t)
1
X
(
t
)
2
(a) First 50 iterations of (X
(t)
1
, X
(t)
2
)
MCMC sample
X
(t)
1
ˆ
fX
1
(x1)
(b) Path of X
(t)
1
and estimated density of X after 1,000
iterations
MCMC sample
X
(t)
2
ˆ
fX
2
(x2)
(c) Path of X
(t)
2
and estimated density of X
2
after 1,000
iterations
Figure 4.4. Gibbs sampler for a bivariate standard normal distribution with correlation ρ(X
1
, X
2
) = 0.3.

1
.
5
1.5

1
.
0
1.0

0
.
5
0.5
0
.
0
0.0
0
.
5
0.5
1
.
0
1.0
1
.
5
1.5
X
(t)
1
X
(
t
)
2
(a) First 50 iterations of (X
(t)
1
, X
(t)
2
)
MCMC sample
X
(t)
1
ˆ
fX
1
(x1)
(b) Path of X
(t)
1
and estimated density of X
1
after
1,000 iterations
MCMC sample
X
(t)
2
ˆ
fX
2
(x2)
(c) Path of X
(t)
2
and estimated density of X
2
after 1,000
iterations
Figure 4.5. Gibbs sampler for a bivariate normal distribution with correlation ρ(X
1
, X
2
) = 0.99.
52 4. The Gibbs Sampler
Chapter 5
The MetropolisHastings Algorithm
5.1 Algorithm
In the previous chapter we have studied the Gibbs sampler, a special case of a Monte Carlo Markov Chain (MCMC)
method: the target distribution is the invariant distribution of the Markov chain generated by the algorithm, to which
it (hopefully) converges.
This chapter will introduce another MCMC method: the MetropolisHastings algorithm, which goes back to
Metropolis et al. (1953) and Hastings (1970). Like the rejection sampling algorithm 3.1, the MetropolisHastings
algorithm is based on proposing values sampled from an instrumental distribution, which are then accepted with a
certain probability that reﬂects how likely it is that they are from the target distribution f.
The main drawback of the rejection sampling algorithm 3.1 is that it is often very difﬁcult to come up with
a suitable proposal distribution that leads to an efﬁcient algorithm. One way around this problem is to allow for
“local updates”, i.e. let the proposed value depend on the last accepted value. This makes it easier to come up
with a suitable (conditional) proposal, however at the price of yielding a Markov chain instead of a sequence of
independent realisations.
Algorithm 5.1 (MetropolisHastings). Starting with X
(0)
:= (X
(0)
1
, . . . , X
(0)
p
) iterate for t = 1, 2, . . .
1. Draw X ∼ q(·X
(t−1)
).
2. Compute
α(XX
(t−1)
) = min
_
1,
f(X) · q(X
(t−1)
X)
f(X
(t−1)
) · q(XX
(t−1)
)
_
. (5.1)
3. With probability α(XX
(t−1)
) set X
(t)
= X, otherwise set X
(t)
= X
(t−1)
.
Figure 5.1 illustrates the MetropolisHasting algorithm. Note that if the algorithm rejects the newly proposed
value (open disks joined by dotted lines in ﬁgure 5.1) it stays at its current value X
(t−1)
. The probability that the
MetropolisHastings algorithm accepts the newly proposed state Xgiven that it currently is in state X
(t−1)
is
a(x
(t−1)
) =
_
α(xx
(t−1)
)q(xx
(t−1)
) dx. (5.2)
Just like the Gibbs sampler, the MetropolisHastings algorithm generates a Markov chain, whose properties will be
discussed in the next section.
Remark 5.1. The probability of acceptance (5.1) does not depend on the normalisation constant, i.e. if f(x) =
C · π(x), then
f(x) · q(x
(t−1)
x)
f(x
(t−1)
) · q(xx
(t−1)
)
=
Cπ(x) · q(x
(t−1)
x)
Cπ(x
(t−1)
) · q(xx
(t−1)
)
=
π(x) · q(x
(t−1)
x)
π(x
(t−1)
) · q(xx
(t−1)
)
54 5. The MetropolisHastings Algorithm
X
(t)
1
X
(
t
)
2
x
(0)
=x
(1)
=x
(2)
x
(3)
=x
(4)
=x
(5)
=x
(6)
=x
(7)
x
(8)
x
(9)
x
(10)
x
(11)
=x
(12)
=x
(13)
=x
(14)
x
(15)
Figure 5.1. Illustration of the MetropolisHastings algorithm. Filled dots denote accepted states, open circles rejected values.
Thus f only needs to be known up to normalisation constant.
1
5.2 Convergence results
Lemma 5.2. The transition kernel of the MetropolisHastings algorithm is
K(x
(t−1)
, x
(t)
) = α(x
(t)
x
(t−1)
)q(x
(t)
x
(t−1)
) + (1 −a(x
(t−1)
))δ
x
(t−1) (x
(t)
), (5.3)
where δ
x
(t−1) (·) denotes Diracmass on {x
(t−1)
}.
Note that the transition kernel (5.3) is not continuous with respect to the Lebesgue measure.
Proof. We have
P(X
(t)
∈ XX
(t−1)
= x
(t−1)
) = P(X
(t)
∈ X, new value acceptedX
(t−1)
= x
(t−1)
)
+P(X
(t)
∈ X, new value rejectedX
(t−1)
= x
(t−1)
)
=
_
X
α(x
(t)
x
(t−1)
)q(x
(t)
x
(t−1)
) dx
(t)
+ I
X
(x
(t−1)
)
. ¸¸ .
=
R
X
δ
x
(t−1)
(dx
(t)
)
P(new value rejectedX
(t−1)
= x
(t−1)
)
. ¸¸ .
=1−a(x
(t−1)
)
. ¸¸ .
=
R
X
(1−a(x
(t−1)
))δ
x
(t−1)
(dx
(t)
)
=
_
X
α(x
(t)
x
(t−1)
)q(x
(t)
x
(t−1)
) dx
(t)
+
_
X
(1 −a(x
(t−1)
))δ
x
(t−1) (dx
(t)
)
Proposition 5.3. The MetropolisHastings kernel (5.3) satisﬁes the detailed balance condition
K(x
(t−1)
, x
(t)
)f(x
(t−1)
) = K(x
(t)
, x
(t−1)
)f(x
(t)
)
and thus f(x) is the invariant distribution of the Markov chain (X
(0)
, X
(1)
, . . .) generated by the Metropolis
Hastings sampler. Furthermore the Markov chain is reversible.
1
On a similar note, it is enough to know q(x
(t−1)
x) up to a multiplicative constant independent of x
(t−1)
and x.
5.2 Convergence results 55
Proof. We have that
α(x
(t)
x
(t−1)
)q(x
(t)
x
(t−1)
)f(x
(t−1)
) = min
_
1,
f(x
(t)
)q(x
(t−1)
x
(t)
)
f(x
(t−1)
)q(x
(t)
x
(t−1)
)
_
q(x
(t)
x
(t−1)
)f(x
(t−1)
)
= min
_
f(x
(t−1)
)q(x
(t)
x
(t−1)
), f(x
(t)
)q(x
(t−1)
x
(t)
)
_
= min
_
f(x
(t−1)
)q(x
(t)
x
(t−1)
)
f(x
(t)
)q(x
(t−1)
x
(t)
)
, 1
_
q(x
(t−1)
x
(t)
)f(x
(t)
) = α(x
(t−1)
x
(t)
)q(x
(t−1)
x
(t)
)f(x
(t)
)
and thus
K(x
(t−1)
, x
(t)
)f(x
(t−1)
) = α(x
(t)
x
(t−1)
)q(x
(t)
x
(t−1)
)f(x
(t−1)
)
. ¸¸ .
=α(x
(t−1)
x
(t)
)q(x
(t−1)
x
(t)
)f(x
(t)
)
+(1 −a(x
(t−1)
)) δ
x
(t−1) (x
(t)
)
. ¸¸ .
=0 if x
(t)
= x
(t−1)
f(x
(t−1)
)
. ¸¸ .
(1−a(x
(t)
))δ
x
(t)
(x
(t−1)
)
= K(x
(t)
, x
(t−1)
)f(x
(t)
)
The other conclusions follow by theorem 1.22, which also applies in the continuous case (see page 21).
Next we need to examine whether the MetropolisHastings algorithm yields an irreducible chain. As with the
Gibbs sampler, this does not need to be the case, as the following example shows.
Example 5.1 (Reducible MetropolisHastings). Consider using a MetropolisHastings algorithm for sampling from
a uniformdistribution on [0, 1]∪[2, 3] and a U(x
(t−1)
−δ, x
(t−1)
+δ) distribution as proposal distribution q(·x
(t−1)
).
Figure 5.2 illustrates this example. It is easy to see that the resulting Markov chain is not irreducible if δ ≤ 1: in
this case the chain either stays in [0, 1] or [2, 3]. ⊳
x
(t−1)
1/(2δ)
q(·x
(t−1)
)
δ δ
f(·)
1 2 3
1/2
Figure 5.2. Illustration of example 5.1
Under mild assumptions on the proposal q(·x
(t−1)
) one can however establish the irreducibility of the resulting
Markov chain:
– If q(x
(t)
x
(t−1)
) is positive for all x
(t−1)
, x
(t)
∈ supp(f), then it is easy to see that we can reach any set of
nonzero probability under f within a single step. The resulting Markov chain is thus strongly irreducible. Even
though this condition seems rather restrictive, many popular choices of q(·x
(t−1)
) like multivariate Gaussians or
tdistributions fulﬁl this condition.
– Roberts and Tweedie (1996) give a more general condition for the irreducibility of the resulting Markov chain:
they only require that
∀ ǫ ∃ δ : q(x
(t)
x
(t−1)
) > ǫ if x
(t−1)
−x
(t)
< δ
together with the boundedness of f on any compact subset of its support.
The Markov chain (X
(0)
, X
(1)
, . . .) is further aperiodic, if there is positive probability that the chain remains in
the current state, i.e. P(X
(t)
= X
(t−1)
) > 0, which is the case if
56 5. The MetropolisHastings Algorithm
P
_
f(X
(t−1)
)q(XX
(t−1)
) > f(X)q(X
(t−1)
X)
_
> 0.
Note that this condition is not met if we use a “perfect” proposal which has f as invariant distribution: in this case
we accept every proposed value with probability 1.
Proposition 5.4. The Markov chain generated by the MetropolisHastings algorithm is Harrisrecurrent if it is
irreducible.
Proof. Recurrence follows from the irreducibility and the fact that f is the unique invariant distribution (using
proposition 1.28). For a proof of Harris recurrence see (Tierney, 1994).
As we have now established (Harris)recurrence, we are now ready to state an ergodic theorem (using theorem
1.30).
Theorem 5.5. If the Markov chain generated by the MetropolisHastings algorithm is irreducible, then for any
integrable function h : E →R
lim
n→∞
1
n
n
t=1
h(X
(t)
) →E
f
(h(X))
for every starting value X
(0)
.
As with the Gibbs sampler the above ergodic theorem allows for inference using a single Markov chain.
5.3 The random walk Metropolis algorithm
In this section we will focus on an important special case of the MetropolisHastings algorithm: the random walk
MetropolisHastings algorithm. Assume that we generate the newly proposed state Xnot using the fairly general
X ∼ q(·X
(t−1)
), (5.4)
from algorithm 5.1, but rather
X = X
(t−1)
+ε, ε ∼ g, (5.5)
with g being a symmetric distribution. It is easy to see that (5.5) is a special case of (5.4) using q(xx
(t−1)
) =
g(x −x
(t−1)
). When using (5.5) the probability of acceptance simpliﬁes to
min
_
1,
f(X) · q(X
(t−1)
X)
f(X
(t−1)
) · q(XX
(t−1)
)
_
= min
_
1,
f(X)
f(X
(t−1)
)
_
,
as q(XX
(t−1)
) = g(X − X
(t−1)
) = g(X
(t−1)
− X) = q(X
(t−1)
X) using the symmetry of g. This yields the
following algorithm which is a special case of algorithm 5.1, which is actually the original algorithm proposed by
Metropolis et al. (1953).
Algorithm 5.2 (Random walk Metropolis). Starting with X
(0)
:= (X
(0)
1
, . . . , X
(0)
p
) and using a symmetric dis
tributon g, iterate for t = 1, 2, . . .
1. Draw ε ∼ g and set X = X
(t−1)
+ε.
2. Compute
α(XX
(t−1)
) = min
_
1,
f(X)
f(X
(t−1)
)
_
. (5.6)
3. With probability α(XX
(t−1)
) set X
(t)
= X, otherwise set X
(t)
= X
(t−1)
.
5.3 The random walk Metropolis algorithm 57
Example 5.2 (Bayesian probit model). In a medical study on infections resulting from birth by Cesarean section
(taken from Fahrmeir and Tutz, 2001) three inﬂuence factors have been studied: an indicator whether the Cesarian
was planned or not (z
i1
), an indicator of whether additional risk factors were present at the time of birth (z
i2
), and
an indicator of whether antibiotics were given as a prophylaxis (z
i3
). The response Y
i
is the number of infections
that were observed amongst n
i
patients having the same inﬂuence factors (covariates). The data is given in table
5.1.
Number of births planned risk factors antibiotics
with infection total
y
i
n
i
z
i1
z
i2
z
i3
11 98 1 1 1
1 18 0 1 1
0 2 0 0 1
23 26 1 1 0
28 58 0 1 0
0 9 1 0 0
8 40 0 0 0
Table 5.1. Data used in example 5.2
The data can be modeled by assuming that
Y
i
∼ Bin(n
i
, π
i
), π = Φ(z
′
i
β),
where z
i
= (1, z
i1
, z
i2
, z
i3
) and Φ(·) being the CDF of the N(0, 1) distribution. Note that Φ(t) ∈ [0, 1] for all t ∈ R.
A suitable prior distribution for the parameter of interest β is β ∼ N(0, I/λ). The posterior density of β is
f(βy
1
, . . . , y
n
) ∝
_
n
i=1
Φ(z
′
i
β)
y
i
· (1 −Φ(z
′
i
β))
n
i
−y
i
_
· exp
_
_
−
λ
2
3
j=0
β
2
j
_
_
We can sample from the above posterior distribution using the following random walk Metropolis algorithm. Start
ing with any β
(0)
iterate for t = 1, 2, . . .:
1. Draw ε ∼ N(0, Σ) and set β = β
(t−1)
+ε.
2. Compute
α(ββ
(t−1)
) = min
_
1,
f(βY
1
, . . . , Y
n
)
f(β
(t−1)
Y
1
, . . . , Y
n
)
_
.
3. With probability α(ββ
(t−1)
) set β
(t)
= β, otherwise set β
(t)
= β
(t−1)
.
To keep things simple, we choose the covariance Σ of the proposal to be 0.08 · I.
Figure 5.3 and table 5.2 show the results obtained using 50,000 samples
2
. Note that the convergence of the β
(t)
j
Posterior mean 95% credible interval
intercept β
0
1.0952 1.4646 0.7333
planned β
1
0.6201 0.2029 1.0413
risk factors β
2
1.2000 0.7783 1.6296
antibiotics β
3
1.8993 2.3636 1.471
Table 5.2. Parameter estimates obtained for the Bayesian probit model from example 5.2
is to a distribution, whereas the cumulative averages
t
τ=1
β
(τ)
j
/t converge, as the ergodic theorem implies, to a
value. For ﬁgure 5.3 and table 5.2 the ﬁrst 10,000 samples have been discarded (“burnin”). ⊳
2
You might want to consider a longer chain in practise.
58 5. The MetropolisHastings Algorithm
0 0
0 0
10000 10000
10000 10000
20000 20000
20000 20000
30000 30000
30000 30000
40000 40000
40000 40000
50000 50000
50000 50000
0
.
5
0
.
5
1
.
0
1
.
5
1
.
5

2
.
0

2
.
0

1
.
5

1
.
0

1
.
0

0
.
5

0
.
5
0
.
0
0
.
0

3
.
0
β
(t)
0
β
(
t
)
0
β
(t)
1
β
(
t
)
1
β
(t)
2
β
(
t
)
2
β
(t)
3
β
(
t
)
3
(a) Sample paths of the β
(t)
j
0 0
0 0
10000 10000
10000 10000
20000 20000
20000 20000
30000 30000
30000 30000
40000 40000
40000 40000
50000 50000
50000 50000

1
.
0

1
.
0

0
.
6

0
.
2
sample sample
sample sample
P
t
τ=1
β
(τ)
0
/t
P
t τ
=
1
β
(
τ
)
0
/
t
0
.
2
0
.
3
0
.
4
0
.
5
0
.
5
0
.
6
0
.
7
P
t
τ=1
β
(τ)
1
/t
P
t τ
=
1
β
(
τ
)
1
/
t
0
.
0
1
.
0
P
t
τ=1
β
(τ)
2
/t
P
t τ
=
1
β
(
τ
)
2
/
t

2
.
0

1
.
5

0
.
5
P
t
τ=1
β
(τ)
3
/t
P
t τ
=
1
β
(
τ
)
3
/
t
(b) Cumulative averages
P
t
τ=1
β
(τ)
j
/t
d
e
n
s
i
t
y
d
e
n
s
i
t
y
d
e
n
s
i
t
y
d
e
n
s
i
t
y
2.0
2.0
1.5
1.5
1.0
1.0 0.5
0
.
0
0
.
0
0
.
0
0.0
0
.
0
0
.
5
0
.
5
0.5
0
.
5
0.5
0
.
5
1
.
0
1
.
0
1.0
1
.
0
1.0
1
.
0
1
.
5
1
.
5
1.5
1
.
5
1.5
1
.
5
β0
β1
2.0
β2
3.0 2.5
β3
(c) Posterior distributions of the β
j
Figure 5.3. Results obtained for the Bayesian probit model from example 5.2
5.4 Choosing the proposal distribution 59
5.4 Choosing the proposal distribution
The efﬁciency of a MetropolisHastings sampler depends on the choice of the proposal distribution q(·x
(t−1)
).
An ideal choice of proposal would lead to a small correlation of subsequent realisations X
(t−1)
and X
(t)
. This
correlation has two sources:
– the correlation between the current state X
(t−1)
and the newly proposed value X ∼ q(·X
(t−1)
), and
– the correlation introduced by retaining a value X
(t)
= X
(t−1)
because the newly generated value X has been
rejected.
Thus we would ideally want a proposal distribution that both allows for fast changes in the X
(t)
and yields a high
probability of acceptance. Unfortunately these are two competing goals. If we choose a proposal distribution with
a small variance, the probability of acceptance will be high, however the resulting Markov chain will be highly
correlated, as the X
(t)
change only very slowly. If, on the other hand, we choose a proposal distribution with a large
variance, the X
(t)
can potentially move very fast, however the probability of acceptance will be rather low.
Example 5.3. Assume we want to sample from a N(0, 1) distribution using a random walk MetropolisHastings
algorithm with ε ∼ N(0, σ
2
). At ﬁrst sight, we might think that setting σ
2
= 1 is the optimal choice, this is
however not the case. In this example we examine the choices: σ
2
= 0.1, σ
2
= 1, σ
2
= 2.38
2
, and σ
2
= 10
2
.
Figure 5.4 shows the sample paths of a single run of the corresponding random walk MetropolisHastings algorithm.
Rejected values are drawn as grey open circles. Table 5.3 shows the average correlation ρ(X
(t−1)
, X
(t)
) as well
as the average probability of acceptance α(XX
(t−1)
) averaged over 100 runs of the algorithm. Choosing σ
2
too
small yields a very high probability of acceptance, however at the price of a chain that is hardly moving. Choosing
σ
2
too large allows the chain to make large jumps, however most of the proposed values are rejected, so the chain
remains for a long time at each accepted value. The results suggest that σ
2
= 2.38
2
is the optimal choice. This
corresponds to the theoretical results of Gelman et al. (1995). ⊳
Autocorrelation ρ(X
(t−1)
, X
(t)
) Probability of acceptance α(X, X
(t−1)
)
Mean 95% CI Mean 95% CI
σ
2
= 0.1
2
0.9901 (0.9891,0.9910) 0.9694 (0.9677,0.9710)
σ
2
= 1 0.7733 (0.7676,0.7791) 0.7038 (0.7014,0.7061)
σ
2
= 2.38
2
0.6225 (0.6162,0.6289) 0.4426 (0.4401,0.4452)
σ
2
= 10
2
0.8360 (0.8303,0.8418) 0.1255 (0.1237,0.1274)
Table 5.3. Average correlation ρ(X
(t−1)
, X
(t)
) and average probability of acceptance α(XX
(t−1)
) found in example 5.3 for
different choices of the proposal variance σ
2
.
Finding the ideal proposal distribution q(·x
(t−1)
) is an art.
3
This is the price we have to pay for the generality
of the MetropolisHastings algorithm. Popular choices for random walk proposals are multivariate Gaussians or
tdistributions. The latter have heavier tails, making them a safer choice. The covariance structure of the proposal
distribution should ideally reﬂect the expected covariance of the (X
1
, . . . , X
p
). Gelman et al. (1997) propose to
adjust the proposal such that the acceptance rate is around 1/2 for one or two dimensional target distributions, and
around 1/4 for larger dimensions, which is in line with the results we obtained in the above simple example and the
guidelines which motivate them. Note however that these are just rough guidelines.
Example 5.4 (Bayesian probit model (continued)). In the Bayesian probit model we studied in example 5.2 we drew
3
The optimal proposal would be sampling directly fromthe target distribution. The very reason for using a MetropolisHastings
algorithm is however that we cannot sample directly from the target!
60 5. The MetropolisHastings Algorithm

6

6

6

6

4

4

4

4

2

2

2

2
0
0
0
0
0
2
2
2
2
4
4
4
4
6
6
6
6
σ
2
=
0
.
1
2
σ
2
=
1
σ
2
=
2
.
3
8
2
200 400 600 800 1000
σ
2
=
1
0
2
Figure 5.4. Sample paths for example 5.3 for different choices of the proposal variance σ
2
. Open grey discs represent rejected
values.
5.4 Choosing the proposal distribution 61
ε ∼ N(0, Σ)
with Σ = 0.08 · I, i.e. we modeled the components of ε to be independent. The proportion of accepted values we
obtained in example 5.2 was 13.9%. Table 5.4 (a) shows the corresponding autocorrelation. The resulting Markov
chain can be made faster mixing by using a proposal distribution that represents the covariance structure of the
posterior distribution of β.
This can be done by resorting to the frequentist theory of generalised linear models (GLM): it suggests that the
asymptotic covariance of the maximum likelihood estimate
ˆ
β is (Z
′
DZ)
−1
, where Z is the matrix of the covariates,
and Dis a suitable diagonal matrix. When using Σ = 2·(Z
′
DZ)
−1
in the algorithm presented in section 5.2 we can
obtain better mixing performance: the autocorrelation is reduced (see table 5.4 (b)), and the proportion of accepted
values obtained increases to 20.0%. Note that the determinant of both choices of Σ was chosen to be the same, so
the improvement of the mixing behaviour is entirely due to a difference in the structure of the the covariance. ⊳
(a) Σ = 0.08 · I
β
0
β
1
β
2
β
3
Autocorrelation ρ(β
(t−1)
j
, β
(t)
j
) 0.9496 0.9503 0.9562 0.9532
(b) Σ = 2 · (Z
′
DZ)
−1
β
0
β
1
β
2
β
3
Autocorrelation ρ(β
(t−1)
j
, β
(t)
j
) 0.8726 0.8765 0.8741 0.8792
Table 5.4. Autocorrelation ρ(β
(t−1)
j
, β
(t)
j
) between subsequent samples for the two choices of the covariance Σ.
62 5. The MetropolisHastings Algorithm
Chapter 6
Diagnosing convergence
6.1 Practical considerations
The theory of Markov chains we have seen in chapter 1 guarantees that a Markov chain that is irreducible and
has invariant distribution f converges to the invariant distribution. The ergodic theorems 4.6 and 5.5 allow for
approximating expectations E
f
(h(X)) by their the corresponding means
1
T
T
t=1
h(X
(t)
) −→E
f
(h(X))
using the entire chain. In practise, however, often only a subset of the chain (X
(t)
)
t
is used:
Burnin Depending on how X
(0)
is chosen, the distribution of (X
(t)
)
t
for small t might still be far from the
stationary distribution f. Thus it might be beneﬁcial to discard the ﬁrst iterations X
(t)
, t = 1, . . . , T
0
. This
early stage of the sampling process is often referred to as burnin period. How large T
0
has to be chosen
depends on how fast mixing the Markov chain (X
(t)
)
t
is. Figure 6.1 illustrates the idea of a burnin period.
burnin period (discarded)
Figure 6.1. Illustration of the idea of a burnin period.
Thinning Markov chain Monte Carlo methods typically yield a Markov chain with positive autocorrelation, i.e.
ρ(X
(t)
k
, X
(t+τ)
k
) is positive for small τ. This suggests building a subchain by only keeping every mth value
(m > 1), i.e. we consider a Markov chain (Y
(t)
)
t
with Y
(t)
= X
(m·t)
instead of (X
(t)
)
t
. If the correlation
ρ(X
(t)
, X
(t+τ)
) decreases monotonically in τ, then
ρ(Y
(t)
k
, Y
(t+τ)
k
) = ρ(X
(t)
k
, X
(m·t+τ)
k
) < ρ(X
(t)
k
, X
(t+τ)
k
),
i.e. the thinned chain (Y
(t)
)
t
exhibits less autocorrelation than the original chain (X
(t)
)
t
. Thus thinning can be
seen as a technique for reducing the autocorrelation, however at the price of yielding a chain (Y
(t)
)
t=1,...⌊T/m⌋
,
64 6. Diagnosing convergence
whose length is reduced to (1/m)th of the length of the original chain (X
(t)
)
t=1,...,T
. Even though thinning is
very popular, it cannot be justiﬁed when the objective is estimating E
f
(h(X)), as the following lemma shows.
Lemma 6.1. Let (X
(t)
)
t=1,...,T
be a sequence of random variables (e.g. from a Markov chain) with X
(t)
∼ f
and (Y
(t)
)
t=1,...,⌊T/m⌋
a second sequence deﬁned by Y
(t)
:= X
(m·t)
. If Var
f
(h(X
(t)
)) < +∞, then
Var
_
1
T
T
t=1
h(X
(t)
)
_
≤ Var
_
_
1
⌊T/m⌋
⌊T/m⌋
t=1
h(Y
(t)
)
_
_
.
Proof. To simplify the proof we assume that T is divisible by m, i.e. T/m ∈ N. Using
T
t=1
h(X
(t)
) =
m−1
τ=0
T/m
t=1
h(X
(t·m+τ)
)
and
Var
_
_
T/m
t=1
h(X
(t·m+τ
1
)
)
_
_
= Var
_
_
T/m
t=1
h(X
(t·m+τ
2
)
)
_
_
for τ
1
, τ
2
∈ {0, . . . , m−1}, we obtain that
Var
_
T
t=1
h(X
(t)
)
_
= Var
_
_
m−1
τ=0
T/m
t=1
h(X
(t·m+τ)
)
_
_
= m· Var
_
_
T/m
t=1
h(X
(t·m)
)
_
_
+
m−1
η=τ=0
Cov
_
_
T/m
t=1
h(X
(t·m+η)
),
T/m
t=1
h(X
(t·m+τ)
)
_
_
. ¸¸ .
≤Var
“
P
T/m
t=1
h(X
(t·m)
)
”
≤ m
2
· Var
_
_
T/m
t=1
h(X
(t·m)
)
_
_
= m
2
· Var
_
_
T/m
t=1
h(Y
(t)
)
_
_
.
Thus
Var
_
1
T
T
t=1
h(X
(t)
)
_
=
1
T
2
Var
_
T
t=1
h(X
(t)
)
_
≤
m
2
T
2
Var
_
_
T/m
t=1
h(Y
(t)
)
_
_
= Var
_
_
1
T/m
T/m
t=1
h(Y
(t)
)
_
_
.
The concept of thinning can be useful for other reasons. If the computer’s memory cannot hold the entire chain
(X
(t)
)
t
, thinning is a good choice. Further, it can be easier to assess the convergence of the thinned chain
(Y
(t)
)
t
as opposed to entire chain (X
(t)
)
t
.
6.2 Tools for monitoring convergence
Although the theory presented in the preceding chapters guarantees the convergence of the Markov chains to the
required distributions, this does not imply that a ﬁnite sample from such a chain yields a good approximation to the
target distribution. As with all approximating methods this must be conﬁrmed in practise.
This section tries to give a brief overview over various approaches to diagnosing convergence. A more detailed
review with many practical examples can be diagnofound in (GuihennecJouyaux et al., 1998) or (Robert and
Casella, 2004, chapter 12). There is an R package (CODA) that provides a vast selection of tools for diagnosing
convergence. Diagnosing convergence is an art. The techniques presented in the following are nothing other than
exploratory tools that help you judging whether the chain has reached its stationary regime. This section contains
several cautionary examples where the different tools for diagnosing convergence fail.
Broadly speaking, convergence assessment can be split into the following three tasks of diagnosing different
aspects of convergence:
6.2 Tools for monitoring convergence 65
Convergence to the target distribution. The ﬁrst, and most important, question is whether (X
(t)
)
t
yields a sample
from the target distribution? In order to answer this question we need to assess . . .
– whether (X
(t)
)
t
has reached a stationary regime, and
– whether (X
(t)
)
t
covers the entire support of the target distribution.
Convergence of the averages. Does
T
t=1
h(X
(t)
)/T provide a good approximation to the expectation E
f
(h(X))
under the target distribution?
Comparison to i.i.d. sampling. How much information is contained in the sample from the Markov chain compared
to i.i.d. sampling?
6.2.1 Basic plots
The most basic approach to diagnosing the output of a Markov Chain Monte Carlo algorithm is to plot the sample
path (X
(t)
)
t
as in ﬁgures 4.4 (b) (c), 4.5 (b) (c), 5.3 (a), and 5.4. Note that the convergence of (X
(t)
)
t
is in dis
tribution, i.e. the sample path is not supposed to converge to a single value. Ideally, the plot should be oscillating
very fast and show very little structure or trend (like for example ﬁgure 4.4). The smoother the plot seems (like for
example ﬁgure 4.5), the slower mixing the resulting chain is.
Note however that this plot suffers from the “you’ve only seen where you’ve been” problem. It is impossible to
see from a plot of the sample path whether the chain has explored the entire support of the distribution.
Example 6.1 (A simple mixture of two Gaussians). In this example we sample from a mixture of two wellseparated
Gaussians
f(x) = 0.4 · φ
(−1,0.2
2
)
(x) + 0.6 · φ
(2,0.3
2
)
(x)
(see ﬁgure 6.2 (a) for a plot of the density) using a random walk Metropolis algorithm with proposed value X =
X
(t−1)
+ ε with ε ∼ N(0, Var(ε)). If we choose the proposal variance Var(ε) too small, we only sample from
one population instead of both. Figure 6.2 shows the sample paths for two choices of Var(ε): Var(ε) = 0.4
2
and
Var(ε) = 1.2
2
. The ﬁrst choice of Var(ε) is too small: the chain is very likely to remain in one of the two modes of
the distribution. Note that it is impossible to tell from ﬁgure 6.2 (b) alone that the chain has not explored the entire
support of the target. ⊳
2 1 0 1 2 3
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
x
d
e
n
s
i
t
y
f
(
x
)
(a) Density f(x)
2000 4000 6000 8000 10000

1
0
0
1
2
3
sample
s
a
m
p
l
e
p
a
t
h
X
(
t
)
(b) Sample path of a random walk
Metropolis algorithm with proposal vari
ance Var(ε) = 0.4
2
2000 4000 6000 8000 10000

1
0
0
1
2
3
sample
s
a
m
p
l
e
p
a
t
h
X
(
t
)
(c) Sample path of a random walk
Metropolis algorithm with proposal vari
ance Var(ε) = 1.2
2
Figure 6.2. Density of the mixture distribution with two random walk Metropolis samples using two different variances Var(ε)
of the proposal.
In order to diagnose the convergence of the averages, one can look at a plot of the cumulative averages
(
t
τ=1
h(X
(τ)
)/t)
t
. Note that the convergence of the cumulative averages is — as the ergodic theorems sug
gest — to a value (E
f
(h(X)). Figures 4.3, and 5.3 (b) show plots of the cumulative averages. An alternative
66 6. Diagnosing convergence
to plotting the cumulative means is using the socalled CUSUMs
_
¯
h(X
j
) −
t
τ=1
h(X
(τ)
j
)/t
_
t
with
¯
h(X
j
) =
T
τ=1
h(X
(τ)
j
)/T, which is nothing other than the difference between the cumulative averages and the estimate of
the limit E
f
(h(X)).
Example 6.2 (A pathological generator for the Beta distribution). The following MCMC algorithm (for details, see
Robert and Casella, 2004, problem 7.5) yields a sample from the Beta(α, 1) distribution. Starting with any X
(0)
iterate for t = 1, 2, . . .
1. With probability 1 −X
(t−1)
, set X
(t)
= X
(t−1)
.
2. Otherwise draw X
(t)
∼ Beta(α + 1, 1).
This algorithm yields a very slowly converging Markov chain, to which no central limit theorem applies. This slow
convergence can be seen in a plot of the cumulative means (ﬁgure 6.3 (b)). ⊳
0 2000 4000 6000 8000 10000
0
.
0
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
sample
s
a
m
p
l
e
p
a
t
h
X
(
t
)
(a) Sample path X
(t)
0 2000 4000 6000 8000 10000
0
.
2
0
.
4
0
.
6
0
.
8
1
.
0
sample
c
u
m
u
l
a
t
i
v
e
a
v
e
r
a
g
e
P
t τ
=
1
X
(
τ
)
/
t
(b) Cumulative means
P
t
τ=1
X
(τ)
/t
Figure 6.3. Sample paths and cumulative means obtained for the pathological Beta generator.
Note that it is impossible to tell from a plot of the cumulative means whether the Markov chain has explored the
entire support of the target distribution.
6.2.2 Nonparametric tests of stationarity
This section presents the KolmogorovSmirnov test, which is an example of how nonparametric tests can be used
as a tool for diagnosing whether a Markov chain has already converged.
In its simplest version, it is based on splitting the chain into three parts: (X
(t)
)
t=1,...,⌊T/3⌋
,
(X
(t)
)
t=⌊T/3⌋+1,...,2⌊T/3⌋
, and (X
(t)
)
t=2⌊T/3⌋+1,...,T
. The ﬁrst block is considered to be the burnin period. If
the Markov chain has reached its stationary regime after ⌊T/3⌋ iterations, the second and third block should be
from the same distribution. Thus we should be able to tell whether the chain has converged by comparing the distri
bution of (X
(t)
)
t=⌊T/3⌋+1,...,2⌊T/3⌋
to the one of (X
(t)
)
t=2⌊T/3⌋+1,...,T
using suitable nonparametric twosample
tests. One such test is the KolmogorovSmirnov test.
As the KolmogorovSmirnov test is designed for i.i.d. samples, we do not apply it to the (X
(t)
)
t
directly,
but to a thinned chain (Y
(t)
)
t
with Y
(t)
= X
(m·t)
: the thinned chain is less correlated and thus closer to
being an i.i.d. sample. We can now compare the distribution of (Y
(t)
)
t=⌊T/(3m)⌋+1,...,2⌊T/(3m)⌋
to the one of
6.2 Tools for monitoring convergence 67
(Y
(t)
)
t=2⌊T/(3m)⌋+1,...,⌊T/m⌋
using the KolmogorovSmirnov statistic
1
K = sup
x∈R
¸
¸
¸
ˆ
F
(Y
(t)
)
t=⌊T/(3m)⌋+1,...,2⌊T/(3m)⌋
(x) −
ˆ
F
(Y
(t)
)
t=2⌊T/(3m)⌋+1,...,⌊T/m⌋
(x)
¸
¸
¸ .
As the thinned chain is not an i.i.d. sample, we cannot use the KolmogorovSmirnov test as a formal statistical
test (besides we would run into problems of multiple testing). However, we can use it as an informal tool by
monitoring the standardised statistic
√
tK
t
as a function of t.
2
As long as a signiﬁcant proportion of the values of
the standardised statistic are above the corresponding quantile of the asymptotic distribution, it is safe to assume
that the chain has not yet reached its stationary regime.
Example 6.3 (Gibbs sampling from a bivariate Gaussian (continued)). In this example we consider sampling from a
bivariate Gaussian distribution, once with ρ(X
1
, X
2
) = 0.3 (as in example 4.4) and once with ρ(X
1
, X
2
) = 0.99
(as in example 4.5). The former leads a fast mixing chain, the latter a very slowly mixing chain. Figure 6.4 shows the
plots of the standardised KolmogorovSmirnov statistic. It suggests that the sample size of 10,000 is large enough
for the lowcorrelation setting, but not large enough for the highcorrelation setting. ⊳
0 2000 4000 6000 8000 10000
0
.
6
0
.
8
1
.
0
1
.
2
1
.
4
1
.
6
1
.
8
sample
S
t
a
n
d
a
r
d
i
s
e
d
K
S
s
t
a
t
i
s
t
i
c
√
t
K
t
(a) ρ(X
1
, X
2
) = 0.3
0 2000 4000 6000 8000 10000
1
2
3
4
5
6
7
sample
S
t
a
n
d
a
r
d
i
s
e
d
K
S
s
t
a
t
i
s
t
i
c
√
t
K
t
(b) ρ(X
1
, X
2
) = 0.99
Figure 6.4. Standardised KolmogorovSmirnov statistic for X
(5·t)
1
from the Gibbs sampler from the bivariate Gaussian for two
different correlations.
Note that the KolmogorovSmirnov test suffers from the “you’ve only seen where you’ve been” problem,
as it is based on comparing (Y
(t)
)
t=⌊T/(3m)⌋+1,...,2⌊T/(3m)⌋
and (Y
(t)
)
t=2⌊T/(3m)⌋+1,...,⌊T/m⌋
. A plot of the
KolmogorovSmirnov statistic for the chain with Var(ε) = 0.4 from example 6.1 would not reveal anything un
usual.
1
The twosample KolmogorovSmirnov test for comparing two i.i.d. samples Z
1,1
, . . . , Z
1,n
and Z
2,1
, . . . , Z
2,n
is based on
comparing their empirical CDFs
ˆ
F
k
(z) =
1
n
n
X
i=1
I
(−∞,z]
(Z
k,i
).
The KolmogorovSmirnov test statistic is the maximum difference between the two empirical CDFs:
K = sup
z∈R

ˆ
F
1
(z) −
ˆ
F
2
(z).
For n →∞the CDF of
√
n · K converges to the CDF
R(k) = 1 −
+∞
X
i=1
(−1)
i−1
exp(−2i
2
k
2
).
2
K
t
is hereby the KolmogorovSmirnov statistic obtained from the sample consisting of the ﬁrst t observations only.
68 6. Diagnosing convergence
6.2.3 Riemann sums and control variates
A simple tool for diagnosing convergence of a onedimensional Markov chain can be based on the fact that
_
E
f(x) dx = 1.
We can estimate this integral by the Riemann sum
T
t=2
(X
[t]
−X
[t−1]
)f(X
[t]
), (6.1)
where X
[1]
≤ . . . ≤ X
[T]
is the ordered sample from the Markov chain. If the Markov chain has explored all the
support of f, then (6.1) should be around 1. Note that this method, often referred to as Riemann sums (Philippe and
Robert, 2001), requires that the density f is known inclusive of normalisation constants.
Example 6.4 (A simple mixture of two Gaussians (continued)). In example 6.1 we considered two randomwalk
Metropolis algorithms: one (Var(ε) = 0.4
2
) failed to explore the entire support of the target distribution, whereas
the other one (Var(ε) = 1.2
2
) managed to. The corresponding Riemann sums are 0.598 and 1.001, clearly indicat
ing that the ﬁrst algorithm does not explore the entire support. ⊳
Riemann sums can be seen as a special case of a technique called control variates. The idea of control variates
is comparing several ways of estimating the same quantity. As long as the different estimates disagree, the chain
has not yet converged. Note that the technique of control variates is only useful if the different estimators converge
about as fast as the quantity of interest — otherwise we would obtain an overly optimistic, or an overly conservative
estimate of whether the chain has converged. In the special case of the Riemann sum we compare two quantities:
the constant 1 and the Riemann sum (6.1).
6.2.4 Comparing multiple chains
A family of convergence diagnostics (see e.g. Gelman and Rubin, 1992; Brooks and Gelman, 1998) is based on
running L > 1 chains — which we will denote by (X
(1,t)
)
t
, . . . , (X
(L,t)
)
t
— with overdispersed
3
starting values
X
(1,0)
, . . . , X
(L,0)
, covering at least the support of the target distribution.
All L chains should converge to the same distribution, so comparing the plots from section 6.2.1 for the L
different chains should not reveal any difference. A more formal approach to diagnosing whether the L chains are
all from the same distribution can be based on comparing the interquantile distances.
We can estimate the interquantile distances in two ways. The ﬁrst consists of estimating the interquantile
distance for each of the L chain and averaging over these results, i.e. our estimate is
L
l=1
δ
(l)
α
/L, where δ
(l)
α
is the
distance between the α and (1 −α) quantile of the lth chain(X
(l,t)
)
t
. Alternatively, we can pool the data ﬁrst, and
then compute the distance between the α and (1 − α) quantile of the pooled data. If all chains are a sample from
the same distribution, both estimates should be roughly the same, so their ratio
ˆ
S
interval
α
=
L
l=1
δ
(l)
α
/L
δ
(·)
α
can be used as a tool to diagnose whether all chains sampled from the same distribution, in which case the ratio
should be around 1.
Alternatively, one could compare the variances within the L chains to the pooled estimate of the variance (see
Brooks and Gelman, 1998, for more details).
3
i.e. the variance of the starting values should be larger than the variance of the target distribution.
6.2 Tools for monitoring convergence 69
Example 6.5 (A simple mixture of two Gaussians (continued)). In the example of the mixture of two Gaussians we
will consider L = 8 chains initialised from a N(0, 10
2
) distribution. Figure 6.5 shows the sample paths of the 8
chains for both choices of Var(ε). The corresponding values of
ˆ
S
interval
0.05
are:
Var(ε) = 0.4
2
:
ˆ
S
interval
0.05
=
0.9789992
3.630008
= 0.2696962
Var(ε) = 1.2
2
:
ˆ
S
interval
0.05
=
3.634382
3.646463
= 0.996687. ⊳
2000 4000 6000 8000 10000

2

1
0
0
1
2
3
sample
s
a
m
p
l
e
p
a
t
h
s
X
(
l
,
t
)
(a) Var(ε) = 0.4
2
2000 4000 6000 8000 10000

2

1
0
0
1
2
3
sample
s
a
m
p
l
e
p
a
t
h
s
X
(
l
,
t
)
(b) Var(ε) = 1.2
2
Figure 6.5. Comparison of the sample paths for L = 8 chains for the mixture of two Gaussians.
Note that this method depends crucially on the choice of initial values X
(1,0)
, . . . , X
(L,0)
, and thus can easily
fail, as the following example shows.
Example 6.6 (Witch’s hat distribution). Consider a distribution with the following density:
f(x
1
, x
2
) ∝
_
(1 −δ)φ
(µ,σ
2
·I)
(x
1
, x
2
) +δ if x
1
, x
2
∈ (0, 1)
0 else,
which is a mixture of a Gaussian and a uniform distribution, both truncated to [0, 1] × [0, 1]. Figure 6.6 illustrates
the density. For very small σ
2
, the Gaussian component is concentrated in a very small area around µ.
The conditional distribution of X
1
X
2
is
f(x
1
x
2
) =
_
(1 −δ
x
2
)φ
(µ,σ
2
·I)
(x
1
, x
2
) +δ
x
2
for x
1
∈ (0, 1)
0 else.
with δ
x
2
=
δ
δ + (1 −δ)φ
(µ
2
,σ
2
)
(x
2
)
.
Assume we want to estimate P(0.49 < X
1
, X
2
≤ 0.51) for δ = 10
−3
, µ = (0.5, 0.5)
′
, and σ = 10
−5
using a
Gibbs sampler. Note that 99.9% of the mass of the distribution is concentrated in a very small area around (0.5, 0.5),
i.e. P(0.49 < X
1
, X
2
≤ 0.51) = 0.999.
Nonetheless, it is very unlikely that the Gibbs sampler visits this part of the distribution. This is due to the fact
that unless x
2
(or x
1
) is very close to µ
2
(or µ
1
), δ
x
2
(or δ
x
1
) is almost 1, i.e. the Gibbs sampler only samples
from the uniform component of the distribution. Figure 6.6 shows the samples obtained from 15 runs of the Gibbs
70 6. Diagnosing convergence
sampler (ﬁrst 100 iterations only) all using different initialisations. On average only 0.04% of the sampled values
lie in (0.49, 0.51) × (0.49, 0.51) yielding an estimate of
ˆ
P(0.49 < X
1
, X
2
≤ 0.51) = 0.0004 (as opposed to
P(0.49 < X
1
, X
2
≤ 0.51) = 0.999).
It is however close to impossible to detect this problem with any technique based on multiple initialisations.
The Gibbs sampler shows this behaviour for practically all starting values. In ﬁgure 6.6 all 15 starting values yield
a Gibbs sampler that is stuck in the “brim” of the witch’s hat and thus misses 99.9% of the probability mass of the
target distribution. ⊳
x
1
x
2
d
e
n
s
i
t
y
f
(
x
1
,
x
2
)
(a) Density for δ = 0.2, µ =
(0.5, 0.5)
′
, and σ = 0.05
0.0
0.0
0.2
0.2
0.4
0.4
0.6
0.6
0.8
0.8
1.0
1.0
x1
x
2
(b) First 100 values from 15 samples using different
starting values.
(δ = 10
−3
, µ = (0.5, 0.5)
′
, and σ = 10
−5
)
Figure 6.6. Density and sample from the witch’s hat distribution.
6.2.5 Comparison to i.i.d. sampling and the effective sample size
MCMC algorithms typically yield a positively correlated sample (X
(t)
)
t=1,...,T
, which contains less information
than an i.i.d. sample of size T. If the (X
(t)
)
t=1,...,T
are positively correlated, then the variance of the average
Var
_
1
T
T
t=1
h(X
(t)
)
_
(6.2)
is larger than the variance we would obtain from an i.i.d. sample, which is Var(h(X
(t)
))/T.
The effective sample size (ESS) allows to quantify this loss of information caused by the positive correlation.
The effective sample size is the size an i.i.d. sample would have to have in order to obtain the same variance (6.2)
as the estimate from the Markov chain (X
(t)
)
t=1,...,T
.
In order to compute the variance (6.2) we make the simplifying assumption that (h(X
(t)
))
t=1,...,T
is from a
secondorder stationary time series, i.e. Var(h(X
(t)
)) = σ
2
, and ρ(h(X
(t)
), h(X
(t+τ)
)) = ρ(τ). Then
Var
_
1
T
T
t=1
h(X
(t)
)
_
=
1
T
2
_
_
_
T
t=1
Var(h(X
(t)
))
. ¸¸ .
=σ
2
+2
1≤s<t≤T
Cov(h(X
(s)
), h(X
(t)
))
. ¸¸ .
=σ
2
·ρ(t−s)
_
_
_
=
σ
2
T
2
_
T + 2
T−1
τ=1
(T −τ)ρ(τ)
_
=
σ
2
T
_
1 + 2
T−1
τ=1
_
1 −
τ
T
_
ρ(τ)
_
.
If
+∞
τ=1
ρ(τ) < +∞, then we can obtain from the dominated convergence theorem
4
that
4
see e.g. Brockwell and Davis (1991, theorem 7.1.1) for details.
6.2 Tools for monitoring convergence 71
T · Var
_
1
T
T
t=1
h(X
(t)
)
_
−→σ
2
_
1 + 2
+∞
τ=1
ρ(τ)
_
as T → ∞. Note that the variance would be σ
2
/T
ESS
if we were to use an i.i.d. sample of size T
ESS
. We can now
obtain the effective sample size T
ESS
by equating these two variances and solving for T
ESS
, yielding
T
ESS
=
1
1 + 2
+∞
τ=1
ρ(τ)
· T.
If we assume that (h(X
(t)
))
t=1,...,T
is a ﬁrstorder autoregressive time series (AR(1)), i.e. ρ(τ) = ρ(h(X
(t)
), h(X
(t+τ)
)) =
ρ
τ
, then we obtain using 1 + 2
+∞
τ=1
ρ
τ
= (1 +ρ)/(1 −ρ) that
T
ESS
=
1 −ρ
1 +ρ
· T.
Example 6.7 (Gibbs sampling from a bivariate Gaussian (continued)). In examples 4.4 and 4.5 we obtained for the
lowcorrelation setting that ρ(X
(t−1)
1
, X
(t)
1
) = 0.078, thus the effective sample size is
T
ESS
=
1 −0.078
1 + 0.078
· 10000 = 8547.
For the highcorrelation setting we obtained ρ(X
(t−1)
1
, X
(t)
1
) = 0.979, thus the effective sample size is considerably
smaller:
T
ESS
=
1 −0.979
1 + 0.979
· 10000 = 105.
⊳
72 6. Diagnosing convergence
Chapter 7
Statespace models and the Kalman ﬁlter
algorithm
7.1 Motivation
In many realworld applications, observations arrive sequentially in time, and interest lies in performing online
inference about unknown quantities from the given observations. If prior knowledge about these quantities is avail
able, then it is possible to formulate a Bayesian model that incorporates this knowledge in the form of a prior
distribution on the unknown quantities, and relates these to the observations via a likelihood function. Inference
on the unknown quantities is then based on the posterior distribution obtained from Bayes’ theorem. Examples
include tracking an aircraft using radar measurements, speech recognition using noisy measurements of voice sig
nals, or estimating the volatility of ﬁnancial instruments using stock market data. In these examples, the unknown
quantities of interest might be the location and velocity of the aircraft, the words in the speech signal, and the
variancecovariance structure, respectively. In all three examples, the data is modelled dynamically in the sense that
the underlying distribution evolves in time; these models are known as dynamic models. Sequential Monte Carlo
(SMC) methods are a noniterative, alternative class of algorithms to MCMC, designed speciﬁcally for inference
in dynamic models. A comprehensive introduction to these methods is the book by Doucet et al. (2001). We point
out that SMC methods are applicable in settings beyond dynamic models, such as nonsequential Bayesian infer
ence, rare events simulation, and global optimization, provided that it is possible to deﬁne an evolving sequence of
artiﬁcial distributions from which the distribution of interest is obtained via marginalisation.
Let p
t
(x
t
) denote the distribution at time t ≥ 1, where x
t
= (x
1
, . . . , x
t
) typically increases in dimension with
t, but it is possible that the dimension of x
t
be constant ∀t ≥ 1, or that x
t
have one dimension less than x
t−1
. The
particular feature of dynamic models is the evolving nature of the underlying distribution, where p
t
(x
t
) changes
in time t as new observations are generated. Note that x
t
are the quantities of interest, not the observations; the
observations up to time t determine the form of the distribution, and this is implied by the subscript t in p
t
(·). This
is in contrast to nondynamic models where the distribution is constant as new observations are generated, denoted
by p(x). In the latter case, MCMC methods have proven highly effective in generating approximate samples from
lowdimensional distributions p(x), when exact simulation is not possible. In the dynamic case, at each time step t
a different MCMC sampler with stationary distribution p
t
(x
t
) is required, so the overall computational cost would
increase with t. Moreover, for large t, designing the sampler and assessing its convergence would be increasingly
difﬁcult.
SMC methods are a noniterative alternative to MCMC algorithms, based on the key idea that if p
t−1
(x
t−1
)
does not differ much from p
t
(x
t
), then it is possible to reuse the samples from p
t−1
(x
t−1
) to obtain samples from
74 7. Statespace models and the Kalman ﬁlter algorithm
p
t
(x
t
). In most applications of interest, it is not possible to obtain exact samples from these evolving distributions,
so the goal is to reuse an approximate sample, representative of p
t−1
(x
t−1
), to obtain a good representation of
p
t
(x
t
). Moreover, since inference is to be performed in real time as new observations arrive, it is necessary that
the computational cost be ﬁxed in t. We will see in the sequel that SMC methods are highly ﬂexible, and widely
applicable; we restrict our attention to a particular class of dynamic models called the statespace model (SSM).
7.2 Statespace models
SSMs are a class of dynamic models that consist of an underlying Markov process, usually called the state process,
X
t
, that is hidden, i.e., unobserved, and an observed process, usually called the observation process, Y
t
. Consider
the following notation for a statespace model:
observation: y
t
= a(x
t
, u
t
) ∼ g(·x
t
, φ)
hidden state: x
t
= b(x
t−1
, v
t
) ∼ f(·x
t−1
, θ),
where y
t
and x
t
are generated by functions a(·) and b(·) of the state and noise disturbances, denoted by u
t
and
v
t
, respectively. Assume φ and θ to be known. Let p(x
1
) denote the distribution of the initial state x
1
. The state
process is a Markov chain, i.e., p(x
t
x
1
, . . . , x
t−1
) = p(x
t
x
t−1
) = f(x
t
x
t−1
, θ), and the distribution of the
observation y
t
, conditional on x
t
, is independent of previous values of the state and observation processes, i.e.,
p(y
t
x
1:t
, y
1:t−1
) = p(y
t
x
t
) = g(y
t
x
t
, φ). See Figure 7.1 for illustration.
y
1
x
1
x
t
x
t−1
∼ f(x
t
x
t−1
, θ)
y
t
x
t
∼ g(y
t
x
t
, φ)
y
2
x
2
x
t
x
t−1
∼ f(x
t
x
t−1
, θ)
y
t
x
t
∼ g(y
t
x
t
, φ)
y
3
x
3
x
t
x
t−1
∼ f(x
t
x
t−1
, θ)
y
t
x
t
∼ g(y
t
x
t
, φ)
y
4
x
4
x
t
x
t−1
∼ f(x
t
x
t−1
, θ)
y
t
x
t
∼ g(y
t
x
t
, φ)
y
5
x
5
x
t
x
t−1
∼ f(x
t
x
t−1
, θ)
y
t
x
t
∼ g(y
t
x
t
, φ)
y
6
x
6
x
t
x
t−1
∼ f(x
t
x
t−1
, θ)
y
t
x
t
∼ g(y
t
x
t
, φ)
y
7
x
7
x
t
x
t−1
∼ f(x
t
x
t−1
, θ)
y
t
x
t
∼ g(y
t
x
t
, φ)
· · ·
Figure 7.1. The conditional independence structure of the ﬁrst few states and observations in a hidden Markov Model.
Note that we use the notation x
1:t
to denote x
1
, . . . , x
t
, and similarly for y
1:t
. For simplicity, we drop the explicit
dependence of the state transition and observation densities on θ and φ, and write f(·x
t−1
), and g(·x
t
).
The literature sometimes distinguishes between statespace models where the state process is given by a discrete
Markov chain, called hidden Markov models (HMM), as opposed to a continuous Markov chain. An extensive
monograph on inference for statespace models is the book by Capp´ e et al. (2005), and a more recent overview
is Capp´ e et al. (2007). In the present chapter and the following, we introduce several algorithms for inference in
statespace models, and point out that the algorithms in Chapter 8 apply more generally to dynamic models.
7.2.1 Inference problems in SSMs
Under the notation introduced above, we have the joint density
p(x
1:t
, y
1:t
) = p(x
1
)g(y
1
x
1
)
t
i=2
p(x
i
, y
i
x
1:i−1
, y
1:i−1
) = p(x
1
)g(y
1
x
1
)
t
i=2
f(x
i
x
i−1
)g(y
i
x
i
),
and, by Bayes’ theorem, the density of the distribution of interest
p(x
1:t
y
1:t
) ∝ p(x
1:t
y
1:t−1
)g(y
t
x
t
) = p(x
1:t−1
y
1:t−1
)f(x
t
x
t−1
)g(y
t
x
t
). (7.1)
7.2 Statespace models 75
To connect this with the notation introduced for dynamic models, we can write p
t
(x
1:t
) = p(x
1:t
y
1:t
), but we
believe that stating the dependence on the observations explicitly leads to less confusion.
There exist several inference problems in statespace models that involve computing the posterior distribution
of a collection of state variables conditional on a batch of observations:
– ﬁltering: p(x
t
y
1:t
)
– ﬁxed lag smoothing: p(x
t−l
y
1:t
), for 0 ≤ l ≤ t −1
– ﬁxed interval smoothing: p(x
l:k
y
1:t
), for 1 ≤ l < k ≤ t
– prediction: p(x
l:k
y
1:t
), for k > t and 1 ≤ l ≤ k.
The ﬁrst three inference problems reduce to marginalisation of the full smoothing distribution p(x
1:t
y
1:t
), i.e.,
integrating over the state variables that are not of interest, whereas the fourth reduces to marginalisation of
p(x
1:k
y
1:t
) = p(x
1:t
y
1:t
)
k
i=t+1
f(x
i
x
i−1
).
So far we assumed that the state transition and observation densities are completely characterised, i.e., that the
parameters θ and φ are known. If they are unknown, then Bayesian inference is concerned with the joint posterior
distribution of the hidden states and the parameters:
p(x
1:t
, θ, φy
1:t
) ∝ p(y
1:t
x
1:t
, θ, φ)p(x
1:t
θ, φ)p(θ, φ) = p(θ, φ)p(x
1
)g(y
1
x
1
, φ)
t
i=2
f(x
i
x
i−1
, θ)g(y
i
x
i
, φ).
If interest lies in the posterior distribution of the parameters, then the inference problem is called:
– static parameter estimation: p(θ, φy
1:t
),
which reduces to integrating over the state variables in the joint posterior distribution p(x
1:t
, θ, φy
1:t
).
It is evident, then, that these inference problems depend on the tractability of the posterior distribution
p(x
1:t
y
1:t
), if the parameters are known, or p(x
1:t
, θ, φy
1:t
), otherwise. Notice that equation (7.1) gives the pos
terior distribution up to a normalising constant
_
p(x
1:t
y
1:t−1
)g(y
t
x
t
)dx
1:t
, and it is oftentimes the case that the
posterior distribution is known only up to a constant. In fact, these posterior distributions can be computed in closed
form only in a few speciﬁc cases, such as the hidden Markov model, i.e., when the state process is a discrete Markov
chain, and the linear Gaussian model, i.e., when the functions a() and b() are linear, and the noise disturbances u
t
and v
t
are Gaussian.
For HMMs with discrete state transition and observation distributions, the tutorial of Rabiner (1989) presents
recursive algorithms for the smoothing and static parameter estimation problems. The Viterbi algorithm returns the
optimal sequence of hidden states, i.e., the sequence that maximises the smoothing distribution, and the Expectation
Maximization (EM) algorithm returns parameter values for which the likelihood function of the observations attains
a local maximum. If the observation distribution is continuous, then it can be approximated by a ﬁnite mixture
of Gaussian distributions to insure that the EM algorithm applies to the problem of parameter estimation. These
recursive algorithms involve summations over all states in the model, so they are impractical when the state space
is large.
For the linear Gaussian model, the normalising constant in (7.1) can be computed analytically, and thus the
posterior distribution of interest is known in closed form; in fact, it is the Gaussian distribution. The Kalman ﬁl
ter algorithm (Kalman, 1960) gives recursive expressions for the mean and variance of the ﬁltering distribution
p(x
t
y
1:t
), under the assumption that all parameters in the model are known. Kalman (1960) obtains recursive ex
pressions for the optimal values of the mean and variance parameters via a leastsquares approach. The algorithm
alternates between two steps: a prediction step (i.e., predict the state at time t conditional on y
1:t−1
), and an update
step (i.e., observe y
t
, and update the prediction in light of the new observation). Section 7.3 presents a Bayesian
formulation of the Kalman ﬁlter algorithm following Meinhold and Singpurwalla (1983).
76 7. Statespace models and the Kalman ﬁlter algorithm
When an analytic solution is intractable, exact inference is replaced by inference based on an approximation to
the posterior distribution of interest. Gridbased methods using discrete numerical approximations to these posterior
distributions are severely limited by parameter dimension. Alternatively, sequential Monte Carlo methods are a
simulationbased approach that offer greater ﬂexibility and scale better with increasing dimensionality. The key
idea of SMC methods is to represent the posterior distribution by a weighted set of samples, called particles, that
are ﬁltered in time as new observations arrive, through a combination of sampling and resampling steps. Hence
SMC sampling algorithms are oftentimes called particle ﬁlters (Carpenter et al., 1999). Chapter 8 presents SMC
methods for the problems of ﬁltering and smoothing.
7.3 The Kalman ﬁlter algorithm
From (7.1), the posterior distribution of the state x
t
conditional on the observations y
1:t
is proportional to
p(x
t
y
1:t−1
)g(y
t
x
t
). The ﬁrst term is the distribution of x
t
conditional on the ﬁrst t − 1 observations; comput
ing this distribution is known as the prediction step. The second term is the distribution of the new observation
y
t
conditional on the hidden state at time t. Updating p(x
t
y
1:t−1
) in light of the new observation involves taking
the product of these two terms, and normalising; this is known as the update step. The result is the distribution of
interest:
p(x
t
y
1:t
) =
_
p(x
1:t
y
1:t
)dx
1
. . . dx
t−1
=
p(x
t
y
1:t−1
)g(y
t
x
t
)
_
p(x
t
y
1:t−1
)g(y
t
x
t
)dx
t
.
We now show how the prediction and update stages can be performed exactly for the linear Gaussian statespace
model, which is represented as follows:
observation: y
t
= A
t
x
t
+u
t
∼ N(A
t
x
t
, Φ
2
) (7.2)
hidden state: x
t
= B
t
x
t−1
+v
t
∼ N(B
t
x
t−1
, Θ
2
), (7.3)
where u
t
∼ N(0, Φ
2
) and v
t
∼ N(0, Θ
2
) are independent noise sequences, and the parameters A
t
, B
t
, Φ
2
, and Θ
2
are known. It is also possible to let the noise variances Φ
2
and Θ
2
vary with time; the derivation of the mean and
variance of the posterior distribution follows as detailed below. We assume that both states and observations are
vectors, in which case the parameters are matrices of appropriate sizes.
The Kalman ﬁlter algorithm proceeds as follows. Start with an initial Gaussian distribution on x
1
: x
1
∼
N(µ
1
, Σ
1
). At time t −1, let µ
t−1
and Σ
t−1
be the mean and variance of the Gaussian distribution of x
t−1
condi
tional on y
1:t−1
. Looking forward to time t, we begin by predicting the distribution of x
t
conditional on y
1:t−1
.
Prediction step: From equation (7.3), x
t
= B
t
x
t−1
+ v
t
, where x
t−1
y
1:t−1
∼ N(µ
t−1
, Σ
t−1
), and v
t
∼
N(0, Θ
2
) independently. By results in multivariate statistical analysis (Anderson, 2003), we have that
x
t
y
1:t−1
∼ N(B
t
µ
t−1
, B
t
Σ
t−1
B
T
t
+Θ
2
), (7.4)
where the superscript
T
indicates matrix transpose. This can be thought of as the prior distribution on x
t
.
Update step: Upon observing y
t
, we are interested in
p(x
t
y
1:t
) ∝ p(y
t
x
t
, y
1:t−1
)p(x
t
y
1:t−1
).
Following equation (7.2) and the result in (7.4), consider predicting y
t
by ˆ y
t
= A
t
B
t
µ
t−1
, where B
t
µ
t−1
is the
prior mean on x
t
. The prediction error is e
t
= y
t
− ˆ y
t
= y
t
− A
t
B
t
µ
t−1
, which is equivalent to observing y
t
. So
it follows that p(x
t
y
1:t
) ∝ p(e
t
x
t
, y
1:t−1
)p(x
t
y
1:t−1
). Finally, from (7.2), e
t
= A
t
(x
t
− B
t
µ
t−1
) + u
t
, where
u
t
∼ N(0, Φ
2
), so e
t
x
t
, y
1:t−1
∼ N(A
t
(x
t
−B
t
µ
t−1
), Φ
2
).
We now use the following results from Anderson (2003). Let X
1
an X
2
have a bivariate normal distribution:
7.3 The Kalman ﬁlter algorithm 77
_
X
1
X
2
_
∼ N
__
µ
1
µ
2
_
,
_
Σ
11
Σ
12
Σ
21
Σ
22
__
. (7.5)
If equation (7.5) holds, then the conditional distribution of X
1
given X
2
= x
2
is given by
X
1
X
2
= x
2
∼ N
_
µ
1
+Σ
12
Σ
−1
22
(x
2
−µ
2
), Σ
11
−Σ
12
Σ
−1
22
Σ
21
_
. (7.6)
Conversely, if (7.6) holds, and X
2
∼ N(µ
2
, Σ
22
), then (7.5) is true.
Since e
t
x
t
, y
1:t−1
∼ N(A
t
(x
t
−B
t
µ
t−1
), Φ
2
) and x
t
y
1:t−1
∼ N(B
t
µ
t−1
, B
t
Σ
t−1
B
T
t
+Θ
2
), it follows that
_
x
t
e
t
_¸
¸
¸
¸
¸
y
1:t−1
∼ N
__
B
t
µ
t−1
0
_
,
_
B
t
Σ
t−1
B
T
t
+Θ
2
(B
t
Σ
t−1
B
T
t
+Θ
2
)A
T
t
A
t
(B
t
Σ
t−1
B
T
t
+Θ
2
) A
t
(B
t
Σ
t−1
B
T
t
+Θ
2
)A
T
t
+Φ
2
__
Using the result above, the ﬁltering distribution is p(x
t
y
1:t
) = p(x
t
e
t
, y
1:t−1
) = N(µ
t
, Σ
t
), since observing e
t
is
equivalent to observing y
t
, where
µ
t
= B
t
µ
t−1
+ (B
t
Σ
t−1
B
T
t
+Θ
2
)A
T
t
(A
t
(B
t
Σ
t−1
B
T
t
+Θ
2
)A
T
t
+Φ
2
)
−1
e
t
Σ
t
= B
t
Σ
t−1
B
T
t
+Θ
2
−(B
t
Σ
t−1
B
T
t
+Θ
2
)A
T
t
(A
t
(B
t
Σ
t−1
B
T
t
+Θ
2
)A
T
t
+Φ
2
)
−1
A
t
(B
t
Σ
t−1
B
T
t
+Θ
2
).
Algorithm 1 The Kalman ﬁlter algorithm.
1: Input: µ
1
and Σ
1
.
2: Set t = 2.
3: Compute mean and variance of prediction: ˆ µ
t
= B
t
µ
t−1
,
ˆ
Σ
t
= B
t
Σ
t−1
B
T
t
+Θ
2
.
4: Observe y
t
and compute error in prediction: e
t
= y
t
−A
t
ˆ µ
t
.
5: Compute variance of prediction error: R
t
= A
t
ˆ
Σ
t
A
T
t
+Φ
2
.
6: Update the mean and variance of the posterior distribution:
µ
t
= ˆ µ
t
+
ˆ
Σ
t
A
T
t
R
−1
t
e
t
Σ
t
=
ˆ
Σ
t
−
ˆ
Σ
t
A
T
t
R
−1
t
A
t
ˆ
Σ
t
.
7: Set t = t + 1. Go to step 3.
Example 7.1 (Firstorder, linear autoregressive (AR(1)) model observed with noise). Consider the following AR(1)
model:
x
t
= φx
t−1
+σ
U
u
t
∼ N(φx
t−1
, σ
2
u
)
y
t
= x
t
+σ
V
v
t
∼ N(x
t
, σ
2
V
),
where u
t
∼ N(0, 1) and v
t
∼ N(0, 1) are independent, Gaussian white noise processes. The Markov chain {X
t
}
t≥1
is a Gaussian random walk with transition kernel K(x
t−1
, x
t
) corresponding to the N(φx
t−1
, σ
2
U
) distribution.
A normal distribution N(µ, σ
2
) is stationary for {X
t
}
t≥1
if X
t−1
∼ N(µ, σ
2
) and X
t
X
t−1
= x
t−1
∼
N(φx
t−1
, σ
2
u
) imply that X
t
∼ N(µ, σ
2
). We require that E(X
t
) = φµ = µ and Var(X
t
) = φ
2
σ
2
+ σ
2
U
= σ
2
,
which are satisﬁed by µ = 0 and σ
2
= σ
2
U
/(1 −φ
2
), provided φ < 1. In fact, the N
_
0, σ
2
U
/(1 −φ
2
)
_
distribution
is the unique stationary distribution of the chain.
Start the Kalman ﬁlter algorithm with µ
1
= 0 and Σ
1
= σ
2
U
/(1 −φ
2
). At time t −1, t ≥ 2, let µ
t−1
and Σ
t−1
denote the posterior mean and variance, respectively. Then the mean and variance of the prediction at time t are:
ˆ µ
t
= φµ
t−1
and
ˆ
Σ
t
= φ
2
Σ
t−1
+σ
2
U
. The prediction error is e
t
= y
t
− ˆ µ
t
with variance
ˆ
Σ
t
+σ
2
V
. Finally, update
the mean and variance of the posterior distribution:
µ
t
= ˆ µ
t
+
ˆ
Σ
t
1
ˆ
Σ
t
+σ
2
V
(y
t
− ˆ µ
t
) =
_
1 −
ˆ
Σ
t
ˆ
Σ
t
+σ
2
V
_
ˆ µ
t
+
ˆ
Σ
t
ˆ
Σ
t
+σ
2
V
y
t
Σ
t
=
ˆ
Σ
t
−
ˆ
Σ
t
_
ˆ
Σ
t
ˆ
Σ
t
+σ
2
V
_
ˆ
Σ
t
=
ˆ
Σ
t
_
1 −
ˆ
Σ
t
_
ˆ
Σ
t
ˆ
Σ
t
+σ
2
V
__
.
78 7. Statespace models and the Kalman ﬁlter algorithm
⊳
The Kalman ﬁlter algorithm (see Figure 1 for pseudocode) is not robust to outlying observations y
t
, i.e., when
the prediction error e
t
is large, because the mean µ
t
is an unbounded function of e
t
, and the variance Σ
t
does
not depend on the observed data y
t
. Meinhold and Singpurwalla (1989) let the distributions of the error terms u
t
and v
t
be Studentt, and show that the posterior distribution of x
t
given y
1:t
converges to the prior distribution of
p(x
t
y
1:t−1
) when e
t
is large. In this case, the posterior distribution is no longer known exactly, but is approximated.
The underlying assumptions of the Kalman ﬁlter algorithm are that the state transition and observation equa
tions are linear, and that the error terms are normally distributed. If the linearity assumption is violated, but the
state transition and observation equations are differentiable functions, then the extended Kalman ﬁlter algorithm
propagates the mean and covariance via the Kalman ﬁlter equations by linearizing the underlying nonlinear model.
If this model is highly nonlinear, then this approach will result in very poor estimates of the mean and covariance.
An alternative is the unscented Kalman ﬁlter which takes a deterministic sampling approach, representing the state
transition distribution by a set of sample points that are propagated through the nonlinear model. This approach
improves the accuracy of the posterior mean and covariance; for details, see Wan and van der Merwe (2000).
Chapter 8
Sequential Monte Carlo
In this chapter we introduce sequential Monte Carlo (SMC) methods for sampling from dynamic models; these
methods are based on importance sampling and resampling techniques. In particular, we present SMC methods the
ﬁltering and smoothing problems in statespace models.
8.1 Importance Sampling revisited
In Section 3.3, importance sampling is introduced as a technique for approximating a given integral µ =
_
h(x)f(x)dx under a distribution f, by sampling from an instrumental distribution g with support satisfying
supp(g) ⊃ supp(f · h). This is based on the observation that
µ = E
f
(h(X)) =
_
h(x)f(x)dx =
_
h(x)
f(x)
g(x)
g(x)dx =
_
h(x)w(x)g(x)dx = E
g
(h(X) · w(X)), (8.1)
where the rightmost expectation in (8.1) is approximated by the empirical average of h · w evaluated at n i.i.d.
samples from g.
In practice, we want to select g as close as possible to f such that the estimator of µ has ﬁnite variance. One
sufﬁcient condition is that f(x) < M · g(x) and Var
f
(h(X)) < ∞. Under this condition, it is possible to use
rejection sampling to sample from f and approximate µ. We argue in the following subsection that importance
sampling is more efﬁcient than rejection sampling, in terms of producing weights with smaller variance.
8.1.1 Importance Sampling versus Rejection Sampling
Let E be the support of f. Deﬁne the artiﬁcial target distribution
¯
f(x, y) on E ×[0, 1] as
¯
f(x, y) =
_
_
_
Mg(x) for
_
(x, y) : x ∈ E, y ∈
_
0,
f(x)
Mg(x)
__
0 otherwise
where
f(x) =
_
¯
f(x, y)dy =
_
f(x)
Mg(x)
0
Mg(x)dy.
Consider the instrumental distribution ¯ g(x, y) = g(x)U
[0,1]
(y), for (x, y) ∈ E×[0, 1], where U
[0,1]
(·) is the uniform
distribution on [0, 1]. Then, performing importance sampling on E ×[0, 1] with weights
w(x, y) =
¯
f(x, y)
¯ g(x, y)
=
_
_
_
M for
_
(x, y) : x ∈ E, y ∈
_
0,
f(x)
Mg(x)
__
0 otherwise
80 8. Sequential Monte Carlo
is equivalent to rejection sampling to sample from f using instrumental distribution g. In contrast, importance
sampling from f using instrumental distribution g has weights w(x) = f(x)/g(x).
We now show that the weights for rejection sampling, w(x, y), have higher variance than those for importance
sampling, w(x). For this purpose, we introduce the following technical lemma which relates the variance of a
random variable to its conditional variance and expectation. In the following, any expectation or variance with a
subscript corresponding to a random variable should be interpreted as the expectation or variance with respect to
that random variable.
Lemma 8.1 (Law of Total Variance). Given two random variables, A and B, on the same probability space, such
that Var (A) < ∞, then the following decomposition exists:
Var (A) = E
B
[Var
A
(AB)] + Var
B
(E
A
[AB]) .
Proof. By deﬁnition, and the law of total probability, we have:
Var (A) = E
_
A
2
¸
−E[A]
2
= E
B
_
E
A
_
A
2
B
¸¸
−E
B
[E
A
[AB]]
2
.
Considering the deﬁnition of conditional variance, and then variance, it is clear that:
Var (A) = E
B
_
Var
A
(AB) +E
A
[AB]
2
_
−E
B
[E
A
[AB]]
2
= E
B
[Var
A
(AB)] +E
B
_
E
A
[AB]
2
_
−E
B
[E
A
[AB]]
2
= E
B
[Var
A
(AB)] + Var
B
(E
A
[AB]) .
Returning to importance sampling versus rejection sampling, we have by Lemma 8.1 that
Var (w(X, Y )) = Var (E(w(X, Y )X)) +E(Var (w(X, Y )X)) = Var (w(X)) +E(Var (w(X, Y )X))
≥ Var (w(X)) ,
since
E(w(X, Y )X) =
_
1
0
w(x, y)
¯ g(x, y)
g(x)
dy =
_
f(X)
Mg(X)
0
Mdy =
f(X)
g(X)
= w(X),
and the fact that Var (w(X, Y )X) is a nonnegative function.
8.1.2 Empirical distributions
Consider a collection of i.i.d. points {x
i
}
n
i=1
in E drawn from f. We can approximate f by the following empirical
measure, associated with these points,
ˆ
f(x) =
1
n
n
i=1
I(x = x
i
) =
1
n
n
i=1
δ
x
i
(x),
where, for any x ∈ E, δ
x
i
(x) is the Dirac measure which places all of its mass at point x
i
, i.e., δ
x
i
(x) = 1 if x = x
i
and 0 otherwise. Similarly, we can deﬁne the empirical distribution function
ˆ
F(x) =
1
n
n
i=1
I(x
i
≤ x),
where I(x
i
≤ x) is the indicator of event x
i
≤ x.
If the collection of points has associated positive, realvalued weights {x
i
, w
i
}
n
i=1
, then the empirical measure
is deﬁned as follows
8.2 Sequential Importance Sampling 81
ˆ
f(x) =
n
i=1
w
i
δ
x
i
(x)
n
i=1
w
i
.
For ﬁxed x ∈ E,
ˆ
f(x) and
ˆ
F(x), as functions of a random sample, are a random measure and distribution,
respectively. The strong law of large numbers justiﬁes approximating the true density and distribution functions by
ˆ
f(x) and
ˆ
F(x) as the number of samples n increases to inﬁnity.
From these approximations, we can then estimate integrals with respect to f by integrals with respect to the
associated empirical measure. Let h : E →R be a measureable function. Then
E
f
(h(X)) =
_
h(x)f(x)dx ≈
_
h(x)
ˆ
f(x)dx =
1
n
n
i=1
h(x
i
),
or, if the random sample is weighted,
E
f
(h(X)) =
_
h(x)f(x)dx ≈
_
h(x)
ˆ
f(x)dx =
n
i=1
w
i
h(x
i
)
n
i=1
w
i
.
Sequential Importance Sampling exploits the idea of sequentially approximating an intractable density function
f by the corresponding empirical measure
ˆ
f associated with a random sample from an instrumental distribution g,
and properly weighted with weights w
i
∝ f(x
i
)/g(x
i
).
8.2 Sequential Importance Sampling
Sequential Importance Sampling (SIS) performs importance sampling sequentially to sample from a distribution
p
t
(x
t
) in a dynamic model. Let
_
x
(i)
t
, w
(i)
t
_
n
i=1
be a collection of samples, called particles, that targets p
t
(x
t
). At
time t+1, the target distribution evolves to p
t+1
(x
t+1
); for i = 1, . . . , n, sample the (t+1)st component x
(i)
t+1
from
an instrumental distribution, update the weight w
(i)
t+1
, and append x
(i)
t+1
to x
(i)
t
. The desired result is a collection of
samples
_
x
(i)
t+1
, w
(i)
t+1
_
n
i=1
that targets p
t+1
(x
t+1
).
The idea is to choose an instrumental distribution such that importance sampling can proceed sequentially. Let
q
t+1
(x
t+1
) denote the instrumental distribution, and suppose that it can be factored as follows:
q
t+1
(x
t+1
) = q
1
(x
1
)
t+1
i=2
q
i
(x
i
x
i−1
) = q
t
(x
t
)q
t+1
(x
t+1
x
t
).
Then, the weight w
(i)
t+1
can be computed incrementally from w
(i)
t
. At time t = 1, sample x
(i)
1
∼ q
1
(x
1
) for
i = 1, . . . , n, and set w
(i)
1
= p
1
(x
1
)/q
1
(x
1
). Normalise the weights by dividing them by
n
j=1
w
(j)
1
. At time t > 1,
w
(i)
t
=
p
t
(x
(i)
t
)
q
t
(x
(i)
t
)
,
so, at the following time step, we sample x
(i)
t+1
∼ q
t+1
(x
t+1
x
(i)
t
), and update the weight
w
(i)
t+1
=
p
t+1
(x
(i)
t+1
)
q
t+1
(x
(i)
t+1
)
=
p
t+1
(x
(i)
t+1
)
q
t
(x
(i)
t
)q
t+1
(x
(i)
t+1
x
(i)
t
)
= w
(i)
t
p
t+1
(x
(i)
t+1
)
p
t
(x
(i)
t
)q
t+1
(x
(i)
t+1
x
(i)
t
)
. (8.2)
Normalise the weights. The term
p
t+1
(x
(i)
t+1
)
p
t
(x
(i)
t
)q
t+1
(x
(i)
t+1
x
(i)
t
)
is known as the incremental weight. The intuition is that if the weighted sample
_
x
(i)
t
, w
(i)
t
_
n
i=1
is a good ap
proximation to the target distribution at time t, p
t
(x
t
), then, for appropriately chosen instrumental distribution
q
t+1
(x
t+1
x
t
), the weighted sample
_
x
(i)
t+1
, w
(i)
t+1
_
n
i=1
is also a good approximation to p
t+1
(x
t+1
). For details,
see Liu and Chen (1998).
82 8. Sequential Monte Carlo
8.2.1 Optimal instrumental distribution
Kong et al. (1994) and Doucet et al. (2000) prove that the unconditional variance of the weights increases over
time. So in the long run, a few of the weights contain most of the probability mass, while most of the particles have
normalised weights with near zero values. This phenomenon is know in the literature as weight degeneracy. Chopin
(2004) argues that the SIS algorithm suffers from the curse of dimensionality, in the sense that weight degeneracy
grows exponentially in the dimension t.
So it is natural to seek an instrumental distribution that minimises the variance of the weights. Doucet et al.
(2000) show that the optimal instrumental distribution is q
t+1
(x
t+1
x
(i)
t
) = p
t+1
(x
t+1
x
(i)
t
), in the sense that the
variance of w
(i)
t+1
conditional upon x
(i)
t
is zero. This result appears in the following proposition.
Proposition 8.2. The instrumental distribution q
t+1
(x
t+1
x
(i)
t
) = p
t+1
(x
t+1
x
(i)
t
) minimises the variance of the
weight w
(i)
t+1
conditional upon x
(i)
t
.
Proof. From (8.2),
Var
q
t+1
(x
t+1
x
(i)
t
)
_
w
(i)
t+1
_
=
_
w
(i)
t
_
2
Var
q
t+1
(x
t+1
x
(i)
t
)
_
p
t+1
(x
(i)
t
, x
t+1
)
p
t
(x
(i)
t
)q
t+1
(x
t+1
x
(i)
t
)
_
=
_
w
(i)
t
_
2
_
_
_
_
_
p
t+1
(x
(i)
t
, x
t+1
)
p
t
(x
(i)
t
)
_
2
1
q
t+1
(x
t+1
x
(i)
t
)
dx
t+1
−
_
p
t+1
(x
(i)
t
)
p
t
(x
(i)
t
)
_
2
_
_
_
= 0,
if q
t+1
(x
t+1
x
(i)
t
) = p
t+1
(x
t+1
x
(i)
t
).
More intuitively, Liu and Chen (1998) rewrite the incremental weight as
p
t+1
(x
(i)
t+1
)
p
t
(x
(i)
t
)q
t+1
(x
t+1
x
(i)
t
)
=
p
t+1
(x
(i)
t
)
p
t
(x
(i)
t
)
p
t+1
(x
t+1
x
(i)
t
)
q
t+1
(x
t+1
x
(i)
t
)
,
and interpret the second ratio on the right hand side as correcting the discrepancy between q
t+1
(x
t+1
x
(i)
t
) and
p
t+1
(x
t+1
x
(i)
t
), when they are different. Hence the optimal instrumental distribution is p
t+1
(x
t+1
x
(i)
t
).
In practice, however, sampling from the optimal instrumental distribution is usually not possible, so other
choices of instrumental distributions are considered. Oftentimes it is possible to ﬁnd good approximations to the
optimal instrumental distribution; in such instances, the variance of the corresponding weights is low for small t,
but weight degeneracy still occurs at t increases.
When q
t+1
(x
t+1
x
(i)
t
) = p
t
(x
t+1
x
(i)
t
), i.e., the distribution p
t
(x
t
) is used to predict x
t+1
, then the incremental
weight simpliﬁes to p
t+1
(x
(i)
t+1
)/p
t
(x
(i)
t+1
). The resulting SIS algorithm is known as the bootstrap ﬁlter. It was ﬁrst
introduced by Gordon et al. (1993) in the context of Bayesian ﬁltering for nonlinear, nonGaussian statespace
models.
8.2.2 SIS for statespace models
Recall the statespace model introduced in Section 7.2. For simplicity, assume that all parameters are known.
observation: y
t
= a(x
t
, u
t
) ∼ g(·x
t
)
hidden state: x
t
= b(x
t−1
, v
t
) ∼ f(·x
t−1
).
We present the SIS algorithm to sample approximately from the ﬁltering distribution p
t+1
(x
t+1
) = p(x
t+1
y
1:t+1
),
and the smoothing distribution p
t+1
(x
t+1
) = p(x
1:t+1
y
1:t+1
).
The instrumental distribution is q
t+1
(x
t+1
x
(i)
1:t
), where the subscript t + 1 indicates that the distribution may
incorporate all the data up to time t + 1: y
1:t+1
. For the bootstrap ﬁlter, q
t+1
(x
t+1
x
(i)
1:t
) = p(x
t+1
x
(i)
1:t
, y
1:t
) =
8.2 Sequential Importance Sampling 83
f(x
t+1
x
(i)
t
), i.e., the instrumental distribution does not incorporate the most recent observation y
t+1
, and the
weight is
w
(i)
t+1
= w
(i)
t
p(x
(i)
1:t+1
y
1:t+1
)
p(x
(i)
1:t
y
1:t
)f(x
(i)
t+1
x
(i)
t
)
∝ w
(i)
t
p(x
(i)
1:t
y
1:t
)f(x
(i)
t+1
x
(i)
t
)g(y
t+1
x
(i)
t+1
)
p(x
(i)
1:t
y
1:t
)f(x
(i)
t+1
x
(i)
t
)
= w
(i)
t
g(y
t+1
x
(i)
t+1
)
by equation (7.1). In this case, the incremental weight does not depend on past trajectories x
(i)
1:t
, but only on the
likelihood function of the most recent observation.
Gordon et al. (1993) introduce the bootstrap ﬁlter in a very intuitive way, without reference to SIS, but rather as
a twostage recursive process with a propagate step, followed by an update step (similar in spirit to the recursions
in the Kalman ﬁlter algorithm). Write the ﬁltering density as follows:
p(x
t+1
y
1:t+1
) =
g(y
t+1
x
t+1
)p(x
t+1
y
1:t
)
p(y
t+1
y
1:t
)
∝ g(y
t+1
x
t+1
)
_
p(x
t+1
, x
t
y
1:t
)dx
t
∝ g(y
t+1
x
t+1
)
_
f(x
t+1
x
t
)p(x
t
y
1:t
)dx
t
. (8.3)
Now, let
_
x
(i)
t
, w
(i)
t
_
n
i=1
be a weighted sample representing the ﬁltering density at time t, such that
n
j=1
w
(j)
t
= 1.
Then, ˆ p(x
t
y
1:t
) =
n
i=1
w
(i)
t
δ
x
(i)
t
(x
t
) is the empirical measure approximating p(x
t
y
1:t
), and ˆ p(x
t+1
y
1:t
) =
n
i=1
w
(i)
t
f(x
t+1
x
(i)
t
). Furthermore, by (8.3), we have the approximation
ˆ p(x
t+1
y
1:t+1
) =
n
i=1
w
(i)
t
g(y
t+1
x
t+1
)f(x
t+1
)x
(i)
t
),
and sampling proceeds in two steps:
Propagate step: for i = 1 : n, sample x
(i)
t+1
∼ f(x
t+1
x
(i)
t
).
Update step: for i = 1 : n, weigh x
(i)
t+1
with weight w
(i)
t+1
= w
(i)
t
g(y
t+1
x
(i)
t+1
). Normalise the weights.
In contrast to the instrumental distribution of the bootstrap ﬁlter, the optimal instrumental distribution incorpo
rates the most recent observation:
q
t+1
(x
t+1
x
(i)
1:t
) = p(x
t+1
x
(i)
1:t
, y
1:t+1
)
=
p(y
t+1
x
(i)
1:t
, x
t+1
, y
1:t
)p(x
t+1
x
(i)
1:t
, y
1:t
)
_
p(y
t+1
x
(i)
1:t
, x
t+1
, y
1:t
)p(x
t+1
x
(i)
1:t
, y
1:t
)dx
t+1
=
g(y
t+1
x
t+1
)f(x
t+1
x
(i)
t
)
_
g(y
t+1
x
t+1
)f(x
t+1
x
(i)
t
)dx
t+1
,
where the normalising constant equals the predictive distribution of y
t+1
conditional on x
t
, i.e., p(y
t+1
x
t
). So the
weight function becomes w
(i)
t+1
∝ w
(i)
t
p(y
t+1
x
t
).
Example 8.1 (AR(1) model observed with noise (continued from example 7.1)). The optimal instrumental distribu
tion is
q
t+1
(x
t+1
x
1:t
) ∝ g(y
t+1
x
t+1
)f(x
t+1
x
t
)
∝ exp
_
−
1
2σ
2
V
(y
t+1
−x
t+1
)
2
_
exp
_
−
1
2σ
2
U
(x
t+1
−φx
t
)
2
_
= exp
_
−
1
2
_
x
2
t+1
_
1
σ
2
V
+
1
σ
2
U
_
−2x
t+1
_
y
t+1
σ
2
V
+
φx
t
σ
2
U
_
+
y
2
t+1
σ
2
V
+
φ
2
x
2
t
σ
2
U
__
∝ exp
_
−
1
2σ
2
(x
t+1
−µ)
2
_
,
implying that the distribution is N(µ, σ
2
) with
84 8. Sequential Monte Carlo
µ =
σ
2
U
σ
2
V
σ
2
U
+σ
2
V
_
y
t+1
σ
2
V
+
φx
t
σ
2
U
_
σ
2
=
σ
2
U
σ
2
V
σ
2
U
+σ
2
V
.
The normalising constant of this optimal instrumental distribution is
p(y
t+1
x
t
) =
_
g(y
t+1
x
t+1
)f(x
t+1
x
t
)dx
t+1
∝ exp
_
−
1
2σ
2
V
y
2
t+1
_
exp
_
1
2
_
1
σ
2
U
+
1
σ
2
V
__
σ
2
U
σ
2
V
σ
2
U
+σ
2
V
_
2
_
y
t+1
σ
2
V
+
φx
t
σ
2
U
_
2
_
×
_
1
√
2π
_
1
σ
2
U
+
1
σ
2
V
_
1/2
exp
_
−
1
2
_
1
σ
2
U
+
1
σ
2
V
__
x
t+1
−
σ
2
U
σ
2
V
σ
2
U
+σ
2
V
_
y
t+1
σ
2
V
+
φx
t
σ
2
U
__
2
_
dx
t+1
∝ exp
_
−
1
2
_
y
2
t+1
1
σ
2
U
+σ
2
V
−
2φx
t
σ
2
U
+σ
2
V
y
t+1
__
,
i.e., y
t+1
x
t
∼ N(φx
t
, σ
2
U
+σ
2
V
). ⊳
Let
_
x
(i)
1:t+1
, w
(i)
t+1
_
n
i=1
be a weighted sample, normalised such that
n
j=1
w
(j)
t+1
= 1, from p(x
1:t+1
y
1:t+1
).
Then the ﬁltering and smoothing densities are approximated by the corresponding empirical measures:
ˆ p(x
t+1
y
1:t+1
) =
n
i=1
w
(i)
t+1
δ
x
(i)
t+1
(x
t+1
), ˆ p(x
1:t+1
y
1:t+1
) =
n
i=1
w
(i)
t+1
δ
x
(i)
1:t+1
(x
1:t+1
).
Algorithm 2 is the general SIS algorithm for statespace models. The computational complexity to generate n
particles representing p(x
1:t
y
1:t
) is O(nt).
Algorithm 2 The SIS algorithm for statespace models
1: Set t = 1.
2: For i = 1 : n, sample x
(i)
1
∼ q
1
(x
1
).
3: For i = 1 : n, set w
(i)
1
∝ p(x
(i)
1
)g(y
1
x
(i)
1
)/q
1
(x
(i)
1
). Normalise such that
n
P
j=1
w
(j)
1
= 1.
4: At time t + 1, do:
5: For i = 1 : n, sample x
(i)
t+1
∼ q
t+1
(x
t+1
x
(i)
1:t
).
6: For i = 1 : n, set w
(i)
t+1
∝ w
(i)
t
f(x
(i)
t+1
x
(i)
t
)g(y
t+1
x
(i)
t+1
)/q
t+1
(x
(i)
t+1
x
(i)
1:t
). Normalise such that
n
P
j=1
w
(j)
t+1
= 1.
7: The ﬁltering and smoothing densities at time t + 1 may be approximated by
ˆ p(x
t+1
y
1:t+1
) =
n
X
i=1
w
(i)
t+1
δ
x
(i)
t+1
(x
t+1
), ˆ p(x
1:t+1
y
1:t+1
) =
n
X
i=1
w
(i)
t+1
δ
x
(i)
1:t+1
(x
1:t+1
).
8: Set t = t + 1. Go to step 4.
8.3 Sequential Importance Sampling with Resampling
One approach to limiting the weight degeneracy problem is to choose an instrumental distribution that lowers the
variance of the weights; a second approach is to introduce a resampling step after drawing and weighing the particles
at time t + 1. This idea of rejuvenating the particles by resampling was ﬁrst suggested by Gordon et al. (1993).
The idea is as follows: let
_
x
(i)
t
, w
(i)
t
_
n
i=1
be a weighted sample fromp
t
(x
t
), obtained by importance sampling,
and normalised such that
n
j=1
w
(j)
t
= 1. The empirical measure is ˆ p
t
(x
t
) =
n
i=1
w
(i)
t
δ
x
(i)
t
(x
t
). Under suitable
regularity conditions, the law of large number states that, for any ﬁxed measurable function h, as n →∞,
_
h(x
t
)ˆ p
t
(x
t
)dx
t
=
n
i=1
w
(i)
t
h(x
(i)
t
) →
_
h(x
t
)p
t
(x
t
)dx
t
.
8.3 Sequential Importance Sampling with Resampling 85
Suppose now that we draw a sample of size n
′
from ˆ p
t
(x
t
) with replacement, i.e., for j = 1, . . . , n, ˜ x
(j)
t
= x
(i)
t
with
probability w
(i)
t
. The new particles have equal weights: ˜ w
(i)
t
= 1/n
′
. Again, invoking the law of large numbers, as
n
′
→∞,
1
n
′
n
′
j=1
h(˜ x
(j)
t
) →
n
i=1
w
(i)
t
h(x
(i)
t
).
So, for n
′
large, the integral of the function h with respect to the new empirical measure based on
_
˜ x
(j)
t
, 1/n
′
_
n
′
j=1
is a good approximation to the integral of h with respect to ˆ p
t
(x
t
).
In the SIS algorithm, resampling is applied to the entire trajectory x
(i)
1:t+1
, not simply to the last value x
t+1
; the
new algorithm is known as SIS with Resampling (SISR). The advantage of resampling is that it eliminates particle
trajectories with low weights, and replicates those with large weights; all of the resampled particles then contribute
signiﬁcantly to the importance sampling estimates.
On the other hand, replicating trajectories with large weights reduces diversity by depleting the number of
distinct particle values at any time step in the past. At time t + 1, new values x
(i)
t+1
are sampled and appended to
the particle trajectories; resampling then eliminates some of these trajectories. Since the values at t
′
< t +1 are not
rejuvenated as t increases, their diversity decreases due to resampling. In the extreme case, the smoothing density
p
t+1
(x
1:t+1
) is approximated by a system of particle trajectories with a single common acestor. Figure 8.1 displays
this situation graphically.
In general, at the current time step t +1, we can obtain a good approximation to the ﬁltering density p
t+1
(x
t+1
)
from the particles and their corresponding weights, provided the number of particles is large enough. However,
approximations to the smoothing density p
t+1
(x
1:t+1
) and ﬁxed interval smoothing densities p
t+1
(x
1:t
′ ), for t
′
≪
t, will be poor. Chopin (2004) argues that for smoothing the ﬁrst state x
1
, i.e., approximating p
t+1
(x
1
), the SIS
algorithmis more efﬁcient than the SISRalgorithm, but that the latter can be expected to be more efﬁcient in ﬁltering
the states, i.e., approximating p
t+1
(x
t+1
). In particular, if the instrumental distribution of the SISR algorithm has a
certain abilitiy to “forget the past” (i.e., to forget its initial condition), then the asymptotic variance of the estimator
is bounded from above in t.
2 4 6 8 10
0
1
2
3
4
5
6
t
p
a
r
t
i
c
l
e
s
Figure 8.1. Plot of particle trajectories with a single common ancestor after resampling.
86 8. Sequential Monte Carlo
Moreover, the resampled trajectories are no longer independent, so resampling has the additional effect of in
creasing the Monte Carlo variance of an estimator at the current time step (Chopin, 2004). However, it can reduce the
variance of estimators at later times. So, if one is interested in estimating the integral
_
h(x
t+1
)p
t+1
(x
t+1
)dx
t+1
,
for some measurable function h, then the estimator
n
i=1
w
(i)
t+1
h(x
(i)
t+1
) must be computed before resampling (as it
will have lower Monte Carlo error than if it were computed after the resampling step). So far we discussed resam
pling via multinomial sampling; other resampling schemes exist that introduce lower Monte Carlo variance, and no
additional bias, such as stratiﬁed sampling (Carpenter et al., 1999), and residual sampling (Liu and Chen, 1998).
Chopin (2004) shows that residual sampling always outperforms multinomial sampling: the resulting estimator
using the former sampling scheme has lower asymptotic variance.
8.3.1 Effective sample size
Resampling at every time step introduces unnecesary variation, so a tradeoff is required between reducing the
Monte Carlo variance in the future, and increasing the variance at the recent time step. Following Kong et al.
(1994), we deﬁne the effective sample size (ESS), a measure of the efﬁciency of estimation based on a given SIS
sample, compared to estimation based on a sample of i.i.d. draws from the target distribution p
t+1
(x
t+1
).
Let h(x
t+1
) be a measurable function, and suppose that we’re interested in estimating the mean µ =
E
p
t+1
(x
t+1
)
(h(X
t+1
)). Let
_
x
(i)
t+1
, w
(i)
t+1
_
n
i=1
be a weighted sample approximating p
t+1
(x
t+1
), obtained by SIS,
and let
_
y
(i)
t+1
_
n
i=1
be a sample of i.i.d. draws from p
t+1
(x
t+1
).
The SIS estimator of µ is
ˆ µ
IS
=
n
i=1
w
(i)
t+1
h(x
(i)
t+1
)
n
j=1
w
(j)
t+1
,
and the Monte Carlo estimator is
ˆ µ
MC
=
1
n
n
i=1
h(y
(i)
t+1
).
Then the relative efﬁciency of SIS in estimating µ can be measured by the ratio
Var
_
ˆ µ
IS
_
Var (ˆ µ
MC
)
,
which, in general, cannot be computed exactly. Kong et al. (1994) propose the following approximation that has the
advantage of being independent of the function h:
Var
_
ˆ µ
IS
_
Var (ˆ µ
MC
)
≈ 1 + Var
q
t+1
(x
t+1
)
( ¯ w
t+1
),
where q
t+1
(x
t+1
) is the instrumental distribution in SIS, and ¯ w
t+1
is the normalised version of w
t+1
, i.e.,
_
¯ w
t+1
q
t+1
(x
t+1
)dx
t+1
= 1. In general, Var
q
t+1
(x
t+1
)
( ¯ w
t+1
) is impossible to obtain, but can be approximated
by the sample variance of
_
¯ w
(i)
t+1
_
n
i=1
, where ¯ w
(i)
t+1
= w
(i)
t+1
/
n
j=1
w
(j)
t+1
are the normalised weights.
In practice, the ESS is deﬁned as follows:
ESS =
n
1 + Var
q
t+1
(x
t+1
)
( ¯ w
t+1
)
=
n
E
q
t+1
(x
t+1
)
( ¯ w
t+1
)
2
≈
n
n
n
i=1
_
¯ w
(i)
t+1
_
2
=
_
n
j=1
w
(j)
t+1
_
2
n
i=1
_
w
(i)
t+1
_
2
,
since the weights are normalised to sum to 1, and
E
q
t+1
(x
t+1
)
( ¯ w
t+1
)
2
= E
q
t+1
(x
t+1
)
_
w
t+1
C
_
2
=
1
C
2
E
q
t+1
(x
t+1
)
(w
t+1
)
2
≈
n
−1
n
i=1
_
w
(i)
t+1
_
2
n
−2
_
n
j=1
w
(j)
t+1
_
2
,
where C is the normalising constant. ESS is interpreted as the number of i.i.d. samples from the target distribution
p
t+1
(x
t+1
) that would be required to obtain an estimator with the same variance as the SIS estimator. Since ESS ≤
8.3 Sequential Importance Sampling with Resampling 87
n (Kong et al., 1994), then an ESS value close to n indicates that the SIS sample of size n is approximately as “good”
as an i.i.d. sample of size n from p
t+1
(x
t+1
). In practice, a ﬁxed threshold is chosen (in general, half of the sample
size n), and if the ESS falls below that threshold, then a resampling step is performed.
A word of caution is required at this point. Via the ESS approach, we’re using the importance sampling weights
to evaluate how well the weighted sample
_
x
(i)
t+1
, w
(i)
t+1
_
n
i=1
approximates the target distribution p
t+1
(x
t+1
). It is
possible that the instrumental distribution matches poorly the target distribution, but the weights are similar in value,
and thus have small variance. The ESS would then be large, thus incorrectly indicating a good match between the
instrumental and target distributions. This, for example, could happen if the target distribution places most of its
mass on a small region where the instrumental distribution is ﬂat, and the target distribution is ﬂat in the region of
the mode of the instrumental distribution. Hence, we expect sampling from this instrumental distribution to result in
weights that are similar in value. With small probability, a draw would fall under the mode of the target distribution,
resulting in a strikingly different weight value. This example highlights the importance of sampling a large enough
number of particles such that all modes of the target distribution are explored via sampling from the instrumental
distribution. So choosing an appropriate sample size n depends on knowing the shape of the target distribution,
which, of course, is not known; in practice, we use as many particles as computationally feasible.
8.3.2 SISR for statespace models
Figure 3 presents the SISR algorithm with multinomial sampling for statespace models.
Algorithm 3 The SISR algorithm for statespace models
1: Set t = 1.
2: For i = 1 : n, sample x
(i)
1
∼ q
1
(x
1
).
3: For i = 1 : n, set w
(i)
1
∝ p
1
(x
(i)
1
)g(y
1
x
(i)
1
)/q
1
(x
(i)
1
). Normalise such that
n
P
j=1
w
(j)
1
= 1.
4: At time t + 1, do:
5: Resample step: compute ESS = 1/
P
n
j=1
(w
(j)
t
)
2
.
6: If ESS < threshold, then resample: for i = 1 : n, set ˜ x
(i)
1:t
= x
(j)
1:t
with probability w
(j)
t
, j = 1, . . . , n. Finally, for
i = 1 : n, set x
(i)
1:t
= ˜ x
(i)
1:t
and w
(i)
t
= 1/n.
7: For i = 1 : n, sample x
(i)
t+1
∼ q
t+1
(x
t+1
x
(i)
1:t
).
8: For i = 1 : n, set w
(i)
t+1
∝ w
(i)
t
f(x
(i)
t+1
x
(i)
t
)g(y
t+1
x
(i)
t+1
)/q
t+1
(x
(i)
t+1
x
(i)
t
). Normalise such that
n
P
i=1
w
(i)
t+1
= 1.
9: The ﬁltering and smoothing densities at time t + 1 may be approximated by
ˆ p(x
t+1
y
1:t+1
) =
n
X
i=1
w
(i)
t+1
δ
x
(i)
t+1
(x
t+1
), ˆ p(x
1:t+1
y
1:t+1
) =
n
X
i=1
w
(i)
t+1
δ
x
(i)
1:t+1
(x
1:t+1
).
10: Set t = t + 1. Go to step 4.
For the SISR algorithm for statespace models, Crisan and Doucet (2002) prove that the empirical distributions
converge to their true values almost surely as n → ∞, under weak regularity conditions. Furthemore, they show
convergence of the mean square error for bounded, measurable functions, provided that the weights are upper
bounded; moreover, the rate of convergence is proportional to 1/n. However, only under restrictive assumptions
can they prove that approximation errors do not accumulate over time, so careful implementation and interpretation
is required when dealing with SMC methods. More generally, Chopin (2004) proves a Central Limit Theorem result
for the SISR algorithm under both multinomial sampling, and residual sampling, not restricted to applications to
statespace models.
Example 8.2 (A nonlinear time series model (Capp´ e et al., 2007)). Consider the following nonlinear time series model:
88 8. Sequential Monte Carlo
y
t
=
x
2
t
20
+v
t
∼ g(y
t
x
t
)
x
t
=
x
t−1
2
+ 25
x
t−1
1 +x
2
t−1
+ 8 cos(1.2t) +u
t
∼ f(x
t
x
t−1
),
where v
t
∼ N(0, σ
2
v
), u
t
∼ N(0, σ
2
u
), and parameters σ
2
v
= 1, σ
2
u
= 10. Let x
1
∼ p(x
1
) = N(0, 10). The densities
are
f(x
t
x
t−1
) = N
_
x
t−1
2
+ 25
x
t−1
1 +x
2
t−1
+ 8 cos(1.2t), 10
_
g(y
t
x
t
) = N
_
x
2
t
20
, 1
_
.
⊳
Figure 8.2 shows 100 states x
t
and corresponding observations y
t
generated from this model.
Using these 100 observations, we begin by running the SISR algorithm until t = 9, with n = 10000 par
ticles and resampling whenever ESS < 0.6 × n. The instrumental distribution is the state transition distribu
tion: q
t+1
(x
t+1
x
(i)
1:t
) = f(x
t+1
x
(i)
t
). Figure 8.3 shows the weighted samples
_
x
(i)
9
, w
(i)
9
_
n
i=1
as small dots (with
weights unnormalised), and the kernel density estimate of the ﬁltering distribution as a continuous line. Kernel
density estimation is a nonparameteric approach to estimating the density of a random variable from a (possibly
weighted) sample of values. For details, see Silverman (1986). We use a Gaussian kernel with ﬁxed width of 0.5.
The kernel density estimator takes into account both the value of the weights, and the local density of the particles.
0 20 40 60 80 100
−
2
0
−
1
0
0
1
0
2
0
time(t)
x
_
t
0 20 40 60 80 100
0
5
1
0
1
5
2
0
2
5
time(t)
y
_
t
Figure 8.2. Plot of 100 observations y
t
and hidden states x
t
generated from the above statespace model.
To analyse the effect of resampling, we run the SISR algorithm up to time t = 100 with n = 10000 and
resampling whenever ESS < 0.6 ×n, and the SIS algorithm with n = 10000. Figures 8.4 and 8.5 show the image
intensity plots of the kernel density estimates based on the ﬁlter outputs, with the true state sequence overlaid. In
general, the true state value falls in the high density regions of the density estimate in Figure 8.4, indicating good
performance of the SISR algorithm. Moreover, it is interesting to notice that there is clear evidence of multimodality
and nonGaussianity. In Figure 8.5, however, we remark that the particle distributions are highly degenerate, and do
not track the correct state sequence. Hence, resampling is required for good performance of the SIS algorithm.
8.4 Sequential Importance Sampling with Resampling and MCMC moves 89
−30 −20 −10 0 10 20
0
.
0
0
0
.
0
5
0
.
1
0
0
.
1
5
0
.
2
0
0
.
2
5
0
.
3
0
Filtering density estimate at t=9 (n = 10000)
particles x_9
D
e
n
s
i
t
y
Figure 8.3. Filtering density estimate at t = 9 from SISR algorithm with n = 10000, and ESS threshold = 6000. Weighted
samples
n
x
(i)
9
, w
(i)
9
o
n
i=1
shown as small dots, and kernel density estimate as continuous line.
To make this last point more clear, we look at histograms of the base 10 logarithm of the normalised weights at
various time steps in the SIS algorithm. Figure 8.6 shows that the weights quickly degenerate as t increases; by t =
5, we already observe weights on the order of 10
−300
. Recall that these weights are a measure of the adequacy of the
simulated trajectory, drawn from an instrumental distribution, to the target distribution. In this particular example,
the instrumental distribution is the state transition distribution, which is highly variable (σ
2
u
= 10) compared to the
observation distribution (σ
2
v
= 1). Hence, draws from the state transition distribution are given low weights under
the observation distribution, and, with no resampling to eliminate the particles with very low weights, there is a
quickly growing accumulation of very low weights.
8.4 Sequential Importance Sampling with Resampling and MCMC moves
Gilks and Berzuini (2001) introduce the idea of using MCMC moves to reduce sample impoverishment, and call
their proposed algorithmresamplemove. The algorithmperforms sequential importance resampling with an MCMC
move after the resampling step that rejuvenates the particles. Let
_
x
(i)
1:t+1
, w
(i)
t+1
_
n
i=1
be a weighted set of particles
that targets the distribution p
t+1
(x
1:t+1
). Let q
t+1
(x
1:t+1
) denote the instrumental distribution from which the par
ticles are generated. During resampling, some particles will be replicated (possibly many times), to produce the set
_
˜ x
(i)
1:t+1
, 1/n
_
n
i=1
. Let K
t+1
be a p
t+1
(x
1:t+1
)invariant Markov kernel, i.e., p
t+1
K
t+1
= p
t+1
. The MCMC move
is as follows: for i = 1, . . . , n, draw z
(i)
1:t+1
∼ K
t+1
(˜ x
(i)
1:t+1
, ·). Then
_
z
(i)
1:t+1
, 1/n
_
n
i=1
is a rejuvenated, weighted
set of particles that targets the distribution p
t+1
(z
1:t+1
). If
_
˜ x
(i)
1:t+1
, 1/n
_
n
i=1
is a good particle representation of
p
t+1
(x
1:t+1
), then, provided that the kernel K
t+1
is fast mixing, each ˜ x
(i)
1:t+1
will tend to move to a distinct point
in a high density region of the target distribution, thus improving the particle representation. In the words of Gilks
and Berzuini (2001), the MCMC step helps the particles track the moving target.
Just as in Section 8.1.1 we interpreted rejection sampling as importance sampling on an enlarged space, so can
importance sampling with an MCMC move be interpreted as importance sampling on an enlarged space with in
strumental distribution q
t+1
(x
1:t+1
)K
t+1
(x
1:t+1
, z
1:t+1
) and target distribution p
t+1
(x
1:t+1
)K
t+1
(x
1:t+1
, z
1:t+1
),
90 8. Sequential Monte Carlo
20 40 60 80 100
−
4
0
−
2
0
0
2
0
4
0
Kernel density estimates up to time t=100 (n = 10000)
t
x
_
t
Figure 8.4. Image intensity plot of the kernel density estimates up to t = 100, with diamond symbol indicating the true state
sequence (SISR algorithm).
20 40 60 80 100
−
4
0
−
2
0
0
2
0
4
0
Kernel density estimates up to time t=100, no resampling (n = 10000)
t
x
_
t
Figure 8.5. Image intensity plot of the kernel density estimates up to t = 100, with diamond symbol indicating the true state
sequence (SIS algorithm).
8.4 Sequential Importance Sampling with Resampling and MCMC moves 91
t=1
F
r
e
q
u
e
n
c
y
−14 −12 −10 −8 −6 −4
0
1
0
0
0
3
0
0
0
5
0
0
0
t=5
F
r
e
q
u
e
n
c
y
−350 −300 −250 −200 −150 −100 −50 0
0
5
0
0
1
0
0
0
t=10
F
r
e
q
u
e
n
c
y
−350 −300 −250 −200 −150 −100 −50 0
0
2
0
0
6
0
0
1
0
0
0
t=50
F
r
e
q
u
e
n
c
y
−350 −300 −250 −200 −150 −100 −50 0
0
5
0
1
0
0
2
0
0
t=75
F
r
e
q
u
e
n
c
y
−350 −300 −250 −200 −150 −100 −50 0
0
5
0
1
0
0
1
5
0
t=100
F
r
e
q
u
e
n
c
y
−350 −300 −250 −200 −150 −100 −50 0
0
5
1
0
1
5
2
0
2
5
Figure 8.6. Histogram of the base 10 logarithm of the normalised weights for the ﬁltering distributions (SIS algorithm).
where
_
p
t+1
(x
1:t+1
)K
t+1
(x
1:t+1
, z
1:t+1
)dx
1:t+1
= p
t+1
(z
1:t+1
) by invariance of K
t+1
. This explanation omits
the resampling step, but the latter does not alter the justiﬁcation of the algorithm. Gilks and Berzuini (2001) present
a Central Limit Theorem result in the number of particles, and an explicit formulation for the asymptotic variance as
the number of particles tends to inﬁnity, for t ﬁxed. They argue that, in the extreme case, rejuvenation via a perfectly
mixing kernel at step t, i.e., K
t
(x
1:t
, z
1:t
) = p
t
(z
1:t
x
1:t
), can reduce the asymptotic variance of estimators at later
time steps. This is similar to the idea from Section 8.3 that resampling, although it introduces extra Monte Carlo
variation at the current time step, can reduce the variance at later times.
Chopin (2004) states that MCMC moves may lead to more stable algorithms for the ﬁltering problem in terms
of the asymptotic variance of estimators (although theoretical results to support this are lacking); however, he is
not as hopeful regarding the smoothing problem. He suggests investigating the degeneracy of a given particle ﬁlter
algorithm by running m, say m = 10, independent particle ﬁlters in parallel, computing the estimates from each
output, and monitoring the empirical variance of these m estimates as t increases.
8.4.1 SISR with MCMC moves for statespace models
Algorithm 4 is the SISR algorithm with MCMC moves for statespace models. First, notice that except for the re
quirement of invariance, there are no constraints on the choice of kernel; it can be a Gibbs sampling, or a Metropolis
Hastings kernel. Second, only one MCMC step will sufﬁce (i.e., no burnin period is required), since it is assumed
that the particle set at step t has “converged”, i.e., it is a good representation of p(x
1:t
y
1:t
). Third, notice that the
MCMC move is applied to the entire state trajectory up to time t. Hence, the dimension of the kernel increases with
t, thus increasing the computational cost of performing the move. Also, as t increases, it is increasingly difﬁcult
to construct a fastmixing Markov kernel of dimension t. In practice, the MCMC move is applied only to the last
component x
t+1
with kernel K
t+1
that is invariant with respect to p(x
t+1
y
1:t+1
).
92 8. Sequential Monte Carlo
Algorithm 4 The SISR algorithm with MCMC moves for statespace models
1: Set t = 1.
2: For i = 1 : n, sample x
(i)
1
∼ q
1
(x
1
).
3: For i = 1 : n, set w
(i)
1
∝ p
1
(x
(i)
1
)g(y
1
x
(i)
1
)/q
1
(x
(i)
1
). Normalise such that
n
P
j=1
w
(j)
1
= 1.
4: At time t + 1, do:
5: Resample step: compute ESS = 1/
P
n
j=1
(w
(j)
t
)
2
.
6: If ESS < threshold, then resample: for i = 1 : n, set ˜ x
(i)
1:t
= x
(j)
1:t
with probability w
(j)
t
, j = 1, . . . , n. Finally, for
i = 1 : n, set x
(i)
1:t
= ˜ x
(i)
1:t
and w
(i)
t
= 1/n.
7: MCMC step: for i = 1 : n, sample z
(i)
1:t
∼ K
t
(x
(i)
1:t
, ·). Then, for i = 1 : n, set x
(i)
1:t
= z
(i)
1:t
.
8: For i = 1 : n, sample x
(i)
t+1
∼ q
t+1
(x
t+1
x
(i)
1:t
).
9: For i = 1 : n, set w
(i)
t+1
∝ w
(i)
t
f(x
(i)
t+1
x
(i)
t
)g(y
t+1
x
(i)
t+1
)/q
t+1
(x
(i)
t+1
x
(i)
t
). Normalise such that
n
P
i=1
w
(i)
t+1
= 1.
10: The ﬁltering and smoothing densities at time t + 1 may be approximated by
ˆ p(x
t+1
y
1:t+1
) =
n
X
i=1
w
(i)
t+1
δ
x
(i)
t+1
(x
t+1
), ˆ p(x
1:t+1
y
1:t+1
) =
n
X
i=1
w
(i)
t+1
δ
x
(i)
1:t+1
(x
1:t+1
).
11: Set t = t + 1. Go to step 4.
8.5 Smoothing density estimation
As we have seen, the SISR algorithm is prone to suffer from sample impoverishment as t grows; hence the parti
cle trajectories do not offer reliable approximations to the smoothing density. This section presents a Monte Carlo
smoothing algorithm for statespace models that is carried out in a forwardﬁltering, backwardsmoothing proce
dure, i.e., a ﬁltering procedure such as SISR is applied forward in time, followed by a smoothing procedure applied
backward in t (Godsill et al., 2004).
Godsill et al. (2004) consider the joint smoothing density
p(x
1:T
y
1:T
) = p(x
T
y
1:T
)
T−1
t=1
p(x
t
x
t+1:T
, y
1:T
) = p(x
T
y
1:T
)
T−1
t=1
p(x
t
x
t+1
, y
1:t
)
∝ p(x
T
y
1:T
)
T−1
t=1
p(x
t
y
1:t
)f(x
t+1
x
t
).
Let
_
x
(i)
t
, w
(i)
t
_
n
i=1
be a particle representation to the ﬁltering density p(x
t
y
1:t
), that is, ˆ p(x
t
y
1:t
) =
n
i=1
w
(i)
t
δ
x
(i)
t
(x
t
). Then, it is possible to approximate p(x
t
x
t+1
, y
1:t
) by
ˆ p(x
t
x
t+1
, y
1:t
) =
n
i=1
w
(i)
t
f(x
t+1
x
(i)
t
)δ
x
(i)
t
(x
t
)
n
j=1
w
(j)
t
f(x
t+1
x
(j)
t
)
. (8.4)
So the idea is to run the particle ﬁlter (e.g., the SISR algorithm) forward in time, to obtain particle approximations
to p(x
t
y
1:t
), for t = 1, . . . , T, and then to apply the backward smoothing recursion (8.4) for t = T − 1 to t = 1.
The draws (˜ x
1
, . . . , ˜ x
T
) form an approximate realization from p(x
1:T
y
1:T
).
Algorithm 5 returns one realization fromp(x
1:T
y
1:T
) via this forwardﬁltering, backwardsmoothing approach.
The computational complexity is O(nT), for ﬁltering density approximations with n particles. So for n realizations,
the computational complexity is O(n
2
T), i.e., it is quadratic in n, compared to the computational complexity of
SISR, for n realizations, which is linear in n. This smoothing algorithm has the advantage that it uses particle
approximations to the ﬁltering densities (under the assumption that good approximations can be obtained via SISR,
for example), as opposed to resampled particle trajectories x
1:t
which suffer from sample impoverishment as t
increases.
Example 8.3 (A nonlinear time series model (continued)). We continue the example of the nonlinear time series
model. We carry out smoothing via Algorithm 5, implementing the SISR algorithm as before, with n = 10000
8.5 Smoothing density estimation 93
Algorithm 5 The forwardﬁltering, backwardsmoothing algorithm for statespace models
1: Run a particle ﬁlter algorithm to obtain weighted particle approximations
n
x
(i)
t
, w
(i)
t
o
n
i=1
to the ﬁltering distributions
p(x
t
y
1:t
) for t = 1, . . . , T.
2: Set t = T.
3: Choose ˜ x
T
= x
(i)
T
with probability w
(i)
T
.
4: Set t = t −1.
5: At t ≥ 1
6: For i = 1 : n, compute w
(i)
tt+1
∝ w
(i)
t
f(˜ x
t+1
x
(i)
t
). Normalise the weights.
7: Choose ˜ x
t
= x
(i)
t
with probability w
(i)
tt+1
.
8: Go to step 3.
9: (˜ x
1
, . . . , ˜ x
T
) is an approximate realization from p(x
1:T
y
1:T
).
particles. Figure 8.7 displays the 10000 smoothing trajectories drawn from p(x
1:100
y
1:100
) with the true state se
quence overlaid. Multimodality in the smoothing distributions is shown in Figure 8.8.
0 20 40 60 80 100
−
2
0
−
1
0
0
1
0
2
0
t
p
a
r
t
i
c
l
e
t
r
a
j
e
c
t
o
r
y
x
_
t
Figure 8.7. 10000 smoothing trajectories drawn from p(x
1:100
y
1:100
) with dimond symbol indicating the true sequence.
Finally, since the algorithm returns entire smoothing trajectories, as opposed to simply returning smoothing
marginals, it is possible to visualize characteristics of the multivariate smoothing distribution. Figure 8.9 shows the
kernel density estimate for p(x
11:12
y
1:100
).
⊳
94 8. Sequential Monte Carlo
20 40 60 80 100
−
2
0
−
1
0
0
1
0
2
0
Kernel density estimates up to time t=100 (n = 10000)
t
x
_
t
Figure 8.8. Image intensity plot of the kernel density estimates of smoothing densities with dimond symbol indicating the true
state sequence.
Figure 8.9. Kernel density estimate for p(x
11:12
y
1:100
).
Bibliography
Anderson, T. W. (2003) An introduction to multivariate statistical analysis. Wiley series in probability and statistics.
WileyInterscience.
Athreya, K. B., Doss, H. and Sethuraman, J. (1992) A proof of convergence of the markov chain simulation method.
Tech. Rep. 868, Dept. of Statistics, Florida State Univeristy.
Badger, L. (1994) Lazzarini’s lucky approximation of π. Mathematics Magazine, 67, 83–91.
Brockwell, P. J. and Davis, R. A. (1991) Time series: theory and methods. New York: Springer, 2 edn.
Brooks, S. and Gelman, A. (1998) General methods for monitoring convergence of iterative simulations. Journal
of Computational and Graphical Statistics, 7, 434–455.
Buffon, G. (1733) Editor’s note concerning a lecture given 1733 to the Royal Academy of Sciences in Paris. Histoire
de l’Acad´ emie Royale des Sciences, 43–45.
— (1777) Essai d’arithm´ etique morale. Histoire naturelle, g´ en´ erale et particuli` ere, Suppl´ ement 4, 46–123.
Capp´ e, O., Godsill, S. J. and Moulines, E. (2007) An overview of existing methods and recent advances in sequential
Monte Carlo. In IEEE Proceedings, vol. 95, 1–21.
Capp´ e, O., Moulines, E. and Ryd´ en, T. (2005) Inference in Hidden Markov Models. Springer Series in Statistics.
Springer.
Carpenter, J., Clifford, P. and Fearnhead, P. (1999) An improved particle ﬁlter for nonlinear problems. IEE Pro
ceedings, Radar, Sonar and Navigation, 146, 2–7.
Chopin, N. (2004) Central limit theorem for sequential Monte Carlo methods and its applications to Bayesian
inference. Annals of Statistics, 32, 2385–2411.
Crisan, D. and Doucet, A. (2002) A survey of convergence results on particle ﬁltering methods for practitioners.
IEEE Transactions on Signal Processing, 50, 736–746.
Doucet, A., de Freitas, N. and Gordon, N. (eds.) (2001) Sequential Monte Carlo Methods in Practice. Statistics for
Engineering and Information Science. New York: Springer.
Doucet, A., Godsill, S. and Andrieu, C. (2000) On sequential simulationbased methods for Bayesian ﬁltering.
Statistics and Computing, 10, 197–208.
Fahrmeir, L. and Tutz, G. (2001) Multivariate Statistical Modelling Based on Generalised Linear Models. New
York: Springer, 2 edn.
Gelfand, A. E. and Smith, A. F. M. (1990) Samplingbased approaches to calculating marginal densities. Journal
of the American Statistical Association, 85, 398–409.
Gelman, A., Gilks, W. R. and Roberts, G. O. (1997) Weak convergence and optimal scaling of random walk
Metropolis algorithms. Annals of. Applied Probability, 7, 110–120.
Gelman, A., Roberts, G. O. and Gilks, W. R. (1995) Efﬁcient Metropolis jumping rules. In Bayesian Statistics (eds.
J. M. Bernado, J. Berger, A. Dawid and A. Smith), vol. 5. Oxford: Oxford University Press.
Gelman, A. and Rubin, B. D. (1992) Inference from iterative simulation using multiple sequences. Statistical
Science, 7, 457–472.
96 BIBLIOGRAPHY
Geman, S. and Geman, D. (1984) Stochastic relaxation, Gibbs distribution, and the Bayesian restoration of images.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 721–741.
Gihman, I. and Skohorod, A. (1974) The Theory of Stochastic Processes, vol. 1. Springer.
Gilks, W. R. and Berzuini, C. (2001) Following a moving target – Monte Carlo inference for dynamic Bayesian
models. Journal of the Royal Statistical Society B, 63, 127–146.
Godsill, S. J., Doucet, A. and West, M. (2004) Monte Carlo smoothing for nonlinear time series. Journal of the
American Statistical Association, 99, 156–168.
Gordon, N. J., Salmond, S. J. and Smith, A. F. M. (1993) Novel approach to nonlinear/nonGaussian Bayesian state
estimation. IEE ProceedingsF, 140, 107–113.
GuihennecJouyaux, C., Mengersen, K. L. and Robert, C. P. (1998) Mcmc convergence diagnostics: A ”reviewww”.
Tech. Rep. 9816, Institut National de la Statistique et des Etudes Economiques.
Halton, J. H. (1970) A retrospective and prospective survey of the Monte Carlo method. SIAM Review, 12, 1–63.
Hastings, W. K. (1970) Monte Carlo sampling methods using Markov chains and their applications. Biometrika,
57, 97–109.
Kalman, R. E. (1960) A new approach to linear ﬁltering and prediction problems. Transactions of the ASME 
Journal of Basic Engineering, 82 (Series D), 35–45.
Knuth, D. (1997) The Art of Computer Programming, vol. 1. Reading, MA: AddisonWesley Professional.
Kong, A., Liu, J. S. and Wong, W. H. (1994) Sequential imputations and Bayesian missing data problems. Journal
of the American Statistical Association, 89, 278–288.
Laplace, P. S. (1812) Th´ eorie Analytique des Probabilit´ es. Paris: Courcier.
Lazzarini, M. (1901) Un’ applicazione del calcolo della probabilit` a alla ricerca esperimentale di un valore approssi
mato di π. Periodico di Matematica, 4.
Liu, J. S. (2001) Monte Carlo Strategies in Scientiﬁc Computing. New York: Springer.
Liu, J. S. and Chen, R. (1998) Sequential Monte Carlo methods for dynamic systems. Journal of the American
Statistical Association, 93, 1032–1044.
Liu, J. S., Wong, W. H. and Kong, A. (1995) Covariance structure and convergence rate of the Gibbs sampler with
various scans. Journal of the Royal Statistical Society B, 57, 157–169.
Marsaglia, G. (1968) Random numbers fall mainly in the planes. Proceedings of the National Academy of Sciences
of the United States of America, 61, 25–28.
Marsaglia, G. and Zaman, A. (1991) A new class of random number generators. The Annals of Applied Probability,
1, 462–480.
Matsumoto, M. and Nishimura, T. (1998) Mersenne twister: a 623dimensionally equidistributed uniform pseudo
random number generator. ACM Trans. Model. Comput. Simul., 8, 3–30.
Meinhold, R. J. and Singpurwalla, N. D. (1983) Understanding the Kalman ﬁlter. The American Statistician, 37,
123–127.
— (1989) Robustiﬁcation of Kalman Filter Methods. Journal of the American Statistical Association, 84, 479–486.
Metropolis, N. (1987) The beginning of the Monte Carlo method. Los Alamos Science, 15, 122–143.
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. B., Teller, A. H. and Teller, E. (1953) Equation of state calcula
tions by fast computing machines. The Journal of Chemical Physics, 21, 1087–1092.
Metropolis, N. and Ulam, S. (1949) The Monte Carlo method. Journal of the American Statistical Association, 44,
335–341.
Meyn, S. P. and Tweedie, R. L. (1993) Markov Chains and Stochastic Stability. Springer, New York, Inc. URL
http://probability.ca/MT/.
Norris, J. (1997) Markov Chains. Cambridge Univeristy Press.
BIBLIOGRAPHY 97
Nummelin, E. (1984) General Irreducible Markov Chains and NonNegative Operators. No. 83 in Cambridge
Tracts in Mathematics. Cambridge University Press, 1st paperback edn.
Philippe, A. and Robert, C. P. (2001) Riemann sums for mcmc estimation and convergence monitoring. Statistics
and Computing, 11, 103–115.
Rabiner, L. R. (1989) A tutorial on hidden Markov models and selected applications in speech recognition. In IEEE
Proceedings, vol. 77, 257–286.
Ripley, B. D. (1987) Stochastic simulation. New York: Wiley.
Robert, C. P. and Casella, G. (2004) Monte Carlo Statistical Methods. Secaucus, NJ, USA: Springer, New York,
Inc., 2 edn.
Roberts, G. and Tweedie, R. (1996) Geometric convergence and central limit theorems for multivariate Hastings
and Metropolis algorithms. Biometrika, 83, 95–110.
Roberts, G. O. and Rosenthal, J. S. (2004) General state space Markov chains and MCMC algorithms. Probability
Surveys, 1, 20–71.
Silverman, B. W. (1986) Density estimation for statistics and data analysis. Monographs on statistics and applied
probability, 26. Chapman & Hall/CRC.
Tierney, L. (1994) Markov Chains for exploring posterior distributions. The Annals of Statistics, 22, 1701–1762.
Ulam, S. (1983) Adventures of a Mathematician. New York: Charles Scribner’s Sons.
Wan, E. A. and van der Merwe, R. (2000) The unscented Kalman ﬁlter for nonlinear estimation. In Proceedings of
Symposium, 153–158. Citeseer.
These notes are based on notes written originally by Adam M. Johansen and Ludger Evers for a lecture course on Monte Carlo Methods given at the University of Bristol in 2007.
Commons Deed
You are free: to copy, distribute and display the work. to make derivative works. Under the following conditions:
Attribution. You must give the original authors credit. Noncommercial. You may not use this work for commercial purposes. Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting
work only under a license identical to this one. For any reuse or distribution, you must make clear to others the license terms of this work. Any of these conditions can be waived if you get permission from the copyright holders. Nothing in this license impairs or restricts the authors’ moral rights. Your fair dealing and other rights are in no way affected by the above. This is a humanreadable summary of the Legal Code (the full license). It can be read online at the following address: http://creativecommons.org/licenses/byncsa/2.5/za/legalcode Note: The Commons Deed is not a license. It is simply a handy reference for understanding the Legal Code (the full license) — it is a humanreadable expression of some of its key terms. Think of it as the userfriendly interface to the Legal Code beneath. This Deed itself has no legal value, and its contents do not appear in the actual license.
A The LTEX source ﬁles of this document can be obtained from the authors.
Table of Contents
Table of Contents 1. Markov Chains
3 5
1.1 1.2 1.3 1.4
Stochastic processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Discrete Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5 7
General state space Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 Ergodic theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
23
2. An Introduction to Monte Carlo Methods
2.1 2.2 2.3 2.4
What are Monte Carlo Methods? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Introductory examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 A Brief History of Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Pseudorandom numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
31
3. Fundamental Concepts: Transformation, Rejection, and Reweighting
3.1 3.2 3.3
Transformation methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Rejection sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 Importance sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
43
4. The Gibbs Sampler
4.1 4.2 4.3 4.4
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 The HammersleyClifford Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 Convergence of the Gibbs sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
53
5. The MetropolisHastings Algorithm
5.1 5.2 5.3 5.4
Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Convergence results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 The random walk Metropolis algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Choosing the proposal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
63
6. Diagnosing convergence
6.1 6.2
Practical considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Tools for monitoring convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
73
7. Statespace models and the Kalman ﬁlter algorithm
7.1 7.2
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Statespace models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
. . . . . . . . . . . . . . . . . . . . . . . . .1 8. . . . . . . . . . . . . . 76 79 8. 81 Sequential Importance Sampling with Resampling . . . . . . . . Sequential Monte Carlo 8. . . . . . . . . . . . . . . . . . . . . . . . . 79 Sequential Importance Sampling . . . . . 92 95 Bibliography . . . .3 The Kalman ﬁlter algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 8. . . . . .3 8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5 Importance Sampling revisited . . . . . . . . . . . . . . . . . . . . .4 TABLE OF CONTENTS 7. 84 Sequential Importance Sampling with Resampling and MCMC moves . . . . . . . . . . . 89 Smoothing density estimation .4 8. . . . . . . .
∞) or T = R (processes in continuous time) or T = R × R (spatial process). which would constitute a stochastic process in continuous time.1 (Stochastic process). a Russian mathematician. whereas the process in panel (c) has the real numbers as its state space (“continuous state space”). . 1. Deﬁnition 1.1 Stochastic processes Deﬁnition 1. This special case will turn out to be much simpler than the case of a general state space. Other choices An example of a stochastic process in discrete time would be the sequence of temperatures recorded every morning at Braemar in the Scottish Highlands. i. During the day we can trace the share price continuously. In order to be able to study Markov chains. Figure 1. We can distinguish between processes not only based on their index set T .1 shows sample paths both of a stochastic process in discrete time (panel (a)). if T = R (continuous time) the sample path is a function from R to S.e. We shall then call X a discrete process. we assume that T ⊂ N or T ⊂ Z. For a given realisation ω ∈ Ω the collection {Xt (ω) : t ∈ T } is called the sample path of X at ω. and of two stochastic processes in continuous time (panels (b) and (c)). An important special case arises if the state space S is a countable set. The process in panel (b) has a discrete state space. which gives the “range” of possible values the process can take. a special kind of random process which is said to have “no memory”: the evolution of the process in the future depends only on the present state and not on where it has been in the past. Another example would be the price of a share recorded at the opening of the market every day. t ∈ T } of random variables would be T = [0. but also based on their state space S. Note that whilst certain stochastic processes have sample paths that are (almost surely) continuous or differentiable.Chapter 1 Markov Chains This chapter introduces Markov chains1 .2 (Sample Path). We will soon focus on stochastic processes in discrete time. 1 named after the Andrey Andreyevich Markov (1856–1922). T is hereby called the index set (“time”) and S is called the state space. If T = N0 (discrete time) the sample path is a sequence. we ﬁrst need to introduce the concept of a stochastic process. A stochastic process X is a family {Xt : Xt : Ω → S. The reasons for treating discrete processes separately are the same as for treating discrete random variables separately: we can assume without loss of generality that the state space are the natural numbers. this does not need to be the case.
..tk ) (x1 .f.. .. .tj−1 .. . Theorem 1.tj . . xj−1 .tk ) (x1 . This dependency structure can be expressed by the ﬁnitedimensional distributions of the process: P(Xt1 ∈ A1 .f..tk ) (x1 . +∞. . xj where p(. ..1) corresponds to p(t1 ..tk ) (x1 . (1. ) (·) are the joint probability mass functions (p. . .. .tj ... xj .. .1. but also by the dependency structure of the process. ) (·) are the joint probability density functions (p... xj−1 .. .. xj+1 .3 (Kolmogorov). . . . To keep things simple. . xk ) dxj = f(t1 . A stochastic process is not only characterised by the marginal distributions of Xt .. xj+1 . .tj+1 . Markov Chains Xt 5 4 Xt (ω2 ) 3 2 Xt (ω1 ) 1 t 1 2 3 4 5 (a) Two sample paths of a discrete process in discrete time.... and A1 . . . . xk ) = p(t1 .. . Without the consistency condition we could obtain different results when computing the same probability using different members of the family.. k ∈ N...tj−1 . ..tj−1 . xk ]). . Let F(t1 .tj−1 . . Xtk ∈ (−∞. Ak are measurable subsets of S. . ..tj−1 .. .1) This consistency condition says nothing else than that lowerdimensional members of the family have to be the marginal distributions of the higherdimensional members of the family.. xk ).. xj+1 .tk ) (x1 . xj−1 . xj ..1) corresponds to f(t1 .... . such that F(t1 . .. . . xj+1 . . . . Xtk ∈ Ak ) where t1 . Then there exists a probability space and a stochastic process X.... .tj−1 . This raises the question whether a stochastic process X is fully described by its ﬁnite dimensional distributions.tk ) (x1 .. For a discrete state space. . xk ) (1. . Examples of sample paths of stochastic processes. .tj+1 . However. In the case of S ⊂ R the ﬁnitedimensional distributions can be represented using their joint distribution functions F(t1 . xj−1 . . The proof of this theorem can be for example found in (Gihman and Skohorod. . xj+1 ..tk ) (x1 . . xj−1 . xk ) = P(Xt1 ∈ (−∞. in order to be able to formulate the theorem. . . .. we need to introduce the concept of a consistent family of ﬁnitedimensional distributions. . . 1974). .tk ) (x1 .. . we will formulate this condition using distributions functions.tj+1 . For a continuous state space. . k} F(t1 .. The answer to this is given by Kolmogorov’s existence theorem.tj+1 . xk ) = F(t1 . .d. . ...tk ) be a family of consistent ﬁnitedimensional distribution functions. ... .).tj+1 . We shall call a family of ﬁnite dimensional distribution functions consistent if for any collection of times t1 .tj . ... . .. . . xj−1 . .. . . . . xk ]). xk ) where f(. . 1 3 2 5 4 Yt 5 Yt (ω2 ) 4 Zt Zt (ω2 ) 3 Yt (ω1 ) 2 1 Zt (ω1 ) t 1 2 3 4 5 (b) Two sample paths of a discrete process in continuous time.). Xtk ∈ (−∞. . . x1 ].6 1. . xj+1 . . . for all j ∈ {1.. . tk . . Proof.tj+1 . . .. . t 1 2 3 4 5 (c) Two sample paths of a continuous process in continuous time. .. .. xk ) = P(Xt1 ∈ (−∞. .m. . . . . ... Figure 1.... tk ∈ T . . x1 ]..... .. (1.
at least as far as this course is concerned. If we record its state every minute we obtain a stochastic process {Xt : t ∈ N0 }. then the next state Xt+1 is independent of the past states X0 .2 Xt Past Present Future t−1 Figure 1. Xt−1 . 1. we will also assume that the state space is discrete.4 (Discrete Markov chain). 2. all that we are interested in. present. Past. thus we can relabel the states 1. . (homework) Example 1. Proof. Consider the simple example of a phone line.1. .g. This is however. . In what follows we will only consider the case of a stochastic process in discrete time i. If we know the current state Xt . Let X be a stochastic process in discrete time with countable (“discrete”) state space. . . . It can either be busy (we shall call this state 1) or free (which we shall call 0).1 (Phone line).org/ wiki/Markov_process. . . . The Markov property is equivalent to assuming that for all k ∈ N and all t1 < . Figure 1.5.2 Discrete Markov chains 1.e. .2. 3.1 Introduction In this section we will deﬁne Markov chains. See e. X is called a Markov chain (with discrete state space) if X satisﬁes the Markov property P(Xt+1 = xt+1 Xt = xt . Note that the stochastic process X is not necessarily uniquely determined by its ﬁnitedimensional distributions. . However. . 2 A similar concept (Markov processes) exists for processes in continuous time. Deﬁnition 1. . t t+1 Proposition 1. however we will focus on the special case that the state space S is (at most) countable. Thus we can assume without loss of generality that the state space S is the set of natural numbers N (or a subset of it): there exists a bijection that uniquely maps each element to S to a natural number. < tk ≤ t P(Xt+1 = xt+1 Xtk = xtk . the ﬁnite dimensional distributions uniquely determine all probabilities relating to events involving an at most countable collection of random variables. Initially. http://en. .wikipedia.2 Discrete Markov chains 7 Thus we can specify the distribution of a stochastic process by writing down its ﬁnitedimensional distributions.2.2 illustrates this idea.. and future of a Markov chain at t. . T = N0 (or Z). . . X0 = x0 ) = P(Xt+1 = xt+1 Xt = xt ) This deﬁnition formalises the idea of the process depending on the past only through the present. Xt1 = xt1 ) = P(Xt+1 = xt+1 Xtk = xtk ).
. ⊳ If we assume that {Xt : t ∈ N0 } is a Markov chain.2 (Random walk on Z). From the deﬁnition of conditional probabilities we can derive that . where α. .3 illustrates this idea. We can assume that the phone is more likely to be used during the day and more likely to be free during the night. X0 = x0 ) = Es ⊥Et P(Et = xt+1 − xt Et−1 = xt − xt−1 . X0 = x0 }: P(Xt+1 = xt+1 Xt = xt . . the probability of staying in the current state is 1 − α − β. Consider a socalled random walk on Z starting at X0 = 0. . . For a discrete Markov chain {Xt : t ∈ N0 } we have that t−1 The distribution of a Markov chain is fully speciﬁed by its initial distribution P(X0 = x0 ) and the transition P(Xt = xt . Proposition 1. Similarly the Markov assumption implies that the probability of a new phone call being made is independent of how long the phone has been out of use before. Suppose that independently of the number is α and that the probability of moving to the next larger number is β. Illustration (“Markov graph”) of the random walk on Z.6. . with the Et being independent and for all t P(Et = −1) = α It is easy to see that P(Xt+1 = xt − 1Xt = xt ) = α P(Xt+1 = xt Xt = xt ) = 1 − α − β P(Xt+1 = xt + 1Xt = xt ) = β P(Et = 0) = 1 − α − β P(Et = 1) = β. . Xt+1 = Xt + Et . β ≥ 0 with α + β ≤ 1. Xt−1 = xt−1 . Markov Chains is independent of how long the phone call has already lasted. Most importantly. . . we assume that probability of a new phone call being ended Example 1. as the following proposition shows. . At every time. X0 = xo ) P(Et = xt+1 − xt ) = P(Xt+1 = xt+1 Xt = xt ) ⊳ = Thus {Xt : t ∈ N0 } is a Markov chain. .8 1.3. . the probability of moving to the next smaller Figure 1. . Proof. E0 = x1 − x0 . these probabilities do not change when we condition additionally on the past {Xt−1 = xt−1 . . The Markov assumption is compatible with assuming that the usage pattern changes over time. X0 = x0 ) = P(X0 = x0 ) · τ =0 P(Xτ +1 = xτ +1 Xτ = xτ ). . To analyse this process in more detail we write Xt+1 as 1−α−β β ··· α −3 β α −2 1−α−β β α −1 1−α−β β α 0 1−α−β β α 1 1−α−β β α 2 1−α−β β α 1−α−β β 3 α ··· Figure 1. Xt−1 = xt−1 . probabilities P(Xt+1 = xt+1 Xt = xt ). current state. we can either stay in the state or move to the next smaller or next larger number.
The matrix K = (kij )ij with kij = P(Xt+1 = jXt = i) is called the transi tion kernel (or transition matrix) of the homogeneous Markov chain X. X0 = x0 ) =P(Xt =xt Xt−1 =xt−1 ) t−1 = τ =0 P(Xτ +1 = xτ +1 Xτ = xτ ). P(Xt+1 = 1Xt = 0) = 0. Deﬁnition 1. which is a Markov chains whose behaviour does not change over time. .4.6 to the ﬁrst equation of the proof (which holds for any sequence of random variables) illustrates how powerful the Markovian assumption is. .1 0.7 (Homogeneous Markov Chain). Suppose that in the example of the phone line the probability that someone makes a new call (if the phone is currently unused) is 10% and the probability that someone terminates an active phone call is 30%.9 P(Xt+1 = 0Xt = 1) = 0. .7 .1 P(Xt+1 = 1Xt = 1) = 0. Then P(Xt+1 = 0Xt = 0) = 0. Deﬁnition 1. which we might write as a vector λ0 = (P(X0 = i))(i∈S) . the transition kernel K fully speciﬁes the distribution of a homogeneous Markov chain.3 0. . A Markov chain {Xt : t ∈ N0 } is said to be homogeneous if P(Xt+1 = jXt = i) = pij for all i. We will see that together with the initial distribution. – Each row of the transition kernel sums to 1.3 (Phone line (continued)). The transition probabilities are often illustrated using a socalled Markov graph. If we denote the states by 0 (phone not in use) and 1 (phone in use). ⊳ . j ∈ S. To simplify things even further we will introduce the concept of a homogeneous Markov chain. as kij = j j P(Xt+1 = jXt = i) = P(Xt+1 ∈ SXt = i) = 1 Example 1.9 0. However.2 Discrete Markov chains 9 P(Xt = xt .3 and the transition kernel is K= 0.1. X0 = x0 ) = P(X0 = x0 ) · P(X1 = x1 X0 = x0 ) · P(X2 = x2 X1 = x1 . Note that knowing K alone is not enough to ﬁnd the distribution of the states: for this we also need to know the initial distribution λ0 .7. Xt−1 = xt−1 . In the following we will assume that all Markov chains are homogeneous. . . Comparing the equation in proposition 1. The Markov graph for this example is shown in ﬁgure 1. we start by stating two basic properties of the transition kernel K: – The entries of the transition kernel are nonnegative (they are probabilities). . X0 = x0 ) =P(X2 =x2 X1 =x1 ) ··· · P(Xt = xt Xt−1 = xt−1 .8 (Transition kernel). and independent of t ∈ N0 . .
. . .1 0. P(Xm = j... Xt = i) j =P(Xt+m1 +m2 =kXt+m1 =j)=P(Xt+m2 =kXt =j) = P(Xt+m1 = jXt = i) = j P(Xt+m2 = kXt = j)P(Xt+m1 = jXt = i) Kij j (m1 ) = Kjk 2 = K(m1 ) K(m2 ) (m ) i. .4.. then i.9 0. m2 ∈ N we have that K(m1 +m2 ) = K(m1 ) · K(m2 ) : P(Xt+m1 +m2 = kXt = i) = j P(Xt+m1 +m2 = k. X0 = i) = i i =Kij P(Xm = jX0 = i) P(X0 = i) = (λ′ Km )j 0 (m) =(λ0 )i Example 1. .. P(Xm = j) = Thus K(2) = K · K = K2 .. We will now show that the mstep transition kernel is nothing other than the mpower of the transition kernel. .3. . . 0 0 .. .10 1.. . i.10. and by induction K(m) = Km ..3 0. . . 0 α K= .. ... P(Xm = j) = (λ′ K(m) )j . .. The Markov graph for this Markov chain was given in ﬁgure 1. Let X be a homogeneous Markov chain.... and ii. . . Markov graph for the phone line example. . to the mstep transition kernel. . . .1 0 0.. Xt+m1 = jXt = i) P(Xt+m1 +m2 = kXt+m1 = j.9 (mstep transition kernel).5 (Phone line (continued)).. .. . The transition kernel for the random walk on Z is a Toeplitz matrix with an inﬁnite number of rows and columns: . . which contains the probabilities of moving from state i to step j in one step..9 0. In the phoneline example. . K(m) = Km . 0 0 β 1−α−β . which contains the probabilities of moving from state i to step j in m steps: Deﬁnition 1... . Markov Chains 0. .3 Figure 1. α 1−α−β . . 0 Proof. .4 (Random walk on Z (continued)). . β 1−α−β α 0 . 0. We will ﬁrst show that for m1 . ⊳ We will now generalise the concept of the transition kernel.. . the transition kernel is K= The mstep transition kernel is 0. 0 0 0 β . . .k ii. . The matrix K(m) = (kij )ij with kij (m) (m) = P(Xt+m = jXt = i) is called the mstep transition kernel of the homogeneous Markov chain X. 0 β 1−α−β α . . .7 . 0 0 . .. Proposition 1. . . .. .7 1 Example 1.
j and j i (as kii = P(Xt+0 = iXt = i) = 1 > 0). then also j ∼ i. such that all Example 1. This concept will become important when we Deﬁnition 1. k ∈ S: – i – If i – If i ∼ j. A Markov chain is called irreducible if it only consists of a single class.7515.2 Discrete Markov chains m 11 K(m) = Km = 0. Thus we can get (with positive probability) from state i to state k in mij + mjk steps: kik (mij +mjk ) = P(Xt+mij +mjk = kXt = i) = ι P(Xt+mij +mjk = kXt+mij = ι)P(Xt+mij = ιXt = i) ≥ P(Xt+mij +mjk = kXt+mij = j) P(Xt+mij = jXt = i) > 0 >0 >0 Thus i j and j k imply i k. We have that 2 ∼ 4. ⊳ 4 (10) 1.e. (b) Two states i and j are said to communicate (“i ∼ j”) if i From the deﬁnition we can derive for states i. 6} is closed. thus i ∼ i. 4 ∼ 5.7 = 3 3+( 5 ) 4 m 1+( 3 ) 5 4 m 3 1−( 5 ) 4 3 m 3−( 5 ) 4 m .9 0. (a) A state i is said to lead to a state j (“i such that there is a positive probability of getting from state i to state j in m steps.2. A class C is called closed if there are no paths going out of C. as well as the probability of getting from state j (m ) kjk jk = P(Xt+mjk = kXt = j) = P(Xt+mij +mjk = kXt+mij = j) > 0. i. 2 ∼ 5. Thus ∼ is an equivalence relation and we can partition the state space S into communicating classes. i. analyse the limiting behaviour of the Markov chain. i. j. all states communicate. then there exist mij . j and j i. Consider a Markov chain with transition kernel classes are {1}. m2 ≥ 0 such that the probability of getting from state i to state j in (mij ) = P(Xt+mij = jXt = i) > 0. .6. Thus the communicating K= 1 2 1 4 0 0 3 4 1 4 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 3 4 1 0 0 0 0 0 0 1 2 0 0 1 4 0 1 4 1 2 ⊳ Finally.0 = 10 3+( 3 ) 7338981 5 = 9765625 = 0. 3 ∼ 6. and {3.e. i. we will introduce the notion of an irreducible chain. for all i ∈ C we have that i We will see that states within one class have many properties in common. Thus the probability that the phone is free given that it was free 10 hours ago is P(Xt+10 = 0Xt = 0) = K0.1 0.1.e. 4.e. kij to state k in mjk steps. Thus i ∼ j and j ∼ k also imply i ∼ k states in one class communicate and no larger classes can be formed.11 (Classiﬁcation of states).3 0. 6}. {2. (0) mij steps is positive.e. The Markov graph is shown in ﬁgure 1. Only the class {3. 5}.12 (Irreducibility).2 Classiﬁcation of states j”) if there is an m ≥ 0 Deﬁnition 1. k. j implies that j ∈ C. i.5. kij (m) = P(Xt+m = jXt = i) > 0.
Suppose we can get from i to j in mij steps and from j to i in mji steps. Markov graph of the chain of example 1. as all paths from 2 back to 2 have a length which is a multiple of 3. . thus K33 > 0 and K66 > 0 for all m. 6}. then we can only visit this state again at time t + 3.13 (Period). thus the period is d(2) = 3 (3 being the greatest common denominator of 3. If no path of length m exists.5) and the random walk on Z (examples 1. Xt = 2). . the number of steps required to possibly get back to this state must be a multiple of the period d(i).e. In example 1. The communicating classes are {1}. Suppose also that we can get from j back to j in mjj steps. (a) All states within a communicating class have the same period.6. 4. (m) (3) K22 > 0. (9) . (a) A state i ∈ S is said to have period d(i) = gcd{m ≥ 1 : Kii where gcd denotes the greatest common denominator.). . The chain of example 1. (c) If d(i) > 1 the state i is called periodic. In example 1. (m) > 0}. thus K22 > 0. 9.5. 3 Similarly d(4) = 3 and d(5) = 3. as the following proposition shows: Proposition 1. t + 6. For a periodic state i. thus d(3) = d(6) = 1.. (m) The states 3 and 6 are aperiodic. Markov Chains 1 2 3 4 1 1 4 1 4 2 3 4 1 4 3 1 2 1 4 1 5 1 4 6 1 2 Figure 1.1. then Kii positive probability of length m. . Proof.1. Example 1.6 the state 2 has period d(2) = 3. Deﬁnition 1. If there exists a single path of > 0.4) are irreducible chains. This holds in general. and 1. as there is a positive probability of remaining in these states. Such a behaviour is referred to as periodicity. and {3. (6) K22 > 0. then (m) Kii (m) = 0.6 is not irreducible. .6 continued).12 1. {2. To analyse the periodicity of a state i we must check the existence of paths of positive probability and of length m going from the state i back to i. All other K22 = 0 ( m ∈ N0 ). . Thus there are paths of positive probability between these two states.. 6. . . Thus mij + mji and mij + mjj + mji must be divisible by the period d(i) of state i. Thus mjj is also divisible by d(i) (being the difference of two numbers divisible by d(i)).14. The Markov chain considered in the phoneline example (examples 1. (b) If d(i) = 1 the state i is called aperiodic. (a) Suppose i ∼ j.2 and 1. ⊳ (m) In example 1. (b) In an irreducible chain all states have the same period. 4 and 5 can only be visited in this order: if we are currently in state 2 (i.6 the states 2.3. Then we can get from i back to i in mij + mji steps as well as in mij + mjj + mji steps.6 all states within one communicating class had the same period.7 (Example 1. 5}.
thus d(i) = d(j). either all states are transient or all states are recurrent. (a) Every class which is not closed is transient.1. i. (almost surely) be visited only a ﬁnite number of times. Proof. Proposition 1. Then there exists a path of length mij leading from i to j and a path of length mji from j (mij ) back to i. E(Vi X0 = i) = This implies +∞ +∞ kii < +∞. i.14 we have seen that within a communicating class either all states are aperiodic. 1.15 (Recurrence and transience).6 long enough. For a proof see (Norris.5). (b) A state i is called transient if E(Vi X0 = i) < +∞. (b) Every ﬁnite closed class is recurrent.e. the other states will eventually be left forever. kij > 0 and kji (mji ) > 0. whereas a transient state will In proposition 1. Thus d(i) ≤ d(j) (d(j) being the greatest common denominator). we will classify states as recurrent or transient: Deﬁnition 1. o by d(i). thus the length of any path from j back to j is divisible Repeating the same argument with the rˆ les of i and j swapped gives us d(j) ≤ d(i). Proposition 1. +∞ Suppose furthermore that the state i is transient.3 Recurrence and transience If we follow the Markov chain of example 1.e. In order to formalise these notions we will introduce the number of visits in state i: +∞ Vi = t=0 1{Xt =i} The expected number of visits in state i given that we start the chain in i is +∞ +∞ +∞ +∞ E(Vi X0 = i) = E 1{Xt =i} X0 = i t=0 = t=0 E(1{Xt =i} X0 = i) = P(Xt = iXo = i) = t=0 t=0 kii (t) Based on whether the expected number of visits in a state is inﬁnite or not. 1. Finally we state without proof two simple criteria for determining recurrence and transience. . Proof. One can show that a recurrent state will (almost surely) be visited inﬁnitely often. A similar dichotomy holds for recurrence and transience. sect.2 Discrete Markov chains 13 The above argument holds for any path between j and j.16. (b) An irreducible chain consists of a single communicating class. thus (b) is implied by (a). Suppose i ∼ j. or all states are periodic. we will eventually end up switching between state 3 and 6 without ever coming back to the other states Whilst the states 3 and 6 will be visited inﬁnitely often. (s) thus state j is be transient as well. t=0 (t) E(Vj X0 = j) = kjj = t=0 (t) 1 kij (mij ) (mji ) kji t=0 +∞ kij (mij ) (t) (mji ) kjj kji ≤kii (m+t+n) ≤ 1 kij +∞ (mij ) (mji ) kji t=0 k (mij +t+mji ) ≤ 1 kij (mij ) (mji ) kji s=0 kii < +∞.17. (a) A state i is called recurrent if E(Vi X0 = i) = +∞. 1997.2. Within a communicating class.
. . µ = (µ0 .18 (Invariant distribution). The classes {1} and {2.6). and 1.9 0.e. Norris. .3. .4 Invariant distribution and equilibrium In this section we will study the longterm behaviour of Markov chains.. = µ′ Km = µ′ K(m) =µ′ K for all m ∈ N. 1997. Not every Markov chain has an invariant distribution.e..2 that a symmetric random walk on Zp is only recurrent if p ≤ 2 (see e.10 (Random walk on Z (continued)). The class {3. −0.4) 3 i. 4. To ﬁnd the invariant distribution. . which is equivalent to solving the following system of linear equations: (K′ − I)µ = 0.7 continued). An interesting result is 1. Markov Chains Example 1. . . .1. Deﬁnition 1. In example 1. 4. .e. we need to solve µ′ K = µ′ for µ.3 −0.. . ⊳ Note that an inﬁnite closed class is not necessarily recurrent. . . i..1 0. 1). α 0 0 0 . thus µ0 + µ1 = 1).. If µ is the stationary distribution of a chain with transition kernel K. 4 i. 6} is closed and ﬁnite. . {2.e. . . . The random walk on Z studied in examples 1. . . The random walk on Z (studied in examples 1. µ is the left eigenvector of K corresponding to the eigenvalue 1. . . Example 1. thus recurrent..1 0. µ = It is easy to see that the corresponding system is underdetermined and that −µ0 + 3µ1 = 0. Then µ is called the invariant distribution (or stationary distribution) of the Markov chain X if3 µ′ K = µ′ .2.1 0. . β 1−α−β α 0 .14 1.3 · µ0 µ1 = 0 0 ⊳ (3. .. so they are transient.e.. then µ′ = µ′ K = µ′ K2 = . . ... . 6}.6 had three classes: {1}. α = β.7 . . . otherwise it drifts off to −∞ or +∞. .. 3 1 ′ 4. i. 5} are not closed. sect.... Thus if X0 in drawn from µ. 0 0 0 β . Thus.5 we studied a Markov chain with the two states 0 (“free”) and 1 (“in use”) and which modeled whether a phone is free or not. . 1. then all Xm have distribution µ: according to proposition 1. . . as the following example shows: Example 1.3 0. .. if the chain has µ as initial distribution... µ1 )′ ∝ (as µ has to be a probability distribution. Let µ = (µi )i∈S be a probability distribution on the state space S. i..4) for example does not have an invariant distribution. and 1.. The random walk on Z had the transition kernel (see example 1. .4 is only recurrent if it is symmetric. The chain of example 1.. .1.2 and 1. 0 0 β 1−α−β .9 (Phone line (continued)).6 and 1. and let X be a Markov chain with transition kernel K. 0 β 1−α−β α . . K= . A key concept for this is the invariant distribution. 5}. . . . the distribution of X will not change over time... . .8 (Examples 1. . . . 1−α−β α 0 0 ..g.. . and {3.10 P(Xm = j) = (µ′ K(m) )j = (µ)j for all m. Its transition kernel was K= 0.
2 Discrete Markov chains 15 to become a probability distribution.19 is necessary. Example 1.8).19 P(Xt = 0) → 3 4 and P(Xt = 1) → ⊳ Example 1. Then P(Xt = i) −→ µi for all states i. thus according to theorem 1. .19 (Convergence to equilibrium). 3 1 4. Outline of the proof. 1. Let X be an irreducible and aperiodic Markov chain with invariant distribution µ. sec. 1. thus The chain Y has its invariant distribution as initial distribution. t→+∞ Suppose that X has initial distribution λ and transition kernel K. i. As t → +∞ the probability of {Yt = Zt } tends to 1. 4 Thus. One can show that Z is a Markov chain with initial distribution λ (as Xt Xt Zt Yt Yt T t Figure 1. X0 = Z0 ) and transition kernel K (as both X and Y have the transition kernel K). .11 (Phone line (continued)). Theorem 1.1. . . 1. however µ cannot be renormalised ⊳ We will now show that if a Markov chain is irreducible and aperiodic.)′ that µ′ K = µ′ . We will explain the outline of the proof using the idea of coupling. Let T be the ﬁrst time the two chains “meet” in the state i.6 illustrates this new chain Z. 1997.9 that the invariant distribution of the Markov chain modeling the phone line is µ = 1 4. thus P(Yt = j) = µj for all t ∈ N0 and j ∈ S. A more detailed proof of this theorem can be found in (Norris. T = min{t ≥ 0 : Xt = Yt = i} Then one can show that P(T < ∞) = 1 and deﬁne a new process Z by Zt = Xt Yt if t ≤ T if t > T Figure 1. in the long run. As α + (1 − α − β) + β = 1 we have for µ = (. . We have shown in example 1.12. Y (— —) and Z (thick line) used in the proof of theorem 1. This example illustrates that the aperiodicity condition in theorem 1. 1.6. Thus X and Z have the same distribution and for all t ∈ N0 we have that P(Xt = j) = P(Zt = j) for all states j ∈ S. . its distribution will in the long run tend to the invariant distribution.19. 2} and transition kernel . Illustration of the chains X (− − −). . the phone will be free 75% of the time. Consider a Markov chain X with two states S = {1. P(Xt = j) = P(Zt = j) → P(Yt = j) = µj .e. Deﬁne a new Markov chain Y with initial distribution µ and same transition kernel K.
1. In this case P(Xs+1 = i) = µi and P(Xs = j) = µj for all s. . we have deﬁned the transition kernel to consist of kij = P(Xt+1 = jXt = i). .e. 0. 0). then P(Xt = 0) = 1 if t is odd .1 0. or 0. 0. 2 . then using (1. i.e we turn the universal clock backwards? P(Xt = j) P(Xt = j. Thus it is periodic with period 2. (1. . . . just with the rˆ les of past and future swapped. as 1 1 . ⊳ 0 if t is even 1 if t is even 1 which is different from the invariant distribution. 0. 1. If we use the invariant distribution µ as initial distribution for X0 . under which all these probabilities would be 2 . thus goes either 1. We have that P(Yt = jYt−1 = i) = P(Xτ −t = jXτ −t+1 = i) = P(Xs = jXs+1 = i) = = P(Xs+1 P(Xs = j. the same should hold for the “backward chain”. . even if the forward chain X is homogeneous. .11) the probabilities transition matrix was K= The invariant distribution was µ = 3 1 ′ 4. Xt+1 = i) = P(Xt+1 = iXt = j) · P(Xt+1 = i) P(Xt+1 = i) This suggests deﬁning a new Markov chain which goes back in time. . 2 2 = µ′ . i. P(Xt = 1) = 0 if t is odd . and 1. Its invariant distribution is µ′ = 1 1 2. In the example of the phone line (examples 1.13 (Phone line (continued)).9. This changes however if the forward chain X is initialised according to its invariant distribution µ. .16 1. What happens if we analyse the distribution of Xt conditional on the future.9 0. As the deﬁning property of a Markov P(Xt = jXt+1 = i) = chain was that the past and future are conditionally independent given the present. 1.5 Reversibility and detailed balance In our study of Markov chains we have so far focused on conditioning on the past.5. µ′ K = However if the chain is started in X0 = 1.7 . 1. 1. τ } deﬁned by Yt = Xτ −t is called the timereversed chain corresponding to X. Xs+1 = i) P(Xs+1 = i) P(Xs = j) P(Xs = j) = kji · . = iXs = j) · P(Xs+1 = i) P(Xs+1 = i) thus the timereversed chain is in general not homogeneous.2) .2. For τ ∈ N let {Xt : t = 0. τ } be a Markov chain. 0. . . Markov Chains K= 0 1 1 0 . This Markov chains switches deterministically.2) µi In general.3 0. and thus Y is a homogeneous Markov chain with transition µj . 1. ..1. . .20 (Timereversed chain). 4 . o Deﬁnition 1. P(Yt = jYt−1 = i) = kji · Example 1. λ = (1. the transition probabilities for the timereversed chain will thus be different from the forward chain. 1. For example. 2 2 0 1 1 0 = 1 1 .3. Then {Yt : t = 0.
21 (Detailed balance).8 0 0.8 0 1 1 1 3. Note that not every chain which has an invariant distribution is timereversible.2 0.22. µ is the invariant distribution.8 0 0. then using (1. 3} with transition matrix The corresponding Markov graph is shown in ﬁgure 1.e. The detailedbalance condition is a very important concept that we will require when studying Markov Chain Monte Carlo (MCMC) algorithms later. 3. We have that (µ′ K)i = j µj kji = µi =µi kij j kij = µi . ii. Theorem 1.2) µi kij ′ ′ P(Yt = jYt−1 = i) = µj kji = kij = P(Xt = jXt−1 = i). more importantly. as their dynamics do not change when time is reversed. µi thus X and Y have the same transition probabilities. as the following example shows: Example 1. Deﬁnition 1.14. The advantage of the detailedbalance condition over the condition of deﬁnition 1.1 · 4 = 0. Let Y be the timereversal of X. Consider the following Markov chain on S = {1. .e. =1 thus µ K = µ . We will now introduce a criterion for checking whether a chain is timereversible. j ∈ S µi kij = µj kji .2 0.13) satisﬁes the detailedbalance condition. We will call such chains timereversible. Proof.7: The stationary distribution of the chain is µ = K = 0. as it does not involve a sum (or a vectormatrix product). 3 . ii. Then i.2 Discrete Markov chains 17 P(Yt = 0Yt−1 = 0) = k00 · P(Yt = 0Yt−1 µ0 = k00 = P(Xt = 0Xt−1 = 1) µ0 3 µ0 = 0. then the chain is timereversible.1. which says that if a Markov chain is in detailed balance with a distribution µ.18 is that the detailedbalance condition is often simpler to check. Let X be a Markov chain with transition kernel K which is in detailed balance with some distribu tion µ on the states of the chain.1 = k01 = P(Xt = 1Xt−1 = 0) 3 µ0 4 µ1 = 1) = k11 · = k11 = P(Xt = 1Xt−1 = 1) µ1 Thus in this case both the forward chain X and the timereversed chain Y have the same transition probabilities.3 = k10 = P(Xt = 0Xt−1 = 1) = 1) = k01 · 1 µ1 4 P(Yt = 1Yt−1 = 0) = k10 · P(Yt = 1Yt−1 1 µ1 = 0. 2.2 0. i. and. It is easy to see that Markov chain studied in the phone line example (see example 1. both X and its time reversal have the same transition kernel. µ is the invariant distribution. X is timereversible. If initialised according to µ. A transition kernel K is said to be in detailed balance with a distribution µ if ⊳ for all i. i. The reason for its relevance is the following theorem.3 · 4 = 0. µ is the invariant distribution of X. i.
we have restricted our attention to Markov chains with a discrete (i. Thus the chains X and its time reversal Y have different transition kernels. rather than K.2) we can ﬁnd the transition matrix of the timereversed chain Y . Xt = xt ) = P(Xt+1 ∈ AXt = xt ) for all measurable sets A ⊂ S.8 0. . Markov graph for the Markov chain of example 1.18 1. or (Robert and Casella.2 0. However the distribution is not timereversible. which is only meaningful if the state space is discrete. rather than individual states. the chain is much more likely to go clockwise in ﬁgure 1.3 General state space Markov chains So far. i. the corresponding Markov chain has a continuous state space S. When going forward in time. Largely. as it is the case for a process with a continuous state space. chapter 6).8 Figure 1. Thus we have to work with conditional probabilities of sets of states. when going backwards in time however.8 0. However. (Nummelin. conditionally on the present. A process with a continuous state space spreads the probability so thinly that the probability of exactly hitting one given state is 0 for all states.2 0. 2004. In a general state space it can be that all events of the type {Xt = j} have probability 0. Although the basic principles are not entirely different from the ones we have derived in the discrete case. most applications of Markov Chain Monte Carlo algorithms are concerned with continuous random variables.7. We deﬁned a Markov chain to be a stochastic process in which. In this section we will give a brief overview of the theory underlying Markov chains with general state spaces. Markov Chains 1 0. . the study of general state space Markov chains involves many more technicalities and subtleties. thus the theory studied in the preceding section does not directly apply. 0 1.14. ⊳ 0. 1984). The main reason for this was that this kind of Markov chain is much easier to analyse than Markov chains having a more general state space.2 0.8 which is equal to K′ .2 .8 2 0. we need to generalise our deﬁnition of a Markov chain (deﬁnition 1. which is 0 0. at most countable) state space S. the past and the future are independent. The interested reader can ﬁnd a more rigorous treatment in (Meyn and Tweedie. . Let X be a stochastic process in discrete time with general state space S. 1993). Deﬁnition 1. Though this section is concerned with general state spaces we will notationally assume that the state space is S = Rd . .2 0 0.e. we deﬁned most concepts for discrete state spaces by looking at events of the type {Xt = j}. . Using equation (1.7.2 3 0. the chain is much more likely to go counterclockwise. First of all. In the discrete case we formalised this idea using the conditional probability of {Xt = j} given different collections of past events.4).23 (Markov chain).8 0. X is called a Markov chain if X satisﬁes the Markov property P(Xt+1 ∈ AX0 = x0 .e. so that we will not present any proofs here.
We also assume that Et is independent of X0 . . . We obtain the special case of deﬁnition 1. 1). z2 1 .8 by setting K(i. xt+2 ) · · · K(xt+m−1 . . For a discrete state space the dominating measure is the counting measure. equation (1.4 The transition kernel K(x.15 (Gaussian random walk on R). j)th element of the transition matrix K.e. xt+1 ∈A ··· S K(xt . X0 = x0 ) = P(Et ∈ A − xt ) = P(Xt+1 ∈ AXt = xt ). y) is thus just the conditional probability density of Xt+1 given Xt = xt . . 1). dxt+1 ). the probability density function of Et is φ(z) = √ exp − 2 2π assuming that Xt+1 Xt = xt ∼ N(xt . xt+1 ) = φ(xt+1 − xt ) To ﬁnd the mstep transition kernel we could use equation (1. where A − xt = {a − xt : a ∈ A}.1. . In contrast to the random example 1.3) where the integration is with respect to a suitable dominating measure. xt+1 ) dxt+1 A In the following we will assume that the Markov chain is homogeneous. i. thus the mstep transition kernel is K (m) (x0 . the probabilities P(Xt+1 ∈ AXt = (1. xm ) dxm−1 · · · dx1 The mstep transition kernel allows for expressing the mstep transition probabilities more conveniently: P(Xt+m ∈ AXt = xt ) = K (m) (xt . xt ) are independent of t.3 General state space Markov chains 19 If S is at most countable. xm ) = S ··· S K(x0 . 1). For the remainder of this section we shall also assume that we can express the probability from deﬁnition 1. However. this deﬁnition is equivalent to deﬁnition 1.23 using a transition kernel K : S × S → R+ : 0 P(Xt+1 ∈ AXt = xt ) = K(xt . Consider the random walk on R deﬁned by Xt+1 = Xt + Et . i. xt+m ) dxt+1 · · · dxt+m−1 dxt+m . Et−1 . Thus X is indeed a Markov chain. This is equivalent to where Et ∼ N(0. j) = kij . Rather we exploit the fact that 4 A more correct way of stating this would be P(Xt+1 ∈ AXt = xt ) = R A K(xt . .4.e. x1 ) · · · K(xm−1 . In complete analogy with φ(xt+1 − xt ) dxt+1 Thus the transition kernel (which is nothing other than the conditional density of Xt+1 Xt = xt ) is thus K(xt . the resulting integral is difﬁcult to compute. . where kij is the (i.e. . Furthermore we have that P(Xt+1 ∈ AXt = xt ) = P(Et ∈ A − xt ) = A walk on Z (introduced in example 1. . i. so integration just corresponds to summation. E1 . xt+1 )K(xt+1 . . . . Suppose that X0 ∼ N(0.3).2) the state space of the Gaussian random walk is R. i.3) is equivalent to P(Xt+1 ∈ AXt = xt ) = We have for measurable set A ⊂ S that P(Xt+m ∈ AXt = xt ) = A S kxt xt+1 . for example with respect to the Lebesgue measure if S = Rd . . X0 = x0 ) = P(Et ∈ A − xt Xt = xt .2 we have that P(Xt+1 ∈ AXt = xt . xt+m ) dxt+m A Example 1.e.
We then deﬁne +∞ E(VA X0 = x) = E 1{Xt ∈A} X0 = x t=0 = t=0 E(1{Xt ∈A} X0 = x) = K (t) (x.15 we had that Xt+1 Xt = xt ∼ N(xt . If the number of steps m = 1 for all A. 2004. 1). recurrence.4). As Lebesgue measure.3 from the discrete case to the general case requires additional technical concepts like atoms and small sets. Thus the chain is strongly irreducible with the respect to any continuous distribution.3 we deﬁned a discrete Markov chain to be recurrent. y) dy > 0. ∼N(0. which are beyond the scope of this course (for a more rigorous treatment of these concepts see e. we need to consider the number of visits to a set of states rather than single states. ii. Deﬁnition 1.3) we can identify 1 K (m) (xt .24 (Irreducibility). . then the chain is said to be strongly µirreducible.g.2.12 to Markov chains with general state spaces: it might be — as it is the case for a continuous state space — that the probability of hitting a given state is 0 for all states.2 we deﬁned a Markov chain to be irreducible if there is a positive probability of getting from any state i ∈ S to any other state j ∈ S. xt+m ) = √ φ m as mstep transition kernel. we have that P(Xt+1 ∈ AXt = xt ) > 0 for all sets A of nonzero ⊳ Extending the concepts of periodicity. We start with deﬁning recurrence of sets before extending the deﬁnition of recurrence of an entire Markov chain. Deﬁnition 1. For more general state spaces. Again.16 (Gaussian random walk (continued)). (a) A set A ⊂ S is said to be recurrent for a Markov chain X if for all x ∈ A E(VA X0 = x) = +∞. y) dy t=0 A This allows us to deﬁne recurrence for general state spaces.20 1. if all states are (on average) visited inﬁnitely often. m). there exists an m ∈ N0 such that P(Xt+m ∈ AXt = x) = A K (m) (x.3 and 6. a Markov chain is said to be µirreducible if for all sets A with µ(A) > 0 and for all x ∈ S.2. when we start the chain in x ∈ S: +∞ +∞ 1{Xt ∈A} be the number of visits the chain makes to states in the set A ⊂ S. . P(Xt+m ∈ AXt = xt ) = P(Xt+m − Xt ∈ A − xt ) = Comparing this with (1. + Et+m−1 . Markov Chains Xt+m = Xt + Et + . The chain is µirreducible for some distribution µ. and transience studied in sections 1. . Robert and Casella. Let VA = +∞ t=0 the expected number of visits in A ⊂ S.2 and 1. we cannot directly apply deﬁnition 1. the range of the Gaussian distribution is R. xt+m − xt √ m ⊳ A 1 √ φ m xt+m − xt √ m dxt+m In section 1. Example 1.m) thus Xt+m Xt = xt ∼ N(xt .25 (Recurrence). possibly via intermediate steps. Given a distribution µ on the states S. We will again resolve this by looking at sets of states rather than individual states. In section 1. In example 1.2. Thus we will only generalise the concept of recurrence. Every measurable set A ⊂ S with µ(A) > 0 is recurrent. (b) A Markov chain is said to be recurrent.2. if i. sections 6.
detailed balance. but not necessary) condition is. This result is the reason why Markov Chain Monte Carlo methods work: it allows us to set up simulation algorithms to generate a Markov chain. then the chain is also recurrent. see (Tierney.4 Ergodic theorems In this section we will study the question whether we can use observations from a Markov chain to make inferences about its invariant distribution. Deﬁnition 1.27 (Invariant Distribution). Suppose that X is a µirreducible Markov chain having µ as unique invariant distribution. The chain is µirreducible for some distribution µ.. whose sample path we can then use for estimating various quantities of interest.22 one can also show in the general case that if the transition kernel of a Markov chain is in detailed balance with a distribution µ.22 also holds in the general case. A transition kernel K is said to be in detailed balance with a distribution µ with density fµ if for almost all x. if i.27 requires computing an integral. Proof. ii. we need to deﬁne invariant distributions for general state spaces. A similar result can be obtained for Markov chains. This type of recurrence is referred to as Harris recurrence. y) dx for almost all y ∈ S. Thus theorem 1. This is already the case if there is a nonzero probability of visiting the set inﬁnitely often. (a) A set A ⊂ S is said to be Harrisrecurrent for a Markov chain X if for all x ∈ A P(VA = +∞X0 = x) = 1.26 (Harris Recurrence). However. 1. A stronger concept of recurrence can be obtained if we require that the set is visited inﬁnitely often with probability 1. before we can state this proposition. x). We will see that under some regularity conditions it is even enough to follow a single sample path of the Markov chain. then the chain is timereversible and has µ as its invariant distribution. A simpler (sufﬁcient.4 Ergodic theorems 21 According to the deﬁnition a set is recurrent if on average it is visited inﬁnitely often. Deﬁnition 1. For independent identically distributed data the Law of Large Numbers is used to justify estimating the expected value of a functional using empirical averages. Deﬁnition 1.28. . y) = fµ (y)K(y. Every measurable set A ⊂ S with µ(A) > 0 is Harrisrecurrent. just like in the case discrete case. A distribution µ with density function fµ is said to be the invariant distri bution of a Markov chain X with transition kernel K if fµ (y) = S fµ (x)K(x. theorem 1) or (Athreya et al. Checking recurrence or Harris recurrence can be very difﬁcult. which can be quite cumbersome. Then X is also recurrent. y ∈ S fµ (x)K(x.1.29 (Detailed balance). For discrete state spaces the two concepts are equivalent. In complete analogy with theorem 1. (b) A Markov chain is said to be Harrisrecurrent. Proposition 1. We will state (without) proof a proposition which establishes that if a Markov chain is irreducible and has a unique invariant distribution. It is easy to see that Harris recurrence implies recurrence. 1994. 1992) Checking the invariance condition of deﬁnition 1.
. fact 5). which is a necessary condition for the convergence of ergodic averages. 2004. For a proof see (Roberts and Rosenthal. or 2.3.30 is about averages. Example 1. 2004. However. If the initial distribution is µ = (α. Let X be a µirreducible. The reason for this is that whilst theorem 1.19. 1] then for every t ∈ N0 we have that P(Xt = 1) = α P(Xt = 2) = 1 − α. recurrent Rd valued Markov chain with invariant distribution µ. 2} and transition matrix K= 1 0 0 1 The corresponding Markov graph is shown in ﬁgure 1. 1 − α)′ with α ∈ [0. theorem 1. In order to estimate the parameter α we would need to look at more than one sample path. Proof.30 (Ergodic Theorem).12 we studied a periodic chain. . Due to the periodicity we could not apply theorem 1.8. . We can however apply theorem 1. .8. Consider a discrete chain with two states S = {1. Markov graph of the chain of example 1.30. 1993. (Robert and Casella. switch between the states 1 and 2). or (Meyn and Tweedie. as µ′ K = µ′ I = µ′ for all µ. and the periodic behaviour does not affect averages. 2. .30 to this chain. Under additional regularity conditions one can also derive a Central Limit Theorem which can be used to justify Gaussian approximations for ergodic averages of Markov chains.22 1. This chain will remain in its intial state forever. 2} is an invariant distribution.63).17.19 was about the distribution of states at a given time t. The reason for this is that the chain fails to explore the space (i. 2. If X is Harrisrecurrent this holds for every starting value x. Markov Chains Theorem 1. In example 1. These conditions ensure that the chain is permanently exploring the entire state space. By observing one sample path (which is either 1. 1. Then we have for any integrable function g : Rd → R that with probability 1 1 t→∞ t lim t i=1 g(Xi ) → Eµ (g(X)) = g(x)fµ (x) dx S for almost every starting value X0 = x. 1. .e.30 does not require the chain to the aperiodic. Any 1 1 1 Figure 1. theorem 6.2). We conclude by giving an example that illustrates that the conditions of irreducibility and recurrence are necessary in theorem 1.) we can make no inference about the distribution of Xt or the parameter α.17 2 distribution µ on {1. the chain is not irreducible (or recurrent): we cannot get from state 1 to state 2 and vice versa. theorem 17. . ⊳ Note that theorem 1. This would however be beyond the scope of this course.
1). which are sometimes referred to as stochastic simulation (Ripley (1987) for example only uses this term). but independent of the position of R. Y ∼ U[−1.d. where we use a simulationbased method to evaluate an integral. 1]). i. He deﬁned a Monte Carlo method as “representing the solution of a problem as a parameter of a hypothetical population. A formal deﬁnition of Monte Carlo methods was given (amongst others) by Halton (1970). If we observe n raindrops.Chapter 2 An Introduction to Monte Carlo Methods 2. however we can estimate the probability using our raindrop experiment. then the number of raindrops Z that fall inside the circle is a binomial random variable: {x2 +y 2 ≤1} using a simple experiment. and using a random sequence of numbers to construct a sample of the population. such that distributions on the interval [−1. Monte Carlo tests. 1]. 1] (in short X.2 Introductory examples Example 2. i. Assume that we could produce “uniform rain” on the square [−1.e.y≤1} = π π = 2·2 4 .1 (A raindrop experiment for computing π ). 1 dxdy {−1≤x.i. where we resort to simulation in order to compute the pvalue.1 What are Monte Carlo Methods? This lecture course is concerned with Monte Carlo methods. It is 1 dxdy area of the unit circle P(drop within circle) = = area of the square In other words. 1]2 is proportional to the area of R. π = 4 · P(drop within circle). Examples of Monte Carlo methods include stochastic integration.i. where we construct a Markov chain which (hopefully) converges to the distribution of interest.” 2. we found a way of expressing the desired quantity π as a function of a probability. It is easy to see that this is the case iff the two coordinates X. Y are i.d. and MarkovChain Monte Carlo (MCMC). 1] × [−1. Of course we cannot compute P(drop within circle) without knowing π. realisations of uniform Now consider the probability that a raindrop falls into the unit circle (see ﬁgure 2. from which statistical estimates of the parameter can be obtained. Assume we want to compute an Monte Carlo estimate of π the probability of a raindrop falling into a region R ⊂ [−1.
p). . with p = P(drop within circle).750. An Introduction to Monte Carlo Methods 1 −1 −1 Figure 2. let πn = 4ˆn denote the estimate after having observed n raindrops. p) and Z n. .1. 2000. p= ˆ We can assess the quality of our estimate by computing a conﬁdence interval for π. n Z . n .8525]. As we have Z ∼ B(100. 3.6875. 100 As our estimate of π is four times the estimate of p. Illustration of the raindrop experiment for estimating π 1 Z ∼ B(n. as in ﬁgure 2.77 + 1. . Figure 2.96 · 100 we use the approximation that Z ∼ N(100p.2 shows the ˆ estimate obtained after n iterations as a function of n for n = 1.410] In more general.08.77 − 1.77) . .77 · (1 − 0.1.24 2. πn + z1−α ˆ n πn (4 − πn ) ˆ ˆ n ⊳ pn (1 − pn ) ˆ ˆ . In this case. and we can ˆ 0. pn + z1−α ˆ n pn (1 − pn ) ˆ ˆ . Hence. 100p(1 − p)). that 77 of the 100 raindrops were inside the circle. n Assume we have observed.77 · (1 − 0.96 · 0. 4 · 77 = 3. 100 obtain a 95% conﬁdence interval for p using this Normal approximation: 0. A (1−2α) conﬁdence interval ˆ p for p is then pn − z1−α ˆ thus a (1 − 2α) conﬁdence interval for π is πn − z1−α ˆ πn (4 − πn ) ˆ ˆ . p ∼ N(p.77) = [0. 0. However the law of large numbers guarantees that our estimate π converges almost surely to π. our estimate of π is π= ˆ which is relatively poor. You can see that the estimate improves as n increases. we now also have a conﬁdence interval for π: [2. Thus we can estimate p by its maximumlikelihood estimate p= ˆ and we can estimate π by π = 4ˆ = 4 · ˆ p Z . 0. p(1 − p)/100).
t):t≤f (x)} 1dt dx {0≤x.t):t≤f (x)} f (x) dx = 0 0 0 1 dt dx = 1dt dx = {(x.0 2. Assume we can produce uniform rain on the unit square. 1]2 . we can resort to a raindrop experiment.t≤1} The numerator is nothing other than the dark grey area under the curve.0 Estimate of π 2. 1 dt dx {(x. Furthermore we used the central limit theorem to assess the speed of convergence. Estimate of π resulting from the raindrop experiment Let us recall again the different steps we have used in the example: – We have written the quantity of interest (in our case π) as an expectation.1 – Second. Once more. and the denominator is the area of the unit square (shaded in light grey in ﬁgure 2.2 Introductory examples 25 Monte Carlo estimate of π (with 90% conﬁdence interval) 4.5 0 500 1000 Sample size 1500 2000 Figure 2. . 1].2 Figure 2.5 3. which of course equals the integral we want to evaluate (the area of the unit square is 1. so we don’t need to rescale the result).2. we have replaced this algebraic representation of the quantity of interest by a sample approximation to it.0 3. using the fact that f (x) = 1 1 f (x) f (x) 0 1 dt.2. Example 2. The law of large numbers guaranteed that the sample approximation converges to the algebraic representation. The probability that a raindrop falls below the curve is equal to the area below the curve. it is 8505 = 212 35 ·5·7 ≈ 0.2 (Monte Carlo Integration).4816. It is of course of interest whether the Monte Carlo methods offer more favourable rates of convergence than other numerical methods. 4096 As f is a polynomial we can obtain the result analytically.3). Its graph is fully contained in the unit square [0. Assume we want to evaluate the integral 1 f (x) dx 0 with f (x) = 1 · −65536x8 + 262144x7 − 409600x6 + 311296x5 − 114688x4 + 16384x3 27 using a Monte Carlo approach. We will investigate this in the case of Monte Carlo integration using the following simple example.3 shows the function for x ∈ [0. A more formal justiﬁcation for this is. Thus the expression on the right hand side is the probability that a 1 2 A probability is a special case of an expectation as P(A) = E(IA ). and thus to the quantity of interest.
2. For a twodimensional function f the error made by the Riemann approximation using n function evaluations is O(n−1/2 ). it is easy to see that its speed of convergence is of the same order. so is the upper bound on f (x)−f (ξmid ).4) to approximate the integral from example 2. Laplace (1812) suggested that this experiment can be used to approximate π.2 the error is of order O(n−1 ). we have to evaluate the function n = m2 times. We have thus reexpressed our quantity of interest as a probability in a statistical model. or equivalently O(n−1/2 ).26 2.4 Recall that our Monte Carlo method was “only” of order OP (n−1/2 ). 5 This makes the Monte Carlo methods especially suited for highdimensional problems. Let n denote the number evaluations of f (and thus each “bar” is O(m−3 ) (each of the two sides of the base area of the “bar” is proportional to m−1 . An Introduction to Monte Carlo Methods raindrop falls below the curve. Assume we partition both axes into m segments. There are in total m2 bars. 3 ∆2 2 The error made for each “bar” can be upper bounded by 4 5 1 the number of “bars”). yielding O(m−3 )). The order of convergence can be improved when using the trapezoid rule and (even more) by using Simpson’s rule. This is not the case for other (deterministic) numerical integration methods.52.3. Figure 2. the total −1 error is O(n ). yielding a MonteCarlo estimate of the integral of 0. 2) contains a rough estimate of π (using the columns of King Solomon’s temple). regardless of the dimension of the support of f . However. As ∆ is proportional to n . pn + z1−α ˆ n pn (1 − pn ) ˆ n ⊳ Thus the speed of convergence of our (rather crude) Monte Carlo method is OP (n−1/2 ). 1 x 0 Figure 2. Monte Carlo methods are a somewhat more recent discipline. The error made for max f ′ (x). 52 of them are below the curve.3 A Brief History of Monte Carlo Methods Experimental Mathematics is an old discipline: the Old Testament (1 Kings vii.3. One of the ﬁrst documented Monte Carlo experiments is Buffon’s needle experiment (see example 2. the integral is If after n raindrops a proportion pn is found to lie below the curve. Furthermore the Monte Carlo method offers the advantage of being relatively simple and thus easy to implement on a computer. . so the total error is only O(m−1 ).3 below).e. Illustration of the raindrop experiment to compute 1 R1 0 f (x)dx When using Riemann sums (as in ﬁgure 2. i. As there are n “bars”.3 shows the result obtained when observing 100 raindrops. a (1 − 2α) conﬁdence interval for the value of ˆ pn − z1−α ˆ pn (1 − pn ) ˆ . 23 and 2 Chronicles iv. the error made for each bar is O(n−2 ).
δ π P(intersect) = 0 P(intersectθ) · 1 dθ = π π 0 l sin θ 1 l · dθ = · δ π πδ sin θ dθ = 0 =2 2l . Dark needles intersect the thin vertical lines.3 (Buffon’s needle). π) we obtain π l sin θ . light needles do not.2.5. πδ When dropping n needles the expected number of needles crossing a line is thus 2nl .3 A Brief History of Monte Carlo Methods 27 1 ∆ f (x) − f (ξmid ) < ∆ 2 · max f ′ (x) for x−ξmid ≤ ∆ 2 x 0 ξmid 1 Figure 2. George Louis Leclerc. 1777). What is the probability that a needle of length l < δ dropped on the ﬂoor will intersect one of the lines? Buffon answered the question himself in 1777 (Buffon. The probability of this happening is P(intersectθ) = Assuming that the angle θ is uniform on [0. asked the following question (Buffon. In 1733. the Comte de Buffon. 1733): Consider a ﬂoor with equally spaced lines.5).4. Illustration of Buffon’s needle . Figure 2. πδ Thus we can estimate π by δ δ δ θ l sin θ (a) Illustration of the geometry behind Buffon’s needle (b) Results of the Buffon’s needle experiment using 50 needles. a distance δ apart. Then the question whether the needle intersects a line is equivalent to the question whether a box of width l sin θ intersects a line. Assume the needle landed such that its angle is θ (see ﬁgure 2. Illustration of numerical integration by Riemann sums Example 2.
Most importantly: – The numbers generated by the algorithm should reproduce independence. It is not a particularly powerful generator (so we discourage you from using it in practise). An Introduction to Monte Carlo Methods π≈ where X is the number of needles crossing a line. Physical random experiments were difﬁcult to perform and so was the numerical processing of their results. casts some doubt over the results of his experiments (see Badger. . 1] distribution is a suitable model. 1901). This section tries to give very brief overview of how pseudorandom numbers can be generated. the main drawback of Monte Carlo methods was that they used to be expensive to carry out. 1808 · 3 5424 133 which is nothing other than the best rational approximation to π with at most 4 digits each in the denominator and ⊳ Historically. however it is easy enough to allow some insight into how pseudorandom number generators work. 1987). who claimed to be stimulated by playing poker (Ulam. In other words. . however. i. Amongst the ﬁrst to realise this potential were John von Neuman and Stanisław Ulam. In the following we will brieﬂy discuss the linear congruential generator. for a more detailed discussion). 2. A pseudorandom number generator (RNG) is an algorithm for whose output the U[0. 1983). by deﬁnition. 6 7 That Lazzarini’s experiment was that precise. Xδ The Italian mathematician Mario Lazzarini performed Buffon’s needle experiment in 1901 using a needle of length l = 2. is deterministic in nature — a philosophical paradox.5cm and lines d = 3cm apart (Lazzarini. . Of 3408 needles 1808 needles crossed a line. In the following chapters we will assume that independent (pseudo)random realisations from a uniform U[0. so Lazzarini’s estimate of π was π≈ the numerator. 1]. for generating the randomness. which.5 17040 355 = = . This however changed fundamentally with the advent of the digital computer. 1] random variable. They proposed in 1947 to use a computer simulation for solving the problem of neutron diffusion in ﬁssionable material (Metropolis. . the numbers X1 .e. 1) random variables using deterministic algorithms. 1) distribution as a source of randomness. Xn that we have already generated should not contain any discernible information on the next value Xn+1 . 2nl . We will only use the U(0. Samples from other distributions can be derived from realisations of U(0. – The numbers generated should be spread out evenly across the interval [0. 1994. In 1949 Metropolis and Ulam published their results in the Journal of the American Statistical Association (Metropolis and Ulam. the number generated by the pseudorandom number generator should have the same relevant statistical properties as independent realisations of a U[0. This property is often referred to as the lack of predictability. the socalled “Fermiac”.6 2 · 3408 · 2. 1949). 1] distribution7 are readily available.4 Pseudorandom numbers For any MonteCarlo simulation we need to be able to reproduce randomness by a computer algorithm. Nonetheless. The name “Monte Carlo” goes back to Stanisław Ulam. and not by statisticians: it was only in the 1980s — following the paper by Geman and Geman (1984) proposing the Gibbs sampler — that the relevance of Monte Carlo methods in the context of (Bayesian) statistics was fully realised. . who were then working for the Manhattan project in Los Alamos. Enrico Fermi previously considered using Monte Carlo techniques in the calculation of neutron diffusion. however he proposed to use a mechanical device. in the following 30 years Monte Carlo methods were used and analysed predominantly by physicists.28 2. For a more detailed discussion of pseudorandom number generators see Ripley (1987) or Knuth (1997).
2. is viewed as points in an ndimension cube8 . . 1968). and Digital’s PDP11. It used a = 216 + 3.98828120. ⊳ The main ﬂaw of the congruential generator its “crystalline” nature (Marsaglia. c = 1. .4 Pseudorandom numbers 29 Algorithm 2. c = 0. . Set Zi = (aZi−1 + c) mod M . c ∈ N0 . 1. M − 1}. .4023438. It is easy to see that the sequence of pseudorandom numbers only depends on the seed Z0 . mod 256 = 359 mod 256 = 103 mod 256 = 186 mod 256 = 253 mod 256 = 8378 mod 256 = 15101 The corresponding Xi are X1 = 103/256 = 0.2).. Due to this ﬂaw of the congruential pseudorandom number generator. . X2 = 186/256 = 0. The lefthand panel shows a plot of 1. . 1). The right hand panel shows the outcome of the BoxMuller method for transforming two uniform pseudorandom numbers into a pair of Gaussians (see example 3. M ∈ N. value (“seed”) Z0 ∈ {1. and seed Z0 = 4..g. 1. and M . For i = 1. X2 . Marsaglia and Zaman (1991) or Matsumoto and Nishimura (1998).4. Xnk+n−1 ). M − 1} and thus the Xi are in the interval [0. . GNU R (and other environments) provide you with a large choice of powerful random number generators.72656250. Z1 Z2 Z3 = = = (81 · 4 + 35) (81 · 103 + 35) (81 · 186 + 35) . see the corresponding help page (?RNGkind) for details. . they lie on a ﬁnite.1 (Congruential pseudorandom number generator). . and often very small number of parallel hyperplanes. X1 = 253/256 = 0. The integers Zi generated by the algorithm are from the set {0. . . c. .6). and the initial 2. Example 2. M = 256.000 realisations of a congruential generator with a = 1229. . and M = 211 . which was (unfortunately) very popular in the 1970s and used for example in IBM’s System/360 and System/370. . Cosider the choice of a = 81. Or as Marsaglia (1968) put it: “the points [generated by a congruential generator] are about as randomly spaced in the unit ncube as the atoms in a perfect crystal at absolute zero. . For more powerful pseudorandom number generators see e. 2. Figure 2. The random numbers lie on only 5 hyperplanes in the unit square. 8 The (k + 1)th point has the coordinates (Xnk+1 . it should not be used in Monte Carlo experiments. and Xi = Zi /M .” The number of hyperplanes depends on the choice of a. Running the pseudorandom number generator twice with the same seed thus generates exactly the same sequence of pseudorandom numbers. c = 35. . Choose a. . 1987). An example for a notoriously poor design of a congruential pseudorandom number generator is RANDU. This can be a very useful feature when debugging your own code. and M = 231 . The numbers generated by RANDU lie on only 15 hyperplanes in the 3dimensional unit cube (see ﬁgure 2.7 shows another cautionary example (taken from Ripley. . If the sequence of generated values X1 .
The data points lie on 15 hyperplanes.0 0.000 realisations of the RANDU pseudorandom number generator plotted in 3D.0 −2 log(X2k−1 ) sin(2πX2k ) 5 0 5 10 X2k 0. (b) Supposedly bivariate Gaussian pseudorandom numbers obtained using the pseudorandom numbers shown in panel (a). . Results obtained using a congruential generator with a = 1229. x3k−1 . A point corresponds to a triplet (x3k−2 . 5 0 5 −2 log(X2k−1 ) cos(2πX2k ) Figure 2.6. and M = 211 . .6 (a) 1.4 0.6 X2k−1 0. An Introduction to Monte Carlo Methods Figure 2. 100000.30 2. .8 1.0 0. 0.8 1. x3k ) for k = 1. . 300. c = 1.4 0.2 0.0 0.7.2 0.000 realisations of this congruential generator plotted in 2D.
d.d.d. ⊳ − −1 exp(−λx) for x ≥ 0. Let U ∼ U[0. is an increasing function. of X = F − (U ). Expo(λ) by applying the transformation − log(1 − U )/λ to a uniform U[0.1) that F − (u) ≤ x is equivalent to u ≤ F (x). 1] P(F − (U ) ≤ x) = P(U ≤ F (x)) = F (x). One of the simplest methods of generating random samples from a distribution with cumulative distribution function (c. in ﬁgure 3.1. F .f. 1 F (x) u F − (u) Figure 3. As U and 1 − U . 1]. The exponential distribution with rate λ > 0 has the c..f.4 we have seen how to create (pseudo)random numbers from the uniform distribution U[0. Example 3. inverse F − (u) = inf{x : F (x) ≥ u}.f.f.1 Transformation methods In section 2.d. however it is not necessarily continuous. thus F is the c.f.1 (Exponential Distribution). 1] and F be a c. and Reweighting 3. Fλ (x) = 1 − of course.f.. It is easy to see (e. then F − (u) = Theorem 3.f. Proof.d. Then F − (U ) has the c. Thus Fλ (u) = Fλ (u) = − log(1 − u)/λ. Rejection. Illustration of the deﬁnition of the generalised inverse F − x of a c.1 illustrates its deﬁnition. F The c.d.d.1 (Inversion Method).d. Thus for U ∼ U[0. have the same distribution we can use − log(U )/λ as well. Thus we deﬁne the generalised F −1 (u).) F (x) = P(X ≤ x) is based on the inverse of the c. 1] random variable U . Thus we can generate random samples from . Figure 3.g.f.Chapter 3 Fundamental Concepts: Transformation. If F is continuous.
d. X2 ) in the plane. The idea of transformation methods like the Inversion Method was to generate random samples from a distribution other than the target distribution and to transform them such that they come from the desired target distribution. Example 3. θ = 2πU2 .r2 ) (θ. as a point (X1 . In many situations. r2 ) = 1 1 1 1[0. Using U1 . . then its i. Fundamental Concepts: Transformation. 1). X2 = −2 log(U1 ) · sin(2πU2 ) ⊳ −2 log(U1 ). 2π]. 2π] and R2 ∼ Expo(1/2).2π] (θ) To obtain the probability density function of X1 = X2 = √ R2 · sin(θ) we need to use the transformation of densities formula. x2 ) = f(θ. is not even available in closed form. as required. Note however that the generalised inverse of the c.d. Rejection. The next two sections present two such ideas: rejection sampling and importance sampling.32 3.r2 ) (θ(x1 . An example of such a method is the BoxMuller method for generating Gaussian random variables.2 (BoxMuller Method for Sampling from Gaussians).X2 ) (x1 . U2 ∼ U[0. f(X1 .i. 2π]. Φ(·). However very few distributions possess a c. is just one possible transformation and that there might be other transformations that yield the desired distribution. are two independent realisations from a N(0. X1 and X2 are independent. and R2 ∼ Expo(1/2). whose c. and Reweighting The Inversion Method is a very efﬁcient tool for generating random numbers. nor its inverse has a closedform expression. x2 ). and R2 ∼ Expo(1/2). 1) distribution. Take the example of the Gaussian distribution. is This can be shown as follows.f.f. one faces the problem that neither the c. It turns out however. x2 )) · = as ∂x1 ∂θ ∂x2 ∂θ ∂x1 ∂r 2 ∂x2 ∂r 2 2 ∂x1 ∂θ ∂x2 ∂θ ∂x1 ∂r 2 ∂x2 ∂r 2 −1 = 1 1 exp − (x2 + x2 )2 2 4π 2 1 ·2 1 1 √ exp − x2 2 1 2π · 1 1 √ exp − x2 2 2 2π = −r sin(θ) r cos(θ) cos(θ) 2r sin(θ) 2r = − 1 r cos(θ)2 r sin(θ)2 = − 2r 2r 2 Thus we only need to generate θ ∼ U[0.f. Assume that θ ∼ U[0.d. that if we consider a pair X1 . X2 ∼ N(0. Then the joint density of (θ. whose (generalised) inverse can be evaluated efﬁciently. In these cases we have to ﬁnd other ways of correcting for the fact that we sample from the “wrong” distribution.d. we cannot ﬁnd such a transformation in closed form. 1] and example 3. r2 ) f(θ. θ) are again independent and have distributions we can easily sample from: θ ∼ U[0.d.d. r (x1 . X2 ∼ N(0.i.f.2π] (θ) · exp − r2 2π 2 2 √ R2 · cos(θ). Thus we cannot use the inversion method. Thus X1 . When sampling from the normal distribution. As their joint density factorises. polar coordinates (R. = 1 1 exp − r2 4π 2 · 1[0. i.1 √ we can generate R = R2 and θ by R= and thus X1 = −2 log(U1 ) · cos(2πU2 ). 1).
3. otherwise we reject it. u) : 0 ≤ u ≤ f (x)}.3.4 x 0 1 Figure 3. we sample independently X ∼ U[0.u) Thus f (x) can be interpreted as the marginal density of a uniform distribution on the area under the density f (x) {(x. .2. Example 3. ﬁlled circles denote accepted values. We keep the pair (X. For a. u 2. b > 1 the Beta(a. The simple idea underlying rejection sampling (and other Monte Carlo algorithms) is the rather trivial identity f (x) 1 f (x) = 0 1 du = 0 10<u<f (x) du =f (x. The Beta(a. 1] and U ∼ U[0. Using the above identity we can draw from Beta(3.3 (Sampling from a Beta distribution).2 Rejection sampling 33 3.2. b ≥ 0) has the density f (x) = where Γ (a) = +∞ a−1 t 0 Γ (a + b) a−1 x (1 − x)b−1 . We will sample from the light grey rectangle and only keep the samples that fall in the area under the curve. b) distribution (a. 5) by drawing from a uniform distribution on the area under the density {(x. u) : 0 < u < f (x)} (the area shaded in dark gray in ﬁgure 3. Assume that we want to sample from a target distribution whose density f is known to us. b) density is unimodal with mode (a − 1)/(a + b − 2). In order to sample from the area under the density.2 illustrates this idea. Figure 3. 5) density.2).2 shows the density of a Beta(3.3 we use a uniform distribution of the light grey rectangle as proposal distribution.4 1 The instrumental distribution is sometimes referred to as proposal distribution.4]. 5) distribution. Figure 3.1 and 2. U ) if U < f (X). It attains its maximum of 1680/729 ≈ 2.2 Rejection sampling The basic idea of rejection sampling is to sample from an instrumental distribution1 and reject samples that are “unlikely” under the target distribution. Illustration of example 3. we will use a similar trick as in examples 2. This suggests that we can generate a sample from f by sampling from the area under the curve.2 illustrates this idea. Mathematically speaking. Figure 3. The conditional probability that a pair (X. exp(−t) dt is the Gamma function. Γ (a)Γ (b) for 0 < x < 1. Empty circles denote rejected values. 2. In example 3. U ) is kept if X = x is P(U < f (X)X = x) = P(U < f (x)) = f (x)/2. Sampling from the area under the curve (dark grey) corresponds to sampling from the Beta(3.305 at x = 1/3.
If we know f only up to a multiplicative constant. provided π(x) < M · g(x) for all x. Given two densities f. we can generate a sample from f as follows: 1. However we might be able to bound the density of f (x) by M · g(x).(3. M (3. Then by analogy with (3. and Reweighting As X and U were drawn independently we can rewrite our algorithm as: Draw X from U[0.3) we have P(X ∈ X and is accepted) = P(X is accepted) = 1/(C · M ). ⊳ The method proposed in example 3.1) and thus2 P(X is accepted) = P(X ∈ S and is accepted) = yielding P(x ∈ X X is accepted) = P(X ∈ X and is accepted) = P(X is accepted) X 1 .1) . it cannot be directly applied to other distributions. as the density might be unbounded or have inﬁnite support.34 3. M · g(X) otherwise go back to step 1. 1) distribution with density x2 1 f (x) = √ exp − 2 2π using a Cauchy distribution with density g(x) = 1 π(1 + x2 ) . Assume we want to sam ple from the N(0. We have P(X ∈ X and is accepted) = g(x) X f (x) M · g(x) =P(X is acceptedX=x) dx = X f (x) dx . Proof. Algorithm 3. 1] and accept X with probability f (X)/2. we can carry out rejection sampling using as probability of rejecting X. Fundamental Concepts: Transformation.4 (Rejection sampling from the N(0.1 (Rejection sampling). Draw X ∼ g 2.e. X (3. Remark 3. otherwise reject X. Rejection. i. M (3. where g(x) is a density that we can easily sample from.2.3) Thus the density of the values accepted by the algorithm is f (·). 1) distribution using a Cauchy proposal). Accept X as a sample from f with probability f (X) .3 is based on bounding the density of the Beta distribution by a box.2) f (x) dx/M = 1/M f (x) dx. Whilst this is a powerful idea. g with f (x) < M · g(x) for all x. C ·M f (x) dx/(C · M ) = 1/(C · M ) f (x) dx X Example 3.4. if we only know π(x). where f (x) = C · π(x). and thus P(x ∈ X X is accepted) = X π(X) M · g(X) g(x) X π(x) dx = M · g(x) X π(x) dx = M X f (x) dx .
i. Importance sampling is based on the identity P(X ∈ A) = f (x) dx = A A g(x) f (x) dx = g(x) =:w(x) g(x)w(x) dx A (3. a. Xn ∼ g. 1). n→∞ R We denote by S the set of all possible values X can take. . Note that it is impossible to do rejection sampling from a Cauchy distribution using a N(0.s.5) 1 n 2 3 n i=1 w(Xi )h(Xi ) −→ Ef (h(X)).3 Importance sampling In rejection sampling we have compensated for the fact that we sampled from the instrumental distribution g(x) instead of f (x) by rejecting some of the values proposed by g(x). provided Eg w(X) · h(X) exists. . We can generalise this identity by considering the expectation Ef (h(X)) of a measurable function h: Ef (h(X)) = S f (x)h(x) dx = S g(x) f (x) h(x) dx = g(x) =:w(x) S g(x)w(x)h(x) dx = Eg (w(X) · h(X)). . S f (x)dx = 1. 1) distribution as instrumental distribution: there is no M ∈ R such that 1 1 x2 <M·√ exp − 2) π(1 + x 2 2πσ 2 as instrumental distribution. Then. . 3. ⊳ the Cauchy distribution has heavier tails than the Gaussian distribution..4) for all g(·). There is not much point is using this method is practise.s. (3. Figure 3. n→∞ (by the Law of Large Numbers) and thus by (3. ﬁlled circles correspond to accepted values whereas open circles correspond to rejected values. such that g(x) > 0 for (almost) all x with f (x) > 0.3. w(Xi )h(Xi ) −→ Eg (w(X) · h(X)) a. Importance sampling is based on the idea of using weights to correct for the fact that we sample from the instrumental distribution g(x) instead of the target distribution f (x).5) if g(x) > 0 for (almost) all x with f (x) · h(x) = 0.e. . 1) density.3 The smallest M we can choose such that f (x) ≤ M g(x) is M = √ 2π · exp(−1/2).3 Importance sampling 35 M · g(x) f (x) −6 −5 −4 −3 −2 −1 1 2 3 4 5 6 Figure 3. The BoxMuller method is more efﬁcient. Sampling from the area under the density f (x) (dark grey) corresponds to sampling from the N(0.3 illustrates the results. . 1 n n i=1 Assume we have a sample X1 . As before. Illustration of example 3.3.3. The proposal g(x) is a Cauchy(0.
3 (Bias and Variance of Importance Sampling). . 35) Varg (w(Xi )h(Xi )) = i=1 Varg (w(X)h(X)) n Note that the theorem implies that in contrast to µ the selfnormalised estimator µ is biased. . Set w(Xi ) = 2. i=1 1.4 On the other hand. as we ˆ have seen in the proof of theorem 3.3 it is a lot harder to analyse the theoretical properties of the selfnormalised estimator µ. p. Fundamental Concepts: Transformation. . (a) Eg (˜) = µ µ Varg (w(X) · h(X)) n µVarg (w(X)) − Covg (w(X). The selfnormalised ˜ ˆ estimator µ however might have a lower variance.4) and (3. Return either µ= ˆ or i. Assume f (x) = C · π(x). (a) Eg n 1 n n w(Xi )h(Xi ) i=1 = 1 n n n Eg (w(Xi )h(Xi )) = Ef (h(X)) i=1 (b) Varg 1 1 w(Xi )h(Xi ) = 2 n i=1 n (c) and (d) see (Liu. . wn (X) do not necesn sarily sum up to n. as it is often the case in hierarchical Bayesian modelling. . .2 (Importance Sampling). one typically only considers choices of g that lead to ﬁnite variance estimators. and Reweighting In other words. 2001. so one might want to consider the selfnormalised version µ= ˆ This gives rise to the following algorithm: Algorithm 3. the weights w1 (X). w(X) · h(X)) + O(n−2 ) (c) Eg (ˆ) = µ + µ n Varg (w(X) · h(X)) − 2µCovg (w(X). it has another advantage: we only need to know ˆ the density up to a multiplicative constant. n i=1 w(Xi )h(Xi ) n i=1 w(Xi ) n i=1 w(Xi )h(Xi ) n The following theorem gives the bias and the variance of importance sampling. Generate Xi ∼ g. f (Xi ) g(Xi ) . . µ= ˜ Theorem 3. The following two conditions are each sufﬁcient (albeit rather restrictive) for a ﬁnite variance of µ: ˜ 4 Although the above equations (3. . we can estimate µ = Ef (h(X)) by µ= ˜ Note that whilst Eg (w(X)) = f (x) g(x) S g(x) 1 n n w(Xi )h(Xi ) i=1 S dx = f (x) = 1. ˆ sampling algorithm converges for a large choice of such g. the selfnormalised estimator µ does not depend on the normalisation constant C. Rejection. 1 n i=1 w(Xi ) w(Xi )h(Xi ). For i = 1.36 3.e.5) hold for every g with supp(g) ⊃ supp(f · h) and the importance By complete analogy one can show that is enough to know g up to a multiplicative constant. n: ii. In addition. Choose g such that supp(g) ⊃ supp(f · h). w(X) · h(X)) + µ2 Varg (w(X)) + O(n−2 ) (d) Varg (ˆ) = µ n (b) Varg (˜) = µ Proof. n π(Xi ) i=1 g(Xi ) i. then µ= ˆ n i=1 w(Xi )h(Xi ) n i=1 w(Xi ) = n f (Xi ) i=1 g(Xi ) h(Xi ) n f (Xi ) i=1 g(Xi ) = n C·π(Xi ) i=1 g(Xi ) h(Xi ) n C·π(Xi ) i=1 g(Xi ) = n π(Xi ) i=1 g(Xi ) h(Xi ) . .
f is bounded above on S. Further we need to be able to draw samples from g ⋆ efﬁciently. When plugging in g ⋆ we obtain: h(x)2 · f (x)2 dx · h(x)f (x) Eg ⋆ = S h(x)2 · f (x)2 dx = g ⋆ (x) 2 S S h(t)f (t) dt = S h(x)f (x) dx h(X)·f (X) g(X) 2 2 On the other hand. When using µ we need to know the normalisation ˜ constant of g ⋆ . i. An important corollary of theorem 3. h(t)f (t) dt S 2 n·Varg (˜) = Varg (w(X) · h(X)) = Varg µ h(X) · f (X) g(X) = Eg h(X) · f (X) g(X) 2 − Eg h(X) · f (X) g(X) =Eg (˜ )=µ µ . for which choice Var(˜) is minimal. . which is exactly the integral we are looking for. we can apply the Jensen inequality 5 to Eg h(X) · f (X) g(X) 2 yielding 2 Eg ≥ Eg h(X) · f (X) g(X) = S h(x)f (x) dx . So far we have only studied whether a distribution g leads to a ﬁnitevariance estimator. Using g ⋆ instead of f causes us to focus on regions of high probability where h is large. Unless h(X) is (almost surely) constant the inequality is strict. + h(Xn ) n 2 S = Ef (h(X)2 ) − µ2 2 ≥ (Ef h(X)) − µ2 = h(x)f (x) dx − µ2 = n · Varg⋆ (˜) µ by Jensen’s inequality. Theorem 3.5 (Computing Ef X for X ∼ t3 ). Three different schemes are considered 5 If X is realvalued random variable. then ψ(E(X)) ≤ E(ψ(X)). There is an intuitive explanation to the superefﬁciency of importance sampling.3 (b) that h(x)f (x) . which contribute most to the integral Ef (h(X)). a rather formal optimality result.4 is that we should choose an instrumental distribution g whose shape is close to the one of f · h. Assume we want to compute Ef X for X from a tdistribution with 3 degrees of freedom (t3 ) using a Monte Carlo method. The proposal distribution g that minimises the variance of µ is ˜ – f (x) < M · g(x) and Varf (h(X)) < +∞. Example 3. Note that under the ﬁrst condition rejection sampling can also be used to sample from f . when using the optimal g ⋆ from theorem 3. Thus we only have to minimise Eg h(X) · f (X) g ⋆ (X) 2 h(X)·f (X) g(X) 2 . The practically important corollary of theorem 3.4 (Optimal proposal). . and ψ a convex function. This leads to the question which instrumental distribution is optimal. g ∗ (x) = Proof. i.4 the variance of µ is less than the variance obtained when sampling directly from f : ˜ n · Varf h(X1 ) + .3.4 is that importance sampling can be superefﬁcient.3 Importance sampling 37 – S is compact. however. . We have from theroem 3. The following theorem µ answers this question: Theorem 3. and g is bounded below on S.4 is.e.e.
Figure 3. So using a t1 distribution as instrumental distribution is superefﬁcient (see ﬁgure 3. . Indeed.6 (Partially labelled data). as the instrumental distribution (N(0. – Alternatively we could use importance sampling using a t1 (which is nothing other than a Cauchy distribution) as instrumental distribution.5. in Note that the third choice yields weights of inﬁnite variance. after 1500 iterations the empirical standard deviation (over 100 realisations) of the direct estimate is 0. which is larger than the empirical standard deviation of µ when using ˜ a t1 distribution as instrumental distribution. such that Yi ∼ Poi(λ1 ) Yi ∼ Poi(λ2 ) for the remaining ten observations. As the large weights. This however does not imply that this yields an estimate of the xf (x) dx of minimal variance. The t1 distribution has heavy enough tails. 1)) 0. Xn directly from t3 and estimating Ef X by 1 n i=1 nXi . this choice integral xf (x) dx. 0. we will consider importance sampling using a N(0. causing the jumps of the estimate µ shown in ﬁgure 3.5.e. The righthand panel of ﬁgure 3. i. 1)) has lighter tails than the distribution we want to sample from (t3 ).2 0. Suppose that we are given count data from observations in two groups. explaining the small variance of the estimate µ when using a t1 ˜ ⊳ N(0.4 shows.4 x · f (x) (Target) f (x) (direct sampling) gt1 (x) (IS t1 ) gN(0. – Third. β) distribution as (conjugate) prior distribution for λj . 1) distribution as instrumental distribution. as ﬁgure 3. Note that only the ﬁrst ten observations are labelled. The idea behind this choice is that the density gt1 (x) of a t1 distribution is closer to f (x)x.6 somewhat explains why the t1 distribution is a far better choice than the N(0. and Reweighting – Sampling X1 .3 4 2 0 2 4 Figure 3. Thus large x get Example 3. Rejection. where f (x) is the density of a t3 distribution.5). the weight tends to inﬁnity as x → +∞.0182. Fundamental Concepts: Transformation. which is 0.0345. Illustration of the different instrumental distributions x example 3. so the weights are small for large values of x. clearly minimises the variance of the weights. the prior density of λj is if the ith observation is from group 1 if the ith observation is from group 2 The data is given in the table 3.1 0. We will use a Gamma(α. 1) distribution does not have heavy enough tails.1. ˜ distribution as instrumental distribution.5 illustrates that this choice yields a very poor estimate of the integral Sampling directly from the t3 distribution can be seen as importance sampling with all weights wi ≡ 1. 1) distributon. . . .38 3.1) (x) (IS N(0.0 0.4. the group label is missing .
3.3 Importance sampling
39
Sampling directly from t3 3.0
IS using t1 as instrumental distribution
IS using N(0, 1) as instrumental distribution
IS estimate over time
0.0
0.5
1.0
1.5
2.0
2.5
0
500
1000
1500 0
500 Iteration
1000
1500 0
500
1000
1500
Figure 3.5. Estimates of EX for X ∼ t3 obtained after 1 to 1500 iterations. The three panels correspond to the three different
sampling schemes used. The areas shaded in grey correspond to the range of 100 replications.
Sampling directly from t3 3.0
IS using t1 as instrumental distribution
IS using N(0, 1) as instrumental distribution
Weights Wi
0.0
0.5
1.0
1.5
2.0
2.5
4
2
0
2
4
4
2
0
2
4
4
2
0
2
4
Sample Xi from the instrumental distribution
Figure 3.6. Weights Wi obtained for 20 realisations Xi from the different instrumental distributions.
40
3. Fundamental Concepts: Transformation, Rejection, and Reweighting
Group 1 1 1 1 1
Count Yi 3 6 3 5 9
Group 2 2 2 2 2
Count Yi 14 12 11 19 18
Group * * * * *
Count Yi 15 4 1 6 11
Group * * * * *
Count Yi 21 11 3 7 18
Table 3.1. Data of example 3.6.
f (λj ) =
1 α λα−1 βj exp(−βλj ). Γ (α) j
Furthermore, we believe that a priori each observation is equally likely to stem from group 1 or group 2. We start with analysing the labelled data only, ignoring the 10 unlabelled observations. In this case, we can analyse the two groups separately. In group 1 we have that the joint distribution of Y1 , . . . , Y5 , λ1 is given by f (y1 , . . . , y5 , λ1 ) = f (y1 , . . . , y5 λ1 )f (λ1 ) = = 1 · exp(−λ1 )λyi 1 yi ! i=1
5
·
1 λα−1 β α exp(−βλ) Γ (α) 1
P P 1 α+ 5 yi α+ 5 yi α i=1 i=1 exp(−(β + 5)λ1 ) β exp(−(β + 5)λ1 ) ∝ λ1 λ1 5 yi ! Γ (α) i=1
The posterior distribution of λ1 given the data from group 1 is f (λ1 y1 , . . . , y5 ) = ∝λ1 f (y1 , . . . , y5 , λ1 ) ∝ f (y1 , . . . , y5 , λ1 ) f (y1 , . . . , y5 , λ) dλ λ
α+ P5
i=1
yi
exp(−(β + 5)λ1 )
Comparing this to the density of the Gamma distribution we obtain that
5
λ1 Y1 , . . . , Y5 ∼ Gamma α + and similarly
yi , β + 5 ,
i=1
10
λ2 Y6 , . . . , Y10 ∼ Gamma α +
yi , β + 5 .
i=6
Thus, when only using the labelled data, we do not have to resort to Monte Carlo methods for ﬁnding the posterior distribution. This however is not the case any more once we also want to include the unlabelled data. The conditional density of Yi λ1 , λ2 for an unlabelled observation (i > 10) is f (yi λ1 , λ2 ) = 1 exp(−λ1 )λyi 1 exp(−λ2 )λyi 1 2 + 2 yi ! 2 yi !
The posterior density for the entire sample (using both labelled and unlabelled data) is f (λ1 , λ2 y1 , . . . , y20 )∝ f (λ1 )f (y1 , . . . , y5 λ1 ) f (λ2 )f (y6 , . . . , y10 λ2 ) · f (y11 , . . . , y20 λ1 , λ2 )
∝f (λ1 y1 ,...,y5 ) ∝f (λ2 y6 ,...,y10 ) 20 = Q20
i=11
f (yi λ1 ,λ2 )
∝ f (λ1 y1 , . . . , y5 )f (λ2 y6 , . . . , y10 )
i=11
f (yi λ1 , λ2 )
as instrumental distributions, i. e. use
This suggests using importance sampling with the product of the distributions of λ1 Y1 , . . . , Y5 and λ2 Y6 , . . . , Y10 g(λ1 , λ2 ) = f (λ1 y1 , . . . , y5 )f (λ2 y6 , . . . , y10 ).
The target distribution is f (λ1 , λ2 y1 , . . . , y20 ), thus the weights are
3.3 Importance sampling
41
w(λ1 , λ2 ) = ∝ =
f (λ1 , λ2 y1 , . . . , y20 ) g(λ1 , λ2 ) f (λ1 y1 , . . . , y5 )f (λ2 y6 , . . . , y10 ) i=11 f (yi λ1 , λ2 ) f (λ1 y1 , . . . , y5 )f (λ2 y6 , . . . , y10 )
20 20 i=11 20
(3.6)
f (yi λ1 , λ2 ) =
i=11
1 exp(−λ2 )λyi 1 exp(−λ1 )λyi 1 2 + 2 yi ! 2 yi !
Thus we can draw a weighted sample of size n from the distribution of f (λ1 , λ2 y1 , . . . , y20 ) by repeating the three steps below n times: 1. Draw λ1 ∼ Gamma α + 2. Draw λ2 ∼ Gamma α +
5 i=1 10 i=6
yi , β + 5 yi , β + 5
3. Compute the weight w(λ1 , λ2 ) using equation (3.6). From a simulation with n = 50, 000 I obtained 4.4604 as posterior mean of λ1 and 14.5294 as posterior mean of λ2 . The posterior densities are shown in ﬁgure 3.7. ⊳
replacements
Posterior density of λ1
All data Labelled data
Posterior density of λ2
All data Labelled data
0.5
0.4
Density
0.3
Density 2 4
0.2
0.1
0.00
0.0
0.05
0.10
0.15
0.20
0.25
0.30
6
8
10
10
15
20
N = 50000 Bandwidth = 0.1036
N = 50000 Bandwidth = 0.1742
Figure 3.7. Posterior distributions of λ1 and λ2 in example 3.6. The dashed line is the posterior density obtained only from the labelled data.
Fundamental Concepts: Transformation. Rejection.42 3. and Reweighting .
. . yi . . βj ) distribution with density f (λj ) = 1 α −1 α λ j βj j exp(−βj λj ). . . M 1 ∼ Gamma α1 + ∼ Gamma α2 + yi .1 Yi ∼ Poi(λ1 ) Yi ∼ Poi(λ2 ) for for i = 1. and M is f (y1 . . Γ (αj ) j The joint distribution of Y1 . However. We will discuss methods that allow obtaining an approximate sample from f without having to sample from f directly. . . . Yn .1 Introduction In section 3. . λ2 . .Chapter 4 The Gibbs Sampler 4. β1 + M i=1 n i=M +1 (4. yn . .1 (Poisson change point model). . . M ) ∝ λ1 1 so M α −1+ PM i=1 yi exp(−(β1 + M )λ1 ).3 we have seen that. . . Yn . the posterior distribution of λ1 has the density f (λ1 Y1 . (4.1) . using importance sampling. Yn . λ1 . . . . . Example 4. Yn . Assume the following Poisson model of two regimes for n random variables Y1 . n A suitable (conjugate) prior distribution for λj is the Gamma(αj . λ1 Y1 . λ1 . . . we will discuss methods which generate a Markov chain whose stationary distribution is the distribution of interest f . . Yn . M λ2 Y1 . Γ (α1 ) 1 Γ (α2 ) 2 If M is known. . λ2 . . M i = M + 1. . . . . In this chapter and the following chapters we will use a somewhat different approach. M ) = exp(−λ1 )λyi 1 yi ! i=1 M n · i=M +1 exp(−λ2 )λyi 2 yi ! 1 1 α α · λα1 −1 β1 1 exp(−β1 λ1 ) · λα2 −1 β2 2 exp(−β2 λ2 ). we can approximate an expectation Ef (h(X)) without having to sample directly from f . . Such methods are often referred to as Markov Chain Monte Carlo (MCMC) methods. especially in large dimensions. β2 + n − M . ﬁnding an instrumental distribution which allows us to efﬁciently estimate Ef (h(X)) can be difﬁcult.2) The probability distribution function of the Poi(λ) distribution is p(y) = exp(−λ)λy y! . More mathematically speaking.
. It is easy to compute the distribution of M given the observations Y1 . Note that the set of marginal distributions does not have this property. (1995) proposed the following algorithm that yields a reversible chain. . . .1) to (4. . is that they fully specify the joint distribution. .. . . . Draw Xj ∼ fXj X−j (·X1 . Xj+1 . Xp (t) (t−1) ).44 4. . Note that the Gibbs sampler is not reversible. . Starting with (X1 . Algorithm 4. . . The conditional distributions as used in the Gibbs sampler are often referred to as full conditionals. . Xp ) iterate for t = 1.2 Algorithm The Gibbs sampler was ﬁrst proposed by Geman and Geman (1984) and further developed by Gelfand and Smith (1990). p} (e. . 4.3) The conditional distributions in (4.. . . . This tentative strategy however raises some questions. and set Xι (t) := Xι (t−1) for all ι = j. which the Gibbs sampler is based on. 2.3) are all easy to sample from. 2 Hammersley and Clifford actually never published this result.4.1 illustrates the Gibbs sampler. 2. . M ). λ2 ) ∝ λ1 PM i=1 1}. Xj−1 . . Will this approach yield a Markov chain with the correct invariant distribution? Will the Markov chain converge to the invariant distribution? As we will see in sections 4. .3 and 4. . . Draw Xp ∼ fXp X−p (·X1 . Xp−1 ).2 (Random sweep Gibbs sampler). . not the past values. . . It is however rather difﬁcult to sample from the joint posterior of (λ1 . . (0) (0) . The next section will however ﬁrst of all state the Gibbs sampling algorithm. as they could not extend the theorem to the case of nonpositivity. . . . It is a discrete yi · λ2 Pn i=M +1 yi · exp((λ2 − λ1 ) · M ) (4. Xp (t) (t) (t) (t−1) ). Draw Xj (t) ∼ fXj X−j (·X1 (t−1) .1 ((Systematic sweep) Gibbs sampler). Xp ) iterate for t = 1. Draw X1 ∼ fX1 X−1 (·X2 (t) (t) (t−1) . The Gibbs Sampler Now assume that we do not know the change point M and that we assume a uniform prior on the set {1. . M − distribution with probability density function proportional to p(M Y1 . . Algorithm 4. xp ). . . . . p. Xp (t−1) (t−1) (t−1) ). Yn .3) in the example). .. uniform) 2. . . λ1 . . Figure 4. . . (t−1) j. . . the answer to both questions is — under certain conditions — yes. Liu et al. . λ2 . . . (0) (0) (t) 1. as Hammersley and Clifford proved in 19702 . . ⊳ The example above suggests the strategy of alternately sampling from the (full) conditional distributions ((4. – Is the joint distribution uniquely speciﬁed by the conditional distributions? – Sampling alternately from the conditional distributions yields a Markov chain: the newly proposed values only depend on the present values. Draw an index j from a distribution on {1.1) to (4. . xi+1 . Xj−1 .. . Yn . . . .g. . Starting with (X1 . . . .3 The HammersleyClifford Theorem An interesting property of the full conditionals. 1. . Xj+1 . Denote with x−i := (x1 . 4. . . . . and λ1 and λ2 . xi−1 . .
. . .3 The HammersleyClifford Theorem 45 (X1 . . .. xp−1 ) fX1 X−1 (x1 ξ2 . . xp ) > 0. . . . . . . . .. . .5) fXp X−p (xp x1 . . .1 (Positivity condition). . .1. xp−1 ) (4. thus f (x1 . . . X2 ) (X1 . X2 ) (6) (6) (2) (1) (X1 . X2 ) (0) (0) (X1 . . . . Xp ) satisfy the positivity condition and have joint density f (x1 . Then for all (ξ1 . Let (X1 . . X2 ) (1) (0) (3) (3) (X1 .4. . . . . xp−1 )f (x1 . xp implies that f (x1 . . X2 ) (2) (2) (X1 . . ..4) (4. . . .. X2 ) (1) (1) (3) (2) (X1 . We have fXj X−j (xj x1 . xp−1 ) The positivity condition guarantees that the conditional densities are nonzero. . ξp ) ∈ supp(f ) f (x1 . ξj+1 . . . . . . . . . . . . A distribution with density f (x1 . xp−1 ) = f (x1 . . xp−1 . xp ) (4. . X2 ) (X1 .xp−1 . . X2 ) (6) (5) (X1 . . . . .. Theorem 4.xp−1 ) = = = f (x1 . . X2 ) (5) (5) (X1 .. . . . . xp−1 ) fXp X−p (ξp x1 . ξp ) . .. ξj+1 . xp ). In Bayesian modeling such problems mostly arise when using improper prior distributions. xp−1 ) fX X (xp x1 . X2 ) (4) (4) (X1 . . . xp ) ∝ Proof. xj−1 . . . .. X2 ) (4) (3) X1 (t) Figure 4. f (ξ1 . . . . as the following example shows. . . xp−1 )f (x1 . . . X2 ) (X1 . . ξp ) fXp X−p (ξp x1 . . . . . xp ) = fXp X−p (xp x1 .. . xp ) and marginal densities fXi (xi ) is said to satisfy the positivity condition if fXi (xi ) > 0 for all x1 . . Note that the HammersleyClifford theorem does not guarantee the existence of a joint probability distribution for every choice of conditionals. xp−1 ) and by complete analogy f (x1 . . . . . . . xp−1 . . . The positivity condition thus implies that the support of the joint density f is the Cartesian product of the support of the marginals fXi . . . . . . ξp ) f j=1 Xj X−j j 1 (4. xp−1 ). . Illustration of the Gibbs sampler for a twodimensional distribution Deﬁnition 4. . . . xj−1 .5) = f (x1 .4) p f (x1 . . ξp ) · · · p −p fX1 X−1 (ξ1 ξ2 . X2 ) (5) (4) X2 (t) (X1 . .2 (HammersleyClifford). . . ξp ) fXp X−p (xp x1 . xp−1 .ξp )/fXp X−p (ξp x1 . . .. . .. . . . . ξp ) = fXp X−p (ξp x1 . ξp ) (ξ x . . . . .
. . x(t) ) = fX1 X−1 (x1 x2 (t) (t−1) .3. . . Very similar results hold for the random scan Gibbs sampler (algorithm 4. we obtain f (x1 . Trying to apply the HammersleyClifford theorem. . p corresponds to step 1. xp ) is indeed the stationary distribution of the Markov chain generated by the Gibbs sampler3 .46 4. for which it would be easy to design a Gibbs sampler. . of the algorithm corresponds to step 2. x(t−1) ) · fX2 X−2 (x2 x1 . x(t−1) ) · . xp−1 ) p Proof. . . . 4.2). . thus there is no twodimensional probability distri bution with f (x1 . The transition kernel of the Gibbs sampler is K(x(t−1) . . x(t−1) ) · . . . of the algorithm 3 All the results in this section will be derived for the systematic scan Gibbs sampler (algorithm 4. . x(t−1) ) · fX2 X−2 (x2 x1 .2. x3 p (t) (t) (t) (t) (t−1) . We have P(X(t) ∈ X X(t−1) = x(t−1) ) = = X X f(Xt X(t−1) ) (x(t) x(t−1) ) dx(t) (t) (t) (t−1) fX1 X−1 (x1 x2 (t) (t−1) . . . . . . . Lemma 4. . . . x2 ) as its density. x3 p dx (t) . . x2 ) ∝ The integral fX1 X2 (x1 ξ2 ) · fX2 X1 (x2 x1 ) λξ2 exp(−λx1 ξ2 ) · λx1 exp(−λx1 x2 ) = ∝ exp(−λx1 x2 ) fX1 X2 (ξ1 ξ2 ) · fX2 X1 (ξ2 x1 ) λξ2 exp(−λξ1 ξ2 ) · λx1 exp(−λx1 ξ2 ) ⊳ exp(−λx1 x2 ) dx1 dx2 however is not ﬁnite. . of the algorithm · (t) (t) fXp X−p (x(t) x1 . .4 Convergence of the Gibbs sampler First of all we have to analyse whether the joint distribution f (x1 . . . . . For this we ﬁrst have to determine the transition kernel corresponding to the Gibbs sampler. The Gibbs Sampler Example 4. . .1). Consider the following “model” X1 X2 X2 X1 ∼ Expo(λX2 ) ∼ Expo(λX1 ). p ·fXp X−p (x(t) x1 . xp−1 ) p corresponds to step p. .
. .. .x3 . .. .. x(t) ) dx(t−1) = ··· f (x1 (t−1) . X(1) .. .27 f is indeed the invariant distribution. .. . The joint distribution f (x1 . .. . xp−1 )dx2 p p (t) (t) (t−1) .xp (t−1) ) (t) (t) (t−1) = ··· f (x1 . .. .. . .xp−1 ) =f (x1 . .x2 . . xp−1 ) p p p =f (x1 . . dx(t−1) p =f (x2 . .. x(t) ) p Thus according to deﬁnition 1. x(t−1) ) · · · fXp X−p (x(t) x1 .. .. .xp (t−1) ) f (x1 . . . ... . f (x(t−1) )K(x(t−1) . x(t−1) ) dx(t−1) fXp X−p (x(t) x1 .. x(t−1) ) dx2 p . .. . x2 (t) (t−1) ..xp (t−1) (t−1) fX2 X−2 (x2 x1 . . .xp (t−1) ) (t) (t−1) =f (x1 .. .. . .. .Proposition 4.x2 . . Proof. .) generated by the Gibbs sampler. . .xp ) (t) (t) (t) (t) (t) (t) = f (x1 .4 Convergence of the Gibbs sampler 47 . .. . x3 .x3 (t) (t−1) ) (t) (t) (t−1) = . . dx(t−1) p =f (x1 . = (t) (t) =f (x1 . x(t−1) ) dx1 p (t−1) (t−1) fX1 X−1 (x1 x2 (t) (t−1) .. .4. .. xp ) is indeed the invariant distribution of the Markov chain (X(0) . xp−1 )dx3 p p (t) (t) (t−1) . . (t) 4. .. xp−1 .. .. x(t−1) ) · · · fXp X−p (x(t) x1 . . . . .
. . −1) ≤ 1}. i. As f is the unique invariant distribution of the Markov chain. Next. we have to analyse under which conditions the Markov chain generated by the Gibbs sampler will converge to f . Illustration of a Gibbs sampler failing to sample from a distribution with unconnected support (uniform distribution on {(x1 . ⊳ The following proposition gives a sufﬁcient condition for irreducibility (and thus the recurrence) of the Markov chain generated by the Gibbs sampler. Proof. It is easy to that when the Gibbs sampler is started in C2 it will stay there and never reach C1 .3 (Reducible Gibbs sampler). f (x1 .e. The Gibbs Sampler So far we have established that f is indeed the invariant distribution of the Gibbs sampler. 1) ≤ 1} and C2 := {(x1 . . . There are less strict conditions for the irreducibility and aperiodicity of the Markov chain generated by the Gibbs sampler (see e. .11). 4 Here and in the following we understand by “irreducibilty” irreducibility with respect to the target distribution f . .g. recurrent Markov chain.28). . p p >0 (on a set of nonzero measure) (t) (t) >0 (on a set of nonzero measure) where the conditional densities are nonzero by the positivity condition. x2 ) 2π 1 2 Figure 4. Example 4. x(t−1) ) · · · fXp X−p (x(t) x1 . −1) ≤ 1}) for this is that the conditional distribution X2 X1 (X1 X2 ) is for X1 < 0 (X2 < 0) entirely concentrated on C2 . (t) (t−1) (t) (t) (t) (t) K(x(t−1) . . x2 ) = 1 IC ∪C (x1 . x2 ) − (1. Let X ⊂ supp(f ) be a set with X f (x1 . x2 ) − (−1. . xp−1 ) dx(t) > 0.5. Proposition 4. Consider Gibbs sampling from the uniform distribution on C1 ∪ C2 with C1 := {(x1 .2 shows the density as well the ﬁrst few samples obtained by starting a Gibbs sampler with X1 (0) X2 (0) < 0 and < 0. xp )d(x1 . The reason X2 (t) 2 2 1 0 1 2 1 0 (t) X1 1 2 Figure 4. x2 ) : (x1 . Thus the Markov Chain (X(t) )t is strongly f irreducible. x(t) )dx(t) = X X fX1 X−1 (x1 x2 . x2 ) : (x1 . . . x2 ) − (−1.2. 2004. . . . . The following example shows that this does not need to be the case. 1) ≤ 1} ∪ {(x1 . the Gibbs sampler yields an irreducible. .48 4. . First of all we have to study under which conditions the resulting Markov chain is irreducible4 . x2 ) : (x1 . If the joint distribution f (x1 . x2 ) : (x1 . xp ) satisﬁes the positivity condition. . it is as well recurrent (proposition 1. Robert and Casella. Lemma 10. x2 ) − (1. . xp ) > 0. . .
Draw X1 ∼ N(µ1 + σ12 /σ2 (X2 5 (t) (t−1) 2 2 − µ2 ). − µ1 µ2 ∝ exp − i. If the chain is Harris recurrent. Zp ∼ N (0. Now we have established all the necessary ingredients to state an ergodic theorem for the Gibbs sampler.e.4. .d.i. σ1 − (σ12 )2 /σ2 ) 6 A Gibbs sampler is of course not the optimal way to sample from a Np (µ. 2004. Theorem 4.g. we need the conditional distributions X1 X2 = x2 and X2 X1 = x1 . x2 ) − σ12 2 σ2 −1 x1 x2 σ12 .5 The marginal distributions are 2 X1 ∼ N(µ1 . 2 (x1 − (µ1 + σ12 /σ22 (x2 − µ2 )))2 2 − (σ )2 /σ 2 ) 2(σ1 12 2 2 2 2 X1 X2 = x2 ∼ N(µ1 + σ12 /σ2 (x2 − µ2 ). Robert and Casella. . . 2 σ1 σ12 2 σ2 σ12 distribution. . Z1 . . 2 1. We have6 1 ∝ exp − 2 x1 x2 µ1 µ2 ′ 2 σ1 f (x1 . 2. . . A more efﬁcient way is: draw i. . Theorem 4. σ1 ) and 2 X2 ∼ N(µ2 . . Zp )′ + µ We make use of ! !!′ !−1 ! !! 2 x1 µ1 σ1 σ12 x1 µ1 − − 2 x2 µ2 σ12 σ2 x2 µ2 ! !!′ ! ! !! 2 x1 µ1 σ2 −σ12 x1 µ1 1 − = − 2 2 2 σ1 σ2 − (σ12 )2 x2 µ2 −σ12 σ1 x2 µ2 ` 2 ´ 1 σ2 (x1 − µ1 )2 − 2σ12 (x1 − µ1 )(x2 − µ2 ) + const = 2 2 σ1 σ2 − (σ12 )2 ` 2 2 ´ 1 2 = σ2 x1 − 2σ2 x1 µ1 − 2σ12 x1 (x2 − µ2 ) + const 2 2 σ1 σ2 − (σ12 )2 ` 2 ´ 1 2 x1 − 2x1 (µ1 + σ12 /σ2 (x2 − µ2 )) + const = 2 2 σ1 − (σ12 )2 /σ2 ` ´2 1 2 = x1 − (µ1 + σ12 /σ2 (x2 − µ2 ) + const 2 2 σ1 − (σ12 )2 /σ2 . Σ) distribution. then for any integrable function h : E → R 1 n→∞ n lim n t=1 h(X(t) ) → Ef (h(X)) for almost every starting value X(0) .9). .30. Example 4. σ1 − (σ12 )2 /σ2 ) Thus the Gibbs sampler for this problem consists of iterating for t = 1. If the Markov chain generated by the Gibbs sampler is irreducible and recurrent (which is e. Assume that we want to use a Gibbs sampler to estimate the probability P(X1 ≥ 0. . then the above result holds for every starting value X(0) . Lemma 10. σ2 ) In order to construct a Gibbs sampler. X2 ≥ 0) for a N2 µ1 µ2 . Xp )′ = Σ 1/2 (Z1 . which is a direct consequence of theorem 1. . 1) and set (X1 .6 guarantees that we can approximate expectations Ef (h(X)) by their empirical counterparts using a single Markov chain. .g. . then recurrence even implies Harris recurrence (see e.6. . the case when the positivity condition holds).4 Convergence of the Gibbs sampler 49 If the transition kernel is absolutely continuous with respect to the dominating measure.4.
Figure 4.4. and are thus not independent.5 0 2000 4000 6000 8000 10000 t range of 100 replications.5 shows the results obtained when sampling from a bivariate Normal distribution as in example 4. X2 0. Figure 4.) form a Markov chain. This makes subsequent samples Xj (t−1) and Xj highly correlated and thus yields to a slower convergence. Example 4.3 (τ ) (τ ) (τ ) 0.4 shows the sample paths of (t) (t) this Gibbs sampler.99. σ2 − (σ12 )2 /σ1 ). For the Gibbs sampler this is typically the case if the variables Xj are strongly (positively or negatively) correlated.4 (and displayed in ﬁgure 4.1 0. as the following example shows.5).6 we can estimate P(X1 ≥ 0. X1 (τ ) ≥ 0. Estimate of the P(X1 ≥ 0. X2 ) with X1 ≥ 0 and X2 ≥ 0.3.4 and 4. σ1 = σ2 = 1 and σ12 = 0. X2 ) : τ ≤ t. . as the ⊳ (t) plot of the estimated densities show (panels (b) and (c) of ﬁgures 4. X2 ) = 0. Draw X2 ∼ N(µ2 + σ12 /σ1 (X1 − µ1 ).3 shows this estimate. . The area shaded in grey corresponds to the Note that the realisations (X(0) .2 0. The correlation between the X(t) is larger if the Markov chain moves only slowly (the chain is then said to be slowly mixing). Figure 4.50 4.4): due to the strong correlation the Gibbs sampler can only perform very small movements. . The Gibbs Sampler 2 2 2 2. This Gibbs sampler is a lot slower mixing than the one considered in example 4. 2 2 Now consider the special case µ1 = µ2 = 0. however with σ12 = 0.5 (Sampling from a highly correlated bivariate Gaussian). Using theorem 4. but typically positively correlated. X2 ≥ 0) by the proportion of samples (X1 . . This yields a correlation of ρ(X1 . X2 ≥ 0) obtained using a Gibbs sampler.4 0.99.3. X(1) . (t) (t) (t) (t) ⊳ ≥ 0}/t {(X1 . Figure 4.
0 ˆ fX1 (x1 ) 1.5 1. X2 ) = 0.0 (t) X1 1. Gibbs sampler for a bivariate standard normal distribution with correlation ρ(X1 .3.4. 1.0 (t) MCMC sample (t) X2 (b) Path of X1 and estimated density of X after 1.5 0.000 iterations Figure 4. Gibbs sampler for a bivariate normal distribution with correlation ρ(X1 .99. .0 and estimated density of X1 after ˆ fX2 (x2 ) 1.0 (t) X1 0.0 1.5 1.0 2.5 0.000 iterations 2.0 0.0 1.5 X1 (t) (a) First 50 iterations of (X1 .0 0. X2 ) (t) (t) X2 (t) MCMC sample (t) (c) Path of X2 and estimated density of X2 after 1.0 0.0 0.000 iterations X2 1.5 X1 (t) (t) MCMC sample (t) (b) Path of X1 1. X2 ) (t) (t) X2 (t) MCMC sample (t) (c) Path of X2 and estimated density of X2 after 1.0 ˆ fX2 (x2 ) 2.0 1.5 1.000 iterations Figure 4.4 Convergence of the Gibbs sampler 51 2.5 ˆ fX1 (x1 ) 1.0 0. X2 ) = 0.0 (a) First 50 iterations of (X1 .5.4.
The Gibbs Sampler .52 4.
Compute α(XX(t−1) ) = min 1. f (X) · q(X(t−1) X) . Xp ) iterate for t = 1. then f (x) · q(x(t−1) x) Cπ(x) · q(x(t−1) x) π(x) · q(x(t−1) x) = = f (x(t−1) ) · q(xx(t−1) ) Cπ(x(t−1) ) · q(xx(t−1) ) π(x(t−1) ) · q(xx(t−1) ) . One way around this problem is to allow for “local updates”. . the MetropolisHastings algorithm is based on proposing values sampled from an instrumental distribution. The probability that the MetropolisHastings algorithm accepts the newly proposed state X given that it currently is in state X(t−1) is a(x(t−1) ) = α(xx(t−1) )q(xx(t−1) ) dx.2) Just like the Gibbs sampler. otherwise set X(t) = X(t−1) . This chapter will introduce another MCMC method: the MetropolisHastings algorithm.1 is that it is often very difﬁcult to come up with a suitable proposal distribution that leads to an efﬁcient algorithm. . which goes back to Metropolis et al. . . Starting with X(0) := (X1 .Chapter 5 The MetropolisHastings Algorithm 5. (0) (0) 1.1 illustrates the MetropolisHasting algorithm. the MetropolisHastings algorithm generates a Markov chain. (1953) and Hastings (1970). Note that if the algorithm rejects the newly proposed value (open disks joined by dotted lines in ﬁgure 5. which are then accepted with a certain probability that reﬂects how likely it is that they are from the target distribution f . Remark 5. The probability of acceptance (5. whose properties will be discussed in the next section. Like the rejection sampling algorithm 3. (5. however at the price of yielding a Markov chain instead of a sequence of independent realisations.1) it stays at its current value X(t−1) . The main drawback of the rejection sampling algorithm 3. . . 2. i. Draw X ∼ q(·X(t−1) ). a special case of a Monte Carlo Markov Chain (MCMC) method: the target distribution is the invariant distribution of the Markov chain generated by the algorithm. . if f (x) = C · π(x). (5. Algorithm 5.1.1 (MetropolisHastings).1) does not depend on the normalisation constant. With probability α(XX(t−1) ) set X(t) = X. This makes it easier to come up with a suitable (conditional) proposal.e. 2.1) f (X(t−1) ) · q(XX(t−1) ) 3.e. let the proposed value depend on the last accepted value. i.1 Algorithm In the previous chapter we have studied the Gibbs sampler. to which it (hopefully) converges.1. Figure 5.
3) satisﬁes the detailed balance condition K(x(t−1) . Illustration of the MetropolisHastings algorithm.3) P(X(t) ∈ X X(t−1) = x(t−1) ) = P(X(t) ∈ X . . where δx(t−1) (·) denotes Diracmass on {x(t−1) }. We have (5. 1 On a similar note.3. X(1) . new value rejectedX(t−1) = x(t−1) ) = X α(x(t) x(t−1) )q(x(t) x(t−1) ) dx(t) IX (x(t−1) ) = R X + P(new value rejectedX(t−1) = x(t−1) ) =1−a(x(t−1) ) (1−a(x(t−1) ))δx(t−1) (dx(t) ) δx(t−1) (dx(t) ) = R X = X α(x(t) x(t−1) )q(x(t) x(t−1) ) dx(t) + X (1 − a(x(t−1) ))δx(t−1) (dx(t) ) Proposition 5.1 5. it is enough to know q(x(t−1) x) up to a multiplicative constant independent of x(t−1) and x. . open circles rejected values. new value acceptedX(t−1) = x(t−1) ) + P(X(t) ∈ X . The MetropolisHastings kernel (5. Thus f only needs to be known up to normalisation constant. x(t−1) )f (x(t) ) and thus f (x) is the invariant distribution of the Markov chain (X(0) . The transition kernel of the MetropolisHastings algorithm is K(x(t−1) . Note that the transition kernel (5. Filled dots denote accepted states.54 5.1. . x(t) )f (x(t−1) ) = K(x(t) .3) is not continuous with respect to the Lebesgue measure. x(t) ) = α(x(t) x(t−1) )q(x(t) x(t−1) ) + (1 − a(x(t−1) ))δx(t−1) (x(t) ).2 Convergence results Lemma 5. Proof. .2. The MetropolisHastings Algorithm x(3) = x(4) = x(5) = x(6) = x(7) x(15) (t) x(8) (14) X2 x (11) =x (12) =x (13) =x x(0) = x(1) = x(2) x(10) x(9) X1 (t) Figure 5. Furthermore the Markov chain is reversible.) generated by the MetropolisHastings sampler.
which also applies in the continuous case (see page 21). which is the case if . 1] or [2. x(t−1) +δ) distribution as proposal distribution q(·x(t−1) ). 3] and a U(x(t−1) −δ. It is easy to see that the resulting Markov chain is not irreducible if δ ≤ 1: in ⊳ δ q(·x(t−1) ) f (·) x(t−1) 1 Figure 5. The resulting Markov chain is thus strongly irreducible. 3]. as the following example shows.22. . if there is positive probability that the chain remains in the current state. this case the chain either stays in [0. f (x(t) )q(x(t−1) x(t) ) = min and thus f (x(t) )q(x(t−1) x(t) ) f (x(t−1) )q(x(t) x(t−1) ) q(x(t) x(t−1) )f (x(t−1) ) f (x(t−1) )q(x(t) x(t−1) ) . P(X(t) = X(t−1) ) > 0.) is further aperiodic.1 (Reducible MetropolisHastings). – Roberts and Tweedie (1996) give a more general condition for the irreducibility of the resulting Markov chain: they only require that ∀ ǫ ∃ δ : q(x(t) x(t−1) ) > ǫ if x(t−1) − x(t) < δ together with the boundedness of f on any compact subset of its support. Consider using a MetropolisHastings algorithm for sampling from a uniform distribution on [0. many popular choices of q(·x(t−1) ) like multivariate Gaussians or tdistributions fulﬁl this condition. Example 5. 1 q(x(t−1) x(t) )f (x(t) ) = α(x(t−1) x(t) )q(x(t−1) x(t) )f (x(t) ) f (x(t) )q(x(t−1) x(t) ) K(x(t−1) .2 Convergence results 55 Proof.2. x(t) )f (x(t−1) ) = α(x(t) x(t−1) )q(x(t) x(t−1) )f (x(t−1) ) =α(x(t−1) x(t) )q(x(t−1) x(t) )f (x(t) ) + (1 − a(x(t−1) )) δx(t−1) (x(t) ) f (x(t−1) ) =0 if x(t) = x(t−1) (1−a(x(t) ))δx(t) (x(t−1) ) = K(x(t) . Even though this condition seems rather restrictive. δ 1/(2δ) 1/2 Figure 5.1 2 3 Under mild assumptions on the proposal q(·x(t−1) ) one can however establish the irreducibility of the resulting Markov chain: – If q(x(t) x(t−1) ) is positive for all x(t−1) .2 illustrates this example. this does not need to be the case. x(t−1) )f (x(t) ) The other conclusions follow by theorem 1.5. 1]∪[2. i. . x(t) ∈ supp(f ). As with the Gibbs sampler. then it is easy to see that we can reach any set of nonzero probability under f within a single step. Illustration of example 5. Next we need to examine whether the MetropolisHastings algorithm yields an irreducible chain. We have that α(x(t) x(t−1) )q(x(t) x(t−1) )f (x(t−1) ) = min 1. = min f (x(t−1) )q(x(t) x(t−1) ).e. The Markov chain (X(0) . X(1) . .
2. For a proof of Harris recurrence see (Tierney. as q(XX(t−1) ) = g(X − X(t−1) ) = g(X(t−1) − X) = q(X(t−1) X) using the symmetry of g. . Proposition 5. As we have now established (Harris)recurrence. from algorithm 5. Theorem 5. The Markov chain generated by the MetropolisHastings algorithm is Harrisrecurrent if it is irreducible.5) (5. f (X f (X) · q(X(t−1) X) (t−1) ) · q(XX (t−1) ) = min 1.5) the probability of acceptance simpliﬁes to min 1. (5.6) 2. otherwise set X(t) = X(t−1) . Assume that we generate the newly proposed state X not using the fairly general X ∼ q(·X(t−1) ). Proof.5. This yields the following algorithm which is a special case of algorithm 5.4.4) using q(xx(t−1) ) = g(x − x(t−1) ). 1994). (1953).2 (Random walk Metropolis). The MetropolisHastings Algorithm P f (X(t−1) )q(XX(t−1) ) > f (X)q(X(t−1) X) > 0. When using (5. With probability α(XX(t−1) ) set X(t) = X. Algorithm 5.56 5. .3 The random walk Metropolis algorithm In this section we will focus on an important special case of the MetropolisHastings algorithm: the random walk MetropolisHastings algorithm. Xp ) and using a symmetric dis(0) (0) tributon g.1. we are now ready to state an ergodic theorem (using theorem 1. . Draw ε ∼ g and set X = X(t−1) + ε. .1. 1.30). Recurrence follows from the irreducibility and the fact that f is the unique invariant distribution (using proposition 1. f (X) f (X(t−1) ) . 5. . Starting with X(0) := (X1 . . .28). . As with the Gibbs sampler the above ergodic theorem allows for inference using a single Markov chain. f (X) f (X(t−1) ) .4) with g being a symmetric distribution. iterate for t = 1. (5. which is actually the original algorithm proposed by Metropolis et al. but rather X = X(t−1) + ε. If the Markov chain generated by the MetropolisHastings algorithm is irreducible. Compute 3. then for any integrable function h : E → R 1 lim n→∞ n n t=1 h(X(t) ) → Ef (h(X)) for every starting value X(0) . α(XX(t−1) ) = min 1. Note that this condition is not met if we use a “perfect” proposal which has f as invariant distribution: in this case we accept every proposed value with probability 1. It is easy to see that (5.5) is a special case of (5. ε ∼ g.
Data used in example 5. and an indicator of whether antibiotics were given as a prophylaxis (zi3 ).2 (Bayesian probit model). For ﬁgure 5. . The response Yi is the number of infections that were observed amongst ni patients having the same inﬂuence factors (covariates). we choose the covariance Σ of the proposal to be 0. . 1] for all t ∈ R. Number of births with infection total yi ni 11 1 0 23 28 0 8 Table 5.6296 1.3636 1.6201 95% credible interval 1.2 the ﬁrst 10. With probability α(ββ (t−1) ) set β (t) = β. To keep things simple. Σ) and set β = β (t−1) + ε.2000 0. . 2. as the ergodic theorem implies. . . zi2 .0413 risk factors β2 1. 2. You might want to consider a longer chain in practise.2. πi ). . .3 and table 5. yn ) ∝ Φ(z′ β)yi · (1 − Φ(z′ β))ni −yi · exp − β2 i i 2 j=0 j i=1 We can sample from the above posterior distribution using the following random walk Metropolis algorithm.7333 1.: 1. Draw ε ∼ N (0.1.2029 0. i where zi = (1.8993 2.2 is to a distribution.08 · I. f (βY1 . . . Compute α(ββ (t−1) ) = min 1. . . I/λ). . an indicator of whether additional risk factors were present at the time of birth (zi2 ). (t) Figure 5. Note that Φ(t) ∈ [0. zi1 . The data is given in table 5. .1. In a medical study on infections resulting from birth by Cesarean section (taken from Fahrmeir and Tutz.000 samples2 . A suitable prior distribution for the parameter of interest β is β ∼ N (0. The posterior density of β is n 3 λ f (βy1 . . π = Φ(z′ β). . whereas the cumulative averages 2 t τ =1 βj /t converge. to a ⊳ (τ ) value.471 antibiotics β3 Table 5. Yn ) .4646 0.3 The random walk Metropolis algorithm 57 Example 5. otherwise set β (t) = β (t−1) .2 show the results obtained using 50. 3.3 and table 5. 1) distribution. Note that the convergence of the βj Posterior mean intercept planned β0 β1 1. zi3 ) and Φ(·) being the CDF of the N(0. Yn ) f (β (t−1) Y1 . Starting with any β (0) iterate for t = 1.0952 0.000 samples have been discarded (“burnin”). Parameter estimates obtained for the Bayesian probit model from example 5.5.7783 1. . 2001) three inﬂuence factors have been studied: an indicator whether the Cesarian was planned or not (zi1 ).2 planned zi1 1 0 0 1 0 1 0 risk factors zi2 1 1 0 1 1 0 0 antibiotics zi3 1 1 1 0 0 0 0 98 18 2 26 58 9 40 The data can be modeled by assuming that Yi ∼ Bin(ni .
0 0.0 20000 30000 40000 50000 0 10000 20000 30000 40000 50000 β2 (t) β3 (t) (t) β2 0.3 0.0 1.5 1.58 5.0 0.4 0.5 1.5 0.0 1.0 density 1.5 1.0 0.5 0. The MetropolisHastings Algorithm β0 2.5 β3 (t) 0.6 0.5 1.5 1.0 0.0 1.0 (c) Posterior distributions of the βj Figure 5.0 0.3.5 1.5 Pt τ =1 βj /t β1 (τ ) density density 2.5 1.5 1.0 1.5 (t) (t) β0 β1 (t) 0 10000 0.0 Pt 1.0 0.5 2.0 0.0 3.5 0 10000 3.0 (t) β1 1.5 1.0 1.7 (τ ) Pt τ =1 β1 /t (τ ) 0.5 1.5 1.2 .0 0.0 (τ ) 20000 30000 40000 50000 0 10000 20000 30000 40000 50000 sample sample (b) Cumulative averages β0 0. Results obtained for the Bayesian probit model from example 5.0 0.0 1.6 τ =1 1.0 2.5 0.5 20000 30000 40000 50000 0 10000 (t) 20000 30000 40000 50000 (a) Sample paths of the βj Pt 0.2 0.0 Pt 0 10000 Pt τ =1 β1 /t (τ ) 20000 30000 40000 50000 0 10000 20000 30000 40000 50000 sample Pt 1.0 Pt 0 10000 2.5 1.0 2.5 β3 density 1.5 0.0 0.0 β2 /t (τ ) τ =1 sample Pt τ =1 β2 /t 0.0 1.5 0.5 (τ ) β3 /t (τ ) β3 /t τ =1 τ =1 0.5 β2 1.2 β0 /t (τ ) τ =1 β0 /t 0.0 0.5 2.
X (t) ) Mean 95% CI σ 2 = 0.4426 0. If we choose a proposal distribution with a small variance.0.4 Choosing the proposal distribution 59 5. At ﬁrst sight. (1995).3 for different choices of the proposal variance σ 2 .3. . This corresponds to the theoretical results of Gelman et al.0. and σ 2 = 102 .1274) ⊳ Table 5. X (t) ) and average probability of acceptance α(XX (t−1) ) found in example 5.0.9677. This correlation has two sources: – the correlation between the current state X(t−1) and the newly proposed value X ∼ q(·X(t−1) ). on the other hand. we might think that setting σ 2 = 1 is the optimal choice. which is in line with the results we obtained in the above simple example and the guidelines which motivate them. Thus we would ideally want a proposal distribution that both allows for fast changes in the X(t) and yields a high probability of acceptance. Rejected values are drawn as grey open circles.9910) (0. The covariance structure of the proposal distribution should ideally reﬂect the expected covariance of the (X1 . X (t) ) as well as the average probability of acceptance α(XX (t−1) ) averaged over 100 runs of the algorithm. and rejected. Xp ).3 This is the price we have to pay for the generality of the MetropolisHastings algorithm.382 σ 2 = 102 0.6162. however the probability of acceptance will be rather low. In this example we examine the choices: σ 2 = 0. Autocorrelation ρ(X (t−1) .9694 0. The results suggest that σ 2 = 2. however at the price of a chain that is hardly moving.2 we drew 3 The optimal proposal would be sampling directly from the target distribution. σ 2 = 1.1.0.6225 0. Unfortunately these are two competing goals. If.1237. (1997) propose to adjust the proposal such that the acceptance rate is around 1/2 for one.9710) (0. Choosing σ 2 too large allows the chain to make large jumps.7038 0. Example 5.7791) (0.12 σ2 = 1 σ 2 = 2.9891.4 shows the sample paths of a single run of the corresponding random walk MetropolisHastings algorithm. this is however not the case. Gelman et al. The very reason for using a MetropolisHastings algorithm is however that we cannot sample directly from the target! .4452) (0. An ideal choice of proposal would lead to a small correlation of subsequent realisations X(t−1) and X(t) .0. Figure 5.8360 (0.382 .1255 (0. Popular choices for random walk proposals are multivariate Gaussians or tdistributions.0.3. In the Bayesian probit model we studied in example 5.5. Average correlation ρ(X (t−1) .7061) (0. 1) distribution using a random walk MetropolisHastings – the correlation introduced by retaining a value X(t) = X(t−1) because the newly generated value X has been algorithm with ε ∼ N (0. Assume we want to sample from a N (0. . making them a safer choice. Finding the ideal proposal distribution q(·x(t−1) ) is an art. σ 2 ).8418) Probability of acceptance α(X. σ 2 = 2. Table 5. so the chain remains for a long time at each accepted value. Choosing σ 2 too small yields a very high probability of acceptance.or two dimensional target distributions.4 Choosing the proposal distribution The efﬁciency of a MetropolisHastings sampler depends on the choice of the proposal distribution q(·x(t−1) ).3 shows the average correlation ρ(X (t−1) . Note however that these are just rough guidelines. Example 5.7676. the probability of acceptance will be high. the X (t) can potentially move very fast.7014.382 is the optimal choice. and around 1/4 for larger dimensions.9901 0.4401. we choose a proposal distribution with a large variance.4 (Bayesian probit model (continued)). .7733 0. X (t−1) ) Mean 95% CI 0.0. however most of the proposed values are rejected. The latter have heavier tails.0. however the resulting Markov chain will be highly correlated.8303. . as the X (t) change only very slowly.6289) (0.
Sample paths for example 5. Open grey discs represent rejected values. The MetropolisHastings Algorithm σ 2 = 0.12 σ2 = 1 σ 2 = 2.3 for different choices of the proposal variance σ 2 .4.60 5.382 σ 2 = 102 6 4 2 0 2 4 6 6 4 2 0 2 4 6 6 4 2 0 2 4 6 6 4 2 0 2 4 6 0 200 400 600 800 1000 Figure 5. .
(a) Σ = 0. i.9503 β2 0. chain can be made faster mixing by using a proposal distribution that represents the covariance structure of the This can be done by resorting to the frequentist theory of generalised linear models (GLM): it suggests that the ˆ asymptotic covariance of the maximum likelihood estimate β is (Z′ DZ)−1 .8792 . so the improvement of the mixing behaviour is entirely due to a difference in the structure of the the covariance. When using Σ = 2·(Z′ DZ)−1 in the algorithm presented in section 5. The proportion of accepted values we obtained in example 5.e.4 (a) shows the corresponding autocorrelation. The resulting Markov posterior distribution of β.4 (b)). Table 5.2 we can obtain better mixing performance: the autocorrelation is reduced (see table 5. βj ) (t) 0.08 · I β0 Autocorrelation ρ(βj (t−1) ⊳ β1 0.0%.2 was 13.5.8726 . we modeled the components of ε to be independent. Note that the determinant of both choices of Σ was chosen to be the same. where Z is the matrix of the covariates.8765 β2 0.9%. βj ) between subsequent samples for the two choices of the covariance Σ. Autocorrelation ρ(βj (t−1) (t) (t−1) β1 0. Σ) with Σ = 0.8741 β3 0.9532 .08 · I.9562 β3 0.4 Choosing the proposal distribution 61 ε ∼ N(0. .4. and the proportion of accepted values obtained increases to 20. βj ) (t) 0.9496 (b) Σ = 2 · (Z′ DZ)−1 β0 Autocorrelation ρ(βj Table 5. and D is a suitable diagonal matrix.
62 5. The MetropolisHastings Algorithm .
Xk (t) (m·t+τ ) ) < ρ(Xk . .e. If the correlation ρ(X(t) . Figure 6. often only a subset of the chain (X(t) )t is used: Burnin Depending on how X(0) is chosen..1 Practical considerations The theory of Markov chains we have seen in chapter 1 guarantees that a Markov chain that is irreducible and has invariant distribution f converges to the invariant distribution. In practise. Thinning Markov chain Monte Carlo methods typically yield a Markov chain with positive autocorrelation.6 and 5. the distribution of (X(t) )t for small t might still be far from the stationary distribution f . Illustration of the idea of a burnin period.5 allow for approximating expectations Ef (h(X)) by their the corresponding means 1 T T t=1 h(X(t) ) −→ Ef (h(X)) using the entire chain. Thus it might be beneﬁcial to discard the ﬁrst iterations X(t) . Xk (t) (t+τ ) ) is positive for small τ . i.1 illustrates the idea of a burnin period. t = 1. . we consider a Markov chain (Y(t) )t with Y(t) = X(m·t) instead of (X(t) )t . Thus thinning can be seen as a technique for reducing the autocorrelation. . Yk (t) (t+τ ) ) = ρ(Xk . The ergodic theorems 4. however at the price of yielding a chain (Y(t) )t=1. .Chapter 6 Diagnosing convergence 6.e. the thinned chain (Y(t) )t exhibits less autocorrelation than the original chain (X(t) )t . i. burnin period (discarded) Figure 6. then ρ(Yk . How large T0 has to be chosen depends on how fast mixing the Markov chain (X(t) )t is. Xk (t) (t+τ ) ). however. This early stage of the sampling process is often referred to as burnin period. i. ρ(Xk .1. . X(t+τ ) ) decreases monotonically in τ . T0 .⌊T /m⌋ . This suggests building a subchain by only keeping every mth value (m > 1)...e.
1 = 2 Var T T h(X(t) ) t=1 T /m T /m m2 1 h(Y(t) ) .. this does not imply that a ﬁnite sample from such a chain yields a good approximation to the target distribution... we obtain that T m−1 T /m τ =0 t=1 Var T /m T /m t=1 h(X(t·m+τ1 ) ) = Var h(X(t·m+τ ) ) m−1 t=1 h(X(t·m+τ2 ) ) Var h(X(t) ) = t=1 Var = m · Var T /m t=1 h(X(t·m) ) + η=τ =0 Cov T /m T /m h(X(t·m+η) ). If the computer’s memory cannot hold the entire chain (X(t) )t . convergence assessment can be split into the following three tasks of diagnosing different aspects of convergence: . then ⌊T /m⌋ T 1 1 h(Y(t) ) . as the following lemma shows.1. 2004. This section tries to give a brief overview over various approaches to diagnosing convergence. Further. The techniques presented in the following are nothing other than exploratory tools that help you judging whether the chain has reached its stationary regime. i. . ..64 6. Even though thinning is very popular. ≤ 2 Var h(Y(t) ) = Var T T /m t=1 t=1 The concept of thinning can be useful for other reasons.T be a sequence of random variables (e. As with all approximating methods this must be conﬁrmed in practise.. h(X(t) ) ≤ Var Var T t=1 ⌊T /m⌋ t=1 h(X(t) ) = t=1 τ =0 t=1 h(X(t·m+τ ) ) and for τ1 ...⌊T /m⌋ a second sequence deﬁned by Y(t) := X(m·t) .. from a Markov chain) with X(t) ∼ f Proof. A more detailed review with many practical examples can be diagnofound in (GuihennecJouyaux et al. 6.T . thinning is a good choice. Let (X(t) )t=1.2 Tools for monitoring convergence Although the theory presented in the preceding chapters guarantees the convergence of the Markov chains to the required distributions.. There is an R package (CODA) that provides a vast selection of tools for diagnosing convergence. τ2 ∈ {0. Broadly speaking. 1998) or (Robert and Casella. t=1 ≤Var “P T /m t=1 t=1 h(X(t·m) ) h(X(t·m+τ ) ) ” Thus Var 1 T T ≤ m2 · Var h(X(t) ) t=1 T /m T /m t=1 h(X(t·m) ) = m2 · Var t=1 h(Y(t) ) . chapter 12). Using T m−1 T /m and (Y(t) )t=1.. it can be easier to assess the convergence of the thinned chain (Y(t) )t as opposed to entire chain (X(t) )t . Diagnosing convergence is an art. it cannot be justiﬁed when the objective is estimating Ef (h(X)). T /m ∈ N. Diagnosing convergence whose length is reduced to (1/m)th of the length of the original chain (X(t) )t=1. Lemma 6. ...g. This section contains several cautionary examples where the different tools for diagnosing convergence fail. m − 1}. To simplify the proof we assume that T is divisible by m. If Varf (h(X(t) )) < +∞.e. ..
Density of the mixture distribution with two random walk Metropolis samples using two different variances Var(ε) of the proposal. Example 6. – whether (X(t) )t has reached a stationary regime. The ﬁrst. Note that it is impossible to tell from ﬁgure 6. . It is impossible to see from a plot of the sample path whether the chain has explored the entire support of the distribution. and most important.0.22 .3 (b) show plots of the cumulative averages.2 (b) alone that the chain has not explored the entire support of the target. and 5. How much information is contained in the sample from the Markov chain compared to i. Note that the convergence of (X(t) )t is in distribution. the sample path is not supposed to converge to a single value.5). The ﬁrst choice of Var(ε) is too small: the chain is very likely to remain in one of the two modes of the distribution.0 2 1 0 x 1 2 1 1 0 1 2 3 0 2000 6000 8000 10000 0 2000 4000 sample 6000 8000 10000 (a) Density f (x) (b) Sample path of a random walk (c) Sample path of a random walk Metropolis algorithm with proposal vari. If we choose the proposal variance Var(ε) too small.i.2 Tools for monitoring convergence 65 Convergence to the target distribution.d. Does under the target distribution? Comparison to i.2 0 0.22 ) (x) + 0.2 (a) for a plot of the density) using a random walk Metropolis algorithm with proposed value X = one population instead of both.i.22 Figure 6. one can look at a plot of the cumulative averages ( t τ =1 h(X (τ ) )/t)t .e. Figure 6.6 sample path X (t) sample path X (t) 4000 sample density f (x) 1 0. Convergence of the averages.0. we only sample from Var(ε) = 1. Note that the convergence of the cumulative averages is — as the ergodic theorems sug gest — to a value (Ef (h(X)). Ideally. i.8 3 3 ⊳ 2 0.4 0. In order to diagnose the convergence of the averages.42 ance Var(ε) = 1. .4 (b) (c).2. sampling? T t=1 h(X(t) )/T provide a good approximation to the expectation Ef (h(X)) 6. In this example we sample from a mixture of two wellseparated Gaussians f (x) = 0.Metropolis algorithm with proposal variance Var(ε) = 0. Var(ε)).d. The smoother the plot seems (like for example ﬁgure 4. and – whether (X(t) )t covers the entire support of the target distribution.3.6.42 and X (t−1) + ε with ε ∼ N(0. and 5.32 ) (x) (see ﬁgure 6. Note however that this plot suffers from the “you’ve only seen where you’ve been” problem. 4. question is whether (X(t) )t yields a sample from the target distribution? In order to answer this question we need to assess .4 · φ(−1.1 (A simple mixture of two Gaussians).2.4). An alternative . the slower mixing the resulting chain is. Figures 4.2 shows the sample paths for two choices of Var(ε): Var(ε) = 0. 5. 0.4.5 (b) (c).3 (a).1 Basic plots The most basic approach to diagnosing the output of a Markov Chain Monte Carlo algorithm is to plot the sample path (X(t) )t as in ﬁgures 4. the plot should be oscillating very fast and show very little structure or trend (like for example ﬁgure 4. sampling.6 · φ(2.
. 2004.d. ⊳ 0....6 Pt τ =1 X (τ ) /t 0. (X(t) )t=⌊T /3⌋+1. we do not apply it to the (X(t) )t directly...2 (A pathological generator for the Beta distribution). Pt τ =1 X (τ ) /t Note that it is impossible to tell from a plot of the cumulative means whether the Markov chain has explored the entire support of the target distribution. Sample paths and cumulative means obtained for the pathological Beta generator. In its simplest version. The ﬁrst block is considered to be the burnin period.8 1.2..0 0 2000 4000 8000 10000 0 2000 4000 sample 6000 8000 10000 (a) Sample path X (t) (b) Cumulative means Figure 6.8 1. sample. .2⌊T /3⌋ to the one of (X(t) )t=2⌊T /3⌋+1. This slow convergence can be seen in a plot of the cumulative means (ﬁgure 6...i. the second and third block should be bution of (X(t) )t=⌊T /3⌋+1. and (X(t) )t=2⌊T /3⌋+1.2 Nonparametric tests of stationarity This section presents the KolmogorovSmirnov test. . 1).4 0. 1. 2. set X (t) = X (t−1) .. problem 7.d.4 0.. but to a thinned chain (Y(t) )t with Y(t) = X(m·t) : the thinned chain is less correlated and thus closer to being an i.66 6. Thus we should be able to tell whether the chain has converged by comparing the distri .. which is an example of how nonparametric tests can be used as a tool for diagnosing whether a Markov chain has already converged. 1) distribution..T . Otherwise draw X (t) ∼ Beta(α + 1. samples.. Example 6.2 0. 2.T using suitable nonparametric twosample tests. As the KolmogorovSmirnov test is designed for i. If the Markov chain has reached its stationary regime after ⌊T /3⌋ iterations.3. The following MCMC algorithm (for details.2⌊T /(3m)⌋ to the one of from the same distribution.⌊T /3⌋ .2⌊T /3⌋ . This algorithm yields a very slowly converging Markov chain.6 cumulative average 6000 sample 0.. One such test is the KolmogorovSmirnov test..0 sample path X (t) 0..i.... which is nothing other than the difference between the cumulative averages and the estimate of the limit Ef (h(X)). to which no central limit theorem applies. With probability 1 − X (t−1) . We can now compare the distribution of (Y(t) )t=⌊T /(3m)⌋+1.. see Robert and Casella.3 (b)).0 0. Diagnosing convergence ¯ to plotting the cumulative means is using the socalled CUSUMs h(Xj ) − T τ =1 (τ ) t τ =1 h(Xj )/t (τ ) t ¯ with h(Xj ) = h(Xj )/T .. it is based on splitting the chain into three parts: (X(t) )t=1..5) yields a sample from the Beta(α.2 0.. . 6. Starting with any X (0) iterate for t = 1.
Example 6. .6 0.000 is large enough for the lowcorrelation setting.i..99 (as in example 4.. .4 shows the plots of the standardised KolmogorovSmirnov statistic.4) and once with ρ(X1 .⌊T /m⌋ (x) .z] (Zk.n is based on comparing their empirical CDFs n 1X ˆ Fk (z) = I(−∞..5).2⌊T /(3m)⌋ (x) − F(Y(t) )t=2⌊T /(3m)⌋+1.4 from example 6.2 tKt 6 7 0 2000 4000 8000 10000 0 2000 4000 sample 6000 8000 10000 (a) ρ(X1 . 1.d.n and Z2.2 Tools for monitoring convergence 67 (Y(t) )t=2⌊T /(3m)⌋+1. the latter a very slowly mixing chain. as it is based on comparing (Y(t) )t=⌊T /(3m)⌋+1..i. we can use it as an informal tool by √ monitoring the standardised statistic tKt as a function of t. Z1. ... sample...99 from the Gibbs sampler from the bivariate Gaussian for two different correlations.1 .. Z2. we cannot use the KolmogorovSmirnov test as a formal statistical test (besides we would run into problems of multiple testing).3 Figure 6. X2 ) = 0. i=1 2 Kt is hereby the KolmogorovSmirnov statistic obtained from the sample consisting of the ﬁrst t observations only.0 1.. . x∈R As the thinned chain is not an i.8 1..8 ⊳ 1... A plot of the KolmogorovSmirnov statistic for the chain with Var(ε) = 0.4 Standardised KS statistic 6000 sample Standardised KS statistic √ 1 2 3 4 5 0. it is safe to assume that the chain has not yet reached its stationary regime. In this example we consider sampling from a bivariate Gaussian distribution. X2 ) = 0. ..⌊T /m⌋ using the KolmogorovSmirnov statistic 1 ˆ ˆ K = sup F(Y(t) )t=⌊T /(3m)⌋+1.6. but not large enough for the highcorrelation setting. It suggests that the sample size of 10.3 (as in example 4. Figure 6. X2 ) = 0. z∈R For n → ∞ the CDF of √ n · K converges to the CDF R(k) = 1 − +∞ X (−1)i−1 exp(−2i2 k2 ). . The former leads a fast mixing chain. . However.d.. .3 (Gibbs sampling from a bivariate Gaussian (continued)). samples Z1.4..⌊T /m⌋ .2 As long as a signiﬁcant proportion of the values of the standardised statistic are above the corresponding quantile of the asymptotic distribution. once with ρ(X1 .1 would not reveal anything unusual.i ). Standardised KolmogorovSmirnov statistic for X1 (5·t) (b) ρ(X1 . n i=1 The KolmogorovSmirnov test statistic is the maximum difference between the two empirical CDFs: ˆ ˆ K = sup F1 (z) − F2 (z).. .6 tKt √ 1. X2 ) = 0. Note that the KolmogorovSmirnov test suffers from the “you’ve only seen where you’ve been” problem... 1 The twosample KolmogorovSmirnov test for comparing two i.2⌊T /(3m)⌋ and (Y(t) )t=2⌊T /(3m)⌋+1.1 ..
6. Note that the technique of control variates is only useful if the different estimators converge about as fast as the quantity of interest — otherwise we would obtain an overly optimistic.68 6. Alternatively. Diagnosing convergence 6. .001.t) )t .2.1 for the L different chains should not reveal any difference. for more details). requires that the density f is known inclusive of normalisation constants. . . so comparing the plots from section 6.4 (A simple mixture of two Gaussians (continued)).42 ) failed to explore the entire support of the target distribution. . Alternatively.3 Riemann sums and control variates A simple tool for diagnosing convergence of a onedimensional Markov chain can be based on the fact that f (x) dx = 1. then (6. and can be used as a tool to diagnose whether all chains sampled from the same distribution.t) )t — with overdispersed3 starting values X(1. Gelman and Rubin. where δα is the (l) then compute the distance between the α and (1 − α) quantile of the pooled data.0) .e.1). The corresponding Riemann sums are 0. 2001). one could compare the variances within the L chains to the pooled estimate of the variance (see Brooks and Gelman. both estimates should be roughly the same. Note that this method. If all chains are a sample from the same distribution.4 Comparing multiple chains A family of convergence diagnostics (see e. . . (6. .22 ) managed to. (X(L. In example 6. X(L. our estimate is (l) L l=1 δα /L.2. 1998. covering at least the support of the target distribution. clearly indicating that the ﬁrst algorithm does not explore the entire support. The ﬁrst consists of estimating the interquantile distance for each of the L chain and averaging over these results. 1998) is based on running L > 1 chains — which we will denote by (X(1. often referred to as Riemann sums (Philippe and Robert. All L chains should converge to the same distribution. .1 we considered two randomwalk where X [1] ≤ . In the special case of the Riemann sum we compare two quantities: the constant 1 and the Riemann sum (6.2. i.e. ≤ X [T ] is the ordered sample from the Markov chain. E We can estimate this integral by the Riemann sum T t=2 (X [t] − X [t−1] )f (X [t] ). ⊳ Riemann sums can be seen as a special case of a technique called control variates. . we can pool the data ﬁrst. 3 i.1) support of f . If the Markov chain has explored all the Metropolis algorithms: one (Var(ε) = 0. The idea of control variates is comparing several ways of estimating the same quantity. We can estimate the interquantile distances in two ways.g. so their ratio ˆinterval = Sα (l) L l=1 δα /L (·) δα distance between the α and (1 − α) quantile of the lth chain(X (l. Example 6. As long as the different estimates disagree. 1992.1) should be around 1. . or an overly conservative estimate of whether the chain has converged. the chain has not yet converged.t) )t . A more formal approach to diagnosing whether the L chains are all from the same distribution can be based on comparing the interquantile distances. whereas the other one (Var(ε) = 1. the variance of the starting values should be larger than the variance of the target distribution.598 and 1.0) . in which case the ratio should be around 1. Brooks and Gelman. .
0) . X2 ≤ 0. This is due to the fact that unless x2 (or x1 ) is very close to µ2 (or µ1 ).0) .t) 2000 1 0 1 2 2 1 0 1 2 3 0 4000 sample 6000 8000 10000 0 2000 4000 sample 6000 8000 10000 (a) Var(ε) = 0. . 1] × [0. In the example of the mixture of two Gaussians we will consider L = 8 chains initialised from a N(0. X(L.σ2 ·I) (x1 .5)′ . 0.5. Figure 6. Gibbs sampler.6 (Witch’s hat distribution). both truncated to [0. The conditional distribution of X1 X2 is f (x1 x2 ) = with δx2 = (1 − δx2 )φ(µ. δx2 (or δx1 ) is almost 1. δ + (1 − δ)φ(µ2 .646463 ⊳ 3 2 sample paths X (l. and σ = 10−5 using a Nonetheless. Note that 99. 3.6 shows the samples obtained from 15 runs of the Gibbs .2 Tools for monitoring convergence 69 Example 6.51) = 0. 102 ) distribution.49 < X1 . .42 (b) Var(ε) = 1. it is very unlikely that the Gibbs sampler visits this part of the distribution. δ .6.e. x2 ∈ (0. 1) else. 1) else.5 shows the sample paths of the 8 ˆ chains for both choices of Var(ε).2696962 Var(ε) = 0.5). 1].49 < X1 . as the following example shows.9789992 ˆinterval = 0.42 : S0.05 = 3. The corresponding values of S interval are: 0.05 0. which is a mixture of a Gaussian and a uniform distribution. x2 ) + δx2 0 for x1 ∈ (0. .5. 0. Example 6.6 illustrates the density.996687.t) sample paths X (l. and thus can easily fail. P(0.22 Figure 6.22 : S0.e. x2 ) + δ 0 if x1 .630008 3. µ = (0.5. Note that this method depends crucially on the choice of initial values X(1.634382 ˆinterval Var(ε) = 1. Figure 6.9% of the mass of the distribution is concentrated in a very small area around (0.999.5 (A simple mixture of two Gaussians (continued)). . Comparison of the sample paths for L = 8 chains for the mixture of two Gaussians. i.σ2 ) (x2 ) Assume we want to estimate P(0. the Gaussian component is concentrated in a very small area around µ.σ2 ·I) (x1 . x2 ) ∝ (1 − δ)φ(µ. Consider a distribution with the following density: f (x1 . For very small σ 2 . i. Figure 6. the Gibbs sampler only samples from the uniform component of the distribution.51) for δ = 10−3 . X2 ≤ 0.05 = = 0.
Then Var 1 T T If 4 +∞ τ =1 see e. sample.70 6. 0.999).0 x1 (a) Density for δ = 0. sample of size T . 0. 6. X2 ≤ 0. and σ = 0.05 = (b) First 100 values from 15 samples using different starting values. Diagnosing convergence sampler (ﬁrst 100 iterations only) all using different initialisations. which is Var(h(X(t) ))/T .8 1.6. x2 ) 0.T is from a secondorder stationary time series..2.1) for details.2) we make the simplifying assumption that (h(X(t) ))t=1.49.. If the (X(t) )t=1.5)′ . which contains less information than an i. Var(h(X(t) )) = σ 2 . ⊳ x2 x1 x2 0...9% of the probability mass of the target distribution. (δ = 10−3 .04% of the sampled values ˆ lie in (0.0 0.. density f (x1 . µ = (0.. ρ(τ ) < +∞. In ﬁgure 6.i. then we can obtain from the dominated convergence theorem4 that .4 0. It is however close to impossible to detect this problem with any technique based on multiple initialisations.6 all 15 starting values yield a Gibbs sampler that is stuck in the “brim” of the witch’s hat and thus misses 99.51) = 0.. 0.6 0. In order to compute the variance (6.51) = 0...2) h(X(t) ) = t=1 1 Var(h(X(t) )) +2 T 2 t=1 =σ 2 T 1≤s<t≤T = σ2 T2 T −1 τ =1 T +2 (T − τ )ρ(τ ) = σ2 T Cov(h(X(s) ).. and σ = 10−5 ) h(X(t) ) t=1 (6.d. h(X(t) )) =σ 2 ·ρ(t−s) T −1 τ =1 1+2 1− τ T ρ(τ ) . theorem 7.d...i. The Gibbs sampler shows this behaviour for practically all starting values. Brockwell and Davis (1991.i.2) as the estimate from the Markov chain (X(t) )t=1. The effective sample size is the size an i..T .5..51) × (0.T . h(X(t+τ ) )) = ρ(τ ). sample would have to have in order to obtain the same variance (6.e.2.2 0.. X2 ≤ 0.d.0 Figure 6. then the variance of the average Var 1 T T is larger than the variance we would obtain from an i.49. i.8 1.6 0.49 < X1 .g.49 < X1 . sampling and the effective sample size MCMC algorithms typically yield a positively correlated sample (X(t) )t=1. On average only 0. and ρ(h(X(t) ).5)′ . µ (0.5.51) yielding an estimate of P(0. Density and sample from the witch’s hat distribution.0004 (as opposed to P(0.2 0.0 0.T are positively correlated..4 0.i. The effective sample size (ESS) allows to quantify this loss of information caused by the positive correlation.1.d. 0.5 Comparison to i.
yielding TESS = 1 1+2 +∞ τ =1 as T → ∞. thus the effective sample size is TESS = 1 − 0. ρ(τ ) = ρ(h(X(t) ).e.5 we obtained for the lowcorrelation setting that ρ(X1 (t−1) . i. Note that the variance would be σ 2 /TESS if we were to use an i.i. X1 ) = 0.2 Tools for monitoring convergence 71 T · Var 1 T T +∞ h(X(t) ) t=1 −→ σ 2 1+2 τ =1 ρ(τ ) obtain the effective sample size TESS by equating these two variances and solving for TESS .078. h(X(t+τ ) )) = ρτ  .. X1 ) = 0.. sample of size TESS . then we obtain using 1 + 2 +∞ τ =1 ρτ = (1 + ρ)/(1 − ρ) that TESS = 1−ρ · T.d.7 (Gibbs sampling from a bivariate Gaussian (continued)). We can now ρ(τ ) · T. 1+ρ Example 6..979 ⊳ .979 · 10000 = 105. 1 + 0.T is a ﬁrstorder autoregressive time series (AR(1)).4 and 4.6. 1 + 0..979. In examples 4. If we assume that (h(X(t) ))t=1.078 (t−1) (t) For the highcorrelation setting we obtained ρ(X1 smaller: TESS = .078 · 10000 = 8547. thus the effective sample size is considerably (t) 1 − 0.
72 6. Diagnosing convergence .
or estimating the volatility of ﬁnancial instruments using stock market data. and the variancecovariance structure.Chapter 7 Statespace models and the Kalman ﬁlter algorithm 7. and this is implied by the subscript t in pt (·). then it is possible to reuse the samples from pt−1 (xt−1 ) to obtain samples from . We point out that SMC methods are applicable in settings beyond dynamic models. when exact simulation is not possible. and interest lies in performing online inference about unknown quantities from the given observations. denoted by p(x). The particular feature of dynamic models is the evolving nature of the underlying distribution.1 Motivation In many realworld applications. where xt = (x1 . . SMC methods are a noniterative alternative to MCMC algorithms. then it is possible to formulate a Bayesian model that incorporates this knowledge in the form of a prior distribution on the unknown quantities. for large t. This is in contrast to nondynamic models where the distribution is constant as new observations are generated. not the observations. Moreover. Note that xt are the quantities of interest. rare events simulation. Examples include tracking an aircraft using radar measurements. A comprehensive introduction to these methods is the book by Doucet et al. but it is possible that the dimension of xt be constant ∀t ≥ 1. observations arrive sequentially in time. Sequential Monte Carlo (SMC) methods are a noniterative. and global optimization. the words in the speech signal. In the latter case. and relates these to the observations via a likelihood function. the data is modelled dynamically in the sense that the underlying distribution evolves in time. alternative class of algorithms to MCMC. (2001). In all three examples. the unknown quantities of interest might be the location and velocity of the aircraft. designed speciﬁcally for inference in dynamic models. MCMC methods have proven highly effective in generating approximate samples from lowdimensional distributions p(x). t. . the observations up to time t determine the form of the distribution. so the overall computational cost would increase with t. or that xt have one dimension less than xt−1 . speech recognition using noisy measurements of voice signals. such as nonsequential Bayesian inference. these models are known as dynamic models. at each time step t a different MCMC sampler with stationary distribution pt (xt ) is required. xt ) typically increases in dimension with in time t as new observations are generated. If prior knowledge about these quantities is available. In these examples. Inference on the unknown quantities is then based on the posterior distribution obtained from Bayes’ theorem. based on the key idea that if pt−1 (xt−1 ) does not differ much from pt (xt ). designing the sampler and assessing its convergence would be increasingly difﬁcult. respectively. where pt (xt ) changes Let pt (xt ) denote the distribution at time t ≥ 1. . provided that it is possible to deﬁne an evolving sequence of artiﬁcial distributions from which the distribution of interest is obtained via marginalisation. . In the dynamic case.
and. ut ) ∼ g(·xt .1) . it is not possible to obtain exact samples from these evolving distributions. yi x1:i−1 . See Figure 7. φ). to obtain a good representation of pt (xt ). . . that is hidden. so the goal is to reuse an approximate sample.. respectively. we introduce several algorithms for inference in e statespace models. The literature sometimes distinguishes between statespace models where the state process is given by a discrete Markov chain. conditional on xt . In most applications of interest. The conditional independence structure of the ﬁrst few states and observations in a hidden Markov Model. (2005). θ). and similarly for y1:t .2. we restrict our attention to a particular class of dynamic models called the statespace model (SSM).e. and widely applicable. An extensive monograph on inference for statespace models is the book by Capp´ et al. (7. Yt . xt . .1 Inference problems in SSMs Under the notation introduced above. The state process is a Markov chain.e. y1 y2 y3 y4 y5 y6 y7 yt xt ∼ g(yt xt . representative of pt−1 (xt−1 ). and write f (·xt−1 ). as opposed to a continuous Markov chain.1 for illustration. i. and an observed process. xt−1 ) = p(xt xt−1 ) = f (xt xt−1 .2 Statespace models SSMs are a class of dynamic models that consist of an underlying Markov process. Consider the following notation for a statespace model: observation: hidden state: yt = a(xt . denoted by ut and vt . . xt xt−1 ∼ f (xt xt−1 . since inference is to be performed in real time as new observations arrive. usually called the state process. by Bayes’ theorem. where yt and xt are generated by functions a(·) and b(·) of the state and noise disturbances. 7. Statespace models and the Kalman ﬁlter algorithm pt (xt ). and the distribution of the p(yt x1:t . We will see in the sequel that SMC methods are highly ﬂexible. For simplicity. p(xt x1 . Moreover. In the present chapter and the following. . i. y1:t−1 ) = p(yt xt ) = g(yt xt . y1:i−1 ) = p(x1 )g(y1 x1 ) i=2 f (xi xi−1 )g(yi xi ). Note that we use the notation x1:t to denote x1 .. called hidden Markov models (HMM). 7. (2007). we have the joint density t t p(x1:t . and a more recent overview e is Capp´ et al. . Xt . usually called the observation process.74 7. unobserved. is independent of previous values of the state and observation processes. vt ) ∼ f (·xt−1 . . Let p(x1 ) denote the distribution of the initial state x1 . . φ) xt = b(xt−1 .1. i.. and g(·xt ). φ) Figure 7. θ). we drop the explicit dependence of the state transition and observation densities on θ and φ. θ) x1 x2 x3 x4 x5 x6 x7 ··· observation yt .e. it is necessary that the computational cost be ﬁxed in t. y1:t ) = p(x1 )g(y1 x1 ) i=2 p(xi . Assume φ and θ to be known. the density of the distribution of interest p(x1:t y1:t ) ∝ p(x1:t y1:t−1 )g(yt xt ) = p(x1:t−1 y1:t−1 )f (xt xt−1 )g(yt xt ). and point out that the algorithms in Chapter 8 apply more generally to dynamic models.
θ)g(yi xi .e. then it can be approximated by a ﬁnite mixture of Gaussian distributions to insure that the EM algorithm applies to the problem of parameter estimation. θ. under the assumption that all parameters in the model are known. when the state process is a discrete Markov chain. if the parameters are known. for 1 ≤ l < k ≤ t – prediction: p(xl:k y1:t ).e. these posterior distributions can be computed in closed form only in a few speciﬁc cases. φ)p(θ. i. pressions for the optimal values of the mean and variance parameters via a leastsquares approach.e. and the ExpectationMaximization (EM) algorithm returns parameter values for which the likelihood function of the observations attains a local maximum. we can write pt (x1:t ) = p(x1:t y1:t ). θ. The Viterbi algorithm returns the optimal sequence of hidden states. otherwise. of a collection of state variables conditional on a batch of observations: – ﬁltering: p(xt y1:t ) There exist several inference problems in statespace models that involve computing the posterior distribution – ﬁxed interval smoothing: p(xl:k y1:t ). so they are impractical when the state space is large.e. the normalising constant in (7. i. It is evident. φy1:t ).. For HMMs with discrete state transition and observation distributions. φ)p(x1 )g(y1 x1 .2 Statespace models 75 To connect this with the notation introduced for dynamic models.7. it is the Gaussian distribution. φy1:t ) ∝ p(y1:t x1:t . So far we assumed that the state transition and observation densities are completely characterised.. the tutorial of Rabiner (1989) presents recursive algorithms for the smoothing and static parameter estimation problems. These recursive algorithms involve summations over all states in the model. predict the state at time t conditional on y1:t−1 ). Kalman (1960) obtains recursive exalternates between two steps: a prediction step (i. φ).e. integrating over the state variables that are not of interest. The ﬁrst three inference problems reduce to marginalisation of the full smoothing distribution p(x1:t y1:t ). i. and an update step (i. and thus the posterior distribution of interest is known in closed form. If they are unknown. and update the prediction in light of the new observation). but we believe that stating the dependence on the observations explicitly leads to less confusion... the sequence that maximises the smoothing distribution.1) gives the posterior distribution up to a normalising constant posterior distribution is known only up to a constant. The Kalman ﬁlter algorithm (Kalman. i. then. θ. then the inference problem is called: – static parameter estimation: p(θ.1) can be computed analytically.3 presents a Bayesian formulation of the Kalman ﬁlter algorithm following Meinhold and Singpurwalla (1983). 1960) gives recursive expressions for the mean and variance of the ﬁltering distribution p(xt y1:t ). In fact. i.. and the noise disturbances ut and vt are Gaussian. φ)p(x1:t θ. observe yt . when the functions a() and b() are linear. then Bayesian inference is concerned with the joint posterior distribution of the hidden states and the parameters: t p(x1:t .e. for k > t and 1 ≤ l ≤ k. φ) = p(θ. Notice that equation (7.. that these inference problems depend on the tractability of the posterior distribution p(x1:t y1:t ). and the linear Gaussian model. or p(x1:t . that the parameters θ and φ are known. φy1:t ). such as the hidden Markov model.e. θ. in fact. If the observation distribution is continuous. Section 7. φ) i=2 f (xi xi−1 . The algorithm p(x1:t y1:t−1 )g(yt xt )dx1:t . whereas the fourth reduces to marginalisation of k – ﬁxed lag smoothing: p(xt−l y1:t ). for 0 ≤ l ≤ t − 1 p(x1:k y1:t ) = p(x1:t y1:t ) i=t+1 f (xi xi−1 ).. and it is oftentimes the case that the . If interest lies in the posterior distribution of the parameters. φy1:t ). For the linear Gaussian model. which reduces to integrating over the state variables in the joint posterior distribution p(x1:t .
called particles.2). Start with an initial Gaussian distribution on x1 : x1 ∼ Prediction step: From equation (7. . Θ2 ) are independent noise sequences. where ut ∼ N(0. Bt Σt−1 Bt + Θ2 ).4) where the superscript T indicates matrix transpose. and normalising. through a combination of sampling and resampling steps. We assume that both states and observations are vectors. the derivation of the mean and variance of the posterior distribution follows as detailed below. the posterior distribution of the state xt conditional on the observations y1:t is proportional to p(xt y1:t−1 )g(yt xt ). . The prediction error is et = yt − yt = yt − At Bt µt−1 .3) are known. 2003). Σ1 ). The key idea of SMC methods is to represent the posterior distribution by a weighted set of samples. Φ2 ). N(µ1 . Σt−1 ).2) and the result in (7. y1:t−1 )p(xt y1:t−1 ). Φ2 ) xt = Bt xt−1 + vt ∼ N(Bt xt−1 . where Bt µt−1 is the ˆ it follows that p(xt y1:t ) ∝ p(et xt . that are ﬁltered in time as new observations arrive. which is equivalent to observing yt . 1999). et = At (xt − Bt µt−1 ) + ut . Bt . Statespace models and the Kalman ﬁlter algorithm When an analytic solution is intractable. Hence SMC sampling algorithms are oftentimes called particle ﬁlters (Carpenter et al. y1:t−1 ∼ N(At (xt − Bt µt−1 ). let µt−1 and Σt−1 be the mean and variance of the Gaussian distribution of xt−1 conditional on y1:t−1 . Finally. Φ2 ) and vt ∼ N(0. N(0. where xt−1 y1:t−1 ∼ N(µt−1 . The ﬁrst term is the distribution of xt conditional on the ﬁrst t − 1 observations.2) (7. and the parameters At .4). Φ2 . . p(xt y1:t−1 )g(yt xt )dxt the product of these two terms. (7. Update step: Upon observing yt . in which case the parameters are matrices of appropriate sizes. The second term is the distribution of the new observation yt conditional on the hidden state at time t. which is represented as follows: observation: hidden state: yt = At xt + ut ∼ N(At xt . we begin by predicting the distribution of xt conditional on y1:t−1 . y1:t−1 )p(xt y1:t−1 ). where We now use the following results from Anderson (2003). sequential Monte Carlo methods are a simulationbased approach that offer greater ﬂexibility and scale better with increasing dimensionality. xt = Bt xt−1 + vt . It is also possible to let the noise variances Φ2 and Θ2 vary with time.3 The Kalman ﬁlter algorithm From (7. Θ2 ) independently. and vt ∼ (7. Following equation (7. Let X1 an X2 have a bivariate normal distribution: prior mean on xt . Gridbased methods using discrete numerical approximations to these posterior distributions are severely limited by parameter dimension. exact inference is replaced by inference based on an approximation to the posterior distribution of interest. we are interested in p(xt y1:t ) ∝ p(yt xt . 7. we have that T xt y1:t−1 ∼ N(Bt µt−1 . Alternatively. dxt−1 = p(xt y1:t−1 )g(yt xt ) .76 7. At time t − 1.1).3). So ˆ ut ∼ N(0. Φ2 ). This can be thought of as the prior distribution on xt . The result is the distribution of We now show how the prediction and update stages can be performed exactly for the linear Gaussian statespace model. Θ2 ). from (7. and Θ2 The Kalman ﬁlter algorithm proceeds as follows. computing this distribution is known as the prediction step.. Looking forward to time t. Updating p(xt y1:t−1 ) in light of the new observation involves taking interest: p(xt y1:t ) = p(x1:t y1:t )dx1 . consider predicting yt by yt = At Bt µt−1 . Chapter 8 presents SMC methods for the problems of ﬁltering and smoothing. this is known as the update step. By results in multivariate statistical analysis (Anderson. so et xt .
σ 2 ). if (7. We require that E(Xt ) = φµ = µ and Var(Xt ) = φ2 σ 2 + σU = σ 2 . Σ11 − Σ12 Σ22 Σ21 . Σ22 ). ˆ ˆ = Σt − Σt ˆ ˆ Σt = Σt ˆ 1 − Σt 2 ˆ Σt + σV . where ut ∼ N(0. Example 7. σ 2 ) is stationary for {Xt }t≥1 if Xt−1 ∼ N(µ. σ 2 ) and Xt Xt−1 = xt−1 ∼ is the unique stationary distribution of the chain. Consider the following AR(1) model: xt yt 2 = φxt−1 + σU ut ∼ N(φxt−1 . Σt ). xt ) corresponding to the N(φxt−1 .7.5) is true. Gaussian white noise processes. Σt = Bt Σt−1 Bt + Θ2 . linear autoregressive (AR(1)) model observed with noise). Bt Σt−1 Bt + Θ2 ). update ˆ ˆ U V the mean and variance of the posterior distribution: µt Σt ˆ = µt + Σt ˆ 1 (y − µt ) = ˆ ˆt + σ 2 t Σ V ˆ Σt 2 ˆ Σt + σV 1− ˆ Σt ˆt + σ 2 Σ V µt + ˆ ˆ Σt ˆ Σt y ˆt + σ 2 t Σ V . T (Bt Σt−1 Bt + Θ2 )AT t T At (Bt Σt−1 Bt + Θ2 )AT + Φ2 t T At (Bt Σt−1 Bt + Θ2 ) equivalent to observing yt . 2: Set t = 2. t t Algorithm 1 The Kalman ﬁlter algorithm. Then the mean and variance of the prediction at time t are: ˆ ˆ µt = φµt−1 and Σt = φ2 Σt−1 + σ 2 . provided φ < 1. Finally. y1:t−1 ∼ N(At (xt − Bt µt−1 ). The Markov chain {Xt }t≥1 2 is a Gaussian random walk with transition kernel K(xt−1 .1 (Firstorder. ˆ ˆt AT + Φ2 .3 The Kalman ﬁlter algorithm 77 X1 X2 ∼N µ1 µ2 . ˆ 4: Observe yt and compute error in prediction: et = yt − At µt . 1: Input: µ1 and Σ1 . 2 2 which are satisﬁed by µ = 0 and σ 2 = σU /(1 − φ2 ). σV ). y1:t−1 ) = N(µt . (7. Φ2 ) and xt y1:t−1 ∼ N(Bt µt−1 . 1) and vt ∼ N(0. σu ) 2 = xt + σV vt ∼ N(xt . σu ) imply that Xt ∼ N(µ.5) If equation (7. σU ) distribution.6) holds. and X2 ∼ N(µ2 . it follows that y1:t−1 ∼ N . The prediction error is et = yt − µt with variance Σt + σ 2 .5) holds. 1) are independent. where µt Σt Using the result above. Σ11 Σ21 Σ12 Σ22 . then the conditional distribution of X1 given X2 = x2 is given by −1 −1 X1 X2 = x2 ∼ N µ1 + Σ12 Σ22 (x2 − µ2 ). t ≥ 2. Go to step 3. A normal distribution N(µ. the N 0. In fact. the ﬁltering distribution is p(xt y1:t ) = p(xt et . At time t − 1. (7. xt et Bt µt−1 0 T Bt Σt−1 Bt + Θ2 T Since et xt .6) Conversely. respectively. σU /(1 − φ2 ) distribution 2 Start the Kalman ﬁlter algorithm with µ1 = 0 and Σ1 = σU /(1 − φ2 ). T ˆ 3: Compute mean and variance of prediction: µt = Bt µt−1 . then (7. since observing et is T T = Bt µt−1 + (Bt Σt−1 Bt + Θ2 )AT (At (Bt Σt−1 Bt + Θ2 )AT + Φ2 )−1 et t t T T T T = Bt Σt−1 Bt + Θ2 − (Bt Σt−1 Bt + Θ2 )AT (At (Bt Σt−1 Bt + Θ2 )AT + Φ2 )−1 At (Bt Σt−1 Bt + Θ2 ). let µt−1 and Σt−1 denote the posterior mean and variance. 2 2 N(φxt−1 . = = ˆ t −1 µt + Σt AT Rt et ˆ ˆ ˆ t −1 ˆ Σt − Σt AT Rt At Σt . 5: Compute variance of prediction error: Rt = At Σ t 6: Update the mean and variance of the posterior distribution: µt Σt 7: Set t = t + 1.
78 7. If this model is highly nonlinear. . An alternative is the unscented Kalman ﬁlter which takes a deterministic sampling approach. and that the error terms are normally distributed. but is approximated. The underlying assumptions of the Kalman ﬁlter algorithm are that the state transition and observation equa tions are linear. i. Meinhold and Singpurwalla (1989) let the distributions of the error terms ut and vt be Studentt. because the mean µt is an unbounded function of et . the posterior distribution is no longer known exactly. then this approach will result in very poor estimates of the mean and covariance. Statespace models and the Kalman ﬁlter algorithm ⊳ The Kalman ﬁlter algorithm (see Figure 1 for pseudocode) is not robust to outlying observations yt .e. then the extended Kalman ﬁlter algorithm propagates the mean and covariance via the Kalman ﬁlter equations by linearizing the underlying nonlinear model. for details. If the linearity assumption is violated. but the state transition and observation equations are differentiable functions. when the prediction error et is large. representing the state transition distribution by a set of sample points that are propagated through the nonlinear model. This approach improves the accuracy of the posterior mean and covariance. In this case. and show that the posterior distribution of xt given y1:t converges to the prior distribution of p(xt y1:t−1 ) when et is large. see Wan and van der Merwe (2000). and the variance Σt does not depend on the observed data yt ..
Under this condition. This is based on the observation that µ = Ef (h(X)) = h(x)f (x)dx = h(x) f (x) g(x)dx = g(x) h(x)w(x)g(x)dx = Eg (h(X) · w(X)). for (x. 1] as M g(x) for (x.d. it is possible to use 8. we want to select g as close as possible to f such that the estimator of µ has ﬁnite variance. y) = g(x)U[0. In particular.1) samples from g. sufﬁcient condition is that f (x) < M · g(x) and Varf (h(X)) < ∞. y) : x ∈ E.i. f (x) M g(x) ¯ f (x. 1] with weights f (x) ¯(x.3. we present SMC methods the ﬁltering and smoothing problems in statespace models. by sampling from an instrumental distribution g with support satisfying supp(g) ⊃ supp(f · h). In practice. We argue in the following subsection that importance sampling is more efﬁcient than rejection sampling. y) : x ∈ E. y) = 0 otherwise where f (x) = ¯ f (x. performing importance sampling on E × [0. in terms of producing weights with smaller variance. y) ∈ E×[0.1) is approximated by the empirical average of h · w evaluated at n i. where the rightmost expectation in (8. M g(x) f = w(x. (8.1 Importance Sampling versus Rejection Sampling ¯ Let E be the support of f .1] (y). importance sampling is introduced as a technique for approximating a given integral µ = h(x)f (x)dx under a distribution f . y)dy = 0 f (x) M g(x) M g(x)dy. where U[0. y) M for (x.Chapter 8 Sequential Monte Carlo In this chapter we introduce sequential Monte Carlo (SMC) methods for sampling from dynamic models. 8.1] (·) is the uniform ¯ distribution on [0. Consider the instrumental distribution g (x. y) on E × [0. One rejection sampling to sample from f and approximate µ. y ∈ 0. Deﬁne the artiﬁcial target distribution f (x. y) = g (x. these methods are based on importance sampling and resampling techniques.1 Importance Sampling revisited In Section 3.1. y ∈ 0. 1]. Then. 1]. y) 0 ¯ otherwise .
. y) ¯ w(x. associated with these points. we introduce the following technical lemma which relates the variance of a random variable to its conditional variance and expectation. i=1 where. Returning to importance sampling versus rejection sampling. w(x.i. w(x). Similarly. we have: Var (A) = E A2 − E [A] = EB EA A2 B 2 − EB [EA [AB]] . importance sampling from f using instrumental distribution g has weights w(x) = f (x)/g(x).d.1. Proof. y) dy = g(x) f (X) M g(X) M dy = 0 f (X) = w(X). we have by Lemma 8. 2 Considering the deﬁnition of conditional variance. it is clear that: Var (A) = EB VarA (AB) + EA [AB] 2 − EB [EA [AB]] 2 2 2 = EB [VarA (AB)] + EB EA [AB] − EB [EA [AB]] = EB [VarA (AB)] + VarB (EA [AB]) . is deﬁned as follows n i=1 I(xi ≤ x). on the same probability space. then the empirical measure i=1 . δxi (x) = 1 if x = xi and 0 otherwise. 8. Y )X) = 0 g (x. Y )X)) ≥ Var (w(X)) . For this purpose. Y )X) is a nonnegative function. y). we can deﬁne the empirical distribution function 1 ˆ F (x) = n where I(xi ≤ x) is the indicator of event xi ≤ x. We now show that the weights for rejection sampling.80 8.1 (Law of Total Variance). have higher variance than those for importance sampling. points {xi }n in E drawn from f .1 that Var (w(X. wi }n . We can approximate f by the following empirical i=1 measure. Sequential Monte Carlo is equivalent to rejection sampling to sample from f using instrumental distribution g. Y )X)) + E (Var (w(X. δxi (x) is the Dirac measure which places all of its mass at point xi . g(X) and the fact that Var (w(X. and the law of total probability. Y )) = Var (E (w(X.2 Empirical distributions Consider a collection of i. any expectation or variance with a subscript corresponding to a random variable should be interpreted as the expectation or variance with respect to that random variable. such that Var (A) < ∞. i. for any x ∈ E. 1 ˆ f (x) = n n I(x = xi ) = i=1 1 n n δxi (x). Y )X)) = Var (w(X)) + E (Var (w(X. In contrast. If the collection of points has associated positive. In the following. since 1 E (w(X. then the following decomposition exists: Var (A) = EB [VarA (AB)] + VarB (EA [AB]) . By deﬁnition. realvalued weights {xi . A and B. and then variance. Given two random variables.e. Lemma 8.
sample x1 i = 1. i=1 Sequential Importance Sampling exploits the idea of sequentially approximating an intractable density function ˆ f by the corresponding empirical measure f associated with a random sample from an instrumental distribution g. we can then estimate integrals with respect to f by integrals with respect to the h(x)f (x)dx ≈ ˆ h(x)f (x)dx = 1 n n h(xi ). . wt+1 (i) (i) n (i) (i) (i) xt . the target distribution evolves to pt+1 (xt+1 ).2 Sequential Importance Sampling n i=1 81 ˆ f (x) = wi δxi (x) . . Then Ef (h(X)) = or. . wt (i) (i) n i=1 is a good ap proximation to the target distribution at time t. For details. see Liu and Chen (1998). . The idea is to choose an instrumental distribution such that importance sampling can proceed sequentially. ∼ q1 (x1 ) for pt (xt ) qt (xt ) (i) (i) (i) . and suppose that it can be factored as follows: t+1 qt+1 (xt+1 ) = q1 (x1 ) i=2 (i) qi (xi xi−1 ) = qt (xt )qt+1 (xt+1 xt ). Let qt+1 (xt+1 ) denote the instrumental distribution. The desired result is a collection of samples xt+1 . for appropriately chosen instrumental distribution i=1 is also a good approximation to pt+1 (xt+1 ). pt (xt ). if the random sample is weighted. sample the (t+1)st component xt+1 from an instrumental distribution. The intuition is that if the weighted sample qt+1 (xt+1 xt ). n. and properly weighted with weights wi ∝ f (xi )/g(xi ). . f (x) and F (x). . At time t > 1. . The strong law of large numbers justiﬁes approximating the true density and distribution functions by ˆ ˆ f (x) and F (x) as the number of samples n increases to inﬁnity. Ef (h(X)) = h(x)f (x)dx ≈ ˆ h(x)f (x)dx = n i=1 wi h(xi ) . . the weighted sample xt+1 . are a random measure and distribution.2 Sequential Importance Sampling Sequential Importance Sampling (SIS) performs importance sampling sequentially to sample from a distribution pt (xt ) in a dynamic model. we sample xt+1 ∼ qt+1 (xt+1 xt ). update the weight wt+1 . called particles. then. for i = 1.2) Normalise the weights. At (i) (i) (i) (i) time t+1. 8. so. wt (i) (i) n i=1 be a collection of samples. n i=1 wi ˆ ˆ For ﬁxed x ∈ E. and update the weight wt+1 = (i) pt+1 (xt+1 ) qt+1 (xt+1 ) (i) (i) = pt+1 (xt+1 ) qt (xt )qt+1 (xt+1 xt ) pt+1 (xt+1 ) (i) (i) (i) (i) (i) = wt (i) pt+1 (xt+1 ) pt (xt )qt+1 (xt+1 xt ) (i) (i) (i) (i) . as functions of a random sample. From these approximations. that targets pt (xt ). and append xt+1 to xt . wt+1 (i) (i) n i=1 that targets pt+1 (xt+1 ). The term pt (xt )qt+1 (xt+1 xt ) is known as the incremental weight. (i) (i) Then. Let xt .8. . Normalise the weights by dividing them by wt = (i) (i) n j=1 (j) w1 . At time t = 1. n. the weight wt+1 can be computed incrementally from wt . n i=1 wi respectively. at the following time step. associated empirical measure. and set (i) w1 = p1 (x1 )/q1 (x1 ). (8. Let h : E → R be a measureable function.
Doucet et al. So it is natural to seek an instrumental distribution that minimises the variance of the weights. a few of the weights contain most of the probability mass. vt ) ∼ f (·xt−1 ). the distribution pt (xt ) is used to predict xt+1 . and interpret the second ratio on the right hand side as correcting the discrepancy between qt+1 (xt+1 xt ) and (i) In practice. (i) pt+1 (xt+1 xt ). nonGaussian statespace models.1 Optimal instrumental distribution Kong et al. This phenomenon is know in the literature as weight degeneracy. So in the long run. For the bootstrap ﬁlter. where the subscript t + 1 indicates that the distribution may (i) (i) incorporate all the data up to time t + 1: y1:t+1 . xt+1 ) pt (xt )qt+1 (xt+1 xt ) 2 (i) (i) (i) 2 (i) = = (i) wt 0. (i) (i) (i) (i) (2000) show that the optimal instrumental distribution is qt+1 (xt+1 xt ) = pt+1 (xt+1 xt ). Sequential Monte Carlo 8.2. The instrumental distribution is qt+1 (xt+1 x1:t ). xt+1 ) pt (xt ) (i) (i) 1 qt+1 (xt+1 xt ) (i) dxt+1 − pt+1 (xt ) pt (xt ) (i) if qt+1 (xt+1 xt ) = pt+1 (xt+1 xt ). weight simpliﬁes to pt+1 (xt+1 )/pt (xt+1 ). Proof.2. The resulting SIS algorithm is known as the bootstrap ﬁlter. when they are different.2). From (8. in such instances. Varqt+1 (xt+1 x(i) ) wt+1 t (i) = wt (i) 2 Varqt+1 (xt+1 x(i) ) t pt+1 (xt . When qt+1 (xt+1 xt ) = pt (xt+1 xt ). and the smoothing distribution pt+1 (xt+1 ) = p(x1:t+1 y1:t+1 ).2. sampling from the optimal instrumental distribution is usually not possible.e. (1994) and Doucet et al. observation: yt = a(xt .2 SIS for statespace models Recall the statespace model introduced in Section 7. assume that all parameters are known. the variance of the corresponding weights is low for small t. in the sense that weight degeneracy grows exponentially in the dimension t. Hence the optimal instrumental distribution is pt+1 (xt+1 xt ). (i) (i) 2 pt+1 (xt . (2000) prove that the unconditional variance of the weights increases over time. i. Chopin (2004) argues that the SIS algorithm suffers from the curse of dimensionality. y1:t ) = . ut ) ∼ g(·xt ) hidden state: xt = b(xt−1 . so other choices of instrumental distributions are considered. The instrumental distribution qt+1 (xt+1 xt ) = pt+1 (xt+1 xt ) minimises the variance of the weight wt+1 conditional upon xt . then the incremental (i) (i) (i) (i) 8.2. (1993) in the context of Bayesian ﬁltering for nonlinear. Oftentimes it is possible to ﬁnd good approximations to the optimal instrumental distribution. variance of wt+1 conditional upon xt is zero. (i) We present the SIS algorithm to sample approximately from the ﬁltering distribution pt+1 (xt+1 ) = p(xt+1 y1:t+1 ). For simplicity. It was ﬁrst introduced by Gordon et al. however. but weight degeneracy still occurs at t increases.. More intuitively. Liu and Chen (1998) rewrite the incremental weight as pt+1 (xt+1 ) pt (xt )qt+1 (xt+1 xt ) (i) (i) (i) (i) = pt+1 (xt ) pt+1 (xt+1 xt ) pt (xt ) qt+1 (xt+1 xt ) (i) (i) (i) (i) . while most of the particles have normalised weights with near zero values. qt+1 (xt+1 x1:t ) = p(xt+1 x1:t . This result appears in the following proposition.82 8. in the sense that the (i) (i) (i) (i) Proposition 8.
sample xt+1 ∼ f (xt+1 xt ).1)).e. y1:t ) (i) (i) (i) ... p(yt+1 xt ). So the (i) (i) Example 8.8. (i) (i) for i = 1 : n. tion is where the normalising constant equals the predictive distribution of yt+1 conditional on xt . y1:t+1 ) (i) (i) (i) = = p(yt+1 x1:t .1). and p(xt+1 y1:t ) = ˆ we have the approximation n p(xt+1 y1:t+1 ) = ˆ and sampling proceeds in two steps: Propagate step: Update step: i=1 wt g(yt+1 xt+1 )f (xt+1 )xt ).e.3) wt (j) be a weighted sample representing the ﬁltering density at time t.2 Sequential Importance Sampling 83 f (xt+1 xt ). xt y1:t )dxt f (xt+1 xt )p(xt y1:t )dxt . weight function becomes wt+1 ∝ wt p(yt+1 xt ). the incremental weight does not depend on past trajectories x1:t . by (8. (1993) introduce the bootstrap ﬁlter in a very intuitive way. σ 2 ) with . y1:t )p(xt+1 x1:t . Write the ﬁltering density as follows: p(xt+1 y1:t+1 ) = g(yt+1 xt+1 )p(xt+1 y1:t ) p(yt+1 y1:t ) ∝ g(yt+1 xt+1 ) ∝ g(yt+1 xt+1 ) Now. the optimal instrumental distribution incorporates the most recent observation: qt+1 (xt+1 x1:t ) = p(xt+1 x1:t . n j=1 (8. weigh xt+1 with weight wt+1 = wt g(yt+1 xt+1 ). In this case. wt (i) (i) n i=1 (i) n Then. such that = 1.1 (AR(1) model observed with noise (continued from example 7. i. followed by an update step (similar in spirit to the recursions in the Kalman ﬁlter algorithm). 2σ implying that the distribution is N(µ. Furthermore.3). (i) (i) In contrast to the instrumental distribution of the bootstrap ﬁlter. (i) (i) (i) (i) for i = 1 : n. y1:t )p(xt+1 x1:t . but only on the likelihood function of the most recent observation. i=1 p(xt+1 . The optimal instrumental distribu qt+1 (xt+1 x1:t ) ∝ g(yt+1 xt+1 )f (xt+1 xt ) ∝ exp − = exp − 1 1 2 exp − 2 (xt+1 − φxt )2 2 (yt+1 − xt+1 ) 2σV 2σU − 2xt+1 yt+1 φxt 2 + σ2 σV U + 2 yt+1 φ2 x2 t 2 + σ2 σV U 1 1 2 1 x 2 + σ2 2 t+1 σV U 1 ∝ exp − 2 (xt+1 − µ)2 . but rather as a twostage recursive process with a propagate step. xt+1 . the instrumental distribution does not incorporate the most recent observation yt+1 . Gordon et al. i. empirical measure approximating p(xt y1:t ). Normalise the weights. without reference to SIS. xt+1 . and the weight is p(x1:t+1 y1:t+1 ) (i) (i) (i) (i) (i) (i) (i) p(x1:t y1:t )f (xt+1 xt )g(yt+1 xt+1 ) (i) (i) (i) p(x1:t y1:t )f (xt+1 xt ) (i) wt+1 = wt (i) (i) p(x1:t y1:t )f (xt+1 xt ) (i) (i) ∝ wt = wt g(yt+1 xt+1 ) (i) (i) (i) by equation (7. let xt . p(xt y1:t ) = ˆ i=1 wt δx(i) (xt ) is the t (i) (i) n wt f (xt+1 xt ). y1:t )dxt+1 g(yt+1 xt+1 )f (xt+1 xt )dxt+1 g(yt+1 xt+1 )f (xt+1 xt ) (i) (i) p(yt+1 x1:t .
This idea of rejuvenating the particles by resampling was ﬁrst suggested by Gordon et al. n X i=1 n X i=1 j=1 n P wt+1 = 1. n i=1 = 1. as n → ∞. set wt+1 ∝ wt f (xt+1 xt )g(yt+1 xt+1 )/qt+1 (xt+1 x1:t ). t+1 (i) p(x1:t+1 y1:t+1 ) = ˆ wt+1 δx(i) (i) (x1:t+1 ).. The idea is as follows: let xt . h(xt )ˆt (xt )dxt = p i=1 wt h(xt ) → (i) (i) h(xt )pt (xt )dxt . set w1 ∝ p(x1 )g(y1 x1 )/q1 (x1 ). yt+1 xt ∼ N(φxt . wt+1 (i) (i) n i=1 be a weighted sample.3 Sequential Importance Sampling with Resampling One approach to limiting the weight degeneracy problem is to choose an instrumental distribution that lowers the variance of the weights. obtained by importance sampling.84 8. σU + σV ). Algorithm 2 The SIS algorithm for statespace models 1: Set t = 1. i=1 t+1 (i) p(x1:t+1 y1:t+1 ) = ˆ wt+1 δx(i) (x1:t+1 ). (i) (i) 3: For i = 1 : n. i=1 1:t+1 Algorithm 2 is the general SIS algorithm for statespace models. The computational complexity to generate n particles representing p(x1:t y1:t ) is O(nt). Normalise such that 4: At time t + 1. (i) (i) (i) (i) (i) (i) (i) (i) (i) (i) (i) j=1 n P w1 = 1. (j) wt+1 δx(i) (xt+1 ). xt+1 − yt+1 φxt 2 + σ2 σV U dxt+1 ∝ exp − 1 2φxt 1 2 y 2 2 − σ 2 + σ 2 yt+1 2 t+1 σU + σV U V 2 2 i. Go to step 4. Under suitable t (i) regularity conditions. (i) 2: For i = 1 : n. normalised such that n Then the ﬁltering and smoothing densities are approximated by the corresponding empirical measures: n wt+1 = 1. (j) 6: For i = 1 : n. wt and normalised such that n j=1 (j) wt (i) (i) n i=1 be a weighted sample from pt (xt ). do: 5: For i = 1 : n. (i) (j) p(xt+1 y1:t+1 ) = ˆ wt+1 δx(i) (xt+1 ). 1:t+1 8. sample x1 ∼ q1 (x1 ). . sample xt+1 ∼ qt+1 (xt+1 x1:t ). ⊳ n j=1 Let x1:t+1 . a second approach is to introduce a resampling step after drawing and weighing the particles at time t + 1. for any ﬁxed measurable function h. The empirical measure is pt (xt ) = ˆ n wt δx(i) (xt ). Sequential Monte Carlo 2 2 σU σV 2 + σV µ= σ2 = 2 σU 2 σU yt+1 φxt 2 + σ2 σV U 2 2 σU σV 2 . the law of large number states that.e. + σV The normalising constant of this optimal instrumental distribution is p(yt+1 xt ) = g(yt+1 xt+1 )f (xt+1 xt )dxt+1 1 2 exp 2 y 2σV t+1 1/2 ∝ exp − 1 √ 2π 1 2 1 1 2 + σ2 σU V 1 2 2 2 σU σV 2 + σ2 σU V 2 yt+1 φxt 2 + σ2 σV U 2 2 σU σV 2 2 σU + σV 2 × 2 1 1 2 + σ2 σU V exp − 1 1 2 + σ2 σU V . (1993). Normalise such that 7: The ﬁltering and smoothing densities at time t + 1 may be approximated by p(xt+1 y1:t+1 ) = ˆ 8: Set t = t + 1. from p(x1:t+1 y1:t+1 ).
In the extreme case. xt ˆ ˜ (i) (i) (j) = xt with (i) probability wt . but that the latter can be expected to be more efﬁcient in ﬁltering the states. to forget its initial condition). The advantage of resampling is that it eliminates particle trajectories with low weights..3 Sequential Importance Sampling with Resampling 85 Suppose now that we draw a sample of size n′ from pt (xt ) with replacement. the SIS algorithm is more efﬁcient than the SISR algorithm. Chopin (2004) argues that for smoothing the ﬁrst state x1 . replicating trajectories with large weights reduces diversity by depleting the number of distinct particle values at any time step in the past. . invoking the law of large numbers. and replicates those with large weights. for t′ ≪ t. the integral of the function h with respect to the new empirical measure based on xt . as ˜ n′ → ∞. .e. The new particles have equal weights: wt = 1/n′ .e. Again.e. However. i.1. approximations to the smoothing density pt+1 (x1:t+1 ) and ﬁxed interval smoothing densities pt+1 (x1:t′ ). On the other hand. i=1 (j) n′ j=1 So. . 1/n′ ˜ is a good approximation to the integral of h with respect to pt (xt ). provided the number of particles is large enough. their diversity decreases due to resampling.. At time t + 1.. n. not simply to the last value xt+1 .8. then the asymptotic variance of the estimator is bounded from above in t. new values xt+1 are sampled and appended to the particle trajectories. resampling then eliminates some of these trajectories. at the current time step t + 1. we can obtain a good approximation to the ﬁltering density pt+1 (xt+1 ) from the particles and their corresponding weights. Since the values at t′ < t + 1 are not rejuvenated as t increases. i. (i) particles 0 1 2 3 4 5 6 2 4 t 6 8 10 Figure 8. will be poor. the smoothing density pt+1 (x1:t+1 ) is approximated by a system of particle trajectories with a single common acestor. In general.1 displays this situation graphically. if the instrumental distribution of the SISR algorithm has a certain abilitiy to “forget the past” (i. . resampling is applied to the entire trajectory x1:t+1 . for j = 1. for n′ large. Figure 8. the new algorithm is known as SIS with Resampling (SISR). In particular. . approximating pt+1 (xt+1 ). Plot of particle trajectories with a single common ancestor after resampling. 1 n′ n′ (j) h(˜t ) x j=1 n (i) (i) → wt h(xt ). approximating pt+1 (x1 ). all of the resampled particles then contribute signiﬁcantly to the importance sampling estimates.. i. ˆ (i) In the SIS algorithm.e.
and residual sampling (Liu and Chen. other resampling schemes exist that introduce lower Monte Carlo variance.. wt+1 and let yt+1 (i) n i=1 (i) (i) n i=1 be a weighted sample approximating pt+1 (xt+1 ). samples from the target distribution pt+1 (xt+1 ) that would be required to obtain an estimator with the same variance as the SIS estimator.86 8. such as stratiﬁed sampling (Carpenter et al.i. if one is interested in estimating the integral for some measurable function h. and wt+1 is the normalised version of wt+1 . since the weights are normalised to sum to 1. then the estimator n i=1 (i) (i) wt+1 h(xt+1 ) h(xt+1 )pt+1 (xt+1 )dxt+1 . and increasing the variance at the recent time step. where wt+1 = wt+1 / ¯ (i) (i) n j=1 wt+1 are the normalised weights. it can reduce the variance of estimators at later times. and suppose that we’re interested in estimating the mean µ = Ept+1 (xt+1 ) (h(Xt+1 )). Following Kong et al.i. Varqt+1 (xt+1 ) (wt+1 ) is impossible to obtain. So. ¯ wt+1 qt+1 (xt+1 )dxt+1 = 1. So far we discussed resampling via multinomial sampling. Let h(xt+1 ) be a measurable function..i. so a tradeoff is required between reducing the Monte Carlo variance in the future. (1994). but can be approximated ¯ ¯ by the sample variance of wt+1 ¯ (i) n i=1 . Var (ˆM C ) µ which. 8. the resampled trajectories are no longer independent.d. obtained by SIS. we deﬁne the effective sample size (ESS). draws from pt+1 (xt+1 ). and the Monte Carlo estimator is µ ˆ MC 1 = n n h(yt+1 ). where C is the normalising constant. i. Sequential Monte Carlo Moreover. and no additional bias. 2004). cannot be computed exactly. 1999). compared to estimation based on a sample of i.1 Effective sample size Resampling at every time step introduces unnecesary variation. and Eqt+1 (xt+1 ) (wt+1 )2 = Eqt+1 (xt+1 ) ¯ wt+1 C 2 n−1 1 2 = 2 Eqt+1 (xt+1 ) (wt+1 ) ≈ C n−2 n i=1 n j=1 wt+1 wt+1 (j) (i) 2 2. Kong et al. 1998). ESS is interpreted as the number of i. In general. in general. However. draws from the target distribution pt+1 (xt+1 ). the ESS is deﬁned as follows: n n = ≈ 1 + Varqt+1 (xt+1 ) (wt+1 ) ¯ Eqt+1 (xt+1 ) (wt+1 )2 ¯ n n n i=1 n j=1 n i=1 ESS = wt+1 ¯ (i) 2 = wt+1 (i) (j) wt+1 . Let xt+1 . so resampling has the additional effect of increasing the Monte Carlo variance of an estimator at the current time step (Chopin. be a sample of i. a measure of the efﬁciency of estimation based on a given SIS sample.3.d. Chopin (2004) shows that residual sampling always outperforms multinomial sampling: the resulting estimator using the former sampling scheme has lower asymptotic variance.d. 2 2 (j) In practice.e. ¯ Var (ˆM C ) µ where qt+1 (xt+1 ) is the instrumental distribution in SIS. µIS = ˆ n i=1 The SIS estimator of µ is wt+1 h(xt+1 ) n j=1 (i) (i) wt+1 (i) (j) . Since ESS ≤ . i=1 Then the relative efﬁciency of SIS in estimating µ can be measured by the ratio Var µIS ˆ . must be computed before resampling (as it will have lower Monte Carlo error than if it were computed after the resampling step). (1994) propose the following approximation that has the advantage of being independent of the function h: Var µIS ˆ ≈ 1 + Varqt+1 (xt+1 ) (wt+1 ).
a draw would fall under the mode of the target distribution. 8.3 Sequential Importance Sampling with Resampling 87 n (Kong et al. . Normalise such that (i) (i) j=1 4: At time t + 1. ´ Example 8. (i) wt+1 δx(i) (xt+1 ). which. 1994). sample x1 ∼ q1 (x1 ). Go to step 4. they show bounded. we expect sampling from this instrumental distribution to result in weights that are similar in value. resulting in a strikingly different weight value. So choosing an appropriate sample size n depends on knowing the shape of the target distribution. 2: For i = 1 : n. (i) (i) (i) 3: For i = 1 : n. so careful implementation and interpretation is required when dealing with SMC methods. . is not known. and thus have small variance. and the target distribution is ﬂat in the region of the mode of the instrumental distribution.. With small probability.2 SISR for statespace models Figure 3 presents the SISR algorithm with multinomial sampling for statespace models. j=1 (i) (j) (j) 6: If ESS < threshold. a ﬁxed threshold is chosen (in general. The ESS would then be large. in practice. 1:t+1 For the SISR algorithm for statespace models. but the weights are similar in value. thus incorrectly indicating a good match between the instrumental and target distributions. Via the ESS approach. half of the sample size n). wt+1 (i) (i) n i=1 approximates the target distribution pt+1 (xt+1 ). then resample: for i = 1 : n. the rate of convergence is proportional to 1/n.d.2 (A nonlinear time series model (Cappe et al. Algorithm 3 The SISR algorithm for statespace models 1: Set t = 1. j = 1. This. More generally. and residual sampling. 2007)). A word of caution is required at this point. ˜ (i) (i) 7: For i = 1 : n. sample xt+1 ∼ qt+1 (xt+1 x1:t ). could happen if the target distribution places most of its mass on a small region where the instrumental distribution is ﬂat. However. It is possible that the instrumental distribution matches poorly the target distribution.3. n X i=1 n X i=1 (i) (i) (i) i=1 n P wt+1 = 1. only under restrictive assumptions can they prove that approximation errors do not accumulate over time. set x1:t = x1:t with probability wt .8.i. t+1 (i) p(x1:t+1 y1:t+1 ) = ˆ wt+1 δx(i) (i) (x1:t+1 ). sample of size n from pt+1 (xt+1 ). This example highlights the importance of sampling a large enough number of particles such that all modes of the target distribution are explored via sampling from the instrumental distribution. under weak regularity conditions. Consider the following nonlinear time series model: . Chopin (2004) proves a Central Limit Theorem result for the SISR algorithm under both multinomial sampling. Hence. (i) (i) (i) (i) (i) (i) (i) n P w1 = 1. set wt+1 ∝ wt f (xt+1 xt )g(yt+1 xt+1 )/qt+1 (xt+1 xt ). . In practice. Finally. measurable functions. then an ESS value close to n indicates that the SIS sample of size n is approximately as “good” as an i. of course. we use as many particles as computationally feasible. Normalise such that 9: The ﬁltering and smoothing densities at time t + 1 may be approximated by p(xt+1 y1:t+1 ) = ˆ 10: Set t = t + 1. then a resampling step is performed. and if the ESS falls below that threshold. set x1:t = x1:t and wt = 1/n. not restricted to applications to statespace models. we’re using the importance sampling weights to evaluate how well the weighted sample xt+1 .. . Crisan and Doucet (2002) prove that the empirical distributions convergence of the mean square error for bounded. n. set w1 ∝ p1 (x1 )g(y1 x1 )/q1 (x1 ). for ˜ i = 1 : n. for example. provided that the weights are upper converge to their true values almost surely as n → ∞. (j) 8: For i = 1 : n. Furthemore. moreover. do: P (j) 5: Resample step: compute ESS = 1/ n (wt )2 .
Kernel density estimation is a nonparameteric approach to estimating the density of a random variable from a (possibly weighted) sample of values.6 × n. In general. it is interesting to notice that there is clear evidence of multimodality and nonGaussianity. The densities are f (xt xt−1 ) = N g(yt xt ) = N xt−1 xt−1 + 25 + 8 cos(1. Sequential Monte Carlo yt xt = = x2 t + vt ∼ g(yt xt ) 20 xt−1 xt−1 + 25 + 8 cos(1. 10). and the kernel density estimate of the ﬁltering distribution as a continuous line. w9 ticles and resampling whenever ESS < 0. see Silverman (1986). and parameters σv = 1. x_t −20 −10 0 10 20 0 20 40 time(t) 60 80 100 0 5 10 15 20 25 y_t 0 20 40 time(t) 60 80 100 Figure 8. 10 2 1 + x2 t−1 x2 t .3 shows the weighted samples x9 .4 and 8. We use a Gaussian kernel with ﬁxed width of 0.2t) + ut ∼ f (xt xt−1 ).1 .5 show the image . The kernel density estimator takes into account both the value of the weights. Moreover.4. Figure 8.2. resampling whenever ESS < 0. we run the SISR algorithm up to time t = 100 with n = 10000 and intensity plots of the kernel density estimates based on the ﬁlter outputs.2t).2 shows 100 states xt and corresponding observations yt generated from this model. Using these 100 observations. with the true state sequence overlaid. we begin by running the SISR algorithm until t = 9. σv ). Plot of 100 observations yt and hidden states xt generated from the above statespace model. resampling is required for good performance of the SIS algorithm.5.5.6 × n. 2 1 + x2 t−1 2 2 2 2 where vt ∼ N(0. with n = 10000 partion: qt+1 (xt+1 x1:t ) = f (xt+1 xt ). σu ). 20 ⊳ Figure 8. Let x1 ∼ p(x1 ) = N(0. and do not track the correct state sequence. ut ∼ N(0. the true state value falls in the high density regions of the density estimate in Figure 8. In Figure 8. indicating good performance of the SISR algorithm. Hence. σu = 10. The instrumental distribution is the state transition distribu(i) (i) (i) (i) n i=1 as small dots (with weights unnormalised). To analyse the effect of resampling. For details. we remark that the particle distributions are highly degenerate. and the local density of the particles. Figures 8.88 8. however. and the SIS algorithm with n = 10000.
which is highly variable (σu = 10) compared to the 2 observation distribution (σv = 1). draws from the state transition distribution are given low weights under the observation distribution. ·). Figure 8. drawn from an instrumental distribution. w9 (i) (i) n i=1 shown as small dots.30 Density 0.1 we interpreted rejection sampling as importance sampling on an enlarged space.6 shows that the weights quickly degenerate as t increases.. we already observe weights on the order of 10−300 .05 0. pt+1 Kt+1 = pt+1 . i.8. Hence. 1/n ˜ pt+1 (x1:t+1 ). to produce the set x1:t+1 . each (i) n i=1 (i) x1:t+1 will ˜ is a rejuvenated.15 0. then. Let qt+1 (x1:t+1 ) denote the instrumental distribution from which the particles are generated.e. . To make this last point more clear.3. 8. so can importance sampling with an MCMC move be interpreted as importance sampling on an enlarged space with instrumental distribution qt+1 (x1:t+1 )Kt+1 (x1:t+1 . Recall that these weights are a measure of the adequacy of the simulated trajectory. weighted is a good particle representation of tend to move to a distinct point in a high density region of the target distribution.00 −30 0.25 −20 −10 0 particles x_9 10 20 Figure 8. Let Kt+1 be a pt+1 (x1:t+1 )invariant Markov kernel. wt+1 (i) (i) n that targets the distribution pt+1 (x1:t+1 ). 1/n ˜ (i) n i=1 i=1 be a weighted set of particles . the MCMC step helps the particles track the moving target. 1/n x set of particles that targets the distribution pt+1 (z1:t+1 ). and call their proposed algorithm resamplemove. Filtering density estimate at t = 9 from SISR algorithm with n = 10000. During resampling. . to the target distribution.10 0. . thus improving the particle representation. Then z1:t+1 . n. with no resampling to eliminate the particles with very low weights. and kernel density estimate as continuous line. provided that the kernel Kt+1 is fast mixing. draw z1:t+1 ∼ Kt+1 (˜1:t+1 .20 0. In this particular example. and. z1:t+1 ). we look at histograms of the base 10 logarithm of the normalised weights at various time steps in the SIS algorithm. by t = 5. there is a quickly growing accumulation of very low weights. and ESS threshold = 6000. If x1:t+1 .4 Sequential Importance Sampling with Resampling and MCMC moves Gilks and Berzuini (2001) introduce the idea of using MCMC moves to reduce sample impoverishment. The algorithm performs sequential importance resampling with an MCMC move after the resampling step that rejuvenates the particles. Weighted n o samples x9 . In the words of Gilks and Berzuini (2001). . 2 the instrumental distribution is the state transition distribution. z1:t+1 ) and target distribution pt+1 (x1:t+1 )Kt+1 (x1:t+1 . Just as in Section 8. The MCMC move (i) (i) (i) n i=1 is as follows: for i = 1. . some particles will be replicated (possibly many times). Let x1:t+1 .1.4 Sequential Importance Sampling with Resampling and MCMC moves 89 Filtering density estimate at t=9 (n = 10000) 0.
with diamond symbol indicating the true state sequence (SIS algorithm). Kernel density estimates up to time t=100. no resampling (n = 10000) 40 x_t −40 −20 0 20 20 40 t 60 80 100 Figure 8. Image intensity plot of the kernel density estimates up to t = 100. .5.4. with diamond symbol indicating the true state sequence (SISR algorithm).90 8. Image intensity plot of the kernel density estimates up to t = 100. Sequential Monte Carlo Kernel density estimates up to time t=100 (n = 10000) 40 x_t −40 −20 0 20 20 40 t 60 80 100 Figure 8.
but the latter does not alter the justiﬁcation of the algorithm. Histogram of the base 10 logarithm of the normalised weights for the ﬁltering distributions (SIS algorithm).3 that resampling. he is not as hopeful regarding the smoothing problem. Third. it is a good representation of p(x1:t y1:t ). Hence. Chopin (2004) states that MCMC moves may lead to more stable algorithms for the ﬁltering problem in terms of the asymptotic variance of estimators (although theoretical results to support this are lacking).. computing the estimates from each output.1 SISR with MCMC moves for statespace models Algorithm 4 is the SISR algorithm with MCMC moves for statespace models. although it introduces extra Monte Carlo variation at the current time step. say m = 10.e. thus increasing the computational cost of performing the move. as t increases. notice that the t. MCMC move is applied to the entire state trajectory up to time t. since it is assumed that the particle set at step t has “converged”. z1:t ) = pt (z1:t x1:t ). i. Also. only one MCMC step will sufﬁce (i.4.4 Sequential Importance Sampling with Resampling and MCMC moves 91 t=1 5000 t=5 Frequency Frequency −14 −12 −10 −8 −6 −4 3000 0 1000 0 −350 500 1000 −300 −250 −200 −150 −100 −50 0 t=10 1000 200 t=50 Frequency Frequency −350 −300 −250 −200 −150 −100 −50 0 600 200 0 0 −350 50 100 −300 −250 −200 −150 −100 −50 0 t=75 25 t=100 150 Frequency Frequency −350 −300 −250 −200 −150 −100 −50 0 100 50 0 0 −350 5 10 15 20 −300 −250 −200 −150 −100 −50 0 Figure 8. it can be a Gibbs sampling. for t ﬁxed. it is increasingly difﬁcult to construct a fastmixing Markov kernel of dimension t. independent particle ﬁlters in parallel. the dimension of the kernel increases with . notice that except for the requirement of invariance. and monitoring the empirical variance of these m estimates as t increases.6. and an explicit formulation for the asymptotic variance as the number of particles tends to inﬁnity. mixing kernel at step t. In practice. He suggests investigating the degeneracy of a given particle ﬁlter algorithm by running m. Second. First. or a MetropolisHastings kernel. where pt+1 (x1:t+1 )Kt+1 (x1:t+1 .e. there are no constraints on the choice of kernel.. however. no burnin period is required).. This is similar to the idea from Section 8. in the extreme case. They argue that. rejuvenation via a perfectly time steps. This explanation omits the resampling step. Kt (x1:t .8. Gilks and Berzuini (2001) present a Central Limit Theorem result in the number of particles. the MCMC move is applied only to the last component xt+1 with kernel Kt+1 that is invariant with respect to p(xt+1 y1:t+1 ). can reduce the asymptotic variance of estimators at later 8. i. can reduce the variance at later times. z1:t+1 )dx1:t+1 = pt+1 (z1:t+1 ) by invariance of Kt+1 .e.
Algorithm 5 returns one realization from p(x1:T y1:T ) via this forwardﬁltering. (i) (i) (i) (i) 6: If ESS < threshold... x ˜ to p(xt y1:t ). set x1:t = z1:t . j=1 (i) (i) j=1 n P w1 = 1. y1:t ) = ˆ wt f (xt+1 xt )δx(i) (xt ) t n j=1 wt f (xt+1 xt ) (j) (j) . j = 1. to obtain particle approximations The draws (˜1 .4) for t = T − 1 to t = 1. sample xt+1 ∼ qt+1 (xt+1 x1:t ). xT ) form an approximate realization from p(x1:T y1:T ). for ˜ (i) (i) (i) i = 1 : n. . backwardsmoothing procedure.. . i. y1:T ) = p(xT y1:T ) p(xt y1:t )f (xt+1 xt ). (i) (i) (i) (i) (i) n P (i) (j) (j) 9: For i = 1 : n. . t+1 (i) p(x1:t+1 y1:t+1 ) = ˆ wt+1 δx(i) (i) (x1:t+1 ). compared to the computational complexity of SISR. . . We carry out smoothing via Algorithm 5. Example 8. n X i=1 n X i=1 i=1 wt+1 = 1. Finally. the SISR algorithm is prone to suffer from sample impoverishment as t grows. ˜ (i) (i) (i) (i) 7: MCMC step: for i = 1 : n.e. for ﬁltering density approximations with n particles. followed by a smoothing procedure applied backward in t (Godsill et al. Then. set w1 ∝ p1 (x1 )g(y1 x1 )/q1 (x1 ). This section presents a Monte Carlo smoothing algorithm for statespace models that is carried out in a forwardﬁltering. for i = 1 : n. wt t (i) (i) n i=1 (i) wt δx(i) (xt ). . . sample x1 ∼ q1 (x1 ). it is quadratic in n. y1:t ) xt . The computational complexity is O(nT ). for example). do: P (j) 5: Resample step: compute ESS = 1/ n (wt )2 . This smoothing algorithm has the advantage that it uses particle approximations to the ﬁltering densities (under the assumption that good approximations can be obtained via SISR. that is. 2004). set x1:t = x1:t and wt = 1/n. T . implementing the SISR algorithm as before. with n = 10000 . the computational complexity is O(n2 T ). then resample: for i = 1 : n. set x1:t = x1:t with probability wt . Normalise such that 10: The ﬁltering and smoothing densities at time t + 1 may be approximated by p(xt+1 y1:t+1 ) = ˆ 11: Set t = t + 1. a ﬁltering procedure such as SISR is applied forward in time. (i) wt+1 δx(i) (xt+1 ). as opposed to resampled particle trajectories x1:t which suffer from sample impoverishment as t increases. Sequential Monte Carlo Algorithm 4 The SISR algorithm with MCMC moves for statespace models 1: Set t = 1. . . 1:t+1 8. n. and then to apply the backward smoothing recursion (8. . set wt+1 ∝ wt f (xt+1 xt )g(yt+1 xt+1 )/qt+1 (xt+1 xt ). We continue the example of the nonlinear time series model.3 (A nonlinear time series model (continued)). the SISR algorithm) forward in time. be a particle representation to the ﬁltering density p(xt y1:t ). So for n realizations. for n realizations.5 Smoothing density estimation As we have seen. backwardsmoothing approach.g. p(xt y1:t ) = ˆ n i=1 (i) (i) Then. (j) 8: For i = 1 : n. t=1 p(xt xt+1 . Godsill et al. (2004) consider the joint smoothing density T −1 T −1 p(x1:T y1:T ) = p(xT y1:T ) ∝ p(xT y1:T ) Let n i=1 t=1 T −1 t=1 p(xt xt+1:T . ·). which is linear in n. Normalise such that 4: At time t + 1. (8. for t = 1. hence the particle trajectories do not offer reliable approximations to the smoothing density. .4) So the idea is to run the particle ﬁlter (e.92 8. 2: For i = 1 : n. y1:t ) by p(xt xt+1 . Go to step 4.e. sample z1:t ∼ Kt (x1:t . . it is possible to approximate p(xt xt+1 .. i. (i) (i) (i) 3: For i = 1 : n.
Normalise the weights. 9: (˜1 .9 shows the kernel density estimate for p(x11:12 y1:100 ). Figure 8.8. T . . particle trajectory x_t −20 −10 0 10 20 0 20 40 t 60 80 100 Figure 8. xT ) is an approximate realization from p(x1:T y1:T ). ⊳ . it is possible to visualize characteristics of the multivariate smoothing distribution.7. Finally. backwardsmoothing algorithm for statespace models 1: Run a particle ﬁlter algorithm to obtain weighted particle approximations p(xt y1:t ) for t = 1. Figure 8.8. wt i=1 to the ﬁltering distributions 7: Choose xt = xt with probability wtt+1 . . . . ˜ 8: Go to step 3. as opposed to simply returning smoothing marginals. . 5: At t ≥ 1 (i) (i) (i) 6: For i = 1 : n. . Multimodality in the smoothing distributions is shown in Figure 8. ˜ n on (i) (i) xt . (i) (i) 3: Choose xT = xT with probability wT . x ˜ 4: Set t = t − 1. compute wtt+1 ∝ wt f (˜t+1 xt ). x (i) (i) particles. 2: Set t = T . since the algorithm returns entire smoothing trajectories.5 Smoothing density estimation 93 Algorithm 5 The forwardﬁltering. .7 displays the 10000 smoothing trajectories drawn from p(x1:100 y1:100 ) with the true state sequence overlaid. . 10000 smoothing trajectories drawn from p(x1:100 y1:100 ) with dimond symbol indicating the true sequence.
94
8. Sequential Monte Carlo
Kernel density estimates up to time t=100 (n = 10000)
x_t
−20
−10
0
10
20
20
40 t
60
80
100
Figure 8.8. Image intensity plot of the kernel density estimates of smoothing densities with dimond symbol indicating the true
state sequence.
Figure 8.9. Kernel density estimate for p(x11:12 y1:100 ).
Bibliography
Anderson, T. W. (2003) An introduction to multivariate statistical analysis. Wiley series in probability and statistics. WileyInterscience. Athreya, K. B., Doss, H. and Sethuraman, J. (1992) A proof of convergence of the markov chain simulation method. Tech. Rep. 868, Dept. of Statistics, Florida State Univeristy. Badger, L. (1994) Lazzarini’s lucky approximation of π. Mathematics Magazine, 67, 83–91. Brockwell, P. J. and Davis, R. A. (1991) Time series: theory and methods. New York: Springer, 2 edn. Brooks, S. and Gelman, A. (1998) General methods for monitoring convergence of iterative simulations. Journal of Computational and Graphical Statistics, 7, 434–455. Buffon, G. (1733) Editor’s note concerning a lecture given 1733 to the Royal Academy of Sciences in Paris. Histoire de l’Acad´ mie Royale des Sciences, 43–45. e — (1777) Essai d’arithm´ tique morale. Histoire naturelle, g´ n´ rale et particuli` re, Suppl´ ment 4, 46–123. e e e e e Capp´ , O., Godsill, S. J. and Moulines, E. (2007) An overview of existing methods and recent advances in sequential e Monte Carlo. In IEEE Proceedings, vol. 95, 1–21. Capp´ , O., Moulines, E. and Ryd´ n, T. (2005) Inference in Hidden Markov Models. Springer Series in Statistics. e e Springer. Carpenter, J., Clifford, P. and Fearnhead, P. (1999) An improved particle ﬁlter for nonlinear problems. IEE Proceedings, Radar, Sonar and Navigation, 146, 2–7. Chopin, N. (2004) Central limit theorem for sequential Monte Carlo methods and its applications to Bayesian inference. Annals of Statistics, 32, 2385–2411. Crisan, D. and Doucet, A. (2002) A survey of convergence results on particle ﬁltering methods for practitioners. IEEE Transactions on Signal Processing, 50, 736–746. Doucet, A., de Freitas, N. and Gordon, N. (eds.) (2001) Sequential Monte Carlo Methods in Practice. Statistics for Engineering and Information Science. New York: Springer. Doucet, A., Godsill, S. and Andrieu, C. (2000) On sequential simulationbased methods for Bayesian ﬁltering. Statistics and Computing, 10, 197–208. Fahrmeir, L. and Tutz, G. (2001) Multivariate Statistical Modelling Based on Generalised Linear Models. New York: Springer, 2 edn. Gelfand, A. E. and Smith, A. F. M. (1990) Samplingbased approaches to calculating marginal densities. Journal of the American Statistical Association, 85, 398–409. Gelman, A., Gilks, W. R. and Roberts, G. O. (1997) Weak convergence and optimal scaling of random walk Metropolis algorithms. Annals of. Applied Probability, 7, 110–120. Gelman, A., Roberts, G. O. and Gilks, W. R. (1995) Efﬁcient Metropolis jumping rules. In Bayesian Statistics (eds. J. M. Bernado, J. Berger, A. Dawid and A. Smith), vol. 5. Oxford: Oxford University Press. Gelman, A. and Rubin, B. D. (1992) Inference from iterative simulation using multiple sequences. Statistical Science, 7, 457–472.
96
BIBLIOGRAPHY
Geman, S. and Geman, D. (1984) Stochastic relaxation, Gibbs distribution, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 721–741. Gihman, I. and Skohorod, A. (1974) The Theory of Stochastic Processes, vol. 1. Springer. Gilks, W. R. and Berzuini, C. (2001) Following a moving target – Monte Carlo inference for dynamic Bayesian models. Journal of the Royal Statistical Society B, 63, 127–146. Godsill, S. J., Doucet, A. and West, M. (2004) Monte Carlo smoothing for nonlinear time series. Journal of the American Statistical Association, 99, 156–168. Gordon, N. J., Salmond, S. J. and Smith, A. F. M. (1993) Novel approach to nonlinear/nonGaussian Bayesian state estimation. IEE ProceedingsF, 140, 107–113. GuihennecJouyaux, C., Mengersen, K. L. and Robert, C. P. (1998) Mcmc convergence diagnostics: A ”reviewww”. Tech. Rep. 9816, Institut National de la Statistique et des Etudes Economiques. Halton, J. H. (1970) A retrospective and prospective survey of the Monte Carlo method. SIAM Review, 12, 1–63. Hastings, W. K. (1970) Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57, 97–109. Kalman, R. E. (1960) A new approach to linear ﬁltering and prediction problems. Transactions of the ASME Journal of Basic Engineering, 82 (Series D), 35–45. Knuth, D. (1997) The Art of Computer Programming, vol. 1. Reading, MA: AddisonWesley Professional. Kong, A., Liu, J. S. and Wong, W. H. (1994) Sequential imputations and Bayesian missing data problems. Journal of the American Statistical Association, 89, 278–288. Laplace, P. S. (1812) Th´ orie Analytique des Probabilit´ s. Paris: Courcier. e e Lazzarini, M. (1901) Un’ applicazione del calcolo della probabilit` alla ricerca esperimentale di un valore approssia mato di π. Periodico di Matematica, 4. Liu, J. S. (2001) Monte Carlo Strategies in Scientiﬁc Computing. New York: Springer. Liu, J. S. and Chen, R. (1998) Sequential Monte Carlo methods for dynamic systems. Journal of the American Statistical Association, 93, 1032–1044. Liu, J. S., Wong, W. H. and Kong, A. (1995) Covariance structure and convergence rate of the Gibbs sampler with various scans. Journal of the Royal Statistical Society B, 57, 157–169. Marsaglia, G. (1968) Random numbers fall mainly in the planes. Proceedings of the National Academy of Sciences of the United States of America, 61, 25–28. Marsaglia, G. and Zaman, A. (1991) A new class of random number generators. The Annals of Applied Probability, 1, 462–480. Matsumoto, M. and Nishimura, T. (1998) Mersenne twister: a 623dimensionally equidistributed uniform pseudorandom number generator. ACM Trans. Model. Comput. Simul., 8, 3–30. Meinhold, R. J. and Singpurwalla, N. D. (1983) Understanding the Kalman ﬁlter. The American Statistician, 37, 123–127. — (1989) Robustiﬁcation of Kalman Filter Methods. Journal of the American Statistical Association, 84, 479–486. Metropolis, N. (1987) The beginning of the Monte Carlo method. Los Alamos Science, 15, 122–143. Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. B., Teller, A. H. and Teller, E. (1953) Equation of state calculations by fast computing machines. The Journal of Chemical Physics, 21, 1087–1092. Metropolis, N. and Ulam, S. (1949) The Monte Carlo method. Journal of the American Statistical Association, 44, 335–341. Meyn, S. P. and Tweedie, R. L. (1993) Markov Chains and Stochastic Stability. Springer, New York, Inc. URL http://probability.ca/MT/. Norris, J. (1997) Markov Chains. Cambridge Univeristy Press.
Rabiner. In Proceedings of Symposium. J. 2 edn. Chapman & Hall/CRC. Ulam. New York. Citeseer. 1st paperback edn. 1701–1762. 95–110. R. Secaucus. B. 11. P. A. L. Statistics and Computing. . USA: Springer. (1984) General Irreducible Markov Chains and NonNegative Operators. Tierney. Ripley. Probability Surveys. (1989) A tutorial on hidden Markov models and selected applications in speech recognition. No. 257–286. Biometrika. 83. L. (1986) Density estimation for statistics and data analysis. R. New York: Wiley. Philippe. (1987) Stochastic simulation. NJ. Robert. Roberts. E. A. Inc. (1996) Geometric convergence and central limit theorems for multivariate Hastings and Metropolis algorithms. G. Cambridge University Press. 153–158. (1994) Markov Chains for exploring posterior distributions.. Roberts. O. and Casella. (1983) Adventures of a Mathematician. C. and Rosenthal. 20–71. 83 in Cambridge Tracts in Mathematics.BIBLIOGRAPHY 97 Nummelin. P. C. The Annals of Statistics. G. (2004) Monte Carlo Statistical Methods. D. 77. R. Silverman. G. and Robert. (2000) The unscented Kalman ﬁlter for nonlinear estimation. 22. New York: Charles Scribner’s Sons. Monographs on statistics and applied probability. E. 103–115. 26. and van der Merwe. S. B. (2004) General state space Markov chains and MCMC algorithms. Wan. (2001) Riemann sums for mcmc estimation and convergence monitoring. vol. S. 1. In IEEE Proceedings. and Tweedie. W.
This action might not be possible to undo. Are you sure you want to continue?
We've moved you to where you read on your other device.
Get the full title to continue listening from where you left off, or restart the preview.