Professional Documents
Culture Documents
4 Gibbs Sampling
Gibbs sampling6 is an MCMC sampler introduced by Geman and Geman [22]. Let x(t) ∈ Rm denote the
current sample. Then Gibbs sampling proceeds as follows:
1. Pick an index k ∈ {1, . . . , m} either via a deterministic round-robin format or at random
(t+1) (t) (t+1) (t)
2. Set xj = xj , for j 6= k, i.e. x−k = x−k
(t+1) (t)
3. Generate xk ∼ p(xk | x−k )
In Gibbs only one component of x is updated at a time. It is common to simply order the m components and
(t+1)
update them sequentially. We can then let xk be the value of the chain after all m updates rather than
each individual update. Gibbs sampling is a very popular method for applications where the conditional
(t)
distributions, p(xj | x−k ), are easy to simulate from. This is the case for for conditionally conjugate models
and others.
It is easy to see that Gibbs sampling is a special case of Metropolis-Hastings sampling. When the updated
coordinate is drawn randomly we have transition matrix
πk p(yk | x−k ), y−k = x−k
Q(y | x) =
0, otherwise
where πk is the probability of the k th coordinate being updated. If all the coordinates are updated deter-
ministically in round-robin fashion then the transition matrix for an individual update will depend on k, the
coordinate being updated. This doesn’t fit with the MH framework but an easy remedy is to consider a full
sweep of all coordinate updates as a single transition with the transition matrix determined appropriately.
Regardless of how the coordinates are selected for updating, however, we can easily see that each component
update will be accepted with probability 1. One must also be careful that the component-wise Markov Chain
is ergodic7 as discussed earlier.
It is hard to simulate directly from p(x, y) but the conditional distributions are easy to work with. We see
that
• p(x | y) ∼ Bin(n, y)
• p(y | x) ∼ Beta(x + α, n − x + β)
and since it’s easy to simulate from each conditional, it’s easy to run a Gibbs sampler to simulate from the
joint distribution.
Robert and Casella [39] provide a detailed treatment of Gibbs sampling as well as numerous examples and exercises that go
well beyond what we cover here.
7 See Figure 27.5 from Barber [2] in Section 4.5 for an example where the chain is not ergodic. In this case the Gibbs sampler
Definition 6 We say a sequence of random variables X1 , X2 , . . . is infinitely exchangeable iff for all n ∈ N
the distribution of (X1 , . . . , Xn ) is identical to the distribution of (Xs(1) , . . . , Xs(n) ) for every permutation
s : {1, . . . , n} → {1, . . . , n}.
Note that IID random variables are always exchangeable but exchangeable random variables need not be
(and generally are not) IID. For a simple example, suppose X1 , X2 , . . . are IID and let Z be a non-trivial
random variable independent of X1 , X2 , . . .. Let Yi := Z + Xi . Then the Yi ’s are not IID but they are
exchangeable. The importance of exchangeable random variable lies in de Finetti’s Theorem.
Theorem 5 (de Finetti) A sequence of random variables X1 , X2 , . . . is infinitely exchangeable iff for all
n∈N
n
Z Y
p(x1 , . . . , xn ) = p(xi | θ) P (dθ) (14)
i=1
That (14) implies exchangeability is clear since the product inside the integral doesn’t depend on the ordering
of the random variables. That exchangeability implies (14) is a much deeper result. The significance of de
Finetti’s Theorem is that if we have exchangeable random variables (which are generally not IID), then
there exists a (random) parameter θ that renders the random variables independent conditional on θ. The
concept of exchangeability and de Finetti’s theorem helps motivate a very rich class of models where θ can
be a multi- or even infinitely-dimensional parameter. The class of hierarchical or multi-level model fits into
this framework and Example 7 is our first such example but we will see others in Section 6.
yij | θj ∼ N (θj , σ 2 ).
Diet Measurements
A 62, 60, 63, 59
B 63, 67, 71, 64, 65, 66
B 68, 66, 71, 67, 68, 68
D 56, 62, 60, 61, 63, 64, 63, 59
Table 1: Coagulation time in seconds for blood drawn from 24 animals randomly allocated to four different di-
ets. Different treatments have different numbers of observations because the randomization was unrestricted.
From Box, Hunter, and Hunter (1978), who adjusted the data so that the averages are integers, a complication
we ignore in our analysis.
PJ
The total number of observations is n = j=1 nj . Group means are assumed to follow a normal distribution
with unknown mean µ and variance τ 2 so that
θj ∼ N (µ, τ 2 ).
A uniform prior is assumed9 for (µ, log σ, τ ) which is equivalent to assuming (why?) that p(µ, log σ, log τ ) ∝ τ .
The posterior is then given by
J J Yn
j
Y Y
N θj | µ, τ 2 N yij | θj , σ 2 .
p(θ, µ, log σ, log τ | y) ∝ τ (15)
j=1 j=1 i=1
We will see from (15) that all of the conditional distributions required for Gibbs sampler have simple conjugate
forms:
where nj
1
τ 2 µ + σ 2 ȳ.j 1
θbj := 1 nj and Vθj := 1 nj .
τ 2 + σ2 τ2 + σ2
These conditional distributions are independent so generating the θj ’s one at a time is equivalent to
drawing θ all at once.
τ2
µ | (θ, σ, τ, y) ∼ N µb, (17)
J
1
PJ
where µ
b := J j=1 θj .
σ 2 | (θ, µ, τ, y) ∼ Inv-χ2 n, σ
b2
(18)
1
PJ Pnj 2
b2 :=
where σ n j=1 i=1 (yij − θj ) .
1
PJ 2
where τb2 := J−1 j=1 (θj − µ) .
To start the Gibbs sampler we only (why?) need starting points for θ and µ and then we use (16) to (19) to
repeatedly generate samples from the conditional distributions.
distribution. It’s ok to use an improper prior if the resulting posterior is well-defined as a distribution but this is not guaranteed
and so care must be taken to ensure the posterior is indeed a distribution. In this example if we had assigned a uniform prior
to log τ then the posterior would be improper. It must be emphasized that even if the ‘posterior’ has legitimate conditional
distributions that does not guarantee that the posterior is indeed a distribution. In this case while a Gibbs sampler can be
implemented it will not converge to any distribution. See Example 10.