You are on page 1of 3

16

4 Gibbs Sampling
Gibbs sampling6 is an MCMC sampler introduced by Geman and Geman [22]. Let x(t) ∈ Rm denote the
current sample. Then Gibbs sampling proceeds as follows:
1. Pick an index k ∈ {1, . . . , m} either via a deterministic round-robin format or at random
(t+1) (t) (t+1) (t)
2. Set xj = xj , for j 6= k, i.e. x−k = x−k
(t+1) (t)
3. Generate xk ∼ p(xk | x−k )
In Gibbs only one component of x is updated at a time. It is common to simply order the m components and
(t+1)
update them sequentially. We can then let xk be the value of the chain after all m updates rather than
each individual update. Gibbs sampling is a very popular method for applications where the conditional
(t)
distributions, p(xj | x−k ), are easy to simulate from. This is the case for for conditionally conjugate models
and others.
It is easy to see that Gibbs sampling is a special case of Metropolis-Hastings sampling. When the updated
coordinate is drawn randomly we have transition matrix

πk p(yk | x−k ), y−k = x−k
Q(y | x) =
0, otherwise

where πk is the probability of the k th coordinate being updated. If all the coordinates are updated deter-
ministically in round-robin fashion then the transition matrix for an individual update will depend on k, the
coordinate being updated. This doesn’t fit with the MH framework but an easy remedy is to consider a full
sweep of all coordinate updates as a single transition with the transition matrix determined appropriately.
Regardless of how the coordinates are selected for updating, however, we can easily see that each component
update will be accepted with probability 1. One must also be careful that the component-wise Markov Chain
is ergodic7 as discussed earlier.

Example 6 (A Binomial Likelihood and a Beta Prior)


Consider the distribution
n!
p(x, y) = y (x+α−1) (1 − y)(n−x+β−1) , x ∈ {0, . . . , n}, y ∈ [0, 1]. (13)
(n − x)!x!

It is hard to simulate directly from p(x, y) but the conditional distributions are easy to work with. We see
that

• p(x | y) ∼ Bin(n, y)
• p(y | x) ∼ Beta(x + α, n − x + β)

and since it’s easy to simulate from each conditional, it’s easy to run a Gibbs sampler to simulate from the
joint distribution.

4.1 Hierarchical Models and de Finetti’s Theorem


Gibbs sampling is particulary suited for hierarchical8 models, an important class of models that is regularly
used throughout the sciences and social sciences. The concept of exchangeability and de Finetti’s Theorem
are often mentioned in the context of these models since in some sense they provide us with the equivalent of
the “IID assumption” that is so common in non-Bayesian methods. For example, suppose we have random
6 The algorithm is named after the physicist J. W. Gibbs who died approx. 80 years earlier in 1903. Chapters 9 and 10 of

Robert and Casella [39] provide a detailed treatment of Gibbs sampling as well as numerous examples and exercises that go
well beyond what we cover here.
7 See Figure 27.5 from Barber [2] in Section 4.5 for an example where the chain is not ergodic. In this case the Gibbs sampler

would fail to converge to the desired stationary distribution.


8 Hierarchical models are also known as multi-level models.

Electronic copy available at: https://ssrn.com/abstract=3759243


17

variables X1 , . . . , Xn whose distribution depends on some unknown parameter µ. In a Bayesian model we


would like to place a prior distribution on µ but then X1 , . . . , Xn are clearly dependent and so how do we
write out their likelihood? We can answer this via the concept of exchangeability.

Definition 6 We say a sequence of random variables X1 , X2 , . . . is infinitely exchangeable iff for all n ∈ N
the distribution of (X1 , . . . , Xn ) is identical to the distribution of (Xs(1) , . . . , Xs(n) ) for every permutation
s : {1, . . . , n} → {1, . . . , n}.

Note that IID random variables are always exchangeable but exchangeable random variables need not be
(and generally are not) IID. For a simple example, suppose X1 , X2 , . . . are IID and let Z be a non-trivial
random variable independent of X1 , X2 , . . .. Let Yi := Z + Xi . Then the Yi ’s are not IID but they are
exchangeable. The importance of exchangeable random variable lies in de Finetti’s Theorem.

Theorem 5 (de Finetti) A sequence of random variables X1 , X2 , . . . is infinitely exchangeable iff for all
n∈N
n
Z Y
p(x1 , . . . , xn ) = p(xi | θ) P (dθ) (14)
i=1

for some probability measure P on θ.

That (14) implies exchangeability is clear since the product inside the integral doesn’t depend on the ordering
of the random variables. That exchangeability implies (14) is a much deeper result. The significance of de
Finetti’s Theorem is that if we have exchangeable random variables (which are generally not IID), then
there exists a (random) parameter θ that renders the random variables independent conditional on θ. The
concept of exchangeability and de Finetti’s theorem helps motivate a very rich class of models where θ can
be a multi- or even infinitely-dimensional parameter. The class of hierarchical or multi-level model fits into
this framework and Example 7 is our first such example but we will see others in Section 6.

Example 7 (Blood Coagulation Times and Diets)


We consider here an example from Gelman et al. [19] and the data - which concerns blood coagulation
times for animals randomly allocated to four different diets - is presented in Table 1. The data-points yij ,
for i = 1, . . . , nj and j = 1, . . . , J, are assumed to be independently normally distributed within each of J
groups with means θj and common variance σ 2 . That is,

yij | θj ∼ N (θj , σ 2 ).

Diet Measurements
A 62, 60, 63, 59
B 63, 67, 71, 64, 65, 66
B 68, 66, 71, 67, 68, 68
D 56, 62, 60, 61, 63, 64, 63, 59

Table 1: Coagulation time in seconds for blood drawn from 24 animals randomly allocated to four different di-
ets. Different treatments have different numbers of observations because the randomization was unrestricted.
From Box, Hunter, and Hunter (1978), who adjusted the data so that the averages are integers, a complication
we ignore in our analysis.

PJ
The total number of observations is n = j=1 nj . Group means are assumed to follow a normal distribution
with unknown mean µ and variance τ 2 so that

θj ∼ N (µ, τ 2 ).

Electronic copy available at: https://ssrn.com/abstract=3759243


18

A uniform prior is assumed9 for (µ, log σ, τ ) which is equivalent to assuming (why?) that p(µ, log σ, log τ ) ∝ τ .
The posterior is then given by
J J Yn
j
Y Y
N θj | µ, τ 2 N yij | θj , σ 2 .

p(θ, µ, log σ, log τ | y) ∝ τ (15)
j=1 j=1 i=1

We will see from (15) that all of the conditional distributions required for Gibbs sampler have simple conjugate
forms:

1. Conditional Posterior Distribution of Each θj


We simply need to gather the terms (from the posterior in (15)) that only involve θj and then simplify
to obtain  
θj | (θ−j , µ, σ, τ, y) ∼ N θbj , Vθj (16)

where nj
1
τ 2 µ + σ 2 ȳ.j 1
θbj := 1 nj and Vθj := 1 nj .
τ 2 + σ2 τ2 + σ2
These conditional distributions are independent so generating the θj ’s one at a time is equivalent to
drawing θ all at once.

2. Conditional Posterior Distribution of µ


Again, we simply gather terms from the posterior that only involve µ and then simplify to obtain

τ2
 
µ | (θ, σ, τ, y) ∼ N µb, (17)
J
1
PJ
where µ
b := J j=1 θj .

3. Conditional Posterior Distribution of σ 2


Gathering terms from the posterior that only involve σ and then simplifying, we obtain

σ 2 | (θ, µ, τ, y) ∼ Inv-χ2 n, σ
b2

(18)

1
PJ Pnj 2
b2 :=
where σ n j=1 i=1 (yij − θj ) .

4. Conditional Posterior Distribution of τ 2


Again, we gather terms from the posterior that only involve τ and then simplify to obtain

τ 2 | (θ, µ, σ, y) ∼ Inv-χ2 J − 1, τb2



(19)

1
PJ 2
where τb2 := J−1 j=1 (θj − µ) .

To start the Gibbs sampler we only (why?) need starting points for θ and µ and then we use (16) to (19) to
repeatedly generate samples from the conditional distributions.

4.2 Other Variations of Gibbs Sampling


We now briefly discuss some variations on the basic Gibbs sampling algorithm.
9 A uniform prior on the real line (or any interval of infinite length) is a so-called improper prior because there is no such

distribution. It’s ok to use an improper prior if the resulting posterior is well-defined as a distribution but this is not guaranteed
and so care must be taken to ensure the posterior is indeed a distribution. In this example if we had assigned a uniform prior
to log τ then the posterior would be improper. It must be emphasized that even if the ‘posterior’ has legitimate conditional
distributions that does not guarantee that the posterior is indeed a distribution. In this case while a Gibbs sampler can be
implemented it will not converge to any distribution. See Example 10.

Electronic copy available at: https://ssrn.com/abstract=3759243

You might also like