You are on page 1of 38

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/46714374

Bayesian data analysis

Article · January 2009


Source: OAI

CITATION READS

1 2,263

1 author:

Herbert Hoijtink
Utrecht University
148 PUBLICATIONS   2,910 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Formalization and evaluation of prior knowledge View project

All content following this page was uploaded by Herbert Hoijtink on 16 February 2016.

The user has requested enhancement of the downloaded file.


Bayesian Data Analysis
Herbert Hoijtink
Department of Methodology and Statistics
Utrecht University
P.O.Box 80140
3508TC Utrecht
The Netherlands
h.hoijtink@uu.nl
March 25, 2008

Abstract
This chapter will provide an introduction to Bayesian data anal-
ysis. Using an analysis of covariance model as the point of depar-
ture, Bayesian parameter estimation (based on the Gibbs sampler),
Bayesian hypothesis testing (using posterior predictive inference), and
Bayesian model selection (via the Bayes factor) will be introduced.
The chapter will be concluded with a short discussion of Bayesian
hierarchical modelling and references for further reading.

Key words: Bayes Factor, Bayesian Statistics, Gibbs Sampler, Posterior


Predictive Inference, Prior Distribution, Posterior Distribution.

Research supported by a grant (NWO 453-05-002) of the Dutch Organization


for Scientific Research.

1
1 Introduction
It is impossible to give a comprehensive introduction to Bayesian data anal-
ysis in just one chapter. In the sequel I will present what I consider to be
the most important components of Bayesian data analysis: parameter esti-
mation based on the Gibbs sampler; the Bayesian counterpart of hypothesis
testing (posterior predictive inference); and model selection using the Bayes
factor. The chapter will be concluded with a short discussion of Bayesian
hierarchical modelling and references to topics that will not be discussed in
this chapter. For accessible introductions to Bayesian data analysis the in-
terested reader is referred to Gill (2002) and Lee (1997). Throughout the
chapter references for further reading will be given both to these two books
and to more advanced material.
It would be easy to fill a whole chapter with a description and discussion
of the differences between Bayesian data analysis and the classical frequentist
data analysis that most readers will be acquainted with. Since this chapter is
rather applied in nature (how to do Bayesian estimation, hypothesis testing
and model selection) I will here and in the sequel highlight two differences
that are important for these applications. Consider, for example, a simple
estimation problem: estimate the mean weight (the parameter of interest)
of 18 year old Dutch females. A frequentist would obtain a sample from
the population of 18 year old Dutch females, compute the sample average
and use this as an estimate of the mean weight in the population. Besides
this sample, a Bayesian would also use his prior expectations (that is his
expectations with respect to the mean weight before the data are sampled)
to estimate the mean weight. These expectation are quantified in a so called
prior distribution. For the example at hand this prior distribution could be
a normal distribution with a mean of 60 kilogram and a standard deviation
of 5 kilogram. Bayesians combine the information in the sample and the
prior distribution to estimate average weight. Suppose, for example, that
the average weight in the sample would be 58 kilogram with a standard error
of 2 kilogram, in that case the Bayesian estimate would be an average weight
of 58.27 kilogram (a weighted average of 60 and 58 with weights of 52 and
22 , respectively). Stated otherwise, Bayesians use two sources of information
when making inferences: the data and prior distributions. Throughout this
chapter this difference between Bayesian and classical frequentist inference
will be highlighted.
The second difference between frequentist and Bayesian data analysis are

2
the computational means that are used to obtain estimates, p-values and
other quantities that are useful when making statistical inferences. Where
maximum likelihood is the main tool in classical inference, Bayesians pre-
fer sampling methods. Sampling methods will be elaborated in each of the
sections dealing with estimation, model checking and model selection in this
chapter.
All the concepts and procedures to be introduced in this chapter will be
discussed in the context of and illustrated with a data set previously discussed
by Tabachnick and Fidell (1996, p. 426-428, 436-437). They use analysis of
covariance (Tabachnick and Fidell, 1996, Chapter 8) to determine whether
or not the self-esteem of women depends on the degree of feminity (which
is coded low/high) and masculinity (also coded low/high) of the women.
Note that, the observed scores for self-esteem are in the range 8-29 where 8
denotes a high and 29 a low self-esteem. Social economic status (observed
scores in the range 0-81 where 0 denotes a low social economic status) will
be used as a covariate. Observed means, standard deviations and sample
sizes are presented in Table 1. The main research questions for these data
are: (a) whether high (h) feminine women have a higher self-esteem than low
(l) feminine women; (b) whether high masculine women have a higher self-
esteem than low masculine women; and, (c) whether there is a joint effect
of scoring high or low on both variables. Note that self-esteem is scored
inversely, that is, higher values denote a smaller self-esteem. Let µ denote the
mean of self-esteem adjusted for the covariate social economic status. The
hypotheses corresponding to (a), (b) and (c) are then: H1a : {µhl , µhh } <
{µll , µlh }, where the first index denotes the degree of femininity and the
second index the degree of masculinity; H1b : {µlh , µhh } < {µll , µhl } ; and,
H1c : µhh < {µhl , µlh } < µll , respectively. The traditional null-hypotheses
H0 : µhh = µhl = µlh = µhh represents the possibility that neither the degree
of femininity nor masculinity have an effect on self-esteem.
Note that the set of hypotheses specified differs from the traditional null-
hypothesis H0 , that is, ”nothing is going on” and alternative hypothesis
H2 : not H0 , that is, ”something is going on but I don’t know what”. Loosely
formulated, if H2 is preferred over H0 it is still not clear what is going on,
however, if either one of H1a , H1b , H1c is preferred over H0 it is clear which
of the underlying theories is the best. This is an example of the use of prior
knowledge (what is the relative order of the four adjusted means) in statistical
inference. Instead of having a rather general and non-specific alternative like
H2 , prior knowledge with respect to possible state of affairs in the population

3
Table 1: Sample Means, Standard Deviations and Sample Sizes

Means Standard Deviations Sample Sizes


Masc./Fem. Low High Low High Low High
Low 17.86 16.40 3.71 3.45 68 168
High 13.80 13.33 3.94 3.08 36 86

is incorporated in three specific and competing alternative hypotheses.


The analyses of covariance that will be executed in this paper are based
on the following linear model:
G
X
yi = µg dig + βxi + ei , (1)
g=1

where yi and xi are the scores of person i = 1, . . . , N on the criterium variable


and covariate, respectively. The group membership of a person is denotes by
dig . The response 1 denotes that a person is a member of group g, the
response 0 denotes that a person is not a member of group g. The number
of groups is denoted by G. The relation between xi and yi is denoted by β,
and the residuals ei are assumed to come from a normal distribution with
mean zero and unknown variance σ 2 .
In the next section Bayesian estimation will be introduced using a simple
binomial example. In the subsequent section we will return to the analysis
of covariance model (1) and explain how Bayesian estimation can be used to
estimate the parameters of that model.

2 Bayesian Estimation Using a Simple Bino-


mial Example
Consider an experiment in which a regular coin is flipped N = 10 times
and comes up heads x = 2 times. The goal is to estimate the probability
π that the coin comes up heads. Three ingredients are needed for Bayesian
estimation. The first is the distribution of the data which represents the
information in the data with respect to π. For ”flips of a coin” this is a

4
binomial distribution:
à !
N
f (x | N, π) = π x (1 − π)N −x . (2)
x

Figure 1 displays this distribution which is often called the likelihood as a


function of π. As can be seen, the most likely value of π is .2. At this value
the likelihood attains its maximum.
The second ingredient needed for Bayesian estimation is the prior distri-
bution. It represents the knowledge a researcher has with respect to π before
the data are observed. A standard choice for the prior distribution h(π) is
the beta distribution. The interested reader is referred to Lee (1997, pp.
77-82) for further elaboration and visualization. The functional form of the
beta distribution is
h(π) ∝ π α−1 (1 − π)δ−1 , (3)
where α − 1 and δ − 1 denote the parameters of the beta distribution and
∝ denotes ”proportional to”. If α = δ = 1, the beta distribution is uniform
on the interval [0, 1], that is, it is uninformative because a priori each value
of π is equally likely. However, when flipping a regular coin we know that π
should be in the neighborhood of .5. This can be reflected using a subjective
prior with α = 6 and δ = 6. The information in this prior is equivalent to
6+6-2=10 flips with a coin of which 6-1=5 come up heads. This subjective
prior distribution is also displayed in Figure 1. As can be seen, according to
the prior the most likely value of π is .5.
The third ingredient is the posterior distribution. The posterior is a
summary of the information with respect to π available in the data and the
prior distribution. The posterior distribution g(π | N, x) is proportional to
the product of distribution of the data and prior distribution:

g(π | N, x) ∝ π x (1 − π)N −x π α−1 (1 − π)δ−1 = π x+α−1 (1 − π)N −x+δ−1 , (4)

for data and prior at hand this leads to

g(π | N, x) ∝ π 8−1 (1 − π)14−1 . (5)

The posterior distribution is also displayed in Figure 1. As can clearly be


seen the posterior is a compromise between the information contained in the
data and the information contained in the prior distribution. The mode and
expectation of a beta posterior distribution can easily be computed (Gelman,

5
5

Posterior
4
Likelihood

3 Prior

0
0 0.2 0.4 0.6 0.8 1
π

Figure 1: Likelihood, Prior and Posterior Densities for the Binomial Example

Carlin, Stern and Rubin, 2004, pp. 576-577). The mode is obtained at
8−1 8
π = 8−1+14−1 = .35, the expectation is obtained at π = 22 = .363. The
mode is an equally weighted average (both the sample size of the data and
the prior distribution are equal to 10) of the value of π in the sample (.2)
and prior (.5). This illustrates how the posterior combines the information
available in the distribution of the data and the prior distribution.

3 Estimation: Exploring the Posterior Using


the Gibbs Sampler
3.1 Distribution of the Data, Prior and Posterior for
the Analysis of Covariance Model
The distribution of the data given the parameters of the statistical model at
hand is an important concept in both classical and Bayesian statistics. It is
a formal representation of the information contained in the data with respect

6
to the unknown model parameters. For (1) the distribution is
N
Y G
X
f (y | D, x, µ, β, σ 2 ) ∼ N (yi | µg dig + βxi , σ 2 ) (6)
i=1 g=1

where y = {y1 , . . . , yN }, x = {x1 , . . . , xN }, D = {d1 , . . . , dG } with dg =


{d1g , . . . , dN g }, and µ = {µ1 , . . . , µG }. What can be seen in (6) is that the
distribution of each yi has a mean that is determined by the group member-
ship of person i and person i’s score on the covariate xi , and a variance that
is equal to σ 2 .
As was illustrated in the previous section, in Bayesian analysis besides
the distribution of the data also the prior distribution of the parameters
has to be specified. This is elaborated by, for example, Gill (2002, Chapter
5), Lee (2001, pp. 59-61) and Gelman, Carlin, Stern and Rubin (2004, pp.
39-43). The prior distribution reflects what is known about the parameters
of the statistical model before the data are collected. It is one of the main
differences between Bayesian and classical statistics because the latter do not
use the prior distribution. The specification of the prior distribution is not
always easy. The interested reader is referred to Kass and Wasserman (1996)
for an elaborate discussion. One of the issues that causes a lot of discussion is
whether the prior should be objective or subjective. Inferences (estimation,
hypothesis testing and model selection) are objective if they do not depend
on the choice of the prior, and subjective otherwise. For estimation inferences
are virtually independent of the prior if ”the data dominate the prior”, that
is, if the amount of information with respect to the parameters in the data is
much larger than the amount of information in the prior (Gill, 2002, p. 125-
126). In this section this will be achieved using an uninformative or vague
prior (Gelman, Carlin, Stern and Rubin, 2004, pp. 61-65).
In Section 1 five models were introduced: H0 , H1a , H1b , H1c and H2 :
µll , µhl , µlh , µhh that is, a model without equality or inequality constraints
among the adjusted means. Each of these models is based on the analysis of
covariance model (1). We will now first of all specify the prior distribution for
H2 , subsequently it will be shown how this prior distribution can be used to
derive the prior distributions for the other hypotheses under consideration.
The general form of the prior distribution that will be used for H2 is
G
Y
h(µ, β, σ 2 | H2 ) ∼ N (µg | µ0 , τ02 )N (β | β0 , γ02 )Inv-χ2 (σ 2 | ν0 , λ20 ). (7)
g=1

7
As can be seen, the same prior is used for each µg , that is, a normal dis-
tribution with mean µ0 and variance τ02 . A vague prior for µg is obtained
using e.g. τ02 = 100000. A normal distribution with such a large variance is
almost flat, implying that a priori each possible value of µg is equally likely.
The prior for β is also a normal distribution with mean β0 and variance γ02 .
Again a vague prior is obtained using e.g. γ02 = 100000. The prior for σ 2
is a so called scaled inverse chi-square distribution. The interested reader is
referred to Gelman, Carlin, Stern and Rubin (2004, pp. 50, 547, 580) for a
further specification of the scaled inverse chi-square distribution with scale
parameter λ20 and degrees of freedom ν0 . A vague prior is obtained using
ν0 = 1, see, for example, the figures in Lee (1997, pp. 51-53).
Prior distributions for inequality constrained and null models can easily
be derived from the prior distribution of the unconstrained model. Let θm
denote {µ, β, σ 2 ∈ Hm }, that is, the set of parameter values allowed given
the restriction imposed by model Hm , then
h(µ, β, σ 2 | H2 )Iθm ∈Hm
h(µ, β, σ 2 | Hm ) = R 2
, (8)
θm h(µ, β, σ | H2 )Iθm ∈Hm dθm

where Iθm ∈Hm equals 1 if θm is in accordance with the restriction in Hm and


0 otherwise. The main feature of (8) is that it assigns a prior probability
of zero to values of θm that are not in accordance with the restrictions of
model Hm . This manner to specify prior distributions is closely related to
the conditioning method described in Dawid and Laurtizen (2000).
The posterior distribution is the Bayesian way to summarize the infor-
mation with respect to the parameters in the data and the prior. Like for
the simple binomial example from the previous section it is the product of
the distribution of the data and the prior distribution:

g(µ, β, σ 2 | y, D, x, Hm ) ∝ f (y | D, x, µ, β, σ 2 )h(µ, β, σ 2 | Hm ). (9)

3.2 The Gibbs Sampler


In the simple binomial example the model of interest contained one parameter
(the probability of a coin flip coming up heads). For such simple models the
Bayesian computation of parameter estimates is usually rather easy. How-
ever, for multidimensional models like the analysis of covariance model that
contains six parameters (four means, a regression coefficient and a residual
variance) it is rather complicated. A solution that has become rather popular

8
is to obtain a sample from the posterior, and to use this sample to compute
parameter estimates and credibility intervals (the Bayesian counterpart of
a confidence interval). For the simple binomial example this sample could
consist of, for example, 1000 values of π sampled from the posterior distri-
bution g(π | N, x). The expected value of π (called the expected a posteriori
(EAP) estimate) is then simple the average of these 1000 values. A 95% cen-
tral credibility interval is obtained using the 2.5-th and 97.5-th percentile of
the 1000 values ordered from smallest to largest. The error in estimate and
credibility interval caused by using a sample from the posterior is called the
Monte Carlo error (Gelman, Carlin, Stern and Rubin, 2004, pp. 277-278).
Increasing the sample will reduce the error.
Obtaining a sample from the posterior is not always so easy as in the sim-
ple binomial example. The latter can be obtained from many software pack-
ages, for example, in SPSS using COMPUTE with RV.BETA(.). A popular
method to obtain a sample from a multidimensional posterior distribution
is the Gibbs sampler (Gelman, Carlin, Stern and Rubin, 2004, pp. 287-289;
Gill, 2002, pp. 311-313; Lee, 1997, pp. 259-268; Hoijtink, 2000). Gibbs sam-
plers can be programmed using, for example, Fortran or C++, or, using pack-
ages especially developed for the construction of Gibbs samplers like Win-
bugs (Spiegelhalter, Thomas, Best and Lunn, 2004) or MCMCpack (Martin
and Quinn, 2005) combined with the R-package (http://www.r-project.org/)
and OpenBugs (Thomas, 2004) in combination with the R-package (BRugs,
http://cran.r-project.org/src/contrib/Descriptions/BRugs.html). The Gibbs
sampler is an iterative procedure. Each iteration consists of a number of steps
in which each parameter is sampled from its distribution conditional on the
current values of the other parameters. This will exemplified using (9). For
notational convenience, let g = 1, . . . , 4 = ll, hl, lh, hh.

• Initialization Step: Each parameter is set at a value that is allowed in


model Hm . For H1a : {µhl , µhh } < {µll , µlh } for the self-esteem data,
the values could be 1, 1, 2, 2 for µhl , µhh , µll and µlh , respectively, β = 0
and σ 2 = 1.

Subsequently the Gibbs sampler iterates across the following three steps for
t = 1, . . . , T iterations:

• Step 1: For g = 1, . . . , 4 Sample µg from

g(µg | µ1 , . . . , µg−1 , µg+1 , . . . , µG , β, σ 2 , y, D, x, Hm ), (10)

9
which can be shown (Klugkist, Laudy and Hoijtink, 2005) to be a
N (µg | ag , bg , L, U ) distribution where ag and bg denote the mean and
variance of this normal distribution, respectively, and
L = max{µhl , µhh } if g ∈ {ll, lh} and − ∞ otherwise
denotes the lowerbound on µg implied by the restriction H1a and
U = min{µll , µlh } if g ∈ {hl, hh} and ∞ otherwise
denotes the upperbound on µg . The mean and variance are:
1 PN P
1
µ
τ02 0
+ σ2
( i=1 dig yi − β Ni=1 dig xi )
ag = 1 1 PN ,
τ02
+ σ2 i=1 dig
0

and
1
bg = 1 1 PN .
τ02
+ σ2 i=1 dig
Using inverse probability sampling it is easy to sample a deviate from
this truncated distribution: (a) sample a random number u from a
uniform distribution on the interval [0,1]; (b) compute the proportions
v and w that are not admissible due to L and U :
Z L
v= N (µg | ag , bg )dµg , (11)
−∞

and, Z ∞
w= N (µg | ag , bg )dµg ; (12)
U
(c) compute µg such that it is the deviate associated with the u-th
percentile of the admissible part of the posterior of µg :
Z µg
v + u(1 − v − w) = N (µg | ag , bg )dµg . (13)
0

• Step 2: Sample β from


g(β | µ, σ 2 , y, D, x, Hm ), (14)
which can be shown to be a N (β | c, d) distribution where
1 PN P PG
1
β
γ02 0
+ σ2
( i=1 yi xi − Ni=1 g=1 xi µg dig )
c= 1 1 PN 2
,
γ02
+ σ2 i=1 xi

10
Table 2: Gibbs Sample and EAP estimates for the Parameter of model H1a
with Social Economic Status as a Covariate

t µll µhl µlh µhh β σ2


...
1093 17.88 15.79 16.01 13.95 .00 13.36
1094 17.93 15.99 16.13 12.90 .01 12.51
...
2411 18.23 15.97 15.97 13.83 -.01 12.53
...
EAP 17.93 16.00 16.15 13.44 -.00 12.75
2.5% 17.08 15.37 15.50 12.62 -.01 11.26
97.5% 18.79 16.65 16.84 14.26 .01 14.40

and
1
d= 1 1 PN .
γ02
+ σ2 i=1 x2i

• Step 3: Sample σ 2 from

g(σ 2 | µ, β, y, D, x, Hm ), (15)

which can be shown to be a scale inverse chi squared distribution with


degrees of freedom
ν = ν0 + N,
and scale parameter
ν0 λ20 + N e
λ= ,
ν0 + N
where
N G
1 X X
e= (yi − µg dig − βxi )2 .
N i=1 g=1

In Table 2 a part of the sample obtained for the ’self-esteem’ data using
social economic status as a covariate is displayed for H1a . The number of
iterations T = 6000 of which 1000 were used as the burn-in period (see the
next section).

11
As can be seen in Table 2, the 95% central credibility interval for β
contains the value zero. This implies that the adjusted means will not change
a lot if the covariate social economic status is removed from the model. As
can be seen from the observed means in Table 1, the restriction µhl < µlh
does not appear to be in accordance with the data. This is nicely reflected in
Table 2 where µhl is forced to be smaller than µlh , but is never much smaller
than µlh . This is also reflected by the EAP estimates (simply the average
of the corresponding column), and the largely overlapping central credibility
intervals (simply the 2.5-th and 97.5 percentile of the corresponding column)
for µhl and µlh . Note that the EAP estimates were computed after deletion
of 1000 iterations burn-in, and, after a check of convergence of the Gibbs
sampler. Both burn-in and convergence will be elaborated in the next section.

3.3 Burn-In and Convergence


Before parameter estimates and credibility intervals can be computed and
the sample obtained can be used for any other purposes, it has to be verified
that the Gibbs sampler has converged, that is, that the resulting sample ad-
equately reflects the information in the posterior distribution. Two steps are
needed: discarding the burn-in phase (arising because of the relatively arbi-
trary choice of initial values); and, a convergence check. Using the R-CODA
package (http://cran.r-project.org/src/contrib/Descriptions/coda.html) out-
put from the Gibbs sampler as displayed in the top panel of Table 2 can easily
be processed. For each parameter the values sampled can be plotted against
iteration number like is done Figure 2 for µll for t = 1, . . . , 2000. As can be
seen, in the first few iterations the values sampled are far outside the band
width of the values that are sampled later on. This is caused by the relatively
arbitrary choice of initial values. For the inequality constrained ANCOVA
models discussed in this chapter, often within a relatively small number of
iterations the effect of the initial values vanishes and the sample converges
to the desired posterior distribution. The size of the burn-in period can be
determined by looking at plots like Figure 2 for each of the parameters. Here
a burn-in period of 1000 iterations should be more than sufficient to remove
the effect of the initial values.
The remaining question is then whether iterations 1001 until 6000 are a
representative sample from the posterior distribution. There is no fail safe
method that can be used to verify this so-called ”convergence of the Gibbs
sampler”. A comprehensive overview of convergence diagnostics is presented

12
20
15
µll 10
5
0
0 100 200 300 400 500 600 700 800 900 1000
t
20
15
µll 10
5
0
1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000
t

Figure 2: The First 2000 Iterations of the Gibbs Sampler for µll

in Cowles and Carlin (1996) and Gill (2002, Chapter 11). Especially in more
complicated models, there is always the possibility that the Gibbs sampler did
not visit the whole domain of the posterior distribution. The consequence is
that some regions may be under-represented in the Gibbs sample. The prob-
ability that this happens can be reduced running k = 1, . . . , K parallel Gibbs
samplers, each starting from different initial values. For each parameter this
would result (after discarding a burn-in phase) in, for example, k = 1, . . . , 5
vectors of sampled values, that can be summarized in a matrix with elements
ξkt for t = 1, . . . , 1000. If the posterior distribution of a model is a uni-modal
distribution (as is the case for all the models discussed in this chapter) there
is no need for multiple parallel chains of the Gibbs sampler. If the chain
is long enough (usually a few thousand iterations of the Gibbs sampler is
sufficient) the Gibbs sampler will almost certainly converge to the desired
posterior distribution. However, in order to check convergence, that is, to
check whether the number of iteration is large enough, it is still convenient to
collect the values sampled in a matrix with elements ξtk . In this case k = 1
refers to iterations 1001,...,2000, k = 2 to 2001,...,3000 etc., that is, for each
of the sequences T = 1000.
Iterations 1001,...,2000 are displayed in the bottom panel of Figure 2.

13
For k = 2, . . . , 5 almost identical displays are obtained. That is, according
to an eye ball test the Gibbs sampler has converged. Gelman, Carlin, Stern
and Rubin (2004, pp. 294-299) present a diagnostic that has become quite
popular as a more formal way to check convergence. First of all, for each
parameter the so-called between and within sequence variance is computed:
K
T X
B= (ξ − ξ .. )2 , (16)
K − 1 k=1 .k
1 PT 1 PK
where, ξ .k = T t=1 ξtk and ξ .. = K k=1 ξ .k , and
K T
1 X 1 X
W = (ξtk − ξ .k )2 . (17)
K k=1 T − 1 t=1

The posterior variance of ξ can be estimated using: T T−1 W + T1 B which


is unbiased under stationarity of the Gibbs sampler, or, using W (which
approaches the posterior variance if T → ∞. Consequently, the quantity
s
T −1
T
W + T1 B
R̂ = (18)
W
approaches 1.0 if T → ∞. According to Gelman, Carlin, Stern and Rubin
(2004, pp. 296-297) values of R̂ smaller than 1.1 are indicative of convergence
of the Gibbs sampler.
In Table 3 the values of R̂ are displayed for each of the parameters of the
model currently under investigation. As can be seen, all values are smaller
than 1.1. The R̂-values, figures like the bottom panel of Figure 2 for K = 5
series of 1000 iterations, and the knowledge that we are sampling from a uni-
modal posterior distribution, provides convincing evidence for convergence
of the Gibbs sampler.

3.4 The Metropolis Hastings Algorithm and Data Aug-


mentation
In the example elaborated in Section 3.2 it is easy to sample from the con-
ditional distributions (10), (14) and (15) because they can be shown to be
standard distributions. However, not all conditional distributions that you
will encounter will be standard. Using the Metropolis Hastings Algorithm

14
Table 3: R̂ for H1a Using Social Economic Status as a Covariate

Parameter R̂
µll 1.01
µhl 1.01
µlh 1.02
µhh 1.01
β 1.01
σ2 1.03

(Chib and Greenberg, 1995; Tierney, 1998; Gelman, Carlin, Stern and Rubin,
2004, pp. 290-292; Gill, 2002, pp. 317-325) it is easy to sample from non-
standard distributions. Here we will focus on the Metropolis Hastings within
Gibbs algorithm. In this algorithm within one or more steps of the Gibbs
sampler the Metropolis Hastings algorithm is used to sample the conditional
distribution at hand (Gelman, Carlin, Stern and Rubin, 2004, p. 292).
Suppose, for example, that the conditional distribution in Step 2 of our
Gibbs sampler can not be traced. What often can be done in such a situ-
ation is evaluation of g(β | µ, σ 2 , y, D, x, Hm ) = g(β | .) for each value of
β (just evaluate (14) for a specific value of β with al the other parameters
fixed at their current values). What subsequently is needed, is an approxima-
tion of the target distribution g(β | .) by means of a standard distribution.
Especially for models that contain many parameters the choice of the ap-
proximating distribution is important: the closer the resemblance between
approximation and target the faster the Metropolis-Hastings within Gibbs
sampler will converge (Gelman, Carlin, Stern and Rubin, 2004, pp.305-307).
A basic idea is to use an approximating distribution depending on the val-
ues sampled in the previous iteration q(β t | β t−1 ). The interested reader
is reffered to Robert and Casella (2004, Chapter 7.3) for an elaboration of
this idea. A so called independent Metropolis-Hastings algorithm (Robert
and Casella, 2004, Chapter 7.4) is obtained if the approximating distribution
does not depend on the values sample in the previous iteration. A rather
unintelligent idea that would nevertheless work quite well in the situation at
hand is to use an approximation q(β t | β t−1 ) = q(β t ) ∼ N (0, 1). After spec-
ification of the approximating distribution three steps are needed to sample
a value from the target distribution:

15
1. In iteration t, sample a value β t from q(β t | β t−1 ).
g(β t )/q(β t |β t−1 )
2. Compute the ratio r = g(β t−1 )/q(β t−1 |β t )
.

3. Accept β t as a draw from the target distribution with probability


min{r,1} and set β t = β t−1 , otherwise.
This basically solves the problem of sampling from (conditional) distributions
that are not standard distributions.
Another problem that can occur during the construction of a Gibbs sam-
pler is the presence of missing data, random or latent variables. Both prob-
lems can usually be handled using a so called data augmented Gibbs sampler
(Tanner and Wong, 1987; Gill, 325-327; Zeger and Karim, 1991). The latter
is obtained via the addition of a step to the Gibbs sampler in which the
missing data or latent variables are sampled. Suppose, for example, that (1)
is used to analyze data with missing yi for some of the persons. This can
easily be dealt with via the addition of a fourth step to the Gibbs sampler
in which the missing data are augmented:
• Step 4: Sample each of the missing yi ’s from
g(yi | di1 , . . . , diG , xi , µ, β, σ 2 ), (19)
PG
which can be shown to be a normal distribution with mean g=1 µg dig +
βxi and variance σ 2 .
Under the assumption that given D and x the data are missing at random
(Schafer, 1997, pp. 10-13), this renders a sample from the correct poste-
rior distribution. For a simple example where data augmentation is used to
handle random variables the interested reader is referred to Hoijtink (2000).

4 Model Checking: Posterior Predictive In-


ference
In the previous section estimation using Bayesian computational methods
was discussed. In this section Bayesian hypothesis testing or model checking
will be discussed. First of all the basic problem of null-hypothesis testing
(nuisance parameters) and the Bayesian solution to this problem will be
discussed. An illustration will be provided using a test of homogeneity of
within group residual variances. Finally, the frequency properties of the
Bayesian solution will be discussed.

16
H0 : P1 P2

P1 P2 3,V 2 2 P1 P2 4,V 2 7

... yrep ... yrep y


yrep yrep

T ( yrep) ... T ( yrep) T ( yrep) ... T ( yrep) t(y)

tN1N2 1 tN1N2 1

Figure 3: Null Hypothesis Testing Using Pivots

4.1 The Basic Problem of Null-Hypothesis Testing:


Nuisance Parameters
The definition of a p-value (see, for example, Meng (1994)) is probably well-
known:
p = P (T (·) > t(·) | H0 ). (20)
for one-sided tests. Stated otherwise, a p-value is the probability that a test
statistic T (·) computed for a data set sampled from the null-population H0 is
larger than the same test statistic t(·) computed for the observed data. This
procedure is visualized in Figure 3 for testing H0 : µ1 = µ2 versus µ1 6= µ2
using student’s t-test:
y1 − y2
t= r , (21)
(N1 −1)s21 +(N2 −1)s22 1 1
N1 +N2 −2
( N1 + N2
)

where N1 and N2 denote the sample sizes in group 1 and 2, respectively, and
y 1 , y 2 , s21 and s22 the corresponding sample averages and variances. Note that
in the null-population µ1 = µ2 , that is, both means have the same value. In
the sequel this value will be denoted by µ. As can be seen in Figure 3, first

17
of all data matrices have to be sampled from the null-population. This is
problematic because under H0 the values µ and σ 2 have to be known in order
to be able to sample data. Here µ and σ 2 are nuisance parameters, stated
otherwise, there are many values for µ and σ 2 that are in accordance with
H0 which leaves the problem from which of the many null-populations the
data matrices should be sampled.
In many standard situations (analysis of variance, multiple regression)
nuisance parameters can easily be handled because the test statistic is a
pivot, that is, the distribution of the test statistic does not depend on the
actual values of the nuisance parameters. This is illustrated in Figure 3:
whatever the actual values of µ and σ 2 the t-test always has a t-distribution
with N1 + N2 − 1 degrees of freedom. Stated otherwise, the two-sided p-value

p = P (|T (·)| > |t(·)| | H0 ) (22)

does not depend on the actual null-population from which data matrices are
replicated, because the distribution of T (·) is always tN1 +N2 −1 .
Pivots are among the most elegant achievements of classical statistics.
For many situations pivotal test statistics do not exist. Classical solutions
for this situation are so called plug-in p-values (Bayarri and Berger, 2000)
or asymptotic p-values (Robins, van der Vaart and Ventura, 2000), that is,
p-values computed assuming that the sample size is very large. However,
since this chapter is on Bayesian data analysis we will limit ourselves to the
Bayesian way to deal with nuisance parameters in the absence of pivotal test
statistics: posterior predictive p-values.

4.2 Posterior Predictive p-values


Posterior predictive p-values are discussed in Meng (1994), Gelman, Meng
and Stern (1996), Gill (2002, pp. 179-181) and Gelman, Carlin, Stern and
Rubin (2004, pp. 159-177). Let θ 0 denotes the nuisance parameters of the
null-model, let Z denote the observed data (for our analysis of covariance
example Z = {y, D, x}) and Z rep a replicate that is sampled from the null-
population. Then

p = P (T (θ 0 , Z) > t(θ 0 , Z rep ) | H0 , Z), (23)

that is, in accordance with the Bayesian tradition computations are per-
formed conditional on the data that are observed. This opens the possibility

18
H0

g (T 0 | Z , H 0 )

T 0 ,1 ... T 0 ,T

Z 1rep ... Z Trep

T (T 0 ,1 , Z 1rep ) T (T 0 ,1 , Z Trep )
... p-value
t (T 0 ,1 , Z ) t (T 0 ,1 , Z )

Figure 4: The Posterior Predictive p-value

to integrate out the nuisance parameters during the computation of the p-


value: Z
p= P (T (θ 0 , Z) > t(θ 0 , Z rep ) | θ 0 )g(θ 0 | Z, H0 )dθ 0 . (24)
θ0
Another difference with the classical approach is the use of discrepancy mea-
sures instead of statistics. As can be seen in (24) the discrepancy measures
T (·) and t(·) can be a function of both the data and the unknown model
parameters. A visual illustration of (24) is provided in Figure 4. As can be
seen (24) can be computed in three steps:
• Step 1, sample parameter vectors θ 0,1 , . . . , θ 0,T from their posterior
distribution. This sample reflects the knowledge with respect to θ 0
that is available after observing the data.
• Step 2, replicate a data matrix Z rep t using θ 0,t for t = 1, . . . , T . The
result is called the posterior predictive distribution of the data matrices.
This is the distribution of data matrices that can be expected if the
null-model provides a correct description of the observed data.
• Step 3, compute the posterior predictive p-value simply by counting
the proportion of replicated data matrices for which T (θ 0,t , Z rep
t ) >
t(θ 0,t , Z).
Posterior predictive inference will be illustrated using (1) and the self-esteem
data. The question we will investigate is whether or not the within group

19
Table 4: The Computation of Posterior Predictive P-values

t µ1 µ2 µ3 µ4 β σ2 T (·) t(·)
1 18.20 16.62 12.46 13.18 12.62 .00 1.86 1.60
...
6 18.44 16.51 14.97 13.02 12.25 .01 1.76 1.64
...

residual variances are equal. This will be investigated using the following
discrepancy measure:
t(·) = s2largest /s2smallest , (25)
where
Ng
1 X
s2g = (yi − µg + βxi )2 (26)
Ng i=1
denotes the within group residual variance of which the smallest and largest
observed in the the four groups are used in the test statistic. Note that Ng
denotes the sample size in group g. Note furthermore, that (26) and thus
(25) depend both on the data and the unknown model parameters µ and β.
This measure is chosen to show that the posterior predictive approach
enables a researcher to construct any test statistic without having to derive
its distribution under the null-hypothesis. As will be elaborated in the next
section, t(.) can be evaluated using a posterior predictive p-value, or using
its distance to the distribution of T (.). The latter approach is called model
checking (Gelman, Meng and Stern, 1996): even if the p-value is rather small,
a researcher may conclude that the distance between t(.) and the distribution
of T (.) is so small, that it is not necessary to adjust the model used, e.g.,
that it is not necessary to use a model with group dependent within group
residual variances. It is interesting to note a rule of thumb existing in the
context of analysis of variance (Tabachnick and Fidell, 1996, p. 80): if the
sample sizes per group are within a ratio of 4:1, t(.) may be as large as 10,
before heterogeneity of within group variances becomes a problem.
First of all the Gibbs sampler was used to obtain a sample from the pos-
terior distribution of the null-model. A part of the results is displayed in
Table 4. Subsequently, for t = 1, . . . , T a data matrix is replicated from the
null-population. Finally, t(·) and T (·) are computed using the observed and
replicated data matrices, respectively, and µ and β. The posterior predictive

20
p-value is then simply the proportion of T (·) larger than t(·) resulting in the
value .88. This implies that the discrepancies computed for the observed data
are in accordance with the posterior predictive distribution of the discrep-
ancies under the hypothesis of equal within group residual variances. The
range of the observed discrepancies was [1.51,1.72], the range of the repli-
cated discrepancies [1.02,3.58]. As can be seen the observed discrepancies
are well within the range of the replicated discrepancies. Furthermore, the
values of the observed discrepancies are much smaller than the rule of thumb
that t(.) may be as large as 10 (for analysis of variance, here we look at an
analysis of covariance) if the group sizes differ less than a factor 4:1. The
conclusion is that a model with equal within group residual variances is more
than reasonable.

4.3 Frequency Properties


A potential flaw of posterior predictive inference, that is, the Bayesian way
to deal with nuisance parameters is the fact that the data are used twice:
once to compute t(θ 0 , Z) and once to determine g(θ 0 | Z). As noted by
Meng (1994) and more elaborately discussed by Bayarri and Berger (2000),
the frequency properties of posterior predictive inference may not be optimal.
If a data matrix is repeatedly sampled from a null-population, resulting in
a sequence Z j for j = 1, . . . , J then the distribution of the corresponding
sequence of p-values p1 , . . . , pJ may not be uniform. Stated otherwise, where
the equality P (p < α | H0 ) = α holds for all values of α in the interval
[0, 1] for Student’s t-test, this equality does not hold for posterior predictive
p-values in general. This makes it difficult to interpret a posterior predictive
p-value.
There are several ways to deal with this problem:

• Kato and Hoijtink (2004) investigated the frequency properties of pos-


terior predictive p-values used to test model assumptions in a simple
multilevel model. Using a simulation study, that is, repeatedly sam-
pling a data matrix from the null-population and computing a p-value
for each data matrix, they evaluated among others classical asymptotic
p-values, posterior predictive p-values for test statistics (a function of
only the data) and posterior predictive p-values for discrepancy mea-
sures (a function of both the data and the unknown model parameters).
From these only the posterior predictive p-values for the discrepancy

21
measures were (almost) uniform, that is, that (p < α | H0) ≈ α.
Also in other situations researchers can execute such a simulation study
to determine if their posterior predictive p-values have acceptable fre-
quency properties or not.

• Bayarri and Berger (2000) present two new types of p-values that ex-
plicitly account for the fact that the data are used twice: the conditional
predictive p-value and the partial posterior predictive p-value. In their
examples the frequency properties of these p-values are excellent. How-
ever, their examples are rather simple, and it may be difficult or even
impossible to compute these p-values for more elaborate examples like
the example given in the previous section.

• Bayarri and Berger (2000) note and exemplify that so-called ’plug-in’
p-values appear to have better frequency properties than posterior pre-
dictive p-values. These p-values can be obtained using the parametric
bootstrap, that is, replace θ 0,1 , . . . , θ 0,T in Step 1 of the computation
of the posterior predictive p-value by the maximum likelihood estimate
of the model parameters θ̂ 0 for t = 1, . . . , T . Note, that although the
frequency properties of plug-in p-values appear to be better than those
of posterior predictive p-values, it has to be determined for each new
situation how good they actually are.

• Bayarri and Berger (2000) also note that p-values can be calibrated. In
its simplest form this entails the simulation of a sequence Z 1 , . . . , Z J
from a null population, and subsequent computation of the sequence
p1 , . . . , pJ . If the latter is not uniformly distributed, it does not hold
that P (p < α | H0 ) = α. However, using the sequence p1 , . . . , pJ for
each α a value α∗ can be found such that P (p < α∗ | H0 ) = α. If,
subsequently, it is desired to test the null-hypothesis with α = .05 for
empirical data, the null hypothesis should be rejected if the p-value is
smaller than the α∗ corresponding to α = .05.

• Last but not least, Gelman, Meng and Stern (1996) are not in the least
worried about the frequency properties of posterior predictive p-values.
They suggest to use discrepancies simply to assess the discrepancy be-
tween a model and the data. A quote from Tiao and Xu (1993) clarifies
what they mean: ”... development of diagnostic tools with a greater
emphasis on assessing the usefulness of an assumed model for specific

22
purposes at hand, rather than on whether the model is true”. They
also suggest not to worry about the power that can be achieved using a
specific discrepancy, but, to choose the discrepancy such that it reflects
”how the model fits in aspects that are important for our problems at
hand”. Stated otherwise, although posterior predictive inference is not
a straightforward alternative for the classical approach with respect to
hypothesis testing (is H0 true or not?), it can be used for model check-
ing. It allows researchers to define discrepancies between model and
data such that they are relevant for the problem at hand (as was done
in the previous section to investigate equality of within group residual
variances). Subsequently the observed size of these discrepancies can
be compared with the sizes that are expected if the model is true via
the posterior predictive distribution of these discrepancies. Finally, the
researcher at hand has to decide whether the differences between the
observed and replicated discrepancies are so large that it is worthwhile
to adjust the model.

5 Model Selection: Marginal Likelihood, the


Bayes Factor and Posterior Probabilities
5.1 Introduction
So far in this chapter on Bayesian data analysis Bayes theorem has not
explicitly been discussed, although it was implicitly used when the posterior
distribution (9) was introduced. It states that:
f (Z | θ m )h(θ m | Hm )
g(θ m | Z, Hm ) = . (27)
m(Z | Hm )
The posterior distribution, distribution of the data and prior distribution are
denoted by g(·), f (·) and h(·), respectively. New is the so called marginal
likelihood m(Z | Hm ). It is called marginal because of the conditioning on
the model Hm at hand instead of on θ m as is done in the distribution of the
data. It is defined as follows:
Z
m(Z | Hm ) = f (Z | θ m )h(θ m | Hm )dθ m . (28)
θm
The marginal likelihood can be seen as a Bayesian information criterion.
Information criteria can be used to select the best of a set of competing

23
2
P2

P1 P2

P1

-2
-2 2

Figure 5: Marginal Likelihood: Ockham’s Razor Illustrated

models. Classical information criteria like AIC (Akaike, 1987) and CAIC
(Bozdogan, 1987) consist of two parts:

• The first part is −2 log f (Z | θ̂ m ), that is, the distribution of the data
or likelihood evaluated using the maximum likelihood estimate θ̂ m of
θ m . The smaller the value of the first part, the better the fit of the
model.

• The second part is a penalty for model size which is a function of the
number of parameters P in a model. For AIC this penalty is 2P , for
CAIC the penalty is (log N + 1)P . The smaller the penalty, the more
parsimonious the model.

An information criterion results from the addition of fit and penalty, the
smaller the resulting number, the better the model at hand.
As will now be illustrated, fit and penalty are (although implicitly) also
important parts of the marginal likelihood (28). It is therefore a fully auto-
matic Ockham’s razor (Smith and Spiegelhalter, 1980; Jefferys and Berger,
1992; Kass and Raftery, 1995) in the sense that model fit and model size are
automatically accounted for. Consider, for example, the situation displayed
in Figure 5. There are two models under investigation:
2
X
H1 : yi = µg dig + ei , with ei ∼ N (0, 1), (29)
g=1

24
and,
2
X
H2 : yi = µg dig + ei , with ei ∼ N (0, 1) and µ1 > µ2 . (30)
g=1

The ellipses in Figure 5 represent the isodensity contours of f (y | D, µ1 , µ2 ),


that is, a simplification of (6). The square represents the prior distribution
h(µ1 , µ2 | H1 ) for H1 that is chosen to be uniform, that is, a density of 1/16-
th over the two-dimensional space bounded by the values -2 and +2. The
lower triangle represents the prior distribution h(µ1 , µ2 | H2 ) = 2/16 of H2
which can be derived from h(µ1 , µ2 | H1 ) using (8). Applied to the situation
at hand (28) reduces to
Z
m(y | Hm ) = f (y | D, µ1 , µ2 )h(µ1 , µ2 | Hm )dµ1 , µ2 . (31)
µ1 ,µ2

As can be seen in Figure 5, the fit of both models is the same because, loosely
spoken, both H1 and H2 support the maximum of f (y | D, µ1 , µ2 ). However,
when (31) is evaluated it turns out that it is larger for H2 than for H1 , that
is, H2 is preferred to H1 . This can be seen as follows: denote the integrated
density of f (·) over the upper triangle by a and over the lower triangle by
b. Since a is smaller than b, it follows that m(y | H1 ) = 1/16a + 1/16b is
smaller than m(y | H2 ) = 2/16b. Stated otherwise, the marginal likelihood
prefers H2 over H1 because the fit of both models is about the same, but the
parameter space of H2 is smaller than the parameter space of H1 .
The ratio of two marginal likelihoods is called the Bayes factor (Kass
and Raftery, 1995; Gill, 2002, Chapter 7; Lee, 1997, Chapter 4; Lavine and
Schervish, 1999), that is,
m(Z | Hm ) P (Hm | Z) P (Hm )
BFm,m0 = 0 )
= / . (32)
m(Z | Hm P (Hm0 | Z) P (Hm0 )
As can be seen, the Bayes factor is equal to the ratio of posterior to prior
model-odds. This means that the Bayes factor represents the change in
believe from prior to posterior model odds. Stated otherwise, if BFm,m0 = 4
model m has become four times as likely as model m0 after observing the data.
A more straightforward interpretation of the marginal likelihood is obtained
using posterior model probabilities computed under the assumption that the
prior model probabilities P (Hm ) = M1 for m = 1, . . . , M :
m(Z | Hm )
P (Hm | Z) = PM . (33)
m=1 m(Z | Hm )

25
If BFm,m0 = 4 then with equal prior probabilities the posterior probabilities
of model m and m0 are .80 and .20, respectively.

5.2 Specification of the Prior Distributions


An important step in model selection using the marginal likelihood is spec-
ification of the prior distributions. When the goal is to estimate model pa-
rameters, prior distributions are often dominated by the data and have little
influence on the resulting estimates. The same holds for posterior predic-
tive model checking. Consequently, the use of uninformative or vague prior
distributions is not a problem. However, as is exemplified by the Bartlett-
Lindley paradox (Lindley, 1957; Howson, 2002), the marginal likelihood is
very sensitive to the specification of prior distributions and one should not
use uninformative or vague prior distributions. Consider the following two
models:
H1 : yi = 0 + ei , with ei ∼ N (0, 1), (34)
and,
H2 : yi = µ + ei , with ei ∼ N (0, 1). (35)
The main research question is whether µ equals 0 or not. The normal curve
in Figure 6 displays the normal distribution of the data for H2 which has a
mean of -1.5 and a variance of 1. The height of the normal curve at µ = 0 is
the marginal likelihood (.1295) of H1 (since there are no unknown parameters
under H1 the prior distribution is a point mass of 1.0 at µ = 0). The marginal
likelihood for H2 is obtained if the distribution of the data is integrated with
respect to the prior distribution chosen. If the prior distribution is uniform in
a certain interval [−d, d], the marginal likelihood is equal to density under the
normal curve in the interval [-d,d] multiplied with the prior density. The two
boxes in Figure 6 are prior distributions with d = 2 and d = 3, respectively.
For d = 2 the marginal likelihood is .1652. for d = 3 the marginal likelihood
is .1645, for d = 20 the marginal likelihood is .0214. As can be seen, the
marginal likelihood decreases if d increases. If d → ∞ then m(H2 | y) → 0
and BF12 → ∞. Stated otherwise, the support for H1 depends completely
on the prior chosen and is not influenced by the data!
As has become clear in the previous paragraph, Bayesian model selection
using the marginal likelihood requires a careful selection and specification
of prior distributions. Here a general and a specific approach will be elab-
orated: training data and encompassing priors. The idea behind training

26
.5
.4
.3 d=2
.2 d=3
.1
0
-3 -2 -1 0 1 2 3
µ

Figure 6: An Illustration of the Bartlett-Lindley Paradox

data (Berger and Perricchi, 1996, 2004; Perez and Berger, 2002) is to use as
small a part of the data as possible to construct a prior distribution for each
model under consideration. This will render a prior distribution that is in
agreement with the population from which the data are sampled, and, that
is informative enough to avoid the Bartlett-Lindley paradox. The as small
as possible part of the data is called a minimal training sample and will be
denoted by Z(l). A minimal training sample is the smallest sample for which
the posterior prior is proper:
f (Z(l) | θ m )h(θ m | Hm )
h(θ m | Z(l), Hm ) = . (36)
m(Z(l) | Hm )

A standard (but not the only possible) choice for h(θ m | Hm ) is a reference
prior (Kass and Wasserman, 1996). For the example in (34) and (35) the size
of the minimal training sample is one, because one observation is sufficient
to obtain a proper posterior prior for µ: f (z(l) | µ) = N (z(l) | µ, 1), h(µ |
H2 ) ∝ constant, resulting in h(µ | z(l), H2 ) = N (µ | z(l), 1). Note that for
H1 the (posterior) prior is a point mass of one at µ = 0.
The posterior prior distribution depends on the training sample chosen.
One way to avoid this arbitrariness is to randomly select many training sam-
ples from the observed data. The two most important ways in which these
training samples can be processed to render one Bayes factor are averaged
intrinsic Bayes factors (Berger and Perricchi, 1996, 2004) and expected pos-
terior priors (Berger and Perricchi, 2004; Perez and Berger, 2002). For each
training sample the intrinsic Bayes factor of model m to m0 can be computed:
R
f (Z(−l) | θ m )h(θ m | Z(l), Hm )dθ m
IBFm,m0 = R θ m , (37)
θ m f (Z(−l) | θ m0 )h(θ m0 | Z(l), Hm0 )dθ m0

27
where Z(−l) denotes the data matrix excluding the observations that are
part of the training sample. The average of the IBF’s resulting for each of
the training samples is the averaged intrinsic Bayes factor. Bayes factors can
also be computed using (28) for each model m with h(θ m | Hm ) replaced by
the expected posterior prior:
L
1X
h(θ m | Z(l), Hm ), (38)
L l=1
where L denotes the number of training samples.
Both intrinsic Bayes factors and the approach using expected posterior
priors are general methods that can be applied in many situations. The
encompassing prior approach (Klugkist, Laudy and Hoijtink, 2005; Klugk-
ist, Kato and Hoijtink, 2005; Kato and Hoijtink, 2006; Laudy and Hoijtink,
2006) was developed specifically to deal with the selection of the best of a
set of inequality constrained hypotheses (see Section 1 for an elaboration
of inequality constrained hypotheses for the self-esteem data). Since (8) is
used to derive the prior distribution for constrained models, only the encom-
passing prior (7), that is, the prior for the unconstrained model, has to be
specified. This is in agreement with the principle of compatibility (Dawid
and Lauritzen, 2000) which is best illustrated using a quote from Leucari
and Consonni (2003) and Roverate and Consonni (2004): ”If nothing was
elicited to indicate that the two priors should be different, then it is sensible
to specify [the prior of constrained models] to be, . . ., as close as possible
to [the prior of the unconstrained model]. In this way the resulting Bayes
factor should be least influenced by dissimilarities between the two priors due
to differences in the construction processes, and could thus more faithfully
represent the strength of the support that the data lend to each model”.
As can be seen in (7), each mean has the same prior distribution, this
ensures that the encompassing model does not favor any of the models being
compared. Furthermore, the encompassing prior should assign a substantial
probability to values of µ, β and σ 2 that are in agreement with the data at
hand, and very small probabilities to values that are not. Since it is a priori
unknown which values are in agreement with the data, these values will be
derived using the data. This is reasonable because the compatibility of the
priors ensures that this information is used in a similar manner for each of
the models under investigation. The following procedure is used:
• The prior distribution for σ 2 is an Inv-χ2 (σ 2 | ν0 , λ20 ). We will use

28
ν0 = 1 and λ20 = 12.1 (the least squares estimate of σ 2 ).
• The prior distribution for β is a N (β | β0 , γ02 ). The lower (l) and
upper bound (u) of the 99.7% confidence interval for the least squares
estimate of β is used to determine the prior distribution: β0 = u+l 2
,
2 u−l 2
and γ0 = ( 2 ) . Stated otherwise, prior mean and variance are chosen
such that mean minus one standard deviation equal l, and mean plus
one standard deviation equals u. The resulting numbers for β0 and γ02
are 0 and .0004, respectively.
• For g = 1, . . . , G the prior distribution for µg is N (µg | µ0 , τ02 ). Like for
β, for each mean the lower and upper bound of the 99.7% confidence
interval for the least squares estimate is determined. The smallest lower
bound becomes l and the largest upperbound u. Subsequently, µ0 and
τ02 are determined in the same way as β0 and γ02 . The resulting numbers
for µ0 and τ02 are 15.7 and 13.7, respectively.
To summarize this section, if researchers want to use Bayes factors to
select the best of a number of competing models, one should not choose
reference, vague or uninformative priors. This was exemplified using the
Bartlett-Lindley paradox. Instead researchers should either use subjective
priors, or, priors constructed using the data like the posterior prior or the
encompassing prior.

5.3 Computation of the Marginal Likelihood


In general the computation of (28) and consequently also (32) and (33) is
not easy. The interested reader is referred to: Kass and Raftery (1995) for
an overview of approximations and methods based on importance sampling;
Chib (1995) who uses (27) as the point of departure to develop an estimator;
and, Carlin and Chib (1995) who develope a Markov Chain Monte Carlo
procedure in which not only the parameters of all models under investigation
are sampled, but also the model indicator m. The most straightforward
estimate of (28) is the approximation
S
1X
m̂(Z | Hm ) = f (Z | θ m,s ), (39)
S s=1
where, θ m,s for s = 1, . . . , S denotes a sample from the prior distribution
h(θ m | Hm ). However, as noted by Kass and Raftery (1995), this estimate is

29
rather inefficient, that is, often a huge sample from h(θ m | Hm ) is needed to
avoid that m̂(·) depends strongly on the sample at hand. An improvement
of (39) is the harmonic mean estimator (Kass and Raftery (1995):
T
1X
m̂(Z | Hm ) = ( f (Z | θ m,t )−1 )−1 , (40)
T t=1

where θ m,t for t = 1, . . . , T denotes a sample from the posterior distribu-


tion g(θ m | Z, Hm ). According to Kass and Raftery (1995) it is more stable
than (39). However, the presence (or absence) of values of θ m,t with a small
f (Z | θ m,t ) in the sample at hand has a large effect on m̂(·). The consequence
is that the harmonic mean estimator should only be used if the model at hand
contains only a few parameters and is well-behaved (e.g. a uni-modal like-
lihood function). For more complicated models one of the methods referred
to at the beginning of this section should be used.
Here only the estimator of BF1m will be presented. Note thate m = 1
denotes the encompassing (unconstrained) model and m = 2, . . . , M denotes
inequality constrained models nested in H1 . Using (32) and (27) it can be
seen that:
m(Z | H1 ) f (Z | θ 1 )h(θ 1 | H1 ) f (Z | θ m )h(θ m | Hm )
BF1m = =( )/( ).
m(Z | Hm ) g(θ 1 | Z, H1 ) g(θ m | Z, Hm )
(41)
From (8) it follows that h(θ 1 | H1 ) = cm × h(θ m | Hm ) and g(θ 1 | Z, H1 ) =
dm × g(θ m | Z, Hm ) for any value of θ m in both the encompassing and the
constrained model (Klugkist, Kato and Hoijtink, 2005). This reduces the
Bayes factor to
cm
BF1m = , (42)
dm
where cm denotes the proportion of the encompassing prior in agreement
with the the prior of constrained model m, and dm the proportion of the
encompassing posterior in agreement with the constrained posterior of model
m. Using a sufficiently large Gibbs sample from h(θ 1 | H1 ) and g(θ 1 |
Z, H1 ), the proportion of each sample in agreement with model m, provides
cm and dm , respectively. Posterior probabilities for each of the models under
investigation can then be obtained using
1/BF1m
P (Hm | Z) = . (43)
1/BF11 + 1/BF12 + . . . + 1/BF1M

30
Table 5: Posterior Probabilities for Four Models for the Self-Esteem Data

Model Post. Prob.


H1a : {µhl , µhh } < {µll , µlh } .00
H1b : {µlh , µhh } < {µll , µhl } .40
H1c : µhh < {µhl , µlh } < µll .60
H0 : µhh ≈ µhl ≈ µlh ≈ µhh .00

5.4 Example
In the introduction of this chapter the self-esteem data were introduced. The
four hypotheses that were specified for these data are listed in Table 5. As
can be seen, the hypothesis that the four means are equal is replaced by the
hypothesis that the four means are about equal. The main reason for this
substitution is that (42) is not defined for models in which two or more of
the parameters are exactly equal. Another reason is that the traditional null-
hypothesis does not always describe a state of affairs in the population that
is relevant for the reseach project at hand. See, for example, Cohen (1994)
for an elaboration of this point of view. In these situations the traditional
null-hypothesis can be replaced by a hypothesis that states that the four
means are about equal, where about equal is operationalized as:

|µg − µg0 | < ε for g, g 0 = ll, lh, hl, hh. (44)

Further motivation for this choice can be found in Berger and Delampady
(1987). For the computation of the posterior probabilities presented in Table
5, ε = .1. This number is about 1/4-th of the posterior standard error of
the means if (1) is used to analyze the self-esteem data without constraints
on the parameters. Results in Berger and Delampady (1987) suggest that
use of such small values of ε in (44) renders results that are similar to using
ε = 0, that is, using exact equality constraints. As can be seen, the data
provide support for H1b and H1c , and not for H1a and H0 . Given posterior
probabilities of .40 and .60 for H1b and H1c , respectively, it is hard decide
which hypothesis is the best. Choosing H1c implies that the probability
of incorrectly rejecting H1b is .40, which is a rather large conditional error
probability. It is much more realistic to acknowledge that both models have
their merits, or, to use a technique called model averaging (Hoeting, Madigan,
Raftery and Volinsky, 1999) which can, loosely spoken, be used to combine

31
both models. Whatever method is used, looking at Table 5 it can be seen
that both models agree that µhh < µll . However, there is disagreement about
the position of µlh and µhl . It is interesting to see (note that the EAP of
β was about zero for all models under investigation) that the restrictions
of both H1b and H1c are in agreement with the observed means in Table 1.
Probably H1c has a higher posterior probability than H1b because it contains
one more inequality constraint, that is, it is a smaller model and thus the
implicit penalty for model size in the marginal likelihood is smaller.

6 Further Reading
This chapter provided an introduction, discussion and illustration of Bayesian
estimation using the Gibbs sampler, model checking using posterior predic-
tive inference, and model selection using posterior probabilities. As noted
before, I consider these to be the most important components of Bayesian
data analysis. Below I will shortly discuss other important components of
Bayesian data analysis that did not receive attention in this chapter.
Hierarchical modelling (Gill, 2002, Chapter 10; Gelman, Carlin, Stern
and Rubin, 2004, Chapter 5; Lee, 1997, Chapter 8) is an important Bayesian
tool for model construction. Consider, for example, a sample of g = 1, . . . , G
schools and with each school the IQ (denoted by yi|g ) for i = 1, . . . , N children
is measured. For g = 1, . . . , G it can be assumed that yi|g ∼ N (yi|g | µg , 15).
A hierarchical model is obtained if it is assumed that the µg have a common
distribution: µg ∼ N (µg | µ, σ 2 ). For µ and σ 2 a so called hyper prior has
to be specified, e.g., h(µ, σ 2 ) ∼ N (µ | µ0 , τ02 )Inv-χ2 (σ 2 | ν0 , λ20 ). This setup
renders the joint posterior distribution of µ1 , . . . , µG , µ and σ 2 as:
g(µ1 , . . . , µG , µ, σ 2 | y 1 , . . . , y G )
∝ (45)
G Y
Y N
N (yi|g | µg , 15)N (µg | µ, σ 2 )N (µ | µ0 , τ02 )Inv-χ2 (σ 2 | ν0 , λ20 ).
g=1 i=1
Using a data augmented Gibbs sampler this posterior is easily sampled iter-
ating across the following two steps:
• Data augmentation: for g = 1, . . . , G sample µg from
N
Y
g(µg | µ, σ 2 , y1|g , . . . , yN |g ) ∝ N (yi|g | µg , 15)N (µg | µ, σ 2 ). (46)
i=1

32
• Sample µ and σ 2 from

g(µ, σ 2 | µ1 , . . . , µG )

∝ (47)
G
Y
N (µg | µ, σ 2 )N (µ | µ0 , τ02 )Inv-χ2 (σ 2 | ν0 , λ20 ).
g=1

As illustrated in this chapter, this sample can be used for estimation, model
checking and model selection.
In Section 4 posterior predictive inference was presented. The interested
reader is referred to Box (1980), who discusses prior predictive inference.
Prior predictive inference is obtained if in Figure 4 the posterior distribution
g(θ 0 | Z, H0 ) is replaced by the prior distributions h(θ 0 | H0 ). See Gelman,
Meng and Stern (1996) for comparisons of both methods.
Besides posterior probabilities there are other Bayesian methods that
can be used for model selection. The Bayesian information criterion (BIC,
Kass and Raftery, 1995; Gill, 2002, pp. 223-224) is an approximation of
−2 log m(Z | Hm ) that is similar to the CAIC: −2 log f (Z | θ̂m ) + P log N .
The deviance information criterion (DIC, Spiegelhalter, Best, Carlin and van
der Linde, 2002) is an information criterion that can be computed using a
sample of parameter vectors from g(θ m | Z, Hm ). Like the marginal likeli-
hood, the penalty for model fit does not have to be specified in terms of the
number of parameters, but is determined using ”the mean of the deviances
minus the deviance of the mean” as a measure of the size of the parameter
space. The posterior predictive L-criterion (Laud and Ibrahim, 1995; Gelfand
and Gosh, 1998) is a measure of the distance between the observed data and
the posterior predictive distribution of the data for each model under inves-
tigation. It can be used to select the model that best predicts the observed
data in terms of the specific L-criterion chosen.

References
Akaike, H. (1987). Factor analysis and AIC. Psychometrika, 52,317-332.

Bayarri, M.J. and Berger, J.O. (2000). P-values for composite null models.
Journal of the American Statistical Association, 95,1127-1142.

33
Berger, J.O. and Delampady, M. (1987). Testing Precise Hypotheses. Sta-
tistical Science, 3,317-352.

Berger, J.O. and Perricchi, L. (1996). The intrinsic Bayes factor for model
selection and prediction. Journal of the American Statistical Associa-
tion, 91, 109-122.

Berger, J.O. and Perricchi, L. (2004). Training samples in objective Bayesian


model selection. Annals of Statistics, 32,841-869.

Box, G.E.P. (1980). Sampling and Bayesian inference in scientific modelling


and robustness. Journal of the Royal Statistical Society, Series A,
143,383-430.

Bozdogan, H. (1987). Model selection and Akaike’s information criterion


(AIC): The general theory and its analytic extensions. Psychometrika,
52,345-370.

Chib, S. and Greenberg, E. (1995). Understanding the Metropolis-Hastings


algorithm. American Statistician, 49,327-335.

Cohen, J. (1994). The Earth is Round (p ¡ .05). American Psychologist,


49, 997-1003.

Cowles, M.K. and Carlin, B.P. (1996). Markov chain Monte Carlo methods:
a comparative review. Journal of the American Statistical Association,
91, 883-904.

Dawid, A.P. and Lauritzen, S.L. (2000). Compatible Prior Distributions.


In: E.I. George, Bayesian Methods with Applications to Science Policy
and Official Statistics. Selected Papers from ISBA 2000: The Sixth
World Meeting of the International Society for Bayesian Analysis, pp.
109-118.

Gelfand, A.E. and Gosh, S.K. (1998). Model choice, a minimum posterior
predictive loss approach. Biometrika,85,1-11.

Gelman, A., Carlin, J.B., Stern, H.S. and Rubin, D.B. (2004). Bayesian
Data Analysis, London: Chapman and Hall.

34
Gelman, A. Meng, X.L. and Stern, H. (1996). Posterior predictive assess-
ment of model fitness via realized discrepancies. Statistica Sinica, 6,
733-807.

Gill, J. (2002). Bayesian Methods. A Social and Behavioral Sciences Ap-


proach. London: Chapman and Hall.

Hoeting, J.A., Madigan, D., Raftery, A.E. and Volinsky, C.T. (1999). Bayesian
model averaging, a tutorial. Statistical Science, 14, 382-417.

Hoijtink, H. (2000). Posterior inference in the random intercept model


based on samples obtained with Markov chain Monte Carlo methods.
Computational Statistics, 3, 315-336.

Howson, C. (2002). Bayesianism in statistics. In: R. Swinburne (Ed.),


Bayes Theorem, pp. 39-69. Oxford: Oxford University Press.

Jefferys, W. and Berger, J. (1992). Ockham’s razor and Bayesian analysis.


American Scientist, 80, 64-72.

Kass, R.E. and Raftery, A.E. (1995). Bayes factors. Journal of the Ameri-
can Statistical Association, 90, 773-795.

Kass, R.E. and Wasserman, L. (1996). The selection of prior distributions


by formal rules. Journal of the American Statistical Association, 91,
1343-1370.

Kato, B. and Hoijtink, H. (2004). Testing homogeneity in a random inter-


cept model using asymptotic, posterior predictive and plug-in-p-values.
Statistica Neerlandica, 58, 179-196.

Kato, B.S. and Hoijtink, H. (2006). A Bayesian approach to inequality


constrained hierarchical models: Estimation and Model selection. Sta-
tistical Modelling, 6, 1-19.

Klugkist, I., Laudy, O. and Hoijtink, H. (2005). Inequality Constrained


Analysis of Variance: A Bayesian approach. Psychological Methods,10,477-
493.

Klugkist, I., Kato, B. and Hoijtink, H. (2005). Bayesian Model Selection


Using Encompassing Priors. Statistica Neerlandica,59, 57-69.

35
Laud, P. and Ibrahim, J. (1995). Predictive model selection. Journal of the
Royal Statistical Society, Series B, 57,247-262.

Laudy, O. and Hoijtink, H. (2006). Bayesian methods for the analysis of in-
equality constrained contingency tables. Statistical Methods in Medical
Research, 15,1-16.

Lavine, M. and Schervish, M.J. (1999). Bayes factors: what they are and
what they are not. The American Statistician, 53, 119-122.

Lee, P. M. (1997). Bayesian Statistics: An Introduction. London: Arnold.

Leucari, V. and Consonni, G. (2003). Compatible Priors for Causal Bayesian


Networks. In: J.M. Bernardo, M.J. Bayarri, J.O. Berger, A.P. Dawid,
D. Heckerman, A.F.M. Smith and M.West, Bayesian Statistics 7, pp.
597-606. Oxford: Clarendon Press.

Lindley, D.V. (1957). A statistical paradox. Biometrika, 44,187-192.

Martin, A.D. and Quinn, K.M. (2005). MCMCpack: Markov chain Monte
Carlo (MCMC) Package. URL http://mcmcpack.wustl.edu. R package
version 0.6-3.

Meng, X.L (1994). Posterior predictive p-values. The Annals of Statistics,


22, 1142-1160.

Perez, J.M. and Berger, J.O. (2002). Expected posterior prior distributions
for model selection. Biometrika,89,491-511.

Robert, C.P. and Casella, G. (2004). Monte Carlo Statistical Methods. New
York: Springer.

Robins, J.M., van der Vaart, A. and Venture, V. (2000). Asymptotic distri-
bution of p-values in composite null models. Journal of the American
Statistical Association, 95, 1143-1156.

Roverato, A. and Consonni, G. (2004). Compatible prior distributions for


DAG models. Journal of the Royal Statistical Society, Series B, 66,
47-62.

Schafer, J.L. (1997). Analysis of Incomplete Multivariate Data. London:


Chapman and Hall.

36
View publication stats

Smith, A.F.M. and Spiegelhalter, D.J. (1980). Bayes factors and choice cri-
teria for linear models. Journal of the Royal Statistical Society, Series
B, 42, 213-220.

Spiegelhalter, D., Thomas, A., Best, N., and Lunn, D. (2004). WinBUGS,.
URL http://www.mrc-bsu.cam.ac.uk/bugs/winbugs/. Version 1.4.1.

Spiegelhalter, D.J., Best, N.G., Carlin, B.P. and van der Linde, A. (2002).
Bayesian measures of model complexity and fit. Journal of the Royal
Statistical Society, Series B, 64,583-639.

Tabachnick, B.G. and Fidell, L.S. (1996). Using Multivariate Statistics.


New York: Harper Collins.

Tanner, M.A. and Wong, W.H. (1987). The calculation of posterior dis-
tributions by data augmentation. Journal of the American Statistical
Association,82,528-550.

Thomas, A. (2004). OpenBUGS. URL http://mathstat.helsinki.fi.openbugs/.

Tiao, G.C. and Xu, D. (1993). Robustness of maximum likelihood es-


timates for multi-step predictions: the exponential smoothing case.
Biometrika, 80,623-641.

Tierney, L. (1998). A note on the Metropolis Hastings algorithm for general


state spaces. Annals of Applied Probability, 8,1-9.

Zeger, S.L. and Karim, M.R. (1991). Generalized linear models with random
effects: a Gibbs sampling approach. Journal of the American Statistical
Association,86,79-86.

37

You might also like