Professional Documents
Culture Documents
This series provides applied researchers and students with analysis and research design
books that emphasize the use of methods to answer research questions. Rather than
emphasizing statistical theory, each volume in the series illustrates when a technique should
(and should not) be used and how the output from available software programs should (and
should not) be interpreted. Common pitfalls as well as areas of further development are
clearly articulated.
RECENT VOLUMES
David Kaplan
Anytime I see one of my series authors put together a second edition it’s like
falling in love again because I know two things: It’s a labor of love for the
author and you become even more enriched than the first time around. David
Kaplan’s second edition is simply lovely. As I said in the first edition, Kaplan is
in a very elite class of scholar. He is a methodological innovator who is guid-
ing and changing the way that researchers conduct their research and ana-
lyze their data. He is also a distinguished educational researcher whose work
shapes educational policy and practice. I see David Kaplan’s book as a reflec-
tion of his sophistication as both a researcher and statistician; it shows depth
of understanding that even dedicated quantitative specialists may not have
and, in my view, it will have an enduring impact on research practice. Kaplan’s
research profile and research skills are renowned internationally and his repu-
tation is globally recognized. His profile as a prominent power player in the
field brings instant credibility. As a result, when Kaplan says Bayesian is the
way to go, researchers listen. As with the first edition, his book brings his voice
to you in an engaging and highly serviceable manner.
Why is the Bayesian approach to statistics seeing a resurgence across the
social and behavioral sciences (it’s an approach that has been around for some
time)? One reason for the delay in adopting Bayes is technological. Bayesian
estimation can be computer intensive and, until about a score of years ago, the
computational demands limited the widespread application. Another reason
is that the social and behavioral sciences needed an accessible translation of
Bayes for these fields so that we could understand not only the benefits of
Bayes but also how to apply a Bayesian approach. Kaplan is clear and practical
in his presentation and shares with us his experiences and helpful/pragmatic
recommendations. I think a Bayesian perspective will continue to see wide-
spread usage now that David has updated and expanded upon this indispens-
able resource. In many ways, the zeitgeist for Bayes is still favorable given that
researchers are asking and attempting to answer more complex questions. This
second edition provides researchers with the means to address well the intri-
cate nuances of applying a Bayesian perspective to test intertwined theory-
driven hypotheses.
v
vi Series Editor’s Note
This second edition brings a wealth of new material that builds nicely on
what was already a thorough and formidable foundation. Kaplan uses the R
interface to Stan that provides a fast and stable software environment, which is
great news because the inadequacies of other software environments were an
impediment to adopting a Bayesian approach. Each of the prior chapters has
been expanded with new material. For example, Chapter 1 adds a discussion
of coherence, Dutch book bets, and the calibration of probability assessments.
In Chapter 2, there is an extended discussion of prior distributions, which is at
the heart of Bayesian estimation. Chapter 3 continues with coverage of Jeffreys’
prior and the LKJ prior for correlation matrices. In Chapter 4, the different
algorithms utilized in the Stan software platform are explained, including the
Metropolis-Hastings algorithm, the Hamiltonian Monte Carlo algorithm, and
the No-U-Turn sampler, as well as an updated discussion of convergence diag-
nostics.
Other chapters have new material such as new missing data material on
the problem of model uncertainty in multiple imputation, expanded coverage
of continuous and categorical latent variables, factor analysis, and latent class
analysis, as well as coverage of multinomial, Poisson, and negative binomial
regression. In addition, the important topics of model evaluation and model
comparison are given their own chapter in the second edition. New chapters
on other critical topics have been added—including variable selection and
sparsity, the Bayesian decision theory framework to explain model averag-
ing, and the method of Bayesian stacking as a means of combining predictive
distributions—and a remarkably insightful chapter on Bayesian workflow for
statistical modeling in the social sciences. All lovely additions to the second
edition, which, as in the first edition, was already a treasure trove of all things
Bayesian. As always, enjoy!
Todd D. Little
At My Wit’s End in Montana (the name of my home)
Preface to the Second Edition
Since the publication of the first edition of Bayesian Statistics for the Social
Sciences in 2014, Bayesian statistics is, arguably, still not the norm in the
formal quantitative methods training of social scientists. Typically, the only
introduction that a student might have to Bayesian ideas is a brief overview
of Bayes’ theorem while studying probability in an introductory statistics
class. This is not surprising. First, until relatively recently, it was not feasible
to conduct statistical modeling from a Bayesian perspective owing to its
complexity and lack of available software. Second, Bayesian statistics rep-
resents a powerful alternative to frequentist (conventional) statistics, and,
therefore, can be controversial, especially in the context of null hypothesis
significance testing.1 However, over the last 20 years or so, considerable
progress has been made in the development and application of complex
Bayesian statistical methods, due mostly to developments and availability
of proprietary and open-source statistical software tools. And, although
Bayesian statistics is not quite yet an integral part of the quantitative train-
ing of social scientists, there has been increasing interest in the application
of Bayesian methods, and it is not unreasonable to say that in terms of
theoretical developments and substantive applications, Bayesian statistics
has arrived.
Because of extensive developments in Bayesian theory and computa-
tion since the publication of the first edition of this book, I felt there was a
pressing need for a thorough update of the material to reflect new devel-
opments in Bayesian methodology and software. The basic foundations
of Bayesian statistics remain more or less the same, but this second edi-
tion encompasses many new extensions and so the order of the chapters
has changed in some instances, with some chapters heavily revised, some
chapters updated, and some chapters containing all new material.
1
We will use the term frequentist to describe the paradigm of statistics commonly used today, and
which represents the counterpart to the Bayesian paradigm of statistics. Historically, however,
Bayesian statistics predates frequentist statistics by about 150 years.
vii
viii Preface to the Second Edition
Data Sources
As in the first edition, the examples provided will primarily utilize large-
scale assessment data, and in particular data from the OECD Program for
x Preface to the Second Edition
Software
For this edition, I will demonstrate Bayesian concepts and provide applica-
tions using primarily the Stan (Stan Development Team, 2021a) software
program and its R interface RStan (Stan Development Team, 2020). Stan is
a high-level probabilistic programming language written in C++. Stan is
named after Stanislaw Ulam, one of the major developers of Monte Carlo
methods. With Stan, the user can specify log density functions, and, of
relevance to this book obtain fully Bayesian inference through Hamilto-
nian Monte Carlo and the No-U-Turn algorithm (discussed in Chapter 4).
In some cases, other interfaces for Stan, such as rstanarm (Goodrich, Gabry,
Ali, & Brilleman, 2022) and brms (Bürkner, 2021), will be used. These pro-
grams also call in other programs, and for cross-validation, we will be
using the loo program (Vehtari, Gabry, Yao, & Gelman, 2019). For peda-
gogical purposes, I have written the code to be as explicit as possible. The
Stan programming language is quite flexible, and many different ways
of writing the same code are possible. However, it should be emphasized
that this book is not a manual on Bayesian inference with Stan. For more
information on the intricacies of the Stan programming language, see Stan
Development Team (2021a).
Finally, all code will be integrated into the text and fully annotated. In
addition, all software code in the form of R files and data can be found on
the Guilford companion website. Note that due to the probabilistic nature
of Bayesian statistical computing, a reanalysis of the examples may not
yield precisely the same numerical results as found in the book.
Preface to the Second Edition xiii
Philosophical Stance
In the previous edition of this book, I wrote a full chapter on various
philosophical views underlying Bayesian statistics, including subjective
versus objective Bayesian inference, as well as a position I took arguing
for an evidence-based view of subjective Bayesian statistics. As a good
Bayesian, I have updated my views since that time, and in the interest of
space, I would rather add more practical material and concentrate less on
philosophical matters. However, whether one likes it or not, the applica-
tion of statistical methods, Bayesian or otherwise, betrays a philosophical
stance, and it may be useful to know the philosophical stance that encom-
passes this second edition. In particular, my position regarding an evidence-
based view of Bayesian modeling remains more or less unchanged, but my
updated view is also consistent with that of Gelman and Shalizi (2013) and
summarized in Haig (2018), namely, a neo-Popperian view that Bayesian
statistical inference is fundamentally (or should be) deductive in nature
and that the ”usual story” of Bayesian inference characterized by updating
knowledge inductively from priors to posteriors is probably a fiction, at
least with respect to typical statistical practice (Gelman & Shalizi, 2013, p.
8).
To be clear, the philosophical position that I take in this book can be
summarized by five general points. First, statistical modeling takes place
in a state of pervasive uncertainty. This uncertainty is inherently epistemic
in that it is our knowledge about parameters and models that is imperfect.
Attempting to address this uncertainty is valuable insofar as it impacts
one’s findings whether it is directly addressed or not. Second, parameters
and models, by their very definition, are unknown quantities and the only
language we have for expressing our uncertainty about parameters and
models is probability. Third, prior distributions encode our uncertainty by
quantifying our current knowledge and assumptions about the parameters
and models of interest through the use of continuous or categorical prob-
ability distributions. Fourth, our current knowledge and assumptions are
propagated via Bayes’ theorem to the posterior distribution which provides
a rich way to describe results and to test models for violations through the
use of posterior predictive checking via the posterior predictive distribu-
tion. Fifth, posterior predictive checking provides a way to probe deficien-
cies in a model, both globally and locally, and while I may hold a more
sanguine view of model averaging than Gelman and Shalizi (2013), I con-
cur that posterior predictive checking is an essential part of the Bayesian
workflow (Gelman et al., 2020) for both explanatory and predictive uses of
models.
xiv Preface to the Second Edition
Target Audience
Positioning a book for a particular audience is always a tricky process. The
goal is to first decide on the type of reader one hopes to attract, and then to
continuously keep that reader in mind when writing the book. For this edi-
tion, the readers I have in mind are advanced graduate students or research-
ers in the social sciences (e.g., education, psychology, and sociology) who
are either focusing on the development of quantitative methods in those
areas or who are interested in using quantitative methods to advance sub-
stantive theory in those areas. Such individuals would be expected to have
good foundational knowledge of the theory and application of regression
analysis in the social sciences and have had some exposure to mathematical
statistics and calculus. It would also be expected that such readers would
have had some exposure to methodologies that are now widely applied to
social science data, in particular multilevel models and latent variable mod-
els. Familiarity with R would also be expected, but it is not assumed that
the reader would have knowledge of Stan. It is not expected that readers
would have been exposed to Bayesian statistics, but at the same time, this
is not an introductory book. Nevertheless, given the presumed background
knowledge, the fundamental principles of Bayesian statistical theory and
practice are self-contained in this book.
Acknowledgments
I would like to thank the many individuals in the Stan community
(https:discourse.mc-Stan.org) who patiently and kindly answered
many questions that I had regarding the implementation of Stan.
I would also like to thank the reviewers who were initially anonymous:
Rens van de Schoot, Department of Methodology and Statistics, Utrecht
University; Irini Moustaki, Department of Statistics, London School of Eco-
nomics and Political Science; Insu Paek, Senior Scientist, Human Resources
Research Organization; and David Rindskopf, Departments of Educational
Psychology and Psychology, The Graduate Center, The City University of
New York. All of these scholars’ comments have greatly improved the qual-
ity and accessibility of the book. Of course, any errors of commission or
omission are strictly my responsibility.
I am indebted to my editor C. Deborah Laughton. I say, “my editor”
because C. Deborah not only edited this edition and the previous edi-
tion, but also the first edition of my book on structural equation modeling
(Kaplan, 2000) and my handbook on quantitative methods in the social sci-
ences (Kaplan, 2004) when she was editor at another publishing house. My
loyalty to C. Deborah stems from my first-hand knowledge of her extraor-
Preface to the Second Edition xv
PART I. FOUNDATIONS
1 • PROBABILITY CONCEPTS AND BAYES’ THEOREM 3
1.1 Relevant Probability Axioms / 3
1.1.1 The Kolmogorov Axioms of Probability / 3
1.1.2 The Re´nyi Axioms of Probability / 4
1.2 Frequentist Probability / 5
1.3 Epistemic Probability / 6
1.3.1 Coherence and the Dutch Book / 6
1.3.2 Calibrating Epistemic Probability Assessments / 7
1.4 Bayes’ Theorem / 9
1.4.1 The Monty Hall Problem / 10
1.5 Summary / 11
xvii
xviii Contents
REFERENCES 225
FOUNDATIONS
1
Probability Concepts and Bayes’
Theorem
3
4 Bayesian Statistics for the Social Sciences
1. p(A) ≥ 0.
2. The probability of the sample space is 1.0.
3. Countable additivity: If A and B are mutually exclusive, then p(A or B) ≡
p(A ∪ B) = p(A) + p(B). Or, more generally,
[ ∞
X∞
=
p A j p(A j ) (1.1)
j=1 j=1
A number of other axioms of probability can be derived from these three basic
axioms. Nevertheless, these three axioms can be used to deal with the relatively
easy case of the coin flipping example mentioned above. For example, if we toss a
fair coin an infinite number of times, we expect it to land heads 50% of the time.1
This probability, and others like it, satisfy the first axiom that probabilities must be
greater than or equal to zero. The second axiom states that over an infinite number
of coin flips, the sum of all possible outcomes (in this case, heads and tails) is equal
to one. Indeed, the number of possible outcomes represents the sample space, and
the sum of probabilities over the sample space is one. Finally, with regard to
the third axiom, assuming that one outcome precludes the occurrence of another
outcome (e.g., the coin landing heads precludes the occurrence of the coin landing
tails), then the probability of the joint event p(A ∪ B) is the sum of the separate
probabilities, that is, p(A ∪ B) = p(A) + p(B).
We may wish to add to these three axioms a fourth axiom that deals with the
notion of independent events. If two events are independent, then the occurrence of
one event does not influence the probability of another event. For example, with
two coins A and B, the probability of A resulting in “heads,” does not influence the
result of a flip of B. Formally, we define independence as p(A and B) ≡ p(A ∩ B) =
p(A)p(B). The notion that independent events allow the individual probabilities to
be simply their product plays a critical role in the derivation Bayes’ theorem.
p(C ∩ S)
p(C | S) = (1.2)
p(S)
1
Interestingly, this expectation is not based on having actually tossed the coin an infinite
number times. Rather, this expectation is a prior belief, and arguably, this is one example
of how Bayesian thinking is automatically embedded in frequentist logic.
Probability Concepts and Bayes’ Theorem 5
The denominator on the right-hand side of Equation (1.2) shows that the sample
space associated with p(C ∩ S) is reduced by knowing S. Notice that if C and S
were independent, then
p(C ∩ S)
p(C | S) =
p(S)
p(C)p(S)
=
p(S)
= p(C) (1.3)
Pearson hypothesis (Neyman & Pearson, 1928) testing and Fisherian statistics (e.g.,
Fisher, 1941/1925) is based on the conception of probability as long-run frequency.
Our conclusions regarding null and alternative hypotheses presuppose the idea
that we could conduct the same study an infinite number of times under perfectly
reproducible conditions. Moreover, our interpretation of confidence intervals also
assumes a fixed parameter with the confidence intervals varying over an infinitely
large number of identical studies.
sequence of bets in which they are guaranteed to lose regardless of the outcome.
In other words, the epistemic beliefs of the bettor do not cohere with the axioms
of probability. This type of bet is referred as a Dutch book or lock.
Table 1.1 below shows a sequence of bets that one of the following teams goes
to the World Series.
Consider the first bet in the sequence, namely that the odds of the Chicago Cubs
going to the World series is even. This is the same as saying that the probability
implied by the odds is 0.50. The bookie sets the bet price at $100. If the Cubs
do go to the World Series, then the bettor gets back the $100 plus the bet price.
However, this is a sequence of bets that also includes the Red Sox, Dodgers, and
Yankees. Taken together, we see that the implied probabilities sum to greater than
1.0, which is a clear violation of Kolmogorov’s Axiom 2. As a result, the bookie
will pay out $200 regardless of who goes to the World Series while the bettor has
paid $210 for the bet, a net loss of $10.
assigned probability of the outcome matches the actual proportion of times that
the outcome occurred (Dawid, 1982).
A scoring rule is a utility function (Gneiting & Raftery, 2007), and the goal
of the assessor is to be honest and provide a forecast that will maximize his/her
utility. The idea of scoring rules is quite general, but one can consider scoring rules
from a subjectivist Bayesian perspective. Here, Winkler (1996) quotes de Finetti
(1962, p. 359)
The scoring rule is constructed according to the basic idea that the
resulting device should oblige each participant to express his true
feelings, because any departure from his own personal probability
results in a diminution of his own average score as he sees it.
Because scoring rules only require the stated probabilities and realized out-
comes, they can be developed for ex-post or ex-ante probability evaluations. Ex-post
probability assessments utilize the existing historical probability assessments to
gauge accuracy whereas ex-ante probability assessments are true forecasts into the
future before the realization of the outcome. However, as suggested by Winkler
(1996), the ex-ante perspective of probability evaluation should lead us to consider
strictly proper scoring rules because these rules are maximized if and only if the
assessor is honest in reporting their scores.
Following the discussion and notation given in Winkler (1996, see also; Jose,
Nau, & Winkler, 2008), let p ≡ (p1 , . . . pn ) represent the assessor’s epistemic prob-
ability distribution of the outcomes of interest, let r ≡ (r1 . . . rn ) represent the
assessor’s reported epistemic probability of the outcomes of interest, and let ei
represent the probability distribution that assigns a probability of 1 if the event i
occurs and a probability of 0 for all other events. Then, a scoring rule, denoted as
S(r, p), provides a score S(r, ei ) if the event occurs. The expected score obtained
when the assessor reports r when their true distribution is p is
X
S(r, p) = pi S(r, ei ). (1.4)
i
The scoring rule is strictly proper if S(p, p) ≥ S(r, p) for every r and p with equality
when r = p (Jose et al., 2008, p. 1147). We will discuss scoring rules in more detail
in Chapter 11 when we consider model averaging. Suffice it to say here that there
are three popular types of scoring rules.
1. Quadratic scoring rule (Brier score)
n
X
Sk (r) = 2rk − r2i (1.5)
i=1
p(S ∩ C)
p(S | C) = (1.8)
p(C)
Because of the symmetry of the joint probabilities, we obtain
second axiom described earlier. Thus, it is not uncommon to see Bayes’ theorem
written as
p(C | S) ∝ p(S | C)p(C) (1.11)
Equation (1.11) states that the probability of observing lung cancer given smoking
is proportional to the probability of smoking given observing lung cancer times
the marginal probability of observing lung cancer.
It is interesting to note that Bayesian reasoning resolves the so-called base-rate
fallacy, that is, the tendency to equate p(C | S) with P(S | C). Specifically, without
knowledge of the base-rate p(C) (the prior probability) and the total amount of
evidence in the observation P(S), it is a fallacy to believe that p(C | S) = p(S | C).
The final probability is due to the fact that there is only one door for Monty to
choose given that the contestant chose door A and the prize is behind door B.
Probability Concepts and Bayes’ Theorem 11
Let M represent Monty opening door B. Then, the joint probabilities can be
obtained follows:
1 1 1
p(M ∩ A) = p(M | A)p(A) = × =
2 3 6
1
p(M ∩ B) = p(M | B)p(B) = 0 × = 0
3
1 1
p(M ∩ C) = p(M | C)p(C) = 1 × =
3 3
Before applying Bayes’ theorem, note that we have to obtain the marginal proba-
bility of Monty opening door B. This is
1 1
p(M | A)p(A) 2 × 3 1
p(A | M) = = 1
=
p(M) 2
3
1
p(M | C)p(C) 3 2
p(C | M) = =1× 1
=
p(M) 2
3
Thus, from Bayes’ theorem, the best strategy on the part of the contestant is to
switch doors. Crucially, this winning strategy is conceived of in terms of long-run
frequency. That is, if the game were played an infinite number of times, then
switching doors would lead to the prize approximately 66% of the time. This is an
example of where long-run frequency can serve to calibrate Bayesian probability
assessments (Dawid, 1982), as we discussed in Section 1.3.2.
1.5 Summary
This chapter provided a brief introduction to probabilistic concepts relevant to
Bayesian statistical inference. Although the notion of epistemic probability pre-
dates the frequentist conception of probability, it had not significantly impacted the
practice of applied statistics until computational developments brought Bayesian
inference back into the limelight. This chapter also highlighted the conceptual
differences between the frequentist and epistemic notions of probability. The
importance of understanding the differences between these two conceptions of
probability is more than just a philosophical exercise. Rather, their differences
are manifest in the elements of the statistical machinery needed for advancing a
Bayesian perspective for research in the social sciences. We discuss the statistical
elements of Bayes’ theorem in the following chapter.
2
Statistical Elements of Bayes’ Theorem
The material presented thus far concerned frequentist and epistemic conceptions
of probability, leading to Bayes’ theorem. The goal of this chapter is to present
the role of Bayes’ theorem as it pertains specifically to statistical inference. Set-
ting the foundations of Bayesian statistical inference provides the framework for
applications to a variety of substantive problems in the social sciences.
The first part of this chapter introduces Bayes’ theorem using the notation
of random variables and parameters. This is followed by a discussion of the
assumption of exchangeability. Following that, we extend Bayes’ theorem to more
general hierarchical models. Next are three sections that break down the elements
of Bayes’ theorem with discussions of the prior distribution, the likelihood, and
the posterior distribution. The final section introduces the Bayesian central limit
theorem and Bayesian shrinkage.
13
14 Bayesian Statistics for the Social Sciences
As an aside, for complex models with many parameters, Equation (2.4) will be
very hard to evaluate, and it is for this reason we need the computational methods
that will be discussed in Chapter 4.
In line with Equation (1.11), the denominator of Equation (2.2) does not involve
model parameters, so we can omit the term and obtain the unnormalized posterior
distribution:
leads to a very natural way to conceive of multilevel models which we will take
up in Chapter 7.
p(1, 0, 1, 1, 0, 1, 0, 1, 0, 0) (2.11a)
p(1, 1, 0, 0, 1, 1, 1, 0, 0, 0) (2.11b)
p(1, 0, 0, 0, 0, 0, 1, 1, 1, 1) (2.11c)
We have just presented three possible patterns, but notice that there are 210 =
1, 024 possible patterns of agreement and disagreement among the 10 students. If
our task were to assign probabilities to all possible outcomes, this could become
Statistical Elements of Bayes’ Theorem 17
Because the right-hand side can be multiplied in any order, it follows that the left-
hand side is symmetric, and hence exchangeable. However, exchangeability does
not necessarily imply iid. A simple example of this idea is the case of drawing balls
from an urn without replacement (Suppes, 1986, p. 348). Specifically, suppose we
have an urn containing one red ball and two white balls and we are told to draw
one ball out without replacement. Then,
1, if the ith ball is red
yi =
(2.13)
0, otherwise
U(α, β) over some sensible range of values from α to β. Application of the uni-
form distribution is based on the Principle of Insufficient Reason, first articulated
by Laplace (1774/1951), which states that in the absence of any relevant (prior)
evidence, one should assign their degrees-of-belief equally among all the possible
outcomes. In this case, the uniform distribution essentially indicates that our as-
sumption regarding the value of a parameter of interest is that it lies in the range
β − α and that all possible values have equal probability. Care must be taken
in the choice of the range of values over the uniform distribution. For example,
U[−∞, ∞] is an improper prior distribution insofar as it does not integrate to 1.0 as
required of probability distributions. We will discuss the uniform distribution in
more detail in Chapter 3.
dθ
p(ϕ) = p(θ) (2.17)
dϕ
On the basis of the relationship in Equation (2.17), Jeffreys (1961) developed a non-
informative prior distribution that is invariant under transformations, written as
particularly for hierarchical models, and so although one may have information
about, say, higher level variance terms, such terms may not be substantively
important, and/or they may be difficult to estimate. Therefore, providing weakly
informative prior information may help stabilize the analysis without impacting
inferences.
Finally, as discussed in Gelman, Carlin, et al. (2014), weakly informative priors
can be useful in theory testing where it may appear unfair to specify strong priors
in the direction of one’s theory. Rather, specifying weakly informative priors in
the opposite direction of a theory would then require the theory to pass a higher
standard of evidence.
In suggesting an approach to constructing weakly informative priors, Gelman,
Carlin, et al. (2014) consider two procedures: (1) Start with non-informative priors
and then shift to trying to place reasonable bounds on the parameters according to
the substantive situation. (2) Start with highly informative priors and then shift to
trying to elicit a more honest assessment of uncertainty around those values. From
the standpoint of specifying weakly informative priors, the first approach seems
the most sensible. The second approach appears more useful when engaging in
sensitivity analyses — a topic we will take up later.
I beseech you, in the bowels of Christ, think it possible that you may
be mistaken.
Statistical Elements of Bayes’ Theorem 23
That is, there must be some allowance for the possibility, however slim, that you
are mistaken. If we look closely at Bayes theorem, we can see why. Recall that
Bayes theorem is written as
p(y | θ)p(θ)
p(θ | y) = (2.23)
p(y)
and so, if you state your prior probability of an outcome to be exactly zero, then
p(y | θ) ∗ 0
p(θ | y) = =0 (2.24)
p(y)
and thus no amount of evidence to the contrary would change your mind.
What if you state your prior probability to be exactly 1? In this case recall that
the denominator p(y) is a marginal distribution across all possible values of θ. So,
if p(θ) = 1, the denominator p(y) collapses to only your hypothesis p(y | θ) and
therefore
p(y | θ)
p(θ | y) = =1 (2.25)
p(y | θ)
Again, no amount of evidence to the contrary would change your mind; the prob-
ability of your hypothesis is 1.0 and you’re sticking to your hypothesis, no matter
what. Cromwell’s rule states that unless the statements are deductions of logic,
then one should leave some doubt (however small) in probability assessments.
2.5 Likelihood
Whereas the prior distribution encodes our accumulated knowledge/assumptions
of the parameters of interest, this prior information must, of course, be moderated
by the data in hand before yielding the posterior distribution — the source of our
current inferences. In Equation (2.5) we noted that the probability distribution of
the data given the model parameters, p(y | θ) could be written equivalently as the
L(θ | y), the likelihood of the parameters given the data.
The concept of likelihood is extremely important for both the frequentist and
Bayesian schools of statistics. Excellent discussions of likelihood can be found in
Edwards (1992) and Royall (1997). In this section, we briefly review the law of
likelihood and then present simple expressions of the likelihood for the binomial
probability and normal sampling models.
Definition 2.5.1. If hypothesis θ1 implies that Y takes on the value y with prob-
ability p(y | θ1 ) while hypothesis θ2 implies that Y takes on the value y with
probability p(y | θ2 ), then the law of likelihood states that the realization Y = y is
evidence in support of θ1 over θ2 if and only if L(θ1 | y) > L(θ2 | y). The likelihood
ratio, L(θ1 | y)/L(θ2 | y), measures the strength of that evidence.
Notice that the law of likelihood implies that only the information in the data as
summarized by the likelihood serve as evidence in corroboration (or refutation)
of a hypothesis. This latter idea is referred to as the likelihood principle. Notice
also that frequentist notions of repeated sampling do not enter into the law of
likelihood or the likelihood principle. The issue of conditioning on data that was
not observed will be revisited in Chapter 6 when we take up the problem of null
hypothesis significance testing.
First, consider the number of correct answers on a test of length n. Each item
on the test represents a Bernoulli trial, with outcomes 0 = wrong and 1 = right.
The natural probability model for data arising from n Bernoulli sequences is the
binomial sampling model. Under the assumption of exchangeability – meaning
the indexes 1 ... n provide no relevant information – we can summarize the total
number of successes by y. Letting θ be the proportion of correct responses in the
population, the binomial sampling model can be written as
!
n y
p(y | θ) = Bin(y | n, θ) = θ (1 − θ)(n−y)
y
= L(θ | n, y) (2.27)
Next consider the likelihood function for the parameters of the simple normal
distribution which we write as
(y − µ)2
!
1
p(y | µ, σ ) = √
2
exp − (2.28)
2πσ 2σ2
Statistical Elements of Bayes’ Theorem 25
where µ is the population mean and σ2 is the population variance. Under the
assumption of independent observations, we can write Equation (2.28) as
n
Y
p(y1 , y2 , . . . , yn | µ, σ2 ) = p(yi | µ, σ2 )
i
(y − µ)2
P
i i
!n/2
1
= √
exp − 2
2πσ2 2σ
= L(θ | y) (2.29)
where θ = (µ, σ). Thus, under the assumption of independence, the likelihood
of model the parameters given the data is simply the product of the individual
probabilities of the data given the parameters.
Consider again the binomial distribution used to estimate probabilities for suc-
cesses and failures, such as obtained from responses to multiple choice questions
scored right/wrong. As an example of a conjugate prior, consider estimating the
number of correct responses y on a test of length n. Let θ be the proportion of cor-
rect responses. We first assume that the responses are independent of one another.
The binomial sampling model was given in Equation (2.27) and reproduced here:
!
n y
p(y | θ) = Bin(y | n, θ) = θ (1 − θ)(n−y) . (2.30)
y
One choice of a conjugate prior distribution for θ is the beta(a,b) distribution. The
beta distribution is a continuous distribution appropriate for variables that range
from 0 to 1. The terms a and b are the shape and scale parameters of the beta
distribution, respectively. The shape parameter, as the term implies, affects the
shape of the distribution. The scale parameter affects spread of the distribution,
in the sense of shrinking or stretching the distribution. For this example, a and b
26 Bayesian Statistics for the Social Sciences
will serve as hyperparameters because the beta distribution is being used as a prior
distribution for the binomial distribution. The form of the beta(a,b) distribution is
Γ(a + b) a−1
p(θ; a, b) = θ (1 − θ)b−1 (2.31)
Γ(a)Γ(b)
where Γ is the gamma(a, b) function. Multiplying Equation (2.30) and Equation
(2.31) and ignoring terms that don’t involve model parameters, we obtain the
posterior distribution by
Γ(n + a + b)
p(θ | y) = θ y+a−1 (1 − θ)n−y+b−1 (2.32)
Γ(y + a)Γ(n − y + b)
This next example explores the Gaussian prior distribution for the Gaussian
sampling model in which the variance σ2 is assumed to be known. Thus, the
problem is one of estimating the mean µ. Let y denote a data vector of size n.
We assume that y follows a Gaussian distribution shown in Equation (2.28) and
reproduced here:
(y − µ)2
!
1
p(y | µ, σ2 ) = √ exp − (2.33)
2πσ 2σ2
Consider that our prior distribution on the mean is Gaussian with mean and vari-
ance hyperparameters, κ and τ2 , respectively which for this example are assumed
known. The prior distribution can be written as
(µ − κ)2
!
1
p(µ | κ, τ ) = √
2
exp − (2.34)
2πτ2 2τ2
and variance
τ2 σ2
σˆ2µ = (2.37)
σ2 + nτ2
We see from Equations (2.35), (2.36), and (2.37) that the Gaussian prior is conjugate
for the likelihood, yielding a Gaussian posterior.
Statistical Elements of Bayes’ Theorem 27
Thus as the sample size increases to infinity, the expected a posteriori estimate µ̂
converges to the maximum likelihood estimate ȳ
In terms of the variance, first let 1/τ2 and n/σ2 to refer to the prior precision
and data precision, respectively. The role of these two measures of precision can
be seen by once again examining the variance term for the normal distribution in
Equation (2.37). Specifically, letting n approach infinity, we obtain
1
lim σˆ2µ = lim
n→∞ n→∞ 1
τ2
+ n
σ2
σ2 σ2
= lim = (2.39)
n→∞ σ2 +n n
τ2
σ2 nτ2
µ̂ = κ + ȳ (2.40)
σ2 + nτ2 σ2 + nτ2
Thus, the posterior mean is a weighted combination of the prior mean and
observed data mean. These weights are bounded by 0 and 1 and together
28 Bayesian Statistics for the Social Sciences
are referred to as the shrinkage factor. The shrinkage factor represents the
proportional distance that the posterior mean has shrunk back to the prior
mean, κ, and away from the maximum likelihood estimator, ȳ. Notice that
if the sample size is large, the weight associated with κ will approach zero
and the weight associated with ȳ will approach one. Thus µ̂ will approach ȳ.
Similarly, if the data variance, σ2 , is very large relative to the prior variance,
τ2 , this suggests little precision in the data relative to the prior and therefore
the posterior mean will approach the prior mean, κ. Conversely, if the prior
variance is very large relative to the data variance, this suggests greater precision
in the data compared to the prior and therefore the posterior mean will approach ȳ.
Perhaps a more realistic situation that arises in practice is when the mean
and variance of the Gaussian distribution are unknown. In this case, we need to
specify a full probability model for both the mean µ and variance σ2 . If we assume
that µ and σ2 are independent of one another, then we can factor the joint prior
distribution of µ and σ2 as
p(µ, σ2 ) = p(µ)p(σ2 ) (2.41)
We now need to specify the prior distribution of σ2 . There are two approaches that
we can take to specify the prior for σ2 . First, we can specify a uniform prior on
µ and log(σ2 ), because when converting the uniform prior on log(σ2 ) to a density
for σ2 , we obtain p(σ2 ) = 1/σ2 .1 With uniform priors on both µ and σ2 , the joint
prior distribution p(µ, σ2 ) ∝ 1/σ2 . However, the problem with this first approach is
that the uniform prior over the real line is an improper prior. Therefore, a second
approach would be to provide proper informative priors, but with a choice of
hyperparameters such that the resulting priors are quite diffused. First, again we
assume as before that y ∼ N(µ, σ2 ) and that µ ∼ N(κ, τ2 ). As will be discussed in the
next chapter, the variance parameter, σ2 , follows an inverse-gamma distribution
with shape and scale parameters, a and b, respectively. Succinctly, σ2 ∼ IG(a, b)
and the probability density function for σ2 can be written as
2
p(σ2 | a, b) ∝ (σ2 )−(a+1) e−b/(σ )
(2.42)
Even though Equation (2.42) is a proper distribution for σ2 , we can see that as
a and b approach 0, the proper prior approaches the non-informative prior 1/σ2 .
Thus, very small values of a and b can suffice to provide a prior on σ2 to be used
to estimate the joint posterior distribution of µ and σ2 .
The final step in this example is to obtain the joint posterior distribution for µ
and σ2 . Assuming that the joint prior distribution is 1/σ2 , then the joint posterior
distribution can be written as
n
(yi − µ)2
( )
1 Y 1
p(µ, σ | y) ∝ 2
2
√ exp − (2.43)
σ i=1 2πσ2 2σ2
1
Following Lynch (2007), this result is obtained via a change of variable calculus. Specifi-
cally, let k = log(σ2 ), and p(k) ∝ constant. Change of variable calculus involves the Jacobian
J = dk/dσ2 = 1/σ2 . Therefore, p(σ2 ) ∝ constant × J.
Statistical Elements of Bayes’ Theorem 29
Notice, however, that Equation (2.43) involves two parameters µ and σ2 . The solu-
tion to this problem is discussed in Lynch (2007). First, the posterior distribution
of µ obtained from Equation (2.43) can be written as
nµ2 − 2n ȳµ
( )
p(µ | y, σ ) ∝ exp −
2
(2.44)
2σ2
(µ − ȳ)2
( )
p(µ | y, σ ) ∝ exp −
2
(2.45)
2σ2 /n
The first term on the right-hand side of Equation (2.46) was solved above assuming
σ2 is known. The second term on the right hand side of Equation (2.46) is the
marginal posterior distribution of σ2 . An exact expression for p(σ2 | y) can be
obtained by integrating the joint distribution over µ – that is,
Z
p(σ2 | y) = p(µ, σ2 | y)dµ (2.47)
Although this discussion shows that analytic expressions are possible for the
solution of this simple case, in practice the advent of MCMC algorithms render
the solution to the joint posterior distribution of model parameters quite straight-
forward.
2.8 Summary
With reference to any parameter of interest, be it a mean, variance, regression
coefficient, or a factor loading, Bayes’ theorem is composed of three parts: (1) the
prior distribution representing our cumulative knowledge about the parameter
of interest; (2) the likelihood representing the data in hand; and (3) the posterior
distribution, representing our updated knowledge based on the moderation of the
prior distribution by the likelihood. By carefully decomposing Bayes’ theorem into
its constituent parts, we also can see its relationship to frequentist statistics, partic-
ularly through the Bayesian central limit theorem and the notion of shrinkage. In
the next chapter we focus on the relationship between the likelihood and the prior.
Specifically, we examine a variety of common data distributions used in social and
behavioral science research and define their conjugate prior distributions.
3
Common Probability Distributions and
Their Priors
31
32 Bayesian Statistics for the Social Sciences
will also describe the posterior distribution that is derived from applying Bayes’
theorem to each of these distributions. Finally, we provide Jeffreys’ prior for
each of the univariate distributions. We will not devote space to describing more
technical details of these distributions, such as the moment generating functions or
characteristic functions. A succinct summary of the technical details of these and
many more distributions can be found in Evans, Hastings, and Peacock (2000).1
y ∼ N(µ, σ2 ) (3.1)
where
E[y] = µ (3.2)
V[y] = σ 2
(3.3)
where E[·] and V[·] are the expectation and variance operators, respectively. The
probability density function of the Gaussian distribution was given in Equation
(2.28) and reproduced here:
(y − µ)2
!
1
p(y | µ, σ ) = √
2
exp − (3.4)
2πσ 2σ2
In what follows, we consider conjugate priors for two cases of the Gaussian
distribution. The first case is where the mean of the distribution is unknown
but the variance is known and the second case is where the mean is known but
variance is unknown. The Gaussian distribution is often used as a conjugate
prior for parameters that are assumed to be Gaussian in the population, such as
regression coefficients.
⎛ ⎞
1 ⎜⎜⎜ 1 2⎟
⎟
p(μ | μ0 , σ20 ) ∝ exp ⎜⎝− 2 (μ − μ0 ) ⎟⎟⎠ , (3.5)
σ0 2σ0
where μ0 and σ20 are hyperparameters
Figure 3.1 below illustrates the Gaussian distribution with unknown prior
mean and known variance under varying conjugate priors. For each plot, the dark
dashed line is the Gaussian likelihood which remains the same for each plot. The
light dotted line is the Gaussian prior distribution which becomes increasingly
diffused. The gray line is the resulting posterior distribution.
−4 −2 −4 −2
−4 −2 −4 −2
FIGURE 3.1. Gaussian distribution, mean unknown/variance known with varying conju-
gate priors. Note how the posterior distribution begins to align with the distribution of the
data as the prior becomes increasingly flat.
These cases makes quite clear the relationship between the prior distribution
and the posterior distribution. Specifically, the smaller the variance on the prior
distribution of the mean (upper left figure), the closer the posterior matches the
prior distribution. However, in the case of a very flat prior distribution (bottom
right figure), the more the posterior distribution matches the data distribution.
Y ∼ U(α, β) (3.6)
where α and β, are the lower and upper limits of the uniform distribution, re-
spectively. The uniform distribution has α = 0 ≤ x ≤ β = 1. Under the uniform
distribution,
α+β
E[y] = (3.7)
2
(β − α)2
V[y] = (3.8)
12
Generally speaking, it is useful to incorporate the uniform distribution as a
non-informative prior for a distribution that has bounded support, such as (−1, 1).
As an example of the use of the uniform distribution distribution as a prior,
consider its role in forming the posterior distribution for a Gaussian likelihood.
Again, this would be the case where a researcher lacks prior information regarding
the distribution of the parameter of interest.
Figure 3.2 below shows the influence of different bounds on the uniform prior
results in different posterior distributions.
Common Probability Distributions and Their Priors 35
−4 −2 −4 −2
−4 −2 −4 −2
We see from Figure 3.2 that the effect of the uniform prior on the posterior distri-
bution is dependent on the bounds of the prior. For a prior with relatively narrow
bounds (upper left of figure), this is akin to having a fair amount of information,
and therefore, the prior and posterior roughly match up. However, as in the case
of Figure 3.1, if the uniform prior has very wide bounds indicating virtually no
prior information (lower right figure), the posterior distribution will match the
data distribution.
b
E(σ2 ) = , for a > 0 (3.11)
a−1
and
b2
V(σ2 ) = , for a > 2 (3.12)
(a − 1)2 (a − 2)
respectively.
Figure 3.3 below shows the posterior distribution of the variance for different
inverse-gamma priors that differ only with respect to their shape.
FIGURE 3.3. Inverse-gamma prior for variance of Gaussian distribution. Note that the
“peakedness” of the posterior distribution of the variance is dependent on the shape of the
inverse-gamma prior.
Figure 3.4 below shows the C+ distribution for various values of the scale
parameter δ.
Then, following our discussion in Section 2.4.2, Jeffreys’ prior is the square root of
the determinant of the information matrix - viz.,
1/2
σ12 0 1 1
0 1 = ∝ 3 (3.14)
2σ 4 2σ 6 σ
Often we see Jeffreys’ prior for this case written as 1/σ2 . This stems from the
transformation found in Equation (2.18). Namely, the prior 1/σ3 based on p(μ, σ2 )
38 Bayesian Statistics for the Social Sciences
is the same as having the prior 1/σ2 on p(µ, σ). To see this, note from Equation
(2.18) that
e−θ θk
p(k | θ) = , k = 0, 1, 2, . . . , θ>0 (3.18)
k!
Figure 3.5 below shows the posterior distribution under the Poisson likelihood
with varying gamma-density priors.
Common Probability Distributions and Their Priors 39
Here again we see the manner in which the data distribution moderates the influ-
ence on the prior distribution to obtain a posterior distribution that balances the
data in hand with the prior information we can bring regarding the parameters
of interest. The upper left of Figure 3.5 shows this perhaps most clearly with the
posterior distribution balanced between the prior distribution and the data distri-
bution. And again, in the case of a relatively non-informative Gamma distribution,
the posterior distribution matches up to the likelihood (lower right of Figure 3.5).
∂2 k
log p(k | θ) = − 2 (3.21)
∂θ2 θ
Thus the information matrix can be written as
I(θ) = −E log p(k | θ) (3.22)
θ
= (3.23)
θ2
1
= (3.24)
θ
40 Bayesian Statistics for the Social Sciences
Note also that the U(0, 1) distribution is equivalent to the B(1,1) distribution.
The beta distribution is typically used as the prior distribution when data are
assumed to be generated from the binomial distribution, such as the example in
Section 2.4, because the binomial parameter θ is continuous and ranges between
zero and one.
Common Probability Distributions and Their Priors 41
Figure 3.6 below shows the posterior distribution under the binomial likeli-
hood with varying beta priors.
FIGURE 3.6. Binomial distribution with varying beta priors. Note that the figure on the
lower right is Jeffreys’ prior for the binomial distribution.
We see that the role of the beta prior on the posterior distribution is quite similar
to the role of the Gaussian prior on the posterior distribution in Figure 3.1. Note
that the B(1, 1) prior distribution in the lower left-hand corner of Figure 3.6 is
equivalent to the U(0, 1) distribution. The lower right-hand display is Jeffreys’
prior, which is discussed next.
Jeffreys’ prior is then the square root of Equation (3.32d). That is,
p(X1 = x1 , . . . , XC = xc ) (3.34)
PC
x xC
x1 !···xC ! π11 · · · πC , when c=1 xc = n
n!
=
0,
otherwise
and the covariances among any two categories c and d can be written as
where QC
c=1 Γ(ac )
B(a) = PC (3.39)
Γ( c=1 ac )
Example 3.7: Multinomial Likelihood with Varying Precision on the Dirichlet Prior
Figure 3.7 below shows the the multinomial likelihood and posterior distribu-
tions with varying degrees of precision on the Dirichlet prior.
44 Bayesian Statistics for the Social Sciences
As in the other cases, we find that a highly informative Dirichlet prior (top row)
yields a posterior distribution that is relatively precise with a shape that is similar
to that of the prior. For a relatively diffused Dirichlet prior (bottom row), the
posterior more closely resembles the likelihood.
and
2ψ2PP
V= (3.42)
(ν − P − 1)2 (ν − P − 3)
From Equation (3.41) and Equation (3.42) we see that the informativeness of
the IW prior is dependent on the scale matrix Ψ and the degrees-of-freedom ν -
that is, the IW prior becomes more informative as either the elements in Ψ become
smaller or ν becomes larger. Finding the balance between these elements is tricky,
and so common practice is to set Ψ = I and vary the values of ν.
Σ = σRσ (3.43)
Figure 3.8 below shows the probability density plots for the LKJ distribution
with different values of η. Notice that higher values of η place less mass on extreme
correlations.
46 Bayesian Statistics for the Social Sciences
3.7 Summary
This chapter presented the most common distributions encountered in the social
sciences along with their conjugate priors and associated Jeffreys’ priors. We
also discussed the LKJ prior for correlation matrices which is quite useful when
specifying inverse-Wishart priors for covariance matrices. The manner in which
the prior and the data distributions balance each other to result in the posterior
distribution is the key point of this chapter. When priors are very precise, the
posterior distribution will have a shape closer to that of the prior. When the prior
distribution is non-informative, the posterior distribution will adopt the shape
of the data distribution. This finding can be deduced from an inspection of the
shrinkage factor given in Equation (2.40). In the next chapter we focus our attention
on the computational machinery for summarizing the posterior distribution.
4
Obtaining and Summarizing the
Posterior Distribution
As stated in the Introduction, the key reason for the increased popularity of
Bayesian methods in the social sciences has been the (re)discovery of numerical
algorithms for estimating posterior distributions of the model parameters given
the data. Prior to these developments, it was virtually impossible to derive sum-
mary measures of the posterior distribution, particularly for complex models with
many parameters. The numerical algorithms that we will describe in this chapter
involve Monte Carlo integration using Markov chains – also referred to as Markov
chain Monte Carlo (MCMC) sampling. These algorithms have a rather long his-
tory, arising out of statistical physics and image analysis (Geman & Geman, 1984;
Metropolis, Rosenbluth, Rosenbluth, Teller, & Teller, 1953). For a nice introduction
to the history of MCMC, see Robert and Casella (2011).
For the purposes of this chapter, we will consider three of the most common
algorithms that are available in both open source as well as commercially available
software – the random walk Metropolis-Hastings algorithm, the Gibbs sampler, and
Hamiltonian Monte Carlo. First, however, we will introduce some of the general
features of MCMC. Then, we will turn to a discussion of the individual algorithms,
and finally we will discuss the criteria used to evaluate the quality of an MCMC
sample. A full example using the Hamiltonian Monte Carlo algorithm will be
provided, which also introduces the Stan software package.
47
48 Bayesian Statistics for the Social Sciences
.
Assuming the samples are independent of one another, the law of large numbers
ensures that the approximation in Equation (4.1) will be increasingly accurate as
S increases. Indeed, under independent samples, this process describes ordinary
Monte Carlo sampling. However, an important feature of Monte Carlo integration,
and of particular relevance to Bayesian inference, is that the samples do not have
to be drawn independently. All that is required is that the sequence θs , (s =
1, . . . , S) yields samples that have explored the support of the distribution (Gilks,
Richardson, & Spiegelhalter, 1996a).1
One approach to sampling throughout the support of a distribution while also
relaxing the assumption of independent sampling is through the use of a Markov
chain. Formally, a Markov chain is a sequence of dependent random variables {θs }
θ0 , θ1 , . . . , θs , . . . (4.2)
such that the conditional probability of θs given all of the past variables depends
only on θs−1 – that is, only on the immediate past variable. This conditional
probability for the continuous case is referred to as the transition kernel of the
Markov chain. For discrete random variables this is referred to as the transition
matrix.
The Markov chain has a number of very important properties, not the least
of which is that over a long sequence, the chain will forget its initial state θ0 and
converge to its stationary distribution p(θ | y), which does not depend either on
the number of samples S or on the initial state θ0 . The number of iterations prior
to the stability of the distribution is referred to as the warmup samples. Letting m
1
The support of a distribution is the smallest closed interval (or set in the multivariate case)
where the elements of the interval/set are members of the distribution. Outside the support
of the distribution, the probability of the element is zero. Technically, MCMC algorithms
explore the typical set of a probability distribution. The concept of typical set will be taken
up when discussing Hamiltonian Monte Carlo.
Obtaining and Summarizing the Posterior Distribution 49
represent the initial number of warmup samples, we can obtain an ergodic average
of the posterior distribution p(θ | y) as
S
1 X
p(θ | y) = p(θs | y) (4.3)
S−m
s=m+1
The idea of conducting Monte Carlo sampling through the construction of Markov
chains defines MCMC. The question that we need to address is how to construct
the Markov Chain, that is, how to move from one parameter value to the next.
Three popular algorithms have been developed for this purpose and we take them
up next.
..
.
P. sample θsP ∼ p(θp | θs1 , θs2 , . . . , θsP−1 )
So, for example, in Step 1, a value for the first parameter θ1 at iteration s = 1 is
drawn from the conditional distribution of θ1 given other parameters with start
values at iteration 0 and the data y. At Step 2, the algorithm draws a value for
the second parameter θ2 at iteration s = 1 from the conditional distribution of θ2
given the value of θ1 drawn in Step 1, the remaining parameters with start values
Obtaining and Summarizing the Posterior Distribution 51
at iteration zero, and the data. This process continues until ultimately a sequence
of dependent vectors are formed:
θ1 = {θ11 , . . . θ1P }
θ2 = {θ21 , . . . θ2P }
..
.
θS = {θS1 , . . . θSP }
The difficulty with the M-H and Gibbs algorithms is that although they will
eventually explore the typical set of a distribution, it might be so slow that com-
puter resources will be expended. This problem is due to the random walk nature
of these algorithms. For example, in the ideal situation for a small number of pa-
rameters, the proposal distribution of M-H algorithm (usually a Gaussian proposal
distribution) will be biased toward the tails of the distribution where the volume
is high while the algorithm will reject proposal values if the density is small. This
then pushes the M-H algorithm toward to typical set as desired. However, as the
number of parameters increase, the volume outside the typical set will dominate
the volume inside the typical set and thus the Markov chain will mostly end up
outside the typical set yielding proposals with low probabilities and hence more
rejections by the algorithm. This results in the Markov chain getting stuck outside
the typical set and thus moving very slowly, as is often observed when employing
M-H in practice. The same problem just described holds for the Gibbs sampler as
well.
The solution to the problem of the Markov chain getting stuck outside the
typical set is to come up with an approach that is capable of making large jumps
across regions of the typical set, such that the typical set is fully explored with-
out the algorithm jumping outside the typical set. This is the goal of HMC.
Specifically, HMC exploits the geometry of the typical set and constructs transi-
tions that “...glide across the typical set towards new, unexplored neighborhoods”
(Betancourt, 2018b, p. 18). To accomplish this controlled sojourn across the typical
set, HMC exploits the correspondence between probabilistic systems and physical
systems. As discussed in Betancourt (2018b), the physical analogy is one of placing
a satellite in a stable orbit around Earth. A balance must be struck between the
momentum of the satellite and the gravity of Earth. Too much momentum and
the satellite will fly off into space. Too little, and the satellite will crash into Earth.
Thus, the key to gliding across the typical set is to carefully choose an auxiliary
momentum parameter to the probabilistic system. This momentum parameter is
essentially a first-order gradient calculated from the log-posterior distribution.
In general, we expect that the lag-1 autocorrelation will be close to 1.0. However,
we also expect that the components of the Markov chain will become independent
as l increases. Thus, we prefer that the autocorrelation decrease quickly over the
number of iterations. If this is not the case, it is evidence that the chain is “stuck”
and thus not providing a full exploration over the support of the target distribution.
In general, positive autocorrelation will be observed, but in some cases negative
autocorrelation is possible, indicating fast convergence of the estimated value to
the equilibrium value.
Divergent Transitions
Much of what has been described so far represents the ideal case of the typical
set being a nice smooth surface that HMC can easily explore. However, in more
complex Bayesian models, particularly Bayesian hierarchical models often applied
in the social and behavioral sciences, the surface of the typical set is not always
smooth. In particular, there are can be regions of the typical set that have very high
curvature. Algorithms such as Metropolis-Hastings might jump over that region,
but the problem is that there is information in that region, and if it is ignored, then
the resulting parameter estimates may be biased. However, to compensate for not
exactly exploring this region, MCMC algorithms will instead get very close to the
boundary of this region and hover there for a long time. This can be seen through
56 Bayesian Statistics for the Social Sciences
a careful inspection of trace plots where instead of a nice horizontal band, one
sees a sudden jump in the plot followed by a long sequence of iterations at that
jump point. After a while, the algorithm will jump back, and, in fact, over correct.
In principle, if the algorithm were allowed to run forever, these discontinuities
would cancel each other out, and the algorithm would converge to the posterior
distribution under the central limit theorem. However, in finite time, the resulting
estimates will likely be biased. Excellent graphical descriptions of this issue can
be found in Betancourt (2018b)
The difficulty with the problem just describe is that M-H and Gibbs algorithms
do not provide feedback as to the conditions under which they got stuck in the
region of high curvature. With HMC as implemented in Stan, if the algorithm
diverges sharply from the trajectory through the typical set, it will throw an error
message that some number of transitions diverged. In Stan the error might read
The mean and variance provide two simple summary values of the posterior
distribution. Another common summary measure would be the mode of the pos-
terior distribution – referred to as the maximum a posteriori (MAP) estimate. The
MAP begins with the idea of maximum likelihood estimation. Maximum likeli-
hood estimation obtains the value of θ, say θ̂ML which maximizes the likelihood
function L(θ | y), written succinctly as
where argmax stands for the value of the argument for which the function attains
its maximum. In Bayesian inference, however, we treat θ as random and specify
a prior distribution on θ to reflect our uncertainty about θ. By adding the prior
distribution to the problem, we obtain
Recalling that p(θ | y) = L(θ | y)p(θ) is the posterior distribution, we see that
Equation (4.11) provides the maximum value of the posterior density of θ given
y, corresponding to the mode of the posterior density.
In words, the first part says that given the data y, the probability that θ is in a
particular region is equal to 1−α, where α is determined ahead of time. The second
part says that for two different values of θ, denoted as θ1 and θ2 , if θ1 is in the
region defined by 1 − α, but θ2 is not, then θ1 has a higher probability than θ2 given
the data. Note that for unimodal and symmetric distributions, such as the uniform
distribution or the Gaussian distribution, the HPD is formed by choosing tails of
equal density. The advantage of the HPD arises when densities are not symmetric
and/or are multi-modal. A multi-modal distribution, for example, could arise as a
consequence of a mixture of two distributions. Following G. Box and Tiao (1973), if
p(θ | y) is not uniform over every region in θ, then the HPD region 1 − α is unique.
Also, if p(θ1 | y) = p(θ2 | y), then these points are included (or excluded) by a 1 − α
HPD region. The opposite is true as well, namely, if p(θ1 | y) , p(θ2 | y) then a
1−α HPD region includes one point but not the other (G. Box & Tiao, 1973, pg 123).
Figure 4.1 below shows the HPDs for a symmetric distribution centered at
zero on the left, and an asymmetric distribution on the right. We see that for the
FIGURE 4.1. Highest posterior density plot for symmetric and nonsymmetric distributions.
symmetric distribution, the 95% HPD aligns with the 95% confidence interval as
well as the posterior probability interval, as expected. Perhaps more importantly,
we see the role of the HPD in the case of the asymmetric distribution on the right.
Such distributions could arise from the mixture of two Gaussian distributions.
Here, the value of the posterior probability interval would be misleading. The
HPD, by contrast, indicates that, due to the asymmetric nature of this particular
distribution, there is very little difference in the probability that the parameter of
interest lies within the 95% or 99% intervals of the highest posterior density.
60 Bayesian Statistics for the Social Sciences
To introduce some of the basic elements of Stan and its R interface RStan
we begin by estimating the distribution of reading literacy from the United States
sample of the Program for International Student Assessment (PISA) (OECD, 2019).
For this example, we examine the distribution of the first plausible value of the
reading for the U.S. sample. For detailed discussions of the design of the PISA
assessment, see Kaplan and Kuger (2016) and von Davier (2013).
In the following block of code we load RStan, bayesplot, read in the PISA
data, and do some subsetting to extract the first plausible value of the reading
competency assessment.4 We also create an object called data.list that allows us to
rename variables and provide information to be used later. Note that we refer to
the reading score as readscore in the code.
library(rstan)
library(bayesplot)
PISA18Data <-read.csv(file.choose(),header=TRUE)
PISA18.read<- subset(PISA18Data, select=c(PV1READ))
data.list <- with(PISA18.read, list(readscore=PV1READ,
n = nrow(PISA18.read)))
Next we write the Stan code beginning with the command ReadingLit=” which
provides the name for the string of Stan code (other names can be chosen). We
declare the sample size to be an integer denoted by n with a lower value of zero.
Note that Stan uses // for comments.
4
In fact, when loading RStan, the program bayesplot and other required programs will be
loaded in automatically.
Obtaining and Summarizing the Posterior Distribution 61
ReadingLit ="
data {
int<lower=0> n; // Declare sample size
vector[n] readscore; // Declare reading outcome
}
Notice that we are declaring the lower bound of the mean of the reading distribu-
tion to be 100. This is because we know that the scores cannot fall below 100 by
virtue of how the reading assessment was scaled.
In the following parameters block we declare the names and dimensions of
the parameters of the reading distribution.
parameters {
real<lower=100> mu;
real<lower=0> sigma;
}
In the model block we write out the probability model (likelihood) for the reading
outcome followed by the specification of the prior distributions for the mean and
standard deviation.
model {
readscore ˜ normal(mu, sigma);
mu ˜ normal(500, 10);
sigma ˜ cauchy(0,6);
}
"
Regarding the standard deviation, the PISA 2009 U.S. results show that the
standard deviation of the reading scale was 97. To elicit a prior for the standard
deviation recall that one choice for a prior on the standard deviation is the C+
distribution. An ad hoc approach to obtaining the scale parameter of the C+
distribution is as follows: First, take one-half of the interquartile range of the
outcome variable, which can be calculated in R by typing IQR(variable name)
and multiply it by 0.5. For the PISA 2009 reading scale, the interquartile range
is 151, which serves as a rough estimate of the scale parameter for the Cauchy
distribution. Because we are dealing with the C+ distribution, it is reasonable
to take the square root of the Cauchy scale parameter and then one-half of that
result to yield the scale parameter for the C+ distribution. For our case, the scale
parameter for the C+ is set to 6. Note that this example uses a large sample size so
most choices for the Cauchy scale are fine.
We now define the information needed for the algorithm. We are requesting
four chains (nChains), 30,000 iterations (nIter) with 10 thinning steps (thinSteps).5
The number of warmup iterations (warmupSteps) will be half the number of iter-
ations. Thus, the results will be based on 6,000 draws from posterior distribution.
nChains = 4
nIter= 30000
thinSteps = 10
warmupSteps = floor(nIter/2)
readscore = data.list$readscore
myfitRead = stan(data=data.list,model_code=ReadingLit,chains=nChains,
iter=nIter,warmup=warmupSteps,thin=thinSteps)
Convergence Diagnostics
As noted earlier, it is a matter of necessity to inspect the diagnostics to ensure
convergence before examining the results. We begin with the trace plots using the
command mcmc trace.
color_scheme_set("gray")
stantraceMu <- mcmc_trace(myfitRead,inc_warmup=T, pars="mu")
stantraceMu + ylim(490,510)
5
For this simple problem considerably fewer number of iterations would be needed, but it
was of interest to show diagnostics in the “best case scenario.”
Obtaining and Summarizing the Posterior Distribution 63
FIGURE 4.2. Trace plots for the mean and standard deviation.
An inspection of the trace plots in Figure 4.2 above reveal a reasonably tight band
across the history of the chains. We do not see any serious separations among the
chains. We conclude from these plots that there is no evidence of non-convergence.
Next we request the posterior density plots for the mean and standard devia-
tion.
We find that the density plots for mu and sigma in Figure 4.3 above have a relatively
smooth bell shape. Next, we request the autocorrelation plots for mu and sigma.
Notice that the autocorrelations in Figure 4.4 above decrease very quickly to zero as
desired. As small amount of negative autocorrelation is observed, again indicating
a fast convergence to the posterior distribution.
In this next chunk of code we print the results of this simple example shown
below in Table 4.1.
print(myfitRead,pars=c("mu","sigma"))
Note that the ratio of the effective sample size (n eff) to the total number of post
warmup iterations is very close to 1.0 for both the mean and standard deviation.
This indicates that a very large percentage of the post warmup draws are inde-
pendent. Note also that with 6,000 draws over 4 chains, the Split R̂ (denoted as
Rhat) was calculated using 3,000 draws over 8 chains. The Rhat value is 1.0 for
both the mean and standard deviation which provides yet another indicator of
convergence of the chains to stationary posterior distributions.
The results in Table 4.1 indicate that under the assumptions made by the
chosen priors, the posterior mean 2018 reading score is estimated to be 500.18 and
the standard deviation is estimated to be 108.46. For each parameter, in addition
to the posterior mean, we also obtain the Monte Carlo standard deviation of
the posterior distribution for that parameter. So, the standard deviation of the
posterior distribution of mu is 1.52 and for sigma it is 1.10. The Monte Carlo
Obtaining and Summarizing the Posterior Distribution 65
standard error for the estimator is obtained by the posterior standard deviation
√
divided by the square root of the effective sample size, SEMean = SD/ n eff. The
smaller the standard error, the closer the estimate is expected to be to the true
value.
Turning to the posterior probability intervals (PPIs) we find that there is a 0.95
probability that the true reading mean is between 497.18 and 503.10. It may be
interesting to note that the actual 2018 mean reading score for U.S. was 505. Thus
the analysis here yielded a very close estimate of the obtained PISA reading score
for U.S., which most likely is due to the very large sample size.
library(coda)
newMyFitRead <- As.mcmc.list(myfitRead)
HPDinterval(newMyFitRead, prob=0.95)
2 mu 497.21 503.20
sigma 106.45 110.69
3 mu 497.13 503.17
sigma 106.28 110.53
4 mu 497.29 502.93
sigma 106.37 110.65
The values of the HPDs for both parameters are quite similar across chains and
also similar to the 95% PPI in Table 4.1. This is expected given the mostly smooth
bell shape of the posterior distributions in Figure 4.3.
66 Bayesian Statistics for the Social Sciences
showing that about 55% of the U.S. sample have reading literacy scores above the
OECD international average. We may also wish to know the percentage of the
U.S. sample that lies between the top performing country in 2009 (Shangai-China
with a reading score of 556) and the U.S. average. This can be obtained as
showing that about 19% of the U.S. sample have reading scores between the U.S.
mean of 500.18 and Shangai-China mean of 555. It’s important to emphasize that
obtaining these types of interval summaries is possible due solely to the fact that
we are working directly with the posterior distribution of the model parameters.
to as variational inference (see e.g. Jordan, Ghahramani, Jaakkola, & Saul, 1999).
This section provides a brief intuitive introduction to classical VB as implemented
in Stan. Excellent tutorials on VB geared toward the statistics community can be
found in and Blei, Kucukelbir, and McAuliffe (2017) and Tran, Nguyen, and Dao
(2021). A nice introduction to VB with applications to item response theory in the
psychometric literature can be found in Ulitzsch and Nestler (2022).
The basic idea behind VB is to use optimization rather than approximation to
obtain the target posterior distribution. In the first step, a family of approximate
densities F over the (possibly vector-valued) parameters θ are proposed. These
distributions are controlled by a set of auxillary parameters that we will denote by γ
that are used in the approximation. For example, the approximating distributions
could be Gaussian with parameters γ. Next, the algorithm attempts to find a
specific distribution, say, fγ ∈ F that minimizes the Kullback-Leibler divergence
(KLD) between fγ (θ | y) and the posterior distribution p(θ | y) defined in Section
2.1. Briefly, the KLD between two distributions, fγ (θ | y) and p(θ | y) can be written
as Z !
p(θ | y)
KLD( f | p) = − fγ (θ | y)log dθ (4.13)
fγ (θ | y)
where, in this case, KLD( f | p) is the information lost when fγ (θ | y) is used to
approximate p(θ | y). The objective is to minimize the divergence from fγ (θ | y) to
p(θ | y).
where ELBO stands for the evidence lower bound because it is the lower bound for
the evidence p(y).
The next step in VB is to specify the approximating distributions to p(θ | y).
The VB algorithm, as implemented in Stan, offers two choices: so-called mean field
VB and full rank VB. For mean-field VB, first we assume that f (θ) can be factored
into the product of individual parameters
P
Y
f (θ) = f (θp ) (4.16)
p=1
68 Bayesian Statistics for the Social Sciences
Notice that Equation (4.17) has a form very similar to Gibbs sampling discussed
in Section 4.3 where the expectation is taken with respect to parameters not under
consideration (Ulitzsch & Nestler, 2022). In contrast, full-rank VB relaxes the
assumption of independent parameters, which, while realistic, becomes somewhat
burdensome for complex high-dimensional models. For this discussion, will focus
our attention on mean-field VB.
The importance sampling estimate for the function h is based on finding a simpler
auxiliary distribution g(θ) and obtaining
PS
s=1 rs h(θs )
PS (4.19)
s=1 rs
where
p(θs )
rs = r(θs ) = (4.20)
g(θs )
are the importance ratios. The idea behind Equations (4.19) and (4.20) is that the
importance ratios serve as weights to correct for the fact that g(θ) is being used
instead of p(θ) to draw inferences about h(θ).
A difficulty with importance sampling is that the importance weights rs can be
noisy, and this can happen when the proposal distribution g(θ) is quite different
from the target distribution p(θ). To remedy this issue, Yao et al. (2018b) advocate
the use of the generalized Pareto distribution, where the density is defined as
− 1k −1
y−µ
1
+ , k,0
σ
1 k σ
p(y | µ, σ, k) =
(4.21)
y−µ
σ exp σ , k=0
1
where k is a shape parameter. The idea is that the generalized Pareto distribution
√
is fit to the L largest importance ratios, where L is set at min(S/5, 3 S), where,
again, S is the total number of samples from the posterior distribution. Then, the
L largest importance ratios are replaced by their expected values under the Pareto
distribution. This defines Pareto-smoothed importance sampling (PSIS).
As pointed out by Yao et al. (2018b), the shape parameter k can be used
as a diagnostic for VB. Specifically, the value of k determines the finite sample
convergence rate for PSIS and signals how well the proposal distribution f (θ)
approximates f (θ | y). Studies by Vehtari, Simpson, et al. (2021) have shown
that k̂ < 0.5 suggests fast convergence such that f (θ) can be safely used as an
approximation to p(θ | y). If 0.5 < k̂ < 0.7, then the approximation may be useful,
but it is not perfect. Finally, if k̂ > 0.7, then convergence will be quite slow and
results from VB should not be trusted. Indeed, Stan will generate a warning that
perhaps MCMC should be used instead. We can summarize the approach in Stan
following the steps outlined in Yao et al. (2018b):
1. Run VB to obtain f (θ);
2. Take samples θs (s = 1 . . . S) from f (θ);
3. Calculate importance ratios rs ;
4. Fit the Pareto distribution to the L largest importance ratios;
5. Note the shape parameter k.
6. If k < 0.7 then,
7. The VI approximation f (θ) is close enough to p(θ | y) to be used in its place,
70 Bayesian Statistics for the Social Sciences
8. else
9. If 0.5 ≤ k < 0.7 then
10. Conclude that the approximation may be useful though not perfect;
11. else
12. If k > 0.7
13. Conclude that the approximation may not be reliable and the convergence
will be noticeably slow.
14. end if
It should be noted that the VB algorithm as implemented in Stan is somewhat
experimental and should be used with caution. Indeed, a recent paper by Ulitzsch
and Nestler (2022) that focused on item response theory models found via a
simulation study that some, but not all, parameters were severely biased when
compared to marginal maximum likelihood and MCMC, and these results were
further corroborated with a case study using data from PISA 2018. They conclude
that Stan’s implementation of VB was not viable for multidimensional IRT models,
and may be further challenged by more complicated models. Ulitzsch and Nestler
(2022) do call for additional research on VB for complex problems insofar as
MCMC is computationally demanding (though somewhat more stable) in complex
large-scale data scenarios. Indeed, research continues on developing fast and
stable VB estimation (see, e.g., Tomasetti, Forbes, & Panagiotelis, 2022; Dang &
Maestrini, 2022). Nevertheless, given these concerns, and the fact that we will
be demonstrating Bayesian methods primarily with large-scale data, we will not
demonstrate VB with the examples presented throughout this book. However, a
situation where VB seems to work well is in addressing the so-called label-switching
problem in Bayesian latent class analysis, and we will demonstrate an application
of VB as implemented in Stan to this problem in Chapter 8.
4.9 Summary
Markov chain Monte Carlo sampling and Hamiltonian Monte Carlo have revolu-
tionized Bayesian statistical practice by making it possible to accurately estimate
the posterior distribution of model parameters. Three algorithms were reviewed
in this chapter — the Metropolis-Hastings algorithm, the Gibbs sampler, and
Hamiltonian Monte Carlo with No-U-Turn sampling. Convergence diagnostics
were presented along with a simple example using the Stan software language.
Monitoring convergence cannot be overstated insofar as MCMC can be computa-
tionally intensive, especially for complex models. We briefly discuss an alternative
algorithm — variational Bayes — which has considerable advantages in terms of
speed, but should be used with caution when inference is of specific concern.
Part II
BAYESIAN
MODEL
BUILDING
5
Bayesian Linear and Generalized
Models
This chapter focuses on Bayesian linear and generalized linear models and sets
the groundwork for later chapters insofar as many, if not most, of the statistical
methodologies used in the social sciences have, at their core, the linear or gener-
alized linear regression model. We begin with the linear model after which we
will examine generalized linear models, including logistic regression, multinomial
logistic regression, Poisson, and negative binomial regression.
y = Xβ + u (5.1)
u ∼ N(0, σ2 I) (5.3)
The assumptions in Equations (5.2) and (5.3) give rise to the Gaussian linear
regression model with homoskedastic disturbances (see e.g. Fox, 2008).
73
74 Bayesian Statistics for the Social Sciences
From standard linear regression theory, the probability of the data X and y
given of the model parameters β and σ2 , can be written as
n 1 o
p(X, y | β, σ2 ) = (2πσ2 )−n/2 exp − 2 (y − Xβ)′ (y − Xβ) (5.4)
2σ
Notice that estimation of β hinges on minimizing the residual sum of squares
(y − Xβ)′ (y − Xβ) in the exponent of Equation (5.4). Expanding the residual sum
of squares, we obtain
∂ ′
(y y − 2β′ X′ y + β′ X′ Xβ)
∂β
= −2X′ y + 2X′ Xβ (5.6)
X′ Xβ = X′ y,
β̂ = (X′ X)−1 X′ y (5.7)
(y − Xβ̂)′ (y − Xβ̂)
σ̂2 = (5.8)
n
We recognize that Equation (5.7) is the same as that obtained under ordinary
least squares. However, Equation (5.8) differs from the least squares estimator
[(y − Xβ̂)′ (y − Xβ̂)]/(n − Q)
Recall from Chapter 2 that the first step in a Bayesian analysis is the specifi-
cation of the prior distributions for all model parameters. Recall also that we can
consider three broad classes of prior distributions: (1) non-informative priors that
reflect no prior knowledge or information about the location or shape of the dis-
tribution of the model parameters, (2) weakly-informative priors that place some
constraints on the model parameters for the purposes of aiding computation, but
otherwise mostly encode little information, and (3) informative-conjugate priors
that specify prior knowledge or information about model parameters. Multiply-
ing the likelihood and the conjugate prior yields a posterior distribution that is in
the same family of distributions as the likelihood.
over a very large range. A Gaussian distribution with a mean of zero and a
large standard deviation would accomplish this goal. Next, we might choose to
assign a C+ prior to σ. From here, the joint posterior distribution of the model
parameters is obtained by multiplying the prior distributions of β and σ2 by the
likelihood give in Equation (5.4).
For this example we use data from PISA 2018 to estimate a model relating read-
ing proficiency to a set of background, attitudinal, and reading strategy variables.
The sample comes from 4,838 PISA-eligible students in the United States. Variables
included in this model are FEMALE (female=1, male = 0); economic, social, and
cultural status of the student (ESCS); an index measuring the awareness of using
summarizing strategies in obtaining information from text (METASUM); an index
on the perceived extent to which the teacher gives feedback on students’ strengths
in reading (PERFEED); an index of student enjoyment of reading (JOYREAD); an
index of perceived teacher’s adaptivity of instruction to student and class learning
needs (ADAPTIVITY); a measure of the extent to which the teacher is interested
in every students’ learning (TEACHINT); a scale of students’ mastery-approach
orientation of achievement goals (MASTGOAL); an index of perceived difficulty
in reading (SCREADDIFF); and an index of perceived competency in reading
(SCREADCOMP). The first plausible value of the reading assessment was used
as the dependent variable (READSCORE).1 For more detail on these scales, see
OECD (2018).
The linear model can be written as
library(rstan)
library(loo)
library(bayesplot)
library(dplyr)
1
Plausible values were developed as a means of obtaining consistent estimates of population
characteristics in large-scale assessment such as PISA where students are administered too
few items to allow precise ability estimates. Plausible values represent random draws
from an empirical proficiency distributional conditioned on the observed responses to the
assessment items and background variables (Mislevy, 1991).
76 Bayesian Statistics for the Social Sciences
The Stan code comes next in which we specify the dimensions and types of
variables in the data block.
ReadingReg = "
data {
int<lower=0> n;
vector [n] readscore;
vector [n] Female; vector [n] ESCS;
vector [n] METASUM; vector [n] PERFEED;
vector [n] JOYREAD; vector [n] MASTGOAL;
vector [n] ADAPTIVITY; vector [n] TEACHINT;
vector [n] SCREADDIFF; vector [n] SCREADCOMP;
}
In the following parameter block, we specify each parameter in the model and
provide its scale of measurement. Notice that defining sigma as real< lower = 0 >
will ensure that sigma is on the positive real line.
parameters {
real alpha;
real beta1; real beta6;
Bayesian Linear and Generalized Models 77
In the following model block, we write out our regression model, provide priors
for each of the model parameters, and specify the likelihood. For this example,
non-informative N(0, 10) priors were specified for the intercept and all regression
coefficients. The standard deviation of the disturbance term was set to be a C+ (0,
6) prior. By having specified the scale of sigma as being positive in the parameters
block, the cauchy command yields a C+ distribution.
model {
real mu[n];
for (i in 1:n)
mu[i] = alpha + beta1*Female[i] + beta2*ESCS[i]
+ beta3*METASUM[i]
+ beta4*PERFEED[i] + beta5*JOYREAD[i] + beta6*MASTGOAL[i]
+ beta7*ADAPTIVITY[i] + beta8*TEACHINT[i]
+ beta9*SCREADDIFF[i] + beta10*SCREADCOMP[i] ;
// Non-informative Priors
alpha ˜ normal(0, 10);
beta1 ˜ normal(0, 10); beta6 ˜ normal(0, 10);
beta2 ˜ normal(0, 10); beta7 ˜ normal(0, 10);
beta3 ˜ normal(0, 10); beta8 ˜ normal(0, 10);
beta4 ˜ normal(0, 10); beta9 ˜ normal(0, 10);
beta5 ˜ normal(0, 10); beta10 ˜ normal(0, 10);
sigma ˜ cauchy(0,6);
// Likelihood
readscore ˜ normal(mu, sigma);
}
"
For this analysis we use four chains with 2,500 warmup iterations and 2,500 post-
warmup iterations per chain and a thinning interval of 10. Posterior results are
therefore based on a total of 1,000 iterations. The analysis is run using the following
commands:
78 Bayesian Statistics for the Social Sciences
nChains = 4
nIter= 10000
thinSteps = 10
warmupSteps = floor(nIter/2)
readscore = data.list$readscore
myfit = stan(data=data.list,model_code=ReadingReg,
chains=nChains,iter=nIter,warmup=warmupSteps,thin=thinSteps)
FIGURE 5.1. Trace plots for regression example under non-informative priors.
Bayesian Linear and Generalized Models 79
For each parameter, we find a nice tight band for each chain and good mixing.
This is not terribly surprising insofar as the data are well behaved and the model
is not particularly complex. We can combine this information with the split-chain
rhat statistics in Table 5.1 below which is based on eight chains and conclude that
there is good mixing and no evidence of non-stationarity of the chains.
Figure 5.2 below uses the stan dens command to display the posterior density
plots for each of the parameters in the model.
stan_dens(myfit2NonInf, fill="gray",pars=c("alpha","beta1",
"beta2", "beta3","beta4", "beta5","beta6","beta7", "beta8","beta9",
"beta10","sigma"))
FIGURE 5.2. Density plots for regression example under non-informative priors.
Not all density plots exhibit a smooth bell-shaped curve as desired, especially for
beta6. A larger number of iterations might help improve the shape, but as we’ll
see, other diagnostics seem to suggest that the shape of these density plots might
not be much of problem.
Using the stan ac command, Figure 5.3 below displays the auto-correlation
plots.
FIGURE 5.3. ACF plots for regression example under non-informative priors.
The ACF plots show that the algorithm achieves mostly independent samples very
quickly. This information can be combined with the effective sample sizes for each
parameter (Table 5.1) to further gauge the extent of independent samples. We find
that the ratio of the n eff to the total number of iterations of 2,000 is never greater
than 3% (beta9) and never less than 93% (sigma) of the total number of iterations.
This further indicates that the samples achieved independence.
print(myfit2NonInf,pars=c("alpha","beta1", "beta2",
"beta3", beta4", "beta5", "beta6","beta7", "beta8",
"beta9","beta10","sigma"),probs = c(0.025, 0.975))
Bayesian Linear and Generalized Models 81
TABLE 5.1. Posterior results for reading literacy score under non-informative priors
The results show that for all but beta7 (the effect of ADAPTIVITY on READSCORE)
the posterior probability intervals do not cover zero. As discussed in Chapter 4, it
is possible to assess the probability that the posterior mean for beta7 is greater than
zero, despite it being in the 95% posterior probability interval. Using the pnorm
function in R we find that the probability that the obtained posterior mean (2.05) is
greater than zero is approximately 0.89. Thus, even though the posterior mean is
in the 95% credible interval, the majority of the probability distribution lies to the
right of zero. It may, however, be of interest to see how close the posterior mean
is from zero. This too can be obtained from the posterior distribution using the
pnorm function in R. The value we obtain is 0.39. Content area expertise would be
needed to decide whether this result is substantively important, but it is crucial
to note that this calculation is not possible in the frequentist framework and high-
lights the nuanced analyses that can be conducted within the Bayesian framework.
Example 5.2: Bayesian Linear Regression Model Using PISA 2018 Data:
Informative Priors
prior distribution, and in particular the value of the precision of the hyperparame-
ters. Informative priors for this example come from an analysis of the same model
using data from PISA 2009.
The most sensible conjugate prior distribution for the individual regression
coefficients in β is the Gaussian prior. The argument for using the Gaussian
distribution as the prior for each β lies in the fact that the asymptotic distribution
of regression coefficients is Gaussian (Fox, 2008). Moreover, the Gaussian prior is
a conjugate distribution for the regression coefficients and will result in a Gaussian
posterior distribution for these model parameters.
The conditional prior distribution of the vector β given σ2 can be written as
1
p(β | σ2 ) = (2π)Q/2 | Σ |1/2 exp − (β − B)′ Σ−1 (β − B) (5.10)
2
where Q is the number of variables, B is the vector of mean hyperparameters
assigned to β and Σ = σ2 I is the diagonal matrix of constant disturbance variances.
The conjugate prior for the variance of the disturbance term σ2 is (from Chapter
3), the inverse-gamma distribution, with hyperparameters a and b. We write the
conjugate prior density for σ2 as
2
p(σ2 ) ∝ (σ2 )−(a+1) e−b/σ (5.11)
With the likelihood L(β, σ2 | X, y) defined in Equation (5.4) as well as the prior
distributions p(β | σ2 ) and p(σ2 ), we have the necessary components to obtain the
joint posterior distribution of the model parameters given the data. Specifically,
the joint posterior distribution of the parameters β and σ2 is given as
With regard to the Stan code, we leave all of the code intact with the exception
of the block of code in the model statement that specifies the prior distributions.
Here we place either weakly-informative or informative priors on the model pa-
rameters as follows:
// Informative Priors
alpha ˜ normal(500, 5);
beta1 ˜ normal(10, 2); beta6 ˜ normal(0, 2);
beta2 ˜ normal(30, 5); beta7 ˜ normal(0, 2);
Bayesian Linear and Generalized Models 83
FIGURE 5.4. Trace plots for regression example under informative priors.
84 Bayesian Statistics for the Social Sciences
FIGURE 5.5. Density plots for regression example under informative priors.
FIGURE 5.6. ACF plots for regression example under informative priors.
Bayesian Linear and Generalized Models 85
TABLE 5.2. Posterior results for reading literacy score under informative priors
From a substantive point of view, the results have changed in important ways,
suggesting a degree of sensitivity to choice of priors. For example, for the non-
informative priors case in Table 5.1, we found that the posterior probability interval
for the effect of ADAPTIVITY on reading readscore (beta7) covered zero. In Table
5.2, we find that PPI for beta7 does not cover zero, and moreover, the probability
that the posterior mean estimate of 2.59 is greater than zero is 0.49. Other estimates
exhibited relatively large changes. A direct comparison of these two models in
terms of model fit and model selection is deferred to Chapter 6.
bution might not be Gaussian at all. For example, an outcome of interest might be
dichotomous, such as a response to the question “Did you vote in the last national
election? (Yes/No)”. Or, an outcome might be in the form of a count – for example,
a response to a question, such as ”How many times this week did you read to your
child?”
In both of these examples, the use of Gaussian theory-based linear regression
would lead to biased and inefficient estimates, and, moreover, there would be
loss in the richness of interpretation if the Gaussian model were used for these
data. Rather, it is best to apply models that explicitly account for the probability
model generating the data. In the example of the dichotomous outcome, the
appropriate probability model would be based on the binomial distribution, and
in the example of the count outcome, the appropriate probability model would be
based on the Poisson distribution. Both distributions (along with their conjugate
priors) were discussed in Chapter 3. Incorporating alternative probability models
in the context of regression has lead to the generalized linear model.
For this section, we describe the so-called link-function, which provides a con-
venient framework for moving among non-linear and linear models (McCullagh
& Nelder, 1989). We then provide an empirical example utilizing Bayesian logistic
regression.
yi | θ ∼ bin(n, θ) (5.14)
Recall that the link function is the logarithm of the odds ratio, that is,
µ
!
ln (5.15)
1−µ
As usual, the goal is to find the joint posterior distribution of the model
parameters via Bayes’ theorem, that is,
From here, different types of priors could be chosen for the intercept β0 and
the slopes β1 , . . . , βQ reflecting a lesser or greater degree of prior informa-
tion on the model parameters. The resulting joint posterior distribution,
p(β0 . . . βQ | y1 , . . . , yn ), will be Gaussian.
ReadLogistic ="
data {
int <lower=0> n;
vector[n] Female; vector[n] ESCS;
vector[n] COMPETE; vector[n] GFOFAIL;
vector[n] MASTGOAL; vector[n] BELONG;
int <lower=0,upper=1> SWBP[n];
}
parameters {
real alpha;
real beta1; real beta4;
real beta2; real beta5;
real beta3; real beta6;
}
Next we define the logistic regression model in the model block by modeling
subjective well-being using the bernoulli logit function in Stan. This function has
been found to be more numerically stable when models for dichotomous outcomes
are parameterized into the logit scale (Stan Development Team, 2021a). Note that
non-informative priors are specified for the model parameters in this example.
model {
for (i in 1:n) {
SWBP[i] ˜ bernoulli_logit(alpha + beta1*Female[i] +
beta2*ESCS[i] + beta3*COMPETE[i] + beta4*GFOFAIL[i]
+ beta5*MASTGOAL[i] + beta6*BELONG[i]);
}
alpha ˜ normal(0, 1);
beta1 ˜ normal(0, 10); beta4 ˜ normal(1, 5);
beta2 ˜ normal(0, 10); beta5 ˜ normal(1, 5);
beta3 ˜ normal(0, 10); beta6 ˜ normal(1, 5);
}
Bayesian Linear and Generalized Models 89
}
"
Convergence Diagnostics
In Figures 5.7 - 5.10 below we show the convergence diagnostics for the logistic
regression model based on 2,000 draws from the posterior distribution (10,000
total draws, four chains, and a thinning interval of 10).
FIGURE 5.7. Trace plots for logistic regression example under informative priors.
The trace plots do exhibit some degree “stickiness” but overall the chains seem to
mix well.
90 Bayesian Statistics for the Social Sciences
FIGURE 5.8. Density plots for logistic regression example under informative priors.
The density plots exhibit a relatively nice bell shape with the possible exception of
beta3.
FIGURE 5.9. ACF plots for logistic regression example under informative priors.
The ACF plots exhibit an immediate transition to almost independent draws. The
posterior results for Bayesian logistic regression are shown below in Table 5.4.
Bayesian Linear and Generalized Models 91
We find that zero is in the 95% PPI for Female only. The pnorm functions discussed
in Chapter 4 can be used to examine the probabilities that any of these effects are
greater than zero.
n <- nrow(Canadareg2)
f <- as.formula("DOWELL ˜ Female + booksHome + motivRead")
m <- model.matrix(f,Canadareg2)
data.list <- list(n=nrow(Canadareg2),
C=length(unique(Canadareg2[,1])),
P=ncol(m),x=m, Female=Female, booksHome=booksHome,motivRead=motivRead,
92 Bayesian Statistics for the Social Sciences
DOWELL=as.numeric(Canadareg2[,1]))
The Stan code comes next in which the data block defines the data matrix which is
of order n × Q, where Q is the number of predictors. Note again that C represents
the number of unique categories of the outcome variable and with lower=2 defines
that this variable has at least two categories.
We next multiple the x matrix by the beta matrix. Note that x is n × Q and beta is
Q × C, and so the resulting x beta matrix is n × C.
parameters {
matrix[k, C] beta;
}
transformed parameters {
matrix[n, C] x_beta= x * beta;
}
In the following model block, we use the to vector utility in Stan which vectorizes
the matrix beta and allows assigning the same prior to all elements of beta.
model {
to_vector(beta) ˜ normal (0,2);
for (i in 1:n) {
DOWELL[i] ˜ categorical_logit(x_beta[i]’);
}
}
Bayesian Linear and Generalized Models 93
generated quantities {
int <lower=1, upper=C> DOWELL_rep[n];
vector[n] log_lik;
for (i in 1:n) {
DOWELL_rep[i] = categorical_logit_rng(x_beta[i]’);
log_lik[i] = categorical_logit_lpmf(DOWELL[i] | x_beta[i]’);
}
}
TABLE 5.5. Posterior results for multinomial logistic regression on self-reported reading
competency
The regression coefficients in Table 5.5 are interpreted as follows. The first element
of each beta (e.g. beta[1,1]) represents the variable in the model and the second
element represents the level of the categorical outcome. So, beta[1,1] is the intercept
for the first category of DOWELL. Similarly, beta[2,1] is the Female effect on the
first category of DOWELL, and so on. As these regression coefficients are in the
94 Bayesian Statistics for the Social Sciences
logit metric, they can be converted into probabilities to reflect the probability that,
say, a male would answer category c using the formula
eβc
p(y = c) = PC (5.19)
c=1 eβc
where βc is the coefficient for the particular categorical outcome and level of x.
For example, recalling that Females are coded 1 and that the first category of
DOWELL is “strongly agree,” we find that the proportion of females endorsing
this category is e(0.54) /[(e(0.54) + e(0.02) + e(−0.28) + e(−0.23) ] = 0.40, indicating that less
than half of the females in this sample strongly agree that they do well in reading.
In contrast, examining the last category, only 19% of the females in this sample
strongly disagree that they do well in reading. If we calculate these probabilities
for all C = 4 categories for females and add them up, they will sum to 1.0.
e−θ θk
p(k) = , k = 0, 1, 2, . . . , θ>0 (5.20)
k!
where k is the set of whole numbers representing the counts of the events. The
link function for the Poisson regression in Table 5.3 allows us to model the count
data in terms of chosen predictors.
This analysis utilizes data from a school administrator’s study of the atten-
dance behavior of high school juniors at two schools. The outcome is a count
of the number of days absent and the predictors include gender of the student
and standardized test scores in math and language arts. The source of the data
is unknown, but is widely used as an example of regression for count data in a
number of R programs. The data are available on the accompanying website.
We begin by reading in the data and creating the data list for Stan.
We see that the distribution of number of days absent has the typical look of a
Poisson distribution. Next, we set up the Stan code. Notice that in the data block,
we indicate that the variable daysabs is an integer with a lower bound of zero,
which would be the proper metric for a count.
PoissonModel = "
data {
int<lower=0> n;
vector[N] math;
vector[N] langarts;
vector[N] male;
96 Bayesian Statistics for the Social Sciences
int<lower=0> daysabs[n];
}
Next, in the parameter block, we specify a vector for the regression coefficients
which is an efficient way to provide the same prior to all of the regression coeffi-
cients.
parameters {
vector[4] beta;
}
In this next transformed parameters block we specify the exponential of the ex-
pected value of days absent,
transformed parameters {
vector[n] mu = exp(beta[1] + beta[2]*math
+ beta[3]*langarts + beta[4]*male);
}
and in the model block next, we specify Poisson likelihood and place standard
Gaussian priors on the regression coefficients.
model {
daysabs ˜ poisson(mu);
beta ˜ std_normal();
}
generated quantities {
int<lower=0> daysabs_pred[N] = poisson_rng(mu);
vector[n] log_lik;
Bayesian Linear and Generalized Models 97
for (i in 1:n) {
log_lik[i] = poisson_lpmf(daysabs[i] | mu[i]);
}
}
"
For completeness, the next set of code reads in the relevant information to begin
estimation, summarize the results, obtain necessary diagnostic plots, conduct pos-
terior predictive checking, and obtain loo cross-validation measures to be used
for model comparison, which we will show in the following section on negative
binomial regression. Notice that for this analysis, the total number of draws will
be 2,000.
nChains = 4
nIter= 10000
thinSteps = 10
burnInSteps = floor(nIter/2)
daysabs = data.list$daysabs
PoissRegfit = stan(data=data.list,model_code=PoissonModel,
chains=nChains,iter=nIter,warmup=burnInSteps,
thin=thinSteps)
stan_plot(PoissRegfit, pars=c("beta"))
stan_trace(PoissRegfit,inc_warmup=T,pars=c("beta"))
stan_dens(PoissRegfit,pars=c("beta"))
stan_ac(PoissRegfit, pars=c("beta"))
print(PoissRegfit,pars=c("beta"))
All diagnostics indicated good convergence of the algorithm. Table 5.6 below
presents the results.
98 Bayesian Statistics for the Social Sciences
TABLE 5.6. Posterior results for Poisson regression of days absent from school
Variable Parameter Mean SD 2.5% PPI 95% PPI n eff Rhat
Intercept alpha 2.75 0.07 2.61 2.89 2063 1.00
Math beta1 −0.01 0.00 −0.01 0.00 2170 1.00
Lang. Arts beta2 0.00 −0.01 0.00 0.03 2047 1.00
Male beta3 −0.35 0.05 −0.45 −0.26 1920 1.00
We find that the expected number of days absent are about 30% percent lower
for males than females: exp(−0.35) = 0.70 and 1−0.70 = 0.30). We also find that
a 10-unit increase in language arts scores results in about a 10% decrease in the
expected number of days absent: exp(−.01 × 10) = 0.90) and 1−0.90 = 0.10.
The Stan code for Bayesian negative binomial regression is identical to that for
the Poisson regression in Example 5.5 except for a small change to the parameter
block where we specify the dispersion parameters ϕ (phi), add it to the likelihood
in the model block, and use it in the generated quantities block to obtain posterior
predictive checks and cross-validation measures. Note that in the model block we
give phi a C+ (0,1) prior. Other priors for the dispersion term are also possible.
NegBinomModel = "
data {
int<lower=0> n;
vector[N] math;
vector[N] langarts;
vector[N] male;
int<lower=0> daysabs[n];
}
parameters {
vector[4] beta;
Bayesian Linear and Generalized Models 99
The results for the negative binomial regression are shown below in Table 5.7
TABLE 5.7. Posterior results for negative binomial regression of days absent from school
Variable Parameter Mean SD 2.5% PPI 95% PPI n eff Rhat
Intercept alpha 2.64 0.22 2.49 3.08 2152 1.00
Math beta1 −0.01 0.01 −0.02 0.00 2039 1.00
Lang. Arts beta2 −0.01 0.01 −0.02 0.00 1767 1.00
Male beta3 −0.34 0.14 −0.43 −0.08 1974 1.00
Dispersion phi 0.79 0.08 0.66 0.95 1992 1.00
5.7 Summary
This chapter provided the first complete set of analyses of substantive problems
using Bayesian linear and generalized linear regression. The general approach
follows closely the steps of model building, evaluation, and selection within the
frequentist domain. The key differences between the Bayesian and frequentist
approaches to model building are (1) incorporation of prior knowledge as en-
coded into the prior distribution, and (2) interpretation, particularly of posterior
probability intervals
An example of Bayesian generalized linear modeling was also provided for
logistic regression, Poisson regression, and negative binomial regression. From a
100 Bayesian Statistics for the Social Sciences
Bayesian perspective, as long as the probability model for the outcome variable is
correctly specified, the issue then centers on the specification of the priors.
This chapter did not look at the specification of interaction terms in the linear
or generalized linear model. The omission of interaction terms in our examples
was for simplicity, but adding interaction terms would not yield any additional
complications. Specifically, as with the conventional frequentist regression mod-
els, it might be necessary to center the variables involved in the interaction. In
doing so, this would involve taking care to specify the hyperparameters of the
prior distributions of the coefficients associated with the interactions so as to rep-
resent the appropriate metrics of the association between the interaction terms and
the outcome.
The next chapter takes up the topic of Bayesian approaches to model evaluation
and model comparison, with an initial discussion of the problem of hypothesis
testing in frequentist framework the Bayesian perspective to the problem.
6
Model Evaluation and Comparison
101
102 Bayesian Statistics for the Social Sciences
when it is false, denoted as β, where 1 − β denotes the power of the test). As Raftery
(1995) has pointed out, the dimensions of this hypothesis are not relevant – that
is, the problem can be as simple as the difference between a treatment group and
a control group, or as complex as a structural equation model. The point remains
that only two hypotheses are of interest in the conventional practice. Moreover,
as Raftery (1995) also notes, it is often far from the case that only two hypotheses
are of interest. This is particularly true in the early stages of a research program,
when a large number of models might be of interest to explore, with equally large
numbers of variables that can be plausibly entertained as relevant to the problem.
The goal is not, typically, the comparison of any one model taken as “true” against
an alternative model. Rather, it is whether the data provide evidence in support
for one of the competing models.
The conflation of Fisherian and Neyman-Pearson hypothesis testing lies in the
use and interpretation of the p-value. In Fisher’s paradigm, the p-value is a matter
of convention with the resulting outcome being based on the data. In contrast, in
the Neyman-Pearson paradigm, α and β are determined prior to the experiment
being conducted and refer to a consideration of the cost of making one or the other
decision error. Indeed, in the Neyman-Pearson approach, the problem is one of
finding a balance between α, power, and sample size. However, even a casual
perusal of the top journals in the social sciences will reveal that this balance is vir-
tually always ignored and α is taken to be 0.05, with the 0.05 level itself being the
result of Fisher’s experience with small agricultural experiments and never taken
to be a universal standard. The point is that the p-value and α are not the same
thing. This confusion is made worse by the fact that statistical software packages
often report a number of p-values that a researcher can choose from after having
conducted the analysis (e.g., .001, .01, .05). This can lead a researcher to set α ahead
of time (perhaps according to an experimental design), but then communicate a
different level of “significance” after running the test. This different level of signif-
icance would have corresponded to a different effect size, sample size, and power,
all of which were not part of the experimenter’s original design. The conventional
practice is even worse than described as evidenced by nonsensical phrases such as
results “trending toward significance,” or “approaching significance,” or “nearly
significant.”1
One could argue that a poor understanding and questionable practice of NHST
is not sufficient as a criticism against its use. However, it has been argued by
Jeffreys (1961, see also; Wagenmakers, 2007) that the statistical logic of the p-
value underlying NHST is fundamentally flawed on its own terms. Consider
that any test statistic t(y) that is a function of the data y (such as the t-test). The
p-value is obtained by p[t(y) | H0 ], as well as that part of the sampling distribution
t(yrep | H0 ) more extreme than the observed t(y). A p-value is obtained from
the distribution of a test statistic over hypothetical replications (i.e., the sampling
distribution). The p-value is the sum or integral over values of the test statistic
that are at least as extreme as the one that is actually observed. In other words,
1
For a perhaps not-so-humorous account of the number of different phrases that have
been used in the literature, see https://mchankins.wordpress.com/2013/04/21/still
-not-significant-2/.
Model Evaluation and Comparison 103
the p-value is the probability of observing data at least as extreme as the data that
were actually observed, computed under the assumption that the null hypothesis
is true. However, data points more extreme were never actually observed and
thus constitutes a violation of the likelihood principle, a foundational principle in
statistics (Birnbaum, 1962) that states that in drawing inferences or decisions about
a parameter after the data are observed, all relevant observational information is
contained in the likelihood function for the observed data, p(y | θ). This issue was
echoed in Kadane (2011, p. 439) who wrote:
Significance testing violates the Likelihood Principle, which states
that, having observed the data, inference must rely only on what
happened, and not on what might have happened but did not.
Kadane (2011, p. 439) goes on to write:
But the probability statement...is a statment about√X̄n before it is ob-
served. After it is observed, the event | X̄n |> 1.96/ n either happened
or did not happened and hence has probability either one or zero.
To be specific, if we observe an effect, say y = 5, then the significance calculations
involve not just y = 5 but also more extreme values, y > 5. But y > 5 was not
observed and it might not even be possible to observe it in reality! To quote Jeffreys
(1961, p. 385),
I have always considered the arguments for the use of P [sic] absurd.
They amount to saying that a hypothesis that may or may not be
true is rejected because a greater departure from the trial value was
improbable; that is, that it has not predicted something that has not
happened. ...This seems a remarkable procedure
a full probability model for the data and the parameters of the model, where the
latter requires the specification of the prior distribution. The notion of model fit,
therefore, implies that the full probability model fits the data. Lack of model fit
may well be due to incorrect specification of likelihood, the prior distribution, or
both.
Arguably, another difference between the Bayesian and frequentist goals of
model building relates to the justification for choosing a particular model among
a set of competing models. Specifically, model building and model choice in the
frequentist domain is based primarily on choosing the model that best fits the data.
This has certainly been the key motivation for model building, respecification,
and model choice in the context of popular methods in the social sciences such as
structural equation modeling (see, e.g., Kaplan, 2009). In the Bayesian domain,
the choice among a set of competing models is based on which model provides the
best posterior predictions. That is, the choice among a set of competing models
should be based on which model will best predict what actually happened.
In this chapter, we examine the components that might be termed a Bayesian
workflow (see, e.g., Bayesian Workflow, 2020; Schad, Betancourt, & Vasishth, 2019;
Depaoli & van de Schoot, 2017) and includes a series of steps that can be taken
by a researcher before and after estimation of the model. We discuss a possible
Bayesian workflow in more detail in Chapter 12. These steps include (1) prior pre-
dictive checking, which can aid in identifying priors that might be serious conflict
with the distribution of the data, and (2) post-estimation steps including model
assessment, model comparison, and model selection. We begin with a discus-
sion of prior predictive checking, which essentially generates data from the prior
distribution before the data are observed. Next, we discuss posterior predictive
checking as a flexible approach to assessing the overall fit of a model as well as
the fit of the model to specific features of the data. Then, we discuss methods
of model comparison including Bayes factors, the related Bayesian information
criterion (BIC), the deviance information criterion (DIC), the widely applicable
information criterion (WAIC), and the leave-one-out cross-validation information
criterion (LOO-IC).
which we see is simply generating replications of the data from the prior
distribution.2
library(rstan)
library(loo)
library(bayesplot)
library(ggplot2)
## Read in data ##
PISA18Data <-read.csv("˜/desktop/pisa2018.BayesBook.csv",header=TRUE)
PISA18.read<- subset(PISA18Data, select=c(PV1READ))
data.list <- with(PISA18.read, list(readscore=PV1READ,
n = nrow(PISA18.read)))
priorpredIncorrect<-"
data {
int n;
}
parameters {
real<lower=0> mu;
real<lower=0> sigma;
}
model {
// priors
mu ˜ normal(400,1);
sigma ˜ normal(100,5);
2
Gelman, Carlin, et al. (2014, p. 7) point out that the marginal distribution p(y) given in
Equation (2.4) is more accurately referred to as the prior predictive distribution as it is not
conditioned on any prior observation and predictive because it is the distribution of an
observable quantity.
106 Bayesian Statistics for the Social Sciences
generated quantities {
vector[n] prior_rep;
for(i in 1:n) {
prior_rep[i] = normal_rng(mu,sigma);
}
}
"
priorpredIncorrect <- stan(model_code=priorpredIncorrect,
data=data.list,
chains = 4,
iter = 5000)
color_scheme_set("gray")
prior_rep <- as.matrix(priorpredIncorrect,pars="prior_rep")
plot <- ppd_dens_overlay(ypred=prior_rep[1:100,])
plot + lims(x=c(200,800),y=c(0,.009))
priorpredCorrect<-"
data {
int n;
}
parameters {
real<lower=0> mu;
real<lower=0> sigma;
}
model {
// priors
mu ˜ normal(500,10); // mean based on 2009 PISA US results
sigma ˜ normal(109,10);
}
generated quantities {
vector[n] prior_rep;
for(i in 1:n) {
prior_rep[i] = normal_rng(mu,sigma);
}
Model Evaluation and Comparison 107
"
priorpredCorrect <- stan(model_code=priorpredCorrect,
data=data.list,
chains = 4,
iter = 5000)
color_scheme_set("gray")
prior_rep <- as.matrix(priorpredCorrect,pars="prior_rep")
plot <- ppd_dens_overlay(ypred=prior_rep[1:100,])
plot + lims(x=c(200,800),y=c(0,.009))
Figure 6.1 below presents the plots of the prior predictive distributions.
FIGURE 6.1. Prior predictive checking plots. The plot on left represents the results of an
elicitation without substantive knowledge of previous results from PISA 2009. These priors
would be somewhat incorrect. The plot on the right is based on hypothetical expert opinion
based on results from PISA 2009.
In the Bayesian context, the approach to examining model fit and specification
utilizes the posterior predictive distribution of replicated data. Following Gelman,
Carlin, et al. (2014), let yrep be data replicated from our current model. That is,
Z
p(yrep | y) = p(yrep | θ)p(θ | y)dθ (6.2a)
Z
= p(yrep | θ)p(y | θ)p(θ)dθ (6.2b)
Notice that the posterior predictive distribution derives from the fact that the
second term on the right-hand side of Equation (6.2b) is simply the posterior
distribution of the model parameters. In words, Equation (6.2b) states that the
distribution of future observations given the present data, p(yrep | y), is equal
to the probability distribution of the future observations given the parameters,
p(yrep | θ), weighted by the posterior distribution of the model parameters. This is
then integrated (or summed) over the model parameters yielding the distribution
of future observations given the present data. Thus, posterior predictive checking
accounts for the uncertainty in the model parameters and the uncertainty in the
data.
As a means of assessing the fit of the model, posterior predictive checking
implies that the replicated data should match the observed data quite closely
if we are to conclude that the model fits the data. One approach to quantifying
model fit in the context of posterior predictive checking is to calculate the posterior
predictive p-value. Denote by T(y) a test statistic based on the data, and let T(yrep )
be the same test statistics but defined for the replicated data. Then, the PPp is
defined to be
PPp = p[T(yrep ) ≥ T(y) | y] (6.3)
Equation (6.3) measures the proportion of test statistics based on replicated data
that equal or exceed that of the test statistics based on the actual data.
For the examples presented in this book, the interpretation of the posterior
predictive p-value is as follows. First, as noted by Gelman (2013), when the
uncertainty in the model parameters is passed to the test statistic T through
the posterior predictive distribution then the resulting p-values will concentrate
around 0.5, under the assumption that the model is true. Therefore, values
closer to 0 or 1 are indicative of a model with poor posterior predictive qualities.
Because the focus is on assessing the predictive quality of a model, the degree of
deviation from 0.5 that would constitute “poor predictive quality” is a matter of
content area judgment and will depend, in part, on expected uses of the model.
Let’s return to the linear regression example using PISA 2018 data in Section
5.1.1. The code necessary to obtain posterior predictive checks of the overall
quality of the model requires the use of generated quantities block within the Stan
model string.
Model Evaluation and Comparison 109
generated quantities {
vector[n] readscore_rep;
for (i in 1:n) {
readscore_rep[i] = normal_rng(alpha + beta1*Female[i]
+ beta2*ESCS[i] + beta3*METASUM[i]
+ beta4*PERFEED[i] + beta5*JOYREAD[i]
+ beta6*MASTGOAL[i] + beta7*ADAPTIVITY[i]
+ beta8*TEACHINT[i]+ beta9*SCREADDIFF[i]
+ beta10*SCREADCOMP[i], sigma);
}
}
A variety of plots can be obtained to judge the quality of the model via PPC. An
important plot is the overlay of the density of the outcome in comparison with the
randomly generated densities from the above code. Below in Figure 6.2 we display
the overlay density plot based on 1,000 randomly generated reading scores from
the model. We observe a small degree of misfit of the model.
Here we see very poor fit of the model to the mean of the reading distribution. We
can also evaluate our model with respect to the variance of the posterior reading
distribution, as shown below in Figure 6.4.
Relative to the mean of the distribution, the model does a somewhat better job of
predicting the variance of the reading distribution, however, some degree of misfit
still remains.
The flexibility of posterior predictive checking should not be underestimated.
In addition to examining the posterior predictive performance of the model in
predicting the mean or variance, any quantile of the distribution can be examined.
For example, suppose an investigator is concerned with the ability of the model
to predict the 25th quantile of the reading distribution as this quantile is of policy
relevance in identifying very poor reading performance. Below in Figure 6.5 we
show the posterior predictive performance of the model in predicting the 25th
quantile.
FIGURE 6.5. Histogram for prediction of 25th quantile of the reading distribution.
As with the mean and variance of the reading distribution, this model does not do
a very good job in predicting the 25th quantile, and should probably not be used
for this purpose.
p(y | M1 )p(M1 )
p(M1 | y) = (6.4)
p(y | M1 )p(M1 ) + p(y | M2 )p(M2 )
Notice that p(y | M1 ) does not contain model parameters θ1 , so to obtain p(y | M1 )
requires integrating over θ1 . That is,
Z
p(y | M1 ) = p(y | θ1 , M1 )p(θ1 | M1 )dθ1 (6.5)
where the terms inside the integral are the likelihood and the prior, respectively.
The quantity p(y | M1 ) is referred to as the marginal likelihood for model M1 (Raftery,
1995). A similar expression can be written for M2 .
With these expressions, we can move to the comparison of our two models,
M1 and M2 . The goal is to develop a quantity that expresses the extent to which
a posteriori, the data support M1 over M2 . One quantity could be the posterior
odds of M1 over M2 , expressed as
" #
p(M1 | y) p(y | M1 ) p(M1 )
= × (6.6)
p(M2 | y) p(y | M2 ) p(M2 )
Notice that the first term on the right hand side of Equation (6.6) is the ratio of two
marginal likelihoods. This ratio is referred to as the Bayes factor (BF) for M1 over M2 ,
denoted here as B12 , which is an odds ratio. In line with Kass and Raftery (1995, p.
776), our prior opinion regarding the odds of M1 over M2 , given by p(M1 )/p(M2 ),
is weighted by our consideration of the data, given by p(y | M1 )/p(y | M2 ). This
weighting gives rise to our updated view of evidence provided by the data for
either hypothesis, denoted as p(M1 | y)/p(M2 | y). If the posterior odds is greater
than 1.0, then the evidence supports Model 1. If the posterior odds are less than
1.0, then the evidence supports Model 2.
In practice, there might not be a prior preference for one model over the other
and, of course, this is the default setting in software programs that produce the
BF. In this case, the prior odds are neutral and p(M1 ) = p(M2 ) = 0.5 and the prior
odds ratio equals 1, in which case the posterior odds is equal to the BF. The inter-
pretation of the BF as the ratio of the integrated likelihoods is straightforward —
it expresses the relative evidence in the data for support of one model over another.
Model BF
1 Female + ESCS 3.269e+146
2 JOYREAD + SCREADCOMP + SCREADDIFF 4.168e+208
3 METASUM + MASTGOAL + ADAPTIVITY 2.486e+212
4 PERFEED + TEACHINT 2.946e+42
Recalling that the intercept-only model is the baseline for comparison, we find
that the Model 3, which involve meta-cognitive strategies that a student might use
when reading a passage, is preferred by the data.
and is referred to as the Bayesian information criterion (BIC), also referred to as the
Schwarz criterion (Schwarz, 1978). A detailed mathematical derivation for the BIC
can be found in Raftery (1995), who also examines generalizations of the BIC to a
broad class of statistical models.
Consider again our two models, M1 and M2 , with M2 nested in M1 . Here
again, M1 could represent a set of predictors in a regression model and M2 could
be a subset of those predictors. Or, M1 could be an initially specified structural
equation model and M2 could be the same model with one path deleted. Under
conditions where there is little prior information, Raftery (1995) has shown that
an approximation of the Bayes factor can be written as
TABLE 6.2. Rules of thumb for the BIC and Bayes factors with M1 as the reference model
As with all rules of thumb, those in Table 6.2 should be used with caution and not
without content area knowledge to support decisions about model selection.
116 Bayesian Statistics for the Social Sciences
information theory. The BIC is not derived from the Kullback-Leibler divergence
(see Section 4.8).
Aside from the misnomer, there are more important criticisms of the BIC
when used for model selection (see Weakliem, 1999). First, recall that the BIC
is an approximation of the Bayes factor defined in Equation (6.6), and note that
the Bayes factor requires that the prior odds of, say, M1 against M2 , written as
p(M1 )/p(M2 ), be specified. In practice, and by default, this ratio is set to 1.0, which
itself might not be what a researcher truly believes about the prior odds of the
models. For example, consider the case where M1 is a model that has had good
empirical support in the past, and M2 is a competitor model. Equal prior odds in
this case might not make sense. If, however, we wish to account for differences
in our knowledge and experience with these models, then each model will also
have different prior distributions on the model parameters. In large samples, these
priors might not have much of an effect alone, but the Bayes factor can be quite
sensitive to these prior distributions.
The importance of this point is that the BIC implies specific (and identical)
distributions for each model’s set of parameters and these priors might not be what
a researcher truly believes about the two models taken separately. Specifically,
the BIC assumes so-called unit information priors (UIP) for the model parameters
(Raftery, 1995). We take up unit information priors in Chapter 11, but suffice to
say that the UIP is a data-dependent prior that is Gaussian with a mean set at the
maximum likelihood estimate and precision equal to the information provided by
one observation. Thus, although there can be an infinite number of Bayes factors
corresponding to an infinite number of prior beliefs, there is only one Bayes factor
implied by the use of the BIC, and this might not accurately describe a researcher’s
prior belief. At the very least, researchers who use of the BIC for model selection
should interpret their findings with caution.
where P is the number of parameters and where 2P serves as a penalty term for
overfitting. Note, however, that the AIC is based on a plug-in point estimate
of θ̂ obtained from maximum likelihood estimation of the model and not from
posterior samples. An advantage of the DIC over the AIC is that the DIC is
based on posterior samples and is designed to choose a model that gives the
smallest expected Kullback-Leibler divergence between the true data generating
process (DGP) and a predictive distribution (see Section 4.8 for a discussion of the
Kullback-Leibler divergence).
The DIC can be defined as
where θ̂Bayes is the posterior mean E(θ | y) derived from the MCMC draws, and
PDIC is the effective number of parameters obtained as
PDIC = 2 log p(y | θ̂Bayes ) − E log p(y | θ) (6.12)
where E(log (p(y | θ)) is obtained over the T draws from the posterior distribution.
For calculation purposes, Equation (6.12) can be written as
T
1X
DIC = 2 log p(y | θ̂Bayes ) −
Pd log (p(y | θt ) (6.13)
T
t=1
Notice that Equations (6.12) and (6.13) are essentially expressions of variation
such that the more uncertainty there is in the parameters the greater the overall
penalization against the fit as measured by log p(y | θ̂Bayes ), particularly when
compared to the penalty term for the AIC.
The DIC in Equation (6.11) is presented here in terms of log predictive density,
but it is often expressed in terms of deviance as
Finally, the maximum posterior density will be obtained when the posterior mean
coincides with the maximum a posterior (MAP) value. The DIC is not available in
Stan but is available in the R program rjags (Plummer, 2022)
where pt ( ỹi ) represents the distribution of the true but unknown data-generating
process for the predicted values ỹi . For computation purposes we need the log
pointwise predictive density, defined as
n
X n
X Z
lpdWAIC = log p(yi | y) = log p(yi | θ)p(θ | y)dθ (6.16)
i=1 i=1
To calculate this predictive density, we use draws from the posterior distribution,
yielding
n S !
X 1X
lpd
d
WAIC = log p(yi | θT
) (6.17)
S t=s
i=1
With these definitions in hand, the WAIC uses Equation (6.17) and, as with the
DIC, adds a term to correct for overfitting. Specifically, this correction factor can
be written as
Xn
!
PWAIC = 2 log E p(yi | θ) − E log p(yi | θ) (6.18)
i=1
As with the AIC and DIC, the WAIC can be written in deviance form as
where Z
p(yi | y−i ) = p(yi | θ)p(θ | y−i )dθ (6.22)
is the LOO predictive density given the data with the ith data point left out (Vehtari
et al., 2017).
It is useful to note that an information criterion based on LOO, referred to as
the LOO-IC, can be easily derived as
LOO-IC = −2 × elpd
d
LOO (6.23)
which places the LOO-IC on the deviance scale. Among a set of competing models,
the one with the smallest LOO-IC is considered best from an out-of-sample point-
wise predictive point of view. In addition, it may also be interesting to note that
under maximum likelihood estimation, LOO-CV is asymptotically equivalent to
the AIC (Stone, 1977, see also; Yao, Vehtari, Simpson, & Gelman, 2018a).
As pointed out by Vehtari et al. (2017), LOO is asymptotically equivalent to
the WAIC, but in the case of finite samples with weak priors and/or influential
observations, a more robust method for calculating the LOO-CV might be desired.
To this end, Vehtari et al. (2017) developed a fast and stable approach to obtaining
LOO-CV using Pareto-smoothed importance sampling, which we discussed in
Section 4.8.2. As applied to LOO, the importance ratios presented in Equation
(4.20) are obtained as
p(θt | y−i )
rti = (6.24)
p(θt | y)
From here, the importance sampling LOO predictive distribution is obtained as
PT
rti p( ỹi | θt )
t=1
p( ỹi | y−i ) ≈ PT t (6.25)
t=1 ri
4
A distinction is sometimes made between LOO-CV and Bayesian LOO-CV. The former
implies that LOO can be applied in any cross-validation context, whereas Bayesian LOO-
CV deals explicitly with posterior predictive distributions. For this book, we use LOO-CV
to mean Bayesian LOO-CV.
Model Evaluation and Comparison 121
The density of the held-out data point is obtained from the T posterior samples as
1
p(yi | y−i ) ≈ 1 PT 1
. (6.26)
T t=1 p(yi |θT )
generated quantities {
vector[n] readscore_rep;
vector[n] log_lik;
for (i in 1:n) {
readscore_rep[i] = normal_rng(alpha + beta1*Female[i] +
beta2*ESCS[i] + beta3*METASUM[i]
+ beta4*PERFEED[i] + beta5*JOYREAD[i] + beta6*MASTGOAL[i]+
beta7*ADAPTIVITY[i] + beta8*TEACHINT[i]
+ beta9*SCREADDIFF[i] + beta10*SCREADCOMP[i], sigma);
Then, outside of the modelString we add the lines for the model with non-
informative priors.
> print(loo_noninf)
Estimate SE
elpd_loo -28537.8 48.3
p_loo 12.2 0.3
looic 57075.6 96.6
------
Monte Carlo SE of elpd_loo is 0.1.
Estimate SE
elpd_waic -28537.8 48.3
p_waic 12.1 0.3
waic 57075.6 96.6
The results for the LOO and WAIC are identical because of the large sample size
and also identical to the results for the informative priors case, but the LOO does
provide diagnostic information based on Pareto k that the results can be trusted.
in the negative binomial model, it may be of interest to compare the LOO values
for both models. The LOO value for the Poisson regression model is 2991.9.5 The
LOO value for the negative binomial regression was found to be 1752.0 which is
substantially lower than the value found for the Poisson regression indicating that
the negative binomial regression model shows much better out-of-sample point-
wise prediction of number of days absent from school compared to the Poisson
regression model.
6.4 Summary
In this chapter we covered Bayesian methods for model evaluation and com-
parison. The chapter began with an overview of the Bayesian critique of null
hypothesis significance testing. We argue that over and above the consistent mis-
use of the NHST in the social and behavioral sciences, the method itself may be
fundamentally flawed insofar as it appears to violate the likelihood principle. We
then discussed the Bayesian alternative of model assessment through posterior
predictive checks. We agree with Gelman and Shalizi (2013) that posterior pre-
dictive checks represent a powerful approach to fully probing the adequacy of a
model around its intended use, and we advocate its routine application in Bayesian
practice. We next covered issues in model comparison and provided a review of
Bayes factors (and an aside on the BIC), the DIC, WAIC, and LOO-CV/LOO-IC.
Following Gelman and Rubin, we agree that while model comparison might be
useful, in the case of Bayes factors and BIC, they probably should not be used
for model selection insofar as the purposes of the model are not embedded as
part of the decision on which to select a model. The DIC, WAIC, and LOO-IC,
on the other hand, are an improvement insofar as they are directed toward model
selection based on predictive criteria. Also, although the WAIC and LOO-CV have
been shown to be asymptotically equivalent (Watanabe, 2010), the implementation
of LOO-CV in the loo package is more robust in finite samples with weak priors
or influential observations (Vehtari et al., 2017) with the LOO-IC perhaps having
the most solid underlying motivation if it is to be used for model selection. In
summary, for model evaluation and comparison/selection, we advocate substan-
tively guided posterior predictive checks for model evaluation and the LOO-IC for
model selection. These will be presented across the examples used in this book.
5
Note that the initial calculation of the LOO led to a number of ”bad” or ”very bad” Pareto-k
values. On inspection, this was due to one outlier observation which was removed, after
which all Pareto-k values were below 0.7. A discussion of the Pareto-k diagnostic can be
found in Sections 4.8.2 and 7.3.4.
7
Bayesian Multilevel Modeling
A common feature of data collection in the social sciences is that units of anal-
ysis (e.g., students or employees) are often nested in higher level organizational
units (e.g., schools or companies, respectively). Indeed, in many instances, the
substantive problem concerns specifically an understanding of the role that units
at both levels play in explaining or predicting outcomes of interest. For example,
the OECD/PISA study deliberately samples schools (within a country) and then
takes an age-based sample of 15 year olds within sampled schools. Such data
collection plans are generically referred to as clustered sampling designs. Data from
clustered sampling designs are then collected at both levels for the purposes of un-
derstanding each level separately, but also to understand the inputs and processes
of student and school level variables as they predict both school and student level
outcomes. Higher levels of nesting are, of course, possible, e.g., students nested in
schools, which in turn are nested in local educational authorities, such as school
districts.
It is probably without exaggeration to say that one of the most important con-
tributions to the empirical analysis of data arising from such data collection efforts
has been the development of so-called multilevel models. Original contributions to
the theory of multilevel modeling for the social sciences can be found in Burstein
(1980), Goldstein (2011), and Raudenbush and Bryk (2002), among others.
The purpose of this chapter is to highlight the fact that multilevel models can
be conceptualized as Bayesian hierarchical models. Apart from the advantages
gained from being able to incorporate priors directly into a multilevel model,
the Bayesian conception of multilevel modeling has another advantage – namely
it clears up a great deal of confusion in the presentation of multilevel models.
Specifically, the literature on multilevel modeling attempts to make a distinction
between so-called “fixed effects” and “random effects.” Indeed, Gelman and Hill
(2003) have recognized this issue and present five different definitions of fixed and
random effects. Moreover, there are differences in the presentation of multilevel
models. For example, Raudenbush and Bryk (2002) provide a pedagogically useful
representation of multilevel modeling as one of modeling different organizational
levels. Others (e.g., Pinheiro & Bates, 2000) represent multilevel modeling as
a single-level “mixed effects” model. Although these two representations are
mathematically equivalent, such differences in presentation and the varying uses
of terminology can be confusing.
125
126 Bayesian Statistics for the Social Sciences
as yig , and the school means, denoted as β0g . Assuming exchangeable schools
allows us to assign the same prior probability for the parameters β0g . In other
words, lacking any information about the G schools, exchangeability of the β0g is
a reasonable assumption.
Now, however, consider the situation in which we learn that some subset of
the G schools are public schools and the remainder are private schools. Given this
knowledge, it might not be appropriate to specify the same prior distribution for
these different types of schools. Instead, we might be able to argue that conditional
on school type, the β0g s are exchangeable – that is, we might feel comfortable
assigning the same prior distribution within public and private schools. For this
to be reasonable, we would need to directly add school type to our random effects
model, yielding a more general multilevel model.
Yet another implication of exchangeability in the context of multilevel models
concerns the notion of borrowing strength (see, e.g., Jackman, 2009, p. 307). The cen-
tral idea is that inferences regarding the school means β0g come from two sources.
The first source is information coming from school g itself. However, under ex-
changeability, another source of information arises from the remaining schools via
the prior distribution on β0g . Specifically, given that the prior distribution on the
random school means is generated from the hyperparameters mean µ and variance
σ2 , and given that these parameters are unknown, in essence the data coming from
school g is being used to update the priors µ and σ2 . As Jackman (2009) points
out, the phenomenon of borrowing strength is (1) a consequence of hierarchical
modeling and partial pooling, discussed in Section 2.2, and (2) possible only under
exchangeability of β0g .
where γ00 is a grand mean and u0g is a homoskedastic error term with variance σ2β0
that picks up the school effect over and above the grand mean. Inserting Equation
(7.2) into Equation (7.1) yields
indicating that the reading score for student i in school g can be decomposed into
an overall grand mean γ00 , a component due to the school effects u0g , and a random
error component rig .
An important measure used to evaluate the necessity of multilevel modeling is
the so-called intra-class correlation (ICC). The ICC yields the proportion of variance
in the outcome that can be attributed to differences among schools and is defined
as
σ2β0
ICC = 2 (7.4)
σβ0 + σ2read
What constitutes a large ICC is, of course, a matter of substantive judgment, but a
benefit of Bayesian multilevel modeling is that the ICC is obtained from the draws
of the posterior distribution of the variance terms and thus not only encodes
uncertainty, but allows one to explore credible intervals for the ICC.
Recall that a fully Bayesian perspective requires specifying the prior distribu-
tions on all model parameters. For the model in Equation (7.3), we first specify the
distribution of the reading score given the school effect u g and the within school
variance σ2 . Specifically,
readscoreig ∼ N(u0g , σ2β0 ) (7.5)
Prior distributions on the remaining model parameters can be specified as
where µ0 and τ0 are the mean and standard deviation hyperparameters on µ that
are assumed to be fixed and known. Other choices for prior distributions, especially
for the level 1 and level 2 variance terms, are possible (see Gelman, 2006)
To see how this specification fits into a Bayesian hierarchical model, note that
we can arrange all of the parameters of the random effects ANOVA model into a
vector θ and write the prior distribution as
For this example we run a simple Bayesian random effects ANOVA on reading
literacy for the United States sample of PISA 2018 (n = 4, 838). The analysis simply
examines whether there are school differences in the average reading performance
of students within schools.
To begin, we read in the data and select the reading score along with the school
identification code which is necessary for the multilevel analysis.
library(rstan)
library(loo)
library(bayesplot)
library(dplyr)
In the following section is the Stan code for the random effects ANOVA model.
In the data block we read in the sample size, the number of groups, a school
identification number for each student, and the reading score.
RandomEffectsAnova = "
data {
int<lower=0> n; // number of observations
int<lower=0> G; // number of groups
int<lower=1,upper=G> SchoolID[n]; // discrete group indicators
vector[n] readscore; // real valued observations
}
In the following parameter block we define the parameters in the model. This
is followed by a transformed parameters block that allows us to obtain components
for calculating the ICC. Recall that Stan works with standard deviations, and so
these elements must be squared to obtain variances.
130 Bayesian Statistics for the Social Sciences
parameters {
real mu0; // school means
real<lower=0> sigma_beta0; // level 2 std.
real<lower=0> sigma_read; // level 1 std.
vector[G] mu; // Overall mean
}
transformed parameters {
real <lower=0> sigma2_read;
real <lower=0> sigma2_beta0;
real <lower=0> ICC;
sigma2_read= sigma_readˆ2;
sigma2_beta0 = sigma_beta0ˆ2;
ICC = sigma2_beta0/(sigma2_read + sigma2_beta0);
}
Next we specify the priors and the likelihood in the model block. In this example
we know that a non-informative prior on the intercept mu would be quite incorrect
given that the international reading mean in PISA 2018 is not zero. Thus we give
a highly diffused prior around a more sensible mean. Finally, notice that the
expression in the likelihood mu[SchoolID] signals the program to obtain the mean
reading score for each school as indexed by the school identification number.
model {
mu ˜ normal(400, 100); // Prior based on PISA international scale
sigma_read ˜ cauchy(0, 1); // Weakly informative prior
sigma_l2 ˜ cauchy(0, 1); // Weakly informative prior
mu0 ˜ normal(mu,sigma_l2);
readscore ˜ normal(mu[SchoolID], sigma_read); // Likelihood
}
generated quantities {
vector[n] readscore_rep;
vector[n] log_lik_ANOVA;
for(i in 1:n) {
Bayesian Multilevel Modeling 131
We next specify the information needed for the the analysis, plots, results, posterior
predictive checks, and cross-validation assessment.
nChains = 4
nIter= 10000
thinSteps = 10
warmupSteps = floor(nIter/2)
readscore = data.list$readscore
RFAnova = stan(data=data.list,model_code=RandomEffectsAnova,
chains=nChains,iter=nIter,warmup=burnInSteps,
thin=thinSteps)
stan_plot(RFAnova,pars=c("mu0","sigma","tau","ICC"))
stan_trace(RFAnova,pars=c("mu0","sigma","tau","ICC"))
stan_dens(RFAnova,fill="gray",pars=c("mu0","sigma","tau","ICC"))
stan_ac(RFAnova,fill="gray",pars=c("mu0","sigma","tau","ICC"))
Convergence Diagnostics
Below are the convergence diagnostics along with the results in Table 7.1 below.
As seen in Figures 7.1 - 7.3 below, the diagnostic information suggests that for each
parameter, the algorithm converged adequately to the posterior distribution.
FIGURE 7.1. Trace plots for random effects regression example under informative priors.
Bayesian Multilevel Modeling 133
FIGURE 7.2. ACF plots for random effects regression example under informative priors.
FIGURE 7.3. Density plots for random effects regression example under informative pri-
ors.
134 Bayesian Statistics for the Social Sciences
FIGURE 7.4. Density Plots for Random Effects Regression Example Under Informative
Priors
We find that there is a small degree of misfit as judged by the posterior density
plot. However, the posterior prediction of the reading mean under the specific
priors in this model is quite good, with a posterior p-value of 0.507. Finally
the LOO-IC value for this model is 58358.8 which we will use as a means of
comparison with more complex models next.
Bayesian Multilevel Modeling 135
Here we provide only the Stan code for the varying intercept model. The
surrounding R code for reading in the data and summarizing the results are
virtually the same as the in Example 7.1. The complete code is available on the
book’s companion website.
InterceptOutcome = "
data {
int<lower=1> n; // number of students
int<lower=1> G; // number of schools
int SchoolID[n]; // school indices
vector[n] readscore; // reading outcome variable
vector[n] STAFFSHORT; // Measure of staff shortage in the school
}
parameters {
vector[G] beta0;
vector[G] beta1;
real gamma00;
real gamma01;
real<lower=0> sigma_read;
real<lower=0> sigma_beta0;
In the following transformed parameters block we define and calculate the intra-
class correlation.
transformed parameters {
real <lower=0> sigma2;
real <lower=0> tau2;
real <lower=0> ICC;
136 Bayesian Statistics for the Social Sciences
sigma2 = sigmaˆ2;
tau2 = tauˆ2;
ICC = tau2/(sigma2 + tau2);
}
In the following model block notice that we now specify the model relating the
school means beta0[g]to the staff shortage variable, denoted as STAFFSHORT.
model {
gamma00 ˜ normal(400,100); // Informative prior
gamma01 ˜ normal(0, 2);
sigma_read ˜ cauchy(0,1);
sigma_beta0 ˜ cauchy(0,1);
for(i in 1:n) {
readscore[i] ˜ normal(beta0[SchoolID[i]], sigma_read);
}
for(g in 1:G) {
beta0[g] ˜ normal(gamma00 + gamma01*STAFFSHORT[g], sigma_beta0);
}
}
generated quantities {
real readscore_rep[n];
vector[n] log_lik_M1;
real beta0_rep[G];
for(i in 1:n) {
readscore_rep[i] = normal_rng(beta0[SchoolID[i]],sigma_read);
log_lik_M1[i] = normal_lpdf(readscore[i] | beta0[SchoolID[i]],
sigma_read);
}
for(g in 1:G) {
beta0_rep[g] = normal_rng(gamma00 + gamma01 * STAFFSHORT[g],
sigma_beta0);
}
Bayesian Multilevel Modeling 137
}
"
The results of the varying intercept model are displayed below in Table 7.2.
For this example, we do not show the trace plots, density plots, or ACF plots.
However, inspecting the n eff and Rhat, we find that the model shows good
convergence. The posterior p-value is 0.48 indicating excellent predictive fit of
the model to the mean reading score. The LOO-IC for this model is 58355.6,
indicating that the varying intercept model yields slightly better out-of-sample
predictive accuracy compared to the random effects ANOVA model. The mean
effect of staff shortage (0.61) is in the 95% posterior probability interval, however,
the probability of the effect being greater than zero is 0.63. This suggests that zero
lies relatively close to the mean and 63% of the distribution lies to the right of 0. To
check this, we can use the full posterior distribution to determine the percentage
of the distribution that lies between zero and the mean value of 0.61. This value
turns out to be 0.13, which, indeed, is relatively close to 0. Here again, such a
nuanced analysis would not be possible in a frequentist setting, but, still, content
area judgment is needed to warrant the importance of the effect.
denote as z g . For the following example, we again use the single school-level
predictor of principal-reported teacher shortage in the school (STAFFSHORT). Al-
lowing the intercept β0g to be modeled as a function of STAFFSHORT and allowing
the slope of readscore on ESCS to be modeled as function of STAFFSHORT, we
can write the model
where γ’s are the coefficients relating β jg to the school-level predictors. From
Raudenbush and Bryk (2002), the model in Equations (7.13a) - (7.13b) is referred to
as the “level 2” model. The coefficient γ00 captures the school average reading score
for schools with no staff shortage; γ01 captures the relationship between school staff
shortage and school-level reading performance; γ10 captures the school average
ESCS-reading score relationship for schools with no staff shortage; and γ11 captures
the moderating effect of staff shortage on the school average ESCS-reading score
relationship.
As with the random effects ANOVA model, we can substitute Equations (7.13a)
- (7.13b) into Equation (7.12) and rearrange to yield the full model:
where prior distributions would have to be chosen for σ2g , γ g , and σ2j .
IntSlopeOutcome = "
data {
int<lower=1> n; // number of students
int<lower=1> G; // number of schools
Bayesian Multilevel Modeling 139
The parameters block adds the coefficient gamma01 for the main effect of ESCS,
and gamma11 for the interaction of ESCS with STAFFSHORT.
parameters {
vector[G] beta0; vector[G] beta1;
real gamma00; real gamma01;
real gamma10; real gamma11;
real<lower=0> sigma_read; real<lower=0>sigma_beta0;
real<lower=0> sigma_beta1;
}
The following model block simply adds non-informative priors to the newly added
parameters from the previous block and specifies the model shown in
model {
gamma00 ˜ normal(400,100);
gamma01 ˜ normal(0, 2);
gamma10 ˜ normal(0, 2);
gamma11 ˜ normal(0, 2);
sigma_read ˜ cauchy(0,1);
sigma_beta0 ˜ cauchy(0,1);
sigma_beta1 ˜ cauchy(0,1);
for(i in 1:n) {
readscore[i] ˜ normal(beta0[SchoolID[i]] +
beta1[SchoolID[i]]*ESCS[i], sigma_read);
}
for(g in 1:G) {
beta0[g] ˜ normal(gamma00 + gamma01*STAFFSHORT[g],
sigma_beta0);
beta1[g] ˜ normal(gamma10 + gamma11*STAFFSHORT[g],
sigma_beta1);
}
}
140 Bayesian Statistics for the Social Sciences
Finally, the generated quantities block sets up the code necessary to obtain the
posterior predictive checks and LOO-IC information.
generated quantities {
real readscore_rep[n];
vector[n] log_lik_M2;
real beta0_rep[G];
real beta1_rep[G];
for(i in 1:n) {
readscore_rep[i] = normal_rng(beta0[SchoolID[i]]
+ beta1[SchoolID[i]]*ESCS[i],sigma_read);
An inspection of the n eff and Rhat (as well as diagnostic plots not shown)
suggest adequate convergence of the algorithm. The results of the intercepts and
slopes as outcomes model are displayed below in Table 7.3.
The results in Table 7.3 reveal a rather small effect of staff shortage on reading.
Here, however, we are interested in two important predictors of reading perfor-
mance: (1) the impact of ESCS and (2) the moderating effect of staff shortage on
Bayesian Multilevel Modeling 141
the relationship between ESCS and reading. From Table 7.3 we find a posterior
effect of ESCS of gamma01 = 13.31, with a standard deviation of 1.20. The proba-
bility that this effect is greater than zero is approximately one. The impact of the
moderating effect of staff shortage on the relationship between ESCS and reading
is represented by the coefficient gamma11 = 3.33 with a standard deviation of
1.49. Working with the full posterior distribution, we find that the probability
that 3.33 is greater than 0 is approximately 0.98. In both cases, the percentage of
the distribution that lies between 0 and the obtained estimates are also reasonably
large (0.50 and 0.49, respectively). It appears that ESCS impacts not only average
school reading performance but also interacts with school-level staff shortages in
impacting reading performance. It should also be noted that the LOO-IC for this
model is 58093.8 which is substantially lower than the varying intercepts model,
indicating substantial improvement in the prediction of reading when accounting
for student socioeconomic status, a finding that is not surprising.
7.5 Summary
Multilevel modeling has become an extremely important and powerful tool in
the array of methodologies for the social sciences, by virtue of the fact that many
research studies in the social sciences result in data with some sort of clustering.
The conventional approach to multilevel modeling is based on some variant of
the mixed effects model. A pedagogically useful approach conceives of multi-
level models in terms of levels, as in the work of Raudenbush and Bryk (2002)
and colleagues. The Bayesian perspective of multilevel modeling is to treat the
problem as one of a hierarchy of parameters treated as unknown and where our
uncertainty about the parameters is described by probability distributions. This
chapter attempted to maintain the discussion of multilevel models as levels but
also to show that they are essentially Bayesian hierarchical models. We also point
out that the assumption of exchangeability requires careful consideration in the
context of Bayesian hierarchical models.
8
Bayesian Latent Variable Modeling
As noted in the Preface, a recent book by Sarah Depaoli (2021) provides a detailed
overview of Bayesian structural equation modeling. Depaoli primarily covers
Bayesian structural equation modeling using Mplus (Muthén & Muthén, 1998–
2017), BUGS (Lunn et al., 2009), and blavaan (Merkle et al., 2021) but does not
cover the Stan software package which has been the primary software package
for the examples in this book. Thus, for this chapter, we will examine two simple
but popular latent variable models: (1) confirmatory factor analysis (CFA) and (2)
latent class analysis (LCA), and provide examples that utilize interesting features
of the Stan programming language.
Our focus specifically on CFA and not the the full structural equation model
(SEM) stems from the fact that (recursive) SEM models can be shown to be a
special case of CFA (Kaplan, 2009), and Bayesian estimation of CFA can be readily
translated to the SEM context. We focus on LCA insofar as Bayesian estimation
in this case introduces some interesting problems that lead to new computational
solutions.
Σ = ΛΦΛ′ + Ξ (8.2)
143
144 Bayesian Statistics for the Social Sciences
a priori number and location of (typically) zero values in the factor loading matrix
Λ. In the conventional approach to factor analysis, the additional restrictions
placed on Λ preclude rotation to simple structure (Lawley & Maxwell, 1971).
where µ and Ω are the mean and variance hyperparameters, respectively, of the
normal prior. The covariance matrix of the common factors, Φ, and the unique-
ness covariance matrix, Ξ, are assumed to follow an inverse-Wishart distribution.
Specifically,
θIW ∼ IW(Ψ, ν) (8.4)
where Ψ is a positive definite matrix, and ν > P − 1, where P is the number of
observed variables. Different choices for Ψ and ν will yield different degrees
of “informativeness” for the inverse-Wishart distribution. The inverse-Wishart
distribution was discussed in Chapter 3. Note that other prior distributions
for θIW can be chosen. For example, Φ can be standardized to a correlation
matrix and the LKJ(η) prior can be applied. Also, if the uniqueness covariance
matrix Ξ is assumed to be a diagonal matrix of unique variances (which is
typically the case), then the elements of Ξ can be given IG(α, β) priors or
C+ (0,β) priors, where α and β are shape and scale parameters, respectively, for
the C+ distribution, and the location x0 is set to zero by definition (see Section 3.1.3).
The confirmatory factor model in this example was specified to have two fac-
tors. The first factor is labeled IntrinsicMotiv measured by enjoyread, lookforward,
enjoy, and interest. The second factor is labeled ExtrinsicMotiv and is measured by
effort, career, important, and job. The Stan code follows.
To begin, we read in and select the data. Missing data are deleted listwise.
library(rstan)
library(loo)
library(bayesplot)
library(dplyr)
# Read in data
PISAcfa <-read.csv(file.choose(),header=TRUE) # browse to select data
PISAcfa <- subset(PISAcfa, select=c(enjoyread,effort,lookforward,
enjoy,career,interest,important,job))
PISAcfa[PISAcfa==999]=NA
PISAcfa <- na.omit(PISAcfa)
The following Stan code was drawn from DeWitt (2018). We next create a list
called patternMat that will be used to assign items to factors.
BayesCFA = "
data {
int<lower=1> n; //sample size
int<lower=1> k; //number of items
int<lower=1> n_fac; // number of factors
matrix[n,k] y; // matrix of outcomes
int<lower=1, upper=n_fac> patternMat[k];
}
transformed data {
matrix[n,k] scaled_y;
for (j in 1:k){
scaled_y[,j] = (y[,j] - mean(y[,j]))/sd(y[,j]);
}
}
In the following parameters block we first define a matrix with rows equal
to the number of factors and columns equal to the sample size which we name
N01prior, to be used to assign an N(0, 1) prior to the factor loadings. This is
followed by a command in Stan labeled cholesky factor corr which will provide
the Cholesky decomposition to be used later to obtain the factor correlations.
parameters {
matrix[n_fac,n] N01prior;
cholesky_factor_corr[n_fac] fac_cor_helper;
vector[p] scaled_y_means;
vector<lower=0>[k] scaled_y_unique;
vector<lower=0>[k] lambda;
}
transformed parameters {
matrix[n,n_fac] FacScores;
FacScores = transpose(fac_cor_helper * N01prior);
In the model block next, we use Stan’s to vector function to specify an N(0, 1)
prior to N0prior. We specify fac cor helper to have a non-informative LKJ prior
via the function lkj corr cholesky(1) (see Chapter 3, Section 3.6). The remain-
ing non-informative/weakly informative priors are then assigned to the means,
uniqunesses, and factor loadings. The final line in this section specifies the likeli-
hood of the data.
Bayesian Latent Variable Modeling 147
model {
to_vector(N01prior) ˜ normal(0,1);
fac_cor_helper ˜ lkj_corr_cholesky(1);
scaled_y_means ˜ normal(0,1);
scaled_y_unique ˜ cauchy(0,1);
lambda ˜ normal(0,1);
// Likelihood
for (j in 1:k) {
scaled_y[, j] ˜ normal(scaled_y_means[j] +
FacScores[,patternMat[j]] * lambda[j],
scaled_y_unique[j]);
}
}
generated quantities {
corr_matrix[n_fac] fac_cor ;
vector[p] y_means;
vector[p] y_unique;
fac_cor = multiply_lower_tri_self_transpose(fac_cor_helper);
for(j in 1:k){
y_means[j] = scaled_y_means[j]*sd(y[,j]) + mean(y[,j]);
y_unique[j] = scaled_y_unique[j]*sd(y[,j]);
}
}
"
Finally, we provide the information necessary to begin the estimation of the model.
nChains = 4
nIter= 10000
thinSteps = 10
warmupSteps = floor(nIter/2)
patternMat = c(1,1,1,1,2,2,2,2))
BayesCFAfit = stan(data=data.list,model_code=BayesCFA,chains=nChains,
iter=nIter,warmup=warmupSteps,thin=thinSteps)
Figures 8.1 - 8.3 below provide the trace plots, density plots, and autocorrela-
tion plots, respectively, for the Bayesian CFA analysis. We see that in each case,
there is reasonable evidence of convergence with the chains mixing well.
The results of the Bayesian CFA can be seen below in Table 8.1.
150 Bayesian Statistics for the Social Sciences
Along with the results, we also see additional evidence of convergence through
inspection of the n eff, which are close to 2000, and Rhat values, which are all 1.0.
C
X I
Y
p(yi ) = αc p(y jq | zi = c) (8.5)
c=1 q=1
where zi is the class that individual i belongs to, and αc is the probability of being
in class c.
We focus on binary responses (mastery/non-mastery) and therefore we specify
yiq ∼ Bernoulli(pcq ) where pcq is the probability that an individual in class c masters
skill level q. Note that zi is the latent class indicator and so requires a probability
distribution. We write the Bayesian hierarchical latent class model as
Bayesian Latent Variable Modeling 151
zi ∼ Categorical(αc ) (8.6a)
yi | zi = c ∼ Bernoulli(pc ) (8.6b)
PC
where α is a simplex (α1 . . . αC )’ and c=1 = 1 and pc is the class-specific parameter
vector.
For this example, we focus on Bayesian latent class analysis. The data come
from the Early Childhood Longitudinal Study – Kindergarten Cohort of 1998
(ECLS-K:1998-99) (NCES, 2018), which provide a nationally representative sample
of children attending kindergarten in 1998–99 who were periodically assessed until
they reached 8th grade.
We focus on the performance of students in the third wave of the study cor-
responding to students entering the first grade. Five components of reading were
assessed: letter recognition, beginning sounds, ending sounds, word recognition,
and reading in context. For each component, a binary master score is assigned
if the student gets three out of four items in the set correct. Less than that, and
the student is deemed not to have mastered that particular skill. On the basis of
previous research (Kaplan, 2008), a three-class model was deemed to fit the data
well, and we will demonstrate Bayesian LCA with a three-class model.
The Stan code is as follows. First, we read in the data and create a list that
specifies the sample size, the number of variables, and the hypothesized number
of classes. Here we specify a three-class model.
options(mc.cores = 4)
reading <- read.csv(file.choose(),header=T)
read3rd <- subset(reading,select=c(letterrec3,beginning3,ending3,
words3,reading3))
In the following data block we specify the dimensions of the items, respondents,
number of latent classes, and the response matrix.
parameters {
simplex[C] alpha; // probabilities of being in one latent class
real <lower = 0, upper = 1> p[C, I];
}
In the model block we define the vector lmix[C] which will contain the contri-
butions to the marginal probabilities from each latent class. We then use the
log sum exp function in Stan to compute the log of the exponentiated elements
from lmix[C]. Finally, we use the function target += to increment the resulting log
posterior up to an additive constant.1
model {
real lmix[C];
for (i in 1:N){
for (c in 1: C){
lmix[c] = log(alpha[c]) + bernoulli_lpmf(y[i, ] | p[c,]);
}
target += log_sum_exp(lmix);
}
}
1
For more information about the target += function in Stan, see Stan Development Team
(2021a).
Bayesian Latent Variable Modeling 153
generated quantities {
int<lower = 1> pred_class_dis[N];
simplex[C] pred_class[N];
real lmix[C];
for (i in 1:N){
for (c in 1: C){
lmix[c] = log(alpha[c]) + bernoulli_lpmf(y[i, ] | p[c,]);
}
for (c in 1: C){
pred_class[i][c] = exp((lmix[c])-log_sum_exp(lmix));
}
pred_class_dis[i] = categorical_rng(pred_class[i]);
}
}
"
Finally, we specify the necessary code to run the analysis and produce diagnostic
plots.
nChains = 4
nIter= 5000
thinSteps = 10
burnInSteps = floor(nIter/2)
154 Bayesian Statistics for the Social Sciences
α1 = .6
α2 = .4
p1 = (.6, .1, .1, .2)
p2 = (.2, .4, .3, .1)
In Figure 8.4 below, we can clearly see the label-switching issue in the trace plots,
where the different chains separate for different parameters of the model.
FIGURE 8.4. Trace plots for Bayesian LCA demonstrating the label-switching problem.
of samples that will be saved for summaries, and the tol rel obj command controls
the convergence tolerance.
vb_fit <-
vb(
stan_vb,
data = lca_data,
iter = 40000,
elbo_samples = 1000,
algorithm = c("meanfield"),
output_samples = 1000,
tol_rel_obj = 0.000001,
We see from Figure 8.5 below that although the trace plots are not perfect, the
variational Bayes procedure eliminates the label-switching problem.
Bayesian Latent Variable Modeling 157
FIGURE 8.5. Trace plots for variational Bayes LCA demonstrating the removal of the
label-switching problem.
The results shown in Table 8.3 below indicate that latent class 1 is composed of
approximately 12% of the sample and is made up of children who have more or less
mastered letter recognition but have not yet quite mastered the remaining skills.
Latent class 2 is composed of approximately 26% of the sample and is made up
of children who have mastered almost all skills except fully reading. Latent class
3, constituting 63% of the sample, is made up of children who have more or less
mastered letter recognition, beginning sounds, and ending sounds. It is important
to point out that these results are not optimal but may be practically useful insofar
as the Pareto-k values, denoted as khat in the table, were all approximately 0.7.
Recall from Chapter 4 that the Pareto-k values for variational Bayes reflect the
quality of the approximation to p(θ | y) (Yao et al., 2018b). Also, again note that, as
of this writing, the implementation of variational Bayes in Stan is still experimental
and the results should be treated with caution.
158 Bayesian Statistics for the Social Sciences
TABLE 8.3. Parameter estimates for the variational Bayes LCA model
install.packages("poLCA")
Bayesian Latent Variable Modeling 159
library(poLCA)
reading <- read.csv("˜/desktop/reading.csv",header=T)
readvars <- subset(reading,select=c(letterrec3,
beginning3,ending3,words3,reading3))
First, we need to recode so that values are 1 and 2 instead of 0 and 1. Note that 2
represents mastery.
This provides the code for a simple latent class analysis without regressing latent
class membership onto a predictor. Finally, we estimate the 3-class model and the
results are shown below in Table 8.4.
TABLE 8.4. Parameter estimates for maximum likelihood LCA model using poLCA
Variable/class Prob(1) Prob(2)
letterrec3
class 1: 0.20 0.80
class 2: 0.00 1.00
class 3: 0.00 1.00
beginning3
class 1: 0.78 0.22
class 2: 0.01 0.99
class 3: 0.07 0.93
ending3
class 1: 0.96 0.04
class 2: 0.02 0.98
class 3: 0.26 0.74
words3
class 1: 1.00 0.00
class 2: 0.02 0.98
class 3: 0.95 0.05
reading3
class 1: 1.00 0.00
class 2: 0.57 0.43
class 3: 1.00 0.00
Estimated class population shares:
0.11 0.26 0.63
Predicted class memberships (by modal posterior prob.):
0.10 0.27 0.63
We find that the results are very close to the findings using variational Bayes.
Although there are changes to class enumeration (which is trivial), we find that
the estimated class population shares, as well as the predicted class membership
by modal posterior probability, are very close to those found using variational
Bayes.
8.3 Summary
This chapter discussed Bayesian approaches to latent variable modeling with spe-
cial focus on confirmatory factor analysis and latent class analysis. Our example
of confirmatory factor analysis involved some unique Stan coding, but it should
be pointed out that the R program blavaan (Merkle et al., 2021) provides a very
simple interface for confirmatory factor analysis (and structural equation model-
ing generally) with Stan running in the background. For examples of how to run
Bayesian structural equation models, including CFA, using blavaan, see Depaoli,
Kaplan, and Winter (2023)
Regarding latent class analysis, we demonstrated the common problem of
label switching and then applied variational Bayes to address the problem. The
results from variational Bayes were, in this instance, quite close to the results
Bayesian Latent Variable Modeling 161
ADVANCED
TOPICS AND
METHODS
9
Missing Data from a Bayesian
Perspective
165
166 Bayesian Statistics for the Social Sciences
are observed, and 0 if the data are missing. Further, let y be the complete data,
yobs represent observed data, and ymiss represent missing data. Finally, let ϕ be
the scalar or vector-valued parameter describing the process that generates the
missing data. In the first instance, the missing data on education and income
might be unrelated to the age, education, or income level of the participants. In
this instance, we say that the missing data are missing completely at random or
MCAR. More formally, MCAR implies that
f (M | y) = f (M | ϕ) (9.1)
which is to say that the missing data indicator is unrelated to the data, missing
or observed, and only related to some unknown missing data-generating mech-
anism. Conditions in which the missing data might be MCAR include random
coding errors, instances of missing by design, such as occurs with balanced in-
complete block spiraling designs (Kaplan, 1995; Kaplan & Su, 2016), or statistical
matching/data fusion (see e.g. Rässler, 2002). It has been recognized that MCAR
is a fairly unrealistic assumption in most social science data.
In the second instance, missing data on, say education, may be due to the age
or income of the respondents. Similarly, missing data on income may be due to
the age or education of the respondents. So, for example, a parent may not reveal
their income level based on their age and/or education, regardless of their income.
In this case, we say that the missing data are missing at random or MAR. Again, in
terms of our notation, MAR implies that
f (M | y) = f (M | yobs , ϕ) (9.2)
which states that the missing data mechanism is unrelated to variables that are
missing, but could be related to other observed variables in the analysis. Generally,
MAR is a more realistic assumption than MCAR.
Finally, missing data on, say income, might be related to the income of the
respondents and not necessarily on their age or education level. That is, individ-
uals choose to omit their response on the income question because of their level
of income, regardless of their age or education level. In this case, we say that the
missing data are not missing at random or NMAR. More formally,
meaning that the missing data are related to the variable on which there is missing
data as well as, possibly, the observed data. It has been argued that NMAR is
probably the most realistic scenario of why omitted responses are occurring.
data. Two types of imputations are possible. The first, discussed next, involves
imputing one single value for each missing data point. Methods to be discussed
next are mean imputation, regression imputation, and stochastic regression imputation.
This will be followed by a brief discussion of multiple imputation, which will
set the stage for Bayesian methods. It should be noted that this section does not
present an exhaustive review of single imputation methods.
and
incomei = β0 + β1 (agei ) + β2 (educi ) + ei,income (9.5)
From here, predicted values of education and income are obtained as
and
[ i = βˆ0 + βˆ1 (agei ) + βˆ2 (educi )
income (9.7)
and these predicted values are imputed for the corresponding missing data point.
Although single regression imputation is an improvement upon mean im-
putation and the ad hoc deletion methods, it suffers from one major drawback.
Specifically, the predicted values based on the regression in Equations (9.6) and
(9.7) will, by definition, lie exactly on the regression line. This implies that among
the subset of observations for which there are missing data, the correlations among
the variables of interest will be 1.0. As a result, the overall R2 value will be overes-
timated. Second, as with mean imputation, it is presumed that the imputed values
would be the ones observed had there been no missing data. For this to be true,
the regression model would have to have been correctly specified.
Missing Data from a Bayesian Perspective 169
This approach would result in partitioning the sample into cells that can be used
for conventional hot deck matching described above. However, other metrics can
170 Bayesian Statistics for the Social Sciences
be defined as well. For example, the metric of maximum deviation can be defined
as
(Rubin, 1987) insofar as the imputations reflect uncertainty about the missing
data as well as uncertainty about the unknown model parameters. Moreover, the
Bayesian view of statistical inference allows for the incorporation of prior knowl-
edge, which can further reduce uncertainty in model parameters. It is important to
point out that although the method of stochastic regression imputation described
above has a Bayesian flavor, it is not Bayesianly proper insofar as it does not
account for parameter uncertainty, but rather only uncertainty in the predicted
missing data values.
In this section, we will discuss Bayesian approaches to multiple imputation.
To begin, we first consider the data augmentation algorithm of Tanner and Wong
(1987), which is a Bayesian approach to addressing missing data problems and
which is similar to Gibbs sampling (see Section 4.3). We then discuss an approach
to multiple imputation using the chained equation algorithm of van Buuren (2012).
From there, we consider two more modern approaches to multiple imputation.
The first of these is based on the EM algorithm, and the second is based on
a combination of the so-called Bayesian bootstrap and predictive mean matching
discussed earlier.
In other words, we use the current value θ(s) and the observed data yobs to generate
a value for the missing data from the predictive distribution of the missing data
p(ymiss | yobs , θs ). The I-step is followed by the P-step which draws a new value of
θ, namely θ(s+1) , from the posterior distribution of θ given the observed data yobs
and simulated missing data from the previous step, ys+1 miss
. Formally,
(s+1)
P-step: Draw θ(s+1) from p(θ | yobs , ymiss )
Results of the comparative study for multiple imputation under data augmen-
tation are given below in Table 9.1. The analysis uses the R program norm (Shafer,
2012).
Results of the comparative study for multiple imputation using chained equa-
tions are given in Table 9.2 below. The analysis uses the R program mi (Su, Gelman,
Hill, & Yajima, 2011)
Following Little and Rubin (2020), the basic idea behind the EM algorithm is
as follows. We recognize that the missing data ymiss contains information relevant
to estimating a parameter θ, and that given an estimate of θ, we can obtain
information regarding ymiss . Thus, a sensible approach would be to start with
an initial value of θ, say θ(0) , estimate ymiss based on that value, and then with
the “filled-in” data, re-estimate θ via maximum likelihood, referring to this new
estimate as θ(1) . The process then continues until the s iterations (s = 0, 1, 2, . . .)
converge.
More formally, the EM algorithm has two steps: the (E)xpectation-step and
the (M)aximzation-step. The E-step begins with an initial value of the parameter,
θs , treating it as θ and obtains the expected complete data log likelihood:
Z
Q(θ | θ ) =
(s)
l(θ | y)p(ymiss | yobs , θ(s) )dymiss (9.14)
The M-step then obtains θ(s+1) via maximum likelihood estimation of the expected
complete data log likelihood in Equation (9.14). Dempster et al. (1977, see also;
Schafer, 1997) showed that θ(s+1) is a better estimate than θ(s) insofar as the observed
data log likelihood under θ(s+1) is at least as large as that obtained under θ(s) – that
is,
The EM algorithm has been extended to handle the problem of multiple im-
putation without the need for computationally intensive draws from the posterior
distribution, as with the data augmentation approach. The idea is to extend the EM
algorithm using a bootstrap approach. This approach is labeled EMB (Honaker
& King, 2010) and implemented in the R program Amelia (Honaker, King, &
Blackwell, 2011), which we will use in our analyses below.
Following Honaker and King (2010) and Honaker (personal communcation,
June 2011), the first step is to bootstrap the dataset to create m versions of the
incomplete data, where m ranges typically from three to five as in other multiple
imputation approaches. Bootstrap resampling involves taking a sample of size n
with replacement from the original dataset. Second, the EM algorithm is run and it
is here that Honaker and King (2010) allow for the inclusion of prior distributions
on the model parameters estimated via the EM algorithm. Notice that because m
boostrapped samples are obtained, and that each EM run on these samples may
contain priors, then once the EM algorithm has run, the model parameters will be
different. Indeed, with priors, the final results are the maximum a posteriori (MAP)
estimates, which is the Bayesian counterpart of the maximum likelihood estimates.
Finally, missing values are imputed based on the final converged estimates for each
of the m datasets. These m versions can then be used in subsequent analyses.
Results of the comparative study for multiple imputation under the EM boot-
strap are given below in Table 9.3. The analysis uses the R program Amelia
(Honaker et al., 2011)
Missing Data from a Bayesian Perspective 175
TABLE 9.4. Multiple imputation using Bayesian bootstrap predictive mean matching
Parameter Non-informative prior Informative prior Frequentist
EAP SD EAP SD Coeff. SE
Full Model
INTERCEPT 487.71 3.32 482.44 2.52 487.72 3.36
READING on GENDER 6.27 2.30 10.96 1.78 6.27 2.30
READING on NATIVE −6.62 3.87 −6.12 3.04 −6.63 3.87
READING on SLANG 7.79 4.55 11.03 3.27 7.77 4.59
READING on ESCS 31.34 1.31 33.09 1.00 31.35 1.30
READING on JOYREAD 28.76 1.26 25.39 1.00 28.76 1.25
READING on DIVREAD −4.51 1.19 −1.83 0.93 −4.51 1.19
READING on MEMOR −19.10 1.31 −18.82 1.07 −19.09 1.30
READING on ELAB −15.12 1.27 −14.47 1.05 −15.11 1.26
READING on CSTRAT 28.12 1.44 27.10 1.15 28.12 1.45
Note. EAP = Expected A Posteriori. SD = Standard Deviation.
An inspection of Tables 9.1 – 9.4 reveal similar results across methods of impu-
tation for both the non-informative and frequentist cases, but sizable differences
when comparing the informative case to the non-informative and frequentist cases,
particularly for standard deviations, as expected. Of course, it is difficult to draw
generalizations about these methods when based on real data, but the results do
serve as a caution that important differences can occur depending on whether and
how missing data are handled.
9.5 Summary
This chapter presented an overview of advanced methods for handling problems
of missing data. Given theoretical developments discussed in Little and Rubin
(2020), extended by Schafer (1997), Rässler (2002), and van Buuren (2012), among
others, and summarized in Enders (2022), there is no defensible reason to resort to
ad hoc methods such as listwise and pairwise deletion. The central idea of multiple
imputation originated by Rubin (1987) is essentially Bayesian, and the various
algorithms described in this chapter, such as data augmentation and chained
equations now allow for a full Bayesian approach to addressing uncertainty in
missing data via chained equations and to analyze multiple imputed datasets
using fully Bayesian methods. The next chapter introduces Bayesian methods to
the linear and generalized linear model.
10
Bayesian Variable Selection and
Sparsity
10.1 Introduction
Over the past three decades a great deal of attention has been paid to the problem
of variable selection. Specifically, in considering a relatively long list of predictors
such as shown in the linear regression example in Chapter 5, concern focuses on
the trade-off between the bias that could occur if important variables are omitted
from the model and the variance that could occur from overfitting the model
with variables that do not play a very important role in the prediction of the
outcome. Variable selection methods are designed to yield so-called sparse models
that contain, more or less, the important predictors of the outcome.
This chapter concentrates on Bayesian methods for variable selection, although
the two methods discussed here can be implemented in a frequentist framework
and the results are often comparable. However, as pointed out by van Erp (2020),
there are a number of important benefits in adopting a Bayesian framework for
variable selection. First, as we will see, variable selection can be easily imple-
mented through the priors placed on model parameters, and these are generically
referred to as shrinkage priors or sparsity-inducing priors. Shrinkage priors can
be specified to shrink small coefficients toward zero while allowing large coeffi-
cients to remain large. Sparsity is induced by specifying certain hyperparameters
within the priors set on the model parameters. These hyperparameters are defined
through their own hyperprior distributions. The hyperpriors can be manipulated
to increase or decrease the amount of shrinkage in the estimated effects.
The second benefit of adopting a Bayesian perspective to variable selection is
that the penalty term is estimated in the same step as the other model parameters.
In other words, the penalty term is built into the model estimation process because
it is incorporated directly into the model via a prior. In turn, that prior can be
specified in a flexible manner through different settings, controlling for the degree
of shrinkage as the researcher sees fit.
Finally, the third benefit of estimating Bayesian penalty terms is that many dif-
ferent forms of penalties can be implemented. There are frequentist-based penalty
techniques, such as the ridge and lasso methods described, which have their
179
180 Bayesian Statistics for the Social Sciences
Bayesian counterparts. In addition, there are methods that are strictly Bayesian
such as the spike-and-slab prior and the horseshoe prior (see van de Schoot et al.,
2021, for more information on these priors.).
In this chapter, we focus on Bayesian variable selection methods in the context
of linear models and consider four methods for variable selection: (1) the ridge
prior (A. E. Hoerl & Kennard, 1970; Hsiang, 1975), (2) lasso prior, (Park & Casella,
2008; Tibshirani, 1996), (3) horseshoe prior (Carvalho, Polson, & Scott, 2010), and
(4) regularized horseshoe prior (Piironen & Vehtari, 2017). The first two can also
be implemented in a frequentist setting, but we will concentrate on their Bayesian
counterparts. Although there are many more that could be considered (see, e.g.,
Hastie, Tibshirani, & Friedman, 2009), these methods are chosen to highlight the
issues of variable selection and lead naturally into our discussion of Bayesian
model averaging in Chapter 11. A representation of the different shrinkage prior
distributions is given below in Figure 10.1, and a comparison of the performance
of these priors will be given in Section 10.6 below.
FIGURE 10.1. Four types of shrinkage priors. Top row left: Ridge prior N(0,1); top row
right: Laplace prior with location=0, scale=4; bottom row left: Horseshoe prior with λp ∼
C+ (0,1) and τ ∼ C+ (0,1); bottom row right: Regularized horseshoe prior.
Bayesian Variable Selection and Sparsity 181
P
X
βridge = argmin y′ y − β′ x′ x + λ β2p (10.1)
β p=1
methods. Preliminary analyses with the full sample reveal virtually no differences
among the methods, as would be expected.1
In what follows, only the data and parameter blocks are provided insofar as
the remaining code is the same as that in Example 5.1 and also across all other
methods. For the ridge priors, we give an N(0, 1) prior to the regression coefficients
and a C+ (0,1) distribution to the standard deviation of the residuals. The likelihood
follows the specification of the priors.
RidgeString = "
data {
int<lower=0> n;
vector [n] readscore;
vector [n] Female; vector [n] ESCS;
vector [n] METASUM; vector [n] PERFEED;
vector [n] JOYREAD; vector [n] MASTGOAL;
vector [n] ADAPTIVITY; vector [n] TEACHINT;
vector [n] SCREADDIFF; vector [n] SCREADCOMP;
}
parameters {
real alpha;
real beta1; real beta6;
real beta2; real beta7;
real beta3; real beta8;
real beta4; real beta9;
real beta5; real beta10;
real<lower=0> sigma;
}
model {
real mu[n];
for (i in 1:n)
mu[i] = alpha + beta1*Female[i] + beta2*ESCS[i] + beta3*METASUM[i]
+ beta4*PERFEED[i] + beta5*JOYREAD[i] + beta6*MASTGOAL[i]
+ beta7*ADAPTIVITY[i] + beta8*TEACHINT[i]
+ beta9*SCREADDIFF[i] + beta10*SCREADCOMP[i] ;
// Priors
alpha ˜ normal(0, 1);
beta1 ˜ normal(0, 1); beta6 ˜ normal(0, 1);
beta2 ˜ normal(0, 1); beta7 ˜ normal(0, 1);
beta3 ˜ normal(0, 1); beta8 ˜ normal(0, 1);
beta4 ˜ normal(0, 1); beta9 ˜ normal(0, 1);
beta5 ˜ normal(0, 1); beta10 ˜ normal(0, 1);
sigma ˜ cauchy(0,1);
1
We do not sample students within schools, thus this example should not be taken as a
serious model of reading proficiency.
Bayesian Variable Selection and Sparsity 183
// Likelihood
readscore ˜ normal(mu, sigma);
The important points to note about this code is that, first, the data should be
standardized before estimation. Second, note that the specification of the N(0, 1)
priors induces the ridge shrinkage in the sense that regression coefficients that
are close to zero will be shrunk toward the prior mean of zero, whereas large
coefficients should be relatively unaffected by the prior. Again, as noted above,
the extent of the shrinkage is determined by the value of λ.
P
X
βlasso = argmin y′ y − β′ x′ x + λ |βp | (10.3)
β p=1
The term λ Pp=1 |βp | is referred to as an L1 − norm penalty, which allows less
P
important coefficients to be set to zero, and thus the lasso provides for both
shrinkage and variable selection.
Bayesian lasso penalization uses a different shrinkage prior as compared to the
Bayesian ridge approach. Specifically, Tibshirani (1996) showed that |βp | is propor-
tional to minus the log-density of the double exponential (Laplace) distribution.
That is, the lasso estimate of the posterior mode of βp can be obtained by using the
prior !
1 |βp |
p(βp ) = exp − (10.4)
2τ τ
where τ = 1/λ.
The top right of Figure 10.1 shows the double exponential distribution. We see
that the double exponential distribution is ideal because it peaks at zero, which
shrinks small coefficients toward zero. However, the double exponential can be
set to have thick tails (in both directions), allowing the larger coefficients to remain
large. Given that the distribution is centered at zero to control shrinkage toward
zero, the mean hyperparameter setting is fixed to zero. The scale, or dispersion,
of the double exponential distribution is the hyperparameter that researchers can
alter when implementing the shrinkage. This defines the amount of spread and
184 Bayesian Statistics for the Social Sciences
the thickness of the tails, which controls the degree of shrinkage in coefficients.
Again, a C+ (0,1) prior can be specified on the standard deviation of the residuals,
if desired.
Although the ridge and lasso approaches are similarly implemented in
the Bayesian framework, these techniques can produce different amounts of
shrinkage depending on the hyperparameter settings. That is, the lasso approach
can result in more shrinkage for the small estimates, but less shrinkage for the
large estimates. This result is a function of the double exponential distribution
implemented in the lasso approach. The double exponential distribution is more
peaked around zero and it has heavier tails compared to the normal distribution
used in the ridge approach. Regardless of the approach implemented, Bayesian
penalization can be a useful tool when attempting to avoid overfitting a complex
model to small samples. Indeed, the lasso is simultaneously a shrinkage and
variable selection method. In addition, these approaches further highlight
the modeling flexibility that Bayesian methods provide through the flexible
implementation of priors. Next follows the specification for the lasso priors.
Below we show the Stan code for the lasso prior. Note in the parameter block
the use of the double exponential(0,1) distribution to induce the lasso. Also, notice
that we do not attempt to induce as much shrinkage in the intercept alpha.
modelString = "
data {
int<lower=0> n;
vector [n] readscore;
vector [n] Female; vector [n] ESCS;
vector [n] METASUM; vector [n] PERFEED;
vector [n] JOYREAD; vector [n] MASTGOAL;
vector [n] ADAPTIVITY; vector [n] TEACHINT;
vector [n] SCREADDIFF; vector [n] SCREADCOMP;
}
model {
real mu[n];
for (i in 1:n)
mu[i] = alpha + beta1*Female[i] + beta2*ESCS[i] +
beta3*METASUM[i]
+ beta4*PERFEED[i] + beta5*JOYREAD[i] + beta6*MASTGOAL[i]
+ beta7*ADAPTIVITY[i] + beta8*TEACHINT[i]
+ beta9*SCREADDIFF[i] + beta10*SCREADCOMP[i] ;
// Priors
alpha ˜ normal(0, 1);
beta1 ˜ double_exponential(0, 1); beta6 ˜ double_exponential(0, 1);
Bayesian Variable Selection and Sparsity 185
// Likelihood
readscore ˜ normal(mu, sigma);
The lasso is not without limitations (see van Erp, Oberski, & Mulder, 2019).
First, when the number of variables p are greater than the sample size n (which
we might encounter in “big data” problems), the model selection algorithm will
stop at n because the model will no longer be identified. Second, if there are
groups of variables that are highly pairwise correlated, the lasso will select only
one of the variables from that group rather arbitrarily. Third, when n > p, which
is the motivating case in this chapter, and when variables are highly correlated,
it has been shown that ridge regression will outperform the lasso with respect to
predictive performance.
2
The horseshoe prior gets its name from the fact that under certain conditions, the proba-
bility distribution of the shrinkage parameter associated with horseshoe prior reduces to a
Beta( 21 , 12 ) distribution, which has the shape of a horseshoe.
186 Bayesian Statistics for the Social Sciences
For this example, we specify λp as the local prior for each of the p regression
coefficients and τ as the global prior in the Stan parameter block, where we set
τ0 = 1. Note that in the Stan model block, the regression coefficients have mean
zero and a scale mixture τλp .
Horseshoe = "
data {
int<lower=1> n; // Number of data
int<lower=1> p; // Number of covariates
matrix[n,p] X;
real readscore[n];
}
parameters {
vector[p] beta;
vector<lower=0>[p] lambda; // Local prior
real<lower=0> tau; // Global prior
real alpha;
real<lower=0> sigma;
}
model {
beta ˜ normal(0, tau * lambda); // Scale mixture
tau ˜ cauchy(0, 1);
lambda ˜ cauchy(0, 1);
alpha ˜ normal(0, 1);
sigma ˜ cauchy(0, 1);
where s2 is the variance for each of the p predictor variables. As pointed out by
Piironen and Vehtari (2017), those variables that have large variances would be
considered more relevant a priori, and while it is possible to provide predictor
specific values for s2 , generally we scale the variables ahead of time so that s2 =
1. Finally, c2 is the slab width which controls the size of the large regression
coefficients.
To gain an intuition of the regularized horseshoe, first note that the form
of Equation (10.6a) is quite similar to the horseshoe prior, however λ̃p places a
control on the size of the coefficients by introducing a slab width c2 in Equation
(10.6b). Following Piironen and Vehtari (2017), notice that if τ2 λ2p ≪ c2 , then this
means that βp is close to zero and λ̃p → λp , which is the original horseshoe in
Section 10.4. However, if τ2 λ2p ≫ c2 , then λ̃p → c2 /τ2 and the prior begins to
approach the N(0, c2 ), where, again, the choice of c2 controls the size of the large
coefficients. Because c2 is a slab width that might not be well known, it follows
that it should be given a prior distribution, and Piironen and Vehtari (2017)
recommend the inverse-gamma distribution in Equation (10.6d), which induces a
relatively non-informative Student’s - t slab when coefficients are far from zero.
In setting up Stan first recall that as with all of the methods for sparsity, the
data are first standardized to have a mean of zero and standard deviation of
one. Also, recall that Stan works with standard deviations and not variances or
precisions. To start, for the regularized horseshoe we first need to indicate our
belief regarding the number of large coefficients. This is required because the
global scale parameter τ0 inside the transformed parameter block is a function of
188 Bayesian Statistics for the Social Sciences
the number of large coefficients assumed by the researcher ahead of analyzing the
data. In the transformed data block, this is indicated by the line real p0=5;.
n <- nrow(PISA18sampleScale)
X <- PISA18sampleScale[,2:11]
readscore <- PISA18sampleScale[,1]
p <- ncol(X)
modelString = "
data {
int <lower=1> n; // number of observations
int <lower=1> p; // number of predictors
real readscore[n]; // outcome
matrix[n,p] X; // inputs
}
transformed data {
real p0 = 5;
}
Next, in the parameters block, we define the parameters of the regularized horse-
shoe given in Equations (10.6a) - (10.6e).
parameters {
vector[p] beta;
vector<lower=0>[p] lambda;
real<lower=0> c2;
real<lower=0> tau;
real alpha;
real<lower=0> sigma;
In the transformed parameters we specify tau0 in line with Betancourt (2018a) and
we write λ̃ as in Equation (10.6d).
Bayesian Variable Selection and Sparsity 189
transformed parameters {
real tau0 = (p0 / (p - p0)) * (sigma / sqrt(1.0 * n));
vector[p] lambda_tilde =
sqrt(c2) * lambda ./ sqrt(c2 + square(tau) * square(lambda));
}
model {
beta ˜ normal(0, tau * lambda_tilde);
lambda ˜ cauchy(0, 1);
c2 ˜ inv_gamma(2,8);
tau ˜ cauchy(0, tau0);
}
// For posterior predictive checking and loo cross-validation
generated quantities {
vector[n] readscore_rep;
vector[n] log_lik;
for (i in 1:n) {
readscore_rep[i] = normal_rng(alpha + X[i,:] * beta, sigma);
log_lik[i] = normal_lpdf(readscore[i] | alpha + X[i,:]
* beta, sigma);
}
}
"
First, note that the horseshoe and regularized horseshoe methods generated a
warning of divergent transitions after warmup. This message needs to be taken
seriously and implies that the complexity of the model is such that the HMC/NUTS
algorithm cannot pick up small changes in the curvature of the log posterior. As
such, the estimates may be biased. A possible solution to this problem is to adjust
the alpha delta setting to beyond the default of 0.99 and max treedepth setting
to beyond the default value of 12, and of course to check the model and priors.
For this example, we set adapt delta=.9999 and max treedepth=20 and still had
divergent transitions. Generally speaking, however, if other diagnostics such as
n eff and Rhat look good, then one can proceed to interpret the results, albeit
with caution. For more information on Stan program warnings, see https://
mc-stan.org/misc/warnings.html.
With this caveat in mind, a visual inspection of the results in Table 10.1 in-
dicates that the ridge and lasso priors provide results that are somewhat similar
to Bayesian linear regression with non-informative priors that we found in Table
5.1 (when standardized). On the other hand, the original horseshoe prior and
regularized horseshoe achieve slightly more shrinkage in the posterior estimates
and standard deviations with the regularized horseshoe yielding the most shrink-
age, and indeed shrinking some of the larger coefficients (e.g., beta2 and beta3), as
expected. In terms of cross-validation, however, we find that the horseshoe prior
yields the lowest value of the LOO-IC followed closely by the regularized horse-
shoe. A comparative analysis of this kind might be worthwhile in practice if the
goal of the analysis is not only variable selection but also comparative predictive
performance.
Bayesian Variable Selection and Sparsity 191
βp | λp , c ∼ λp N(0, c2 λ2 ) (10.8a)
λp ∼ Bernoulli(π) (10.8b)
The result of this setup is that λ is a discrete parameter that only takes on two
values (λp = 0, 1).
It is necessary to note that Stan cannot incorporate discrete parameters. How-
ever, studies have shown the similarity in performance between the spike-and-slab
prior and the horseshoe prior (see, e.g., Carvalho et al., 2010; Polson & Scott, 2011).
Finally, the spike-and-slab prior is similar to the regularized horseshoe prior when
the slab width c < ∞, thus providing some regularization on large coefficients.
10.7 Summary
This chapter considered the problem of Bayesian variable selection and spar-
sity. Many variable selection methods can be implemented in the frequentist and
Bayesian framework, and some are explicitly Bayesian. However, both simulation
studies and real data analyses seem to point to the original horseshoe prior or
regularized horseshoe prior as the preferred methods for inducing sparsity, par-
ticularly with respect to out-of-sample predictive performance. As usual, in the
case of large sample sizes, application of sparsity-inducing priors will likely lead
to similar conclusions. Nevertheless, it may be prudent to examine results using
different priors and choose the model that yields desirable shrinkage along with
acceptable out-of-sample predictive performance.
In the end, however, a single model is selected for interpretation, and although
the predictive performance of Bayesian shrinkage methods is often better than
regression modeling without inducing sparsity, these methods do not account
for the uncertainty that underlies the choice of a single model. An approach
192 Bayesian Statistics for the Social Sciences
11.1 Introduction
In the previous chapter, we discussed Bayesian approaches to model regularization
that have the effect of balancing the bias-variance trade-off by shrinking regression
coefficients close to, or equal, to zero and allowing large coefficients to remain large.
In the end, regardless of whether one is using variable selection methods or not,
typically a final model is selected, and this model is often discussed as though
it was the model chosen ahead of time. The Bayesian framework recognizes,
however, that model selection is conducted under pervasive uncertainty insofar
as a particular model is typically chosen among a set of competing models that
could also have generated the data. This problem has been recognized for over 40
years. Early on, Leamer (1978, p. 91) noted that
...ambiguity over a model should dilute information about the re-
gression coefficients, since part of the evidence is spent to specify the
model.
Similar observations were made later by Draper et al. (1987, p. iii) who stated
This [model selection] tends to underestimate Your actual uncer-
tainty, with the result that Your actions both inferentially in science
and predictively in decision-making, are not sufficiently conserva-
tive.[Capitalization authors’.]
Furthermore, Hoeting, Madigan, Raftery, and Volinsky (1999) wrote
Standard statistical practice ignores model uncertainty. Data analysts
typically select a model from some class of models and then proceed
as if the selected model had generated the data. This approach ig-
nores the uncertainty in model selection, leading to over-confident
inferences and decisions that are more risky than one thinks they are.
(p. 382)
As the quotes by Leamer (1978), Draper et al. (1987), and Hoeting et al. (1999)
suggest, it is risky to settle on a single model for predictive or explanatory pur-
poses. Rather, it may be prudent to draw predictive strength through combining
1
Portions of this chapter are based on Kaplan (2021).
193
194 Bayesian Statistics for the Social Sciences
models. The purpose of this chapter is to provide an overview and some examples
of methods to address the problem of model uncertainty. First, we will discuss
the elements of predictive modeling that set the foundation for our discussion of
model uncertainty. We then turn to the method of Bayesian model averaging (BMA)
as a classical approach to addressing model uncertainty. We then point out that
Bayesian model averaging rests on a strong assumption regarding the existence
of a true model among those to be averaged, and so we will discuss the notion of
true models in the general context of M-frameworks (Bernardo & Smith, 2000).
Relaxing this assumption will lead us to a discussion and example of Bayesian
stacking.
The organization of this chapter is as follows. In the next section, we discuss
Bayesian predictive modeling as embedded in Bayesian decision theory. Here
we discuss the concepts of expected utility and expected loss, and frame these
ideas within the use of information-theoretic methods for judging decisions. We
show that the action which optimizes the expected utility is the BMA solution.
Then, we discuss the statistical elements of BMA, including connections to Bayes
factors, computation considerations, and the problem parameter and model priors.
This is followed by a simple example of BMA in linear regression modeling and
a comparison of results based on different parameter and model prior settings.
The next section is a presentation of methods for evaluating the quality of a
solution based on BMA, including the use of scoring rules and how they tie back
to the information-theoretic concepts discussed earlier in the chapter. Finally, we
discuss the main problem associated with conventional BMA — namely, that BMA
assumes that the true data-generating model is contained in the set of models that
are being averaged and demonstrate the method of Bayesian stacking that directly
deals with this assumption. A simple example of Bayesian stacking is provided.
Bayesian decision theory casts the problem of predictive evaluation in the context
of maximizing the expected utility of a model – that is, the benefit that is accrued
from using a particular model to predict future observations. The greater the ex-
pected utility, the better the model is at predictive performance in comparison to
other models.
The idea here is to take an action a that maximizes the utility u when the future
observation is ỹ. Clyde and Iversen (2013) show that the optimal decision obtains
when a∗ = E( ỹ | D), which is the posterior predictive mean of ỹ given the data D.
196 Bayesian Statistics for the Social Sciences
Under the assumption that the true data-generating model exists and is among
the set of models under consideration, this can be expressed as
K
X K
X
E( ỹ | D) = E( ỹ | Mk , D)p(Mk | D) = p(Mk | D) ỹˆ Mk (11.3)
k=1 k=1
The popularity of BMA across many different disciplines is due to the fact
that BMA is known to provide better out-of-sample predictive performance than
any other model under consideration as measured by the logarithmic scoring
rule (Raftery & Zheng, 2003). In addition, Bayesian model averaging has been
implemented in the R software programs BMA (Raftery, Hoeting, Volinsky, Painter,
& Yeung, 2020), BMS (Zeugner & Feldkircher, 2015), and BAS (Clyde, 2017).
These packages are quite general, allowing Bayesian model averaging over linear
models, generalized linear models, and survival models, with flexible handling of
parameter and model priors.
Raftery, 1994) and the other is based on a Metropolis sampler referred to as Markov
chain Monte Carlo model composition (Madigan & York, 1995).
Occam’s Window
To motivate the idea behind Occam’s window, consider the problem of finding
the best subset of predictors in a linear regression model. Following closely the
discussion given in Raftery et al. (1997), we would initially start with a very
large number of predictors; but the goal would be to pare this down to a smaller
number of predictors that provide accurate predictions. As noted in the earlier
quote by Hoeting et al. (1999), the concern in drawing inferences from a single best
model is that the choice of a single set of predictors ignores uncertainty in model
selection. Occam’s window provides an approach to BMA that reduces the subset
of models under consideration, but instead of settling on a final ”best” model, we
instead integrate over the parameters of the smaller set with weights reflecting the
posterior uncertainty in each model.
The algorithm proceeds as follows (Raftery et al., 1997). In the initial step,
the space of possible models is reduced by implementing the so-called leaps and
bounds algorithm developed by Furnival and Wilson, Jr. (1974) in the context of
best subsets regression (see also Raftery, 1995). This initial step can substantially
reduce the number of models, after which Occam’s window can then be employed.
The general idea is that models are eliminated from Equation (11.4) if they predict
the data less well than the model that provides the best predictions based on a
caliper value C chosen in advance by the analyst. The caliper C sets the width of
Occam’s window. Formally, consider again a set of models {Mk }Kk=1 . Then, the set
A′ is defined as ( )
maxl {p(Ml | y)}
A′ = Mk : ≤C (11.7)
p(Mk | y)
In words, Equation (11.7) compares the model with the largest posterior model
probability, maxl {p(Ml | y)}, to a given model, p(Mk | y). If the ratio in Equation
(11.7) is greater than the chosen value C, then it is discarded from the set A′ of
models to be included in the model averaging. Notice that the set of models
contained in A′ is based on Bayes factor values.
The set A′ now contains models to be considered for model averaging. In the
second, optional, step, models are discarded from A′ if they receive less support
from the data than simpler submodels. Formally, models are further excluded
from Equation (11.4) if they belong to the set
( )
p(Ml | y)
B = Mk : ∃Ml ∈ A′ , Ml ⊂ Mk , >1 (11.8)
p(Mk | y)
Again, in words Equation (11.8) states that there exists a model Ml within the set
A′ and where Ml is simpler than Mk . If a complex model receives less support
from the data than a simpler submodel — again based on the Bayes factor — then
it is excluded from B. Notice that the second step corresponds to the principle of
Occam’s razor (Madigan & Raftery, 1994).
Model Uncertainty 199
With Step 1 and Step 2, the problem of reducing the size of the model space
for BMA is simplified by replacing Equation (11.4) with
X
p( ỹ | y, A) = p( ỹ | Mk , y)p(Mk | y, A) (11.9)
Mk ∈A
In other words, models under consideration for BMA are those that are in A′ but
not in B.
Madigan and Raftery (1994) outline an approach to the choice between two
models to be considered for Bayesian model averaging. To make the approach
clear, consider the case of just two models M1 and M0 , where M0 is the simpler of the
two models. This could be the case where M0 contains fewer predictors than M1 in
a regression analysis. In terms of posterior odds, if the odds are positive, indicating
support for M1 , then we reject M0 . If the posterior odds are large and negative,
then we reject M1 in favor of M0 . Finally, if the posterior odds lie in between the
pre-set criterion, then both models are retained. For linear regression models, the
leaps and bounds algorithm combined with Occam’s window is available in the
bicreg option in the R program BMA (Raftery et al., 2020).
pr(M′ | y)
( )
min 1, (11.10)
pr(M | y)
otherwise, the chain stays in model M. We recognize the term inside Equation
(11.10) as the Markov acceptance ratio presented in Equation (4.4).
200 Bayesian Statistics for the Social Sciences
Parameter Priors
A large number of choices for parameter priors are available in the R software
program BMS (Zeugner & Feldkircher, 2015) and are based on variations of Zell-
ner’s g-prior (Zellner, 1986). Specifically, Zellner introduced a natural-conjugate
normal-gamma g-prior for regression coefficients β under the normal linear re-
gression model, written as
yi = x′i β + ε (11.11)
where ε is iid N(0, σ2 ). For a given model, say Mk , Zellner’s g-prior can be written
as −1
βk | σ2 , Mk , g ∼ N 0, σ2 g x′k xk (11.12)
Feldkircher and Zeugner (2009) have argued for using the g-prior for two reasons:
its consistency in asymptotically uncovering the true model, and its role as a
penalty term for model size.
The g-prior has been the subject of some criticism. In particular, Feldkircher
and Zeugner (2009) have pointed out that the particular choice of g can have a
very large impact on posterior inferences drawn from BMA. Specifically, small
values of g can yield a posterior mass that is spread out across many models while
large values of g can yield a posterior mass that is concentrated on fewer models.
Feldkircher and Zeugner (2009) use the term supermodel effect to describe how
values of g impact the posterior statistics, including posterior model probabilities
(PMPs) and posterior inclusion probabilities (PIPs).
To account for the supermodel effect, researchers such as Fernández et al.
(2001a), Liang et al. (2008), Eicher et al. (2011), and Feldkircher and Zeugner
(2009) have proposed alternative priors based on extensions of the work of Zellner
(1986). Generally speaking, these alternatives can be divided into two categories:
fixed priors and flexible priors. Examples of fixed parameter priors include the
unit information prior, the risk inflation criterion prior, the benchmark risk inflation
criterion, and the Hannan-Quinn prior (Hannan & Quinn, 1979). Examples of
flexible parameter priors include the local empirical Bayes prior (E. George & Foster,
2000; Liang et al., 2008; Hansen & Yu, 2001), and the family of hyper-g priors
(Feldkircher & Zeugner, 2009).
Model Uncertainty 201
Model Priors
In addition to parameter priors, it is essential to consider priors over the space of
possible models, which concerns our prior belief regarding whether the true model
lies within the space of possible models. Among those implemented in BMS, the
uniform model prior is a common default prior which specifies that if there are Q
predictors in the model, then the prior on the model space is 2−Q . The problem
with the uniform model prior is that the expected model size is Q/2, when in
fact there are many more models of intermediate size than there are models with
extreme sizes. For example, with six variables, there are more models of size 2 or
5 than there are 1 or 6. As a result, the uniform prior ends up placing more mass
on models of intermediate size. An alternative is the binomial model prior which
proposes placing a fixed inclusion probability θ on each predictor of the model.
The problem is that θ is treated as fixed, and so to remedy this problem, Ley and
Steel (2009) proposed a beta-binomial model prior which treats θ as random specified
by a beta distribution. Unlike the uniform model prior, the beta-binomial prior
will place equal mass across the models regardless of size.
where, for example, ỹi is the predictive density for ith person, x and y represent
the model information for the remaining individuals, and x̃i is the information on
the predictors for individual i. The model with the lowest log predictive score is
deemed best in terms of long-run predictive performance.
202 Bayesian Statistics for the Social Sciences
We will focus again on the reading literacy results from PISA 2018. The list of
variables used in this example is given below in Table 11.1.
For this example, we use the software package BMS (Zeugner & Feldkircher,
2015), which implements the so-called Birth/Death (BD) algorithm as a default for
conducting MC3. See Zeugner and Feldkircher (2015) for more details on the BD
algorithm.
The analysis steps for this example are as follows:
1. We begin by implementing BMA with default unit information priors for
the model parameters and the uniform prior on the model space. We will
outline the major components of the results including the posterior model
probabilities and the posterior inclusion probabilities.
2. We next examine the results under different combinations of parameter and
model priors available in BMS and compare results using the LPS and KLD.
BMA Results
We first call BMS and use the unit information prior (g) (g=”uip”)and the uniform
model prior (mprior) (mprior=”uniform”).
plotModelsize(PISAbmsMod1,col="black")
density(PISAbmsMod1)
The Bayesian model averaging results under unit information priors for model
parameters and the uniform prior for the model space are shown in Tables 11.2
and 11.3. We note that there are 19 predictors and thus 219 = 524, 288 models in
the full space of models to be visited. Table 11.2 presents a summary of the BD
algorithm used to implement MC3 in BMS. We find that the algorithm only visited
471 models (0.09%) out of the total model space, however, these models accounted
for 100% of the posterior model mass.2 The column labeled ”Avg # predictors”
shows that across all of the models explored by the algorithm, the average number
of predictors was 11.8 out of 19.
2
This percentage is obtained by summing over the PMPs for all models explored by the
algorithm and dividing by the total number of those models.
204 Bayesian Statistics for the Social Sciences
In the second row of Table 11.2 below we present the posterior model prob-
abilities associated with the top 5 models out of the 471 models explored by the
algorithm. It is important to note that Model 1 would also be associated with the
lowest Bayesian information criterion. Hence, on the basis of the low PMP for
Model 1 (0.35), we can see that selecting Model 1 and acting as though this is the
model we considered ahead of time considerably underestimates the uncertainty
in our model choice. Moreover, as Clyde and Iversen (2013) remind us, this model
might not be the one closest to the BMA solution.
TABLE 11.2. Summary of birth/death algorithm and top posterior model probabilities
Summary of Algorithm
It may be interesting to examine the impact of the model prior on the posterior
distribution of model sizes. The results are shown below in Figure 11.1.
FIGURE 11.1. Posterior model size under unit information parameters priors and uniform
model prior.
Model Uncertainty 205
Notice in Figure 11.1 that although the code called for the uniform model prior, we
see that the prior model size is not actually uniform. This is because, as discussed
in the earlier section on model 9, the mean model size is approximately nine.
That is, there are more models containing nine predictors than, say, containing
two predictors. The uniform model prior adjusts for this by placing more mass on
models with larger numbers of predictors. Note also that the mean posterior model
size is approximately 11, thus somewhat larger than the mean prior model size,
suggesting that after encountering data the posterior places greater importance on
slightly larger models.
Table 11.3 below presents a summary of the BMA results. The column labeled
“PIP” shows the posterior inclusion probabilities for each variable, referring to
the sum of the posterior model probabilities for all models for which the variable
was included. For example, the PIP for ESCS is 1.00, meaning that 100% of
the posterior model mass rests on models that include ESCS. In contrast, only
0.09% of the model mass rests on ATTLNACT. The PIP thus provides a different
perspective on variable importance. The columns labeled “Post Mean” and “Post
SD” are the posterior estimates of the regression coefficients and their posterior
standard deviations, respectively. The column labeled “Cond. Pos. Sign” refers
to the probability that the sign of the respective regression coefficient is positive
conditional on its inclusion in the model. We find, for example, that the sign of
ESCS is positive in 100% of the models in which ESCS appears. By contrast, the
probability that the sign of the PISADIFF effect positive is zero, meaning that in
100% of the models visited by the algorithm, the sign of PISADIFF is negative.3
Finally, we present the frequentist p-values from a simple ordinary least-square
regression applied to these data.
3
The probabilities listed under the Cond. Pos. Sign column will often range from zero to
one, but for these results, it appears that all 471 models show clarity with respect to the sign
of the posterior coefficients.
206 Bayesian Statistics for the Social Sciences
TABLE 11.3. Summary of BMA with unit information parameter priors and uniform model
priors
Predictor PIP Post. Coeff. Post. SD Cond. pos. sign Freq. p-value
ESCS 1.00 18.97 1.30 1.00 0.000
METASUM 1.00 27.99 1.29 1.00 0.000
TEACHINT 1.00 12.53 1.51 1.00 0.000
JOYREAD 1.00 10.33 1.32 1.00 0.000
GFOFAIL 1.00 11.06 1.21 1.00 0.000
MASTGOAL 1.00 −13.34 1.50 0.00 0.000
SCREADCOMP 1.00 10.13 1.49 1.00 0.000
PISADIFF 1.00 −29.71 1.46 0.00 0.000
PERFEED 0.98 −5.00 1.55 0.00 0.001
SWBP 0.87 −3.95 1.92 0.00 0.015
WORKMAST 0.86 4.28 2.16 1.00 0.008
FEMALE 0.64 5.14 4.37 1.00 0.006
ADAPTIVE 0.12 0.45 1.31 1.00 0.053
SCREADDIFF 0.08 −0.19 0.78 0.00 0.007
COMPETE 0.07 0.19 0.79 1.00 0.038
BELONG 0.05 −0.15 0.72 0.00 0.020
HOMEPOS 0.02 0.02 0.29 1.00 0.046
ICTRES 0.01 −0.02 0.23 0.00 0.020
ATTLNACT 0.00 0.00 0.09 1.00 0.500
We find that the first 12 predictors (ESCS thru GENDER) have relatively high
PIPs. The majority of these predictors have PIPs of 1.0, indicating their importance,
and these are also associated with statistically significant p-values. It is also inter-
esting to note that these predictors contain a mix of demographic measures (e.g.,
ESCS, GENDER), attitudes/perceptions (e.g., TEACHINT, JOYREAD, SCREAD-
COMP), and cognitive strategies involved in reading (e.g., METASUM). Perhaps
most importantly, we find that coefficients that are statistically significant from the
ordinary least square regression have very small posterior inclusion probabilities
when accounting for model uncertainty. For example, the variable SCREADDIFF
(self-perception of reading difficulty) has an OLS estimate of −4.5 (not shown)
and is statistically significant (p = .007). However, when accounting for model
uncertainty, this coefficient is −0.19 with a posterior inclusion probability of 0.08.
This finding is what is meant by “...over confident inferences...” (Hoeting et al.,
1999).
TABLE 11.4. Summary of birth/death algorithm and top posterior model probabilities:
Model 2
Summary of Algorithm
The posterior model probabilities are uniformly smaller under this set of prior
specifications. The impact of the beta-binomial model prior on the distribution of
model size is shown below in Figure 11.2.
FIGURE 11.2. Posterior model size under unit information parameters priors and beta-
binomial model prior.
208 Bayesian Statistics for the Social Sciences
Here we see that the prior distribution of model size is completely flat and we
note that posterior model size under the beta-binomial prior is slightly higher
than under the uniform model prior. Finally, the results of the BMA are displayed
below in Table 11.5.
TABLE 11.5. Summary of BMA with unit information parameter priors and random model
priors: Model 2
We find that the results are virtually identical to those in Table 11.3 under uni-
form model priors, with differences among variables with much smaller inclusion
probabilities. This can be seen by comparing the PIPs in Figure 11.3 below.
Model Uncertainty 209
KLD(predicted.values.Mod1, PISAdata[,1])
lps.bma(predDensM1, realized.y= PISAdata[,1])
KLD(predicted.values.Mod2, PISAdata[,1])
lps.bma(predDensM2, realized.y= PISAdata[,1])
The results of this comparison are shown in Kaplan (2021) in which the KLD and
LPS measures for both model are virtually identical. The KLD values for both
models were 0.015 and the LPS scores for both models were 5.84. This finding
suggests the choice of these parameter and model priors does not impact the
predictive performance of these models, and this is perhaps not surprising given
the large sample size for this example.
210 Bayesian Statistics for the Social Sciences
and will equal the true model in an infinitely large sample size. If MT < M, then
BMA is not consistent.
y = fk (x) + ϵ (11.15)
where fk are different models of the reading literacy outcome, for example, some
models may include only demographic predictors, while others may include vari-
ous combinations of attitudes and behaviors related to reading literacy. Indeed, fk
might even reflect a non-linear model of reading literacy. Predictions from these
separate models are then combined (stacked) as (see Le & Clarke, 2017)
K
X
ỹ = ŵk fˆk (x) (11.16)
k=1
where fˆk estimates fk . The weights, ŵk (ŵ1 , ŵ2 , . . . ŵK ), are obtained as
n K
2
X X
ŵ = argmin yi −
w k fˆk,−i (xi ) (11.17)
w
i=1 k=1
For this chapter, we demonstrate Bayesian stacking using the software pro-
gram loo with the same PISA 2018 dataset used to demonstrate BMA. The analysis
steps for this demonstration are as follows:
1. Specify four models of reading literacy. From Table 11.1, Model 1 in-
cludes only demographic measures (FEMALE, ESCS, HOMEPOS, ICTRES);
Model 2 includes only attitudes and behaviors specifically directed toward
reading (JOYREAD, PISADIFF, SCREADCOMP, SCREADDIF); Model 3 in-
cludes predictors related to academic mindset as well as general well-being;
(METASUM, GFOFAIL, MASTGOAL, SWBP, WORKMAST, ADAPTIVITY,
COMPETE); and Model 4 includes attitudes toward school (PERFEED,
TEACHINT, BELONG).
2. Obtain results from log-score stacking weights, pseudo-BMA weights, and
pseudo-BMA+ weights.
3. Obtain posterior predictive distributions using the R software program rsta-
narm (Goodrich et al., 2022).
4. Obtain KLD measures comparing the predicted distribution of reading
scores to the observed distribution.
The following code is can be used to implement Bayesian stacking. To begin,
we require rstanarm, loo, and LaplacesDemon.
library(rstanarm)
library(loo)
library(LaplacesDemon)
After reading in the data, we write a list containing the models to be compared.
+ BELONG)
With the weights in hand, we next produce a weighted combination of the posterior
predictions under each type of weight. The command posterior predict comes
from rstanarm.
We now have the predicted values which we can compare to the actual reading
scores using the KLD scoring rule in the following code.
Table 11.6 below presents the results for Bayesian stacking with different
choices of weights.
TABLE 11.6. Log-score stacking, pseudo-BMA, and pseudo-BMA+ weights along with
LOO-IC and Kullback-Leibler divergence
We find that Model 2, which includes predictors related attitudes and behaviors
directed toward reading, has the highest weight regardless of how the weights
were calculated. We find that pseudo-BMA and pseudo-BMA+ place almost all of
the weight on Model 2 whereas the stacking weights based on the log predictive
score are somewhat more spread out, with Model 3 having the next highest weight.
We also find that the Model 2 has the lowest LOO-IC value.
The bottom row of Table 11.6 presents the KLD measures obtained from com-
paring the distribution of predicted reading scores to the observed reading scores
216 Bayesian Statistics for the Social Sciences
for each method of obtaining weights. Keep in mind that the predicted distribution
under stacking is based on mixing the predicted distributions from the different
models with mixing proportions equal to the weights. Here we find that the lowest
KLD value is obtained under the log-score stacking weights. Overall, we find that
stacking using LOO weights provides overall the best predictive performance. It
may be interesting to note that the KLD values for the BMA results are uniformly
lower compared to the KLD values in Table 11.6 although it needs to be reiterated
that BMA assumes an M-closed framework.
11.6 Summary
Although the orientation of this chapter was focused on Bayesian methods for
quantifying model uncertainty, it should be pointed out that issues of model un-
certainty and model averaging have been addressed within the frequentist domain.
The topic of frequentist model averaging (FMA) has been covered extensively in
Hjort and Claeskens (2003), Claeskens and Hjort (2008), and Fletcher (2018). Our
focus on Bayesian model averaging is based on some important advantages over
FMA. As noted by Steel (2020), (1) BMA is optimal (under M-closed) in terms of
prediction as measured by the log predictive density score; (2) BMA is easier to
implement in situations where the model space is large due to very fast algorithms
such as MC3 ; (3) BMA naturally leads to substantively valuable interpretations of
posterior model probabilities and posterior inclusion probabilities; and (4) in the
majority of content domains wherein model averaging is required, BMA is more
frequently used than FMA.
12
Closing Thoughts
217
218 Bayesian Statistics for the Social Sciences
2. Specify the functional form of the relationship between the outcome and the
predictors. For the social sciences, this will most likely be a type of linear or
generalized linear model, but more complex models are, of course, possible.
Chapter 8, for example, discusses Bayesian methods for continuous and
categorical latent variables. As an aside, it is important to note that there
may be more than one model that could have plausibly generated the data.
Keeping the problem of model uncertainty in the back of one’s mind is quite
important, depending on the goals of the analysis.
3. Take note of the complexities of the data structure, for example, are the data
generated from a clustered sampling design? Are there sampling weights?
Accounting for the complexities of the data structure can be handled by
careful specification of a Bayesian hierarchical model, and this was discussed
in Chapter 7 with examples involving multilevel modeling. This book did
not cover the use of sampling weights, but these can be easily incorporated
in Stan-based programs such as rstanarm (Goodrich et al., 2022) and brms
(Bürkner, 2021).
4. Decide on the prior distributions for all parameters in the model. As dis-
cussed in Chapter 2, these priors will be either non-informative, weakly
informative, informative, or a mix of all three. In addition, the goals of an
analysis might be to induce sparsity and therefore the choices of shrink-
age priors discussed in Chapter 10 are available. Because our examples
were based on large-sample data, we found little impact of prior choice
on posterior results, but of course this will not always be the case. Thus,
an important activity at this step is to generate data according to the prior
distributions and gauge the sensibility and sensitivity of the priors — so
called prior predictive checking, discussed in Chapter 6. Finally, in the spirit
of research transparency, the origin of all priors must be communicated to
the research community.
5. After running the analysis, it is essential to check the convergence criteria
of the algorithm. The basics of Bayesian computation, along with conver-
gence criteria, were discussed in Chapter 4. Note that results cannot be
communicated unless there is overwhelming evidence from a variety of di-
agnostics that the algorithm converged. There are instances where there
may be contradictory evidence of convergence. For example, trace plots
may appear fine, but Rhat values may be somewhat problematic. All at-
tempts should be made to improve these diagnostics before communicating
the results. In most cases, if the effective sample size and Rhat values are
reasonable, then one can proceed with communicating the results. This
is because these diagnostics together capture autocorrelation, mixing, and
trend in the iterations.
6. Given evidence of convergence, and with the results in hand, posterior
predictive checking is a necessary step in the Bayesian workflow. Posterior
predictive checks can be set up to gauge overall model fit, but depending on
the goals of the model, specific posterior predictive checks can be provided
regarding the fit of specific aspects of the posterior predictive distribution,
such as checks on the fit of the model to extreme values.
Closing Thoughts 219
12.2.1 Coherence
The rules of probability, as described in Chapter 1, and manifested in Bayes’
theorem, cohere in such a way as to be internally consistent, providing only one
method for obtaining an answer. The rules of probability along with Bayes’
theorem are coherent because they align with axioms of rational decision making.
Practically speaking, coherent probability theory allows one to avoid Dutch book
bets.
itself with inferences based on the data in hand, not on inferences based on data
that have never been observed. Finally, as discussed at length in Chapter 6, the
goal of hypothesis testing in the Bayesian framework is not to make statements
in support or refutation of a null hypothesis, but rather to fully summarize the
distribution of the parameters of interest and to examine the predictive quality
of a proposed model. Posterior predictive checking provides a way of probing
whether a model can predict data that actually have occurred, and falls squarely
into neo-Popperian theory as noted by Gelman and Shalizi (2013) and discussed
in the Preface.
12.2.4 Validity
It has been argued by Gigerenzer et al. (2004) that Bayesian statistics provides in-
ferences that the analyst actually cares about. Namely, researchers wish, and often
report, results with respect to their hypotheses of interest. In other words, the ana-
lyst wishes to make statements about the probability of their particular hypothesis
of interest. However, the Neyman-Pearson framework with its requirement of
setting the probability of a Type I error ahead of collecting data, and the Fisherian
1
However, as pointed out by Wagenmakers et al. (2008), Fisher himself held that the p-value
postulate was correct.
222 Bayesian Statistics for the Social Sciences
framework of interpreting the p-value as the strength of evidence against the null
hypothesis, both preclude this wish. The analyst’s wish can come true, however,
using a Bayesian framework because it provides probability assessments of the
scientific hypothesis actually under consideration. Indeed, access to the posterior
distribution of the parameter(s) of interest provide a much richer and more nu-
anced description of the research question of interest than the relatively artificial
dichotomization of the research result into “significant” or “non-significant.”
223
224 Bayesian Statistics for the Social Sciences
M-H Metropolis-Hastings. 49
MAR Missing at random. 166
MC3 Markov chain Monte Carlo model composition. 199
MCAR Missing completely at random. 166
MCMC Markov chain Monte Carlo. 21
VB Variational Bayes. 66
225
226 References
Furnival, G. M., & Wilson, Jr., R. W. (1974). Regressions by leaps and bounds.
Technometrics, 16, 499–511.
Geisser, S., & Eddy, W. F. (1979). A predictive approach to model selection. Journal
of the American Statistical Association, 74, 153–160.
Gelfand, A. (1996). Model determination using sampling-based methods. In
W. R. Gilks, S. Richardson, & D. J. Spiegelhalter (Eds.), Markov chain monte
carlo in practice (pp. 145–161). Chapman & Hall.
Gelman, A. (1996). Inference and monitoring convergence. In W. R. Gilks,
S. Richardson, & D. J. Spiegelhalter (Eds.), Markov chain Monte Carlo in
practice (pp. 131–143). Chapman & Hall.
Gelman, A. (2006). Prior distributions for variance parameters in hierarchical
models. Bayesian Analysis, 1, 515–533.
Gelman, A. (2013). Understanding posterior p-values. Electronic Journal of Statis-
tics.
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D., Vehatari, A., & Rubin, D. B.
(2014). Bayesian data analysis (3rd ed.). Chapman & Hall.
Gelman, A., & Hill, J. (2003). Data analysis using regression and multilevel/hierarchical
models. Cambridge University Press.
Gelman, A., Hwang, J., & Vehtari, A. (2014). Understanding predictive information
criteria for Bayesian models. Statistics and Computing, 24, 997–1016.
Gelman, A., Meng, X.-L., & Stern, H. (1996). Posterior predictive assessment
of model fitness via realized discrepancies: With commentary. Statistical
Science, 6, 733–807.
Gelman, A., & Rubin, D. B. (1992a). Inference from iterative simulation using
multiple sequences. Statistical Science, 7, 457–511.
Gelman, A., & Rubin, D. B. (1992b). A single series from the Gibbs sampler
provides a false sense of security. In J. M. Bernardo, J. O. Berger, A. P. Dawid,
& A. F. M. Smith (Eds.), Bayesian statistics 4 (pp. 625–631). Oxford University
Press.
Gelman, A., & Rubin, D. B. (1995). Avoiding model selection in Bayesian social
research. Sociological Methodology, 25, 165–173.
Gelman, A., & Shalizi, C. R. (2013). Philosophy and the practice of Bayesian
statistics. British Journal of Mathematical and Statistical Psychology, 66, 8–38.
Gelman, A., Simpson, D., & Betancourt, M. (2017). The prior can often only be
understood in the context of the likelihood. Entropy, 19.
Gelman, A., Vehtari, A., Simpson, D., Margossian, C. C., Carpenter, B., Yao, Y., . . .
Modrák, M. (2020). Bayesian workflow. arXiv.
Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions and
the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 6, 721–741.
George, E., & Foster, D. (2000). Calibration and empirical Bayes variable selection.
Biometrika, 87, 731–747.
George, E. I., & McCulloch, R. E. (1993). Variable selection via gibbs sampling.
Journal of the American Statistical Association, 88, 881–889.
Gigerenzer, G., Krauss, S., & Vitouch, O. (2004). The null ritual: What you
always wanted to know about significance testing but were afraid to ask.
References 229
In D. Kaplan (Ed.), The Sage handbook of quantitative methodology for the social
sciences (pp. 391–408). Sage.
Gilks, W. R., Richardson, S., & Spiegelhalter, D. J. (1996a). Introducing Markov
chain Monte Carlo. In W. R. Gilks, S. Richardson, & D. J. Spiegelhalter (Eds.),
Markov chain Monte Carlo in practice (pp. 1–19). Chapman & Hall.
Gilks, W. R., Richardson, S., & Spiegelhalter, D. J. (Eds.). (1996b). Markov chain
Monte Carlo in practice. Chapman & Hall.
Gneiting, T., & Raftery, A. (2007). Strictly proper scoring rules, prediction, and
estimation. Journal of the American Statistical Association, 102, 359–378.
Goldstein, H. (2011). Multilevel statistical models (4th ed.). Wiley.
Good, I. J. (1952). Rational decisions. Journal of the Royal Statistical Society. Series B
(Methodological), 14, 107–114.
Goodrich, B., Gabry, J., Ali, I., & Brilleman, S. (2022). rstanarm: Bayesian applied
regression modeling via Stan. https://mc-stan.org/rstanarm/
Haig, B. D. (2018). The philosophy of quantitative methods: Understanding statistics.
Oxford University Press.
Hanea, A., Nane, G., Bedford, T., & French, S. (2021). Expert judgement in risk and
decision analysis. Springer Nature.
Hannan, E. J., & Quinn, B. G. (1979). The determination of the order of an
autoregression. Journal of the Royal Statistical Society. Series B (Methodological),
41(2), 190–195.
Hansen, M. H., & Yu, B. (2001). Model selection and the principle of minimum
description length. Journal of the American Statistical Association, 96, 746–774.
Harlow, L. L., Mulaik, S. A., & Steiger, J. H. (1997). What if there were no significance
tests? Erlbaum.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning.
Springer.
Heckman, J. J., & Kautz, T. (2012). Hard evidence on soft skills. Labour Economics,
19, 451–464.
Hinne, M., Gronau, Q. F., van den Bergh, D., & Wagenmakers, E.-J. (2020). A
conceptual introduction to Bayesian model averaging. Advances in Methods
and Practices in Psychological Science, 3, 200-215.
Hjort, N. L., & Claeskens, G. (2003). Frequentist model average estimators. Journal
of the American Statistical Association, 98, 879–899.
Hoerl, A. E., & Kennard, R. W. (1970). Ridge regression: Biased estimation for
nonorthogonal problems. Technometrics, 12(1), 55–67.
Hoerl, R. W. (1985). Ridge analysis 25 years later. The American Statistician, 39(3),
186–192.
Hoeting, J. A., Madigan, D., Raftery, A., & Volinsky, C. T. (1999). Bayesian model
averaging: A tutorial. Statistical Science, 14, 382–417.
Hoffman, M. D., & Gelman, A. (2014). The No-U-Turn sampler: Adaptively
setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning
Research, 15, 1593-1623. http://jmlr.org/papers/v15/hoffman14a.html
Honaker, J., & King, G. (2010). What to do about missing values in time-series
cross-section data. American Journal of Political Science, 54, 561–581.
Honaker, J., King, G., & Blackwell, M. (2011). Amelia II: A program for missing
data. Journal of Statistical Software, 45(7), 1–47. http://www.jstatsoft.org/
230 References
v45/i07/
Howson, C., & Urbach, P. (2006). Scientific reasoning: The Bayesian approach. Open
Court.
Hsiang, T. C. (1975). A Bayesian view on ridge regression. Journal of the Royal
Statistical Society. Series D (The Statistician), 24, 267–268.
Jackman, S. (2009). Bayesian analysis for the social sciences. Wiley.
Jackman, S. (2012). pscl: Classes and methods for R developed in the political
science computational laboratory [Computer software manual]. http://
github.com/atahk/pscl
Jeffreys, H. (1961). Theory of probability (3rd ed.). Oxford University Press.
Jordan, M., Ghahramani, Z., Jaakkola, T., & Saul, L. (1999). An introduction to
variational methods for graphical models. Machine Learning, 37, 183-233.
doi: 10.1023/A:1007665907178
Jöreskog, K. G. (1967). Some contributions to maximum likelihood factor analysis.
Psychometrika, 32, 443-482.
Jose, V. R. R., Nau, R. F., & Winkler, R. L. (2008). Scoring rules, generalized entropy,
and utility maximization. Operations Research, 56, 1146–1157.
Kadane, J. B. (2011). Principles of uncertainty. Chapman & Hall/CRC Press.
Kaplan, D. (1995). The impact of BIB spiraling-induced missing data patterns on
goodness-of-fit tests in factor analysis. Journal of Educational and Behavioral
Statistics, 20, 69-82.
Kaplan, D. (2000). Structural equation modeling: Foundations and extensions. Sage.
Kaplan, D. (2004). The SAGE handbook of quantitative methodology for the social
sciences. Sage.
Kaplan, D. (2008). An overview of markov chain methods for the study of stage-
sequential developmental processes. Developmental Psychology, 44, 457–467.
Kaplan, D. (2009). Structural equation modeling: Foundations and extensions. (2nd
ed.). Sage.
Kaplan, D. (2021). On the quantification of model uncertainty: A Bayesian Per-
spective. Psychometrika, 86, 215–238.
Kaplan, D., & Chen, J. (2014). Bayesian model averaging for propensity score
analysis. Multivariate Behavioral Research, 49, 505–517.
Kaplan, D., & Depaoli, S. (2012). Bayesian structural equation modeling. In
R. Hoyle (Ed.), Handbook of structural equation modeling (pp. 650–673). Guil-
ford Press.
Kaplan, D., & Huang, M. (2021). Bayesian probabilistic forecasting with large-scale
educational trend data: a case study using NAEP. Large-scale Assessments in
Education, 9.
Kaplan, D., & Kuger, S. (2016). The methodology of PISA: Past, present, and
future. In S. Kuger, E. Klieme, N. Jude, & D. Kaplan (Eds.), Assessing contexts
of learning world-wide – Extended context assessment frameworks. Springer.
Kaplan, D., & Lee, C. (2016). Bayesian model averaging over directed acyclic
graphs with implications for the predictive performance of structural equa-
tion models. Structural Equation Modeling, 23, 343–353.
Kaplan, D., & Su, D. (2016). On matrix sampling and imputation of context
questionnaires with implications for the generation of plausible values in
References 231
Madigan, D., & York, J. (1995). Bayesian graphical models for discrete data.
International Statistical Review, 63, 215–232.
Makowski, D., Ben-Shachar, M. S., & Lüdecke, D. (2019). bayestestr: Describing
effects and their uncertainty, existence and significance within the Bayesian
framework. Journal of Open Source Software, 4, 1541.
Martin, A. D., Quinn, K. M., & Park, J. H. (2011). MCMCpack: Markov chain
monte carlo in R. Journal of Statistical Software, 42, 22.
McCullagh, P., & Nelder, J. A. (1989). Generalized linear models (2nd ed.). Chapman
& Hall/CRC.
Meinfelder, F. (2011). BaBooN: Bayesian bootstrap predictive mean matching –
multiple and single imputation for discrete data [Computer software man-
ual]. https://cran.r-project.org/src/contrib/Archive/BaBooN/
Mengersen, K. L., Robery, C. P., & Guihenneuc-Jouyax, C. (1999). MCMC conver-
gence diagnostics: A review. Bayesian Statistics, 6, 415–440.
Merkle, E. C., Fitzsimmons, E., Uanhoro, J., & Goodrich, B. (2021). Efficient
Bayesian structural equation modeling in Stan. Journal of Statistical Software,
100, 1-22.
Merkle, E. C., & Steyvers, M. (2013). Choosing a strictly proper scoring rule.
Decision Analysis, 10, 292–304.
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., & Teller, E.
(1953). Equations of state calculations by fast computing machines. Journal
of Chemical Physics, 21, 1087-1091.
Mislevy, R. J. (1991). Randomization-based inference about latent variables from
complex samples. Psychometrika, 56, 177–196.
Mitchell, T. J., & Beauchamp, J. J. (1988). Bayesian variable selection in linear
regression. Journal of the American Statistical Association, 83(404), 1023–1032.
Montgomery, J. M., & Nyhan, B. (2010). Bayesian model averaging: Theoretical
developments and practical applications. Political Analysis, 18, 245–270.
Morey, R. D., Romeijn, J.-W., & Rouder, J. N. (2016). The philosophy of bayes
factors and the quantification of statistical evidence. Journal of Mathematical
Psychology, 72, 6-18.
Mullis, I. V. S., & Martin, M. O. (2015). PIRLS 2016 assessment framework (2nd ed.).
Boston College: TIMSS and PIRLS International Study Center. http://
timssandpirls.bc.edu/pirls2016/framework.html
Muthén, L. K., & Muthén, B. (1998–2017). Mplus user’s guide (Eighth ed.). Muthén
& Muthén.
NCES. (2001). Early childhood longitudinal study: Kindergarten class of 1998-99: Base
year public-use data files user’s manual (Tech. Rep. No. NCES 2001-029). U.S.
Government Printing Office.
NCES. (2018). Early Childhood Longitudinal Program (ECLS) - Overview. https://
nces.ed.gov/ecls/
Neyman, J., & Pearson, E. S. (1928). On the use and interpretation of certain test
criteria for purposes of statistical inference. Biometrika, 29A, Part I, 175–240.
OECD. (2002). PISA 2000 technical report. Organization for Economic Cooperation
and Development.
OECD. (2010). PISA 2009 Results (Vol. I-VI). OECD.
OECD. (2017). PISA 2015 Technical Report. OECD.
References 233
Royall, R. M. (1986). The effect of sample size on the meaning of significance tests.
The American Statistician, 40(4), 313–315.
Royall, R. M. (1997). Statistical evidence: A likelihood paradigm. Chapman & Hall.
Rubin, D. B. (1981). The Bayesian bootstrap. Annals of Statistics, 9, 130–134.
Rubin, D. B. (1986). Statistical matching using file concatenation with adjusted
weights and multiple imputation. Journal of Business and Economic Statistics,
4, 87–95.
Rubin, D. B. (1987). Multiple imputation in nonresponse surveys. Wiley.
Ryan, R. M., & Deci, E. L. (2009). Promoting self-determined school engagement:
Motivation, learning, and well-being. In K. R. Wenzel & A. Wigfield (Eds.),
Handbook of motivation at school (p. 171-195). Routledge/Taylor & Francis
Group.
Savage, L. J. (1954). The foundations of statistics. Wiley.
Schad, D. J., Betancourt, M., & Vasishth, S. (2019). Toward a principled Bayesian
workflow in cognitive science. arXiv. https://arxiv.org/abs/1904.12765
Schafer, J. L. (1997). Analysis of incomplete multivariate data. Chapman & Hall/CRC.
Schwarz, G. E. (1978). Estimating the dimension of a model. Annals of Statistics, 6,
461–464.
Shafer, J. L. (2012). norm: Analysis of multivariate normal datasets with miss-
ing values [Computer software manual]. http://CRAN.R-project.org/
package=norm (R package version 1.0-9.4, Ported to R by Alvaro A. Novo.)
Silvey, S. D. (1975). Statistical inference. CRC Press.
Sloughter, J. M., Gneiting, T., & Raftery, A. (2013). Probabilistic wind vector fore-
casting using ensembles and Bayesian model averaging. Monthly Weather
Review, 141, 2107–2119.
Spiegelhalter, D. J., Best, N. G., Carlin, B. P., & van der Linde, A. (2002). Bayesian
measures of model complexity and fit (with discussion). Journal of the Royal
Statistical Society, Series B (Statistical Methodology), 64, 583–639.
Stan Development Team. (2020). RStan: the R interface to Stan. http://mc-stan
.org/ (R package version 2.21.1)
Stan Development Team. (2021a). Stan modeling language users guide and
reference manual,version 2.26 [Computer software manual]. https://
mc-stan.org (ISBN 3-900051-07-0)
Stan Development Team. (2021b). Stan reference manual,version 2.30 [Computer
software manual]. https://mc-stan.org/docs/reference-manual/index
.html (ISBN 3-900051-07-0)
Statisticat, & LLC. (2021). Laplacesdemon: Complete environment for bayesian
inference [Computer software manual]. Bayesian-Inference.com. https://
cran.r-project.org/web/packages/LaplacesDemon/index.html
Steel, M. F. J. (2020). Model averaging and its use in economics. Journal of Economic
Literature, 58, 644–719.
Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions.
Journal of the Royal Statistical Society. Series B (Methodological), 36, 111–147.
Stone, M. (1977). An asymptotic equivalence of choice of model by cross-validation
and akaike’s criterion. Journal of the Royal Statistical Society. Series B (Method-
ological), 39, 44–47.
References 235
Su, Y.-S., Gelman, A., Hill, J., & Yajima, M. (2011). Multiple imputation with
diagnostics (mi) in R: Opening windows into the black box. Journal of
Statistical Software, 45(2), 1–31. http://www.jstatsoft.org/v45/i02/
Suppes, P. (1986). Comment on Fishburn, 1986. Statistical Science, 1, 347–350.
Tanner, M. H., & Wong, W. A. (1987). The calculation of posterior distributions
by data augmentation (with discussion). Journal of the American Statistical
Association, 82, 528–550.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of
the Royal Statistical Society. Series B (Methodological), 58, 267–288.
Tomasetti, N., Forbes, C. S., & Panagiotelis, A. (2022). Updating variational bayes:
Fast sequential posterior inference. Statistics and Computing, 32.
Tourangeau, K., Nord, C., Lê, T., Sorongon, A. G., & Najarian, M. (2009). Early child-
hood longitudinal study, kindergarten class of 1998–99 (ECLS-K), combined user’s
manual for the ECLS-K eighth-grade and K–8 full sample data files and electronic
codebooks (NCES 2009–004). National Center for Education Statistics.
Tran, M.-N., Nguyen, T.-N., & Dao, V.-H. (2021). A practical tutorial on variational
bayes. arXiv. https://arxiv.org/abs/2103.01327
Ulitzsch, E., & Nestler, S. (2022). Evaluating Stan’s variational Bayes algorithm
for estimating multidimensional IRT models. Psych, 4, 73–88. https://
www.mdpi.com/2624-8611/4/1/7
van Buuren, S. (2012). Flexible imputation of missing data. Chapman & Hall.
van de Schoot, R., Depaoli, S., King, R., Kramer, B., Märtens, K., Tadesse, M.,
. . . Yau, C. (2021). Bayesian statistical modelling. Nature Reviews Methods
Primers, 1, 1-26.
van Erp, S. (2020). A tutorial on Bayesian penalized regression with shrinkage
priors for small sample sizes. In R. van de Schoot & M. Miočević (Eds.),
Small sample size solutions (p. 71-84). Taylor & Francis.
van Erp, S., Oberski, D. L., & Mulder, J. (2019). Shrinkage priors for Bayesian
penalized regression. Journal of Mathematical Psychology, 89, 31-50.
Vehtari, A., Gabry, J., Yao, Y., & Gelman, A. (2019). loo: Efficient leave-one-out cross-
validation and WAIC for Bayesian models. https://CRAN.R-project.org/
package=loo (R package version 2.1.0)
Vehtari, A., Gelman, A., & Gabry, J. (2017). Practical Bayesian model evaluation
using leave-one-out cross-validation and WAIC. Statistics and Computing,
27, 1413–1432.
Vehtari, A., Gelman, A., Simpson, D., Carpenter, B., & Bürkner, P.-C. (2021).
Rank-normalization, folding, and localization: An improved b R for assessing
convergence of MCMC. Bayesian Analysis..
Vehtari, A., & Ojanen, J. (2012). A survey of Bayesian predictive methods for
model assessment, selection and comparison. Statistics Surveys, 6, 142-228.
DOI:10.1214/12-SS102
Vehtari, A., Simpson, D., Gelman, A., Yao, Y., & Gabry, J. (2021). Pareto smoothed
importance sampling. https://arxiv.org/abs/1507.02646
von Davier, M. (2013). Imputing proficiency data under planned missingness in
population models. In L. Rutkowski, M. von Davier, & D. Rutkowski (Eds.),
Handbook of international large-scale assessment: Background, technical issues,
and methods of data analysis. Chapman & Hall/CRC Press.
236 References
237
238 Author Index
G J
Gabry, J., 55, 68, 119 Jaakkola, T., 67
Geisser, S., 212 Jackman, S., 17, 32, 127, 128
Gelfand, A., 212 Jeffreys, H., 19, 103, 221
Gelman, A., 18, 21, 36, 51, 52, 53, 55, 68, 104, Jordan, M., 67
105, 107, 108, 116, 117, 118, 119, 120, 123, Jöreskog, K. G., 143
125, 128, 173, 210, 221 Jose, V. R. R., 8, 201
Geman, D., 47
Geman, S., 47 K
George, E. I., 190, 196, 200
Ghahramani, Z., 67 Kadane, J. B., 103, 220
Gigerenzer, G., 101, 221 Kaplan, D., 60, 104, 117, 126, 143, 144, 151,
Gilks, W. R., 48, 50, 199 160, 166, 176, 177, 193, 196, 200, 201, 209
Gneiting, T., 8, 196, 201 Kass, R. E., 20, 112, 113, 196, 221
Goldstein, H., 125 Kennard, R. W., 180, 181
Good, I. J., 194, 201 King, G., 174
Goodrich, B., 121, 213, 218 Klauda, S. L., 144
Gronau, Q. F., 196 Kolmogorov, A. N., 3
Guihenneuc-Jouyax, C., 53 Krauss, S., 101
Kucukelbir, A., 67
Kuger, S., 60
H
Hanea, A., 21 L
Hannan, E. J., 200
Hansen, M. H., 200 Laird, N. M., 173
Harlow, L. L., 101 Lancaster, T., 16
Hastie, T., 180, 182 Laplace, P. S., 19, 34
Hastings, N. A. J., 32 Lawley, D. N., 144
Hill, J., 125, 173 Le, T., 212
Hinne, M., 196 Leamer, E. E., 193, 196
Hjort, N. L., 216 Lee, C., 196
Hoerl, A. E., 180, 181 Lee, M., 220
Author Index 239
Lewandowski, D., 45 O
Lewis, J. B., 158
Oberski, D. L., 184
Ley, E., 196, 200, 201
O’Hagan, A., 21
Liang, F., 200
Ojanen, J., 194, 195, 210, 211
Ligges, U., 32
Lindley, D. V., 6, 22, 194, 195
Linzer, D. A., 158 P
Little, R. J. A., 165, 169, 171, 174, 177 Painter, I., 197
LLC, 209 Panagiotelis, A., 70
Lodewyckx, T., 220 Papageorgiou, C., 200
Lüdecke, D., 114 Park, J. H., 32
Lunn, D., 143 Park, T., 180
Lynch, S. M., 28, 29 Paulo, R., 200
Peacock, J. B., 32
M Pearson, E. S., 6, 101
Piironen, J., 180, 186, 187
Mächler, M., 32
Pinheiro, J. C., 125
Madigan, D., 193, 196, 197, 198, 199
Plummer, M., 65, 118
Maestrini, L., 70
Polakowski, M., 196
Makowski, D., 114
Polson, N. G., 180, 191
Martin, A. D., 32
Press, S. J., 5, 6, 7
Maxwell, A., 144
McAuliffe, J. D., 67
McCullagh, P., 86 Q
McCulloch, R. E., 190 Quinn, B. G., 200
Meinfelder, F., 175 Quinn, K. M., 32
Meng, X. L., 210
Mengersen, K. L., 53 R
Merkle, E. C., 143, 160, 201
Metropolis, N., 47, 49 Raftery, A., 8, 102, 112, 113, 115, 116, 117, 193,
Mislevy, R. J., 75 196, 197, 198, 199, 200, 201, 221
Mitchell, T. J., 190 Ramsey, F. P., 6
Molina, G., 200 Rässler, S., 166, 177
Montgomery, J. M., 196 Raudenbush, S. W., 125, 126, 137, 138, 141
Morey, R. D., 116 Richardson, S., 48, 199
Mulaik, S. A., 101 Robert, C., 47
Mulder, J., 184 Robery, C. P., 53
Muthén, B., 143 Romeijn, J. W., 116
Muthén, L. K., 143 Rosenbluth, A. W., 47
Rosenbluth, M. N., 47
N Rouder, J. N., 116
Royall, R. M., 23
Nane, G., 21 Rubin, D. B., 55, 116, 117, 123, 165, 169, 170,
Nau, R. F., 8 171, 173, 174, 175, 177, 212
Nelder, J. A., 86 Ryan, R. M., 144
Nestler, S., 67, 68, 70
Neyman, J., 6, 101 S
Nguyen, T. N., 67
Nyhan, B., 196 Saul, L., 67
240 Author Index
241
242 Subject Index
C D
Calibration, 7–9, 11 Data augmentation (DA) algorithm, 171–
Casewise deletion, 167, 177 172, 172t, 177
Categorical latent variables, 218 Data distributions
Central limit theorem, 27–29 binomial distribution, 40–42, 41f
Chained equation algorithm, 171, 172–173, Inverse-Wishart (IW) distribution,
173t, 177 44–45, 46
Clustered sampling designs, 125, 218 LKJ distribution for correlation matrices,
Coherence, 3, 6–7, 7t, 220 45, 46, 46f
Complete pooling, 15–16 multinomial distribution, 42–44, 44f
Conditional exchangeability, 18, 126, 127. overview, 31–32, 46, 218
See also Exchangeability poission distribution, 38–40, 39f
Conditional probability, 4–5. See also See also Gaussian distribution; Prior
Probability distribution
Subject Index 243
Data generating process (DGP), 13, 118 Expected utility of a model, 195
Data precision, 27 Exploratory factor analysis (EFA), 143–144
Decision error types, 101–102, 221–222. Ex-post probability evaluations, 8. See also
See also Hypothesis testing; Null Scoring rules
hypothesis
Decision theory, 194–195 F
Density plots
confirmatory factor analysis (CFA) and, Finite exchangeability, 16–17. See also
149f Exchangeability
informative priors and, 84f Finnish horseshoe prior. See Regularized
logistic regression, 90, 90f horseshoe prior
posterior predictive checking, 109–110, Fisher information matrix, 20–21, 42
109f Fisherian hypothesis testing, 101, 102, 221–
random effects analysis of variance and, 222. See also Hypothesis testing
133f Fixed effects, 125
See also Convergence diagnostics; Fixed priors, 200
Posterior density plots Flexible priors, 200
Deviance information criterion (DIC), 104, Frequentist model averaging (FMA), 216
117–118, 123, 219 Frequentist parameter estimation, 47–48
Diagnostics, 68–70. See also Convergence Frequentist probability, 5–6, 11
diagnostics Frequentist statistical inference, 13–14, 103,
Different bounds, 34–35, 35f 220, 221. See also Statistical inference
Diffuse priors. See Non-informative priors;
Prior distribution G
Dirichlet distribution, 43, 44f Gamma function, 26
Discrete variables, 14 Gamma prior, 38–39, 39f
Distributions. See Data distributions; Gaussian distribution
Gaussian distribution; Prior Bayesian central limit theorem and,
distribution 28–29
Divergent transitions error message, 55–56, Gaussian prior, 32–33, 33f
190 generalized linear models and, 85–86
half-Cauchy prior, 36–37, 37f
E informative priors and, 22
Effective sample size, 54, 64. See also inverse-gamma prior, 35–36, 36f
Sample size Jeffreys’ prior for, 37–38
EM bootstrap, 173–174, 175t. See also overview, 32–38, 33f, 35f, 36f, 37f
Expectation-maximization (EM) posterior distribution and, 26
Epistemic belief, 3 uniform distribution as non-informative
Epistemic probability, 3, 6–9, 7t prior, 33–35, 35f
Evidence lower bound (ELBO), 67–68 Gaussian linear regression model, 73. See
Ex-ante probability evaluations, 8. See also also Bayesian linear regression model
Scoring rules Gaussian prior, 32–33, 33f
Exchangeability, 15, 16–18, 126–127, 128, 141. Generalized linear models, 85–86, 86t,
See also Conditional exchangeability 99–100, 218
Expectation-maximization (EM), 158–161, Generalized Pareto distribution, 69
160t, 173–174, 175t Gibbs sampler
Expected a posteriori (EAP) estimate, 56–57 Hamiltonian Monte Carlo (HMC) and,
51–52
244 Subject Index
Lasso prior, 180, 182–184, 189–191, 189t Markov chain Monte Carlo (MCMC) model
Latent class analysis (LCA) composition, 198, 199
comparison of VB to the EM algorithm, Markov chain Monte Carlo (MCMC)
158–160, 160t sampling
label-switching problem, 154–158, 155f, Bayesian central limit theorem and, 29
157f, 158t convergence diagnostics, 53–56
overview, 70, 143, 150–161, 154t, 155f, 157f, Gibbs sampler, 50–51
158t, 160t Hamiltonian Monte Carlo (HMC), 51–53
prediction of class membership, 153–154, informative priors and, 21–22
154t overview, 47–49, 70
Latent variable growth curve modeling, 126 posterior distribution and, 25
Law of likelihood, 23–25. See also random walk Metropolis-Hastings
Likelihood algorithm, 49–50
Leapfrog steps L, 52–53 variational Bayes (VB) and, 66–70
Leaps and bounds algorithm, 198 See also Posterior distribution
Least absolute shrinkage and selection Maximum a posteriori (MAP) estimate, 57,
operator (LASSO). See Lasso prior 174
Leave-one-out cross-validation information Maximum deviation, 170
criterion (LOO-IC), 104, 119–123, 137, Maximum likelihood, 47–48, 173–174, 175t.
140–141, 215, 219 See also Likelihood
Leave-one-out cross-validation (LOO-CV), Mean imputation, 168
119–120, 123, 212 Mean known, variance unknown, 35–37,
Likelihood 36f, 37f
likelihood principle, 24, 103, 123 Mean unknown, variance known, 32–33,
marginal likelihood, 113 33f
model assessment and, 104, 123 Metropolis-Hastings algorithm (M-H
overview, 14–15, 23–25, 29 algorithm)
See also Maximum likelihood Hamiltonian Monte Carlo (HMC) and,
Linear model, 86t, 218 51–52
Linear regression, 74–85, 78f, 79f, 80f, 81t, overview, 70
83f, 84f, 85t, 99–100, 194 random walk Metropolis-Hastings
Link function, 86, 86t algorithm, 49–50
Listwise deletion, 167, 177 M-frameworks, 210–211, 219
LKJ distribution for correlation matrices, Missing at random (MAR) data, 166. See
45, 46, 46f also Missing data
Local empirical Bayes prior, 200 Missing completely at random (MCAR)
Log predictive density score (LPD), 201 data, 166, 167. See also Missing data
Logarithmic scoring rule, 8–9. See also Missing data
Scoring rules ad hoc deletion methods for, 166–167
Logistic regression, 86t, 87–91, 89f, 90f, 91t, multiple imputation and, 170–177, 172t,
99–100 173t, 175t, 176t
Longitudinal data, 126 overview, 165–166, 177
Long-run frequency, 5–6, 11. See also single imputation methods, 167–170
Frequentist probability Model assessment
Bayesian information criterion and,
M 114–115, 116–117
deviance information criterion, 117–118
Mahalanobis, 170 leave-one-out cross-validation
Marginal likelihood, 113. See also information criterion, 119–123
Likelihood
246 Subject Index
249
250 About the Author